Loading, Exploring and Transforming Data

Data

Signal strength: data dispersion

Like the plot function, R’s summary function is a little bit magical:

Take a moment to digest the output; what did we get?

Let’s think about this data set:

  • Is there signal? How disperse are the data for each variable?
  • Are there missing data?
  • Are all of our vectors of the correct type? What might we want to change?

Before we go on, let’s fix the Major_category variable. To make a factor, use the factor function:

How does the summary change when this variable is a factor?

Getting the median and quartile ranges for the variables in our data set is useful, but they don’t tell the whole data dispersion story.

To visualize dispersion for a quantitative variable we can use a histogram or density plot. Let’s look at the distribution of median incomes.

What does this plot tell us about median income distribution?

So, we’ve got some income variability to work with! Check data dispersion for other quantitative variables in the data set.

More than two dimensions

As we’ve seen, you can use color to transform a plot like this:

Into one like this:

We can also use point size to visualize additional quantitative variables:

Finally, values can be mapped to alpha to visually “weight” points:

Calculating new values

Let’s make a new column that holds the percentage of people who are employed for each major:

Take a moment to make sure you understand what we did there.

Now make three more columns on your own for:

  • Percentage of people with jobs that require a college degree
  • Percentage of people with jobs that don’t
  • Percentage of people in low wage jobs

Filtering

Let’s use filter to create a table holding data for just the Biology majors:

Take a look at the resulting table.

Now we have a reasonable number of majors to plot on a categorical axis:

See ?geom_bar for an explanation of the stat argument and ?theme for an explaination for how I tilted the axis text there.

Re-arranging

Take a moment to dissect what we did there. See bio_ordered$Major.

Which makes for a prettier plot:

The Pipe