Loading, Exploring and Transforming Data

Data

Like the plot function, R’s summary function is a little bit magical:

summary(recent_grads)

Take a moment to digest the output; what did we get?

Let’s think about this data set:

Before we go on, let’s fix the Major_category variable. To make a factor, use the factor function:

recent_grads$Major_category <- factor(recent_grads$Major_category)

How does the summary change when this variable is a factor?

Getting the median and quartile ranges for the variables in our data set is useful, but they don’t tell the whole data dispersion story.

To visualize dispersion for a quantitative variable we can use a histogram or density plot. Let’s look at the distribution of median incomes.

library(ggplot2)

ggplot(recent_grads, aes(Median)) + geom_density()

What does this plot tell us about median income distribution?

So, we’ve got some income variability to work with! Check data dispersion for other quantitative variables in the data set.

As we’ve seen, you can use color to transform a plot like this:

ggplot(recent_grads, aes(P25th, P75th)) + geom_point()

Into one like this:

ggplot(recent_grads, aes(P25th, P75th, color = Major_category)) + geom_point()

ggplot(recent_grads, aes(P25th, P75th, color = ShareWomen)) + geom_point()

We can also use point size to visualize additional quantitative variables:

ggplot(recent_grads, aes(P25th, P75th, color = Major_category, size = Total)) + geom_point()

Finally, values can be mapped to alpha to visually “weight” points:

ggplot(recent_grads, aes(P25th, P75th, color = Major_category, size = Total, alpha = Sample_size)) + geom_point()

Let’s make a new column that holds the percentage of people who are employed for each major:

recent_grads$percent_employed <- recent_grads$Employed / recent_grads$Total

Take a moment to make sure you understand what we did there.

Now make three more columns on your own for:

recent_grads$employment_dev <- recent_grads$percent_employed - 
                               mean(recent_grads$percent_employed)

Let’s use filter to create a table holding data for just the Biology majors:

bio <- filter(recent_grads, Major_category == "Biology & Life Science")

Take a look at the resulting table.

Now we have a reasonable number of majors to plot on a categorical axis:

ggplot(bio, aes(Major, percent_employed)) + 
  geom_bar(stat = 'identity') + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

See ?geom_bar for an explanation of the stat argument and ?theme for an explaination for how I tilted the axis text there.

bio_ordered <- mutate( bio_sorted, 
                       Major = factor(Major, levels = Major, ordered = TRUE)
                     )

Take a moment to dissect what we did there. See bio_ordered$Major.

Which makes for a prettier plot:

ggplot(bio_ordered, aes(Major, percent_employed)) + 
  geom_bar(stat = 'identity') + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1))