Visualizing Data

Introduction

Tables: data.frame’s

One of the data sets that ggplot2 comes with is diamonds. Get a preview:

diamonds

# A tibble: 6 x 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

You can see how many observations a data set has with nrow:

nrow(diamonds)

[1] 53940

That’s a few!

Each column on a data.frame is just a vector of values. To get the vector of data in a column you can use the short hand $ syntax:

diamonds$cut

[1] Ideal     Premium   Good      Premium   Good      Very Good
Levels: Fair < Good < Very Good < Premium < Ideal

You can also use the longer form [[..]] syntax:

diamonds[['cut']]
diamonds[['cut']] == diamonds$cut

Note the use of quotes with [[..]] but not with $. See what happens if you use single [...] on a data.frame column:

diamonds['cut']

# A tibble: 6 x 1
  cut      
  <ord>    
1 Ideal    
2 Premium  
3 Good     
4 Premium  
5 Good     
6 Very Good

What did you get? What’s the difference between double and single brackets?

RStudio also has a very nice interface for inspecting a data.frame with the View function (note caps):

View(diamonds)

Plotting

Or a quantitative and categorical variable:

plot(diamonds$cut, diamonds$price)

What did you get there?

The first component of any plot is the data. You define the table by passing it as the first argument to the ggplot2 function:

ggplot(diamonds)

And you get a beautiful empty box. Exciting! We’ll get there.

Visualize the price variable:

ggplot(diamonds, aes(price))

Visualize the cut variable:

ggplot(diamonds, aes(cut))

Visualize price as a function of carat (independent, dependent):

ggplot(diamonds, aes(carat, price))

Why does this relationship make more sense than the inverse?

A scatter plot with points:

ggplot(diamonds, aes(carat, price)) + geom_point()

A histogram:

ggplot(diamonds, aes(price)) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A density distribution:

ggplot(diamonds, aes(price)) + geom_density()

How do these two methods compare for visualizing the dispersion of values in a quantitative variable?

Grouping variables

To use fill color to group values, we add that to the aesthetic mapping, because it’s about what we’re plotting:

ggplot(diamonds, aes(depth, fill = cut)) + geom_density()

ggplot(diamonds, aes(depth, fill = cut)) + geom_density(alpha = 0.2)

Finally, to focus in on the main body of the data (binning out very large and very small values), we can set the x-axis limits with xlim:

ggplot(diamonds, aes(depth, fill = cut)) + geom_density(alpha = 0.2) + xlim(55, 70)

Warning: Removed 45 rows containing non-finite values (stat_density).

And there it is!