View as a slideshow.
For today’s class let’s move to working with a slightly larger data set; we’ll use the nycflights13
package that contains information about every flight that departed from New York City in 2013. We’ll also use dplyr
and ggplot2
today.
library(nycflights13)
library(dplyr)
library(ggplot2)
Loading nycflights13
brought several tables into our current environment:
airlines
airports
planes
weather
flights
For today we’ll focus on the flights
data set, which lists all domestic flights out of the New York area in 2013. Next week we’ll see how we can merge data from the other tables with flights
. For now, you might want to have a look at the airlines
table to see the full names that go with each of the airline codes.
Over the last couple of weeks we’ve made a number of visualizations, making quantitative-categorical and quantitative-quantitative comparisons using histograms, density distributions, box plots, bar graphs and scatter plots. Embedded in each of these plots was an implicit question/hypothesis and mental model of how the comparison being visualized relates to that question/hypothesis.
A bit later on in the course we’ll explore sophisticated machine learning tools that we can use to build models based on data; for now, we’ll explore how far we can push simple data visualization as a modeling mechanism. It’s the best place to start when you’re first exploring a new data set.
What kind of model(s), and visualization(s), could address this question?
One option would be to visualize average departure delay by airline (a continuous-categorical comparison). dplyr
has two functions that make it easy to do that: the group_by
and the summarize
functions. You’ll almost always want to use the two together. This code, using pipes %>%
, groups the rows of flights
together based on the carrier
and then uses summarise
and the mean
function to calculate the average delay:
flights %>%
group_by(carrier) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 16 x 2
## carrier ave_delay
## <chr> <dbl>
## 1 9E 16.7
## 2 AA 8.59
## 3 AS 5.80
## 4 B6 13.0
## 5 DL 9.26
## 6 EV 20.0
## 7 F9 20.2
## 8 FL 18.7
## 9 HA 4.90
## 10 MQ 10.6
## 11 OO 12.6
## 12 UA 12.1
## 13 US 3.78
## 14 VX 12.9
## 15 WN 17.7
## 16 YV 19.0
Let’s unpack that! Use ?mean
, ?summarise
, and ?group_by
to see the help pages for each.
Now, take a moment to test and see what happens if you leave out the na.rm = TRUE
call. Why?
We can visualize our summary by piping the table into ggplot
:
flights %>%
group_by(carrier) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
For anyone feeling stressed out about the pipe %>%
this code does exactly the same thing, but saves intermediate values in variables:
flights_by_carrier <- group_by(flights, carrier)
carrier_ave_delay <- summarize(flights_by_carrier,
ave_delay = mean(dep_delay, na.rm = TRUE)
)
ggplot(carrier_ave_delay, aes(carrier, ave_delay)) +
geom_bar(stat = 'identity')
For most of my example code I’ll by using the pipe because it’s shorter; but remember, you don’t have to!
Might be nice to arrange that categorical x-axis by the ave_delay
value, no? Let’s do that to make it easy to see which airline has the worst average delay. Remember from last class that well use arrange
to order the table and then mutate
to change our character
column of carrier names into an ordered factor
:
flights %>%
group_by(carrier) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(ave_delay) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
So Frontier (F9) and Express Jet (EV) aren’t looking great. But we all know that using mean
to summarize a value can be dangerous, because it’s sensitive to outliers! We should always ask about the variation in the variables in our data sets, but it’s especially important to do so if we’re going to use averages to summarize them.
What is the distribution of departure delays by airline? Visualized as a density distribution:
flights %>%
ggplot(aes(dep_delay, fill = carrier)) + geom_density(alpha = 0.5)
What did we learn? We have a small number HUGE outliers! That makes using mean
possibly very misleading.
Variation in data like these that are very sparse is hard to visualize using density plots. Our two other options are geom_freqpoly
and geom_density
:
flights %>%
ggplot(aes(dep_delay, color = carrier)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
flights %>%
ggplot(aes(dep_delay)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What would happen if we used median
to average the delay time, instead of mean
? This code is identical to that above, but we’ll use the median
function to do our averaging.
flights %>%
group_by(carrier) %>%
summarise(ave_delay = median(dep_delay, na.rm = TRUE)) %>%
arrange(ave_delay) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
That tells a bit of a different story! Fly SkyWest and you’ll get to leave six minutes early. Seemingly small, simple differences in the tools you choose when exploring data can lead to visualizations that tell very different stories.
So how many flights were really delayed and how does that break down by airline? Being delayed more than an hour really sucks, so let’s use that as our cutoff:
flights %>%
filter(dep_delay > 120)
## # A tibble: 9,723 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 848 1835 853 1001
## 2 2013 1 1 957 733 144 1056
## 3 2013 1 1 1114 900 134 1447
## 4 2013 1 1 1540 1338 122 2020
## 5 2013 1 1 1815 1325 290 2120
## 6 2013 1 1 1842 1422 260 1958
## 7 2013 1 1 1856 1645 131 2212
## 8 2013 1 1 1934 1725 129 2126
## 9 2013 1 1 1938 1703 155 2109
## 10 2013 1 1 1942 1705 157 2124
## # … with 9,713 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
That’s a lot of flights! We can use the dplyr
function named count
to give us a summary of the number of rows of a that correspond to each group of a variable:
flights %>%
filter(dep_delay > 120) %>%
count(carrier)
## # A tibble: 16 x 2
## carrier n
## <chr> <int>
## 1 9E 772
## 2 AA 720
## 3 AS 17
## 4 B6 1621
## 5 DL 1093
## 6 EV 2443
## 7 F9 34
## 8 FL 151
## 9 HA 5
## 10 MQ 607
## 11 OO 2
## 12 UA 1364
## 13 US 238
## 14 VX 181
## 15 WN 452
## 16 YV 23
Note that count
has created a column named n
which contains the counts. We can ask it to sort that table for us:
flights %>%
filter(dep_delay > 120) %>%
count(carrier, sort = TRUE)
## # A tibble: 16 x 2
## carrier n
## <chr> <int>
## 1 EV 2443
## 2 B6 1621
## 3 UA 1364
## 4 DL 1093
## 5 9E 772
## 6 AA 720
## 7 MQ 607
## 8 WN 452
## 9 US 238
## 10 VX 181
## 11 FL 151
## 12 F9 34
## 13 YV 23
## 14 AS 17
## 15 HA 5
## 16 OO 2
And we can visualize it with a bar plot (note we don’t need an arrange
because count
has done that for us):
flights %>%
filter(dep_delay > 120) %>%
count(carrier, sort = TRUE) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, n)) + geom_bar(stat = 'identity')
So now we’re starting to understand ExpressJet’s problem: they win at having a lot of very delayed flights.
If you’re flying out of New York you might want to know which airport has the worst delays on average. One way to visualize covariation in categorical (airport) and continuous (delay) variable is with a box plot:
flights %>%
ggplot(aes(origin, dep_delay)) + geom_boxplot()
## Warning: Removed 8255 rows containing non-finite values (stat_boxplot).
Again, those extreme outliers are blowing out our dynamic range. Let’s use a little scaling to get a better picture of the average delay:
flights %>%
ggplot(aes(origin, dep_delay)) +
geom_boxplot() +
ylim(0, 60)
## Warning: Removed 218411 rows containing non-finite values (stat_boxplot).
They look pretty similar. But how does that break down by carrier?.
flights %>%
ggplot(aes(origin, dep_delay, fill = carrier)) +
geom_boxplot() +
ylim(0, 60)
## Warning: Removed 218411 rows containing non-finite values (stat_boxplot).
SkyWest is pretty good, but you might not want to fly with them if you’re at LaGuardia.
To explore covariation in two continuous (quantitative) variables, we can use the tried and true scatter plot:
flights %>%
ggplot(aes(sched_dep_time, dep_delay)) + geom_point()
## Warning: Removed 8255 rows containing missing values (geom_point).
By carrier?
flights %>%
ggplot(aes(sched_dep_time, dep_delay, color = carrier)) +
geom_point(alpha = 0.3)
## Warning: Removed 8255 rows containing missing values (geom_point).
We can compare two categorical variables by plot counts using point size with geom_count
s:
flights %>%
ggplot(aes(origin, carrier)) + geom_count()
Notice how I chose to put the variable with more levels on the y-axis.
We can also make a heatmap using geom_tile
. In this case, geom_tile
doesn’t offer a way to calculate counts on it’s own, so we use the function count
in our pipe:
flights %>%
count(origin, carrier) %>%
ggplot(aes(origin, carrier, fill = n)) + geom_tile()
Heat maps are awesome! Let’s switch out our summary function to look at delays instead of counts:
flights %>%
group_by(origin, carrier) %>%
summarize(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(origin, carrier, fill = ave_delay)) +
geom_tile() +
scale_fill_continuous(low = "#31a354", high = "#e5f5e0")
I wonder if there’s a correlation between number of flights and average delay? Let’s combine what we reviewed in this section with the previous!
flights %>%
group_by(origin, carrier) %>%
summarize(n = n(),
ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(n, ave_delay, color = origin)) + geom_point()
For the rest of class today, use these tools, what you learned in the last class, and the plots featured in the reading to make some more models.
Include answers to these questions: