Loading nycflights13
brought several tables into our current environment:
airlines
airports
planes
weather
flights
# A tibble: 16 x 2
carrier ave_delay
<chr> <dbl>
1 9E 16.7
2 AA 8.59
3 AS 5.80
4 B6 13.0
5 DL 9.26
6 EV 20.0
7 F9 20.2
8 FL 18.7
9 HA 4.90
10 MQ 10.6
11 OO 12.6
12 UA 12.1
13 US 3.78
14 VX 12.9
15 WN 17.7
16 YV 19.0
Let’s unpack that! Use ?mean
, ?summarise
, and ?group_by
to see the help pages for each.
Now, take a moment to test and see what happens if you leave out the na.rm = TRUE
call. Why?
We can visualize our summary by piping the table into ggplot
:
flights %>%
group_by(carrier) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
For anyone feeling stressed out about the pipe %>%
this code does exactly the same thing, but saves intermediate values in variables:
flights_by_carrier <- group_by(flights, carrier)
carrier_ave_delay <- summarize(flights_by_carrier,
ave_delay = mean(dep_delay, na.rm = TRUE)
)
ggplot(carrier_ave_delay, aes(carrier, ave_delay)) +
geom_bar(stat = 'identity')
For most of my example code I’ll by using the pipe because it’s shorter; but remember, you don’t have to!
flights %>%
group_by(carrier) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(ave_delay) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
What is the distribution of departure delays by airline? Visualized as a density distribution:
What did we learn? We have a small number HUGE outliers! That makes using mean
possibly very misleading.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
flights %>%
group_by(carrier) %>%
summarise(ave_delay = median(dep_delay, na.rm = TRUE)) %>%
arrange(ave_delay) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')
# A tibble: 9,723 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 848 1835 853 1001
2 2013 1 1 957 733 144 1056
3 2013 1 1 1114 900 134 1447
4 2013 1 1 1540 1338 122 2020
5 2013 1 1 1815 1325 290 2120
6 2013 1 1 1842 1422 260 1958
7 2013 1 1 1856 1645 131 2212
8 2013 1 1 1934 1725 129 2126
9 2013 1 1 1938 1703 155 2109
10 2013 1 1 1942 1705 157 2124
# … with 9,713 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
# A tibble: 16 x 2
carrier n
<chr> <int>
1 9E 772
2 AA 720
3 AS 17
4 B6 1621
5 DL 1093
6 EV 2443
7 F9 34
8 FL 151
9 HA 5
10 MQ 607
11 OO 2
12 UA 1364
13 US 238
14 VX 181
15 WN 452
16 YV 23
Note that count
has created a column named n
which contains the counts. We can ask it to sort that table for us:
# A tibble: 16 x 2
carrier n
<chr> <int>
1 EV 2443
2 B6 1621
3 UA 1364
4 DL 1093
5 9E 772
6 AA 720
7 MQ 607
8 WN 452
9 US 238
10 VX 181
11 FL 151
12 F9 34
13 YV 23
14 AS 17
15 HA 5
16 OO 2
And we can visualize it with a bar plot (note we don’t need an arrange
because count
has done that for us):
flights %>%
filter(dep_delay > 120) %>%
count(carrier, sort = TRUE) %>%
mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
ggplot(aes(carrier, n)) + geom_bar(stat = 'identity')
So now we’re starting to understand ExpressJet’s problem: they win at having a lot of very delayed flights.
Warning: Removed 8255 rows containing non-finite values (stat_boxplot).
Again, those extreme outliers are blowing out our dynamic range. Let’s use a little scaling to get a better picture of the average delay:
Warning: Removed 218411 rows containing non-finite values (stat_boxplot).
They look pretty similar. But how does that break down by carrier?.
Warning: Removed 218411 rows containing non-finite values (stat_boxplot).
SkyWest is pretty good, but you might not want to fly with them if you’re at LaGuardia.
By carrier?
Warning: Removed 8255 rows containing missing values (geom_point).
Notice how I chose to put the variable with more levels on the y-axis.
Heat maps are awesome! Let’s switch out our summary function to look at delays instead of counts:
flights %>%
group_by(origin, carrier) %>%
summarize(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(origin, carrier, fill = ave_delay)) +
geom_tile() +
scale_fill_continuous(low = "#31a354", high = "#e5f5e0")
flights %>%
group_by(origin, carrier) %>%
summarize(n = n(),
ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(n, ave_delay, color = origin)) + geom_point()