Exploratory Data Analysis

Data

Loading nycflights13 brought several tables into our current environment:

airlines
airports
planes
weather
flights

Thinking about models

Which airlines have the worst delays? (group_by, summarize & count)

flights %>% 
  group_by(carrier) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 16 x 2
   carrier ave_delay
   <chr>       <dbl>
 1 9E          16.7 
 2 AA           8.59
 3 AS           5.80
 4 B6          13.0 
 5 DL           9.26
 6 EV          20.0 
 7 F9          20.2 
 8 FL          18.7 
 9 HA           4.90
10 MQ          10.6 
11 OO          12.6 
12 UA          12.1 
13 US           3.78
14 VX          12.9 
15 WN          17.7 
16 YV          19.0

Let’s unpack that! Use ?mean, ?summarise, and ?group_by to see the help pages for each.

Now, take a moment to test and see what happens if you leave out the na.rm = TRUE call. Why?

We can visualize our summary by piping the table into ggplot:

flights %>% 
  group_by(carrier) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')

For anyone feeling stressed out about the pipe %>% this code does exactly the same thing, but saves intermediate values in variables:

flights_by_carrier <- group_by(flights, carrier)
carrier_ave_delay  <- summarize(flights_by_carrier, 
                                ave_delay = mean(dep_delay, na.rm = TRUE)
                               )
ggplot(carrier_ave_delay, aes(carrier, ave_delay)) +
  geom_bar(stat = 'identity')

For most of my example code I’ll by using the pipe because it’s shorter; but remember, you don’t have to!

flights %>% 
  group_by(carrier) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(ave_delay) %>%
  mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
  ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')

What is the distribution of departure delays by airline? Visualized as a density distribution:

flights %>%
  ggplot(aes(dep_delay, fill = carrier)) + geom_density(alpha = 0.5)

What did we learn? We have a small number HUGE outliers! That makes using mean possibly very misleading.

flights %>%
  ggplot(aes(dep_delay, color = carrier)) + geom_freqpoly()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

flights %>%
  ggplot(aes(dep_delay)) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

flights %>% 
  group_by(carrier) %>% 
  summarise(ave_delay = median(dep_delay, na.rm = TRUE)) %>%
  arrange(ave_delay) %>%
  mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
  ggplot(aes(carrier, ave_delay)) + geom_bar(stat = 'identity')

flights %>%
  filter(dep_delay > 120)

# A tibble: 9,723 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      848           1835       853     1001
 2  2013     1     1      957            733       144     1056
 3  2013     1     1     1114            900       134     1447
 4  2013     1     1     1540           1338       122     2020
 5  2013     1     1     1815           1325       290     2120
 6  2013     1     1     1842           1422       260     1958
 7  2013     1     1     1856           1645       131     2212
 8  2013     1     1     1934           1725       129     2126
 9  2013     1     1     1938           1703       155     2109
10  2013     1     1     1942           1705       157     2124
# … with 9,713 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

flights %>%
  filter(dep_delay > 120) %>%
  count(carrier)

# A tibble: 16 x 2
   carrier     n
   <chr>   <int>
 1 9E        772
 2 AA        720
 3 AS         17
 4 B6       1621
 5 DL       1093
 6 EV       2443
 7 F9         34
 8 FL        151
 9 HA          5
10 MQ        607
11 OO          2
12 UA       1364
13 US        238
14 VX        181
15 WN        452
16 YV         23

Note that count has created a column named n which contains the counts. We can ask it to sort that table for us:

flights %>%
  filter(dep_delay > 120) %>%
  count(carrier, sort = TRUE)

# A tibble: 16 x 2
   carrier     n
   <chr>   <int>
 1 EV       2443
 2 B6       1621
 3 UA       1364
 4 DL       1093
 5 9E        772
 6 AA        720
 7 MQ        607
 8 WN        452
 9 US        238
10 VX        181
11 FL        151
12 F9         34
13 YV         23
14 AS         17
15 HA          5
16 OO          2

And we can visualize it with a bar plot (note we don’t need an arrange because count has done that for us):

flights %>%
  filter(dep_delay > 120) %>%
  count(carrier, sort = TRUE) %>%
  mutate(carrier = factor(carrier, levels = carrier, ordered = TRUE)) %>%
  ggplot(aes(carrier, n)) + geom_bar(stat = 'identity')

So now we’re starting to understand ExpressJet’s problem: they win at having a lot of very delayed flights.

Are flight delays worse at different New York airports? (covariation: categorical-continuous)

flights %>%
  ggplot(aes(origin, dep_delay)) + geom_boxplot()

Warning: Removed 8255 rows containing non-finite values (stat_boxplot).

Again, those extreme outliers are blowing out our dynamic range. Let’s use a little scaling to get a better picture of the average delay:

flights %>%
  ggplot(aes(origin, dep_delay)) + 
  geom_boxplot() + 
  ylim(0, 60)

Warning: Removed 218411 rows containing non-finite values (stat_boxplot).

They look pretty similar. But how does that break down by carrier?.

flights %>%
  ggplot(aes(origin, dep_delay, fill = carrier)) + 
  geom_boxplot() + 
  ylim(0, 60)

Warning: Removed 218411 rows containing non-finite values (stat_boxplot).

SkyWest is pretty good, but you might not want to fly with them if you’re at LaGuardia.

Does departure time affect flight delays? (covariation: continuous-continuous)

By carrier?

flights %>%
  ggplot(aes(sched_dep_time, dep_delay, color = carrier)) + 
    geom_point(alpha = 0.3)

Warning: Removed 8255 rows containing missing values (geom_point).

How many flights leave each New York airport for each carrier? (covariation: categorical-categorical)

flights %>%
  ggplot(aes(origin, carrier)) + geom_count()

Notice how I chose to put the variable with more levels on the y-axis.

flights %>%
  count(origin, carrier) %>%
  ggplot(aes(origin, carrier, fill = n)) + geom_tile()

Heat maps are awesome! Let’s switch out our summary function to look at delays instead of counts:

flights %>%
  group_by(origin, carrier) %>%
  summarize(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(origin, carrier, fill = ave_delay)) + 
    geom_tile() + 
    scale_fill_continuous(low = "#31a354", high = "#e5f5e0")

flights %>%
  group_by(origin, carrier) %>%
  summarize(n         = n(),
            ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(n, ave_delay, color = origin)) + geom_point()