Exploratory Data Analysis

Data

Loading nycflights13 brought several tables into our current environment:

  • airlines
  • airports
  • planes
  • weather
  • flights

Thinking about models

Which airlines have the worst delays? (group_by, summarize & count)

# A tibble: 16 x 2
   carrier ave_delay
   <chr>       <dbl>
 1 9E          16.7 
 2 AA           8.59
 3 AS           5.80
 4 B6          13.0 
 5 DL           9.26
 6 EV          20.0 
 7 F9          20.2 
 8 FL          18.7 
 9 HA           4.90
10 MQ          10.6 
11 OO          12.6 
12 UA          12.1 
13 US           3.78
14 VX          12.9 
15 WN          17.7 
16 YV          19.0 

Let’s unpack that! Use ?mean, ?summarise, and ?group_by to see the help pages for each.

Now, take a moment to test and see what happens if you leave out the na.rm = TRUE call. Why?

We can visualize our summary by piping the table into ggplot:

For anyone feeling stressed out about the pipe %>% this code does exactly the same thing, but saves intermediate values in variables:

For most of my example code I’ll by using the pipe because it’s shorter; but remember, you don’t have to!

What is the distribution of departure delays by airline? Visualized as a density distribution:

What did we learn? We have a small number HUGE outliers! That makes using mean possibly very misleading.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# A tibble: 9,723 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      848           1835       853     1001
 2  2013     1     1      957            733       144     1056
 3  2013     1     1     1114            900       134     1447
 4  2013     1     1     1540           1338       122     2020
 5  2013     1     1     1815           1325       290     2120
 6  2013     1     1     1842           1422       260     1958
 7  2013     1     1     1856           1645       131     2212
 8  2013     1     1     1934           1725       129     2126
 9  2013     1     1     1938           1703       155     2109
10  2013     1     1     1942           1705       157     2124
# … with 9,713 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

# A tibble: 16 x 2
   carrier     n
   <chr>   <int>
 1 9E        772
 2 AA        720
 3 AS         17
 4 B6       1621
 5 DL       1093
 6 EV       2443
 7 F9         34
 8 FL        151
 9 HA          5
10 MQ        607
11 OO          2
12 UA       1364
13 US        238
14 VX        181
15 WN        452
16 YV         23

Note that count has created a column named n which contains the counts. We can ask it to sort that table for us:

# A tibble: 16 x 2
   carrier     n
   <chr>   <int>
 1 EV       2443
 2 B6       1621
 3 UA       1364
 4 DL       1093
 5 9E        772
 6 AA        720
 7 MQ        607
 8 WN        452
 9 US        238
10 VX        181
11 FL        151
12 F9         34
13 YV         23
14 AS         17
15 HA          5
16 OO          2

And we can visualize it with a bar plot (note we don’t need an arrange because count has done that for us):

So now we’re starting to understand ExpressJet’s problem: they win at having a lot of very delayed flights.

Are flight delays worse at different New York airports? (covariation: categorical-continuous)

Warning: Removed 8255 rows containing non-finite values (stat_boxplot).

Again, those extreme outliers are blowing out our dynamic range. Let’s use a little scaling to get a better picture of the average delay:

Warning: Removed 218411 rows containing non-finite values (stat_boxplot).

They look pretty similar. But how does that break down by carrier?.

Warning: Removed 218411 rows containing non-finite values (stat_boxplot).

SkyWest is pretty good, but you might not want to fly with them if you’re at LaGuardia.

Does departure time affect flight delays? (covariation: continuous-continuous)

By carrier?

Warning: Removed 8255 rows containing missing values (geom_point).

How many flights leave each New York airport for each carrier? (covariation: categorical-categorical)

Notice how I chose to put the variable with more levels on the y-axis.

Heat maps are awesome! Let’s switch out our summary function to look at delays instead of counts:

Make some models