05-Data Transformation

Let’s start by loading the dplyr package:

library(dplyr)

Did you notice the warning messages? What’s going on there?

It turns out that the dplyr package has a function named filter(), but the stats package, which is automatically loaded when you start an R session, also has a function named filter()! So, if I type the command filter(dataset, ...), how does R know which filter() function to use?

R looks for the function filter() starting with the package that was loaded most recently, and going backwards in time. Since dplyr was the last package loaded, R will assume that we meant dplyr’s version of filter() and use that.

What if I meant the stats version of filter() instead? Is there a way that I can reference it? Yes! We can use “double colon” notation: stats::filter(). (The general syntax for this is packageName::functionName().)

nycflights13

Today we’ll be working with the flights dataset from the nycflights13 package. Let’s load the nycflights13 package and the flights dataset (use install.packages("nycflights13") if you don’t have the package yet):

library(nycflights13)

## Warning: package 'nycflights13' was built under R version 4.0.5

data(flights)

Next, use the ?, str() and View() functions to examine the dataset:

?flights
str(flights)
View(flights)

This dataset contains ~336,000 flights that departed from New York City (all 3 airports) in 2013.

Next, just key in the dataset name (i.e. flights):

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Did you notice that the output format is different from what we’ve seen before? That’s because previous datasets were in a data structure that we called data frames, while this is in a data structure called a tibble. Don’t worry about the difference: for all intents and purposes, data frames are the same as tibbles.

`filter()` and logical operations

Since we are here in Stanford, we may only be interested in flights from NYC to SFO. We can use the filter() verb to achieve this:

flights %>% filter(dest == "SFO")

## # A tibble: 13,331 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 13,321 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note that we used == to test whether dest was equal to "SFO". DO NOT USE =. In programming, = usually means variable assignment.

There are two other international airports near Stanford, San Jose International Airport (“SJC”) and Oakland International Airport (“OAK”). So if we want to analyze flights that people take to get from NYC to Stanford, we should probably include these flights.

flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The command above filters the dataset and prints it out, but does not retain the output. To keep the extracted dataset for further analysis, we have to assign it to a variable:

Stanford <- flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")

We now have flights from NYC to SFO/SJC/OAK for the entire year. Let’s say that I’m only interested in flights when school is in session (Sep - Jun). Since month is a numeric variable, we could do this:

Stanford %>% filter(month <= 6 | month >= 9)

## # A tibble: 11,351 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 11,341 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

or this:

Stanford %>% filter(month != 7 & month != 8)

## # A tibble: 11,351 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 11,341 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

select() & rename()

Let’s return to the Stanford dataset (i.e. all flights from NYC to SFO/SJC/OAK). Notice that we have a total of 19 variables. Sometimes our datasets will have hundreds or thousands of variables! Not all of them may be of interest to us. select() allows us to choose a subset of these variables to form a smaller dataset that may be easier to work with.

19 is a pretty small number so we could do our data analysis without dropping any columns, but let’s just try out some commands to get a feel for how select() works.

We can select columns by name: if we just want the year, month and day columns, we can use the following code:

Stanford %>% select(year, month, day)

## # A tibble: 13,972 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 13,962 more rows

If the columns we want form a contiguous block, then we can use simpler syntax. To select rows from year to arr_delay (inclusive):

Stanford %>% select(year:arr_delay)

## # A tibble: 13,972 x 9
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 1 more variable: arr_delay <dbl>

In this example, the year column is superfluous, since all the values are all 2013. The code below drops the year column, keeping the rest:

Stanford %>% select(-year)

## # A tibble: 13,972 x 18
##    month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1      558            600        -2      923            937
##  2     1     1      611            600        11      945            931
##  3     1     1      655            700        -5     1037           1045
##  4     1     1      729            730        -1     1049           1115
##  5     1     1      734            737        -3     1047           1113
##  6     1     1      745            745         0     1135           1125
##  7     1     1      746            746         0     1119           1129
##  8     1     1      803            800         3     1132           1144
##  9     1     1      826            817         9     1145           1158
## 10     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

select() can also be used to rearrange the columns. If, for example, I wanted to have the first 3 columns be day, month, year instead of year, month, day:

Stanford %>% select(day, month, year, everything())

## # A tibble: 13,972 x 19
##      day month  year dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1  2013      558            600        -2      923            937
##  2     1     1  2013      611            600        11      945            931
##  3     1     1  2013      655            700        -5     1037           1045
##  4     1     1  2013      729            730        -1     1049           1115
##  5     1     1  2013      734            737        -3     1047           1113
##  6     1     1  2013      745            745         0     1135           1125
##  7     1     1  2013      746            746         0     1119           1129
##  8     1     1  2013      803            800         3     1132           1144
##  9     1     1  2013      826            817         9     1145           1158
## 10     1     1  2013     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

To rename column names, use the rename() function:

Stanford %>% rename(tail_num = tailnum)

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

arrange()

Often we get datasets which are not in order, or in an order which we are not interested in. The arrange() function allows us to reorder the rows according to an order we want.

The Stanford dataset looks like it is already ordered by actual departure time. Perhaps I’m most interested in the flights which had the longest departure delay. I could sort the dataset as follows:

Stanford %>% arrange(dep_delay)

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    11      710            730       -20     1039           1105
##  2  2013    11    16      712            730       -18     1025           1055
##  3  2013     9    11      712            730       -18      946           1045
##  4  2013    11    19      713            730       -17     1036           1055
##  5  2013     7    14     1151           1208       -17     1450           1515
##  6  2013    12    10      714            730       -16     1104           1110
##  7  2013     3    29     1050           1106       -16     1359           1431
##  8  2013     4    20     1420           1436       -16     1737           1755
##  9  2013     5    20      719            735       -16      951           1110
## 10  2013     1    23      545            600       -15      948            925
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Looks like the flights with the shortest delay are at the top instead! To re-order by descending order, use desc():

Stanford %>% arrange(desc(dep_delay))

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     9    20     1139           1845      1014     1457           2210
##  2  2013     7     7     2123           1030       653       17           1345
##  3  2013     7     7     2059           1030       629      106           1350
##  4  2013     7     6      149           1600       589      456           1935
##  5  2013     7    10      133           1800       453      455           2130
##  6  2013     7    10     2342           1630       432      312           1959
##  7  2013     7     7     2204           1525       399      107           1823
##  8  2013     7     7     2306           1630       396      250           1959
##  9  2013     6    23     1833           1200       393       NA           1507
## 10  2013     7    10     2232           1609       383      138           1928
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

(Wow, that’s a really long delay! Almost 17 hours.) To extract just the flights with the top 10 departure delays, we can use the head() function:

Stanford %>% 
    arrange(desc(dep_delay)) %>%
    head(n = 10)

## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     9    20     1139           1845      1014     1457           2210
##  2  2013     7     7     2123           1030       653       17           1345
##  3  2013     7     7     2059           1030       629      106           1350
##  4  2013     7     6      149           1600       589      456           1935
##  5  2013     7    10      133           1800       453      455           2130
##  6  2013     7    10     2342           1630       432      312           1959
##  7  2013     7     7     2204           1525       399      107           1823
##  8  2013     7     7     2306           1630       396      250           1959
##  9  2013     6    23     1833           1200       393       NA           1507
## 10  2013     7    10     2232           1609       383      138           1928
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

arrange() also allows us to filter by more than one column, in that each additional column will be used to break ties in the values of the preceding ones. For example, flights seems to be sorted by year, month, day, and actual departure time. If I wanted to sort by year, month, day and scheduled departure time instead:

Stanford %>% arrange(year, month, day, sched_dep_time)

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

mutate()

In this dataset we have both the time the plane spent in the air (air_time) and distance traveled (distance). From these two pieces of information, we can figure out the average speed of the plane for the flight using mutate().

mutate() adds new columns to the end of the dataset, so let’s work with a smaller dataset for now so that we can see the values of our new column.

Stanford_small <- Stanford %>% 
    select(month, carrier, origin, dest, air_time, distance) %>%
    mutate(speed = distance / air_time * 60)
Stanford_small

## # A tibble: 13,972 x 7
##    month carrier origin dest  air_time distance speed
##    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1     1 UA      EWR    SFO        361     2565  426.
##  2     1 UA      JFK    SFO        366     2586  424.
##  3     1 DL      JFK    SFO        362     2586  429.
##  4     1 VX      JFK    SFO        356     2586  436.
##  5     1 B6      JFK    SFO        350     2586  443.
##  6     1 AA      JFK    SFO        378     2586  410.
##  7     1 UA      EWR    SFO        373     2565  413.
##  8     1 UA      JFK    SFO        369     2586  420.
##  9     1 UA      EWR    SFO        357     2565  431.
## 10     1 AA      JFK    SFO        389     2586  399.
## # ... with 13,962 more rows

mutate() can be used to create several new variables at once. For example, the following code is valid syntax:

Stanford_small %>% mutate(speed_miles_per_min = air_time / distance,
                   speed_miles_per_hour = speed_miles_per_min * 60)

## # A tibble: 13,972 x 9
##    month carrier origin dest  air_time distance speed speed_miles_per_min
##    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>               <dbl>
##  1     1 UA      EWR    SFO        361     2565  426.               0.141
##  2     1 UA      JFK    SFO        366     2586  424.               0.142
##  3     1 DL      JFK    SFO        362     2586  429.               0.140
##  4     1 VX      JFK    SFO        356     2586  436.               0.138
##  5     1 B6      JFK    SFO        350     2586  443.               0.135
##  6     1 AA      JFK    SFO        378     2586  410.               0.146
##  7     1 UA      EWR    SFO        373     2565  413.               0.145
##  8     1 UA      JFK    SFO        369     2586  420.               0.143
##  9     1 UA      EWR    SFO        357     2565  431.               0.139
## 10     1 AA      JFK    SFO        389     2586  399.               0.150
## # ... with 13,962 more rows, and 1 more variable: speed_miles_per_hour <dbl>

If we only want to keep the newly created variables, use transmute() instead of mutate().

Exercises

How many flights arrived at their destination late? How many departed late?
What proportion of delayed flights made up the lost time and arrived on time?
Find the fastest and slowest flights in the dataset. How much does the average speed vary?

A digression: plotting our data

Let’s make use of our data visualization skills to see if there are any trends in air time. First, let’s make a histogram of air_time:

library(ggplot2)
ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 162 rows containing non-finite values (stat_bin).

Did you notice the warning message about rows being removed for “containing non-finite values”? If you view the Stanford_small dataset and scroll all the way down, you’ll notice that there are some rows which have NA for air_time. Since we don’t know what the air time is, we won’t be able to compute the speed for those rows.

As a data analyst, NAs are something to watch out for as they could invalidate your analysis. Why are these data missing? Is it completely at random, or is there something going on? For this session, we will just leave them in the dataset.

It seems like the air time of planes might vary depending on the origin and destination, so let’s facet on these 2 variables:

ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time)) + 
    facet_grid(origin ~ dest)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 162 rows containing non-finite values (stat_bin).

We learn 3 things from this plot: (i) there are no flights from La Guardia (LGA) to any of the 3 airports; (ii) there are no flights from Newark (EWR) to SJC/OAK; and (iii) there are very few flights from NYC to SJC/OAK compared to SFO. It’s hard to tell if there are differences in the distributions from this plot. One alternative is to facet in the other direction, then let each facet have its own y-axis:

ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time)) + 
    facet_grid(dest ~ origin, scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 162 rows containing non-finite values (stat_bin).

Exercises

Make some plots with this data: are any months or days of the week particularly susceptible to delays? How about airlines?
Plot the departure delay against the arrival delay, with total flight time also represented somehow. What do you notice?

summarize()

Instead of looking at plots, we can try to look at summary statistics instead. What was the mean/median air time for flights in our Stanford_small dataset? We can use the summarize() function to help us:

Stanford_small %>% summarize(mean_airtime = mean(air_time))

## # A tibble: 1 x 1
##   mean_airtime
##          <dbl>
## 1           NA

Stanford_small %>% summarize(median_airtime = median(air_time))

## # A tibble: 1 x 1
##   median_airtime
##            <dbl>
## 1             NA

The NAs are causing us trouble! We need to specify the na.rm = TRUE option to remove NAs from consideration:

Stanford_small %>% summarize(mean_airtime = mean(air_time, na.rm = TRUE))

## # A tibble: 1 x 1
##   mean_airtime
##          <dbl>
## 1         346.

Stanford_small %>% summarize(median_airtime = median(air_time, na.rm = TRUE))

## # A tibble: 1 x 1
##   median_airtime
##            <dbl>
## 1            345

summarize() gives me a summary of the entire dataset. If I want summaries by group, then I have to use summarize() in conjunction with group_by(). group_by() changes the unit of analysis from the whole dataset to individual groups. The following code groups the dataset by carrier, then computes the summary statistic for each group:

Stanford_small %>%
    group_by(carrier) %>%
    summarize(mean_airtime = mean(air_time, na.rm = TRUE)) %>%
    arrange(desc(mean_airtime))

## # A tibble: 5 x 2
##   carrier mean_airtime
##   <chr>          <dbl>
## 1 AA              348.
## 2 VX              348.
## 3 DL              347.
## 4 B6              347.
## 5 UA              344.

I can also group by more than one variable. For example, if I wanted to count the number of flights for each carrier in each month, I could use the following code:

Stanford_small %>%
    group_by(month, carrier) %>%
    summarize(count = n())

## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.

## # A tibble: 60 x 3
## # Groups:   month [12]
##    month carrier count
##    <int> <chr>   <int>
##  1     1 AA        120
##  2     1 B6        121
##  3     1 DL        142
##  4     1 UA        422
##  5     1 VX        124
##  6     2 AA        108
##  7     2 B6        106
##  8     2 DL        127
##  9     2 UA        378
## 10     2 VX        104
## # ... with 50 more rows

We can even “pipe” the dataset to ggplot() to plot the data!

Stanford_small %>%
    group_by(month, carrier) %>%
    summarize(count = n()) %>%
    ggplot(mapping = aes(x = month, y = count, col = carrier)) +
        geom_line() +
        geom_point() +
        scale_x_continuous(breaks = 1:12)

## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.

Exercises

Each plane has a corresponding tail number. Find some interesting planes (maybe they have very many flights, or are chronically delayed), and look them up online. Try to find some that are still in service
Do certain airlines prefer certain airports? Do some data operations and make the corresponding plots to find out.

Optional material

The `%in%` operator

Recall that we used the following line of code to extract flights that landed in SFO, SJC or OAK:

Stanford <- flights %>% filter(dest == "SFO" | dest == "SJC" | dest == "OAK")

We can use the %in% operator to make the code more succinct:

flights %>% filter(dest %in% c("SFO", "SJC", "OAK"))

## # A tibble: 13,972 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      558            600        -2      923            937
##  2  2013     1     1      611            600        11      945            931
##  3  2013     1     1      655            700        -5     1037           1045
##  4  2013     1     1      729            730        -1     1049           1115
##  5  2013     1     1      734            737        -3     1047           1113
##  6  2013     1     1      745            745         0     1135           1125
##  7  2013     1     1      746            746         0     1119           1129
##  8  2013     1     1      803            800         3     1132           1144
##  9  2013     1     1      826            817         9     1145           1158
## 10  2013     1     1     1029           1030        -1     1427           1355
## # ... with 13,962 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The %in% operator is very useful, especially we are checking if dest belongs to a long list of airports.

Joy plots

Let’s remove the rows with air_time being NA:

Stanford_small <- Stanford_small %>%
    filter(!is.na(air_time))

One theory we might have is that different carriers have different air times. Let’s do a facet on carrier:

ggplot(data = Stanford_small) + 
    geom_histogram(aes(x = air_time)) + 
    facet_grid(carrier ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The first thing we notice is that UA has many more flights than the other carriers. Because all 5 histograms have the same y-axis, this causes the other histograms to be obscured. To allow each histogram to have its own y-axis, we can add a scales argument to facet_grid():

ggplot(data = Stanford_small) + 
    geom_histogram(mapping = aes(x = air_time)) + 
    facet_grid(carrier ~ ., scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, the histograms have very similar shapes, suggesting that the air times of various carriers is roughly the same. The one thing that we might notice is are the tails on the right.

A plot that is increasing in popularity for plotting multiple histograms or density plots is the joy plot. The plot looks like a series of overlapping mountain ranges which can be compared against each other more easily than the histograms. The code below produces a joy plot:

library(ggridges)

## Warning: package 'ggridges' was built under R version 4.0.5

ggplot(data = Stanford_small, aes(x = air_time, y = carrier)) +
    geom_density_ridges(scale = 5)

## Picking joint bandwidth of 3.24

(Play around with the scale parameter and see what happens.)

Session info

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggridges_0.5.3     ggplot2_3.3.3      nycflights13_1.0.2 dplyr_1.0.5       
## [5] knitr_1.31        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6        plyr_1.8.6        highr_0.8         pillar_1.5.1     
##  [5] compiler_4.0.4    tools_4.0.4       digest_0.6.27     evaluate_0.14    
##  [9] lifecycle_1.0.0   tibble_3.1.0      gtable_0.3.0      pkgconfig_2.0.3  
## [13] rlang_0.4.10      DBI_1.1.1         cli_2.3.1         rstudioapi_0.13  
## [17] yaml_2.2.1        xfun_0.22         withr_2.4.1       stringr_1.4.0    
## [21] generics_0.1.0    vctrs_0.3.6       grid_4.0.4        tidyselect_1.1.0 
## [25] glue_1.4.2        R6_2.5.0          fansi_0.4.2       rmarkdown_2.7    
## [29] farver_2.1.0      purrr_0.3.4       magrittr_2.0.1    scales_1.1.1     
## [33] ellipsis_0.3.1    htmltools_0.5.1.1 assertthat_0.2.1  colorspace_2.0-0 
## [37] labeling_0.4.2    utf8_1.2.1        stringi_1.5.3     munsell_0.5.0    
## [41] crayon_1.4.1

05-Data Transformation

Apr 20, 2020

nycflights13

filter() and logical operations

select() & rename()

arrange()

mutate()

Exercises

A digression: plotting our data

Exercises

summarize()

Exercises

Optional material

The %in% operator

Joy plots

Session info

`filter()` and logical operations

The `%in%` operator