STATS 32 Session 5: Data transformation and functions

Damian Pavlyshyn

April 28

http://web.stanford.edu/class/stats32/lectures/

Recap of sessions 3 and 4

Agenda for today

Loading files

Data tables are stored in files - generally .txt or .csv - as rows of values separated by commas (or some other character).

Warning: .xls files are not like this at all, and are much (and needlessly) more complicated.

Absolute file paths:

Unique addresses of files or folders on your system. When looking for a file, your machine starts at the “root directory” and follows the chain of folders until it reached the end.

You can always load the same file on your machine by loading it from its absolute file path:

df <- read_csv("~/Documents/example.csv")

URLs are also a type of absolute file path, and R can load files directly from the internet too:

df <- read_csv("http://web.stanford.edu/class/stats32/assets/lecture-3/data/worldbank_data_tidy.csv")

Relative file paths

If a file path doesn’t start with the root directory, your machine will instead start looking in whatever folder the program you’re using is running in. This is called the working directory.

You can change the working directory using the setwd() function, then access files relative to that folder:

setwd("~/Documents/")
df <- read_csv("example.csv")

Alternatively, you set your working directory in RStudio by navigating to your desired folder in the file browser in the lower right of the screen, and then clicking More > Set As Working Directory.

Functions: R’s workhorse

A function is a named block of code which

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")
## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str(), head(), sd(), ggplot(), is.list(), …

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

Function syntax

The most important syntax in R is the function call. All R syntax has function calls underlying it.

A function call consists of:

function_name(<positional arguments>, <keyword arguments>)

Positional arguments are the “main input” of the function - often these are data tables or vectors. These are supplied without keys and in a specific order. They are essential for the function to work.

If there are multiple positional arguments, is can be good practice to supply them as key-value pairs anyway.

x <- c(-5, -3, -1, 1, 3, NA)
mean(x)
## [1] NA

Keyword arguments provide supplementary information that modify what the function does, and are often strings or booleans. They are supplied as key-value pairs in arbitrary order, and can be omitted if you are satisfied with a function’s default behavior.

x <- c(-5, -3, -1, 1, 3, NA)
mean(x, na.rm = TRUE)
## [1] -1

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

The pipe operator %>%

library(magrittr)
x %>% abs() %>% mean(na.rm = TRUE)
## [1] 2.6

This specific example is silly and only for demonstrative purposes. We’ll see actual (extremely) useful applications of this soon.

Why do we need to transform data?

We rarely get data in exactly the form we need!

Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).

The 5 basic dplyr verbs

Most of the operations that you’d ever want to perform on a single table can be expressed with the following functions:

Toy example: Student scores

library(tidyverse)
scores <- data.frame(Name = c("Maedhros", "Maglor", "Celegorm", "Caranthir", "Curufin", "Amrod", "Amras"),
                     Year = c("Sen", "Sen", "Jun", "Jun", "Sen", "Sen", "Jun"),
                     English = c(60, 66, 92, 80, 80, 58, 81),
                     Math = c(96, 55, 63, 76, 80, 52, 64),
                     Science = c(80, 56, 70, 89, 82, 79, 90),
                     History = c(56, 64, 62, 55, 48, 90, 71),
                     Spanish = c(77, 77, 98, 40, 50, 61, 72),
                     stringsAsFactors = FALSE)
scores
##        Name Year English Math Science History Spanish
## 1  Maedhros  Sen      60   96      80      56      77
## 2    Maglor  Sen      66   55      56      64      77
## 3  Celegorm  Jun      92   63      70      62      98
## 4 Caranthir  Jun      80   76      89      55      40
## 5   Curufin  Sen      80   80      82      48      50
## 6     Amrod  Sen      58   52      79      90      61
## 7     Amras  Jun      81   64      90      71      72

select: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

  1. Take the scores dataset.
  2. Select the Name and History columns.
scores %>%
    select(Name, History)
##        Name History
## 1  Maedhros      56
## 2    Maglor      64
## 3  Celegorm      62
## 4 Caranthir      55
## 5   Curufin      48
## 6     Amrod      90
## 7     Amras      71

mutate: create new columns based on old ones

Teacher: “What are their total scores?”

  1. Take the scores dataset.
  2. Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish
scores <- scores %>%
    mutate(Total = English + Math + Science + History + Spanish)

scores
##        Name Year English Math Science History Spanish Total
## 1  Maedhros  Sen      60   96      80      56      77   369
## 2    Maglor  Sen      66   55      56      64      77   318
## 3  Celegorm  Jun      92   63      70      62      98   385
## 4 Caranthir  Jun      80   76      89      55      40   340
## 5   Curufin  Sen      80   80      82      48      50   340
## 6     Amrod  Sen      58   52      79      90      61   340
## 7     Amras  Jun      81   64      90      71      72   378

See also: transmute

Similar to mutate, but creates a totally new data table with only the variables explicitly mentioned:

scores %>%
    transmute(
        Name = Name,
        Total = English + Math + Science + History + Spanish,
        Mean = Total/5
    )
##        Name Total Mean
## 1  Maedhros   369 73.8
## 2    Maglor   318 63.6
## 3  Celegorm   385 77.0
## 4 Caranthir   340 68.0
## 5   Curufin   340 68.0
## 6     Amrod   340 68.0
## 7     Amras   378 75.6

Note: We can use variables that we just created!

arrange: reorder rows

Teacher: “Can I have the students in order of overall performance?”

  1. Take the scores dataset.
  2. Arrange rows by the Total column.
scores %>%
    arrange(Total)
##        Name Year English Math Science History Spanish Total
## 1    Maglor  Sen      66   55      56      64      77   318
## 2 Caranthir  Jun      80   76      89      55      40   340
## 3   Curufin  Sen      80   80      82      48      50   340
## 4     Amrod  Sen      58   52      79      90      61   340
## 5  Maedhros  Sen      60   96      80      56      77   369
## 6     Amras  Jun      81   64      90      71      72   378
## 7  Celegorm  Jun      92   63      70      62      98   385

Teacher: “No no, better students on top please…”

scores %>%
    arrange(desc(Total))
##        Name Year English Math Science History Spanish Total
## 1  Celegorm  Jun      92   63      70      62      98   385
## 2     Amras  Jun      81   64      90      71      72   378
## 3  Maedhros  Sen      60   96      80      56      77   369
## 4 Caranthir  Jun      80   76      89      55      40   340
## 5   Curufin  Sen      80   80      82      48      50   340
## 6     Amrod  Sen      58   52      79      90      61   340
## 7    Maglor  Sen      66   55      56      64      77   318

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

scores %>%
    arrange(desc(Total), Name)
##        Name Year English Math Science History Spanish Total
## 1  Celegorm  Jun      92   63      70      62      98   385
## 2     Amras  Jun      81   64      90      71      72   378
## 3  Maedhros  Sen      60   96      80      56      77   369
## 4     Amrod  Sen      58   52      79      90      61   340
## 5 Caranthir  Jun      80   76      89      55      40   340
## 6   Curufin  Sen      80   80      82      48      50   340
## 7    Maglor  Sen      66   55      56      64      77   318

filter: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

  1. Take the scores dataset.
  2. Filter for rows where the value in the History column is less than 60.
scores %>%
    filter(History < 60)
##        Name Year English Math Science History Spanish Total
## 1  Maedhros  Sen      60   96      80      56      77   369
## 2 Caranthir  Jun      80   76      89      55      40   340
## 3   Curufin  Sen      80   80      82      48      50   340

More on filters

Other ways to make comparisons:

Warning!

sqrt(2)^2 == 2
## [1] FALSE

Don’t use == to compare doubles! This is because computers have only finite space to store doubles, so tiny rounding errors crop up when doing arithmetic. Normally this isn’t a problem, but 2.0000000001 does not equal 2, so watch out!

Much better to use the near() function, which allows a small difference between values

near(sqrt(2)^2, 2)
## [1] TRUE

Combining comparisons:

More filter examples

Maglor’s parents: “I just want Maglor’s scores”

scores %>% 
    filter(Name == "Maglor")
##     Name Year English Math Science History Spanish Total
## 1 Maglor  Sen      66   55      56      64      77   318

Language teacher: “I want to know which students scores < 50 for either English or Spanish”

scores %>% 
    filter(English < 50 | Spanish < 50)
##        Name Year English Math Science History Spanish Total
## 1 Caranthir  Jun      80   76      89      55      40   340

summarize: get summaries of data

Academic: “I want to know the correlation between math and science scores as well as their means”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the correlation of the Math and Science columns.
scores %>%
    summarize(corr = cor(Math, Science))
##        corr
## 1 0.4137445

Science teacher: “I want to know the mean and standard deviation of the scores for science”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.
scores %>%
    summarize(Science_mean = mean(Science), 
              Science_sd = sd(Science))
##   Science_mean Science_sd
## 1           78   11.78983

Chaining multiple dplyr commands using %>%

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have Science > 80.
  3. Select the Name column.
scores %>%
    filter(Science > 80) %>%
    select(Name)
##        Name
## 1 Caranthir
## 2   Curufin
## 3     Amras

group_by: use dplyr verbs on a group-by-group basis

Academic: “I want to know if the seniors scored better than the juniors in Spanish”

  1. Take the scores dataset.
  2. Group the dataset rows by year.
  3. Summarize each group of the dataset by taking the mean of the Spanish column.
scores %>%
    group_by(Year) %>%
    summarize(Spanish_mean = mean(Spanish))
## # A tibble: 2 x 2
##   Year  Spanish_mean
##   <chr>        <dbl>
## 1 Jun           70  
## 2 Sen           66.2

Various versions of summarize

We can save time by summarizing multiple variables at once:

scores %>%
    group_by(Year) %>%
    summarize_all(mean)
## Warning in mean.default(Name): argument is not numeric or logical: returning NA

## Warning in mean.default(Name): argument is not numeric or logical: returning NA
## # A tibble: 2 x 8
##   Year   Name English  Math Science History Spanish Total
##   <chr> <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 Jun      NA    84.3  67.7    83      62.7    70    368.
## 2 Sen      NA    66    70.8    74.2    64.5    66.2  342.

Hmm, taking the mean names doesn’t make sense, so we’d probably prefer something like

scores %>%
    group_by(Year) %>%
    summarize_if(is.numeric, mean)
## # A tibble: 2 x 7
##   Year  English  Math Science History Spanish Total
##   <chr>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 Jun      84.3  67.7    83      62.7    70    368.
## 2 Sen      66    70.8    74.2    64.5    66.2  342.

We can also pick several columns explicitly:

scores %>%
    group_by(Year) %>%
    summarize_at(c("English", "History", "Spanish"), mean)
## # A tibble: 2 x 4
##   Year  English History Spanish
##   <chr>   <dbl>   <dbl>   <dbl>
## 1 Jun      84.3    62.7    70  
## 2 Sen      66      64.5    66.2

Today’s dataset: Flights from New York City

Installation:

install.packages("nycflights13")
library(nycflights13)
data(flights)

This loads the following table:

str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...