Loading files

Data tables are stored in files - generally .txt or .csv - as rows of values separated by commas (or some other character).

Warning: .xls files are not like this at all, and are much (and needlessly) more complicated.

Absolute file paths:

Unique addresses of files or folders on your system. When looking for a file, your machine starts at the “root directory” and follows the chain of folders until it reached the end.

On Unix-based systems like Mac OS or Linux, the root directory is called /, and an absolute file path looks like:

/home/damian/Documents/example.csv

Usually, the /home/damian part is simply abbreviated as ~.
On Windows, each drive is its own root directory (called things like C: and D:), and an absolute file path looks like:

C:\Users\damian\Documents\example.csv

Though Windows natively uses backslashes to separate folder names, R uses backslashes to indicate special characters, so you should replace them with regular slashes when typing them into R.

You can always load the same file on your machine by loading it from its absolute file path:

df <- read_csv("~/Documents/example.csv")

URLs are also a type of absolute file path, and R can load files directly from the internet too:

df <- read_csv("http://web.stanford.edu/class/stats32/assets/lecture-3/data/worldbank_data_tidy.csv")

Relative file paths

If a file path doesn’t start with the root directory, your machine will instead start looking in whatever folder the program you’re using is running in. This is called the working directory.

You can change the working directory using the setwd() function, then access files relative to that folder:

setwd("~/Documents/")
df <- read_csv("example.csv")

Alternatively, you set your working directory in RStudio by navigating to your desired folder in the file browser in the lower right of the screen, and then clicking More > Set As Working Directory.

Functions: R’s workhorse

A function is a named block of code which

Takes in 1 or more inputs from the user,
Performs a specific task, and
Returns an output to the user.

(Source: practicalli.github.io)

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")

## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str(), head(), sd(), ggplot(), is.list(), …

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

Function syntax

The most important syntax in R is the function call. All R syntax has function calls underlying it.

A function call consists of:

Function name
Parentheses, and
A list of arguments within the parentheses

function_name(<positional arguments>, <keyword arguments>)

Positional arguments are the “main input” of the function - often these are data tables or vectors. These are supplied without keys and in a specific order. They are essential for the function to work.

If there are multiple positional arguments, is can be good practice to supply them as key-value pairs anyway.

x <- c(-5, -3, -1, 1, 3, NA)
mean(x)

## [1] NA

Keyword arguments provide supplementary information that modify what the function does, and are often strings or booleans. They are supplied as key-value pairs in arbitrary order, and can be omitted if you are satisfied with a function’s default behavior.

x <- c(-5, -3, -1, 1, 3, NA)
mean(x, na.rm = TRUE)

## [1] -1

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)

## [1] 2.6

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)

## [1] 2.6

The pipe operator `%>%`

%>% is implemented by the magrittr package, which is loaded as part of the tidyverse
%>% is “syntactic sugar”: makes code easier to understand
Whatever is on the left of %>% becomes the first argument in the function on the right of %>%

library(magrittr)
x %>% abs() %>% mean(na.rm = TRUE)

## [1] 2.6

This specific example is silly and only for demonstrative purposes. We’ll see actual (extremely) useful applications of this soon.

Why do we need to transform data?

We rarely get data in exactly the form we need!

Create new variables or summaries
Rename or reorder variables
Move thing around to maintain the “one observation-one row” rule

Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).

The 5 basic `dplyr` verbs

Most of the operations that you’d ever want to perform on a single table can be expressed with the following functions:

select(): pick variables by their names
mutate(): create new variables based on existing ones
arrange(): reorder rows
filter(): pick observations by their values
summarize(): collapse many values down to a single summary

Toy example: Student scores

library(tidyverse)
scores <- data.frame(Name = c("Maedhros", "Maglor", "Celegorm", "Caranthir", "Curufin", "Amrod", "Amras"),
                     Year = c("Sen", "Sen", "Jun", "Jun", "Sen", "Sen", "Jun"),
                     English = c(60, 66, 92, 80, 80, 58, 81),
                     Math = c(96, 55, 63, 76, 80, 52, 64),
                     Science = c(80, 56, 70, 89, 82, 79, 90),
                     History = c(56, 64, 62, 55, 48, 90, 71),
                     Spanish = c(77, 77, 98, 40, 50, 61, 72),
                     stringsAsFactors = FALSE)

scores

##        Name Year English Math Science History Spanish
## 1  Maedhros  Sen      60   96      80      56      77
## 2    Maglor  Sen      66   55      56      64      77
## 3  Celegorm  Jun      92   63      70      62      98
## 4 Caranthir  Jun      80   76      89      55      40
## 5   Curufin  Sen      80   80      82      48      50
## 6     Amrod  Sen      58   52      79      90      61
## 7     Amras  Jun      81   64      90      71      72

`select`: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

Take the scores dataset.
Select the Name and History columns.

scores %>%
    select(Name, History)

##        Name History
## 1  Maedhros      56
## 2    Maglor      64
## 3  Celegorm      62
## 4 Caranthir      55
## 5   Curufin      48
## 6     Amrod      90
## 7     Amras      71

`mutate`: create new columns based on old ones

Teacher: “What are their total scores?”

Take the scores dataset.
Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish

scores <- scores %>%
    mutate(Total = English + Math + Science + History + Spanish)

scores

##        Name Year English Math Science History Spanish Total
## 1  Maedhros  Sen      60   96      80      56      77   369
## 2    Maglor  Sen      66   55      56      64      77   318
## 3  Celegorm  Jun      92   63      70      62      98   385
## 4 Caranthir  Jun      80   76      89      55      40   340
## 5   Curufin  Sen      80   80      82      48      50   340
## 6     Amrod  Sen      58   52      79      90      61   340
## 7     Amras  Jun      81   64      90      71      72   378

`arrange`: reorder rows

Teacher: “Can I have the students in order of overall performance?”

Take the scores dataset.
Arrange rows by the Total column.

scores %>%
    arrange(Total)

##        Name Year English Math Science History Spanish Total
## 1    Maglor  Sen      66   55      56      64      77   318
## 2 Caranthir  Jun      80   76      89      55      40   340
## 3   Curufin  Sen      80   80      82      48      50   340
## 4     Amrod  Sen      58   52      79      90      61   340
## 5  Maedhros  Sen      60   96      80      56      77   369
## 6     Amras  Jun      81   64      90      71      72   378
## 7  Celegorm  Jun      92   63      70      62      98   385

Teacher: “No no, better students on top please…”

scores %>%
    arrange(desc(Total))

##        Name Year English Math Science History Spanish Total
## 1  Celegorm  Jun      92   63      70      62      98   385
## 2     Amras  Jun      81   64      90      71      72   378
## 3  Maedhros  Sen      60   96      80      56      77   369
## 4 Caranthir  Jun      80   76      89      55      40   340
## 5   Curufin  Sen      80   80      82      48      50   340
## 6     Amrod  Sen      58   52      79      90      61   340
## 7    Maglor  Sen      66   55      56      64      77   318

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

scores %>%
    arrange(desc(Total), Name)

##        Name Year English Math Science History Spanish Total
## 1  Celegorm  Jun      92   63      70      62      98   385
## 2     Amras  Jun      81   64      90      71      72   378
## 3  Maedhros  Sen      60   96      80      56      77   369
## 4     Amrod  Sen      58   52      79      90      61   340
## 5 Caranthir  Jun      80   76      89      55      40   340
## 6   Curufin  Sen      80   80      82      48      50   340
## 7    Maglor  Sen      66   55      56      64      77   318

`filter`: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

Take the scores dataset.
Filter for rows where the value in the History column is less than 60.

scores %>%
    filter(History < 60)

##        Name Year English Math Science History Spanish Total
## 1  Maedhros  Sen      60   96      80      56      77   369
## 2 Caranthir  Jun      80   76      89      55      40   340
## 3   Curufin  Sen      80   80      82      48      50   340

More on filters

Other ways to make comparisons:

> greater than
< less than
>= greater than or equal to
<= less than or equal to
!= not equal to
== equal to (Do not use = to test for equality!!)

Warning!

sqrt(2)^2 == 2

## [1] FALSE

Don’t use == to compare doubles! This is because computers have only finite space to store doubles, so tiny rounding errors crop up when doing arithmetic. Normally this isn’t a problem, but 2.0000000001 does not equal 2, so watch out!

Much better to use the near() function, which allows a small difference between values

near(sqrt(2)^2, 2)

## [1] TRUE

Combining comparisons:

! not (!TRUE == FALSE and !FALSE == TRUE)
& and (returns TRUE if both comparisons are TRUE)
| or (returns TRUE if either comparison is TRUE)

More `filter` examples

Maglor’s parents: “I just want Maglor’s scores”

scores %>% 
    filter(Name == "Maglor")

##     Name Year English Math Science History Spanish Total
## 1 Maglor  Sen      66   55      56      64      77   318

Language teacher: “I want to know which students scores < 50 for either English or Spanish”

scores %>% 
    filter(English < 50 | Spanish < 50)

##        Name Year English Math Science History Spanish Total
## 1 Caranthir  Jun      80   76      89      55      40   340

`summarize`: get summaries of data

Academic: “I want to know the correlation between math and science scores as well as their means”

Take the scores dataset.
Summarize the dataset by taking the correlation of the Math and Science columns.

scores %>%
    summarize(corr = cor(Math, Science))

##        corr
## 1 0.4137445

Science teacher: “I want to know the mean and standard deviation of the scores for science”

Take the scores dataset.
Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.

scores %>%
    summarize(Science_mean = mean(Science), 
              Science_sd = sd(Science))

##   Science_mean Science_sd
## 1           78   11.78983

Chaining multiple `dplyr` commands using `%>%`

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

Take the scores dataset.
Filter for the rows which have Science > 80.
Select the Name column.

scores %>%
    filter(Science > 80) %>%
    select(Name)

##        Name
## 1 Caranthir
## 2   Curufin
## 3     Amras

`group_by`: use `dplyr` verbs on a group-by-group basis

Academic: “I want to know if the seniors scored better than the juniors in Spanish”

Take the scores dataset.
Group the dataset rows by year.
Summarize each group of the dataset by taking the mean of the Spanish column.

scores %>%
    group_by(Year) %>%
    summarize(Spanish_mean = mean(Spanish))

## # A tibble: 2 x 2
##   Year  Spanish_mean
##   <chr>        <dbl>
## 1 Jun           70  
## 2 Sen           66.2

Various versions of `summarize`

We can save time by summarizing multiple variables at once:

scores %>%
    group_by(Year) %>%
    summarize_all(mean)

## Warning in mean.default(Name): argument is not numeric or logical: returning NA

## Warning in mean.default(Name): argument is not numeric or logical: returning NA

## # A tibble: 2 x 8
##   Year   Name English  Math Science History Spanish Total
##   <chr> <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 Jun      NA    84.3  67.7    83      62.7    70    368.
## 2 Sen      NA    66    70.8    74.2    64.5    66.2  342.

Hmm, taking the mean names doesn’t make sense, so we’d probably prefer something like

scores %>%
    group_by(Year) %>%
    summarize_if(is.numeric, mean)

## # A tibble: 2 x 7
##   Year  English  Math Science History Spanish Total
##   <chr>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1 Jun      84.3  67.7    83      62.7    70    368.
## 2 Sen      66    70.8    74.2    64.5    66.2  342.

We can also pick several columns explicitly:

scores %>%
    group_by(Year) %>%
    summarize_at(c("English", "History", "Spanish"), mean)

## # A tibble: 2 x 4
##   Year  English History Spanish
##   <chr>   <dbl>   <dbl>   <dbl>
## 1 Jun      84.3    62.7    70  
## 2 Sen      66      64.5    66.2

Today’s dataset: Flights from New York City

Installation:

install.packages("nycflights13")
library(nycflights13)
data(flights)

This loads the following table:

str(flights)

## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

STATS 32 Session 5: Data transformation and functions

Recap of sessions 3 and 4

Agenda for today

Loading files

Absolute file paths:

Relative file paths

Functions: R’s workhorse

We use functions in R all the time

Function syntax

Function calls read “inside out”

Function calls read “inside out”

The pipe operator `%>%`

Why do we need to transform data?

The 5 basic `dplyr` verbs

Toy example: Student scores

`select`: pick subset of variables/columns by name

`mutate`: create new columns based on old ones

See also: `transmute`

`arrange`: reorder rows

`filter`: pick observations by their values

More on filters

Combining comparisons:

More `filter` examples

`summarize`: get summaries of data

Chaining multiple `dplyr` commands using `%>%`

`group_by`: use `dplyr` verbs on a group-by-group basis

Various versions of `summarize`

Today’s dataset: Flights from New York City

STATS 32 Session 5: Data transformation and functions

Recap of sessions 3 and 4

Agenda for today

Loading files

Absolute file paths:

Relative file paths

Functions: R’s workhorse

We use functions in R all the time

Function syntax

Function calls read “inside out”

Function calls read “inside out”

The pipe operator %>%

Why do we need to transform data?

The 5 basic dplyr verbs

Toy example: Student scores

select: pick subset of variables/columns by name

mutate: create new columns based on old ones

See also: transmute

arrange: reorder rows

filter: pick observations by their values

More on filters

Combining comparisons:

More filter examples

summarize: get summaries of data

Chaining multiple dplyr commands using %>%

group_by: use dplyr verbs on a group-by-group basis

Various versions of summarize

Today’s dataset: Flights from New York City

The pipe operator `%>%`

The 5 basic `dplyr` verbs

`select`: pick subset of variables/columns by name

`mutate`: create new columns based on old ones

See also: `transmute`

`arrange`: reorder rows

`filter`: pick observations by their values

More `filter` examples

`summarize`: get summaries of data

Chaining multiple `dplyr` commands using `%>%`

`group_by`: use `dplyr` verbs on a group-by-group basis

Various versions of `summarize`