STATS 32 Session 2: R objects, variables and tables

Damian Pavlyshyn

Apr 8, 2021

http://web.stanford.edu/class/stats32/lectures/

Recap of session 1: Variables

Types of variables in R
- Numeric (e.g. 1, -3.5, 200)
- Character/string (“abc”, “R”, “94305”)
- Boolean/logical (TRUE or FALSE)
- NAs
Variable assignment

Reminder!

Install R, RStudio and the relevant R packages!
Instructions on the course website.
Make sure you are using the latest versions!

Agenda for today

Basic R objects:

Vectors
Matrices/arrays
Lists
Tables: base R’s data frames and tidyverse’s tibbles

Vectors

One-dimensional array whose entries are the same type
Can be created using the c() function, or using the : shortcut

vec <- c("a", "b", "c")
vec

## [1] "a" "b" "c"

Vectors

One-dimensional array whose entries are the same type
Can be created using the c() function, or using the : shortcut

vec <- 1:100
vec

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

Note: the bracketed numbers indicate the indices of the vector entry at the start of each line

Vectors

R treats everything as a vector - even individual numbers are just length-one vectors. This is why we see [1] after every output of a single number:

1 + 1

## [1] 2

The “c” in c() stands for “concatenate” - what we think of as building a vector out of numbers, R considers to be concatenating a bunch of length-one vectors into a single vector.

This means that we can concatenate longer vectors in the same way:

v1 <- 1:3
v2 <- c(10, 20)
v3 <- c(-1, -2)

c(v1, v2, v3)

## [1]  1  2  3 10 20 -1 -2

Vectors

We can manipulate vectors using arithmetic operations (just like numbers).
Arithmetic operations happen element-wise.

even <- 1:50 * 2
even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

How can we get the odd numbers from 1 to 100 from even?

Vectors

We can manipulate vectors using arithmetic operations (just like numbers).
Arithmetic operations happen element-wise.

even <- 1:50 * 2
even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

How can we get the odd numbers from 1 to 100 from even?

odd <- even - 1
odd

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
## [26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

Vectors: Indexing

To extract a subset of elements by their indices, put a vector of indices in square brackets

Warning: Unlike many programming languages, R indexes the first element of a vector by 1 rather than 0!

even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

even[1]

## [1] 2

Vectors: Indexing

To extract a subset of elements by their indices, put a vector of indices in square brackets

Warning: Unlike many programming languages, R indexes the first element of a vector by 1 rather than 0!

even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

even[3:7]

## [1]  6  8 10 12 14

Vectors: Indexing

To extract a subset of elements by their indices, put a vector of indices in square brackets

Warning: Unlike many programming languages, R indexes the first element of a vector by 1 rather than 0!

even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

even[c(3,5)]

## [1]  6 10

Note: even[3,5] does not have a vector in the square brackets and so will not work!

Vectors: Negative indexing

To extract all except a few indices, put a negative sign before the vector of indices

even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

even[-c(1,2)]

##  [1]   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42
## [20]  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80
## [39]  82  84  86  88  90  92  94  96  98 100

Vectors: Logical indexing

If we index using a vector of logical values (TRUE or FALSE), this will extract all elements of the original vector corresponding to the TRUE indices

even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

even < 20

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE

even[even < 20]

## [1]  2  4  6  8 10 12 14 16 18

Vectors: Length

Use the length function to figure out how many elements there are in a vector

even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

length(even)

## [1] 50

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]

## numeric(0)

No error thrown!!

(In fact, this returns the type of the elements making up the vector, which can be useful information.)

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]

## numeric(0)

No error thrown!!

(In fact, this returns the type of the elements making up the vector, which can be useful information.)

odd[51]

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]

## numeric(0)

No error thrown!!

(In fact, this returns the type of the elements making up the vector, which can be useful information.)

odd[51]

## [1] NA

No error thrown!!

(This makes less sense)

Vectors: Things to watch out for

If we try to assign a vector with different types, type coercion happens.

c(1, 2, "a")

## [1] "1" "2" "a"

It’s not always obvious how R will decide to do the type coercion, so I don’t recommend relying on this (and besides, if you are trying to use the numbers 1 and 2, and the string “a” in a single vector, something has probably already gone wrong!)

Vectors: Things to watch out for

If we try to assign a vector with different types, type coercion happens.

c(1, 2, "a")

## [1] "1" "2" "a"

It’s not always obvious how R will decide to do the type coercion, so I don’t recommend relying on this (and besides, if you are trying to use the numbers 1 and 2, and the string “a” in a single vector, something has probably already gone wrong!)

It’s almost always safe and sensible to coerce integers into floating point numbers, though. You may not even have realised that you are doing type coercion when typing something like:

c(1, 2, 3.5)

## [1] 1.0 2.0 3.5

Matrices and arrays

Two-dimensional analogs of vectors

A <- matrix(1:12, nrow = 3)
A

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Indexing: put the rows you want before the comma, columns you want after the comma

A[1, 2]

## [1] 4

Matrices: Indexing example

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

What does A[c(1,3), c(2,4)] return?

Matrices: Indexing example

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

What does A[c(1,3), c(2,4)] return?

A[c(1,3), c(2,4)]

##      [,1] [,2]
## [1,]    4   10
## [2,]    6   12

To extract whole rows (or columns), we just leave the column (or row) specification blank.

A[c(1,3),]

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    3    6    9   12

A[,c(2,4)]

##      [,1] [,2]
## [1,]    4   10
## [2,]    5   11
## [3,]    6   12

Lists

A collection of key-value pairs
Keys are character strings, values can be anything!
Values do not have to be of the same type for different keys
Created with the list() function

List example

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Extracting parts of a list

Use [[ or $ notation to refer to a specific key-value pair

cars$make         # no quotation marks

## [1] "Honda"

cars[["models"]]  # remember quotation marks!

## [1] "Fit"     "CR-V"    "Odyssey"

What is a data table?

The most general definition is simply 2-dimensional array of data (ok, that’s not especially enlightening).

It is good practice to have a standard format for all data tables so that we can compare and combine them, and write software that works generically. This specification is

Each row represents an individual or observation
Each column represents a variable

Example of a dataset

In this example, each row encodes the 2016 presidential election results of a single county.

The columns show the various quantites, or variables that were measured in that county.

How R stores data tables

A data frame is R’s data structure for storing datasets
- First row: variable/covariate/feature names
- Each subsequent row represents one observation
- Each column contains the values of that variable across observations

##   fips_cod             county total  dem   gop other  dem_prop  gop_prop
## 1    26041       Delta County 18467 6431 11112   924 0.3482428 0.6017220
## 2    48295    Lipscomb County  1322  135  1159    28 0.1021180 0.8767020
## 3    01127      Walker County 29243 4486 24208   549 0.1534042 0.8278220
## 4    48389      Reeves County  3184 1659  1417   108 0.5210427 0.4450377
## 5    56017 Hot Springs County  2535  400  1939   196 0.1577909 0.7648915
## 6    20043    Doniphan County  3366  584  2601   181 0.1734997 0.7727273
##   other_prop state state_name
## 1 0.05003520    MI   Michigan
## 2 0.02118003    TX      Texas
## 3 0.01877372    AL    Alabama
## 4 0.03391960    TX      Texas
## 5 0.07731755    WY    Wyoming
## 6 0.05377302    KS     Kansas

Tibbles are the tidyverse’s implementation of dataframes
- Each columns has a type explicitly assigned
- Faster to save and load

## # A tibble: 6 x 11
##   fips_cod county     total   dem   gop other dem_prop gop_prop other_prop state
##   <chr>    <chr>      <int> <int> <int> <int>    <dbl>    <dbl>      <dbl> <chr>
## 1 26041    Delta Cou~ 18467  6431 11112   924    0.348    0.602     0.0500 MI   
## 2 48295    Lipscomb ~  1322   135  1159    28    0.102    0.877     0.0212 TX   
## 3 01127    Walker Co~ 29243  4486 24208   549    0.153    0.828     0.0188 AL   
## 4 48389    Reeves Co~  3184  1659  1417   108    0.521    0.445     0.0339 TX   
## 5 56017    Hot Sprin~  2535   400  1939   196    0.158    0.765     0.0773 WY   
## 6 20043    Doniphan ~  3366   584  2601   181    0.173    0.773     0.0538 KS   
## # ... with 1 more variable: state_name <chr>

Structure of data frame

str(df)

## 'data.frame':    3112 obs. of  11 variables:
##  $ fips_cod  : chr  "26041" "48295" "01127" "48389" ...
##  $ county    : chr  "Delta County" "Lipscomb County" "Walker County" "Reeves County" ...
##  $ total     : int  18467 1322 29243 3184 2535 3366 510940 78264 24661 8171 ...
##  $ dem       : int  6431 135 4486 1659 400 584 298353 40967 3412 1093 ...
##  $ gop       : int  11112 1159 24208 1417 1939 2601 193607 35191 20655 6863 ...
##  $ other     : int  924 28 549 108 196 181 18980 2106 594 215 ...
##  $ dem_prop  : num  0.348 0.102 0.153 0.521 0.158 ...
##  $ gop_prop  : num  0.602 0.877 0.828 0.445 0.765 ...
##  $ other_prop: num  0.05 0.0212 0.0188 0.0339 0.0773 ...
##  $ state     : chr  "MI" "TX" "AL" "TX" ...
##  $ state_name: chr  "Michigan" "Texas" "Alabama" "Texas" ...

Data frames “under the hood”

To R, a data frame is simply a special type of list!
- Keys of the list are the variable/covariate names
- Values are vectors of the same length

Consider the following simple data frame that counts the total number of votes for the two major parties:

df

##   votes_dem votes_gop
## 1    486351     91189
## 2       318       211
## 3      5904     10239

Now let’s look at its structure:

str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ votes_dem: num  486351 318 5904
##  $ votes_gop: num  91189 211 10239

is.list(df)

## [1] TRUE

df$votes_dem

## [1] 486351    318   5904

Today’s datasets

Fuel economy: no need to download anything, just run

install.packages("fueleconomy") # only needs to be run once
library(fueleconomy)
data(vehicles)

which loads a tibble called vehicles:

head(vehicles)

## # A tibble: 6 x 12
##      id make  model    year class   trans  drive     cyl displ fuel    hwy   cty
##   <dbl> <chr> <chr>   <dbl> <chr>   <chr>  <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13309 Acura 2.2CL/~  1997 Subcom~ Autom~ Front-~     4   2.2 Regu~    26    20
## 2 13310 Acura 2.2CL/~  1997 Subcom~ Manua~ Front-~     4   2.2 Regu~    28    22
## 3 13311 Acura 2.2CL/~  1997 Subcom~ Autom~ Front-~     6   3   Regu~    26    18
## 4 14038 Acura 2.3CL/~  1998 Subcom~ Autom~ Front-~     4   2.3 Regu~    27    19
## 5 14039 Acura 2.3CL/~  1998 Subcom~ Manua~ Front-~     4   2.3 Regu~    29    21
## 6 14040 Acura 2.3CL/~  1998 Subcom~ Autom~ Front-~     6   3   Regu~    26    17

County-level 2016 presidential election results

We’ll load the dataset directly from the course website (in later lectures we’ll see how to load files from your hard drive)

elections <- read_csv(
    "http://web.stanford.edu/class/stats32/assets/lecture-2/2016-presidential-election-county-results.csv",
    col_types = "cciiiidddcc"
)

This is the data set from earlier in the lecture:

head(elections)

## # A tibble: 6 x 11
##   fips_cod county     total   dem   gop other dem_prop gop_prop other_prop state
##   <chr>    <chr>      <int> <int> <int> <int>    <dbl>    <dbl>      <dbl> <chr>
## 1 26041    Delta Cou~ 18467  6431 11112   924    0.348    0.602     0.0500 MI   
## 2 48295    Lipscomb ~  1322   135  1159    28    0.102    0.877     0.0212 TX   
## 3 01127    Walker Co~ 29243  4486 24208   549    0.153    0.828     0.0188 AL   
## 4 48389    Reeves Co~  3184  1659  1417   108    0.521    0.445     0.0339 TX   
## 5 56017    Hot Sprin~  2535   400  1939   196    0.158    0.765     0.0773 WY   
## 6 20043    Doniphan ~  3366   584  2601   181    0.173    0.773     0.0538 KS   
## # ... with 1 more variable: state_name <chr>

`fueleconomy`: Package information on CRAN

https://cran.r-project.org/web/packages/fueleconomy/index.html

Optional material

Measures of central tendency

Mean: sum of all values divided by the number of values
Mode: most commonly occuring value
$x$th percentile: value such that $x$% of the values fall below it
- Median: 50th percentile
- 1st quartile: 25th percentile
- 3rd quartile: 75th percentile

Measures of spread

Variance: average squared deviation from the mean
Standard deviation: square root of variance
Interquartile range: 3rd quartile - 1st quartile

STATS 32 Session 2: R objects, variables and tables

Recap of session 1: Variables

Reminder!

Agenda for today

Vectors

Vectors

Vectors

Vectors

Vectors

Vectors: Indexing

Vectors: Indexing

Vectors: Indexing

Vectors: Negative indexing

Vectors: Logical indexing

Vectors: Length

Vectors: Things to watch out for

Vectors: Things to watch out for

Vectors: Things to watch out for

Vectors: Things to watch out for

Vectors: Things to watch out for

Vectors: Things to watch out for

Matrices and arrays

Matrices: Indexing example

Matrices: Indexing example

Lists

List example

Extracting parts of a list

What is a data table?

How R stores data tables

Structure of data frame

Data frames “under the hood”

Today’s datasets

fueleconomy: Package information on CRAN

Measures of central tendency

Measures of spread

`fueleconomy`: Package information on CRAN