STATS 32 Session 2: R objects, variables and tables

Damian Pavlyshyn

Apr 8, 2021

http://web.stanford.edu/class/stats32/lectures/

Recap of session 1: Variables

Reminder!

Agenda for today

Basic R objects:

Vectors

vec <- c("a", "b", "c")
vec
## [1] "a" "b" "c"

Vectors

vec <- 1:100
vec
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

Vectors

R treats everything as a vector - even individual numbers are just length-one vectors. This is why we see [1] after every output of a single number:

1 + 1
## [1] 2

The “c” in c() stands for “concatenate” - what we think of as building a vector out of numbers, R considers to be concatenating a bunch of length-one vectors into a single vector.

This means that we can concatenate longer vectors in the same way:

v1 <- 1:3
v2 <- c(10, 20)
v3 <- c(-1, -2)

c(v1, v2, v3)
## [1]  1  2  3 10 20 -1 -2

Vectors

even <- 1:50 * 2
even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

How can we get the odd numbers from 1 to 100 from even?

Vectors

even <- 1:50 * 2
even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

How can we get the odd numbers from 1 to 100 from even?

odd <- even - 1
odd
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
## [26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

Vectors: Indexing

To extract a subset of elements by their indices, put a vector of indices in square brackets

Warning: Unlike many programming languages, R indexes the first element of a vector by 1 rather than 0!

even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100
even[1]
## [1] 2

Vectors: Indexing

To extract a subset of elements by their indices, put a vector of indices in square brackets

Warning: Unlike many programming languages, R indexes the first element of a vector by 1 rather than 0!

even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100
even[3:7]
## [1]  6  8 10 12 14

Vectors: Indexing

To extract a subset of elements by their indices, put a vector of indices in square brackets

Warning: Unlike many programming languages, R indexes the first element of a vector by 1 rather than 0!

even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100
even[c(3,5)]
## [1]  6 10

Note: even[3,5] does not have a vector in the square brackets and so will not work!

Vectors: Negative indexing

To extract all except a few indices, put a negative sign before the vector of indices

even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100
even[-c(1,2)]
##  [1]   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42
## [20]  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80
## [39]  82  84  86  88  90  92  94  96  98 100

Vectors: Logical indexing

If we index using a vector of logical values (TRUE or FALSE), this will extract all elements of the original vector corresponding to the TRUE indices

even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100
even < 20
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE
even[even < 20]
## [1]  2  4  6  8 10 12 14 16 18

Vectors: Length

Use the length function to figure out how many elements there are in a vector

even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100
length(even)
## [1] 50

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]
## numeric(0)

No error thrown!!

(In fact, this returns the type of the elements making up the vector, which can be useful information.)

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]
## numeric(0)

No error thrown!!

(In fact, this returns the type of the elements making up the vector, which can be useful information.)

odd[51]

Vectors: Things to watch out for

What happens if we try to extract an invalid index?

odd[0]
## numeric(0)

No error thrown!!

(In fact, this returns the type of the elements making up the vector, which can be useful information.)

odd[51]
## [1] NA

No error thrown!!

(This makes less sense)

Vectors: Things to watch out for

If we try to assign a vector with different types, type coercion happens.

c(1, 2, "a")
## [1] "1" "2" "a"

It’s not always obvious how R will decide to do the type coercion, so I don’t recommend relying on this (and besides, if you are trying to use the numbers 1 and 2, and the string “a” in a single vector, something has probably already gone wrong!)

Vectors: Things to watch out for

If we try to assign a vector with different types, type coercion happens.

c(1, 2, "a")
## [1] "1" "2" "a"

It’s not always obvious how R will decide to do the type coercion, so I don’t recommend relying on this (and besides, if you are trying to use the numbers 1 and 2, and the string “a” in a single vector, something has probably already gone wrong!)

It’s almost always safe and sensible to coerce integers into floating point numbers, though. You may not even have realised that you are doing type coercion when typing something like:

c(1, 2, 3.5)
## [1] 1.0 2.0 3.5

Matrices and arrays

Two-dimensional analogs of vectors

A <- matrix(1:12, nrow = 3)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Indexing: put the rows you want before the comma, columns you want after the comma

A[1, 2]
## [1] 4

Matrices: Indexing example

A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

What does A[c(1,3), c(2,4)] return?

Matrices: Indexing example

A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

What does A[c(1,3), c(2,4)] return?

A[c(1,3), c(2,4)]
##      [,1] [,2]
## [1,]    4   10
## [2,]    6   12

To extract whole rows (or columns), we just leave the column (or row) specification blank.

A[c(1,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    3    6    9   12
A[,c(2,4)]
##      [,1] [,2]
## [1,]    4   10
## [2,]    5   11
## [3,]    6   12

Lists

List example

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Extracting parts of a list

Use [[ or $ notation to refer to a specific key-value pair

cars$make         # no quotation marks
## [1] "Honda"
cars[["models"]]  # remember quotation marks!
## [1] "Fit"     "CR-V"    "Odyssey"

What is a data table?

The most general definition is simply 2-dimensional array of data (ok, that’s not especially enlightening).

It is good practice to have a standard format for all data tables so that we can compare and combine them, and write software that works generically. This specification is

Example of a dataset

In this example, each row encodes the 2016 presidential election results of a single county.

The columns show the various quantites, or variables that were measured in that county.

How R stores data tables

##   fips_cod             county total  dem   gop other  dem_prop  gop_prop
## 1    26041       Delta County 18467 6431 11112   924 0.3482428 0.6017220
## 2    48295    Lipscomb County  1322  135  1159    28 0.1021180 0.8767020
## 3    01127      Walker County 29243 4486 24208   549 0.1534042 0.8278220
## 4    48389      Reeves County  3184 1659  1417   108 0.5210427 0.4450377
## 5    56017 Hot Springs County  2535  400  1939   196 0.1577909 0.7648915
## 6    20043    Doniphan County  3366  584  2601   181 0.1734997 0.7727273
##   other_prop state state_name
## 1 0.05003520    MI   Michigan
## 2 0.02118003    TX      Texas
## 3 0.01877372    AL    Alabama
## 4 0.03391960    TX      Texas
## 5 0.07731755    WY    Wyoming
## 6 0.05377302    KS     Kansas
## # A tibble: 6 x 11
##   fips_cod county     total   dem   gop other dem_prop gop_prop other_prop state
##   <chr>    <chr>      <int> <int> <int> <int>    <dbl>    <dbl>      <dbl> <chr>
## 1 26041    Delta Cou~ 18467  6431 11112   924    0.348    0.602     0.0500 MI   
## 2 48295    Lipscomb ~  1322   135  1159    28    0.102    0.877     0.0212 TX   
## 3 01127    Walker Co~ 29243  4486 24208   549    0.153    0.828     0.0188 AL   
## 4 48389    Reeves Co~  3184  1659  1417   108    0.521    0.445     0.0339 TX   
## 5 56017    Hot Sprin~  2535   400  1939   196    0.158    0.765     0.0773 WY   
## 6 20043    Doniphan ~  3366   584  2601   181    0.173    0.773     0.0538 KS   
## # ... with 1 more variable: state_name <chr>

Structure of data frame

str(df)
## 'data.frame':    3112 obs. of  11 variables:
##  $ fips_cod  : chr  "26041" "48295" "01127" "48389" ...
##  $ county    : chr  "Delta County" "Lipscomb County" "Walker County" "Reeves County" ...
##  $ total     : int  18467 1322 29243 3184 2535 3366 510940 78264 24661 8171 ...
##  $ dem       : int  6431 135 4486 1659 400 584 298353 40967 3412 1093 ...
##  $ gop       : int  11112 1159 24208 1417 1939 2601 193607 35191 20655 6863 ...
##  $ other     : int  924 28 549 108 196 181 18980 2106 594 215 ...
##  $ dem_prop  : num  0.348 0.102 0.153 0.521 0.158 ...
##  $ gop_prop  : num  0.602 0.877 0.828 0.445 0.765 ...
##  $ other_prop: num  0.05 0.0212 0.0188 0.0339 0.0773 ...
##  $ state     : chr  "MI" "TX" "AL" "TX" ...
##  $ state_name: chr  "Michigan" "Texas" "Alabama" "Texas" ...

Data frames “under the hood”

Consider the following simple data frame that counts the total number of votes for the two major parties:

df
##   votes_dem votes_gop
## 1    486351     91189
## 2       318       211
## 3      5904     10239

Now let’s look at its structure:

str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ votes_dem: num  486351 318 5904
##  $ votes_gop: num  91189 211 10239
is.list(df)
## [1] TRUE
df$votes_dem
## [1] 486351    318   5904

Today’s datasets

install.packages("fueleconomy") # only needs to be run once
library(fueleconomy)
data(vehicles)

which loads a tibble called vehicles:

head(vehicles)
## # A tibble: 6 x 12
##      id make  model    year class   trans  drive     cyl displ fuel    hwy   cty
##   <dbl> <chr> <chr>   <dbl> <chr>   <chr>  <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13309 Acura 2.2CL/~  1997 Subcom~ Autom~ Front-~     4   2.2 Regu~    26    20
## 2 13310 Acura 2.2CL/~  1997 Subcom~ Manua~ Front-~     4   2.2 Regu~    28    22
## 3 13311 Acura 2.2CL/~  1997 Subcom~ Autom~ Front-~     6   3   Regu~    26    18
## 4 14038 Acura 2.3CL/~  1998 Subcom~ Autom~ Front-~     4   2.3 Regu~    27    19
## 5 14039 Acura 2.3CL/~  1998 Subcom~ Manua~ Front-~     4   2.3 Regu~    29    21
## 6 14040 Acura 2.3CL/~  1998 Subcom~ Autom~ Front-~     6   3   Regu~    26    17

We’ll load the dataset directly from the course website (in later lectures we’ll see how to load files from your hard drive)

elections <- read_csv(
    "http://web.stanford.edu/class/stats32/assets/lecture-2/2016-presidential-election-county-results.csv",
    col_types = "cciiiidddcc"
)

This is the data set from earlier in the lecture:

head(elections)
## # A tibble: 6 x 11
##   fips_cod county     total   dem   gop other dem_prop gop_prop other_prop state
##   <chr>    <chr>      <int> <int> <int> <int>    <dbl>    <dbl>      <dbl> <chr>
## 1 26041    Delta Cou~ 18467  6431 11112   924    0.348    0.602     0.0500 MI   
## 2 48295    Lipscomb ~  1322   135  1159    28    0.102    0.877     0.0212 TX   
## 3 01127    Walker Co~ 29243  4486 24208   549    0.153    0.828     0.0188 AL   
## 4 48389    Reeves Co~  3184  1659  1417   108    0.521    0.445     0.0339 TX   
## 5 56017    Hot Sprin~  2535   400  1939   196    0.158    0.765     0.0773 WY   
## 6 20043    Doniphan ~  3366   584  2601   181    0.173    0.773     0.0538 KS   
## # ... with 1 more variable: state_name <chr>

fueleconomy: Package information on CRAN

https://cran.r-project.org/web/packages/fueleconomy/index.html









Optional material

Measures of central tendency

Measures of spread