Vectors

For data analysis, we often have to work with multiple values at the same time. There are a number of different R objects which allow us to do this.

The vector is a 1-dimensional array whose entries are the same type. For example, the following code produces a vector containing the numbers 1,2 and 3:

vec <- c("a", "b", "c")
vec
## [1] "a" "b" "c"

Typing out all the elements can be tedious. Sometimes there are shortcuts we can use. The following code assigns a vector of the numbers 1 to 100 to vec:

vec <- 1:100
vec
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

What if I only want even numbers from 1 to 100 (inclusive)? We can manipulate vectors using arithmetic operations (just like numbers). Note that arithmetic operations happen element-wise.

even <- 1:50 * 2
even
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

We can also get the odd numbers:

odd <- even - 1

We can use the c() function to combine (“concatenate”) several small vectors into one large vector. How many elements does the vector z have?

z <- 1:5
z <- c(z, 3, z)

R allows us to access individual elements in a vector. Unlike many other programming languages, indexing begins at 1, not 0. For example, to return the first even number, I would use the following code:

even[1]
## [1] 2

We can get multiple elements of a vector as well. The following code extracts the 3rd to 7th even number (inclusive), and assigns it to the variable y:

y <- even[3:7]
y
## [1]  6  8 10 12 14

This extracts just the 3rd and 5th even numbers:

even[c(3,5)]
## [1]  6 10

What if I want all even numbers except the first two? I can use negative indexing to achieve my goal:

even[-c(1,2)]
##  [1]   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42
## [20]  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80
## [39]  82  84  86  88  90  92  94  96  98 100

Use the length function to figure out how many elements there are in a vector. What happens if I try to extract an element from an index greater than its length?

length(odd)
## [1] 50
odd[51]
## [1] NA

One last note about vectors: the elements in a vector have to be of the same type. How do you think R gets the results for the code below?

c(1, 2, "a")
## [1] "1" "2" "a"

Matrices and arrays

Matrices are just the 2-dimensional analogs of vectors while arrays are the \(n\)-dimensional analogs of vectors. We won’t be talking about them a whole lot in this class. As with vectors, elements of matrices and arrays have to be of the same type.

Use the matrix() command to change a vector into a matrix:

A <- matrix(1:12, nrow = 3)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

To get the dimensions of the matrix, we can use the dim, nrow and ncol functions:

dim(A)
## [1] 3 4
nrow(A)
## [1] 3
ncol(A)
## [1] 4

To access the element in the ith row and j column for the matrix A, use the index i,j:

A[1, 2]  # for the element in the 1st row and 2nd column
## [1] 4

What do you think A[2,] returns? how about A[,2]? And A[c(1, 3), c(2,4)]?

Lists

In all the data structures so far, the elements have to be of the same type. To have elements on different types in one data structure, we can use a list, which we create with list(). We can think of a list as a collection of key-value pairs. Keys should be strings.

person <- list(name = "John Doe", age = 26)
person
## $name
## [1] "John Doe"
## 
## $age
## [1] 26

The str function can be used to inspect what is inside person:

str(person)
## List of 2
##  $ name: chr "John Doe"
##  $ age : num 26

To access the name element person, we have 2 options:

person[["name"]]
## [1] "John Doe"
person$name
## [1] "John Doe"

The elements of a list can be anything, even another data structure! Let’s add the names of John’s children to the person object:

person$children <- c("Ross", "Robert")
str(person)
## List of 3
##  $ name    : chr "John Doe"
##  $ age     : num 26
##  $ children: chr [1:2] "Ross" "Robert"

To see the keys associated with a list, use the names() function:

names(person)
## [1] "name"     "age"      "children"

Data frames

Packages not only give us access to user-created functions, but also user-created datasets. In R, datasets are called data frames.

Let’s load the fueleconomy package (if you haven’t install this package yet, run this command first: install.packages("fueleconomy")):

library(fueleconomy)

Load the vehicles dataset with the data function (to find out more about the vehicles dataset, key in ?vehicles):

data(vehicles)

An entry vehicles pops up in the Environment tab. We can see that the dataset has ~33,000 observations with 12 variables.

Let’s view the data with the View() function (note the capital V). (Alternatively, we can click on “the”vehicles" in the Environment tab.) A new tab pops up in the top-left pane displaying the data. Clicking on the column names allows us to sort the data.

(Note: Some of you might not be able to click on “fueleconomy” in the Environment tab right away. Don’t worry about it, typing View(fueleconomy) into the console will still work, and you should be able to click on “fueleconomy” after that.)

Seeing parts of the data

33,000 observations is a lot of observations to look through. Instead of looking through all of it, we can use various functions to give us a feel for the data.

Use the head and tail functions to display the first few or last few rows of the dataset. To control the number of lines shown (default is 6), use the optional n argument.

head(vehicles)
##      id  make       model year           class           trans
## 1 13309 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 2 13310 Acura 2.2CL/3.0CL 1997 Subcompact Cars    Manual 5-spd
## 3 13311 Acura 2.2CL/3.0CL 1997 Subcompact Cars Automatic 4-spd
## 4 14038 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
## 5 14039 Acura 2.3CL/3.0CL 1998 Subcompact Cars    Manual 5-spd
## 6 14040 Acura 2.3CL/3.0CL 1998 Subcompact Cars Automatic 4-spd
##               drive cyl displ    fuel hwy cty
## 1 Front-Wheel Drive   4   2.2 Regular  26  20
## 2 Front-Wheel Drive   4   2.2 Regular  28  22
## 3 Front-Wheel Drive   6   3.0 Regular  26  18
## 4 Front-Wheel Drive   4   2.3 Regular  27  19
## 5 Front-Wheel Drive   4   2.3 Regular  29  21
## 6 Front-Wheel Drive   6   3.0 Regular  26  17
tail(vehicles, n = 2)
##         id make       model year            class        trans
## 33441 5498 Yugo      GV/GVX 1989  Subcompact Cars Manual 5-spd
## 33442 1745 Yugo Gy/yugo GVX 1986 Minicompact Cars Manual 4-spd
##                   drive cyl displ    fuel hwy cty
## 33441 Front-Wheel Drive   4   1.3 Regular  28  23
## 33442 Front-Wheel Drive   4   1.1 Regular  29  22

Under the hood, data frames are implemented as lists, with each column being one element in the list. Hence, whatever we can do with lists, we can do with data frames. For example, we can get the data frame’s column names using name():

names(vehicles)
##  [1] "id"    "make"  "model" "year"  "class" "trans" "drive" "cyl"   "displ"
## [10] "fuel"  "hwy"   "cty"

To access a particular column, we can use the [[ or $ notation:

vehicles$class[1:10]
##  [1] "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars"
##  [5] "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars"
##  [9] "Subcompact Cars" "Compact Cars"

Since the number of columns in a data frame is just the number of elements in a list, we can get the number of columns using length():

length(vehicles)
## [1] 12

We can also use the ncol() and nrow() functions to get the number of columns and rows of the data frame:

ncol(vehicles)
## [1] 12
nrow(vehicles)
## [1] 33442

Interestingly, data frames can act a little like matrices too. For example, we can use dim() to figure out the number of rows and columns in the data frame:

dim(vehicles)
## [1] 33442    12

To access the 30th row, we can type

vehicles[30, ]
##       id  make model year        class           trans             drive cyl
## 30 15872 Acura 3.5RL 2000 Midsize Cars Automatic 4-spd Front-Wheel Drive   6
##    displ    fuel hwy cty
## 30   3.5 Premium  22  16

Getting an overview of the data

For an overview of the entire data set, the str function we introduced last session is very handy. For each column, str tells us what type of variable it is, as well as the first couple of values for the column.

str(vehicles)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : num  13309 13310 13311 14038 14039 ...
##  $ make : chr  "Acura" "Acura" "Acura" "Acura" ...
##  $ model: chr  "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : num  1997 1997 1997 1998 1998 ...
##  $ class: chr  "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans: chr  "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr  "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : num  4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num  2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : num  26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : num  20 22 18 19 21 17 20 21 17 18 ...

The summary function gives us some useful statistics for each variable:

summary(vehicles)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

We can also do summaries on just one column:

summary(vehicles$hwy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   19.00   23.00   23.55   27.00  109.00

For just the mean or median, use the mean and median functions on the column of interest:

mean(vehicles$hwy)
## [1] 23.55128
median(vehicles$hwy)
## [1] 23

The sd() and var() functions compute the standard deviation and variance of a vector for us:

sd(vehicles$hwy)
## [1] 6.211417
var(vehicles$hwy)
## [1] 38.5817

Note that the default types for the variables don’t always make sense. For example, does it make sense to take the mean of id numbers? To change the type of a column, use the as.x function (where x is the type you want to change to):

vehicles$id <- as.character(vehicles$id)
str(vehicles)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : chr  "13309" "13310" "13311" "14038" ...
##  $ make : chr  "Acura" "Acura" "Acura" "Acura" ...
##  $ model: chr  "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : num  1997 1997 1997 1998 1998 ...
##  $ class: chr  "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans: chr  "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr  "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : num  4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num  2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : num  26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : num  20 22 18 19 21 17 20 21 17 18 ...

Factors

Look at the output of summary(vehicles) again. Note that for all the character variables, summary() doesn’t give us any information on them. One way to get information on character variables is to use the table() function:

table(vehicles$drive)
## 
##              2-Wheel Drive              4-Wheel Drive 
##                        507                        699 
## 4-Wheel or All-Wheel Drive            All-Wheel Drive 
##                       6647                       1267 
##          Front-Wheel Drive    Part-time 4-Wheel Drive 
##                      12233                         96 
##           Rear-Wheel Drive 
##                      11993

Another way we can get information on character variables is by converting them to factors. Factors represent categorical variables: i.e. values fall into one of several categories (e.g. gender, age group). Categories can be unordered (e.g. gender, we call them nominal variables), or ordered (e.g. age group, we call them ordinal variables).

We can make a character variable into a factor variable by using factor(). Notice now that summary() gives more useful information. (By default, factor variables are nominal variables.)

vehicles$drive <- factor(vehicles$drive)
summary(vehicles$drive)
##              2-Wheel Drive              4-Wheel Drive 
##                        507                        699 
## 4-Wheel or All-Wheel Drive            All-Wheel Drive 
##                       6647                       1267 
##          Front-Wheel Drive    Part-time 4-Wheel Drive 
##                      12233                         96 
##           Rear-Wheel Drive 
##                      11993

Let’s look at the internal structure of the factor variable:

str(vehicles$drive)
##  Factor w/ 7 levels "2-Wheel Drive",..: 5 5 5 5 5 5 5 5 5 5 ...

Notice that the words (“2 Wheel Drive”, etc.) have been changed into numbers! That’s because R assigns each category a number. We can see this assignment somewhat by calling levels(), which shows us the “levels”, or categories, for this variable:

levels(vehicles$drive)
## [1] "2-Wheel Drive"              "4-Wheel Drive"             
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"           
## [5] "Front-Wheel Drive"          "Part-time 4-Wheel Drive"   
## [7] "Rear-Wheel Drive"

So 2-Wheel Drives are labeled 1, and so on. By default, R assigns this internal labeling by alphabetical order. This internal labeling is usually not a concern to us. See optional material section for more details.

Working with NAs

Let’s compute the mean number of cylinders in our dataset:

mean(vehicles$cyl)
## [1] NA

Hmm, we get an NA? What’s happening? If we look through the cyl column, you’ll find that some of the entries are NA. Look at the documentation for the mean function and you’ll see that there is an na.rm option, with default value FALSE. This means that by default, mean will not remove any NAs that it sees, and will return NA if any one of the elements is NA.

We can get the mean as follows:

mean(vehicles$cyl, na.rm = TRUE)
## [1] 5.771867

Working with NAs can be tricky sometimes because they don’t always show up. For example, the output of table doesn’t show you the NAs, which could mislead you into thinking that there are no NAs in the column:

table(vehicles$cyl)
## 
##     2     3     4     5     6     8    10    12    16 
##    45   182 12381   718 11885  7550   138   478     7

The summary function does tell us though if there are NAs in a column:

summary(vehicles$cyl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   4.000   6.000   5.772   6.000  16.000      58

To test if something is an NA or not, use the is.na function.

is.na(NA)
## [1] TRUE

Optional material

Computing the mode of a column

R doesn’t have a built-in function to compute the mode. We can either write our own function (a number of people have done that, do a google search), or we can use some other functions which allow us to figure out what the mode is.

First, the table function tells us how many times each value appeared in the column:

table(vehicles$hwy)
## 
##    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 
##   13   66   62  275  295  453  847 1257 2094 1547 1605 2314 1400 2672 2383 2788 
##   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40 
## 1944 2712 1558 1448 1371  846  799  528  515  358  313  205  125  106  125   79 
##   41   42   43   44   45   46   47   48   49   50   51   52   53   54   58   59 
##   56   46   20   52   55    9   10    8   14    2    4    7    1    3    4    2 
##   60   61   62   64   65   68   69   74   79   90   92   93   96   97   99  101 
##    1    1    2    3    2    2    2    3    2    3    2    4    2    2    6    2 
##  102  105  108  109 
##    1    3    2    1

To find out which number appeared most often, we have to visually scan the whole table. We could sort the table to help us:

sort(table(vehicles$hwy))
## 
##   53   60   61  102  109   50   59   62   65   68   69   79   92   96   97  101 
##    1    1    1    1    1    2    2    2    2    2    2    2    2    2    2    2 
##  108   54   64   74   90  105   51   58   93   99   52   48   46   47    9   49 
##    2    3    3    3    3    3    4    4    4    6    7    8    9   10   13   14 
##   43   42   44   45   41   11   10   40   38   37   39   36   12   13   35   34 
##   20   46   52   55   56   62   66   79  106  125  125  205  275  295  313  358 
##   14   33   32   31   30   15   16   29   21   28   18   27   19   25   17   20 
##  453  515  528  799  846  847 1257 1371 1400 1448 1547 1558 1605 1944 2094 2314 
##   23   22   26   24 
## 2383 2672 2712 2788

The mode is the last entry (24, appearing 2788 times). To have the mode appear in front, adding a decreasing = TRUE argument to the function call:

sort(table(vehicles$hwy), decreasing = TRUE)
## 
##   24   26   22   23   20   17   25   19   27   18   28   21   29   16   15   30 
## 2788 2712 2672 2383 2314 2094 1944 1605 1558 1547 1448 1400 1371 1257  847  846 
##   31   32   33   14   34   35   13   12   36   37   39   38   40   10   11   41 
##  799  528  515  453  358  313  295  275  205  125  125  106   79   66   62   56 
##   45   44   42   43   49    9   47   46   48   52   99   51   58   93   54   64 
##   55   52   46   20   14   13   10    9    8    7    6    4    4    4    3    3 
##   74   90  105   50   59   62   65   68   69   79   92   96   97  101  108   53 
##    3    3    3    2    2    2    2    2    2    2    2    2    2    2    2    1 
##   60   61  102  109 
##    1    1    1    1

More on factors

By default, when we make a variable a factor, R assigns an internal labeling by alphabetical order. This usually doesn’t concern us. One instance where we might want to have more control over the ordering is when we plot the data: for a bar plot, the category labeled 1 goes on the left-most end, followed by 2, etc.

barplot(table(vehicles$drive))

If we want to, we can set the order ourselves by specifying a levels argument. Let’s flip the labeling:

vehicles$drive <- factor(vehicles$drive, 
                         levels = sort(unique(vehicles$drive), decreasing = TRUE))
levels(vehicles$drive)
## [1] "Rear-Wheel Drive"           "Part-time 4-Wheel Drive"   
## [3] "Front-Wheel Drive"          "All-Wheel Drive"           
## [5] "4-Wheel or All-Wheel Drive" "4-Wheel Drive"             
## [7] "2-Wheel Drive"

Note how the barplot is now “flipped”:

barplot(table(vehicles$drive))

For ordinal variables, we need to add an ordered = TRUE argument to factor():

vehicles$drive <- as.character(vehicles$drive)
vehicles$drive <- factor(vehicles$drive, ordered = TRUE)
str(vehicles$drive)
##  Ord.factor w/ 7 levels "2-Wheel Drive"<..: 5 5 5 5 5 5 5 5 5 5 ...
levels(vehicles$drive)
## [1] "2-Wheel Drive"              "4-Wheel Drive"             
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"           
## [5] "Front-Wheel Drive"          "Part-time 4-Wheel Drive"   
## [7] "Rear-Wheel Drive"
head(vehicles$drive)
## [1] Front-Wheel Drive Front-Wheel Drive Front-Wheel Drive Front-Wheel Drive
## [5] Front-Wheel Drive Front-Wheel Drive
## 7 Levels: 2-Wheel Drive < 4-Wheel Drive < ... < Rear-Wheel Drive

Session info

This section is for documentation purposes: By displaying my session info, others who read this document will know what the system set-up was when I ran the commands above.

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] fueleconomy_1.0.0
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.4    magrittr_2.0.1    tools_4.0.4       htmltools_0.5.1.1
##  [5] yaml_2.2.1        stringi_1.5.3     rmarkdown_2.7     highr_0.8        
##  [9] knitr_1.31        stringr_1.4.0     xfun_0.22         digest_0.6.27    
## [13] rlang_0.4.10      evaluate_0.14