Damian Pavlyshyn
ggplot2
%>%
dplyr
Data tables are stored in files - generally .txt
or .csv
- as rows of values separated by commas (or some other character).
Warning: .xls
files are not like this at all, and are much (and needlessly) more complicated.
Unique addresses of files or folders on your system. When looking for a file, your machine starts at the “root directory” and follows the chain of folders until it reached the end.
On Unix-based systems like Mac OS or Linux, the root directory is called /
, and an absolute file path looks like:
/home/damian/Documents/example.csv
Usually, the /home/damian
part is simply abbreviated as ~
.
On Windows, each drive is its own root directory (called things like C:
and D:
), and an absolute file path looks like:
C:\Users\damian\Documents\example.csv
Though Windows natively uses backslashes to separate folder names, R uses backslashes to indicate special characters, so you should replace them with regular slashes when typing them into R.
You can always load the same file on your machine by loading it from its absolute file path:
URLs are also a type of absolute file path, and R can load files directly from the internet too:
If a file path doesn’t start with the root directory, your machine will instead start looking in whatever folder the program you’re using is running in. This is called the working directory.
You can change the working directory using the setwd()
function, then access files relative to that folder:
Alternatively, you set your working directory in RStudio by navigating to your desired folder in the file browser in the lower right of the screen, and then clicking More > Set As Working Directory
.
A function is a named block of code which
We’ve already seen a number of functions in R! For example,
## [1] TRUE
The function is.character
takes the input given to it in the parentheses and returns TRUE
or FALSE
, depending on whether the input is of type character or not.
Others we’ve seen: str()
, head()
, sd()
, ggplot()
, is.list()
, …
We can see what a function does by typing in ?
followed by the function name in the R console.
The most important syntax in R is the function call. All R syntax has function calls underlying it.
A function call consists of:
Positional arguments are the “main input” of the function - often these are data tables or vectors. These are supplied without keys and in a specific order. They are essential for the function to work.
If there are multiple positional arguments, is can be good practice to supply them as key-value pairs anyway.
## [1] NA
Keyword arguments provide supplementary information that modify what the function does, and are often strings or booleans. They are supplied as key-value pairs in arbitrary order, and can be omitted if you are satisfied with a function’s default behavior.
## [1] -1
abs(x)
: If x
is positive, return x
. If x
is negative, return x
without the negative sign.
## [1] 2.6
abs(x)
: If x
is positive, return x
. If x
is negative, return x
without the negative sign.
## [1] 2.6
%>%
%>%
is implemented by the magrittr
package, which is loaded as part of the tidyverse
%>%
is “syntactic sugar”: makes code easier to understand%>%
becomes the first argument in the function on the right of %>%
## [1] 2.6
This specific example is silly and only for demonstrative purposes. We’ll see actual (extremely) useful applications of this soon.
We rarely get data in exactly the form we need!
Transforming data in R is made easy by the dplyr
package (“official” cheat sheet available here).
dplyr
verbsMost of the operations that you’d ever want to perform on a single table can be expressed with the following functions:
select()
: pick variables by their namesmutate()
: create new variables based on existing onesarrange()
: reorder rowsfilter()
: pick observations by their valuessummarize()
: collapse many values down to a single summarylibrary(tidyverse)
scores <- data.frame(Name = c("Maedhros", "Maglor", "Celegorm", "Caranthir", "Curufin", "Amrod", "Amras"),
Year = c("Sen", "Sen", "Jun", "Jun", "Sen", "Sen", "Jun"),
English = c(60, 66, 92, 80, 80, 58, 81),
Math = c(96, 55, 63, 76, 80, 52, 64),
Science = c(80, 56, 70, 89, 82, 79, 90),
History = c(56, 64, 62, 55, 48, 90, 71),
Spanish = c(77, 77, 98, 40, 50, 61, 72),
stringsAsFactors = FALSE)
## Name Year English Math Science History Spanish
## 1 Maedhros Sen 60 96 80 56 77
## 2 Maglor Sen 66 55 56 64 77
## 3 Celegorm Jun 92 63 70 62 98
## 4 Caranthir Jun 80 76 89 55 40
## 5 Curufin Sen 80 80 82 48 50
## 6 Amrod Sen 58 52 79 90 61
## 7 Amras Jun 81 64 90 71 72
select
: pick subset of variables/columns by nameHistory teacher: “I just want their names and History scores”
scores
dataset.## Name History
## 1 Maedhros 56
## 2 Maglor 64
## 3 Celegorm 62
## 4 Caranthir 55
## 5 Curufin 48
## 6 Amrod 90
## 7 Amras 71
mutate
: create new columns based on old onesTeacher: “What are their total scores?”
scores
dataset.## Name Year English Math Science History Spanish Total
## 1 Maedhros Sen 60 96 80 56 77 369
## 2 Maglor Sen 66 55 56 64 77 318
## 3 Celegorm Jun 92 63 70 62 98 385
## 4 Caranthir Jun 80 76 89 55 40 340
## 5 Curufin Sen 80 80 82 48 50 340
## 6 Amrod Sen 58 52 79 90 61 340
## 7 Amras Jun 81 64 90 71 72 378
transmute
Similar to mutate
, but creates a totally new data table with only the variables explicitly mentioned:
scores %>%
transmute(
Name = Name,
Total = English + Math + Science + History + Spanish,
Mean = Total/5
)
## Name Total Mean
## 1 Maedhros 369 73.8
## 2 Maglor 318 63.6
## 3 Celegorm 385 77.0
## 4 Caranthir 340 68.0
## 5 Curufin 340 68.0
## 6 Amrod 340 68.0
## 7 Amras 378 75.6
Note: We can use variables that we just created!
arrange
: reorder rowsTeacher: “Can I have the students in order of overall performance?”
scores
dataset.## Name Year English Math Science History Spanish Total
## 1 Maglor Sen 66 55 56 64 77 318
## 2 Caranthir Jun 80 76 89 55 40 340
## 3 Curufin Sen 80 80 82 48 50 340
## 4 Amrod Sen 58 52 79 90 61 340
## 5 Maedhros Sen 60 96 80 56 77 369
## 6 Amras Jun 81 64 90 71 72 378
## 7 Celegorm Jun 92 63 70 62 98 385
Teacher: “No no, better students on top please…”
## Name Year English Math Science History Spanish Total
## 1 Celegorm Jun 92 63 70 62 98 385
## 2 Amras Jun 81 64 90 71 72 378
## 3 Maedhros Sen 60 96 80 56 77 369
## 4 Caranthir Jun 80 76 89 55 40 340
## 5 Curufin Sen 80 80 82 48 50 340
## 6 Amrod Sen 58 52 79 90 61 340
## 7 Maglor Sen 66 55 56 64 77 318
Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”
## Name Year English Math Science History Spanish Total
## 1 Celegorm Jun 92 63 70 62 98 385
## 2 Amras Jun 81 64 90 71 72 378
## 3 Maedhros Sen 60 96 80 56 77 369
## 4 Amrod Sen 58 52 79 90 61 340
## 5 Caranthir Jun 80 76 89 55 40 340
## 6 Curufin Sen 80 80 82 48 50 340
## 7 Maglor Sen 66 55 56 64 77 318
filter
: pick observations by their valuesHistory teacher: “I want to see which students scored less than 60 for history”
scores
dataset.## Name Year English Math Science History Spanish Total
## 1 Maedhros Sen 60 96 80 56 77 369
## 2 Caranthir Jun 80 76 89 55 40 340
## 3 Curufin Sen 80 80 82 48 50 340
Other ways to make comparisons:
>
greater than<
less than>=
greater than or equal to<=
less than or equal to!=
not equal to==
equal to (Do not use =
to test for equality!!)Warning!
## [1] FALSE
Don’t use ==
to compare doubles! This is because computers have only finite space to store doubles, so tiny rounding errors crop up when doing arithmetic. Normally this isn’t a problem, but 2.0000000001
does not equal 2
, so watch out!
Much better to use the near()
function, which allows a small difference between values
## [1] TRUE
!
not (!TRUE == FALSE
and !FALSE == TRUE
)&
and (returns TRUE
if both comparisons are TRUE
)|
or (returns TRUE
if either comparison is TRUE
)filter
examplesMaglor’s parents: “I just want Maglor’s scores”
## Name Year English Math Science History Spanish Total
## 1 Maglor Sen 66 55 56 64 77 318
Language teacher: “I want to know which students scores < 50 for either English or Spanish”
## Name Year English Math Science History Spanish Total
## 1 Caranthir Jun 80 76 89 55 40 340
summarize
: get summaries of dataAcademic: “I want to know the correlation between math and science scores as well as their means”
scores
dataset.## corr
## 1 0.4137445
Science teacher: “I want to know the mean and standard deviation of the scores for science”
scores
dataset.## Science_mean Science_sd
## 1 78 11.78983
dplyr
commands using %>%
Science teacher: “I want to know which students scored > 80 for Science, but I just want names”
scores
dataset.## Name
## 1 Caranthir
## 2 Curufin
## 3 Amras
group_by
: use dplyr
verbs on a group-by-group basisAcademic: “I want to know if the seniors scored better than the juniors in Spanish”
scores
dataset.## # A tibble: 2 x 2
## Year Spanish_mean
## <chr> <dbl>
## 1 Jun 70
## 2 Sen 66.2
summarize
We can save time by summarizing multiple variables at once:
## Warning in mean.default(Name): argument is not numeric or logical: returning NA
## Warning in mean.default(Name): argument is not numeric or logical: returning NA
## # A tibble: 2 x 8
## Year Name English Math Science History Spanish Total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jun NA 84.3 67.7 83 62.7 70 368.
## 2 Sen NA 66 70.8 74.2 64.5 66.2 342.
Hmm, taking the mean names doesn’t make sense, so we’d probably prefer something like
## # A tibble: 2 x 7
## Year English Math Science History Spanish Total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jun 84.3 67.7 83 62.7 70 368.
## 2 Sen 66 70.8 74.2 64.5 66.2 342.
We can also pick several columns explicitly:
## # A tibble: 2 x 4
## Year English History Spanish
## <chr> <dbl> <dbl> <dbl>
## 1 Jun 84.3 62.7 70
## 2 Sen 66 64.5 66.2
Installation:
This loads the following table:
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...