Damian Pavlyshyn
May 7, 2020
Topics that I have had repeated questions about and that are of general utility for the project and data processing in general.
Often, when we load a dataset, the variables will not be split neatly into columns, and instead, multiple pieces of information will be contained in a single column.
For example, in the following dataset, the name
columns contains first and last names, which we might want in separate columns:
df <- tibble(
name = c("Frodo Baggins", "Peregrin Took", "Aragorn", "Gandalf"),
height = c("4'", "4' 7\"", "6' 6\"", "5' 11\""),
father = c("Drogo", "Paladin", "Arathorn", NA)
)
## # A tibble: 4 x 3
## name height father
## <chr> <chr> <chr>
## 1 Frodo Baggins "4'" Drogo
## 2 Peregrin Took "4' 7\"" Paladin
## 3 Aragorn "6' 6\"" Arathorn
## 4 Gandalf "5' 11\"" <NA>
This is done with the separate()
function, which splits a string into discrete bits that are continuous chunks of letters and numbers:
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].
## # A tibble: 4 x 4
## firstname lastname height father
## <chr> <chr> <chr> <chr>
## 1 Frodo Baggins "4'" Drogo
## 2 Peregrin Took "4' 7\"" Paladin
## 3 Aragorn <NA> "6' 6\"" Arathorn
## 4 Gandalf <NA> "5' 11\"" <NA>
You can manually specify the sep=
parameter to only split along certain characters. This is useful, for example, if you want to spit at commas, but not at spaces
Often, columns contain multiple pieces of information that are not just next to each other, but encoded in a more complicated way. Consider the height
column in the following dataframe, which is formatted as feet and inches.
## # A tibble: 4 x 4
## firstname lastname height father
## <chr> <chr> <chr> <chr>
## 1 Frodo Baggins "4'" Drogo
## 2 Peregrin Took "4' 7\"" Paladin
## 3 Aragorn <NA> "6' 6\"" Arathorn
## 4 Gandalf <NA> "5' 11\"" <NA>
For more complicated specifications, we use a function called extract()
together with a format specification string called a “regular expression”. (It turns out that separate()
works here as well, but this works as a relatively simple example of extract()
)
Regular expressions are strings specify the format of a string, while keeping its content generic. This allows us to extract parts of a formatted string for later use.
Regular expressions are extremely useful in almost any domain of computation, so do yourself a favour and spend a few hours learning them on websites such as https://regexone.com/ and https://regexr.com/.
Without going into much detail, the regex string "([0-9]+)' ([0-9]+)"
specifies a format that is "<some numbers>' <some other numbers>"
, and the parentheses indicate that we should keep track of the numbers. Let’s see this in action:
## # A tibble: 4 x 5
## firstname lastname height_feet height_inches father
## <chr> <chr> <chr> <chr> <chr>
## 1 Frodo Baggins <NA> <NA> Drogo
## 2 Peregrin Took 4 7 Paladin
## 3 Aragorn <NA> 6 6 Arathorn
## 4 Gandalf <NA> 5 11 <NA>
This isn’t bad, but not exactly what we want. The resulting columns are characters, so we should supply the convert=TRUE
argument to extract()
to convert them to numbers.
Moreover, Frodo’s height has not been matched, since "4'"
doesn’t have the inches part. The modified regex string "([0-9]+)' ?([0-9]*)"
makes matching the second number optional - I strongly recommend that you learn enough about regex that you understand why!
df %>%
extract(height, c("height_feet", "height_inches"), regex = "([0-9]+)' ?([0-9]*)", convert=TRUE)
## # A tibble: 4 x 5
## firstname lastname height_feet height_inches father
## <chr> <chr> <int> <int> <chr>
## 1 Frodo Baggins 4 NA Drogo
## 2 Peregrin Took 4 7 Paladin
## 3 Aragorn <NA> 6 6 Arathorn
## 4 Gandalf <NA> 5 11 <NA>
Much better!
A zero value is better than NA
for Frodo’s height_inches
, and perhaps we want to convert the heights to centimetres. We can achieve both of these modifications with a single mutate()
:
df <- df %>%
extract(height, c("height_feet", "height_inches"), regex = "([0-9]+)' ?([0-9]*)", convert=TRUE) %>%
mutate(
height_inches = replace_na(height_inches, 0),
height_cm = (12 * height_feet + height_inches) * 2.54
)
## # A tibble: 4 x 6
## firstname lastname height_feet height_inches father height_cm
## <chr> <chr> <int> <dbl> <chr> <dbl>
## 1 Frodo Baggins 4 0 Drogo 122.
## 2 Peregrin Took 4 7 Paladin 140.
## 3 Aragorn <NA> 6 6 Arathorn 198.
## 4 Gandalf <NA> 5 11 <NA> 180.
If we can split columns into constituent parts, we should also be able to combine them together. We do this with the unite()
function.
Tolkien likes to refer to his characters in the patronymic form “Aragorn, son of Arathorn”, so let’s try to create this from our data table:
## # A tibble: 4 x 5
## patronymic lastname height_feet height_inches height_cm
## <chr> <chr> <int> <dbl> <dbl>
## 1 Frodo_Drogo Baggins 4 0 122.
## 2 Peregrin_Paladin Took 4 7 140.
## 3 Aragorn_Arathorn <NA> 6 6 198.
## 4 Gandalf_NA <NA> 5 11 180.
This is a good start, but there are several things wrong:
firstname
and lastname
is "_"
, whereas we would like it to be ", son of "
instead.unite()
removes the constituent columns, whereas we probably want to hold on to them.All of these problems can be easily addressed with keyword arguments passed to unite()
:
df <- df %>%
unite("patronymic", c("firstname", "father"),
sep = ", son of ",
na.rm = TRUE,
remove = FALSE)
## # A tibble: 4 x 7
## patronymic firstname lastname height_feet height_inches father height_cm
## <chr> <chr> <chr> <int> <dbl> <chr> <dbl>
## 1 Frodo, son of D… Frodo Baggins 4 0 Drogo 122.
## 2 Peregrin, son o… Peregrin Took 4 7 Palad… 140.
## 3 Aragorn, son of… Aragorn <NA> 6 6 Arath… 198.
## 4 Gandalf Gandalf <NA> 5 11 <NA> 180.
## # A tibble: 4 x 1
## patronymic
## <chr>
## 1 Frodo, son of Drogo
## 2 Peregrin, son of Paladin
## 3 Aragorn, son of Arathorn
## 4 Gandalf
This is exactly what we wanted!
stat_summary
aesthetics in ggplotThe usual geom_
objects of ggplot
place visuals on the canvas based on the variable values of each tabled row.
Often, we want to plot things computed from the data, (such as aggregated means) directly instead of going through the hassle of creating a new dataframe containing those values.
ggplot
objects starting with stat_
work by first summarising variables in some way, and then plotting the result.
Observe the following stat_smooth()
function in action:
ggplot(mpg, aes(x = manufacturer, y = hwy)) +
geom_point() +
stat_summary(fun = mean, geom="point", color="red")
Since manufacturer
is a factor, stat_summary()
has grouped the data by manufacturer and computed the mean efficiency by manufacturer.
Even though this mean doesn’t appear in any data table, it is still plotted to the canvas.
stat_summary()
takes the same sort of arguments as regular geom_
objects, including aesthetic mappings.
ggplot(mpg, aes(x = manufacturer, y = hwy)) +
geom_point(position="jitter", alpha=0.2) +
stat_summary(
fun = mean,
mapping = aes(color=factor(cyl)),
size = 3,
geom="point")
stat_summary()
understands all sorts of functions, and you can find such functions in-built as well as in external packages.
Here, we see a function that produces bootstrapped confidence intervals for the mean (if you don’t understand this, I recommend taking some applied stats courses!)
stat_smooth
: interpolation and lines of best fitOther common transformations for out data are interpolations. These allow us to fit a smooth curve to potentially messy and noisy data to summarise an overall trend.
The function for this is stat_smooth()
. By default, it fits a curve to our data:
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Often, a curve will overfit, and we prefer to fit only straight lines. This is done by supplying the method = "lm"
(lm stands for linear model) option to the stat_smooth()
function.
## `geom_smooth()` using formula 'y ~ x'
As we see in the above, stat_smooth()
inherits aesthetics from the ggplot()
call, so in this case, we have three different lines of best fit.
If that’s what we want, then great! Otherwise, we need to make sure not to pass stat_smooth()
a grouping aesthetic.
ggplot(mpg, aes(x = displ, cty)) +
geom_point(aes(color = factor(cyl))) +
stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
lubridate
and parsing datesWorking with dates and times is generally an enormous headache - different parts of the world have different formats, we have distinctions between 12 and 24-hour times and every dataset seemingly has a different way of recording this information.
In a dataset, the ONE TRUE FORMAT for recording dates is year-month-day, so that today’s date is 2020-05-07
. You should always record dates in this standard format and always immediately convert dates in any other format to this one.
Our main tool for this is a library called lubridate
:
A comprehensive specification of lubridate
’s functionality is found here: https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf, but here are some examples:
## [1] "2020-05-07"
## [1] "2020-05-07"
## [1] "2020-05-07"
## [1] "2020-05-06"
As you can see, this is very flexible, and saves many headaches!
For date and time data, the ONE TRUE FORMAT is “year-month-day hour:minute:second”, where the hour is recorded in 24-hour time.
For example, the final project is due on 2020-05-16 23:59:59
, with the hyphens, spaces and colons in exactly that order.
The lubridate
functions for parsing datetimes are analogous to those for dates:
## [1] "2020-05-16 23:59:00 UTC"
## [1] "2020-05-16 23:59:59 UTC"
## [1] "2020-05-06 17:04:55 PDT"
We can also parse arbitrarily terrible formats:
## [1] "2020-05-16 00:59:59 UTC"
The UTC
refers to “Coordinated Universal Time,” which is basically Greenwich Mean Time. To be precise, we should specify the timezone of out project due date:
## [1] "2020-05-17 07:59:59 PST"
Once we have our due date in the ONE TRUE FORMAT, we can extract all kinds of information about it:
## [1] 5
## [1] FALSE
## [1] TRUE
We can also do timezone conversions:
## [1] "2020-05-17 07:59:59 UTC"
## [1] "2020-05-17 17:59:59 AEST"
R records missing values of any type as the special value NA
. Be aware that there are other non-values, such as Inf
and NaN
. These have the double type, and occur as outputs to mathematical operations.
NA
sNA
s represent missing values - that is, data that we don’t know. Thus, any operation including an NA
will produce an NA
## [1] NA
## [1] NA
## [1] NA
Remember, equality tests if two data points have the same value. If one or both values are missing we can’t know whether or not that’s tru
## [1] NA
## [1] NA
To check if a value missing, instead use the is.na()
function:
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
NA
s## # A tibble: 3 x 3
## index x y
## <dbl> <dbl> <chr>
## 1 1 11 "Yes"
## 2 2 NA "No"
## 3 3 13 ""
We can replace NA
s with a specific value or vise-versa with the replace_na()
and na_if()
functions respectively:
## # A tibble: 3 x 3
## index x y
## <dbl> <dbl> <chr>
## 1 1 11 Yes
## 2 2 0 No
## 3 3 13 <NA>
We can call this function on the whole dataframe:
## # A tibble: 3 x 3
## index x y
## <dbl> <dbl> <chr>
## 1 1 11 Yes
## 2 2 NA No
## 3 3 13 <NA>
We can remove rows with missing values using the drop_na()
function on the whole data table, or only on some columns:
## # A tibble: 1 x 3
## index x y
## <dbl> <dbl> <chr>
## 1 1 11 Yes
## # A tibble: 2 x 3
## index x y
## <dbl> <dbl> <chr>
## 1 1 11 Yes
## 2 2 NA No
ggplot
- made up of data, aesthetic mapping and geomsdplyr
verbs, grouping and pipinggeom_polygon
s and making mapsThanks for your attention and for taking stats32!