Splitting columns

Often, when we load a dataset, the variables will not be split neatly into columns, and instead, multiple pieces of information will be contained in a single column.

For example, in the following dataset, the name columns contains first and last names, which we might want in separate columns:

df <- tibble(
    name = c("Frodo Baggins", "Peregrin Took", "Aragorn", "Gandalf"),
    height = c("4'", "4' 7\"", "6' 6\"", "5' 11\""),
    father = c("Drogo", "Paladin", "Arathorn", NA)
)

## # A tibble: 4 x 3
##   name          height    father  
##   <chr>         <chr>     <chr>   
## 1 Frodo Baggins "4'"      Drogo   
## 2 Peregrin Took "4' 7\""  Paladin 
## 3 Aragorn       "6' 6\""  Arathorn
## 4 Gandalf       "5' 11\"" <NA>

This is done with the separate() function, which splits a string into discrete bits that are continuous chunks of letters and numbers:

df <- df %>%
    separate(name, c("firstname", "lastname"))

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].

## # A tibble: 4 x 4
##   firstname lastname height    father  
##   <chr>     <chr>    <chr>     <chr>   
## 1 Frodo     Baggins  "4'"      Drogo   
## 2 Peregrin  Took     "4' 7\""  Paladin 
## 3 Aragorn   <NA>     "6' 6\""  Arathorn
## 4 Gandalf   <NA>     "5' 11\"" <NA>

You can manually specify the sep= parameter to only split along certain characters. This is useful, for example, if you want to spit at commas, but not at spaces

Advanced column splitting with regular expressions

Often, columns contain multiple pieces of information that are not just next to each other, but encoded in a more complicated way. Consider the height column in the following dataframe, which is formatted as feet and inches.

## # A tibble: 4 x 4
##   firstname lastname height    father  
##   <chr>     <chr>    <chr>     <chr>   
## 1 Frodo     Baggins  "4'"      Drogo   
## 2 Peregrin  Took     "4' 7\""  Paladin 
## 3 Aragorn   <NA>     "6' 6\""  Arathorn
## 4 Gandalf   <NA>     "5' 11\"" <NA>

For more complicated specifications, we use a function called extract() together with a format specification string called a “regular expression”. (It turns out that separate() works here as well, but this works as a relatively simple example of extract())

Regular expressions are strings specify the format of a string, while keeping its content generic. This allows us to extract parts of a formatted string for later use.

Regular expressions are extremely useful in almost any domain of computation, so do yourself a favour and spend a few hours learning them on websites such as https://regexone.com/ and https://regexr.com/.

Without going into much detail, the regex string "([0-9]+)' ([0-9]+)" specifies a format that is "<some numbers>' <some other numbers>", and the parentheses indicate that we should keep track of the numbers. Let’s see this in action:

df %>%
    extract(height, c("height_feet", "height_inches"), regex = "([0-9]+)' ([0-9]+)")

## # A tibble: 4 x 5
##   firstname lastname height_feet height_inches father  
##   <chr>     <chr>    <chr>       <chr>         <chr>   
## 1 Frodo     Baggins  <NA>        <NA>          Drogo   
## 2 Peregrin  Took     4           7             Paladin 
## 3 Aragorn   <NA>     6           6             Arathorn
## 4 Gandalf   <NA>     5           11            <NA>

This isn’t bad, but not exactly what we want. The resulting columns are characters, so we should supply the convert=TRUE argument to extract() to convert them to numbers.

Moreover, Frodo’s height has not been matched, since "4'" doesn’t have the inches part. The modified regex string "([0-9]+)' ?([0-9]*)" makes matching the second number optional - I strongly recommend that you learn enough about regex that you understand why!

df %>%
    extract(height, c("height_feet", "height_inches"), regex = "([0-9]+)' ?([0-9]*)", convert=TRUE)

## # A tibble: 4 x 5
##   firstname lastname height_feet height_inches father  
##   <chr>     <chr>          <int>         <int> <chr>   
## 1 Frodo     Baggins            4            NA Drogo   
## 2 Peregrin  Took               4             7 Paladin 
## 3 Aragorn   <NA>               6             6 Arathorn
## 4 Gandalf   <NA>               5            11 <NA>

Much better!

A zero value is better than NA for Frodo’s height_inches, and perhaps we want to convert the heights to centimetres. We can achieve both of these modifications with a single mutate():

df <- df %>%
    extract(height, c("height_feet", "height_inches"), regex = "([0-9]+)' ?([0-9]*)", convert=TRUE) %>%
    mutate(
        height_inches = replace_na(height_inches, 0),
        height_cm = (12 * height_feet + height_inches) * 2.54
    )

## # A tibble: 4 x 6
##   firstname lastname height_feet height_inches father   height_cm
##   <chr>     <chr>          <int>         <dbl> <chr>        <dbl>
## 1 Frodo     Baggins            4             0 Drogo         122.
## 2 Peregrin  Took               4             7 Paladin       140.
## 3 Aragorn   <NA>               6             6 Arathorn      198.
## 4 Gandalf   <NA>               5            11 <NA>          180.

Combining columns

If we can split columns into constituent parts, we should also be able to combine them together. We do this with the unite() function.

Tolkien likes to refer to his characters in the patronymic form “Aragorn, son of Arathorn”, so let’s try to create this from our data table:

df %>%
    unite("patronymic", c("firstname", "father"))

## # A tibble: 4 x 5
##   patronymic       lastname height_feet height_inches height_cm
##   <chr>            <chr>          <int>         <dbl>     <dbl>
## 1 Frodo_Drogo      Baggins            4             0      122.
## 2 Peregrin_Paladin Took               4             7      140.
## 3 Aragorn_Arathorn <NA>               6             6      198.
## 4 Gandalf_NA       <NA>               5            11      180.

This is a good start, but there are several things wrong:

The separator between firstname and lastname is "_", whereas we would like it to be ", son of " instead.
Being an immortal being, Gandalf has no father, so we don’t want to add anything on to his name.
By default, unite() removes the constituent columns, whereas we probably want to hold on to them.

All of these problems can be easily addressed with keyword arguments passed to unite():

df <- df %>%
    unite("patronymic", c("firstname", "father"),
          sep = ", son of ",
          na.rm = TRUE,
          remove = FALSE)

## # A tibble: 4 x 7
##   patronymic       firstname lastname height_feet height_inches father height_cm
##   <chr>            <chr>     <chr>          <int>         <dbl> <chr>      <dbl>
## 1 Frodo, son of D… Frodo     Baggins            4             0 Drogo       122.
## 2 Peregrin, son o… Peregrin  Took               4             7 Palad…      140.
## 3 Aragorn, son of… Aragorn   <NA>               6             6 Arath…      198.
## 4 Gandalf          Gandalf   <NA>               5            11 <NA>        180.

df %>% select(patronymic)

## # A tibble: 4 x 1
##   patronymic              
##   <chr>                   
## 1 Frodo, son of Drogo     
## 2 Peregrin, son of Paladin
## 3 Aragorn, son of Arathorn
## 4 Gandalf

This is exactly what we wanted!

`stat_summary` aesthetics in ggplot

The usual geom_ objects of ggplot place visuals on the canvas based on the variable values of each tabled row.

Often, we want to plot things computed from the data, (such as aggregated means) directly instead of going through the hassle of creating a new dataframe containing those values.

ggplot objects starting with stat_ work by first summarising variables in some way, and then plotting the result.

Observe the following stat_smooth() function in action:

ggplot(mpg, aes(x = manufacturer, y = hwy)) +
    geom_point() +
    stat_summary(fun = mean, geom="point", color="red")

Since manufacturer is a factor, stat_summary() has grouped the data by manufacturer and computed the mean efficiency by manufacturer.

Even though this mean doesn’t appear in any data table, it is still plotted to the canvas.

stat_summary() takes the same sort of arguments as regular geom_ objects, including aesthetic mappings.

ggplot(mpg, aes(x = manufacturer, y = hwy)) +
    geom_point(position="jitter", alpha=0.2) +
    stat_summary(
        fun = mean,
        mapping = aes(color=factor(cyl)),
        size = 3,
        geom="point")

stat_summary() understands all sorts of functions, and you can find such functions in-built as well as in external packages.

Here, we see a function that produces bootstrapped confidence intervals for the mean (if you don’t understand this, I recommend taking some applied stats courses!)

library(Hmisc)
ggplot(mpg, aes(x = manufacturer, y = hwy)) +
    stat_summary(fun = mean, geom="bar", fill="white", color="black") +
    stat_summary(fun.data = "mean_cl_boot", color="red")

`stat_smooth`: interpolation and lines of best fit

Other common transformations for out data are interpolations. These allow us to fit a smooth curve to potentially messy and noisy data to summarise an overall trend.

The function for this is stat_smooth(). By default, it fits a curve to our data:

ggplot(mpg, aes(x = displ, y = cty)) +
    geom_point() +
    stat_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Often, a curve will overfit, and we prefer to fit only straight lines. This is done by supplying the method = "lm" (lm stands for linear model) option to the stat_smooth() function.

ggplot(mpg, aes(x = displ, cty, color = factor(cyl))) +
    geom_point() +
    stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

As we see in the above, stat_smooth() inherits aesthetics from the ggplot() call, so in this case, we have three different lines of best fit.

If that’s what we want, then great! Otherwise, we need to make sure not to pass stat_smooth() a grouping aesthetic.

ggplot(mpg, aes(x = displ, cty)) +
    geom_point(aes(color = factor(cyl))) +
    stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

`lubridate` and parsing dates

Working with dates and times is generally an enormous headache - different parts of the world have different formats, we have distinctions between 12 and 24-hour times and every dataset seemingly has a different way of recording this information.

In a dataset, the ONE TRUE FORMAT for recording dates is year-month-day, so that today’s date is 2020-05-07. You should always record dates in this standard format and always immediately convert dates in any other format to this one.

Our main tool for this is a library called lubridate:

library(lubridate)

A comprehensive specification of lubridate’s functionality is found here: https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf, but here are some examples:

mdy("5/7/20")

## [1] "2020-05-07"

dmy("7th of May 2020")

## [1] "2020-05-07"

ymd(20200507)

## [1] "2020-05-07"

today()

## [1] "2020-05-06"

As you can see, this is very flexible, and saves many headaches!

Parsing times and dates

For date and time data, the ONE TRUE FORMAT is “year-month-day hour:minute:second”, where the hour is recorded in 24-hour time.

For example, the final project is due on 2020-05-16 23:59:59, with the hyphens, spaces and colons in exactly that order.

The lubridate functions for parsing datetimes are analogous to those for dates:

dmy_hm("16/5/2020, 11.59 PM")

## [1] "2020-05-16 23:59:00 UTC"

ymd_hms("20~5~16 23:59.59")

## [1] "2020-05-16 23:59:59 UTC"

now()

## [1] "2020-05-06 17:04:55 PDT"

We can also parse arbitrarily terrible formats:

parse_date_time("AM 59:12, 59 16-2020, 5", "pMhsdym")

## [1] "2020-05-16 00:59:59 UTC"

The UTC refers to “Coordinated Universal Time,” which is basically Greenwich Mean Time. To be precise, we should specify the timezone of out project due date:

due_date <- ymd_hms("2020-05-16 23:59:59", tz = "PST")

## [1] "2020-05-17 07:59:59 PST"

Making use of the standard format

Once we have our due date in the ONE TRUE FORMAT, we can extract all kinds of information about it:

month(due_date)

## [1] 5

dst(due_date)

## [1] FALSE

leap_year(due_date)

## [1] TRUE

We can also do timezone conversions:

with_tz(due_date, "Universal")

## [1] "2020-05-17 07:59:59 UTC"

with_tz(due_date, "Australia/Melbourne")

## [1] "2020-05-17 17:59:59 AEST"

Dealing with missing data

R records missing values of any type as the special value NA. Be aware that there are other non-values, such as Inf and NaN. These have the double type, and occur as outputs to mathematical operations.

Operations with `NA`s

NAs represent missing values - that is, data that we don’t know. Thus, any operation including an NA will produce an NA

1 + NA

## [1] NA

NA - NA

## [1] NA

NA * 0

## [1] NA

Remember, equality tests if two data points have the same value. If one or both values are missing we can’t know whether or not that’s tru

1 == NA

## [1] NA

NA == NA

## [1] NA

To check if a value missing, instead use the is.na() function:

is.na(c(1, 2, 3, NA, 5, 6, NA))

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

Introducing, filling and removing `NA`s

df <- tibble(
    index = c(1, 2, 3),
    x = c(11, NA, 13),
    y = c("Yes", "No", ""))

## # A tibble: 3 x 3
##   index     x y    
##   <dbl> <dbl> <chr>
## 1     1    11 "Yes"
## 2     2    NA "No" 
## 3     3    13 ""

We can replace NAs with a specific value or vise-versa with the replace_na() and na_if() functions respectively:

df %>%
    mutate(
        x = replace_na(x, 0),
        y = na_if(y, ""))

## # A tibble: 3 x 3
##   index     x y    
##   <dbl> <dbl> <chr>
## 1     1    11 Yes  
## 2     2     0 No   
## 3     3    13 <NA>

We can call this function on the whole dataframe:

df <- na_if(df, "")

## # A tibble: 3 x 3
##   index     x y    
##   <dbl> <dbl> <chr>
## 1     1    11 Yes  
## 2     2    NA No   
## 3     3    13 <NA>

We can remove rows with missing values using the drop_na() function on the whole data table, or only on some columns:

drop_na(df)

## # A tibble: 1 x 3
##   index     x y    
##   <dbl> <dbl> <chr>
## 1     1    11 Yes

drop_na(df, index, y)

## # A tibble: 2 x 3
##   index     x y    
##   <dbl> <dbl> <chr>
## 1     1    11 Yes  
## 2     2    NA No

STATS 32 Session 8: Common questions and useful packages

Agenda for today

Splitting columns

Advanced column splitting with regular expressions

Combining columns

`stat_summary` aesthetics in ggplot

`stat_smooth`: interpolation and lines of best fit

`lubridate` and parsing dates

Parsing times and dates

Making use of the standard format

Dealing with missing data

Operations with `NA`s

Introducing, filling and removing `NA`s

Overview of the course

STATS 32 Session 8: Common questions and useful packages

Agenda for today

Splitting columns

Advanced column splitting with regular expressions

Combining columns

stat_summary aesthetics in ggplot

stat_smooth: interpolation and lines of best fit

lubridate and parsing dates

Parsing times and dates

Making use of the standard format

Dealing with missing data

Operations with NAs

Introducing, filling and removing NAs

Overview of the course

`stat_summary` aesthetics in ggplot

`stat_smooth`: interpolation and lines of best fit

`lubridate` and parsing dates

Operations with `NA`s

Introducing, filling and removing `NA`s