STATS 32 Session 7: Data joining and maps

Damian Pavlyshyn

May 5, 2020

Agenda for today

Making maps in R: maps package

Example: Drawing CA and its counties (1)

library(tidyverse)
library(maps)
county_data <- map_data("county")
CA_data <- county_data %>% filter(region == "california")
head(CA_data)
##        long      lat group order     region subregion
## 1 -121.4785 37.48290   157  6965 california   alameda
## 2 -121.5129 37.48290   157  6966 california   alameda
## 3 -121.8853 37.48290   157  6967 california   alameda
## 4 -121.8968 37.46571   157  6968 california   alameda
## 5 -121.9254 37.45998   157  6969 california   alameda
## 6 -121.9483 37.47717   157  6970 california   alameda

Example: Drawing CA and its counties (2)

County outlines are drawn using geom_polygon.

ggplot(data = CA_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, group = group))

Example: Drawing CA and its counties (3)

coord_quickmap() preserves the aspect ratio of the map.

ggplot(data = CA_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, group = group)) + 
    coord_quickmap()

Example: Drought levels in CA (1)

drought_data <- read_csv("Drought data.csv")
head(drought_data, n = 3)
## # A tibble: 3 x 2
##   County  Drought_percent
##   <chr>             <dbl>
## 1 alameda             100
## 2 alpine              100
## 3 amador              100
head(CA_data, n = 3)
##        long     lat group order     region subregion
## 1 -121.4785 37.4829   157  6965 california   alameda
## 2 -121.5129 37.4829   157  6966 california   alameda
## 3 -121.8853 37.4829   157  6967 california   alameda

Our drought data and mapping information are in different datasets!

Joining datasets

Sometimes our data are spread across different datasets, making it difficult to answer some questions.

Question: Who scored the highest in English in each class?

bio <- tibble(
    Name = c("Andrew", "John", "Mary", "Jane"),
    Class = c("A", "B", "A", "B"))
scores <- tibble(
    Name = c("John", "Andrew", "John", "Mary"),
    Subject = c("English", "English", "Math", "English"),
    Score = c(76, 66, 85, 71))
bio
## # A tibble: 4 x 2
##   Name   Class
##   <chr>  <chr>
## 1 Andrew A    
## 2 John   B    
## 3 Mary   A    
## 4 Jane   B
scores
## # A tibble: 4 x 3
##   Name   Subject Score
##   <chr>  <chr>   <dbl>
## 1 John   English    76
## 2 Andrew English    66
## 3 John   Math       85
## 4 Mary   English    71

We join these tables together using the dplyr function left_join():

library(dplyr)
left_join(bio, scores, by = "Name")

Joining datasets (using dplyr)

library(dplyr)
left_join(bio, scores, by = "Name")
## # A tibble: 5 x 4
##   Name   Class Subject Score
##   <chr>  <chr> <chr>   <dbl>
## 1 Andrew A     English    66
## 2 John   B     English    76
## 3 John   B     Math       85
## 4 Mary   A     English    71
## 5 Jane   B     <NA>       NA

Joining datasets (using dplyr)

Question: Who scored the highest in English in each class?

library(dplyr)
bio %>% left_join(scores, by = "Name") %>%
    filter(Subject == "English") %>%
    group_by(Class) %>%
    top_n(1, Score)
## # A tibble: 2 x 4
## # Groups:   Class [2]
##   Name  Class Subject Score
##   <chr> <chr> <chr>   <dbl>
## 1 John  B     English    76
## 2 Mary  A     English    71

Example: Drought levels in CA (1)

drought_data <- read_csv("Drought data.csv")
head(drought_data, n = 3)
## # A tibble: 3 x 2
##   County  Drought_percent
##   <chr>             <dbl>
## 1 alameda             100
## 2 alpine              100
## 3 amador              100
head(CA_data, n = 3)
##        long     lat group order     region subregion
## 1 -121.4785 37.4829   157  6965 california   alameda
## 2 -121.5129 37.4829   157  6966 california   alameda
## 3 -121.8853 37.4829   157  6967 california   alameda

Our drought data and mapping information are in different datasets!

Example: Drought levels in CA (2)

Solution: Join the datasets together.

combined_data <- CA_data %>%
    left_join(drought_data, by = c("subregion" = "County"))
head(combined_data)
##        long      lat group order     region subregion Drought_percent
## 1 -121.4785 37.48290   157  6965 california   alameda             100
## 2 -121.5129 37.48290   157  6966 california   alameda             100
## 3 -121.8853 37.48290   157  6967 california   alameda             100
## 4 -121.8968 37.46571   157  6968 california   alameda             100
## 5 -121.9254 37.45998   157  6969 california   alameda             100
## 6 -121.9483 37.47717   157  6970 california   alameda             100

Example: Drought levels in CA (3)

Map the fill attribute of geom_polygon to the Drought_percent column.

ggplot(data = combined_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, 
            group = group, fill = Drought_percent)) + 
    coord_quickmap()

Example: Drought levels in CA (4)

Use scale_fill_distiller to define a more appropriate color scale.

ggplot(data = combined_data) + 
    geom_polygon(mapping = aes(x = long, y = lat, 
            group = group, fill = Drought_percent)) + 
    scale_fill_distiller(palette = "YlOrRd", direction = 1) +
    coord_quickmap()

Other types of joins

Inner join: Matches pairs of observations with equal keys, drops everything else. Hence, only keeps observations which appear in both datasets.

Other types of joins

After matching pairs of observations with equal keys…

Long and wide data frames

Consider the following data frame that details estimated average rent and income across US states:

df <- us_rent_income %>%
    select(-moe)
## # A tibble: 104 x 4
##    GEOID NAME       variable estimate
##    <chr> <chr>      <chr>       <dbl>
##  1 01    Alabama    income      24476
##  2 01    Alabama    rent          747
##  3 02    Alaska     income      32940
##  4 02    Alaska     rent         1200
##  5 04    Arizona    income      27517
##  6 04    Arizona    rent          972
##  7 05    Arkansas   income      23789
##  8 05    Arkansas   rent          709
##  9 06    California income      29454
## 10 06    California rent         1358
## # … with 94 more rows

This table sticks very closely to the rule that there is a single variable of interest - estimate, and the rest of the variables are what are called “covariates” - that is, they contain supplementary information about the measurement.

This format is called a long data frame, and is well-suited to the filter and ggplot functions.

However, it’s a little strange that estimates of the rent and income - two quantities on quite different scales appear in the same column! We might imagine a different table format that contains the same information:

## # A tibble: 52 x 4
##    GEOID NAME                 income  rent
##    <chr> <chr>                 <dbl> <dbl>
##  1 01    Alabama               24476   747
##  2 02    Alaska                32940  1200
##  3 04    Arizona               27517   972
##  4 05    Arkansas              23789   709
##  5 06    California            29454  1358
##  6 08    Colorado              32401  1125
##  7 09    Connecticut           35326  1123
##  8 10    Delaware              31560  1076
##  9 11    District of Columbia  43198  1424
## 10 12    Florida               25952  1077
## # … with 42 more rows

This format has an additional column and half as many rows, so is called a wide data frame. It is well-suited to the select function, and is useful in situations where it’s not 100% clear what the right variable of interest is.

Switching between wide and long formats

## # A tibble: 104 x 4
##    GEOID NAME       variable estimate
##    <chr> <chr>      <chr>       <dbl>
##  1 01    Alabama    income      24476
##  2 01    Alabama    rent          747
##  3 02    Alaska     income      32940
##  4 02    Alaska     rent         1200
##  5 04    Arizona    income      27517
##  6 04    Arizona    rent          972
##  7 05    Arkansas   income      23789
##  8 05    Arkansas   rent          709
##  9 06    California income      29454
## 10 06    California rent         1358
## # … with 94 more rows

This is a process called pivoting, and is governed by the pivot_wider() (for changing to wide format) and pivot_longer() (for changing to long format) functions.

The essential arguments of pivot_wider() are

df_wide <- df %>%
    pivot_wider(names_from = variable, values_from = estimate)
## # A tibble: 52 x 4
##    GEOID NAME                 income  rent
##    <chr> <chr>                 <dbl> <dbl>
##  1 01    Alabama               24476   747
##  2 02    Alaska                32940  1200
##  3 04    Arizona               27517   972
##  4 05    Arkansas              23789   709
##  5 06    California            29454  1358
##  6 08    Colorado              32401  1125
##  7 09    Connecticut           35326  1123
##  8 10    Delaware              31560  1076
##  9 11    District of Columbia  43198  1424
## 10 12    Florida               25952  1077
## # … with 42 more rows

pivot_longer() is the inverse of this, and takes arguments

df_long <- df_wide %>%
    pivot_longer(cols = c(income, rent), names_to = "variable", values_to = "estimate")
## # A tibble: 104 x 4
##    GEOID NAME       variable estimate
##    <chr> <chr>      <chr>       <dbl>
##  1 01    Alabama    income      24476
##  2 01    Alabama    rent          747
##  3 02    Alaska     income      32940
##  4 02    Alaska     rent         1200
##  5 04    Arizona    income      27517
##  6 04    Arizona    rent          972
##  7 05    Arkansas   income      23789
##  8 05    Arkansas   rent          709
##  9 06    California income      29454
## 10 06    California rent         1358
## # … with 94 more rows