07-Data joining and maps

Getting started: drawing a map of the USA

Let’s start by drawing a map of the USA. The maps package contains a lot of outlines of continents, countries, states, and counties. ggplot2’s map_data function puts these outlines in data frame format, which then allows us to plot them with ggplot.

Let’s load the packages that we need:

library(tidyverse)
library(maps)

## Warning: package 'maps' was built under R version 4.0.5

Type in the following code:

map_USA <- map_data("usa")
head(map_USA)

##        long      lat group order region subregion
## 1 -101.4078 29.74224     1     1   main      <NA>
## 2 -101.3906 29.74224     1     2   main      <NA>
## 3 -101.3620 29.65056     1     3   main      <NA>
## 4 -101.3505 29.63911     1     4   main      <NA>
## 5 -101.3219 29.63338     1     5   main      <NA>
## 6 -101.3047 29.64484     1     6   main      <NA>

(If the code above doesn’t work, it could be that the maps package has not been installed yet. Install the maps package and try the code above again.) Each row in the map_USA dataset is one point on the outline of the USA. We are going to use ggplot2’s geom_polygon to connect these points.

It terms out that you can’t draw a map of the US with just one polygon: there are islands (e.g. Long Island) which form separate polygons. To draw just the “main” mainland, filter as follows:

map_USA_main <- map_USA %>% filter(region == "main")

In the same code chunk, type the following code:

ggplot() +
    geom_polygon(data = map_USA_main, mapping = aes(x = long, y = lat))

We get a map! By default, the fill of the polygon is black. Click on the “Zoom” button and resize the window. Do you notice what happens to the map?

In order to keep the correct aspect ratio, amend the code for the map by adding on an extra line:

ggplot() +
    geom_polygon(data = map_USA_main, mapping = aes(x = long, y = lat)) + 
    coord_quickmap()

Try resizing the window and see the difference in behavior from before.

County map

Let’s try to draw a county-level map. For the later part of this lab, we want to label these counties by FIPS codes (you can read more about FIPS county codes here). It turns out that it’s not so easy to get mapping data with FIPS codes, so I’ve put together a dataset to save you this trouble. Load it with the following code:

map_county_fips <- readRDS("county_map_fips.rds")
head(map_county_fips)

##        long      lat group order  region subregion fips
## 1 -86.50517 32.34920     1     1 alabama   autauga 1001
## 2 -86.53382 32.35493     1     2 alabama   autauga 1001
## 3 -86.54527 32.36639     1     3 alabama   autauga 1001
## 4 -86.55673 32.37785     1     4 alabama   autauga 1001
## 5 -86.57966 32.38357     1     5 alabama   autauga 1001
## 6 -86.59111 32.37785     1     6 alabama   autauga 1001

Let’s draw a county map using code that’s very similar to what we had for drawing the map of the USA:

ggplot() +
    geom_polygon(data = map_county_fips, 
                 mapping = aes(x = long, y = lat, group = group)) + 
    coord_quickmap()

Now, let’s draw a map with black outlines for the counties, and different colors for each state:

ggplot() +
    geom_polygon(data = map_county_fips,
                 mapping = aes(x = long, y = lat, group = group, fill = region),
                 col = "black") + 
    coord_quickmap() +
    theme(legend.position="none")

Exercises

What precisely does coord_quickmap() do? Find some other map coord_ commands and experimant with them.
What happens when you don’t include the group argument in geom_polygon()? Test is out on the data frame map_USA.
Generate a map of only the counties of California.
Find all the quadrilateral counties in the US and map them. Which state has the highest proportion of such counties?

Elections data

The goal of the rest of this lab is to visualize the 2016 US presidential elections results on a county-level map. Let’s load the elections data and have a peek at it:

df <- read_csv("2016_US_Presdential_Results_for_class.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   fips = col_double(),
##   percent_diff = col_double(),
##   state = col_character(),
##   county = col_character(),
##   votes_dem = col_double(),
##   votes_gop = col_double(),
##   total_votes = col_double()
## )

head(df)

## # A tibble: 6 x 7
##    fips percent_diff state county votes_dem votes_gop total_votes
##   <dbl>        <dbl> <chr> <chr>      <dbl>     <dbl>       <dbl>
## 1  2013         15.2 AK    Alaska     93003    130413      246588
## 2  2016         15.2 AK    Alaska     93003    130413      246588
## 3  2020         15.2 AK    Alaska     93003    130413      246588
## 4  2050         15.2 AK    Alaska     93003    130413      246588
## 5  2060         15.2 AK    Alaska     93003    130413      246588
## 6  2068         15.2 AK    Alaska     93003    130413      246588

In order to have the fills of the counties depend on our elections data, we need to somehow get information from our df dataset to the map_county_fips dataset. We can achieve this by using dplyr’s left-join.

The code below joins the two data frames on the fips column (map_county_fips is on the left, df is on the right):

# join elections data to mapping data
map_county_per_diff <- map_county_fips %>%
    left_join(df, by = "fips")

Next, we use ggplot to plot the data:

ggplot(data = map_county_per_diff, mapping = aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = percent_diff)) + 
    coord_quickmap() + 
    labs(title = "Election results by county")

Great, we got a map! There are two things we can do to improve on it:

The color scale is currently different shades of blue. Something more informative would be counties that voted very Republican being red, those that voted very Democrat being blue, and those that voted evenly being white.
The “lat” and “long” axes, as well as the grey background with grid lines, are not helpful for interpreting maps.

Let’s change the color scale first. Amend the map plotting code to the following (see ?scale_fill_gradient2 to understand how it works):

ggplot(data = map_county_per_diff, mapping = aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = percent_diff)) + 
    scale_fill_gradient2(low = "blue", high = "red") +
    coord_quickmap() + 
    labs(title = "Election results by county")

To remove parts of the plot which are not helpful, type in the following code in the same chunk before the ggplot call:

map_theme <- theme(
    axis.title.x = element_blank(),
    axis.text.x  = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y  = element_blank(),
    axis.ticks.y = element_blank(),
    panel.background = element_rect(fill = "white")
)

Next, add map_theme to the ggplot function call:

ggplot(data = map_county_per_diff, mapping = aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = percent_diff)) + 
    scale_fill_gradient2(low = "blue", high = "red") +
    coord_quickmap() + 
    labs(title = "Election results by county") +
    map_theme

Play around with the labels, theme and color scales. Do any of the counties stand out to you? How can we modify the code above to draw a map of just one state?

We can see that large parts of the country voted more Republican than Democratic. At the same time, it doesn’t look like 85% of the map is red. This suggests that Trump won many counties which are geographically small.

Exercises

Highlight on the map the 20 counties whose results were closest to a tie.
Using group_by() and summarize(), create a map of the state-by-state vote totals (this is demonstrated in the next section, but you should try to do it yourself before moving on).
Suppose that we are interested in mapping the change in vote share between 2016 and 2020. Think about how you might do this, and try to implement it.

Optional material

County map with state boundaries

To make the map better, we could draw state boundaries as well. Create a new code chunk and type in the following code. (Compare this code with the code in the previous chunk and try to figure out how the state boundaries were drawn.)

map_state <- map_data("state")
ggplot(data = map_county_per_diff, mapping = aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = percent_diff)) + 
    geom_polygon(data = map_state, fill = NA, color = "black") +
    scale_fill_gradient2(low = "blue", high = "red") +
    coord_quickmap() + 
    labs(title = "Election results by county") + 
    map_theme

(If you’re curious, that very blue spot in the middle of the country is Oglala County, South Dakota. From Wikipedia: “The counties surrounding Oglala Lakota County are predominantly Republican, but, like most Native American counties, Oglala Lakota is heavily Democratic, giving over 75 percent of the vote to every Democratic presidential nominee in every election back to 1984, making it one of the most Democratic counties in the United States. No Republican has carried the county in a presidential election since 1952.”)

State-level map

Let’s try to make the same map but at the state level, instead of at the county level. Make a new dataframe containing state-level data:

state_df <- df %>% group_by(state) %>%
    summarize(votes_dem = sum(votes_dem),
              votes_gop = sum(votes_gop),
              total_votes = sum(total_votes)) %>%
    mutate(diff = votes_gop - votes_dem,
           percent_diff = diff / total_votes)

Notice that state_df and map_state contain their data on state differently: it’s abbreviated in state_df, while its the full name in map_state. We’ll need to do some data wrangling/transformation to get them to match.

R has 2 built-in variables, state.abb and state.name, which can help us. First, let’s augment these variables with “District of Columbia”:

state_abb <- c(state.abb, "DC")
state_name <- c(state.name, "District of Columbia")

Next, we use the following line of code to add a new column to state_df which has the state name in full. (See ?match to figure out what is going on here. We also have to use tolower() as the state names are all in small letters in the map_state dataset.)

state_df$region <- tolower(state_name[match(state_df$state, state_abb)])

We are now in a position to join the datasets:

combined_state_df <- map_state %>% left_join(state_df, by = "region")

The commands for plotting are almost the same as for the county map:

ggplot(data = combined_state_df, mapping = aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = percent_diff)) + 
    geom_polygon(fill = NA, color = "black") +
    scale_fill_gradient2(low = "blue", high = "red") +
    coord_quickmap() + 
    labs(title = "Election results by state") + 
    map_theme

We can also plot the state labels on the map (notice that the group can no longer be in the ggplot() call):

state_names <- combined_state_df %>% group_by(state) %>%
    summarize(lat = 0.5 * (max(lat) + min(lat)), 
              long = 0.5 * (max(long) + min(long)))

ggplot(data = combined_state_df, mapping = aes(x = long, y = lat)) +
    geom_polygon(aes(group = group, fill = percent_diff)) + 
    geom_polygon(aes(group = group), fill = NA, color = "black") +
    geom_text(data = state_names,
              mapping = aes(x = long, y = lat, label = state)) +
    scale_fill_gradient2(low = "blue", high = "red") +
    coord_quickmap() + 
    labs(title = "Election results by state") + 
    map_theme

Session info

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] maps_3.3.0      forcats_0.5.1   stringr_1.4.0   dplyr_1.0.5    
##  [5] purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.0   
##  [9] ggplot2_3.3.3   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.22         haven_2.3.1       colorspace_2.0-0 
##  [5] vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 yaml_2.2.1       
##  [9] utf8_1.2.1        rlang_0.4.10      pillar_1.5.1      glue_1.4.2       
## [13] withr_2.4.1       DBI_1.1.1         dbplyr_2.1.0      modelr_0.1.8     
## [17] readxl_1.3.1      lifecycle_1.0.0   munsell_0.5.0     gtable_0.3.0     
## [21] cellranger_1.1.0  rvest_1.0.0       evaluate_0.14     labeling_0.4.2   
## [25] knitr_1.31        fansi_0.4.2       highr_0.8         broom_0.7.5      
## [29] Rcpp_1.0.6        scales_1.1.1      backports_1.2.1   jsonlite_1.7.2   
## [33] farver_2.1.0      fs_1.5.0          hms_1.0.0         digest_0.6.27    
## [37] stringi_1.5.3     grid_4.0.4        cli_2.3.1         tools_4.0.4      
## [41] magrittr_2.0.1    crayon_1.4.1      pkgconfig_2.0.3   ellipsis_0.3.1   
## [45] xml2_1.3.2        reprex_1.0.0      lubridate_1.7.10  assertthat_0.2.1 
## [49] rmarkdown_2.7     httr_1.4.2        rstudioapi_0.13   R6_2.5.0         
## [53] compiler_4.0.4

07-Data joining and maps

April 26, 2020

Getting started: drawing a map of the USA

County map

Exercises

Elections data

Exercises

Optional material

County map with state boundaries

State-level map

Session info