Introduction

This is an analysis of US presidential elections data for 2016 at the county level. Since only a small percentage of votes went to independent candidates, we will only compare Democrat and Republican voteshare.

The data for this analysis is taken from https://github.com/tonmcg/County_Level_Election_Results_12-16.

Data import and checking

Library imports:

library(tidyverse)
library(knitr)

Read in the data:

df <- read_csv("http://web.stanford.edu/class/stats32/assets/lecture-2/2016-presidential-election-county-results.csv")
kable(head(df))
fips_cod county total dem gop other dem_prop gop_prop other_prop state state_name
26041 Delta County 18467 6431 11112 924 0.3482428 0.6017220 0.0500352 MI Michigan
48295 Lipscomb County 1322 135 1159 28 0.1021180 0.8767020 0.0211800 TX Texas
01127 Walker County 29243 4486 24208 549 0.1534042 0.8278220 0.0187737 AL Alabama
48389 Reeves County 3184 1659 1417 108 0.5210427 0.4450377 0.0339196 TX Texas
56017 Hot Springs County 2535 400 1939 196 0.1577909 0.7648915 0.0773176 WY Wyoming
20043 Doniphan County 3366 584 2601 181 0.1734997 0.7727273 0.0537730 KS Kansas

There are 3112 rows in total, matching the number of counties in the US (Source: http://www.snopes.com/trump-won-3084-of-3141-counties-clinton-won-57/ and http://www.wnd.com/2016/12/trumps-landslide-2623-to-489-among-u-s-counties/).

The dataset contains the following columns:

names(df)
##  [1] "fips_cod"   "county"     "total"      "dem"        "gop"       
##  [6] "other"      "dem_prop"   "gop_prop"   "other_prop" "state"     
## [11] "state_name"

Since we are interested in whether a given county had more Republican or Democrat votes, we have to recompute the diff and per_point_diff columns. diff and per_point_diff will be positive if there are more Republican votes than Democrat votes (and vice versa).

df <- df %>% mutate(diff = gop - dem,
                    per_point_diff = diff / total * 100)

Summary statistics

Compute percentage of popular vote won by each party:

paste0("Republican % of popular vote: ", 
       round(sum(df$gop) / sum(df$total) * 100, digits = 1),
       "%")
## [1] "Republican % of popular vote: 47.3%"
paste0("Democrat % of popular vote: ", 
       round(sum(df$dem) / sum(df$total) * 100, digits = 1),
       "%")
## [1] "Democrat % of popular vote: 47.8%"

Although Clinton lost the presidential election, she actually won the popular vote!

Compute number of counties won by each party:

df %>% transmute(gop_won = gop > dem) %>%
    summarize(gop_won = sum(gop_won))
## # A tibble: 1 x 1
##   gop_won
##     <int>
## 1    2625

Painting a completely different picture, Trump won 2654 out of 3141 counties (or 84.5% of all counties). Clinton only won 487 counties. This suggests that Clinton won in counties with large populations, or that the margin of victory was slimmer in the counties that Trump won compared with the counties that Clinton won.

Histograms

We have Clinton winning the popular vote on one hand, but Trump winning many more counties. How can we reconcile these two facts?

One theory is that Clinton won her counties by a huge margin percentage-wise, while Trump won his counties by a slim margin percentage-wise. To test this theory, we could plot a histogram of the per_point_diff:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = per_point_diff)) + 
    labs(title = "Histogram of % vote margin", 
         x = "% Republicans won by", y = "Frequency")

The chart does not support the theory that Trump had narrower margins of victory in the counties that he won: he won a sizeable number of counties with > 50% vote difference.

Let’s try plotting a histogram of diff to look at absolute differences instead:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = diff)) + 
    labs(title = "Histogram of absolute vote margin", 
         x = "No. of votes Republicans won by", y = "Frequency")

This chart is very different! In the counties that Clinton won, she won it by extremely large margins in terms of absolute votes. Thus, even though she won very few counties compared to Trump, these large margins meant that she could actually win the popular vote.

The code below shows that the top 45 counties with largest absolute vote difference were all won by Clinton (number 46 was Montgomery, TX, which went to Trump).

df %>% select(State = state, County = county, diff) %>%
    mutate(abs_diff = abs(diff)) %>%
    arrange(desc(abs_diff)) %>%
    select(State, County, `Vote difference` = diff) %>%
    head(n = 50) %>%
    kable()
State County Vote difference
CA Los Angeles County -1112035
IL Cook County -1088369
NY Kings County -461433
NY New York County -456546
PA Philadelphia County -455124
WA King County -446227
NY Queens County -334839
MA Middlesex County -292756
CA Santa Clara County -289853
FL Miami-Dade County -289340
MI Wayne County -288709
FL Broward County -288435
MD Prince George’s County -284337
NY Bronx County -283979
CA Alameda County -273514
DC District of Columbia -248670
MN Hennepin County -237515
MD Montgomery County -226776
CA San Francisco County -211139
OR Multnomah County -208699
OH Cuyahoga County -204080
TX Dallas County -196980
VA Fairfax County -196648
GA DeKalb County -191600
MA Suffolk County -191170
TX Travis County -179725
GA Fulton County -171503
NJ Essex County -168972
WI Milwaukee County -162895
TX Harris County -161511
MD Baltimore City -155836
WI Dane County -146236
CA Contra Costa County -144757
OH Franklin County -143633
NC Mecklenburg County -137955
FL Orange County -134488
CO Denver County -130974
CA San Diego County -130817
NY Westchester County -124027
CA San Mateo County -122275
LA Orleans Parish -109566
MN Ramsey County -106151
PA Allegheny County -105529
NC Wake County -104746
TX Montgomery County 104444
NJ Hudson County -104365
FL Palm Beach County -100649
MA Norfolk County -99958
TN Shelby County -91692
TX El Paso County -90942

Conclusion

When analyzing elections, we have to examine the data from many different perspectives in order to get the full story.