Damian Pavlyshyn
May 5, 2020
maps
packagemaps
package contains a lot of outlines of continents, countries, states, and countiesggplot2
’s map_data()
function puts these outlines in data frame format, which then allows us to plot them with ggplot()
county_data <- map_data("county")
CA_data <- county_data %>% filter(region == "california")
head(CA_data)
## long lat group order region subregion
## 1 -121.4785 37.48290 157 6965 california alameda
## 2 -121.5129 37.48290 157 6966 california alameda
## 3 -121.8853 37.48290 157 6967 california alameda
## 4 -121.8968 37.46571 157 6968 california alameda
## 5 -121.9254 37.45998 157 6969 california alameda
## 6 -121.9483 37.47717 157 6970 california alameda
County outlines are drawn using geom_polygon
.
coord_quickmap()
preserves the aspect ratio of the map.
## # A tibble: 3 x 2
## County Drought_percent
## <chr> <dbl>
## 1 alameda 100
## 2 alpine 100
## 3 amador 100
## long lat group order region subregion
## 1 -121.4785 37.4829 157 6965 california alameda
## 2 -121.5129 37.4829 157 6966 california alameda
## 3 -121.8853 37.4829 157 6967 california alameda
Our drought data and mapping information are in different datasets!
Sometimes our data are spread across different datasets, making it difficult to answer some questions.
Question: Who scored the highest in English in each class?
bio <- tibble(
Name = c("Andrew", "John", "Mary", "Jane"),
Class = c("A", "B", "A", "B"))
scores <- tibble(
Name = c("John", "Andrew", "John", "Mary"),
Subject = c("English", "English", "Math", "English"),
Score = c(76, 66, 85, 71))
## # A tibble: 4 x 2
## Name Class
## <chr> <chr>
## 1 Andrew A
## 2 John B
## 3 Mary A
## 4 Jane B
## # A tibble: 4 x 3
## Name Subject Score
## <chr> <chr> <dbl>
## 1 John English 76
## 2 Andrew English 66
## 3 John Math 85
## 4 Mary English 71
We join these tables together using the dplyr
function left_join()
:
dplyr
)## # A tibble: 5 x 4
## Name Class Subject Score
## <chr> <chr> <chr> <dbl>
## 1 Andrew A English 66
## 2 John B English 76
## 3 John B Math 85
## 4 Mary A English 71
## 5 Jane B <NA> NA
dplyr
)Question: Who scored the highest in English in each class?
library(dplyr)
bio %>% left_join(scores, by = "Name") %>%
filter(Subject == "English") %>%
group_by(Class) %>%
top_n(1, Score)
## # A tibble: 2 x 4
## # Groups: Class [2]
## Name Class Subject Score
## <chr> <chr> <chr> <dbl>
## 1 John B English 76
## 2 Mary A English 71
## # A tibble: 3 x 2
## County Drought_percent
## <chr> <dbl>
## 1 alameda 100
## 2 alpine 100
## 3 amador 100
## long lat group order region subregion
## 1 -121.4785 37.4829 157 6965 california alameda
## 2 -121.5129 37.4829 157 6966 california alameda
## 3 -121.8853 37.4829 157 6967 california alameda
Our drought data and mapping information are in different datasets!
Solution: Join the datasets together.
combined_data <- CA_data %>%
left_join(drought_data, by = c("subregion" = "County"))
head(combined_data)
## long lat group order region subregion Drought_percent
## 1 -121.4785 37.48290 157 6965 california alameda 100
## 2 -121.5129 37.48290 157 6966 california alameda 100
## 3 -121.8853 37.48290 157 6967 california alameda 100
## 4 -121.8968 37.46571 157 6968 california alameda 100
## 5 -121.9254 37.45998 157 6969 california alameda 100
## 6 -121.9483 37.47717 157 6970 california alameda 100
Map the fill
attribute of geom_polygon
to the Drought_percent
column.
Use scale_fill_distiller
to define a more appropriate color scale.
Inner join: Matches pairs of observations with equal keys, drops everything else. Hence, only keeps observations which appear in both datasets.
After matching pairs of observations with equal keys…
Consider the following data frame that details estimated average rent and income across US states:
## # A tibble: 104 x 4
## GEOID NAME variable estimate
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama income 24476
## 2 01 Alabama rent 747
## 3 02 Alaska income 32940
## 4 02 Alaska rent 1200
## 5 04 Arizona income 27517
## 6 04 Arizona rent 972
## 7 05 Arkansas income 23789
## 8 05 Arkansas rent 709
## 9 06 California income 29454
## 10 06 California rent 1358
## # … with 94 more rows
This table sticks very closely to the rule that there is a single variable of interest - estimate
, and the rest of the variables are what are called “covariates” - that is, they contain supplementary information about the measurement.
This format is called a long data frame, and is well-suited to the filter
and ggplot
functions.
However, it’s a little strange that estimates of the rent and income - two quantities on quite different scales appear in the same column! We might imagine a different table format that contains the same information:
## # A tibble: 52 x 4
## GEOID NAME income rent
## <chr> <chr> <dbl> <dbl>
## 1 01 Alabama 24476 747
## 2 02 Alaska 32940 1200
## 3 04 Arizona 27517 972
## 4 05 Arkansas 23789 709
## 5 06 California 29454 1358
## 6 08 Colorado 32401 1125
## 7 09 Connecticut 35326 1123
## 8 10 Delaware 31560 1076
## 9 11 District of Columbia 43198 1424
## 10 12 Florida 25952 1077
## # … with 42 more rows
This format has an additional column and half as many rows, so is called a wide data frame. It is well-suited to the select
function, and is useful in situations where it’s not 100% clear what the right variable of interest is.
## # A tibble: 104 x 4
## GEOID NAME variable estimate
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama income 24476
## 2 01 Alabama rent 747
## 3 02 Alaska income 32940
## 4 02 Alaska rent 1200
## 5 04 Arizona income 27517
## 6 04 Arizona rent 972
## 7 05 Arkansas income 23789
## 8 05 Arkansas rent 709
## 9 06 California income 29454
## 10 06 California rent 1358
## # … with 94 more rows
This is a process called pivoting, and is governed by the pivot_wider()
(for changing to wide format) and pivot_longer()
(for changing to long format) functions.
The essential arguments of pivot_wider()
are
names_from
, which is the factor variables that you would like to make the headings of the new columns, andvalues_from
, which is the variable of interest that will be distributed into the newly-made columns.## # A tibble: 52 x 4
## GEOID NAME income rent
## <chr> <chr> <dbl> <dbl>
## 1 01 Alabama 24476 747
## 2 02 Alaska 32940 1200
## 3 04 Arizona 27517 972
## 4 05 Arkansas 23789 709
## 5 06 California 29454 1358
## 6 08 Colorado 32401 1125
## 7 09 Connecticut 35326 1123
## 8 10 Delaware 31560 1076
## 9 11 District of Columbia 43198 1424
## 10 12 Florida 25952 1077
## # … with 42 more rows
pivot_longer()
is the inverse of this, and takes arguments
cols
, which is a vector of the variables that you will want to fold into a single columns,names_to
, which is the name of the new column containing the old column headings, andvalues_to
, which is the name of the new column that contains the variable of interest.df_long <- df_wide %>%
pivot_longer(cols = c(income, rent), names_to = "variable", values_to = "estimate")
## # A tibble: 104 x 4
## GEOID NAME variable estimate
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama income 24476
## 2 01 Alabama rent 747
## 3 02 Alaska income 32940
## 4 02 Alaska rent 1200
## 5 04 Arizona income 27517
## 6 04 Arizona rent 972
## 7 05 Arkansas income 23789
## 8 05 Arkansas rent 709
## 9 06 California income 29454
## 10 06 California rent 1358
## # … with 94 more rows