We do the same set-up as in the previous lab.
This time, we use the read_csv
function instead of read.csv
. This is the tidyverse version of that function, which produces a tidyverse tibble instead of a base R data frame,
library(tidyverse)
df <- read_csv("http://web.stanford.edu/class/stats32/assets/lecture-3/data/worldbank_data_tidy.csv")
As a reminder, the table header is
head(df)
## # A tibble: 6 x 11
## cty_name cty_code year elecAccess gdpPerCap compEduc educPri educTer
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan AFG 2009 45.2 1455. 9 NA NA
## 2 Afghanistan AFG 2010 42.7 1637. 9 NA NA
## 3 Afghanistan AFG 2011 43.2 1627. 9 NA NA
## 4 Afghanistan AFG 2012 69.1 1807. 9 NA NA
## 5 Afghanistan AFG 2013 70.2 1875. 9 NA NA
## 6 Afghanistan AFG 2014 89.5 1898. 9 NA NA
## # ... with 3 more variables: govEducExp <dbl>, popYoung <dbl>, pop <dbl>
In the previous lab, we saw the following two plots :
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_boxplot()
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_point(alpha = 0.2, position = "jitter")
which present the same data.
The boxplots quickly convey a summary of the data, whereas the jittered scatterplot shows all the data - albeit in a bit of a messy way.
To get the strengths of both of these plots, we might want to see them overlaid over each other, plotted on the same axes.
ggplot
is very good at this kind of thing, and we can accomplish it just by adding both the geom_boxplot()
and geom_point()
layers to the same base plot.
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_boxplot() +
geom_point(alpha = 0.2, position = "jitter")
The order of the layers is important, as later layers are drawn on top of earlier ones:
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_point(alpha = 0.2, position = "jitter") +
geom_boxplot()
Any variable assignment done by an aes()
function in the base ggplot()
function gets inherited by each of the geom_
layers added to the canvas
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap), color = factor(year))) +
geom_point(alpha = 0.2, position = "jitter") +
geom_boxplot()
We may not want this, however — the boxplots in the above graph are way to skinny and hard to read, whereas for the scatterplot, the colors seem ok.
To address this, we can assign variables to specific geom_
layers by including the aes()
specification for those variables in the geom_
function rather than in the ggplot()
function.
In the following plot, we assign x
and y
for all layers, but color
only for the scatterplot layer.
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_boxplot() +
geom_point(aes(color = factor(year)), alpha = 0.2, position = "jitter")
As you know well, it is vital to label your axes! ggplot
will automatically label each graphical element used (in this case, x
, y
and color
) with the name of that variable in the corresponding data table. However, this is not necessarily human-readable, so the labs()
function allows us to label the graphical elements with something more evocative.
The labs()
function also takes an argument named title
, which does what you would expect - titling the plot.
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_boxplot() +
geom_point(aes(color = factor(year)), alpha = 0.2, position = "jitter") +
labs(
title = "log-gdp per capita against compulsory years of education",
x = "Compulsory years of education",
y = "log10 of GDP per capita",
color = "Year"
)
The plot that we have produced now has all of the right data and labeling, but we can still play around with it to make it look pretty.
The overall look of a plot is dictated by its theme, and this theme is supplied as a layer as any other layer. In the following, we use theme_bw()
— a personal preference of mine:
ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
geom_boxplot() +
geom_point(aes(color = factor(year)), alpha = 0.2, position = "jitter") +
labs(
title = "log-gdp per capita against compulsory years of education",
x = "Compulsory years of education",
y = "log10 of GDP per capita",
color = "Year"
) +
theme_bw()
There are many themes available — the lecture slides also list theme_classic()
and theme_dark()
, but Googling “ggplot themes” will give you all sorts of options.
For the remainder of this lab, we will concentrate on three specific years — 2009, 2013 and 2017.
Recall that to do this, we must create a vector with values TRUE
at those indices where the corresponding row has one of these years, and FALSE
everywhere else.
We accomplish this using the %in%
operator, which checks if a value on its left is contained in the vector on its right. Play around with this operator, and convince yourself that the following command produced the correct vector
df$year %in% c(2009, 2013, 2017)
To select the correct rows, we use our usual two-dimensional indexing with this vector:
df_late = df[df$year %in% c(2009, 2013, 2017),]
We have seen before how we can use aesthetics like color to distinguish data according to a categorical variable.
ggplot(data = df_late, mapping = aes(x = log10(gdpPerCap), y = educTer, color = factor(year))) +
geom_point()
There is another way of doing this — namely, we can produce several sub-plots, split according to that categorical variable.
This is done using faceting. Though there are many faceting functions that do slightly different things, the most common is facet_wrap()
.
In the next plot, we facet by the factor(year)
variable (remember — in order to facet, we need to ensure that we’re dealing with a categorical variable).
Unfortunately, faceting functions have somewhat strange notation, and we call need to call facet_wrap(~factor(year))
, making sure note to leave out the ~
. There is a good reason for this notation, related to an R object known as a “formula,” but that is beyond the scope of this course.
ggplot(data = df_late, aes(x = log10(gdpPerCap), y = educTer)) +
geom_point() +
facet_wrap(~factor(year))
Sometimes we would like to apply an aesthetic to every point in a graphic, rather than be assigned to a variable.
To do this, we supply the argument, for example color = "red"
directly to the corresponding geom_
function — not inside an aes()
function.
ggplot(data = df_late, aes(x = log10(gdpPerCap), y = educTer)) +
geom_point(color = "red") +
facet_wrap(~factor(year))
This sort of thing allows us to distinguish between two different scatter plots, for example.
ggplot(data = df_late, aes(x = log10(gdpPerCap))) +
geom_point(data = df_late, aes(y = educTer), color = "red") +
geom_point(data = df_late, aes(y = educPri), color = "blue") +
facet_wrap(~factor(year))
As a warning, this is not best practice, and it is extremely awkward to make a legend for this kind of incantation! Later in the course, we’ll see the more principled way to do this.
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4
## [5] readr_1.4.0 tidyr_1.1.3 tibble_3.1.0 ggplot2_3.3.3
## [9] tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.22 haven_2.3.1 colorspace_2.0-0
## [5] vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1 yaml_2.2.1
## [9] utf8_1.2.1 rlang_0.4.10 pillar_1.5.1 glue_1.4.2
## [13] withr_2.4.1 DBI_1.1.1 dbplyr_2.1.0 modelr_0.1.8
## [17] readxl_1.3.1 lifecycle_1.0.0 munsell_0.5.0 gtable_0.3.0
## [21] cellranger_1.1.0 rvest_1.0.0 evaluate_0.14 labeling_0.4.2
## [25] knitr_1.31 curl_4.3 fansi_0.4.2 highr_0.8
## [29] broom_0.7.5 Rcpp_1.0.6 scales_1.1.1 backports_1.2.1
## [33] jsonlite_1.7.2 farver_2.1.0 fs_1.5.0 hms_1.0.0
## [37] digest_0.6.27 stringi_1.5.3 grid_4.0.4 cli_2.3.1
## [41] tools_4.0.4 magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3
## [45] ellipsis_0.3.1 xml2_1.3.2 reprex_1.0.0 lubridate_1.7.10
## [49] assertthat_0.2.1 rmarkdown_2.7 httr_1.4.2 rstudioapi_0.13
## [53] R6_2.5.0 compiler_4.0.4