We do the same set-up as in the previous lab.

This time, we use the read_csv function instead of read.csv. This is the tidyverse version of that function, which produces a tidyverse tibble instead of a base R data frame,

library(tidyverse)
df <- read_csv("http://web.stanford.edu/class/stats32/assets/lecture-3/data/worldbank_data_tidy.csv")

As a reminder, the table header is

head(df)

## # A tibble: 6 x 11
##   cty_name    cty_code  year elecAccess gdpPerCap compEduc educPri educTer
##   <chr>       <chr>    <dbl>      <dbl>     <dbl>    <dbl>   <dbl>   <dbl>
## 1 Afghanistan AFG       2009       45.2     1455.        9      NA      NA
## 2 Afghanistan AFG       2010       42.7     1637.        9      NA      NA
## 3 Afghanistan AFG       2011       43.2     1627.        9      NA      NA
## 4 Afghanistan AFG       2012       69.1     1807.        9      NA      NA
## 5 Afghanistan AFG       2013       70.2     1875.        9      NA      NA
## 6 Afghanistan AFG       2014       89.5     1898.        9      NA      NA
## # ... with 3 more variables: govEducExp <dbl>, popYoung <dbl>, pop <dbl>

Overlaying multiple plots

In the previous lab, we saw the following two plots :

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_boxplot()

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_point(alpha = 0.2, position = "jitter")

which present the same data.

The boxplots quickly convey a summary of the data, whereas the jittered scatterplot shows all the data - albeit in a bit of a messy way.

To get the strengths of both of these plots, we might want to see them overlaid over each other, plotted on the same axes.

ggplot is very good at this kind of thing, and we can accomplish it just by adding both the geom_boxplot() and geom_point() layers to the same base plot.

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_boxplot() + 
    geom_point(alpha = 0.2, position = "jitter")

The order of the layers is important, as later layers are drawn on top of earlier ones:

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_point(alpha = 0.2, position = "jitter") +
    geom_boxplot()

Any variable assignment done by an aes() function in the base ggplot() function gets inherited by each of the geom_ layers added to the canvas

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap), color = factor(year))) +
    geom_point(alpha = 0.2, position = "jitter") +
    geom_boxplot()

We may not want this, however — the boxplots in the above graph are way to skinny and hard to read, whereas for the scatterplot, the colors seem ok.

To address this, we can assign variables to specific geom_ layers by including the aes() specification for those variables in the geom_ function rather than in the ggplot() function.

In the following plot, we assign x and y for all layers, but color only for the scatterplot layer.

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_boxplot() + 
    geom_point(aes(color = factor(year)), alpha = 0.2, position = "jitter")

Labels and themes

As you know well, it is vital to label your axes! ggplot will automatically label each graphical element used (in this case, x, y and color) with the name of that variable in the corresponding data table. However, this is not necessarily human-readable, so the labs() function allows us to label the graphical elements with something more evocative.

The labs() function also takes an argument named title, which does what you would expect - titling the plot.

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_boxplot() + 
    geom_point(aes(color = factor(year)), alpha = 0.2, position = "jitter") + 
    labs(
        title = "log-gdp per capita against compulsory years of education",
        x = "Compulsory years of education",
        y = "log10 of GDP per capita",
        color = "Year"
    )

The plot that we have produced now has all of the right data and labeling, but we can still play around with it to make it look pretty.

The overall look of a plot is dictated by its theme, and this theme is supplied as a layer as any other layer. In the following, we use theme_bw() — a personal preference of mine:

ggplot(data = df, aes(x = factor(compEduc), y = log10(gdpPerCap))) +
    geom_boxplot() + 
    geom_point(aes(color = factor(year)), alpha = 0.2, position = "jitter") + 
    labs(
        title = "log-gdp per capita against compulsory years of education",
        x = "Compulsory years of education",
        y = "log10 of GDP per capita",
        color = "Year"
    ) + 
    theme_bw()

There are many themes available — the lecture slides also list theme_classic() and theme_dark(), but Googling “ggplot themes” will give you all sorts of options.

Faceting

For the remainder of this lab, we will concentrate on three specific years — 2009, 2013 and 2017.

Recall that to do this, we must create a vector with values TRUE at those indices where the corresponding row has one of these years, and FALSE everywhere else.

We accomplish this using the %in% operator, which checks if a value on its left is contained in the vector on its right. Play around with this operator, and convince yourself that the following command produced the correct vector

df$year %in% c(2009, 2013, 2017)

To select the correct rows, we use our usual two-dimensional indexing with this vector:

df_late = df[df$year %in% c(2009, 2013, 2017),]

We have seen before how we can use aesthetics like color to distinguish data according to a categorical variable.

ggplot(data = df_late, mapping = aes(x = log10(gdpPerCap), y = educTer, color = factor(year))) + 
    geom_point()

There is another way of doing this — namely, we can produce several sub-plots, split according to that categorical variable.

This is done using faceting. Though there are many faceting functions that do slightly different things, the most common is facet_wrap().

In the next plot, we facet by the factor(year) variable (remember — in order to facet, we need to ensure that we’re dealing with a categorical variable).

Unfortunately, faceting functions have somewhat strange notation, and we call need to call facet_wrap(~factor(year)), making sure note to leave out the ~. There is a good reason for this notation, related to an R object known as a “formula,” but that is beyond the scope of this course.

ggplot(data = df_late, aes(x = log10(gdpPerCap), y = educTer)) + 
    geom_point() + 
    facet_wrap(~factor(year))

Non-variable aesthetics

Sometimes we would like to apply an aesthetic to every point in a graphic, rather than be assigned to a variable.

To do this, we supply the argument, for example color = "red" directly to the corresponding geom_ function — not inside an aes() function.

ggplot(data = df_late, aes(x = log10(gdpPerCap), y = educTer)) + 
    geom_point(color = "red") + 
    facet_wrap(~factor(year))

This sort of thing allows us to distinguish between two different scatter plots, for example.

ggplot(data = df_late, aes(x = log10(gdpPerCap))) + 
    geom_point(data = df_late, aes(y = educTer), color = "red") + 
    geom_point(data = df_late, aes(y = educPri), color = "blue") + 
    facet_wrap(~factor(year))

As a warning, this is not best practice, and it is extremely awkward to make a legend for this kind of incantation! Later in the course, we’ll see the more principled way to do this.

Session info

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.5     purrr_0.3.4    
## [5] readr_1.4.0     tidyr_1.1.3     tibble_3.1.0    ggplot2_3.3.3  
## [9] tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.22         haven_2.3.1       colorspace_2.0-0 
##  [5] vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 yaml_2.2.1       
##  [9] utf8_1.2.1        rlang_0.4.10      pillar_1.5.1      glue_1.4.2       
## [13] withr_2.4.1       DBI_1.1.1         dbplyr_2.1.0      modelr_0.1.8     
## [17] readxl_1.3.1      lifecycle_1.0.0   munsell_0.5.0     gtable_0.3.0     
## [21] cellranger_1.1.0  rvest_1.0.0       evaluate_0.14     labeling_0.4.2   
## [25] knitr_1.31        curl_4.3          fansi_0.4.2       highr_0.8        
## [29] broom_0.7.5       Rcpp_1.0.6        scales_1.1.1      backports_1.2.1  
## [33] jsonlite_1.7.2    farver_2.1.0      fs_1.5.0          hms_1.0.0        
## [37] digest_0.6.27     stringi_1.5.3     grid_4.0.4        cli_2.3.1        
## [41] tools_4.0.4       magrittr_2.0.1    crayon_1.4.1      pkgconfig_2.0.3  
## [45] ellipsis_0.3.1    xml2_1.3.2        reprex_1.0.0      lubridate_1.7.10 
## [49] assertthat_0.2.1  rmarkdown_2.7     httr_1.4.2        rstudioapi_0.13  
## [53] R6_2.5.0          compiler_4.0.4

04-Data Visualization part 2

Damian Pavlyshyn

Apr 15, 2021

Overlaying multiple plots

Labels and themes

Faceting

Non-variable aesthetics

Session info