STATS 32 Session 3: Data Visualization with ggplot

Damian Pavlyshyn

April 21

http://web.stanford.edu/class/stats32/lectures/

Final project

Goal: Demonstrate that you know how to do data analysis in R

Can be done individually or in a pair.

Minimum requirements:

Project proposal

Recap of session 2: Data tables

A standard-form data table is a matrix of values where

In this lecture we will see how to use a data table to gain insight about the variables (corresponding to columns) and how they relate with each other.

This is essentially a definition of data presentation.

Data presentation

We start with a big and unwieldy table of numbers. How do we extract useful information about it?

Numerical summaries

Try this out on some vectors and dataframes

summary(efficiency)
##       mpg        cylinders     weight       horsepower         engine  
##  Min.   :10.40   4:11      Min.   :1513   Min.   : 52.0   V-shaped:18  
##  1st Qu.:15.43   6: 7      1st Qu.:2581   1st Qu.: 96.5   straight:14  
##  Median :19.20   8:14      Median :3325   Median :123.0                
##  Mean   :20.09             Mean   :3217   Mean   :146.7                
##  3rd Qu.:22.80             3rd Qu.:3610   3rd Qu.:180.0                
##  Max.   :33.90             Max.   :5424   Max.   :335.0                
##     transmission gears 
##  automatic:19    3:15  
##  manual   :13    4:12  
##                  5: 5  
##                        
##                        
## 
head(efficiency)
## # A tibble: 6 x 7
##     mpg cylinders weight horsepower engine   transmission gears
##   <dbl> <fct>      <dbl>      <dbl> <fct>    <fct>        <fct>
## 1  21   6           2620        110 V-shaped manual       4    
## 2  21   6           2875        110 V-shaped manual       4    
## 3  22.8 4           2320         93 straight manual       4    
## 4  21.4 6           3215        110 straight automatic    3    
## 5  18.7 8           3440        175 V-shaped automatic    3    
## 6  18.1 6           3460        105 straight automatic    3
sd(efficiency$weight)
## [1] 978.4574

Plots and visualizations

Each (variable) column specifies a graphical element of a plot

The layered grammar of graphics

Ingredients of a plot:

ggplot(data = efficiency, aes(x = weight, y = mpg, color = transmission)) +
    geom_point(size = 3) +
    ggtitle("Fuel efficiency vs vehicle weight")

ggplot(data = efficiency, aes(x = cylinders, fill = transmission)) +
    geom_bar() +
    coord_flip()

ggplot(data = efficiency, aes(x = horsepower)) +
    geom_histogram(bins = 10) +
    geom_freqpoly(aes(color = engine), bins = 10)

Two classes of variables in statistics

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Notice that number of cylinders is a number, not a factor, so it is treated as a continuous variable.

ggplot(data = mtcars, aes(x = wt, y = mpg, color = cyl)) +
    geom_point(size = 3)

But the only cylinder numbers are 4, 6 and 8, so we probably want to treat them as discrete, after all, the above graphic has a color designation for 3.56 cylinders, which isn’t at all useful!

ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
    geom_point(size = 3)

By converting the number of cylinders to the factor type, R now knows to treat it as a discrete variable and the resulting plot makes much more sense!

The factor type

R has an additional type factor that it uses to record a discrete variable with finitely many possible values, called “levels”

mtcars$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
factor(mtcars$cyl)
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8

We can add additional levels that are not represented in the data vector to allow possible values

factor(mtcars$cyl, levels = c(4, 6, 8, 12))
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8 12

The levels of the factors are the important parts, so we can supply arbitrary labels:

weekdays <- factor(
    c(1, 1, 2, 3, 6, 7, 1, 2, 4, 5, 5, 2, 1),
    levels = c(1, 2, 3, 4, 5, 6, 7),
    labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
    ordered = TRUE)

weekdays
##  [1] Mon Mon Tue Wed Sat Sun Mon Tue Thu Fri Fri Tue Mon
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun

We’ve also supplied the ordered argument to factor(). This lets R know about the order of the days of the week so that they can (for example) be plotted in the right order automatically

We can also easily relabel factors:

levels(weekdays) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

weekdays
##  [1] Monday    Monday    Tuesday   Wednesday Saturday  Sunday    Monday   
##  [8] Tuesday   Thursday  Friday    Friday    Tuesday   Monday   
## 7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

ggplot(data = efficiency, aes(x = cylinders)) +
    geom_bar() +
    ggtitle("Count by cylinders") +
    xlab("No. of cylinders")

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

ggplot(data = efficiency, aes(x = mpg)) + 
    geom_histogram() +
    ggtitle("Histogram of miles per gallon")

Not ideal: too many bins, which defeats the purpose of a histogram. We can manually specify the bins using the breaks option.

ggplot(data = efficiency, aes(x = mpg)) + 
    geom_histogram(breaks = seq(10, 35, 5)) +
    ggtitle("Histogram of miles per gallon")

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

ggplot(data = efficiency, aes(y = mpg, x = weight)) + 
    geom_point(size = 2) + 
    ggtitle("Miles per gallon vs. weight")

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

We will plot the yearly mean mpg against the year. To create the corresponding table, we use the following code, which we will explain in later lectures.

library(fueleconomy)
data(vehicles)
mpg <- vehicles %>%
    group_by(year) %>%
    summarize(`mean highway mpg` = mean(hwy))

head(mpg)
## # A tibble: 6 x 2
##    year `mean highway mpg`
##   <dbl>              <dbl>
## 1  1984               19.1
## 2  1985               23.0
## 3  1986               22.7
## 4  1987               22.4
## 5  1988               22.7
## 6  1989               22.5

Now, we make our usual scatterplot

ggplot(data = mpg, aes(y = `mean highway mpg`, x = year)) +
    geom_point() +
    ggtitle("Mean highway mpg by year")

Hmmm, not so good…

Let’s replace geom_point with geom_line:

ggplot(data = mpg, aes(y = `mean highway mpg`, x = year)) +
    geom_line() +
    ggtitle("Mean highway mpg by year")

Boxplots & violin plots: continuous variable vs. categorical variable

For each value of cylinder, what is the distribution of mpg like?

p <- ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
    ggtitle("Distribution of mpg by cylinders")

We can store parts of a plot as a variable and re-use it with different layers:

Position: Arranging bar plots

p <- ggplot(data = efficiency, aes(x = cylinders, fill = engine)) +
    ggtitle("Count by cylinders") +
    xlab("No. of cylinders")

In a bar plot, we have different ways of arranging the bars:

Position: Seeing obscured data

p <- ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
    ggtitle("mpg by cylinders")
Often, points will obscure one another and we need to move them out of the way to see what’s going on.

Common graphical specifcations

Aesthetics

These aesthetics are shared by many different geoms and so are good to know off the top of you head

Some geoms have special aesthetics - these are usually documented in the help file for the corresponding geom.

Geoms

We’ve gone over many of these in the previous slides, but they’re assembled in this list for reference

Shapes in R

Colors in R

Combining multiple plots into one graphic: layers

= +

Each layer contains (essentially):

ggplot() +
    geom_boxplot(data = efficiency, aes(x = cylinders, y = mpg)) +
    geom_point(data = efficiency, aes(x = cylinders, y = mpg), position = "jitter")

Often, theses are shared between layers, and can be inherited from the ggplot() function call to save time and minimize errors:

ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

The convention is to always pass arguments that are common to all elements of the graphic to ggplot() arguments that are specific to certain plots to their corresponding geom_ function.

This is efficient, and makes your intentions clearer for anyone reading your code (remember, in 99% of cases, this will be you!)

A more complicated multi-layer example

vehicles
## # A tibble: 33,442 x 12
##       id make  model   year class   trans   drive    cyl displ fuel    hwy   cty
##    <dbl> <chr> <chr>  <dbl> <chr>   <chr>   <chr>  <dbl> <dbl> <chr> <dbl> <dbl>
##  1 13309 Acura 2.2CL…  1997 Subcom… Automa… Front…     4   2.2 Regu…    26    20
##  2 13310 Acura 2.2CL…  1997 Subcom… Manual… Front…     4   2.2 Regu…    28    22
##  3 13311 Acura 2.2CL…  1997 Subcom… Automa… Front…     6   3   Regu…    26    18
##  4 14038 Acura 2.3CL…  1998 Subcom… Automa… Front…     4   2.3 Regu…    27    19
##  5 14039 Acura 2.3CL…  1998 Subcom… Manual… Front…     4   2.3 Regu…    29    21
##  6 14040 Acura 2.3CL…  1998 Subcom… Automa… Front…     6   3   Regu…    26    17
##  7 14834 Acura 2.3CL…  1999 Subcom… Automa… Front…     4   2.3 Regu…    27    20
##  8 14835 Acura 2.3CL…  1999 Subcom… Manual… Front…     4   2.3 Regu…    29    21
##  9 14836 Acura 2.3CL…  1999 Subcom… Automa… Front…     6   3   Regu…    26    17
## 10 11789 Acura 2.5TL   1995 Compac… Automa… Front…     5   2.5 Prem…    23    18
## # … with 33,432 more rows
mpg
## # A tibble: 32 x 2
##     year `mean highway mpg`
##    <dbl>              <dbl>
##  1  1984               19.1
##  2  1985               23.0
##  3  1986               22.7
##  4  1987               22.4
##  5  1988               22.7
##  6  1989               22.5
##  7  1990               22.3
##  8  1991               22.3
##  9  1992               22.4
## 10  1993               22.8
## # … with 22 more rows

Let’s make a plot of mpg vs year. We’ll include a series of boxplots from the vehicles table, and overlay a line showing the yearly average.

ggplot(mapping = aes(x = year)) +
    geom_boxplot(
        data = vehicles,
        aes(group = factor(year), y = hwy)) +
    geom_line(
        data = mpg,
        aes(y = `mean highway mpg`),
        color = "red", size = 2) +
    labs(
        title = "Change in highway mpg over time",
        y = "Highway mpg"
    )

Things to note in this plot:

Combining multiple plots into one graphic: facets

Instead of seeing plots overlaid on top of each other, we might want to see them side-by-side.

The facet_wrap() and facet_grid() functions allow us to split out data into several side-by-side plots according to some variable:

ggplot(efficiency, aes(x = weight, y = mpg, color = factor(cylinders))) +
    geom_point() +
    facet_wrap(vars(factor(gears)))

We can go crazy with these options, making our graphic display really a lot of information:

ggplot(
    data = efficiency,
    mapping = aes(
        x = weight,
        y = mpg,
        fill = factor(cylinders),
        size = horsepower)) +
    geom_point(color = "black", alpha = 0.5, shape = 21) + # outlined circle
    facet_grid(
        cols = vars(transmission),
        rows = vars(engine),
        margins = TRUE, # include unfaceted scatter plots on the edges
        labeller = label_both) # include the name of the faceting variables

It’s easy to go overboard with this, though, so be careful not to overdo it - the above graphic is fun and useful for general data exploration, but is getting so busy that it no longer makes a single coherent point.

Graphical customization

Until now, we’ve only been specifying the logic of our plots and letting R do the actual laying out.

This is a good thing: the people that designed ggplot are graphic designers and data visualization experts. You (probably) are not.

However, if you do want something specific, R ggplot has extensive customization options available.

Themes

These determine the look of everything that isn’t data. That is, the axes, labels, gridlines, background, …

A full list of themes can be found on this page: https://ggplot2.tidyverse.org/reference/ggtheme.html

p <- ggplot(data = efficiency, aes(x = weight, y = mpg, color = cylinders)) +
    ggtitle("mpg by weight") +
    geom_point()

Axis scales

Done using the scale_(x/y)_continuous functions. Many examples given here: https://ggplot2.tidyverse.org/reference/scale_continuous.html

Other customizations

Generally, each aesthetic has a corresponding scale_ function for customization. I’ve provided links to several of them here, as they work similarly to the functions that we’ve seen already:

You can easily find more with some light googling.

Today’s dataset: World Bank data

(Source: flickr and World Bank)

DataBank homepage

Interface for World Development Indicators