ggplot
Damian Pavlyshyn
Goal: Demonstrate that you know how to do data analysis in R
Can be done individually or in a pair.
Minimum requirements:
A standard-form data table is a matrix of values where
In this lecture we will see how to use a data table to gain insight about the variables (corresponding to columns) and how they relate with each other.
This is essentially a definition of data presentation.
We start with a big and unwieldy table of numbers. How do we extract useful information about it?
Try this out on some vectors and dataframes
str
, summary
head
, tail
names
, dim
, nrow
, ncol
mean
, median
, sd
, var
## mpg cylinders weight horsepower engine
## Min. :10.40 4:11 Min. :1513 Min. : 52.0 V-shaped:18
## 1st Qu.:15.43 6: 7 1st Qu.:2581 1st Qu.: 96.5 straight:14
## Median :19.20 8:14 Median :3325 Median :123.0
## Mean :20.09 Mean :3217 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:3610 3rd Qu.:180.0
## Max. :33.90 Max. :5424 Max. :335.0
## transmission gears
## automatic:19 3:15
## manual :13 4:12
## 5: 5
##
##
##
## # A tibble: 6 x 7
## mpg cylinders weight horsepower engine transmission gears
## <dbl> <fct> <dbl> <dbl> <fct> <fct> <fct>
## 1 21 6 2620 110 V-shaped manual 4
## 2 21 6 2875 110 V-shaped manual 4
## 3 22.8 4 2320 93 straight manual 4
## 4 21.4 6 3215 110 straight automatic 3
## 5 18.7 8 3440 175 V-shaped automatic 3
## 6 18.1 6 3460 105 straight automatic 3
## [1] 978.4574
Each (variable) column specifies a graphical element of a plot
Ingredients of a plot:
ggplot
with the aes
(aesthetic) functiongeom_
prefixcoord_
prefix## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Notice that number of cylinders is a number, not a factor, so it is treated as a continuous variable.
But the only cylinder numbers are 4, 6 and 8, so we probably want to treat them as discrete, after all, the above graphic has a color designation for 3.56 cylinders, which isn’t at all useful!
By converting the number of cylinders to the factor type, R now knows to treat it as a discrete variable and the resulting plot makes much more sense!
R has an additional type factor
that it uses to record a discrete variable with finitely many possible values, called “levels”
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8
We can add additional levels that are not represented in the data vector to allow possible values
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8 12
The levels of the factors are the important parts, so we can supply arbitrary labels:
weekdays <- factor(
c(1, 1, 2, 3, 6, 7, 1, 2, 4, 5, 5, 2, 1),
levels = c(1, 2, 3, 4, 5, 6, 7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
ordered = TRUE)
weekdays
## [1] Mon Mon Tue Wed Sat Sun Mon Tue Thu Fri Fri Tue Mon
## Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
We’ve also supplied the ordered
argument to factor()
. This lets R know about the order of the days of the week so that they can (for example) be plotted in the right order automatically
We can also easily relabel factors:
levels(weekdays) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
weekdays
## [1] Monday Monday Tuesday Wednesday Saturday Sunday Monday
## [8] Tuesday Thursday Friday Friday Tuesday Monday
## 7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday
What is the distribution of cylinders in my dataset?
What is the distribution of miles per gallon
in my dataset?
ggplot(data = efficiency, aes(x = mpg)) +
geom_histogram() +
ggtitle("Histogram of miles per gallon")
Not ideal: too many bins, which defeats the purpose of a histogram. We can manually specify the bins using the breaks
option.
What is the relationship between mpg
and weight
?
What is the relationship between mpg
and time?
We will plot the yearly mean mpg against the year. To create the corresponding table, we use the following code, which we will explain in later lectures.
library(fueleconomy)
data(vehicles)
mpg <- vehicles %>%
group_by(year) %>%
summarize(`mean highway mpg` = mean(hwy))
head(mpg)
## # A tibble: 6 x 2
## year `mean highway mpg`
## <dbl> <dbl>
## 1 1984 19.1
## 2 1985 23.0
## 3 1986 22.7
## 4 1987 22.4
## 5 1988 22.7
## 6 1989 22.5
Now, we make our usual scatterplot
ggplot(data = mpg, aes(y = `mean highway mpg`, x = year)) +
geom_point() +
ggtitle("Mean highway mpg by year")
Hmmm, not so good…
Let’s replace geom_point
with geom_line
:
For each value of cylinder, what is the distribution of mpg
like?
p <- ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
ggtitle("Distribution of mpg by cylinders")
We can store parts of a plot as a variable and re-use it with different layers:
p <- ggplot(data = efficiency, aes(x = cylinders, fill = engine)) +
ggtitle("Count by cylinders") +
xlab("No. of cylinders")
In a bar plot, we have different ways of arranging the bars:
These aesthetics are shared by many different geoms and so are good to know off the top of you head
x
, y
: coordinatescolor
: (out)line color, fill
: fill colorsize
: point size or (out)line width, shape
: shape of points (circle, x, square etc…)linetype
: solid, dashed, dottet, etc. line specificationalpha
: transparencygroup
: which points to link together with linesSome geoms have special aesthetics - these are usually documented in the help file for the corresponding geom.
We’ve gone over many of these in the previous slides, but they’re assembled in this list for reference
geom_point()
: Points on a scatter plot. Requires x
and y
aesthetics.geom_line()
: Points connected by a line in order of increasing x coordinated. Requires x
and y
aesthetics. geom_path()
is similar, but connects the points in the order that they appear in the data frame, which is useful for drawing lines that are not of functions of the x-axis.geom_histogram()
: Histogram of values in column specified by x
. geom_freqpoly()
is a similar geom that is essentially just the outline of a histogram and is useful when you want to overlay several histograms.geom_bar()
: Bar chart indicating the number of observations in each of the categories specified by x
. If you supply a y
aesthetic and pass the argument stat = "identity"
, the y
aesthetic will specify the height of each bar.geom_polygon()
: Shape with vertices specified by x-y coordinates. Make sure to include a group
aesthetic to specify which polygon each observation is part of. This is useful for drawing maps.rgb(0,0,1)
, rgb(1,0,0)
, rgb(0,0,0)
, rgb(1,1,1)
= +
Each layer contains (essentially):
ggplot() +
geom_boxplot(data = efficiency, aes(x = cylinders, y = mpg)) +
geom_point(data = efficiency, aes(x = cylinders, y = mpg), position = "jitter")
Often, theses are shared between layers, and can be inherited from the ggplot()
function call to save time and minimize errors:
ggplot(data = efficiency, aes(x = cylinders, y = mpg)) +
geom_boxplot() +
geom_point(position = "jitter")
The convention is to always pass arguments that are common to all elements of the graphic to ggplot()
arguments that are specific to certain plots to their corresponding geom_
function.
This is efficient, and makes your intentions clearer for anyone reading your code (remember, in 99% of cases, this will be you!)
## # A tibble: 33,442 x 12
## id make model year class trans drive cyl displ fuel hwy cty
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13309 Acura 2.2CL… 1997 Subcom… Automa… Front… 4 2.2 Regu… 26 20
## 2 13310 Acura 2.2CL… 1997 Subcom… Manual… Front… 4 2.2 Regu… 28 22
## 3 13311 Acura 2.2CL… 1997 Subcom… Automa… Front… 6 3 Regu… 26 18
## 4 14038 Acura 2.3CL… 1998 Subcom… Automa… Front… 4 2.3 Regu… 27 19
## 5 14039 Acura 2.3CL… 1998 Subcom… Manual… Front… 4 2.3 Regu… 29 21
## 6 14040 Acura 2.3CL… 1998 Subcom… Automa… Front… 6 3 Regu… 26 17
## 7 14834 Acura 2.3CL… 1999 Subcom… Automa… Front… 4 2.3 Regu… 27 20
## 8 14835 Acura 2.3CL… 1999 Subcom… Manual… Front… 4 2.3 Regu… 29 21
## 9 14836 Acura 2.3CL… 1999 Subcom… Automa… Front… 6 3 Regu… 26 17
## 10 11789 Acura 2.5TL 1995 Compac… Automa… Front… 5 2.5 Prem… 23 18
## # … with 33,432 more rows
## # A tibble: 32 x 2
## year `mean highway mpg`
## <dbl> <dbl>
## 1 1984 19.1
## 2 1985 23.0
## 3 1986 22.7
## 4 1987 22.4
## 5 1988 22.7
## 6 1989 22.5
## 7 1990 22.3
## 8 1991 22.3
## 9 1992 22.4
## 10 1993 22.8
## # … with 22 more rows
Let’s make a plot of mpg vs year. We’ll include a series of boxplots from the vehicles
table, and overlay a line showing the yearly average.
ggplot(mapping = aes(x = year)) +
geom_boxplot(
data = vehicles,
aes(group = factor(year), y = hwy)) +
geom_line(
data = mpg,
aes(y = `mean highway mpg`),
color = "red", size = 2) +
labs(
title = "Change in highway mpg over time",
y = "Highway mpg"
)
Things to note in this plot:
geom_boxplot
and geom_line
both inherit the aesthetic x = year
from ggplot
, but other aesthetics are provided by the layers.geom_boxplot
has the group
aesthetic, which makes sure that the boxplots are grouped by year even though the x-axis is a continuous variable.color = "red", size = 2
arguments of geom_line
are passed outside the aes()
specification. This means that these specifications will be applied to all data in that plot, and ggplot
will not produce a legend for this aesthetic.labs
allows us to label aesthetics. In our case, we’ve labelled the y
aesthetic and kept the x
aesthetic as the default value.ggplot
is usually a data table. Since we are not specifying one, we need to use ggplot(mapping = aes(x = year))
instead of ggplot(aes(x = year))
to make this explicit to R.Instead of seeing plots overlaid on top of each other, we might want to see them side-by-side.
The facet_wrap()
and facet_grid()
functions allow us to split out data into several side-by-side plots according to some variable:
ggplot(efficiency, aes(x = weight, y = mpg, color = factor(cylinders))) +
geom_point() +
facet_wrap(vars(factor(gears)))
facet_wrap
is for splitting along a single variable, and will arrange the plots in whatever grid it decides is most efficient.
facet_grid
allows us to choose two faceting variables, and split our data further
We can go crazy with these options, making our graphic display really a lot of information:
ggplot(
data = efficiency,
mapping = aes(
x = weight,
y = mpg,
fill = factor(cylinders),
size = horsepower)) +
geom_point(color = "black", alpha = 0.5, shape = 21) + # outlined circle
facet_grid(
cols = vars(transmission),
rows = vars(engine),
margins = TRUE, # include unfaceted scatter plots on the edges
labeller = label_both) # include the name of the faceting variables
It’s easy to go overboard with this, though, so be careful not to overdo it - the above graphic is fun and useful for general data exploration, but is getting so busy that it no longer makes a single coherent point.
Until now, we’ve only been specifying the logic of our plots and letting R do the actual laying out.
This is a good thing: the people that designed ggplot
are graphic designers and data visualization experts. You (probably) are not.
However, if you do want something specific, R ggplot
has extensive customization options available.
These determine the look of everything that isn’t data. That is, the axes, labels, gridlines, background, …
A full list of themes can be found on this page: https://ggplot2.tidyverse.org/reference/ggtheme.html
p <- ggplot(data = efficiency, aes(x = weight, y = mpg, color = cylinders)) +
ggtitle("mpg by weight") +
geom_point()
Done using the scale_(x/y)_continuous
functions. Many examples given here: https://ggplot2.tidyverse.org/reference/scale_continuous.html
Set the x- and y-axis limits
Specify where you want axis ticks
Apply transformations to the axes
Specify where you want the axes to appear
Generally, each aesthetic has a corresponding scale_
function for customization. I’ve provided links to several of them here, as they work similarly to the functions that we’ve seen already:
You can easily find more with some light googling.