Download airbnb_nyc_2019.csv
and Airbnb_analysis.R
from the course website. Make sure that you save them in the folder that you have been using for this class!
Open up Airbnb_analysis.R
and run the lines of code that you see there. Do you understand what the code is doing? Some of the comments have been left blank: fill them up with what you believe the code to be doing.
The rest of this document will go through how we can convert the R script Airbnb_analysis.R
into an R markdown document.
Let’s create an R markdown file where we will write up our data analysis. Click on the icon in the top-left corner of the window and select “R Markdown…” In the window that pops up, select “Document” in the sidebar on the left. Type in “Analysis of Airbnb Data in NYC 2019” for Title
, and your name for Author
. For “Default Output Format”, select HTML. Click “OK”.
Upon clicking “OK”, a new sub-window appears in the top-left of our RStudio window with some default text. Notice how the filename is “Untitled1”? Save the document in our class folder with the name “Airbnb analysis”. The filename in the window will become “Airbnb analysis.Rmd”.
The top section (boxed in red) is called the YAML header (“yet another markup language”). It is separated from the rest of the document by ---
s. R markdown uses it to control many details of the whole document. We won’t talk much about this header in this class. Just notice that the “title” and “author” fields were automatically populated by what we filled in in an earlier window, and that the date is the date when the document was created. You can change these fields by manually editing them here.
To create the HTML document from this R Markdown file, click on the button (or use the shortcut Cmd/Ctrl + Shift + K
). A couple of things happen when you do this:
(It’s possible that your preview shows up in the “Viewer” window in the bottom-right corner as well. To expand it to a new window, click the “Show in new window” button (on the right of the broom icon).)
If you open up the “.html” file in your web browser, you will see that it is the same as the preview.
Compare the contents of the .Rmd
file with the preview that you see. Can you see how the markdown syntax (such as the ##
before “R Markdown” and the asterisks surrounding “Knit”) get styled in the final document?
Next, notice how code chunks are represented in the .Rmd
file. They start with ```{r}
and end with ```
. The next word after r
(e.g. cars
, pressure
) is the name of the code chunk. If you scroll through the “R Markdown” tab in the bottom-left window, you’ll see these names pop-up. Code chunks don’t need to have a name. After the name of the code chunk, you may see things like echo=FALSE
or include=FALSE
. We’ll talk about these as we go along.
Finally, notice that our environment is empty (see “Environment” tab in the top-left window). When we knit a document, R essentially starts a new session/environment and runs all the code there.
To illustrate how to use R markdown for presenting data analyses, we will work through a case study on Airbnb listings in New York City (NYC) in 2019. The data analysis is mean to be illustrative, not comprehensive.
As a start, delete everything in the .Rmd
file except the YAML header and the first code chunk (the one that has {r setup, include=FALSE}
at the top).
The code in the chunk labeled “setup” sets global options for all code chunks to follow. By setting echo=TRUE
, all code chunks that follow will be printed, along with their result. (If we set it to echo=FALSE
, we will not see the code chunks in the published document. However, the code is still run and the results of the code will be shown.)
It’s always a good idea to have an introduction section to your data analysis. Type the following below the setup code chunk:
## Introduction
This is an analysis of Airbnb listings in New York City (NYC) in 2019. The data was taken from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/.
Next, let’s create a code chunk to import the libraries that we will use. It’s hard to know exactly which packages we are going to use in advance, but we can always go back to this chunk and amend it later. There are a number of ways to create a code chunk:
```{r}
, followed by your code, then closing the chunk with ```
,Cmd/Ctrl + Alt + I
shortcut.After creating the code chunk, type the following line in the chunk:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Next, create another code chunk to read in data:
df <- read_csv("airbnb_nyc_2019.csv",
col_types = cols(host_id = col_character(),
id = col_character(),
last_review = col_date(format = "%Y-%m-%d")))
Note: make sure to set your working directory in the setup chunk or supply the absolute file path.
Knit the document to see what it looks like at this point. See how there is a whole bunch of messages after the library(tidyverse)
line? While informative when doing our data analysis, it’s probably something we don’t want to present. To remove this message (and all other future messages), go to the setup
code chunk and amend knitr::opts_chunk$set(echo = TRUE)
to knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
. If we knit the document now, we’ll see that the tidyverse
messages are no longer there.
Next, create another code chunk and put in the lines of code which give us a feel for the data:
dim(df)
## [1] 48895 16
head(df)
## # A tibble: 6 x 16
## id name host_id host_name neighbourhood_g~ neighbourhood latitude
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2539 Clean & quiet~ 2787 John Brooklyn Kensington 40.6
## 2 2595 Skylit Midtow~ 2845 Jennifer Manhattan Midtown 40.8
## 3 3647 THE VILLAGE O~ 4632 Elisabeth Manhattan Harlem 40.8
## 4 3831 Cozy Entire F~ 4869 LisaRoxa~ Brooklyn Clinton Hill 40.7
## 5 5022 Entire Apt: S~ 7192 Laura Manhattan East Harlem 40.8
## 6 5099 Large Cozy 1 ~ 7322 Chris Manhattan Murray Hill 40.7
## # ... with 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>
names(df)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
The knitr
package provides us with a function, kable()
, that helps print datasets more nicely in R markdown files. Add library(knitr)
to the library imports chunk, and change head(df)
to kable(head(df))
. Knit the document again to see the difference.
To orient our reader, we may want to add some text before that code chunk along the lines of “the dataset contains the following columns:”.
So far, all the R code has been in chunks. It is possible to have R code within the text itself too! For example, instead of dedicating an R chunk for nrow(df)
and ncol(df)
, we could have the following line outside an R code chunk:
The dataset has `r nrow(df)`
rows and `r ncol(df)`
columns.
When you knit the document again, notice how the command nrow(df)
is run and the output is printed (instead of the code itself).
We can repeat the process above for the rest of the code in Airbnb_analysis.R
:
After the data analysis, you should end off with a conclusion section. This can just be a summary of the results presented, or it could also include takeaway lessons, limitations of the analysis and/or future directions.
You can find a complete version of this Airbnb analysis (both .Rmd
and .html
file) on the course website.
We can specify “options” for each R chunk to change how the output looks like. For example, the chunk below makes a histogram of log10(price)
.
ggplot(df) +
geom_histogram(aes(x = log10(price)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).
We may want to change the size of the figure (e.g. for different aspect ratio, or to save space). The way to do that is to replace the ```{r}
at the top of the chunk to ```{r fig.width=6, fig.height=3}
:
ggplot(df) +
geom_histogram(aes(x = log10(price)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).
(The default values are fig.width=7
and fig.height=5
.)
Remember the setup
chunk right at the top of the R markdown document? If a particular code chunk does not have any options specified, it will follow whatever is in the setup
chunk.
Here are some commonly used options:
include = FALSE
: prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
echo = FALSE
: prevents code, but not the results from appearing in the finished file.
eval = FALSE
: Code appears in the output but is not run.
message = FALSE
: prevents messages that are generated by code from appearing in the finished file.
warning = FALSE
: prevents warnings that are generated by code from appearing in the finished.
In this lab, we started with a working R script, then converted that R script into an R markdown document. While tedious, this is a great way to create R markdown documents as it ensures that the code itself is working.
When you are more familiar with R, you can also starting writing R markdown documents from scratch, typing in the code as you go. The only trouble there is that to check that your code is working, you have to knit the document after writing each chunk to check if you got the result you wanted.
One way to speed up the process of writing an .Rmd
file is to run the code in the Console instead. There are 3 ways to do this:
.Rmd
file into the console and pressing Enter
,Cmd/Ctrl + Enter
shortcut, orBy mimicking the knitting process in the console, this allows us to ensure that the code chunks evaluate to the result we want without knitting over and over again.
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4
## [5] readr_1.4.0 tidyr_1.1.3 tibble_3.1.0 ggplot2_3.3.3
## [9] tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.22 haven_2.3.1 colorspace_2.0-0
## [5] vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1 yaml_2.2.1
## [9] utf8_1.2.1 rlang_0.4.10 pillar_1.5.1 glue_1.4.2
## [13] withr_2.4.1 DBI_1.1.1 dbplyr_2.1.0 modelr_0.1.8
## [17] readxl_1.3.1 lifecycle_1.0.0 munsell_0.5.0 gtable_0.3.0
## [21] cellranger_1.1.0 rvest_1.0.0 evaluate_0.14 labeling_0.4.2
## [25] knitr_1.31 fansi_0.4.2 highr_0.8 broom_0.7.5
## [29] Rcpp_1.0.6 scales_1.1.1 backports_1.2.1 jsonlite_1.7.2
## [33] farver_2.1.0 fs_1.5.0 hms_1.0.0 digest_0.6.27
## [37] stringi_1.5.3 grid_4.0.4 cli_2.3.1 tools_4.0.4
## [41] magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.1
## [45] xml2_1.3.2 reprex_1.0.0 lubridate_1.7.10 assertthat_0.2.1
## [49] rmarkdown_2.7 httr_1.4.2 rstudioapi_0.13 R6_2.5.0
## [53] compiler_4.0.4