For this project, I was interested in looking at various data for movies: budget, revenue, rating, genre, director, runtime, etc. I first took datasets from the IMdb website, found here: https://www.imdb.com/interfaces/, which include data on every film and person ever entered into the IMdb database. But this data did not include budget and revenue information, which IMdb collects from external sources. So, I also used a dataset from The Movie Database (TMdb) that includes budget and revenue data on 5000 movies, a sufficient slice of the IMdb’s more comprehensive but unwieldy datasets. The TMdb dataset can be found here: https://www.kaggle.com/tmdb/tmdb-movie-metadata.
In this document I will ask a few questions about our data. 1. What genres are most profitable over different budget ranges? 2. Which directors are most successful within different genres? 3. How has the runtime of films of changed over time?
First, let’s load the necessary libraries.
library(readr)
library(tidyverse)
library(vroom)
library(lubridate)
library(knitr)
library(scales)
And load in our datasets.
akas <- vroom("data/title.akas.tsv.gz")
basics <- vroom("data/title.basics.tsv.gz")
crew <- vroom("data/title.crew.tsv.gz")
principals <- vroom("data/title.principals.tsv.gz")
ratings <- vroom("data/title.ratings.tsv.gz")
names <- vroom("data/name.basics.tsv.gz")
t <- vroom("data/tmdb_5000_movies.csv")
I use vroom here because several of my datasets are quite large, and I wanted to find a way to make the process go faster.
The IMdb datasets are huge, and the relevant information is spread across many of the different datasets. For example:
nrow(akas)
## [1] 13345185
Let’s look at what info each dataset contains.
names(basics)
## [1] "tconst" "titleType" "primaryTitle" "originalTitle"
## [5] "isAdult" "startYear" "endYear" "runtimeMinutes"
## [9] "genres"
names(akas)
## [1] "titleId" "ordering" "title" "region"
## [5] "language" "types" "attributes" "isOriginalTitle"
names(crew)
## [1] "tconst" "directors" "writers"
names(principals)
## [1] "tconst" "ordering" "nconst" "category" "job"
## [6] "characters"
names(ratings)
## [1] "tconst" "averageRating" "numVotes"
names(names)
## [1] "nconst" "primaryName" "birthYear"
## [4] "deathYear" "primaryProfession" "knownForTitles"
We’re going to need to join these into one dataset, so we can compare information originally stored in separate datasets. Let’s start with basics, and work up from there.
y <- basics %>% filter(titleType == "movie")
Our new object y is filtered to only contain movies—not TV shows, shorts, or anything else. This significantly narrows down our data right off the bat. Look:
nrow(basics)
## [1] 6790810
nrow(y)
## [1] 550198
Our smaller dataset means our joins will take less time, too.
ya <- left_join(y, akas, by = c("tconst" = "titleId"))
yab <- ya %>%
group_by(tconst) %>%
filter(n() == 1 | isOriginalTitle == 1)
In order to avoid having duplicate entries for different languages or versions of a film carried over from the akas dataset, we filter for entries where either the tconst (the unique code assigned by IMdb to each movie) appears only once in the dataset, or where the value for isOriginalTitle is 1, since this signifies that this is the original version of the movie.
Now, let’s continue with our joins.
yabc <- left_join(yab, crew)
yabcp <- left_join(yabc, principals)
yfull <- left_join(yabcp, ratings)
Now let’s check in on our final product, yfull:
names(yfull)
## [1] "tconst" "titleType" "primaryTitle" "originalTitle"
## [5] "isAdult" "startYear" "endYear" "runtimeMinutes"
## [9] "genres" "ordering" "title" "region"
## [13] "language" "types" "attributes" "isOriginalTitle"
## [17] "directors" "writers" "nconst" "category"
## [21] "job" "characters" "averageRating" "numVotes"
nrow(yfull)
## [1] 550195
While we won’t use every variable in this new, combined dataset in our analysis, it is handy to have it all in one place like this.
Now we need to add the relevant pieces of this information to our TMdb dataset (t) of 5000 films, which is the one we will be working with for information about revenue and budget. While the TMdb dataset has genre, release date, and other information in it, the format of those pieces of information is much worse and harder to parse.
t %>% transmute(genres, original_title, release_date) %>% head()
## # A tibble: 6 x 3
## genres original_title release_date
## <chr> <chr> <chr>
## 1 "[{\"id\": 28, \"name\": \"Action\"}, … Avatar 12/10/09
## 2 "[{\"id\": 12, \"name\": \"Adventure\"… Pirates of the Caribbean… 5/19/07
## 3 "[{\"id\": 28, \"name\": \"Action\"}, … Spectre 10/26/15
## 4 "[{\"id\": 28, \"name\": \"Action\"}, … The Dark Knight Rises 7/16/12
## 5 "[{\"id\": 28, \"name\": \"Action\"}, … John Carter 3/7/12
## 6 "[{\"id\": 14, \"name\": \"Fantasy\"},… Spider-Man 3 5/1/07
Rather than bother with trying to fix the genre formats, we can just join the relevant columns from the IMdb dataset to the TMdb one. We will also parse the release_date column of the TMdb dataset for its years, so that we can make sure there aren’t extra rows for movies that share a title with a film released in a different year.
#We have to create a yfix function because for some reason, year() will return years like 2044 or 2068 if given dates where the year is 44 or 68.
yfix <- function(x, year=1921){
m <- year(x) %% 100
year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
year(x)
}
ty <- mutate(t, date = mdy(release_date))
ty2 <- mutate(ty, year = yfix(date))
yfull$startYear <- as.numeric(yfull$startYear)
td <- left_join(ty2, yfull %>% select(primaryTitle, tconst, genres, startYear), by = c("original_title" = "primaryTitle", "year" = "startYear"))
names(td)
## [1] "budget" "genres.x" "homepage"
## [4] "id" "keywords" "original_language"
## [7] "original_title" "overview" "popularity"
## [10] "production_companies" "production_countries" "release_date"
## [13] "revenue" "runtime" "spoken_languages"
## [16] "status" "tagline" "title"
## [19] "vote_average" "vote_count" "date"
## [22] "year" "tconst" "genres.y"
Now we have our smaller dataset with revenue and budget, plus the genre and year variables in easier to use condition.
One final fix: some of the movies in the dataset have unrealistically low values for revenue and budget. So, let’s filter out any data that has a budget or revenue below $10000.
tdc <- td %>% filter(budget > 10000, revenue > 10000)
Okay, now we should be good to visualize our data.
Let’s do our first data visualization on our first research quesiton:
What genres are most profitable over different budget ranges?
First, let’s split tdc so that films with multiple genres appear multiple times with only one genre per row, and get rid of rows with missing genre values:
tdcgenres <- tdc %>% mutate(genres2 = strsplit(genres.y, ",")) %>% unnest(c(genres2))
bygenre <- tdcgenres %>% filter(!is.na(genres2), genres2 != "\\N")
We should create boxplots of the revenue minus budget and revenue divided by budget of different generes, facet wrapped by budget range.
bygenre %>%
mutate(b = cut(budget, breaks = c(0, 1e6, 1e7, 5e7, 1e8, Inf),
labels = c("Budget: $0-$1m", "Budget: $1m-$10m", "Budget: $10m-$50m","Budget: $50m-$100m","Budget: $100m+"))) %>%
group_by(genres2, b) %>%
filter(n() > 30) %>%
mutate(m = median(revenue-budget)) %>%
arrange(m) %>%
ungroup() %>%
mutate(genres2 = factor(genres2, levels = unique(genres2))) %>%
ggplot(mapping = aes(y = genres2, x = revenue-budget)) +
labs(title = "Gross profit by genre",
x = "Profit (Revenue minus budget)",
y = "Genre") +
scale_x_continuous(labels = comma) +
geom_boxplot(outlier.size = .3, size = .3) +
#coord_cartesian(xlim = c(0, 1000)) +
facet_wrap(vars(b), scales = "free_x", ncol = 5) +
theme(axis.text.x = element_text(angle = -90))
The graph is blank where there were fewer than 30 movies in that genre and budget range. We can see that the genres are pretty tightly grouped, and obviously more expensive movies have higher profit.
Let’s look at the ratio, rather than the difference, of revenue and budget now.
bygenre %>%
mutate(b = cut(budget, breaks = c(0, 1e6, 1e7, 5e7, 1e8, Inf),
labels = c("Budget: $0-$1m", "Budget: $1m-$10m", "Budget: $10m-$50m","Budget: $50m-$100m","Budget: $100m+"))) %>%
group_by(genres2, b) %>%
filter(n() > 30) %>%
mutate(m = median(revenue/budget)) %>%
arrange(m) %>%
ungroup() %>%
mutate(genres2 = factor(genres2, levels = unique(genres2))) %>%
ggplot(mapping = aes(y = genres2, x = revenue/budget)) +
labs(title = "Ratio of revenue and budget by genre",
x = "Ratio of revenue and budget (log scale)",
y = "Genre") +
scale_x_log10(labels = comma) +
geom_boxplot(outlier.size = .3, size = .3) +
#coord_cartesian(xlim = c(0, 1000)) +
facet_wrap(vars(b), scales = "free_x", ncol = 5)+
theme(axis.text.x = element_text(angle = -90))
Here, we see that movies with budgets greater than 1 million dollars tend to have revenue/budget ratios of about three, and there don’t seem to be any genres with especially high ratios. This is expected, because if some genres were more profitable than others, then more entries to the market would saturate it, and drive down people’s willingness to pay for yet another film in that genre.
Next, let’s look at which directors are most successful by rating within different genres. Since we don’t want a bajillion directors to display on the y-axis, let’s stick with the smaller tdc dataset as our base, which includes only truly major films.
The first thing we need to do is give each director per film their own row in use the director nconsts from yfull to retrieve the corresponding names from names.
jnames <- transmute(names, nconst = nconst, primaryName = primaryName)
tdd <- left_join(tdc, yfull %>% select(primaryTitle, directors, startYear), by = c("original_title" = "primaryTitle", "year" = "startYear"))
tdds <- tdd %>% mutate(directors2 = strsplit(directors, ",")) %>% unnest(c(directors2))
bydirector <- left_join(tdds, jnames, by = c("directors2" = "nconst"))
We again have to filter out missing values.
bydirectorc <- bydirector %>% filter(!is.na(genres.y), genres.y != "\\N", !is.na(primaryName))
And now let’s filter only for directors with more than 5 films in the dataset.
byd2 <- bydirectorc %>% group_by(primaryName) %>% filter(n() > 5) %>% ungroup()
Now we can add in the ratings and number of votes columns from the IMdb dataset. Though these exist in the TMdb dataset as well, they have smaller numbers of votes overall, and are not as up-to-date.
byd3 <- left_join(byd2, yfull %>%
select(primaryTitle, startYear, averageRating, numVotes), by = c("original_title" = "primaryTitle", "year" = "startYear"))
Let’s get rid of rating values that are missing or that are based on 1000 or fewer votes, then split our data up based on genre again, then make sure that each genre-director combo has at least 5 films represented.
byd4 <- byd3 %>% filter(!is.na(averageRating), numVotes > 1000)
byd5 <- byd4 %>% mutate(genres2 = strsplit(genres.y, ",")) %>% unnest(c(genres2))
byd6 <- byd5 %>% group_by(primaryName, genres2) %>% filter(n() > 5) %>% ungroup()
Now we’re ready to plot. Let’s look at the film ratings of several directors in each of the four most represented genres in the dataset: Action, Adventure, Comedy, and Drama.
byd6 %>%
filter(genres2 %in% c("Action", "Comedy", "Drama", "Adventure")) %>%
group_by(primaryName) %>%
mutate(m = mean(averageRating)) %>%
arrange(m) %>%
ungroup %>%
mutate(primaryName = factor(primaryName, levels = unique(primaryName))) %>%
ggplot(aes(x = averageRating, y = primaryName)) +
labs(title = "Film ratings by director across four genres", y = "Director Name", x = "Average Rating") +
geom_point(size = .5, color = "grey") +
stat_summary(aes(group = 1), fun = mean, geom = "line", color = "red", alpha = .5, size = .3) +
stat_summary(shape = 3, geom = "point", fun = mean, size = 1, color = "red") +
facet_wrap(vars(genres2), scales = "free_y", ncol = 2) +
coord_cartesian(xlim = c(0,10)) +
scale_x_continuous(expand = c(0,0))
I’ve sorted the y-axis so that the directors are listed in order of overall average rating, which doesn’t perfectly correspond with average rating within each genre, but is pretty close. We can see that Quentin Tarantino has the highest average rating compared with other dramatic directors by a significant margin. Wes Anderson is the highest rated for comedy by a similarly high margin. Renny Harlin, M. Night Shyamalan, and David Zucker round out the bottom of their respective categories.
Finally, let’s look at the distribution of major films’ runtimes over time. Since we don’t need budget data for this plot, we can just use our yfull dataset instead of the smaller tdc. But since this dataset contains all films, major or not, we’ll filter for a number of votes (for the film’s rating) over 1000. We’ll also need to convert the values in the runtime column to numeric variables, from character.
yfull$runtimeMinutes <- as.numeric(yfull$runtimeMinutes)
Now let’s plot it.
yfull %>%
filter(numVotes > 1000) %>%
ggplot(aes(x = startYear, y = runtimeMinutes)) +
labs(title = "Average runtime over time", x = "Year", y = "Runtime") +
geom_point(size = .1, alpha = .2, position = position_jitter(width = .3)) +
geom_smooth (size = .5, aes(group = 1), color = "black") +
coord_cartesian(ylim = c(0,250), xlim = c(1920,2020), expand = c(0,0))
We can see that average run time rises for a while, possibly as it becomes more and more convenient to shoot longer and longer movies, then wavers back and forth a bit from about 1960 to the present, though with a minor positive trend overall.
Though none of these results challenged any direct expectations, they did serve to confirm suspicions that people might have about the industry. For example, that different genres buy-and-large have similar profitability, since otherwise the market would likely adjust to account for a discrepancy. It was interesting to see conclusively that Tarantino and Wes Anderson are actually the highest-rated directors, which also conforms to expectations. And the marked, if inconsistent, increase in runtime over the course of the past ninety-ish years, from 90 or so minutes to about 110, is also nice to confirm.
The most difficult part of this project was undoubtedly managing the many datasets, and trying to join and mutate them in the right way to get the intended result.