We will look at the relationship between budget and revenue for movies made in the United States in 1986 to 2016. The data is from the Internet Movie Database (IMDB).
library(readr)
library(tidyverse)
library(DT)
The movies
data set includes basic information about each movie including budget, genre, movie studio, director, etc. A full list of the variables may be found here.
movies <- read_csv("https://raw.githubusercontent.com/danielgrijalva/movie-stats/master/movies.csv")
movies <- movies %>%
filter(country=="USA",
!(genre %in% c("Musical","War","Western"))) #remove genres with < 10 movies
movies
## # A tibble: 4,868 x 15
## budget company country director genre gross name rating released runtime
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 8.00e6 Columb… USA Rob Rei… Adve… 5.23e7 Stan… R 1986-08… 89
## 2 6.00e6 Paramo… USA John Hu… Come… 7.01e7 Ferr… PG-13 1986-06… 103
## 3 1.50e7 Paramo… USA Tony Sc… Acti… 1.80e8 Top … PG 1986-05… 110
## 4 1.85e7 Twenti… USA James C… Acti… 8.52e7 Alie… R 1986-07… 137
## 5 9.00e6 Walt D… USA Randal … Adve… 1.86e7 Flig… PG 1986-08… 90
## 6 6.00e6 De Lau… USA David L… Drama 8.55e6 Blue… R 1986-10… 120
## 7 9.00e6 Paramo… USA Howard … Come… 4.05e7 Pret… PG-13 1986-02… 96
## 8 1.50e7 SLM Pr… USA David C… Drama 4.05e7 The … R 1986-08… 96
## 9 6.00e6 Twenti… USA David S… Come… 8.20e6 Lucas PG-13 1986-03… 100
## 10 2.50e7 Twenti… USA John Ca… Acti… 1.11e7 Big … PG-13 1986-07… 99
## # … with 4,858 more rows, and 5 more variables: score <dbl>, star <chr>,
## # votes <dbl>, writer <chr>, year <dbl>
We begin by looking at how the average gross revenue (gross
) has changed over time. Since we want to visualize the results, we will choose a few genres of interest for the analysis.
genre_list <- c("Comedy", "Action", "Animation", "Horror")
movies %>%
filter(genre %in% genre_list) %>%
group_by(genre,year) %>%
summarise(avg_gross = mean(gross)) %>%
ggplot(mapping = aes(x = year, y = avg_gross, color=genre)) +
geom_point() +
geom_line() +
ylab("Average Gross Revenue (in US Dollars)") +
ggtitle("Gross Revenue Over Time")
## `summarise()` regrouping output by 'genre' (override with `.groups` argument)
Next, let’s see the relationship between a movie’s budget and its gross revenue.
movies %>%
filter(genre %in% genre_list, budget > 0) %>%
ggplot(mapping = aes(x=log(budget), y = log(gross), color=genre)) +
geom_point() +
geom_smooth(method="lm",se=FALSE) +
xlab("Log-transformed Budget")+
ylab("Log-transformed Gross Revenue") +
facet_wrap(~ genre)
## `geom_smooth()` using formula 'y ~ x'
Consider the plot from Part 1. Suppose we fit a regression equation for each genre that uses year
to predict gross revenue
. Which genre do you expect to have the largest slope? (Submit your response in the online form.)
Consider the plot from Part 2. Suppose we fit a regression equation for each genre that uses budget
to predict gross revenue (gross
). What are the signs of the correlation between budget
and gross
and the slope in each regression equation?
Consider the plot from Part 2. Which genre has the smallest residuals, on average? Note: residual = observed revenue - predicted revenue.
Consider the plot from Part 2. In the remaining time, discuss the following: Notice in the graph above that budget
and gross
are log-transformed. Why are the log-transformed values of the variables displayed rather than the original values (in U.S. dollars)?
Below is a list of genres in the data set:
movies %>%
arrange(genre) %>%
select(genre) %>%
distinct() %>%
datatable()