We will look at the relationship between budget and revenue for movies made in the United States in 1986 to 2016. The data is from the Internet Movie Database (IMDB).

library(readr)
library(tidyverse)
library(DT)

Data

The movies data set includes basic information about each movie including budget, genre, movie studio, director, etc. A full list of the variables may be found here.

movies <- read_csv("https://raw.githubusercontent.com/danielgrijalva/movie-stats/master/movies.csv")
movies <- movies %>%
  filter(country=="USA", 
         !(genre %in% c("Musical","War","Western"))) #remove genres with < 10 movies
movies
## # A tibble: 4,868 x 15
##    budget company country director genre  gross name  rating released runtime
##     <dbl> <chr>   <chr>   <chr>    <chr>  <dbl> <chr> <chr>  <chr>      <dbl>
##  1 8.00e6 Columb… USA     Rob Rei… Adve… 5.23e7 Stan… R      1986-08…      89
##  2 6.00e6 Paramo… USA     John Hu… Come… 7.01e7 Ferr… PG-13  1986-06…     103
##  3 1.50e7 Paramo… USA     Tony Sc… Acti… 1.80e8 Top … PG     1986-05…     110
##  4 1.85e7 Twenti… USA     James C… Acti… 8.52e7 Alie… R      1986-07…     137
##  5 9.00e6 Walt D… USA     Randal … Adve… 1.86e7 Flig… PG     1986-08…      90
##  6 6.00e6 De Lau… USA     David L… Drama 8.55e6 Blue… R      1986-10…     120
##  7 9.00e6 Paramo… USA     Howard … Come… 4.05e7 Pret… PG-13  1986-02…      96
##  8 1.50e7 SLM Pr… USA     David C… Drama 4.05e7 The … R      1986-08…      96
##  9 6.00e6 Twenti… USA     David S… Come… 8.20e6 Lucas PG-13  1986-03…     100
## 10 2.50e7 Twenti… USA     John Ca… Acti… 1.11e7 Big … PG-13  1986-07…      99
## # … with 4,858 more rows, and 5 more variables: score <dbl>, star <chr>,
## #   votes <dbl>, writer <chr>, year <dbl>

Analysis

Part 1

We begin by looking at how the average gross revenue (gross) has changed over time. Since we want to visualize the results, we will choose a few genres of interest for the analysis.

genre_list <- c("Comedy", "Action", "Animation", "Horror")
movies %>%
  filter(genre %in% genre_list) %>% 
  group_by(genre,year) %>%
  summarise(avg_gross = mean(gross)) %>%
  ggplot(mapping = aes(x = year, y = avg_gross, color=genre)) +
    geom_point() + 
    geom_line() +
    ylab("Average Gross Revenue (in US Dollars)") +
    ggtitle("Gross Revenue Over Time")
## `summarise()` regrouping output by 'genre' (override with `.groups` argument)

Part 2

Next, let’s see the relationship between a movie’s budget and its gross revenue.

movies %>%
  filter(genre %in% genre_list, budget > 0) %>% 
  ggplot(mapping = aes(x=log(budget), y = log(gross), color=genre)) +
  geom_point() +
  geom_smooth(method="lm",se=FALSE) + 
  xlab("Log-transformed Budget")+
  ylab("Log-transformed Gross Revenue") +
  facet_wrap(~ genre)
## `geom_smooth()` using formula 'y ~ x'

Discussion

  1. Consider the plot from Part 1. Suppose we fit a regression equation for each genre that uses year to predict gross revenue. Which genre do you expect to have the largest slope? (Submit your response in the online form.)

  2. Consider the plot from Part 2. Suppose we fit a regression equation for each genre that uses budget to predict gross revenue (gross). What are the signs of the correlation between budget and gross and the slope in each regression equation?

  3. Consider the plot from Part 2. Which genre has the smallest residuals, on average? Note: residual = observed revenue - predicted revenue.

  4. Consider the plot from Part 2. In the remaining time, discuss the following: Notice in the graph above that budget and gross are log-transformed. Why are the log-transformed values of the variables displayed rather than the original values (in U.S. dollars)?

Appendix

Below is a list of genres in the data set:

movies %>% 
  arrange(genre) %>% 
  select(genre) %>%
  distinct() %>%
  datatable()