In this analysis, we will work with the Advertising data.

Data and packages

We start with loading the packages we’ll use.

library(readr)
library(tidyverse)
library(skimr)
library(broom)
advertising <- read_csv("data/advertising.csv")

We will analyze the advertising and sales data for 200 markets. The variables we’ll use are

Analysis

We’ll begin the analysis by getting quick view of the data:

glimpse(advertising)
## Rows: 200
## Columns: 4
## $ tv        <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199…
## $ radio     <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, …
## $ sales     <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.…

Next, we can calculate summary statistics for each of the variables in the data set.

# skim() is from the skimr package
advertising %>% 
  skim()
Data summary
Name Piped data
Number of rows 200
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tv 0 1 147.04 85.85 0.7 74.38 149.75 218.82 296.4 ▇▆▆▇▆
radio 0 1 23.26 14.85 0.0 9.97 22.90 36.52 49.6 ▇▆▆▆▆
newspaper 0 1 30.55 21.78 0.3 12.75 25.75 45.10 114.0 ▇▆▃▁▁
sales 0 1 14.02 5.22 1.6 10.38 12.90 17.40 27.0 ▁▇▇▅▂
  1. What type of advertising typically has the smallest spending?
  2. What type of advertising has the largest variation in spending?
  3. Describe the shape of the distribution of sales.

We are most interested in understanding how advertising spending affect sales. One way to quantify the relationship between the variables is by calculating the correlation matrix.

advertising %>% 
  cor()
##                   tv      radio  newspaper     sales
## tv        1.00000000 0.05480866 0.05664787 0.7822244
## radio     0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales     0.78222442 0.57622257 0.22829903 1.0000000
  1. What is the correlation between radio and sales? Interpret this value.
  2. What type of advertising has the strongest linear relationship with sales?

Below are visualizations of sales versus each explanatory variable.

ggplot(data = advertising, mapping = aes(x =tv,y = sales)) + 
  geom_point(alpha=0.7) +
  geom_smooth(method="lm",se=FALSE,color="blue") + 
  labs(title = "Sales vs. TV Advertising", 
       x= "TV Advertising (in $thousands)", 
       y="Sales (in $millions") #fill in the Y axis label

ggplot(data = advertising, mapping = aes(x = radio, y = sales)) + 
  geom_point(alpha = 0.7) + 
  geom_smooth(method = "lm",se=FALSE,color="red") +
  labs(title = "Sales vs. TV Advertising", 
       x= "Radio Advertising (in $thousands)", 
       y="Sales (in $millions)")

## Fill in the code to create the a scatterplot sales vs. TV ads.

Since tv appears to have the strongest linear relationship with sales, let’s calculate a simple linear regression model using these two variables.

ad_model <- lm(sales ~ tv, data=advertising)
ad_model
## 
## Call:
## lm(formula = sales ~ tv, data = advertising)
## 
## Coefficients:
## (Intercept)           tv  
##     7.03259      0.04754
  1. Write the model equation.
  2. Interpret the intercept in the context of the problem.
  3. Interpret the slope in the context of the problem.

We’ll talk about slope and intercept next week!

Acknowledgements

The advertising data is from Introduction of Statistical Learning