In this analysis, we will work with the Advertising
data.
We start with loading the packages we’ll use.
library(readr)
library(tidyverse)
library(skimr)
library(broom)
advertising <- read_csv("data/advertising.csv")
We will analyze the advertising and sales data for 200 markets. The variables we’ll use are
tv
: total spending on TV advertising (in $thousands)radio
: total spending on radio advertising (in $thousands)newspaper
: total spending on newspaper advertising (in $thousands)sales
: total sales (in $millions)We’ll begin the analysis by getting quick view of the data:
glimpse(advertising)
## Rows: 200
## Columns: 4
## $ tv <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199…
## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2.6, 5…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 21.2, …
## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.…
Next, we can calculate summary statistics for each of the variables in the data set.
# skim() is from the skimr package
advertising %>%
skim()
Name | Piped data |
Number of rows | 200 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
tv | 0 | 1 | 147.04 | 85.85 | 0.7 | 74.38 | 149.75 | 218.82 | 296.4 | ▇▆▆▇▆ |
radio | 0 | 1 | 23.26 | 14.85 | 0.0 | 9.97 | 22.90 | 36.52 | 49.6 | ▇▆▆▆▆ |
newspaper | 0 | 1 | 30.55 | 21.78 | 0.3 | 12.75 | 25.75 | 45.10 | 114.0 | ▇▆▃▁▁ |
sales | 0 | 1 | 14.02 | 5.22 | 1.6 | 10.38 | 12.90 | 17.40 | 27.0 | ▁▇▇▅▂ |
sales
.We are most interested in understanding how advertising spending affect sales. One way to quantify the relationship between the variables is by calculating the correlation matrix.
advertising %>%
cor()
## tv radio newspaper sales
## tv 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
radio
and sales
? Interpret this value.sales
?Below are visualizations of sales
versus each explanatory variable.
ggplot(data = advertising, mapping = aes(x =tv,y = sales)) +
geom_point(alpha=0.7) +
geom_smooth(method="lm",se=FALSE,color="blue") +
labs(title = "Sales vs. TV Advertising",
x= "TV Advertising (in $thousands)",
y="Sales (in $millions") #fill in the Y axis label
ggplot(data = advertising, mapping = aes(x = radio, y = sales)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm",se=FALSE,color="red") +
labs(title = "Sales vs. TV Advertising",
x= "Radio Advertising (in $thousands)",
y="Sales (in $millions)")
## Fill in the code to create the a scatterplot sales vs. TV ads.
Since tv
appears to have the strongest linear relationship with sales
, let’s calculate a simple linear regression model using these two variables.
ad_model <- lm(sales ~ tv, data=advertising)
ad_model
##
## Call:
## lm(formula = sales ~ tv, data = advertising)
##
## Coefficients:
## (Intercept) tv
## 7.03259 0.04754
We’ll talk about slope and intercept next week!
The advertising data is from Introduction of Statistical Learning