library(tidyverse)
library(broom)
library(patchwork)
library(knitr)
Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.1 The variables in the data set are
Carat
: Size of the diamond (in carats)Color
: Coded as D (most white/bright) through JClarity
: Coded as IF (internally flawless), VVS1, VVS2, VS1, VS2, SI1, SI2, or SI3 (slightly clouded)Depth
: Depth (as a percentage of diameter)PricePerCt
: Price per caratTotalPrice
: Price for the diamond (in dollars)We will use the characteristics to understand variability in the price of diamonds.
diamonds <- read_csv("data/diamonds.csv")
diamonds <- diamonds %>%
mutate(carat_cent = Carat - mean(Carat),
log_total_price = log(TotalPrice))
log_model <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent, data = diamonds)
tidy(log_model, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 8.377 | 0.066 | 126.707 | 0.000 | 8.247 | 8.507 |
ClaritySI1 | -0.299 | 0.075 | -4.007 | 0.000 | -0.446 | -0.152 |
ClaritySI2 | -0.303 | 0.111 | -2.730 | 0.007 | -0.522 | -0.084 |
ClaritySI3 | -1.005 | 0.175 | -5.759 | 0.000 | -1.349 | -0.662 |
ClarityVS1 | -0.133 | 0.069 | -1.918 | 0.056 | -0.270 | 0.004 |
ClarityVS2 | -0.201 | 0.070 | -2.857 | 0.005 | -0.340 | -0.063 |
ClarityVVS1 | -0.157 | 0.083 | -1.884 | 0.061 | -0.320 | 0.007 |
ClarityVVS2 | -0.054 | 0.081 | -0.669 | 0.504 | -0.214 | 0.105 |
carat_cent | 1.932 | 0.189 | 10.244 | 0.000 | 1.561 | 2.304 |
ClaritySI1:carat_cent | 0.554 | 0.237 | 2.331 | 0.021 | 0.086 | 1.021 |
ClaritySI2:carat_cent | -0.646 | 0.273 | -2.367 | 0.019 | -1.183 | -0.109 |
ClaritySI3:carat_cent | -0.002 | 0.368 | -0.005 | 0.996 | -0.726 | 0.722 |
ClarityVS1:carat_cent | 0.424 | 0.205 | 2.068 | 0.040 | 0.020 | 0.827 |
ClarityVS2:carat_cent | 0.237 | 0.213 | 1.116 | 0.266 | -0.181 | 0.656 |
ClarityVVS1:carat_cent | 0.462 | 0.241 | 1.918 | 0.056 | -0.012 | 0.937 |
ClarityVVS2:carat_cent | 0.473 | 0.255 | 1.856 | 0.065 | -0.029 | 0.974 |
log_model_aug <- augment(log_model)
ggplot(data = log_model_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "Red") +
labs(x = "Predicted",
y = "Residual",
title = "Residuals vs. Predcited")
It looks like we fixed the constant variance issue, but linearity is still violated!
Let’s take a look at a plot of log(TotalPrice) vs. carat and the residuals vs. carat.
ggplot(data = diamonds, aes(x = carat_cent, y =log_total_price )) +
geom_point() +
labs(x= "Mean-Centered Carat",
y = "Log-transformed Total Price)",
title = "Log-transformed total price vs. Mean-Centered Carat")
ggplot(data = log_model_aug, aes(x = carat_cent, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Mean-Centered Carat",
y = "Residual",
title = "Residuals vs. Mean-Centered Carat")
log_model_v2 <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent + I(carat_cent^2),
data = diamonds)
tidy(log_model_v2, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 8.598 | 0.054 | 160.100 | 0.000 | 8.492 | 8.704 |
ClaritySI1 | -0.428 | 0.058 | -7.331 | 0.000 | -0.544 | -0.313 |
ClaritySI2 | -0.583 | 0.088 | -6.604 | 0.000 | -0.757 | -0.409 |
ClaritySI3 | -0.961 | 0.135 | -7.128 | 0.000 | -1.226 | -0.695 |
ClarityVS1 | -0.227 | 0.054 | -4.206 | 0.000 | -0.334 | -0.121 |
ClarityVS2 | -0.324 | 0.055 | -5.868 | 0.000 | -0.432 | -0.215 |
ClarityVVS1 | -0.215 | 0.064 | -3.347 | 0.001 | -0.342 | -0.089 |
ClarityVVS2 | -0.171 | 0.063 | -2.714 | 0.007 | -0.296 | -0.047 |
carat_cent | 1.892 | 0.146 | 12.997 | 0.000 | 1.606 | 2.179 |
I(carat_cent^2) | -1.860 | 0.141 | -13.195 | 0.000 | -2.138 | -1.583 |
ClaritySI1:carat_cent | 0.917 | 0.185 | 4.948 | 0.000 | 0.552 | 1.282 |
ClaritySI2:carat_cent | 0.779 | 0.237 | 3.295 | 0.001 | 0.314 | 1.245 |
ClaritySI3:carat_cent | 0.947 | 0.293 | 3.234 | 0.001 | 0.370 | 1.523 |
ClarityVS1:carat_cent | 0.572 | 0.159 | 3.610 | 0.000 | 0.260 | 0.885 |
ClarityVS2:carat_cent | 0.656 | 0.167 | 3.923 | 0.000 | 0.326 | 0.985 |
ClarityVVS1:carat_cent | 0.213 | 0.187 | 1.141 | 0.255 | -0.155 | 0.581 |
ClarityVVS2:carat_cent | 0.297 | 0.197 | 1.508 | 0.133 | -0.091 | 0.685 |
Clarity = IF
.\[\hat{\log(TotalPrice)} = 8.598 + 1.892\times \text{carat_cent} - 1.860\times \text{carat_cent}^2\]
Based on the model, the effect on \(\log(TotalPrice)\) is \[1.892*(0.2 - (-0.2)) - 1.860*(0.2^2 - (-0.2)^2) = 0.7568\]
Interpretation in terms of \(\log(TotalPrice)\): For diamonds with clarity IF, if we go from a diamond that is 0.2 carats below the mean carat weight to one that is 0.2 carats above the mean carat weight, we expect the log price to increase by 0.7568 on average.
Interpretation in terms of price: For diamonds with clarity IF, diamonds that weigh 0.2 carats above the mean carat weight are expected to have a median price that is 2.131, exp(0.7568) times the median price of diamonds that weigh 0.2 carats below the mean carat weight.
Alternate interpretation: For diamonds with clarity IF, if we go from a diamond that weighs 0.2 carats below the mean carat weight to one that weighs 0.2 carats more the mean carat weight, we expect the price to multiply by a factor of 2.131, exp(0.7568).
Suppose we have the following model:
orig_model <- lm(log_total_price ~ Carat + Depth, data = diamonds)
tidy(orig_model, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 6.940 | 0.194 | 35.767 | 0.000 | 6.558 | 7.322 |
Carat | 2.151 | 0.057 | 37.412 | 0.000 | 2.037 | 2.264 |
Depth | -0.007 | 0.003 | -2.320 | 0.021 | -0.014 | -0.001 |
Conduct a nested F test to determine if we should add Carat
to the model shown above.
\(H_0\): All of the coefficients for Color are 0
\(H_a\): At least one coefficient for Color is not 0
proposed_model <- lm(log_total_price ~ Carat + Depth + Color, data = diamonds)
anova(orig_model, proposed_model) %>%
tidy() %>%
kable(digits = 3)
res.df | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|
268 | 16.050 | NA | NA | NA | NA |
262 | 12.962 | 6 | 3.088 | 10.404 | 0 |
The p-value is very small, so we reject \(H_0\). There is at least one coefficient associated with Color that is not 0, so we should include Color in the model.
glance(orig_model) %>%
select(adj.r.squared, AIC, BIC)
## # A tibble: 1 x 3
## adj.r.squared AIC BIC
## <dbl> <dbl> <dbl>
## 1 0.854 11.1 25.5
glance(proposed_model) %>%
select(adj.r.squared, AIC, BIC)
## # A tibble: 1 x 3
## adj.r.squared AIC BIC
## <dbl> <dbl> <dbl>
## 1 0.879 -34.8 1.22
Based on \(Adj. R^2\), the proposed model that includes Color is a better fit since the \(Adj. R^2\) is higher.
Based on \(AIC\), the proposed model that includes Color is a better fit, since \(AIC\) is lower.
Based on \(BIC\), the proposed model that includes Color is a better fit, since \(BIC\) is lower.
Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎