library(tidyverse)
library(broom)
library(patchwork)
library(knitr)
Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.1 The variables in the data set are
Carat
: Size of the diamond (in carats)Color
: Coded as D (most white/bright) through JClarity
: Coded as IF (internally flawless), VVS1, VVS2, VS1, VS2, SI1, SI2, or SI3 (slightly clouded)Depth
: Depth (as a percentage of diameter)PricePerCt
: Price per caratTotalPrice
: Price for the diamond (in dollars)We will use the characteristics to understand variability in the price of diamonds.
diamonds <- read_csv("data/diamonds.csv")
orig_model <- lm(TotalPrice ~ Clarity + Carat + Clarity * Carat, data = diamonds)
tidy(orig_model, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -1086.229 | 511.376 | -2.124 | 0.035 | -2093.286 | -79.171 |
ClaritySI1 | -2574.745 | 719.867 | -3.577 | 0.000 | -3992.386 | -1157.104 |
ClaritySI2 | -1196.544 | 1046.294 | -1.144 | 0.254 | -3257.022 | 863.934 |
ClaritySI3 | -1181.771 | 1517.647 | -0.779 | 0.437 | -4170.490 | 1806.948 |
ClarityVS1 | -2085.958 | 581.567 | -3.587 | 0.000 | -3231.244 | -940.672 |
ClarityVS2 | -2608.348 | 615.997 | -4.234 | 0.000 | -3821.438 | -1395.259 |
ClarityVVS1 | -682.555 | 660.317 | -1.034 | 0.302 | -1982.924 | 617.814 |
ClarityVVS2 | -1295.121 | 686.745 | -1.886 | 0.060 | -2647.536 | 57.294 |
Carat | 7770.757 | 747.682 | 10.393 | 0.000 | 6298.339 | 9243.176 |
ClaritySI1:Carat | 1534.970 | 941.373 | 1.631 | 0.104 | -318.886 | 3388.826 |
ClaritySI2:Carat | -1009.767 | 1080.924 | -0.934 | 0.351 | -3138.443 | 1118.908 |
ClaritySI3:Carat | -2208.757 | 1457.253 | -1.516 | 0.131 | -5078.542 | 661.027 |
ClarityVS1:Carat | 1882.893 | 812.346 | 2.318 | 0.021 | 283.132 | 3482.653 |
ClarityVS2:Carat | 1944.583 | 842.556 | 2.308 | 0.022 | 285.328 | 3603.838 |
ClarityVVS1:Carat | 84.449 | 955.234 | 0.088 | 0.930 | -1796.703 | 1965.601 |
ClarityVVS2:Carat | 1018.237 | 1009.286 | 1.009 | 0.314 | -969.361 | 3005.835 |
carat
Mean-center Carat
and refit the model from Part 2 using the mean-centered variable for carat.
diamonds <- diamonds %>%
mutate(carat_cent = Carat - mean(Carat))
## code to refit model model
mean_cent_model <- lm(TotalPrice ~ Clarity + carat_cent + Clarity*carat_cent , data = diamonds)
tidy(mean_cent_model, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 5175.396 | 262.076 | 19.748 | 0.000 | 4659.287 | 5691.504 |
ClaritySI1 | -1337.877 | 295.885 | -4.522 | 0.000 | -1920.567 | -755.187 |
ClaritySI2 | -2010.208 | 440.171 | -4.567 | 0.000 | -2877.041 | -1143.375 |
ClaritySI3 | -2961.573 | 691.962 | -4.280 | 0.000 | -4324.262 | -1598.884 |
ClarityVS1 | -568.736 | 275.161 | -2.067 | 0.040 | -1110.613 | -26.858 |
ClarityVS2 | -1041.416 | 279.329 | -3.728 | 0.000 | -1591.502 | -491.330 |
ClarityVVS1 | -614.507 | 329.673 | -1.864 | 0.063 | -1263.736 | 34.722 |
ClarityVVS2 | -474.632 | 321.197 | -1.478 | 0.141 | -1107.169 | 157.905 |
carat_cent | 7770.757 | 747.682 | 10.393 | 0.000 | 6298.339 | 9243.176 |
ClaritySI1:carat_cent | 1534.970 | 941.373 | 1.631 | 0.104 | -318.886 | 3388.826 |
ClaritySI2:carat_cent | -1009.767 | 1080.924 | -0.934 | 0.351 | -3138.443 | 1118.908 |
ClaritySI3:carat_cent | -2208.757 | 1457.253 | -1.516 | 0.131 | -5078.542 | 661.027 |
ClarityVS1:carat_cent | 1882.893 | 812.346 | 2.318 | 0.021 | 283.132 | 3482.653 |
ClarityVS2:carat_cent | 1944.583 | 842.556 | 2.308 | 0.022 | 285.328 | 3603.838 |
ClarityVVS1:carat_cent | 84.449 | 955.234 | 0.088 | 0.930 | -1796.703 | 1965.601 |
ClarityVVS2:carat_cent | 1018.237 | 1009.286 | 1.009 | 0.314 | -969.361 | 3005.835 |
Interpret the intercept in the context of the data.
Let’s compare the original and mean-centered model for ClarityIF
.
Let’s compare the original and mean-centered models for ClaritySI1
.
Let’s use the mean-centered version of the model to answer the following questions:
Clarity = "SI3"
differs significantly from diamonds with Clarity = "IF"
. We should use the inferential statistics for which term to conduct this test?Clarity = "SI3"
differs significantly from diamonds with Clarity = "IF"
. We should use the inferential statistics for which term to conduct this test?Clarity = VS1
differs from the slope of carat for diamonds with Clarity = IF
?Let’s take a look at the plot of residuals vs. fitted.
model_aug <- augment(mean_cent_model)
ggplot(data = model_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "Red") +
labs(x = "Predicted",
y = "Residual",
title = "Residuals vs. Predcited")
diamonds <- diamonds %>%
mutate(log_total_price = log(TotalPrice))
log_model <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent, data = diamonds)
tidy(log_model, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 8.377 | 0.066 | 126.707 | 0.000 | 8.247 | 8.507 |
ClaritySI1 | -0.299 | 0.075 | -4.007 | 0.000 | -0.446 | -0.152 |
ClaritySI2 | -0.303 | 0.111 | -2.730 | 0.007 | -0.522 | -0.084 |
ClaritySI3 | -1.005 | 0.175 | -5.759 | 0.000 | -1.349 | -0.662 |
ClarityVS1 | -0.133 | 0.069 | -1.918 | 0.056 | -0.270 | 0.004 |
ClarityVS2 | -0.201 | 0.070 | -2.857 | 0.005 | -0.340 | -0.063 |
ClarityVVS1 | -0.157 | 0.083 | -1.884 | 0.061 | -0.320 | 0.007 |
ClarityVVS2 | -0.054 | 0.081 | -0.669 | 0.504 | -0.214 | 0.105 |
carat_cent | 1.932 | 0.189 | 10.244 | 0.000 | 1.561 | 2.304 |
ClaritySI1:carat_cent | 0.554 | 0.237 | 2.331 | 0.021 | 0.086 | 1.021 |
ClaritySI2:carat_cent | -0.646 | 0.273 | -2.367 | 0.019 | -1.183 | -0.109 |
ClaritySI3:carat_cent | -0.002 | 0.368 | -0.005 | 0.996 | -0.726 | 0.722 |
ClarityVS1:carat_cent | 0.424 | 0.205 | 2.068 | 0.040 | 0.020 | 0.827 |
ClarityVS2:carat_cent | 0.237 | 0.213 | 1.116 | 0.266 | -0.181 | 0.656 |
ClarityVVS1:carat_cent | 0.462 | 0.241 | 1.918 | 0.056 | -0.012 | 0.937 |
ClarityVVS2:carat_cent | 0.473 | 0.255 | 1.856 | 0.065 | -0.029 | 0.974 |
Interpret the intercept in terms of log(TotalPrice)
.
Interpret the intercept in terms of TotalPrice
.
Let’s focus on diamonds with Clarity = IF
.
log(TotalPrice)
.carat_cent
in terms of log(TotalPrice)
carat_cent
in terms of the TotalPrice
.log_model_aug <- augment(log_model)
ggplot(data = log_model_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "Red") +
labs(x = "Predicted",
y = "Residual",
title = "Residuals vs. Predcited")
It looks like we fixed the constant variance issue, but linearity is still violated!
Let’s take a look at a plot of log(TotalPrice) vs. carat and the residuals vs. carat.
ggplot(data = diamonds, aes(x = carat_cent, y =log_total_price )) +
geom_point() +
labs(x= "Mean-Centered Carat",
y = "Log-transformed Total Price)",
title = "Log-transformed total price vs. Mean-Centered Carat")
ggplot(data = log_model_aug, aes(x = carat_cent, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Mean-Centered Carat",
y = "Residual",
title = "Residuals vs. Mean-Centered Carat")
log_model_v2 <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent + I(carat_cent^2),
data = diamonds)
tidy(log_model_v2, conf.int = TRUE) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 8.598 | 0.054 | 160.100 | 0.000 | 8.492 | 8.704 |
ClaritySI1 | -0.428 | 0.058 | -7.331 | 0.000 | -0.544 | -0.313 |
ClaritySI2 | -0.583 | 0.088 | -6.604 | 0.000 | -0.757 | -0.409 |
ClaritySI3 | -0.961 | 0.135 | -7.128 | 0.000 | -1.226 | -0.695 |
ClarityVS1 | -0.227 | 0.054 | -4.206 | 0.000 | -0.334 | -0.121 |
ClarityVS2 | -0.324 | 0.055 | -5.868 | 0.000 | -0.432 | -0.215 |
ClarityVVS1 | -0.215 | 0.064 | -3.347 | 0.001 | -0.342 | -0.089 |
ClarityVVS2 | -0.171 | 0.063 | -2.714 | 0.007 | -0.296 | -0.047 |
carat_cent | 1.892 | 0.146 | 12.997 | 0.000 | 1.606 | 2.179 |
I(carat_cent^2) | -1.860 | 0.141 | -13.195 | 0.000 | -2.138 | -1.583 |
ClaritySI1:carat_cent | 0.917 | 0.185 | 4.948 | 0.000 | 0.552 | 1.282 |
ClaritySI2:carat_cent | 0.779 | 0.237 | 3.295 | 0.001 | 0.314 | 1.245 |
ClaritySI3:carat_cent | 0.947 | 0.293 | 3.234 | 0.001 | 0.370 | 1.523 |
ClarityVS1:carat_cent | 0.572 | 0.159 | 3.610 | 0.000 | 0.260 | 0.885 |
ClarityVS2:carat_cent | 0.656 | 0.167 | 3.923 | 0.000 | 0.326 | 0.985 |
ClarityVVS1:carat_cent | 0.213 | 0.187 | 1.141 | 0.255 | -0.155 | 0.581 |
ClarityVVS2:carat_cent | 0.297 | 0.197 | 1.508 | 0.133 | -0.091 | 0.685 |
Write the model equation for Clarity = IF
.
Let’s interpret the effect of carat on the total price for diamonds with Clarity = IF that have a carat size within 0.2 carats of the mean (~Q1 and Q3).
Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎