Announcements

Questions from video?

Price of Diamonds

library(tidyverse)
library(broom)
library(patchwork)
library(knitr)

Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.1 The variables in the data set are

We will use the characteristics to understand variability in the price of diamonds.

diamonds <- read_csv("data/diamonds.csv")
diamonds <- diamonds %>%
  mutate(carat_cent = Carat - mean(Carat), 
         log_total_price = log(TotalPrice))
log_model <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent, data = diamonds)
tidy(log_model, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.377 0.066 126.707 0.000 8.247 8.507
ClaritySI1 -0.299 0.075 -4.007 0.000 -0.446 -0.152
ClaritySI2 -0.303 0.111 -2.730 0.007 -0.522 -0.084
ClaritySI3 -1.005 0.175 -5.759 0.000 -1.349 -0.662
ClarityVS1 -0.133 0.069 -1.918 0.056 -0.270 0.004
ClarityVS2 -0.201 0.070 -2.857 0.005 -0.340 -0.063
ClarityVVS1 -0.157 0.083 -1.884 0.061 -0.320 0.007
ClarityVVS2 -0.054 0.081 -0.669 0.504 -0.214 0.105
carat_cent 1.932 0.189 10.244 0.000 1.561 2.304
ClaritySI1:carat_cent 0.554 0.237 2.331 0.021 0.086 1.021
ClaritySI2:carat_cent -0.646 0.273 -2.367 0.019 -1.183 -0.109
ClaritySI3:carat_cent -0.002 0.368 -0.005 0.996 -0.726 0.722
ClarityVS1:carat_cent 0.424 0.205 2.068 0.040 0.020 0.827
ClarityVS2:carat_cent 0.237 0.213 1.116 0.266 -0.181 0.656
ClarityVVS1:carat_cent 0.462 0.241 1.918 0.056 -0.012 0.937
ClarityVVS2:carat_cent 0.473 0.255 1.856 0.065 -0.029 0.974
log_model_aug <- augment(log_model)
ggplot(data = log_model_aug, aes(x = .fitted, y  = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "Red") +
  labs(x = "Predicted", 
      y = "Residual", 
      title = "Residuals vs. Predcited")

It looks like we fixed the constant variance issue, but linearity is still violated!

Higher-order terms

Let’s take a look at a plot of log(TotalPrice) vs. carat and the residuals vs. carat.

ggplot(data = diamonds, aes(x = carat_cent, y =log_total_price )) +
  geom_point() +
  labs(x= "Mean-Centered Carat", 
       y  = "Log-transformed Total Price)", 
       title = "Log-transformed total price vs. Mean-Centered Carat")

ggplot(data = log_model_aug, aes(x = carat_cent, y = .resid)) +
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Mean-Centered Carat", 
       y = "Residual", 
       title = "Residuals vs. Mean-Centered Carat")

  1. Let’s add a quadratic term to the model.
log_model_v2 <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent + I(carat_cent^2), 
                   data = diamonds)
tidy(log_model_v2, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.598 0.054 160.100 0.000 8.492 8.704
ClaritySI1 -0.428 0.058 -7.331 0.000 -0.544 -0.313
ClaritySI2 -0.583 0.088 -6.604 0.000 -0.757 -0.409
ClaritySI3 -0.961 0.135 -7.128 0.000 -1.226 -0.695
ClarityVS1 -0.227 0.054 -4.206 0.000 -0.334 -0.121
ClarityVS2 -0.324 0.055 -5.868 0.000 -0.432 -0.215
ClarityVVS1 -0.215 0.064 -3.347 0.001 -0.342 -0.089
ClarityVVS2 -0.171 0.063 -2.714 0.007 -0.296 -0.047
carat_cent 1.892 0.146 12.997 0.000 1.606 2.179
I(carat_cent^2) -1.860 0.141 -13.195 0.000 -2.138 -1.583
ClaritySI1:carat_cent 0.917 0.185 4.948 0.000 0.552 1.282
ClaritySI2:carat_cent 0.779 0.237 3.295 0.001 0.314 1.245
ClaritySI3:carat_cent 0.947 0.293 3.234 0.001 0.370 1.523
ClarityVS1:carat_cent 0.572 0.159 3.610 0.000 0.260 0.885
ClarityVS2:carat_cent 0.656 0.167 3.923 0.000 0.326 0.985
ClarityVVS1:carat_cent 0.213 0.187 1.141 0.255 -0.155 0.581
ClarityVVS2:carat_cent 0.297 0.197 1.508 0.133 -0.091 0.685
  1. Write the model equation for Clarity = IF.

\[\hat{\log(TotalPrice)} = 8.598 + 1.892\times \text{carat_cent} - 1.860\times \text{carat_cent}^2\]

  1. Let’s interpret the effect of carat on the total price for diamonds with Clarity = IF that have a carat size within 0.2 carats of the mean (approx. Q1 and Q3).

Based on the model, the effect on \(\log(TotalPrice)\) is \[1.892*(0.2 - (-0.2)) - 1.860*(0.2^2 - (-0.2)^2) = 0.7568\]

Interpretation in terms of \(\log(TotalPrice)\): For diamonds with clarity IF, if we go from a diamond that is 0.2 carats below the mean carat weight to one that is 0.2 carats above the mean carat weight, we expect the log price to increase by 0.7568 on average.

Interpretation in terms of price: For diamonds with clarity IF, diamonds that weigh 0.2 carats above the mean carat weight are expected to have a median price that is 2.131, exp(0.7568) times the median price of diamonds that weigh 0.2 carats below the mean carat weight.

Alternate interpretation: For diamonds with clarity IF, if we go from a diamond that weighs 0.2 carats below the mean carat weight to one that weighs 0.2 carats more the mean carat weight, we expect the price to multiply by a factor of 2.131, exp(0.7568).

Nested F test

Suppose we have the following model:

orig_model <- lm(log_total_price ~ Carat + Depth, data = diamonds)
tidy(orig_model, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 6.940 0.194 35.767 0.000 6.558 7.322
Carat 2.151 0.057 37.412 0.000 2.037 2.264
Depth -0.007 0.003 -2.320 0.021 -0.014 -0.001
  1. Conduct a nested F test to determine if we should add Carat to the model shown above.

    • State the null and alternative hypotheses in words.

    \(H_0\): All of the coefficients for Color are 0

    \(H_a\): At least one coefficient for Color is not 0

    • Show the code and output for the nested F test.
proposed_model <- lm(log_total_price ~ Carat + Depth + Color, data = diamonds)
anova(orig_model, proposed_model) %>%
  tidy() %>%
  kable(digits = 3)
res.df rss df sumsq statistic p.value
268 16.050 NA NA NA NA
262 12.962 6 3.088 10.404 0
  • State your conclusion in the context of the data.

The p-value is very small, so we reject \(H_0\). There is at least one coefficient associated with Color that is not 0, so we should include Color in the model.

Model comparison

  1. Compare the original model and the model that includes Color using \(Adj. R^2\), \(AIC\) and \(BIC\). Which model would you select based on each criteria?
glance(orig_model) %>%
  select(adj.r.squared, AIC, BIC)
## # A tibble: 1 x 3
##   adj.r.squared   AIC   BIC
##           <dbl> <dbl> <dbl>
## 1         0.854  11.1  25.5
glance(proposed_model) %>%
  select(adj.r.squared, AIC, BIC)
## # A tibble: 1 x 3
##   adj.r.squared   AIC   BIC
##           <dbl> <dbl> <dbl>
## 1         0.879 -34.8  1.22

Based on \(Adj. R^2\), the proposed model that includes Color is a better fit since the \(Adj. R^2\) is higher.

Based on \(AIC\), the proposed model that includes Color is a better fit, since \(AIC\) is lower.

Based on \(BIC\), the proposed model that includes Color is a better fit, since \(BIC\) is lower.


  1. Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎