Announcements

HW 03

Questions from video?

Price of Diamonds

library(tidyverse)
library(broom)
library(patchwork)
library(knitr)

Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.1 The variables in the data set are

We will use the characteristics to understand variability in the price of diamonds.

diamonds <- read_csv("data/diamonds.csv")

Part 1: Mean-centered variables + Interactions

Original model

orig_model <- lm(TotalPrice ~ Clarity + Carat + Clarity * Carat, data = diamonds)
tidy(orig_model, conf.int = TRUE) %>%
    kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -1086.229 511.376 -2.124 0.035 -2093.286 -79.171
ClaritySI1 -2574.745 719.867 -3.577 0.000 -3992.386 -1157.104
ClaritySI2 -1196.544 1046.294 -1.144 0.254 -3257.022 863.934
ClaritySI3 -1181.771 1517.647 -0.779 0.437 -4170.490 1806.948
ClarityVS1 -2085.958 581.567 -3.587 0.000 -3231.244 -940.672
ClarityVS2 -2608.348 615.997 -4.234 0.000 -3821.438 -1395.259
ClarityVVS1 -682.555 660.317 -1.034 0.302 -1982.924 617.814
ClarityVVS2 -1295.121 686.745 -1.886 0.060 -2647.536 57.294
Carat 7770.757 747.682 10.393 0.000 6298.339 9243.176
ClaritySI1:Carat 1534.970 941.373 1.631 0.104 -318.886 3388.826
ClaritySI2:Carat -1009.767 1080.924 -0.934 0.351 -3138.443 1118.908
ClaritySI3:Carat -2208.757 1457.253 -1.516 0.131 -5078.542 661.027
ClarityVS1:Carat 1882.893 812.346 2.318 0.021 283.132 3482.653
ClarityVS2:Carat 1944.583 842.556 2.308 0.022 285.328 3603.838
ClarityVVS1:Carat 84.449 955.234 0.088 0.930 -1796.703 1965.601
ClarityVVS2:Carat 1018.237 1009.286 1.009 0.314 -969.361 3005.835

Model using mean-centered carat

Mean-center Carat and refit the model from Part 2 using the mean-centered variable for carat.

diamonds <- diamonds %>%
  mutate(carat_cent = Carat - mean(Carat))
## code to refit model model
mean_cent_model <- lm(TotalPrice ~ Clarity + carat_cent + Clarity*carat_cent , data = diamonds)
tidy(mean_cent_model, conf.int = TRUE) %>%
    kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 5175.396 262.076 19.748 0.000 4659.287 5691.504
ClaritySI1 -1337.877 295.885 -4.522 0.000 -1920.567 -755.187
ClaritySI2 -2010.208 440.171 -4.567 0.000 -2877.041 -1143.375
ClaritySI3 -2961.573 691.962 -4.280 0.000 -4324.262 -1598.884
ClarityVS1 -568.736 275.161 -2.067 0.040 -1110.613 -26.858
ClarityVS2 -1041.416 279.329 -3.728 0.000 -1591.502 -491.330
ClarityVVS1 -614.507 329.673 -1.864 0.063 -1263.736 34.722
ClarityVVS2 -474.632 321.197 -1.478 0.141 -1107.169 157.905
carat_cent 7770.757 747.682 10.393 0.000 6298.339 9243.176
ClaritySI1:carat_cent 1534.970 941.373 1.631 0.104 -318.886 3388.826
ClaritySI2:carat_cent -1009.767 1080.924 -0.934 0.351 -3138.443 1118.908
ClaritySI3:carat_cent -2208.757 1457.253 -1.516 0.131 -5078.542 661.027
ClarityVS1:carat_cent 1882.893 812.346 2.318 0.021 283.132 3482.653
ClarityVS2:carat_cent 1944.583 842.556 2.308 0.022 285.328 3603.838
ClarityVVS1:carat_cent 84.449 955.234 0.088 0.930 -1796.703 1965.601
ClarityVVS2:carat_cent 1018.237 1009.286 1.009 0.314 -969.361 3005.835
  1. Interpret the intercept in the context of the data.

  2. Let’s compare the original and mean-centered model for ClarityIF.

  3. Let’s compare the original and mean-centered models for ClaritySI1.

Testing coefficients (Zoom poll)

Let’s use the mean-centered version of the model to answer the following questions:

  1. Suppose we wish to test if the mean price for diamonds with Clarity = "SI3" differs significantly from diamonds with Clarity = "IF". We should use the inferential statistics for which term to conduct this test?
  1. Intercept
  2. ClaritySI3
  3. carat_cent
  4. ClaritySI3:carat_cent
  1. Suppose we wish to test if the effect of carat for diamonds with Clarity = "SI3" differs significantly from diamonds with Clarity = "IF". We should use the inferential statistics for which term to conduct this test?
  1. Intercept
  2. ClaritySI3
  3. carat_cent
  4. ClaritySI3:carat_cent
  1. What are the degrees of freedom associated with the regression standard error, \(\hat{\sigma}\), and therefore the degrees of freedom associated with the two tests mentioned above? Note: There are 271 observations in the diamonds data set.
  1. 269
  2. 268
  3. 256
  4. 255
  1. What is the 95% confidence interval for the amount by which the slope of carat for diamonds with Clarity = VS1 differs from the slope of carat for diamonds with Clarity = IF?
  1. (4659.287, 5691.504)
  2. (-1110.613, -26.858)
  3. (6298.339, 9243.176)
  4. (283.132, 3482.653)

Transformations

Let’s take a look at the plot of residuals vs. fitted.

model_aug <- augment(mean_cent_model)
ggplot(data = model_aug, aes(x = .fitted, y  = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "Red") +
  labs(x = "Predicted", 
      y = "Residual", 
      title = "Residuals vs. Predcited")

  1. Based on what you learned about conditions for SLR, which condition(s) appear to be violated? Select all that apply. (Zoom poll)
  1. Linearity
  2. Constant variance
  3. Normality
  4. Independence
  1. Let’s use a log-transformation on the total price and refit the model.
diamonds <- diamonds %>%
  mutate(log_total_price = log(TotalPrice))
log_model <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent, data = diamonds)
tidy(log_model, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.377 0.066 126.707 0.000 8.247 8.507
ClaritySI1 -0.299 0.075 -4.007 0.000 -0.446 -0.152
ClaritySI2 -0.303 0.111 -2.730 0.007 -0.522 -0.084
ClaritySI3 -1.005 0.175 -5.759 0.000 -1.349 -0.662
ClarityVS1 -0.133 0.069 -1.918 0.056 -0.270 0.004
ClarityVS2 -0.201 0.070 -2.857 0.005 -0.340 -0.063
ClarityVVS1 -0.157 0.083 -1.884 0.061 -0.320 0.007
ClarityVVS2 -0.054 0.081 -0.669 0.504 -0.214 0.105
carat_cent 1.932 0.189 10.244 0.000 1.561 2.304
ClaritySI1:carat_cent 0.554 0.237 2.331 0.021 0.086 1.021
ClaritySI2:carat_cent -0.646 0.273 -2.367 0.019 -1.183 -0.109
ClaritySI3:carat_cent -0.002 0.368 -0.005 0.996 -0.726 0.722
ClarityVS1:carat_cent 0.424 0.205 2.068 0.040 0.020 0.827
ClarityVS2:carat_cent 0.237 0.213 1.116 0.266 -0.181 0.656
ClarityVVS1:carat_cent 0.462 0.241 1.918 0.056 -0.012 0.937
ClarityVVS2:carat_cent 0.473 0.255 1.856 0.065 -0.029 0.974
  1. Interpret the intercept in terms of log(TotalPrice).

  2. Interpret the intercept in terms of TotalPrice.

  3. Let’s focus on diamonds with Clarity = IF.

  1. Let’s check the plot of residuals vs. fitted for this model.
log_model_aug <- augment(log_model)
ggplot(data = log_model_aug, aes(x = .fitted, y  = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "Red") +
  labs(x = "Predicted", 
      y = "Residual", 
      title = "Residuals vs. Predcited")

It looks like we fixed the constant variance issue, but linearity is still violated!

Higher-order terms

Let’s take a look at a plot of log(TotalPrice) vs. carat and the residuals vs. carat.

ggplot(data = diamonds, aes(x = carat_cent, y =log_total_price )) +
  geom_point() +
  labs(x= "Mean-Centered Carat", 
       y  = "Log-transformed Total Price)", 
       title = "Log-transformed total price vs. Mean-Centered Carat")

ggplot(data = log_model_aug, aes(x = carat_cent, y = .resid)) +
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Mean-Centered Carat", 
       y = "Residual", 
       title = "Residuals vs. Mean-Centered Carat")

  1. Let’s add a quadratic term to the model.
log_model_v2 <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent + I(carat_cent^2), 
                   data = diamonds)
tidy(log_model_v2, conf.int = TRUE) %>%
  kable(digits = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.598 0.054 160.100 0.000 8.492 8.704
ClaritySI1 -0.428 0.058 -7.331 0.000 -0.544 -0.313
ClaritySI2 -0.583 0.088 -6.604 0.000 -0.757 -0.409
ClaritySI3 -0.961 0.135 -7.128 0.000 -1.226 -0.695
ClarityVS1 -0.227 0.054 -4.206 0.000 -0.334 -0.121
ClarityVS2 -0.324 0.055 -5.868 0.000 -0.432 -0.215
ClarityVVS1 -0.215 0.064 -3.347 0.001 -0.342 -0.089
ClarityVVS2 -0.171 0.063 -2.714 0.007 -0.296 -0.047
carat_cent 1.892 0.146 12.997 0.000 1.606 2.179
I(carat_cent^2) -1.860 0.141 -13.195 0.000 -2.138 -1.583
ClaritySI1:carat_cent 0.917 0.185 4.948 0.000 0.552 1.282
ClaritySI2:carat_cent 0.779 0.237 3.295 0.001 0.314 1.245
ClaritySI3:carat_cent 0.947 0.293 3.234 0.001 0.370 1.523
ClarityVS1:carat_cent 0.572 0.159 3.610 0.000 0.260 0.885
ClarityVS2:carat_cent 0.656 0.167 3.923 0.000 0.326 0.985
ClarityVVS1:carat_cent 0.213 0.187 1.141 0.255 -0.155 0.581
ClarityVVS2:carat_cent 0.297 0.197 1.508 0.133 -0.091 0.685
  1. Write the model equation for Clarity = IF.

  2. Let’s interpret the effect of carat on the total price for diamonds with Clarity = IF that have a carat size within 0.2 carats of the mean (~Q1 and Q3).


  1. Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎