AE 11: Price of Diamonds, Part 2

Announcements

HW 03 due Wednesday at 11:59p (available after class)
Introducing statistics experiences
Upcoming events from the StatSci Majors Union

HW 03

Questions from video?

Price of Diamonds

library(tidyverse)
library(broom)
library(patchwork)
library(knitr)

Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.¹ The variables in the data set are

Carat: Size of the diamond (in carats)
Color: Coded as D (most white/bright) through J
Clarity: Coded as IF (internally flawless), VVS1, VVS2, VS1, VS2, SI1, SI2, or SI3 (slightly clouded)
Depth: Depth (as a percentage of diameter)
PricePerCt: Price per carat
TotalPrice: Price for the diamond (in dollars)

We will use the characteristics to understand variability in the price of diamonds.

diamonds <- read_csv("data/diamonds.csv")

Part 1: Mean-centered variables + Interactions

Original model

orig_model <- lm(TotalPrice ~ Clarity + Carat + Clarity * Carat, data = diamonds)
tidy(orig_model, conf.int = TRUE) %>%
    kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1086.229	511.376	-2.124	0.035	-2093.286	-79.171
ClaritySI1	-2574.745	719.867	-3.577	0.000	-3992.386	-1157.104
ClaritySI2	-1196.544	1046.294	-1.144	0.254	-3257.022	863.934
ClaritySI3	-1181.771	1517.647	-0.779	0.437	-4170.490	1806.948
ClarityVS1	-2085.958	581.567	-3.587	0.000	-3231.244	-940.672
ClarityVS2	-2608.348	615.997	-4.234	0.000	-3821.438	-1395.259
ClarityVVS1	-682.555	660.317	-1.034	0.302	-1982.924	617.814
ClarityVVS2	-1295.121	686.745	-1.886	0.060	-2647.536	57.294
Carat	7770.757	747.682	10.393	0.000	6298.339	9243.176
ClaritySI1:Carat	1534.970	941.373	1.631	0.104	-318.886	3388.826
ClaritySI2:Carat	-1009.767	1080.924	-0.934	0.351	-3138.443	1118.908
ClaritySI3:Carat	-2208.757	1457.253	-1.516	0.131	-5078.542	661.027
ClarityVS1:Carat	1882.893	812.346	2.318	0.021	283.132	3482.653
ClarityVS2:Carat	1944.583	842.556	2.308	0.022	285.328	3603.838
ClarityVVS1:Carat	84.449	955.234	0.088	0.930	-1796.703	1965.601
ClarityVVS2:Carat	1018.237	1009.286	1.009	0.314	-969.361	3005.835

Model using mean-centered `carat`

Mean-center Carat and refit the model from Part 2 using the mean-centered variable for carat.

diamonds <- diamonds %>%
  mutate(carat_cent = Carat - mean(Carat))

## code to refit model model
mean_cent_model <- lm(TotalPrice ~ Clarity + carat_cent + Clarity*carat_cent , data = diamonds)
tidy(mean_cent_model, conf.int = TRUE) %>%
    kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	5175.396	262.076	19.748	0.000	4659.287	5691.504
ClaritySI1	-1337.877	295.885	-4.522	0.000	-1920.567	-755.187
ClaritySI2	-2010.208	440.171	-4.567	0.000	-2877.041	-1143.375
ClaritySI3	-2961.573	691.962	-4.280	0.000	-4324.262	-1598.884
ClarityVS1	-568.736	275.161	-2.067	0.040	-1110.613	-26.858
ClarityVS2	-1041.416	279.329	-3.728	0.000	-1591.502	-491.330
ClarityVVS1	-614.507	329.673	-1.864	0.063	-1263.736	34.722
ClarityVVS2	-474.632	321.197	-1.478	0.141	-1107.169	157.905
carat_cent	7770.757	747.682	10.393	0.000	6298.339	9243.176
ClaritySI1:carat_cent	1534.970	941.373	1.631	0.104	-318.886	3388.826
ClaritySI2:carat_cent	-1009.767	1080.924	-0.934	0.351	-3138.443	1118.908
ClaritySI3:carat_cent	-2208.757	1457.253	-1.516	0.131	-5078.542	661.027
ClarityVS1:carat_cent	1882.893	812.346	2.318	0.021	283.132	3482.653
ClarityVS2:carat_cent	1944.583	842.556	2.308	0.022	285.328	3603.838
ClarityVVS1:carat_cent	84.449	955.234	0.088	0.930	-1796.703	1965.601
ClarityVVS2:carat_cent	1018.237	1009.286	1.009	0.314	-969.361	3005.835

Interpret the intercept in the context of the data.
Let’s compare the original and mean-centered model for ClarityIF.
Let’s compare the original and mean-centered models for ClaritySI1.

Testing coefficients (Zoom poll)

Let’s use the mean-centered version of the model to answer the following questions:

Suppose we wish to test if the mean price for diamonds with Clarity = "SI3" differs significantly from diamonds with Clarity = "IF". We should use the inferential statistics for which term to conduct this test?

Intercept
ClaritySI3
carat_cent
ClaritySI3:carat_cent

Suppose we wish to test if the effect of carat for diamonds with Clarity = "SI3" differs significantly from diamonds with Clarity = "IF". We should use the inferential statistics for which term to conduct this test?

Intercept
ClaritySI3
carat_cent
ClaritySI3:carat_cent

What are the degrees of freedom associated with the regression standard error, \(\hat{\sigma}\), and therefore the degrees of freedom associated with the two tests mentioned above? Note: There are 271 observations in the diamonds data set.

What is the 95% confidence interval for the amount by which the slope of carat for diamonds with Clarity = VS1 differs from the slope of carat for diamonds with Clarity = IF?

(4659.287, 5691.504)
(-1110.613, -26.858)
(6298.339, 9243.176)
(283.132, 3482.653)

Transformations

Let’s take a look at the plot of residuals vs. fitted.

model_aug <- augment(mean_cent_model)

ggplot(data = model_aug, aes(x = .fitted, y  = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "Red") +
  labs(x = "Predicted", 
      y = "Residual", 
      title = "Residuals vs. Predcited")

Based on what you learned about conditions for SLR, which condition(s) appear to be violated? Select all that apply. (Zoom poll)

Linearity
Constant variance
Normality
Independence

Let’s use a log-transformation on the total price and refit the model.

diamonds <- diamonds %>%
  mutate(log_total_price = log(TotalPrice))

log_model <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent, data = diamonds)
tidy(log_model, conf.int = TRUE) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	8.377	0.066	126.707	0.000	8.247	8.507
ClaritySI1	-0.299	0.075	-4.007	0.000	-0.446	-0.152
ClaritySI2	-0.303	0.111	-2.730	0.007	-0.522	-0.084
ClaritySI3	-1.005	0.175	-5.759	0.000	-1.349	-0.662
ClarityVS1	-0.133	0.069	-1.918	0.056	-0.270	0.004
ClarityVS2	-0.201	0.070	-2.857	0.005	-0.340	-0.063
ClarityVVS1	-0.157	0.083	-1.884	0.061	-0.320	0.007
ClarityVVS2	-0.054	0.081	-0.669	0.504	-0.214	0.105
carat_cent	1.932	0.189	10.244	0.000	1.561	2.304
ClaritySI1:carat_cent	0.554	0.237	2.331	0.021	0.086	1.021
ClaritySI2:carat_cent	-0.646	0.273	-2.367	0.019	-1.183	-0.109
ClaritySI3:carat_cent	-0.002	0.368	-0.005	0.996	-0.726	0.722
ClarityVS1:carat_cent	0.424	0.205	2.068	0.040	0.020	0.827
ClarityVS2:carat_cent	0.237	0.213	1.116	0.266	-0.181	0.656
ClarityVVS1:carat_cent	0.462	0.241	1.918	0.056	-0.012	0.937
ClarityVVS2:carat_cent	0.473	0.255	1.856	0.065	-0.029	0.974

Interpret the intercept in terms of log(TotalPrice).
Interpret the intercept in terms of TotalPrice.
Let’s focus on diamonds with Clarity = IF.

Write the model in terms of the log(TotalPrice).
Interpret the coefficient of carat_cent in terms of log(TotalPrice)
Interpret the coefficient of carat_cent in terms of the TotalPrice.

Let’s check the plot of residuals vs. fitted for this model.

log_model_aug <- augment(log_model)

ggplot(data = log_model_aug, aes(x = .fitted, y  = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "Red") +
  labs(x = "Predicted", 
      y = "Residual", 
      title = "Residuals vs. Predcited")

It looks like we fixed the constant variance issue, but linearity is still violated!

Higher-order terms

Let’s take a look at a plot of log(TotalPrice) vs. carat and the residuals vs. carat.

ggplot(data = diamonds, aes(x = carat_cent, y =log_total_price )) +
  geom_point() +
  labs(x= "Mean-Centered Carat", 
       y  = "Log-transformed Total Price)", 
       title = "Log-transformed total price vs. Mean-Centered Carat")

ggplot(data = log_model_aug, aes(x = carat_cent, y = .resid)) +
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Mean-Centered Carat", 
       y = "Residual", 
       title = "Residuals vs. Mean-Centered Carat")

Let’s add a quadratic term to the model.

log_model_v2 <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent + I(carat_cent^2), 
                   data = diamonds)
tidy(log_model_v2, conf.int = TRUE) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	8.598	0.054	160.100	0.000	8.492	8.704
ClaritySI1	-0.428	0.058	-7.331	0.000	-0.544	-0.313
ClaritySI2	-0.583	0.088	-6.604	0.000	-0.757	-0.409
ClaritySI3	-0.961	0.135	-7.128	0.000	-1.226	-0.695
ClarityVS1	-0.227	0.054	-4.206	0.000	-0.334	-0.121
ClarityVS2	-0.324	0.055	-5.868	0.000	-0.432	-0.215
ClarityVVS1	-0.215	0.064	-3.347	0.001	-0.342	-0.089
ClarityVVS2	-0.171	0.063	-2.714	0.007	-0.296	-0.047
carat_cent	1.892	0.146	12.997	0.000	1.606	2.179
I(carat_cent^2)	-1.860	0.141	-13.195	0.000	-2.138	-1.583
ClaritySI1:carat_cent	0.917	0.185	4.948	0.000	0.552	1.282
ClaritySI2:carat_cent	0.779	0.237	3.295	0.001	0.314	1.245
ClaritySI3:carat_cent	0.947	0.293	3.234	0.001	0.370	1.523
ClarityVS1:carat_cent	0.572	0.159	3.610	0.000	0.260	0.885
ClarityVS2:carat_cent	0.656	0.167	3.923	0.000	0.326	0.985
ClarityVVS1:carat_cent	0.213	0.187	1.141	0.255	-0.155	0.581
ClarityVVS2:carat_cent	0.297	0.197	1.508	0.133	-0.091	0.685

Write the model equation for Clarity = IF.
Let’s interpret the effect of carat on the total price for diamonds with Clarity = IF that have a carat size within 0.2 carats of the mean (~Q1 and Q3).

Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎