AE 12: Model comparison

Announcements

HW 03 due today at 11:59p
HW 04 available after class due Sep 30 at 11:59p

Questions from video?

Price of Diamonds

library(tidyverse)
library(broom)
library(patchwork)
library(knitr)

Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.¹ The variables in the data set are

Carat: Size of the diamond (in carats)
Color: Coded as D (most white/bright) through J
Clarity: Coded as IF (internally flawless), VVS1, VVS2, VS1, VS2, SI1, SI2, or SI3 (slightly clouded)
Depth: Depth (as a percentage of diameter)
PricePerCt: Price per carat
TotalPrice: Price for the diamond (in dollars)

We will use the characteristics to understand variability in the price of diamonds.

diamonds <- read_csv("data/diamonds.csv")

diamonds <- diamonds %>%
  mutate(carat_cent = Carat - mean(Carat), 
         log_total_price = log(TotalPrice))

log_model <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent, data = diamonds)
tidy(log_model, conf.int = TRUE) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	8.377	0.066	126.707	0.000	8.247	8.507
ClaritySI1	-0.299	0.075	-4.007	0.000	-0.446	-0.152
ClaritySI2	-0.303	0.111	-2.730	0.007	-0.522	-0.084
ClaritySI3	-1.005	0.175	-5.759	0.000	-1.349	-0.662
ClarityVS1	-0.133	0.069	-1.918	0.056	-0.270	0.004
ClarityVS2	-0.201	0.070	-2.857	0.005	-0.340	-0.063
ClarityVVS1	-0.157	0.083	-1.884	0.061	-0.320	0.007
ClarityVVS2	-0.054	0.081	-0.669	0.504	-0.214	0.105
carat_cent	1.932	0.189	10.244	0.000	1.561	2.304
ClaritySI1:carat_cent	0.554	0.237	2.331	0.021	0.086	1.021
ClaritySI2:carat_cent	-0.646	0.273	-2.367	0.019	-1.183	-0.109
ClaritySI3:carat_cent	-0.002	0.368	-0.005	0.996	-0.726	0.722
ClarityVS1:carat_cent	0.424	0.205	2.068	0.040	0.020	0.827
ClarityVS2:carat_cent	0.237	0.213	1.116	0.266	-0.181	0.656
ClarityVVS1:carat_cent	0.462	0.241	1.918	0.056	-0.012	0.937
ClarityVVS2:carat_cent	0.473	0.255	1.856	0.065	-0.029	0.974

log_model_aug <- augment(log_model)

ggplot(data = log_model_aug, aes(x = .fitted, y  = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "Red") +
  labs(x = "Predicted", 
      y = "Residual", 
      title = "Residuals vs. Predcited")

It looks like we fixed the constant variance issue, but linearity is still violated!

Higher-order terms

Let’s take a look at a plot of log(TotalPrice) vs. carat and the residuals vs. carat.

ggplot(data = diamonds, aes(x = carat_cent, y =log_total_price )) +
  geom_point() +
  labs(x= "Mean-Centered Carat", 
       y  = "Log-transformed Total Price)", 
       title = "Log-transformed total price vs. Mean-Centered Carat")

ggplot(data = log_model_aug, aes(x = carat_cent, y = .resid)) +
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Mean-Centered Carat", 
       y = "Residual", 
       title = "Residuals vs. Mean-Centered Carat")

Let’s add a quadratic term to the model.

log_model_v2 <- lm(log_total_price ~ Clarity + carat_cent + Clarity*carat_cent + I(carat_cent^2), 
                   data = diamonds)
tidy(log_model_v2, conf.int = TRUE) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	8.598	0.054	160.100	0.000	8.492	8.704
ClaritySI1	-0.428	0.058	-7.331	0.000	-0.544	-0.313
ClaritySI2	-0.583	0.088	-6.604	0.000	-0.757	-0.409
ClaritySI3	-0.961	0.135	-7.128	0.000	-1.226	-0.695
ClarityVS1	-0.227	0.054	-4.206	0.000	-0.334	-0.121
ClarityVS2	-0.324	0.055	-5.868	0.000	-0.432	-0.215
ClarityVVS1	-0.215	0.064	-3.347	0.001	-0.342	-0.089
ClarityVVS2	-0.171	0.063	-2.714	0.007	-0.296	-0.047
carat_cent	1.892	0.146	12.997	0.000	1.606	2.179
I(carat_cent^2)	-1.860	0.141	-13.195	0.000	-2.138	-1.583
ClaritySI1:carat_cent	0.917	0.185	4.948	0.000	0.552	1.282
ClaritySI2:carat_cent	0.779	0.237	3.295	0.001	0.314	1.245
ClaritySI3:carat_cent	0.947	0.293	3.234	0.001	0.370	1.523
ClarityVS1:carat_cent	0.572	0.159	3.610	0.000	0.260	0.885
ClarityVS2:carat_cent	0.656	0.167	3.923	0.000	0.326	0.985
ClarityVVS1:carat_cent	0.213	0.187	1.141	0.255	-0.155	0.581
ClarityVVS2:carat_cent	0.297	0.197	1.508	0.133	-0.091	0.685

Write the model equation for Clarity = IF.

\[\hat{\log(TotalPrice)} = 8.598 + 1.892\times \text{carat_cent} - 1.860\times \text{carat_cent}^2\]

Let’s interpret the effect of carat on the total price for diamonds with Clarity = IF that have a carat size within 0.2 carats of the mean (approx. Q1 and Q3).

Based on the model, the effect on \(\log(TotalPrice)\) is \[1.892*(0.2 - (-0.2)) - 1.860*(0.2^2 - (-0.2)^2) = 0.7568\]

Interpretation in terms of \(\log(TotalPrice)\): For diamonds with clarity IF, if we go from a diamond that is 0.2 carats below the mean carat weight to one that is 0.2 carats above the mean carat weight, we expect the log price to increase by 0.7568 on average.

Interpretation in terms of price: For diamonds with clarity IF, diamonds that weigh 0.2 carats above the mean carat weight are expected to have a median price that is 2.131, exp(0.7568) times the median price of diamonds that weigh 0.2 carats below the mean carat weight.

Alternate interpretation: For diamonds with clarity IF, if we go from a diamond that weighs 0.2 carats below the mean carat weight to one that weighs 0.2 carats more the mean carat weight, we expect the price to multiply by a factor of 2.131, exp(0.7568).

Nested F test

Suppose we have the following model:

orig_model <- lm(log_total_price ~ Carat + Depth, data = diamonds)
tidy(orig_model, conf.int = TRUE) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	6.940	0.194	35.767	0.000	6.558	7.322
Carat	2.151	0.057	37.412	0.000	2.037	2.264
Depth	-0.007	0.003	-2.320	0.021	-0.014	-0.001

Conduct a nested F test to determine if we should add Carat to the model shown above.
- State the null and alternative hypotheses in words.
\(H_0\): All of the coefficients for Color are 0

\(H_a\): At least one coefficient for Color is not 0
- Show the code and output for the nested F test.

proposed_model <- lm(log_total_price ~ Carat + Depth + Color, data = diamonds)

anova(orig_model, proposed_model) %>%
  tidy() %>%
  kable(digits = 3)

res.df	rss	df	sumsq	statistic	p.value
268	16.050	NA	NA	NA	NA
262	12.962	6	3.088	10.404	0

State your conclusion in the context of the data.

The p-value is very small, so we reject \(H_0\). There is at least one coefficient associated with Color that is not 0, so we should include Color in the model.

Model comparison

Compare the original model and the model that includes Color using \(Adj. R^2\), \(AIC\) and \(BIC\). Which model would you select based on each criteria?

glance(orig_model) %>%
  select(adj.r.squared, AIC, BIC)

## # A tibble: 1 x 3
##   adj.r.squared   AIC   BIC
##           <dbl> <dbl> <dbl>
## 1         0.854  11.1  25.5

glance(proposed_model) %>%
  select(adj.r.squared, AIC, BIC)

## # A tibble: 1 x 3
##   adj.r.squared   AIC   BIC
##           <dbl> <dbl> <dbl>
## 1         0.879 -34.8  1.22

Based on \(Adj. R^2\), the proposed model that includes Color is a better fit since the \(Adj. R^2\) is higher.

Based on \(AIC\), the proposed model that includes Color is a better fit, since \(AIC\) is lower.

Based on \(BIC\), the proposed model that includes Color is a better fit, since \(BIC\) is lower.

Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎