library(tidyverse)
library(broom)
library(patchwork)
library(knitr)
Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.1 The variables in the data set are
Carat: Size of the diamond (in carats)Color: Coded as D (most white/bright) through JClarity: Coded as IF (internally flawless), VVS1, VVS2, VS1, VS2, SI1, SI2, or SI3 (slightly clouded)Depth: Depth (as a percentage of diameter)PricePerCt: Price per caratTotalPrice: Price for the diamond (in dollars)We will use the characteristics to understand variability in the price of diamonds.
diamonds <- read_csv("data/diamonds.csv")
Let’s fit a model using Clarity to predict the price.
model1 <- lm(TotalPrice ~ Clarity, data = diamonds)
tidy(model1) %>%
select(term, estimate) %>%
kable(digits = 3)
| term | estimate |
|---|---|
| (Intercept) | 3707.731 |
| ClaritySI1 | 578.386 |
| ClaritySI2 | 1608.849 |
| ClaritySI3 | -135.631 |
| ClarityVS1 | 1147.464 |
| ClarityVS2 | 851.568 |
| ClarityVVS1 | -456.664 |
| ClarityVVS2 | -464.124 |
What is the baseline level?
What is the interpretation of ClaritySI1?
What is the expected price of a diamond with ClarityVVS2?
What is the difference in the expected price between a diamond with ClaritySI3 and a diamond with ClarityVVS1?
We can change the baseline category using the fct_relevel function in the forcats R package. We will make SI3 the baseline category.
diamonds <- diamonds %>%
mutate(Clarity = fct_relevel(Clarity, c("SI3", "IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2"))
)
Let’s refit the model:
model1_relevel <- lm(TotalPrice ~ Clarity, data = diamonds)
tidy(model1_relevel) %>%
select(term, estimate) %>%
kable(digits = 3)
| term | estimate |
|---|---|
| (Intercept) | 3572.100 |
| ClarityIF | 135.631 |
| ClarityVVS1 | -321.033 |
| ClarityVVS2 | -328.493 |
| ClarityVS1 | 1283.095 |
| ClarityVS2 | 987.199 |
| ClaritySI1 | 714.017 |
| ClaritySI2 | 1744.480 |
Interpret the coefficient for ClarityVVS1.
How does the coefficient for ClarityVVS1 compare to your response to Exercise 4 above? Is this what you expected?
Now let’s fit a model using Clarity, Carat, and the interaction between the two variables.
model2 <- lm(TotalPrice ~ Clarity + Carat + Clarity * Carat, data = diamonds)
tidy(model2) %>%
kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -2268.000 | 1428.898 | -1.587 | 0.114 |
| ClarityIF | 1181.771 | 1517.647 | 0.779 | 0.437 |
| ClarityVVS1 | 499.216 | 1488.711 | 0.335 | 0.738 |
| ClarityVVS2 | -113.350 | 1500.621 | -0.076 | 0.940 |
| ClarityVS1 | -904.187 | 1455.494 | -0.621 | 0.535 |
| ClarityVS2 | -1426.577 | 1469.590 | -0.971 | 0.333 |
| ClaritySI1 | -1392.974 | 1516.065 | -0.919 | 0.359 |
| ClaritySI2 | -14.773 | 1695.575 | -0.009 | 0.993 |
| Carat | 5562.000 | 1250.823 | 4.447 | 0.000 |
| ClarityIF:Carat | 2208.757 | 1457.253 | 1.516 | 0.131 |
| ClarityVVS1:Carat | 2293.206 | 1384.919 | 1.656 | 0.099 |
| ClarityVVS2:Carat | 3226.995 | 1422.740 | 2.268 | 0.024 |
| ClarityVS1:Carat | 4091.650 | 1290.517 | 3.171 | 0.002 |
| ClarityVS2:Carat | 4153.341 | 1309.744 | 3.171 | 0.002 |
| ClaritySI1:Carat | 3743.727 | 1375.395 | 2.722 | 0.007 |
| ClaritySI2:Carat | 1198.990 | 1474.424 | 0.813 | 0.417 |
Write the model equation for a diamond with ClaritySI3.
The coefficient of Carat is the relationship between carat and price for diamonds in what category of Clarity? (This is called a “main effect”.)
Interpret the coefficient of ClarityVVS1:Carat.
Write the model equation for a diamond with ClarityVVS1.
Describe the effect of carat on the price of a diamond with ClarityVVS1.
Mean-center Carat and refit the model from Part 2 using the mean-centered variable for carat.
#write code to mean-center
## code to refit model model
Describe the effect of carat on the total price for diamond with ClarityIF.
Interpret the intercept in the context of the data.
Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎