library(tidyverse)
library(broom)
library(patchwork)
library(knitr)
Today’s data set contains the price and characteristics for 271diamonds randomly selected from AwesomeGems.com in July 2005.1 The variables in the data set are
Carat
: Size of the diamond (in carats)Color
: Coded as D (most white/bright) through JClarity
: Coded as IF (internally flawless), VVS1, VVS2, VS1, VS2, SI1, SI2, or SI3 (slightly clouded)Depth
: Depth (as a percentage of diameter)PricePerCt
: Price per caratTotalPrice
: Price for the diamond (in dollars)We will use the characteristics to understand variability in the price of diamonds.
diamonds <- read_csv("data/diamonds.csv")
Let’s fit a model using Clarity
to predict the price.
model1 <- lm(TotalPrice ~ Clarity, data = diamonds)
tidy(model1) %>%
select(term, estimate) %>%
kable(digits = 3)
term | estimate |
---|---|
(Intercept) | 3707.731 |
ClaritySI1 | 578.386 |
ClaritySI2 | 1608.849 |
ClaritySI3 | -135.631 |
ClarityVS1 | 1147.464 |
ClarityVS2 | 851.568 |
ClarityVVS1 | -456.664 |
ClarityVVS2 | -464.124 |
What is the baseline level?
What is the interpretation of ClaritySI1
?
What is the expected price of a diamond with ClarityVVS2
?
What is the difference in the expected price between a diamond with ClaritySI3
and a diamond with ClarityVVS1
?
We can change the baseline category using the fct_relevel
function in the forcats R package. We will make SI3
the baseline category.
diamonds <- diamonds %>%
mutate(Clarity = fct_relevel(Clarity, c("SI3", "IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2"))
)
Let’s refit the model:
model1_relevel <- lm(TotalPrice ~ Clarity, data = diamonds)
tidy(model1_relevel) %>%
select(term, estimate) %>%
kable(digits = 3)
term | estimate |
---|---|
(Intercept) | 3572.100 |
ClarityIF | 135.631 |
ClarityVVS1 | -321.033 |
ClarityVVS2 | -328.493 |
ClarityVS1 | 1283.095 |
ClarityVS2 | 987.199 |
ClaritySI1 | 714.017 |
ClaritySI2 | 1744.480 |
Interpret the coefficient for ClarityVVS1
.
How does the coefficient for ClarityVVS1
compare to your response to Exercise 4 above? Is this what you expected?
Now let’s fit a model using Clarity
, Carat
, and the interaction between the two variables.
model2 <- lm(TotalPrice ~ Clarity + Carat + Clarity * Carat, data = diamonds)
tidy(model2) %>%
kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -2268.000 | 1428.898 | -1.587 | 0.114 |
ClarityIF | 1181.771 | 1517.647 | 0.779 | 0.437 |
ClarityVVS1 | 499.216 | 1488.711 | 0.335 | 0.738 |
ClarityVVS2 | -113.350 | 1500.621 | -0.076 | 0.940 |
ClarityVS1 | -904.187 | 1455.494 | -0.621 | 0.535 |
ClarityVS2 | -1426.577 | 1469.590 | -0.971 | 0.333 |
ClaritySI1 | -1392.974 | 1516.065 | -0.919 | 0.359 |
ClaritySI2 | -14.773 | 1695.575 | -0.009 | 0.993 |
Carat | 5562.000 | 1250.823 | 4.447 | 0.000 |
ClarityIF:Carat | 2208.757 | 1457.253 | 1.516 | 0.131 |
ClarityVVS1:Carat | 2293.206 | 1384.919 | 1.656 | 0.099 |
ClarityVVS2:Carat | 3226.995 | 1422.740 | 2.268 | 0.024 |
ClarityVS1:Carat | 4091.650 | 1290.517 | 3.171 | 0.002 |
ClarityVS2:Carat | 4153.341 | 1309.744 | 3.171 | 0.002 |
ClaritySI1:Carat | 3743.727 | 1375.395 | 2.722 | 0.007 |
ClaritySI2:Carat | 1198.990 | 1474.424 | 0.813 | 0.417 |
Write the model equation for a diamond with ClaritySI3
.
The coefficient of Carat
is the relationship between carat and price for diamonds in what category of Clarity
? (This is called a “main effect”.)
Interpret the coefficient of ClarityVVS1:Carat
.
Write the model equation for a diamond with ClarityVVS1
.
Describe the effect of carat on the price of a diamond with ClarityVVS1
.
Mean-center Carat
and refit the model from Part 2 using the mean-centered variable for carat.
#write code to mean-center
## code to refit model model
Describe the effect of carat on the total price for diamond with ClarityIF
.
Interpret the intercept in the context of the data.
Data set adapted from the Diamonds data set in the Stat2Data R Package.↩︎