AE 14: Model diagnostics

Announcements

Tea with a TA!

Hang out with the TAs from STA 210! This is a casual conversation and a fun opportunity to meet the members of the STA 210 teaching team. The only rule is these can’t turn into office hours!

Tea with a TA counts as a statistics experience.

Betsy Bersson, Wed, Oct 7, 9:30 - 10:30 am
- Click here to sign up. Zoom details will be emailed before the event.

Quiz 02

Covers material weeks 04 - 06
Active Oct 1 at 12a - Oct 2 at 11:59p; 2 hours to complete it
Format similar to Quiz 01
Questions about quiz material?

Checking model conditions

Notes

Questions about model diagnostics?

Model diagnostics for Tips data

What affects the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.¹

The variables in the data set are

Tip: amount of the tip
Party: number of people in the party
Meal: time of day (Lunch, Dinner, Late Night)
Age: age category of person paying the bill (Yadult, Middle, SenCit)

tips <- read_csv("data/tip-data.csv") %>%
  filter(!is.na(Party))

We fit a model using the Party, Meal and Age to predict the Tips at this restaurant.

tips_model <- lm(Tip ~ Party + Meal + Age, data = tips)
tidy(tips_model) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	1.254	0.394	3.182	0.002
Party	1.808	0.121	14.909	0.000
MealLate Night	-1.632	0.407	-4.013	0.000
MealLunch	-0.612	0.402	-1.523	0.130
AgeSenCit	0.390	0.394	0.990	0.324
AgeYadult	-0.505	0.412	-1.227	0.222

We can use the augment function to get the model diagnostics:

tips_aug <- augment(tips_model) %>%
   mutate(obs_num = row_number()) #add row number to help with graphing

Leverage

What is the threshold used to identify observations with high leverage? Calculate the threshold and save as leverage_threshold.

#calculate threshold

Remove eval = F from the code chunk header to create a plot of leverage vs. the observation number.

Look at the values of the predictors for the observations that have high leverage. Why do you think these observations have high leverage?

ggplot(data = tips_aug, aes(x = obs_num, y = .hat)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = leverage_threshold, color = "red") +
  labs(x = "Observation Number", y = "Leverage") +
  geom_text(aes(label=ifelse(.hat > leverage_threshold,
                             as.character(obs_num), "")), nudge_x = 4)

Identifying outliers

Create a plot of the standardized residuals (.std.resid) vs. the predicted values. Include markers on the plot to identify points with standardized residuals beyond and points with standardized residuals beyond .

# scatterplot of std. residuals vs. predicted

Using as the threshold, what observations are considered outliers (if any)? If there are outliers, why do you think these observations are outliers, i.e. why do they have standardized residuals with large magnitude? What information would you like to know to better understand why these observations are outliers?

Cook’s distance

Create a scatterplot of Cook’s D .cooksd. Include markers on the plot to identify points with Cook’s D > 0.5 and Cook’s D > 1.

## scatterplot of Cook's D vs. observation number

Based on Cook’s D, are there any observations that have strong influence on the model coefficients? If so, would you keep these observations in the data used to build the model? Briefly explain why or why not.

Multicollinearity

Use the vif function from the rms package to calculate the variance inflation factors for the variables in the model. Use the code

install.packages("rms")

if you need to install the rms package.

# load rms package and calculate VIF

Is there multicollinearity in this model? Briefly explain why or why not.

Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎