Announcements

Tea with a TA!

Hang out with the TAs from STA 210! This is a casual conversation and a fun opportunity to meet the members of the STA 210 teaching team. The only rule is these can’t turn into office hours!

Tea with a TA counts as a statistics experience.

  • Betsy Bersson, Wed, Oct 7, 9:30 - 10:30 am
    • Click here to sign up. Zoom details will be emailed before the event.

Quiz 02

  • Covers material weeks 04 - 06
  • Active Oct 1 at 12a - Oct 2 at 11:59p; 2 hours to complete it
  • Format similar to Quiz 01
  • Questions about quiz material?

Checking model conditions

Notes

Questions about model diagnostics?

Model diagnostics for Tips data

What affects the amount customers tip at a restaurant? To answer this question, we will use data collected in 2011 by a student at St. Olaf who worked at a local restaurant.1

The variables in the data set are

tips <- read_csv("data/tip-data.csv") %>%
  filter(!is.na(Party)) 

We fit a model using the Party, Meal and Age to predict the Tips at this restaurant.

tips_model <- lm(Tip ~ Party + Meal + Age, data = tips)
tidy(tips_model) %>%
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 1.254 0.394 3.182 0.002
Party 1.808 0.121 14.909 0.000
MealLate Night -1.632 0.407 -4.013 0.000
MealLunch -0.612 0.402 -1.523 0.130
AgeSenCit 0.390 0.394 0.990 0.324
AgeYadult -0.505 0.412 -1.227 0.222

We can use the augment function to get the model diagnostics:

tips_aug <- augment(tips_model) %>%
   mutate(obs_num = row_number()) #add row number to help with graphing

Leverage

  1. What is the threshold used to identify observations with high leverage? Calculate the threshold and save as leverage_threshold.
#calculate threshold
  1. Remove eval = F from the code chunk header to create a plot of leverage vs. the observation number.

Look at the values of the predictors for the observations that have high leverage. Why do you think these observations have high leverage?

ggplot(data = tips_aug, aes(x = obs_num, y = .hat)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = leverage_threshold, color = "red") +
  labs(x = "Observation Number", y = "Leverage") +
  geom_text(aes(label=ifelse(.hat > leverage_threshold,
                             as.character(obs_num), "")), nudge_x = 4)

Identifying outliers

  1. Create a plot of the standardized residuals (.std.resid) vs. the predicted values. Include markers on the plot to identify points with standardized residuals beyond \(\pm 2\) and points with standardized residuals beyond \(\pm 3\).
# scatterplot of std. residuals vs. predicted
  1. Using \(\pm 3\) as the threshold, what observations are considered outliers (if any)? If there are outliers, why do you think these observations are outliers, i.e. why do they have standardized residuals with large magnitude? What information would you like to know to better understand why these observations are outliers?

Cook’s distance

  1. Create a scatterplot of Cook’s D .cooksd. Include markers on the plot to identify points with Cook’s D > 0.5 and Cook’s D > 1.
## scatterplot of Cook's D vs. observation number
  1. Based on Cook’s D, are there any observations that have strong influence on the model coefficients? If so, would you keep these observations in the data used to build the model? Briefly explain why or why not.

Multicollinearity

  1. Use the vif function from the rms package to calculate the variance inflation factors for the variables in the model. Use the code
install.packages("rms")

if you need to install the rms package.

# load rms package and calculate VIF
  1. Is there multicollinearity in this model? Briefly explain why or why not.


  1. Dahlquist, Samantha, and Jin Dong. 2011. “The Effects of Credit Cards on Tipping.” Project for Statistics 212-Statistics for the Sciences, St. Olaf College.↩︎