library(tidyverse)
library(knitr)
library(broom)
library(leaps)

Announcements

Tea with a TA!

Hang out with the TAs from STA 210! This is a casual conversation and a fun opportunity to meet the members of the STA 210 teaching team. The only rule is these can’t turn into office hours!

Tea with a TA counts as a statistics experience.

  • Betsy Bersson, Wed, Oct 7, 9:30 - 10:30 am
    • Click here to sign up. Zoom details will be emailed before the event.

Project

Link: https://sta210-fa20.netlify.app/project

Proposal due Oct 09

Quiz 02

  • Covers material weeks 04 - 06
  • Active Oct 1 at 12a - Oct 2 at 11:59p; 2 hours to complete it
  • Format similar to Quiz 01

Questions from video?

Understanding state-wide performance on the SAT

This data set contains the average SAT score (out of 1600) and other variables that may be associated with SAT performance for each of the 50 U.S. states. The data is based on test takers for the 1982 exam.

sat_scores <- read_csv("data/sat-scores.csv") %>%
  select(-State)
lm(SAT ~ 1, data = sat_scores)
## 
## Call:
## lm(formula = SAT ~ 1, data = sat_scores)
## 
## Coefficients:
## (Intercept)  
##       947.9

Part 1: Manual backward selection

Use the drop1 function to manually perform backward selection using AIC as the selection criterion. To help you get started, the code for the full model and the first are below.

full_model <- lm(SAT ~ ., data = sat_scores)
int_only_model <- lm(SAT ~ 1, data = sat_scores)
drop1(full_model)
## Single term deletions
## 
## Model:
## SAT ~ Takers + Income + Years + Public + Expend + Rank
##        Df Sum of Sq   RSS    AIC
## <none>              29842 333.58
## Takers  1     332.4 30175 332.14
## Income  1       2.0 29844 331.59
## Years   1    8897.8 38740 344.63
## Public  1     445.8 30288 332.32
## Expend  1    4744.9 34587 338.96
## Rank    1   11223.0 41065 347.54
#remove income
current_model <- lm(SAT ~ Takers + Years + Public + Expend + Rank, data = sat_scores)
drop1(current_model)
## Single term deletions
## 
## Model:
## SAT ~ Takers + Years + Public + Expend + Rank
##        Df Sum of Sq   RSS    AIC
## <none>              29844 331.59
## Takers  1     401.3 30246 330.25
## Years   1    9219.7 39064 343.05
## Public  1     495.5 30340 330.41
## Expend  1    6904.4 36749 339.99
## Rank    1   11645.9 41490 346.06
#remove takers
current_model <- lm(SAT ~ Years + Public + Expend + Rank, data = sat_scores)
drop1(current_model)
## Single term deletions
## 
## Model:
## SAT ~ Years + Public + Expend + Rank
##        Df Sum of Sq    RSS    AIC
## <none>               30246 330.25
## Years   1      8837  39083 341.07
## Public  1      1462  31708 330.62
## Expend  1      7343  37589 339.12
## Rank    1    184786 215032 426.33

Final Model

tidy(current_model)
## # A tibble: 5 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -205.      118.        -1.74 8.90e- 2
## 2 Years         21.9       6.04       3.63 7.31e- 4
## 3 Public        -0.664     0.450     -1.48 1.47e- 1
## 4 Expend         2.24      0.678      3.31 1.87e- 3
## 5 Rank          10.0       0.603     16.6  8.67e-21

Part 2: Model selection functions in R

step function

If AIC is the selection criterion, we can use the step function in R.

Let’s perform forward selection using the step function.

int_only_model <- lm(SAT ~ 1, data = sat_scores)
selected_model <- step(int_only_model, scope = formula(full_model), direction = "forward")
## Start:  AIC=427.06
## SAT ~ 1
## 
##          Df Sum of Sq    RSS    AIC
## + Rank    1    190471  55539 354.64
## + Takers  1    181024  64987 362.50
## + Income  1     84038 161973 408.16
## + Years   1     26948 219063 423.25
## <none>                246011 427.06
## + Public  1      1589 244422 428.73
## + Expend  1       973 245038 428.86
## 
## Step:  AIC=354.64
## SAT ~ Rank
## 
##          Df Sum of Sq   RSS    AIC
## + Years   1   17913.6 37626 337.17
## + Expend  1    7671.0 47868 349.21
## + Income  1    4601.1 50938 352.32
## + Public  1    3847.7 51692 353.05
## <none>                55539 354.64
## + Takers  1    1761.8 53778 355.03
## 
## Step:  AIC=337.17
## SAT ~ Rank + Years
## 
##          Df Sum of Sq   RSS    AIC
## + Expend  1    5917.6 31708 330.62
## + Income  1    2782.4 34843 335.33
## <none>                37626 337.17
## + Takers  1     778.7 36847 338.13
## + Public  1      37.0 37589 339.12
## 
## Step:  AIC=330.62
## SAT ~ Rank + Years + Expend
## 
##          Df Sum of Sq   RSS    AIC
## + Public  1   1462.46 30246 330.25
## + Takers  1   1368.28 30340 330.41
## <none>                31708 330.62
## + Income  1    848.47 30860 331.26
## 
## Step:  AIC=330.25
## SAT ~ Rank + Years + Expend + Public
## 
##          Df Sum of Sq   RSS    AIC
## <none>                30246 330.25
## + Takers  1    401.32 29844 331.59
## + Income  1     70.95 30175 332.14
tidy(selected_model)
## # A tibble: 5 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) -205.      118.        -1.74 8.90e- 2
## 2 Rank          10.0       0.603     16.6  8.67e-21
## 3 Years         21.9       6.04       3.63 7.31e- 4
## 4 Expend         2.24      0.678      3.31 1.87e- 3
## 5 Public        -0.664     0.450     -1.48 1.47e- 1

Now let’s perform backward selection using the step function

step(full_model, scope = formula(int_only_model), direction = "backward")
## Start:  AIC=333.58
## SAT ~ Takers + Income + Years + Public + Expend + Rank
## 
##          Df Sum of Sq   RSS    AIC
## - Income  1       2.0 29844 331.59
## - Takers  1     332.4 30175 332.14
## - Public  1     445.8 30288 332.32
## <none>                29842 333.58
## - Expend  1    4744.9 34587 338.96
## - Years   1    8897.8 38740 344.63
## - Rank    1   11223.0 41065 347.54
## 
## Step:  AIC=331.59
## SAT ~ Takers + Years + Public + Expend + Rank
## 
##          Df Sum of Sq   RSS    AIC
## - Takers  1     401.3 30246 330.25
## - Public  1     495.5 30340 330.41
## <none>                29844 331.59
## - Expend  1    6904.4 36749 339.99
## - Years   1    9219.7 39064 343.05
## - Rank    1   11645.9 41490 346.06
## 
## Step:  AIC=330.25
## SAT ~ Years + Public + Expend + Rank
## 
##          Df Sum of Sq    RSS    AIC
## <none>                 30246 330.25
## - Public  1      1462  31708 330.62
## - Expend  1      7343  37589 339.12
## - Years   1      8837  39083 341.07
## - Rank    1    184786 215032 426.33
## 
## Call:
## lm(formula = SAT ~ Years + Public + Expend + Rank, data = sat_scores)
## 
## Coefficients:
## (Intercept)        Years       Public       Expend         Rank  
##   -204.5982      21.8905      -0.6638       2.2416      10.0032

BAckward selection with the step function and BIC

The output will still say “AIC” even though BIC is being used as the selection criterion!

n <- nrow(sat_scores)

step(full_model, scope = formula(int_only_model), direction = "backward", 
     k = log(n))
## Start:  AIC=346.97
## SAT ~ Takers + Income + Years + Public + Expend + Rank
## 
##          Df Sum of Sq   RSS    AIC
## - Income  1       2.0 29844 343.06
## - Takers  1     332.4 30175 343.61
## - Public  1     445.8 30288 343.80
## <none>                29842 346.97
## - Expend  1    4744.9 34587 350.43
## - Years   1    8897.8 38740 356.10
## - Rank    1   11223.0 41065 359.02
## 
## Step:  AIC=343.06
## SAT ~ Takers + Years + Public + Expend + Rank
## 
##          Df Sum of Sq   RSS    AIC
## - Takers  1     401.3 30246 339.81
## - Public  1     495.5 30340 339.97
## <none>                29844 343.06
## - Expend  1    6904.4 36749 349.55
## - Years   1    9219.7 39064 352.61
## - Rank    1   11645.9 41490 355.62
## 
## Step:  AIC=339.81
## SAT ~ Years + Public + Expend + Rank
## 
##          Df Sum of Sq    RSS    AIC
## - Public  1      1462  31708 338.26
## <none>                 30246 339.81
## - Expend  1      7343  37589 346.77
## - Years   1      8837  39083 348.72
## - Rank    1    184786 215032 433.97
## 
## Step:  AIC=338.26
## SAT ~ Years + Expend + Rank
## 
##          Df Sum of Sq    RSS    AIC
## <none>                 31708 338.26
## - Expend  1      5918  37626 342.91
## - Years   1     16160  47868 354.95
## - Rank    1    185667 217375 430.60
## 
## Call:
## lm(formula = SAT ~ Years + Expend + Rank, data = sat_scores)
## 
## Coefficients:
## (Intercept)        Years       Expend         Rank  
##    -303.724       26.095        1.861        9.826

Backward selection using regsubsets

Use the regsubsets function in the leap package to perform backward selection with BIC as the selection criterion.

First may need to install the leaps package. To do so, type the code below in the console:

You only need to install the package one time.

install.packges(leaps)

Uncomment the code below to load the leaps package and perform backward selection with BIC as the selection criterion.

library(leaps)
reg_backward <- regsubsets(SAT ~ ., data = sat_scores, method="backward")

Get a summary of the selection results

sel_summary <- summary(reg_backward)

Choose the model with the minimum value of BIC:

coef(reg_backward, which.min(sel_summary$bic))
## (Intercept)       Years      Expend        Rank 
## -303.724295   26.095227    1.860866    9.825794
selected_model <- lm(SAT ~ Years + Expend + Rank, data = sat_scores)

Now select the coefficients corresponding to the model selected using \(Adj R^2\) (adjr2) as the criterion.

coef(reg_backward, which.max(sel_summary$adjr2))
## (Intercept)       Years      Public      Expend        Rank 
## -204.598232   21.890482   -0.663798    2.241640   10.003169