library(tidyverse)
library(knitr)
library(broom)
library(leaps)
Hang out with the TAs from STA 210! This is a casual conversation and a fun opportunity to meet the members of the STA 210 teaching team. The only rule is these can’t turn into office hours!
Tea with a TA counts as a statistics experience.
This data set contains the average SAT score (out of 1600) and other variables that may be associated with SAT performance for each of the 50 U.S. states. The data is based on test takers for the 1982 exam.
SAT
, average total SAT scoreTakers
percentage of high school seniors who took examIncome
: median income of families of test-takers ($ hundreds)Years
: average number of years test-takers had formal sciences, natural sciences, and humanitiesPublic
: percentage of test-takers who attended public high schoolExpend
: total state expenditure on high schools ($ hundreds per student)Rank
: median percentile rank of test-takers within their high school classessat_scores <- read_csv("data/sat-scores.csv") %>%
select(-State)
lm(SAT ~ 1, data = sat_scores)
##
## Call:
## lm(formula = SAT ~ 1, data = sat_scores)
##
## Coefficients:
## (Intercept)
## 947.9
Use the drop1
function to manually perform backward selection using AIC as the selection criterion. To help you get started, the code for the full model and the first are below.
full_model <- lm(SAT ~ ., data = sat_scores)
int_only_model <- lm(SAT ~ 1, data = sat_scores)
drop1(full_model)
## Single term deletions
##
## Model:
## SAT ~ Takers + Income + Years + Public + Expend + Rank
## Df Sum of Sq RSS AIC
## <none> 29842 333.58
## Takers 1 332.4 30175 332.14
## Income 1 2.0 29844 331.59
## Years 1 8897.8 38740 344.63
## Public 1 445.8 30288 332.32
## Expend 1 4744.9 34587 338.96
## Rank 1 11223.0 41065 347.54
#remove income
current_model <- lm(SAT ~ Takers + Years + Public + Expend + Rank, data = sat_scores)
drop1(current_model)
## Single term deletions
##
## Model:
## SAT ~ Takers + Years + Public + Expend + Rank
## Df Sum of Sq RSS AIC
## <none> 29844 331.59
## Takers 1 401.3 30246 330.25
## Years 1 9219.7 39064 343.05
## Public 1 495.5 30340 330.41
## Expend 1 6904.4 36749 339.99
## Rank 1 11645.9 41490 346.06
#remove takers
current_model <- lm(SAT ~ Years + Public + Expend + Rank, data = sat_scores)
drop1(current_model)
## Single term deletions
##
## Model:
## SAT ~ Years + Public + Expend + Rank
## Df Sum of Sq RSS AIC
## <none> 30246 330.25
## Years 1 8837 39083 341.07
## Public 1 1462 31708 330.62
## Expend 1 7343 37589 339.12
## Rank 1 184786 215032 426.33
Final Model
tidy(current_model)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -205. 118. -1.74 8.90e- 2
## 2 Years 21.9 6.04 3.63 7.31e- 4
## 3 Public -0.664 0.450 -1.48 1.47e- 1
## 4 Expend 2.24 0.678 3.31 1.87e- 3
## 5 Rank 10.0 0.603 16.6 8.67e-21
step
functionIf AIC is the selection criterion, we can use the step
function in R.
Let’s perform forward selection using the step
function.
int_only_model <- lm(SAT ~ 1, data = sat_scores)
selected_model <- step(int_only_model, scope = formula(full_model), direction = "forward")
## Start: AIC=427.06
## SAT ~ 1
##
## Df Sum of Sq RSS AIC
## + Rank 1 190471 55539 354.64
## + Takers 1 181024 64987 362.50
## + Income 1 84038 161973 408.16
## + Years 1 26948 219063 423.25
## <none> 246011 427.06
## + Public 1 1589 244422 428.73
## + Expend 1 973 245038 428.86
##
## Step: AIC=354.64
## SAT ~ Rank
##
## Df Sum of Sq RSS AIC
## + Years 1 17913.6 37626 337.17
## + Expend 1 7671.0 47868 349.21
## + Income 1 4601.1 50938 352.32
## + Public 1 3847.7 51692 353.05
## <none> 55539 354.64
## + Takers 1 1761.8 53778 355.03
##
## Step: AIC=337.17
## SAT ~ Rank + Years
##
## Df Sum of Sq RSS AIC
## + Expend 1 5917.6 31708 330.62
## + Income 1 2782.4 34843 335.33
## <none> 37626 337.17
## + Takers 1 778.7 36847 338.13
## + Public 1 37.0 37589 339.12
##
## Step: AIC=330.62
## SAT ~ Rank + Years + Expend
##
## Df Sum of Sq RSS AIC
## + Public 1 1462.46 30246 330.25
## + Takers 1 1368.28 30340 330.41
## <none> 31708 330.62
## + Income 1 848.47 30860 331.26
##
## Step: AIC=330.25
## SAT ~ Rank + Years + Expend + Public
##
## Df Sum of Sq RSS AIC
## <none> 30246 330.25
## + Takers 1 401.32 29844 331.59
## + Income 1 70.95 30175 332.14
tidy(selected_model)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -205. 118. -1.74 8.90e- 2
## 2 Rank 10.0 0.603 16.6 8.67e-21
## 3 Years 21.9 6.04 3.63 7.31e- 4
## 4 Expend 2.24 0.678 3.31 1.87e- 3
## 5 Public -0.664 0.450 -1.48 1.47e- 1
Now let’s perform backward selection using the step
function
step(full_model, scope = formula(int_only_model), direction = "backward")
## Start: AIC=333.58
## SAT ~ Takers + Income + Years + Public + Expend + Rank
##
## Df Sum of Sq RSS AIC
## - Income 1 2.0 29844 331.59
## - Takers 1 332.4 30175 332.14
## - Public 1 445.8 30288 332.32
## <none> 29842 333.58
## - Expend 1 4744.9 34587 338.96
## - Years 1 8897.8 38740 344.63
## - Rank 1 11223.0 41065 347.54
##
## Step: AIC=331.59
## SAT ~ Takers + Years + Public + Expend + Rank
##
## Df Sum of Sq RSS AIC
## - Takers 1 401.3 30246 330.25
## - Public 1 495.5 30340 330.41
## <none> 29844 331.59
## - Expend 1 6904.4 36749 339.99
## - Years 1 9219.7 39064 343.05
## - Rank 1 11645.9 41490 346.06
##
## Step: AIC=330.25
## SAT ~ Years + Public + Expend + Rank
##
## Df Sum of Sq RSS AIC
## <none> 30246 330.25
## - Public 1 1462 31708 330.62
## - Expend 1 7343 37589 339.12
## - Years 1 8837 39083 341.07
## - Rank 1 184786 215032 426.33
##
## Call:
## lm(formula = SAT ~ Years + Public + Expend + Rank, data = sat_scores)
##
## Coefficients:
## (Intercept) Years Public Expend Rank
## -204.5982 21.8905 -0.6638 2.2416 10.0032
step
function and BICThe output will still say “AIC” even though BIC is being used as the selection criterion!
n <- nrow(sat_scores)
step(full_model, scope = formula(int_only_model), direction = "backward",
k = log(n))
## Start: AIC=346.97
## SAT ~ Takers + Income + Years + Public + Expend + Rank
##
## Df Sum of Sq RSS AIC
## - Income 1 2.0 29844 343.06
## - Takers 1 332.4 30175 343.61
## - Public 1 445.8 30288 343.80
## <none> 29842 346.97
## - Expend 1 4744.9 34587 350.43
## - Years 1 8897.8 38740 356.10
## - Rank 1 11223.0 41065 359.02
##
## Step: AIC=343.06
## SAT ~ Takers + Years + Public + Expend + Rank
##
## Df Sum of Sq RSS AIC
## - Takers 1 401.3 30246 339.81
## - Public 1 495.5 30340 339.97
## <none> 29844 343.06
## - Expend 1 6904.4 36749 349.55
## - Years 1 9219.7 39064 352.61
## - Rank 1 11645.9 41490 355.62
##
## Step: AIC=339.81
## SAT ~ Years + Public + Expend + Rank
##
## Df Sum of Sq RSS AIC
## - Public 1 1462 31708 338.26
## <none> 30246 339.81
## - Expend 1 7343 37589 346.77
## - Years 1 8837 39083 348.72
## - Rank 1 184786 215032 433.97
##
## Step: AIC=338.26
## SAT ~ Years + Expend + Rank
##
## Df Sum of Sq RSS AIC
## <none> 31708 338.26
## - Expend 1 5918 37626 342.91
## - Years 1 16160 47868 354.95
## - Rank 1 185667 217375 430.60
##
## Call:
## lm(formula = SAT ~ Years + Expend + Rank, data = sat_scores)
##
## Coefficients:
## (Intercept) Years Expend Rank
## -303.724 26.095 1.861 9.826
regsubsets
Use the regsubsets
function in the leap package to perform backward selection with BIC as the selection criterion.
First may need to install the leaps package. To do so, type the code below in the console:
You only need to install the package one time.
install.packges(leaps)
Uncomment the code below to load the leaps package and perform backward selection with BIC as the selection criterion.
library(leaps)
reg_backward <- regsubsets(SAT ~ ., data = sat_scores, method="backward")
Get a summary of the selection results
sel_summary <- summary(reg_backward)
Choose the model with the minimum value of BIC:
coef(reg_backward, which.min(sel_summary$bic))
## (Intercept) Years Expend Rank
## -303.724295 26.095227 1.860866 9.825794
selected_model <- lm(SAT ~ Years + Expend + Rank, data = sat_scores)
Now select the coefficients corresponding to the model selected using \(Adj R^2\) (adjr2
) as the criterion.
coef(reg_backward, which.max(sel_summary$adjr2))
## (Intercept) Years Public Expend Rank
## -204.598232 21.890482 -0.663798 2.241640 10.003169