Model building is an iterative process that often requires fitting, assessing, and tweaking multiple models. The goal of today’s lab is to use an iterative model building process along with various criteria to compare models to select a model that “best” fits the data.

Getting Started

Go to the sta210-fa20 organization on GitHub (https://github.com/sta210-fa20). Click on the repo with the prefix lab-05-. It contains the starter documents you need to complete the lab.
Clone the repo and create a new project in your RStudio Docker Container.
Configure git by typing the following in the console.

library(usethis)
use_git_config(user.name="github username", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal in RStudio:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

We will use the following packages in today’s lab. Feel free to add any other packages as needed.

library(tidyverse)
library(broom)
library(knitr)

Data

The data from today’s lab come from the paper Influence of perceived threat of Covid-19 and HEXACO personality traits on toilet paper stockpiling by Garbe, Rau and Toppe. The objective their analysis was to examine the relationship between personality traits, perceived threat of Covid-19 and stockpiling toilet paper.

They conducted an online survey March 23 - 29, 2020 and used the results to fit multiple linear regression models to draw conclusions about their research questions. From their survey, they collected data on adults across 35 countries. Given the small number of responses from people outside of the United States, Canada and Europe, only responses from people in these three locations were included in the regression analysis.

They used the standardized version of each numerical variable in their model. Given a variable \(x\), the standardized \(x\) is calculated as \(\frac{x - \bar{x}}{s_x}\). In other words, we standardize a variable by shifting it by its mean and rescaling it by its standard deviation.

The dataset is in the file covid-tp.csv.

The variables we’ll use in this analysis are

Top_Inhouse: Number of toilet paper rolls in the household
politics: Standardized politics score (larger values indicate more conservative/ to the right on politics)
gender: 1: Male, 2: Female, 3: Other gender identified
age: Standardized age
household: Standardized number of people in the household
duration: Standardized number of days in self-quarantine at the time of survey
mobility: 1: Local restrictions on leaving the house, 0: No restrictions on leaving the house
public_transport: Local restrictions on public transportation
europe: 1: respondent lives in EU, 0: respondent lives in US/Canada
date_diff_z: Standardized number of days between first reported case in locality and taking the survey
consc: Standardized measure of conscientiousness (higher score means stronger traits such as organization, diligence, perfectionism, prudence)
open: Standardized measure of openness to experience (higher score means stronger traits such as aesthetic appreciation, inquisitiveness, creativity, unconventionality)
perceived_threat: Standardized measure of perceived threat of Covid-19 (higher score means greater perceived threat)

Exercises

In this lab, we will use an iterative model building process to fit a model that can be used to predict Top_Inhouse, the number of toilet paper rolls a respondent reported having in their home at the time they took the survey.

The started by fitting a baseline model with the variables politics, gender, age, household, duration, mobility, public_transport, europe, and date_diff_z. Briefly explain why the authors started by fitting this baseline model before including the variables measuring perceived threat from Covid-19 or personality traits.
The authors used the standardized version of the numeric predictor variables in the model. Briefly explain why the authors used the standardized values for the numeric predictors rather than the raw variables. Hint: Consider how using the standardized variables can make it easier to draw conclusions from the models.
Fit the baseline model that includes the variables politics, gender, age, household, duration, mobility, public_transport, europe, and date_diff_z.
- Describe the subset of people included in the intercept.
- Interpret the coefficient household in the context of the data. Note: The standard deviation for household is 1.57 \(\approx 2\).
Does the perceived threat of Covid-19 help explain the variation in the toilet paper consumption after adjusting for the baseline variables? Show the code and output used to derive your answer.
Does the effect of Covid-19 differ based on gender? Let’s a Nested F test to answer this question.
- State the null and alternative hypotheses in words in the context of the data.
- Show the code and output from the test.
- State the conclusion in the context of the data.

Use the best fit model based on the results of the Nested F test for the next exercise.

Calculate the model fit statistics - \(Adj. R^2\), \(AIC\), and \(BIC\) - for the model selected in the previous exercise.
Now let’s consider the personality traits conscientiousness and openness to experiences, the two traits identified by the authors. Fit a model that includes all of the variables in the model chosen in Exercise 5 along with conscientiousness and openness to experiences.

In addition to the model output, display the model fit statistics - \(Adj. R^2\), \(AIC\), and \(BIC\).

Now let’s compare the model chosen in Exercise 5 to the model fit in Exercise 7.
- Based on Adjusted \(R^2\), which model better fits the data? Briefly explain.
- Based on \(AIC\), which model better fits the data? Briefly explain.
- Based on \(BIC\), which model better fits the data? Briefly explain.
Briefly explain why we didn’t use \(R^2\) to choose the model of best fit in the previous question.
Select a final model and briefly explain why you selected that model. Use the model to describe the factors that are associated with an increase in the number of toilet paper rolls.

Lab 05: Stockpiling toilet paper

due Wed, Sep 30 at 11:59p

Getting Started

Password caching

Packages

Data

Exercises

Submission