Model building is an iterative process that often requires fitting, assessing, and tweaking multiple models. The goal of today’s lab is to use an iterative model building process along with various criteria to compare models to select a model that “best” fits the data.
Go to the sta210-fa20 organization on GitHub (https://github.com/sta210-fa20). Click on the repo with the prefix lab-05-. It contains the starter documents you need to complete the lab.
Clone the repo and create a new project in your RStudio Docker Container.
Configure git by typing the following in the console.
If you would like your git password cached for a week for this project, type the following in the Terminal in RStudio:
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
We will use the following packages in today’s lab. Feel free to add any other packages as needed.
The data from today’s lab come from the paper Influence of perceived threat of Covid-19 and HEXACO personality traits on toilet paper stockpiling by Garbe, Rau and Toppe. The objective their analysis was to examine the relationship between personality traits, perceived threat of Covid-19 and stockpiling toilet paper.
They conducted an online survey March 23 - 29, 2020 and used the results to fit multiple linear regression models to draw conclusions about their research questions. From their survey, they collected data on adults across 35 countries. Given the small number of responses from people outside of the United States, Canada and Europe, only responses from people in these three locations were included in the regression analysis.
They used the standardized version of each numerical variable in their model. Given a variable \(x\), the standardized \(x\) is calculated as \(\frac{x - \bar{x}}{s_x}\). In other words, we standardize a variable by shifting it by its mean and rescaling it by its standard deviation.
The dataset is in the file covid-tp.csv
.
The variables we’ll use in this analysis are
Top_Inhouse
: Number of toilet paper rolls in the householdpolitics
: Standardized politics score (larger values indicate more conservative/ to the right on politics)gender
: 1: Male, 2: Female, 3: Other gender identifiedage
: Standardized agehousehold
: Standardized number of people in the householdduration
: Standardized number of days in self-quarantine at the time of surveymobility
: 1: Local restrictions on leaving the house, 0: No restrictions on leaving the housepublic_transport
: Local restrictions on public transportationeurope
: 1: respondent lives in EU, 0: respondent lives in US/Canadadate_diff_z
: Standardized number of days between first reported case in locality and taking the surveyconsc
: Standardized measure of conscientiousness (higher score means stronger traits such as organization, diligence, perfectionism, prudence)open
: Standardized measure of openness to experience (higher score means stronger traits such as aesthetic appreciation, inquisitiveness, creativity, unconventionality)perceived_threat
: Standardized measure of perceived threat of Covid-19 (higher score means greater perceived threat)In this lab, we will use an iterative model building process to fit a model that can be used to predict Top_Inhouse
, the number of toilet paper rolls a respondent reported having in their home at the time they took the survey.
The started by fitting a baseline model with the variables politics
, gender
, age
, household
, duration
, mobility
, public_transport
, europe
, and date_diff_z
. Briefly explain why the authors started by fitting this baseline model before including the variables measuring perceived threat from Covid-19 or personality traits.
The authors used the standardized version of the numeric predictor variables in the model. Briefly explain why the authors used the standardized values for the numeric predictors rather than the raw variables. Hint: Consider how using the standardized variables can make it easier to draw conclusions from the models.
Fit the baseline model that includes the variables politics
, gender
, age
, household
, duration
, mobility
, public_transport
, europe
, and date_diff_z
.
household
in the context of the data. Note: The standard deviation for household is 1.57 \(\approx 2\).Does the perceived threat of Covid-19 help explain the variation in the toilet paper consumption after adjusting for the baseline variables? Show the code and output used to derive your answer.
Does the effect of Covid-19 differ based on gender? Let’s a Nested F test to answer this question.
Use the best fit model based on the results of the Nested F test for the next exercise.
Calculate the model fit statistics - \(Adj. R^2\), \(AIC\), and \(BIC\) - for the model selected in the previous exercise.
Now let’s consider the personality traits conscientiousness and openness to experiences, the two traits identified by the authors. Fit a model that includes all of the variables in the model chosen in Exercise 5 along with conscientiousness and openness to experiences.
In addition to the model output, display the model fit statistics - \(Adj. R^2\), \(AIC\), and \(BIC\).
Now let’s compare the model chosen in Exercise 5 to the model fit in Exercise 7.
Briefly explain why we didn’t use \(R^2\) to choose the model of best fit in the previous question.
Select a final model and briefly explain why you selected that model. Use the model to describe the factors that are associated with an increase in the number of toilet paper rolls.
Upload the team’s PDF to Gradescope. Be sure to include every team member’s name in the Gradescope submission Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.
There should only be one submission per team on Gradescope. Be sure to include every team member’s name in the Gradescope submission.
Click here to access the paper, copy of the survey, original data, and other supplemental materials.