The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.
In today’s lab, we will use logistic regression to understand the relationship between a person’s political views and their attitudes towards government spending on mass transportation projects. To do so, we will use data from the 2010 GSS survey.
Go to the sta210-fa20 organization on GitHub (https://github.com/sta210-fa20). Click on the repo with the prefix lab-06-. It contains the starter documents you need to complete the lab.
Clone the repo and create a new project in your RStudio Docker Container.
Configure git by typing the following in the console.
If you would like your git password cached for a week for this project, type the following in the Terminal in RStudio:
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
We will use the following packages in today’s lab. Feel free to add any other packages as needed.
The data for this lab are from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. Given the size of the dataset, we’ll load it into the R Markdown file from the internet rather than keeping it in the data folder of our RStudio project.
We will use the following variables in the lab:
natmass
: Respondent’s answer to the following prompt:
“We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…are we spending too much, too little, or about the right amount on mass transportation?”
age
: Age in years.
sex
: Sex recorded as male or female
sei10
: Socioeconomic index from 0 to 100
region
: Region where interview took place
polviews
: Respondent’s answer to the following prompt:
“We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal - point 1 - to extremely conservative - point 7. Where would you place yourself on this scale?”
Use the code below to read in the data.
gss <- read_csv("https://sta210-fa20.netlify.app/data/gss2016.csv",
na = c("", "Don't know", "No answer",
"Not applicable"),
guess_max = 2867) %>%
select(natmass, age, sex, sei10, region, polviews) %>%
drop_na()
The argument guess_max = 2867
tells the read_csv
function to use all of the observations in a column to determine its data type. Without this argument, only the first 1,000 observations would be used to make this determination. This becomes important for a variable like age
; though age
is coded as numeric data for most of the observations, there are some in which age
is coded as "89 or older"
. Without the guess_max
argument, you will get warnings when loading the data.
Note also that only the variables of interest will be loaded, not the entire dataset. This will make for faster computation and knitting as you work on the lab.
Show all relevant code and output to support your responses even if you use inline code to write your narrative.
The goal of the analysis is to understand the factors that are associated with a person being satisfied with the current spending on mass transportation. Create a new variable that is equal to “1” if a person said spending on mass transportation is about right and “0” otherwise.
Recode polviews
so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.
Make a plot of the distribution of polviews
. Which political view occurs most frequently in this data set?
Make a plot displaying the relationship between satisfaction with mass transportation spending and political views. Use the plot to describe the relationship between a person’s political views and whether they are satisfied with spending on mass transportation.
We’d like to use age
as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older"
. Recode age
so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older"
with a single value.
Briefly explain why we should use a logistic regression model to predict the odds a randomly selected person is satisfied with spending on mass transportation.
Let’s start by fitting a model using the demographic factors - age
, sex
, sei10
, and region
- to predict the odds a person is satisfied with spending on mass transportation. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation.
Interpret the intercept in the context of the data. Include any relevant values in your response.
Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of being satisfied with spending on mass transportation.
Now let’s see whether a person’s political views has a significant impact on their odds of being satisfied with spending on mass transportation, after accounting for the demographic factors.
Conduct a drop-in-deviance test to determine if polviews
is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypotheses in words, display all relevant code and output, and state your conclusion in the context of the problem.
Use the model to describe the relationship between one’s political views and their attitude towards spending on mass transportation. Be sure your answer includes the interpretation of the model coefficients and associated hypothesis tests or confidence intervals used to support your response.
Upload the team’s PDF to Gradescope. Be sure to include every team member’s name in the Gradescope submission Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.
There should only be one submission per team on Gradescope. Be sure to include every team member’s name in the Gradescope submission.