Lab 06: Logistic regression

due Wed, Oct 21 at 11:59p

The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.

In today’s lab, we will use logistic regression to understand the relationship between a person’s political views and their attitudes towards government spending on mass transportation projects. To do so, we will use data from the 2010 GSS survey.

Getting Started

library(usethis)
use_git_config(user.name="github username", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal in RStudio:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

We will use the following packages in today’s lab. Feel free to add any other packages as needed.

library(tidyverse)
library(broom)
library(knitr)
# fill in other packages as needed

Data

The data for this lab are from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. Given the size of the dataset, we’ll load it into the R Markdown file from the internet rather than keeping it in the data folder of our RStudio project.

We will use the following variables in the lab:

Use the code below to read in the data.

gss <- read_csv("https://sta210-fa20.netlify.app/data/gss2016.csv",
  na = c("", "Don't know", "No answer", 
         "Not applicable"), 
         guess_max = 2867) %>%
  select(natmass, age, sex, sei10, region, polviews) %>%
  drop_na()

The argument guess_max = 2867 tells the read_csv function to use all of the observations in a column to determine its data type. Without this argument, only the first 1,000 observations would be used to make this determination. This becomes important for a variable like age; though age is coded as numeric data for most of the observations, there are some in which age is coded as "89 or older". Without the guess_max argument, you will get warnings when loading the data.

Note also that only the variables of interest will be loaded, not the entire dataset. This will make for faster computation and knitting as you work on the lab.

Exercises

Show all relevant code and output to support your responses even if you use inline code to write your narrative.

Part I: Exploratory data analysis

  1. The goal of the analysis is to understand the factors that are associated with a person being satisfied with the current spending on mass transportation. Create a new variable that is equal to “1” if a person said spending on mass transportation is about right and “0” otherwise.

  2. Recode polviews so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.

    Make a plot of the distribution of polviews. Which political view occurs most frequently in this data set?

  3. Make a plot displaying the relationship between satisfaction with mass transportation spending and political views. Use the plot to describe the relationship between a person’s political views and whether they are satisfied with spending on mass transportation.

  4. We’d like to use age as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older". Recode age so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older" with a single value.

Part II: Logistic regression model

  1. Briefly explain why we should use a logistic regression model to predict the odds a randomly selected person is satisfied with spending on mass transportation.

  2. Let’s start by fitting a model using the demographic factors - age, sex, sei10, and region - to predict the odds a person is satisfied with spending on mass transportation. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation.

  3. Interpret the intercept in the context of the data. Include any relevant values in your response.

  4. Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of being satisfied with spending on mass transportation.

  5. Now let’s see whether a person’s political views has a significant impact on their odds of being satisfied with spending on mass transportation, after accounting for the demographic factors.

    Conduct a drop-in-deviance test to determine if polviews is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypotheses in words, display all relevant code and output, and state your conclusion in the context of the problem.

  6. Use the model to describe the relationship between one’s political views and their attitude towards spending on mass transportation. Be sure your answer includes the interpretation of the model coefficients and associated hypothesis tests or confidence intervals used to support your response.

Submission

Upload the team’s PDF to Gradescope. Be sure to include every team member’s name in the Gradescope submission Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

There should only be one submission per team on Gradescope. Be sure to include every team member’s name in the Gradescope submission.