Questions from Multiple linear regression videos?

Histogram vs. density plot

library(tidyverse)
library(broom)
library(patchwork)
x <- tibble(val = rnorm(1000, 0, 1))
hist <- ggplot(data = x, aes(x = val)) +
  geom_histogram()
density <- ggplot(data = x, aes(x = val)) +
  geom_density()
hist + density

Density plot is derived using kernel density estimation, i.e. a nonparametric approach to estimate the the probability density function of a random variable. You can read more about the density function in R here.

Starting salaries at Harris Trust and Savings Bank

Example: Starting Wages

In the 1970s Harris Trust and Savings Bank was sued for discrimination on the basis of sex. The report from the Department of Labor states, “Prior to filing this case, Treasury retained two statistical experts, Drs. Shafie and Cabral, ‘To explore the feasibility of using to determine the existence of an affected class of employees in the workforce of Treasury contractors.’”(Dept of Labor vs. Harris Trust and Savings).

Each side presented a statistical analysis to examine whether if there was sufficient evidence that female employees received lower starting salaries on average than male peers with similar qualifications.

We will take a look at some of the data used for the analyses. The data set contains information on 93 employees from a single job category (skilled, entry-level,clerical) who were hired between 1965 and 1975.

wages <- read_csv("data/wages.csv")

The variables in the data are

  • Educ: years of education
  • Exper: months of experience prior to working at the bank
  • Sex: sex of employee
  • Senior: months work
  • Age: age in months
  • Sal77: salary as of March 1975
  • Bsal: annual salary at time of hire

Today we will focus on the relationship between the following variables:

  • Response: Bsal
  • Predictors: Educ, Exper, Senior, Age

Why multiple linear regression?

  • We would like to use the employees’ age and previous experience to explain variation in their starting salaries. Why do we want to do this fitting a multiple linear regression model instead of a simple linear regression model for each predictor?

  • Why might we want to start with fitting a multiple linear regression model that doesn’t include Sex?

Fit the model

Fit the linear regression model and output the results. Include conf.int = TRUE in the tidy function to display the confidence interval for each coefficient.

## fit linear model 
  1. Interpret the coefficient of Age in the context of the data.

  2. Interpret the 95% confidence interval for Exper in the context of the data.

  3. Does Exper help explain some of the variability in the starting salary? Briefly explain why or why not?

  4. Refit the model including Sex as a predictor variable.

## fit linear model including Sex as a predictor
  1. The court ruled that there was evidence that female employees received lower salaries, on average, than their male peers with similar qualifications. Does your model support this ruling? Why or why not?

The data used in this exercise is originally from the Sleuth3 R package.