library(tidyverse)
library(broom)
library(patchwork)
x <- tibble(val = rnorm(1000, 0, 1))
hist <- ggplot(data = x, aes(x = val)) +
geom_histogram()
density <- ggplot(data = x, aes(x = val)) +
geom_density()
hist + density
Density plot is derived using kernel density estimation, i.e. a nonparametric approach to estimate the the probability density function of a random variable. You can read more about the density function in R here.
In the 1970s Harris Trust and Savings Bank was sued for discrimination on the basis of sex. The report from the Department of Labor states, “Prior to filing this case, Treasury retained two statistical experts, Drs. Shafie and Cabral, ‘To explore the feasibility of using to determine the existence of an affected class of employees in the workforce of Treasury contractors.’”(Dept of Labor vs. Harris Trust and Savings).
Each side presented a statistical analysis to examine whether if there was sufficient evidence that female employees received lower starting salaries on average than male peers with similar qualifications.
We will take a look at some of the data used for the analyses. The data set contains information on 93 employees from a single job category (skilled, entry-level,clerical) who were hired between 1965 and 1975.
wages <- read_csv("data/wages.csv")
The variables in the data are
Educ
: years of educationExper
: months of experience prior to working at the bankSex
: sex of employeeSenior
: months workAge
: age in monthsSal77
: salary as of March 1975Bsal
: annual salary at time of hireToday we will focus on the relationship between the following variables:
Bsal
Educ
, Exper
, Senior
, Age
We would like to use the employees’ age and previous experience to explain variation in their starting salaries. Why do we want to do this fitting a multiple linear regression model instead of a simple linear regression model for each predictor?
Why might we want to start with fitting a multiple linear regression model that doesn’t include Sex
?
Fit the linear regression model and output the results. Include conf.int = TRUE
in the tidy
function to display the confidence interval for each coefficient.
## fit linear model
Interpret the coefficient of Age
in the context of the data.
Interpret the 95% confidence interval for Exper
in the context of the data.
Does Exper
help explain some of the variability in the starting salary? Briefly explain why or why not?
Refit the model including Sex
as a predictor variable.
## fit linear model including Sex as a predictor
The data used in this exercise is originally from the Sleuth3 R package.