The following are examples of scenarios with Poisson response variables:
Are the number of motorcycle deaths in a given year related to a state’s helmet laws?
Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
Has the number of deformed fish in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade?
If Y follows a Poisson distribution, then
P(Y=y)=exp{−λ}λyy!y=0,1,2,…
If Y follows a Poisson distribution, then
P(Y=y)=exp{−λ}λyy!y=0,1,2,…
Features of the Poisson distribution
Mean | Variance | |
---|---|---|
lambda=2 | 2.00740 | 2.015245 |
lambda=5 | 4.99130 | 4.968734 |
lambda=20 | 19.99546 | 19.836958 |
lambda=100 | 100.02276 | 100.527647 |
Why is a multiple linear regression model not appropriate?
Why is a multiple linear regression model not appropriate?
If the observed values Yi are Poisson, then we can model using a Poisson regression model of the form
log(λi)=β0+β1x1i+β2x2i+⋯+βpxpi
Slope, βj:
Quantitative Predictor: When xj increases by one unit, the mean of y is expected to multiply by a factor of exp{βj}, (holding all else constant).
Categorical Predictor: The mean of y for category k is expected to be exp{βj} times the mean of y for the baseline category, (holding all else constant).
Slope, βj:
Quantitative Predictor: When xj increases by one unit, the mean of y is expected to multiply by a factor of exp{βj}, (holding all else constant).
Categorical Predictor: The mean of y for category k is expected to be exp{βj} times the mean of y for the baseline category, (holding all else constant).
The data come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.
Goal: We want to use the data to understand the relationship between the age of the head of the household and the number of people in their household.
Variables
age
: the age of the head of householdtotal
: the number of people in the household other than the headLet's examine a plot of the log-transformed mean number of people in the household by age
Let's examine a plot of the log-transformed mean number of people in the household by age
model1 <- glm(total ~ ageCent, data = hh, family = "poisson")tidy(model1, conf.int = T) %>% kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.303 | 0.013 | 96.647 | 0 | 1.276 | 1.329 |
ageCent | -0.005 | 0.001 | -4.832 | 0 | -0.006 | -0.003 |
log(¯total)=1.303−0.0047×ageCent
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.303 | 0.013 | 96.647 | 0 | 1.276 | 1.329 |
ageCent | -0.005 | 0.001 | -4.832 | 0 | -0.006 | -0.003 |
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.303 | 0.013 | 96.647 | 0 | 1.276 | 1.329 |
ageCent | -0.005 | 0.001 | -4.832 | 0 | -0.006 | -0.003 |
For each additional year older the head of the household is, we expect the mean number in the house to multiply by a factor of 0.995 (exp(-0.0047)).
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.303 | 0.013 | 96.647 | 0 | 1.276 | 1.329 |
ageCent | -0.005 | 0.001 | -4.832 | 0 | -0.006 | -0.003 |
For each additional year older the head of the household is, we expect the mean number in the house to multiply by a factor of 0.995 (exp(-0.0047)).
For households with a head of the household who is 52.657 years old, we expect the mean number of people in the household to be 3.68 (exp(1.303)).
We can use a drop-in-deviance test to compare nested models (similar to logistic regression).
Let's try adding ageCent^2
to the model.
H0:βageCent2=0 vs. Ha:βageCent2≠0
model1 <- glm(total ~ ageCent, data = hh, family = "poisson")model2 <- glm(total ~ ageCent + I(ageCent^2), data = hh, family = "poisson")
anova(model1, model2, test = "Chisq")
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
1496 | 2330.729 | NA | NA | NA |
1495 | 2198.533 | 1 | 132.197 | 0 |
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
1496 | 2330.729 | NA | NA | NA |
1495 | 2198.533 | 1 | 132.197 | 0 |
The p-value is small, so we reject H0. We will include ageCent^2
to the model.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.436 | 0.017 | 82.339 | 0 | 1.402 | 1.470 |
ageCent | -0.004 | 0.001 | -3.584 | 0 | -0.006 | -0.002 |
I(ageCent^2) | -0.001 | 0.000 | -10.938 | 0 | -0.001 | -0.001 |
Poisson Response: The response follows a Poisson distribution for each level of the predictor.
Independence: The observations are independent of one another.
Mean = variance: The mean value of the response equals the variance of the response for each level of the predictor.
Linearity: log(λ) is a linear function of the predictors.
Let's check the first assumption by looking at the distribution of the response for groups of the predictor.
hh <- hh %>% mutate(age_group = cut(age, breaks = seq(15, 100, 5)))
This condition is satisfied based on the overall distribution of the response (from the EDA) and the distribution of the response by age group.
We don't have much information about how the households were selected for the survey.
If the households were not selected randomly but rather groups of household were selected from different areas with different customs about living arrangements, then the independence assumption would be violated.
Let's look at the mean and variance for each age group.
## # A tibble: 10 x 4## age_group mean_total var_total n## <fct> <dbl> <dbl> <int>## 1 (15,20] 1.67 0.667 6## 2 (20,25] 2.17 1.56 18## 3 (25,30] 2.92 1.41 49## 4 (30,35] 3.44 2.19 108## 5 (35,40] 3.84 3.57 158## 6 (40,45] 4.23 4.44 175## 7 (45,50] 4.49 6.40 194## 8 (50,55] 4.01 5.25 188## 9 (55,60] 3.81 6.53 145## 10 (60,65] 3.71 6.20 153
## # A tibble: 6 x 4## age_group mean_total var_total n## <fct> <dbl> <dbl> <int>## 1 (65,70] 3.34 8.00 115## 2 (70,75] 2.74 6.75 91## 3 (75,80] 2.53 4.97 57## 4 (80,85] 2.23 3.15 30## 5 (85,90] 2.56 7.03 9## 6 (90,95] 1 2 2
## # A tibble: 6 x 4## age_group mean_total var_total n## <fct> <dbl> <dbl> <int>## 1 (65,70] 3.34 8.00 115## 2 (70,75] 2.74 6.75 91## 3 (75,80] 2.53 4.97 57## 4 (80,85] 2.23 3.15 30## 5 (85,90] 2.56 7.03 9## 6 (90,95] 1 2 2
It appears the assumption is violated in some age groups; however, the violations are small enough that we can proceed.
The raw residual for the ith observation, yi−ˆλi, is difficult to interpret since the variance is equal to the mean in the Poisson distribution.
The raw residual for the ith observation, yi−ˆλi, is difficult to interpret since the variance is equal to the mean in the Poisson distribution.
Instead, we can analyze a standardized residual called the Pearson residual. ri=yi−ˆλi√ˆλi
The raw residual for the ith observation, yi−ˆλi, is difficult to interpret since the variance is equal to the mean in the Poisson distribution.
Instead, we can analyze a standardized residual called the Pearson residual. ri=yi−ˆλi√ˆλi
We will examine a plot of the Pearson residuals versus the predicted values to check the linearity assumption.
augment
functionhh_aug <- augment(model2, type.predict = "response", type.residuals = "pearson")
There is no distinguishable pattern in the residuals, so the linearity assumption is satisfied.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |