class: center, middle, inverse, title-slide # Log-linear models ## Poisson regression ### Prof. Maria Tackett --- class: middle, center ## [Click for PDF of slides](23-poisson.pdf) --- ## Poisson response variables The following are examples of scenarios with Poisson response variables: - Are the .vocab[number of motorcycle deaths] in a given year related to a state’s helmet laws? - Does the .vocab[number of employers] conducting on-campus interviews during a year differ for public and private colleges? - Does the .vocab[daily number of asthma-related visits] to an Emergency Room differ depending on air pollution indices? - Has the .vocab[number of deformed fish] in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade? --- ## Poisson Distribution If `\(Y\)` follows a Poisson distribution, then `$$P(Y=y) = \frac{\exp\{-\lambda\}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots$$` -- **Features of the Poisson distribution** - Mean and variance are equal `\((\lambda)\)` - Distribution tends to be skewed right, especially when the mean is small - If the mean is larger, it can be approximated by a Normal distribution --- ## Simulated Poisson distributions <img src="23-poisson_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Simulated Poisson distributions <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Variance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> lambda=2 </td> <td style="text-align:right;"> 2.00740 </td> <td style="text-align:right;"> 2.015245 </td> </tr> <tr> <td style="text-align:left;"> lambda=5 </td> <td style="text-align:right;"> 4.99130 </td> <td style="text-align:right;"> 4.968734 </td> </tr> <tr> <td style="text-align:left;"> lambda=20 </td> <td style="text-align:right;"> 19.99546 </td> <td style="text-align:right;"> 19.836958 </td> </tr> <tr> <td style="text-align:left;"> lambda=100 </td> <td style="text-align:right;"> 100.02276 </td> <td style="text-align:right;"> 100.527647 </td> </tr> </tbody> </table> --- ## Poisson Regression - We want `\(\lambda\)` to be a function of predictor variables `\(x_1, \ldots, x_p\)` -- .question[ Why is a multiple linear regression model not appropriate? ] -- - `\(\lambda\)` must be greater than or equal to 0 for any combination of predictor variables - Constant variance assumption will be violated! --- ## Multiple linear regression vs. Poisson <img src="img/23/poisson_ols.png" width="65%" style="display: block; margin: auto;" /> .footnote[Image from: [*Broadening Your Statistical Horizons*](https://bookdown.org/roback/bookdown-bysh/ch-poissonreg.html)] --- ## Poisson Regression If the observed values `\(Y_i\)` are Poisson, then we can model using a <font class="vocab">Poisson regression model</font> of the form .alert[ `$$\log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \dots + \beta_p x_{pi}$$` ] --- ## Interpreting Model Coefficients - <font class="vocab">Slope, `\(\beta_j\)`: </font> - **Quantitative Predictor**: When `\(x_j\)` increases by one unit, the mean of `\(y\)` is expected to multiply by a factor of `\(\exp\{\beta_j\}\)`, (*holding all else constant*). - **Categorical Predictor**: The mean of `\(y\)` for category `\(k\)` is expected to be `\(\exp\{\beta_j\}\)` times the mean of `\(y\)` for the baseline category, (*holding all else constant*). -- - <font class="vocab">Intercept, `\(\beta_0\)`: </font> When all of the predictors equal 0, the mean of `\(y\)` is expected to be `\(\exp\{\beta_0\}\)`. --- ## Example: Household size in the Philippines The data come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority. .vocab[Goal:] We want to use the data to understand the relationship between the age of the head of the household and the number of people in their household. .vocab[Variables] - `age`: the age of the head of household - `total`: the number of people in the household other than the head --- ## Exploratory data analysis <img src="23-poisson_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Exploratory data analysis Let's examine a plot of the log-transformed mean number of people in the household by age <img src="23-poisson_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Exploratory data analysis Let's examine a plot of the log-transformed mean number of people in the household by age <img src="23-poisson_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Number in household vs. age .midi[ ```r model1 <- glm(total ~ ageCent, data = hh, family = "poisson") tidy(model1, conf.int = T) %>% kable(digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.303 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 96.647 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.276 </td> <td style="text-align:right;"> 1.329 </td> </tr> <tr> <td style="text-align:left;"> ageCent </td> <td style="text-align:right;"> -0.005 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> -4.832 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.006 </td> <td style="text-align:right;"> -0.003 </td> </tr> </tbody> </table> ] `$$\color{#87037B}{\log(\overline{\text{total}}) = 1.303 - 0.0047 \times \text{ageCent}}$$` --- ## Interpretations <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.303 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> 96.647 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.276 </td> <td style="text-align:right;"> 1.329 </td> </tr> <tr> <td style="text-align:left;"> ageCent </td> <td style="text-align:right;"> -0.005 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> -4.832 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.006 </td> <td style="text-align:right;"> -0.003 </td> </tr> </tbody> </table> -- For each additional year older the head of the household is, we expect the mean number in the house to multiply by a factor of 0.995 (exp(-0.0047)). -- For households with a head of the household who is 52.657 years old, we expect the mean number of people in the household to be 3.68 (exp(1.303)). --- ## Drop-In-Deviance Test We can use a .vocab[drop-in-deviance test] to compare nested models (similar to logistic regression). Let's try adding `ageCent^2` to the model. `$$H_0: \beta_{ageCent^2} = 0 \text{ vs. }H_a: \beta_{ageCent^2} \neq 0$$` .midi[ ```r model1 <- glm(total ~ ageCent, data = hh, family = "poisson") model2 <- glm(total ~ ageCent + I(ageCent^2), data = hh, family = "poisson") ``` ```r anova(model1, model2, test = "Chisq") ``` ] --- ## Drop-In-Deviance Test <table> <thead> <tr> <th style="text-align:right;"> Resid. Df </th> <th style="text-align:right;"> Resid. Dev </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Deviance </th> <th style="text-align:right;"> Pr(>Chi) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1496 </td> <td style="text-align:right;"> 2330.729 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 1495 </td> <td style="text-align:right;"> 2198.533 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 132.197 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> -- The p-value is small, so we reject `\(H_0\)`. We will include `ageCent^2` to the model. --- ## Final model <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.436 </td> <td style="text-align:right;"> 0.017 </td> <td style="text-align:right;"> 82.339 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.402 </td> <td style="text-align:right;"> 1.470 </td> </tr> <tr> <td style="text-align:left;"> ageCent </td> <td style="text-align:right;"> -0.004 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> -3.584 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.006 </td> <td style="text-align:right;"> -0.002 </td> </tr> <tr> <td style="text-align:left;"> I(ageCent^2) </td> <td style="text-align:right;"> -0.001 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -10.938 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.001 </td> <td style="text-align:right;"> -0.001 </td> </tr> </tbody> </table> --- ## Model Assumptions 1. .vocab[Poisson Response]: The response follows a Poisson distribution for each level of the predictor. 2. .vocab[Independence]: The observations are independent of one another. 3. .vocab[Mean = variance]: The mean value of the response equals the variance of the response for each level of the predictor. 4. .vocab[Linearity]: `\(\log(\lambda)\)` is a linear function of the predictors. --- ## Poisson response Let's check the first assumption by looking at the distribution of the response for groups of the predictor. ```r hh <- hh %>% mutate(age_group = cut(age, breaks = seq(15, 100, 5))) ``` --- ## Poisson response .small[ <img src="23-poisson_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] This condition is satisfied based on the overall distribution of the response (from the EDA) and the distribution of the response by age group. --- ## Independence We don't have much information about how the households were selected for the survey. If the households were not selected randomly but rather groups of household were selected from different areas with different customs about living arrangements, then the independence assumption would be violated. --- ## Mean = variance Let's look at the mean and variance for each age group. ``` ## # A tibble: 10 x 4 ## age_group mean_total var_total n ## <fct> <dbl> <dbl> <int> ## 1 (15,20] 1.67 0.667 6 ## 2 (20,25] 2.17 1.56 18 ## 3 (25,30] 2.92 1.41 49 ## 4 (30,35] 3.44 2.19 108 ## 5 (35,40] 3.84 3.57 158 ## 6 (40,45] 4.23 4.44 175 ## 7 (45,50] 4.49 6.40 194 ## 8 (50,55] 4.01 5.25 188 ## 9 (55,60] 3.81 6.53 145 ## 10 (60,65] 3.71 6.20 153 ``` --- ## Mean = variance ``` ## # A tibble: 6 x 4 ## age_group mean_total var_total n ## <fct> <dbl> <dbl> <int> ## 1 (65,70] 3.34 8.00 115 ## 2 (70,75] 2.74 6.75 91 ## 3 (75,80] 2.53 4.97 57 ## 4 (80,85] 2.23 3.15 30 ## 5 (85,90] 2.56 7.03 9 ## 6 (90,95] 1 2 2 ``` -- It appears the assumption is violated in some age groups; however, the violations are small enough that we can proceed. --- ## Linearity The raw residual for the `\(i^{th}\)` observation, `\(y_i - \hat{\lambda}_i\)`, is difficult to interpret since the variance is equal to the mean in the Poisson distribution. -- Instead, we can analyze a standardized residual called the .vocab[Pearson residual]. `$$r_i = \frac{y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}$$` -- We will examine a plot of the Pearson residuals versus the predicted values to check the linearity assumption. --- ## `augment` function ```r hh_aug <- augment(model2, type.predict = "response", type.residuals = "pearson") ``` --- ## Linearity condition <img src="23-poisson_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> -- There is no distinguishable pattern in the residuals, so the linearity assumption is satisfied. --- ## References These slides draw material from Chapter 4 of [*Beyond Multiple Linear Regression*](https://bookdown.org/roback/bookdown-bysh/ch-poissonreg.html).