class: center, middle, inverse, title-slide # Multinomial Logistic Regression ## Prediction + model selection + conditions ### Prof. Maria Tackett --- class: middle, center ## [Click for PDF of slides](22-multinom-logistic-pt2.pdf) --- ## Topics - Predictions - Model selection - Checking conditions --- ## NHANES Data - [National Health and Nutrition Examination Survey](https://www.cdc.gov/nchs/nhanes/index.htm) is conducted by the National Center for Health Statistics (NCHS) - The goal is to *"assess the health and nutritional status of adults and children in the United States"* - This survey includes an interview and a physical examination --- ## Health Rating vs. Age & Physical Activity - **Question**: Can we use a person's age and whether they do regular physical activity to predict their self-reported health rating? - We will analyze the following variables: + <font class="vocab">`HealthGen`: </font>Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor. + <font class="vocab">`Age`: </font>Age at time of screening (in years). Participants 80 or older were recorded as 80. + <font class="vocab">`PhysActive`: </font>Participant does moderate to vigorous-intensity sports, fitness or recreational activities --- ## Model in R <table> <thead> <tr> <th style="text-align:left;"> y.level </th> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Vgood </td> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.205 </td> <td style="text-align:right;"> 0.145 </td> <td style="text-align:right;"> 8.325 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Vgood </td> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> 0.002 </td> <td style="text-align:right;"> 0.369 </td> <td style="text-align:right;"> 0.712 </td> </tr> <tr> <td style="text-align:left;"> Vgood </td> <td style="text-align:left;"> PhysActiveYes </td> <td style="text-align:right;"> -0.321 </td> <td style="text-align:right;"> 0.093 </td> <td style="text-align:right;"> -3.454 </td> <td style="text-align:right;"> 0.001 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.948 </td> <td style="text-align:right;"> 0.141 </td> <td style="text-align:right;"> 13.844 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> -0.002 </td> <td style="text-align:right;"> 0.002 </td> <td style="text-align:right;"> -0.977 </td> <td style="text-align:right;"> 0.329 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> PhysActiveYes </td> <td style="text-align:right;"> -1.001 </td> <td style="text-align:right;"> 0.090 </td> <td style="text-align:right;"> -11.120 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 0.915 </td> <td style="text-align:right;"> 0.164 </td> <td style="text-align:right;"> 5.566 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.003 </td> <td style="text-align:right;"> 0.003 </td> <td style="text-align:right;"> 1.058 </td> <td style="text-align:right;"> 0.290 </td> </tr> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> PhysActiveYes </td> <td style="text-align:right;"> -1.645 </td> <td style="text-align:right;"> 0.107 </td> <td style="text-align:right;"> -15.319 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Poor </td> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.521 </td> <td style="text-align:right;"> 0.290 </td> <td style="text-align:right;"> -5.238 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Poor </td> <td style="text-align:left;"> Age </td> <td style="text-align:right;"> 0.022 </td> <td style="text-align:right;"> 0.005 </td> <td style="text-align:right;"> 4.522 </td> <td style="text-align:right;"> 0.000 </td> </tr> <tr> <td style="text-align:left;"> Poor </td> <td style="text-align:left;"> PhysActiveYes </td> <td style="text-align:right;"> -2.656 </td> <td style="text-align:right;"> 0.236 </td> <td style="text-align:right;"> -11.275 </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> --- class: middle, center ## Predictions --- ## Calculating probabilities For categories `\(2,\ldots,K\)`, the probability that the `\(i^{th}\)` observation is in the `\(j^{th}\)` category is `$$\hat{\pi}_{ij} = \frac{\exp\{\hat{\beta}_{0j} + \hat{\beta}_{1j}x_{i1} + \dots + \hat{\beta}_{pj}x_{ip}\}}{1 + \sum\limits_{k=2}^K \exp\{\hat{\beta}_{0k} + \hat{\beta}_{1k}x_{i1} + \dots \hat{\beta}_{pk}x_{ip}\}}$$` <br> For the baseline category, `\(k=1\)`, we calculate the probability `\(\hat{\pi}_{i1}\)` as `$$\hat{\pi}_{i1} = 1- \sum\limits_{k=2}^K \hat{\pi}_{ik}$$` --- ## NHANES: Predicted probabilities .midi[ ```r #calculate predicted probabilities pred_probs <- as_tibble(predict(health_m, type = "probs")) %>% mutate(obs_num = 1:n()) ``` ] ``` ## # A tibble: 5 x 6 ## Excellent Vgood Good Fair Poor obs_num ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 0.0705 0.244 0.451 0.198 0.0366 101 ## 2 0.0702 0.244 0.441 0.202 0.0426 102 ## 3 0.0696 0.244 0.427 0.206 0.0527 103 ## 4 0.0696 0.244 0.427 0.206 0.0527 104 ## 5 0.155 0.393 0.359 0.0861 0.00662 105 ``` --- ## Add predictions to original data .midi[ ```r health_m_aug <- inner_join(nhanes_adult, pred_probs, by = "obs_num") %>% select(obs_num, everything()) ``` ] ``` ## Rows: 6,710 ## Columns: 10 ## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,… ## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood,… ## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56… ## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No,… ## $ Education <fct> High School, High School, High School, Some College, Colle… ## $ Excellent <dbl> 0.07069715, 0.07069715, 0.07069715, 0.07003173, 0.15547075… ## $ Vgood <dbl> 0.2433979, 0.2433979, 0.2433979, 0.2444214, 0.3922335, 0.3… ## $ Good <dbl> 0.4573727, 0.4573727, 0.4573727, 0.4372533, 0.3599639, 0.3… ## $ Fair <dbl> 0.19568909, 0.19568909, 0.19568909, 0.20291032, 0.08585489… ## $ Poor <dbl> 0.032843150, 0.032843150, 0.032843150, 0.045383332, 0.0064… ``` --- ## Actual vs. Predicted Health Rating - We can use our model to predict a person's perceived health rating given their age and whether they exercise - For each observation, the predicted perceived health rating is the category with the highest predicted probability ```r health_m_aug <- health_m_aug %>% mutate(pred_health = predict(health_m, type = "class")) ``` --- ## Actual vs. Predicted Health Rating .midi[ ```r health_m_aug %>% count(HealthGen, pred_health, .drop = FALSE) %>% pivot_wider(names_from = pred_health, values_from = n) ``` ``` ## # A tibble: 5 x 6 ## HealthGen Excellent Vgood Good Fair Poor ## <fct> <int> <int> <int> <int> <int> ## 1 Excellent 0 550 223 0 0 ## 2 Vgood 0 1376 785 0 0 ## 3 Good 0 1255 1399 0 0 ## 4 Fair 0 300 642 0 0 ## 5 Poor 0 24 156 0 0 ``` ] --- .question[ Why do you think no observations were predicted to have a rating of "Excellent", "Fair", or "Poor"? ] -- <img src="22-multinom-logistic-pt2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- class: middle, center ## Model selection --- ## Comparing Nested Models - Suppose there are two models: - Reduced Model includes predictors `\(x_1, \ldots, x_q\)` - Full Model includes predictors `\(x_1, \ldots, x_q, x_{q+1}, \ldots, x_p\)` - We want to test the hypotheses `$$\begin{aligned}&H_0: \beta_{q+1} = \dots = \beta_p = 0 \\ & H_a: \text{ at least 1 }\beta_j \text{ is not} 0\end{aligned}$$` - To do so, we will use the .vocab[Drop-in-Deviance test] (very similar to logistic regression) --- ## Add `Education` to the model? - We consider adding the participants' `Education` level to the model. - Education takes values `8thGrade`, `9-11thGrade`, `HighSchool`, `SomeCollege`, and `CollegeGrad` - Models we're testing: - Reduced Model: `Age`, `PhysActive` - Full Model: `Age`, `PhysActive`, `Education` .alert[ `$$\begin{align}&H_0: \beta_{9-11thGrade} = \beta_{HighSchool} = \beta_{SomeCollege} = \beta_{CollegeGrad} = 0\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to }0\end{align}$$` ] --- ## Add `Education` to the model? .alert[ `$$\begin{align}&H_0: \beta_{9-11thGrade} = \beta_{HighSchool} = \beta_{SomeCollege} = \beta_{CollegeGrad} = 0\\ &H_a: \text{ at least one }\beta_j \text{ is not equal to }0\end{align}$$` ] ```r model_red <- multinom(HealthGen ~ Age + PhysActive, data = nhanes_adult) model_full <- multinom(HealthGen ~ Age + PhysActive + Education, data = nhanes_adult) ``` --- ## Add `Education` to the model? ```r anova(model_red, model_full, test = "Chisq") %>% kable(format = "markdown") ``` |Model | Resid. df| Resid. Dev|Test | Df| LR stat.| Pr(Chi)| |:----------------------------|---------:|----------:|:------|-----:|--------:|-------:| |Age + PhysActive | 25848| 16994.23| | NA| NA| NA| |Age + PhysActive + Education | 25832| 16505.10|1 vs 2 | 16| 489.1319| 0| -- At least one coefficient associated with `Education` is non-zero. Therefore, we will include `Education` in the model. --- ## Model with `Education` .small[ |y.level |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-------|:-----------------------|--------:|---------:|---------:|-------:|--------:|---------:| |Vgood |(Intercept) | 0.582| 0.301| 1.930| 0.054| -0.009| 1.173| |Vgood |Age | 0.001| 0.003| 0.419| 0.675| -0.004| 0.006| |Vgood |PhysActiveYes | -0.264| 0.099| -2.681| 0.007| -0.457| -0.071| |Vgood |Education9 - 11th Grade | 0.768| 0.308| 2.493| 0.013| 0.164| 1.372| |Vgood |EducationHigh School | 0.701| 0.280| 2.509| 0.012| 0.153| 1.249| |Vgood |EducationSome College | 0.788| 0.271| 2.901| 0.004| 0.256| 1.320| |Vgood |EducationCollege Grad | 0.408| 0.268| 1.522| 0.128| -0.117| 0.933| |Good |(Intercept) | 2.041| 0.272| 7.513| 0.000| 1.508| 2.573| |Good |Age | -0.002| 0.003| -0.651| 0.515| -0.007| 0.003| |Good |PhysActiveYes | -0.758| 0.096| -7.884| 0.000| -0.946| -0.569| |Good |Education9 - 11th Grade | 0.360| 0.275| 1.310| 0.190| -0.179| 0.899| |Good |EducationHigh School | 0.085| 0.247| 0.345| 0.730| -0.399| 0.569| |Good |EducationSome College | -0.011| 0.239| -0.047| 0.962| -0.480| 0.457| |Good |EducationCollege Grad | -0.891| 0.236| -3.767| 0.000| -1.354| -0.427| |Fair |(Intercept) | 2.116| 0.288| 7.355| 0.000| 1.552| 2.680| |Fair |Age | 0.000| 0.003| 0.107| 0.914| -0.006| 0.006| |Fair |PhysActiveYes | -1.191| 0.115| -10.367| 0.000| -1.416| -0.966| |Fair |Education9 - 11th Grade | -0.224| 0.279| -0.802| 0.422| -0.771| 0.323| |Fair |EducationHigh School | -0.832| 0.252| -3.307| 0.001| -1.326| -0.339| |Fair |EducationSome College | -1.343| 0.246| -5.462| 0.000| -1.825| -0.861| |Fair |EducationCollege Grad | -2.509| 0.253| -9.913| 0.000| -3.005| -2.013| |Poor |(Intercept) | -0.200| 0.411| -0.488| 0.626| -1.005| 0.605| |Poor |Age | 0.018| 0.005| 3.527| 0.000| 0.008| 0.028| |Poor |PhysActiveYes | -2.267| 0.242| -9.377| 0.000| -2.741| -1.793| |Poor |Education9 - 11th Grade | -0.360| 0.353| -1.020| 0.308| -1.053| 0.332| |Poor |EducationHigh School | -1.150| 0.334| -3.438| 0.001| -1.805| -0.494| |Poor |EducationSome College | -1.073| 0.316| -3.399| 0.001| -1.692| -0.454| |Poor |EducationCollege Grad | -2.322| 0.366| -6.342| 0.000| -3.039| -1.604| ] --- ## Compare NHANES models using AIC .midi[ ```r glance(model_red)$AIC ``` ``` ## [1] 17018.23 ``` ```r glance(model_full)$AIC ``` ``` ## [1] 16561.1 ``` ] -- Use the `step()` function to do model selection with AIC as the selection criteria --- class: middle, center ## Checking conditions --- ## Assumptions for multinomial logistic regression We want to check the following assumptions for the multinomial logistic regression model: 1. .vocab[Linearity]: Is there a linear relationship between the log-odds and the predictor variables? 2. .vocab[Randomness]: Was the sample randomly selected? Or can we reasonably treat it as random? 3. .vocab[Independence]: There is no obvious relationship between observations --- ## Checking linearity Similar to logistic regression, we will check linearity by examining empirical logit plots between each level of the response and the quantitative predictor variables. .small[ ```r nhanes_adult <- nhanes_adult %>% mutate(Excellent = factor(if_else(HealthGen == "Excellent", "1", "0")), Vgood = factor(if_else(HealthGen == "Vgood", "1", "0")), Good = factor(if_else(HealthGen == "Good", "1", "0")), Fair = factor(if_else(HealthGen == "Fair", "1", "0")), Poor = factor(if_else(HealthGen == "Poor", "1", "0")) ) ``` ] --- ## Checking linearity .small[ ```r library(Stat2Data) ``` ```r par(mfrow = c(2,1)) emplogitplot1(Excellent ~ Age, data = nhanes_adult, ngroups = 5, main = "Excellent vs. Age") emplogitplot1(Vgood ~ Age, data = nhanes_adult, ngroups = 5, main = "Vgood vs. Age") ``` <img src="22-multinom-logistic-pt2_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] --- ## Checking linearity .small[ ```r par(mfrow = c(2,1)) emplogitplot1(Good ~ Age, data = nhanes_adult, ngroups = 5, main = "Good vs. Age") emplogitplot1(Fair ~ Age, data = nhanes_adult, ngroups = 5, main = "Fair vs. Age") ``` <img src="22-multinom-logistic-pt2_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- ## Checking linearity .small[ ```r emplogitplot1(Poor ~ Age, data = nhanes_adult, ngroups = 5, main = "Poor vs. Age") ``` <img src="22-multinom-logistic-pt2_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] -- ✅ The linearity condition is satisfied. There is a linear relationship between the empirical logit and the quantitative predictor variable, Age. --- ## Checking randomness We can check the randomness condition based on the context of the data and how the observations were collected. - Was the sample randomly selected? - If the sample was not randomly selected, ask whether there is reason to believe the observations in the sample differ systematically from the population of interest. -- ✅ The randomness condition is satisfied. We do not have reason to believe that the participants in this study differ systematically from adults in the U.S.. --- ## Checking independence We can check the independence condition based on the context of the data and how the observations were collected. Independence is most often violated if the data were collected over time or there is a strong spatial relationship between the observations. -- ✅ The independence condition is satisfied. It is reasonable to conclude that the participants' health and behavior characteristics are independent of one another. --- ## Recap - Predictions - Model selection - Checking conditions