class: center, middle, inverse, title-slide # Simple Linear Regression ## Partioning variability ### Prof. Maria Tackett --- class: middle, center ## [Click here for PDF of slides](06-slr-partition-var.pdf) --- ## Topics -- - Use analysis of variance to partition variability in the response variable -- - Define and calculate `\(R^2\)` -- - Use ANOVA to test the hypothesis `$$H_0: \beta_1 = 0 \text{ vs }H_a: \beta_1 \neq 0$$` -- --- ## Cats data The data set contains the **heart weight** (.term[Hwt]) and **body weight** (.term[Bwt]) for 144 domestic cats. <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Distribution of response <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> <table> <thead> <tr> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Std. Dev. </th> <th style="text-align:right;"> IQR </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10.631 </td> <td style="text-align:right;"> 2.435 </td> <td style="text-align:right;"> 3.175 </td> </tr> </tbody> </table> --- ## The model .eq[ `$$\hat{\text{Hwt}} = -0.357 + 4.034 \times \text{Bwt}$$` ] <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- class: middle, center .alert[ How much of the variation in cats' heart weights can be explained by knowing their body weights? ] --- ## ANOVA We will use .vocab[Analysis of Variance (ANOVA)] to partition the variation in the response variable `\(Y\)`. <br> <img src="img/06/model-anova.png" width="300%" style="display: block; margin: auto;" /> --- ## Response variable, `\(Y\)` <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Total variation <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-9-1.png" width="90%" style="display: block; margin: auto;" /> `$$\large{SS_{Total} = \sum_{i=1}^n(y_i - \bar{y})^2 = (n-1)s_y^2}$$` --- ## Explained variation (Model) <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> `$$\large{SS_{Model} = \sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2}$$` --- ## Unexplained variation (Residuals) <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> `$$\large{SS_{Error} = \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}$$` --- class: middle `$$\sum_{i=1}^n(y_i - \bar{y})^2 = \sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2$$` --- class: middle `$$\mathbf{\color{blue}{\sum_{i=1}^n(y_i - \bar{y})^2}} = \sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2 + \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2$$` --- class: middle `$$\sum_{i=1}^n(y_i - \bar{y})^2 = \mathbf{\color{blue}{\sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2}} + \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2$$` --- class: middle `$$\sum_{i=1}^n(y_i - \bar{y})^2 = \sum_{i = 1}^{n}(\hat{y}_i - \bar{y})^2 + \mathbf{\color{blue}{\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}}$$` --- ## `\(R^2\)` The .vocab[coefficient of determination], <font class = "vocab">R<sup>2</sup></font>, is the proportion of variation in the response, `\(Y\)`, that is explained by the regression model <br> -- .eq[ `$$\large{R^2 = \frac{SS_{Model}}{SS_{Total}} = 1 - \frac{SS_{Error}}{SS_{Total}}}$$` ] --- ## `\(R^2\)` for our model .pull-left[ .small-box-work[ `$$SS_{Model} = 548.092$$` `$$SS_{Error} = 299.533$$` `$$SS_{Total} = 847.625$$` ] ] -- .pull-right[ .small-box-work[ `$$\begin{aligned}R^2 &= \frac{548.092}{847.625} \\[10pt] &= \mathbf{0.647}\end{aligned}$$` ] ] -- <br> .vocab[About 64.7% of the variation in the heart weight of cats can be explained by variation in body weight.] --- ## ANOVA table <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> --- ## ANOVA table <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;background-color: #dce5b2 !important;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> -- .vocab[Sum of squares] `\(SS_{Total} = 847.625 = 548.092 + 299.533\)` `\(SS_{Model} = 548.092\)` `\(SS_{Error} = 299.533\)` --- ## ANOVA Test <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> -- <br> .eq[ `$$\large{\begin{align}&H_0: \beta_1 = 0 \\ &H_a: \beta_1 \neq 0\\ \end{align}}$$` ] --- ## ANOVA Test <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;background-color: #dce5b2 !important;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> -- .vocab[Degrees of freedom] `\(df_{Total} = 144 - 1 = 143\)` `\(df_{Model} = 1\)` `\(df_{Error} = 143 - 1 = 142\)` --- ## ANOVA Test <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;background-color: #dce5b2 !important;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> -- .vocab[Mean squares] `\(MS_{Model} = \frac{548.092}{1} = 548.092\)` `\(MS_{Error} = \frac{299.533}{142} = 2.109\)` --- ## ANOVA Test <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;background-color: #dce5b2 !important;"> F Stat </th> <th style="text-align:right;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 259.835 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> -- .vocab[F test statistic]: ratio of explained to unexplained variability `\(F = \frac{MS_{Model}}{MS_{Error}}= \frac{548.092}{2.109} = 259.835\)` --- ## F distribution <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## ANOVA test <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;background-color: #dce5b2 !important;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> </tr> </tbody> </table> -- .vocab[P-value]: Probability of observing a test statistic at least as extreme as *F Stat* given the population slope `\(\beta_1\)` is 0 -- The p-value is calculated using an `\(F\)` distribution with 1 and `\(n-2\)` degrees of freedom --- ## Calculating p-value <img src="06-slr-partition-var_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ## ANOVA <table> <thead> <tr> <th style="text-align:left;"> Source </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Sum Sq </th> <th style="text-align:right;"> Mean Sq </th> <th style="text-align:right;"> F Stat </th> <th style="text-align:right;background-color: #dce5b2 !important;"> Pr(> F) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Model </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 548.092 </td> <td style="text-align:right;"> 259.835 </td> <td style="text-align:right;background-color: #dce5b2 !important;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Residuals </td> <td style="text-align:right;"> 142 </td> <td style="text-align:right;"> 299.533 </td> <td style="text-align:right;"> 2.109 </td> <td style="text-align:right;"> </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:right;"> 143 </td> <td style="text-align:right;"> 847.625 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> <td style="text-align:right;background-color: #dce5b2 !important;"> </td> </tr> </tbody> </table> The p-value is very small `\((\approx 0)\)`, so we reject `\(H_0\)`. -- The data provide strong evidence that population slope, `\(\beta_1\)`, is different from 0. -- .vocab[The data provide sufficient evidence that there is a linear relationship between a cat's heart weight and body weight.] --- ## Recap -- - Used analysis of variance to partition variability in the response variable -- - Defined and calculated `\(R^2\)` -- - Used ANOVA to test the hypothesis `$$H_0: \beta_1 = 0 \text{ vs }H_a: \beta_1 \neq 0$$`