Multiple linear regression

Multiple linear regressionInferenceProf. Maria Tackett1

Click here for PDF of slides

Topics

Conduct a hypothesis test for
Calculate a confidence interval for
Quick overview of math details for MLR

House prices in Levittown

The data set contains the sales price and characteristics of 85 homes in Levittown, NY that sold between June 2010 and May 2011.

We would like to use the characteristics of a house to understand variability in the sales price.

Variables

Predictors

bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
living_area: Total living area of the house (in square feet)
lot_size: Total area of the lot (in square feet)
year_built: Year the house was built
property_tax: Annual property taxes (in U.S. dollars)

Response

sale_price: Sales price (in U.S. dollars)

EDA: Response variable

EDA: Response vs. Predictors

Home price model

term
estimate
std.error
statistic
p.value


(Intercept)
-7148818.957
3820093.694
-1.871
0.065

bedrooms
-12291.011
9346.727
-1.315
0.192

bathrooms
51699.236
13094.170
3.948
0.000

living_area
65.903
15.979
4.124
0.000

lot_size
-0.897
4.194
-0.214
0.831

year_built
3760.898
1962.504
1.916
0.059

property_tax
1.476
2.832
0.521
0.604


8

Hypothesis test for βj9

Outline of a hypothesis test10

Outline of a hypothesis test

1️⃣ State the hypotheses.

Outline of a hypothesis test

1️⃣ State the hypotheses.

2️⃣ Calculate the test statistic.

Outline of a hypothesis test

1️⃣ State the hypotheses.

2️⃣ Calculate the test statistic.

3️⃣ Calculate the p-value.

Outline of a hypothesis test

1️⃣ State the hypotheses.

2️⃣ Calculate the test statistic.

3️⃣ Calculate the p-value.

4️⃣ State the conclusion.

1️⃣ State the hypotheses

term	estimate	std.error	statistic	p.value
(Intercept)	-7148818.957	3820093.694	-1.871	0.065
bedrooms	-12291.011	9346.727	-1.315	0.192
bathrooms	51699.236	13094.170	3.948	0.000
living_area	65.903	15.979	4.124	0.000
lot_size	-0.897	4.194	-0.214	0.831
year_built	3760.898	1962.504	1.916	0.059
property_tax	1.476	2.832	0.521	0.604

2️⃣ Calculate the test statistic

term	estimate	std.error	statistic	p.value
(Intercept)	-7148818.957	3820093.694	-1.871	0.065
bedrooms	-12291.011	9346.727	-1.315	0.192
bathrooms	51699.236	13094.170	3.948	0.000
living_area	65.903	15.979	4.124	0.000
lot_size	-0.897	4.194	-0.214	0.831
year_built	3760.898	1962.504	1.916	0.059
property_tax	1.476	2.832	0.521	0.604

2️⃣ Calculate the test statistic

The estimated slope, 65.903, is 4.124 standard errors above the hypothesized mean, 0.

3️⃣ Calculate the p-value

term	estimate	std.error	statistic	p.value
(Intercept)	-7148818.957	3820093.694	-1.871	0.065
bedrooms	-12291.011	9346.727	-1.315	0.192
bathrooms	51699.236	13094.170	3.948	0.000
living_area	65.903	15.979	4.124	0.000
lot_size	-0.897	4.194	-0.214	0.831
year_built	3760.898	1962.504	1.916	0.059
property_tax	1.476	2.832	0.521	0.604

3️⃣ Calculate the p-value

The p-value is calculated using a distribution with degrees of freedom, where is the number of coefficients in the model.

3️⃣ Calculate the p-value

The p-value is calculated using a distribution with degrees of freedom, where is the number of coefficients in the model.

In this example, the p-value is calculated using a distribution with degrees of freedom.

3️⃣ Calculate the p-value

The p-value is calculated using a distribution with degrees of freedom, where is the number of coefficients in the model.

In this example, the p-value is calculated using a distribution with degrees of freedom.

Given the probability of observing a coefficient at least as extreme as the one we've observed, 65.903, is 0.00009.

4️⃣ State the conclusion

term	estimate	std.error	statistic	p.value
(Intercept)	-7148818.957	3820093.694	-1.871	0.065
bedrooms	-12291.011	9346.727	-1.315	0.192
bathrooms	51699.236	13094.170	3.948	0.000
living_area	65.903	15.979	4.124	0.000
lot_size	-0.897	4.194	-0.214	0.831
year_built	3760.898	1962.504	1.916	0.059
property_tax	1.476	2.832	0.521	0.604

The p-value is very small, so we reject . The data provide sufficient evidence that the living area is a helpful predictor in the model explaining some of the variability in price.

Confidence interval for βj17

Confidence Interval for

The confidence interval for

where

follows a

distribution with

degrees of freedom

General Interpretation: We are confident that the interval LB to UB contains the population coefficient of . Therefore, for every one unit increase in , we expect to change by LB to UB units, holding all else constant.

Confidence interval for `living_area`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-7148818.957	3820093.694	-1.871	0.065	-14754041.291	456403.376
bedrooms	-12291.011	9346.727	-1.315	0.192	-30898.915	6316.893
bathrooms	51699.236	13094.170	3.948	0.000	25630.746	77767.726
living_area	65.903	15.979	4.124	0.000	34.091	97.715
lot_size	-0.897	4.194	-0.214	0.831	-9.247	7.453
year_built	3760.898	1962.504	1.916	0.059	-146.148	7667.944
property_tax	1.476	2.832	0.521	0.604	-4.163	7.115

We are 95% confident that for every one additional square foot in living area, we expect the price to increase by $34.09 to $97.71, holding all other characteristics constant.

🛑 Caution: Large sample sizes

If the sample size is large enough, the test will likely result in rejecting even has a very small effect on

Consider the practical significance of the result not just the statistical significance
Use the confidence interval to draw conclusions instead of relying only p-values

🛑 Caution: Small sample sizes

If the sample size is small, there may not be enough evidence to reject

When you fail to reject the null hypothesis, DON'T immediately conclude that the variable has no association with the response.
There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

Math details22

Regression Model

The multiple linear regression model assumes

Regression Model

The multiple linear regression model assumes

For a given observation , we can rewrite the previous statement as

Estimating

For a given observation the residual is

Estimating

For a given observation the residual is

The estimated value of the regression variance , , is

Estimating Coefficients

One way to estimate the coefficients is by taking partial derivatives of the formula

Estimating Coefficients

One way to estimate the coefficients is by taking partial derivatives of the formula

This produces messy formulas, so instead we can use matrix notation for multiple linear regression and estimate the coefficients using rules from linear algebra. For more details, see A Matrix Formulation of the Multiple Regression Model.

Recap

Conduct a hypothesis test for
Calculate a confidence interval for
Quick overview of math details for MLR

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Multiple linear regression

Inference

Prof. Maria Tackett

Click here for PDF of slides

Topics

House prices in Levittown

Variables

EDA: Response variable

EDA: Response vs. Predictors

Home price model

Hypothesis test for βj

Outline of a hypothesis test

Outline of a hypothesis test

Outline of a hypothesis test

Outline of a hypothesis test

Outline of a hypothesis test

1️⃣ State the hypotheses

2️⃣ Calculate the test statistic

2️⃣ Calculate the test statistic

3️⃣ Calculate the p-value

3️⃣ Calculate the p-value

3️⃣ Calculate the p-value

3️⃣ Calculate the p-value

4️⃣ State the conclusion

Confidence interval for βj

Confidence Interval for βj

Confidence interval for living_area

🛑 Caution: Large sample sizes

🛑 Caution: Small sample sizes

Math details

Regression Model

Regression Model

Estimating σ2ϵ

Estimating σ2ϵ

Estimating Coefficients

Estimating Coefficients

Recap

Click here for PDF of slides

Help

Hypothesis test for

Confidence interval for

Confidence Interval for

Confidence interval for `living_area`

Estimating

Estimating