library(tidyverse)
library(broom)
library(patchwork)
Click on the link provided in the slides to create your own private repo for this exercise.
Go to the ae-03-[GITHUB USERNAME]
repo on GitHub that you created
Click on the green Code button, Use HTTPS, and click on the clipboard icon to copy the repo URL.
Go to https://vm-manage.oit.duke.edu/containers and login with your Duke NetID and Password. Click to log into the Docker container RStudio - statistics application with Rmarkdown and knitr support. You should now see the RStudio environment.
Go to File ➡️ New Project ➡️ Version Control ➡️ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. You can leave Project Directory Name empty. It will default to the name of the GitHub repo.
Click Create Project, and the files from your GitHub repo will be displayed the Files pane in RStudio.
Before we start the exercise, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.
Type the following lines of code in the Console in RStudio filling in your GitHub username and email address associated with your GitHub account.
library(usethis)
use_git_config(user.name= "github username", user.email="your email")
RStudio and GitHub can now communicate with each other and you are ready to do the exercise!
porsche <- read_csv("data/PorschePrice.csv")
In this AE, we will analyze the relationship between mileage and price for 30 Porsches for sale. More specifically, we want to use the mileage to understand variation in the price. The data set includes the following variables:
Price
: Asking price for the car (in $1,000’s)Age
: Age of the car (in years)Mileage
: Previous miles driven (in 1,000’s)Let’s start by getting a quick view of the data.
glimpse(porsche)
## Rows: 30
## Columns: 3
## $ Price <dbl> 69.4, 56.9, 49.9, 47.4, 42.9, 36.9, 83.0, 72.9, 69.9, 67.9, 6…
## $ Age <dbl> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10, 3…
## $ Mileage <dbl> 21.50, 43.00, 19.90, 36.00, 44.00, 49.80, 1.30, 0.67, 13.40, …
Let’s look at the distribution of each variable.
p1 <- ggplot(data = porsche, aes(x = Mileage)) +
geom_histogram() +
labs(title = "Mileage of Porsches",
x = "Mileage (in 1000's)")
p2 <- ggplot(data = porsche, aes(x = Price)) +
geom_histogram() +
labs(title = "Price of Porsches",
x = "Price (in $1,000s)")
p1 + p2 #using the patchwork package
porsche %>%
summarise(mean_mileage = mean(Mileage),
sd_mileage = sd(Mileage),
median_mileage = median(Mileage),
IQR_mileage = IQR(Mileage))
## # A tibble: 1 x 4
## mean_mileage sd_mileage median_mileage IQR_mileage
## <dbl> <dbl> <dbl> <dbl>
## 1 34.9 23.5 33.2 29.1
porsche %>%
summarise(mean_price = mean(Price),
sd_price = sd(Price),
median_price = median(Price),
IQR_price = IQR(Price))
## # A tibble: 1 x 4
## mean_price sd_price median_price IQR_price
## <dbl> <dbl> <dbl> <dbl>
## 1 50.5 15.5 51.9 18
Mileage
. Include the shape, center, spread, and any outliers.Price
. Include the shape, center, spread, and any outliers.Before fitting a linear model, let’s look at the relationship between Mileage
and Porsche
.
ggplot(data = porsche, aes(x = Mileage, y = Price)) +
geom_point() +
labs(x = "Mileage (in 1,000s)",
y = "Price (in $1,000s)",
title = "Price vs. Mileage for Porsches")
What sign do you expect the slope to have?
Around what value do you expect the intercept to take?
We will use the lm
function to fit the linear model. The syntax for lm
function is lm(Y ~ X, data = dataset
).
Then, we will use the tidy
function from the broom package to print the results in a “tidy” format, i.e. each row contains the statistics for a model coefficient.
Fill in the lm
function to fit a regression model and assign it to price_model
. Remove eval = FALSE
, so the code chunk runs when you knit.
Knit the file to see the model output.
price_model <- lm(___________)
tidy(price_model)
Commit the updated files with a short informative commit message, and push the updates to GitHub.
Write the model equation.
Interpret the slope in the context of the data.
The intercept is the mean price for what group of Porsches?
We can use the augment
function to get the predicted values (.fitted
), residuals (.resid
) and other statistics we’ll use later to assess the model fit.
Fill in the model name in the augment function. Then, remove eval = FALSE
, so the code chunk runs when you knit.
price_aug <- augment(price_model)
Consider the first observation.
price_aug %>%
slice(1)
What is the residual? How was it calculated?
Did the model over or under predict the price of this Porsche?
Knit your Rmd file to view the updated output. Commit your changes with an informative commit message, and push the updated files to GitHub.
The data used in this exercise is from Stat2: Building Models for a World of Data.