library(tidyverse)
library(broom)
library(patchwork)

Data analysis life cycle

Data science life cycle from [*R for Data Science*](https://r4ds.had.co.nz/) with modifications from *The Art of Statistics: How to Learn from Data*

Data science life cycle from R for Data Science with modifications from The Art of Statistics: How to Learn from Data

Clone a repo + start a new project

Configure git

Before we start the exercise, we need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your email address and your name.

Type the following lines of code in the Console in RStudio filling in your GitHub username and email address associated with your GitHub account.

library(usethis)
use_git_config(user.name= "github username", user.email="your email")

RStudio and GitHub can now communicate with each other and you are ready to do the exercise!

Price vs. Mileage

porsche <- read_csv("data/PorschePrice.csv")

In this AE, we will analyze the relationship between mileage and price for 30 Porsches for sale. More specifically, we want to use the mileage to understand variation in the price. The data set includes the following variables:

Let’s start by getting a quick view of the data.

glimpse(porsche)
## Rows: 30
## Columns: 3
## $ Price   <dbl> 69.4, 56.9, 49.9, 47.4, 42.9, 36.9, 83.0, 72.9, 69.9, 67.9, 6…
## $ Age     <dbl> 3, 3, 2, 4, 4, 6, 0, 0, 2, 0, 2, 2, 4, 3, 10, 11, 4, 4, 10, 3…
## $ Mileage <dbl> 21.50, 43.00, 19.90, 36.00, 44.00, 49.80, 1.30, 0.67, 13.40, …

Exploratory data analysis

  1. Which variable is the response? Which is the predictor?

Let’s look at the distribution of each variable.

p1 <- ggplot(data = porsche, aes(x = Mileage)) + 
  geom_histogram() + 
  labs(title = "Mileage of Porsches", 
       x = "Mileage (in 1000's)")

p2 <- ggplot(data = porsche, aes(x = Price)) + 
  geom_histogram() + 
  labs(title = "Price of Porsches", 
       x = "Price (in $1,000s)")

p1 + p2 #using the patchwork package

porsche %>%
  summarise(mean_mileage = mean(Mileage), 
            sd_mileage = sd(Mileage), 
            median_mileage = median(Mileage), 
            IQR_mileage = IQR(Mileage))
## # A tibble: 1 x 4
##   mean_mileage sd_mileage median_mileage IQR_mileage
##          <dbl>      <dbl>          <dbl>       <dbl>
## 1         34.9       23.5           33.2        29.1
porsche %>%
  summarise(mean_price = mean(Price), 
            sd_price = sd(Price), 
            median_price = median(Price), 
            IQR_price = IQR(Price))
## # A tibble: 1 x 4
##   mean_price sd_price median_price IQR_price
##        <dbl>    <dbl>        <dbl>     <dbl>
## 1       50.5     15.5         51.9        18
  1. Describe the distribution of Mileage. Include the shape, center, spread, and any outliers.
  2. Describe the distribution of Price. Include the shape, center, spread, and any outliers.

Before fitting a linear model, let’s look at the relationship between Mileage and Porsche.

ggplot(data = porsche, aes(x = Mileage, y = Price)) +
  geom_point() + 
  labs(x = "Mileage (in 1,000s)", 
       y = "Price (in $1,000s)", 
       title = "Price vs. Mileage for Porsches")

  1. What sign do you expect the slope to have?

  2. Around what value do you expect the intercept to take?

Fitting a simple linear regression model

We will use the lm function to fit the linear model. The syntax for lm function is lm(Y ~ X, data = dataset).

Then, we will use the tidy function from the broom package to print the results in a “tidy” format, i.e. each row contains the statistics for a model coefficient.

Fill in the lm function to fit a regression model and assign it to price_model. Remove eval = FALSE, so the code chunk runs when you knit.

Knit the file to see the model output.

price_model <- lm(___________)
tidy(price_model)

Commit the updated files with a short informative commit message, and push the updates to GitHub.

  1. Write the model equation.

  2. Interpret the slope in the context of the data.

  3. The intercept is the mean price for what group of Porsches?

Residuals

We can use the augment function to get the predicted values (.fitted), residuals (.resid) and other statistics we’ll use later to assess the model fit.

Fill in the model name in the augment function. Then, remove eval = FALSE, so the code chunk runs when you knit.

price_aug <- augment(price_model)

Consider the first observation.

price_aug %>%
  slice(1)
  1. What is the residual? How was it calculated?

  2. Did the model over or under predict the price of this Porsche?

Knit your Rmd file to view the updated output. Commit your changes with an informative commit message, and push the updated files to GitHub.


The data used in this exercise is from Stat2: Building Models for a World of Data.