The primary goal of today’s lab is to practice with some of the primary data wrangling, data visualization, and modeling functions we will use for regression analysis. It is also another opportunity to continue practicing using RStudio and GitHub before you start collaborating with others.

Topics covered in this lab

Fitting a simple linear regression model using the lm function
Interpreting the slope and intercept
Using statistical inference to draw conclusions about the slope

Getting Started

Each of your assignments will begin with the steps outlined in this section. For more detailed instructions about getting started, see the Lab 01 instructions.

Clone the repo & start new RStudio project

Go to the sta210-fa20 organization on GitHub . Click on the repo with the prefix lab-02-slr-. It contains the starter documents you need to complete the lab.
Click on the green Code button, select Use HTTPS. Click on the clipboard icon to copy the repo URL.
Go to https://vm-manage.oit.duke.edu/containers and login with your Duke NetId and Password.
Click to log into the Docker container STA 210 - Regression Analysis. You should now see the RStudio environment.
Go to File ➡️ New Project ➡️ Version Control ➡️ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL.
Click Create Project, and the files from your GitHub repo will be displayed the Files pane in RStudio.

Configure git

One more thing before you get started. We need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your GitHub username and the email address associated with your GitHub account.

Type the following lines of code in the console in RStudio filling in your name and email address.

library(usethis)
use_git_config(user.name = "GitHub username", 
               user.email="your email")

Now you’re ready to start! Open the R Markdown file (the one with extension .Rmd) to begin the assignment.

YAML

⊕The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.

Before we introduce the data, let’s update the name and date in the YAML.

Open the R Markdown (Rmd) file in your project, input your name for the author name and today’s date for the date, and knit the document.

Committing and pushing changes:

Go to the Git pane in your RStudio.
Select all of the files that appear in the Git pane.
Click to commit. Add a short and informative commit message. - Click to Push. This will prompt a dialogue box where you first need to enter your user name, and then your password.

Make sure you push all the files from the Git pane to your assignment repo on GitHub. The Git pane should be empty after you push. If it’s not, click the box next to the remaining files, write an informative commit message, and push.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(broom)
library(MASS) #package containing dataset

Data: Body and heart weight of cats

In today’s lab, we will analyze the cats dataset in the MASS package. This dataset contains the sex, body weight, and heart weight for 144 domestic cats.

When a veterinarian (vet for short) prescribes heart medicine for a cat, the dosage is partially determined by the weight of the cat’s heart. It would be almost impossible for an vet to weigh a cat’s heart in the moment, but it is possible (though still difficult!) to get a cat’s body weight.

We want to fit a regression model to understand the relationship between a cat’s heart weight and body weight. We could use the model to get an estimate of the cat’s heart weight.

Once you’ve loaded the MASS package, you can load the data using the code below:

data(cats)

The cats dataset contains the following variables:

`Sex`	levels M and F
`Bwt`	Body weight in kg
`Hwt`	Heart weight in g

Exercises

Exploratory Data Analysis

What is the response variable? What is the predictor variable?
Plot the distribution of Hwt and calculate the appropriate summary statistics. Describe the distribution including the shape, center, spread, and any outliers.
Plot the distribution of Bwt and calculate the appropriate summary statistics. Describe the distribution including the shape, center, spread, and any outliers.
Create a scatterplot to display the relationship between Hwt and Bwt. Use the scatterplot to describe the relationship between the two variables. Be sure the scatterplot includes informative axis labels and title.

✅ ⬆️ Now is a good time to knit, commit and push your changes to GitHub with a short, informative commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Simple Linear Regression

Use the lm function to fit a simple linear regression model for Bwt and Hwt. Complete the code below to assign your model a name, and use the tidy and kable functions to neatly display the model output.

_____ <- lm(_________)
tidy(_____) %>% # output model in a tidy format
  kable(digits = 3) # neatly format the output

Interpret the slope in the context of the problem.
Does it make sense to interpret the intercept? If so, interpret the intercept. Otherwise, briefly explain why not.

✅ ⬆️ Now is another good time to knit, commit and push your changes to GitHub with a short, informative commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Inference

We would like to test the following hypotheses:

State the hypotheses in words.
What is the test statistic? Write the equation used to calculate the test statistic.
What distribution was used to calculate the p-value?
State your conclusion for the test in the context of the data.

Now let’s compare the relationship between heart weight and body weight for male and female cats. Use the filter function to create one data frame for female cats and a separate data frame for male cats.
Fit a model of Hwt vs. Bwtfor female cats. Display the 95% confidence interval for the slope using by including conf.int = TRUE in the tidy function.

Interpret the interval in the context of the data.
Fit a model of Hwt vs. Bwt for male cats. Display the 95% confidence interval for the slope using by including conf.int = TRUE in the tidy function.
Does the data provide sufficient evidence that the slope is significantly different for male and female cats? Briefly explain. Hint: Compare the confidence intervals.

✅ ⬆️ Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Submission

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. we will be checking these to make sure you have been practicing how to commit and push changes.

Once your work is finalized in your GitHub repo, you will submit it to Gradescope. To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .pdf submission to be associated with the “Overall” section.

Lab 02: Simple linear regression

due Wed, Sep 2 at 11:59p