General instructions

Be sure to read all instructions and quiz questions carefully!

  • Quiz 04 will be released by November 5 12a EST and must be completed by November 8 at 11:59p EST. The quiz is not timed, so you may submit it any time before the deadline.

  • During the time the quiz is available, you will see a Quiz 04 assignment on Gradescope to submit the quiz.

  • The quiz is an individual assignment. You may not discuss the quiz with anyone else (either in the class or not). See the Academic Honesty guidelines in the course syllabus for more information about academic conduct.

  • The quiz is open-book, open-note, so you may use any materials from class (except other people!) as you take the quiz.

  • Write your responses using a Word processor (e.g. Word, Pages, Google Docs, etc.). You do not need to use RStudio/R Markdown or conduct analysis for any part of this quiz. Your quiz submission should be no more than 1 page single-spaced. Any content over 1 page will not be considered for grading.

Scenario

Suppose you are a statistician for an email platform (e.g. Gmail or Hotmail), and you have been tasked with building a model that will be used to determine whether or not an email is spam. If it is determined that an email is spam, it will automatically be put into a junk folder instead of the user’s regular inbox. The company wants to use you model so they’re able to effectively identify spam emails without putting too many non-spam emails in the junk folder.

The Data

The data set contains incoming emails for one user’s email account from January 2012 through March 2012. All personally identifiable information has been removed.

You can find a full list of the variables in the data set at the bottom of the instructions.

Questions

Answer the questions below describing how you would approach the regression analysis. For each question, in addition to stating the choice(s) you would make, include a brief explanation about why you’ve made the choice(s). Your responses should demonstrate your knowledge of the statistical concepts and how they applies in a real world scenario.

Type your responses to the questions below using any word processor (e.g. Word, Pages, Google Docs, etc.). Include your name at the top of the document. You do not need to use R/R Markdown or conduct analysis for any part of this quiz. Your quiz submission should be no more than 1 page single-spaced. Any content over 1 page will not be considered for grading.

Question 1 (5pt) What type of model would you use for this analysis? Explain your choice.

Question 2 (15pt) Describe the process you would use to select variables for the final model. In your description, include the model selection strategy and procedure, any potential interactions you might consider, and an explanation about why you made these choices.

Question 3 (10pt) The goal is to determine whether a particular email is spam and should be sent directly to the junk folder. Describe how you would use the output from your model to classify emails as spam or not spam. Be specific in your description and include explanation about your choices.

Question 4 (10pt) What are some potential limitations to your model or analysis? Limit your response to no more than 3 limitations.

Question 5 (5pt) Based on the limitations from Question 4, what are some next steps you might explore to improve your model?

Grading rubric

Each question will be graded using the following rubric:

  • Full credit: Answer clearly demonstrates a clear understanding of the statistical concepts and how to apply them to this specific scenario.
  • 80% credit: Answer demonstrates an general understanding of the statistical concepts but lacks clarity about how to apply them to this specific scenario.
  • 60%: Answer displays a lack of understanding of the general concepts and/or makes to attempt to apply them to this specific scenario.
  • No credit: No answer provided.

There will also be 5pt for the overall narrative awarded as follows:

  • 5pt: The narrative describes a cohesive and thoughtful analysis.
  • 3pt: The narrative has some inconsistencies or inaccuracies.
  • 1pt: The narrative lacks overall cohesion and/or has major inaccuracies.

Submission

Submit a PDF of your narrative on Gradescope by Sunday, Nov 8 at 11:59p. Be sure your name is at the top of your document.

No late work will be accepted for this quiz.

Full variable list

  • spam - Indicator for whether the email was spam.
  • to_multiple - Indicator for whether the email was addressed to more than the recipient.
  • from - Whether the message was listed as from anyone (this is usually set default for regular outgoing email).
  • cc - Indicator for whether anyone was CCed.
  • sent_email - Indicator for whether the sender had been sent an email in the last 30 days.
  • time - Time at which email was sent.
  • image - The number of images attached.
  • attach - The number of attached files.
  • dollar - The number of times a dollar sign or the word “dollar” appeared in the email.
  • winner - Indicates whether “winner” appeared in the email.
  • inherit - The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
  • viagra - The number of times “viagra” appeared in the email.
  • password - The number of times “password” appeared in the email.
  • num_char - The number of characters in the email, in thousands.
  • line_breaks - The number of line breaks in the email (does not count text wrapping).
  • format - Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
  • re_subj - Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
  • exclaim_subj - Whether there was an exclamation point in the subject.
  • urgent_subj - Whether the word “urgent” was in the email subject.
  • exclaim_mess - The number of exclamation points in the email message.
  • period_mess - The number of periods in the message.
  • signoff - Whether a sign-off of “Cheers”, “Regards”, or “Best” (also, “Best Regards”) was used.
  • number - Factor variable saying whether there was no number, a small number (under 1 million), or a big number.