## Description

Submit two files through Blackboard: (a) .Rmd R Markdown file with answers and code and

(b) Word document of knitted R Markdown file. Your code/Word files should be named as

follows: “HW[X]-[Full Name]-[Class Time]” and include those details in the body of those

files.

Complete your work individually and comment your code for full credit. For an example of

how to format your homework see the files posted with Lecture 1 on Blackboard. Show all

of your code in the knitted Word document.

This assignment has two parts. In the first part, you will be creating a logistic regression

model using the data set, “SpeedDating.csv.” In the second part of the assignment, you will

be constructing a one-way ANOVA model using the data set, “kudzu.xls”.

1

Part 1: Logistic Regression

In speed dating, participants meet many people, each for a few minutes, and then decide

who they would like to see again. The data set you will be working with contains information on speed dating experiments conducted on graduate and professional students. Each

person in the experiment met with 10-20 randomly selected people of the opposite sex (only

heterosexual pairings) for four minutes. After each speed date, each participant filled out a

questionnaire about the other person.

Your goal is to build a model to predict which pairs of daters want to meet each other again

(i.e., have a second date). The list of variables are:

We will be using a reduced version of this experimental data with 276 unique male-female

date pairs. In the file “SpeedDating.csv”, the variables have either “M” for male or “F” for

female. For example, “LikeM” refers to the “Like” variable as answered by the male participant (about the female participant). Treat the rating scale variables (such as “PartnerYes”,

”Attractive”, etc.) as numerical variables instead of categorical ones for your analysis.

Page 2 of 4

1. Based on the variable “Decision”, fill out the contingency table below. What percentage

of dates ended with both people wanting a second date?

Decision made by female

No Yes

Decision made by male No

Yes

2. A second date is planned only if both people within the matched pair want to see each

other again. Make a new column in your data set and call it “second.date”. Values in

this column should be 0 if there will be no second date, 1 if there will be a second date.

Construct a scatterplot for each numerical variable where the male values are on the

x-axis and the female values are on the y-axis. Observations in your scatterplot should

have a different color (or pch value) based on whether or not there will be a second date.

Describe what you see. (Note: Jitter your points just for making these plots.)

3. Many of the numerical variables are on rating scales from 1 to 10. Are the responses

within these ranges? If not, what should we do these responses? Is there any missing

data? If so, how many observations and for which variables?

4. What are the possible race categories in your data set? Is there any missing data? If so,

how many observations and what should you do with them? Make a mosaic plot with

female and male race. Describe what you see.

5. Use logistic regression to construct a model for “second.date” (i.e., “second.date” should

be your response variable). Incorporate the discoveries and decisions you made in questions 2, 3, and 4. Explain the steps you used to determine the best model, include the

summary output for your final model only, check your model assumptions, and evaluate your model by running the relevant hypothesis tests. Do not use “Decision” as an

explanatory variable.

6. Redo question (1) using only the observations used to fit your final logistic regression

model. What is your sample size? Does the number of explanatory variables in your

model follow our rule of thumb? Justify your answer.

7. Interpret the slopes in your model. Which explanatory variables increase the probability

of a second date? Which ones decrease it? Is this what you expected to find? Justify.

8. Construct an ROC curve and compute the AUC. Determine the best threshold for classifying observations (i.e., second date or no second date) based on the ROC curve. Justify

your choice of threshold. For your chosen threshold, compute (a) accuracy, (b) sensitivity, and (c) specificity.

Page 3 of 4

Part 2: One-Way ANOVA

Kudzu is a plant that was imported to the United States from Japan and now covers

over seven million acres in the South. The plant contains chemicals called isoflavones

that have been shown to have beneficial effects on bones. One study used three groups

of rats to compare a control group with rats that were fed either a low dose or a high

dose of isoflavones from kudzu. One of the outcomes examined was bone mineral density

in the femur (in grams per square centimeter). Rats were randomly assigned to one of

the three groups. The data can be found in “kudzu.jmp.”

9. Identify the response variable.

10. Identify the factors (and levels) in the experiment.

11. How many treatments are included in the experiment?

12. What type of experimental design is employed?

13. Compute the mean, standard deviation, and sample size for each treatment group and

put the results into a table. Remember to include the units of measurement.

14. Construct side-by-side box plots with connected means. Describe what you see.

15. Are the one-way ANOVA model assumptions satisfied? Justify your answer.

16. Run a one-way ANOVA model and discuss your results. (Let α = 0.01; remember to

include your hypotheses, and identify the test statistic, degrees of freedom, and p-value.)

17. Use Tukey’s multiple-comparisons method to compare the three groups (include the

visual results for the Tukey method). Which groups (if any) have significantly different

means?

Page 4 of 4