Description
Submit two files through Blackboard: (a) .Rmd R Markdown file with answers and code
and (b) Word document of knitted R Markdown file. Your file should be named as follows:
“HW[X]-[Full Name]-[Class Time]” and include those details in the body of your file.
Complete your work individually and comment your code for full credit. For an example of
how to format your homework see the files posted with Lecture 1 on Blackboard. Show all
of your code in the knitted Word document.
Read the New England Journal of Medicine article, “Chocolate Consumption, Cognitive
Function, and Nobel Laureates” (Messerli, F.H., Vol. 367(16), 1562-1564; 2012) which is
posted with this assignment. We will be using a reconstruction of Messerli’s data. The
variables in the data set you will use are (file: “nobel chocolate.txt” on Blackboard) are
“country”, “nobel rate”, and “chocolate”.
The information gathered in the data set you will be using is from several different sources.
The number of Nobel prize winners is from Wikipedia and includes winners through November 2012, population information (used to compute the “nobel rate” variable) is from the
World Bank, and chocolate market size is from the Euromonitor International’s Passport
Database.
Goal: In this assignment, you will be replicating Messerli’s analysis.
1. According to Messerli, what is the variable “number of Nobel laureates per capita”
supposed to measure? Do you think it is a reasonable measure? Justify your answer.
2. Are countries without Nobel prize recipients included in Messerli’s study? If not, what
types of bias(es) would that introduce?
3. Are the number of Nobel laureates per capita and chocolate consumption per capita
measured on the same temporal scale? If not, how could this affect the analysis?
1
4. Create a table of summary statistics for the following variables: Nobel laureates per
capita, GDP per capita, and chocolate consumption. Include the statistics: minimum,
maximum, median, mean, and standard deviation. Remember to include the units of
measurement in your table.
5. Create histograms for the following variables: Nobel laureates per capita, GDP per
capita, and chocolate consumption. Describe the shape of the distributions.
6. Construct a scatterplot of Nobel laureates per capita vs. chocolate consumption. Label
Sweden on your plot (on the computer, not by hand). Compute the correlation between these two variables and add it to the scatterplot. How would you describe this
relationship? Is correlation an appropriate measure? Why or why not?
7. What is Messerli’s correlation value? (Use the correlation value that includes Sweden.)
Why is your correlation different?
8. Why does Messerli consider Sweden an outlier? How does he explain it?
9. Regress Nobel laureates per capita against chocolate consumption (include Sweden):
(a) What is the regression equation? (Include units of measurement.)
(b) Interpret the slope.
(c) Conduct a residual analysis to check the regression assumptions. Make all plots
within one figure. Can we conduct hypothesis tests for this regression model?
Justify your answer.
(d) Is the slope significant (conduct a hypothesis test and include your regression output
in your answer)? Test at the α = 0.05 level and remember to specify the hypotheses
you are testing.
(e) Add the regression line to your scatterplot.
10. Using your model, what is the number of Nobel laureates expected to be for Sweden?
What is the residual? (Remember to include units of measurement.)
11. Now we will see if the variable GDP per capita (i.e., “GDP cap”) is a better way to
predict Nobel laureates.
(a) In one figure construct a scatter plot of (i) Nobel laureates vs. GDP per capita and
(ii) log(Nobel laureates) vs. GDP per capita. Which plot is more linear? Label
Sweden on both plots. On the second plot, label the two countries which appear
on the bottom left corner.
(b) Is Sweden still an outlier? Justify your answer.
(c) Regress log(Nobel laureates) against GDP per capita. Provide the output and add
the regression line to your scatterplot. (In practice, we would do a residual analysis
here, but we will skip it to reduce the length of this assignment.)
Page 2 of 3
(d) The log-y model is a multiplicative model: log(y) = β0 + β1 is y = e
β0+β1x
. For
such a model, the slope is interpreted as follows: a unit increase in x changes y
by approximately (e
β1 − 1) × 100%. For your regression, model interpret the slope
(remember to include units of measurement).
12. Does increasing chocolate consumption cause an increase in the number of Nobel Laureates? Justify your answer.
Page 3 of 3