Problem 1 [25%]
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make
sure to use set.seed(1) [P: np.random.seed(1)] prior to starting part (1) to ensure consistent results.
1. Using the rnorm() [P: np.random.normal] function, create a vector, x, containing 100 observations
drawn from a
N (0, 3) distribution (Normal distribution with the mean 0 and the standard deviation
3). This represents a feature, X.
2. Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N (0, 0.5)
distribution i.e. a normal distribution with mean zero and standard deviation √
3. Using x and eps, generate a vector y according to the model Y :
Y = −2 + 0.6X +
What is the length (number of elements) of y? What are the values of β0, β1 in the equation above
(intercept and slope)?
4. Create a scatterplot displaying the relationship between x and y. Comment on what you observe. [P:
5. Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ
compare to β0, β1?
6. Display the least squares line on the scatterplot obtained in 4.
7. Now fit a polynomial regression model that predicts y using x and x
. Is there evidence that the
quadratic term improves the model fit? Explain your answer.
Optional Problem O1 [30%]
This problem can be substituted for Problem 1 above, for up to 5 points extra credit. At most one of the
problems 1 and O1 will be considered.
Read Chapter 1 and solve Exercises 1.6 and 1.10 in [Bishop, C. M. (2006). Pattern Recognition and Machine
Problem 2 [25%]
Read through Section 2.3 in ISL. Load the Auto data set and make sure to remove missing values from the
data. Then answer the following questions:
1. Which predictors are quantitative and which ones are qualitative?
2. What is the range, mean, and standard deviation of each predictor? Use range() [pandas.DataFrame.min
and max] function.
3. Investigate the predictors graphically using plots. Create plots highlighting relationships between
predictors. See  for a ggplot cheatsheet.
4. Compute the matrix of correlations between variables using the function cor() [P: pandas.DataFrame.corr].
Exclude the name variable.
5. Use the lm() function to perform a multiple linear regression with mpg as the response. [P: using rpy
package is acceptable] Exclude name as a predictor, since it is qualitative. Briefly comment on the
output: What is the relationship between the predictors? What does the coefficient for year variable
6. Use the symbols * and : to fit linear regression models with interaction effects. What do you observe?
7. Try a few different transformations of variables, such as log(X),
. What do you observe?
Problem 3 [25%]
Using equation (3.4) in ISL, argue that in the case of simple linear regression, the least squares line always
passes through the point (¯x, y¯).
Problem 4 [25%]
It is claimed in the ISL book that in the case of simple linear regression of Y onto X, the R2
is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity,
you may assume that x¯ = ¯y = 0.
Each reference is a link. Please open the PDF in a viewer if it is not working on the website.
1. R GGPlot cheat sheet
2. Python Pandas data visualization
3. R For Data Science