## Description

Problem 1 [25%]

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make

sure to use set.seed(1) [P: np.random.seed(1)] prior to starting part (1) to ensure consistent results.

1. Using the rnorm() [P: np.random.normal] function, create a vector, x, containing 100 observations

drawn from a

√

N (0, 3) distribution (Normal distribution with the mean 0 and the standard deviation

3). This represents a feature, X.

2. Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N (0, 0.5)

distribution i.e. a normal distribution with mean zero and standard deviation √

0.5.

3. Using x and eps, generate a vector y according to the model Y :

Y = −2 + 0.6X +

What is the length (number of elements) of y? What are the values of β0, β1 in the equation above

(intercept and slope)?

4. Create a scatterplot displaying the relationship between x and y. Comment on what you observe. [P:

see [2]]

5. Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ

0, βˆ

1

compare to β0, β1?

6. Display the least squares line on the scatterplot obtained in 4.

7. Now fit a polynomial regression model that predicts y using x and x

2

. Is there evidence that the

quadratic term improves the model fit? Explain your answer.

Optional Problem O1 [30%]

This problem can be substituted for Problem 1 above, for up to 5 points extra credit. At most one of the

problems 1 and O1 will be considered.

Read Chapter 1 and solve Exercises 1.6 and 1.10 in [Bishop, C. M. (2006). Pattern Recognition and Machine

Learning].

Problem 2 [25%]

Read through Section 2.3 in ISL. Load the Auto data set and make sure to remove missing values from the

data. Then answer the following questions:

1. Which predictors are quantitative and which ones are qualitative?

1

2. What is the range, mean, and standard deviation of each predictor? Use range() [pandas.DataFrame.min

and max] function.

3. Investigate the predictors graphically using plots. Create plots highlighting relationships between

predictors. See [1] for a ggplot cheatsheet.

4. Compute the matrix of correlations between variables using the function cor() [P: pandas.DataFrame.corr].

Exclude the name variable.

5. Use the lm() function to perform a multiple linear regression with mpg as the response. [P: using rpy

package is acceptable] Exclude name as a predictor, since it is qualitative. Briefly comment on the

output: What is the relationship between the predictors? What does the coefficient for year variable

suggest?

6. Use the symbols * and : to fit linear regression models with interaction effects. What do you observe?

7. Try a few different transformations of variables, such as log(X),

√

X, X2

. What do you observe?

Problem 3 [25%]

Using equation (3.4) in ISL, argue that in the case of simple linear regression, the least squares line always

passes through the point (¯x, y¯).

Problem 4 [25%]

It is claimed in the ISL book that in the case of simple linear regression of Y onto X, the R2

statistic (3.17)

is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity,

you may assume that x¯ = ¯y = 0.

References

Each reference is a link. Please open the PDF in a viewer if it is not working on the website.

1. R GGPlot cheat sheet

2. Python Pandas data visualization

3. R For Data Science

4. Cheatsheets

2