## Description

Problem 1 (30 points)

In this problem, you will practice using linear regression on a simple data set.

Download the wine data set wine.mat from the course website. The first feature (given in data) is a constant

feature, always equal to one; I have already applied “affine expansion” to the data. The next d = 11 features

are fixed acidity, volatile acidity, . . . , alcohol as described in http://archive.ics.uci.edu/ml/machinelearning-databases/wine-quality/winequality.names. I have applied a “standardization” transformation to

these features so that, on the training data, the empirical mean of each feature is zero, and the empirical

standard deviation of each feature is one; the same transformation was applied to the test data. The output

(given in labels) is quality, the quality grade of the wine (an integer between 0 and 10).

Ordinary least squares

Compute the ordinary least squares estimator based on the training data (i.e., the squared loss ERM). You

can use any software package you like to do this provided that you include proper citations.2 Make sure

the solution satisfies the appropriate normal equations! This is a way to check if you’ve done

things correctly. Compute the test squared loss risk on the test data (testdata and testlabels).

Sparse linear predictor

Write a program to find a coefficient vector βˆ = (βˆ

0, βˆ

1, . . . , βˆ

d) where at most three of the βˆ

i for 1 ≤ i ≤ d are

non-zero, with smallest empirical risk on the training data. (The coefficient βˆ

0 corresponds to the always-one

feature and is also allowed to be non-zero.) This can be done by enumerating over all

d

3

triplets of the d

features. For this chosen coefficient vector, compute the test squared loss risk on the test data.

For each of the (at most) three variables v ∈ {fixed acidity, volatile acidity, . . . , alcohol} with nonzero coefficients in βˆ, determine the two variables (different from v) with highest and second-highest (in

absolute value) sample correlation with variable v, as based on the test data. Record both the variable names

and the corresponding correlation values.

What to submit in your write-up:

(a) Test risks of the ordinary least squares estimator and of the sparse linear predictor.

(b) Names (e.g., fixed acidity, volatile acidity) of the variables with non-zero coefficients in the

sparse linear predictor, along with the coefficient values. Also give the value of the coefficient βˆ

0.

(c) For each variable v with non-zero coefficient, the names of the two other variables most correlated (in

absolute value) with variable v, along with the corresponding correlation values.

Please submit your source code on Courseworks.

2See https://libguides.mit.edu/c.php?g=551454&p=3900280.

Problem 2 (35 points)

In this problem, you will study an “approximation error-vs-variance” trade-off in the fixed design setting

(which was covered in the reading assignment).

Suppose you are given n distinct real numbers z1, . . . , zn ∈ [−π, π], and you observe the realizations of n

uncorrelated real-valued random variables Y1, . . . , Yn, where E(Yi) = h(zi) for some unknown continuous

function h: R → R, and var(Yi) = 1 for all i = 1, . . . , n. Taking inspiration from the Weierstrass approximation

theorem, you construct a feature vector for each zi using a degree-k polynomial expansion (for some k ∈ N),

xi

:= (1, zi

, z2

i

, . . . , zk

i

) for each i = 1, . . . , n,

and use ordinary least squares to construct βˆ ∈ R

k+1 to approximate each h(zi) with x

>

i βˆ.

(a) Suppose it turns out that

h(zi) = cos(zi) for all i = 1, . . . , n.

Derive an upper-bound on the expected (fixed design) risk

E

”

1

n

Xn

i=1

(x

>

i βˆ − E(Yi))2

#

.

The bound should be expressed as a simple function of k and n. Give detailed justifications for each

step in your derivation. (It is fine to use asymptotic notations, like O(1/n).)

Hint: Use Taylor’s (remainder) theorem.

(b) If k is chosen (as a function of n) to make the expected risk from part (a) as small as possible, what is

this best possible expected risk? Your answer should be expressed only in terms of n. Briefly justify

your answer. (It is fine to use asymptotic notations, like O(1/n).)

(c) Let n = 1001, and let z1, . . . , zn be uniformly spaced in the interval [−π, π] (with z1 = −π and zn = π).

Simulate the random variables Y1, . . . , Yn, where Yi = sin(3zi/2) + εi for each i, and ε1, . . . , εn are iid

N(0, 1)-random variables. For each k = 1, . . . , 20, compute the predictions of E(Yi) obtained using

ordinary least squares on the degree-k polynomial expansion, along with the fixed design risk

1

n

Xn

i=1

(x

>

i βˆ − E(Yi))2

.

Repeat this process 1000 times, and average the 1000 resulting fixed design risks (to estimate the

expected fixed design risk) for each k. Report these average fixed design risks for each k, both in a

table and in a plot. (Make sure the table and plot are clearly labeled.)

Please submit your source code on Courseworks.

3

Problem 3 (35 points)

In this problem, you will carry out a simulation that illustrates a phenomenon called Freedman’s paradox.

Download the data set freedman.mat from the course website. The n × d matrix in data contains n feature

vectors x

(1), . . . , x(n) ∈ R

d

(as rows), and the n × 1 vector in labels contains the corresponding labels

y

(1), . . . , y(n) ∈ R. Note that n < d, so we don’t expect ordinary least squares to work well.
Note that the features and labels are approximately standardized (i.e., mean zero and variance one).
Consider the following three-step procedure:
1. Compute the following estimates of the correlations between the features and the label:
ρˆj :=
1
n
Xn
i=1
x
(i)
j
y
(i)
for j = 1, . . . , d.
2. Let Jb := {j ∈ {1, . . . , d} : |ρˆj | > 1.75/

√

n} be the features for which the estimated correlation is larger

than 1.75/

√

n, and discard the features not in Jb.

3. Construct the ordinary least squares estimate βˆ(Jb) ∈ R

d using only features in Jb. In other words,

βˆ(Jb) ∈ arg min

β∈R

d

:

βj = 0 for all j /∈ Jb

1

n

Xn

i=1

(β

>x

(i) − y

(i)

)

2

.

Steps 1 & 2 comprise a screening procedure whose purpose is to remove irrelevant features. If the number of

features k := |Jb| remaining after the screening is much smaller than n, then one might think that ordinary

least squares (using just the features in Jb) will work well.

(a) How many features remain (in Jb) after the screening? (It should be much smaller than n.)

(b) What is the empirical risk of βˆ(Jb), i.e.,

1

n

Xn

i=1

(βˆ(Jb)

>x

(i) − y

(i)

)

2

?

(It should be much smaller than the empirical variance of the labels.)

(c) Construct βˆ(I) in the same manner as βˆ(Jb) in Step 3 above, except using I := {1, . . . , 75} in place of

Jb. What is the empirical risk of βˆ(I)? (It should be higher than what you got in part (b).)

(d) It turns out that the features and labels are drawn independently from N(0, 1)—i.e., there is no real

correlation between the features and labels. Yet based on the relatively small empirical risk of βˆ(Jb),

together with the small number of features used in βˆ(Jb) relative to the sample size n, you might have

thought that there is a relationship between the features and labels! If this seems unbelievable to you,

write some code to generate the data yourself, and vary the numerical parameters of the simulation

(e.g., n and d) to see how robust the phenomenon is to changes to these parameters. Try to explain

why this happens, in your own words.

(e) Suppose you were able to carry out Step 3 using a new, independent but identically distributed set of

data. What would you expect βˆ(Jb) to be? And what would you expect its empirical risk (on this new

data set) to be? Approximate answers are fine, but briefly justify your answers.

Please submit your source code on Courseworks.

4