STAT4001 Data Mining and Statistical Learning Homework 2 solved

$30.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1. (15 marks) Ridge regression v.s. Least squares
Given data (yi
, xi)i=1,…,n, yi = β0+β1xi+i
, where i
i.i.d ∼ N(0, σ2
), (xi)i=1,…,n is known
and fixed.
Least square estimate (βˆLS
0
, βˆLS
1
) = argmin
β0,β1
Pn
i=1(yi − β0 − β1xi)
2
.
Ridge regression (βˆRidge
0
, βˆRidge
1
) = argmin
β0,β1
Pn
i=1(yi − β0 − β1xi)
2 + λβ2
1
.
(a) Show that the least square estimate is unbiased by showing E(βˆLS
0
) = β0 and E(βˆLS
1
) =
β1.
(b) Show that the ridge regression estimate is biased by calculating E(βˆRidge
0
) and E(βˆRidge
1
).
Hint: You may directly use some derivations in the lecture.
2. (20 marks) Invariant of linear regression to scaling, but not ridge regression without
standardization
(a) Consider the following data set: y = (2.2, 3.3, 3.8), x = (1, 2, 3). Fit y = β0 + β1x.
1
i. Calculate least square parameter estimates βˆLS
0
, βˆLS
1
and ridge regression
parameter estimates βˆRidge
0
, βˆRidge
1 with λ = 1.
ii. Calculate ˆy
LS and ˆy
Ridge for x.
(b) Consider the data set in (a) with x
0 = 10x, i.e. x
0 = (10, 20, 30).
i. Calculate least square parameter estimates βˆL
0
, βˆL
1
and ridge regression parameter estimates βˆR
0
, βˆR
0 with λ = 1.
ii. Compare βˆR
0
and βˆR
1
in 2b(i) with βˆRidge
0
and βˆRidge
1
10 in 2a(i) which are without
scaling, also compare βˆL
0
and βˆL
1
2b(i) with βˆLS
0
and βˆLS
1
10 in 2a(i) for least
square.
iii. Calculate ˆy
L and ˆy
R for x
0
, and compare with 2a(ii).
(Note: You will see that scaling x will have an effect on yˆ for ridge regression, but not
in least square)
Hint: You may directly use some derivations in the lecture.
3. (15 marks) Cyclic coordinate descent for LASSO
Given f(βj ) = aβ2
j − 2bβj + λ|βj
|, where a > 0, λ > 0. Show that when b < −
λ
2 <
0, βˆ
j =
2b+λ
2a minimizes f(βj ).
4. (35 marks) Variance and bias for Linear regression v.s. Ridge regression
Fit the data with model yi = β0 + β1xi + i
, where i
i.i.d ∼ N(0, σ2
).
Least square parameter estimtates: βˆLS
0 = ¯y − βˆLS
1 x¯, and βˆLS
1 =
Pn
i=1 P
(xi−x¯)yi
n
i=1(xi−x¯)
2
Ridge regression parameter estimtates: βˆRidge
0 = ¯y−βˆRidge
1 x¯, and βˆRidge
1 =
Pn
i=1 P
(xi−x¯)yi
n
i=1(xi−x¯)
2+λ
For a new point x0, calculate bias and variance for
(a) Linear regression
(b) Ridge regression
Where bias2 = [β0 + β1×0 − E(ˆy0)]2 and variance = E[ˆy0 − E(ˆy0)]2
.
(Note: You will see that compared with linear regression, the bias2
for ridge is larger,
but the variance is smaller. At high dimensional setting, ridge regression (and lasso)
will be better because of the smaller variance.)
Hint: You may directly use some derivations in the lecture.
V ar(A + B) = V ar(A) + V ar(B) + 2cov(A, B)
V ar(αA) = α
2V ar(A), where α is a scalar.
5. (15 marks) R code exercise
2
(a) Use the rnorm() function to generate a predictor X of length n = 100, µX =
0, σX = 1, as well as a noise vector  of length n = 100, µ = 0, σ = 0.1.
(b) Generate a response vector Y of length n = 100 according to the model Y =
1 + X + X2 + X3 + .
(c) Fit a lasso model to the simulated data, using X, X2
, …, X10 as predictors. Use
cross-validation to select the optimal value of λ. Create plots of the crossvalidation error (i.e. Mean-Square error v.s. log(λ)) as a function of λ. Report
the resulting coefficient estimates.
(d) Now re-generate a response vector Y according to the new model Y = 1 + X7 +.
Again, re-fit a lasso model using X, X2
, …, X10 as predictors. Use cross-validation
to select the optimal value of λ. Create plots of the cross-validation error (i.e.
Mean-Square error v.s. log(λ)) as a function of λ. Report the resulting coefficient
estimates.
(Note: You will see that when the true data-generating model is sparser, cross-validation
tends to select a sparser model.)
Hint: You may refer to the tutorial notes ‘Tutorial05’.
– End –
3