## Description

1. Regularized linear regression. For this problem, we will use the linear regression model

from lecture:

y =

X

D

j=1

wjxj + b

In lecture, we saw that regression models with too much capacity can overfit the training

data and fail to generalize. One way to improve generalization, which we’ll cover properly

later in this course, is regularization: adding a term to the cost function which favors some

explanations over others. For instance, we might prefer that weights not grow too large in

magnitude. We can encourage them to stay small by adding a penalty

R(w) = λ

2

w>w =

λ

2

X

D

j=1

w

2

j

to the cost function, for some λ > 0. In other words,

Ereg =

1

2N

X

N

i=1

y

(i) − t

(i)

2

| {z }

=E

+

λ

2

X

D

j=1

w

2

j

| {z }

=R

,

where i indexes the data points and E is the same squared error cost function from lecture.

Note that in this formulation, there is no regularization penalty on the bias parameter.

(a) [3 pts] Determine the gradient descent update rules for the regularized cost function

Ereg. Your answer should have the form:

wj ← · · ·

b ← · · ·

This form of regularization is sometimes called “weight decay”. Based on this update

rule, why do you suppose that is?

1

https://markus.teach.cs.toronto.edu/csc321-2017-01

2

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/syllabus.pdf

1

CSC321 Homework 1

(b) [3 pts] It’s also possible to solve the regularized regression problem directly by setting

the partial derivatives equal to zero. In this part, for simplicity, we will drop the bias

term from the model, so our model is:

y =

X

D

j=1

wjxj .

In tutorial, and in Section 3.1 of the Lecture 2 notes, we derived a system of linear

equations of the form

∂E

∂wj

=

X

D

j

0=1

Ajj0wj

0 − cj = 0.

It is possible to derive constraints of the same form for Ereg. Determine formulas for Ajj0

and cj .

2. Visualizing the cost function. In lecture, we visualized the linear regression cost function

in weight space and saw that the contours were ellipses. Let’s work through a simple example

of that. In particular, suppose we have a linear regression model with two weights and no

bias term:

y = w1x1 + w2x2,

and the usual loss function L(y, t) = 1

2

(y−t)

2 and cost E(w1, w2) = 1

N

P

i L(y

(i)

, t(i)

). Suppose

we have a training set consisting of N = 3 examples:

• x

(1) = (2, 0), t(1) = 1

• x

(2) = (0, 1), t(2) = 2

• x

(3) = (0, 1), t(3) = 0.

Let’s sketch one of the contours.

(a) [2pts] Write the cost in the form

E = c1(w1 − d1)

2 + c2(w2 − d2)

2 + E0.

(b) [2pts] Since c1, c2 > 0, this corresponds to an axis-aligned ellipse. Sketch the ellipse by

hand for E = 1. Label the center and radii of the ellipse. If you’ve forgotten how to plot

axis-aligned ellipses, see Khan Academy3

.

3

https://www.khanacademy.org/math/algebra-home/alg-conic-sections/alg-center-and-radii-of-an-ellipse/

v/conic-sections-intro-to-ellipses

2