1. Gradient descent. [5pts] We can get quite a bit of insight into the behavior of gradient
descent by looking at how it behaves on quadratic functions. Suppose we are trying to
optimize a quadratic function
C(θ) = a1
(θ1 − r1)
2 + · · · +
(θN − rN )
with each ai > 0. We can exactly solve for the dynamics of gradient descent. In other words,
we can find an exact formula for θ
, where t is the number of gradient descent updates.
(a) [1pt] Derive the gradient descent update rule for each θi with learning rate α. It should
have the form
i = · · · ,
where the right-hand side is some function of the previous value θ
, as well as ri
and α. (It’s an interesting and useful fact that the different θi
’s evolve independently,
so we can analyze a single coordinate at a time.)
(b) [2pts] Now let’s introduce the error e
i = θ
i − ri
. Take your update rule from the
previous part, and write a recurrence for the errors. It should have the form
i = · · · ,
where the right-hand side is some function of e
, and α.
(c) [1pt] Solve this recurrence to obtain an explicit formula for e
in terms of the initial
. For what values of α is the procedure stable (the errors decay over time), and
for what values is it unstable (the errors grow over time)?
Aside: your answer will indicate that large learning rates are more unstable than small
ones, and that high curvature dimensions are more unstable than low curvature ones.
(d) [1pt] Using your answer for the previous part, write an explicit formula for the cost
) as a function of the initial values θ
(0). (You can write it as a summation over
indices, i.e., you don’t need to vectorize it.) As t → ∞, which term comes to dominate?
CSC321 Homework 4
Aside: you’ll find that if you use the optimal α, the asymptotic behavior roughly depends
on the condition number
This supports the claim that narrow ravines are problematic for gradient descent.
(e) [Optional (not marked), and advanced] This part is optional, but you may find
it interesting. We’ll make use of eigendecompositions of symmetric matrices; see MIT
OpenCourseware for a refresher:
It turns out we’ve actually just analyzed the fully general quadratic case. I.e., suppose
we try to minimize a cost function of the form
C(θ) = 1
(θ − r)
T A(θ − r),
where A is a symmetric positive definite matrix, i.e. a symmetric matrix with all positive
eigenvalues. (This is the general form for a quadratic function which curves upwards.)
Determine the gradient descent update for θ in vectorized form. Then write a recurrence
for the error vector e = θ − r, similarly to Part (1b). It will have the form
(t+1) = Be(t)
where B is a symmetric matrix. Determine the eigenvectors and eigenvalues of B in
terms of the eigenvectors and eigenvalues of A, and use this to find an explicit form for
(t) and for C(θ
) in terms of θ
(0). The result will be closely related to your answer
from Part (1d).
2. Dropout. [5pts] For this question, you may wish to review the properties of expectation
and variance: https://metacademy.org/graphs/concepts/expectation_and_variance
Dropout has an interesting interpretation in the case of linear regression. Recall that the
predictions are made stochastically as:
where the mj ’s are all i.i.d. (independnet and identically distributed) Bernoulli random variables with expectation 1/2. (I.e., they are indepdendent for every input dimension and every
data point.) We would like to minimize the cost
(i) − t
where the expectation is with respect to the m
Now we show that this is equivalent to a regularized linear regression problem:
(a) [2pts] Find expressions for E[y] and Var[y] for a given x and w.
CSC321 Homework 4
(b) [1pt] Determine ˜wj as a function of wj such that
E[y] = ˜y =
Here, ˜y can be thought of as (deterministic) predictions made by a different model.
(c) [2pts] Using the model from the previous section, show that the cost E (Eqn. 1) can be
(i) − t
2 + R( ˜w1, . . . , w˜D),
where R is a function of the ˜wD’s which does not involve an expectation. I.e., give an
expression for R. (Note that R will depend on the data, so we call it a “data-dependent
Hint: write the cost in terms of the mean and variance formulas from part (a). For
inspiration, you may wish to refer to the derivation of the bias/variance decomposition
from Lecture 9.