Description
Problem 1 (30 points)
In this problem, you will study a case where maximum likelihood fails but empirical risk minimization
succeeds.
Consider the following probability distribution P on X × Y, for X = {0, 1}
2 and Y = {0, 1}.
P(X = (x1, x2)) :
x1 = 0 x1 = 1
x2 = 0 1−
2
1−
2
x2 = 1
2
2
P(Y = 1 | X = (x1, x2)) :
x1 = 0 x1 = 1
x2 = 0 0 1
x2 = 1 1 0
Above, regard as a small positive number between 0 and 1. Let f
? be the Bayes optimal classifier for P.
(In this problem, we are concerned with zero-one loss.)
Now also consider the statistical model P for Y | X: P = {PA, PB}, where
PA(Y = 1 | X = (x1, x2)) = (
4
5
if x1 = 0,
1
5
if x1 = 1;
PB(Y = 1 | X = (x1, x2)) = (
0 if x1 = 0,
1 if x1 = 1.
Note that P /∈ P, and that distributions in P do not specify the distribution of X (like logistic regression).
Let fA and fB be the Bayes optimal classifiers, respectively, for PA and PB.
The maximum likelihood approach to learning a classifier selects the distribution in P of highest likelihood
given training data (which are regarded as an iid sample), then returns the optimal classifier for the chosen
distribution (i.e., fA if PA is the maximum likelihood distribution, otherwise fB).
(a) Give simple expressions for the Bayes optimal classifiers for P, PA, and PB. E.g.,
f
?
(x) = (
1 if x1 + x2 = 2,
0 otherwise.
(b) What is the risk of each classifier from Part (a) under distribution P?
(c) Suppose in the training data, the number of training examples of the form ((x1, x2), y) is equal to
Nx1,x2,y, for (x1, x2, y) ∈ {0, 1}
3
. Give a simple rule for determining the maximum likelihood distribution
in P in terms of Nx1,x2,y for (x1, x2, y) ∈ {0, 1}
3
. Briefly justify your answer.
(d) If the training data is a typical iid sample from P (with large sample size n), which classifier from Part
(a) is returned by the maximum likelihood approach? Briefly justify your answer.
(e) If the training data is an iid sample from P of size n, then how large should n be so that, with probability
at least 0.99, Empirical Risk Minimization returns the classifer in {fA, fB} with smallest risk? Briefly
justify your answer.
2
Problem 2 (20 points)
In this problem, you will practice verifying the convexity of certain functions.
(a) Let D be a finite non-empty subset of R
d
, and consider the statistical model {Pθ : θ ∈ R
d} for iid
random variables X1, . . . , Xn, where the probability mass function for X1 is given by
pθ(x) = exp(θ
>x)
P
y∈D exp(θ
>y)
, x ∈ D,
and pθ(x) = 0 for x 6∈ D. Prove that for any x1, . . . , xn ∈ D, the log-likelihood function
LL(θ) = ln Yn
i=1
pθ(xi)
!
for all θ ∈ R
d
is concave (i.e., − LL is convex).
(b) A twice-differentiable function f : R
d → R is strictly convex if its second-derivative matrix at any x ∈ R
d
is positive definite. Is the function f : R → R given by
f(x) = exp(−x) for all x ∈ R
strictly convex? Explain (with a short proof) why or why not.
(c) Let θ ∈ R
d be a non-zero vector. Is the function f : R
d → R given by
f(x) = kx − θk
2
2 + 2θ
>x + 1 for all x ∈ R
d
strictly convex? Explain (with a short proof) why or why not.
(d) Let θ ∈ R
d be a non-zero vector. Is the function f : R
d → R given by
f(x) = exp(θ
>x) for all x ∈ R
d
strictly convex? Explain (with a short proof) why or why not.
3
Problem 3 (20 points)
In this problem, you will formulate optimization problems as convex optimization problems using only “simple”
objective and constraint functions.
Let (x1, y1), . . . ,(xn, yn) ∈ R
d × R be given data. Write each of the following optimization problems as a
convex optimization problem in standard form in which the objective function and constraint functions are
linear or affine functions. You may introduce additional variables and constraints, but the number of variables
and constraints should be no more than a polynomial function of n and d (with constant degree). Briefly
explain why the optimization problem you formulate has the same optimal solutions as the given one.
(a)
min
w∈Rd
max
i=1,…,n
|x
>
i w − yi
|.
(b)
min
w∈Rd
Xn
i=1
φ(x
>
i w − yi),
where φ: R → R is the function defined by
φ(t) := (
0 if |t| ≤ 1,
|t| − 1 if |t| > 1.
(c)
min
w∈Rd
X
d
j=1
|wj |
s.t. |x
>
i w − yi
| ≤ 1 for i = 1, . . . , n.
(d)
min
w∈Rd,r∈Rn
X
k
j=1
|r|(j)
s.t. x
>
i w − yi = ri for i = 1, . . . , n.
Above, k is a given positive integer between 1 and d. Also, for a vector v = (v1, . . . , vn),
|v|(j) = |vπ(j)
|
for the permutation π on {1, . . . , n} that orders the entries of v in non-increasing absolute value:
|v|(1) ≥ |v|(2) ≥ · · · ≥ |v|(n)
.
4
Problem 4 (30 points)
In this problem, you will experimentally study convergence behavior of gradient descent for logistic regression.
Let (x1, y1), . . . ,(xn, yn) be training examples from R
d × {0, 1} for a binary classification problem. The
following optimization problem specifies the logistic regression MLE parameters (with explicit affine expansion):
min
β0∈R,β∈Rd
1
n
Xn
i=1
n
ln (1 + exp(β0 + x
>
i β)) − yi(β0 + x
>
i β)
o
.
(a) Give concise and unambiguous pseudocode for a gradient descent algorithm that approximately solves
this optimization problem. Be explicit about how the gradients are computed. Assume the initial
solution, step sizes, and number of iterations are provided as inputs.
(b) Implement the gradient descent algorithm from part (a), except use β0 = 0 and β = 0 as the initial
solution, choose the step sizes using a backtracking line search (with initial step size η = 1)
2
, and use as
many iterations as are required to achieve a prescribed objective value. You can use library functions
that implement standard linear algebraic operations and simple functions such as exp and log; of course,
you should not use (or look at the source code for) existing implementations of gradient descent or other
optimization algorithms. Run your gradient descent code on the data set logreg.mat from the course
website (which has training features vectors and labels stored as data and labels, respectively). How
many iterations of gradient descent are needed to achieve an objective value that is at most 0.65064?
3
(c) The feature vectors in the data set from logreg.mat are three-dimensional, so they are (relatively) easy
to inspect. Investigate the data by plotting it and/or computing some statistics about the features. Do
you notice anything peculiar about the features? Use what you discover to design an invertible linear
transformation of the feature vectors xi
7→ Axi such that running gradient descent on this transformed
data (Ax1, y1), . . . ,(Axn, yn) reaches an objective value of 0.65064 in (many) fewer iterations. Describe
the steps and reasoning in this investigation, as well as your chosen linear transformation (as a 3×3
matrix). How many iterations of gradient descent were required to achieve this stated objective value?
(d) Create a new version of your gradient descent code with the following changes.
1. Use only the first b0.8nc examples to define the objective function; keep the remaining n − b0.8nc
examples as a validation set.4
2. Use the following stopping condition. After every power-of-two (2
0
, 2
1
, 2
2
, etc.) iterations of
gradient descent, record the validation error rate (i.e., zero-one loss validation risk) for the linear
classifier based on the current (β0, β). If this validation error rate is more than 0.99 times that of
the best validation error rate previously computed, and the number of iterations executed is at
least 32 (which is somewhat of an arbitrary number), then stop.
Run this modified gradient descent code on the original logreg.mat data, as well as the linearly
transformed data (from part (c)). In each case, report: (1) the number of iterations executed, (2) the
final objective value, and (3) the final validation error rate.
Please submit your source code on Courseworks.
2Note that you should start with η = 1 in the backtracking line search for each step of gradient descent.
3The actual minimum value is less than 0.65064.
4Normally you would not simply select the first b0.8nc examples, but rather pick a random subset of b0.8nc examples. But I
have already randomized the order of the examples in logreg.mat.
5