## Description

Problem 1 Multiclass Logistic Regression Classifier (5 pts, 5 pts, 5 pts, 2 pts)

In this problem, you will derive and implement a multiclass linear classifier based on logistic regression (LR,

see Section 9.3 of UML). For binary classification, LR has two interpretations. The first, which is given in

UML, is that it performs ERM with the logistic loss, i.e., using

`(hw,(x, y)) = log

1 + e

−y(w

T x)

,

where x, w ∈ R

d

, y ∈ {−1, 1}, and

hw(x) = sign

w

T x

.

This results in the ERM objective function

J1(w) = 1

m

Xm

i=1

log

1 + e

−yi(w

T xi)

. (1)

The second interpretation of (binary) LR is that it models/learns the conditional distribution P[Y = y | x]

and then uses this model to estimate the Bayes classifier. In particular, LR assumes that

P[Y = y | x, w] ≈ σ(w

T x)

where

σ(t) = 1

1 + e−t

is called the sigmoid or logistic function. You can check that σ(w

T x) ∈ [0, 1] and goes to 1 when w

T x is

large and 0 when w

T x is small. The vector w can be learned using maximum likelihood estimation, which

results in minimizing the negative log-likelihood

J2(w) = −

1

m

Xm

i=1

yi

log

σ(w

T xi)

+ (1 − yi) log

1 − σ(w

T xi)

, (2)

where now we set yi ∈ {0, 1}. The learned w is then used in σ(t) above to estimate the Bayes classifier.

The other benefit of the probabilistic interpretation of LR is that it easily extends to the multiclass case.

In this case, the conditional probability is modeled as

P[Y = k | x, w1, . . . , wK] ≈ softmax(k, wT

1 x, . . . , wT

Kx),

where

softmax(k, t1, . . . , tK) = e

tk

PK

j=1 e

tj

is the softmax function, which is the generalization of the sigmoid to multiple classes. We learn K weight

vectors (one for each class) by minimizing the cost function

J(w1, . . . , wK) = −

Xm

i=1

X

K

k=1

1 {yi = k} log

softmax(k, wT

1 xi

, . . . , wT

Kxi)

. (3

Homework 2 2

The above is minimized using (stochastic) gradient descent and is known as multiclass LR/multinomial

LR/softmax regression.

(a) Verify that (1) and (2) are equivalent cost functions. Hint: Start with (2) and use algebraic manipulations

to show that it is equivalent to (1).

(b) Verify that the gradient of (3) with respect to one weight vector is

∇wk J(w1, . . . , wK) = Xm

i=1

xi

softmax(k, wT

1 x, . . . , wT

Kx) − 1 {yi = k}

.

(c) Complete the script prob1.py by minimizing (3) using stochastic gradient descent and then using the

learned weight vectors to predict on the MNIST dataset (use the data from Homework 1). Use a learning

rate of µ = 10−2 and run ten full passes through the data (you can play with these if you want). Turn

in your code, as well as the classification error on the training and test sets. Hint: I’ve given you code to

create a smaller dataset for debugging that selects only three digits. Your final training and test errors

should both be below 10%.

(d) What do you notice about the differences between LR and the multiclass ridge regression classifier from

Homework 1?

Problem 2 DSS: Visualizing Data (3 pts, 5 pts, 5 pts)

(DSS rules apply.) Another important tool for any data scientist is knowing how to visualize results. Two

popular approaches to visualizing high-dimensional datasets are t-distributed stochastic neighbor embedding

(t-SNE) and Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). In this

problem, you will use UMAP to embed the MNIST test dataset and create an interactive plot showing

which digits were misclassified. To complete this problem, you will hack the tutorial here, which requires

the following packages: pandas, seaborn, bokeh, umap.

Hint: Start by grabbing a subset of the digits to reduce computation time while debugging.

(a) Use UMAP to create a scatter plot of the MNIST test dataset embedded into two dimensions, with the

color of each point in your plot corresponding to the true class of the image. Note that the ‘Digits’

dataset is not the same as the MNIST dataset. Turn in your plot.

(b) Hack the code in the linked tutorial to create an interactive plot that allows you to view the image and

true class of each point in the MNIST test dataset when you hover over it. Turn in a screenshot of your

plot with your mouse hovering over a point to show the tooltip.

(c) An important method for troubleshooting and displaying results is understanding what types of examples

are misclassified by your predictor. First, classify the MNIST test dataset using LogisticRegression

from sklearn. Next, hack the code from part (b) so that the points are colored based on whether they

are classified correctly. Turn in your code, a screenshot of your plot with your mouse hovering over a

point to show the tooltip, and some observations of what you learned about which types of points are

incorrectly classified.

Problem 3 SLT (10 pts)

(SLT rules apply.) UML, Ch. 3, Exercise 7. State how long you worked on the problem before looking at

the solution.

Problem 4 SLT (10 pts)

(SLT rules apply.) UML, Ch. 4, Exercise 2. State how long you worked on the problem before looking at

the solution