## Description

1. Perceptron Algorithm. Suppose we have the following training examples in a 2-dimensional

input space, and no bias term. Following our discussion of the perceptron algorithm, we use

t = −1 rather than t = 0 to denote the negative class.

x1 x2 t

1 -2 1

0 -1 -1

Recall the perceptron algorithm from lecture. It repeatedly applies the following procedure,

stopping when the weights are strictly within the feasible region (i.e. not on the boundary):

For each training case (x

(i)

, t(i)

),

z

(i) ← wT x

(i)

If z

(i)

t

(i) ≤ 0,

w ← w + t

(i)x

(i)

Here we work through the algorithm by hand.

(a) [2pts] Sketch out the problem in weight space. In particular: draw and label the axes,

draw the half-space corresponding to each of the two training examples, and shade the

feasible region. (Remember that there is no bias term.)

(b) [2pts] Simulate the perceptron algorithm by hand, starting from the initial weights

w1 = 0, w2 = −2. On the plot from part (a), mark an X on every point in weight space

visited by the algorithm until it stops. You do not need to provide anything for this part

other than the X’s on the plot. Hint: the weights are updated 12 times. It will be tedious

if you compute all the updates using arithmetic; instead, plot the first few updates as you

go along, and you should start to notice a visual pattern.

1

https://markus.teach.cs.toronto.edu/csc321-2018-01

2

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf

1

CSC321 Homework 2

2. Feature Maps. Suppose we have the following 1-D dataset for binary classification:

x t

-1 1

1 0

3 1

(a) [1pt] Argue briefly (at most a few sentences) that this dataset is not linearly separable.

Your answer should mention convexity.

Now suppose we apply the feature map

ψ(x) =

ψ1(x)

ψ2(x)

=

x

x

2

.

Assume we have no bias term, so that the parameters are w1 and w2.

(b) [2pts] Write down the constraint on w1 and w2 corresponding to each training example, and then find a pair of values (w1, w2) that correctly classifies all the examples.

Remember that there is no bias term.

3. Loss Functions. [3pts] Suppose we have a prediction problem where the target t corresponds to an angle, measured in radians. A reasonable loss function we might use is

L(y, t) = 1 − cos(y − t).

Suppose we make predictions using a linear model,

y = w>x + b.

As usual, the cost is the average loss over the training set:

E =

1

N

X

N

i=1

L(y

(i)

, t(i)

).

Derive a sequence of vectorized mathematical expressions for the gradients of the cost with

respect to w and b. As usual, the inputs are organized into a design matrix X with one row

per training example. The expressions should be something you can translate into a Python

program without requiring a for-loop. Your answer should look like:

y = · · ·

∂E

∂y

= · · ·

∂E

∂w

= · · ·

∂E

∂b = · · ·

You can use sin(A) to denote the sin function applied elementwise to A. Remember that

∂E/∂w denotes the gradient vector,

∂E

∂w

=

∂E

∂w1

.

.

.

∂E

∂wD

2