## Description

1. Code up a 2-class perceptron classifier, for the case of 2 features (3 dimensions in augmented

space). As in the previous homework problems, you are expected to code this up yourself

rather than use a library or toolbox, so the same guidelines apply.

Use basic sequential gradient descent.

Hint: One way to randomly shuffle the data points, is to generate random permutations of

numbers from 1 to N, where N = number of training samples. The result can be used as indices

for the data points and their labels.

It is recommended (but not required) that you use augmented space, and reflect the training

data. Do not reflect the test data.

Because the data you will use might not be linearly separable, you will need two halting

conditions:

(1) The usual perceptron halting condition: if there have been no updates to in 1 epoch,

then halt;

(2) An ad hoc one in case condition (1) is not satisfied. A reasonable one is to set a

maximum number of iterations or epochs (1000 epochs should be high enough), and

during the last full epoch, at each iteration i, evaluate over all training data

points, and store each . Then, after the last iteration, choose the that had

the smallest criterion value .

For your initial weight vector, use (vector of 0.1 in each component).

For your learning rate parameter, use

Run your classifier on 3 datasets: Synthetic1 (get from HW1 folder), Synthetic2 (also from HW1

folder), and Synthetic3 (from the current HW folder).

(a) For each dataset, report on the final weight vector, the classification error rate on the training

set, and the classification error rate on the test set.

(b) Also for each dataset, give a 2D plot showing decision boundary and regions, as well as

training data points (similar to plots of HW1). You may find it convenient to use the

PlotDecBoundaries function from HW1 folder.

(c) For Synthetic1 and Synthetic2 datasets, compare your error rates of part (a) with the error rates

obtained in HW1(a); you can compare with the posted solutions if your HW1 answer wasn’t

correct. Explain any significant differences between perceptron result and nearest means

result.

Problem 2 on next page…

w

J (w(i))

J (w(i)) w(i)

J (w(i))

w(0) = 0.1(1)

η(i) = 1.

p. 2 of 2

2. In a gradient descent procedure, let the “weight update” be defined as

.

Consider a given weight vector which is the weight vector at (given) iteration i , so that

.

Let the criterion function take the form , in which is the criterion

function for one data point. In this problem, do not plug in for the perceptron criterion

function; please solve the problem for the general case of . Assume

.

(a) For stochastic gradient descent, variant 2, find , in which denotes

expected value, and for each probabilistic experiment, n is randomly chosen from

with uniform probability. Try to express your answer in terms of

.

(b) Compare your result of (a) with for batch gradient descent. Describe in words

how the batch GD update compares with the expected value of the stochastic GD

variant 2 update.

Explain why the comparison between these two expressions makes sense, or does not

make sense.

Δw(i) = w(i +1) − w(i)

w0

w(i) = w0

J (w) = Jn (w) n=1

N

∑ Jn (w)

J and Jn

η(i) = η = constant

E{Δw(i)} E{ }

{1,2,!,N}

∇wJ w0 ( )

Δw(i)