## Description

Problem 1 (40 points)

In this problem, you will implement the Online Perceptron algorithm with an online-to-batch conversion, and

evaluate it on a classification task with a few different data representations.

Restaurant review data set

Download the review data set reviews_tr.csv (training data) and reviews_te.csv (test data) from

Courseworks. This data set is comprised of reviews of restaurants in Pittsburgh; the label indicates whether

or not the reviewer-assigned rating is at least four (on a five-point scale). The data are in CSV format (where

the first line is the header); the first column is the label (label; 0 or 1), and the second column is the review

text (text). The text has been processed to remove non-alphanumeric symbols and capitalization.

Data representations

In this problem, you will experiment with the following four different data representations.

1. Unigram representation (i.e., term frequency).

In this representation, there is a feature for every word t, and the feature value associated with a word

t in a document d is

tf(t; d) := number of times word t appears in document d.

2. Term frequency-inverse document frequency (tf-idf).

This is like the unigram representation, except the feature associated with a word t in a document d

from a collection of documents D (e.g., training data) is

tf(t; d) × log10(idf(t; D)),

where tf(t; d) is as defined above, and

idf(t; D) := |D|

number of documents in D that contain word t

.

This representation puts more emphasis on rare words and less emphasis on common words. (There are

many variants of tf-idf that are unfortunately all referred to by the same name.)

Note: When you apply this representation to a new document (e.g., a document in the test set), you

should still use the idf defined with respect to D. This, however, becomes problematic if a word t appears

in a new document but did not appear in any document in D: in this case, idf(t; D) = |D|/0 = ∞. It

is not obvious what should be done in these cases. For this homework assignment, simply ignore words

t that do not appear in any document in D.

3. Bigram representation.

In addition to the unigram features, there is a feature for every pair of words (t1, t2) (called a bigram),

and the feature value associated with a bigram (t1, t2) in a given document d is

tf((t1, t2); d) := number of times bigram (t1, t2) appears consecutively in document d.

In the sequence of words “a rose is a rose”, the bigrams that appear are: (a,rose), which appears twice;

(rose, is); and (is, a).

4. One more data representation of your own choosing (e.g., trigrams, skip-grams). It should be non-trivially

different from the representations above (e.g., not just bigrams plus a few extra features).

In all four of these representations, include an “intercept” feature whose value is always equal to one (as in

affine expansion).

2

Online Perceptron with online-to-batch conversion

Implement the Online Perceptron algorithm with the following online-to-batch conversion process (similar to

one suggested in lecture):

1. Run Online Perceptron to make two passes through the training data. Before each pass, randomly

shuffle the order of the training examples. Note that a total of 2n + 1 linear classifiers wˆ1, . . . , wˆ2n+1

are created during the run of the algorithm, where n is the number of training examples.

2. Return the linear classifier wˆfinal given by the simple average of the final n + 1 linear classifiers:

wˆfinal :=

1

n + 1

2

Xn+1

i=n+1

wˆi

.

Of course, you should not use (or even look at the source code for) any existing implementation of Perceptron

or Online Perceptron; that would defeat the purpose of this assignment. You can, however, use existing

library functions for basic operations needed in Online Perceptron (e.g., vector inner products), and also for

loading the CSV files and creating the desired data representations. (You are responsible for determining

whether or not an existing library function correctly implements the desired data representation.) As always,

provide proper citations for any software packages you use.

Recall that as Online Perceptron is making its pass through the training examples, it only updates its weight

vector in rounds in which it makes a prediction error. So, for example, if wˆ1 does not make a mistake in

classifying the first training example, then wˆ2 = wˆ1. Use this fact to keep the memory usage of your code

relatively modest.

You will be using this learning algorithm with the restaurant review data represented in four ways described

above. The total number of features with each representation can be very large. However, because each

review has only a relatively small numbers of words, it should still be possible to compactly represent the

data. Use this fact to keep the running time of your code close to linear in the size of the training data.

Do the following with each of the four data representations above. Run your code on the training data to

learn a linear classifier wˆfinal. Compute the training error rate of wˆfinal on this training data, and compute

the test error rate of wˆfinal on the test data.

Post-hoc analysis

Now consider the classifier based on the unigram representation. To get a sense as to its behavior, determine

the 10 words that have the highest (i.e., most positive) weights; also determine the 10 words that have the

lowest (i.e., most negative) weights.

Pick two training examples that this classifier (based on unigram representation) incorrectly classifies. For

each of these examples x = (x1, . . . , xd), identify the 10 words with the highest (i.e., most positive) wixi

value, and also the 10 words with the lowest (i.e., most negative) wixi value. Attempt to explain why the

classifier incorrectly classifies these examples.

What to submit in your write-up:

(a) A precise specification of your fourth chosen data representation. Briefly explain why you chose this

representation. Provide proper citations of any sources you used to come up with this representation.

(b) Training error rates of the linear classifiers with all four data representations.

(c) Test error rates of the linear classifiers with all four data representations.

(d) The lists of words with highest/lowest weights, described above. (Sort each list alphabetically.)

(e) The text of two misclassified examples, with associated lists of words, and explanations for why the

examples were misclassified.

Please submit your source code on Courseworks.

3

Problem 2 (30 points)

In this problem, you’ll analyze an algorithm that finds a large margin linear separator whenever one exists.

Recall that the Perceptron algorithm quickly finds a linear separator for a data set S whenever there exists a

large margin linear separator for S. But the linear separator returned by Perceptron is not necessarily one

that achieves a large margin on S.

An alternative is the following algorithm, which we call Margin Perceptron. This algorithm, like Perceptron,

takes as input a collection S of labeled examples from R

d × {−1, +1}).

• Begin with wˆ1 := 0 ∈ R

d

.

• For t = 1, 2, . . .:

– If there is a labeled example in S (call it (xt, yt)) such that ytwˆ

>

t xt < 1, then set wˆt+1 := wˆt+ηytxt.
– Else, return wˆt.
Above, there are two differences relative to the original Perceptron algorithm. The first is the criteria under
which an example in S is used to perform an update: ytwˆ
>

t xt < 1 (rather than ≤ 0). The second is the
update itself: wˆt+1 := ˆwt + ηytxt. Here, η > 0 is a parameter of the algorithm, which we’ll assume is set as

η :=

1

L2

where L := max

(x,y)∈S

kxk2.

If Margin Perceptron terminates, it returns wˆt satisfying

ywˆ

>

t x ≥ 1 for all (x, y) ∈ S.

But this fact alone does not mean that wˆt achieves a large margin on S. After all, if w is any linear separator

satisfying yw>x > 0 for all (x, y) ∈ S, we can construct another w˜ satisfying yw˜

>x ≥ 1 for all (x, y) ∈ S,

simply by letting w˜ be a particular scaling of w.

(a) Assume that there exists a vector w? ∈ R

d

such that

min

(x,y)∈S

yw>

? x = 1.

Mimic the proof of the Perceptron convergence theorem to prove that Margin Perceptron halts after at

most 3kw?k

2

2L

2

loop iterations. Clearly explain every step of the proof, especially where it differs from

that of the original Perceptron convergence theorem.

(b) Under the same assumption as in Part (a), prove that Margin Perceptron returns wˆt satisfying

kwˆtk2 ≤ 3kw?k2.

Hint: Use the claim you proved in Part (a) (and, possibly, part of its proof).

(c) Very briefly explain why this means that the margin achieved by wˆt is (almost) as large as the margin

achieved by w?.

(d) (Optional.) For any given > 0, explain how to modify Margin Perceptron so that, under the same

conditions as in the previous parts, it returns wˆt satisfying

kwˆtk2 ≤ (2 + )kw?k2.

4

Problem 3 (30 points)

In this problem, you’ll use Lagrange duality to derive the “dual” forms of ridge regression and a variant of

the soft-margin SVM problem.

The ridge regression optimization problem can be written in a form that is reminiscent of the soft-margin

SVM problem (for C > 0):

min

w∈Rd,ξ∈Rn

1

2

kwk

2

2 +

C

2

Xn

i=1

ξ

2

i

s.t. ξi = yi − x

>

i w for all i = 1, . . . , n.

Above, ξ = (ξ1, . . . , ξn) ∈ R

n has a similar role as the slack variables in soft-margin SVM, but note that they

are not required to be non-negative here.

The Lagrangian function associated with this optimization problem is

L(w, ξ, α) = 1

2

kwk

2

2 +

C

2

Xn

i=1

ξ

2

i +

Xn

i=1

αi(yi − x

>

i w − ξi).

Here, α = (α1, . . . , αn) ∈ R

n are the Lagrange multipliers, which are unconstrained.

(a) Fix α ∈ R

n. Derive expressions for the w and ξ that, together, minimize L(w, ξ, α). The expressions

should be given in terms of α (as well as the data and C). Briefly justify each step of your derivations.

(b) The dual objective function g : R

n → R is equal to the Lagrangian function evaluated at wα, ξα, α,

where wα and ξα are the values of w and ξ from Part (a) that minimize L:

g(α) = L(wα, ξα, α) = min

w∈Rd,ξ∈Rn

L(w, ξ, α).

Derive an expression for the α that maximizes g(α). The expression should be given in terms of the

data and C. Moreover, the expression should only involve the xi

’s through the matrix K ∈ R

n×n whose

(i, j)-th entry is x

>

i xj . Briefly justify each step of your derivation.

(c) (Optional.) A variant of the soft-margin SVM problem that is similar to ridge regression is the following

optimization problem (for C > 0):

min

w∈Rd,ξ∈Rn

1

2

kwk

2

2 +

C

2

Xn

i=1

ξ

2

i

s.t. yix

>

i w ≥ 1 − ξi for all i = 1, . . . , n.

This is almost the same as soft-margin SVM, except the slack variables are squared in the objective.

The Lagrangian is

L(w, ξ, α) = 1

2

kwk

2

2 +

C

2

Xn

i=1

ξ

2

i +

Xn

i=1

αi(1 − yix

>

i w − ξi).

However, since the constraints are inequality constraints, the associated Lagrange multipliers α ∈ R

n

are constrained to be non-negative.

Derive expressions for the w and ξ that, together, minimize L(w, ξ, α), and then derive an expression

for the dual objective g(α) = minw∈Rd,ξ∈Rn L(w, ξ, α). Explain why the dual objective only depends on

the xi

’s through the matrix K ∈ R

n×n whose (i, j)-th entry is x

>

i xj .

5