## Description

Problem 1 (Features; 10 points). It is common to pre-preprocess the feature vectors in R

d before

passing them to a learning algorithm. Two simple and generic ways to pre-process are as follows.

• Centering: Subtract the mean µˆ :=

1

|S|

P

(x,y)∈S x (of the training data) from every feature

vector:

x 7→ x − µˆ .

• Standardization: Perform centering, and then divide every feature by the per-feature standard deviation ˆσi =

q 1

|S|

P

(x,y)∈S

(xi − µˆi)

2:

(x1, x2, . . . , xd) 7→

x1 − µˆ1

σˆ1

,

x2 − µˆ2

σˆ2

, . . . ,

xd − µˆd

σˆd

.

For each of the following learning algorithms, and each of the above pre-processing transformations,

does the transformation affect the learning algorithm?

(a) The classifier based on the generative model where class conditional distributions are multivariate Gaussian distributions with a fixed covariance equal to the identity matrix I. Assume

MLE is used for parameter estimation.

(b) The 1-NN classifier using Euclidean distance.

(c) The greedy decision tree learning algorithm with axis-aligned splits. (For concreteness, assume

Gini index is used as the uncertainty measure, and the algorithm stops after 20 leaf nodes.)

(d) Empirical Risk Minimization: the (intractable) algorithm that finds the linear classifier (both

the weight vector and threshold) that has the smallest training error rate.

To make this more precise, consider any training data set S from X × Y, and let ˆfS : X → Y be

the classifier obtained from the learning algorithm with training set S. Let φ: X → X 0 be the

pre-processing transformation; and let ˜fS0 : X

0 → Y be the classifier obtained from the learning

algorithm with training set S

0

, where S

0

is the data set from X

0 × Y containing (φ(x), y) for each

(x, y) in S. We say φ affects the learning algorithm if, there is a set of training data S such that

the classifier ˆfS is not the same as x 7→ ˜fS0(φ(x)).

You should assume the following: (i) the per-feature standard deviations are never zero; (ii)

there are never any “ties” whenever you compute an arg max or an arg min; (iii) there are no issues

with numerical precision or computational efficiency.

What to submit: “yes” or “no” for each pre-processing and learning algorithm pair, along with

a brief statement of justification.

1

Problem 2 (More features; 30 points). Download the review data set reviews tr.csv (training

data) and reviews te.csv (test data) from Courseworks. This data set is comprised of reviews

of restaurants in Pittsburgh; the label indicates whether or not the reviewer-assigned rating is at

least four (on a five-point scale). The data are in CSV format (where the first line is the header);

the first column is the label (“label”), and the second column is the review text (“text”). The text

has been processed to remove non-alphanumeric symbols and to make all letters lowercase.

Write a script that, using only the training data, tries different methods of data representation

and different methods for learning a linear classifier, and ultimately selects—via five-fold cross

validation—and uses one combination of these methods to train a final classifier. (You can think

of the choice of these methods as “hyperparameters”.)

The data representations to try are the following.

1. Unigram representation.

In this representation, there is a feature for every word w, and the feature value associated

with a word w in a document d is

tf(w; d) := number of times word w appears in document d .

(tf is short for term frequency.)

2. Term frequency-inverse document frequency (tf-idf ) weighting.

This is like the unigram representation, except the feature associated with a word w in a

document d from a collection of documents D (e.g., training data) is

tf(w; d) × log10(idf(w; D)),

where tf(w; d) is as defined above, and

idf(w; D) :=

|D|

number of documents in D that contain word w

.

This representation puts more emphasis on rare words and less emphasis on common words.

(There are many variants of tf-idf that are unfortunately all referred to by the same name.)

Important: When you apply this representation to a new document (e.g., a document in the

test set), you should still use the idf defined with respect to D. This, however, becomes

problematic if a word w appears in a new document but did not appear in any document in

D: in this case, idf(w; D) = |D|/0 = ∞. It is ambiguous what should be done in these cases,

but a safe recourse, which you should use in this assignment, is to simply ignore words w that

do not appear in any document in D.

3. Bigram representation.

In addition to the unigram features, there is a feature for every pair of words (w1, w2) (called

a bigram), and the feature value associated with a bigram (w1, w2) in a given document d is

tf((w1, w2); d) := number of times bigram (w1, w2) appears consecutively in document d .

In the sequence of words “a rose is a rose”, the bigrams that appear are: (a,rose), which

appears twice; (rose, is); and (is, a).

4. Another data representation of your own choosing or design.

Some examples:

2

• Extend the bigram representation to n-gram representations (for any positive integer n)

to consider n-tuples of words appearing consecutively in documents.

• Extend to k-skip-n-grams: instead of just counting the number of times the n-tuples

(w1, w2, . . . , wn) appears consecutively in the document, also count occurrences where

wi and wi+1 are at most k words apart.

• Use variants of tf-idf weighting, such as one that replace tf(w; d) with 1+log10(tf(w; d)).

This is better for documents where words often appear in bursts.

• Select specific words or n-grams you think are informative for the classification task.

The learning methods to try are the following.

1. Averaged-Perceptron (with some modifications described below).

Averaged-Perceptron is like Voted-Perceptron (which uses Online Perceptron), except instead

of forming the final classifier as a weighed-majority vote over the various linear classifiers,

you simply form a single linear classifier by taking a weighted average the weight vectors and

thresholds of all the linear classifiers used by Online Perceptron.

Use the following two modifications:

• Run Online Perceptron to make two passes through the data. Before each pass, randomly

shuffle the order of the training examples. Note that a total of 2n + 1 linear classifiers

are considered during the run of the algorithm (where n is the number of training data).

• Instead of averaging the weights and thresholds for all 2n + 1 linear classifiers, just

average these things for the final n + 1 linear classifiers.

You should use Averaged-Perceptron with all four data representations.

2. Na¨ıve Bayes classifier with parameter estimation as in the previous homework assignment.

Note that with this learning method, a word that appears more than once in a document is

treated the same as appearing exactly once.

You only need to use Na¨ıve Bayes with the unigram representation.

3. (Optional.) Any other learning method you like.

You must write your own code to implement Averaged-Perceptron. The code should be easy

to understand (e.g., by using sensible variable names and comments). For other things

(e.g., Na¨ıve Bayes, cross validation, feature processing), you can use your own or existing library

implementations in MATLAB (+Statistics/ML library) and Python (+numpy, scipy, sklearn), but

you are responsible for the correctness of these implementations as per the specifications from the

course lectures. Provide references for any such third-party implementations you use (e.g., specific

scikit-learn or MATLAB functions that perform cross validation).

What to submit:

1. A concise and unambiguous description of the fourth data representation you try (and also

of any additional learning methods you try).

2. Cross-validation error rates for all data representations / learning methods combinations.

3. Name of the method ultimately selected using the cross validation procedure.

4. Training and test error rates of the classifier learned by the selected method.

5. Source code and scripts that you write yourself (in separate files).

3

Problem 3 (MLE; 10 points).

(a) Consider the statistical model P = {Pµ,σ2 : µ ∈ R

d

, σ2 > 0}, where Pµ,σ2 is the multivariate

Gaussian distribution with mean µ and covariance matrix σ

2I. Give a formula for the MLE

of σ

2 given the data {xi}

n

i=1 from R

d

, which is regarded as an iid sample. Also give a clear

derivation of the formula in which you briefly justify each step.

(b) Consider the statistical model P = {Pθ

: θ ∈ N}, where Pθ is the distribution of the random

pair (X, Y ) in N × {0, 1} such that:

• X is uniformly distributed on {1, 2, . . . , θ};

• for any x ∈ {1, 2, . . . , θ}, the conditional distribution of Y given X = x is specified by

Pθ(Y = 1 | X = x) = x/θ.

Give a formula for the MLE of θ given a single observation (x, y) ∈ N × {0, 1}. Also give a

clear derivation of the formula in which you briefly justify each step.

4