## Description

Problem 1 (Non-linear classifiers; 20 points). For this problem, you will need to learn to use

software libraries for at least two of the following non-linear classifier types:

• neural networks;

• boosted decision trees (i.e., boosting, with the “weak learner” being a decision tree learner);

• random forests.

All of these are available in scikit-learn and the MATLAB Statistics and Machine Learning toolbox,

although you may also use other external libraries (e.g., Tensorflow1

for neural networks, XGBoost2

for boosted decision trees). You are welcome to implement learning algorithms for these classifiers

yourself, but this is neither required nor recommended.

Learn two different types of non-linear classifiers from above for the HW3 restaurant review

data, with any feature representation you like. You can use the software libraries mentioned above,

and as well as libraries & tools (or your old code from HW3) to perform the feature processing.

Your should try to match (and hopefully surpass) the results obtained by the “modified” AveragedPerceptron with bigram features from HW3, which had five-fold cross-validation error rate around

8.8%. You may sub-sample the training data if you like, but note that using more training data

usually improves performance (up to a point).

For many of these learning algorithms, you will need to set various hyperparameters (including

classifier “architecture”, like the number of layers and width of each layer in a neural network,

and the number of trees in a random forest). Often there are defaults that make a good starting

point, but you may need to adjust at least some of them to get good performance. Use hold-out

validation or K-fold cross-validation to do this. Do not make any hyperparameter choices

(or any other similar choices) based on the test set! You should only compute the test error

rates after you have settled on hyperparameter settings and trained your two final classifiers.

What to submit:

1. Names of the two types of classifiers you opt to learn.

2. Proper citations for any external code you use. See https://integrity.mit.edu/handbook/

writing-code for guidelines.

3. Description of your training methodology, with enough details so that another machine learning enthusiast can reproduce the your results.

4. The final hyperparameter settings you use.

5. Training error rates, hold-out or cross-validation error rates, and test error rates for your two

final classifiers.

No need to submit any code.

1

https://www.tensorflow.org

2

https://github.com/dmlc/xgboost/tree/master/python-package

1

Problem 2 (Conditional probability estimation; 15 points).

(a) Suppose (X, Y ) is a random pair taking values in X × {−1, 1}. Let η : X → [0, 1] be the

conditional probability function, given by η(x) := P(Y = 1 | X = x). What function

f : X → R minimizes E[`(Y f(X))] where `(z) := e

−z

? Give your answer in terms of η (e.g.,

f(x) = η(x)).

(b) Download the data set hw5data.mat from Courseworks (which has training feature vectors

and labels, stored as data and labels, respectively). Compute the maximum likelihood

estimates of the logistic regression model parameters (β0, β) ∈ R × R

d

. You may use your

code from the previous homework assignment, or code from an existing software library for

logistic regression. It should be possible to obtain estimates with objective value (from the

previous homework) not larger than 0.50317.

The data set hw5data.mat also contains test feature vectors and labels, stored as testdata

and testlabels, respectively. With your estimated parameters (βˆ

0, βˆ) ∈ R × R

d

in hand,

you can compute the conditional probability

P(βˆ

0,βˆ)

(Y = 1 | X = x) = 1

1 + exp(−βˆ

0 − hβˆ, xi)

for each row x

> in testdata.

It turns out each row of testdata actually appears 128 times in testdata: for each i =

1, 2, . . . , 1024, the testdata rows {x

>

1024k+i

: k = 0, 1, . . . , 127} are identical. However,

the corresponding entries in testlabels are not repeated in the same way: for each i =

1, 2, . . . , 1024, the actual fraction of labels equal to one is pi

:=

1

128

P127

k=0 y1024k+i ∈ { / 0, 1}.

Compare P(βˆ

0,βˆ)

(Y = 1 | X = xi) to pi by computing the mean absolute error (MAE)

MAE :=

1

1024

1024

X

i=1

P(βˆ

0,βˆ)

(Y = 1 | X = xi) − pi

.

(This is one way to evaluate conditional probability predictions; there are many others.)

What to submit in your write-up:

1. MAE of conditional probability predictions.

2. Proper citations for any external code you use.

No need to submit any code.

(c) Optional. The data from part (b) was generated by a distribution for which the Bayes classifier

is a linear classifier, but is not one from the logistic regression model. Try a different approach

to get better conditional probability predictions on the data from part (b).

What to submit in your write-up:

1. MAE of conditional probability predictions.

2. Proper citations for any external code you use.

3. A description of the approach you tried.

No need to submit any code.

2

Problem 3 (Linear regression; 15 points).

(a) Let S := {(xi

, yi)}

n

i=1 be a data set from R

d × R, and let P = {P(w,σ2)

: w ∈ R

d

, σ2 > 0} be

the “classical statistical model for linear regression” discussed in lecture.

Let A be the n×d matrix whose i-th row is x

>

i

, let y be the vector whose i-th entry is yi

.

Suppose S really is an iid sample from a distribution P(w?,σ2

?

) ∈ P. Assume that A has rank

d, so A>A is invertible and wˆ ols exists; define yˆ := Awˆ ols.

For each of the following assertions, state whether it is true or false, and briefly justify your

answer.

1. var(yi

| xi) = σ

2

?

.

2. cov(w? | A) = σ

2

?

(A>A)

−1

.

3. cov(wˆ ols | A) = σ

2

?

(A>A)

−1

.

4. E(yˆ | A) = Aw?.

5. r := y − Awˆ ols is the orthogonal projection of y onto the null space of A>

.

6. If every entry in the first column of A is 1, then Pn

i=1 ri = 0 (where ri

is the i-th entry

of r, defined above).

(b) Download the Boston housing data set housing.mat from Courseworks. The first feature

is a constant feature, always equal to one: this is just here to simplify estimation of the

“intercept” term. The next 13 features are CRIM, ZN, . . . , LSTAT, as described in https:

//archive.ics.uci.edu/ml/datasets/Housing; note that standardization (based on the

training data) has been applied (to both the training data and test data). The output (label)

is MEDV, the median value of owner-occupied homes (in units of $1000).

Compute the ordinary least squares (OLS) estimator based on the training data (data and

labels). You can use whatever software libraries you like to do this. Compute the average

squared loss of the OLS estimator on the test data (testdata and testlabels).

Use some existing software package to compute a sparse weight vector with at most three nonzero entries (not including the “intercept”) based on the training data. Compute the average

squared loss of this sparse linear predictor on the test data (testdata and testlabels).

Some suggested methods to use: Lasso, LARS (which is actually an algorithm for solving the

Lasso optimization problem with some additional convenient properties), stepwise regression,

orthogonal matching pursuit.

What to submit in your write-up:

1. Test average squared loss of OLS estimator.

2. Test average squared loss of the sparse linear predictor.

3. Names of the variables with non-zero entries in the sparse linear prediction. Report the

actual variable names3

(e.g., CRIM, ZN, INDUS).

4. Proper citations for any external code you use.

No need to submit any code.

3The names of the variables are given in https://archive.ics.uci.edu/ml/machine-learning-databases/

housing/housing.names.

3