COMS 4721: Machine Learning for Data Science Homework 2 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

Problem 1 – 20 points
In this problem you will derive a naive Bayes classifier. For a labeled set of data (y1, x1), . . . ,(yn, xn),
where for this problem y ∈ {0, 1} and x is a D-dimensional vector of counts, the Bayes classifier
observes a new x0 and predicts y0 as
y0 = arg maxy p(y0 = y|π)
QD
d=1 p(x0,d|λy,d).
The distribution p(y0 = y|π) = Bernoulli(y|π). What is “naive” about this classifier is the assumption
that all D dimensions of x are independent. Assume that each dimension of x is Poisson distributed with
a Gamma prior. The full generative process is
Data: yi
iid∼ Bern(π), xi,d|yi ∼ Pois(λyi,d), d = 1, . . . , D Prior: λy,d
iid∼ Gamma(2, 1)
Derive the solution for π and each λy,d by maximizing
π, b λb
0,1:D, λb
1,1:D = arg max
π,λb0,1:D,λb1,1:D
Xn
i=1
ln p(yi
|π)+X
D
d=1
ln p(λ0,d) + ln p(λ1,d) +Xn
i=1
ln p(xi,d|λyi,d)
!
.
Please separate your derivations as follows:
(a) Derive πb using the objective above.
(b) Derive λb
y,d using the objective above, leaving y and d arbitrary in your notation.
Problem 2 – 30 points
In this problem you will implement the naive Bayes classifier derived in Problem 1 and the logistic regression algorithm. The data consists of examples of spam and non-spam emails, of which there are
4600 labeled examples. The feature vector x is a 54-dimensional vector extracted from the email and
y = 1 indicates a spam email.1
In every experiment below, randomly partition the data into 10 groups and run the algorithm 10 different
times so that each group is held out as a test set one time. The final result you show should be the
cumulative result across these 10 groups.
(a) Implement the naive Bayes classifier described above. In a 2 × 2 table, write the number of times
that you predicted a class y data point (ground truth) as a class y
0 data point (model prediction) in
the (y, y0
)-th cell of the table, where y and y
0
can be either 0 or 1. There should be four values
written in the table in your PDF. Next to your table, write the prediction accuracy—the sum of the
diagonal divided by 4600. (The sum of all entries in the table should be 4600.)
(b) In one figure, show a stem plot (stem() in Matlab) of the 54 Poisson parameters for each class
averaged across the 10 runs. (This average is only used for plotting purposes on this homework.
In practice you would relearn these parameters using the entire data set to find their final values.)
Use the README file to make an observation about dimensions 16 and 52.
Next run logistic regression on the same data set. Set every yi = 0 to yi = −1 for this part. Also, be
sure to add a dimension equal to +1 to each data point.
(c) Implement the steepest ascent algorithm discussed in class. Use a step size η =
0.01
4600 . Run your
algorithm for 1,000 iterations and plot the logistic regression objective training function L per
iteration for each of the 10 training runs. Plot this in the same figure.
(d) Finally, implement an algorithm called “Newton’s method” for logistic regression as follows: At
iteration t, approximate the function
L(w) ≈ L0
(w) ≡ L(wt) + (w − wt)
T ∇L(wt) + 1
2
(w − wt)
T ∇2L(wt)(w − wt)
Then set wt+1 = arg maxw L
0
(w). Derive the update for wt+1 for the logistic regression problem
and implement and run this algorithm. Plot the objective function L on the training data as a
function of t = 1, . . . , 100 for each of the 10 training runs. Plot this in the same figure.
(e) In a 2 × 2 table, show the testing results using Newton’s method in the same way as shown in
Problem 2(a).
1
I’ve preprocessed the data already. Please use the data as-is for all of problem 2. More information about the meanings of
the 54 dimensions of the data is provided in two accompanying files.
Problem 3 – 25 points
In this problem you will implement the Gaussian process model for regression. You will use the same
data used for homework 1 to do this, which is again provided in the data zip file for this homework.
Recall that the Gaussian process treats a set of N observations (x1, y1), . . . ,(xN , yN ), with xi ∈ R
d
and
yi ∈ R, as being generated from a multivariate Gaussian distribution as follows,
y ∼ Normal(0, σ2
I + K), Kij = K(xi
, xj )

use: exp n

1
b
kxi − xjk
2
o.
Here, y is an N-dimensional vector of outputs and K is an N × N kernel matrix. For this problem use
the Gaussian kernel indicated above. In the lecture slides, we discuss making predictions for a new y
0
given x
0
, which was Gaussian with mean µ(x
0
) and variance Σ(x
0
). The equations are shown in the slides.
There are two parameters that need to be set for this model as given above, σ
2
and b.
a) Write code to implement the Gaussian process and to make predictions on test data. For b ∈
{5, 7, 9, 11, 13, 15} and σ
2 ∈ {.1, .2, .3, .4, .5, .6, .7, .8, .9, 1}—so 60 total pairs (b, σ2
)—calculate
the RMSE on the 42 test points as you did in the first homework. Use the mean of the Gaussian
process at the test point as your prediction. Show your results in a table.
b) Which value was the best and how does this compare with the first homework? What might be a
drawback of using the approach in this homework compared with homework 1?
c) To better understand what the Gaussian process is doing through visualization, re-run the algorithm
by using only the 4th dimension of xi (car weight). Set b = 5 and σ
2 = 2. Show a scatter plot
of the data (x[4] versus y for each point). Also, plot as a solid line the predictive mean of the
Gaussian process at each point in the training set. You can think of this problem as asking you to
create a test set by duplicating xi[4] for each i in the training set and then to predict that test set.