Description
1 Kernel Perceptron
In the class, we have seen the idea of kernels applied to support vector machines. In this
question, you will derive the kernel version of the Perceptron learner. For the purpose of this
question, we will assume that instances are classified using a feature representation φ(x).
That is, classification using Perceptron is the sign of the dot product of the weight vector
and the example w
T φ(x). The goal is to, instead, represent this using a kernel K(w, x).
Note that, with this feature representation, the perceptron update is w ← w + ryφ(x),
whenever there is a mistake (i.e. ywT φ(x) < 0). For this problem you may assume the
learning rate r = 1.
1. Assume that the initial weight vector is the zero vector. Then, show that
P
w =
(xi,yi)∈M yi φ(xi). Here M = {(xi
, yi)} is the set of examples on which the learning algorithm made mistakes.
2. Let K be a kernel function defined as K(u, v) = φ(u)
T φ(v).
Using the fact that the weight vector can be written as in the previous question, write
the classification step y = sgn
w
T φ(x)
using the kernel function K(u, v) instead of
using the explicit feature representation φ(u).
3. In pseudo code, write the full kernel perceptron algorithm.
(Hint: Instead of storing a weight vector, you will store a list of examples the algorithm
made a mistake on.)
2 Na¨ıve Bayes
Consider the Boolean function fT H(3,7). This is a threshold function defined on the 7 dimensional Boolean cube as follows: given an instance x, fT H(3,7)(x) = 1 if and only if 3 or more
of x’s components are 1.
1. Show that fT H(3,7) has a linear decision surface over the 7 dimensional Boolean cube.
2. Assume that you are given data sampled according to the uniform distribution over
the Boolean cube {0, 1}
7 and labeled according to fT H(3,7). Use na¨ıve Bayes to learn a
hypothesis that predicts these labels. What is the hypothesis generated by the na¨ıve
Bayes algorithm? (You do not have to implement the algorithm here. You may assume
that you have seen all the data required to get accurate estimates of the probabilities).
3. Show that the hypothesis produced in (b) does not represent this function.
4. Are the na¨ıve Bayes assumptions satisfied by fT H(3,7)? Justify your answer.
3 Ensembles of decision trees
Support Vector Machines
Recall from class that an SVM is a technique for learning a linear classifier by minimizing
the following loss function:
E(w) = 1
2
w
Tw + C
X
i
max(0, 1 − yiw
T xi)
where C is a hyper-parameter that controls the relative importance of the regularization
term 1
2w
Tw with respect to the error term. As always, the inputs xi are real valued vectors
and yi ∈ {−1, +1}. This formalism is commonly referred to as L2 regularization and L1 loss.
Stochastic Gradient Descent
The concept behind SGD is to do gradient decent, but only calculate the gradient using a
single example. In practice, it can be helpful to shuffle the order of the data for each epoch.
1. Initialize w = ~0, t = 0
2. for epoch 1 . . . T:
(a) for example (xi
, yi) for every i (random order)
2
i. rt = Learning rate at t
ii. w = w − rt∇E(w, xi
, yi)
iii. t = t + 1
Here, the gradient is defined as follows:
∇E(w, x, y) = (
w − Cyx if ywT x ≤ 1
w otherwise
(Refer to the lecture slides for the full description of the algorithm.)
The learning rate is often stated as just r, but in practice it is better to scale it down as
the algorithm converges. In your implementation r will be a function of the initial learning
rate, ρ0, and the example number, t. In the case of the SVM loss function the best function
to choose is
r(t, ρ0) = ρ0
1 + ρ0t/C .
Here, t should be initialized to zero at the start and incremented for each example. Note
that t should not be reset at the start of the epoch.
Cross Validation
In class we have seen cross validation is used to select hyper-parameters for learning algorithms. Some of the training data is put aside, and when training is finished, the resulting
classifier is tested on the held out data. This allows you get get an idea of how well the
particular choice of hyper-parameters does. Since you did not train on your whole dataset
you may have introduced a bias. To correct for this, you will need to train many classifiers
with different subsets of the data removed.
For problems with small data sets, a popular method is the leave-one-out approach. For
each example, a classifier is trained on the rest of the data and the chosen example is then
evaluated. The performance of the classifier is the average accuracy on all the examples.
The downside to this method is for a data set with n examples you must train n different
classifiers. Of course, this is not practical for the data set you will use in this problem, so
you will hold out subsets of the data many times instead.
Specifically, for this problem, you should implement k-fold cross validation to identify
the hyper-parameters C and the initial learning rate ρ0. The general approach for k-fold
cross validation is the following: Suppose you want to evaluate how good a particular hyperparameter is. You split the training data D into k parts. Now, you will train your model
on k − 1 parts with the chosen hyper-parameter and evaluate the trained model on the
remaining part. You should repeat this k times, choosing a different part for evaluation
each time. This will give you k values of accuracy. Their average cross-validation accuracy
gives you an idea of how good this choice of the hyper-parameter is. To find the best value
of the hyper-parameter, you will need to repeat this procedure for different choices of the
hyper-parameter.
3
Data
The data is the badges data (again). This dataset is similar to the badges data that
we have seen in class and in a previous homework. The decision function, however, is
a rather complex one this time. You can download the dataset from the course website
in a file called badges.tar.gz. This archive consists for four files: badges-train.txt and
badges-test.txt are the new training and test files. In addition, we have also extracted indicator features from the first five characters of the first and last names. The feature extracted
train and test data is in badges-train-features.txt and badges-test-features.txt
respectively.
This feature extracted data is in the libSVM data format which we used in homework
2. Recall from the description from homework 2 that in this format, each line is a single
training example. The format of the each line in the data is