CS 5350/6350: Machine Learining Homework 5 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1 Kernel Perceptron
In the class, we have seen the idea of kernels applied to support vector machines. In this
question, you will derive the kernel version of the Perceptron learner. For the purpose of this
question, we will assume that instances are classified using a feature representation φ(x).
That is, classification using Perceptron is the sign of the dot product of the weight vector
and the example w
T φ(x). The goal is to, instead, represent this using a kernel K(w, x).
Note that, with this feature representation, the perceptron update is w ← w + ryφ(x),
whenever there is a mistake (i.e. ywT φ(x) < 0). For this problem you may assume the learning rate r = 1. 1. Assume that the initial weight vector is the zero vector. Then, show that P w = (xi,yi)∈M yi φ(xi). Here M = {(xi , yi)} is the set of examples on which the learning algorithm made mistakes. 2. Let K be a kernel function defined as K(u, v) = φ(u) T φ(v). Using the fact that the weight vector can be written as in the previous question, write the classification step y = sgn w T φ(x)  using the kernel function K(u, v) instead of using the explicit feature representation φ(u). 3. In pseudo code, write the full kernel perceptron algorithm. (Hint: Instead of storing a weight vector, you will store a list of examples the algorithm made a mistake on.) 2 Na¨ıve Bayes Consider the Boolean function fT H(3,7). This is a threshold function defined on the 7 dimensional Boolean cube as follows: given an instance x, fT H(3,7)(x) = 1 if and only if 3 or more of x’s components are 1. 1. Show that fT H(3,7) has a linear decision surface over the 7 dimensional Boolean cube. 2. Assume that you are given data sampled according to the uniform distribution over the Boolean cube {0, 1} 7 and labeled according to fT H(3,7). Use na¨ıve Bayes to learn a hypothesis that predicts these labels. What is the hypothesis generated by the na¨ıve Bayes algorithm? (You do not have to implement the algorithm here. You may assume that you have seen all the data required to get accurate estimates of the probabilities). 3. Show that the hypothesis produced in (b) does not represent this function. 4. Are the na¨ıve Bayes assumptions satisfied by fT H(3,7)? Justify your answer. 3 Ensembles of decision trees Support Vector Machines Recall from class that an SVM is a technique for learning a linear classifier by minimizing the following loss function: E(w) = 1 2 w Tw + C X i max(0, 1 − yiw T xi) where C is a hyper-parameter that controls the relative importance of the regularization term 1 2w Tw with respect to the error term. As always, the inputs xi are real valued vectors and yi ∈ {−1, +1}. This formalism is commonly referred to as L2 regularization and L1 loss. Stochastic Gradient Descent The concept behind SGD is to do gradient decent, but only calculate the gradient using a single example. In practice, it can be helpful to shuffle the order of the data for each epoch. 1. Initialize w = ~0, t = 0 2. for epoch 1 . . . T: (a) for example (xi , yi) for every i (random order) 2 i. rt = Learning rate at t ii. w = w − rt∇E(w, xi , yi) iii. t = t + 1 Here, the gradient is defined as follows: ∇E(w, x, y) = ( w − Cyx if ywT x ≤ 1 w otherwise (Refer to the lecture slides for the full description of the algorithm.) The learning rate is often stated as just r, but in practice it is better to scale it down as the algorithm converges. In your implementation r will be a function of the initial learning rate, ρ0, and the example number, t. In the case of the SVM loss function the best function to choose is r(t, ρ0) = ρ0 1 + ρ0t/C . Here, t should be initialized to zero at the start and incremented for each example. Note that t should not be reset at the start of the epoch. Cross Validation In class we have seen cross validation is used to select hyper-parameters for learning algorithms. Some of the training data is put aside, and when training is finished, the resulting classifier is tested on the held out data. This allows you get get an idea of how well the particular choice of hyper-parameters does. Since you did not train on your whole dataset you may have introduced a bias. To correct for this, you will need to train many classifiers with different subsets of the data removed. For problems with small data sets, a popular method is the leave-one-out approach. For each example, a classifier is trained on the rest of the data and the chosen example is then evaluated. The performance of the classifier is the average accuracy on all the examples. The downside to this method is for a data set with n examples you must train n different classifiers. Of course, this is not practical for the data set you will use in this problem, so you will hold out subsets of the data many times instead. Specifically, for this problem, you should implement k-fold cross validation to identify the hyper-parameters C and the initial learning rate ρ0. The general approach for k-fold cross validation is the following: Suppose you want to evaluate how good a particular hyperparameter is. You split the training data D into k parts. Now, you will train your model on k − 1 parts with the chosen hyper-parameter and evaluate the trained model on the remaining part. You should repeat this k times, choosing a different part for evaluation each time. This will give you k values of accuracy. Their average cross-validation accuracy gives you an idea of how good this choice of the hyper-parameter is. To find the best value of the hyper-parameter, you will need to repeat this procedure for different choices of the hyper-parameter. 3 Data The data is the badges data (again). This dataset is similar to the badges data that we have seen in class and in a previous homework. The decision function, however, is a rather complex one this time. You can download the dataset from the course website in a file called badges.tar.gz. This archive consists for four files: badges-train.txt and badges-test.txt are the new training and test files. In addition, we have also extracted indicator features from the first five characters of the first and last names. The feature extracted train and test data is in badges-train-features.txt and badges-test-features.txt respectively. This feature extracted data is in the libSVM data format which we used in homework 2. Recall from the description from homework 2 that in this format, each line is a single training example. The format of the each line in the data is