CS 5350/6350: Machine Learining Homework 4 solved

$35.00

Category:

Description

5/5 - (1 vote)

1 Margins
The margin of a set of points {x1, x2, · · · , xm} with respect to a hyperplane is defined as the
distance of the closest point to the hyperplane w · x = θ. The margin γ is thus:
γ = min
i

w · xi − θ
||w||

Suppose our examples are points in {0, 1}
20 (that is, 20 dimensional binary vectors). For
all the questions below, we wish to learn the following concept in this space using examples:
f(x) = x2 ∨ x4 ∨ x6 ∨ x10 ∨ x12 ∨ x14 ∨ x16 ∨ x18.
The variables that are in the disjunction are referred to as relevant variables and the
others as irrelevant.
1. [3 points] Represent f as a linear threshold function. That is, find w and θ such that
sgn(wT x − θ) is equivalent to f for all x ∈ {0, 1}
20
.
2. [6 points] Consider a dataset D1 that is generated as follows:
1
• Positive examples: All possible points that have one relevant variable and six
irrelevant variables set to one and all others zero.
• Negative examples: All possible points that have no relevant variables and six
irrelevant variables set to one and all others zero.
Compute the margin of D1 with respect to your w.
3. [6 points] Consider a dataset D2 that is similar to D1, except that for positive examples
six relevant variables are set to one in addition to the six irrelevant variables.
Compute the margin of D2 with respect to your w.
4. [6 points] Now, consider a dataset D3 that is similar to D1. The only difference is that
the number of irrelevant variables that are seen in both positive and negative examples
is increased to ten. Write the Perceptron mistake bound for D1, D2 and D3.
5. [4 points] Rank the datasets in terms of “ease of learning”. Justify your answer.
2 Boosting and Perceptron
Consider the following set of training examples (i is the index of the example):
i x y label
1 1 10 –
2 4 4 –
3 8 7 +
4 5 6 –
5 3 16 –
6 7 7 +
7 10 14 +
8 4 2 –
9 4 10 +
10 8 8 –
In this problem you will use two learning algorithms, Boosting and Perceptron to learn
a hidden Boolean function from this set of examples.
1. [10 points] Use two rounds of AdaBoost to learn a hypothesis for this Boolean data set.
As weak learners use a hypothesis of the form [x > θx] or [y > θy], for some integer θx
or θy. Each round, choose the weak learner that minimizes the error . There should
be no need to try many θs, appropriate values should be clear from the data.
Start the first round with a uniform distribution D0. Place the value for D0 for each
example in the appropriate column of Table 1. Find the hypothesis given by the weak
learner that minimizes the error  for that distribution. Place this hypothesis as the
heading to the fourth column of Table 1, and give its prediction for each example in
that column.
2
Now compute D1 for each example, and select hypothesis that minimizes error on
this distribution, placing these values and predictions in the appropriate columns of
Table 1.
Boosting Perceptron updates Hypothesis 1 Hypothesis 2
i Label D0 D1 Start with (0, 0)
1
2
3
4
5
6
7
8
9
10
Table 1: Table for Boosting and Perceptron
2. [5 points] Write the final hypothesis produced by AdaBoost.
3. [6 points] Use the Perceptron learning algorithm to train a hypothesis for this Boolean
data set, using as features the weak learners chosen by the boosting algorithm in (a).
Train the Perceptron one cycle through the data, using 0 as initial weights, threshold
of 3, and learning rate of 1. Fill in the weight vector column of Table 1.
4. [4 points] Did the two algorithms converge to the same hypothesis? If both hypotheses
were used to predict labels for the training set, would the set of predictions be the
same? Explain.
3 Kernels
(a) [10 points] If K1(x, z) and K2(x, z) are both valid kernel functions, with positive α and
β, prove that
K(x, z) = αK1(x, z) + βK2(x, z) (1)
is also a kernel function.
(b) [10 points] Given two examples x ∈ <2 and z ∈ <2 , let K(x, z) = (x T z) 3 + 9(x T z) 2 + 81x T z. (2) Prove that this is a valid kernel function. 3 4 Learning Decision Lists (For CS 6350 students) In this problem, we are going to learn the class of k-decision lists. A decision list is an ordered sequence of if-then-else statements. The sequence of if-then-else conditions are tested in order, and the answer associated to the first satisfied condition is output (see Figure 1). Figure 1: A 2-decision list. A k-decision list over the variables x1, . . . , xn is an ordered sequence L = (c1, b1), . . . ,(cl , bl) and a bit b, in which each ci is a conjunction of at most k literals over x1, . . . , xn. The bit b is called the default value, and bi is referred to as the bit associated with condition ci . For any input x ∈ {0, 1} n , L(x) is defined to be the bit bj , where j is the smallest index satisfying cj (x) = 1; if no such index exists, then L(x) = b. We denote by k-DL the class of concepts that can be represented by a k-decision list. 1. [8 points] Show that if a concept c can be represented as a k-decision list so can its complement, ¬c. You can show this by providing a k-decision list that represents ¬c, given c = {(c1, b1), . . . ,(cl , bl), b). 2. [9 points] Use Occam’s Razor to show: For any constant k ≥ 1, the class of k-decision lists is PAC-learnable. 3. [8 points] Show that 1-decision lists are a linearly separable functions. (Hint: Find a weight vector that will make the same predictions a given 1-decision list.) The experiments question is removed 4