## Description

1 Margins

The margin of a set of points {x1, x2, · · · , xm} with respect to a hyperplane is defined as the

distance of the closest point to the hyperplane w · x = θ. The margin γ is thus:

γ = min

i

w · xi − θ

||w||

Suppose our examples are points in {0, 1}

20 (that is, 20 dimensional binary vectors). For

all the questions below, we wish to learn the following concept in this space using examples:

f(x) = x2 ∨ x4 ∨ x6 ∨ x10 ∨ x12 ∨ x14 ∨ x16 ∨ x18.

The variables that are in the disjunction are referred to as relevant variables and the

others as irrelevant.

1. [3 points] Represent f as a linear threshold function. That is, find w and θ such that

sgn(wT x − θ) is equivalent to f for all x ∈ {0, 1}

20

.

2. [6 points] Consider a dataset D1 that is generated as follows:

1

• Positive examples: All possible points that have one relevant variable and six

irrelevant variables set to one and all others zero.

• Negative examples: All possible points that have no relevant variables and six

irrelevant variables set to one and all others zero.

Compute the margin of D1 with respect to your w.

3. [6 points] Consider a dataset D2 that is similar to D1, except that for positive examples

six relevant variables are set to one in addition to the six irrelevant variables.

Compute the margin of D2 with respect to your w.

4. [6 points] Now, consider a dataset D3 that is similar to D1. The only difference is that

the number of irrelevant variables that are seen in both positive and negative examples

is increased to ten. Write the Perceptron mistake bound for D1, D2 and D3.

5. [4 points] Rank the datasets in terms of “ease of learning”. Justify your answer.

2 Boosting and Perceptron

Consider the following set of training examples (i is the index of the example):

i x y label

1 1 10 –

2 4 4 –

3 8 7 +

4 5 6 –

5 3 16 –

6 7 7 +

7 10 14 +

8 4 2 –

9 4 10 +

10 8 8 –

In this problem you will use two learning algorithms, Boosting and Perceptron to learn

a hidden Boolean function from this set of examples.

1. [10 points] Use two rounds of AdaBoost to learn a hypothesis for this Boolean data set.

As weak learners use a hypothesis of the form [x > θx] or [y > θy], for some integer θx

or θy. Each round, choose the weak learner that minimizes the error . There should

be no need to try many θs, appropriate values should be clear from the data.

Start the first round with a uniform distribution D0. Place the value for D0 for each

example in the appropriate column of Table 1. Find the hypothesis given by the weak

learner that minimizes the error for that distribution. Place this hypothesis as the

heading to the fourth column of Table 1, and give its prediction for each example in

that column.

2

Now compute D1 for each example, and select hypothesis that minimizes error on

this distribution, placing these values and predictions in the appropriate columns of

Table 1.

Boosting Perceptron updates Hypothesis 1 Hypothesis 2

i Label D0 D1 Start with (0, 0)

1

2

3

4

5

6

7

8

9

10

Table 1: Table for Boosting and Perceptron

2. [5 points] Write the final hypothesis produced by AdaBoost.

3. [6 points] Use the Perceptron learning algorithm to train a hypothesis for this Boolean

data set, using as features the weak learners chosen by the boosting algorithm in (a).

Train the Perceptron one cycle through the data, using 0 as initial weights, threshold

of 3, and learning rate of 1. Fill in the weight vector column of Table 1.

4. [4 points] Did the two algorithms converge to the same hypothesis? If both hypotheses

were used to predict labels for the training set, would the set of predictions be the

same? Explain.

3 Kernels

(a) [10 points] If K1(x, z) and K2(x, z) are both valid kernel functions, with positive α and

β, prove that

K(x, z) = αK1(x, z) + βK2(x, z) (1)

is also a kernel function.

(b) [10 points] Given two examples x ∈ <2 and z ∈ <2
, let
K(x, z) = (x
T
z)
3 + 9(x
T
z)
2 + 81x
T
z. (2)
Prove that this is a valid kernel function.
3
4 Learning Decision Lists (For CS 6350 students)
In this problem, we are going to learn the class of k-decision lists. A decision list is an ordered
sequence of if-then-else statements. The sequence of if-then-else conditions are tested in
order, and the answer associated to the first satisfied condition is output (see Figure 1).
Figure 1: A 2-decision list.
A k-decision list over the variables x1, . . . , xn is an ordered sequence L = (c1, b1), . . . ,(cl
, bl)
and a bit b, in which each ci
is a conjunction of at most k literals over x1, . . . , xn. The bit b is
called the default value, and bi
is referred to as the bit associated with condition ci
. For any
input x ∈ {0, 1}
n
, L(x) is defined to be the bit bj
, where j is the smallest index satisfying
cj (x) = 1; if no such index exists, then L(x) = b.
We denote by k-DL the class of concepts that can be represented by a k-decision list.
1. [8 points] Show that if a concept c can be represented as a k-decision list so can its
complement, ¬c. You can show this by providing a k-decision list that represents ¬c,
given c = {(c1, b1), . . . ,(cl
, bl), b).
2. [9 points] Use Occam’s Razor to show:
For any constant k ≥ 1, the class of k-decision lists is PAC-learnable.
3. [8 points] Show that 1-decision lists are a linearly separable functions. (Hint: Find a
weight vector that will make the same predictions a given 1-decision list.)
The experiments question is removed
4