## Description

1. Logistic Regression. We consider the following models of logistic regression for a binary classification with a sigmoid function σ(z) = 1

1+e−z

,

Model 1: P(Y = 1|X, w1, w2) = σ(w1X1 + w2X2)

Model 2: P(Y = 1|X, w0, w1, w2) = σ(w0 + w1X1 + w2X2)

We have three training examples:

(x

1

, y1

) = ([1, 1]>, 1), (x

2

, y2

) = ([1, 0]>, −1), (x

3

, y3

) = ([0, 0]>, 1)

A. How does the the learned value of w = (w1, w2) change if we change the label of the third

example to −1? How about in Model 2? Explain (Hint: think of the decision boundary on 2D

plane.)

B. Now, suppose we train the logistic regression model (Model 2) based on the N training examples

x

1

, . . . , x

N and labels y

1

, . . . , yn by maximizing the penalized log-likelihood of the labels:

X

i

log P(y

i

|x

i

, w) −

λ

2

kwk

2

2

For large λ (strong regularization), the log-likelihood terms will behave as linear functions of w.

log σ(y

iw>x

i

) ≈

1

2

y

(i)w>x

i

Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE w in terms of λ and training data {(x

i

, yi

)}

N

i=1. Based on this, explain how w

behaves as λ increases.

2. Support Vector Machine. Consider a binary classification problem in one-dimensional space

where the sample contains four data points S = {(1, −1),(−1, −1),(2, 1),(−2, 1)} as shown in

Fig. 1.

Figure 1: Red points represent instances from class +1 and blue points represent instances from class -1.

A. Define Ht = [t, ∞). Consider a class of linear separateors H = {Ht

: t ∈ R}, i.e., for

∀Ht ∈ H, Ht(x) = 1 if x ≥ t otherwise −1. Is there any linear separator Ht ∈ H that achieves 0

1

classification error on this sample? If yes, show one of the linear separators that achieves 0 classification error on this example. If not, briefly explain why there cannot be such linear separator.

B. Now consider a feature map φ : R → R

2 where φ(x) = (x, x2

). . Apply the feature map to

all the instances in sample S to generate a transformed sample S

0 = {(φ(x), y) : (x, y) ∈ S}.Let

H0 = {ax1 + bx2 + c ≥ 0 : a

2 + b

2 6= 0} be a collection of half-spaces in R2

. More specifically,

Ha,b,c((x1, x2)) = 1 if ax1 + bx2 + c ≥ 0 otherwise −1. Is there any half-space H0 ∈ H0

that

achieves 0 classification error on the transformed sample S

0

? If yes, give the equation of the maxmargin linear separator and compute the corresponding margin.

C. What is the kernel corresponding to the feature map φ(·) in the last question, i.e., give the

kernel function K(x, z) : R × R → R.

3. Constructing Kernels. In this question you will be asked to construct new kernels from existing kernels. Suppose K1(x, z) : R

d × R

d → R and K2(x, z) : R

d × R

d → R are both kernels,

show the following functions are also kernels:

A. K(x, z) = c1K1(x, z) + c2K2(x, z) with c1, c2 ≥ 0.

B. K(x, z) = K1(x, z) · K2(x, z).

C. Let q(t) = Pp

i=0 cit

i be a polynomial function with nonnegative coefficients, i.e., ci ≥ 0, ∀i.

Show that K(x, z) = q(K1(x, z)) is a kernel.

D. K(x, z) = exp(K1(x, z)). (Hint: you can use the previous results to prove this.)

E. Let A be a positive semidefinite matrix and define K(x, z) = x

TAz.

F. K(x, z) = exp(−kx − zk

2

2

).

4. Support Vectors. In question 2, we explicitly constructed the feature map and find the corresponding kernel to help classify the instances using linear separator in the feature space. However

in most cases it is hard to manually construct the desired feature map, and the dimensionality of

the feature space can be very high, even infinity, which makes explicit computation in the feature

space infeasible in practice. In this question we will develop the dual of the primal optimization problem to avoid working in the feature space explicitly. Suppose we have a sample set

S = (x1, y1), …,(xn, yn) of labeled examples in R

d with label set {+1, −1}. Let φ : R

d → RD be

a feature map that transform each input example to a feature vector in R

D. Recall from the lecture

notes that the primal optimization of SVM is given by

minimize

w,ξi

1

2

kwk

2

2 + C

Xn

i=1

ξi

subject to yi(w

T φ(xi)) ≥ 1 − ξi ∀i = 1, · · · , n

ξi ≥ 0 ∀i = 1, ·, n

2

which is equivalent to the following dual optimization

minimize

αi

Xn

i=1

αi −

1

2

Xn

i=1

Xn

j=1

αiαjyiyjφ(xi)

T φ(xj )

subject to 0 ≤ αi ≤ C ∀i = 1, . . . , n

Xn

i=1

αiyi = 0 ∀i = 1, . . . , n

Recall from the lecture notes ξ1, …, ξn are called slack variables. The optimal slack variables have

intuitive geometric interpretation as shown in Fig. 3. Basically, when ξi = 0, the corresponding

feature vector φ(xi) is correctly classified and it will either lie on the margin of the separator or on

the correct side of the margin. Feature vector with 0 < ξi ≤ 1 lies within the margin but is still
be correctly classified. When ξi > 1, the corresponding feature vector is misclassified. Support

vectors correspond to the instances with ξi > 0 or instances that lie on the margin. The optimal

vector w can be represented in terms of αi

, i = 1, · · · , n as w =

Pn

i=1 αiyiφ(xi).

A. Suppose the optimal ξ1, …, ξn have been computed. Use the ξi

to obtain an upper bound

on the number of misclassified instances.

B. In the primal optimization of SVM, what?s the role of the coefficient C? Briefly explain your

answer by considering two extreme cases, i.e., C → 0 and C → ∞.

C. Explain how to use the kernel trick to avoid the explicit computation of the feature vector

φ(xi)? Also, given a new instance x, , how to make prediction on the instance without explicitly

computing the feature vector φ(x)?

5. Generalized Lagrangian Function. Consider the optimization problem

min

w

f(w) s.t. gj (w) ≤ 0, ∀j = 1, . . . , m, hj (w) = 0, ∀j = 1, . . . , p (1)

Show that for the generalized Lagrangian function, defined by

L(w, α, β) , f(w) +Xn

j=1

αjgj (w) +X

p

j=1

βjhj (w)

3

the following always holds

max

α≥0,β

min

w

L(w, α, β) ≤ min

w

max

α≥0,β

L(w, α, β)

6. Dual Optimization. Consider the optimization program

min

x

1

2

x

>P0x + q

>

0 x + r0 s.t.

1

2

x

>Pix + q

>

i x + ri ≤ 0, i = 1, . . . , m

where P0 and all Pi are assumed to be positive semi-definite matrices. A) Form the generalized

Lagrangian function. B) Compute the Lagrange dual function. C) Derive the dual maximization

problem.

7. Logistic Regression Implementation.

A) Write down a code in Python whose input is a training dataset {(x

1

, y1

), . . . ,(x

N , yN )} and its

output is the weight vector w in the logistic regression model y = σ(w>x).

B) Download the dataset on the course webpage. Use ‘dataset1’. Run the code on the training

dataset to compute w and evaluate on the test dataset. Report w, classification error on the training set and classification error on the test set. Plot the data (use different colors for data in different

classes) and plot the decision boundary found by the logistic regressions.

C) Repeat part B using ‘dataset2’. Explain the differences in results between part A and B and

justify your observations/results.

8. SVM Implementation.

Implement SVM with the SMO algorithm and train it on the provided dataset. For your implementation, you only have to use the linear kernel. In addition, run SVM using the LIBSVM package and compare the results. You can implement the simplified SMO, as described

in http://cs229.stanford.edu/materials/smo.pdf

A) Apply the SVM on the ‘dataset1’ and report the classification error (on both training and test

sets) as a function of the regularization parameter C.

B) Repeat part A using ‘dataset2’. Explain the differences in results between part A and B and

justify your observations/results.

Homework Submission Instructions:

– Submission of Written Part: You must submit your written report in the class BEFORE CLASS

STARTS. The written part, must contain the plots and results of running your code on the provided

datasets.

–Submission of Code: You must submit your Python code (.py file) via email, BEFORE CLASS

STARTS. For submitting your code, please send an email to me and CC both TAs.

– The title of your email must be ”CS6140: Code: HW2: Your First and Last Name”.

– You must attach a single zip file to your email that contains all python codes and a readme file

on how to run your files.

– The name of the zip file must be ”HW2-Code: Your First and Last Name”.

4