CSE176 Introduction to Machine Learning Homework set #1 solved

$35.00 $17.50



Exercise 1: Bayes’ rule (6 points). Suppose that 5% of competitive athletes use performanceenhancing drugs and that a particular drug test has a 2% false positive rate and a 1.5% false negative
1. (3 points) Athlete A tests positive for drug use. What is the probability that Athlete A is using
2. (3 points) Athlete B tests negative for drug use. What is the probability that Athlete B is not
using drugs?
Exercise 2: Bayesian decision theory: losses and risks (11 points). Consider a classification
problem with K classes, using a loss λik ≥ 0 if we choose class i when the input actually belongs to
class k, for i, k ∈ {1, . . . , K}.
1. (2 points) Write the expression for the expected risk Ri(x) for choosing class i as the class for a
pattern x, and the rule for choosing the class for x.
Consider a two-class problem with losses given by the matrix λik =

0 1
λ21 0

2. (3 points) Give the optimal decision rule in the form “p(C1|x) > . . . ” as a function of λ21.
3. (3 points) Imagine we consider both misclassification errors as equally costly. When is class 1
chosen (for what values of p(C1|x))?
4. (3 points) Imagine we want to be very conservative when choosing class 2 and we seek a rule of
the form “p(C2|x) > 0.99” (i.e., choose class 2 when its posterior probability exceeds 99%). What
should λ21 be?
Exercise 3: association rules (6 points). Given the following data of transactions at a supermarket, calculate the support and confidence values of the following association rules: meat → avocado,
avocado → meat, yogurt → avocado, avocado → yogurt, meat → yogurt, yogurt → meat. What is the
best rule to use in practice?
transaction # items in basket
1 meat, avocado
2 yogurt, avocado
3 meat
4 yogurt, meat
5 avocado, meat, yogurt
6 meat, avocad
Exercise 4: true- and false-positive rates (10 points). Consider the following table, where xn is
a pattern, yn its ground-truth label (1 = positive class, 2 = negative class) and p(C1|xn) the posterior
probability produced by some probabilistic classification algorithm:
n 1 2 3 4 5
yn 1 2 2 1 2
p(C1|xn) 0.6 0.7 0.5 0.9 0.2
We use a classification rule of the form “p(C1|x) > θ” where θ ∈ [0, 1] is a threshold.
1. (8 points) Give, for all possible values of θ ∈ [0, 1], the predicted labels and the corresponding
confusion matrix and classification error.
2. (2 points) Plot the corresponding pairs (fp,tp) as an ROC curve.
Exercise 5: ROC curves (8 points). Imagine we have a classifier A that has false-positive and
true-positive rates fpA,tpA ∈ [0, 1] such that fpA > tpA (that is, this classifier is below the diagonal
on the ROC space). Now consider a classifier B that negates the decision of A, that is, whenever A
predicts the positive class then B predicts the negative class and vice versa. Compute the false-positive
and true-positive rates fpB,tpB for classifier B. Where is this point in the ROC space?
Exercise 6: least-squares regression (14 points). Consider the following model, with parameters
Θ = {θ1, θ2, θ3} ⊂ R and an input x ∈ R:
h(x; Θ) = θ1 + θ2 sin 2x + θ3 sin 4x ∈ R.
1. (2 points) Write the general expression of the least-squares error function of a model h(x; Θ) with
parameters Θ given a sample {(xn, yn)}
2. (2 points) Apply it to the above model, simplifying it as much as possible.
3. (6 points) Find the least-squares estimate for the parameters.
4. (4 points) Assume the values {xn}
n=1 are uniformly distributed in the interval [0, 2π]. Can you find
a simpler, approximate way to find the least-squares estimate ? Hint: approximate 1
n=1 f(xn)
by an integral.
Exercise 7: maximum likelihood estimate (15 points). A discrete random variable x ∈ {0, 1, 2 . . . }
follows a Poisson distribution if it has the following probability mass function:
p(x; θ) = e
where the parameter is θ > 0.
1. (2 points) Verify that P∞
x=0 p(x) = 1.
2. (2 points) Write the general expression of the log-likelihood of a probability mass function p(x; Θ)
with parameters Θ for an iid sample x1, . . . , xN .
3. (5 points) Apply it to the above distribution, simplifying it as much as possible.
4. (6 points) Find the maximum likelihood estimate for the parameter θ.
Exercise 8: multivariate Bernoulli distribution (20 points). Consider a multivariate Bernoulli
distribution where θ ∈ [0, 1]D are the parameters and x ∈ {0, 1}
D the binary random vector:
p(x; θ) = Y
(1 − θd)
1. (5 points) Compute the maximum likelihood estimate for θ given a sample X = {x1, . . . , xN }.
Let us do document classification using a D-word dictionary (element d in xn is 1 if word d is in document
n and 0 otherwise) using a multivariate Bernoulli model for each class. Assume we have K document
classes for which we have already obtained the values of the optimal parameters θk = (θk1, . . . , θkD)
and prior distribution p(Ck) = πk, for k = 1, . . . , K, by maximum likelihood.
2. (2 points) Write the discriminant function gk(x) for a probabilistic classifier in general (not necessarily Bernoulli), and the rule to make a decision.
3. (5 points) Apply it to the multivariate Bernoulli case with K classes. Show that gk(x) is linear
on x, i.e., it can be written as gk(x) = wT
k x + wk0 and give the expression for wk and wk0.
4. (3 points) Consider K = 2 classes. Show the decision rule can be written as “if wT x + w0 > 0
then choose class 1”, and give the expression for w and w0.
5. (5 points) Compute the numerical values of w and w0 for a two-word dictionary where π1 = 0.7,
θ1 = ( 0.2
) and θ2 = ( 0.3
). Plot in 2D all the possible values of x ∈ {0, 1}
D and the boundary
corresponding to this classifier.
Exercise 9: Gaussian classifiers (10 points). Consider a binary classification problem for x ∈ R
where we use Gaussian class-conditional probabilities p(x|C1) ∼ N (µ, σ2
I) and p(x|C2) ∼ N (µ, σ2
That is, they have the same mean and the covariance matrices are isotropic but different. Compute the
expression for the class boundary. What shape is it?