Description
Problem 1 (Values). Name one or two of your own personal, academic, or career values, and
explain how you hope machine learning can be of service to those values.
1
Problem 2 (Stuff you must know). The course website http://www.cs.columbia.edu/~djhsu/
coms4771-f16/ has information about the course prerequisites, course requirements, academic rules
of conduct, and other information. You are required to understand this information and abide by
the rules of conduct, regardless of whether or not you can solve the following problems.
(a) True or false: I may share my homework write-up or code with another student as long as
(1) the write-up only contains solutions for at most half of the problems, (2) the code is at
most five lines, and (3) we list each other as discussion partners on the submitted write-up.
(b) True or false: I may use any outside reference material to help me solve the homework
problems as long as I appropriately acknowledge these materials in the submitted write-up.
2
Problem 3 (More stuff you should know). We’ll use the notation f : X → Y to declare a function
f whose domain is the set X , and whose range is the set Y. For example, f : R → R declares
a real-valued function over the real line. For a positive integer d, the d-dimensional vector space
called Euclidean space is denoted by R
d
. For positive integers m and n, the space of m×n matrices
over the real field R is denoted by R
m×n
. Every matrix in R
m×n
can be regarded as a linear map
from R
n
to R
m.
Let A, B ∈ R
2×2 be given by
A :=
”
1 2
2 4#
, B :=
”
4 0
0 4#
.
(“:=” is the notation used for “equals by definition”.) Also let u := (2, 1) and v := (1, 2), which
are vectors in R
2
. Note that when we refer to vectors from Euclidean spaces in the context of
matrix-vector products, we always regard vectors (like u) as column vectors, and their transposes
(like u
>) as row vectors:
u =
”
2
1
#
, u
> =
h
2 1i
.
(a) What is the rank of A?
(b) What is Au + Bv?
(c) What is u
>Av?
(d) The (Euclidean) norm (or length) of a vector x = (x1, x2, . . . , xd) ∈ R
d
is denoted by kxk2,
and is equal to q
x
2
1 + x
2
2 + · · · + x
2
d
. What is kuk2?
(e) Let f : R
2 → R be the function defined by
f(x) := x
>
(A + B)x .
The gradient of a real-valued function g : R
d → R at a point z ∈ R
d
, denoted by ∇g(z), is
the vector λ = (λ1, λ2, . . . , λd) where
λi
:=
∂
∂xi
g(x)
x=z
for all i = 1, 2, . . . , d .
What is ∇f(v)?
(f) The unit circle in R
2
is the set of vectors in R
2 with unit length, i.e., {x ∈ R
2
: kxk2 = 1}.
Which vector in the unit circle minimizes f (defined above), and what is the value of f
evaluated at this vector? (Hint: think about eigenvectors.)
3
Problem 4 (Random stuff you should know). A (discrete) probability space is a pair (Ω, P), where
Ω is a (discrete) set called the sample space, and P : Ω → R is a real-valued function on Ω called
the probability distribution, which must satisfy P(ω) ≥ 0 for all ω ∈ Ω, and P
ω∈Ω P(ω) = 1. An
event A is a subset of Ω, and the probability of A, denoted by P(A) (somewhat abusing notation),
is equal to P
ω∈A P(ω).
(a) A fair coin is tossed three times. Consider the three events:
• A: the outcome of the first toss is heads.
• B: the outcome of the second toss is tails.
• C: the outcomes of all three tosses are the same.
• D: exactly one of the outcomes is heads.
Which of the following pairs of events are independent?
• A and B.
• A and C.
• A and D.
• C and D.
(b) A student applies to two schools: Trump University and Columbia University. The student
has a probability of 0.5 of being accepted to Trump, and a probability of 0.3 of being accepted
to Columbia. The probability of being accepted by both is 0.2. What is the probability that
the student is accepted to Columbia, given that the student is accepted at Trump?
A random variable (r.v.) on (Ω, P) is a real-valued function X : Ω → R. The notation X ∼ P
declares the r.v. X and associates it with the probability distribution P. (We’ll often leave the
probability space implicit.) The expected value (a.k.a. expectation or mean) of X, written E(X), is
the average value of X under the distribution P:
E(X) :=
X
ω∈Ω
X(ω) · P(ω).
An equivalent definition of E(X) is E(X) :=
P
x
x · P(X = x), where the summation is taken over
all x in the range of X, and P(X = x) is shorthand for P({ω ∈ Ω : X(ω) = x}).
(c) Consider the sample space Ω = {1, 2, . . . , 6} × {1, 2, . . . , 6}, and let P be the uniform distribution over Ω, i.e., P(a, b) = 1/36 for each (a, b) ∈ Ω. Let X be the random variable defined
by X(a, b) = min{a, b} for each (a, b) ∈ Ω.
For each x ∈ {1, 2, . . . , 6}, what is P(X = x)?
(d) Continuing from (c), what is the expected value of X?
(e) A biased coin with P(heads) = 1/5 is tossed repeatedly until heads comes up. What is the
expected number of tosses?
(f) You create a random sentence of length n by repeatedly picking words at random from the
vocabulary {a, is, not,rose}, with each word being equally likely to be picked. What is the
expected number of times that the phrase “a rose is a rose” will appear in the sentence?
4
Problem 5 (More random stuff you should know). We often encounter probability spaces (Ω, P)
where Ω is not a discrete set. In this class, the only random variables we’ll consider on such spaces
will either have a discrete image (i.e., {X(ω) : ω ∈ Ω} is a discrete set) or have a probability density
function p: R → R, which is a non-negative real-valued function on R such that, for any open
interval (a, b) = {x ∈ R : a < x < b} ⊆ R,
P(X ∈ (a, b)) = P({ω ∈ Ω : X(ω) ∈ (a, b)}) = Z
(a,b)
p(x) dx .
Random variables with probability density functions will be called continuous random variables.
(a) Let X be a continuous random variable with probability density function p given by
p(x) :=
(
0 if x < 0 ,
λe−λx if x ≥ 0 .
Here, λ is a positive number (typically called the rate parameter). If P(X ≤ 1000000) = 0.5,
then what is the value of λ?
(b) Let X be a standard normal random variable, i.e., a continuous random variable whose density
is the standard normal density p(x) := e
−x
2/2/
√
2π for all x ∈ R. Define the random variable
Y on the same probability space as X by Y := X2
, i.e., Y (ω) := X(ω)
2
for all ω ∈ Ω. What
are E(X) and E(Y )?
A collection of continuous random variables X1, X2, . . . , Xd, all defined on the same probability
space, has a (joint) probability density function p: R
d → R if, for any A ⊆ R
d
,
P((X1, X2, . . . , Xd) ∈ A) = Z
A
p(x1, x2, . . . , xd) dx1 dx2 · · · dxd .
We’ll often collect several random variables, such as X1, X2, . . . , Xd, into a random vector X =
(X1, X2, . . . , Xd). So the equation above can be written as P(X ∈ A) = R
A
p(x) dx.
(c) Suppose the pair of random variables (X1, X2) has probability density function p given by
p(x1, x2) :=
(
c if 0 ≤ x1 ≤ 0.5 and 0 ≤ x2 ≤ 1 ,
0 otherwise .
Here, c is a constant (that does not depend on x1 or x2). What should be the value of c so
that p is a valid probability density function?
(d) Continuing from (c), what is the probability that X2 ≥ X1?
(e) Continuing from (c), define another random variable Y on the same probability space as X1
and X2 by
Y :=
(
1 if X1 > 2X2 ,
−1 otherwise .
Are X1 and Y independent? What is the expected value of Y ?
(f) Continuing from (c), define yet another random variable Z on the same probability space as
X1 and X2 by
Z :=
(
1 if X2 > 1/2 ,
−1 otherwise .
Are X1 and Z independent? What is the expected value of X1Z?
5
Problem 6 (Google Cloud; optional but recommended). Set up a virtual machine on Google
Cloud. Figure out how to install some useful Python packages like numpy, scipy, scikit-learn,
etc. Download the OCR image data set ocr.mat from Courseworks, and load it into memory:
from scipy . io import loadmat
ocr = loadmat (‘ocr . mat ‘)
This file contains four different matrices called data, labels, testdata, and testlabels. For
example, data represents a 60000×784 matrix, which you can verify using the following command:
ocr [‘data ‘]. shape
Using the numpy and scipy libraries, write some code to compute the average squared Euclidean
norm of the rows of data. The following functions may be useful:
• numpy.apply_along_axis
• numpy.linalg.norm
• numpy.mean
The result should be around 127.642. You don’t need to submit anything for this problem.
6