## Description

Question 1 (2 points)

While simrank is suitable for search engines, how would you adapt it for display ads where

are there are no queries?

Question 2 (2 points)

Consider the following methods to compute user and item biases in neighborhood models

for recommendation.

1

Figure 1: Question 4: figure illustrating the integration of the binary view of the rating

matrix, into matrix factorization

bi =

P

u∈R(i)

(rui − µ)

λ2 + |R(i)|

bu =

P

i∈R(u)

(rui − µ − bi)

λ3 + |R(u)|

where R(i) denotes the set of users who have rated on item i, and R(u) denotes the set

of items rated by user u. Explain the purpose of constants λ2 and λ3.

Question 3 (2 points)

In the matrix factorization model for recommendation, how do you determine the number

of latent factors?

Question 4 (3 points)

Consider figure 1 which illustrates factoring in the binary view of the rating matrix, into

matrix factorization. Why does the information from the (dense) binary view (which is

derived from the rating matrix) produce a 7% improvement in the results over direct matrix

factorization on the rating matrix?

2

Question 5 (7 + 14 = 21)

The objective of this assignment is to design a movie recommender system, i.e., recommend

movies to users using item-item neighborhood models. The dataset includes ratings provided

by users to some movies, and movie metadata that includes title, overview, etc.

Let the set of users be represented by U = {u ∈ U} and set of movies by represented by

I = {i ∈ I}. First, estimate movie biases {bi

: i ∈ I} by averaging over users that rated the

movie, and then estimate user biases by averaging residuals over movies rated by the user.

bi =

P

u∈R(i)

(rui − µ)

|R(i)|

bu =

P

i∈R(u)

(rui − µ − bi)

|R(u)|

where µ is the global mean calculated across all the ratings available in the dataset. The

rating for user u and movie i is predicted as:

rˆui = bu + bi + µ +

P

j∈R(u)

sij × (ruj − buj )

P

j∈R(u)

sij

where sij denotes the similarity between movies i and j. Note that the neighborhood R(u)

is computed over all items rated by user u. In this assignment, we explore two approaches

to compute movie-movie similarities.

1. Pearson correlation: Estimate movie-movie similarity using empirical Pearson correlation coefficient on shared support of items i and j, defined as:

sij =

P

u∈U(i,j)

(rui − bui)(ruj − buj )

r P

u∈U(i,j)

(rui − bui)

2

· (ruj − buj )

2

where U(i, j) is the set of users who have rated movies i and j, and bui = bu + bi + µ

2. Content similarity:

Movie metadata is a collection of words, sij is the cosine similarity between the two

movie documents i and j. In order to calculate cosine similarity, we first convert the

documents to a vectors using TF-IDF.

3

Term Frequency also known as TF, measures the number of times a term (word)

occurs in a document, normalized by document length.

T F(t, i) = Number of times term t appears in document of movie i

Total number of terms in document of movie i

Inverse Document Frequency also known as IDF, measures how important a term

is to a particular movie.

IDF(t) = loge

Total number of movies in dataset

Number of movies containing term t

Now let the set of all unique words across all movie documents be V = {w1, . . . , w|V |}.

We define the vector for movie i as d

i = (d

i

1

, . . . , di

|V |

) where d

i

k = T F(wk, i)×IDF(wk)

The similarity between movies i and j, is now computed as:

sij = cosine-sim(d

i

, dj

)

In this MP, given an input (u, i) , you need to output the estimated rating value ˆrui.

Round ˆrui to 1 decimal point.

Input format: The input contains the (user,movie) ratings, movie metadata and the

(user,movie) pairs for which you need to estimate the ratings.

The first line of the input contains two space separated integers R M. R is the number

of lines of user-movie ratings and M is the number of movies. Next R lines contain the

ratings. Each line contains three space-separated values (user-id, movie-id, rating). Next M

lines contain the metadata. The first entry in each line is the movie id, followed by words

describing movie metadata. The last five lines contain two space-separated integers (target

user-id, target movie-id) for which you need to estimate the ratings. You should submit your

code on compass and the code outputs in the pdf submitted to gradescope.

4