## Description

Overview

In this assignment you will explore the use of k-grams, Jaccard distance, min hashing, and LSH in the

context of document similarity.

You will use four text documents for this assignment:

• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D1.txt

• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D2.txt

• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D3.txt

• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D4.txt

As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may

lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:

http://www.cs.utah.edu/˜jeffp/teaching/latex/

1 Creating k-Grams (40 points)

You will construct several types of k-grams for all documents. All documents only have at most 27 characters: all lower case letters and space.

[G1] Construct 2-grams based on characters, for all documents.

[G2] Construct 3-grams based on characters, for all documents.

[G3] Construct 3-grams based on words, for all documents.

Remember, that you should only store each k-gram once, duplicates are ignored.

A: (20 points) How many distinct k-grams are there for each document with each type of k-gram? You

should report 4 × 3 = 12 different numbers.

B: (20 points) Compute the Jaccard similarity between all pairs of documents for each type of k-gram.

You should report 3 × 6 = 18 different numbers.

2 Min Hashing (30 points)

We will consider a hash family H so that any hash function h ∈ H maps from h : {k-grams} → [m] for m

large enough (I suggest over m ≥ 10,000).

A: (25 points) Using grams G2, build a min-hash signature for document D1 and D2 using t = {10, 50, 100, 250, 500}

hash functions. For each value of t report the approximate Jaccard similarity between the pair of documents

D1 and D2, estimating the Jaccard similarity:

JSˆ

t(a, b) = 1

t

X

t

i=1 (

1 if ai = bi

0 if ai 6= bi

.

You should report 5 numbers.

CS 6140 Data Mining; Spring 2015 Instructor: Jeff M. Phillips, University of Utah

B: (5 point) What seems to be a good value for t? You may run more experiments. Justify your answer in

terms of both accuracy and time.

3 LSH (30 points)

Consider computing an LSH using t = 100 hash functions. We want to find all documents which have

Jaccard similarity above τ = .4.

A: (8 points) Use the trick mentioned in class and the notes to estimate the best values of hash functions

b within each of r bands to provide the S-curve f(s) = 1 − (1 − s

r

)

b

f(s) = 1 − (1 − s

b

)

r

with good separation at τ . Report these values.

The values of r and b we mixed up before. You can report either, but please be clear which means the #

of bands (in blue = r) and the # number of hashes per band (in blue = b). This is now consistent with the

notes, but reverse of the book.

B: (24 points) Using your choice of r and b and f(·), what is the probability of each pair of the four

documents (using [G2]) for being estimated to having similarity greater that τ ? Report 6 numbers. (Show

your work.)

4 Bonus (3 points)

Describe a scheme like Min-Hashing for the Andberg Similarity, defined Andb(A, B) = |A∩B|

|A∪B|+|A4B|

. So

given two sets A and B and family of hash functions, then Prh∈H[h(A) = h(B)] = Andb(A, B). Note the

only randomness is in the choice of hash function h from the set H, and h ∈ H represents the process of

choosing a hash function (randomly) from H. The point of this question is to design this process, and show

that it has the required property.

Or show that such a process cannot be done