CIND-110 Lab Manual Module 9 Information Retrieval solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

CIND-110
Data Organization for Data Analysts
Objectives
To demonstrate Vector Space Model of Information Retrieval in R by building a
search engine.
1. Installation
This section is required only if latest R version is not installed.
From your Ubuntu Terminal, run the commands below.
Installing R
1 sudo apt – key adv — keyserver keyserver . ubuntu .com
–recv – keys
E298A3A825C0D65DFD57CBB651716619E084DAB9
2 sudo add – apt – repository ’deb [ arch =amd64 , i386 ]
https :// cran . rstudio .com/ bin / linux / ubuntu
xenial /’
3 sudo apt – get update
4 sudo apt – get install r – base
5 sudo -i R https :// www . digitalocean . com / community /
tutorials / how – to – install -r -on – ubuntu -16 -04 -2
Then install RStudio through following link: https://www.rstudio.com/products/
rstudio/download/. Scroll down and download RStudio 1.0.153-Ubuntu 16.04+/Debian 9+ (64-bit).
Now you are good to go. Follow the manual and execute the commands in RStudio.
2
2. Creating the Documents
Information retrieval is the process of retrieving documents from a collection in
response to a query (or a search request) by a user.
In this section, we explore our documents and the search query.
Note that all queries below must be run in R
Collection of documents
1 doc1 <- ” Stray cats are running all over the
place . I see 10 a day!”
2 doc2 <- ” Cats are killers . They kill billions of
animals a year .”
3 doc3 <- ” The best food in Columbus , OH is the
North Market .”
4 doc4 <- ” Brand A is the best tasting cat food
around . Your cat will love it.”
5 doc5 <- ” Buy Brand C cat food for your cat. Brand
C makes healthy and happy cats .”
6 doc6 <- ” The Arnold Classic came to town this
weekend . It reminds us to be healthy .”
7 doc7 <- “I have nothing to say . In summary , I
have told you nothing .”
8 doc . list <- list ( doc1 , doc2 , doc3 , doc4 , doc5 ,
doc6 , doc7 )
9 N . docs <- length ( doc . list )
10 names ( doc . list ) <- paste0 (“doc”,c (1: N . docs ))
We also have an information need that is expressed via the following text
query
1 query <- ” Healthy cat food ”
In order to meet our information need amidst all this unstructured text, we are
going to implement the vector space model of information retrieval in R. In the
process, we will learn something about the TM package and about the analysis of
unstructured data before it was Big.
3
3. Text Mining
Install the TM and SNOWBALL packages on RStudio.
Installing Packages
1 install . packages (” slam “)
2 install . packages (” SnowballC “)
3 install . packages (“tm”)
Loading the packages into memory
1 library ( tm )
2 library ( SnowballC )
3 library ( slam )
In text mining and related fields, a corpus is a collection of texts, often with extensive manual annotation. Not surprisingly, the CORPUS class is a fundamental
data structure in TM.
Corpus
1 my . docs <- VectorSource ( c ( doc . list , query ) )
2 my . docs$Names <- c ( names ( doc . list ) , ” query “)
3 my . corpus <- Corpus ( my . docs )
4 my . corpus
Above, we treated the query like any other document. It is, after all, just another
string of text. Queries are not typically known a priori, but in the processing steps
that follow, we will pretend like we knew ours in advance to avoid repeating steps.
Standardizing. One of the nice things about the Corpus class is the TM_MAP
function, which cleans and standardizes documents within a Corpus object. Below
are some of the transformations.
4
1 getTransformations ()
First, let’s remove the punctuation.
Removing Punctuation
1 my . corpus <- tm_map ( my . corpus , removePunctuation )
2 content ( my . corpus [[1]])
Suppose we don’t want to count “cats” and “cat” as two separate words. To implement the famous Porter Stemmer algorithm. To use this particular transformation,
first load the SNOWBALL package.
Stemming
1 library ( SnowballC )
2 my . corpus <- tm_map ( my . corpus , stemDocument )
3 content ( my . corpus [[1]])
Finally, let’s remove numbers and any extra white space/
Remove Numbers and Strip White Space
1 my . corpus <- tm_map ( my . corpus , removeNumbers )
2 my . corpus <- tm_map ( my . corpus ,
content_transformer ( tolower ) )
3 my . corpus <- tm_map ( my . corpus , stripWhitespace )
4 content ( my . corpus [[1]])
5
4. The Vector Space Model
Document Similarity. Here’s a trick that’s been around for a while: represent each
document as a vector in RN with N as the number of words and we use the angle
θ between the vectors as a similarity measure. Rank by the similarity of each
document to the query and we get a search engine.
One of the simplest things we can do is to count words within documents. This
naturally forms a two dimensional structure, the term document matrix, with rows
corresponding to the words and the columns corresponding to the documents. As
with any matrix, we may think of a term document matrix as a collection of column
vectors existing in a space defined by the rows. The query lives in this space as
well, though in practice we wouldn’t know it beforehand.
6
R Commands
1 term . doc . matrix . stm <- TermDocumentMatrix ( my .
corpus ) ; colnames ( term . doc . matrix . stm ) <- c (
names ( doc . list ) , ” query “) ; inspect ( term . doc .
matrix . stm [0:14 , ])
Sparsity and storage of the term document matrix
The matrices in tm are of type Simple Triplet Matrix where only the triples
(i,j,value) are stored for non-zero values. To work directly with these objects,
you may use install the slam package. We bear some extra cost by making the
matrix “dense” (i.e., storing all the zeros) below.
7
R Commands
1 term . doc . matrix <- as. matrix ( term . doc . matrix . stm )
2 cat (” Dense matrix representation costs “, object .
size ( term . doc . matrix ) , ” bytes .$\n$”, ”
Simple triplet matrix representation costs “,
object . size ( term . doc . matrix . stm ) , ” bytes .”
)
Implementation
For both the document and query, we choose tfidf weights of (1 + log2(tf))
*log2(N/ tf). Note that whenever a term does not occur in a specific document, or
when it appears in every document, its weight is zero.
Function computing tfidf weights from term frequency vector
1 get . tf . idf . weights <- function ( tf . vec ) { n . docs
<- length ( tf . vec ) ; doc . frequency <- length ( tf
. vec[tf . vec > 0]) ; weights <- rep (0 , length (
tf . vec ) ) ; weights [tf . vec > 0] <- (1 + log2 ( tf
. vec[tf . vec > 0]) ) * log2 ( n . docs / doc . frequency )
return ( weights ) }
2 # For a word appearing in 4 of 6 documents ,
occurring 1 , 2 , 3 , and 6 times :
3 get . tf . idf . weights ( c (1 , 2 , 3 , 0 , 0 , 6) )
Using apply, we run the tfidf weighting function on every row of the term document
matrix. The document frequency is easily derived from each row by the counting
the non-zero entries (not including the query).
8
Function computing tfidf weights from term frequency vector
1 tfidf . matrix <- t ( apply ( term . doc . matrix , 1 , FUN =
function ( row ) { get . tf . idf . weights ( row )}) )
2 colnames ( tfidf . matrix ) <- colnames ( term . doc .
matrix )
3 tfidf . matrix [0:3 , ]
Dot Product Geometry
A benefit of being in the vector space is the use of its dot product. For vectors a and
b, the geometric definition of the dot product is a cdot = ||a||*||b||*cos theta
where ||cdot|| is the Euclidean norm (the root sum of squares) and theta is the
angle between a and b.
If a and b are orthogonal, then a.b=0
If they are co-directional, then a.b=||a||||b||.
The dot product of a vector by itself is, a.a=||a||2 where ||a||= ˆ
p
a.a which is
the Euclidean Length of the vector.
In fact, we can work directly with the cosine of theta. For theta in the interval [-pi ,pi],the endpoints are orthogonality (totally unrelated documents) and the
center, zero, is complete collinearity (maximally similar documents). We can see
that the cosine decreases from its maximum value of 1 as the angle departs from
zero in either direction.
9
Plot
1 angle <- seq ( – pi , pi , by = pi /16)
2 plot ( cos ( angle ) ~ angle , type = “b”, xlab = ”
angle in radians “, main =” Cosine similarity by
angle “)
may furthermore normalize each column vector in our tfidf matrix so that its norm
is one. Now the dot product is cos theta.
R commands
1 tfidf . matrix <- scale ( tfidf . matrix , center =
FALSE , scale = sqrt ( colSums ( tfidf . matrix ^2) ) )
2 tfidf . matrix [0:3 ,]
10
Dot Product Machine
Keeping the query alongside the other documents let us avoid repeating the
same steps. But now it’s time to pretend it was never there.
R commands
1 query . vector <- tfidf . matrix [, ( N. docs + 1)]
2 tfidf . matrix <- tfidf . matrix [, 1: N . docs ]
With the query vector and the set of document vectors in hand, it is time to go
after the cosine similarities. These are simple dot products as our vectors have
been normalized to unit length.
Recall that matrix multiplication is really just a sequence of vector dot products.
The matrix operation below returns values of cosθ for each document vector and
the query vector.
R commands
1 doc . scores <- t ( query . vector ) %*% tfidf . matrix
With scores in hand, rank the documents by their cosine similarities with the
query vector.
R commands
1 results . df <- data . frame ( doc = names ( doc . list ) ,
score = t( doc . scores ) , text = unlist ( doc . list ) )
2 results . df <- results . df[ order ( results . df$score ,
decreasing = TRUE ) , ]
The results – Evaluating the model
R commands
1 options ( width = 2000)
2 print ( results . df , row . names = FALSE , right =
FALSE , digits = 2)
11
The “best” document, at least in an intuitive sense, comes out ahead with a score
nearly twice as high as its nearest competitor.
Notice however that this next competitor has nothing to do with cats. This is due
to the relative rareness of the word “healthy” in the documents and our choice to
incorporate the inverse document frequency weighting for both documents and
query.
We can calculate the precision, recall and F-Score to find the accuracy.
The entire R Script for the above exercise is uploaded on the D2L shell.
12