You are on page 1of 30

Information Retrieval Models:

Vector Space Models


ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign


Empirical IR vs. Model-based IR
• Empirical IR:
– heuristic approaches
– solely rely on empirical evaluation
– assumptions not always clearly stated
– findings: empirical observations; may or may not generalize well

• Model-based IR:
– theoretical approaches
– rely more on mathematics
– assumptions are explicitly stated
– findings: principles, models that may work well or not work well;
generalize better

• Boundary may not be clear and a combination is


generally necessary
2
History of Research on IR Models
• 1960: First probabilistic model [Maron & Kuhns 60]

• 1970s: Active research on retrieval models


started
– Vector-space model [Salton et al. 75]
– Classic probabilistic model [Robertson & Sparck Jones 76]
– Probability Ranking Principle [Robertson 77]

• 1980s: Further development of different models


– Non-classic logic model [Rijsbergen 86]
– Extended Boolean [Salton et al. 83]
– Early work on learning to rank [Fuhr 89]

3
History of Research on IR Models (cont.)
• 1990s: retrieval model research driven by TREC
– Inference network [Turtle & Croft 91]
– BM25/Okapi [Robertson et al. 94]
– Pivoted length normalization [Singhal et al. 96]
– Language model [Ponte & Croft 98]

• 2000s-present: retrieval model influenced by machine


learning and Web search
– Further development of language models [Zhai & Lafferty 01, Lavrenko &
Croft 01]

– Divergence from randomness [Amati et al. 02]


– Axiomatic model [Fang et al. 04]
– Markov Random Field [Metzler & Croft 05]
– Further development of Learning to rank [Joachimes 02, Burges et al. 05]
4
Modeling Relevance:
Raodmap for Retrieval Models
Relevance constraints
[Fang et al. 04]
Relevance
Div. from Randomness
(Amati & Rijsbergen 02)

(Rep(q), Rep(d)) P(r=1|q,d) r {0,1} P(d q) or P(q d)


Similarity Probability of Relevance Probabilistic inference

Regression
Model (Fuhr 89) Generative Different
Different Model
rep & similarity inference system
Learn. To Rank Doc Query
… (Joachims 02, generation generation
Prob. concept Inference
Berges et al. 05) network
space model
Vector space Prob. distr. Classical LM (Wong & Yao, 95) model
model model prob. Model approach (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
5
1. Vector Space Models
The Basic Question
Given a query, how do we know if
document A is more relevant than B?

One Possible Answer

If document A uses more query words


than document B
(Word usage in document A is more
similar to that in query)
Relevance = Similarity
• Assumptions
– Query and document are represented similarly
– A query can be regarded as a “document”
– Relevance(d,q)  similarity(d,q)
• R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
• Key issues
– How to represent query/document?
– How to define the similarity measure ?

8
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the
query vector and document vector in the vector
space

9
VS Model: illustration

Starbucks
D9
D2 ??
?? D11

D3 D5
D10

D4 D6
Java
Query
D7
D8 D1
Microsoft
?? 10
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
• How to assign weights
– Weight in query indicates importance of term
– Weight in doc indicates how well the term
characterizes the doc
• How to define the similarity/distance measure

11
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and
hopefully accurately
• Many possibilities: Words, stemmed words,
phrases, “latent concept”, …
• “Bag of words” representation works
“surprisingly” well!
12
How to Assign Weights?
• Very very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents

• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)

– Document length normalization

13
TF Weighting

• Idea: A term is more important if it occurs more


frequently in a document
• Formulas: Let f(t,d) be the frequency count of term
t in doc d
– Raw TF: TF(t,d) = f(t,d)
– Log TF: TF(t,d)=log ( f(t,d) +1)
– Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
– “Okapi/BM25 TF”:
TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen))
• Normalization of TF is very important!
14
TF Normalization
• Why?
– Document length variation
– “Repeated occurrences” are less informative than
the “first occurrence”
• Two views of document length
– A doc is long because it uses more words
– A doc is long because it has more contents
• Generally penalize long doc, but avoid over-
penalizing (e.g., pivoted normalization)

15
TF Normalization (cont.)
Norm. TF

Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization

1-b+b*doclen/avgdoclen

b varies from 0 to 1

Normalization interacts with the similarity measure


16
IDF Weighting
• Idea: A term is more discriminative/important if it
occurs only in fewer documents
• Formula:
IDF(t) = 1+ log(n/k)
n – total number of docs
k -- # docs with term t (doc freq)
• Other variants:
– IDF(t) = log((n+1)/k)
– IDF(t)=log ((n+1)/(k+0.5))
• What are the maximum and minimum values of
IDF?
17
Non-Linear Transformation in IDF
IDF(t)

IDF(t) = 1+ log(n/k)
1+log(n)

Linear penalization

k (doc freq)
1 N
=totoal number of docs in collection

Is this transformation optimal?


18
TF-IDF Weighting
• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
– Common in doc  high tf  high weight
– Rare in collection high idf high weight
• Imagine a word count profile, what kind of terms
would have high weights?

19
Empirical distribution of words

• There are stable language-independent patterns


in how people use natural languages
• A few words occur very frequently; most occur
rarely. E.g., in news articles,
– Top 4 words: 10~15% word occurrences
– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be
rare in another

20
Zipf’s Law

• rank * frequency  constant F ( w) 


C
r ( w)
  1, C  0.1

Word High entropy words


Freq.

Word Rank (by Freq)


C
Generalized Zipf’s law: F ( w)  Applicable in many domains
[r ( w)  B]

21
How to Measure Similarity?

Di  ( w i 1 ,..., w iN )

Q  ( wq1 ,..., wqN ) w  0 if a term is absent
  N
Dot product similarity : sim(Q , Di )   wqj  w ij
j 1
N

   wqj  wij
j 1
Cosine : sim(Q , Di ) 
N N
 ( wqj )  2
 ij
( w ) 2

j 1 j 1

(  normalized dot product)


  N

How about Euclidean? sim (Q, Di )   qj ij


( w
j 1
 w ) 2

22
What Works the Best?

•Use single words


Error
•Use stat. phrases

[ ] •Remove stop words


•Stemming
•Others(?)

(Singhal 2001; Singhal et al. 1996)

23
Relevance Feedback in VS
• Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
– How do you learn from this to improve performance?
• General method: Query modification
– Adding new (weighted) terms
– Adjusting weights of old terms
– Doing both
• The most well-known and effective approach is Rocchio
[Rocchio 1971]

24
Rocchio Feedback: Illustration
Centroid of relevant documents Centroid of
non-relevant documents

-- ---
+ -
+
++ + --
- - -
-- + q q
- m +
+ +
- - - ++ + +++ -
--
+ + + -
-
- -- --

25
Rocchio Feedback: Formula

Parameters
New query

Origial query Rel docs Non-rel docs

26
Rocchio in Practice
• How can we optimize the parameters?
• Can it be used for both relevance feedback and pseudo
feedback?
• How does Rocchio feedback affect the efficiency of
scoring documents? How can we improve the
efficiency?

27
Advantages of VS Model

• Empirically effective! (Top TREC performance)


• Intuitive
• Easy to implement
• Well-studied/Most evaluated
• The Smart system
– Developed at Cornell: 1960-1999
– Still widely used
• Warning: Many variants of TF-IDF!
28
Disadvantages of VS Model
• Assume term independence
• Assume query and document to be the same
• Lack of “predictive adequacy”
– Arbitrary term weighting
– Arbitrary similarity measure
• Lots of parameter tuning!

29
What You Should Know
• Basic idea of the vector space model
• TF-IDF weighting
• Pivoted length normalization (read [Singhal et
al. 1996] to know more)
• BM25/Okapi retrieval function (particularly TF
weighting)
• How Rocchio feedback works

30

You might also like