Information Retrieval Models: Vector Space Models: Chengxiang Zhai

Information Retrieval Models:
Vector Space Models

ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign

Empirical IR vs. Model-based IR
• Empirical IR:
– heuristic approaches
– solely rely on empirical evaluation
– assumptions not always clearly stated
– findings: empirical observations; may or may not generalize well
• Model-based IR:
– theoretical approaches
– rely more on mathematics
– assumptions are explicitly stated
– findings: principles, models that may work well or not work well;
generalize better
• Boundary may not be clear and a combination is

generally necessary
2
History of Research on IR Models
• 1960: First probabilistic model [Maron & Kuhns 60]
• 1970s: Active research on retrieval models

started
– Vector-space model [Salton et al. 75]
– Classic probabilistic model [Robertson & Sparck Jones 76]
– Probability Ranking Principle [Robertson 77]
• 1980s: Further development of different models

– Non-classic logic model [Rijsbergen 86]
– Extended Boolean [Salton et al. 83]
– Early work on learning to rank [Fuhr 89]
3
History of Research on IR Models (cont.)
• 1990s: retrieval model research driven by TREC
– Inference network [Turtle & Croft 91]
– BM25/Okapi [Robertson et al. 94]
– Pivoted length normalization [Singhal et al. 96]
– Language model [Ponte & Croft 98]
• 2000s-present: retrieval model influenced by machine

learning and Web search
– Further development of language models [Zhai & Lafferty 01, Lavrenko &
Croft 01]
– Divergence from randomness [Amati et al. 02]

– Axiomatic model [Fang et al. 04]
– Markov Random Field [Metzler & Croft 05]
– Further development of Learning to rank [Joachimes 02, Burges et al. 05]
4
Modeling Relevance:
Raodmap for Retrieval Models
Relevance constraints
[Fang et al. 04]
Relevance
Div. from Randomness
(Amati & Rijsbergen 02)
(Rep(q), Rep(d)) P(r=1|q,d) r {0,1} P(d q) or P(q d)

Similarity Probability of Relevance Probabilistic inference
Regression
Model (Fuhr 89) Generative Different
Different Model
rep & similarity inference system
Learn. To Rank Doc Query
… (Joachims 02, generation generation
Prob. concept Inference
Berges et al. 05) network
space model
Vector space Prob. distr. Classical LM (Wong & Yao, 95) model
model model prob. Model approach (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
5
1. Vector Space Models
The Basic Question
Given a query, how do we know if
document A is more relevant than B?
One Possible Answer
If document A uses more query words

than document B
(Word usage in document A is more
similar to that in query)
Relevance = Similarity
• Assumptions
– Query and document are represented similarly
– A query can be regarded as a “document”
– Relevance(d,q)  similarity(d,q)
• R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
• Key issues
– How to represent query/document?
– How to define the similarity measure ?
8
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the
query vector and document vector in the vector
space
9
VS Model: illustration
Starbucks
D9
D2 ??
?? D11
D3 D5
D10
D4 D6
Java
Query
D7
D8 D1
Microsoft
?? 10
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
• How to assign weights
– Weight in query indicates importance of term
– Weight in doc indicates how well the term
characterizes the doc
• How to define the similarity/distance measure
11
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and
hopefully accurately
• Many possibilities: Words, stemmed words,
phrases, “latent concept”, …
• “Bag of words” representation works
“surprisingly” well!
12
How to Assign Weights?
• Very very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
– Document length normalization
13
TF Weighting
• Idea: A term is more important if it occurs more

frequently in a document
• Formulas: Let f(t,d) be the frequency count of term
t in doc d
– Raw TF: TF(t,d) = f(t,d)
– Log TF: TF(t,d)=log ( f(t,d) +1)
– Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
– “Okapi/BM25 TF”:
TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen))
• Normalization of TF is very important!
14
TF Normalization
• Why?
– Document length variation
– “Repeated occurrences” are less informative than
the “first occurrence”
• Two views of document length
– A doc is long because it uses more words
– A doc is long because it has more contents
• Generally penalize long doc, but avoid over-
penalizing (e.g., pivoted normalization)
15
TF Normalization (cont.)
Norm. TF
Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization
1-b+b*doclen/avgdoclen
b varies from 0 to 1
Normalization interacts with the similarity measure

16
IDF Weighting
• Idea: A term is more discriminative/important if it
occurs only in fewer documents
• Formula:
IDF(t) = 1+ log(n/k)
n – total number of docs
k -- # docs with term t (doc freq)
• Other variants:
– IDF(t) = log((n+1)/k)
– IDF(t)=log ((n+1)/(k+0.5))
• What are the maximum and minimum values of
IDF?
17
Non-Linear Transformation in IDF
IDF(t)
IDF(t) = 1+ log(n/k)
1+log(n)
Linear penalization
k (doc freq)
1 N
=totoal number of docs in collection
Is this transformation optimal?

18
TF-IDF Weighting
• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
– Common in doc  high tf  high weight
– Rare in collection high idf high weight
• Imagine a word count profile, what kind of terms
would have high weights?
19
Empirical distribution of words
• There are stable language-independent patterns

in how people use natural languages
• A few words occur very frequently; most occur
rarely. E.g., in news articles,
– Top 4 words: 10~15% word occurrences
– Top 50 words: 35~40% word occurrences
• The most frequent word in one corpus may be
rare in another
20
Zipf’s Law
• rank * frequency  constant F ( w) 

C
r ( w)
  1, C  0.1
Word High entropy words

Freq.
Word Rank (by Freq)

C
Generalized Zipf’s law: F ( w)  Applicable in many domains
[r ( w)  B]
21
How to Measure Similarity?

Di  ( w i 1 ,..., w iN )

Q  ( wq1 ,..., wqN ) w  0 if a term is absent
  N
Dot product similarity : sim(Q , Di )   wqj  w ij
j 1
N
   wqj  wij
j 1
Cosine : sim(Q , Di ) 
N N
 ( wqj )  2
 ij
( w ) 2
j 1 j 1
(  normalized dot product)

  N
How about Euclidean? sim (Q, Di )   qj ij

( w
j 1
 w ) 2
22
What Works the Best?
•Use single words

Error
•Use stat. phrases
[ ] •Remove stop words

•Stemming
•Others(?)
(Singhal 2001; Singhal et al. 1996)
23
Relevance Feedback in VS
• Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
– How do you learn from this to improve performance?
• General method: Query modification
– Adding new (weighted) terms
– Adjusting weights of old terms
– Doing both
• The most well-known and effective approach is Rocchio
[Rocchio 1971]
24
Rocchio Feedback: Illustration
Centroid of relevant documents Centroid of
non-relevant documents
-- ---
+ -
+
++ + --
- - -
-- + q q
- m +
+ +
- - - ++ + +++ -
--
+ + + -
-
- -- --
25
Rocchio Feedback: Formula
Parameters
New query
Origial query Rel docs Non-rel docs
26
Rocchio in Practice
• How can we optimize the parameters?
• Can it be used for both relevance feedback and pseudo
feedback?
• How does Rocchio feedback affect the efficiency of
scoring documents? How can we improve the
efficiency?
27
Advantages of VS Model
• Empirically effective! (Top TREC performance)

• Intuitive
• Easy to implement
• Well-studied/Most evaluated
• The Smart system
– Developed at Cornell: 1960-1999
– Still widely used
• Warning: Many variants of TF-IDF!
28
Disadvantages of VS Model
• Assume term independence
• Assume query and document to be the same
• Lack of “predictive adequacy”
– Arbitrary term weighting
– Arbitrary similarity measure
• Lots of parameter tuning!
29
What You Should Know
• Basic idea of the vector space model
• TF-IDF weighting
• Pivoted length normalization (read [Singhal et
al. 1996] to know more)
• BM25/Okapi retrieval function (particularly TF
weighting)
• How Rocchio feedback works
30

Information Retrieval Models: Vector Space Models: Chengxiang Zhai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval Models: Vector Space Models: Chengxiang Zhai

Uploaded by

Copyright:

Available Formats

Information Retrieval Models:

Vector Space Models

Department of Computer Science

University of Illinois, Urbana-Champaign

• Boundary may not be clear and a combination is

• 1970s: Active research on retrieval models

• 1980s: Further development of different models

• 2000s-present: retrieval model influenced by machine

– Divergence from randomness [Amati et al. 02]

(Rep(q), Rep(d)) P(r=1|q,d) r {0,1} P(d q) or P(q d)

One Possible Answer

If document A uses more query words

– Document length normalization

• Idea: A term is more important if it occurs more

Normalization interacts with the similarity measure

Is this transformation optimal?

• There are stable language-independent patterns

• rank * frequency  constant F ( w) 

Word High entropy words

Word Rank (by Freq)

(  normalized dot product)

How about Euclidean? sim (Q, Di )   qj ij

•Use single words

[ ] •Remove stop words

(Singhal 2001; Singhal et al. 1996)

Origial query Rel docs Non-rel docs

• Empirically effective! (Top TREC performance)

You might also like