Professional Documents
Culture Documents
• Model-based IR:
– theoretical approaches
– rely more on mathematics
– assumptions are explicitly stated
– findings: principles, models that may work well or not work well;
generalize better
3
History of Research on IR Models (cont.)
• 1990s: retrieval model research driven by TREC
– Inference network [Turtle & Croft 91]
– BM25/Okapi [Robertson et al. 94]
– Pivoted length normalization [Singhal et al. 96]
– Language model [Ponte & Croft 98]
Regression
Model (Fuhr 89) Generative Different
Different Model
rep & similarity inference system
Learn. To Rank Doc Query
… (Joachims 02, generation generation
Prob. concept Inference
Berges et al. 05) network
space model
Vector space Prob. distr. Classical LM (Wong & Yao, 95) model
model model prob. Model approach (Turtle & Croft, 91)
(Salton et al., 75) (Wong & Yao, 89) (Robertson & (Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
5
1. Vector Space Models
The Basic Question
Given a query, how do we know if
document A is more relevant than B?
8
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the
query vector and document vector in the vector
space
9
VS Model: illustration
Starbucks
D9
D2 ??
?? D11
D3 D5
D10
D4 D6
Java
Query
D7
D8 D1
Microsoft
?? 10
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
• How to assign weights
– Weight in query indicates importance of term
– Weight in doc indicates how well the term
characterizes the doc
• How to define the similarity/distance measure
11
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and
hopefully accurately
• Many possibilities: Words, stemmed words,
phrases, “latent concept”, …
• “Bag of words” representation works
“surprisingly” well!
12
How to Assign Weights?
• Very very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
13
TF Weighting
15
TF Normalization (cont.)
Norm. TF
Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization
1-b+b*doclen/avgdoclen
b varies from 0 to 1
IDF(t) = 1+ log(n/k)
1+log(n)
Linear penalization
k (doc freq)
1 N
=totoal number of docs in collection
19
Empirical distribution of words
20
Zipf’s Law
21
How to Measure Similarity?
Di ( w i 1 ,..., w iN )
Q ( wq1 ,..., wqN ) w 0 if a term is absent
N
Dot product similarity : sim(Q , Di ) wqj w ij
j 1
N
wqj wij
j 1
Cosine : sim(Q , Di )
N N
( wqj ) 2
ij
( w ) 2
j 1 j 1
22
What Works the Best?
23
Relevance Feedback in VS
• Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
– How do you learn from this to improve performance?
• General method: Query modification
– Adding new (weighted) terms
– Adjusting weights of old terms
– Doing both
• The most well-known and effective approach is Rocchio
[Rocchio 1971]
24
Rocchio Feedback: Illustration
Centroid of relevant documents Centroid of
non-relevant documents
-- ---
+ -
+
++ + --
- - -
-- + q q
- m +
+ +
- - - ++ + +++ -
--
+ + + -
-
- -- --
25
Rocchio Feedback: Formula
Parameters
New query
26
Rocchio in Practice
• How can we optimize the parameters?
• Can it be used for both relevance feedback and pseudo
feedback?
• How does Rocchio feedback affect the efficiency of
scoring documents? How can we improve the
efficiency?
27
Advantages of VS Model
29
What You Should Know
• Basic idea of the vector space model
• TF-IDF weighting
• Pivoted length normalization (read [Singhal et
al. 1996] to know more)
• BM25/Okapi retrieval function (particularly TF
weighting)
• How Rocchio feedback works
30