Professional Documents
Culture Documents
Multimedia Retrieval
Table of Contents
Evaluation ................................................................................................................................................ 3
Boolean Retrieval ................................................................................................................................ 3
Retrieval with Ordering of Documents............................................................................................ 4
Utility Measure .................................................................................................................................... 4
Text Retrieval........................................................................................................................................... 5
Feature Extraction ............................................................................................................................... 5
Elimination of structure................................................................................................................... 5
Elimination of frequent/infrequent terms ...................................................................................... 5
Mapping text to terms..................................................................................................................... 6
Reduction of terms to their stems .................................................................................................. 6
Mapping to index terms .................................................................................................................. 6
Models for Text retrieval ..................................................................................................................... 6
Boolean Retrieval ............................................................................................................................ 6
Fuzzy Retrieval ................................................................................................................................. 6
Vector Space Retrieval .................................................................................................................... 7
Probabilistic Retrieval ...................................................................................................................... 7
Latent Semantic Indexing ................................................................................................................ 8
Summary.......................................................................................................................................... 9
Indexing Structures ............................................................................................................................. 9
Inverted Lists ................................................................................................................................. 10
Signatures ...................................................................................................................................... 10
Web Retrieval ........................................................................................................................................ 12
Ordering of Documents ..................................................................................................................... 12
Content based Retrieval .................................................................................................................... 13
Architecture of a Search Engine .................................................................................................... 14
Image, Audio and Video Retrieval ......................................................................................................... 16
How similarity search works.............................................................................................................. 16
Primary information for multimedia documents .............................................................................. 17
Raw Data ....................................................................................................................................... 17
Meta Data (MPEG-7) ..................................................................................................................... 18
Simple Features for Images ............................................................................................................... 18
1
Evaluation
Since we cannot generalize that one retrieval method is always better than another one, we need to
run benchmarks in order to find out which method is the best for a specific task. To be able to
evaluate the benchmark, we first need to do the following steps:
Selection of a collection and definition of queries
Each competitor evaluates the queries against the collection and returns a list of answers to
the coordinator
For each query, the answers of all competitors define the collection to be assessed for this
query
Each competitor had to assess the relevance for some of the queries and their associated
collection
Given all relevance assessments, the quality performance of the search engines was
determined
Boolean Retrieval
In Boolean Retrieval the retrieved documents are not ordered; we only have the distinction between
relevant (the ones that are retrieved) and non-relevant. Our most important measures for
evaluating are:
Precision
Recall
Fallout
Total Recall
F-Measure
Combines Precision and Recall (=0 only Precision, =inf, only Recall). A typical value is =1
Another measure is the R-Precision: the precision after having retrieved |RQ| documents
Utility Measure
This method is an alternative to precision/recall that can be used if we have information about userpreferences on pairs of documents. We define user preferences as a set:
Similar we can now define the ordering of documents by retrieval system as sets:
With these values we can now compute the utility measure uA,B:
1. Compute differences yi - xi and remove values of 0 (k0 values remain)
2. Increasingly order values by their absolute value. If several values have the same absolute
value, assign them an average rank (e.g. 2nd and 3rd 2.5)
3. Compute w+ as the sum of ranks of positive values
4. Finally, determine
with
We can also compute how probable it is that uA,B is the result of a random experiment. We assume
UA,B is normally distributed, thus we can say:
with
4
Text Retrieval
Feature Extraction
Feature extraction transforms a document into a set of terms (e.g. words). We distinguish 5 steps:
1.
2.
3.
4.
5.
Elimination of structure
Elimination of frequent/infrequent terms (stop words)
Mapping text to terms (without punctuation)
Reduction of terms to their stems (stemming, syllable division)
Mapping to index terms
Elimination of structure
For example html provides a lot of meta information. Most search engines today therefore
distinguish different areas in the document:
Title
Remaining header (Meta-Keywords)
Main body
o Text between markups
o Special attributes of selected tags (like img alt
URL/Links
(Comments)
For XML we could consider the XML structure in our query, e.g. if we search for names only look at
the attributes that have to do with name. But we need to know the data structure of the XML.
Elimination of frequent/infrequent terms
We dont want to index terms that are very often (a, the) or very infrequent. The theoretical solution
would be to only index terms that have proven to be useful from past experiences. But we would
need a feedback mechanism. Thus we have a more pragmatic solution: term frequencies follow the
Zipfian distribution
N
Number of term occurrences in collection
M
Number of distinct terms in collection
nt
Number of occurrences of term t
rt
Rank of term t ordered by the number of occurrences
pr
Probability that a random term in a document has rank r (= nt/N)
The central theorem is now
whereas
(const)
c now depends on the number of distinct terms:
( )
( ( )
Discrimination Method
We introduce a similarity measure sim(di,dj) with values between 0 (=dissimilar) and 1 (=identical).
Documents are described as vector where component k denotes the occurrences of the k-th term.
The density of a collection is now defined as the sum of similarity between all documents and a
Centroid C (which contains all M terms with average frequency (Ci = nti/N). We define the
discrimination value DWt of a term now as the Difference between the density of the collection
with the term and the density without the term.
DWt > 0 t is a good descriptor
DWt 0 t contains little semantics to distinguish documents
DWt < 0 t is a bad descriptor
5
( )
( )
We assume that terms occur independent from each other in documents. However, this is not true
in general
Probabilistic Retrieval
We basically estimate the probability that given a query Q the document is considered relevant. We
assume that this probability depends only on the query and the document collection.
(
)
(
)
(
)
These probabilities can be rewritten through the Bayes theorem:
| ) ( )
(
(
) (
The Binary Independence Retrieval (BIR) simplifies the computation by considering only binary
representations of the document (and query). More precisely: document D is characterized with a
binary vector x with xi=1 if term Ti occurs in D. P(Dj|R) = P(x|R)
)
(
)
(
)
The probability P(x|R) is given by (
(
)
with
being the probability that term Ti occurs in a randomly chosen document from R.
) to and (
) to .
P(x|NR) can be described similarly. We can shorten (
If we now assume that
if Ti does not occur we get the following
( ) (
)
( )
(
)
( ) (
)
( )
( )
(
)
( )
(
)
Since we are not interested in absolute values, we can now eliminate all constant factors that dont
depent on the current document:
(
)
(
)
(
)
(
)
(
)
(
)
If the c-values are given, search engines can efficiently compute the rsv with inverted files.
columns:
reduced representations of the documents,
rows;
eigenvectors of the similarity matrix between documents ATA
Uk
rows:
reduced representation of terms
columns:
eigenvectors of the similarity matrix between terms
Advantages
Disadvantages
Efficient implementation
feasible with inverted files
Fuzzy Retrieval
Ranking of retrieved
documents
Indexing Structures
We will look at two main aspects:
1. Inverted lists for efficient identification of relevant documents in Boolean and Vector Space
Retrieval
2. Support fuzzy search with the help of n-grams and signatures
Disadvantage
Files
Simple implementation
No transactional guarantees
Good performance
Open source implementations
available
Relational Database
3 tables:
Good performance
documents
(meta data)
terms
(idf)
inverted lists
(d-t pairs with freq)
Signatures
The idea is that if the encoded bit strings of two documents are similar it is likely that they stem from
similar feature sets, which means the two documents are similar. But since we compress features, it
might be that two totally different text documents lead to the same signature False Hit
Thus, signatures often sever as a pre-step to filter out candidates. Those candidates we then
must compare to the query with an rsv-approach to filter out false hits. We can also just order the
candidates by rsv-value without eliminate false hits, or omit the rsv-step totally (requires small
number of false hits)
Computation of Signatures
l
Length of signature in bits
N
number of documents
SQ
Signature of query
F
Error rate
g
Weight of signature for a term (how many bits are set)
The general idea is that we map each feature to a signature (if possible all should have the same g).
Then we combine those signatures. The query is then also mapped to a signature and all documents
that have similar signature as the query are returned
Mapping features to signatures
We can have different kinds of features:
words/phrases
o Approach 1: associate each word/phrase with sig of weight g (in even distribution)
o Approach 2: extract n-grams and superimpose their signatures (leads to different g).
If we want same g, only use the g most frequent n-grams
10
n-grams
o define mapping of text to n-grams
o define hash function to obtain a bit-position for each n-gram, set that bit
o signature of the text is superimposed coding of all signatures of its n-grams
In general you get the best Signature Potential if g = l/2. If we have more terms (M>SP), we will have
collisions (M/SP terms share same signature). To limit this error rate one can determine SP
beforehand and thus the length of signatures:
We have different methods of combining the signatures of text parts:
Similarity Search
1. Hamming Distance
Indexing Structures
Sequential Organization
Bit-sliced Organization Vertical Partitioning
For each of the l bits an individual file stores the contents of the collection
for cover-dist we only need to read the bit files for which the query bit is set
Horizontal Partitioning
similar signatures are grouped in separate files (buckets)
1. Identify groups which may contain candidates
2. Read buckets and identify candidates
3. Read-in the candidates and distinguish between hits and false hits
For this we need a new hash function (simple example: first k bits of signature)
Signature Tree (S-Tree)
Groups are represented by a superimposed block signature for all its members, which are
again treated as signatures (recursion) Signature tree
11
Web Retrieval
The main problem with the web is the sheer size of it and the quality of the data. A typical small
query returns thousands to millions of websites which contain those keywords. However, not all of
them are relevant. Some approaches might be to retrieve only pages which:
contain terms with the same frequency as in the query
contain the query terms most often
contain all query terms
But these methods lack a mechanism against spamming!
Ordering of Documents
Ranking already starts with etracting the right information from documents: (e.g. Google)
Positions of terms in a document
the relative font size
the visual attributes (bold, italic)
the context of the page (term in URL, title, meta-tags, text of references)
The ranking then consists of the following factors:
Proximity of terms
for each position pair, a proximitiy values is
assigned, the frequency of these proximity
values result in the Proximity-vector
multiplying this vector with a weigthning
vector w leads to overall proximity value
( )
( )
( )
( )
L(A)
N(A)
12
The page contains many links to pages that are relevant to the query
The page is relevant for Q (mostly this also means its linked by many hubs)
Whats Related
(Google: similar pages)
Alexa used crawlers and data mining tools. Additionally it pied on the surf patters of the user.
Google relies on the analysis of the link structure of web pages. There are two approaches published:
1. Companion Algorithm
based on the extended HITS algorithm
1. Build a directed graph in the neighborhood of u (given URL). The parent (b), child
(f) and sibling (bf,fb) pages are limited
2. Converge duplicates or near-duplicates) if
o they contain more than 10 links
o 95% of contained links appear in both documents
3. If k edges of document in same domain point to same external page, weight by 1/k.
If a document contains l edges to pages within same domain, weight edges by 1/l
4. Determine hubs and authorities according to ext. HITS (without similarity weighting)
5. Determine the result: pages with highest authority weights are the related pages
2. Co-citation Algorithm
counts how often a page u is referenced together with page q
1. Determine at most b parent pages of u
2. Determine for each parent page at most bf child pages with the links of these child
pages being close to the link to u Siblings of u
3. Determine pages qi that are referenced most frequently together with u
4. If 1-3 result in less than 15 co-citations with freq 2 or higher, repeat the search with
prefixes of the URL of u
Architecture of a Search Engine
A search engine consist of the following
main components
Crawler/Repository
Feature Extractor
Indexer
Sorter
Feedback Component
User Interface
15
16
Lp-Norm
(
17
Text features
recall Chapter 2 (term vectors, LSI vectors )
Linking features
recall Chapter 3 (authority/hubs, PageRank)
Image descriptors
based on signal information
o local features (region dynamic or static) vs. global feature
o feature can be constructed invariant to rotations and translations, but this would not
allow us to distinguish between top/bottom
o feature selection determines precision of retrieval (more features better results)
but also performance (more features worse performance)
Color Histograms
A color histogram is defined by
selection of a color space (RGB, L*a*b*, LCH)
selection of reference colors
o as many colors as possible to obtain good descriptors
o as few colors as possible to benefit from fast retrieval
definition of a distance measure, i.e. a similarity measure
o preferably one that takes visual similarity into account
A histogram should in the end be normalized divide by #pixel
A similar histogram doesnt imply similar images per se!
(Plus: A dissimilar histogram doesnt imply dissimilar images, see box)
Color Moments
Color movements construct an uncorrelated vector. Statistical moments along different color
dimensions are computed. Examples for color space L*a*b*:
18
20
21
(
)
(
)
Histogram Comparison:
( )
(
)
We need to choose the right threshold to
distinguish between cut and no cut. The
optimal threshold minimizes false hits and false drops
Problems with this approach are soft cuts, quick moving objects and fast camera movements
Soft cuts (fade-out fade-in, dissolve,)
The basic idea here is a twin threshold:
first threshold (tc) detects hard cuts
second threshold (ts) detects the beginning of a potential soft cut
The frame where ts is exceeded serves as reference frame for the following frames. If then tc
is exceeded it is a soft cut. If it remains long enough under tc we assume no cut
Another approach is model soft cuts. E.g. fade-out-fade-in have a very characteristic change of
profile on intensity histogram
Key Frame Extraction
We could just extract an arbitrary image
+ very simple
+ often describes the shot very well (especially if shot contains no movements and camera is steady)
- if camera moves, only a certain perspective is depicted
- movements of objects lead to arbitrary representations
A better approach would be to stich images together to a panoramic image. The main issue is that
the camera plane changes as well non-linear transformations
22
Speaker Recognition
allows to query for something somebody has said (e.g. Gorge W. Bush on global warming)
Analysis of Video Stream
E.g. noise level in sport games when something important happens
Movements of objects
feasible by comparing shapes of objects between subsequent frames
already detected in MPEG4 (distinguish between (steady) bg and (moving) fg objects)
o Camera movements
o Time-space Relationships
can answer queries like has person A met with person B
o Speaker Recognition
like in audio retrieval, but can be enhanced with face recognition methods and
subtitle analysis
Abstract features
This area is not very well understood. The main problem is how to model an event, or emotions.
With emotions, its even harder since its very subjective (although some well-known stereotypes
exists)
E.g. with movie soundtrack you could recognize emotions or also locations (certain movies feature
themes which are played at the same locations)
23
Similarity Search
How to evaluate simple queries we already learned in the previous chapter (remember distance
functions and how to map them to similarity score). Now we will take a deeper look into more
complex queries
Complex Queries
1. several reference objects
determine k distances to all k reference objects, combine them with
(
)
2. concerning features
a. different features (including meta data)
determine the J distances of an object for the J features, combine them with
Important: dist need to be normalized in order to use (Gaussian normalization)
(If query consists of several reference objects, theres no need same feature space)
b. key words
key word query returns distances
same as different features
key word query returns RSV
dist= -rsv + const (such that all >0)
key word query filters out objects
same as predicates
3. predicates (on meta data, or on classifiers)
If the predicate is fulfilled, pass similarity score as it is, else set it on 0
Examples for distance combining functions:
maximum (fuzzy logic and)
(
)
minimum (fuzzy logic or)
(
)
average
weighted average
(: weighting vector)
2a) different features
Weighting
Weights on distance and RSV functions
Weights are directly applied on the components of the vector: (
24
25
Pyramid Tree
Linf-norm optimized index
Data space is divided by pyramids (ground area equal to a facet of the data space, top center
of data space). Pyramid further divided along falling line from top to ground area. The so
obtained shells are enumerated data points can be mapped in a one-dimensional space
P-Sphere Tree
good alternative for vector spaces with 10 to 100 dim
Two-level structure: Root contains a set of spherical MBRs; each MBR subsumes a set of
points which are stored in the leaf nodes
A point can lie in several leaf nodes (redundancy) NN can be answered with 1 lookup
26
Quadtree
Advantages
Disadvantages
simple implementation
k-d-tree
simple implementation
good performance in 2d (if tree is
balanced)
Gridfile
Space-Filling
Curves
Simple implementation
Voronoi
Diagrams
R-Tree
27
closed data space of the shape of a hyper cube
data points uniformly distributed in this data space
probability that a data point lies inside a subspace: volume of that subspace
Peculiarities:
1. Bad intuition for high-dimensional spaces
A circle that fills a 2-dim space quite well, does not necessarily do that in higher dimensions
(given same center and radius)
2. Partitioning becomes meaningless in higher dimensions
If we partition each axis in a high-dimensional space in two parts we have 2d cells and on
average N/2d points in each cell most cells will be empty
3. Where are all the data points?
if we take a hyper cube with length 0.95, for the 2-dim space it fills a big percentage of the
space, however in high-dimensional space the quadratic volume is almost empty (0.0059 for
d=100). We may also state that all points are close to the boundary of the data space, not in
the center (probability that a point lies close to boundary (within dist ):
(
)
With
and d=100, the probability is (almost) 1!
4. The nearest neighbor is always far away
If we put a circle with r=0.5 in the center of a 2-dim space we can safely assume that the
nearest neighbor is within the circle. This does not hold true for higher dimensions. For d=10
this hypersphere would only have a volume of 0.002, with d=100 the volume is only 1.9*10-70
Cost Model for NN-Search (using HS-algorithm)
Expected NN-distance
For a given query point and radius r determine the probability that the NN lies within the sphere
around the point with radius r. Define a distribution function over the random variable r
expected value for r corresponds to the expected NN-distance for the query point
expected NN for an arbitrary point in the data space distance is then the mean expected NNdistance of a point over all points in the data space
Cost model
(
)
7ms
average random disc access time
(
) (
)
The tree search is faster if
(
) (
(
)). If we assume realistic leaf
node sizes and fill levels, v should be considerably smaller than 20% (e.g. v < 5%)
Assume, during splits d of d axes have been split in the middle. If we define lmax= sqrt(d2) as
maximum distance between a point in the space and the leaf node, we see that e.g. for d=100, d=15
and N=106 lmax is smaller than the NN-dist each query point is closer to the leaf node than to its NN
for each query, all leafs are visited!
If we consider arbitrary shaped MBRs: estimate the NN-distance and extend the MBRs by it
probability that leaf node is visited: percentage of points that lie inside the extended MBR
28
Computation of approximations
A (not necessarily regular) grid (defined by marks positions) divides data space into rectangular cells.
Along each axis, 2bi slices exists (bi = #bits assigned to that axis). The individual slices are represented
with bi bits and are increasingly enumerated for each cell we get a unique bit-string
For a query the approximation-file is sequentially read and for all bit-strings that have to be
considered more exactly the whole vector then needs to be read (random access in vector file)
Determining the marks positions:
They should be chosen such that each slice contains about the same
number of points. We compute the distribution function along each axis
and determine marks such that the integral over all slices leads to about the
same number
Determining the number of bits:
usually 4 or 8 bits per dimension to obtain a good compromise between space and efficiency
We can also compute lower and upper bound of a point in the VA to a query:
With this information we can filter out vectors, e.g for NN-search: uBnd(p,q)<lBnd(p,q) p not NN
29
30
GeVAS treats predicates like features as an N-dimensional array (given N objects), where arri = I
means the i-th object fulfills the predicate. First the predicate is checked. If it is true, bounds are
computed, else the object is discarded. To further reduce search costs, predicate evaluation and VAfile similarity can be parallelized (requires synchronization)
Textual sub-queries: Boolean queries are predicates, signatures approximations and VSR/PR
require an array filled with RSV-Values (much like array for predicate valueS)
Several reference objects: evaluated as proposed by Ciaccia
Retrieval over several features: as described above
31
Disadvantages
search no longer as selective as with
just one reference objects
Complex queries
with the VA-File
simple implementation
32