You are on page 1of 32

CS342 Multimedia Retrieval

HS2012 University of Basel

Multimedia Retrieval
Table of Contents
Evaluation ................................................................................................................................................ 3
Boolean Retrieval ................................................................................................................................ 3
Retrieval with Ordering of Documents............................................................................................ 4
Utility Measure .................................................................................................................................... 4
Text Retrieval........................................................................................................................................... 5
Feature Extraction ............................................................................................................................... 5
Elimination of structure................................................................................................................... 5
Elimination of frequent/infrequent terms ...................................................................................... 5
Mapping text to terms..................................................................................................................... 6
Reduction of terms to their stems .................................................................................................. 6
Mapping to index terms .................................................................................................................. 6
Models for Text retrieval ..................................................................................................................... 6
Boolean Retrieval ............................................................................................................................ 6
Fuzzy Retrieval ................................................................................................................................. 6
Vector Space Retrieval .................................................................................................................... 7
Probabilistic Retrieval ...................................................................................................................... 7
Latent Semantic Indexing ................................................................................................................ 8
Summary.......................................................................................................................................... 9
Indexing Structures ............................................................................................................................. 9
Inverted Lists ................................................................................................................................. 10
Signatures ...................................................................................................................................... 10
Web Retrieval ........................................................................................................................................ 12
Ordering of Documents ..................................................................................................................... 12
Content based Retrieval .................................................................................................................... 13
Architecture of a Search Engine .................................................................................................... 14
Image, Audio and Video Retrieval ......................................................................................................... 16
How similarity search works.............................................................................................................. 16
Primary information for multimedia documents .............................................................................. 17
Raw Data ....................................................................................................................................... 17
Meta Data (MPEG-7) ..................................................................................................................... 18
Simple Features for Images ............................................................................................................... 18
1

CS342 Multimedia Retrieval


HS2012 University of Basel
Color Histograms ........................................................................................................................... 18
Color Moments .............................................................................................................................. 18
Texture Moments .......................................................................................................................... 19
Shape Features .............................................................................................................................. 19
Regions and Blobs ......................................................................................................................... 20
Invariances of Features ................................................................................................................. 20
Simple Features for Audio Data ........................................................................................................ 20
Search with Acoustical Features.................................................................................................... 20
Search for Pitches and Tunes ........................................................................................................ 21
Search in spoken text .................................................................................................................... 21
Simple Features for Video Data......................................................................................................... 22
Shot Detection ............................................................................................................................... 22
Key Frame Extraction..................................................................................................................... 22
Locical and Abstract Features............................................................................................................ 23
Similarity Search .................................................................................................................................... 24
Complex Queries ............................................................................................................................... 24
Index Structure and Algorithms ........................................................................................................ 25
NN-Search in Hierarchical Structures ............................................................................................ 27
The Curse of high Dimensionality ...................................................................................................... 28
Cost Model for NN-Search (using HS-algorithm)........................................................................... 28
Vector Approximation File................................................................................................................. 29
Evaluation ofComplex Queries .......................................................................................................... 30
Several referenc objects (Ciacca 1998) ......................................................................................... 30
Middleware Approach (Fagin 1996) .............................................................................................. 30
Predicates and Similarity Search (Chaudhuri 1996) ...................................................................... 31
Complex queries with the VA-File (Weber 2001) .......................................................................... 31
Comparison ................................................................................................................................... 32

CS342 Multimedia Retrieval


HS2012 University of Basel

Evaluation
Since we cannot generalize that one retrieval method is always better than another one, we need to
run benchmarks in order to find out which method is the best for a specific task. To be able to
evaluate the benchmark, we first need to do the following steps:
Selection of a collection and definition of queries
Each competitor evaluates the queries against the collection and returns a list of answers to
the coordinator
For each query, the answers of all competitors define the collection to be assessed for this
query
Each competitor had to assess the relevance for some of the queries and their associated
collection
Given all relevance assessments, the quality performance of the search engines was
determined

Boolean Retrieval
In Boolean Retrieval the retrieved documents are not ordered; we only have the distinction between
relevant (the ones that are retrieved) and non-relevant. Our most important measures for
evaluating are:

Precision

Retrieved and relevant documents divided by


retrieved documents

Recall

Retrieved and relevant documents divided by


relevant documents

Fallout

Retrieved but not relevant documents divided by all non-relevant documents

Total Recall

F-Measure

how many relevant documents are in the collection?


(

(the larger the value, the better)

Combines Precision and Recall (=0 only Precision, =inf, only Recall). A typical value is =1

Averaging the Results

Macro Evaluation average of pi and ri

Micro Evaluation summing up numerators and denominators

CS342 Multimedia Retrieval


HS2012 University of Basel

Retrieval with Ordering of Documents


For each subset of the result we
compute a p-r pair (e.g. the first 5
documents). These pairs we draw
into a diagram as shown on the
right.
Interpretation:
Close to (r=0,p=1)
(good for student)
the documents we find are relevant but we dont find all relevant ones
Close to (r=1,p=0)
(good for patent worker)
we find all relevant documents but also a lot of non-relevant ones
In general we would prefer a system that is as close as possible to the point (1,1). Thus we can say
system A is better than system B if

of A is bigger System Efficiency

Another measure is the R-Precision: the precision after having retrieved |RQ| documents

Averaging the Results


The goal here is to get an average P-R plot over several queries. There are 2 approaches:
1. For all queries determine pi and ri for a specific i (e.g. 10) and make a P-R plot for those
values
2. For all queries determine precision values for a fixed recall value. Average this precision value
and repeat for several fixed recall values.

Utility Measure
This method is an alternative to precision/recall that can be used if we have information about userpreferences on pairs of documents. We define user preferences as a set:
Similar we can now define the ordering of documents by retrieval system as sets:

Now we define for each query the following values:

With these values we can now compute the utility measure uA,B:
1. Compute differences yi - xi and remove values of 0 (k0 values remain)
2. Increasingly order values by their absolute value. If several values have the same absolute
value, assign them an average rank (e.g. 2nd and 3rd 2.5)
3. Compute w+ as the sum of ranks of positive values
4. Finally, determine

with

We can also compute how probable it is that uA,B is the result of a random experiment. We assume
UA,B is normally distributed, thus we can say:
with
4

CS342 Multimedia Retrieval


HS2012 University of Basel

Text Retrieval
Feature Extraction
Feature extraction transforms a document into a set of terms (e.g. words). We distinguish 5 steps:
1.
2.
3.
4.
5.

Elimination of structure
Elimination of frequent/infrequent terms (stop words)
Mapping text to terms (without punctuation)
Reduction of terms to their stems (stemming, syllable division)
Mapping to index terms

Elimination of structure
For example html provides a lot of meta information. Most search engines today therefore
distinguish different areas in the document:
Title
Remaining header (Meta-Keywords)
Main body
o Text between markups
o Special attributes of selected tags (like img alt
URL/Links
(Comments)
For XML we could consider the XML structure in our query, e.g. if we search for names only look at
the attributes that have to do with name. But we need to know the data structure of the XML.
Elimination of frequent/infrequent terms
We dont want to index terms that are very often (a, the) or very infrequent. The theoretical solution
would be to only index terms that have proven to be useful from past experiences. But we would
need a feedback mechanism. Thus we have a more pragmatic solution: term frequencies follow the
Zipfian distribution
N
Number of term occurrences in collection
M
Number of distinct terms in collection
nt
Number of occurrences of term t
rt
Rank of term t ordered by the number of occurrences
pr
Probability that a random term in a document has rank r (= nt/N)
The central theorem is now
whereas
(const)
c now depends on the number of distinct terms:

( )

( ( )

Discrimination Method
We introduce a similarity measure sim(di,dj) with values between 0 (=dissimilar) and 1 (=identical).
Documents are described as vector where component k denotes the occurrences of the k-th term.
The density of a collection is now defined as the sum of similarity between all documents and a
Centroid C (which contains all M terms with average frequency (Ci = nti/N). We define the
discrimination value DWt of a term now as the Difference between the density of the collection
with the term and the density without the term.
DWt > 0 t is a good descriptor
DWt 0 t contains little semantics to distinguish documents
DWt < 0 t is a bad descriptor
5

CS342 Multimedia Retrieval


HS2012 University of Basel
Mapping text to terms
Most search engines use words or phrases as features, some use stemming, some differentiate
between upper and lower cases, some support error correction. An interesting option is the usage of
n-grams, since they support fuzzy retrieval.
There are 2 main features of terms:
1. Term frequency
used to rank documents
2. Term location
influences ranking: if 2 search terms are near a document they are ranked higher
whether a document appears at all: e.g. white house you want close together
Reduction of terms to their stems
For English there exists a good algorithm which stems words without the need of a dictionary, the
Porter algorithm. It doesnt stem to the correct linguistic stem but it stems all words which belong
together to the same stem.
However in other languages (like German), words are much more conjugated, thus we will need a
dictionary for stemming. This of course also implies additional lookup costs and maintenance costs
for the dictionary. Efficient dictionary-based stemming can be done in a two-step approach:
1. rule-based approach for regular flexions (much like porter but much simpler)
the result of each applicable is looked up in a dictionary, if the word exists its a possible stem
may lead to several stems (e.g. axes -> axis, axe)
2. exception list with strong or irregular flexions of terms
Mapping to index terms
Here we also deal with relationships between words:
Homonyms
equal terms but different semantics
Synonyms
different terms but equal semantics
Hypernyms
umbrella term
<-->
Hyponyms
species
Holonyms
is part of
<-->
Meronyms
has parts
We might now consider terms that are related to the query term in our search as well, e.g. with a
smaller weight.

Models for Text retrieval


Boolean Retrieval
Very simple, if the query term appears in the document, you retrieve it. Possible combinations:
rsv(q1 q2,d) = min( rsv(q1,d),rsv(q2,d) )
rsv(q1 q2,d) = max( rsv(q1,d),rsv(q2,d) )
rsv(q1,d) = 1 - rsv(q1,d)
Fuzzy Retrieval
A Document is represented by a vector d with di [0,1], the normalized term frequency of term ti
We apply fuzzy logic for more complex queries:
1. fuzzy algebraic
rsv(q1 q2,d) = rsv(q1,d) * rsv(q2,d)
rsv(q1 q2,d) = rsv(q1,d) + rsv(q2,d) (rsv(q1,d) * rsv(q2,d))
rsv(q1,d) = 1 - rsv(q1,d)

CS342 Multimedia Retrieval


HS2012 University of Basel
2. extended Boolean model
rsv(q1 q2,d) = ca1 *max( rsv(q1,d),rsv(q2,d) ) + ca2 * min( rsv(q1,d),rsv(q2,d) )
rsv(q1 q2,d) = co1 *max( rsv(q1,d),rsv(q2,d) ) + co2 * min( rsv(q1,d),rsv(q2,d) )
rsv(q1,d) = 1 - rsv(q1,d)
Usually it holds co1 > co2 and ca1 < ca2
But we can also use the soft Boolean operator ca1 = co1, ca1 = co1 = 1 - co1 (andor)
Vector Space Retrieval
Important measures:
tf(Ti,Dj)
Term frequency
number of occurrences of feature Ti in document Dj
df(Ti)
Document frequency number of documents that contain feature Ti
idf(Ti)
Inverse df(Ti)
discrimination value of term Ti
A typical function for idf(Ti) could be

( )

( )

Now we can represent a document as a vector, where


( ).
(
)
With all documents we can build an MxN matrix, where M=#terms, N=#documents. The query can be
defined as a Mx1 vector similar to a document. Now we can calculate the rsv-value
(
)
1. Inner vector product
2. Cosine measure

We assume that terms occur independent from each other in documents. However, this is not true
in general
Probabilistic Retrieval
We basically estimate the probability that given a query Q the document is considered relevant. We
assume that this probability depends only on the query and the document collection.
(
)
(
)
(
)
These probabilities can be rewritten through the Bayes theorem:

| ) ( )

(
(

) (

The Binary Independence Retrieval (BIR) simplifies the computation by considering only binary
representations of the document (and query). More precisely: document D is characterized with a
binary vector x with xi=1 if term Ti occurs in D. P(Dj|R) = P(x|R)
)

(
)
(
)
The probability P(x|R) is given by (
(
)
with
being the probability that term Ti occurs in a randomly chosen document from R.
) to and (
) to .
P(x|NR) can be described similarly. We can shorten (
If we now assume that
if Ti does not occur we get the following
( ) (
)
( )
(
)

( ) (
)
( )
( )
(
)

( )
(
)
Since we are not interested in absolute values, we can now eliminate all constant factors that dont
depent on the current document:
(
)
(
)
(
)

(
)
(
)
(
)
If the c-values are given, search engines can efficiently compute the rsv with inverted files.

CS342 Multimedia Retrieval


HS2012 University of Basel

One question remains: How to compute and ?


Initialization:
,
( )
Query terms appear in constant probability in relevant documents, in non-relevant
documents they follow the average distribution of that term in the collection
Feedback step:
,
(
) (
)
The user selects l out of k presented documents to be relevant. ki is the number of
documents from the result set that contain term Ti, li is the number of relevant documents
(selected by user) that contain term Ti
Latent Semantic Indexing
The basic idea of LSI is to transform the document vector to a low-dimensional space which
approximates the original vectors. The new dimensions should now denote concepts encompassing
several terms.
A Matrix can be decomposed into a
SVD:
, where U holds
the eigenvectors of AAT in its
columns and V holds the
eigenvectors of ATA in its columns.
If the entries in S are ordered by
decreasing size, we can reduce the dimension of the SVD by just picking the first k values and
reducing U and v accordingly. The new dimensions are U(Mxk), S(kxk), V(kxN).
New documents can approximately be inserted by
Similary, new terms are inserted by
Now we can evaluate the query by transforming it into the sub-space
(
) and similar to Vector Space Retrieval calculate a
distance (as shown on the right).
We can read some interesting facts from different matrices
ATA
element (I,j) given by
(i,j) is large
Di and Dj are similar (however the values do not denote how similar)
T
AA
element is given by
(i,j) is large
Ti and Tj appear in the same context

columns:
reduced representations of the documents,
rows;
eigenvectors of the similarity matrix between documents ATA
Uk
rows:
reduced representation of terms
columns:
eigenvectors of the similarity matrix between terms

CS342 Multimedia Retrieval


HS2012 University of Basel
Summary
Boolean Retrieval

Advantages

Disadvantages

Efficient implementation
feasible with inverted files

Size of results huge


No ranking of documents
Complexity of query language
Worse than any other approach

Fuzzy Retrieval

Ranking of retrieved
documents

Worse than following


approaches and not better than
Boolean retrieval
No weighting of terms feasible
(frequent terms dominate)
Complexity of query language

Vector Space Retrieval

simple model with efficient


evaluation algorithms
Partial match queries possible
Very good retrieval quality (but
not state of the art)

Many heuristics and


simplification, no proof for
correctness of result set
In HTML/web occurrences are
not the most important criteria
(spamming)

Relevance feedback may


further improve VSR
Probabilistic Retrieval

Ordering of documents based


on probability of being relevant
Efficient evaluation with
inverted files possible

Latent Semantic Indexing

Synonyms are detected


Simplifies feature extraction
(no dictionary and ontology
required, different languages
and cross-language retrieval for
free, no stemming necessary)
Good retrieval quality

First step uses rough heuristic


Frequency and position of
terms are not considered
Assumption of independence of
term does not hold in practice
Extremely expensive
computation of SVD, no parallel
algorithms available
Expensive evaluation of queries
(no optimal index structure,
inverted lists not feasible due
to dense vectors after
transformation)
Retrieval quality not much
better than with other methods

Indexing Structures
We will look at two main aspects:
1. Inverted lists for efficient identification of relevant documents in Boolean and Vector Space
Retrieval
2. Support fuzzy search with the help of n-grams and signatures

CS342 Multimedia Retrieval


HS2012 University of Basel
Inverted Lists
We store for each term a list of all documents that contain this term (with frequency).
Boolean Retrieval
We can calculate Boolean operations directly on set of docs
Vector Space Retrieval
or over the sets of inverted lists from all query terms
Only compute rsv for documents that contain at least one query term
Probabilistic Retrieval
same as VSR
We have different ways to implement those lists:
Advantage

Disadvantage

Files

Simple implementation

No transactional guarantees

For each term a separate file

Good performance
Open source implementations
available

Relational Database

Everything a database offers

3 tables:

Good performance

documents
(meta data)
terms
(idf)
inverted lists
(d-t pairs with freq)

not so fast as file based


solution (esp if its optimized)

Simple Implementation (small


software costs)

Signatures
The idea is that if the encoded bit strings of two documents are similar it is likely that they stem from
similar feature sets, which means the two documents are similar. But since we compress features, it
might be that two totally different text documents lead to the same signature False Hit
Thus, signatures often sever as a pre-step to filter out candidates. Those candidates we then
must compare to the query with an rsv-approach to filter out false hits. We can also just order the
candidates by rsv-value without eliminate false hits, or omit the rsv-step totally (requires small
number of false hits)

Computation of Signatures
l
Length of signature in bits
N
number of documents
SQ
Signature of query
F
Error rate
g
Weight of signature for a term (how many bits are set)
The general idea is that we map each feature to a signature (if possible all should have the same g).
Then we combine those signatures. The query is then also mapped to a signature and all documents
that have similar signature as the query are returned
Mapping features to signatures
We can have different kinds of features:
words/phrases
o Approach 1: associate each word/phrase with sig of weight g (in even distribution)
o Approach 2: extract n-grams and superimpose their signatures (leads to different g).
If we want same g, only use the g most frequent n-grams
10

CS342 Multimedia Retrieval


HS2012 University of Basel

n-grams
o define mapping of text to n-grams
o define hash function to obtain a bit-position for each n-gram, set that bit
o signature of the text is superimposed coding of all signatures of its n-grams
In general you get the best Signature Potential if g = l/2. If we have more terms (M>SP), we will have
collisions (M/SP terms share same signature). To limit this error rate one can determine SP
beforehand and thus the length of signatures:
We have different methods of combining the signatures of text parts:

Similarity Search
1. Hamming Distance

Number of different bits in SQ andSD


w(SQ XOR SD)
useful if the full document is known
2. Cover Distance
Number of bits sets in query but not document signature
w(SQ)-w(SQ AND SD) or w(SQ AND (NOT SD))
useful if only parts of the document are known since it implements partial comparison

Indexing Structures

Sequential Organization
Bit-sliced Organization Vertical Partitioning
For each of the l bits an individual file stores the contents of the collection
for cover-dist we only need to read the bit files for which the query bit is set
Horizontal Partitioning
similar signatures are grouped in separate files (buckets)
1. Identify groups which may contain candidates
2. Read buckets and identify candidates
3. Read-in the candidates and distinguish between hits and false hits
For this we need a new hash function (simple example: first k bits of signature)
Signature Tree (S-Tree)
Groups are represented by a superimposed block signature for all its members, which are
again treated as signatures (recursion) Signature tree

11

CS342 Multimedia Retrieval


HS2012 University of Basel

Web Retrieval
The main problem with the web is the sheer size of it and the quality of the data. A typical small
query returns thousands to millions of websites which contain those keywords. However, not all of
them are relevant. Some approaches might be to retrieve only pages which:
contain terms with the same frequency as in the query
contain the query terms most often
contain all query terms
But these methods lack a mechanism against spamming!

Ordering of Documents
Ranking already starts with etracting the right information from documents: (e.g. Google)
Positions of terms in a document
the relative font size
the visual attributes (bold, italic)
the context of the page (term in URL, title, meta-tags, text of references)
The ranking then consists of the following factors:

Proximity of terms
for each position pair, a proximitiy values is
assigned, the frequency of these proximity
values result in the Proximity-vector
multiplying this vector with a weigthning
vector w leads to overall proximity value

Positions in the document, size of font, visual attributes


Depending in which contex the query term appears, its weighted differently. Google has also
mechanisms to cut off spamming:

PageRank (robust against spamming!)


The preliminary idea was to count the number of incoming links: the more important that
page is because it is more likely that a surfer lands on that page. But not every page is equally
important and thus are links, plus we have the problem of spamming
Improved idea: A random surfer clicks with a probability p an outgoing link on the current
page (with probability 1-p he selects an arbitrary page like bookmark). The page rank is now
given by the probability that a random surfer lands on the page (after a number of steps)

( )

( )

( )
( )

L(A)
N(A)

set of pages which have reference to A


number of outgoing links of page A

12

CS342 Multimedia Retrieval


HS2012 University of Basel
The PR() values can be computed by a fix point iteration (not expensive: <100 iterations)
1. Assing arbitrary initial values PR(A) for all documents A
2. Compute PR(A) (left hand side of equation) according to the formula above for all A
3. If |PR(A)-PR(A)| becomes sufficiently small then we take this as solution, else goto 2
Page Rank cannot be the only ordering criteria!
(Very high PR best result for all queries that contain one term of the document)
Further criteria (advertisements, pushed content, )
o Bought ranking positions
o Length of URL (the shorter the better nearer at root)
o User Feedback (if a document in result list is visited, relevance value is increased)
overall we need to have a clever combination of all approaches which means we need a lot of
good chosen weights how we obtain those weights is a secret of the search engine providers

Content based Retrieval


Hub
Authority

The page contains many links to pages that are relevant to the query
The page is relevant for Q (mostly this also means its linked by many hubs)

HITS Algorithm (Kleinberg 1997)


1. for a query Q determine the first t documents root set
2. extend the root set with documents referenced by a document in the root set and with
documents pointing to documents in the root set base set
Maybe we need to limit the set, plus links within same domain should be reduced to one link
3. Compute the hub values h(p) and authority values a(p) for each document p. We could just
sum up incoming and outgoing links, but its better to normalize a(p) and h(p)
a. Initialization: all pages start with the same values for a(p) and h(p)
( )
( ), ( ) ( )
( ) until convergence is reached
b. Iteration: ( )
4. Compute the result
a. if the user asks for overview pages, return the k documents having the larges h(p)
b. if the user asks for content pages, return the k documents having the larges a(p)
This algorithm has three fundamental problems:
1. If all pages in a domain reference the same external page, that page becomes too strong an
authority; if a page links to many different pages in the same domain, its too strong a hub
Solution: the same author has only one vote for an external / internal page
2. Automatically established links (advertisement, provider/host/) are wrong authorities
3. Queries such as jaguar car tend to lead to pages about cars in general
more frequent terms dominate
Solution to 2 and 3: eliminate nodes from the graph that obviously do not match the query
or its relevant documents
artificial query against base set to get a similarity measure between query and document
For a given threshold t we eliminate all documents with s(p) < t.
Final Formula:
The main issue with the improved HITS algorithm are large evaluation costs and long retrieval times.
(In contrast to PageRank HITS depends on the query)
13

CS342 Multimedia Retrieval


HS2012 University of Basel

Whats Related
(Google: similar pages)
Alexa used crawlers and data mining tools. Additionally it pied on the surf patters of the user.
Google relies on the analysis of the link structure of web pages. There are two approaches published:
1. Companion Algorithm
based on the extended HITS algorithm
1. Build a directed graph in the neighborhood of u (given URL). The parent (b), child
(f) and sibling (bf,fb) pages are limited
2. Converge duplicates or near-duplicates) if
o they contain more than 10 links
o 95% of contained links appear in both documents
3. If k edges of document in same domain point to same external page, weight by 1/k.
If a document contains l edges to pages within same domain, weight edges by 1/l
4. Determine hubs and authorities according to ext. HITS (without similarity weighting)
5. Determine the result: pages with highest authority weights are the related pages
2. Co-citation Algorithm
counts how often a page u is referenced together with page q
1. Determine at most b parent pages of u
2. Determine for each parent page at most bf child pages with the links of these child
pages being close to the link to u Siblings of u
3. Determine pages qi that are referenced most frequently together with u
4. If 1-3 result in less than 15 co-citations with freq 2 or higher, repeat the search with
prefixes of the URL of u
Architecture of a Search Engine
A search engine consist of the following
main components
Crawler/Repository
Feature Extractor
Indexer
Sorter
Feedback Component
User Interface

Main Problem: Scalability


The data problem:
If we roughly assume the size of Googles inverted lists is 35TB, how do we search through that if a
single machine only stores 100GB? How can we organize a cluster given the frequent updates and
enormous search frequencies?
Additionally, how do you assign identifiers for 35B pages?
The retrieval Problem:
by estimated 3000 queries per second: one single entry in inverted files consumes more than 5GB
but typical Disk performs with 50MB/s How long does it take to search through 5GB of data? How
do you reduce search times given the high IO load?
The crawling problem:
How do you fetch 35B pages in a reasonable time? DNS lookups expensive, servers have different
response times. Google exchanges index once every month 24000 pages 10KB per second!
Incremental crawling: important pages have to be read daily how to select important pages?
14

CS342 Multimedia Retrieval


HS2012 University of Basel
Crawling tricks and musts
DNS lookup problem: local cache on each crawl server
Follow the rules of the server!
o only few requests per min to the same server
o Do not follow cgi-links (expensive, you may activate/interact/change page)
o read and obey to robots.txt
o filter critical URIs
A single machine / a single connection is not sufficient for crawling
Distributed Data Management
Google File System (GFS) specifically to handle large files
o fault tolerance against crashes of machines, hard disks,..
o partitioning of files to allow for massive parallelization
two-dimensional data management
o data is partitioned along groups of documents (so-called sherds)
o each sherd is stored on an arbitrary number of machines
fault-tolerance, distribution of load
Execute as parallel as possible
Partitioning & replication
Scalability of Google has two dimensions
o partitioning: support for growth of documents
o replication:
support for growth of concurrent queries
Cluster Management
Google has on average 3000 failures per day
Data has to be refreshed in regular intervals
Google permanently improves its software
SW distribution must be fully automated and cannot result in any downtime

15

CS342 Multimedia Retrieval


HS2012 University of Basel

Image, Audio and Video Retrieval


Most search systems for content-based retrieval provide one or more of the following paradigms:
Search by Browsing user clicks through a directory or catalog
+ simple structure, efficient implementations scale easy
- high costs to maintain catalog, queries that are not along catalog structure are difficult
Search by Keyword media is annotated with key words
+ simple implementation based on existing tools, sufficient retrieval quality in many cases
- annotation is labor intensive, descriptions are often subjective or require detailed knowledge
Search by Similarity query contains reference objects that match to a certain degree
+ content analysis is cheap as it is automatic, many different aspects can be analyzed
o Query by Example reference objects are picked from the collection
+usually simple to find a first rough match, one can also insert objects as reference
- difficult to obtain sufficiently good reference objects (overmatching)
we get objects that match the reference object in aspects the user did not like
o *Query by Sketch
reference objects are drawn/sung by the user
+very flexible, exact query definition possible
- not everybody is good at drawing/singing
Relevance Feedback
We can divide features in the following groups:
1. Primary Information
raw data associated with document (signal information, text data, meta data)
2. Secondary Information
derived from primary information, describe focused aspects of the document
a. Simple Features (level 1)
basic aspects of raw data (text, color, texture, shape, geographical position
frequencies, loudness, self similarity, relationships between objects/regions)
b. Logical Features (level 2)
identity of contained objects (types such as animal, names such as White house)
Logical features require a model for each to be detected type or name
c. Abstract Features (level 3)
meaning, reason, context of recording (soccer game, emotional/religious feelings..)

How similarity search works


We describe the collection with suitable features:
Text terms, term frequency, position, visual attributes
Web hub, authority, PageRank, references/links
Image visual features, objects, regions
Audio acoustical features, tone, spoken text, self similarity
Video movement, camera movement, subtitle, sequence/shot
All
meta data, annotations, near-by text
With an appropriate rsv function we induce an ordering for a given query Q by transforming it into
the same feature space and matching it with the collection objects
partial match query must be contained in document
similarity
query object and document have to be similar (distance function)

16

CS342 Multimedia Retrieval


HS2012 University of Basel

Distance function (metrics)

L1-Norm (Manhattan distance)


(
) |
|

L2-Norm (Euclidean distance)


(

Lp-Norm
(

Linf-Norm (Maximum norm)


(
)
|
|
Quadratic function
(
) (
) (
)
Before we can apply distance functions, we need to normalize the component (dimension) with a
Gaussian distribution:
Let j be the mean value of values of component j and j its standard deviation. Then we normalize:

This means e.g. the L1-norm is (

| (mean values fall out).

Remarks on Quadratic functions


Quadratic functions are rather computational intensive. Thus we should apply an eigenspace
transformation on the feature vector. Since we know that A has to be positive definit the eigenvalues
are all real and define the main axis of the hyper ellipse xTAx=const. A simple rotation and scaling will
map the original feature space into a new one where the ellipse turns to a circle (Euclidean distance).
Mapping from distances to similarity values (scores)
A similarity value must be in the range [0,1], where 0 denotes dissimilarity and 1 denotes identity.
)
))) with the constraints:
For this we need a correspondence function h ( (
( (
1. h(0) = 1
2. h(inf) = 0
3. h(x) 0

Primary information for multimedia documents


Raw Data
Signal information
can require strong compression but will still be to big to search through at retrieval time
Textual Annotations
keywords, embedding document, descriptions, URL of document, ALT attribute,
one can find further meta data through DBs (like imdb etc)
Problem: automatic association of textual paragraphs in web pages to multimedia objects
ALT-text for images pretty obvious
text around is harder (it might also describe something else)
o Approach 1: associate text found just before and after embedding tag in source code
o Approach 2: visual closeness
render page, distances to text blocks while taking delimiters into account (<hr>)

17

CS342 Multimedia Retrieval


HS2012 University of Basel
Meta Data (MPEG-7)
MPEG-7 is a standard to maintain meta information:
describe any multimedia document (images, audio, video)
describe possible descriptors and their relationships to each other
define descriptors
encode descriptors and prepare them for later indexing
However, it does not include the concrete implementation of feature extraction algorithms nor filter
and search algorithms to search through MPEG-7 data.
There are still some standardization issues like which languages to support, how fine grained should
the textual description be etc. Also it lacks support by tools and search engines (e.g. a digital camera
should automatically produce an MPEG-7 document for each shot)

Simple Features for Images

Text features
recall Chapter 2 (term vectors, LSI vectors )
Linking features
recall Chapter 3 (authority/hubs, PageRank)
Image descriptors
based on signal information
o local features (region dynamic or static) vs. global feature
o feature can be constructed invariant to rotations and translations, but this would not
allow us to distinguish between top/bottom
o feature selection determines precision of retrieval (more features better results)
but also performance (more features worse performance)

Color Histograms
A color histogram is defined by
selection of a color space (RGB, L*a*b*, LCH)
selection of reference colors
o as many colors as possible to obtain good descriptors
o as few colors as possible to benefit from fast retrieval
definition of a distance measure, i.e. a similarity measure
o preferably one that takes visual similarity into account
A histogram should in the end be normalized divide by #pixel
A similar histogram doesnt imply similar images per se!
(Plus: A dissimilar histogram doesnt imply dissimilar images, see box)

If bucket 2 and 3 are similar colors,


we might have 2 similar images

Color Moments
Color movements construct an uncorrelated vector. Statistical moments along different color
dimensions are computed. Examples for color space L*a*b*:

18

CS342 Multimedia Retrieval


HS2012 University of Basel
Texture Moments
A simple approach would be to have a set of pre-defined reference texture and determine the
probability of match or the coverage of the texture in the image. We consider however an approach
that is based ond the Fourier Transformation.
To use the FFT we need to change the dimensions of the image into 2x x 2y. We can use
Stretching
Filling
Tiling
Mirroring (reduce hard edges)
Over the Fourier transformed image
we can now use a filter.
A popular choice is the Gabor Filter
which defines a set of frequencies
and orientations spanning the entire
Fourier spaces.
The resulting matrix is transformed
back into the image space, where the number and intensity of remaining pixels are counted.
Shape Features
There are several possibilities: simple methods describe area, orientation and edge directions, more
complex apply a Fourier transformation to detect the main orientations in the image
Vailaya (1996, Michigan State University)
Basic idea: determine a histogram depicting the main orientations in an image
1. aplly a canny edge detector
2. determine orientation of edges for each pixel in the image
3. normalize the histogram by #pixels
obtained feature is translation invariant and (given a suitable distance measure) rotation invariant
Berchtold (1997, Ludwig.Maximilians University Munich)
We measure the portions of the shape within predefined areas
Chalechale (2003): Angular Radial Partitioning)
Defined features that are suitable for query by sketch
Similar as above, but first a Canny Edge Detector
Robust against small offsets in
scale, translation (normalization)
rotation of the angle of the slice
( fourier, use of energy levels)
1. Convert image to gray-intensity
2. Normalize size of sketch and DB image
3. Aplly Canny edge detector to find strong edges
4. Partition the resulting edge-map into MxN radial angular partitions
5. Count number of edge pixels in each partition to obtain a raw feature vector
6. Fourier transform the raw feature vector and use absolute values (energy) final vector
19

CS342 Multimedia Retrieval


HS2012 University of Basel
Regions and Blobs
Two approaches to define regions:
Dynamic
Blobworld (Berkeley)
Try to dynamically find regions that belong together (best smoothen image beforehand)
Static
Chariot (ETH)
Images are divided with fixed schemes (overlapping rectangles, 1fg+4bg,)
Invariances of Features
Types of invariances:
Translation
position in the image
Absolute scale
absolute size of image
Relative scale
relative size of the object within an image
Rotation
small or big (e.g. rotation by 90 degree)
Environment
changes in the background/environment
Signal
filters as high contrast, brightness, soft, sharpen, gamma correction...
It is not always desired that all invariances apply, e.g. a blue sky (at top of image) should be
distinguishable to a blue lake (bottom of the image), or portraits should distinguish themselves from
pictures showing an entire person.
Since the use of invariances depends on the search context, search engines should provide options to
select features and invariances that define the search scope.
Provide simple to understand options in the search interface
Ask user to rate the images and select/weight features and invariances based on feedback
Vary the features and invariances and show the user the best results for each variation
Faceted search: provide lists of sub-results over features/invariances that frequently appear
in the result set

Simple Features for Audio Data


Audio-Retrieval knows four different areas:
1. search on meta-data (keyword based search)
2. search with acoustical features such as loudness, pitch, (content-based similarity search)
3. search for pitches and tunes (search by humming)
4. search for spoken text (speech recognition)
Search with Acoustical Features
The main difference to image retrieval is that we dont have a single vector describing the whole
piece of music but a temporal series of such vectors.
A simple characterization of audio files is based on the Fourier Transformation. The audio signal is
divided into intervals of pre-defined lengths. For each interval, statistical values describing specific
perceptual aspects are computed:
Loudness
square root of mean values of squared amplitude values over all intervals
Pitch
calculated out of frequencies and amplitudes of the peaks inside the intervals
Brightness
occurrence of high frequencies (logarithmic value of amplitude of high freq)
Bandwidth
variation of frequencies in the interval (mean value of Fourier coefficients to
mean frequency weight with amplitude values)
4-dimensional vector for each interval

20

CS342 Multimedia Retrieval


HS2012 University of Basel
Frequently the measures are further compressed to obtain fewer vectors.
Average
mean value of the measures over the intervals of the signal
Variance
mean quadratic distance to the mean value over all intervals
Auto correlation
self-similarity (signal is compared with itself at different point in time)
12-dimensional vector describing the acoustical aspects of an audio file
Search for Pitches and Tunes
There are two inherent problems to solve
Recognition of pitches and tunes in the audio signal (approaches for single instrument)
Definition of a tune (one could just look at contour regardless of pitch intervals)
Search by Humming
To describe a contour, the difference to the preceding not is stored: D (down), U (up) and S (same).
One can now either hum the tune (needs a transformation from humming to a stream of D,U,S) or
directly input the character stream if the tune is known (notes) or the person is a bad singer.
To search for this query its good to store for all files in the collection all possible sub-strings of fixed
length such that we can search with a so-called prefix search tree.
Search in spoken text
First we need to clean unimportant features and noise (pitch level of speaker, tempo, loudness).
The usual method for this purpose is the MFCC (mel frequency central coefficient) analysis. The
result is a stream of p-dimensional vectors. To continue, a Hidden Markov Model (HMM) is used.
Since HMM needs a discrete number of states, the maximum number of states (k) is defined and the
vector space is divided with a k-means algorithm into k areas. Each vector is then mapped to a state
value between 1 and k.
A HMM describes a stochastic process consisting of several steps. Each step denotes a change of
state and each state (si) emits an output symbol 1..k with a probability (=bi,k). A change of state from
a starting state to an end state occurs with a certain probability (aij). The output symbols correspond
to the MFCC states.
For each phoneme such a model needs to be defined and learned. During phoneme recognition the
stream of MFCC states is observed and the phoneme model that could have emitted this series with
highest probability is determined transformations into stream of phonemes
One could also use neuronal networks instead of HMMs
Now we have 2 possibilities:
recognition of words based on the stream of phonemes (fault tolerant!)
use n-grams to describe phoneme stream and map queries into the phoneme space
o query is spoken: the query is mapped to a stream in the same fashion
o query with key words: mapping through a dictionary
retrieval in the phoneme stream (rather bad, see points below)
o only limited number of words in current word recognizers
o more expensive
o names are difficult, composite words (where to split)
o fault tolerance hard to achieve once a word is wrongly recognized the word after
that is prone to failure as well

21

CS342 Multimedia Retrieval


HS2012 University of Basel

Simple Features for Video Data


Combination of text retrieval (subtitles, meta data), image retrieval (images of the shots) and audio
retrieval (music, speech). But we have some additional problems:
Shot detection
sequence of images with the same perspective
Feature extraction
combination of features from text, audio and image
Recognition of film genre
Browsing
depict a scene instead of whole movie
Object Movement
closely related to camera movements
Advertisements
Streaming
transfer whole video if user clicks on the result is not good
Shot detection can be divided into two sub tasks
1. detect a shot
distinguish between hard cuts and soft cuts
2. Key frame extraction identify key frame for each shot (maybe stitched big picture)
Shot Detection
with mpeg7 this would be superficial since it already contains boundaries of shots
sometimes different shots should still be in one sequence (2 people talking)
Hard cuts
Pixel based Comparison:

(
)
(
)
Histogram Comparison:
( )
(
)
We need to choose the right threshold to
distinguish between cut and no cut. The
optimal threshold minimizes false hits and false drops
Problems with this approach are soft cuts, quick moving objects and fast camera movements
Soft cuts (fade-out fade-in, dissolve,)
The basic idea here is a twin threshold:
first threshold (tc) detects hard cuts
second threshold (ts) detects the beginning of a potential soft cut
The frame where ts is exceeded serves as reference frame for the following frames. If then tc
is exceeded it is a soft cut. If it remains long enough under tc we assume no cut
Another approach is model soft cuts. E.g. fade-out-fade-in have a very characteristic change of
profile on intensity histogram
Key Frame Extraction
We could just extract an arbitrary image
+ very simple
+ often describes the shot very well (especially if shot contains no movements and camera is steady)
- if camera moves, only a certain perspective is depicted
- movements of objects lead to arbitrary representations
A better approach would be to stich images together to a panoramic image. The main issue is that
the camera plane changes as well non-linear transformations

22

CS342 Multimedia Retrieval


HS2012 University of Basel

Locical and Abstract Features


There are three promising approaches, usually one finds a combination of those:
1. Model based
model the object and try to find the model within the image
2. Neuronal networks
The model is learned with a training set, then the network is applied to detect instance in the
data stream
3. Classification
With the help of clustering algorithms (k-means, support vector machines) an existing,
classified DB is separated into partitions according to the annotations

Logical Features for Images

Modelling with Body-Plans


The object is modeled based on cylindrical areas and defining special classifiers
Neural Network for normalized Face Detection
The image needs to be scaled and rotated, plus normalized with respect to color, intensity
and texture. Then the network is able to detect any face in the image by sliding a recognition
window across the image

Logical Features for Audio Data

Speaker Recognition
allows to query for something somebody has said (e.g. Gorge W. Bush on global warming)
Analysis of Video Stream
E.g. noise level in sport games when something important happens

Logical Features for Video Data

Movements of objects
feasible by comparing shapes of objects between subsequent frames
already detected in MPEG4 (distinguish between (steady) bg and (moving) fg objects)
o Camera movements
o Time-space Relationships
can answer queries like has person A met with person B
o Speaker Recognition
like in audio retrieval, but can be enhanced with face recognition methods and
subtitle analysis

Abstract features
This area is not very well understood. The main problem is how to model an event, or emotions.
With emotions, its even harder since its very subjective (although some well-known stereotypes
exists)
E.g. with movie soundtrack you could recognize emotions or also locations (certain movies feature
themes which are played at the same locations)

23

CS342 Multimedia Retrieval


HS2012 University of Basel

Similarity Search
How to evaluate simple queries we already learned in the previous chapter (remember distance
functions and how to map them to similarity score). Now we will take a deeper look into more
complex queries

Complex Queries
1. several reference objects
determine k distances to all k reference objects, combine them with
(
)
2. concerning features
a. different features (including meta data)
determine the J distances of an object for the J features, combine them with
Important: dist need to be normalized in order to use (Gaussian normalization)
(If query consists of several reference objects, theres no need same feature space)
b. key words
key word query returns distances
same as different features
key word query returns RSV
dist= -rsv + const (such that all >0)
key word query filters out objects
same as predicates
3. predicates (on meta data, or on classifiers)
If the predicate is fulfilled, pass similarity score as it is, else set it on 0
Examples for distance combining functions:
maximum (fuzzy logic and)
(
)
minimum (fuzzy logic or)
(
)

average

weighted average

1) several reference objects

(: weighting vector)
2a) different features

Weighting
Weights on distance and RSV functions
Weights are directly applied on the components of the vector: (

Weights on distances (distance combining functions)


average and weighted average is similar, but weighted min/max cannot be applied directly. The
(
)
(
)
function Dw needs to fulfill: x1>x1 and x2>x2
(

24

CS342 Multimedia Retrieval


HS2012 University of Basel
Summary of Normalization

(Quadratic function must include normalization into matrix A)

Index Structure and Algorithms


We assume a nearest neighbor search with a single reference object in a single feature space without
predicates and keywords and focus on the question how to index objects in order to find a solution
as efficiently as possible in high-dimensional spaces.
1. Quadtree
A two-dimensional space is divided with two orthogonal lines into four
areas NE, NW, SE, SW
Splitting is recursively applied hierarchical structure
a. Point-Quadtree: new data points split leaf into 4 regions (point center of splitting)
b. Region Quadtree: leafs are split in the center of the space simplification of algo
2. k-d-tree
Each node contains a point which divides a single axis into two parts
two sub-spaces.
The axis to be split is chosen depending on the depth level in the tree.
Variants: splitting along arbitrary axis at arbitrary point; balanced tree (similar to B-tree)
3. Gridfile
Space is divided by a regular grid rectangular cells. Each cell gets a data
page or several cells share the sama data page. If a page overflows, a new
grid line is added to split that cell (other cells are split as well!)
4. Space-Filling Curves
space is divided with a regular mesh, the cells are enumerated by a spacefilling curve (Hilbert, z-Ordering,). A data point receives the key of its
surrounding cell. All data points are then stored in a B-tree ordered by
their keys
5. Voronoi Diagrams
split the space into subspaces such that each subspace contains exactly
those points lying closest to a data point nearest neighbor for free.
Two problems:
a. computation is expensive
b. Identifying the subspace is not trivial (in 2d o(log n) steps, more in higher
dimensions)

25

CS342 Multimedia Retrieval


HS2012 University of Basel
6. R-Tree
aims at hierarchical, perfectly balanced structure
Leaf Nodes
data points and MBR
(Minimal Bounding Region)
Inner Nodes MBR and MBR of child nodes
Insertion:
1. search a path to a leaf node (non-deterministic)
2. Insert the point (potentially extend MBR if point lies outside MBR)
a. In case of overflow, split leaf node into 2 new leaf nodes. Splitting should
establish two non-overlapping MBRs split along a single axis
If parent node overflows, propagate (split MBRs often overlap)
b. If root node is split, a new root node is created with 2 new inner nodes
Searching a path from roof to leaf: If the new point lies
outside all MBR
select node with MBR closest to point
inside exactly one MBR
follow the corresponding child node
inside several MBRs
choose a child node according to well-chosen metric
To search for an existing point, one needs to follow all nodes whose MBR contains the point
There exist a couple of extensions for R-Trees:
Shape of MBRs
o SS/SS+-tree circular areas
o TV-Baum
variable number of dimensions for indexing, hyper spheres
Splitting
o SS+-tree
clustering algorithms (k-means) to split overflowed nodes
o R*-tree
like R-tree, but to reduce overlaps, a number of points are reinserted
o X-tree
optimize splitting to avoid overlaps as much as possible
Size of nodes
o X-tree
super nodes: if a split is not meaningful (big overlap), node isnt split
but extended to cover several pages
o DABS-tree
optimize page sizes with the help of cost model each time split occurs
(very large pages)
Metrics
Most discussed so far use LS norm
o M-tree
hierarchical structure for arbitrary metrics. Stored objects dont have
to be points. May also contain strings with edit-distance as metric

Other Hierarchical Indexing Structures

Pyramid Tree
Linf-norm optimized index
Data space is divided by pyramids (ground area equal to a facet of the data space, top center
of data space). Pyramid further divided along falling line from top to ground area. The so
obtained shells are enumerated data points can be mapped in a one-dimensional space
P-Sphere Tree
good alternative for vector spaces with 10 to 100 dim
Two-level structure: Root contains a set of spherical MBRs; each MBR subsumes a set of
points which are stored in the leaf nodes
A point can lie in several leaf nodes (redundancy) NN can be answered with 1 lookup

26

CS342 Multimedia Retrieval


HS2012 University of Basel

Quadtree

Advantages

Disadvantages

simple implementation

not useful in high-dim spaces (2d


subspaces from splitting)

Good performance in 2d (if tree is


balanced)

usually not balanced (structure


depends on the sequence of insertion
of points)
structure for main memory (IO
performance is poor)

k-d-tree

simple implementation
good performance in 2d (if tree is
balanced)

Gridfile

simple implementation (except del


eteand update for balanced structure)
good performance at most 2 disc
accesses for simple point query

Space-Filling
Curves

Simple implementation

Voronoi
Diagrams

Good retrieval performance in lowdimensional spaces

Existing index structures for 1d data


sets can be re-used (DB functionality)

usually not balanced (structure


depends on the sequence of insertion
of points)
structure quickly becomes unbalanced
poor performance in clustered sets
with large empty spaces empty cells
size of directory grows linearly with
number of points
in higher dimensional space, the
spatial relationship is hardly visible in
the keys (already for 2d near the
center of the space)
Computationally expensive, storage
costs large
retrieval time not considerably better
than those of competing methods

R-Tree

always perfectly balanced (height


logarithmic in number of points

overlapping MBR (R-tree should be


rebuilt with bulk-load algorithm from
time to time)

implementations usually optimized for


secondary storage (inner nodes kept in very bad performance in highmain mem, leaves loaded from disc)
dimensional space
good performance in low-dim spaces

basic operations quite expensive

bulk-load algorithms exists to


efficiently develop a tree from scratch

NN-Search in Hierarchical Structures


HS-Algorithm: Priority queue given by distance to query point, ordered by increasing distance
1. Initialization: root node added to the quere
2. as long as the queue is not empty read out element:
a. If top element inner nod, insert all its child nodes into the queue
(priority given by minimal distance between query and MBR)
b. If top element is a leaf node, insert all its data points into the queue
c. If top element is a data point Nearest Neighbor found
(This algorithm reads exactly those leaves with MBRs that intersect with the NN sphere)

27

CS342 Multimedia Retrieval


HS2012 University of Basel

The Curse of high Dimensionality


Assumptions
Observation


closed data space of the shape of a hyper cube
data points uniformly distributed in this data space
probability that a data point lies inside a subspace: volume of that subspace

Peculiarities:
1. Bad intuition for high-dimensional spaces
A circle that fills a 2-dim space quite well, does not necessarily do that in higher dimensions
(given same center and radius)
2. Partitioning becomes meaningless in higher dimensions
If we partition each axis in a high-dimensional space in two parts we have 2d cells and on
average N/2d points in each cell most cells will be empty
3. Where are all the data points?
if we take a hyper cube with length 0.95, for the 2-dim space it fills a big percentage of the
space, however in high-dimensional space the quadratic volume is almost empty (0.0059 for
d=100). We may also state that all points are close to the boundary of the data space, not in
the center (probability that a point lies close to boundary (within dist ):
(
)
With
and d=100, the probability is (almost) 1!
4. The nearest neighbor is always far away
If we put a circle with r=0.5 in the center of a 2-dim space we can safely assume that the
nearest neighbor is within the circle. This does not hold true for higher dimensions. For d=10
this hypersphere would only have a volume of 0.002, with d=100 the volume is only 1.9*10-70
Cost Model for NN-Search (using HS-algorithm)

Expected NN-distance
For a given query point and radius r determine the probability that the NN lies within the sphere
around the point with radius r. Define a distribution function over the random variable r
expected value for r corresponds to the expected NN-distance for the query point
expected NN for an arbitrary point in the data space distance is then the mean expected NNdistance of a point over all points in the data space

Cost model

CPU costs negligible


f (in %) average fill level
b
size of leaf node
IO costs only incur when visiting leaf nodes
v
% of leaf nodes read
(inner nodes cached in main memory)
16MB
average IO speed

(
)
7ms
average random disc access time

(
) (
)
The tree search is faster if
(
) (
(
)). If we assume realistic leaf
node sizes and fill levels, v should be considerably smaller than 20% (e.g. v < 5%)
Assume, during splits d of d axes have been split in the middle. If we define lmax= sqrt(d2) as
maximum distance between a point in the space and the leaf node, we see that e.g. for d=100, d=15
and N=106 lmax is smaller than the NN-dist each query point is closer to the leaf node than to its NN
for each query, all leafs are visited!
If we consider arbitrary shaped MBRs: estimate the NN-distance and extend the MBRs by it
probability that leaf node is visited: percentage of points that lie inside the extended MBR
28

CS342 Multimedia Retrieval


HS2012 University of Basel
Remarks:
20% threshold is reached with dimensionalities smaller than 40
for each hierarchical structure, there is a dimensionality beyond which seq. scan is faster
measurements have shown that seq. scan already is faster with d=10

Vector Approximation File


The VA-File quantizes vectors to reduce the amount of data to be read but in such a way that it is still
able to identify the correct nearest neighbor. With this, search times can be reduced by factor 4-8.
Structure: two files
vector file
o OID: 4 bytes, for relating vectors to objects in DB
o d components of the vector (each 4 bytes)
approximation file: bit string of a certain length
Headers: describing the data such as #dimensions, length of approximations, #entries,averages
New points are usually inserted at the end of the file. If we remove a point earlier, the empty slots
need to be filled. To delete a point, we set the entries to special NULL values in order to not influence
the search

Computation of approximations
A (not necessarily regular) grid (defined by marks positions) divides data space into rectangular cells.
Along each axis, 2bi slices exists (bi = #bits assigned to that axis). The individual slices are represented
with bi bits and are increasingly enumerated for each cell we get a unique bit-string
For a query the approximation-file is sequentially read and for all bit-strings that have to be
considered more exactly the whole vector then needs to be read (random access in vector file)
Determining the marks positions:
They should be chosen such that each slice contains about the same
number of points. We compute the distribution function along each axis
and determine marks such that the integral over all slices leads to about the
same number
Determining the number of bits:
usually 4 or 8 bits per dimension to obtain a good compromise between space and efficiency
We can also compute lower and upper bound of a point in the VA to a query:

With this information we can filter out vectors, e.g for NN-search: uBnd(p,q)<lBnd(p,q) p not NN

29

CS342 Multimedia Retrieval


HS2012 University of Basel
Following are a few search algorithms that use VA-files:
Range search

NNSearchSSA (CPU optimized)


simple search algorithm
reads all approximations and saves k best data points seen so far and their distances to query
If current approx. has lower bound < largest dist, vector data is read to compute exact dist
NNsearchNOA (IO-optimized)
near optimal algorithm
extension of NNSearchSSA that minimizes the number of data vectors to be read. It
completely separates the two phases:
1. point becomes candidate if lower bound at time of reading is smaller than the k
best upper bounds seen so far
2. exact vectors are read in increasing order of their lower bound. It ends if the
lower bound exceeds the k best distances seen so far

Evaluation of Complex Queries


Several reference objects (Ciacca 1998)
During NN search, all reference objects are considered at once and algorithms operate with overall
distances compute overall distances for data points and MBRs. Overall distances of MBR have to
guarantee that no point inside the MBR is a lower bound for the points contained by the MBR
overall distance is a lower bound (due to monotonic assumption of distance combining function)
This means we need to adapt search algorithms as follows:
NN-search is called with several reference objects
computation of overall distance as described in page 24
If lower (or upper) bounds on distances are computed, we combine the bounds on the
distance to individual reference objects with the distance combining function
Middleware Approach (Fagin 1996)
In principle his algorithm can handle arbitrary complex queries, but is outperformed except for the
case of queries with a single reference object across different features.
All sub-systems are supposed to deliver scores (making correspondence function in the middleware
unnecessary) we only need to combine scores to an overall score
The algorithm operates in three phases:
1. sorted acces
each sub-system delivers an ordered stream. The middleware reads from each stream a
certain number of elements. This phase stops until we found k distinct elements that were
returned by all sub-systems

30

CS342 Multimedia Retrieval


HS2012 University of Basel
2. random access
determine for each retrieved object the remaining unknown distances from all sub-systems
3. computation
compute the overall distances for all objects in A & return the best k objects as overall result

Predicates and Similarity Search (Chaudhuri 1996)


A first sub-system evaluates similarity searches, a second sub-system evaluates the predicate based
query.
If the predicate is selective, first determine the set of objects that fulfill the predicate, and
only evaluate the similarity scores for those objects
If the predicate is not selective, first evaluate the similarity search and then eliminate the
objects that do not fulfill the predicate
Potentially, more results have to be retrieved from the sub-system to obtain at least k
elements for the result list
Complex queries with the VA-File (Weber 2001)
Instead of solving the problem at a higher level, the indexing structures themselves are extended to
support complex queries right at the bottom GeVAS (generalized VA-file search)
GeVAS reads approximations and vectors in parallel and combines distances on the fly:

GeVAS treats predicates like features as an N-dimensional array (given N objects), where arri = I
means the i-th object fulfills the predicate. First the predicate is checked. If it is true, bounds are
computed, else the object is discarded. To further reduce search costs, predicate evaluation and VAfile similarity can be parallelized (requires synchronization)
Textual sub-queries: Boolean queries are predicates, signatures approximations and VSR/PR
require an array filled with RSV-Values (much like array for predicate valueS)
Several reference objects: evaluated as proposed by Ciaccia
Retrieval over several features: as described above
31

CS342 Multimedia Retrieval


HS2012 University of Basel
Comparison
Advantages
Several
simple implementation, minimal
reference objects adaption

Disadvantages
search no longer as selective as with
just one reference objects

index structure queried only once for


all objects
Middleware
Approach

any complex query can be evauated

search times can increase dramatically

if we use trees to index feature data,


this is the only way to determine the
result of complex queries over several
features

many index structures or sub-systems


do not allow for random accesses
sub-systems must maintain two
different index structures to be
prepared for 1st and 2nd phase

rather simple implementation


Predicates and
Similarity Search

if the predicate is selective or


unselective, it performs very well
simple implementation possible

Complex queries
with the VA-File

up to a factor of 100 faster than


Ciaccia and over 1000 times faster
than Fagin

if the selectivity is not extreme, both


strategies are useless and lead to a
large number of random accesses
retrieval costs increase linearly with
the number of DB objects and
complexity of the query

simple implementation

32

You might also like