Image Matching: - Alok Talekar - Sairam Sundaresan

IMAGE MATCHING
- ALOK TALEKAR
- SAIRAM SUNDARESAN
11/23/2010
: Presentation structure :
1.
Brief overview of talk
2. What does Object Recognition involve?

3. The Recognition Problem
4. Mathematical background:
1.
2.
3.
Trees
K-Means
Modifications to K-Means
5. Results and Comparisions

6. Application of algorithms to mobile phones
7. Conclusion.
11/23/2010
How many object categories are there?
11/23/2010
Motivation for Image Recognition

Image panoramas
Image watermarking
Global robot localization
Face Detection
Optical Character Recognition
Manufacturing Quality Control
Content-Based Image Indexing
Object Counting and Monitoring
Automated vehicle parking systems
Visual Positioning and tracking
Video Stabilization
11/23/2010
What Challenges exist?
11/23/2010
11/23/2010
11/23/2010
11/23/2010
What other challenges exist?
11/23/2010
11/23/2010
10
11/23/2010
11
11/23/2010
12
What does object recognition involve?

Identification
Detection
Object categorization
Scene and context categorization
Tracking
Action Recognition
Events
11/23/2010
13
11/23/2010
14
11/23/2010
15
11/23/2010
16
Problem
Variations in :
Scale
Illuminations
Rotation
Noise : different sources for images.
Large number of objects and object categories.
Large variation in shapes
Use better features: SURF, GoH
TODAYS TALK : scaling + compactness

11/23/2010
17
Image Recognition
11/23/2010
18
Mathematics
Voronoi diagram and vector quantization
Trees : kd trees, vocabulary trees
K-Means clustering
Modifications of the k-Means.
11/23/2010
19
Voronoi diagram
After Georgy Voroni
Let P be a set of n distinct points (sites)
in the plane.
The Voronoi diagram of P is the
subdivision of the plane into n cells, one
for each site.
A point q lies in the cell corresponding to
a site pi P iff
Euclidean_Distance( q, pi ) <
Euclidean_distance( q, pj ), for each pi P,
j i.
11/23/2010
20
Voronoi diagram
choose measure of distance.
partition space.
dual graph for delauney triangulation.

centers partitions with semi infinite edges, are the convex hull
vertices.
implementation is O(n) using doubly linked lists. Construction
requires O(n log(n) ).
kd trees are used for a similar purpose.

11/23/2010
21
Clustering
Iterative.
Agglomerative clustering Add token to cluster if token is

similar enough to element of clusters
Repeat
Divisive clustering Split cluster into subclustersif tokens

are dissimilar enough within cluster
Boundary separates subclustersbased on similarity
Repeat
11/23/2010
Slide credit: Silvio Savarese
22
Hierachical structure of clusters
11/23/2010
23
K-Means Clustering
Initialization: Given K categories, N points in feature
space. Pick K points randomly; these are initial cluster
centers (means) m1, ,mK. Repeat the following:
1. Assign each of the Npoints, xj , to clusters by

nearest mi
2. Re-compute mean miof each cluster from its
member points
3. If no mean has changed more than some , stop
Slide credit: Christopher Rasmussen
11/23/2010
24
11/23/2010
25
11/23/2010
26
Complexity
NP-Hard. Using huristics, quick to converge.
No guarantee that it will converge to the global optimum,
and the result may depend on the initial clusters.
As the algorithm is usually very fast, it is common to run it
multiple times with different starting conditions.
Similarity to EM, EM maintains probabilistic assignments to
clusters, instead of deterministic assignments, and
multivariate Gaussian distributions instead of means.
11/23/2010
27
Why Trees?
Quick to understand and implement.
Computational benefits:
Most algorithms ( when implemented correctly) give a O(n log(n))
performance.
Speedup due to dividing the search space at each stage.
11/23/2010
28
Building the Vocabulary Tree

k=3 L=2
Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree
29
Demo
http://www.visualthesaurus.com/
11/23/2010
30
Describing an Image
Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree
33
Describing an Image
34
Definition of Scoring
Number of the descriptor
vectors of each image
with a path along the
node i (ni query, mi
database)
Number of images in the
database with at least one
descriptor vector path
through the node i (Ni )
Ni=2
m_Img1=1
m_Img2=1
Ni=1
m_Img1=2
m_Img2=0
35
Definition of Scoring
Weights are assigned to each node (tf idf)
Query and database vectors are defined

according to their assigned weights
Each database image is given a relevance

score based on the normalized difference
between the query and the database
vectors
36
Implementation of Scoring
Every node is associated with
an inverted file.
Inverted files stored the idnumbers of the images in
which a particular node
occurs and the term
frequency of that image.
decrease the fraction of
images in the database that
have to be explicitely
considered for a query.
Img2, 1
Slide credit Tatiana Tommasi - Scalable

Recognition with a Vocabulary Tree
37
Img1, 1
Img1, 2
Query
38
Database
Ground truth database 6376 images
Groups of four images of the same
object but under different conditions
Each image in turn is used as query
image and the three remaining images
from its group should be at the top of
the query results
Slide credit Tatiana Tommasi - Scalable

Recognition with a Vocabulary Tree
39
Modifications to k-Means
Earlier approaches require additional O(K 2) space for a speedup,
which in general is impractical if clustering in the range of 1M.
Two additional methods were proposed:

AKM: Approximate k-Means
HKM: Hierarchical k-Means
11/23/2010
40
AKM
Uses kd-trees.
Gives almost equivalent performance in smaller scales and out
performances at higher scale.
Saves time by using elements of randomness in choosing the
nearest neighbor between points and mean, rather than the exact
nearest neighbor.
11/23/2010
41
Hierarchical K-Means (HKM)

Also called tree structured vector quantization.
Here, all data points are clustered to a small number of cluster
centers (say K).
On the next level, k-means is applied again, within each of partitions
independently with the same number of cluster centers.
Thus at the nth level, there are Kn clusters.
A new data point is assigned by descending the tree.
11/23/2010
42
Results on 1400 images (Nister and Stewenius)

The curves show the distribution of
how far the wanted images drop in
the query rankings
A larger (hierarchical) vocabulary
improves retrieval performance
L1 norm gives better retrieval
performance than L2 norm.
Entropy weighting is important at
least for smaller vocabularies
43
Results on 6376 images

Performance increases
significantly with the number
of leaf nodes
Performance increases with

the branch factor k
44
Results on 6376 images

Performance increases
when the amount of training
data grows
Performance increases at
the beginning when the
number of training cycles
grows, then reaches a
plateaux
45
Results on 1 million images

Performance with respect to increasing database size.
The vocabulary tree is defined with video frames

separate from the database.
Entropy weighting of the
vocabulary tree defined with
video independent from the
database.
Entropy weighting defined using
the ground truth target subset of
images.
46
Applications
47
Conclusions from Nister & Steweniuss work
The methodology provides the ability to make fast

searches on extremely large databases.
If repeatable, discriminative features can be obtained, then
recognition can scale to very large databases using the
vocabulary tree and indexing approach.
48
Results on comparing vocabularies (Sivic)

K-means vs AKM:
For the small 5K dataset, there is almost identical performance between AKM and
exact K-means.
AKM differs in mAP by less than 1% and even outperforms k-means in two of the
cases.
11/23/2010
49
Results on Comparing Vocabularies

HKM vs AKM:
Comparison is done in two ways, first, by comparing performance on the
recognition benchmark introduced by Nister and Stewenius and second, by
comparing the performance of the two methods on the 5K dataset.
In the first evaluation, 10,200 images of the same scene taken from different
viewpoints are split into 4 groups. A perfect result is to return, given a query, the
other three images from that querys group before images from other groups.
For the same number of visual words, AKM outperforms HKM.
The same holds true in the second evaluation, where the 5K dataset is used.
11/23/2010
50
Tabulated Results
11/23/2010
51
Scaling up with AKM

The performance of AKM is evaluated for a number of different
vocabulary sizes for 5K, 5K+100K, and 5K+100K+1M datasets.
In going from the smallest dataset to the largest, there is a 226 fold
increase in the number of images, but the performance falls by just
over 20%.
11/23/2010
52
Outdoors Augmented Reality on

Mobile Phone using Loxel-Based
Visual Feature Organization
System Overview
Image matching on the cell phone
Data Organization
Server groups images by location
Data Reduction
Server clusters, prunes and compresses descriptors
Image Matching Results

Qualitative and quantitative
System Timing
Low latency image matching on the cell phone
11/23/2010
53
Video of Results
11/23/2010
54
System Overview
Since recognizing all the different locations just visually is a very
complicated vision task, the authors simplify and use rough
location as a cue.
Phone gets position from GPS satellite (or other means, its good
enough to know the rough location within a city block or two)
Phone prefetches data relevant to current position
Then the user snaps a photo
the image is then matched with the prefetched data
And the result is displayed to the user
11/23/2010
55
System Overview
GPS
Server
Image Matching
How does the matching work ? And whats in the pre-fetched data?
For example, in the next slide that Ill show you, there are some
database images taken from the users location
The prefetched data contains features from these images. features
are extracted from the query image and matched against features
in our database of features.
The next step is to run geometric consistency (RANSAC) checks to
eliminate false points.
Finally, the image with the highest number of point matches is
chosen.
11/23/2010
57
Image Matching
Query Image
AffineTest
SURF-64
Ratio
Descriptors
Matching
RANSAC
Prefetched Data
Database Images
Data Organization & Loxels

The map is divided into grid cells which are called loxels, location
cells.
Some of the closest loxels are kept in memory, this is called
setting a kernel. In the example shown next, a kernel is a 3x3 set
of loxels, in some other situations, it is possible to have even more
(e.g., 5x5).
As the user moves, the kernel is updated by loading in some new
loxels and forgetting some of the earlier loxels.
11/23/2010
59
Data Organization
Kernel
Loxel
The Overall System

Device sends location to server, the server returns the image features of
the current kernel to the phone, where they are organized into a search
structure (KD-tree, ANN = fast Approximate Nearest Neighbor search).
Camera takes an image, extracts features, compares to features in the
cache, runs RANSAC to find out whether individually matched features are
consistent with other matches, and finally displays the information of the
top ranked image.
The database on the server is built by collecting a set of geotagged and
POI (point-of-interest) tagged images, which are then grouped together
based on their location and allocated to loxels, features are extracted,
and then we try to match images to other images in the loxel.
A similar RANSAC consistency check is run, as during run-time on the
phone, cluster features that tend to easily match features in other images,
and mark features that cannot be matched again as low-priority features
that are sent to handset last, if at all. The features are compressed and
stored in the database.
11/23/2010
61
System Block Diagram

Group Images
by Loxel
Extract
Features
Cluster
Features
Geometric
Consistency Check
Match
Images
Prune
Features
Compress
Descriptors
Loxel-Based
Feature
Store
Extract
Features
Compute
Feature Matches
Camera
Image
Geometric
Consistency Check
Display Info for
Top Ranked Image
Feature
Cache
ANN
Server
Network
Geo-Tagged
Images
Device
Location
Feature Descriptor Clustering

Match all images in loxel
Form graph on features
Cut graph into clusters
Create representative meta-features
meta-feature
Example Feature Cluster
Feature Descriptor Clustering
Images of the same landmark

meta-feature
single feature
Database Feature Pruning
Rank images by number of meta-features

Allocate equal budget for each landmark
Fill budget with meta-features by rank
Fill any remaining budget with single features
4x Reduction
Feature Descriptor Pruning
Budget:
500
200
100
All
Feature Compression
Quantization
Uniform, equal step-size

6 bits per dimension
Entropy coding
Huffman tables
12 different tables
64-dimensional SURF
256 bytes uncompressed
37 bytes with compression
S dx
S dy
S|dx|
S|dy|
Original
Feature
Compressed
Feature
Quantization
Entropy
Coding
Matching Results
Query
Rank 1
Rank 2
Rank 3
Rank 4
Matching Performance
True Matches
False Matches
~10 images / loxel
~10 images / loxel
~1000 images / loxel
Timing Analysis
Nokia N95
332 MHz ARM
64 MB RAM
100 KByte JPEG over 60 Kbps Uplink

Downloads
Upload
Upload
Geometric
Consistency
Feature
Matching
Extract
Features
Extract
Features
All on Phone Extract Features All on Server

on Phone
Conclusions
Image matching on mobile phone
Use loxels to reduce search space

27x reduction in data sent to phone
Clustering
Pruning
Compression
~3 seconds for image matching on N95
The END
Questions?
11/23/2010
72

Image Matching: - Alok Talekar - Sairam Sundaresan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Matching: - Alok Talekar - Sairam Sundaresan

Uploaded by

Copyright:

Available Formats

IMAGE MATCHING

Brief overview of talk

2. What does Object Recognition involve?

5. Results and Comparisions

How many object categories are there?

Motivation for Image Recognition

What Challenges exist?

What Challenges exist?

What Challenges exist?

What Challenges exist?

What other challenges exist?

What other challenges exist?

What other challenges exist?

What does object recognition involve?

Large number of objects and object categories.

Large variation in shapes

Use better features: SURF, GoH

TODAYS TALK : scaling + compactness

dual graph for delauney triangulation.

kd trees are used for a similar purpose.

Agglomerative clustering Add token to cluster if token is

Divisive clustering Split cluster into subclustersif tokens

Slide credit: Silvio Savarese

Hierachical structure of clusters

Slide credit: Silvio Savarese

1. Assign each of the Npoints, xj , to clusters by

Slide credit: Silvio Savarese

Building the Vocabulary Tree

Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

Query and database vectors are defined

Each database image is given a relevance

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

Slide credit Tatiana Tommasi - Scalable

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

Slide credit Tatiana Tommasi - Scalable

Two additional methods were proposed:

Hierarchical K-Means (HKM)

Results on 1400 images (Nister and Stewenius)

Results on 6376 images

Performance increases with

Results on 6376 images

Results on 1 million images

The vocabulary tree is defined with video frames

Conclusions from Nister & Steweniuss work

The methodology provides the ability to make fast

Results on comparing vocabularies (Sivic)

Results on Comparing Vocabularies

Scaling up with AKM

Outdoors Augmented Reality on

Image Matching Results

Data Organization & Loxels

The Overall System

System Block Diagram

Feature Descriptor Clustering

Feature Descriptor Clustering

Images of the same landmark

Database Feature Pruning

Rank images by number of meta-features

Feature Descriptor Pruning

Uniform, equal step-size

~10 images / loxel