You are on page 1of 70

IMAGE MATCHING

- ALOK TALEKAR
- SAIRAM SUNDARESAN

11/23/2010

: Presentation structure :
1.

Brief overview of talk

2. What does Object Recognition involve?


3. The Recognition Problem

4. Mathematical background:
1.
2.
3.

Trees
K-Means
Modifications to K-Means

5. Results and Comparisions


6. Application of algorithms to mobile phones

7. Conclusion.
11/23/2010

How many object categories are there?

11/23/2010

Motivation for Image Recognition


Image panoramas
Image watermarking
Global robot localization
Face Detection
Optical Character Recognition
Manufacturing Quality Control
Content-Based Image Indexing
Object Counting and Monitoring
Automated vehicle parking systems
Visual Positioning and tracking
Video Stabilization
11/23/2010

What Challenges exist?

11/23/2010

What Challenges exist?

11/23/2010

What Challenges exist?

11/23/2010

What Challenges exist?

11/23/2010

What other challenges exist?

11/23/2010

What other challenges exist?

11/23/2010

10

What other challenges exist?

11/23/2010

11

11/23/2010

12

What does object recognition involve?


Identification
Detection
Object categorization
Scene and context categorization
Tracking
Action Recognition
Events

11/23/2010

13

11/23/2010

14

11/23/2010

15

11/23/2010

16

Problem
Variations in :
Scale
Illuminations
Rotation
Noise : different sources for images.

Large number of objects and object categories.

Large variation in shapes

Use better features: SURF, GoH

TODAYS TALK : scaling + compactness


11/23/2010

17

Image Recognition

11/23/2010

18

Mathematics
Voronoi diagram and vector quantization
Trees : kd trees, vocabulary trees
K-Means clustering
Modifications of the k-Means.

11/23/2010

19

Voronoi diagram
After Georgy Voroni
Let P be a set of n distinct points (sites)
in the plane.
The Voronoi diagram of P is the
subdivision of the plane into n cells, one
for each site.
A point q lies in the cell corresponding to
a site pi P iff
Euclidean_Distance( q, pi ) <
Euclidean_distance( q, pj ), for each pi P,
j i.

11/23/2010

20

Voronoi diagram
choose measure of distance.
partition space.

dual graph for delauney triangulation.


centers partitions with semi infinite edges, are the convex hull
vertices.
implementation is O(n) using doubly linked lists. Construction
requires O(n log(n) ).

kd trees are used for a similar purpose.


11/23/2010

21

Clustering
Iterative.

Agglomerative clustering Add token to cluster if token is


similar enough to element of clusters
Repeat

Divisive clustering Split cluster into subclustersif tokens


are dissimilar enough within cluster
Boundary separates subclustersbased on similarity
Repeat

11/23/2010

Slide credit: Silvio Savarese

22

Hierachical structure of clusters

11/23/2010

Slide credit: Silvio Savarese

23

K-Means Clustering
Initialization: Given K categories, N points in feature
space. Pick K points randomly; these are initial cluster
centers (means) m1, ,mK. Repeat the following:

1. Assign each of the Npoints, xj , to clusters by


nearest mi
2. Re-compute mean miof each cluster from its
member points
3. If no mean has changed more than some , stop
Slide credit: Christopher Rasmussen
11/23/2010

24

11/23/2010

25

11/23/2010

Slide credit: Silvio Savarese

26

Complexity
NP-Hard. Using huristics, quick to converge.
No guarantee that it will converge to the global optimum,
and the result may depend on the initial clusters.
As the algorithm is usually very fast, it is common to run it
multiple times with different starting conditions.
Similarity to EM, EM maintains probabilistic assignments to
clusters, instead of deterministic assignments, and
multivariate Gaussian distributions instead of means.
11/23/2010

27

Why Trees?
Quick to understand and implement.
Computational benefits:
Most algorithms ( when implemented correctly) give a O(n log(n))
performance.
Speedup due to dividing the search space at each stage.

11/23/2010

28

Building the Vocabulary Tree


k=3 L=2

Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

29

Demo
http://www.visualthesaurus.com/

11/23/2010

30

Describing an Image

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

33

Describing an Image

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

34

Definition of Scoring
Number of the descriptor
vectors of each image
with a path along the
node i (ni query, mi
database)
Number of images in the
database with at least one
descriptor vector path
through the node i (Ni )
Ni=2
m_Img1=1

m_Img2=1

Ni=1
m_Img1=2
m_Img2=0
Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

35

Definition of Scoring
Weights are assigned to each node (tf idf)

Query and database vectors are defined


according to their assigned weights

Each database image is given a relevance


score based on the normalized difference
between the query and the database
vectors

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

36

Implementation of Scoring
Every node is associated with
an inverted file.
Inverted files stored the idnumbers of the images in
which a particular node
occurs and the term
frequency of that image.
decrease the fraction of
images in the database that
have to be explicitely
considered for a query.
Img2, 1

Slide credit Tatiana Tommasi - Scalable


Recognition with a Vocabulary Tree

37

Img1, 1

Img1, 2

Query

Slide credit Tatiana Tommasi - Scalable Recognition with a Vocabulary Tree

38

Database
Ground truth database 6376 images
Groups of four images of the same
object but under different conditions
Each image in turn is used as query
image and the three remaining images
from its group should be at the top of
the query results

Slide credit Tatiana Tommasi - Scalable


Recognition with a Vocabulary Tree

39

Modifications to k-Means
Earlier approaches require additional O(K 2) space for a speedup,
which in general is impractical if clustering in the range of 1M.

Two additional methods were proposed:


AKM: Approximate k-Means
HKM: Hierarchical k-Means

11/23/2010

40

AKM
Uses kd-trees.
Gives almost equivalent performance in smaller scales and out
performances at higher scale.
Saves time by using elements of randomness in choosing the
nearest neighbor between points and mean, rather than the exact
nearest neighbor.

11/23/2010

41

Hierarchical K-Means (HKM)


Also called tree structured vector quantization.
Here, all data points are clustered to a small number of cluster
centers (say K).
On the next level, k-means is applied again, within each of partitions
independently with the same number of cluster centers.
Thus at the nth level, there are Kn clusters.
A new data point is assigned by descending the tree.

11/23/2010

42

Results on 1400 images (Nister and Stewenius)


The curves show the distribution of
how far the wanted images drop in
the query rankings
A larger (hierarchical) vocabulary
improves retrieval performance
L1 norm gives better retrieval
performance than L2 norm.
Entropy weighting is important at
least for smaller vocabularies

43

Results on 6376 images


Performance increases
significantly with the number
of leaf nodes

Performance increases with


the branch factor k

44

Results on 6376 images


Performance increases
when the amount of training
data grows
Performance increases at
the beginning when the
number of training cycles
grows, then reaches a
plateaux

45

Results on 1 million images


Performance with respect to increasing database size.

The vocabulary tree is defined with video frames


separate from the database.
Entropy weighting of the
vocabulary tree defined with
video independent from the
database.
Entropy weighting defined using
the ground truth target subset of
images.
46

Applications

47

Conclusions from Nister & Steweniuss work

The methodology provides the ability to make fast


searches on extremely large databases.
If repeatable, discriminative features can be obtained, then
recognition can scale to very large databases using the
vocabulary tree and indexing approach.

48

Results on comparing vocabularies (Sivic)


K-means vs AKM:
For the small 5K dataset, there is almost identical performance between AKM and
exact K-means.
AKM differs in mAP by less than 1% and even outperforms k-means in two of the
cases.

11/23/2010

49

Results on Comparing Vocabularies


HKM vs AKM:
Comparison is done in two ways, first, by comparing performance on the
recognition benchmark introduced by Nister and Stewenius and second, by
comparing the performance of the two methods on the 5K dataset.
In the first evaluation, 10,200 images of the same scene taken from different
viewpoints are split into 4 groups. A perfect result is to return, given a query, the
other three images from that querys group before images from other groups.
For the same number of visual words, AKM outperforms HKM.
The same holds true in the second evaluation, where the 5K dataset is used.

11/23/2010

50

Tabulated Results

11/23/2010

51

Scaling up with AKM


The performance of AKM is evaluated for a number of different
vocabulary sizes for 5K, 5K+100K, and 5K+100K+1M datasets.
In going from the smallest dataset to the largest, there is a 226 fold
increase in the number of images, but the performance falls by just
over 20%.

11/23/2010

52

Outdoors Augmented Reality on


Mobile Phone using Loxel-Based
Visual Feature Organization
System Overview
Image matching on the cell phone

Data Organization
Server groups images by location

Data Reduction
Server clusters, prunes and compresses descriptors

Image Matching Results


Qualitative and quantitative

System Timing
Low latency image matching on the cell phone

11/23/2010

53

Video of Results

11/23/2010

54

System Overview
Since recognizing all the different locations just visually is a very
complicated vision task, the authors simplify and use rough
location as a cue.
Phone gets position from GPS satellite (or other means, its good
enough to know the rough location within a city block or two)
Phone prefetches data relevant to current position
Then the user snaps a photo
the image is then matched with the prefetched data
And the result is displayed to the user

11/23/2010

55

System Overview
GPS

Server

Image Matching
How does the matching work ? And whats in the pre-fetched data?
For example, in the next slide that Ill show you, there are some
database images taken from the users location
The prefetched data contains features from these images. features
are extracted from the query image and matched against features
in our database of features.
The next step is to run geometric consistency (RANSAC) checks to
eliminate false points.
Finally, the image with the highest number of point matches is
chosen.

11/23/2010

57

Image Matching

Query Image
AffineTest
SURF-64
Ratio
Descriptors
Matching
RANSAC

Prefetched Data
Database Images

Data Organization & Loxels


The map is divided into grid cells which are called loxels, location
cells.
Some of the closest loxels are kept in memory, this is called
setting a kernel. In the example shown next, a kernel is a 3x3 set
of loxels, in some other situations, it is possible to have even more
(e.g., 5x5).
As the user moves, the kernel is updated by loading in some new
loxels and forgetting some of the earlier loxels.

11/23/2010

59

Data Organization

Kernel

Loxel

The Overall System


Device sends location to server, the server returns the image features of
the current kernel to the phone, where they are organized into a search
structure (KD-tree, ANN = fast Approximate Nearest Neighbor search).
Camera takes an image, extracts features, compares to features in the
cache, runs RANSAC to find out whether individually matched features are
consistent with other matches, and finally displays the information of the
top ranked image.
The database on the server is built by collecting a set of geotagged and
POI (point-of-interest) tagged images, which are then grouped together
based on their location and allocated to loxels, features are extracted,
and then we try to match images to other images in the loxel.
A similar RANSAC consistency check is run, as during run-time on the
phone, cluster features that tend to easily match features in other images,
and mark features that cannot be matched again as low-priority features
that are sent to handset last, if at all. The features are compressed and
stored in the database.
11/23/2010

61

System Block Diagram


Group Images
by Loxel

Extract
Features

Cluster
Features

Geometric
Consistency Check

Match
Images

Prune
Features

Compress
Descriptors

Loxel-Based
Feature
Store

Extract
Features

Compute
Feature Matches

Camera
Image

Geometric
Consistency Check
Display Info for
Top Ranked Image

Feature
Cache

ANN

Server

Network

Geo-Tagged
Images

Device
Location

Feature Descriptor Clustering


Match all images in loxel
Form graph on features
Cut graph into clusters
Create representative meta-features

meta-feature
Example Feature Cluster

Feature Descriptor Clustering

Images of the same landmark


meta-feature

single feature

Database Feature Pruning

Rank images by number of meta-features


Allocate equal budget for each landmark
Fill budget with meta-features by rank
Fill any remaining budget with single features

4x Reduction

Feature Descriptor Pruning

Budget:

500
200
100
All

Feature Compression
Quantization

Uniform, equal step-size


6 bits per dimension

Entropy coding
Huffman tables
12 different tables

64-dimensional SURF
256 bytes uncompressed
37 bytes with compression

S dx
S dy
S|dx|
S|dy|

Original
Feature

Compressed
Feature

Quantization

Entropy
Coding

Matching Results
Query
Rank 1

Rank 2

Rank 3

Rank 4

Matching Performance
True Matches
False Matches

~10 images / loxel

~10 images / loxel

~1000 images / loxel

Timing Analysis
Nokia N95
332 MHz ARM
64 MB RAM

100 KByte JPEG over 60 Kbps Uplink


Downloads

Upload
Upload
Geometric
Consistency
Feature
Matching

Extract
Features

Extract
Features

All on Phone Extract Features All on Server


on Phone

Conclusions

Image matching on mobile phone

Use loxels to reduce search space


27x reduction in data sent to phone
Clustering
Pruning
Compression

~3 seconds for image matching on N95

The END

Questions?

11/23/2010

72

You might also like