Professional Documents
Culture Documents
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Andrew Fitzgibbon Svetlana Lazebnik
Pietro Perona Yoichi Sato
Cordelia Schmidt (Eds.)
Computer Vision
ECCV 2012
12th European Conference on Computer Vision
Florence, Italy, October 7-13, 2012
Proceedings, Part III
13
Volume Editors
Andrew Fitzgibbon
Microsoft Research Ltd., Cambridge, CB3 0FB, UK
E-mail: awf@microsoft.com
Svetlana Lazebnik
University of North Carolina, Dept. of Computer Science
Chapel Hill, NC 27599, USA
E-mail: lazebnik@cs.unc.edu
Pietro Perona
California Institute of Technology
Pasadena, CA 91125, USA
E-mail: perona@caltech.edu
Yoichi Sato
The University of Tokyo, Institute of Industrial Science
Tokyo 153-8505, Japan
E-mail: ysato@iis.u-tokyo.ac.jp
Cordelia Schmid
INRIA, 38330 Montbonnot, France
E-mail: cordelia.schmid@inria.fr
CR Subject Classification (1998): I.4.6, I.4.8, I.4.1-5, I.4.9, I.5.2-4, I.2.10, I.3.5, F.2.2
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Foreword
The European Conference on Computer Vision is one of the top conferences for
researchers in this eld and is held biennially in alternation with the International
Conference on Computer Vision. It was rst held in 1990 in Antibes (France) with
subsequent conferences in Santa Margherita Ligure (Italy) in 1992, Stockholm
(Sweden) in 1994, Cambridge (UK) in 1996, Freiburg (Germany) in 1998, Dublin
(Ireland) in 2000, Copenhagen (Denmark) in 2002, Prague (Czech Republic) in
2004, Graz (Austria) in 2006, Marseille (France) in 2008, and Heraklion (Greece)
in 2010. To our great delight, the 12th conference was held in Florence, Italy.
ECCV has an established tradition of very high scientic quality and an
overall duration of one week. ECCV 2012 began with a keynote lecture from the
honorary chair, Tomaso Poggio. The main conference followed over four days
with 40 orals, 368 posters, 22 demos, and 12 industrial exhibits. There were also
9 tutorials and 21 workshops held before and after the main event. For this event
we introduced some novelties. These included innovations in the review policy,
the publication of a conference booklet with all paper abstracts and the full video
recording of oral presentations.
This conference is the result of a great deal of hard work by many people,
who have been working enthusiastically since our rst meetings in 2008. We are
particularly grateful to the Program Chairs, who handled the review of about
1500 submissions and co-ordinated the eorts of over 50 area chairs and about
1000 reviewers (see details of the process in their preface to the proceedings). We
are also indebted to all the other chairs who, with the support of our research
teams (names listed below), diligently helped us manage all aspects of the main
conference, tutorials, workshops, exhibits, demos, proceedings, and web presence.
Finally we thank our generous sponsors and Consulta Umbria for handling the
registration of delegates and all nancial aspects associated with the conference.
We hope you enjoyed ECCV 2012. Benvenuti a Firenze!
in person), jointly discussed all their papers, and made acceptance/rejection de-
cisions. Thus, the reviews and consolidation reports for each paper were carefully
examined by three ACs, ensuring a fair and thorough assessment. A program
chair assisted in each AC triplet meeting to maintain the consistency in the de-
cision process and to provide any necessary support. Furthermore, each triplet
recommended a small number of top-ranked papers (typically one to three) for
oral presentation, and the program chairs took these candidates and made the
nal oral vs. poster decisions.
Double-blind reviewing policies were strictly maintained throughout the en-
tire process neither the area chairs nor the reviewers knew the identity of the
authors, and the authors did not know the identity of the reviewers and ACs.
Based on feedback from authors, reviewers, and area chairs, we believe we suc-
cessfully maintained the integrity of the paper selection process, and we are very
excited about the quality of the resulting program.
We wish to thank everyone involved for their time and dedication to making
the ECCV 2012 program possible. The success of ECCV 2012 entirely relied on
the time and eort invested by the authors into producing high-quality research,
on the care taken by the reviewers in writing thorough and professional reviews,
and on the commitment by the area chairs to reconciling the reviews and writing
detailed and precise consolidation reports. We also wish to thank the general
chairs, Roberto Cipolla, Carlo Colombo, and Alberto Del Bimbo, and the other
organizing committee members for their top-notch handling of the event.
Finally, we would like to commemorate Mark Everingham, whose untimely
death has shocked and saddened the entire vision community. Mark was an area
chair for ECCV and also an organizer for one of the workshops; his hard work and
dedication were absolutely essential in enabling us to put together a high-quality
conference program. We salute his record of exemplary service and intellectual
contributions to the discipline of computer vision. Mark, you will be missed!
General Chairs
Roberto Cipolla University of Cambridge, UK
Carlo Colombo University of Florence, Italy
Alberto Del Bimbo University of Florence, Italy
Program Coordinator
Pietro Perona California Institute of Technology, USA
Program Chairs
Andrew Fitzgibbon Microsoft Research, Cambridge, UK
Svetlana Lazebnik University of Illinois at Urbana-Champaign, USA
Yoichi Sato The University of Tokyo, Japan
Cordelia Schmid INRIA, Grenoble, France
Honorary Chair
Tomaso Poggio Massachusetts Institute of Technology, USA
Tutorial Chairs
Emanuele Trucco University of Dundee, UK
Alessandro Verri University of Genoa, Italy
Workshop Chairs
Andrea Fusiello University of Udine, Italy
Vittorio Murino Istituto Italiano di Tecnologia, Genoa, Italy
Demonstration Chair
Rita Cucchiara University of Modena and Reggio Emilia, Italy
Web Chair
Marco Bertini University of Florence, Italy
X Organization
Publicity Chairs
Terrance E. Boult University of Colorado at Colorado Springs, USA
Tat Jen Cham Nanyang Technological University, Singapore
Marcello Pelillo University Ca Foscari of Venice, Italy
Publication Chair
Massimo Tistarelli University of Sassari, Italy
Local Committee
Lamberto Ballan Giuseppe Lisanti
Laura Benassi Iacopo Masi
Marco Fanfani Fabio Pazzaglia
Andrea Ferracani Federico Pernici
Claudio Guida Lorenzo Seidenari
Lea Landucci Giuseppe Serra
Area Chairs
Simon Baker Microsoft Research, USA
Horst Bischof Graz University of Technology, Austria
Michael Black Max Planck Institute, Germany
Richard Bowden University of Surrey, UK
Michael S. Brown National University of Singapore, Singapore
Joachim Buhmann ETH Zurich, Switzerland
Alyosha Efros Carnegie Mellon University, USA
Mark Everingham University of Leeds, UK
Pedro Felzenszwalb Brown University, USA
Organization XI
Reviewers
Vitaly Ablavsky Alexander Berg Tae Choe
Lourdes Agapito Tamara Berg Ondrej Chum
Sameer Agarwal Hakan Bilen Albert C.S. Chung
Amit Agrawal Matthew Blaschko John Collomosse
Karteek Alahari Michael Bleyer Tim Cootes
Karim Ali Liefeng Bo Florent Couzine-Devy
Saad Ali Daniele Borghesani David Crandall
S. Ali Eslami Terrance Boult Keenan Crane
Daniel Aliaga Lubomir Bourdev Antonio Criminisi
Neil Alldrin Y-Lan Boureau Shengyang Dai
Marina Alterman Kevin Bowyer Dima Damen
Jose M. Alvarez Edmond Boyer Larry Davis
Brian Amberg Steven Branson Andrew Davison
Cosmin Ancuti Mathieu Bredif Fernando De la Torre
Juan Andrade William Brendel Joost de Weijer
Mykhaylo Andriluka Michael Bronstein Teolo deCampos
Anton Andriyenko Gabriel Brostow Vincent Delaitre
Elli Angelopoulou Matthew Brown Amael Delaunoy
Roland Angst Thomas Brox Andrew Delong
Relja Arandjelovic Marcus Brubaker David Demirdjian
Helder Araujo Darius Burschka Jia Deng
Pablo Arbelaez Tiberio Caetano Joachim Denzler
Antonis Argyros Barbara Caputo Konstantinos Derpanis
Kalle Astrom Stefan Carlsson Chaitanya Desai
Vassilis Athitsos Gustavo Carneiro Thomas Deselaers
Josep Aulinas Joao Carreira Frederic Devernay
Shai Avidan Yaron Caspi Thang Dinh
Tamar Avraham Carlos Castillo Santosh Kumar Divvala
Yannis Avrithis Jan Cech Piotr Dollar
Yusuf Aytar Turgay Celik Justin Domke
Luca Ballan Ayan Chakrabarti Gianfranco Doretto
Lamberto Ballan Tat Jen Cham Matthijs Douze
Atsuhiko Banno Antoni Chan Tom Drummond
Yinzge Bao Manmohan Chandraker Lixin Duan
Adrian Barbu Ming-Ching Chang Olivier Duchenne
Nick Barnes Lin Chen Zoran Duric
Joao Pedro Barreto Xilin Chen Pinar Duygulu
Adrien Bartoli Daozheng Chen Charles Dyer
Arslan Basharat Wen-Huang Cheng Sandra Ebert
Dhruv Batra Yuan Cheng Michael Elad
Sebastiano Battiato Tat-Jun Chin James Elder
Jean-Charles Bazin Han-Pang Chiu Ehsan Elhamifar
Fethallah Benmansour Minsu Cho Ian Endres
Organization XIII
Gold Sponsors
Silver Sponsors
Bronze Sponsors
Institutional Sponsors
Table of Contents
Poster Session 3
Polynomial Regression on Riemannian Manifolds . . . . . . . . . . . . . . . . . . . . . 1
Jacob Hinkle, Prasanna Muralidharan, P. Thomas Fletcher, and
Sarang Joshi
A Discrete Chain Graph Model for 3d+t Cell Tracking with High
Misdetection Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Bernhard X. Kausler, Martin Schiegg, Bjoern Andres,
Martin Lindner, Ullrich Koethe, Heike Leitte, Jochen Wittbrodt,
Lars Hufnagel, and Fred A. Hamprecht
XX Table of Contents
Poster Session 4
Renormalization Returns: Hyper-renormalization and Its
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Kenichi Kanatani, Ali Al-Sharadqah, Nikolai Chernov, and
Yasuyuki Sugaya
1 Introduction
The study of the relationship between measured data and known descriptive
variables is known as the eld of regression analysis. As with most statistical
techniques, regression analyses can be broadly divided into two classes: paramet-
ric and non-parametric. The most widely known parametric regression methods
are linear and polynomial regression in Euclidean space, wherein a linear or poly-
nomial function is t in a least-squares fashion to observed data. Such methods
are the staple of modern data analysis. The most common non-parametric re-
gression approaches are kernel-based methods and spline smoothing approaches
which provide much more exibility in the class of regression functions. How-
ever, their non-parametric nature presents a challenge to inference problems; if,
for example, one wishes to perform a hypothesis test to determine whether the
trend for one group of data is signicantly dierent from that of another group.
Fundamental to the analysis of anatomical imaging data within the framework
of computational anatomy is the analysis of transformations and shape which
are best represented as elements of Riemannian manifolds. Classical regression
suers from the limitation that the data must lie in a Euclidean vector space.
When data are known to lie in a Riemannian manifold, one approach is to map
the manifold into a Euclidean space and perform conventional Euclidean analy-
sis. Such extrinsic or coordinate-based methods inevitably lead to complications
[1]. For example, given simple angular data, averaging in the Euclidean setting
leads to nonsensical results which depend on the chosen coordinates; in the usual
coordinates, the mean of 359 degrees and 1 degree is 180 degrees. In the case
of shape analysis, regression against landmark data in a Euclidean setting, even
after Procrustes alignment (an intrinsic manifold-based technique), will intro-
duce scale and rotation in the regression function, and hence does not capture
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 114, 2012.
c Springer-Verlag Berlin Heidelberg 2012
2 J. Hinkle et al.
2 Methods
2.1 Preliminaries
Let (M, g) be a Riemannian manifold [13]. For each point p M , the metric
g determines an inner product on the tangent space Tp M as well as a way to
Polynomial Regression on Riemannian Manifolds 3
dierentiate vector elds with respect to one another. That derivative is referred
to as the covariant derivative and is denoted by . If v, ware smooth vector elds
on M and : [0, T ] M is a smooth curve on M then the covariant derivative
satises the product rule
d
v((t)), w((t)) = v((t)), w((t)) + v((t)), w((t)) (1)
dt
Geodesics : [0, T ] M are characterized by the conservation of kinetic energy
along the curve:
d
, = 0 = 2 , , (2)
dt
which is satised by the covariant dierential equation (c.f. [13])
= 0. (3)
This dierential equation, called the geodesic equation, uniquely determines
geodesics parametrized by the initial conditions ((0), (0)) T M . The map-
ping from the tangent space at p into the manifold is called the exponential map,
Expp : Tp M M . The exponential map is injective on a zero-centered ball B
in Tp M of some non-zero radius. Thus for a point q in the image of B under
Expp there exists a unique vector v Tp M corresponding to a minimal length
path under the exponential map from p to q. The mapping of such points q to
their associated tangent vectors v at p is called the log map of q at p, denoted
v = Logp q.
Given a curve : [0, T ] M well want to relate tangent vectors at dierent
points along the curve. These relations are governed innitesimally by the co-
variant derivative . A vector eld v is parallel transported along if it satises
the parallel transport equation, v((t)) = 0. Notice that the geodesic equa-
tion is a special case of parallel transport, in which we require that the velocity
is parallel along the curve itself.
for all times t [0, T ] with initial conditions (0) and i (0), i = 1, , k. As
with polynomials in Euclidean space, a k th order polynomial is parametrized by
the (k + 1) initial conditions at t = 0.
Riemannian polynomials are alternatively described using the rolling maps of
Jupp & Kent [3]. Suppose (t) is a smooth curve in M . Then consider embedding
M in m-dimensional Euclidean space and rolling M along a d-dimensional plane
in Rm such that the plane is tangent to M at (t) and the plane does not slip or
twist with respect to M while rolling. The property that is a k-order polyno-
mial is equivalent to the corresponding curve in Rd being a k-order polynomial
in the conventional sense. For more information regarding this connection to
Jupp & Kents rolling maps, as well as a comparison to Noakes cubic splines,
the reader is referred to the literature of Leite & Krakowski [14].
The covariant dierential equation governing the evolution of Riemannian
polynomials is linearized in the same way that a Euclidean ordinary dierential
equation is linearized. Introducing vector elds v1 (t), . . . , vk (t) T(t) M , we
write the system of covariant dierential equations as
1
N
E0 ((0), v1 (0), . . . , vk (0)) = d((tj ), yj )2 (8)
N
j=1
subject to Eqs. 57. Note that in this expression d represents the geodesic dis-
tance: the minimum length of paths from the curve point (tj ) to the data point
yj . The function E0 is minimized in order to nd the optimal initial conditions
(0), vi (0), i = 1, . . . , k, which we will refer to as the parameters of our model.
In order to determine the optimal parameters (0), vi (0), i = 1, . . . , k, we in-
troduce Lagrange multiplier vector elds i , i = 0, . . . , k, often called the adjoint
variables, and dene the unconstrained objective function
T
1
N
E(, {vi }, {i }) = 2
d((tj ), yj ) + 0 (t), (t) v1 (t)dt
N j=1 0
T
k1 T
+ i (t), vi (t) vi+1 (t)dt + k (t), vk (t)dt. (9)
i=1 0 0
As is standard practice, the optimality conditions for this equation are obtained
by taking variations with respect to all arguments of E, integrating by parts
when necessary. The resulting variations with respect the the adjoint variables
yield the original dynamic constraints: the polynomial equations. Variations with
respect to the primal variables gives rise to the following system of equations,
termed the adjoint equations. The adjoint equations take the following form:
where R is the Riemannian curvature tensor and the Dirac functional indicates
that the order zero adjoint variable takes on jump discontinuities at time points
where data is present. The curvature tensor R satises the standard formula [13]
and can be computed in closed form for many manifolds. Gradients of E with
respect to initial and nal conditions give rise to the terminal endpoint conditions
for the adjoint variables, as well as expressions for the gradients with respect to
the parameters (0), vi (0):
In order to determine the value of the adjoint vector elds at t = 0, and thus
the gradients of the functional E0 , the adjoint variables are initialized to zero at
time T , then Eq. 11 is evolved backward in time to t = 0.
Given the gradients with respect to the parameters, a simple steepest descent
algorithm is used to optimize the functional. At each iteration, (0) is updated
using the exponential map and the vectors vi (0) are updated via parallel trans-
lation:
where ParTransp (u, v) denotes the parallel translation of a vector u at p for unit
time along a geodesic in the direction of v.
Note that in the special case of a zero-order polynomial (k = 0), the only
gradient 0 is simply the mean of the log map vectors at the current estimate
of the Frechet mean. So this method generalizes the common method of Frechet
averaging on manifolds [15]. The curvature term, in the case k = 1, indicates
that 1 is a sum of Jacobi elds. So this approach subsumes geodesic regression
as presented by Fletcher [4]. For higher order polynomials, the adjoint equations
represent a generalized Jacobi eld.
In practice, it is often not necessary to explicitly compute the curvature terms.
In the case that the manifold M embeds into a Hilbert space, the extrinsic
adjoint equations can be computed by taking variations in the ambient space,
using standard methods. Such an approach gives rise to the regression algorithm
found in Niethammer et al. [5], for example.
compute the variance of the data. The natural choice of total variance statistic
is the Frechet variance, dened by
1 N
Var{yj } = min d(y, yj )2 . (16)
N yM j=1
Note that the Frechet mean y itself is the 0-th order polynomial regression
against the data {yj } and the variance is the value of the cost function E0 at
that point. We dene the sum of squared error for a curve as the value E0 ():
1
N
SSE = d((tj ), yj )2 . (17)
N j=1
Any regression polynomial will have SSE < Var{yj } and R2 will have a value
between 0 and 1, with 1 indicating a perfect t and 0 indicating that the curve
provides no better t than does the Frechet mean.
3 Examples
3.1 Lie Groups
A special case arises when the manifold M is a Lie group and the metric g is
left-invariant. In this case, left-translation by (t) and 1 (t), denoted L , L 1 ,
allow one to isometrically represent tangent vectors at (t) as tangent vectors at
the identity, using the pushforward L 1 [6]. The tangent space at the identity
is isomorphically identied with the Lie algebra of the group, g. The algebraic
operation in g is called the innitesimal adjoint action ad : g g g. Fixing X,
the adjoint of the operator adX satises adX Y, Z = Y, adX Z. We will refer
to the operator adX as the adjoint-transpose action of X.
Suppose X is a vector eld along the curve and Xc = L 1 X, c =
L 1 g is its representative at the identity. Similarly, c = L 1 is the
representative of the curves velocity in the Lie algebra. Then the covariant
derivative of X along evolves in the Lie algebra via
1
L 1 X = Xc adc Xc + adXc c adc Xc . (19)
2
In the special case when X = so that Xc = c , if we set this covariant derivative
equal to zero we have the geodesic equation:
c = adc c , (20)
8 J. Hinkle et al.
which in this form is often refered to as the Euler-Poincare equation [6,16]. Also
note that if the metric is both left- and right-invariant, then the adjoint-transpose
action is alternating and we can simplify the covariant derivative
d further.
Parallel
transport in such case requires solution of the equation dt + adc X = 0, which
can be integrated using Euler integration or similar time-stepping algorithms, so
long as the adjoint action can be easily computed. Because of the simplied covari-
ant derivative formula for a bi-invariant metric, the curvature also takes a simple
form [13]: R(X, Y )Z = 14 adZ adX Y .
The Lie Group SO(3). In order to illustrate the algorithm in a Lie group, we
consider the Lie group SO(3) of orthogonal 3-by-3 real matrices with determinant
one. We review the basics of SO(3) here, but the reader is encouraged to consult [6]
for a full treatment of SO(3) and derivation of the Euler-Poincare equation there.
The Lie algebra for the rotation group is so(3), the set of all skew-symmetric
3-by-3 real matrices. Such matrices can be identied with vectors in R3 by the
identication
a 0 c b
x = b (x) = c 0 a (21)
c b a 0
under which the innitesimal adjoint action takes the convenient form of the
cross-product of vectors:
adx y = x y = (x)y (x y) = [(x), (y)]. (22)
With the R representation of vectors in so(3), a natural choice of bi-invariant
3
v
Expp v = cos p + sin , = v (25)
v
q (pT q)p
Logp q = , = cos1 (pT q). (26)
q (pT q)p
R(u, v)w = (uT w)v (v T w)u. (27)
Using this exponential map and parallel transport, a polynomial integrator was
implemented. Figure 1 shows the result of integrating polynomials of order one,
two, and three, using the equations from this section, for the special case of a
two-dimensional sphere.
Fig. 2. Bookstein rat calivarium data after uniform scaling and Procrustes alignment.
The colors of dots indicate age group while the colors of lines indicate order of poly-
nomial used for the regression (black=geodesic, blue=quadratic, red=cubic). Zoomed
views of individual rectangles are shown at bottom. Note that the axes are arbitrary,
due to scale-invariance of Kendall shape space.
For a general shape space dm , the exponential map and parallel translation
are performed using representatives in preshape space, S md1 . For d > 2, this
must be done in a time-stepping algorithm, in which at each time step an in-
nitesimal spherical parallel transport is performed, followed by the horizontal
projection. The resulting algorithm can be used to compute the exponential map
as well. Computation of the log map is less trivial, as it requires an iterative op-
timization routine. However, a special case arises in the case when d = 2. In this
case the exponential map, parallel transport and log map have closed form. The
reader is encouraged to consult [4] for more details about the two-dimensional
case. With exponential map, log map, and parallel transport implemented as
described here, one can perform polynomial regression on Kendall shape space
in any dimension. We demonstrate such regression now on examples in both two
and three dimensions.
4 Results
4.1 Rat Calivarium Growth
The rst dataset we consider was rst analyzed by Bookstein [11]. The data is
available for download at http://life.bio.sunysb.edu/morph/data/datasets.
html and consists of m = 8 landmarks on a midsagittal section of rat calivaria
(upper skulls). Landmark positions are available for 18 rats and at 8 ages apiece.
Riemannian polynomials of orders k = 0, 1, 2, 3 were t to this data. The resulting
curves are shown in Fig. 2. Clearly the quadratic and cubic curves dier from that
of the geodesic regression. The R2 values agree with this qualitative dierence:
the geodesic regression has R2 = 0.79, while the quadratic and cubic regressions
both have R2 values of 0.85 and 0.87, respectively (see Table 1). While this shows
that there is a clear improvement in the t due to increasing k from one to two, it
Polynomial Regression on Riemannian Manifolds 11
also shows that little is gained by further increasing the order of the polynomial.
Qualitatively, Fig. 2 shows that the slight increase in R2 obtained by moving from
a quadratic to cubic model corresponds to a marked dierence in the curves, pos-
sibly indicating that the cubic curve is overtting the data.
0.05
0.05
shape change of the corpus
callosum during normal aging, 0
from the OASIS brain database 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2
0.02
glance the results appear simi-
0 lar for the three dierent mod-
0.02
0.04
els, since the motion envelopes
0.06 show close agreement. However,
0.08
Table 1. R2 for regression of three datasets and three orders of polynomial (k)
0.04
0.03
0.02
0.01
0.01
0.02
0.03
0.04
0.05
0.04
0.02 0.06
0 0.04
0.02
0.02 0
0.02
0.04 0.04
0.06
Fig. 5. Typical cerebellum data Fig. 6. Estimated parameters for cubic re-
used for regression. gression on cerebellum data. Velocity is
shown in black, with acceleration in blue,
and jerk shown in red.
Polynomial Regression on Riemannian Manifolds 13
5 Conclusion
Modern regression analysis consists of two classes of techniques: parametric and
nonparametric. Nonparametric approaches to curve-tting include kernel-based
methods and splines, each of which have been explored in the setting of Rie-
mannian manifolds [21,22,23,24,25,26]. Such approaches oer increased exibil-
ity over parametric models and the ability to accurately model observed data,
at the expense of eciency and precision. The trade-os between parametric
and nonparametric regression are well-established [27] and beyond the scope of
this work. A large sample size is required for cross-validation to estimate the
optimal smoothing parameter in a spline-based approach or to determine the
kernel bandwidth in a kernel smoothing method and can be computationally
prohibitive. Compounding this problem, since the number of parameters is tied
to sample size, these nonparametric ts will vary greatly due to the curse of
dimensionality. Parametric analyses enable robust tting with sample sizes that
are more feasible in biomedical applications. This tradeo between accuracy
and precision must be considered on a per-application basis when selecting a
regression model.
In this work, we presented for the rst time a methodology for parametric
regression on Riemannian manifolds, with the parameters being the initial con-
ditions of the curve. Our results show that the polynomial model allows for
parametric analysis on manifolds while providing more exibility and accuracy
than geodesics. For the classical Bookstein rat skull dataset for example, we are
able to provide a rather high R2 coecient of 0.84 with only a quadratic model.
Although we have presented results for landmark shape data, the framework
is general and can be applied to dierent data and modes of variation. For
instance, landmark data without direct correspondences can be modeled in the
currents framework [8], and image or surface data can be t using the popular
dieomorphic matching framework [7,9,5].
References
1. Dryden, I.L., Mardia, K.V.: Statistical Shape Analysis. Wiley (1998)
2. Davis, B.C., Fletcher, P.T., Bullitt, E., Joshi, S.C.: Population shape regression
from random design data. Int. J. Comp. Vis. 90, 255266 (2010)
3. Jupp, P.E., Kent, J.T.: Fitting smooth paths to spherical data. Appl. Statist. 36,
3446 (1987)
4. Fletcher, P.T.: Geodesic regression on Riemannian manifolds. In: International
Workshop on Mathematical Foundations of Computational Anatomy, MFCA
(2011)
14 J. Hinkle et al.
5. Niethammer, M., Huang, Y., Vialard, F.-X.: Geodesic Regression for Image Time-
Series. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011, Part II.
LNCS, vol. 6892, pp. 655662. Springer, Heidelberg (2011)
6. Arnold, V.I.: Mathematical Methods of Classical Mechanics, 2nd edn. Springer
(1989)
7. Cootes, T.F., Twining, C.J., Taylor, C.J.: Dieomorphic statistical shape models.
In: BMVC (2004)
8. Vaillant, M., Glaunes, J.: Surface matching via currents. In: IPMI (2005)
9. Miller, M.I., Trouve, A., Younes, L.: Geodesic shooting for computational anatomy.
Journal of Mathematical Imaging and Vision 24, 209228 (2006)
10. Turaga, P., Veeraraghavan, A., Srivastava, A., Chellappa, R.: Statistical computa-
tions on Grassmann and Stiefel manifolds for image and video-based recognition
33, 22732286 (2011)
11. Bookstein, F.L.: Morphometric Tools for Landmark Data: Geometry and Biology.
Cambridge Univ. Press (1991)
12. Kent, J.T., Mardia, K.V., Morris, R.J., Aykroyd, R.G.: Functional models of
growth for landmark data. In: Proceedings in Functional and Spatial Data Analy-
sis, pp. 109115 (2001)
13. do Carmo, M.P.: Riemannian Geometry, 1st edn. Birkhauser, Boston (1992)
14. Leite, F.S., Krakowski, K.A.: Covariant dierentiation under rolling maps. Centro
de Matematica da Universidade de Coimbra (2008) (preprint)
15. Fletcher, P.T., Liu, C., Pizer, S.M., Joshi, S.C.: Principal geodesic analysis for the
study of nonlinear statistics of shape. IEEE Trans. Med. Imag. 23, 9951005 (2004)
16. Holm, D.D., Marsden, J.E., Ratiu, T.S.: The Euler-Poincare equations and semidi-
rect products with applications to continuum theories. Adv. in Math. 137, 181
(1998)
17. Kendall, D.G.: A survey of the statistical theory of shape. Statistical Science 4,
8799 (1989)
18. Le, H., Kendall, D.G.: The Riemannian structure of Euclidean shape spaces: A
novel environment for statistics. Ann. Statist. 21, 12251271 (1993)
19. ONeill, B.: The fundamental equations of a submersion. Michigan Math J. 13,
459469 (1966)
20. Cates, J., Fletcher, P.T., Styner, M., Shenton, M., Whitaker, R.: Shape modeling
and analysis with entropy-based particle systems. In: IPMI (2007)
21. Davis, B.C., Fletcher, P.T., Bullitt, E., Joshi, S.: Population shape regression from
random design data. In: ICCV (2007)
22. Noakes, L., Heinzinger, G., Paden, B.: Cubic splines on curved surfaces. IMA J.
Math. Control Inform. 6, 465473 (1989)
23. Giambo, R., Giannoni, F., Piccione, P.: An analytical theory for Riemannian cubic
polynomials. IMA J. Math. Control Inform. 19, 445460 (2002)
24. Machado, L., Leite, F.S.: Fitting smooth paths on Riemannian manifolds. Int. J.
App. Math. Stat. 4, 2553 (2006)
25. Samir, C., Absil, P.A., Srivastava, A., Klassen, E.: A gradient-descent method for
curve tting on Riemannian manifolds. Found. Comp. Math. 12, 4973 (2010)
26. Su, J., Dryden, I.L., Klassen, E., Le, H., Srivastava, A.: Fitting smoothing splines
to time-indexed, noisy points on nonlinear manifolds. Image and Vision Comput-
ing 30, 428442 (2012)
27. Moussa, M.A.A., Cheema, M.Y.: Non-parametric regression in curve tting. The
Statistician 41, 209225 (1992)
Approximate Gaussian Mixtures
for Large Scale Vocabularies
1 Introduction
The bag-of-words (BoW) model is ubiquitous in a number of problems of com-
puter vision, including classication, detection, recognition, and retrieval. The
k-means algorithm is one of the most popular in the construction of visual vocab-
ularies, or codebooks. The investigation of alternative methods has evolved into
an active research area for small to medium vocabularies up to 104 visual words.
For problems like image retrieval using local features and descriptors, ner vo-
cabularies are needed, e.g. 106 visual words or more. Clustering options are more
limited at this scale, with the most popular still being variants of k-means like
approximate k-means (AKM) [1] and hierarchical k-means (HKM) [2].
The Gaussian mixture model (GMM), along with expectation-maximization
(EM) [3] learning, is a generalization of k-means that has been applied to vo-
cabulary construction for class-level recognition [4]. In addition to position, it
models cluster population and shape, but assumes pairwise interaction of all
points with all clusters and is slower to converge. The complexity per iteration
is O(N K) where N and K is the number of points and clusters, respectively,
so it is not practical for large K. On the other hand, a point is assigned to the
nearest cluster via approximate nearest neighbor (ANN) search in [1], bringing
complexity down to O(N log K), but keeping only one neighbor per point.
Robust approximate k-means (RAKM) [5] is an extension of AKM where the
nearest neighbor in one iteration is re-used in the next, with less eort being
spent for new neighbor search. This approach yields further speed-up, since
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 1528, 2012.
c Springer-Verlag Berlin Heidelberg 2012
16 Y. Avrithis and Y. Kalantidis
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
the dimensionality D of the underlying space is high, e.g. 64 or 128, and the
cost per iteration is dominated by the number of vector operations spent for
distance computation to potential neighbors. The above motivate us to keep
a larger, xed number m of nearest neighbors across iterations. This not only
improves ANN search in the sense of [5], but makes available enough information
for an approximate Gaussian mixture (AGM) model, whereby each data point
interacts only with the m nearest clusters.
Experimenting with overlapping clusters, we have come up with two modica-
tions that alter the EM behavior to the extent that it appears to be an entirely
new clustering algorithm. An example is shown in Figure 1. Upon applying
EM with relatively large K, the rst modication is to compute the overlap
of neighboring clusters and purge the ones that appear redundant, after each
EM iteration. Now, clusters neighboring to the ones being purged should ll in
the resulting space, so the second modication is to expand them as much as
possible. The algorithm shares properties with other methods like agglomerative
or density-based mode seeking. In particular, it can dynamically estimate the
number of clusters by starting with large K and purging clusters along the way.
On the other hand, expanding towards empty space apart from underlying data
points boosts convergence rate.
This algorithm, expanding Gaussian mixtures (EGM), is quite generic and
can be used in a variety of applications. Focusing on spherical Gaussians, we
apply its approximate version, AGM, to large scale visual vocabulary learning
for image retrieval. In this setting, contrary to typical GMM usage, descriptors
of indexed images are assigned only to their nearest visual word to keep the
index sparse enough. This is the rst implementation of Gaussian mixtures at
this scale, and it appears to enhance retrieval performance at no additional cost
compared to approximate k-means.
Approximate Gaussian Mixtures for Large Scale Vocabularies 17
2 Related Work
One known pitfall of k-means is that visual words cluster around the densest
few regions of the descriptor space, while the sparse ones may be more infor-
mative [6]. In this direction, radius-based clustering is used in [7], while [8], [9]
combine the partitional properties of k-means with agglomerative clustering.
Our approximate approach is ecient enough to initialize from all data points,
preserving even sparse regions of the descriptor space, and to dynamically purge
as appropriate in dense ones.
Precise visual word region shape has been represented by Gaussian mixture
models (GMM) for universal [10] and class-adapted codebooks [4], and even by
one-class SVM classiers [11]. GMMs are extended with spatial information
in [12] to represent images by hyperfeatures. None of the above can scale up as
needed for image retrieval. On the other extreme, training-free approaches like
xed quantization on a regular lattice [13] or hashing with random histograms
[14] are remarkably ecient for recognition tasks but often fail when applied to
retrieval [15]. The at vocabulary of AKM [1] is the most popular in this sense,
accelerating the process by the use of randomized k-d trees [16] and outper-
forming the hierarchical vocabulary of HKM [2], but RAKM [5] is even faster.
Although constrained to spherical Gaussian components, we achieve superior
exibility due to GMM, yet converging as fast as [5].
Assigning an input descriptor to a visual word is also achieved by nearest
neighbor search, and randomized k-d trees [16] are the most popular, in par-
ticular the FLANN implementation [17]. We use FLANN but we exploit the
iterative nature of EM to make the search process incremental, boosting both
speed and precision. Any ANN search method would apply equally, e.g. product
quantization [18], as long as it returns multiple neighbors and its training is fast
enough to repeat within each EM iteration.
One attempt to compensate for the information loss due to quantization is
soft assignment [15]. This is seen as kernel density estimation and applied to
small vocabularies for recognition in [19], while dierent pooling strategies are
explored in [20]. Soft assignment is expensive in general, particularly for recog-
nition, but may apply to the query only in retrieval [21], which is our choice as
well. This functionality is moved inside the codebook in [22] using a blurring
operation, while visual word synonyms are learned from an extremely ne vo-
cabulary in [23], using geometrically veried feature tracks. Towards the other
extreme, Hamming embedding [24] encodes approximate position in Voronoi cells
of a coarser vocabulary, but this also needs more index space. Our method is
complementary to learning synonyms or embedding more information.
Estimating the number of components is often handled by multiple EM runs
and model selection or split and merge operations [25], which is impractical
in our problem. A dynamic approach in a single EM run is more appropriate,
e.g. component annihilation based on competition [26]; we rather estimate spa-
tial overlap directly and purge components long before competition takes place.
Approximations include e.g. space partitioning with k-d trees and group-wise
parameter learning [27], but ANN in the E-step yields lower complexity.
18 Y. Avrithis and Y. Kalantidis
1
N
k = nk (xn k )(xn k )T , (5)
Nk n=1
where nk = k (xn ) for n = 1, . . . , N , and Nk = N n=1 nk , to be interpreted
as the eective number of points assigned to component k. The expectation-
maximization (EM) algorithm involves an iterative learning process: given an
initial set of parameters, compute responsibilities nk according to (2) (E-step);
then, re-estimate parameters according to (3)-(5), keeping responsibilities xed
(M-step). Initialization and termination are discussed in section 3.4.
For the remaining of this work, we shall be focusing on the particular case of
spherical (isotropic) Gaussians, with covariance matrix k = k2 I. In this case,
update equation (5) reduces to
1
N
k2 = nk xn k 2 . (6)
DNk n=1
Approximate Gaussian Mixtures for Large Scale Vocabularies 19
3.2 Purging
Determining the number of components is more critical in Gaussian mixtures
than in k-means due to the possibility of overlapping components. Although one
solution is a variational approach [3], the problem has motivated us to devise a
novel method of purging components according to an overlap measure. Purging
is dynamic in the sense that we initialize the model with as many components
as possible and purge them as necessary during the parameter learning process.
In eect, this idea introduces a P-step in EM, that is to be applied right after
the E- and M-steps in every iteration.
Let pk be the function representing the contribution of component k to the
Gaussian mixture distribution of (1), with
under the spherical Gaussian model. It can be computed in O(D), requiring only
one D-dimensional vector operation i k 2 , while squared norm pi 2 =
pi , pi = i2 (4i2 )D/2 is O(1). Now, if function q represents any component
or cluster, (10) motivates generalizing (2) to dene quantity
q, pk
k (q) = K , (11)
j=1 q, pj
20 Y. Avrithis and Y. Kalantidis
6 C K // updated components
3.3 Expanding
When a component, say i, is purged, data points that were better explained by i
prior to purging will have to be assigned to neighboring components that remain.
These components will then have to expand and cover the space populated by
such points. Towards this goal, we modify (6) to overestimate the extent of each
component as much as this does not overlap with its neighboring components.
Components will then tend to ll in as much empty space as possible, and this
determines the convergence rate of the entire algorithm.
More specically, given component k in the current set of components C, we
use the constrained denition of k (x) in (13) to partition the data set P =
{1, . . . , N } into the set of its inner points
P k = {n P : nk = max nj }, (14)
jC
Nk Nk
Dk2 = + k , (15)
Nk k Nk
where N k = nP k nk , N k = nP k nk , and
1 1
k = nk xn k 2 , k = nk xn k 2 . (16)
Nk Nk
nP k nP k
Dk2 = wk k + (1 wk )k , (17)
N
where wk = Nkk (1 ) and [0, 1] is an expansion factor, with = 0 reduc-
ing (17) back to (6) and = 1 using only the outer sum k .
Because of the exponential form of the normal distribution, the outer sum k
is dominated by the outer points in P k that are nearest to component k, hence
it provides an estimate for the maximum expansion of k before overlaps with
neighboring components begin. Figure 2 illustrates expansion for the example
of Figure 1. The outer green circles typically pass through the nearest clusters,
while the inner ones tightly t the data points of each cluster.
22 Y. Avrithis and Y. Kalantidis
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
It has been argued [7][6] that sparsely populated regions of the descriptor space
are often the most informative ones, so random sampling is usually a bad choice.
We therefore initialize with all data points as cluster centers, that is, K = N .
Using approximate nearest neighbors, this choice is not as inecient as it sounds.
In fact, only the rst iteration is aected because by the second, the ratio of
clusters that survive is typically in the order of 10%. Mixing coecients are
uniform initially. Standard deviations are initialized to the distance of the nearest
neighbor, again found approximately.
Convergence in EM is typically detected by monitoring the likelihood function.
This makes sense after the number of components has stabilized and no purging
takes place. However, experiments on large scale vocabularies take hours or even
days of processing, so convergence is never reached in practice. What is important
is to measure the performance of the resulting vocabulary in a particular task
retrieval in our caseversus processing required.
5 Experiments
5.1 Setup
We explore here the behavior of AGM under dierent settings and parameters,
and compare its speed and performance against AKM [1] and RAKM [5] on large
scale vocabulary construction for image retrieval.
Datasets. We use two publicly available datasets, namely Oxford buildings 1 [1]
and world cities 2 [29]. The former consists of 5, 062 images and annotation
(ground truth) for 11 dierent landmarks in Oxford, with 5 queries for each.
The annotated part of the latter is referred to as the Barcelona dataset and
consists of 927 images grouped into 17 landmarks in Barcelona, again with 5
queries for each. The distractor set of world cities consists of 2M images from
dierent cities; we use the rst one million as distractors in our experiments.
Vocabularies. We extract SURF [30] features and descriptors from each
image. There are in total 10M descriptors in Oxford buildings and 550K in
Barcelona. The descriptors, of dimensionality D = 64, are the data points we
use for training. We construct specic vocabularies with all descriptors of the
smaller Barcelona dataset, which we then evaluate on the same dataset for pa-
rameter tuning and speed comparisons. We also build generic vocabularies with
all 6.5M descriptors of an independent set of 15K images depicting urban scenes,
and compare on the larger Oxford buildings in the presence of distractors.
Evaluation protocol. Given each vocabulary built, we use FLANN to assign
descriptors, keeping only one NN under Euclidean distance. We apply soft assign-
ment on the query side only as in [21], keeping the rst 1, 3, or 5 NNs. These are
found as in [15] with = 0.01. We rank by dot product on 2 -normalized BoW
histograms and tf-idf weighting. We measure training time in D-dimensional
vector operations (vop) per data point and retrieval performance in mean aver-
age precision (mAP) over all queries. We use our own C++ implementations for
AKM and RAKM.
5.2 Tuning
We choose Barcelona for parameter tuning because, combined with descriptor
assignment, it would be prohibitive to use a larger dataset. It turns out how-
ever, that AGM outperforms the state of the art at much larger scale with the
same parameters. The behavior of EGM is controlled by expansion factor and
overlap threshold ; AGM is also controlled by memory m that determines ANN
precision and cost, considered in section 5.3.
1
http://www.robots.ox.ac.uk/~ vgg/data/oxbuildings/
2
http://image.ntua.gr/iva/datasets/wc/
Approximate Gaussian Mixtures for Large Scale Vocabularies 25
0.9
0.88 0.88
mAP
mAP
0.86 0.86
= 0.00 iter 2
= 0.15 iter 5
0.84 = 0.20 0.84 iter 10
= 0.25 iter 20
Figure 3 (left) shows mAP against number of iterations for dierent values
of . Compared to = 0, it is apparent that expansion boosts convergence rate.
However, the eect is only temporary above = 0.2, and performance eventually
drops, apparently due to over-expansion. Now, controls purging, and we expect
0.5 as pointed out in section 3.2. Since i,K is normalized, = 0.5 appears to
be a good choice, as component k explains itself at least as much as K in (12).
Figure 3 (right) shows that the higher the better initially, but eventually
= 0.55 is clearly optimal, being somewhat stricter than expected. We choose
= 0.2 and = 0.55 for the remaining experiments.
5.3 Comparisons
We use FLANN [17] for all methods, with 15 trees and precision controlled
by checks, i.e. the number of leaves checked per ANN query. We use 1, 000
checks during assignment, and less during learning. Approximation in AGM is
controlled by memory m, and we need m FLANN checks plus at most m more
distance computations in Algorithm 2, for a total of 2m vector operations per
iteration. We therefore use 2m checks for AKM and RAKM in comparisons.
Figure 4 (left) compares all methods on convergence for m = 50 and m = 100,
where AKM/RAKM are trained for 80K vocabularies, found to be optimal.
AGM not only converges as fast as AKM and RAKM, but also outperforms
them for the same learning time. AKM and RAKM are only better in the rst
few iterations, because recall that AGM initializes from all points and needs a
few iterations to reach a reasonable vocabulary size. The relative performance
of RAKM versus AKM is not quite as expected by [5]; this may be partly due
to random initialization, a standing issue for k-means that AGM is free of, and
partly because performance in [5] is measured in distortion rather than mAP in
retrieval. It is also interesting that, unlike the other methods, AGM appears to
improve in mAP for lower m.
26 Y. Avrithis and Y. Kalantidis
0.9 0.5
0.45
0.89 0.4
mAP
mAP
AGM-50 AGM-1
AKM-100 0.35 AGM-3
RAKM-100 AGM-5
0.88 AGM-100 0.3 RAKM-1
AKM-200 RAKM-3
RAKM-200 0.25 RAKM-5
Fig. 4. (Left) Barcelona-specic mAP versus learning time, measured in vector opera-
tions (vop) per data point, for AGM, AKM and RAKM under varying approximation
levels, measured in FLANN checks. There are 5 measurements on each curve, corre-
sponding, from left to right, to iteration 5, 10, 20, 30 and 40. (Right) Oxford buildings
generic mAP in the presence of up to 1 million distractor images for AGM and RAKM,
using query-side soft assignment with 1, 3 and 5 NNs.
Table 1. mAP comparisons for generic vocabularies of dierent sizes on Oxford Build-
ings with a varying number of distractors, using 100/200/200 FLANN checks for
AGM/AKM/RAKM respecively, 40 iterations for AKM/RAKM, and 15 for AGM.
6 Discussion
Grouping 107 points into 106 clusters in a space of 102 dimensions is certainly
a dicult problem. Assigning one or more visual words to 103 descriptors from
each of 106 images and repeating for a number of dierent settings and competing
methods on a single machine has proved even more challenging. Nevertheless,
we have managed to tune the algorithm on one dataset and get competitive
performance on a dierent one with a set of parameters that work even on our
very rst two-dimensional example. In most alternatives one needs to tune at
least the vocabulary size. Even with spherical components, the added exibility
of Gaussian mixtures appears to boost discriminative power as measured by
retrieval performance. Yet, learning is as fast as approximate k-means, both in
terms of EM iterations and underlying vector operations for NN search.
Our solution appears to avoid both singularities and overlapping components
that are inherent in ML estimation of GMMs. Still, it would be interesting
to consider a variational approach [3] and study the behavior of our purging
and expanding schemes with dierent priors. Seeking visual synonyms towards
arbitrarily-shaped clusters without expensive sets of parameters is another direc-
tion, either with or without training data as in [23]. More challenging would be
to investigate similar ideas in a supervised setting, e.g. for classication. More
details, including software, can be found at our project home page3 .
References
1. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with
large vocabularies and fast spatial matching. In: CVPR (2007)
2. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR
(2006)
3. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
4. Perronnin, F.: Universal and adapted vocabularies for generic visual categorization.
PAMI 30(7), 12431256 (2008)
5. Li, D., Yang, L., Hua, X.S., Zhang, H.J.: Large-scale robust visual codebook con-
struction. ACM Multimedia (2010)
6. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image
classication. In: CVPR (2008)
3
http://image.ntua.gr/iva/research/agm/
28 Y. Avrithis and Y. Kalantidis
7. Jurie, F., Triggs, B.: Creating ecient codebooks for visual recognition. In: ICCV
(2005)
8. Leibe, B., Mikolajczyk, K., Schiele, B.: Ecient clustering and matching for object
class recognition. In: BMVC (2006)
9. Fulkerson, B., Vedaldi, A., Soatto, S.: Localizing Objects with Smart Dictionaries.
In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302,
pp. 179192. Springer, Heidelberg (2008)
10. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal
visual dictionary. In: ICCV (2005)
11. Wu, J., Rehg, J.M.: Beyond the euclidean distance: Creating eective visual code-
books using the histogram intersection kernel. In: ICCV (2009)
12. Agarwal, A., Triggs, B.: Hyperfeatures Multilevel Local Coding for Visual Recog-
nition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS,
vol. 3951, pp. 3043. Springer, Heidelberg (2006)
13. Tuytelaars, T., Schmid, C.: Vector quantizing feature space with a regular lattice.
In: ICCV (October 2007)
14. Dong, W., Wang, Z., Charikar, M., Li, K.: Eciently matching sets of features
with random histograms. ACM Multimedia (2008)
15. Philbin, J., Chum, O., Sivic, J., Isard, M., Zisserman, A.: Lost in quantization: Im-
proving particular object retrieval in large scale image databases. In: CVPR (2008)
16. Silpa-Anan, C., Hartley, R.: Optimised KD-trees for fast image descriptor match-
ing. In: CVPR (2008)
17. Muja, M., Lowe, D.: Fast approximate nearest neighbors with automatic algorithm
conguration. In: ICCV (2009)
18. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor
search. PAMI 33(1), 117128 (2011)
19. van Gemert, J., Veenman, C., Smeulders, A., Geusebroek, J.: Visual word ambi-
guity. PAMI 32(7), 12711283 (2010)
20. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV (2011)
21. Jegou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image
search. IJCV 87(3), 316336 (2010)
22. Lehmann, A., Leibe, B., van Gool, L.: PRISM: Principled implicit shape model.
In: BMVC (2009)
23. Mikulk, A., Perdoch, M., Chum, O., Matas, J.: Learning a Fine Vocabulary.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III. LNCS,
vol. 6313, pp. 114. Springer, Heidelberg (2010)
24. Jegou, H., Douze, M., Schmid, C.: Hamming Embedding and Weak Geometric Con-
sistency for Large Scale Image Search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008, Part I. LNCS, vol. 5302, pp. 304317. Springer, Heidelberg (2008)
25. Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G.: SMEM algorithm for mixture
models. Neural Computation 12(9), 21092128 (2000)
26. Figueiredo, M., Jain, A.: Unsupervised learning of nite mixture models. IEEE
Transactions on Pattern Analysis and Machine Intelligence 24(3), 381396 (2002)
27. Verbeek, J., Nunnink, J., Vlassis, N.: Accelerated EM-based clustering of large
data sets. Data Mining and Knowledge Discovery 13(3), 291307 (2006)
28. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer (2009)
29. Tolias, G., Avrithis, Y.: Speeded-up, relaxed spatial matching. In: ICCV (2011)
30. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404417.
Springer, Heidelberg (2006)
Geodesic Saliency Using Background Priors
1 Introduction
The human vision system can rapidly and accurately identify important regions
in its visual eld. In order to achieve such an ability in computer vision, extensive
research eorts have been conducted on bottom up visual saliency analysis for
years. Early works [1, 2] in this eld are mostly based on biologically inspired
models (where a human looks) and is evaluated on human eye xation data [3, 4].
Many follow up works are along this direction [57].
Recent years have witnessed more interest in object level saliency detection
(where the salient object is) [813] and evaluation is performed on human labeled
objects (bounding boxes [8] or foreground masks [9]). This new trend is motivated
by the increasing popularity of salient object based vision applications, such as
object aware image retargeting [14, 15], image cropping for browsing [16], object
segmentation for image editing [17] and recognition [18]. It has also been shown
that the low level saliency cues are helpful to nd generic objects irrespective
to their categories [19, 20]. In this work we focus on the object level saliency
detection problem in natural images.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 2942, 2012.
c Springer-Verlag Berlin Heidelberg 2012
30 Y. Wei et al.
Fig. 1. Saliency maps of previous representative contrast prior based methods, on three
example images of increasing complexity in objects and backgrounds. (a) Input images.
(b) Ground truth salient object masks. (c)-(f) Results from one local method [1] and
three global methods [9, 11, 12].
in global methods [11, 12], but these methods still have diculties of highlighting
the entire object uniformly.
In essence, the saliency object detection problem on general objects and back-
ground is highly ill-posed. There still lacks a common denition of what saliency
is in the community, and simply using contrast prior alone is unlikely to suc-
ceed. While previous approaches are mostly based on their own understanding
of how the contrast prior should be implemented, huge behavioral discrep-
ancies between previous methods can be observed, and such discrepancies are
sometimes hard to understand.
This problem is exemplied in Figure 1, which shows results from previous
representative methods, one local method [1] and three global methods [9, 11, 12].
The three examples present increasing complexities in objects and backgrounds:
a simple object on a simple background; a more complex object on a more
complex background; two salient regions on a complex background with low
contrast. It is observed that the results from dierent methods vary signicantly
from each other, even for the rst extremely simple example, which would not be
ambiguous at all for most humans. This phenomenon is repeatedly observed in
many images, either simple or complex. This consolidates our belief that using
the contrast prior alone is insucient for the ill-posed problem and more clues
should be exploited.
In other words, most image patches in the background can be easily connected
to each other. We call it the connectivity prior. Note that the connectivity is
in a piecewise manner. For example, sky and grass regions are homogeneous by
themselves, but patches between them cannot be easily connected. Also the back-
ground appearance homogeneity should be interpreted in terms of human per-
ception. For example, patches in the grass all look similar to humans, although
their pixel-wise intensities might be quite dierent. This prior is dierent from
the connectivity prior used in object segmentation [24, 25], which is assumed on
the spatial continuity of the object instead of the background. Sometimes this
prior is also supported by the fact that background regions are usually out of
focus during photography, and therefore are more blurred and smoother than
the object of interest.
2 Geodesic Saliency
Based on the contrast and the two background priors, we further observed that
most background regions can be easily connected to image boundaries while this is
much harder for object regions. This suggests that we can dene the saliency of
an image patch as the length of its shortest path to image boundaries. However,
this assumes that all the boundary patches are background, which is not realistic
enough as the salient object could be partially cropped on the boundary. This
can be rectied by adding a virtual background node connected to all boundary
patches in the image graph and computing the saliency of boundary patches, as
described in Section 2.1. In this way, we propose a new and intuitive geodesic
saliency measure, that is, the saliency of an image patch is the length of its
shortest path to the virtual background node.
Figure 2 shows several shortest paths on image patches and the geodesic
saliency results. These results are more plausible than those in Figure 1.
Geodesic saliency has several advantages. Firstly, it is intuitive and easy to
interpret. It simultaneously exploits the three priors in an eective manner. Pre-
vious object attenuation problem is signicantly alleviated because all patches
inside a homogeneous object region usually share similar shortest paths to the
image boundary and therefore have similar saliency. As a result, it achieves su-
perior results than previous approaches.
Secondly, it mostly benets from background priors and uses the contrast
prior moderately. Therefore, it is complementary to methods that exploit the
contrast prior in a more sophisticated manner [22, 11, 12]. A combination of
geodesic saliency with such methods can usually improve both.
Last, it is easy to implement, fast and suitable for subsequent applications.
For dierent practical needs in accuracy/speed trade o, we propose two variants
of the geodesic saliency algorithm. For applications that require high speed, such
as interactive image retargeting [14], image thumbnail generation/cropping for
batch image browsing [16], and bounding box based object extraction [26, 19],
we propose to compute geodesic saliency using rectangular patches on an image
grid and an approximate shortest path algorithm [27]. We call this algorithm GS
Geodesic Saliency Using Background Priors 33
Fig. 2. (better viewed in color) Illustration of geodesic saliency (using regular patches).
Top: three example images in Figure 1 and shortest paths of a few foreground (green)
and background (magenta) image patches. Bottom: geodesic saliency results.
Grid. It runs in about 2 milliseconds for images of moderate size (400 400).
For applications that require high accuracy such as object segmentation [17, 18],
we propose to use superpixels [28] as image patches and an exact shortest path
algorithm. We call this algorithm GS Superpixel. It is slower (a few seconds) but
more accurate than GS Grid.
In Section 4, extensive evaluation on the widely used salient object databases
in [8, 9] and a recent more challenging database in [29, 30] shows that geodesic
saliency outperforms previous methods by a large margin, and it can be further
improved by combination with previous methods. In addition, the GS Grid al-
gorithm is signicantly faster than all other methods and suitable for real time
applications.
2.1 Algorithm
For an image, we build an undirected weighted graph G = {V, E}. The ver-
tices are all image patches {Pi } plus a virtual background node B, V = {Pi }
{B}. There are two types of edges: internal edges connect all adjacent patches
and boundary edges connect image boundary patches to the background node,
E = {(Pi , Pj )|Pi is adjacent to Pj } {(Pi , B)|Pi is on image boundary}. The
geodesic saliency of a patch P is the accumulated edge weights along the shortest
path from P to background node B on the graph G,
n1
saliency(P ) = min weight(Pi , Pi+1 ), s.t.(Pi , Pi+1 ) E.
P1 =P,P2 ,...,Pn =B
i=1
34 Y. Wei et al.
Fig. 3. Internal edge weight clipping alleviates the small weight accumulation problem.
(a) Input image. (b) Geodesic saliency (using superpixels) without weight clipping. (c)
Geodesic saliency (using superpixels) with weight clipping.
We rstly describe how to compute edge weights, and then introduce two variants
of geodesic saliency algorithm: GS Grid is faster and GS Superpixel is more
accurate.
Internal edge weight is the appearance distance between adjacent patches.
This distance measure should be consistent with human perception of how sim-
ilar two patches look alike. This problem is not easy in itself. For homogeneous
textured background such as road or grass, while patches there look almost iden-
tical, simple appearance distances such as color histogram distance are usually
small but non zero values. This causes the small-weight-accumulation problem,
that is, many small weight edges can accumulate along a long path and form
undesirable high saliency values in the center of the background. See Figure 3(b)
for an example.
Instead of exploring complex features and sophisticated patch appearance
distance measures, we take a simple and eective weight clipping approach to
address this problem. The patch appearance distance is simply taken as the
dierence (normalized to [0, 1]) between the mean colors of two patches (in LAB
color space). For each patch we pick its smallest appearance distance to all
its neighbors, and then we select an insignicance distance threshold as the
average value of all such smallest distances from all patches. If any distance
is smaller than this threshold, it is considered insignicant and clipped to 0.
Such computation of internal edge weights is very ecient and its eectiveness
is illustrated in Figure 3.
Boundary edge weight characterizes how likely a boundary patch is not back-
ground. When the boundary prior is strictly valid, all boundary patches are
background and such weights should be 0. However, this is too idealistic and
not robust. The whole object could be missed even if it only slightly touches the
image boundary, because all the patches on the object may be easily connected
to the object patch on the image boundary. See Figure 4(b) for examples.
We observe that when the salient object is partially cropped by the image
boundary, the boundary patches on the object are more salient than boundary
patches in the background. Therefore, the boundary edge weights computation
is treated as a one-dimensional saliency detection problem: given only image
Geodesic Saliency Using Background Priors 35
Fig. 4. Boundary edge weight computation makes boundary prior and geodesic saliency
more robust. (a) Input image. (b) Geodesic saliency with boundary edge weights as
0. (c) The boundary edge weights computed using a one dimensional version of the
saliency algorithm in [11]. (d) Geodesic saliency using boundary edge weights in (c).
boundary patches, compute the saliency of each boundary patch Pi as the weight
of boundary edge (Pi , B). This step makes the boundary prior and geodesic
saliency more robust, as illustrated in Figure 4(c)(d).
In principle, any previous image saliency method that can be reduced to a one-
dimensional version can be used for this problem. We use the algorithm in [11]
because it is also based on image patches and adapting it to using only boundary
patches is straightforward. For patch appearance distance in this method, we also
use the mean color dierence.
The GS Grid algorithm uses rectangular image patches of 10 10 pixels on a
regular image grid. Shortest paths for all patches are computed using the ecient
geodesic distance transform [27]. Although this solution is only approximate, it
is very close to the exact solution on a simple graph on an image grid. Because
of its linear complexity in the number of graph nodes and sequential memory
access (therefore cache friendly), it is extremely fast and also used in interactive
image segmentation [31, 32]. Overall, the GS Grid algorithm runs in 2 ms for a
400 400 image and produces decent results.
The GS Superpixel algorithm uses irregular superpixels as image patches. It
is more accurate because superpixels are better aligned with object and back-
ground region boundaries than the regular patches in GS Grid, and the patch
appearance distance computation is more accurate. We use the superpixel seg-
mentation algorithm in [28] to produce superpixels of roughly 10 10 pixels.
It typically takes a few seconds for an image. Note that our approach is quite
insensitive to the superpixel algorithm so any faster method can be used.
The shortest paths of all image patches are computed by Dijkstras algorithm
for better accuracy. In spite of the super linear complexity, the number of su-
perpixels is usually small (around 1, 500) and this step takes no more than a few
milliseconds.
36 Y. Wei et al.
The MSRA database [8] is currently the largest salient object database and
provides object bounding boxes. The database in [9] is a 1, 000 image subset of
[8] and provides human labeled object segmentation masks. Many recent object
saliency detection methods [813, 16] are evaluated on these two databases.
Nevertheless, these databases have several limitations. All images contain only
a single salient object. Most objects are large and near the image center. Most
backgrounds are clean and present strong contrast with the objects. A few ex-
ample images are shown in Figure 8.
Recently a more challenging salient object database was introduced [30]. It is
based on the well known 300 images Berkeley segmentation dataset [29]. Those
images usually contain multiple foreground objects of dierent sizes and positions
in the image. The appearance of objects and backgrounds are also more complex.
See Figure 10 for a few example images. In the work of [30], seven subjects are
asked to label the foreground salient object masks and each subject can label
multiple objects in one image. For each object mask of each subject, a consistency
score is computed from the labeling of the other six subjects. However, [30] does
not provide a single foreground salient object mask as ground truth.
For our evaluation, we obtain a foreground mask in each image by remov-
ing all labeled objects whose consistency scores are smaller than a threshold
and combining the remaining objects masks. This is reasonable because a small
consistency score means there is divergence between the opinions of the seven
subjects and the labeled object is probably less salient. We tried dierent thresh-
olds from 0.7 to 1 (1 is maximum consistency and means all subjects label exactly
the same mask for the object). We found that for most images the user labels are
quite consistent and resulting foreground masks are insensitive to this threshold.
Finally, we set the threshold as 0.7, trying to retain as many salient objects as
possible. See Figure 10 for several example foreground masks.
4 Experiments
As explained above, besides the commonly used MSRA-1000 [9] and MSRA [8]
databases, we also use the Berkeley-300 database [30] in our experiments. The
latter is more challenging and has not been widely used in saliency detection
yet.
Not surprisingly, we found that all our results and discoveries on the MSRA
database are very similar to those found in MSRA-1000. Therefore we omit all
the results on the MSRA database in this paper.
In this experiment, we evaluate how likely the image boundary pixels are back-
ground. For each image, we compute the percentage of all background boundary
pixels in a band of 10 pixels in width (to be consistent with the patch size in
Geodesic Saliency Using Background Priors 37
900 70
800 60
700
50
number of images
600
number of images
40
500
400 30
300
20
200
10
100
0 0
0.9 0.92 0.94 0.96 0.98 1 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
percentage of background pixels along image boundary (MSRA1000) percentage of background pixels along image boundary (Berkeley300)
geodesic saliency) along the four image sides, using the ground truth foreground
mask. The histograms of such percentages of all the images in MSRA-1000 and
Berkeley-300 databases are shown in Figure 5.
As illustrated, the boundary prior is strongly valid in the MSRA-1000
database, which has 961 out of 1000 images with more than 98% boundary pixels
as background. The prior is also observed in the Berkeley-300 database, but not
as strongly. There are 186 out of 300 images with more than 70% boundary pixels
as background. This database is especially challenging for our approach as some
images indeed contain large salient objects cropped on the image boundary.
0.9
0.8
0.7
Precision
0.6 FT [9]
HC [12]
LC [33]
0.5 RC [12]
SR [21]
0.4 IT [1]
GB [22]
CA [11]
0.3 GS_GD
GS_SP
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
1 1
0.9 0.9
Precision
0.7 0.7
HC [12] RC [12]
GS_GD + HC [12] GS_GD + RC [12]
0.6 GS_SP + HC [12] 0.6 GS_SP + RC [12]
GS_GD GS_GD
GS_SP GS_SP
0.5 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall
Results on the MSRA-1000 database [9] are shown in Figure 6 and we have
two conclusions: 1) geodesic saliency signicantly outperforms previous methods.
Especially, it can highlight the entire object uniformly and achieve high preci-
sion in high recall areas. Figure 8 shows example images and results. 2) After
combination with geodesic saliency, all the eight previous methods are improved,
and four of them (FT,GB,HC,RC) are better than both that are combined. In
Figure 6 (bottom) we show results before and after combination for the two best
combined methods, GS+HC and GS+RC.
To further understand how geodesic saliency diers from other methods, Fig-
ure 7 shows the saliency value distributions of all foreground and background
pixels (using ground truth masks) for GS GD and RC methods. Clearly, the
foreground and background are better separated in GS GD than in RC.
Results on the Berkeley-300 database [30] are shown in Figure 9 and example
results are shown in Figure 10. The accuracy of all methods are much lower,
indicating that this database is much more dicult. We still have similar con-
clusions about geodesic saliency: 1) it is better than previous methods; 2) after
combination, all the eight previous methods are improved, and four of them
(CA,IT,GB,RC) are better than both that are combined. In Figure 9 (bottom)
we show results before and after combination for the two best combined methods,
GS+GB and GS+RC.
Geodesic Saliency Using Background Priors 39
0.2 0.2
Foreground Foreground
0.15 0.15
Probability
Probability
Background Background
0.1 0.1
0.05 0.05
0 0
0 50 100 150 200 250 0 50 100 150 200 250
Saliency Value of GS_GD Saliency Value of RC [12]
Fig. 7. Saliency value distributions of all foreground and background pixels in the
MSRA-1000 database [9] for GS GD and RC methods
5 Discussions
Given the superior results and the fact that geodesic saliency is so intuitive, this
illustrates that appropriate prior exploitation is helpful for the ill-posed saliency
detection problem. This is a similar conclusion that has been repeatedly observed
in many other ill-posed vision problems.
40 Y. Wei et al.
1
FT [9]
0.9 HC [12]
LC [33]
RC [12]
0.8 SR [21]
IT [1]
0.7 GB [22]
CA [11]
Precision
GS_GD
0.6 GS_SP
0.5
0.4
0.3
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0.9 0.9
0.8 0.8
Precision
0.6 0.6
GB [22] RC [12]
GS_GD + GB [22] GS_GD + RC [12]
0.5 GS_SP + GB [22] 0.5 GS_SP + RC [12]
GS_GD GS_GD
GS_SP GS_SP
0.4 0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall
Table 1. Averaging running time (milliseconds per image) of dierent methods, mea-
sured on an Intel 2.33GHz CPU with 4GB RAM. IT, GB and CA use Matlab imple-
mentation, while other methods use C++.
Fig. 10. Example results of dierent methods on the Berkeley-300 database [30]
Fig. 11. Typical failure cases of geodesic saliency: (left) salient object is signicantly
cropped at the image boundary and identied as background; (right) small isolated
background regions are identied as object
References
1. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for
rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine In-
telligence 20, 12541259 (1998)
2. Parkhurst, D., Law, K., Niebur, E.: Modeling the role of saliency in the allocation
of overt visual attention. Vision Research (2002)
3. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans
look. In: ICCV (2009)
4. Ramanathan, S., Katti, H., Sebe, N., Kankanhalli, M., Chua, T.-S.: An Eye Fix-
ation Database for Saliency Detection in Images. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 3043. Springer,
Heidelberg (2010)
5. Itti, L., Baldi, P.: Bayesian surprise attracts human attentions. In: NIPS (2005)
6. Bruce, N.D.B., Tsotsos, K.: Saliency based on information maximization. In: NIPS
(2005)
7. Gao, D., Mahadevan, V., Vasconcelos, N.: The discriminant center-surround hy-
pothesis for bottom-up saliency. In: NIPS (2007)
8. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.: Learning to detect a salient object.
In: CVPR (2007)
42 Y. Wei et al.
9. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient re-
gion detection. In: CVPR (2009)
10. Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color.
In: ICCV (2009)
11. Goferman, S., Manor, L., Tal, A.: Context-aware saliency detection. In: CVPR (2010)
12. Cheng, M., Zhang, G., Mitra, N., Huang, X., Hu, S.: Global contrast based salient
region detection. In: CVPR (2011)
13. Klein, D., Frintrop, S.: Center-surround divergence of feature statistics for salient
object detection. In: ICCV (2011)
14. Rubinstein, M., Guterrez, D., Sorkine, O., Shamir, A.: A comparative study of
image retargeting. In: SIGGRAPH Asia (2010)
15. Sun, J., Ling, H.: Scale and object aware image retargeting for thumbnail browsing.
In: ICCV (2011)
16. Marchesotti, L., Cifarelli, C., Csurka, G.: A framework for visual saliency detection
with applications to image thumbnailing. In: ICCV (2009)
17. Lempitsky, V., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bound-
ing box prior. In: ICCV (2009)
18. Wang, L., Xue, J., Zheng, N., Hua, G.: Automatic salient object extraction with
contextual cue. In: ICCV (2011)
19. Alexe, B., Deselaers, T., Ferrari, V.: What is an object. In: CVPR (2010)
20. Feng, J., Wei, Y., Tao, L., Zhang, C., Sun, J.: Salient object detection by compo-
sition. In: ICCV (2011)
21. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach. In: CVPR (2007)
22. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS (2006)
23. Grady, L., Jolly, M., Seitz, A.: Segmentation from a box. In: ICCV (2011)
24. Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with
connectivity priors. In: CVPR (2008)
25. Nowozin, S., Lampert, C.H.: Global connectivity potentials for random eld mod-
els. In: CVPR (2009)
26. Butko, N.J., Movellan, J.R.: Optimal scanning for faster object detection. In:
CVPR (2009)
27. Toivanen, P.J.: New geodesic distance transforms for gray-scale images. Pattern
Recognition Letters 17, 437450 (1996)
28. Veksler, O., Boykov, Y., Mehrani, P.: Superpixels and Supervoxels in an En-
ergy Optimization Framework. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part V. LNCS, vol. 6315, pp. 211224. Springer, Heidelberg (2010)
29. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: ICCV (2001),
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
30. Movahedi, V., Elder, J.H.: Design and perceptual validation of performance mea-
sures for salient object segmentation. In: IEEE Computer Society Workshop on
Perceptual Organization in Computer Vision (2010),
http://elderlab.yorku.ca/~ vida/SOD/index.html
31. Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video
segmentation and matting. In: ICCV (2007)
32. Criminisi, A., Sharp, T., Blake, A.: GeoS: Geodesic Image Segmentation. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302,
pp. 99112. Springer, Heidelberg (2008)
33. Zhai, Y., Shah, M.: Visual attention detection in video sequences using spatiotem-
poral cues. ACM Multimedia (2006)
Joint Face Alignment
with Non-parametric Shape Models
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 4356, 2012.
Springer-Verlag Berlin Heidelberg 2012
44 B.M. Smith and L. Zhang
alignments will drift somewhat from ground truth in order to achieve an inter-
nally consistent solution among input images. However, we show experimentally
that our approach does not sacrice alignment accuracy to achieve consistency.
2 Background
Non-rigid face alignment algorithms can be divided into two categories: holistic
and local methods. Our approach is most directly inspired by a recent holistic
method [20] that operates on a batch of input images simultaneously, and a
recent local method [1] that combines the output of local detectors with a non-
parametric set of shape models. In this section, we highlight some key dierences
between holistic and local methods, and give a brief overview of [20] and [1].
M
L
X = X0 + pm Xm = 0 + l l , (1)
m=1 l=1
where pm and l are the m-th shape and l-th appearance parameters (or load-
ings), respectively; Xm and l capture the top M shape and L appearance
modes, respectively; and X0 and 0 are the average shape and appearance vec-
tors, respectively. The goal is to minimize the dierence between the target image
and the model: 2
L
min 0 + l l W (I; p) , (2)
p,
l=1 2
where W warps the image I in a piecewise linear fashion using the shape pa-
rameters p.
One of the major challenges in applying AAMs is that it is dicult to ade-
quately synthesize the appearance of a new face using a linear model. As demon-
strated experimentally by Gross et al. [6], it is much more dicult to build a
generic appearance model than a person-specic one. This is known as the gen-
eralization problem.
To overcome this problem, Zhao et al. [20] introduced a joint alignment tech-
nique in which a batch of person-specic images are simultaneously aligned in
a modied AAM setting. Intuitively, their approach aims to simultaneously (1)
identify the person-specic appearance space of the input images, which are
assumed to be linear and low-dimensional, and (2) align the person-specic ap-
pearance space with the generic appearance space, which are assumed to be
proximate rather than distant. This joint approach was shown to produces ex-
cellent results on a wide variety real-world images from the internet. However,
Joint Face Alignment with Non-parametric Shape Models 45
the approach breaks down under several common conditions, including signi-
cant occlusion or shadow, image degradation, and outliers. Like Zhao et al., we
operate on a set of input images jointly, but our approach is more robust to
these conditions.
where the sum is taken over J images, ej is an indicator vector (i.e., the j-
th element is one, all others are zero), is a weight parameter that balances
the rst term with the second, and all other parts are dened as in Eq. (2).
The rank term encourages the appearance of the input images to be linearly
correlated; intuitively, the rank term favors a solution in which the input images
are well-aligned with one another (not just the global model). Unfortunately,
the rank term is non-convex and non-continuous and minimizing it is an NP-
hard problem. To make the problem more tractable, they borrow a trick from
compressive sensing. That is, they replace the rank term with its tightest convex
relaxation thenuclear norm ||X|| , which is dened as the sum of singular
values ||X|| = k x (X). See [20] for more details.
Like [20], we incorporate a rank term in our optimization. However, because we
couple the local appearance of faces with global shape models, our rank term does
not operate directly on the holistic appearance of faces. Instead, it operates on
the simplied shape representation of faces, and encourages landmark estimates
that are spatially consistent across images. We encourage the local appearance
at each landmark estimate to be consistent across multiple images via a separate
term, which we discuss in Section 3.
shape model to jointly recover the location of landmarks. Like AAMs, CLMs
typically model non-rigid shape variation linearly, as in Eq. (1).
Recently, Belhumeur et al. [1] proposed a dierent approach that avoids a
parametric linear shape model, and instead combines the outputs of local detec-
tors with a non-parametric set of global shape models. Belhumeur et al. showed
excellent landmark localization results on single images of faces under challeng-
ing conditions, such as signicant occlusions, shadows, and a wide range of pose
and expression variation. According to their experiments on real-world images
from the internet, their algorithm produced slightly more accurate results, on
average, compared to humans assigned to the same task.
Despite its impressive accuracy, we found that Belhumeur et al.s approach
produces temporally inconsistent results in video. Our method makes use of the
same non-parametric shape model approach proposed by Belhumeur et al., but
we further constrain the local appearance and face shape in a joint alignment
setting. As our experiments show, we produce more consistent results, especially
in video. At the same time, we do not sacrice localization accuracy to achieve
this consistency.
where the collections of M global models Xk,t have been introduced and then
marginalized out. By conditioning on the global model Xk,t , the locations of
the parts xi are assumed to be conditionally independent of one another, i.e.,
N
P (X|Xk,t , D) = i=1 P (xi |xik,t , di ). Bayes rule is then applied to Eq. (5), and
all probability terms that only depend on D and Xk,t are reduced to a constant
C. This yields the following optimization problem:
M
N
X = argmax C P (xik,t )P (xi |di )dt, (6)
X tT i=1
k=1
Joint Face Alignment with Non-parametric Shape Models 47
where P (xik,t ) is a 2D Gaussian distribution that models how well the landmark
location in the global model xik,t t the true location xi , i.e., xik,t = xik,t xi .
P (xi |di ) takes the form of a local response map for landmark i.
A RANSAC-like approach is used to generate a large number of global models
Xk,t , which are evaluated using P (Xk,t |D). The set M of top global models M
for which P (Xk,t |D) is greatest is used to approximate Eq. (6) as
N
X = argmax P (xik,t )P (xi |di ). (7)
X
k,tM i=1
X is found by choosing the image location where the following sum is maximized
xi = argmax P (xik,t )P (xi |di ), (8)
xi k,tM
which is equivalent to solving for xi by setting all P (xik,t ) and P (xi |di ) to
a constant in Eq. (6) for all i
= i. For more details, please see [1].
3 Our Approach
This section provides details of how we couple the shape and appearance infor-
mation from multiple related face images.
In a multiple image setting, we can trivially modify Eq. (10) to incorporate all
images:
J N
Xj j=1,...,J = argmax wji (xij ), (11)
X1 ,X2 ,...,XJ j=1 i=1
where wji (xij ) corresponds to landmark i in image j. Note that the selection of
each set
of global
models Mj is still independent for each image j, and nding
the best Xj j=1,...,J in Eq. (11) is equivalent to solving Eq. (10) for each image
separately.
where
1
xij , {xij }i =i = ij exp sij (xij ) sij (xij )2 , (13)
J 1
j Sj
Sj is the set of input images associated with, but not including, image j, and
sij (xij ) is a local feature descriptor centered at xij . ij is a weight reecting the
condence that sij (xij ) describes the correct landmark location. In practice, we
use ij = wji (xij ). Intuitively, Eq. (12) is maximized when two conditions are si-
multaneously optimized: (1) the face shape coincides with positive measurements
according to the local detectors and (2) the local descriptors are consistent across
input images.
wjj
i i i
(xj [hij ], xj [hij ]), or more simply w jj
i
(hij , hij ). Let j [hij ] {0, 1} be an
i
indicator variable that can either be on (1) or o (0) for each possible hij . Eq. (17)
can then be reformulated into the following energy minimization problem.
min () + [X, Z] (18)
s.t. ij [hij ] {0, 1} (19)
ij [hij ] = 1 (20)
hij
A() X = 0 (21)
Z0 Z = 0, (22)
where
1 i
() = j [hij ] ij [hij ] w
jj
i
(hij , hij ) (23)
N i j
j Sj hij hij
50 B.M. Smith and L. Zhang
Solving for . To nd the k+1 that minimizes Eq. (25), we need consider
only those terms in Eq. (24) that involve :
We solve Eq. (29) by breaking it into independent subproblems, one for each
landmark i. Each subproblem is then solved using an iterative greedy approach:
Loop until convergence
For landmark i = 1, 2, . . . , N
For image j = 1, 2, . . . , J
1. Assume all ij [hij ] for j Sj are known and assign a
single hypothesis hij to each j Sj .
Joint Face Alignment with Non-parametric Shape Models 51
Barack Obama Keanu Reeves Lucy Liu Miley Cyrus Oprah Winfrey
Fig. 1. Selected images from two face datasets used for evaluation: Multi-PIE [8] on
top and PubFig [12] on the bottom, with ground truth landmarks shown in green
0.15
CLM
CoE
mean err./IOD
Ours
0.1
0.05
0
e
e
in
in
in
in
in
th
th
th
th
th
m e
m e
m e
m e
m e
ey
ey
ey
ey
ey
ey
ey
ey
ey
ey
s
s
ch
ch
ch
ch
ch
ou
ou
ou
ou
ou
no
no
no
no
no
l.
r.
l.
r.
l.
r.
l.
r.
l.
r.
Obama Reeves Liu Cyrus Winfrey
CLM
0.08
mean err./IOD
CoE
Ours
0.06
0.04
0.02
0
e
e
ge
ge
ge
ge
ge
th
th
th
th
th
m e
m e
m e
m e
m e
ey
ey
ey
ey
ey
ey
ey
ey
ey
ey
s
s
ou
ou
ou
ou
ou
ed
no
ed
no
ed
no
ed
no
ed
no
l.
r.
l.
r.
l.
r.
l.
r.
l.
r.
Subject 002 Subject 017 Subject 055 Subject 077 Subject 08
Fig. 2. This gure shows a comparison between our jointly aligned results, and the
results generated by our implementations of Saragih et al.s [16] algorithm (CLM), and
Belhumeur et al.s [1] algorithm (CoE) on ve sets of images from the PubFig [12]
dataset (top) and ve sets of images from the Multi-PIE dataset [8] (bottom). Please
see Section 4.2 for an explanation of these results.
Fig. 3. Test results using our method on PubFig images. Our alignment results are
nearly identical, even when the input image (shown larger than the rest) is jointly
aligned with entirely dierent sets of input images.
CLM
8
mean accel.
CoE
6 Ours
0
1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75
0.25
0.2
0.15
0.1
0.05
0
1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75
Frame Number
Fig. 4. This gure shows a comparison between our results, and the results generated
by our implementations of Saragih et al.s [16] algorithm (CLM), and Belhumeur et
al.s [1] algorithm (CoE), where both were applied to each frame of the videos inde-
pendently. Our results are generally both more temporally stable (as measured by the
average landmark acceleration in each frame), and slightly more accurate relative to
ground truth. Please see Section 4.3 for an explanation of these results.
We notice that the CLM and CoE methods generate landmark estimates that
appear to swim or jump around their true location. Our approach generates
landmark estimates that are noticeably more stable over time. Meanwhile, we
do not sacrice landmark accuracy relative to ground truth (i.e., we do not
over-smooth the trajectories); in fact, we often achieve slightly more accurate
localizations compared to [1] and [16]. Figure 5 shows several video frames with
our landmark estimates overlaid in green. For a clearer demonstration, please
see our supplemental video.
Fig. 5. Selected frames from a subset of our experimental videos with landmarks esti-
mated by our method shown in green. These video clips, downloaded from YouTube,
were chosen with several key challenges in mind, including dramatic expressions, large
head motion and pose variation, and occluded regions of the face.
5 Conclusions
We have described a new joint alignment approach that takes a batch of images
as input and produces a set of alignments that are more shape- and appearance-
consistent as output. In addition to introducing two constraints that favor shape
and appearance consistency, we have outlined an approach to optimize both of
them together in a joint alignment framework. Our method shows improvement
most dramatically on video sequences. Despite producing more temporally con-
sistent results, we do not sacrice alignment accuracy relative to ground truth.
References
1. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of
faces using a consensus of exemplars. In: Computer Vision and Pattern Recognition,
pp. 545552 (2011)
2. Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P., Nayar, S.K.: Face swapping:
automatically replacing faces in photographs. In: ACM SIGGRAPH (2008)
3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: ACM
SIGGRAPH (1999)
4. Cai, J.F., Candes, E.J., Shen, Z.: A singular value thresholding algorithm for matrix
completion. SIAM Journal on Optimization 20(4), 19561982 (2008)
5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. In: Bubak, M.,
Hertzberger, B., Sloot, P.M.A. (eds.) HPCN-Europe 1998. LNCS, vol. 1401, pp.
484498. Springer, Heidelberg (1998)
6. Gross, R., Matthews, I., Baker, S.: Generic vs. person specic active appearance
models. Image and Vision Computing 23(11), 10801093 (2005)
7. Gross, R., Matthews, I., Baker, S.: Active appearance models with occlusion. Image
and Vision Computing 24(6), 593604 (2006)
8. Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-PIE. In: Interna-
tional Conference on Automatic Face and Gesture Recognition (2008)
9. Gu, L., Kanade, T.: A Generative Shape Regularization Model for Robust Face
Alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 413426. Springer, Heidelberg (2008)
10. Kemelmacher-Shlizerman, I., Sankar, A., Shechtman, E., Seitz, S.M.: Being John
Malkovich. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
LNCS, vol. 6311, pp. 341353. Springer, Heidelberg (2010)
11. Kemelmacher-Shlizerman, I., Shechtman, E., Garg, R., Seitz, S.M.: Exploring pho-
tobios. In: ACM SIGGRAPH (2011)
12. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile clas-
siers for face verication. In: International Conference on Computer Vision,
pp. 365372 (2009)
13. Lowe, D.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60(2), 91110 (2004)
14. Lucey, S., Wang, Y., Saragih, J., Cohn, J.F.: Non-rigid face tracking with enforced
convexity and local appearance consistency constraint. Image and Vision Comput-
ing 28(5), 781789 (2010)
15. Matthews, I., Baker, S.: Active appearance models revisited. International Journal
of Computer Vision 60(2), 135164 (2004)
16. Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained
mean-shifts. In: International Conference on Computer Vision, pp. 10341041 (2009)
17. Smith, B.M., Zhu, S., Zhang, L.: Face image retrieval by shape manipulation. In:
Computer Vision and Pattern Recognition, pp. 769776 (2011)
18. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of
Computer Vision 57(2), 137154 (2004)
19. Wang, Y., Lucey, S., Cohn, J.F.: Enforcing convexity for improved alignment with
constrained local models. In: Computer Vision and Pattern Recognition (2008)
20. Zhao, C., Cham, W.K., Wang, X.: Joint face alignment with a generic deformable
face model. In: Computer Vision and Pattern Recognition, pp. 561568 (2011)
21. Zhou, M., Liang, L., Sun, J., Wang, Y.: AAM based face tracking with temporal
matching and face segmentation. In: Computer Vision and Pattern Recognition,
pp. 701708 (2010)
Discriminative Bayesian Active Shape Models
1 Introduction
Deformable model tting aims to nd the parameters of a Point Distribution
Model (PDM) that best describe the object of interest in an image. Several t-
ting strategies have been proposed, most of which can be categorized as being
either holistic (generative) or patch-based (discriminative). The holistic repre-
sentations [1][2] model the appearance of all image pixels describing the object.
By synthesizing the expected appearance template, a high registration accuracy
can be achieved. However, such representation generalizes poorly when large
amounts of variability are involved, such as the human face under variations of
identity, expression, pose, lighting or non-rigid motion, due to the huge dimen-
sional representation of the appearance (learn from limited data).
Recently, discriminative-based methods, such as the Constrained Local Model
(CLM) [4][5][6][7][8], have been proposed. These approaches can improve the
models representation power, as it accounts only for local correlations between
pixel values. In this paradigm, both shape and appearance are combined by
constraining an ensemble of local feature detectors to lie within the subspace
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 5770, 2012.
c Springer-Verlag Berlin Heidelberg 2012
58 P. Martins et al.
spanned by the PDM. The CLM implements a two step tting strategy: a local
search and global optimization. The rst step performs an exhaustive local search
using a feature detector, obtaining response maps for each landmark. Then, the
global optimization nds the PDM parameters that jointly maximize the detec-
tion responses. Each landmark detector generates a likelihood map by applying
local detectors to the neighborhood regions around the current estimate.
Some of the most popular optimization strategies propose to replace the true
response maps by simple parametric forms (Weighted Peak Responses [4], Gaus-
sians Responses [8], Mixture of Gaussians [9]) and perform the global optimiza-
tion over these forms instead of the original response maps. The detectors are
learned from training images of each of the objects landmarks. However, due to
their small local support and large appearance variation, they can suer from
detection ambiguities. In [10] the authors attempt to deal with these ambiguities
by nonparametrically approximating the response maps using the mean-shift al-
gorithm, constrained to the PDM subspace (Subspace Constrained Mean-Shift -
SCMS). However, in the SCMS global optimization the PDM parameters update
is essentially a regularized projection of the mean-shift vector for each landmark
onto the subspace of plausible shape variations. Since a least squares projection
is used, the optimization is very sensitive to outliers (when the mean-shift output
is very far away from the correct landmark location). The patch responses can
be embedded into a Bayesian inference problem, where the posterior distribution
of the global warp can be inferred in a maximum a posteriori (MAP) sense. The
Bayesian paradigm provides an eective tting strategy, since it combines in the
same framework both the shape prior (the PDM) and multiple sets of patch
alignment classiers to further improve the accuracy.
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
Fig. 2. The DBASM combines a Point Distribution Model (PDM) and a set of discrim-
inant local detectors, one for each landmark. a) Image with the current mesh showing
the search region for some landmarks. b) The local detector (the MOSSE lter [11]
itself). c) Response maps for the correspondent highlighted landmarks. The DBASM
global optimization jointly combines all landmark response maps, in a MAP sense,
using 2nd order statistics of the shape and pose parameters.
where y is the observed shape, p(y|bs ) is the likelihood term and p(bs ) is a
prior distribution over all possible congurations. The section 3.2 describe some
possible strategies to set the observed shape vector y.
The complexity of the problem, in eq.2, can be reduced by making some
simple assumptions. Firstly, conditional independence between landmarks can
be assumed simply by sampling each landmark independently. Secondly, it can
also be considered that we have an approximate solution to the true parameters
(b bs ). Combining these approximations, the eq.2 can be rewritten as
v
p(b|y) p(yi |b) p(b|bk1 ) (3)
i=1
where yi is the ith landmark coordinates and bk1 is the previous optimal esti-
mate of b.
Discriminative Bayesian Active Shape Models 61
The likelihood term, including the PDM model (in eq.1), becomes the
following convex energy function:
1
p(y|b) exp (y (s0 +b))T y1 (y (s0 + b)) (4)
2 !" #
y
where y is the dierence between the observed and the mean shape and y
is the uncertainty of the spatial localization of the landmarks (2v 2v block
diagonal covariance matrix). From the probabilistic point of view, the likelihood
term follows a Gaussian distribution given by
p(y|b) N (y|b, y ). (5)
The prior term, according to the approximations taken, can be written as
p(bk |bk1 ) N (bk |b , b ) (6)
where b = bk1 and b = + . The is the shape parameters covariance
(diagonal matrix with PCA eigenvalues) and is an additive dynamic noise
covariance (that can be estimated oine).
An important property of Bayesian inference is that, when the likelihood and
the prior are Gaussian distributions the posterior is also Gaussian [12]. Following
the Bayes theorem for Gaussian variables, and considering p(bk |bk1 ) a prior
Gaussian distribution for bk and p(y|bk ) a likelihood Gaussian distribution, the
posterior distribution takes the form ([12], pag 90).
p(bk |y) N (bk |, ) (7)
= (b1 + T y1 )1 (8)
= ( T
y1 y + b1 b ). (9)
Note that, the conditional distribution p(y|bk ) has a mean that is a linear function
of bk and a covariance which is independent of bk . This could be a possible solution
to the global alignment optimization [13]. However, in practice, this is a naive ap-
proach because it does not model the covariance of the latent variables, bk , which
is crucial to account for the condence in the current parameters estimate.
The MAP global alignment solution can be inferred by a Linear Dynamical
System (LDS). The LDS is the ideal technique to model the covariance of the
latent variables and solve the naive approach limitations. The LDS is a simple
approach that recursively computes the posterior probability using incoming
Gaussian measurements and a linear model process, taking into account all the
available measures (same requirements as our alignment problem). The state and
measurement equations of the LDS, according to the PDM alignment problem,
can be written as
bk = Abk1 + q (10)
y = bk + r (11)
62 P. Martins et al.
where the current shape parameters bk are the hidden state vector, q N (0, b )
is the additive dynamic noise, y is the observed shape deviation that are re-
lated to the shape parameters by the linear relation (eq.1) and r is the additive
measurement noise following r N (0, y ). The previous shape estimated pa-
rameters bk1 are connected to the current parameters bk by an identity relation
plus noise (A = In ).
We highlight that the nal step of the LDS derivation consists of a Bayesian
inference step [12] (using the Bayes theorem for Gaussian variables), where the
likelihood term is given by eq.5 and the prior follows N (AF k1 , Pk1 ) where
Pk1 = ( + ) + A F T
k1 A . (12)
From these equations we can see that the LDS keep up to date the uncertainty
on the current estimate of the shape parameters. The LDS recursively computes
the mean and covariance of the posterior distributions of the form
p(bk |yk , . . . , y0 ) N (bk |F F
k , k ) (13)
F
with the posterior mean F
k and covariance k given by the LDS formulas:
Weighted Peak Response (WPR): The simplest solution is to take the spa-
tial location where the response map has a higher score [4]. The new landmark
position is then weighted by a factor that reects the peak condence. Formally,
the WPR solution is given by
yWPR
i = max (pi (zi )) , yWPR = diag(pi (yWPR
i )1 ) (18)
zi yc i
i
1 1
yGR = pi (zi )zi , yGR = pi (zi )(zi yGR
i )(zi yi ) .
GR T
i
d i
d1
zi y
c c zi y
i i
(19)
Kernel Density Estimator (KDE): The response maps can also be approxi-
mated by a nonparametric representation, namely using a Kernel Density Esti-
mator (KDE) (isotropic Gaussian kernel with a bandwidth h2 ). Maximizing over
the KDE is typically performed by using the well-known mean-shift algorithm
[10]. The kernel bandwidth h2 is a free parameter that exhibits a strong inu-
ence on the resulting estimate. This problem can be addressed by an annealing
bandwidth schedule [14]. It can be shown that there exists a h2 value such that
the KDE is unimodal. As h2 is reduced, the modes divide and the smoothness of
KDE decreases, guiding the optimization towards the true objective. Formally,
the ith annealed mean-shift landmark update is given by
KDE( )
zi yc zi pi (zi ) N (yi |zi , h2 j I2 )
KDE( +1)
yi i
KDE( )
(20)
zi yc pi (zi ) N (yi |zi , h2 j I2 )
i
Figure 3 highlights the dierences between the three local optimization strate-
gies (WPR, GR and KDE). Notice that DBASM deals with mild occlusions.
When a landmark is under occlusion typically the response map is multi-modal.
If a KDE local strategy is used (DBASM-KDE), the landmark update will se-
lect the nearest mode (eq.20) and the covariance of that landmark (eq.21) will
64 P. Martins et al.
Fig. 3. Qualitative comparison between the three local optimization strategies. The
WPR simply chooses the maximum detector response. GR approximates the response
map by a full Gaussian distribution. KDE uses the mean-shift algorithm to move to
the nearest mode of the density. Its uncertainty is centered at the found mode. The two
examples in the right show patches under occlusion (typically multimodal responses).
1 Precompute:
2 The parametric models (s0 , , ) and the MOSSE filters in the Fourier domain H i
3 Initial estimate of the shape/pose parameters and covariances (b0 , P0 ) / (q0 , Q0 ).
4 repeat
5 Warp image I to the base mesh using the current pose parameters q [0.5ms]
6 Generate current shape s = s0 + b + q
7 for Landmark i = 1 to v do
8 Evaluate the detectors response (MOSSE correlation F 1 {F {(I)} H i }) [3ms]
9 Find yi and yi using a local strategy (sec. 3.2),e.g. if using KDE, eqs.20 and 21, respectively.
10 end
11 Update the pose parameters and their covariance [0.1ms]:
Qk1 = (q + q + Qk1 ), Kq = Qk1 T ( Qk1 T + y )1
12 qk = qk1 + Kq (y qk1 ), Qk = (I4 Kq )Qk1
13 Update the shape parameters (with pose correction) and their covariance [0.2ms]:
Pk1 = ( + + Pk1 ), Kb = Pk1 T (Pk1 T + y )1
14 bk = bk1 + Kb (y bk1 qk ), Pk = (In Kb )Pk1
15 until ||bk bk1 || or maximum number of iterations reached ;
Algorithm 1: Overview of the DBASM method. The performance of DBASM is
comparable to ASM [4], CQF [8] or SCMS [10] depending of the local strategy
DBASM-WPR, DBASM-GR or DBASM-KDE, respectively. It achieves near real-time
performance. The bottleneck is always obtaining the response maps (3ms x number
landmarks), although it can be done in parallel.
4 Evaluation Results
The experiments in this paper were designed to evaluate the local detector
(MOSSE [11]) and the new Bayesian global optimization (DBASM). All the
experiments were conducted on several databases with publicly available ground
truth. (1) The IMM [15] database that consists on 240 annotated images of
40 dierent human faces presenting dierent head pose, illumination, and facial
expression (58 landmarks). (2) The BioID [16] dataset contains 1521 images,
each showing a near frontal view of a face of one of 23 dierent subjects (20
landmarks). (3) The XM2VTS [17] database has 2360 images frontal faces of
295 subjects (68 landmarks). (4) The tracking performance is evaluated on the
FGNet Talking Face (TF) [18] video sequence that holds 5000 frames of video of
an individual engaged in a conversation (68 landmarks). (5) Finally, a qualita-
tive evaluation was also performed using the Labeled Faces in the Wild (LFW)
[3] database that contains images taken under variability in pose, lighting, focus,
facial expression, occlusions, dierent backgrounds, etc.
The Minimum Output Sum of Squared Error (MOSSE) lter, recently proposed
in [11], nds the optimal lter that minimizes the Sum of Squared Dierences
(SSD) to a desired correlation output. Briey, correlation can be computed in the
frequency domain as the element-wise multiplication of the 2D Fourier transform
(F ) of an input image I with a lter H, also dened in the Fourier domain as
G = F {I} H , where the symbol represents the Hadamard product and
is the complex conjugate. The correlation value is given by F 1 {G}, the inverse
Fourier transform of G.
MOSSE nds the lter H, in the Fourier domain, that minimizes the SSD
between the actual output of the correlation and thedesired output of the cor-
relation, across a set of N training images, minH j=1 (F {Ij } H Gj ) ,
N 2
The MOSSE lter maps all aligned training patch examples to an output, G,
centered at the feature location, producing notably stable correlation lters.
At the training stage, each patch example is normalized to have zero mean
and a unitary norm, and is multiplied by a cosine window (required to solve the
Fourier Transform periodicity problem). This also has the benet of emphasizing
the target center. These lters have a high invariance to illumination changes,
due to their null DC component and revealed to be highly suitable to the task
of generic face alignment.
66 P. Martins et al.
IMM Detectors Fitting Performance (WPR) XM2VTS Detectors Fitting Performance (WPR) BioID Detectors Fitting Performance (WPR)
1 1 1
Proportion of Images
Proportion of Images
0.6 0.6 0.6
(a) IMM [15] database (b) XM2VTS [17] database (c) BioID [16] database
Proportion of Images
Proportion of Images
0.6 0.6 0.6
(a) IMM [15] database (b) XM2VTS [17] database (c) BioID [16] database
Reference 7.5 RMS IMM (240 images) XM2VTS (2360 images) BioID (1521 images)
ASM 50.0 30.7 70.0
DBASM-WPR (our method) 56.7 (+6.7) 45.1 (+14.4) 75.4 (+5.4)
CQF 45.4 10.9 47.0
GMM3 40.8 (-4.6) 10.4 (-0.5) 51.7 (+4.7)
BCLM-GR 48.3 (+2.9) 15.9 (+5.0) 54.2 (+7.2)
DBASM-GR (our method) 50.4 (+5.0) 18.0 (+7.1) 62.2 (+15.2)
SCMS-KDE 54.6 35.7 69.0
BCLM-KDE 57.1 (+2.5) 43.4 (+7.7) 71.9 (+2.9)
DBASM-KDE (our method) 64.6 (+10.0) 54.5 (+18.8) 76.5 (+7.5)
DBASM-KDE-H (our method) 64.6 (+10.0) 53.5 (+17.8) 76.5 (+7.5)
Fig. 5. Fitting performance curves. The table shows quantitative values taken by
setting a xed RMS error amount (7.5 pixels - vertical line in the graphics). Each table
entry show how many percentage of images converge with less or equal RMS error than
the reference. The results show that our proposed methods outperform all the other
(using all the local strategies WPR, GR and KDE). Top images show DBASM-KDE
tting examples from each database.
Tracking Performance
40
ASM (10.5 / 6.4)
35 CQF (10.6 / 3.9)
GMM3 (11.1 / 4.3)
30 SCMSKDE (8.2 / 2.6)
BCLMKDE (9.5 / 3.6)
25 DBASMKDE (7.0 / 2.1)
RMS Error
15
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Frame Number
current parameters estimate (see gure 5). The results show that the hierar-
chical annealing version of DBASM-KDE (DBASM-KDE-H) performs slightly
better, but at the cost of more iterations. Tracking performance is also tested on
the FGNET Talking Face video sequence (gure 6). Each frame is tted using as
Discriminative Bayesian Active Shape Models 69
(a) AVG (b) ASM (c) CQF (d) GMM3 (e) BCLM - (f) SCMS - (g) DBASM (h) DBASM
Fig. 7. Qualitative tting results on LFW database [3]. AVG is the initial estimate.
initial estimate the previously estimated shape and pose parameters. The relative
performance between the global optimization approaches is similar to the pre-
vious experiments, where the DBASM technique yields the best performance.
Qualitative evaluation is also performed using the Labeled Faces in the Wild
(LFW) database [3], where some results can be seen on gure 7.
5 Conclusions
An ecient solution to align facial parts in unseen images is described in this
work. We present a novel Bayesian paradigm (DBASM) to solve the global align-
ment problem, in a MAP sense, showing that the posterior distribution of the
global warp can be eciently inferred using a Linear Dynamical System (LDS).
The main advantage w.r.t. previous Bayesian formulations is that DBASM model
the covariance of the latent variables which represent the condence in the cur-
rent parameters estimate. Several performance evaluation results are presented,
comparing both local detectors and global optimization strategies. Evaluating
the local detectors show that the MOSSE correlation lters oer a superior
performance in landmark local detection. Global optimizations evaluation were
performed in several image publicly available datasets, namely, the IMM, the
XM2VTS, the BioID, and the LFW. Tracking performance is also evaluated on
the FGNET Talking Face video sequence. This new Bayesian paradigm is shown
to signicantly outperform other state-of-the-art tting solutions.
References
1. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE
TPAMI 23, 681685 (2001)
2. Matthews, I., Baker, S.: Active appearance models revisited. IJCV 60, 135164
(2004)
3. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:
A database for studying face recognition in unconstrained environments. Technical
Report 07-49, University of Massachusetts, Amherst (2007)
4. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their
training and application. CVIU 61, 3859 (1995)
5. Cristinacce, D., Cootes, T.F.: Boosted regression active shape models. In: BMVC
(2007)
6. Tresadern, P., Bhaskar, H., Adeshina, S., Taylor, C., Cootes, T.F.: Combining local
and global shape models for deformable object matching. In: BMVC (2009)
7. Cristinacce, D., Cootes, T.F.: Automatic feature localisation with constrained local
models. Pattern Recognition 41, 30543067 (2008)
8. Wang, Y., Lucey, S., Cohn, J.: Enforcing convexity for improved alignment with
constrained local models. In: IEEE CVPR (2008)
9. Gu, L., Kanade, T.: A Generative Shape Regularization Model for Robust Face
Alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 413426. Springer, Heidelberg (2008)
10. Saragih, J., Lucey, S., Cohn, J.: Face alignment through subspace constrained
mean-shifts. In: IEEE ICCV (2009)
11. Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using
adaptive correlation lters. In: IEEE CVPR (2010)
12. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
13. Paquet, U.: Convexity and bayesian constrained local models. In: IEEE CVPR
(2009)
14. Shen, C., Brooks, M.J., Hengel, A.: Fast global kernel density mode seeking: Ap-
plications to localization and tracking. IEEE TIP 16, 14571469 (2007)
15. Nordstrom, M., Larsen, M., Sierakowski, J., Stegmann, M.: The IMM face database
- an annotated dataset of 240 face images. Technical report, Technical University
of Denmark, DTU (2004)
16. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust Face Detection Using
the Hausdor Distance. In: Bigun, J., Smeraldi, F. (eds.) AVBPA 2001. LNCS,
vol. 2091, pp. 9095. Springer, Heidelberg (2001)
17. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The ex-
tended M2VTS database. In: AVBPA (1999)
18. FGNet: Talking face video (2004)
19. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A
library for large linear classication. JMLR, 18711874 (2008)
20. Cristinacce, D., Cootes, T.F.: Facial feature detection using adaboost with shape
constraints. In: BMVC (2003)
21. Cristinacce, D., Cootes, T.F.: Feature detection and tracking with constrained local
models. In: BMVC (2006)
22. Viola, P., Jones, M.: Robust real-time object detection. IJCV 57, 137154 (2002)
Patch Based Synthesis
for Single Depth Image Super-Resolution
Oisin Mac Aodha, Neill D.F. Campbell, Arun Nair, and Gabriel J. Brostow
1 Introduction
Widespread 3D imaging hardware is advancing the capture of depth images
with either better accuracy, e.g. Faro F ocus3D laser scanner, or at lower prices,
e.g. Microsofts Kinect. For every such technology, there is a natural upper
limit on the spatial resolution and the precision of each depth sample. It may
seem that calculating useful interpolated depth values requires additional data
from the scene itself, such as a high resolution intensity image [1], or additional
depth images from nearby camera locations [2]. However, the seminal work of
Freeman et al . [3] showed that it is possible to explain and super-resolve an
intensity image, having previously learned the relationships between blurry and
high resolution image patches. To our knowledge, we are the rst to explore a
patch based paradigm for the super-resolution (SR) of single depth images.
Depth image SR is dierent from image SR. While less aected by scene light-
ing and surface texture, noisy depth images have fewer good cues for matching
patches to a database. Also, blurry edges are perceptually tolerable and ex-
pected in images, but at discontinuities in depth images they create jarring
artifacts (Fig. 1). We cope with both these problems by taking the unusual step
of matching inputs against a database at the low resolution, in contrast to us-
ing interpolated high resolution. Even creation of the database is also harder
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 7184, 2012.
c Springer-Verlag Berlin Heidelberg 2012
72 O. Mac Aodha et al.
Fig. 1. On the bottom row are three dierent, high resolution, signals that we wish
to recover. The top row illustrates typical results after upsampling its low resolution
version by interpolation. Interpolation at intensity discontinuities gives results that are
perceptually similar to the ground truth image. However in depth images, this blurring
can result in very noticeable jagged artifacts when viewed in 3D.
for depth images, whether from Time-of-Flight (ToF) arrays or laser scanners,
because these can contain abundant interpolation-like noise.
The proposed algorithm infers a high resolution depth image from a single
low resolution depth image, given a generic database of training patches. The
problem itself is novel, and we achieve results that are qualitatively superior to
what was possible with previous algorithms because we:
Perform patch matching at the low resolution, instead of interpolating rst.
Train on a synthetic dataset instead of using available laser range data.
Perform depth specic normalization of non-overlapping patches.
Introduce a simple noisy-depth reduction algorithm for postprocessing.
Depth cameras are increasingly used for video-based rendering [4], robot ma-
nipulation [5], and gaming [6]. These environments are dynamic, so a general
purpose SR algorithm is better if it does not depend on multiple exposures.
These scenes contain signicant depth variations, so registration of the 3D data
to a nearby cameras high resolution intensity image is approximate [7], and use
of a beam splitter is not currently practicable. In the interest of creating visually
plausible super-resolved outputs under these constraints, we relegate the need
for the results to genuinely match the real 3D scene.
2 Related Work
Both the various problem formulations for super-resolving depth images and
the successive solutions for super-resolving intensity images relate to our algo-
rithm. Most generally, the simplest upsampling techniques use nearest-neighbor,
bilinear, or bicubic interpolation to determine image values at interpolated co-
ordinates of the input domain. Such increases in resolution occur without regard
for the inputs frequency content. As a result, nearest-neighbor interpolation
Patch Based Synthesis for Single Depth Image Super-Resolution 73
turns curved surfaces into jagged steps, while bilinear and bicubic interpolation
smooth out sharp boundaries. Such artifacts can be hard to measure numerically,
but are perceptually quite obvious both in intensity and depth images. While
Fattal [8] imposed strong priors based on edge statistics to smooth stair step
edges, this type of approach still struggles in areas of texture. Methods like [9]
for producing high quality antialiased edges from jagged input are inappropriate
here, for reasons illustrated in Fig. 1.
Multiple Depth Images: The SR problem traditionally centers on fusing mul-
tiple low resolution observations together, to reconstruct a higher resolution im-
age, e.g. [10]. Schuon et al . [2] combine multiple (usually 15) low resolution
depth images with dierent camera centers in an optimization framework that
is designed to be robust to the random noise characteristics of ToF sensors.
To mitigate the noise in each individual depth image, [1] composites together
multiple depths from the same viewpoint to make a single depth image for
further super-resolving, and Hahne and Alexa [11] combine depth scans in a
manner similar to exposure-bracketing for High Dynamic Range photography.
Rajagopalan et al . [12] use an MRF formulation to fuse together several low
resolution depth images to create a nal higher resolution image. Using GPU
acceleration, Izadi et al . made a system which registers and merges multiple
depth images of a scene in real time [13]. Fusing multiple sets of noisy scans has
also been demonstrated for eective scanning of individual 3D shapes [14]. Com-
pared to our approach, these all assume that the scene remains static. Though
somewhat robust to small movements, large scene motion will cause them to fail.
Intensity Image Approaches: For intensity images, learning based methods
exist for SR when multiple frames or static scenes are not available. In the most
closely related work to our own, Freeman et al . [15,16] formulated the prob-
lem as multi-class labeling on an MRF. The label being optimized at each node
represents a high resolution patch. The unary term measures how closely the
high resolution patch matches the interpolated low resolution input patch. The
pairwise terms encourage regions of the high resolution patches to agree. In
their work, the high resolution patches came from an external database of pho-
tographs. To deal with depth images, our algorithm diers substantially from
Freeman et al . and its image SR descendants, with details in Section 3. Briey,
our depth specic considerations mean that we i) compute matches at low reso-
lution to limit blurring and to reduce the dimensionality of the search, ii) model
the output space using non-overlapping patches so depth values are not aver-
aged, iii) normalize height to exploit the redundancy in depth patches, and iv)
we introduce a noise-removal algorithm for postprocessing, though qualitatively
superior results emerge before this step.
Yang et al . [17] were able to reconstruct high resolution test patches as
sparse linear combinations of atoms from a learned compact dictionary of paired
high/low resolution training patches. Our initial attempts were also based on
sparse coding, but ultimately produced blurry results in the reconstruction stage.
Various work has been conducted to best take advantage of the statistics of
natural image patches. Zontak and Irani [18] argue that nding suitable high
74 O. Mac Aodha et al.
resolution matches for a patch with unique high frequency content could take a
prohibitively large external database, and it is more likely to nd matches for
these patches within the same image. Glasner et al . [19] exploit patch repeti-
tion across and within scales of the low resolution input to nd candidates. In
contrast to depth images, their input contains little to no noise, which can not
be said of external databases which are constrained to contain the same con-
tent as the input image. Sun et al . [20] oversegment their input intensity image
into regions of assumed similar texture and lookup an external database using
descriptors computed from the regions. HaCohen et al . [21] attempt to classify
each region as a discrete texture type to help upsample the image. A similar ap-
proach can be used for adding detail to 3D geometry; object specic knowledge
has been shown to help when synthesizing detail on models [22,23]. While some
work has been carried out on the statistics of depth images [24], it is not clear
if they follow those of regular images. Major dierences between the intensity
and depth SR problems are that depth images usually have much lower starting
resolution and signicant non-Gaussian noise. The lack of high quality data also
means that techniques used to exploit patch redundancy are less applicable.
Depth+Intensity Hybrids: Several methods exploit the statistical relation-
ship between a high resolution intensity image and a low resolution depth image.
They rely on the co-occurrence of depth and intensity discontinuities, on depth
smoothness in areas of low texture, and careful registration for the object of
interest. Diebel and Thrun [25] used an MRF to fuse the two data sources after
registration. Yang et al . [1] presented a very eective method to upsample depth
based on a cross bilateral lter of the intensity. Park et al . [7] improved on these
results with better image alignment, outlier detection, and also by allowing for
user interaction to rene the depth. Incorrect depth estimates can come about
if texture from the intensity image propagates into regions of smooth depth.
Chan et al . [26] attempted to overcome this by not copying texture into depth
regions which are corrupted by noise and likely to be geometrically smooth.
Schuon et al . [27] showed that there are situations where depth and color images
can not be aligned well, and that these cases are better o being super-resolved
just from multiple depth images.
In our proposed algorithm, we limit ourselves to super-resolving a single low
resolution depth image, without additional frames or intensity images, and there-
fore no major concerns about baselines, registration, and synchronization.
3 Method
We take, as an input, a low resolution depth image X that is generated from
some unknown high resolution depth image Y by an unknown downsampling
function d such that X = (Y ) d . Our goal is to synthesize a plausible Y.
We treat X as a collection of N non-overlapping patches X = {x1 , x2 , ..., xN }, of
size M M , that we scale to t in the range [0..1], to produce normalized input
patches xi . We recover a plausible SR depth image Y by nding a minimum
of a discrete energy function. Each node in our graphical model (see Fig. 3) is
Patch Based Synthesis for Single Depth Image Super-Resolution 75
Fig. 2. A) Noise removal of the input gives a cleaner input for patching but can remove
important details when compared to our method of matching at low resolution. B)
Patching artifacts are obvious (left) unless we apply our patch min/max compensation
(right).
associated with a low resolution image patch, xi , and the discrete label for the
corresponding node in a Markovian grid corresponds to a high resolution patch,
yi . The total energy of this MRF is
E(Y) = Ed (xi ) + Es (yi , yj ), (1)
i i,jN
Unlike Freeman et al . [3], we do not upsample the low resolution input using a
deterministic interpolation method to then compute matches at the upsampled
scale. We found that doing so unnecessarily accentuates the large amount of noise
that can be present in depth images. Noise removal is only partially successful
and runs the risk of removing details, see Fig. 2 A). A larger amount of training
data is also needed to explain the high resolution patches and increases the size
of the MRF. Instead, we prelter and downsample the high resolution training
patches to make them the same size and comparable to input patches.
The pairwise term, Es (yi , yj ), enforces coherence in the abutting region be-
tween the neighboring unnormalized high resolution candidates, so
where Oij is an overlap operator that extracts the region of overlap between
the extended versions of the unnormalized patches yi and yj , as illustrated in
Fig. 3 A). The overlap region consists of a single pixel border around each patch.
We place the non-extended patches down side by side and compute the pairwise
76 O. Mac Aodha et al.
Fig. 3. A) Candidate high resolution patches yi and yj are placed beside each other
but are not overlapping. Each has an additional one pixel border used to evaluate
smoothness (see Eqn. 3), which is not placed in the nal high resolution depth im-
age. Here, the overlap is the region of 12 pixels in the rectangle enclosed by yellow
dashed lines. B) When downsampling a signal its absolute min and max values will not
necessarily remain the same. We compensate for this when unnormalizing a patch by
accounting for the dierence between the patch and its downsampled version.
term in the overlap region. In standard image SR this overlap region is typically
averaged to produce the nal image [3], but with depth images this can create
artifacts.
The high resolution candidate, yi , is unnormalized based on the min and max
of the input patch:
where the i terms account for the dierences accrued during the downsampling
of the training data (see Fig. 3 B)):
from specular and dark materials [28]. Coupled with low recording resolution
(compared to intensity cameras), this noise poses an additional challenge that
is not present when super-resolving an image. We could attempt to model the
downsampling function, d , which takes a clean noiseless signal and distorts it.
However, this is a non trivial task and would result in a method very specic
to the sensor type. Instead, we work with the assumption that, for ToF sensors,
most of the high frequency content is noise. To remove this noise, we bilateral
lter [32] the input image X before the patches are normalized. The high reso-
lution training patches are also ltered (where d is a bicubic lter), so that all
matching is done on similar patches.
It is still possible to have some noise in the nal super-resolved image due
to patches being stretched incorrectly over boundaries. Park et al . [7] identify
outliers based on the contrast of the min and max depth in a local patch in
image space. We too wish to identify these outliers, but also to replace them
with plausible values and rene the depth estimates of the other points. Using
the observation that most of the error is in the depth direction, we propose a new
set of possible depth values d for each pixel, and attempt to solve for the most
consistent combination across the image. A 3D coordinate pw , with position
pim in the image, is labelled as an outlier if the average distance to its T nearest
neighbours is greater than 3d . In the case of non-outlier pixels, the label set
d contains the depth values pim im
z + npz where n = [N/2, ..., N/2] and each
labels unary cost is |npim z |, with = 1%. For the outlier pixels, the label set
contains the N nearest non-outlier depth values with uniform unary cost. The
pairwise term is the truncated distance between the neighboring depth values
i and j: ||pimiz pjz ||2 . Applying our outlier removal instead of the bilateral
im
lter on the input depth image would produce overly blocky aliased edges that
are dicult for the patch lookup to overcome during SR. Used instead as a
postprocess, it will only act to remove errors due to incorrect patch scaling.
4 Training Data
For image SR, it is straightforward to acquire image collections online for train-
ing purposes. Methods for capturing real scene geometry, e.g. laser scanning,
are not convenient for collecting large amounts of high quality data in varied
environments. Some range datasets do exist online, such as the Brown Range
Image Database [24], the USF Range Database [33] and Make3D Range Image
Database [34]. The USF dataset is a collection of 400 range images of simple
polyhedral objects with a very limited resolution of only 128 128 pixels and
with heavily quantized depth. Similarly, the Make3D dataset contains a large va-
riety of low resolution scenes. The Brown dataset, captured using a laser scanner,
is most relevant for our SR purposes, containing 197 scenes spanning indoors and
outdoors. While the spatial resolution is superior in the Brown dataset, it is still
limited and features noisy data such as ying pixels that ultimately hurt depth
SR; see Fig. 4 B). In our experiments we compared the results of using Brown
Range Image vs. synthetic data, and found superior results using synthetic
78 O. Mac Aodha et al.
Fig. 4. A) Example from our synthetic dataset. The left image displays the depth
image and the right is the 3D projection. Note the clean edges in the depth image and
lack of noise in the 3D projection. B) Example scene from the Brown Range Image
Database [24] which exhibits low spatial resolution and ying pixels (in red box).
data (see Fig. 7). An example of one of our synthesized 3D scenes is shown in
Fig. 4 A). Some synthetic depth datasets also exist, e.g. [35], but they typically
contain single objects.
Due to the large amount of redundancy in depth scenes (e.g. planar surfaces),
we prune the high resolution patches before training. This is achieved by detect-
ing depth discontinuities using an edge detector. A dilation is then performed on
this edge map (with a disk of radius 0.02 the image width) and only patches
with centers in this mask are chosen. During testing, the top K closest candi-
dates to the low resolution input patch are retrieved from the training images.
Matches are computed based on the ||.||2 distance from the low resolution patch
to a downsampled version of the high resolution patch. In practice, we use a k-d
tree to speed up this lookup. Results are presented using a dataset of 30 scenes
of size 800 800 pixels (with each scene also ipped left to right), which creates
a dictionary of 5.3 million patches, compared to 660 thousand patches in [16].
5 Experiments
We performed experiments on single depth scans obtained by various means,
including a laser scanner, structured light, and three dierent ToF cameras. We
favor the newer ToF sensors because they do not suer from missing regions
at depth-discontinuities as much as Kinect, which has a comparatively larger
camera-projector baseline. We run a sliding window lter on the input to ll
in missing data with local depth information. The ToF depth images we tested
come from one of three camera models: PMD CamCube 2.0 with resolution of
200 200, Mesa Imaging SwissRanger SR3000 with 176 144, or the Canesta
EP DevKit with 64 64. We apply bilateral preltering and our postprocessing
denoising algorithm only to ToF images. Unless otherwise indicated, all exper-
iments were run with the same parameters and with the same training data.
We provide comparisons against the Example based Super-Resolution (EbSR)
method of [16] and the Sparse coding Super-Resolution (ScSR) method of [17].
Patch Based Synthesis for Single Depth Image Super-Resolution 79
Fig. 5. A) RMSE comparison of our method versus several others when upsampling
the downsampled Middlebury stereo dataset [36] by a factor of 2 and 4. *MRF
RS [25] and Cross Bilateral [1] require an additional intensity image at the same high
resolution as the upsampled depth output. B) RMSE comparison of our method versus
two other image based techniques for upsampling three dierent laser scans by a factor
of 4 . See Fig. 6 for visual results of Scan 42. C) Training and testing times in seconds.
As described in Section 4, noisy laser depth data is not suitable for training. Fig. 7
shows the result of SR when training on the Brown Range Image Database [24]
(only using the indoor scenes) as compared with our synthetic dataset. The noise
in the laser scan data introduces both blurred and jagged artifacts.
Fig. 7. Result of super-resolving a scene using our algorithm with training data from
two dierent sources: A) Laser scan [24] B) Our synthetic. Note that our dataset
produces much less noise in the nal result and hallucinates detail such as the thin
structures in the right of the image.
Patch Based Synthesis for Single Depth Image Super-Resolution 81
Fig. 8. Upsampling input ToF image from a Canesta EP DevKit (64 64) by a factor
of 10 [1]. A) Input depth image shown to scale in red. B) Bilateral ltering of input
image (to remove noise) followed by upsampling using nearest neighbor. C) EbSR [16]
produces an overly smooth result at this large upsampling factor. D) ScSR [17] recovers
more high frequency detail but creates a ringing artefact. E) The Cross Bilateral [1]
method produces a very sharp result, however, the method requires a high resolution
intensity image at the same resolution as the desired super-resolved depth image (640
640), shown inset in green. F) Our method produces sharper results than C) and D).
Implementation Details
Fig. 5 C) gives an overview of the the training and test times of our algorithm
compared to other techniques. All results presented use a low resolution patch
size of 3 3. For the MRF, we use 150 labels (high resolution candidates) and
the weighting of the pairwise term, , is set to 10. The only parameters we
change are for the Bilateral ltering step. For scenes of high noise we set the
window size to 5, the spatial standard deviation to 3 and range deviation to
0.1; for all other scenes we use 5, 1.5 and 0.01 respectively. We use the default
parameters provided in the implementations for ScSR [17] and EbSR [16]. To
encourage comparison and further work, code and training data are available
from our project webpage.
82 O. Mac Aodha et al.
!
Fig. 9. CamCube ToF input. A) Noisy input. B) Our result for SR by 4; note the
reduced noise and sharp discontinuities. C) Cropped region from the input depth image
comparing dierent SR methods. The red square in A) is the location of the region.
6 Conclusions
We have extended single-image SR to the domain of depth images. In the process,
we have also assessed the suitability for depth images of two leading intensity
image SR techniques [17,16]. Measuring the RMSE of super-resolved depths with
respect to known high resolution laser scans and online Middlebury data, our
algorithm is always rst or a close second best. We also show that perceptually,
we reconstruct better depth discontinuities. An additional advantage of our single
frame SR method is our ability to super-resolve moving depth videos. From the
outset, our aim of super-resolving a lone depth frame has been about producing
a qualitatively believable result, rather than a strictly accurate one. Blurring
and halos may only become noticeable as artifacts when viewed in 3D.
Four factors enabled our algorithm to produce attractive depth reconstructions.
Critically, we saw an improvement when we switched to low resolution
searches for our unary potentials, unlike almost all other algorithms. Second, spe-
cial depth-rendering of clean computer graphics models depicting generic scenes
outperforms training on noisy laser scan data. It is important that these depths are
ltered and downsampled for training, along with a pre-selection stage that favors
areas near gradients. Third, the normalization based on min/max values in the
low resolution input allows the same training patch pair to be applied at various
depths. We found that the alternative of normalizing for mean and variance is not
very robust with 3 3 patches, because severe noise in many depth images shifts
the mean to almost one extreme or the other at depth discontinuities. Finally, we
have the option for rening our high resolution depth output using a targeted noise
removal algorithm. Crucially, our experiments demonstrate both the quantitative
and perceptually signicant advantages of our new method.
Patch Based Synthesis for Single Depth Image Super-Resolution 83
Currently, for a video sequence we process each frame independently. Our algo-
rithm could be extended to exploit temporal context to both obtain the most
reliable SR reconstruction, and apply it smoothly across the sequence. Similarly,
the context used to query the database could be expanded to include global
scene information, such as structure [37]. This may overcome the fact that we
have a low local signal to noise ratio from current depth sensors. As observed by
Reynolds et al . [28], ying pixels and other forms of severe non-Gaussian noise
necessitates preltering. Improvements could be garnered with selective use of
the input depths via a sensor specic noise model. For example, Kinect and the
dierent ToF cameras we tested all exhibit very dierent noise characteristics.
Acknowledgements. Funding for this research was provided by the NUI Trav-
elling Studentship in the Sciences and EPSRC grant EP/I031170/1.
References
1. Yang, Q., Yang, R., Davis, J., Nister, D.: Spatial-depth super resolution for range
images. In: CVPR (2007)
2. Schuon, S., Theobalt, C., Davis, J., Thrun, S.: Lidarboost: Depth superresolution
for tof 3D shape scanning. In: CVPR (2009)
3. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. IJCV
(2000)
4. Kuster, C., Popa, T., Zach, C., Gotsman, C., Gross, M.: Freecam: A hybrid camera
system for interactive free-viewpoint video. In: Proceedings of Vision, Modeling,
and Visualization, VMV (2011)
5. Holz, D., Schnabel, R., Droeschel, D., Stuckler, J., Behnke, S.: Towards semantic
scene analysis with time-of-ight cameras. In: RoboCup International Symposium
(2010)
6. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from single
depth images. In: CVPR (2011)
7. Park, J., Kim, H., Tai, Y.W., Brown, M., Kweon, I.: High quality depth map
upsampling for 3D-ToF cameras. In: ICCV (2011)
8. Fattal, R.: Upsampling via imposed edges statistics. SIGGRAPH (2007)
9. Yang, L., Sander, P.V., Lawrence, J., Hoppe, H.: Antialiasing recovery. ACM Trans-
actions on Graphics (2011)
10. Irani, M., Peleg, S.: Improving resolution by image registration. In: CVGIP: Graph.
Models Image Process. (1991)
11. Hahne, U., Alexa, M.: Exposure Fusion for Time-Of-Flight Imaging. In: Pacic
Graphics (2011)
12. Rajagopalan, A.N., Bhavsar, A., Wallho, F., Rigoll, G.: Resolution Enhancement
of PMD Range Maps. In: Rigoll, G. (ed.) DAGM 2008. LNCS, vol. 5096, pp. 304313.
Springer, Heidelberg (2008)
13. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R.A., Kohli, P., Shotton,
J., Hodges, S., Freeman, D., Davison, A.J., Fitzgibbon, A.: Kinectfusion: Real-time
3D reconstruction and interaction using a moving depth camera. In: UIST (2011)
84 O. Mac Aodha et al.
14. Cui, Y., Schuon, S., Derek, C., Thrun, S., Theobalt, C.: 3D shape scanning with a
time-of-ight camera. In: CVPR (2010)
15. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. Com-
puter Graphics and Applications (2002)
16. Freeman, W.T., Liu, C.: Markov random elds for super-resolution and texture
synthesis. In: Advances in Markov Random Fields for Vision and Image Processing.
MIT Press (2011)
17. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse rep-
resentation. IEEE Transactions on Image Processing (2010)
18. Zontak, M., Irani, M.: Internal statistics of a single natural image. In: CVPR (2011)
19. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: ICCV
(2009)
20. Sun, J., Zhu, J., Tappen, M.: Context-constrained hallucination for image super-
resolution. In: CVPR (2010)
21. HaCohen, Y., Fattal, R., Lischinski, D.: Image upsampling via texture hallucina-
tion. In: ICCP (2010)
22. Gal, R., Shamir, A., Hassner, T., Pauly, M., Cohen-Or, D.: Surface reconstruction
using local shape priors. In: Symposium on Geometry Processing (2007)
23. Golovinskiy, A., Matusik, W., Pster, H., Rusinkiewicz, S., Funkhouser, T.: A
statistical model for synthesis of detailed facial geometry. SIGGRAPH (2006)
24. Huang, J., Lee, A., Mumford, D.: Statistics of range images. In: CVPR (2000)
25. Diebel, J., Thrun, S.: An application of markov random elds to range sensing. In:
NIPS (2005)
26. Chan, D., Buisman, H., Theobalt, C., Thrun, S.: A Noise-Aware Filter for Real-
Time Depth Upsampling. In: Workshop on Multi-camera and Multi-modal Sensor
Fusion Algorithms and Applications (2008)
27. Schuon, S., Theobalt, C., Davis, J., Thrun, S.: High-quality scanning using time-
of-ight depth superresolution. In: CVPR Workshops (2008)
28. Reynolds, M., Dobos, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-
ight data with condence. In: CVPR (2011)
29. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimiza-
tion. PAMI (2006)
30. Wainwright, M., Jaakkola, T., Willsky, A.: Map estimation via agreement on (hy-
per)trees: Message-passing and linear programming approaches. IEEE Transactions
on Information Theory (2002)
31. Frank, M., Plaue, M., Rapp, H., Kothe, U., Jahnea, B., Hamprecht, F.A.: The-
oretical and experimental error analysis of continuous-wave time-of-ight range
cameras. Optical Engineering (2009)
32. Tomasi, C., Manduchi, R.: Bilateral ltering for gray and color images. In: ICCV
(1998)
33. (USF Range Database), http://marathon.csee.usf.edu/range/DataBase.html
34. Saxena, A., Sun, M., Ng, A.: Make3d: Learning 3d scene structure from a single
still image. PAMI (2009)
35. Saxena, A., Driemeyer, J., Kearns, J., Ng, A.: Robotic grasping of novel objects.
In: NIPS (2006)
36. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. IJCV (2002)
37. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor Segmentation and Support
Inference from RGBD Images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato,
Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746760. Springer,
Heidelberg (2012)
Annotation Propagation in Large Image Databases
via Dense Image Correspondence
Abstract. Our goal is to automatically annotate many images with a set of word
tags and a pixel-wise map showing where each word tag occurs. Most previous ap-
proaches rely on a corpus of training images where each pixel is labeled. However,
for large image databases, pixel labels are expensive to obtain and are often un-
available. Furthermore, when classifying multiple images, each image is typically
solved for independently, which often results in inconsistent annotations across
similar images. In this work, we incorporate dense image correspondence into
the annotation model, allowing us to make do with significantly less labeled data
and to resolve ambiguities by propagating inferred annotations from images with
strong local visual evidence to images with weaker local evidence. We establish
a large graphical model spanning all labeled and unlabeled images, then solve it
to infer annotations, enforcing consistent annotations over similar visual patterns.
Our model is optimized by efficient belief propagation algorithms embedded in an
expectation-maximization (EM) scheme. Extensive experiments are conducted to
evaluate the performance on several standard large-scale image datasets, showing
that the proposed framework outperforms state-of-the-art methods.
1 Introduction
We seek to annotate a large set of images, starting from a partially annotated set. The
resulting annotation will consist of a set of word tags for each image, as well as per-
pixel labeling indicating where in the image each word tag appears. Such automatic
annotation will be very useful for Internet image search, but will also have applications
in other areas such as robotics, surveillance, and public safety.
In computer vision, scholars have investigated image annotation in two directions.
One methodology uses image similarities to transfer textual annotation from labeled
images to unlabeled samples, under the assumption that similar images should have
similar annotations. Notably, Makadia et al. [1] recently proposed a simple baseline
approach for auto-annotation based on global image features, and a greedy algorithm
for transferring tags from similar images. ARISTA [2] automatically annotated a web
dataset of billions of images by transferring tags via near-duplicates. Although those
methods have clearly demonstrated the merits of using similar images to transfer anno-
tations, they do so only between globally very similar images, and cannot account for
locally similar patterns.
Consequently, the other methodology focused on dense annotation of images, known
as semantic labeling [36], where correspondences between text and local image fea-
tures are established for annotation propagation: similar local features should have simi-
lar labels. These methods often aim to label each pixel in an image using models learned
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 8599, 2012.
c Springer-Verlag Berlin Heidelberg 2012
86 M. Rubinstein, C. Liu, and W.T. Freeman
tree, staircase, sky tree, sky, river sky, building sky, river
road, plant, door person, mountain tree building, bridge
sidewalk, car, building
Fig. 1. Input and output of our system. (a) The input image database, with scarce image tags
and scarcer pixel labels. (b) Annotation propagation over the image graph, connecting similar
images and corresponding pixels. (c) The fully annotated database by our system (more results
can be found in Fig. 5 and the supplemental material).
from a training database. In [7] and [8], relationships between text and visual words are
characterized by conventional language translation models. More recently, Shotton et
al. [3] proposed to train a discriminative model based on the texton representation, and
to use conditional random fields (CRF) to combine various cues to generate spatially
smooth labeling. This approach was successively extended in [4], where efficient ran-
domized decision forests are used for significant speedup. Liu et al. [5] proposed a
nonparametric approach to semantic labeling, where text labels are transferred from a
labeled database to parse a query image via dense scene correspondences.
Since predicting annotation from image features is by nature ambiguous (e.g. tex-
tureless regions can be sky, wall, or ceiling), such methods rely on regularities in a large
corpus of training data of pixel-wise densely labeled images. However, for large im-
age databases, high-quality pixel labels are very expensive to obtain. Furthermore, the
annotation is typically computed independently for each image, which often results in
inconsistent annotations due to visual ambiguities in image-to-text.
In this work, we show that we can overcome these drawbacks by incorporating dense
image correspondences into the annotation model. Image correspondences can help al-
leviate local ambiguities by propagating annotations in images with strong local evi-
dence to images with weaker evidence. They also implicitly add consideration of con-
textual information.
We present a principled framework for automatic image annotation based on dense
labeling and image correspondence. Our model is guided by four assumptions: (a) re-
gions corresponding to each tag have distinct visual appearances, (b) neighboring pixels
in one image tend to have the same annotation, (c) similar patterns across different im-
ages should have similar annotation, and (d) tags and tag co-occurrences which are
more frequent in the database are also more probable.
By means of dense image correspondence [9], we establish a large image graph
across all labeled and unlabeled images for consistent annotations over similar image
patterns across different images, where annotation propagation is solved jointly for the
entire database rather than for a single image. Our model is effectively optimized by
efficient belief propagation algorithms embedded within an expectation-maximization
(EM) scheme, alternating between visual appearance modeling (via Gaussian mixture
Annotation Propagation via Dense Image Correspondence 87
Textual tagging and pixel-level labeling are both plausible types of annotations for
an image. In this paper, we will consistently refer to semantic, pixel-level labeling (or
semantic segmentation) as labeling and to textual image-level annotation as tags.
The set of tags of an image is comprised of words from a vocabulary. These words
can be specified by a user, or obtained from text surrounding an image on a webpage.
The input database = (I, V) is comprised of RGB images I = {I1 , . . . , IN }
and vocabulary V = {l1 , . . . , lL }. We treat auto-annotation as a labeling problem,
where the goal is to compute the labeling C = {c1 , . . . , cN } for all images in the
database, where for pixel p = (x, y), ci (p) {1, . . . , L, } ( denoting unlabeled)
indexes into vocabulary V. The tags associated with the images, T = {t1 , . . . , tN :
ti {1, ..., L}}, are then defined directly as the set union of the pixel labels ti =
pi ci (p), where i is image Ii s lattice.
Initially, we may be given annotations for some images in the database. We denote
by It and Tt , and Il and Cl the corresponding subsets of tagged images with their tags,
and labeled images with their labels, respectively. Also let (rt , rl ) be the ratio of the
database images that are tagged and labeled, respectively.
We formulate this discrete labeling problem in a probabilistic framework, where
our goal is to construct the joint posterior distribution of pixel labels given the input
database and available annotations, P (C|, Tt , Cl ). We assume that (a) corresponding
patterns across images should have similar annotation, and (b) neighboring pixels in one
image tend to have the same annotation. This effectively factorizes the distribution as
follows. Written in energy form ( log P (C|, Tt , Cl )), the cost of global assignment
C of labels to pixels is given by
% &
N
E(C) = p (ci (p))+ int (ci (p), ci (q))+ ext (ci (p), cj (p + wij (p))) ,
i=1 pi qNp jNi
(1)
where Np are the spatial neighbors of pixel p within its image, and Ni are other images
similar to image Ii . For each such image j Ni , wij defines the (dense) correspon-
dence between pixels in Ii and pixels in Ij .
88 M. Rubinstein, C. Liu, and W.T. Freeman
Learning
Dense image features Image graph
Appearance model
(Section 3.1) (Section 4.1)
(Section 3.2)
tree, sky, sidewalk
tree, sky, car
road, car, building
building
Human Annotation
(Section 4.3)
Inference
tree, sky, plant
grass
This objective function defines a (directed) graphical model over the entire image
database. We define the data term, p , by means of local visual properties of image
Ii at pixel p. We extract features from all images and learn a model to correspond
vocabulary words with image pixels. Since visual appearance is often ambiguous at the
pixel or region level, we use two types of regularization. The intra-image smoothness
term, int , is used to regularize the labeling with respect to the image structures, while
the inter-image smoothness term, ext , is used for regularization across corresponding
pixels in similar images.
Fig. 2 shows an overview of our system, and Fig. 4 visualizes this graphical model.
In the upcoming sections we describe in detail how we formulate each term of the
objective function, how we optimize it jointly for all pixels in the database, and show
experimental results of this approach.
3 Text-to-Image Correspondence
To characterize text-to-image correspondences, we first estimate the probability distri-
bution Pa (ci (p)) of words occurring at each pixel, based on local visual properties.
For example, an image might be tagged with car and road, but their locations within
the image are unknown. We leverage the large number of images and the available
tags to correlate the database vocabulary with image pixels. We first extract local im-
age features for every pixel, and then learn a visual appearance model for each word
by utilizing visual commonalities among images with similar tags, and pixel labels (if
available).
3.1 Local Image Descriptors
We selected features used prevalently in object and scene recognition to characterize lo-
cal image structures and color features. Structures are represented using both SIFT [9]
and HOG [13] features. We compute dense SIFT descriptors with 3 and 7 cells around
a pixel to account for scales. We then compute HOG features and stack together neigh-
boring HOG descriptors within 2 2 patches [14]. Color is represented using a 7 7
patch in L*a*b color space centered at each pixel. Stacking all the features yields a 527-
dimensional descriptor Di (p) for every pixel p in image Ii . We use PCA to reduce the
descriptor to d = 50 dimensions, capturing approximately 80% of the features variance.
Annotation Propagation via Dense Image Correspondence 89
Pa (ci (p); ) = i,l (p) l,k N Di (p); l,k , l,k + i, (p)N (Di (p); , ) ,
l=1 k=1
(2)
where i,l (p) = Pa (ci (p) = l; ) is the weight of model l in generating the fea-
ture
Di (p), M is the number of components in each model (M = 5), and l =
l,k , l,k , l,k is the mixture weight, mean and covariance of component k in model
l, respectively. We use a Gaussian outlier model with parameters
= (
,
) and
weights i,
(p) for each pixel. The intuition for the outlier model is to add an unlabeled
word to the vocabulary V. = ({i,l }i=1:N,l=1:L , {i,
}i=1:N , 1 , . . . , L ,
) is a
vector containing all parameters of the model.
We optimize for in the maximum likelihood sense using a standard EM algorithm.
We initialize the models by partitioning the descriptors into L clusters using k-means
and fitting a GMM to each cluster. The outlier model is initialized from randomly se-
lected pixels throughout the database. We also explicitly restrict each pixel to contribute
its data to models corresponding to its estimated (or given) image tags only. That is, we
clamp i,l (p) = 0 for all l / ti . For labeled images Il , we keep the posterior fixed ac-
cording to the given labels throughout the process (setting zero probability to all other
labels). To account for partial annotations, we introduce an additional weight i for all
descriptors of image Ii , set to t , l or 1 (we use t = 5, l = 10) according to whether
image Ii was tagged, labeled, or inferred automatically by the algorithm, respectively.
More details can be found in the supplementary material.
Fig. 3 (b) and (c) show an example result of the algorithm on an image from LabelMe
Outdoors dataset (See Section 5 for details on the experiment). Clearly, our model is
able to capture meaningful correlations between words and image regions.
4 Annotation Propagation
Now that we have obtained reasonable local evidences, we can utilize image informa-
tion and correspondences between images to regularize the estimation and propagate
annotations between images in the database. To this end, we compute dense pixel-wise
correspondences and construct an image graph connecting similar images and corre-
sponding pixels.
90 M. Rubinstein, C. Liu, and W.T. Freeman
(a) Input image (b) Local evidence (Eq. 2) (c) MAP appearance
Flow color
coding
(d) MAP appearance
+ intra-image reg.
Fig. 3. Text-to-image and dense image correspondences. (a) An image from LabelMe Outdoors
dataset. (b) Visualization of the local evidence Pa (Eq. 2) for the four most probable words,
colored from black (low probability) to white (high probability). (c) The MAP classification based
on local evidence alone, computed independently at each pixel. (d) The classification with spatial
regularization (Eq. 8). (e-f) Nearest neighbors of the image in (a) and dense pixel correspondences
with (a). (g) Same as (b) for each of the neighbors, warped towards the image (a). (h) The final
MAP labeling using inter-image regularization (Eq. 7).
The term Pti (l) is used to bias the vocabulary of image Ii towards words with higher
frequency and co-occurrence among its neighbors. We estimate this probability as
1 o
Pti (l) = [l tj ] + h (l, m), (4)
|Ni | jN Z jN mt
i i j
where hsl (p) is the normalized spatial histogram of word l across all images in the
database. This term will assist in places where the appearance and pixel correspondence
might not be as reliable (see Fig. 2 in the supplemental material).
The color term will assist in refining the labels internally within the image, and is
computed as
Pci ci (p) = l = hi,l
c (Ii (p)), (6)
where hi,l
c is the color histogram of word l in image Ii . We use 3D histograms of 64
bins in each of the color channels to represent hi,l
c .
where i , j are the image weights as defined in Section 3.2, and Si is the SIFT de-
scriptor for image Ii . Intuitively, better matching between corresponding pixels will
result in higher penalty when assigning them different labels, weighted by the relative
importance of the neighbors label.
The intra-image compatibility between neighboring pixels based on tag co-occurrence
and image structures. For image Ii and spatial neighbors p, q Np we define
int ci (p) = lp , ci (q) = lq = o log ho (lp , lq )+ [lp = lq ] exp int
Ii (p) Ii (q)
.
(8)
Overall, the free parameters of this graphical model are {s , c , o , int , ext }, which
we tune manually. We believe these parameters could be learned from a database with
ground-truth labeling, but leave this for future work. .
92 M. Rubinstein, C. Liu, and W.T. Freeman
Inference. We use a parallel coordinate descent algorithm to optimize Eq. 1, which al-
ternates between estimating the appearance model and propagating annotations between
pixels. The appearance model is initialized from the images and partial annotations in
the database. Then, we partition the message passing scheme into intra- and inter-image
updates, parallelized by distributing the computation of each image to a different core.
Message passing is performed using efficient parallel belief propagation, while intra-
image update iterates between belief propagation and re-estimating the color models
(Eq. 6) in a GrabCut fashion [18]. Once the algorithm converges, we compute the MAP
labeling that determines both labels and tags for all the images.
5 Experiments
We conducted extensive experiments with the proposed method using several datasets:
SUN [14] (9556 256 256 images, 522 words), LabelMe Outdoors (LMO, subset of
SUN) [10] (2688 256 256 images, 33 words), ESP game image set [11] (21, 846 im-
ages, 269 words) and IAPR benchmark [12] (19, 805 images, 291 words). Since both
LMO and SUN include dense human labeling, we use them to simulate human annota-
tions for both training and evaluation. We use ESP and IAPR data as used by [1]. They
contain user tags but no pixel labeling.
We implemented the system using MATLAB and C++ and ran it on a small cluster
of three machines with a total of 36 CPU cores. We tuned the algorithms parameters
on LMO dataset, and fixed the parameters for the rest of the experiments to the best
performing setting: s = 1, c = 2, o = 2, int = 60, ext = 5. In practice, 5 iterations
are required for the algorithm to converge to a local minimum. The EM (Section 3)
and inference (Section 4.2) algorithms typically converge within 15 and 50 iterations
respectively. Using K = 16 neighbors for each image gave the best result (Fig. 6(c)),
and we did not notice significant change in performance for small modifications to
(Section 4.1), d (Section 3.1), nor the number of GMM components in the appearance
model, M (Eq. 2). For the aforementioned settings, it takes the system 7 hours to pre-
process the LMO dataset (compute descriptors and the image graph) and 12 hours to
propagate annotations. The run times on SUN were 15 hours and 26 hours respectively.
Annotation Propagation via Dense Image Correspondence 93
0.0103 0.0091 0.0076 0.0069 0.0068 0.0066 0.0063 0.0062 0.0061 0.0059 0.0057
(c) Image hubs used for human labeling and tagging
Fig. 4. The image graph serves as an underlying representation for inference and human
labeling. (a) Visualization of the image graph of the LMO dataset using 400 sampled images
and K = 10. The size of each image corresponds to its visual pagerank score. (b) The graphi-
cal model. Each pixel in the database is connected to spatially adjacent pixels in its image and
corresponding pixels in similar images. (c) Top ranked images selected automatically for human
annotation. Underneath each image is its visual pagerank score.
Results on SUN and LMO. Fig. 5(a) shows some annotation results on SUN using
(rt = 0.5, rl = 0.05). All images shown are initially untagged in the database. Our
system successfully infers most of the tags in each image, and the labeling corresponds
nicely to the image content. In fact, the tags and labels we obtain automatically are often
remarkably accurate, considering that no information was initially available on those
images. Some of the images contain regions with similar visual signatures, yet are still
classified correctly due to good correspondences with other images. More results are
available in the supplemental material.
To evaluate the results quantitatively, we compared the inferred labels and tags against
human labels. We compute the global pixel recognition rate, r, and the class-average
recognition rate r for images in the subset I \ Il , where the latter is used to counter
the bias towards more frequent words. We evaluate tagging performance on the image
set I \ It in terms of the ratio of correctly inferred tags, P (precision), and the ratio of
missing tags that were inferred, R (recall). We also compute the unbiased class-average
measures P and R.
The global and class-average pixel recognition rates are 63% and 30% on LMO, and
33% and 19% on SUN, respectively. In Fig. 6 we show for LMO the confusion matrix
and breakdown of the scores into the different words. The diagonal pattern in the con-
fusion matrix indicates that the system recognizes correctly most of the pixels of each
word, except for less frequent words such as moon and cow. The vertical patterns (e.g.
in the column of building and sky) indicate a tendency to misclassify pixels into those
94 M. Rubinstein, C. Liu, and W.T. Freeman
Table 1. Tagging and labeling performance on LMO and SUN datasets. Numbers are given
in percentages. In (a), AP-RF replaces the GMM model in the annotation propagation algorithm
with Random Forests [16]; AP-NN replaces SIFT-flow with nearest neighbor correspondences.
Table 2. Tagging performance on ESP and IAPR. Numbers are in percentages. The top row is
quoted from [1].
ESP IAPR
P R P R P R P R
Makadia [1] 22 25 28 29
AP (Ours) 24.17 23.64 20.28 13.78 27.89 25.63 19.89 12.23
words due to their frequent co-occurrence with other words (objects) in that dataset.
From the per-class recognition plot (Fig. 6(b)) it is evident that the system generally
performs better on words which are more frequent in the database.
Components of the Model. Our objective function (Eq. 1) is comprised of three main
components, namely the visual appearance model, and the inter-image and intra-image
regularizers. To evaluate the effect of each component on the performance, we repeated
the above experiment where we first enabled only the appearance model (classifying
each pixel independently according to its visual signature), and gradually added the
other terms. The results are shown in Fig. 6(d) for varying values of rl . Each term
clearly assists in improving the result. It is also evident that image correspondences
play important role in improving automatic annotation performance. The effect of the
inference module of the algorithm on the results is demonstrated visually in Fig. 3 and
the supplemental material.
Comparison with State-of-the-Art. We compared our tagging and labeling results
with state-of-the-art in semantic labeling Semantic Texton Forests (STF) [4] and La-
bel Transfer [5] and image annotation (Makadia et al. [1]). We used our own imple-
mentation of [1] and the publicly available implementations of [4] and [5]. We set the
parameters of all methods according to the authors recommendations, and used the
same tagged and labeled images selected automatically by our algorithm as training set
for the other methods (only tags are considered by [1]). The results of this comparison
are summarized in Table 1. On both datasets our algorithm shows clear improvement
in both labeling and tagging results. On LMO, r is increased by 5 percent compared
to STF, and P is increased by 8 percent over Makadia et al.s baseline method. STF
recall is higher, but comes at the cost of lower precision as more tags are associated on
average with each image (Fig. 7). In [5], pixel recognition rate of 74.75% is obtained
Annotation Propagation via Dense Image Correspondence 95
sky tree
tree sky
sea sky
sky sea
sand person
sea mountain
person grass
window
wall tree
tree
tree steps
sky television
plant sky
sidewalk poster
person plant
road paper
ground floor ground
car
chair chimney
building
ceiling lamp building
ceiling
tree
tree sky
tree sky road
sky
sky road mountain
mountain
mountain building fence
airplane car
building
window
sky
shutter window
tree road wall sky
plant
sky heater sea
pipe
building person floor boat
hung out ceiling
door
building
wall
sky table
stone table
sea cupboard
sky sign
sand chair
plant picture
plant ceiling lamp
grass floor
palm trees ceiling
building chair
palm tree blackboard
ceiling
window
tree
staircase window
window
sidewalk tree sky
tree
road sky road
sky
plant mountain door
building
mountain building
door
building
tree
plant
tree picture
sky pink photo
sky
group leaf house
people
crowd green group
grass
crowd
flower
view wood
valley writing tower window
sky tourist street white
roof front square wall
house clock people room
city classroom lamp curtain
bush bed
tree wall
tree view
rock stone
table range
painting sky
pool mountain
mountain people
chair landscape
man front
Fig. 5. Automatic annotation results (best viewed electronically). (a) SUN results were pro-
duced using (rt = 0.5, rl = 0.05) (4778 images tagged and 477 images labeled, out of 9556
images). All images shown are initially un-annotated in the database. (b-c) For ESP and IAPR,
the same training set as in [1] was used, having rt = 0.9, rl = 0. For each example we show
the source image on the left, and the resulting labeling and tags on the right. The word colormap
for SUN is the average pixel color based on the ground truth labels, while in ESP and IAPR each
word is assigned an arbitrary unique color. Example failures are shown in the last rows of (a) and
(c). More results can be found in Fig. 1 and the supplemental material.
96 M. Rubinstein, C. Liu, and W.T. Freeman
awning 0.7
balcony
bird
0.7
boat
bridge
0.65
sidewalk
cow
window
sky
awning
balcony
building
bus
crosswalk
field
bird
door
fence
person
boat
bridge
car
desert
grass
moon
mountain
road
rock
sand
sea
staircase
tree
plant
pole
sign
streetlight
sun
unlabeled
"
" "
"
"
"
!
!
!
#
0 0.01 0.05 0.1 0.5
Number of Neighbors Ratio of images labeled
(a) (b) (c) (d)
Fig. 6. Recognition rates on LMO dataset. (a) The pattern of confusion across the dataset vo-
cabulary. (b) The per-class average recognition rate, with words ordered according to their fre-
quency in the dataset (measured from the ground truth labels), colored as in Fig. 5. (c) Recog-
nition rate as function of the number of nearest neighbors, K. (d) Recognition rate vs. ratio of
labeled images (rl ) with different terms of the objective function enabled.
on LMO at 92% training/test split with all the training images containing dense pixel
labels. However, in our more challenging (and realistic) setup, their pixel recognition
rate is 53% and their tag precision and recall are also lower than ours. [5] also report
the recognition rate of TextonBoost [3], 52%, on the same 92% training/test split they
used in their paper, which is significantly lower than the recognition rate we achieve,
63%, while using only a fraction of the densely labeled images. The performance of all
methods drops significantly on the more challenging SUN dataset. Still, our algorithm
outperforms STF by 10 percent in both r and P , and achieves 3% increase in precision
over [1] with similar recall.
We show qualitative compar-
ison with STF in Fig. 7 and the
Source
drawer sky
tree
sky window sidewalk
sky
sea wall road
road
mountain floor car
building
due to ambiguities in local vi- mountain
building
Results on ESP and IAPR. We also compared our tagging results with [1] on the
ESP image set and IAPR benchmark using the same training set and vocabulary they
used (rt = 0.9, rl = 0). Our precision (Table 2) is 2% better than [1] on ESP, and is
on par with their method on IAPR. IAPR contains many words that do not relate to
particular image regions (e.g. photo, front, range), which does not fit within our text-
to-image correspondence model. Moreover, many images are tagged with colors (e.g.
white), while the image correspondence algorithm we use emphasizes structure. [1]
assigns stronger contribution to color features, which seems more appropriate for this
dataset. Better handling of such abstract keywords, as well as improving the quality of
the image correspondence are both interesting directions for future work.
Annotation Propagation via Dense Image Correspondence 97
Limitations. Some failure cases are shown in Fig. 5 (bottom rows of (a) and (c)).
Occasionally the algorithm mixes words with similar visual properties (e.g. door and
window, tree and plant), or semantic overlap (e.g. building and balcony). We noticed
that street and indoor scenes are generally more challenging for the algorithm. Dense
alignment of arbitrary images is a challenging task, and incorrect correspondences can
adversely affect the solution. In Fig. 5(a) bottom row we show some examples of inac-
curate annotation due to incorrect correspondences. More failure cases are available in
the supplemental material.
Training Set Size. As we are able to efficiently propagate annotations between images
in our model, our method requires significantly less labeled data than previous methods
(typically rt = 0.9 in [1], rl = 0.5 in [4]). We further investigate the performance of
the system w.r.t the training set size on SUN dataset.
We ran our system with varying values of rt =
0.1, 0.2, . . . , 0.9 and rl = 0.1rt . To characterize the sys-
tem performance, we use the F1 measure, F (rt , rl ) = 0.5
Makadia [1]
AP (Ours)
2P R 0.45
Performance (F)
0.4
0.25
0.65 18
0.5
erage, and using K = 16, these add O(1012 ) edges to 0.45
12
10
the graph. This requires significant computation that may 0.4
1 0.25 0.06 0.016 0
be prohibitive for very large image databases. To address Ratio of interimage edges
these issues, we have experimented with using sparser Fig. 9. Performance and
inter-image connections using a simple sampling scheme. run time using sparse
We partition each image Ii into small non-overlapping inter-image edges
patches, and for each patch and image j Ni we keep a single edge for the pixel
with best match in Ij according to the estimated correspondence wij . Fig. 9 shows
the performance and running time for varying sampling rates of the inter-image edges
(e.g. 0.06 corresponds to using 4 4 patches for sampling, thus using 16 1
of the edges).
This plot clearly shows that the running time decreases much faster than performance.
For example, we achieve more than 30% speedup while sacrificing 2% accuracy by
98 M. Rubinstein, C. Liu, and W.T. Freeman
using only 14 of the inter-image edges. Note that intra-image message passing is still
performed in full pixel resolution as before, while further speedup can be achieved by
running on sparser image grids.
6 Conclusion
In this paper we explored automatic annotation of large-scale image databases using
dense semantic labeling and image correspondences. We treat auto-annotation as a la-
beling problem, where the goal is to assign every pixel in the database with an appropri-
ate tag, using both tagged images and scarce, densely-labeled images. Our work differs
from prior art in three key aspects. First, we use dense image correspondences to explic-
itly enforce coherent labeling (and tagging) across images, allowing to resolve visual
ambiguities due to similar visual patterns. Second, we solve the problem jointly over the
entire image database. Third, unlike previous approaches which rely on large training
sets of labeled or tagged images, our method makes principled use of both tag and label
information while also automatically choosing good images for humans to annotate so
as to optimize performance. Our experiments show that the proposed system produces
reasonable automatic annotations and outperforms state-of-the-art methods on several
large-scale image databases while requiring significantly less human annotations.
References
1. Makadia, A., Pavlovic, V., Kumar, S.: Baselines for image annotation. IJCV 90 (2010)
2. Wang, X., Zhang, L., Liu, M., Li, Y., Ma, W.: Arista-image search to annotation on billions
of web photos. In: CVPR, pp. 29872994 (2010)
3. Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance, Shape and
Context Modeling for Multi-class Object Recognition and Segmentation. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 115. Springer,
Heidelberg (2006)
4. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and
segmentation. In: CVPR (2008)
5. Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing via label transfer. TPAMI (2011)
6. Tighe, J., Lazebnik, S.: SuperParsing: Scalable Nonparametric Image Parsing with Superpix-
els. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315,
pp. 352365. Springer, Heidelberg (2010)
7. Blei, D.M., Ng, A., Jordan, M.: Latent dirichlet allocation. Machine Learning Research 3,
9931022 (2003)
8. Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and
video annotation. In: CVPR (2004)
9. Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: SIFT Flow: Dense Correspondence
across Different Scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III.
LNCS, vol. 5304, pp. 2842. Springer, Heidelberg (2008)
10. Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a database and web-based tool
for image annotation. IJCV 77, 157173 (2008)
11. Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: SIGCHI (2004)
12. Grubinger, M., Clough, P., Muller, H., Deselaers, T.: The iapr benchmark: A new evaluation
resource for visual information systems. In: LREC, pp. 1323 (2006)
Annotation Propagation via Dense Image Correspondence 99
13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
14. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large-scale scene recog-
nition from abbey to zoo. In: CVPR, 34853492 (2010)
15. Delong, A., Gorelick, L., Schmidt, F.R., Veksler, O., Boykov, Y.: Interactive Segmentation
with Super-Labels. In: Boykov, Y., Kahl, F., Lempitsky, V., Schmidt, F.R. (eds.) EMMCVPR
2011. LNCS, vol. 6819, pp. 147162. Springer, Heidelberg (2011)
16. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63, 342
(2006)
17. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. IJCV 42, 145175 (2001)
18. Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iter-
ated graph cuts. TOG 23, 309314 (2004)
19. Jing, Y., Baluja, S.: Visualrank: Applying pagerank to large-scale image search. TPAMI
(2008)
20. Vijayanarasimhan, S., Grauman, K.: Cost-sensitive active visual category learning. IJCV,
121 (2011)
Numerically Stable Optimization
of Polynomial Solvers for Minimal Problems
1 Introduction
Many problems in computer vision can be rephrased as several polynomials in several
unknowns. This is particularly true for so called minimal structure and motion prob-
lems. Solutions to minimal structure and motion problems are essential for RANSAC
algorithms to find inliers in noisy data [13, 24, 25]. For such applications one needs to
efficiently solve a large number of minimal structure and motion problems in order to
find the best set of inliers. Once enough inliers have been found it is common to use
non-linear optimization e.g. to find the best fit to all data in a least squares sense. Here
fast and numerical stable solutions for the polynomials systems of the minimal prob-
lems are crucial for generate initial parameters for the non-linear optimization. Another
area of recent interest is global optimization used e.g. for optimal triangulation, resec-
tioning and fundamental matrix estimation. While global optimization is promising, it
is a difficult pursuit and various techniques has been tried, e.g. branch and bound [1],
L -norm methods [16] and methods using linear matrix inequalities (LMIs) [17]. An
alternative way to find the global optimum is to calculate stationary points directly,
usually by solving some polynomial equation system [14, 23]. So far, this has been an
approach of limited applicability since calculation of stationary points is numerically
difficult for larger problems. A third area are methods for finding the optimal set of
inliers [11, 12, 21], which are based on solving certain geometrical problems, which
involve solving systems of polynomial equations.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 100113, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 101
The general field of study of multivariate polynomials is algebraic geometry. See [10]
and [9] for an introduction to the field. In the language of algebraic geometry, V is an
affine algebraic variety and the polynomials fi generate an ideal I = {g C[x] : g =
i hi (x)fi (x)}, where hi C[x] are any polynomials and C[x] denotes the set of all
polynomials in x over the complex numbers.
The motivation for studying the ideal I is that it is a generalization of the set of
equations (1). A point x is a zero of (1) iff it is a zero of I. Being even more general,
we could ask for the complete set of polynomials vanishing on V . If I is equal to this
set, then I is called a radical ideal.
We say that two polynomials f , g are equivalent modulo I iff f g I and denote
this by f g. With this definition we get the quotient space C[x]/I of all equivalence
classes modulo I. Further, we let [] denote the natural projection C[x] C[x]/I, i.e.
by [fi ] we mean the set {gj : fi gj I} of polynomials equivalent to fi modulo I.
A related structure is C[V ], the set of equivalence classes of polynomial functions
on V . We say that a function F is polynomial on V if there is a polynomial f such that
F (x) = f (x) for x V and equivalence here means equality on V . If two polynomials
are equivalent modulo I, then they are obviously also equal on V . If I is radical, then
conversely two polynomials which are equal on V must also be equivalent modulo I.
This means that for radical ideals, C[x]/I and C[V ] are isomorphic. Now, if V is a point
set, then any function on V can be identified with a |V |-dimensional vector and since
the unisolvence theorem for polynomials guarantees that any function on a discrete set
of points can be interpolated exactly by a polynomial, we get that C[V ] is isomorphic
to Cr , where r = |V |.
is called the action matrix. In previous methods [7, 18], the so-called single elimination
template is applied to enhance numerical stability. It starts by multiplying the equations
(1) by a set of monomials producing an equivalent (but larger) set of equations. There
is no general criterion how to choose such set of monomials, which are usually selected
as the minimal set of monomials that make the problem solvable or numerically stable.
We will discuss later that the choices of multiplication monomials affect the overall
numerical accuracy the solvers. By stacking the coefficients of the new equations in an
expanded coefficient matrix Cexp , we have
Giving a linear system as in (3), we know that if we choose from P a linear basis B =
{x1 , . . . , xr }, we are able to use numerical linear algebra to construct corresponding
action matrix. Traditionally, the Grobner bases are used to generate B for some ordering
on the monomials (e.g. grevlex [18]). In this case, the permissible set is the same as basis
set. The key observation of [6] is that by allowing adaptive selection of B from a large
permissible set (|P| r), one can improve the numerical accuracy greatly in most
cases. To achieve this, we first eliminate the E-monomials, since they are not in the
basis and do not need to be reduced. By making Cexp triangular through either LU or
QR factorization, we have
UE1 CR1 CP1 XE
0 UR2 CP2 XR = 0, (4)
0 0 CP3 XP
where UE1 and UR2 are upper triangular. The top rows of the coefficient matrix involv-
ing the excessives can be removed,
% &% &
UR2 CP2 XR
= 0. (5)
0 CP3 XP
We can further reduce CP3 into upper triangular matrix. In [6], the column-pivoting QR
is utilized to try to minimize the condition number of the first |P|r columns of CP3 ,
where is a permutation matrix. This will enhance the overall numerical stability of
the extraction of action matrix.
( The
)t basis is thus selected as the last r monomials after
the reordering of i.e. XP XB . This gives us
% & XR
UR2 CP4 CB1
XP = 0. (6)
0 UP3 CB2
XB
9.6 0
9.8 2
10 4
6
10.2
8
10.4
10
10.6
12
5 6 7 8 20 30 40 50 60 70
Max degree of multiplication monomials Numer of Permissibles
Fig. 1. Numerical accuracy (mean of log10 (errors) ) of the solver in [8] for uncalibrated camera
with radial distortion. Left: With multiplication monomials of degree 5 (393390), 6 (680595),
7 (1096865) and 8 (16751212). The numbers in the parenthesis are the sizes of corresponding
Cexp . Right: Vary the number of permissible monomials. Note the minimal size of the permissible
set is 24 which is the number of solutions.
' '
Since by construction xk xi R P B, we can construct the action matrix from
the linear mapping in (7) by finding corresponding indices of xk xi (xi B).
This method above gives generally good performance concerning speed and numer-
ical accuracy for most minimal problems in computer vision [7]. However, there are
several parameters that are chosen manually for different problems (1) the set of multi-
plication monomials and (2) the choices of the permissible set. For different problem,
in [6, 7], the general guideline is to use as few equations as possible (this is usually
done by fine tuning the set of equations and see if the solver is still numerical stable).
For the permissible set, it was suggested that one should choose it as large as possible,
since this in theory should provide better chance getting stable partial-pivoting QR. One
may ask, will we gain more numerical stabilities by adding more equations or using a
smaller set of permissible monomials?
We present here an example to show the effects of such factors on current state of the
art polynomial solvers. The problem we study is the estimation of fundamental matrix
for uncalibrated cameras with radial distortion [8], which will be discussed in more
details in Section 5.1. In [8], the original equations are multiplied with set of monomials
so that the highest degree of resulting equations is 5. To shed light on the effects of more
equations, we allow the highest degree to be 8. We note that adding equations by simply
increased degrees will actually hurt the overall numerical accuracy (Figure 1, Left). On
the other hand, we also vary the size of permissible set used in the solver, where we
choose the last few monomials in grevlex order similar to [8]. We can see in Figure 1
(Right), by reducing the size of permissible set in this way, the solver retains numerical
stability for permissible set of large size, while lose its accuracy for smaller permissible
set (in this case when size is smaller than 36). We will show in the next few sections
that one can actually gain numerical accuracy by adding equation if we carefully choose
the multiplication monomials. Similarly, we can also achieve similar or even improved
accuracy with smaller permissible set.
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 105
N
max
(Bi P ),
P
i=1
s.t. |P | = K, (8)
In all previous solvers applying single elimination, the equations in the template are
expanded with redundancy. This will both affect the speed and the overall numerical
accuracy of the solvers. This is because all previous methods involve steps of gener-
ating upper triangular form of partial or full template, with LU or QR factorization or
Gauss-Jordan elimination. For large template with many equations, this step is slow and
might be numerically unstable. For [68], the multiplication monomials are manually
tuned for different minimal problems. Both [18] and [20] investigated the possibility of
automatic removing equations while keeping the solvers numerical stable. They showed
that one can remove equations along with corresponding monomials without losing nu-
merical stability and even gain better numerical performance with single precision for
certain minimal problem.
In additional to numerical stability, we here try to optimize the numerical accuracy
by automatically removing equations. To achieve this, we apply a simple greedy search
on what equations to remove while keeping the numerical accuracy of solvers as good
as possible. Given a redundant set of equations, we try to find the best equation to
remove and then iterate until no equations can be removed i.e. when removing any of
the equations makes the template unsolvable. Specifically, we first remove a candidate
equation from the original template. We then measure the mean log10 -errors of the
resulting template on a random set of problem examples (as training data). After running
through all equations, we remove the equation without which gives the lowest errors.
Generally, we can use any polynomial solver to evaluate the errors at each removal step.
We have chosen the solver with basis selection based on column pivoting [8] since it
generally gives better numerical accuracy.
By exploiting the fact that the excessive monomials do not contribute to the construc-
tion of the action matrix, we can remove certain equations without any local search and
maintain the numerically stability of template. Specifically, when expanding the equa-
tions, there can be some excessive monomials that appear only in one of the equations.
We call these excessive monomials as singular excessive monomials. For the first elim-
ination in (4), if an equation contains singular monomials, removing it will not hurt the
overall numerical condition of the LU or QR with proper pivoting. Therefore, all equa-
tions containing singular excessive can be safely removed. One can always apply this
trimming step with each local search step to speed up the removal process.
It might be also good to start with more equations which might introduce better
numerical conditioning for the first elimination in (4). We then rely on our equation re-
moval step to select equations to improve the overall numerical accuracy. For monomial
removal, we have not exploited the numerical routines to remove columns as in [20].
Here, we simply remove columns that become all zeros after certain equations are re-
moved.
Given the tools introduced in the previous sections, we should be able to optimize the
polynomial solvers by finding the best combination of permissible set and equations
for a specific problem. This is a difficult combinatorial problem. Therefore, as a first
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 107
attempt, we here first optimize the permissible set by fixing the set of equations. We se-
lect the optimal permissible sets of different sizes on the original template (no equation
removal). Then we apply greedy search to remove equations with these optimal per-
missible sets. All these training are done with a fixed set of random problem examples.
By investigating the error distributions of all these solvers, the size of template (num-
ber of equations, number of monomials), we will be able to choose these optimized
solvers with trade-off in accuracy and speed. Generally, we have the following steps for
optimizing polynomial solver for a minimal problem:
Input: Expanded coefficient matrix Cexp , initial permissible set P, a training set of problem
examples T and threshold parameter
:
1. Perform permissible selection to choose PK for r K |P| as in Problem 2.
2. For each PK , perform equation removal, intialize Cslim = Cexp ,
(a) Remove equation(s) containing singular excessive monomials
(b) For each remaining rows i in Cslim , evaluate the average errors of the solver with
Cslim/i on a random subset of T
(c) Update Cslim by removing the row i that gives lowest mean error
(d) Go back to (a) if the lowest average error in (c) is less than
or a predefined number of
iterations is reached.
Note here if the solver fail, we set the error to be infinite. Therefore, if we choose
to be large enough, it will be similar to using solvability as removal criterion in [18].
5 Experimental Validation
In this section, we apply our method on different minimal geometry problems in com-
puter vision. The first two problems are the estimation of fundamental or essential
matrix with radial distortion for uncalibrated or calibrated cameras [8, 19]. The third
problem is three-point stitching with radial distortion [5, 15, 20]. All these problems
has been studied before and were shown to be challenging to obtain numerically stable
solvers. We compare our method with previous methods with respect to both numerical
accuracy as well as the size of template. The state-of-the-art method is the column-
pivoting basis selection [7] which is also the building block of our method. We also
compare with other two methods [18, 20] with template reduction procedures. In our
implementation of a general solver with column-pivoting basis selection, we have used
QR factorization to perform all the elimination steps. Our method involves a training
stage where we perform equation removal and permissible selection on a fixed set of
random problem examples. For all the comparisons, we test the resulting solvers on a
independent set of random problems.
7.5 10.5
BS ER degree 5
8 10.6 ER degree 6
AG
CO
8.5
ER0 10.7
mean(log10(err))
mean(log10(err))
9 ER
10.8
9.5
10.9
10
11
10.5
11 11.1
11.5 11.2
20 30 40 50 60 70 300 400 500 600 700
Nr of permissible monomials # eq
Fig. 2. Nine-point uncalibrated cameras with radial distortion. Left: Effects of permissible selec-
tion and equation removal on numerical accuracy of the polynomials solver. BS - basis selection
with column-pivoting [6] , AG - automatic generator [18]), CO - permissible selection with com-
binatorial optimization, ER0 - ours equation removal without removing equations with singular
excessives, ER - same as ER0 but with extra removal on singular excessive equations. Right:
Effects of removing equations with fixed permissible (a) ER - degree 5, 393 390 template and
(b) ER - degree 6, 680 595 template.
By linear elimination, one can reduced the system to 4 equations and 4 unknowns [19].
The reduced system has 24 solutions. In [18], the equations are first expanded to 497
equations which are then reduced to 179 equations and 203 monomials. There is no
basis selection in this method and the basis set is chosen from the output of Macaulay2,
which is of size 24. On the other hand, in [8], a fine-tuned template of 393 equations and
390 monomials are used. In this case, we use the template in [8] as our starting point.
We call this as the original template. For the set of permissible monomials, we choose
also exactly the same set as in [8] which is of size 69. We note that further increasing
the set of permissible makes the template unsolvable.
Selecting Permissibles. Following the steps in Section 4.1, we first run the solver for
393 equations and 390 monomials, and with the set of permissible of size 69 on 5000
random problem examples. First of all, we notice that there are monomials that are
never selected as basis by any problem examples. Ideally, we can easily remove these
monomials from the permissible set. However, we need to force {x1 , x2 , x3 , x4 , 1} to
be in the permissible set even if they are never selected. This is because without these
monomials we can not extract the solutions of the unknowns {x1 , x2 , x3 , x4 } with the
method in [7]1 . Therefore, excluding {x1 , x2 , x3 , x4 , 1}, we are able to remove 8 mono-
mials that are never selected from the permissible set. To this end, we have the possible
permissible set of size 61. We then continue to solve the problem in (8) with varying
K. We manage to solve the problem for K up to 27 with branch and bound. We can
see the effects of permissible selection in Figure 2 (Right, blue - solid), the solver is
still stable using only smallest permissible set of size 32 (we force the 5 monomials to
1
Note that in general, this is not required for action matrix method e.g. see [9].
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 109
300 5
BS #eq 393 #mon 390 BS #eq 393 #mon 390
AG #eq 179 #mon 203 AG #eq 179 #mon 203
ER #eq 356 #mon 367 ER #eq 356 #mon 367
250 ER #eq 585 #mon 572 ER #eq 585 #mon 572
0
200
Frequency
log10(err)
150 5
100
10
50
0 15
15 10 5 0 5 0 0.2 0.4 0.6 0.8 1
log10(err) quantile
Fig. 3. Histogram (left) and quantiles (right) of log10 (errors) of different solvers for the uncal-
ibrated radial distortion problem BS - [6], AG - [18] and ER - equation removal (ours) with
different initial templates.
be included) which is better than without basis selection (Figure 1). The improvement
gained by permissible selection is more significant for permissible set of smaller size.
We also note that permissible selection alone fails to improve the accuracy of the solver
consistently.
Removing Equations. We first study the interplay between equation removal and per-
missible selection. To do this, we perform equation removal on the template with the
optimal permissible sets of different sizes from branch and bound. We can see that by
performing several steps of equation removal (10 steps in this case), one get consistent
improvement of solvers for different optimal permissible sets (ER0 in Figure 2). This
suggests that one need to combine permissible selection and equation removal to get
compact and numerical stable solvers. We also show that removing equations contain-
ing singular excessive monomials have little effect on the numerical stability (ER in
Figure 2). Therefore, we can always apply this to speed up the removal process.
To further understand the mechanism of removing equations, we fix the permissible
to be the large set of size 69. We work with both template T1 (393390) and the further
expanded template T2 (680 595 up to degree 6). We then proceed to remove equations
using local search described in Section 4.2. For each removal step, we evaluate the mean
log-errors on 100 random samples and we remove the equation without which gives the
best numerical accuracy. We can see that the greedy search improves both templates at
the beginning, and the templates get worse when more equations are removed (Figure
2,right)2 . Ideally, by optimization, the large template should be converge to better tem-
plate with smaller size. This is because the nature of greedy optimization as well as the
randomness involved in the training data. Nevertheless, we can see that by removing
equations using the greedy search, we can improve significantly the numerical behavior
of T2 . Note that the initial mean log10 (errors) of T2 (Figure 1, left, degree 6) is around
10.10. We can reduce the size T2 to 585 572 (local minima in Figure 2, right) while
2
The initial errors (see Figure 1 Left) are omitted for better visualization.
110 Y. Kuang and K. Astrom
in the meantime improve the accuracy to 11.15. This is even slightly better than the
best template reduced from T1 (356 367). This indicates that one could improve nu-
merical accuracy with large template if one carefully optimize the set of equations. We
also show the distribution of test errors for solvers producing the lowest errors from
our training comparing to state of the art in Figure 3. We can see that both our opti-
mized solvers improve over [6] on accuracy. From the iteration of optimization, one get
a spectrum of solvers which can be chosen with preference to speed or accuracy.
6
BS
CO
ER
6.5
mean(log10(err))
7.5
8
62 64 66 68 70 72
# permissible
Fig. 4. Six-point uncalibrated cameras with radial distortion. Effects of permissible selection and
equation removal. BS - [6], CO - permissible selection with combinatorial optimization, ER - 5
steps of equation removal with permissibles from CO.
We notice in Figure (4) that the initial template (356 by 378) is fairly unstable com-
pared to the one reported in [8]. Here we perform first permissible selection on the
template. We observe that reducing the permissible size hurts the numerical accuracy
even with permissible selection. The equation removal step gain consistent improve-
ment on all permissible sizes. We illustrate this example here to show that the local
method can be applied to numerically unstable template and still gives improvements.
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 111
10 ER + PS : r = 22, #P 22 5
log (errors)
10
10
10
10.5
15
11
20
50 60 70 80 90 0 0.2 0.4 0.6 0.8 1
Number of Equations Quantile
Fig. 5. Performance of different polynomials solvers on three-point stitching problem. Left: log10
(errors) with respect to number of equations. Original - 90 132 template as in [5], ICCV11-
optimization with sequential equation and monomial removal [20] , ER - our equation removal
and PS - permissible selection. Right: (best view in colors) Quantile for the log10 (errors).
Removing Equations. For original solver in [5], the permissible set used was of the
same size of the basis (both are 25). Therefore, we do equation removal without per-
missible selection. We will explore in the next section on possible ways for permissible
selection. We can see that one can actually remove fairly many equations and also gain
numerical accuracy compared to the original template. Note that our optimized template
is also smaller (48 77 for the smallest one) than the one in [20] with slightly better
numerical accuracy (mean log-errors -10.56 v.s. -10.27). For fair comparison, similar
to [20], we only modify the publicly available solver from [5] with corresponding equa-
tions and monomials removed.
For the solver in [5], it turns out the choice for dimension of the basis also affects the
numerical accuracy. We studied this by running solvers for different pairs of |P| = 25
and r. We find that the best combination of |P | and r is using 25 permissible with
r = 22. It is shown that by simply reducing r, we gain almost one order of improvement
112 Y. Kuang and K. Astrom
on accuracy (Figure 5 , green - dash line) . Note that this is a local search itself and
there might exist better combination. We leave this as future research direction. We can
further reduce the size of the template by removing equations. In this case, the smallest
and actually also the best template we get is of 53 equations in 93 monomials. The
equation removal step again improves the numerical accuracy of the solver further.
Permissible Selection. Given that we have a working solver with |P | > r, (|P| =
25 and r = 22), we can investigate the potential of permissible selection. By solving
Problem (2) with K = 22, 23, 24, we find that having a large permissible set is always
better for this problem. It is here of interest to see what numerical accuracy can we get
with smallest set of permissible monomials. Using 22 permissible selected by branch
and bound and equation removal step, the resulting solver (Figure 5, red - dotted line) is
not as good as the best solver of size 53 with 25 permissibles and r = 22. It is a slightly
slimmer solver (52 84) with trade-off in increased errors.
6 Conclusions
Previous state-of-the-art has given us methods for making the coefficient matrix as com-
pact as possible for improving the speed of the algorithm. Other results exploit a larger
set of so called permissible monomials from which the numerical linear algebra rou-
tine can choose the basis for the quotient ring. In this paper we have made several
observations on the stability of polynomial equation solving using the so called action
matrix method. First it is shown that adding more equations can improve numerical
accuracy, but it only does so up to a point. Adding too many equations can actually de-
crease numerical accuracy. Secondly it is shown that the choice of so called permissible
monomials also affects the numerical precision.
We present optimization algorithms that exploit these observations. Thus we are able
to produce solvers that range from very compact, while still retaining good numerical
accuracy to solvers that involve larger set of multiplication monomials and larger set
of permissible monomials to optimize for numerical accuracy. Our method is easy to
implement and is general for different problems. Therefore, it can serve as an initial
tool for improving minimal problem involving large templates.
There are several interesting avenues of future research. First, the interplay between
numerical linear algebra routines used and the choice of multiplication monomials and
permissible monomials should be better understood. Second, the increased understand-
ing of the mechanisms for numerical accuracy could open up for further improvements.
Acknowledgment. The research leading to these results has received funding from the
strategic research projects ELLIIT and eSSENCE, and Swedish Foundation for Strate-
gic Research projects ENGROSS and VINST(grant no. RIT08-0043).
References
1. Agarwal, S., Chandraker, M.K., Kahl, F., Kriegman, D.J., Belongie, S.: Practical Global Opti-
mization for Multiview Geometry. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006,
Part I. LNCS, vol. 3951, pp. 592605. Springer, Heidelberg (2006)
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 113
2. Brown, M., Hartley, R., Nister, D.: Minimal solutions for panoramic stitching. In: Proc. Conf.
on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis (June 2007)
3. Bujnak, M., Kukelova, Z., Pajdla, T.: A general solution to the p4p problem for camera with
unknown focal length. In: Proc. Conf. Computer Vision and Pattern Recognition, Anchorage,
USA (2008)
4. Bujnak, M., Kukelova, Z., Pajdla, T.: Making Minimal Solvers Fast. In: Proc. Conf. Com-
puter Vision and Pattern Recognition (2012)
5. Byrod, M., Brown, M., Astrom, K.: Minimal solutions for panoramic stitching with radial
distortion. In: Proc. British Machine Vision Conference, London, United Kingdom (2009)
6. Byrod, M., Josephson, K., Astrom, K.: A Column-Pivoting Based Strategy for Monomial
Ordering in Numerical Grobner Basis Calculations. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 130143. Springer, Heidelberg (2008)
7. Byrod, M., Josephson, K., Astrom, K.: Fast and stable polynomial equation solving and its
application to computer vision. Int. Journal of Computer Vision 84(3), 237255 (2009)
8. Byrod, M., Kukelova, Z., Josephson, K., Pajdla, T., Astrom, K.: Fast and robust numerical
solutions to minimal problems for cameras with radial distortion. In: Proc. Conf. Computer
Vision and Pattern Recognition, Anchorage, USA (2008)
9. Cox, D., Little, J., OShea, D.: Using Algebraic Geometry. Springer (1998)
10. Cox, D., Little, J., OShea, D.: Ideals, Varieties, and Algorithms. Springer (2007)
11. Enqvist, O., Ask, E., Kahl, F., Astrom, K.: Robust Fitting for Multiple View Geometry. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I.
LNCS, vol. 7572, pp. 738751. Springer, Heidelberg (2012)
12. Enqvist, O., Josephson, K., Kahl, F.: Optimal correspondences from pairwise constraints. In:
Proc. 12th Int. Conf. on Computer Vision, Kyoto, Japan (2009)
13. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography. Communications of the
ACM 24(6), 381395 (1981)
14. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68,
146157 (1997)
15. Jin, H.: A three-point minimal solution for panoramic stitching with lens distortion. In: Com-
puter Vision and Pattern Recognition, Anchorage, Alaska, USA, June 24-26 (2008)
16. Kahl, F.: Multiple view geometry and the l -norm. In: ICCV, pp. 10021009 (2005)
17. Kahl, F., Henrion, D.: Globally optimal estimates for geometric reconstruction problems. In:
Proc. 10th Int. Conf. on Computer Vision, Bejing, China, pp. 978985 (2005)
18. Kukelova, Z., Bujnak, M., Pajdla, T.: Automatic Generator of Minimal Problem Solvers.
In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304,
pp. 302315. Springer, Heidelberg (2008)
19. Kukelova, Z., Pajdla, T.: Two minimal problems for cameras with radial distortion. In: OM-
NIVIS (2007)
20. Naroditsky, O., Daniilidis, K.: Optimizing polynomial solvers for minimal geometry prob-
lems. In: ICCV, pp. 975982 (2011)
21. Olsson, C., Enqvist, O., Kahl, F.: A polynomial-time bound for matching and registration
with outliers. In: Proc. Conf. on Computer Vision and Pattern Recognition (2008)
22. Stewenius, H.: Grobner Basis Methods for Minimal Problems in Computer Vision. PhD the-
sis, Lund University (April 2005)
23. Stewenius, H., Schaffalitzky, F., Nister, D.: How hard is three-view triangulation really? In:
Proc. Int. Conf. on Computer Vision, Beijing, China, pp. 686693 (2005)
24. Torr, P., Zisserman, A.: Robust parameterization and computation of the trifocal tensor. Im-
age and Vision Computing 15(8), 591605 (1997)
25. Torr, P., Zisserman, A.: Robust computation and parametrization of multiple view relations.
In: Proc. 6th Int. Conf. on Computer Vision, Mumbai, India, pp. 727732 (1998)
Has My Algorithm Succeeded?
An Evaluator for Human Pose Estimators
Abstract. Most current vision algorithms deliver their output as is, without
indicating whether it is correct or not. In this paper we propose evaluator algo-
rithms that predict if a vision algorithm has succeeded. We illustrate this idea for
the case of Human Pose Estimation (HPE).
We describe the stages required to learn and test an evaluator, including the use
of an annotated ground truth dataset for training and testing the evaluator (and
we provide a new dataset for the HPE case), and the development of auxiliary
features that have not been used by the (HPE) algorithm, but can be learnt by the
evaluator to predict if the output is correct or not.
Then an evaluator is built for each of four recently developed HPE algorithms
using their publicly available implementations: Eichner and Ferrari [5], Sapp
et al. [16], Andriluka et al. [2] and Yang and Ramanan [22]. We demonstrate
that in each case the evaluator is able to predict if the algorithm has correctly
estimated the pose or not.
1 Introduction
Computer vision algorithms for recognition are getting progressively more complex as
they build on earlier work including feature detectors, feature descriptors and classi-
fiers. A typical object detector, for example, may involve multiple features and multiple
stages of classifiers [7]. However, apart from the desired output (e.g. a detection win-
dow), at the end of this advanced multi-stage system the sole indication of how well the
algorithm has performed is typically just the score of the final classifier [8,18].
The objective of this paper is to try to redress this balance. We argue that algorithms
should and can self-evaluate. They should self-evaluate because this is a necessary re-
quirement for any practical systems to be reliable. That they can self-evaluate is demon-
strated in this paper for the case of human pose estimation (HPE). Such HPE evaluators
can then take their place in the standard armoury of many applications, for example
removing incorrect pose estimates in video surveillance, pose based image retrieval, or
action recognition.
In general four elements are needed to learn an evaluator: a ground truth annotated
database that can be used to assess the algorithms output; a quality measure comparing
the output to the ground truth; auxiliary features for measuring the output; and a classi-
fier that is learnt as the evaluator. After learning, the evaluator can be used to predict if
the algorithm has succeeded on new test data, for which ground-truth is not available.
We discuss each of these elements in turn in the context of HPE.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 114128, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 115
For the HPE task considered here, the task is to predict the 2D (image) stickman
layout of the person: head, torso, upper and lower arms. The ground truth annotation in
the dataset must provide this information. The pose quality is measured by the differ-
ence between the predicted layout and ground truth for example the difference in their
angles or joint positions. The auxiliary features can be of two types: those that are used
or computed by the HPE algorithm, for example max-marginals of limb positions; and
those that have not been considered by the algorithm, such as proximity to the image
boundary (the boundary is often responsible for erroneous estimations). The estimate
given by HPE on each instance from the dataset is then classified, using a threshold
on the pose quality measure, to determine positive and negative outputs for example
if more than a certain number of limbs are incorrectly estimated, then this would be
deemed a negative. Given these positive and negative training examples and both types
of auxiliary features, the evaluator can be learnt using standard methods (here an SVM).
We apply this evaluator learning framework to four recent publicly available meth-
ods: Eichner and Ferrari [5], Sapp et al. [16], Andriluka et al. [2] and Yang and
Ramanan [22]. The algorithms are reviewed in section 2. The auxiliary features, pose
quality measure, and learning method are described in section 3. For the datasets, sec-
tion 4, we use existing ground truth annotated datasets, such as ETHZ PASCAL Stick-
men [5] and Humans in 3D [4], and supplement these with additional annotation where
necessary, and also introduce a new dataset to provide a larger number of training and
test examples. We assess the evaluator features and method on the four HPE algorithms,
and demonstrate experimentally that the proposed evaluator can indeed predict when the
algorithms succeed.
Note how the task of an HPE evaluator is not the same as that of deciding whether the
underlying human detection is correct or not. It might well be that a correct detection
then leads to an incorrect pose estimate. Moreover, the evaluator cannot be used directly
as a pose estimator either a pose estimator predicts a pose out of an enormous space
of possible structured outputs. The evaluators job of deciding whether a pose is correct
is different, and easier, than that of producing a correct pose estimate.
Related Work. On the theme of evaluating vision algorithms, the most related work to
ours is Mac Aodha et al. [12] where the goal is to choose which optical flow algorithm
to apply to a given video sequence. They cast the problem as a multi-way classification.
In visual biometrics, there has also been extensive work on assessor algorithms which
predict an algorithms failure [17,19,1]. These assess face recognition algorithms by an-
alyzing the similarity scores between a test sample and all training images. The method
by [19] also takes advantage of the similarity within template images. But none of these
explicitly considers other factors like imaging conditions of the test query (as we do).
[1] on the other hand, only takes the imaging conditions into account. Our method is
designed for another task, HPE, and considers several indicators all at the same time,
such as the marginal distribution over the possible pose configurations for an image,
imaging conditions, and the spatial arrangement of multiple detection windows in the
same image. Other methods [20,3] predict the performance by statistical analysis of the
training and test data. However, such methods cannot be used to predict the performance
on individual samples, which is the goal of this paper.
116 N. Jammalamadaka et al.
Fig. 1. Features based on the output of HPE. Examples of unimodal, multi-modal and large
spread pose estimates. Each image is overlaid with the best configuration (sticks) and the posterior
marginal distributions (semi-transparent mask). Max and Var are the features measured from
this distribution (defined in the text). As the distribution moves from peaked unimodal to more
multi-modal and diffuse, the maximum value decreases and the variance increases.
Fig. 2. Human pose estimated based on an in- Fig. 3. Pose normalizing the marginal distri-
correct upper-body detection: a typical exam- bution. Marginal distribution (contour diagram
ple of a human pose estimated in a false positive in green) on the left is pose normalized by rotat-
detection window. The pose estimate (MAP) is ing it by an angle of the body part (line seg-
an unrealistic pose (shown as a stickman, i.e. ment in grey) to obtain the transformed distri-
a line segment per body part) and the poste- bution on the right. In the inset, the pertinent
rior marginals (PM) are very diffuse (shown as body part (line segment in grey) is displayed as
a semi-transparent overlay: torso in red, upper a constituent of the upper body.
arms in blue, lower arms and head in green).
The pose estimation output is from Eichner and
Ferrari [5].
Pictorial Structures (PS). PS [9] model a persons body parts as conditional random
field (CRF). In the four HPE methods we review, parts li are parameterized by lo-
cation (x, y), orientation
and scale s. The posterior of a configuration of parts is
P (L|I) exp (i,j)E (li , lj ) + i (I|li ) . The unary potential (I|li ) evalu-
ates the local image evidence for a part in a particular position. The pairwise potential
(li , lj ) is a prior on the relative position of parts. In most works the model structure
E is a tree [5,9,14,15], which enables exact inference. The most common type of infer-
ence is to find the maximum a posteriori (MAP) configuration L = arg maxL P (L|I).
Some works perform another type of inference, to determine the posterior marginal dis-
tributions over the position of each part [5,14,11]. This is important in this paper, as
some of the features we propose are based on the marginal distributions (section 3.1).
The four HPE techniques [5,16,2,22] we consider are capable of both types of infer-
ence. In this paper, we assume that each configuration has six parts corresponding to
the upper-body, but the framework supports any number of parts (e.g. 10 for a full
body [2]).
Eichner and Ferrari [5]. Eichner et al. [5] automatically estimate body part appearance
models (colour histograms) specific to the particular person and image being analyzed,
before running PS inference. These improve the accuracy of the unary potential ,
which in turn results in better pose estimation. Their model also adopts the person-
generic edge templates as part of and the pairwise potential of [14].
Sapp et al. [16]. This method employs complex unary potentials, and non-parametric
data-dependent pairwise potentials, which accurately model human appearance. How-
ever, these potentials are expensive to compute and do not allow for typical inference
118 N. Jammalamadaka et al.
speed-ups (e.g. distance transform [9]). Therefore, [16] runs inference repeatedly, start-
ing from a coarse state-space for body parts and moving to finer and finer resolutions.
Andriluka et al. [2]. The core of this technique lies in proposing discriminatively
trained part detectors for the unary potential . These are learned using Adaboost on
shape context descriptors. The authors model the kinematic relation between body parts
using Gaussian distributions. For the best configuration of parts, they independently in-
fer the location of each part from the marginal distribution (corresponding to the maxi-
mum), which is suboptimal compared to a MAP estimate.
Yang and Ramanan [22]. This method learns unary and pairwise potentials using a
very effective discriminative model based on the latent-svm algorithm proposed in [8].
Unlike the other HPE methods which model limbs as parts, in this method parts cor-
respond to the mid and end points of each limb. As a consequence, the model has 18
parts. Each part is modeled as a mixture to capture the orientations of the limbs.
This algorithm searches over multiple scales, locations and all the part mixtures (thus
implicitly over rotations). However, it returns many false positives. The UBD is used as
a post-processing step to filter these out, as its precision is higher. In detail, all human
detections which overlap less than 50% with any UBD detection are discarded. Overlap
is measured using the standard intersection over union criteria [7].
3.1 Features
We propose a set of features to capture the conditions under which an HPE algorithm
is likely to make mistakes. We identify two types of features: (i) based on the output
of the HPE algorithm the score of the HPE algorithm, marginal distribution, and
best configuration of parts L ; and, (ii) based on the extended detection window its
position, overlap with other detections, etc these are features which have not been
used directly by the HPE algorithm. We describe both types of features next.
1. Features from the Output of the HPE. The outputs of the HPE algorithm consist
of a marginal probability distribution Pi over (x, y, ) for each body part i, and the
best configuration of parts L . The features computed from these outputs measure the
spread and multi-modality of the marginal distribution of each part. As can be seen
from figures 1 and 2, the multi-modality and spread correlate well with the error in
the pose estimate. Features are computed for each body part in two stages: first, the
marginal distribution is pose-normalized, then the spread of the distribution is measured
by comparing it to an ideal signal.
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 119
The best configuration L , predicts an orientation for each part. This orientation
is used to pose-normalize the marginal distribution (which is originally axis-aligned)
by rotating the distribution so that the predicted orientation corresponds to the x-axis
(illustrated in figure 3). The pose-normalized marginal distribution is then factored into
three separate x, y and spaces by projecting the distribution onto the respective axes.
Empirically we have found that this step improves the discriminability.
A descriptor is then computed for each of the separate distributions which measure
its spread. For this we appeal to the idea of a matched filter, and compare the distribution
to an ideal unimodal one, P , which models the marginal distribution of a perfect pose
estimate. The distribution P is assumed to be Gaussian and its variance is estimated
from training samples with near perfect pose estimate. The unimodal distribution shown
in figure 1 is an example corresponding to a near perfect pose estimate.
The actual feature is obtained by convolving the ideal distribution, P , with the
measured distribution (after the normalization step above), and recording the maximum
value and variance of the convolution. Thus for each part we have six features, two for
each dimension, resulting in a total of 36 features for an upper-body. The entropy and
variance of the distribution Pi , and the score of the HPE algorithm are also used as
features taking the final total to 36 + 13 features.
Algorithm Specific Details: While the procedure to compute the feature vector is
essentially the same for all four HPE algorithms, the exact details vary slightly. For An-
driluka et al. [2] and Eichner and Ferrari [5], we use the posterior marginal distribution
to compute the features. While for Sapp et al. [16] and Yang and Ramanan [22], we use
the max-marginals. Further, in [22] the pose-normalization is omitted as max-marginal
distributions are over the mid and end points of the limb rather than over the limb itself.
For [22], we compute the max-marginals ourselves as they are not available directly
from the implementation.
2. Features from the Detection Window. We now detail the 10 features computed over
the extended detection window. These consist of the scale of the extended detection
window, and the confidence score of the detection window as returned by the upper
body detector. The remainder are:
Two Image Features: the mean image intensity and mean gradient strength over the
extended detection window. These are aiming at capturing the lighting conditions and
the amount of background clutter. Typically HPE algorithms fail when either of them
has a very large or a very small value.
Four Location Features: these are the fraction of the area outside each of the four image
borders. Algorithms also tend to fail when people are very small in the image, which
is captured by the scale of the extended detection window. The location features are
based on the idea that the larger the portion of the extended detection window which
lies outside the image, the less likely it is that HPE algorithms will succeed (as this
indicates that some body parts are not visible in the image).
Two Overlap Features: (i) the maximum overlap, and (ii) the sum of overlaps, over
all the neighbouring detection windows normalized by the area of the current detection
window. As illustrated in Figure 4 HPE algorithms can be affected by the neighbouring
120 N. Jammalamadaka et al.
Fig. 4. Overlap features. Left: Three overlapping detection windows. Right: the HPE output is
incorrect due to interference from the neighbouring people. This problem can be flagged by the
detection window overlap (output from Eichner and Ferrari [5]).
people. The overlap features capture the extent of occlusion by the neighbouring detec-
tion windows. Overlap with other people indicates how close the they are. While large
overlaps occlude many parts in the upper body, small overlaps also affect HPE perfor-
mance as the algorithm could pick the parts (especially arms) from their neighbours.
While measuring the overlap, we consider only those neighbouring detections which
have similar scale (between 0.75 and 1.25 times). Other neighbouring detections which
are at a different scale typically do not affect the pose estimation algorithm.
For evaluating the quality of a pose estimate, we devise a measure which we term the
Continuous Pose Cumulative error (CPC) for measuring the dissimilarity between two
poses. It ranges in [0, 1], with 0 indicating a perfect pose match. In brief, CPC depends
on the sum of normalized distances between the corresponding end points of the parts.
Figure 5 gives examples of actual CPC values of poses as they move away from a ref-
erence pose. The CPC measure is similar to the percentage of correctly estimated body
parts (PCP) measure of [5]. However, CPC adds all distances between parts, whereas
PCP counts the number of parts whose distance is below a threshold. Thus PCP takes in-
teger values in {0, . . . , 6}. In contrast, for proper learning in our application we require
a continuous measure, hence the need for the new CPC measure.
In detail, the CPC measure computes the dissimilarity between two poses as the sum
of the differences in the position and orientation over all parts. Each pose is described
by N parts and each part p in a pose A is represented by a line segment going from
point sap to point eap , and similarly for pose B. All the coordinates are normalized with
respect to the detection window in which the pose is estimated. The angle subtended by
p is pa . With these definitions, the CP C(A, B) between two poses A and B is:
a
N sp sbp + eap ebp
CP C(A, B) = wp
p=1
2 sap eap
(1)
|pa pb |
with wp = 1 + sin
2
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 121
where is the sigmoid function and pa pb is the relative angle between part p in pose
A and B, and lies in [, ]. The weight wp is a penalty for two corresponding parts
of A and B not being in the same direction. Figure 6 depicts the notation in the above
equation.
Reference 1
Fig. 5. Poses with increasing CPC. An example per CPC is shown for each of the two reference
poses for CPC measures of 0.1, 0.2, 0.3 and 0.5. As can be seen example poses move smoothly
away from the reference pose with increasing CPC, with the number of parts which differ and the
extent increasing. For 0.1 there is almost no difference between the examples and the reference
pose. At 0.2, the examples and reference can differ slightly in the angle of one or two limbs, but
from 0.3 on there can be substantial differences with poses differing entirely by 0.5.
Fig. 6. Pose terminology: For poses A and B Fig. 7. Annotated ground-truth samples
corresponding parts p are illustrated. Each part (stickman overlay) for two frames from the
is described by starting point sap , end point eap Movie Stickmen dataset.
and the angle of the part pa are also illustrated.
partial occlusion of limbs, CPC is set to 0.5 if the ground-truth stickman has occluded
parts. In effect, the people who are partially visible in the image get high CPC. This is a
means of providing training data for the evaluator that the HPE algorithms will fail on
these cases.
A CPC threshold of 0.3 is used to separate all poses into a positive set (deemed
correct) and a negative set (deemed incorrect). This threshold is chosen because pose
estimates with CPC below 0.3 are nearly perfect and roughly correspond to PCP = 6
with a threshold of 0.5 (figure 5). A linear SVM is then trained on the positive and
negative sets using the auxiliary features described in section 3.1. The feature vector
has 59 dimensions, and is a concatenation of the two types of features (49 components
based on the output of the HPE, and 10 based on the extended detection window). This
classifier is the evaluator, and will be used to predict the quality of pose estimates on
novel test images.
The performance of the evaluator algorithm is discussed in the experiments of sec-
tion 4.2. Here we comment on the learnt weight vector from the linear SVM. The magni-
tude of the components of the weight vector suggests that the features based on marginal
probability, and location of the detection window, are very important in distinguishing
the positive samples from the negatives. On the other hand, the weights of the upper
body detector, image intensity and image gradient are low, so these do not have much
impact.
4 Experimental Evaluation
After describing the datasets we experiment on (section 4.1), we present a quantitative
analysis of the pose quality evaluator (section 4.2).
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 123
Table 1. (a) Dataset statistics. The number of images and stickmen annotated in the training and
test data. (b) Pose estimation evaluation. The PCP 1 of the four HPE algorithms (Andriluka et
al. [2], Eichner and Ferrari [5], Sapp et al. [16], Yang and Ramanan [22]) averaged over each
dataset (at PCP threshold 0.5, section 3.2). The numbers in brackets are the results reported in the
original publications. The differences in the PCPs are due to different versions of UBD software
used. The performances across the datasets indicate their relative difficulty.
(a) (b)
Dataset images gt-train gt-test Datasets [2] [5] [16] [22]
ETHZ Pascal 549 0 549 Buffy stickmen 78.3 81.6 (83.3) 84.2 86.7
Humans in 3D 429 1002 0 ETHZ Pascal 65.9 68.5 71.4 (78.2) 72.4
Movie 5984 5804 5835 Humans in 3D 70.3 71.3 75.3 77.8
Buffy2 499 775 0 Movie 64.6 70.4 70.6 76.0
Total 7270 7581 6834 Buffy2 65.0 67.5 74.2 81.7
4.1 Datasets
We experiment on four datasets. Two are the existing ETHZ PASCAL Stickmen [5] and
Humans in 3D [4], and we introduce two new datasets, called Buffy2 Stickmen and
Movie Stickmen. All four datasets are challenging as they show people at a range of
scales, wearing a variety of clothes, in diverse poses, lighting conditions and scene
backgrounds. Buffy2 Stickmen is composed of frames sampled from episodes 2 and 3
of season 5 of the TV show Buffy the vampire slayer (note this dataset is distinct from
the existing Buffy Stickmen dataset [11]). Movie Stickmen contains frames sampled
from ten Hollywood movies (About a Boy, Apollo 13, Four Weddings and a Funeral,
Forrest Gump, Notting Hill, Witness, Gandhi, Love Actually, The graduate, Groundhog
day).
The data is annotated with upper body stickmen (6 parts: head, torso, upper and
lower arms) under the following criteria: all humans who have more than three parts
visible are annotated, provided their size is at least 40 pixels in width (so that there is
sufficient resolution to see the parts clearly). For near frontal (or rear) poses, six parts
may be visible; and for side views only three. See the examples in figure 7. This fairly
complete annotations includes self occlusions and occlusions by other objects and other
people.
Table 1a gives the number of images and annotated stickmen in each dataset. As a
training set for our pose quality evaluator, we take Buffy2 Stickmen, Humans in 3D
and the first five movies of Movie Stickmen. ETHZ PASCAL Stickmen and the last five
movies of Movie Stickmen form the test set, on which we report the performance of the
pose quality evaluator in the next subsection.
Performance of the HPE Algorithms. For reference, table 1b gives the PCP performance
of the four HPE algorithms. Table 2 gives the percentage of samples where pose is
estimated accurately, i.e. the CPC between the estimated and ground-truth stickmen is
1
We are using the PCP measure as defined by the source code of the evaluation protocol in [5],
as opposed to the more strict interpretation given in [13].
124 N. Jammalamadaka et al.
Dataset [2] [5] [16] [22] CPC 0.2 CPC 0.3 CPC 0.4
Train 8.5 10.7 11.6 18.5 HPE BL PA BL PA BL PA
Test 9.9 11.1 12.0 16.5 Andriluka [2] 56.7 90.0 56.2 90.0 55.8 89.2
Total 9.1 10.9 11.8 17.6 Eichner [5] 84.3 92.6 81.6 91.6 80.5 90.9
Sapp [16] 76.5 82.5 76.5 83.0 76.9 83.5
Yang [22] 79.5 83.7 78.4 81.5 78.4 81.2
< 0.3 In all cases, we use the implementations provided by the authors [5,16,2,22] and
all methods are given the same detection windows [6] as preprocessing. Both measures
agree on the relative ranking of the methods: Yang and Ramanan [22] performs best,
followed by Sapp et al. [16], Eichner and Ferrari [5] and then by Andriluka et al. [2].
This confirms experiments reported independently in previous papers [5,16,22]. Note
that we report these results only as a reference, as the absolute performance of the HPE
algorithms is not important in this paper. What matters is how well our newly proposed
evaluator can predict whether an HPE algorithm has succeeded.
Here we evaluate the performance of the pose quality evaluator for the four HPE al-
gorithms. To assess the evaluator, we use the following definitions. A pose estimate is
defined as positive if it is within CPC 0.3 of the ground truth and as negative otherwise.
The evaluators output (positive or negative pose estimate) is defined as successful if
it correctly predicts a positive (true positive) or negative (true negative) pose estimate,
and defined as a failure when it incorrectly predicts a positive (false positive) or nega-
tive (false negative) pose estimates. Using these definitions, we assess the performance
of the evaluator by plotting an ROC curve.
The performance is evaluated under two regimes: (A) only where the predicted HPE
corresponds to one of the annotations. Since the images are fairly completely anno-
tated, any upper body detection window [6] which does correspond to an annotation is
considered a false positive. In this regime such false positives are ignored; (B) all pre-
dictions are evaluated, including false-positives. The first regime corresponds to: given
there is a human in the image at this point, how well can the proposed method evalu-
ate the pose-estimate? The second regime corresponds to: given there are wrong upper
body detections, how well can the proposed method evaluate the pose-estimate? The
protocol for assigning a HPE prediction to a ground truth annotation was described in
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 125
1 1
Percentage truepositives
Percentage truepositives
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
Eichner [1] Evaluator Auc: 88.8 Eichner [1] Evaluator Auc: 91.6
Eichner [1] baseline Auc: 78.2 Eichner [1] baseline Auc: 81.6
0.3 0.3
Andriluka [3] Evaluator Auc: 85.6 Andriluka [3] Evaluator Auc: 90.0
Andriluka [3] baseline Auc: 57.4 Andriluka [3] baseline Auc: 56.2
0.2 Sapp [2] Evaluator Auc: 77.0 0.2 Sapp [2] Evaluator Auc: 83.0
Sapp [2] baseline Auc: 69.7 Sapp [2] baseline Auc: 76.5
0.1 Yang [4] Evaluator Auc: 77.0 0.1 Yang [4] Evaluator Auc: 81.5
Yang [4] baseline Auc: 70.0 Yang [4] baseline Auc: 78.4
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 8. (a) Performance of HPE evaluator in regime A: (no false positives used in training or
testing). The ROC curve shows that the evaluator can successfully predict whether an estimated
pose has CPC < 0.3. (b) Performance of HPE evaluator in Regime B: (false positives included
in training and testing).
section 3.3. For training regime B, any pose on a false-positive detection is assigned a
CPC of 1.0.
Figure 8a shows performance for regime A, and figure 8b for B. The ROC curves are
plotted using the score of the evaluator, and the summary measure is the Area Under the
Curve (AUC). The evaluator is compared to a relevant baseline that uses the HPE score
(i.e. the energy of the most probable (MAP) configuration L ) as a confidence measure.
For Andriluka et al. [2], the baseline is the sum over all parts of the maximum value of
the marginal distribution for a part. The plots demonstrate that the evaluator works well,
and outperforms the baseline for all the HPE methods. Since all the HPE algorithms use
the same upper body detections, their performance can be compared fairly.
To test the sensitivity to the 0.3 CPC threshold, we learn a pose evaluator also us-
ing CPC thresholds 0.2 and 0.4 under the regime B. Table 3 shows the performance
of the pose evaluator for different CPC thresholds over all the HPE algorithms. Again,
our pose evaluator shows significant improvements over the baseline in all cases. The
improvement of AUC for Andriluka et al. is over 33.5. We believe that this massive
increase is due to the suboptimal inference method used for computing the best config-
uration. For Eichner and Ferrari [5], Sapp [16], and Yang and Ramanan [22] our pose
evaluator brings an average increase of 9.6, 6.4 and 3.4 respectively across the CPC
thresholds. Interestingly, our pose evaluator has a similar performance across different
CPC thresholds.
Our pose evaluator successfully detects cases where the HPE algorithms succeed
or fail, as shown in figure 9. In general, an HPE algorithm fails where there is self-
occlusion or occlusion from other objects, when the person is very close to the image
borders or at an extreme scale, or when they appear in very cluttered surroundings.
These cases are caught by the HPE evaluator.
126 N. Jammalamadaka et al.
Fig. 9. Example evaluations. The pose estimates in the first two rows are correctly classified as
successes by our pose evaluator. The last two rows are correctly classified as failures. The pose
evaluator is learnt using the regime B and with a CPC threshold of 0.3. Poses in rows 1,3 are esti-
mated by Eichner and Ferrari [5], and poses in rows 2,4 are estimated by Yang and Ramanan [22].
5 Conclusions
Human pose estimation is a base technology that can form the starting point for other
applications, e.g. pose search [10] and action recognition [21]. We have shown that an
evaluator algorithm can be developed for human pose estimation methods where no
confidence score (only a MAP score) is provided, and that it accurately predicts if the
algorithm has succeeded or not. A clear application of this work is to use evaluator to
choose the best result amongst multiple HPE algorithms.
More generally, we have cast self-evaluation as a binary classification problem, using
a threshold on the quality evaluator output to determine successes and failures of the
HPE algorithm. An alternative approach would be to learn an evaluator by regressing
to the quality measure (CPC) of the pose estimate. We could also improve the learning
framework using a non-linear SVM.
We believe that our method has wide applicability. It works for any part-based model
with minimal adaptation, no matter what the parts and their state space are. We have
shown this in the paper, by applying our method to various pose estimators [5,16,2,22]
with different parts and state spaces. A similar methodology to the one given here could
be used to engineer evaluator algorithms for other human pose estimation methods e.g.
using poselets [4], and also for other visual tasks such as object detection (where success
can be measured by an overlap score).
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 127
Acknowledgments. We are grateful for financial support from the UKIERI, the ERC
grant VisRec no. 228180 and the Swiss National Science Foundation.
References
1. Aggarwal, G., Biswas, S., Flynn, P.J., Bowyer, K.W.: Predicting performance of face recog-
nition systems: An image characterization approach. In: IEEE Conf. on Comp. Vis. and Pat.
Rec. Workshops, pp. 5259. IEEE Press (2011)
2. Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and ar-
ticulated pose estimation. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2009)
3. Boshra, M., Bhanu, B.: Predicting performance of object recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(9), 956969 (2000)
4. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annota-
tions. In: IEEE Int. Conf. on Comp. Vis., pp. 13651372. IEEE Press (2009)
5. Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In: Cavallaro,
A., Prince, S., Alexander, D. (eds.) Proceedings of the British Machine Vision Conference,
pp. 3:13:3. BMVA Press (2009)
6. Eichner, M., Marin, M., Zisserman, A., Ferrari, V.: Articulated human pose estimation and
search in (almost) unconstrained still images. ETH Technical report (2010)
7. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. International Journal of Computer Vision 88(2) (2010)
8. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, de-
formable part model. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2008)
9. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Interna-
tional Journal of Computer Vision 61(1), 5579 (2005)
10. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Pose search: Retrieving people using their
pose. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2009)
11. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human
pose estimation. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2008)
12. Mac Aodha, O., Brostow, G.J., Pollefeys, M.: Segmenting video into classes of algorithm-
suitability. In: IEEE Conf. on Comp. Vis. and Pat. Rec., pp. 10541061. IEEE Press (2010)
13. Pishchulin, L., Jain, A., Andriluka, M., Thormahlen, T., Schiele, B.: Articulated people de-
tection and pose estimation: Reshaping the future. In: IEEE Conf. on Comp. Vis. and Pat.
Rec., pp. 31783185. IEEE Press (2012)
14. Ramanan, D.: Learning to parse images of articulated bodies. In: Scholkopf, B., Platt, J.,
Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19, pp. 11291136.
MIT Press, Cambridge (2007)
15. Ronfard, R., Schmid, C., Triggs, B.: Learning to Parse Pictures of People. In: Heyden,
A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353,
pp. 700714. Springer, Heidelberg (2002)
16. Sapp, B., Toshev, A., Taskar, B.: Cascaded Models for Articulated Pose Estimation. In:
Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312,
pp. 406420. Springer, Heidelberg (2010)
17. Scheirer, W.J., Bendale, A., Boult, T.E.: Predicting biometric facial recognition failure with
similarity surfaces and support vector machines. In: IEEE Conf. on Comp. Vis. and Pat. Rec.
Workshops, pp. 18. IEEE Press (2008)
18. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of Computer
Vision 57(2), 137154 (2004)
128 N. Jammalamadaka et al.
19. Wang, P., Ji, Q., Wayman, J.L.: Modeling and predicting face recognition system perfor-
mance based on analysis of similarity scores. IEEE Transactions on Pattern Analysis and
Machine Intelligence 29(4), 665670 (2007)
20. Wang, R., Bhanu, B.: Learning models for predicting recognition performance. In: IEEE Int.
Conf. on Comp. Vis., vol. 2, pp. 16131618. IEEE Press (2005)
21. Willems, G., Becker, J.H., Tuytelaars, T., Van Gool, L.: Exemplar-based action recognition in
video. In: Cavallaro, A., Prince, S., Alexander, D. (eds.) Proceedings of the British Machine
Vision Conference, pp. 90.190.11. BMVA Press (2009)
22. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE
Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2011)
Group Tracking: Exploring Mutual Relations
for Multiple Object Tracking
1 Introduction
Visual object tracking is a major component in computer vision and widely ap-
plied in many domains, such as video surveillance, driving assistant systems,
human computer interactions, etc. As an object to be tracked is unknown in
many applications, many researchers adopt online algorithms [13] to extract
the object from background, which mainly encounters the great challenges from
occlusions, changing appearances, varying illumination and abrupt motions. Re-
cently, some authors utilized the context (e.g. feature points [4] and regions in
similar appearances to the target [5]) for robust single object tracking.
Problem Statement: In this paper, we propose to track multiple previously
unseen objects. A direct solution for this problem is to consider objects indi-
vidually and utilize some approaches designed for single object tracking. Such
a solution relies heavily on individual object models (IOMs). Once IOMs be-
come inaccurate, the targets tend to be lost easily. In fact, the mutual relations
among objects are another important kind of information and worth taking into
account. If we have obtained reliable mutual relation models (MRMs) at a given
time, we can use them to predict more accurate locations of objects as shown in
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 129143, 2012.
c Springer-Verlag Berlin Heidelberg 2012
130 G. Duan et al.
(a) A simple scene (b) Typical errors of (c) Accurate results of (d) More challenging (e) How to learn robust
IOMs IOMs+MRMs scenes mutual relations?
Fig. 1. Examples of mutual relations. (a,b,c) demonstrate a toy example, where mu-
tual relation models (MRMs) can benet robust tracking when some individual object
models (IOMs) are inaccurate (For the scene in (a), the reason is occlusions). The
switching error in (b) is due to similar appearances of right eyeand left eye. As all
objects have rigid-like relations and they come from the same object, robust MRMs can
be easily learned for this toy. However, they are dicult to be learned in many other
scenes, such as in (d)(top) illustrating objects with abrupt motions and (d)(bottom)
showing objects which are originally dierent. We propose to online learn MRMs for
multi-object tracking.
Fig. 1. This visual tracking scenario tracks multiple objects simultaneously and
understands their mutual relations, termed group tracking for simplicity.
Proposal: Our main idea is to model objects in mutual context with each other
and combine both IOMs and MRMs for robust and accurate tracking. The usage
of MRMs needs to answer two crucial questions in the following.
When do MRMs work? The most ideal situation is that all IOMs are accurate
enough, and MRMs are not necessary. However, when IOMs are inaccurate,
MRMs will play an important role in robust tracking. In fact, it is not easy to
know whether IOMs are accurate enough to ignore MRMs. Moreover, if all IOMs
lost accuracies simultaneously, MRMs will become useless because any estimates
of objects are unreliable at these moments. Therefore, we assume that MRMs
exist all the time and there are some accurate IOMs during tracking.
How do MRMs work? MRMs provide mainly three kinds of helpful information
for tracking. The rst is to indicate mutually related objects (i.e. objects impact
on each other), termed the relational graph. The second is to know how much
impact one object has on another one, named as the mutual relation vectors. The
last is to balance all interactions of objects and the responses of IOMs, called
as the relational weights. When these three kinds of information are properly
determined, MRMs will play an important role in robust tracking.
Contribution: Our main contribution can be summarized into two folds. (1) We
model objects in mutual context, and propose a unied framework to integrate
individual object models and mutual relation models for multiple previously un-
seen objects tracking (Sec.3). (2) We extend latent SVM into an online version for
learning the relational weights (Sec.4.1). To the best of our knowledge, our online
algorithm is the rst one to take advantage of LSVM for the tracking problems.
Group Tracking: Exploring Mutual Relations for Multiple Object Tracking 131
2 Related Work
Object Tracking. Adam et al. [6] represented the target based on histograms
of local patches and combined votes of matched local patches using templates
to handle partial occlusions. Since the templates are not updated, it may fail
in handling appearance changes. In order to cope with appearance variations,
Babenko et al. [1] proposed an online multiple instance learning algorithm to
learn a discriminative detector for the target. To avoid drifting over time, Kalal
et al. [7] combined an adaptive Kanade-Lucas-Tomasi feature tracker and several
restrictive learning constraints to establish an incremental classier. To further
deal with non-rigid motion or large pose variations, Kwon et al. [3] extended
the conventional particle lter framework with multiple motion and observation
models to account for appearance variations caused by changes of pose, illumi-
nation and scale as well as partial occlusions. Considering that bounding box
based representations provide a less accurate foreground/background separation,
Godec et al. [2] extended Hough Forests to online domain and coupled the vot-
ing based detection and back-projection with a rough segmentation based on
GrabCut for accurately tracking non-rigid and articulated objects.
Besides these single object tracking approaches, there are also lots of ap-
proaches for multi-object tracking, where most of them assume that the object
category is known and depend on oine trained specic object detectors, such
as [810]. Breitenstein et al. [8] used detection as their observation model and
integrated pedestrian detectors into a particle ltering framework to track mul-
tiple persons. Huang et al. [9] proposed to associate detected results globally,
assuming that the whole sequence is achieved in advance. Dierent from [8, 9]
considering object objects individually, Pellegrini et al. [10] modeled human in-
teractions to some degree and introduced a dynamic social behaviors to facilitate
multi-people tracking. Similar to [10], group tracking considers mutual relations
among objects.
Human Pose Estimation. Felzenszwalb et al. [11] utilized Pictorial Structure
(PS) to estimate human poses in static images. Sapp et al. [12] proposed Stretch-
able Models to estimate articulated poses in videos with rich features like color,
shape, contour and motion cues. Since the object is human, relations of human
parts can be predened and their distributions can be learned on annotated hu-
man pose datasets in [11, 12]. But these priors do not exist in group tracking,
because objects are previously unseen.
Structural Learning and Latent SVM. We propose an online latent struc-
tural learning algorithm to update relational weights. Tsochantaridis et al. [13]
optimized a structural SVM objective function, which is a convex optimization
132 G. Duan et al.
IOMs MRMs
Fig. 2. An illustration of our problem formulation. We combine IOMs and MRMs for
group tracking, where MRMs consist of relational weights, the relational graph and
mutual relation vectors.
problem widely applied to a variety of problems in computer vision [14, 15]. Bran-
son et al. [14] utilized online structural learning algorithm to train deformable
part based model interactively. By introducing latent variables, Felzenszwalb et
al. [16] proposed Latent SVM to handle object detection challenges. Ramanan
et al. [15, 17] made some improvements on Latent SVM and demonstrated state-
of-the-art results in object classication and human pose estimation.
3 Problem Formulation
We denote the sequences of observations and object states from frame 1 to
frame T as O1:T = {O1 , . . . , OT } and S1:T = {S1 , . . . , ST }. Observations consist
of image features which can include image patches, corner points or the outputs
of specic generative/discriminative models. We encode the states of M objects
at time t as St = {st,m }M m=1 . Each object state is represented by its center
x = [x y]T , width w and height h. Thus, we formalize the state of mth object
at time t as st,m = (xt,m , wt,m , ht,m ). We can now write the full score associated
with a conguration of object states as
f (Ot , St ) = wt,p t,p (Ot , st,p ) + wt,p,q t,p,q (st,p , st,q ) (1)
pVt (p,q)Et
where t,p is the response of the individual model for object p, and t,p,q is
the mutual relation vector between two objects p and q. wt,p and wt,p,q are
weight parameters. Gt = (Vt , Et ) represents the relational graph whose nodes
are objects and( edges indicate mutually
) related objects.
( For concise) descriptions,
we write t = {wt,p }; {wt,p,q } and (Ot , St ) = {t,p }; {t,p,q } , and thus the
scoring function in Eq.(1) can be rewritten as f (Ot , St ) = t (Ot , St ). This
formulation is illustrated in Fig. 2.
Individual object models (IOMs) compute the local score of placing one object
at a specic frame location, corresponding to t,p . The larger score indicates the
higher probability of the object at this location. As reviewed in Sec.2, there
are many algorithms for single object tracking and most of them can be easily
Group Tracking: Exploring Mutual Relations for Multiple Object Tracking 133
integrated into our framework. Taking MIL [1] as an example, MIL represents
each object by a bag of samples and online learns a discriminative classier. The
learned MIL classier can return the condence of an image patch to an object.
Mutual relation models (MRMs) consist of the relational weights, the mutual
relation vectors and the relational graph, corresponding to t , t,p,q and Gt
respectively. Firstly, relational weights balance the two terms in Eq.(1). If rela-
tional weights {wt,p,q } are set to zero, Eq.(1) will equal to the case that ignores
mutual relations. Secondly, mutual relation vectors are calculated between re-
lated objects to show the interactions. Thirdly, the inferences of Eq.(1) for cyclic
and acyclic graphs are very dierent. In the next section, we will present details
of modeling mutual relations.
Given a set of frames and object congurations {(O1 , S1 ), . . . , (OT , ST )}, the task
is to learn proper relational weights (t ) to balance the two terms in Eq.(1). We
cast this task as a maximum margin structured learning problem (Structured
SVM [13]), where we consider mutual relation vectors as latent variables. In
addition, there is no need to depict the objects in a long time ago and they
might dier a lot with the current one in appearances, and therefore we only
consider samples in a short time period . Similar to [14], we formulate this
problem to search the optimal weight that minimize the error function FT ()
1 T
FT () = 2 + lt () (2)
2
t=T +1
lt () = max (Ot , St ) (Ot , St ) + (St , St ) (3)
St
The frames come one by one during tracking, which is an incremental way similar
to [14]. Thus, it requires a fast online learning algorithm.
Optimization. The form of our learning problem is a Structured SVM. There
exist many well-tuned solvers such as the cutting plane solver of SVMstruct [18]
and the Online Stochastic Gradient Descent (OSGD) solver [19]. While SVMstruct
solver is most commonly used, OSGD solver has been observed faster in practice
[17]. Since the speed is important for tracking, we choose the OSGD solver for
our learning problem. When an example (Ot , St ) comes, [19] proposed to take
an update as 1
t = t1 (t1 + lt ) (4)
t
where lt () is the sub gradient of the hinge loss lt (), which can be computed
by solving a problem similar to an inference problem
lt () = (Ot , St ) (Ot , St ) (5)
St = arg max (Ot , St ) + (St , St ) (6)
St
The update in Eq.(4) needs at least one whole iteration for each new sample and
is designed to keep the learning ability on all samples. Because of characteristics
of our task (objects in a short time period and video consistencies), we modify
Eq.(4) as
1 lt1 (t1 )
t = t1 min , lt1 . (7)
t1 lt1 2
Our update directly combines the weights and the sub-gradient of the hinge loss
in the previous time. It is non-iterative because t1 , lt1 and lt1 (t1 ) are
all calculated before time t1 . Note that, the coecient of lt1 in Eq.(7) might
be tuned, which is dierent from 1 t in Eq.(4). Relatively, our update is more
reasonable for tracking. In addition, there is no need to determine explicitly
in this online update.
Computation. The computationally taxing portion of Eq.(7) is to calculate the
sub-gradient of lt () using Eq.(6). It is too computationally expensive to sample
locations around the ground truths of objects like [14, 17]. Thus, we select St from
t , where t contains all congurations inferred by Eq.(1) except the tracked
result, whose details will be described in Sec.5. Then, one needs to loop over
t to get the most violated conguration St . For each testing on St , one needs
to compute (Ot , St ) containing scores of IOMs and mutual relation vectors,
and (St , St ) comparing a predicted conguration with the ground truth, both
of which make the computation as O(M ). Therefore, the sub gradient can be
calculated in O(M |t |). Moreover, if some objects are missing (i.e. very small
scores of IOMs), there is no need to update weights related to these objects.
two objects. Ideally, more accurate locations of related objects indicate higher
scores. Unfortunately, only the rst frame is labeled as stated before. Motivated
by deformable part based model in [16, 17], we calculate mutual relation vectors
with the changes of relative locations between two objects.
Following denotations in Sec.3, the relative location of p with respect to q at
time t is zt,p,q = xt,p xt,q and the estimated relative location is zt,p,q = xt,p xt,q ,
where xt,p and xt,q are estimated locations for object p and q respectively. The
Kalman Filter [20] is adopted as the algorithm to estimate object locations,
where both linear Gaussian observation and motion models are assumed for
each object. Then, we dene the mutual relation vectors as t,p,q (st,p , st,q ) =
[dx dy dx2 dy 2 ]T where [dx dy]T = zt,p,q zt,p,q . Although zt,p,q is always
biased from the ground truths, it does not have a direct impact on the whole
tracking system because mutual relation vectors are taken as latent variables
and relational weights are updated online.
5 Inference
The inference of our formulation is to maximize Eq.(1) over all involved objects,
St = arg max f (Ot , St ). (9)
St
Since we restrict the relational graph to be a tree, the inference can be eciently
solved by dynamic programming. Letting child(p) be the set of children of object
p in Gt , we compute the message of p passing to its parent q as follows.
score(p) = wt,p t,p (Ot , st,p ) + mb (p) (10)
bchild(p)
tracked result, other inferred congurations are stored in t for learning rela-
tional weights as in Sec.4.1. Thus, we have |t | = L 1 and the complexity of
learning relational weights is O(M |t |) = O(M L).
Till now, we have elaborated our framework for group tracking. For easy
references, we summarize the whole tracking algorithm in Tab.1.
6 Experiment
In this section, we carry out experiments to demonstrate the advantages of
MRMs for multi-object tracking, where tracking objects are not constrained
to be within a specied class. Besides MRMs, IOMs are also very important for
the whole tracking algorithm. As stated in Sec.3, many approaches for single
object tracking can be used as IOMs. But in this paper, we focus on showing
the performance of MRMs combined with one kind of IOMs. In the experiment,
we select MIL [1] as IOMs, which has achieved promising results in [1]. We use
the default parameters of MIL as in [1] and will demonstrate in the following
that the combination of MIL and MRMs improves tracking performances fur-
ther. Combining other kinds of IOMs with MRMs is one direction of the future
work. All experiments below are conducted on an Intel Core(TM)2 2.33GHz PC
with 2G memory.
Metrics. Center location errors along with frame numbers are widely used for
evaluating single object tracking algorithms and the mean error over all frames
is a summarized performance. However, if a tracker tracks a nearby object but
not the concerned one for a long time, location errors fail to correctly capture
tracker performances. Another metric is the overlap criteria, where a result is
positive only if the intersection is larger than 50% of the union comparing to
the object with the same label. Using this metric, we can calculate the detection
rate by T rueP os/(F alseN eg + T rueP os).
Compared Algorithms. To our knowledge, there are no implementations avail-
able publically for group tracking, and thus we compare our approach with
the algorithms applying IOMs individually. Some applied state-of-the-art algo-
rithms for single object tracking are Online-AdaBoost (OAB) [23], FragTrack [6],
MIL [1], TLD [7] and VTD [3]. We utilize the codes provided by the authors
and carefully adjust the parameters of the trackers for fair comparisons. We use
the best ve results from multiple trials and average the location errors and
detection rates, or directly take results from the prior works.
Fig. 3. Qualitative results on david, face, face2, coke, clibar, shaking, animal and
skating1 sequences from left to right and top to down
0 0 0 0
0 100 200 300 400 0 200 400 600 800 0 200 400 600 800 0 50 100 150 200 250
Frame # Frame # Frame # Frame #
Cliffbar Shaking Animal Skating1
120 200 70
FRAG FRAG FRAG FRAG
100 OAB OAB OAB 60 OAB
Position Error (pixel)
MIL is not designed to handle with both abrupt motions and severe illumination
changes. Moreover, our detection rate in skating1 is higher than that of VTD.
The reason is that the tracking objects in skating1 move violently from the very
beginning and undergo large variations of poses and scales, which causes many
diculties for building holistic appearance models.
OAB Frag TLD VTD MIL OUR 1 OAB Frag TLD VTD MIL OUR 2
david 32 46 12 26 23 15 8 david 0.34 0.06 0.59 0.72 0.65 0.96 0.31
face 39 6 10 10 27 10 17 face 0.44 1 1 0.93 0.77 1 0.23
face2 21 15 15 9 20 14 6 face2 0.80 0.84 0.85 0.96 0.87 0.96 0.09
coke 25 63 9 45 21 11 10 coke 0.18 0.06 0.43 0.06 0.26 0.64 0.38
cliffbar 14 34 6 12 12 11 1 cliffbar 0.54 0.22 0.79 0.68 0.71 0.8 0.09
shaking 16 62 6 8 34 13 21 shaking 0.63 0.13 0.16 0.95 0.49 0.73 0.24
animal 73 91 19 10 35 21 14 animal 0.16 0.02 0.72 0.93 0.44 0.48 0.04
skating1 66 86 10 13 60 48 12 skating1 0.29 0.12 0.1 0.37 0.41 0.56 0.15
(a) Average location errors (pixels) (b) Detection rates
Fig. 5. Quantitative results. After comparing our approach with MIL, 1 in (a) shows
reduced mean location errors, and 2 in (b) shows improved detection rates.
RG3 40 RG2
0.2
30
RG2 RG3 0.1
RGada
RG2 RG3
RG3
20 RGada
0 10
0
Fig. 6. Tracking results on Seq.#3 (a,b) and Sign(c,d). (a,c) show tracking objects in
rectangles and numerated trees of relational graphs. (b) compares the detection rates
on Seq.#3 and (d) compares average location errors on Sign, where RGada indicates
the approach of using adaptive relational graphs.
Tracking objects in Seq.#3 are man, woman and their entirety which are
walking together with frequently light changes. As Seq.#3 is fully annotated
with rectangles, we use detection rates to compare our approaches with MIL
in Fig.6(b). Due to frequently light changes, MIL drifts easily on the man and
woman. Our approaches, with xed relational graphs and the adaptive relational
graph, all show signicantly improvements than MIL on the man, and particu-
larly the adaptive relational graph improves the detection rate by about 25%.
However, the improvement of the adaptive structure on the woman is slightly
(about 3.7%) and RG3 drops the tracking performance slightly, mainly because
light changes on the woman are more severe than those on the man.
Tracking objects in Sign are head, left hand and right hand, whose motions
are abrupt. Because Sign is sparsely annotated with masks while tracked results
are in xed ratios, we use average location errors to compare our approaches with
MIL in Fig.6(d). MIL tracks the left hand well, and our approaches with mutual
relations also keep the accuracies on it. However, due to violently motions and
changed appearances of the right hand, MIL and xed relational graphs drift
easily. In contrast, the adaptive relational graph performs more accurate and
always tracks the real objects as shown in Fig.7.
Group Tracking: Exploring Mutual Relations for Multiple Object Tracking 141
G1 G2 G3
G4 G5 G6
Objects switched Drifting 1 40
G1 G1
0.8 30
G2 G2
0.6
G3 20 G3
0.4
G4 10 G4
0.2
G5 G5
0 0
G6 G6
Detection Rates
(a) (b) Average Location Errors
Fig. 7. Tracking results of MIL(top) Fig. 8. Examples for the impacts of selected
and OUR(bottom) on #1, #11, #24, objects on tracking. Please see Sec.6.2.3 for
#32 and #163 from Sign. more details.
6.3 Discussion
Speed. The inference is highly ecient in our experiment (less than 10ms),
while the bottleneck of our implementation is online learning IOMs. We ob-
serve that the overall speed except the inference is slightly slower than tracking
objects individually. A partial reason might be that inferred locations are not
discriminative and it causes some diculties for online learning object models.
Comparison with Other Works. Although [8, 9] and our approach are all pro-
posed for multi-object tracking, their concentrations are dierent. Since tracked
objects are known in [8, 9], [8, 9] are provided with relatively good oine learned
object models (detectors) and are able to cope with many objects simultaneously.
Thus, [8, 9] focus on handling with ID switches because the oine models could
not distinguish objects in the same category, and coping with fragments because
the oine models could not cover all kinds of objects in this category. However,
as objects are previously unseen in our problem, we should online build models
for objects, where great challenges come from changes of objects and the sur-
roundings. Therefore, our main concern is to build up proper IOMs and MRMs
to avoid drifting and switching errors particularly when tracked objects are in
similar appearances.
References
1. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple in-
stance learning. In: CVPR (2009)
2. Godec, M., Roth, P.M., Bischof, H.: Hough-based tracking of non-rigid objects. In:
ICCV (2011)
Group Tracking: Exploring Mutual Relations for Multiple Object Tracking 143
3. Kwon, J., Lee, K.M.: Visual tracking decomposition. In: CVPR (2010)
4. Grabner, H., Matas, J., Gool, L.V., Cattin, P.: Tracking the invisible: Learning
where the object might be. In: CVPR (2010)
5. Dinh, T.B., Vo, N., Medioni, G.: Context tracker: Exploring supporters and dis-
tracters in unconstrained environments. In: CVPR (2011)
6. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the
integral histogram. In: CVPR (2006)
7. Kalal, Z., Matas, J., Mikolajczyk, K.: P-n learning: Bootstrapping binary classiers
by structural constraints. In: CVPR (2010)
8. Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier, E., Gool, L.V.: Online
multi-person tracking-by-detection from a single, uncalibrated camera. PAMI 33,
18201833 (2011)
9. Huang, C., Wu, B., Nevatia, R.: Robust Object Tracking by Hierarchical Association
of Detection Responses. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part II. LNCS, vol. 5303, pp. 788801. Springer, Heidelberg (2008)
10. Pellegrini, S., Ess, A., Schindler, K., van Gool, L.: Youll never walk alone: Modeling
social behavior for multi-target tracking. In: ICCV (2009)
11. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition.
IJCV 61, 5579 (2005)
12. Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models.
In: CVPR (2011)
13. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. JMLR 6, 14531484 (2005)
14. Branson, S., Perona, P., Belongie, S.: Strong supervision from weak annotation:
Interactive training of deformable part models. In: ICCV (2011)
15. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class object
layout. In: ICCV (2009)
16. Felzenszwalb, P., McAllester, D., Ramaman, D.: A discriminatively trained, mul-
tiscale, deformable part model. In: CVPR (2008)
17. Yang, Y., Ramanan, D.: Articulated pose estimation using exible mixtures of
parts. In: CVPR (2011)
18. Finley, T., Joachims, T.: Training structural svms when exact inference is in-
tractable. In: ICML (2008)
19. Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex
optimization. Machine Learning 69, 169192 (2007)
20. Kalman, R.E.: A new approach to linear ltering and prediction problems. Trans-
actions of the ASME-Journal of Basic Engineering 82, 3545 (1960)
21. Ess, A., Leibe, B., Gool, L.V.: Depth and appearance for mobile scene analysis.
In: ICCV (2007)
22. Buehler, P., Everingham, M., Huttenlocher, D., Zisserman, A.: Long term arm and
hand tracking for continuous sign language tv broadcasts. In: BMVC (2008)
23. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via online boosting. In:
BMVC (2006)
A Discrete Chain Graph Model
for 3d+t Cell Tracking
with High Misdetection Robustness
1 Introduction
One grand challenge of developmental biology is to nd the lineage tree of all
cells in a growing organism, i.e. the complete ancestry of each cell [17]. The chal-
lenges encountered include (i) simultaneously tracking an unknown and variable
number of cells; (ii) allowing for cell division; and (iii) high accuracy, because
each tracking error aects a complete subtree of the lineage.
We propose a tracking by assignment solution contributing these novelties:
1. In recent years graphical models proved to be exceptionally successful in
computer vision [4]. We adopted the approach and present the rst proba-
bilistic graphical model for cell tracking which can track an unknown number
of divisible objects that may appear or disappear and which can cope with
false detections.
2. The model achieves robustness against noise detections by solving the track-
ing problem for all time steps at once taking the expected time between
divisions into account. The most likely cell lineage is obtained as the model
conguration with maximum a-posteriori (MAP) probability.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 144157, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Discrete Chain Graph Model for 3d+t Cell Tracking 145
Our model
1 2 3 4 5 189
1 2 3 4 5 189 1 2 3 4 5 189
3. While all literature with similar data focuses on single organisms, we present
results on zebrash andfor the rst timeon Drosophila together with gold
standard lineages and benchmarking tools (as a supplementary).
We obtain the exact MAP solution by integer linear programming (ILP), and
when comparing to the gold standard lineagesobtain results that are superior
to existing methods.
2 Related Work
time step and fed into a tracking routine. Two dierent kinds of models are
typically used: state space models and assignment models. The Kalman lter, a
linear Gaussian state space model, is the archetype of the former class, which
interprets detections as caused by a hidden target. It has been generalized in
a number of ways allowing for nonlinear motion models, discrete state spaces,
non-Gaussian distributions [2,7], multiple target hypotheses, or even an unknown
(but xed) number of targets [8]. While state space models easily accommodate
target properties such as velocity, size, and appearance and are robust against
noisy detections by design, they are unfortunately hard to coax into dealing with
a variable number of dividing objects.
Tracking by assignment, on the other hand, treats every detection as a poten-
tial target itself. It easily accommodates multiple or dividing objects; however,
this increased exibility must be reined in by enforcing the consistency (see sec-
tion 3) of a tracking. This diculty has so far been addressed in three ways.
Firstly, approximate methods can be applied that forego a consistency guaran-
tee [10]. Secondly, tracks can be generated hierarchically from tracklets [3,15,6],
akin to the use of superpixels in image processing. Thirdly, tracking is performed
across pairs of frames only [11,12,19,16].
Approach [3] is most similar to ours since it isto our best knowledgethe
only other global model of detections and assignments over many time steps.
Therefore we chose this method to compare with.
t [1, . . . , T 1], t + 1, and Y (t) is the set of all assignment variables between
steps t and t + 1. A value of 1 expresses the belief that cell candidate j at time
step t + 1 is identical with, or a child of, cell candidate i at time step t, and a
value of 0 says that i and j are unrelated.
In addition, a number of natural consistency constraints are imposed on these
variables by our application in developmental biology: First, each cell must have
(t)
at most one predecessor, i.e. i Yij 1. Second, each cell must have a unique
fateif it does not disappear (by leaving the data frame or other reasons), it
can either move to (be assigned to) a unique cell candidate in the next time
frame, or it can divide and be associated with exactly two cell candidates in
the next time frame. These children, in turn, may not have any other nonzero
incoming assignment variables. Third, there is a biological upper bound on the
distance traveled between time frames, excluding all assignments that are too far
apart. This leads to a signicant reduction in the number of required assignment
variables.
These consistency constraints dene which congurations are impossible, i.e.
have zero probability. In the next two subsections, we describe how the proba-
bilities of feasible congurations are dened by connecting variables into a chain
graph whose factors encode our knowledge about the plausibility of dierent
trackings as a function of data-dependent properties such as distance traveled,
probability of being a misdetection, etc.
where
1 (t) (t) (t)
(t) (t) (t+1)
P (t) (Y (t) |X (t) , X (t+1) ) = (t)
i (Xi , Yi ) j (Yj , Xj ) , (2)
Z (t) (t+1)
Xi X (t) Xj X (t+1)
(t) (t)
and i and j denote the factors for outgoing and incoming transitions re-
spectively. Note that the CRF partition function Z (t) (X (t) , X (t+1) ) is only calcu-
lated over the assignment variables, since the detection variables are considered
as given. A factor graph [14] representation of eq. (2) is shown in Fig. 3a. Each of
the maximal cliques of the underlying undirected graph corresponds to a factor
in the CRF. These CRFs will form the supernodes of the chain graph, while the
individual assignment variables are the subnodes.
Analyzing the graph with d-separation the implicit independence assumptions
(t)
of the CRF can be identied: an assignment Yij is conditionally independent of
148 B.X. Kausler et al.
(a) Undirected part of the chain (b) Directed part of the chain graph.
graph. A single assignment CRF is The boxes symbolize supernodes containing
shown as a factor graph [14]. The dark assignment CRFs according to Fig. 3a.
nodes represent xed detection vari-
ables X (at t and t+1) and the smaller,
light nodes assignment variables Y.
(t)
all assignments Ykl involving dierent detection variables k
= i, l
= j, provided
(t)
that all incoming assignments Yj of detection j and all outgoing assignments
(t)
Yi of detection i are given:
(t) (t) (t) (t)
Yij
Ykl | Yj , Yi l
= j, i
= k . (3)
In other words, the assignment decisions only inuence a small neighborhood of
other assignment variables directly. By xing enough variables the CRF can even
decouple in two or more completely independent tracking problemspromoting
the tractability of the chain graph model.
(t)
where Pdet (Xi ) is the probability that a detection hypothesis should be ac-
cepted into the tracking. The corresponding chain graphical model is sketched
in Fig. 3b. By treating each CRF as a single random variable (supernode)
with as many states as congurations of the assignment variables (subnodes),
d-separation can be easily applied to the directed part of the model. The (unob-
served) supernodes have only incoming edges and are blocking the path between
detection variables, revealing the following conditional independence relations:
A Discrete Chain Graph Model for 3d+t Cell Tracking 149
Application: Cell Tracking. Applying (5) to the chain graph model equa-
tions (2) and (4) gives the energy function
T
T 1
Edet (Xi )+ Eout (Xi , Yi ) + Ein (Yj , Xj
(t) (t) (t) (t) (t+1)
E(X , Y) = )
t=1 X (t) X (t) t=1 i j
i
(6)
where Edet , Eout , and Ein are the energies corresponding to the factors Pdet ,
(t) (t)
i , and j respectively.
The rst term allows to incorporate evidence from the raw data regarding
the presence or absence of a cell at a location where a potential target was de-
tected.
In our
experiments, we use Random Forest [5] to estimate the probability
(t)
Pdet Xi of a detection being truly a cell, based on features such as intensity
distribution and volume of each cell candidate found by the detection module.
In keeping with the denition of the Gibbs distribution, we get
ln Pdet Xi(t) (t)
, Xi = 1
Edet Xi
(t)
= (7)
ln 1 Pdet X (t) (t)
, Xi = 0
i
150 B.X. Kausler et al.
The assignment energies Ein and Eout need to embody two kinds of prior knowl-
edge: rstly, which trackings are consistent; and secondly, amongst all consistent
trackings, which ones are plausible. To enforce consistency zero probability /
innite energy is assigned to cells dividing into more than two descendants, cells
arising from the merging of two or more cells from the previous frame, and cell
candidates that are regarded as misdetections.
To penalize implausible trackings, we need to parameterize our model ap-
propriately. The model equations expose exactly ve free parameters (Cinit ,
Cinit imposes a penalty for track initiation (i.e. when a
Cterm , Copp , w, d).
cell (re-)appears in the data frame) and Cterm for track termination (i.e. when
a cell disappears from the data frame or is lost due to cell death). To discour-
age trivial solutions where too many objects are explained as misdetections, we
exact a opportunity cost of Copp for the lost opportunity to explain more of the
data. Furthermore, each move is associated with a cost dependent on the squared
length of that move. For cell divisions, nally, it is known that the descendants
tend to be localized at an average distance d from the parent cell, and punish
deviations from that expected distance.
All of the above can be summarized in the following assignment energies:
(t) (t)
, Xi = 1 j Yij > 2 > 2 children
2
w (d d) (t)
+(d d) (t)
2 , Xi = 1
j Yij = 2 division
(t) (t)
(t) (t)
Eout (Xi , Yi ) = wd
2
, Xi = 1 j Yij = 1 move (8)
(t) (t)
Cterm , Xi = 1 j Yij = 0 disappearance
(t)
C
(t)
, Xi = 0 j Yij = 0 opportunity
opp
(t)
(t)
, Xi = 0 j Yij > 0 tracked misdetection
(t)
, Xj
(t+1)
= 1 i Yij >1 > 1 parent
(t)
(t+1)
= 1 i Yij
0 , Xj (t)
=1 move
(t) (t+1)
Ein (Yj , Xj ) = Cinit, Xj(t+1) = 1 i Yij =0 appearance (9)
(t)
0 , Xj
(t+1)
= 0 i Yij =0 misdetection, no parent
(t)
(t+1)
, Xj = 0 i Yij >0 misdetection with parent
Note that more informative features could be extracted from the raw data: for
instance, besides the length of a move, one could consider the similarity of the
associated cell candidates to obtain an improved estimate of their compatibility.
However, it is then no longer possible to select appropriate values for the pa-
rameters with a simple grid. Instead, a proper parameter learning strategy will
be needed, which will be a subject of our future research.
4 Experiments
We applied the chain graph model to track cells during early embryogenesis of
zebrash (Danio Rerio) and fruit y (Drosophila): the GFP-stained nuclei were
imaged with dierent light sheet uorescence microscopes.
and divisions between two time slices against a manually obtained tracking on the
rst 25 time slices. We use the segmentation result of [16] for a fair comparison
of the two tracking methods.
4.3 Implementation
We implemented the chain graph model in C++ using the OpenGM [1] library
and wrote a MAP inference backend based on CPLEX.3
Parameter values are set according to an optimal f-measure of successfully
reconstructed move, division, appearance, and disappearance events relative to
a manually obtained tracking. We initialize the parameters with a reasonable
setting and ne-tune them with a search on a grid in parameter space.
3
A commercial (integer) linear programming solver. Free for academic use.
A Discrete Chain Graph Model for 3d+t Cell Tracking 153
4.4 Results
We perform experiments on two datasets. First, we compare our tracker against
a previously published method [16] on the zebrash dataset. The results are
summarized in table 1. Second, we investigate the performance of the model on
the Drosophila dataset, comparing a chain graph having xed detection variables
with an unconditioned chain graph and a chain graph with 4-state detection
variables satisfying a minimal duration of 3 time steps between division events
of a particular track. Additionally we put the chain graph model in perspective
by comparing with a state-of-the-art cell tracking method by Bise et al. [3]. The
tracking results are presented in table 2.
Finally, table 3 shows the Drosophila results in terms of correctly and incor-
rectly identied cells for both variants of the chain graph in contrast to human
annotations and the Random Forest classier alone (which was used to x the
variables).
5 Discussion
The MAP solution of the chain graph is obtained from all time steps at once
considering properties of complete tracks. In particular a track needs a minimal
number of active transitions to overcome the track initialization and termina-
tion costs together with the cost for misdetections. Cells that are only shortly
presentdue to low signal to noise or due to entering/leaving the data frame
will consequently be labeled as misdetections. On the other hand, cells that were
badly segmented and were classied as misdetections by the Random Forest may
154 B.X. Kausler et al.
be set active as long as they reasonably continue a track (see Fig. 6 for an il-
lustration of this behavior). The chain graph model can therefore be seen as a
regularizer on the pure classier output that trades some true positives (decreas-
ing recall) against a signicantly reduced number of false positives (increasing
precision).
The above interpretation is supported by the results. Compared to the perfor-
mance of method [16] on the zebrash dataset, our method loses 1.1 percentage
points (pp) in recall but gains 9pp in precision leading to an overall better perfor-
mance. More insight into this precision-recall trade-o can be gained by looking
at the cell vs. misdetection identication performance. Compared to the classier
alone, the (unconditioned) chain graph retrieves 61 less cells (less true positives),
but labels 323 more misdetections correctly (more true negatives). (The chain
graph controlling for minimal cell cycle length exposes the same behavior with
nonsignicant dierences in performance.)
Since correct cell identications are a precondition for a successful recovery of
move, division, appearance, and disappearance events, the higher object identi-
cation rate should directly transfer to the tracking performance. The results on
the zebrash are supporting this assumption, but could also be caused by the
fact that our method employs a Random Forest classier whereas the method of
[16] just treats every segmented object as a cell. To exclude the inuence of the
random forest we examined another variant of the chain graph model, where we
x the detection variables as a preprocessing step using the very same Random
Forest predictions. Compared to the chain graph with previously xed detection
variables, the full (unconditioned) model gains 2.4pp and 4pp in terms of recall
and precision, respectively. This is clear evidence that the performance gain over
the previously published step-by-step method is not only caused by the Random
Forest classier, but also by the holistic chain graph model that is optimized
over many time steps at once.
Further improvements can be obtained by requiring a minimal time between
two divisions. Table 2 shows an increase in f-measure by 0.3pp for a minimal
temporal distance of three ( = 3) compared to the unconditioned model. At
a rst glance the gain seems nonsignicant, but the f-measure puts the same
A Discrete Chain Graph Model for 3d+t Cell Tracking 155
importance on every type of tracking event. Since the number of move events
exceeds the number of divisions by a factor of 20, it is dominated by the for-
mer. Yet, the f-measure for divisions (exclusively) improves from 0.883 (precision
0.941, recall 0.832) to 0.917 (precision 0.923, recall 0.911). This is a signicant
gain and very important for an accurate assessment of cell ancestry since a sin-
gle mistracked division can spoil a whole subsequent lineage. In practice, setting
to a low value will already give a boost in performance with an acceptable
overhead. The eect on the overall and the division performance when varying
parameter is depicted in Fig. 5b. The method of Bise et al. [3] shows convinc-
ing results in regions with high data quality (cf. Fig. 2). However, its f-measure
is 33pp worse compared to the chain graph. This is most likely caused by the
high misdetection rate of 13% and can be seen from the low precision of 0.550
and Fig. 2.
Still the performance of our approach could be further improved. A manual in-
spection of incorrectly tracked divisions reveals that the tracker can be confused
by divisions happening in close local proximity. A better model for divisions
that also incorporates the change in shape and the geometric motion pattern
descendants move away from the ancestor in roughly opposite directionscould
prevent such mistakes. Future datasets may contain cells that move in a more co-
ordinated fashion. The presented move energies are optimal in case of Brownian
motion, but will most likely underperform for directed motions. In that case the
model can be extended with discrete momentum variables that are associated
with the detected objectsanalogous to our extension that controls the minimal
cell cycle length. Finally, the model can easily be extended to consider merging
objects (due to undersegmentation or 2d applications) by allowing more than
one active incoming variable.
156 B.X. Kausler et al.
Fig. 6. Sample tracking sequence obtained with the chain graph model.
The eight consecutive time steps show a detail of the Drosophila dataset projected to
2d. The colored spots are segmented objects, where objects with the same color are
assigned to each other. Dark gray objects are misdetections as indicated by inactive
detection variables. Ill-shaped objects will be treated as active, if they are bridging
two tracks in a sensible manner. On the contrary, cell-shaped objects may be labeled
inactive, if they do not constitute a track of a certain length.
6 Conclusion
We presented a chain graph model for tracking by assignment over many time
slices simultaneously. This model is highly robust against misdetections and
can consider temporal domain knowledge. It has been applied in the context
of 3d+t cell tracking and achieved better performance than two recently pub-
lished methods ([16], [3]). Furthermore, it showed that globally optimal tracking
by assignment over many time slices is superior to the popular time step-wise
approach. The performance could be further improved by introducing better
motion models for cell movements and divisions.
Acknowledgments. This work has been supported by the German Research
Foundation (DFG) within the programme Spatio-/Temporal Graphical Models
and Applications in Image Analysis, grant GRK 1653.
References
1. Andres, B., Kappes, J.H., Kothe, U., Schnorr, C., Hamprecht, F.A.: An Empirical
Comparison of Inference Algorithms for Graphical Models with Higher Order Fac-
tors Using OpenGM. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler,
K. (eds.) DGAM 2010. LNCS, vol. 6376, pp. 353362. Springer, Heidelberg (2010)
2. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle lters
for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal
Processing 50(2), 174188 (2002)
3. Bise, R., Yin, Z., Kanade, T.: Reliable cell tracking by global data association. In:
IEEE International Symposium on Biomedical Imaging, ISBI 2011 (2011)
4. Blake, A., Kohli, P., Rother, C. (eds.): Markov Random Fields for Vision and Image
Processing. MIT Press (2011)
A Discrete Chain Graph Model for 3d+t Cell Tracking 157
1 Introduction
Visual tracking is an important topic in computer vision, and has a wide range
of practical applications including video surveillance, perceptual user interfaces,
and video indexing. Tracking generic objects in a dynamic environment poses ad-
ditional challenges, including appearance changes due to illumination, rotation,
and scaling; and problems of partial vs. full visibility. In all cases the fundamen-
tal problem is the variable nature of appearance, and the diculties associated
with using such a changeable feature to distinguish the target from backgrounds.
A wide variety of approaches have been developed for modelling appearance
variations of targets within robust tracking. Many tracking algorithms attempt
to build robust object representations in a particular feature space, and search for
the image region which best corresponds to that representation. Such approaches
include the superpixel tracker [1], the integral histogram [2], subspace learning
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 158172, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Robust Tracking with Weighted Online Structured Learning 159
[3], visual tracking decomposition [4], interest points tracking [5], and sparse
representation tracking [6,7], amongst others.
An alternative approach has been to focus on learning, and particularly on iden-
tifying robust classiers capable of distinguishing the target from backgrounds,
which has been termed tracking-by-detection. This approach encompases boost-
ing [8,9], ensemble learning [10], multiple instance learning [11], discriminative
metric learning [12], graph mode-based tracking [13], structured output tracking
[14], and many more. One of the features of these approaches has been the ability
to update the appearance model online and thus to adapt to appearance changes.
Despite their successes, existing online tracking-by-detection methods have
two main issues. First, these trackers typically rely solely on an online classier
for adapting the object appearance model and thus completely discard past
training data. This makes it almost impossible to recover once the appearance
model begins to drift. Second, as tracking is clearly a time-varying process, more
recent frames should thus have a greater inuence on the current tracking process
than older ones. Existing approaches to online learning of appearance models do
not take this into account, but rather either include images in the model fully,
or discard them totally.
from the background, but it requires that all appearance variations be repre-
sented within the training data. Since the video frames are often obtained on
the y, features are obtained in an online fashion [9].For the same reason, online
learning are brought into tracking. For example, Grabner et al. [9] proposed to
update selected features incrementally using online AdaBoost. Babenko et al.
[11] used online multiple instance boosting to overcome the problem of ambigu-
ous labeling of training examples. Grabner et al. [8] combined the decision of
a given prior and an online classier and formulated them in a semi-supervised
fashion.
Structured learning has shown superior performance on a number of computer
vision problems, including tracking. For example, in contrast to binary classica-
tion, Blaschko et al. [16] considered object localisation as a problem of predicting
structured outputs: they modeled the problem as the prediction of the bounding
box of objects located in images. The degree of bounding box overlap to the
ground truth is maximised using a structured SVM [17]. Hare et al. [14] further
applied structured learning to visual object tracking. The approach, which is
termed Struck, directly predicts the change in object location between frames
using structured learning. Since Struck is closely related to our work, we now
provide more details here.
Struck: Structured output tracking with kernels. In [14] Hare et al. used a
Structured SVM [17] to learn a function f : X Y R, such that, given the t-th
frame xt X, and the bounding box of the object Bt1 in the (t 1)-th frame,
the bounding box Bt in the t-th frame can be obtained via
Here B = (c, l, w, h), where c, l are the column and row coordinates of the upper-
left corner, and w, h are the width and height. The oset is y = (c, l, w, h).
Hare et al. set w and h to always be 0 and f (x, y) = w, (x, y) for the
feature map (x, y).
The discriminative function f is learned via structured SVM [17] by minimis-
ing
1 T
min ||w||2 + C t (2)
w 2
t=1
s.t. t : t 0 ; t, y
= yt : w, (xt , yt ) w, (xt , y) (yt , y) t ,
which was introduced in [16] to measure the VOC bounding box overlap ratio.
In practice, ys are uniformly sampled in the vicinity of the yt in [14]. To enable
kernels, Hare et al. updated the dual variables online following the work of
[16,18,19] and achieved excellent results. Note that prior to the work of [14,16],
Robust Tracking with Weighted Online Structured Learning 161
where (w) is the regulariser (e.g. (w) = w 2 /2) that penalises overly
complicated models, and C is a scalar. Here (, ) is a loss function such as
hinge loss and w is the primal parameter.
Online learning algorithms use one datum or a very small number of recent
data to learn at each iteration by denition. Thus at iteration i, the online
algorithm minimises the regularised risk
1 1
m m
Q(m, w) = (wi , zi ) (w, zi ).
m i=1 m i=1
The regret reects the amount of additional loss (compared with batch algo-
rithm) suered by the online algorithm not being able to see the entire data set
in advance. It has been shown in [20,19] that minimising the risk J(i, w) at each
iteration reduces the regret Q(m, w), and that the regret goes to zero when m
goes to innity for a bounded (w).
One of the properties of weighted reservoir sampling is that data (i.e. {z1 , , zi })
with high weights have higher probability of being maintained in the reservoir.
The advantage of using (6) is that a weighted risk in (4) now is replaced with a
unweighted (i.e. uniformly weighted) risk, to which traditional online algorithms
are applicable. The (sub)gradient of the primal form may be used to update
the parameter wi . Learning in the dual form can be achieved by performing
OLaRank [19] on elements in the reservoir. Details are provided in Section 3.1.
1 1
m 1N 2 1 1
m i (w) C
ERi (wij , bij ) j (w, zj ) + + ,
m i=1 N j=1
m i=1
Z i j=1 CmN 2
164 R. Yao et al.
!"
)'(
*&
''*&
++
" # $
Fig. 1. An illustration of greedy inference. LEFT: a contour plot of f in (8) for varying
yt . The centre blue solid square indicates the position of object in previous frame. The
centre point and some random points (purple) are used as the initial positions. The
yt with the highest f is obtained (green solid point) via Algorithm 2. RIGHT: The
contour plot and true bounding box shown over the relevant frame of the video.
i
where Zi = j=1 j .
The proof is provided in supplementary material.
3.1 Kernelisation
Updating the dual variable enables us to use kernels, which often produce
superior results in many computer vision tasks. Using a similar derivation to
that introduced by [18], we rewrite the optimisation problem of (2) by introduc-
ing
coecients iy for each (xi , yi , yi ) Rt , where iy = yi if y
= yi and
y
y=yi i otherwise and denote as the vector of the dual variables ,
1 y y
max (y, yi )iy
k((xi , y), (xj , y)) (7)
i,y
2 i,y,j,y i j
y
s.t. i, y : iy (y, yi )C; i : i = 0,
y
where (y, yi ) is 1 when y = yi and 0 otherwise, and k(, ) is the kernel function.
The discriminant scoring function becomes
y
f (x, y) = i k((xi , y), (x, y)). (8)
i,y
Robust Tracking with Weighted Online Structured Learning 165
As in [14], at the t-th frame, nt many yti are generated. We update the reservoir
and perform OLaRank on the elements of the reservoir. Pseudo code for online
update of the dual variables and the reservoir is provided in Algorithm 1. By
Theorem 1 we know the below holds (for the proof see supplementary material).
4 Experiments
!
"#$%
&
!"#"
$
!"#$
%&
Fig. 2. Performance of our tracker with various reservoir sizes on four sequences(coke,
tiger1, iceball, bird ). LEFT and MIDDLE: the average VOR and CLE. RIGHT:
the run time. The square, diamond, upward-pointing and down-pointing triangle in
three plots correspond to the performance of Struck on four sequences, respectively.
Fig. 3. Comparison of structured SVM scores and tracking results under various scenes.
TOP ROW: the tracking result of our tracker with reservoir size N = 200 and Struck,
where the solid rectangle shows result of our tracker and the dashed rectangle marks the
result of Struck. BOTTOM ROW: structured SVM score over time for our tracker
and Struck. Ours always has a higher score.
To show the eect of the reservoir size we use ten dierent reservoir sizes N , from
100 to 1000. We compute the average CLE, VOR and CPU time for each sequence
with each dierent reservoir size, and plot in Fig. 2. We see that for 100 N
200, increasing the reservoir size N increases the CLE and VOR signicantly.
However, for N > 200, CLE and VOR do not change much. Meanwhile, the CPU
run time also increases as the reservoir size increases as expected. The result of
Struck is also shown in Fig. 2.
168 R. Yao et al.
!
Sequence Ours Struck MIL OAB1 OAB5 Frag IVT VTD L1T
Coke 0.690.15 0.540.17 0.350.22 0.180.18 0.170.16 0.130.16 0.080.22 0.090.22 0.050.20
Tiger1 0.750.13 0.660.22 0.610.18 0.440.23 0.240.26 0.210.29 0.020.12 0.100.24 0.450.37
Tiger2 0.680.15 0.570.18 0.630.14 0.370.25 0.180.20 0.150.23 0.020.12 0.200.23 0.270.32
Sylv 0.670.11 0.680.14 0.670.18 0.470.38 0.120.12 0.600.23 0.060.16 0.580.31 0.450.35
Iceball 0.690.18 0.500.33 0.370.28 0.080.22 0.400.30 0.520.31 0.03.12 0.560.28 0.030.10
Basketball 0.720.14 0.450.34 0.150.27 0.120.23 0.120.24 0.460.35 0.020.07 0.640.10 0.030.10
Shaking 0.720.14 0.100.18 0.600.26 0.560.28 0.500.21 0.330.27 0.030.11 0.690.14 0.050.16
Singer1 0.710.08 0.330.36 0.180.32 0.170.32 0.070.19 0.130.26 0.200.28 0.490.20 0.180.32
Bird 0.750.10 0.550.25 0.580.32 0.560.28 0.590.30 0.340.32 0.080.19 0.100.26 0.440.37
Box 0.740.17 0.290.38 0.100.18 0.160.27 0.080.18 0.060.17 0.460.29 0.410.34 0.130.26
Board 0.690.25 0.390.29 0.160.27 0.120.24 0.170.26 0.490.25 0.100.21 0.310.27 0.110.27
Walk 0.630.18 0.460.32 0.500.34 0.520.36 0.490.34 0.080.25 0.050.14 0.080.23 0.090.25
Average 0.71 0.46 0.41 0.31 0.26 0.29 0.10 0.35 0.19
Fig. 5. Visual tracking result. Yellow, blue, green, magenta and red rectangles show
results by our tracker, Struck, MIL, Frag and VTD methods, respectively.
Pose and Scale Changes. As can be seen in rst row of Fig. 5, our algorithm
exhibits the best tracking performance amongst the group. In the box sequence,
the other four trackers cannot keep track of the object after rotation, as they are
not eective in accounting for appearance change due to large pose change. In the
basketball sequences, Struck, MIL and Frag methods lose the object completely
after frame 476 due to the rapid appearance change of object. The VTD tracker
gets an average error similar to ours on the basketball sequence, but its VOC
overlap ratio is worse (see Tab. 2 and Tab. 1).
Illumination Variation and Occlusion. The second row of Fig. 5 illus-
trates performance under heavy lighting and occlusion scenes. The MIL and
Frag methods drift to a background area in early frames in the singer1 sequence
as they are not designed to handle full occlusions. Although Struck maintains
track under strong illumination (frame 111 in the singer1 sequence), it loses
the object completely after illumination changes. As can be seen in the right
image of Fig. 3, frequent occlusion also aects the performance of Struck. On
the contrary, our tracker adapts to the appearance changes, and achieves robust
tracking results under illumination variation and partial or full occlusions.
170 R. Yao et al.
Sequence Ours Struck MIL OAB1 OAB5 Frag IVT VTD L1T
Coke 5.04.5 8.35.6 17.59.5 25.215.4 56.730.0 61.532.4 10.521.9 47.321.9 55.322.4
Tiger1 5.13.6 8.612.1 8.36.6 17.616.9 39.232.8 40.125.6 56.320.6 68.736.2 23.230.6
Tiger2 6.13.7 8.56.0 7.03.7 19.715.2 37.327.0 38.725.1 92.734.7 37.029.7 33.128.1
Sylv 8.83.5 8.45.3 8.76.4 33.036.6 75.335.3 12.611.5 73.937.6 21.635.9 21.732.6
Iceball 4.93.6 15.622.1 61.685.6 97.753.4 58.184.0 40.372.9 90.267.5 13.526.0 77.150.0
Basketball 9.66.2 89.6163.7 163.9104.0 167.8100.6 199.599.9 53.959.8 31.536.7 9.85.0 120.166.2
Shaking 9.25.7 123.154.3 37.875.6 26.949.2 29.148.7 47.140.5 94.336.8 12.56.7 132.856.9
Singer1 5.11.8 29.223.6 187.9120.1 189.4114.3 158.048.6 172.594.5 17.313.0 10.57.5 108.578.5
Bird 8.64.1 20.714.7 49.085.3 47.987.7 48.686.2 50.043.3 107.757.6 143.979.3 46.349.3
Box 13.415.7 162.2132.8 140.977.0 153.595.8 165.484.1 147.067.8 34.743.8 63.565.7 150.082.7
Board 32.932.5 85.356.5 211.7124.8 216.5111.6 213.796.1 65.948.6 223.897.5 112.366.9 240.596.2
Walk 11.37.7 36.846.4 33.847.3 35.549.0 36.948.5 102.646.1 90.256.5 100.946.9 79.641.3
Average 9.9 49.6 77.3 85.8 93.1 69.3 76.9 53.4 90.6
Sequence Ours Struck MIL OAB1 OAB5 Frag IVT VTD L1T
Coke 0.95 0.71 0.27 0.05 0.03 0.06 0.08 0.08 0.05
Tiger1 0.96 0.82 0.83 0.41 0.18 0.19 0.01 0.13 0.47
Tiger2 0.85 0.66 0.81 0.34 0.12 0.09 0.01 0.15 0.30
Sylv 0.93 0.90 0.76 0.52 0.01 0.69 0.03 0.67 0.53
Iceball 0.86 0.61 0.30 0.09 0.42 0.67 0.02 0.76 0.02
Basketball 0.90 0.61 0.16 0.13 0.13 0.56 0.01 0.90 0.02
Shaking 0.96 0.07 0.78 0.55 0.46 0.29 0.01 0.86 0.02
Singer1 0.99 0.39 0.21 0.21 0.07 0.17 0.18 0.63 0.26
Bird 0.97 0.53 0.73 0.74 0.75 0.34 0.06 0.14 0.48
Box 0.92 0.36 0.06 0.19 0.07 0.04 0.55 0.48 0.15
Board 0.70 0.29 0.13 0.08 0.09 0.45 0.09 0.23 0.12
Walk 0.76 0.58 0.67 0.68 0.62 0.10 0.03 0.08 0.10
Average 0.90 0.54 0.48 0.33 0.25 0.30 0.09 0.43 0.21
5 Conclusion
We have developed a novel approach to online tracking which allows continuous
development of the target appearance model, without neglecting information
gained from previous observations. The method is extremely exible, in that
the weights between observations are freely adjustable, and it is applicable to
binary, multiclass, and even structured output labels in both linear and non-
linear kernels. In formulating the method we also developed a novel principled
1
The demonstration videos are available at http://www.youtube.com/woltracker12.
Robust Tracking with Weighted Online Structured Learning 171
References
1. Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: Proc. ICCV,
pp. 13231330 (2011)
2. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the
integral histogram. In: Proc. CVPR, vol. 1, pp. 798805 (2006)
3. Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual
tracking. Int. J. Comput. Vision 77, 125141 (2008)
4. Kwon, J., Lee, K.M.: Visual tracking decomposition. In: Proc. CVPR (2010)
5. Garca Cifuentes, C., Sturzel, M., Jurie, F., Brostow, G.J.: Motion models that
only work sometimes. In: Proc. BMVC (2012)
6. Mei, X., Ling, H.: Robust visual tracking and vehicle classication via sparse rep-
resentation. IEEE Trans. on PAMI 33, 22592272 (2011)
7. Li, H., Shen, C., Shi, Q.: Real-time visual tracking using compressive sensing. In:
Proc. CVPR, pp. 13051312 (2011)
8. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-
bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 234247. Springer, Heidelberg (2008)
9. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In:
Proc. BMVC, vol. 1, pp. 4756 (2006)
10. Avidan, S.: Ensemble tracking. IEEE Trans. on PAMI 29, 261271 (2007)
11. Babenko, B., Yang, M.H., Belongie, S.: Visual Tracking with Online Multiple In-
stance Learning. In: Proc. CVPR, pp. 983999 (2009)
12. Li, X., Shen, C., Shi, Q., Dick, A.R., van den Hengel, A.: Non-sparse linear repre-
sentations for visual tracking with online reservoir metric learning. In: Proc. CVPR
(2012)
13. Li, X., Dick, A.R., Wang, H., Shen, C., van den Hengel, A.: Graph mode-based
contextual kernels for robust svm tracking. In: Proc. ICCV, pp. 11561163 (2011)
14. Hare, S., Saari, A., Torr, P.: Struck: Structured output tracking with kernels. In:
Proc. ICCV, pp. 263270 (2011)
15. Avidan, S.: Support vector tracking. IEEE Trans. on PAMI 26(8), 10641072 (2004)
16. Blaschko, M.B., Lampert, C.H.: Learning to Localize Objects with Structured Out-
put Regression. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 215. Springer, Heidelberg (2008)
17. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. JMLR 6, 14531484 (2005)
172 R. Yao et al.
18. Bordes, A., Bottou, L., Gallinari, P., Weston, J.: Solving multiclass support vector
machines with larank. In: Proc. ICML, pp. 8996 (2007)
19. Bordes, A., Usunier, N., Bottou, L.: Sequence Labelling SVMs Trained in One
Pass. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part
I. LNCS (LNAI), vol. 5211, pp. 146161. Springer, Heidelberg (2008)
20. Shalev-Shwartz, S., Singer, Y.: A primal-dual perspective of online learning algo-
rithms. Machine Learning 69, 115142 (2007)
21. Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf.
Process. Lett. 97, 181185 (2006)
Fast Regularization of Matrix-Valued Images
1 Introduction
Matrix Lie-group data, and specically matrix-valued images have become an in-
tegral part of computer vision and image processing. Such representations have
been found useful for tracking [35,45], robotics, motion analysis, image pro-
cessing and computer vision [10,32,34,36,48], as well as medical imaging [6,31].
Specically, developing ecient regularization schemes for matrix-valued images
is of prime importance for image analysis and computer vision. This includes
applications such as direction diusion [25,42,47] and scene motion analysis [27]
in computer vision, as well as diusion tensor MRI (DT-MRI) regularization
[7,14,21,40,43] in medical imaging.
This research was supported by Israel Science Foundation grant no.1551/09 and by
the European Communitys FP7- ERC program, grant agreement no. 267414.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 173186, 2012.
c Springer-Verlag Berlin Heidelberg 2012
174 G. Rosman et al.
The symmetric positive denite group SP D(n) - This group is the group of
symmetric positive denite matrices. This group has been studied extensively in
control theory (see [17] for example), as well as in the context of diusion tensor
images [31], where the matrices are used to describe the diusion coecients
along each direction. By denition, this group is given in matrix form as
The rationale behind the dierent regularization term u stems from the fact
that SO(n) and SE(n) are isometries of Euclidean space, but such a regular-
ization is possible whenever the data consists of nonsingular matrices, and has
been used also for SPD matrices [46]. We refer the reader to our technical report
[37] for a more in-depth discussion of this important point. Next, instead of re-
stricting u to G, we add an auxiliary variable, v, at each point, such that u = v,
and restrict v to G, where the equality constraint is enforced via augmented La-
grangian terms [23,33]. The suggested augmented Lagrangian optimization now
reads
176 G. Rosman et al.
with u = (2u(2+r)
0 +rv+)
. This problem can be solved via fast minimization tech-
niques for TV regularization of vectorial images, such as [9,16,19]. We chose to
use the augmented-Lagrangian TV algorithm [41], as we now describe. In order
to obtain fast optimization of the problem with respect to u, we add an auxil-
iary variable p, along with a constraint that p = u. Again, the constraint is
enforced in an augmented Lagrangian manner. The optimal u now becomes a
saddle point of the optimization problem
u u (u0 , v, , r)
2 +
p
where the matrix U is aunitary one, representing the eigenvectors of the matrix,
and the eigenvalues are the positive projection of the eigenvalues ()i . Op-
i
timization w.r.t. u is done as in the previous cases, as described in Algorithm 1.
Furthermore, the optimization w.r.t. u, v is now over the domain Rm SP D(n),
and the cost function is convex, resulting in a convex optimization problem. The
convex domain of optimization allows us to formulate a convergence proof for the
algorithm similar to the proof by Tseng [44]. We refer the interested reader to
our technical report [3]. An example of using the proposed method for DT-MRI
denoising is shown in Section 4.
We note that the scheme we describe is susceptible to the staircasing eect, since
it minimizes the total variation of the map u. Several higher-order priors can
Fast Regularization of Matrix-Valued Images 179
be incorporated into our scheme that do not suer from staircasing eects. One
such possibile higher-order term generalizes the scheme presented by Wu and Tai
[49], by replacing the per-element gradient operator with a Hessian operator. The
resulting saddle-point problem becomes
% 2 &
p + u u (u0 , v, , r)
min max dx, (17)
u Rm 2 +T2 (p Hu) + r22 p Hu2
p R4m ,
vG
4 Numerical Results
As discussed above, the proposed algorithmic framework is considerably general,
suitable for various applications. In this section, several examples from dier-
ent applications are used to substantiate the eectiveness and eciency of our
algorithm.
Fig. 2. Regularization of principal motion directions. The red arrows demonstrate mea-
surements of motion cues based on a normalized cross-correlation tracker. Blue arrows
demonstrate the regularized directions elds.
Table 1. Processing times (ms) for various sizes of images, with various iteration counts
SE(3) image, as shown in Figure 4. It can be seen that for a careful choice of the
regularization parameter, total variation in the group elements is seen to signi-
cantly reduce rigid motion estimation errors. Furthermore, it allows us to discern
the main rigidly moving parts in the sequence by producing a scale-space of rigid
motions. Visualization is accomplished by projecting the embedded matrix onto
3 dierent representative vectors in R12 . The regularization is implemented us-
ing the CUDA framework, with computation times shown in Table 1, for various
image sizes and iterations. In the GPU implementation the polar decomposi-
tion was chosen for its simplicity and eciency. In practice, one Gauss-Seidel
iteration suced to update u. Using 15 outer iterations, practical convergence
is achieved in 49 milliseconds on an NVIDIA GTX-580 card for QVGA-sized
images, demonstrating the eciency of our algorithm and its potential for real-
time applications. This is especially important for applications such as gesture
recognition where fast computation is crucial.
182 G. Rosman et al.
Fig. 4. Regularization of SE(3) images obtained from local ICP matching of the surface
patch between consecutive Kinect depth frames. Left-to-right: diusion scale-space
obtained by dierent values of : 1.5, 1.2, 0.7, 0.2, 0.1, 0.05 , the foreground segmentation
based on the depth, and an intensity image of the scene. Top-to-bottom: dierent frames
from the depth motion sequence.
Fast Regularization of Matrix-Valued Images 183
5 Conclusions
In this paper, a general framework for regularization of matrix valued maps is
proposed. Based on the augmented Lagrangian techniques, we separate the op-
timization problem into a TV-regularization step and a projection step, both
of which can be solved in an easy-to-implement and parallel way. Specically,
we show the eciency and eectiveness of the resulting scheme through several
examples whose data taken from SO(2), SE(3), and SP D(3) respectively. To
emphasize, for matrix-valued images, our algorithms allow real-time regulariza-
tion for tasks in image analysis and computer vision.
In future work we intend to explore other applications for matrix-valued image
regularization as well as generalize our method to other types of maps, and data
and noise models.
184 G. Rosman et al.
References
21. Gur, Y., Sochen, N.: Fast invariant Riemannian DT-MRI regularization. In: ICCV,
pp. 17 (2007)
22. Hall, B.C.: Lie Groups, Lie Algebras,and Representations, An Elementary Intro-
duction. Springer (2004)
23. Hesteness, M.R.: Multipliers and gradient methods. J. of Optimization Theory and
Applications 4, 303320 (1969)
24. Higham, N.J.: Matrix nearness problems and applications. In: Applications of Ma-
trix Theory, pp. 127. Oxford University Press, Oxford (1989)
25. Kimmel, R., Sochen, N.: Orientation diusion or how to comb a porcupine. Special
Issue on PDEs in Image Processing, Comp. Vision, and Comp. Graphics, J. of Vis.
Comm. and Image Representation 13, 238248 (2002)
26. Larochelle, P.M., Murray, A.P., Angeles, J.: SVD and PD based projection metrics
on SE(N). In: On Advances in Robot Kinematics, pp. 1322. Kluwer (2004)
27. Lin, D., Grimson, W., Fisher, J.: Learning visual ows: A Lie algebraic approach.
In: CVPR, pp. 747754 (2009)
28. Lundervold, A.: On consciousness, resting state fMRI, and neurodynamics. Non-
linear Biomed. Phys. 4(Suppl. 1) (2010)
29. Mehran, R., Moore, B.E., Shah, M.: A Streakline Representation of Flow in
Crowded Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part III. LNCS, vol. 6313, pp. 439452. Springer, Heidelberg (2010)
30. Nicolescu, M., Medioni, G.: A voting-based computational framework for visual
motion analysis and interpretation 27, 739752 (2005)
31. Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing.
IJCV 66(1), 4166 (2006)
32. Perona, P.: Orientation diusions. IEEE Trans. Image Process. 7(3), 457467
(1998)
33. Powell, M.J.: A method for nonlinear constraints in minimization problems. In:
Optimization, pp. 283298. Academic Press (1969)
34. Rahman, I.U., Drori, I., Stodden, V.C., Donoho, D.L., Schroeder, P.: Multiscale
representations of manifold-valued data. Technical report, Stanford (2005)
35. Raptis, M., Soatto, S.: Tracklet Descriptors for Action Modeling and Video Analy-
sis. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS,
vol. 6311, pp. 577590. Springer, Heidelberg (2010)
36. Rosman, G., Bronstein, M.M., Bronstein, A.M., Wolf, A., Kimmel, R.: Group-
Valued Regularization Framework for Motion Segmentation of Dynamic Non-rigid
Shapes. In: Bruckstein, A.M., ter Haar Romeny, B.M., Bronstein, A.M., Bronstein,
M.M. (eds.) SSVM 2011. LNCS, vol. 6667, pp. 725736. Springer, Heidelberg (2012)
37. Rosman, G., Wang, Y., Tai, X.-C., Kimmel, R., Bruckstein, A.M.: Fast regulariza-
tion of matrix-valued images. Technical Report CIS-11-03, Technion (2011)
38. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal
algorithms. Physica D Letters 60, 259268 (1992)
39. Sochen, N.A., Sagiv, C., Kimmel, R.: Stereographic combing a porcupine or
studies on direction diusion in image processing. SIAM J. Appl. Math. 64(5),
14771508 (2004)
40. Steidl, G., Setzer, S., Popilka, B., Burgeth, B.: Restoration of matrix elds by
second-order cone programming. Computing 81(2-3), 161178 (2007)
41. Tai, X.-C., Wu, C.: Augmented Lagrangian method, dual methods and split Breg-
man iteration for ROF model. In: SSVM, pp. 502513 (2009)
42. Tang, B., Sapiro, G., Caselles, V.: Diusion of general data on non-at manifolds
via harmonic maps theory: The direction diusion case. IJCV 36, 149161 (2000)
186 G. Rosman et al.
43. Tschumperle, D., Deriche, R.: Vector-valued image regularization with PDEs: A
common framework for dierent applications. IEEE-TPAMI 27, 506517 (2005)
44. Tseng, P.: Coordinate ascent for maximizing nondierentiable concave functions.
LIDS-P 1940. MIT (1988)
45. Tuzel, O., Porikli, F., Meer, P.: Learning on Lie-groups for invariant detection and
tracking. In: CVPR, pp. 18 (2008)
46. Vemuri, B.C., Chen, Y., Rao, M., McGraw, T., Wang, Z., Mareci, T.: Fiber tract
mapping from diusion tensor MRI. In: VLSM, pp. 8188. IEEE Computer Society
(2001)
47. Weickert, J., Brox, T.: Diusion and regularization of vector- and matrix-valued
images. Inverse problems, image analysis, and medical imaging, vol. 313 (2002)
48. Wen, Z., Goldfarb, D., Yin, W.: Alternating direction augmented Lagrangian meth-
ods for semidenite programming. CAAM TR09-42, Rice university (2009)
49. Wu, C., Tai, X.-C.: Augmented lagrangian method, dual methods, and split breg-
man iteration for ROF, vectorial TV, and high order models. SIAM J. Imaging
Sciences 3(3), 300339 (2010)
Blind Correction of Optical Aberrations
1 Introduction
Optical lenses image scenes by refracting light onto photosensitive surfaces. The
lens of the vertebrate eye creates images on the retina, the lens of a photographic
camera creates images on digital sensors. This transformation should ideally sat-
isfy a number of constraints formalizing our notion of a veridical imaging process.
The design of any lens forms a trade-o between these constraints, leaving us
with residual errors that are called optical aberrations. Some errors are due to
the fact that light coming through dierent parts of the lens can not be fo-
cused onto a single point (spherical aberrations, astigmatism and coma), some
errors appear because refraction depends on the wavelength of the light (chro-
matic aberrations). A third type of error, not treated in the present work, leads
to a deviation from a rectilinear projection (image distortion). Camera lenses
are carefully designed to minimize optical aberrations by combining elements of
multiple shapes and glass types.
However, it is impossible to make a perfect lens, and it is very expensive to
make a close-to-perfect lens. A much cheaper solution is in line with the new
eld of computational photography: correct the optical aberration in software.
To this end, we use non-uniform (non-stationary) blind deconvolution. Deconvo-
lution is a hard inverse problem, which implies that in practice, even non-blind
uniform deconvolution requires assumptions to work robustly. Blind deconvolu-
tion is harder still, since we additionally have to estimate the blur kernel, and
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 187200, 2012.
c Springer-Verlag Berlin Heidelberg 2012
188 C.J. Schuler et al.
(a) The image contains certain elements typical of natural images, in particular,
there are sharp edges.
(b) Even though the blur due to optical aberrations is non-uniform (spatially
varying across the image), there are circular symmetries that we can exploit.
Inverting a forward model has the benet that if the assumptions are correct, it
will lead to a plausible explanation of the image, making it more credible than
an image obtained by sharpening the blurry image using, say, an algorithm that
lters the image to increase high frequencies.
Furthermore, we emphasize that our approach is blind, i.e., it requires as
an input only the blurry image, and not a point spread function that we may
have obtained using other means such as a calibration step. This is a substantial
advantage, since the actual blur depends not only on the particular photographic
lens but also on settings such as focus, aperture and zoom. Moreover, there are
cases where the camera settings are lost and the camera may even no longer be
available, e.g., for historic photographs.
(a) we design a class of PSF families containing realistic optical aberrations, via
a set of suitable symmetry properties,
(b) we represent the PSF basis using an orthonormal basis to improve condi-
tioning, and allow for direct PSF estimation,
(c) we avoid calibration to specic camera lens combinations by proposing a
blind approach for inferring the PSFs, widening the applicability to any
photographs (e.g., with missing lens information such as historical images)
and avoiding cumbersome calibration steps,
(d) we extend blur estimation to multiple color channels to remove chromatic
aberrations as well, and nally
(e) we present experimental results showing that our approach is competitive
with non-blind approaches.
Optical aberrations cause image blur that is spatially varying across the image.
As such they can be modeled as a non-uniform point spread function (PSF), for
which Hirsch et al. [11] introduced the Ecient Filter Flow (EFF) framework,
R
y= a(r) w(r) x , (1)
r=1
= *
blur parameters
where x denotes the ideal image and y is the image degraded by optical aber-
ration. In this paper, we assume that x and y are discretely sampled images,
i.e., x and y are nite-sized matrices whose entries correspond to pixel inten-
sities. w(r) is a weighting matrix that masks out all of the image x except for
a local patch by Hadamard multiplication (symbol , pixel-wise product). The
r-th patch is convolved (symbol ) with a local blur kernel a(r) , also represented
as a matrix. All blurred patches are summed up to form the degraded image.
The more patches are considered (R is the total number of patches), the better
the approximation to the true non-uniform PSF. Note that the patches dened
by the weighting matrices w(r) usually overlap to yield smoothly varying blurs.
The weights are chosen such that they sum up to one for each pixel. In [11] it
is shown that this forward model can be computed eciently by making use of
the short-time Fourier transform.
K
(r)
a(r) = k b k . (2)
k=1
To dene the basis elements we group the patches into overlapping groups, such
that each group contains all patches inside a certain ring around the image center,
i.e., the center distance dr determines whether a patch belongs to a particular
group. Basis elements for three example groups are shown Figure 2. All patches
inside a group will be assigned similar kernels. The width and the overlap of
the rings determine the amount of smoothness between groups (see property (c)
above).
For a single group we dene a series of basis elements as follows. For each
patch in the group we generate matching blur kernels by placing a single delta
peak inside the blur kernel and then mirror the kernel with respect to the line lr
(see, Figure 3). For patches not in the current group (i.e., in the current ring),
the corresponding local blur kernels are zero. This generation process creates
basis elements that fulll the symmetry properties listed above. To increase
smoothness of the basis and avoid eects due to pixelization, we place little
Gaussian blurs (standard deviation 0.5 pixels) instead of delta peaks.
Fig. 3. Shifts to generate basis elements for the middle group of Figure 2
192 C.J. Schuler et al.
B = U SV T . (4)
Fig. 5. Chromatic shock lter removes color fringing (adapted from [12])
R = xR t sign(z )|xR |
xt+1 t t t
G = xG t sign(z )|xG |
xt+1 t t t
with z t = (xtR + xtG + xtB )/3
B = xB t sign(z )|xB |
xt+1 t t t
(6)
where z denotes the second derivative in the direction of the gradient. We call
this extension chromatic shock ltering since it takes all three color channels
simultaneously into account. Figure 5 shows the reduction of color fringing on
the example of Osher and Rudin [12] adapted to three color channels.
Combining the forward model y = x dened above and the chromatic shock
ltering, the PSF parameters and the image x (initialized by y) are estimated
by iterating over three steps:
(a) Prediction Step: the current estimate x is rst denoised with a bilateral l-
ter, then edges are emphasized with chromatic shock ltering and by zeroing
at gradient regions in the image (see [6] for further details). The gradient
selection is modied such that for every radius ring the strongest gradients
are selected.
(b) PSF Estimation: if we work with the overcomplete basis B, we would like
to nd coecients that minimize the regularized t of the gradient images
y and x,
R
2
R
(r) 2
R
(r) 2
y (B (r) ) (w(r) x) + B + B (7)
r=1 r=1 r=1
where B (r) is the matrix containing the basis elements for the r-th patch.
Note that is the same for all patches. This optimization can be performed
iteratively. The regularization parameters and are set to 0.1 and 0.01,
respectively.
However, the iterations are costly, and we can speed up things by using the
orthonormal basis . The blur is initially estimated unconstrained and then
projected onto the orthonormal basis. In particular, we rst minimize the
194 C.J. Schuler et al.
t of the general EFF forward model (without the basis) with an additional
regularization term on the local blur kernels, i.e., we minimize
R
2
R
(r) 2
R
(r) 2
y
a (w x) +
(r) (r) a + a (8)
r=1 r=1 r=1
where l = [1, 2, 1]T denotes the discrete Laplace operator, F the discrete
Fourier transform, and Zb , Zy , Zl , Cr and Er appropriate zero-padding and
cropping matrices. |u| denotes the entry-wise absolute value of a complex
vector u, u its entry-wise complex conjugate. The fraction has to be im-
plemented pixel-wise. The normalization factor N accounts for artifacts at
patch boundaries which originate from windowing (see [8]).
Similar to [6] and [8] the algorithm follows a coarse-to-ne approach. Having
estimated the blur parameters we use a non-uniform version of Krishnan and
Fergus approach [14,8] for the non-blind deconvolution to recover a high-quality
estimate of the true image. For the x-sub problem we use the direct deconvolution
formula (11).
NVIDIA Tesla C2070 GPU with 6GB of memory. The basis elements gener-
ated as detailed in Section 4 are orthogonalized using the SVDLIBC library4 .
Calculating the SVD for the occurring large sparse matrices can require a few
minutes of running times. However, the basis is independent of the image con-
tent, so we can compute the orthonormal basis once and reuse it. Table 1 reports
the running times of our experiments for both PSF and nal non-blind decon-
volution along with the EFF parameters and image dimensions. In particular, it
shows that using the orthonormal basis instead of the overcomplete one improves
the running times by a factor of about six to eight.
Table 1. (a) Image sizes, (b) size of the local blur kernels, (c) number of patches
horizontally and vertically, (d) runtime of PSF estimation using the overcomplete basis
B (see Eq. (7)), (e) runtime of PSF estimation using the orthonormal basis (see
Eq. (8)) as used in our approach, (f) runtime of the nal non-blind deconvolution.
8 Results
In the following, we show results on real photos and do a comprehensive com-
parison with other approaches for removing optical aberrations. Image sizes and
blur parameters are shown in Table 1.
Schuler et al. show deblurring results on images taken with a lens that consists
only of a single element, thus exhibiting strong optical aberrations, in particular
coma. Since their approach is non-blind, they measure the non-uniform PSF with
a point source and apply non-blind deconvolution. In contrast, our approach is
blind and is directly applied to the blurry image.
To better approximate the large blur of that lens, we additionally assume that
the local blurs scale linearly with radial position, which can be easily incorpo-
rated into our basis generation scheme. For comparison, we apply Photoshops
Smart Sharpening function for removing lens blur. It depends on the blur size
and the amount of blur, which are manually controlled by the user. Thus we
call this method semi-blind since it assumes a parametric form. Even though we
choose its parameters carefully, we are not able to obtain comparable results.
4
http://tedlab.mit.edu/~ dr/SVDLIBC/
196 C.J. Schuler et al.
Comparing our blind method against the non-blind approach of [3], we observe
that our estimated PSF matches their measured PSFs rather well (see Figure 7).
However, surprisingly we are getting an image that may be considered sharper.
The reason could be over-sharpening or a less conservative regularization in the
nal deconvolution; it is also conceivable that the calibration procedure used by
[3] is not suciently accurate. Note that neither DXO nor Kee et al.s approach
can be applied, lacking calibration data for this lens.
Fig. 6. Schuler et al.s lens. Full image and lower left corner
198 C.J. Schuler et al.
(a) blindly estimated by our approach (b) measured by Schuler et al. [3]
Fig. 8. Canon 24mm f1/4 lens. Shown is the upper left corner of the image. PSF inset
is three times the original size.
Fig. 9. Comparison between our blind approach and two non-blind approaches of Kee
et al.[2] and DXO
Blind Correction of Optical Aberrations 199
and obtained a sharper image. However, the blur appeared to be small, so al-
gorithms like Adobes Smart Sharpen also give reasonable results. Note that
neither DXO nor Kee et al.s approach can be applied here since lens data is not
available.
9 Conclusion
We have proposed a method to blindly remove spatially varying blur caused by
imperfections in lens designs, including chromatic aberrations. Without relying
on elaborate calibration procedures, results comparable to non-blind methods
can be achieved. By creating a suitable orthonormal basis, the PSF is constrained
to a class that exhibits the generic symmetry properties of lens blurs, while fast
PSF estimation is possible.
9.1 Limitations
Our assumptions about the lens blur are only an approximation for lenses with
poor rotation symmetry.
The image prior used in this work is only suitable for natural images, and is
hence content specic. For images containing only text or patterns, this would
not be ideal.
implies that we can transfer the PSFs estimated for these settings for instance
to images where our image prior assumptions are violated. On the other hand, it
suggests the possibility to improve the quality of the PSF estimates by utilizing
a substantial database of images.
Finally, while optical aberrations are a major source of image degradation,
a picture may also suer from motion blur. By choosing a suitable basis, these
two eects could be combined. It would also be interesting to see if non-uniform
motion deblurring could prot from a direct PSF estimation step as introduced
in the present work.
References
1. Joshi, N., Szeliski, R., Kriegman, D.: PSF estimation using sharp edge prediction.
In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition (June 2008)
2. Kee, E., Paris, S., Chen, S., Wang, J.: Modeling and removing spatially-varying
optical blur. In: Proc. IEEE Int. Conf. Computational Photography (2011)
3. Schuler, C., Hirsch, M., Harmeling, S., Scholkopf, B.: Non-stationary correction of
optical aberrations. In: Proc. IEEE Intern. Conf. on Comput. Vision (2011)
4. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera
shake from a single photograph. ACM Trans. Graph. 25 (2006)
5. Miskin, J., MacKay, D.: Ensemble learning for blind image separation and decon-
volution. Advances in Independent Component Analysis (2000)
6. Cho, S., Lee, S.: Fast Motion Deblurring. ACM Trans. Graph. 28(5) (2009)
7. Whyte, O., Sivic, J., Zisserman, A., Ponce, J.: Non-uniform deblurring for shaken
images. In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition (2010)
8. Hirsch, M., Schuler, C., Harmeling, S., Scholkopf, B.: Fast removal of non-uniform
camera shake. In: Proc. IEEE Intern. Conf. on Comput. Vision (2011)
9. Krishnan, D., Tay, T., Fergus, R.: Blind deconvolution using a normalized sparsity
measure. In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition (2011)
10. Levin, A., Weiss, Y., Durand, F., Freeman, W.: Ecient marginal likelihood opti-
mization in blind deconvolution. In: Proc. IEEE Conf. Comput. Vision and Pattern
Recognition. IEEE (2011)
11. Hirsch, M., Sra, S., Scholkopf, B., Harmeling, S.: Ecient lter ow for space-
variant multiframe blind deconvolution. In: Proc. IEEE Conf. Comput. Vision and
Pattern Recognition (2010)
12. Osher, S., Rudin, L.: Feature-oriented image enhancement using shock lters. SIAM
J. Numerical Analysis 27(4) (1990)
13. Harmeling, S., Hirsch, M., Scholkopf, B.: Space-variant single-image blind decon-
volution for removing camera shake. In: Advances in Neural Inform. Processing
Syst. (2010)
14. Krishnan, D., Fergus, R.: Fast image deconvolution using hyper-Laplacian priors.
In: Advances in Neural Inform. Process. Syst. (2009)
Inverse Rendering of Faces on a Cloudy Day
1 Introduction
The appearance of a face in an image is determined by a combination of intrinsic
and extrinsic factors. The intrinsic properties of a face include its shape and
reectance properties (which vary spatially, giving rise to parameter maps or,
in the case of diuse albedo, texture maps). The extrinsic properties of the
image include illumination conditions, camera properties and viewing conditions.
Inverse rendering seeks to recover intrinsic properties from an image of an object.
These can subsequently be used for recognition or re-rendering under novel pose
or illumination.
The forward rendering process is very well understood and physically-based
rendering tools allow for photorealistic rendering of human faces. The inverse
process on the other hand is much more challenging. Perhaps the best known
results apply to convex Lambertian objects. In this case, reectance is a function
solely of the local surface normal direction and irradiance (even under complex
environment illumination) can be accurately described using a low-dimensional
spherical harmonic approximation [1]. This observation underpins the successful
appearance-based approaches to face recognition.
However, faces are not globally convex and it has been shown that occlusions
of the illumination environment play an important role in human perception of
3D shape [2,3]. Prados et al. [4] have shown how shading caused by occlusion
under perfectly ambient illumination can be used to estimate 3D shape. In this
paper we take a step towards incorporating global illumination eects into the
inverse rendering process. This is done in the context of tting a 3D morphable
face model, so the texture is subject to a global statistical constraint.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 201214, 2012.
c Springer-Verlag Berlin Heidelberg 2012
202 O. Aldrian and W.A.P. Smith
We use a model which incorporates ambient occlusion [5] and bent normals
[6] into the image formation process. This is an approximation to the rendering
equation that is popular in graphics, because it can be precomputed and subse-
quently used in realtime rendering applications. Both properties are a function
of the 3D shape of an object. Figure 1 illustrates this concept graphically. For
non-convex objects, using surface normals to model illumination with spherical
harmonics leads to an approximation error; a problem well understood and often
referred to within the vision and graphics communities. We focus on a certain
object class, human faces, which possess non-convex regions. If the object shown
in Figure 1 is illuminated by a uniform distant light source, a further approxi-
mation error occurs: the entire object will appear un-shaded. In reality however,
points on the surface where the upper hemisphere relative to the surface normal
intersects other parts of the scene (including the object itself), shading occurs
that is proportional to the degree of occlusion; a phenomenon known as ambient
shading.
c a) Point of interest
b) Upper hemisphere
e c) Un-occluded directions
d) Surface normal
a d
e) Bent normal
b
Fig. 1. Spherical lighting modelled via surface normals results in a systematic error for
non-convex parts of a shape. Bent normals (the direction of least occlusion) provide a
more accurate approximation to the true underlying process.
Accurately calculating ambient occlusion and bent normals for a given shape is
computationally expensive. In this paper, we build a statistical model of ambient
occlusion and bent normals and learn the relationship between shape parameters
and ambient occlusion/bent normal parameters. This means an initial estimate of
the parameters can be made by statistical inference from the shape parameters.
During the inverse rendering process, the parameters are rened to best describe
the observed image. The result is that the texture map is not corrupted by dark
pixels in occluded regions and the accuracy of the estimated texture map and
illumination environment is increased. We present results on synthetic images
with corresponding ground truth.
where L() is the illumination function (i.e. the incident radiance from direction
). Vp, is the visibility function, dened to be zero if p is occluded in the
direction and one otherwise. p is the spatially varying diuse albedo and we
assume specular reectance properties are constant over the surface.
A common assumption in computer vision is that the object under study is
convex, i.e.: p, (np ) Vp, = 1. The advantage of this assumption is
that the image irradiance reduces to a function of local normal direction which
can be eciently characterised by a low order spherical harmonic.
However, under point source illumination this corresponds to an assumption
of no cast shadows and under environment illumination it neglects occlusion of
regions of the illumination environment. In both cases, this can lead to a large
discrepancy between the modelled and observed intensity and, in the context of
inverse rendering, distortion of the estimated texture. Heavily occluded regions
are interpreted as regions with dark texture.
Many approximations to Equation 1 have been proposed in the graphics lit-
erature and several could potentially be incorporated into an inverse rendering
formulation. However, in this paper our aim is to demonstrate that even the
simplest approximation yields an improvement in inverse rendering accuracy.
Specically, we use the ambient occlusion and bent normal model proposed by
Zhukov et al. [5] and Landis [6]. Ambient occlusion is based on the simplication
that the visibility term can be moved outside the integral:
i p = ap L() (p (np ) + s(np , , v)) d, (2)
(np)
Under this model, light from all directions is equally attenuated, i.e. the di-
rectional dependence of illumination and visibility are treated separately. For a
perfectly ambient environment (i.e. S 2 , L() = k), the approximation is
exact. Otherwise, the quality of the approximation depends on the nature of the
illumination environment and reectance properties of the surface. An extension
to ambient occlusion is the so-called bent normal. This is the average unoccluded
direction. It attempts to capture the dominant direction from which light arrives
at a point and is used in place of the surface normal for rendering.
204 O. Aldrian and W.A.P. Smith
3 Statistical Modelling
T
where b = [b1 . . . bm1 ] are parameter vectors. Variance-normalised they are
dened as: eb = [b1 /b,1 . . . bm1 /b,m1 ]T .
m1
o = o + ci O i , (8)
i=1
Note that here refers to the intrinsic mean of the bent normals. As in pre-
vious dened models, d = [d1 . . . dm1 ]T is a vector of parameters. Variance-
normalised they are dened as: ec = [d1 /d,1 . . . dm1 /d,m1 ]T .
1
m m1
1 1
ml = fi , 2
ml = 2 , Wml = Uu (2u ml
2
I) 2 .
m i=1 m 1 u i=u+1 i,i
(12)
The small letter u corresponds to the number of used modes. Our aim is to
estimate fc given fk . For data which is jointly Gaussian distributed, the following
margninalisation property holds true:
% & % &
k Wk WkT Wk WcT
p(f ) = p(fk , fc ) N , . (13)
c Wc WkT Wc WcT
4 Experiments
We use a 3D morphable model [15] to represent shape and diuse albedo. Since
we do not have access to the initial training data, we sample from the model
208 O. Aldrian and W.A.P. Smith
to construct a representative set. This is only required for the shape model, as
neither ambient occlusion nor bent normals exhibit a dependency on texture.
In order to capture the span of the model, we sample 3 standard deviations
for each of the k = 199 principal components plus the mean shape. Because of
the non linear relationship between shape and ambient occlusion/bent normals,
we additionally sample 200 random faces from the model. This accounts for a
total of m = 599 training examples. For each sample, we calculate ground truth
ambient occlusion and bent normals using Meshlab [12]. We retain ma,b,c,d = 199
most signicant modes for each model.
Using a physically based rendering toolkit (PBRT v2 [16]), we render 3D faces
of eight subjects in three pose angles and three illumination conditions. The
subjects are not part of the training set. To cover a wide range, we chose pose
angles: 60 , 0 and 45 . The faces are rendered in the following illumination
environments: White, Glacier and Pisa, where the latter two are obtained
from [17]. Skin reectance is composed of additive Lambertian and specular
terms with a ratio of 10/1. The test set consists of 8 3 3 = 72 images in total.
For each of the samples, we rst recover 3D shape and pose from a sparse set of
feature points using algorithm [7]. We project shape into the images and obtain
RGB values for a subset of the model vertices, n = 3p. Using the proposed image
formation process and objective function, we decompose the observations into
its contributions: texture, ambient shading1 , diuse shading and specular reec-
tion. We compare three settings: In a reference method, we calculate spherical
harmonic basis functions using surface normals and do not account for ambient
occlusion (Fit A). In a second setting, we use the same basis functions and t
ambient occlusion (Fit B). And nally, we derive the basis functions from bent
normals and t ambient occlusion (Fit C). Each of the settings is evaluated
according to three quantitative measures:
1. Texture reconstruction error
2. Fully synthesised model error
3. Illumination estimation accuracy
We also investigate how accurately bent normals are predicted from the joint
model by comparing against ground truth. The last part of this section shows
qualitative reconstructions for texture and full model synthesis and an appli-
cation of illumination transfer. Out-of-sample faces are labeled with three digit
numbers (001 323).
Table 1. Bent normal approximation error for eight subjects. Errors are measured in
mean angular dierence.
Table 2. Texture reconstruction errors averaged over subjects and pose angles. Indi-
vidual entries are 103 .
Table 3. Full model approximation errors averaged over subjects and pose angles.
Individual entries are 103
Table 4. Light source approximation error for for the three methods. Entries represent
angular error averaged over all subjects and pose angles.
In this section, we compare lighting approximation error for the three methods.
We obtain ground truth lighting coecients by rendering a sphere in the same
illumination conditions than the faces. The material properties are also set to be
equal. As the normals and the texture of the sphere are known, we deconvolve
the image formation and extract lighting coecients. This also makes sense for
white illumination, as with this procedure we obtain the overall magnitude of
light source intensity. We divide reectance vectors by the corresponding BRDF
parameters and use them as ground truth: lg . For each of the images, we compute
angular distance between the recovered lighting coecients lr and the ground
truth:
lg lr
El = arccos . (18)
lg lr
Results for the experiments are shown in Table 4, where we have averaged over
all subjects and pose angles.
For visual comparison, we show qualitative results for the three methods under
investigation. Figure 2 displays tting results for various subjects, pose angles
and illumination conditions. The Figure shows full model synthesis and texture
reconstructions for methods Fit A and Fit C.
As can be seen in Table 2 and 3, quantitative dierences between method Fit
B and Fit C are less pronounced. This also applies for perceptual dierences.
Methods Fit C notably obtains more accurate reconstructions for regions which
are severely occluded. Figure 3 shows magnications of full model synthesis of
the nose region for two subjects.
Inverse Rendering of Faces on a Cloudy Day 211
Fig. 2. Comparison of method Fit A and Fit C for ve subjects in dierent pose angles
and illumination condition. Top row shows input images. Second row shows full model
synthesis for method Fit A. The third row shows full model synthesis for method Fit
C. And the fourth and fth row show ground truth texture (and shape) and texture
reconstruction using method Fit A. The last row shows texture reconstructions using
method Fit C. Labels on top of images are: Face ID, Pose, Illumination, where W,G
and P corresponds to White, Glacier and Pisa.
Fig. 3. Comparison of the three methods: Fit A, Fit B and Fit C for two subjects.
Close up nose region face ID: 001 (top) and ID: 014 (bottom).
Fig. 4. Fitting results for one subject (ID: 002) in all pose angles and illumination
condition. Top row shows input images. Second row shows full model synthesis using
method Fit C. Bottom row shows texture reconstructions. Note, that the face shown,
does not posses lower texture reconstruction error than method Fit B.
Fig. 5. Illumination transfer example. Top row and rst column show input images
of six subjects. Estimated parameters for shape, pose, diuse albedo and ambient
occlusion from the columns are combined with lighting estimates obtained by the rows.
5 Conclusions
We have presented a generative method to estimate global illumination in an in-
verse rendering pipeline. To do so, we learn and incorporate a statistical model
of ambient occlusion and bent normals into the image formation process. The
resulting objective function is convex in each parameter set and can be solved ac-
curately and eciently using alternating least squares. In addition to qualitative
improvements, empirical results show that reconstruction accuracy for texture,
lighting and full model synthesis increases by around 10 18%. In future work,
we would like to explore the performance of the proposed tting algorithm in a
recognition experiment and consider more complex approximations to full global
illumination.
214 O. Aldrian and W.A.P. Smith
References
1. Basri, R., Jacobs, D.W.: Lambertian reectance and linear subspaces. IEEE Trans.
Pattern Anal. Mach. Intell. 25, 218233 (2003)
2. Langer, M.S., Zucker, S.W.: Shape from shading on a cloudy day. JOSA-A 11,
467478 (1994)
3. Langer, M.S., Butho, H.H.: Depth discrimination from shading under diuse light-
ing (2000)
4. Prados, E., Jindal, N., Soatto, S.: A non-local approach to shape from ambient
shading. In: Proc. IEEE Intl. Conf. on Scale Space and Variational Methods in
Computer Vision, pp. 696708. Springer (2009)
5. Zhukov, S., Iones, A., Kronin, F.: An ambient light illumination model. In: Ren-
dering Techniques, Proceedings of the Eurographics Workshop. Springer (1998)
6. Landis, H.: Production-ready global illumination. Siggraph Course Notes 16 (2002)
7. Aldrian, O., Smith, W.A.P.: A linear approach of 3D face shape and texture recov-
ery using a 3D morphable model. In: Proceedings of the British Machine Vision
Conference, pp. 75.175.10. BMVA Press (2010)
8. Aldrian, O., Smith, W.A.P.: Inverse rendering with a morphable model: A mul-
tilinear approach. In: Proceedings of the British Machine Vision Conference,
pp. 88.188.10. BMVA Press (2011)
9. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proc.
SIGGRAPH, pp. 187194 (1999)
10. Fletcher, P.T., Joshi, S., Lu, C., Pizer, S.M.: Principal geodesic analysis for the
study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23, 9951005
(2004)
11. Smith, W.A.P., Hancock, E.R.: Recovering facial shape using a statistical model of
surface normal direction. IEEE Trans. Pattern Anal. Mach. Intell. 28, 19141930
(2006)
12. Visual Computing Laboratory, Institute of the National Research Council of Italy
(Meshlab), http://meshlab.sourceforge.net/
13. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal
of the Royal Statistical Society, Series B 61, 611622 (1999)
14. Rasmussen, C.E., Williams, C.K.I.: Gaussian processes for machine learning. In:
Adaptive Computation and Machine Learning. MIT Press (2006)
15. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model
for pose and illumination invariant face recognition. In: Proc. IEEE Intl. Conf. on
Advanced Video and Signal based Surveillance (2009)
16. Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Imple-
mentation. Morgan Kaufmann. Elsevier Science (2010)
17. University of Southern California: High-resolution light probe image gallery (2011),
http://gl.ict.usc.edu/Data/HighResProbes
On Tensor-Based PDEs and Their
Corresponding Variational Formulations
with Application to Color Image Denoising
1 Introduction
In their seminal work Perona and Malik proposed a non-linear diusion PDE to
lter an image while retaining lines and structure [1]. This triggered researchers
to investigate well-posedness of the underlying PDE, but also to build on the
drawbacks of the Perona and Malik formulation, namely that noise is preserved
along lines and edges in the image structure. A tensor-based image diusion
formulation, which gained wide acceptance, was proposed by Weickert [2] where
the outer product of the gradient was used to describe the local orientation of
the image structure. As the main novelty of this work we investigate the case
when a given image diusion PDE can be considered as a corresponding E-L
equation to a functional of the form
1
J(u) = ||u u0 ||p + R(u) , (1)
p
where R(u) is a tensor-based smoothness term, and the Lp -norm, 1 < p < , is
a data term. Furthermore, the derived E-L equation of (1) is applied to a color
image denoising problem.
Related Work. State-of-the-art gray-value image denoising achieves impres-
sive results. However, the extention from gray-valued image processing to color
images still pose major problems due to the unknown mechanisms which govern
the process describing color perception. One of the most common color spaces
used to represent images is the RGB color space, despite it being known that
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 215228, 2012.
c Springer-Verlag Berlin Heidelberg 2012
216 F. Astrom, G. Baravdish, and M. Felsberg
the colors (red, green and blue) are correlated due to physical properties such
as the design of color sensitive lters in camera equipment. In [3] anisotropic
diusion in the RGB color space is proposed, where the color components are
used to derive a weighted diusion tensor. One drawback of this formulation
is that color artifacts can appear on the boundary of regions with sharp color
changes. Some alternative color spaces investigated for image diusion include
manifolds [4], the CMY, HSV, Lab and Luv color space [5], luminance and chor-
maticity [6], and a maximal decorrelation approach [7]. Due to the simplicity of
its application we follow the color model used in [7] which was developed in [8].
With regards to variational approaches of image diusion processes, it has
been shown that non-linear diusion and wavelet shrinkage are both related to
variational approaches [9]. Also, anisotropic diusion has been described using
robust statistics of natural images [10]. However, in [10] they construct an update
scheme for the iterative anisotropic process which contains the addition of a
an extra convolution of the diusion tensor. In the paper [11] a tensor-based
functional is given, but the tensor it describes is static, thus the update-scheme
becomes linear. Generalizations of the variational formulation exists, i.e. in [12]
the smoothness term is considered as an Lp -norm where p is an one-dimensional
function.
The main contributions of this paper are: (1) we establish a new tensor-based
variational formulation for image diusion in Theorem 1; (2) in Theorem 2 we
derive necessary conditions such that it is possible to nd an energy functional
given a tensor-based PDE; (3) the derived E-L equation of the established image
diusion functional is applied to a color image denoising problem. It is shown
that the resulting denoising compares favorably to state-of-the-art techniques
such as anisotropic diusion [3], tracebased diusion [13,7] and BM3D [14].
In section 2 we briey review both scalar and tensor-based image diusion
ltering as a motivation for the established functional. Furthermore, we state
the necessary conditions for a PDE to be considered as an E-L equation to the
same functional. The derived functional is applied to a color image denoising
problem in section 3. The paper is concluded with some nal remarks.
Convolving an image with a Gaussian lter reduces the noise variance, but does
not take the underlying image structure such as edges and lines into account.
To avoid blurring of lines and edges Perona and Malik [1] introduced an edge-
stopping function in the solution of the heat equation
formulation only depends on the absolute image gradient and hence noise is pre-
served at edges and lines. Weickert [2] replaced the diusivity function with a
tensor. This allows the lter to reduce noise along edges and lines and simulta-
neously preserve image structure. The tensor-based image diusion formulation
is
t u = div(Du) , (3)
where D = D(Ts ) is a positive-semidenite diusion tensor constructed by scal-
ing the eigenvalues of the structure tensor Ts ,
Ts = w (uut ) , (4)
where denotes the convolution operation and w is typically a Gaussian weight
function [15].
To achieve better denoising results close to lines and edges than the scalar dif-
fusion formulation enables, it is common to dene the tensor T which describes
local orientation of the image structure. Hence the PDE in (7) reads
t u div(T u) = 0 in , t > 0
n u = 0 on , t > 0 (9)
u(x, y, 0) = u0 (x, y) in
where T (u) depends on the image gradient. Note that the common formulation
of the tensor T is to let it depend on (x, y) . The ansatz to derive the new
tensor-based image diusion functional is based on equating (7) and (9) such
that
(|u|)
div(T u) = div u . (10)
|u|
The equality in (10) holds if
(|u|)
T u = u + C , (11)
|u|
Using this result, we motivate the subsequent theorem by performing the in-
tegration with
5 a tensor dened by the outer product of the gradient, then
(|u|) = |u|3 d|u| = 14 |u|4 +c where c is a constant. Setting the constant
to null we obtain the tensordriven functional
1 1
J(u) = (u u0 )2 dxdy + ut T u dxdy . (13)
2 4
/
(u u0 )p1 div (Su) = 0 in
(15)
n u = 0 on
where t
u Tux
S= + T + Tt , (16)
ut Tuy
and Tux , Tuy are the derivatives of T (ux , uy ) with respect to ux and uy .
Proof. To nd the E-L equation of J(u) we dene the smoothness term
R(u) = ut T (u) u dxdy . (17)
Thus t
u Tux
S= + T + Tt , (26)
ut Tuy
in the E-L equation (15). Now by taking the derivative of the data term in J(u)
with respect to u the proof is completed.
ux ux suy = s
s(3) + ux s(3) uy uy sux ,
(1) (2)
+ uy s(2) (4)
(27)
there exists a symmetric tensor T (ux, uy ) such that the tensor-based diusion
PDE
/
(u u0 )p1 div (Su) = 0 in
(28)
n u = 0 on
where 1 < p < is the E-L equation to the functional J(u) in (14). Moreover,
the entities in
f (ux , uy ) g(ux , uy )
T (ux , uy ) = , (29)
g(ux , uy ) h(ux , uy )
are given below. First we consider g(ux , uy ) = g(, ), where
1 1
g(, ) = 2 R(, ) d + 2 () , (30)
and = ux , = ux /uy and R(, ) = R(ux , uy ) is the right-hand side of NC
(27). Second, f and h are obtained from
ux
1 1
f (ux , uy ) = 2 (s(1) (, uy ) uy g (, uy )) d + 2 (uy ) , (31)
ux 0 ux
uy
1 1
h(ux , uy ) = 2 (s(4) (ux , ) ux g (ux , )) d + 2 (ux ) , (32)
uy 0 uy
represent the expansion of (26) with given notation. The function f in (31) is
obtained from (33) by solving
(u2 f ) = ux s(1) ux uy gux . (37)
ux x
To construct g, dierentiate f in (31) with respect to uy . This give that
ux
1 1
fuy (ux , uy ) = 2 y (, uy ) g (, uy ) uy guy (, uy )) d + 2 (uy ) .
(s(1)
ux 0 ux
(38)
Insert now fuy into (35) and dierentiate with respect to ux to obtain
ux ux suy .
ux gux + uy guy + 2g = s(3) + ux s(3) (1)
(39)
In the same way, solve for h in (34) to obtain (32). Using now (34) in (36) yield
uy uy sux .
ux gux + uy guy + 2g = s(2) + uy s(2) (4)
(40)
Comparing (39) and (40), we deduce the necessary condition (NC) to establish
the existence of g, f and h. From NC we are now able to compute g satisfying
where R is the right-hand side (or the left-hand side) in NC. Changing variables
= ux and = ux /uy , in (41) we get
Proof. It is straighforward to verify that the entities of the tensor S satises the
NC condition where the right-hand side (27)
Let T: be the tensor dened by (47) - (49), then put this tensor in (14) and
obtain
2
2
ux u2y 2ux uy ux ux uy
ut T:u = ut u = u t
u = ut T u ,
2ux uy u2y u2x ux uy u2y
(50)
neglecting the factor 1/4. This concludes the proof and (44) follow from (50).
where is the convolution operator and w is a smooth kernel, such that the
functional
1
J(u) = ||u u0 || +
p
ut T (u) u dxdy , (52)
p
has
/
(u u0 )p1 div (Su) = 0 in
(53)
n u = 0 on
and when applied to the RGB color space a primary component describing
the
average gray-value I and two color-opponent components C = C1 , C2 are
obtained.
where V are the eigenvectors and the eigenvalues of T [13]. Then has the
eigenvalues 1 and 2 on its main diagonal such that g() = diag(g(1 ), g(2 )).
Then we denote D = D(T ) the diusion tensor dened
by (60). With the same
ut Tux
notation dene E = E(N ), where N = in (26). Since N is a non-
ut Tuy
symmetric tensor, E may be a complex valued function, however in practice
Im(E) 0 hence this factor is neglected.
4.2 Discretization
To nd the solution u of (15) with tensors E(N ) and D(T ) we dene the
parabolic equation
4.3 Experiments
2
The image noise variance, est , is estimated by using the technique in [17] which
is the foundation for the estimate of the diusion parameter in [13] (cf. (47) p.
6), the resulting diusion parameter for the diusion lters is
e1 2
k= , (63)
e 2 est
where e is the Euler number. A naive approach is taken to approximate the noise
of the color image, the noise estimate will be the average sum of the estimated
noise of each RGB color component. In most cases this approach works reason-
ably well for noise levels < 70 when compared to the true underlying noise
distribution, but note that the image itself contains noise, hence the estimate
will be biased. We use color images from the Berkeley segmentation dataset
[18], more specically images {87065, 175043, 208001, 8143}.jpg, being labeled
as lizard, snake, mushroom and owl (see gure 1). All images predominately
contain scenes from nature and hence large number of branches, leafs, grass and
other commonly occurring structures. The estimated noise levels for the dierent
images can be seen in table 1.
In this work we found that modifying the estimate of the diusion parameter
to scale with the color transform kI = 101 k and kc = 10k, yields an improved
result compared to using k without any scaling. The motivation for using dier-
ent diusion parameters in the dierent channels is that ltering in the average
On Tensor-Based PDEs and Their Corresponding Variational Formulations 225
SSIM PSNR
Noise Est TR Proposed RGB BM3D TR Proposed RGB BM3D
Lizard
5 17.0 0.99 0.99 0.98 0.97 38.0 38.0 35.2 34.1
10 19.2 0.96 0.96 0.94 0.95 32.7 32.8 30.6 32.3
20 24.2 0.90 0.91 0.87 0.92 27.3 27.8 26.7 29.3
50 57.1 0.75 0.77 0.68 0.73 22.6 23.5 22.0 23.2
70 67.1 0.65 0.70 0.59 0.66 20.8 21.8 20.5 21.7
Snake
5 15.5 0.98 0.98 0.97 0.96 38.3 38.3 35.3 34.6
10 18.1 0.95 0.95 0.92 0.94 33.0 33.1 30.9 32.6
20 23.5 0.87 0.88 0.83 0.90 27.7 28.4 27.3 29.7
50 53.9 0.69 0.73 0.63 0.69 23.2 24.3 23.0 24.4
70 65.4 0.59 0.64 0.54 0.61 21.5 22.7 21.6 22.9
Mushroom
5 9.9 0.97 0.97 0.95 0.96 38.7 38.9 36.9 38.2
10 13.9 0.92 0.93 0.89 0.93 33.8 34.3 32.8 35.1
20 21.1 0.83 0.86 0.79 0.88 29.3 30.4 29.2 31.8
50 49.0 0.66 0.70 0.58 0.70 25.1 26.0 24.7 26.4
70 60.4 0.57 0.62 0.52 0.63 23.0 23.8 22.8 24.1
Owl
5 61.4 0.99 0.99 0.98 0.78 36.9 37.0 34.6 23.2
10 19.8 0.97 0.97 0.95 0.96 33.0 33.0 30.1 32.0
20 25.7 0.93 0.93 0.88 0.93 27.8 28.0 25.8 28.7
50 65.1 0.80 0.81 0.68 0.72 22.6 23.0 20.5 21.8
70 71.2 0.71 0.73 0.59 0.66 20.4 21.0 18.8 20.5
gray-value channel I primarily aects the image noise and hence it is preferable
to reduce the k-parameter in this channel. However, one must be careful since
noise may be considered as structure and may be preserved throughout the lter-
ing process. Filtering in the color opponent components will provide intra-region
smoothing in the color space and hence color artifacts will be reduced with a
less structural preserving diusion parameter.
To illustrate the capabilities of the proposed technique it is compared to three
state-of-the-art denoising techniques, namely tracebased diusion (TR) [7], dif-
fusion in the RGB color space [3] and the color version of BM3D [14]. Code
was obtained from the authors of TR-diusion and BM3D, whereas the RGB-
color space diusion was reimplemented based on [3] and [2] (cf. p. 95), the
k-parameter in this case is set to the value obtained from (63) since it uses
the RGB color space. To promote research on diusion techniques we intend to
release the code for the proposed technique.
226 F. Astrom, G. Baravdish, and M. Felsberg
Fig. 1. Results. Even rows display some subregions of the image from the respectively
previous row. The rst four rows are illustrations using additive Gaussian noise of 50
std, the next two rows of 20 std, and the last two rows of 70 std noise. Best viewed in
color.
On Tensor-Based PDEs and Their Corresponding Variational Formulations 227
For a quantitative evaluation we use the SSIM [19] and the peak-signal-to-
noise (PSNR) error measurements. To the best of our knowledge there is no
widely accepted method to evaluate color image perception, hence the error
measurements are computed for each component of the RGB color space and
averaged. Gaussian white noise with {5, 10, 20, 50, 70} standard deviations (std)
is added to the images. Before ltering is commenced the noise is clipped to
t the 8bit image value range. The steplength for the diusion techniques was
set to 0.2 and manually scaled so that the images in gure 1 are obtained after
approximately 1/2 of the maximum number iterations (100).
The purpose here is not to claim the superiority of any technique, rather it is
to illustrate the advantages and disadvantages of the dierent lter methodolo-
gies in various situations. The error measurements SSIM and PSNR values are
given to illustrate the strengths of the compared denoising techniques, results
are shown in table 1.
It is clear from table 1 that the established E-L equation performs well in
terms of SSIM and PSNR values compared to other state-of-the-art denoising
techniques used for color images. However, for images (such as the mushroom
image) with large approximately homogeneous surfaces, BM3D is the favored
denoising technique. This is due to many similar patches which the algorithm
averages over to reduce the noise variance. On the other hand, diusion tech-
niques perform better in high-frequency regions due to the local description of
the tensor. This is also depicted in gure 1 where the diusion techniques pre-
serve the image structure in the lizard, snake and the owl image, whereas BM3D
achieves perceptually good results for the mushroom image.
5 Conclusion
In this work we have given necessary conditions for a tensor-based image diusion
PDE to be the E-L equation for an energy functional. Furthermore, we apply the
derived E-L equation in Theorem 1 to a color image denoising application with
results comparable to state-of-the-art techniques. Generalization of Proposition
1 and its complete proof will be subject to further study.
Acknowledgment. This research has received funding from the Swedish Re-
search Council through a grant for the project Non-linear adaptive color image
processing, from the ECs 7th Framework Programme (FP7/2007-2013), grant
agreement 247947 (GARNICS), and by ELLIIT, the Strategic Area for ICT
research, funded by the Swedish Government.
References
1. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diusion.
IEEE Trans. Pattern Analysis and Machine Intelligence 12, 629639 (1990)
2. Weickert, J.: Anisotropic Diusion In Image Processing. ECMI Series. Teubner-
Verlag, Stuttgart (1998)
228 F. Astrom, G. Baravdish, and M. Felsberg
1 Introduction
Temporal segmentation of human motion sequences (motion capture data, 2.5D depth
sensor or 2D videos), i.e., temporally cut sequences into segments with different se-
mantic meanings, is an important step for building an intelligent framework to analyze
human motion. Temporal segmentation can be applied to motion animations [1], ac-
tion recognition [24], video understanding and activity analysis [5, 6]. In particular,
temporal segmentation is crucial for human action recognition. Most recent works in
human activity recognition focus on simple primitive actions such as walking, running
and jumping, in contrast to the fact that daily activity involves complex temporal pat-
terns (walking then sit-down and stand-up). Thus, recognizing such complex activities
relies on accurate temporal structure decomposition [7].
Previous work on temporal segmentation can be mainly divided into two categories.
On one side, many works in statistics, i.e., either offline or online (quickest) change-
point detections [8], are often restricted to univariate series (1D) and the distribution
is assumed to be known in advance [9]. Because of the complex structure of motion
dynamics, these works are not suitable for temporal segmentation of human actions.
On the other side, temporal clustering has been proposed for unsupervised learning of
human motions [10]. However, temporal clustering is usually performed offline, thus
not suitable for applications such as realtime action segmentation and recognition.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 229243, 2012.
c Springer-Verlag Berlin Heidelberg 2012
230 D. Gong et al.
Fig. 1. Online Hierarchial Temporal Segmentation. A 22 secs input sequence is temporally cut
into two segments; a walking segment (S1) which is further cut into 6 action units, and a jumping
segment (S2) which is further cut into 4 action units.
Fig. 2. System Flowchart. Input can be either Mocap, 2.5D depth sensor or 2D videos. Output
are temporal cuts for both action transitions and cyclic action units.
2 Related Work
Temporal segmentation is a multifaced area and several related topics in machine learn-
ing, statistics, computer vision and graphics are discussed in this section.
Change-Point Detection. Most of the work in statistics, i.e., offline or quickest (on-
line) change-point detections (CD) [8], is often restricted to univariate series (1D) and
parametric distribution assumption, which does not hold for human motions with com-
plex structure. [16] uses the undirected sparse Gaussian graphical models and performs
jointly structure estimation and segmentation. Recently, as a nonparametric extension
of Bayesian online change-point detection (BOCD) [9], [17] is proposed to combine
BOCD and Gaussian Process (GPs) to relax the i.i.d assumption in a regime. Although
GPs improve the ability to model complex data, it also brings in high computational
cost. More relevant to us, kernel methods have been applied to non-parametric change-
point detection on multivariate time series [18, 19]. In particular, [18] (KCD) utilizes the
one-class SVM as online training method and [19] (KCpA) performs sequentially seg-
mentation based on the Kernel Fisher Discriminant Ratio. Unlike all the above works,
KTC can not only detect action transitions but also cyclic motions.
Temporal Clustering. Clustering is a long standing topic in machine learning [20,
21]. Recently, as an extension of clustering, some works focus on how to correctly
temporally segment time series into different clusters. As a elegant combination of Ker-
nel K-means and spectral clustering, Aligned Cluster Analysis (ACA) is developed for
temporal clustering of facial behavior with a multi-subject correspondence algorithm
for matching facial expressions [6]. To estimate the unknown number of clusters, [22]
use the hierarchical Dirichlet process as a prior to improve the switch linear dynamical
system (SLDS). Most of these works offline segment time series and provide cluster
labels as in clustering. As a complementary approach, KTC performs online temporal
segmentation which is suitable for realtime applications.
Motion Analysis. In computer vision and graphics, some works focus on grouping
human motions. Unusual human activity detection is addressed in [5] using the (bipar-
tite) graph spectral clustering. [23] extracts spatio-temporal features to address event
clustering on video sequences. [10] proposes a geometric-invariant temporal clustering
algorithm to cluster facial expressions. More relevantly, [1] proposes an online algo-
rithm to decompose motion sequences into distinct action segments. Their method is an
elegant temporal extension of Probabilistic Principal Component Analysis for change-
point detection (PPCA-CD), which is computationally efficient but restricted to (ap-
proximate) Gaussian assumptions.
232 D. Gong et al.
Action Recognition. Although significant progress has been made in human activ-
ity recognition [3, 4, 7, 24], the problem remains inherently challenging due to view-
point change, partial occlusion and spatio-temporal variations. By combining KTC
and alignment approaches such as [25, 15], we can perform online action recognition
for input from 2.5D depth sensor. Unlike other works on supervised joint segmentation
and recognition [26], two significant features of our approach are, viewpoint indepen-
dence and handling arbitrary person with a few labeled Mocap sequences, in the transfer
learning module.
over time), the goal of temporal segmentation is to predict temporal cut points ci . For
instance, if a person walks and then boxes, a temporal cut point must be detected. For
depth sensor data, xt is the vector representation of tracked joints. More details of xt
are given in sec. 6. From a machine learning perspective, the estimated {ci }N
i=1 can be
c
Nc
LX ({ci }N c
i=1 , Nc ) = I(X ci1 :ci 1 , X ci :ci+1 1 ) (1)
i=1
where X ci :ci+1 1 "D(ci+1 ci ) indicates the segment between two cut points ci and
ci+1 (c1 = 1, cNc +1 = Lx + 1). Here I() is the homogeneous function to measure the
spatio-temporal consistency between two consecutive segments. It is worth noting that,
both {ci }N
i=1 and Nc need to be estimated from eq. 1. Next, the main task is to design
c
I() and to online optimize eq. 1. As the counterpart, eq. 1 could be offline optimized
by dynamic programming when Nc is given, which is out of the scope of this paper.
3.2 KTC-S
Instead of jointly optimizing eq. 1, the proposed Kernelized Temporal Cut (KTC) se-
quentially optimizes ci+1 based on ci by minimizing the following loss function,
to improve this process is described in sec. 3.3. Essentially, Eq. 2 is a two-class temporal
clustering problem for X ci :ci +T 1 "DT . The crucial factor is constructing I(),
which is related to temporal version of (dis)similar functions in spectral clustering [20,
21, 10] and information theoretical clustering [27].
To handle the complex structure of human motion, unlike previous work, KTC uti-
lizes Hilbert space embedding of distributions (HED) to map the distribution of X t1 :t2
into the Reproducing Kernel Hilbert Space (RKHS). [11, 12] are seminal works on com-
bining kernel methods and probability distribution analysis. Without going into details,
the idea of using HED for temporal segmentation is straightforward. The change-point
is detected by using a well behaved (smooth) kernel function, whose values are large
on the samples belonging to the same spatio-temporal pattern and small on the samples
from different patterns. By doing this, KTC does not only handle nonparametric and
high dimensionality problems but also rests on a solid theoretical foundation [11].
HED. Inspired by [12], probabilistic distributions can be embedded in RKHS. At the
center of the Hilbert space embedding of distributions are the mean mapping functions,
1
T
(P x ) = E x (k(x, )), (X) = k(xt , ) (3)
T t=1
where xt=T
t=1 are assumed to be i.i.d sampled from the distribution P x . Under mild
conditions, (P x ) (same for (X)) is an element of the Hilbert space as follows,
1
T
< (P x ), f >= E x (f (x)), < (X), f >= f (xt )
T t=1
2 1 1
k(xi , y j ) 2 k(xi , xj ) 2 k(y i , y j ) (4)
T1 T2 i,j T1 i,j T2 i,j
234 D. Gong et al.
Combining eq. 2 and eq. 4, ci+1 is estimated by minimizing the following function in
matrix formulation as:
c c c 1 1:ci 1 c :T
L{X ci :ci +T 1 } (ci ) = (E Ti )H K KT C i i
ci :ci +T 1 E T , E T = e eTi (5)
ci T di
where ci and di are short notations for ci+1 ci and ci + T ci+1 . etT1 :t2 "T 1 is a
T T
ci :ci +T 1 "
binary vector with 1 for positions from t1 to t2 and 0 for others. K KT C
i ), (x
kKT C (xi , xj ) = kS (xi , xj )kT (xi , xj ) = kS (xi , xj )kT ((x j )) (6)
where kS () is the spatial kernel and kT () is the temporal kernel. (x) is the esti-
mated local tangent space at point x. kS () and kT () can be chosen according to the
domain knowledge or universal kernels such as Gaussian. For instance, the canonical
component analysis (CCA) kernel [28] is used for joint-position features as,
kS (xi , xj ) = exp(S xi xj 2 )
(8)
i ), (x
kT ((x j )) = exp(T ((x
i ), (x
j ))2 )
where S is the kernel parameter for kS () and T is the kernel parameter for kS ().
() is the notation of principal angle between two subspace (range from 0 to 2 ).
In short, the spatio-temporal kernel kKT C captures both spatial and temporal dis-
tributions of data (a visual example in Fig. 3), which is suitable to model structured
sequential data. As special cases, kST degenerates to spatial kernel if T 0 and to
temporal kernel if S 0.
Optimization. Unlike the NP-Hard optimization in spectral clustering [21], eq. 5 can
be efficiently solved because the feasible region of ci+1 is [ci + 1, ci + T 1], allowing
to search the entire space to minimize L(ci+1 ). For each step, minimizing eq. 5 has
complexity at most O(T 2 ) to access kKT C ().
3.3 KTC-R
Sequentially optimizing eq. 1 is given in sec. 3.2. However, this process may not be
suitable for realtime applications. A key feature of human motion is temporal variations,
1
M is the number of 3D joints from Mocap or depth sensor.
Kernelized Temporal Cut for Online Temporal Segmentation and Recognition 235
Fig. 3. An illustration of KTC-R. Left: tracked joints from depth sensor, right: K KT
1:190
C
190190 for window X 1:190 . The decision to make no cut between frame 1 to 110 is made
before the current window with a maximum delay of T0 = 80 frames.
i.e., one action can last for a long time or only a few seconds. Thus, it is difficult to use
a fixed-length-T sliding-window to capture transitions. Small values of T cause over-
segmentation and large values of T cause large delays (T = 300 for depth sensor results
in 10 secs delay). To overcome this problem, we combine incremental sliding-window
strategy [1] and two-sample test [14, 13] to design a realtime algorithm for eq. 5, i.e.,
KTC-R (Fig. 3).
Given X 1:Lx = [x1 , ..., xLx ] "DLx , KTC-R sequentially processes the varying-
length window X t = [xnt , ..., xnt +Tt ] at step t. This process starts from n1 = 1 and
T1 = 2T0 , where T0 is the pre-defined shorest possible action length. At step-t (assume
the last cut is ci ), if there is no captured action transition point, the following updating
process is performed,
nt+1 = nt , Tt+1 = Tt + T (9)
else if there is a transition point,
where T is the step length of increasing the window. This process ends when nt Lx
T0 . As shown in eq. 9 and eq. 10, X 1:Lx is sequentially processed and all cuts ci are
estimated when the algorithm requires the (ci + T0 1)th frame (same for non-cut
frames). This fact indicates STC-R has a fixed-length time delay T0 , as shown in Fig. 3.
At each step, deciding on a cut (at frame nt + Tt T0 ) is equivalent to the following
hypothesis test,
n 1
H0 : {xi }ntt and {xi }nnt +Tt 1 are the same; HA : not H0 (11)
t
where nt is the short notation for nt + Tt T0 . eq. 11 is re-written by combining eq. 5
as follows,
n nt H n nt
Lt = (E Ttt ) K KT C
nt :nt +Tt 1 E Tt
t
(12)
H0 : Lt t : nt is not a cut; HA : Lt < t : nt is a cut
where t is the adaptive threshold for the hypothesis test 12. In fact, eq. 12 is directly
inspired by [14] which proposes a kernelized two-sample test method. Lt is analogous
236 D. Gong et al.
1
T1
M M D[F , X 1:T1 , Y 1:T2 ] = ( 2 k(xi , xj )
T1 i,j=1
(13)
2 1
T1 ,T2 T2
1
k(xi , y j ) + 2 k(y i , y j )) 2
T1 T2 i,j=1 T2 i,j=1
if the same kernel in MMD is used as the spatial kernel in kKT C () (eq. 6) and kT ()
degenerates to 1 as T 0. Based on eq. 14, t is set as BR (t) + where BR (t) is an
adaptive threshold which is calculated from the Rademacher bound [14], and is a fixed
global threshold which is the only non-trivial parameter in KTC-R (used to control the
coarse-fine level of segmentation).
Analysis. In summary, both KTC-S and KTC-R are based on eq. 5. The main dif-
ferences are, KTC-S performs segmentation by sequential optimization in a two-class
temporal clustering way, and KTC-R performs segmentation by using incremental
sliding-window in a two-sample test way. KTC-R requires more sliding-windows than
KTC-S, but for each one, there is no optimization, and accessing kKT C () O(Tt T )
times is enough (linear to Tt ). Only when a new cut is detected, O(Tt2 ) times accessing
is required. Thus, KTC-R is extremely efficient and suitable for realtime applications.
It is notable that, even if the fixed-length sliding-window method (sec. 3.2) is im-
proved to make the decision whether a cut happens or not in X ci :ci +T 1 , a small T
is still not reliable for realtime applications. The reason is that a clear temporal cut for
human motion requires a large number of observations before and after the cut. Indeed,
the required number of frames varies from action to action, even for manual annotation.
motions are not identical because of intra-person variations, which brings challenges
for KAC.
KAC. As an online algorithm, KAC utilizes the sliding-window strategy. Each window
X aj +nt Tm :aj +nt 1 is sequentially processed, starting from n1 = 2Tm , a1 = ci ,
where aj is jth action unit cut. Tm is a parameter which is the minimal length of one
action units. We empirically find that results are insensitive to Tm .
For each window X aj +nt Tm :aj +nt 1 , this process has two branches. The last
action unit continues: nt+1 = nt + Tm ; or there is a new action unit: aj+1 =
aj + nt Tm , nt+1 = 2Tm . Here Tm is the step length. This process ends when
a new cut point ci+1 received. Deciding whether X aj +nt Tm :aj +nt 1 is the start of a
new unit or not can be formulated as,
St = SAlign (X aj :aj +Tm 1 , X aj +nt Tm :aj +nt 1 )
(15)
H0 : St t : aj + nt Tm is a new unit; HA : St > t : not H0
where SAlign () is the metric to measure the structure similarity between X aj :aj +Tm 1
and X aj +nt Tm :aj +nt 1 , to handle intra-person variations. t is an adaptive threshold
(empirically set by cross-validation) and ideally should be close to zero if alignment
can perfectly leverage variations. Similar to KTC-R, KAC has delay Tm . In particular,
KAC uses dynamic time warping (DTW) [15, 6] to design SKAC () by minimizing the
following loss function based on the kernel from eq. 6,
a +n T :a +nt 1
SKAC (K ajj :aj +T
t m j
m 1
: W 1, W 2) (16)
where K is the cross-kernel matrix for two segments, and W 1 and W 2 are binary
temporal warping matrices encoding the temporal alignment path as shown in [15].
Interested readers are referred to [15, 6] for more details about S(). Eq. 16 can be opti-
mized by using dynamic programming with complexity O(Tm 2
), and SKAC () measure
the similarity between the current action unit (a part) and the current window. Impor-
tantly, alignment methods such as DTW are not suitable for eq. 12. This is because
alignment requires two segments to have roughly the same starting and ending points,
which does not hold in eq. 12.
KTC-H. By combining KTC-R and KAC, we can sequentially simultaneously capture
action transitions (cuts) and action units, in the integrated algorithm KTC-H. Formally,
KTC-H uses the two-layer sliding window strategy, i.e., the outer loop (sec. 3.3) to
estimate ci and the inner loop to estimate aj between ci and the current frame from the
outer loop. Since KTC-R (eq. 12) and KAC (eq. 15) both have fixed delay (T0 and Tm ),
KTC-H is suitable for realtime transition and action unit segmentation.
Discussion. We compare with several related algorithms: (1) Spectral clustering [21]
can be extended to temporal clustering if only allowing temporal cuts (TSC) [10]. Sim-
ilarly, minimizing eq. 5 can be viewed as an instance of TSC motivated by embedding
distributions into RKHS. (2) PPCA-CD is proposed in [1] to model motion segments
by Gaussian models, where CD stands for change-point detection. Compared to [1],
KTC has higher computational cost but gains the ability to handle nonparametric dis-
tributions. (3) KTC is similar to KCpA [19], which uses the novel kernel Fisher dis-
criminant ratio. Compared to [19], KTC performs change-point detection by using the
238 D. Gong et al.
where K can be M or M when indicating noisy and possible occluded tracking tra-
jectories from depth sensor data by using the OpenNI tracker. The spatial and temporal
variations between X and X imocap are handled by DMW.
6 Experimental Validation
We evaluate the performance of our system from two aspects: (1) temporal segmenta-
tion on depth sensor, Mocap and video, (2) online action recognition on depth sensor.
Fig. 4. Examples of Online Temporal Segmentation. Top (Depth Sensor): a sequence includes
3 segments as walking, boxing and jumping. Noisy joint-trajectories are tracked by OpenNI.
Middle (Mocap): a sequence has 4579 frames and 7 action segments. Bottom (Video): a clip
includes walking and running. For all cases, KTC-R achieves the highest accuracy.
KTC-S achieves similar accuracy to KTC-R but with longer delay, details are omitted
due to lack of space. Results are very robust to T0 and T . For instance, we got almost
identical results when T0 ranges from 60 to 120 in OpenNI data.
Depth Sensor. To validate online temporal segmentation on depth sensor, 10 human
motion sequences are captured by the PrimeSense sensor. Each sequence is a com-
bination of 3 to 5 actions (e.g., walking to boxing) with length around 700 frames
(30Hz). For human pose tracking, we use the available OpenNI tracker to automatically
track joints on human skeleton. K [12, 15] key points are tracked, resulting in joint
3D position xt in "36 to "45 . Although human pose tracking results are often noisy
(Fig. 4 and Fig. 5), we can correctly estimate action transitions from these noisy tracking
Table 1. Temporal Segmentation Results Comparison. Precision (P), recall (R) and rand index
(RI) are reported
results. In particular, KTC-R (T = 30) significantly improves the accuracy from other
methods (Table 1). The main reason is the joint-positions of noisy tracked joints have
complex nonparametric structures, which is handled by kernel embedding of distribu-
tions [14, 13, 12] in KTC.
KTC-H. Besides action transitions, results on detecting both cyclic motions and tran-
sitions are reported by performing KTC-H (Tm = 50, Tm = 1). Since other methods
dont have the module, we report quantitative comparison on online hierarchial seg-
mentation by using KTC-H or other methods plus our KAC algorithm in sec. 4. Results
show (Table 2) that KTC-H gets higher accuracy than other combinations. It is notable
that, because of the natural of RI, the RI metric will increase when the number of cuts
increase, even for low P/R, which is the case for hierarchial segmentation (including
two types of cuts).
Mocap. Similar to [1, 6], M = 14 joints are used to represent the human skele-
ton, resulting in joint-quaternions of joint angles in 42D. Online temporal segmentation
methods are tested on 14 selected Mocap sequences from subject 86 in CMU Mocap
database. Each sequence is a combination of roughly 10 action segments and in total
there are around 105 frames (120Hz). Since the implementation of PPCA-CD differs
from [1] (such as only forward pass is allowed in our experiments), results are not the
same as in [1]. Table 1 shows that the gain of KTC-R (T = 50) to other methods
in Mocap is reduced, compared with depth sensor data. This is because the Gaussian
property is more likely to hold for quaternions representation of noiseless Mocap data,
which is not the case for real data in general.
Video. Furthermore, KTC-R is performed on a number of sequences from
HumanEva-2, which is a benchmark for human motion analysis [30]. Silhouettes are
extracted by background substraction, resulting in a sequence of binary masks (60Hz).
xt "Dt is set as the vector representation of the mask at frame t. It is notable that,
Dt (size of masks) in different frames may not be identical, so PPCA-CD can not be
applied. This fact supports the advantage of KTC, which is applicable for complex se-
quential data as long as a (pseudo) kernel can be defined. In particular, we follow [6]
to compute the matching distance of silhouettes to set the kernel. Results are shown in
Fig. 4 and Table. 1. As a reference, state-of-the-art offline temporal clustering method
ACA achieves higher accuracy than KTC-R on Mocap (96% precision). However, of-
fline methods (1) are not suitable for real-time applications, and (2) require the number
of clusters (segments) to be set in advance, which is not applicable in many cases.
Fig. 5. Online action segmentation and recognition on 2.5D depth sensor. Top to bottom,
depth image sequences, KTC-H results and action recognition results. For segmentation, blue line
indicates the cut and different rectangles indicate different action units. The blue circle indicates
noisy tracking results. For recognition: distance to labeled Mocap sequences, and inferred 3M D
motion sequences.
sequences, based on proper features. Tracked trajectories from OpenNI in an action unit
(segmented by KTC-H) are associated with labeled Mocap sequences from 10 action
categories.
Although OpenNI tracking results are often noisy (highlighted by blue circles in
Fig. 5), we achieve 85% recognition accuracy (Acc) from these noisy tracking results
(Table 2), without any additional training on depth sensor data. This result does not
only benefit from DMW [25], but also from KTC-H. DMW requires the input only
contain one action unit, while KTC-H performs a critical missing step, i.e., accurate
online temporal segmentation, in order to perform recognition. As illustrated in Table
2, the accuracy on OpenNI is enhanced from 0.71 to 0.85, which strongly supports the
effectiveness of KTC-H. Furthermore, the complete and accurate 3M D human motion
sequences can be inferred by associating the learned manifolds from Mocap.
7 Conclusion
and a hierarchial extension are designed, which can detect both action transitions and
action units. Based on KTC, we achieve realtime temporal segmentation on both Mo-
cap, motion sequences from 2.5D depth sensor and 2D videos. Furthermore, temporal
segmentation is combined with alignment, resulting in realtime action recognition on
depth sensor input, without the need of training data from depth sensor.
Acknowledgments. This work was supported in part by NIH Grant EY016093 and
DE-FG52-08NA28775 from the U.S. Department of Energy.
References
1. Barbic, J., Safonova, A., Pan, J.Y., Faloutsos, C., Hodgins, J.K., Pollard, N.S.: Segmenting
motion capture data into distinct behaviors. In: Proc. Graphics Interface, pp. 185194 (2004)
2. Lv, F., Nevatia, R.: Recognition and Segmentation of 3-D Human Action Using HMM and
Multi-class AdaBoost. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part IV.
LNCS, vol. 3954, pp. 359372. Springer, Heidelberg (2006)
3. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from
movies. In: Proc. CVPR (2008)
4. Weinland, D., Ozuysal, M., Fua, P.: Making Action Recognition Robust to Occlusions and
Viewpoint Changes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III.
LNCS, vol. 6313, pp. 635648. Springer, Heidelberg (2010)
5. Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in video. In: Proc. CVPR,
pp. 8168231 (2004)
6. Zhou, F., De la Torre, F., Hodgins, J.K.: Hierarchical aligned cluster analysis for temporal
clustering of human motion. Under review at IEEE PAMI (2011)
7. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling Temporal Structure of Decomposable Mo-
tion Segments for Activity Classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part II. LNCS, vol. 6312, pp. 392405. Springer, Heidelberg (2010)
8. Chen, J., Gupta, A.: Parametric Statistical Change-point Analysis. Birkhauser (2000)
9. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. University of Cam-
bridge Technical Report (2007)
10. De la Torre, F., Campoy, J., Ambadar, Z., Conn, J.F.: Temporal segmentation of facial be-
havior. In: Proc. ICCV (2007)
11. Hofmann, T., Scholkopf, B., Smola, A.: Kernel methods in machine learning. Annals of
Statistics 36, 11711220 (2008)
12. Smola, A.J., Gretton, A., Song, L., Scholkopf, B.: A Hilbert Space Embedding for Dis-
tributions. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI),
vol. 4754, pp. 1331. Springer, Heidelberg (2007)
13. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.: A fast, consistent kernel two-
sample test. In: NIPS, vol. 19, pp. 673681 (2009)
14. Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B., Smola, A.: A kernel method for the
two-sample-problem. In: NIPS, vol. 19, pp. 513520 (2007)
15. Zhou, F., De la Torre, F.: Canonical time warping for alignment of human behavior. In: NIPS,
vol. 22, pp. 22862294 (2009)
16. Xuan, X., Murphy, K.: Modeling changing dependency structure in multivariate time series.
In: Proc. ICML (2007)
17. Saatci, Y., Turner, R., Rasmussen, C.: Gaussian process change point models. In: Proc. ICML
(2010)
18. Desobry, F., Davy, M., Doncarli, C.: An online kernel change detection algorithm. IEEE
Trans. on Signal Processing 53, 29612974 (2005)
Kernelized Temporal Cut for Online Temporal Segmentation and Recognition 243
19. Harchaoui, Z., Bach, F., Moulines, E.: Kernel change-point analysis. In: NIPS 21, pp. 609616
(2009)
20. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In:
NIPS, vol. 14, pp. 849856 (2002)
21. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395416
(2007)
22. Fox, E., Sudderth, E., Jordan, M., Willsky, A.: Nonparametric bayesian learning of switching
linear dynamical systems. In: NIPS 21, pp. 457464 (2009)
23. Zelnik-Manor, L., Irani, M.: Statistical analysis of dynamic actions. IEEE PAMI 28,
15301535 (2006)
24. Satkin, S., Hebert, M.: Modeling the Temporal Extent of Actions. In: Daniilidis, K., Mara-
gos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 536548. Springer,
Heidelberg (2010)
25. Gong, D., Medioni, G.: Dynamic manifold warping for view invariant action recognition. In:
Proc. ICCV, pp. 571578 (2011)
26. Hoai, M., Lan, Z., De la Torre, F.: Joint segmentation and classification of human actions in
video. In: Proc. CVPR (2011)
27. Faivishevsky, L., Goldberger, J.: A nonparametric information theoretic clustering algorithm.
In: Proc. ICML, pp. 351358 (2010)
28. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. JMLR 3, 148 (2003)
29. Laptev, I., Belongie, S., Perez, P., Wills, J.: Periodic motion detection and segmentation via
approximate sequence alignment. In: Proc. ICCV, pp. 8168231 (2005)
30. Sigal, L., Balan, A.O., Black, M.J.: Humaneva: Synchronized video and motion capture
dataset and baseline algorithm for evaluation of articulated human motion. IJCV 87, 427
(2010)
Grain Segmentation of 3D Superalloy Images
Using Multichannel EWCVT
under Human Annotation Constraints
1 Introduction
For dierent industrial applications, engineers need to choose proper kinds of su-
peralloys with desired mechanical or physical properties, such as lightness, hard-
ness, stiness, electrical conductivity and fluid permeability [2]. Such properties
are related to the micro-structures of the superalloy material which is usually
made up of a set of grains [3]. Thus, in the repeated development or researches on
superalloy materials, the major task is to identify the superalloy micro-structures
so that their mechanical and physical properties can be evaluated. Currently, ma-
terial scientists mainly conduct manual annotations/segmentations on 2D super-
alloy image slices and then combine the 2D results to reconstruct the 3D grains.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 244257, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Grain Segmentation of 3D Superalloy Images 245
Given the large number of grains in a superalloy sample and the large number
of slices in its high-resolution 3D microscopic images, this manual-segmentation
process is very tedious, time consuming and prone to error. Manual segmentation
becomes even more dicult when the microscopic superalloy images are multi-
channel images, where dierent channels reect dierent parameter settings of
the electronic microscope. For example, Figure 1 shows four channels of a single
2D slice, from which we can see that adjacent grains may be distinguishable
by their intensity in certain channels but not in other channels. As a result, we
may need to manually segment the image in each channel and then combine the
segmentation results from dierent channels.
In principle, many existing segmentation methods may be useful to relieve
material scientists from the manual segmentation. 2D segmentation methods,
such as mean shift [4], watershed [5], statistical region merging [6], normalized
cuts [7], graph cuts [8], level set [9] and watershed cuts [10, 11], can be ap-
plied to segment 2D slices. Especially, in [12], a region merging segmentation
method called the stabilized inverse diusion equation and a stochastic seg-
mentation method called the expectation-maximization/maximization of the
posterior marginals (EM/MPM) algorithm are combined for 2D Ni-based su-
peralloy image segmentation. However, all the above 2D segmentation methods
still cannot eectively and accurately correspond the segments across adjacent
slices to achieve the desired 3D grain segmentation.
g1 g1 g1 g1
g2 g2 g2 g2
g3 g3 g3 g3
Fig. 1. Microscopic images of a slice of the superalloy sample taken by using 4 different
microscope settings. Small black points in the images are carbides. (a) 4000 Series. (b)
5000 Series. (c) 6000 Series. (d) 7000 Series.
Let N denote the number of image channels of the superalloy volume, i.e., we
have N images, u1 , u2 , . . . , uN , of the same superalloy sample taken under N
different microscope parameter settings. The set of intensity values of the 3D
superalloy images U and the generators W can be written in the vector form as
and
W = {wl = (wl1 , wl2 , , wlN )T RN }L
l=1 ,
respectively. Note here L denotes the number of generators and D = {(i, j, k)|i =
1, . . . , I, j = 1, . . . , J, k = 1, . . . , K} is an index set for a superalloy volume
domain. The multichannel clustering energy can be dened as
L
EC (W; D) = (i, j, k)u(i, j, k) wl 2 (1)
l=1 (i,j,k)Dl
where u (i, j, k) : D {1, . . . , L} tells which cluster u(i, j, k) belongs to, then
similar to [17], an edge energy can be dened as
EL (D) = (i,j,k) (i , j , k ). (2)
(i,j,k)D (i ,j ,k )N (i,j,k)
When the density function is dened as = 1+|u|, the total energy function,
i,e., the multichannel edge-weighted clustering energy, is dened as:
It was shown in [1] that (W; D) is a minimizer of E(W; D) only if (W; D) forms
a multichannel edge-weighted centroidal Voronoi tessellation (MCEWCVT), i.e.,
W are simultaneously the corresponding centroids of the associated multichannel
edge-weighted Voronoi regions {Dl }L l=1 .
S1 S2 S3 S4
2 3
g21 g g31 g g42
g11 g21 2
2
2 g33 g41
g
g24 3
g34 g36 g43
g13 g2
g53 g46
5 2
1 1 g
g g 8
g3
g4
6 g15 4 g26 g27 7 g38 g39 4 g45
(a) (b) (c) (d)
m
where gn denotes the n-th 2D grain region on the m-th constraint superalloy
image slice, Nm is the number of grains on m-th constraint superalloy image
according to the annotated segmentation. It satises
gnm gnm = , if n
= n . (8)
Thus the constraints over the tessellation (or say grouped clusters) D can be
dened mathematically as:
C1 (D) = { (p) = (q), p, q gnm }
C2 (D) = { (p)
= (q), p gnm , q gnm , gnm and gnm are neighbor grains}
(9)
where () denotes the generic form of the clustering function which provides a
cluster label for a given voxel. Thus we can dene the 3D multichannels super-
alloy image segmentation problem under such human annotation constraints to
be a constrained energy minimization problem in the form of
(W; D) = arg min E(W; D) subject to C = {C1 (D) & C2 (D)}. (10)
(W;D)
We would like to point out that the above dened constraints namely only take
eect on the voxels within the same constraint 2D superalloy slice, but it is ex-
pected that those constraints will propagate their eects to other non-constraint
slices during the solution process to give improved grain segmentations.
which can group Uc with no adjacent grains on the same constraint image slice
belonging to the same cluster. We then dene the centers of the these clusters as
the initial generators for the CMEWCVT. See Algorithm 1 for the description
of the whole procedure.
250 Y. Cao, L. Ju, and S. Wang
S
Algorithm 1. (W c ={wcl }L
l=1 , D ={Dl }l=1 , L) = ConstrainedGenerators(U ,
S L c
G, S, L0 )
1: INPUT: Human annotated segmentation on certain superalloy image slices S. The
average intensities Uc of grain regions in S. Grains neighboring matrix G. Initial
cluster number guess L0 .
2: START:
3: L = L0 and randomly initialized L cluster generators W = {w l }L l=1 .
4: Run the classic K-means on Uc with W to obtain a clustering of Uc .
5: DS {Dls }L
l=1 : If un belongs to the cluster with generator w l , then all voxels in
m
m s
gn are in Dl .
6: if Existing gnm , gnm , which belongs to the same cluster, for neighbor grains gnm and
gnm then
7: L=L+1
8: w L = umn
9: W {W, wL }
10: Go to 4.
11: else
12: Wc = W
13: Return (W c , DS , L).
14: end if
where (i, j, k) refers to the voxel on the non-constraint image slices. DS is given
by Algorithm 1. Thus, we can dene the constrained multichannel edge-weighted
clustering energy as
E(W c ; D; S) = EC (W c ; D; S) + EL (D)
L
= (1 + |u(i, j, k)|)u(i, j, k) wcl 2
l=1 (i,j,k)Dlc
(13)
+ (i,j,k) (i , j , k ).
(i,j,k)D (i ,j ,k )N (i,j,k)
Grain Segmentation of 3D Superalloy Images 251
From Eqn. (5), it is also easy to nd that when W c and S are xed, the con-
strained multichannel edge-weighted Voronoi tessellation D = {Dlc }L
l=1 associ-
ated with W c and S corresponds to the minimizer of the constrained multichan-
nel edge-weighted clustering energy E(W c ; D; S), i.e.,
Algorithm 2. {Dlc }L
l=1 = CMEWVT(u,{w l }l=1 ,{Dl }l=1 , S)
c L c L
go to 3.
such that
w c
l = arg min (i, j, k)u(i, j, k) w cl 2 (15)
wl
c
(i,j,k)Dlc
Denition (CMEWCVT).
For a given constrained
multichannel edge-weighted
Voronoi tessellation {wcl }L
l=1 ; {Dl }l=1 ; S
c L
of D, we call it a constrained mul-
tichannel edge-weighted centroidal Voronoi tessellation (CMEWCVT) of D if
the generators {wcl }L
l=1 are also the corresponding centroids of the associated
constrained multichannel edge-weighted Voronoi regions {Dlc }L
l=1 , i.e.,
wcl = wc
l , l = 1, 2, , L.
Algorithm 3. ({w cl }L
l=1 , {Dl }l=1 ) = CMEWCVT(u,L0 ,S,G)
c L
9: If {Dlc }Ll=1 and {Dl }l=1 are the same, set {w l }l=1
c L c L
= {w cl }l=1 , return
L
({w l }l=1 ; {Dl }l=1 ) and exit; otherwise, set Dl = Dl for l = 1, . . . , L and go
c L c L c c
to step 7.
The resolution on the 2D superalloy image slice are 5 times higher than that
between adjacent image slices. In order to control the smoothness and compact-
ness of the segmentation on 3D superalloy image, we followed [1] by linearly
interpolating the 3D superalloy image with 4 more slices between each two orig-
inal slices. The interpolated superalloy volume image has 846 slices. In order
to run the CMEWCVT with more iterations in reasonable time, we downsized
the 3D volume image to half of its size along each direction and applied the
proposed CMEWCVT algorithm on this downsized superalloy volume image.
Finally, we scale the segmentation results back to its original resolution for per-
formance evaluation. There are six parameters/factors that can be controlled
in CMEWCVT: L, the number as input for constructing initial clusters gen-
erators; , the weighting parameter which balances the clustering energy term
and the edge energy term; , the neighbor size which denes the local search
region N (i, j, k) for each voxel (i, j, k); , the predened threshold of the stop
condition of CMEWCVT; and S, a set of constraint slices with human annotated
segmentation.
In the experiments, we chose L = 5, = 500. Considering the size of carbides
and the energy balance, (see discussions in [17]), we set = 6, and N (i, j, k) to
be a sphere centered at voxel (i, j, k). Theoretically, Algorithm 3-CMEWCVT
stops when the energy function Eqn. (13) is completely minimized. However, in
practical applications, we can set the stop condition as
|Ei+1 Ei |
< (16)
Ei
where Ei denotes the CMEWVT energy at the i-th iteration and is a prede-
ned threshold. In the experiments, we selected = 104 . In addition, we chose
dierent number of constraint slices and analyze the aect to the segmentation
accuracy. In our experiments, we tried 3, 6, 9, 12, 15, 18 and 21 constraint slices.
To quantitatively evaluate the accuracy and performance of the CMEWCVT
algorithm, we calculated its boundary accuracy against the manually annotated
ground-truth segmentation and compared the boundary accuracy with other
segmentation/boundary-detection methods. Besides the MCEWCVT [1], we also
compare the segmentation accuracy with meanshift [4], SRM [6], pbCanny edge
detector [20], pbCGTG edge detector [20], gPb [21] and watershed [5].
For the MCEWCVT algorithm [1] and the proposed CMEWCVT algorithm,
we examine the segmentation boundaries on the 170 non-interpolated image
slices. For other 2D comparison methods, we tested them directly on 170 non-
interpolated slices. For the segmentation methods that are developed only for
single-channel images, we applied them on each of the four channels indepen-
dently and then combine the segmentation results from all four channels to get
a unied segmentation for each slice. Specically, for Pb edge detectors, at each
pixel we took the maximum Pb value over all four channels. For the other com-
parison methods, we combined the segmentations from all four channels using a
logic OR operation.
For the quantitative comparison, we used segmentation evaluation codes pro-
vided in the Berkeley segmentation benchmark [20] to calculate the precision-recall
254 Y. Cao, L. Ju, and S. Wang
and F -measure(harmonic mean of the precision and recall), which is in the form
of
precision recall
F =2 .
precision + recall
For the CMEWCVT algorithm, we evaluated the average best F-measure for
Case I: over all non-interpolated slices but the constraint slices; and Case II:
over all non-interpolated slices including the constraint slices. For the other
comparison algorithms, include MCEWCVT and the 2D image segmentation
algorithms, we evaluated the average best F-measure over all the 170 superalloy
image slices.
For the Pb based edge detectors, varying the threshold on a soft boundary
map can produce a series of binarized boundary images, with which a precision-
recall curve can be calculated and a best F -measure can be obtained to measure
the segmentation performance. For the other comparison methods, binary seg-
mentation boundaries leads to a single precision-recall value, from which we
can compute a F -measure. Generally, the higher the F -measure, the better the
segmentation. From Table 1, it is easy to see that the proposed CMEWCVT al-
gorithm achieves the best F -measures of 93.6% and 94.3% in Case I and Case II,
respectively, which signicantly outperforms the MCEWCVT (F -measure: 89%)
and other six 2D segmentation/edge-detection methods. Here, we set = 6, and
= 500, and use 21 constraint slices.
MCEWCVT
CMEWCVT
CMEWCVT
(a) (b)
Fig. 3. Visual comparisons (two slices, (a) and (b) between the segmentation results
of the proposed CMEWCVT algorithm and the MCEWCVT algorithm
Table 2. The comparison of the segmentation accuracy when using different number
of constraint image slices. The numbers in this table are calculated without taking
constraint image slices into account (Case I).
Table 3. The comparison of the segmentation accuracy when using different number
of constraint image slices when taking constraint image slices into account (Case II)
From Fig. 4, another observation from the experiments is that the improve-
ment of the average F -measure from CMEWCVT is contributed by almost all
the superalloy image slices. In another word, by introducing the constraint slices,
the segmentation on almost all the other slices are improved, instead of just the
few slices near to the constraint slices. This indicates the prior knowledge pro-
vided by the constraint slices can be propagated to the whole 3D volume image
by using the proposed CMEWCVT. In addition, almost all the slices have the
F -measure around 0.9 or even higher except for the 40-th slice which has the
F -measure around 0.7 due to its poor imaging quality.
256 Y. Cao, L. Ju, and S. Wang
0.95
0.9
Fmeasure
0.85
Fig. 4. F -measure computed from each superalloy image slice when using different
number of constraint slices
6 Conclusions
In this paper, we developed a constrained multichannel edge-weighted centroidal
Voronoi tessellations (CMEWCVT) algorithm for the purpose of 3D superalloy
grain image segmentation. Compared to the previous MCEWCVT algorithm,
the proposed algorithm can incorporate the human annotated segmentation on
a small set of superalloy slices as constraints for the MCEWCVT energy mini-
mization. As a result, the professional prior knowledge from material scientists
can indeed be incorporated to produce much better segmentation results as
demonstrated by our experiments.
References
1. Cao, Y., Ju, L., Zou, Q., Qu, C., Wang, S.: A multichannel edge-weighted centroid
Voronoi tessellation algorithm for 3D superalloy image segmentation. In: CVPR,
pp. 1724 (2011)
2. Reed, R.C.: The Superalloys Fundamentals and Applications. Cambridge Press
(2001)
3. Voort, G.F.V., Warmuth, F.J., Purdy, S.M.: Metallography: Past, Present, and
Future (75th Anniversary Vol.). Astm Special Technical Publication (1993)
4. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space
analysis. TPAMI 24, 603619 (2002)
5. Meyer, F.: Topographic distance and watershed lines. Signal Processing 38,
113125 (1994)
6. Nock, R., Nielsen, F.: Statistical region merging. TPAMI 26, 14251458 (2004)
7. Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI 22, 888905
(1997)
8. Felzenszwalb, P.F.: Ecient graph-based image segmentation. IJCV 59, 167181
(2004)
9. Li, C., Xu, C., Gui, C., Fox, M.D.: Level set evolution without re-initialization: A
new variational formulation. In: CVPR, pp. 430436 (2005)
10. Couprie, C., Najman, L., Grady, L., Talbot, H.: Power watersheds: A new im-
age segmentation framework extending graph cuts, random walker and optimal
spanning forest. In: ICCV, pp. 731738 (2009)
11. Cousty, J., Bertrand, G., Najman, L., Couprie, M.: Watershed cuts: thinnings,
shortest-path forests and topological watersheds. TPAMI 32, 925939 (2010)
12. Chuang, H.-C., Human, L.M., Comer, M.L., Simmons, J.P., Pollak, I.: An auto-
mated segmentation for Nickel-based superalloy. In: ICPR, pp. 22802283 (2008)
13. Boykov, Y., Funka-Lea, G.: Graph cuts and ecient N-D image segmentation.
IJCV 70, 109131 (2006)
14. Grady, L.: Random walks for image segmentation. TPAMI 28, 17681783 (2006)
15. Whitaker, R.T.: A level-set approach to 3D reconstruction from range data.
IJCV 29, 203231 (1998)
16. Magee, D., Bulpitt, A., Berry, E.: Level set methods for the 3D segmentation of
CT images of abdominal aortic aneurysms. In: Medical Image Understanding and
Analysis, pp. 141144 (2001)
17. Wang, J., Ju, L., Wang, X.: An edge-weighted centroidal Voronoi tessellation model
for image segmentation. IEEE TIP 18, 18441858 (2009)
18. Powell, M.J.D.: An ecient method for nding the minimum of a function of several
variables without calculating derivatives. Computer Journal 7, 152162 (1964)
19. Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications
and algorithms. SIAM Review 41, 637676 (1999)
20. Martin, D.R., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natu-
ral images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: ICCV, pp. 416423 (2001)
21. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical
image segmentation. TPAMI 33, 898916 (2011)
Hough Regions for Joining
Instance Localization and Segmentation
1 Introduction
Detecting instances of object categories in cluttered scenes is one of the main
challenges in computer vision. Standard recognition systems dene this task as
localizing category instances in test images up to a bounding box representation.
In contrast, in semantic segmentation each pixel is uniquely assigned to one of a
set of pre-dened categories, where overlapping instances of the same category
are indistinguishable. Obviously, these tasks are quite interrelated, since the
segmentation of an instance directly delivers its localization and a bounding box
detection signicantly eases the segmentation of the instance.
This work was supported by the Austrian Research Promotion Agency (FFG) under
the projects CityFit (815971/14472-GLE/ROD) and MobiTrick (8258408) in the
FIT-IT program and SHARE (831717) in the IV2Splus program and the Austrian
Science Fund (FWF) under the projects MASA (P22299) and Advanced Learning
for Tracking and Detection in Medical Workow Analysis (I535-N23).
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 258271, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Hough Regions for Joining Instance Localization and Segmentation 259
Classes
Hough Graph update
Fig. 1. Hough regions (yellow) nd reasonable instance hypotheses across scales
and classes left) despite strongly smeared vote maps (middle). By jointly optimiz-
ing for both segmentation and detection, we identify even strongly overlapping in-
stances (right) compared to standard bounding box non-maximum suppression.
Both tasks have recently enjoyed vast improvements and we are now able
to accurately recognize, localize and segment objects in separate stages using
complex processing pipelines. Problems still arise where recognition is confused
by background clutter or occlusions, localization is confused by proximate fea-
tures of other classes, and segmentation is confused by ambiguous assignments
to foreground or background.
In this work we propose an object detection method which combines instance
localization with its segmentation in an implicit manner for jointly nding op-
timal solutions. Related work in this eld, as will be discussed in detail in the
next section, either tries to combine detection and segmentation in separate
subsequent stages [1] or aims at full scene understanding by jointly estimating
a complete scene segmentation of an image for all object categories [2].
We place our method in between these two approaches, since we combine in-
stance localization and segmentation in a joint framework. We localize object
instances by extracting so-called Hough regions which exploit available informa-
tion from a generalized Hough voting method, as it is illustrated in Figure 1.
Hough regions are an alternative to the de-facto standard of bounding box non-
maximum suppression or vote clustering, as they are directly optimized in the
Hough vote space. Since we are considering maxima regions instead of single
pixel maxima, articulation and scale become far less important. Our method is
thus more tolerant against diverging scales and the number of overall scales in
testing is reduced. Further, we implicitly provide instance segmentations and to
increase the recall in scenes containing heavily overlapping objects.
2 Related Work
layout when assigning the labels and connects each Hough transform with a part
to extract multiple objects. Yang et al. [15] propose a layered object model for
image segmentation, which denes shape masks and explains the appearance,
depth ordering, and labels of all pixels in an image. Ladicky et al. [2] combine
semantic segmentation and object detection into a framework for full scene un-
derstanding. Their method trains various classiers for stu and things and
incorporates them into a coherent cooperating scene explanation. So far, only
Ladicky et al. managed to incorporate information about object instances, their
location and spatial extent as important cues for a complete scene understand-
ing, which allows to answer a question like what, where and how many object
instances can be found in an image. However, their system is designed for di-
cult full scene understanding and not for an object detection task, as they only
integrate detector responses and infer from the corresponding bounding boxes.
This limits the ability to increase the detection recall or to improve the accuracy
of instance localization.
Our method improves the accuracy of object detectors by using the objects
supporting features for joint segmentation and reasoning between object in-
stances. We achieve a performance increase in recall and precision, through the
ability to better handle articulations and diverging scales. Additionally, we do
not require a parameter sensitive non-maximum suppression since our method
delivers unique segmentations per object instance. Please note, in contrast to
related methods [5,16] these segmentation masks are provided without learning
from ground truth segmentation masks for each category.
image and corresponding object center votes Hi into the Hough space. We fur-
ther assume that we are given p (C|Xi , Di ) per voting element, which measures
the likelihood that a voting element Xi belongs to category C by analyzing a
local descriptor Di . This could be feature channel dierences between randomly
drawn pixel pairs as in [17]. All this information can be directly obtained from
the generalized Hough methods; see experimental section for implementation
details.
The goal of our method is to use the provided data of the generalized Hough
voting method to explain a test image by classifying each pixel as background
or as belonging to a specic instance of an object category. We formulate this
problem as Bayesian labeling of a rst-order Conditional Random Field (CRF)
by minimizing a global energy term to derive an optimal labelling.
Such random eld formulations are widespread, especially in the related eld
of semantic segmentation, e.g. [20]. In a standard random eld X each pixel is
represented as a random variable Xi X , which takes a label from the set L =
{l1 , l2 , . . . , lK } indicating one of K pre-dened object classes as e.g. grass, people,
cars, etc. Additionally, an arbitrary neighborhood relationship N is provided,
which denes cliques c. A clique c is a subset of the random variables Xc , where
i, j c holds i Nj and j Ni , i.e. they are mutually neighboring concerning
the dened neighborhood system N .
In our case, we not only want to assign each pixel to an object category,
we additionally aim at providing a unique assignment to a specic instance of
a category in the image, which is a dicult problem if category instances are
highly overlapping. For the sake of simplicity, we dene our method for the single
class case, but the method is easily extendable to a multi-class setting.
We also represent each pixel as random variable Xi with i = 1 . . . N , where
N is the number of pixels in the test image, and aim at assigning each pixel a
category-specic instance label from the set L = {l0 , l1 , . . . , lL }, where L is the
number of instances. We use label l0 for assigning a pixel to the background. We
seek for the optimal labeling x of the image which is taken from the set L = LN .
The optimal labeling minimizes the general energy term E (x) dened as
E (x) = log P (x|D) log Z = c (xc ), (1)
cC
i.e. the unary potentials describe the likelihood of getting assigned to one of the
identied seed regions or the background. Since we do not have a background
likelihood, the corresponding values for l0 are set to a constant value pback ,
which denes the minimum class likelihood we want to have. The term drives
our label assignments to correctly distinguish background from actual object
hypotheses.
The unary potential analyzes the likelihood for assigning a pixel to one of the
dened seed regions, e.g. analyzing local appearance information in comparison
to the seed region. In general any kind of modeling scheme is applicable in this
step. We model each subset Sl X by Gaussian Mixture Models (GMM) Gl and
dene color potentials for assigning a pixel to each instance by
The corresponding likelihoods for the background class l0 are set to log(1
maxlL p (Xi |Gl )) since the background is mostly too complex to model. The
term ensures that pixels are assigned to the right instances by considering the
appearance of each instance.
The third unary potential denes a spatial distance function analyzing how
close each pixel is to each seed region Sl by
= (Xi , Sl ) , (6)
Fig. 2. Two layer graph design (top: Hough and bottom: image) to solve proposed
image labeling problem. We identify Hough regions as subgraphs of the Hough graph
H. The backprojection into the image space, in combination with color information,
distance transform values and contrast sensitive Potts potentials denes instance seg-
mentations in the image graph I.
various sources of error such as changes in global and local scales, aspect ratios,
articulation, etc. Perfect Hough maxima are unlikely and there will always exist
inconsistent centroid votes. The idea and benet of our Hough regions paradigm
is now to exactly capture this uncertainty by considering regions, and not single
Hough maxima.
Thus, our goal is to identify Hough regions Hl in the Hough space H, which
then allows to directly dene the corresponding seed region Sl by the back-
projections into the image space, i.e. the random variables Xi Sl are the
nodes in the image graph I which project into the Hough region Hl . For this
reason, we dene a binary image labeling problem in the Hough graph by
E(x) = (H(Xi )) + (H(Xi ), H(Xj )), (8)
Xi X N
where each random variable XMl within the segmentation mask Ml of the
category instance is used in the update. Using our obtained segmentations, we
focus the update solely on the areas of the graph where the current object
instance plays a role. This leads to a much ner dissection of the image and
Hough graphs, as shown in Figure 2.
After updating the Hough graph H with the same update step, we repeat nd-
ing the next optimal seed region and afterwards segmenting the corresponding
category instance in the image graph I. This guarantees a monotonic decrease
in max(t+1 ) and our iteration stops when max(t+1 ) < pback , i.e. we have
identied all Hough regions (object instances L) above a threshold.
Our implicit step of updating the Hough vote maps by considering optimal seg-
mentations in the image space, is related to other approaches in the eld of non-
maximum suppression. In general, one can distinguish two dierent approaches
in this eld.
other are removed (NMS). Integrating such a setup in our method would mean
that each bounding box represents one seed region Sl and additionally denes
a binary distance function (Xi , Sl ), where = 1 for all pixels within the
bounding box and = 0 for all other pixels.
Non-maximum suppression comes in many dierent forms like a) nding all
local maxima and performing non-maximum suppression in terms of mutual
bounding box overlap in image space to discard low scoring detections; b) ex-
tracting global maxima iteratively and eliminating these by deleting the interior
of the bounding box hypothesis in Hough space [17]; c) iteratively nding global
maxima for the seed pixels [3] and estimating bounding boxes and subtracting
the corresponding Hough vectors. All pixels inside the bounding box are used
to reduce the Hough vote map. This leads to a coarse dissection of the Hough
map which leads to problems since it includes much background information and
overlapping object instances.
Clustering methods allow more elegant formulation since they attempt to group
Hough vote vectors to identify instance hypotheses. The Implicit Shape Model
(ISM) [5], for example, employs a mean-shift clustering on sparse Hough center
locations to nd coherent votes. The work by Ommer and Malik [23] extends
this to clustering Hough voting lines, which are innite lines as opposed to
nite Hough vectors. The benet lies in extrapolating the scale from coherent
Hough lines, as the line intersection in the 3D xyscale space determines the
optimal scale. Such an approach relates to ours, since the clustering step can
be interpreted as elegantly identifying the seed pixels we analyze. However such
approaches assume well-distinguishable cluster distributions and are not well-
suited for scenarios including many overlaps and occlusions. Furthermore, similar
to the bounding box methods, which require a Parzen-window estimate to bind
connected Hough votes, clustering methods require an intra- to inter cluster
balance. In terms of a mean-shift approach this is dened by the bandwidth
parameter considered. An additional drawback of clustering methods is also
the limited scalability in terms of number of input Hough vectors that can be
eciently handled. An interesting alternative is the grouping of dependent object
parts [24], which yields signicantly less uncertain group votes compared to the
individual votes.
Our proposed approach has several advantages compared to the discussed re-
lated methods. A key benet is that we do not have to x a range for local
neighborhood suppression, as it is e.g. required in non-maximum suppression
approaches, due to our integration of segmentation cues. As it is also demon-
strated in the experimental section, our method is robust to an increased range
of variations in object articulations, aspect ratios and scales, which allows to
reduce the number of analyzed scales during testing while maintaining the de-
tection performance. We implicitly also provide segmentations of all detected
category instances, without requiring segmentation masks during training. In
overall, the benets of our Hough Regions approach lead to higher precision and
recall compared to related approaches.
268 H. Riemenschneider et al.
tud_campus tud_crossing
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Precision
Precision
0.5 0.5
0.4 0.4
Hough Regions (1 scales)
0.3 Hough Regions (3 scales) 0.3
Hough Regions (5 scales)
0.2 Barinova [3] (1 scales) 0.2
Barinova [3] (3 scales) Hough Regions (1 scale)
0.1 Barinova [3] (5 scales) 0.1 Barinova [3] (1 scale)
ISM [11] (5 scales) ISM [11] (1 scale)
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall
Fig. 3. Recall/Precision curve for the TUD campus and TUD crossing sequences show-
ing analysis at multiple scales for highly overlapping instances. See text for details
4 Experimental Evaluation
We use the publicly available Hough Forest code [17] to provide the required class
likelihoods and the centroid votes. For non-maximum suppression (NMS), we ap-
ply the method of [3] (also based on [17]), which outperforms standard NMS in
scenes with overlapping instances. It is important to note that for datasets with-
out overlapping instances like PASCAL VOC, the performance is similar to [17].
For this reason, we mainly evaluate on datasets including severely overlapping
detections, namely TUD Crossing and TUD Campus. Additionally, we evaluate
on a novel window detection dataset GT240, designed for testing sensitivity to
aspect ratio and scale variations. We use two GMM components, equal weighting
between the energy terms and x pback = 0.125.
To demonstrate the ability of our system to deal with occluded and overlapping
object instances (which results in an increased recall using the PASCAL 50%
criterion), we evaluated on the TUD Campus sequence, which contains 71 images
and 303 highly overlapping pedestrians with large scale changes. We evaluated
the methods on multiple scale ranges (one, three and ve scales) and Figure 3
shows the Recall-Precision curve (RPC) for the TUD Campus sequence. Aside
the fact that multiple scales benet the detection performance for all methods,
one can also see that our Hough regions method surpasses the performance at
each scale range (+3%, +8%, +8% over [3]) and over fewer scales. For example,
using Hough regions we only require three scales to achieve the recall and preci-
sion, which is otherwise only achieved using ve scales. This demonstrates that
our method handles scale variations in an improved manner, which allows to
reduce the number of scales and thus runtime during testing. The ISM approach
of Leibe et al. [5] reaches good precision, but at the cost of decreased recall, since
the MDL criterion limits the ability to nd all partially occluded objects.
Hough Regions for Joining Instance Localization and Segmentation 269
gt240
1
Baseline [9]
0.9 Hough Regions
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
4.3 GrazTheresa240
We also evaluated our method on a novel street-level window detection dataset
(denoted as GrazTheresa240), which consists of 240 buildings with 5400 redun-
dant images with a total of 5542 window instances. Window detection itself
is dicult due to immense intra-class appearance variations. Additionally, the
dataset includes a large range of diverging aspect ratios and strong perspective
distortions, which usually limit object detectors. In our experiment we tested
on three dierent scales and a single aspect ratio. As shown in Figure 4 we can
substantially improve the detection performance in both recall and precision
(12% at EER) compared to the baseline [17]. Our Hough regions based detector
270 H. Riemenschneider et al.
4.4 Segmentation
As nal experiment we analyze achievable segmentation accuracy on the TUD
Crossing sequence [25], where we created binary segmentations masks for each ob-
ject instance for the entire sequence. Segmentation performance is measured by the
ratio between the number of correctly assigned pixels to the the total number of
pixels (segmentation accuracy). We compared our method to the Implicit Shape
Model (ISM) [5], which also provides segmentations for each detected instance.
Please note that the ISM requires segmentation masks for all training samples to
provide reasonable segmentations whereas our method learns from training im-
ages only. Nevertheless, we achieve competitive segmentation accuracy of 98.59%
compared to 97.99% for the ISM. Illustrative results are shown in Figure 5.
5 Conclusion
In this work we proposed Hough regions for object detection to jointly solve
instance localization and segmentation. We use any generalized hough voting
method, providing pixel-wise object centroid votes for a test image, as starting
point. Our novel Hough regions then determine the locations and segmentations
of all category instances and implicitly handles large location uncertainty due to
articulation, aspect ratio and scale. As shown in the experiments, our method
jointly and accurately delineates the location and outline of object instances,
leading to increased recall and precision for overlapping and occluded instances.
These results conrm related research where combining the tasks of detection
and segmentation improves performance, because of the joint optimization ben-
et over separate individual solutions. In future work, we plan to consider our
method also during training for providing more accurate foreground/background
estimations without increasing manual supervision.
References
1. Ramanan, D.: Using segmentation to verify object hypotheses. In: CVPR (2007)
2. Ladicky, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, Where and
How Many? Combining Object Detectors and CRFs. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 424437. Springer,
Heidelberg (2010)
Hough Regions for Joining Instance Localization and Segmentation 271
3. Barinova, O., Lempitsky, V., Kohli, P.: On the detection of multiple object in-
stances using hough transforms. PAMI 34, 17731784 (2012)
4. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class object
layout. IJCV 95, 112 (2011)
5. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved
categorization and segmentation. IJCV 77, 259289 (2008)
6. Borenstein, E., Ullman, S.: Class-Specic, Top-Down Segmentation. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part II. LNCS, vol. 2351,
pp. 109122. Springer, Heidelberg (2002)
7. Yu, S., Shi, J.: Object-specic gure-ground segregation. In: CVPR (2003)
8. Amit, Y., Geman, D., Fan, X.: A coarse-to-ne strategy for multiclass shape de-
tection. PAMI 26, 16061621 (2004)
9. Larlus, D., Jurie, F.: Combining appearance models and markov random elds for
category level object segmentation. In: CVPR (2008)
10. Gu, C., Lim, J., Arbelaez, P., Malik, J.: Recognition using regions. In: CVPR (2009)
11. Tu, Z., Chen, X., Yuille, A., Zhu, S.: Image parsing: Unifying segmentation, detec-
tion, and recognition. IJCV 62, 113140 (2005)
12. Gould, S., Gao, T., Koller, D.: Region-based segmentation and object detection.
In: NIPS (2009)
13. Wojek, C., Schiele, B.: A Dynamic Conditional Random Field Model for Joint La-
beling of Object and Scene Classes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008, Part IV. LNCS, vol. 5305, pp. 733747. Springer, Heidelberg (2008)
14. Winn, J., Shotton, J.: The layout consistent random eld for recognizing and seg-
menting partially occluded objects. In: CVPR (2006)
15. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object models for image
segmentation. PAMI 34, 17311743 (2011)
16. Floros, G., Rematas, K., Leibe, B.: Multi-Class Image Labeling with Top-Down
Segmentation and Generalized Robust PN Potentials. In: BMVC (2011)
17. Gall, J., Lempitsky, V.: Class-specic hough forests for object detection. In: CVPR
(2009)
18. Maji, S., Malik, J.: Object detection using a max-margin hough transform. In:
CVPR (2009)
19. Okada, R.: Discriminative generalized hough transform for object dectection. In:
ICCV (2009)
20. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image catego-
rization and segmentation. In: CVPR (2008)
21. Kohli, P., Ladicky, L., Torr, P.: Robust higher order potentials for enforcing label
consistency. IJCV 82, 302324 (2009)
22. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. PAMI 23, 12221239 (2001)
23. Ommer, B., Malik, J.: Multi-scale object detection by clustering lines. In: ICCV
(2009)
24. Yarlagadda, P., Monroy, A., Ommer, B.: Voting by Grouping Dependent Parts.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS,
vol. 6315, pp. 197210. Springer, Heidelberg (2010)
25. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-
detection-by-tracking. In: CVPR (2008)
Learning to Segment a Video to Clips
Based on Scene and Camera Motion
1 Introduction
Video analysis such as video summarization, activity recognition, depth from video,
etc. are active research areas. Each of these communities focus on specific modalities of
video; for example, estimating depth from video has been well explored for videos of a
static scene captured from a dynamic camera. In addition to computational applications,
studies in cognitive science [8] and film studies [3, 11, 25] require analyzing the visual
activity in movies, where currently exhaustive manual effort is used [1]. In this work, we
automatically segment an input video into clips based on the scene and camera motion
allowing for more focused algorithms for video editing and analysis.
The task of video segmentation has been studied for various applications. Shot bound-
ary detection [22, 3234], performs a coarse segmentation of frames into the basic ele-
ments of a video, shots1 . Several works build on top of this, grouping shots into scenes
to aid video summarization [15, 24, 31, 35] or performing spatio-temporal segmenta-
tion [7, 9, 14, 18, 29]. Other video segmentation works include camera motion segmen-
tation [10, 21, 23, 28, 30, 35] classifying camera motion such as tilt and zoom, albeit,
without focusing on the motion of the objects within the scene. In this work, we define
1
A shot is everything between turning the camera on and turning it off, consisting of many clips.
A scene consists of all the shots at a location.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 272286, 2012.
Springer-Verlag Berlin Heidelberg 2012
Learning to Segment a Video to Clips Based on Scene and Camera Motion 273
Fig. 1. The proposed algorithm takes a video as input (black box) and first segments the video to
shots, where each shot is the group of frames shown in an orange box. The algorithm segments
the frames of each shot to clips based on scene and camera motion, expressed as combinations
of static camera (CS ) vs. dynamic camera (CD ) and static scene (SS ) vs. dynamic scene (SD ),
shown with red, green, blue and yellow borders around the frame (color code on first row). The
above are some results of our algorithm on the movie Sound of Music.
a novel task of labeling frames into one of four categories based on combinations of
both camera and scene motion i.e. static vs. dynamic camera and static vs. dynamic
scene. While prior work in temporal segmentation were rule-based approaches, we pro-
pose a novel learning-based approach. Some results are shown in Fig 1.
An overview of our algorithm is shown in Fig 2. We formulate the task of video
segmentation as a discrete labeling problem. We first perform shot boundary detection
to segment the video into shots, and assume that the characteristics of motion between
shots are independent of each other. In order to capture the heavy temporal dependence
between frames within a shot, we construct an MRF over the frames of the shot with
edges between consecutive frames. We extract intuitive features from the optical flow
between consecutive frames and using labeled video data learn classifiers to distinguish
between the different classes. Using the classifier scores as evidence for the different
discrete categories we setup an energy function over the graph of frames enforcing
smoothness between adjacent frames and infer the minimum energy labeling over the
frames. We then use the resulting labeling to extract contiguous sequence of frames that
are assigned the same label thus segmenting the video to clips.
Contributions. The main contributions of this paper are:
We propose a novel learning-based approach to segmenting a video into clips based
on both camera and scene motion.
We show that the joint inference over all the frames of the shot as opposed to
independent decisions on each frame leads to significantly better performance.
274 A. Kowdle and T. Chen
((a)
a) (b)
(c)
(d)
SS SD
CS (e)
CD
(f)
Fig. 2. Algorithm overview: (a) We start with the input video and (b) extract the frames of the
video. (c) We represent each frame as a node in our graph and (d) perform shot boundary detection
to identify shot boundary frames, shown in orange. (e) We now construct an MRF over the frames
within each shot and formulate a discrete labeling problem where each frame has to be assigned
one of four discrete labels, the combinations of static camera (CS ) vs. dynamic camera (CD )
and static scene (SS ) vs. dynamic scene (SD ), in red, green, blue and yellow. (f) The minimum
energy labeling obtained via graph cuts allows us to group consecutive frames with the same label
into clips. Note that in addition to segmenting the video to shots, shots 1 and 3 have been further
segmented into clips based on camera and scene motion.
2 Related Work
With the number of video capture devices and the movie industry rapidly growing,
video analysis is gaining a lot of popularity. A number of researchers are studying dif-
ferent aspects of video analytics, from shot detection to video summarization. We give
an overview of related works with a focus on video segmentation and bring our work
into perspective amongst these works.
Shot Boundary Detection. One of the key steps in most video analysis applications,
is given a video extracting out the basic components shots, that allows for further pro-
cessing. Shot boundary detection is also a form of video segmentation albeit a coarse
one. Automatic shot boundary detection has been very well studied in the past with
approaches varying from using the changes in image color statistics across adjacent
frames to using edge change ratio [22, 33, 34]. There are a number of detailed surveys
Learning to Segment a Video to Clips Based on Scene and Camera Motion 275
that compare the various techniques [4, 13, 19]. We refer the reader to a recent formal
study of shot boundary detection by Yuan et. al. [32]. In our work, we perform shot
boundary detection as our first stage of coarse segmentation, but our goal is to go be-
yond and obtain a finer segmentation of the shots to clips.
Spatio-temporal Segmentation. With the shots from a video extracted using the shot
boundary detection algorithms, an active line of research is spatio-temporal segmenta-
tion of the video shot frames. The goal of these works is to obtain a spatial grouping
of pixels enforcing temporally consistency [7, 9, 14, 18, 29]. In our work, we focus on
purely temporal segmentation where we segment the video shot into clips by giving the
entire frame a discrete label. However, we note that our proposed approach can allow
for better, more focused spatio-temporal segmentation.
Scene Segmentation. A line of active research motivated by the application of video
summarization, indexing and retrieval, is segmenting a video into scenes [35]. A scene
can consist of multiple shots, each with different camera and scene dynamics. Scene
segmentation leverages the use of semantics to group shots to scenes and extract key
frames to summarize the video. Scene Transition Graph by Yeung et. al. [31] is one such
approach, where a graph is constructed with each node representing a shot and edges
representing the transitions between shots. Hierarchically clustering the graph splits it
into several sub graphs resulting in the scenes. A similar approach of clustering was
proposed by Hanjalic et. al. [15] to find logical story units in the MPEG compressed
domain. More recently Rasheed et. al. proposed a similar approach for scene detection
in Hollywood movies and TV shows [24]. In our work, as opposed to clustering shots
to scenes, we break the shot into clips based on scene and camera motion.
3 Algorithm
In this section, we describe the proposed algorithm in detail. We first describe the prob-
lem formulation followed by the description of the proposed approach in detail.
Fig. 3. Illustration of the features used in the unary term. (a) shows a sample frame with the
corresponding color coded flow. Flow vectors are illustrated with arrows. (b) Row 1 shows the
log-bin histogram used to classify static vs. dynamic camera, and Row 2 shows the orientation
histogram used to classify static vs. dynamic scene. Refer to Section 3.2 for details.
Segmenting Video to Clips. The shot boundaries from the previous stage are treated
as the last frame of a shot, thus segmenting the video to shots. Our goal is to segment
each shot into clips based on camera and scene motion. We cast this segmentation task
as a frame-level discrete labeling problem. We formulate the multilabel problem as an
energy minimization problem over a graph of the frames in the shot. This is analogous
to constructing a graph over frames of the whole video and breaking the links when
there is a shot boundary assuming independent motion between shots.
Notation. Let the frames in the ith shot of the video be denoted as Fi . The objective
is to obtain a labeling Li over the shot frames such that each frame f is given a label
lf {(CS , SS ), (CD , SS ), (CS , SD ), (CD , SD )}. We build a graph over the frames
f Fi , with edges between adjacent frames denoted by {N }. We define the energy
function to minimize over the entire shot as:
E(Li ) = Ef (lf ) + Ef g (lf , lg ) (1)
f Fi (f,g)N
The unary term Ef (lf ) measures the cost of assigning a frame f to the frame label lf .
The pairwise term Ef g (lf , lg ) measure the penalty of assigning frames f and g to labels
lf and lg respectively, and is a regularization parameter.
Unary Term (Ef ). We extract intuitive features from the optical flow and using la-
beled data learn classifiers to classify between static vs. dynamic camera categories and
static vs. dynamic scene categories. We now describe the unary term in detail.
Static Camera vs. Dynamic Camera. Our goal is to learn a classifier that, given the
features for the current frame returns a score indicating the likelihood that the current
frame was captured with a static camera vs. a dynamic camera. In the most trivial sce-
nario where a static camera is looking at a static scene, the optical flow between the
frames should have zero magnitude. In this case, a histogram of the optical flow mag-
nitude would peak at the zero bin indicating that it is a static camera. However, in a
typical video the scene can be composed of dynamic objects as well.
We handle this by computing a histogram of the optical flow magnitude, using log-
binning with larger bins as we move away from zero magnitude. We obtain a 15-dim
vector of the normalized histogram represented as (f ) to describe each frame f as
illustrated in Fig 3. Using manually labeled data, we use these vectors as the feature
278 A. Kowdle and T. Chen
vectors and learn a logistic classifier to classify the frame between static and dynamic
camera. Let S represent the logit coefficients we learn during training to classify static
camera frames against dynamic camera frames. During inference, given a frame x, we
compute the likelihood of the frame being captured by a static camera represented as
L(CS ) and that by a dynamic camera as L(CD ) given by:
L(CS ) = P (CS | (x); S )
L(CD ) = 1 P (CS | (x); S ) (2)
Static Scene vs. Dynamic Scene. Consider a scene captured from a dynamic camera,
we use a simple cue from the optical flow to separate static vs dynamic scene. Given
the flow for each frame of the shot, we compute the orientation of the flow field at
each pixel and bin them into an orientation histogram with eight bins evenly spaced
between 0 to 360 as illustrated in Fig 3. In addition, we use a no-flow bin that helps
characterize a static camera.
Intuitively, the peak in this histogram indicates the dominant direction of camera
motion. We remove the component of the histogram corresponding of this dominant
motion, and normalize the residual histogram. We use the entropy of this residual dis-
tribution (H(f )) as a feature to identify static scene vs. dynamic scene. We assume a
gaussian distribution over the entropy values and using the labeled data fit a parametric
model to the residual entropies for frames labeled as static frames and frames labeled
as dynamic frames which results in two gaussians N (S , S ) and N (D , D ) respec-
tively. At inference, given a new frame x, we compute the residual entropy H(x) and
compute the likelihood for static scene, L(SS ) and dynamic scene, L(SD ) given by:
L(SS ) = P (SS |H(x); S , S )
L(SD ) = P (SD |H(x); D , D ) (3)
Since the dominant camera motion has been extracted out in the computation of the
scene motion, we assume that the scene and camera motion are independent. We show
in Section 4.2, that this assumption performs respectably on both movies and user
videos, however, causes errors in the ambiguous case of a dynamic camera looking
at a static object vs. a static camera looking at a dynamic object up-close where the
moving object has a large spatial extent in the frame. This is a scenario ambiguous even
to humans, without contextual or semantic reasoning about the actual spatial extent of
the dynamic object and the surrounding scene. While we do not model this in our work,
we see that such a model can be incorporated into Equation 3.
We compute the likelihoods for the four discrete frame labels to define the unary
term in our MRF as follows:
L(CS , SS ) = L(CS ) L(SS )
L(CD , SS ) = L(CD ) L(SS ) (4)
L(CS , SD ) = L(CS ) L(SD )
L(CD , SD ) = L(CD ) L(SD )
We use the negative log-likelihood as the unary term (Ef ) for the four discrete labels.
Learning to Segment a Video to Clips Based on Scene and Camera Motion 279
Pairwise Term (Ef g ). We model the pairwise term using a contrast sensitive Potts
model.
Ef g (lf , lg ) = I (lf = lg ) exp(df g ) (5)
where I () is an indicator function that is 1(0) if the input argument is true (false), df g is
the contrast between frames f and g and is a fixed parameter. The pairwise term tries
to penalize label discontinuities among neighboring frames modulated by a contrast-
sensitive term. Given the optical flow between adjacent frames f and g, we warp the
image from frame g to frame f . The mean error between the original frame f and the
flow induced warped image is used as the contrast (df g ), between the two frames in the
pairwise term. In practice, this enforces temporal continuity since the contrast is low if
the flow can describe the previous frame very well.
With the energy function setup, we use graph-cuts with -expansion to compute
the MAP labels for each shot, using the implementation by Bagon [2] and Boykov et.
al. [5, 6, 17]. We show some of our results on the full-length movies and user videos
in Fig 1 and 5. We note that we can extend the proposed algorithm using prior work
(Section 2), to obtain true camera motion characterization of frames into zoom, pan,
tilt, etc without scene motion clutter, by either post-process the frames labeled by our
algorithm as dynamic camera frames or add additional discrete labels.
4.1 Dataset
Our dataset consists 60 full-length movies and five user videos. The movies are a subset
of the dataset used in [8] that spans from 1960-2010, with 12 full-length movies in each
decade. We divide the dataset into two parts.
The first subset of the dataset (Set-A) consists of five user videos and one full-length
movie that we use for the quantitative analysis. We sample the videos at a frame rate of
10fps. We manually label the frames with one of four discrete labels we define in this
work. We use these labeled 110,000 frames (made publicly available on our website3 )
to perform our quantitative analysis. We also label shot boundaries on a subset of the
movie to evaluate shot boundary detection.
The second subset of the dataset (Set-B) consists of the 60 full-length movies. We
sample the movies at 10fps resulting in an average of 50,000 frames for each movie.
We use the results of the proposed algorithm on this exhaustive dataset to infer trends
in scene and camera motion in movies over the years. We later use the dataset to show
a novel application of time-stamping archive movies.
We extract the optical flow between every two consecutive frames for each of the
videos in the dataset using the implementation by Liu et. al. [20].
3
http://chenlab.ece.cornell.edu/projects/Video2Clips
280 A. Kowdle and T. Chen
"
(CS, SS) .74 .24 .01 .01 (CS, SS) .85 .13 .01
!
%
(CD, SS) .67 .01 .33 (C , S ) .93 .07
D S
(CS, SD) .03 .03 .84 .10 (C , S ) .01 .97 .02
S D
(CD, SD) .01 .25 .05 .69 (CD, SD) .09 .04 .86
) ) ) ) ) ) ) )
,SS ,SS ,SD ,SD ,SS ,SS ,SD ,SD
(C S (C D (C S (C D (C S (C D (C S (C D
We use the Set-A to perform the quantitative analysis. Given the optical flow for each
of the videos, we first perform the shot boundary detection as described in Section 3.2.
We first evaluate the performance of the shot boundary detection in comparison to the
edge change ratio (ECR) approach [33], which is known to perform better than using
color histograms [32]. We use the manually labeled movie data to compute the F-score
of detecting the shot boundary. The ECR performed respectably with an F-score of
0.92 (cross-validation to set threshold), well above random. The proposed flow-based
approach performed better with an F-score of 0.96 without requiring to set the threshold.
While both these approaches had errors due to missed detections in case of slow fades
and dissolves, we observed that the lower performance of ECR is due to false positives
in case of long shots, or when text is instantaneously overlaid on the movie frame.
Other sophisticated approaches (Sec 2) can easily be plugged in here as shot boundary
detection serves as a pre-processing step in our approach.
In order to evaluate the performance of the algorithm in segmenting the video to the
four classes, we treat the problem as a multi-class labeling problem and use the frame-
level labeling accuracy as our metric. We perform leave-one-out cross validation. Using
all the videos in the dataset except the test video, we learn the model parameters de-
scribed in Section 3.2. Given these parameters we evaluate the performance on the test
video. We note that while the task we tackle has not been addressed before, prior work
on camera motion characterization use a rule-based approach using features extracted
from the motion vectors albeit without reasoning about the scene dynamics and making
a decision at a per-frame level. We compare the performance of the proposed approach
of using a multilabel MRF over the frames against taking an independent decision on
each frame using the learnt models. We report the performance in Fig 4a. We see that
using the learnt models on each frame independently performs much better than taking
a random decision, the temporal consistency enforced by the proposed approach gives
a significant additional boost of more than 20%. We compare the confusion matrices
in Fig 4. We observe via the diagonal elements that the proposed algorithm performs
better across all the categories.
Learning to Segment a Video to Clips Based on Scene and Camera Motion 281
Observations. The key observation from the confusion matrices is that the model is
able to learn to distinguish between the different classes as evident from the dominant
diagonal, even in the independent case (Fig 4b). We investigate the errors by considering
the non-zero off-diagonal entries in Fig 4c.
Consider the case of static camera, static scene (CS , SS ) i.e. row 1 of the confusion
matrix. We first note that the ratio of frames in a movie that belongs to this category were
fairly low as we see in Sec 4.4. The confusion here is due to the instability of the camera
that results in dynamic camera, static scene (CD , SS ) being a more appropriate label.
We see even in Fig 4b that there is a large confusion between the classes of dynamic
camera, static scene (CD , SS ) and dynamic camera, dynamic scene (CD , SD ) i.e rows
2 and 4 of the confusion matrix. The proposed approach (Fig 4c) performs significantly
better than the independent case, however the confusion is due to the dynamic content
in the scene having varied spatial extents in the frames, as noted in Sec 3.2.
Fig. 5. Results from the proposed approach (Color code for the frame borders on the last row):
(a) A shot from the movie Lady Killers where it accurately switches between a static camera to
a dynamic camera; (b) The famous 3 min long Copacabana shot from the movie Goodfellas is
not only identified accurately as a single shot but is also accurately identified as dynamic camera,
dynamic scene; (c) A shot from the more recent movie Charlies Angels that accurately switches
from static scene to dynamic scene; (d) A shot from user video captured using a dynamic camera
accurately switches from static scene to dynamic scene and back; (e) Two clips show negative
results from our algorithm. While the clips belong to the (CS , SD ) category, the clips have been
incorrectly labeled (CD , SS ) and (CD , SD ). The error arises since the dynamic scene has a large
spatial extent and occludes the static background providing no evidence of the static camera.
Learning to Segment a Video to Clips Based on Scene and Camera Motion 283
etc for moving cameras, resulting in an increase in the dynamic camera shots. Our algo-
rithm has interestingly picked up this trend, which researchers in film-studies manually
observed [3,11,25]. We then group the static and dynamic camera categories to compare
the static scene vs. dynamic scene categories in Fig 6c. We see that there is a decreasing
trend of dynamic scene category and an increasing trend of the static scene category.
One explanation for this is that, with more sophisticated and stable dynamic cameras,
more static scenes are being captured now. Another explanation from recent research in
cognitive science is that recent movies could have reduced dynamics due to rapid shot
changes or shortened shots [8], which could explain the trends we see.
80
Average percentage of frames in a movie (%)
100 100
70
90 90
60 80 80
(C , S )
S S
50 (CD, SS) 70 70
Static camera Static scene
(CS, SD) 60 60
40 Dynamic camera Dynamic scene
(CD, SD) 50 50
30
40 40
20 30 30
10 20 20
10 10
1960s 1970s 1980s 1990s 2000s
1960s 1970s 1980s 1990s 2000s 1960s 1970s 1980s 1990s 2000s
'
*
&
%
$
#
"
!
Exact decade prediction Atmost 1 decade error
Fig. 7. Time-stamping movies: Learning a regression to estimate the year (decade) using the
output of the proposed algorithm significantly outperforms random guessing and human perfor-
mance (12 subjects). The first set of bars indicate the performance on the exact decade prediction,
while the next set indicates the performance with an error tolerance of atmost one decade.
Fig. 8. 2D to 3D video: Row 1 shows a clip from the movie Sound of music labeled as (CD , SS );
Row 2 is the depth map recovered using plane sweep stereo; Row 3 are anaglyph pairs generated
using the depthmap (to be viewed using red-cyan glasses)
Acknowledgements. The authors thank Jordan DeLong [8] for collecting and sharing
the dataset of movies across decades.
References
1. Cinemetrics, http://www.cinemetrics.lv
2. Bagon, S.: Matlab wrapper for graph cut (December 2006)
3. Bordwell, D.: The Way Hollywood Tells It: Story and Style in Modern Movies. A Hodder
Arnold Publication (2006)
4. Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. In:
SPIE (1996)
5. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms
for energy minimization in vision. PAMI 26(9), 11241137 (2004)
6. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via graph cuts.
PAMI 20(12), 12221239 (2001)
7. Chan, A.B., Vasconcelos, N.: Modeling, clustering, and segmenting video with mixtures of
dynamic textures. PAMI 30(5), 909926 (2008)
8. Cutting, J.E., DeLong, J.E., Brunick, K.L.: Visual activity in hollywood film: 1935 to 2005
and beyond. PACA 5(2) (2010)
9. Dementhon, D.: Spatio-temporal segmentation of video by hierarchical mean shift analysis.
In: SMVP (2002)
10. Dorai, C., Kobla, V.: Extracting motion annotations from mpeg-2 compressed video for hdtv
content management applications. In: ICMCS (1999)
11. Elsaesser, T., Buckland, W.: Studying Contemporary American Film: A Guide to Movie
Analysis. University of California Press (2002)
12. Garca Cifuentes, C., Sturzel, M., Jurie, F., Brostow, G.J.: Motion models that only work
sometimes. In: BMVC (2012)
13. Gargi, U., Kasturi, R., Antani, S.: Performance characterization and comparison of video
indexing algorithms. In: CVPR (1998)
14. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video seg-
mentation. In: CVPR (2010)
15. Hanjalic, A., Lagendijk, R.L., Member, S., Biemond, J.: Automated high-level movie seg-
mentation for advanced video-retrieval systems. CSVT (1999)
16. Jain, R.: Direct computation of the focus of expansion. PAMI 5(1), 5864 (1983)
17. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts?
PAMI 26(2), 147159 (2004)
18. Li, Y., Sun, J., Shum, H.-Y.: Video object cut and paste. In: ACM SIGGRAPH (2005)
19. Lienhart, R.: Reliable transition detection in videos: A survey and practitioners guide. Inter-
national Journal of Image and Graphics 1, 469486 (2001)
20. Liu, C.: Beyond Pixels: Exploring New Representations and Applications for Motion Anal-
ysis. PhD thesis, MIT (2009)
21. Ngo, C.-W., Pong, T.-C., Zhang, H.-J., Chin, R.T.: Motion characterization by temporal slices
analysis. In: CVPR (2000)
22. Otsuji, K., Tonomura, Y.: Projection detecting filter for video cut detection. ACM Multimedia
(1993)
23. Peng Tan, Y., Saur, D.D., Kulkarni, S.R., Member, S., Ramadge, P.J.: Rapid estimation of
camera motion from compressed video with application to video annotation. CSVT 10,
133146 (2000)
24. Rasheed, Z., Shah, M.: Scene detection in hollywood movies and tv shows. In: CVPR (2003)
286 A. Kowdle and T. Chen
25. Salt, B.: Statistical style analysis of motion pictures. Film Quarterly, 28(1) (1974)
26. Schindler, G., Dellaert, F.: Probabilistic temporal inference on reconstructed 3d scenes. In:
CVPR (2010)
27. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. In:
SIGGRAPH (2006)
28. Srinivasan, M.V., Venkatesh, S., Hosie, R.: Qualitative estimation of camera motion param-
eters from video sequences. Pattern Recognition 30(4), 593606 (1997)
29. Wang, J., Thiesson, B., Xu, Y.-Q., Cohen, M.: Image and Video Segmentation by Anisotropic
Kernel Mean Shift. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004, Part II. LNCS, vol. 3022,
pp. 238249. Springer, Heidelberg (2004)
30. Xiong, W., Lee, J.C.-M.: Efficient scene change detection and camera motion annotation for
video classification. CVIU 71(2), 166181 (1998)
31. Yeung, M., Yeo, B.-L., Liu, B.: Segmentation of video by clustering and graph analysis.
CVIU 71, 94109 (1998)
32. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A formal study of shot
boundary detection. TCSVT (2007)
33. Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene
breaks. ACM Multimedia (1995)
34. Zhang, H., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video. Mul-
timedia Syst. 1(1), 1028 (1993)
35. Zhu, X., Elmagarmid, A.K., Xue, X., Wu, L., Catlin, A.C.: Insightvideo: toward hierarchical
video content organization for efficient browsing, summarization and retrieval. IEEE Trans-
actions on Multimedia 7(4), 648666 (2005)
Evaluation of Image Segmentation Quality
by Adaptive Ground Truth Composition
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 287300, 2012.
c Springer-Verlag Berlin Heidelberg 2012
288 B. Peng and L. Zhang
Human beings play an essential role in evaluating the quality of image seg-
mentation. The subjective evaluation has been long regarded as the most reliable
assessment of the segmentation quality. However, it is expensive, time consuming
and often impractical in real-world applications. In addition, even for segmenta-
tions which are visually close, dierent human observers may give inconsistent
evaluations. As an alternative, the objective evaluation methods, which aim to
predict the segmentation quality accurately and automatically, are much more
expected.
The existing objective evaluation methods can be classied as ground-truth
based ones and non-ground-truth based ones. In non-ground-truth based meth-
ods, the empirical goodness measures are proposed to meet the heuristic criteria
in the desirable segmentations. Then the score is calculated based on these cri-
teria to predict the quality of a segmentation. There have been examples of
empirical measures on the uniformity of colors [1, 2] or luminance [3] and even
the shape of object regions [4]. Since the criteria are mainly summarized from
the common characteristics or semantic information of the objects (e.g., ho-
mogeneous regions, smooth boundaries, etc.), they are not accurate enough to
describe the complex objects in the images.
The ground-truth based methods measure the dierence between the seg-
mentation result and the human-labeled ground truths. They are more intuitive
than the empirical based measures, since the ground truths can well represent
the human-level interpretation of an image. Some measures in this category aim
to count the degree of overlapping between regions with strategies of being intol-
erant [5] or tolerant [6] to region renement. In contrast to working on regions,
there are also measures [7, 8] matching the boundaries between segmentations.
Considering only the region boundaries, these measures are more sensitive to
the dissimilarity between the segmentation and the ground truths than the re-
gion based measures. Some other measures use non-parametric tests to count
the pairs of pixels that belong to the same region in dierent segmentations.
The well-known Rand index [9] and its variants [10, 11] are of this kind. So far,
there is no standard procedure for segmentation quality evaluation due to the
ill-dened nature of image segmentation, i.e., there might be multiple acceptable
segmentations which are consistent to the human interpretation of an image. In
addition, there exists a large diversity in the perceptually meaningful segmen-
tations for dierent images. The above factors make the evaluation task very
complex.
In this work, we focus on evaluating segmentation results with multiple ground
truths. The existing methods [511] of this kind prefer matching the given whole
segmentation with ground truths for evaluation. However, the available human-
labeled ground truths are only a small fraction of all the possible interpretations
of an image. The available dataset of ground truths might not contain the desired
ground truth which is suitable to match the input segmentation. Hence such kind
of comparison often leads to a certain bias on the result or is far from the goal
of objective evaluation.
Evaluation of Image Segmentation Quality 289
Fig. 1. A composite interpretation between the segmentation and the ground truths -
the basic idea. (a) is a sample image and (b) is a segmentation of it. Dierent parts
(shown in dierent colors in (c)) of the segmentation can be found to be very similar
to parts of the dierent human-labeled ground truths, as illustrated in (d).
We propose a new framework to solve the problem. The basic idea is illustrated
in Fig.1. Fig. 1(b) shows a possible segmentation of the image in Fig. 1(a), and it
is not directly identical to any of the ground truths listed in Fig. 1(d). However,
one would agree that Fig. 1(b) is a good segmentation, and it is similar to
these ground truths in the sense that it is composed of similar local structures
to them. Inspired by this observation, we propose to design a segmentation
quality measure which could generalize the congurations of the segmentation
and preserve the structural consistency across the ground truths. We assume
that if a segmentation is good, it can be constructed by pieces of the ground
truths. To measure the quality of the segmentation, a composite ground truth can
be adaptively produced to locally match the segmentation as much as possible.
Note that in Fig. 1(c), the shared regions between the segmentation and the
ground truths are typically irregularly shaped; therefore the composition is data
driven and cannot be predened. Also, the condence of selected pieces from the
ground truths will be examined during the process of comparison. Less reliance
should be given on the ambiguous structures, even if they are very similar to the
segmentation. In the proposed measure, we will integrate all these factors for
a robust evaluation. Fig. 2 illustrates the owchart of the proposed framework.
Firstly a new composite ground truth is adaptively constructed from the ground
truths in the database, and then the quantitative evaluation score is produced
by comparing the input segmentation and the new ground truth.
Researchers have found that human visual system (HVS) is highly adapted
to extract structural information from natural scenes [14]. As a consequence,
a perceptually meaningful measure should be error-sensitive to the structures
in the segmentations. It is also known that human observers may pay dierent
attentions to dierent parts of the images [6, 8]. The dierent ground truths
segmentations of an image therefore present dierent levels of details of the
objects in the image. This fact makes them rarely identical in the global view,
while more consistent in the local structures. For this reason, the evaluation of
image segmentation should be based on the local structures, rather than only the
290 B. Peng and L. Zhang
global views. The standard similarity measures, such as Mutual Information [13],
Mean Square Error (MES) [14], probabilistic Rand index [9] and Precision-Recall
curves [8], compare the segmentation with the whole ground truths, producing a
matching result based on the best or the average score. Since the HVS tends to
capture the local structures of images, these measures cannot provide an HVS-
like evaluation on the segmentation. The idea of matching images with composite
pieces has been explored in [1517], where there is no globally similar image in
the given resource to exactly match the image in query. These methods pursue
a composition of given images which is meaningful in the perception of vision.
In contrast, in this paper we propose to create a new composition that is not
only visually similar to the given segmentation, but also keeps the consistency
among the ground truths. The construction is automatically performed on the
boundary maps instead of the original images. To the best of our knowledge, the
proposed work is the rst one that generalizes and infers the ground truths for
segmentation evaluation.
The rest of paper is organized as follows. Section 2 provides the theoretical
framework of constructing the ground truth that is adaptive to the given segmen-
tation. A segmentation evaluation measure is consequently presented in Section
3. Section 4 presents experimental results on 500 benchmark images. Finally the
conclusion is made in Section 5.
Fig. 3. An example of adaptive ground truth composition for the given segmentation S.
G1 and G2 are two ground truths by human observers. The optimal labeling {lG1 , lG2 }
of G1 and G2 produces a composite ground truth G , which matches the S as much
as possible.
then seek for the labeling l that minimizes the energy. In this work, we propose
an energy function that follows the Potts model [18]:
E(l) = D(lgj ) + u{gj ,gj } T (lgj
= lgj ) (1)
j {gj ,gj }M
There are two terms in the energy function. The rst term D(lgj ) is called the
data term. It penalizes the decision of assigning lgj to the element gj , and thus
can be taken as the measure of dierence. Suppose that the normalized distances
between the ground truths and the segmentation S is d(sj , gj ), we can dene:
The second term u{gj ,gj } T (lgj
= lgj ) indicates the cost of assigning dierent
labels to the pair of elements {gj , gj } in G . M is a neighborhood system, and
T is an indicator function:
1 if lgj
= lgj
T (lgj
= lgj ) = (3)
0 otherwise
We call u{gj ,gj } T (lgj
= lgj ) the smoothness term in that it encourages the
labeling in the same region have the same labels. In this way, the consistency
of neighboring structures can be preserved. It is expected that the separation of
regions should pay higher cost on the elements whose label is agreeable by few
ground truths, while lower cost on the reverse. Thus we can dene u{gj ,gj } in
the expression:
u{gj ,gj } = min{dj , dj } (4)
where cs and cs are the complex Gabor coecients of two segmentations s and
s , respectively. |cs,i | is the magnitude of a complex Gabor coecient, and c is
the conjugate of c. and are small positive constants to make the calculation
stable.
It is easy to see that the maximum value of G-SSIM is 1 if cs and cs are
identical. Since the human labeling variations cause some technical problems in
handling the multiple ground truths. The wavelet measure we used can better
solve this problem. It can well preserve the image local structural information
without requiring the precise correspondences between pixels and is robust to
the small geometric deformations during the matching process. Now we dene
the distance d as:
d(cs , cs ) = 1 H(cs , cs ) (6)
where H(cs , cs ) is the average value of G-SSIM by 24 Gabor kernels. Notice
that although the value of d is dened on each pixel, it is inherently decided
by the textural and structural properties of localized regions of image pairs.
With the distance d dened in Eq. (6), we can optimize Eq. (1) so that the
composite ground truth G for the given segmentation S can be obtained. Fig.
6 shows three examples of the adaptive ground truth composition. It should be
noted that the composite ground truth does not have to contain the closed form
of boundary, since it is designed to work in conjunction with the benchmark
criterion (in Sec. 3).
Evaluation of Image Segmentation Quality 295
Fig. 6. Examples of the composite ground truth G . (a) Original images. (b)-(e) Human
labeled ground truths. (f) Input segmentations of images by the mean-shift algorithm
[22]. (g) The composite ground truth G .
Rsj = 1 dj (7)
where dj is the average distance between between gj and {gj1 , gj2 , . . . , gjK }.
In Eq.(7), Rsj achieves the highest value 1 when the distance between gj and
{gj1 , gj2 , . . . , gjK } is zero and achieves the lowest value zero when the situation is
reversed. Since Rsj is a positive factor for describing the condence, the simi-
larity between sj and gj should be normalized to [-1,1] such that the high con-
dence works reasonably for both of the good and bad segmentations. If there
are K instances in G and all of them contribute to the construction of G , we
can decompose S into K disjointed sets {S0 , S1 , . . . , SK }. Based on the above
considerations, the measure of segmentation quality is dened as:
1
K
M (S, G) = (1 2d(sj , gj )) Rsj (8)
N i=1
sj Si
296 B. Peng and L. Zhang
We can see that the proposed quality measure ranges from -1 to 1, which
is an accumulated sum of similarity provided by the individual elements of S.
The minimum value of d(sj , gj ) is zero when sj is completely identical to the
ground truths gj ; in the meanwhile, if all the ground truths in G are identical,
Rsj will be 1 and then M (S, G) will achieve its maximum value 1. Note that
the measure is decided by both the distances d(sj , gj ) and Rsj . If S is only
similar/dissimilar to G without a high consistency among the ground truths
data {gj1 , gj2 , . . . , gjK }, M (S, G) will be close to zero. This might be a common
case for images with complex contents, where perceptual interpretation of the
image contents is diverse. In other words, Rsj enhances the condent decisions
on the similarity/dissimilarity and therefore preserves the structural consistency
in the ground truths.
4 Experiments
1
P R(S, G) = N [I(ljS = ljS )pj,j + I(ljS
= ljS )(1 pj,j )] (10)
2 j,j
jj
where pj,j is the ground truth probability that I(ljS = ljS ). In practice, pj,j is
dened as the mean pixel-pair relationship among the ground truths:
1 Gi
pj,j = I(lj = ljG i ) (11)
K
According to the above denitions, the PR index ranges from 0 to 1, where a
score of zero indicates that the segmentation and the ground truths have opposite
pair-wise relationships, while a score of 1 indicates that the two have the same
pair-wise relationships.
Fig. 7. Example of measure scores for dierent segmentations. For each original image,
5 segmentations are obtained by the mean-shift algorithm [22].The rightmost column
shows the plots of scores achieved by F-measure (in blue), PR index (in green) and our
method (in red).
Table 1. The number of results which are consistent to the human judgment, as well
as the numbers of winning cases and failure cases by competing methods
5 Conclusions
References
1. Liu, J., Yang, Y.: Multiresolution color image segmentation. IEEE Trans. on Pat-
tern Analysis and Machine Intelligence 16, 689700 (1994)
2. Borsotti, M., Campadelli, P., Schettini, R.: Quantitative evaluation of color image
segmenta- tion results. Pattern Recognition Letter 19, 741747 (1998)
3. Zhang, H., Fritts, J., Goldman, S.: An entropy-based objective segmentation eval-
uation method for image segmentation. In: SPIE Electronic Imaging Storage and
Retrieval Methods and Applications for Multimedia, pp. 3849 (2004)
4. Ren, X., Jitendra, M.: Learning a classication model for segmentation. In: IEEE
Conference on Computer Vision, pp. 1017 (2003)
5. Christensen, H., Phillips, P.: Empirical evaluation methods in computer vision.
World Scientic Publishing Company (2002)
300 B. Peng and L. Zhang
6. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: International Conference on Computer Vision, pp. 416423
(2001)
7. Freixenet, J., Munoz, X., Raba, D., Mart, J., Cuf, X.: Yet Another Survey on Im-
age Segmentation: Region and Boundary Information Integration. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part III. LNCS, vol. 2352,
pp. 408422. Springer, Heidelberg (2002)
8. Martin, D.: An empirical approach to grouping and segmentation. Ph.D. disserta-
tion U.C. Berkeley (2002)
9. Rand, W.: Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical Association 66, 846850 (1971)
10. Fowlkes, E., Mallows, C.: A method for comparing two hierarchical clusterings.
Journal of the American Statistical Association 78, 553569 (1983)
11. Unnikrishnan, R., Hebert, M.: Measures of similarity. In: IEEE Workshop on Ap-
plications of Computer Vision, pp. 394400 (2005)
12. Felzenszwalb, P., Huttenlocher, D.: Ecient graph-based image segmentation. In-
ternational Journal on Computer Vision 59, 167181 (2004)
13. Viola, P., Wells, W.: Alignment by maximization of mutual information, vol. 3,
pp. 1623 (1995)
14. Wang, Z., Bovik, A.: Modern image quality assessment. Morgan and Claypool
Publishing Company, New York (2006)
15. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: From
error visibility to structural similarity. IEEE Transactions on Image Process. 13,
600612 (2004)
16. Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless,
B., Salesin, D., Cohen, M.: Interactive digital photomontage, vol. 23, pp. 294302
(2004)
17. Russell, B., Efros, A., Sivic, J., Freeman, W., Zisserman, A.: Segmenting scenes
by matching image composites. In: Advances in Neural Information Processing
Systems, pp. 15801588 (2009)
18. Potts, R.: Some generalized order-disorder transformation. Proceedings of the
Cambridge Philosophical Society 48, 106109 (1952)
19. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23,
12221239 (2001)
20. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries
using local brightness, color and texture cues. IEEE Trans. on Pattern Analysis
and Machine Intelligence 26, 530549 (2004)
21. Sampat, M., Wang, Z., Gupta, S., Bovik, A., Markey, M.: Complex wavelet struc-
tural similarity: A new image similarity index. IEEE Transactions on Image Pro-
cessing 18, 23852401 (2009)
22. Comanicu, D., Meer, P.: Mean shift: A robust approach toward feature space anal-
ysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 603619 (2002)
23. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22, 888905 (1997)
Exact Acceleration of Linear Object Detectors
1 Introduction
A common technique for object detection is to apply a binary classier at every
possible position and scale of an image in a sliding-window fashion. However,
searching the entire search space, even with a simple detector can be slow, es-
pecially if a large number of image features are used.
To that end, linear classiers have gained a huge popularity in the last few
years. Their simplicity allows for very large scale training and relatively fast
testing, as they can be implemented in terms of convolutions. They can also
reach state-of-the-art performance provided one use discriminant enough fea-
tures. Indeed, such systems have constantly ranked atop of the popular Pascal
VOC detection challenge [1,2]. Part-based deformable models [3,4] are the latest
incarnations of such systems, and current winners of the challenge.
The algorithm we propose leverages the classical use of the Fourier trans-
form to accelerate the multiple evaluations of a linear predictor in a multi-scale
sliding-window detection scheme. Despite relying on a classic result of signal
processing, the practical implementation of this strategy is not straightforward
and requires a careful organization of the computation. It can be summarized in
three main ideas: (a) we exploit the linearity of the Fourier transform to avoid
having one such transform per image feature (see Sect. 2.2), (b) we control the
memory usage required to store the transforms of the lters by building patch-
works combining the multiple scales of an image (see Sect. 3.1), and nally (c)
we optimize the use of the processor cache by computing the Fourier domain
point-wise multiplications in small fragments (see Sect. 3.2).
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 301311, 2012.
c Springer-Verlag Berlin Heidelberg 2012
302 C. Dubout and F. Fleuret
Table 1. Notations
Typical linear object detectors including the individual part detectors of part-
based models extract low-level features densely from every scale of an image
pyramid. Those features are arranged in a coarse grid with several features ex-
tracted from each grid cell. For example, the Histogram of Oriented Gradients
(HOG) [15] correspond to the bins of an histogram of the gradient orientations
of the pixels within the cell. Typically cells of size 8 8 pixels are used [15,4],
while the number of features per cell vary from around ten to a hundred (32 in
[5], that we use as a baseline). An alternative description of the arrangement of
the features is to view them as organized in planes as depicted in Fig. 1. Those
planes are analogous to the RGB channels of standard color images, but instead
of colors they contain distinct features from each cell of the grid. The lters
trained by the detector are similar in composition.
HOG
Image Image
HOG ...
x3 (rgb) x32
+
Perfeature Detection
Per filter * score score
...
x32
Filter
HOG
x32
HOG FT
Image Image Image
HOG HOG ...
Filter FT
Filter
HOG HOG
x32
x32
Fig. 1. The top gure shows the standard process, convolving and summing all image
and lter planes. The bottom gure depicts how such a process can be sped up by
taking advantage of the fact that the inverse Fourier transform that produces the nal
detection score needs to be done only once per image / lter pair, and not once per
feature, since the sum across planes can be done in the Fourier domain.
Let K stands for the number of features, xk RMN for the k th feature plane
of a particular image, and yk RP Q for the k th feature plane of a particular
lter. The scores z R(M+P 1)(N +Q1) of a lter evaluated on an image are
given by the following formula:
304 C. Dubout and F. Fleuret
P
K1 1 Q1
zij = xki+p,j+q ypq
k
(1)
k=0 p=0 q=0
that is the sum across planes of the Frobenius inner products of the images
sub-window of size P Q starting at position (i, j) and the lter. Computing
the responses of a lter at all possible locations can thus be done by summing
the results of the (full) convolutions of the image and the (reversed) lter, i.e.
K1
z= xk yk (2)
k=0
Cstd = 2M N P Q (3)
corresponding to one multiplication and one addition for each image and each
lter coecient. Ultimately one needs to convolve L lters and sum them across
K feature planes (see Fig. 1), bringing the total number of operations per image
to
Cstd/image = KLCstd . (4)
the costs of one FT (the approximation of CFFT comes from [16]) and of the
point-wise multiplications respectively, the total cost is approximately
per product for the three (two forward and one inverse) transforms using a Fast
Fourier Transform algorithm [16]. Note that the lters forward FTs can be done
o-line, and thus should not be counted in the overall detection time, and that
an images forward FT has to be done only once, independently of the number
Exact Acceleration of Linear Object Detectors 305
3 Implementation Strategies
Implementing the convolutions with the help of the Fourier transform is straight-
forward, but involves two diculties: memory over-consumption and lack of
memory bandwidth. Those two problems can be remedied using methods pre-
sented in the following subsections.
The computational cost analysis of Sect. 2.2 was done under the assumption
that the Fourier Transforms of the lters were already precomputed. But the
computation of the point-wise multiplications between the FT of an image and
that of a lter requires to rst pad them to the same size. Images can be of
various sizes and aspect-ratio, especially since images are parsed at multiple
scales, and precomputing lters at all possible sizes as in Fig. 2(a) is unrealistic
in term of memory consumption. Another approach could be to precompute the
FTs of the images and the lters padded only to the biggest image size, as shown
in Fig. 2(b). This would require as little memory as possible for the lters, but
would result in an additional computational burden to compute the FTs of the
images, and more importantly to perform the point-wise multiplications.
However, a simpler and more ecient approach exists, combining the advan-
tages of both alternatives. By grouping images together in patchworks of the
306 C. Dubout and F. Fleuret
Fig. 2. The computation of the point-wise multiplications between the Fourier Trans-
form of an image and that of a lter requires to pad them to the same size before
computing the FTs. Given that images have to be parsed at multiple scales, the naive
strategy is either to store for each lter the FTs corresponding to a variety of sizes
(a), or to store only one version of the lters FT and to pad the multiple scales of
each image (b). Both of these strategies are unsatisfactory either in term of memory
consumption (a) or computational cost (b). We propose instead a patchwork approach
(c), which consists of creating patchwork images composed of the multiple scales of
the image, and has the advantages of both alternatives. See Sect. 3.1 and Table 2 for
details.
size of the largest image, one needs to compute the FTs of the lters only at
that size, while the amount of padding needed is much less than required by the
second approach. We observed it experimentally to be less than 20%, vs. 87%
for the second approach. The performance thus stays competitive with the rst
approach while retaining the memory footprint of the second (see Table 2 for
an asymptotical analysis). The grouping of the images does not need to be opti-
mal, and very fast heuristics yielding good results exist, such as the bottom-left
bin-packing heuristic [17].
Table 2. Asymptotic memory footprint and computational cost for the three ap-
proaches described in Sect. 3.1, to process + one image of size M N with L lters,
at scales 1, , 2 , . . . The factor 11
2 = k=0
2k
accounts for the multiple scales of
the image pyramid, while 12 log 2 for 1 is the number of scales to
log M N log M N
visit. Taking the same typical values as in Sect. 2.2 for M, N = 64, and = 0.9 gives
1
12
5.3 and log 12
MN
44. Our patchwork method (c) combines the advantages of
both methods (a) and (b).
Patchwork ...
HOG
...
Patchwork ...
R HOG
...
Patchwork ...
HOG
Fragments to read
Fragments to write
Fig. 3. To compute the point-wise products between each of the Fourier Transform
of the R patchworks, and each of the FT of the L lters, the naive procedure loops
through every pair. This strategy unfortunately requires multiple CPU cache violations,
since the transforms are likely to be too large to all t in cache, resulting in a slow
computation of each one of the LR products. We propose to decompose the transforms
into fragments (here shown as red rectangles), and to have an outer loop through them.
With such a strategy, by loading a total of L + R fragments in the CPU cache, we end
up computing LR point-wise products between fragments. See Sect. 3.2 for details.
308 C. Dubout and F. Fleuret
time it takes to read (resp. write) F coecients from (resp. into) the memory
to (resp. from) the CPU cache.
A naive strategy going through every patchwork / lter pair results in a total
processing time of
This is mainly due to the bad use of the cache, which is constantly reloaded with
new data from the main memory.
We can improve this strategy by decomposing transforms into fragments of
size F , and by adding an outer loop through these MN F fragments (see Fig. 3).
The cache usage will be K(R + 1)F , and the time to process all patchwork /
lter pairs will become
MN
Tfast = K(L + R)v(F ) + KLRu(F ) + LR v(F ) (10)
F#
!" !" # !" # !" #
reading multiplications writing
number of fragments
By making F small, we could reduce the cache usage arbitrarily. However, CPUs
are able to load from the main memory in bursts, which makes values smaller
than that burst size sub-optimal (see Fig. 4). The speed ratio between the naive
and the fast methods is
Tnaive
1
(2 + K ) + u(MN )
v(MN )
= u(MN )
(12)
Tfast ( L+R 1
LR + K ) + v(MN )
v(M N )
2 +1 (13)
u(M N )
In practice, the cache can hold at least one patchwork of size M N and the actual
speedup we observe is around 5.7. Decomposing the transforms into fragments
also scales better across multiple CPU cores, as they can focus on distinct parts
of the transforms, instead of all loading the same patchwork or lter.
4 Experiments
0.14
0.12
0.08
0.06
0.04
0.02
0
1 10 100 1000 10000
Fragment size (log)
Fig. 4. Average time taken by the point-wise multiplications (in seconds) for dierent
fragment sizes (number of coecients) for one image of the Pascal VOC 2007 challenge
The evaluation was done over all 20 classes of the challenge by looking at the
detection time speedup with respect to the fastest baseline convolution imple-
mentation on the same machine. The baseline is written in assembly and makes
use of both CPU SIMD instructions and multi-threading. As our method is ex-
act, the average precision should stay the same up to numerical precision issues.
The results are given in Table 4 for verication purposes.
We used the FFTW (version 3.3) library [16] to compute the FFTs, and the
Eigen (version 3.0) library [18] for the remaining linear algebra. Both libraries are
very fast as they make use of the CPU SIMD instruction sets. Our experiments
show that our approach achieves a signicant speedup, being more than seven
times faster (see Table 3). We compare only the time taken by the convolutions
in order to be fair to the baseline, some of its other components being written in
Matlab, while our implementation is written fully in C++. The average time taken
by the baseline implementation to convolve a feature pyramid (10 scales per oc-
tave) with all the lters of a particular class (54 lters, most of them of size 6 6)
was 413 ms. The average time taken by our implementation was 56 ms, including
the forward FFTs of the images. For comparison, the time taken in our implemen-
tation to compute the HOG features (including loading and resizing the image)
was on average 64 ms, while the time taken by the distance transforms was 42 ms,
the time taken by the remaining components of the system being negligible.
We also tested the numerical precision of both approaches. The maximum ab-
solute dierence that we observed between the baseline and a more precise im-
plementation (using double precision) was 9.5 107 , while for our approach it
was 4.8 107 . The mean absolute dierence were respectively 2.4 108 and
1.8 108 .
While the speed and numerical accuracy of the baseline degrade proportionally
with the lters sizes, they remain constant with our approach, enabling one to
use bigger lters for free. For example if one were to use lters of size 8 8
310 C. Dubout and F. Fleuret
aero bike bird boat bottle bus car cat chair cow table
V4 (ms) 409 437 403 414 366 439 352 432 417 429 450
Ours (ms) 55 56 53 56 57 56 54 56 56 57 57
Speedup (x) 7.4 7.8 7.6 7.4 6.4 7.9 6.5 7.7 7.5 7.5 8.0
aero bike bird boat bottle bus car cat chair cow table
V4 (%) 28.9 59.5 10.0 15.2 25.5 49.6 57.9 19.3 22.4 25.2 23.3
Ours (%) 29.4 58.9 10.0 13.4 25.3 50.6 57.6 18.9 22.6 24.9 24.4
5 Conclusion
The idea motivating our work is that the Fourier transform is linear, enabling
one to do the addition of the convolutions across feature planes in Fourier space,
and be left in the end with only one inverse Fourier transform to do. To take ad-
vantage of this, we proposed two additional implementation strategies, ensuring
maximum eciency without requiring huge memory space and/or bandwidth,
and thus making the whole approach practical.
The method increases the speed of many state-of-the-art object detectors sev-
eralfold with no loss in accuracy when using small lters, and becomes even faster
and more accurate with larger ones. That such an approach is possible is not
entirely trivial (the reference implementation of [5] contains ve dierent ways
to do the convolutions, all at least an order of magnitude slower); nevertheless,
the analysis we developed is readily applicable to many other systems.
References
1. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.: The
PASCAL Visual Object Classes Challenge 2007 (VOC 2007) (2007) Results,
http://www.pascal-network.org/challenges/VOC/voc2007/workshop/
index.html
2. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.: The
PASCAL Visual Object Classes Challenge 2011 (VOC 2011) (2011) Results,
http://www.pascal-network.org/challenges/VOC/voc2011/workshop/
index.html
3. Felzenszwalb, P., Huttenlocher, D.: Pictorial Structures for Object Recognition.
International Journal of Computer Vision 61, 5579 (2005)
4. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object Detection
with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern
Analysis and Machine Intelligence 32(9) (2010)
5. Felzenszwalb, P., Girshick, R., McAllester, D.: Discriminatively Trained De-
formable Part Models, Release 4,
http://people.cs.uchicago.edu/~ pff/latent-release4/
6. Viola, P., Jones, M.: Robust Real-time Object Detection. International Journal of
Computer Vision (2001)
7. Perko, R., Leonardis, A.: Context Driven Focus of Attention for Object Detection.
In: Paletta, L., Rome, E. (eds.) WAPCV 2007. LNCS (LNAI), vol. 4840, pp. 216
233. Springer, Heidelberg (2007)
8. Maji, S., Malik, J.: Object detection using a max-margin Hough transform. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 10381045 (2009)
9. Lampert, C., Blaschko, M., Hofmann, T.: Beyond sliding windows: Object local-
ization by ecient subwindow search. In: IEEE Conference on Computer Vision
and Pattern Recognition, pp. 18 (2008)
10. Felzenszwalb, P., Girshick, R., McAllester, D.: Cascade object detection with de-
formable part models. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp. 22412248 (2010)
11. Zhang, C., Viola, P.: Multiple-Instance Pruning For Learning Ecient Cascade
Detectors. In: Advances in Neural Information Processing Systems (2007)
12. Cecotti, H., Graeser, A.: Convolutional Neural Network with embedded Fourier
Transform for EEG classication. In: International Conference on Pattern Recog-
nition, pp. 14 (2008)
13. Dollar, P., Belongie, S., Perona, P.: The Fastest Pedestrian Detector in the West.
In: British Machine Vision Conference (2010)
14. Felzenszwalb, P., Huttenlocher, D.: Distance transforms of sampled functions.
Technical report, Cornell Computing and Information Science (2004)
15. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In:
IEEE Conference on Computer Vision and Pattern Recognition, pp. 886893 (2005)
16. Frigo, M., Johnson, S.: The design and implementation of FFTW3. Proceedings of
the IEEE 93(2), 216231 (2005)
17. Chazelle, B.: The Bottom-Left Bin-Packing Heuristic: An Ecient Implementation.
IEEE Transactions on Computers, 697707 (1983)
18. Guennebaud, G., Jacob, B.: Eigen v3, http://eigen.tuxfamily.org
Latent Hough Transform for Object Detection
Nima Razavi1 , Juergen Gall2 , Pushmeet Kohli3 , and Luc van Gool1,4
1
Computer Vision Laboratory, ETH Zurich
2
Preceiving Systems Department, MPI for Intelligent Systems
3
Microsoft Research Cambridge
4
IBBT/ESAT-PSI, K.U. Leuven
1 Introduction
Object category detection from 2D images is an extremely challenging and com-
plex problem. The fact that individual instances of a category can look very
dierent in the image due to pose, viewpoint, scale or imaging nuisances is one
of the reasons for the diculty of the problem. A number of techniques have been
proposed to deal with this problem by introducing invariance to such factors.
While some approaches [13] aim at achieving partial invariance through specic
feature representations, others [49] divide the object into parts, assuming less
variation within each part and thus a better invariant representation.
Codeword or voting based methods for object detection belong to the latter
category. These methods work by representing the image by a set of voting ele-
ments such as interest points, pixels, patches or segments that vote for the values
of parameters that describe the object instance. The Implicit Shape Model [4],
an important instance of the Generalized Hough Transform (GHT), represents
an object by the relative locations of specic image features with respect to a
reference point. For object detection, the image features vote for the possible
locations of this reference point. Although this representation allows for parts
from dierent training examples to support a single object hypothesis, it also
produces false positives by accumulating votes that are consistent in location
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 312325, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Latent Hough Transform for Object Detection 313
but inconsistent in other properties like pose, color, shape or type. For example,
features taken from a frontal view image of a training example and a side view
image of another training example might agree in location, but an object can
not be seen from frontal and side views at the same time. It is our understanding
that this accumulation of inconsistent votes is the main reason behind the poor
performance of the voting based approaches.
To improve the detection performance, researchers have proposed enforcement
of consistency of the votes by estimating additional parameters like aspect ra-
tio [5] or pose [1014]. While the use of more parameters obviously improves con-
sistency, it also increases the dimensionality of the Hough space. However, Hough
transform-based methods are known to perform poorly for high-dimensional
spaces [15]. Consistency can also be enforced by grouping the training data
and voting for each group separately. Such a grouping can be dened based
on manual annotations of the objects, if available, or obtained by clustering the
training data. While this does not increase the dimensionality of the voting space,
the votes for each group can become sparse due to a limited number of train-
ing examples, which impairs the detection performance. Even if annotations are
available for the training examples, it is not clear which properties to annotate
for an optimal detection performance since the importance of properties diers
from case to case. For instance, viewpoint is important for detecting airplanes
but far less so for detecting balls.
In this work, we propose to augment the Hough transform by latent variables
to enforce consistency of the votes. This is done by only allowing votes that
agree on the values of the latent variables to support a single hypothesis. We
discriminatively learn the optimal assignments of the training data to an arbi-
trary latent space to improve object detection performance. To this end, starting
from a random assignment, the training examples are reassigned by optimizing
an objective function that approximates the average precision on the training
set. In order to make the optimization feasible, the objective function exploits
the linearity of voting approaches. Further, we extend the concept that train-
ing instances can only be assigned to a single latent value. In particular, we let
training examples assume multiple values and further allow these associations
to be weighted, i.e. modeling the uncertainty involved in assigning a training
example to them. This generalization makes the latent Hough transform robust
with respect to the number of quantized latent values and we believe that the
same is applicable when learning latent variable models in other domains.
Experiments on the Leuven cars [10] and the PASCAL VOC 2007 bench-
mark [16] show that our latent Hough transform approach signicantly out-
performs the standard Hough transform. We also compare our method to other
baselines with unsupervised clustering of the training data or by manual annota-
tions. In our experiments, we empirically demonstrate the superior performance
of our latent approach over these baselines. The proposed method performs bet-
ter than the best Hough transform based methods. And it even outperforms
state-of-the-art detectors on some categories of the PASCAL VOC 2007.
314 N. Razavi et al.
2 Related Work
A number of previous works have investigated the idea of using object properties
to group training data. For instance, the training data is clustered according to
the aspect ratios of the bounding boxes in [8]. Other grouping criteria like user-
annotated silhouettes [14], viewpoints [13, 1719], height [21], and pose [11, 20]
have been considered as well. To enforce consistency, the features vote for objects
depth in addition to locations in [22], the close-by features are grouped in [23]
to vote together. In [24], two subtypes are trained for each part by mirroring the
training data. The location of each part with respect to its parent joint is also
used in [25] to train sub-types. Instead of grouping the training examples, [26]
train a model for every single positive instance in the training data. While all
of these works divide the training data into disjoint groups, [27] proposes a
generative clustering approach that allows overlapping groups.
Latent variable models have been successfully used in numerous areas of com-
puter vision and machine learning to deal with unobserved variables in train-
ing data, e.g. [2831]. Most related to our work, [8, 32] learn a latent mixture
model for object detection by discriminatively grouping the training examples
and training a part model consisting of a root lter and a set of (hierarchical)
parts for each group. In contrast to these works, our approach is a non-parametric
voting based method where we assume the parts are given in the form of a shared
vocabulary. As has been shown in previous works [17, 33], the shared vocabu-
lary allows better generalization when learning a model with few examples as
it makes use of the training data much more eectively. Given this vocabulary,
we aim at learning latent groupings of the votes, i.e. training patches, which
lead to consistent models and improve detection accuracy. The advantage of this
approach is that we need to train the parts only once and not re-train them from
scratch as in [8, 32].
Several approaches have been proposed for training the parts vocabulary.
Generative clustering of the interest points is used in [4, 12, 34] as the codebook
whereas [5, 13] discriminatively train a codebook by optimizing the classica-
tion and localization performance of image patches. Discriminative learning of
weights for the codebook has been addressed before as well. In [34], a weight is
learned for each entry in a Max-Margin framework. [35] introduced a kernel to
measure similarity between two images and used it as a kernel to train weights
with an SVM classier. Although increasing the detection performance, those
works dier from our approach in that they train weights within a group.
Detecting consistent peaks in the Hough-space for localization has been the
subject of many investigations as well. While Leibe et al. [4] used a mean-shift
mode estimation for accurate peak localizations, [36] utilize an iterative pro-
cedure to demist the voting space by removing the improbable votes from
the voting space. Barinova et al. [37] pose the detection as a labeling problem
where each feature can belong to only a single hypothesis and propose an it-
erative greedy optimization by detecting the maximum scoring hypothesis and
removing its votes from the voting space.
Latent Hough Transform for Object Detection 315
where V (h|t, Ii ) denotes the votes of element Ii from training image t. Although
the votes originating from a single training example are always consistent, this
formulation accumulates the votes over all training examples even if they are
inconsistent, e.g., in pose or shape. In the next section, we show how one can
use latent variables to enforce consistency among votes.
where V (h, z|t, Ii ) are the votes of an image element Ii from training image t to
the augmented latent space H Z. For instance, Z can be quantized viewpoints
of an object. Voting in this augmented space allows only votes that are consistent
in viewpoint to support a single hypothesis.
Similar to other latent variable models [8, 29, 30, 32], one can associate each
training image t to a single latent assignment z. This association groups the
training data into |Z| disjoint groups. Note that the number of these groups is
limited by the size of training data |T |.
316 N. Razavi et al.
The term V (h|t) is the sum of the votes originated from the training example
t. This term will be very important while learning the optimal W as we will
discuss in Sec. 5.
The association of the training data to a single latent value z does not make use
of the training data eectively. We therefore generalize the latent assignments
of the training data by letting a training image assume multiple latent values.
To this end, we relax the constraints on W and allow wz,t to be real-valued in
[0, 1] and non-zero for more than a single assignment z. This can be motivated
by the uncertainties of the assignments for the training examples. In particular,
the latent space can be continuous and, with an increasing quantization of Z,
two elements of Z can become very similar and thus need to share training
examples. As we show in our experiments, this generalization makes the latent
Hough transform less sensitive to the number of quantizations, i.e., |Z|.
The most basic special case of the latent matrix is when |Z| = 1 and wz,t =
1, t T which is equivalent to the original Hough transform formulation in
Eq.(1). The splitting of the training data by manual annotations or unsuper-
vised clustering are also other special cases of the latent matrix where each row
of the matrix represents one cluster. For splitting the training data, we have
considered manual annotations of the viewpoints and two popular methods for
clustering. Namely, agglomerative clustering of the training imnstances based
on their similarity and k-means clustering of the aspect ratios of ground truth
bounding boxes. Similar to [35, 13], we dene the similarity of two training hy-
potheses as the 2 distance of their occurrence histograms. Groupings of the
training data based on the similarity measure and manual view annotations are
visualized in Figs. 1(a) and 3(b) respectively. As shown in these gures, the
groupings based on annotations or clustering might be visually very meaningful.
However, as we illustrate in our experiments the visual similarity alone may not
be optimal for the detection. In addition, it is not clear what similarity measure
to choose for grouping and how to quantize it. These problems underline the
importance of learning the optimal latent matrix for detection.
Latent Hough Transform for Object Detection 317
(a) (b)
Fig. 1. (a) Visualization of the 2 similarity metric using Isomap. The two ellipses show
the clustering of the training instances of the Aeroplane category of PASCAL VOC
2007. As can be seen the clustering splits the data into very meaningful groups. (b)
The visualization is accurate since the rst two dimensions cover most of the variation.
where R = {(h, y)} denotes the set of hypotheses h and their labels y {0, 1},
and O(W, R) is the objective function. A hypothesis is assigned the label y = 1
if it is a true positive and y = 0 otherwise. For each hypothesis h, we pre-
compute the contribution of every training instance t T to that hypothesis, i.e.,
V (h|t) (3). It is actually the linearity of the Hough transform based approaches
in Eq.(1) that allows for this pre-computation, which is essential for learning the
latent matrix W in reasonable time.
As our objective function, we use the average precision measure on the vali-
dation set which is calculated as
1 j,y =1 I(hk , hj )
O(W, R) = j (5)
j yj k,yk =1 j I(hk , hj )
where, for a given latent matrix W , I(hk , hj ) indicates whether the score of
hypothesis hk is smaller than that of hypothesis hj or not
1 if S(hj , W ) S(hk , W )
I(hk , hj ) = (6)
0 otherwise.
Learning the latent matrix to optimize the detection performance is very chal-
lenging. First, the number of parameters to be learned is proportional to the
number of training instances in the codebook which is usually very large. An-
other problem is that to be faithful to the greedy optimization in [37], with
every update in the weights, one needs to run the detector on the whole valida-
tion dataset in order to measure the performance.
318 N. Razavi et al.
Fig. 2. This gure illustrates the result of using view annotations and unsupervised
clustering for grouping training data of aeroplane, bicycle and sheep categories
of PASCAL VOC 2007. Groupings based on aspect ratio are shown in the rst row,
similarity clustering in the second, and the manual view annotations in the third row.
Although clustering increases the performance for aeroplane, it is reducing it for the
sheep. Also the AR clustering is performing better than similarity clustering for aero-
plane and bicycle, yet clustering similarities leads to better results for the sheep. By
using four clusters, the results are deteriorating in all three categories which is due to
the insuciency of the number of training data per cluster.
6 Experiments
We have evaluated our latent Hough transform on two popular datasets, namely
the Leuven cars dataset [10] and the PASCAL VOC 2007 [16]. As a baseline for
our experiments, we compare our approach with the marginalization over latent
variables by voting only for locations (Marginal), unsupervised clusterings of
aspect ratios (AR clustering) and image similarities (Similarity clustering),
and the manually annotated viewpoints (View Annotations) provided for both
Leuven cars and PASCAL VOC 2007.
In all our experiments, the codebook of the ISM is trained using the Hough
forests [5] with 15 trees and the bounding boxes for a detection are estimated
using backprojection. The trees are trained up to the maximum depth of 20 such
that at least 10 occurrences are remained in every leaf. For performing detection
on a test image, we have used the greedy optimization in [37]. The multi-scale
detection was performed by doing detection on a dense scale pyramid with a
1
4
2
resizing factor. In addition, instead of penalizing the larger hypotheses by
320 N. Razavi et al.
(a) (b)
each tree, 200 training images from the positive set and 200 from the negative
set and from each of which 250 patches are sampled at random. The trainval
set of a category was used as the validation set for learning the latent groupings.
The performance of the method is evaluated on the test set. The multi-scale
detection is done with 18 scales starting from 1.8.
To evaluate the benets of the discriminative learning against unsupervised
clustering and manual view annotations, we compared the results of learning on
the Leuven cars and PASCAL VOC 2007 datasets. For a fair comparison, we
train the Hough forests [13] for a category only once and without considering
the view annotations or learned groupings. For the optimization with ISA, we
used 500 particles and the number of epoch and iterations were both set to 40.
Figure 3 compares the performance of the learning with our baselines on the
Leuven cars [10]. Disjoint groupings of the training data based on view anno-
tations, improves the result by 4 AP percentage points. By learning the latent
groups the performance improves by 6.8 and 2.7 points w.r.t the marginalization
and manual view annotations respectively. Learning the generalized latent ma-
trix improves the result further by about 2.5 points. By allowing more latent
322 N. Razavi et al.
Potted Plant
Din. Table
Motorbike
Aeroplane
TV/Mon.
mean AP
Bicycle
Person
Bottle
Sheep
Horse
Chair
Train
Boat
Bird
Cow
Sofa
Dog
Bus
Car
Cat
VOC 2007
Our Approach
HT Marginal 16.1 35.6 2.9 3.3 20.4 15.8 25.5 7.5 10.9 37.2 10.3 3.2 28.6 34.2 4.5 19.2 21.9 10.3 9.8 43.4 18.03
HT + AR 21.8 42.7 11.4 10.2 19.6 19.1 25.4 6.0 6.3 38.2 7.6 6.1 30.5 39.0 4.9 20.5 18.4 10.8 16.9 41.3 19.83
HT + View 18.85 40.0 5.3 3.0 - 18.3 28.7 6.6 4.7 34.9 - 2.9 29.8 38.6 6.1 - 23.3 10.5 21.2 38.7 19.50
LHT Ours 24.3 43.2 11.1 10.5 20.7 20.3 24.0 8.0 11.9 38.8 10.5 4.9 33.2 39.4 8.2 21.3 22.5 10.5 17.3 44.1 21.23
Competing Approaches
HT ISK [35] 24.6 32.1 5.0 9.7 9.2 23.3 29.1 11.3 9.1 10.9 8.1 13.0 31.8 29.5 16.6 6.1 7.3 11.8 22.6 21.9 16.65
MKL [41] 37.6 47.8 15.3 15.2 21.9 50.7 50.6 30.0 17.3 33.0 22.5 21.5 51.2 45.5 23.3 12.4 23.9 28.5 45.3 48.5 31.10
Context [42] 53.1 52.7 18.1 13.5 30.7 53.9 43.5 40.3 17.7 31.9 28.0 29.5 52.9 56.6 44.2 12.6 36.2 28.7 50.5 40.7 36.77
LSVM [8] 29.0 54.6 0.6 13.4 26.2 39.4 46.4 16.1 16.3 16.5 24.5 5.0 43.6 37.8 35.0 8.8 17.3 21.6 34.0 39.0 26.26
VOC best 26.2 40.9 9.8 9.4 21.4 39.3 43.2 24.0 12.8 14.0 9.8 16.2 33.5 37.5 22.1 12.0 17.5 14.7 33.4 28.9 23.33
Table 1. Detection results on the PASCAL VOC 2007 dataset [16]. The rst block
compares the performance of the Hough transform without grouping (HT Marginal),
aspect ratio clustering (HT + AR), view annotations (HT+View), and our proposed
latent Hough transform (LHT Ours). As can be seen the clustering improves the results
for 14 categories over the marginalization but reduces it for the other 6. Yet, by learning
latent groups we outperform all three baselines on most categories and perform similar
or slightly worse (red) on others. The comparison to the state-of-the-art approaches is
shown in the second block. We outperform the best previously published voting-based
approach (ISK [35]) in mAP. Our performance is competitive on many categories with
the latent part model of [8] and is state-of-the-art on two categories (green).
assignments, one can learn ner groupings of the data and increase the perfor-
mance by 10 AP points compared to the marginalization baseline.
The detection performance with the two clusterings and view annotations on
three distinct categories aeroplane, bicycle and sheep of VOC07 dataset are
summarized in Fig. 2. As can be seen, although grouping training examples may
improve performance, this improvement is very much dependent on the grouping
criteria, the category and the number of training data per group. For example,
in detecting airplanes, although using two clusters improves performance, using
more clusters impairs the results. As another example, in detecting sheep, the
marginalization is clearly outperforming clustering with two clusters. The clus-
tering or the view annotations do not lead to optimal groupings and even nding
the well performing ones requires plenty of trial and error.
In contrast to clustering, one can discriminatively learn optimal groupings
by treating them as latent variables. Figure 4 compares the performance of the
learning with the clustering and marginalization. By learning the latent groups,
we outperform clustering baselines. In addition, the learning is not sensitive to
selecting the right number of latent groups as it, unlike clustering, shares training
examples between groups. Table 1, gives the full comparison of our latent Hough
transform method with our baselines and other competing approaches. Some
qualitative results on the VOC07 dataset are shown in Fig. 5.
7 Conclusions
In this paper, we have introduced the Latent Hough Transform (LHT) to enforce
consistency among votes that support an object hypothesis. To this end, we have
Latent Hough Transform for Object Detection 323
augmented the Hough space with latent variables and discriminatively learned
the optimal latent assignments of the training data for object detection. Further,
to add robustness to the number of quantizations, we have generalized the latent
variable model by allowing the training instances to have multiple weighted as-
signments and have shown how the previous grouping approaches can be cast as
special cases of our model. In the future, it would be interesting to use the latent
formulation in a more general context e.g., learning a multi-class LHT model or
learning latent transformations of the votes for better detection accuracy.
References
1. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005)
2. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments
for object detection. TPAMI 30, 3651 (2008)
3. Ojala, T., Pietikinen, M., Menp, T.: Multiresolution gray-scale and rotation invari-
ant texture classication with local binary patterns. TPAMI 24, 971987 (2002)
4. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved
categorization and segmentation. IJCV 77, 259289 (2008)
5. Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object
detection, tracking, and action recognition. TPAMI 33, 21882202 (2011)
6. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised
scale-invariant learning. In: CVPR (2003)
7. Hoiem, D., Rother, C., Winn, J.: 3d layoutcrf for multi-view object class recognition
and segmentation. In: CVPR (2007)
8. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with
discriminatively trained part based models. TPAMI 32, 16271645 (2009)
9. Bergtholdt, M., Kappes, J., Schmidt, S., Schnorr, C.: A study of parts-based object
class detection using complete graphs. IJCV 87, 93117 (2010)
10. Leibe, B., Cornelis, N., Cornelis, K., Van Gool, L.: Dynamic 3d scene analysis from
a moving vehicle. In: CVPR (2007)
11. Seemann, E., Leibe, B., Schiele, B.: Multi-aspect detection of articulated objects.
In: CVPR (2006)
12. Seemann, E., Fritz, M., Schiele, B.: Towards robust pedestrian detection in crowded
image sequences. In: CVPR (2007)
13. Razavi, N., Gall, J., Van Gool, L.: Backprojection revisited: Scalable multi-view
object detection and similarity metrics for detections. In: ECCV (2010)
14. Marszalek, M., Schmid, C.: Accurate object localization with shape masks. In:
CVPR (2007)
15. Stephens, R.: Probabilistic approach to the hough transform. Image and Vision
Computing 9, 6671 (1991)
16. Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal
visual object classes (voc) challenge. IJCV 88, 303338 (2010)
17. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: ecient boosting
procedures for multiclass object detection. In: CVPR (2004)
324 N. Razavi et al.
18. Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., Van Gool, L.: To-
wards multi-view object class detection. In: CVPR (2006)
19. Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specic multiview
object localization. In: CVPR (2009)
20. Dantone, M., Gall, J., Fanelli, G., Van Gool, L.: Real-time facial feature detection
using conditional regression forests. In: CVPR (2012)
21. Sun, M., Kohli, P., Shotton, J.: Conditional regression forests for human pose
estimation. In: CVPR (2012)
22. Sun, M., Bradski, G., Xu, B.X., Savarese, S.: Depth-encoded hough voting for
coherent object detection, pose estimation, and shape recovery. In: ECCV (2010)
23. Yarlagadda, P., Monroy, A., Ommer, B.: Voting by grouping dependent parts. In:
ECCV (2010)
24. Girshick, R.B., Felzenszwalb, P.F., McAllester, D.: Object detection with grammar
models. In: NIPS (2011)
25. Yang, Y., Ramanan, D.: Articulated pose estimation with exible mixtures-of-
parts. In: CVPR (2011)
26. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-svms for object
detection and beyond. In: ICCV (2011)
27. Torsello, A., Bulo, S., Pelillo, M.: Beyond partitions: Allowing overlapping groups
in pairwise clustering. In: ICPR (2008)
28. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Ma-
chine Learning 42, 177196 (2001)
29. Farhadi, A., Tabrizi, M., Endres, I., Forsyth, D.: A latent model of discriminative
aspect. In: ICCV (2009)
30. Wang, Y., Mori, G.: A discriminative latent model of object classes and attributes.
In: ECCV (2010)
31. Bilen, H., Namboodiri, V., Van Gool, L.: Object and action classication with
latent variables. In: BMVC (2011)
32. Zhu, L., Chen, Y., Yuille, A., Freeman, W.: Latent hierarchical structural learning
for object detection. In: CVPR (2010)
33. Razavi, N., Gall, J., Van Gool, L.: Scalable multiclass object detection. In: CVPR
(2011)
34. Maji, S., Malik, J.: Object detection using a max-margin hough transform. In:
CVPR (2009)
35. Zhang, Y., Chen, T.: Implicit shape kernel for discriminative learning of the hough
transform detector. In: BMVC (2010)
36. Woodford, O., Pham, M., Maki, A., Perbet, F., Stenger, B.: Demisting the hough
transform. In: BMVC (2011)
37. Barinova, O., Lempitsky, V., Kohli, P.: On detection of multiple object instances
using hough transforms. In: CVPR (2010)
38. Gall, J., Pottho, J., Schnorr, C., Rosenhahn, B., Seidel, H.: Interacting and an-
nealing particle lters: Mathematics and a recipe for applications. Journal of Math-
ematical Imaging and Vision 28, 118 (2007)
39. Grin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical
Report 7694, California Institute of Technology (2007)
40. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic object recognition with
boosting. TPAMI 28, 416431 (2006)
41. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object
detection. In: ICCV (2009)
42. Song, Z., Chen, Q., Huang, Z., Hua, Y., Yan, S.: Contextualizing object detection
and classication. In: CVPR (2011)
Latent Hough Transform for Object Detection 325
Aeroplane
Sheep
Bicycle
Bottle
Potted Plant
Bird
Fig. 5. Some qualitative results on the test set of PASCAL VOC 2007 database.
Ground-truth bounding boxes are in blue, correctly detected boxes in green and false
positives in red.
Using Linking Features in Learning
Non-parametric Part Models
1 Introduction
In this paper, we present a method for detecting parts of highly deformable
objects. In these objects the parts conguration is exible, and therefore the
detection of the correct conguration is a dicult problem. As a central applica-
tion we consider the detection of human body parts, namely: head, torso, lower
and upper arms (gure 1). Recently, several approaches have been proposed to
tackle this problem [110]. The central pipeline in all of these approaches can be
roughly summarized as: input image extraction of image features estimat-
ing part posteriors using dedicated part detectors extracting the most likely
parts conguration by utilizing the human body kinematic constraints. For the
last step of the pipeline prior parametric models of the human body are used,
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 326339, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Using Linking Features in Learning Non-parametric Part Models 327
Fig. 1. Examples of detected body parts on the Buy (left) and PASCAL stickmen
(right). The detected parts are annotated using color-coded sticks.
which are based on the pictorial-structure (PS) model [1], with either standard
tree-connectivity of the body parts [14, 810] (tree edges representing joints
with constrained range of motion), or with extra non-tree edges for additional
spatial constraints between physically unconnected parts [57, 11]. Notably, [10]
use more parts in their tree PS model, allowing better treatment of non-rigid
part deformations like foreshortening of limbs leading to better performance. In
addition, [12] handle limb foreshortening in human pose estimation from video
by employing stretchable models which model joint locations instead of limbs.
In the standard PS model, the pairwise constraints between the related parts
are purely kinematic (e.g. a distribution of allowed relative angles) and do not
depend on the query image. Recently, several approaches [8, 9] introduced some
image dependence into the pairwise kinematic constraints of the PS model. In
[8] the pairwise kinematic constraints are adaptive, and are derived from a sub-
set of training images that are similar (in appearance) to the query image. In
[9], connected parts are required to follow common segmentation boundaries,
but only in the lower levels of the cascade, while in the higher cascade levels
kinematic constraints are used. In addition, [13] uses the detections of poselet
features to constrain the relative conguration between the parts. This state-of-
the-art person detector [13] focuses on the detection of entire persons and their
segmentation boundaries, and does not detect the arms and their sub-parts. Fi-
nally, a related idea of using chains of features to reach the part being detected
has appeared in [14]; However, [14] detects only a single location on a part of
interest (e.g. a hand), and its probabilistic model does not represent parts ex-
plicitly, leading to reduced detection performance compared with the proposed
scheme (section 3).
In this paper, we propose a new method for parts detection that takes a dif-
ferent modeling approach. The approach follows the above pipeline of choosing
the best conguration of detected part candidates. However, instead of estimat-
ing the quality of the relative conguration of part candidates using kinematic
328 L. Karlinsky and S. Ullman
D E F
/ Q
Q
/
! /QP
1N
7RUVR 5 LQ $LQ
WOO
WUO
EOO
)L Q
L ! N Q
EUO
Q !
Fig. 2. (a) The learned associations between parts. An arrow from part A to part B
shows how much part A aected (via the linking features) the detection of part B
relative to all parts that aected part B. The incoming arrows sum to 100% (only
arrows with at least 15% are shown); (b) Since no a-priori skeleton model is assumed
(i.e. the object skeleton is not constrained by the model) the same approach can be
directly applied to other objects (e.g. ostrich); (c) A graphical illustration of the FLPM
model. The dotted orange and the solid blue rectangles are the features plate and the
query images plate respectively. Only Fin variables are observed.
head
tra
tla
torso
bra bla
Fig. 3. Main stages of the proposed approach. (a) SIFT descriptor features are ex-
tracted on a regular grid (section 2.2); (b) Each part is voted for by the features (top
shows a voting map for bottom left arm), and features are then clustered (section 2.4)
to get candidates for the part (illustrated by lines on the bottom image); (c) Features
vote for connections between parts and linking features are identied to form the
FLPM model for the test image, the example shows two automatically discovered sets
of linking features (orange dots), for the lower-upper right arm and the lower-upper
left arm connections respectively (section 2.1); (d) The nal result is obtained through
approximate inference on the obtained FLPM model (section 2.1). Best viewed in color.
The experimental results show that without known kinematic constraints, and
using only simple SIFT descriptors computed on grayscale images as features
(without more complex features such as color and segmentation contours used
in [8, 9]) the approach achieves results comparable, and in some cases superior, to
state-of-the-art on two challenging upper body datasets, Buy [3] and PASCAL-
stickmen [3]. In addition, since no a-priori known connectivity between the parts
is assumed, the same approach can be successfully applied (without modication)
to other domains. This is illustrated by applying the approach for facial ducial
points detection on the AR dataset [15, 16] (gure 5), and applying it for ostrich
parts detection (gure 2b).
The rest of the paper is organized as follows. Section 2 provides the details
of the proposed approach, section 3 describes the experimental evaluation; sum-
mary and discussion are provided in section 4.
2 Method
This section gives a brief overview of our approach further details are provided
in subsections below.
The method starts with two standard pre-processing stages (gure 3a): (i)
Extracting the region of interest (bounding box) around the object (e.g. hu-
man) whose parts are to be detected. It is a standard practice in pose estima-
tion approaches, e.g. [3, 4, 8, 9], to assume that the object is already detected
and roughly localized; and (ii) Computing image features (patch descriptors)
centered on dense regular grid locations inside the region of interest. For the
feature extraction stage, the step size of the grid and the size of the patches for
which the descriptors are computed are determined by the scale of the region of
interest, assuming the scale is roughly determined by the object detector. The
details of the pre-processing are described in section 2.2.
330 L. Karlinsky and S. Ullman
After the pre-processing, the method proceeds in three main stages. Roughly,
the rst focuses on individual parts (gure 3b), the second on pairwise connec-
tivity between part candidates (gure 3c), and the third on extracting the most
likely global parts conguration (gure 3d). First, a set of candidate locations,
orientations and sizes, are detected for each part. The detection of candidates
is explained in section 2.4. Second, parameters of pairwise connectivity between
part candidates are estimated, using non-parametric probability computations.
At this stage, each conguration formed by a pair of part-candidates (e.g. relative
position of two sticks), is scored by using the linking features (observed in the
test image) that support this particular conguration. These linking features are
patch descriptors that were associated with this conguration during training.
The combined set of estimated connectivity parameters forms the FLPM model.
Third, the most consistent subset of candidate parts is computed by a greedy
approximate MAP inference over the estimated FLPM. The FLPM model, the
computation of its parameters, and the approximate inference procedure for it
are detailed in section 2.1.
The training phase of the method receives a set of images with annotated
parts (e.g. stick annotation in upper body experiments). As the method is
non-parametric, the training only involves building ecient data structures for
similar descriptor search and memorizing a set of parameters for each feature,
e.g. relative oset from dierent parts. These data structures are later used in
order to compute the model probabilities for test images using Kernel Density
Estimation (KDE) (section 2.3).
ables is explained below the denition of the joint distribution. The FLPM model
joint distribution is dened as:
Using Linking Features in Learning Non-parametric Part Models 331
m m
P Lnj j=1 , {Fin , Rin , Ani }ki=1
n
= P Lnj j=1
kn 1 ; ; 2 (1)
n; m n; n m
i=1 P (Ai ) P Ri ;{Lj }j=1 P Fi ;Ai , Ri , {Lj }j=1
n n
Here P (Fin ) is the probability of spontaneously observing the feature Fin any-
where inside the region of interest regardless of the parts conguration. During
the inference, the Ani = 0 option provides a more robust way of ignoring some
of the features that are either outside the object or are not potential linking
features. The Rin are pair-of-parts hidden variables taking values from the set of
part pairs: Rin = (p, q) {(p, q) |1 p
= q m }. TheRin provides the means to
m
associate the feature Fin with a pair of parts from Lnj j=1 . Roughly, Rin = (p, q)
means that the feature Fin connects parts p and q. The corresponding conditional
is dened as follows:
; m ;
;
P Fin ;Rin = (p, q) , Lnj j=1 = P Fin ;Lnp , Lnq (3)
the features are not potential linking features (or belong to the background), we
can approximate the posterior as:
MAP m
Lj
j=1 % &
m n n kn P (Fin |Lnp ,Lnq ) (6)
arg max p,q=1 pq Lp , Lq i=1 P (Fin )
{Lj }j=1
n m
Note that although the posterior 6 does not depend on {Ani }, these auxiliary
variables are necessary in order to derive it and make it more robust.
For a given test image In , the pre-processing (sec. 2.2) produces a set of kn
kn
features {Fin }i=1 and part candidates stage (sec. 2.4) produces a set of part
candidates Sj for each of the m parts (such that for every j, Lnj Sj ). Assume
for brevity that for every j: |Sj | = r, i.e. we have the same number of candidates
for each part. In order to perform MAP inference, we compute the two tables
WR and WF , both of size m m r r, each entry of which corresponds to a
choice of 1 p, q m (m m options), and choice of assignment to Lnp , Lnq
(r r options). The WR (p, q, r1 , r2 ) = pq Lnp = Sp {r1 } , Lnq = Sq {r2 } corre-
sponds to pairs-of-parts prior, here Sp {r1 } means the r1 -th element of the Sp
kn P (Fin |Lnp =Sp {r1 },Lnq =Sq {r2 } )
set of part p candidates. The WF (p, q, r1 , r2 ) = i=1 P (Fin )
corresponds to the feature based relative conguration prior. Note that the likeli-
hood ratio inside the sum is higher for features that are more strongly associated
with the considered relative part conguration. Finally, we multiply these two
tables entry-wise (denoted .) to form: W = WR . WF which is used for MAP
inference over the FLPM posterior. According to equation 6, we need
to nd an
optimal choice of indices {r1 , r2 , . . . , rm
} such that mp=1
m
q=1 W p, q, rp , rq is
maximal. In our experiments, we have tested two greedy approximate inference
approaches to the choice of these indices. In the rst approach, denoted FLPM-
sum, we start from the most likely candidate of the auxiliary full object part
and then greedily select parts and their optimal candidates having maximal sum
of weights (entries in W ) to the already selected parts:
q, rq arg max W p, q, rp , rq (7)
q,rq
pselected
In the second approach, denoted FLPM-max, the next part and its optimal
candidate are greedily chosen using only the maximal weight to one of the already
selected candidates:
% &
q, rq arg max max W p, q, rp , rq (8)
q,rq pselected
m
maximizing m p=1 q=1 W p, q, rp , rq ) based on Quadratic Integer Program-
ming (QIP) approximation produced results slightly inferior to the greedy scheme
that we have eventually used.
where dr , oj , ok , ojr , and okr are as above; j and k are orientations of the
Lnj and the Lnk , and rj and rk are the orientations of the j-th and k-th parts
respectively on the r-th neighbor source image; we use = 20 degrees. In
addition, we restrict the above Pj Fin , Lnj , Lnk computation only to neighbors
which are identied as candidate linking features during training. Those are
the features, collected from training images, that lay in the intersections of the
j-th and k-th part training rectangles enlarged by 10 pixels. This increases the
importance of more relevant mutual features for candidate parts consistency
weighting (e.g. elbows for lower and ;upper arms).
; n m
Finally, we dene P Ri = (p, q) ; Lj j=1 = pq Lnp , Lnq to be uniform,
n
thus providing no prior on the relative conguration of the parts. The pq Lnp , Lnq
can also incorporate the part candidate scores produced by the star model used
to generate the candidates (section 2.4). However, in our experiments using part
candidate scores did not result in performance increase due to the obvious dif-
culty of detecting parts independently of the other parts. In addition, any
kinematic prior on the pairwise relative conguration of the parts that
; is used
; m
by the existing approaches may be incorporated into P Rin = (p, q) ; Lnj j=1
by substituting it with the prior value.
of the baselines in [14] and is described in more detail there. Briey, we ac-
cumulate votes from all the features in a matrix Vj with the size equal to
the size of In , while each feature votes for 25 candidate locations of the part
obtained
; from 25 approximate nearest neighbors, with voting weight equal to
Pj Fin ;Lnj /Pj (Fin ) (section 2.3). Each vote is smoothed by a Gaussian with
STD of 15 pixels.
t
Second, up to t 100 strongest local maxima {ms }s=1 are collected from Vj
(with the weakest one mt having at least 5% score of the Vj global maxima m1 :
mt 0.05 m1 ). Then for each feature Fin we compute a t-sized vector vi each
entry of which contains the amount of weight it contributed to each of the chosen
t kn
maxima {ms }s=1 . The features {Fin }i=1 are then clustered into 20 clusters using
spectral clustering [21], where the association between two features Fin and Fln
is computed as an inner product between vi and vl . The reason for this clustering
is that for the more ambiguous parts such as lower or upper arms, small local
features along the boundary of the part have a wider range of possible part
center locations relative to them (and they essentially vote for many of those).
Third, for each feature cluster the voting (as in the rst step) is repeated using
only features belonging to the cluster. The location of the maximal accumulated
vote for each cluster is taken as the center location for the corresponding part
candidate. Then the features of each cluster vote for the top three candidate
orientations and top one length estimate of the part. The weights for orientation
and length voting are the same as in the rst step. This way, out of 20 feature
clusters, 60 candidates are produced for each of the m parts.
After the computation of part candidates, the computation proceeds as ex-
plained in the description of the inference procedure in section 2.1. The next
section describes the experimental evaluation of the proposed approach.
3 Results
To test the proposed approach we applied it to two challenging human body
parts datasets proposed by [3], namely Buy (version 2.1) and PASCAL stick-
men (version 1.1). Moreover, in order to test the generality of the method, we
also applied it (without modication) to the AR facial ducial points detection
dataset [15], and an ostrich parts detection dataset (see below). Table 4a sum-
marizes the numerical results and state-of-the-art comparisons. The detections
are assumed correct if they meet the standard P CP0.5 criterion [3], where both
endpoints of the detected part should be within 0.5 ground-truth part length
from the ground-truth part endpoints (the full P CP curves are given in gure
5a). The results show that the FLPM produces comparable performance to the
state-of-the-art approaches, improving the best result on the PASCAL stickmen
v1.1 dataset. Note that FLPM is trained without knowing the kinematics of the
object (connections and allowed relative angles between the parts) that is used
by all the other methods [3, 8, 9], and using only grayscale images and simple
SIFT descriptor features (as opposed to complex features involving color and
segmentation boundaries used by [8, 9]). Figure 2a shows the average connectiv-
ity pattern discovered by the FLPM. This pattern is obtained by averaging all
336 L. Karlinsky and S. Ullman
D E
%XII\Y 0HWKRG?'DWDVHW %XII\ 3$6&$/
0HWKRG?3DUW 7RUVR 8SSHU /RZHU +HDG 7RWDO VWLFNPHQ
DUPV DUPV )/30
)/30
FDQGLGDWHUHFDOO
EDVHOLQHLQGHSHQGHQWPD[VFRUH
(LFKQHU )HUUDUL>@ FDQGLGDWHVQROLQNLQJIHDWXUHV
$QGULOXNDHWDO>@
.DUOLQVN\HWDO>@
6DSSHWDO>@ F
6DSSHWDO>@
<DQJ 5DPDQDQ>@
3DVFDO6WLFNPHQY
)/30
FDQGLGDWHUHFDOO
(LFKQHU )HUUDUL>@
6DSSHWDO>@
<DQJ 5DPDQDQ>@
ERWK>@DQG>@RULJLQDOO\UHSRUWHGWKHLUUHVXOWVRQWKHFXUUHQWO\XQDYDLODEOH3$6&$/VWLFNPHQYGDWDVHWFRQWDLQLQJOHVVWHVWLPDJHVWKHQYWKHUHVXOWLQ
WKHWDEOHLVRIDSSO\LQJWKHSXEOLFO\DYDLODEOHFRGHRI>@RQWKHYGDWDVHW
>@GLGQRWUHSRUWUHVXOWVIRUWKH3$6&$/VWLFNPHQGDWDVHWWKHUHVXOWLQWKHWDEOHLVRIDSSO\LQJWKHSXEOLFO\DYDLODEOHFRGHRI>@RQWKHYGDWDVHW
Fig. 4. (a) Summary of Buy and PASCAL stickmen results. Our results are in blue.
The candidate-recall test computes the maximum recall rates (disregarding the score)
of the individual part detectors (providing upper bound on the potential performance).
To the best of our knowledge, [8] tested on the Buy dataset version 1.0 (we test on
v2.1); (b) Baseline comparison that tests the contribution of the core component of the
model - the linking features. The comparison shows the prominent contribution of the
linking features to the FLPM performance; (c) More examples of successfully detected
parts on the tested datasets, illustrating a range of poses and appearances.
FRUUHFWGHWHFWLRQ
%XII\Y
3$6&$/Y
3&3
D E
Fig. 5. (a) PCP curves for FLPM on PASCAL stickmen & Buy datasets; (b) Example
visual results of applying the FLPM approach (without modication) to the detection
of 130 facial ducial points (light blue) on the AR dataset.
between parts. The FLPM has two sources of information to select the most
likely parts conguration, namely, the candidate part detection scores and the
part connection scores via the linking features. The baseline illustrates that
the most signicant source of information in the model comes from the linking
features. Figure 4c shows more examples of successfully detected parts.
Buy v2.1. This dataset consists of 748 frames from 5 episodes (2-6) of Buy the
Vampire Slayer TV series. On each image one person is annotated. The results
are reported for the usual 235 images from the episodes 2, 5 & 6 detected by the
upper body detector used in [3]. In our Buy experiments, the training and testing
is done in leave-one-episode-out manner, in which all images of one episode are
used for testing, while images of all the other episodes are used for training.
PASCAL Stickmen v1.1. This dataset consists of 549 unrelated images with
one annotated person per-image. In our experiments on this dataset, the training
and testing was done in 5-fold cross validation manner, in which we divided the
dataset images into 5 equal folds, and for testing on images of each fold the
model was trained using images of the other folds. As in [3, 8, 9], the results are
reported for the subset of 412 dataset images (denoted test images) on which
the person bounding box was successfully detected by the upper body detector
of [3]. Note however, that the two previous schemes [8, 9], originally reported
their results on the previous (currently unavailable) version 1.0 of this dataset
(with only 360 test images). Therefore, in table 4a, we compare the results of
the proposed approach (FLPM) with the results of publicly available code of [9]
applied to the 1.1 version of this dataset (412 test images).
AR Dataset. This is a facial ducial points detection dataset proposed by [15,
16]. It consists of 895 images (with ground truth annotations) of 112
dierent people with 130 facial landmarks to be detected (gure 5b). The
338 L. Karlinsky and S. Ullman
images exhibit dierent facial expressions, dierent lighting conditions, and some
occluders (glasses, facial hair, etc.). We treat each facial landmark as a part
whose location is to be detected on the test images. The linking features in
the FLPM model score the relative locations of the facial landmarks, and the
most coherent (according to the FLPM) conguration of the facial landmarks
is selected during inference. The experiments were conducted in leave-person-
out-manner, each time leaving all the images belonging to a test person out of
the training. The FLPM model attains an average detection error of 4.0 1.3
pixels comparing favorably to the average error of 8.4 1.2 pixels reported by
the state-of-the-art method of [16].
Osterich. The dataset consists of 558 frames of a running ostrich movie sequence
downloaded from YouTube. The training and testing was again done by 5-fold
cross validation, each time leaving a sequence of about 110 consecutive frames
for testing and training on the other frames. The average P CP0.5 was 90.7%,
with perfect detection of neck and torso, 78.2% detection of the upper legs and
93.8% detection of the lower legs. A movie with the visual results is provided in
the supplementary.
References
1. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition.
IJCV (2005)
2. Ramanan, D.: Learning to parse images of articulated objects. In: NIPS (2006)
3. Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In:
BMVC (2009)
4. Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection
and articulated pose estimation. In: CVPR (2009)
5. Wang, Y., Mori, G.: Multiple Tree Models for Occlusion and Spatial Constraints in
Human Pose Estimation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part III. LNCS, vol. 5304, pp. 710724. Springer, Heidelberg (2008)
6. Jiang, H., Martin, D.R.: Global pose estimation using non-tree models. In: CVPR
(2008)
7. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-
object interaction activities. In: CVPR (2010)
8. Sapp, B., Jordan, C., Taskar, B.: Adaptive pose priors for pictorial structures. In:
CVPR (2010)
9. Sapp, B., Toshev, A., Taskar, B.: Cascaded Models for Articulated Pose Estimation.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS,
vol. 6312, pp. 406420. Springer, Heidelberg (2010)
10. Yang, Y., Ramanan, D.: Articulated pose estimation using exible mixtures of
parts. In: CVPR (2011)
11. Ioe, S., Forsyth, D.: Human tracking with mixtures of trees. In: ICCV (2001)
12. Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models.
In: CVPR (2011)
13. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose
annotations. In: ICCV (2009)
14. Karlinsky, L., Dinerstein, M., Harari, D., Ullman, S.: The chains model for detect-
ing parts by their context. In: CVPR (2010)
15. Martinez, A., Benavente, R.: The ar face database. CVC Technical Report num.
24 (1998)
16. Ding, L., Martinez, A.M.: Features versus context: An approach for precise and
detailed detection and delineation of faces and facial features. PAMI (2010)
17. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
18. Liu, C., Yuen, J., Torralba, A.: Sift ow: dense correspondence across dierent
scenes and its applications. PAMI (2010)
19. Mount, D., Arya, S.: Ann: A library for approximate nearest neighbor searching.
CGC (1997)
20. Duda, R., Hart, P.: Pattern classication and scene analysis. Wiley (1973)
21. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm.
In: NIPS (2001)
Diagnosing Error in Object Detectors
Abstract. This paper shows how to analyze the influences of object characteris-
tics on detection performance and the frequency and impact of different types of
false positives. In particular, we examine effects of occlusion, size, aspect ratio,
visibility of parts, viewpoint, localization error, and confusion with semantically
similar objects, other labeled objects, and background. We analyze two classes
of detectors: the Vedaldi et al. multiple kernel learning detector and different ver-
sions of the Felzenszwalb et al. detector. Our study shows that sensitivity to size,
localization error, and confusion with similar objects are the most impactful forms
of error. Our analysis also reveals that many different kinds of improvement are
necessary to achieve large gains, making more detailed analysis essential for the
progress of recognition research. By making our software and annotations avail-
able, we make it effortless for future researchers to perform similar analysis.
1 Introduction
Large datasets are a boon to object recognition, yielding powerful detectors trained on
hundreds of examples. Dozens of papers are published every year citing recognition
accuracy or average precision results, computed over thousands of images from Cal-
Tech or PASCAL VOC datasets [1, 2]. Yet such performance summaries do not tell
us why one method outperforms another or help understand how it could be improved.
For authors, it is difficult to find the most illustrative qualitative results, and the read-
ers may be suspicious of a few hand-picked successes or intuitive failures. To make
matters worse, a dramatic improvement in one aspect of recognition may produce only
a tiny change in overall performance. For example, our study shows that complete ro-
bustness to occlusion would lead to an improvement of only a few percent average
precision. There are many potential causes of failure: occlusion, intra-class variations,
pose, camera position, localization error, and confusion with similar objects or textured
backgrounds. To make progress, we need to better understand what most needs im-
provement and whether a new idea produces the desired effect. We need to measure the
modes of failure of our algorithms. Fortunately, we already have excellent large, diverse
datasets; now, we propose annotations and analysis tools to take full advantage.
This paper analyzes the influences of object characteristics on detection performance
and the frequency and impact of different types of false positives. In particular, we ex-
amine effects of occlusion, size, aspect ratio, visibility of parts, viewpoint, localization
error, confusion with semantically similar objects, confusion with other labeled objects,
and confusion with background. We analyze two types of detectors on the PASCAL
This work was supported by NSF awards IIS-1053768 and IIS-0904209, ONR MURI Grant
N000141010934, and a research award from Google.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 340353, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Diagnosing Error in Object Detectors 341
VOC 2007 dataset [2]: the Vedaldi et al. [3] multiple kernel learning detector (called
VGVZ for authors initials) and the Felzenszwalb et al. [4, 5] detector (called FGMR).
By making our analysis software and annotations available, we will make it effortless
for future researchers to perform similar analysis.
Relation to Existing Studies: Many methods have been proposed to address a par-
ticular recognition challenge, such as occlusion (e.g., [68]), variation in aspect ratio
(e.g., [5, 3, 9]), or changes of viewpoint (e.g., [1012]). Such methods are usually eval-
uated based on overall performance, home-brewed datasets, or artificial manipulations,
making it difficult to determine whether the motivating challenge is addressed for nat-
urally varying examples. Researchers are aware of the value of additional analysis, but
there are no standards for performing or compactly summarizing detailed evaluations.
Some studies focus on particular aspects of recognition, such as interest point detec-
tion (e.g., [13, 14]), contextual methods [15, 16], dataset design [17], and cross-dataset
generalization [18]. The work by Divvala et al. [15] is particularly relevant: through
analysis of false positives, they show that context reduces confusion with textured back-
ground patches and increases confusion with semantically similar objects. The report on
PASCAL VOC 2007 by Everingham et al. [19] is also related in its sensitivity analysis
of size and qualitative analysis of failures. For size analysis, Everingham et al. measure
the average precision (AP) for increasing size thresholds (smaller-than-threshold ob-
jects are treated as dont cares). Because the measure is cumulative, some effects of
size are obscured, causing the authors to conclude that detectors have a limited prefer-
ence for large objects. In contrast, our experiments indicate that object size is the best
single predictor of performance.
Studies in specialized domains, such as pedestrian detection [20] and face recog-
nition [21, 22], have provided useful insights. For example, Dollar et al. [20] analyze
effects of scale, occlusion, and aspect ratio for a large dataset of pedestrian videos,
summarizing performance for different subsets with a single point on the ROC curve.
Contributions: Our main contribution is to provide analysis tools (annotations, soft-
ware, and techniques) that facilitate detailed and meaningful investigation of object
detector performance. Although benchmark metrics, such as AP, are well-established,
we had much difficulty to quantitatively explore causes and correlates of error; we wish
to share some of the methodology that we found most useful with the community.
Our second contribution is the analysis of two state-of-the-art detectors. We intend
our analysis to serve as an example of how other researchers can evaluate the frequency
and correlates of error for their own detectors. We sometimes include detailed analysis
for the purpose of exemplification, beyond any insights that can be gathered from the
results.
Third, our analysis also helps identify significant weaknesses of current approaches
and suggests directions for improvement. Equally important, our paper highlights the
danger of relying on overall benchmarks to measure short-term progress. Currently,
detectors perform well for the most common cases of objects and avoid egregious errors.
Large improvements in any one aspect of object recognition may yield small overall
gains and go under-appreciated without further analysis.
342 D. Hoiem, Y. Chodpathumwan, and Q. Dai
Details of Dataset and Detectors: Our experiments are based on the PASCAL VOC
2007 dataset [2], which is widely used to evaluate performance in object category de-
tection. The detection task is to find instances of a specific object category within each
input image, localizing each object with a tight bounding box. For the 2007 dataset,
roughly 10,000 images were collected from Flickr.com and split evenly into train-
ing/validation and testing sets. The dataset is favored for its representative sample of
consumer photographs and rigorous annotation and collection procedures. The dataset
contains bounding box annotations for 20 object categories, including some auxiliary
labels for truncation and five categories of viewpoint. We extend these annotations with
more detailed labels for occlusion, part visibility, and viewpoint. Object detections con-
sist of a bounding box and confidence. The highest confidence bounding box with 50%
overlap is considered correct; all others are incorrect. Detection performance is mea-
sured with precision-recall curves and summarized with average precision. We use the
2007 version of VOC because the test set annotations are available (unlike later ver-
sions), enabling analysis on test performance of various detectors. Experiments on later
versions are very likely to yield similar conclusions.
The FGMR detector [23] consists of a mixture of deformable part models for each
object category, where each mixture component has a global template and a set of de-
formable parts. Both the global template and the deformable parts are represented by
HOG features captured at a coarser and finer scale respectively. A hypothesis score con-
tains both a data term (filter responses) and a spatial prior (deformable cost), and the
overall score for each root location is computed based on the best placement of parts.
Search is performed by scoring the root template at each position and scale. The training
of the object model is posed as a latent SVM problem where the latent variables specify
the object configuration. The main differences between v4 and v2 are that v4 includes
more components (3 vs. 2) and more latent parts (8 vs. 6) and has a latent left/right flip
term. We do not use the optional context rescoring.
The VGVZ detector [3] adopts a cascade approach by training a three-stage classi-
fier, using as features Bag of Visual Words, Dense Words, Histogram of Oriented Edges
and Self-Similarity Features. For each feature channel, a three-level pyramid of spatial
histograms is computed. In the first stage, a linear SVM is used to output a set of can-
didate regions, which are passed to the more powerful second and third stages, which
uses quasi-linear and non-linear kernel SVMs respectively. Search is performed using
jumping windows proposed based on a learned set of detected keypoints.
aeroplane (loc): ov=0.47 1r=0.95 aeroplane (sim): ov=0.00 1r=0.93 aeroplane (bg): ov=0.00 1r=0.88 aeroplane (loc): ov=0.50 1r=0.91 aeroplane (loc): ov=0.49 1r=0.87 aeroplane (sim): ov=0.00 1r=0.85
cat (loc): ov=0.40 1r=0.99 cat (loc): ov=0.38 1r=0.98 cat (loc): ov=0.16 1r=0.97 cat (loc): ov=0.47 1r=0.97 cat (loc): ov=0.35 1r=0.96 cat (loc): ov=0.63 1r=0.96
cow (sim): ov=0.00 1r=0.93 cow (sim): ov=0.00 1r=0.91 cow (sim): ov=0.00 1r=0.92 cow (sim): ov=0.00 1r=0.90 cow (sim): ov=0.00 1r=0.96 cow (sim): ov=0.00 1r=0.90
Fig. 1. Examples of top false positives: We show the top six false positives (FPs) for the FGMR
(v4) airplane, cat, and cow detectors. The text indicates the type of error (loc=localization;
bg=background; sim=confusion with similar object), the amount of overlap (ov) with a
true object, and the fraction of correct examples that are ranked lower than the given false positive
(1-r, for 1-recall). Localization errors could be insufficient overlap (less than 0.5) or duplicate
detections. Qualitative examples, such as these, indicate that confusion with similar objects and
localization error are much more frequent causes of false positives than mislabeled background
patches (which provide many more opportunities for error).
counted as confusion with similar objects. For example a dog detector may assign
a high score to a cat region. We consider two categories to be semantically similar
if they are both within one of these sets: {all vehicles}, {all animals including per-
son}, {chair, diningtable, sofa}, {aeroplane, bird}. Confusion with dissimilar objects
describes remaining false positives that have at least 0.1 overlap with another labeled
VOC object. For example, the FGMR bottle detector very frequently detects people
because the exterior contours are similar. All other false positives are categorized as
confusion with background. These could be detections within highly textured areas
or confusions with unlabeled objects that are not within the VOC categories. See Fig. 1
for examples of confident false positives.
In Figure 2, we show the frequency and impact on performance of each type of
false positive. We count top-ranked false positives that are within the most confi-
dent Nj detections. We choose parameter Nj to be the number of positive examples
for the category, so that if all objects are correctly detected, no false positives would
remain. A surprisingly small fraction of confident false positives are due to confusion
with background (e.g., only 9% for VGVZ animal detectors on average). For animals,
most false positives are due to confusion with other animals; for vehicles, localization
and confusion with similar categories are both common. In looking at trends of false
positives with increasing rank, localization errors and confusion with similar objects
tend to be more common among the top-ranked than the lower-ranked false positives.
Confusion with other objects and confusion with background may be similar types
of errors. Some categories were often confused with semantically dissimilar categories.
344 D. Hoiem, Y. Chodpathumwan, and Q. Dai
Fig. 2. Analysis of Top-Ranked False Positives. Pie charts: fraction of top-ranked false positives
that are due to poor localization (Loc), confusion with similar objects (Sim), confusion with other
VOC objects (Oth), or confusion with background or unlabeled objects (BG). Each category
named within Sim is the source of at least 10% of the top false positives. Bar graphs display
absolute AP improvement by removing all false positives of one type (B removes confusion
with background and non-similar objects). L: the first bar segment displays improvement if
duplicate or poor localizations are removed; the second displays improvement if the localization
errors were corrected, turning false detections into true positives.
For example, bottles were often confused with people, due to the similarity of exterior
contours. Removing only one type of false positives may have a small effect, due to
the T PT+F
P
P form of precision. In particular, the potential improvement by removing
all background false detections is surprisingly small (e.g., 0.02 AP for animals, 0.04
AP for vehicles). Improvements in localization or differentiating between similar cat-
egories would lead to the largest gains. If poor localizations were corrected, e.g. with
an effective category-based segmentation method, performance would improve greatly
from additional high-confidence true positives, as well as fewer false positives.
Low
High
None Medium
categories (aeroplane, bicycle, bird, boat, cat, chair, diningtable) that span
the major groups of vehicles, animals, and furniture. Annotations were created by one
author to ensure consistency and are publicly available.
Object size is measured as the pixel area of the bounding box. We also considered
bounding box height as a size measure, which led to similar conclusions. We assign
each object to a size category, depending on the objects percentile size within its object
category: extra-small (XS: bottom 10%); small (S: next 20%); medium (M: next 40%);
large (L: next 20%); extra-large (XL: next 10%). Aspect ratio is defined as object width
divided by object height, computed from the VOC bounding box annotation. Similarly
to object size, objects are categorized into extra-tall (XT), tall (T), medium (M), wide
(W), and extra-wide (XW), using the same percentiles. Occlusion (part of the object
is obscured by another surface) and truncation (part of the object is outside the im-
age) have binary annotations in the standard VOC annotation. We replace the occlusion
labels with degree of occlusion (see Fig. 3): None, Low (slight occlusion), Moder-
ate (significant part is occluded), and High (many parts missing or 75% occluded).
Visibility of parts influences detector performance, so we add annotations for whether
each part of an object is visible. We annotate viewpoint as whether each side (bottom,
front, top, side, rear) is visible. For example, an object seen from the front-right
may be labeled as bottom=0, front=1, top=0, side=1, rear=0.
Table 1. Detection Results on PASCAL VOC2007 Dataset. For each object category, we show the
total number of objects and the average precision (AP) and average normalized precision (APN )
for the VGVZ and FGMR detectors. For APN , the precision is normalized to be comparable
across categories. Categories with many positive examples (e.g., person) will have lower APN
than AP; the reverse is true for categories with few examples.
aero bike boat bus car mbike train bird cat cow dog horse sheep bottle chair table plant sofa tv pers
Num Objs 285 337 263 213 1201 325 282 459 358 244 489 348 242 469 756 206 480 239 308 4528
VGVZ AP .364 .468 .113 .513 .508 .450 .447 .106 .277 .312 .174 .512 .208 .193 .131 .191 .073 .266 .477 .200
APN .443 .531 .181 .612 .480 .511 .527 .136 .372 .418 .222 .565 .294 .222 .130 .318 .094 .294 .567 .073
FGMR AP .277 .581 .134 .481 .555 .475 .436 .028 .149 .214 .034 .582 .154 .225 .191 .204 .068 .306 .405 .410
APN .343 .631 .187 .574 .517 0.537 .528 .039 .207 .311 .048 .631 .215 .252 .190 .346 .086 .409 .477 .214
would be higher than if Nj were small for the same detection rate. This sensitivity to
Nj invalidates AP comparisons for different sets of objects. For example, we cannot use
AP to determine whether people are easier to detect than cows, or whether big objects
are easier to detect than small ones.
We propose to replace Nj with a constant N to create a normalized precision:
R(c) N
PN (c) = . (2)
R(c) N + F (c)
In our experiments, we set N = 0.15Nimages = 742.8, which is roughly equal to the
average Nj over the PASCAL VOC categories. Detectors with similar detection rates
and false positive rates will have similar normalized precision-recall curves. We can
summarize normalized precision-recall by averaging the normalized precision values of
the positive examples to create average normalized precision (APN ). When computing
APN , undetected objects are assigned a precision value of 0. We compare AP and APN
in Table 1.
3.3 Analysis
To understand performance for a category, it helps to inspect the performance variations
for each characteristic. We include this detailed analysis primarily as an example of how
researchers can use our tool to understand the strengths and weaknesses of their own
detectors. Upon careful inspection of Figure 4, for example, we can learn the following
about airplane detectors: both detectors perform similarly, preferring similar subsets
of non-occluded, non-truncated, medium to extra-wide, side views in which all major
parts are visible. Performance for very small and heavily occluded airplanes is poor,
though performance on non-occluded airplanes is near average (because few airplanes
are heavily occluded). FGMR shows a stronger preference for exact side-views than
VGVZ, which may also account for differences in performance on small and extra-
large objects, which are both less likely to be side views. We can learn similar things
about the other categories. For example, both detectors work best for large cats, and
FGMR performs best with truncated cats, where all parts except the face and ears are
not visible. Note that both detectors seem to vary in similar ways, indicating that their
sensitivities may be due to some objects being intrinsically more difficult to recognize.
The VGVZ cat detector is less sensitive to viewpoint and part visibility, which may be
due to its textural bag of words features.
Diagnosing Error in Object Detectors 347
Fig. 4. Per-Category Analysis of Characteristics: APN (+) with standard error bars (red).
Black dashed lines indicate overall APN . Key: Occlusion: N=none; L=low; M=medium; H=high.
Truncation: N=not truncated; T=truncated. Bounding Box Area: XS=extra-small; S=small;
M=medium; L=large; XL =extra-large. Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium;
W=wide; XW =extra-wide. Part Visibility / Viewpoint: 1=part/side is visible; 0=part/side is
not visible. Standard error is used for the average precision statistic as a measure of significance,
rather than confidence bounds, due to the difficulty of modeling the precision distribution.
Since size and occlusion are such important characteristics, we think it worthwhile
to examine their effects across several categories (Fig. 5). Typically, detectors work best
for non-occluded objects, but when objects are frequently occluded (bicycle, chair, din-
ingtable) there is sometimes a small gain in performance for lightly occluded objects.
Although detectors are bad at detecting medium-heavy occluded objects, the impact on
overall performance is small. Researchers working on the important problem of occlu-
sion robustness should be careful to examine effects within the subset of occluded ob-
jects. Both detectors tend to prefer medium to large objects (the 30th to 90th percentile
in area). The difficulty with small objects is intuitive. The difficulty with extra-large ob-
jects may initially surprise, but qualitative analysis (e.g., Fig. 7) shows that large objects
are often highly truncated or have unusual viewpoints.
Fig. 6 provides a compact summary of the sensitivity to each characteristic and the
potential impact of improving robustness. The worst-performing and best-performing
subsets for each characteristic are averaged over 7 categories: the difference between
best and worst indicates sensitivity; the difference between best and overall indicates
potential impact. For example, detectors are very sensitive to occlusion and truncation,
but the impact is small (about 0.05 APN ) because the most difficult cases are not com-
mon. Object area and aspect have a much larger impact (roughly 0.18 for area, 0.13
for aspect). Overall, VGVZ is more robust than FGMR to viewpoint and part visibility,
likely because it is better at encoding texture (due to bags of words features), while
FGMR can accommodate only limited deformations. The performance of VGVZ varies
more than FGMR with object size. Efforts to improve occlusion or viewpoint robust-
ness should be validated with specialized analysis so that improvements are not diluted
by the more common easier cases.
One of the aims of our study is to understand why current detectors fail to detect
30 40% of objects, even at small confidence thresholds. We consider an object to be
348 D. Hoiem, Y. Chodpathumwan, and Q. Dai
Fig. 5. Sensitivity and Impact of Object Characteristics: APN (+) with standard error bars
(red). Black dashed lines indicate overall APN . Key: Occlusion: N=none; L=low; M=medium;
H=high. Bounding Box Area: XS=extra-small; S=small; M=medium; L=large; XL =extra-large.
VGVZ 2009: Sensitivity and Impact FGMR (v2): Sensitivity and Impact FGMR (v4): Sensitivity and Impact
0.6 0.6 0.6
Fig. 6. Summary of Sensitivity and Impact of Object Characteristics: We show the average
(over categories) APN performance of the highest performing and lowest performing subsets
within each characteristic (occlusion, truncation, bounding box area, aspect ratio, viewpoint, part
visibility). Overall APN is indicated by the black dashed line. The difference between max and
min indicates sensitivity; the difference between max and overall indicates the impact.
undetected if there is no detection above 0.05 PN (roughly 1.5 FP/image). Our analysis
indicates that size is the best single explanation, as nearly all extra-small objects go
undetected and roughly half of undetected objects are in the smallest 30%. But size
accounts for only 20% of the entropy of whether an object is detected (measured by
information gain divided by entropy). Contrary to intuition, occlusion does not increase
the chance of going undetected (though it decreases the expected confidence); there is
also only a weak correlation with aspect ratio. As can be seen in Figure 7, misses are
due to a variety of factors, such as unusual appearance, unusual viewpoints, in-plane
rotation, and particularly disguising occlusions. Some missed detections are actually
detected with high confidence but poor localization. With a 10% overlap criteria, the
number of missed detections is reduced by 40-85% (depending on category, detector).
Conclusions and Discussion: As expected, objects that are small, heavily occluded,
seen from unusual views (e.g., the bottom), or that have important occluded parts are
hard to detect. But the deviations are interesting: slightly occluded bicycles and ta-
bles are easier; detectors often have trouble with very large objects; cats are easiest for
Diagnosing Error in Object Detectors 349
0.091 0.876
0.136 0.538
0.673
0.654
0.021
0.654
0.030 0.654
0.026 0.364
0.032 0.364 0.594 0.319
0.124 0.301
0.319
0.008 0.012
0.056 0.0400.386
0.030
0.403 0.397
0.386
0.115
0.323
0.020
Fig. 7. Unexpectedly Difficult Detections: We fit a linear regressor to predict confidence (PN )
based on size, aspect ratio, occlusion and truncation for the FGMR (v4) detector. We show 5
of the 15 objects with the greatest difference between predicted confidence and actual detection
confidence. The ground truth object is in red, with predicted confidence in upper-right corner
in italics. The other box shows the highest scoring correct detection (green), if any, or highest
scoring overlapping detection (blue dashed) with the detection confidence in the upper-left corner.
FGMR when only the head is visible. Despite high sensitivity, the overall impact of
occlusion and many other characteristics is surprisingly small. Even if the gap between
no occlusions and heavy occlusions were completely closed, the overall AP gain would
be only a few percent. Also, note that smaller objects, heavily occluded objects, and
those seen from unusual viewpoints will be intrinsically more difficult, so we cannot
expect to close the gap completely.
We can also see the impact of design differences for two detectors. The VGVZ de-
tector incorporates more textural features and also has a more flexible window proposal
method. These detector properties are advantageous for detecting highly deformable
objects, such as cats, and they lead to reduced sensitivity to part visibility and view-
point. However, VGVZ is more sensitive to size (performing better for large objects
and worse for small ones) because the bag-of-words models are size-dependent, while
the FGMR templates are not. Comparing the FGMR (v2) and (v4) detectors in Figure 6,
we see that the additional components and left/right latent flip term lead to improved
robustness (improvement in both worst and best cases) to aspect ratio, viewpoint, and
truncation.
4 Conclusion
4.1 Diagnosis
We have analyzed the patterns of false and missed detections made by currently top-
performing detectors. The good news is that egregious errors are rare. Most false posi-
tives are due to misaligned detection windows or confusion with similar objects (Fig. 2).
Most missed detections are atypical in some way small, occluded, viewed from an
unusual angle, or simply odd looking. Detectors seem to excel at latching onto the
350 D. Hoiem, Y. Chodpathumwan, and Q. Dai
common modes of object appearance and avoiding confusion with miscellaneous back-
ground patches. For example, detectors more easily detect lightly occluded objects for
categories that are typically occluded, such as bicycles and tables. Airplanes in the
stereotypical direct side-view are detected by FGMR with almost double the accuracy
of the average airplane.
Further progress, however, is likely to be slow, incremental, and to require careful
validation. Localization error is not easily handled by detectors because determining
object extent often requires looking well outside the window. A box around a cat face
could be a perfect detection in one image but only a small part of the visible cat in
another (Figs. 1, 8). Gradient histogram-based models, designed to accommodate mod-
erate positional slop, may be too coarse for differentiation of similar objects, such as
cows and horses or cats and dogs. Although detectors fare well for common views of
objects, there are many types of deviations that may require different solutions. Solving
only one problem will lead to small gains. For example, even if occluded objects could
be detected as easily as unoccluded ones, the overall AP would increase by 0.05 or less
(Fig. 6). Better detection of small objects could yield large gains (roughly 0.17 AP), but
the smallest objects are intrinsically difficult to detect. Our biggest concern is that suc-
cessful attempts to address problems of occlusion, size, localization, or other problems
could look unpromising if viewed only through the lens of overall performance. There
may be a gridlock in recognition research: new approaches that do not immediately ex-
cel at detecting objects typical of the standard datasets may be discarded before they
are fully developed. One way to escape that gridlock is through more targeted efforts
that specifically evaluate gains along the various dimensions of error.
4.2 Recommendations
Detecting Small/Large Objects: An objects size is a good predictor of whether it will
be detected. Small objects have fewer visual features; large objects often have unusual
viewpoints or are truncated. Current detectors make a weak perspective assumption that
an objects appearance does not depend on its distance to the camera. This assumption
is violated when objects are close, making the largest objects difficult to detect. When
objects are far away, only low resolution template models can be applied, and contextual
models may be necessary for disambiguation. There are interesting challenges in how
to benefit from the high resolution of nearby objects while being robust to near-field
perspective effects. Park et al. [24] offer one simple approach of combining detectors
trained at different resolutions. Other possibilities include using scene layout to predict
viewpoint of nearby objects, including features that take advantage of texture and small
parts that are visible for close objects, and occlusion reasoning for distant objects.
Improving Localization: Difficulty with localization and identification of duplicate
detections has a large impact on recognition performance for many categories. In part,
the problem is that an template-based detectors cannot accommodate flexible objects.
For example, the FGMR (v4) detector works very well for localizing cat heads, but not
cat bodies [25]. Additionally, an object part, such as a head, could be considered correct
if the body is occluded, or incorrect otherwise; it is impossible to tell from within the
window alone (Fig. 8). Template detectors should play a major role in finding likely
object positions, but specialized processes are required for segmenting out the objects.
Diagnosing Error in Object Detectors 351
Dog Model
Challenges of Confusion with Similar Categories
Fig. 8. Challenging False Positives: In the top row, we show examples of dog detections that
are considered correct or incorrect, according to the VOC localization criteria. On the right, we
cropped out the rest of the image; it is not possible to determine correctness from within the win-
dow alone. On the bottom, we show several confident dog detections, some of which correspond
to objects from other similar categories. The robust HOG template detector (right), though good
for sorting through many background patches quickly, may be too robust for more fine differen-
tiations.
In some applications, precise localization is unimportant; even so, the ability to identify
occluding and interior object contours would also be useful for category verification,
attribute recognition, or pose estimation. Current approaches to segment objects from
known categories (e.g., [8, 25]) tend to combine simple shape and/or color priors with
more generic segmentation techniques, such as graph cuts. There is an excellent oppor-
tunity for a more careful exploration of the interaction of material, object shape, pose,
and object category, through contour and texture-based inference. We encourage such
work to evaluate specifically in terms of improved localization of objects and to avoid
conflating detection and localization performance.
Reducing Confusion with Similar Categories: By far, the biggest task of detectors
is to quickly discard random background patches, and they do well with features that
are made robust to illumination, small shifts, and other variations. But a detector that
sorts through millions of windows per second may not be suited for differentiating be-
tween dogs and cats or horses and cows. Such differences require detailed comparison
of particular features, such as the shape of the head or eyes. Some recent work has
addressed fine-grained differentiation of birds and plants [2628], and the ideas of find-
ing important regions for comparison may apply to category recognition as well. Also,
while HOG features [29] are well-suited to whole-object detection, some of the feature
representations originally developed for detection of small faces (e.g., [30]) may be
better for differentiating similar categories based on their localized parts. We encourage
such work to evaluate specifically on differentiating between similar objects. It may be
worthwhile to pose the problem simply as categorizing a set of well-localized objects
(even categorizing objects given PASCAL VOC bounding boxes is not easy).
352 D. Hoiem, Y. Chodpathumwan, and Q. Dai
Robustness to Object Variation: One interpretation of our results is that existing de-
tectors do well on the most common modes of appearance (e.g., side views of airplanes)
but fail when a characteristic, such as viewpoint, is unusual. Greatly improved recogni-
tion will require appearance models that better encode shape with robustness to moder-
ate changes in viewpoint, pose, lighting, and texture. One approach is to learn feature
representations from paired 3D and RGB images; a second is to learn the natural vari-
ations of existing features within categories for which examples are plentiful and to
extend that variational knowledge to other categories.
More Detailed Analysis: Most important, we hope that our work will inspire research
that targets and evaluates reduction in specific modes of error. In the supplemental ma-
terial, we include automatically generated reports for four detectors. Authors of future
papers can use our tools to perform similar analysis, and the results can be compactly
summarized in 1/4 page, as shown in Figs. 2, 6. We also encourage analysis of other
aspects of recognition, such as the effects of training sample size, cross-dataset gener-
alization, cross-category generalization, and recognition of pose and other properties.
References
1. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report
7694, California Institute of Technology (2007)
2. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL
Visual Object Classes Challenge 2007 (VOC 2007) (2007) Results,
http://www.pascal-network.org/challenges/VOC/voc2007/
workshop/index.html
3. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection.
In: ICCV (2009)
4. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, de-
formable part model. In: CVPR (2008)
5. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discrim-
inatively trained part based models. PAMI (2009)
6. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by
bayesian combination of edgelet part detectors. In: ICCV (2005)
7. Wang, X., Han, T.X., Yan, S.: An hog-lbp human detector with partial occlusion handling.
In: ICCV (2009)
8. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object detection for multi-class
segmentation. In: CVPR (2010)
9. Vedaldi, A., Zisserman, A.: Structured output regression for detection with partial occulsion.
In: NIPS (2009)
10. Kushal, A., Schmid, C., Ponce, J.: Flexible object models for category-level 3d object recog-
nition. In: CVPR (2007)
11. Hoiem, D., Rother, C., Winn, J.: 3d layoutcrf for multi-view object class recognition and
segmentation. In: CVPR (2007)
12. Sun, M., Su, H., Savarese, S., Fei-Fei, L.: A multi-view probabilistic model for 3d object
classes. In: CVPR (2009)
13. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. IJCV 37,
151172 (2000)
14. Gil, A., Mozos, O.M., Ballesta, M., Reinoso, O.: A comparative evaluation of interest point
detectors and local descriptors for visual slam. Machine Vision and Applications 21(6),
905920 (2009)
Diagnosing Error in Object Detectors 353
15. Divvala, S., Hoiem, D., Hays, J., Efros, A., Hebert, M.: An empirical study of context in
object detection. In: CVPR (2009)
16. Rabinovich, A., Belongie, S.: Scenes vs. objects: a comparative study of two approaches to
context based recognition. In: Intl. Wkshp. on Visual Scene Understanding, ViSU (2009)
17. Pinto, N., Cox, D.D., DiCarlo, J.J.: Why is real-world visual object recognition hard? PLoS
Computational Biology 4, e27 (2008)
18. Torralba, A., Efros, A.: Unbiased look at dataset bias. In: CVPR (2011)
19. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. IJCV 88, 303338 (2010)
20. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: CVPR
(2009)
21. Sim, T., Baker, S., Bsat, M.: The CMU Pose, Illumination, and Expression (PIE) database
of human faces. Technical Report CMU-RI-TR-01-02, Carnegie Mellon, Robotics Institute
(2001)
22. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J.,
Min, J., Worek, W.: Overview of the face recognition grand challenge, pp. 947954 (2005)
23. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Discriminatively trained deformable part
models, release 4,
http://people.cs.uchicago.edu/pff/latent-release4/
24. Park, D., Ramanan, D., Fowlkes, C.: Multiresolution Models for Object Detection. In: Dani-
ilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 241254.
Springer, Heidelberg (2010)
25. Parkhi, O., Vedaldi, A., Jawahar, C.V., Zisserman, A.: The truth about cats and dogs. In:
ICCV (2011)
26. Belhumeur, P.N., Chen, D., Feiner, S.K., Jacobs, D.W., Kress, W.J., Ling, H., Lopez, I.,
Ramamoorthi, R., Sheorey, S., White, S., Zhang, L.: Searching the Worlds Herbaria: A
System for Visual Identification of Plant Species. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 116129. Springer, Heidelberg (2008)
27. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-
UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology
(2010)
28. Khosla, A., Yao, B., Fei-Fei, L.: Combining randomization and discrimination for fine-
grained image categorization. In: CVPR (2011)
29. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
30. Schneiderman, H., Kanade, T.: A statistical model for 3-d object detection applied to faces
and cars. In: CVPR (2000)
Attributes for Classifier Feedback
1 Introduction
Consider the scenario where a teacher is trying to teach a child how to recognize gi-
raffes. The teacher can show example photographs of giraffes to the child, as well as
many pictures of other animals that are not giraffes. The teacher can then only hope
that the child effectively utilizes all these examples to decipher what makes a giraffe a
giraffe. In computer vision (and machine learning in general), this is the traditional way
in which we train a classifier to learn novel concepts.
It is clear that throwing many labeled examples at a passively learning child is very
time consuming. Instead, it might be much more efficient if the child actively engages
in the learning process. After seeing a few initial examples of giraffes and non-giraffes,
the child can identify animals that it finds most confusing (say a camel), and request the
teacher for a label for these examples where the labels are likely to be most informative.
This corresponds to the now well studied active learning paradigm.
While this paradigm is more natural, there are several aspects of this set-up that seem
artificially restrictive. For instance, the learner simply poses a query to the teacher, but
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 354368, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Attributes for Classifier Feedback 355
Fig. 1. We propose a novel use of attributes as a mode for a human supervisor to provide feedback
to a machine active learner allowing it to better learn from its mistakes
does not communicate to the teacher what its current model of the world (or giraffes!) is.
It simply asks What is this?. Instead, it could just as easily say I think this is a giraffe,
what do you think?. The teacher can then encouragingly confirm or gently reject this
belief. More importantly, if the teacher rejects the belief, he can now proactively provide
a focussed explanation for why the learners current belief is inaccurate. If the learner
calls a tiger a giraffe, the teacher can say Unfortunately, youre wrong. This is not
a giraffe, because its neck is not long enough. Not only does this give the learner
information about this particular example, it also gives the learner information that can
be transferred to many other examples: all animals with necks shorter than this query
animal (tiger) must not be giraffes. With this mode of supervision, the learner can learn
what giraffes are (not) with significantly fewer labeled examples of (not) giraffes.
To allow for such a rich mode of communication between the learner and the teacher,
we need a language that they both can understand. Attributes provide precisely this
mode of communication. They are mid-level concepts such as furry and chubby
that are shareable across categories. They are visual and hence machine detectable,
as well as human interpretable. Attributes have been used for a variety of tasks such
as multimedia retrieval [16], zero-shot learning of novel concepts [7, 8], describing
unusual aspects of objects [9], face verification [10] and others.
In this paper, we advocate the novel use of attributes as a mode of communication
for a human supervisor to provide feedback to a machine classifier. This allows for a
natural learning environment where the learner can learn much more from each mistake
it makes. The supervisor can convey his domain knowledge about categories, which can
be propagated to unlabeled images in the dataset. Leveraging these images allows for
effective discriminative learning even with a few labeled examples. We use a straight-
forward approach to incorporate this feedback in order to update the learnt classifiers.
In the above example, all images that have shorter necks than the tiger are labeled as
not-giraffe, and the giraffe classifier is updated with these new negative examples. We
demonstrate the power of this feedback in two different domains (scenes and faces) on
four visual recognition scenarios including image classification and image annotation.
2 Related Work
We discuss our work relative to existing work on exploiting attributes, collecting richer
annotation, active learning, providing a classifier feedback and discriminative guidance.
356 A. Parkash and D. Parikh
Attributes: Human-nameable visual concepts or attributes are often used in the mul-
timedia community to build intermediate representations for images [16]. Attributes
have also been gaining a lot of attention in the computer vision community over the
past few years [724]. The most relevant to our work would be the use of attributes for
zero-shot learning. Once a machine has learnt to detect attributes, a supervisor can teach
it a novel concept simply by describing which attributes the concept does or does not
have [7] or how the concept relates to other concepts the machine already knows [8].
The novel concept can thus be learnt without any training examples. However zero-shot
learning is restricted to being generative in nature and the category models have to be
built in the attribute space. Given a test image, its attribute values need to be predicted,
and then the image can be classified. Hence, while very light on supervision, zero-shot
learning often lags in classification performance. Our work allows for the marriage of
using attributes to alleviate supervision effort, while still allowing for discriminative
learning. Moreover, the feature space is not restricted to being the attributes space.
This is because we use attributes during training to transfer category labels to other
unlabeled image instances, as opposed to directly carving out portions of the attributes
space where a category can live. At test time, attribute detectors need not be used.
Detailed Annotation: With the advent of crowd-sourcing services, efforts are being
made at getting many more images labeled, and more importantly, gathering more
detailed annotations such as segmentation masks [25], parts [26], attributes [10, 20],
pose [27], etc. In contrast, the goal of our work is not to collect deeper annotations
(e.g. attribute annotations) for the images. Instead, we propose the use of attributes as
an efficient means for the supervisor to broadly transfer category-level information to
unlabeled images in the dataset, that the learner can exploit.
Active Learning: There is a large body of work, in computer vision and otherwise, that
utilizes active learning to efficiently collect data annotations. Several works consider
the problem of actively interleaving requests for different forms of annotation: object
and attribute labels [20], image-level annotation, bounding boxes or segmentations [28],
part annotations and attribute presence labels [21], etc. To re-iterate, our work focuses
only on gathering category-label annotations, but we use attributes as a mode for the
supervisor to convey more information about the categories, leading to propagation of
category labels to unlabeled images. The system proposed in [22] recognizes a bird
species with the help of a human expert answering actively selected questions pertain-
ing to visual attributes of the bird. Note the involvement of the user at test time. Our
approach uses a human in the loop only during training. In [29], the learner actively asks
the supervisor linguistic questions involving prepositions and attributes. In contrast, in
our work, an informative attribute is selected by the supervisor that allows the classi-
fier to better learn from its mistakes. Our work can be naturally extended to leveraging
the provided feedback to simultaneously update the attribute models as in [20], or for
discovering the attribute vocabulary in the first place as in [23].
Rationales: In natural language processing, works have explored human feature selec-
tion to highlight words relevant for document classification [3032]. A recent work
in vision [33] similarly solicits rationales from the annotator in the form of spatial
Attributes for Classifier Feedback 357
information or attributes for assigning a certain label to the image. This helps in iden-
tifying relevant image features or attributes for that image. But similar to zero-shot
learning, this can be exploited only if the feature space is the attributes space and thus
directly maps to the mode of feedback used by the annotator. Our work also uses at-
tributes to allow the supervisor to interactively communicate an explanation, but this
explanation is propagated as category labels to many unlabeled images in the dataset.
The feature space where category models are built is thus not constrained by the mode
of communication i.e. attributes used by the annotator and can hence be non-semantic
and quite complex (e.g. bag-of-words, gist, etc.).
Focussed Discrimination: The role of the supervisor in our work can be viewed as
that of providing a discriminative direction that helps eliminate current confusions of
the learner. Such a discriminative direction is often enforced by mining hard negative
examples that violate the margin in large-margin classifiers [34], or is determined after-
the-fact to better understand the learnt classifier [35]. In our work, a human supervisor
provides this direction by simply verbalizing his semantic knowledge about the domain
to steer the learner away from its inaccurate beliefs.1
3 Proposed Approach
We consider a scenario where a supervisor is trying to teach a machine visual concepts.
The machine has already learnt a vocabulary of attributes relevant to the domain. As
learning aid, there is a pool of unlabeled images that the supervisor will label over the
course of the learning process for the classifier to learn from. The learning process starts
with the learner querying the supervisor for a label for a random example in the pool
of unlabeled images. At each subsequent iteration, the learner picks an image from the
unlabeled pool that it finds most confusing (e.g. a camel image in the giraffe learning
example). It communicates its own belief about this image to the supervisor in the form
of a predicted label for this image. The supervisor either confirms or rejects this label.
If rejected, the supervisor provides a correct label for the query. He also communicates
an explanation using attributes for why the learners belief was wrong. Note that the
feedback can be relative to the category depicted in the query image (necks of tigers
are not long enough to be giraffes) or relative to the specific image instance (this
image is too open to be a forest image, see Fig. 1). The learner incorporates both the
label-feedback (whether accepted or rejected with correction) and the attributes-based
explanation (when applicable) to update its models. And the process continues.
We train a binary classifier hk (x), k {1 . . . K} for each of the K categories to be
learnt. At any point during the learning process, there is an unlabeled pool of images and
a training set Tk = {(xk , y k )} of labeled images for each classifier. The labeled images
xk can lie in any feature space, including the space of predicted attribute values RM .
The class labels are binary y k {+1, 1}. In our implementation we use RBF SVMs
as the binary classifier with probability estimates as output. If pk (x) is the probability
1
In similar spirit, contemporary work at this conference [36] uses attributes to prevent a semi-
supervised approach from augmenting its training data with irrelevant candidates from an un-
labeled pool of images, thus avoiding semantic-drift.
358 A. Parkash and D. Parikh
pk (x)
pk (x) = K . (1)
k=1 pk (x)
At each iteration in the learning process, the learner actively selects an unlabeled image
x with maximum entropy.
Note that there are two parts to the supervisors response to the query image. The first
is label-based feedback where the supervisor confirms or rejects and corrects the learn-
ers predicted label for the actively chosen image instance x . And the second is an
attributes-based explanation that is provided if the learners predicted label was incor-
rect and thus rejected. We now discuss how the attributes-based feedback is incorpo-
rated. The label-based feedback can be interpreted differently based on the application
at hand, and we discuss that in Section 4.
Influence of Imperfect Attribute Predictors: The updated Tl may contradict the un-
derlying ground truth labels (not available to the system). For instance, when the su-
pervisor says x is too open to be a forest, there may be images in the unlabeled
pool U that are more open than x but are in fact forest images. Or, due to faulty at-
tribute predictors, a forest image that is less open than the query image is predicted to
be more open, and is hence labeled to be not forest. Hence, when in conflict, the label-
based feedback from the supervisor (which is the ground truth by definition) overrides
the label of an image that was or will be deduced from any past or future attributes-
based feedback. In this way, if the entire pool of unlabeled images U were sequentially
presented to the supervisor, even with inaccurate attributes-based feedback or faulty
attribute predictors, it will be correctly labeled. Of course, the scenario we are propos-
ing is one where the supervisor need not label all images in the dataset, and employ
attributes-based feedback instead to propagate labels to unlabeled images. Hence, in
any realistic setting, some of these labels are bound to be incorrectly propagated. Our
approach relies on the assumption that the benefit obtained by propagating the labels to
many more images outweighs the harm done by inaccuracies in the propagation caused
by faulty attribute predictors. This is reasonable, especially since we use SVM classi-
fiers to build our category models, which involve maximizing a soft margin making the
classifier robust to outliers in the labeled training data. Of course, if the attribute pre-
dictors are severely flawed (in the extreme case: predicting the opposite attribute than
what they were meant to predict), this would not be the case. But as we see in our re-
sults in Section 6, in practice the attribute predictors are reliable enough allowing for
improved performance. Since the attribute predictors are pre-trained, their effectiveness
is easy to evaluate (perhaps on held-out set from the data used to train the attributes
in the first place). Presumably one would not utilize severely faulty attribute predictors
in real systems. Recall, attributes have been used in literature for zero-shot learning
tasks [7, 8] which also hinge on reasonable attribute predictors. We now describe our
choice of attribute predictors.
T
rm (xi ) = wm xi for m = 1, . . . , M , such that the maximum number of the following
constraints is satisfied:
(i, j) Om : wm
T T
x i > wm xj , (i, j) Sm : wm
T T
x i = wm xj . (3)
This being an NP hard problem a relaxed version is solved using a large margin learning
to rank formulation similar to that of Joachims [37] and adapted in [8]. With this, given
any image x in our pool of unlabeled images U , one can compute the relative strength
T
of each of the M attributes as rm (x) = wm x.
4 Applications
We consider four different applications to demonstrate our approach. For each of these
scenarios, the label-based feedback is interpreted differently, but the attributes-based
feedback has the same interpretation as discussed in Section 3.1. The label-based feed-
back carries different amounts of information, and thus influences the classifiers to vary-
ing degrees. Recall that via the label-based feedback, the supervisor confirms or rejects
and corrects the classifiers predicted label for the actively chosen image x .
Classification: This is the classical scenario where an image is to be classified into only
one of several pre-determined number of classes. In this application, the label-based
feedback is very informative. It not only specifies what the image is, it also indicates
what the image is not (all other categories in the vocabulary). Lets say the label pre-
dicted by the classifier for the actively chosen image x is l. If the classifier is correct,
the supervisor can confirm the prediction. This would result in Tl = Tl {(x , 1)},
and Tn = Tn {(x , 1)} n
= l. On the other hand, if the prediction was in-
correct, the supervisor would indicate so, and provide the correct label q. In this case,
Tq = Tq {(x , 1)} and Tn = Tn {(x , 1)} n
= q (note that this includes
Tl = Tl {(x , 1)}). In this way, every label-based feedback impacts all the K classi-
fiers, and is thus very informative. Recall that the attributes-based explanation can only
affect (the negative side) of one classifier at a time. We would thus expect that attributes-
based feedback in a classification scenario would not improve the performance of the
learner by much, since the label-based feedback is already very informative.
Large Vocabulary Classification: Next we consider is one that is receiving a lot of
attention in the community today: a classification scenario where the number of classes
is very large (and perhaps evolving over time); for example (celebrity) face recogni-
tion [10] or specialized fine-grained classification of animal species [16] or classifying
thousands of object categories [38]. In this scenario, the supervisor can verify if the
classifiers prediction of x is correct or not. But when incorrect, it would be very time
consuming for or beyond the expertise of the supervisor to seek out the correct label
from the large vocabulary of categories to provide as feedback. Hence the supervisor
only confirms or rejects the prediction, and does not provide a correction if rejecting
it. In this case, if the classifiers prediction l is confirmed, similar to classification,
Tl = Tl {(x , 1)}, and Tn = Tn {(x , 1)} n
= l. However, if the classi-
fiers prediction is rejected, we only have Tl = Tl {(x , 1)}. In this scenario, the
label feedback has less information when the classifier is wrong, and hence the classifier
would have to rely more on the attributes-based explanation to learn from its mistakes.
Attributes for Classifier Feedback 361
Annotation: In this application, each image can be tagged with more than one class.
Hence, specifying (one of the things) the image is, does not imply anything about what
else it may or may not be. Hence when the classifier predicts that x has tag l (as
one of its tags), if the supervisor confirms it we have Tl = Tl {(x , 1)} and if the
classifiers prediction is rejected, we have Tl = Tl {(x , 1)}. Note that only the
lth classifier is affected by this feedback, and all other classifiers remain unaffected. To
be consistent across the different applications, we assume that at each iteration in the
learning process, the classifier makes a single prediction, and the supervisor provides
a single response to this prediction. In the classification scenario, this single response
can affect all classifiers as we saw above. But in annotation, to affect all classifiers, the
supervisor would have to comment on each tag individually (as being relevant to image
x or not). This would be time consuming and does not constitute a single response.
Since the label-based feedback is relatively uninformative in this scenario, we expect
the attributes-based feedback to have a large impact on the classifiers ability to learn.
Biased Binary Classification: The last scenario we consider is that of a binary clas-
sification problem, where the negative set of images is much larger than the positive
set. This is frequently encountered when training an object detector for instance, where
there are millions of negative image windows not containing the object, but only a few
hundreds or thousands of positive windows containing the object of interest. In this case,
an example can belong to either the positive or negative class (similar to classification),
and the label-based feedback is informative. However, being able to propagate the neg-
ative labels to many more instances via our attributes-based explanation can potentially
accelerate the learning process. Approaches that mine hard negative examples [34] are
motivated by related observations. However, what we describe above is concerned more
with the bias in the class distribution, and not so much with the volume of data itself.
For all three classification-based applications, at test time, an image is assigned to
the class that receives the highest probability. For annotation, all classes that have a
probability greater than 0.5 are predicted as tags for the image.
5 Experimental Set-Up
Datasets: We experimented with the two datasets used in [8]. The first is the out-
door scene recognition dataset [39] (Scenes) containing 2688 images from 8 categories:
coast, forest, highway, inside-city, mountain, open-country, street and tall-building de-
scribed via gist features [39]. The second dataset is a subset of the PubFig: Public
Figures Face Database [10] (Faces) containing 772 images of 8 celebrities: Alex Ro-
driguez, Clive Owen, Hugh Laurie, Jared Leto, Miley Cyrus, Scarlett Johansson, Viggo
Mortensen and Zac Efron described via gist and color features [8]. For both our datasets,
we used the pre-trained relative attributes made publicly available by Parikh et al. [8].
For Scenes they include natural, open, perspective, large-objects, diagonal-plane and
close-depth. For Faces: Masculine-looking, White, Young, Smiling, Chubby, Visible-
Forehead, Bushy-Eyebrows, Narrow-Eyes, Pointy-Nose, Big-Lips and Round-Face.
Applications: For Scenes, we show results on all four applications described above.
The binary classification problems are set up as learning only one of the 8 categories
362 A. Parkash and D. Parikh
(a) This image is not (b) Zac Effon is too (c) open-country, (d) street, inside-city,
perspective enough to young to be Hugh mountain, forest tall-building
be a street scene. Laurie (bottom right).
Fig. 2. (a), (b) Example feedback collected from subjects (c), (d) Example images from Scenes
dataset annotated with multiple tags by subjects
through the entire learning process, which is selected at random in each trial. Images
from the remaining 7 categories are considered to be negative examples. This results in
more negative than positive examples. Many of the images in this dataset can be tagged
with more than one category. For instance, many of the street images can just as easily
be tagged as inside-city, many of the coast images have mountains in the background,
etc. Hence, this dataset is quite suitable for studying the annotation scenario. We collect
annotation labels for the entire dataset as we will describe next. For Faces, we show
results for classification, binary classification and large vocabulary classification (anno-
tation is not applicable since each image displays only one persons face). For evaluation
purposes, we had to collect exhaustive real human data as attributes-based explanations
(more details below). This restricted the vocabulary of our dataset to be small. However
the domain of celebrity classification lends itself well to large vocabulary classification.
Classification (att) Large Vocab. Classification (att) Biased Bin. Classification (att) Annotation (att)
60 60
50 40 60 55
Accuracy
Accuracy
Accuracy
Accuracy
40 50
30 50
30 45
20 20 40 40
10 35
20 40 20 40 5 10 15 20 20 40
# iterations # iterations # iterations # iterations
Classification (feat) Large Vocab. Classification (feat) Biased Bin. Classification (feat) Annotation (feat)
40
60
65 52
50 30 60
Accuracy
Accuracy
Accuracy
Accuracy
40 50
55 48
30 20
50
20 46
10 45
10 44
40
20 40 20 40 10 20 30 20 40
# iterations # iterations
zeroshot baseline # iterations
proposed (oracle) #proposed
iterations (hum)
Fig. 3. Results for four different applications on the Scenes Dataset. See text for details.
selected a different statement. These cases were left unannotated. In our experiments,
no attributes-based explanation was provided for those particular image-category com-
binations. Hence the performance reported in our results is an underestimate.
For the Faces dataset, since the categories correspond to individuals, the attributes we
use can be expected to be the same across all images in the category. Hence we collect
feedback at the category-level. For each pair of categories, our 11 attributes allow for
22 statements such as Miley Cyrus is too young to be Scarlett Johansson or Miley
Cyrus is too chubby to be Scarlett Johansson and so on. We showed subjects example
images for the celebrities. Again, 5 subjects selected one of the statements that was
most true. We found that all category pairs had at least two subjects agreeing on the
feedback. Note that during the learning process, unlike Scenes, every time an image
from class q in the Faces dataset is classified as class l, the same feedback will be used.
Example responses collected from subjects can be seen in Fig. 2 (a), (b).
We annotate all 2688 images in the Scenes dataset with multiple tags. For each im-
age, 5 subjects on Amazon Mechanical Turk were asked to select any of the 8 categories
that they think can be aptly used to describe the image. A tag was retained if at least
2 subjects selected it. For sake of completeness, for each image, we append its newly
collected tags with its ground truth tag from the original dataset (although in most cases
the ground truth label was already provided by subjects). On average, we have 1.6 tags
for every image in the dataset. Example annotations can be seen in Fig. 2 (c), (d).
6 Results
We now present our experimental results on the Scenes and Faces datasets for the ap-
plications described above. We will compare the performance of our approach that uses
both the label-based feedback (Section 4) as well as attributes-based explanations (Sec-
tion 3.1) for each mistake the classifier makes, to the baseline approach of only using
364 A. Parkash and D. Parikh
Classification (att) Large Vocab. Classification (att) Biased Bin. Classification (att)
70 70
50
60 45 60
Accuracy
Accuracy
Accuracy
50 40
50
35
40
30 40
30 25
10 20 30 40 50 10 20 30 40 50 5 10 15 20
# iterations # iterations # iterations
Classification (feat) Large Vocab. Classification (feat) Biased Bin. Classification (feat)
60
40 20
55
Accuracy
Accuracy
Accuracy
30
15
20 50
10
10 45
10 20 30 40 50 10 20 30 40 50 10 20 30 40
# iterations zeroshot# iterations
baseline proposed (oracle) # iterations
proposed (hum)
Fig. 4. Results for three different applications on the Faces Dataset. See text for details.
the label-based feedback (Section 4). Hence the difference in information in the label-
based feedback for the different applications affects both our approach and the baseline.
Any improvement in performance is due to the use of attributes for providing feedback.
Our results are shown in Figs. 3 and 42 . We show results in the raw feature space
(feat) i.e. gist for Scenes and gist+color for Faces, as well as in the relative attributes
space (att). The results shown are average accuracies of 20 random trials. For each trial,
we randomly select 50 images as the initial unlabeled training pool of images. Note
that for Scenes, these 50 images come from the 240 images we had subjects annotate
with feedback. The remaining images (aside from the 240 images) in the datasets are
used as test images. On the x-axis are the number of iterations in the active learning
process. Each iteration corresponds to one response from the supervisor as described
earlier. For classification-based applications, accuracy is computed as the proportion of
correctly classified images. For annotation, the accuracy of each of the tags (i.e. class)
in the vocabulary is computed separately and then averaged.
As expected, for classification where the label-based feedback is already informa-
tive, attributes-based feedback provides little improvement. On the other hand, for an-
notation where label-based feedback is quite uninformative, attributes-based feedback
provides significant improvements. This demonstrates the power of attributes-based
feedback in propagating information (i.e. class labels), even if weak, to a large set of
unlabeled images. This results in faster learning of concepts. Particularly with large vo-
cabulary classification, we see that the attributes-based feedback boosts performance
in the raw feature space more than the attributes space. This is because learning in
high-dimensional space with few examples is challenging. The additional feedback of
our approach can better tame the process. Attributes-based feedback also boosts perfor-
mance in the biased binary classification scenario more than in classical classification.
In datasets with a larger bias towards negative examples, the improvement may be even
2
Number of iterations for binary classification was truncated because performance leveled off.
Attributes for Classifier Feedback 365
larger. Note that in many cases, the same classification accuracy can be reached via our
approach using only a fifth of the users time as with the baseline approach. Clearly,
attributes-based feedback allows classifiers to learn more effectively from their mis-
takes, leading to accelerated learning even with fewer labeled examples.
In Figs. 3 and 4 we also show an oracle-based accuracy for our approach. This is
computed by selecting the feedback at each iteration that maximizes the performance on
a validation set of images. This demonstrates the true potential our approach holds. The
gap between the oracle and the performance using human feedback is primarily due to
lack of motivation of mechanical turk workers to identify the best feedback to provide.
We expect real users of the system to perform between the two solid (red and black)
curves. In spite of the noisy responses from workers, the red curve performs favorably
in most scenarios. This speaks to the robustness of the proposed straightforward idea of
using attributes for classifier feedback.
In Figs. 3 and 4 we also show the accuracy of zero-shot learning using the direct at-
tribute prediction model of Lampert et al. [7]. Briefly, each category is simply described
in terms of which attributes are present and which ones are not. These binary attributes
are predicted for a test image, and the image is assigned to the class with the most
similar signature. We use the binary attributes and binary attributes-based descriptions
of categories from [8]. Note that in this case, the binary attributes were trained using
images from all categories, and hence the categories are not unseen. More impor-
tantly, the zero-shot learning descriptions of the categories was exactly what was used
to train the binary attribute classifiers in the first place. We expect performance to be
significantly worse if real human subjects provided the descriptions of these categories.
We see that even this optimistic zero-shot learning performance compares poorly to our
approach. While our approach also alleviates supervision burden (but not down to zero,
of course) by using attributes-based feedback, it still allows access to discriminative
learning, making the proposed form of supervision quite powerful.
The proposed learning paradigm raises many intriguing questions. It would be interest-
ing to consider the different strategies a user may use to identify what feedback should
be provided to the classifier for its mistake. One strategy may be to provide the most
accurate feedback. That is, when saying this image is too open to be a forest, ensur-
ing that there is very little chance any image that is more open than this one could be
a forest. This ensures that incorrect labels are not propagated across the unlabeled im-
ages, but there may be very few images to transfer this information to in the first place.
Another strategy may be to provide very precise feedback. That is, when saying the
person in this image (Miley Cyrus) is too young to be Scarlett Johansson, ensuring
that Scarlett Johansson is very close in age to Miley Cyrus. This ensures that this feed-
back can be propagated to many celebrities (all that are older than Scarlett Johansson).
Yet another strategy may be to simply pick the most obvious feedback. Of the many
attributes celebrities can have, some are likely to stand out to users more than others.
For example, if a picture of Miley Cyrus is classified as Zac Efron, most of us would
probably react to the fact that the genders are not consistent. Analyzing the behavior
366 A. Parkash and D. Parikh
of users, and perceptual sensitivity of users along the different attributes is part of fu-
ture work. Developing active learning strategies for the proposed learning paradigm is
also part of future work. An image should be actively selected with considerations that
the supervisor will provide an explanation that would propagate to many other images
relative to the selected image. One can also envision accounting for distance between
images in the attributes space when propagating labels. For instance, if a particular im-
age is too open to be a forest, an image that is significantly more open is extremely
unlikely to be a forest. This leads to interesting calibration questions: when a human
views an image as being significantly more open, what is the corresponding differ-
ence in the machine prediction of the relative attributes open? One could also explore
alternative interfaces for the supervisor to provide feedback e.g. providing an automati-
cally selected short-list of relevant attributes to select from. It would be worth exploring
the connections between using attributes-based feedback to transfer information to un-
labeled images and semi-supervised learning. Finally, exploring the potential of this
novel learning paradigm for other tasks such as object detection is part of future work.
Conclusion: In this work we advocate the novel use of attributes as a mode of com-
munication for the human supervisor to provide an actively learning machine classifier
feedback when it predicts an incorrect label for an image. This feedback allows the
classifier to propagate category labels to many more images in the unlabeled dataset
besides just the individual query images that are actively selected. This in turn allows
it to learn visual concepts faster, saving significant user time and effort. We employ a
straight forward approach to incorporate this attributes-based feedback into discrimi-
natively trained classifiers. We demonstrate the power of this feedback on a variety of
visual recognition applications including image classification and annotation on scenes
and faces. What is most exciting about attributes is their ability to allow for communi-
cation between humans and machines. This work takes a step towards exploiting this
channel to build smarter machines more efficiently.
References
1. Smith, J., Naphade, M., Natsev, A.: Multimedia semantic indexing using model vectors. In:
ICME (2003)
2. Rasiwasia, N., Moreno, P., Vasconcelos, N.: Bridging the gap: Query by semantic example.
IEEE Transactions on Multimedia (2007)
3. Naphade, M., Smith, J., Tesic, J., Chang, S., Hsu, W., Kennedy, L., Hauptmann, A., Curtis,
J.: Large-scale concept ontology for multimedia. IEEE Multimedia (2006)
4. Zavesky, E., Chang, S.F.: Cuzero: Embracing the frontier of interactive visual search for
informed users. In: Proceedings of ACM Multimedia Information Retrieval (2008)
5. Douze, M., Ramisa, A., Schmid, C.: Combining attributes and fisher vectors for efficient
image retrieval. In: CVPR (2011)
6. Wang, X., Liu, K., Tang, X.: Query-specific visual semantic spaces for web image re-ranking.
In: CVPR (2011)
7. Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by
between-class attribute transfer. In: CVPR (2009)
8. Parikh, D., Grauman, K.: Relative attributes. In: ICCV (2011)
9. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In:
CVPR (2009)
Attributes for Classifier Feedback 367
10. Kumar, N., Berg, A., Belhumeur, P., Nayar, S.: Attribute and simile classifiers for face veri-
fication. In: ICCV (2009)
11. Berg, T.L., Berg, A.C., Shih, J.: Automatic Attribute Discovery and Characterization from
Noisy Web Data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
LNCS, vol. 6311, pp. 663676. Springer, Heidelberg (2010)
12. Wang, J., Markert, K., Everingham, M.: Learning models for object recognition from natural
language descriptions. In: BMVC (2009)
13. Wang, G., Forsyth, D.: Joint learning of visual attributes, object classes and visual saliency.
In: ICCV (2009)
14. Wang, Y., Mori, G.: A Discriminative Latent Model of Object Classes and Attributes. In:
Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315,
pp. 155168. Springer, Heidelberg (2010)
15. Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS (2007)
16. Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P., Belongie, S.: Visual
Recognition with Humans in the Loop. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part IV. LNCS, vol. 6314, pp. 438451. Springer, Heidelberg (2010)
17. Fergus, R., Bernal, H., Weiss, Y., Torralba, A.: Semantic Label Sharing for Learning with
Many Categories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
LNCS, vol. 6311, pp. 762775. Springer, Heidelberg (2010)
18. Wang, G., Forsyth, D., Hoiem, D.: Comparative object similarity for improved recognition
with few or no examples. In: CVPR (2010)
19. Mahajan, D., Sellamanickam, S., Nair, V.: A joint learning framework for attribute models
and object descriptions. In: ICCV (2011)
20. Kovashka, A., Vijayanarasimhan, S., Grauman, K.: Actively selecting annotations among
objects and attributes. In: ICCV (2011)
21. Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization
with humans in the loop. In: ICCV (2011)
22. Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P., Belongie, S.: Visual
Recognition with Humans in the Loop. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part IV. LNCS, vol. 6314, pp. 438451. Springer, Heidelberg (2010)
23. Parikh, D., Grauman, K.: Interactively building a discriminative vocabulary of nameable
attributes. In: CVPR (2011)
24. Bourdev, L., Maji, S., Malik, J.: Describing people: A poselet-based approach to attribute
classification. In: ICCV (2011)
25. Russel, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a database and web-based tool
for image annotation. IJCV (2008)
26. Farhadi, A., Endres, I., Hoiem, D.: Attribute-centric recognition for cross-category general-
ization. In: CVPR (2010)
27. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annota-
tions. In: ICCV (2009)
28. Vijayanarasimhan, S., Grauman, K.: Multi-level active prediction of useful image annota-
tions for recognition. In: NIPS (2008)
29. Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for
multi-class active learning. In: CVPR (2010)
30. Raghavan, H., Madani, O., Jones, R.: Interactive feature selection. IJCAI (2005)
31. Druck, G., Settles, B., McCallum, A.: Active learning by labeling features. In: EMNLP
(2009)
32. Zaidan, O., Eisner, J., Piatko, C.: Using annotator rationales to improve machine learning for
text categorization. In: NAACL - HLT (2007)
33. Donahue, J., Grauman, K.: Annotator rationales for visual recognition. In: ICCV (2011)
368 A. Parkash and D. Parikh
34. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discrim-
inatively trained part based models. PAMI (2010)
35. Golland, P.: Discriminative direction for kernel classifiers. In: NIPS (2001)
36. Shrivastava, A., Singh, S., Gupta, A.: Constrained Semi-Supervised Learning Using At-
tributes and Comparative Attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y.,
Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 370384. Springer, Heidelberg
(2012)
37. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD (2002)
38. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hier-
archical Image Database. In: CVPR 2009 (2009)
39. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. IJCV (2001)
Constrained Semi-Supervised Learning
Using Attributes and Comparative Attributes
1 Introduction
How do we exploit the sea of visual data available online? Most supervised computer
vision approaches are still impeded by their dependence on manual labeling, which, for
rapidly growing datasets, requires an incredible amount of manpower. The popularity
of Amazon Mechanical Turk and other online collaborative annotation efforts [1, 2] has
eased the process of gathering more labeled data, but it is still unclear whether such
an approach can scale up with the available data. This is exacerbated by the heavy-
tailed distribution of objects in the natural world [3]: a large number of objects occur so
sparsely that it would require a significant amount of labeling to build reliable models.
In addition, human labeling has a practical limitation in that it suffers from semantic and
functional bias. For example, humans might label an image of Christ/Cross as Church
due to high-level semantic connections between the two concepts.
An alternative way to exploit a large amount of unlabeled data is semi-supervised
learning (SSL). A classic example is the bootstrapping method: start with a small
number of labeled examples, train initial models using those examples, then use the
initial models to label the unlabeled data. The model is retrained using the confident
self-labeled examples in addition to original examples. However, most semi-supervised
approaches, including bootstrapping, have often exhibited low and unacceptable accu-
racy because the limited number of initially labeled examples are insufficient to con-
strain the learning process. This often leads to the well known problem of semantic
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 369383, 2012.
c Springer-Verlag Berlin Heidelberg 2012
370 A. Shrivastava, S. Singh, and A. Gupta
Amphitheatre Auditorium
Selected Candidates (Bootstrapping)
Evaluating Constraints
Binary-Attribute Constraints
Indoor Amphitheatre
Has
Seat
9 9 8
Rows
Auditorium
Comparative-Attribute Constraints
Has More Circular-Structure
Selected Candidates (Constrained Bootstrapping)
9
9 8 Amphitheatre Auditorium
Fig. 1. Standard Bootstrapping vs. Constrained Bootstrapping: We propose to learn multiple clas-
sifiers (auditorium and amphitheater) jointly by exploiting similarities (e.g., both are indoor and
have seating) and dissimilarities (e.g, amphitheater has more circular structures than auditorium)
between the two. We show that joint learning constrains the SSL, thus avoiding semantic drift.
drift [4], where newly added examples tend to stray away from the original meaning
of the concept. The problem of semantic drift is more evident in the field of visual
categorization because intra-class variation is often greater than inter-class variation.
For example, electric trains resemble buses more than steam engines and many
auditoriums appear very similar to amphitheaters.
This paper shows that we can avoid semantic drift and significantly improve perfor-
mance of bootstrapping approach by imposing additional constraints. We build upon
recent work in information extraction [5], and propose a novel semi-supervised image
classification framework where instead of each classifier selecting its own set of images,
we jointly select images for each classifier by enforcing different types of constraints.
We show that coupling scene categories via attributes and comparative attributes1 pro-
vides us with the constraints necessary for a robust bootstrapping framework. For ex-
ample, consider the case shown in Figure 1. In the case of the nave bootstrapping
approach, the initial amphitheater and auditorium classifiers select self-labeled im-
ages independently which leads to incorrect instances being selected (outlined in red).
However, if we couple the scene categories and jointly label all the images in the dataset,
then we can use the auditorium images to clean the amphitheater images, since the lat-
ter should have more circular structures compared to the former. We demonstrate that
the joint labeling indeed makes the data selection robust and improves the performance
1
Comparative attributes are special forms of attributes that are used to compare and express
relationships between two nouns. For example, banquet halls are bigger than bedrooms;
amphitheaters are more circular than auditoriums.
Constrained SSL Using Attributes and Comparative Attributes 371
significantly. While we only explore the application of these new constraints to boot-
strapping approaches, we believe they are generic and can be applied to other semi-
supervised approaches as well.
Contributions: We present a framework for coupled bootstrap learning and explore its
application to the field of image classification. The input to our system is an ontology
which defines the set of target categories to be learned, the relationships between those
categories and a handful of initial labeled examples. We show that given these initial la-
beled examples and millions of unlabeled images, our approach can obtain much higher
accuracy by coupling the labeling of all the images using multiple classifiers. The key
contributions of our paper are: (a) a semi-supervised image classification framework
which jointly learns multiple classifiers, (b) demonstrating that sharing information
across categories via attributes [6, 7] is crucial to constrain the semi-supervised learning
problem, (c) extending the notion of sharing across categories and showing that infor-
mation sharing can also be achieved by capturing dissimilarities between categories.
Here we build upon the recent work on relative attributes [8] and comparative adjec-
tives [9] to capture differences across categories. In this paper, the relationships be-
tween attributes and scene categories are defined by a human annotator. As opposed to
image-level labeling, attribute and comparative attribute relationships are significantly
cheaper to annotate as they scale with the number of categories (not with the number of
images). Note that the attributes need not be semantic at all [10]. However, the general
benefit of using semantic attributes is that they are human-communicable [11] and we
can obtain them automatically using other sources of data such as text [12].
2 Prior Work
During the past decade, computer vision has seen some major successes due to the in-
creasing amount of data on the web. While using big data is a promising direction, it
is still unclear how we should exploit such a large amount of data. There is a spectrum
of approaches based on the amount of human labeling required to use this data. On
one end of the spectrum are supervised approaches that use as much hand-labeled data
as possible. These approaches have focused on using the power of crowds to generate
hand-labeled training data [1, 2]. Recent works have also focused on active learn-
ing [1317, 11], to minimize human effort by selecting label requests that are most
informative. On the other end of the spectrum are completely unsupervised approaches,
which use no human supervision and rely on clustering techniques to discover image
categories [18, 19].
In this work, we explore the intermediate range of the spectrum; the domain of semi-
supervised approaches. Semi-supervised learning (SSL) techniques use a small amount
of labeled data in conjunction with a large amount of unlabeled data to learn reliable
and robust visual models. There is a large literature on semi-supervised techniques. For
brevity, we only discuss closely related works and refer the reader to recent survey on
the subject [20]. The most commonly used semi-supervised approach is the bootstrap-
ping method, also known as self-training. However, bootstrapping typically suffers
from semantic drift [4] that is, after many iterations, errors in labeling tend to accu-
mulate. To avoid semantic drift, researchers have focused on several approaches such as
372 A. Shrivastava, S. Singh, and A. Gupta
Our goal is to use the initial set of labeled seed examples (L) and a large unlabeled
dataset (U) to learn robust image classifiers. Our method iteratively trains classifiers
in a self-supervised manner. It starts by training classifiers using a small amount of
labeled data and then uses these classifiers to label unlabeled data. The most confident
new labels are promoted and added to the pool of data used to train the models, and
the process repeats. The key difference from the standard bootstrapping approach is the
set of constraints that restrict which data points are promoted to the pool of labeled data.
In this work, we focus on learning scene classifiers for image classification. We rep-
resent these classifiers as functions (f : X Y ) which, given input image features x,
predict some label y. Instead of learning these classifiers separately, we propose an ap-
proach which learns these classifiers jointly. Our central contribution is the formulation
of constraints in the domain of image classification. Specifically, we exploit the recently
proposed attribute-based approaches [6, 7] to provide another view of the same data and
enforce multi-view agreement constraints. We also build upon the recent framework of
comparative adjectives [9] to formulate pair-wise labeling constraints. Finally, we use
introspection to perform an additional step of self-refinement to weed out false positives
included in the training set. We describe all the constraints below.
learned functions to satisfy these. One such output constraint is the mutual exclusion
constraint (ME). In mutual exclusion, positive classification by one classifier imme-
diately implies negative classification for the other classifiers. For example, an image
classified as restaurant can be used as an negative example for barn, bridge etc.
Current semi-supervised approaches enforce mutual exclusion by learning a multi-
class classifier where a positive example of one class is automatically treated as a nega-
tive example for all other classes. However, the multi-class classifier formulation is too
rigid for a learning algorithm. Consider, for example, banquet hall and restaurant,
which are very similar and likely to be confused by the classifier. For such classes, the
initial classifier learned from a few seed examples is not reliable enough; hence, adding
the mutual exclusion constraint causes the classifier to overfit.
We propose an adaptive mutual exclusion constraint. The basic idea is that during
initial iterations, we do not want to enforce mutual exclusion between similar classes
(y1 and y2 ), since this is likely to confuse the classifier. Therefore, we relax the ME
constraint for manually annotated similar classes a candidate added to the pool of one
is not used as a negative example for the other. However, after a few iterations, we adapt
our mutual exclusion constraints and enforce these constraints across similar classes as
well.
where (xi , yi ) is the unary node potential for image i with features xi and its candidate
label yi and (xi , xj , yi , yj ) is the edge potential for labels yi and yj of pair of images
i and j. It should be noted that yi denotes assigned label to image i which can take label
assignments in {l1 ....ln }.
Constrained SSL Using Attributes and Comparative Attributes 375
"L #
"U #
The unary term (xi , yi ) is the confidence in assigning label yi to image i by the
combination of scene and attribute classifier scores. Specifically, if fj (xi ) is the raw
score of j th scene classifier on xi and f ak (xi ) is the raw score of k th attribute classifier
on xi , then the unary potential is given by:
(xi , yi = lj ) = (fj (xi )) + 1 (1)1xi ,yi (ak ) (f ak (xi )) (2)
ak A
The first term takes in the raw scene classifier scores and converts these scores into po-
tentials using the sigmoid function ((t) = exp(t)/(1 + exp(t))) [26]. The second
term uses a weighted voting scheme for agreement between scene and attribute classi-
fiers (1 is the normalization factor). Here, A is the set of binary attributes, 1xi ,yi (ak )
is an indicator function which is 1 if the attribute ak is detected in the image i but ak
is a property of scene class yi (and vice versa). () denotes the confidence in prediction
for attribute classifier2 . Intuitively, this second term votes positive if both the attribute
classifier and scene classifier agree in terms of class-attribute relationships. Otherwise,
it votes negative where the vote is weighted in terms of the confidence of prediction.
2
The confidence in prediction is defined as: (f ak (xi )) = max((f ak (xi )), 1(f ak (xi ))).
376 A. Shrivastava, S. Singh, and A. Gupta
The binary term (xi , xj , yi , yj ) between images i and j with labels yi and yj
represents comparative-attribute relations between labeled classes.
(xi , xj , yi , yj ) = 1ck (yi , yj ) log((f ck (xi , xj ))) (3)
ck C
where C denotes the set of comparative attributes, 1ck (yi , yj ) denotes if a given com-
parative attribute ck exists between pairs of classes yi and yj and f ck (xi , xj ) is the
score of comparative-attribute classifier for the pair of images xi , xj . Intuitively, this
term boosts the labels yi and yj if a comparative attribute ck scores high on pairwise
features. For example, if instances i, j are labeled conference-room and bedroom,
their scores get boosted if the comparative attribute has more space scores high on
pairwise features (since conference rooms have more space than bedrooms).
Promoting Instances: Typically in the semi-supervised problem, |U| varies from tens
of thousand to million images. Estimating the most likely label for each image in U
necessitates minimizing Eq.(1) which is computationally intractable in general. Since
our goal is to find a few very confident images to add to the labeled set L we do not
need to minimize Eq.(1) over the entire U. Instead, we follow the standard practice
of pruning the image nodes which have low probability of being classified as one of
the n scene classes. Specifically, we evaluate the unary term (), that represents the
confidence of assigning label to an image, for the entire U and use it to prune out and
keep only the top-N candidate images for each class. In our experiments, we set N to
3 times the number of instances to be promoted.
While pruning image nodes reduces the search space, exact inference still remains
intractable. However, approximate inference techniques like loopy belief propagation
or Gibbs sampling can be used to find the most likely assignments. In this work, we
compute the marginals at each node by running one iteration of loopy belief propagation
on the reduced graph. This approximate inference gives us the confidence of candidate
class labeling for each image incorporating scene, attribute and comparative-attribute
constraints. Now we select C most confidently labeled images for each class (U l ), add
them to (L U l ) L (and remove from (U \ U l ) U) and re-train our classifiers.
comparative attributes, we follow the approach proposed in [9] and train classifiers over
differential features (xi xj ). We train the comparative attribute classifiers using ground
truth pair of images that follow such relationships and for the negative data we use ran-
dom pair of images and inverse relationships. We used a boosted decision tree classifier
with 20 trees and 4 internal nodes.
5 Experiments
We now present experimental results to demonstrate the effectiveness of constraints in
bootstrapping based approach. We first present a detailed experimental analysis of our
approach using the fully labeled SUN dataset [30]. Using a completely labeled dataset
allows us to evaluate the quality of unlabeled images being added to our classifiers
and how it affects the performance of our system. Finally, we evaluate our complete
system on a large scale dataset which uses approximately 1 million unlabeled images
to improve the performance of scene classifiers. For all the experiments we use a fixed
vocabulary of scene classes, attributes and comparative attributes as described below.
Vocabulary: We evaluate the performance of our coupled bootstrapping approach in
learning 15 scene categories3 randomly chosen from from SUN dataset . These classes
are: auditorium, amphitheater, banquet hall, barn, bedroom, bowling alley, bridge, casino
indoor, cemetery, church outdoor, coast, conference room, desert sand, field cultivated
and restaurant. We used 19 attribute classes: horizon visible, indoor, has water, has
building, has seat rows, has people, has grass, has clutter, has chairs, is man-made,
eating place, fun place, made of stone, meeting place, livable, part of house, relaxing,
animal-related, crowd-related. We used 10 comparative attributes: is more open, had
more space (indoor), had more space (outdoor), has more seating space, has larger struc-
tures, has horizontally longer structures, has taller structures, has more water, has more
sand and has more greenery. The relationship between scene category and attributes
were defined using a human annotator (see the website for list of relationships).
Baselines: The goal of this paper is to show the importance of additional constraints in
semi-supervised domain. We believe these constraints should improve the performance
irrespective of the choice of a particular approach. Therefore, as a baseline, we com-
pare the performance of our constrained bootstrapping approach with two versions of
standard bootstrapping approach: one uses independent multiple binary classifiers and
the other uses multi-class scene classifiers.
For our first set of experiments, we also compare our constrained bootstrapping
framework to the state-of-the-art SSL technique for image classification based on eigen
functions [24]. Following experimental settings from [24], we map the GIST descrip-
tor for each image down to a 32D space using PCA, use k = 64 eigen functions with
= 50 and = 0.2 for computing the Laplacian (see [24] for details).
Evaluation Metric: We evaluate the performance of our approach using two met-
rics. Firstly, we evaluate the performance of our trained classifier in terms of Average
3
We expect our approach to scale and perform better with more categories: increasing the num-
ber of categories will add more constraints and enforce extensive sharing, which is exactly
what our approach exploits.
378 A. Shrivastava, S. Singh, and A. Gupta
Amphitheatre
Bootstrapping
BA Constraints
Our Approach
(BA+CA+ME)
Bridge
Fig. 3. Qualitative Results: We demonstrate how Binary Attribute (middle row) constraints and
Comparative Attribute (bottom row) constraints help us promote better instances as compared to
nave Bootstrapping (top row)
-Precision (AP) at each iteration on a held-out test dataset. Secondly, for the small-scale
experiments (sections 5.1 and 5.2), we also evaluate purity of the promoted instances in
terms of the fraction of correctly labeled images.
We first evaluate the performance of our approach on SUN dataset. We train the 15
scene classifiers using 2 images each (labeled set). We want to observe the effect of
these constraints when the attribute and comparative attribute classifiers are not re-
trained at each iteration. Therefore, in this case, we used fixed pre-trained attribute
classifiers and relative attribute classifiers. These classifiers were trained on 25 exam-
ples each (from a held-out dataset). Our unlabeled dataset consists of 18,000 images
from SUN dataset. Out of these 18K images, 8.5K images are from these 15 categories
and the remainder are randomly sampled from the rest of the dataset. At each iteration,
we add 5 images per category from the unlabeled dataset to the classifier.
Figure 3 shows examples of how each constraint helps to select better instances that
should be added to the classifier. The bootstrapping approach clearly faces semantic
drift, as it adds bridge, coastal and park images to the amphitheater classi-
fier. It is the presence of binary attributes such as has water and has greenery that
help us to reject these bad candidates. While binary attributes do help to prune lot
Constrained SSL Using Attributes and Comparative Attributes 379
Iteration 1
Iteration 10
Banquet Hall
Iteration 40
Iteration 90
Conference Room
Iteration 1
Iteration 10
Iteration 40
Iteration 90
Iteration 1
Iteration 10
Church Outdoor
Iteration 90
Iteration 40
Fig. 4. Qualitative results showing selected candidates for our approach at iteration 1, 10, 40 and
90. Notice that during the final iterations, there are errors such as bedroom images being added
to conference rooms etc.
of bad instances, they sometimes promote bad instances like the cemetery image
(3rd in 2nd row). However, comparative attributes help us clean such instances. For ex-
ample, the cemetery image is rejected since it has less circular structure. Similarly,
the church image is rejected since it does not have the long horizontal structures
compared to other bridge images. Interestingly, our approach does not overfit to the
seed examples and can indeed cover a greater diversity, thus increasing recall. For ex-
ample, in Figure 4, the seed examples for banquet hall include close-view of tables but
as iterations proceed we incorporate distant views of banquet hall (eg., 3rd image in
iteration 1 and 40).
Next, we quantitatively evaluate the importance of each constraint in terms of per-
formance on held-out test data. Figure 5(a) shows the performance of our approach
with different combinations of constraints. Our system shows significant improvement
in performance by adding attribute-based constraints. In fact, using a randomly chosen
set of ten attributes (ME+10BA) seems to provide enough constraints to avoid seman-
tic drift. Adding another nine attributes to the system does provide some improvement
during the initial iterations (ME+19BA). However, at later stages, the effect of these at-
tributes saturate. Finally, we evaluate the importance of comparative attributes by com-
paring the performance of our system with and without CA constraint. Adding CA
constraint does provides significant boost (6-7%) in performance. Figure 5(b) shows
the comparison of our full system with other baseline approaches. Our approach shows
significant improvement over all the baselines which include self-learning approaches
based on independent binary classifiers, self-learning based on multi-class classifier
and Eigen-functions [24]. We also show the upper-bound on the performance of our
approach, which is achieved if all the unlabeled data is manually labeled and used to
train the scene classifiers. Since the binary attribute classifiers are pre-trained on some
other data, it would be interesting to analyze the performance of these classifiers alone
380 A. Shrivastava, S. Singh, and A. Gupta
Mean AveragePrecision
0.4 BA only (no scene) Our Approach
Our Approach
0.5 0.7
ME
0.3 ME+10BA 0.4 0.6
ME+19BA
0.5
ME+19BA+CA (No Introspection) 0.3
0.2 Our Approach 0.4
0.2
0.3
0 0.1
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of Iterations Number of Iterations Number of Iterations
(a) Evaluating Constraints (b) Comparisons with Baselines (c) Purity based Comparison
Fig. 5. Quantitative Evaluations: (a) We first evaluate importance of each constraint in our system
using control experiments. (b) and (c) Show the comparison of our approach against standard
baselines in terms of performance on held-out test data and purity of added instances respectively.
Our Approach
Our Approach
Fig. 6. Selected candidates for baseline and our approach at iterations 1 and 60
(and without scene classifiers). Our approach performs significantly better than just us-
ing attributes alone. This indicates that coupling does provide constraints and help in
better labeling of unlabeled data. We also compare the performance of our full system
in terms of purity of added instances (See Figure 5(c)).
1
Selflearning (binary) Selflearning (binary)
Selflearning (multiclass) 0.9 Selflearning (multiclass)
0.6
0.2 0.5
0.4
0.3
0.1
0.2
0.1
0 20 40 60 80 100 0 20 40 60 80 100
Number of Iterations Number of Iterations
Fig. 7. Quantitative Evaluations: We evaluate our approach against standard baselines in terms of
(a) mean AP over all scene classes, (b) purity of added instances
baselines. Notice that our approach outperforms all the baselines significantly even
though in this case we used the same seed examples to train the attribute classifier. This
shows that attribute classifiers can pool information from multiple classes to help avoid
semantic drift.
Acknowledgments. The authors would like to thank Tom Mitchell, Martial Hebert,
Varun Ramakrishna and Debadeepta Dey for many helpful discussions. This work was
supported by Google, ONR-MURI N000141010934 and ONR Grant N000141010766.
382 A. Shrivastava, S. Singh, and A. Gupta
References
1. von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: SIGCHI Conference
on Human Factors in Computing Systems (2004)
2. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and web-
based tool for image annotation. IJCV (2009)
3. Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large database for non-
parametric object and scene recognition. PAMI (2008)
4. Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion boot-
strapping. In: Conference of the Pacific Association for Computational Linguistics (2007)
5. Carlson, A., Betteridge, J., Hruschka Jr., E.R., Mitchell, T.M.: Coupling semi-supervised
learning of categories and relations. In: NAACL HLT Workskop on SSL for NLP (2009)
6. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In:
CVPR (2009)
7. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by
between-class attribute transfer. In: CVPR (2009)
8. Parikh, D., Grauman, K.: Relative attributes. In: ICCV (November 2011)
9. Gupta, A., Davis, L.S.: Beyond Nouns: Exploiting Prepositions and Comparative Adjectives
for Learning Visual Classifiers. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part I. LNCS, vol. 5302, pp. 1629. Springer, Heidelberg (2008)
10. Rastegari, M., Farhadi, A., Forsyth, D.: Attribute Discovery via Predictable Discrimina-
tive Binary Codes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.)
ECCV 2012, Part VI. LNCS, vol. 7577, pp. 876889. Springer, Heidelberg (2012)
11. Parkash, A., Parikh, D.: Attributes for Classifier Feedback. In: Fitzgibbon, A., Lazebnik, S.,
Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 355369.
Springer, Heidelberg (2012)
12. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward
an architecture for never-ending language learning. In: AAAI (2010)
13. Vijayanarasimhan, S., Grauman, K.: Large-scale live active learning: Training object detec-
tors with crawled data and crowds. In: CVPR (2011)
14. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active learning with gaussian processes
for object categorization. In: ICCV (2007)
15. Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Zhang, H.J.: Two-dimensional active learning for image
classification. In: CVPR (2008)
16. Joshi, A., Porikli, F., Papanikolopoulos, N.: Multi-class active learning for image classifica-
tion. In: CVPR (2009)
17. Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for
multi-class active learning. In: CVPR (2010)
18. Russell, B., Freeman, W., Efros, A., Sivic, J., Zisserman, A.: Using multiple segmentations
to discover objects and their extent in image collections. In: CVPR (2006)
19. Kang, H., Hebert, M., Kanade, T.: Discovering object instances from scenes of daily living.
In: ICCV (2011)
20. Zhu, X.: Semi-supervised learning literature survey. Technical report, CS, UW-Madison
(2005)
21. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level boot-
strapping. In: AAAI (1999)
22. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT
(1998)
23. Ebert, S., Larlus, D., Schiele, B.: Extracting Structures in Image Collections for Object
Recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS,
vol. 6311, pp. 720733. Springer, Heidelberg (2010)
Constrained SSL Using Attributes and Comparative Attributes 383
24. Fergus, R., Weiss, Y., Torralba, A.: Semi-supervised learning in gigantic image collections.
In: NIPS (2009)
25. Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image
classification. In: CVPR (2010)
26. Tighe, J., Lazebnik, S.: Understanding scenes on many levels. In: ICCV (2011)
27. Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. IJCV (2007)
28. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Transactions
on Graphics, SIGGRAPH (2007)
29. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in
recognition. Progress in Brain Research (2006)
30. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large scale scene
recognition from abbey to zoo. In: CVPR (2010)
31. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierar-
chical image database. In: CVPR (2009)
Renormalization Returns:
Hyper-renormalization and Its Applications
1 Introduction
One of the most fundamental tasks of computer vision is to compute 2-D and
3-D shapes of objects from noisy observations by using geometric constraints.
Many problems are formulated as follows. We observe N vector data x1 , ..., xN ,
whose true values x1 , ..., xN are supposed to satisfy a geometric constraint in
the form
F (x; ) = 0, (1)
where is an unknown parameter vector which we want to estimate. We call
this type of problem simply geometric estimation. In traditional domains of
statistics such as agriculture, pharmaceutics, and economics, observations are
regarded as repeated samples from a parameterized probability density model
p (x); the task is to estimate the parameter . We call this type of problem
simply statistical estimation, for which the minimization principle has been
a major tool: One chooses the value that minimizes a specied cost. The best
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 384397, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Renormalization Returns: Hyper-renormalization and Its Applications 385
known approach is maximum likelihood (ML), which minimizes the negative log-
N
likelihood l = =1 log p (x ). Recently, an alternative approach is more and
more in use: One directly solves specied equations, called estimating equations
[5], in the form of g(x1 , ..., xN , ) = 0. This approach can be viewed as an ex-
tension of the minimization principle; ML corresponds to g(x1 , ..., xN , ) = l,
known as the score. However, the estimating equations need not be the gradient
of any function, and one can modify g(x1 , ..., xN , ) as one likes so that the re-
sulting solution have desirable properties (unbiasedness, consistency, eciency,
etc.). In this sense, the estimating equation approach is more general and exi-
ble, having the possibility of providing a better solution than the minimization
principle.
In the domain of computer vision, the minimization principle, in particular
reprojection error minimization, is the norm and is called the Gold Standard [6].
A notable exception is renormalization of Kanatani [7,8]: Instead of minimizing
some cost, it iteratively removes bias of weighted least squares (LS). Renor-
malization attracted much attention because it exhibited higher accuracy than
any other then known methods. However, questions were repeatedly raised as to
what it minimizes, perhaps out of the deep-rooted preconception that optimal
estimation should minimize something. One answer was given by Chojnacki et
al., who proposed in [4] an iterative scheme similar to renormalization, which
they called FNS (Fundamental Numerical Scheme), for minimizing what is now
referred to as the Sampson error [6]. They argued in [3] that renormalization
can be rationalized if viewed as approximately minimizing the Sampson error.
Leedan and Meer [14] and Matei and Meer [15] also proposed a dierent Sampson
error minimization scheme, which they called HEIV (Heteroscedastic Errors-in-
Variables). Kanatani and Sugaya [13] pointed out that the reprojection error
can be minimized by repeated applications of Sampson error minimization: The
Sampson error is iteratively modied so that it agrees with the reprojection error
in the end. However, reprojection error minimization, which is ML statistically,
still has some bias. Kanatani [9] analytically evaluated the bias of the FNS solu-
tion and subtracted it; he called this scheme hyperaccurate correction. Okatani
and Deguchi [16,17] removed the bias of ML of particular types by analyzing
the relationship between the bias and the hypersurface dened by the constraint
[16] and using the method of projected scores [17].
We note that renormalization is similar to the estimating equation approach
for statistical estimation in the sense that it directly species equations to solve,
which has the form of the generalized eigenvalue problem for geometric esti-
mation. In this paper, we show that by doing high order error analysis using
the perturbation technique of Kanatani [10] and Al-Sharadqah and Chernov
[1], renormalization can achieve higher accuracy than reprojection error mini-
mization and is comparable to bias-corrected ML. We call the resulting scheme
hyper-renormalization.
In Sec. 2, we summarize the fundamentals of geometric estimation. In
Sec. 3, we describes the iterative reweight, the most primitive form of the non-
minimization approach. In Sec. 4, we reformulate Kanatanis renormalization as
386 K. Kanatani et al.
2 Geometric Estimation
Equation (1) is a general nonlinear equation in x. In many practical problem,
we can reparameterize the problem to make F (x; ) linear in (but nonlinear
in x), allowing us to write Eq. (1) as
((x), ) = 0, (2)
where and hereafter (a, b) denotes the inner product of vectors a and b. The
vector (x) is some nonlinear mapping of x from Rm to Rn , where m and n
are the dimensions of the data x and the parameter , respectively. Since the
vector in Eq. (2) has scale indeterminacy, we normalize it to unit norm:
= 1.
Example 1 (Ellipse tting). Given a point sequence (x , y ), = 1, ..., N ,
we wish to t an ellipse of the form
If we let
where F is a matrix of rank 2 called the fundamental matrix , from which we can
compute the camera positions and the 3-D structure of the scene [6,8]. If we let
3 Iterative Reweight
The oldest method that is not based on minimization is the following iterative
reweight :
1. Let W = 1, = 1, ..., N , and 0 = 0.
2. Compute the following matrix M :
1
N
M= W
. (11)
N =1
The motivation of this method is the weighted least squares that minimizes
1 1
N N
W ( , )2 = W
= (, M ). (13)
N =1 N =1
This is minimized by the unit eigenvector of the matrix M for the smallest
eigenvalue. As is well known in statistics, the optimal choice of the weight W
is the inverse of the variance of that term. Since ( , ) = (1 , ) + , the
leading term of the variance is
E[(1 , )2 ] = E[ 1 1 2
] = (, E[1 1 ]) = (, V0 [ ]).
(14)
Hence, we should choose W = 1/(, V0 [ ]), but is not known. So, we do
iterations, determining the weight W from the value of in the preceding
step. Let us call the rstvalue of computed with W = 1 simply the initial
N
solution. It minimizes =1 ( , )2 , corresponding to what is known as least
squares (LS ), algebraic distance minimization, and many other names [6]. Thus,
iterative reweight is an iterative improvement of the LS solution.
It appears at rst sight that the above procedure minimizes
1 ( , )2
N
J= , (15)
N =1 (, V0 [ ])
which is known today as the Sampson error [6]. However, iterative reweight
does not minimize it, because at each step we are computing the value of that
minimizes the numerator part with the denominator part xed. Hence, at the
time of the convergence, the resulting solution is such that
1 ( , )2 1 ( , )2
N N
(16)
N =1 (, V0 [ ]) N =1 (, V0 [ ])
4 Renormalization
Kanatanis renormalization [7,8] can be described as follows:
1 1
N N
M= W
, N= W V0 [ ]. (18)
N =1 N =1
This has a dierent appearance from the procedure described in [7], in which the
generalized eigenvalue problem is reduced to the standard eigenvalue problem,
but the resulting solution is the same [8]. The motivation of renormalization is
as follows. Let M be the true value of the matrix M in Eq. (18) dened by the
true values . Since ( , ) = 0, we have M = 0. Hence, is the eigenvector
of M for eigenvalue 0. But M is unknown, so we estimate it. Since E[1 ] =
0, the expectation of M is to a rst approximation
1 1
N N
E[M ] = E[
W ( +1 )( +1 ) ] = M + W E[1 1
]
N =1 N =1
2
N
= M + W V0 [ ] = M + 2 N . (20)
N =1
Very small it may be, the bias is not 0, however. The error analysis in [10]
shows that the bias expression involves the
matrix N . Our strategy is to optimize
N
the matrix N in Eq. (18) to N = (1/N ) =1 W V0 [ ] + so that the bias
is zero up to high order error terms.
5 Error Analysis
Substituting Eq. (7) into the denition of the matrix M in Eq. (18), we can
expand it in the form
M = M + 1 M + 2 M + , (21)
1
N 1
N
1 M = W 1 + 1
+ 1 W , (22)
N =1 N =1
1
N
2 M = W 1 1
+
2 +
2
N =1
1 1
N N
+ 1 W (1 + 1
) + 2 W . (23)
N =1 N =1
(M +1 M +2 M + )( +1 +2 + )
= (+1 +2 + )(N +1 N +2 N + )(+1 +2 + ). (26)
M 1 + 1 M = 1 N , (27)
M 2 + 1 M 1 + 2 M = 2 N . (28)
Renormalization Returns: Hyper-renormalization and Its Applications 391
Computing the inner product of Eq. (27) and on both sides, we have
(, M 1 ) + (, 1 M ) = 1 (, N ), (29)
(, T )
2 = . (35)
(, N )
Hence, Eq. (34) is rewritten as follows:
(, T )
2 = M N T . (36)
(, N )
6 Hyper-renormalization
From Eq. (30), we see that E[1 ] = 0: the rst order bias is 0. Thus, the bias
is evaluated by the second order term E[2 ]. From Eq. (36), we obtain
(, E[T ])
E[
2 ] = M N E[T ] , (37)
(, N )
392 K. Kanatani et al.
which implies that if we can choose such an N that its noiseless value N satises
E[T ] = cN for some constant c, we will have E[ 2 ] = 0. Then, the bias
will be O( 4 ), because the expectation of odd-order error terms is zero. After a
lengthy analysis (we omit the details), we nd that E[T ] = 2 N holds if we
dene
1
N
N = W V0 [ ] + 2S[ e
]
N =1
1 2
N
W ( , M )V0 [ ] + 2S[V0 [ ]M ] , (38)
N 2 =1
1
N
M= W
, (40)
N =1
1
N
N= W V0 [ ] + 2S[ e
]
N =1
1 2
N
W ( , M )V [
n1 0 ]+2S[V [
0 ]M n1 . (41)
]
N 2 =1
It turns out that the initial solution with W = 1 coincides with what is called
HyperLS [11,12,18], which is derived to remove the bias up to second order error
terms within the framework of algebraic methods without iterations. The expres-
sion of Eq. (41) with W = 1 lacks one term as compared with the corresponding
expression of HyperLS, but the same solution is produced. We omit the details,
but all the intermediate solutions in the hyper-renormalization iterations are
Renormalization Returns: Hyper-renormalization and Its Applications 393
1 1
2 2
7 3
5 5
8 4 4 8
6 3 6 7
(a) (b) (c)
Fig. 1. (a) Thirty points on an ellipse. (b), (c) Fitted ellipses ( = 0.5). 1. LS, 2.
iterative reweight, 3. the Taubin method, 4. renormalization, 5. HyperLS, 6. hyper-
renormalization, 7. ML, 8. ML with hyperaccurate correction. The dotted lines indicate
the true shape.
10000 independent noise instances for each and evaluated the bias B (Fig. 2(b))
and the RMS (root-mean-square) error D (Fig. 2(c)) dened by
>
1 10000 ?
? 1 10000
(a)
B= , D=@ (a) 2 , (44)
10000 a=1 10000 a=1
where (a) is the solution in the ath trial. The dotted line in Fig. 2(c) indicates
the KCR lower bound [2,8,10]
DKCR = trM , (45)
N
where M is the pseudoinverse of the true value M (of rank 5) of the M in
Eqs. (11), (18), and (40), and tr stands for the trace. The interrupted plots
in Fig. 2(b) for iterative reweight, ML, and ML with hyperaccurate correction
indicate that the iterations did not converge beyond that noise level. Our con-
vergence criterion is 0 < 106 for the current value and the value 0
in the preceding iteration; their signs are adjusted before subtraction. If this cri-
terion is not satised after 100 iterations, we stopped. For each , we regarded
the iterations as nonconvergent if any among the 10000 trials did not converge.
Figure 3 shows the enlargements of Figs. 2(b), (c) for the small part.
We can see from Fig. 2(b) that LS and iterative reweight have very large bias,
in contrast to which the bias of the Taubin method and renormalization is very
small. The bias of HyperLS and hyper-renormalization is still smaller and even
smaller than ML. Since the leading covariance is common to iterative reweight,
renormalization, and hyper-renormalization, the RMS reects the magnitude of
the bias as shown in Fig. 2(c). Because the hyper-renormalization solution does
not have bias up to high order error terms, it has nearly the same accuracy as
ML. A close examination of the small part (Fig. 3(b)) reveals that hyper-
renormalization outperforms ML. The highest accuracy is achieved, although
the dierence is very small, by Kanatanis hyperaccurate correction of ML [9].
Renormalization Returns: Hyper-renormalization and Its Applications 395
0.1 0.3
0.08
2
0.2 8
7
0.06
1 2 6
6 4
5
1 5
O 0.04 3
7 0.1
4
0.02 3
8
Fig. 2. (a) The true value , the computed value , and its orthogonal component
of . (b), (c) The bias (a) and the RMS error (b) of the tted ellipse for the standard
deviation of the added noise over 10000 independent trials. 1. LS, 2. iterative reweight,
3. the Taubin method, 4. renormalization, 5. HyperLS, 6. hyper-renormalization, 7. ML,
8. ML with hyperaccurate correction. The dotted line in (c) indicates the KCR lower
bound.
0.02 0.1
0.08 2 8
6
2 1 7
4
5
0.06 3
0.01 1
0.04
7
4
3 6 0.02
5 8
(a) (b)
However, it rst requires the ML solution, and the FNS iterations for its com-
putation may not converge above a certain noise level, as shown in Figs. 2(b),
(c). In contrast, hyper-renormalization is very robust to noise, because the initial
solution is HyperLS, which is itself highly accurate as shown in Figs. 2 and 3. For
this reason, we conclude that it is the best method for practical computations.
Figure 4(a) is an edge image of a scene with a circular object. We tted an
ellipse to the 160 edge points indicated in red, using various methods. Figure 4(b)
shows the tted ellipses superimposed on the original image, where the occluded
part is also articially composed for visual ease. In this case, iterative reweight
converged after four iterations, and renormalization and hyper-renormalization
converged after three iterations, while FNS for ML computation required six
iterations. We can see that LS and iterative reweight produce much smaller
ellipses than the true shape as in Fig. 1(b), (c). All other ts are very close to
the true ellipse, and ML gives the best t in this particular instance.
396 K. Kanatani et al.
7 6
8
3
1
5
2
4
(a) (b)
Fig. 4. (a) An edge image of a scene with a circular object. An ellipse is tted to the
160 edge points indicated in red. (b) Fitted ellipses superimposed on the original image.
The occluded part is articially composed for visual ease. 1. LS, 2. iterative reweight,
3. the Taubin method, 4. renormalization, 5. HyperLS, 6. hyper-renormalization, 7.
ML, 8. ML with hyperaccurate correction.
8 Conclusions
We have reformulated iterative reweight and renormalization as geometric esti-
mation techniques not based on the minimization principle and optimized the
matrix N that appears in the renormalization computation so that the resulting
solution has no bias up to the second order noise terms. We called the resulting
scheme hyper-renormalization and applied it to ellipse tting. We observed:
1. Iterative reweight is an iterative improvement of LS. The leading covariance
of the solution agrees with the KCR lower bound, but the bias is very large,
hence the accuracy is low.
2. Renormalization [7,8] is an iterative improvement of the Taubin method [19].
The leading covariance of the solution agrees with the KCR lower bound,
and the bias is very small, hence the accuracy is high.
3. Hyper-renormalization is an iterative improvement of HyperLS [11,12,18].
The leading covariance of the solution agrees with the KCR lower bound
with no bias up to high order error terms. Its accuracy outperforms ML.
4. Although the dierence is very small, ML with hyperaccurate correction
exhibits the highest accuracy, but the iterations for its computation may
not converge in the presence of large noise, while hyper-renormalization is
robust to noise.
We conclude that hyper-renormalization is the best method for practical com-
putations.
References
1. Al-Sharadqah, A., Chernov, N.: A doubly optimal ellipse t. Comp. Stat. Data
Anal. 56(9), 27712781 (2012)
2. Chernov, N., Lesort, C.: Statistical eciency of curve tting algorithms. Comp.
Stat. Data Anal. 47(4), 713728 (2004)
3. Chojnacki, W., Brooks, M.J., van den Hengel, A.: Rationalising the renormalization
method of Kanatani. J. Math. Imaging Vis. 21(11), 2138 (2001)
4. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: On the tting of
surfaces to data with covariances. IEEE Trans. Patt. Anal. Mach. Intell. 22(11),
12941303 (2000)
5. Godambe, V.P. (ed.): Estimating Functions. Oxford University Press, New York
(1991)
6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision.
Cambridge University Press, Cambridge (2000)
7. Kanatani, K.: Renormalization for unbiased estimation. In: Proc. 4th Int. Conf.
Comput. Vis., Berlin, Germany, pp. 599606 (May 1993)
8. Kanatani, K.: Statistical Optimization for Geometric Computation. Theory and
Practice. Elsevier, Amsterdam (1996); reprinted, Dover, New York, U.S.A., (2005)
9. Kanatani, K.: Ellipse tting with hyperaccuracy. IEICE Trans. Inf. & Syst.
E89-D(10), 26532660 (2006)
10. Kanatani, K.: Statistical optimization for geometric tting: Theoretical accuracy
bound and high order error analysis. Int. J. Comput. Vis. 80(2), 167188 (2008)
11. Kanatani, K., Rangarajan, P.: Hyper least squares tting of circles and ellipses.
Comput. Stat. Data Anal. 55(6), 21972208 (2011)
12. Kanatani, K., Rangarajan, P., Sugaya, Y., Niitsuma, H.: HyperLS for parameter
estimation in geometric tting. IPSJ Trans. Comput. Vis. Appl. 3, 8094 (2011)
13. Kanatani, K., Sugaya, Y.: Unied computation of strict maximum likelihood for
geometric tting. J. Math. Imaging Vis. 38(1), 113 (2010)
14. Leedan, Y., Meer, P.: Heteroscedastic regression in computer vision: Problems with
bilinear constraint. Int. J. Comput. Vis. 37(2), 127150 (2000)
15. Matei, J., Meer, P.: Estimation of nonlinear errors-in-variables models for computer
vision applications. IEEE Trans. Patt. Anal. Mach. Intell. 28(10), 15371552 (2006)
16. Okatani, T., Deguchi, K.: On bias correction for geometric parameter estimation
in computer vision. In: Proc. IEEE Conf. Computer Vision Pattern Recognition,
Miami Beach, FL, U.S.A., pp. 959966 (June 2009)
17. Okatani, T., Deguchi, K.: Improving accuracy of geometric parameter estimation
using projected score method. In: Proc. 12th Int. Conf. Computer Vision, Kyoto,
Japan, pp. 17331740 (September/October 2009)
18. Rangarajan, P., Kanatani, K.: Improved algebraic methods for circle tting. Elec-
tronic J. Stat. 3, 10751082 (2009)
19. Taubin, G.: Estimation of planar curves, surfaces, and non-planar space curves
dened by implicit equations with applications to edge and range image segmen-
tation. IEEE Trans. Patt. Anal. Mach. Intell. 13(11), 11151138 (1991)
Scale Robust Multi View Stereo
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 398411, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Scale Robust Multi View Stereo 399
Fig. 1. Reconstructed 3D model of the Ulm Minster (German: Ulmer Munster ) dataset
with very large resolution dierences, created by our method. Note the huge scale
dierence from the left most image (about 160m hight) to the right most image (about
3cm).
the surface distance (Figure 3(a)) and not the most similar images are the best
choice. Furthermore, an algorithm should also consider if there are scale dier-
ences in the dataset, i.e., if objects can be seen in dierent images from dierent
distances in dierent scales. Otherwise this usually leads to loss of reconstruction
detail. Our approach provides solutions for both problems. For image selection
we extend the image selection approach of Goesele et al. [14]. Scale robustness
is achieved by considering scale in all stages of this work. These are besides im-
age selection, multi depth map ltering, point selection, point optimization and
meshing. Multi depth map ltering is an ecient, GPU parallelizable way to
remove outliers by comparing depth map against each other and point selection
and optimization selects the assumed best points in space and optimizes them
in position with worse not selected points. This is only possible if we carefully
consider their assumed accuracy.
2 Related Work
One of the most constrained methods to reconstruct huge datasets is recon-
struction from street level video. Pollefeys et al. [15] demonstrated that this can
even be done in realtime. Frahm et al. [16] developed an algorithm which is
able to deal with millions of images. However, their method seems to be primar-
ily interesting for image preselection and calibration and not for massive stereo
reconstruction, as they demonstrated this only locally for single objects.
Jancosek et al. [17] proposed a scalable patch based method which can deal
with huge unstructured image datasets, but it is not scale robust and has only
a simple image selection heuristic. Goesele et al. [14] were probably the rst
that proposed a robust image selection method for huge unstructured datasets,
with large scale dierences. This allowed them to create good depth maps for
such datasets. However, they didnt consider point accuracy for meshing. Thus,
the overall method is not scale robust. They addressed this problem in re-
cent publications where they presented scale robust meshing methods for depth
400 C. Bailer, M. Finckh, and H.P.A. Lensch
maps [18,19]. Furukawa et al. [20] clustered huge unstructured datasets into
clusters of similar visibility and scale. These clusters can then be independently
processed by simple Multi View Stereo algorithms without image selection, like
their previous work [12]. Afterwards, they rene the point clouds of dierent clus-
ters by comparing them against each other. Thereby, they also consider point
accuracies and remove points from clusters if a surface part is represented by
another cluster in clearly higher resolution. However, they dont demonstrate
meshing for their point clouds.
3 Overview
This section gives an overview over the proposed pipeline. Before our approach
can work with raw image datasets, these must be calibrated in space at rst.
We use the publicly available Bundler software [1] for this task. Bundler delivers
rotation matrices, translation vectors, distortion coecients and focal lengths
for all images, that could successfully be calibrated and a sparse set of surface
points. This data is the input to our pipeline which is outlined in Figure 2. We
create depth maps out of the perspective of each image, by matching it to other
images that are selected as described in Section 4. The selection process prefers
images with similar scale, large view angles and a big overlap in eld of vision to
the reference image. Additionally, we demand that overlaps and view angles are
not too similar in dierent selected images. Our depth map algorithm which is
described in Section 5 is GPU based and can perform normal optimization similar
to region growing based approaches. It further lters out outliers. After creating
and independently ltering all depth maps, they are compared against each other
and points in depth maps that have not enough support by other depth maps or
have too many visibility conicts are ltered out by the approach in Section 6.
Successively, points are transformed into the 3 dimensional space and ordered
into a kd-tree. Because depth maps often have redundancy to one another the
point cloud is very dense in some regions. Thus, the best points are selected and
the rest of the points are only used to optimize the selected points in position as
described in Section 7. Thereby, the redundancy can be used to remove surface
noise. Selected local outliers are also removed in the optimization process by
pulling them onto the surface. Finally, the mesh is created by a Delaunay graph
cut described in Section 8.
(a) (b)
Fig. 3. a) Good image selection depends on the surface distance. Green indicates a
good image distance. b) Our angular weighting scheme reduces the inuence of close-
by views, here demonstrated on the Ulm Minster data set.
4 Image Selection
In this section we present an image selection heuristic that considers scale and
robust stereo reconstruction. To nd suitable images for stereo matching with
diversity, Goesele et al. [14] demand view angles above 10 degrees between all
images including the reference image to which the other images are matched
for depth map creation, if possible. However, works like [21] showed that 10
degrees are by far not enough for accurate stereo matching. Thus, we extend their
approach by demanding especially big view angles between the reference and the
other images. Furthermore, our approach prioritizes images, where regions of
the reference image are visible that are hardly or not visible in already selected
images.
To nd suitable images Is for stereo matching with a reference image R we use
sparse Structure From Motion feature points f , delivered by Bundler [1]. These
points have already visibility information and are necessary, because knowledge
of the surface distance is essential for a good selection (Figure 3(a)). To rate
an image, we use similar to Goesele et al. a weighted sum of all feature points
visible in R and the tested image:1
gR (I) = wa (f )ws (f )wc (f ) (1)
f FR FI
whereX,Y (f ) is the view angle dierence at point f for rays from the center of
projection of image X and Y and Is the set of already selected images. As Is
1
Please consult [14] for a detailed motivation for using a weighted sum.
402 C. Bailer, M. Finckh, and H.P.A. Lensch
increases over time all images must be rerated after each selection of the best
image. We used mostly max = 35 , max = 14 , x = 1.5 and y = 1. Our
coverage weighting, which favors feature points that are only sparsely covered
by already selected images is given by:
rI (f )
wc (f ) = . (3)
rI (f ) +
J|JIs pP (J) rJ (f )
A simple choice for rX that ignores scale would be rX = 1. However, we dont
want images with good scales be covered by images with bad scales, so we set it
to rX (f ) = min sR (f )2 /sX (f )2 , 1 . This can be especially important at depth
discontinuities. There, it might not be possible to select images with good scale
on both sides of the discontinuity. We also changed the scale weighting of Goesele
et al. to be more strict, as we dont want big angles with bad scale to be better
rated than small angles with good scale. If sX (f ) is the scale at f in an image
and r = sR (f )/sI (f ) our scale weighting is calculated as:
0 r > 1.8
r2 1 < r 1.8
ws (f ) = 1 1/1.6 < r 1 . (4)
1.6 2
r else
As Figure 3(b) shows the provided angles are often smaller than 35 . Any-
way, for x = 1.5 we get more than twice as much large angles than Goesele
et al. Furthermore, our approach also performs better for the accuracy weight-
ing max(1, 1/r). Overall we get better view angles not at the cost of worse
scaling.
Thus, we use horizontal processing lines and the three propagation directions of
Figure 5(d). We call this downward propagation. Additionally, we perform upward
propagation at the same time. Up and downward propagation is performed in
all uneven propagation steps. Left and rightward propagation in all even steps.
Fig. 6. Dierence to ground truth. Left: No Normal Tuning Right: Tuning with random
normals. Artifacts disappear nearly completely after the second propagation step.
After propagation, depth values are randomly shifted in depth and the new
depth is kept if the matching error of the shifted depth is smaller. We perform
three shifts with uniform distributions with 16 fd , 8 fd and 4 fd maximum shift
distance, where d is the depth value and f the focal length of the image. Depth
values have initially no normals, hence, we assign normals with uniform random
direction and surface angle to them if the matching error decreases:
where nx and ny are the gradients of the tangent plane in x and y direction. As
Figure 6 shows, random normals are sucient as the good ones get distributed
over the depth map in the next propagation step. Allover, we perform 3 propa-
gation steps as this usually almost leads to convergence. Even with two steps we
mostly retrieved good results. Important for fast convergence of sloped surfaces
is the usage of normal information to propagate into tangent direction in prop-
agation step 2 and 3. Finally, we optimize the depth values and normals with a
404 C. Bailer, M. Finckh, and H.P.A. Lensch
The error function was occlusion robust in our tests, as it strongly prefers small
errors. It performed even at non occluded surface parts better than a common
weighting.2 For details about matching with normals please consult [12].
Depth Map Filtering. To lter erroneous values in the depth map, we discard
all depth values that have a (1 N CC) matching error above a threshold in at
least n images:
n = max(c, min(3, Iv /2)) , (7)
where Iv is the number of images that can see the point if there is no occlusion,
i.e., the point lies not outside of the image area if it is projected into the image.
We use c = 2 in most of our tests and c = 1 for sparse datasets.
Fig. 7. Left: Points dont resemble a surface. Right: Segmentation of our lter.
Additionally, we lter outliers if they dont resemble a surface (see Figure 7).
This is done by segmenting the images by depth discontinuities. If there is a
way from one point in the depth map to another point without crossing a depth
discontinuity above a threshold, both are in the same region. Thereafter, we
delete all regions which have less than 15 points. The depth threshold is set to
2 fd , where d is the current depth and f the focal length of the image.
(a) (b)
Fig. 8. a) Depth maps can be consistent (B), inconsistent (A, C) or show dierent non
conicting surfaces (not shown). b) Dicult scene for MVS, because of textureless and
transparent objects and bad camera calibrations. Left: Even the Delaunay Graph cut
cannot correct the unltered depth maps. Right: With our additional multi depth map
ltering we can recover a lot of surface parts.
Especially image selection makes the approach powerful because the depth maps
it nds show not only the same surface parts as the reference depth map, but
are also very unlikely to have the same outliers, as they dier in view angle and
thus in the set of images they were created from.
Figure 8(a) shows the dierent cases that can occur when corresponding points
in two depth maps are compared to each other. In order to compare them, we
assign a support value to each depth value. A depth value gains support for
similarity (case B in Figure 8(a)) and looses support for visibility conicts (case
A and C). Hereby, we take only the dierence of case A and C conicts as
negative support into account like in [25], because a point can not be too near
and too far at the same time. The support for a depth value is dened as:
; ;
; ;
; ;
SR (x, y) = BR,D (x, y)rb ; AR,D (x, y) CR,D (x, y); , (8)
; ;
DDs DDs DDs
where AR,D ,BR,D and CR,D are booleans that determine the case, A, B, or C
between the reference depth map R and depth map D. rb is a constant, which is
set to 2 in this work, because we think case A and C are despite the dierence
still stronger aected by outliers in other depth maps than case B. After all
depth maps were processed and points which meet the condition SR (x, y) < 1
are ltered out.
We test for conicts similar to [25], but when we render a depth map for case
C into the view of the reference depth map, we render each depth value into
the 4 closest pixels3 instead of only one to avoid gaps (see Figure 9) that can
otherwise strongly occur for large view angles and scale dierences. There is a
conict if a depth value of the rendered depth map is signicantly smaller than
the value of the reference depth map.
3
Depth values can only be overwritten by smaller ones.
406 C. Bailer, M. Finckh, and H.P.A. Lensch
Case B could also be decided like case C. However, we handle it like case
A because we think this is more accurate. This is done by transforming points
of the reference depth map into the view of another depth map and comparing
depth values there on the rays of the other view. Thereby, direct comparisons are
possible without rendering. For the acceptance threshold we consider for case B
only the accuracy of the referenced depth map. For conicts we take the worse
of both accuracies:
c
TA = TC = c max(dR (X), dDi (X)), TB = dR (X) , (9)
2
where c is a constant we set to 1.6, X is the 3D representation of the tested
point in the reference depth map, R the reference depth map, Di another depth
map, and d the sampling distance of a depth map at a point, i.e., the depth the
point would have in that depth map over the focal length of the depth map. We
do not test for conicts if 1.6 dDi (X) > dR (X) to avoid border eects at depth
discontinuities, because of inaccurate borders in low resolution depth maps. After
multi depth map ltering, we again lter depth maps with the segmentation lter
of Figure 7 to remove single surviving outliers.
The approach can also be used to lter depth values in small scale depth
maps. Therefore, the selection algorithm can be modied to prefer depth maps
with 1.6 times or higher sampling rates as the reference depth map and depth
values are ltered out if such a depth map is found for a pixel position. This
is not necessary for scale robustness, but speeds up the rest of our approach.
Practically, we use the lter but dont select extra high resolution depth maps.
Fig. 9. When rendering a depth map into another view some pixels are covered multiple
times, others never. This can be avoided by rendering points into 4 pixels at once.
own inuence radius. This is done by testing the points with the smallest I rst.
Neighbors of a point in the same depth map can slightly lie inside the inuence
radius and are ignored to keep the regular depth map structure. Primary points
then represent the surface approximately in the resolution of the best depth
map and are most likely the most accurate points. Nevertheless, they can still
be improved in position by using other points with similar scale to remove surface
noise. We use a modied minimum least square algorithm [26] for this:
qP w(||pi q||)(Sp /Sq ) q
2
pi+1 = i (10)
qPi w(||pi q||)(Sp /Sq )
2
qPi w(||pi q||)(Sp /Sq ) nq
2
ni+1 = (11)
qPi w(||pi q||nq ||)(Sp /Sq )
2
pi+1 = pi + ni (pi+1 pi )ni (12)
p0 = p is an unoptimized primary point, Pi contains all unmodied points that
(a) (b)
Fig. 10. (a): Local outlier avoidance with iteration (b) Left: Unoptimized primary
points. Right: Optimized primary points.
are in the radius of 2Ipi and have a scale that is smaller than 1.6 times the scale
of p. According to our simulations, (Sp /Sq )2 is the optimal weighting function
for all medium free noise functions. Of course, this is not fullled between depth
maps of dierent scales, because rough depth maps can not have information
about ne details. That is why we dont include inaccurate points to Pi . ni is
the local normal. pi is only allowed to move along the normal and not along
the tangent direction for pi+1 . pi is optimized to pi+1 until the process fails or
converges. If it fails because it does not terminate after 20 iterations, |Pi | < 3,
or the point moves out of its original inuence radius the point is discarded. The
process converges if:
We use this function to improve the eect in Figure 10(a) to pull local outlier
points onto the surface. The often used Gauss function falls outwards too fast to
take much advantage of this eect. To speed up the optimization process we start
two iterations for a point. One with c = 5 and one with c = 30. Local outliers
can slide between other points on the surface, so, we discard primary points if
there are better primary points in 0.8x their inuence radius after optimization.
Finally, we use Formula 10 with the color cq instead of q to improve the color
of an optimized point. In Figure 10(b) the dierence between unoptimized and
optimized primary points can be seen. Details are kept while noise is removed.
However, noise removal is only possible if there are enough samples with
similar scale in the inuence radius. Otherwise, we must increase I before point
selection and decouple it from S to avoid noisy surfaces. We estimate the noise
N as
1/N = w(||pi q||)(Ip /Iq )2 . (15)
qPi
8 Meshing
For meshing of our optimized primary points we use the Delaunay graph cut of
Labatut et al. [27], because it is robust to changes in point density and considers
visibility, which helps to reconstruct dicult surface parts. In their work they
also showed that their approach is very robust against outliers. We got similar
results with random noise and dense surface point clouds. However, as can be
seen in Figure 8(b) outlier avoidance in real datasets can be more dicult.
For their visibility constraint we consider not only the reference image, from
which that point was created, but we also add the reference image of each sec-
ondary point to the nearest optimized primary point. Additionally, we bind their
constant to the scale of the primary point, to handle ray accuracy depending
on the point accuracy and not the ray length:
p = const Sp . (16)
9 Results
We tested our approach on several datasets. Results can be seen in Figure 1, 11
and 12. Our memory peak in the Ulm Minster dataset was mainly because of the
Delaunay graph cut 12.8 GB for about 18 Million triangles and 55 Million points
for point selection and optimization. Point processing took single core about 20
minutes and meshing 1 hour. Compared to [18,19] our approach is clearly faster
and needs clearly less memory, as far as it is comparable on dierent datasets.
Furthermore, it is probably also more accurate as we perform careful continuous
optimizations instead of just using voxels of dierent scales. Depth map creation
Scale Robust Multi View Stereo 409
(a) (b)
Fig. 11. a) The upper row shows reconstruction from a Ladybug video dataset.
Through short Ladybug videos it is possible to reconstruct huge scenes with our ap-
proach. The lower row shows our sofa and garden dataset. Thanks to point optimiza-
tion, the sofa dataset is very smooth but anyhow accurate at locations where many
depth maps overlap. b) Ulm Minster surface.
Fig. 12. a) Some dierences with (bottom) and without (top) covering weighting on
the garden dataset. b) Columns. The right most reconstruction was strongly reduced
to demonstrate the high resolution. c) Some parts of the chapel were photographed in
especially high resolution compared to the other parts. Thus, there is a strong resolution
discontinuity at the border of the close reconstruction, that is clearly visible in the left
image. The left image was like the column in (b) size reduced. The full reconstruction
can be seen on the right. d) and e) Further results for the Ulm Minster dataset. The
left column in d) has only pixel size in Figure 11 b).
took in average 35 seconds, out of 9 images, for the 333 1863x1236 pixel depth
maps of the Ulm Minster dataset. This is also very fast compared to run times
of region growing based approaches in [3]. Multi depth map ltering took 192
seconds, altogether. Most time was needed for loading depth maps from the
harddisk. We got 6.8% more 3D points and 5.3% more mesh points with our
covering weighting on the garden dataset, than without. On the Ulm Minster
dataset with less occlusions it were 3.4/1.8%. Figure 12(a) shows some dierences
because of the weighting.
410 C. Bailer, M. Finckh, and H.P.A. Lensch
10 Conclusion
We presented a scale robust Multi View Stereo approach that can process ar-
bitrary image datasets. Our core contribution for scale robustness is surely the
point selection and optimization method that delivers rened point clouds that
can easily be processed by Delaunay based meshing methods. Nevertheless, our
image selection method is also important for high quality depth maps and, thanks
to image selection, the depth map based outlier lter is a powerful and fast
method to remove outliers.
Acknowledgments. This work has been partially funded by the DFG Emmy
Noether fellowship (Le 1341/1-1) and by an NVIDIA Professor Partnership
Award.
References
1. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in
3d. In: SIGGRAPH Conference Proceedings, pp. 835846. ACM Press, New York
(2006), http://phototour.cs.washington.edu/bundler/
2. Agarwal, S., Snavely, N., Simon, I., Seitz, S., Szeliski, R.: Building rome in a day.
In: 2009 IEEE 12th International Conference on Computer Vision, pp. 7279. IEEE
(2009)
3. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and
evaluation of multi-view stereo reconstruction algorithms. In: 2006 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519
528. IEEE (2006), http://vision.middlebury.edu/mview/
4. Pons, J., Keriven, R., Faugeras, O.: Multi-view stereo reconstruction and scene
ow estimation with a global image-based matching score. International Journal
of Computer Vision 72, 179193 (2007)
5. Vogiatzis, G., Hernandez Esteban, C., Torr, P.H.S., Cipolla, R.: Multiview stereo
via volumetric graph-cuts and occlusion robust photo-consistency. IEEE Trans.
Pattern Anal. Mach. Intell. 29, 22412246 (2007)
6. Kolev, K., Pock, T., Cremers, D.: Anisotropic Minimal Surfaces Integrating Photo-
consistency and Normal Information for Multiview Stereo. In: Daniilidis, K. (ed.)
ECCV 2010, Part III. LNCS, vol. 6313, pp. 538551. Springer, Heidelberg (2010)
7. Sinha, S., Pollefeys, M.: Multi-view reconstruction using photo-consistency and
exact silhouette constraints: A maximum-ow formulation (2005)
8. Furukawa, Y., Ponce, J.: Carved Visual Hulls for Image-Based Modeling. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951,
pp. 564577. Springer, Heidelberg (2006)
9. Song, P., Wu, X., Wang, M.: Volumetric stereo and silhouette fusion for image-
based modeling. The Visual Computer 26, 14351450 (2010)
10. Gallup, D., Frahm, J., Mordohai, P., Yang, Q., Pollefeys, M.: Real-time plane-
sweeping stereo with multiple sweeping directions. In: 2007 IEEE Conference on
Computer Vision and Pattern Recognition, pp. 18. IEEE (2007)
11. Campbell, N.D.F., Vogiatzis, G., Hernandez, C., Cipolla, R.: Using Multiple Hy-
potheses to Improve Depth-Maps for Multi-View Stereo. In: Forsyth, D., Torr, P.,
Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 766779. Springer,
Heidelberg (2008)
Scale Robust Multi View Stereo 411
12. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 13621376 (2009)
13. Hiep, V., Keriven, R., Labatut, P., Pons, J.: Towards high-resolution large-scale
multi-view stereo. In: IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2009, pp. 14301437. IEEE (2009)
14. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.: Multi-view stereo for
community photo collections. In: Proc. ICCV, pp. 18. Citeseer (2007)
15. Pollefeys, M., Nister, D., Frahm, J., Akbarzadeh, A., Mordohai, P., Clipp, B.,
Engels, C., Gallup, D., Kim, S., Merrell, P., et al.: Detailed real-time urban 3d
reconstruction from video. International Journal of Computer Vision 78, 143167
(2008)
16. Frahm, J.-M., Fite-Georgel, P., Gallup, D., Johnson, T., Raguram, R., Wu, C.,
Jen, Y.-H., Dunn, E., Clipp, B., Lazebnik, S., Pollefeys, M.: Building Rome on
a Cloudless Day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part IV. LNCS, vol. 6314, pp. 368381. Springer, Heidelberg (2010)
17. Jancosek, M., Shekhovtsov, A., Pajdla, T.: Scalable multi-view stereo. In: 2009
IEEE 12th International Conference on Computer Vision Workshops (ICCV Work-
shops), pp. 15261533. IEEE (2009)
18. Mucke, P., Klowsky, R., Goesele, M.: Surface reconstruction from multi-resolution
sample points. In: Proceedings of Vision, Modeling, and Visualization, pp. 105112
(2011)
19. Fuhrmann, S., Goesele, M.: Fusion of depth maps with multiple scales. ACM Trans-
actions on Graphics (TOG) 30, 148 (2011)
20. Furukawa, Y., Curless, B., Seitz, S., Szeliski, R.: Towards internet-scale multi-view
stereo. In: Proceedings of IEEE CVPR. IEEE (2010)
21. Rumpler, M., Irschara, A., Bischof, H.: Multi-view stereo: Redundancy benets for
3d reconstruction. In: Proceedings of the 35th Workshop of the Austrian Associa-
tion for Pattern Recognition, AAPR/OAGM (2011)
22. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.: Patchmatch: A random-
ized correspondence algorithm for structural image editing. ACM Transactions on
Graphics (TOG) 28, 24 (2009)
23. Strecha, C., Von Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On bench-
marking camera calibration and multi-view stereo for high resolution imagery
(2008), http://cvlab.epfl.ch/~ strecha/multiview/denseMVS.html
24. Tola, E., Lepetit, V., Fua, P.: Daisy: An ecient dense descriptor applied to wide-
baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence,
815830 (2009)
25. Merrell, P., Akbarzadeh, A., Wang, L., Mordohai, P., Frahm, J., Yang, R., Nister,
D., Pollefeys, M.: Real-time visibility-based fusion of depth maps. In: IEEE 11th
Int. Conf. on Computer Vision, ICCV 2007, pp. 18. IEEE (2007)
26. Cuccuru, G., Gobbetti, E., Marton, F., Pajarola, R., Pintus, R.: Fast low-memory
streaming mls reconstruction of point-sampled surfaces. In: Proceedings of Graph-
ics Interface 2009, pp. 1522. Canadian Information Processing Society (2009)
27. Labatut, P., Pons, J., Keriven, R.: Robust and ecient surface reconstruction from
range data. Computer Graphics Forum 28, 22752290 (2009)
Laplacian Meshes
for Monocular 3D Shape Recovery
Jonas Ostlund, Aydin Varol, Dat Tien Ngo, and Pascal Fua
EPFL CVLab.
1015 Lausanne, Switzerland
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 412425, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Laplacian Meshes for Monocular 3D Shape Recovery 413
control vertices and inject this expression into the large system to produce the
small one we are looking for. Well-posedness stems from an implicit and novel
regularization term, which penalizes deviations away from the reference shape,
which does not have to be planar.
In other words, instead of performing a minimization involving many degress
of freedom or solving a large system of equations, we end up simply solving
a very compact linear system. This yields an initial 3D shape estimate, which
is not necessarily accurate, but whose 2D projections are. In practice, this is
extremely useful to quickly and reliably eliminate erroneous correspondences. We
can then achieve better accuracy than the method of [3], which has been shown to
outperform other recent ones, by enforcing the same inextensibility constraints.
This is done by optimizing with respect to far fewer variables, thereby lowering
the computational complexity, and without requiring training data as [3] does.
Furthermore, unlike in other recent approaches that also rely on control
points [9, 17], our implicit regularization term is orientation invariant. As a
result, we do not have to pre-align the reference shape with the one to be re-
covered or to precompute a rotation matrix. To explore this aspect, we replaced
our Laplacian approach to dimensionality reduction by a simpler one based on
the same linear deformation modes as those of [14, 11] and again enforced the
inextensibility constraints of [3]. We will show that without pre-alignment, it
performs much worse. With manual pre-alignment it can yield the same accu-
racy but an automated version would necessitate an additional non-linear opti-
mization step. Furthermore, computing the deformation modes requires either
training data or a stiness matrix, neither of which may be forthcoming.
In short, our contribution is a novel approach to reducing the dimensionality
of the surface reconstruction problem. It removes the need both for an estimate
of the rigid transformation with respect to the reference shape and for access
to training data or material properties, which may be unavailable or unknown.
Furthermore, this is achieved without sacricing accuracy.
In the remainder of this paper, we rst review existing approaches and remind
the reader of how the problem can be formulated as one of solving a linear
but degenerate system of equations, as was done in earlier approaches [1]. We
then introduce our extended Laplacian formalism, which lets us transform the
degenerate linear system into a smaller well-posed one. Finally, we present our
results and compare them against state-of-the-art methods.
2 Related Work
Reconstructing the 3D shape of a non-rigid surface from a single input image
is a severely under-constrained problem, even when a reference image of the
surface in a dierent but known conguration is available, which is the problem
we address here.
When point correspondences can be established between the reference
and input images, shape-recovery can be formulated in terms of solving an
ill-conditioned linear-system [1] and requires the introduction of additional
constraints to make it well-posed. The most popular ones involve preserving
414 J. Ostlund et al.
Euclidean or Geodesic distances as the surface deforms and are enforced either
by solving a convex optimization problem [9, 7, 8, 12, 13, 3] or by solving in
closed form sets of quadratic equations [14, 11]. The latter is typically done by
linearization, which results in very large systems and is no faster than mini-
mizing a convex objective function, as is done in [3] which is one of the best
representatives of this class of techniques.
The complexity of the problem can be reduced using a dimensionality reduc-
tion technique such as Principal Component Analysis (PCA) to create morphable
models [2, 18], modal analysis [11, 12], or Free Form Deformations (FFDs) [9].
As shown in the result section, PCA and modal analysis can be coupled with
constraints such as those discussed above to achieve good accuracy. However,
this implies using training data or being able to build a stiness matrix, which
is not always possible. The FFD approach [9] is particularly relevant since, like
ours, it relies on parameterizing the surface in terms of control points. However,
it requires placing the control points on a regular grid, whereas ours can be
arbitrarily chosen for maximal accuracy. This makes our approach closer to ear-
lier ones to tting 3D surfaces to point clouds that rely on Dirichlet Free Form
Deformations and also allow arbitrary placement of the control points [19].
Furthermore, none of these dimensionality reduction methods allow
orientation-invariant regularization because deformations are minimized with
respect to a reference shape or a set of control points that must be properly ori-
ented. Thus, global rotations must be explicitly handled to achieve satisfactory
results. To overcome this limitation, we took our inspiration from the Laplacian
formalism presented in [15] and the rotation invariant formulation of [16], which
like ours involves introducing virtual vertices. In both these papers, the mesh
Laplacian is used to dene a regularization term that tends to preserve the shape
of a non-planar surface. Furthermore, vertex coordinates can be parameterized
as a linear combination of those of control vertices to solve specic minimization
problems [16]. In Section 4, we will go one step further by providing a generic
linear expression of the vertex coordinates as a function of those of the control
vertices, independently of the objective function to be minimized.
where vi contains the 3D coordinates of the ith vertex of the Nv -vertex trian-
gulated mesh representing the surface, and M is a matrix that depends on the
coordinates of correspondences in the input image and on the camera internal pa-
rameters. A solution of this system denes a surface such that 3D feature points
Laplacian Meshes for Monocular 3D Shape Recovery 415
Ns
x = x0 + wi bi = x0 + Bw , (2)
i=1
where x is the coordinate vector or Eq. 1, B is the matrix whose columns are the
bi basis vectors typically taken to be the eigenvectors of a stiness matrix, and
w is the associated vectors of weights wi . Injecting this expression into Eq. 1
and adding a regularization term yields a new system
% &% &
MB Mx0 w
=0, (3)
r L 0 1
4 Laplacian Formulation
In the previous section, we introduced the linear system of Eq. 1, which is so
badly conditioned that we cannot minimize the reprojection error by simply
solving it. In this section, we show how to turn it into a much smaller and
well-conditioned system.
416 J. Ostlund et al.
(a) (b)
Fig. 1. Linear parameterization of mesh vertices using control points (a) Vertices and
control points in their rest state. (b) Control points and vertices in a deformed state.
Every vertex is a linear combination of the control points.
To this end, let us assume we are given a reference shape, which may be planar
or not and let xref be the coordinate vector of its vertices, as the ones highlighted
in Fig. 1. We rst show that we can dene a regularization matrix A such that
Ax2 = 0, with x being the coordinate vector of Eq. 1, when xref = x up to
2
a rigid transformation. In other words, Ax penalizes non-rigid deformations
away from the reference shape but not rigid ones. We then show that, given a
subset of Nc mesh vertices whose coordinates are
vi1
c = ... , (4)
viNc
where wr is a scalar coecient and which can still be done in closed form. We
will show in the results section that this yields accurate reconstructions, even
with relatively small numbers of control points.
1 = 0.001
= 0.0056
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
Index
(a) (b)
Fig. 2. Singular values (a) of MP and M drawn with red and blue markers respectively
for a planar model for a given set of correspondences. The smallest singular value of
MP is about four whereas it is practically zero for the rst Nv singular values of M.
Normalized singular values (b) of regularization matrix A computed on a non-planar
model for various values of . Note that the curves are almost superposed for values
greater than one.
We now turn to building the matrix A such that Axref = 0 and Ax = Ax
2 2
Planar Rest Shape. Given a planar mesh that represents a reference surface
in its rest shape, consider every pair of facets that share an edge, such as those
depicted by Fig. 3(a). They jointly have four vertices v1 ref to v4 ref . Since they
lie on a plane, whether or not the mesh is regular, one can always nd a unique
set of weights w1 , w2 , w3 and w4 such that
4 2
1 3
Fig. 3. Building the A regularization matrix of Section 4.1. (a) Two facets that share
an edge. (b) Non-planar reference mesh. (c) Non-planar reference mesh with virtual
vertices and edges added. One of the pairs of tetrahedra used to construct the matrix
is shown in red.
0 = w1 v1 A A A A A
ref + w2 v2 ref + w3 v3 ref + w4 v4 ref + w5 v5 ref ,
0 = w1 + w2 + w3 + w4 + w5 , (10)
1= w12 + w22 + w32 + w42 + w52 .
Laplacian Meshes for Monocular 3D Shape Recovery 419
The three equalities of Eq. 10 serve the same purpose as those of Eq. 8. We form
a matrix AA by considering all pairs of tetrahedra that share a facet, computing
the w1 to w5 weights that encode local linear dependencies between real and
virtual vertices, and using them to add successive rows to the matrix, as we did
to build the A matrix in the planar case. It is again easy to verify to AA xA
ref = 0
and that AA xA is invariant to ane deformations of xA . In our scheme, the
regularization term can be computed as
2
2
C = AA xA = Ax + AxV , (11)
1 2
if we write AA as A A where A has as three times as many columns as there
are real vertices and A as there are virtual ones. Given the real vertices x, the
virtual vertex coordinates that minimize the C term of Eq. 11 is
1
xV = AT A AT Ax (12)
1 2
C= Ax A AT
A AT
Ax = Ax2 ,
1
where A = A A AT A AT A. In other words, this matrix A is the regular-
ization matrix we are looking for and its elements depend only on the coordinates
of the reference vertices and on the distance chosen to build the virtual vertices.
To study the inuence of the scale parameter of Eq. 9, which controls the
distance of the virtual vertices from the mesh, we computed the A matrix and
its singular values for a sail shaped mesh and many values. As can be seen
in Fig. 2(b), there is a wide range for which the distribution of singular values
remains practically identical. This suggests that has limited inuence on the
numerical character of the regularization matrix. In all our experiments, we set
to 1 and were nevertheless able to obtain accurate reconstructions of many
dierent surfaces with very dierent physical properties.
4.3 Reconstruction
Given that we can write the coordinate vector x as Pc and given potentially
noisy correspondences between the reference and input images, recall that we
can nd c as the solution of the linear least-squares problem of Eq. 7.
As shown in the result section, this yields a mesh whose reprojection is very
accurate but whose 3D shape may not be because our regularization does not
penalize ane deformations away from the rest shape. In practice, we use this
initial mesh to eliminate erroneous correspondences. We then rene it by solving
5 Results
We compare our approach to surface reconstruction from a single input image
given correspondences with a reference image against the recent method of [3],
which has been shown to outperform closed-form solutions based on a global
deformation model and inextensibility constraints [14] and solutions that rely
on a nonlinear local deformation models [20]. In all these experiments, we set
the initial regularization weight wr of Eq. 7 to 2e4, the scale parameter of
Eq. 9 to 1 and the initial radius for outlier rejection to 150 pixels.
Laplacian Meshes for Monocular 3D Shape Recovery 421
Paper Sequence. Our qualitative results are depicted by Fig. 4. The rest shape is
at and modeled by a 119 rectangular mesh and parameterized by 20 regularly
spaced control points such as those of Fig. 1. For reconstruction purposes, we
ignored the depth maps and used only the color images, along with SIFT corre-
spondences with the reference image. For evaluation purposes, we measured the
distance of the recovered vertices to the depth map.
We report the 3D reconstruction errors for dierent methods in Fig. 6(a). Our
approach yields the highest average accuracy, and it is followed closely by the
subspace method provided that the model is properly pre-aligned by computing
global rotation and translation.
Sail Images. We applied our reconstruction algorithm on sail images such as the
one of Fig. 5. Since images taken by a second camera are available we used a
simple triangulation method to generate 3D points corresponding to the circular
markers on the sail which serve as ground-truth measurements. We compare
the reconstructions against the scaled ground-truth points such that the mean
distance between the predicted marker positions and their ground-truth positions
is minimized. In this case, as shown in Fig. 6(b) , our method achieves the highest
accuracy against the competing methods.
Our C++ implementation relies on the Ferns keypoint classier [24] to match
feature points and average computation times per frame are given in Fig. 6(c)
as a function of the numbers of control points. As stated above, the accuracy
results of Figs. 6(a,b) were obtained using 20 control points.
422 J. Ostlund et al.
Fig. 4. Paper sequence with ground-truth: Top row: Reprojection of the con-
strained reconstructions on the input images Middle row: Unconstrained solutions
seen from a dierent view-point Bottom row: Constrained solutions seen from the
same view-point. We supply the corresponding video sequence as supplementary ma-
terials.
0.2
5 Unconstrained
Constrained
15
35
Ours Total
0.15
Time (seconds)
Mean Reconstruction Error
Mean Reconstruction Error
15
5 0.05
10
0
0 0 0 10 20 30 40
Number of control points
Fig. 7. Paper sequence. Top row: Reprojections of our reconstructions on the input
images. Bottom row: Reconstructions seen from a dierent view-point. We supply
the corresponding video-sequences as supplementary material.
reference shape of Fig. 9. Our method can then be used to recover its shape from
input images whether it is concave or convex. This demonstrates our methods
ability to recover truly large non planar deformations.
424 J. Ostlund et al.
Fig. 9. Large deformations. Top row: The reference image with a concave reference
shape and two input images. In the rst, the surface is concave and in the other convex.
Bottom row: The depth image used to build the reference mesh, reconstructions of
the concave and convex versions of the surface.
6 Conclusion
References
1. Salzmann, M., Fua, P.: Deformable Surface 3D Reconstruction from Monocular
Images. Morgan-Claypool Publishers (2010)
2. Blanz, V., Vetter, T.: A Morphable Model for the Synthesis of 3D Faces. In:
SIGGRAPH, pp. 187194 (1999)
3. Salzmann, M., Fua, P.: Linear Local Models for Monocular Reconstruction of De-
formable Surfaces. PAMI 33, 931944 (2011)
4. Gumerov, N., Zandifar, A., Duraiswami, R., Davis, L.S.: Structure of Applicable Sur-
faces from Single Views. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004, Part III.
LNCS, vol. 3023, pp. 482496. Springer, Heidelberg (2004)
5. Liang, J., Dementhon, D., Doermann, D.: Flattening Curved Documents in Images.
In: CVPR, pp. 338345 (2005)
6. Perriollat, M., Bartoli, A.: A Quasi-Minimal Model for Paper-Like Surfaces. In:
BenCos: Workshop Towards Benchmarking Automated Calibration, Orientation
and Surface Reconstruction from Images (2007)
Laplacian Meshes for Monocular 3D Shape Recovery 425
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 426440, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Soft Inextensibility Constraints for Template-Free Non-rigid Reconstruction 427
On the other hand, since each patch is reconstructed independently, these meth-
ods require a post-processing stitching step. Compared to prior work, ours is the
rst to include all of the following advantages:
The inextensibility prior has been most commonly used in template based meth-
ods. These focus on reconstructing a single image, assuming that correspon-
dences between the image and a 3D template shape are given. Template based
methods have been mostly applied to the reconstruction of developable surfaces.
Since developable surfaces are isometric with a plane, the template shape is usu-
ally a planar surface. A closed form solution for this problem was introduced in
[1], where the template was in the form of a 2D mesh and the inextensibility
constraint was enforced between mesh nodes. A dierent version of the inexten-
sibility constraint that allows shrinking to account for the dierences between
Euclidean and geodesic distances was presented in [9] and shown to be more
suitable for reconstructing sharply folding surfaces. In [2] the known distances
between interest points were used to obtain upper bounds on the depth of each
point. This approach was later extended in [3] to account for noise in the corre-
spondences and in the template.
An interesting exception to this coupling between inextensibility priors and
template based methods was presented in [10]. This method assumed the surface
was developable and recovered a planar template given correspondences across
multiple images. In contrast to these methods our approach is not only template
free, but also applicable to non-developable surfaces.
Soft Inextensibility Constraints for Template-Free Non-rigid Reconstruction 429
Table 1. Notation. When referring to the case of a single input image or a generic
image, we sometimes drop the index s.
Notation Description
S Number of input images.
N Number of interest points. We use i and j to index the set of points.
psi = (xsi , yis ) 2D point i in image frame s.
Qsi 3D coordinates of point i on the shape corresponding to frame s.
si Depth of the 3D point i on the shape corresponding to frame s.
N = {(i, j)} Neighbourhood system: the set of all pairs (i, j) of neighbouring
points. Points i and j are neighbours if their image distance is small
in all frames.
Also related to our approach are the recently introduced piecewise models
[7, 8, 11, 6]. In general, these methods can cope with sequences with stronger
deformations than previous global approaches. They dier in the type of local
model considered (piecewise planarity [7], local rigidity [8] and quadratic defor-
mations [11, 6]) and in the way the surface is divided into patches. For example
in [8] each patch is a triangle formed by neighbouring points, while in [7] the
surface is divided into regular rectangular patches. A more principled approach
was proposed in [6], where the format of the patches is jointly optimised with
the model for each patch.
Inextensibility is imposed in the piecewise planar patches used in [7] and
the rigid triangles used in [8]. Neither of these approaches require a template.
However, they both require a stitching step to recover a coherent nal surface.
Furthermore, [7] requires a mesh tting step with manual initialisation to re-
cover a smooth surface, while the method in [8] discards non-rigid triangles,
producing only partial reconstructions. Our method does not suer from any
of these disadvantages. It produces a full reconstruction, even for parts which
violate the inextensibility constraint, and does not require any post-processing
step of stitching or mesh tting.
Most related to our approach is the incremental rigidity scheme proposed
in [12] for temporal sequences. The scheme is applicable to both rigid and non-
rigid objects and it maintains a 3D model of the object which is incrementally
updated based on every new frame, by maximizing rigidity. Since the initializa-
tion of the model consists of a at shape (at constant depth), the method is
not expected to recover a plausible reconstruction for the initial frames. This
scheme was later extended in [13] to reconstruct deforming developable surfaces.
Our method diers from these approaches in that we aim at recovering the 3D
shape corresponding to all frames and we do not require a temporal sequence.
1.2 Notation
We start by dening the notation used throughout the paper and discuss some
of the assumptions made. Our notation is summarised in Table 1.
430 S. Vicente and L. Agapito
j
0 i j
S
E(, d) = |Qsi (si ) Qsj (sj ) dij | (3)
s=1 (i,j)N
4 Discrete Optimisation
We rely on discrete optimisation methods and in particular fusion moves to
minimise both the template-based (1) and the template-free (3) energy functions.
While (1) is a pairwise energy function, the template-free problem involves a
higher-order energy function (3), with cliques of size 3. We describe in detail
how energy (3) is minimised and omit details on the minimisation of (1) since
it is analogous, but simpler.
E(, d) E(, d)
si = si dij = dij (4)
si dij
where is a weighting term. Note that this weighting is less crucial than in nor-
mal gradient descent, since using this new proposal in a fusion move guarantees
that the energy does not increase.
4.3 Initialisation
We have described how to obtain new proposals from a current solution. We now
discuss how to dene the initial proposal ( , d ). We start by approximating the
unknown distances dij from the known 2D projections by taking the maximum
distance in 2D between each pair of points in the image, i.e. we initialise the
distances as dij = maxs |psi psj |. For an orthographic camera, these 2D distances
are guaranteed to be a lower bound on the real 3D distances.
All the depths are initialised to the same value, i.e. we set si = const, i, s,
where const = 0 for orthographic and const = k for perspective cameras.
Our overall algorithm functions in the following way: we initialise the algorithm
8 times as described in section 4.3. We then run 50 fusion moves for each ini-
tialisation, where the proposals are generated using the greedy region growing
procedure summarised in Algorithm 1. As choice of seed region we alternate
between using a pair of neighbouring points or a half-plane region dened by a
constraint of the form: {i : xsi < m}, {i : xsi > m}, {i : yis < m} or {i : yis > m}
for some frame s and constant m.
The 8 initialisations are optimised independently, therefore we obtain 8 dif-
ferent solutions and we choose the one with minimum energy. Finally, we run
an additional 100 fusion moves on this solution with proposals obtained from
gradient descent.
5 Extensions
5.1 Additional Priors
Two commonly used priors for the task of non-rigid reconstruction are temporal
and surface smoothness. Both these priors can be included in our formulation
by adding extra-terms to the energy function and using a similar optimization
strategy. Note that, the addition of these extra-terms requires a choice of sensible
weights for the dierent terms.
6 Experiments
We evaluate our approach in the two dierent scenarios it allows: template based
reconstruction for a single image and template-free reconstruction from multiple
images. We use several datasets that have been previously introduced in [2, 8, 20,
11]. We start by providing details about the construction of the neighbourhood
system and the computation of the error measures.
436 S. Vicente and L. Agapito
Fig. 3. Template based reconstruction for synthetic data. The table records
median pwre for 1000 randomly generated paper sheets. We show results for two
typical paper sheets: black grid is ground truth and blue points are our reconstruction.
Denition of the Neighbourhood System N . For all pairs of points (i, j),
1 i
= j N we compute their furthest distance fij over all the frames:
fij = maxs psi psj . For each point i, let Ki be its set of k-nearest neighbours
according to the distances fij . The neighbourhood system is then dened as
N = i {(i, j) : j Ki , fij t}, where t corresponds to a threshold on the
distance. For most of the experiments we set k = 10 and t = 200. An example
of neighbourhood system can be seen in Fig.1(b), where each edge corresponds
to a pair of points in the neighbourhood.
Error Measures. Some of the datasets have associated ground truth that can
be used to quantitatively evaluate our method. We denote the ground truth
reconstruction of point i in image s by Qsi . We found that there is no stan-
dard error measure uniformly used in previous related work and in order to
enable an easier comparison we use two dierent error metrics reported by pre-
vious authors: per-frame pointwise reconstruction error (pwre) [2, 3], given by
N
pwre= N1 i=1 Qi C
Qsi ; and root mean squared error (rmse) over a sequence
s
[8] dened as rmse= 3N1 S Ss=1 N i=1 Qi Qi . When considering an or-
s s 2
error for perspective projection with noise compares favourably with the re-
sults reported in [3]. We outperform all other point-wise reconstruction methods
[9, 2, 3], all of which have a median pwre greater than 2 while ours is 1.8. The
only method that outperforms ours is the surface reconstruction method also in-
troduced in [3] which has median pwre around 1.5 (see gure 3 in [3]). However,
surface reconstruction methods use stronger priors than pointwise approaches.
Paper Sequence [20]: Fig. 4 shows the results of our method applied to re-
construct just 3 frames of the sequence. We use the tracks provided by [8].
In these reconstructions we make use of a perspective calibrated camera and of
the surface smoothness prior discussed in section 5.1, since the interest points
are positioned in a grid. For both datasets, we obtain reconstructions which are
visually comparable to previous methods that use more information, either a
template [3] or temporal consistency [8, 6]. Furthermore, our method recovers
Y
Z
correctly the relative depths of the reconstructed shapes. This can be seen in the
last column of Fig. 4.
1
[8, 6] are currently considered the state of the art nrsfm methods.
Soft Inextensibility Constraints for Template-Free Non-rigid Reconstruction 439
Fig. 5. Flag sequence. Reconstruction (red dots) overlaid with ground truth (black
circles) for a few frames from dierent viewpoints.
7 Conclusion
References
1. Salzmann, M., Moreno-Noguer, F., Lepetit, V., Fua, P.: Closed-Form Solution to
Non-rigid 3D Surface Registration. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008, Part IV. LNCS, vol. 5305, pp. 581594. Springer, Heidelberg (2008)
2. Perriollat, M., Hartley, R., Bartoli, A.: Monocular template-based reconstruction
of inextensible surfaces. In: BMVC (2008)
3. Brunet, F., Hartley, R., Bartoli, A., Navab, N., Malgouyres, R.: Monocu-
lar Template-Based Reconstruction of Smooth and Inextensible Surfaces. In:
Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494,
pp. 5266. Springer, Heidelberg (2011)
4. Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3D shape from
image streams. In: CVPR (June 2000)
5. Torresani, L., Hertzmann, A., Bregler., C.: Non-rigid structure-from-motion: Esti-
mating shape and motion with hierarchical priors. PAMI (2008)
2
Note that the apparent strip-like structure of the ag is present in the ground truth
3D and is due to the regular way in which the MoCap markers were placed.
440 S. Vicente and L. Agapito
6. Russell, C., Fayad, J., Agapito, L.: Energy based multiple model tting for non-
rigid structure from motion. In: CVPR (2011)
7. Varol, A., Salzmann, M., Tola, E., Fua, P.: Template-free monocular reconstruction
of deformable surfaces. In: ICCV (2009)
8. Taylor, J., Jepson, A.D., Kutulakos, K.N.: Non-rigid structure from locally-rigid
motion. In: CVPR (2010)
9. Salzmann, M., Fua, P.: Reconstructing sharply folding surfaces: A convex formu-
lation. In: CVPR (2009)
10. Ferreira, R., Xavier, J., Costeira, J.P.: Shape from motion of nonrigid objects: the
case of isometrically deformable at surfaces. In: BMVC (2009)
11. Fayad, J., Agapito, L., Del Bue, A.: Piecewise Quadratic Reconstruction of Non-
Rigid Surfaces from Monocular Sequences. In: Daniilidis, K., Maragos, P., Paragios,
N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 297310. Springer, Heidelberg
(2010)
12. Ullman, S.: Maximizing rigidity: the incremental recovery of 3-d structure from
rigid and rubbery motion. Perception (1984)
13. Jasinschi, R., Yuille, A.: Nonrigid motion and regge calculus. Journal of the Optical
Society of America 6(7), 10881095 (1989)
14. Woodford, O.J., Torr, P.H.S., Reid, I.D., Fitzgibbon, A.W.: Global stereo recon-
struction under second order smoothness priors. In: CVPR (2008)
15. Lempitsky, V., Rother, C., Roth, S., Blake, A.: Fusion moves for markov random
eld optimization. PAMI 8, 13921405 (2010)
16. Ishikawa, H.: Higher-order clique reduction in binary graph cut. In: CVPR (2009)
17. Rother, C., Kolmogorov, V., Lempitsky, V., Szummer, M.: Optimizing binary
MRFs via extended roof duality. In: CVPR (2007)
18. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ow
algorithms for energy minimization in vision. PAMI (September 2004)
19. Ishikawa, H.: Higher-order gradient descent by fusion-move graph cut. In: ICCV
(2009)
20. Salzmann, M., Hartley, R., Fua, P.: Convex optimization for deformable surface
3-d tracking. In: ICCV (2007)
21. Perriollat, M., Bartoli, A.: A quasi-minimal model for paper-like surfaces. In:
CVPR (2007)
22. Yan, J., Pollefeys, M.: A General Framework for Motion Segmentation: Independent,
Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate. In: Leonardis, A.,
Bischof, H., Pinz, A. (eds.) ECCV 2006, Part IV. LNCS, vol. 3954, pp. 94106.
Springer, Heidelberg (2006)
Spatiotemporal Descriptor for Wide-Baseline Stereo
Reconstruction of Non-rigid and Ambiguous Scenes
Abstract. This paper studies the use of temporal consistency to match appear-
ance descriptors and handle complex ambiguities when computing dynamic depth
maps from stereo. Previous attempts have designed 3D descriptors over the space-
time volume and have been mostly used for monocular action recognition, as they
cannot deal with perspective changes. Our approach is based on a state-of-the-art
2D dense appearance descriptor which we extend in time by means of optical
flow priors, and can be applied to wide-baseline stereo setups. The basic idea be-
hind our approach is to capture the changes around a feature point in time instead
of trying to describe the spatiotemporal volume. We demonstrate its effectiveness
on very ambiguous synthetic video sequences with ground truth data, as well as
real sequences.
1 Introduction
Appearance descriptors have been used to successfully address both low- and high-
level computer vision problems by describing in a succinct and informative manner the
neighborhood around an image point. They have proved robust in many applications
such as 3D reconstruction, object classification or gesture recognition, and remain a
major topic of research in computer vision.
Stereo reconstruction is often performed matching descriptors from two or more dif-
ferent images to determine the quality of each correspondence, and applying a global
optimization scheme to enforce spatial consistency. This approach fails on scenes with
poor texture or repetitive patterns, regardless of the descriptor or the underlying opti-
mization scheme. In these situations we can attempt to solve the correspondence prob-
lem by incorporating dynamic information. Many descriptors are based on histograms
of gradient orientations [13], which can be extended to the spacetime volume. Since
the local temporal structure depends strongly on the camera view, most efforts in this
direction have focused on monocular action recognition [47]. For stereo reconstruc-
tion, spatiotemporal descriptors computed in this manner should be oriented according
Work supported by the Spanish Ministry of Science and Innovation under projects Rob-
TaskCoop (DPI2010-17112), PAU+ (DPI2011-27510) and MIPRCV (Consolider-Ingenio
2010)(CSD2007-00018), and the EU ARCAS Project FP7-ICT-2011-287617. E. Trulls is sup-
ported by scholarship from Universitat Politecnica de Catalunya.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 441454, 2012.
c Springer-Verlag Berlin Heidelberg 2012
442 E. Trulls, A. Sanfeliu, and F. Moreno-Noguer
to the geometry of the setup. This approach was applied in [8] to disparity estimation
for narrow-baseline stereo, but its application to wide-baseline scenarios remains unex-
plored.
We face the problem from a different angle. Our main contribution is a spatiotempo-
ral approach to 3D stereo reconstruction applicable to wide-baseline stereo, augment-
ing the descriptor with optical flow priors instead of computing 3D gradients or similar
primitives. We compute, for each camera, dense Daisy descriptors [3] for a frame over
the spatial domain, and then extend them over time with optical flow priors while dis-
carding incorrect matches, effectively obtaining a concatenation of spatial descriptors
for every pixel. After matching, global optimization algorithm is applied over the spa-
tiotemporal domain to enforce spatial and temporal consistency. The core idea is to
develop a primitive that captures the evolution of the spatial structure around a fea-
ture point in time, instead of trying to describe the spatiotemporal volume, which for
matching requires a knowledge of the geometry of the scene. We apply this approach to
dynamic sequences of non-rigid objects with a high number of ambiguous correspon-
dences and evaluate it on both synthetic and real sequences. We show that its reconstruc-
tions are more accurate and stable than those obtained with state-of-the-art descriptors.
In addition, we demonstrate that our approach can be applied to wide-baseline setups
with occlusions, and that it performs very strongly against image noise.
2 Related Work
The main reference among keypoint descriptors is still the SIFT descriptor [1], which
has shown great resilience against affine deformations on both the spatial and intensity
domains. Given its computational cost, subsequent work has often focused on devel-
oping more efficient descriptors, such as PCA-SIFT [9], GLOH [2] or SURF [10], and
recent efforts such as Daisy veer towards dense computation [3]i.e. computing a de-
scriptor for every pixel. Open problems include the treatment of non-rigid deformations,
scale, and occlusions. Current techniques accommodate scale changes by limiting their
application to singular points [1, 11, 10], where scale can be reliably estimated. A recent
approach [12] exploits a log-polar transformation to achieve scale invariance without
detection. Non-rigid deformations have been seldom addressed with region descriptors,
but recent advances have shown that kernels based on heat diffusion geometry can ef-
fectively describe local features of deforming surfaces [13]. Regarding occlusions, [3]
demonstrated performance improvements in multi-view stereo from the treatment of
occlusion as a latent variable and enforcing spatial consistency with graph cuts [14].
Although regularization schemes such as graph cuts may improve the spatial consis-
tency when matching pairs of images, in scenes with little texture or highly repetitive
patterns the problem can be too challenging. Dynamic information can then be used
to further discriminate amongst possible matches. In controlled settings, the correspon-
dence problem can be further relaxed via structured-light patterns [15, 16]. Otherwise,
more sophisticated descriptors need to be designed. For this purpose, SIFT and its many
variants have been extended to 3D data, although mostly for monocular cameras, as the
local temporal structure varies strongly with large viewpoint changes. For instance, this
has been applied to volumetric images on clinical data, and to video sequences for ac-
tion recognition [6, 5, 7, 17]. There are exceptions in which spatiotemporal descriptors
Spatiotemporal Descriptor for Wide-Baseline Stereo 443
have also been applied to disparity estimation for narrow-baseline stereo [8], designing
primitives (Stequels) that can be reoriented and matched from slightly different view-
points. The application of this approach to wide-baseline scenarios remains unexplored.
Our approach is also related to scene flowi.e. the simultaneous recovery of 3D
flow and geometry. We have not attempted to compute scene flow, as these approaches
usually require strong assumptions such as known reflectance models or relatively small
deformations to simplify the problem, and often use simple pixel-wise matching strate-
gies that cannot be applied to ambiguous scenes or wide baselines. A reflectance model
is exploited in [18] to estimate the scene flow under known illumination conditions.
Zhang and Kambhamettu [19] estimate an initial disparity map and compute the scene
flow iteratively, exploiting image segmentation to maintain discontinuities. A more re-
cent approach is that of [20], which couples dense stereo matching with optical flow
estimationthis involves a set of partial differential equations which are solved nu-
merically. Of particular interest to us is the work of [21], which decouples stereo from
motion while enforcing a scene flow consistent across different viewsthis approach
is agnostic to the algorithm used for stereo.
3 Spatiotemporal Descriptor
Our approach could be applied, in principle, to any spatial appearance descriptor de-
scribed in the Section 2. We pick Daisy [3] for the following reasons: (1) it is designed
for dense computation, which has been proven useful for wide-baseline stereo, and
will particularly help in ambiguous scenes without differentiated salient points; (2) it is
efficient to compute, and more so for our purposes, as we can store the convolved ori-
entation maps for the frames that make up a descriptor without recomputing them; and
(3) unlike similar descriptors like SIFT [1] or GLOH [2], Daisy is computed over a dis-
crete grid, which can be warped in time to follow the evolution of the structure around
a point, allowing us to validate the optical flow priors and determine the stability of the
descriptor in time. The last point will help us particularly when dealing with occlusions.
This section describes the computation of the descriptor set for a single camera, and the
following section explains how to match descriptor sets for stereo reconstruction.
We extend the Daisy descriptor in the following way. For every new (gray-scale)
frame Ik , we compute the optical flow priors for each consecutive pair of frames from
+
the same camera, in both the forward and backward directions: Fk1 (from Ik1 to Ik )
and Fk (viceversa). To compute a full set of spatiotemporal descriptors for frame k
using T = 2 B + 1 frames we require frames IkB to Ik+B and their respective flow
priors. To compute the flow priors we use [22]. Since the size of the descriptor grows
linearly with T we use small values, of about 3 to 11.
We then compute, for each frame, H gradient maps, defined as the gradient norms
at each location if they are greater than zero, to preserve polarity. Each orientation map
is convolved with multiple gaussian kernels of different values to obtain the convolved
orientation maps, so that the size of the kernel determines the size of the region. The
gaussian filters are separable, and the convolution for larger kernels is obtained from
consecutive convolutions with smaller kernels, for increased performance. The result is
a series of convolved orientation maps, which contain a weighted sum of gradient norms
444 E. Trulls, A. Sanfeliu, and F. Moreno-Noguer
around each pixel. Each map describes regions of a different size for a given orientation.
A Daisy descriptor for a given feature point would then be computed as a concatenation
of these values over certain pixel coordinates defined by a grid. The grid is defined as
equally spaced points over equally spaced concentric circles around the feature point
coordinates (see Fig. 1). The number of circles Q and the number of grid points on
each circle P , along with the number of gradient maps H, determine the size of the
descriptor, S = (1 + P Q) H. The number of circles also determines the number of
kernels to compute, as outer rings use convolution maps computed over larger regions
for some invariance against rotation. The histograms are normalized separately for each
grid point, for robustness against partial occlusions. We can compute a score for the
match between a pair of Daisy descriptors as:
1
S
[g] [g]
D(D1 (x), D2 (x)) = D1 (x) D2 (x), (1)
S g=1
[g]
where Di denotes the histogram at grid point g = 1 . . . G. To enforce spatial coher-
ence in dealing with occlusions we apply binary masks over each grid point, obscuring
parts of its neighborhood and thus removing them from the matching process, follow-
ing [3]. The choice of masks will be explained in the following section. The matching
score (cost) can then be computed in the following way:
1
S
[g] [g]
D = S M[g] D1 (x) D2 (x), (2)
q=1 M[q] g=1
where M[q] is the mask for grid point g. Note that we do not use masks to build the
spatiotemporal descriptor, only to match two spatiotemporal descriptors for stereo re-
construction.
For our descriptor, we use the grid defined above to compute the sub-descriptor over
the feature point on the central frame. We then warp the grid through time by means of
the optical flow priors, translating each grid point independently. In addition to warping
the grid, we average the angular displacement of each grid point over the center of the
grid to estimate the change in rotation over the patch between the frames, and compute
a new sub-descriptor using the warped grid and the new orientation.
An example of this procedure is depicted in Fig. 1. We then match each sub-
descriptor computed with the warped grid against the sub-descriptor for the central
frame, and discard it if the matching score falls above a certain threshold. The matching
score is typically smaller than the occlusion cost in the regularization process (see sec-
tion 4). With this procedure we can validate the optical flow and discard a match if the
patch suffers significant transformations such as large distortions, lighting changes or,
in particular, partial occlusions. Note that we do not need to recompute the convolved
orientation maps for each frame, as we can store them in memory. Computing the de-
scriptors is then reduced to sampling the convolved orientation maps at the appropriate
grid points on each frame. To compute descriptors over different orientations we simply
rotate the grid and shift the histograms circularly. We also apply interpolation over both
the spatial domain and the gradient orientations. The resulting descriptor is assembled
Spatiotemporal Descriptor for Wide-Baseline Stereo 445
time
spatial
descriptor
{
spatiotemporal descriptor
Fig. 1. Computation of the spatiotemporal descriptor with flow priors. Sub-descriptors outside
the central frame are computed with a warped grid and matched against the sub-descriptor for the
central frame. Valid sub-descriptors are then concatenated to create the spatiotemporal descriptor.
by concatenating the sub-descriptors in time D(x) = {DkB . . . Dk+B }, and its size
is at most S = T S, where S is the size of a single Daisy descriptor.
To compute the distance between two spatiotemporal descriptors Dk1 we average the
distance between valid pairs of sub-descriptors:
k+B
= 1
D
{l}
vl D (D1 (x), D2 (x)) ,
{l}
(3)
V
l=kB
{l}
where Di is the sub-descriptor for frame l, vl are a set of binary flags that determine
valid matching sub-descriptor pairs (where both sub-descriptors pass the validation pro-
k+B
cess), and V = l=kB vl . Note that in a worst-case scenario we will always match
the sub-descriptors for frame k.
4 Depth Estimation
For stereo reconstruction we use a pair of calibrated monocular cameras. Since Daisy is
not rotation-invariant we require the calibration data to compute the descriptors along
the epipolar lines, rotating the grid. We do this operation for the central frame, and
for frames forwards and backwards we use the flow priors to warp the grid and to
estimate the patch rotation between the frames. We discretize the 3D space from a
given perspective, then compute the depth for every possible match of descriptors and
store it if the match is the best for a given depth bin, building a cube of matching scores
of size W H L, where W and H are the width and height of the image and L is
the number of layers we use to discretize 3D space. We can think of these values as
costs, and then apply global optimization techniques such as graph cuts [14] to enforce
446 E. Trulls, A. Sanfeliu, and F. Moreno-Noguer
piecewise smoothness, computing a good estimate that balances the matching costs with
a smoothness cost that penalizes the use of different labels on neighboring pixels. To
deal with occlusions we incorporate an occlusion node in the graph structure with a
constant cost. We use the same value, 20% of the maximum cost, for all experiments.
We can use two strategies for global optimization: enforcing spatial consistency or
enforcing both spatial and temporal consistency. To enforce spatial consistency, we can
match two sets of spatiotemporal descriptors and run the optimization algorithm for a
single frame. The smoothness function for the optimization process will consider the 4
pixels adjacent to the feature point, and its cost is binary: a constant penalty for differing
labels, and 0 for matching labels.
To enforce spatial and temporal consistency, we match two sets of multiple spa-
tiotemporal descriptors. To do this we use a hypercube of matching costs of size W
H L M , where M is the number of frames used in the optimization process. Note
that M does not need to match T , the number of frames used for the computation of
the descriptors. We then set 6 neighboring pixels for the smoothing function, so that
each pixel (x, y, t) is linked to its 4 adjacent neighbors over the spatial domain and the
two pixels that share its spatial coordinates on two adjacent frames, (x, y, t 1) and
(x, y, t + 1). We then run the optimization algorithm over the spatiotemporal volume
and obtain M separate depth maps. The value for M is constrained by the amount of
memory available, and we use M = 5 for all the experiments in this paper.
This strategy produces better, more stable results on dynamic scenes, but introduces
one issue. The smoothness function operates over both the spatial and the temporal
domain in the same manner, but we have a much shorter buffer in the temporal domain
(M , versus the image size). This results in over-penalizing label discontinuities over
the time domain, and in practice smooth gradients on the temporal dimension are often
lost in favor of two frames with the same depth values. Additionally, the frames at
either end of the spatiotemporal volume are linked to a single frame, as opposed to two,
and the problem becomes more noticeable. For 2D stereo we apply the Potts model,
F (, ) = S T (
= ), where and are discretized depth values for adjacent
pixels, S is the smoothness cost, and T () is 1 if the argument is true and 0 otherwise.
For spatiotemporal stereo we use instead a truncated linear model:
/
S||
L if | | L
F (, ) = (4)
0 otherwise.
ls
be
la
h
pt
de
image plane temporal
constraint
spatial
constraint
{
buffered for the
next iteration
...
(discarded) valid depth maps
next iteration
Fig. 2. To perform spatiotemporal regularization, we apply a graph cuts algorithm over the spa-
tiotemporal hypercube of costs (here displayed as a series of cost cubes). The smoothness func-
tion connects pixels on a spatial and temporal neighborhood. To avoid singularities on the frames
at either end of the buffer, we discard their resulting depth maps and obtain them from the next
iteration, as pictured. We do not need to recompute the cost cubes.
optimization strategy for most of the experiments on this paper. We refer to each strat-
egy as GC-2D and GC-3D.
Fig. 3. Application of masks to scenes with occlusions. Left to right: reference image, a depth
map after one iteration, and a posterior refinement using masks. Enabled grid points are shown
in blue, and disabled grid points are drawn in black. We show a simplified grid of 9 points, for
simplicity.
For reference, a round of spatiotemporal stereo for two 480x360 images takes about 24
minutes on a 2 Ghz dual-core computer (without parallelization), up from about 11 min-
utes for Daisy. This does not include optical flow estimation (approximately 1 minute
per image), for which we use a Matlab off-the-shelf implementationthe rest of the
code is C++. The bottleneck is on descriptor matchingnote that for these experiments
we allow matches along most of the epipolar lines.
size 136. For Daisy-2D and SIFT, we apply a graph-cuts spatial regularization scheme.
To provide a fair comparison with SIFT we do not use feature points, but instead com-
pute the descriptors densely at a constant scale, over a patch rotated along the epipolar
lines, as we do for Daisy-2D and -3D. For the stequel algorithm we rectify the images
with [23]. We do not compute nor match descriptors for background pixels. We choose
a wide scene range, as the actual range of our synthetic scenes is very narrow and the
correspondence problem would be too simple. To evaluate and plot the results, we prune
them to the actual scene range (i.e. white or black pixels in the depth map plots are out-
side the actual scene range). We follow this approach for every experiment except for
those on occlusions (5.4), which already feature very richly textured scenes.
/4 baseline
The Stequel descriptor proves very vulnerable to noise, and SIFT performs slightly bet-
ter than Daisy.
Baselinegravel, error treshold at 10% the range Baselinegravel, error treshold at 5% the range Image noise, error treshold at 10% the scene range
1
1 1
0.9
Baselineflowers, error treshold at 10% the range Baselineflowers, error treshold at 5% the range Image noise, error treshold at 5% the scene range
1 1 1
0.845
0.84
0.835
0.83
0.825
0.82
Treshold 5% the range
0.815 Treshold 10% the range
Treshold 25% the range
0.81
1 3 5 7 9 11
Number of frames used
Fig. 8. Benchmarking the spatiotemporal descriptor against occlusions. The image displays the
depth error after three iterations with masks, using the Daisy-3D3D approach with 11 frames. The
error is expressed in terms of layers: white indicates a correct estimation, light yellow indicates
an error of one layer, which is often due to quantization errors, and further deviations are drawn in
yellow to red. Red also indicates occlusions. Blue indicates a foreground pixel incorrectly labeled
as an occlusion. The plot shows accuracy values at different error thresholds for an increasing
number of frames used to compute the spatiotemporal descriptor.
Reference
Daisy-3D-3D
Daisy-2D
SIFT
Stequel
Reference
Daisy-3D-3D
Daisy-2D
SIFT
Stequel
Fig. 9. Depth reconstruction for five consecutive frames of two very challenging stereo sequences
Spatiotemporal Descriptor for Wide-Baseline Stereo 453
at a higher frame-rate, approximately 100 Hz. We then interpolate the flow data and
run the stereo algorithm at a third of that rate. We do so because even though we have
demonstrated that our spatiotemporal descriptor is very robust against perturbations of
the flow data (see Section 5.3), video sequences of this nature may suffer from the aper-
ture problemwe believe that in practice this constraint can be relaxed. Even without
ground truth data, simple visual inspection shows that the depth estimate is generally
much more stable in time and follows the expected motion (see Fig. 9).
We have proposed a solution to stereo reconstruction that can handle very challeng-
ing situations, including highly repetitive patterns, noisy images, occlusions, non-rigid
deformations, and large baselines. We have shown that these artifacts can only be ad-
dressed by appropriately exploiting temporal information. Even so, we have not at-
tempted to describe cubic-shaped volumes of data in space and time, as is typically
done in current spatiotemporal stereo methods. In contrast, we have used a state-of-the-
art appearance descriptor to represent all pixels and their neighborhoods at a specific
moment in time, and have propagated this representation through time using optical
flow information. The resulting descriptor is able to capture the non-linear dynamics
that individual pixels undergo over time. Our approach is very robust against optical
flow noise and other artifacts such as mislabeling due to aperture noisethe spatiotem-
poral regularization can cope with these errors. Additionally, objects that suddenly pop
on-screen or show parts that were occluded will be considered for matching across
frames where they may not existwe address this problem matching warped descrip-
tors across time.
One of the main limitations of the current descriptor is its high computational cost
and dimensionality. Thus, as future work, we will investigate techniques such as vector
quantization or dimensionality reduction to compact our representation. In addition, we
believe that monocular approaches to non-rigid 3D reconstruction and action recogni-
tion [24, 25] may benefit from our methodology, specially when the scene or the person
is observed from very different points of view. We will also research along these lines.
We would also like to investigate the application of our approach to scene flow es-
timation. We currently do not check the flow estimates for consistencynote that we
perform alternative tests which are also effective: we do check the warped descriptors
for temporal consistency and discard them when they match poorly, and we apply a
global regularization to enforce spatiotemporal consistency. Given our computational
concerns, we could decouple stereo from motion as in [21], while enforcing joint con-
sistency.
References
1. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. T. PAMI 27
(2005)
454 E. Trulls, A. Sanfeliu, and F. Moreno-Noguer
3. Tola, E., Lepetit, V., Fua, P.: Daisy: An efficient dense descriptor applied to wide-baseline
stereo. T. PAMI 32 (2010)
4. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to
action recognition. In: Int. Conf. on Multimedia (2007)
5. Derpanis, K., Sizintsev, M., Cannons, K., Wildes, R.: Efficient action spotting based on a
spacetime oriented structure representation. In: CVPR (2010)
6. Klaser, A., Marszaek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients.
In: BMVC (2008)
7. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003)
8. Sizintsev, M., Wildes, R.: Spatiotemporal stereo via spatiotemporal quadric element (stequel)
matching. In: CVPR (2009)
9. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image de-
scriptors. In: CVPR, Washington, USA, pp. 511517 (2004)
10. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. CVIU
(2008)
11. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape con-
texts. T. PAMI (2002)
12. Kokkinos, I., Yuille, A.: Scale invariance without scale selection. In: CVPR (2008)
13. Moreno-Noguer, F.: Deformation and illumination invariant feature point descriptor. In:
CVPR (2011)
14. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. T.
PAMI 23, 12221239 (2001)
15. Zhang, L., Curless, B., Seitz, S.M.: Spacetime stereo: Shape recovery for dynamic scenes.
In: CVPR (2003)
16. Davis, J., Ramamoothi, R., Rusinkiewicz, S.: Spacetime stereo: A unifying framework for
depth from triangulation. In: CVPR (2003)
17. Rodriguez, M., Ahmed, J., Shah, M.: Action MACH: A spatio-temporal maximum average
correlation height filter for action recognition. In: CVPR (2008)
18. Carceroni, R., Kutulakos, K.: Multi-view scene capture by surfel sampling: From video
streams to non-rigid 3D motion, shape reflectance. In: ICCV, pp. 6067 (2001)
19. Zhang, Y., Kambhamettu, C., Kambhamettu, R.: On 3D scene flow and structure estimation.
In: CVPR, pp. 778785 (2001)
20. Huguet, F., Devernay, F.: A variational method for scene flow estimation from stereo se-
quences. In: ICCV, Rio de Janeiro, Brasil (2007)
21. Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., Cremers, D.: Efficient Dense Scene
Flow from Sparse or Dense Stereo Data. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008, Part I. LNCS, vol. 5302, pp. 739751. Springer, Heidelberg (2008)
22. Liu, C.: Beyond pixels: Exploring new representations and applications for motion analysis.
PhD Thesis, MIT (2009)
23. Fusiello, A., Trucco, E., Verri, A.: A compact algorithm for rectification of stereo pairs.
Machine Vision and Applications 12, 1622 (2000)
24. Moreno-Noguer, F., Porta, J.M., Fua, P.: Exploring Ambiguities for Monocular Non-Rigid
Shape Estimation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III.
LNCS, vol. 6313, pp. 370383. Springer, Heidelberg (2010)
25. Simo-Serra, E., Ramisa, A., Alenya, G., Torras, C., Moreno-Noguer, F.: Single image 3D
human pose estimation from noisy observations. In: CVPR (2012)
Elevation Angle from Reectance Monotonicity:
Photometric Stereo
for General Isotropic Reectances
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 455468, 2012.
c Springer-Verlag Berlin Heidelberg 2012
456 B. Shi et al.
pixel. We prove that the reectance monotonicity is maintained only when the
correct elevation angle is found, if the scene is observed under dense directional
lights uniformly distributed over the whole hemisphere.
Based on the evaluations using MERL BRDF database [5] and the conclusion
in [4], we show that many materials such as acrylic, phenolic, metallic-paint, and
some shiny plastics can be well approximated by the 1-lobe half-vector BRDF
model. While some other materials, such as fabric, require two or more lobes for
precise representation, in practice, our algorithm still shows robustness against
the deviations and performs accurate estimation when a dominant lobe projected
on half-vector exists. We assess the applicability of our method to 2-lobe BRDFs
as well and verify the eectiveness of the proposed method on various real data.
2 Related Works
Conventional photometric stereo algorithm [1] assumes Lamberts reectance
and can recover surface normal directions from as few as three images. With ad-
ditional images, non-Lambertian phenomenon such as shadow can be handled [6].
With more images, various robust techniques can be applied to statistically
handle non-Lambertian outliers. There are approaches based on RANSAC [7],
median ltering [8] and rank minimization [9]. Spatial information has also
been used for robustly solving the problem by using expectation maximiza-
tion [10]. Some photometric stereo methods explicitly model surface reectances
with parametric BRDFs. Georghiades [11] use the Torrance-Sparrow model [12]
for specular tting, and Goldman et al . [13] build their methods on the Ward
model [2].
For surfaces with general reectance, various properties have been exploited
to estimate surface normal, such as radiance similarity [14] and attached shadow
codes [15]. Higo et al .s method [16] uses monotonicity and isotropy of general
diuse reectances. Alldrin and Kriegman [3] show that the azimuth angle of
normal can be reliably estimated for isotropic materials using reectance sym-
metry, if lights locate on a view-centered ring. Later, their method is extended
to solve for both shape and reectance by further restricting the BRDF to be
bivariate [17]. Based on the symmetry property, a comprehensive theory is devel-
oped in [18], and a surface reconstruction method using a special lighting rig is
introduced in [19]. By focusing on the low-frequency reectance, the biquadratic
model [20] can also be applied to solve the problem here.
We exploit reectance monotonicity to compute the per-pixel elevation angle
of surface normal given estimated azimuth angles using [3]. Unlike [17]s ap-
proach that involves complex optimization for iteratively estimating shape and
reectance, our method avoids the complex optimization by taking the advan-
tages of the monotonicity from the 1-lobe BRDFs.
h n
l
h
l
o
l
where A = l2+2l
1z +1
l2+2l
2z +1 l1x
, B = 2+2l 2+2l
l2x
A,
, and = arctan B
1z 2z 1z 2z
[/2, /2]. We obtain a similar equation for n by replacing in Eq. (4) with .
1
We discuss only the monotonically increasing case in this paper. The similar analysis
can also be applied to monotonically decreasing case.
Elevation Angle from Reectance Monotonicity 459
the two lighting directions l1 and l2 , and independent of and . Hence, a larger
dierence between and gives higher possibility for an x-swap to happen.
Here, l12y is y-component of the cross product of l1 and l2 . The derivation can
be found in the Appendix. Eq. (6) indicates that the ordering of y depends on
two factors: (1) the sign of ( ), i.e., the hypothesized is larger or smaller
than the true value, and (2) the relative positions of the two lights l1 and l2 .
(a) v (b) v
n
h1 l12y h2
h2 h2 h1
l1 l2
h2 l2 l1
y y l12y
n
l2 l2
x x
Fig. 2. Lights, half-vectors and normal on the same longitude. (a) /4; (b) > /4.
values withl = 0
(a) (/2, /2) -/4 (b) 0
-0.2
max()
-0.4
l2
-0.6
Fig. 3. (a) values for l = 0 and l1 , l2 [0, /2]; (b) max() values for l (0, /2]
both farther from v than n, we should have l2 > l1 (l2 is closer to v than
l1 ). Hence, l12y < 0; (2) If h1 and h2 are both closer to v than n, we should
have l1 > l2 (l1 is closer to v than l2 , which is a similar case as Fig. 2(a)).
Hence, l12y > 0. When lights are densely distributed along the longitude,
we can always nd a pair of lights satisfying these two cases. As a result,
Eq. (6) can hold to cause y-swap no matter what is.
Now we know y-swap might not happen when /4 and > . In the next,
we analyze when x-swap happens. If both swaps do not happen, an incorrect
estimate of the elevation angle will not break the monotonicity of , and
we cannot tell whether that is correct or not. Due to the complexity of the
analytic form of , we simulate and plot all values for all combinations of l1
and l2 on the same longitude as n in Fig. 3(a). It is interesting to note that
continuously changes in [/2, max()], where max() /4. In other words,
covers the whole range of [ max(), /2]. Recall that the necessary and
sucient condition for x-swap to happen is that < < (or, < < ).
Thus, if both and are smaller than max(), x-swap will never happen.
Combining the conditions that both x-swap and y-swap do not happen, we
conclude [, max()] is the degenerate interval for normals with l = 0,
where the monotonicity of is not broken by any incorrect .
Next, we consider lights along other longitudes, i.e. l (0, /2]. The same
analysis can be applied, and similarly, the monotonicity of is not broken by
Elevation Angle from Reectance Monotonicity 461
4 Solution Method
We perform a 1-D search for within [0, /2] at each pixel. For each hypothe-
sized value , we assess the reectance monotonicity. Specically, we compute
the hypothesized BRDF values y and evaluate its monotonicity w.r.t. to x .
To measure the reectance monotonicity, we calculate the derivatives (discrete
dierentiation) of y and sum up the absolute values of negative derivatives,
denoted as ( ):
d y
( ) = max , 0 , (7)
dx
x
(a) 1 (b) 1
0.8 0.8
0.6 0.6
0.4 1 0.4 1
0.2 0.5
0.2 0.5
2
2
f ( x) 1 f ( x) 1
1 e x 1 e x
0 0
0 5 10 0 5 10
0 0
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
T' T'
5 Experiments
We evaluate the accuracy of elevation angle estimation using all the 100 materials
in the MERL BRDF database [5], also with varying number of lights. Synthetic
experiments using 2-lobe BRDFs are performed to assess the robustness of our
method against the deviation from the 1-lobe BRDF assumption. Finally, we
show our normal estimates on real data.
A 0.06 0.1
0.1 0.3 0.6
0.05 0.08
U(n h)
U(n h)
U(n h)
U(n h)
U(n h)
3.5 0.08 0.2 0.4
T
0.04 0.06
Elevation angle error (deg.)
3 0.02
0 0.5
T
1
0.02
0 0.5
T
1
0.04
0 0.5
T
1
0
0 0.5
T
1
0
0 0.5
T
1
n h n h n h n h n h
2.5
B
1.5 C A B C D E
1
0.5
D E
0
green-metallic-paint2
light-red-paint
red-plastic
pink-fabric2
white-fabric2
delrin
teflon
chrome-steel
specular-green-phenolic
gold-metallic-paint2
gold-metallic-paint3
green-plastic
pink-fabric
chrome
silicon-nitrade
fruitwood-241
silver-metallic-paint2
tungsten-carbide
green-latex
white-fabric
specular-white-phenolic
gold-metallic-paint
black-phenolic
blue-metallic-paint2
color-changing-paint3
silver-metallic-paint
specular-maroon-phenolic
aventurnine
violet-acrylic
pickled-oak-260
white-marble
alumina-oxide
polyurethane-foam
beige-fabric
polyethylene
white-paint
pink-plastic
nylon
pure-rubber
pink-jasper
grease-covered-steel
dark-red-paint
aluminium
black-oxidized-steel
color-changing-paint1
green-metallic-paint
red-metallic-paint
alum-bronze
color-changing-paint2
two-layer-silver
pvc
gray-plastic
red-fabric2
maroon-plastic
specular-orange-phenolic
yellow-plastic
neoprene-rubber
pink-felt
silver-paint
orange-paint
pearl-paint
violet-rubber
green-fabric
blue-rubber
gold-paint
nickel
special-walnut-224
black-fabric
black-soft-plastic
colonial-maple-223
blue-metallic-paint
black-obsidian
dark-blue-paint
specular-black-phenolic
steel
light-brown-fabric
brass
green-acrylic
ipswich-pine-221
natural-209
two-layer-gold
red-specular-plastic
specular-red-phenolic
white-acrylic
white-diffuse-bball
yellow-paint
specular-yellow-phenolic
red-fabric
blue-acrylic
dark-specular-fabric
specular-blue-phenolic
yellow-phenolic
ss440
blue-fabric
specular-violet-phenolic
cherry-235
hematite
red-phenolic
purple-paint
yellow-matte-plastic
Material names
Fig. 5. Elevation angle errors (degree) on 100 materials (ranked by errors in a descend-
ing order). Rendered spheres of some representative materials and their BRDF values
in the -nT h space are shown near the curve. The spheres in the -nT h plots show
the 1-D projection RMS errors using all the directions on the hemisphere (dark blue
means small, and red means large errors).
the hemisphere. For the sampling, we use the vertices dened by an icosahedron
with a tessellation order of three. The average elevation angle errors of the 1620
samples for all materials in the MERL BRDF database are summarized in Fig. 5.
The average error on all the 100 materials is 0.77 from the ground truth.
We observe that the error is relatively large for materials that cannot be well
represented by a monotonic 1-lobe BRDF with half-vector as the projection di-
rection, for example the material A and B in Fig. 5 (see the -nT h plot above
the rendered spheres). For those materials, their optimal 1-lobe BRDF repre-
sentations have a much dierent projection direction from h (see the spheres
of RMS error distribution in the -nT h plots). Therefore, if we force them to
project on h and apply our method, the results have relatively larger errors.
4.5
Elevation angle error (deg.)
4 3.98
3.5
2.5 25 50
2 2.14
1.5 1.37
1 0.88 0.77
0.5
0
25 50 100 200 337 100 200 337
Number of lights Light distributions
Fig. 6. Average elevation angle errors (degree) on 100 materials varying with the num-
ber of lighting directions. Next to the curve, the blue dots on spheres show the light
distribution in each case.
k1
k1
Fig. 7. Elevation angle errors (degree) on 2-lobe BRDFs. (a) The Cook-Torrance
model, where k1 is the Lambertian and k2 is the specular strength; (b) k1 1 (nT v) +
k2 2 (nT (v + 2l)); (c) k1 1 (nT h) + k2 2 (nT k), where k is a random direction.
our result on the Cook-Torrance model [21] (roughness m = 0.5) with a Lamber-
tian diuse lobe and a specular lobe. Note the specular lobe of the Cook-Torrance
model is not centered at h. However, our method always gives accurate result
for dierent relative strength of the two lobes. The error in estimated elevation
angles is the largest (about 1 ) when BRDF is completely dominated by the
specular lobe.
As discussed in [4], some fabric materials contain lobes with projection direc-
tion v or (v + 2l). Hence, we deliberately create such a BRDF as k1 1 (nT v) +
k2 2 (nT (v + 2l)) and evaluate our method on this BRDF. Note both lobes are
not centered at h. Here, 1 (x) = 2 (x) = x. In Fig. 7(b), we plot the elevation
angle errors for dierent values of k1 , k2 . The errors are smaller than 3 for most
of the combinations.
At last, we create a 2-lobe BRDF k1 1 (nT h) + k2 2 (nT k) by combining a
lobe centered at h and another one with a random direction k. k is a randomly
sampled direction on the sphere for each pixel under each lighting direction. The
elevation angle errors are shown in Fig. 7(c). We can observe that for more than
half of all cases, our method generates errors smaller than 5 . Our method can
work reasonably well with errors smaller than 3 when the relative strength of
the random lobe is below 0.3.
Elevation Angle from Reectance Monotonicity 465
Fig. 8. Results on Gourd1 (102) (The number in the bracket indicates the number of
images in the dataset.) [17]. (a) One input image; (b), (c) Our normal and Lambertian
shading; (d), (e) Normal and shading from Lambertian photometric stereo. Note that
(b) and (d) have an average angular dierence of about 12 .
Fig. 9. Results on Apple (112) [17]. (a) One input image; (b), (c) Our normal and
reconstructed surface; (d), (e) Normal and surface shown in paper [17]. Note here we
use the color mapping of ((nx + 1)/2, (ny + 1)/2, nz ) (R, G, B) for a comparison
purpose. The shapes look dierent partially due to we do not know exactly the same
reconstruction method and rendering parameters used in [17].
We show the estimated normals by our method on real data. We visualize the
estimated normals by linearly encoding the x, y and z components of normals
in RGB channels of an image (except for the Apple data in Fig. 9 for a com-
parison purpose). First, in Fig. 8, our method is compared with the Lambertian
photometric stereo [1] by showing the estimated normals and the Lambertian
shadings calculated from the estimated normals with the same lighting direc-
tion as the input image. For such non-Lambertian materials, our method shows
much more reasonable normal estimates. Next, we compare with the method
in [17] by showing the surface reconstructions (according to the method in [22])
from the estimated normals in Fig. 9(b). Due to the lack of the ground truth,
we cannot make a quantitative comparison, but qualitatively, our method shows
similar results as [17]. In terms of the complexity, our method is much simpler
and computationally inexpensive for deriving the elevation angle.
Finally, we show our estimated normals on other materials with various re-
ectances, such as plastic, metal, paint, etc., as shown in Fig. 10. The datasets
on the left side are from [17] and [23]; right part of the datasets are captured by
ourselves with a Sony XCD-X710CR camera. By comparing the input images
and the Lambertian shadings, we claim our estimated normals are of high ac-
curacy. Some noisy points observed on the results are mainly caused by pixels
with too dark intensities and hence low signal-to-noise ratio.
466 B. Shi et al.
Fig. 10. Real data results. Left part: Gourd2 (98) [17], Helmet side right (119) (We
use only 119 out of all the 253 images in the original dataset.) [23], Kneeling Knight
(119) [23]; Right part: Pear (77), God (57), Dinosaur (118). We show one input
image, the estimated normals and the Lambertian shadings for each dataset.
6 Conclusions
We show a method for estimating elevation angles of surface normal by exploit-
ing reectance monotonicity given the azimuth angles. We assume the BRDF
contains a dominant monotonic lobe projected on the half-vector, and prove that
the elevation angle can be uniquely determined under dense lights uniformly dis-
tributed on the hemisphere. In synthetic experiments, we rst demonstrate the
accuracy of our method on a broad category of materials. We further evaluate its
robustness to deviations from our assumption about BRDFs. Various real-data
experiments also show the eectiveness of the proposed method.
Acknowledgement. The authors would like to thank Neil Alldrin for providing
the code of [3] and Manmohan Chandraker for helpful discussions. Boxin Shi
and Katsushi Ikeuchi was partially supported by the Digital Museum Project,
Ministry of Education, Culture, Sports, Science & Technology in Japan. Ping
Tan was supported by the Singapore ASTAR grant R-263-000-592-305 and MOE
grant R-263-000-555-112.
References
1. Woodham, R.: Photometric method for determining surface orientation from mul-
tiple images. Optical Engineering 19, 139144 (1980)
2. Ward, G.: Measuring and modeling anisotropic reection. Computer Graphics 26,
265272 (1992)
3. Alldrin, N., Kriegman, D.: Toward reconstructing surfaces with arbitrary isotropic
reectance: A stratied photometric stereo approach. In: Proc. of International
Conference on Computer Vision, ICCV (2007)
4. Chandraker, M., Ramamoorthi, R.: What an image reveals about material re-
ectance. In: Proc. of International Conference on Computer Vision, ICCV (2011)
5. Matusik, W., Pster, H., Brand, M., McMillan, L.: A data-driven reectance model.
In: ACM Trans. on Graph (Proc. of SIGGRAPH) (2003)
6. Barsky, S., Petrou, M.: The 4-source photometric stereo technique for three-
dimensional surfaces in the presence of highlights and shadows. IEEE Trans. Pat-
tern Anal. Mach. Intell. 25, 12391252 (2003)
7. Mukaigawa, Y., Ishii, Y., Shakunaga, T.: Analysis of photometric factors based on
photometric linearization. Journal of the Optical Society of America 24, 33263334
(2007)
8. Miyazaki, D., Hara, K., Ikeuchi, K.: Median photometric stereo as applied to
the segonko tumulus and museum objects. International Journal of Computer Vi-
sion 86, 229242 (2010)
9. Wu, L., Ganesh, A., Shi, B., Matsushita, Y., Wang, Y., Ma, Y.: Robust Photometric
Stereo via Low-Rank Matrix Completion and Recovery. In: Kimmel, R., Klette, R.,
Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 703717. Springer,
Heidelberg (2011)
10. Wu, T., Tang, C.: Photometric stereo via expectation maximization. IEEE Trans.
Pattern Anal. Mach. Intell. 32, 546560 (2010)
11. Georghiades, A.: Incorporating the Torrance and Sparrow model of reectance in
uncalibrated photometric stereo. In: Proc. of International Conference on Com-
puter Vision, ICCV (2003)
12. Torrance, K., Sparrow, E.: Theory for o-specular reection from roughened sur-
faces. Journal of the Optical Society of America 57, 11051114 (1967)
13. Goldman, D., Curless, B., Hertzmann, A., Seitz, S.: Shape and spatially-varying
BRDFs from photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32,
10601071 (2010)
14. Sato, I., Okabe, T., Yu, Q., Sato, Y.: Shape reconstruction based on similarity in
radiance changes under varying illumination. In: Proc. of International Conference
on Computer Vision, ICCV (2007)
15. Okabe, T., Sato, I., Sato, Y.: Attached shadow coding: estimating surface nor-
mals from shadows under unknown reectance and lighting conditions. In: Proc.
of International Conference on Computer Vision, ICCV (2009)
468 B. Shi et al.
16. Higo, T., Matsushita, Y., Ikeuchi, K.: Consensus photometric stereo. In: Proc. of
IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2010)
17. Alldrin, N., Zickler, T., Kriegman, D.: Photometric stereo with non-parametric and
spatially-varying reectance. In: Proc. of IEEE Conference on Computer Vision and
Pattern Recognition, CVPR (2008)
18. Tan, P., Quan, L., Zickler, T.: The geometry of reectance symmetries. IEEE Trans.
Pattern Anal. Mach. Intell. 33, 25062520 (2011)
19. Chandraker, M., Bai, J., Ramamoorthi, R.: A theory of photometric reconstruction
for unknown isotropic reectances. In: Proc. of IEEE Conference on Computer
Vision and Pattern Recognition, CVPR (2011)
20. Shi, B., Tan, P., Matsushita, Y., Ikeuchi, K.: A biquadratic reectance model for
radiometric image analysis. In: Proc. of IEEE Conference on Computer Vision and
Pattern Recognition, CVPR (2012)
21. Cook, R., Torrance, K.: A reectance model for computer graphics. ACM Trans.
on Graph. 1, 724 (1982)
22. Kovesi, P.: Shapelets correlated with surface normals produce surfaces. In: Proc.
of International Conference on Computer Vision, ICCV (2005)
23. Einarsson, P., et al.: Relighting human locomotion with owed reectance elds.
In: Proc. of Eurographics Symposium on Rendering (2006)
sin( + 1 ) sin( + 2 )
> > 0.
sin( + 1 ) sin( + 2 )
Therefore, we have
This indicates the sign of sin(1 2 ) is the same as l1x l2z l1z l2x = l2 l1 =
l12y . Therefore, the inequality of Eq. (6) holds.
Local Log-Euclidean Covariance Matrix (L2 ECM)
for Image Representation and Its Applications
1 Introduction
Characterizing local image properties is of great research interest in recent years
[1]. The local descriptors can be either used sparsely for describing region of
interest (ROI) extracted by the region detectors [2], or be used at dense grid
points for image representation [3,4]. They play a fundamental role for success
of many middle-level or high-level vision tasks, e.g., image segmentation, scene
or texture classication, image or video retrieval and person recognition. It is
challenging to present feature descriptors which are highly distinctive and robust
to photometric or geometrical transformations.
The motivation of this paper is to present a general kind of image descriptors
for representing local image properties by fusing multiple cues. We are inspired
by the structure tensor method [5,6] and Diusion Tensor Imaging (DTI) [7],
both concerning tensor-valued (matrix-valued) images and enjoying important
applications in image or medical image processing. The former computes second-
order moment of image gradients at every pixel, while the latter associates to
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 469482, 2012.
c Springer-Verlag Berlin Heidelberg 2012
470 P. Li and Q. Wang
each voxel a 3-D symmetric matrix describing the molecular mobility along three
directions and correlations between these directions. Our methodology consists
in the pixel-wise covariance matrix for representing the local image correlation of
multiple image cues. This leads to the model of Local Log-Euclidean Covariance
matrix, L2 ECM. It can fuse multiple image cues, e.g., intensity, lower- or higher-
order gradients, spatial coordinates, texture, etc. In addition, it is insensitive
to scale and rotation changes as well as illumination variation. Comparison of
structure tensor, DTI and L2 ECM is presented in Table 1. Our model can be
seen as a generalization of the structure tensor method; it can also be interpreted
as a kind of imaging method, in which each pixel point is associated with
logarithm of a covariance matrix, describing the local image properties through
the correlation of multiple image cues.
Fig. 1(a) shows the outline of the modeling methodology of L2 ECM. Given
an image I(x, y), we extract raw features for all pixels, obtaining a feature map
f (x, y) which may contain intensity, lower- or higher-order derivatives, spatial
coordinates, texture, etc. Then for every pixel point, one covariance matrix is
computed using the raw features in the neighboring region around this pixel. We
thus obtain a tensor-valued (matrix-valued) image C(x, y). By computing the
covariance matrix logarithm followed by half-vectorization, we obtain the vector-
valued L2 ECM image which can be handled with Euclidean operations. For
illustration purpose, 3-dimensional (3-D) raw features are used which comprising
intensity and the absolute values of partial derivatives of image I w.r.t x, y; the
resulting L2 ECM features are 5-D and the corresponding slices are shown at the
bottom-right in Fig. 1(a). Refer to section 3 for details on L2 ECM features.
The covariance matrices are symmetric and positive-denite (SPD), the spaces
of which is not a Euclidean space but a smooth Riemannian manifold. In the
Log-Euclidean framework [8], the SPD matrices form a commutative Lie group
which is equipped with a Euclidean structure. This framework enables us to com-
pute the logarithms of SPD matrices, which can then be exibly and eciently
handled with common Euclidean operations. Our technique avoids complicated
and computationally expensive operations such as the geodesic distance, intrin-
sic mean computation or statistical modeling directly in Riemannian manifold.
Instead, we can conveniently operate in the Euclidean space.
Local Log-Euclidean Covariance Matrix (L2 ECM) 471
| I y (x, y)|
w | I x (x, y)|
I(x, y) C(1,1)
C(1, w) v log C(1,1) vlogC(1,w)
C(2,1) C(2, w) v log C(2,1) vlogC(2,w)
h
C(k,1) C(k, w)
v log C(k,1) vlogC(k,w)
C(h,1) C(h, w) v logC(h,1) vlogC(h,w)
I ( x, y ) f ( x, y ) C( x, y ) vlogC( x, y )
| I y (x, y)|
w | I x (x, y)|
I(x, y)
c11 c12 c13
h c21 c22 c23
c c33
31 c32
I ( x, y ) f ( x, y )
vlogC1 vlogC2 vlogC4
vlogC3 vlogC5
vlogC6
Fig. 1. Overview of L2 ECM (3-D raw features are used for illustration). (a) shows the
modeling methodology of L2 ECM. Given an image I(x, y), the raw feature image f (x, y)
is rst extracted; then the tensor-valued image C(x, y) is obtained by computing the
covariance matrix for every pixel; after the logarithm of C(x, y), the symmetric matrix
log C(x, y) is vectorized to get the 6-D vector-valued image denoted by vlogC(x, y),
slices of which are shown at the bottom-right. Refer to Section 3 for details on L2 ECM.
(b) shows the modeling methodology of Tuzel et al. [9]only one global covariance
matrix is computed for the overall image of interest.
The covariance matrix as a region descriptor was rst proposed by Tuzel et al.
[9], the modeling methodology of which was illustrated in Fig. 1(b). The main
dierence is their global modeling vs our local one: the former computes one
covariance matrix for the overall image region of interest; while the latter com-
putes a covariance matrix for every pixel point. Above all, Tuzel et al. employed
ane-Riemannian metric [10] which is known to have high computational cost
in computations of geodesic distance and statistics, e.g., the intrinsic mean, the
second-order moment, mixture models or Principal Component Analysis (PCA),
in the Riemannian space [11]. This may make the modeling of pixel-wise covari-
ance matrices prohibitively inecient. By contrast, we employ Log-Euclidean
metric [8] which transforms the operations from the Riemannian space into Eu-
clidean one so that the above-mentioned limitations are overcome.
The remainder of this paper is organized as follows. Section 2 reviews the
papers related to our work. Section 3 describes in detail the L2 ECM features
472 P. Li and Q. Wang
2 Related Work
where || || is the Euclidean norm in the vector space S(n). This bi-invariant
metric is called Log-Euclidean metrics, which is invariant under similarity trans-
formation. For a real number , dene the logarithmic scalar multiplication
between and a SPD matrix S
where | | denotes the absolute value, Ix (resp. Ixx ) and Iy (resp. Iyy ) denote the
rst (resp. second)-order partial derivative with respect to x and y, respectively.
Other image cues, e.g., spatial coordinates, gradient orientation or texture fea-
ture can also be included in the raw features. For a color image, the gray-levels
of three channels may be combined as well.
Provided with the raw feature vectors, we can obtain a tensor-valued image
by computing the covariance matrix function C(x, y) at every pixel
1
C(x, y) = (f (x , y ) f (x, y))(f (x , y ) f (x, y))T (7)
N Sr 1
(x ,y )S r (x,y)
1
f (x, y) = f (x , y )
N Sr
(x ,y )S r (x,y)
this violation is really a rare event. With increase of r, the local image structure
in larger scales will be captured. Small r may be helpful in capturing ne local
structure, which may however result in the singularity of the covariance matrix
due to insucient number of samples for estimation. In our experiments r 16
is appropriate for most applications. The covariance matrix is robust to noise
and insensitive to changes of scale, rotation and illumination [9].
We wish to exploit these covariance matrices as fundamental features for vi-
sion applications. It is known that the Ane-Riemannian framework involves
intensive computations of matrix square root, matrix inverse, matrix exponen-
tial and logarithm. Hence, we utilize the competent Log-Euclidean framework:
C(x, y) in the commutative Lie group SPD(n) is mapped by matrix logarithm
to log C(x, y) in the vector space of S(n). log C(x, y) can then be handled with
the Euclidean operations and the intensive computations involved in Ane-
Riemannian framework are avoided. It also facilitates greatly further analysis or
modeling of the SPD matrices.
From the tensor-valued image C(x, y) SPD(n), we compute the logarithm
of the covariance matrix C(x, y) according to Eq. (2). log C(x, y) is a symmetric
matrix of Euclidean space, i.e., log C(x, y) S(n). Because of its symmetry, we
perform half-vectorization of log C(x, y), denoted by vlogC(x, y), i.e., we pack
into a vector in the column order the upper triangular part of log C(x, y). The
nal L2 ECM feature descriptor can thus be represented as
( )
vlogC(x, y) = vlogC1 (x, y) vlogC2 (x, y) . . . vlogCn(n+1)/2 (x, y) (8)
The covariance matrices can be computed eciently via the Integral Images [9].
The computational complexity of constructing the integral images is O(||n2 ),
where || is the image size. The complexity of eigen-decomposition of all covari-
ance matrices is O(||n3 ). In practice, n=25 may suce for most problems.
In particular, when n=2 or n=3, the eigen-decomposition of covariance matrices
can be obtained analytically; hence, the L2 ECM features, which are 3-D (n=2)
or 6-D (n=3), can be computed fast through integral images and closed-form of
eigen-decomposition.
The L2 ECM may be used in a number of ways:
0.2
HOG Lin. SVM
HOG Ker. SVM
Tuzel et al.
Ours
miss rate
0.1
0.05
0.02
0.01
0.0 5 4 3 2 1
10 10 10 10 10
false positives per window (FFPW)
the dataset does not integrate scale, viewpoint or illumination changes, it in-
cludes non-homogenous textures which pose diculty for classication.
For each texture, the corresponding 640640 image is divided into 49 overlap-
ping blocks, each of which is of 160160 pixles with a block stride of 80. Twenty
four images of each class are randomly selected for training and the remaining
ones for testing. For each image, we rst compute the L2 ECM feature image;
it is then divided into four 8080 patches, the covariance matrices of which
are computed; the resultant four covariance matrices are subject to logarithmic
operation and half-vectorization. When testing, each covariance matrix of the
testing image is compared to all covariance matrices of the training set and its
label is determined by KNN algorithm (k = 5). The maximum votes of the four
matrices associated with this testing image determine its classication. To avoid
bias, the experiments are repeated twenty times and the result is computed as
the average value and standard deviation over the twenty runs.
Fig. 3(a) shows the classication accuracy of our method and the following
six texture classication approaches: Harris detector+Laplacian detector+SIFT
descriptor+SPIN descriptor (HLSS) [29], Lazebniks method [30], VZ-joint [31],
and Jalbas method [32], Haymans method [28], and Tuzel et al.s method [9].
The data of other methods are duplicated from their respective papers. Our
method obtains the best classication accuracy; the random covariances method
of Tuzel et al. obtains the second best accuracy which, however, does not report
standard deviation. These two are far better than the remaining ones.
KTH-TIPS Database [28]. This dataset is challenging in the sense that it
contains varying scale, illumination and pose. There are 10 texture classes each
of which is represented by 81 image samples. The size of samples is 200200
pixels.
The classication method is similar to that in the Brodatz database. For each
sample, the L2 ECM feature image is computed (r = 32). Every feature image is
uniformly divided into four blocks and a global covariance matrix is computed
478 P. Li and Q. Wang
100
90
80
Accuracy
70
Classification
60
GG mean+std
Lazenik
50
VZjoint
Hayman
(HS+LS)(SIFT+SPIN)
40 Ours
30
5 10 15 20 25 30 35 40
Number of training images
(a) (b)
Fig. 3. Texture classication in the Brodatz (a) and KTH-TIPS (b) databases
on each block. The logarithms of these covariance matrices are subject to half-
vectorization.
Fig. 3(b) shows the classication accuracy against the number of training
images. For comparison, the classication results of the following methods are
also shown [29]: Lazebniks method [30], VZ-joint [31], Haymans method [28],
global Gabor Filters(GG) [33], and Harris detector+Laplacian detector+SIFT
descriptor+SPIN descriptor((HS+LS)(SIFT+SPIN))[29]. It can be seen that the
classication accuracy of our method is much better than all the others.
m n
Fig. 4. Covariance matrix computation in the L2 ECM (left) and Tuzel (right) trackers
In the sequence, there exist pose variation of the object and cars nearby in the
background which have similar appearances with the object. Both trackers suc-
ceed in following the object throughout the sequence. As seen in Table 2, the
L2 ECM has smaller tracking errors than the Tuzel tracker. Some sample tracking
results are shown in Fig. 5 (top pannel).
Face Sequence. In the face sequence (1480 frames, size: 320240) the object
undergoes large illumination changes [34]. We track the object every four frames
(so there are 370 frames in total) because the object motion between consecutive
frames is very small. Though both trackers can follow the object across the
sequence, the L2 ECM tracker has much smaller tracking error than the Tuzel
tracker. The average error is presented in Table 2. Fig. 5 (middle panel) shows
some sample tracking results.
Table 2. Comparison of average Image seq. Method Dist. err. (pixels) Succ. frames
tracking errors (meanstd) and Car seq.
Tuzel 10.76 5.72 190/190
L2 ECM 7.55 3.38 190/190
number of successful frames vs total
Tuzel 20.4413.87 370/370
frames Face seq.
L2 ECM 4.45 3.87 370/370
Tuzel 30.9216.58 116/190
Mall seq.
L2 ECM 17.6710.45 190/190
480 P. Li and Q. Wang
Fig. 5. Tracking results in the car sequence (top panel), face sequence (middle panel)
and mall sequence (bottom panel). In each panel, the results of Tuzel tracker and
L2 ECM tracker are shown in the rst and second rows, respectively.
5 Conclusion
The L2 ECM is analogous to the structure tensor and Diusion tensor, which
can be seen as an imaging technology in producing novel multi-channel images
that capture the local correlation of multiple cues in the original image. We
compute a tensor-valued image from the original one in which each point is
associated with a covariance matrix computed locally. Through matrix logarithm
and half-vectorization we obtain the vector-valued, L2 ECM feature image. The
theoretical foundation of L2 ECM is the Log-Euclidean framework, which endows
the commutative lie group formed by the SPD matrices with a linear space
structure. This enables the common Euclidean operations of covariance matrices
in the logarithmic domain while preserving their geometric structure.
Local Log-Euclidean Covariance Matrix (L2 ECM) 481
References
1. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE
Trans. Pattern Anal. Mach. Intell. 27(10), 16151630 (2005)
2. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaalitzky,
F., Kadir, T., Van Gool, L.: A comparison of ane region detectors. Int. J. Comput.
Vision 65(1/2), 4372 (2005)
3. Liu, C., Yuen, J., Torralba, A.: SIFT ow: Dense correspondence across scenes and
its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978994 (2011)
4. Bosch, A., Zisserman, A., Muoz, X.: Scene Classication via pLSA. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517530. Springer,
Heidelberg (2006)
5. Knutsson, H.: Representing local structure using tensors. In: Proc. 6th Scandina-
vian Conf. on Image Analysis, pp. 244251 (1989)
6. Weichert, J., Hagen, H. (eds.): Visualization and image processing of tensor elds.
Springer, Heidelberg (2006)
7. Bihan, D.L., Mangin, J., Poupon, C., Clark, C., Pappata, S., Molko, N.: Diusion
tensor imaging: Concepts and applications. J. Magn. Reson. Imaging 66, 534546
(2001)
8. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector
space structure on symmetric positive-denite matrices. SIAM J. Matrix Anal.
Appl. (2006)
9. Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection
and Classication. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 589600. Springer, Heidelberg (2006)
10. Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing.
Int. J. Comput. Vision, pp. 4166 (2006)
11. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Fast and Simple Calculus on Tensors
in the Log-Euclidean Framework. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005.
LNCS, vol. 3749, pp. 115122. Springer, Heidelberg (2005)
12. Yan, K., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local
image descriptors. In: Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 506513 (2004)
13. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput.
Vision 60, 91110 (2004)
14. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 886893 (2005)
482 P. Li and Q. Wang
15. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF).
Comput. Vis. Image Underst. 110, 346359 (2008)
16. Tola, E., Lepetit, V., Fua, P.: DAISY: An ecient dense descriptor applied to wide-
baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815830 (2010)
17. Tuzel, O., Porikli, F., Meer, P.: Human detection via classication on Riemannian
manifolds. In: Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 18 (2007)
18. Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model update based on
lie algebra. In: Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 728735 (2006)
19. Li, X., Hu, W., Zhang, Z., Zhang, X., Zhu, M., Cheng, J.: Visual tracking via incre-
mental Log-Euclidean Riemannian subspace learning. In: Proc. Int. Conf. Comp.
Vis. Patt. Recog., pp. 18 (2008)
20. Gong, L., Wang, T., Liu, F.: Shape of gaussians as feature descriptors. In: Proc.
Int. Conf. Comp. Vis. Patt. Recog., pp. 23662371 (2009)
21. Nakayama, H., Harada, T., Kuniyoshi, Y.: Global gaussian approach for scene
categorization using information geometry. In: Proc. Int. Conf. Comp. Vis. Patt.
Recog., pp. 23362343 (2010)
22. Sivalingam, R., Boley, D., Morellas, V., Papanikolopoulos, N.: Tensor Sparse Cod-
ing for Region Covariances. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part IV. LNCS, vol. 6314, pp. 722735. Springer, Heidelberg (2010)
23. Fletcher, P.T., Joshi, S.: Principal Geodesic Analysis on Symmetric Spaces:
Statistics of Diusion Tensors. In: Sonka, M., Kakadiaris, I.A., Kybic, J. (eds.)
CVAMIA/MMBIA 2004. LNCS, vol. 3117, pp. 8798. Springer, Heidelberg (2004)
24. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans.
Pattern Anal. Mach. Intell. 25(5), 564575 (2003)
25. Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: Int. Conf. on
Comp. Vis., pp. 13231330 (2011)
26. de Luis Garca, R., Westin, C.F., Alberola-Lpez, C.: Gaussian mixtures on tensor
elds for segmentation: Applications to medical imaging. Comp. Med. Imag. and
Graph. 35(1), 1630 (2011)
27. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Trans. Intell. Syst. Technol. 2, 27:127:27 (2011)
28. Hayman, E., Caputo, B., Fritz, M., Eklundh, J.-O.: On the Signicance of Real-
World Conditions for Material Classication. In: Pajdla, T., Matas, J(G.) (eds.)
ECCV 2004, Part IV. LNCS, vol. 3024, pp. 253266. Springer, Heidelberg (2004)
29. Zhang, J., Marszalek, M., Lzaebnik, S., Schmid, C.: Local features and kernels
for classication of texture and object categories: A comprehensive study. Int. J.
Comput. Vision 73, 213238 (2007)
30. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local
ane regions. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12651278 (2005)
31. Varma, M., Zisserman, A.: Texture classication: Are lter banks necessary? In:
Proc. Int. Conf. Comp. Vis. Patt. Recog. (2003)
32. Jalba, A., Roerdink, J., Wilkinson, M.: Morphological hat-transform scale spaces
and their use in pattern classication. Pattern Recognition 37, 901915 (2004)
33. Manjunath, B., Ma, W.: Texture features for browsing and retrieval of image data.
IEEE Trans. Pattern Anal. Mach. Intell., 837842 (1996)
34. Ross, D., Lim, J., Yang, M.-H.: Adaptive Probabilistic Visual Tracking with Incre-
mental Subspace Update. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004, Part II.
LNCS, vol. 3022, pp. 470482. Springer, Heidelberg (2004)
35. Leichter, I., Lindenbaum, M., Rivlin, E.: Tracking by ane kernel transformations
using color and boundary cues. IEEE Trans. on Pattern Anal. Mach. Intell. 31(1),
164171 (2009)
Ensemble Partitioning
for Unsupervised Image Categorization
Dengxin Dai, Mukta Prasad, Christian Leistner, and Luc Van Gool
1 Introduction
Image categorization is an intensely studied eld with many applications. In the
last decade, advances in representations and supervised learning methods have
led to great progress [1,2,3,4]. However, the explosion of available visual data,
the cost of annotation, and the user-specic bias of annotations have resulted
in an increased focus on learning with less supervision. While powerful methods
for semi-supervised [5] or weakly-supervised learning [6,7] have been proposed,
unsupervised methods have been studied less.
In this paper, we are interested in automatically discovering image categories
from an unordered image collection. This task is hard as objects and scenes
can appear with clutter, large variations in pose and lighting, high intra-class
variances and low inter-class variances. The recent success of discriminative
classiers in supervised image categorization [1,2,4] suggests that one possible
way to progress in unsupervised image categorization is to construct a simi-
lar scenario, in which the power of discriminative classiers can be leveraged.
Since obtaining a pure training set automatically is infeasible, in this paper, we
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 483496, 2012.
c Springer-Verlag Berlin Heidelberg 2012
484 D. Dai et al.
Fig. 1. The pipeline of our method for an image collection comprised of four categories:
face, car, panda and duck. The pseudolabels in the weak training (WT) sets are as-
signed by our random walk sampling scheme. The image collection is partitioned by the
classiers learnt from the WT sets and the results are stored into the binary proximity
matrices. All of the binary matrices are then averaged into the ensemble proximity
matrix (EnProx) as input for spectral methods. As expected, the partitionings of the
WT sets contain errors, but their errors dier, which discovers the underlying truth
by ensemble learning techniques.
object categories from an unordered image collection. Aspect models were orig-
inally developed for content analysis in textual document collections, but they
perform very well for images as well [10]. Later on, [11] used Hierarchical Latent
Dirichlet Allocation to automatically discover object class hierarchies. For scene
category discovery, Dai et al . [12] proposed to combine information projection
and clustering sampling. All these methods share assumptions about the sample
distribution. Image categories, nevertheless, are arranged in complex and widely
diverging shapes, making the design of explicit models dicult. An alternative
strand, which is more versatile in handling structured data, builds on similarity-
based methods. Frey and Dueck [13] applied the anity propagation algorithm
[14] for unsupervised image categorization. Grauman and Darrell [15] developed
partially matching image features to compute image similarity and used spectral
methods for image clustering. The main diculty of this strand is how to mea-
sure image similarity as the semantic level goes up. For a more detailed survey,
we refer readers to [16].
The method closest to ours is that of Gomes et al . [17]. They also propose a
framework for learning discriminative classiers while clustering the data, called
regularized information maximization. Yet, the method still signicantly diers
from ours. In particular, [17] nds a partitioning of images that results in a
good discriminative classier, which is evaluated by three criteria: class separa-
tion, class balance and classier complexity. Our method however, more closely
follows the training-test paradigm; that is, sample a series of weak training sets,
learn knowledge from these sets and classify the whole dataset with this learned
knowledge. Our method also shares similarities with the method of Lee and
Grauman [18]. They proposed to use curriculum learning (CL) [19] for unsuper-
vised object discovery. CL imitates the learning process of people; that is, using
a pre-dened learning schedule, it starts from learning the easiest concepts rst
and graduately increases the complexity of concepts [19]. Besides technical dif-
ferences, the main benet of our work compared to [19] is that we do not rely
on pre-selected learning schedules.
As we already stated, our method builds on the ensemble learning principle [8].
Ensemble methods build a committee of (usually weak) base learners that over-
all performs better than its individual parts. Thereby, they benet from the
strength and diversity of their base learners. Popular ensemble methods that
have been successfully applied in computer vision are boosting [20], bagging [21]
and random forests [22]. There has been previous work using ensemble methods
for clustering e.g. [23,24]. The main dierence of our work to [23,24] is that
we focus on high-level categories. [23,24] make basic assumptions about the un-
derlying distributions, which are hard to obtain for our task and, hence, makes
previous methods only applicable in rather low-level clustering tasks. In con-
trast, we propose to learn category distinctions from sampled training data. We
also dier from previous approaches in the way we combine our base learners. In
particular, [24] used a boosting framework and [23] used hierarchical clustering,
while our method is based on spectral techniques.
486 D. Dai et al.
3 Observations
In this section, we share some insights with the reader that help to motivate
our approach and understand why it is working better. We raise two questions:
First, how much does the impurity of the training data in WT sets inuence the
convergence of the learning? Second, given a standard distance metric, how does
one approach the problem of getting good WT sets in an unsupervised manner?
In order to guarantee the diversity of the training sets (for ensemble learning),
each WT set Dttrain is formed by randomly taking 30% of the images from Dtrain ,
and randomly re-assigning labels of a xed percentage R of these. Hence, R = 0
corresponds to the upper performance bound as every sample is assigned its
true label. A classier is trained for each of these WT sets. At test time, each of
these classiers returns the category label of each image in Dtest . The winning
label is the mode of the results returned by all the classiers. Fig. 2 evaluates
this for the 15-Scene dataset [1]. Linear SVMs are used as the classiers with a
concatenation of GIST [25], PHOG [2] and LBP [26] as input. When the label
noise percentage R is low, the classication precision starts out high and levels
quickly with T , as one would expect. But interestingly, for R even as high as
80%, the classication precision, which starts low, converges to a similarly high
precision given more WT sets T ( 500). This shows that it is possible to learn
the essence of image categories even from very weak (extremely mislabelled)
training sets, given a sucient number with diverse deciencies. This is crucial
for our task, because WT sets are much easier to get than pure training data in
unsupervised settings.
Observation 2: Given a standard distance metric, how does one approach the
problem of getting good (precise and diverse) WT sets without supervision?
An ideal image representation along with a powerful metric should ensure that
all images of the same category are more similar to each other than to those
of other categories. In order to examine it, we tabulate how often an image is
from the same category as its k th -nearest neighbor. We refer to the frequency as
label co-occurrence probability p(k). p(k) is averaged across images and labels in
the dataset. Various features and distance metrics were tested: GIST [25] with
Euclidean distance, PHOG [2] and LBP [26] with the 2 distance.
Fig. 3 shows the results on the 15-Scene dataset [1]. The results reveal that
using the distance metric in conventional ways (e.g. clustering by K-means and
spectral methods) will result in very noisy training sets, because the label co-
occurrence probability p(k) drops very quickly with k. However, sampling in the
very close neighborhood of a given image is likely to generate more instances of
the same category, whereas sampling in very far-away space can gather samples of
new categories. This suggests that samples along with a few very close neighbors,
488 D. Dai et al.
namely compact image clusters, can form a training set for a single class, and
a set of such image clusters which scatter far away from each other in feature
space can serve as a good WT set for our task. Furthermore, sampling in this
way provides the chance of creating a large number of diverse WT sets, due to
the small size of the sampled WT set. This will be further examined in 4.1.
4 Our Approach
In this section, we introduce our approach for unsupervised image category dis-
covery. It has two interleaved parts: 1) the creation of the weak training (WT)
sets by random walk (RW) sampling, and 2) learning the ensemble proximity
matrix A for categorization.
t in each trial as they usually work well for most features types. Additionally,
they have also been shown to generalize well even when trained only on a few
training samples [28], which is an important characteristic as we keep the size
of the individual WT sets relatively small. Given a collection of WT sets, the
ensemble partitioning is summarized in Algo. 2. Final categorization can be
completed by feeding A to any similarity-based clustering algorithm. In this
work, we choose spectral clustering due to its simplicity and popularity. The
whole pipeline of Fig. 1 has now been explained.
5 Experiments
In this section, we elaborate upon the datasets, evaluation criteria and experi-
mental settings for the evaluation of the proposed approach.
Evaluation Datasets: We tested our method on three datasets: 15-Scene [1],
Caltech-101 [29], and a 35-Compound dataset. The 15-Scene dataset contains 15
Ensemble Partitioning for Unsupervised Image Categorization 491
scene categories in both indoor and outdoor environments. Each category has 200
to 400 images, and they are of size 300 250 pixels on average. The Caltech-101
dataset [29] contains 101 object categories and is one of the most popular datasets
for supervised object recognition. The large number of categories, combined with
the uneven category distribution (from 31 to 800 images per category), poses a
real challenge for unsupervised categorization. In order to evaluate our method
on a more general image dataset, we composed a 35-Compound data set by
mixing up the 15 scene categories [1] with the 20 object categories selected from
Caltech 256. The 20 object categories are: American ag, fern, French horn,
leopards 101, pci card, tombstone, airplanes 101, diamond ring, re extinguisher,
ketch 101, mandolin, rotary phone, Pisa tower, face easy 101, dice, reworks,
killer whale, motorbikes 101, roulette wheel, and zebra.
Features: The following features are considered in our experiments: f1 =SIFT
[30], f2 = GIST [25], f3 =PHOG [2], f4 = LBP [26], f5 = GIST + PHOG,
f6 = GIST + LBP, f7 = PHOG + LBP, and f8 = GIST + PHOG + LBP,
where + means a simple concatenation. GIST features are computed on resized
images of 256 256 pixels and other features are computed on the original
images. For PHOG, we computed the derivatives in 8 directions and used a two-
layer pyramid. For LBP, the uniform LBP was used. For SIFT, we used Hessian
detectors and k-means to get 300 centroids for the bag-of-feature representation.
Our Method: We evaluated our approach against competing methods for the
datasets and features described above. The number of categories C is assumed
to be known. Although the transition matrix P of 4.1, Algo. 1, the classier
t of 4.2 and Algo. 2 use the same notation for their features, speed-accuracy
tradeos may cause dierent features to be optimal at each stage. Since GIST
features have proven very eective for holistic image categorization with the
Euclidean distance [25], we adopted them for building P of Algo. 1, where dist(, )
in Eq. 1 is the Euclidean distance between features. For training the classiers,
f2 f8 were considered 1 . For the evaluation and comparison across all the
datasets and features, we used the same set of parameters: T = 1000 and K = 9.
The inuence of these two parameters on the performance of our method was
also tested on the three datasets, where a set of T s in the range of [1, 1000]
and a set of Ks in the range of [3, 30] were evaluated. The in Eq. 1 is set
automatically by using a self-tuning method proposed in [31], which has already
shown promising performance for similarity scaling.
Competing Methods: We compared our method to well-known clustering meth-
ods: k-means and spectral clustering. For k-means, four features f 2 f 4 and
f 8 were tried. For spectral clustering, the same features are tried: f 2 with the
Euclidean distance measure, f 3 and f 4 with the 2 distance measure, and f 8
1
The reason why we did not consider f 1 is to avoid severe over-tting, given the small
size of the WT sets and the high dimension of f 1.
492 D. Dai et al.
with the Mahalanobis distance measure with diagonal covariances (the o di-
agonal elements were set to 0). We selected these combinations just because
the 2 distance measure is superior for histogram-based features, and the Maha-
lanobis distance measure with diagonal covariances is capable of learning feature
weights. We also compared our method to the PLSA-based method of [10], the
anity propagation method (AP) of [14], and the regularized information max-
imization method (RIM) of [17]. f 1 is used as the input for the PLSA-based
method. Spatial pyramid matching (SPM) [1] is used as the input for the RIM
as it was used in [17]. The AP used the same input as the spectral clustering. For
the implementation, we used the authors code directly 2 3 4 . Since the number
of categories cannot be set directly in the RIM and the AP the input param-
eters are searched to yield the right number of categories we only tested these
two methods on the 15-Scene dataset.
Baselines for Sampling: To investigate the performance of our random walk
sampling scheme, we create two baselines for obtaining the WT sets. Everything
else in Algo. 2 remains as before. GIST features are also used for the baselines.
1. Baseline 1: In the tth trial, run k-means on the whole dataset to cluster
images into C groups. The WT set Dt was created by adding all images and
their labels to st and ct , resp.
2. Baseline 2: In the tth trial, we run k-means on a bootstrap subset of the
whole dataset. For a fair comparison to the WT sampling, CK images were
randomly selected for each subset with K = 9. Again, the WT set Dt was
created by adding all images and their labels to st and ct , resp.
We follow [12,16] to use the purity (the bigger the better) for performance evalu-
ation. First, let us compare our method to existing methods. Table 1 lists the av-
erage and variance of the purity over 10 runnings of all the methods on 15-Scene,
Caltech-101 and the 35-Compound dataset. The table shows that our method
outperforms all the competing methods by a substantial margin on all the three
datasets while having comparable variance. The superiority of our method can
be attributed to its discriminative learning ability, which gives more weight to
relevant variables and rejects the irrelevant ones. This can be proven by simply
examining how the performance of our method improves when providing more
features to the classier. This learning ability is crucial for image and object
categorization, as so far no single feature can describe all image categories very
well. The classier trained on each individual WT set may yield over-tting,
but this can be oset by the classiers trained on other WT sets as the sets are
mutually diverse.
2
http://www.robots.ox.ac.uk/~ vgg/software
3
http://vision.caltech.edu/~ gomes/software.html
4
http://www.psi.toronto.edu/index.php?q=affinity%20propagation
Ensemble Partitioning for Unsupervised Image Categorization 493
Secondly, let us compare the three ways of creating the WT sets. From Ta-
ble 1, we can see that the RW sampling outperforms the two baselines substan-
tially. This superiority can be attributed to the good diversity and accuracy of
the WT sets generated by the RW sampling, obviously fullling the requirements
of successful ensemble learning [8]. The accuracy comes from the underlying prin-
ciple of the RW sampling: with a high probability choosing the most separable
images and leaving out the ambiguous images, which will be handled after ab-
stracting knowledge by the max-margin learning. The great diversity comes from
the random jumping by the RW sampling. The possibility that two WT sets are
the same or very similar is very low. The superiority of baseline 2 over baseline
1 results from the higher diversity of the WT sets that baseline 2 generated.
Using the bootstrap technique to increase the diversity of training sets and in
turn boost overall performance, has been widely accepted for supervised learn-
ing. Our experimental results conrm that this diversity is also very important
for unsupervised scenarios like ours.
494 D. Dai et al.
One may argue that the WT sets generated by the RW sampling are not
precise and the precision of some individual sets may be aected by the outliers
in the dataset. This is true, but our method demands far less from a WT set
than a supervised categorization method does from its training set the label
assignment of the WT sets only needs to be better than random guessing. The
imprecision of one WT set can be oset by other WT sets since their deciencies
are dierent. This is analogous to the weak learner of ensemble learning methods
such as bagging and boosting each weak learner could be imprecise, but the
ensemble is precise and with good generality. Also, our categorization framework
is very exible and not tied to the way the WT sets are created. Any method that
can generate good WT sets can be used in this framework. The RW sampling
is a case in point. An interesting thing is that even using the WT set created
by the two casual baselines, the ensemble partitioning framework can generate
more precise results than competing methods.
Furthermore, let us examine the inuence of the parameters T and K on the
performance of our method. Fig. 5 shows the evaluation results over a variety
of values for T and K. From the curves of T in Fig. 5, it is evident that the
purity increases pretty fast with T but then also stabilizes quickly to a good
performance. That is, our method benets from adding more WT sets and can
provide accurate results when a fairly large number of WT sets are provided.
This is quite similar to how bagging [21] and random forests [22] behave with
dierent numbers of training sets. For parameter K = Q/C, we nd that very
small and very large values (e.g. 3 and 30 in our case) degrade the performance.
A very small K leads to insucient training samples per category and a very
large K results in very noisy training sets. But as Fig. 5 shows, our method is not
overly sensitive to the choice of K: there are large ranges to select appropriate
values from. Its choice can be based on the experience of choosing training data
for supervised categorization.
Finally, though additional time is needed for training (most unsupervised
categorization methods need no training), our method is still quite ecient. The
eciency comes from two factors: 1) Training linear SVMs is very ecient by
using the optimized package5 ; and 2) the performance of our method stabilizes
5
www.csie.ntu.edu.tw/~ cjlin/liblinear
Ensemble Partitioning for Unsupervised Image Categorization 495
quickly with respect to T as Fig. 5 shows. The training time on a Core i5 2.80
GHz desktop PC are: 3.67 minutes for 15-Scene (4485 images in total), 66.18
minutes for Caltech-101 (8251 images in total), and 12.33 minutes for the 35-
Compound dateset (8671 images in total). Furthermore, our method is inherently
parallelizable and can take advantage of multi-core processors.
6 Conclusion
This paper has tackled the hard task of unsupervised image categorization. To
this end, we leveraged the power of discriminative learning and ensemble meth-
ods. We presented the concept of weak training sets and proposed a new yet
simple sampling scheme, denoted as RW sampling, to generate them. For each
weak training set, a discriminative classier is trained to get a base partitioning of
the unordered image collection and then all the base partitionings are combined
to a strong ensemble proximity matrix that can be incorporated into spectral
clustering. In the experiments on several challenging datasets, we showed that
our method is able to consistently outperform the state-of-the-art methods. The
study poses multiple interesting questions for future research: Is it possible to
design more sophisticated methods for the creation of the WT sets, rather than
relying on the common distance metric as we do? What is the way of extend-
ing the method to handle noisy web data? How could we apply the method to
weakly-supervised image categorization?
Acknowledgements. The authors gratefully acknowledge support from SNF
project Aerial Crowds and the ERC Advanced Grant VarCity.
References
1. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In: CVPR (2006)
2. Bosch, A., Zisserman, A., Muoz, X.: Image classication using random forests and
ferns. In: ICCV (2007)
3. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image
classication. In: CVPR (2008)
4. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database large-scale
scene recognition from abbey to zoo. In: CVPR (2010)
5. Fergus, R., Weiss, Y., Torralba, A.: Semi-supervised learning in gigantic image
collections. In: NIPS (2009)
6. Deselaers, T., Alexe, B., Ferrari, V.: Localizing Objects While Learning Their
Appearance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part
IV. LNCS, vol. 6314, pp. 452466. Springer, Heidelberg (2010)
7. Blaschko, M.B., Vedaldi, A., Zisserman, A.: Simultaneous object detection and
ranking with weak supervision. In: NIPS (2010)
8. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F.
(eds.) MCS 2000. LNCS, vol. 1857, pp. 115. Springer, Heidelberg (2000)
9. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised
scale-invariant learning. In: CVPR (2003)
496 D. Dai et al.
10. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering
objects and their location in images. In: ICCV (2005)
11. Sivic, J., Russell, B.C., Zisserman, A., Freeman, W.T., Efros, A.A.: Unsupervised
discovery of visual object class hierarchies. In: CVPR (2008)
12. Dai, D., Wu, T., Zhu, S.C.: Discovering scene categories by information projection
and cluster sampling. In: CVPR (2010)
13. Dueck, D., Frey, B.J.: Non-metric anity propagation for unsupervised image cat-
egorization. In: ICCV (2007)
14. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Sci-
ence 315, 972976 (2007)
15. Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of partially
matching image features. In: CVPR (2006)
16. Tuytelaars, T., Lampert, C.H., Blaschko, M.B., Buntine, W.: Unsupervised object
discovery: A comparison. IJCV 88, 284302 (2009)
17. Gomes, R., Krause, A., Perona, P.: Discriminative clustering by regularized infor-
mation maximization. In: NIPS (2010)
18. Lee, Y.J., Grauman, K.: Learning the easy things rst: Self-paced visual category
discovery. In: CVPR (2011)
19. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In:
ICML (2009)
20. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical
View of Boosting. The Annals of Statistics 28, 337407 (2000)
21. Breiman, L.: Bagging predictors. ML 24, 123140 (1996)
22. Breiman, L.: Random forest. ML 45, 532 (2001)
23. Leisch, F.: Bagged clustering. Working Paper Series (1999)
24. Saari, A., Bischof, H.: Clustering in a boosting framework. In: CVWW (2007)
25. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation
of the spatial envelope. IJCV 41, 145175 (2001)
26. Ojala, T., Pietikinen, M., Menp, T.: Multiresolution gray-scale and rotation invari-
ant texture classication with local binary patterns. PAMI 24, 971987 (2002)
27. Harel, D., Koren, Y.: On Clustering Using Random Walks. In: Hariharan, R.,
Mukund, M., Vinay, V. (eds.) FSTTCS 2001. LNCS, vol. 2245, pp. 1841. Springer,
Heidelberg (2001)
28. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-svms for object
detection and beyond. In: ICCV (2011)
29. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few
training examples: an incremental Bayesian approach tested on 101 object cate-
gories. In: WGMBV (2004)
30. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60,
91110 (2004)
31. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: NIPS (2004)
Set Based Discriminative Ranking
for Recognition
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 497510, 2012.
c Springer-Verlag Berlin Heidelberg 2012
498 Y. Wu et al.
Q Xi Q Xi Q Xi Q Xi
dWS S ( Q, X i )
dWS S ( Q, X i )
dWS S ( Q, X j )
dWS S ( Q, X j )
Xj Xj Xj
Xj
(b) Between-set geometric (c) Feature space projection (d) Learned feature space
(a) Original query and gallery sets
distance finding (metric) learning and between-set distances
more candidates generally mean more information and should result in better
recognition performance. Such a belief comes from the experiences of human
vision, and it has also been proved by recent research eorts [14].
Though set-based recognition is believed to be more promising than recog-
nition based on single images, it brings new challenges: the images in a single
set may be taken by dierent cameras under dierent conditions (illumination,
viewpoint, background, etc.), and the object itself may also have large appear-
ance changes (pose, occlusion, etc.). Such a wide coverage increases the chance
of nding the correct matchings between sets, but could also increase the proba-
bility of making mistakes. How to make the best use of multiple instances while
at the same time be discriminative on distinguishing image sets from dierent
classes becomes the key issue. Existing supervised classication methods usually
suppose the data to be recognized is a single instance which can be represented
by a feature vector in the same space as the training samples stay, so they are
not directly applicable to set-based recognition tasks.
In this paper, we propose a novel approach to optimize a global set-based
recognition objective function by iteratively optimizing between-set distance
nding and feature space projection (or namely metric learning). As shown in
Figure 1, given a query set Q and two arbitrary gallery sets Xi and Xj , each
of which contains multiple images of an object, our approach nds the distance
between each pair of query-gallery sets by computing the geometric distance
of their approximated convex hulls (distance of closest approach) [1]. It can be
treated as nding the closest points from them, though by convex approxima-
tion these points may be virtual as they can be linear combinations of actual
sample points. Then a maximum-margin-based ranking algorithm is adopted to
learn a good metric, making the closest approach between correct query-gallery
pair smaller than that between incorrect ones as much as possible. Such a met-
ric learning can be viewed as feature space projection. After that, the closest
approach between each query-gallery pair is updated in the new feature space,
and it will activate another metric learning step. The iteration between these two
continues until convergence is achieved. Experimental results on both face recog-
nition and person re-identication demonstrate that our approach consistently
outperforms state-of-the-art methods without parameter tuning.
Set Based Discriminative Ranking for Recognition 499
2 Related Work
Existing methods on set-based object recognition depend much on the object
categories they are working on. Generally speaking, research on face recognition
focuses much on exploring the distribution [5] or structure of the data sets [68]
and designing the between-set distance [9, 2], while in the literature of person re-
identication, more attention has been paid to feature representation [3, 10, 4].
Such a phenomenon to some extent is due to the dierences between these two
categories: usually human bodies have greater appearance variations and oc-
clusions than faces, causing diculties for feature representation. Nevertheless,
they are valuable attempts for solving the generic set-based recognition prob-
lems. Here we briey categorize them into three groups: a) set-based signature
generation, b) direct set-to-set matching, and c) between-set distance nding.
The rst two groups count on exploring new features, in which group (a) aims
at a single global and informative representation for each query/gallery set while
group (b) looks for good features for individual images. More concretely, in 2006,
a spatiotemporal segmentation algorithm was employed to generate signatures
that are invariant to variations in illumination, pose, and the dynamic appear-
ance of clothing, followed by clustering and matching methods [11]. Three years
later, a new signature called HPE (Histogram Plus Epitome) [10] was proposed
to integrate global HSV color histogram with local epitome descriptors (generic
and local epitomes). The most recent work in group (a) is called MRCG (Mean
Riemannian Covariance Grid) [4] which divides each image into a dense grid
structure with overlapping cells, and then describe each cell by the region covari-
ance features [12]. Specially designed mean and variance functions were applied
to the cells to form a signature and then a similarity function was proposed to
compare two signatures. Though these signatures are compact and directly com-
patible with image-based matching/classication algorithms, it is hard to make
them both representative and discriminative. Unlike them, the feature represen-
tation in group (b) does not have to meet such high demands as it seeks for the
cooperation with set-to-set matching methods. A typical approach is the one
named SDALF (Symmetry-driven Accumulation of Local Features) [3], which
explores the symmetry and asymmetry of the human body to partition it into
three parts and then uses three types of features to describe each part (except
the smallest head part). A weighted combination of distances on these features
was dened as the image-to-image distance, and the minimum distance among
all possible image-pairs was adopted for set-to-set matching. Though this group
takes the advantage of having multiple candidates in each set for matching, the
eectiveness of it still largely depends on the goodness of features which requires
a case-dependent careful design.
Instead of focusing on features, the third group works on set structure extrac-
tion and distance computation. Between-set distance nding approaches were
mainly proposed for face recognition. Among these eorts, two recently in-
troduced methods are remarkable. One is AHISD/CHISD (Ane/Convex Hull
based Image Set Distance) [1], which characterizes each image set (query/gallery)
in terms of an ane/convex hull of the feature vectors of its images, and then
500 Y. Wu et al.
nds the geometric distance (distance of closest approach) between two hulls to
serve as the set-to-set distance. Such a convex approximation is less overtting
than the models based on sample points because it can generate new samples
on the hull, and the approach can also be robust to outliers to some extent. The
other work is named SANP (Sparse Approximated Nearest Points) [2] which en-
forces the sparsity of samples used for point generation via linear combination.
Compared with direct set-to-set matching, this group of methods can make a
better use of the data sets, however, as they are unsupervised, the feature space
where the data sets stay directly determines their eectiveness since all the di-
mensions are equally weighted for distance computation.
We decompose the feature map by the so-called partial order features [15]:
(Q, Xi ) (Q, Xj )
po (Q, X , yQ ) = yij ; ; ; ;
; +; ; ; , (3)
+
iI jI ;IQ ; ;IQ ;
Q Q
where
1 Xi y Q Xj
yij = ,
1 Xi %yQ Xj
and (Q, Xi ) is a feature map characterizing the relationship between query set
Q and gallery set Xi . Like what has been studied in single-instance-based ranking
[16], it can be proved that such a denition has a very attractive
property: given
the model w, the ranking yQ that maximizes w, Q, X , yQ is the one that
sorts X by descending w, (Q, Xi ), where Xi is an arbitrary set in X . This
property transfers the ranking problem to a weighted set-to-set similarity scoring
problem.
Now the problem becomes how to dene a proper set-to-set joint feature map
(Q, Xi ). It looks like an extension of the traditional relative feature map or
relative distance as called in [17], however, for two sets which may contain
dierent numbers of points it is nontrivial to design such a relative feature map
as simple feature subtraction cannot be directly applied to point sets.
Therefore, if we can dene our set-to-set relative feature map similar to that, then
by replacing the vector style model parameter w with the metric matrix W and
changing the normal inner product w, (Q, Xi ) to the Frobenius inner prod-
uct W, (Q, Xi )F , the large-margin-based objective function for ranking will
directly optimize the set-to-set distance metric. About the embedded function
f (W ) in the objective function, there may be dierent options such as tr(W ),
1 T
2 tr(W W ), etc. tr(W ) is a good choice for us as it prefers sparse solutions.
Inspired by the closest points based set-to-set distance and the nearest neigh-
bor based set-to-set matching, we dene our set-to-set distance metric as follows:
dSS
W (Xi , Xj ) = min dW (xik , xjl ) (8)
xik Xi ,xjl Xj
ni
nj
s.t. ik = 1 = jk ,
k=1 k =1
dSS
W (Xi , Xj ) = P T Xi i P T Xj j 2 = XiP i XjP j 2 , (12)
expression changes. The raw pixels of images were used as features. The CMU
MoBo dataset was originally built for pose recognition, but recently the human
faces have been detected and resized to 40 40 for face recognition. There are
24 individuals appearing in 96 sequences, which were captured from multiple
cameras with four dierent walking styles: slow, fast, inclined, and carrying a
ball. Each individual appears in 4 sequences which cover dierent walking styles.
We used the exact LBP features as adopted by CHISD and SANP for CMU
MoBo dataset, and followed their original experimental setting: select one se-
quence for each individual as the gallery set and use the other 3 as query sets.
The gallery-query splitting on Honda/UCSD data exactly follows [2], i.e., 20 se-
quences for query and the remaining 39 for gallery. Since the performance on the
whole sequences of the Honda/UCSD dataset has got saturated (SANP claimed
100% accuracy), we sampled 50 or 100 frames per sequence instead. This strategy
has also been applied to the CMU MoBo dataset for in-depth comparison.
The frame sampling strategy was originally proposed by [2], however, it simply
treats the rst 50/100 frames in the beginning of each sequence as testing sam-
ples, which may be improper for long sequences with probably more variations in
the remaining frames. Therefore, we propose to experiment on two dierent types
of sampling strategies to better show the properties of the data and recognition
algorithms: Type I refers to sampling from the beginning of each sequence for
testing and randomly sampling images from the rest for training; while Type II
stands for doing random sampling for both training and testing without overlap-
ping. For those short sequences which do not have enough frames, we sampled
as many as we can while making the training and test datasets balanced. For
each setting, the results are averaged over 10 trials to eliminate the uncertainty
of random sampling.
As the recently proposed CHISD and SANP models present so far the best
performance on these two datasets and our model can directly utilizes them,
in this paper we just compare our SBDR model with these two approaches.
We used P rec@k with k = 1 as our loss function since the evaluation is on the
recognition rate, and we set C = 10 without further tuning.
Table 1. Average recognition rate (%) on faces. The stars indicate that the results
are dierent from the ones presented in [2] due to the small change of experimental
settings. Detailed explanations are given in the text.
Experimental results are shown in Table 1. They not only clearly demonstrate
the superiority of the proposed SBDR approach but also provide important in-
sights on how to use it properly (data sampling and set-distance nder selection).
We explain the results from the following perspectives.
SBDR significantly outperforms both CHISD and
SANP, and SANP fits SBDR better than CHISD. All the results
consistently show that embedding discriminative metric learning using the
proposed SBDR model can signicantly improve the performance of unsuper-
vised set-based distance-nding methods (CHISD and SANP), and such an
improvement is much greater over SANP. For almost all the cases (with only
one exception), SBDRSANP performs better than SBDRCHISD even when
SANP itself performs worse than CHISD. A positive reason for such an in-
teresting phenomenon might be that SANP provides SBDR with between-set
distances determined by sparse but representative samples which could be
more informative for discriminative learning than those distances computed
over all the samples in the CHISD model.
Wider within-class variation coverage results in better recognition
performance. For the same set size, random sampling is more likely results
in larger within-set variations than sequential sampling from the beginning
of the video sequences, which can be witnessed by the dierence between
training sets and testing set for sampling strategy Type I as presented in Fig.
2. As expected, experimental results clearly show that for all the methods
the performance on randomly sampled testing sets is much better than that
on sequentially sampled ones. This is more signicant on the Honda/UCSD
dataset, which may due to that the facial appearance changes more gradu-
ally and sequentially in the Honda/UCSD dataset than in the CMU Mobo
dataset.
CHISD vs. SANP: CHISD performs better on sparse data, but
SANP exceeds as the within-set data density increases. For sampling
strategy Type II, CHISD performs considerably better than SANP on both
datasets when the set size equals 50, but the superiority is weakened or even
eliminated when the set size goes up to 100. Such a change over the growing
of set size can be observed for sampling strategy Type I as well. Results
on the Honda/UCSD dataset demonstrate that more signicantly, which
coincides with the experimental results in [2], though the exact numbers
there are dierent from the ones shown here due to two reasons: we have
much smaller testing sets for persons with less than 50/100 images (in such
cases only half of them are used for testing while the other half are reserved
for training), and our results are averaged over 10 trails instead of only one
trail. With the same set size, comparing to CHISD, SANP performs better on
Type I than Type II, this is because Type I has smaller within-set variations
so that the density of data is relatively higher than that of Type II.
Note that even with only 100 frames per set for testing, the recognition rate
of SBDRSANP has already passed the ones ever reported by other state-of-the-
art methods (excluding CHISD and SBDR as they have already been compared
Set Based Discriminative Ranking for Recognition 507
Fig. 2. Dierent sampling styles (random and sequential) result in dierent within-class
variation coverage for each set. Without loss of generality, the training and testing sets
for the rst 5 persons (P1 to P5) in the Honda/UCSD dataset are presented using
sampling strategy Type I with a set size of 50. Therefore, both the test sets for query
and gallery are sequentially sampled from the beginning, while the training sets are
randomly sampled from the remaining frames of each sequence. It can be clearly seen
that the training sets have larger within-set variations.
with) using full length sequences. More concretely, the best ever reported rate
on the Honda/UCSD dataset is 97.44% with averagely 267 frames per set, while
that on the CMU Mobo dataset is 95.97% with averagely 496 frames per set [2].
Datasets and Settings: There are two publicly available and commonly used
datasets for set-based (or namely multiple-shot) person re-identication: the
ETHZ dataset [21] and the i-LIDS dataset [22]. However, due to the recently
saturated performance on the ETHZ dataset [4] which suggests its unsuitability
for further comparison, we only experiment on the i-LIDS dataset in this paper.
The i-LIDS dataset for person re-identication was adjusted from the 2008
i-LIDS Multiple-Camera Tracking Scenario (MCTS), released by the Home Of-
ce of UK for the research on cross-camera human tracking. Since its introduc-
tion, it has been tested by almost all of the approaches proposed for person
re-identication. It contains 476 images of 119 unique individuals. As the im-
ages are automatically extracted, the width-height-ratio varies and misalignment
happens. Though it has an average of 4 images for each individual, the actual
number varies from 2 to 8. This dataset is very challenging due to that it was
collected by two non-overlapping cameras from two dierent view angles at a
busy airport and there are many occlusions and truncations, as well as large
illumination changes, as shown in Fig. 3. Since there are too few images for
each person, we follow the setting in [17] to have only one image per person for
gallery sets (actually not sets any more), and the other images for query sets.
508 Y. Wu et al.
Fig. 3. The i-LIDS dataset for person re-identication. (a) shows the two non-
overlapping views selected for the dataset. (b) lists the cropped images for several
persons, of which the rst two rows show the great intra-class variations including
viewpoint, pose, illumination, occlusion and background changes, while the last two
rows present the rather small within-class variations between some pairs of persons.
Red/black bounding boxes are used to separate dierent persons.
The same color and texture features as used in [17] and [23] were adopted in our
experiments.
Comparison with Existing Methods: We compare our approach with the
state-of-the-art methods from each of the three groups mentioned in Section 2.
Concretely, they are: MRCG [4] for set-based signature generation, SDALF [3] for
direct set-to-set matching, CHISD (linear) for set-based distance nding. Note
that the sparsity-based SANP model is not suitable here due to the extremely
small set size (N=1 for gallery sets and averagely N=3 for query sets). They
represent the latest and so far the most powerful methods in each specic group.
Unfortunately these three groups of methods are all unsupervised and as far as
we are aware there is no existing work on using supervised learning to directly
optimizes multiple-shot (i.e. set-based) re-identication. Therefore, to show the
advantage of set-based recognition we also present here the best results from
two recently proposed single image based classication and ranking algorithms:
RankSVM [24] and PRDC [17], with exactly the same feature representation
and experimental setting as ours. For the loss function in our model, we just
used the Mean Reciprocal Rank (MRR) as in [14]. The trade-o parameter C
was set to 10 without cross-validation. Since we randomly sampled the data for
training and test, all the experiments are repeated 10 times for averaging.
Results and Analysis: As shown in Table 2, using simple features for repre-
sentation and training with about the same amount of data as that for testing
Set Based Discriminative Ranking for Recognition 509
Table 2. Top ranked matching rate (%) on i-LIDS dataset, where r means rank
Unsupervised p = 50
Methods SDALF MRCG RankSVM PRDC CHISD SBDRCHISD
r=1 34.96 45.80 37.41 37.83 41.80 46.60
r=5 60.92 66.81 63.02 63.70 68.80 71.60
r = 10 73.36 75.21 73.50 75.09 83.60 80.40
r = 20 83.78 83.61 88.30 88.35 94.40 90.40
Table 3. Top ranked matching rate (%) on i-LIDS dataset, where r means rank
p = 30 p = 80
Methods RankSVM PRDC CHISD SBDRCHISD RankSVM PRDC CHISD SBDRCHISD
r=1 42.96 44.05 50.67 53.67 31.73 32.60 37.87 37.75
r=5 71.30 72.74 76.00 79.00 55.69 54.55 61.75 64.13
r = 10 85.15 84.69 90.00 88.67 67.02 65.89 75.37 76.38
r = 20 96.99 96.29 98.67 97.67 77.78 78.30 84.13 84.63
References
1. Cevikalp, H., Triggs, B.: Face recognition based on image sets. In: CVPR,
pp. 25672573 (2010)
2. Hu, Y., Mian, A.S., Owens, R.: Sparse Approximated Nearest Points for Image Set
Classication. In: CVPR, 121128 (2011)
510 Y. Wu et al.
3. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-
identication by symmetry-driven accumulation of local features. In: CVPR (2010)
4. Bak, S., Corvee, E., Bremond, F., Thonnat, M.: Multiple-shot Human Re-
Identication by Mean Riemannian Covariance Grid. In: Advanced Video and
Signal-Based Surveillance (2011)
5. Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face recog-
nition with image sets using manifold density divergence. In: CVPR, pp. 581588
(2005)
6. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image se-
quence. In: IEEE International Conference on Automatic Face and Gesture Recog-
nition, p. 318 (1998)
7. Li, X., Fukui, K., Zheng, N.: Image-Set Based Face Recognition Using Boosted Global
and Local Principal Angles. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV
2009, Part I. LNCS, vol. 5994, pp. 323332. Springer, Heidelberg (2010)
8. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image
set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell. 29,
10051018 (2007)
9. Wang, R., Shan, S., Chen, X., Gao, W.: Manifold-manifold distance with applica-
tion to face recognition based on image set. In: CVPR (2008)
10. Bazzani, L., Cristani, M., Perina, A., Farenzena, M., Murino, V.: Multiple-shot
person re-identication by hpe signature. In: ICPR, pp. 14131416 (2010)
11. Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentication using spatiotem-
poral appearance. In: CVPR, pp. 15281535 (2006)
12. Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection
and Classication. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part
II. LNCS, vol. 3952, pp. 589600. Springer, Heidelberg (2006)
13. Prosser, B., Zheng, W.S., Gong, S., Xiang, T.: Person re-identication by support
vector ranking. In: BMVC (2010)
14. Wu, Y., Mukunoki, M., Funatomi, T., Minoh, M., Lao, S.: Optimizing Mean Re-
ciprocal Rank for Person Re-identication. In: 2011 8th IEEE International Con-
ference on Advanced Video and Signal-Based Surveillancei (AVSS), p. 6 (2011)
15. Joachims, T.: A support vector method for multivariate performance measures. In:
ICML, pp. 377384 (2005)
16. McFee, B., Lanckriet, G.: Metric learning to rank. In: ICML, pp. 775782 (2010)
17. Zheng, W.-S., Gong, S., Xiang, T.: Person Re-identication by Probabilistic Rela-
tive Distance Comparison. In: CVPR, pp. 649656 (2011)
18. Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural svms. Ma-
chine Learning 77, 2759 (2009)
19. Lee, K., Ho, J., Yang, M.H., Kriegman, D.: Video-based face recognition using
probabilistic appearance manifolds. In: CVPR, pp. 313320 (2003)
20. Gross, R., Shi, J.: The cmu motion of body (mobo) database. Technical Report CMU-
RI-TR-01-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (2001)
21. Schwartz, W.R., Davis, L.S.: Learning discriminative appearance-based models
using partial least squares. In: SIBGRAPI, pp. 322329 (2009)
22. Zheng, W.S., Gong, S., Xiang, T.: Associating groups of people. In: BMVC (2009)
23. Gray, D., Tao, H.: Viewpoint Invariant Pedestrian Recognition with an Ensemble
of Localized Features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part I. LNCS, vol. 5302, pp. 262275. Springer, Heidelberg (2008)
24. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings
of the Eighth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 133142 (2002)
A Global Hypotheses Verication Method
for 3D Object Recognition
1 Introduction
Object recognition has been extensively pursued during the last decade within
application scenarios such as image retrieval, robot grasping and manipulation,
scene understanding and place recognition, human-robot interaction, localiza-
tion and mapping. A popular approach to tackle object recognition - especially
in robotic and retrieval scenarios - is to deploy 3D data, motivated by its in-
herently higher eectiveness compared to 2D images in dealing with occlusions
and clutter, as well as by the possibility of achieving 6-degree-of-freedom (6DOF)
pose estimation of arbitrarily shaped objects. Moreover, the recent advent of low-
cost, real-time 3D cameras, such as the Microsoft Kinect and the ASUS Xtion,
has turned 3D sensing into an easily aordable technology. Nevertheless, ob-
ject recognition remains an unsolved task, particularly in challenging real world
settings involving texture-less and/or smooth objects, signicant occlusions and
clutter, dierent sensing modalities and/or resolutions (i.e. see Fig. 1).
Algorithms for 3D object recognition can be divided between local and global.
Local approaches extract repeatable and distinctive keypoints from the 3D sur-
face of the models in the library and the scene, each being then associated with
a 3D descriptor of the local neighborhood [15]. Scene and model keypoints are
successively matched together via their associated descriptors to attain point-to-
point correspondences. Once correspondences are established, they are usually
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 511524, 2012.
c Springer-Verlag Berlin Heidelberg 2012
512 A. Aldoma et al.
Fig. 1. With the proposed approach, heavily occluded and cluttered scenes (left) are
handled by evaluating a high number of hypotheses (center), then retaining only those
providing a coherent interpretation of the scene according to a global optimization
framework based on geometric cues (right)
2 Related Work
So far only a few methods have outlined a specic proposal for the HV stage. In
[1, 11] using the correspondences supporting a hypothesis as seeds, a set of scene
points is grown by iteratively including neighboring points which lie closer than
a pre-dened distance to the transformed model points. If the nal set of points
is larger than a pre-dened fraction of the number of model points (from one
fourth to one third of the number of model points), the hypothesis is selected
and ICP is selectively run on the attained set of points in order to rene objects
pose. Obviously, one disadvantage of such an approach is that it can not handle
levels of occlusions higher than 75%.
The HV method proposed in [3] ranks hypotheses based on the quality of
supporting correspondences, so that they are then veried sequentially start-
ing from the highest rank. To verify each hypothesis, after ICP, two terms are
evaluated: the former, similarly to [1], is the ratio between the number of model
points having a correspondent in the scene and the total number of model points,
the latter is the product between this ratio and the quality score of supporting
correspondences. This step requires to set three dierent thresholds. Two addi-
tional checks are then enforced, so as to prune hypotheses based on the number
of outliers (model points without a correspondent in the scene) as well as on
the amount of occlusion generated by the current hypothesis with respect to the
remaining scene points. Again, these two additional checks require three thresh-
olds. If an hypothesis gets through each of these steps, it is accepted and its
associated scene points are eliminated from the scene, so that they will not be
taken into account when the next hypothesis is veried.
In [6], for each model yielding correspondences, the set of hypotheses associ-
ated with the model is rst pruned by thresholding the number of supporting
correspondences. Then, the best hypothesis is chosen based on the overlap area
A(Hbest ) between the model associated with that hypothesis and the scene, and
the initial pose is rened by means of ICP. Finally, the accuracy of the selected
hypothesis is given by the ratio MA(H best )
a (Hbest )
where Ma (Hbest ) is the total visible
surface of the model within the bounding box of the scene. The model is said
to be present in the scene if its accuracy is above a certain threshold and, upon
acceptance, the scene points associated with A(Hbest ) are removed.
Papazov and Burschka [5] evaluate how well a model hypothesis ts into
the scene by means of an acceptance function which takes into account, as a
bonus, the number of transformed model points falling within a certain distance
514 A. Aldoma et al.
from a scene point (support ) and, as a penalty, the number of model points
that would occlude other scene points (i.e. their distance from the closest scene
point is above threshold but they lie on the line of sight of a scene point). A
hypothesis is accepted by thresholding its support and occlusion sizes. Given
the hypotheses fullling the requirements set forth by the acceptance function, a
conict graph is built, wherein forks are created every time two hypotheses share
a percentage of scene points above a -third- threshold. Surviving hypotheses are
then selected by means of a non-maxima suppression step carried over the graph
and based on the acceptance function. This approach is the most similar to
ours, as, thanks to the conict graph, interaction between hypotheses is taken
into account. Nevertheless, their method is only partially global, since the rst
stage of the verication still relies on pruning hypotheses one at a time and a
winner-take-all strategy is used for conicting hypotheses.
Relevant to our work but aimed at piecewise surface segmentation on range
images, Leonardis et al. proposed in [12] a model selection strategy based on the
minimization of a cost function to produce a globally consistent solution. Even
though the minimization is formalized in terms of a Quadratic Boolean Problem,
the nal solution is still attained taking into accounts hypotheses sequentially
by means of a winner-take-all strategy.
3 Proposed Method
This Section illustrates the proposed HV approach for 3D object recognition.
After introducing notation, we analyze the geometrical cues that ought to be
taken into account for global hypotheses verication. Then, we illustrate how
to formulate the cost function and tackle the optimization problem. Finally, we
describe the overall 3D object recognition pipeline.
Occlusions. Given an hypothesis hi , model parts not visible in the scene due to
self-occlusions or occlusions generated by other scene parts should be removed
since they cannot provide consensus for hi . Thus, given an instance of X , for
each xi = 1 we compute a modied version of Mhi by transforming the model
according to Thi and removing all occluded points. Hereinafter this new point
cloud will be denoted as Mvhi .
Establishing whether a model point is visible or occluded can be done e-
ciently based on the range image associated to the scene point cloud, possibly
generating the range image from the point cloud whenever the former is not
available directly. Thus, similarly to [5, 3], a point p Mhi is considered oc-
cluded if its back-projection into the rendered range map of the scene falls onto
a valid depth point and its depth is bigger than the corresponding one of the
model. The same reasoning applies to self-occlusions.
Cues i, ii) Scene Fitting and Model Outliers. Once the set of visible points
of a model, Mvhi , has been calculated, we want to determine whether these points
have a correspondent on the scene, i.e. how well they explain scene points. This
cue was exploited by thresholding scene points based on a xed distance value
in the HV stage proposed in [1, 3, 6, 5]. Here, we want to rene such approach,
by measuring how well each visible model point locally ts the scene. Hence, for
each xi = 1 we associate to each scene point, p, a weight which estimates how
well the point is explained by hi by measuring
the local tting with respect to
its nearest neighbor in Mvhi , denoted as N p, Mvhi :
hi (p) = p, N p, Mvhi (1)
Local tting between two points p and q is measured by function (p, q), which
accounts for their distance as well as the local alignment of surfaces, as typically
done, e.g., to assess the quality of registration between two surfaces. Referring
to the normals at p and q respectively as np and nq , (p, q) is dened as
/
(
pq
e
2
+ 1)(np nq ), p q2 e
(p, q) = (2)
0, elsewhere
For each solution X , we can associate to each scene point p the sum of all
weights related with active hypotheses:
n
X (p) = hi (p) xi (3)
i=1
Cue iv) Clutter. In many application scenarios not all sensed shapes can be
tted with some known objects model. Exceptions might occur, for instance,
in some controlled industrial environments where all the objects making up the
scene are known a priori. More generally, though, several visible scene parts
which do not correspond to any model in the library might locally - and erro-
neously - t some model shapes, potentially leading to false detections. Maximiz-
ing the number of explained scene points (i.e. cue i) ), although useful to increase
the number of correct recognitions, nevertheless favors this circumstance. On the
other hand, computing the outliers associated with these false positives (cue ii))
might not help, since the parts of the model which do not t the scene might
A Global Hypotheses Verication Method for 3D Object Recognition 517
(a) (b)
Fig. 2. Proposed cues for global optimization. In both a) and b), top left: a solution
consisting of a set of active model hypotheses super-imposed onto the scene. Top right:
scene labeling via smooth surface segmentation. Bottom left: classication of visible
model points between inliers (orange) and outliers (green). Bottom right: classication
of scene points between explained by a single hypothesis (blue), by multiple hypotheses
(black), unexplained (red), cluttered due to region growing (yellow) and to proximity
(purple).
turn out occluded or outside the eld of view of the 3D sensor. This is the case,
e.g., of the chicken in Fig. 2-b) being wrongly tted to the rhino (in particular,
the bottom left image shows that the potential outliers turn out indeed occluded
and hence the wrong hypothesis not signicantly penalized).
To counterattack the eect of clutter, we devised an approach, inspired by
surface-based segmentation [13], aimed at penalizing a hypothesis that locally
explains some part of the scene but not nearby points belonging to the same
smooth surface patch. This is also useful to penalize hypotheses featuring cor-
rect recognition but wrong alignment of the model in the scene. Surface-based
segmentation methods are based on the assumption that object surfaces are con-
tinuous and smooth. Continuity is usually assessed by density of points in space
and smoothness through surface normals. Following this idea, scene segmenta-
tion is performed by identifying smooth clusters of points. Each new cluster
is initialized with a random point, then it is grown by iteratively including all
points pj lying in its neighborhood which show a similar normal:
n
X (p) = xi (p, N (p, Ehi )) (6)
i=1
X (p) consists of contributions, (p, N (p, Ehi )), associated with each active hy-
pothesis hi , where Ehi is the set scene points explained by hi . Analogously to
(p, q), we want (p, q) to weight clutter based on the proximity of p to its
nearest neighbor, q Ehi , as well as according to the alignment of their surface
patches:
, p q2 c l(p) = l(q)
pq
22
(p, q) = ( 2 + 1)(np nq ), p q2 c l(p)
= l(q) (7)
c
0, elsewhere
The radius given by c denes the spatial support related to (p, q), while
is a constant parameter used to penalize unexplained points that have been
assigned to the same cluster by the smooth segmentation step. Thanks to the
above formulation, wrong active hypotheses, such as the milk bottle in Fig. 2-a)
and the chicken model in Fig. 2-b), cause a signicantly valued clutter term
X (e.g. the purple and yellow regions in the bottom right part of the Figure),
which will penalize their validation within the global cost function. Therefore,
we have derived the last cue: iv) the amount of unexplained scene points close
to an active hypothesis according to (7) should be minimized.
We have so far outlined four cues i)-iv). While i) is aimed at increasing as much
as possible the number of recognized model instances (thus TPs and FPs), ii),
iii) and iv) try to penalize unlikely hypotheses through geometrical constraints,
so as to minimize false detections (FPs). The cost function F we are looking for
is obtained as the sum of the terms related to the cues that need to be enforced
within our optimization framework:
F (X ) = fS (X ) + fM (X ) (8)
The global cost formulation in (8) easily allows plugging in additional cost terms
derived from geometric constraints or from specic application characteristics.
For instance, the relatively common assumption, in indoor robotic scenarios,
that objects are placed over a planar surface [9, 10] would allow to penalize
hypotheses having object parts lying below - or intersecting with - this surface.
3.3 Optimization
Having dened the cost function, we need to devise a solver for the proposed
optimization problem. As previously pointed out, we are looking for a solution
X which minimizes function F (X ) over the solution space Bn :
X = argmin { F (X ) = fS (X ) + fM (X )} (11)
X Bn
As the cardinality of the solution space is 2n , even with a relatively small num-
ber of recognition hypotheses (e.g. in the order of tens) exhaustive enumeration
becomes prohibitive. As the dened problem belongs to the class of non-linear
pseudo-boolean optimization, we adopt a classical approach from operation re-
search based on simulated annealing [14] (SA). SA is a meta-heuristic algorithm
that optimizes a certain function without the guarantee to nd the global opti-
mum. It randomly explores the solution space applying moves from a solution X i
to another valid solution X j . In order to deal with local optima, the algorithm
allows moves which increase the cost of the target solution. Such bad moves
are usually performed during the initial iterations (when the temperature of the
system is high), whilst they become less and less probable as the optimization
goes on (system cooling down). The algorithm stops when the temperature has
reached a minimum or no improvement has been achieved in the last N moves,
which occurs either when the algorithm reaches the global optimum or trapped
into a local minimum. We initialize SA assuming all hypotheses to be active,
i.e. X 0 = {1, , 1}, each move consisting then in switching on/o a specic
hypothesis at a time.
In our experiments we relied on the SA implementation available on MetsLib1
based on linear cooling and on the default parameter values, except for the
number of iterations, as we used a high enough value (6000) so that dierent
runs yielded the same results. We wish to point out that the proposed formulation
allows the terms included in the cost function to be pre-computed for each single
hypothesis (those related to fM ) and scene point (those related to fS ), so that at
each new move the cost function can be computed eciently, and independently
of the total amount of scene points and number of hypotheses, by updating
only the hypothesis (and related scene points) being switched on/o. In Sec.
4 we will show how, despite being an approximate optimization algorithm, in
the addressed scenarios SA can yield accurate results while requiring reasonably
low computation times (typically below 2 seconds for hypothesis sets up to 200
elements see Fig. 3-a) and Fig. 4-b).
1
www.coin-or.org/metslib
520 A. Aldoma et al.
Parameters
Det./descr. CG HV
Dataset s d e c
Laser Scanner 0.5 cm 4 cm 5 0.5 cm 0.5 cm 3 cm 5 3
Kinect 1 cm 5 cm 5 1.5 cm 1 cm 5 cm 5 3
and minimum cardinality for RANSAC are set, respectively, to and . Finally,
ICP is deployed on each subset to rene the 6DOF pose given by AO. At this
point, the hypothesis set H is ready to be fed to the proposed HV stage.
4 Experimental Results
(a) (b)
(c)
Fig. 3. Kinect dataset: a) results reported by the proposed HV algorithm and that
in [5]; the chart also reports the upper bound on the recognition rate related to the
deployed 3D pipeline. b-c): qualitative results yielded by the proposed algorithm on a
challenging dataset scene (b) and other simpler scenes (c).
given the dierent nature of the 3D data between scenes (2.5D acquired with a
Kinect) and models (fully 3D, mainly CAD, the remaining acquired with a laser
scanner). The relatively high number of models present in the library allows
testing how well the proposed approach scales with the library size. Another
relevant challenge of this dataset is represented by the traits of the models shape,
some of which are highly symmetrical and hardly descriptive (e.g. bowls, cups,
glasses), many of them being characterized by a high similarity (e.g. dierent
kinds of glasses, dierent kinds of mugs, ..), as also vouched by Fig. 3. Fig. 3-a)
shows the Recognition vs Occlusion curve and number of FPs of the proposed HV
algorithm and that in [5], both plugged in on the 3D pipeline presented in Sec.
3.4. The Figure also shows the upper bound curve, i.e. the maximum recognition
rate achievable with the proposed recognition pipeline. It can be seen that our
approach gets really close to the upper bound while dramatically reducing the
number of false positives. Our approach also outperforms that in [5], in terms
of Recognition rates as well as in terms of FPs. The overall recognition rate on
this dataset is 84.1% for the upper bound, 79.5% for the proposed approach and
73.8% for [5]. Qualitative results obtained by the proposed algorithm are also
provided in Fig. 3-b) and 3-c).
(a) (b)
proposed 3D pipeline together with the results reported in [57]: our method
clearly outperforms the state of the art. Moreover, to the best of our knowledge,
we are the rst to achieve a recognition rate of 100% without any false posi-
tive. Fig. 4-b) shows the Recognition vs Occlusion curve and number of FPs of
the proposed HV stage compared to that proposed in Papazov et al. [5], with
the recognition pipeline proposed in Sec. 3.4 for dierent ICP iterations. The
average time required by each combination is also reported. It is worth noting
that although the results obtained by both HV algorithms are equivalent at 0
and 10 ICP iterations in terms of recognition rate, the proposed algorithm is
able to yield less FPs (0 against 2). Then, at 30 ICP iterations, our method
accurately recognizes all models without yielding any FP, while the number of
FPs reported by [5] signicantly increases due to the fact that a high number of
ICP iterations tend to locally align incorrect hypotheses to the scene points, so
that their respective support is large enough to be accepted by that method.
during the SA stage. Another direction regards the use of additional cues, par-
ticularly with the aim of penalizing physical intersections between active visible
models. We plan to publicly release the implementation of the proposed HV
algorithm and recognition pipeline in the open source Point Cloud Library.
References
1. Johnson, A.E., Hebert, M.: Using spin images for ecient object recognition in
cluttered 3d scenes. IEEE Trans. PAMI (5) (1999)
2. Tombari, F., Salti, S., Di Stefano, L.: Unique Signatures of Histograms for Local
Surface Description. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part III. LNCS, vol. 6313, pp. 356369. Springer, Heidelberg (2010)
3. Mian, A., Bennamoun, M., Owens, R.: 3d model-based object recognition and
segmentation in cluttered scenes. IEEE Trans. PAMI (10) (2006)
4. Mian, A., Bennamoun, M., Owens, R.: On the repeatability and quality of key-
points for local feature-based 3d object retrieval from cluttered scenes. IJCV (2010)
5. Papazov, C., Burschka, D.: An Ecient RANSAC for 3D Object Recognition
in Noisy and Occluded Scenes. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.)
ACCV 2010, Part I. LNCS, vol. 6492, pp. 135148. Springer, Heidelberg (2011)
6. Bariya, P., Nishino, K.: Scale-hierarchical 3d object recognition in cluttered scenes.
In: Proc. CVPR (2010)
7. Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: Ecient
and robust 3d object recognition. In: Proc. CVPR (2010)
8. Tombari, F., Di Stefano, L.: Object recognition in 3d scenes with occlusions and
clutter by hough voting. In: Proc. PSIVT (2010)
9. Aldoma, A., Blodow, N., Gossow, D., Gedikli, S., Rusu, R.B., Vincze, M., Bradski,
G.: CAD-Model Recognition and 6DOF Pose Estimation Using 3D Cues. In: 3DRR
Workshop, ICCV (2011)
10. Rusu, R.B., Bradski, G., Thibaux, R., Hsu, J.: Fast 3d recognition and pose using
the viewpoint feature histogram. In: Proc. 23rd IROS (2010)
11. Johnson, A.E., Hebert, M.: Surface matching for object recognition in complex
three-dimensional scenes. IVC (9) (1998)
12. Leonardis, A., Gupta, A., Bajcsy, R.: Segmentation of range images as the search
for geometric parametric models. IJCV (1995)
13. Rabbani, T., van den Heuvel, F., Vosselmann, G.: Segmentation of point clouds
using smoothness constraint. In: IEVM 2006 (2006)
14. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science (4598) (1983)
15. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algo-
rithm conguration. In: VISAPP. INSTICC Press (2009)
16. Chen, H., Bhanu, B.: 3d free-form object recognition in range images using local
surface patches. Pattern Recognition Letters (10) (2007)
17. Arun, K., Huang, T., Blostein, S.: Least-squares tting of two 3-d point sets. Trans.
PAMI (1987)
Are You Really Smiling at Me? Spontaneous
versus Posed Enjoyment Smiles
1 Introduction
Human facial expressions are indispensable elements of non-verbal communication.
Since faces can reveal the mood or the emotional feeling of a person, automatic under-
standing and interpretation of facial expressions provide a natural way to interact with
computers. In recent studies, analysis of spontaneous facial expressions have gained
more interest. For social interaction analysis, it is necessary to distinguish spontaneous
(felt) expressions from the posed (deliberate) ones. Spontaneous expressions can re-
veal states of attention, agreement and interest, as well as deceit. The foremost facial
expression for spontaneity analysis is the smile, as it is the most frequently performed
expression, and used for signaling enjoyment, embarrassment, politeness, etc. [1]. It is
also used to mask other emotional expressions, since it is the easiest emotional facial
expression to pose voluntarily [2].
Several characteristics of spontaneous and posed smiles, such as symmetry, speed,
and timing are analyzed in the literature [3], [4], [5]. Their findings suggest that dif-
ferent facial regions contribute differently to the classification of smiles. In this paper,
we assess the facial dynamics for different face regions, and demonstrate that the eye
region contains the most useful information for this problem. Our method combines
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 525538, 2012.
c Springer-Verlag Berlin Heidelberg 2012
526 H. Dibeklioglu, A.A. Salah, and T. Gevers
region-specific smile dynamics (duration, amplitude, speed, acceleration, etc.) with eye-
lid movements to detect the genuineness of enjoyment smiles.
Our contributions are: 1) A region-specific analysis of facial feature movements un-
der various conditions; 2) An accurate smile classification method which outperforms
the state-of-the-art methods; 3) New empirical findings on age related differences in
smile expression dynamics; 4) The largest spontaneous/posed enjoyment smile database
in the literature for detailed and precise analysis of enjoyment smiles. The database, its
evaluation protocols and annotations are made available to the research community.
2 Related Work
The smile is the easiest emotional facial expression to pose voluntarily [2]. Broadly,
a smile can be identified as the upward movement of the mouth corners, which corre-
sponds to Action Unit 12 (AU12) in the facial action coding system (FACS) [6]. In terms
of anatomy, the zygomatic major muscle contracts and raises the corners of the lips dur-
ing a smile [4]. In terms of dynamics, smiles are composed of three non-overlapping
phases; the onset (neutral to expressive), apex, and offset (expressive to neutral), re-
spectively. Ekman individually identified 18 different smiles (such as enjoyment, fear,
miserable, embarrassment, listener response smiles) by describing the specific visual
differences on the face and indicating the accompanying action units [2].
Smiles of joy are called Duchenne smiles (D-smiles) in honor of Guillaume Du-
chenne, who did early experimental work on smiles. A good indicator for the D-smile
is the contraction of the orbicularis oculi, pars lateralis muscle that raises the cheek,
narrows the eye aperture, and forms wrinkles on the external side of the eyes. This ac-
tivation corresponds to Action Unit 6 and is called the Duchenne marker (D-marker)
in the literature. However, new empirical findings questioned the reliability of the D-
marker [7]. Recently, it has been shown that orbicularis oculi, pars lateralis can be
active or inactive under both spontaneous and posed conditions with similar frequen-
cies [8]. On the other hand, untrained people consistently use the D-marker to recognize
spontaneous and posed enjoyment smiles [9].
Symmetry is also potentially informative to distinguish spontaneous and posed en-
joyment smiles [4]. In [3], it is claimed that spontaneous enjoyment smiles are more
symmetrical than posed ones. Later studies reported no significant difference [7]. This
study has also failed to find significant effects of symmetry.
In the last decade, dynamical properties of smiles (such as duration, speed, and am-
plitude of smiles; movements of head and eyes) received attention as opposed to mor-
phological cues to discriminate between spontaneous and posed smiles. In [10], Cohn et
al. analyze correlations between lip-corner displacements, head rotations, and eye mo-
tion during spontaneous smiles. In another study, Cohn and Schmidt report that spon-
taneous smiles have smaller onset amplitude of lip corner movement, but a more stable
relation between amplitude and duration [5]. Furthermore, the maximum speed of the
smile onset is higher in posed samples and posed eyebrow raises have higher maximum
speed and larger amplitude, but shorter duration than spontaneous ones [7].
In [5], Cohn and Schmidt propose a system which distinguishes spontaneous and de-
liberate enjoyment smiles by a linear discriminant classifier using duration, amplitude,
Are You Really Smiling at Me? 527
duration
and amplitute measures of smile onsets. They analyze the significance of the proposed
features and show that the amplitude of the lip corner movement is a strong linear func-
tion of duration in spontaneous smile, but not in deliberate ones. In [11], Valstar et al.
propose a multimodal system to classify posed and spontaneous smiles. GentleSVM-
Sigmoid classifier is used with the fusion of shoulder, head and inner facial movements.
In [12], Pfister et al. propose a spatiotemporal method using both natural and infrared
face videos to discriminate between spontaneous and posed facial expressions. By en-
abling the temporal space and using the image sequence as a volume, they extend the
Completed Local Binary Patterns (CLBP) texture descriptor into the spatio-temporal
CLBP-TOP features for this task.
Recently, Dibeklioglu et al. have proposed a system which uses eyelid movements
to classify spontaneous and posed enjoyment smiles [13], where distance-based and
angular features are defined in terms of changes in eye aperture. Several classifiers are
compared and the reliability of eyelid movements are shown to be superior to that of
the eyebrows, cheek, and lip movements for smile classification.
In conclusion, the most relevant facial cues for smile classification in the literature
are 1) D-marker, 2) the symmetry, and 3) the dynamics of smiles. Instead of analyzing
these facial cues separately, in this paper, the aim is to use a more generic descriptor
set which can be applied to different facial regions to enhance the indicated facial cues
with detailed dynamic features. Additionally, we focus on the dynamical characteristics
of eyelid movements (such as duration, amplitude, speed, and acceleration), instead of
simple displacement analysis, motivated by the findings of [5] and [13].
3 Method
To analyze the facial dynamics, 11 facial feature points (eye corners, center of upper
eyelids, cheek centers, nose tip, lip corners) are tracked in the videos (see Fig. 1(a)).
Each point is initialized in the first frame of the videos for precise tracking and analysis.
In our system, we use the piecewise Bezier volume deformation (PBVD) tracker, which
is proposed by Tao and Huang [14] (see Fig. 1(b)). While this method is relatively old,
we have introduced improved methods for its initialization, and it is fast and robust with
accurate initialization. The generic face model consists of 16 surface patches. To form
a continuous and smooth model, these patches are embedded in Bezier volumes.
528 H. Dibeklioglu, A.A. Salah, and T. Gevers
(a) (b)
Fig. 1. (a) Used facial feature points with their indices and (b) the 3D mesh model
Three different face regions (eyes, cheeks, and mouth) are used to extract descriptive
features. First of all, tracked 3D coordinates of the facial feature points i (see Fig. 1(a))
are used to align the faces in each frame. We estimate the 3D pose of the face, and
normalize the face with respect to roll, yaw, and pitch rotations. Since three non-colinear
points are enough to construct a plane, we use three stable landmarks (eye centers and
nose tip) to define a plane P. Eye centers are defined as middle points between inner
and outer eye corners as c1 = 1 + 2
3
and c2 = 4 + 2 . Angles between the positive
6
U.NP
= arccos , where N = 9 c2 9 c1 . (1)
U NP
In Equation 1, 9 c2 and 9 c1 denote the vectors from point 9 to points c2 and c1 ,
respectively. U and NP are the magnitudes of U and NP vectors. According to
the face geometry, Equation 1 can estimate the exact roll (z ) and yaw (y ) angles of
the face with respect to the camera. If we assume that the face is approximately frontal
in the first frame, then the actual pitch angles (x ) can be calculated by subtracting the
initial value. Once the pose of the head is estimated, tracked points are normalized with
respect to rotation, scale, and translation as follows:
% &
c1 + c2 100
i = i Rx (x )Ry (y )Rz (z ) , (2)
2 (c1, c2)
where i is the aligned point and Rx , Ry , and Rz denote the 3D rotation matrices for the
given angles. () denotes the Euclidean distance. On the normalized face, the middle
point between eye centers is located at the origin and the interocular distance (distance
between eye centers) is set to 100 pixels. Since the normalized face is approximately
frontal with respect to the camera, we ignore the depth (Z) values of the normalized
feature points i , and denote them as li .
After the normalization, onset, apex, and offset phases of the smile can be detected
using the approach proposed by Schmidt et al. [15], by calculating the amplitude of the
Are You Really Smiling at Me? 529
smile as the distance of the right lip corner to the lip center during the smile. Differently
from [15], we estimate the smile amplitude as the mean amplitude of right and left
lip corners, normalized by the length of the lip. Let Dlip (t) be the value of the mean
amplitude signal of the lip corners in the frame t. It is estimated as:
l110 +l111 t l1 +l111 t
( , l10 )
+ ( 10 2 , l11 )
Dlip (t) = 2
1 1 ) , (3)
2(l10 , l11
where lit denotes the 2D location of the ith point in frame t. The longest continuous
increase in Dlip is defined as the onset phase. Similarly, the offset phase is detected as
the longest continuous decrease in Dlip . The phase between the last frame of the onset
and the first frame of the offset defines the apex.
To extract features from the eyelids and the cheeks, additional amplitude signals are
computed. We estimate the (normalized) eyelid aperture Deyelid and cheek displacement
Dcheek as follows:
lt1 +lt3 t lt1 +lt3 t lt +lt lt4 +lt6 t
( 2 , l2 )( 2 , l2 ) + ( 4 2 6 , l5t )( 2 , l5 )
Deyelid (t) = , (4)
2(l1t , l3t )
l17 +l18 t l1 +l1
( + ( 7 2 8 , l8t )
2 , l7 )
Dcheek (t) = , (5)
2(l71 , l81 )
where (li , lj ) denotes the relative vertical location function, which equals to 1 if lj
is located (vertically) below li on the face, and 1 otherwise. Dlip , Deyelid , and Dcheek are
hereafter referred to as amplitude signals. In addition to the amplitudes, speed V and
acceleration A signals are extracted by computing the first and the second derivatives
of the amplitudes, respectively.
In summary, description of the used features and the related facial cues with those
are given in Table 1. Note that the defined features are extracted separately from each
phase of the smile. As a result, we obtain three feature sets for each of the eye, mouth
and cheek regions. Each phase is further divided into increasing (+ ) and decreasing
( ) segments, for each feature set. This allows a more detailed analysis of the feature
dynamics.
In Table 1, signals symbolized with superindex (+ ) and ( ) denote the segments
of the related signal with continuous increase and continuous decrease, respectively.
For example, D+ pools the increasing segments in D. defines the length (number of
frames) of a given signal, and is the frame rate of the video. DL and DR define the
amplitudes for the left and right sides of the face, respectively. For each face region,
three 25-dimensional feature vectors are generated by concatenating these features.
In some cases, features cannot be calculated. For example, if we extract features
from the amplitude signal of the lip corners Dlip using the onset phase, then decreasing
segments will be an empty set ( (D ) = 0). For such exceptions, all the features
describing the related segments are set to zero.
Table 1. Definitions of the extracted features, and the related facial cues with those. The relation
with D-marker is only valid for eyelid features.
feature vectors for each face region. To deal with feature redundancy, we use Min-
Redundancy Max-Relevance (mRMR) algorithm to select discriminative features [16].
mRMR is an incremental method for minimizing the redundancy while selecting the
most relevant information as follows:
1
max I (fj , c) I (fj , fi ) , (6)
fj F Sm1 m1
fi Sm1
where I shows the mutual information function and c indicates the target class. F and
Sm1 denote the feature set, and the set of m 1 features, respectively.
During the training of our system, both individual feature vectors for onset, apex,
offset phases, and a vector with their fusion are generated. The most discriminative
features on each of the generated feature sets are selected using mRMR. Minimum
classification error on a separate validation set is used to determine the most informa-
tive facial region, and the most discriminative features on the selected region. Sim-
ilarly, to optimize the SVM configuration, different kernels (linear, polynomial, and
radial basis function (RBF)) with different parameters (size of RBF kernel, degree of
Are You Really Smiling at Me? 531
polynomial kernel) are tested on the validation set and the configuration with the mini-
mum validation error is selected. The test partition of the dataset is not used for param-
eter optimization.
4 Database
4.1 UvA-NEMO Smile Database
We have recently collected the UvA-NEMO Smile Database1 to analyze the dynamics
of spontaneous/posed enjoyment smiles. This database is composed of videos (in RGB
color) recorded with a Panasonic HDC-HS700 3MOS camcorder, placed on a monitor,
at approximately 1.5 meters away from the recorded subjects. Videos were recorded
with a resolution of 19201080 pixels at a rate of 50 frames per second under controlled
illumination conditions. Additionally, a color chart is present on the background of the
videos for further illumination and color normalization (See Fig. 2).
The database has 1240 smile videos (597 spontaneous, 643 posed) from 400 subjects
(185 female, 215 male), making it the largest smile database in the literature so far.
Ages of subjects vary from 8 to 76 years. 149 subjects are younger than 18 years (235
spontaneous, 240 posed). 43 subjects do not have spontaneous smiles and 32 subjects
have no posed smile samples. (See Fig. 3 for age and gender distributions).
Fig. 2. Spontaneous (top) and posed (bottom) enjoyment smiles from the UvA-NEMO Smile
Database
For posed smiles, each subject was asked to pose an enjoyment smile as realistically
as possible, sometimes after being shown the proper way in a sample video. Short, funny
video segments were used to elicit spontaneous enjoyment smiles. Approximately five
minutes of recordings were made per subject, and genuine smiles were segmented.
1
This research was part of Science Live, the innovative research programme of science cen-
ter NEMO that enables scientists to carry out real, publishable, peer-reviewed research using
NEMO visitors as volunteers. See http://www.uva-nemo.org on how to obtain the UvA-NEMO
Smile Database.
532 H. Dibeklioglu, A.A. Salah, and T. Gevers
0DOHVXEMHFWV
)HPDOHVXEMHFWV
1XPEHURIVXEMHFWV
$JH
6SRQWDQHRXVPDOH
6SRQWDQHRXVIHPDOH
1XPEHURIVDPSOHV
'HOLEHUDWHPDOH
'HOLEHUDWHIHPDOH
$JH
Fig. 3. Age and gender distributions for the subjects (top), and for the smiles (bottom) in the
UvA-NEMO Smile Database
For each subject, a balanced number of spontaneous and posed smiles were selected
and annotated by seeking consensus of two trained annotators. Segments start/end with
neutral or near-neutral expressions. The mean duration of the smile samples is 3.9 sec-
onds ( = 1.8). Average interocular distance on the database is approximately 200
pixels (estimated by using the tracked landmarks). 50 subjects wear eyeglasses.
Table 2. Overview of databases with spontaneous/posed smile content. Note that the posed part
of USTC-NVIE includes only the most expressive (apex) images.
part consists of only the most expressive (apex) images. The database has 302 spon-
taneous smiles (image sequences of onset phases) from 133 subjects, and 624 posed
smiles (single apex images) from 104 subjects.
5 Experimental Results
We use the UvA-NEMO smile database to evaluate our system, but report further results
on BBC and SPOS corpora in Section 5.4. We use a two level 10-fold cross-validation
scheme: Each time a test fold is separated, a 9-fold cross-validation is used to train
the system, and parameters are optimized without using the test partition. There is no
subject overlap between folds. For classification, linear SVM is found to perform better
than polynomial and RBF alternatives. For detailed analysis of the features and the
system, tracking is initialized by manually annotated facial landmarks. Additionally,
results with automatic initialization are also given in Section 5.4.
0.8
0.7
0.6
0.5
0.4
Ons A Off A Onse A O K K
>
Fig. 4. Effect of feature selection on correct classification rates for different facial regions
the features of all phases are concatenated. Lip corners (83.63%) and cheeks (83.15%)
follow the eyelids. Combined eyelid features have the minimum validation error in this
experiment.
Three different fusion strategies (early, mid-level, and late) are defined and evaluated.
Each fusion strategy enables feature selection before classification. In early fusion, fea-
tures of onset, apex, and offset of all regions are fused into one low-abstraction vector
and classified by a single classifier. Mid-level fusion concatenates features of all phases
for each region, separately. Constructed feature vectors are individually classified by
SVMs and the classifier outputs are fused (either by SUM or PRODUCT rule, or by
voting3 ). In the late fusion scheme, feature sets of onset, apex, and offset for all facial
regions are individually classified by SVMs and fused.
As shown in Fig. 5 (a), mid-level fusion provides the best performance (88.87%
with voting), followed by early and late fusion, respectively. Elimination of redundant
information by feature selection after low-level feature abstraction on each region, sep-
arately, and following higher level of abstraction for classification on different facial
regions can explain the high performance of mid-level fusion.
The features on which we base our analysis may depend on the age of the subjects.
In this section, we split the UvA-NEMO smile database into two partitions as young
people (age < 18 years), and adults (age 18 years), and all training and evalua-
tion is repeated separately for the two partitions of the database. Fig. 5 (b) shows the
classification accuracies. Regional performances are given using the fused (onset, apex,
offset) features of the related region. Mid-level fusion (voting) accuracies are also given
in Fig. 5 (b).
3
The SUM and PRODUCT rules fuse the computed posterior probabilities for the target classes
of different classifiers. To estimate these posterior probabilities, sigmoids of SVM output dis-
tances are used.
Are You Really Smiling at Me? 535
0.90 0.95
Z
Z
Sum Young people
Product Adults
0.88 Whole set
Voting 0.90
0.86
0.85
0.84
0.82 0.80
0.80 0.75
& D& >& > Z
D
(a) (b)
Fig. 5. (a) Effect of different fusion strategies and (b) age on correct classification rates
Our results show that eyelid features perform better on adults (as well as on the
whole set) than cheeks and lip corners. For adults, eyelid features reach an accuracy of
89.41%, whereas features of cheeks and lip corners have an accuracy of 83.79% and
82.22%, respectively. However, the correct classification rate for young people with
eyelid features is approximately 5% and 3% less than for the adults and whole set, re-
spectively. Lip corner features provide the most reliable classification for young people
with an accuracy of 86.53%, and also have the minimum validation error for this parti-
tion. Additionally, results show that fusion does not increase the performance for adults
and young people, individually, and the highest correct classification rate is achieved on
the whole set.
We compare our method with the state-of-the-art smile classification systems proposed
in the literature [5], [13], [12] by evaluating them on the UvA-NEMO database with the
same experimental protocols, as well as on BBC and SPOS. Results for [12] are given
by using only natural texture images. For a fair comparison, all methods are tested by
using the piecewise Bezier volume tracker [14] and the tracker is initialized automati-
cally by the method proposed in [19]. Correct classification rates are given in Table 3.
Results show that the proposed method outperforms the state-of-the-art methods. Mid-
level fusion with voting provides an accuracy of 87.02%, which is 9.76% (absolute)
higher than the performance of the method proposed in [5]. Using only eyelid features
decreases the correct classification rate by only 1.29% (absolute) in comparison to mid-
level fusion. This confirms the reliability of eyelid movements and the discriminative
power of the proposed dynamical eyelid features to distinguish between types of smiles.
Our system with eyelid features has an 85.73% accuracy, significantly higher than
that of [13] (71.05%), which uses only eyelid movement features without any temporal
segmentation. This shows the importance of temporal segmentation. The results of [12]
with 73.06% classification rate is less than the accuracy of our method with only onset
features, and shows that spatiotemporal features are not as reliable as facial dynamics.
Since [5] relies on solely the onset features of lip corners, we also tested our method
with onset features of lip corners, and obtained an accuracy of 80.73% (compared
to [5]s 77.26%). We conclude that using automatically selected features from a large
536 H. Dibeklioglu, A.A. Salah, and T. Gevers
Table 3. Correct classification rates on the BBC, SPOS, and UvA-NEMO databases
pool of informative features serves better than enabling a few carefully selected mea-
sures for this problem. Manually selected features may also show less generalization
power across different (database-specific) recording conditions.
It is important to note that the proposed method uses solely onset features on SPOS
corpus, since it has only onset phases of smiles. We have observed that spontaneous
smiles are generally classified better than posed ones for all methods. One possible
explanation is that dynamical facial features have more variance in posed smiles. Sub-
sequently, class boundaries of spontaneous smiles are more defined, and this leads to a
higher accuracy.
6 Discussion
In our experiments, onset features of lip corners perform best for individual phases. This
result is consistent with the findings of Cohn et al. [5]. However, when onset, apex, and
offset phases are fused, the eyelid movements are more descriptive than those of cheeks
and lip corners for enjoyment smile classification.
On UvA-NEMO, the best fusion scheme increases the correct classification rate by
only 1.29% (absolute) with respect to the accuracy of eyelid features. This finding sup-
ports our motivation and confirms the discriminative power and the reliability of eyelid
movements to classify enjoyment smiles. However, it is important to note that temporal
segmentation of the smiles is performed by using lip corner movements, which means
that additional information from the movements of lip corners is leveraged.
Highly significant (p < 0.001) feature differences (of selected features) between
adults and young people are obtained. For both spontaneous and posed smiles, max-
imum and mean apertures of eyes are larger for adults. During onset, both amplitude
of eye closure and closure speed are higher for young people. During offset, amplitude
and speed of eye opening are higher for young people. When we analyze the signifi-
cance levels of the most selected features for smile classification, we see that the size
of the eye aperture is smaller during spontaneous smiles. However, many subjects in
UvA-NEMO lower their eyelids also during posed smiles. This result is consistent with
the findings of Krumhuber and Manstead [8], which indicate that D-marker can exist
during both spontaneous and posed enjoyment smiles.
Another important finding is that the speed and acceleration of eyelid movements
are higher for posed smiles. As a result, since faster eyelid movements of young people
Are You Really Smiling at Me? 537
cause confusion with posed smiles, the classification accuracy with eyelid features is
higher for adults. Similarly, features extracted from the cheek region perform better for
adults, since cheek movements of adults are slower and more stationary. The duration
of spontaneous smiles are longer than posed ones, but the lip corner movement for
posed smiles is faster (also in terms of acceleration). This improves the accuracy of
classification with lip corner features in favor of young people, since the lip corner
movements of young people are significantly faster than adults during posed smiles.
Since eyelid and cheek features are reliable in adults as opposed to lip corners in
young people, fusion on a specific age group decreases the accuracy compared to the
performance on the whole set. Lastly, there is no significant symmetry difference (in
terms of amplitude) between spontaneous and posed smiles (as in [20], [7]) or between
young people and adults. More detailed analysis of age related smile dynamics is given
in [21], which uses the proposed features (facial dynamics) for age estimation.
7 Conclusions
In this paper, we have presented a smile classifier that can distinguish posed and spon-
taneous enjoyment smiles with high accuracy. The method is based on the hypothesis
that dynamics of eyelid movements are reliable cues for identifying posed and spon-
taneous smiles. Since the movements of eyelids are complex to analyze (because of
blinks and continuous change in eye aperture), novel features have been proposed to
describe dynamics of eyelid movements in detail. The proposed features also incorpo-
rate facial cues previously used in the literature (D-marker, symmetry, and dynamics)
for classification of smiles, and can be generalized to any facial region.
We have introduced the largest spontaneous/posed enjoyment smile database (1240
samples of 400 subjects, ages of subjects vary from 8 to 76 years) in the literature for
detailed and precise analysis of enjoyment smiles. On this database, we have verified
the discriminative power of eyelid movement dynamics and showed its superiority over
lip corner and cheek movements. We have evaluated fusion of features, and obtained
minor improvements over using only eyelid features. We provided comparative results
on three smile databases with the proposed method, as well as three other methods.
We report new and significant empirical findings on smile dynamics that can be lever-
aged to implement better facial expression analysis systems: 1) Maximum and mean
apertures of eyes are smaller and speed of eyelid (also cheek and lip corner) move-
ments are faster for young people compared to adults, during both spontaneous and
posed smiles. 2) Mean eye aperture is smaller during spontaneous smiles in comparison
to posed ones. 3) The speed and acceleration of eyelid movements are higher in posed
smiles. 4) There is no significant difference in (movement) symmetry between sponta-
neous and posed smiles, even when tested separately for young people and adults.
References
1. Ambadar, Z., Cohn, J.F., Reed, L.I.: All smiles are not created equal: Morphology and timing
of smiles perceived as amused, polite, and embarrassed/nervous. J. Nonverbal Behav. 33,
1734 (2009)
2. Ekman, P.: Telling lies: Cues to deceit in the marketplace, politics, and marriage. WW. Norton
& Company, New York (1992)
3. Ekman, P., Hager, J.C., Friesen, W.V.: The symmetry of emotional and deliberate facial ac-
tions. Psychophysiology 18, 101106 (1981)
4. Ekman, P., Friesen, W.V.: Felt, false, and miserable smiles. J. Nonverbal Behav. 6, 238252
(1982)
5. Cohn, J.F., Schmidt, K.L.: The timing of facial motion in posed and spontaneous smiles. Int.
Journal of Wavelets, Multiresolution and Information Processing 2, 121132 (2004)
6. Ekman, P., Friesen, W.V.: The Facial Action Coding System: A technique for the measure-
ment of facial movement. Consulting Psychologists Press Inc., San Francisco (1978)
7. Schmidt, K.L., Bhattacharya, S., Denlinger, R.: Comparison of deliberate and spontaneous
facial movement in smiles and eyebrow raises. J. Nonverbal Behav. 33, 3545 (2009)
8. Krumhuber, E.G., Manstead, A.S.R.: Can Duchenne smiles be feigned? New evidence on
felt and false smiles. Emotion 9, 807820 (2009)
9. Manera, V., Giudice, M.D., Grandi, E., Colle, L.: Individual differences in the recognition of
enjoyment smiles: No role for perceptualattentional factors and autistic-like traits. Frontiers
in Psychology 2 (2011)
10. Cohn, J.F., Reed, L.I., Moriyama, T., Xiao, J., Schmidt, K.L., Ambadar, Z.: Multimodal
coordination of facial action, head rotation, and eye motion during spontaneous smiles. In:
IEEE AFGR, pp. 129135 (2004)
11. Valstar, M.F., Pantic, M.: How to distinguish posed from spontaneous smiles using geometric
features. In: ACM ICMI, pp. 3845 (2007)
12. Pfister, T., Li, X., Zhao, G., Pietikainen, M.: Differentiating spontaneous from posed facial
expressions within a generic facial expression recognition framework. In: ICCV Workshops,
pp. 868875 (2011)
13. Dibeklioglu, H., Valenti, R., Salah, A.A., Gevers, T.: Eyes do not lie: Spontaneous versus
posed smiles. ACM Multimedia, 703706 (2010)
14. Tao, H., Huang, T.: Explanation-based facial motion tracking using a piecewise Bezier vol-
ume deformation model. In: CVPR, pp. 611617 (1999)
15. Schmidt, K.L., Cohn, J.F., Tian, Y.: Signal characteristics of spontaneous facial expressions:
Automatic movement in solitary and social smiles. Biological Psychology 65, 4966 (2003)
16. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Trans. on PAMI 27, 12261238
(2005)
17. Valstar, M.F., Pantic, M.: Induced disgust, happiness and surprise: An addition to the MMI
facial expression database. In: LREC, Workshop on EMOTION, pp. 6570 (2010)
18. Wang, S., Liu, Z., Lv, S., Lv, Y., Wu, G., Peng, P., Chen, F., Wang, X.: A natural visible and
infrared facial expression database for expression recognition and emotion inference. IEEE
Trans. on Multimedia 12, 682691 (2010)
19. Dibeklioglu, H., Salah, A.A., Gevers, T.: A statistical method for 2-d facial landmarking.
IEEE Trans. on Image Processing 21, 844858 (2012)
20. Schmidt, L.K., Ambadar, Z., Cohn, J.F., Reed, L.I.: Movement differences between deliber-
ate and spontaneous facial expressions: Zygomaticus major action in smiling. J. Nonverbal
Behav. 30, 3752 (2006)
21. Dibeklioglu, H., Gevers, T., Salah, A.A., Valenti, R.: A smile can reveal your age: Enabling
facial dynamics in age estimation. ACM Multimedia (2012)
Ecient Monte Carlo Sampler for Detecting
Parametric Objects in Large Scenes
1 Introduction
Many recent works exploiting point processes have been proposed to address a
large variety of computer vision problems [29]. The growing interest in these
probabilistic models is motivated by the need to manipulate parametric ob-
jects interacting in scenes. Parametric objects can be dened in discrete and/or
continuous domains. They usually correspond to geometric entities, e.g. seg-
ments, rectangles, circles or planes, but can more generally be any type of multi-
dimensional function. A point process requires the formulation of a probability
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 539552, 2012.
c Springer-Verlag Berlin Heidelberg 2012
540 Y. Verdie and F. Lafarge
1.2 Motivations
The results obtained by these point processes are particularly convincing and
competitive, but the performances remain limited in terms of computation time
and convergence stability, especially on large scenes. These drawbacks explain
why industry has been reluctant until now to integrate these mathematical mod-
els in their products. Indeed, the point processes presented in Section 1.1 empha-
size complex model formulations by proposing objects parametrically sophisti-
cated [3, 4], advanced techniques to t objects to the data [9], and non-trivial spa-
tial interactions between objects [6, 8]. However, these works have only slightly
addressed the optimization issues from such complex models.
Jump-Diusion algorithms [4, 12, 13] have been designed to speed-up the
MCMC sampling by inserting diusion dynamics for the exploration of the con-
tinuous subspaces. These samplers are unfortunately restricted to specic density
forms. Data considerations have also been used to drive the sampling with more
eciency [14] for image segmentation problems. Some works have also proposed
parallelization procedures by using multiple chains simultaneously [15] or decom-
position schemes in congurations spaces of xed dimension [16, 17]. However
they are limited by border eects, and are not designed to perform on large
scenes. In addition, the existing decomposition schemes cannot be used for con-
guration spaces of variable dimension. A mechanism based on multiple creation
and destruction of objects has been also developed for addressing population
Ecient Monte Carlo Sampler 541
counting problems [2, 9]. Nevertheless object creations require the discretization
of the point coordinates which induces a signicant loss of accuracy.
These alternative versions of the conventional MCMC sampler globally allow
the improvement of optimization performances in specic contexts.That said,
the gains in terms of computation time remain weak and are usually realized
at the expense of solution stability. Finding a fast ecient sampler for general
Markov point processes clearly represents a challenging problem.
1.3 Contributions
We present solutions to address this problem and to drastically reduce computa-
tion times while guarantying quality and stability of the solution. Our algorithm
presents several important contributions to the eld.
Sampling in parallel - Contrary to the conventional MCMC sampler which
makes the solution evolve by successive perturbations, our algorithm can per-
form a large number of perturbations simultaneously using a unique chain. The
Markovian property of point processes is exploited to make the global sampling
problem spatially independent in a neighborhood.
Non uniform point distributions - Point processes mainly use uniform point
distributions which are computationally easy to simulate, but make the sampling
extremely slow. We propose an ecient mechanism allowing the modications,
creations or removals of objects by taking into account information on the ob-
served scenes. Contrary to the data-driven solutions proposed by [3] and [14],
our non-uniform distribution is not built directly from image likelihood, but is
created via space-partitioning trees to ensure the sampling parallelization.
Ecient GPU implementation - We propose an implementation on GPU
which signicantly reduces computing times with respect to existing algorithms,
while increasing stability and improving the quality of the obtained solution.
3D object detection model from point cloud - To evaluate the algorithm
in 3D, we propose an original 3D point process to detect parametric objects from
point clouds. This model is applied to tree recognition from laser scans of large
urban and natural environments. To our knowledge, it is the rst point process
sampler to date to perform in such highly complex state spaces.
where n() is the number of points associated with the event . We denote by
P, the space of congurations of points in K. The most natural point process
542 Y. Verdie and F. Lafarge
is the homogeneous Poisson process for which the number of points follows a
discrete Poisson distribution whereas the position of the points is uniformly and
independently distributed in K. Point processes can also provide more complex
realizations of points by being specied by a density h(.) dened in P and a
reference measure (.) under the condition that the normalization constant of
h(.) is nite:
h(p)d(p) < (2)
pP
The measure (.) having the density h(.) is usually dened via the intensity
measure (.) of an homogeneous Poisson process. Specifying a density h(.) allows
the insertion of data consistency, and also the creation of spatial interactions
between the points. In particular, the Markovian property can be used in point
processes, similarly to random elds, to create a spatial independence of the
points in a neighborhood. Note also that h(.) can be expressed by a Gibbs
energy U (.) such that
h(.) exp U (.) (3)
Fig. 1. From left to right: realizations of a point process in 2D, of a Markov point
process, and of a Markov point process of line-segments.The grey dashed lines represent
the pairs of points interacting with respect to the neighboring relationship which is
specied here by a limit distance
between two points.
Fig. 2. Independence of cells. (a) the width of the cell c2 is not large enough to ensure
the independence of the cells c1 and c3 : the two grey points in c1 and c3 cannot be
perturbed at the same time. (b) the cells c1 and c3 are independent as Eq. 8 is satised.
(c) in dimension two, the cells are squares regrouped into 4 mic-sets (yellow, blue, red
and green). Each cell is adjacent to cells belonging to dierent mic-sets. (d) in dimension
three, the cells are cubes regrouped into 8 mic-sets.
The partitioning tree allows the creation of a non uniform point distribution
naturally and eciently. Indeed, the density progressively decreases when moving
far from the class of interest as shown in Fig. 3, while being ensured to be non-
null. The point distribution is thus not directly aected when the class of interest
is inaccurately extracted.
Fig. 3. Space partitioning tree in dimension two. (b) A class of interest (blue area)
is estimated from (a) an input image. (c) A quadtree is created so that the levels
are recursively partitioned according to the class of interest. Each level is composed
of four mic-sets (yellow, blue, red and green sets of cells) to guaranty the sampling
parallelization. (d) The cell accumulation at the dierent levels designs a non uniform
distribution. Note how the distribution naturally describes the class of interest by
progressively decreasing the density when moving away.
Choose
a mic-set Smic K and a kernel type t T according to probability
qc,t
cSmic
For each cell c Smic ,
Perturb x in the cell c to a conguration y according to Qc,t (x .)
Calculate the Green ratio
Qc,t (y x) U (x) U (y)
R= exp (11)
Qc,t (x y) Tt
4 Experiments
The proposed algorithm has been implemented on GPU for dim K = 2 and
dim K = 3, and tested on various problems including population counting, line-
network extraction from images, and object recognition from 3D point clouds.
Note that additional results and comparisons can be found in [18], as well as
details on energy formulations.
4.1 Implementation
The algorithm has been implemented on GPU using CUDA. A thread is ded-
icated to each simultaneous perturbation so that operations are performed in
parallel for each cell of a mic-set. The sampler is thus all the more ecient as the
mic-set contains many cells, and generally speaking, as the scene supported by K
is large. Moreover, the code has been programmed to avoid time-consuming op-
erations. In particular, the threads do not communicate between each other, and
memory coalescing permits fast memory access. The memory transfer between
CPU and GPU has also been minimized by indexing the parametric objects. The
experiments presented in this section have been performed on a 2.5 Ghz Xeon
computer with a Nvidia graphics card (Quadro 4800, architecture 1.3).
548 Y. Verdie and F. Lafarge
The algorithm has been evaluated on population counting problems from large-
scale images using a point process in 2D, i.e. with dim K = 2. The problem
presented in Fig. 4 consists in detecting migrating birds to extract information
on their number, their size and their spatial organization. The point process
is marked by ellipses which are simple geometric objects dened by a point
(center of mass) and 3 additional parameters, and are well adapted to capture
the bird contours. The energy is specied by a unitary data term based on the
Bhattacharyya distance between the radiometry inside and outside of the object,
and a pairwise interaction penalizing the strong overlapping of objects [18].
Fig. 4. Bird counting by a point process of ellipses. (right) More than ten thousand
birds are extracted by our algorithm in a few minutes from (left) a large scale aerial
image. (middle) A quad-tree structure is used to create a non-uniform point distribu-
tion. Note, on the cropped parts, how the birds are accurately captured by ellipses in
spite of the low quality of the image and the partial overlapping of birds.
Computation time, quality of the reached energy, and stability are the three
important criteria used to evaluate and compare the performance of samplers. As
shown on Fig. 5, our algorithm obtains the best results for each of the criteria
compared to the existing samplers. In particular, we reach a better energy (-
8.76 vs -5.78 for [2] and -2.01 for [10]) while signicantly reducing computation
times (269 sec vs 1078 sec for [2] and 2.8 106 sec for [10]). Note that, for
the reasons mentioned in Section 4.1, this gap in performance increases when
the input scene becomes larger. Fig. 5 also underlines an important limitation
of the reference point process sampler for population counting [2] compared to
our algorithm. Indeed, the discretization of the object parameters required in
[2] causes approximate detection and localization of objects which explains the
average quality of the reached energy. The stability is analyzed by the coecient
of variation, dened as the standard deviation over mean and known to be
a relevant statistical measure for comparing methods having dierent means.
Ecient Monte Carlo Sampler 549
Coecient of variation
energy time #objects
RJMCMC [10] 7.3% 4.2% 1.7%
multiple birth and 5.0% 2.1% 1.3%
death [2]
our sampler without 7.4% 6.2% 1.6%
partitioning tree
our sampler with 4.4% 1.8% 1.1%
partitioning tree
Fig. 5. Performances of the various samplers. (left) The graph describes the energy
decrease over time from the bird image presented in Fig. 4. Time is represented using
a logarithmic scale. Note that the RJMCMC algorithm [10] is so slow that the conver-
gence is not displayed on the graph (2.9 107 iterations are required versus 1.8 104
for our algorithm). (right) The table presents the coecients of variation of the energy,
time and number of objects reached at the convergence over 50 simulations.
Our sampler provides a better stability than the existing algorithms. The impact
of the data-driven space partitioning tree is also measured by performing tests
with uniform point distributions. The performances decrease but remain better
than the existing algorithms. In particular, the sampler loses stability, and the
objects are detected and located less accurately than with the partitioning tree.
The algorithm has also been tested on line-network extraction from images.
The parametric objects here are line-segments dened by a point (center of mass)
and two additional parameters (length and orientation). Contrary to the popula-
tion counting model, the pairwise potential includes a connexion interaction for
linking the line-segments. Figure 6 shows a road network extraction result ob-
tained from a satellite image whereas Table 1 provides elements of comparisons
550 Y. Verdie and F. Lafarge
Table 1. Comparisons with existing line-network extraction methods from the image
presented in Fig. 6
Fig. 7. Tree recognition from point clouds by a 3D-point process specied by 3D-
parametric models of trees. Our algorithm detects trees and recognizes their shapes in
large-scale (left, input scan: 13.8M points) natural and (top right, input scan: 2.3M
points) urban environments, in spite of other types of urban entities, e.g. buildings, car
and fences, contained in input point clouds (red dot scans). An aerial image is joined to
(bottom right) the cropped part to provide a more intuitive representation of the scene
and the tree location. Note, on the cropped part, how the parametric models t well
to the input points corresponding to trees, and how the interaction of tree competition
allows the regularization of the tree type in a neighborhood.
5 Conclusion
We propose a new algorithm to sample point processes whose strengths lean on
the exploitation of Markovian properties to enable the sampling to be performed
in parallel, and the integration of a data-driven mechanism allowing ecient dis-
tributions of the points in the scene. Our algorithm improves the performances
of the existing samplers in terms of computing times and stability, especially on
large scenes where the gain is very important. It can be used without particular
restrictions, contrary to most samplers, and even appears as an interesting alter-
native to the standard optimization techniques for MRF labeling problems. In
particular, one can envisage using the model proposed in Section 4.3 to extract
any type of parametric objects from large 3D-point clouds.
In future works, it would be interesting to implement the algorithm on other
GPU architectures more adapted to the manipulation of 3D data structures,
e.g. architecture 2.0, so that the performances for point processes in 3D could
be improved.
552 Y. Verdie and F. Lafarge
References
1. Baddeley, A.J., Lieshout, M.V.: Stochastic geometry models in high-level vision.
Journal of Applied Statistics 20 (1993)
2. Descombes, X., Minlos, R., Zhizhina, E.: Object extraction using a stochastic birth-
and-death dynamics in continuum. JMIV 33 (2009)
3. Ge, W., Collins, R.: Marked point processes for crowd counting. In: CVPR, Miami,
U.S. (2009)
4. Lafarge, F., Gimelfarb, G., Descombes, X.: Geometric feature extraction by a
multi-marked point process. PAMI 32 (2010)
5. Lieshout, M.V.: Depth map calculation for a variable number of moving objects
using markov sequential object processes. PAMI 30 (2008)
6. Mallet, C., Lafarge, F., Roux, M., Soergel, U., Bretar, F., Heipke, C.: A marked
point process for modeling lidar waveforms. IP 19 (2010)
7. Lacoste, C., Descombe, X., Zerubia, J.: Point processes for unsupervised line net-
work extraction in remote sensing. PAMI 27 (2005)
8. Sun, K., Sang, N., Zhang, T.: Marked Point Process for Vascular Tree Extrac-
tion on Angiogram. In: Yuille, A.L., Zhu, S.-C., Cremers, D., Wang, Y. (eds.)
EMMCVPR 2007. LNCS, vol. 4679, pp. 467478. Springer, Heidelberg (2007)
9. Utasi, A., Benedek, C.: A 3-D marked point process model for multi-view people
detection. In: CVPR, Colorado Springs, U.S. (2011)
10. Green, P.: Reversible Jump Markov Chains Monte Carlo computation and Bayesian
model determination. Biometrika 82 (1995)
11. Hastings, W.: Monte Carlo sampling using Markov chains and their applications.
Biometrika 57 (1970)
12. Han, F., Tu, Z.W., Zhu, S.: Range image segmentation by an eective jump-
diusion method. PAMI 26 (2004)
13. Srivastava, A., Grenander, U., Jensen, G., Miller, M.: Jump-Diusion Markov pro-
cesses on orthogonal groups for object pose estimation. Journal of Statistical Plan-
ning and Inference 103 (2002)
14. Tu, Z., Zhu, S.: Image Segmentation by Data-Driven Markov Chain Monte Carlo.
PAMI 24 (2002)
15. Harkness, M., Green, P.: Parallel chains, delayed rejection and reversible jump
mcmc for object recognition. In: BMVC, Bristol, U.K (2000)
16. Byrd, J., Jarvis, S., Bhalerao, A.: On the parallelisation of mcmc-based image pro-
cessing. In: IEEE International Symposium on Parallel and Distributed Processing,
Atlanta, U.S. (2010)
17. Gonzalez, J., Low, Y., Gretton, A., Guestrin, C.: Parallel Gibbs sampling: From
colored elds to thin junction trees. Journal of Machine Learning Research (2011)
18. Verdie, Y., Lafarge, F.: Towards the parallelization of Reversible Jump Markov Chain
Monte Carlo algorithms for vision problems. Research report 8016, INRIA (2012)
19. Rochery, M., Jermyn, I., Zerubia, J.: Higher order active contours. IJCV 69 (2006)
Supervised Geodesic Propagation
for Semantic Label Transfer
Xiaowu Chen , Qing Li , Yafei Song, Xin Jin, and Qinping Zhao
1 Introduction
Semantic labeling, or multi-class segmentation, is a basic and important topic in
computer vision and image understanding. In the last decades, there has been a
great advance in this research area [15]. However it is still a challenging problem
for current computer vision technique to recognize and segment objects in an
image as human beings. Recently, some methods have been proposed to typically
solve this problem with trained generative or discriminative models [13] which
usually need a xed dataset contains certain category classes. Moreover, some
approaches aim to integrate the low level features and context priori into the
bottom-up and top-down model [4]. These methods require the training proce-
dure on the xed dataset for the parametric model and they are not scalable
with the number of object categories. For example, to include more object cat-
egory for the learning-based model, we need to train the model with additional
label categories to adapt the model parameters.
Corresponding authors.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 553565, 2012.
c Springer-Verlag Berlin Heidelberg 2012
554 X. Chen et al.
Fig. 1. The objective. Our objective is to get the semantic segmentation of the input
image. This input image is taken from MSRC dataset[6].
Our main contributions include: (1)A label transfer method with geodesic
propagation. (2)Supervised seed selection and label propagation scheme in geodesic
propagation for label transfer. Figure 1 shows our label transfer result.
2 Related Works
Liu et al. [9] rstly address the semantic labeling work into a nonparametric
scene parsing framework denoted as label transfer. Following the idea of label
transfer, Tighe and Lazebnik [10] use scene-level matching with global features,
superpixel-level matching with local features and Markov random eld (MRF)
optimization for incorporating neighborhood context. However, these two non-
parametric methods both require an existing large training database. In addition,
nding highly similar training images for a given query image is dicult. To as-
sure that the retrieved images are proper, multiple image sets cover all semantic
categories in the input image are rstly obtained in [11]. Then a KNN-MRF
matching scheme is proposed to establish dense correspondence between the in-
put image and each found image sets. A Markov random eld optimization is
used based on those matching correspondences. In [12], a ANN bilateral match-
ing scheme is proposed. Both the retrieval of partially similar images and the
parsing scheme are built upon the ANN bilateral matching. The test image is
parsed by integrating multiple cues in a markov random eld. Inspired by these
works, our method take the similarity of retrieved images for semantic labeling
without precise pixel-level or super-pixel level matching.
Geodesic distance which is the shortest path between points in the space is
used as a metric to classify the pixels in [14]. The method proposed in [15]
is also based on the idea of seed growing geodesic segmentation. To avoid the
bias to seed placement and the lack of edge modeling in geodesic, Price et al.
[15] present to combine geodesic distance information with edge information in
a graph cut optimization framework. Gulshany et al. [16] introduce Geodesic
Forests, which exploit the structure of shortest paths in implementing the star-
convexity constraints. In addition to the methods [1416] that mainly focus on
the foreground/background segmentation with geodesic distance, Chen et al. [17]
employ geodesic distance to multi-class segmentation by integrating color and
edge constraints on the edge weight. However, they [17] utilize no contextual
similarity in the propagation. Besides, we are obviously dierent from [17] in
the training set and the seed selection scheme. We are somehow similar with
[19, 20] in using GIST feature and boosting classier, however, we use geodesic
propagation to get the deterministic solution while [19, 20] use CRF or MRF to
get the optimal solution. In addition, we utilize the contextual information of
both the similar images and the test image itself, while [19, 20] utilize only the
contextual information of the test image or test sequence.
Gist
Matching
(b) Dataset
Retrieval
(g) Boundary
Fig. 2. Overview of our supervised geodesic propagation. (a) The input image. (b) The
dataset with manually annotations. Each color uniquely maps to a category label. The
semantic label maps are best viewed in color. (c) We rstly get the similar image set
from the dataset with Gist matching and re-ranking (Section 3.1). (d) Then we infer
the proposal map of the input image using the boosted model which is trained on this
similar image set (Section 3.2). (e) The propagation indicator is learned on the similar
image set (Section 3.3). (f) and (g) are the texture and boundary features of the input
image (Section 3.4). (h) The propagation indicator, the proposal map, the texture and
boundary features of the input image are integrated into the geodesic propagation
framework (Section 3.4) to get the semantic label result of the input image.
through Gist matching [18]. We infer the proposal map of the input image for
seed selection. Then the proposal map, the texture and boundary features of
input image, and the contextual similarity of the similar images are integrated
into the geodesic propagation framework to get segmentation result. There is
no explicit energy to minimize in the framework, instead, multi-class semantic
labeling can be solved through geodesic propagation process.
The denotations in this paper are presented here for clarity. Given an in-
put image I, the semantic labeling problem is to assign each pixel x I with
a certain label l L; L = {1, 2, , N }. To solve this problem, we build up
a neighbor-connected graph G =< V, E >. In this paper, each superpixel i
is a vertex v in graph G and is assigned a specic label l contained in the
dataset through geodesic propagation procedure. The edge set E consists of edges
Supervised Geodesic Propagation for Semantic Label Transfer 557
In order to transfer labels of annotated images to the input image, the rst thing
we need to do is to select proper images similar to the input image in seman-
tic categories and contextual informations. Since the similar image retrieval is
not the main focus of this paper, we use Gist matching as recent label transfer
methods [912] did. The gist descriptor [18] is employed to retrieve the K-Nearest
Neighbors from the dataset and the similar image set is formed with these neigh-
bors. After gist matching, the K-Nearest neighbors in the similar image set are
re-ranked in the following way. We over-segment the input image and each of its
similar image R {R} using algorithm described in [21]. Then each superpixel
i I is matched to a proper superpixel r(i) R which has the smallest match-
ing distance to i. The following distance metric is used to compute the matching
distance of correspondence in the re-ranking procedure. Given two images I and
R, the matching distance Dr (I, R) is scored as:
Dr (I, R) = (f vi f vr(i) )2 (1)
iI,r(i)R
Previous works [1416] utilize the manual scribbles as the initial seeds for each
category or objects which take the human priori into the segmentation task.
Instead, we take a dynamic seed selection for geodesic propagation. The similar
images in {RK } imply the possible categories in the input image. We assume
that categories l {RK } cover all the categories in I. To exploit the inherent
possibilities, the similar image set RK is used as the training set for the input
image I to learn the recognition model. The 17 dimension raw texton features
and Jointboost algorithm are used in the learning procedure as [22] did.
When we get the recognition proposal map of I, the initial geodesic distances
of all classes for each superpixel are dened upon this proposal map. Each su-
perpixel is temporarily assigned the initial label that has the max probability
pl (i). According to equation 2, we get the initial geodesic distance of their tem-
porary class for each superpixel. Then a distance map of the input image can
be obtained: superpixels which have higher probabilities will have smaller initial
558 X. Chen et al.
Our supervised geodesic propagation algorithm starts with the initial geodesic
distance and initial labels for all the vertices. In each step of propagation, ver-
tices which have undetermined labels will be put into the unlabeled set Q to sort
for current seed that has minimum geodesic distance. The initial seed for propa-
gation is obtained by the recognition proposal map. Once a vertex is selected as
seed in a step, it is removed out of the set Q with its semantic label determined
as current label. The weight of edge W (i, j) and propagation indicator are inte-
grated in the propagation iteration to decide how to update geodesic distance.
In each step of geodesic propagation, the geodesic distance between a labeled
seed and its neighboring undetermined superpixels have to be updated according
to corresponding indicator. Suppose i has labeled as li and j has undetermined
label, then employ the propagation indicator of category li and get the indicator
condence value Tl (i, j). If Tl (i, j) is true and W (i, j) is less than threshold e ,
then the geodesic distance Dis(j) of j is updated as Dis(i) + W (i, j) and the
label of j is assigned li ; otherwise Dis(j) and the label of j stay their current
statuses. Figure 2 (h) illustrates the propagation process step by step. The se-
mantic labels are propagated from initial seed to the entire image after iterating
616 times. Comparing the 233, 369, 496 and 561 iteration, we can see that our
method work well around the object edges and our algorithm propagates seman-
tic label among multiple classes. The algorithm is convergent in a linear time
depending on the image size and we implement a queue introduced in [26].
560 X. Chen et al.
Fig. 3. Illustration of one iteration. Take the three statuses of 313 iteration for details
instruction. Dark green means superpixels determinately labeled as the class grass and
dark blue determinately the cow. Light green means superpixels temporarily labeled as
the class grass and light blue temporarily the cow. (a) In the rst status, the current
seed is selected (in red curve) and its label is determined as grass. (b) Then in the
second status, we get the undetermined neighboring superpixels of current seed for
their update of both label and geodesic distance. We only shows parts of its neighbors
for examples. The intra features of the input image and our indicator jointly make
decision that whether to update the current label and distance of these neighhors. (c)
In the third7KHLWHUDWLRQ
status, one neighbor is updated while the other is not. Then it is ready
for the next iteration. (d) In the nal iteration (616), we get the nal result with all
superpixels have their determined labels.
4 Experiments
In this section we investigate the performance of our method on several challeng-
ing datasets, and compare our results with several state-of-the-art works. The
datasets are the Cambridge-driving Labeled Video dataset (CamVid) [27], the
Microsoft Research Cambridge (MSRC) dataset [6] and the CBCL StreetScenes
(CBCL) dataset [28]. Table 1 shows our semantic segmentation accuracy com-
pared with other methods. Our Method without GP indicates the accuracy of
the proposal map (without the Geodesic Propagation). Our Method indicates
the accuracy after our geodesic propagation procedure. We are the best on the
CamVid dataset and the MSRC dataset. Although our method is 1.1% lower
than [11] on the CBCL dataset, we are 3.46% higher than [11] on the CamVid
dataset. Moreover, our method performs better than [3] signicantly on the
CBCL dataset.
Supervised Geodesic Propagation for Semantic Label Transfer 561
Input Image Our Result Ground Truth Input Image Our Result Ground Truth
Building Tree Sky Car Signsymbol Road
Pedestrian Fence Columnpole Sidewalk Bicyclist
Fig. 4. Our results on CamVid dataset. Left column shows the input image and middle
column shows our label transfer result. Ground truth is shown on the right. Legend is
shown on the bottom.
In our experiments, the training procedure of Joint Boost model takes about
40 seconds per image and the training of label propagation scheme for each
image is also about 40 seconds. Geodesic propagation takes about 5 seconds. On
each dataset, we set 500 rounds to train the Joint Boost model for each test
image. All the experiments are implemented on PC computers. Some results on
the three datasets are shown in Figure 4, Figure 5 and Figure 6.
Input Image Our Result Ground Truth Input Image Our Result Ground Truth
Building Grass Tree Cow Sheep Sky Aeroplane Water Face Car Bike
Flower Sign Bird Book Chair Road Cat Dog Body Boat
Fig. 5. Our results on MSRC dataset. Left column shows the input image and middle
column shows our label transfer result. Ground truth is shown on the right. Legend is
shown on the bottom.
Input Image Our Result Ground Truth Input Image Our Result Ground Truth
Car Pedestrian Bicycle Building Tree
Sk
Sky R d
Road Sid
Sidewalk
lk St
Store
Fig. 6. Our results on CBCL dataset. Left column shows the input image and middle
column shows our label transfer result. Ground truth is shown on the right. Legend is
shown on the bottom.
to train the propagation indicator. The threshold e and are set to be 0.8 and
0.75 respectively. The segmentation accuracy of our method is 87.76% as shown
in Table 1.
Figure 5 shows some results of proposed method on this dataset. The com-
parisons with previous methods are shown in Table 1. Our method is the best
among those methods on this dataset.
The CBCL StreetScenes dataset contains 3547 still images of street scenes, which
includes nine categories: car, pedestrian, bicycle, building, tree, sky, road, side-
walk, and store. In our test, the pedestrian, bicycle, and store are not included as
[11] did. To compare with [11], we resize the original images to 320240 pixels.
We use 5 similar images for each test image to train the propagation indicator.
The threshold e and are set to be 0.3 and 0.6 respectively. The segmentation
accuracy of our method is 71.7% as shown in Table 1. Although our method is
not the best on this dataset (about 1 percent lower than [11]), we perform better
than [3] who has accuracy of 61.9%. Figure 6 shows the results of our method.
References
1. Tu, Z.: Auto-context and its application to high-level vision tasks. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 18. IEEE Press,
Piscataway (2008)
2. Gould, S., Fulton, R., Koller, D.: Decomposing a Scene into Geometric and Se-
mantically Consistent Regions. In: IEEE International Conference on Computer
Vision, pp. 18. IEEE Press, Piscataway (2009)
3. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image catego-
rization and segmentation. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 18. IEEE Press, Piscataway (2008)
4. Han, F., Zhu, S.-C.: Bottom-up/Top-Down Image Parsing by Attribute Graph
Grammar. In: IEEE International Conference on Computer Vision, pp. 17781785.
IEEE Press, Piscataway (2005)
5. Zhao, P., Fang, T., Xiao, J., Zhang, H., Zhao, Q., Quan, L.: Rectilinear parsing
of architecture in urban environment. In: IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pp. 342349. IEEE Press, Piscataway
(2010)
6. Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost for Image Under-
standing: Multi-Class Object Recognition and Segmentation by Jointly Modeling
Texture, Layout, and Context. Int. J. Comput. Vis. 81, 223 (2009)
7. Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: A database and
web-based tool for image annotation. MIT AI Lab Memo (2005)
8. Torralba, A., Fergus, R., Freeman, W.: 80 Million Tiny Images: A Large Data
Set for Nonparametric Object and Scene Recognition. IEEE Trans. Pattern Anal.
Mach. Intell. 30, 19581970 (2008)
9. Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing: Label transfer via
dense scene alignment. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp. 19721979. IEEE Press, Piscataway (2009)
10. Tighe, J., Lazebnik, S.: SuperParsing: Scalable Nonparametric Image Parsing
with Superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part V. LNCS, vol. 6315, pp. 352365. Springer, Heidelberg (2010)
11. Zhang, H., Xiao, J., Quan, L.: Supervised Label Transfer for Semantic Segmenta-
tion of Street Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part V. LNCS, vol. 6315, pp. 561574. Springer, Heidelberg (2010)
12. Zhang, H., Fang, T., Chen, X., Zhao, Q., Quan, L.: Partial similarity based non-
parametric scene parsing in certain environment. In: IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 22412248. IEEE Press, Piscataway
(2011)
13. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features
in recognition. Prog. Brain Res. 155, 2336 (2006)
14. Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video
segmentation and matting. In: IEEE International Conference on Computer Vision,
pp. 18 (2007)
15. Price, B., Morse, B., Cohen, S.: Geodesic Graph Cut for Interactive Image Seg-
mentation. In: IEEE Conference on Computer Vision and Pattern Recognition,
pp. 31613168. IEEE Press, Piscataway (2010)
16. Gulshany, V., Rotherz, C., Criminisiz, A., Blakez, A., Zisserman, A.: Geodesic Star
Convexity for Interactive Image Segmentation. In: IEEE Conference on Computer
Vision and Pattern Recognition, pp. 31293136. IEEE Press, Piscataway (2010)
Supervised Geodesic Propagation for Semantic Label Transfer 565
17. Chen, X., Zhao, D., Zhao, Y., Lin, L.: Accurate semantic image labeling by fast
geodesic propagation. In: IEEE International Conference on Image processing, pp.
40214024. IEEE Press, Piscataway (2009)
18. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation
of the spatial envelope. Int. J. Comput. Vis. 42, 145175 (2001)
19. Xiao, J., Quan, L.: Multiple view semantic segmentation for street view im-
ages. In: Proceedings of 12th IEEE International Conference on Computer Vision,
pp. 686693. IEEE Press, Piscataway (2009)
20. Xiao, J., Fang, T., Zhao, P., Lhuillier, M., Quan, L.: Image-based street-side city
modeling. ACM Trans. Graph. 28, 112 (2009)
21. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierarchical
Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 898916 (2011)
22. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance,
Shape and Context Modeling for Multi-class Object Recognition and Segmenta-
tion. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS,
vol. 3951, pp. 115. Springer, Heidelberg (2006)
23. Breiman, L.: Random forests. Mach. Learn. 45, 532 (2001)
24. Liaw, A., Wiener, M.: Classication and Regression by randomForest. R News 2,
1822 (2002)
25. Jaiantilal, A.: Classication and regression by randomforest-matlab (2009),
http://code.google.com/p/randomforest-matlab
26. Yatziv, L., Bartesaghi, A., Sapiro, G.: O(n) implementation of the fast marching
algorithm. J. Comput. Phys. 212, 393399 (2006)
27. Brostow, G., Fauqueur, J., Cipolla, R.: Semantic Object Classes in Video: A High-
Denition Ground Truth Database. Pattern Recognit. Lett. 30, 8897 (2009)
28. Bileschi, S.: CBCL streetscenes challenge framework (2007),
http://cbcl.mit.edu/software-datasets/streetscenes/
Bayesian Face Revisited: A Joint Formulation
Dong Chen1 , Xudong Cao3 , Liwei Wang2 , Fang Wen3 , and Jian Sun3
1
University of Science and Technology of China
chendong@mail.ustc.edu.cn
2
The Chinese University of Hong Kong
lwwang@cse.cuhk.edu.hk
3
Microsoft Research Asia, Beijing, China
{xudongca,fangwen,jiansun}@microsoft.com
1 Introduction
Face verication and face identication are two sub-problems in face recognition.
The former is to verify whether two given faces belong to the same person, while
the latter answers who is who question in a probe face set given a gallery face
set. In this paper, we focus on the verication problem, which is more widely
applicable and lay the foundation of the identication problem.
Bayesian face recognition [1] by Baback Moghaddam et al. is one of repre-
sentative and successful face verication methods. It formulates the verication
task as a binary Bayesian decision problem. Let HI represents the intra-personal
(same) hypothesis that two faces x1 and x2 belong to the same subject, and HE
is the extra-personal (not same) hypothesis that two faces are from dierent
subjects. Then, the face verication problem amounts to classifying the dier-
ence = x1 x2 as intra-personal variation or extra-personal variation. Based
on the MAP (Maximum a Posterior) rule, the decision is made by testing a log
likelihood ratio r(x1 , x2 ):
P ( |HI )
r(x1 , x2 ) = log . (1)
P ( |HE )
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 566579, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Bayesian Face Revisited: A Joint Formulation 567
ss
se
2
pa
C
}
O
la
ra
ss
bl
e
1
X-
Y
O X
Fig. 1. The 2-D data is projected to 1-D by xy. The two classes which are separable in
joint representation are inseparable after the projecting. Class1 and Class2 could
be considered as an intra-personal and an extra-personal hypothesis in face recognition.
and derive a closed-form expression of the log likelihood ratio, which makes the
computation ecient in the test phase.
We also nd interesting connections between our joint Bayesian formulation
and other two types of face verication methods: metric learning [710] and
reference-based methods [1114]. On one hand, the similarity metric derive from
our joint formulation is beyond the standard form of the Mahalanobis distance.
The new similarity metric preserves the separability in the original feature space
and leads to better performance. On the other hand, the joint Bayesian method
could be viewed as a kind of reference model with parametric form.
Many supervised approaches including ours need a good training data which
contains sucient intra-person and extra-person variations. A good training data
should be both wide and deep: having large number of dierent subjects and
having enough images of each subject. However, the current large face datasets
in the wild condition suer from either small width (Pubg [11]) or small depth
(Labeled Faces in Wild (LFW) [15]). To address this issue, we introduce a new
dataset, Wide and Deep Reference dataset (WDRef), which is both wide (around
3,000 subjects) and deep (2,000+ subjects with over 15 images, 1,000+ subjects
with more than 40 images). To facilitate further research and evaluation on
supervised methods on the same test bed, we also share two kinds of extracted
low-level features of this dataset. The whole dataset can be downloaded from
our project website http://home.ustc.edu.cn/~chendong/JointBayesian/.
Our main contributions can be summarize as followings:
A joint formulation of Bayesian face with an appropriate prior on the face
representation. The joint model can be eectively learned from large scale,
high-dimension training data, and the verication can be eciently per-
formed by the derived closed-form solution.
We demonstrate our joint Bayesian face outperforms the state of arts super-
vised methods, through comprehensive comparisons on LFW and WDRef.
Our simple system achieved better average accuracy than the current best
commercial system (face.com) [16]1 .
A large dataset (with annotations and extracted low-level features) which is
both wide and deep is released.
Fig. 2. Prior on face representation: both of the identities distribution (left) and the
within-person variation (right) are modeled by Gaussians. Each face instance is repre-
sented by the sum of identity and the its variant.
same identity. Here, the latent variable and follow two Gaussian distributions
N (0, S ) and N (0, S ), where S and S are two unknown covariance matrixes.
For brevity, we call the above representation and associated assumptions as a
face prior.
Joint Formulation with Prior. Given the above prior, no matter under which
hypothesis, the joint distribution of {x1 , x2 } is also Gaussian with zero mean.
Based on the linear form of Eqn. (2) and the independent assumption between
and , the covariance of two faces is:
P (x1 , x2 |HI )
r(x1 , x2 ) = log = xT1 Ax1 + xT2 Ax2 2xT1 Gx2 , (4)
P (x1 , x2 |HE )
where
A = (S + S )1 (F + G), (5)
1
F +G G S + S S
= . (6)
G F +G S S + S
Note that in Eqn. (4) the constant term is omitted for simplicity.
There are three interesting properties of this log likelihood ratio metric. Read-
ers can refer to the supplementary materials for the proof.
Both matrix A and G are negative semi-denite matrixes.
The negative log likelihood ratio will degrade to Mahalanobis distance if
A = G.
The log likelihood ratio metric is invariant to any full rank linear transform
of the feature.
We will have more discussions on the new metric obtained by log likelihood ratio
over the Mahalanobis distance in the Section 3.
Bayesian Face Revisited: A Joint Formulation 571
2.4 Discussion
In this section, we discuss three aspects of our joint formulation.
Robustness of EM. Generally speaking, the EM algorithm only guarantees
convergence. The quality of the solution depends on the objective function (the
likelihood function of the observations) and the initialization. Here we empiri-
cally study the robustness of the EM algorithm by the following ve experiments:
1. S and S are set to between-class and within-class matrixes (no EM).
2. S and S are initialized by between-class and within-class matrixes.
3. S is initialized by between-class matrix and S is initialized randomly.
4. S is initialized randomly and S is initialized by within-class matrix.
5. Both S and S are initialized randomly.
The above experiments are performed by using the training dataset (WDRef)
and testing dataset (LFW, under unrestrict protocol). We use PCA to reduce
the dimension of feature to 2,000. The rest of the experimental settings are the
same with those in Section 4.2.
The experiment results in Table 1 show that: 1) EM optimization improves
the performance (87.8% to 89.44%); 2) EM optimization is robust to various
initializations. The estimated parameters converge to the similar solutions from
quite dierent starting points. This indicates that our objective function may
have very good convex properties.
Table 2. Roles of matrixes A and G in the log likelihood ratio metric. Both training
and testing are conducted on LFW following the unrestricted protocol. SIFT feature
is used and its dimension is reduce to 200 by PCA. Readers can refer to Section 4.4
for more information.
Experiments A and G A 0 G 0 A G G A
Accuracy 87.5% 84.93% 55.63% 83.73% 84.85%
574 D. Chen et al.
If we consider the references are innite and independent sampled from a distri-
bution P (), the above equation can be rewritten as:
5
P ( x1 | ) P ( x2 | ) P ()d
Log 5 5 . (12)
P ( x1 | ) P ()d P ( x2 | ) P ()d
Bayesian Face Revisited: A Joint Formulation 575
4 Experimental Results
4.1 Dataset
Label Face in the Wild(LFW) [15] contains face images of celebrities collected
from the Internet with large variations in pose, expression, and lighting. There
are a total of 5749 subjects but only 95 persons have more than 15 images. For
4069 identities, just one image is available.
In this work, we introduce a new dataset, WDRef, to relieve this depth issue.
We collect and annotate face images from image search engines by querying a
set of people names. We need to emphasize that there is no overlap between our
queries and the names in LFW. Then, the faces are detected by a face detector
and rectied by ane transform estimated by ve landmarks, i.e. eyes, nose and
mouth corners. The landmarks are detected by [23]. We extract two kind of
low-level features: LBP [24] and LE [25] in rectied holistic face.
Our nal data set contains 99,773 images of 2995 people, and 2065 of them
have more than 15 images. The large width and depth of this dataset will en-
able researchers to better develop and evaluate supervised algorithms which
needs sucient intra-personal and extra-personal variations. We will make the
extracted features publicly available.
all identities in WDRef are used for the training. We vary the identities number
in the training data from 600 to 3000 to study the performance w.r.t training
data size. When test on WDRef, we split it into two mutually exclusive parts:
300 dierent subjects are used for test, the others are for training. Similar to
protocol in LFW, the test images are divided into 10 cross-validation sets and
each set contains 300 intra-personal and extra-personal pairs. We use LBP fea-
ture and reduce the feature dimension by PCA to the best dimension for each
method (2000 for joint Bayesian and unied subspace, 400 for Bayesian and
naive formulation methods).
0.9 0.86
0.88 0.84
0.86 0.82
Accuracy
Accuracy
0.84 0.8
0.82 0.78
0.8 0.76
classic Bayesian [1] classic Bayesian [1]
0.78 Unified subspace [3] 0.74 Unified subspace [3]
naive formulation naive formulation
joint Bayesian (formulation) joint Bayesian (formulation)
0.76 0.72
600 900 1200 1500 1800 2100 2400 2700 3000 600 900 1200 1500 1800 2100 2400 2700
Identity Number Identity Number
Fig. 4. Comparison with other Bayesian face related works. The joint Bayesian method
is consistently better by using dierent training data sizes and on two databases:
LFW(left) and WDRef(right).
mentation [19]. In the experiment, the original 5900 dimension LBP feature is
reduced to the best dimension by PCA for each method (2000 for all methods).
For LDA and PLDA, we further traverse to get the best sub-space dimension
(100 in our experiment). As shown in Figure 5, our method signicantly out-
performs the other two methods for any size of training data. Both PLDA and
LDA only use the discriminative information in such a low dimension space and
ignore the other dimensions even though there are useful information in them.
On the contrary, our method treats each dimension equally and leverages high
dimension feature.
0.9 0.86
0.88 0.84
0.86 0.82
Accuracy
Accuracy
0.84 0.8
0.82 0.78
0.8 0.76
In order to fairly compare with other approaches published on the LFW website,
in this experiment, we follow the LFW unrestricted protocol, using only LFW
for training. We combine the scores of 4 descriptors (SIFT, LBP, TPLBP and
FPLBP) with a linear SVM classier. The same feature combination could also
be nd in [12]. As shown in Table 3, our joint Bayesian method achieves the
highest accuracy.
Table 3. Comparison with state of the arts method following the LFW unrestricted
protocol. The results of other methods are from their original papers.
Finally, we present our best performance on LFW along with existing state of the
art methods or systems. Our purposes are threefold: 1) verify the generalization
578 D. Chen et al.
ability from one dataset to another dataset; 2) see what we can achieve using
a limited outside training data; 3) compare with other methods which also rely
on the outside training data.
We follow the standard unconstricted protocol in LFW and simply combine
the four similarity scores computed under LBP feature and three types of LE
[25] features. As shown by ROC curves in Figure 6, our approach with simple
joint Bayesian formulation is ranked as No.1 and achieved 92.4% accuracy. The
error rate is reduced by over 10%, compared with the current best (commercial)
system which takes the additional advantages of an accurate 3D normalization
and billions of training samples.
0.9
true positive rate
0.8
0.7
Fig. 6. The ROC curve of Joint Bayesian method comparing with the state of the art
methods which also rely on the outside training data on LFW
5 Conclusions
In this paper, we have revisited the classic Bayesian face recognition and pro-
posed a joint formulation in the same probabilistic framework. The superior
performance on comprehensive evaluations shows that the classic Bayesian face
recognition is still highly competitive and shining, given modern low-level fea-
tures and a training data with the moderate size.
References
1. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian face recognition. Pattern
Recognition 33, 17711782 (2000)
2. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The feret evaluation methodology
for face-recognition algorithms. PAMI 22, 10901104 (2000)
Bayesian Face Revisited: A Joint Formulation 579
3. Wang, X., Tang, X.: A unied framework for subspace face recognition. PAMI 26,
12221228 (2004)
4. Wang, X., Tang, X.: Subspace analysis using random mixture models. In: CVPR
(2005)
5. Wang, X., Tang, X.: Bayesian face recognition using gabor features, pp. 7073
(2003)
6. Li, Z., Tang, X.: Bayesian face recognition using support vector machine and face
clustering. In: CVPR (2004)
7. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin
nearest neighbor classication, vol. 10, pp. 207244 (2005)
8. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: ICML (2007)
9. Guillaumin, M., Verbeek, J.J., Schmid, C.: Is that you? metric learning approaches
for face identication. In: ICCV (2009)
10. Ying, Y., Li, P.: Distance metric learning with eigenvalue optimization. Journal of
Machine Learning Research 13, 126 (2012)
11. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile clas-
siers for face verication. In: ICCV (2009)
12. Taigman, Y., Wolf, L., Hassner, T.: Multiple one-shots for utilizing class label
information. In: BMVC (2009)
13. Yin, Q., Tang, X., Sun, J.: An associate-predict model for face recognition. In:
CVPR (2011)
14. Zhu, C., Wen, F., Sun, J.: A rank-order distance based clustering algorithm for
face tagging. In: CVPR (2011)
15. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E., Hanson, A.: Labeled faces
in the wild: A database for studying face recognition in unconstrained environ-
ments. In: ECCV (2008)
16. Taigman, Y., Wolf, L.: Leveraging billions of faces to overcome performance bar-
riers in unconstrained face recognition. Arxiv preprint arXiv:1108.1122 (2011)
17. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. sherfaces:
Recognition using class specic linear projection. PAMI 19, 711720 (1997)
18. Ioe, S.: Probabilistic Linear Discriminant Analysis. In: Leonardis, A., Bischof,
H., Pinz, A. (eds.) ECCV 2006, Part IV. LNCS, vol. 3954, pp. 531542. Springer,
Heidelberg (2006)
19. Prince, S., Li, P., Fu, Y., Mohammed, U., Elder, J.: Probabilistic models for infer-
ence about identity. PAMI 34, 144157 (2012)
20. Susskind, J., Memisevic, R., Hinton, G., Pollefeys, M.: Modeling the joint density
of two images under a variety of transformations. In: CVPR (2011)
21. Ramanan, D., Baker, S.: Local distance functions: A taxonomy, new algorithms,
and an evaluation. In: ICCV (2009)
22. Nguyen, H.V., Bai, L.: Cosine Similarity Metric Learning for Face Verication. In:
Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part II. LNCS, vol. 6493,
pp. 709720. Springer, Heidelberg (2011)
23. Liang, L., Xiao, R., Wen, F., Sun, J.: Face Alignment Via Component-Based Dis-
criminative Search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part II. LNCS, vol. 5303, pp. 7285. Springer, Heidelberg (2008)
24. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
invariant texture classication with local binary patterns. PAMI 24, 971987 (2002)
25. Cao, Z., Yin, Q., Tang, X., Sun, J.: Face recognition with learning-based descriptor.
In: CVPR (2010)
Beyond Bounding-Boxes: Learning Object Shape
by Model-Driven Grouping
1 Introduction
Recognizing and localizing all instances of object categories in a novel query
image is a core task on the way to automatic scene understanding. Object detec-
tion typically proceeds by localizing object bounding-boxes (e.g. [1]), which are
parameterized by their location, scale and aspect ratio. A classier is then eval-
uated for each detection window, thereby providing hypotheses that are ranked
by their score. Such approaches have proven to be very successful in bench-
marks, but there are two issues that remain unresolved. First, objects are not
box-shaped and so the detection window contains a signicant amount of back-
ground clutter that tends to deteriorate the whole windows classication result.
And indeed, even complex models like [1] are eventually based on a holistic repre-
sentation of the whole bounding-box, including the clutter. The second problem
is that object shape becomes only available for detection once the object has
been segregated from the background. To overcome both problems, not only
background suppression is required, but also a reasoning about the object shape
is essential. Recent work (e.g. [2]) in the eld of segmentation has shown that
relying only on low-level cues is not enough. Furthermore, it appears reasonable
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 580593, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping 581
Fig. 1. Object representation (best viewed in color). We divide the bounding-box into
cells and calculate features on each cell. Inferring a foreground segmentation cell-mask
from unsegmented training data, we suppress the background features by setting the
corresponding cells to zero.
authors discard the shape information contained in the super-pixels. They sam-
ple at each pixel 5 dierent color features and utilize them within a standard
bag-of-words model to classify the object.
The paper is organized as follows. In Sec. 2 we explain how to suppress the
background and represent the shape of an object and in Sec. 3 we then describe
how to learn our model (without using pixel-wise segmentation masks in the
training stage) and infer the shape of an object using a MIL framework.
2 Model
2.1 Suppressing the Background
Detection window approaches like [16, 1] have demonstrated a good performance
in dicult benchmarks. Consequently, such a framework oers us a good basis
to implement our idea. The detection window is commonly divided into a grid
of cells and we learn object shape to suppress cells in the clutter and concen-
trate on the actual object. In this section we describe how to model a fore-
ground/background segregation.
Suppose an object Oj within image I is given and we assume for a moment
a pixel-wise foreground objects segmentation is also given. In the next section
we will describe how to automatically learn a cell-accurate shape estimation for
the objects foreground.
First, we divide the bounding-box j into an array of size l0 h0 . For each
cell we calculate a ddimensional feature. This l0 h0 d matrix is called
0 (pj0 ), where pj0 = (x, y) is the top-left position of the bounding-box in image
I. Specically, in this paper we use histograms of oriented gradients (HoG) as
features. These widely used and fast to calculate descriptors capture the edge or
gradient structure that is very characteristic of local shape. Additionally, they
exhibit invariance to local geometric and photometric transformations ([1, 16]).
However, our framework is independent of this specic choice of features. A
combination of dierent descriptors (e.g. like in [12]) can be integrated into our
model and should enable further performance improvements.
The foreground of an object is modeled by dening a binary vector mj0
B 1l0 h0
. This vector contains ones if the corresponding cell is covered by the
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping 583
Fig. 2. Left: the rst column shows a detection and the last two columns the two most
similar prototypical segments. Right: Subset of prototypical segments for the category
cow.
object, otherwise it is zero. We call this vector mj0 the root-cell mask for object
Oj part-cell masks are introduced in Sec. 4.1. Using mj0 we set to zero the
cells of 0 (pj0 ) corresponding to the zero entries in mj0 . Formally, the foreground
representation of object Oj is dened as
where
mj0 m0 2
a(mj0 , m0 ) := exp (3)
|m0 |
represents the dissimilarity score between the root-cell mask mj0 and the proto-
typical root-cell mask m0 . The parameter is obtained by cross-validation. In
our experiments, we obtained an optimal value in the range of 1.1 0.1 for the
584 A. Monroy and B. Ommer
dierent object classes. Here |m0 | represents the total number of active cells in
the prototypical root-cell mask m0 .
Equation (2) induces a Mercer kernel, since the sum of Mercer kernels is
a Mercer kernel again. By the Kernel Trick we know, that there exists a
(possibly unknown) transformation into a space in which the kernel (2) is
a scalar product. To keep the notation simple we identify Oj := (0 (pj0 , mj0 ))
and refer to this scalar product as
In praxis we do not require to evaluate the function to learn our model, but
use the kernel values instead. By dening the kernel (2) we have integrated
both of our goals into the detection window approach: We suppress the features
corresponding to the background and robustly represent the shape of an object
through a prototypical set of shapes.
3 Learning
Lets assume for the moment that for all objects Oj in the training data their
root-cell masks mj0 containing the whole object foreground are given. The train-
ing set is denoted by {(Oj , y j )}. Here y j {1, 1} denotes the label of object
Oj = (0 (pj0 , mj0 )). In this special case, we could easily learn a discriminative
function
f (0 (pq0 , mq0 )) = yi i d0 (0 (pq0 , mq0 ), 0 (pi0 , mi0 )) + b (5)
iSV
to classify the query object 0 (pq0 , mq0 ) (SV is the set of support vectors). How-
ever, in contrast to [3], we are not provided with the foreground root-cell masks
mj0 during training, but rather we automatically learn them from unsegmented
training data. This is described in the next section. Similar to [3], [10] assumes
that the best-ranked foreground segmentation mask of [15] covers the whole ob-
ject. In practice this assumption is, however, not valid: The second row of gure 3
shows the best ranked CMPC segments that lie within the object bounding-box.
None of them covers exactly the whole object.
"
#
Fig. 3. First row: We simultaneously detect and infer the object foreground. Second
row: We show our data-driven grouping from which we infer the foreground of our
object. For complex categories we can not assume that the rst CMPC segment covers
the whole object.
We simultaneously learn the function f and solve the grouping problem by for-
mulating our problem in the Multiple Instance Learning (MIL) framework. Here,
a bag contains features corresponding to dierent root-cell masks. For positive
instances, at least one of these features correspond to the foreground of an ob-
ject. In the ideal case, a bag B0j would contain all possible combinations of cells
within the bounding-box. Since this is not tractable, we describe in the next
section how to create a shortlist of meaningful groups in a bottom-up manner.
Suppose we obtain l dierent groups for a bounding-box j. The i-the group is
represented by a root cell mask mj0i and build the set U j := {mj0i }li=1 . A bag is
then dened as
D E|P (U j )|
B0j := 0 (pj0 , mj0i )|mj0i P (U j ) , (7)
i=1
here OiI = (0 (pI0 , mI0i )) are object hypotheses and denote the elements within
the bag I. Once the function f is learned, the inference problem (6) for a query
image is transformed into
In other words, in (10) we look for the most positive instance within B0j and
by doing this, we indirectly infer the corresponding root-cell segmentation mask
586 A. Monroy and B. Ommer
mj0 (s. rst row of Fig. 3). In practice the optimization problem (8) is solved by
alternating the calculation of the hyperplane w0 and bias b with the calculation
of the margin for the positive bags: YI maxiI (< w0 , OiI >CB +b) (we used the
MIL solver of [17] but other approaches like [18] could also be used). This means
that for every positive bag we x the most positive instance and then we use
all other instances of the negative bags to learn a SVM using our Mercer kernel
(2). In our experiments we used this MIL formulation since it is eective, fast
(convergence is reached after a few iterations) and the performance was robust
for varying initializations. Specically, we randomly chose an element for every
positive bags to initialize the algorithm.
In the rst row of Fig. 3 we visualize the inference of the nal foreground
root-cell mask mj0 , given a data-driven grouping of cells for the bounding-box.
1
Ns
j
Hkl := rj [pkl S I ] [pkl BBj ] . (12)
Ns t t t
The values in this map map indicate which regions within the bounding-box
were consistently covered by CMPC segments StI . We then apply a mean-shift
clustering algorithm on this 2D density map Hj and enforce the connectedness
of each of the resulting groups. The cells covering each of these groups dene
the root-cell masks mj0i used to construct the bags in equation (7).
In practice, for bounding boxes containing an object, we typically obtain be-
tween 6 and 8 groups. For boxes in the background, our grouping algorithm
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping 587
typically does not generate any segment, since these regions are not covered by
a CMPC segment (the weights in Eq. 11 are zero). This situation renders feasible
an exhaustive search using inference (6). We favor mean-shift over other cluster-
ing methods because it allows an adaptive bandwidth for dierent clusters.
4 Results
4.1 Implementation Details
We use a sliding window detection model similar to [1] to implement our idea.
The model in [1] describes an object Oj by means of a bounding-box covering
the entire object (root window) as well as eight smaller windows (about half
the size) that cover parts of the root window. Every part window i is divided
into a grid of cells of size li hi , i = 1 . . . 8 and a HoG feature is calculated for
every cell. During training, weights (used as linear lters) are learnt for the root
window and additional 8 linear lters are trained for the parts. In our case, if we
ignore the parts for a moment, we rst would need to learn the prototypical set
of segments using the positive training samples as described in Sec. 3.3 and then
learn the classier (Eq. 5) as described in Sec. 3.1. To include the concept of parts
from [1], we will rst introduce the notion of a bag for each of the part windows
and then extend our matching kernel (2) also for these parts. Thereafter, the
corresponding classier can be trained analogously to [1] and thus we remit to
that work for further details.
Modeling Parts: Running the bottom-up grouping described in section 3.3
exclusively on the root window, results in the bags B0j for each training sample
j ( being the number of instances in each bag). We then dene a bag Bij for
each of the part-windows as follows:
Here pji denotes the position of the i-th part-window for sample j. The binary
vector mjik Bli hi denotes the k-th part-cell segmentation mask of part i. It is
obtained by taking the overlap of part i with the root-cell mask mj0k Bl0 h0 .
In doing so, we obtain the feature representation i (pji , mjik ) for the i-th part
(similar to Eq. (1)). Following this notation, the matching score of Eq. 2 between
two objects Oj , Ou can be extended to include parts,
8
d(Oj , Ou ) := di (i (pji , mji ), i (pui , mui ))+ < pji pj0 , pui pu0 > . (14)
i=0
Here the last term compares the displacement of the i-th part w.r.t. the object
center. di (., .) denotes the matching score for part i dened as in Eq. (2). To
obtain di (., .) we also use a set of prototypical segments to represent each one
of the parts. This set is obtained in a similar way as for the root window by
hierarchically clustering the elements of all positive bags Bij . In practice, 7 pro-
totypical segments are used to represent each part. Using the kernel (14) we are
then capable of learning a discriminative function along the lines of (5).
For the sake of completeness, we also evaluate our model on the level of
bounding-boxes for the detected objects (standard setting). We used the INRIA
horses dataset, which contains 340 images. Half of the images contain one ore
more horses and the rest are negative images. 50 horse images and 50 negative
images are used for training. The remaining 120 horse images plus 120 negative
images are used for testing. Results are listed in gure gure 5. Compared to
[1], we improve state-of-the-art detection rate at 0.1 fppi by 3.5% achieve a gain
of 29% to the recent segmentation-based approach of [20].
Table 2. Detection rate at 0.02, 0.3 and 0.4, fppi on ETHZ-Shape. We reach compa-
rable pixel-wise detection rates to [10].
Next, we evaluate the impact of our bottom-up grouping (see section 3.2)
during training. For this experiment, the union of the rst n best-ranked CMPC
segmentation masks of [15] lying within the bounding-box was taken to dene
the bags (13). This setting would be equivalent to [10], which assumes that
the best bottom-up generated segment covers the whole object. We varied the
number of segments n and measured the detection performance in terms of
average precision (AP). The experiment was evaluated on the horse category of
PASCAL VOC 2007. The result is plotted in the left side of gure (6). For large
n the performance reaches that of [1], since eventually all cells of mj0 are active.
Conversely, performance signicantly drops as we approach n=1, which is the
590 A. Monroy and B. Ommer
Method AP Det.rate at
0.1 FPPI
Our Method 0.883 0.902
[21] - 0.730
BoSS [20] - 0.630
[22] - 0.652
[23] - 0.674
[1] 0.871 0.867
Fig. 5. Detection results for the INRIA horses dataset. We improve [1] by 3.5% and
the segmentation-based approach [20] by 29.9% detection rate at 0.1 FPPI.
setting of [10]. Our full model is plotted as a constant line, since it is independent
of the number of segments generated by [15].
In a second experiment, we tested the impact of our bottom-up grouping dur-
ing testing. Instead of obtaining a bottom-up grouping for each sliding window,
we tested our model exclusively on all the CMPC segments. We considered the
tight bounding-box around each gure-ground segment StI for an image I and
used this segment to construct the bags Bij (in this case we have as many bags
as segments StI , see Eq. (13)). The experiment was carried out using the car cat-
egory of VOC 2007 (see right plot in gure 6). We observed a 7.3% performance
drop in AP. Hence, it is advisable to combine the dierent segments StI (as we
do) to obtain a better detection performance.
We also tested the impact of using a prototypical set of segments (see section
3.3) to represent object shape. Since the matching score (14) use a prototypical
set of segments to evaluate each di (., .), we trained in this experiment a linear
SVM using the euclidean dot product (instead of using di (., .)) between the
feature representations i (pji , mji ) for all parts. In this case the matching score
(14) is transformed into
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping 591
8
j , Ou ) :=
d(O < i (pji , mji ), i (pui , mui ) > + < pji pj0 , pui pu0 > . (15)
i=0
In doing so, we obtained a very poor performance of 0.45 AP for the horse
category compared to the 0.578 AP of our model.
Fig. 7. Detection examples for certain PASCAL VOC 2007 categories. The cells cor-
responding to the object foreground are grouped and used for detection.
592 A. Monroy and B. Ommer
horse cow cat train plane car mbike bus tv bicycle sofa person
Our approach 57.8 25.3 23.9 47.8 31.9 59.8 49.8 51.6 41.9 59.8 33.7 41.9
Felz. etal. [1] 56.8 25.2 19.3 45.1 28.9 57.9 48.7 49.6 41.6 59.5 33.6 41.9
best2007 [26] 37.5 14.0 24.0 33.4 26.2 43.2 37.5 39.3 28.9 40.9 14.7 22.1
UCI [27] 45.0 17.7 12.4 34.2 28.8 48.7 39.4 38.7 35.4 56.2 20.1 35.5
LHS [13] 50.4 19.3 21.3 36.8 29.4 51.3 38.4 44.0 39.3 55.8 25.1 36.6
C2F [28] 52.0 22.0 14.6 35.3 27.7 47.3 42.0 44.2 31.1 54.0 18.8 26.8
SMC [29] 51.0 23.0 16.0 41.0 26.0 50.0 45.0 47.0 38.0 56.0 29.0 37.0
HStruct [30] 48.5 18.3 15.2 34.1 31.7 48.0 38.9 41.3 39.8 56.3 18.8 35.8
LatentCRF [31] 49.1 18.5 14.5 34.3 31.9 49.3 41.9 49.8 41.3 57.0 23.3 35.7
MKL [24] 51.2 33.0 30.0 45.3 37.6 50.6 45.5 50.7 48.5 47.8 28.5 23.3
a single, standard feature type, multi-feature approaches (e.g. [12, 24, 25]) are
complementary and should enable further performance improvements.
5 Conclusion
We have presented a model that explicitly represents object shape and seg-
regates it from the background, however, without requiring segmented training
samples. The basis of this method is to capture the overall object form by group-
ing foreground regions in a model-driven manner and representing it through a
class-specic prototypical set of segments automatically learnt from unsegmented
training data. By using exlcusively bounding-box annotated training data, our
model improves pixel-wise detection results and at the same time it provides a
richer object parametrization for detecting object instances.
References
1. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with
discriminatively trained part-based models. PAMI (2010)
2. Levin, A., Weiss, Y.: Learning to combine bottom-up and top-down segmentation.
IJCV 81(1), 105118 (2009)
3. Gao, T., Packer, B., Koller, D.: A segmentation-aware object detection model with
occlusion handling. In: CVPR, pp. 13611368 (2011)
4. Marszalek, M., Schmidt, C.: Accurate object recognition with shape masks.
IJCV (97), 191209 (2011)
5. Vijayanarasimhan, S., Grauman, K.: Ecient region search for object detection.
In: CVPR (2011)
6. Malisiewicz, T., Efros, A.: Improving spacial support for objects via multiple seg-
mentations. In: BMVC (2007)
7. Todorovic, S., Ahuja, N.: Learning subcategory relevances for category recognition.
In: CVPR (2008)
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping 593
8. Wang, X., Han, T., Yan, S.: An hog-lbp human detector with partial occlusion
handling. In: ICCV (2009)
9. Chen, Y., Zhu, L(L.), Yuille, A.: Active Mask Hierarchies for Object Detection.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS,
vol. 6315, pp. 4356. Springer, Heidelberg (2010)
10. Carreira, J., Li, F., Sminchisescu, C.: Object Recognition by Sequential Figure-
Ground Ranking. IJCV (November 2011)
11. Gu, C., Lim, J., Arbelaez, J., Malik, J.: Recognition using regions. In: ICCV (2009)
12. Van de Sande, K., Uijlings, J., Gevers, T., Smeulders, A.: Segmentation as selective
search for object recognition. In: ICCV (2011)
13. Zhu, L., Chen, Y., Yuille, A.L., Freeman, W.: Latent hierarchical structural learn-
ing for object detection. In: CVPR, pp. 10621069 (2010)
14. Ommer, B., Malik, J.: Multi-scale object detection by clustering lines. In: ICCV
(2009)
15. Carreira, J., Scminchisescu, C.: Constrained parametric min-cuts for automatic
object segmentation. In: CVPR (2010)
16. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005)
17. Andrews, S., Tscochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: NIPS, vol. 15 (2003)
18. Deselaers, T., Ferrari, V.: A conditional random eld for multiple-instance learning.
In: ICML (2010)
19. Ferrari, V., Jurie, F., Schmid, C.: Accurate object detection with deformable shape
models learnt from images. In: CVPR (2007)
20. Toshev, A., Taskar, B., Daniilidis, K.: Object detection via boundary structure
segmentation. In: CVPR (2010)
21. Yarlagadda, P., Monroy, A., Ommer, B.: Voting by Grouping Dependent Parts.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS,
vol. 6315, pp. 197210. Springer, Heidelberg (2010)
22. Maji, S., Malik, J.: Object detection using a max-margin hough transform. In:
CVPR (2009)
23. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments
for object detection. PAMI 30(1), 3651 (2008)
24. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object
detection. In: ICCV (2009)
25. Harzallah, H., Jurie, F., Schmid, C.: Combining ecient object localization and
image classication. In: ICCV (2009)
26. Mark, E., Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual object
classes challenge 2007 (voc 2007). Results (2007)
27. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for mulit-class object
layout. In: ICCV, pp. 229236 (2009)
28. Pedersoli, M., Vedaldi, A., Gonzalez, J.: A coarse-to-ne approach for fast de-
formable object detection. In: CVPR (2011)
29. Razavi, N., Gall, J., van Gool, L.: Scalable mulit-class object detection. In: CVPR
(2011)
30. Schnitzpan, P., Fritz, M., Roth, S., Schiele, B.: Discriminative structure learning of
hierarchical representations for object detection. In: CVPR, pp. 22382245 (2009)
31. Schnitzspan, P., Roth, S., Schiele, B.: Automatic discovery of meaningful object
parts with latent crfs. In: CVPR, pp. 121128 (2010)
In Defence of Negative Mining
for Annotating Weakly Labelled Data
1 Introduction
Detecting objects in images [1, 2] and actions in videos [3, 4] are among the
most widely studied computer vision problems, with applications in consumer
photography, surveillance and automatic media tagging. Typically, these stan-
dard detectors are fully-supervised, that is they require a large body of training
data where the location of the objects/actions in images/videos have been man-
ually annotated, as shown in Fig.1. With the emergence of digital media, and
the rise of high-speed internet, raw images and video are available for little to
no cost. However, the manual annotation of object and action locations remains
tedious, slow, and expensive, and as a result there has been a great interest in
training detectors with weak supervision [58].
For weakly supervised training of object detector, each image in the training
set is annotated with a weak label indicating if the image contains the object
of interest or not, but not the locations of the object. As shown in Fig. 2, a
weakly labelled data-set consists of two types of images: a set of weakly-labelled
positive images where the exact location of object is unknown, and a set of
strongly labelled negative images which we know for sure that every location in
the image does not contain the object of interest.
This author was funded by the European Research Council under the ERC Starting
Grant agreement 204871-HUMANIS.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 594608, 2012.
c Springer-Verlag Berlin Heidelberg 2012
In Defence of Negative Mining for Annotating Weakly Labelled Data 595
Fig. 1. Manual annotation of object and action data-sets, which are required for fully
supervised detector training
Fig. 2. In the annotating of weakly labelled data task we have a set of images or videos
where the object or action of interest is present and a set of images or videos where
the object or action is not present. Absence of object or action is strong information,
as we know every part of the image or video is negative, whereas the presence of object
or action is weak information as we do not know where the object or action is located.
1
Typically learned from meta-data.
2
On the voc2007data-set, in 30% of the images the saliency measure of [14] fails to
propose a valid location.
In Defence of Negative Mining for Annotating Weakly Labelled Data 597
fused with several existing approaches to mil-based image annotation [8, 16, 17],
where it invariably leads to an improvement over the method it was fused with.
2 Prior Works
set. Furthermore, unlike [18], we treat and evaluate negative mining as a classier
in its own right rather than as a pre-processing heuristic.
3 Methodology
A formal denition of negative mining is given in section 3.1. Section 3.2 con-
tains a discussion of feature normalisation, which is essential for getting negative
mining to work as a mil method. Finally, we describe the saliency measure used
to dene instances in images and videos in section 3.3.
Consider a data-set consisting of a set of positive images Ii+ that contain the
object of interest and a set of negative images Ii which do not. Following [6]
and [8] we consider a set of 100 salient locations or instances xi,j=1...100 in each
image i. Each instance xi,j is represented by a bag-of-words (bow) histogram.
The goal is to select a single instance x+ i from each positive image I
+
corre-
sponding to the correct location of an object of interest. The negative mining
algorithm accomplishes this by selecting the instance that maximises the dis-
tance to the nearest neighbour in any image containing only negative instances
x
i,j ,
x+i = arg max
+
||x+
i,j N (xi,j )||1 ,
+
(1)
xi,j
where || ||1 is the L1 norm and N (x+ i,j ) refers to the negative nearest neighbour
of x+
i,j . As mentioned earlier, for a typical voc class, we have 470,000 negative
instances (xi=1...4700,j=1...100 ) in a 2, 000 dimensional space. As such, the key to
an ecient algorithm is fast nearest neighbour look-up, and to handle the large
volume of data eciently we make use of a KD-tree based approximate nearest
neighbour algorithm [19], with 16 trees.
This approach of mining the nearest negative instance relies on the abun-
dance of known negative instances and unlike [6, 8] requires no optimisation of
intra-class cost function, resulting in a computationally ecient algorithm. Fur-
thermore, unlike [8], this inter-class measure does not assume that the average
instance in each positive bag represents a positive instance, nor does it assume
that the entire image is a positive instance like [5, 7].
An important variation on (1) incorporates a saliency measure. If we have a
measure (), which serves as a prior of how likely an instance is to be an object
of interest, regardless of the choice of class, we can simply add (x+ i,j ) to our
negative mining cost:
+
i = arg max ||xi,j N (xi,j )||1 + (xi,j )
x+ + +
(2)
x+
i,j
Finding the instance with the maximum negative nearest neighbour (nnn) dis-
tance in a positive bag is complicated by the drastic change in the size of proposed
instances. Consider Fig. 3(a) where we plot the distance from all the instances
in the positive bags of a single class to its nearest negative neighbour (Distance
i,j N (xi,j )||1 ) vs. the instances size in pixel (sizeof (xij )). The plot uses
= ||x+ + +
standard normalised bow histograms, which are typically used for both object
and action detection:
h(i)
h(i) = B , (3)
i=1 h(i)
where h is the bow histogram composed of B bins, and the L1 distance between
them3 . We observe that the nnn distance for instances small in size are almost al-
ways much greater than the nnn distance associated with instances large in size.
This behaviour can be attributed to sampling artefacts: small boxes naturally
contain fewer densely sampled words. Compared to a large box, the distribution
of words associated with a small box is much more likely to have a few sharply
peaked modes, and many empty bins.
As a consequence, when selecting the instances that maximise the normalised
distance to the nearest negative instance we select very small instances in each
positive bag. The same behaviour can be observed for other type of distances
based upon normalised bow histograms (Fig. 3(d)). In the voc data-set, this
bias is extremely inappropriate, as the majority of images contain large object
instances, and on the voc2007 data-set this causes negative mining to perform
ten times worse than the random selection of positive instances. A similar degra-
dation in performance can be observed on the msr2 data-set [20].
On the contrary, if we use unnormalised histograms (Fig. 3(c)), together with
the L1 distance, we observe the opposite eect. Large instances contain many
dense words, and owing to the sheer number of words, typically have a large
distance from their nearest negative neighbour, while small instances lie very
close to their nnn. Although biased towards large instances this measure is
at least biased in the correct direction, and performs substantially better than
random.
To minimise the bias towards either large or small boxes, we consider the
novel measure of root-normalised histograms
h(i)
h(i) = C . (4)
B
i=1 h(i)
Fig. 3. The L1 distances from all instances in the positive bags of the cat class in
voc2007 to the nearest negative instance plotted against the size in pixels of each
instance in the positive bag. The L1 distance is computed on a 2,000 word bow his-
togram formed from regular grid sift features. (a-c) illustrate the eect of dierent
normalisation strategies on bow histograms. We see that selecting the instance with
the largest distance to the nearest negative is correlated to the size of the instance and
this correlation is less pronounced for root-norm. (d) shows that this phenomenon is
not limited to the L1 distance.
3.3 Saliency
Saliency is used at two places in our framework. We require it to propose a small
set of viable instances or potential locations of objects and actions in each image
or video. We also require a saliency measure of how likely a location is to be an
object or action of any class, which we use in (2).
4 Experiments
The data-sets and features used in our experiments are outlined in this section.
Our main analysis is the comparison of our negative mining result with state-of-
the-art results in section 4.1. Improvements in other mil formulations when fused
with the negative mining are given in section 4.2. Dierent histogram normali-
sation strategies are evaluated in section 4.3. Finally in section 4.4 we evaluate
a leave-one-out classier, assuming we have complete manual annotation of the
training set, to determine the theoretical upper bound on the intra-class infor-
mation in comparison to the inter-class information.
Data-Sets. For object detection we use the challenging Pascal voc2007 data-
set [15] and for action we use the msr2 data-set [21]. For comparison with [8] we use
vocAll which consists of all 20 classes with no pose annotation, unlike [6] and [7]
who use manual pose annotation. For comparison with [6] and [7] we also include
the voc6x2 which consists of six classes (aeroplane, bicycle, boat, bus, horse, and
motorbike) where the left and right pose are considered as separate classes for a
total of 12 classes. For both vocAll and voc6x2 we only exclude images anno-
tated as dicult in the voc2007 data-set. As in the other works [6], all results are
presented as the percent of correct localisation, where correct localisation refers
to 50% overlap of selected instance with ground truth as dened in [15].
Features. For features we use bow histograms. For the voc data-set bow his-
tograms are formed from regular grid SIFT features constructed with vlfeat [22].
For the msr2 data-set bow histograms are formed using the histogram of ow
602 P. Siva, C. Russell, and T. Xiang
Fig. 4. Results using negative mining (N), saliency (), and combined negative mining
and saliency (N + ) methods
Table 1. Results using negative mining (N), saliency (), and combined negative
mining and saliency (N + ) methods
Table 2. Augmention of existing methods by fusing with our negative mining (N) and
saliency ()
Intra Intra Nguyen et al. Inter [8] Inter [8] Inter Inter Intra Intra
[8] [8] [5] MI-SVM + [8]+ [8]+ [16] [16]+
Data-set + MI-SVM Init. Avg. N+ Intra Intra N+
N+ Init. Img. [8] [8]+
One Iteration N+
vocAll Avg. 23.9 28.8 21.3 26.6 29.6 28.9 30.2 21.4 24.3
voc6x2 Avg. 34.8 36.5 24.6 39.0 40.8 39.6 40.2 31.0 36.2
the instance that maximises the sum of all measure scores is selected as the true
positive instance. We weight all measures equally. Average results are shown in
Table 2 and more detailed per class results are in Table 4. Note that inter-class
measure of [8] is mi-svm [17] run for a single iteration and initialised with the
average instance in each positive bag. Initialising with average instance performs
better than initialising with the entire image, the method used by [5]. In all cases,
we show that a score level fusion with our method (combined negative mining
and saliency, N + ) consistently boosts the performance of all other methods
by between 1 and 5%.
As mentioned in section 3.2 the normalisation of the bow histogram is vital for
nearest neighbour to work. Fig. 5 shows annotation accuracy with normalised,
604 P. Siva, C. Russell, and T. Xiang
Data-set N N + P P + N P + P + N +
vocAll Avg. 25.1 27.1 29.0 26.3 27.7 28.6 29.6
voc6x2 Avg. 34.8 35.0 37.1 38.1 37.3 38.4 39.2
msr2 Avg. 60.8 73.9 81.5 79.5 77.3 73.4 82.6
4
The exception being on the msr2 data-set, which was mapped onto [0, 0.6], see
section 3.3.
606 P. Siva, C. Russell, and T. Xiang
Table 4. Augmention of existing methods by fusing with our negative mining (N) and
saliency ()
Intra Intra Nguyen Inter [8] Inter [8] Inter Inter Intra Intra
[8] [8] et al. [5] MI-SVM + [8]+ [8]+ [16] [16]+
Data-set + MI-SVM Init. Avg. N+ Intra Intra N+
N+ Init. Img. [8] [8]+
One Iteration N+
aeroplane 31.1 37.8 27.7 41.2 46.6 45.4 45.8 31.9 36.6
bicycle 18.5 21.8 19.3 17.7 21.0 20.6 21.8 14.8 19.8
bird 25.2 27.0 20.3 28.2 29.7 29.7 30.9 24.8 26.1
boat 13.8 17.1 14.4 13.3 18.8 12.2 20.4 11.6 17.7
bottle 03.3 04.1 05.7 05.3 05.3 04.1 05.3 04.5 04.1
bus 31.2 31.2 31.7 35.5 37.6 37.1 37.6 25.8 23.7
car 26.7 39.8 29.0 33.0 40.7 41.0 40.8 21.5 22.9
cat 42.1 48.7 32.3 50.1 50.1 53.4 51.6 44.5 45.4
chair 06.7 08.5 07.4 05.4 06.7 06.5 07.0 06.1 06.5
cow 28.4 31.9 24.1 30.5 29.8 31.9 29.8 27.7 28.4
diningtable 24.0 32.0 15.0 16.0 26.5 20.5 27.5 18.0 24.5
dog 33.3 39.7 33.0 36.1 39.4 40.9 41.3 30.6 35.6
horse 30.7 38.7 17.1 38.7 41.8 37.3 41.8 30.3 33.1
motorbike 34.7 44.9 30.2 44.1 45.7 46.5 47.3 33.1 38.4
person 14.3 23.3 14.9 20.6 23.8 22.3 24.1 13.9 14.1
pottedplant 09.4 11.4 12.7 11.4 12.2 10.2 12.2 10.6 8.6
sheep 26.0 27.1 24.0 25.0 28.1 27.1 28.1 17.7 22.9
sofa 25.3 34.9 18.8 23.6 29.7 32.3 32.8 18.3 24.5
train 42.9 45.2 39.5 47.9 48.3 49.0 48.7 36.0 42.5
tvmonitor 10.6 10.5 08.2 08.6 09.8 09.8 09.4 05.5 11.3
Average 23.9 28.8 26.6 21.3 29.6 28.9 30.2 21.4 24.3
Aeroplane left 42.2 42.2 31.3 43.8 37.5 37.5 45.3 32.8 42.2
Aeroplane right 38.5 46.2 26.9 59.6 61.5 55.8 53.8 28.8 42.3
Bicycle left 23.9 28.4 25.4 26.9 28.4 34.3 31.3 25.4 23.9
Bicycle right 33.9 27.4 27.4 30.7 37.1 30.7 30.6 27.4 30.6
Boat left 09.4 17.0 07.5 01.9 15.1 09.4 13.2 05.7 15.1
Boat right 13.8 15.5 15.5 13.8 12.1 12.1 15.5 05.2 15.5
Bus left 44.8 34.5 24.1 31.0 34.5 37.9 37.9 34.5 37.9
Bus right 29.7 37.8 24.3 43.2 51.4 43.2 45.9 18.9 37.8
Horse left 43.9 50.0 24.2 40.9 51.5 48.5 50.0 47.0 48.5
Horse right 33.9 41.9 03.2 54.8 53.2 51.6 51.6 38.7 45.2
Motorbike left 46.3 46.3 31.5 55.6 50.0 50.0 50.0 46.3 42.6
Motorbike right 57.5 51.1 53.2 66.0 57.4 63.8 57.4 61.7 53.2
Average 34.8 36.5 24.6 39.0 40.8 39.6 40.2 31.0 36.2
5 Conclusion
This work has presented a simple, robust, technique based upon negative mining.
By itself, this technique out-performs all previous existing techniques on the
msr2 and voc2007 (all views) data-sets. We have shown how this technique can
be readily combined with other approaches to mil, by fusing it as an additional
potential, and that doing so has always lead to a signicant improvement in
performance over the original method. Our comprehensive experiments have
validated our approach. We believe that the conceptual and implementational
simplicity of our approach, alongside its state-of-the-art performance will make
it a valuable tool for increasing the performance of many more sophisticated mil
approaches in the future.
References
1. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple
features. In: CVPR, pp. 511518 (2001)
2. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with
discriminatively trained part-based models. TPAMI 32(9) (2010)
3. Poppe, R.: A survey on vision-based human action recognition. Image and Vision
Computing 28(6), 976990 (2010)
4. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for ac-
tion representation, segmentation and recognition. Computer Vision and Image
Understanding 115(2), 224241 (2011)
5. Nguyen, M.H., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised
discriminative localization and classication: a joint learning process. In: ICCV,
pp. 19251932 (2009)
6. Deselaers, T., Alexe, B., Ferrari, V.: Localizing Objects While Learning Their Ap-
pearance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV.
LNCS, vol. 6314, pp. 452466. Springer, Heidelberg (2010)
7. Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object local-
ization with deformable part-based model. In: ICCV (2011)
8. Siva, P., Xiang, T.: Weakly supervised object detector learning with model drift
detection. In: ICCV (2011)
9. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)
10. Deselaers, T., Ferrari, V.: A conditional random eld for multiple-instance learning.
In: ICML (2010)
11. Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: NIPS
(1998)
12. Chen, Y., Bi, J., Wang, J.: Miles: Multiple-instance learning via embedded instance
selection. PAMI 28(12), 19311947 (2006)
13. Adelson, E.H.: On seeing stu: the perception of materials by humans and ma-
chines. In: SPIE, vol. 4299, pp. 112 (2001)
14. Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR, pp. 7380
(2010)
15. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
pascal visual object classes (voc) challenge. IJCV 88(2), 303338 (2010)
16. Siva, P., Xiang, T.: Weakly supervised action detection. In: BMVC (2011)
608 P. Siva, C. Russell, and T. Xiang
17. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: NIPS, pp. 577584 (2003)
18. Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple Instance Learning with Instance
Selection. TPAMI (99) (2010)
19. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algo-
rithm conguration. In: International Conference on Computer Vision Theory and
Application VISSAPP 2009, pp. 331340. INSTICC Press (2009)
20. Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for ecient action
detection. In: CVPR, pp. 24422449 (2009)
21. Cao, L., Liu, Z., Huang, T.S.: Cross-data action detection. In: CVPR (2010)
22. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer
vision algorithms (2008), http://www.vlfeat.org/
23. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, Nice, France,
pp. 432439 (2003)
24. Siva, P., Xiang, T.: Action detection in crowds. In: BMVC (2010)
Describing Clothing by Semantic Attributes
1 Introduction
Over recent years, computer vision algorithms that describe objects on the semantic
level have attracted research interest. Compared to conventional vision tasks such as ob-
ject matching and categorization, learning meaningful attributes offers a more detailed
description about the objects. One example is the FaceTracer search engine [1], which
allows the user to perform face queries with a variety of descriptive facial attributes.
In this paper, we are interested in learning the visual attributes for clothing items.
As shown in Fig.1, a set of attributes is generated to describe the visual appearance
of clothing on the human body. This technique has a great impact on many emerging
applications, such as customer profile analysis for shopping recommendations. With a
collection of personal or event photos, it is possible to infer the dressing style of the per-
son or the event by analyzing the attributes of clothes, and subsequently make shopping
recommendations. The application of dressing style analysis is shown in Fig.9.
Another important application is context-aware person identification, where many
researchers have demonstrated superior performance by incorporating clothing infor-
mation as a contextual cue that complements facial features [24]. Indeed, within a
certain time frame (i.e., at a given event), people are unlikely to change their clothing.
By accurately describing the clothing, person identification accuracy can be improved
over conventional techniques that only rely on faces. In our study, we also found that
clothing carry significant information to infer the gender of the wearer. This observa-
tion is consistent with the prior work of [5, 6], which exploits human body information
to predict gender. Consequently, a better gender classification system can be developed
by combining clothing information with traditional face-based gender recognition al-
gorithms.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 609623, 2012.
c Springer-Verlag Berlin Heidelberg 2012
610 H. Chen, A. Gallagher, and B. Girod
We present a fully automatic system that learns semantic attributes for clothing on
the human upper body. We take advantage of the recent advances in human pose estima-
tion [7], by adaptively extracting image features from different human body parts. Due
to the diversity of the clothing attributes that we wish to learn, a single type of feature is
unlikely to perform well on all attributes. Consequently, for each attribute, the predic-
tion is obtained by aggregating the classification results from several complementary
features. Last, but not the least, since the clothing attributes are naturally correlated,
we also explore the mutual dependencies between the attributes. Essentially, these mu-
tual dependencies between various clothing attributes capture the Rules of Style. For
example, neckties are rarely worn with T-shirts. To model the style rules, a conditional
random field (CRF) is employed on top of the classification predictions from individ-
ual attribute classifiers, and the final list of attributes will be produced as the inference
result of the CRF.
To evaluate the descriptive ability of our system, we have created a dataset with an-
notated images that contain clothed people in unconstrained settings. Learning clothing
attributes in unconstrained settings can be extremely difficult due to the wide variety of
clothing appearances, such as geometric and photometric distortions, partial occlusions,
and different lighting conditions. Even under such challenging conditions, our system
demonstrates very good performance. Our major contributions are as follows: 1) We
introduce a novel clothing feature extraction method that is adaptive to human pose;
2) We exploit the natural rules of fashion by learning the mutual relationship between
different clothing attributes, and show improved performance with this modeling; 3)
We propose a new application that predicts the dressing style of a person or an event by
analyzing a group of photos, and demonstrate gender classification from clothing that
advocate similar findings from other researchers [5, 6].
2 Related Work
The study of clothing appearance has become a popular topic because many tech-
niques in context-aware person identification use clothing as an important contextual
cue. Anguelov et. al. [2] constructed a Markov Random Field to incorporate clothing
and other contexts with face recognition. They extract clothing features by collecting
color and textual information from a rectangular bounding box under the detected face,
Describing Clothing by Semantic Attributes 611
but this approach suffers from losing clothing information by neglecting the shape and
variable pose of the human body. To overcome this problem, Gallagher and Chen [4]
proposed clothing cosegmentation from a set of images and demonstrated improved
performance on person identification. However, clothing is described only by low-level
features for the purpose of object matching, which differs from our work that learns
mid-level semantic attributes for clothes.
Very recently, attribute learning has been widely applied to many computer vision
tasks. In [8], Ferrari and Zisserman present a probabilistic generative model to learn vi-
sual attributes such as red, stripes and dots. With the challenge of training mod-
els on noisy images returned from Google search, their algorithm has shown very good
performance. Farhadi et. al. [9] learn discriminative attributes and effectively categorize
objects using the compact attribute representation. Similar work has been performed by
Russakovsky and Fei-Fei [10], by utilizing attributes and exploring transfer learning
for object classification on large-scale datasets. Also, for the task of object classifica-
tion, Parikh and Grauman [11] build a set of attributes that is both discriminative and
nameable by displaying categorized object images to a human in the loop. In [12],
Kumar et. al. propose attribute and simile classifiers that describe face appearances,
and have demonstrated very competitive results for the application of face verification.
Siddiquie et. al. [13] explore the co-occurrence of attributes for image ranking and re-
trieval with multi-attribute queries. One of the major challenges of attribute learning is
the lack of training data, since the acquisition of labels can be both labor-intensive and
time-consuming. To overcome this limitation, Berg et. al. [14] proposed mining both
the catalog images and their descriptive text from the Internet and perform language
processing to discover attributes. Another interesting application is to incorporate the
discovered attributes with language models to generate a sentence for an image, in a
similar manner that a human might describe an image [15, 16]. Most work in the at-
tribute learning literatures either assumes a pre-detected bounding box that contains the
object of interest, or uses the image as a whole for feature extraction. Our work is differ-
ent from previous work not simply because we perform attribute learning on clothing,
but also due to the fact that we extract features that are adaptive to unconstrained human
poses. We show that with the prior knowledge of human poses, features can be collected
in a more sensible way to make better attribute predictions.
Recently, Bourdev et. al. [6] proposed a system that describes the appearance of peo-
ple, using 9 binary attributes such as is male, has T-shirt, and long hair. They
used a set of parts from Poselets [17] for extracting low-level features and perform sub-
sequent attribute learning. In comparison, we propose a system that comprehensively
describes the clothing appearance, with 23 binary attributes and 3 multi-class attributes.
We not only consider high-level attributes such as clothing categories, but also deal
with some very detailed attributes like collar presence, neckline shape, striped,
spotted and graphics. In addition, we explicitly exploit human poses by selectively
changing sampling positions of our model and experimentally demonstrate the benefits
of modeling pose. Further, [6] applies an additional layer of SVMs to explore the at-
tribute correlations, whereas in our system, the attribute correlations are modeled with a
CRF which benefits from computation efficiency. Besides the system of [6] that learns
descriptive attributes for people, some limited work has been done to understand the
612 H. Chen, A. Gallagher, and B. Girod
clothing appearance on the semantic level. Song et. al. [18] proposed an algorithm that
predicts job occupation via human clothes and contexts, showing emerging applica-
tions by investigating clothing visual appearance. Yang and Yu [19] proposed a cloth-
ing recognition system that identifies clothing categories such as suit and T-shirt. In
contrast, our work learns a much broader range of attributes like collar presence and
sleeve length. The work that explores similar attributes to ours is [20]. However, their
system is built for images taken in a well controlled fitting room environment with con-
strained human poses in frontal view, whereas our algorithm works for unconstrained
images with varying poses. In addition, their attributes predictions use hand-designed
features, while we follow a more disciplined learning approach that more easily allows
new attributes to be added to the model.
Table 1. Statistics of the clothing attribute dataset. There are 26 attributes in total, including 23
binary-class attributes (6 for pattern, 11 for color and 6 miscellaneous attributes) and 3 multi-class
attributes (sleeve length, neckline shape and clothing category).
Clothing pattern Solid (1052 / 441), Floral (69 / 1649), Spotted (101 / 1619)
(Positive / Negative) Plaid (105 / 1635), Striped (140 / 1534), Graphics (110 / 1668)
Red (93 / 1651), Yellow (67 / 1677), Green (83 / 1661), Cyan (90 / 1654)
Major color
Blue (150 / 1594), Purple (77 / 1667), Brown (168 / 1576), White (466 / 1278)
(Positive / Negative)
Gray (345 / 1399), Black (620 / 1124), > 2 Colors (203 / 1541)
Wearing necktie Yes 211, No 1528
Collar presence Yes 895, No 567
Gender Male 762, Female 1032
Wearing scarf Yes 234, No 1432
Skin exposure High 193, Low 1497
Placket presence Yes 1159, No 624
Sleeve length No sleeve (188), Short sleeve (323), Long sleeve (1270)
Neckline shape V-shape (626), Round (465), Others (223)
Shirt (134), Sweater (88), T-shirt (108), Outerwear (220)
Clothing category
Suit (232), Tank Top (62), Dress (260)
Thanks to the recent progress in human pose estimation [7, 22, 23], the analysis of
complex human poses has been made possible. With the knowledge of the physical
pose of the person, clothing attributes can be learned more effectively, e.g., features
from the arm regions offer a strong clue for sleeve length.
Estimating the full body pose from 2-D images still remains a challenging problem,
partly because the lower body is occasionally occluded or otherwise not visible in some
images. Consequently, we only consider the clothing items on the upper body. We apply
the work in [7] for human pose estimation and briefly review the technique here for the
completeness of the paper. Given an input image, the upper body of the person is firstly
located by using complementary results of an upper body detector [21] and Viola-Jones
face detector [24]. The bounding box of the detected upper body is then enlarged, and
the GrabCut algorithm [25] is used to segment the person from the background. Person-
specific appearance models for different body parts are estimated within the detected
upper body window. Finally, an articulated pose is formed within the segmented person
area, by using the person-specific appearance models and generic appearance models.
The outputs of the pose estimation algorithm are illustrated in Fig.1, which include a
stick-man model and the posterior probability map of the six upper body regions (head,
torso, upper arms and lower arms). We threshold the posterior probability map to obtain
binary masks for the torso and arms, while ignoring the head region since it is not related
to clothing attribute learning.
Due to the large number of attributes that we wish to learn, the system is unlikely to
achieve optimal performance with a single type of feature. For example, while texture
descriptors are useful for analyzing the clothing patterns such as striped and dotted,
they are not suitable for describing clothing colors. Therefore, in our implementation
we use 4 types of base features, including SIFT [26], texture descriptors from the Max-
614 H. Chen, A. Gallagher, and B. Girod
Feature 1
Combine
SVM
features
Feature N
A3 Long sleeve
Attribute
Multi-attribute
classifier M CRF inference
Fig. 2. Flowchart of our system. Several types of features are extracted based on the human pose.
These features are combined based on their predictive power to train a classifier for each attribute.
A Conditional Random Field that captures the mutual dependencies between the attributes is
employed to make attribute predictions. The final output of the system is a list of nameable
attributes that describe the clothing appearance.
Fig. 3. Extraction of SIFT over the torso region. For better visualization, we display fewer de-
scriptors than the actual number. The left figure shows the location, scale and orientation of the
SIFT descriptors in green circles. The right figure depicts the relative positions between the sam-
pling points (red dots), the torso region (white mask) and the torso stick (green bar).
imum Response Filters [27], color in the LAB space, and skin probabilities from our
skin detector.
As mentioned in Section 1, the features are extracted in a pose-adaptive way. The
sampling location, scale and orientation of the SIFT descriptors depend on the esti-
mated human body parts and the stick-man model. Fig.3 illustrates the extraction of the
SIFT descriptors over the persons torso region. The sampling points are arranged ac-
cording to the torso stick and the boundary of the torso. The configuration of the torso
sampling points is a 2-D array, whose size is given by the number of samples along the
stick direction times the number of samples normal to the stick direction. The scale of
the SIFT descriptor is determined by the size of the torso mask, while the orientation is
simply chosen as the direction of the torso stick. The extraction of SIFT features over
the 4 arm regions is done in a similar way, except that the descriptors are only sampled
along the arm sticks. In our implementation, we sample SIFT descriptors at 64 32
Describing Clothing by Semantic Attributes 615
locations for the torso, and 32 locations along the arm stick for each of the 4 arm seg-
ments. The remaining three of our base features, namely textures descriptors, color and
skin probabilities, are computed for each pixel in the five body regions.
Once the base features are computed, they are quantized by soft K-means with 5
nearest neighbors. As a general design guideline, features with larger dimensionality
should have more quantization centroids. Therefore we allocate 1000 visual words for
SIFT, 256 centroids for texture descriptor, and 128 centroids for color and skin proba-
bility. We employed a Kd-tree for efficient feature quantization. The quantized features
are then aggregated by performing average pooling or max pooling over the torso and
arm regions. Note that the feature aggregation for torso is done by constructing a 4 4
spatial pyramid [28] over the torso region and then average or max pooled.
Table 2 summarizes the features that we extract. In total, we have 40 different fea-
tures from 4 feature types, computed over 5 body regions with 2 aggregation methods.
Last, but not the least, we extract one more feature that is specifically designed for
learning clothing color attributes, which we call skin-excluded color feature. Obviously,
the exposed skin color should not be used to describe clothing colors. Therefore we
mask out the skin area (generated by our skin detector) to extract the skin-excluded
color feature. As the arms may contain purely skin pixels when the person is wearing
no-sleeve clothes, it may not be feasible to aggregate skin-excluded color feature for the
arm regions. Hence we perform aggregation over the whole upper body area (torso +
4 arm segments), excluding the skin region. Also, since the maximum color responses
are subject to image noise, only average pooling is adopted. So compared to the set of
40 features described above, we only use 1 feature for learning clothing color.
We utilize Support Vector Machines (SVMs) [29] to learn attributes since they demon-
strate state-of-the-art performance for classification. For clothing color attributes like
red and blue, we simply use the skin-excluded color feature to train one SVM per
color attribute. For each of the other clothing attributes such as wear necktie and
collar presence, we have 40 different features and corresponding weak classifiers but
it is uncertain which feature offers the best classification performance for the attribute
that we are currently learning. A naive solution would be to concatenate all features
into a single feature vector and feeding to an SVM for attribute classification. How-
ever, this approach suffers from three major drawbacks: 1) the model is likely to overfit
due to the extremely high dimensional feature vector; 2) due to the high feature di-
mension, classification can be slow; and most importantly, 3) within the concatenated
feature vector, high dimensional features will dominate over low dimensional ones so
616 H. Chen, A. Gallagher, and B. Girod
F2
F1 FM
A2
A1 AM
Fig. 4. A CRF model with M attribute nodes pairwise connected. Fi denotes the feature for
inferring attribute i
the classifier performance is similar to the case when only high dimensional features are
used. Indeed, in our experiments we found that the concatenated feature vector offers
negligible improvements over the SVM that uses only the SIFT features.
Another attribute classification approach is to train a set of 40 SVMs, one for each
feature type, and pick the best performing SVM as the attribute classifier. This method
certainly works, but we are interested in achieving a even better performance with all
the available features. Using the fact that SVM is a kernel method, we form a com-
bined kernel from the 40 feature kernels by weighted summation, where the weights
correspond to the classification performance of the features. Intuitively, better feature
kernels are assigned heavier weights than weaker feature kernels. The combined kernel
is then used to train an SVM classifier for the attribute. This method is inspired by the
work in [30], where they demonstrated significant scene classification improvement by
combining features. Our experimental results in Section 5.1 also show the advantage
offered by feature combination.
Classification Accuracy
80%
75%
70%
65%
60%
55%
50%
45%
Necktie
sleevelength
Gender
Color red
Scarf
Color yellow
Pattern stripe
Placket presence
Color green
Color cyan
Color blue
Color purple
Color white
Color black
category
Collar
Pattern solid
Color brown
Color gray
Pattern floral
Pattern spot
Pattern graphics
Pattern plaid
neckline
Skin exposure
>2 colors
Fig. 5. Comparison of attribute classification under 3 scenarios
Equation 3 is consistent with our CRF model in Fig.4, assuming that the observed
feature Fi is independent of all other features once the attribute Ai is known. From
the ground truth of the training data, we can estimate the joint probability P (A1 , A2 )
as well as the priors P (A1 ) and P (A2 ). The conditional probabilities P (A1 |F1 ) and
P (A2 |F2 ) are given by the SVM probability
outputs for A1 and A2 respectively. We de-
i |Fi )
fine the unary potential (Ai ) = log PP(A(A i)
and the edge potential (Ai , Aj ) =
log(P (Ai , Aj )). Following Equation 1, the optimal inference for (A1 , A2 ) is achieved
by minimizing (A1 )+ (A2 )+(A1 , A2 ), where the first two terms are the unary po-
tentials associated with the nodes, and the last term is the edge potential that describes
the relation between the attributes.
For a fully connected CRF with a set of nodes S and a set of edges E, the optimal
graph configuration is obtained by minimizing the graph potential, given by:
(Ai ) + (Ai , Aj ) (5)
Ai S (Ai ,Aj )E
where assigns relative weights between the unary potentials and the edge potentials.
It is typically less than 1 because a fully connected CRF normally contains more edges
than nodes. The actual value of can be optimized by cross validation. In our imple-
mentation, we use belief propagation [31] to minimize the attribute label cost.
5 Experiments
We have performed extensive evaluations of our system on the clothing attribute dataset.
In Section 5.1, we show that our pose-adaptive features offer better classification accu-
racies compared to the features extracted without a human pose model. The results from
section 5.2 demonstrates that the CRF improves attribute predictions since it explores
relations between the attributes. Finally, we show some potential applications that di-
rectly utilize the output of our system.
618 H. Chen, A. Gallagher, and B. Girod
Fig. 6. Multi-class confusion matrices. Predictions are made using the combined feature extracted
with the pose model.
G-mean
75%
70%
65%
60%
55%
50%
45%
sleevelength
Necktie
Gender
Color red
Scarf
Pattern plaid
Pattern stripe
Color yellow
category
Placket presence
Color green
Color cyan
Color purple
Collar
Pattern solid
Color blue
Color white
Color black
Color brown
Color gray
Pattern floral
Pattern spot
Pattern graphics
neckline
Skin exposure
>2 colors
Fig. 7. Comparison of G-means before and after the CRF
reflects the proportion of each attribute class in the dataset. Therefore, instead of testing
the classifier performance on the balanced data, we evaluate the classification perfor-
mance on the whole clothing attribute dataset. Since this is an unbalanced classification
problem, classification accuracy is no longer a proper evaluation metric. We use the
Geometric Mean (G-mean), which is a popular evaluation metric for unbalanced data
classification.
N N1
G-mean = Ri (6)
i=1
where Ri is the retrieval rate for class i, and N is the number of classes.
The G-means before and after applying the CRF are shown in Fig.7. It can be seen
that the CRF offers better predictions for 19 out of 26 attributes, sometimes providing
large margins of improvement. For those attributes that the CRF fails to improve the
performance, only very minor degradations are observed. Overall, better classification
performance is achieved by applying the CRF on top of our attribute classifiers. Fig.8
shows the attribute predictions of 4 images sampled from our dataset. As can be seen
in Fig.8, the CRF corrects some conflicting attributes that are generated by independent
classifiers. More clothing attribute illustration examples can be found here 2 .
5.3 Applications
Dressing Style Analysis With a collection of customer photos, our system can be used
to analyze the customers dressing style and subsequently make shopping recommen-
dations. For each attribute, the system can tell the percentage of its occurrences in the
group of photos. Intuitively, in order for an attribute class to be regarded as a personal
style, this class has to be observed much more frequently compared to the prior of the
general public. As an example, if a customer wears plaid-patterned clothing three days
a week, then wearing plaid is probably his personal style, because the frequency that he
2
https://sites.google.com/site/eccv2012clothingattribute/
620 H. Chen, A. Gallagher, and B. Girod
Fig. 8. For each image, the attribute predictions from independent classifiers are listed. Incorrect
attributes are highlighted in red. We describe the abnormal attributes generated by independent
classifiers for the above images, from left to right: man in dress, suit with high skin exposure,
no sleeves but wearing scarf, T-shirt with placket. By exploring attribute dependencies, the CRF
corrects some wrong attribute predictions. Attributes that are changed by the CRF are shown in
parentheses, and override the independent classifier result to the left.
is observed in plaid is much higher than the general publics prior of wearing plaid (the
prior of plaid is 6.4% according to our clothing attribute dataset). In our analysis, we
regard an attribute class as a dressing style if it is 20% higher than its prior. Of course,
this threshold can be flexibly adjusted based on specific application requirements.
We perform dressing style analysis on both personal and event photos, as shown in
Fig.9. Firstly, we analyze the dressing style of Steve Jobs, who is well known for wear-
ing his black turtlenecks. Using 35 photos of Jobs from Flickr, our system summarizes
his dressing style as solid pattern, mens clothing, black color, long sleeves, round
neckline, outerwear, wear scarf. Most of the styles predicted by our system are very
sensible. The wrong inferences of outerwear and wearing scarf are not particularly
surprising, since Steve Jobs high-collar turtlenecks share visual similarities with out-
erwears and the presence of scarfs. Similarly, our system predicts the dressing style of
Mark Zuckerberg as gray, brown, T-shirt, outerwear, which is in agreement with
his actual dressing style.
Apart from personal clothing style analysis, we can also analyze the dressing style for
events. We downloaded 54 western-style wedding photos from Flickr, and consider the
dressing styles for men and women using the gender attribute predicted by our sys-
tem. The dressing style for men at wedding is predicted as: solid pattern, suit, long-
sleeves, V-shape neckline, wearing necktie, wear scarf, has collar, has placket,
while the dressing style for women is high skin exposure, no sleeves, dress, other
neckline shapes (i.e. neither v-shape nor round), white, >2 colors, floral pattern.
Most of these predictions agree well with the dressing style of weddings, except wear-
ing scarf for men, and >2 colors and floral pattern for women. These abnormal
predictions can be understood, by considering the similarity between mens scarfs and
neckties, as well as the visual confusion for including the wedding flowers in womens
hands when describing the color and pattern of their dresses. Similar analysis was done
Describing Clothing by Semantic Attributes 621
Fig. 9. With a set of person or event photos (sample images shown above), our system infers the
dressing style of a person or an event. Most of our predicted dressing styles are quite reasonable.
The wrong predictions are highlighted in red.
to predict the clothing style of the event of basketball. Using 61 NBA photos, the bas-
ketball players dressing style is predicted as no sleeves, high skin exposure, other
neckline shapes, tank top.
Gender Classification. Although there exist clothes that are unisex style, visual ap-
pearance of clothes often carries significant information about gender. For example,
men are less likely to wear floral-patterned clothes, and women rarely wear a tie. Mo-
tivated by this observation, we are interested in combining the gender prediction from
clothing information, with traditional gender recognition algorithms that use facial fea-
tures. We adopt the publicly available gender-from-face implementation of [32], which
outputs gender probabilities by projecting aligned faces in the Fisherface space. For the
clothing-based gender classification, we use the graph potential in Equation 5 as the
penalty score for assigning gender to a testing instance. The male and female predic-
tion penalties are simply given by the CRF potentials, by assigning male or female to
the gender node, while keeping all other nodes unchanged. An RBF-kernel SVM is
trained to give gender predictions that combine both the gender probabilities from faces
and penalties from clothing.
We evaluate the gender classification algorithms on our clothing attribute dataset. As
shown Table 3, the combined gender classification algorithm offers a better performance
than that of using each feature alone. Interestingly, clothing-only gender recognition
outperforms face-only gender recognition.
6 Conclusions
We propose a fully automated system that describes the clothing appearance with se-
mantic attributes. Our system demonstrates superior performance on unconstrained im-
ages, by incorporating a human pose model during the feature extraction stage, as well
as by modeling the rules of clothing style by observing co-occurrences of the attributes.
We also show novel applications where our clothing attributes can be directly utilized.
In the future, we expect to observe even more improvements on our system, by employ-
ing the (almost ground truth) human pose estimated by Kinect sensors [23].
References
1. Kumar, N., Belhumeur, P.N., Nayar, S.K.: FaceTracer: A Search Engine for Large Collec-
tions of Images with Faces. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part
IV. LNCS, vol. 5305, pp. 340353. Springer, Heidelberg (2008)
2. Anguelov, D., Lee, K., Gokturk, S.B., Sumengen, B.: Contextual identity recognition in per-
sonal photo albums. In: CVPR (2007)
3. Lin, D., Kapoor, A., Hua, G., Baker, S.: Joint People, Event, and Location Recognition in
Personal Photo Collections Using Cross-Domain Context. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 243256. Springer, Heidelberg
(2010)
4. Gallagher, A.C., Chen, T.: Clothing cosegmentation for recognizing people. In: CVPR (2008)
5. Cao, L., Dikmen, M., Fu, Y., Huang, T.S.: Gender recognition from body. ACM Multimedia
(2008)
6. Bourdev, L., Maji, S., Malik, J.: Describing people: Poselet-based attribute classification. In:
ICCV (2011)
7. Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: Articulated human pose estima-
tion and search in (almost) unconstrained still images. Technical Report 272, ETH Zurich,
D-ITET, BIWI (2010)
8. Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS (2007)
9. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In:
CVPR (2009)
10. Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: ECCV (2010)
11. Parikh, D., Grauman, K.: Interactively building a discriminative vocabulary of nameable
attributes. In: CVPR (2011)
12. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face
verification. In: ICCV (2009)
13. Siddiquie, B., Feris, R.S., Davis, L.S.: Image ranking and retrieval based on multi-attribute
queries. In: CVPR (2011)
14. Berg, T.L., Berg, A.C., Shih, J.: Automatic Attribute Discovery and Characterization from
Noisy Web Data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
LNCS, vol. 6311, pp. 663676. Springer, Heidelberg (2010)
15. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Berg, A., Choi, Y., Berg, T.: Baby talk: Under-
standing and generating image descriptions. In: CVPR (2011)
Describing Clothing by Semantic Attributes 623
16. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J.,
Forsyth, D.: Every Picture Tells a Story: Generating Sentences from Images. In: Daniilidis,
K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 1529.
Springer, Heidelberg (2010)
17. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annota-
tions. In: ICCV (2009)
18. Song, Z., Wang, M., Hua, X., Yan, S.: Predicting occupation via human clothing and con-
texts. In: ICCV (2011)
19. Yang, M., Yu, K.: Real-time clothing recognition in surveillance videos. In: ICIP (2011)
20. Zhang, W., Begole, B., Chu, M., Liu, J., Yee, N.: Real-time clothes comparison based on
multi-view vision. In: ICDSC (2008)
21. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discrim-
inatively trained part based models. PAMI (2009)
22. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In:
CVPR (2011)
23. Shotton, J., Fitzgibbon, A., Cook, M., Blake, A.: Real-time human pose recognition in parts
from single depth images. In: CVPR (2011)
24. Viola, P., Jones, M.: Robust real-time object detection. IJCV (2001)
25. Rother, C., Kolmogorov, V., Blake, A.: Grabcut - interactive foreground extraction using
iterated graph cuts. In: SIGGRAPH (2004)
26. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
27. Varma, M., Zisserman, A.: A statistical approach to texture classification from single images.
IJCV (2005)
28. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: CVPR (2006)
29. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. on
Intel. Sys. and Tech. (2011)
30. Xiao, J., Hays, J., Ehinger, K.A., Torralba, A., Oliva, A.: Sun database: Large scale scene
recognition from abbey to zoo. In: CVPR (2010)
31. Tappen, M.F., Freeman, W.T.: Comparison of graph cuts with belief propagation for stereo,
using identical mrf parameters. In: ICCV (2003)
32. Gallagher, A.C., Chen, T.: Understanding images of groups of people. In: CVPR (2009)
Graph Matching via Sequential Monte Carlo
1 Introduction
Graphs provide excellent means to represent features and the structural rela-
tionship between them, and the problem of feature correspondence is eectively
solved by graph matching. Therefore, graph matching has gained high popular-
ity in various applications in computer vision and machine learning (e.g., object
recognition [1], shape matching [2], and feature tracking [3]). Due to the NP-hard
nature of graph matching in general, its optimal solution is virtually unachiev-
able so that a myriad of algorithms are proposed to approximate it. In recent
computer vision literature, the graph matching problem is widely formulated
as the Integer Quadratic Program (IQP) [4,5,6,7] that explicitly considers both
local appearances and generic pairwise relations. This line of graph matching
researches has further extended to hyper-graph matching [8,9], learning tech-
niques [10,11], and its progressive framework [12]. The existing methods, how-
ever, still suer from severe distortion of the graphs (e.g., addition of outliers in
nodes or deformation of attributes). Robustness to such distortion is of particu-
lar importance in practical vision problems because real-world image matching
problems are widely exposed to background clutter and image variation.
In this paper, a robust graph matching algorithm based on the Sequential
Monte Carlo (SMC) framework [13] is proposed. Unlike previous methods, the
algorithm sequentially samples potential matches via proposal distributions to
maximize the graph matching objective in a robust and eective manner. To
treat the graph matching problem in the SMC framework, a sequence of arti-
cial target distributions is designed, which provide smooth transitions from an
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 624637, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Graph Matching via Sequential Monte Carlo 625
initial distribution to the nal target distribution as the graph matching objec-
tive. The proper design of an importance distribution and a resampling scheme
allows the algorithm to explore the solution space eciently under the matching
constraints. Consequently, the proposed algorithm achieves higher robustness to
outliers and deformation than other graph matching algorithms.
Among numerous graph matching methods, the algorithms closely related to
ours are as follows. Maciel and Costeira [14] casted a graph matching problem
as an integer constrained minimization, which is popularly adopted for its gen-
erality in recent studies including ours. Many of the subsequent approaches to
the IQP formulation handle the original problem by solving a relaxed problem
and then enforcing the matching constraint on its solution. For example, Gold
and Rangarajan [15] relaxed the IQP by dropping the integer constraint and
iteratively estimated the convex approximations around the previous solution.
Leordeanu et al. [4] and Cour et al. [16] proposed spectral types of relaxation
followed by a discretization process. Torresani et al. [17] used dual decomposition
technique for graph matching optimization. Cho et al. [6] introduced a random
walk algorithm to explore the relaxed solution space in the consideration of the
matching constraints. More recently, to overcome the limitation of pair-wise re-
lations and embed higher-order information, hyper-graph matching techniques
were proposed based on the generalization of the IQP [8,9].
There have been many sampling-based algorithms for feature correspondence
and shape matching in literature. Dellaert et al. [18] used the Metropolis-Hastings
(MH) scheme to solve a correspondence problem in a Bayesian framework. Cao
et al. [2] used the MH scheme for partial shape matching. Tamminen et al. [19]
used the SMC framework to solve Bayesian matching of objects with occlusion.
Lu et al. [20] and Yang et al. [21] proposed a shape detection algorithm based on
the particle lter with static observations. However, all these methods for shape
matching problems cannot be applied to graph matching in general. Lee et al. [7]
introduced the data-driven Markov chain Monte Carlo (DDMCMC) algorithm
for graph matching which adopts spectral relaxation for data-driven proposals.
To our best knowledge, our work is the rst attempt to use the SMC framework
for the matching of general graphs.
2 Problem Denition
image P . In this paper, one-to-one matching constraints are adopted that every
node in GP is mapped to at most one node in GQ and vice versa. Then, Graph
matching is to nd node correspondences between GP and GQ , which best pre-
serve the attribute relations under the matching constraints. A graph matching
solution M is represented by a set of assignments or matches (viP , vaQ ) between
626 Y. Suh, M. Cho, and K.M. Lee
Fig. 1. The main idea of the proposed algorithm. The algorithm sequentially samples
potential matches via proposal distributions to maximize the graph matching objective
under the matching constraints. In every time step, new matches are sampled based
on the anity with the previously sampled matches.
Unlike typical SMC methods for dynamic problems [22], we do not assume any
prior knowledge about the graph for generality, except for the anity matrix.
In this case, it is intractable to calculate the conditional distribution even up
to a normalizing constant, thus the direct sequential sampling from conditional
distributions is also intractable.
To formulate graph matching in the SMC framework, we introduced a se-
quence of intermediate distributions [24], which starts from a simple initial dis-
tribution and gradually moves toward the nal target distribution equivalent to
the objective function of (1). This SMC procedure performs graph matching by
sequentially re-distributing particles over the evolving sample space according
to the intermediate target distribution at each step. The procedure is illustrated
in Fig. 2, and the details are described in the following subsections.
A feasible set of the graph matching problem, or a set of all possible x satisfying
the constraints of Eq. (1), is denoted by X . Each intermediate target distribution
t in Eq. (6) then is dened on the measurable space (Xt , t ) where Xt = {x
(i)
X | i,a xt (i, a) = t}. The more true matches a particle xt includes, the higher
628 Y. Suh, M. Cho, and K.M. Lee
importance
(
distribution
( )
( resample
Target distribution( ) (
update
(a)
(c) (b)
Fig. 2. Overview of the proposed method. (a) A sequence of intermediate target distri-
butions from a simple initial distribution to the target distribution is constructed. (b)
Transition from time step t 1 to the t is performed based on the sampling and impor-
tance resampling (SIR) technique exploiting the particles from the previous distribu-
tion. (c) An example of the neighborhood relation between two candidate matchings.
They share two common matches and dier in one match.
(i)
value the probability t (xt ) has. Thus, the sequence of distributions {t }L1
t=1
eectively drives the particles to the nal target distribution L .
N
t (xt ) = t1 (xt1 )K(xt1 , xt ) t1 (xt1 )K(xt1 , xt ).
(i) (i)
(8)
xt1 i=1
For further discussion, let us denote the particles after transition by {x t }N
(i)
i=1 .
The particles of equal matches are grouped together to form a set of n unique
assignment vectors {g t }nj=1 and the number of particles corresponding to the
(j)
The selected match mw is replaced with a new match sampled by the tran-
(i) (i)
sition kernel K(xt emw , xt ) of Eq. (7). After repeating the grouping and
(i)
resampling process of Sec. 3.2, the updated particles {xt }N i=1 are obtained. In
this updating procedure, while following the intermediate target distribution,
the particles are redistributed into the support with high probability under the
target distribution.
total complexity for one step procedure is O(N ncm ), and the entire algorithm
takes O(ncm LN ) where L denotes the number of samplings.
Initial Distribution. The initial distribution is calculated as the marginal
of the intermediate target distribution, 2 . Utilizing the symmetric property of
the anity matrix, the initial distribution 1 is obtained as
4 Experiments
In this section, two sets of experiments were performed for quantitative evalua-
tion: (1) synthetically generated random graph matching and (2) feature match-
ing using real images. The proposed algorithm (SMCM) was compared with
Graph Matching via Sequential Monte Carlo 631
The random graph matching experiments were performed following the exper-
imental protocol of [16,15,6]. In this experiment, the Hungarian algorithm[26]
was used for the post-processing of discretization, and = 2 is used for SMCM.
Two graphs, GP and GQ , were constructed at each trial with n = nin + nout
nodes, where nin and nout denote number of inlier and outlier nodes, respec-
tively. The reference graph GP was generated with random edges, where each
edge (i, j) E P was assigned a random attribute aP ij distributed uniformly
in [0, 1]. A perturbed graph GQ was then created by adding noise on the edge
attributes between inlier nodes: aQ P
p(i)p(j) = aij +, where p(.) is a random permu-
tation function. The deformation noise was assigned by the Gaussian function
N (0, 2 ). All the other edges connecting at least one of the outlier nodes were
randomly generated with random edges in the same way of GP . Thus, GP and
GQ share a common but perturbed subgraph with size nin . The anity matrix
Q 2 Q
W was computed by Wi,a;j,b = exp(|aP i,j aa,b | /s ), eij E , eab E
2 P P Q
2
and the diagonal terms are set to be zero. The scaling factor s was chosen to
be 0.1 to perform best in overall algorithms. The accuracy was computed as
the ratio of the number of true positive matches to that of ground truths. The
objective score was computed by the IQP objective xT Wx.
In the experimental setting, there are two independent variables, the number
of outliers nout and the amount of deformation noise . Using these, three ex-
periments with dierent conditions were conducted: (1) varying the amount of
deformation noise with no outliers (2) varying the number of outliers with no
deformation (3) varying the number of outliers with some deformation. Each ex-
periment was repeated for 100 times with dierent matching problems generated
from the above procedure, and the average accuracy and objective score were
recorded. In all experiments, the number of inliers nin is xed to 20. Figure 3(a)
represents the results of varying deformation noise from 0 to 0.3 by increments
of 0.05 while xing the number of outliers nout = 0 and particles N = 10000.
Figure 3(b) shows the result of varying the number of outliers nout from 0 to
20 by increments of two while xing N = 2000 and deformation noise = 0.
1
http://www.seas.upenn.edu/~ timothee/
2
http://www.cs.huji.ac.il/~ zass/
3
http://86.34.14.245/code.php
4
http://cv.snu.ac.kr/research/~ RRWM/
632 Y. Suh, M. Cho, and K.M. Lee
Fig. 3. Synthetic graph matching experiments. The accuracies (top), objective scores
(middle) and computational times (bottom) are plotted for three dierent conditions:
(a) varying deformation noise with no outliers, (b) varying the number of outliers with
no deformation noise, and (c) varying the number of outliers with some deformation
noise.
Figure 3(c) represents the result of varying the number of outliers nout from 0 to
20 by increments of two while xing N = 5000 and deformation noise = 0.1.
As observed in the score plots (middle) in Fig. 3, SMCM consistently out-
performs all the other algorithms in the objective score, regardless of the degree
of deformation and outliers; SMCM performs best as the optimizer of graph
matching score. Especially, the accuracy of SMCM signicantly exceeds those of
the other algorithms under increasing outliers as shown in Figs. 3(b) and 3(c). It
implies that in the SMC framework SMCM avoids the adverse eects of outliers
eectively through sequential collection of inliers under the matching constraints.
In deformation experiment of Fig. 3(a), the accuracy of SMCM is comparable or
slightly inferior to the best of RRWM. However, SMCM outperforms all other
methods in realistic conditions where both outliers and deformation exists as in
Fig. 3(c). The bottom row of Fig. 3 shows the computational time.
Graph Matching via Sequential Monte Carlo 633
Fig. 4. Some representative results of the real image matching experiments. True
positive matches are represented by blue lines and false positive matches are represented
by black lines.
(a) (b)
(a) (b)
Fig. 6. The average precision of particles and the number of particle groups at each
time step are measured on the synthetic random graph matching in Sec. 4.1. (a) The
average precision of particles increases as the number of matches in particles increases.
The plot is depicted while the number of outliers is varied. An abrupt increase occurs
when enough true matches (38) are sampled. (b) The number of particle groups is
depicted versus the time step while the number of outliers is varied. The number of
groups decreases as the match size increases.
Fig. 7. The accuracies for dierent settings of are shown. The experimental settings
are same as the synthetic random graph matching in Sec. 4.1.
When the precision saturates, the number of groups gradually decreases after a
small increase, resulting in eective reduction of computational cost.
In the third experiment, we evaluated the accuracy for dierent settings of
for the same experimental settings in Sec. 4.1. The results are shown in Fig.
7. As the value of increases, SMCM becomes more robust to the deformation
noise. In the presence of outliers, however, the highest accuracy is achieved at
= 0.1 where the distracting outlier matches with relatively small anity values
are eectively ignored.
5 Conclusion
In this paper, we proposed a novel graph matching algorithm based on the Se-
quential Monte Carlo (SMC) framework. During the sequential sampling
procedure, our algorithm eectively collects potential matches under matching
constraints and avoids the adverse eect of outliers and deformation. The exper-
imental results show that it outperforms the state-of-the-art algorithms and is
highly robust to outliers. Since the SMC can preserves samples of multi-modes in
particles, it is extendable to multi-modal graph matching algorithm by utilizing
the properties. We leave this in future work.
References
1. Duchenne, O., Joulin, A., Ponce, J.: A graph-matching kernel for object catego-
rization. In: ICCV (2011)
2. Cao, Y., Zhang, Z., Czogiel, I., Dryden, I., Wang, S.: 2d nonrigid partial shape
matching using mcmc and contour subdivision. In: CVPR (2011)
3. Bircheld, S.: KLT: An implementation of the kanade-lucas-tomasi feature tracker
(1998)
4. Leordeanu, M., Hebert, M.: A spectral technique for correspondence problems using
pairwise constraints. In: ICCV (2005)
5. Leordeanu, M., Hebert, M.: An integer projected xed point method for graph
matching and map inference. In: NIPS (2009)
Graph Matching via Sequential Monte Carlo 637
6. Cho, M., Lee, J., Lee, K.M.: Reweighted Random Walks for Graph Matching.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS,
vol. 6315, pp. 492505. Springer, Heidelberg (2010)
7. Lee, J., Cho, M., Lee, K.M.: A graph matching algorithm using data-driven markov
chain monte carlo sampling. In: ICPR (2010)
8. Duchenne, O., Bach, F., Kweon, I.S., Ponce, J.: A tensor-based algorithm for high-
order graph matching. PAMI (2010)
9. Lee, J., Cho, M., Lee, K.M.: Hyper-graph matching via reweighted random walks.
In: CVPR (2011)
10. Caetano, T., McAuley, J., Cheng, L., Le, Q., Smola, A.: Learning graph matching.
PAMI (2009)
11. Leordeanu, M., Hebert, M.: Unsupervised learning for graph matching. In: CVPR
(2009)
12. Cho, M., Lee, K.M.: Progressive graph matching: Making a move of graphs via
probabilistic voting. In: CVPR (2012)
13. Cappe, O., Godsill, S., Moulines, E.: An overview of existing methods and recent
advances in sequential monte carlo. Proc. IEEE (2007)
14. Maciel, J., Costeira, J.P.: A global solution to sparse correspondence problems.
PAMI 25, 187199 (2003)
15. Gold, S., Rangarajan, A.: A graduated assignment algorithm for graph matching.
PAMI, 411436 (1996)
16. Cour, T., Srinivasan, P., Shi, J.: Balanced graph matching. In: NIPS (2006)
17. Torresani, L., Kolmogorov, V., Rother, C.: Feature Correspondence Via Graph
Matching: Models and Global Optimization. In: Forsyth, D., Torr, P., Zisserman,
A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 596609. Springer, Heidelberg
(2008)
18. Dellaert, F., Seitz, S.M., Thrun, S., Thorpe, C.: Feature correspondence: A markov
chain monte carlo approach. In: NIPS (2001)
19. Tamminen, T., Lampinen, J.: Sequential monte carlo for bayesian matching of
objects with occlusions. PAMI (2006)
20. Lu, C., Latecki, L.J., Adluru, N., Yang, X., Ling, H.: Shape guided contour grouping
with particle lters. In: ICCV (2009)
21. Yang, X., Latecki, L.J.: Weakly Supervised Shape Based Object Detection with
Particle Filter. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part V. LNCS, vol. 6315, pp. 757770. Springer, Heidelberg (2010)
22. Isard, M., Blake, A.: Condensation - conditional density propagation for visual
tracking. IJCV (1998)
23. Moral, P.D., Doucet, A.: Sequential monte carlo for bayesian computation. Journal
of the Royal Statistical Society (2007)
24. Moral, P.D., Doucet, A., Jarsa, A.: Sequential monte carlo samplers. Journal of
the Royal Statistical Society, 411436 (2006)
25. Zass, R., Shashua, A.: Probabilistic graph and hypergraph matching. In: CVPR
(2008)
26. Munkres, J.: Algorithms for the assignment and transportation problems. SIAM
(1957)
27. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from
maximally stable extremal regions. In: BMVC (2002)
Jet-Based Local Image Descriptors
1 Introduction
Image characterizations based on local interest points and descriptors like SIFT
[1], GLOH [2], PCA-SIFT [3], DAISY [4,5], and many others have had great
impact on computer vision research. Such descriptors have been used in a range
of applications for eciently determining image similarities. Common for these
approaches is their aim at a description of a local image patch, which is invariant
with respect to photogrammetric variations like viewpoint, scale, and lighting
change. Additionally, they all have low dimensionality compared to the original
pixel representation, but the dierence in dimensionality among the descriptors
varies signicantly.
These characterizations all describe the local geometry of the image, like the
local distributions of rst order derivatives in the SIFT descriptor [1], or the
learned basis of PCA-SIFT [3] that encodes the parameters of this basis. The
introduction of PCA-SIFT inevitable leads to the question of whether the com-
plex structure and relatively high dimensionality of most descriptors are in fact
over-representations and perhaps much simpler, theoretically sound, and com-
pact formulations exist. This observation has inspired us to investigate the local
k-jet [6], a higher order dierential descriptor, as a natural basis for encoding
the geometry of an image patch. Thus, the descriptor we propose can, similar to
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 638650, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Jet-Based Local Image Descriptors 639
2 Jet Descriptor
n+m
Lxn ym (r; ) = n+m (G I) (r; ) , r = (x, y) , (1)
xn y m
640 A.B.L. Larsen et al.
Here (0) denotes the Dirac delta function centered at zero. The local k-jet is
then given by the vector Jk (r; ) RK , K = (2+k)!
2k! 1
By excluding the zeroth order term, the k-jet becomes invariant to additive
changes to the intensities.
By construction, the scale space derivatives are correlated and under basic
assumptions of the image statistics this correlation can be described analyti-
cally [14,15]. In order to create a scale normalized and decorrelated descriptor,
we therefore perform a whitening of the local k-jet coecients according to the
covariance structure derived in [14,15]. This yields a vector where the elements
are uncorrelated and of the same order of magnitude, allowing the use of the
Euclidean distance as descriptor similarity measure. According to [14], the co-
variance between the jet coecients Lxi yj and Lxk yl , where both n = i + j and
m = k + l are even, is given by
n+m 2 n!m!
cov(Lxi yj , Lxk yl ) = (1) 2 +k+l . (4)
2 n+m 2n+m (n + m) n2 ! m
2 !
is the scale parameter of the local k-jet, and is a model parameter that is
irrelevant in our context and will be set to = 1. If either n or m is odd, the
covariance is 0. Finally, we remark that this normalization method is related,
but not identical, to the descriptor similarity measure proposed in [8]. After the
whitening, the descriptor is L2 normalized to achieve invariance to ane contrast
changes.
Using the local k-jet above, we wish to investigate what descriptor congu-
rations yield good discriminability. More specically, we investigate the eect
of sampling jets in a multi-scale or multi-local fashion. We will also explore to
what extent we can increase the dierential order k while getting performance
improvements. This leads to the descriptor proposals listed in Table 1. We use
the following naming convention. The sux -scale2 indicates that the jet is
sampled at two dierent scales in a single point. The sux -gridn indicates a
multi-local jet sampling in a regular square grid of size n n (similar to the
spatial sampling grid used in SIFT [1]).
Following [2,16], image patches of size 64 64 pixels are extracted at three
times the scale of the interest points generated by the interest point detector.
Moreover, ane interest point regions are warped to isotropic regions. The jet
descriptor parameters shown in Table 1 have been optimized manually using the
dataset by the Oxford Visual Geometry Group1 . The parameters denote the
1
Data and code available at http://www.robots.ox.ac.uk/~ vgg/research/affine.
Jet-Based Local Image Descriptors 641
jet scales in pixel units. The -gridn sampling location parameters p denote the
x and y pixel osets of the grid columns and rows respectively (the row and
column osets are the same since the grid is square, hence the shared p values).
The proposed jet descriptors are, by construction, invariant to changes in
scale, translation, and contrast. The scale and translation invariance arise mainly
from the interest point detectors which provide a localization in scale-space of
the interest point and the resampling of the interest region around the point
to 64 64 pixels. However, this localization includes some noise depending on
the quality of the detector algorithm and its implementation (see e.g. [10]). We
compensate for this by using fairly large scales relative to the 6464 pixels patch
making the jet descriptor robust towards small perturbations in the detected
position and scale. We compensate for the reduction in derivative response level
caused by the Gaussian blurring by using scale normalized derivatives (see (1)).
The jet descriptors are not invariant to rotations around the optical axis as we do
not orient the jets according to e.g. a dominant orientation. This could, however,
easily be achieved by rotating the coordinate frame according to some ducial
orientation of the image patch and expressing the jets in this coordinate system.
Table 1. The jet descriptor proposals, their parameters and dimensionalities. Bot-
tom rows: State-of-the-art descriptors that we compare the jet descriptors to in our
experiments.
Scene
Arc 2
Linear path
Arc 3
(a) (b)
Fig. 1. The DTU Robot dataset. (a): Examples of dierent scenes. (b): A top-down
view of all camera positions (circles) and light source positions (pluses). The key frame
in the front center is used as reference image when evaluating descriptor performance.
Arc 1 is at a distance of 0.5 m from the scene and spans 40 , Arc 2 has distance
0.65 m and spans 25 and Arc 3 has distance 0.8 m and spans 20 . The linear path
spans the range [0.5 m; 0.8 m].
0.95
Mean AUC
0.90
0.85
0.80
0.75
Arc 1 Arc 2 Arc 3 Linear path
0.90 J4 J5 -scale2
Mean AUC
0.88 J5 J3 -grid2
0.86 J6 J3 -grid4
0.84 J7 J4 -grid2
0.82 J4 -scale2 J5 -grid2
0.80
Light path, x-axis Light path, z-axis
4 Experiments
Using the DTU Robot dataset, we investigate the performance of our dierent
jet descriptor congurations from Table 1. For this experiment, we use interest
points generated by the MSER detector [18] as it has been shown to perform well
[16]. The experimental results are shown in Fig. 2. We see that for the single
jets, the performance improvements stagnate around k = 6. The multi-scale
approach does not seem to increase the discriminability signicantly as J4 -scale2
and J5 -scale2 only have a slight edge over J4 and J5 respectively. The multi-
scale descriptors are slightly more discriminative than the single-scale descriptors
for changes in view point. Under illumination variations, the performance of the
multi-scale descriptors is no better than the single-scale descriptors. Conversely,
the multi-local approach yields a visible improvement as it consistently achieves
better scores than the single- and multi-scale jets. We see, to some extent, that
multi-locality can be replaced with higher-order image structure as J4 -grid2
perform slightly better than J3 -grid4. Among the multi-local approaches, J4 -
grid2 oers a good trade-o between descriptor dimensionality and performance.
We compare our jet descriptors with the state-of-the-art descriptors listed in
Table 1. We choose these descriptors because they are among the top performers
in the comparative study by Dahl et al. [16]. These descriptors are generated
using code made available by the Oxford Visual Geometry Group and from the
SURF website3 . Since our dataset does not support rotations around the optical
axis, we have disabled rotational invariance for all descriptors to better reveal
their discriminative ability. As representative for our multi-local jet descriptors,
3
http://www.vision.ee.ethz.ch/~ surf
644 A.B.L. Larsen et al.
0.95
Mean AUC
0.90
0.85
0.80
0.75
Arc 1 Arc 2 Arc 3 Linear path
0.88 GLOH J5
0.86 SURF J7
0.84 PCA-SIFT
0.82
0.80
Light path, x-axis Light path, z-axis
0.95
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Arc 1 Arc 2 Arc 3 Linear path
0.85 GLOH J5
0.80 SURF J7
0.75 PCA-SIFT
0.70
0.65
Light path, x-axis Light path, z-axis
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.90 GLOH J5
SURF J7
0.85
PCA-SIFT
0.80
0.75
Light path, x-axis Light path, z-axis
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.90 GLOH J5
SURF J7
0.85 PCA-SIFT
0.80
0.75
Light path, x-axis Light path, z-axis
in Arc 1. One may speculate whether this dierence arises from the dierent
geometries detected by DoG and Harris-Ane blobs versus corners or from
the ane correction performed by the Harris-Ane detector.
5 Discussion
From the experiments, we have seen that jet-based descriptors oer a viable al-
ternative to state-of-the-art local image descriptors. More specically, the multi-
local jet descriptor J4 -grid2 has shown superior results for illumination changes,
competitive performance for changes in viewing angle and a performance slightly
under par for scale variations, except for DoG interest points on which we obtain
superior performance for all perturbations. Thus, we have shown that the popu-
lar histogram approaches, as applied in e.g. SIFT [1] and its extensions [2,4,5], is
not crucial for achieving good performance. Additionally, we have shown that the
use of higher-order dierential image structure allows us to reduce the complex-
ity of the multi-locality (recall that e.g. SIFT is constructed from 16 histogram
sampling points). In fact, a single jet can be competitive with both SIFT and
GLOH in some situations. The reduction of multi-local sampling points leads to
a much lower descriptor dimensionality, which has typically been obtained by
adding a PCA step to the description algorithm.
The jet descriptors have, similar to PCA-SIFT, an interpretation as a basis
representation of the image patch. However, instead of learning a basis for the
patch, our descriptor relies on the monomial basis of the Taylor expansion of the
patch. This interpretation give us a low-dimensional descriptor which allows us
to reconstruct the patch directly from the descriptor itself [20].
Descriptor performance has been evaluated using interest points generated
by dierent popular detectors. We consider this an important part of our de-
scriptor evaluation, as we have seen that the relative descriptor performance to
a large extent is dependent on the detector. Thus, one should be very careful
when drawing conclusions from a single interest point detector as it clearly will
favor some descriptors over others. From our results, we have seen that the jet
descriptors oer superior performance on DoG interest points in all perturbation
scenarios. For MSER, HarA and HarLap interest points, the jet approach oers
competitive performance.
The promising results of our jet descriptors are at rst glance peculiar con-
sidering previous non-favorable evaluations of descriptors based on image dier-
entials [2,8]. Our approach diers in the following ways: We use the local k-jet
directly (instead of dierential invariants), we extract dierentials to a higher
order and we employ multi-local sampling of the jets. Including spatial sampling
of jets and increasing dierential order is essential to the success of our descrip-
tor. We speculate that part of the explanation for the good performance of our
jets is that the DTU Robot dataset contains non-planar surfaces (as we have
seen, jet descriptors are more robust in these sitations). Previous evaluations of
dierential descriptors have been on datasets containing to a large extent planar
or near-planar surfaces.
648 A.B.L. Larsen et al.
0.85
0.80
0.75
0.70
0.80
0.75
0.70
0.65
0.95
DoG, planar
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.65
0.60
0.95
DoG, non-planar
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Arc 1 Arc 2 Arc 3 Linear path
Fig. 7. Descriptor performance for planar and non-planar image surface structures
Jet-Based Local Image Descriptors 649
Our jet descriptors are very simple to implement and rely on very few param-
eters (compared to histogram-based descriptors) making them easy to congure.
We remark that J4 -grid2 requires 14 convolutions of the input patch which al-
lows for a fast implementation in hardware. Single-scale jets can be implemented
even more eciently from a series of point-wise products.
Taking the positive performance results and the simplicity of the descriptor
into account, we believe that the multi-local jet descriptor can prove to be valu-
able for many computer vision applications.
References
1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60, 91110 (2004)
2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27, 16151630 (2005)
3. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local
image descriptors. In: IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2004), vol. 2, pp. 506513 (2004)
4. Winder, S., Hua, G., Brown, M.: Picking the best daisy. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR 2009), pp. 178185 (2009)
5. Tola, E., Lepetit, V., Fua, P.: Daisy: An ecient dense descriptor applied to
wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 32, 815830 (2010)
6. Florack, L., ter Haar Romeny, B.M., Viergever, M., Koenderink, J.: The gaussian
scale-space paradigm and the multiscale local jet. International Journal of Com-
puter Vision 18, 6175 (1996)
7. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 19, 530535 (1997)
8. Balmashnova, E., Florack, L.: Novel similarity measures for dierential invariant
descriptors for generic object retrieval. Journal of Mathematical Imaging and Vi-
sion 31, 121132 (2008)
9. Laptev, I., Lindeberg, T.: Local Descriptors for Spatio-temporal Recognition.
In: MacLean, W.J. (ed.) SCVMA 2004. LNCS, vol. 3667, pp. 91103. Springer,
Heidelberg (2006)
10. Aans, H., Dahl, A., Steenstrup Pedersen, K.: Interesting interest points: A com-
parative study of interest point performance on a unique data set. International
Journal of Computer Vision 97, 1835 (2012)
11. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaal-
itzky, F., Kadir, T., Gool, L.V.: A comparison of ane region detectors. Interna-
tional Journal of Computer Vision 65, 4372 (2005)
12. Kaneva, B., Torralba, A., Freeman, W.T.: Evaluating image feaures using a pho-
torealistic virtual world. In: IEEE International Conference on Computer Vision
(2011)
13. Vedaldi, A., Ling, H., Soatto, S.: Knowing a Good Feature When You See It:
Ground Truth and Methodology to Evaluate Local Features for Recognition. In:
Cipolla, R., Battiato, S., Farinella, G. (eds.) Computer Vision. SCI, vol. 285,
pp. 2749. Springer, Heidelberg (2010)
14. Pedersen, K.S.: Statistics of natural image geometry. PhD thesis, University of
Copenhagen, Department of Computer Science, Denmark (2003)
650 A.B.L. Larsen et al.
15. Markussen, B., Pedersen, K.S., Loog, M.: Second order structure of scale-space
measurements. Journal of Mathematical Imaging and Vision 31, 207220 (2008)
16. Dahl, A.L., Aans, H., Pedersen, K.S.: Finding the best feature detector-descriptor
combination. In: The First Joint Conference of 3D Imaging, Modeling, Processing,
Visualization and Transmission (2011)
17. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (surf).
Computer Vision and Image Understanding 110, 346359 (2008)
18. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from max-
imally stable extremal regions. Image and Vision Computing 22, 761767 (2004)
19. Mikolajczyk, K., Schmid, C.: Scale & ane invariant interest point detectors. In-
ternational Journal of Computer Vision 60, 6386 (2004)
20. Lillholm, M., Nielsen, M., Grin, L.D.: Feature-based image analysis. International
Journal of Computer Vision 52, 7395 (2003)
Abnormal Object Detection
by Canonical Scene-Based Contextual Model
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 651664, 2012.
c Springer-Verlag Berlin Heidelberg 2012
652 S. Park, W. Kim, and K.M. Lee
sofa[0.57] person[0.69]
Fig. 1. Examples of abnormal objects detected using the proposed method. Each image
contains the topmost abnormal object that violates co-occurrence, relative position and
relative scale among objects.
This paper focuses on abnormal object detection, that is, nding abnormal ob-
jects in given scenes. We dene abnormal objects as those that do not match ex-
pectations set by the surrounding context or by common knowledge. Specically,
abnormal objects (1) do not co-occur with surrounding objects (co-occurrence-
violating objects), (2) violate positional relationships with other objects (position-
violating objects) or (3) have relatively huge or small sizes (scale-violating
objects) (Fig. 1). Exploiting the distributions of normal objects is necessary
for abnormal object detection. This is because abnormality can be dened based
on the extent to which an object is not normal.
Four necessary properties of abnormal object detection methods are dened as
follows: (1) auent object relations, (2) quantitative object relations, (3) auent
context types, and (4) prior-free object search. First, auent relations among
objects, such as a fully connected relation among objects, is critical because
abnormal objects rarely occur. If an abnormal object, such as a oating car, is
related with only one object, road, then identifying whether the target object is
abnormal or not becomes dicult, because models can only identify the oating
cars abnormality once the road has been detected correctly.
Second, object relations, particularly relative position/scale relation, should
be dened quantitatively because qualitative representation is improper for de-
termining the contextual violations. For example, if the relative position is qual-
itatively dened, such as above, then person and road can be related by
the above relation. However, identifying the abnormality of a person hanging in
the air becomes dicult because the person is still above the road. Third,
contextual models become more informative when the more context types, such
as co-occurrence and relative position/scale among objects, are used [7]. Finally,
the models should not restrict the interpretation of scenes to nd abnormal
object properly. If an abnormal object-detecting model has prior knowledge of
searching objects, then nding abnormal objects is nearly impossible because
abnormal objects do not exist in locations where they are expected.
We propose a novel generative model that generates both normal and abnor-
mal objects, following a learned conguration among the objects. This learned
conguration is determined by transforming a standard conguration, also called
Abnormal Object Detection by Canonical Scene-Based Contextual Model 653
2 Related Work
Although a variety of contextual models have been proposed over the past decade
[24, 1, 5, 9, 7, 6, 10, 11], contextual models that meet all four criteria for nd
abnormal objects are dicult to nd.
One group of such models incorporates object-object interaction [12] which
is used to directly enforce contextual relation among objects. In [9], a condi-
tional random eld (CRF)-based contextual model provides only co-occurrence
statistics among objects to rene miss-labeled image segments. This work is fur-
ther expanded in [7] so that relative position context, such as above/below
and inside/around, among objects is qualitatively encoded in the CRF model.
Furthermore, [4] proposed a far/near relation between objects to reduce the
ambiguity of qualitative context representation. In [1], Choi represents the re-
lation among objects in a tree model for computational eciency, taking only
a few signicant relationship among the objects into consideration. This tree
model is more expanded in [5] to also encode support context, a geometric
context for prohibiting oating objects. The strategy of [3] is to rst nd easy
objects, and then search for dicult objects based on the locations of the easy
654 S. Park, W. Kim, and K.M. Lee
2
car
O
building
tree
s l c
K
T x y
M O N
O S D 2
(a) (b)
Fig. 2. (a) Possible locations of objects in learned outdoor Canonical Scene. (b) The
graphical model of the Canonical Scene model.
objects. This model combines boosting and CRF to eciently search for un-
known objects based on known ones, making it unnecessary to search the entire
image to locate objects. [6] leverages the qualitative spatial contexts of stu,
such as sky or road, to improve object detection in satellite images.
Another group of contextual models indirectly embeds the contextual relation-
ship among objects via a latent variable, such as a scene. [2] correlates objects
through a common cause scene, thus including co-occurrence statistics in a
hierarchical Bayesian model. The same contextual information is exploited by
[10] to achieve scene classication, annotation, and segmentation altogether in a
single framework. [13] solves the object detection problem by assuming that holis-
tically similar images share the same object conguration. This method retrieves
normal, annotated images with similar Gist feature [14] and then constructs a
graphical contextual model to constrain which objects appear and where they
can be found.
This paper adopts object-level image representation (such as [15, 1]), which
diers from conventional low or mid-level representations of images, such as
colors, SIFT or bag of visual words. Representing an image at object-level is
necessary because our goal includes modeling the objects position/scale in the
real world based on the objects in the image.
An image is described as a set of candidate objects, {(xo,n , yo,n )|1 o
O, 1 n No }. This set can be a set of bounding boxes generated by applying
conventional object detectors on a given image, but only the top No scored
candidate objects are used among the outputs of the object o detector. xo,n =
(bo,n , ho,n ) is the location and height of the corresponding candidate object,
where bo,n is a y-coordinate of the candidate objects center and ho,n is the height
of the candidate object. We do not consider an x-coordinate and the width of a
candidate object because these values are not informative [14]. Thus, the words
scale and height are used interchangeably. yo,n is a score that represents
the similarity between the appearance of the corresponding candidate object
and that of the object category o. yo,n can be calculated by applying an object
detector (such as [16]) to the corresponding candidate object.
656 S. Park, W. Kim, and K.M. Lee
We transform the location of the candidate object from the image coordinate
into the camera coordinate to exploit the position relationships and the scale
information between the objects. For this purpose, we apply Hoiems undo
projectivity technique [17], assuming that all instances of the same object cate-
gory o have the same physical height. The objects location, that is, in an image
coordinate is transformed into a camera coordinate by the following triangu-
b f
lation [1]: Go,n = (Gyo,n , Gzo,n ) = ( ho,n
o,n
Ho , ho,n Ho ). The coordinates are calcu-
lated based on the center location bo,n of the candidate object, the candidate
objects height ho,n [h1 , h2 ], the hand-crafted physical height of the objects
Ho [H1 , H2 ], and the focal length f . This paper assumes that the height of
the images is normalized to one, the principal axis passes through the center
of the image, f = 1, h1 = 0.1, h2 = 1, H1 = 0.1, and H2 = 30. Furthermore,
with a normalization of the location space to S = [0.5, 0.5] [0, 1], we restate
the location of the corresponding candidate object as xo,n = fnorm (Go,n ), which
represents both position and scale information.
This paper assumes that all instances of an object category o have the same
physical height. Thus, the relatively small object instance in the image also has
the same height as the normal objects in the camera coordinate. Even if the
small object is dicult to identify based on its height, the value of the second
component of the small objects xo,n is larger than that of the normal ones. Thus,
the abnormal object can be identied using the second component of xo,n .
This paper employs the exact sense of objects used by [18]. Therefore, we
use car object, person object, or sky stu. However, object and stu
are called object if no confusion exists.
where the rst term is the location model, the second is the appearance model,
and the last term is the prior distribution
5 over latent variables.
The marginal location model p(Ko,n , xo,n |s, lo,n , T )dKo,n can be analyti-
cally represented. Because p(xo,n |lo,n , Ko,n ) is dened as N (xo,n |Ko,n , o ) when
the candidate object is correctly detected as a normal object, lo,n = 1, the
marginal location model is also a Gaussian with mean of R s,o,1 + and co-
variance of R s,o,1 R + o . Moreover, if lo,n = 0 and xo,n is independent on
Ko,n and follows a uniform distribution, then the marginal location model also
follows a uniform distribution. The appearance model p(yo,n |co,n ) is adopted
from the conventional models [2, 1]. The p(yo,n |co,n ) is indirectly dened by the
p(co,n |yo,n )p(yo,n )
Bayes theorem, p(yo,n |co,n ) = p(co,n ) , where p(co,n |yo,n ) is dened as
a logistic regression model (o,co,n [1 y] ). The scene distribution is dened as
p(s) M ult() and the transformation distributions p(T = (, )) is dened
as a bivariate Gaussian distribution N ((, )| = (, )) with no correlation
between and . The distribution over correctness of appearance p(co,n |lo,n )
is modeled by the Bernoulli distribution with parameter o,lo,n and location
distribution p(lo,n ) Bern(o ).
4 Parameter Learning
This paper explicitly estimates T to convert all random variables into observed
ones for learning Canonical Scenes. Given the scene/object-level annotated dataset
[20, 21], separate dataset D = {Ds } and use Ds to build the Canonical Scene
for scene category s. The distributions of Lo,n,1 are estimated by maximum a
posterior criterion by considering the relation T (Lo,n,1 ) = Ko,n,1 . Assuming that
all objects on the images are located in the ground plane, then Algorithm 1 es-
timates T , which transforms objects on the ground plane to the slanted plane
in the camera coordinate. In this estimate, only the locations of objects, not
those of stu, are used. Moreover, the distribution of object os normal location
Lo,n,1 which does not exist in a Ds is set to N (0, 1 )
Learning the logistic regression model p(co,n |yo,n ) = (o,co,n [1 yo,n ] ) in the
appearance model may seen trivial. However, when data classes are imbalanced,
Abnormal Object Detection by Canonical Scene-Based Contextual Model 659
Because the integral in eq. (3) cannot be analytically solved and is computa-
tionally expensive, we approximate its value by assuming that most mass of the
distribution over T are concentrated at a single point. This assumption seems
valid
5 because pictures are commonly taken in the conventional way. Therefore,
p(s, l, c, T |x, y)dT is approximated as p(s, l, c, T |x, y), where T is estimated by
solving the non-linear 5 optimization problem using the gradient ascent method:
T = arg maxT p(T ) p(K, x|s, l, c, T )dK. Because T follows Bivariate Gaussian
distribution, the mean of T is a proper initial for the gradient method.
This paper adopts a sampling method, which is called Pop-MCMC [8], to
solve the optimization problem (3). Pop-MCMC generates multiple samples,
also called chromosomes, from multiple Markov chains in parallel. This method
can make global moves because it exchanges information between samples, thus
leading to a higher mixing rate compared with conventional single chain MCMC
samplers. When it comes to an optimization problem, it is possible to escape from
local optimum via chromosomes of the Pop-MCMC. Therefore, optimization by
means of Pop-MCMC is ecient when the objective function is multimodal
and/or high-dimensional, such as the proposed model as well as MRF models
for stereo problems [23].
Note that detections should be scored for ranking and drawing precision-recall
curve to measure object detection results. The natural score for the detections
becomes the posterior marginals [1] or log-odds ratio [4]. The log-odds ratio
660 S. Park, W. Kim, and K.M. Lee
can quantify oddness of realizations of random variables. In this paper, the log-
odds ratio represents how abnormal a candidate object is. The odd ratio of
object category os nth candidate object is calculated as follows: m((lo,n , co,n ) =
p((l ,co,n )=(0,1)|x,y)
(0, 1) = log p((lo,n
o,n ,co,n )=(0,1)|x,y)
. This log-odd ratio can be approximated as [4]
and calculated by modifying the MAP inference of eq. (3).
6 Evaluation
6.1 Abnormal Dataset and Experimental Setup
The dataset should be separated into each types of abnormal object to allow
thorough evaluation of the accuracy of abnormal object detection models. The
new dataset1 consists of three dierent types of images (50 images each, for a
total of 150 images containing annotated objects that violate co-occurrence, rel-
ative position, and relative scale) and additional extra images that contain two
or more dierent types of abnormal objects. The abnormal objects are anno-
tated relative to normal objects contained in the normal dataset. For example,
a conventional normal dataset, such as LabelMe, consists of cars on the road,
and so a ying car is annotated as an abnormal object. Likewise, a person
who is taller than common adults is also annotated. Even though an abnormal
dataset has already been established [1, 5], the set is unsuitable for a thorough
evaluation of all three types of abnormal objects because the set has not been
separated into each type.
Two datasets, one normal and one abnormal, are used for training and test.
The SUN dataset [20, 1] is used for the rst dataset which has scene-level anno-
tations as well as many annotated objects in a single image. Therefore, the SUN
is the proper dataset for training and testing the relationships among objects.
The second dataset is the proposed abnormal dataset. Object detector models
[16] and the parameters of the proposed model are trained on randomly sepa-
rated images from the SUN dataset. Evaluation is conducted on the test set of
the SUN dataset and the entire abnormal dataset. Two scene categories and ten
object categories2 are used to train and test the proposed model. The parame-
ters for the proposed model are = 9 = 5 36 , t1 = 0.02, t2 = 0.004 and
o = 10128 I. Note that optimizing eq. (3) takes about two minutes using an
Intel Quad Core 3.3GHz PC platform with 8GB of memory.
We choose a hybrid model of co-occurrence contextual (CO) and support con-
textual (SUP) model [5] as the baseline method for abnormal object detection,
because our proposed model simultaneously detects abnormal objects that vi-
olate co-occurrence context and position/scale context. The number of objects
used in the baseline is 10+2 objects, including support objects such as oor and
road. These supporting objects, useless for the proposed model, are positively
necessary for the baseline method because of their dependency on supporting
objects. This baseline method is quantitatively (Fig. 3) and qualitatively (Fig.
4) compared with the proposed method.
1
L. Wei [24] has copyright on several abnormal images.
2
indoor, outdoor / bed, bottle, building, car, monitor, person, sky, sofa, toilet, tree.
Abnormal Object Detection by Canonical Scene-Based Contextual Model 661
60 60 60
Precision
Precision
Precision
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Recall Recall Recall
100 20
CS CS
CO+SUP CO+SUP
80
15
% CS CO+SUP
60
100 15.21 0.91 9.71 0.05
Precision
10
40 10 5.64 0.91 2.53 0.19
5 5 7.32 2.50 3.38 0.11
20
0 0
0 20 40 60 80 100 100 50 10 5
Recall Abnormal image ratio (%)
6.2 Results
Precision-recall curves and average precision measures are used to evaluate ab-
normal object detection accuracy. These methods are conventional measures [25]
for object detection tasks.
The average precision in Fig. 3(e) shows that the proposed method out-
performs the baseline method on both the abnormal and mixed datasets. Our
method is robust in all three types of dataset, particularly in the position dataset
Fig. 3(b), as shown by the analysis of precision-recall curve (Figs. 3(a) to 3(c)).
The reason is that the proposed Canonical Scene method is more robust on
separating oating objects than the baseline method. Fig. 3(d) shows that our
method also outperforms in the combination of each type of dataset and extra
abnormal images. Fig. 3(e) shows average precision scores on 150 images with
varying ratios of abnormal and normal images. If we assume that a general im-
age set is composed of 5% of abnormal images, then this experiment veries
the robustness of the proposed method on a general image set. The proposed
method may also be superior on the general image set based on the experiment
662 S. Park, W. Kim, and K.M. Lee
person[0.35]
car[0.15]
sofa[0.28]
person[0.36]
person[0.12] person[0.34]
car[0.44]
person[0.60]
sofa[0.55]bed[0.25]
person[0.35] bed[0.46]
person[0.31]
car[0.27]
0 0 0
Fig. 4. Qualitative comparisons. The results in the top and the middle-rows are gener-
ated using the baseline method and the proposed method, respectively. In each images,
the top-three abnormal objects are represented in bounding boxes. In the third row, the
distribution of objects locations in the Canonical Scene is represented with estimated
abnormalities.
results. Note that in Fig. 3(c) shows that the proposed method is weak at classi-
fying abnormal objects with abnormally huge or small sizes. Improperly learned
Canonical Scenes may cause this weakness.
The qualitative comparisons and results are shown in Figs. 4 and 5. Based on
the last row of Fig. 4, we can conclude that the proposed method exploits the
rich relations among objects and instance of objects. For example, the second
gure in the last row represents the distribution of a normal person in an
outdoor scene. The learned normal distribution of person is transformed to
cover person on the road. Therefore, the transformed distribution, or Canoni-
cal Scene, cannot cover the oating person, thus making it possible to classify
the oating person as an abnormal object. Fig. 5 illustrates the positive and
negative results of the proposed method. The negative results are due to the
improperly learned Canonical Scene, the experimentally set parameters, and the
weakness of the base object detectors.
Abnormal Object Detection by Canonical Scene-Based Contextual Model 663
toilet[0.35]
Fig. 5. Qualitative results. The images in the rst, second and third columns consist of
co-occurrence-violating, relative position-violating, and relative scale-violating images,
respectively. In each image, the top-ve abnormal objects are represented in bounding
boxes.
7 Conclusion
We proposed a generative model for abnormal object detection. The model mainly
exploited the rich and quantitative relations among objects and objects instances
via latent variables. Because of these considerations, our model outperformed the
state-of-the-art method for abnormal object detection. In addition, we were able
to thoroughly analyze the accuracy of the proposed method for dierent types
of abnormality by classifying evaluation dataset into three types. The analysis re-
vealed that our model is strong at detecting abnormal co-occurrence and position,
but not as eective at detecting scale-violating objects. We expect that accuracy
will be increased by fully learning of the proposed model.
References
1. Choi, M., Lim, J., Torralba, A., Willsky, A.: Exploiting hierarchical context on a
large database of object categories. In: Proc. of IEEE Conf. on CVPR (2010)
2. Murphy, K., Torralba, A., Freeman, W.: Using the forest to see the trees: a graph-
ical model relating features, objects and scenes. In: NIPS (2003)
3. Torralba, A., Murphy, K., Freeman, W.: Contextual models for object detection
using boosted random elds. In: NIPS (2005)
4. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class object
layout. In: Proc. of ICCV (2009)
5. Choi, M., Torralba, A., Willsky, A.: Context models and out-of-context objects.
Pattern Recognition Letters 33, 853862 (2012)
664 S. Park, W. Kim, and K.M. Lee
6. Heitz, G., Koller, D.: Learning Spatial Context: Using Stu to Find Things. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302,
pp. 3043. Springer, Heidelberg (2008)
7. Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-
occurrence, location and appearance. In: Proc. of IEEE Conf. on CVPR (2008)
8. Jasra, A., Stephens, D., Holmes, C.: On population-based simulation for static
inference. Statistics and Computing 17, 263279 (2007)
9. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects
in context. In: Proc. of ICCV (2007)
10. Li, L., Socher, R., Fei-Fei, L.: Towards total scene understanding: Classication,
annotation and segmentation in an automatic framework. In: Proc. of IEEE Conf.
on CVPR (2009)
11. Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph Cut Based Inference
with Co-occurrence Statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part V. LNCS, vol. 6315, pp. 239253. Springer, Heidelberg (2010)
12. Galleguillos, C., Belongie, S.: Context based object categorization: A critical sur-
vey. Computer Vision and Image Understanding 114, 712722 (2010)
13. Russell, B., Torralba, A., Liu, C., Fergus, R., Freeman, W.: Object recognition by
scene alignment. In: NIPS (2007)
14. Torralba, A.: Contextual priming for object detection. IJCV 53, 169191 (2003)
15. Li, L., Su, H., Xing, E., Fei-Fei, L.: Object bank: A high-level image representation
for scene classication and semantic feature sparsication. In: NIPS (2010)
16. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with
discriminatively trained part-based models. IEEE Trans. on PAMI 32, 16271645
(2010)
17. Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. In: Proc. of IEEE
Conf. on CVPR (2006)
18. Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: Proc. of IEEE Conf.
on CVPR (2010)
19. Olson, C.: Maximum-likelihood image matching. IEEE Trans. on PAMI 24,
853857 (2002)
20. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large-scale
scene recognition from abbey to zoo. In: Proc. of IEEE Conf. on CVPR (2010)
21. Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a database and
web-based tool for image annotation. IJCV 77, 157173 (2008)
22. Van Hulse, J., Khoshgoftaar, T., Napolitano, A.: Experimental perspectives on
learning from imbalanced data. In: ICML (2007)
23. Kim, W., Park, J., Lee, K.: Stereo matching using population-based mcmc.
IJCV 83, 195209 (2009)
24. Wei, L. (2012), http://www.liweiart.com
25. Everingham, M., Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. IJCV 88, 303338 (2010)
Shapecollage: Occlusion-Aware, Example-Based
Shape Interpretation
1 Introduction
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 665678, 2012.
c Springer-Verlag Berlin Heidelberg 2012
666 F. Cole et al.
2ULJLQDO
VKDSH
,QSXW ,QIHUUHGQRUPDOV
Fig. 1. A line drawing and diuse rendering of a 3D shape (left) are interpreted
as normal maps by our system (right). Each rendering produces a dierent plausible
interpretation. Our system uses compatibility between local depth layers to interpret
self-occlusion (blue box) and near-occlusion (red box).
shape collage, coupled with a new patch representation and compatibility mea-
surement between patches that addresses the combinatorial explosion at occlu-
sion boundaries and generalizes patches across scale, rotation, and translation.
Given compatibility between patches, we use belief propagation to produce the
most likely collage and then t a smooth surface to produce the nal shape
interpretation (Figure 1).
We demonstrate our approach on synthetic renderings of blobby shapes in a
range of rendering styles, including line drawings and glossy, textured images.
Interpreting line drawings of curved shapes has been a long-standing unsolved
problem in computer vision [6, 7]. Much progress has been made for drawings
and wire frame representations of 3D shapes with at facets (e.g., [8]), but
progress on the interpretation of generic line drawings has stalled. Interpretation
of shaded renderings has been addressed more frequently in the computer vision
community, but is still not a solved problem [2].
Our contributions are in the design of the shape patches and the measurement
of compatibility between them, the irregular, multi-scale relationships between
patches, and the selection of a diverse set of shape candidates for each patch.
Besides shape, our representation can also estimate factors such as the visual
style of the input.
over time been divided into subproblems such as shape-from-shading (see [2] for
a recent survey), shape-from-line-drawing (e.g., [68]), and shape-from-texture
(e.g., [3]). These parametric methods have diculty capturing many of the cases
that occur in the visual world, while learning-based methods can successfully
capture such visual complexity. Saxena, et al [5], use a learning-based approach
to nd a parametric model of a 3D scene and also explicitly treat occlusion
boundaries. While their system can process complex input due to its learned
model, their parametric scene description is tailored for boxy shapes such as
buildings and does not contain features other than shape. By contrast, our rich,
example-based model extends to smooth shapes and can predict visual style as
well as shape.
For line drawing recognition, Saund [9] applied belief propagation in a Markov
Random Field (MRF) to label contours and edges of pre-parsed sketch images,
although without any explicit representation of shape. Ouyang and Davis [10]
combined learned local evidence for symbol sketches with MRF inference for
sketch recogntion, focussing on chemical sketches.
Our work is a spiritual successor to the work of Freeman, et al [11], learning
low-level vision. That system uses a network of multi-scale patches and a set
of candidates at each patch to infer the most likely explanation for the stimulus
image. Hassner and Basri [4] also use a patch-based approach to reconstruct
depth, while adding the ability to synthesize new exemplar images on-the-y.
Unlike both systems, we use an irregular network of patches centered on interest
points in the image, and directly tackle the problem of occlusion boundaries.
Because our training data is synthetic we could also create new exemplars on-
the-y, but have not explored that option in this work.
Example-based image priors were also used by Fitzgibbon et al [12] in the
context of image interpolation between measurements from multiple cameras.
The benets of learning from large, synthetically generated datasets were shown
emphatically in the recent work of Shotton et al [13].
2 Overview
Our approach nds interest points in the image, selects candidate interpretations
for each point, then denes a Markov random eld tailored to those interests and
performs inference to determine the most likely shape conguration (Figure 2).
Shape is represented by patches that contain the rendered appearance of the
patch, a normal map, a depth map, a map of occluding contours, and an owner-
ship mask that denes for which pixels the patch provides a shape explanation
(Figure 3). Patches are placed and oriented using image keypoints computed
at multiple scales. As described in Section 3, we found that current approaches
for detecting keypoints do not provide a good set of points for stimuli such as
line drawings, and designed our own method based on disk sampling of an in-
terest map. We propose a simple set of multi-scale graph relationships for these
keypoints.
The local appearance at each keypoint determines the selection of candidate
patches (Section 4). There may be thousands of possible matches for the more
668 F. Cole et al.
1. Find keypoints across scale 3. Select candidates by local appearance 5. Infer most likely patches
Appearance
Normals
OK OK BAD
Fit Surface Original Shape
Fig. 2. The major steps of our approach. First, keypoints are extracted at multiple
scales (1), then neighboring keypoints in both image space and scale are connected
(2). Separately, a set of candidate shapes is selected for each keypoint based on local
appearance (3), and the compatibility of each candidate is evaluated against the can-
didates at each connected keypoint (4). Loopy belief propagation is used to nd the
most likely candidates, which form a shape collage (5). Finally, a smooth surface is t
to the collage (6).
Fig. 3. Our shape patch representation. Given a square, oriented training patch (left,
red box) the normal map is extracted and rotated to match the keypoint orientation.
The depth map and occluding contours are also stored along with an ownership mask.
common patches, far more than our inference algorithm can support. It is critical
to choose a diverse-enough subset of candidates, for which we use similarity of
local shape descriptors.
The patch candidates are scored based on similarity of local appearance, nor-
mal and depth layer compatibility with adjacent patches, and prior probability
(Section 5). The most likely shape collage given these scores is found using loopy
belief propagation and a thin-plate spline surface is t to this shape collage.
Fig. 4. Keypoint detection at three scales (blur at 16, 32, 64 pixels, image: 3002 pixels).
Left of pair: level of interest in the image, computed by the trace of the structure
tensor. Right of pair: original image with keypoints overlaid. Keypoints are placed to
cut out the most interesting pixel until a threshold is reached.
Our criteria for a keypoint detector are: coverage, all the interesting parts of
the image should be covered by at least one keypoint; sparsity, keypoints should
not be redundant; and repeatability, similar image patches should have a similar
distribution of keypoints.
Standard methods such as Harris corner detection [14] and SIFT [15] are not
directly suitable for our needs. Corner detection, for example, fails the coverage
criterion, since we want keypoints along all linear features in addition to corners.
The SIFT keypoint detector also fails the coverage criterion: some interesting
image features do not correspond to maxima of the blob detector in scale space,
and so do not produce a SIFT keypoint.
By contrast, a naive method such as a regular or randomized grid fails the
repeatability criterion. Repeatability is important for two reasons: it reduces the
number of required training examples, and increases the eectiveness of local
appearance matching (Section 5) by aligning image and patch features.
3.1 Detection
Our approach to keypoint detection is to compute an interest map over the
image, then iteratively place keypoints at the highest point of this map that is
not already near another keypoint. Intuitively, this approach uses a cookie-cutter
to greedily stamp out keypoints at the most interesting points in the image.
The interest map I(p) is dened as the sum of the eigenvalues (i.e., the trace)
of the 2x2 structure tensor of the test image. Prior to computing the structure
tensor, the image is blurred to the desired level of scale. For a keypoint of radius
r, the blur is dened as a Gaussian of = r/3.
The stamp shape for a keypoint at p is a radially symmetric function of the
distance s from p, with a smooth rollo:
/
I(p) : s < 0.25r
stampp (s) = (1)
I(p)G(s 0.25r) : s > 0.25r
D
$ %
D $ % $ %
E
G
$ %
$ % $ %
F
F
%
E $ $ %
$ %
G
$ %
7HVWLPDJH $ % $ %
Fig. 5. Depth layer labeling. Left: test image with challenging areas outlined. Mid-
dle: contours of test image (black) dene regions in the overlap area and a graph of
connections between them. Right: the depth map of each candidate shape is used to
assign an ordinal layer to each overlap region. The compatibility of the candidates is
the compatibility of the graph labelings weighted by the size of the overlap regions.
In (a), the two shapes are compatible because the overlap graph allows for a gap; (b),
the two shapes are incompatible because they compete to label both sides of the single
contour; (c), the large scale patch has a single depth layer in the entire patch, but two
layers inside the overlap region; (d), at the T-junction the two labelings are slightly
incompatible, but the incompatible labeling has small weight.
to explain two nearby contours. Simultaneously, two separate patches should not
claim opposite sides of the same contour (Figure 5b). The imperfect alignment
of shape patches to image features makes exact matching of pixels unreliable.
Instead, we propose a scoring mechanism based on the regions of the test image
inside the overlap area (Figure 5, middle and right).
and divided into regions Ri by the patchs occluding contours. The depth values
inside each Ri are averaged, then the average values are sorted. The index j of
Rj in the sorted list is the local depth layer. Each Oi takes the mean of the local
depth layer pixels inside its region, rounded to the nearest integer.
for the two-element parameter vector x in the least-squares sense. The best t
labels a are found by multiplying through with x. The score of the tting is then
i
Sab = |ai bi |wi (6)
1..n
The nal compatibility score is the max of the tting scores in both directions.
This prior estimate is simple but considerably improves the quality of results,
especially for ambiguous stimuli such as line drawings.
Fig. 6. Examples of normal map estimation for varying rendering styles. Top: ground
truth normal map. (a,b,c): line drawing, diuse and glossy shading with solid material,
with interpreted normal map. (d): same as (c) but with texture only, diuse and glossy
shading with texture. Bottom: normal interpolation from shape boundaries.
Given a full set of candidate patches and likelihood scores between them, nding
the most probable shape collage is a straightforward MRF inference task. We
use min-sum loopy belief propagation for this purpose.
To produce a complete surface from the most likely shape collage, we t a
thin-plate spline to the most likely patches, taking care to allow the spline to split
at occluding contours. An example collage and t surface is shown in Figure 2.
676 F. Cole et al.
To measure and visualize our results, we mask out background pixels using
ground truth. Additional details of our inference and tting steps may be found
in [16].
The principal goal of our method is to estimate a normal map for the stimulus
image. Figure 6 shows example normal maps computed by our system. Table 1
shows the average errors for our test set against ground truth.
For a baseline comparison we propose the following simple method, similar to
that proposed by Tappen [19]: clamp the normals to ground truth values at the
occluding contour, and smoothly interpolate the normals in the interior of the
shape. This method produces smooth blobs (Figure 6, bottom). Note that while
the interpolation method is very simple, it still is given ground truth normals at
shape boundaries, whereas our method is not.
Overall our method produces accurate interpretations of the simuli, with aver-
age angular error between 20 and 26 for the shaded and textured stimuli, and
32 for line drawings (Table 1). In some cases, particularly the line drawings, the
input stimulus is ambiguous, and our system proposes a plausible interpretation
that is dierent from the original shape (e.g., Figure 6c). We also tested with
training sets containing only patches of the same style as the input and found a
small improvement in accuracy.
The running time for a single stimulus is approximately 20 minutes on a
modern, multicore PC. The implementation is written entirely in MATLAB.
The inferred patches can also be used trivially to approximate the stimulus image
and thus make an estimate of the rendering style (Figure 7). We simply take the
original appearance of each training patch and average it with the appearance
of its neighbors. To make an estimate of the stimulus style, we simply nd the
style used by the majority of the selected shape patches. In our experiments the
styles are suciently distinct that this estimate is very accurate.
Table 1. Average per image RMS error in degrees for normal estimates for each style,
trained with all styles and only with the target style. Boundary interpolation does not
vary between styles. Front-facing error is deection from a front-facing normal.
line diuse glossy only tex. tex. + di. tex. + gloss all
Training set w/ all styles 32 20 26 23 21 23 24
Train. set w/ only target style 30 20 25 23 21 23 24
Boundary interpolation - - - - - - 39
Front-facing - - - - - - 46
Shapecollage: Occlusion-Aware, Example-Based Shape Interpretation 677
Fig. 7. Reconstructing the stimulus from patch appearance. Reconstructed patches are
usually faithful to the original rendering, though are sometimes confused (boxed). The
set of styles used for reconstruction gives an estimate of the stimulus style.
6.5 Discussion
So far we have experimented with synthetic datasets. The renderings were cho-
sen to span a range of styles that challenge parametric interpretation systems.
Suggestive contours, for example, have been shown to mimic lines drawn by
artists [20], but are often disconnected and noisy. Disconnected lines violate the
major assumption of shape-from-line systems (e.g., [6]) that the line drawing
be a complete, connected graph. The glossy, textured stimuli violate the Lam-
bertian assumptions of even recent work on shape-from-shading (e.g., [21]). Our
system can interpret all six styles with no modication and a single training set.
There are several limitations to the current method that must be overcome
in order to process real-world stimuli. Most importantly, we need a rich train-
ing set of photographs and associated 3D shape. Such data may become more
commonplace as 3D acquisition hardware becomes more robust and easy to use.
Also, realistic computer graphics renderings may provide an accurate enough
approximation of real photographs to provide training data (similar to [22]).
Some aspects of the algorithm itself would also require extension. The appear-
ance matching step (Section 4.1), for example, matches based on the appearance
of the entire patch. This approach works well when the background is empty,
such as our stimuli, but can become confused by noise or distracting elements.
An extended method could match appearance based on the ownership masks
of training set patches. The appearance scoring metric (Section 5.4) would also
need to be adapted to more general stimuli.
7 Conclusion
We have presented an approach to infer a normal map from a single image of a 3D
shape. The treatment of occlusion between patches and the irregular, interest-
driven patch placement that we introduce dramatically reduce the complexity
of example-based shape interpretation, allowing our method to interpret images
with multiple layers of depth and self-occlusion using a moderately sized training
set. Because it is data-driven, our method can interpret ambiguous stimuli such
as sparse line drawings and complex stimuli such as glossy, textured surfaces,
and it uses the same machinery in all cases. To our knowledge, this property is
unique among current vision systems.
678 F. Cole et al.
References
1. Roberts, L.: Machine perception of three-dimensional solids, dspace.mit.edu
(1963)
2. Durou, J., Falcone, M., Sagona, M.: Num. methods for shape-from-shading: A new
survey with benchmarks. Comp. Vis. and Img. Understanding 109, 2243 (2008)
3. Malik, J., Rosenholtz, R.: Computing local surface orientation and shape from
texture for curved surfaces. IJCV 23, 149168 (1997)
4. Hassner, T., Basri, R.: Example based 3d reconstruction from single 2d images. In:
CVPR Workshops: Beyond Patches, pp. 18 (2006)
5. Saxena, A., Sun, M., Ng, A.: Make3d: Learning 3d scene structure from a single
still image. PAMI 31, 824840 (2009)
6. Malik, J.: Interpreting line drawings of curved objects. IJCV (1987)
7. Wang, Y., Chen, Y., Liu, J., Tang, X.: 3d reconstruction of curved objects from
single 2d line drawings. In: CVPR, pp. 18341841 (2009)
8. Xue, T., Liu, J., Tang, X.: Symmetric piecewise planar object reconstruction from
a single image. In: CVPR, pp. 25772584 (2011)
9. Saund, E.: Logic and mrf circuitry for labeling occluding and thinline visual con-
tours. In: Neural Information Processing Systems, NIPS (2005)
10. Ouyang, T., Davis, R.: Learning from neighboring strokes: combining appearance
and context for multi-domain sketch recognition. In: NIPS (2009)
11. Freeman, W., Pasztor, E., Carmichael, O.: Learning low-level vision. International
Journal of Computer Vision 40, 2547 (2000)
12. Fitzgibbon, A.W., Wexler, Y., Zisserman, A.: Image-based rendering using image-
based priors. In: Intl. Conf. Computer Vision, ICCV (2003)
13. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from a sin-
gle depth image. In: CVPR (2011)
14. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision
Conference, vol. 15, p. 50 (1988)
15. Lowe, D.: Object recognition from local scale-invariant features. In: ICCV (1999)
16. Supplemental Material (2012),
http://people.csail.mit.edu/fcole/shapecollage
17. DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., Santella, A.: Suggestive contours
for conveying shape. ACM Trans. Graph. 22, 848855 (2003)
18. Hays, J., Efros, A.: Im2gps: estimating geographic information from a single image.
In: CVPR, pp. 18 (2008)
19. Tappen, M.: Recovering shape from a single image of a mirrored surface from
curvature constraints. In: CVPR, pp. 25452552 (2011)
20. Cole, F., Golovinskiy, A., Limpaecher, A., Barros, H., Finkelstein, A., Funkhouser,
T., Rusinkiewicz, S.: Where do people draw lines? In: SIGGRAPH (2008)
21. Barron, J.T., Malik, J.: High-frequency shape and albedo from shading using nat-
ural image statistics. In: CVPR (2011)
22. Kaneva, B., Torralba, A., Freeman, W.: Evaluation of image features using a
photorealistic virtual world. In: ICCV (2011)
Interactive Facial Feature Localization
1 Introduction
Accurate facial component localization is useful for intelligent editing of pictures
of faces, such as opening the eyes, red eye removal, highlighting the lips, or
making a smile. In addition, it is an essential component for face recognition,
tracking and expression analysis. The problem remains challenging due to pose
and viewpoint variation, expression or occlusion (e.g. with sunglasses, hair or a
beard). While the techniques have improved a lot over the past ve years [15],
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 679692, 2012.
c Springer-Verlag Berlin Heidelberg 2012
680 V. Le et al.
we are still far from a reliable and fully-automated facial component localization
system. For all practical purposes, an interactive, user-assisted face localization
would be necessary.
Our goal is a facial component localization system that achieves high delity
results with minimal user interaction. Our contributions are as follows:
We show that our system in a fully automated mode outperforms other state of
the art systems on this challenging dataset. Furthermore we show that the error
decreases signicantly after only a minimal number of user corrections.
2 Related Work
Active shape model (ASM) [6] and Active appearance model (AAM) [7] form
classic families of methods for facial feature point detection. Comparing these
two models, ASM has the advantages of being more accurate in point (or contour)
localization, less sensitive to lighting variations and more ecient, hence is more
suitable for applications requiring accurate contour tting.
Since the rst introduction of the traditional ASM, there have been many ex-
tensions for improving its robustness, performance and eciency. For example,
[8] employs mixtures of Gaussians for representing the shape, [9] uses Kernel
PCA and SVM, and models nonlinear shape changes by 3D rotation, [10] apply
robust least squares algorithms to match the shapes to observations, [11] uses
more robust texture descriptors to replace the 1D prole model and used k near-
est neighbor search for prole searching, and [12] relies on Bayesian inference.
Among those extensions to classical ASM, the recent work of Milborrow and
Nicolls [1] with the introduction of the 2D prole model and denser point set
obtained promising results through quantitative evaluation. The recent work [4]
combines global and local models based on MRF and an iterative tting scheme,
but this approach is mainly focused on localizing very sparse set of landmark
points.
Other notable recent works exploring alternatives to ASM include [2]. In [2],
a robust approach for facial feature localization is proposed by a discriminative
search approach combining component detectors with learned directional clas-
siers. In [3], a generative model with shape regularization prior is learned and
Interactive Facial Feature Localization 681
used for face alignment to robustly deal with challenging cases such as expres-
sion, occlusion and noise. As an alternative to the parametric model approaches,
a principled optimization strategy with nonparametric representations for de-
formable shapes is recently proposed in [13].
A Component based ASM is introduced in [14] which implements a set of
ASMs for facial features and stacks up the PCA parameters into one long vec-
tor, then ts those vectors in a single global model using a Gaussian Process
Latent Variable Model to handle nonlinear distribution. In comparison with our
approach, they do not model the global spatial conguration of features of faces
therefore may generate invalid congurations. In [2] the face components are
found by an AAM style tting method and a discriminative classication is used
for predicting the movement of components. In another eort to explore the
conguration of face parts, Bayesian objective function are combined with local
detectors in [15] for localization of facial parts.
Motivated by the pictorial structure model in [16], we aim to decompose a
face shape into components, and model variations of individual components as
well as the spatial conguration between components. In [16], the face/human
pose inference problem is modeled as a part-based problem formulated as energy
minimization, where the energy (or cost) is expressed as the linear combination
of a unary tting term and a pair-wise potential term. Following this work,
we represent the global face shape model as a set of inter-related components,
and model their spatial relationships. Unlike [16] where the unknown parame-
ter space is low dimensional, in our approach we need to estimate component
locations as well as shapes, which is a higher dimensional space. Therefore, we
adopt PCA to model both the component shape variation and the relative spa-
tial conguration. Due to the component decomposition, our approach is more
exible in modeling face shapes for handling larger pose changes and has better
generalization capabilities than the standard ASM. Our approach uses PCA to
model the relative positions between components, hence it is simpler and more
ecient than [4] which relies on expensive MRF inference.
682 V. Le et al.
Fig. 2. (a) A failure example of the classic ASM on a face image by STASM [1]. The
bigger left eye constrains the right eye and the tilted right brow pulled the left brow
o the correct location. (b) A better tting result using our CompASM.
Prole Model. In the classic ASM [6, 1], during the prole tting step, at each
of the N landmarks of a given component we consider M candidate locations
along a line orthogonal to the shape boundary. We evaluate the prole model
at each location and pick the one with the highest score. The problem with this
greedy algorithm is that each landmark location is chosen independently of its
neighbors and neighboring landmarks could end up far from each other.
686 V. Le et al.
Fig. 4. An example of prole search by greedy method of classic ASM (a) and by our
Viterbi global optimization (b)
x e
p(x) = (6)
(x + 1)
where x is the Euclidean distance between the two locations and is the gamma
function. We chose the Poisson distribution because it has a suitable PDF and
a single parameter which we t separately for each component.
To nd the optimal set of candidate locations we run Viterbi, which maximizes
the sum of the log probabilities. In our experiment section we show that our
joint optimization of the landmarks outperforms the greedy approach in the
traditional ASM algorithm. In addition, our model easily accommodates user
interaction as discussed in Section 5. An example of the prole search results of
two methods is depicted in Fig.4.
Here we describe how we combine our new shape model and prole model into
the facial feature localization algorithm. Given an input image, after initializa-
tion by face detection, the shape model and prole model will be applied in an
interleaved, iterative way in a coarse-to-ne manner in the pyramid.
Interactive Facial Feature Localization 687
In each particular iteration step, prole search is applied rst, which gives
the suggested candidate locations of the landmarks. The landmark candidates
are then regularized by the local shape model. In each pyramid level, the tting
process is performed by repeating conguration model tting and local model
tting until convergence.
The details of our facial feature localization method using CompASM is illus-
trated in Algorithm 1.
In this section we describe our algorithm for user-assisted facial feature local-
ization. Once the automatic tting is performed, the user is instructed to pick
the landmark with the largest error and move it to the correct location after
which we adjust the locations of the remaining landmarks to take into account
the user input. This interaction step is called an interaction round. This process
is repeated until the user is satised with the results. Our goal is to adjust the
locations of the remaining landmarks as to minimize the number of interaction
rounds.
Our update procedure consists of two main steps: Linearly scaled movement
followed by a Model tting update.
In the Linearly scaled movement step, when the user moves a landmark pi ,
we move the neighboring landmarks on the same contour as pi along the same
direction. The amount of movement for neighboring landmarks is proportional to
their proximity to pi . Specically, we identify landmarks pa and pb on both sides
of pi which dene the span of landmarks that will be aected. Each landmark
with index j (a, i] moves by aj
ai d where d is the user-specied displacement
688 V. Le et al.
6 Experiments
Our algorithm is designed to allow for high independence and freedom in com-
ponent shape so that it can deal with natural photos with higher variations than
the studio data. Nevertheless, we needs to assure that it can work adequately
well on standard studio data as well. Therefore, in our rst experiment, we com-
pare our algorithm with STASM [1, 19] on standard datasets that they have
chosen in their work. In this experiment, following [19] we train we train our al-
gorithm on MUCT dataset [19] and test on BioID dataset [20]. MUCT contains
3755 images of 276 subjects taken in studio environment. Bioid consists of 1521
frontal images of 23 dierent test persons taken at the same lighting.
On the BioID test set, the tting performance of STASM which features
the stacked model and 2D prole trained on MUCT is reported to be the best
among current algorithms [19]. The MUCT and BioID point sets includes ve
standalone points; therefore their shapes cannot be separated into single con-
tours. Consequently, we are not able to use Viterbi optimized prole searching
in this experiment. Following STASM, we evaluate the performance using the
me-17 measure. This measure is the mean of distance between tting result of 17
points in the face to the manually marked ground truth divided by the distance
between two eye pupils. The comparison of the two algorithms performance is
shown in Table 1. The result shows that both of the algorithms results on BioID
are very close to perfect and STASM made a slightly lower error.
2
Note that neighboring landmarks in our dataset are equidistant.
Interactive Facial Feature Localization 689
Fig. 5. CDF of me-194 error for CompASM and STASM on the Helen dataset
Table 1. Comparison of STASM and CompASM on the two test sets. For
MUCT/BioID we use the me-17 measure. For Helen, we use the me-194 measure.
While BioID is an easy dataset for tting algorithms to deal with, the He-
len dataset is a more challenging one. The Helen data consists of natural images
which not only vary in pose and lighting but are also more diverse in subject iden-
tities and expression. In this dataset, the face features are highly uncorrelated
with non-gaussian distribution which our model can exploit better. To measure
tting performance on Helen, we use me-194, similar to the me-17 measure,
which calculates the mean deviation of 194 points from ground truth normalized
by distance between two eye centroids. We divided Helen images into training
set of 2000 images and testing set of 330 images. The comparison of STASM
and CompASM is shown in Table 1 and Figure 5. Table 1 shows that Com-
pASM outperforms STASM by 16%. The result proves that CompASM is more
robust over diverse face images. Some examples of tting results of STASM and
CompASM are shown in Fig. 6.
Fig. 6. Some examples of tting results on the Helen test set of CompASM (rst row)
and STASM (second row)
0.12
linearly scaled movement and refit
linearly scaled movement no refit
move isolated point and refit
0.1 move isolated point no refit
0.08
me194 Error
0.06
0.04
0.02
0
2 4 6 8 10 12 14
Interaction Round
Fig. 7. The mean me-194 error measure across the Helen test set as a function of
interaction round. One interaction round constitutes a single user action (see text for
details). Each curve is the mean value across the test set under varying experimental
conditions. Using both the linearly scaled movement and constrained retting steps
results in the most eective reduction in error.
satisfactory tting result. More concretely, we wish to observe how the tting
error decays as a function of the number of interaction steps.
To this end, we simulate the users typical behavior when interacting with the
system, namely to choose the landmark point located farthest from its true posi-
tion and move it to its true position. After being moved, it becomes a constraint
for subsequent tting rounds as described in Sec. 5.
We repeat this procedure for a number of rounds and record the me-194 error
measure at each round. Fig. 7 depicts the distribution of the error across our test
set over 15 interactivity rounds. To evaluate the eectiveness of the two steps
of interactivity described in Sec. 5, we compare the performance of the case
Interactive Facial Feature Localization 691
Fig. 8. Examples of a sequence of editing steps where the most erroneous point is
selected at each step. The red lled point denotes the point selected for editing. The
open yellow circles denote points that have already been edited and xed.
where both of the update steps are done with that of the case where only one
of the steps is done, and also against a baseline where only the most erroneous
point is moved. The comparative eectiveness of these strategies is apparent in
Fig 7, from which we can conclude that both steps used in conjunction are most
eective at reducing the overall error most quickly. Fig. 8 depicts some example
runs of the interaction process. More examples with animation are available at
the website http://www.ifp.uiuc.edu/~vuongle2/helen/interactive
7 Conclusion
In this paper, we proposed a new component-based ASM model (CompASM)
and an interactive renement algorithm for practical facial feature localization
on real-world face images. CompASM extends the classical, global ASM by de-
composing global shape tting into two modules: component shape tting and
conguration model tting. The model provides more exibility for individual
components shape variation and relative conguration between components,
hence is more suitable for handling images with large variation. We improve pro-
le matching by introducing a new joint landmark optimization scheme using the
Viterbi algorithm which outperforms the standard greedy-based approach. We
also propose an interactive renement algorithm for facial feature localization
to minimize tting errors for bad initial tting. Furthermore, a new challeng-
ing facial feature localization dataset is introduced which contains more than
2000 high-resolution fully annotated real-world face images. Experimental re-
sults show that our approach outperforms the state of the art on the new, real-
world facial feature localization dataset. Our interactive renement algorithm is
capable of reducing errors quickly but there is space to improve it further by
learning from users behavior or by guiding the interaction through statistics of
landmark uncertainties, which we leave as our future work.
692 V. Le et al.
References
1. Milborrow, S., Nicolls, F.: Locating Facial Features with an Extended Active Shape
Model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS,
vol. 5305, pp. 504513. Springer, Heidelberg (2008)
2. Liang, L., Xiao, R., Wen, F., Sun, J.: Face Alignment Via Component-Based Dis-
criminative Search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part II. LNCS, vol. 5303, pp. 7285. Springer, Heidelberg (2008)
3. Gu, L., Kanade, T.: A Generative Shape Regularization Model for Robust Face
Alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 413426. Springer, Heidelberg (2008)
4. Tresadern, P., Bhaskar, H., Adeshina, S., Taylor, C., Cootes, T.: Combining local
and global shape models for deformable object matching. In: BMVC (2009)
5. Zhu, X., Ramanan, D.: Face Detection, Pose Estimation, and Landmark Localiza-
tion in the Wild. In: CVPR (2012)
6. Cootes, T., Taylor, C.J., Cooper, D., Graham, J.: Active shape models - their
training and application. Computer Vision and Image Understanding 61, 3859
(1995)
7. Cootes, T., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans.
PAMI 23, 681685 (2001)
8. Cootes, T., Taylor, C.: A mixture model for representing shape variation. Image
and Vision Computing 17, 567574 (1999)
9. Romdhani, S., Gong, S., Psarrou, A.: A multi-view non-linear active shape model
using kernel pca. In: BMVC (1999)
10. Rogers, M., Graham, J.: Robust Active Shape Model Search. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353,
pp. 517530. Springer, Heidelberg (2002)
11. Van Ginneken, B., Frangi, A., Staal, J., Ter Haar Romeny, B., Viergever, M.: Active
shape model segmentation with optimal features. IEEE Trans. Medical Imaging 21,
924933 (2002)
12. Zhou, Y., Gu, L., Zhang, H.: Bayesian tangent shape model: Estimating shape and
pose parameters via bayesian inference. In: CVPR (2003)
13. Saragih, J.M., Lucey, S., Cohn, J.F.: Deformable model tting by regularized land-
mark mean-shift. Int. J. Comput. Vision 91, 200215 (2011)
14. Huang, Y., Liu, Q., Metaxas, D.: A component based deformable model for gen-
eralized face alignment. In: ICCV (2007)
15. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of
faces using a consensus of exemplars. In: CVPR 2011, pp. 545552 (2011)
16. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition.
International Journal of Computer Vision 61, 5579 (2005)
17. Kasinski, A., Florek, A., Schmidt, A.: The PUT face database. Image Processing
and Communications 13, 5964 (2008)
18. Forney Jr., G.D.: The Viterbi algorithm. Proceedings of IEEE 61, 268278 (1973)
19. Milborrow, S., Morkel, J., Nicolls, F.: The MUCT Landmarked Face Database.
Pattern Recognition Association of South Africa (2010)
20. Jesorsky, O., Kirchberg, K.J., Frischholz, R.: Robust Face Detection Using the
Hausdor Distance. In: Bigun, J., Smeraldi, F. (eds.) AVBPA 2001. LNCS,
vol. 2091, pp. 9095. Springer, Heidelberg (2001)
Propagative Hough Voting
for Human Activity Recognition
1 Introduction
Hough-transform based local feature voting has shown promising results in both
object and activity detections. It leverages the ensemble of local features, where
each local feature votes individually to the hypothesis, thus can provide robust
detection results even when the target object is partially occluded. Meanwhile,
it takes the spatial or spatio-temporal congurations of the local features into
consideration, thus can provide reliable detection in the cluttered scenes, and
can well handle rotation or scale variation of the target object.
Despite previous successes, most current Hough-voting based detection ap-
proaches require sucient training data to enable discriminative voting of local
patches. For example, [7] requires sucient labeled local features to train the
random forest. When limited training examples are provided, e.g., given one or
few query examples to search similar activity instances, the performance of pre-
vious methods is likely to suer. The root of this problem lies on the challenges
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 693706, 2012.
c Springer-Verlag Berlin Heidelberg 2012
694 G. Yu, J. Yuan, and Z. Liu
of matching local features to the training model. Due to the possibly large varia-
tions of activity patterns, if limited training examples are provided, it is dicult
to tell whether a given local feature belongs to the target activity or not. Thus
the inaccurate voting scores will degrade the detection performance.
In this paper, we propose propagative Hough voting for activity analysis.
To improve the local feature point matching, we introduce random projection
trees [18], which are able to capture the intrinsic low dimensional manifold struc-
ture to improve matching in high-dimensional space. With the improved match-
ing, the voting weight for each matched feature point pair can be computed more
reliably. Besides, as the number of trees grows, our propagative Hough voting
algorithm is theoretically guaranteed to converge to the optimal detection.
Another nice property of the random projection tree is that its construction is
unsupervised, thus making it perfectly suitable for leveraging the unlabeled test
data. When the amount of training data is small such as in activity search with
one or few queries, one can use the test data to construct the random projection
trees. After the random projection trees are constructed, the label information
in the training data can then be propagated to the testing data by the trees.
Our method is explained in Fig. 1. For each local patch (or feature) from
the training example, it searches for the best matches through RPT. Once the
matches in the testing video are found, the label and spatial-temporal cong-
uration information are propagated from the training data to the testing data.
The accumulated Hough voting score can be used for recognition and detection.
By applying the random projection trees, our proposed method is as ecient as
the existing Hough-voting based activity recognition approach, e.g., the random
forest used in [7]. However, our method does not rely on human detection and
tracking, and can well handle the intra-class variations of the activity patterns.
With an iterative scale renement procedure, our method can handle small scale
variations of activities as well.
We evaluate our method in two benchmarked datasets, UT-interaction [10]
and TV Human Interaction [19]. To fairly compare with existing methods, we test
our propagative Hough voting with (1) insucient training data, e.g., in activity
search with few query examples, and (2) sucient training data, e.g., in activity
recognition with many training examples. The superior performances over the
state-of-the-arts validate that our method can outperform in both conditions.
2 Related Work
Based on the successful development of video features, e.g., STIP [1], cuboids [13],
and 3D HoG [22], many human activity recognition methods have been devel-
oped. Previously, [15][7][16][11] rely on the human detection or even human pose
estimation for activity analysis. But human detection, tracking, and pose esti-
mation in uncontrolled environments are challenging problems.
Without relying on auxiliary algorithms such as human detection, [12][5][6] per-
form activity recognition by formulating the problem as a template matching pro-
cess. In [12], it learns a spatial-temporal graph model for each activity and classies
Propagative Hough Voting for Human Activity Recognition 695
y y
t t
x x
Fig. 1. Propagative Hough Voting: The left gure illustrates our implicit spatial-
temporal shape model on a training video. Three sample STIPs from the testing videos
are illustrated with blue triangles in the right gure. Several matches will be found for
the three STIPs given the RPT. We choose three yellow dots to describe the matched
STIPs from the training data in the middle gure. For each training STIP (yellow dot),
the spatial-temporal information will be transferred to the matched testing STIPs (blue
triangle) in the testing videos. By accumulating the votes from all the matching pairs, a
sub-volume is located in the right gure. The regions marked with magenta color refer
to the low-dimension manifold learned with RPT, which can built on either training
data or testing data. (Best viewed in color)
the testing video as the one with the smallest matching cost. In [5], temporal infor-
mation is utilized to build the string of feature graph. Videos are segmented at
a xed interval and each segment is modeled by a feature graph. By combining the
matching cost from each segment, they can determine the category of the testing
video as the one with the smallest matching cost. In [6], similar idea of partition-
ing videos to small segments is used but video partition problem is solved with
dynamic programming. There are several limitations in these matching based al-
gorithms. First, the template matching algorithms, e.g., graph matching [12][5],
are computationally intensive. For example, in order to use these template based
methods for activity localization, we need to use sliding-windows to scan all the
possible sub-volumes, which is an extremely large search space. Despite the fact
that [6] can achieve fast speed, the proposed dynamic BoW based matching is not
discriminative since it drops all the spatial information from the interest points.
Similarly, in [12][5], the temporal models are not exible to handle speed variations
of the activity pattern. Third, a large training dataset will be needed to learn the
activity model in [12][6]. However, in some applications such as activity search, the
amount of training data is extremely limited.
To avoid enumerating all the possible sub-volumes and save the computational
cost, Hough voting has been used to locate the potential candidates. In [7], Hough
voting has been employed to vote for the temporal center of the activity while
the spatial locations are pre-determined by human tracking. However, tracking
human in unconstrained environment is a challenging problem. In contrast, our
algorithm does not rely on tracking. In our algorithm, both spatial and temporal
centers can be determined by Hough voting and the scale can be further rened
696 G. Yu, J. Yuan, and Z. Liu
with back-projection. Besides, the trees in [7] are supervisedly constructed for
the classication purpose while our trees are trying to model the underlying data
distribution in an unsupervised way. Furthermore, our trees can be built on the
test data which allows our propagative Hough voting to well handle the limited
training data problem.
On the other hand, p([x, t, ]|ls , lr ) determines the voting position. Suppose dr =
[fr , lr ] R matches ds = [fs , ls ] S, we cast the spatial-temporal information
from the training data to the testing data with voting position lv = [xv , tv ]:
xv = xs x (xr cxr )
(4)
tv = ts t (tr ctr ),
data only, e.g., activity search, (3) both training and testing data. The trees are
constructed in an unsupervised way and the labels from the training data will only
be used in the voting step. Assume we have a set of STIPs, denoted by D = {di ; i =
1, 2, , ND }, where di = [fi , li ] as dened in Section 3 and ND is the total number
of interest points. The feature dimension is set to n = 162, so fi Rn .
1
NT
p(fs |fr ) = Ii (fs , fr ), (6)
NT i=1
where ds S && Ii (fs , fr ) = 1 refers to the interest points from S which fall in
the same leaf as dr in the ith tree. Based on Eq. 5, we can compute the voting
score as:
NT ||[xv x,tv t]||2
S(V (x, t, ), R) e 2 . (9)
dr R i=1 ds S && Ii (fs ,fr )=1
Propagative Hough Voting for Human Activity Recognition 699
5 Scale Determination
renement: 1) x the scale, search for the activity center with Hough voting; 2)
x the activity center, and determine the scale based on back-projection. We
iterate the two steps until convergence.
The initial scale information is set to the average scale of the training
videos. Based on the Hough voting step discussed in Section 3, we can obtain
the rough position of the activity center. Then back-projection, which has been
used in [17][14] for 2D object segmentation or localization, is used to determine
the scale parameters.
After the Hough voting step, we obtain a back-projection score for each testing
interest point ds from the testing video based on Eq. 2:
sds = dr R p(l |ls , lr )p(fs |fr )
||l ls ||2 (11)
= Z1 dr R e 2 p(fs |fr ),
where l is the activity center computed from last round; Z and 2 are, respec-
tively, normalization constant and kernel bandwidth, which are the same as in
Eq. 5. p(fs |fr ) is computed by Eq. 6. The back-projection score sds represents
how much this interest point ds contributes to the voting center, i.e., l . For each
sub-volume detected in previous Hough voting step, we rst enlarge the origi-
nal sub-volume in both spatial and temporal domains by 10%. We refer to the
l
extended volume as VW HT , meaning a volume centered at l with width W ,
height H and duration T . We need to nd a sub-volume Vwl h t to maximize
the following function:
max
s d s + w h t , (12)
w ,h ,t
ds Vwl h t
6 Experiments
Two datasets are used to validate the performance of our algorithms. They are
UT-Interaction [10] and TV Human Interaction dataset [19]. We perform two
Propagative Hough Voting for Human Activity Recognition 701
types of tests: 1) activity recognition with few training examples but we have a
large testing data (building RPT on the testing data), and 2) activity recognition
when the training data is sucient (building RPT on the training data).
precision
precision
0.5 0.5 0.5
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
recall recall recall
Activity Retrieval: punch Activity Retrieval: push Activity Retrieval: shakehands
1 1 1
HV without scale refinement, AP:0.2262 HV without scale refinement, AP:0.3283 HV without scale refinement, AP:0.1836
0.9 HV with scale refinement, AP:0.2500 0.9 HV with scale refinement, AP:0.3283 0.9 HV with scale refinement, AP:0.2436
[20], AP:0.1016 [20], AP:0.1504 [20], AP:0.0668
0.8 0.8 NN + Hough voting, AP:0.1992 0.8 NN + Hough voting, AP:0.1369
NN + Hough voting, AP:0.1774
0.7 0.7 0.7
precision
precision
precision
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
recall recall recall
Fig. 3. Activity search results on the UT-Interaction Dataset. we show two categories
of results in each row. For each category, the rst image is from the query video and
the following three images are sample detection results. The red regions refer to our
detected sub-volumes.
0.55 1
shakehands, AP:0.6609
Our proposed 0.9 highfive, AP:0.5800
NN + Hough voting hug, AP:0.7107
0.5 0.8 kiss, AP:0.6947
0.7
average precision
0.45 0.6
precision
0.5
0.4 0.4
0.3
0.35 0.2
0.1
0
1 3 5 7 9 11 13 0 0.2 0.4 0.6 0.8 1
# of query videos recall
Fig. 4. Left: Average precision versus dierent number of training videos provided
on TV Human Dataset (testing on a database with 100 videos). Right: PR curves
for activity recognition on TV Human Dataset (25 videos for each activity used for
training).
Method Shake Hug Kick Point Punch Push Total Method Shake Hug Kick Point Punch Push Total
[1] + kNN 0.18 0.49 0.57 0.88 0.73 0.57 0.57 [1] + kNN 0.3 0.38 0.76 0.98 0.34 0.22 0.497
[1] + Bayes 0.38 0.72 0.47 0.9 0.5 0.52 0.582 [1] + Bayes 0.36 0.67 0.62 0.9 0.32 0.4 0.545
[1] + SVM 0.5 0.8 0.7 0.8 0.6 0.7 0.683 [1] + SVM 0.5 0.7 0.8 0.9 0.5 0.5 0.65
[13] + kNN 0.56 0.85 0.33 0.93 0.39 0.72 0.63 [13] + kNN 0.65 0.75 0.57 0.9 0.58 0.25 0.617
[13] + Bayes 0.49 0.86 0.72 0.96 0.44 0.53 0.667 [13] + Bayes 0.26 0.68 0.72 0.94 0.28 0.33 0.535
[13] + SVM 0.8 0.9 0.9 1 0.7 0.8 0.85 [13] + SVM 0.8 0.8 0.6 0.9 0.7 0.4 0.7
[7] 0.7 1 1 1 0.7 0.9 0.88 [7] 0.5 0.9 1 1 0.8 0.4 0.77
[6] BoW - - - - - - 0.85 Our Proposed 0.7 0.9 1 0.9 1 1 0.917
Our proposed 1 1 1 1 0.6 1 0.933
As we can see from Table 2, results from cuboid features [13] are better than
those from STIP features [1]. Even though we use STIP features, we still achieve
better results than the state-of-the-art techniques that use cuboid features.
7 Conclusion
Local feature voting plays an essential role in Hough voting-based detection. To
enable discriminative Hough-voting with limited training examples, we propose
propagative Hough voting for human activity analysis. Instead of matching the
local features with the training model directly, by employing random projec-
tion trees, our technique leverages the low-dimension manifold structure in the
high-dimensional feature space. This provides us signicantly better matching
accuracy and better activity detection results without increasing the compu-
tational cost too much. As the number of trees grows, our propagative Hough
voting algorithm can converge to the optimal detection. The superior perfor-
mances on two benchmarked datasets validate that our method can outperform
not only with sucient training data, e.g., in activity recognition, but also with
limited training data, e.g., in activity search with one query example.
References
1. Laptev, I.: On space-time interest points. International Journal of Computer Vi-
sion 64(2-3), 107123 (2005)
2. Yuan, J., Liu, Z., Wu, Y.: Discriminative Video Pattern Search for Ecient Action
Detection. IEEE Trans. on PAMI (2011)
706 G. Yu, J. Yuan, and Z. Liu
3. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: Proc. CVPR (2008)
4. Yuan, F., Prinet, V., Yuan, J.: Middle-Level Representation for Human Activities
Recognition: the Role of Spatio-temporal Relationships. In: ECCV Workshop on
Human Motion (2010)
5. Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: A String of Feature Graphs
Model for Recognition of Complex Activities in Natural Videos. In: ICCV (2011)
6. Ryoo, M.S.: Human Activity Prediction: Early Recognition of Ongoing Activities
from Streaming Videos. In: ICCV (2011)
7. Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object
detection, tracking, and action recognition. PAMI, 21882202 (2011)
8. Ryoo, M.S., Chen, C., Aggarwal, J.: An overview of contest on semantic description
of human activities, SDHA (2010)
9. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: International Conference on Very Large Data Bases (VLDB), pp. 518529
(1999)
10. Ryoo, M.S., Aggarwal, J.K.: Spatio-Temporal Relationship Match: Video Structure
Comparison for Recognition of Complex Human Activities. In: ICCV (2009)
11. Amer, M.R., Todorovic, S.: A Chains Model for Localizing Participants of Group
Activities in Videos. In: ICCV (2011)
12. Brendel, W., Todorovic, S.: Learning Spatiotemporal Graphs of Human Activities.
In: ICCV (2011)
13. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse
Spatio-Temporal Features. In: Workshop on Visual Surveillance and Performance
Evaluation of Tracking and Surveillance (2005)
14. Razavi, N., Gall, J., Van Gool, L.: Backprojection Revisited: Scalable Multi-view
Object Detection and Similarity Metrics for Detections. In: Daniilidis, K., Maragos,
P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 620633. Springer,
Heidelberg (2010)
15. Choi, W., Savarese, S.: Learning Context for Collective Activity Recognition. In:
CVPR (2011)
16. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-
object interaction activities. In: CVPR (2010)
17. Leibe, B., Leonardis, A., Schiele, B.: Robust Object Detection with Interleaved
Categorization and Segmentation. IJCV 77(1-3), 259289 (2007)
18. Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds.
In: ACM Symposium on Theory of Computing (STOC), pp. 537546 (2008)
19. Patron-perez, A., Marszalek, M., Zisserman, A., Reid, I.: High Five: Recognising
human interactions in TV shows. In: BMVC (2010)
20. Yu, G., Yuan, J., Liu, Z.: Unsupervised Random Forest Indexing for Fast Action
Search. In: CVPR (2011)
21. Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image clas-
sication. PAMI 30, 16321646 (2008)
22. Klaser, A., Marszalek, M.: A spatio-temporal descriptor based on 3D-gradients. In:
BMVC (2008)
Spatio-Temporal Phrases
for Activity Recognition
Abstract. The local feature based approaches have become popular for
activity recognition. A local feature captures the local movement and
appearance of a local region in a video, and thus can be ambiguous;
e.g., it cannot tell whether a movement is from a persons hand or foot,
when the camera is far away from the person. To better distinguish
dierent types of activities, people have proposed using the combination
of local features to encode the relationships of local movements. Due to
the computation limit, previous work only creates a combination from
neighboring features in space and/or time. In this paper, we propose
an approach that eciently identies both local and long-range motion
interactions; taking the push activity as an example, our approach can
capture the combination of the hand movement of one person and the
foot response of another person, the local features of which are both
spatially and temporally far away from each other. Our computational
complexity is in linear time to the number of local features in a video.
The extensive experiments show that our approach is generically eective
for recognizing a wide variety of activities and activities spanning a long
term, compared to a number of state-of-the-art methods.
1 Introduction
Activity recognition in videos has attracted increasing interest recently. An ac-
tivity can be dened as a certain spatial and temporal pattern involving the
movements of a single or multiple actors. The recognition task requires cap-
turing enough spatial and temporal information to distinguish dierent activity
categories, while handling the large intra-category variations. Also, most video
analysis applications, such as surveillance, require high computational eciency.
This work was done while the rst author was visiting GE Global Research as an in-
tern. This work was supported by the National Institute of Justice, US Department
of Justice, under the award #2009-SQ-B9-K013. The opinions, ndings, and conclu-
sions or recommendations expressed in this publication are those of the authors and
do not necessarily reect the views of the Department of Justice.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 707721, 2012.
c Springer-Verlag Berlin Heidelberg 2012
708 Y. Zhang et al.
T=15 T=20
T=55
T=136 T=141
T=176
The bag-of-words (BoW) model based on local features is very popular for
activity recognition due to its robustness to real-world environments [15]. De-
spite its computational eciency and intra-category invariance, the BoW model
discards any structural spatial and temporal information among the local fea-
tures. However, a local word only captures the local movement of a particular
region, and can be ambiguous sometimes. For example, using the local features
alone, the activities of push and kick can be quite similar when it is hard to
tell whether a movement is from a persons hand or foot due to video resolution.
In order to better distinguish dierent types of activities, many works have
been done to incorporate spatio-temporal relationships among the local features
[612]. In particular, approaches that model the mutual relationships among the
words are spatio-temporally shift invariant [6, 8, 1315], and thus do not require
the knowledge on where and when the activity occurs in a video. However, due
to the computational concern, these works have one or more of the following
limitations: 1) only create a combination (phrase) with words in a local space or
time neighborhood [8,14,15], and thus cannot capture the long-term interactions
of dierent body parts, which can be extracted from dierent individuals; 2) only
consider a pair of local words [6, 13], and therefore cannot encode interactions
among more than two time stamps or local regions; 3) only encode the distances
Spatio-Temporal Phrases for Activity Recognition 709
between the words, while discarding the temporal ordering and spatial layout
among them [6,13], and thus is incapable of modeling the casuality relationships.
Along the direction of modeling the mutual relationships among words, this
paper presents a generic activity recognition approach that addresses all of
the aforementioned limitations. Our approach is formed around a novel con-
cept named the spatio-temporal phrase (ST phrase). A ST phrase of length k
is dened as a combination of k words in a certain spatial and temporal struc-
ture, including their order and relative positions. Hence, it encodes rich tempo-
ral ordering and spatial geometry information of local words. Fig. 1 illustrates
example ST phrases in two videos, both of which include the push activity
at dierent time stamps. The ST phrases capture the patterns, e.g. the hand
movement of one person in one frame and the foot movement of another person
several frames later. The illustrated phrase consists of local movements that do
not occur within the same space-time neighborhood region. Moreover, the same
phrase, which characterizes the push activity, occurs at dierent locations and
time stamps in the two videos.
The algorithm for identifying ST-phrases extends a recent work [16] proposed
for image retrieval, which eciently detects co-occurring 2D phrases in two static
images. We extend the work to the space-time domain, and show that we can
identify all ST phrases of any length in a video in time linear to the num-
ber of local features. We also provide an algorithm that makes the ST phrases
speed invariant to deal with dierent speeds or durations of the activities. The
co-occurring ST phrases are further used to compute a kernel for discrimina-
tive learning with SVM, which implicitly determines the importance of dierent
phrases. In addition, we propose an algorithm that can update the kernel values
incrementally when a new frame arrives, thereby enabling us to eciently detect
the activities from video streams in an online fashion.
Video V
Offset Space
Y
*
T
X ****
'Y *
Video V *
* 'T
'X
Y
T
X
Fig. 2. 3D correspondence transform. Circles represent local features, and colors rep-
resent word assignments. A co-occurring ST phrase of length 4 is shown.
due to the high computational cost, previous works do not capture higher order
and long range dependencies.
2 Approach
Activity Speed Variations: The same activity category can occur with dif-
ferent speeds and time durations in dierent videos. For example, people may
run at dierent speeds, or take dierent time to form a group in order to
ght. To explicitly deal with this problem, we would like to detect k words
from two videos as the same ST phrase if their only structural dierence is the
temporal rates. To this end, we add another dimension to the oset space, which
indicates the temporal scaling d, and allow each pair of words vote for multiple
d. The oset location of a pair of features with the same word assignment is:
(xi , yi , ti , d) = (xi xi , yi yi , ti ti d, d).
In this 4D space, if we have k votes at the same location, we have a co-
occurring ST phrase with a particular temporal scale dierence d. In the ex-
periments, we set the scale d to 1, 1.5, 1.52, .., 5, which tolerates up to 5 times
speed dierence between activities of two videos.
Scale Variations: People in an activity can appear in dierent scales for dierent
videos. The scale invariance can be obtained by adding another dimension to the
oset space as scale dierence, similar as the solution for the speed invariance. We
can also utilize the detected scales for local features and vote for only the scale
dierence that is the same as that of the local features. Therefore, we still create
one vote for each pair of same local words when speed is not considered.
S2
t 'y
'y
Vt 1 D 2 0.4
Y T
'x 'T Red: votes added
X Blue: votes deleted
Obtained from SVM training
Fig. 3. Illustration of the incremental kernel computing algorithm. The support vectors
(selected training videos) S and their coecients (second column) are obtained during
training, while the oset spaces for each support vector (third column) need to be
computed online during testing. The right most image shows the enlarged oset space
between the video segment Vt and support vector S1 .
where is a factor (greater than 1) chosen to ensure that the normalized kernel
values for dierent k are in similar scales, i.e. the values are not too small.
The SVM with the above kernel projects the word space to the higher order ST
phrase space for classication. During training, we use both positive and negative
examples to compute the kernel matrix, which is used to train a SVM. The
SVM will implicitly assign dierent weights to dierent ST-phrases by learning
the coecients for dierent training videos. Let the training videos that have
non-zero weights (support vectors) be Sj and the coecients be j , the decision
score for a test video V of a particular activity category is computed as:
Score(V ) = j j K(V, Sj ). (3)
a video segment Vt and a support vector Sj is to compute the oset space which
is created with the votes from the local features. As discussed in Section 2.2,
this computational complexity is linear to the number of local features in the
segment.
However, note that the video segment of the current frame and that of the
previous frame have a signicant overlap; we can further accelerate the kernel
computation with an incremental updating algorithm. The dierence between
the video segments of the current and the previous frame only involves two
frames, i.e., the current frame t, and the frame at t T (Fig. 3). If we have
stored the oset spaces for the previous frame segment Vt1 , to compute the
new oset space for the current Vt and a support vector Sj , we only need to add
the votes made by the local features of the current frame, and delete the votes
that were contributed by the features of the frame t T .
When we add a vote at location l in the oset space that originally had
nl votes, the number of co-occurring length-k ST phrases (Eqn.(1)) would be
changed as follows:
nl nl + 1 nl
Kknew (V, V ) = Kkold (V, V ) + = Kkold (V, V ) + . (4)
k k k1
The change for the number of co-occurring length-k ST phrases when we delete
a vote can be calculated similarly.
The algorithm for updating the kernel values is listed in Algorithm 1. Conse-
quently, to compute the kernel values for the current frame and a support vector,
we only need to perform the operations of Eqn.(4) 2Nt times, with Nt being the
number of local features at frame t. This acceleration enables us to consider the
long-term context (a large number of previous frames) without increasing the
recognition time for each frame. This property is especially useful when detecting
complex activities that usually last a long period of time.
very sparse with most locations having zero votes. Since we only care about the
number of votes that fall to the same location, we can simply store the non-zero
locations, which is again linear to the local feature number per video. Thus, we
have the same O(N ) computational complexity with BoW for classifying a video
clip, when BoW uses a non-linear kernel, such as 2 or Gaussian kernel.
With the incremental kernel computing, computing the kernel value between
a support vector and a video segment dened for the current frame depends only
on the number of words in the current and previous frames. Therefore, making
prediction for one frame requires O(SNt ) computations in total, where S is the
number of support vectors, and Nt is the number of words per frame.
3 Experiments
We rst perform experiments on single person activities with the KTH dataset [21],
YouTube Action dataset [5], and a hospital surveillance dataset. The KTH
dataset is created in a controlled environment, while the hospital dataset is
collected from real surveillance cameras in a more complex environment and
the YouTube Action dataset includes more pose and scale variations taken with
shaky cameras.
Then we evaluate our performance for the multiple-person activity recognition
problem, which is an up-coming topic that recent work are addressing [8,22,23],
with the UT interaction dataset [24] and the MPR dataset [22]. On the MPR
dataset, we evaluate our incremental algorithm (2.4) for online activity detection.
Baseline: We aim to verify that the proposed bag-of-ST-phrase (BoP) repre-
sentation outperforms BoW with exactly the same local features, same visual
words, and same classier (SVM). For BoW, we use the 2 -kernel, which already
captures the co-occurrence statistics of dierent words in a video. Therefore, the
comparison of BoW and BoP veries whether the spatial-temporal layout of the
words helps activity recognition. Moreover, we compare favorably with other
state-of-the-art approaches [2,13,19,25,26] that incorporate spatial and/or tem-
poral information to the BoW representation.
Table 1. (a) Classication accuracies (%) with dierent vocabulary sizes on the KTH
dataset. (b) Accuracies (%) under each train and test setting on the KTH dataset.
We compare BoP with other methods that incorporate spatio-temporal relationships
to BoW.
(a) (b)
Method 500 1000 2000 4000 Setting BoW BoP [2] [19] [13] [26]
BoW 90.8 91.2 91.3 91.5 16 train / 9 test 91.5 94.0 91.8 n/a n/a 91.1
BoP 94.0 93.8 94.1 94.0 5-fold 92.9 94.6 n/a 92.0 n/a n/a
leave-one-out 91.9 95.5 n/a n/a 94.2 93.8
This dataset includes the realistic surveillance videos of 6 patients in the private
sickrooms of a hospital2 . The goal is to detect abnormal behaviors of the patients
(get up from the bed ) from other normal ones, which is also a real application
used in the hospital. The environment of this dataset is much more complex
than the KTH dataset, since the patients conduct many variations of normal
activities and many noisy motion features are detected on non-person objects
or other people; eg. the activity that a nurse cleans the bed may lead to a false
positive. To collect the ground-truth, we segmented the videos into 10-second
clips, and manually labeled each clip. In total, the dataset has 124 positive
(abnormal) examples, and 1067 negative examples. We perform a leave-one-
patient-out cross-validation experiment. We extract similar features as the KTH
dataset, and create a vocabulary of size 500. Figure 4 shows that our approach
outperforms the BoW method.
The YouTube action dataset [5] consists of 11 categories with 1168 videos. The
videos are separated into 25 groups, which are taken in dierent environments
or by dierent photographers. Following [5], we perform a leave-one-group-out
cross-validation experiment. We extract HOG/HOF features with the code [2],
and create a vocabulary of size 2000 with K-means.
2
For patient privacy, we cannot show the example images from the dataset.
716 Y. Zhang et al.
1.05
BoW BoP
0.95
AUC
0.85
0.75
False Positive Rate Patient-1 Patient-2 Patient-3 Patient-4 Patient-5 Patient-6
(a) (b)
Fig. 4. Hospital Surveillance Dataset. (a) ROC curve for all patients. (b) The AUC
score for each patient.
BoW BoP
0.9
0.7
0.5
0.3
Figure 5 shows the classication performance. The BoW achieves 63.7% ac-
curacy averaged over the 11 categories. Our implementation of BoW achieves
similar performance as the one in [5] (65.4%) when similar (motion) features are
used. With BoP, we improve the performance by 9.2%. Our BoP result (72.9%)
is also better than the nal result of [5] (71.2%), where additional static features,
feature pruning techniques, and semantic vocabulary learning are adopted. These
techniques that improve the local features are complimentary to our work. There-
fore, further improvement can be expected when these techniques are applied.
We notice the main improvement is the discrimination of dierent swing activ-
ities, and dierent categories that both involve jump actions, such as basketball
shooting, trampoline jumping, and horse riding. The main reason is that BoP
can capture the ST relationships among the local movements.
and push. There are two sets in the dataset. Set 1 was recorded with relatively
stable camera, while Set 2 was taken on a lawn in a windy day. Background
is moving slightly (e.g. tree leaves move), and the set contains more camera
jitters. For each set, every type of activity has 10 video clips. These high-level
activities are more complex, and involve a combination of atomic movements of
two people over a period of time. Therefore, the BoW model which only captures
the atomic movements may work poorly. This dataset was used in the activity
recognition contest at ICPR 2010 [24]. We use the cuboid [1] as local features
and a vocabulary of size 500.
Comparison: We compare BoP with BoW and the best results reported in
the contest [25]. We adopt the classication settings described for the contest,
where a 10-fold leave-one-set-out cross validation is performed. Figure 6 shows
the confusion matrix for dierent methods on Set 1. Our implementation for
BoW with a 2 kernel for SVM demonstrated 76.7% accuracy, which is similar
to rates of the BoW method reported in the contest. Table 2 (b) compares the
performance on both Set 1 and Set 2. By modeling the atomic moves to the
activity center, the Hough voting based method [25] improves BoW by around
7%. Using BoP, we further outperform the Hough voting method by around 10%.
Analysis: Since the activities in this dataset involve more atomic movement
interactions of dierent body parts and from dierent persons, the rich spatio-
temporal interactions modeled with BoP are benecial. As shown in Fig. 6, BoW
confuses push vs. punch, and shake hands vs. hug, since the local movements of
each body part are similar for these activities. With the ST phrases, we can
capture the spatio-temporal combinations, thereby better discriminating these
activities.
Analysis for Spatial and Temporal Information: We show the benet of
simultaneous spatial and temporal modeling with ST-phrases in Table 2(a). To
incorporate space alone, we still use the BoP algorithms we proposed in this
paper, but discarding the temporal domain when computing the co-occurring
ST phrases. In other words, the temporal domain is modeled with BoW. To
incorporate time alone, we ignore the space domain. According to the table,
718 Y. Zhang et al.
1.0 ST, Dataset Method Avg. Shake Hug Kick Point Push Punch
Space, 0.95
0.90 Time, BoW 0.77 0.70 0.80 0.90 1.00 0.50 0.70
0.9 Set 1 Hough Voting
0.85 0.83 0.50 1.00 1.00 1.00 0.70 0.80
BOW, BoP 0.95 1.00 1.00 1.00 0.90 0.90 0.90
0.8 0.77
BoW 0.73 0.70 0.70 0.80 0.80 0.70 0.70
Set2 Hough Voting 0.80 0.70 0.90 1.00 1.00 0.80 0.40
0.7
Accuracy BoP 0.90 0.80 1.00 1.00 0.80 0.90 0.90
(a) (b)
(a) group dispersion (b) group formation (c) approaching (d) group flanking (e) group fighting
Fig. 7. Example scenarios we aim to recognize from videos of the MPR dataset
both spatial and temporal information improves BoW, and incorporating spatial
information helps more than incorporating temporal information. Simultaneous
spatial and temporal modeling obtain the best result.
fight BoP
BoW
Rule
follow
chasing
flank
disperse
forming
Fig. 8. (a) The predicted probabilities (vertical) of various events at each frame index
(horizontal) for a test video. (b) Comparison of area under curve (AUC) scores for
each event category on the whole test dataset. We compare the rule based method [22],
BoW [27], and BoP. Note that the rules for ghting are not dened in [22].
the test videos. Although BoW can detect the events of interest well, it gener-
ates much more false positives than BoP because of the lack of discrimination
of the events of interest with normal behaviors. Figure 8(b) compares BoP with
previous works using the AUC (area under curve) scores of ROC curves for dier-
ent categories. BoP improves the previous works, especially for group forming,
group following, and group ghting events, where the local group changes are
quite confusing with those of random behaviors.
Running Time: We implement our approach with C++ and Python on an In-
tel 2.4G dual-core computer. Excluding person tracking and feature extraction
(around 0.02s per frame), classifying a 640 480 frame using a 4-second context
takes less than 5 millisecond with the incremental algorithm proposed in Sec-
tion 2.4. Therefore, we can perform event detection in real time for continuous
video streams. In comparison, directly classifying its 4-second context for each
frame takes more than 200 millisecond.
4 Conclusion
We proposed the spatio-temporal (ST) phrases to model rich spatial and tem-
poral relationships among the local features, and present an algorithm which
can identify all ST phrases in time linear to the number of local features in an
video. Experiment results demonstrated that our approach improves the state-of-
the-art approaches in activity recognition. Our approach is independent of local
feature representation and is widely applicable to a large variety of activities, as
illustrated by the diversity of experimental datasets.
720 Y. Zhang et al.
References
1. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.J.: Behavior recognition via sparse
spatio-temporal features. In: PETS Workshop (2005)
2. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: CVPR (2008)
3. Willems, G., Tuytelaars, T., Van Gool, L.: An Ecient Dense and Scale-Invariant
Spatio-Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650663. Springer, Heidelberg (2008)
4. Wang, X.G., Ma, X.X., Grimson, W.E.L.: Unsupervised activity perception in
crowded and complicated scenes using hierarchical Bayesian models. PAMI 31,
539555 (2009)
5. Liu, J.G., Luo, J.B., Shah, M.: Recognizing realistic actions from videos in the
wild. In: CVPR (2009)
6. Liu, J.G., Shah, M.: Learning human actions via information maximization. In:
CVPR (2008)
7. Wong, S.F., Kim, T.K., Cipolla, R.: Learning motion categories using both seman-
tic and structural information. In: CVPR (2007)
8. Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: A string of feature graphs
model for recognition of complex activities in natural videos. In: ICCV (2011)
9. Wang, P., Abowd, G.D., Rehg, J.M.: Quasi-periodic event analysis for social game
retrieval. In: ICCV (2009)
10. Duan, L., Xu, D., Tsang, I.W.H., Luo, J.: Visual event recognition in videos by
learning from web data. In: CVPR (2010)
11. Nowozin, S., Bakir, G., Tsuda, K.: Discriminative subsequence mining for action
classication. In: ICCV (2007)
12. Sun, J., Wu, X., Yan, S.C., Cheong, L.F., Chua, T.S., Li, J.T.: Hierarchical spatio-
temporal context modeling for action recognition. In: CVPR (2009)
13. Savarese, S., Pozo, A.D., Niebles, J.C., Li, F.F.: Spatial-temporal correlatons for
unsupervised action classication. In: WMVC (2008)
14. Gilbert, A., Illingworth, J., Bowden, R.: Scale Invariant Action Recognition Using
Compound Features Mined from Dense Spatio-temporal Corners. In: Forsyth, D.,
Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 222233.
Springer, Heidelberg (2008)
15. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time
neighborhood features for human action recognition. In: CVPR (2010)
16. Zhang, Y., Jia, Z., Chen, T.: Image retrieval with geometry-preserving visual
phrases. In: CVPR (2011)
17. Wu, S., Moore, B.E., Shah, M.: Chaotic invariants of lagrangian particle trajectories
for anomaly detection in crowded scenes. In: CVPR (2010)
18. Messing, R., Pal, C., Kautz, H.A.: Activity recognition using the velocity histories
of tracked keypoints. In: ICCV (2009)
19. Yao, A., Gall, J., Van Gool, L.: A Hough transform-based voting framework for
action recognition. In: CVPR (2010)
20. Zhang, Y., Chen, T.: Ecient kernels for identifying unbounded-order spatial fea-
tures. In: CVPR (2009)
21. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM ap-
proach. In: ICPR (2004)
22. Chang, M.C., Krahnstoever, N., Ge, W.: Probabilistic group-level motion analysis
and scenario recognition. In: ICCV (2011)
Spatio-Temporal Phrases for Activity Recognition 721
23. Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities.
In: ICCV (2011)
24. Ryoo, M.S., Chen, C.-C., Aggarwal, J.K., Roy-Chowdhury, A.: An Overview of
Contest on Semantic Description of Human Activities (SDHA) 2010. In: Unay, D.,
Cataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 270285. Springer,
Heidelberg (2010)
25. Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-Voting Action
Recognition System. In: Unay, D., Cataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS,
vol. 6388, pp. 306312. Springer, Heidelberg (2010)
26. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure
comparison for recognition of complex human activities. In: ICCV (2009)
27. Zhang, Y., Ge, W., Chang, M.C., Liu, X.: Group context learning for event recog-
nition. In: WACV (2012)
Complex Events Detection
Using Data-Driven Concepts
1 Introduction
User uploaded videos on the internet have been growing explosively in recent
years. Automatic event detection in videos is an interesting and important task
with great potential for many applications, such as on-line video search and
indexing, consumer content management, etc. However, it is a very challeng-
ing task to deal with large corpora of unconstrained videos with huge content
variations and uncontrolled capturing conditions.(as illustrated in Fig.1).
Common approaches in event recognition rely on hand-crafted low level fea-
tures such as SIFT [1], STIP [2], MFCC [3], and human-dened high level con-
cepts [4]. The use of high level semantic concepts have been proven eective in
representing complex events[5]. However, how to discover a powerful set of se-
mantic concepts is still unclear and has not been investigated in previous works.
The drawbacks of human dened concepts include: (1) its hard to extend these
concepts to a larger scale, (2) they can not handle multiple modalities, and (3)
the concepts dont generalize well to new datasets.
In this paper, we propose a novel unsupervised approach to discover event
concepts directly from training data in three modalities (audio, image frames
and video).
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 722735, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Complex Events Detection Using Data-Driven Concepts 723
Fig. 1. Randomly selected example videos from our dataset. Each row shows frames
of four videos from two categories.
We rst learn low level features for each modality using Topography Indepen-
dent Component Analysis (TICA), which has shown superior performance over
popular hand-designed features in [6]. Then we map our low level features to
more compact representations using deep belief networks (DBN)[7]. After that,
our data-driven concepts are learned by clustering the training data in a low-
dimensional space using vector quantization (VQ). This dimension reduction
step is crucial to produce reasonable clustering results. Finally, we merge the
concepts from three dierent modalities by learning compact sparse representa-
tions. The whole framework is shown in Fig.2.
We argue that unsupervised learning of concepts is appropriate due to two
reasons. First, the disconnection between limited linguistic words and complexity
of real world events makes human denition of visual concepts very hard if not
impossible. We will later show that large number of learned concepts help to im-
prove recognition accuracy signicantly. Second, most of the time, insuciency
of annotated data prevents us from learning concepts in supervised manner. We
present extensive evaluation of our method. The results show that our proposed
approach signicantly outperforms popular baselines.
The rest of the paper is organized as follows. We rst review the related
literature in section 2. The proposed method is presented in section 3-5 in the
following order: Low-level Feature Learning, Data-driven Concept Discovery, and
Event Representation Learning. Extensive experiment results, comparisons and
analysis are reported in section 6. Finally, we conclude in section 7.
2 Related Work
Most of the existing works[8,5,9] investigate dierent hand-crafted features such
as SIFT [1], STIP [2], Dollar [10] and MFCC [3]. Recently, there has been a
growing interest in learning visual features using biologically-inspired networks,
such as, Independent Component Analysis (ICA) [11] and Independent Subspace
Analysis [12]. In [13], Le shows that using their learned 3D (Spatiotemporal)
lters by ISA, the action recognition performance is comparable to other hand-
designed features. In [6], TICA, another extension of ICA, was proposed for
static images that achieves state-of-the-art performance on object recognition.
724 Y. Yang and M. Shah
Fig. 2. Framework of the proposed method. Each video is divided into short clips.
We rst learn low level features for each modality using Topography Independent
Component Analysis (TICA). Then we map our low level features to more compact
representations using deep belief networks (DBN). After that, our data-driven concepts
are learned by clustering the training data in a low-dimensional space using vector
quantization (VQ). Finally, we merge the concepts from three dierent modalities by
learning compact sparse representations.
where V Rmm is a xed matrix that encodes the topography of the hidden
units H. m is the number of hidden units in the rst layer.
In the lter learning process, the optimal S is learned by minimizing function:
T
m
S = arg min ri x(p) ; S; V . (3)
S
p=1 i=1
s.t. SS T = I
where T is the total number of training samples. The orthogonality constraint
SS T = I provides competitiveness and ensures that the learned features are
diverse. In the feature extraction process, given S and the new whitened local
raw signal x, the activation in the second layer R will be served as the feature
of x.
Considering the data we have are quite diverse and huge amount, we argue
that learning good features directly from the data is very ecient. We choose
TICA as our low-level building block because of its two advantages: feature
robustness and less computational complexity. The pooling architecture of TICA
ensures the learned features are invariant to slight location and orientation shifts,
and selective to frequency, rotation and motion velocity. The lter learning is
much faster than other methods such as GRBM [21] since the gradient of the
objective function Eqn.3 is tractable. The feature extraction is also fast compared
with sparse coding as the feature is simply computed through the matrix vector
products.
726 Y. Yang and M. Shah
Here, is the standard deviation of the Gaussian density, lj are hidden unit
variables, qi are visible unit variables, wij is the weight connected with qi and
lj , ci and bj are the bias term of visible and hidden units respectively. The
learning process is to estimate wij , ci and bj through minimizing the energy of
states drawn from the data q distribution, and raise the energy of states that are
Complex Events Detection Using Data-Driven Concepts 727
improbable given the data. We follow [24] to use contrastive divergence learning
which gives an ecient approximation to the gradient of the energy function.
Further, in each iteration, we apply contrastive divergence update rule, followed
by one step of gradient descent using the gradient of the regularization term.
Once nish training a layer of the network, we feed the output values of this
layer as inputs of the next higher layer. Finally, after nishing training all the
layers, we obtain the clip representation as the outputs of the last layer, denoted
as Y Rd . By doing so, We map the original features to much lower dimensional
space since D $ d.
Building Concepts: The low dimensional representations Y from similar
clips are then grouped into concepts with a semantic meaning. In our framework,
concepts are obtained from three modalities separately and each event video is
represented as the occurrence frequency of each concepts from three modalities,
denoted as Z.
2
SVM with kernel is used for classication [25].
728 Y. Yang and M. Shah
6 Experiments
In this section, we will describe the dataset sand discuss several interesting ob-
servations that we had.
Fig. 3. 24 out of 600 audio lters learned from TRECVID event collection
Fig. 4. 48 out of 600 image (2D) lters learned from TRECVID event collection. Since
the training patches have 3 channels (RGB), the learned lters are also with 3 channels.
The color information of the lters mainly captures the scene concepts, such as indoor,
outdoor.
Fig. 5. 12 out of 600 spatiotemporal (3D) lters learned from TRECVID event collec-
tion. Filter size is 20 20 10. Each row shows two 3D lters.
data dimension from 4,000 to 100. Fig.6 shows that when the dimension of the
clip representation reduces from 4,000 to 1000, the event detection MAP in-
creases from 39% to 66% and it reaches the highest point at 1,000 dimension.
This supports our assumption that the initial representation of the clip lies in
a high dimensional space where Euclidean distance can not measure the true
similarity and DBN learns the regularity between the clips correctly. If we keep
decreasing the dimension of the clip representation, the accuracy goes down. It
means that the high dimensional data is compressed into a too concise space,
some useful information maybe lost there. Further, we repeat the same exper-
iments using other manifold learning methods such as PCA, EighenMap and
LLE. Figure 5 shows the detection results. It is clear that DBN performs signif-
icantly better.
Fig.7 shows the classication results based on a dierent number of concepts
using three modalities and their combination. The results show that motion
concepts play an important role in the event detection problem. And a large
number of concepts helps the recognition mainly because that larger pool of
concepts captures ner level variations of actions, e.g., running action from dif-
ferent viewpoints. However, when the number of concepts increases further, the
accuracy drops presumably due to the insuciency of training video samples,
and SVM runs into overtting.
We observe that, the curve of audio signal (TICA 1D) peaks at 500 con-
cepts, while that of scene (TICA 2D) and motion (TICA 3D) reach their highest
performance much later. This implies that the underlying variation of audio
signal is less than that of motion and 2D scene signals, which is consistent
with common sense. Combined concepts (TICA 1D+2D+3D) achieve the best
results since audio, scene and motion concepts capture complementary infor-
mation in the videos. We also use STIP feature to run the same experiments.
732 Y. Yang and M. Shah
(a) (b)
Fig. 8. (a)A comparison of event detection using dierent number of concepts dis-
covered using three modalities jointly and separately. The former fuse the low level
features from three modalities rst as the initial clip representation, then learns con-
cepts jointly. The later discovers the audio, scene and motion concepts separately. The
results show that using concepts based on the modalities separately outperform the
early fusion one. (b)A comparison of event detection using sparse representation and
bag of concept representation. Sparse representation works consistently better than
bag of concept representation.
The performance is signicantly worse than using TICA feature. This shows that
low-level features are important for discovering meaningful data-driven concepts.
Interestingly, the model using 62 human-dened concepts trained in super-
vised manner, outperforms its counterpart using the same number of concepts
Complex Events Detection Using Data-Driven Concepts 733
Fig. 9. Comparison of our proposed method with the combination of MFCC, SIFT
and STIP features in terms of detection accuracy on each event category. Our mean
average precision is 68.2%. The MAP of combination method is 51.1%.
but discovered in unsupervised way. This suggests more supervision helps when
only a small number of concepts are used. However, when the number of concepts
increase, data-driven concepts perform better. This shows that a large number
of concepts improves the event detection.
We also compare the performance of discovered concepts using early feature
fusion of three modalities to the concepts learned from three modalities sepa-
rately. Fig.8a shows that discovering concepts separately is always better.
References
1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com-
put. Vision 60, 91110 (2004)
2. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, pp. 432439 (2003)
3. Rabiner, L., Juang, B.: Fundamentals of Speech Recognition, Englewood Clis,
New Jersey. Prentice-Hall Signal Processing Series (1993)
4. Loui, A.C., Luo, J., Chang, S.F., Ellis, D., Jiang, W., Kennedy, L.S., Lee, K.,
Yanagawa, A.: Kodaks consumer video benchmark data set: concept denition
and annotation. In: Multimedia Information Retrieval, pp. 245254 (2007)
5. Wei, X.Y., Jiang, Y.G., Ngo, C.W.: Concept-driven multi-modality fusion for video
search. IEEE Trans. Circuits Syst. Video Techn. 21, 6273 (2011)
6. Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: Tiled convolutional
neural networks. In: NIPS 2010 (2010)
7. Hinton, G.E., Osindero, S., Whye Teh, Y.: A fast learning algorithm for deep belief
nets. Neural Computation (2006)
8. Schindler, G., Zitnick, L., Brown, M.: Internet video category recognition. In:
CVPRW 2008, pp. 17 (2008)
9. Wang, Z., Zhao, M., Song, Y., Kumar, S., Li, B.: Youtubecat: Learning to categorize
wild web videos. In: CVPR 2010, pp. 879886 (2010)
10. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse
spatio-temporal features. In: VS-PETS (2005)
11. van Hateren, J.H., Ruderman, D.L.: Independent component analysis of natural
image sequences yields spatiotemporal lters similar to simple cells in primary
visual cortex (1998)
12. Hyvarinen, A., Hoyer, P.: Emergence of phase- and shift-invariant features by de-
composition of natural images into independent feature subspaces. Neural Compu-
tation (2000)
13. Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical invariant spatio-temporal
features for action recognition with independent subspace analysis. In: CVPR 2011,
pp. 33613368 (2011)
14. Liu, X., Huet, B.: Automatic concept detector renement for large-scale video
semantic annotation. In: 2010 IEEE Fourth International Conference on Semantic
Computing (ICSC), pp. 97100 (2010)
15. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild.
In: CVPR (2009)
16. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum
average correlation height lter for action recognition. In: CVPR. IEEE Computer
Society (2008)
17. Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural
networks. Science 313, 504507 (2006)
Complex Events Detection Using Data-Driven Concepts 735
18. Snoek, C.G.M.: Early versus late fusion in semantic video analysis. ACM Multi-
media, 399402 (2005)
19. Olshausen, B.A.: Sparse coding of time-varying natural images. In: Proc. of the
Int. Conf. on Independent Component Analysis and Blind Source Separation,
pp. 603608 (2000)
20. Hyvarinen, A., Hoyer, P., Inki, M.: Topographic ica as a model of v1 receptive
elds. In: IJCNN 2000, vol. 4, pp. 8388 (2000)
21. Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional Learning of Spatio-
temporal Features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part VI. LNCS, vol. 6316, pp. 140153. Springer, Heidelberg (2010)
22. Trecvid (2011), http://www-nlpir.nist.gov/projects/tv2011/tv2011.html
23. van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Vi-
sual word ambiguity. IEEE Transactions on Pattern Analysis and Machine In-
telligence 32, 12711283 (2010)
24. Lee, H., Ekanadham, C., Ng, A.Y.: Sparse deep belief net model for visual area V2.
In: Advances in Neural Information Processing Systems 20, pp. 873880 (2008)
25. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:127:27 (2011), Software
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
26. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local
spatio-temporal features for action recognition. In: British Machine Vision Confer-
ence, p. 127 (2009)
27. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajecto-
ries. In: CVPR 2011, pp. 31693176 (2011)
Directional Space-Time Oriented Gradients
for 3D Visual Pattern Analysis
Abstract. Various visual tasks such as the recognition of human actions, ges-
tures, facial expressions, and classification of dynamic textures require modeling
and the representation of spatio-temporal information. In this paper, we propose
representing space-time patterns using directional spatio-temporal oriented gra-
dients. In the proposed approach, a 3D video patch is represented by a histogram
of oriented gradients over nine symmetric spatio-temporal planes. Video com-
parison is achieved through a positive definite similarity kernel that is learnt by
multiple kernel learning. A rich spatio-temporal descriptor with a simple trade-off
between discriminatory power and invariance properties is thereby obtained. To
evaluate the proposed approach, we consider three challenging visual recognition
tasks, namely the classification of dynamic textures, human gestures and human
actions. Our evaluations indicate that the proposed approach attains significant
classification improvements in recognition accuracy in comparison to state-of-
the-art methods such as LBP-TOP, 3D-SIFT, HOG3D, tensor canonical correla-
tion analysis, and dynamical fractal analysis.
1 Introduction
The goal of visual pattern recognition is to detect the presence of a particular object or
pattern in a given image or video. This usually involves representing patterns in a suit-
able feature space to achieve robustness against a broad range of environmental changes
like photometric variations, occlusions, background clutter, geometric transformations,
and variations in view angle.
The study of space-time patterns such as human actions, gestures, facial expressions,
and dynamic textures has attracted growing attention, mainly due to the wide range of
applications in real world [1]. One of the major theme of research in space-time pattern
classification is shaped around devising robust spatio-temporal local descriptors [1,2].
Broadly speaking, spatio-temporal local descriptors can be categorized into three main
classes.
The first and the largest category includes the direct extension of 2D local descriptors
to 3D. The underlying idea is to replace the rectangular regions in 2D descriptors with
3D patches and recast the functions/procedures from the spatial domain (ie. (x, y))
into the spatio-temporal space (ie. (x, y, t)). Examples include 3D-SIFT by Scovanner
et al. [3], HOG3D by Klaser et al. [4], Volume Local Binary Patterns (VLBP) by Zhao
et al. [5] and Extended-SURF (ESURF) by Willems et al. [6].
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 736749, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Directional Space-Time Oriented Gradients for 3D Visual Pattern Analysis 737
In the second category, the static (ie. spatial) and dynamic (ie. temporal) properties
of a given 3D patch are independently captured and then merged to form the descriptor.
The HOG/HOF descriptor proposed by Laptev et al. [7] is an example of this school
of thought. Other examples of this category include combining appearance and motion
descriptors for recognizing human actions as proposed by Schindler et al. [8] and Huang
et al. [9].
In the third category, no direct extension from 2D to 3D is considered. Instead, a
set of 2D local descriptors are extracted from three orthogonal spatio-temporal planes
within a 3D patch. Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) pro-
posed by Zhao et al. [5] was one of the very first attempts in this direction and has been
successfully applied in classification of various space-time patterns such as dynamic
textures, facial expressions [5] and human actions [10]. In LBP-TOP, a video patch
is encoded by computing LBP features over the three orthogonal XY , XT , and Y T
planes. Other examples of encoding video patches by three orthogonal planes include
extensions of the Webers local descriptor (WLD) [11] and Local Phase Quantization
(LQP) [12] to WLD-TOP [10] and LQP-TOP [13] respectively.
Despite considerable success, two major shortcomings are associated with the no-
tion of three orthogonal planes (TOP). First, TOP is not able to optimally capture the
dynamical properties of a given 3D patch. This is mainly due to the fact that dynamical
information is only encoded by two planes (XT and YT). Second, to create the video
descriptor, the descriptors associated with the XY, XT and YT planes are simply con-
catenated together. This is restrictive and may lead to inappropriate modeling in more
complex recognition scenarios.
With the aim of addressing the aforementioned shortcomings, in this paper we extend
the TOP approach to Nine Symmetry Planes (NSP). Unlike the TOP approach, NSP
benefits from six more diagonal planes to model the dynamical information of a video
patch. In our proposal, a video patch is described by a Histogram of Oriented Gradients
(HOG) using gradient information in all nine symmetric planes. Moreover, to overcome
the second shortcoming, we propose to learn the optimal fusion of sub-descriptors by
a positive definite similarity kernel. This can be done through the notion of Multiple
Kernel Learning (MKL) and is subtly different from previous studies on feature fusion
1
. As such, our proposed method, the HOG-NSP, is free from concatenation at any level.
1.1 Contributions
The three main novelties in this work are (i) we propose to encode a video patch through
nine spatio-temporal symmetry planes, and (ii) we apply a positive definite similarity
kernel learnt by multiple kernel learning to compare two videos, and (iii) the proposed
descriptor has been analyzed and contrasted against state-of-the-art methods in three
visual recognition tasks, namely the classification of dynamic textures, hand gestures
and human actions.
The remainder of the paper is organized as follows: We elaborate on the NSP ap-
proach in Section 2. In Section 3 we compare the performance of the proposed method
1
While in previous studies, MKL is employed to fuse features of different kind such as color,
texture, and motion, to authors knowledge, this is the first paper that adapts MKL to fuse
sub-descriptors of the same kind.
738 E. Norouznezhad et al.
Fig. 1. The HOG-NSP feature extraction procedure: (a) input space-time pattern as a 3D volume;
(b) extract cuboid patches on dense spatio-temporal grid; (c) nine symmetry planes within the
patch are considered for feature extraction with each plane corresponding to a specific space-
time direction; (d) extract oriented gradients on each plane and represent as locally normalized
histogram
with previous approaches on the aforementioned visual classification tasks. Finally, the
main findings and possible future directions are summarised in Section 4.
2 Proposed Approach
The proposed methods consists of three main components:
1. Space-Time Plane Descriptor. A video is decomposed into a set of 3D cuboid
patches. Each cuboid patch is then represented by histogram of oriented gradients
over nine spatio-temporal symmetry planes. The final descriptor of a 3D cuboid
patch, is obtained by the Bag of features (Bof) framework. As such, a dictionary is
trained for each spatio-temporal plane and used to encode 3D cuboid patches.
2. Video sub-descriptors. A video is described by a set of sub-descriptor using a
spatio-temporal pyramid. More specifically, individual representations of a given
video will be created in each channel of spatio-temporal pyramid. The total number
of sub-descriptors is equal to the number of plane-descriptors multiplied by the
number of spatio-temporal grids.
3. Similarity-Based Classification. Video comparison is achieved through a positive
definite similarity kernel that is learnt by multiple kernel learning. The similarity
kernel takes into account all video sub-descriptors and can be seen as a feature
fusion.
Each of the components is explained in detail in the following sections.
Fig. 2. The nine symmetry planes: (a) the first row illustrates the XY and its rotations around
the X axis; (b) the middle row demonstrates the Y T plane and its rotations around the Y axis;
(c) the bottom row shows the XT plane and its rotations around the T axis. It should be noted
that we are left with nine planes since some planes are repeated twice (P(X,0) = P(Y,0) , P(T,90)
= P(Y,90) and P(T,0) = P(X,90) ).
point detectors [14] to represent a video sparsely. However, in this study we confine
ourselves to dense grid-based representation. This is mainly driven by the recent success
of dense representation in various recognition scenarios [15,1].
Within each cuboid patch, nine space-time directions are considered for feature ex-
traction. Each space-time direction corresponds to one symmetry planes inside the
cuboid patch as shown in Fig. 1. The planes can be obtained systematically by ro-
tating XY, XT and YT planes. For example, by rotating the XY plane around the X axis
by 45 , 90 and 135 , three orthogonal planes to the YT plane can be attained. The
rotated planes are shown at the top row of Fig. 2 and labeled P(X,45) , P(X,90) , P(X,135)
respectively. Similarly, three orthogonal planes to XT plane and three orthogonal planes
to XY plane are constructed by rotating XY and XT planes as shown in the middle and
bottom rows of Fig. 2. Since some planes are repeated twice (P(X,0) = P(Y,0) , P(T,90)
= P(Y,90) and P(T,0) = P(X,90) ), we are left with nine distinct planes.
Having nine symmetry planes at our disposal, a cuboid patch is then represented by
normalized gradients histograms over the spatio-temporal planes. This is accomplished
by dividing each plane into smaller blocks of size M N and computing histogram
of oriented gradients within each block. For normalization, we pursue a similar nor-
malization strategy as Lowe [16]. The descriptor in each plane is normalized using
the l2 -norm with predefined cut values of c. In our experiments, the cut value is set to
c = 0.25. Furthermore sub-block and histogram parameters are set to M = 4, N = 4
and B = 8 which results in a 128 dimensional plane descriptor.
HOG-NSP should not be confused with other 3D video descriptors such as HOG3D
or 3D-SIFT. HOG3D or 3D-SIFT represent 3D video patches by measuring and quan-
tizing the 3D gradient orientations using a regular n-sided polyhedron. In contrast,
HOG-NSP uses directional spatio-temporal oriented gradients to model static and dy-
namic properties of 3D patches. In other words, local oriented gradients in nine discrete
space-time directions (planes) within a 3D patch are used for representation.
740 E. Norouznezhad et al.
Fig. 3. The classification procedure: (a) input space-time pattern; (b) separate visual dictionar-
ies (codebook) are learned by quantizing the feature space for each plane-descriptor using the
training dataset; each codebook has a number of visual words; only three visual words are shown
in this figure for illustration purposes (c) HOG-NSP features are extracted for a given video; (d)
Spatio-temporal pyramids are created for representing videos; for illustrative purposes, only three
levels (L = 0, 1, 2) are shown; (e) individual representations of a given video will be created in
each channel; the number of channels is equal to the number of plane-descriptors (i.e. 9) multi-
plied by the number of spatio-temporal grids (i.e. 1, 8 and 64 for L = 0, 1, 2 respectively); (f) the
representation for each channel is assigned a kernel and multiple kernel learning (MKL) is used
to find the optimal weights for the kernels; SVM is used for classification.
2.2 Bag-of-Features
Given a set of spatio-temporal local descriptors, the next step is to represent the video
based on the extracted descriptors. We use bag-of-features (BoF) framework for this
purpose. In BoF framework, the first step is to construct a visual dictionary called code-
book. This is achieved by clustering the descriptors that are extracted from the videos
in the training set using k-means. This results in nine separate visual vocabularies for
the nine plane-descriptors. In our experiments, the number of visual words in the each
dictionary is set to 800 by default which have shown empirically to give good results
for the datasets under test. Given a video, the extracted plane-descriptors are assigned
to their closest visual word, using the Euclidean distance. The occurrences of each of
the visual words of each plane-descriptor can be used to represent a given video.
The SPM approach was later extended to spatio-temporal domain and was utilized for
representing space-time patterns in videos [7,18]. In our proposed approach, Spatio-
Temporal Pyramid Matching (STPM) is employed to encode global structural infor-
mation into the representation. A sequence of increasingly finer binary partitions are
constructed at resolution 0, , L, such that the partition at each level l has 2l cells along
each dimension. Fig. 3 illustrates the STPM approach for L = 2. Combining the
spatio-temporal grids at L = 0, 1, 2 will result in in 73 spatio-temporal grids.
ndic
(x(j) y(j))
2
(x, y) = (2)
j=1
x(j) + y(j)
The kernel combination weights, (di ), can be learnt through a convex optimization prob-
lem as discussed for example in [20]. Having a positive definite similarity kernel at our
disposal, Support Vector Machines can be utilized to classify video sequences.
Fig. 4. Representative examples from the UCLA dynamic texture database [24] used for evalu-
ation
is achieved through locally measuring of the static and dynamic properties of a given
pattern. Furthermore our experiments demonstrate that utilizing a multi-plane approach
for feature extraction and reducing the angle between spatio-temporal planes (the angle
between spatio-temporal planes is reduced to 45 ) has further increased the robustness
of the proposed descriptor.
3 Experiments
To evaluate and contrast the performance of the proposed HOG-NSP approach, three
sets of experiments are considered in this section. First, we appraise HOG-NSP against
state-of-the-art methods such as tensor canonical correlation analysis [21], dynamical
fractal analysis [22] and Kinematic features with multiple instance learning [23] in
classifying dynamic textures, hand gestures and human actions. Second, we compare
the performance of HOG-NSP descriptor against spatio-temporal local descriptors such
as 3D-SIFT and LBP-TOP. For the latter case we only consider the recognition of dy-
namic texture. Finally, the computational complexity of HOG-NSP descriptor is com-
pared against state-of-the-art spatio-temporal local descriptors.
protocol [26], the number of classes is reduced to 8 by removing the class of plants. Fi-
nally the fourth test protocol, shift-invariant-recognition (SIR) was proposed by [27] to
evaluate shift-invariance properties of descriptors. Representative examples of UCLA
dataset are shown in Fig. 4
The second dataset is the Dyntex++ dataset. This dataset is compiled from a Dyntex
dataset [28] and is proposed in [25]. It contains 3600 dynamic textures, grouped into 36
classes. Each sequence has a size of 50 50 50 pixels. We follow the test protocol
proposed in [25]. In all our experiments the data was randomly split into two equal
size training and test sets. The random split was repeated 20 times and the average
classification accuracy is reported here.
For representing dynamic texture, spatio-temporal pyramid with l = 0 was used
which is the standard bag-of-features (BoF) representation. This is due to the fact that
in our experiments we did not observe significant improvement by using higher levels
of pyramid. Furthermore the size of codebook and cuboid patch was empirically set to
800 and 16 16 16 pixels with 50% overlap respectively.
The proposed approach is compared against the following methods: Bag-of-dynamical
systems (BDS) [26], Distance Learning using DL-PEGASOS [25], Space-time oriented
structures [27], and dynamic fractal spectrum (DFS) descriptors [22]. BDS models
dynamic textures using linear dynamical systems (LDS) in the framework of Bag-of-
features. Space-time oriented structures and DFS utilize 3D oriented filters for repre-
senting dynamic textures. DFS exploit dynamic fractal analysis for modelling dynamic
textures. DL-PEGASOS uses maximum margin distance learning (MMDL) approach
based on Pegasos algorithm for dynamic texture recognition.
The classification accuracies are presented in Table 1 and 2 for UCLA and Dyn-
tex++ dataset respectively. In UCLA dataset, the proposed approach achieves the high-
est performance with 98.1% and 78.2% in 9-class and SIR test protocols respectively. In
8-class test protocol, the proposed approach achieves 98.7% which is very close to state-
of-the-art (99%). Moreover in 50-class test protocol the proposed approach achieves the
second highest performance with 97.2% as compared to DFS method [22] that achieves
perfect recognition accuracy. In Dyntex++ dataset, the proposed approach achieves the
highest classification rate with 90.1% in comparison to DL-Pegasos [25] and DFS [22].
744 E. Norouznezhad et al.
Table 1. Classification rate (%) for dynamic texture recognition on the UCLA dataset using
BDS [26], DL-PEGASOS [25], Space-time oriented structures (STOS) [27], and DFS descrip-
tors [22] and the proposed approach (HOG-NSP descriptor in STPM framework using MKL for
sub-descriptors fusion)
Table 2. Classification rate (%) for dynamic texture recognition on the Dyntex++ dynamic tex-
ture database using DL-PEGASOS [25] , DFS [22], and the proposed approach
For gesture recognition, we used the Cambridge hand gesture dataset [21]. It consists
of 900 image sequences of 9 gesture classes. The gesture classes are defined by three
primitive hand shapes and three primitive motions. Each class has 100 image sequences
which are performed by 2 subjects, captured under 5 illuminations and 10 arbitrary
motions. Each sequence was recorded at 30f ps with a resolution of 320 240, in front
of a fixed camera having roughly isolated gestures in space and time. Representative
examples of this dataset are shown in Fig 6.
Following the test protocol in [21], sequences are divided into five sets where the
fifth set with normal illumination is used for training and the remaining sets are used for
testing. As per [21], we report the classification rate for all the four illumination sets.
The proposed method is compared against the original Canonical Correlation Analy-
sis (CCA) [29], Tensor Canonical Correlation Analysis (TCCA) [21], and Product
Manifolds (PM) [30]. Canonical correlation analysis (CCA) is a standard method for
measuring the similarity between subspaces. TCCA, as the name implies, is the exten-
sion of canonical correlation analysis to multiway data arrays or tensors. In the PM
method, a tensor is characterised as a point on a product manifold and classification is
performed on this space.
For the hand gesture test, HOG-NSP with three levels of spatio-temporal pyramid
attains the maximum performance. As such each video sequence is described by (1 +
8 + 64) 9 spatio-temporal sub-descriptor. The size of codebook and cuboid patch is
Directional Space-Time Oriented Gradients for 3D Visual Pattern Analysis 745
Table 3. Average correct classification rate (%) for gesture recognition on the Cambridge hand-
gesture database using CCA [29], TCCA [21], Product Manifold [30] and the proposed approach
the same as previous experiment. The results, presented in Table 3, demonstrate that
the proposed approach achieves the highest performance with overall classification rate
of 90%. We note that the performance gap between CCA, TCCA and HOG-NSP is
quite significant. The proposed approach also achieves the lowest variations with 1.3%
standard deviations. This is an indication of robustness in presence of photometric vari-
ations.
(b)
Table 4. Comparison of mean classification accuracy on the KTH and Weizmann datasets
Method 3D-SIFT [3] HOG3D [4] HOG/HOF [7] LBP-TOP [10] KF-MIL [23] UFL-ISA [33] HOG-NSP
Table 5. Classification rate (%) for dynamic texture recognition on UCLA and Dyntex++ dataset
UCLA
Method Dyntex++
50-class 8-class SIR
3D-SIFT [3] 77.3 86.2 44.7 63.7
VLBP [5] 79.8 91.4 40.6 61.1
LBP-TOP [5] 86.1 96.8 60.3 71.2
HOG-TOP 83.7 95.4 62.7 72.8
HOG-NSP 88.5 97.5 66.3 78.7
Table 6. Average run-time (sec) for computing HOG-NSP, LBP-TOP [5], 3D-SIFT [3] and
HOG3D [4] descriptors on KTH dataset
References
1. Wang, H., Ullah, M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal
features for action recognition (2009)
2. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., Windridge,
D.: An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In:
WACV, pp. 344351 (2011)
3. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action
recognition. In: International Conference on Multimedia, pp. 357360 (2007)
4. Klaser, A., Marszaek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients.
In: BMVC, pp. 9951004 (2008)
5. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an
application to facial expressions. PAMI 29(6), 915928 (2007)
6. Willems, G., Tuytelaars, T., Van Gool, L.: An Efficient Dense and Scale-Invariant Spatio-
Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part II. LNCS, vol. 5303, pp. 650663. Springer, Heidelberg (2008)
7. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from
movies. In: CVPR, pp. 18 (2008)
8. Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recogni-
tion require? In: CVPR, pp. 18 (2008)
9. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recog-
nition. In: ICCV, pp. 18 (2007)
10. Mattivi, R., Shao, L.: Human Action Recognition Using LBP-TOP as Sparse Spatio-
Temporal Feature Descriptor. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702,
pp. 740747. Springer, Heidelberg (2009)
11. Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M., Chen, X., Gao, W.: Wld: A robust local
image descriptor. PAMI 32(9), 17051720 (2010)
Directional Space-Time Oriented Gradients for 3D Visual Pattern Analysis 749
12. Ojansivu, V., Heikkila, J.: Blur Insensitive Texture Classification Using Local Phase Quanti-
zation. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) ICISP 2008. LNCS,
vol. 5099, pp. 236243. Springer, Heidelberg (2008)
13. Paivarinta, J., Rahtu, E., Heikkila, J.: Volume Local Phase Quantization for Blur-Insensitive
Dynamic Texture Classification. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688,
pp. 360369. Springer, Heidelberg (2011)
14. Laptev, I.: On space-time interest points. IJCV 64(2), 107123 (2005)
15. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories.
In: CVPR, vol. 2, pp. 524531 (2005)
16. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91110
(2004)
17. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: CVPR, vol. 2, pp. 21692178 (2006)
18. Choi, J., Jeon, W., Lee, S.: Spatio-temporal pyramid matching for sports videos. In: ACM
Int. Conf. on Multimedia Information Retrieval, pp. 291297 (2008)
19. Chen, Y., Garcia, E.K., Gupta, M.R., Rahimi, A., Cazzanti, L.: Similarity-based classifica-
tion: Concepts and algorithms. JMLR 10, 747776 (2009)
20. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. JMLR 9, 24912521
(2008)
21. Kim, T., Cipolla, R.: Canonical correlation analysis of video volume tensors for action cate-
gorization and detection. PAMI 31(8), 14151428 (2009)
22. Xu, Y., Quan, Y., Ling, H., Ji, H.: Dynamic texture classification using dynamic fractal anal-
ysis. In: ICCV (2011)
23. Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple
instance learning. PAMI 32(2), 288303 (2010)
24. Doretto, G., Chiuso, A., Wu, Y., Soatto, S.: Dynamic textures. IJCV 51(2), 91109 (2003)
25. Ghanem, B., Ahuja, N.: Maximum Margin Distance Learning for Dynamic Texture Recogni-
tion. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312,
pp. 223236. Springer, Heidelberg (2010)
26. Ravichandran, A., Chaudhry, R., Vidal, R.: View-invariant dynamic texture recognition using
a bag of dynamical systems. In: CVPR, pp. 16511657 (2009)
27. Derpanis, K., Wildes, R.: Dynamic texture recognition based on distributions of spacetime
oriented structure. In: CVPR, pp. 191198 (2010)
28. Peteri, R., Fazekas, S., Huiskes, M.: Dyntex: A comprehensive database of dynamic textures.
Pattern Recognition Letters 31(12), 16271632 (2010)
29. Kim, T., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes
using canonical correlations. PAMI 29(6), 10051018 (2007)
30. Lui, Y., Beveridge, J., Kirby, M.: Action classification on product manifolds. In: CVPR,
pp. 833839 (2010)
31. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In:
ICPR, vol. 3, pp. 3236 (2004)
32. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes.
PAMI 29(12), 22472253 (2007)
33. Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical invariant spatio-temporal features
for action recognition with independent subspace analysis. In: CVPR, pp. 33613368 (2011)
Learning to Recognize Unsuccessful Activities
Using a Two-Layer Latent Structural Model
1 Introduction
In recent years, researchers have made astonishing progress in developing vision
system to recognize human activities. For a target activity class, recognition is
usually cast as a binary classication problem: is someone performing it or not
performing it. However, in many applications, particularly in health care, we also
would like to know whether an activity is successfully completed. For example, a
senior person in a nursing home tries to dress himself, but may fail at some stage
(e.g., cannot raise his hands). If we have a sensor installed in the room, and this
unsuccessful activity is recognized, then a nurse can be automatically alerted
to help him. Recognizing unsuccessful activities is extremely useful for helping
patients with dementia, which is one of the most costly diseases. Patients with
dementia tend to forget things because of the deteriorated cognitive ability. As
a result, they have to be fully take care of by caregivers. If a computer system
is able to recognize their unsuccessful activities, then an instructional video can
be automatically played to remind (or guide) them. This will help them live
independently and reduce signicantly economic cost.
To tackle this interesting and useful research problem, we propose to recognize
unsuccessful activities using vision algorithms. As shown in Fig. 1, we now have
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 750763, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Learning to Recognize Unsuccessful Activities 751
Fig. 1. The unsuccessful activity recognition task. There are three possible class labels:
successful, unsuccessful and background. It is dierent from traditional binary activity
recognition, which only deal whether someone is performing the target activity or not.
three possible class labels for a target activities class: perform the activity and
succeed (abbreviated as successful); perform the activity but fail at some stage
(abbreviated as unsuccessful); not performing the target activity (abbreviated
as background). Then given a test video, we aim to automatically categorize
it as one of these three classes.
One of our target applications is health care, where patients privacy infor-
mation must be preserved. Hence we choose the Kinect sensor to record activity
videos, we can turn o the RGB camera in real application to only capture depth
information which is user friendly. However, our proposed approach can also be
applied to RGB videos, as shown in the experimental section.
Recognizing unsuccessful activities is hard, partially because we dont know
at which stage the activity performer fails. For example, one person may fail to
dress himself at the beginning or at the end. This means that we cannot take a
holistic way [1, 2] to represent unsuccessful activities. Instead, we decompose an
activity into a number of key temporal stages (e.g.,hand up,hand down). In
a unsuccessful activity video clip, the activity performer completes some stages
but does not complete all. The changing point is called the point of failure
in this paper. It is prohibitive to manually annotate these key stages and the
point of failure. Instead, we treat them as latent variables and let the model
automatically discover them. We develop a two-layer latent structural model
752 Q. Zhou and G. Wang
based on the general latent structural SVMs [3]. Our temporal structure has two
layers: the rst layer denes the temporal position of the point of failure; the
second layer denes the temporal positions of a number of key temporal stages.
The stages before the point of failure are successful stages, while the stages after
the point of failure are background stages. One advantage of this method is
that it can deal with the three class classication problem with a single model,
rather than train one classier for each class independently as the conventional
methods.
In the recognition procedure, the learned model selects a discriminative de-
composition for matching. It can not only categorize a test video, but also dis-
cover the two-layer temporal structure to support the categorization decision
(where is the point of failure? where do dierent key stages take place?).
To our best knowledge, there are no previous vision algorithms tackling this
research problem and there are no existing dataset available either. We collect a
new dataset to study it. Our experimental results demonstrate that it is promis-
ing to solve this problem using computer vision techniques, and our proposed
two-layer latent structural SVM model shows superior performance compared to
several previous approaches.
1.2 Contributions
We summarize our contributions as below.
(1) Our major contribution is to show the promise of recognizing unsuccessful
human activities, which is expected to be useful in health care. To the best of our
knowledge, this is the rst vision algorithm on unsuccessful activity recognition.
(2) We develop a two-layer latent structural SVM model to recognize unsuc-
cessful activities and discover the temporal structures.
(3) We create a new dataset that enables further investigation and study of
unsuccessful activity recognition.
2 Approach
Given a test video, we aim to categorize it with one of the three possible classes:
successful, unsuccessful or background (related random activities), which are
denoted as +1, 0 and 1 respectively. We write the label set as Y .
Fig. 2. Illustration of our two-layer latent structural model for unsuccessful activity
recognition. The structure model has two layers: the rst layer denes the point of
failure (where the activity performer fails to perform the activity); the second layer
denes the temporal positions of a number of key stages. denotes the model parame-
ter. f p is the point of failure. (h1 , . . . , hK ) are the temporal positions of K key stages.
A (x, hk ) is appearance feature vector, when k f p, A (x, hk ) should be compatible
with the class label +1 since this stage should be similar to a temporal stage of a suc-
cessful activity; when k > f p, A (x, hk ) should be compatible with the class label 1
since this stage should be dissimilar to any key temporal stages of a successful activity.
T (hk ) denotes the temporal displacement feature. In recognition, we infer the latent
structure and calculate a score for each temporal stage. We summarize over all the
temporal stages to produce a nal classication score.
We develop our model based on the general latent structural SVM formulation
[3], but with deeper temporal structures. Given a video x, let z = (h1 , ..., hK , f p)
specify a particular two-layer structure. (h1 , ..., hK ) are the second layer latent
variables, where hk species the temporal position of the k-th temporal stage.
In our model, we normalize hk using the total number of video frames and it
falls in the range of [0, 1]. The rst layer latent variable f p denotes the point of
failure in unsuccessful activities, and f p is specied as a number in [0, 1, . . . , K].
This structure means people successfully perform the rst f p stages and fail
to perform the remaining (K f p) stages. For successful activities, f p should
be equal to K since all the K stages have been successfully performed. For
unsuccessful activities, f p must be in [1, . . . , K 1]. Similarly, f p must be 0 in
background activities because no key stages have been successfully performed.
Given learned model parameters, the overall score of a hypothesis z can be
expressed in terms of dot product between the model parameters and the joint
feature vector (x, y, z),
Eq. (1) measures the compatibility among the input x, output y, when the latent
variable z is specied [3, 10]. (x, y, z) is a concatenation of features extracted
at dierent key temporal stages.
(x, y, z) = [(x, y, h1 ), . . . , (x, y, hK )] (2)
(x, y, hk ) is the joint feature vector for the k-th temporal stage which includes
appearance feature A (x, hk ) and temporal displacement feature T (hk ). In this
paper, we experiment with two types of appearance features: one is bag-of-words
features (HOG and HOF are used as local features), and the other one is human
pose descriptors [14]. The details of extracting appearance features are described
in Sec. 3. Each temporal stage has an anchor point tk . If a temporal stage hy-
pothesis has a temporal displacement relative to its anchor point, a deformation
cost is enforced. Similar to [7], T (x, hk ) is dened as:
T (hk ) = (dtk , dt2k ), dtk = hk tk (3)
where dtk is the temporal displacement of the k-th key temporal stage relative
to its anchor point tk .
When y = 1,the feature vector is dened as:
(x, y, hk ) = [y A (x, hk ), T (hk )] (4)
For a unsuccessful activity (y = 0), when k f p, A (x, hk ) should be compatible
with the class label +1 since this stage should be similar to a temporal stage of a
successful activity; when k > f p, A (x, hk ) should be compatible with the class
label 1 because this stage should be dissimilar to any key temporal stages of a
successful activity. Hence the feature vector is dened as:
(+1 A (x, hk ), T (hk )) if k f p
(x, y, hk ) = (5)
(1 A (x, hk ), T (hk )) if k > f p
For the k-th stage, we dene a stage appearance lter Fk to measure the ap-
pearance response; and a two-dimensional vector dk as the deformation cost
parameter. Our model parameters are:
= (F1 , d1 , . . . , FK , dK ) (6)
2
where 12 is a regularization term, and i is the slack variable for the i-th
training example. It attempts to learn model parameters , such that the score
extracted for the ground truth label yi , give the best latent structure, is greater
than the score for any other possible class labels, give the best latent structure,
by at least 1.
As mentioned in previous works [9, 3, 7, 10], this two-layer latent structural
SVM model leads to a nonconvex optimization problem. But after we x the
value of z for each (xi , yi , z), the optimization problem becomes convex. Similar
Learning to Recognize Unsuccessful Activities 757
to previous works [3, 10, 5], we can get a local optimum of Eq. (8) using an
iterative procedure as following:
The task of Step 1 is the same as the inference problem described in Sec. 2.2. In
Step 2, following [9], we use a stochastic gradient descent method to train the
model. Another choice is the cutting plane algorithm in [17, 3, 10].
In stochastic gradient descent, we randomly select a training example xi , the
hinge loss is as below:
Li = max 0, 1 (xi , yi , zi ) max (xi , y, z) , (10)
(y,z)
g = + CLi (11)
0 if Li = 0
Li (, xi , yi ) = (12)
(xi , yi , zi ) (xi , y , z ) if Li > 0
where
(y , z ) = max (xi , y, z) (13)
(y,z)
3 Implementation Details
We tried two types of features in our experiments. The rst one is bag-of-words
representation (HOG and HOF are used as local features). We extract these
low-level features from depth videos. We also compare with baselines which
extract bag-of-words feature from RGB videos. Another feature is geometric
pose descriptors [14] to represent human poses.
C
norm(J i , J j ) = (Jxj Jxi )2 + (Jyj Jyi ) + (Jzj Jzi ) (16)
We represent each pose as a 16 15 3/2 = 360 dimensions GPD feature.
Note that when we use GPD features, each temporal stage only has one frame.
4.1 Dataset
4.2 Baselines
Table 1. Recognition results of our two-layer latent structural SVM model and three
baselines approaches. Baseline 1 is the standard bag-of-words method and baseline 2 is
the spatial-temporal pyramid matching [1]. We also compare our approach with one-
layer latent structural model. RGB and Depth means RGB videos and depth videos
respectively. We also try two types of features (HOG/HOF features and GPD features)
using our model. The column K = 4 shows the results of our model with 4 temporal
stages and the column Best K reports the results of our model with the best K for
each activity. When people lie on a bed, we cant get joints using the SDK. So the
result of get up using GPD feature is not available. A01: Calling, A02: Dressing,
A03: Drinking, A04: Stand Up, A05: Get Up, A06: Mopping, A07: Undressing.
In this section, we compare our two-layer latent structural SVM with the two
baselines on unsuccessful activity recognition. We use leave-one-out method in
our experiments. At each time, we use data of one performer for testing and
all the other data for training. All of the experiments were done on a 2.8 GHz
Intel Xeon PC. It takes about 2 hours to train an activity model (K = 4) on
training data with 84 videos and inference takes around 15 seconds on a video
with 400 frames. We report the average accuracy over all the performers. The
results of dierent methods are summarized in Table 1. When people lie on a
bed, we cant get joints using the SDK. So the result of get up using GPD
feature is not available. We can see that our method signicantly outperforms
baseline methods on almost all the activities. This suggests the eectiveness of
our two-layer latent structural model for unsuccessful human activity recogni-
tion. And with our model, HOG/HOF features work better than GPD features.
One possible reason is that when using HOG/HOF features, multiple frames are
used to represent each temporal stage; while when using GPD features, only one
frame can be used to represent each temporal stage. Fig. 3 shows the confusion
matrix obtained by our model with HOG/HOF features.
In Fig. 5, we show the two-layer structure inferred using our approach on two
unsuccessful activities videos. Our method can nd the point of failure and the
positions of dierent temporal stages.
Learning to Recognize Unsuccessful Activities 761
Fig. 3. Confusion matrixes of our model using HOG/HOF features (Best K). B, U
and S are short for Background, Unsuccessful and Successful respectively.
Fig. 4. The eect of dierent temporal stage sizes. We show recognition accuracy of
each class with dierent stage sizes. The top image shows the results using GPD feature.
The bottom image shows the results using HOG/HOF features.
Fig. 5. Two examples of inferred latent structure using our learned model (K = 3)
with HOG/HOF features. The top image shows a unsuccessful get up video and
the bottom image shows a unsuccessful calling video. We use red boxes to represent
unsuccessful temporal stages and green boxes to represent successful temporal stages.
(RGB images are for illustration only, our model is built on depth videos.)
References
1. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: Proc. CVPR (2008)
2. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild.
In: Proc. CVPR (2009)
Learning to Recognize Unsuccessful Activities 763
3. Yu, C.N.J., Joachims, T.: Learning structural svms with latent variables. In: Proc.
ICML (2009)
4. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing
Surveys 43(3), 16 (2011)
5. Wang, Y., Mori, G.: Max-margin hidden conditional random elds for human ac-
tion recognition. In: Proc. CVPR (2009)
6. Ni, B., Yan, S., Kassim, A.A.: Recognizing human group activities with localized
causalities. In: Proc. CVPR (2009)
7. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling Temporal Structure of Decompos-
able Motion Segments for Activity Classication. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 392405. Springer,
Heidelberg (2010)
8. Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from
streaming videos. In: Proc. ICCV (2011)
9. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detec-
tion with discriminatively trained part-based models. IEEE Trans. Pattern Anal.
Mach. Intell. 32(9), 16271645 (2010)
10. Zhu, L., Chen, Y., Yuille, A.L., Freeman, W.T.: Latent hierarchical structural
learning for object detection. In: Proc. CVPR (2010)
11. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In:
CVPR Workshops (2010)
12. Sung, J., Ponce, C., Selman, B., Saxena, A.: Human activity detection from rgbd
images. In: AAAI Workshop (2011)
13. Ni, B., Wang, G., Moulin, P.: Rgbd-hudaact: A color-depth video database for
huamn daily activity recognition. In: 1st IEEE Workshop on Consumer Depth
Cameras for Computer Vision, in Conjunction with ICCV (2011)
14. Chen, C., Zhuang, Y., Nie, F., Yang, Y., Wu, F., Xiao, J.: Learning a 3d human
pose distance metric from geometric pose descriptor. IEEE Trans. Vis. Comput.
Graph. 17(11), 16761689 (2011)
15. Felzenszwalb, P.F., Huttenlocher, D.P.: Distance transforms of sampled func-
tions. Cornell Computing and Information Science Technical Report TR2004-1963
(September 2004)
16. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition.
IJCV 61(1), 5579 (2005)
17. Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine
learning for interdependent and structured output spaces. In: Proc. ICML (2004)
18. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:127:27 (2011), Software
available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm
19. Laptev, I.: On space-time interest points. International Journal of Computer Vi-
sion 64(2-3), 107123 (2005)
20. Yao, B., Li, F.F.: Modeling mutual context of object and human pose in human-
object interaction activities. In: Proc. CVPR (2010)
21. Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with
latent poses. In: Proc. CVPR (2010)
22. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from single
depth images. In: Proc. CVPR (2011)
Action Recognition Using Subtensor Constraint
1 Introduction
As one of the hottest recognition tasks in computer vision, human action recog-
nition from videos draws tremendous interest [13]. In this work, we focus on
the task of action recognition that can be stated as: Given a video sequence
containing one person performing an action, the task is to design a system for
automatically recognizing what the action is performed in this video.
Approaches targeting dierent challenges in action recognition are intensively
addressed. An action can be decomposed into spatial and temporal parts. Im-
age features that are discriminative with respect to posture and motion of the
human body are rst extracted. Using these features, spatial representations to
discriminate actions can be suggested. Many approaches represent the spatial
structure of actions with reference to the human body [47]. Holistic represen-
tations capture the spatial structure by densely computing features on a regular
grid bounded by a detected region of interest (ROI) centered around the person
[810]. Other approaches recognize actions based on the statistics of local fea-
tures from all regions, which relies neither on explicit body part labeling, nor on
explicit human detection and localization [1113].
Dierent representations can be used to learn the temporal structure of actions
from these features as well. Some approaches represent an action as a sequence of
moments, each with their own appearance and dynamics [14][15]. Some methods
attempt to directly learn the complete temporal appearance through example
sequences [16]. Temporal statistics attempt to model the appearance of actions
without an explicit model of their dynamics [1719].
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 764777, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Action Recognition Using Subtensor Constraint 765
2 Related Work
The authors in [27] use a similarity measurement that is not aected by the
camera viewpoint changes for matching two actions. Based on ane epipolar
geometry, they claim that two trajectories of the same action match if and only
if the observation matrix is of rank at most 3. But their approach assumes an
ane camera model and treats individual body points separately ignoring the
space relation between them.
The work in [28] demonstrates the plausibility of the assumption that the
proportion between human body parts can be captured by a projective transfor-
mation. Then fundamental matrix in epipolar geometry is applied as a constraint
for matching poses in the testing video and a template video. In [31], two-body
fundamental matrix is used to capture the geometry of dynamic action scenes
and help deriving an action matching score across dierent views without any
prior segmentation of actors. Since the proposed rank constraint treats individ-
ual poses separately, they use the dynamic time warping to enforce the temporal
consistency in another step.
To address the temporal information, the work in [29] develops a constraint
called fundamental ratio to catch the pose transition between successive frames.
But for estimating the two epipoles of two virtual cameras, they assume the
second view is under some non-general motion relative to the rst view. The
authors in [30] use the property that a planar homography has two equal eigen-
values to match the pose transition. Though these methods treat sets of body
joints separately instead of giving an overall constraint on the whole body, they
rely on the globally estimated epipoles which again is constrained by the epipolar
geometry.
766 Q. Liu and X. Cao
Fig. 1. Two poses of a basketball player playing basketabll. An action can be viewed
as a collection of static poses represented by the body joints.
All the above approaches under perspective geometry work in the framework
of two view geometry. One view is used for template video of an action, and the
other is for the testing video. Our work adds more information and certainty for
recognizing an action by introducing another template view. The reasons why
we choose tri-view are: First, by adding only one template view, the informa-
tion for matching an action is doubled without a burden of maintaining a large
template dataset. Also, the tri-view geometry is well studied in the literature
giving a clear and understandable way for nding constraints under this frame-
work. One may think that the problem becomes trivial since one can reconstruct
the template poses in 3D given two template views. However, knowing only the
point correspondences, the pose can only be reconstructed up to a projective
ambiguity which distorts the actual 3D model in an unexpected way. Though
nding the right projection may match the 3D pose with the 2D test view, the
projection and the matching errors will depend on the degree of distortion intro-
duced by the projective transformation determining the reconstruction. So we
believe that the indirect method of reconstructing the 3D template pose instead
of extracting direct view invariant geometric constraint is not an eective way.
The contributions of this paper include: 1) We nd that the trifocal tensor is
under a twelve dimensional subspace of the original 27D space if the rst two
views are already matched and the fundamental matrix between them is known,
which we refer to as subtensor constraint; 2) We use this subtensor constraint
for recognizing actions, which requires less matching points but gains better
performance; 3) We experimentally prove that treating the two template views
separately or not considering the correspondence relation already known between
the rst two views omits a lot of useful information.
3 Our Method
Shown in Fig. 1, an action can be viewed as a sequence of frames, each of which
contains a static pose. Each pose can be represented by the joints or anatomical
landmarks (spheres in Fig. 1) of the body that contain sucient information for
recognizing an action [28]. Our approach is designed to nd constraints for the
joints under three views. Suppose we have a testing video in which the action
Action Recognition Using Subtensor Constraint 767
it represents is unknown and two template videos recorded for a known action
under two dierent views. We can consider the unknown video as a third view of
the known action if these three views satisfy some tri-view geometry properties.
Hence, we want to nd a certain constraint on the body joint correspondences,
and the degree of satisfaction to it implies the similarity between the testing
action and the template actions. We assume that the 2D image coordinates of
body joints are already extracted and recognized from the video sequence or
synthesized from motion capture data.
Our goal is to nd some overall constraint between the joints of the pose in
this frame and joints of two corresponding poses in two template action videos.
Apparently, using the two template views separately is not a good solution.
Since our approach works under three views, another naive proposal would be
use the trifocal tensor [32]. But directly using the trifocal tensor will omit a lot of
useful information given our assumption that the two views in the two template
videos are already correctly matched. Though the trifocal tensor contains 27
parameters which form a 27-D space, we will rst show that it actually resides
in a 12 dimensional subspace of the original 27-D space if we already know the
correct correspondence relationship between the rst and the second views.
If we have two views, views 1 and 2, of the same pose and the joint coordinates
in each view, the correspondence relation between the joints in the two views
is partially encoded by the fundamental matrix F21 , so that a pair of points,
x, x , in the rst two views satises xT F21 x = 0. We can rst estimate F21
given the known information of two corresponding frames in the two template
videos. In the third view, the testing video, a line l results in a homography
H12 mapping points from the rst view to the second view through the trifocal
tensor T = [T1 T2 T3 ] .
H12 = [T1 T2 T3 ]l . (1)
The homography H12 is compatible with the fundamental matrix F21 if and only
if the matrix HT12 F21 is skew-symmetric, so
( F21 T2 + ( [TT1 TT2 TT3 ]f2 )T )l = 031 . (5)
768 Q. Liu and X. Cao
( F21 T3 + ( [TT1 TT2 TT3 ]f3 )T )l = 031 . (6)
Since equations (4),(5) and (6) are valid for an arbitrary line l in the third
view, the following equations must hold true.
t = Bt , (10)
where, B is a 27 12 matrix of which the columns are the base vectors spanning
the subspace estimated from the 15 independent constraints, and t is some 12
dimensional vector which we refer to as subtensor.
A point correspondence across three views { x, x , x } must satisfy
[x ] ( xi Ti )[x ] = 033 , (11)
i
= 041 ,
At (12)
where A is a 427 matrix obtained using coordinates of { x, x , x }. Since every
and provides 4 constraints,
point correspondence across three views forms one A
n correspondences can give 4n constraints and form a larger matrix A of size
4n 27 by concatenating A of every correspondence together, so that
At = 04n1 . (13)
Putting (10) and (13) together, we can obtain
Gt = 04n1 , (14)
where G = AB. Since F21 already provides some information about correspond-
ing relation between the rst and second views, instead of four independent
Action Recognition Using Subtensor Constraint 769
template frame #
template frame #
template frame #
template frame #
50 50 50 50
We also estimate the frame-wise similarity between the template action and two
other actions, shoot basketball (g. 2 (c)) and pirouette in dance (g. 2 (d)). No
dark path can be found in these two gures as in (a) and (b), implying dierent
actions with the template action.
where P is a set containing matched frame pairs after the DTW alignment.
4.1 Datasets
In our approach, we need two template views where the correspondences are
matched. But for evaluating the performance, there is no publicly available
dataset that provides matched correspondences in two views to serve as the
templates, even for the multiview IXMAS dataset [24] widely used for view in-
dependent recognition. So, we evaluate our approach on two challenging datasets
generated by motion capture data under dierent views. We use 11 body joints,
which are root, left femur, left tibia, right femur, right tibia, lower neck, head,
left humerus, left wrist, right humerus and right wrist. Though these are sim-
ulated clean data, the actions are played by dierent actors for dierent trials,
varying largely in speed, style, anthropometric proportions, viewpoint, and etc.
Therefore, these data can be representative of the types of testing poses in real
videos.
Action Recognition Using Subtensor Constraint 771
Fig. 3. Datasets generation. (a) Views used to generate the two datasets. The red small
spheres represent the locations for the camera. (b) and (c) are example images in CMU
mocap dataset. (d) and (e) are example images in HDM05 mocap dataset.
In this subsection, we rst check the view invariant property of the proposed
approach. Fig. 4 (a) and (b) show a pose in a playing soccer action and a pose in
a playing basketball action, respectively. They are both observed by two cameras
at locations represented by the two red spheres in Fig. 4 (c) as the template
poses. Another pose the same to the pose in Fig. 4 (a) but performed by a
dierent actor is observed at locations represented by the cyan spheres in Fig.
4 (c) as the testing pose. Then the testing pose in dierent views is compared
772 Q. Liu and X. Cao
0.1
matching error
0.08
0.06
0.04
0.02
400
100
50 200
zenith 0 0 azimuth
Fig. 4. View invariant. (a) A pose in a playing soccer action. (b) A pose in a playing
basketball action. (c) The same pose as in (a) played by another actor. The pose in
(c) is observed in the views represented by the cyan spheres. Poses in (a) and (b) are
observed at views on the two red spheres as the templates. (d) The lower surface is
the matching errors between template (a) and the test pose (c) in dierent views. The
higher surface is the matching errors between template (b) and the test pose (c).
with the two template poses in (a) and (b) using Eg , and the matching errors
are plotted in Fig. 4 (d) respectively. The lower and higher surfaces represent
the matching errors with the template poses generated using Fig. 4 (a) and (b),
respectively. As we can see, since the testing pose is the same to the template
pose (a), the error is much smaller for all the views.
Due to tracking inaccuracy, the position of the body joints may not be pre-
cisely identied. But developing an eective method for pose estimating and
tracking of the body joints is beyond the scope of our consideration. So we ex-
amine our approach when the position of the body joints suers from tracking
noises. Shown in Fig. 5, the conguration is the same to Fig. 4 (d), except that
the position of the body joints in the testing poses is corrupted by Gaussian
noise with standard deviation of 1, 5, 10, 15 and 20 pixels. We plot the matching
error in Fig. 5 (a)-(e), respectively. Fig. 5 (f) gives a curve showing the change of
the mean dierence of the matching error between dierent poses and the same
pose with respect to the noise level. As can be seen, even with the standard
deviation of the noise up to 20, the matching error of the same pose is still lower
than that of the dierent pose in most of the views, which means that the poses
can also be matched even if the body joints tracked are severely deviated from
the right positions. With a robust method for solving the problem of estimating
and tracking the body joints, our approach will work eectively.
We compare our approach with the two views approach in [28], and other
two naive approaches of recognizing an action in three views as well, which
are approaches treating the two template views separately or directly using T
without considering the correspondence relation already known between the rst
two views. First, we give a result to intuitively show why our approach works
better. As shown in Fig. 6, framewise similarity is estimated using F in [28], T
and subtensor constraint for the same action played by two dierent actors. One
pose of the action is shown in the rst gure in Fig. 6. 7, 9, 11 and 15 joints are
used to estimate the matching error respectively. The three rows are estimated
using F,T and subtensor constraint, respectively. Since F requires at least 9
Action Recognition Using Subtensor Constraint 773
matching error
matching error
matching error
0.1 0.1 0.1
0.05
0.05 0.05
0
20 0 0
40 100 100
10 20 400 400
zenith 0 0 azimuth 50 200 50 200
zenith 0 0 azimuth zenith 0 0 azimuth
(a) (b) (c)
matching error
0.1 0.1
0.05 0.05
0.02
0 0
100 100
400 400 0
50 200 50 200 0 20 40
zenith 0 0 azimuth zenith 0 0 azimuth noise level
(d) (e) (f)
Fig. 5. Robustness of noise. (a) Matching error under noise level of 1 pixel. (b) Noise
level of 5 pixels. (c) Noise level of 10 pixels. (d) Noise level of 15 pixels. (e) Noise level
of 20 pixels. (f) The mean dierence of the matching error between dierent poses and
the same poses with respect to the noise level.
template frame #
template frame #
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
testing frame # testing frame # testing frame #
T 7 points T 9 points T 11 points T 15 points
template frame #
template frame #
template frame #
template frame #
20 20 20 20
40 40 40 40
60 60 60 60
80 80 80 80
100 100 100 100
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
testing frame # testing frame # testing frame # testing frame #
subtensor 7 points subtensor 9 points subtensor 11 points subtensor 15 points
template frame #
template frame #
template frame #
template frame #
20 20 20 20
40 40 40 40
60 60 60 60
80 80 80 80
100 100 100 100
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
testing frame # testing frame # testing frame # testing frame #
Fig. 6. Framewise similarity estimated for an action using F, T and subtensor with 7,
9, 11 and 15 body joints available
774 Q. Liu and X. Cao
Table 1. Results on CMU Mocap Dataset. Each row shows the number of correctly
recognized testing videos for each action class and the total precision for an approach.
There are 50 testing videos for each class.
correspondences, the 7 joints situation is not used for F. There should be a dark
path around the diagonal area of the gure, and the path will be blurred with
the decreasing of the correspondences because of the lost of the information. As
can be seen, the path is blurring even for sucient 15 correspondences for F . As
for T, the path is very blurring when there are 9 correspondences and can hardly
be recognized when the number of the correspondences reduces to 7. Whereas,
for the subtensor, the path can still be recognized in the case of 7 joints.
The approaches are tested on the two datasets. The examined approaches are:
1) Using the rst view as the template action for F in [28]; 2) Using the second
view as the template action for F; 3) Using both the rst and second templates
but separately for F, and the smaller matching error is used for recognizing
an action; 4) Directly using trifocal tensor; 5) Using the proposed subtensor
constraint. All the 11 points correspondences are used for matching. The results
are given in Tab. 1 and Tab. 2 to show the number of correctly recognized testing
videos for each action class and the total precision for every approach.
From Tab. 1 and 2, we can see that recognizing actions using tri-view con-
straint is more eective than the two view constraint and simply combining two
view constraint using two templates separately can not obtain much gain either.
The best performance is obtained using the subtensor. Sine there are sucient
11 joints, using T can also get a fair performance. But in real life situation,
some of the body joints in a testing video may be missing due to the failure of a
tracking algorism or self-occlusion. To evaluate the performance under such situ-
ations, we test our approach on the two datasets with some of the joints missing.
Action Recognition Using Subtensor Constraint 775
Table 3. Results on CMU Mocap Dataset with occlusion. Each row shows the number
of correctly recognized testing videos for each action class and the total precision for
an approach. There are 50 testing videos for each class.
The datasets are generated using the above CMU Mocap and HDM05 Mocap
Datasets. In each frame of an action sequence in the datasets, we randomly
choose 1 to 4 joints with the probabilities of 0.4, 0.3,0.2 and 0.1, respectively, and
assume that they are missing. Then we use these sequences as the testing videos.
Shown in Tab. 3 and Tab. 4, there is a signicant degeneration of the performance
for approach using T, while our approach still works well. We believe that since
the known correspondence relation between the rst two views is not considered
for directly using T , a lot of information is omitted.
To compare the degeneration speed of the performance with respect to the
degree of occlusion, we again generate data using the CMU Mocap and HDM05
Mocap Datasets. For every frame in the entire dataset, m points are randomly
assumed to be occluded and missing. Fig. 7 shows the precision of the two
approaches when there are 0 5 points missing. The case of 5 points missing
is not considered for T, since T requires at least 7 points. We can see from the
gure, the subtensor is more robust to occlusion.
60
60
40 subtensor
T 40
20
0 20
0 1 2 3 4 5 0 1 2 3 4 5
number of missing points number of missing points
5 Conclusion
In this work, we rst proved that the trifocal tensor is under a twelve dimensional
subspace of the original 27D space if the fundamental matrix between the rst
two views are known, which we refer to as subtensor constraint. Then we used
this constraint for recognizing actions under three views, which requires less
matching points but gains better performance. We found that 1) Recognizing
actions under three views is more eective than two view constraint; 2) Simply
by adding another template view using two view constraint can not obtain much
gain; 3) A naive solution of using trifocal tensor instead of the subtensor will
lose useful information and lead to a deciency of the constraint. One weakness
of our approach is that one may consider the precondition that the 2D image
coordinates of the body joints are extracted be a strong assumption. However,
as shown in the experiments, our approach performs well when the body joints
tracked suering from noise. With a robust tracking method for, our approach
will work eectively.
References
1. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based hu-
man motion capture and analysis. CVIU 104, 90126 (2006)
2. Wang, L., Hu, W., Tan, T.: Recent developments in human motion analysis. PR 36,
585601 (2003)
3. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action
representation. CVIU 115, 224241 (2011)
4. Ikizler, N., Forsyth, D.A.: Searching video for complex activities with nite state
models. In: CVPR, pp. 1823 (2007)
5. Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images.
PAMI 28, 4458 (2006)
6. Peursum, P., Venkatesh, S., West, G.A.W.: Tracking-as-recognition for articulated
full-body human motion analysis. In: CVPR, pp. 18 (2007)
7. Yilmaz, A., Shah, M.: Recognizing human actions in videos acquired by uncali-
brated moving cameras. In: CVPR, pp. 150157 (2005)
8. Yilmaz, A., Shah, M.: Actions sketch: A novel action representation. In: CVPR,
pp. 984989 (2005)
9. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In:
CVPR, pp. 2432 (2008)
10. Schindler, K., Gool, L.J.V.: Action snippets: How many frames does human action
recognition require. In: CVPR, pp. 18 (2008)
11. Laptev, I.: On space-time interest points. IJCV 64, 107123 (2005)
Action Recognition Using Subtensor Constraint 777
12. Gilbert, A., Illingworth, J., Bowden, R.: Scale Invariant Action Recognition Using
Compound Features Mined from Dense Spatio-temporal Corners. In: Forsyth, D.,
Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 222233.
Springer, Heidelberg (2008)
13. Niebles, J.C., Li, F.F.: A hierarchical model of shape and appearance for human
action classication. In: CVPR, pp. 18 (2007)
14. Morency, L.P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models
for continuous gesture recognition. In: CVPR, pp. 18 (2007)
15. Thurau, C., Hlavac, V.: Pose primitive based human action recognition in videos
or still images. In: CVPR, pp. 18 (2008)
16. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. PAMI 29, 22472253 (2007)
17. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action cate-
gories using spatial-temporal words. IJCV 79, 299318 (2008)
18. Nowozin, S., Bakir, G.H., Tsuda, K.: Discriminative subsequence mining for action
classication. In: ICCV, pp. 18 (2007)
19. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application
to action recognition. In: ACMMM, pp. 357360 (2007)
20. Parameswaran, V., Chellappa, R.: View invariants for human action recognition.
In: CVPR, pp. 613619 (2003)
21. Campbell, L.W., Becker, D.A., Azarbayejani, A., Bobick, A.F., Pentland, A.: In-
variant features for 3-d gesture recognition. In: FG, pp. 157163 (1996)
22. Holte, M.B., Moeslund, T.B., Fihl, P.: View invariant gesture recognition using the
csem swissranger sr-2 camera. IJISTA 5, 295303 (2008)
23. Zelnik-manor, L., Irani, M.: Event-based analysis of video. In: CVPR, pp. 123130
(2001)
24. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using
motion history volumes. CVIU 104, 249257 (2006)
25. Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition
from temporal self-similarities. PAMI 33, 172185 (2011)
26. Farhadi, A., Tabrizi, M.K.: Learning to Recognize Activities from the Wrong View
Point. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS,
vol. 5302, pp. 154166. Springer, Heidelberg (2008)
27. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of
actions. IJCV 50, 203226 (2002)
28. Gritai, A., Sheikh, Y., Rao, C., Shah, M.: Matching trajectories of anatomical
landmarks under viewpoint, anthropometric and temporal transforms. IJCV 84,
325343 (2009)
29. Shen, Y., Foroosh, H.: View-invariant action recognition using fundamental ratios.
In: CVPR, pp. 18 (2008)
30. Shen, Y., Foroosh, H.: View-invariant action recognition from point triplets.
PAMI 31, 18981905 (2009)
31. ul Haq, A., Gondal, I., Murshed, M.: On dynamic scene geometry for view-invariant
action matching. In: CVPR, pp. 33053312 (2011)
32. Faugeras, O.D., Papadopoulo, T.: A nonlinear method for estimating the projective
geometry of three views. In: ICCV, pp. 477484 (1998)
33. Muller, M., Roder, T., Clausen, M., Eberhardt, B., Kruger, B., Weber, A.: Docu-
mentation mocap database hdm05. Technical report CG-2007-2, Universitat Bonn
(2007)
Globally Optimal Closed-Surface Segmentation
for Connectomics
1 Introduction
This paper studies the problem of partitioning a volume image into a previously un-
known number of segments, based on a likelihood of merging adjacent supervoxels.
We choose a graphical model approach in which binary variables are associated with
the joint faces of supervoxels, indicating for each face whether the two adjacent super-
voxels should belong to the same segment (0) or not (1). Models of low order can lead
to inconsistencies where a face is labeled as 1 even though there exists a path from one
of the adjacent segments to the other along which all faces are labeled as 0. As a result,
the union of all faces labeled as 1 need not form closed surfaces. Such inconsistencies
can be excluded by a higher-order conditional random field (CRF) [1] that constrains
the binary labelings to the multicut polytope [2], thus ensuring closed surfaces. While
the number of multicut constraints can be exponential [3], constraints that are violated
by a given labeling can be found in quadratic time [4]. The MAP inference problem
can therefore be addressed by the cutting-plane method, i.e. by solving a sequence of
relaxed problems to global optimality until no more constraints are violated [4].
Here, we show that the optimization scheme described in [1] is unsuitable for large
3D segmentations where the supervoxel adjacency graph is denser and non-planar. We
therefore extend the cutting-plane approach by adding only constraints which are facet-
defining by a property of the multicut polytope (Section 4.3), a double-ended, paral-
lel search to find violated constraints (Section 4.2) and a problem-specific warm-start
Contributed equally.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 778791, 2012.
Springer-Verlag Berlin Heidelberg 2012
Globally Optimal Closed-Surface Segmentation for Connectomics 779
heuristic (Section 4.4). This approach scales gracefully to volume images that consist
of 109 voxels and are initially segmented into 106 supervoxels (Section 5).
The fact that exact MAP inference remains tractable at this scale is important for
the reconstruction of neural circuits form electron microscopic volume images (Fig. 1)
where multicut constraints substantially improve the quality of segmentations across
different imaging techniques (Section 5).
For this quantitative analysis, we manually segmented 109 voxels of two different
datasets acquired at different laboratories using different imaging techniques and as-
sess the performance of three models: the simplest, local model uses learned unary
conditional probabilities that state, independently, if each face should be on (establish
part of an object boundary) or off (merge adjacent supervoxels). Optimization of this
model almost always results in inconsistent labelings (Figs. 3b, 3e). The intermediate,
finite-order model guarantees consistency across a small local horizon, leading to better
results. The best results are obtained with the fully constrained model which admits only
labelings that are globally consistent. With the cutting-plane approach proposed here,
the full model is often faster to optimize than the finite-order approximation. Figs. 3, 4,
and 5 summarize why the fully constrained method is the one we recommend.
2 Related Work
The problem we address is known as the multicut problem [5] in combinatorial op-
timization and as correlation clustering [6,7] in statistics, and it is a special case of
the partition problem [2]. Both problems are NP-hard [6,8]. Instances of the multicut
problem have been solved by tightening an outer approximation of the multicut poly-
tope [9] via cutting planes [10,4]. In computer vision, this technique has been applied
in [11,12,13] where the relaxed LP is solved first, as well as in [1] where the inte-
grality constraints are kept throughout the cutting-plane loop. Cutting planes have also
been used to enforce connectivity in foreground vs. background image segmentation
[14,15,16]. Here, we build on the probabilistic formulation in [1] but without the likeli-
hood terms w.r.t. geometry that were shown to have a negligible effect on segmentations
of photographs.
780 B. Andres et al.
3 Probabilistic Model
In order to make the duality between supervoxels and their joint faces explicit, we
build a cell-complex (C, , dim) representation [18] of the supervoxel segmentation
by connected component labeling of the finest possible grid-cell topology. This yields
disjoint sets of supervoxels C3 , joint-faces between supervoxels C2 , lines C1 and points
C0 . The function dim : C N maps each cell to its dimension. For any cells c, c
C = 3j=0 Cj , c c indicates that c bounds c , which implies dim(c) < dim(c ).
This representation has the advantage that multiple disconnected joint-faces seperating
the same pair of supervoxels can be treated independently for feature extraction and
classification (see below). We model the posterior probability of a joint labeling y
{0, 1}|C2| of all faces, given
m N features fc Rm of each face c C2 , summarized in a vector F Rm|C2 |
the bounding relation between faces and supervoxels, encoded in a topology matrix
T {0, 1}|C2||C3 | in which Tcc = 1 if and only if c c .
We assume that features are independent of the topology, a) F T , b) F T | y,
that features of any faces c
= c are independent, c) fc
fc , d) fc
fc | y and that
the label of any face c is independent of the label of any face c
= c given the features
fc , e) yc
yc | fc . (Note, however, that yc
yc | T ). From these conditional
independence assumptions follows
p(F, T |y)p(y) (a,b) p(F |y)p(T |y)p(y)
p(y|F, T ) = =
p(F, T ) p(F )p(T )
(c,d) p(T |y) p(fc |y)p(y) p(T |y)
= = p(y|fc )
p(T ) p(fc ) p(T )
cC2 cC2
p(T |y)
(e) p(T |y) p(fc |yc )p(yc )
= p(yc |fc ) =
p(T ) p(T ) p(fc )
cC2 cC2
p(T |y) p(fc |yc )p(yc ) . (1)
cC2
The prior p(yc ) is assumed to be identical for all faces and is specified with a single
design parameter (0, 1) as p(yc = 1) = and p(yc = 0) = 1 .
Globally Optimal Closed-Surface Segmentation for Connectomics 781
The likelihood p(T |y) is the instrument that is used to enforce consistency: While
uninformative for all consistent labelings, it assigns zero probability to all inconsistent
labelings. Consistent labelings y MC, i.e. labelings inside the multicut polytope1, are
defined in (2) w.r.t. the set SC(n) of all simple cycles (s1 , t1 , . . . sn , tn , s1 ) over n N
pairwise distinct supervoxels (s1 , . . . , sn ) via pairwise distinct faces (t1 , . . . tn ). The
inequalities in (2) ensure that along each cycle either none or more than one face is
labeled as 1:
;
; c1 = c2n+1 j Nn : c2j1 C3 c2j C2
(cj )jN2n+1 ; ;
SC(n) =
in C ; j Nn : c2j c2j1 c2j c2j+1
; j, k Nn : j = k (c2j
= c2k c2j1
= c2k1 )
D ; n E
;
MC(n) = y {0, 1}|C2| ; (s1 , t1 , . . . sn , tn , s1 ) SC(n) : yt1 j=2 ytj
|C2 |
G
MC = MC(j) (2)
j=1
The likelihood p(fc |yc ) is learned by means of a random forest. More precisely, we
learn p(yc |fc ) from class-balanced training data, i.e. with p(yc ) = 0.5, and assume
p(fc |yc ) = p(fc |yc ). Therefore, p(fc |yc ) p(yc |fc )p(fc ) and thus,
p(y|F, T ) p(T |y) p(yc |fc )p(yc ) . (3)
cC2
For comparison, we consider two simpler models, a local model in which p(T |y) is
uniform and thus, faces are labeled independently,
p (y|F, T ) p(yc |fc )p(yc ) , (4)
cC2
and a finite-order approximation of (3) in which not all multicut constraints need to be
fulfilled but only those that correspond to cycles up to length 4. With p (T |y) = const.
if y 4j=1 MC(j) and p (T |y) = 0, otherwise,
p (y|F, T ) p (T |y) p(yc |fc )p(yc ) . (5)
cC2
4 MAP Inference
4.1 Integer Linear Programming Problem (ILP)
Instead of maximizing (3), we minimize its negative log likelihood by solving the ILP
min wT y
y{0,1}|C2 | (6)
subject to y MC
1
The multicut polytope is the convex hull of MC, by Lemma 2.2 in [2].
782 B. Andres et al.
5 Applications
5.1 SBEM Volume Image
A volume image referred to as E1088 in [19] was acquired using serial block-face
electron microscopy (SBEM) [20] and shows a section of rabbit retina at the almost
isotropic resolution of 22 22 30 nm3 . A small subset is shown in Fig. 3a. The bright
intra-cellular space that makes up more than 90% of the volume contrasts the stained
extra-cellular space that forms thin membranous faces. This staining [21] simplifies the
automated segmentation because no intra-cellular structures such as mitochondria or
vesicles are visible (Fig. 3d shows a different staining). A supervoxel segmentation is
obtained as described in Appendix C.
Features fc of each face c C2 described in Appendix B include statistics of voxel
features over c as well as characteristics of the two supervoxels that are bounded by c.
In order to learn p(yc |fc ) by means of a random forest, 437 faces per class have been
labeled interactively in a subset of 2503 voxels in about one hour, starting with obvious
cases and continuing where the predictions needed improvement. A custom extension
of ILASTIK [22] implements this online learning workflow. The user can mark faces as
on, off or unlabeled via a single mouse click and inspect intermediate predictions
by viewing faces colored according to p(yc |fc ).
Qualitative results on independent test data are shown in Fig. 3a-c. The global maxi-
mum of the local model (4) is inconsistent, i.e. not all surfaces are closed. A consistent
labeling with closed surfaces and thus a segmentation is obtained by merging super-
voxels transitively, i.e. whenever there exists a path from one supervoxel to the other
along which all faces are labeled as 0 (cf. Section 4.4), regardless of how many faces
between these supervoxels are labeled as 1. This mapping to the multicut polytope is
biased towards under-segmentation (Fig. 3b). In contrast, the global maximum of the
fully constrained model (3) is consistent and directly yields a segmentation with closed
surfaces (Fig. 3c). Note also the quality of the initial supervoxel segmentation from
which only 19.3% of all faces are removed in the found global optimum.
Quantitatively, the effect of introducing multicut constraints is shown in Fig. 4a
where maxima of (3), the local model (4) and the finite-order approximation (5) are
compared to a man-made segmentation of 400 200 200 voxels (Appendix A) in
784 B. Andres et al.
1m
(a) SBEM raw data (b) w/o constraints, = 0.5 (c) with constraints, = 0.5
1m
(d) FIBSEM raw data (e) w/o constraints, = 0.5 (f) with constraints, = 0.5
Fig. 3. Segmentations of the SBEM dataset (a-c, 2422 voxels) and FIBSEM dataset (d-f, 5122
voxels). Faces are colored to show if the algorithm decides that these are part of a segment
boundary (blue) or not (magenta). Importantly, yellow faces are decided to be part of a segment
boundary, but are ignored because the bounded supervoxels are merged elsewhere.
(a) SBEM (400 200 200 voxels) (b) FIBSEM (900 900 900 voxels)
Fig. 4. Variation of information (VI) and Rand index (RI) from a comparison of man-made seg-
mentations (Appendix A) with segmentations obtained as maxima of the fully constrained model
(Eq. 3), the finite-order approximation (Eq. 5), and the local model (Eq. 4), for various priors
. Exact optimization of the finite-order approximation becomes intractable for most on the
FIBSEM data. In contrast, optima of the full model are found in less than 13 minutes. (Fig. 5b).
Globally Optimal Closed-Surface Segmentation for Connectomics 785
Table 1. Globally optimal closed-surface segmentations of the SBEM and and FIBSEM volume
images of 8003 voxels are found in less than 13 minutes using the proposed cutting-plane method
that is about 22 times as fast as the optimization scheme in [1]
terms of the Variation of Information (VI) [23] and Rand Index (RI) [24]. The overall
best segmentation is obtained from (3), i.e. with all multicut constraints, and without an
artificial bias
= 0.5; note that the bias term in (7) vanishes for = 0.5. Without any
multicut constraints, the best segmentation, obtained for = 0.8, is worse in terms of
both VI and RI. Segmentations of intermediate quality are obtained from the finite-order
approximation (5).
Runtimes for optimizing (3) are shown in Fig. 5a and Tab. 1 for our C++ implemen-
tation. The ILP (6) is solved with IBM ILOG Cplex and Gurobi alternatively, with the
duality gap set to 0 in order to obtain globally optimal solutions. Global optima of (3),
for blocks of 8003 voxels with 1.6106 and 6105 faces (and variables), respectively, are
found in less than 13 minutes (Tab. 1), about 22 times as fast as with the optimization
scheme in [1].
Maximizing (3) can be faster than maximizing the finite-order approximation (5)
that does not guarantee closed surfaces and yields worse segmentations empirically
(runtimes not shown). We therefore recommend to use all multicut constraints.
A volume image acquired with a focused ion beam serial-section electron microscope
(FIBSEM) [25] shows a section of adult mouse somatosensory cortex at the almost
isotropic resolution of 556 nm3 . A small subset is shown in Fig. 3d. Intra- and extra-
cellular space are indistinguishable by brightness and texture. Not only cell membranes
but also intra-cellular structures such as mitochondria and vesicles are visible due to
a different staining. The resolution is four times as high as that of the SBEM image.
However, intra-cellular structures have membranous surfaces themselves and thus make
the problem of segmenting entire cells more difficult. A supervoxel segmentation is
obtained as described in Appendix C.
We use the same features as for the SBEM dataset but adjusted in scale. Our labeling
strategy was to annotate membranes of both cells and mitochondria3 as yc = 1.
Qualitative results are shown in Fig. 3d-f for a slice of 5123 voxels. The MAP label-
ing of the local model (4) is inconsistent and almost all faces are removed by transitivity
3
It is still possible to obtain the geometry of cells because mitochondria can be detected reliably
[26], even via their mean gray-value once a segmentation is available. Segments which are
classified as mitochondria are disregarded when computing VI [23] and RI [24].
786 B. Andres et al.
250 250
50 40
40 35
30
200 30 200 25
20 20
Runtime [minutes]
Runtime [minutes]
10 15
150 150 10
0 5
1.2 1.4 1.6
0
106 4.2 4.4 4.6 4.8 5.0 5.2
100 100 108
noCDPW
noC
50 50
noD
noP
all
0 0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0 1 2 3 4 5
#joint faces (variables) 106 Number of voxels 108
450 450
20 20
15 15
400 10 400 10
5 5
0 0
350 5.0 5.2 5.4 5.6 5.8 6.0 6.2 350 4.64.74.84.95.05.15.2
Runtime [minutes]
Runtime [minutes]
105 108
300 300
250 250
200 200
150 150
noCDPW noCDPW
100 noC 100 noC
noD noD
50 noP 50 noP
all all
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5
#joint faces (variables) 105 Number of voxels 108
Fig. 5. Wall clock runtimes for optimizing (3) with = 0.5 (on an 8-core Intel i7 at 2.8 GHz),
with and without improvements to the optimization scheme in [1], for increasing problem size
(number of joint faces of supervoxels, left, and number of voxels, right), for datasets ranging
in size from 1503 through 8003 voxels. all: proposed optimization scheme, noC: no chordal-
ity check, noD: no double-ended search, noP: no parallel search, noCDPW: no improvements.
Analogous plots for Gurobi instead of Cplex are provided as supplementary material.
Globally Optimal Closed-Surface Segmentation for Connectomics 787
(a) frames 1285, (b) without multicut constraints, (c) with multicut constraints, =
1950 = 0.5 0.5
Fig. 6. Color video segmentation. Supervoxel faces are colored blue if yc = 1, magenta if yc = 0
and yellow if the decision is inconsistent, i.e. yc = 1 but the adjacent supervoxels are merged.
Segments are filled with their mean color.
(note the lack of blue faces in Fig. 3e). In contrast, the global maximum of the fully con-
strained model (3) is consistent (Fig. 3f). In contrast to the SBEM dataset in which only
19.3% of the faces are removed, 80.7% of all faces are removed here, due to the inferior
supervoxel segmentation.
Quantitative results are shown in Fig. 4b. VI and RI are w.r.t. a complete segmenta-
tion of 9003 voxels carried out by a neurobiologist (Appendix A). Similarly as for the
SBEM volume segmentation, the best segmentations are obtained from the full model
(3). Exact MAP inference for the finite-order approximation (5) becomes intractable for
most .
Runtimes are shown in Fig. 5b for problem instances from blocks between 1503 and
8003 voxels. Unlike for the SBEM dataset (Fig. 5a), the overall speedup is dominated
by the chordality check.
6 Discussion
In a fully connected graph with n N nodes, the n3 cycle constraints of order 3
n4 n!
imply all j=0 j! higher-order cycle constraints. However, in the sparse graphs that
we consider, the higher-order constraints need to be dealt with explicitly. An example
is depicted in Fig. 2 where the constraint that corresponds to the blue cycle on the
left has order 4, excluding from the feasible set a locally closed loop that is globally
inconsistent. Our experiments have shown that including these higher-order constraints
is essential to achieve the best performance w.r.t. ground truth.
When inconsistent labelings are permitted, unlike in the fully constrained model
(3), and mapped to the multicut polytope as described in Section 4.4, the risk of false
mergers is higher in 3D than in 2D because there are on average more and shorter paths
from one supervoxel to another along which faces can be incorrectly labeled as 0. We
therefore expect multicut constraints to be more important in 3D than in 2D.
Solving (6) as proposed here requires the solution of problems of an NP-hard class.
Whether or not this is tractable in practice depends on the quality of the predictions
p(yc |fc ). The learning of this function is therefore important.
The warm start heuristic described in Section 4.4 is biased maximally towards under-
segmentation. Smarter heuristics that explore local neighborhoods in the cell complex
are subject of future work. The model (3) can be grafted on any method, 2D or 3D, that
finds superpixels or supervoxels and probabilities that joint boundaries of these should
be preserved or removed. Whether it can improve segmentations of images acquired by
transmission electron microscopy is subject of future research.
The recontruction of neural circuits such as a neocortical column or the central ner-
vous system of Drosophila melanogaster will eventually require the segmentation of
volume images of 1012 voxels. The result that 109 voxels can be segmented by optimiz-
ing a non-submodular higher-order multicut objective exactly on a single computer in
13 minutes is encouraging.
7 Conclusion
We have addressed the problem of segmenting volume images based on a learned like-
lihood of merging adjacent supervoxels. To solve this problem, we have adapted a
probabilistic model that enforces consistent decisions via multicut constraints to 3D
cell topologies and suggested a fast scheme for exact MAP inference. The resulting
22-fold speedup has allowed us to systematically study the positive effect of multicut
constraints in large-scale 3D segmentation problems for neural circuit reconstruction.
The best segmentations have been obtained for an unbiased parameter-free model with
multicut constraints.
Globally Optimal Closed-Surface Segmentation for Connectomics 789
(a) SBEM, 1400 1400 600 voxels (b) FIBSEM, 900 900 900 voxels
Fig. 7. The largest volume images we have segmented with multicut constraints. Shown are 600
segments in the SBEM dataset and 50 objects in the FIBSEM dataset.
We manually segmented subsets of the SBEM and the FIBSEM dataset (Fig. 8) using
the interactive method [27]. Each object was segmented independently. Some indepen-
dently segmented objects needed correction because there was overlap. This asserts a
consistent ground truth and shows that the segmentation problem is non-trivial, even for
a human. For the SBEM dataset, 528 objects (90.8% of 400 200 200 voxels) were
segmented by one expert in two weeks. For the FIBSEM dataset, 514 objects (96.8% of
a cubic block of 9003 voxels) were segmented by a neurobiologist in three weeks.
Fig. 8. Man-made segmentations (ground truth). Shown are orthogonal slices in which a random
color is assigned to each segment.
790 B. Andres et al.
Table 2. Features of supervoxel faces. The statistics (*) include the min, max, mean, median,
standard deviation, 0.25- and 0.75-quantile over all voxels adjacent to the face.
B Features
From every joint face of adjacent supervoxels, 31 features are extracted (Tab. 2a). One
of these features is the response of a Random Forest that discriminates between mem-
branes on the one hand and intra-/extra-cellular tissue on the other hand, based on 28
rotation-invariant non-linear features of local neighborhoods of 113 voxels (Tab. 2b).
C Supervoxel Segmentation
We computed supervoxels by marker-based watersheds. To obtain an elevation map and
markers for the SBEM dataset, we train a random forest classifier to distinguish between
two classes of voxels, extra-cellular space and intra-cellular space, based on rotation
invariant features of local neighborhoods (Tab. 2b and Appendix B). As training data,
1600 voxels per class were labeled interactively in two subsets of 1503 voxels which
has taken three hours using ILASTIK [22]. Predicted probabilities are used as elevation
levels. Connected components of at least three voxels classified as intra-cellular space
are used as markers. For the FIBSEM dataset, the elevation level is defined as the largest
eigenvalue of the Hessian matrix at scale 1.6. Markers are taken to be maximal plateaus
of the raw data that consist of at least two voxels.
References
1. Andres, B., Kappes, J.H., Beier, T., Kothe, U., Hamprecht, F.A.: Probabilistic image segmen-
tation with closedness constraints. In: ICCV (2011)
2. Chopra, S., Rao, M.R.: The partition problem. Math. Program. 59, 87115 (1993)
3. Alt, H., Fuchs, U., Kriegel, K.: On the Number of Simple Cycles in Planar Graphs. In:
Mohring, R.H. (ed.) WG 1997. LNCS, vol. 1335, pp. 1524. Springer, Heidelberg (1997)
4. Grotschel, M., Wakabayashi, Y.: A cutting plane algorithm for a clustering problem. Math.
Program. 45, 5996 (1989)
Globally Optimal Closed-Surface Segmentation for Connectomics 791
5. Costa, M.-C., Letocart, L., Roupin, F.: Minimal multicut and maximal integer multiflow: A
survey. European J. of Oper. Res. 162, 5569 (2005)
6. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56, 89113
(2004)
7. Demaine, E.D., Emanuel, D., Fiat, A., Immorlica, N.: Correlation clustering in general
weighted graphs. Theoretical Computer Science 361, 172187 (2006)
8. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-
Completeness. Freeman, New York (1979)
9. Sontag, D., Jaakkola, T.: New outer bounds on the marginal polytope. In: NIPS (2008)
10. Barahona, F., Grotschel, M., Junger, M., Reinelt, G.: An application of combinatorial opti-
mization to statistical physics and circuit layout design. Oper. Res. 36, 493513 (1988)
11. Kappes, J.H., Speth, M., Andres, B., Reinelt, G., Schnorr, C.: Globally Optimal Image
Partitioning by Multicuts. In: Boykov, Y., Kahl, F., Lempitsky, V., Schmidt, F.R. (eds.)
EMMCVPR 2011. LNCS, vol. 6819, pp. 3144. Springer, Heidelberg (2011)
12. Kim, S., Nowozin, S., Kohli, P., Yoo, C.D.D.: Higher-order correlation clustering for image
segmentation. In: NIPS (2011)
13. Nowozin, S., Jegelka, S.: Solution stability in linear programming relaxations: graph parti-
tioning and unsupervised learning. In: ICML (2009)
14. Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with connec-
tivity priors. In: CVPR (2008)
15. Nowozin, S., Lampert, C.H.: Global connectivity potentials for random field models. In:
CVPR (2009)
16. Lempitsky, V., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bounding box
prior. In: ICCV (2009)
17. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images
and its application to evaluating segmentation algorithms and measuring ecological statistics.
In: ICCV (2001)
18. Hatcher, A.: Algebraic Topology. Cambridge Univ. Press (2002)
19. Helmstaedter, M., Briggman, K.L., Denk, W.: High-accuracy neurite reconstruction for high-
throughput neuro-anatomy. Nature Neuroscience 14, 10811088 (2011)
20. Denk, W., Horstmann, H.: Serial block-face scanning electron microscopy to reconstruct
three-dimensional tissue nanostructure. PLoS Biology 2, e329 (2004)
21. Briggman, K.L., Helmstaedter, M., Denk, W.: Wiring specificity in the direction-selectivity
circuit of the retina. Nature 471, 183188 (2011)
22. Sommer, C., Straehle, C., Koethe, U., Hamprecht, F.A.: Ilastik: Interactive Learning and
Segmentation Toolkit. In: ISBI (2011)
23. Meila, M.: Comparing clusterings an information based distance. J. of Multivariate
Anal. 98, 873895 (2007)
24. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the Amer-
ican Statistical Association 66, 846850 (1971)
25. Knott, G., Marchman, H., Wall, D., Lich, B.: Serial section scanning electron microscopy of
adult brain tissue using focused ion beam milling. J. Neurosci. 28, 29592964 (2008)
26. Lucchi, A., Smith, K., Achanta, R., Knott, G., Fua, P.: Supervoxel-based segmentation of
mitochondria in em image stacks with learned shape features. IEEE Transactions on Medical
Imaging 31, 474486 (2012)
27. Straehle, C.N., Kothe, U., Knott, G., Hamprecht, F.A.: Carving: Scalable Interactive Seg-
mentation of Neural Volume Electron Microscopy Images. In: Fichtinger, G., Martel, A.,
Peters, T. (eds.) MICCAI 2011, Part I. LNCS, vol. 6891, pp. 653660. Springer, Heidelberg
(2011)
Reduced Analytical Dependency Modeling
for Classier Fusion
1 Introduction
Many computer vision and pattern recognition applications face the challenges
of complex scenes with clustered backgrounds, small inter-class variations and
large intra-class variations. To solve this problem, many algorithms have been
developed to extract local or global discriminative features such as Local Binary
Patterns [1], Laplacianfaces [2], etc. Instead of extracting a high discrimina-
tive feature, classier fusion has been proposed and the results are encourag-
ing [3] [4] [5] [6] [7] [8] [9] [10]. While many classier combination techniques [3]
have been studied and developed in the last decade, it is a general assumption
that classication scores are conditionally independent distributed. With this
assumption, the joint probability of all the scores can be expressed as the prod-
uct of marginal probabilities. The conditionally independent assumption could
simplify the problem, but may not be valid in many practical applications.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 792805, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Reduced Analytical Dependency Modeling for Classier Fusion 793
P0 M
Pr(l |x1 , , xM ) = Pr(l |xm ) (1)
Pr(l )M1 m=1
M
Pr(l |x1 , , xM ) P0 [(1 M )Pr(l ) + alm Pr(l |xm )] (2)
m=1
Reduced Analytical Dependency Modeling for Classier Fusion 795
where al1 , , alM are the dependency weights. Then, the optimal LCDM model
was learned by solving a standard linear programming problem, which maxi-
mized margins between genuine and imposter posterior probabilities in [10].
Besides the explicit dependency modeling methods [5] [7] [10], the optimal
weighting method (OWM) [4], LPBoost approaches [8] [12] and reduced multi-
variate polynomial model (RM) [6] can be used to combine classiers with dif-
ferent kinds of features to improve the recognition performance as well. OWM
and LPBoost methods aimed at determining the correct weighting for linear
combination by minimizing the classication error and 1-norm soft margin er-
ror, respectively. In order to describe the nonlinear input-output relationships,
the reduced multivariate polynomial (RM) model was introduced in [6]. Since
the number of terms will increase exponentially with the order in the multivari-
ate polynomial, Toh et al. [6] proposed to approximate the full polynomial by
modied lumped multinomial. Then, the optimal RM model was learned by a
weight-decay regularization problem in [6].
M
Product: Pr(l |x1 , , xM ) = P0 (LM1 slm ) = P0 hProduct(sl )
m=1
(3)
M
1M
LCDM: Pr(l |x1 , , xM ) P0 ( alm slm + ) = P0 hLCDM (sl )
m=1
L
796 A.J. Ma and P.C. Yuen
As indicated in (3), the Product rule and LCDM model can be formulated as two
dierent functions hProduct and hLCDM on posterior probabilities sl1 , , slM .
This implies, if we choose a function dierent from hProduct with independent
assumption, e.g. hLCDM , the dependency can be modeled. On the other hand,
LCDM was proposed under the assumption that posterior probabilities of each
classier will not deviate dramatically from the priors as mentioned in [10].
However, without these assumptions, the function hl for class l on sl1 , , slM
should be dierent from hProduct and hLCDM . Generally speaking, hl can be
any function which models dependency between feature measurements by the
posterior probabilities sl1 , , slM , i.e.
Analytical functions are very popular and have been employed in Product rule
and LCDM. We follow this direction and consider hl as an analytical function.
According to the denition of analytical functions [15], hl can be expressed
explicitly by the converged power series as
hl (sl ; l ) = l sl (5)
||=0
where slm = (sl1 , , sl(m1) , sl(m+1) , , slM )T and glmr is an analytical func-
tion of slm with coecient vector lmr . On the other hand, the posterior prob-
abilities can be given by the Bayes rule [13] as follow,
Pr(xm |l )Pr(l )
Pr(l |xm ) =
Pr(xm )
(7)
Pr(x1 , , xM |l )Pr(l )
Pr(l |x1 , , xM ) =
Pr(x1 , , xM )
With notations slm = Pr(l |xm ), substituting the probabilities Pr(xm |l ) in (7)
and Pr(x1 , , xM |l ) in (9) into (8), we get
slm = Pr(xi )[ glmr (slm ; lmr )srlm ]dx1 dxm1 dxm+1 dxM (10)
i=m r=0
According to Abels Lemma [15], the series in (10) is uniformly converged. Thus,
it can be integrated term by term, and equation (10) becomes
slm = Glmr (lmr )srlm (11)
r=0
5
where Glmr (lmr ) = Pr(xi )glmr (slm ; lmr )dx1 dxm1 dxm+1 dxM .
i=m
Comparing the left and the right hand sides in (11), the following equations
can be obtained,
According to the denition of (6), glmr (slm ; lmr ) is an analytical function sim-
ilar to hl (sl ; l ) in (5) and vector slm can be considered as the mappings from
feature measurements x1 , , xm1 , xm+1 , , xM to their posterior probabili-
ties. Therefore, the integration of i=m Pr(xi )glmr (slm ; lmr ) over feature mea-
surements except xm , which is denoted by Glmr (lmr ), is a linear function on
coecient vector lmr . Without calculating the integration, we can observe that
lm0 = 0, lm2 = 0, , lmr = 0, is a trivial solution to (13). Substi-
tuting this trivial solution into (6), and under the assumption that ADM is
symmetric on each slm , we have equations hl (sl ; l ) = glm1 (slm ; lmr ) slm
for 1 m M . This implies that each term in the power series hl contains
all variables sl1 , , slM and the order of each M slm cannot be larger than one.
In this case, there is only one non-zero term m=1 slm in the analytical func-
tion hl . In addition, according to (12), the ADM model becomes hl (sl ; l ) =
LM1 M m=1 slm , which is the Product rule under conditionally independent as-
sumption. This means independent condition is equivalent to the situation that
the solution to (13) is trivial. In other words, if the solution to (13) is non-trivial,
the dependency can be modeled. For general analytical functions, the weight vec-
tors lm0 , lm2 , , lmr , are not necessary to be zeros. Consequently, ADM
can model dependency by setting non-trivial solution to (13).
798 A.J. Ma and P.C. Yuen
K
hl (sl ; l ) hl (sl ; l ; K) = l sl (14)
||=0
R
hl (sl ; l ) hl (sl ; l ; R) = glmr (slm ; lmr )srlm (15)
r=0
where R is a positive integer. According to the analysis in Section 3.1, the ap-
proximation in (15) can model dependency by setting solutions to the rst R
equations in (13) as non-trivial. Therefore, the reduced form of ADM can model
dependency. Equations (14) and (15) indicate that the highest order of each term
and each variable are K and R, respectively. Combining these two equations,
the reduced analytical dependency model (RADM) is given by
K
hl (sl ; l ; K, R) = l sl , s.t. = (n1 , , nM )T and 0 nm R (16)
||=0
Since the model order should be larger than or equal to the variable order,
K R. On the other hand, if K > M R, hl (sl ; l ; K, R) = hl (sl ; l ; M R, R).
This means the RADM model degenerates to hl (sl ; l ; M R, R) when K > M R.
Therefore, the relationship between the model order K and variable order R in
RADM is restricted as R K M R.
Denote the score vector z l = (s0l , , sl , )T , where s0l , , sl , are the
terms in (16). With these notations, the RADM model (16) can be written as
hl (sl ; l ; K, R) = Tl z l . The algorithmic procedure to obtain z l in RADM for
class l is given in Algorithm 1.
On the other hand, with equation (4), the posterior probability is computed
by P0 hl (sl ; l ). Since P0 is a parameter depending on Oj , denote pj as the
parameter with respective to Oj . With these notations, we get
/
1, l = yj
hl (sjl ; l ) = qjl = jl /pj , jl = (17)
yj
0, l =
where sjl = (sjl1 , , sjlM )T , sjlm = Pr(l |xjm ) and xjm = fm (Oj ). Denote
z jl = (s0jl , , sjl , )T , where s0jl , , sjl , are the terms in (16). The ob-
jective function for learning the optimal RADM is given by approximating the
distribution (17) under regularized least square criterion [17] as follows
L
J
L
min (Tl z jl qjl )2 + b l 2 (18)
,q
l=1 j=1 l=1
Let q l = (qj1 l , , qjNl l )T , such that yjn = l , where Nl is the number of samples
for class l . On the other hand, denote Al and Bl as matrices made up by vectors
800 A.J. Ma and P.C. Yuen
L
min [(Tl Al q Tl )(ATl l q l ) + Tl Bl BlT l + b Tl l + c q Tl q l ] (20)
,q
l=1
min Tl Hl l
l
(22)
s.t. i) qln , 1 n Nl ; ii) Tl z jl 0 , 1 j J
4 Experiments
and Flower [19] databases, respectively. After that, the results for face recogni-
tion using CMU PIE [20] and FERET [21], and human action recognition using
Weizmann [22] and KTH [23] databases, are reported in Sections 4.3 and 4.4,
respectively. It is important to point out that the main objective of these exper-
iments is to evaluate the performance of dierent classier fusion methods, but
not state-of-art digit, ower, face and human action recognition algorithms.
Multiple feature digit database [18] contains ten digits from 0 to 9, and 200 exam-
ples for each digit. Six features, namely Fourier coecients, prole correlations,
Karhunen-Love coecients, pixel averages, Zernike moments and morphological
features, are extracted [18]. In this experiment, we randomly select 20 samples of
each digit for training and the rest for testing. Since the probabilities are hard to
determine accurately due to the problem of limited training samples, we use SVM
classiers [24] and normalize the SVM outputs by the double sigmoid method [25]
to approximate the probabilities. To select the best parameters, ve-fold cross
validation (CV) is performed. The parameter C introduced in the soft margin
SVM is selected from {0.01, 0.1, 1, 10, 100, 1000}. The CV outputs of the SVMs
are used to train the weights for classier combination. The RADM parameters
and 0 in (22) are set as follows. If features are independent, the point-wise
Pr(x , ,x )
dependencies M j1 Pr(xjM) for 1 j J are equal to 1. Since parameter
m=1 jm
represents the lower bound of the point-wise dependencies, we set = 0.5 < 1,
so that RADM includes the independent case. On the other hand, parameter
0 in (22) must be a small number, so 0 is set as 0.01. Other parameters are
selected from suitable sets as follows. Regularization parameters b, c are selected
from {106, , 101 , 1}, the variable order R is selected from {1, 2} and the
model order K is selected from one to the number of features. For LPBoost,
LP-B and LCDM, the best parameter is selected from {0.05, 0.1, . . . , 0.95}.
This experiment has been repeated ten times.
The mean accuracies of the best single feature (BestFea) and dierent classi-
er combination methods are reported in the second column of Table 1. From
Table 1, we can see that the proposed RADM obtains the highest recognition
rate of 96.84% on this database. Moreover, the recognition accuracies of the
classier fusion methods are higher than that of the best single feature. This
convince that the performance can be improved by combining classiers, and
RADM outperforms other fusion methods by better modeling dependency.
Oxford owers database [19] contains 17 categories of owers with 80 images per
category. Seven features including shape, color, texture, HSV, HoG, SIFT inter-
nal, and SIFT boundary, are extracted using the methods reported in [19] [26].
We evaluate the proposed method using 17 40 images for training, 17 20
for validation and 17 20 for testing. The best parameters are selected by the
802 A.J. Ma and P.C. Yuen
validation set similar to the procedure in Digit database. Following the settings
in [8], kernel SVMs are used in this experiment, and the kernel matrices are de-
ned as exp (d(x, x )/), where d is the distance and is the mean of pairwise
distances. This experiment was repeated three times using the predened splits
of this database [19].
The third column in Table 1 shows the mean accuracies of dierent methods.
From Table 1, we can see that RADM obtains an improvement of 2.55% and
1.77% over the classier fusion methods with independent assumption and those
without independent assumption, respectively. This indicates that modeling de-
pendency without specic assumptions like those in DN and LCDM helps to
improve the recognition performance.
(a) (b)
(c) (d)
Fig. 1. (a) (b) CMC curves on CMU PIE and FERET face databases. (c) ROC curve
on Weizmann database. (d) Recognition rates of RADM and RM with changed model
order on KTH database.
The mean accuracies on these two databases are reported in the fourth and
fth columns of Table 1. Same conclusion can be drawn that RADM outperforms
the other methods on both two face databases. While results in Table 1 only show
the Rank-1 accuracies, CMC curves of the top four methods on CMU PIE and
FERET databases are plotted in Fig. 1(a) and Fig. 1(b), respectively, for detailed
comparison. It can be seen that RADM outperforms IN and DN on CMU PIE,
LPBoost and LCDM on FERET database, and is slightly better than RM with
dierent number of ranks. This indicates that RADM with dependency modeling
gives the best performance for face recognition as well.
intensity, intensity dierence, HoF, HoG, HoF2D, HoG2D, HoF3D and HoG3D,
are extracted from videos as reported in [10]. In these two databases, eight-fold
CV is performed on the training data, and the CV outputs are used to train the
weights for classier fusion. The best parameters are selected by the CV outputs
on Weizmann and validation set on KTH database, respectively, similar to the
procedure mentioned in Section 4.1.
The last two columns in Table 1 show the recognition rates of dierent meth-
ods. We can see that RADM, LCDM and IN get the highest recognition rate
of 85.56% on Weizmann, while RADM outperforms other methods on KTH
database. We further compare the best three algorithms on Weizmann database
by the ROC measurement. The ROC curves in Fig. 1(c) show that RADM gives
better performance when the false positive rate is larger than 10%. And the ar-
eas under curves (AUC) are 0.8524 for RADM, 0.8472 for LCDM and 0.8424 for
IN. This also convinces that the proposed RADM is better than other classier
fusion methods for human action recognition. At last, we compare RADM and
RM with changed model order K on KTH database. From Fig. 1(d), we can see
that RADM outperforms RM with dierent model orders and is less sensitive to
model order changed. This is another advantage of the proposed method.
5 Conclusion
In this paper, we have designed and proposed a new framework for dependency
modeling by analytical functions on posterior probabilities of each feature. It
is shown that Product rule [3] (with independent assumption) and LCDM [10]
(without independent assumption) can be unied by the proposed analytical
dependency model (ADM). With the ADM, we give an equivalent condition
to independent assumption from the properties of marginal distributions, and
show that ADM can model dependency. Since ADM may contain innite number
of undetermined coecients, a reduced form is proposed based on the conver-
gent properties of analytical functions. At last, the optimal Reduced Analytical
Dependency Model (RADM) is learned by a modied least square regulariza-
tion problem, which aims at approximating posterior probabilities such that all
training samples are correctly classied. Experimental results show that RADM
outperforms existing classier fusion methods on Digit, Flower, Face and Human
Action databases. This indicates that RADM can better model dependency and
help to improve the performance in many recognition problems.
Acknowledgments. This project was partially supported by the Science
Faculty Research Grant of Hong Kong Baptist University, National Natural Sci-
ence Foundation of China Research Grants 61128009 and 61172136, and NSFC-
GuangDong Research Grant U0835005.
References
1. Ahonen, T., Hadid, A., Pietikainen, M.: Face Recognition with Local Binary Pat-
terns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469481.
Springer, Heidelberg (2004)
2. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using Laplacian-
faces. TPAMI 27, 328340 (2005)
Reduced Analytical Dependency Modeling for Classier Fusion 805
3. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classiers.
TPAMI 20, 226239 (1998)
4. Ueda, N.: Optimal linear combination of neural networks for improving classica-
tion performance. TPAMI 22, 207215 (2000)
5. Prabhakar, S., Jain, A.K.: Decision-level fusion in ngerprint verication. Pattern
Recognition 35, 861874 (2002)
6. Toh, K.A., Yau, W.Y., Jiang, X.: A reduced multivariate polynomial model for
multimodal biometrics and classiers fusion. IEEE Transactions on Circuits and
Systems for Video Technology 14, 224233 (2004)
7. Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classier fusion in a non-bayesian
probabilistic framework. TPAMI 31, 16301644 (2009)
8. Gehler, P., Nowozin, S.: On feature combination for multiclass object classication.
In: ICCV, pp. 221228 (2009)
9. Scheirer, W., Rocha, A., Micheals, R., Boult, T.: Robust Fusion: Extreme Value
Theory for Recognition Score Normalization. In: Daniilidis, K., Maragos, P., Para-
gios, N. (eds.) ECCV 2010, Part III. LNCS, vol. 6313, pp. 481495. Springer,
Heidelberg (2010)
10. Ma, A.J., Yuen, P.C.: Linear dependency modeling for feature fusion. In: ICCV,
pp. 20412048 (2011)
11. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman
and Hall (1986)
12. Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting via
column generation. JMLR 46, 225254 (2002)
13. Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1.
Wiley (1968)
14. Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. Springer (2008)
15. Krantz, S.G., Parks, H.R.: A Primer of Real Analytic Functions. Birkhauser (2002)
16. Rudin, W.: Principles of mathematical analysis. McGraw-Hill (1976)
17. Rifkin, R., Yeo, G., Poggio, T.: Regularized least squares classication. NATO
Science Series III: Computer and Systems Sciences 190, 131153 (2003)
18. Breukelen, M.V., Duin, R.P.W., Tax, D.M.J., Hartog, J.E.D.: Handwritten digit
recognition by combined classiers. Kybernetika 34, 381386 (1998)
19. Nilsback, M.E., Zisserman, A.: A visual vocabulary for ower classication. In:
CVPR (2006)
20. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database.
TPAMI 25, 16151618 (2003)
21. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation method-
ology for face recognition algorithms. TPAMI 22, 10901104 (2000)
22. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. TPAMI 29, 22472253 (2007)
23. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM
approach. In: ICPR (2004)
24. Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and kernel meth-
ods matlab toolbox. Perception Systmes et Information, INSA de Rouen, Rouen,
France (2005)
25. Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric
systems. Pattern Recognition 38, 22702285 (2005)
26. Nilsback, M.E., Zisserman, A.: Automated ower classication over a large number
of classes. In: ICVGIP (2008)
27. Peter, N., Belhumeur, J.P.H., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: recogni-
tion using class specic linear projection. TPAMI 19, 711720 (1997)
Learning to Match Appearances
by Correlations in a Covariance Metric Space
1 Introduction
Appearance matching of the same object registered in disjoint camera views is
one of the most challenging issues in every video surveillance system. The present
work addresses the problem of appearance matching, in which non-rigid objects
change their pose and orientation. The changes in object appearance together
with inter-camera variations in lighting conditions, dierent color responses, dif-
ferent camera viewpoints and dierent camera parameters make the appearance
matching task extremely dicult.
Recent studies tackle this topic in the context of pedestrian recognition. Hu-
man recognition is of primary importance not only for people behavior analysis
in large area networks but also in security applications for people tracking. De-
termining whether a given person of interest has already been observed over a
network of cameras is referred to as person re-identication. Our approach is
focused on, but not limited to, this application.
Appearance-based recognition is of particular interest in video surveillance,
where biometrics such as iris, face or gait might not be available, e.g. due to sen-
sors scarce resolution or low frame-rate. Thus, appearance-based techniques rely
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 806820, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Learning to Match Appearances 807
on clothing assuming that individuals wear the same clothes between dierent
sightings.
In this paper, we propose a novel method, which learns how to match ap-
pearance of a specic object class (e.g. class of humans). Our main idea is that
dierent regions of the object appearance ought to be matched using dierent
strategies to obtain a distinctive representation. Extracting region-dependent
features allows us to characterize the appearance of a given object class in a more
ecient and informative way. Dierent kinds of features characterizing various
regions of an object is fundamental to our appearance matching method.
We propose to model the object appearance using covariance descriptor [1]
yielding rotation and illumination invariance. Covariance descriptor has already
been successfully used in the literature for appearance matching [2,3]. In con-
trast to these approaches, we do not dene a priori feature vector for extracting
covariance, but we learn which features are the most descriptive and distinc-
tive depending on their localization in the object appearance. Characterizing
a specic class of objects (e.g. humans), we select only essential features for
this class, removing irrelevant redundancy from covariance feature vectors and
ensuring low computational cost. The output model is used for extracting the
appearance of an object from a set of images, while producing a descriptive and
distinctive representation.
In this paper, we make the following contributions:
We formulate the appearance matching problem as the task of learning a
model that selects the most descriptive features for a specic class of objects
(Section 3).
By using a combination of small covariance matrices (4 4) between few
relevant features, we oer an ecient and distinctive representation of the
object appearance (Section 3.1).
We propose to learn a general model for appearance matching in a covari-
ance metric space using correlation-based feature selection (CFS) technique
(Section 3.2).
We evaluate our approach in Section 4 before discussing future work and con-
cluding.
2 Related Work
Recent studies have focused on the appearance matching problem in the con-
text of pedestrian recognition. Person re-identication approaches concentrate
either on distance learning regardless of the representation choice [4,5], or on
feature modeling, while producing a distinctive and invariant representation for
appearance matching [6,7]. Learning approaches use training data to search for
strategies that combine given features maximizing inter-class variation whilst
minimizing intra-class variation. Instead, feature-oriented approaches concen-
trate on an invariant representation, which should handle view point and camera
changes. Further classication of appearance-based techniques distinguishes the
808 S. Bk et al.
single-shot and the multiple-shot approaches. The former class extracts appear-
ance using a single image [8,9], while the latter employs multiple images of the
same object to obtain a robust representation [6,7,10].
Single-Shot Approaches: In [9], a model for shape and appearance of the
object is presented. A pedestrian image is segmented into regions and their color
spatial information is registered into a co-occurrence matrix. This method works
well if the system considers only a frontal viewpoint. For more challenging cases,
where viewpoint invariance is necessary, an ensemble of localized features (ELF )
[11] is selected by a boosting scheme. Instead of designing a specic feature
for characterizing people appearance, a machine learning algorithm constructs
a model that provides maximum discriminability by ltering a set of simple
features. In [12], pairwise dissimilarity proles between individuals are learned
and adapted for nearest neighbor classication. Similarly, in [13], a rich set of
feature descriptors based on color, textures and edges is used to reduce the
amount of ambiguity among human class. The high-dimensional signature is
transformed into a low-dimensional discriminant latent space using a statistical
tool called Partial Least Squares (PLS) in a one-against-all scheme. However,
both methods demand an extensive learning phase based on the pedestrians
to re-identify, extracting discriminative proles, which makes the approaches
non-scalable. The person re-identication problem is reformulated as a ranking
problem in [14]. The authors present extensive evaluation of learning approaches
and show that a ranking-based model can improve the reliability and accuracy.
Distance learning is also the main topic of [5]. A probabilistic model maximizes
the probability of true match pair having a smaller distance than that of a wrong
match pair. This approach focuses on maximizing matching accuracy regardless
of the representation choice.
Multiple-Shot Approaches: In [15], every individual is represented by two mod-
els: descriptive and discriminative. The discriminative model is learned using the
descriptive model as an assistance. In [10], a spatiotemporal graph is generated for
ten consecutive frames to group spatiotemporally similar regions. Then, a cluster-
ing method is applied to capture the local descriptions over time and to improve
matching accuracy. In [7], the authors propose the feature-oriented approach which
combines three features: (1) chromatic content (HSV histogram); (2) maximally
stable color regions (MSCR) and (3) recurrent highly structured patches (RHSP).
The extracted features are weighted using the idea that features closer to the bod-
ies axes of symmetry are more robust against scene clutter. Recurrent patches are
presented in [6]. Using epitome analysis, highly informative patches are extracted
from a set of images. In [16], the authors show that features are not as important
as precise body parts detection, looking for part-to-part correspondences.
Learning approaches concentrate on distance metrics regardless of the repre-
sentation choice. In the end, those approaches use very simple features such as
color histograms to perform recognition. Instead of learning, feature-oriented ap-
proaches concentrate on the representation without taking into account discrim-
inative analysis. In fact, learning using a sophisticated feature representation
Learning to Match Appearances 809
3 The Approach
Our appearance matching requires two operations. The st stage overcomes color
dissimilarities caused by variations in lighting conditions. We apply the histogram
equalization [17] technique to the color channels (RGB) to maximize the entropy
in each of these channels and to obtain camera-independent color representation.
The second step is responsible for appearance extraction (Section 3.3) using a
general model learned for a specic class of objects (e.g. humans). The following
sections describe our feature space and the learning, by which the appearance
model for matching is generated.
Our object appearance is characterized using the covariance descriptor [1]. This
descriptor encodes information on feature variances inside an image region, their
correlations with each other and their spatial layout. The performance of the
covariance features is found to be superior to other methods, as rotation and
illumination changes are absorbed by the covariance matrix.
In contrast to [1,2,15], we do not limit our covariance descriptor to a single fea-
ture vector. Instead of dening a priori feature vector, we use a machine learning
technique to select features that provide the most descriptive representation of
the appearance of an object.
Let L = {R, G, B, I, I , I , . . . } be a set of feature layers, in which each layer
is a mapping such as color, intensity, gradients and lter responses (texture l-
ters, i.e. Gabor, Laplacian or Gaussian). Instead of using covariance between all
of these layers, which would be computationally expensive, we compute covari-
ance matrices of a few relevant feature layers. These relevant layers are selected
depending on the region of an object (see Section 3.2). In addition, let layer D be
a distance between the center of an object and the current location. Covariance
of distance layer D and three other layers l (l L) form our descriptor, which is
represented by a 4 4 covariance matrix. By using distance D in every covari-
ance, we keep a spatial layout of feature variances, which is rotation invariant.
State of the art techniques very often use pixel location (x, y) instead of distance
D, yielding better description of an image region. Conversely, among our detail
experimentation, using D rather than (x, y), we did not decrease the recognition
accuracy in the general case, while decreasing the number of features in the co-
variance matrix. This discrepancy may be due to the fact that we hold spatial
810 S. Bk et al.
Fig. 1. A meta covariance feature space. Example of three dierent covariance features.
Every covariance is extracted from a region (P ), distance layer (D) and three channel
functions (e.g. bottom covariance feature is extracted from region P3 using layers: D,
I-intensity, I -gradient magnitude and I -gradient orientation).
where is the geodesic distance between covariance matrices [18], and aci,j [z],
ack,l [z] are the corresponding covariance matrices (the same region P and the
same combination of layers). The index z is an iterator of C.
We cast the appearance matching problem into the following distance learning
problem. Let + be distance vectors computed using pairs of relevant samples
(of the same people captured in dierent cameras, i = k, c
= c ) and let be
distance vectors computed between pairs of related irrelevant samples (i
= k,
c
= c ). Pairwise elements + and are distance vectors, which stand for
positive and negative samples, respectively. These distance vectors dene a co-
variance metric space. Given + and as training data, our task is to nd a
general model of appearance to maximize matching accuracy by selecting rele-
vant covariances and thus dening a distance.
Fig. 3. Extraction of the appearance using the model (the set of features selected by
CFS). Dierent colors and shapes in the model refer to dierent kinds of covariance
features.
Let A and B be the object signatures. The signature consists of mean covariance
matrices extracted using set . The similarity between two signatures A and B
is dened as
1 1
S(A, B) = , (4)
|| max((A,i , B,i ), )
i
where is a geodesic distances [18], A,i and B,i are mean covariance matrices
extracted using covariance feature i and = 0.1 is introduced to avoid the
denominator approaching to zero. Using the average of similarities computed on
feature set the appearance matching becomes robust to noise.
4 Experimental Results
Feature Space: We scale every human image into a xed size window of 64192
pixels. The set of rectangular sub-regions P is produced by shifting 328 and 16
16 pixel regions with 8 pixels step (up and down). It gives |P| = 281 overlapping
1
The Image Library for Intelligent Detection Systems (i-LIDS) is the UK governments
benchmark for Video Analytics (VA) systems.
Learning to Match Appearances 815
4.2 Results
i-LIDS-MA [2]: This dataset consists of 40 individuals extracted from two
non-overlapping camera views. For each individual a set of 46 images is given.
The dataset contains in total 40 2 46 = 3680 images. For each individual we
randomly select m = 10 images. Then, we randomly select q = 10 individuals to
learn a model. The evaluation is performed on the remaining p = 30 individuals.
Every signature is used as a query to the gallery set of signatures from the other
camera. This procedure has been repeated 10 times to obtain averaged CMC
curves.
COSMATI vs. Ci : We rst evaluate the improvement in using dierent com-
binations of features for the appearance matching. We compare models based
on a single combination of features with the model, which employs several com-
binations of features. From Fig. 5(a) it is apparent that using dierent kinds of
covariance features we improve matching accuracy.
COSMATI w.r.t. the Number of Shots: We carry out experiments to show
the evolution of the performance with the number of given frames per individual
816 S. Bk et al.
iLIDSMA iLIDSMA
100 100
90 90
Recognition Percentage
80
Recognition Percentage
80
COSMATI
C1
70 C2 70
C3
C4
60 C5 60
COSMATI, N=1
C6
COSMATI, N=3
C7
COSMATI, N=5
50
C8 50 COSMATI, N=10
C9
COSMATI, N=20
C10
COSMATI, N=46
40 40
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Rank score Rank score
iLIDSMA iLIDSAA
100
80
95
Recognition Percentage
Recognition Percentage
70
90
60
85 COSMATI 50 COSMATI
LCP
LCP
80 40
75 30
1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20
Rank score Rank score
(Fig. 5(b)). The results indicate that the larger number of frames, the better
performance is achieved. It clearly shows that averaging covariance matrices on
a Riemannian manifold using multiple shots leads to a much better recognition
accuracy. It is worth noting that N 50 is usually aordable in practice as it
corresponds to only 2 seconds of a standard 25 framerate camera.
COSMATI vs. LCP: We compare our results with LCP [2] method. This
method employs a priori 11 11 covariance descriptor. Discriminative charac-
teristics of an appearance are enhanced using one-against-all boosting scheme.
Both methods are evaluated on the same subsets (p = 30) of the original data.
Our method achieves slightly better results than LCP as shown in Fig. 6(a). It
shows that using specic descriptors for dierent regions of the object, we are
able to obtain equally distinctive representation as LCP, which uses an exhaus-
tive one-against-all learning scheme.
i-LIDS-AA [2]: This dataset contains 100 individuals automatically detected
and tracked in two cameras. Cropped images are noisy, which makes the dataset
more challenging (e.g. detected bounding boxes are not accurately centered
around the people, only part of the people is detected due to occlusion).
COSMATI vs. LCP: Using the models learned on i-LIDS-MA, we evaluate
our approach on 100 individuals. Results are illustrated in Fig. 6(b). Our method
again performs better than LCP. It is relevant to mention that once our model is
Learning to Match Appearances 817
90 90
Recognition Percentage
80
Recognition Percentage
Recognition Percentage
80 80
70
70
70 60
COSMATI 60 50 COSMATI, N = 2
60 COSMATI
PRDC LCP
50 PRDC 40 CPS, N = 2
50
SDALF, N = 2
40 30 HPE, N = 2
40
Group Context
30 20
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25
Rank score Rank score Rank score
learned oine, it does not need any additional discriminative analysis. Specifying
informative regions and their characteristics (features) we obtain a distinctive
representation of the object appearance. In contrast to [2] our oine learning is
scalable and do not require any reference dataset.
i-LIDS-119 [24]: i-LIDS-119 does not t very well for multiple-shot case, be-
cause the number of images per individual is very low (in average 4). However,
this dataset was extensively used in the literature for evaluating the person re-
identication approaches. Thus, we also use this dataset to compare with state
of the art results. The dataset consist of 119 individuals with 476 images. This
dataset is very challenging since there are many occlusions and often only the
top part of the person is visible. As only few images are given, we extract our
signatures using maximally N = 2 images.
COSMATI vs. PRDC [5]: We compare our approach with PRDC method.
This method also requires oine learning. PRDC focuses on distance learning
that can maximize matching accuracy regardless of the representation choice.
We reproduce the same experimental settings as [5]. All images of p randomly
selected individuals are used to set up the test set. The remaining images form
the training data. Each test set is composed of a gallery set and a probe set.
In contrast to [5], we use multiple images (maximally N = 2) to create the
gallery and the probe set. The procedure is repeated 10 times to obtain reliable
statistics. Our results (Fig. 7(a,b)) show clearly that COSMATI outperforms
PRDC. The results indicate that using strong descriptors, we can signicantly
increase matching accuracy.
COSMATI vs. LCP [2], CPS [16], SDALF [7], HPE [6] and Group
Context [13]: We have used models learned on i-LIDS-MA to evaluate our
approach on the full dataset of 119 individuals. Our CMC curve is generated
by averaged CMC over 10 trials. Results are presented in Fig. 7 (c). Our ap-
proach performs the best among all considered methods. We believe that it is
due to the informative appearance representation obtained by CFS technique. It
clearly shows that a combination of the strong covariance descriptor with the e-
cient selection method produces distinctive models for the appearance matching
problem.
818 S. Bk et al.
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
9.61 % 5.30% 4.62% 4.37% 5.47% 12.70 % 14.29% 17.35% 12.12% 14.16 %
5 Conclusion
We proposed to formulate the appearance matching problem as the task of learn-
ing a model that selects the most descriptive features for a specic class of
Learning to Match Appearances 819
objects. Our strategy is based on the idea that dierent regions of the object ap-
pearance ought to be matched using various strategies. Our experiments demon-
strate that: (1) by using dierent kinds of covariance features w.r.t. the region
of an object, we obtain clear improvement in appearance matching performance;
(2) our method outperforms state of the art methods in the context of pedes-
trian recognition. In the future, we plan to integrate the notion of motion in our
recognition framework. This would allow to distinguish individuals using their
behavioral characteristics and to extract only the features which surely belong
to foreground region.
References
1. Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection
and Classication. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 589600. Springer, Heidelberg (2006)
2. Bak, S., Corvee, E., Bremond, F., Thonnat, M.: Boosted human re-identication
using riemannian manifolds. Image and Vision Computing (2011)
3. Oncel, F.P., Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model up-
date based on lie algebra. In: CVPR (2006)
4. Dikmen, M., Akbas, E., Huang, T.S., Ahuja, N.: Pedestrian Recognition with a
Learned Metric. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part
IV. LNCS, vol. 6495, pp. 501512. Springer, Heidelberg (2011)
5. Zheng, W.-S., Gong, S., Xiang, T.: Person re-identication by probabilistic relative
distance comparison. In: CVPR (2011)
6. Bazzani, L., Cristani, M., Perina, A., Farenzena, M., Murino, V.: Multiple-shot
person re-identication by hpe signature. In: ICPR, pp. 14131416 (2010)
7. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-
identication by symmetry-driven accumulation of local features. In: CVPR (2010)
8. Park, U., Jain, A., Kitahara, I., Kogure, K., Hagita, N.: Vise: Visual search engine
using multiple networked cameras. In: ICPR, pp. 12041207 (2006)
9. Wang, X., Doretto, G., Sebastian, T., Rittscher, J., Tu, P.: Shape and appearance
context modeling. In: ICCV, pp. 18 (2007)
10. Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentication using spatiotem-
poral appearance. In: CVPR, pp. 15281535 (2006)
11. Gray, D., Tao, H.: Viewpoint Invariant Pedestrian Recognition with an Ensemble
of Localized Features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part I. LNCS, vol. 5302, pp. 262275. Springer, Heidelberg (2008)
12. Lin, Z., Davis, L.S.: Learning Pairwise Dissimilarity Proles for Appearance Recog-
nition in Visual Surveillance. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D.,
Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne,
T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 2334. Springer,
Heidelberg (2008)
13. Schwartz, W.R., Davis, L.S.: Learning discriminative appearance-based models us-
ing partial least squares. In: SIBGRAPI, pp. 322329 (2009)
14. Prosser, B., Zheng, W.-S., Gong, S., Xiang, T.: Person re-identication by support
vector ranking. In: BMVC, pp. 21.121.11 (2010)
820 S. Bk et al.
15. Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person Re-identication by Descrip-
tive and Discriminative Classication. In: Heyden, A., Kahl, F. (eds.) SCIA 2011.
LNCS, vol. 6688, pp. 91102. Springer, Heidelberg (2011)
16. Cheng, D.S., Cristani, M., Stoppa, M., Bazzani, L., Murino, V.: Custom pictorial
structures for re-identication. In: BMVC, pp. 68.168.11 (2011)
17. Hordley, S.D., Finlayson, G.D., Schaefer, G., Tian, G.Y.: Illuminant and device
invariant colour using histogram equalisation. Pattern Recognition 38 (2005)
18. Frstner, W., Moonen, B.: A metric for covariance matrices. In: Quo vadis geode-
sia..?, Festschrift for Erik W. Grafarend on the occasion of his 60th birthday, TR
Dept. of Geodesy and Geoinformatics, Stuttgart University (1999)
19. Pennec, X., Fillard, P., Ayache, N.: A riemannian framework for tensor computing.
Int. J. Comput. Vision 66, 4166 (2006)
20. Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classication on rieman-
nian manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 30, 17131727 (2008)
21. Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD
thesis, Department of Computer Science, University of Waikato (1999)
22. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued at-
tributes for classication learning. In: IJCAI, pp. 10221027 (1993)
23. Rich, E., Knight, K.: Articial Intelligence. McGraw-Hill Higher Education (1991)
24. Zheng, W.-S., Gong, S., Xiang, T.: Associating groups of people. In: BMVC (2009)
25. Gray, D., Brennan, S., Tao, H.: Evaluating Appearance Models for Recognition,
Reacquisition, and Tracking. In: PETS (2007)
26. Rother, C., Kolmogorov, V., Blake, A.: "Grabcut": interactive foreground extrac-
tion using iterated graph cuts. In: SIGGRAPH, pp. 309314 (2004)
On the Convergence of Graph Matching:
Graduated Assignment Revisited
In computer vision and machine learning, many tasks that require finding correspon-
dences between two node sets can be formulated as graph matching such that the rela-
tions between nodes can be preserved as much as possible. Although extensive research
[1] has been done for decades, graph matching is still challenging mainly due to two
reasons: 1) In general, the objective function is non-convex and prone to local optimum;
2) the solution needs to satisfy combinatorial constraints. The graph matching problem
are widely formulated as integer quadratic programming (IQP), by considering both
unary and second-order terms reflecting the local similarities together with the pair-
wise relations. Concretely, given two graphs GL (V L , E L , AL ) and GR (V R , E R , AR ),
where V denotes nodes, E, edges and A, attributes, there is an affinity matrix defined
as Mia;jb that measures the affinity with the candidate edge pair (viL , vjL ) vs. (vaR , vbR ).
The work is supported by NSF IIS-1116886, NSF IIS-1049694, NSFC 61129001/F010403
and the 111 Project (B07022).
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 821835, 2012.
c Springer-Verlag Berlin Heidelberg 2012
822 Y. Tian et al.
And the diagonal term Mia;ia describes the unary affinity of a node match (viL , vaR ).
By introducing a permutation matrix x {0, 1}nlnR whereby xia = 1 if node ViL
matches node VaR (and xia = 0 otherwise), it leads to the following formulation:
x = argx max(xT Mx) s.t. Ax = 1 x {0, 1} (1)
here x is deformed into vectorized version, constraints Ax = 1 refer to the one-to-one
matching from GL to GR such that one feature from one image can be matched to at
most one other feature from the other image. The difficulty depends on the structure of
matrix M, in general it is NP-hard and no polynomial algorithm exists.
We drop the quadratic terms for three reasons: a) the new quadratic terms increase the
difficulty for optimizing; b) xT x = n holds for one-to-one matching as considered in
this paper; c) the resultant quadratic-free formulation will not change the optimal solu-
tion against the original objective since the deviated score associated with the diagonal
I plays an equal role to all matching candidates:
max xTk+1 (M + I)xk (4)
xk+1
On the Convergence of Graph Matching: Graduated Assignment Revisited 823
2MTdec pi+1 , MTdec pi MTdec pi+1 , MTdec pi+1 + MTdec pi , MTdec pi (6)
Equation holds i.f.f. pTi+1 Mpi+1 =pTi+1 Mpi =pTi Mpi , using Assumption 1: pi =pi+1 .
CGA is sensitive to initial point and sometimes would trap to an unsatisfied point
shortly. To address this issue, on one hand, by observing the affinity matrix may be
distorted after finishing CGA, in order to extend the solution chain, we can frame an
outer loop whereby the current iteration (CGA)s output is the input to the next itera-
tion where CGA is also performed. And the outer loop will stop if the solution with
respect to the original objective cannot be improved by the new iteration - we call this
824 Y. Tian et al.
outer loop CGA algorithm as LCGA. On the other hand, the Hungarian method con-
fined in the discrete domain misses many optimums, which could otherwise have been
explored by a soft counterpart in the continuous domain. Thus we leverage the tool of
softmax [6] (instead of the Hungarian method) that has been applied in the classical
Graduate Assignment algorithm (GAGM) [2], and extend CGA to its soft version:
Soft Constrained Graduate Assignment (SCGA) described in Alg.2. The algorithmic
procedure of Alg.2 is empirically established based on GAGM and CGA, yet we are
more interested in the underlying question: will the continuous methods (SCGA/GAGM)
have similar converge property as their discrete counterparts (CGA/GA)? Affirmative
answer is given in Theorem 3 and Corollary 1 and Corollary 2 provided on Assumption
1 in addition with Assumption 2 hold, which is described in below:
1
min is defined in Assumption 1, N = n2 where n is the number of nodes in one graph,
and are defined in Lemma 4 while sj and pj are defined in Equ.10 and Equ.11 in the
appendix respectively.
On the Convergence of Graph Matching: Graduated Assignment Revisited 825
Theorem 3. If Assumption 1(min ) and Assumption 2 (sj , Tj , ) hold, using the con-
tinuous solution chain {sj } as defined in Equ. 10, the induced discrete solution chain
{pj } defined by Equ. 11 will asymptotically (if only j is big enough) satisfy
sj+1 = pj+1 , and {pj } will satisfy the definition of the Hungarian method: pj+k+1
Hd (Mpj+k ).
Theorem 3 directly leads two important resultant corollaries:
Corollary 1. During the iteration, if there exists sj s.t. Assumption 1(min ) and As-
sumption 2 (sj , Tj , ) hold, then GAGM will converge to a double-circular solution
chain.
Corollary 2. During the iteration, if there exists sj s.t. Assumption 1(min ) and As-
sumption 2 (sj , Tj , ) hold, then SCGA will converge to a fixed discrete point.
Proof details can be found in the appendix.
3 Related Work
Early work by Christmas, Kittler and Petrou [9] and Wilson and Hancock [10] showed
how relaxation labeling could be used to the graph matching problem by modeling
the probabilistic distribution of matching errors. Drawing on ideas from these con-
nectionist literature, Gold and Rangarajan [2] developed a relaxation scheme based
on soft-assign, which gradually updates the derivative of the relaxed IQP i.e. Gradu-
ate Assignment (GAGM) [2]. Leordeanu and Hebert [3] present a spectral matching
(SM) method which computes the leading eigenvector of symmetric nonnegative affin-
ity. The constraint are entirely dropped and they only consider the integer constraints
in the final discretization step. Later, spectral graph matching with affine constraints
was developed [11] (SMAC). In addition, [11] suggests a preprocessing on the input
matrix, by normalizing it into a doubly stochastic matrix. More recently, Cho et al. [4]
apply random walk and introduce a reweighting jump scheme to incorporate mapping
constraints. None of these aforementioned approaches are concerned with the original
integer constraints during optimization while they assume the final continuous solution
can lead to an optimal feasible solution after discretization. Contrary to this idea, an
iterative matching method (IPFP) is proposed by Leordeanu et al. [5]. In their method,
the optimized solution xk in each iteration is continuous, yet they keep the discrete
one bk induced by xk if it suffices the optimal condition. When the iteration ends, the
solution is selected among the stored discrete set instead from the binarization of the
continues solution in the final iteration. This paper addresses the general learning-free
graph matching problem. In terms of convergence study as most emphasized in this pa-
per, Rangarajan et al [7] show that GAGM will converge to a discrete point provided
the affinity matrix is positive definite, while having no clear answer for general cases.
This paper clarifies that for the general graph matching problem, GAGM will converge
to a circular solution chain, and progressively making additive to the diagonal of the
affinity matrix will result in an algorithm that is deemed to converge to a fixed discrete
point after a finite number of iterations, under mild conditions. Motivated by this theo-
retical finding, we further carefully design the mechanism for when and how much the
826 Y. Tian et al.
addition is performed during iterations. In general, the proposed methods prolong the
searching path in both discrete and continuous domain, avoiding early trap to unwanted
point. Moreover, the proposed mechanism keeps the relaxed solver off deviating from
the original objective function thus improves the solution quality.
0.15 as the same value in [4]. The performances of each method are measured in both
accuracy and scores as defined in [4]. Based on the aforementioned settings, we con-
duct two sub-experiments to verify the robustness against deformation and outlier. In
the deformation test, we change the deformation noise parameter from 0 to 0.4 with
the interval 0.05, to compare with RRWM on a larger baseline, we set inlier number nin
to 50 and 60. In outlier noise experiment, the number of outliers nout varies from 0 to
20 by increments of 2, while fix nin to 50. The experimental results are plotted in Fig.
1. As observed from the deformation test, SCGA outperforms in accuracy while LCGA
and RRWM are comparable. And LCGA achieves a higher score given increased noise.
These phenomenon may imply a continuous technique like SoftMax is a key factor
against large noises. In contrast, a looping scheme like LCGA pays more attention to
objective score. In case of small noise, the proposed two methods are both very fast,
yet becoming relative slow in the existence of larger noise. Considering efficiency, all
algorithms are stopped when the iteration exceeds a fixed threshold. In outlier test, both
On the Convergence of Graph Matching: Graduated Assignment Revisited 827
SCGA and LCGA outperform others in accuracy and score, and in score test LCGA
is better again. It suggests that LCGA is suitable when noise is small as in this case
a better score always means a better match. We also tested our LCGA with its corre-
sponding simple version CGA and Gradient Assignment (GA). The results in Fig. 1
verify the efficacy of gradually adding constrains and outer looping. Note CGA shows
good performance and relative low time cost.
1 9
7000
Objective score
SCGA 6000 SCGA SCGA
0.7
RRWM RRWM 6 RRWM
Accuracy
SM 5500 SM SM
0.6
Time
5
0.5 5000
4
0.4
4500
0.3 # of inliers n = 60 # of inliers n = 60 3
# of inliers n = 60
in in in
0.2
# of outliers n = 0 4000 # of outliers n = 0 2 # of outliers n = 0
out out out
0.1 Edge density = 1 3500 Edge density = 1 1 Edge density = 1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Deformation noise Deformation noise Deformation noise
1
7000 7
GAGM GAGM
0.9
LCGA LCGA
6
6500 IPFP IPFP
0.8
Objective score
SCGA SCGA
RRWM 5 RRWM
0.7
Accuracy
6000
SM SM
Time
0.6 4
5500
0.5 3
GAGM
5000
0.4 # of inliers n = 50
LCGA # of inliers n = 50 # of inliers n = 50
IPFP in in 2 in
0.3 Deformation noise = 0
SCGA
RRWM
4500 Deformation noise = 0 Deformation noise = 0
Edge density = 1 Edge density = 1 Edge density = 1
1
SM
0.2 4000
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
# of outliers n # of outliers n # of outliers n
out out out
1
4.5
0.9
4500 4
0.8
3.5
Objective score
2.5
0.5 3500
2
0.4
0.3
# of inliers nin = 50 3000 # of inliers nin = 50 1.5 # of inliers nin = 50
0.2
# of outliers nout = 0 # of outliers nout = 0 1 # of outliers nout = 0
Edge density = 1 Edge density = 1 Edge density = 1
2500
0.5
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Deformation noise Deformation noise Deformation noise
1 8
LCGA
0.95 CGA
7000 7
GA
0.9
6
Objective score
0.85 LRGA
6500
Accuracy
RGA 5
0.8
Time
GA
0.75 4
6000
LCGA
0.7
CGA 3
0.65 # of inliers nin = 50 # of inliers nin = 50 GA # of inliers nin = 50
5500 2
0.6 Deformation noise = 0 Deformation noise = 0 Deformation noise = 0
0.55 Edge density = 1 Edge density = 1 1
Edge density = 1
5000
0.5 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
# of outliers n # of outliers n # of outliers n
out out out
Fig. 1. Evaluation on synthetic random graphs. Two top rows: Deformation and outlier evaluation
among GAGM, LCGA, IPFP, SCGA, RRWM, SM; Two bottom rows: Deformation and outlier
evaluation among GA, CGA, LCGA.
Performance of Being a Post Step: As shown in Fig. 2, in this test we use IPFP and
LCGA as a post processor using the results from RRWM, SM, SCGA and GAGM
as their initial input, called RRWM-IPFP and so on. Solid line means LCGA for post
processing while Dashes denotes IPFP. We find LCGA outperforms IPFP in all methods
except for SM, which is worth our future study. Aware SCGA-IPFP and SCGA-LCGA
are comparable and achieve the best results, as similar as in the outlier test.
828 Y. Tian et al.
1 7
RRWMIPFP RRWMIPFP
0.9 RRWMLCGA RRWMLCGA
4500 SMIPFP 6 SMIPFP
0.8 SMLCGA SMLCGA
Objective score
SCGAIPFP 5 SCGAIPFP
0.7
4000 SCGALCGA SCGALCGA
Accuracy
GAGMIPFP GAGMIPFP
0.6
Time
GAGMLCGA 4 GAGMLCGA
0.5 3500
RRWMIPFP 3
0.4 RRWMLCGA
0.3
# of inliers n = 50
SMIPFP 3000 # of inliers n = 50 # of inliers n = 50
SMLCGA in in 2 in
0.2
# of outliers n = 0
SCGAIPFP # of outliers n = 0 # of outliers n = 0
SCGALCGA out out out
Edge density = 1 Edge density = 1 Edge density = 1
2500 1
0.1 GAGMIPFP
GAGMLCGA
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Deformation noise Deformation noise Deformation noise
1
RRWMIPFP 8 RRWMIPFP
7500
0.9 RRWMLCGA RRWMLCGA
SMIPFP 7 SMIPFP
7000 SMLCGA SMLCGA
0.8
Objective score
SCGAIPFP 6 SCGAIPFP
0.7 6500 SCGALCGA SCGALCGA
Accuracy
GAGMIPFP 5 GAGMIPFP
Time
0.6 6000 GAGMLCGA GAGMLCGA
4
RRWMIPFP
0.5 5500
RRWMLCGA
3
0.4 # of inliers n = 50
SMIPFP
5000 # of inliers n = 50 # of inliers n = 50
SMLCGA in in in
0.3 Deformation noise = 0
SCGAIPFP
SCGALCGA 4500 Deformation noise = 0 2
Deformation noise = 0
0.2 Edge density = 1
GAGMIPFP Edge density = 1 1 Edge density = 1
GAGMLCGA 4000
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
# of outliers n # of outliers n # of outliers n
out out out
Fig. 2. Performance evaluation of using LCGA and IPFP as a post processor optimizing with the
result from other algorithms as their initial input
Real Image: In the second image experiment, as the same with [4, 12], 30 image pairs
from Caltech-101 and MSRC datasets are selected from which feature points are ex-
tracted by MSER detector and SIFT descriptor. We followed the setup from [4, 12]
exactly. The complexity of this experiment is due to the large intra-category variations
in shape and large number of outliers, posing the challenge regarding noise and outlier.
Some comparative examples are shown in Fig. 3, and the average performance compar-
ison on the dataset is summarized in Table 1, the results show LCGA outperform others,
especially in score while SCGA is comparable with the state-of-the-art algorithms. We
also compare our algorithm with IPFP as being post step in this test, the result is sim-
ilar to synthetic experiment, both methods can elevate the performance, especially in
score. LCGA is slightly better than IPFP in most cases except SM. Both SCGA-IPFP
and SCGA-LCGA achieve competitive results in accuracy and score.
Table 1. Evaluation of various matching algorithms. (+IPFP) and (+LCGA) denote perform post
IPFP and LCGA using other methods output as their input.
Fig. 3. Evaluation on real images, in the bracket: method name, accuracy, and score
Conclusion: We have proposed a novel framework and induced algorithms for graph
matching, whose convergence are also proved. The experiments show that it outper-
forms the state-of-the-art methods in the presence of outliers and deformation,
especially when the number of graph node is large. The comparison reveals that the
matching accuracy and convergence rate in the challenging situations largely depends
on the effective exploitation of the matching constraints in the searching space.
References
1. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern
recognition. IJPRAI (2004)
2. Gold, S., Rangarajan, A.: A graduated assignment algorithm for graph matching. PAMI
(1996)
3. Leordeanu, M., Hebert, M.: A Spectral Technique for Correspondence Problems using Pair-
wise Constraints. In: ICCV (2005)
4. Cho, M., Lee, J., Lee, K.M.: Reweighted Random Walks for Graph Matching. In: Daniilidis,
K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 492505.
Springer, Heidelberg (2010)
5. Leordeanu, M., Herbert, M.: An integer projected fixed point method for graph matching and
map inference. In: NIPS (2009)
6. Gold, S., Rangarajan, A.: Softmax to softassign: neural network algorithms for combinatorial
optimization. J. Artif. Neural Netw., 3810399 (1995)
7. Rangarajan, A., Yuille, A.: Convergence properties of the softassign quadratic assignment
algorithm. Neural Computation (1999)
830 Y. Tian et al.
8. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics
Quarterly 2(1-2), 8397 (1955)
9. Christmas, W., Kittler, J., Petrou, M.: Structural matching in computer vision using proba-
bilistic relaxation. PAMI 19(6), 634648 (1997)
10. Wilson, R., Hancock, E.: Structural matching by discrete relaxation. IEEE Trans. Pattern
Anal. Mach. Intell. 17(8), 749764 (1995)
11. Cour, T., Srinivasan, P., Shi, J.: Balanced Graph Matching. In: NIPS (2006)
12. Lee, J., Cho, M., Lee, K.M.: Hyper-graph Matching via Reweighted RandomWalks. In:
CVPR (2011)
13. Kosowsky, J.J., Yuille, A.L.: The Invisible Hand Algorithm: Solving the Assignment Prob-
lem With Statistical Physics. Neural Networks 7(3) (1994)
Proof: Assume the precision of Ms elements is 10k (Equ.12). And the number of
the elements in M is N 2 . One can add specks on M: 10k1K 21 ,10k1K
22 ,...,10k1K 2N to obtain the new matrix Mnew . Thus for i; j, k, j
= k,
2
each element in pj pk is from the discrete set {1, 0, 1} and each item in pi is {0, 1}.
As a result, (pj pk )T Mpi is the linear combination of the elements of original M
plus the additional specks. When (pj pk )T Mpi = 0 holds for the original M, we
can see that the new Mnew is nonequal to zero due to the additional specks whose
corresponding (pj pk )T (Mnew M)pi (linear combination) cannot be zero in that
each basis is 2L , L N 2 while the coefficients are {1, 0, 1}. Furthermore, when k is
big enough, the precision of (pj pk )T (Mnew M)pi is less than the original function,
so the original result that is nonequal to zero is still nonzero after this modification.
On the Convergence of Graph Matching: Graduated Assignment Revisited 831
Using Lemma 1 and Theorem 1, Theorem 2 ensure the convergence for GA, CGA
respectively. In what follows, we will present the proof flow details of the convergence
of SCGA and GAGM as stated in the main text.
Lemma 2. (sj , T ) Given a fixed , or T = 1/. SoftMax(Msj ) must be converged to
the solution sj+1 , whose matrix form is doubly stochastic.
Lemma 3. (T ) Given a fixed T , let pj+1 = Hd (Msj ), we have: pTj+1 Msj T N log N
sTj+1 Msj pTj+1 Msj where N = n2 and n is the number of graph node.
Lemma 4. (sj , T, ) Given Msj , if there exists a unique best solution: pj+1 =Hd (Msj )
returned by the Hungarian method, let pi be the second best one (discrete), and let
=|(pj+1 pi )T Msj | then we have:
T N log N
max |pij+1 sij+1 | (13)
i
where pij+1 denotes the ith element in the vectorized permutation matrix pj+1 .
Lemma 2, Lemma 3, Lemma 4 have been proven by Kosowsky and Yuille in [13]. And
one can find Assumption 2 (sj , Tj , ) derives the establishment of Lemma 4 (sj , Tj , ).
Tj
pTj+2 Msj+1 pTj Msj+1 + min 2 nN 2 log N (17)
Tj
pTj+2 Msj+1 pTj+2 Mpj+1 nN 2 log N (19)
By Assumption 1(min ) and Lemma 6 (sj , Tj , ) we have:
Tj Tj
pTj+2 Mpj+1 nN 2 log N pTi Mpj+1 + min nN 2 log N (20)
By Lemma 5 (sj , Tj , ) we have:
Tj Tj
pTi Mpj+1 + min nN 2 log N pTi Msj+1 + min 2 nN 2 log N
Combining the three equations we reach:
Tj
pTj+2 Msj+1 pTi Msj+1 + min 2 nN 2 log N (21)
T
By Assumption 2 (sj , Tj , ), min 2 j nN 2 log N , the lemma holds.
Proof: Since Tj+1 < Tj , the below relations hold, either thus Assumption 2
(sj+1 , Tj+1 , ).
Tj+1 Tj min
nN 2 log N < nN 2 log N < (22)
2
Tj Tj+1
0 < < min 2 nN log N < min 2
2
nN 2 log N (23)
and this leads to our main results: here we prove our main theoretical conclusion as
shown in Theorem 3 in the main text.
Proof: For k = 0, 1, ... Lemma 4 (sj+k+1 , Tj+k+1 , ) and Lemma 6
(sj+k+1 , Tj+k+1 , ) obviously hold. By Lemma 6 (sj+k , Tj+k , ), we know
pj+k+2 = Hd (Mpj+k+1 ). According to Assumption 1 (min ) and Lemma 4
(sj+k+1 , Tj+k+1 , ), when k is big enough, we have sequence sj+k will converged
to discrete sequence pj+k .
Now we prove the different convergence property of the classical GAGM and the pro-
posed SCGA as has been stated in Corollary 1 and Corollary 2 in the main text.
Proof: By combining Theorem 1 and Theorem 3, we know GAGM will converged
to circular sequence. By combining Theorem 2 and Theorem 3, we know when M
becomes positive definite, SCGA will converge to a fixed discrete point.
Remark 1. (The wildness of Assumption 2) According to Equ.8 and Equ.9, lies in:
min 2min 8Tj nN 2 log N min + 2min 8Tj nN 2 log N
[ , ] (24)
2 2
When Tj 0, this formula imply (0, min ), also when Tj goes to zero, the
SoftMax result Sof tM ax(sj ) approaches to discrete result (vectorized permutation
matrix). From Lemma 1 we know as sj goes to discrete result, the score difference
between best solution pj+1 and suboptimal solution pi approaches min , thus after
several iteration, we can choose = min 2 , which lies in the range (0, min ) - and
smaller than the difference between best and suboptimal score, thus the existence of
Assumption 2 is assured. In addition, based on our previous derivation, for a given
sj , once there exists a sufficing the condition for Assumption 2, all later sj+1 would
satisfy the constraint. These facts consolidate our judgement that Assumption 2 is weak.
Furthermore, we have a more concrete observation as described in Lemma 9.
Proof: One can select s0 =( N12 , . . . , N12 )T , obviously lies within [ N12 min ,
N 2 1
N 2 min ], and then we can select an appropriate T1 s.t. lies in the scope of Equ.24.
At the end of this appendix, we give conceptual illustrations showing the idea of our
proof, and suppose T 0 without loss of generality. First we introduce the initial
status. The dotted arrows mean the generate process, red dotted arrows denote iterative
step of GAGM [2], where sj+1 = SoftMax(Msj ); blue dotted arrows denote discrete
optimum where pj+1 = Hd (Msj ). Yellow solid lines mean the corresponding function
score. The bidirectional-arrow means the quantity relationship between targets. Dashed
bidirectional-arrow denotes the Gap relationship while solid bidirectional-arrow de-
notes the asymptotic relationship.
As illustrated in Fig.4, according to Lemma 3 we know as T approximates to 0, for
all k, sTj+k+1 Msj+k approaches to pTj+k+1 Msj+k . Based on Assumption 2 (sj , Tj , )
we know there is a gap between optimal score pTj+1 Msj and any other score. Thus from
834 Y. Tian et al.
Lemma 4 we get the asymptotic relationship between sj+1 and pj+1 . The green curve
arrow illustrates the role of Assumption 2 - here we use Tj to represent Assumption
2(sj , Tj , ). Kosowsky and Yuille focus on the connection of the Hungarian solution
and the SoftMax one, which is first order version versus the pairwise graph matching
task we address in this paper - a quadratic assignment problem solved by transforming
it into a series of iterative first-order assignment subproblems. By repeatedly applying
their conclusion to the separate steps in the quadratic case, we know that starting from
sj , each score sTj+k+1 Msj+k would approximate to pTj+k+1 Msj+k (Roughly speaking,
the approximation of two scores do not necessarily ensure sj+k+1 pj+k+1 , k, while
we show in Fig.5 that the answer is affirmative if only Assumption 2 holds for the first
iteration). In addition, under the precondition of Assumption 2 (sj , Tj , ), according
the conclusion from [13], one can obtain a single sj+1 that approaches pj+1 . Note
that [13]s conclusion does not address the chain property in the quadratic assignment
problem where first-order assignment is performed in each iteration, thus involving
multiple (sj+k+1 , pj+k+1 ) pairs. In another saying, if directly applying their conclusion
to the chain case, one has to check the existence of a specific and separate k that
builds the asymptotic relation between respective sj+k+1 , pj+k+1 in each iteration - and
further the convergence property of the iterative algorithm. As an extension, we strictly
analyze and give the affirmative answer in terms of convergence for this iteration chain,
under the rather weak precondition: if only Assumption 2 (sj , Tj , ) holds once in one
beginning iteration, say j, then in all of the next iterations, the updated Assumption 2
(sj+k+1 , Tj+k+1 , ) holds, and sj+k+1 is bound to approximate to pj+k+1 .
We also prove that pj+k+2 is the Hungarian solution of Mpj+k+1 , thus for other pi ,
the score gap exists - see the dotted eclipses for attention (thus the discrete sequence
pj+k+1 induced by sj+k composes an iteration chain by self). We show the proof flow
in Fig.5 and summarize key steps as following:
On the Convergence of Graph Matching: Graduated Assignment Revisited 835
Fig. 5. Conceptualization of proof flow for the iteration chain in the pairwise graph matching
problem. The main idea is exploring the connection between continuous solution and discrete
one: sj pj , as well as pj+1 = Hd (Mpj ), where pj+1 is indeed calculated from Hd (Msj ). In
this spirit, SCGAs convergence proof is similar to the one of CGA.
1 Introduction
Automatic image annotation is a labelling problem which has potential appli-
cations in image retrieval [5,11,16], image description [23], etc. Given an unseen
image, the goal is to predict multiple textual labels describing that image. Re-
cent outburst of multimedia content has raised the demand for auto-annotation
methods, thus making it an active area of research [5,11,16,19,20]. Several meth-
ods have been proposed in the past for image auto-annotation which try to
model image-to-image [5,11,16], image-to-label [10,20] and label-to-label [11,20]
similarities. Our work falls under the category of the supervised annotation mod-
els such as [4,5,10,11,13,16,19,22] that work with large annotation vocabularies.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 836849, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Image Annotation Using Metric Learning in Semantic Neighbourhoods 837
Among these, K-nearest neighbour (or KNN) based methods such as [5,11,16]
have been found to give some of the best results despite their simplicity. The
intuition is that similar images share common labels [11]. In most of these
approaches, this similarity is determined only using image features. Though
this can handle label-to-label dependencies to some extent, it fails to address
the class-imbalance (large variations in the frequency of dierent labels) and
weak-labelling (many available images are not annotated with all the relevant
labels) problems that are prevalent in the popular datasets (Sec. 4.1) as well as
real-world databases. E.g., in an experiment on the Corel 5K dataset, we found
that for the 20% least frequent labels, JEC [11] achieves an F-score of 19.7%,
whereas it gives reasonably good performance for the 20% most frequent labels
with F-score being 50.6%.
As per our knowledge, no attempt has been made in the past that directly
addresses these issues. An indirect attempt was made in TagProp [16] to address
class-imbalance, which we discuss in Sec. 2. To address these issues in a nearest-
neighbour set-up, we need to make sure that (i) for a given image, the (subset
of) training images that are considered for label prediction should not have large
dierences in the frequency of dierent labels; and (ii) the comparison criteria
between two images should make use of both image-to-label and image-to-image
similarities (image-to-image similarities also capture label-to-label similarities
in a nearest-neighbour scenario). With this motivation, we present a two-step
KNN-based method. We call this 2-Pass K-Nearest Neighbour (or 2PKNN)
algorithm. For an image, we say that its few nearest neighbours from a given class
constitute its semantic neighbourhood, and these neighbours are its semantic
neighbours. For a particular class, these are the samples that are most related
with a new image. Given an unseen image, in the rst step of 2PKNN we identify
its semantic neighbours corresponding to all the labels. Then in the second step,
only these samples are used for label prediction. This relates with the idea of
bottom-up pruning common in day-to-day scenarios such as buying a car,
or selecting a cloth to wear; where rst the potential candidates are short-listed
based on quick analysis, and then another set of criteria is used for nal selection.
It is well-known that the performance of KNN based methods largely depends
on how two images are compared [11]. Usually, this comparison is done using a
set of features extracted from images and some specialized distance metric for
each feature (such as L1 for colour histograms, L2 for Gist) [11,16]. In such a
scenario, (i) since each base distance contributes dierently, we can learn ap-
propriate weights to combine them in the distance space [11,16]; and (ii) since
every feature (such as SIFT or colour histogram) itself is represented as a multi-
dimensional vector, its individual elements can also be weighted in the feature
space [12]. As the 2PKNN algorithm works in the nearest-neighbour setting, we
would like to learn weights that maximize the annotation performance. With this
goal, we perform metric learning over 2PKNN by generalizing the LMNN [12]
algorithm for multi-label prediction. Our metric learning framework extends
LMNN in two major ways: (i) LMNN is meant for single-label classication (or
simply classication) problems, while we adapt it for images annotation which
838 Y. Verma and C.V. Jawahar
2 Related Work
The image annotation problem was initially addressed using generative models;
e.g. translation models [2,3] and nearest-neighbour based relevance models [4,5].
Recently, a Markov Random Field [13] based approach was proposed that can
exibly accommodate most of the previous generative models. Though these
methods are directly extendable to large datasets, they might not be ideal for
the annotation task as their underlying joint distribution formulations assume
independence of image features and labels, whereas recent developments [17]
emphasize on using conditional dependence to achieve Bayes optimal prediction.
Among discriminative models, SML [10] treats each label as a class of a multi-
class multi-labelling problem, and learns class-specic distributions. However, it
requires large (class-)balanced training data to estimate these distributions. Also
label interdependencies might result into corrupted distribution models. Another
nearest-neighbour based method [19] tries to benet from feature sparsity and
clustering properties using a regularization based algorithm for feature selection.
JEC [11] treats image annotation as retrieval. Using multiple global features, a
greedy algorithm is used for label transfer from neighbours. They also performed
metric learning in the distance space but it could not do any better than using
equal weights. This is because they used a classication-based metric learning
approach for the annotation task which is multi-label classication by nature.
Though JEC looks simple at the modelling level, it reported the best results on
benchmark annotation datasets when it was proposed. TagProp [16] is a weighted
KNN based method that transfers labels by taking a weighted average of key-
words presence among the neighbours. To address the class-imbalance problem,
logistic discriminant models are wrapped over the weighted KNN method with
metric learning. This boosts the importance given to infrequent labels and sup-
presses it for frequent labels appearing among the neighbours.
Nearly all image annotation models can be considered as multi-label clas-
sication algorithms, as they associate multiple labels with an image. Recent
Image Annotation Using Metric Learning in Semantic Neighbourhoods 839
Let Ti T , i {1, . . . , l} be the subset of training data that contains all the
images annotated with the label yi . Since each set Ti contains images with one
semantic concept (or label) common among them, we consider it as a semantic
group (similar to [20]). It should be noted that the sets Ti s are not disjoint, as an
image usually has multiple labels and hence belongs to multipe semantic groups.
Given an unannotated image J, from each semantic group we pick K1 images
that are most similar to J and form corresponding sets TJ,i Ti . Thus, each TJ,i
contains those images that are most informative in predicting the probability
of the label yi for J. The samples in each set TJ,i are the semantic neighbours
of J corresponding to yi . These semantic neighbours incorporate image-to-label
840 Y. Verma and C.V. Jawahar
' Once TJ,i s are determined, we merge them all to form a set TJ = {TJ,1
similarity.
'
. . . TJ,l }. This way, we obtain a subset of the training data TJ T specific
to J that contains its semantic neighbours corresponding to all the labels in the
vocabulary Y. This is the first pass of 2PKNN. It can be easily noted that in
TJ , each label appears (at least) K1 times, thus addressing the class-imbalance
issue. To understand how this step also handles the issue of weak-labelling, we
analyze the cause of this. Weak-labelling occurs because obvious labels are
often missed by human annotators while building a dataset, and hence many
images depicting any such concept are actually not annotated with it. Under
this situation, given an unseen image, if we use only its few nearest neighbours
from the entire training data (as in [11,16]), then such labels may not appear
among these neighbours and hence not get apt scores. In contrary, the rst pass
of 2PKNN nds a neighbourhood where all the labels are present explicitly.
Therefore, now they (i.e., the obvious labels) have better chances of getting
assigned to a new image, thus addressing the weak-labelling issue.
bay, gravel, road, bush, coast, grey, house, mountain, bush, landscape, meadow, house,
house, landscape, sea, sky hill, sky, village, cloud, sea, tree sea, bush, dirt,
meadow, shrub, tree sky, bay, road,
sky coast
Test Image
Neighbours bush, coast, grey, hill, sky, house, bush, coast, grey, hill, tree, house,
sea, sky mountain, village, sea, sky village, mountain,
tree sky
Fig. 1. For a test image from the IAPR TC-12 dataset (rst column), the rst row
on the right section shows its 4 nearest images (and their labels) from the training
data after the rst pass of 2PKNN (K1 = 1), and the second row shows the 4 nearest
images using JEC [11]. The labels in bold are the ones that match with the ground-
truth labels of the test image. Note the frequency (9 vs. 6) and diversity ({sky, house,
landscape, bay, road, meadow} vs. {sky, house}) of matching labels for 2PKNN vs.
JEC.
Figure 1 shows an example from the IAPR TC-12 dataset (Sec. 4.1) illus-
trating how the rst pass of 2PKNN addresses both class-imbalance and weak-
labelling. For a given test image (rst column) along with its ground-truth labels,
we can notice the presence of rare labels {landscape, bay, road,meadow}
Image Annotation Using Metric Learning in Semantic Neighbourhoods 841
among its four nearest images found after the rst pass of 2PKNN (rst row
on the right), without compromising with the frequent labels {sky, house}.
In contrary, the neighbours obtained using JEC [11] (second row) contain only
the frequent labels. It can also be observed that though the labels {landscape,
meadow} look obvious for the neighbours found using JEC, these are actually
absent in their ground-truth annotations (weak-labelling), whereas the rst pass
of 2PKNN explicits their presence among the neighbours.
The second pass of 2PKNN is a weighted sum over the samples in TJ to
assign importance to labels based on image similarity. This gives the posterior
probability for J given a label yk Y as
P (J|yk ) = J,Ii .P (yk |Ii ) = exp(D(J, Ii )).(yk Yi ), (3)
(Ii ,Yi )TJ (Ii ,Yi )TJ
where J,Ii = exp(D(J, Ii )) (see Eq. (4) for the denition of D(J, Ii )) denotes
the contribution of image Ii in predicting the label yk for J depending on their
visual similarity; and P (yk |Ii ) = (yk Yi ) denotes the presence/absence of
label yk in the label set Yi of Ii , with () being 1 only when the argument
holds true and 0 otherwise. Assuming that the rst pass of 2PKNN will give
a subset of the training data where each label has comparable frequency, we
set the prior probability in Eq. (1) as a uniform distribution; i.e., P (yi ) = |T1 | ,
i {1, . . . , l}. Putting Eq. (3) in Eq. (1) generates a ranking of all the labels
based on their probability of getting assinged to the unseen image J. Note that
along with image-to-image similarities, the second pass of 2PKNN implicitly
takes care of label-to-label dependencies, as the labels appearing together in the
same neighbouring image will get equal importance.
Both conceptually as well practically, 2PKNN is entirely dierent from the
previous two-step variants of KNN such as [1,7]. They use few (global) nearest
neighbours of a sample to apply some other more sophisticated technique such as
linear discriminant analysis [1] or SVM [7]. Whereas, the rst pass of 2PKNN
considers all the samples but in localized semantic groups. Also, the previous vari-
ants were designed for the classication task, while 2PKNN addresses the more
challenging problem of image annotation (which is multi-label classication by
nature), where the datasets suer from high class-imbalance and weak-labelling
(note that weak-labelling is not a concern in classication problems).
purpose, we extend the classical LMNN [12] algorithm for multi-label prediction.
Let there be two images A and B, each represented by n features {fA 1
, . . . , fA
n
}
and {fB , . . . , fB } respectively. Each feature is a multi-dimensional vector, with
1 n
unit vector1 . Note that vi can replaced by any non-negative real-valued normal-
ized vector that assigns appropriate weights to individual dimensions of a feature
vector in the feature space. Moreover, we can also learn weights w Rn+ in the
distance space to optimally combine multiple feature distances. Based on this,
we write the distance between A and B as
n
Ni
D(A, B) = w(i). vi (j).diAB (j) . (4)
i=1 j=1
Now we describe how to learn the weights appearing in Eq. (4). For a given
labelled sample (Ip, Y p) T , we dene its (i) target neighbours as its K1 nearest
images from the semantic group Tq , q s.t. yq Yp , and (ii) impostors as its K1
nearest images from Tr , r s.t. yr Y \ Yp . Our objective is learn the weights
such that the distance of a sample from its target neighbours is minimized, and
is also less than its distance from any of the impostors (i.e., pull the target
neighbours and push the impostors). In other words, given an image Ip along
with its labels Yp , we want to learn weights such that its nearest (K1 ) semantic
neighbours from the semantic groups Tq s (i.e., the groups corresponding to its
ground-truth labels) are pulled closer, and those from the remaining semantic
groups are pushed far. With this goal, for sample image Ip , its target neighbour
Iq and its impostor Ir , the loss function will be given by
Eloss = pq D(Ip , Iq ) + pq (1 pr )[1 + D(Ip , Iq ) D(Ip , Ir )]+ . (5)
pq pqr
Here, > 0 handles the trade-o between the two error terms. The variable pq
|Y Y |
is 1 if Iq is a target neighbour of Ip and 0 otherwise. pr = p|Yr | r [0, 1], with
Yr being the label set of an impostor Ir of Ip . And [z]+ = max(0, z) is the hinge
loss which will be positive only when D(Ip , Ir ) < D(Ip , Iq ) + 1 (i.e., when for a
sample Ip , its impostor Ir is nearer than its target neighbour Iq ). To make sure
1
diAB can similarly be computed for other measures such as squared L2 and 2 .
Image Annotation Using Metric Learning in Semantic Neighbourhoods 843
that a target neighbour Iq is much closer than an impostor Ir , a margin (of size 1)
is used in the error function. Note that pr is in the continuous range [0, 1], thus
scaling the hinge loss depending on the overlap between the label sets of a given
image Ip and its impostor Ir . This means that for a given sample, the amount of
push applied on its impostor varies depending on its conceptual similarity with
that sample. An impostor with large similarity will be pushed less, whereas an
impostor with small similarity will be pushed more. This makes it suitable for
multi-label classication tasks such as image annotation. The above loss function
is minimized by the following constrained optimization problem:
min pq pq D(Ip , Iq ) + pqr pq (1 pr )pqr
w,v
s.t. D(Ip , Ir ) D(Ip , Iq ) 1 pqr p, q, r
pqr 0 p, q, r
n
w(i) 0 i; i=1 w(i) = n
Ni i
v (j) 0 i, j;
i
j=1 v (j) = 1 i {1, . . . , n} (6)
Here, the slack variables pqr represent the hinge loss in Eq. (5), and v is a vec-
tor obtained by concatenating all the vi s for i = 1, . . . , n. We solve the above
optimization problem in the primal form itself. Since image features n are usually
in very high dimensions, the number of variables is large (= n + i=1 Ni ). This
makes the scalability of the above optimization problem dicult using conven-
tional gradient descent. To overcome this issue, we solve it by alternatively using
stochastic sub-gradient descent and projection steps (similar to Pegasos [9]). This
gives an approximate optimal solution using small number of comparisons, and
thus helps in achieving high scalability. To determine the optimal weights, we
alternate between the weights in distance space and feature space.
Our extension of LMNN conceptually diers from its previous extensions such
as [18] in at least two signicant ways: (i) we adapt LMNN in its choice of tar-
get/impostors to learn metrics for multi-label prediction problems, whereas [18]
uses the same denition of target/impostors as in LMNN to address classica-
tion problem in multi-task setting, and (ii) in our formulation, the amount of
push applied on an impostor varies depending on its conceptual similarity w.r.t.
a given sample, which makes it suitable for multi-label prediction tasks.
4 Experiments
We have used three popular image annotation datasets Corel 5K, ESP Game
and IAPR TC-12 to test and compare the performance of our method with
previous approaches. Corel 5K was rst used in [3], and since then it has become
a benchmark for comparing annotation performance. ESP Game contains images
annotated using an on-line game, where two (mutually unknown) players are
randomly given an image for which they have to predict same keyword(s) to
844 Y. Verma and C.V. Jawahar
score points [6]. This way, many people participate in the manual annotation
task thus making this dataset very challenging and diverse. IAPR TC-12 was
introduced in [8] for cross-lingual retrieval. In this, each image has a detailed
description from which only nouns are extracted and treated as annotations.
In Table 1, columns 2 5 show the general statistics of the three datasets;
and in columns 6 8, we highlight some interesting statistics that provide better
insights about the structure of the three datasets. It can be noticed that for
each dataset, around 75% of the labels have frequency less than the mean label
frequency (column 8), and also the median label frequency is far less than the
corresponding mean frequency (column 7). This veries the claim we previously
made in Sec. 1 (i.e., datasets badly suer from the class-imbalance problem).
Table 1. General (columns 2-5) and some insightful (columns 6-8) statistics for the
three datasets. In column 6 and 7, the entries are in the format mean, median, maxi-
mum. Column 8 (Labels#) shows the number of labels whose frequency is less than
the mean label frequency.
Dataset Number Number Training Testing Labels Images per label (or Labels#
of images of labels images images per image label frequency)
Corel 5K 5,000 260 4,500 500 3.4, 4, 5 58.6, 22, 1004 195 (75.0%)
ESP Game 20,770 268 18,689 2,081 4.7, 5, 15 326.7, 172, 4553 201 (75.0%)
IAPR TC-12 19,627 291 17,665 1,962 5.7, 5, 23 347.7, 153, 4999 217 (74.6%)
4.4 Comparisons
Now we present the quantitive analysis on the three datasets. First, for the sake
of completeness, we compare with recent multi-label ranking methods [14,21],
and then we show detailed comparisons with the previous annotation methods.
in Table 2 using the same conventions as theirs. To be specic, we use the Corel
5K dataset and consider only the 20 most frequent labels, which results in 3, 947
training and 444 test images with average 1.8 labels per image. As we can see,
our method performs signicantly better than [14] on all the three measures.
Notable is the considerable reduction in the coverage score, which indicates that
in most of the cases, all the ground-truth annotations are included among the
top 4 labels predicted by our method.
Table 2. Comparison between MIML [14] and 2PKNN combined with metric learning
on a subset of the Corel 5K dataset using only 20 most frequent labels as in [14]
To show that our method addresses weak-labelling issue, we compare with [21].
Though [21] addresses the problem of incomplete labelling (see Sec. 2), conceptu-
ally it overlaps with the weak-labelling issue as both work under the scenario of
inavailability of all relevant labels. Following their protocol, we select those im-
ages from the entire ESP Game dataset that are annotated with at least 5 labels,
and then test on four cases. First we use all the labels in the ground-truth, and
then randomly remove 20%, 40% and 60% labels respectively from the ground-
truth annotation of each training image. On an average, they achieved 83.7475%
AUC for these cases, while we get 85.4739% which is better by 1.7264%.
These empirical evaluations conclude that our method consistently shows su-
perior performance than the previous models under multiple evaluation criteria,
thus establishing its overall eectiveness.
Image Annotation Using Metric Learning in Semantic Neighbourhoods 847
4.5 Discussion
Fig. 2. Annotation performance in terms of mean recall (vertical axis) for the three
datasets obtained using weighted KNN as in TagProp [16] (blue), using 2PKNN (red),
and 2PKNN combined with metric learning (green). The labels are grouped based on
their frequency in a dataset (horizontal axis). The rst bin corresponds to the 50%
least frequent labels and the second bin corresponds to the 50% most frequent labels.
bear, reflection, field, horses, green, phone, fight, grass, building, base, fence, moun-
water, black mare, foals woman, hair game horse, statue tain, range
bear, reflec- field, horses, green, phone, fight, grass, building, fence, moun-
tion, water, mare, foals, woman, hair, game, anime , base, horse, tain, range,
black, river tree suit man statue, man airplane , sky
Fig. 3. Annotations for example images from the three datasets. The second row shows
the ground-truth annotations and the third row shows the labels predicted using our
method 2PKNN+ML. The labels in blue (bold) are those that match with ground-
truth. The labels in red (italics) are those that, though depicted in the corresponding
images, are missing in their ground-truth annotations and are predicted by our method.
5 Conclusion
We showed that our variant of the KNN algorithm, i.e. 2PKNN, combined with
metric learning performs better than the previous methods on three challenging
image annotation datasets. It also addresses class-imbalance and weak-labelling
issues that are prevalent in the real-world scenarios. This can be useful for natu-
ral image databases where tags obey the Zipf s law. We also demonstrated how
a classication metric learning algorithm can be eectively adapted for the more
complicated multi-label classication problems such as annotation. We hope that
this work throws some light on the possibilites of extending the popular discrim-
inative margin-based classication methods for the multi-label prediction tasks.
Image Annotation Using Metric Learning in Semantic Neighbourhoods 849
References
1. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classication.
PAMI 18(6), 607616 (1996)
2. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing
and vector quantizing images with words. In: First International Workshop on
Multimedia Intelligent Storage and Retrieval Management (1999)
3. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object Recognition
as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In:
Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV.
LNCS, vol. 2353, pp. 97112. Springer, Heidelberg (2002)
4. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of
pictures. In: NIPS (2003)
5. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for
image and video annotation. In: CVPR, pp. 10021009 (2004)
6. von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: ACM SIGCHI
(2004)
7. Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest
neighbor classication for visual category recognition. In: CVPR (2006)
8. Grubinger, M.: Analysis and Evaluation of Visual Information Systems Perfor-
mance. PhD thesis, Victoria University, Melbourne, Australia (2007)
9. Shwartz, S.S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver
for SVM. In: ICML (2007)
10. Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of
semantic classes for image annotation and retrieval. PAMI 29(3), 394410 (2007)
11. Makadia, A., Pavlovic, V., Kumar, S.: A New Baseline for Image Annotation. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304,
pp. 316329. Springer, Heidelberg (2008)
12. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest
neighbor classication. JMLR 10(2), 207244 (2009)
13. Xiang, Y., Zhou, X., Chua, T.-S., Ngo, C.W.: A Revisit of generative models for
automatic image annotation using markov random elds. In: CVPR (2009)
14. Jin, R., Wang, S., Zhou, Z.H.: Learning a distance metric from multi-instance
multi-label data. In: CVPR, pp. 896902 (2009)
15. Wang, H., Huang, H., Ding, C.: Image annotation using multi-label correlated
Greens function. In: ICCV (2009)
16. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative met-
ric learning in nearest neighbour models for image auto-annotation. In: ICCV (2009)
17. Dembczynski, K., Cheng, W., Hullermeier, E.: Bayes optimal multilabel classica-
tion via probabilistic classier chains. In: ICML (2010)
18. Parameswaran, S., Weinberger, K.Q.: Large margin multi-task metric learning. In:
NIPS (2010)
19. Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., Metaxas, D.N.: Automatic image
annotation using group sparsity. In: CVPR (2010)
20. Wang, H., Huang, H., Ding, C.: Image annotation using bi-relational graph of
images and semantic labels. In: CVPR (2011)
21. Bucak, S.S., Jin, R., Jain, A.K.: Multi-label learning with incomplete class assign-
ments. In: CVPR, pp. 28012808 (2011)
22. Nakayama, H.: Linear distance metric Learning for large-scale generic image recog-
nition. PhD thesis, The University of Tokyo, Japan (2011)
23. Gupta, A., Verma, Y., Jawahar, C.V.: Choosing linguistics over vision to describe
images. In: AAAI (2012)
Dynamic Programming
for Approximate Expansion Algorithm
Olga Veksler
1 Introduction
Expansion algorithm [1] has been used for a variety of pixel labeling problems
arising in vision and graphics [26]. In a pixel labeling problem the goal is to
assign a label to each image pixel. In general, pixel labeling problems are NP-
hard [7], but can be solved eciently in some special cases [812].
Expansion algorithm is an approximate optimization method with certain
optimality guarantees [1]. It works by iteratively visiting labels and perform-
ing an -expansion move on that label. An expansion move nds an optimal
subset of pixels to switch to label . Each expansion step is performed with a
min-cut/max-ow algorithm [13]. There is a max-ow algorithm [14] that was
specically designed for vision problems and it performs very eciently in prac-
tice for many instances. Nevertheless the best known worst-case time complexity
of max-ow is worse than linear in the number of image pixels.
Recently, a tiered labeling [12] algorithm, based on Dynamic Programming
(DP) was proposed for labeling of 2D scenes. This algorithm nds the exact
This research was supported,in part, by NSERC, CFI, and ERA grants.
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 850863, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Dynamic Programming for Approximate Expansion Algorithm 851
global minimum, but only for scenes with a tiered label layout. In particular,
the scene is assumed to be divided by two simple horizontal curves into three
main tiers, top, middle, and bottom. The top and bottom tiers receive a xed
label, T and B, respectively. The middle tier can be further subdivided into
tiers by vertical straight lines, and each sub-tier has a choice of one out of k
labels. This tiered scene layout is rather limited, but an optimal labeling can
be found in O(kn1.5 ) time for a square image of n pixels and the Potts energy
model. Most importantly, [12] shows that dynamic programming can be useful
for labeling 2D scenes, whereas traditionally, in computer vision, DP was applied
mostly to one-dimensional or low treewidth structures [15].
Inspired by tiered labeling, we propose an approximate expansion algorithm
based on fast DP. Just as in tiered labeling, the main idea of our DP-expansion
moves is to allow only certain structure layout inside the image to switch to
label . This is in contrast to the classical -expansion, that allows an arbitrary
subset of pixels to switch to . By limiting the structure of the subsets that
can switch to , we are modifying the problem to make it suitable for dynamic
programming. In fact, our moves are even more restrictive than the tiered
labeling structure, but their complexity is linear in the number of pixels, making
them extremely ecient in practice. Our moves are based on the observation that
instead of expanding on the optimal subset of pixels in a single step, it maybe
sucient, in practice, although clearly suboptimal, to nd an expansion region
where only the left (or right, or top, or bottom boundaries) are accurate. This
allows linear eciency in the number of pixels. To recover accurate boundaries
in all directions, we apply our moves iteratively.
Like tiered labeling, due to DP, we are limited to pairwise energies with a 4
or 8 connected neighborhood structure. This is a clear limitation, but longer-
range interactions are seldomly used. In addition to speed, a clear advantage of
our DP-expansion moves over the max-ow expansion moves is that, like tiered
labeling, we can handle arbitrary pairwise terms, again, due to DP. In fact, due
to the simplicity of our moves, we can handle arbitrary terms with the same
complexity as Potts pairwise terms, i.e. O(n), where n is the number of pixels.
We illustrate the performance of our DP-expansion on the Potts energy, but
our algorithm can be used for any pairwise energies. We achieve better eciency
with almost the same energy compared to the max-ow expansion moves.
This paper is organized as follows. We review the most related work in Sec-
tion 2. We motivate and explain our DP-expansion moves in Section 3. We
describe our ecient optimization algorithm in Section 4, present experimental
results in Section 5, and conclude with a short discussion in Section 6.
2 Related Work
There has been a lot of interest in vision community to develop ecient opti-
mization methods for pixel labeling problems, see a survey in [16]. The popular
methods evaluated in [16] include the expansion and swap algorithms [7], loopy
belief propagation (LBP) [17], and sequential tree re-weighted message passing
852 O. Veksler
(TRW-S) [18]. One of the conclusions of [16] is that the expansion algorithm
performs best in terms of speed and accuracy for Potts energies.
Since our optimization is based on making approximate expansion moves, our
work is related to the general direction of move-making algorithms. In addition to
the expansion and swap algorithms [7], a number of new move making algorithms
have been developed recently. In [19], they develop a fusion move that instead of
expanding on a xed label , can expand on a dierent but single xed label for
each pixel in a move, thereby fusing the current solution and a new proposal
solution together. In [20, 21], range moves that are designed specically for
energies tuned to truncated convex pairwise terms are developed. These moves,
unlike the previous ones, are multi-label, they allow each pixel a choice of more
than one label to switch to. In [22] they develop multi-label moves for more
general energy functions. The approach in [23] also uses multi-label moves for
fast approximate optimization of a tractable (submodular) multi-label energies
that can, in fact, be exactly optimized by such methods as [10], but the time
and space complexities of the exact method are heavy.
Another line of related methods are those optimization techniques that limit
the structure of the allowed labellings. Tiered labeling [12], which inspired our
work, uses a limited scene layout and relies on DP for optimization. Another
very related approach is [24], where they use move-making algorithm based on
tiered labeling for approximate optimization of more general energies. Their time
complexity is rather large, since using tiered labeling for a single move has cost
higher than linear in the number of pixels. We construct a simpler move than
that in [24], but our eciency is much better. Another related method is [25].
In [25] they propose a min-cut based approach for a ve-label energy function
with certain label structure layout. This layout is motivated by the application of
indoor geometric scene labeling. The approach in [25] is approximate, unlike [12],
even though for a four-connected neighborhood system tiered labeling layout is
more general than that in [25]. Recently, [26] showed how to optimize the energy
in [25] globally an optimally with DP. Another DP-based optimization method
on 2D domains is [27], although their objective function is suitable only for
object/background segmentation, not a more general scene labeling problem.
Finally, since our goal is to speed-up the expansion algorithm, there are re-
lated methods in this direction as well, which signicantly improve eciency of
expansion. In [28], they attain speed-ups by considering the dual formulation.
In [29], they exploit several techniques, such as recycling information from pre-
vious instances, clever intialization, etc. In [30], they nd the order of iterations
over labels likely to lead to the fastest energy decrease. Our approach explores
a previously unexplored direction, namely we limit the allowed move structure.
3 DP-Expansion Moves
In this section, we explain and motivate our DP-expansion moves.
Dynamic Programming for Approximate Expansion Algorithm 853
where Dp (lp ) and Vpq (lp , lq ) are unary and pairwise energy terms, and N is a
neighborhood system on P. We assume N is 4-connected with ordered pairs
(p, q) such that p is directly to the left or above q. Our algorithm can be easily
extended to the 8-connected neighborhood that includes diagonal edges with
almost no increase in computational time. This is unlike max-ow expansion,
where including diagonal links noticeably increases computational time.
The unary term Dp () is the cost for assigning label to pixel p. It usu-
ally depends on the value of pixel p in the observed image. The pairwise term
Vpq (, ) species the cost for assigning labels and to pixels p and q. This
term is usually used to enforce some smoothness constraints on a low-cost label-
ing. Unlike most work on optimizing Eq. (1), we impose no constraints on unary
and pairwise terms. In particular, Vpq (, ) does not have to be submodular [31].
Fig. 1. Simplied version of tiered labeling: the middle tier is only allowed one label
854 O. Veksler
3.3 DP-Expansion
To speed up, we need to reduce the number of states in each column further.
Observe that in the approach outlined in Section 3.2 a single move is capable
of carving out a relatively complex upper and lower boundaries of a region
to switch to , leaving left and right boundaries relatively simpler. Being able
to carve out complex boundaries is clearly important for faithfully recovering
object boundaries. However, it may be sucient to work on one boundary at
a time. Our main idea is to develop an even simpler move. Our move allows a
complex boundary in only one direction. The trade-o is that the state space is
reduced and we can achieve a linear time complexity DP.
This motivates the following DP-based moves. We have four dierent move
types, top-anchored, left-anchored, bottom-anchored, and right-anchored. For
the top-anchored move, if any pixels in column j decide to switch their labels
to , they must form a contiguous set. Furthermore this set must contain the
topmost, row 0, pixel. That is, for each column j, only pixel subsets of form
{(i, j)|0 i < b} for some b n are allowed to switch to label . Here n is the
1
Actually, [24] introduced more powerful moves than we just described, but we believe
these moves will also work almost as well as those in [24].
Dynamic Programming for Approximate Expansion Algorithm 855
number of rows in the image. This move is illustrated in Figure 2(a). It allows
recovering of a region with a complex boundary on the bottom.
The other moves, illustrated in Figure 2(b,c,d) are similar. They require any
switching to start on the left, bottom, right of the image, respectively, and
they are capable of recovering a region with complex boundaries on the right,
top, and left, respectively.
Fig. 2. Four DP-expansion moves. The old label is white and the new label is green.
Top-anchored move in (a): if any pixels in a column switch to label , they must form
a contiguous region and the top pixel must switch to . Left-, bottom-, and right-
anchored moves require switching to start at the left, bottom, right of the image.
We show how to implement these moves in time linear in the number of pixels
in Section 4. While very fast, these moves would not be powerful if they were
applied only to the whole image, since any switch to a new label must start at
the image top (or bottom, left, right). We found it eective to apply these moves
to image blocks of dierent sizes in random order.
When applying a DP-expansion move to an image block, only the unary and
pairwise terms that depend on pixels entirely within the block are involved.
However, the pairwise terms Vpq that involve one pixel in the block and one
pixel outside the block are also eected by the changes within the block, and
must be addressed to ensure the energy strictly decreases. Like previous work [1],
the x is simple. Let l be the old labeling. Let Vpq have p inside the block and
q outside the block. We add Vpq (lp , lq ) to Dp (lp ) and Vpq (, lq ) to Dp ().
4 Fast Optimization
We now explain how to do fast optimization for the DP-expansion moves dened
in Sec. 3.3. Even though most of our moves are performed on image blocks, it is
easier to explain how optimization works assuming it is done on the whole image,
avoiding the extra sub-indexing notation. The x that needs to be done when
performing the moves on image blocks was explained at the end of Section 3.3.
Our fast DP derivation closely follows the tiered labeling speed-ups in [12].
However, our speed-up of DP is simpler since we have a less complex state space.
We explain optimization for the case of top-anchored moves. The other cases are
handled by appropriately rotating the image.
For most derivations in this paper, for pixel p = (i, j) it is convenient to
denote Dp as Dij , and lp as li,j . Also, if p = (i, j) and q = (i + 1, j), that is a
v
vertical term, then we denote Vpq as Vi,j . If p = (i, j), and q = (i, j + 1), that is a
856 O. Veksler
Fig. 3. Site j in state d. Pixels above row d (in green) are labeled as in column j.
Pixels below and including row d (in white) in column j retain the old label.
h
horizontal term, we denote Vpq as Vi,j . Let image have n rows, indexed from 0 to
n 1, and m columns, indexed from 0 to m 1, see Fig. 3. Let the old labeling
be l and we want to nd the best top-anchored expansion move. That is, in
each column, we want to nd the best contiguous subset of pixels to switch to
, provided that any switch in column j, must begin at row 0, i.e. at pixel (0, j).
m1
m2
E(s) = Uj (sj ) + Bj (sj , sj+1 ), (2)
j=0 j=0
where Uj (sj ) adds up all the unary terms and pairwise terms of the energy in
Equation (1) that are contained entirely in column j, and Bj (sj , sj+1 ) adds up
all the pairwise terms of the energy in Equation (1) that rely on exactly one pixel
from column j and one pixel from column j + 1. Let us dene a labeling lj,d as
follows. Pixels in columns other than j can have any labels. Pixels in column j
and rows 0 to d 1 have label . Pixels in column j and rows d to n 1 have
the same label as they have in the old labeling l . Then
Dynamic Programming for Approximate Expansion Algorithm 857
n1
j,s
n2
j,s j,s
v
Uj (sj ) = Dij (li,j j ) + Vi,j (li,j j , li+1,j
j
), (3)
i=0 i=0
and
n1
j,s j+1,s
h
Bj (sj , sj+1 ) = Vi,j (li,j j , li,j+1 j+1 ). (4)
i=0
We also compute the optimum previous state in a look-up table Pj (t). It is used
to trace back the optimal sequence of states, starting from the last state and
going backwards. That is the optimal state for the last site is computed with
sm1 = argmaxt Em1 (t). After that, the previous optimal state is computed
with sj1 = Pj (sj ), where sj is the optimal state for site j.
For a xed j, we have to compute Ej (t) for n+ 1 possible values of t . For each
xed j and t, we need to search over all previous states t , and there are n + 1 of
them. Therefore, computing Ej (t) is O(n) for a xed j and t. Computing Ej (t)
is O(n2 ) for all n + 1 values of t, and, nally, computing Ej (t) for all m values of
j is O(n2 m), which is too expensive. We explain how to speed it up to O(nm),
i.e. linear in the number of pixels in Section 4.2.
We now explain how, for a xed j and t, compute Ej (t) in O(1) time instead
of the straightforward O(n) time. This will result in the total time complexity
O(nm), i.e. linear in the number of pixels.
The expensive part of Ej (t) computation is the search over the previous states
t . Let us give this part its own name Fj (t):
Fj (t) = min
[Ej1 (t ) + Bj1 (t , t)] . (7)
t
where
Tj (t) = min
[Ej1 (t ) + Bj1 (t , t)] (9)
t t
and
Lj (t) = min
[Ej1 (t ) + Bj1 (t , t)] (10)
t >t
(a) (b)
Fig. 4. Case 1: Minimization over t t. (a) The state of column j is xed to t, which
means that all pixels above row t in column j take new label . All possible choices for
t in column j 1 are illustrated. We have to nd the smallest cost choice. Below the
red line, all four choices have exactly the same labels, therefore the costs are dierent
only for the energy terms that include pixels above the red line. (b) is the same as in
(a) but now column j is in state n. Still the costs of four possible states diers only in
the energy terms above the red line, since labels below the red line are the same.
We are going to minimize these two cases separately, and each one in O(1)
time. Consider rst computation of Tj (t), i.e. minimization is over t t. In this
case, we are assigning to pixels (0, j), (1, j), ...(t 1, j) and the rest of pixels in
column j retain their old label. We have to nd the best possible state t t in
the previous column j 1. That is the previous column we can have in rows 0
through t 1, where t can range from 0 to t. This is illustrated in Figure 4(a).
When considering which t 0, ..., t is the optimal state to go through in
column j 1, observe that no matter which particular value we choose for t ,
both in column j 1 and column j, in rows including and below t (below the
red line in Figure 4(a)), all the pixels have the same labels (the old labels they
had in l ). Therefore all pairwise terms V and unary terms D of the original 2D
labeling that depend only on pixels in rows below and including t are the same
for all choices of t 0, ..., t. This means that we can nd the optimal t only
considering the unary and pairwise terms that involve pixels above row t.
Furthermore, consider Figure 4(b). It illustrates the case when we are com-
puting Tj (n), that is the jth column has all pixels labeled as . Notice that out
of t t, the best t is the same for both (a) and (b). This is because again,
the best t is independent of the labels of pixels in rows below the red line.
And above the red line, all cases for (a) and (b) are identical. Thus t mini-
mizing (a) and (b) is the same, but the values at the minimum t are dierent.
Dynamic Programming for Approximate Expansion Algorithm 859
Tj (t) = min
[Ej1 (t ) + Bj1 (t , t)] = k(t) + min
[Ej1 (t ) + Bj1 (t , n)] , (11)
t t t t
where
n1
( )
k(t) = h
Vi,j1 (li,j1 , li,j ) Vi,j1
h
(li,j1 , ) .
i=t
We can compute k(t) in O(1) time using cumulative sums on the columns that
store pairwise costs V h , in the same way as Bj terms in Equation (4).
Let us now turn to the ecient computation of the second term in Equa-
tion (11). First dene a table C of size n, indexed starting at 0, as:
C (i) = min
C(i). (12)
i i
We can compute C in O(n) time. Initialize C (0) = C(0). Then update, in order
of increasing i, C (i) = min {C(i), C (i 1)}. After the table C is computed,
the minimum in Equation (11) can be simply looked up as C (t). Since we reuse
C for all t in column j, computing Tj (t) in Equation (11) takes O(n) time for
all t {0, 1, ..., n 1}.
(a) (b)
Fig. 5. Case 2: Minimization over t > t. (a) When column j is xed to state t, all
possible choices for t in column j 1 are illustrated. Above the red line, all four choices
have exactly the same labels, therefore the costs are dierent only for the energy terms
that include pixels below the red line. (b) is the same as in (a) but now column j is in
state 0. Still the costs of four possible states diers only in the energy terms below the
red line, since labels above the red line are the same.
860 O. Veksler
Fig. 6. Results on Middlebury stereo pairs. The left column shows our results, the right
one max-ow expansion results. We report the energy and the running time in seconds
for each case.
Furthermore, consider Figure 5(b). It illustrates the case when we are com-
puting Tj (0), that is the jth column retains old labels for all pixels. Notice that
the minimum state t is the same between (a) and (b), when only t > t are
considered. The dierence in minimum values between (a) and (b) dier by k (t)
that depends only on t, not t . Therefore we get:
Li (t) = min
[Ei1 (t ) + Bi1 (t , t)] = k (t) + min
[Ei1 (t ) + Bi1 (t , 0)] , (13)
t >t t t
Dynamic Programming for Approximate Expansion Algorithm 861
where
t1
( )
k (t) = h
Vi,j1 (, ) Vi,j1
h
(, li,j ) .
i=0
We compute k (t) in O(1) time using cumulative sums, similar to that of k(t).
Let us dene a table H of size n as:
We need to compute running minimum for array H, but now in the other direc-
tion than what we used for array C. Dene running minimum H as:
H (i) = min
H(i ). (14)
i i
5 Experimental Results
In this section, we evaluate the performance of our DP-expansion algorithm
and compare it to the max-ow expansion. We do not compare our method
to previous work that improves eciency of expansion, such as [2830], since
our goal is to introduce a previously unexplored direction for speed-ups, rather
than compete with previous work. In addition, since our approach is orthogonal
to [2830], we can reuse some of their ideas.
Even though we can handle arbitrary pairwise potentials, we evaluate here
only the Potts model, namely Vpq (lp , lq ) = wpq min{|lp lq |, 1}. As in previous
work [1], wpq depends on the strength of the gradient between pixels p and q in
the observed image. The stronger the gradient, the smaller is wpq .
We evaluate our performance on the application of stereo correspondence and
on the set of ten Middlebury stereo pairs [33, 34]. For the data term, we use a
simple truncated absolute value dierence between the pixel in the left image
and the pixel in the right image shifted according to the disparity label. That
is if L is the left image and R is the right image, Dp (lp ) = min{T, |L(p)
R(p lp )|}. We used T = 20. Since our goal is to evaluate our DP-expansion
moves as an optimization method, we did not tune parameters for the best stereo
correspondence performance. We ran the expansion algorithm for two iterations,
since it has been observed that the energy decreases most during the rst two
iterations. We ran our algorithm for box sizes from 1/10 of the image to the
largest image dimension, increasing them by a factor of 2 and making them
overlap by half of their side length.
862 O. Veksler
It has been shown [1] that rst two iterations decrease the energy the most
for the max-ow expansion. Thus to make comparisons more fair to max-ow
expansion, we run it for two iterations. On average, max-ow expansion nds an
energy 1.3% lower, where we measured energy dierence between DP-expansion
and max-ow expansion relative to the energy computed by max-ow expansion.
The largest relative energy dierence was 3%, and in some case DP-expansion
returned a slightly lower energy. This is only because max-ow expansion was
not run until convergence. The running time of our approach was, on average,
1.8 times faster. Fig. 6 illustrates some of the results.
6 Discussion
We have presented DP-expansion moves that work by nding a restricted struc-
ture subset of pixels to switch to . The restriction allows to implement them
very eciently, in linear time. There are many other aspects for future research.
First, there are many variants of such moves to explore. For example, it is easy
to extend our moves to the case when the anchoring happens not necessarily
at the same row for each column. Right now, the anchoring is done in row 0 (or
at the left, right, bottom of the picture). It is easy to see that as long as the
one of the end points is xed, not necessarily to the same value for each column,
we can compute the move with with linear time eciency. These moves can be
used for taking a low-quality but cheap solution, exploring the connected sets
of pixels assigned to the same label, and improving their boundaries one side at
a time, leaving the other sides xed. It is also easy to parallelize our algorithm,
expanding on a non-overlapping set of boxes at the same time. Most of the boxes
are small to medium in size and it is easy to come up with a non-overlapping
subsets of them. Finally, just as there are several speed-ups to the basic max-ow
expansion suggested in the literature [29, 30], based on clever heuristics, there
are multiple speed-ups possible for our approach too.
References
1. Boykov, Y., Veksler, O., Zabih, R.: Ecient approximate energy minimization via
graph cuts. PAMI 26, 12221239 (2001)
2. Wills, J., Agarwal, S., Belongie, S.: What went where. In: CVPR, vol. I, pp. 3744
(2003)
3. Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B.,
Salesin, D., Cohen, M.: Interactive digital photomontage. ACM Transactions on
Graphics 23, 294302 (2004)
4. Kwatra, V., Schdl, A., Essa, I., Turk, G., Bobick, A.: Graphcut textures: Image
and video synthesis using graph cuts. SIGGRAPH 2003 22, 277286 (2003)
5. Hoiem, D., Rother, C., Winn, J.: 3d layout crf for multi-view object class recogni-
tion and segmentation. In: CVPR, pp. 18 (2007)
6. Delong, A., Osokin, A., Isack, H.N., Boykov, Y.: Fast approximate energy mini-
mization with label costs. IJCV 96, 127 (2012)
7. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. PAMI 23, 12221239 (2001)
8. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for
binary images. Journal of the Royal Statistical Society, Series B 51, 271279 (1989)
Dynamic Programming for Approximate Expansion Algorithm 863
9. Ishikawa, H.: Exact optimization for markov random elds with convex priors.
PAMI 25, 13331336 (2003)
10. Schlesinger, D., Flach, B.: Transforming an arbitrary minsum problem into a binary
one. Technical Report TUD-FI06-01, Dresden University of Technology (2006)
11. Kolmogorov, V., Rother, C.: Minimizing nonsubmodular functions with graph cuts
- a review. PAMI 29, 12741279 (2007)
12. Felzenszwalb, P.F., Veksler, O.: Tiered scene labeling with dynamic programming.
In: CVPR (2010)
13. Ford, L., Fulkerson, D.: Flows in Networks. Princeton University Press (1962)
14. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ow
algorithms for energy minimization in vision. PAMI 26, 11241137 (2004)
15. Amini, A., Weymouth, T., Jain, R.: Using dynamic programming for solving vari-
ational problems in vision. PAMI 12, 855867 (1990)
16. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A.,
Tappen, M.F., Rother, C.: A comparative study of energy minimization methods for
markov random elds with smoothness-based priors. PAMI 30, 10681080 (2008)
17. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible in-
ference. Morgan Kaufmann (1988)
18. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimiza-
tion. PAMI 28, 15681583 (2006)
19. Lempitsky, V.S., Rother, C., Roth, S., Blake, A.: Fusion moves for markov random
eld optimization. PAMI 32, 13921405 (2010)
20. Veksler, O.: Graph cut based optimization for mrfs with truncated convex priors.
In: CVPR, pp. 18 (2007)
21. Kumar, M.P., Torr, P.H.S.: Improved moves for truncated convex models. In:
Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) NIPS, pp. 889896 (2009)
22. Gould, S., Amat, F., Koller, D.: Alphabet soup: A framework for approximate
energy minimization. In: CVPR, pp. 903910 (2009)
23. Carr, P., Hartley, R.: Solving multilabel graph cut problems with multilabel swap.
In: DICTA (2009)
24. Vineet, V., Warrell, J., Torr, P.: A tiered move-making algorithm for general pair-
wise mrfs. In: CVPR (2012)
25. Liu, X., Veksler, O., Samarabandu, J.: Order preserving moves for graph cut based
optimization. PAMI (2010)
26. Bai, J., Song, Q., Veksler, O., Wu, X.: Fast dynamic programming for labeling
problems with ordering constraints. In: CVPR (2012)
27. Asano, T., Chen, D., Katoh, N., Tokuyama, T.: Ecient algorithms for
optimization-based image segmentation. IJCGA 11, 145166 (2001)
28. Komodakis, N., Tziritas, G., Paragios, N.: Fast, approximately optimal solutions
for single and dynamic mrfs. In: CVPR (2007)
29. Alahari, K., Kohli, P., Torr, P.H.S.: Reduce, reuse & recycle: Eciently solving
multi-label mrfs. In: CVPR (2008)
30. Batra, D., Kohli, P.: Making the right moves: Guiding alpha-expansion using local
primal-dual gaps. In: CVPR, pp. 18651872 (2011)
31. Kolmogorov, V., Zabih, R.: What energy function can be minimized via graph
cuts? PAMI 26, 147159 (2004)
32. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple
features. In: CVPR (1), pp. 511518 (2001)
33. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. IJCV 47, 742 (2001)
34. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light.
In: CVPR, pp. 195202 (2003)
Real-Time Compressive Tracking
1 Introduction
Despite that numerous algorithms have been proposed in the literature, object
tracking remains a challenging problem due to appearance change caused by
pose, illumination, occlusion, and motion, among others. An eective appearance
model is of prime importance for the success of a tracking algorithm that has
been attracting much attention in recent years [110]. Tracking algorithms can
be generally categorized as either generative [1, 2, 6, 10, 9] or discriminative [3
5, 7, 8] based on their appearance models.
Generative tracking algorithms typically learn a model to represent the target
object and then use it to search for the image region with minimal reconstruction
error. Black et al. [1] learn an o-line subspace model to represent the object of
interest for tracking. The IVT method [6] utilizes an incremental subspace model
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 864877, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Real-Time Compressive Tracking 865
x x x x x x
x x x x x x x x x "
xx xx xx " x x x " xx xx xx
x x x xx xx xx
# x x x
x x x
x x x x x x
x x x x x x "
# xx xx xx " x x x
x x x xxx
x x x x x x "
xx xx xx x x x
Sparse xxx
Multiscale Multiscale measurement Compressed
Frame(t) Samples Classifer
filter bank image features matrix vectors
allow almost perfect reconstruction of the signal if the signal is compressible such
as natural images or audio [1214]. We use a very sparse measurement matrix
that satises the restricted isometry property (RIP) [15], thereby facilitating ef-
cient projection from the image feature space to a low-dimensional compressed
subspace. For tracking, the positive and negative samples are projected (i.e.,
compressed) with the same sparse measurement matrix and discriminated by
a simple naive Bayes classier learned online. The proposed compressive track-
ing algorithm runs at real-time and performs favorably against state-of-the-art
trackers on challenging sequences in terms of eciency, accuracy and robustness.
2 Preliminaries
We present some preliminaries of compressive sensing which are used in the
proposed tracking algorithm.
v = Rx, (1)
holds true for the restricted isometry property in compressive sensing. There-
fore, if the random matrix R in (1) satises the Johnson-Lindenstrauss lemma,
we can reconstruct x with minimum error from v with high probability if x is
compressive such as audio or image. We can ensure that v preserves almost all
the information in x. This very strong theoretical support motivates us to ana-
lyze the high-dimensional signals via its low-dimensional random projections. In
the proposed algorithm, we use a very sparse matrix that not only satises the
Johnson-Lindenstrauss lemma, but also can be eciently computed for real-time
tracking.
Achlioptas [16] proved that this type of matrix with s = 2 or 3 satises the
Johnson-Lindenstrauss lemma. This matrix is very easy to compute which re-
quires only a uniform random generator. More importantly, when s = 3, it is
very sparse where two thirds of the computation can be avoided. In addition, Li
et al. [19] showed that for s = O(m) (x Rm ), this matrix is asymptotically
normal. Even when s = m/ log(m), the random projections are almost as accu-
rate as the conventional random projections where rij N (0, 1). In this work,
we set s = m/4 which makes a very sparse random matrix. For each row of R,
only about c, c 4, entries need to be computed. Therefore, the computational
complexity is only O(cn) which is very low. Furthermore, we only need to store
the nonzero entries of R which makes the memory requirement also very light.
3 Proposed Algorithm
In this section, we present our tracking algorithm in details. The tracking prob-
lem is formulated as a detection task and our algorithm is shown in Figure 1.
We assume that the tracking window in the rst frame has been determined. At
each frame, we sample some positive samples near the current target location
and negative samples far away from the object center to update the classier. To
predict the object location in the next frame, we draw some samples around the
current target location and determine the one with the maximal classication
score.
868 K. Zhang, L. Zhang, and M.-H. Yang
x 1 v1
x 2 v2
u # vi = rij x j
j
v
# n
Rnm
x m
0.1
0.08
0.15
0.08
0.06
0.1 0.06
0.04
0.04
0.05
0.02
0.02
0 0 0
0 0.5 1 1.5 2 0 2 4 6 8 10 12 0 5 10 15
5 4 4
x 10 x 10 x 10
Haar-like features which make the computational load very heavy. This problem
is alleviated by boosting algorithms for selecting important features [20, 21]. Re-
cently, Babenko et al. [8] adopted the generalized Haar-like features where each
one is a linear combination of randomly generated rectangle features, and use
online boosting to select a small set of them for object tracking. In our work,
the large set of Haar-like features are compressively sensed with a very sparse
measurement matrix. The compressive sensing theories ensure that the extracted
features of our algorithm preserve almost all the information of the original im-
age. Therefore, we can classify the projected features in the compressed domain
eciently without curse of dimensionality.
where we assume uniform prior, p(y = 1) = p(y = 0), and y {0, 1} is a binary
variable which represents the sample label.
Diaconis and Freedman [23] showed that the random projections of high di-
mensional random vectors are almost always Gaussian. Thus, the conditional
distributions p(vi |y = 1) and p(vi |y = 0) in the classier H(v) are assumed to
be Gaussian distributed with four parameters (1i , i1 , 0i , i0 ) where
1i 1i + (1 )1
C
i1 (i1 )2 + (1 )( 1 )2 + (1 )(1i 1 )2 , (6)
870 K. Zhang, L. Zhang, and M.-H. Yang
C
n1
k=0|y=1 (vi (k) ) and
1
where > 0 is a learning parameter, 1 = n
1 2
1
n1
1 = n k=0|y=1 vi (k). The above equations can be easily derived by maximal
likelihood estimation. Figure 3 shows the probability distributions for three dif-
ferent features of the positive and negative samples cropped from a few frames of
a sequence for clarity of presentation. It shows that a Gaussian distribution with
online update using (6) is a good approximation of the features in the projected
space. The main steps of our algorithm are summarized in Algorithm 1.
1. Sample a set of image patches, D = {z|||l(z) lt1 || < } where lt1 is the
tracking location at the (t-1)-th frame, and extract the features with low dimen-
sionality.
2. Use classier H in (4) to each feature vector v(z) and nd the tracking location
lt with the maximal classier response.
3. Sample two sets of image patches D = {z|||l(z) lt || < } and D, = {z| <
||l(z) lt || < } with < < .
4. Extract the features with these two sets of samples and update the classier
parameters according to (6).
3.4 Discussion
We note that simplicity is the prime characteristic of our algorithm in which the
proposed sparse measurement matrix R is independent of any training samples,
thereby resulting in a very ecient method. In addition, our algorithm achieves
robust performance as discussed below.
Dierence with Related Work. It should be noted that our algorithm is
dierent from the recently proposed 1 -tracker [10] and compressive sensing
tracker [9]. First, both algorithms are generative models that encode an ob-
ject sample by sparse representation of templates using 1 -minimization. Thus
the training samples cropped from the previous frames are stored and updated,
but this is not required in our algorithm due to the use of a data-independent
measurement matrix. Second, our algorithm extracts a linear combination of gen-
eralized Haar-like features but these trackers [10][9] use the holistic templates
for sparse representation which are less robust as demonstrated in our experi-
ments. In [9], an orthogonal matching pursuit algorithm is applied to solve the
1 -minimization problems. Third, both of these tracking algorithms [10][9] need
to solve numerous time-consuming 1 -minimization problems but our algorithm
is ecient as only matrix multiplications are required.
Random Projection vs. Principal Component Analysis. For visual track-
ing, dimensionality reduction algorithms such as principal component analysis
Real-Time Compressive Tracking 871
0.16 0.18
0.1
0.16
0.14
0.14
0.08 0.12
0.12
0.1
0.06 0.1
0.08
0.08
0.04 0.06
0.06
0.04
0.04
0.02
0.02 0.02
0 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
5 5 5
x 10 x 10 x 10
and its variations have been widely used in generative tracking methods [1, 6].
These methods need to update the appearance models frequently for robust
tracking. However, these methods are sensitive to occlusion due to the holis-
tic representation schemes. Furthermore, it is not clear whether the appearance
models can be updated correctly with new observations (e.g., without alignment
errors to avoid tracking drift). In contrast, our algorithm does not suer from the
problems with online self-taught learning approaches [24] as the proposed model
with the measurement matrix is data-independent. It has been shown that for
image and text applications, favorable results can be achieved by methods with
random projection than principal component analysis [25].
Robustness to Ambiguity in Detection. The tracking-by-detection methods
often encounter the inherent ambiguity problems as shown in Figure 4. Recently
Babenko et al. [8] introduced multiple instance learning schemes to alleviate the
tracking ambiguity problem. Our algorithm is robust to the ambiguity problem
as illustrated in Figure 4. While the target appearance changes over time, the
most correct positive samples (e.g., the sample in the red rectangle in Figure
4) are similar in most frames. However, the less correct positive samples (e.g.,
samples in yellow rectangles of Figure 4) are much more dierent as they involve
some background information which vary much more than those within the tar-
get object. Thus, the distributions for the features extracted from the most
correct positive samples are more concentrated than those from the less cor-
rect positive samples. This in turn makes the features from the most correct
positive samples much more stable than those from the less correct positive
samples (e.g., bottom row in Figure 4, the features denoted by red markers are
more stable than those denoted by yellow markers). Thus, our algorithm is able
to select the most correct positive sample because its probability is larger than
872 K. Zhang, L. Zhang, and M.-H. Yang
those of the less correct positive samples (See the markers in Figure 4). In ad-
dition, our measurement matrix is data-independent and no noise is introduced
by mis-aligned samples.
Robustness to Occlusion. Each feature in our algorithm is spatially localized
(Figure 2) which is less sensitive to occlusion than holistic representations. Simi-
lar representations, e.g., local binary patterns [26] and generalized Haar-like fea-
tures [8], have been shown to be more eective in handling occlusion. Moreover,
features are randomly sampled at multiple scales by our algorithm in a way similar
to [27, 8] which have demonstrated robust results for dealing with occlusion.
Dimensionality of Projected Space. Assume there exist d input points in
Rm . Given 0 < < 1 as well as > 0, and let R Rnm be a random
matrix projecting data from Rm to Rn , the theoretical bound
for the dimension
n that satises the Johnson-Lindenstrauss lemma is n 2 /24+2
3 /3 ln(d) [16].
In practice, Bingham and Mannila [25] pointed out that this bound is much
higher than that suces to achieve good results on image and text data. In their
applications, the lower bound for n when = 0.2 is 1600 but n = 50 is sucient
to generate good results. In our experiments, with 100 samples (i.e., d = 100),
= 0.2 and = 1, the lower bound for n is approximately 1600. Another bound
derived from the restricted isometry property in compressive sensing [15] is much
tighter than that from Johnson-Lindenstrauss lemma, where n log(m/)
and and are constants. For m = 106 , = 1, and = 10, it is expected
that n 50. We nd that good results can be obtained when n = 50 in our
experiments.
4 Experiments
We evaluate our tracking algorithm with 7 state-or-the-art methods on 20 chal-
lenging sequences among which 16 are publicly available and 4 are our own.
The Animal, Shaking and Soccer sequences are provided in [28] and the Box
and Jumping are from [29]. The 7 trackers we compare with are the fragment
tracker (Frag) [30], the online AdaBoost method (OAB) [5], the Semi-supervised
tracker (SemiB) [7], the MILTrack algorithm [8], the 1 -tracker [10], the TLD
tracker [11], and the Struck method [31]. We note that the source code of [9]
is not available for evaluation and the implementation requires some technical
details and parameters not discussed therein. It is worth noticing that we use
the most challenging sequences from the existing works. For fair comparison, we
use the source or binary codes provided by the authors with tuned parameters
for best performance. For our compared trackers, we either use the tuned pa-
rameters from the source codes or empirically set them for best results. Since
all of the trackers except for Frag involve randomness, we run them 10 times
and report the average result for each video clip. Our tracker is implemented in
MATLAB, which runs at 35 frames per second (FPS) on a Pentium Dual-Core
2.80 GHz CPU with 4 GB RAM. The source codes and datasets are available at
http://www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm.
Real-Time Compressive Tracking 873
Table 1. Success rate (SR)(%). Bold fonts indicate the best performance while the
italic fonts indicate the second best ones. The total number of evaluated frames is 8270.
Table 2. Center location error (CLE)(in pixels) and average frame per second (FPS).
Bold fonts indicate the best performance while the italic fonts indicate the second best
ones. The total number of evaluated frames is 8270.
Scale, Pose and Illumination Change. For the David indoor sequence shown
in Figure 5(a), the illumination and pose of the object both change gradually.
The MILTrack, TLD and Struck methods perform well on this sequence. For
the Shaking sequence shown in Figure 5(b), when the stage light changes dras-
tically and the pose of the subject changes rapidly as he performs, all the other
trackers fail to track the object reliably. The proposed tracker is robust to pose
and illumination changes as object appearance can be modeled well by random
projections (based on the Johnson-Lindenstrauss lemma) and the classier with
online update is used to separate foreground and background samples. Moreover,
the proposed tracker is a discriminative model with local features that has been
demonstrated to handle pose variation well (e.g., MILTrack [8]). The generative
subspace tracker (e.g., IVT [6]) has been shown to be eective in dealing with
large illumination changes while the discriminative tracking method with local
features (i.e., MILTrack [8]) has been demonstrated to handle pose variation ad-
equately. Furthermore, the features we use are similar to generalized Haar-like
features which have been shown to be robust to scale and orientation change [8]
as illustrated in the David indoor sequence. In addition, our tracker performs
well on the Sylvester and Panda sequences in which the target objects undergo
signicant pose changes (See the supplementary material for details).
Occlusion and Pose Variation. The target object in Occluded face 2 sequence
in Figure 5(c) undergoes large pose variation and heavy occlusion. Only the MIL-
Track and Struck methods as well as the proposed algorithm perform well on
this sequence. In addition, our tracker achieves the best performance in terms
of success rate, center location error, and frame rate. The target player in Soc-
cer sequence is heavily occluded by others many times when he is holding up
the trophy as shown in Figure 5(d). In some frames, the object is almost fully
Real-Time Compressive Tracking 875
occluded (e.g., #120). Moreover, the object undergoes drastic motion blur and
illumination change (#70, and #300). All the other trackers lose track of the
targets in numerous frames. Due to drastic scene change, it is unlikely that on-
line appearance models are able to adapt fast and correctly. Our tracker can
handle occlusion and pose variations well as its appearance model is discrimi-
natively learned from target and background with a data-independent measure-
ment, thereby alleviating the inuence from background (See also Figure 4).
Furthermore, our tracker performs well for objects with non-rigid pose variation
and camera view change in the Bolt sequence (Figure 5(j)) because the appear-
ance model of our tracker is based on local features which are insensitive to
non-rigid shape deformation.
Out of Plane Rotation and Abrupt Motion. The object in the Kitesurf
sequence (Figure 5(e)) undergoes acrobat movements with 360 degrees out of
plane rotation. Only the MILTrack, SemiB and the proposed trackers perform
well on this sequence. Both our tracker and the MILTrack method are designed
to handle object location ambiguity in tracking with classiers and discrimina-
tive features. The object in the Animal sequence (Figure 5(f)) exhibits abrupt
876 K. Zhang, L. Zhang, and M.-H. Yang
motion. Both the Struck and the proposed methods perform well on this se-
quence. However, when the out of plane rotation and abrupt motion both occur
in the Biker and Tiger 2 sequences (Figure 5(g),(h)), all the other algorithms
fail to track the target objects well. Our tracker outperforms the other methods
in all the metrics (accuracy, success rate and speed). The Kitesurf, Skiing, Twin-
ings, Girl and Tiger 1 sequences all contain out of plane rotation while Jumping
and Box sequences include abrupt motion. Similarly, our tracker performs well
in terms of all metrics.
Background Clutters. The object in the Cli bar sequence changes in scale,
orientation and the surrounding background has similar texture. As the 1 -
tracker uses a generative appearance model that does not take background in-
formation into account, it is dicult to keep track of the objects correctly. The
object in the Coupon book sequence undergoes signicant appearance change at
the 60-th frame, and then the other coupon book appears. Both the Frag and
SemiB methods are distracted to track the other coupon book (#230 in Fig-
ure 5(i)) while our tracker successfully tracks the correct one. Because the TLD
tracker relies heavily on the visual information in the rst frame to re-detect the
object, it also suers from the same problem. Our algorithm is able to track the
right objects accurately in these two sequences because it extracts discriminative
features for the most correct positive sample (i.e., the target object) online
(See Figure 4) with classier update for foreground/background separation.
5 Concluding Remarks
In this paper, we proposed a simple yet robust tracking algorithm with an ap-
pearance model based on non-adaptive random projections that preserve the
structure of original image space. A very sparse measurement matrix was adopted
to eciently compress features from the foreground targets and background
ones. The tracking task was formulated as a binary classication problem with
online update in the compressed domain. Our algorithm combines the merits of
generative and discriminative appearance models to account for scene changes.
Numerous experiments with state-of-the-art algorithms on challenging sequences
demonstrated that the proposed algorithm performs well in terms of accuracy,
robustness, and speed.
References
1. Black, M., Jepson, A.: Eigentracking: Robust matching and tracking of articulated
objects using a view-based representation. IJCV 38, 6384 (1998)
2. Jepson, A., Fleet, D., Maraghi, T.: Robust online appearance models for visual
tracking. PAMI 25, 12961311 (2003)
3. Avidan, S.: Support vector tracking. PAMI 26, 10641072 (2004)
4. Collins, R., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking
features. PAMI 27, 16311643 (2005)
5. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via online boosting. In:
BMVC, pp. 4756 (2006)
Real-Time Compressive Tracking 877
6. Ross, D., Lim, J., Lin, R., Yang, M.-H.: Incremental learning for robust visual
tracking. IJCV 77, 125141 (2008)
7. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-
bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 234247. Springer, Heidelberg (2008)
8. Babenko, B., Yang, M.-H., Belongie, S.: Robust object tracking with online multi-
ple instance learning. PAMI 33, 16191632 (2011)
9. Li, H., Shen, C., Shi, Q.: Real-time visual tracking using compressive sensing. In:
CVPR, pp. 13051312 (2011)
10. Mei, X., Ling, H.: Robust visual tracking and vehicle classication via sparse rep-
resentation. PAMI 33, 22592272 (2011)
11. Kalal, Z., Matas, J., Mikolajczyk, K.: P-n learning: bootstrapping binary classier
by structural constraints. In: CVPR, pp. 4956 (2010)
12. Donoho, D.: Compressed sensing. IEEE Trans. Inform. Theory 52, 12891306
(2006)
13. Candes, E., Tao, T.: Near optimal signal recovery from random projections and
universal encoding strategies. IEEE Trans. Inform. Theory 52, 54065425 (2006)
14. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via
sparse representation. PAMI 31, 210227 (2009)
15. Candes, E., Tao, T.: Decoding by linear programing. IEEE Trans. Inform. The-
ory 51, 42034215 (2005)
16. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with
binary coins. J. Comput. Syst. Sci 66, 671687 (2003)
17. Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the re-
stricted isometry property for random matrices. Constr. Approx 28, 253263 (2008)
18. Liu, L., Fieguth, P.: Texture classication from random features. PAMI 34, 574586
(2012)
19. Li, P., Hastie, T., Church, K.: Very sparse random projections. In: KDD, pp. 287296
(2006)
20. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple
features. In: CVPR, pp. 511518 (2001)
21. Li, S., Zhang, Z.: Floatboost learning and statistical face detection. PAMI 26, 112
(2004)
22. Ng, A., Jordan, M.: On discriminative vs. generative classier: a comparison of
logistic regression and naive bayes. In: NIPS, pp. 841848 (2002)
23. Diaconis, P., Freedman, D.: Asymptotics of graphical projection pursuit. Ann.
Stat. 12, 228235 (1984)
24. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: Transfer
learning from unlabeled data. In: ICML (2007)
25. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Appli-
cations to image and text data. In: KDD, pp. 245250 (2001)
26. Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns:
application to face recogntion. PAMI 28, 20372041 (2006)
27. Leonardis, A., Bischof, H.: Robust recogtion using eigenimages. CVIU 78, 99118
(2000)
28. Kwon, J., Lee, K.: Visual tracking decomposition. In: CVPR, pp. 12691276 (2010)
29. Santner, J., Leistner, C., Saari, A., Pock, T., Bischof, H.: PROST Parallel Robust
Online Simple Tracking. In: CVPR (2010)
30. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragements-based tracking using the
integral histogram. In: CVPR, pp. 798805 (2006)
31. Hare, S., Saari, A., Torr, P.: Struck: structured output tracking with kernels. In:
ICCV, pp. 263270 (2011)
Author Index