978 3 642 33712 3 PDF

Lecture Notes in Computer Science 7574
Commenced Publication in 1973

Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Andrew Fitzgibbon Svetlana Lazebnik
Pietro Perona Yoichi Sato
Cordelia Schmidt (Eds.)
Computer Vision
ECCV 2012
12th European Conference on Computer Vision
Florence, Italy, October 7-13, 2012
Proceedings, Part III
13
Volume Editors
Andrew Fitzgibbon
Microsoft Research Ltd., Cambridge, CB3 0FB, UK
E-mail: awf@microsoft.com
Svetlana Lazebnik
University of North Carolina, Dept. of Computer Science
Chapel Hill, NC 27599, USA
E-mail: lazebnik@cs.unc.edu
Pietro Perona
California Institute of Technology
Pasadena, CA 91125, USA
E-mail: perona@caltech.edu
Yoichi Sato
The University of Tokyo, Institute of Industrial Science
Tokyo 153-8505, Japan
E-mail: ysato@iis.u-tokyo.ac.jp
Cordelia Schmid
INRIA, 38330 Montbonnot, France
E-mail: cordelia.schmid@inria.fr
ISSN 0302-9743 e-ISSN 1611-3349

ISBN 978-3-642-33711-6 e-ISBN 978-3-642-33712-3
DOI 10.1007/978-3-642-33712-3
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2012947663
CR Subject Classification (1998): I.4.6, I.4.8, I.4.1-5, I.4.9, I.5.2-4, I.2.10, I.3.5, F.2.2
LNCS Sublibrary: SL 6 Image Processing, Computer Vision, Pattern Recognition,

and Graphics
Springer-Verlag Berlin Heidelberg 2012
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Foreword
The European Conference on Computer Vision is one of the top conferences for
researchers in this eld and is held biennially in alternation with the International
Conference on Computer Vision. It was rst held in 1990 in Antibes (France) with
subsequent conferences in Santa Margherita Ligure (Italy) in 1992, Stockholm
(Sweden) in 1994, Cambridge (UK) in 1996, Freiburg (Germany) in 1998, Dublin
(Ireland) in 2000, Copenhagen (Denmark) in 2002, Prague (Czech Republic) in
2004, Graz (Austria) in 2006, Marseille (France) in 2008, and Heraklion (Greece)
in 2010. To our great delight, the 12th conference was held in Florence, Italy.
ECCV has an established tradition of very high scientic quality and an
overall duration of one week. ECCV 2012 began with a keynote lecture from the
honorary chair, Tomaso Poggio. The main conference followed over four days
with 40 orals, 368 posters, 22 demos, and 12 industrial exhibits. There were also
9 tutorials and 21 workshops held before and after the main event. For this event
we introduced some novelties. These included innovations in the review policy,
the publication of a conference booklet with all paper abstracts and the full video
recording of oral presentations.
This conference is the result of a great deal of hard work by many people,
who have been working enthusiastically since our rst meetings in 2008. We are
particularly grateful to the Program Chairs, who handled the review of about
1500 submissions and co-ordinated the eorts of over 50 area chairs and about
1000 reviewers (see details of the process in their preface to the proceedings). We
are also indebted to all the other chairs who, with the support of our research
teams (names listed below), diligently helped us manage all aspects of the main
conference, tutorials, workshops, exhibits, demos, proceedings, and web presence.
Finally we thank our generous sponsors and Consulta Umbria for handling the
registration of delegates and all nancial aspects associated with the conference.
We hope you enjoyed ECCV 2012. Benvenuti a Firenze!
October 2012 Roberto Cipolla

Carlo Colombo
Alberto Del Bimbo
Preface
Welcome to the proceedings of the 2012 European Conference on Computer

Vision in Florence, Italy! We received 1437 complete submissions, the largest
number of submissions in the history of ECCV. Forty papers were selected for
oral presentation and 368 papers for poster presentation, resulting in acceptance
rates of 2.8% for oral, 25.6% for poster, and 28.4% in total.
The following is a brief description of the review process. After the submis-
sion deadline, each paper was assigned to one of 54 area chairs (28 from Europe,
21 from the USA and Canada, and 4 from Asia) with the help of the Toronto Pa-
per Matching System (TMS). TMS, developed by Laurent Charlin and Richard
Zemel, is beginning to be used by an increasing number of conferences, including
NIPS, ICML, and CVPR. To ensure the best possible assignment of papers to
area chairs, the program chairs manually selected several area chair candidates
for each paper based on the suggestions generated by TMS. After automatic load
balancing and conict resolution, each AC was nally assigned approximately
30 papers closely matching their expertise.
Area chairs then made reviewer suggestions (an average of seven per paper),
which were load-balanced and conict-resolved, giving 3 reviewers for each pa-
per and a maximum of 11 papers per reviewer. The ACs were assisted in this
process by TMS, which was also used for automatically selecting potential re-
viewers, matching each submitted paper based on the reviewers representative
publications. These suggestions came from a pool of potential reviewers com-
posed from names of people who have reviewed for recent vision conferences,
self-nominations (any member of the community could ll out a form on the
ECCV website asking to be a reviewer), and nominations by ACs. From an ini-
tial pool of 863 reviewers, 638 ended up reviewing at least one paper. This was
the rst time that TMS had been used this extensively in the review process
for a vision conference (CVPR 2012 used a restricted version of the system for
assigning papers to area chairs), and in the end, we were very pleased with its
performance. An important improvement over previous conferences was that ini-
tial reviewer suggestions were generated entirely in parallel by the ACs, without
the race for good reviewers that the previous methods have implicitly encour-
aged. Area chairs were then given the opportunity to correct infelicities in the
load balancing before the nal list was generated. We extend our heartfelt thanks
to the area chairs, who participated vigorously in this process, to maximize the
quality of the review assignments.
For the decision process, we introduced one major innovation. We replaced
the physical area chair meeting and the conventional AC buddy system with vir-
tual meetings of AC triplets (this system was rst tried out for BMVC 2011 and
found to work very well). After the conclusion of the review, rebuttal, and discus-
sion periods, the AC triplets met on the phone or on Skype (and, in just one case,
VIII Preface
in person), jointly discussed all their papers, and made acceptance/rejection de-
cisions. Thus, the reviews and consolidation reports for each paper were carefully
examined by three ACs, ensuring a fair and thorough assessment. A program
chair assisted in each AC triplet meeting to maintain the consistency in the de-
cision process and to provide any necessary support. Furthermore, each triplet
recommended a small number of top-ranked papers (typically one to three) for
oral presentation, and the program chairs took these candidates and made the
nal oral vs. poster decisions.
Double-blind reviewing policies were strictly maintained throughout the en-
tire process neither the area chairs nor the reviewers knew the identity of the
authors, and the authors did not know the identity of the reviewers and ACs.
Based on feedback from authors, reviewers, and area chairs, we believe we suc-
cessfully maintained the integrity of the paper selection process, and we are very
excited about the quality of the resulting program.
We wish to thank everyone involved for their time and dedication to making
the ECCV 2012 program possible. The success of ECCV 2012 entirely relied on
the time and eort invested by the authors into producing high-quality research,
on the care taken by the reviewers in writing thorough and professional reviews,
and on the commitment by the area chairs to reconciling the reviews and writing
detailed and precise consolidation reports. We also wish to thank the general
chairs, Roberto Cipolla, Carlo Colombo, and Alberto Del Bimbo, and the other
organizing committee members for their top-notch handling of the event.
Finally, we would like to commemorate Mark Everingham, whose untimely
death has shocked and saddened the entire vision community. Mark was an area
chair for ECCV and also an organizer for one of the workshops; his hard work and
dedication were absolutely essential in enabling us to put together a high-quality
conference program. We salute his record of exemplary service and intellectual
contributions to the discipline of computer vision. Mark, you will be missed!
October 2012 Andrew Fitzgibbon

Svetlana Lazebnik
Pietro Perona
Yoichi Sato
Cordelia Schmid
Organization
General Chairs
Roberto Cipolla University of Cambridge, UK
Carlo Colombo University of Florence, Italy
Alberto Del Bimbo University of Florence, Italy
Program Coordinator
Pietro Perona California Institute of Technology, USA
Program Chairs
Andrew Fitzgibbon Microsoft Research, Cambridge, UK
Svetlana Lazebnik University of Illinois at Urbana-Champaign, USA
Yoichi Sato The University of Tokyo, Japan
Cordelia Schmid INRIA, Grenoble, France
Honorary Chair
Tomaso Poggio Massachusetts Institute of Technology, USA
Tutorial Chairs
Emanuele Trucco University of Dundee, UK
Alessandro Verri University of Genoa, Italy
Workshop Chairs
Andrea Fusiello University of Udine, Italy
Vittorio Murino Istituto Italiano di Tecnologia, Genoa, Italy
Demonstration Chair
Rita Cucchiara University of Modena and Reggio Emilia, Italy
Industrial Liaison Chair

Bjorn Stenger Toshiba Research Europe, Cambridge, UK
Web Chair
Marco Bertini University of Florence, Italy
X Organization
Publicity Chairs
Terrance E. Boult University of Colorado at Colorado Springs, USA
Tat Jen Cham Nanyang Technological University, Singapore
Marcello Pelillo University Ca Foscari of Venice, Italy
Publication Chair
Massimo Tistarelli University of Sassari, Italy
Video Processing Chairs

Sebastiano Battiato University of Catania, Italy
Giovanni M. Farinella University of Catania, Italy
Travel Grants Chair

Luigi Di Stefano University of Bologna, Italy
Travel Visa Chair

Stefano Berretti University of Florence, Italy
Local Committee Chair

Andrew Bagdanov MICC, Florence, Italy
Local Committee
Lamberto Ballan Giuseppe Lisanti
Laura Benassi Iacopo Masi
Marco Fanfani Fabio Pazzaglia
Andrea Ferracani Federico Pernici
Claudio Guida Lorenzo Seidenari
Lea Landucci Giuseppe Serra
Area Chairs
Simon Baker Microsoft Research, USA
Horst Bischof Graz University of Technology, Austria
Michael Black Max Planck Institute, Germany
Richard Bowden University of Surrey, UK
Michael S. Brown National University of Singapore, Singapore
Joachim Buhmann ETH Zurich, Switzerland
Alyosha Efros Carnegie Mellon University, USA
Mark Everingham University of Leeds, UK
Pedro Felzenszwalb Brown University, USA
Organization XI
Rob Fergus New York University, USA

Vittorio Ferrari ETH Zurich, Switzerland
David Fleet University of Toronto, Canada
David Forsyth University of Illinois at Urbana-Champaign, USA
Kristen Grauman University of Texas at Austin, USA
Martial Hebert Carnegie Mellon University, USA
Aaron Hertzmann University of Toronto, Canada
Derek Hoiem University of Illinois at Urbana-Champaign, USA
Katsushi Ikeuchi The University of Tokyo, Japan
Michal Irani The Weizmann Institute of Science, Israel
David Jacobs University of Maryland, USA
Sing Bing Kang Microsoft Research, USA
David Kriegman University of California, San Diego, USA
Kyros Kutulakos University of Toronto, Canada
Christof Lampert Institute of Science and Technology, Austria
Ivan Laptev INRIA, France
Victor Lempitsky Yandex, Russia
Steve Lin Microsoft Research, China
Jitendra Malik University of California, Berkeley, USA
Jir Matas Czech Technical University, Czech Republic
Yasuyuki Matsushita Microsoft Research, China
Tomas Pajdla Czech Technical University, Czech Republic
Patrick Perez Thomson-Technicolor, France
Marc Pollefeys ETH Zurich, Switzerland
Jean Ponce Ecole Normale Superieure, France
Long Quan Hong Kong Univ. of Science and Technology, China
Deva Ramanan University of California, Irvine, USA
Stefan Roth TU Darmstadt, Germany
Carsten Rother Microsoft Research, UK
Yoav Schechner Technion, Israel
Bernt Schiele Max Planck Institute, Germany
Christoph Schnorr University of Heidelberg, Germany
Stan Sclaro University of Boston, USA
Josef Sivic Ecole Normale Superieure, France
Peter Sturm INRIA, France
Carlo Tomasi Duke University, USA
Antonio Torralba Massachusetts Institute of Technology, USA
Tinne Tuytelaars University of Leuven, Belgium
Jakob Verbeek INRIA, France
Yair Weiss The Hebrew University of Jerusalem, Israel
Christopher Williams University of Edinburgh, UK
Ramin Zabih Cornell University, USA
Lihi Zelnik Technion, Israel
Andrew Zisserman University of Oxford, UK
Larry Zitnick Microsoft Research, USA
XII Organization
Reviewers
Vitaly Ablavsky Alexander Berg Tae Choe
Lourdes Agapito Tamara Berg Ondrej Chum
Sameer Agarwal Hakan Bilen Albert C.S. Chung
Amit Agrawal Matthew Blaschko John Collomosse
Karteek Alahari Michael Bleyer Tim Cootes
Karim Ali Liefeng Bo Florent Couzine-Devy
Saad Ali Daniele Borghesani David Crandall
S. Ali Eslami Terrance Boult Keenan Crane
Daniel Aliaga Lubomir Bourdev Antonio Criminisi
Neil Alldrin Y-Lan Boureau Shengyang Dai
Marina Alterman Kevin Bowyer Dima Damen
Jose M. Alvarez Edmond Boyer Larry Davis
Brian Amberg Steven Branson Andrew Davison
Cosmin Ancuti Mathieu Bredif Fernando De la Torre
Juan Andrade William Brendel Joost de Weijer
Mykhaylo Andriluka Michael Bronstein Teolo deCampos
Anton Andriyenko Gabriel Brostow Vincent Delaitre
Elli Angelopoulou Matthew Brown Amael Delaunoy
Roland Angst Thomas Brox Andrew Delong
Relja Arandjelovic Marcus Brubaker David Demirdjian
Helder Araujo Darius Burschka Jia Deng
Pablo Arbelaez Tiberio Caetano Joachim Denzler
Antonis Argyros Barbara Caputo Konstantinos Derpanis
Kalle Astrom Stefan Carlsson Chaitanya Desai
Vassilis Athitsos Gustavo Carneiro Thomas Deselaers
Josep Aulinas Joao Carreira Frederic Devernay
Shai Avidan Yaron Caspi Thang Dinh
Tamar Avraham Carlos Castillo Santosh Kumar Divvala
Yannis Avrithis Jan Cech Piotr Dollar
Yusuf Aytar Turgay Celik Justin Domke
Luca Ballan Ayan Chakrabarti Gianfranco Doretto
Lamberto Ballan Tat Jen Cham Matthijs Douze
Atsuhiko Banno Antoni Chan Tom Drummond
Yinzge Bao Manmohan Chandraker Lixin Duan
Adrian Barbu Ming-Ching Chang Olivier Duchenne
Nick Barnes Lin Chen Zoran Duric
Joao Pedro Barreto Xilin Chen Pinar Duygulu
Adrien Bartoli Daozheng Chen Charles Dyer
Arslan Basharat Wen-Huang Cheng Sandra Ebert
Dhruv Batra Yuan Cheng Michael Elad
Sebastiano Battiato Tat-Jun Chin James Elder
Jean-Charles Bazin Han-Pang Chiu Ehsan Elhamifar
Fethallah Benmansour Minsu Cho Ian Endres
Organization XIII
Olof Enqvist German Gonzalez Wenze Hu

Sergio Escalera Raghuraman Gopalan Changbo Hu
Jialue Fan Albert Gordo Gang Hua
Bin Fan Lena Gorelick Xinyu Huang
Gabriele Fanelli Paulo Gotardo Rui Huang
Yi Fang Stephen Gould Wonjun Hwang
Ali Farhadi Helmut Grabner Ichiro Ide
Ryan Farrell Etienne Grossmann Juan Iglesias
Raanan Fattal Matthias Grundmann Ivo Ihrke
Paolo Favaro Jinwei Gu Nazli Ikizler-Cinbis
Rogerio Feris Steve Gu Slobodan Ilic
Sanja Fidler Li Guan Ignazio Infantino
Robert Fisher Peng Guan Michael Isard
Pierre Fite-Georgel Matthieu Guillaumin Herve Jegou
Boris Flach Jean-Yves Guillemaut C.V. Jawahar
Francois Fleuret Ruiqi Guo Rodolphe Jenatton
Wolfgang Forstner Guodong Guo Hueihan Jhuang
Andrea Fossati Abhinav Gupta Qiang Ji
Charless Fowlkes Mohit Gupta Jiaya Jia
Jan-Michael Frahm Tony Han Hongjun Jia
Jean-Sebastien Franco Bohyung Han Yong-Dian Jian
Friedrich Fraundorfer Mei Han Hao Jiang
William Freeman Edwin Hancock Zhuolin Jiang
Oren Freifeld Jari Hannuksela Shuqiang Jiang
Mario Fritz Kenji Hara Sam Johnson
Yasutaka Furukawa Tatsuya Harada Anne Jorstad
Andrea Fusiello Daniel Harari Neel Joshi
Adrien Gaidon Zaid Harchaoui Armand Joulin
Juergen Gall Stefan Harmeling Frederic Jurie
Andrew Gallagher Sren Hauberg Ioannis Kakadiaris
Simone Gasparini Michal Havlena Zdenek Kalal
Peter Gehler James Hays Joni-K. Kamarainen
Yakup Genc Xuming He Kenichi Kanatani
Leifman George Kaiming He Atul Kanaujia
Guido Gerig Varsha Hedau Ashish Kapoor
Christopher Geyer Nicolas Heess Jorg Kappes
Abhijeet Ghosh Yong Heo Leonid Karlinsky
Andrew Gilbert Adrian Hilton Kevin Karsch
Ross Girshick Stefan Hinterstoisser koray kavukcuoglu
Martin Godec Minh Hoai Rei Kawakami
Roland Goecke Jesse Hoey Hiroshi Kawasaki
Michael Goesele Anthony Hoogs Verena Kaynig
Siome Goldenstein Joachim Hornegger Qifa Ke
Bastian Goldluecke Alexander Hornung Ira Kemelmacher-
Shaogang Gong Edward Hsiao Shlizerman
XIV Organization
Aditya Khosla Ruonan Li Jason Meltzer

Tae-Kyun Kim Hongdong Li Talya Meltzer
Jaechul Kim Feng Li Heydi Mendez-Vazquez
Seon Joo Kim Yunpeng Li Thomas Mensink
Kris Kitani Fuxin Li Fabrice Michel
Jyri Kivinen Li-Jia Li Branislav Micusik
Hedvig Kjellstrom Zicheng Liao Krystian Mikolajczyk
Jan Knopp Shengcai Liao Niloy Mitra
Kevin Koeser Jongwoo Lim Anurag Mittal
Pushmeet Kohli Joseph Lim Philippos Mordohai
Nikos Komodakis Yen-Yu Lin Francesc Moreno-Noguer
Kurt Konolige Dahua Lin Greg Mori
Filip Korc Daniel Lin Bryan Morse
Andreas Koschan Haibin Ling Yadong Mu
Adriana Kovashka James Little Yasuhiro Mukaigawa
Josip Krapac Ce Liu Lopamudra Mukherjee
Dilip Krishnan Xiaobai Liu Andreas Muller
Zuzana Kukelova Ming-Yu Liu Jane Mulligan
Neeraj Kumar Xiaoming Liu Daniel Munoz
M. Pawan Kumar Tyng-Luh Liu A. Murillo
Junghyun Kwon Yunlong Liu Carlo Mutto
Dongjin Kwon Wei Liu Hajime Nagahara
Junseok Kwon Jingen Liu Vinay Namboodiri
Florent Lafarge Marcus Liwicki Srinivasa Narasimhan
Shang-Hong Lai Liliana Lo Presti Fabian Nater
Jean-Francois Lalonde Roberto Lopez-Sastre Shawn Newsam
Michael Langer Jiwen Lu Kai Ni
Douglas Lanman Zheng Lu Feiping Nie
Diane Larlus Le Lu Juan Carlos Niebles
Longin Jan Latecki Simon Lucey Claudia Nieuwenhuis
Erik Learned-Miller Julien Mairal Ko Nishino
Seungkyu Lee Michael Maire Sebastian Nowozin
Kyong Joon Lee Subhransu Maji Jean-Marc Odobez
Honglak Lee Yasushi Makihara Peter ODonovan
Yong Jae Lee Dimitrios Makris Sangmin Oh
Bastian Leibe Tomasz Malisiewicz Takeshi Oishi
Ido Leichter Jiri Matas Takahiro Okabe
Frank Lenzen Iain Matthews Takayuki Okatani
Matt Leotta Stefano Mattoccia Aude Oliva
Vincent Lepetit Thomas Mauthner Carl Olsson
Anat Levin Steven Maybank Bjorn Ommer
Maxime Lhuillier Walterio Mayol-Cuevas Eng-Jon Ong
Rui Li Scott McCloskey Anton Osokin
Stan Li Stephen McKenna Matthew OToole
Hongsheng Li Gerard Medioni Mustafa Ozuysal
Organization XV
Maja Pantic Erik Reinhard Shiguang Shan

Caroline Pantofaru Xiaofeng Ren Ling Shao
George Papandreou Christoph Rhemann Abhishek Sharma
Touq Parag Antonio Robles-Kelly Eli Shechtman
Vasu Parameswaran Emanuele Rodola Yaser Sheikh
Devi Parikh Mikel Rodriguez Alexander Shekhovtsov
Sylvain Paris Antonio Rodriguez- Ilan Shimshoni
Minwoo Park Sanchez Takaaki Shiratori
Dennis Park Marcus Rohrbach Jamie Shotton
Ioannis Patras Javier Romero Nitesh Shro
Ioannis Pavlidis Charles Rosenberg Zhangzhang Si
Nadia Payet Bodo Rosenhahn Leonid Sigal
Kim Pedersen Samuel Rota Bulo Nathan Silberman
Or Pele Peter Roth Karen Simonyan
Shmuel Peleg Amit Roy-Chowdhury Vivek Singh
Yigang Peng Dmitry Rudoy Vikas Singh
Amitha Perera Olga Russakovsky Maneesh Singh
Florent Perronnin Bryan Russell Sudipta Sinha
Adrian Peter Chris Russell Greg Slabaugh
Maria Petrou Radu Rusu Arnold Smeulders
Patrick Peursum Michael Ryoo Cristian Sminchisescu
Tomas Pster Mohammad Sadeghi William A. P. Smith
James Philbin Kate Saenko Kevin Smith
Justus Piater Amir Saari Noah Snavely
Hamed Pirsiavash Albert Salah Cees Snoek
Robert Pless Mathieu Salzmann Michal Sofka
Thomas Pock Dimitris Samaras Qi Song
Gerard Pons-Moll Aswin Xuan Song
Ronald Poppe Sankaranarayanan Anuj Srivastava
Fatih Porikli Benjamin Sapp Michael Stark
Mukta Prasad Radim Sara Bjorn Stenger
Andrea Prati Scott Satkin Yu Su
Jerry Prince Imari Sato Yusuke Sugano
Nicolas Pugeault Eric Saund Ju Sun
Novi Quadrianto Daniel Scharstein Min Sun
Vincent Rabaud Walter Scheirer Deqing Sun
Rahul Raguram Kevin Schelten Jian Sun
Srikumar Ramalingam Raimondo Schettini David Suter
Narayanan Ramanathan Konrad Schindler Yohay Swirski
MarcAurelio Ranzato Joseph Schlecht Rick Szeliski
Konstantinos Frank Schmidt Yuichi Taguchi
Rapantzikos Uwe Schmidt Yu-Wing Tai
Nikhil Rasiwasia Florian Schro Jun Takamatsu
Mohammad Rastegari Rodolphe Sepulchre Hugues Talbot
James Rehg Uri Shalit Robby Tan
XVI Organization
Xiaoou Tang Song Wang Qingxiong Yang

Marshall Tappen Gang Wang Jinfeng Yang
Jonathan Taylor Hongcheng Wang Weilong Yang
Christian Theobalt Jingdong Wang Ruigang Yang
Tai-Peng Tian Lu Wang Jianchao Yang
Joseph Tighe Yueming Wang Yi Yang
Radu Timofte Ruiping Wang Bangpeng Yao
Sinisa Todorovic Kai Wang Angela Yao
Federico Tombari Alexander Weiss Mohammad Yaqub
Akihiko Torii Andreas Wendel Lijun Yin
Duan Tran Manuel Werlberger Kuk-Jin Yoon
Tali Treibitz Tomas Werner Tianli Yu
Bill Triggs Gordon Wetzstein Qian Yu
Nhon Trinh Yonatan Wexler Lu Yuan
Ivor Tsang Oliver Whyte Xiaotong Yuan
Yanghai Tsin Richard Wildes Christopher Zach
Aggeliki Tsoli Oliver Williams Stefanos Zafeiriou
Zhuowen Tu Thomas Windheuser Andrei Zaharescu
Pavan Turaga David Wipf Matthew Zeiler
Ambrish Tyagi Kwan-Yee K. Wong Yun Zeng
Martin Urschler John Wright Guofeng Zhang
Raquel Urtasun Shandong Wu Li Zhang
Jan van Gemert Yi Wu Lei Zhang
Daniel Vaquero Changchang Wu Xinhua Zhang
Andrea Vedaldi Jianxin Wu Shaoting Zhang
Ashok Veeraraghavan Ying Wu Jianguo Zhang
Olga Veksler Jonas Wul Ying Zheng
Alexander Vezhnevets Jing Xiao S. Kevin Zhou
Sara Vicente Jianxiong Xiao Changyin Zhou
Sudheendra Wei Xu Shaojie Zhuo
Vijayanarasimhan Li Xu Todd Zickler
Pascal Vincent Yong Xu Darko Zikic
Carl Vondrick Yi Xu Henning Zimmer
Chaohui Wang Yasushi Yagi Daniel Zoran
Yang Wang Takayoshi Yamashita Silvia Zu
Jue Wang Ming Yang
Hanzi Wang Ming-Hsuan Yang
Sponsoring Companies and Institutions
Gold Sponsors
Silver Sponsors
Bronze Sponsors
Institutional Sponsors
Table of Contents
Poster Session 3
Polynomial Regression on Riemannian Manifolds . . . . . . . . . . . . . . . . . . . . . 1
Jacob Hinkle, Prasanna Muralidharan, P. Thomas Fletcher, and
Sarang Joshi
Approximate Gaussian Mixtures for Large Scale Vocabularies . . . . . . . . . . 15

Yannis Avrithis and Yannis Kalantidis
Geodesic Saliency Using Background Priors . . . . . . . . . . . . . . . . . . . . . . . . . 29

Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun
Joint Face Alignment with Non-parametric Shape Models . . . . . . . . . . . . . 43

Brandon M. Smith and Li Zhang
Discriminative Bayesian Active Shape Models . . . . . . . . . . . . . . . . . . . . . . . 57

Pedro Martins, Rui Caseiro, Joao F. Henriques, and Jorge Batista
Patch Based Synthesis for Single Depth Image Super-Resolution . . . . . . . 71

Oisin Mac Aodha, Neill D.F. Campbell, Arun Nair, and
Gabriel J. Brostow
Annotation Propagation in Large Image Databases via Dense Image

Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Michael Rubinstein, Ce Liu, and William T. Freeman
Numerically Stable Optimization of Polynomial Solvers for Minimal

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Yubin Kuang and Kalle Astrom
Has My Algorithm Succeeded? An Evaluator for Human Pose

Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner,
Vittorio Ferrari, and C.V. Jawahar
Group Tracking: Exploring Mutual Relations for Multiple Object

Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Genquan Duan, Haizhou Ai, Song Cao, and Shihong Lao
A Discrete Chain Graph Model for 3d+t Cell Tracking with High
Misdetection Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Bernhard X. Kausler, Martin Schiegg, Bjoern Andres,
Martin Lindner, Ullrich Koethe, Heike Leitte, Jochen Wittbrodt,
Lars Hufnagel, and Fred A. Hamprecht
XX Table of Contents
Robust Tracking with Weighted Online Structured Learning . . . . . . . . . . . 158

Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and
Anton van den Hengel
Fast Regularization of Matrix-Valued Images . . . . . . . . . . . . . . . . . . . . . . . . 173

Guy Rosman, Yu Wang, Xue-Cheng Tai, Ron Kimmel, and
Alfred M. Bruckstein
Blind Correction of Optical Aberrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Christian J. Schuler, Michael Hirsch, Stefan Harmeling, and
Bernhard Scholkopf
Inverse Rendering of Faces on a Cloudy Day . . . . . . . . . . . . . . . . . . . . . . . . . 201

Oswald Aldrian and William A.P. Smith
On Tensor-Based PDEs and Their Corresponding Variational

Formulations with Application to Color Image Denoising . . . . . . . . . . . . . . 215
Freddie Astrom, George Baravdish, and Michael Felsberg
Kernelized Temporal Cut for Online Temporal Segmentation and

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Dian Gong, Gerard Medioni, Sikai Zhu, and Xuemei Zhao
Grain Segmentation of 3D Superalloy Images Using Multichannel

EWCVT under Human Annotation Constraints . . . . . . . . . . . . . . . . . . . . . . 244
Yu Cao, Lili Ju, and Song Wang
Hough Regions for Joining Instance Localization and Segmentation . . . . . 258

Hayko Riemenschneider, Sabine Sternig, Michael Donoser,
Peter M. Roth, and Horst Bischof
Learning to Segment a Video to Clips Based on Scene and Camera

Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Adarsh Kowdle and Tsuhan Chen
Evaluation of Image Segmentation Quality by Adaptive Ground Truth

Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Bo Peng and Lei Zhang
Oral Session 3: Detection and Attributes

Exact Acceleration of Linear Object Detectors . . . . . . . . . . . . . . . . . . . . . . . 301
Charles Dubout and Francois Fleuret
Latent Hough Transform for Object Detection . . . . . . . . . . . . . . . . . . . . . . . 312

Nima Razavi, Juergen Gall, Pushmeet Kohli, and Luc van Gool
Using Linking Features in Learning Non-parametric Part Models . . . . . . . 326

Leonid Karlinsky and Shimon Ullman
Table of Contents XXI
Diagnosing Error in Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai
Attributes for Classier Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

Amar Parkash and Devi Parikh
Constrained Semi-Supervised Learning Using Attributes and

Comparative Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta
Poster Session 4
Renormalization Returns: Hyper-renormalization and Its
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Kenichi Kanatani, Ali Al-Sharadqah, Nikolai Chernov, and
Yasuyuki Sugaya
Scale Robust Multi View Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

Christian Bailer, Manuel Finckh, and Hendrik P.A. Lensch
Laplacian Meshes for Monocular 3D Shape Recovery . . . . . . . . . . . . . . . . . 412

Jonas Ostlund, Aydin Varol, Dat Tien Ngo, and Pascal Fua
Soft Inextensibility Constraints for Template-Free Non-rigid

Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Sara Vicente and Lourdes Agapito
Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction

of Non-rigid and Ambiguous Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Eduard Trulls, Alberto Sanfeliu, and Francesc Moreno-Noguer
Elevation Angle from Reectance Monotonicity: Photometric Stereo

for General Isotropic Reectances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
Boxin Shi, Ping Tan, Yasuyuki Matsushita, and Katsushi Ikeuchi
Local Log-Euclidean Covariance Matrix (L2 ECM) for Image

Representation and Its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
Peihua Li and Qilong Wang
Ensemble Partitioning for Unsupervised Image Categorization . . . . . . . . . 483

Dengxin Dai, Mukta Prasad, Christian Leistner, and Luc Van Gool
Set Based Discriminative Ranking for Recognition . . . . . . . . . . . . . . . . . . . . 497

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, and Shihong Lao
A Global Hypotheses Verication Method for 3D Object Recognition . . . 511

Aitor Aldoma, Federico Tombari, Luigi Di Stefano, and
Markus Vincze
XXII Table of Contents
Are You Really Smiling at Me? Spontaneous versus Posed Enjoyment

Smiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Hamdi Dibeklioglu, Albert Ali Salah, and Theo Gevers
Ecient Monte Carlo Sampler for Detecting Parametric Objects

in Large Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Yannick Verdie and Florent Lafarge
Supervised Geodesic Propagation for Semantic Label Transfer . . . . . . . . . 553

Xiaowu Chen, Qing Li, Yafei Song, Xin Jin, and Qinping Zhao
Bayesian Face Revisited: A Joint Formulation . . . . . . . . . . . . . . . . . . . . . . . 566

Dong Chen, Xudong Cao, Liwei Wang, Fang Wen, and Jian Sun
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven

Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Antonio Monroy and Bjorn Ommer
In Defence of Negative Mining for Annotating Weakly Labelled Data . . . 594

Parthipan Siva, Chris Russell, and Tao Xiang
Describing Clothing by Semantic Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 609

Huizhong Chen, Andrew Gallagher, and Bernd Girod
Graph Matching via Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 624

Yumin Suh, Minsu Cho, and Kyoung Mu Lee
Jet-Based Local Image Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638

Anders Boesen Lindbo Larsen, Sune Darkner,
Anders Lindbjerg Dahl, and Kim Steenstrup Pedersen
Abnormal Object Detection by Canonical Scene-Based Contextual

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Sangdon Park, Wonsik Kim, and Kyoung Mu Lee
Shapecollage: Occlusion-Aware, Example-Based Shape Interpretation . . . 665

Forrester Cole, Phillip Isola, William T. Freeman,
Fredo Durand, and Edward H. Adelson
Interactive Facial Feature Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679

Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and
Thomas S. Huang
Propagative Hough Voting for Human Activity Recognition . . . . . . . . . . . 693

Gang Yu, Junsong Yuan, and Zicheng Liu
Spatio-Temporal Phrases for Activity Recognition . . . . . . . . . . . . . . . . . . . . 707

Yimeng Zhang, Xiaoming Liu, Ming-Ching Chang, Weina Ge, and
Tsuhan Cheny
Table of Contents XXIII
Complex Events Detection Using Data-Driven Concepts . . . . . . . . . . . . . . 722

Yang Yang and Mubarak Shah
Directional Space-Time Oriented Gradients for 3D Visual Pattern

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Ehsan Norouznezhad, Mehrtash T. Harandi, Abbas Bigdeli,
Mahsa Baktash, Adam Postula, and Brian C. Lovell
Learning to Recognize Unsuccessful Activities Using a Two-Layer

Latent Structural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
Qiang Zhou and Gang Wang
Action Recognition Using Subtensor Constraint . . . . . . . . . . . . . . . . . . . . . . 764

Qiguang Liu and Xiaochun Cao
Globally Optimal Closed-Surface Segmentation for Connectomics . . . . . . 778

Bjoern Andres, Thorben Kroeger, Kevin L. Briggman,
Winfried Denk, Natalya Korogod, Graham Knott,
Ullrich Koethe, and Fred A. Hamprecht
Reduced Analytical Dependency Modeling for Classier Fusion . . . . . . . . 792

Andy Jinhua Ma and Pong Chi Yuen
Learning to Match Appearances by Correlations in a Covariance Metric

Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
Slawomir Bak, Guillaume Charpiat, Etienne Corvee,
Francois Bremond, and Monique Thonnat
On the Convergence of Graph Matching: Graduated Assignment

Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
Yu Tian, Junchi Yan, Hequan Zhang, Ya Zhang,
Xiaokang Yang, and Hongyuan Zha
Image Annotation Using Metric Learning in Semantic

Neighbourhoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Yashaswi Verma and C.V. Jawahar
Dynamic Programming for Approximate Expansion Algorithm . . . . . . . . . 850

Olga Veksler
Real-Time Compressive Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864

Kaihua Zhang, Lei Zhang, and Ming-Hsuan Yang
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879

Polynomial Regression on Riemannian Manifolds
Jacob Hinkle, Prasanna Muralidharan, P. Thomas Fletcher, and Sarang Joshi
SCI Institute, University of Utah

72 Central Campus Drive, Salt Lake City, UT 84112
{jacob,prasanna,fletcher,sjoshi}@sci.utah.edu
Abstract. In this paper we develop the theory of parametric polyno-

mial regression in Riemannian manifolds. The theory enables parametric
analysis in a wide range of applications, including rigid and non-rigid
kinematics as well as shape change of organs due to growth and aging.
We show application of Riemannian polynomial regression to shape anal-
ysis in Kendall shape space. Results are presented, showing the power of
polynomial regression on the classic rat skull growth data of Bookstein
and the analysis of the shape changes associated with aging of the corpus
callosum from the OASIS Alzheimers study.
1 Introduction
The study of the relationship between measured data and known descriptive
variables is known as the eld of regression analysis. As with most statistical
techniques, regression analyses can be broadly divided into two classes: paramet-
ric and non-parametric. The most widely known parametric regression methods
are linear and polynomial regression in Euclidean space, wherein a linear or poly-
nomial function is t in a least-squares fashion to observed data. Such methods
are the staple of modern data analysis. The most common non-parametric re-
gression approaches are kernel-based methods and spline smoothing approaches
which provide much more exibility in the class of regression functions. How-
ever, their non-parametric nature presents a challenge to inference problems; if,
for example, one wishes to perform a hypothesis test to determine whether the
trend for one group of data is signicantly dierent from that of another group.
Fundamental to the analysis of anatomical imaging data within the framework
of computational anatomy is the analysis of transformations and shape which
are best represented as elements of Riemannian manifolds. Classical regression
suers from the limitation that the data must lie in a Euclidean vector space.
When data are known to lie in a Riemannian manifold, one approach is to map
the manifold into a Euclidean space and perform conventional Euclidean analy-
sis. Such extrinsic or coordinate-based methods inevitably lead to complications
[1]. For example, given simple angular data, averaging in the Euclidean setting
leads to nonsensical results which depend on the chosen coordinates; in the usual
coordinates, the mean of 359 degrees and 1 degree is 180 degrees. In the case
of shape analysis, regression against landmark data in a Euclidean setting, even
after Procrustes alignment (an intrinsic manifold-based technique), will intro-
duce scale and rotation in the regression function, and hence does not capture
A. Fitzgibbon et al. (Eds.): ECCV 2012, Part III, LNCS 7574, pp. 114, 2012.

c Springer-Verlag Berlin Heidelberg 2012
2 J. Hinkle et al.
invariant properties of shape. By performing an intrinsic regression analysis on

the manifold of shapes, such factors are guaranteed to not eect the analysis.
In previous work, non-parametric kernel-based and spline-based methods have
been extended to observations that lie on a Riemannian manifold with some
success [2,3], but intrinsic parametric regression on Riemannian manifolds has
received limited attention. Most recently, Fletcher [4] and Niethammer et al.
[5] have each independently developed geodesic regression which generalizes the
notion of linear regression to Riemannian manifolds.
The goal of the current work is to extend the geodesic model in order to
accommodate more exibility while remaining in the parametric setting. The
increased exibility introduced by the methods in this manuscript allow a better
description of the variability in the data, and ultimately will allow more powerful
statistical inference. Our work builds o that of Jupp & Kent [3], whose method
for tting parametric curves to the sphere involved intricate unwrapping and
rolling processes. The work presented in this paper allows one to t polynomial
regression curves on a general Riemannian manifold, using intrinsic methods and
avoiding the need for unwrapping and rolling. Since our model includes time-
reparametrized geodesics as a special case, information about time dependence
may be obtained from the regression without explicit modeling by examining
the collinearity of the estimated parameters.
The generality of the Riemannian framework developed in this work allows
this parametric regression technique to be applied in a wide variety of contexts
besides shape analysis. Rigid and non-rigid motion can be modeled with polyno-
mials in the Lie groups SO(3), SE(3), or Di() [6,7]. Alternatively, the action
of a Riemannian Lie group on a homogeneous space is often a convenient set-
ting, such as in the currents framework [8], or the action of dieomorphisms
on images [9]. Additionally, there are applications in which the data are best
modeled directly in a Riemannian manifold, such as modeling directional data
on the sphere, or modeling motion via Grassmannian manifolds [10].
We demonstrate our algorithm in three studies of shape. By applying our algo-
rithm to Booksteins classical rat skull growth dataset [11], we show that we are
able to obtain a parametric regression curve of similar quality to that produced
by non-parametric methods [12]. We also demonstrate in a 2D corpus callosum
aging study that, in addition to providing more exibility in the traced path,
our polynomial model provides information about the optimal parametrization
of the time variable. Finally, the usefulness of shape regression for 3D shapes is
demonstrated in an infant cerebellum growth study.
2 Methods
2.1 Preliminaries
Let (M, g) be a Riemannian manifold [13]. For each point p M , the metric
g determines an inner product on the tangent space Tp M as well as a way to
Polynomial Regression on Riemannian Manifolds 3
dierentiate vector elds with respect to one another. That derivative is referred
to as the covariant derivative and is denoted by . If v, ware smooth vector elds
on M and : [0, T ] M is a smooth curve on M then the covariant derivative
satises the product rule
d
v((t)), w((t)) = v((t)), w((t)) + v((t)), w((t)) (1)
dt
Geodesics : [0, T ] M are characterized by the conservation of kinetic energy
along the curve:
d
, = 0 = 2 , , (2)
dt
which is satised by the covariant dierential equation (c.f. [13])
= 0. (3)
This dierential equation, called the geodesic equation, uniquely determines
geodesics parametrized by the initial conditions ((0), (0)) T M . The map-
ping from the tangent space at p into the manifold is called the exponential map,
Expp : Tp M M . The exponential map is injective on a zero-centered ball B
in Tp M of some non-zero radius. Thus for a point q in the image of B under
Expp there exists a unique vector v Tp M corresponding to a minimal length
path under the exponential map from p to q. The mapping of such points q to
their associated tangent vectors v at p is called the log map of q at p, denoted
v = Logp q.
Given a curve : [0, T ] M well want to relate tangent vectors at dierent
points along the curve. These relations are governed innitesimally by the co-
variant derivative . A vector eld v is parallel transported along if it satises
the parallel transport equation, v((t)) = 0. Notice that the geodesic equa-
tion is a special case of parallel transport, in which we require that the velocity
is parallel along the curve itself.
2.2 Riemannian Polynomials

Given a vector eld w along a curve , the covariant derivative of w gives us a
way to dene vector elds which are constant along , as parallel transported
vectors. Geodesics are generalizations to the Riemannian manifold setting of
curves in Rd with constant rst derivative.
We will apply the covariant derivative repeatedly to examine curves with par-
allel higher derivatives. For instance, we refer to the vector eld (t) (t) as
the acceleration of the curve . Curves with parallel acceleration are generaliza-
tions of quadratic curves in R and satisfy the second order polynomial equation,
( )2 (t) = 0. Extending this idea, a cubic polynomial is dened as a curve
having parallel jerk (time derivative of acceleration), and so on. Generally, a kth
order polynomial in M is dened as a curve : [0, T ] M satisfying
( )k (t) = 0 (4)
4 J. Hinkle et al.
for all times t [0, T ] with initial conditions (0) and i (0), i = 1, , k. As
with polynomials in Euclidean space, a k th order polynomial is parametrized by
the (k + 1) initial conditions at t = 0.
Riemannian polynomials are alternatively described using the rolling maps of
Jupp & Kent [3]. Suppose (t) is a smooth curve in M . Then consider embedding
M in m-dimensional Euclidean space and rolling M along a d-dimensional plane
in Rm such that the plane is tangent to M at (t) and the plane does not slip or
twist with respect to M while rolling. The property that is a k-order polyno-
mial is equivalent to the corresponding curve in Rd being a k-order polynomial
in the conventional sense. For more information regarding this connection to
Jupp & Kents rolling maps, as well as a comparison to Noakes cubic splines,
the reader is referred to the literature of Leite & Krakowski [14].
The covariant dierential equation governing the evolution of Riemannian
polynomials is linearized in the same way that a Euclidean ordinary dierential
equation is linearized. Introducing vector elds v1 (t), . . . , vk (t) T(t) M , we
write the system of covariant dierential equations as
(t) = v1 (t) (5)

vi (t) = vi+1 (t), i = 1, . . . , k 1 (6)
vk (t) = 0. (7)
The Riemannian polynomial equations can-

not, in general, be solved in closed form, and
must be integrated numerically. In order to
discretize this system of covariant dierential
equations, we implement a covariant Euler in-
tegrator. At each step of the integrator, each
vector is incremented within the tangent space
at (t) and the results are parallel transported
innitesimally along a geodesic from (t) to
(t + t). The only ingredients necessary to
integrate a polynomial forward in time are the
exponential map and parallel transport on the
manifold.
Figure 1 shows the result of integrating
Fig. 1. Sample polynomial curves
emanating from a common
polynomials of order one, two, and three. The
basepoint (green) on the sphere parameters, the initial velocity, acceleration,
(black=geodesic, blue=quadratic, and jerk, were chosen a priori and a cubic
red=cubic) polynomial was integrated to obtain the blue
curve. Then the initial jerk was set to zero
and the blue quadratic curve was integrated,
followed by the black geodesic whose acceleration was also set to zero.
2.3 Estimation of Parameters for Regression

In order to regress polynomials against observed data yj M, j = 1, . . . , N at
known times tj R, j = 1, . . . , N , we dene the following constrained objective
function
1
N
E0 ((0), v1 (0), . . . , vk (0)) = d((tj ), yj )2 (8)
N
j=1
subject to Eqs. 57. Note that in this expression d represents the geodesic dis-
tance: the minimum length of paths from the curve point (tj ) to the data point
yj . The function E0 is minimized in order to nd the optimal initial conditions
(0), vi (0), i = 1, . . . , k, which we will refer to as the parameters of our model.
In order to determine the optimal parameters (0), vi (0), i = 1, . . . , k, we in-
troduce Lagrange multiplier vector elds i , i = 0, . . . , k, often called the adjoint
variables, and dene the unconstrained objective function
T
1
N
E(, {vi }, {i }) = 2
d((tj ), yj ) + 0 (t), (t) v1 (t)dt
N j=1 0
T
k1 T
+ i (t), vi (t) vi+1 (t)dt + k (t), vk (t)dt. (9)
i=1 0 0
As is standard practice, the optimality conditions for this equation are obtained
by taking variations with respect to all arguments of E, integrating by parts
when necessary. The resulting variations with respect the the adjoint variables
yield the original dynamic constraints: the polynomial equations. Variations with
respect to the primal variables gives rise to the following system of equations,
termed the adjoint equations. The adjoint equations take the following form:
i (t) = i1 (t) i = 1, . . . , k (10)

2
N k
0 (t) = (t tj ) Log yj R(vi (t), i (t))v1 (t), (11)
N j=1 i=1
where R is the Riemannian curvature tensor and the Dirac functional indicates
that the order zero adjoint variable takes on jump discontinuities at time points
where data is present. The curvature tensor R satises the standard formula [13]
R(u, v)w = u v w v u w [u,v] w, (12)
and can be computed in closed form for many manifolds. Gradients of E with
respect to initial and nal conditions give rise to the terminal endpoint conditions
for the adjoint variables, as well as expressions for the gradients with respect to
the parameters (0), vi (0):
i (T ) = 0, (0) E = 0 (0), vi (0) E = i (0) i = 0, , k (13)

6 J. Hinkle et al.
In order to determine the value of the adjoint vector elds at t = 0, and thus
the gradients of the functional E0 , the adjoint variables are initialized to zero at
time T , then Eq. 11 is evolved backward in time to t = 0.
Given the gradients with respect to the parameters, a simple steepest descent
algorithm is used to optimize the functional. At each iteration, (0) is updated
using the exponential map and the vectors vi (0) are updated via parallel trans-
lation:
(0) Exp(0) ((0) E) = Exp(0) (0 (0)) (14)

vi (0) ParTrans(0) (vi (0) + i (0), 0 (0)), (15)
where ParTransp (u, v) denotes the parallel translation of a vector u at p for unit
time along a geodesic in the direction of v.
Note that in the special case of a zero-order polynomial (k = 0), the only
gradient 0 is simply the mean of the log map vectors at the current estimate
of the Frechet mean. So this method generalizes the common method of Frechet
averaging on manifolds [15]. The curvature term, in the case k = 1, indicates
that 1 is a sum of Jacobi elds. So this approach subsumes geodesic regression
as presented by Fletcher [4]. For higher order polynomials, the adjoint equations
represent a generalized Jacobi eld.
In practice, it is often not necessary to explicitly compute the curvature terms.
In the case that the manifold M embeds into a Hilbert space, the extrinsic
adjoint equations can be computed by taking variations in the ambient space,
using standard methods. Such an approach gives rise to the regression algorithm
found in Niethammer et al. [5], for example.
2.4 Time Reparametrization

Geodesic curves propagate at a constant speed, a result of their extremal action
property. However, using higher-order polynomials it is possible to generate a
polynomial curve whose path (the image of the mapping ) matches that of a
geodesic, but whose time-dependence has been reparametrized. If the parame-
ters consist of collinear vi , then this is the case. In this way, polynomials provide
exibility not only in the class of paths that are possible, but in the time de-
pendence of the curves traversing those paths. Regression models could even
be implemented in which the operator wishes to estimate geodesic paths, but
is unsure of parametrization, and so enforces the estimated parameters to be
collinear.
2.5 Coecient of Determination for Regression in Metric Spaces

It will be useful to dene a statistic which will indicate how well our model
ts a set of data. As in [4], we compute the coecient of determination of our
regression polynomial (t), denoted R2 . The rst step to computing R2 is to
compute the variance of the data. The natural choice of total variance statistic
is the Frechet variance, dened by
1 N
Var{yj } = min d(y, yj )2 . (16)
N yM j=1
Note that the Frechet mean y itself is the 0-th order polynomial regression
against the data {yj } and the variance is the value of the cost function E0 at
that point. We dene the sum of squared error for a curve as the value E0 ():
1
N
SSE = d((tj ), yj )2 . (17)
N j=1
Then the coecient of determination is dened as

SSE
R2 = 1 . (18)
Var{yj }
Any regression polynomial will have SSE < Var{yj } and R2 will have a value
between 0 and 1, with 1 indicating a perfect t and 0 indicating that the curve
provides no better t than does the Frechet mean.
3 Examples
3.1 Lie Groups
A special case arises when the manifold M is a Lie group and the metric g is
left-invariant. In this case, left-translation by (t) and 1 (t), denoted L , L 1 ,
allow one to isometrically represent tangent vectors at (t) as tangent vectors at
the identity, using the pushforward L 1 [6]. The tangent space at the identity
is isomorphically identied with the Lie algebra of the group, g. The algebraic
operation in g is called the innitesimal adjoint action ad : g g g. Fixing X,
the adjoint of the operator adX satises adX Y, Z = Y, adX Z. We will refer
to the operator adX as the adjoint-transpose action of X.
Suppose X is a vector eld along the curve and Xc = L 1 X, c =
L 1 g is its representative at the identity. Similarly, c = L 1 is the
representative of the curves velocity in the Lie algebra. Then the covariant
derivative of X along evolves in the Lie algebra via
1
L 1 X = Xc adc Xc + adXc c adc Xc . (19)
2
In the special case when X = so that Xc = c , if we set this covariant derivative
equal to zero we have the geodesic equation:
c = adc c , (20)
8 J. Hinkle et al.
which in this form is often refered to as the Euler-Poincare equation [6,16]. Also
note that if the metric is both left- and right-invariant, then the adjoint-transpose
action is alternating and we can simplify the covariant derivative
d further.
Parallel
transport in such case requires solution of the equation dt + adc X = 0, which
can be integrated using Euler integration or similar time-stepping algorithms, so
long as the adjoint action can be easily computed. Because of the simplied covari-
ant derivative formula for a bi-invariant metric, the curvature also takes a simple
form [13]: R(X, Y )Z = 14 adZ adX Y .
The Lie Group SO(3). In order to illustrate the algorithm in a Lie group, we
consider the Lie group SO(3) of orthogonal 3-by-3 real matrices with determinant
one. We review the basics of SO(3) here, but the reader is encouraged to consult [6]
for a full treatment of SO(3) and derivation of the Euler-Poincare equation there.
The Lie algebra for the rotation group is so(3), the set of all skew-symmetric
3-by-3 real matrices. Such matrices can be identied with vectors in R3 by the
identication

a 0 c b
x = b (x) = c 0 a (21)
c b a 0
under which the innitesimal adjoint action takes the convenient form of the
cross-product of vectors:
adx y = x y = (x)y (x y) = [(x), (y)]. (22)
With the R representation of vectors in so(3), a natural choice of bi-invariant
3
metric is available: x, y = xT y. With this metric the adjoint-transpose action

is adx y = x y. If we plug this into the general Lie group formula we obtain
the geodesic equation in the Lie algebra:
c = (c c ) = 0, (23)
reecting the fact that angular momentum is conserved under geodesics in SO(3).
The exponential map is simply the matrix exponential of c , which can be com-
puted in closed form using Rodrigues rotation formula [6]. To compute a poly-
nomial curve, the matrix v is converted to a skew-symmetric matrix by multi-
plication by (t)1 , which translates v back into the Lie algebra. The resulting
skew-symmetric matrix is converted to a vector c (0). The polynomial equation
is then integrated to nd c at future times. The point (t) then evolves along
the exponential map by time-stepping, where at each step c is converted back
into a skew-symmetric matrix W , multiplied by t and exponentiated to obtain
a rotation matrix. The Euler integration step is then given by left-multiplication
by the resulting matrix.
Plugging into the above expressions, the curvature of SO(3) is
1
R(u, v)w = w u v. (24)
4
and the parallel transport equation is given by uc = 21 c uc .
3.2 The n-Dimensional Sphere
In the case that M is an n-dimensional sphere, S n = {p Rn+1 : p = 1},

a natural metric exists corresponding to that of Rn+1 and to the bi-invariant
metric on SO(n + 1), under which S n is a homogeneous space. Geodesics are
great circles and the exponential map, log map, and Riemannian curvature are
given by
v
Expp v = cos p + sin , = v (25)
v
q (pT q)p
Logp q = , = cos1 (pT q). (26)
q (pT q)p
R(u, v)w = (uT w)v (v T w)u. (27)
Parallel translation of a vector w along the exponential map of a vector v is

performed as follows. The vector w is decomposed into a vector parallel to v,
which we denote wv , and a vector orthogonal to v, w . w is left unchanged by
parallel transport along Expp v, while wv transforms in a similar way as v:
T
v v
wv p sin + cos wv . (28)
v v
Using this exponential map and parallel transport, a polynomial integrator was
implemented. Figure 1 shows the result of integrating polynomials of order one,
two, and three, using the equations from this section, for the special case of a
two-dimensional sphere.
3.3 Kendall Shape Space

We will briey describe Kendalls shape space of m-landmark point sets in Rd ,
denoted dm . For a more complete overview of Kendalls shape space, including
computation of the curvature tensor, the reader is encouraged to consult the
work of Kendall and Le [17,18].
Let x = (xi )i=1,...,m , xi Rd be a labelled point set in Rd . Translate the point
set so that the centroid resides at 0, removing any eects due to translations.
Another
m degree of freedom due to uniform scaling is removed by requiring that
i=1 xi = 1. After this standardization, x is a point in the sphere S
2 md1
,
commonly referred to as preshape space in this context. The Kendall shape space
dm is obtained by taking the quotient of the preshape space by the action of the
rotation group SO(d) on point sets. In practice, points in the quotient (referred
to as shapes) are represented by a member of their equivalence class in preshape
space. The work of ONeill [19] characterizes the link in geometry between the
shape and preshape spaces.
10 J. Hinkle et al.

Fig. 2. Bookstein rat calivarium data after uniform scaling and Procrustes alignment.
The colors of dots indicate age group while the colors of lines indicate order of poly-
nomial used for the regression (black=geodesic, blue=quadratic, red=cubic). Zoomed
views of individual rectangles are shown at bottom. Note that the axes are arbitrary,
due to scale-invariance of Kendall shape space.
For a general shape space dm , the exponential map and parallel translation
are performed using representatives in preshape space, S md1 . For d > 2, this
must be done in a time-stepping algorithm, in which at each time step an in-
nitesimal spherical parallel transport is performed, followed by the horizontal
projection. The resulting algorithm can be used to compute the exponential map
as well. Computation of the log map is less trivial, as it requires an iterative op-
timization routine. However, a special case arises in the case when d = 2. In this
case the exponential map, parallel transport and log map have closed form. The
reader is encouraged to consult [4] for more details about the two-dimensional
case. With exponential map, log map, and parallel transport implemented as
described here, one can perform polynomial regression on Kendall shape space
in any dimension. We demonstrate such regression now on examples in both two
and three dimensions.
4 Results
4.1 Rat Calivarium Growth
The rst dataset we consider was rst analyzed by Bookstein [11]. The data is
available for download at http://life.bio.sunysb.edu/morph/data/datasets.
html and consists of m = 8 landmarks on a midsagittal section of rat calivaria
(upper skulls). Landmark positions are available for 18 rats and at 8 ages apiece.
Riemannian polynomials of orders k = 0, 1, 2, 3 were t to this data. The resulting
curves are shown in Fig. 2. Clearly the quadratic and cubic curves dier from that
of the geodesic regression. The R2 values agree with this qualitative dierence:
the geodesic regression has R2 = 0.79, while the quadratic and cubic regressions
both have R2 values of 0.85 and 0.87, respectively (see Table 1). While this shows
that there is a clear improvement in the t due to increasing k from one to two, it
also shows that little is gained by further increasing the order of the polynomial.
Qualitatively, Fig. 2 shows that the slight increase in R2 obtained by moving from
a quadratic to cubic model corresponds to a marked dierence in the curves, pos-
sibly indicating that the cubic curve is overtting the data.
0.05
Corpus Callosum Aging. The

corpus callosum is the major 0
white matter bundle connect-

0.05
ing the two hemispheres of the
brain. In order to investigate 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2
0.05
shape change of the corpus
callosum during normal aging, 0
polynomial regression was per-

formed on a collection of data 0.05
from the OASIS brain database 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2
(www.oasis-brains.org). Mag- 0.05
netic resonance imaging (MRI)

scans from 32 normal subjects 0
with ages between 19 and 90 0.05
years were obtained from the

database. A midsagittal slice was 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2
extracted from each volumet- Fig. 3. Geodesic (top, R2 = 0.12) quadratic

ric image and segmented using (middle, R2 = 0.13) and cubic (bottom, R2 =
the ITK-SNAP program (www. 0.21) regression for corpus callosum dataset.
itksnap.org). Sets of 64 land- Color represents age, with blue indicating youth
marks were optimized on the con- (age 19) and red indicating old age (age 90).
tours of these segmentations us-
ing the ShapeWorks program [20]
(www.sci.utah.edu/software.html). The algorithm generates samplings of
each shape boundary with optimal correspondences among the populaton.
Regression results for geodesic,
0.08
quadratic, and cubic regression
0.06 are shown in Fig. 3. At rst
0.04
0.02
glance the results appear simi-
0 lar for the three dierent mod-
0.02
0.04
els, since the motion envelopes
0.06 show close agreement. However,
0.08
0.15 0.1 0.05 0 0.05 0.1

as shown in Table 1, the R2 val-
0.15 0.2
ues show an improvement from

geodesic to quadratic (from 0.12
Fig. 4. Parameters for cubic regression of cor-
to 0.13) and from quadratic to cu-
pus callosa. The velocity (black), acceleration
(blue) and jerk (red) are nearly collinear. bic (from 0.13 to 0.21). Inspec-
tion of the estimated parameters,
shown in Fig. 4 reveals that the
tangent vectors appear to be rather collinear. For the reasons stated in Sec. 2.4,
12 J. Hinkle et al.
this suggests that the dierences can essentially be described as reparametriza-

tion of the time variable, which is only accommodated in the higher order poly-
nomial models.
Infant Cerebellum Growth. To test the performance of the 3D shape space

algorithm, polynomials were t to a set of ten cerebellum shapes extracted from
structural MRI data. The data were obtained in clinical studies of subjects with
ages ranging from neonatal to 4 years old. A three-dimensional 1.5T, T1-weighted
MRI was obtained for each patient, which was then segmented to obtain a volume
segmentation of the cerebellum. ShapeWorks was then used to nd correspon-
dences among the ten shapes, using 256 landmarks for each shape. A sample
point-set is shown in Fig. 5.
Table 1. R2 for regression of three datasets and three orders of polynomial (k)
k Rat Calivarium Corpus Callosum Cerebellum

1 .79 0.12 0.082
2 .85 0.13 0.17
3 .87 0.21 0.19
Polynomial regression was performed on the cerebellum using geodesic, quad-

ratic, and cubic curves. The R2 value for the geodesic regression was 0.082. The
quadratic regression t the data much better, improving the R2 value to 0.170.
The cubic polynomial showed marginal improvement over quadratic, resulting in
an of R2 value of 0.192 (see Table 1. Figure 6 shows the estimated parameters for
the cubic polynomial regression. Notice that the hemispheres of the cerebellum
show smaller tangent vectors than those on the vermis (the ridge dividing the
two hemispheres). This indicates that the major modes of shape change due to
age are likely related to changes in the vermis.
0.04
0.03
0.02
0.01
0.01
0.02
0.03
0.04
0.05
0.04
0.02 0.06
0 0.04
0.02
0.02 0
0.02
0.04 0.04
0.06
Fig. 5. Typical cerebellum data Fig. 6. Estimated parameters for cubic re-
used for regression. gression on cerebellum data. Velocity is
shown in black, with acceleration in blue,
and jerk shown in red.
5 Conclusion
Modern regression analysis consists of two classes of techniques: parametric and
nonparametric. Nonparametric approaches to curve-tting include kernel-based
methods and splines, each of which have been explored in the setting of Rie-
mannian manifolds [21,22,23,24,25,26]. Such approaches oer increased exibil-
ity over parametric models and the ability to accurately model observed data,
at the expense of eciency and precision. The trade-os between parametric
and nonparametric regression are well-established [27] and beyond the scope of
this work. A large sample size is required for cross-validation to estimate the
optimal smoothing parameter in a spline-based approach or to determine the
kernel bandwidth in a kernel smoothing method and can be computationally
prohibitive. Compounding this problem, since the number of parameters is tied
to sample size, these nonparametric ts will vary greatly due to the curse of
dimensionality. Parametric analyses enable robust tting with sample sizes that
are more feasible in biomedical applications. This tradeo between accuracy
and precision must be considered on a per-application basis when selecting a
regression model.
In this work, we presented for the rst time a methodology for parametric
regression on Riemannian manifolds, with the parameters being the initial con-
ditions of the curve. Our results show that the polynomial model allows for
parametric analysis on manifolds while providing more exibility and accuracy
than geodesics. For the classical Bookstein rat skull dataset for example, we are
able to provide a rather high R2 coecient of 0.84 with only a quadratic model.
Although we have presented results for landmark shape data, the framework
is general and can be applied to dierent data and modes of variation. For
instance, landmark data without direct correspondences can be modeled in the
currents framework [8], and image or surface data can be t using the popular
dieomorphic matching framework [7,9,5].
Acknowledgements. This work was made possible in part by NIH/NIBIB

Grant 5R01EB007688, NIH/NCRR Grant 2P41 RR0112553-12, and by NSF
CAREER Grant 1054057. We also acknowledge support from the NIMH Silvio
Conte Center for Neuroscience of Mental Disorders MH064065 and the BRP
grant R01 NS055754-01-02.
References
1. Dryden, I.L., Mardia, K.V.: Statistical Shape Analysis. Wiley (1998)
2. Davis, B.C., Fletcher, P.T., Bullitt, E., Joshi, S.C.: Population shape regression
from random design data. Int. J. Comp. Vis. 90, 255266 (2010)
3. Jupp, P.E., Kent, J.T.: Fitting smooth paths to spherical data. Appl. Statist. 36,
3446 (1987)
4. Fletcher, P.T.: Geodesic regression on Riemannian manifolds. In: International
Workshop on Mathematical Foundations of Computational Anatomy, MFCA
(2011)
14 J. Hinkle et al.
5. Niethammer, M., Huang, Y., Vialard, F.-X.: Geodesic Regression for Image Time-
Series. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011, Part II.
LNCS, vol. 6892, pp. 655662. Springer, Heidelberg (2011)
6. Arnold, V.I.: Mathematical Methods of Classical Mechanics, 2nd edn. Springer
(1989)
7. Cootes, T.F., Twining, C.J., Taylor, C.J.: Dieomorphic statistical shape models.
In: BMVC (2004)
8. Vaillant, M., Glaunes, J.: Surface matching via currents. In: IPMI (2005)
9. Miller, M.I., Trouve, A., Younes, L.: Geodesic shooting for computational anatomy.
Journal of Mathematical Imaging and Vision 24, 209228 (2006)
10. Turaga, P., Veeraraghavan, A., Srivastava, A., Chellappa, R.: Statistical computa-
tions on Grassmann and Stiefel manifolds for image and video-based recognition
33, 22732286 (2011)
11. Bookstein, F.L.: Morphometric Tools for Landmark Data: Geometry and Biology.
Cambridge Univ. Press (1991)
12. Kent, J.T., Mardia, K.V., Morris, R.J., Aykroyd, R.G.: Functional models of
growth for landmark data. In: Proceedings in Functional and Spatial Data Analy-
sis, pp. 109115 (2001)
13. do Carmo, M.P.: Riemannian Geometry, 1st edn. Birkhauser, Boston (1992)
14. Leite, F.S., Krakowski, K.A.: Covariant dierentiation under rolling maps. Centro
de Matematica da Universidade de Coimbra (2008) (preprint)
15. Fletcher, P.T., Liu, C., Pizer, S.M., Joshi, S.C.: Principal geodesic analysis for the
study of nonlinear statistics of shape. IEEE Trans. Med. Imag. 23, 9951005 (2004)
16. Holm, D.D., Marsden, J.E., Ratiu, T.S.: The Euler-Poincare equations and semidi-
rect products with applications to continuum theories. Adv. in Math. 137, 181
(1998)
17. Kendall, D.G.: A survey of the statistical theory of shape. Statistical Science 4,
8799 (1989)
18. Le, H., Kendall, D.G.: The Riemannian structure of Euclidean shape spaces: A
novel environment for statistics. Ann. Statist. 21, 12251271 (1993)
19. ONeill, B.: The fundamental equations of a submersion. Michigan Math J. 13,
459469 (1966)
20. Cates, J., Fletcher, P.T., Styner, M., Shenton, M., Whitaker, R.: Shape modeling
and analysis with entropy-based particle systems. In: IPMI (2007)
21. Davis, B.C., Fletcher, P.T., Bullitt, E., Joshi, S.: Population shape regression from
random design data. In: ICCV (2007)
22. Noakes, L., Heinzinger, G., Paden, B.: Cubic splines on curved surfaces. IMA J.
Math. Control Inform. 6, 465473 (1989)
23. Giambo, R., Giannoni, F., Piccione, P.: An analytical theory for Riemannian cubic
polynomials. IMA J. Math. Control Inform. 19, 445460 (2002)
24. Machado, L., Leite, F.S.: Fitting smooth paths on Riemannian manifolds. Int. J.
App. Math. Stat. 4, 2553 (2006)
25. Samir, C., Absil, P.A., Srivastava, A., Klassen, E.: A gradient-descent method for
curve tting on Riemannian manifolds. Found. Comp. Math. 12, 4973 (2010)
26. Su, J., Dryden, I.L., Klassen, E., Le, H., Srivastava, A.: Fitting smoothing splines
to time-indexed, noisy points on nonlinear manifolds. Image and Vision Comput-
ing 30, 428442 (2012)
27. Moussa, M.A.A., Cheema, M.Y.: Non-parametric regression in curve tting. The
Statistician 41, 209225 (1992)
Approximate Gaussian Mixtures
for Large Scale Vocabularies
Yannis Avrithis and Yannis Kalantidis
National Technical University of Athens

{iavr,ykalant}@image.ntua.gr
Abstract. We introduce a clustering method that combines the exibil-

ity of Gaussian mixtures with the scaling properties needed to construct
visual vocabularies for image retrieval. It is a variant of expectation-
maximization that can converge rapidly while dynamically estimating
the number of components. We employ approximate nearest neighbor
search to speed-up the E-step and exploit its iterative nature to make
search incremental, boosting both speed and precision. We achieve supe-
rior performance in large scale retrieval, being as fast as the best known
approximate k-means.
Keywords: Gaussian mixtures, expectation-maximization, visual vo-

cabularies, large scale clustering, approximate nearest neighbor search.
1 Introduction
The bag-of-words (BoW) model is ubiquitous in a number of problems of com-
puter vision, including classication, detection, recognition, and retrieval. The
k-means algorithm is one of the most popular in the construction of visual vocab-
ularies, or codebooks. The investigation of alternative methods has evolved into
an active research area for small to medium vocabularies up to 104 visual words.
For problems like image retrieval using local features and descriptors, ner vo-
cabularies are needed, e.g. 106 visual words or more. Clustering options are more
limited at this scale, with the most popular still being variants of k-means like
approximate k-means (AKM) [1] and hierarchical k-means (HKM) [2].
The Gaussian mixture model (GMM), along with expectation-maximization
(EM) [3] learning, is a generalization of k-means that has been applied to vo-
cabulary construction for class-level recognition [4]. In addition to position, it
models cluster population and shape, but assumes pairwise interaction of all
points with all clusters and is slower to converge. The complexity per iteration
is O(N K) where N and K is the number of points and clusters, respectively,
so it is not practical for large K. On the other hand, a point is assigned to the
nearest cluster via approximate nearest neighbor (ANN) search in [1], bringing
complexity down to O(N log K), but keeping only one neighbor per point.
Robust approximate k-means (RAKM) [5] is an extension of AKM where the
nearest neighbor in one iteration is re-used in the next, with less eort being
spent for new neighbor search. This approach yields further speed-up, since

16 Y. Avrithis and Y. Kalantidis
iteration=0, clusters=50 iteration=1, clusters=15 iteration=3, clusters=8

1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 1. Estimating the number, population, position and extent of clusters on an 8-

mode 2d Gaussian mixture sampled at 800 points, in just 3 iterations (iteration 2
not shown). Red circles: cluster centers; blue: two standard deviations. Clusters are
initialized on 50 data points sampled at random, with initial = 0.02. Observe the
space-lling behavior of the two clusters on the left.
the dimensionality D of the underlying space is high, e.g. 64 or 128, and the
cost per iteration is dominated by the number of vector operations spent for
distance computation to potential neighbors. The above motivate us to keep
a larger, xed number m of nearest neighbors across iterations. This not only
improves ANN search in the sense of [5], but makes available enough information
for an approximate Gaussian mixture (AGM) model, whereby each data point
interacts only with the m nearest clusters.
Experimenting with overlapping clusters, we have come up with two modica-
tions that alter the EM behavior to the extent that it appears to be an entirely
new clustering algorithm. An example is shown in Figure 1. Upon applying
EM with relatively large K, the rst modication is to compute the overlap
of neighboring clusters and purge the ones that appear redundant, after each
EM iteration. Now, clusters neighboring to the ones being purged should ll in
the resulting space, so the second modication is to expand them as much as
possible. The algorithm shares properties with other methods like agglomerative
or density-based mode seeking. In particular, it can dynamically estimate the
number of clusters by starting with large K and purging clusters along the way.
On the other hand, expanding towards empty space apart from underlying data
points boosts convergence rate.
This algorithm, expanding Gaussian mixtures (EGM), is quite generic and
can be used in a variety of applications. Focusing on spherical Gaussians, we
apply its approximate version, AGM, to large scale visual vocabulary learning
for image retrieval. In this setting, contrary to typical GMM usage, descriptors
of indexed images are assigned only to their nearest visual word to keep the
index sparse enough. This is the rst implementation of Gaussian mixtures at
this scale, and it appears to enhance retrieval performance at no additional cost
compared to approximate k-means.
Approximate Gaussian Mixtures for Large Scale Vocabularies 17
2 Related Work
One known pitfall of k-means is that visual words cluster around the densest
few regions of the descriptor space, while the sparse ones may be more infor-
mative [6]. In this direction, radius-based clustering is used in [7], while [8], [9]
combine the partitional properties of k-means with agglomerative clustering.
Our approximate approach is ecient enough to initialize from all data points,
preserving even sparse regions of the descriptor space, and to dynamically purge
as appropriate in dense ones.
Precise visual word region shape has been represented by Gaussian mixture
models (GMM) for universal [10] and class-adapted codebooks [4], and even by
one-class SVM classiers [11]. GMMs are extended with spatial information
in [12] to represent images by hyperfeatures. None of the above can scale up as
needed for image retrieval. On the other extreme, training-free approaches like
xed quantization on a regular lattice [13] or hashing with random histograms
[14] are remarkably ecient for recognition tasks but often fail when applied to
retrieval [15]. The at vocabulary of AKM [1] is the most popular in this sense,
accelerating the process by the use of randomized k-d trees [16] and outper-
forming the hierarchical vocabulary of HKM [2], but RAKM [5] is even faster.
Although constrained to spherical Gaussian components, we achieve superior
exibility due to GMM, yet converging as fast as [5].
Assigning an input descriptor to a visual word is also achieved by nearest
neighbor search, and randomized k-d trees [16] are the most popular, in par-
ticular the FLANN implementation [17]. We use FLANN but we exploit the
iterative nature of EM to make the search process incremental, boosting both
speed and precision. Any ANN search method would apply equally, e.g. product
quantization [18], as long as it returns multiple neighbors and its training is fast
enough to repeat within each EM iteration.
One attempt to compensate for the information loss due to quantization is
soft assignment [15]. This is seen as kernel density estimation and applied to
small vocabularies for recognition in [19], while dierent pooling strategies are
explored in [20]. Soft assignment is expensive in general, particularly for recog-
nition, but may apply to the query only in retrieval [21], which is our choice as
well. This functionality is moved inside the codebook in [22] using a blurring
operation, while visual word synonyms are learned from an extremely ne vo-
cabulary in [23], using geometrically veried feature tracks. Towards the other
extreme, Hamming embedding [24] encodes approximate position in Voronoi cells
of a coarser vocabulary, but this also needs more index space. Our method is
complementary to learning synonyms or embedding more information.
Estimating the number of components is often handled by multiple EM runs
and model selection or split and merge operations [25], which is impractical
in our problem. A dynamic approach in a single EM run is more appropriate,
e.g. component annihilation based on competition [26]; we rather estimate spa-
tial overlap directly and purge components long before competition takes place.
Approximations include e.g. space partitioning with k-d trees and group-wise
parameter learning [27], but ANN in the E-step yields lower complexity.
3 Expanding Gaussian Mixtures

We begin with an overview of Gaussian mixture models and parameter learn-
ing via EM, mainly following [3], in section 3.1. We then develop our purging
and expanding schemes in sections 3.2 and 3.3, respectively. Finally, we discuss
initializing and terminating in the specic context of visual vocabularies.
3.1 Parameter Learning

The density p(x) of a Gaussian mixture distribution is a convex combination of
K D-dimensional normal densities or components,

K
p(x) = k N (x|k , k ) (1)
k=1
for x RD , where k , k , k are the mixing coecient, mean and covariance

matrix respectively of the k-th component. Interpreting k as the prior proba-
bility p(k) of component k, and given observation x, quantity
k N (x|k , k )
k (x) = K (2)
j=1 j N (x|j , j )
for x RD , k = 1, . . . , K, expresses the posterior probability p(k|x); we say that

k (x) is the responsibility that component k takes for explaining observation
x, or just the responsibility of k for x. Now, given a set of i.i.d. observations or
data points X = {x1 , . . . , xN }, the maximum likelihood (ML) estimate for the
parameters of each component k = 1, . . . , K is [3]
Nk
k = (3)
N
1
N
k = nk xn (4)
Nk n=1
1
N
k = nk (xn k )(xn k )T , (5)
Nk n=1

where nk = k (xn ) for n = 1, . . . , N , and Nk = N n=1 nk , to be interpreted
as the eective number of points assigned to component k. The expectation-
maximization (EM) algorithm involves an iterative learning process: given an
initial set of parameters, compute responsibilities nk according to (2) (E-step);
then, re-estimate parameters according to (3)-(5), keeping responsibilities xed
(M-step). Initialization and termination are discussed in section 3.4.
For the remaining of this work, we shall be focusing on the particular case of
spherical (isotropic) Gaussians, with covariance matrix k = k2 I. In this case,
update equation (5) reduces to
1
N
k2 = nk xn k 2 . (6)
DNk n=1
Comparing to standard k-means, the model is still more exible in terms of

mixing coecients k and standard deviations k , expressing the population
and extent of clusters. Yet, the model is ecient because k is a scalar, whereas
the representation of full covariance matrices k would be quadratic in D.
3.2 Purging
Determining the number of components is more critical in Gaussian mixtures
than in k-means due to the possibility of overlapping components. Although one
solution is a variational approach [3], the problem has motivated us to devise a
novel method of purging components according to an overlap measure. Purging
is dynamic in the sense that we initialize the model with as many components
as possible and purge them as necessary during the parameter learning process.
In eect, this idea introduces a P-step in EM, that is to be applied right after
the E- and M-steps in every iteration.
Let pk be the function representing the contribution of component k to the
Gaussian mixture distribution of (1), with
pk (x) = k N (x|k , k ) (7)
for x RD . Then pk can the treated as a representation of component k itself.

Note that pk is not a normalized distribution unless k = 1. Also, let

p, q = p(x)q(x)dx (8)
be the L2 inner product of real-valued, square-integrable functions p, qagain

to be over R . The
D
not necessarily normalizedwhere the integral is assumed
corresponding L norm of function p is given by p = p, p. When p, q are
2
normal distributions, the integral in (8) can be evaluated in closed form.

Theorem 1. Let p(x) = N (x|a, A) and q(x) = N (x|b, B) for x RD . Then
p, q = N (a|b, A + B). (9)
Hence, given components represented by pi , pk , their overlap in space, as mea-

sured by inner product pi , pk , is
pi , pk = i k N (i |k , (i2 + k2 )I) (10)
under the spherical Gaussian model. It can be computed in O(D), requiring only
one D-dimensional vector operation i k 2 , while squared norm pi 2 =
pi , pi = i2 (4i2 )D/2 is O(1). Now, if function q represents any component
or cluster, (10) motivates generalizing (2) to dene quantity
q, pk
k (q) = K , (11)
j=1 q, pj
Algorithm 1: Component purging (P-step)

input : set of components C {1, . . . , K} at current iteration
output: updated set of components C C, after purging
1 K // set of components to keep
2 Sort C such that i < k i k for i, k C // re-order components i. . .
3 foreach i C do // . . . in descending order of i
4 if i,K then // compute i,K by (12)
5 K Ki // keep i if it does not overlap with K
6 C K // updated components
so that ik = k (pi ) [0, 1] is the generalized responsibility of component k for

component i. Function pi is treated here as a generalized data point, centered
at i , weighted by coecient i and having spatial extent i to represent the
underlying actual data points. Observe that (11) reduces to (2) when q collapses
to a Dirac delta function, eectively sampling component functions pk .
According to our denitions, ii is the responsibility of component i for itself.
More generally, given a set K of components and one component i / K, let
pi 2
i,K = ii = . (12)
ii + jK ij pi 2 + jK pi , pj
Quantity i,K [0, 1] is the responsibility of component i for itself relative to K.

If i,K is large, component i can explain itself better than set K as a whole;
otherwise i appears to be redundant. So, if K represents the components we have
decided to keep so far, it makes sense to purge component i if i,K drops below
overlap threshold [0, 1], in which case we say that i overlaps with K.
We choose to process components in descending order of mixing coecients,
starting from the most populated cluster, which we always keep. We then keep a
component unless it overlaps with the ones we have already kept; otherwise we
purge it. Note that 12 ensures we cannot keep two identical components.
This P-step is outlined in Algorithm 1, which assumes that C {1, . . . , K} holds
the components of the current iteration. Now K refers to the initial number of
components; the eective number, C = |C| K, decreases in each iteration.
This behavior resembles agglomerative clusteringin fact, inner product (8) is
analogous to the average linkage or group average criterion [28], which however
measures dissimilarity and is discrete.
We perform the P-step right after E- and M-steps in each EM iteration. We
also modify the M-step to re-estimate k , k , k according to (3), (4), (6) only
for k C; similarly, in the E-step, compute nk = k (xn ) for all n = 1, . . . , N
but only k C, with
k N (x|k , k )
k (x) = , (13)
jC j N (x|j , j )
where the sum in the denominator of (2) has been constrained to C.

3.3 Expanding
When a component, say i, is purged, data points that were better explained by i
prior to purging will have to be assigned to neighboring components that remain.
These components will then have to expand and cover the space populated by
such points. Towards this goal, we modify (6) to overestimate the extent of each
component as much as this does not overlap with its neighboring components.
Components will then tend to ll in as much empty space as possible, and this
determines the convergence rate of the entire algorithm.
More specically, given component k in the current set of components C, we
use the constrained denition of k (x) in (13) to partition the data set P =
{1, . . . , N } into the set of its inner points
P k = {n P : nk = max nj }, (14)
jC
contained in its Voronoi cell, and outer points, P k = P \ P k . We now observe

that re-estimation equation (6) can be decomposed into
Nk Nk
Dk2 = + k , (15)
Nk k Nk

where N k = nP k nk , N k = nP k nk , and
1 1
k = nk xn k 2 , k = nk xn k 2 . (16)
Nk Nk
nP k nP k
Since N k + N k = Nk , the weights in the linear combination of (15) sum to one.

The inner sum k expresses a weighted average distance from k of data points
that are better explained by component k, hence ts the underlying data of
the corresponding cluster. On the other hand, the outer sum k plays a similar
role for points explained by the remaining components. Thus typically k < k
though this is not always true, especially in cases of excessive component overlap.
Now, the trick is to bias the weighted sum towards k in (15) as follows,
Dk2 = wk k + (1 wk )k , (17)
N
where wk = Nkk (1 ) and [0, 1] is an expansion factor, with = 0 reduc-
ing (17) back to (6) and = 1 using only the outer sum k .
Because of the exponential form of the normal distribution, the outer sum k
is dominated by the outer points in P k that are nearest to component k, hence
it provides an estimate for the maximum expansion of k before overlaps with
neighboring components begin. Figure 2 illustrates expansion for the example
of Figure 1. The outer green circles typically pass through the nearest clusters,
while the inner ones tightly t the data points of each cluster.
iteration=1, clusters=15 iteration=2, clusters=10 iteration=3, clusters=8

1 1 1
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 2. Component expansion for iterations 1, 2 and 3 of the example of Figure 1.

Blue circles: two standard deviations with expansion (17) and = 0.25, as in Figure 1;
magenta: without expansion (6); dashed green: inner and outer sum contributions.
3.4 Initializing and Terminating
It has been argued [7][6] that sparsely populated regions of the descriptor space
are often the most informative ones, so random sampling is usually a bad choice.
We therefore initialize with all data points as cluster centers, that is, K = N .
Using approximate nearest neighbors, this choice is not as inecient as it sounds.
In fact, only the rst iteration is aected because by the second, the ratio of
clusters that survive is typically in the order of 10%. Mixing coecients are
uniform initially. Standard deviations are initialized to the distance of the nearest
neighbor, again found approximately.
Convergence in EM is typically detected by monitoring the likelihood function.
This makes sense after the number of components has stabilized and no purging
takes place. However, experiments on large scale vocabularies take hours or even
days of processing, so convergence is never reached in practice. What is important
is to measure the performance of the resulting vocabulary in a particular task
retrieval in our caseversus processing required.
4 Approximate Gaussian Mixtures
Counting D-dimensional vector operations and ignoring iterations, the complex-

ity of the algorithm presented so far is O(N K). In particular, the complexity of
the E-step (13) and M-step (3), (4), (16)-(17) of each iteration is O(N C), where
C = |C| K N is the current number of components, and the complexity
of the P-step (Algorithm 1) is O(C 2 ). This is clearly not practical for large C,
especially when K is in the order of N .
Similarly to [1], the approximate version of our Gaussian mixtures clustering
algorithm involves indexing the entire set of clusters C according to their center
k and performing an approximate nearest neighbor query for each data point
xn , prior to the E-step of each iteration. The former step is O(C(C)) and the
Algorithm 2: Incremental m-nearest neighbors (N-step)

input : m best neighbors B(xn ) found so far for n = 1, . . . , N
output: updated m best neighbors B (xn ) for n = 1, . . . , N
1 for n = 1, . . . , N do // for all data points
2 B(xn ) C B(xn ) // ignore purged neighbors
3 (R, d) NNm (xn ) // R: m-NN of xn ; d: distances to xn
// (such that d2k =
xn k
2 for k R)
4 for k B(xn ) \ R do // for all previous neighbors. . .
5 d2k
xn k
2 // . . . nd distance after k update (M-step)
6 A B(xn ) R // for all previous and new neighbors. . .
7 for k A do // . . . compute unnormalized. . .
8 gk (k /kD ) exp{d2k /(2k2 )} // . . . responsibility of k for xn
9 Sort A such that i < k gi gk for i, k A // keep the top-ranking. . .
10 B (xn ) A[1, . . . , m] // . . . m neighbors
latter O(N (C)), where expresses the complexity of a query as a function of

the indexed set size; e.g. (C) = log C for typical tree-based methods. Respon-
sibilities nk are thus obtained according to (13) for n = 1, . . . , N , k C, but
with distances to cluster centers eectively replaced by metric

x k 2 , if k NNm (x)
d2m (x, k ) = (18)
0, otherwise,
where NNm (x) C denotes the approximate m-nearest neighborhood of query

point x RD . Each component k found as a nearest neighbor of data point xn is
subsequently updated by computing the contributions nk , nk xn , nk xn k 2
to Nk (hence k in (3)), k in (4), k2 in (16)-(17), respectively, that are due
to xn . The M-step thus brings no additional complexity. Similarly, the P-step
involves a query for each cluster center k , with complexity O(C(C)). It follows
that the overall complexity per iteration is O(N (C)). In the case of FLANN,
(C) is constant and equal to the number of leaf checks per query, since we are
only counting vector operations and each split node operation is scalar.
Now, similarly to [5], we not only use approximate search to speed up clus-
tering, but we also exploit the iterative nature of the clustering algorithm to
enhance the search process itself. To this end, we maintain a list of the m best
neighbors B(xn ) found so far for each data point xn , and re-use it across it-
erations. The distance of each new nearest neighbor is readily available as a
by-product of the query at each iteration, while distances of previous neighbors
have to be re-computed after update of cluster centers (4) at the M-step. The
list of best neighbors is updated as outlined in Algorithm 2.
This incremental m-nearest neighbors algorithm is a generalization of [5],
which restricts to m = 1 and k-means only. It may be considered an N-step
in the overall approximate clustering algorithm, to be performed prior to the
E-stepin fact, providing responsibilities as a by-product. The additional cost
is m vector operations for distance computation at each iteration, but this is

compensated for by reducing the requested precision for each query. As in [5],
the rationale is that by keeping track of the best neighbors found so far, we may
signicantly reduce the eort spent in searching for new ones.
5 Experiments
5.1 Setup
We explore here the behavior of AGM under dierent settings and parameters,
and compare its speed and performance against AKM [1] and RAKM [5] on large
scale vocabulary construction for image retrieval.
Datasets. We use two publicly available datasets, namely Oxford buildings 1 [1]
and world cities 2 [29]. The former consists of 5, 062 images and annotation
(ground truth) for 11 dierent landmarks in Oxford, with 5 queries for each.
The annotated part of the latter is referred to as the Barcelona dataset and
consists of 927 images grouped into 17 landmarks in Barcelona, again with 5
queries for each. The distractor set of world cities consists of 2M images from
dierent cities; we use the rst one million as distractors in our experiments.
Vocabularies. We extract SURF [30] features and descriptors from each
image. There are in total 10M descriptors in Oxford buildings and 550K in
Barcelona. The descriptors, of dimensionality D = 64, are the data points we
use for training. We construct specic vocabularies with all descriptors of the
smaller Barcelona dataset, which we then evaluate on the same dataset for pa-
rameter tuning and speed comparisons. We also build generic vocabularies with
all 6.5M descriptors of an independent set of 15K images depicting urban scenes,
and compare on the larger Oxford buildings in the presence of distractors.
Evaluation protocol. Given each vocabulary built, we use FLANN to assign
descriptors, keeping only one NN under Euclidean distance. We apply soft assign-
ment on the query side only as in [21], keeping the rst 1, 3, or 5 NNs. These are
found as in [15] with = 0.01. We rank by dot product on 2 -normalized BoW
histograms and tf-idf weighting. We measure training time in D-dimensional
vector operations (vop) per data point and retrieval performance in mean aver-
age precision (mAP) over all queries. We use our own C++ implementations for
AKM and RAKM.
5.2 Tuning
We choose Barcelona for parameter tuning because, combined with descriptor
assignment, it would be prohibitive to use a larger dataset. It turns out how-
ever, that AGM outperforms the state of the art at much larger scale with the
same parameters. The behavior of EGM is controlled by expansion factor and
overlap threshold ; AGM is also controlled by memory m that determines ANN
precision and cost, considered in section 5.3.
1
http://www.robots.ox.ac.uk/~ vgg/data/oxbuildings/
2
http://image.ntua.gr/iva/datasets/wc/
0.9
0.88 0.88
mAP
mAP
0.86 0.86
= 0.00 iter 2
= 0.15 iter 5
0.84 = 0.20 0.84 iter 10
= 0.25 iter 20
0 5 10 15 20 25 0.4 0.5 0.6 0.7

iteration overlap threshold,
Fig. 3. Barcelona-specic parameter tuning. (left) mAP performance versus iteration

during learning, for varying expansion factor and xed = 0.5. (right) mAP versus
overlap threshold for dierent iterations, with xed = 0.2.
Figure 3 (left) shows mAP against number of iterations for dierent values
of . Compared to = 0, it is apparent that expansion boosts convergence rate.
However, the eect is only temporary above = 0.2, and performance eventually
drops, apparently due to over-expansion. Now, controls purging, and we expect
0.5 as pointed out in section 3.2. Since i,K is normalized, = 0.5 appears to
be a good choice, as component k explains itself at least as much as K in (12).
Figure 3 (right) shows that the higher the better initially, but eventually
= 0.55 is clearly optimal, being somewhat stricter than expected. We choose
= 0.2 and = 0.55 for the remaining experiments.
5.3 Comparisons
We use FLANN [17] for all methods, with 15 trees and precision controlled
by checks, i.e. the number of leaves checked per ANN query. We use 1, 000
checks during assignment, and less during learning. Approximation in AGM is
controlled by memory m, and we need m FLANN checks plus at most m more
distance computations in Algorithm 2, for a total of 2m vector operations per
iteration. We therefore use 2m checks for AKM and RAKM in comparisons.
Figure 4 (left) compares all methods on convergence for m = 50 and m = 100,
where AKM/RAKM are trained for 80K vocabularies, found to be optimal.
AGM not only converges as fast as AKM and RAKM, but also outperforms
them for the same learning time. AKM and RAKM are only better in the rst
few iterations, because recall that AGM initializes from all points and needs a
few iterations to reach a reasonable vocabulary size. The relative performance
of RAKM versus AKM is not quite as expected by [5]; this may be partly due
to random initialization, a standing issue for k-means that AGM is free of, and
partly because performance in [5] is measured in distortion rather than mAP in
retrieval. It is also interesting that, unlike the other methods, AGM appears to
improve in mAP for lower m.
0.9 0.5
0.45
0.89 0.4
mAP
mAP
AGM-50 AGM-1
AKM-100 0.35 AGM-3
RAKM-100 AGM-5
0.88 AGM-100 0.3 RAKM-1
AKM-200 RAKM-3
RAKM-200 0.25 RAKM-5
0 2 4 6 8 104 105 106

vop / point (103 ) distractors
Fig. 4. (Left) Barcelona-specic mAP versus learning time, measured in vector opera-
tions (vop) per data point, for AGM, AKM and RAKM under varying approximation
levels, measured in FLANN checks. There are 5 measurements on each curve, corre-
sponding, from left to right, to iteration 5, 10, 20, 30 and 40. (Right) Oxford buildings
generic mAP in the presence of up to 1 million distractor images for AGM and RAKM,
using query-side soft assignment with 1, 3 and 5 NNs.
Table 1. mAP comparisons for generic vocabularies of dierent sizes on Oxford Build-
ings with a varying number of distractors, using 100/200/200 FLANN checks for
AGM/AKM/RAKM respecively, 40 iterations for AKM/RAKM, and 15 for AGM.
Method RAKM AKM AGM

Vocabulary 100K 200K 350K 500K 550K 600K 700K 550K 857K
No distractors 0.430 0.464 0.471 0.479 0.486 0.485 0.476 0.485 0.492
20K distractors 0.412 0.427 0.439 0.440 0.448 0.441 0.437 0.447 0.459
For large scale image retrieval, we train generic vocabularies on an indepen-

dent dataset of 15K images and 6.5M descriptors, and evaluate on Oxford build-
ings in the presence of up to one million distractors from world cities. Because
experiments are very expensive, we rst choose the best competing method for
up to 20K distractors as shown in Table 1. We use m = 100 in this experiment,
with 40 iterations for AKM/RAKM, and only 15 for AGM. This is because
we are now focusing on mAP performance rather than learning speed, and our
choices clearly favor AKM/RAKM. It appears that 550K is the best size for
RAKM, and AKM is more or less equivalent as expected, since their dierence
is in speed. AGM is slightly better with its vocabulary size C = 857K being
automatically inferred, keeping = 0.2, = 0.55 as explained in section 5.2.
Keeping exactly the same settings, we extend the experiment up to 1M dis-
tractors for the RAKM 550K and AGM 857K vocabularies under query-side
soft-assignment, as depicted in Figure 4 (right). Without any further tuning
compared to that carried out on Barcelona, AGM maintains a clear overhead of

up to 3.5% in mAP, e.g. 0.303 (0.338) for RAKM (AGM) at 1M distractors.
The query time of RAKM (AGM) is 354ms (342ms) on average for the full 1M
distractor index, so there appears to be no issue with the balance of cluster pop-
ulation. Spatial verication and re-ranking further increases mAP performance
to 0.387 (0.411) for RAKM (AGM) at an additional cost of 323ms (328ms) us-
ing Hough pyramid matching [29] on a shortlist of 200 images at 1M distractors.
Variants of Hamming embedding [24], advanced scoring or query expansion are
complementary, hence expected to further boost performance.
6 Discussion
Grouping 107 points into 106 clusters in a space of 102 dimensions is certainly
a dicult problem. Assigning one or more visual words to 103 descriptors from
each of 106 images and repeating for a number of dierent settings and competing
methods on a single machine has proved even more challenging. Nevertheless,
we have managed to tune the algorithm on one dataset and get competitive
performance on a dierent one with a set of parameters that work even on our
very rst two-dimensional example. In most alternatives one needs to tune at
least the vocabulary size. Even with spherical components, the added exibility
of Gaussian mixtures appears to boost discriminative power as measured by
retrieval performance. Yet, learning is as fast as approximate k-means, both in
terms of EM iterations and underlying vector operations for NN search.
Our solution appears to avoid both singularities and overlapping components
that are inherent in ML estimation of GMMs. Still, it would be interesting
to consider a variational approach [3] and study the behavior of our purging
and expanding schemes with dierent priors. Seeking visual synonyms towards
arbitrarily-shaped clusters without expensive sets of parameters is another direc-
tion, either with or without training data as in [23]. More challenging would be
to investigate similar ideas in a supervised setting, e.g. for classication. More
details, including software, can be found at our project home page3 .
References
1. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with
large vocabularies and fast spatial matching. In: CVPR (2007)
2. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR
(2006)
3. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
4. Perronnin, F.: Universal and adapted vocabularies for generic visual categorization.
PAMI 30(7), 12431256 (2008)
5. Li, D., Yang, L., Hua, X.S., Zhang, H.J.: Large-scale robust visual codebook con-
struction. ACM Multimedia (2010)
6. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image
classication. In: CVPR (2008)
3
http://image.ntua.gr/iva/research/agm/
7. Jurie, F., Triggs, B.: Creating ecient codebooks for visual recognition. In: ICCV
(2005)
8. Leibe, B., Mikolajczyk, K., Schiele, B.: Ecient clustering and matching for object
class recognition. In: BMVC (2006)
9. Fulkerson, B., Vedaldi, A., Soatto, S.: Localizing Objects with Smart Dictionaries.
In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302,
pp. 179192. Springer, Heidelberg (2008)
10. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal
visual dictionary. In: ICCV (2005)
11. Wu, J., Rehg, J.M.: Beyond the euclidean distance: Creating eective visual code-
books using the histogram intersection kernel. In: ICCV (2009)
12. Agarwal, A., Triggs, B.: Hyperfeatures Multilevel Local Coding for Visual Recog-
nition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS,
vol. 3951, pp. 3043. Springer, Heidelberg (2006)
13. Tuytelaars, T., Schmid, C.: Vector quantizing feature space with a regular lattice.
In: ICCV (October 2007)
14. Dong, W., Wang, Z., Charikar, M., Li, K.: Eciently matching sets of features
with random histograms. ACM Multimedia (2008)
15. Philbin, J., Chum, O., Sivic, J., Isard, M., Zisserman, A.: Lost in quantization: Im-
proving particular object retrieval in large scale image databases. In: CVPR (2008)
16. Silpa-Anan, C., Hartley, R.: Optimised KD-trees for fast image descriptor match-
ing. In: CVPR (2008)
17. Muja, M., Lowe, D.: Fast approximate nearest neighbors with automatic algorithm
conguration. In: ICCV (2009)
18. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor
search. PAMI 33(1), 117128 (2011)
19. van Gemert, J., Veenman, C., Smeulders, A., Geusebroek, J.: Visual word ambi-
guity. PAMI 32(7), 12711283 (2010)
20. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV (2011)
21. Jegou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image
search. IJCV 87(3), 316336 (2010)
22. Lehmann, A., Leibe, B., van Gool, L.: PRISM: Principled implicit shape model.
In: BMVC (2009)
23. Mikulk, A., Perdoch, M., Chum, O., Matas, J.: Learning a Fine Vocabulary.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III. LNCS,
24. Jegou, H., Douze, M., Schmid, C.: Hamming Embedding and Weak Geometric Con-
sistency for Large Scale Image Search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008, Part I. LNCS, vol. 5302, pp. 304317. Springer, Heidelberg (2008)
25. Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G.: SMEM algorithm for mixture
models. Neural Computation 12(9), 21092128 (2000)
26. Figueiredo, M., Jain, A.: Unsupervised learning of nite mixture models. IEEE
Transactions on Pattern Analysis and Machine Intelligence 24(3), 381396 (2002)
27. Verbeek, J., Nunnink, J., Vlassis, N.: Accelerated EM-based clustering of large
data sets. Data Mining and Knowledge Discovery 13(3), 291307 (2006)
28. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer (2009)
29. Tolias, G., Avrithis, Y.: Speeded-up, relaxed spatial matching. In: ICCV (2011)
30. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404417.
Springer, Heidelberg (2006)
Geodesic Saliency Using Background Priors
Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun
Microsoft Research Asia

{yichenw,fangwen,v-wazhu,jiansun}@microsoft.com
Abstract. Generic object level saliency detection is important for many

vision tasks. Previous approaches are mostly built on the prior that ap-
pearance contrast between objects and backgrounds is high. Although
various computational models have been developed, the problem re-
mains challenging and huge behavioral discrepancies between previous
approaches can be observed. This suggest that the problem may still be
highly ill-posed by using this prior only.
In this work, we tackle the problem from a dierent viewpoint: we
focus more on the background instead of the object. We exploit two
common priors about backgrounds in natural images, namely boundary
and connectivity priors, to provide more clues for the problem. Accord-
ingly, we propose a novel saliency measure called geodesic saliency. It
is intuitive, easy to interpret and allows fast implementation. Further-
more, it is complementary to previous approaches, because it benets
more from background priors while previous approaches do not.
Evaluation on two databases validates that geodesic saliency achieves
superior results and outperforms previous approaches by a large margin,
in both accuracy and speed (2 ms per image). This illustrates that ap-
propriate prior exploitation is helpful for the ill-posed saliency detection
problem.
1 Introduction
The human vision system can rapidly and accurately identify important regions
in its visual eld. In order to achieve such an ability in computer vision, extensive
research eorts have been conducted on bottom up visual saliency analysis for
years. Early works [1, 2] in this eld are mostly based on biologically inspired
models (where a human looks) and is evaluated on human eye xation data [3, 4].
Many follow up works are along this direction [57].
Recent years have witnessed more interest in object level saliency detection
(where the salient object is) [813] and evaluation is performed on human labeled
objects (bounding boxes [8] or foreground masks [9]). This new trend is motivated
by the increasing popularity of salient object based vision applications, such as
object aware image retargeting [14, 15], image cropping for browsing [16], object
segmentation for image editing [17] and recognition [18]. It has also been shown
that the low level saliency cues are helpful to nd generic objects irrespective
to their categories [19, 20]. In this work we focus on the object level saliency
detection problem in natural images.

30 Y. Wei et al.
(a) (b) (c) (d) (e) (f)
Fig. 1. Saliency maps of previous representative contrast prior based methods, on three
example images of increasing complexity in objects and backgrounds. (a) Input images.
(b) Ground truth salient object masks. (c)-(f) Results from one local method [1] and
three global methods [9, 11, 12].
1.1 Contrast Prior in Previous Work

Due to the absence of high level knowledge, all bottom up saliency methods
rely on assumptions or priors1 on the properties of objects and backgrounds.
Arguably, the most fundamental assumption is that appearance contrast be-
tween object and background is high. This assumption states that a salient
image pixel/patch presents high contrast within a certain context. It is intuitive
and used in all saliency methods, explicitly or implicitly. In this paper we call it
contrast prior for conciseness.
Depending on the extent of the context where the contrast is computed, pre-
vious methods can be categorized as local methods [1, 6, 8, 7, 10] or global meth-
ods [21, 9, 22, 11, 12]. Local methods compute various contrast measures in a
local neighborhood of the pixel/patch, such as edge contrast [8], center-surround
discriminative power [7], center-surround dierences [1, 8, 13], curvature [10] and
self information [6].
Global methods use the entire image to compute the saliency of individual
pixels/patches. Some methods assume globally less frequent features are more
salient and use frequency analysis in the spectral domain [21, 9]. Other methods
compare each pixel/patch to all the others in the image and use the averaged
appearance dissimilarity as the saliency measure [22, 11, 12], and the averaging is
usually weighted by spatial distances between pixels/patches to take into account
the fact that salient pixels/patches are usually grouped together to form a salient
object.
Contrast prior based methods have achieved success in their own aspects, but
still have certain limitations. Typically, the boundaries of the salient object can
be found well, but the object interior is attenuated. This object attenuation
problem is observed in all local methods and some global methods. It is alleviated
1
We use assumption and prior interchangeably in this paper.
Geodesic Saliency Using Background Priors 31
in global methods [11, 12], but these methods still have diculties of highlighting
the entire object uniformly.
In essence, the saliency object detection problem on general objects and back-
ground is highly ill-posed. There still lacks a common denition of what saliency
is in the community, and simply using contrast prior alone is unlikely to suc-
ceed. While previous approaches are mostly based on their own understanding
of how the contrast prior should be implemented, huge behavioral discrep-
ancies between previous methods can be observed, and such discrepancies are
sometimes hard to understand.
This problem is exemplied in Figure 1, which shows results from previous
representative methods, one local method [1] and three global methods [9, 11, 12].
The three examples present increasing complexities in objects and backgrounds:
a simple object on a simple background; a more complex object on a more
complex background; two salient regions on a complex background with low
contrast. It is observed that the results from dierent methods vary signicantly
from each other, even for the rst extremely simple example, which would not be
ambiguous at all for most humans. This phenomenon is repeatedly observed in
many images, either simple or complex. This consolidates our belief that using
the contrast prior alone is insucient for the ill-posed problem and more clues
should be exploited.
1.2 Background Priors in Our Approach

We tackle this problem from a dierent direction. In addition to asking what
the salient object should look like, we ask the opposite question what the
background should look like. Intuitively, answering this question would help
removing background clutters and in turn lead to better foreground detection.
Let us look at the rst image in Figure 1 as an example. The entire background
is a large and smoothly connected region, and it should not be considered as
foreground in any way, that is, its saliency should always be zero.
We propose to use two priors about common backgrounds in natural images,
namely boundary and connectivity priors. They are not fully exploited in previous
methods and are helpful to resolve previous problems that are otherwise dicult
to address.
The rst prior comes from the basic rule of photographic composition, that is,
most photographers will not crop salient objects along the view frame. In other
words, the image boundary is mostly background. We call it the boundary prior.
It is more general than the previously used center prior [8, 3] that the image
center is more important, because salient objects can be placed o the center
(one-third rule in professional photography), but they seldom touch the image
boundary. This prior shares a similar spirit with the bounding box prior used
in interactive image segmentation [17, 23], where the user is asked to provide
a bounding box that encloses the object of interest. This prior is validated on
several salient object databases, as discussed in Section 4.1.
The second prior is from the appearance characteristics of real world back-
grounds in images, that is, background regions are usually large and homogeneous.
32 Y. Wei et al.
In other words, most image patches in the background can be easily connected
to each other. We call it the connectivity prior. Note that the connectivity is
in a piecewise manner. For example, sky and grass regions are homogeneous by
themselves, but patches between them cannot be easily connected. Also the back-
ground appearance homogeneity should be interpreted in terms of human per-
ception. For example, patches in the grass all look similar to humans, although
their pixel-wise intensities might be quite dierent. This prior is dierent from
the connectivity prior used in object segmentation [24, 25], which is assumed on
the spatial continuity of the object instead of the background. Sometimes this
prior is also supported by the fact that background regions are usually out of
focus during photography, and therefore are more blurred and smoother than
the object of interest.
2 Geodesic Saliency
Based on the contrast and the two background priors, we further observed that
most background regions can be easily connected to image boundaries while this is
much harder for object regions. This suggests that we can dene the saliency of
an image patch as the length of its shortest path to image boundaries. However,
this assumes that all the boundary patches are background, which is not realistic
enough as the salient object could be partially cropped on the boundary. This
can be rectied by adding a virtual background node connected to all boundary
patches in the image graph and computing the saliency of boundary patches, as
described in Section 2.1. In this way, we propose a new and intuitive geodesic
saliency measure, that is, the saliency of an image patch is the length of its
shortest path to the virtual background node.
Figure 2 shows several shortest paths on image patches and the geodesic
saliency results. These results are more plausible than those in Figure 1.
Geodesic saliency has several advantages. Firstly, it is intuitive and easy to
interpret. It simultaneously exploits the three priors in an eective manner. Pre-
vious object attenuation problem is signicantly alleviated because all patches
inside a homogeneous object region usually share similar shortest paths to the
image boundary and therefore have similar saliency. As a result, it achieves su-
perior results than previous approaches.
Secondly, it mostly benets from background priors and uses the contrast
prior moderately. Therefore, it is complementary to methods that exploit the
contrast prior in a more sophisticated manner [22, 11, 12]. A combination of
geodesic saliency with such methods can usually improve both.
Last, it is easy to implement, fast and suitable for subsequent applications.
For dierent practical needs in accuracy/speed trade o, we propose two variants
of the geodesic saliency algorithm. For applications that require high speed, such
as interactive image retargeting [14], image thumbnail generation/cropping for
batch image browsing [16], and bounding box based object extraction [26, 19],
we propose to compute geodesic saliency using rectangular patches on an image
grid and an approximate shortest path algorithm [27]. We call this algorithm GS
Fig. 2. (better viewed in color) Illustration of geodesic saliency (using regular patches).
Top: three example images in Figure 1 and shortest paths of a few foreground (green)
and background (magenta) image patches. Bottom: geodesic saliency results.
Grid. It runs in about 2 milliseconds for images of moderate size (400 400).
For applications that require high accuracy such as object segmentation [17, 18],
we propose to use superpixels [28] as image patches and an exact shortest path
algorithm. We call this algorithm GS Superpixel. It is slower (a few seconds) but
more accurate than GS Grid.
In Section 4, extensive evaluation on the widely used salient object databases
in [8, 9] and a recent more challenging database in [29, 30] shows that geodesic
saliency outperforms previous methods by a large margin, and it can be further
improved by combination with previous methods. In addition, the GS Grid al-
gorithm is signicantly faster than all other methods and suitable for real time
applications.
2.1 Algorithm
For an image, we build an undirected weighted graph G = {V, E}. The ver-
tices are all image patches {Pi } plus a virtual background node B, V = {Pi }
{B}. There are two types of edges: internal edges connect all adjacent patches
and boundary edges connect image boundary patches to the background node,
E = {(Pi , Pj )|Pi is adjacent to Pj } {(Pi , B)|Pi is on image boundary}. The
geodesic saliency of a patch P is the accumulated edge weights along the shortest
path from P to background node B on the graph G,

n1
saliency(P ) = min weight(Pi , Pi+1 ), s.t.(Pi , Pi+1 ) E.
P1 =P,P2 ,...,Pn =B
i=1
34 Y. Wei et al.
(a) (b) (c)
Fig. 3. Internal edge weight clipping alleviates the small weight accumulation problem.
(a) Input image. (b) Geodesic saliency (using superpixels) without weight clipping. (c)
Geodesic saliency (using superpixels) with weight clipping.
We rstly describe how to compute edge weights, and then introduce two variants
of geodesic saliency algorithm: GS Grid is faster and GS Superpixel is more
accurate.
Internal edge weight is the appearance distance between adjacent patches.
This distance measure should be consistent with human perception of how sim-
ilar two patches look alike. This problem is not easy in itself. For homogeneous
textured background such as road or grass, while patches there look almost iden-
tical, simple appearance distances such as color histogram distance are usually
small but non zero values. This causes the small-weight-accumulation problem,
that is, many small weight edges can accumulate along a long path and form
undesirable high saliency values in the center of the background. See Figure 3(b)
for an example.
Instead of exploring complex features and sophisticated patch appearance
distance measures, we take a simple and eective weight clipping approach to
address this problem. The patch appearance distance is simply taken as the
dierence (normalized to [0, 1]) between the mean colors of two patches (in LAB
color space). For each patch we pick its smallest appearance distance to all
its neighbors, and then we select an insignicance distance threshold as the
average value of all such smallest distances from all patches. If any distance
is smaller than this threshold, it is considered insignicant and clipped to 0.
Such computation of internal edge weights is very ecient and its eectiveness
is illustrated in Figure 3.
Boundary edge weight characterizes how likely a boundary patch is not back-
ground. When the boundary prior is strictly valid, all boundary patches are
background and such weights should be 0. However, this is too idealistic and
not robust. The whole object could be missed even if it only slightly touches the
image boundary, because all the patches on the object may be easily connected
to the object patch on the image boundary. See Figure 4(b) for examples.
We observe that when the salient object is partially cropped by the image
boundary, the boundary patches on the object are more salient than boundary
patches in the background. Therefore, the boundary edge weights computation
is treated as a one-dimensional saliency detection problem: given only image
(a) (b) (c) (d)
Fig. 4. Boundary edge weight computation makes boundary prior and geodesic saliency
more robust. (a) Input image. (b) Geodesic saliency with boundary edge weights as
0. (c) The boundary edge weights computed using a one dimensional version of the
saliency algorithm in [11]. (d) Geodesic saliency using boundary edge weights in (c).
boundary patches, compute the saliency of each boundary patch Pi as the weight
of boundary edge (Pi , B). This step makes the boundary prior and geodesic
saliency more robust, as illustrated in Figure 4(c)(d).
In principle, any previous image saliency method that can be reduced to a one-
dimensional version can be used for this problem. We use the algorithm in [11]
because it is also based on image patches and adapting it to using only boundary
patches is straightforward. For patch appearance distance in this method, we also
use the mean color dierence.
The GS Grid algorithm uses rectangular image patches of 10 10 pixels on a
regular image grid. Shortest paths for all patches are computed using the ecient
geodesic distance transform [27]. Although this solution is only approximate, it
is very close to the exact solution on a simple graph on an image grid. Because
of its linear complexity in the number of graph nodes and sequential memory
access (therefore cache friendly), it is extremely fast and also used in interactive
image segmentation [31, 32]. Overall, the GS Grid algorithm runs in 2 ms for a
400 400 image and produces decent results.
The GS Superpixel algorithm uses irregular superpixels as image patches. It
is more accurate because superpixels are better aligned with object and back-
ground region boundaries than the regular patches in GS Grid, and the patch
appearance distance computation is more accurate. We use the superpixel seg-
mentation algorithm in [28] to produce superpixels of roughly 10 10 pixels.
It typically takes a few seconds for an image. Note that our approach is quite
insensitive to the superpixel algorithm so any faster method can be used.
The shortest paths of all image patches are computed by Dijkstras algorithm
for better accuracy. In spite of the super linear complexity, the number of su-
perpixels is usually small (around 1, 500) and this step takes no more than a few
milliseconds.
36 Y. Wei et al.
3 Salient Object Databases
The MSRA database [8] is currently the largest salient object database and
provides object bounding boxes. The database in [9] is a 1, 000 image subset of
[8] and provides human labeled object segmentation masks. Many recent object
saliency detection methods [813, 16] are evaluated on these two databases.
Nevertheless, these databases have several limitations. All images contain only
a single salient object. Most objects are large and near the image center. Most
backgrounds are clean and present strong contrast with the objects. A few ex-
ample images are shown in Figure 8.
Recently a more challenging salient object database was introduced [30]. It is
based on the well known 300 images Berkeley segmentation dataset [29]. Those
images usually contain multiple foreground objects of dierent sizes and positions
in the image. The appearance of objects and backgrounds are also more complex.
See Figure 10 for a few example images. In the work of [30], seven subjects are
asked to label the foreground salient object masks and each subject can label
multiple objects in one image. For each object mask of each subject, a consistency
score is computed from the labeling of the other six subjects. However, [30] does
not provide a single foreground salient object mask as ground truth.
For our evaluation, we obtain a foreground mask in each image by remov-
ing all labeled objects whose consistency scores are smaller than a threshold
and combining the remaining objects masks. This is reasonable because a small
consistency score means there is divergence between the opinions of the seven
subjects and the labeled object is probably less salient. We tried dierent thresh-
olds from 0.7 to 1 (1 is maximum consistency and means all subjects label exactly
the same mask for the object). We found that for most images the user labels are
quite consistent and resulting foreground masks are insensitive to this threshold.
Finally, we set the threshold as 0.7, trying to retain as many salient objects as
possible. See Figure 10 for several example foreground masks.
4 Experiments
As explained above, besides the commonly used MSRA-1000 [9] and MSRA [8]
databases, we also use the Berkeley-300 database [30] in our experiments. The
latter is more challenging and has not been widely used in saliency detection
yet.
Not surprisingly, we found that all our results and discoveries on the MSRA
database are very similar to those found in MSRA-1000. Therefore we omit all
the results on the MSRA database in this paper.
4.1 Validation of Boundary Prior
In this experiment, we evaluate how likely the image boundary pixels are back-
ground. For each image, we compute the percentage of all background boundary
pixels in a band of 10 pixels in width (to be consistent with the patch size in
900 70
800 60
700
50
number of images
600
number of images
40
500
400 30
300
20
200
10
100
0 0
0.9 0.92 0.94 0.96 0.98 1 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
percentage of background pixels along image boundary (MSRA1000) percentage of background pixels along image boundary (Berkeley300)
Fig. 5. Distribution of percentage of background pixels along the image boundary on

the MSRA-1000 database [9] (left) and Berkeley-300 database [30] (right)
geodesic saliency) along the four image sides, using the ground truth foreground
mask. The histograms of such percentages of all the images in MSRA-1000 and
Berkeley-300 databases are shown in Figure 5.
As illustrated, the boundary prior is strongly valid in the MSRA-1000
database, which has 961 out of 1000 images with more than 98% boundary pixels
as background. The prior is also observed in the Berkeley-300 database, but not
as strongly. There are 186 out of 300 images with more than 70% boundary pixels
as background. This database is especially challenging for our approach as some
images indeed contain large salient objects cropped on the image boundary.
4.2 Evaluation of Geodesic Saliency

We compare the two geodesic saliency methods (GS GD short for GS Grid
and GS SP short for GS Superpixel ) with eight previous approaches: Ittis
method(IT) [1], the spectral residual approach(SR) [21], graph based vi-
sual saliency(GB) [22], the frequency-tuned approach(FT) [9], context-aware
saliency(CA) [11], Zhais method(LC) [33], histogram based contrast(HC) and
region based contrast(RC) [12]. Each method outputs a full resolution saliency
map that is normalized to range [0, 255]. Such methods are selected due to their
large diversity in computational models. For SR, FT, LC, HC and RC, we use
the implementation from [12]. For IT, GB and CA, we use the public imple-
mentation from the original authors.
To test whether geodesic saliency is complementary to previous methods, we
combine two methods by averaging their results. For example, the combined
method RC+GS GD produces a saliency map that is the average of the maps
from the RC and GS GD methods.
Similar as previous work, precision-recall curves are used to evaluate all meth-
ods, including the combined methods. Given a saliency map, a binary foreground
object mask is created using a threshold. Its precision and recall values are com-
puted with respect to the ground truth object mask. A precision-recall curve is
then obtained by varying the threshold from 0 to 255.
38 Y. Wei et al.
0.9
0.8
0.7
Precision
0.6 FT [9]
HC [12]
LC [33]
0.5 RC [12]
SR [21]
0.4 IT [1]
GB [22]
CA [11]
0.3 GS_GD
GS_SP
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
1 1
0.9 0.9
Precision
0.8 Precision 0.8
0.7 0.7
HC [12] RC [12]
GS_GD + HC [12] GS_GD + RC [12]
0.6 GS_SP + HC [12] 0.6 GS_SP + RC [12]
GS_GD GS_GD
GS_SP GS_SP
0.5 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall
Fig. 6. (better viewed in color) Average precision-recall curves on the MSRA-1000

database [9]. Top: Geodesic saliency is better than all other methods. Bottom: Com-
bination of geodesic saliency with HC(RC) is better than both.
Results on the MSRA-1000 database [9] are shown in Figure 6 and we have
two conclusions: 1) geodesic saliency signicantly outperforms previous methods.
Especially, it can highlight the entire object uniformly and achieve high preci-
sion in high recall areas. Figure 8 shows example images and results. 2) After
combination with geodesic saliency, all the eight previous methods are improved,
and four of them (FT,GB,HC,RC) are better than both that are combined. In
Figure 6 (bottom) we show results before and after combination for the two best
combined methods, GS+HC and GS+RC.
To further understand how geodesic saliency diers from other methods, Fig-
ure 7 shows the saliency value distributions of all foreground and background
pixels (using ground truth masks) for GS GD and RC methods. Clearly, the
foreground and background are better separated in GS GD than in RC.
Results on the Berkeley-300 database [30] are shown in Figure 9 and example
results are shown in Figure 10. The accuracy of all methods are much lower,
indicating that this database is much more dicult. We still have similar con-
clusions about geodesic saliency: 1) it is better than previous methods; 2) after
combination, all the eight previous methods are improved, and four of them
(CA,IT,GB,RC) are better than both that are combined. In Figure 9 (bottom)
we show results before and after combination for the two best combined methods,
GS+GB and GS+RC.
0.2 0.2
Foreground Foreground
0.15 0.15
Probability
Probability
Background Background
0.1 0.1
0.05 0.05
0 0
0 50 100 150 200 250 0 50 100 150 200 250
Saliency Value of GS_GD Saliency Value of RC [12]
Fig. 7. Saliency value distributions of all foreground and background pixels in the
MSRA-1000 database [9] for GS GD and RC methods
Image True Mask GS_GD GS_SP FT [9] CA [11] GB [22] RC [12]
Fig. 8. Example results of dierent methods on the MSRA-1000 database [9]
Running times of all the methods are summarized in Table 1. GS GD is sig-

nicantly faster than all the methods and useful for real-time applications.
The only important free parameter is the patch size. In experiment, we found
1 1
that best results are obtained when it is within [ 80 , 20 ] of the image dimension.
As most images are 400 400, we use 10 10 patches (roughly 40 1
).
We also evaluate the eect of the weight clipping and boundary edge weight
computation by comparing the results with and without them. It turns out that
geodesic saliency without these components is still better than previous methods,
but by a very small margin. This shows that the background priors are indeed
useful, and these algorithm components can further improve the performance.
5 Discussions
Given the superior results and the fact that geodesic saliency is so intuitive, this
illustrates that appropriate prior exploitation is helpful for the ill-posed saliency
detection problem. This is a similar conclusion that has been repeatedly observed
in many other ill-posed vision problems.
40 Y. Wei et al.
1
FT [9]
0.9 HC [12]
LC [33]
RC [12]
0.8 SR [21]
IT [1]
0.7 GB [22]
CA [11]
Precision
GS_GD
0.6 GS_SP
0.5
0.4
0.3
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0.9 0.9
0.8 0.8
Precision
0.7 Precision 0.7
0.6 0.6
GB [22] RC [12]
GS_GD + GB [22] GS_GD + RC [12]
0.5 GS_SP + GB [22] 0.5 GS_SP + RC [12]
GS_GD GS_GD
GS_SP GS_SP
0.4 0.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall
Fig. 9. (better viewed in color) Average precision-recall curves on the Berkeley-300

database [30]. Top: Geodesic saliency is better than all the other methods. Bottom:
Combination of geodesic saliency with GB(RC) is better than both.
Table 1. Averaging running time (milliseconds per image) of dierent methods, mea-
sured on an Intel 2.33GHz CPU with 4GB RAM. IT, GB and CA use Matlab imple-
mentation, while other methods use C++.
GS GD GS SP IT [1] SR [21] GB [22] FT [9] CA [11] LC [33] HC [12] RC [12]

2.0 7438 483 34 1557 8.5 59327 9.6 10.1 134.5
As our approach is built on statistical priors, inevitably it fails when such

priors are invalid. Figures 11 shows typical failure cases: objects signicantly
touching the image boundary and complex backgrounds. Nevertheless, the priors
hold for most images and geodesic saliency performs well in general.
There are several directions to improve geodesic saliency: 1) alleviate its de-
pendency on the background priors; 2) compute saliency on the image boundary
in a better way, instead of solving a one-dimensional saliency detection problem
using only image boundary; 3) combine with other methods in a way better than
our straightforward combination by averaging.
Finally, we would encourage future work to use more challenging databases
(e.g., Berkeley or even PASCAL) to advance the research in this eld.
Image True Mask GS_GD GS_SP FT [9] CA [11] GB [22] RC [12]
Fig. 10. Example results of dierent methods on the Berkeley-300 database [30]
Fig. 11. Typical failure cases of geodesic saliency: (left) salient object is signicantly
cropped at the image boundary and identied as background; (right) small isolated
background regions are identied as object
References
1. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for
rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine In-
telligence 20, 12541259 (1998)
2. Parkhurst, D., Law, K., Niebur, E.: Modeling the role of saliency in the allocation
of overt visual attention. Vision Research (2002)
3. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans
look. In: ICCV (2009)
4. Ramanathan, S., Katti, H., Sebe, N., Kankanhalli, M., Chua, T.-S.: An Eye Fix-
ation Database for Saliency Detection in Images. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 3043. Springer,
Heidelberg (2010)
5. Itti, L., Baldi, P.: Bayesian surprise attracts human attentions. In: NIPS (2005)
6. Bruce, N.D.B., Tsotsos, K.: Saliency based on information maximization. In: NIPS
(2005)
7. Gao, D., Mahadevan, V., Vasconcelos, N.: The discriminant center-surround hy-
pothesis for bottom-up saliency. In: NIPS (2007)
8. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.: Learning to detect a salient object.
In: CVPR (2007)
42 Y. Wei et al.
9. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient re-
gion detection. In: CVPR (2009)
10. Valenti, R., Sebe, N., Gevers, T.: Image saliency by isocentric curvedness and color.
In: ICCV (2009)
11. Goferman, S., Manor, L., Tal, A.: Context-aware saliency detection. In: CVPR (2010)
12. Cheng, M., Zhang, G., Mitra, N., Huang, X., Hu, S.: Global contrast based salient
region detection. In: CVPR (2011)
13. Klein, D., Frintrop, S.: Center-surround divergence of feature statistics for salient
object detection. In: ICCV (2011)
14. Rubinstein, M., Guterrez, D., Sorkine, O., Shamir, A.: A comparative study of
image retargeting. In: SIGGRAPH Asia (2010)
15. Sun, J., Ling, H.: Scale and object aware image retargeting for thumbnail browsing.
In: ICCV (2011)
16. Marchesotti, L., Cifarelli, C., Csurka, G.: A framework for visual saliency detection
with applications to image thumbnailing. In: ICCV (2009)
17. Lempitsky, V., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bound-
ing box prior. In: ICCV (2009)
18. Wang, L., Xue, J., Zheng, N., Hua, G.: Automatic salient object extraction with
contextual cue. In: ICCV (2011)
19. Alexe, B., Deselaers, T., Ferrari, V.: What is an object. In: CVPR (2010)
20. Feng, J., Wei, Y., Tao, L., Zhang, C., Sun, J.: Salient object detection by compo-
sition. In: ICCV (2011)
21. Hou, X., Zhang, L.: Saliency detection: A spectral residual approach. In: CVPR (2007)
22. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS (2006)
23. Grady, L., Jolly, M., Seitz, A.: Segmentation from a box. In: ICCV (2011)
24. Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with
connectivity priors. In: CVPR (2008)
25. Nowozin, S., Lampert, C.H.: Global connectivity potentials for random eld mod-
els. In: CVPR (2009)
26. Butko, N.J., Movellan, J.R.: Optimal scanning for faster object detection. In:
CVPR (2009)
27. Toivanen, P.J.: New geodesic distance transforms for gray-scale images. Pattern
Recognition Letters 17, 437450 (1996)
28. Veksler, O., Boykov, Y., Mehrani, P.: Superpixels and Supervoxels in an En-
ergy Optimization Framework. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part V. LNCS, vol. 6315, pp. 211224. Springer, Heidelberg (2010)
29. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: ICCV (2001),
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
30. Movahedi, V., Elder, J.H.: Design and perceptual validation of performance mea-
sures for salient object segmentation. In: IEEE Computer Society Workshop on
Perceptual Organization in Computer Vision (2010),
http://elderlab.yorku.ca/~ vida/SOD/index.html
31. Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video
segmentation and matting. In: ICCV (2007)
32. Criminisi, A., Sharp, T., Blake, A.: GeoS: Geodesic Image Segmentation. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302,
33. Zhai, Y., Shah, M.: Visual attention detection in video sequences using spatiotem-
poral cues. ACM Multimedia (2006)
Joint Face Alignment
with Non-parametric Shape Models
Brandon M. Smith and Li Zhang
University of Wisconsin, Madison

http://www.cs.wisc.edu/~ lizhang/projects/joint-align/
Abstract. We present a joint face alignment technique that takes a

set of images as input and produces a set of shape- and appearance-
consistent face alignments as output. Our method is an extension of the
recent localization method of Belhumeur et al. [1], which combines the
output of local detectors with a non-parametric set of face shape models.
We are inspired by the recent joint alignment method of Zhao et al. [20],
which employs a modied Active Appearance Model (AAM) approach
to align a batch of images. We introduce an approach for simultaneously
optimizing both a local appearance constraint, which couples the local
estimates between multiple images, and a global shape constraint, which
couples landmarks and images across the image set. In video sequences,
our method greatly improves the temporal stability of landmark esti-
mates without compromising accuracy relative to ground truth.
1 Introduction
Face alignment is an important problem in computer vision [1,5,9,15,16,20] with

many compelling applications, including performance-driven animation [10], au-
tomatic face replacement [2], and as a pre-process for face recognition and ver-
ication. As digital cameras become cheaper and more ubiquitous, and visual
media-sharing websites like Flickr, Picasa, Facebook, and YouTube become more
popular, it is increasingly convenient to perform face alignment on batches of
images for other applications such as face image retrieval [17] and digital media
management and exploration [11].
Intuitively, by using the additional information provided by multiple images
of the same face, we can better handle a range of challenging conditions, such as
partial occlusion, pose and illumination changes, image blur, and image noise.
This intuition has been used in the face tracking literature, for example, to
impose local appearance constraints across multiple video frames [14,21].
Inspired by this intuition, we propose a joint face alignment technique that
takes a set of images as input and produces a set of shape- and appearance-
consistent face alignments as output. By appearance-consistent we mean that
the local appearance at each landmark estimate is more similar across input im-
ages, and by shape-consistent we mean the spatial arrangements of landmarks
on the input faces are more consistent. We might expect that the nal set of
44 B.M. Smith and L. Zhang
alignments will drift somewhat from ground truth in order to achieve an inter-
nally consistent solution among input images. However, we show experimentally
that our approach does not sacrice alignment accuracy to achieve consistency.
2 Background
Non-rigid face alignment algorithms can be divided into two categories: holistic
and local methods. Our approach is most directly inspired by a recent holistic
method [20] that operates on a batch of input images simultaneously, and a
recent local method [1] that combines the output of local detectors with a non-
parametric set of shape models. In this section, we highlight some key dierences
between holistic and local methods, and give a brief overview of [20] and [1].
2.1 Holistic Methods

In the domain of non-rigid alignment, Active Appearance Models (AAMs) [5]
are among the most popular, with many recent works [3,6,7,15,20] relying on
the AAM framework. Broadly, AAMs model both the overall appearance and
shape X of the face linearly:

M
L
X = X0 + pm Xm = 0 + l l , (1)
m=1 l=1
where pm and l are the m-th shape and l-th appearance parameters (or load-
ings), respectively; Xm and l capture the top M shape and L appearance
modes, respectively; and X0 and 0 are the average shape and appearance vec-
tors, respectively. The goal is to minimize the dierence between the target image
and the model: 2
L

min 0 + l l W (I; p) , (2)
p,
l=1 2
where W warps the image I in a piecewise linear fashion using the shape pa-
rameters p.
One of the major challenges in applying AAMs is that it is dicult to ade-
quately synthesize the appearance of a new face using a linear model. As demon-
strated experimentally by Gross et al. [6], it is much more dicult to build a
generic appearance model than a person-specic one. This is known as the gen-
eralization problem.
To overcome this problem, Zhao et al. [20] introduced a joint alignment tech-
nique in which a batch of person-specic images are simultaneously aligned in
a modied AAM setting. Intuitively, their approach aims to simultaneously (1)
identify the person-specic appearance space of the input images, which are
assumed to be linear and low-dimensional, and (2) align the person-specic ap-
pearance space with the generic appearance space, which are assumed to be
proximate rather than distant. This joint approach was shown to produces ex-
cellent results on a wide variety real-world images from the internet. However,
Joint Face Alignment with Non-parametric Shape Models 45
the approach breaks down under several common conditions, including signi-
cant occlusion or shadow, image degradation, and outliers. Like Zhao et al., we
operate on a set of input images jointly, but our approach is more robust to
these conditions.
2.2 Overview of Zhao et al. [20]

Zhao et al. [20] jointly align a batch of images by extending the AAM framework:
2
J L
J

min 0 + l l W (Ij ; p) + rank W (Ij ; p)ej T , (3)
pj ,j
j l=1 2 j
where the sum is taken over J images, ej is an indicator vector (i.e., the j-
th element is one, all others are zero), is a weight parameter that balances
the rst term with the second, and all other parts are dened as in Eq. (2).
The rank term encourages the appearance of the input images to be linearly
correlated; intuitively, the rank term favors a solution in which the input images
are well-aligned with one another (not just the global model). Unfortunately,
the rank term is non-convex and non-continuous and minimizing it is an NP-
hard problem. To make the problem more tractable, they borrow a trick from
compressive sensing. That is, they replace the rank term with its tightest convex
relaxation thenuclear norm ||X|| , which is dened as the sum of singular
values ||X|| = k x (X). See [20] for more details.
Like [20], we incorporate a rank term in our optimization. However, because we
couple the local appearance of faces with global shape models, our rank term does
not operate directly on the holistic appearance of faces. Instead, it operates on
the simplied shape representation of faces, and encourages landmark estimates
that are spatially consistent across images. We encourage the local appearance
at each landmark estimate to be consistent across multiple images via a separate
term, which we discuss in Section 3.
2.3 Local Methods

Constrained Local Models (CLMs) [9,14,16,19] overcome many of the problems
inherent in holistic methods. For examples, CLMs have inherent computational
advantages (e.g., opportunities for parallelization) [14], and reduced modeling
complexity and sensitivity to illumination changes [16,19]. CLMs also generalize
well to new face images, and can be made robust against other confounding
eects such as reectance, image blur, and occlusion [9].
CLMs model the appearance of the face locally via an ensemble of region
experts, or local detectors. A variety of local detectors have been proposed for
this purpose, including those based on discriminative classiers that operate
on local image patches [19], or feature descriptors such as SIFT [13]. These
local detectors generate a likelihood map around the current estimate of each
landmark location. The likelihood maps are then combined with an overall face
shape model to jointly recover the location of landmarks. Like AAMs, CLMs
typically model non-rigid shape variation linearly, as in Eq. (1).
Recently, Belhumeur et al. [1] proposed a dierent approach that avoids a
parametric linear shape model, and instead combines the outputs of local detec-
tors with a non-parametric set of global shape models. Belhumeur et al. showed
excellent landmark localization results on single images of faces under challeng-
ing conditions, such as signicant occlusions, shadows, and a wide range of pose
and expression variation. According to their experiments on real-world images
from the internet, their algorithm produced slightly more accurate results, on
average, compared to humans assigned to the same task.
Despite its impressive accuracy, we found that Belhumeur et al.s approach
produces temporally inconsistent results in video. Our method makes use of the
same non-parametric shape model approach proposed by Belhumeur et al., but
we further constrain the local appearance and face shape in a joint alignment
setting. As our experiments show, we produce more consistent results, especially
in video. At the same time, we do not sacrice localization accuracy to achieve
this consistency.
2.4 Overview of Belhumeur et al. [1]

Formally, [1] aims to solve the following problem:
X = argmax P (X|D), (4)
X
where X = {x1 , x2 , . . . , xN } gives the locations of N face landmarks (i.e., the

eye corners, nose tip, etc.), and D = {d1 , d2 , . . . , dN } are local detector re-
sponses. In words, the goal is to nd the landmark locations X that maximize
the probability of X given the measurements from the set of local detectors.
In general, Eq. (4) is non-convex. Therefore, Belhumeur et al. employ a set of
approximations to make the problem more tractable. The rst approximation is
to assume each X is generated by one of M shape exemplars, Xk , transformed
by some similarity transformation t; they call Xk,t a global model. This allows
P (X|D) to be expanded as
M

P (X|D) = P (X|Xk,t , D)P (Xk,t |D)dt, (5)
k=1 tT
where the collections of M global models Xk,t have been introduced and then
marginalized out. By conditioning on the global model Xk,t , the locations of
the parts xi are assumed to be conditionally independent of one another, i.e.,
N
P (X|Xk,t , D) = i=1 P (xi |xik,t , di ). Bayes rule is then applied to Eq. (5), and
all probability terms that only depend on D and Xk,t are reduced to a constant
C. This yields the following optimization problem:
M

N
X = argmax C P (xik,t )P (xi |di )dt, (6)
X tT i=1
k=1
where P (xik,t ) is a 2D Gaussian distribution that models how well the landmark
location in the global model xik,t t the true location xi , i.e., xik,t = xik,t xi .
P (xi |di ) takes the form of a local response map for landmark i.
A RANSAC-like approach is used to generate a large number of global models
Xk,t , which are evaluated using P (Xk,t |D). The set M of top global models M
for which P (Xk,t |D) is greatest is used to approximate Eq. (6) as

N
X = argmax P (xik,t )P (xi |di ). (7)
X
k,tM i=1
X is found by choosing the image location where the following sum is maximized

xi = argmax P (xik,t )P (xi |di ), (8)
xi k,tM

which is equivalent to solving for xi by setting all P (xik,t ) and P (xi |di ) to
a constant in Eq. (6) for all i = i. For more details, please see [1].
3 Our Approach
This section provides details of how we couple the shape and appearance infor-
mation from multiple related face images.
3.1 Single Image Alignment

Dene wi (xi ) as the accumulated response map for landmark i in Eq. (8),

wi (xi ) = P (xik,t )P (xi |di ). (9)
k,tM
As in [1], the value of wi (xi ) is computed by multiplying the local detector

output at xi by a 2D Gaussian function centered at each xik,t , and summing the
resulting products. The Gaussian function is parameterized by a 2 2 covariance
matrix that captures the spatial uncertainty of the global shape model tting
the true landmark locations; the covariances are computed as described in [1].
Note that Eq. (9) couples the local appearance information with global shape
information.
To estimate the location of each landmark, Belhumeur et al. [1] choose the
location xik,t that maximizes wi (xi ). Because the global shape information is
already incorporated via P (xik,t ), the nal selection of the best xi can be done

independent of all other landmark locations xi for i = i. Therefore, assuming
the global shape models Xk,t are xed after the generate-and-test procedure,
maximizing the product of terms in Eq. (7) is equivalent to maximizing the sum
of terms:

N
X = argmax wi (xi ). (10)
X i=1
3.2 Multiple Image Alignment
In a multiple image setting, we can trivially modify Eq. (10) to incorporate all
images:
J N
Xj j=1,...,J = argmax wji (xij ), (11)
X1 ,X2 ,...,XJ j=1 i=1
where wji (xij ) corresponds to landmark i in image j. Note that the selection of
each set
of global
models Mj is still independent for each image j, and nding
the best Xj j=1,...,J in Eq. (11) is equivalent to solving Eq. (10) for each image
separately.
Enforcing Joint Appearance Consistency. Under this multiple image set-

ting, we couple the appearance information of the input images by modifying
Eq. (11) as follows:
1 1 i i
J N

Xj j=1,...,J = argmax wj (xj ) xij , {xij }i =i , (12)
X1 ,X2 ,...,XJ J j=1 N i=1
where

1
xij , {xij }i =i = ij exp sij (xij ) sij (xij )2 , (13)
J 1
j Sj
Sj is the set of input images associated with, but not including, image j, and
sij (xij ) is a local feature descriptor centered at xij . ij is a weight reecting the
condence that sij (xij ) describes the correct landmark location. In practice, we
use ij = wji (xij ). Intuitively, Eq. (12) is maximized when two conditions are si-
multaneously optimized: (1) the face shape coincides with positive measurements
according to the local detectors and (2) the local descriptors are consistent across
input images.
Enforcing Joint Appearance and Shape Consistency. The optimization of

Eq. (12) will yield landmark estimates that are locally more consistent. However,
the estimated face shapes may not be consistent across the input images. Shape
inconsistency can be due to several factors, including local appearance ambi-
guities and noise, and the randomness inherent in the search for global shape
models. This inconsistency is especially noticeable in video sequences, where the
landmark estimates appear to swim around their true location, or jitter along
image contours such as the chin line. We could trivially remove these artifacts by
forcing all of the face shape estimates to be the same, or by applying a low-pass
lter to the landmark trajectories in a video. Unfortunately, this yields a poor
solution that cannot adequately handle non-rigid deformations or fast non-rigid
motion. Instead, we aim to nd a set of face shapes that (1) provide locally
consistent landmark estimates, and (2) are linearly correlated, but still exible
enough to handle a broad range of non-rigid deformations.
In order to incorporate a shape consistency constraint, we rst change Eq. (12)

to a minimization problem:
1 i
Xj j=1,...,J = argmin wjj (xij , xij ), (14)
X1 ,X2 ,...,XJ N i j
where j Sj

jj
w i
(xj , xj ) = wj (xj ) wj (xj ) exp sj (xj ) sj (xj )
i i i i i i i i i i 2
(15)
and
1 1 1 1
= . (16)
N N J J 1
We then add a rank term that encourages the recovered shapes to be linearly
correlated:

1
Xj j=1,...,J = argmin jj
w i i i
(xj , xj ) + rank (X) ,
X1 ,X2 ,...,XJ N i j
j Sj

(17)
where > 0 is a scalar weight that balances the two terms, and X is formed
by concatinating the face shape vectors into a 2N J matrix: X = [X1 (:), X2 (:
), , XJ (:)]. Eq. (17) is similar in form to Eq. (3). However, instead of encourag-
ing the holistic face appearance to be linearly correlated across the input images,
our rank term encourages the estimated face shapes to be linearly correlated.
Reformulation. As Zhao et al. [20] point out, the diculty in incorporating a

rank term is that it is non-convex and non-continuous, which makes minimizing
Eq. (17) NP-hard. Instead, we follow the same approach in [20] and replace the
rank(X) term with its tightest convex relaxation the nuclear norm ||X|| .
At this point, it is convenient to introduce some additional notation. Let
hij = 1, 2, . . . , Hij be an index that selects one of Hij hypotheses for landmark
i in image j, (i.e., xij [hij = 1] selects the rst location hypothesis, xij [hij = 2]
selects the second, and so on). w jj
i i i
(xj , xj ) in Eq. (17) can then be written
wjj
i i i
(xj [hij ], xj [hij ]), or more simply w jj
i
(hij , hij ). Let j [hij ] {0, 1} be an
i
indicator variable that can either be on (1) or o (0) for each possible hij . Eq. (17)
can then be reformulated into the following energy minimization problem.
min () + [X, Z] (18)

s.t. ij [hij ] {0, 1} (19)

ij [hij ] = 1 (20)
hij
A() X = 0 (21)
Z0 Z = 0, (22)
where
1 i
() = j [hij ] ij [hij ] w
jj
i
(hij , hij ) (23)
N i j
j Sj hij hij
and contains all indicator variables ij [hij ] for j = 1, . . . , J, i = 1, . . . , N , and

hij = 1, . . . , Hij ; Z0 is formed by concatenating the global shape model vectors
for all images into a 2N M matrix: Z0 = [Z1 , Z2 , . . . , ZM ], and A() is a
linear operator on the indicator variables , which produces a 2N J matrix
of landmark locations. The intuition behind including Z, which is comprised of
true shape vectors, is that we want to ensure X does not deviate signicantly
from the space of plausible face shapes.
To solve the optimization problem in Eq. (18), we use a strategy similar to
[20]. That is, we adapt the augmented Lagrangian method to solve Eq. (18):
2
L(, X, Z, YX , YZ ) = ( () + [X, Z] ) + 12 [X, Z]F
+YX , A() X + YZ , Z0 Z, (24)
where U F = U, U is the Frobenius norm, U, V = trace(U T V ), and YX and

YZ are Lagrangian multipliers. Here we have also made use of a result from [4],
namely that solving min U + 12 U 2F subject to linear constraints converges
to the same minimum as solving min U for large enough . We iterate among
four update steps to solve Eq. (24):
Step 1 k+1 = argmin L(, X, Z, YX , YZ ) (25)

2

Xk+1 , Zk+1 = argmin
[X, Z]
+ [X, Z] [YXk , YZk ]
X,Z F
Step 2 = shrink([YXk , YZk ], ) (26)

k
YXk+1 YX A(k ) Xk
Step 3 = + k (27)
YZk+1 YZk Z0 Zk
Step 4 k+1 = k . (28)
The shrink operator is a nonlinear function which applies a soft-thresholding

rule at level to the singular values of the input matrix. See [4] for more details.
Steps 2 4 can be easily implemented in MATLAB using a few lines of code.
Step 1 is more complicated and so we give it some additional attention.
Solving for . To nd the k+1 that minimizes Eq. (25), we need consider
only those terms in Eq. (24) that involve :
k+1 = argmin { () + YX , A()} . (29)

We solve Eq. (29) by breaking it into independent subproblems, one for each
landmark i. Each subproblem is then solved using an iterative greedy approach:
Loop until convergence
For landmark i = 1, 2, . . . , N
For image j = 1, 2, . . . , J
1. Assume all ij [hij ] for j Sj are known and assign a
single hypothesis hij to each j Sj .
2. Extract the portion of YX , A() involving landmark i

and image j; denote this portion [Y X ]ij , [A()]ij .
3. Evaluate [YX ]ij , [A()]ij N1 jj
j =j w
i
(hij , hij )
(see Eq. (29)) for each hij = 1, . . . , Hij .

4. Choose the hypothesis hij that minimizes the sum above,
and set the corresponding indicator variable ij [hij ] to
one and all others to zero.
We remark that Step 1 above greatly simplies Eq. (29), because we need only
consider the small subset of terms in Eq. (23) for which ij [hij ] = 1. Further-
more, evaluating [YX ]ij , [A()]ij involves relatively few operations.
To initialize , we solve Eq. (6) for each landmark and image independently,
and then set the corresponding indicator variables in to one (all others are
zero). In our experiments, this initialization is close to the nal solution for
the majority of landmarks. As a result, the algorithm above typically converges
quickly in 2 3 iterations.
3.3 Implementation Details

To obtain wji (xij ) in Eq. (9), we use the consensus of exemplars approach de-
scribed in [1], with all parameters set according to [1] unless noted. Like [1], our
landmark detectors are simple support vector machine (SVM) classiers with
grayscale SIFT [13] descriptors. We note that, for a typical 20-image input set,
the detectors account for approximately 86% of our total computation. SIFT
descriptors are also used to enforce joint appearance consistency in Eq. (15).
For eciency reasons, we do not consider all possible hypotheses ij [hij ] when
solving for . Instead, we consider only the hypotheses that give the largest Hij
values according to wji (hij ); we typically choose Hij 200 in our experiments.
This is acceptable because most hypotheses within each local search window
correspond to very low values for wji (hij ).
Parameter is set such that the rst 7 12 singular values are preserved
by the shrink operator in Eq. (26). The step size 0 is set to 1 in our experi-
ments, with = 0.75. should depend on two things: (1) condence in the joint
appearance constraint (i.e., should increase with better image quality, fewer
occlusions, etc.), and (2) condence in the shape consistency constraint (i.e.,
should decreaseas the local appearance becomes more corrupted). Roughly
speaking, = C [X0 , Z0 ] / (0 ), for an empirically determined constant C
and initial states of X, Z, and . We set C 10 in our experiments.
The sum j Sj in Eq. (23) is straightforward for static image sets: for a
given image j, sum over all other images in the input set. For videos, it changes
slightly: Sj becomes the temporal neighborhood around frame j; Sj = {j
5, . . . , j 1, j + 1, . . . , j + 5} in our experiments.
Our algorithm requires an initial estimate of the landmark locations in order
to place the local search windows. For this purpose, we t a canonical face shape
to the bounding box generated by the OpenCV Viola-Jones face detector [18].
Subject 002 Subject 017 Subject 055 Subject 070 Subject 83
Barack Obama Keanu Reeves Lucy Liu Miley Cyrus Oprah Winfrey
Fig. 1. Selected images from two face datasets used for evaluation: Multi-PIE [8] on
top and PubFig [12] on the bottom, with ground truth landmarks shown in green
4 Results and Discussion

We now discuss our results on a variety of challenging image sets and video se-
quences, and compare them with state-of-the-art alignment methods, namely [1]
and [16]. In general, we nd that our method produces landmark localizations
that are more consistent at both the local level (i.e., the local appearance at
each landmark estimate is more similar across input images) and the global level
(i.e., the arrangement of landmarks is more consistent across input faces). At
the same time, we do not sacrice alignment accuracy relative to ground truth.
4.1 Experimental Datasets

We evaluate our method on two static image datasets and one video dataset.
As training data, we use the LFPW dataset [1], which includes many real-
world images from the internet along with hand-labeled landmarks. For testing
purposes, we use randomly selected images from the PubFig dataset [12]. In
PubFig, unlike LFPW, each subject is depicted across many images. In order
to perform a quantitative analysis, we manually labeled the subset of PubFig
image according to the LFPW arrangement. We also perform an evaluation on
the Multi-PIE dataset [8], some images of which are shown in Figure 1. Our video
dataset is composed of several video sequences downloaded from YouTube.
4.2 Experiments on Static Image

Ten sets of 20 images were used to evaluate our algorithm on static image
setsve from Multi-PIE and ve from PubFig. Each set includes images of the
same subject under multiple expressions, illuminations, and poses; for reference,
the subjects are shown in Figure 1. Figure 2 shows a comparison between our
jointly aligned results, and the results generated by our implementations of
0.15
CLM
CoE
mean err./IOD
Ours
0.1
0.05
0
e
e
in
in
in
in
in
th
th
th
th
th
m e
m e
m e
m e
m e
ey
ey
ey
ey
ey
ey
ey
ey
ey
ey
s
s
ch
ch
ch
ch
ch
ou
ou
ou
ou
ou
no
no
no
no
no
l.
r.
l.
r.
l.
r.
l.
r.
l.
r.
Obama Reeves Liu Cyrus Winfrey
CLM
0.08
mean err./IOD
CoE
Ours
0.06
0.04
0.02
0
e
e
ge
ge
ge
ge
ge
th
th
th
th
th
m e
m e
m e
m e
m e
ey
ey
ey
ey
ey
ey
ey
ey
ey
ey
s
s
ou
ou
ou
ou
ou
ed
no
ed
no
ed
no
ed
no
ed
no
l.
r.
l.
r.
l.
r.
l.
r.
l.
r.
Subject 002 Subject 017 Subject 055 Subject 077 Subject 08
Fig. 2. This gure shows a comparison between our jointly aligned results, and the
results generated by our implementations of Saragih et al.s [16] algorithm (CLM), and
Belhumeur et al.s [1] algorithm (CoE) on ve sets of images from the PubFig [12]
dataset (top) and ve sets of images from the Multi-PIE dataset [8] (bottom). Please
see Section 4.2 for an explanation of these results.
Saragih et al.s algorithm [16] and Belhumeur et al.s Consensus of Exemplars

(CoE) algorithm [1]. For a fair comparison, we used the same landmark detec-
tors (described in Section 3.3) across all algorithms. The errors shown are the
point-to-point distance from ground truth, normalized by the inter-ocular dis-
tance (IOD). As a reference, 0.05 is 2.75 pixels for a face with an IOD of 55
pixels.
Our method favors landmark localizations that are internally consistent among
images in each set with respect to both the local appearance and the global ar-
rangement of landmarks. To best satisfy these internal constraints, we might
expect to lose some accuracy relative to ground truth. However, we maintain
good localization accuracy similar to other state-of-the-art methods [2, 8] on
challenging static images. Furthermore, as Figure 3 illustrates, our results are
nearly identical even when an image is jointly aligned with entirely dierent sets
of input images.
4.3 Experiments on Video Sequences

The improvements demonstrated by our joint alignment approach are most dra-
matically seen in video sequences. Figure 4 shows a comparison between our
results, and the results generated by our implementations of Saragih et al.s [16]
algorithm (CLM), and Belhumeur et al.s [1] algorithm (CoE), both applied
to each frame of the videos independently. In all cases, the same detectors were
Fig. 3. Test results using our method on PubFig images. Our alignment results are
nearly identical, even when the input image (shown larger than the rest) is jointly
aligned with entirely dierent sets of input images.
Blanchett Carrey Bamford Glass McAdams

10
mean err. / IOD (pixels / frames2 )
CLM
8
mean accel.
CoE
6 Ours
0
1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75
0.25
0.2
0.15
0.1
0.05
0
1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75 1 25 50 75
Frame Number
Fig. 4. This gure shows a comparison between our results, and the results generated
by our implementations of Saragih et al.s [16] algorithm (CLM), and Belhumeur et
al.s [1] algorithm (CoE), where both were applied to each frame of the videos inde-
pendently. Our results are generally both more temporally stable (as measured by the
average landmark acceleration in each frame), and slightly more accurate relative to
ground truth. Please see Section 4.3 for an explanation of these results.
used. The top

rowi shows the mean acceleration magnitude at each frame, com-
puted as N1 i xj 1
2 (xi i i
j1 + xj+1 )2 , where xj is the 2D location of landmark
i in frame j. The bottom row shows the mean point-to-point error as a frac-
tion of the inter-ocular distance (IOD). The errors were computed relative to
ground truth locations, which were labeled manually at each 10th frame. Five
challenging internet videos were used for this experiment.
We notice that the CLM and CoE methods generate landmark estimates that
appear to swim or jump around their true location. Our approach generates
landmark estimates that are noticeably more stable over time. Meanwhile, we
do not sacrice landmark accuracy relative to ground truth (i.e., we do not
over-smooth the trajectories); in fact, we often achieve slightly more accurate
localizations compared to [1] and [16]. Figure 5 shows several video frames with
our landmark estimates overlaid in green. For a clearer demonstration, please
see our supplemental video.
Jim Carrey movie clip: exaggerated facial expressions
Maria Bamford video blog: occlusion from hair
Philip Glass lecture: occlusion from hand gesture and glasses
Fig. 5. Selected frames from a subset of our experimental videos with landmarks esti-
mated by our method shown in green. These video clips, downloaded from YouTube,
were chosen with several key challenges in mind, including dramatic expressions, large
head motion and pose variation, and occluded regions of the face.
5 Conclusions
We have described a new joint alignment approach that takes a batch of images
as input and produces a set of alignments that are more shape- and appearance-
consistent as output. In addition to introducing two constraints that favor shape
and appearance consistency, we have outlined an approach to optimize both of
them together in a joint alignment framework. Our method shows improvement
most dramatically on video sequences. Despite producing more temporally con-
sistent results, we do not sacrice alignment accuracy relative to ground truth.
Acknowledgements. This work is supported in part by NSF IIS-0845916,

NSF IIS-0916441, a Sloan Research Fellowship, and a Packard Fellowship for
Science and Engineering. Brandon Smith is also supported by an NSF graduate
fellowship.
References
1. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of
faces using a consensus of exemplars. In: Computer Vision and Pattern Recognition,
pp. 545552 (2011)
2. Bitouk, D., Kumar, N., Dhillon, S., Belhumeur, P., Nayar, S.K.: Face swapping:
automatically replacing faces in photographs. In: ACM SIGGRAPH (2008)
3. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: ACM
SIGGRAPH (1999)
4. Cai, J.F., Candes, E.J., Shen, Z.: A singular value thresholding algorithm for matrix
completion. SIAM Journal on Optimization 20(4), 19561982 (2008)
5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. In: Bubak, M.,
Hertzberger, B., Sloot, P.M.A. (eds.) HPCN-Europe 1998. LNCS, vol. 1401, pp.
484498. Springer, Heidelberg (1998)
6. Gross, R., Matthews, I., Baker, S.: Generic vs. person specic active appearance
models. Image and Vision Computing 23(11), 10801093 (2005)
7. Gross, R., Matthews, I., Baker, S.: Active appearance models with occlusion. Image
and Vision Computing 24(6), 593604 (2006)
8. Gross, R., Matthews, I., Cohn, J., Kanade, T., Baker, S.: Multi-PIE. In: Interna-
tional Conference on Automatic Face and Gesture Recognition (2008)
9. Gu, L., Kanade, T.: A Generative Shape Regularization Model for Robust Face
Alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
10. Kemelmacher-Shlizerman, I., Sankar, A., Shechtman, E., Seitz, S.M.: Being John
Malkovich. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
11. Kemelmacher-Shlizerman, I., Shechtman, E., Garg, R., Seitz, S.M.: Exploring pho-
tobios. In: ACM SIGGRAPH (2011)
12. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile clas-
siers for face verication. In: International Conference on Computer Vision,
pp. 365372 (2009)
13. Lowe, D.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60(2), 91110 (2004)
14. Lucey, S., Wang, Y., Saragih, J., Cohn, J.F.: Non-rigid face tracking with enforced
convexity and local appearance consistency constraint. Image and Vision Comput-
ing 28(5), 781789 (2010)
15. Matthews, I., Baker, S.: Active appearance models revisited. International Journal
of Computer Vision 60(2), 135164 (2004)
16. Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained
mean-shifts. In: International Conference on Computer Vision, pp. 10341041 (2009)
17. Smith, B.M., Zhu, S., Zhang, L.: Face image retrieval by shape manipulation. In:
Computer Vision and Pattern Recognition, pp. 769776 (2011)
18. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of
Computer Vision 57(2), 137154 (2004)
19. Wang, Y., Lucey, S., Cohn, J.F.: Enforcing convexity for improved alignment with
constrained local models. In: Computer Vision and Pattern Recognition (2008)
20. Zhao, C., Cham, W.K., Wang, X.: Joint face alignment with a generic deformable
face model. In: Computer Vision and Pattern Recognition, pp. 561568 (2011)
21. Zhou, M., Liang, L., Sun, J., Wang, Y.: AAM based face tracking with temporal
matching and face segmentation. In: Computer Vision and Pattern Recognition,
pp. 701708 (2010)
Discriminative Bayesian Active Shape Models
Pedro Martins, Rui Caseiro, Joao F. Henriques, and Jorge Batista
Institute of Systems and Robotics, University of Coimbra, Portugal

{pedromartins,ruicaseiro,henriques,batista}@isr.uc.pt
Abstract. This work presents a simple and very ecient solution to

align facial parts in unseen images. Our solution relies on a Point Distri-
bution Model (PDM) face model and a set of discriminant local detectors,
one for each facial landmark. The patch responses can be embedded into
a Bayesian inference problem, where the posterior distribution of the
global warp is inferred in a maximum a posteriori (MAP) sense. How-
ever, previous formulations do not model explicitly the covariance of the
latent variables, which represents the condence in the current solution.
In our Discriminative Bayesian Active Shape Model (DBASM) formu-
lation, the MAP global alignment is inferred by a Linear Dynamical
System (LDS) that takes this information into account. The Bayesian
paradigm provides an eective tting strategy, since it combines in the
same framework both the shape prior and multiple sets of patch align-
ment classiers to further improve the accuracy. Extensive evaluations
were performed on several datasets including the challenging Labeled
Faces in the Wild (LFW). Face parts descriptors were also evaluated,
including the recently proposed Minimum Output Sum of Squared Error
(MOSSE) lter. The proposed Bayesian optimization strategy improves
on the state-of-the-art while using the same local detectors. We also show
that MOSSE lters further improve on these results.
1 Introduction
Deformable model tting aims to nd the parameters of a Point Distribution
Model (PDM) that best describe the object of interest in an image. Several t-
ting strategies have been proposed, most of which can be categorized as being
either holistic (generative) or patch-based (discriminative). The holistic repre-
sentations [1][2] model the appearance of all image pixels describing the object.
By synthesizing the expected appearance template, a high registration accuracy
can be achieved. However, such representation generalizes poorly when large
amounts of variability are involved, such as the human face under variations of
identity, expression, pose, lighting or non-rigid motion, due to the huge dimen-
sional representation of the appearance (learn from limited data).
Recently, discriminative-based methods, such as the Constrained Local Model
(CLM) [4][5][6][7][8], have been proposed. These approaches can improve the
models representation power, as it accounts only for local correlations between
pixel values. In this paradigm, both shape and appearance are combined by
constraining an ensemble of local feature detectors to lie within the subspace

58 P. Martins et al.
Fig. 1. Examples of the DBASM global alignment on LFW dataset [3]
spanned by the PDM. The CLM implements a two step tting strategy: a local
search and global optimization. The rst step performs an exhaustive local search
using a feature detector, obtaining response maps for each landmark. Then, the
global optimization nds the PDM parameters that jointly maximize the detec-
tion responses. Each landmark detector generates a likelihood map by applying
local detectors to the neighborhood regions around the current estimate.
Some of the most popular optimization strategies propose to replace the true
response maps by simple parametric forms (Weighted Peak Responses [4], Gaus-
sians Responses [8], Mixture of Gaussians [9]) and perform the global optimiza-
tion over these forms instead of the original response maps. The detectors are
learned from training images of each of the objects landmarks. However, due to
their small local support and large appearance variation, they can suer from
detection ambiguities. In [10] the authors attempt to deal with these ambiguities
by nonparametrically approximating the response maps using the mean-shift al-
gorithm, constrained to the PDM subspace (Subspace Constrained Mean-Shift -
SCMS). However, in the SCMS global optimization the PDM parameters update
is essentially a regularized projection of the mean-shift vector for each landmark
onto the subspace of plausible shape variations. Since a least squares projection
is used, the optimization is very sensitive to outliers (when the mean-shift output
is very far away from the correct landmark location). The patch responses can
be embedded into a Bayesian inference problem, where the posterior distribution
of the global warp can be inferred in a maximum a posteriori (MAP) sense. The
Bayesian paradigm provides an eective tting strategy, since it combines in the
same framework both the shape prior (the PDM) and multiple sets of patch
alignment classiers to further improve the accuracy.
1.1 Main Contributions
1. We present a novel and ecient Bayesian formulation to solve the MAP

global alignment problem (Discriminative Bayesian Active Shape Model -
DBASM). The main advantage of the proposed DBASM with respect to the
previous Bayesian formulations is that we model the covariance of the latent
variables, which represents the condence in the current parameters estimate
i.e. DBASM explicitly maintains 2nd order statistics of the shape and pose
Discriminative Bayesian Active Shape Models 59
parameters, instead of assuming them to be constant. We show that the

posterior distribution of the global warp can be eciently inferred using a
Linear Dynamical System (LDS) taking this information into account.
2. We aim to prove that solving the PDM using a Bayesian paradigm oers
superior performance versus the traditional rst order forwards additive up-
date [4][8][9][10]. We conrm experimentally that the MAP parameter up-
date outperforms the standard optimization strategies, based on maximum
likelihood solutions (least squares). See gures 5 and 6.
3. We present a comparison between several face parts descriptors, including
the recently proposed Minimum Output Sum of Squared Error (MOSSE)
lters [11]. The MOSSE maps aligned training patch examples to a desired
output, producing correlation lters that are notably stable. These lters
exhibit a high invariance to illumination, due to their null DC component.
Results show that the MOSSE outperforms the others detectors, being par-
ticularly well-suited to the task of generic face alignment (gures 2 and 4).
The remaining paper is organized as follows: Section 2 briey explains the shape
model PDM. Section 3 presents our Bayesian global optimization approach. Ex-
perimental results comparing the tting performances of several local detectors
(including the MOSSE lters) and several global optimizations strategies are
shown in Section 4. Finally, Section 5 provides some conclusions.
2 The Shape Model PDM

The shape s of a Point Distribution Model (PDM) is represented by the 2D vertex
T
locations of a mesh, with a 2v dimensional vector s = (x1 , y1 , . . . , xv , yv ) . The
traditional way of building a PDM requires a set of shape annotated images that
are previously aligned in scale, rotation and translation by Procrustes Analysis.
Applying a PCA to a set of aligned training examples, the shape can be expressed
by the linear parametric model
s = s0 + bs + q (1)
where s0 is the mean shape (also referred to as the base mesh), is the shape
subspace matrix holding n eigenvectors (retaining a user dened variance, e.g.
95%), bs is a vector of shape parameters, q contains the similarity pose pa-
rameters and is a matrix holding four special eigenvectors [2] that linearly
model the 2D pose. From the probabilistic point of view, bs follows a multivari-
ate Gaussian distribution bs N (bs |0, ), with = diag(1 , . . . , n ), where i
denotes the PCA eigenvalue of the ith mode of deformation.
3 Global PDM Optimization DBASM

This section describes the proposed global optimization method (Discriminative
Bayesian Active Shape Models -DBASM). The deformable model tting goal
(that follows the parametric form eq.1) is formulated as a global shape alignment
problem in a maximum a posteriori (MAP) sense.
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
10
50
20
30
100
40
150
50
60
200
70
250 80
50 100 150 200 250 10 20 30 40 50 60 70 80
(a) Example image and search regions. (b) F 1 {H

i} (c) pi (zi )
Fig. 2. The DBASM combines a Point Distribution Model (PDM) and a set of discrim-
inant local detectors, one for each landmark. a) Image with the current mesh showing
the search region for some landmarks. b) The local detector (the MOSSE lter [11]
itself). c) Response maps for the correspondent highlighted landmarks. The DBASM
global optimization jointly combines all landmark response maps, in a MAP sense,
using 2nd order statistics of the shape and pose parameters.
3.1 MAP Formulation

Given a 2v vector of observed positions y, the goal is to nd the optimal set of
parameters bs that maximizes the posterior probability of being its true position.
Using a Bayesian approach, the optimal shape parameters are dened as
bs = arg max p(bs |y) p(y|bs )p(bs ) (2)

bs
where y is the observed shape, p(y|bs ) is the likelihood term and p(bs ) is a
prior distribution over all possible congurations. The section 3.2 describe some
possible strategies to set the observed shape vector y.
The complexity of the problem, in eq.2, can be reduced by making some
simple assumptions. Firstly, conditional independence between landmarks can
be assumed simply by sampling each landmark independently. Secondly, it can
also be considered that we have an approximate solution to the true parameters
(b bs ). Combining these approximations, the eq.2 can be rewritten as
v

p(b|y) p(yi |b) p(b|bk1 ) (3)
i=1
where yi is the ith landmark coordinates and bk1 is the previous optimal esti-
mate of b.
The likelihood term, including the PDM model (in eq.1), becomes the
following convex energy function:

1
p(y|b) exp (y (s0 +b))T y1 (y (s0 + b)) (4)
2 !" #
y
where y is the dierence between the observed and the mean shape and y
is the uncertainty of the spatial localization of the landmarks (2v 2v block
diagonal covariance matrix). From the probabilistic point of view, the likelihood
term follows a Gaussian distribution given by
p(y|b) N (y|b, y ). (5)
The prior term, according to the approximations taken, can be written as
p(bk |bk1 ) N (bk |b , b ) (6)
where b = bk1 and b = + . The is the shape parameters covariance
(diagonal matrix with PCA eigenvalues) and is an additive dynamic noise
covariance (that can be estimated oine).
An important property of Bayesian inference is that, when the likelihood and
the prior are Gaussian distributions the posterior is also Gaussian [12]. Following
the Bayes theorem for Gaussian variables, and considering p(bk |bk1 ) a prior
Gaussian distribution for bk and p(y|bk ) a likelihood Gaussian distribution, the
posterior distribution takes the form ([12], pag 90).
p(bk |y) N (bk |, ) (7)
= (b1 + T y1 )1 (8)
= ( T
y1 y + b1 b ). (9)
Note that, the conditional distribution p(y|bk ) has a mean that is a linear function
of bk and a covariance which is independent of bk . This could be a possible solution
to the global alignment optimization [13]. However, in practice, this is a naive ap-
proach because it does not model the covariance of the latent variables, bk , which
is crucial to account for the condence in the current parameters estimate.
The MAP global alignment solution can be inferred by a Linear Dynamical
System (LDS). The LDS is the ideal technique to model the covariance of the
latent variables and solve the naive approach limitations. The LDS is a simple
approach that recursively computes the posterior probability using incoming
Gaussian measurements and a linear model process, taking into account all the
available measures (same requirements as our alignment problem). The state and
measurement equations of the LDS, according to the PDM alignment problem,
can be written as
bk = Abk1 + q (10)
y = bk + r (11)
where the current shape parameters bk are the hidden state vector, q N (0, b )
is the additive dynamic noise, y is the observed shape deviation that are re-
lated to the shape parameters by the linear relation (eq.1) and r is the additive
measurement noise following r N (0, y ). The previous shape estimated pa-
rameters bk1 are connected to the current parameters bk by an identity relation
plus noise (A = In ).
We highlight that the nal step of the LDS derivation consists of a Bayesian
inference step [12] (using the Bayes theorem for Gaussian variables), where the
likelihood term is given by eq.5 and the prior follows N (AF k1 , Pk1 ) where
Pk1 = ( + ) + A F T
k1 A . (12)
From these equations we can see that the LDS keep up to date the uncertainty
on the current estimate of the shape parameters. The LDS recursively computes
the mean and covariance of the posterior distributions of the form
p(bk |yk , . . . , y0 ) N (bk |F F
k , k ) (13)
F
with the posterior mean F
k and covariance k given by the LDS formulas:
K = Pk1 T (Pk1 T + y )1 (14)
k = Ak1 + K(y Ak1 ),

F k = (In K)Pk1 .
F
F F
(15)
Finally, the optimal shape parameters that maximize eq.2 are given by F
k.
In
order to estimate the pose parameters, we also apply the LDS paradigm. The
dierence is that, in this case, the state vector is given by q and the obser-
vation matrix is . The algorithm 1 summarizes the proposed DBASM global
optimization.
3.2 Local Optimization Strategies

This section briey describes several local strategies to represent the true re-
sponse maps by a probabilistic model (parametric and nonparametric). We also
describe how to extract from each probabilistic model the likelihood term of the
MAP formulation (observed shape y and the uncertainty covariance y ).
Let zi = (xi , yi ) be a candidate to the ith landmark, being yci the current
landmark estimate, yci a L L patch centered at yci , ai a binary variable that
denotes correct landmark alignment, Di the score of a generic local detector and
I the target image up to a similarity transformation (typically the detector is
designed to operate at a given scale). The probability of pixel zi to be aligned
is given by
1
pi (zi ) = p(ai = 1|I(zi ), Di ) = (16)
1 + eai Di (I(zi ))
where the detector score is converted to probability using the logistic function.
The parameters yi and yi can be found by minimizing the expression [13]

arg min pi (zi )N (zi |yi , yi ) (17)
yi ,yi
zi yc
i
where several strategies can be used to do this optimization.

Weighted Peak Response (WPR): The simplest solution is to take the spa-
tial location where the response map has a higher score [4]. The new landmark
position is then weighted by a factor that reects the peak condence. Formally,
the WPR solution is given by
yWPR
i = max (pi (zi )) , yWPR = diag(pi (yWPR
i )1 ) (18)
zi yc i
i
that is equivalent to approximate each response map by an isotropic Gaussian

N (zi |yWPR
i , yWPR
i
).
Gaussian Response (GR): The previous approach was extended in [8] to
approximate the response maps by a full Gaussian distribution N (zi |yGR GR
i , yi ).
This is equivalent
to t a Gaussian density to weighted data.
Let d = zi yc pi (zi ), the solution is given by
i
1 1
yGR = pi (zi )zi , yGR = pi (zi )(zi yGR
i )(zi yi ) .
GR T
i
d i
d1
zi y
c c zi y
i i
(19)
Kernel Density Estimator (KDE): The response maps can also be approxi-
mated by a nonparametric representation, namely using a Kernel Density Esti-
mator (KDE) (isotropic Gaussian kernel with a bandwidth h2 ). Maximizing over
the KDE is typically performed by using the well-known mean-shift algorithm
[10]. The kernel bandwidth h2 is a free parameter that exhibits a strong inu-
ence on the resulting estimate. This problem can be addressed by an annealing
bandwidth schedule [14]. It can be shown that there exists a h2 value such that
the KDE is unimodal. As h2 is reduced, the modes divide and the smoothness of
KDE decreases, guiding the optimization towards the true objective. Formally,
the ith annealed mean-shift landmark update is given by
KDE( )
zi yc zi pi (zi ) N (yi |zi , h2 j I2 )
KDE( +1)
yi i
KDE( )
(20)
zi yc pi (zi ) N (yi |zi , h2 j I2 )
i
where I2 is a two-dimensional identity matrix and h2 j represents the decreasing

annealed bandwidth. The KDE uncertainty error consists on computing the
weighted covariance using the mean-shift results as mean
1
yKDE = pi (zi )(zi yKDE )(zi yKDE )T . (21)
i
d1 i i
c zi y
i
Figure 3 highlights the dierences between the three local optimization strate-
gies (WPR, GR and KDE). Notice that DBASM deals with mild occlusions.
When a landmark is under occlusion typically the response map is multi-modal.
If a KDE local strategy is used (DBASM-KDE), the landmark update will se-
lect the nearest mode (eq.20) and the covariance of that landmark (eq.21) will
Fig. 3. Qualitative comparison between the three local optimization strategies. The
WPR simply chooses the maximum detector response. GR approximates the response
map by a full Gaussian distribution. KDE uses the mean-shift algorithm to move to
the nearest mode of the density. Its uncertainty is centered at the found mode. The two
examples in the right show patches under occlusion (typically multimodal responses).
be inherently large, modeling a high localization uncertainty. Then, the global

optimization stage jointly combines all uncertainties (MAP sense) handling oc-
clusions. Similarly, to deal with large occlusions, a minor tweak is required. One
can simply set a large covariance for the occluded landmarks.
1 Precompute:
2 The parametric models (s0 , , ) and the MOSSE filters in the Fourier domain H i
3 Initial estimate of the shape/pose parameters and covariances (b0 , P0 ) / (q0 , Q0 ).
4 repeat
5 Warp image I to the base mesh using the current pose parameters q [0.5ms]
6 Generate current shape s = s0 + b + q
7 for Landmark i = 1 to v do
8 Evaluate the detectors response (MOSSE correlation F 1 {F {(I)} H i }) [3ms]
9 Find yi and yi using a local strategy (sec. 3.2),e.g. if using KDE, eqs.20 and 21, respectively.
10 end
11 Update the pose parameters and their covariance [0.1ms]:
Qk1 = (q + q + Qk1 ), Kq = Qk1 T ( Qk1 T + y )1
12 qk = qk1 + Kq (y qk1 ), Qk = (I4 Kq )Qk1
13 Update the shape parameters (with pose correction) and their covariance [0.2ms]:
Pk1 = ( + + Pk1 ), Kb = Pk1 T (Pk1 T + y )1
14 bk = bk1 + Kb (y bk1 qk ), Pk = (In Kb )Pk1
15 until ||bk bk1 || or maximum number of iterations reached ;
Algorithm 1: Overview of the DBASM method. The performance of DBASM is
comparable to ASM [4], CQF [8] or SCMS [10] depending of the local strategy
DBASM-WPR, DBASM-GR or DBASM-KDE, respectively. It achieves near real-time
performance. The bottleneck is always obtaining the response maps (3ms x number
landmarks), although it can be done in parallel.
3.3 Hierarchical Search (DBASM-KDE-H)

A slightly dierent annealing approach is proposed in this section. Instead of
using the mean-shift with an iterative kernel bandwidth relaxation and then
optimize using the LDS MAP formulation, a hierarchical search can be used
instead. This solution is composed by multiple levels of xed kernel bandwidth
mean-shifts followed by LDS optimization steps. The annealing is performed
between hierarchical levels.
4 Evaluation Results
The experiments in this paper were designed to evaluate the local detector
(MOSSE [11]) and the new Bayesian global optimization (DBASM). All the
experiments were conducted on several databases with publicly available ground
truth. (1) The IMM [15] database that consists on 240 annotated images of
40 dierent human faces presenting dierent head pose, illumination, and facial
expression (58 landmarks). (2) The BioID [16] dataset contains 1521 images,
each showing a near frontal view of a face of one of 23 dierent subjects (20
landmarks). (3) The XM2VTS [17] database has 2360 images frontal faces of
295 subjects (68 landmarks). (4) The tracking performance is evaluated on the
FGNet Talking Face (TF) [18] video sequence that holds 5000 frames of video of
an individual engaged in a conversation (68 landmarks). (5) Finally, a qualita-
tive evaluation was also performed using the Labeled Faces in the Wild (LFW)
[3] database that contains images taken under variability in pose, lighting, focus,
facial expression, occlusions, dierent backgrounds, etc.
4.1 Local Detector MOSSE Filter
The Minimum Output Sum of Squared Error (MOSSE) lter, recently proposed
in [11], nds the optimal lter that minimizes the Sum of Squared Dierences
(SSD) to a desired correlation output. Briey, correlation can be computed in the
frequency domain as the element-wise multiplication of the 2D Fourier transform
(F ) of an input image I with a lter H, also dened in the Fourier domain as
G = F {I} H , where the symbol represents the Hadamard product and
is the complex conjugate. The correlation value is given by F 1 {G}, the inverse
Fourier transform of G.
MOSSE nds the lter H, in the Fourier domain, that minimizes the SSD
between the actual output of the correlation and thedesired output of the cor-
relation, across a set of N training images, minH j=1 (F {Ij } H Gj ) ,
N 2
where G is obtained by sampling a 2D Gaussian uniformly. Solving for the lter

H yields the closed form solution
N
j=1 Gj F {Ij }
H = N . (22)
j=1 F {Ij } F {Ij }
The MOSSE lter maps all aligned training patch examples to an output, G,
centered at the feature location, producing notably stable correlation lters.
At the training stage, each patch example is normalized to have zero mean
and a unitary norm, and is multiplied by a cosine window (required to solve the
Fourier Transform periodicity problem). This also has the benet of emphasizing
the target center. These lters have a high invariance to illumination changes,
due to their null DC component and revealed to be highly suitable to the task
of generic face alignment.
4.2 Evaluating Local Detectors

Three landmark expert detectors were evaluated. The most used detector [8][10]
is based on a linear classier built from aligned (positive) and misaligned (neg-
ative) grey level patch examples. The score of the ith linear detector is given by
Dilinear (I(yi )) = wTi I(yi ) + bi , (23)

with wi being the linear weight, bi the bias constant and I(yi ) a vectorized patch
of pixel values sampled at yi . Similarly, a quadratic classier can be used
Diquadratic (I(yi )) = I(yi )T Qi I(yi ) + LTi I(yi ) + bi (24)
with Qi and Li being the quadratic and linear terms, respectively. Finally, the
MOSSE lter correlation gives
DiMOSSE (I(yi )) = F 1 {F {I(yi )} Hi } (25)
where Hi
is the MOSSE lter from eq.22. Both linear and quadratic classiers
(linear-SVM [19] and Quadratic Discriminant Analysis) were trained using im-
ages from the IMM [15] dataset with 144 negative patch examples (for each
landmark and each image) being misaligned up to 12 pixels in x and y transla-
tion.
The MOSSE lters were built using aligned patch samples with size 128128.
A power of two patch size is used to speed up the FFT computation, however
only a 40 40 subwindow of the output is considered. During the MOSSE lter
building, each training patch requires a normalization step. Each example is
normalized to have a zero mean and a unitary norm and is multiplied by a
cosine window. The desired output G (eq.22) is set to be a 2D Gaussian function
centered at the landmark with 3 pixels of standard deviation.
The global optimization method that best evaluates the detectors performance
is the approach that relies the most on the output of the detector, i.e., the
Active Shape Models (ASM) [4]. The results are present in the form of tting
performance curves, which were also adopted by [20][5][21][8][10]. These curves
show the percentage of faces that achieved a given Root Mean Square (RMS)
error amount. The gure 4 shows tting performance curves that compare the
three kinds of detectors using the ASM [4] optimization1 and the proposed global
DBASM technique using a Weighted Peak Response strategy (DBASM-WPR).
From the results we can highlight some conclusions: (1) the MOSSE lter always
outperforms the others, specially when using simpler optimization methods; (2)
the DBASM optimization improves the results even with simple detectors; (3)
maximum performance can be achieved by using the MOSSE detector and the
DBASM optimization.
The use of MOSSE lters is an interesting solution that works well in practice
and is particularly suited to detection of facial parts. However we highlight that
is not crucial for the performance of our Bayesian formulation. DBASM still
improves performance when using standard detectors.
1
The ASM [4], CQF [8] and SCMS [10] use as local optimizations the WPR, GR and
KDE strategies, respectively.
IMM Detectors Fitting Performance (WPR) XM2VTS Detectors Fitting Performance (WPR) BioID Detectors Fitting Performance (WPR)
1 1 1
0.8 0.8 0.8

Proportion of Images
0.6 0.6 0.6
0.4 AVG 0.4 AVG 0.4 AVG

ASMMOSSE ASMMOSSE ASMMOSSE
ASMSVM ASMSVM ASMSVM
ASMQDA ASMQDA ASMQDA
0.2 DBASMWPRMOSSE 0.2 DBASMWPRMOSSE 0.2 DBASMWPRMOSSE
DBASMWPRSVM DBASMWPRSVM DBASMWPRSVM
DBASMWPRQDA DBASMWPRQDA DBASMWPRQDA
0 0 0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
RMS Error RMS Error RMS Error
(a) IMM [15] database (b) XM2VTS [17] database (c) BioID [16] database
Fig. 4. Fitting performance curves comparing dierent detectors (Linear, Quadratic

and MOSSE) on the IMM, XM2VTS and BioID database, respectively. The AVG means
the average location provided by the initial estimate (Adaboost [22] face detector).
4.3 Evaluating Global Optimization Strategies

In this section the DBASM optimization strategy is evaluated w.r.t. state-of-the-
art global alignment solutions. The proposed DBASM and DBASM-H methods
are compared with (1) ASM [4], (2) CQF [8], (3) BCLM [13], (4) GMM [9] using 3
Gaussians (GMM3) and (5) SCMS [10]. Note that the DBASM can be used with
dierent local strategies to approximate the response maps (e.g. WPR, GR or
KDE as described in section 3.2). In these experiments we xed the local strategy
as a KDE (BCLM-KDE, SCMS-KDE, DBASM-KDE) in order to compare the
global optimization approaches. The results from ASM, CQF and GMM3 are
provided as a baseline. The same bandwidth schedule of h2 = (15, 10, 5, 2) is
always used for KDE. All the experiments, in this section, use MOSSE lters
as local detectors (using the same settings as in section 4.2) built with only
training images from the IMM [15] set and tested on the remaining datasets2 . In
all cases, the nonrigid parameters start from zero, the similarity parameters were
initialized by a face detection [22] and the model was tted until convergence
(limited to a maximum of 20 iterations).
Figure 5 shows the tting performance curves for the IMM, XM2VTS and
BioID datasets, respectively. The CQF performs better than GMM3, mainly be-
cause GMM is very prone to local optimums due to its multimodal nature (it
is worth mentioning that given a good initial estimate GMM oers a superior
tting quality). The main drawback of CQF is the limited accuracy due to the
over-smoothness of the response map (see gure 3). The BCLM is slightly bet-
ter than SCMS due to its improved parameter update (MAP update vs rst
order forwards additive). The SCMS improves the results when compared to
CQF due to the high accuracy provided by the mean-shift. In some cases, the
ASM achieves a comparable performance to the SCMS; the reason for this relies
on the excellent performance of the MOSSE detector. The proposed Bayesian
global optimization (DBASM) outperforms all previous methods, by modeling
the covariance of the latent variables which represent the condence in the
2
The results on the IMM dataset use training images collected at our institution.
IMM Fitting Performance XM2VTS Fitting Performance BioID Fitting Performance

1 1 1
0.8 0.8 0.8

0.6 0.6 0.6
AVG AVG AVG

0.4 ASM 0.4 ASM 0.4 ASM
CQF CQF CQF
GMM3 GMM3 GMM3
SCMS SCMS SCMS
0.2 BCLMKDE 0.2 BCLMKDE 0.2 BCLMKDE
DBASMKDE DBASMKDE DBASMKDE
DBASMKDEH DBASMKDEH DBASMKDEH
0 0 0
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
RMS Error RMS Error RMS Error
(a) IMM [15] database (b) XM2VTS [17] database (c) BioID [16] database
Reference 7.5 RMS IMM (240 images) XM2VTS (2360 images) BioID (1521 images)
ASM 50.0 30.7 70.0
DBASM-WPR (our method) 56.7 (+6.7) 45.1 (+14.4) 75.4 (+5.4)
CQF 45.4 10.9 47.0
GMM3 40.8 (-4.6) 10.4 (-0.5) 51.7 (+4.7)
BCLM-GR 48.3 (+2.9) 15.9 (+5.0) 54.2 (+7.2)
DBASM-GR (our method) 50.4 (+5.0) 18.0 (+7.1) 62.2 (+15.2)
SCMS-KDE 54.6 35.7 69.0
BCLM-KDE 57.1 (+2.5) 43.4 (+7.7) 71.9 (+2.9)
DBASM-KDE (our method) 64.6 (+10.0) 54.5 (+18.8) 76.5 (+7.5)
DBASM-KDE-H (our method) 64.6 (+10.0) 53.5 (+17.8) 76.5 (+7.5)
Fig. 5. Fitting performance curves. The table shows quantitative values taken by
setting a xed RMS error amount (7.5 pixels - vertical line in the graphics). Each table
entry show how many percentage of images converge with less or equal RMS error than
the reference. The results show that our proposed methods outperform all the other
(using all the local strategies WPR, GR and KDE). Top images show DBASM-KDE
tting examples from each database.
Tracking Performance
40
ASM (10.5 / 6.4)
35 CQF (10.6 / 3.9)
GMM3 (11.1 / 4.3)
30 SCMSKDE (8.2 / 2.6)
BCLMKDE (9.5 / 3.6)
25 DBASMKDE (7.0 / 2.1)
RMS Error
DBASMKDEH (7.2 / 2.2)

20
15
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Frame Number
Fig. 6. Tracking performance evaluation of several tting algorithms on the FGNET

Talking Face [18] sequence. The values on legend box are the mean and standard
deviation RMS errors, respectively. Top images show DBASM-KDE tting examples.
current parameters estimate (see gure 5). The results show that the hierar-
chical annealing version of DBASM-KDE (DBASM-KDE-H) performs slightly
better, but at the cost of more iterations. Tracking performance is also tested on
the FGNET Talking Face video sequence (gure 6). Each frame is tted using as
(a) AVG (b) ASM (c) CQF (d) GMM3 (e) BCLM - (f) SCMS - (g) DBASM (h) DBASM
KDE KDE -KDE -KDE-H
Fig. 7. Qualitative tting results on LFW database [3]. AVG is the initial estimate.
initial estimate the previously estimated shape and pose parameters. The relative
performance between the global optimization approaches is similar to the pre-
vious experiments, where the DBASM technique yields the best performance.
Qualitative evaluation is also performed using the Labeled Faces in the Wild
(LFW) database [3], where some results can be seen on gure 7.
5 Conclusions
An ecient solution to align facial parts in unseen images is described in this
work. We present a novel Bayesian paradigm (DBASM) to solve the global align-
ment problem, in a MAP sense, showing that the posterior distribution of the
global warp can be eciently inferred using a Linear Dynamical System (LDS).
The main advantage w.r.t. previous Bayesian formulations is that DBASM model
the covariance of the latent variables which represent the condence in the cur-
rent parameters estimate. Several performance evaluation results are presented,
comparing both local detectors and global optimization strategies. Evaluating
the local detectors show that the MOSSE correlation lters oer a superior
performance in landmark local detection. Global optimizations evaluation were
performed in several image publicly available datasets, namely, the IMM, the
XM2VTS, the BioID, and the LFW. Tracking performance is also evaluated on
the FGNET Talking Face video sequence. This new Bayesian paradigm is shown
to signicantly outperform other state-of-the-art tting solutions.
Acknowledgments. This work was supported by the Portuguese Science Foun-

dation (FCT) by the project Dinamica Facial 4D para Reconhecimento de
Identidade(grant PTDC/EIA-CCO/108791/2008). Pedro Martins, Rui Caseiro

and Joao Henriques acknowledge the FCT through the grants SFRH/BD/45178/
2008, SFRH/BD74152/2010 and SFRH/BD/75459/2010, respectively.
References
1. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE
TPAMI 23, 681685 (2001)
2. Matthews, I., Baker, S.: Active appearance models revisited. IJCV 60, 135164
(2004)
3. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:
A database for studying face recognition in unconstrained environments. Technical
Report 07-49, University of Massachusetts, Amherst (2007)
4. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their
training and application. CVIU 61, 3859 (1995)
5. Cristinacce, D., Cootes, T.F.: Boosted regression active shape models. In: BMVC
(2007)
6. Tresadern, P., Bhaskar, H., Adeshina, S., Taylor, C., Cootes, T.F.: Combining local
and global shape models for deformable object matching. In: BMVC (2009)
7. Cristinacce, D., Cootes, T.F.: Automatic feature localisation with constrained local
models. Pattern Recognition 41, 30543067 (2008)
8. Wang, Y., Lucey, S., Cohn, J.: Enforcing convexity for improved alignment with
constrained local models. In: IEEE CVPR (2008)
10. Saragih, J., Lucey, S., Cohn, J.: Face alignment through subspace constrained
mean-shifts. In: IEEE ICCV (2009)
11. Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using
adaptive correlation lters. In: IEEE CVPR (2010)
12. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
13. Paquet, U.: Convexity and bayesian constrained local models. In: IEEE CVPR
(2009)
14. Shen, C., Brooks, M.J., Hengel, A.: Fast global kernel density mode seeking: Ap-
plications to localization and tracking. IEEE TIP 16, 14571469 (2007)
15. Nordstrom, M., Larsen, M., Sierakowski, J., Stegmann, M.: The IMM face database
- an annotated dataset of 240 face images. Technical report, Technical University
of Denmark, DTU (2004)
16. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust Face Detection Using
the Hausdor Distance. In: Bigun, J., Smeraldi, F. (eds.) AVBPA 2001. LNCS,
17. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The ex-
tended M2VTS database. In: AVBPA (1999)
18. FGNet: Talking face video (2004)
19. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A
library for large linear classication. JMLR, 18711874 (2008)
20. Cristinacce, D., Cootes, T.F.: Facial feature detection using adaboost with shape
constraints. In: BMVC (2003)
21. Cristinacce, D., Cootes, T.F.: Feature detection and tracking with constrained local
models. In: BMVC (2006)
22. Viola, P., Jones, M.: Robust real-time object detection. IJCV 57, 137154 (2002)
Patch Based Synthesis
for Single Depth Image Super-Resolution
Oisin Mac Aodha, Neill D.F. Campbell, Arun Nair, and Gabriel J. Brostow
University College London

http://visual.cs.ucl.ac.uk/pubs/depthSuperRes/
Abstract. We present an algorithm to synthetically increase the reso-

lution of a solitary depth image using only a generic database of local
patches. Modern range sensors measure depths with non-Gaussian noise
and at lower starting resolutions than typical visible-light cameras. While
patch based approaches for upsampling intensity images continue to im-
prove, this is the rst exploration of patching for depth images.
We match against the height eld of each low resolution input depth
patch, and search our database for a list of appropriate high resolution
candidate patches. Selecting the right candidate at each location in the
depth image is then posed as a Markov random eld labeling problem.
Our experiments also show how important further depth-specic pro-
cessing, such as noise removal and correct patch normalization, dramat-
ically improves our results. Perhaps surprisingly, even better results are
achieved on a variety of real test scenes by providing our algorithm with
only synthetic training depth data.
1 Introduction
Widespread 3D imaging hardware is advancing the capture of depth images
with either better accuracy, e.g. Faro F ocus3D laser scanner, or at lower prices,
e.g. Microsofts Kinect. For every such technology, there is a natural upper
limit on the spatial resolution and the precision of each depth sample. It may
seem that calculating useful interpolated depth values requires additional data
from the scene itself, such as a high resolution intensity image [1], or additional
depth images from nearby camera locations [2]. However, the seminal work of
Freeman et al . [3] showed that it is possible to explain and super-resolve an
intensity image, having previously learned the relationships between blurry and
high resolution image patches. To our knowledge, we are the rst to explore a
patch based paradigm for the super-resolution (SR) of single depth images.
Depth image SR is dierent from image SR. While less aected by scene light-
ing and surface texture, noisy depth images have fewer good cues for matching
patches to a database. Also, blurry edges are perceptually tolerable and ex-
pected in images, but at discontinuities in depth images they create jarring
artifacts (Fig. 1). We cope with both these problems by taking the unusual step
of matching inputs against a database at the low resolution, in contrast to us-
ing interpolated high resolution. Even creation of the database is also harder

72 O. Mac Aodha et al.

Fig. 1. On the bottom row are three dierent, high resolution, signals that we wish
to recover. The top row illustrates typical results after upsampling its low resolution
version by interpolation. Interpolation at intensity discontinuities gives results that are
perceptually similar to the ground truth image. However in depth images, this blurring
can result in very noticeable jagged artifacts when viewed in 3D.
for depth images, whether from Time-of-Flight (ToF) arrays or laser scanners,
because these can contain abundant interpolation-like noise.
The proposed algorithm infers a high resolution depth image from a single
low resolution depth image, given a generic database of training patches. The
problem itself is novel, and we achieve results that are qualitatively superior to
what was possible with previous algorithms because we:
Perform patch matching at the low resolution, instead of interpolating rst.
Train on a synthetic dataset instead of using available laser range data.
Perform depth specic normalization of non-overlapping patches.
Introduce a simple noisy-depth reduction algorithm for postprocessing.
Depth cameras are increasingly used for video-based rendering [4], robot ma-
nipulation [5], and gaming [6]. These environments are dynamic, so a general
purpose SR algorithm is better if it does not depend on multiple exposures.
These scenes contain signicant depth variations, so registration of the 3D data
to a nearby cameras high resolution intensity image is approximate [7], and use
of a beam splitter is not currently practicable. In the interest of creating visually
plausible super-resolved outputs under these constraints, we relegate the need
for the results to genuinely match the real 3D scene.
2 Related Work
Both the various problem formulations for super-resolving depth images and
the successive solutions for super-resolving intensity images relate to our algo-
rithm. Most generally, the simplest upsampling techniques use nearest-neighbor,
bilinear, or bicubic interpolation to determine image values at interpolated co-
ordinates of the input domain. Such increases in resolution occur without regard
for the inputs frequency content. As a result, nearest-neighbor interpolation
Patch Based Synthesis for Single Depth Image Super-Resolution 73
turns curved surfaces into jagged steps, while bilinear and bicubic interpolation
smooth out sharp boundaries. Such artifacts can be hard to measure numerically,
but are perceptually quite obvious both in intensity and depth images. While
Fattal [8] imposed strong priors based on edge statistics to smooth stair step
edges, this type of approach still struggles in areas of texture. Methods like [9]
for producing high quality antialiased edges from jagged input are inappropriate
here, for reasons illustrated in Fig. 1.
Multiple Depth Images: The SR problem traditionally centers on fusing mul-
tiple low resolution observations together, to reconstruct a higher resolution im-
age, e.g. [10]. Schuon et al . [2] combine multiple (usually 15) low resolution
depth images with dierent camera centers in an optimization framework that
is designed to be robust to the random noise characteristics of ToF sensors.
To mitigate the noise in each individual depth image, [1] composites together
multiple depths from the same viewpoint to make a single depth image for
further super-resolving, and Hahne and Alexa [11] combine depth scans in a
manner similar to exposure-bracketing for High Dynamic Range photography.
Rajagopalan et al . [12] use an MRF formulation to fuse together several low
resolution depth images to create a nal higher resolution image. Using GPU
acceleration, Izadi et al . made a system which registers and merges multiple
depth images of a scene in real time [13]. Fusing multiple sets of noisy scans has
also been demonstrated for eective scanning of individual 3D shapes [14]. Com-
pared to our approach, these all assume that the scene remains static. Though
somewhat robust to small movements, large scene motion will cause them to fail.
Intensity Image Approaches: For intensity images, learning based methods
exist for SR when multiple frames or static scenes are not available. In the most
closely related work to our own, Freeman et al . [15,16] formulated the prob-
lem as multi-class labeling on an MRF. The label being optimized at each node
represents a high resolution patch. The unary term measures how closely the
high resolution patch matches the interpolated low resolution input patch. The
pairwise terms encourage regions of the high resolution patches to agree. In
their work, the high resolution patches came from an external database of pho-
tographs. To deal with depth images, our algorithm diers substantially from
Freeman et al . and its image SR descendants, with details in Section 3. Briey,
our depth specic considerations mean that we i) compute matches at low reso-
lution to limit blurring and to reduce the dimensionality of the search, ii) model
the output space using non-overlapping patches so depth values are not aver-
aged, iii) normalize height to exploit the redundancy in depth patches, and iv)
we introduce a noise-removal algorithm for postprocessing, though qualitatively
superior results emerge before this step.
Yang et al . [17] were able to reconstruct high resolution test patches as
sparse linear combinations of atoms from a learned compact dictionary of paired
high/low resolution training patches. Our initial attempts were also based on
sparse coding, but ultimately produced blurry results in the reconstruction stage.
Various work has been conducted to best take advantage of the statistics of
natural image patches. Zontak and Irani [18] argue that nding suitable high
resolution matches for a patch with unique high frequency content could take a
prohibitively large external database, and it is more likely to nd matches for
these patches within the same image. Glasner et al . [19] exploit patch repeti-
tion across and within scales of the low resolution input to nd candidates. In
contrast to depth images, their input contains little to no noise, which can not
be said of external databases which are constrained to contain the same con-
tent as the input image. Sun et al . [20] oversegment their input intensity image
into regions of assumed similar texture and lookup an external database using
descriptors computed from the regions. HaCohen et al . [21] attempt to classify
each region as a discrete texture type to help upsample the image. A similar ap-
proach can be used for adding detail to 3D geometry; object specic knowledge
has been shown to help when synthesizing detail on models [22,23]. While some
work has been carried out on the statistics of depth images [24], it is not clear
if they follow those of regular images. Major dierences between the intensity
and depth SR problems are that depth images usually have much lower starting
resolution and signicant non-Gaussian noise. The lack of high quality data also
means that techniques used to exploit patch redundancy are less applicable.
Depth+Intensity Hybrids: Several methods exploit the statistical relation-
ship between a high resolution intensity image and a low resolution depth image.
They rely on the co-occurrence of depth and intensity discontinuities, on depth
smoothness in areas of low texture, and careful registration for the object of
interest. Diebel and Thrun [25] used an MRF to fuse the two data sources after
registration. Yang et al . [1] presented a very eective method to upsample depth
based on a cross bilateral lter of the intensity. Park et al . [7] improved on these
results with better image alignment, outlier detection, and also by allowing for
user interaction to rene the depth. Incorrect depth estimates can come about
if texture from the intensity image propagates into regions of smooth depth.
Chan et al . [26] attempted to overcome this by not copying texture into depth
regions which are corrupted by noise and likely to be geometrically smooth.
Schuon et al . [27] showed that there are situations where depth and color images
can not be aligned well, and that these cases are better o being super-resolved
just from multiple depth images.
In our proposed algorithm, we limit ourselves to super-resolving a single low
resolution depth image, without additional frames or intensity images, and there-
fore no major concerns about baselines, registration, and synchronization.
3 Method
We take, as an input, a low resolution depth image X that is generated from
some unknown high resolution depth image Y by an unknown downsampling
function d such that X = (Y ) d . Our goal is to synthesize a plausible Y.
We treat X as a collection of N non-overlapping patches X = {x1 , x2 , ..., xN }, of
size M M , that we scale to t in the range [0..1], to produce normalized input
patches xi . We recover a plausible SR depth image Y by nding a minimum
of a discrete energy function. Each node in our graphical model (see Fig. 3) is

Fig. 2. A) Noise removal of the input gives a cleaner input for patching but can remove
important details when compared to our method of matching at low resolution. B)
Patching artifacts are obvious (left) unless we apply our patch min/max compensation
(right).
associated with a low resolution image patch, xi , and the discrete label for the
corresponding node in a Markovian grid corresponds to a high resolution patch,
yi . The total energy of this MRF is

E(Y) = Ed (xi ) + Es (yi , yj ), (1)
i i,jN
where N denotes the set of neighboring patches.

The data likelihood term, Ed (xi ), measures the dierence between the nor-
malized input patch and the normalized downsampled high resolution candidate:
Ed (xi ) = ||xi (yi ) d ||2 . (2)
Unlike Freeman et al . [3], we do not upsample the low resolution input using a
deterministic interpolation method to then compute matches at the upsampled
scale. We found that doing so unnecessarily accentuates the large amount of noise
that can be present in depth images. Noise removal is only partially successful
and runs the risk of removing details, see Fig. 2 A). A larger amount of training
data is also needed to explain the high resolution patches and increases the size
of the MRF. Instead, we prelter and downsample the high resolution training
patches to make them the same size and comparable to input patches.
The pairwise term, Es (yi , yj ), enforces coherence in the abutting region be-
tween the neighboring unnormalized high resolution candidates, so
Es (yi , yj ) = ||Oij (yi ) Oji (yj )||2 , (3)
where Oij is an overlap operator that extracts the region of overlap between
the extended versions of the unnormalized patches yi and yj , as illustrated in
Fig. 3 A). The overlap region consists of a single pixel border around each patch.
We place the non-extended patches down side by side and compute the pairwise
Fig. 3. A) Candidate high resolution patches yi and yj are placed beside each other
but are not overlapping. Each has an additional one pixel border used to evaluate
smoothness (see Eqn. 3), which is not placed in the nal high resolution depth im-
age. Here, the overlap is the region of 12 pixels in the rectangle enclosed by yellow
dashed lines. B) When downsampling a signal its absolute min and max values will not
necessarily remain the same. We compensate for this when unnormalizing a patch by
accounting for the dierence between the patch and its downsampled version.
term in the overlap region. In standard image SR this overlap region is typically
averaged to produce the nal image [3], but with depth images this can create
artifacts.
The high resolution candidate, yi , is unnormalized based on the min and max
of the input patch:
yi = yi (max(xi )imax min(xi )imin ) + min(xi )imin , (4)
where the i terms account for the dierences accrued during the downsampling
of the training data (see Fig. 3 B)):
imin = min(yi )/ min((yi ) d ),

imax = max(yi )/ max((yi ) d ).
The exclusion of the i terms results in noticeable patching artifacts, such as

stair stepping and misalignment in the output depth image, due to the high
resolution patch being placed into the output with the incorrect scaling (see
Fig. 2 B)). We experimented with dierent normalization techniques, such as
matching the mean and variance, but, due to the non-Gaussian distribution of
depth errors [28], noticeable artifacts were produced in the upsampled image.
To super-resolve our input, we solve the discrete energy minimization objective
function of (1) using the TRW-S algorithm [29,30].
3.1 Depth Image Noise

Depending on the sensor used, depth images can contain a considerable amount
of noise. Work has been undertaken to try to characterize the noise of ToF sen-
sors [31]. They exhibit phenomena such as ying pixels at depth discontinuities
due to the averaging of dierent surfaces, and return incorrect depth readings
from specular and dark materials [28]. Coupled with low recording resolution
(compared to intensity cameras), this noise poses an additional challenge that
is not present when super-resolving an image. We could attempt to model the
downsampling function, d , which takes a clean noiseless signal and distorts it.
However, this is a non trivial task and would result in a method very specic
to the sensor type. Instead, we work with the assumption that, for ToF sensors,
most of the high frequency content is noise. To remove this noise, we bilateral
lter [32] the input image X before the patches are normalized. The high reso-
lution training patches are also ltered (where d is a bicubic lter), so that all
matching is done on similar patches.
It is still possible to have some noise in the nal super-resolved image due
to patches being stretched incorrectly over boundaries. Park et al . [7] identify
outliers based on the contrast of the min and max depth in a local patch in
image space. We too wish to identify these outliers, but also to replace them
with plausible values and rene the depth estimates of the other points. Using
the observation that most of the error is in the depth direction, we propose a new
set of possible depth values d for each pixel, and attempt to solve for the most
consistent combination across the image. A 3D coordinate pw , with position
pim in the image, is labelled as an outlier if the average distance to its T nearest
neighbours is greater than 3d . In the case of non-outlier pixels, the label set
d contains the depth values pim im
z + npz where n = [N/2, ..., N/2] and each
labels unary cost is |npim z |, with = 1%. For the outlier pixels, the label set
contains the N nearest non-outlier depth values with uniform unary cost. The
pairwise term is the truncated distance between the neighboring depth values
i and j: ||pimiz pjz ||2 . Applying our outlier removal instead of the bilateral
im
lter on the input depth image would produce overly blocky aliased edges that
are dicult for the patch lookup to overcome during SR. Used instead as a
postprocess, it will only act to remove errors due to incorrect patch scaling.
4 Training Data
For image SR, it is straightforward to acquire image collections online for train-
ing purposes. Methods for capturing real scene geometry, e.g. laser scanning,
are not convenient for collecting large amounts of high quality data in varied
environments. Some range datasets do exist online, such as the Brown Range
Image Database [24], the USF Range Database [33] and Make3D Range Image
Database [34]. The USF dataset is a collection of 400 range images of simple
polyhedral objects with a very limited resolution of only 128 128 pixels and
with heavily quantized depth. Similarly, the Make3D dataset contains a large va-
riety of low resolution scenes. The Brown dataset, captured using a laser scanner,
is most relevant for our SR purposes, containing 197 scenes spanning indoors and
outdoors. While the spatial resolution is superior in the Brown dataset, it is still
limited and features noisy data such as ying pixels that ultimately hurt depth
SR; see Fig. 4 B). In our experiments we compared the results of using Brown
Range Image vs. synthetic data, and found superior results using synthetic

Fig. 4. A) Example from our synthetic dataset. The left image displays the depth
image and the right is the 3D projection. Note the clean edges in the depth image and
lack of noise in the 3D projection. B) Example scene from the Brown Range Image
Database [24] which exhibits low spatial resolution and ying pixels (in red box).
data (see Fig. 7). An example of one of our synthesized 3D scenes is shown in
Fig. 4 A). Some synthetic depth datasets also exist, e.g. [35], but they typically
contain single objects.
Due to the large amount of redundancy in depth scenes (e.g. planar surfaces),
we prune the high resolution patches before training. This is achieved by detect-
ing depth discontinuities using an edge detector. A dilation is then performed on
this edge map (with a disk of radius 0.02 the image width) and only patches
with centers in this mask are chosen. During testing, the top K closest candi-
dates to the low resolution input patch are retrieved from the training images.
Matches are computed based on the ||.||2 distance from the low resolution patch
to a downsampled version of the high resolution patch. In practice, we use a k-d
tree to speed up this lookup. Results are presented using a dataset of 30 scenes
of size 800 800 pixels (with each scene also ipped left to right), which creates
a dictionary of 5.3 million patches, compared to 660 thousand patches in [16].
5 Experiments
We performed experiments on single depth scans obtained by various means,
including a laser scanner, structured light, and three dierent ToF cameras. We
favor the newer ToF sensors because they do not suer from missing regions
at depth-discontinuities as much as Kinect, which has a comparatively larger
camera-projector baseline. We run a sliding window lter on the input to ll
in missing data with local depth information. The ToF depth images we tested
come from one of three camera models: PMD CamCube 2.0 with resolution of
200 200, Mesa Imaging SwissRanger SR3000 with 176 144, or the Canesta
EP DevKit with 64 64. We apply bilateral preltering and our postprocessing
denoising algorithm only to ToF images. Unless otherwise indicated, all exper-
iments were run with the same parameters and with the same training data.
We provide comparisons against the Example based Super-Resolution (EbSR)
method of [16] and the Sparse coding Super-Resolution (ScSR) method of [17].

Fig. 5. A) RMSE comparison of our method versus several others when upsampling
the downsampled Middlebury stereo dataset [36] by a factor of 2 and 4. *MRF
RS [25] and Cross Bilateral [1] require an additional intensity image at the same high
resolution as the upsampled depth output. B) RMSE comparison of our method versus
two other image based techniques for upsampling three dierent laser scans by a factor
of 4 . See Fig. 6 for visual results of Scan 42. C) Training and testing times in seconds.
5.1 Quantitative Evaluation against Image Based Techniques

In the single image SR community, quantitative results have been criticized for
not being representative of perceptual quality [17,18], and some authors choose
to ignore them completely [19,21,20]. We rst evaluated our technique on the
Middlebury stereo dataset [36]. We downsampled the ground truth (using near-
est neighbor interpolation) by a factor of 2 and 4, and then compared our
performance at reconstructing the original image. The error for several dierent
algorithms is reported as the Root Mean Squared Error (RMSE) in Fig. 5 A).
Excluding the algorithms that use additional scene information, we consistently
come rst or second and achieve the best average score across the two experi-
ments. It should be noted that, due to the heavy quantization of the disparity,
trivial methods such as nearest neighbor interpolation perform numerically well
but perceptually they can exhibit strong artifacts such as jagged edges. As the
scale factor increases, nearest neighbors performance decreases. This is consis-
tent with the same observation for bilinear interpolation reported in [7].
We also report results for three laser scans upsampled by a factor of 4; see Fig
5 B). Again our method performs best overall. Fig. 6 shows depth images along
with 3D views of the results of each algorithm. It also highlights artifacts for
both of the intensity image based techniques at depth discontinuities. EbSR [16]
smooths over the discontinuities while ScSR [17] introduces high frequency errors
that manifest as a ringing eect in the depth image and as spikes in the 3D view.
Both competing methods also fail to reconstruct detail on the objects surface
such as the ridges on the back of the statue. Our method produces sharp edges
like the ones present in the ground truth and also detail on the objects surface.

Fig. 6. 3D synthesized views and corresponding depth images (cropped versions of

the original) of Scan 42 from Fig. 5 B) upsampled 4. A) ScSR [17] introduces high
frequency artifacts at depth discontinuities. B) EbSR [16] over-smooths the depth due
to its initial interpolated upsampling. This eect is apparent in both the 3D view and
depth image (pink arrow). C) Our method inserts sharp discontinuities and detail on
the object surface (yellow arrow). D) Ground truth laser scan.
5.2 Qualitative Results
As described in Section 4, noisy laser depth data is not suitable for training. Fig. 7
shows the result of SR when training on the Brown Range Image Database [24]
(only using the indoor scenes) as compared with our synthetic dataset. The noise
in the laser scan data introduces both blurred and jagged artifacts.

Fig. 7. Result of super-resolving a scene using our algorithm with training data from
two dierent sources: A) Laser scan [24] B) Our synthetic. Note that our dataset
produces much less noise in the nal result and hallucinates detail such as the thin
structures in the right of the image.

Fig. 8. Upsampling input ToF image from a Canesta EP DevKit (64 64) by a factor
of 10 [1]. A) Input depth image shown to scale in red. B) Bilateral ltering of input
image (to remove noise) followed by upsampling using nearest neighbor. C) EbSR [16]
produces an overly smooth result at this large upsampling factor. D) ScSR [17] recovers
more high frequency detail but creates a ringing artefact. E) The Cross Bilateral [1]
method produces a very sharp result, however, the method requires a high resolution
intensity image at the same resolution as the desired super-resolved depth image (640
640), shown inset in green. F) Our method produces sharper results than C) and D).
We also compare ourselves against the cross bilateral method of Yang et al .

[1]. Their technique uses an additional high resolution image of the same size as
the desired output depth image. Fig. 8 shows results when upsampling a Canesta
EP DevKit 64 64 ToF image by a factor of 10. It is important to note that
to reduce noise, [1] use an average of many successive ToF frames as input. The
other methods, including ours, use only a single depth frame.
Fig. 9 shows one sample result of our algorithm for a noisy ToF image captured
using a PMD CamCube 2.0. The image has a starting resolution of 200200 and
is upsampled by a factor of 4. The zoomed regions in Fig. 9 C) demonstrate that
we synthesize sharp discontinuities. For more results, including moving scenes,
please see our project webpage.
Implementation Details
Fig. 5 C) gives an overview of the the training and test times of our algorithm
compared to other techniques. All results presented use a low resolution patch
size of 3 3. For the MRF, we use 150 labels (high resolution candidates) and
the weighting of the pairwise term, , is set to 10. The only parameters we
change are for the Bilateral ltering step. For scenes of high noise we set the
window size to 5, the spatial standard deviation to 3 and range deviation to
0.1; for all other scenes we use 5, 1.5 and 0.01 respectively. We use the default
parameters provided in the implementations for ScSR [17] and EbSR [16]. To
encourage comparison and further work, code and training data are available
from our project webpage.

!

Fig. 9. CamCube ToF input. A) Noisy input. B) Our result for SR by 4; note the
reduced noise and sharp discontinuities. C) Cropped region from the input depth image
comparing dierent SR methods. The red square in A) is the location of the region.
6 Conclusions
We have extended single-image SR to the domain of depth images. In the process,
we have also assessed the suitability for depth images of two leading intensity
image SR techniques [17,16]. Measuring the RMSE of super-resolved depths with
respect to known high resolution laser scans and online Middlebury data, our
algorithm is always rst or a close second best. We also show that perceptually,
we reconstruct better depth discontinuities. An additional advantage of our single
frame SR method is our ability to super-resolve moving depth videos. From the
outset, our aim of super-resolving a lone depth frame has been about producing
a qualitatively believable result, rather than a strictly accurate one. Blurring
and halos may only become noticeable as artifacts when viewed in 3D.
Four factors enabled our algorithm to produce attractive depth reconstructions.
Critically, we saw an improvement when we switched to low resolution
searches for our unary potentials, unlike almost all other algorithms. Second, spe-
cial depth-rendering of clean computer graphics models depicting generic scenes
outperforms training on noisy laser scan data. It is important that these depths are
ltered and downsampled for training, along with a pre-selection stage that favors
areas near gradients. Third, the normalization based on min/max values in the
low resolution input allows the same training patch pair to be applied at various
depths. We found that the alternative of normalizing for mean and variance is not
very robust with 3 3 patches, because severe noise in many depth images shifts
the mean to almost one extreme or the other at depth discontinuities. Finally, we
have the option for rening our high resolution depth output using a targeted noise
removal algorithm. Crucially, our experiments demonstrate both the quantitative
and perceptually signicant advantages of our new method.
Limitations and Future Work
Currently, for a video sequence we process each frame independently. Our algo-
rithm could be extended to exploit temporal context to both obtain the most
reliable SR reconstruction, and apply it smoothly across the sequence. Similarly,
the context used to query the database could be expanded to include global
scene information, such as structure [37]. This may overcome the fact that we
have a low local signal to noise ratio from current depth sensors. As observed by
Reynolds et al . [28], ying pixels and other forms of severe non-Gaussian noise
necessitates preltering. Improvements could be garnered with selective use of
the input depths via a sensor specic noise model. For example, Kinect and the
dierent ToF cameras we tested all exhibit very dierent noise characteristics.
Acknowledgements. Funding for this research was provided by the NUI Trav-
elling Studentship in the Sciences and EPSRC grant EP/I031170/1.
References
1. Yang, Q., Yang, R., Davis, J., Nister, D.: Spatial-depth super resolution for range
images. In: CVPR (2007)
2. Schuon, S., Theobalt, C., Davis, J., Thrun, S.: Lidarboost: Depth superresolution
for tof 3D shape scanning. In: CVPR (2009)
3. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. IJCV
(2000)
4. Kuster, C., Popa, T., Zach, C., Gotsman, C., Gross, M.: Freecam: A hybrid camera
system for interactive free-viewpoint video. In: Proceedings of Vision, Modeling,
and Visualization, VMV (2011)
5. Holz, D., Schnabel, R., Droeschel, D., Stuckler, J., Behnke, S.: Towards semantic
scene analysis with time-of-ight cameras. In: RoboCup International Symposium
(2010)
6. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from single
depth images. In: CVPR (2011)
7. Park, J., Kim, H., Tai, Y.W., Brown, M., Kweon, I.: High quality depth map
upsampling for 3D-ToF cameras. In: ICCV (2011)
8. Fattal, R.: Upsampling via imposed edges statistics. SIGGRAPH (2007)
9. Yang, L., Sander, P.V., Lawrence, J., Hoppe, H.: Antialiasing recovery. ACM Trans-
actions on Graphics (2011)
10. Irani, M., Peleg, S.: Improving resolution by image registration. In: CVGIP: Graph.
Models Image Process. (1991)
11. Hahne, U., Alexa, M.: Exposure Fusion for Time-Of-Flight Imaging. In: Pacic
Graphics (2011)
12. Rajagopalan, A.N., Bhavsar, A., Wallho, F., Rigoll, G.: Resolution Enhancement
of PMD Range Maps. In: Rigoll, G. (ed.) DAGM 2008. LNCS, vol. 5096, pp. 304313.
13. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R.A., Kohli, P., Shotton,
J., Hodges, S., Freeman, D., Davison, A.J., Fitzgibbon, A.: Kinectfusion: Real-time
3D reconstruction and interaction using a moving depth camera. In: UIST (2011)
14. Cui, Y., Schuon, S., Derek, C., Thrun, S., Theobalt, C.: 3D shape scanning with a
time-of-ight camera. In: CVPR (2010)
15. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. Com-
puter Graphics and Applications (2002)
16. Freeman, W.T., Liu, C.: Markov random elds for super-resolution and texture
synthesis. In: Advances in Markov Random Fields for Vision and Image Processing.
MIT Press (2011)
17. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse rep-
resentation. IEEE Transactions on Image Processing (2010)
18. Zontak, M., Irani, M.: Internal statistics of a single natural image. In: CVPR (2011)
19. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image. In: ICCV
(2009)
20. Sun, J., Zhu, J., Tappen, M.: Context-constrained hallucination for image super-
resolution. In: CVPR (2010)
21. HaCohen, Y., Fattal, R., Lischinski, D.: Image upsampling via texture hallucina-
tion. In: ICCP (2010)
22. Gal, R., Shamir, A., Hassner, T., Pauly, M., Cohen-Or, D.: Surface reconstruction
using local shape priors. In: Symposium on Geometry Processing (2007)
23. Golovinskiy, A., Matusik, W., Pster, H., Rusinkiewicz, S., Funkhouser, T.: A
statistical model for synthesis of detailed facial geometry. SIGGRAPH (2006)
24. Huang, J., Lee, A., Mumford, D.: Statistics of range images. In: CVPR (2000)
25. Diebel, J., Thrun, S.: An application of markov random elds to range sensing. In:
NIPS (2005)
26. Chan, D., Buisman, H., Theobalt, C., Thrun, S.: A Noise-Aware Filter for Real-
Time Depth Upsampling. In: Workshop on Multi-camera and Multi-modal Sensor
Fusion Algorithms and Applications (2008)
27. Schuon, S., Theobalt, C., Davis, J., Thrun, S.: High-quality scanning using time-
of-ight depth superresolution. In: CVPR Workshops (2008)
28. Reynolds, M., Dobos, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-
ight data with condence. In: CVPR (2011)
29. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimiza-
tion. PAMI (2006)
30. Wainwright, M., Jaakkola, T., Willsky, A.: Map estimation via agreement on (hy-
per)trees: Message-passing and linear programming approaches. IEEE Transactions
on Information Theory (2002)
31. Frank, M., Plaue, M., Rapp, H., Kothe, U., Jahnea, B., Hamprecht, F.A.: The-
oretical and experimental error analysis of continuous-wave time-of-ight range
cameras. Optical Engineering (2009)
32. Tomasi, C., Manduchi, R.: Bilateral ltering for gray and color images. In: ICCV
(1998)
33. (USF Range Database), http://marathon.csee.usf.edu/range/DataBase.html
34. Saxena, A., Sun, M., Ng, A.: Make3d: Learning 3d scene structure from a single
still image. PAMI (2009)
35. Saxena, A., Driemeyer, J., Kearns, J., Ng, A.: Robotic grasping of novel objects.
In: NIPS (2006)
36. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. IJCV (2002)
37. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor Segmentation and Support
Inference from RGBD Images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato,
Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746760. Springer,
Heidelberg (2012)
Annotation Propagation in Large Image Databases
via Dense Image Correspondence
Michael Rubinstein1,2 , Ce Liu2 , and William T. Freeman1

1
MIT CSAIL
2
Microsoft Research New England
Abstract. Our goal is to automatically annotate many images with a set of word
tags and a pixel-wise map showing where each word tag occurs. Most previous ap-
proaches rely on a corpus of training images where each pixel is labeled. However,
for large image databases, pixel labels are expensive to obtain and are often un-
available. Furthermore, when classifying multiple images, each image is typically
solved for independently, which often results in inconsistent annotations across
similar images. In this work, we incorporate dense image correspondence into
the annotation model, allowing us to make do with significantly less labeled data
and to resolve ambiguities by propagating inferred annotations from images with
strong local visual evidence to images with weaker local evidence. We establish
a large graphical model spanning all labeled and unlabeled images, then solve it
to infer annotations, enforcing consistent annotations over similar visual patterns.
Our model is optimized by efficient belief propagation algorithms embedded in an
expectation-maximization (EM) scheme. Extensive experiments are conducted to
evaluate the performance on several standard large-scale image datasets, showing
that the proposed framework outperforms state-of-the-art methods.
1 Introduction
We seek to annotate a large set of images, starting from a partially annotated set. The
resulting annotation will consist of a set of word tags for each image, as well as per-
pixel labeling indicating where in the image each word tag appears. Such automatic
annotation will be very useful for Internet image search, but will also have applications
in other areas such as robotics, surveillance, and public safety.
In computer vision, scholars have investigated image annotation in two directions.
One methodology uses image similarities to transfer textual annotation from labeled
images to unlabeled samples, under the assumption that similar images should have
similar annotations. Notably, Makadia et al. [1] recently proposed a simple baseline
approach for auto-annotation based on global image features, and a greedy algorithm
for transferring tags from similar images. ARISTA [2] automatically annotated a web
dataset of billions of images by transferring tags via near-duplicates. Although those
methods have clearly demonstrated the merits of using similar images to transfer anno-
tations, they do so only between globally very similar images, and cannot account for
locally similar patterns.
Consequently, the other methodology focused on dense annotation of images, known
as semantic labeling [36], where correspondences between text and local image fea-
tures are established for annotation propagation: similar local features should have simi-
lar labels. These methods often aim to label each pixel in an image using models learned

86 M. Rubinstein, C. Liu, and W.T. Freeman
tree, sky, river sky, river

tree, sky, river tree, sky, road tree, sky, plant mountain, field,
mountain building, bridge
mountain car, building grass building, sky, tree
sidewalk, road, car

building, tree, sky sky, mountain tree, sky, sidewalk tree, streetlight, car tree, sky, car
tree road, car, building sky, road, mountain building
tree, staircase, sky tree, sky, river sky, building sky, river
road, plant, door person, mountain tree building, bridge
sidewalk, car, building
(a) Input database + (b) Annotation propagation (c) Annotated database

sparse annotations ( = dense pixel correspondence)
Fig. 1. Input and output of our system. (a) The input image database, with scarce image tags
and scarcer pixel labels. (b) Annotation propagation over the image graph, connecting similar
images and corresponding pixels. (c) The fully annotated database by our system (more results
can be found in Fig. 5 and the supplemental material).
from a training database. In [7] and [8], relationships between text and visual words are
characterized by conventional language translation models. More recently, Shotton et
al. [3] proposed to train a discriminative model based on the texton representation, and
to use conditional random fields (CRF) to combine various cues to generate spatially
smooth labeling. This approach was successively extended in [4], where efficient ran-
domized decision forests are used for significant speedup. Liu et al. [5] proposed a
nonparametric approach to semantic labeling, where text labels are transferred from a
labeled database to parse a query image via dense scene correspondences.
Since predicting annotation from image features is by nature ambiguous (e.g. tex-
tureless regions can be sky, wall, or ceiling), such methods rely on regularities in a large
corpus of training data of pixel-wise densely labeled images. However, for large im-
age databases, high-quality pixel labels are very expensive to obtain. Furthermore, the
annotation is typically computed independently for each image, which often results in
inconsistent annotations due to visual ambiguities in image-to-text.
In this work, we show that we can overcome these drawbacks by incorporating dense
image correspondences into the annotation model. Image correspondences can help al-
leviate local ambiguities by propagating annotations in images with strong local evi-
dence to images with weaker evidence. They also implicitly add consideration of con-
textual information.
We present a principled framework for automatic image annotation based on dense
labeling and image correspondence. Our model is guided by four assumptions: (a) re-
gions corresponding to each tag have distinct visual appearances, (b) neighboring pixels
in one image tend to have the same annotation, (c) similar patterns across different im-
ages should have similar annotation, and (d) tags and tag co-occurrences which are
more frequent in the database are also more probable.
By means of dense image correspondence [9], we establish a large image graph
across all labeled and unlabeled images for consistent annotations over similar image
patterns across different images, where annotation propagation is solved jointly for the
entire database rather than for a single image. Our model is effectively optimized by
efficient belief propagation algorithms embedded within an expectation-maximization
(EM) scheme, alternating between visual appearance modeling (via Gaussian mixture
Annotation Propagation via Dense Image Correspondence 87
models) and pixel-wise tag inference. We conducted extensive experiments on standard

large-scale datasets, namely LabelMe [10], ESP [11] and IAPR [12], showing that our
system consistently outperforms the state-of-the-art in automatic annotation and seman-
tic labeling, while requiring significantly less labeled data.
In summary, this paper makes the following contributions. (a) We show that dense
correspondences between similar images can assist in resolving text-to-image visual
ambiguities, and can dramatically decrease the amount of training data for automatic
annotation. (b) We propose a novel graphical model for solving image annotation jointly
by building an dense image graph that spans the entire database for propagating anno-
tations. (c) Our novel formulation assumes scarce data, and combines different levels of
supervision - tagged images (cheaper to obtain) and labeled images (expensive to ob-
tain). We thoroughly studied how the ratio of tags and labels affects the performance.
2 Terminology and Formulation
Textual tagging and pixel-level labeling are both plausible types of annotations for
an image. In this paper, we will consistently refer to semantic, pixel-level labeling (or
semantic segmentation) as labeling and to textual image-level annotation as tags.
The set of tags of an image is comprised of words from a vocabulary. These words
can be specified by a user, or obtained from text surrounding an image on a webpage.
The input database = (I, V) is comprised of RGB images I = {I1 , . . . , IN }
and vocabulary V = {l1 , . . . , lL }. We treat auto-annotation as a labeling problem,
where the goal is to compute the labeling C = {c1 , . . . , cN } for all images in the
database, where for pixel p = (x, y), ci (p) {1, . . . , L, } ( denoting unlabeled)
indexes into vocabulary V. The tags associated with the images, T = {t1 , . . . , tN :
ti {1, ..., L}}, are then defined directly as the set union of the pixel labels ti =
pi ci (p), where i is image Ii s lattice.
Initially, we may be given annotations for some images in the database. We denote
by It and Tt , and Il and Cl the corresponding subsets of tagged images with their tags,
and labeled images with their labels, respectively. Also let (rt , rl ) be the ratio of the
database images that are tagged and labeled, respectively.
We formulate this discrete labeling problem in a probabilistic framework, where
our goal is to construct the joint posterior distribution of pixel labels given the input
database and available annotations, P (C|, Tt , Cl ). We assume that (a) corresponding
patterns across images should have similar annotation, and (b) neighboring pixels in one
image tend to have the same annotation. This effectively factorizes the distribution as
follows. Written in energy form ( log P (C|, Tt , Cl )), the cost of global assignment
C of labels to pixels is given by
% &

N
E(C) = p (ci (p))+ int (ci (p), ci (q))+ ext (ci (p), cj (p + wij (p))) ,
i=1 pi qNp jNi
(1)
where Np are the spatial neighbors of pixel p within its image, and Ni are other images
similar to image Ii . For each such image j Ni , wij defines the (dense) correspon-
dence between pixels in Ii and pixels in Ij .
Database Representation Annotation Propagation
Learning
Dense image features Image graph
Appearance model
(Section 3.1) (Section 4.1)
(Section 3.2)
tree, sky, sidewalk
tree, sky, car
road, car, building
building
Human Annotation
(Section 4.3)
Inference
tree, sky, plant
grass
Input image Output database

sky, car, tree sidewalk, road, car Propagation
Pro
op
database building building, tree, sky (fully annotated)
(Section 4.2)
Tagging Labeling
Fig. 2. System overview. Each rectangle corresponds to a module in the system.
This objective function defines a (directed) graphical model over the entire image
database. We define the data term, p , by means of local visual properties of image
Ii at pixel p. We extract features from all images and learn a model to correspond
vocabulary words with image pixels. Since visual appearance is often ambiguous at the
pixel or region level, we use two types of regularization. The intra-image smoothness
term, int , is used to regularize the labeling with respect to the image structures, while
the inter-image smoothness term, ext , is used for regularization across corresponding
pixels in similar images.
Fig. 2 shows an overview of our system, and Fig. 4 visualizes this graphical model.
In the upcoming sections we describe in detail how we formulate each term of the
objective function, how we optimize it jointly for all pixels in the database, and show
experimental results of this approach.
3 Text-to-Image Correspondence
To characterize text-to-image correspondences, we first estimate the probability distri-
bution Pa (ci (p)) of words occurring at each pixel, based on local visual properties.
For example, an image might be tagged with car and road, but their locations within
the image are unknown. We leverage the large number of images and the available
tags to correlate the database vocabulary with image pixels. We first extract local im-
age features for every pixel, and then learn a visual appearance model for each word
by utilizing visual commonalities among images with similar tags, and pixel labels (if
available).
3.1 Local Image Descriptors
We selected features used prevalently in object and scene recognition to characterize lo-
cal image structures and color features. Structures are represented using both SIFT [9]
and HOG [13] features. We compute dense SIFT descriptors with 3 and 7 cells around
a pixel to account for scales. We then compute HOG features and stack together neigh-
boring HOG descriptors within 2 2 patches [14]. Color is represented using a 7 7
patch in L*a*b color space centered at each pixel. Stacking all the features yields a 527-
dimensional descriptor Di (p) for every pixel p in image Ii . We use PCA to reduce the
descriptor to d = 50 dimensions, capturing approximately 80% of the features variance.
3.2 Learning Appearance Models
We use a generative model based on Gaussian mixtures to represent the distribution of

the above continuous features. More specifically, we model each word in the database
vocabulary using a full-covariance Gaussian Mixture Model (GMM) in the 50D de-
scriptor space. Such models have been successfully applied in the past to model object
appearance for image segmentation [15]. Note that our system is not limited to work
with this particular model. In fact, we also experimented with a discriminative approach,
Randomized Forests [16], previously used for semantic segmentation [4] as an alterna-
tive to GMM. We found that GMM produces better results than random forests in our
system (see Section 5).
For pixel p in image Ii , we define

L
M

Pa (ci (p); ) = i,l (p) l,k N Di (p); l,k , l,k + i, (p)N (Di (p); , ) ,
l=1 k=1
(2)
where i,l (p) = Pa (ci (p) = l; ) is the weight of model l in generating the fea-
ture
Di (p), M is the number of components in each model (M = 5), and l =
l,k , l,k , l,k is the mixture weight, mean and covariance of component k in model
l, respectively. We use a Gaussian outlier model with parameters = ( , ) and
weights i, (p) for each pixel. The intuition for the outlier model is to add an unlabeled
word to the vocabulary V. = ({i,l }i=1:N,l=1:L , {i, }i=1:N , 1 , . . . , L , ) is a
vector containing all parameters of the model.
We optimize for in the maximum likelihood sense using a standard EM algorithm.
We initialize the models by partitioning the descriptors into L clusters using k-means
and fitting a GMM to each cluster. The outlier model is initialized from randomly se-
lected pixels throughout the database. We also explicitly restrict each pixel to contribute
its data to models corresponding to its estimated (or given) image tags only. That is, we
clamp i,l (p) = 0 for all l / ti . For labeled images Il , we keep the posterior fixed ac-
cording to the given labels throughout the process (setting zero probability to all other
labels). To account for partial annotations, we introduce an additional weight i for all
descriptors of image Ii , set to t , l or 1 (we use t = 5, l = 10) according to whether
image Ii was tagged, labeled, or inferred automatically by the algorithm, respectively.
More details can be found in the supplementary material.
Fig. 3 (b) and (c) show an example result of the algorithm on an image from LabelMe
Outdoors dataset (See Section 5 for details on the experiment). Clearly, our model is
able to capture meaningful correlations between words and image regions.
4 Annotation Propagation
Now that we have obtained reasonable local evidences, we can utilize image informa-
tion and correspondences between images to regularize the estimation and propagate
annotations between images in the database. To this end, we compute dense pixel-wise
correspondences and construct an image graph connecting similar images and corre-
sponding pixels.
(a) Input image (b) Local evidence (Eq. 2) (c) MAP appearance
Flow color
coding
(d) MAP appearance
+ intra-image reg.
(i) MAP appearance

+ intra-image + inter-image reg.
(final result)
(e) Neighbors (f) Dense corr. (g) Neighbors warped (h) Neighbors local evidence warped
Fig. 3. Text-to-image and dense image correspondences. (a) An image from LabelMe Outdoors
dataset. (b) Visualization of the local evidence Pa (Eq. 2) for the four most probable words,
colored from black (low probability) to white (high probability). (c) The MAP classification based
on local evidence alone, computed independently at each pixel. (d) The classification with spatial
regularization (Eq. 8). (e-f) Nearest neighbors of the image in (a) and dense pixel correspondences
with (a). (g) Same as (b) for each of the neighbors, warped towards the image (a). (h) The final
MAP labeling using inter-image regularization (Eq. 7).
4.1 Dense Image Correspondence

We first fetch the top K, nearest neighbors, Ni , for every image Ii using the GIST
descriptor [17], where K is the maximum number of neighbors, and is a threshold
on the distance between the images (above which neighbors are discarded) [5]. We
use SIFT-flow [9] to align the image with each of its neighbors, which results in an
integer warp field wij mapping each pixel in Ii to a pixel in Ij . Notice that neither the
nearest neighbors nor wij need necessarily be symmetric (as indicated by the arrows in
Fig. 1(b)).
4.2 Large-Scale Inference

Data Term. The data term is comprised of four components:

p ci (p) = l = log Pa (ci (p); )log Pti (l)s log Ps (ci (p))c log Pci (ci (p)),
(3)
where Pa (ci (p); ) was defined in the previous section, Pti (l) is the estimated proba-
bility of image Ii having the tag l, and Ps (ci (p)) and Pci (ci (p)) capture the probability
of the pixel p having the word l based on its relative spatial position and color, respec-
tively. We use superscript i in Pti and Pci to emphasize that they are estimated separately
for each image. s , c balance the contribution of Ps and Pci , respectively.
The term Pti (l) is used to bias the vocabulary of image Ii towards words with higher
frequency and co-occurrence among its neighbors. We estimate this probability as
1 o
Pti (l) = [l tj ] + h (l, m), (4)
|Ni | jN Z jN mt
i i j
where [] evaluates to 1 when the inside condition is true, ho is the L L row-

normalized tag co-occurrence matrix, computed
from the current tag estimates and ini-
tialized from the known tags, and Z = jNi |tj |. The first term in this prior measures
the frequency of word l among image Ii s neighbors, and the second term is the mean
co-occurrence rate of word l within its neighbors tags. We typically set = 0.5, as-
signing equal contribution to the two terms. This term is inspired by [1], but we do not
set explicit threshold on the number of tags to infer as they do.
Both the spatial and color terms are computed from the current pixel label estimates.
The spatial location term is computed as

Ps ci (p) = l = hsl (p), (5)
where hsl (p) is the normalized spatial histogram of word l across all images in the
database. This term will assist in places where the appearance and pixel correspondence
might not be as reliable (see Fig. 2 in the supplemental material).
The color term will assist in refining the labels internally within the image, and is
computed as
Pci ci (p) = l = hi,l
c (Ii (p)), (6)
where hi,l
c is the color histogram of word l in image Ii . We use 3D histograms of 64
bins in each of the color channels to represent hi,l
c .
Smoothness Terms. To encourage corresponding pixels in similar images to attain the

same word, we define the inter-image compatibility between corresponding pixels p
and r = p + wij (p) in images Ii and Ij respectively as

j
ext ci (p) = lp , cj (r) = lr = [lp = lr ] exp ext |Si (p) Sj (r)| , (7)
i
where i , j are the image weights as defined in Section 3.2, and Si is the SIFT de-
scriptor for image Ii . Intuitively, better matching between corresponding pixels will
result in higher penalty when assigning them different labels, weighted by the relative
importance of the neighbors label.
The intra-image compatibility between neighboring pixels based on tag co-occurrence
and image structures. For image Ii and spatial neighbors p, q Np we define

int ci (p) = lp , ci (q) = lq = o log ho (lp , lq )+ [lp = lq ] exp int
Ii (p) Ii (q)
.
(8)
Overall, the free parameters of this graphical model are {s , c , o , int , ext }, which
we tune manually. We believe these parameters could be learned from a database with
ground-truth labeling, but leave this for future work. .
Inference. We use a parallel coordinate descent algorithm to optimize Eq. 1, which al-
ternates between estimating the appearance model and propagating annotations between
pixels. The appearance model is initialized from the images and partial annotations in
the database. Then, we partition the message passing scheme into intra- and inter-image
updates, parallelized by distributing the computation of each image to a different core.
Message passing is performed using efficient parallel belief propagation, while intra-
image update iterates between belief propagation and re-estimating the color models
(Eq. 6) in a GrabCut fashion [18]. Once the algorithm converges, we compute the MAP
labeling that determines both labels and tags for all the images.
4.3 Choosing Images to Annotate

As there is freedom to choose images to be labeled by the user, intuitively, we would
want to strategically choose image hubs that have many similar images, as they will
have many direct neighbors in the image graph to which they can propagate labels
efficiently. We use visual pagerank [19] to find good images to label, using the GIST
descriptor as the image similarity measure. To make sure that images throughout the
database are considered, we initially cluster the images and use a non-uniform damping
factor in the visual rank computation (Eq. 2 in [19]), assigning higher weight to the
images closest to the cluster centers. Given an annotation budget (rt , rl ), we then set Il
and It as the rl and rt top ranked images, respectively. Fig. 4(c) shows the top image
hubs selected automatically with this approach, which nicely span the variety of scenes
in the database.
5 Experiments
We conducted extensive experiments with the proposed method using several datasets:
SUN [14] (9556 256 256 images, 522 words), LabelMe Outdoors (LMO, subset of
SUN) [10] (2688 256 256 images, 33 words), ESP game image set [11] (21, 846 im-
ages, 269 words) and IAPR benchmark [12] (19, 805 images, 291 words). Since both
LMO and SUN include dense human labeling, we use them to simulate human annota-
tions for both training and evaluation. We use ESP and IAPR data as used by [1]. They
contain user tags but no pixel labeling.
We implemented the system using MATLAB and C++ and ran it on a small cluster
of three machines with a total of 36 CPU cores. We tuned the algorithms parameters
on LMO dataset, and fixed the parameters for the rest of the experiments to the best
performing setting: s = 1, c = 2, o = 2, int = 60, ext = 5. In practice, 5 iterations
are required for the algorithm to converge to a local minimum. The EM (Section 3)
and inference (Section 4.2) algorithms typically converge within 15 and 50 iterations
respectively. Using K = 16 neighbors for each image gave the best result (Fig. 6(c)),
and we did not notice significant change in performance for small modifications to
(Section 4.1), d (Section 3.1), nor the number of GMM components in the appearance
model, M (Eq. 2). For the aforementioned settings, it takes the system 7 hours to pre-
process the LMO dataset (compute descriptors and the image graph) and 12 hours to
propagate annotations. The run times on SUN were 15 hours and 26 hours respectively.
(a) Image graph (b) Graphical model
0.0103 0.0091 0.0076 0.0069 0.0068 0.0066 0.0063 0.0062 0.0061 0.0059 0.0057
(c) Image hubs used for human labeling and tagging
Fig. 4. The image graph serves as an underlying representation for inference and human
labeling. (a) Visualization of the image graph of the LMO dataset using 400 sampled images
and K = 10. The size of each image corresponds to its visual pagerank score. (b) The graphi-
cal model. Each pixel in the database is connected to spatially adjacent pixels in its image and
corresponding pixels in similar images. (c) Top ranked images selected automatically for human
annotation. Underneath each image is its visual pagerank score.
Results on SUN and LMO. Fig. 5(a) shows some annotation results on SUN using
(rt = 0.5, rl = 0.05). All images shown are initially untagged in the database. Our
system successfully infers most of the tags in each image, and the labeling corresponds
nicely to the image content. In fact, the tags and labels we obtain automatically are often
remarkably accurate, considering that no information was initially available on those
images. Some of the images contain regions with similar visual signatures, yet are still
classified correctly due to good correspondences with other images. More results are
available in the supplemental material.
To evaluate the results quantitatively, we compared the inferred labels and tags against
human labels. We compute the global pixel recognition rate, r, and the class-average
recognition rate r for images in the subset I \ Il , where the latter is used to counter
the bias towards more frequent words. We evaluate tagging performance on the image
set I \ It in terms of the ratio of correctly inferred tags, P (precision), and the ratio of
missing tags that were inferred, R (recall). We also compute the unbiased class-average
measures P and R.
The global and class-average pixel recognition rates are 63% and 30% on LMO, and
33% and 19% on SUN, respectively. In Fig. 6 we show for LMO the confusion matrix
and breakdown of the scores into the different words. The diagonal pattern in the con-
fusion matrix indicates that the system recognizes correctly most of the pixels of each
word, except for less frequent words such as moon and cow. The vertical patterns (e.g.
in the column of building and sky) indicate a tendency to misclassify pixels into those
Table 1. Tagging and labeling performance on LMO and SUN datasets. Numbers are given
in percentages. In (a), AP-RF replaces the GMM model in the annotation propagation algorithm
with Random Forests [16]; AP-NN replaces SIFT-flow with nearest neighbor correspondences.
Labeling Tagging Label Tag

r r P P R R r r P P R R
Makadia [1] 53.87 30.23 60.82 25.51 Makadia [1] 26.67 11.33 39.5 14.32
STF [4] 52.83 24.9 34.21 30.56 73.83 58.67 STF [4] 20.52 9.18 11.2 5.81 62.04 16.13
LT [5] 53.1 24.6 41.07 35.5 44.3 19.33 AP (ours) 33.29 19.21 32.25 14.1 47 13.74
AP (ours) 63.29 29.52 55.26 38.8 59.09 22.62
AP-RF 56.17 26.1 48.9 36.43 60.22 24.34
AP-NN 57.62 26.45 47.5 35.34 59.83 24.01
(a) Results on LMO. (b) Results on SUN.
Table 2. Tagging performance on ESP and IAPR. Numbers are in percentages. The top row is
quoted from [1].
ESP IAPR
P R P R P R P R
Makadia [1] 22 25 28 29
AP (Ours) 24.17 23.64 20.28 13.78 27.89 25.63 19.89 12.23
words due to their frequent co-occurrence with other words (objects) in that dataset.
From the per-class recognition plot (Fig. 6(b)) it is evident that the system generally
performs better on words which are more frequent in the database.
Components of the Model. Our objective function (Eq. 1) is comprised of three main
components, namely the visual appearance model, and the inter-image and intra-image
regularizers. To evaluate the effect of each component on the performance, we repeated
the above experiment where we first enabled only the appearance model (classifying
each pixel independently according to its visual signature), and gradually added the
other terms. The results are shown in Fig. 6(d) for varying values of rl . Each term
clearly assists in improving the result. It is also evident that image correspondences
play important role in improving automatic annotation performance. The effect of the
inference module of the algorithm on the results is demonstrated visually in Fig. 3 and
the supplemental material.
Comparison with State-of-the-Art. We compared our tagging and labeling results
with state-of-the-art in semantic labeling Semantic Texton Forests (STF) [4] and La-
bel Transfer [5] and image annotation (Makadia et al. [1]). We used our own imple-
mentation of [1] and the publicly available implementations of [4] and [5]. We set the
parameters of all methods according to the authors recommendations, and used the
same tagged and labeled images selected automatically by our algorithm as training set
for the other methods (only tags are considered by [1]). The results of this comparison
are summarized in Table 1. On both datasets our algorithm shows clear improvement
in both labeling and tagging results. On LMO, r is increased by 5 percent compared
to STF, and P is increased by 8 percent over Makadia et al.s baseline method. STF
recall is higher, but comes at the cost of lower precision as more tags are associated on
average with each image (Fig. 7). In [5], pixel recognition rate of 74.75% is obtained
sky tree
tree sky
sea sky
sky sea
sand person
sea mountain
person grass
window
wall tree
tree
tree steps
sky television
plant sky
sidewalk poster
person plant
road paper
ground floor ground
car
chair chimney
building
ceiling lamp building
ceiling
tree
tree sky
tree sky road
sky
sky road mountain
mountain
mountain building fence
airplane car
building
window
sky
shutter window
tree road wall sky
plant
sky heater sea
pipe
building person floor boat
hung out ceiling
door
building
wall
sky table
stone table
sea cupboard
sky sign
sand chair
plant picture
plant ceiling lamp
grass floor
palm trees ceiling
building chair
palm tree blackboard
ceiling
window window window

sky
tree tree wall
road
sky car tree
car
door building door
building
building awning building
window
tree
staircase window
window
sidewalk tree sky
tree
road sky road
sky
plant mountain door
building
mountain building
door
building
(a) SUN (9556 images, 522 words)

shoes toy world
sand white
man white
tree
purple helmet man
house
feet grass green green
blue doll black
tree
plant
tree picture
sky pink photo
sky
group leaf house
people
crowd green group
grass
crowd
flower
(b) ESP (21, 846 images, 269 words)

tower tree
sky wall plant
wall
pinnacle sign lot
rail
palm people house
one
flag front front
brick
column cliff cobblestone
building car
sky view tower sky

rock sky street road
people house square lookout
mountain cloud night landscape
desert city lamp canyon
view wood
valley writing tower window
sky tourist street white
roof front square wall
house clock people room
city classroom lamp curtain
bush bed
tree wall
tree view
rock stone
table range
painting sky
pool mountain
mountain people
chair landscape
man front
(c) IAPR-TC12 (19, 805 images, 291 words)
Fig. 5. Automatic annotation results (best viewed electronically). (a) SUN results were pro-
duced using (rt = 0.5, rl = 0.05) (4778 images tagged and 477 images labeled, out of 9556
images). All images shown are initially un-annotated in the database. (b-c) For ESP and IAPR,
the same training set as in [1] was used, having rt = 0.9, rl = 0. For each example we show
the source image on the left, and the resulting labeling and tags on the right. The word colormap
for SUN is the average pixel color based on the ground truth labels, while in ESP and IAPR each
word is assigned an arbitrary unique color. Example failures are shown in the last rows of (a) and
(c). More results can be found in Fig. 1 and the supplemental material.
awning 0.7
balcony
bird
0.7
boat
bridge
0.65

Pixel recognition rate

building 0.6
bus
Pixel recognition rate

car
cow
crosswalk
desert 0.5
0.65 0.6
door
fence
field
grass 0.4
0.55
moon
mountain
person
plant
0.6 0.5
pole 0.3
river
road
rock
0.45
sand 0.2
sea
sidewalk 0.55 0.4
sign
sky
Learning
staircase 0.1
streetlight
sun
0.35 Learning + intra
tree
window Learning + intra + inter
0.5 0.3
0 4 8 16 32
river
sidewalk
cow
window
sky
awning
balcony
building
bus
crosswalk
field
bird
door
fence
person
boat
bridge
car
desert
grass
moon
mountain
road
rock
sand
sea
staircase
tree
plant
pole
sign
streetlight
sun
unlabeled

"
" "
"

"
"

!

!

!

#
0 0.01 0.05 0.1 0.5
Number of Neighbors Ratio of images labeled
(a) (b) (c) (d)
Fig. 6. Recognition rates on LMO dataset. (a) The pattern of confusion across the dataset vo-
cabulary. (b) The per-class average recognition rate, with words ordered according to their fre-
quency in the dataset (measured from the ground truth labels), colored as in Fig. 5. (c) Recog-
nition rate as function of the number of nearest neighbors, K. (d) Recognition rate vs. ratio of
labeled images (rl ) with different terms of the objective function enabled.
on LMO at 92% training/test split with all the training images containing dense pixel
labels. However, in our more challenging (and realistic) setup, their pixel recognition
rate is 53% and their tag precision and recall are also lower than ours. [5] also report
the recognition rate of TextonBoost [3], 52%, on the same 92% training/test split they
used in their paper, which is significantly lower than the recognition rate we achieve,
63%, while using only a fraction of the densely labeled images. The performance of all
methods drops significantly on the more challenging SUN dataset. Still, our algorithm
outperforms STF by 10 percent in both r and P , and achieves 3% increase in precision
over [1] with similar recall.
We show qualitative compar-
ison with STF in Fig. 7 and the
Source
supplemental material. Our al-

tree
gorithm generally assigns less
fan
extractor hood
dummy
tree
streetlight
window
tree
STF [4]
drawer sky
tags per image in comparison

sky desk lamp sidewalk sky
decoration
cupboard sea sidewalk
sea cubicle sand
countertop rock road
plant column
chair road person
ceiling lamp person
to STF, for two reasons. First,

mountain ceiling mountain
bread mountain
grass book car door
bench
field beam building car
basket boat
balcony awning building
arcade
dense correspondences are used

tree sky
to rule out improbable labels

Ours
tree
sky window sidewalk
sky
sea wall road
road
mountain floor car
building
due to ambiguities in local vi- mountain
building
sual appearance. Second, we ex-

Fig. 7. Comparison with [4] on SUN dataset plicitly bias towards more fre-
quent words and word co-occurrences in Eq. 4, as those were shown to be key factors
for transferring annotations [1].
Results on ESP and IAPR. We also compared our tagging results with [1] on the
ESP image set and IAPR benchmark using the same training set and vocabulary they
used (rt = 0.9, rl = 0). Our precision (Table 2) is 2% better than [1] on ESP, and is
on par with their method on IAPR. IAPR contains many words that do not relate to
particular image regions (e.g. photo, front, range), which does not fit within our text-
to-image correspondence model. Moreover, many images are tagged with colors (e.g.
white), while the image correspondence algorithm we use emphasizes structure. [1]
assigns stronger contribution to color features, which seems more appropriate for this
dataset. Better handling of such abstract keywords, as well as improving the quality of
the image correspondence are both interesting directions for future work.
Limitations. Some failure cases are shown in Fig. 5 (bottom rows of (a) and (c)).
Occasionally the algorithm mixes words with similar visual properties (e.g. door and
window, tree and plant), or semantic overlap (e.g. building and balcony). We noticed
that street and indoor scenes are generally more challenging for the algorithm. Dense
alignment of arbitrary images is a challenging task, and incorrect correspondences can
adversely affect the solution. In Fig. 5(a) bottom row we show some examples of inac-
curate annotation due to incorrect correspondences. More failure cases are available in
the supplemental material.
Training Set Size. As we are able to efficiently propagate annotations between images
in our model, our method requires significantly less labeled data than previous methods
(typically rt = 0.9 in [1], rl = 0.5 in [4]). We further investigate the performance of
the system w.r.t the training set size on SUN dataset.
We ran our system with varying values of rt =
0.1, 0.2, . . . , 0.9 and rl = 0.1rt . To characterize the sys-
tem performance, we use the F1 measure, F (rt , rl ) = 0.5
Makadia [1]
AP (Ours)
2P R 0.45
P +R , with P and R the precision and recall as defined
Performance (F)
0.4
above. Following the user study by Vijayanarasimhan 0.35
and Grauman [20], fully labeling an image takes 50 sec- 0.3
0.25
onds on average, and we further assume that supplying

tags for an image takes 20 seconds. Thus, for example, 0 10 20 30 40 50 60
Human Annotation Time (hours)
tagging 20% and labeling 2% of the images in SUN re-
Fig. 8. Performance (F ) as
quires 13.2 human hours. The result is shown in Fig. 8.
function of human annotation
The best performance of [1], which requires roughly 42 time of our method and [1] on
hours of human effort, can be achieved with our system SUN dataset
with less than 20 human hours. Notice that our perfor-
mance increases much more rapidly, and that both systems converge, indicating that
beyond a certain point adding more annotations does not introduce new useful data to
the algorithms.
Running Time. While it was clearly shown that dense 0.7 20
correspondences facilitate annotation propagation, they are

Pixel Recognition Rate
Running Time (hours)
0.65 18
also one of the main sources of complexity in our model. 0.6 16
For example, for 105 images, each 1 mega-pixel on av- 0.55 14
0.5
erage, and using K = 16, these add O(1012 ) edges to 0.45
12
10
the graph. This requires significant computation that may 0.4
1 0.25 0.06 0.016 0
be prohibitive for very large image databases. To address Ratio of interimage edges
these issues, we have experimented with using sparser Fig. 9. Performance and
inter-image connections using a simple sampling scheme. run time using sparse
We partition each image Ii into small non-overlapping inter-image edges
patches, and for each patch and image j Ni we keep a single edge for the pixel
with best match in Ij according to the estimated correspondence wij . Fig. 9 shows
the performance and running time for varying sampling rates of the inter-image edges
(e.g. 0.06 corresponds to using 4 4 patches for sampling, thus using 16 1
of the edges).
This plot clearly shows that the running time decreases much faster than performance.
For example, we achieve more than 30% speedup while sacrificing 2% accuracy by
using only 14 of the inter-image edges. Note that intra-image message passing is still
performed in full pixel resolution as before, while further speedup can be achieved by
running on sparser image grids.
6 Conclusion
In this paper we explored automatic annotation of large-scale image databases using
dense semantic labeling and image correspondences. We treat auto-annotation as a la-
beling problem, where the goal is to assign every pixel in the database with an appropri-
ate tag, using both tagged images and scarce, densely-labeled images. Our work differs
from prior art in three key aspects. First, we use dense image correspondences to explic-
itly enforce coherent labeling (and tagging) across images, allowing to resolve visual
ambiguities due to similar visual patterns. Second, we solve the problem jointly over the
entire image database. Third, unlike previous approaches which rely on large training
sets of labeled or tagged images, our method makes principled use of both tag and label
information while also automatically choosing good images for humans to annotate so
as to optimize performance. Our experiments show that the proposed system produces
reasonable automatic annotations and outperforms state-of-the-art methods on several
large-scale image databases while requiring significantly less human annotations.
References
1. Makadia, A., Pavlovic, V., Kumar, S.: Baselines for image annotation. IJCV 90 (2010)
2. Wang, X., Zhang, L., Liu, M., Li, Y., Ma, W.: Arista-image search to annotation on billions
of web photos. In: CVPR, pp. 29872994 (2010)
3. Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance, Shape and
Context Modeling for Multi-class Object Recognition and Segmentation. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 115. Springer,
Heidelberg (2006)
4. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and
segmentation. In: CVPR (2008)
5. Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing via label transfer. TPAMI (2011)
6. Tighe, J., Lazebnik, S.: SuperParsing: Scalable Nonparametric Image Parsing with Superpix-
els. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315,
7. Blei, D.M., Ng, A., Jordan, M.: Latent dirichlet allocation. Machine Learning Research 3,
9931022 (2003)
8. Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and
video annotation. In: CVPR (2004)
9. Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: SIFT Flow: Dense Correspondence
across Different Scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III.
10. Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a database and web-based tool
for image annotation. IJCV 77, 157173 (2008)
11. Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: SIGCHI (2004)
12. Grubinger, M., Clough, P., Muller, H., Deselaers, T.: The iapr benchmark: A new evaluation
resource for visual information systems. In: LREC, pp. 1323 (2006)
13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
14. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large-scale scene recog-
nition from abbey to zoo. In: CVPR, 34853492 (2010)
15. Delong, A., Gorelick, L., Schmidt, F.R., Veksler, O., Boykov, Y.: Interactive Segmentation
with Super-Labels. In: Boykov, Y., Kahl, F., Lempitsky, V., Schmidt, F.R. (eds.) EMMCVPR
2011. LNCS, vol. 6819, pp. 147162. Springer, Heidelberg (2011)
16. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63, 342
(2006)
17. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. IJCV 42, 145175 (2001)
18. Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction using iter-
ated graph cuts. TOG 23, 309314 (2004)
19. Jing, Y., Baluja, S.: Visualrank: Applying pagerank to large-scale image search. TPAMI
(2008)
20. Vijayanarasimhan, S., Grauman, K.: Cost-sensitive active visual category learning. IJCV,
121 (2011)
Numerically Stable Optimization
of Polynomial Solvers for Minimal Problems
Yubin Kuang and Kalle Astrom
Centre for Mathematical Sciences

Lund University, Sweden
{yubin,kalle}@maths.lth.se
Abstract. Numerous geometric problems in computer vision involve the solu-

tion of systems of polynomial equations. This is particularly true for so called
minimal problems, but also for finding stationary points for overdetermined prob-
lems. The state-of-the-art is based on the use of numerical linear algebra on the
large but sparse coefficient matrix that represents the original equations multi-
plied with a set of monomials. The key observation in this paper is that the speed
and numerical stability of the solver depends heavily on (i) what multiplication
monomials are used and (ii) the set of so called permissible monomials from
which numerical linear algebra routines choose the basis of a certain quotient
ring. In the paper we show that optimizing with respect to these two factors can
give both significant improvements to numerical stability as compared to the state
of the art, as well as highly compact solvers, while still retaining numerical stabil-
ity. The methods are validated on several minimal problems that have previously
been shown to be challenging with improvement over the current state of the art.
1 Introduction
Many problems in computer vision can be rephrased as several polynomials in several
unknowns. This is particularly true for so called minimal structure and motion prob-
lems. Solutions to minimal structure and motion problems are essential for RANSAC
algorithms to find inliers in noisy data [13, 24, 25]. For such applications one needs to
efficiently solve a large number of minimal structure and motion problems in order to
find the best set of inliers. Once enough inliers have been found it is common to use
non-linear optimization e.g. to find the best fit to all data in a least squares sense. Here
fast and numerical stable solutions for the polynomials systems of the minimal prob-
lems are crucial for generate initial parameters for the non-linear optimization. Another
area of recent interest is global optimization used e.g. for optimal triangulation, resec-
tioning and fundamental matrix estimation. While global optimization is promising, it
is a difficult pursuit and various techniques has been tried, e.g. branch and bound [1],
L -norm methods [16] and methods using linear matrix inequalities (LMIs) [17]. An
alternative way to find the global optimum is to calculate stationary points directly,
usually by solving some polynomial equation system [14, 23]. So far, this has been an
approach of limited applicability since calculation of stationary points is numerically
difficult for larger problems. A third area are methods for finding the optimal set of
inliers [11, 12, 21], which are based on solving certain geometrical problems, which
involve solving systems of polynomial equations.

Numerically Stable Optimization of Polynomial Solvers for Minimal Problems 101
Recent applications of polynomial techniques in computer vision include solving for

fundamental and essential matrices with radial distortion [19], panoramic stitching [2]
and pose with unknown focal length [3]. To solve such geometric problems by solving
systems of polynomials in several unknowns is often not trivial. First of all one has
model the problem and choose parameters in order to get as simple equations as pos-
sible. Then efficient and numerical stable techniques are applied to solve the resulting
system of equations.
Related Works. Traditionally, researchers have hand-coded elimination schemes in

order to solve systems of polynomial equations. Recently, however, new techniques
based on algebraic geometry and numerical linear algebra have been used to find all
solutions, cf . [22]. In [7] several techniques for improving the numerical stability by al-
lowing a larger set of so called permissible monomials from which to choose the basis
of C[x]/I. By clever use of e.g. QR factorization with column-pivoting the numerical
stability can be improved by several orders of magnitude. In [18] a system for auto-
matic generation of minimal solvers is developed. The script first calls upon algebraic
geometry packages to determine the number of solutions and to determine a basis for
C[x]/I. Then repeated numerical tries with increasingly number of equations are tried
until a large enough set is found. Then the equations are trimmed down automatically to
give a compact set of equations and monomials. Finally these are hardcoded in a solver.
Similar techniques for automatic trimming of equations has been developed in [20],
where equations (rows) and monomials (columns) are removed if they have no numer-
ical effects on the construction of the action matrix. Another recent work [4] focuses
on speed improvement for polynomial solvers with techniques that avoid direct action
matrix eigenvalue computation.
In this paper, we focus on optimizing polynomial solvers with respect to their size
as well as numerical accuracy. The goal is to generate a solver that have better numeri-
cal accuracy with as compact set of equations as possible. This is unlike [18] and [20],
where the main goals are to generate compact solvers. The common criterion is to re-
move equations and monomials that do not affect the overall solvability of the solvers.
While such compact solvers are generally faster, they tend to have inferior numerical
accuracy. We propose to use the numerical accuracy of the solvers as the criterion, and
optimize the set of equations in the template, as well as the permissible set. We apply
local search on this combinatorial problem and generate solvers for minimal problems
with better numerical accuracy and smaller template.
2 Action Matrix Method

In this section we review some of the classical theory of multivariate polynomials. We
consider the following problem:
Problem 1. Given a set of m polynomials fi (x) in s variables x = (x1 , . . . , xs ),
determine the complete set of solutions to
f1 (x) = 0, . . . , fm (x) = 0. (1)
We denote by V the zero set of (1). In general V need not be finite, but in this paper we
will only consider zero dimensional V , i.e. V is a point set.
102 Y. Kuang and K. Astrom
The general field of study of multivariate polynomials is algebraic geometry. See [10]
and [9] for an introduction to the field. In the language of algebraic geometry, V is an
affine algebraic variety and the polynomials fi generate an ideal I = {g C[x] : g =
i hi (x)fi (x)}, where hi C[x] are any polynomials and C[x] denotes the set of all
polynomials in x over the complex numbers.
The motivation for studying the ideal I is that it is a generalization of the set of
equations (1). A point x is a zero of (1) iff it is a zero of I. Being even more general,
we could ask for the complete set of polynomials vanishing on V . If I is equal to this
set, then I is called a radical ideal.
We say that two polynomials f , g are equivalent modulo I iff f g I and denote
this by f g. With this definition we get the quotient space C[x]/I of all equivalence
classes modulo I. Further, we let [] denote the natural projection C[x] C[x]/I, i.e.
by [fi ] we mean the set {gj : fi gj I} of polynomials equivalent to fi modulo I.
A related structure is C[V ], the set of equivalence classes of polynomial functions
on V . We say that a function F is polynomial on V if there is a polynomial f such that
F (x) = f (x) for x V and equivalence here means equality on V . If two polynomials
are equivalent modulo I, then they are obviously also equal on V . If I is radical, then
conversely two polynomials which are equal on V must also be equivalent modulo I.
This means that for radical ideals, C[x]/I and C[V ] are isomorphic. Now, if V is a point
set, then any function on V can be identified with a |V |-dimensional vector and since
the unisolvence theorem for polynomials guarantees that any function on a discrete set
of points can be interpolated exactly by a polynomial, we get that C[V ] is isomorphic
to Cr , where r = |V |.
3 Revisit Basis Selection Technique

In this section, we review in detail the basis selection method introduced in [6]. We
further discuss the key components of this method and motivate why potential improve-
ment can be achieved.
The goal is to construct the action matrix for the linear mapping Txk : p(x)
xk p(x) in the r-dimensional space C[x]/I in numerically stable manner. Specifically,
given the equation system, for a linear basis of monomials B = {x1 , . . . , xr } for
C[x]/I and the action variable xk , we want to express
the image xk xi ( xi B )
in terms of the basis monomials, i.e. xk x j Aij xj . The matrix A = {Aij }T
i
is called the action matrix. In previous methods [7, 18], the so-called single elimination
template is applied to enhance numerical stability. It starts by multiplying the equations
(1) by a set of monomials producing an equivalent (but larger) set of equations. There
is no general criterion how to choose such set of monomials, which are usually selected
as the minimal set of monomials that make the problem solvable or numerically stable.
We will discuss later that the choices of multiplication monomials affect the overall
numerical accuracy the solvers. By stacking the coefficients of the new equations in an
expanded coefficient matrix Cexp , we have
Cexp Xexp = 0. (2)
This expanded coefficient matrix is usually called elimination template.

Using the notation in [7], we partition

' the' set of all monomials M occurring in the
expanded set of equations as M = E R P. We use P (P for permissible) to denote
the set (or a subset) of monomials that stay in M after the multiplication of the action
variable. And we denote the reducible set xk xi
/ P for xi P as R. The monomials
E (E for excessive) are simply the monomials which are neither in R nor in P. We then
order them so that E > R > P holds for all monomials in their respective sets. After
applying the corresponding reordering of the columns of Cexp , we yield

( ) XE
CE CR CP XR = 0. (3)
XP
Giving a linear system as in (3), we know that if we choose from P a linear basis B =
{x1 , . . . , xr }, we are able to use numerical linear algebra to construct corresponding
action matrix. Traditionally, the Grobner bases are used to generate B for some ordering
on the monomials (e.g. grevlex [18]). In this case, the permissible set is the same as basis
set. The key observation of [6] is that by allowing adaptive selection of B from a large
permissible set (|P| r), one can improve the numerical accuracy greatly in most
cases. To achieve this, we first eliminate the E-monomials, since they are not in the
basis and do not need to be reduced. By making Cexp triangular through either LU or
QR factorization, we have

UE1 CR1 CP1 XE
0 UR2 CP2 XR = 0, (4)
0 0 CP3 XP
where UE1 and UR2 are upper triangular. The top rows of the coefficient matrix involv-
ing the excessives can be removed,
% &% &
UR2 CP2 XR
= 0. (5)
0 CP3 XP
We can further reduce CP3 into upper triangular matrix. In [6], the column-pivoting QR
is utilized to try to minimize the condition number of the first |P|r columns of CP3 ,
where is a permutation matrix. This will enhance the overall numerical stability of
the extraction of action matrix.
( The
)t basis is thus selected as the last r monomials after
the reordering of i.e. XP XB . This gives us

% & XR
UR2 CP4 CB1
XP = 0. (6)
0 UP3 CB2
XB
From this, we can express monomials in R and P as linear combination of B:

% & % &1 % &
XR UR2 CP4 CB 1
= XB . (7)
XP 0 UP3 CB 2
9.6 0
9.8 2
Mean of log10 errors

mean(log10(err))
10 4
6
10.2
8
10.4
10
10.6
12
5 6 7 8 20 30 40 50 60 70
Max degree of multiplication monomials Numer of Permissibles
Fig. 1. Numerical accuracy (mean of log10 (errors) ) of the solver in [8] for uncalibrated camera
with radial distortion. Left: With multiplication monomials of degree 5 (393390), 6 (680595),
7 (1096865) and 8 (16751212). The numbers in the parenthesis are the sizes of corresponding
Cexp . Right: Vary the number of permissible monomials. Note the minimal size of the permissible
set is 24 which is the number of solutions.
' '
Since by construction xk xi R P B, we can construct the action matrix from
the linear mapping in (7) by finding corresponding indices of xk xi (xi B).
This method above gives generally good performance concerning speed and numer-
ical accuracy for most minimal problems in computer vision [7]. However, there are
several parameters that are chosen manually for different problems (1) the set of multi-
plication monomials and (2) the choices of the permissible set. For different problem,
in [6, 7], the general guideline is to use as few equations as possible (this is usually
done by fine tuning the set of equations and see if the solver is still numerical stable).
For the permissible set, it was suggested that one should choose it as large as possible,
since this in theory should provide better chance getting stable partial-pivoting QR. One
may ask, will we gain more numerical stabilities by adding more equations or using a
smaller set of permissible monomials?
We present here an example to show the effects of such factors on current state of the
art polynomial solvers. The problem we study is the estimation of fundamental matrix
for uncalibrated cameras with radial distortion [8], which will be discussed in more
details in Section 5.1. In [8], the original equations are multiplied with set of monomials
so that the highest degree of resulting equations is 5. To shed light on the effects of more
equations, we allow the highest degree to be 8. We note that adding equations by simply
increased degrees will actually hurt the overall numerical accuracy (Figure 1, Left). On
the other hand, we also vary the size of permissible set used in the solver, where we
choose the last few monomials in grevlex order similar to [8]. We can see in Figure 1
(Right), by reducing the size of permissible set in this way, the solver retains numerical
stability for permissible set of large size, while lose its accuracy for smaller permissible
set (in this case when size is smaller than 36). We will show in the next few sections
that one can actually gain numerical accuracy by adding equation if we carefully choose
the multiplication monomials. Similarly, we can also achieve similar or even improved
accuracy with smaller permissible set.
4 Optimizing Size and Accuracy for Solvers

In this section, we describe our method for optimizing numerical accuracy of poly-
nomial solvers. We focus on improving the numerical accuracy of the basis selection
method in [6] with respect to the permissible set and the equations of the template.
4.1 Permissible Selection

As discussed in 3, the permissible set is the set of monomials where the optimal basis
can be selected from. In [6], it is suggested to choose the permissible set as large as
possible and it is usually chosen as last few monomials (with grevlex order). In general,
larger permissible set will gives better accuracy since it gives high freedom in choosing
the optimal basis. However, for some problems, it is possible to obtain slightly better
numerical accuracy with smaller set of permissible as indicated by the local minima in
Figure 1 (Right) . However, it is not clear how the size of permissible set will affect
the overall numerical accuracy of the final solvers. On the other hand, it should also
be noted that the larger the permissible set is, we need in theory more equations for the
template to be solvable. This is related to the fact that we need at least m = |P|+|R|r
equations to yield the system in (6). Therefore, to generate slimmer solvers, one step
to go is to select a compact set of permissibles while keeping the solvers numerically
stable.
Choosing the optimal permissible set of size K is a combinatorial problem, and it is
difficult to fully access the optimality of such sets with respect to numerical accuracy.
Our approach here is to start with a large set of permissible. We then run the solvers as in
[6] with basis selection for random problem examples. We collect for each run the basis
selected for the specific problem in B = {B1 , . . . , Bi , . . . , BN }, where N is number of
random runs. In this way, we have information of what monomials are selected as basis.
Typically, in our experiments, we find no unique optimal set of monomials among the
permissibles, which are selected by all the random problems. One criterion for selecting
a smaller permissible set of size K is to choose the most selected K monomials. The
drawback of this is that such monomials as a set might have never been selected as basis
in any random runs. To avoid such cases, we formulate this as a optimization problem.
Essentially, a subset P of P would retain numerical accuracy if it is a superset of
as many as basis sets as possible. Therefore, for optimal P of size K, we have the
following:
Problem 2. Given a set P and B = {B1 , . . . , Bi , . . . , BN } where Bi P for i =
{1, . . . N }, find P P,

N
max

(Bi P ),
P
i=1

s.t. |P | = K, (8)
where (.) is 1 for true boolean expression and 0 otherwise.

This is a hard problem itself. However, since |P| is generally small for most minimal
problems, we can solve it with branch and bound in reasonable time.
4.2 Equation Removal
In all previous solvers applying single elimination, the equations in the template are
expanded with redundancy. This will both affect the speed and the overall numerical
accuracy of the solvers. This is because all previous methods involve steps of gener-
ating upper triangular form of partial or full template, with LU or QR factorization or
Gauss-Jordan elimination. For large template with many equations, this step is slow and
might be numerically unstable. For [68], the multiplication monomials are manually
tuned for different minimal problems. Both [18] and [20] investigated the possibility of
automatic removing equations while keeping the solvers numerical stable. They showed
that one can remove equations along with corresponding monomials without losing nu-
merical stability and even gain better numerical performance with single precision for
certain minimal problem.
In additional to numerical stability, we here try to optimize the numerical accuracy
by automatically removing equations. To achieve this, we apply a simple greedy search
on what equations to remove while keeping the numerical accuracy of solvers as good
as possible. Given a redundant set of equations, we try to find the best equation to
remove and then iterate until no equations can be removed i.e. when removing any of
the equations makes the template unsolvable. Specifically, we first remove a candidate
equation from the original template. We then measure the mean log10 -errors of the
resulting template on a random set of problem examples (as training data). After running
through all equations, we remove the equation without which gives the lowest errors.
Generally, we can use any polynomial solver to evaluate the errors at each removal step.
We have chosen the solver with basis selection based on column pivoting [8] since it
generally gives better numerical accuracy.
By exploiting the fact that the excessive monomials do not contribute to the construc-
tion of the action matrix, we can remove certain equations without any local search and
maintain the numerically stability of template. Specifically, when expanding the equa-
tions, there can be some excessive monomials that appear only in one of the equations.
We call these excessive monomials as singular excessive monomials. For the first elim-
ination in (4), if an equation contains singular monomials, removing it will not hurt the
overall numerical condition of the LU or QR with proper pivoting. Therefore, all equa-
tions containing singular excessive can be safely removed. One can always apply this
trimming step with each local search step to speed up the removal process.
It might be also good to start with more equations which might introduce better
numerical conditioning for the first elimination in (4). We then rely on our equation re-
moval step to select equations to improve the overall numerical accuracy. For monomial
removal, we have not exploited the numerical routines to remove columns as in [20].
Here, we simply remove columns that become all zeros after certain equations are re-
moved.
4.3 Optimization Scheme
Given the tools introduced in the previous sections, we should be able to optimize the
polynomial solvers by finding the best combination of permissible set and equations
for a specific problem. This is a difficult combinatorial problem. Therefore, as a first
attempt, we here first optimize the permissible set by fixing the set of equations. We se-
lect the optimal permissible sets of different sizes on the original template (no equation
removal). Then we apply greedy search to remove equations with these optimal per-
missible sets. All these training are done with a fixed set of random problem examples.
By investigating the error distributions of all these solvers, the size of template (num-
ber of equations, number of monomials), we will be able to choose these optimized
solvers with trade-off in accuracy and speed. Generally, we have the following steps for
optimizing polynomial solver for a minimal problem:
Input: Expanded coefficient matrix Cexp , initial permissible set P, a training set of problem
examples T and threshold parameter :

1. Perform permissible selection to choose PK for r K |P| as in Problem 2.

2. For each PK , perform equation removal, intialize Cslim = Cexp ,
(a) Remove equation(s) containing singular excessive monomials
(b) For each remaining rows i in Cslim , evaluate the average errors of the solver with
Cslim/i on a random subset of T
(c) Update Cslim by removing the row i that gives lowest mean error
(d) Go back to (a) if the lowest average error in (c) is less than or a predefined number of
iterations is reached.
Note here if the solver fail, we set the error to be infinite. Therefore, if we choose
to be large enough, it will be similar to using solvability as removal criterion in [18].
5 Experimental Validation
In this section, we apply our method on different minimal geometry problems in com-
puter vision. The first two problems are the estimation of fundamental or essential
matrix with radial distortion for uncalibrated or calibrated cameras [8, 19]. The third
problem is three-point stitching with radial distortion [5, 15, 20]. All these problems
has been studied before and were shown to be challenging to obtain numerically stable
solvers. We compare our method with previous methods with respect to both numerical
accuracy as well as the size of template. The state-of-the-art method is the column-
pivoting basis selection [7] which is also the building block of our method. We also
compare with other two methods [18, 20] with template reduction procedures. In our
implementation of a general solver with column-pivoting basis selection, we have used
QR factorization to perform all the elimination steps. Our method involves a training
stage where we perform equation removal and permissible selection on a fixed set of
random problem examples. For all the comparisons, we test the resulting solvers on a
independent set of random problems.
5.1 Nine-Point Uncalibrated Radial Distortion

The problem of estimating fundamental matrix with radial distortion is studied by [19].
Numerically stable solutions were provided by [8, 18]. The minimal case for this prob-
lem is 9-point correspondences. In general, there are 10 equations and 10 unknowns.
7.5 10.5
BS ER degree 5
8 10.6 ER degree 6
AG
CO
8.5
ER0 10.7
mean(log10(err))
mean(log10(err))
9 ER
10.8
9.5
10.9
10
11
10.5
11 11.1
11.5 11.2
20 30 40 50 60 70 300 400 500 600 700
Nr of permissible monomials # eq
Fig. 2. Nine-point uncalibrated cameras with radial distortion. Left: Effects of permissible selec-
tion and equation removal on numerical accuracy of the polynomials solver. BS - basis selection
with column-pivoting [6] , AG - automatic generator [18]), CO - permissible selection with com-
binatorial optimization, ER0 - ours equation removal without removing equations with singular
excessives, ER - same as ER0 but with extra removal on singular excessive equations. Right:
Effects of removing equations with fixed permissible (a) ER - degree 5, 393 390 template and
(b) ER - degree 6, 680 595 template.
By linear elimination, one can reduced the system to 4 equations and 4 unknowns [19].
The reduced system has 24 solutions. In [18], the equations are first expanded to 497
equations which are then reduced to 179 equations and 203 monomials. There is no
basis selection in this method and the basis set is chosen from the output of Macaulay2,
which is of size 24. On the other hand, in [8], a fine-tuned template of 393 equations and
390 monomials are used. In this case, we use the template in [8] as our starting point.
We call this as the original template. For the set of permissible monomials, we choose
also exactly the same set as in [8] which is of size 69. We note that further increasing
the set of permissible makes the template unsolvable.
Selecting Permissibles. Following the steps in Section 4.1, we first run the solver for
393 equations and 390 monomials, and with the set of permissible of size 69 on 5000
random problem examples. First of all, we notice that there are monomials that are
never selected as basis by any problem examples. Ideally, we can easily remove these
monomials from the permissible set. However, we need to force {x1 , x2 , x3 , x4 , 1} to
be in the permissible set even if they are never selected. This is because without these
monomials we can not extract the solutions of the unknowns {x1 , x2 , x3 , x4 } with the
method in [7]1 . Therefore, excluding {x1 , x2 , x3 , x4 , 1}, we are able to remove 8 mono-
mials that are never selected from the permissible set. To this end, we have the possible
permissible set of size 61. We then continue to solve the problem in (8) with varying
K. We manage to solve the problem for K up to 27 with branch and bound. We can
see the effects of permissible selection in Figure 2 (Right, blue - solid), the solver is
still stable using only smallest permissible set of size 32 (we force the 5 monomials to
1
Note that in general, this is not required for action matrix method e.g. see [9].
300 5
BS #eq 393 #mon 390 BS #eq 393 #mon 390
AG #eq 179 #mon 203 AG #eq 179 #mon 203
ER #eq 356 #mon 367 ER #eq 356 #mon 367
250 ER #eq 585 #mon 572 ER #eq 585 #mon 572
0
200
Frequency
log10(err)
150 5
100
10
50
0 15
15 10 5 0 5 0 0.2 0.4 0.6 0.8 1
log10(err) quantile
Fig. 3. Histogram (left) and quantiles (right) of log10 (errors) of different solvers for the uncal-
ibrated radial distortion problem BS - [6], AG - [18] and ER - equation removal (ours) with
different initial templates.
be included) which is better than without basis selection (Figure 1). The improvement
gained by permissible selection is more significant for permissible set of smaller size.
We also note that permissible selection alone fails to improve the accuracy of the solver
consistently.
Removing Equations. We first study the interplay between equation removal and per-
missible selection. To do this, we perform equation removal on the template with the
optimal permissible sets of different sizes from branch and bound. We can see that by
performing several steps of equation removal (10 steps in this case), one get consistent
improvement of solvers for different optimal permissible sets (ER0 in Figure 2). This
suggests that one need to combine permissible selection and equation removal to get
compact and numerical stable solvers. We also show that removing equations contain-
ing singular excessive monomials have little effect on the numerical stability (ER in
Figure 2). Therefore, we can always apply this to speed up the removal process.
To further understand the mechanism of removing equations, we fix the permissible
to be the large set of size 69. We work with both template T1 (393390) and the further
expanded template T2 (680 595 up to degree 6). We then proceed to remove equations
using local search described in Section 4.2. For each removal step, we evaluate the mean
log-errors on 100 random samples and we remove the equation without which gives the
best numerical accuracy. We can see that the greedy search improves both templates at
the beginning, and the templates get worse when more equations are removed (Figure
2,right)2 . Ideally, by optimization, the large template should be converge to better tem-
plate with smaller size. This is because the nature of greedy optimization as well as the
randomness involved in the training data. Nevertheless, we can see that by removing
equations using the greedy search, we can improve significantly the numerical behavior
of T2 . Note that the initial mean log10 (errors) of T2 (Figure 1, left, degree 6) is around
10.10. We can reduce the size T2 to 585 572 (local minima in Figure 2, right) while
2
The initial errors (see Figure 1 Left) are omitted for better visualization.
in the meantime improve the accuracy to 11.15. This is even slightly better than the
best template reduced from T1 (356 367). This indicates that one could improve nu-
merical accuracy with large template if one carefully optimize the set of equations. We
also show the distribution of test errors for solvers producing the lowest errors from
our training comparing to state of the art in Figure 3. We can see that both our opti-
mized solvers improve over [6] on accuracy. From the iteration of optimization, one get
a spectrum of solvers which can be chosen with preference to speed or accuracy.
5.2 Six-Point Calibrated Radial Distortion

For estimation of essential matrix for calibrated cameras with radial distortion, the min-
imal case is 6 points [19]. This minimal problem consist of 16 equations in 9 unknowns.
It can be reduced to 11 equations in 4 unknown with linear elimination, which has 52
solutions. In [18], a solver with template of 238 equations and 290 monomials is auto-
matically generated. While in [8], a template with 320 equation and 363 monomials is
used after fine tuning the degree of monomials multiplied with. For this problem, we
do not find the the tuning parameter for the smaller template (320 by 363). Therefore,
we work with a larger template reported in the paper (356 by 378, with monomials up
to degree eight).
6
BS
CO
ER
6.5
mean(log10(err))
7.5
8
62 64 66 68 70 72
# permissible
Fig. 4. Six-point uncalibrated cameras with radial distortion. Effects of permissible selection and
equation removal. BS - [6], CO - permissible selection with combinatorial optimization, ER - 5
steps of equation removal with permissibles from CO.
We notice in Figure (4) that the initial template (356 by 378) is fairly unstable com-
pared to the one reported in [8]. Here we perform first permissible selection on the
template. We observe that reducing the permissible size hurts the numerical accuracy
even with permissible selection. The equation removal step gain consistent improve-
ment on all permissible sizes. We illustrate this example here to show that the local
method can be applied to numerically unstable template and still gives improvements.
Original r = 25, #P 25 Original #eq 90 #mon 132 #P 25 r=25

9.5 0 ICCV11 #eq 54 #mon 77 #P 25 r=25
ICCV11 r = 25, #P 25
ER : r = 25, #P 25 ER #eq 48 #mon 77 #P 25 r=25
ER : r = 22, #P 25 ER #eq 53 #mon 93 #P 25 r=22
Mean of log (errors)
10 ER + PS : r = 22, #P 22 5
log (errors)
10
10
10
10.5
15
11
20
50 60 70 80 90 0 0.2 0.4 0.6 0.8 1
Number of Equations Quantile
Fig. 5. Performance of different polynomials solvers on three-point stitching problem. Left: log10
(errors) with respect to number of equations. Original - 90 132 template as in [5], ICCV11-
optimization with sequential equation and monomial removal [20] , ER - our equation removal
and PS - permissible selection. Right: (best view in colors) Quantile for the log10 (errors).
5.3 Three-Point Stitching

For cameras with common focal point, we can solve for the radial distortion and focal
length (assuming also the same). The minimal case for this problem is 3 point corre-
spondences [15]. A numerically stable solver based on action matrix method was pro-
posed in [5]. The polynomial system is 2 equations in 2 unknowns with 18 solutions.
By multiplying the equations with monomials up to degree eight, the expanded template
consist of 90 equations in 132 monomials. For this problem, a truncation technique [7]
is also applied to obtain numerical solver. It is essential a way to gain numerical stability
by allowing false solutions. Specifically. instead of 18 (number of solutions), the size of
the basis r is chosen to 25. In this case, the size of permissible set is also 25. To further
improve numerical stability and speed, in [20], a reduction procedure is performed on
the same template which is trimmed down to 54 equation in 77 monomials. The smaller
template tends to give more stable solvers that are also faster.
Removing Equations. For original solver in [5], the permissible set used was of the
same size of the basis (both are 25). Therefore, we do equation removal without per-
missible selection. We will explore in the next section on possible ways for permissible
selection. We can see that one can actually remove fairly many equations and also gain
numerical accuracy compared to the original template. Note that our optimized template
is also smaller (48 77 for the smallest one) than the one in [20] with slightly better
numerical accuracy (mean log-errors -10.56 v.s. -10.27). For fair comparison, similar
to [20], we only modify the publicly available solver from [5] with corresponding equa-
tions and monomials removed.
For the solver in [5], it turns out the choice for dimension of the basis also affects the
numerical accuracy. We studied this by running solvers for different pairs of |P| = 25
and r. We find that the best combination of |P | and r is using 25 permissible with
r = 22. It is shown that by simply reducing r, we gain almost one order of improvement
on accuracy (Figure 5 , green - dash line) . Note that this is a local search itself and
there might exist better combination. We leave this as future research direction. We can
further reduce the size of the template by removing equations. In this case, the smallest
and actually also the best template we get is of 53 equations in 93 monomials. The
equation removal step again improves the numerical accuracy of the solver further.
Permissible Selection. Given that we have a working solver with |P | > r, (|P| =
25 and r = 22), we can investigate the potential of permissible selection. By solving
Problem (2) with K = 22, 23, 24, we find that having a large permissible set is always
better for this problem. It is here of interest to see what numerical accuracy can we get
with smallest set of permissible monomials. Using 22 permissible selected by branch
and bound and equation removal step, the resulting solver (Figure 5, red - dotted line) is
not as good as the best solver of size 53 with 25 permissibles and r = 22. It is a slightly
slimmer solver (52 84) with trade-off in increased errors.
6 Conclusions
Previous state-of-the-art has given us methods for making the coefficient matrix as com-
pact as possible for improving the speed of the algorithm. Other results exploit a larger
set of so called permissible monomials from which the numerical linear algebra rou-
tine can choose the basis for the quotient ring. In this paper we have made several
observations on the stability of polynomial equation solving using the so called action
matrix method. First it is shown that adding more equations can improve numerical
accuracy, but it only does so up to a point. Adding too many equations can actually de-
crease numerical accuracy. Secondly it is shown that the choice of so called permissible
monomials also affects the numerical precision.
We present optimization algorithms that exploit these observations. Thus we are able
to produce solvers that range from very compact, while still retaining good numerical
accuracy to solvers that involve larger set of multiplication monomials and larger set
of permissible monomials to optimize for numerical accuracy. Our method is easy to
implement and is general for different problems. Therefore, it can serve as an initial
tool for improving minimal problem involving large templates.
There are several interesting avenues of future research. First, the interplay between
numerical linear algebra routines used and the choice of multiplication monomials and
permissible monomials should be better understood. Second, the increased understand-
ing of the mechanisms for numerical accuracy could open up for further improvements.
Acknowledgment. The research leading to these results has received funding from the
strategic research projects ELLIIT and eSSENCE, and Swedish Foundation for Strate-
gic Research projects ENGROSS and VINST(grant no. RIT08-0043).
References
1. Agarwal, S., Chandraker, M.K., Kahl, F., Kriegman, D.J., Belongie, S.: Practical Global Opti-
mization for Multiview Geometry. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006,
Part I. LNCS, vol. 3951, pp. 592605. Springer, Heidelberg (2006)
2. Brown, M., Hartley, R., Nister, D.: Minimal solutions for panoramic stitching. In: Proc. Conf.
on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis (June 2007)
3. Bujnak, M., Kukelova, Z., Pajdla, T.: A general solution to the p4p problem for camera with
unknown focal length. In: Proc. Conf. Computer Vision and Pattern Recognition, Anchorage,
USA (2008)
4. Bujnak, M., Kukelova, Z., Pajdla, T.: Making Minimal Solvers Fast. In: Proc. Conf. Com-
puter Vision and Pattern Recognition (2012)
5. Byrod, M., Brown, M., Astrom, K.: Minimal solutions for panoramic stitching with radial
distortion. In: Proc. British Machine Vision Conference, London, United Kingdom (2009)
6. Byrod, M., Josephson, K., Astrom, K.: A Column-Pivoting Based Strategy for Monomial
Ordering in Numerical Grobner Basis Calculations. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 130143. Springer, Heidelberg (2008)
7. Byrod, M., Josephson, K., Astrom, K.: Fast and stable polynomial equation solving and its
application to computer vision. Int. Journal of Computer Vision 84(3), 237255 (2009)
8. Byrod, M., Kukelova, Z., Josephson, K., Pajdla, T., Astrom, K.: Fast and robust numerical
solutions to minimal problems for cameras with radial distortion. In: Proc. Conf. Computer
Vision and Pattern Recognition, Anchorage, USA (2008)
9. Cox, D., Little, J., OShea, D.: Using Algebraic Geometry. Springer (1998)
10. Cox, D., Little, J., OShea, D.: Ideals, Varieties, and Algorithms. Springer (2007)
11. Enqvist, O., Ask, E., Kahl, F., Astrom, K.: Robust Fitting for Multiple View Geometry. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I.
12. Enqvist, O., Josephson, K., Kahl, F.: Optimal correspondences from pairwise constraints. In:
Proc. 12th Int. Conf. on Computer Vision, Kyoto, Japan (2009)
13. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography. Communications of the
ACM 24(6), 381395 (1981)
14. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68,
146157 (1997)
15. Jin, H.: A three-point minimal solution for panoramic stitching with lens distortion. In: Com-
puter Vision and Pattern Recognition, Anchorage, Alaska, USA, June 24-26 (2008)
16. Kahl, F.: Multiple view geometry and the l -norm. In: ICCV, pp. 10021009 (2005)
17. Kahl, F., Henrion, D.: Globally optimal estimates for geometric reconstruction problems. In:
Proc. 10th Int. Conf. on Computer Vision, Bejing, China, pp. 978985 (2005)
18. Kukelova, Z., Bujnak, M., Pajdla, T.: Automatic Generator of Minimal Problem Solvers.
In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304,
19. Kukelova, Z., Pajdla, T.: Two minimal problems for cameras with radial distortion. In: OM-
NIVIS (2007)
20. Naroditsky, O., Daniilidis, K.: Optimizing polynomial solvers for minimal geometry prob-
lems. In: ICCV, pp. 975982 (2011)
21. Olsson, C., Enqvist, O., Kahl, F.: A polynomial-time bound for matching and registration
with outliers. In: Proc. Conf. on Computer Vision and Pattern Recognition (2008)
22. Stewenius, H.: Grobner Basis Methods for Minimal Problems in Computer Vision. PhD the-
sis, Lund University (April 2005)
23. Stewenius, H., Schaffalitzky, F., Nister, D.: How hard is three-view triangulation really? In:
Proc. Int. Conf. on Computer Vision, Beijing, China, pp. 686693 (2005)
24. Torr, P., Zisserman, A.: Robust parameterization and computation of the trifocal tensor. Im-
age and Vision Computing 15(8), 591605 (1997)
25. Torr, P., Zisserman, A.: Robust computation and parametrization of multiple view relations.
In: Proc. 6th Int. Conf. on Computer Vision, Mumbai, India, pp. 727732 (1998)
Has My Algorithm Succeeded?
An Evaluator for Human Pose Estimators
Nataraj Jammalamadaka1, Andrew Zisserman2 , Marcin Eichner3,

Vittorio Ferrari4 , and C.V. Jawahar1
1
IIIT-Hyderabad
2
University of Oxford
3
ETH Zurich
4
University of Edinburgh
Abstract. Most current vision algorithms deliver their output as is, without
indicating whether it is correct or not. In this paper we propose evaluator algo-
rithms that predict if a vision algorithm has succeeded. We illustrate this idea for
the case of Human Pose Estimation (HPE).
We describe the stages required to learn and test an evaluator, including the use
of an annotated ground truth dataset for training and testing the evaluator (and
we provide a new dataset for the HPE case), and the development of auxiliary
features that have not been used by the (HPE) algorithm, but can be learnt by the
evaluator to predict if the output is correct or not.
Then an evaluator is built for each of four recently developed HPE algorithms
using their publicly available implementations: Eichner and Ferrari [5], Sapp
et al. [16], Andriluka et al. [2] and Yang and Ramanan [22]. We demonstrate
that in each case the evaluator is able to predict if the algorithm has correctly
estimated the pose or not.
1 Introduction
Computer vision algorithms for recognition are getting progressively more complex as
they build on earlier work including feature detectors, feature descriptors and classi-
fiers. A typical object detector, for example, may involve multiple features and multiple
stages of classifiers [7]. However, apart from the desired output (e.g. a detection win-
dow), at the end of this advanced multi-stage system the sole indication of how well the
algorithm has performed is typically just the score of the final classifier [8,18].
The objective of this paper is to try to redress this balance. We argue that algorithms
should and can self-evaluate. They should self-evaluate because this is a necessary re-
quirement for any practical systems to be reliable. That they can self-evaluate is demon-
strated in this paper for the case of human pose estimation (HPE). Such HPE evaluators
can then take their place in the standard armoury of many applications, for example
removing incorrect pose estimates in video surveillance, pose based image retrieval, or
action recognition.
In general four elements are needed to learn an evaluator: a ground truth annotated
database that can be used to assess the algorithms output; a quality measure comparing
the output to the ground truth; auxiliary features for measuring the output; and a classi-
fier that is learnt as the evaluator. After learning, the evaluator can be used to predict if
the algorithm has succeeded on new test data, for which ground-truth is not available.
We discuss each of these elements in turn in the context of HPE.

Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators 115
For the HPE task considered here, the task is to predict the 2D (image) stickman
layout of the person: head, torso, upper and lower arms. The ground truth annotation in
the dataset must provide this information. The pose quality is measured by the differ-
ence between the predicted layout and ground truth for example the difference in their
angles or joint positions. The auxiliary features can be of two types: those that are used
or computed by the HPE algorithm, for example max-marginals of limb positions; and
those that have not been considered by the algorithm, such as proximity to the image
boundary (the boundary is often responsible for erroneous estimations). The estimate
given by HPE on each instance from the dataset is then classified, using a threshold
on the pose quality measure, to determine positive and negative outputs for example
if more than a certain number of limbs are incorrectly estimated, then this would be
deemed a negative. Given these positive and negative training examples and both types
of auxiliary features, the evaluator can be learnt using standard methods (here an SVM).
We apply this evaluator learning framework to four recent publicly available meth-
ods: Eichner and Ferrari [5], Sapp et al. [16], Andriluka et al. [2] and Yang and
Ramanan [22]. The algorithms are reviewed in section 2. The auxiliary features, pose
quality measure, and learning method are described in section 3. For the datasets, sec-
tion 4, we use existing ground truth annotated datasets, such as ETHZ PASCAL Stick-
men [5] and Humans in 3D [4], and supplement these with additional annotation where
necessary, and also introduce a new dataset to provide a larger number of training and
test examples. We assess the evaluator features and method on the four HPE algorithms,
and demonstrate experimentally that the proposed evaluator can indeed predict when the
algorithms succeed.
Note how the task of an HPE evaluator is not the same as that of deciding whether the
underlying human detection is correct or not. It might well be that a correct detection
then leads to an incorrect pose estimate. Moreover, the evaluator cannot be used directly
as a pose estimator either a pose estimator predicts a pose out of an enormous space
of possible structured outputs. The evaluators job of deciding whether a pose is correct
is different, and easier, than that of producing a correct pose estimate.
Related Work. On the theme of evaluating vision algorithms, the most related work to
ours is Mac Aodha et al. [12] where the goal is to choose which optical flow algorithm
to apply to a given video sequence. They cast the problem as a multi-way classification.
In visual biometrics, there has also been extensive work on assessor algorithms which
predict an algorithms failure [17,19,1]. These assess face recognition algorithms by an-
alyzing the similarity scores between a test sample and all training images. The method
by [19] also takes advantage of the similarity within template images. But none of these
explicitly considers other factors like imaging conditions of the test query (as we do).
[1] on the other hand, only takes the imaging conditions into account. Our method is
designed for another task, HPE, and considers several indicators all at the same time,
such as the marginal distribution over the possible pose configurations for an image,
imaging conditions, and the spatial arrangement of multiple detection windows in the
same image. Other methods [20,3] predict the performance by statistical analysis of the
training and test data. However, such methods cannot be used to predict the performance
on individual samples, which is the goal of this paper.
116 N. Jammalamadaka et al.
Fig. 1. Features based on the output of HPE. Examples of unimodal, multi-modal and large
spread pose estimates. Each image is overlaid with the best configuration (sticks) and the posterior
marginal distributions (semi-transparent mask). Max and Var are the features measured from
this distribution (defined in the text). As the distribution moves from peaked unimodal to more
multi-modal and diffuse, the maximum value decreases and the variance increases.
2 Human Pose Estimation

In this section, we review four human pose estimation algorithms (HPE). We also re-
view the pictorial structure framework, on which all four HPE techniques we consider
are based on.
2.1 Upper Body Detection

Upper body detectors (UBD) are often used as a preprocessing stage in HPE pipelines
[5,16,2]. Here, we use the publicly available detector of [6]. It combines an upper-body
detector based on the part-based model of [8] and a face detector [18]. Both components
are first run independently in a sliding window fashion followed by non-maximum sup-
pression, and output two sets of detection windows (x, y, w, h). To avoid detecting the
same person twice, each face detection is then regressed into the coordinate frame of
the upper-body detector and suppressed if it overlaps substantially with any upper-body
detection. As shown in [6] this combination yields a higher detection rate at the same
false-positive rate, compared to using either component alone.
2.2 Human Pose Estimation

All methods we review are based on the pictorial structure framework. An UBD is
used to determine the location and scale (x, y, w, h) of the person for three of the HPE
methods we consider [5,16,2]. Pose estimation is then run within an extended detec-
tion window of width and height 2w and 2.5h respectively. The method of Yang and
Ramanan [22] is the exception. This runs directly on the image, without requiring an
UBD. However, it generates many false positive human detections, and these are re-
moved (in our work) by filtering using an UBD.

Fig. 2. Human pose estimated based on an in- Fig. 3. Pose normalizing the marginal distri-
correct upper-body detection: a typical exam- bution. Marginal distribution (contour diagram
ple of a human pose estimated in a false positive in green) on the left is pose normalized by rotat-
detection window. The pose estimate (MAP) is ing it by an angle of the body part (line seg-
an unrealistic pose (shown as a stickman, i.e. ment in grey) to obtain the transformed distri-
a line segment per body part) and the poste- bution on the right. In the inset, the pertinent
rior marginals (PM) are very diffuse (shown as body part (line segment in grey) is displayed as
a semi-transparent overlay: torso in red, upper a constituent of the upper body.
arms in blue, lower arms and head in green).
The pose estimation output is from Eichner and
Ferrari [5].
Pictorial Structures (PS). PS [9] model a persons body parts as conditional random
field (CRF). In the four HPE methods we review, parts li are parameterized by lo-
cation (x, y), orientation
and scale s. The posterior of a configuration of parts is

P (L|I) exp (i,j)E (li , lj ) + i (I|li ) . The unary potential (I|li ) evalu-
ates the local image evidence for a part in a particular position. The pairwise potential
(li , lj ) is a prior on the relative position of parts. In most works the model structure
E is a tree [5,9,14,15], which enables exact inference. The most common type of infer-
ence is to find the maximum a posteriori (MAP) configuration L = arg maxL P (L|I).
Some works perform another type of inference, to determine the posterior marginal dis-
tributions over the position of each part [5,14,11]. This is important in this paper, as
some of the features we propose are based on the marginal distributions (section 3.1).
The four HPE techniques [5,16,2,22] we consider are capable of both types of infer-
ence. In this paper, we assume that each configuration has six parts corresponding to
the upper-body, but the framework supports any number of parts (e.g. 10 for a full
body [2]).
Eichner and Ferrari [5]. Eichner et al. [5] automatically estimate body part appearance
models (colour histograms) specific to the particular person and image being analyzed,
before running PS inference. These improve the accuracy of the unary potential ,
which in turn results in better pose estimation. Their model also adopts the person-
generic edge templates as part of and the pairwise potential of [14].
Sapp et al. [16]. This method employs complex unary potentials, and non-parametric
data-dependent pairwise potentials, which accurately model human appearance. How-
ever, these potentials are expensive to compute and do not allow for typical inference
speed-ups (e.g. distance transform [9]). Therefore, [16] runs inference repeatedly, start-
ing from a coarse state-space for body parts and moving to finer and finer resolutions.
Andriluka et al. [2]. The core of this technique lies in proposing discriminatively
trained part detectors for the unary potential . These are learned using Adaboost on
shape context descriptors. The authors model the kinematic relation between body parts
using Gaussian distributions. For the best configuration of parts, they independently in-
fer the location of each part from the marginal distribution (corresponding to the maxi-
mum), which is suboptimal compared to a MAP estimate.
Yang and Ramanan [22]. This method learns unary and pairwise potentials using a
very effective discriminative model based on the latent-svm algorithm proposed in [8].
Unlike the other HPE methods which model limbs as parts, in this method parts cor-
respond to the mid and end points of each limb. As a consequence, the model has 18
parts. Each part is modeled as a mixture to capture the orientations of the limbs.
This algorithm searches over multiple scales, locations and all the part mixtures (thus
implicitly over rotations). However, it returns many false positives. The UBD is used as
a post-processing step to filter these out, as its precision is higher. In detail, all human
detections which overlap less than 50% with any UBD detection are discarded. Overlap
is measured using the standard intersection over union criteria [7].
3 Algorithm Evaluation Method

We formulate the problem of evaluating the human pose estimates as classification into
success and failure classes. First, we describe the novel features we use (section 3.1).
We then explain how human pose estimates are evaluated. For this, we introduce a
measure of quality for HPE by comparing it to a ground-truth stickman (section 3.2).
The features and quality measures are then used to train the evaluator as a classifier
(section 3.3)
3.1 Features
We propose a set of features to capture the conditions under which an HPE algorithm
is likely to make mistakes. We identify two types of features: (i) based on the output
of the HPE algorithm the score of the HPE algorithm, marginal distribution, and
best configuration of parts L ; and, (ii) based on the extended detection window its
position, overlap with other detections, etc these are features which have not been
used directly by the HPE algorithm. We describe both types of features next.
1. Features from the Output of the HPE. The outputs of the HPE algorithm consist
of a marginal probability distribution Pi over (x, y, ) for each body part i, and the
best configuration of parts L . The features computed from these outputs measure the
spread and multi-modality of the marginal distribution of each part. As can be seen
from figures 1 and 2, the multi-modality and spread correlate well with the error in
the pose estimate. Features are computed for each body part in two stages: first, the
marginal distribution is pose-normalized, then the spread of the distribution is measured
by comparing it to an ideal signal.
The best configuration L , predicts an orientation for each part. This orientation
is used to pose-normalize the marginal distribution (which is originally axis-aligned)
by rotating the distribution so that the predicted orientation corresponds to the x-axis
(illustrated in figure 3). The pose-normalized marginal distribution is then factored into
three separate x, y and spaces by projecting the distribution onto the respective axes.
Empirically we have found that this step improves the discriminability.
A descriptor is then computed for each of the separate distributions which measure
its spread. For this we appeal to the idea of a matched filter, and compare the distribution
to an ideal unimodal one, P , which models the marginal distribution of a perfect pose
estimate. The distribution P is assumed to be Gaussian and its variance is estimated
from training samples with near perfect pose estimate. The unimodal distribution shown
in figure 1 is an example corresponding to a near perfect pose estimate.
The actual feature is obtained by convolving the ideal distribution, P , with the
measured distribution (after the normalization step above), and recording the maximum
value and variance of the convolution. Thus for each part we have six features, two for
each dimension, resulting in a total of 36 features for an upper-body. The entropy and
variance of the distribution Pi , and the score of the HPE algorithm are also used as
features taking the final total to 36 + 13 features.
Algorithm Specific Details: While the procedure to compute the feature vector is
essentially the same for all four HPE algorithms, the exact details vary slightly. For An-
driluka et al. [2] and Eichner and Ferrari [5], we use the posterior marginal distribution
to compute the features. While for Sapp et al. [16] and Yang and Ramanan [22], we use
the max-marginals. Further, in [22] the pose-normalization is omitted as max-marginal
distributions are over the mid and end points of the limb rather than over the limb itself.
For [22], we compute the max-marginals ourselves as they are not available directly
from the implementation.
2. Features from the Detection Window. We now detail the 10 features computed over
the extended detection window. These consist of the scale of the extended detection
window, and the confidence score of the detection window as returned by the upper
body detector. The remainder are:
Two Image Features: the mean image intensity and mean gradient strength over the
extended detection window. These are aiming at capturing the lighting conditions and
the amount of background clutter. Typically HPE algorithms fail when either of them
has a very large or a very small value.
Four Location Features: these are the fraction of the area outside each of the four image
borders. Algorithms also tend to fail when people are very small in the image, which
is captured by the scale of the extended detection window. The location features are
based on the idea that the larger the portion of the extended detection window which
lies outside the image, the less likely it is that HPE algorithms will succeed (as this
indicates that some body parts are not visible in the image).
Two Overlap Features: (i) the maximum overlap, and (ii) the sum of overlaps, over
all the neighbouring detection windows normalized by the area of the current detection
window. As illustrated in Figure 4 HPE algorithms can be affected by the neighbouring
Fig. 4. Overlap features. Left: Three overlapping detection windows. Right: the HPE output is
incorrect due to interference from the neighbouring people. This problem can be flagged by the
detection window overlap (output from Eichner and Ferrari [5]).
people. The overlap features capture the extent of occlusion by the neighbouring detec-
tion windows. Overlap with other people indicates how close the they are. While large
overlaps occlude many parts in the upper body, small overlaps also affect HPE perfor-
mance as the algorithm could pick the parts (especially arms) from their neighbours.
While measuring the overlap, we consider only those neighbouring detections which
have similar scale (between 0.75 and 1.25 times). Other neighbouring detections which
are at a different scale typically do not affect the pose estimation algorithm.
3.2 Pose Quality Measure
For evaluating the quality of a pose estimate, we devise a measure which we term the
Continuous Pose Cumulative error (CPC) for measuring the dissimilarity between two
poses. It ranges in [0, 1], with 0 indicating a perfect pose match. In brief, CPC depends
on the sum of normalized distances between the corresponding end points of the parts.
Figure 5 gives examples of actual CPC values of poses as they move away from a ref-
erence pose. The CPC measure is similar to the percentage of correctly estimated body
parts (PCP) measure of [5]. However, CPC adds all distances between parts, whereas
PCP counts the number of parts whose distance is below a threshold. Thus PCP takes in-
teger values in {0, . . . , 6}. In contrast, for proper learning in our application we require
a continuous measure, hence the need for the new CPC measure.
In detail, the CPC measure computes the dissimilarity between two poses as the sum
of the differences in the position and orientation over all parts. Each pose is described
by N parts and each part p in a pose A is represented by a line segment going from
point sap to point eap , and similarly for pose B. All the coordinates are normalized with
respect to the detection window in which the pose is estimated. The angle subtended by
p is pa . With these definitions, the CP C(A, B) between two poses A and B is:
a

N sp sbp + eap ebp
CP C(A, B) = wp
p=1
2 sap eap
(1)
|pa pb |
with wp = 1 + sin
2
where is the sigmoid function and pa pb is the relative angle between part p in pose
A and B, and lies in [, ]. The weight wp is a penalty for two corresponding parts
of A and B not being in the same direction. Figure 6 depicts the notation in the above
equation.
Reference 1
0.1 0.2 0.3 0.5

Reference 2
Fig. 5. Poses with increasing CPC. An example per CPC is shown for each of the two reference
poses for CPC measures of 0.1, 0.2, 0.3 and 0.5. As can be seen example poses move smoothly
away from the reference pose with increasing CPC, with the number of parts which differ and the
extent increasing. For 0.1 there is almost no difference between the examples and the reference
pose. At 0.2, the examples and reference can differ slightly in the angle of one or two limbs, but
from 0.3 on there can be substantial differences with poses differing entirely by 0.5.
3.3 Learning the Pose Quality Evaluator

We require positive and negative pose examples in order to train the evaluator for an
HPE algorithm. Here a positive is where the HPE has succeeded and a negative where
it has not. Given a training dataset of images annotated with stickmen indicating the
ground truth of each pose, (section 4.1), the positive and negative examples are obtained
by comparing the output of the HPE algorithms to the ground truth using CPC and
thresholding its value. Estimates with low CPC (i.e. estimates close to the true pose)
are the positives, and those above threshold are the negatives.
In detail, the UBD algorithm is applied to all training images, and the four HPE
algorithms [5,16,2,22] are applied to each detection window to obtain a pose estimate.
The quality of the pose estimate is then measured by comparing it to ground truth using
CPC. In a similar manner to the use of PCP [5], CPC is only computed for correctly
localized detections (those with IoU > 0.5). Detections not associated with any ground-
truth are discarded. Since all of the HPE algorithms considered here can not estimate
Fig. 6. Pose terminology: For poses A and B Fig. 7. Annotated ground-truth samples
corresponding parts p are illustrated. Each part (stickman overlay) for two frames from the
is described by starting point sap , end point eap Movie Stickmen dataset.
and the angle of the part pa are also illustrated.
partial occlusion of limbs, CPC is set to 0.5 if the ground-truth stickman has occluded
parts. In effect, the people who are partially visible in the image get high CPC. This is a
means of providing training data for the evaluator that the HPE algorithms will fail on
these cases.
A CPC threshold of 0.3 is used to separate all poses into a positive set (deemed
correct) and a negative set (deemed incorrect). This threshold is chosen because pose
estimates with CPC below 0.3 are nearly perfect and roughly correspond to PCP = 6
with a threshold of 0.5 (figure 5). A linear SVM is then trained on the positive and
negative sets using the auxiliary features described in section 3.1. The feature vector
has 59 dimensions, and is a concatenation of the two types of features (49 components
based on the output of the HPE, and 10 based on the extended detection window). This
classifier is the evaluator, and will be used to predict the quality of pose estimates on
novel test images.
The performance of the evaluator algorithm is discussed in the experiments of sec-
tion 4.2. Here we comment on the learnt weight vector from the linear SVM. The magni-
tude of the components of the weight vector suggests that the features based on marginal
probability, and location of the detection window, are very important in distinguishing
the positive samples from the negatives. On the other hand, the weights of the upper
body detector, image intensity and image gradient are low, so these do not have much
impact.
4 Experimental Evaluation
After describing the datasets we experiment on (section 4.1), we present a quantitative
analysis of the pose quality evaluator (section 4.2).
Table 1. (a) Dataset statistics. The number of images and stickmen annotated in the training and
test data. (b) Pose estimation evaluation. The PCP 1 of the four HPE algorithms (Andriluka et
al. [2], Eichner and Ferrari [5], Sapp et al. [16], Yang and Ramanan [22]) averaged over each
dataset (at PCP threshold 0.5, section 3.2). The numbers in brackets are the results reported in the
original publications. The differences in the PCPs are due to different versions of UBD software
used. The performances across the datasets indicate their relative difficulty.
(a) (b)
Dataset images gt-train gt-test Datasets [2] [5] [16] [22]
ETHZ Pascal 549 0 549 Buffy stickmen 78.3 81.6 (83.3) 84.2 86.7
Humans in 3D 429 1002 0 ETHZ Pascal 65.9 68.5 71.4 (78.2) 72.4
Movie 5984 5804 5835 Humans in 3D 70.3 71.3 75.3 77.8
Buffy2 499 775 0 Movie 64.6 70.4 70.6 76.0
Total 7270 7581 6834 Buffy2 65.0 67.5 74.2 81.7
4.1 Datasets
We experiment on four datasets. Two are the existing ETHZ PASCAL Stickmen [5] and
Humans in 3D [4], and we introduce two new datasets, called Buffy2 Stickmen and
Movie Stickmen. All four datasets are challenging as they show people at a range of
scales, wearing a variety of clothes, in diverse poses, lighting conditions and scene
backgrounds. Buffy2 Stickmen is composed of frames sampled from episodes 2 and 3
of season 5 of the TV show Buffy the vampire slayer (note this dataset is distinct from
the existing Buffy Stickmen dataset [11]). Movie Stickmen contains frames sampled
from ten Hollywood movies (About a Boy, Apollo 13, Four Weddings and a Funeral,
Forrest Gump, Notting Hill, Witness, Gandhi, Love Actually, The graduate, Groundhog
day).
The data is annotated with upper body stickmen (6 parts: head, torso, upper and
lower arms) under the following criteria: all humans who have more than three parts
visible are annotated, provided their size is at least 40 pixels in width (so that there is
sufficient resolution to see the parts clearly). For near frontal (or rear) poses, six parts
may be visible; and for side views only three. See the examples in figure 7. This fairly
complete annotations includes self occlusions and occlusions by other objects and other
people.
Table 1a gives the number of images and annotated stickmen in each dataset. As a
training set for our pose quality evaluator, we take Buffy2 Stickmen, Humans in 3D
and the first five movies of Movie Stickmen. ETHZ PASCAL Stickmen and the last five
movies of Movie Stickmen form the test set, on which we report the performance of the
pose quality evaluator in the next subsection.
Performance of the HPE Algorithms. For reference, table 1b gives the PCP performance
of the four HPE algorithms. Table 2 gives the percentage of samples where pose is
estimated accurately, i.e. the CPC between the estimated and ground-truth stickmen is
1
We are using the PCP measure as defined by the source code of the evaluation protocol in [5],
as opposed to the more strict interpretation given in [13].
Table 2. Percentage of accurately estimated Table 3. Performance of the Pose Evalua-

poses. The table shows the percentage of sam- tor. The pose evaluator is used to assess the
ples which were estimated accurately (CPC outputs of four HPE algorithms (Andriluka et
< 0.3) on the training and test sets, as well al. [2], Eichner and Ferrari [5], Sapp et al. [16],
as overall, for the four HPE algorithms (An- Yang and Ramanan [22]) at three different
driluka et al. [2], Eichner and Ferrari [5], Sapp CPC thresholds. The evaluation criteria is the
et al. [16], Yang and Ramanan [22]). These ac- area under the ROC curve (AUC). BL is the
curate pose estimates form the positive sam- AUC of the baseline and PA is the AUC of our
ples for training and testing the evaluator. pose evaluator.
Dataset [2] [5] [16] [22] CPC 0.2 CPC 0.3 CPC 0.4
Train 8.5 10.7 11.6 18.5 HPE BL PA BL PA BL PA
Test 9.9 11.1 12.0 16.5 Andriluka [2] 56.7 90.0 56.2 90.0 55.8 89.2
Total 9.1 10.9 11.8 17.6 Eichner [5] 84.3 92.6 81.6 91.6 80.5 90.9
Sapp [16] 76.5 82.5 76.5 83.0 76.9 83.5
Yang [22] 79.5 83.7 78.4 81.5 78.4 81.2
< 0.3 In all cases, we use the implementations provided by the authors [5,16,2,22] and
all methods are given the same detection windows [6] as preprocessing. Both measures
agree on the relative ranking of the methods: Yang and Ramanan [22] performs best,
followed by Sapp et al. [16], Eichner and Ferrari [5] and then by Andriluka et al. [2].
This confirms experiments reported independently in previous papers [5,16,22]. Note
that we report these results only as a reference, as the absolute performance of the HPE
algorithms is not important in this paper. What matters is how well our newly proposed
evaluator can predict whether an HPE algorithm has succeeded.
4.2 Assessment of the Pose Quality Evaluator
Here we evaluate the performance of the pose quality evaluator for the four HPE al-
gorithms. To assess the evaluator, we use the following definitions. A pose estimate is
defined as positive if it is within CPC 0.3 of the ground truth and as negative otherwise.
The evaluators output (positive or negative pose estimate) is defined as successful if
it correctly predicts a positive (true positive) or negative (true negative) pose estimate,
and defined as a failure when it incorrectly predicts a positive (false positive) or nega-
tive (false negative) pose estimates. Using these definitions, we assess the performance
of the evaluator by plotting an ROC curve.
The performance is evaluated under two regimes: (A) only where the predicted HPE
corresponds to one of the annotations. Since the images are fairly completely anno-
tated, any upper body detection window [6] which does correspond to an annotation is
considered a false positive. In this regime such false positives are ignored; (B) all pre-
dictions are evaluated, including false-positives. The first regime corresponds to: given
there is a human in the image at this point, how well can the proposed method evalu-
ate the pose-estimate? The second regime corresponds to: given there are wrong upper
body detections, how well can the proposed method evaluate the pose-estimate? The
protocol for assigning a HPE prediction to a ground truth annotation was described in
1 1
Percentage truepositives
Percentage truepositives
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
Eichner [1] Evaluator Auc: 88.8 Eichner [1] Evaluator Auc: 91.6
Eichner [1] baseline Auc: 78.2 Eichner [1] baseline Auc: 81.6
0.3 0.3
Andriluka [3] Evaluator Auc: 85.6 Andriluka [3] Evaluator Auc: 90.0
Andriluka [3] baseline Auc: 57.4 Andriluka [3] baseline Auc: 56.2
0.2 Sapp [2] Evaluator Auc: 77.0 0.2 Sapp [2] Evaluator Auc: 83.0
Sapp [2] baseline Auc: 69.7 Sapp [2] baseline Auc: 76.5
0.1 Yang [4] Evaluator Auc: 77.0 0.1 Yang [4] Evaluator Auc: 81.5
Yang [4] baseline Auc: 70.0 Yang [4] baseline Auc: 78.4
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Percentage falsepositives Percentage falsepositives

(a) Regime A (b) Regime B
Fig. 8. (a) Performance of HPE evaluator in regime A: (no false positives used in training or
testing). The ROC curve shows that the evaluator can successfully predict whether an estimated
pose has CPC < 0.3. (b) Performance of HPE evaluator in Regime B: (false positives included
in training and testing).
section 3.3. For training regime B, any pose on a false-positive detection is assigned a
CPC of 1.0.
Figure 8a shows performance for regime A, and figure 8b for B. The ROC curves are
plotted using the score of the evaluator, and the summary measure is the Area Under the
Curve (AUC). The evaluator is compared to a relevant baseline that uses the HPE score
(i.e. the energy of the most probable (MAP) configuration L ) as a confidence measure.
For Andriluka et al. [2], the baseline is the sum over all parts of the maximum value of
the marginal distribution for a part. The plots demonstrate that the evaluator works well,
and outperforms the baseline for all the HPE methods. Since all the HPE algorithms use
the same upper body detections, their performance can be compared fairly.
To test the sensitivity to the 0.3 CPC threshold, we learn a pose evaluator also us-
ing CPC thresholds 0.2 and 0.4 under the regime B. Table 3 shows the performance
of the pose evaluator for different CPC thresholds over all the HPE algorithms. Again,
our pose evaluator shows significant improvements over the baseline in all cases. The
improvement of AUC for Andriluka et al. is over 33.5. We believe that this massive
increase is due to the suboptimal inference method used for computing the best config-
uration. For Eichner and Ferrari [5], Sapp [16], and Yang and Ramanan [22] our pose
evaluator brings an average increase of 9.6, 6.4 and 3.4 respectively across the CPC
thresholds. Interestingly, our pose evaluator has a similar performance across different
CPC thresholds.
Our pose evaluator successfully detects cases where the HPE algorithms succeed
or fail, as shown in figure 9. In general, an HPE algorithm fails where there is self-
occlusion or occlusion from other objects, when the person is very close to the image
borders or at an extreme scale, or when they appear in very cluttered surroundings.
These cases are caught by the HPE evaluator.
Fig. 9. Example evaluations. The pose estimates in the first two rows are correctly classified as
successes by our pose evaluator. The last two rows are correctly classified as failures. The pose
evaluator is learnt using the regime B and with a CPC threshold of 0.3. Poses in rows 1,3 are esti-
mated by Eichner and Ferrari [5], and poses in rows 2,4 are estimated by Yang and Ramanan [22].
5 Conclusions
Human pose estimation is a base technology that can form the starting point for other
applications, e.g. pose search [10] and action recognition [21]. We have shown that an
evaluator algorithm can be developed for human pose estimation methods where no
confidence score (only a MAP score) is provided, and that it accurately predicts if the
algorithm has succeeded or not. A clear application of this work is to use evaluator to
choose the best result amongst multiple HPE algorithms.
More generally, we have cast self-evaluation as a binary classification problem, using
a threshold on the quality evaluator output to determine successes and failures of the
HPE algorithm. An alternative approach would be to learn an evaluator by regressing
to the quality measure (CPC) of the pose estimate. We could also improve the learning
framework using a non-linear SVM.
We believe that our method has wide applicability. It works for any part-based model
with minimal adaptation, no matter what the parts and their state space are. We have
shown this in the paper, by applying our method to various pose estimators [5,16,2,22]
with different parts and state spaces. A similar methodology to the one given here could
be used to engineer evaluator algorithms for other human pose estimation methods e.g.
using poselets [4], and also for other visual tasks such as object detection (where success
can be measured by an overlap score).
Acknowledgments. We are grateful for financial support from the UKIERI, the ERC
grant VisRec no. 228180 and the Swiss National Science Foundation.
References
1. Aggarwal, G., Biswas, S., Flynn, P.J., Bowyer, K.W.: Predicting performance of face recog-
nition systems: An image characterization approach. In: IEEE Conf. on Comp. Vis. and Pat.
Rec. Workshops, pp. 5259. IEEE Press (2011)
2. Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and ar-
ticulated pose estimation. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2009)
3. Boshra, M., Bhanu, B.: Predicting performance of object recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(9), 956969 (2000)
4. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annota-
tions. In: IEEE Int. Conf. on Comp. Vis., pp. 13651372. IEEE Press (2009)
5. Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In: Cavallaro,
A., Prince, S., Alexander, D. (eds.) Proceedings of the British Machine Vision Conference,
pp. 3:13:3. BMVA Press (2009)
6. Eichner, M., Marin, M., Zisserman, A., Ferrari, V.: Articulated human pose estimation and
search in (almost) unconstrained still images. ETH Technical report (2010)
7. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. International Journal of Computer Vision 88(2) (2010)
8. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, de-
formable part model. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2008)
9. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Interna-
tional Journal of Computer Vision 61(1), 5579 (2005)
10. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Pose search: Retrieving people using their
pose. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2009)
11. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human
pose estimation. In: IEEE Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2008)
12. Mac Aodha, O., Brostow, G.J., Pollefeys, M.: Segmenting video into classes of algorithm-
suitability. In: IEEE Conf. on Comp. Vis. and Pat. Rec., pp. 10541061. IEEE Press (2010)
13. Pishchulin, L., Jain, A., Andriluka, M., Thormahlen, T., Schiele, B.: Articulated people de-
tection and pose estimation: Reshaping the future. In: IEEE Conf. on Comp. Vis. and Pat.
Rec., pp. 31783185. IEEE Press (2012)
14. Ramanan, D.: Learning to parse images of articulated bodies. In: Scholkopf, B., Platt, J.,
Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19, pp. 11291136.
MIT Press, Cambridge (2007)
15. Ronfard, R., Schmid, C., Triggs, B.: Learning to Parse Pictures of People. In: Heyden,
A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353,
16. Sapp, B., Toshev, A., Taskar, B.: Cascaded Models for Articulated Pose Estimation. In:
Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312,
17. Scheirer, W.J., Bendale, A., Boult, T.E.: Predicting biometric facial recognition failure with
similarity surfaces and support vector machines. In: IEEE Conf. on Comp. Vis. and Pat. Rec.
Workshops, pp. 18. IEEE Press (2008)
18. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of Computer
Vision 57(2), 137154 (2004)
19. Wang, P., Ji, Q., Wayman, J.L.: Modeling and predicting face recognition system perfor-
mance based on analysis of similarity scores. IEEE Transactions on Pattern Analysis and
Machine Intelligence 29(4), 665670 (2007)
20. Wang, R., Bhanu, B.: Learning models for predicting recognition performance. In: IEEE Int.
Conf. on Comp. Vis., vol. 2, pp. 16131618. IEEE Press (2005)
21. Willems, G., Becker, J.H., Tuytelaars, T., Van Gool, L.: Exemplar-based action recognition in
video. In: Cavallaro, A., Prince, S., Alexander, D. (eds.) Proceedings of the British Machine
Vision Conference, pp. 90.190.11. BMVA Press (2009)
22. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE
Conf. on Comp. Vis. and Pat. Rec. IEEE Press (2011)
Group Tracking: Exploring Mutual Relations
for Multiple Object Tracking
Genquan Duan1 , Haizhou Ai1 , Song Cao1 , and Shihong Lao2

1
Computer Science & Technology Department, Tsinghua University, Beijing, China
dgq08@mails.tsinghua.edu.cn, ahz@mail.tsinghua.edu.cn
2
Development Center, OMRON Social Solutions Co., Ltd., Kyoto, Japan
lao@ari.ncl.omron.co.jp
Abstract. In this paper, we propose to track multiple previously unseen

objects in unconstrained scenes. Instead of considering objects individ-
ually, we model objects in mutual context with each other to benet
robust and accurate tracking. We introduce a unied framework to com-
bine both Individual Object Models (IOMs) and Mutual Relation Models
(MRMs). The MRMs consist of three components, the relational graph
to indicate related objects, the mutual relation vectors calculated within
related objects to show the interactions, and the relational weights to bal-
ance all interactions and IOMs. As MRMs are varying along temporal
sequences, we propose online algorithms to make MRMs adapt to current
situations. We update relational graphs through analyzing object trajec-
tories and cast the relational weight learning task as an online latent
SVM problem. Extensive experiments on challenging real world video
sequences demonstrate the eciency and eectiveness of our framework.
1 Introduction
Visual object tracking is a major component in computer vision and widely ap-
plied in many domains, such as video surveillance, driving assistant systems,
human computer interactions, etc. As an object to be tracked is unknown in
many applications, many researchers adopt online algorithms [13] to extract
the object from background, which mainly encounters the great challenges from
occlusions, changing appearances, varying illumination and abrupt motions. Re-
cently, some authors utilized the context (e.g. feature points [4] and regions in
similar appearances to the target [5]) for robust single object tracking.
Problem Statement: In this paper, we propose to track multiple previously
unseen objects. A direct solution for this problem is to consider objects indi-
vidually and utilize some approaches designed for single object tracking. Such
a solution relies heavily on individual object models (IOMs). Once IOMs be-
come inaccurate, the targets tend to be lost easily. In fact, the mutual relations
among objects are another important kind of information and worth taking into
account. If we have obtained reliable mutual relation models (MRMs) at a given
time, we can use them to predict more accurate locations of objects as shown in

130 G. Duan et al.
Drifting Learned mutual relations
Switching from left

eye to right eye
(a) A simple scene (b) Typical errors of (c) Accurate results of (d) More challenging (e) How to learn robust
IOMs IOMs+MRMs scenes mutual relations?
Fig. 1. Examples of mutual relations. (a,b,c) demonstrate a toy example, where mu-
tual relation models (MRMs) can benet robust tracking when some individual object
models (IOMs) are inaccurate (For the scene in (a), the reason is occlusions). The
switching error in (b) is due to similar appearances of right eyeand left eye. As all
objects have rigid-like relations and they come from the same object, robust MRMs can
be easily learned for this toy. However, they are dicult to be learned in many other
scenes, such as in (d)(top) illustrating objects with abrupt motions and (d)(bottom)
showing objects which are originally dierent. We propose to online learn MRMs for
multi-object tracking.
Fig. 1. This visual tracking scenario tracks multiple objects simultaneously and
understands their mutual relations, termed group tracking for simplicity.
Proposal: Our main idea is to model objects in mutual context with each other
and combine both IOMs and MRMs for robust and accurate tracking. The usage
of MRMs needs to answer two crucial questions in the following.
When do MRMs work? The most ideal situation is that all IOMs are accurate
enough, and MRMs are not necessary. However, when IOMs are inaccurate,
MRMs will play an important role in robust tracking. In fact, it is not easy to
know whether IOMs are accurate enough to ignore MRMs. Moreover, if all IOMs
lost accuracies simultaneously, MRMs will become useless because any estimates
of objects are unreliable at these moments. Therefore, we assume that MRMs
exist all the time and there are some accurate IOMs during tracking.
How do MRMs work? MRMs provide mainly three kinds of helpful information
for tracking. The rst is to indicate mutually related objects (i.e. objects impact
on each other), termed the relational graph. The second is to know how much
impact one object has on another one, named as the mutual relation vectors. The
last is to balance all interactions of objects and the responses of IOMs, called
as the relational weights. When these three kinds of information are properly
determined, MRMs will play an important role in robust tracking.
Contribution: Our main contribution can be summarized into two folds. (1) We
model objects in mutual context, and propose a unied framework to integrate
individual object models and mutual relation models for multiple previously un-
seen objects tracking (Sec.3). (2) We extend latent SVM into an online version for
learning the relational weights (Sec.4.1). To the best of our knowledge, our online
algorithm is the rst one to take advantage of LSVM for the tracking problems.
Group Tracking: Exploring Mutual Relations for Multiple Object Tracking 131
Organization: Related work is presented in the next section. Then, problem

formulation, the mutual relation modeling and the inference are described con-
secutively in Sec.3, Sec.4 and Sec.5. Experimental results and discussions are
provided in Sec.6. Finally, the conclusion and future work are given in Sec.7.
2 Related Work
Object Tracking. Adam et al. [6] represented the target based on histograms
of local patches and combined votes of matched local patches using templates
to handle partial occlusions. Since the templates are not updated, it may fail
in handling appearance changes. In order to cope with appearance variations,
Babenko et al. [1] proposed an online multiple instance learning algorithm to
learn a discriminative detector for the target. To avoid drifting over time, Kalal
et al. [7] combined an adaptive Kanade-Lucas-Tomasi feature tracker and several
restrictive learning constraints to establish an incremental classier. To further
deal with non-rigid motion or large pose variations, Kwon et al. [3] extended
the conventional particle lter framework with multiple motion and observation
models to account for appearance variations caused by changes of pose, illumi-
nation and scale as well as partial occlusions. Considering that bounding box
based representations provide a less accurate foreground/background separation,
Godec et al. [2] extended Hough Forests to online domain and coupled the vot-
ing based detection and back-projection with a rough segmentation based on
GrabCut for accurately tracking non-rigid and articulated objects.
Besides these single object tracking approaches, there are also lots of ap-
proaches for multi-object tracking, where most of them assume that the object
category is known and depend on oine trained specic object detectors, such
as [810]. Breitenstein et al. [8] used detection as their observation model and
integrated pedestrian detectors into a particle ltering framework to track mul-
tiple persons. Huang et al. [9] proposed to associate detected results globally,
assuming that the whole sequence is achieved in advance. Dierent from [8, 9]
considering object objects individually, Pellegrini et al. [10] modeled human in-
teractions to some degree and introduced a dynamic social behaviors to facilitate
multi-people tracking. Similar to [10], group tracking considers mutual relations
among objects.
Human Pose Estimation. Felzenszwalb et al. [11] utilized Pictorial Structure
(PS) to estimate human poses in static images. Sapp et al. [12] proposed Stretch-
able Models to estimate articulated poses in videos with rich features like color,
shape, contour and motion cues. Since the object is human, relations of human
parts can be predened and their distributions can be learned on annotated hu-
man pose datasets in [11, 12]. But these priors do not exist in group tracking,
because objects are previously unseen.
Structural Learning and Latent SVM. We propose an online latent struc-
tural learning algorithm to update relational weights. Tsochantaridis et al. [13]
optimized a structural SVM objective function, which is a convex optimization
132 G. Duan et al.
wt , p Relational weights wt , p,q

Individual models Mutual relation vectors
It ,1 The relation graph Mt ,1,2 shows interactions
between and
It ,2
Mt ,1,3 shows interactions
It ,3 between and
IOMs MRMs
Fig. 2. An illustration of our problem formulation. We combine IOMs and MRMs for
group tracking, where MRMs consist of relational weights, the relational graph and
mutual relation vectors.
problem widely applied to a variety of problems in computer vision [14, 15]. Bran-
son et al. [14] utilized online structural learning algorithm to train deformable
part based model interactively. By introducing latent variables, Felzenszwalb et
al. [16] proposed Latent SVM to handle object detection challenges. Ramanan
et al. [15, 17] made some improvements on Latent SVM and demonstrated state-
of-the-art results in object classication and human pose estimation.
3 Problem Formulation
We denote the sequences of observations and object states from frame 1 to
frame T as O1:T = {O1 , . . . , OT } and S1:T = {S1 , . . . , ST }. Observations consist
of image features which can include image patches, corner points or the outputs
of specic generative/discriminative models. We encode the states of M objects
at time t as St = {st,m }M m=1 . Each object state is represented by its center
x = [x y]T , width w and height h. Thus, we formalize the state of mth object
at time t as st,m = (xt,m , wt,m , ht,m ). We can now write the full score associated
with a conguration of object states as

f (Ot , St ) = wt,p t,p (Ot , st,p ) + wt,p,q t,p,q (st,p , st,q ) (1)
pVt (p,q)Et
where t,p is the response of the individual model for object p, and t,p,q is
the mutual relation vector between two objects p and q. wt,p and wt,p,q are
weight parameters. Gt = (Vt , Et ) represents the relational graph whose nodes
are objects and( edges indicate mutually
) related objects.
( For concise) descriptions,
we write t = {wt,p }; {wt,p,q } and (Ot , St ) = {t,p }; {t,p,q } , and thus the
scoring function in Eq.(1) can be rewritten as f (Ot , St ) = t (Ot , St ). This
formulation is illustrated in Fig. 2.
Individual object models (IOMs) compute the local score of placing one object
at a specic frame location, corresponding to t,p . The larger score indicates the
higher probability of the object at this location. As reviewed in Sec.2, there
are many algorithms for single object tracking and most of them can be easily
integrated into our framework. Taking MIL [1] as an example, MIL represents
each object by a bag of samples and online learns a discriminative classier. The
learned MIL classier can return the condence of an image patch to an object.
Mutual relation models (MRMs) consist of the relational weights, the mutual
relation vectors and the relational graph, corresponding to t , t,p,q and Gt
respectively. Firstly, relational weights balance the two terms in Eq.(1). If rela-
tional weights {wt,p,q } are set to zero, Eq.(1) will equal to the case that ignores
mutual relations. Secondly, mutual relation vectors are calculated between re-
lated objects to show the interactions. Thirdly, the inferences of Eq.(1) for cyclic
and acyclic graphs are very dierent. In the next section, we will present details
of modeling mutual relations.
4 Modeling Mutual Relations

In this section, we describe particularly how to learn relational weights, calculate
mutual relation vectors and update relational graphs.
4.1 Learning Relational Weights through Online LSVM
Given a set of frames and object congurations {(O1 , S1 ), . . . , (OT , ST )}, the task
is to learn proper relational weights (t ) to balance the two terms in Eq.(1). We
cast this task as a maximum margin structured learning problem (Structured
SVM [13]), where we consider mutual relation vectors as latent variables. In
addition, there is no need to depict the objects in a long time ago and they
might dier a lot with the current one in appearances, and therefore we only
consider samples in a short time period . Similar to [14], we formulate this
problem to search the optimal weight that minimize the error function FT ()
1 T
FT () = 2 + lt () (2)
2
t=T +1
lt () = max (Ot , St ) (Ot , St ) + (St , St ) (3)
St
where St is a particular object conguration and (St , St ) is the loss function,

equaling to the number of missing objects when the ground truth is St . This
criterion attempts to learn a set of weight parameters , so that the score of any
other choice of object congurations (Ot , St ) is less than that of the ground
truth conguration (Ot , St ) by at least (St , St ).
Comparison with [14]. Before designing ecient algorithms to learn relational
weights, we should consider three main issues. (1) Labeled samples. In our
tracking problem, we have only one labeled result S1 at t = 1, while {St }Tt=2 are
tracked results, which are assumed as ground truths and will not change after
T . In contrast, [14] have many well-labeled ground truths for training but there
are no consistencies between images. To some extent, video consistencies are
the most important information to make our problem tractable. (2) Working
samples. We exploit recent frames to depict object relations, while [14] utilizes
all possible training samples to learn a powerful model. (3) Learning speeds.
134 G. Duan et al.
The frames come one by one during tracking, which is an incremental way similar
to [14]. Thus, it requires a fast online learning algorithm.
Optimization. The form of our learning problem is a Structured SVM. There
exist many well-tuned solvers such as the cutting plane solver of SVMstruct [18]
and the Online Stochastic Gradient Descent (OSGD) solver [19]. While SVMstruct
solver is most commonly used, OSGD solver has been observed faster in practice
[17]. Since the speed is important for tracking, we choose the OSGD solver for
our learning problem. When an example (Ot , St ) comes, [19] proposed to take
an update as 1
t = t1 (t1 + lt ) (4)
t
where lt () is the sub gradient of the hinge loss lt (), which can be computed
by solving a problem similar to an inference problem
lt () = (Ot , St ) (Ot , St ) (5)
St = arg max (Ot , St ) + (St , St ) (6)
St
The update in Eq.(4) needs at least one whole iteration for each new sample and
is designed to keep the learning ability on all samples. Because of characteristics
of our task (objects in a short time period and video consistencies), we modify
Eq.(4) as
1 lt1 (t1 )
t = t1 min , lt1 . (7)
t1 lt1 2
Our update directly combines the weights and the sub-gradient of the hinge loss
in the previous time. It is non-iterative because t1 , lt1 and lt1 (t1 ) are
all calculated before time t1 . Note that, the coecient of lt1 in Eq.(7) might
be tuned, which is dierent from 1 t in Eq.(4). Relatively, our update is more
reasonable for tracking. In addition, there is no need to determine explicitly
in this online update.
Computation. The computationally taxing portion of Eq.(7) is to calculate the
sub-gradient of lt () using Eq.(6). It is too computationally expensive to sample
locations around the ground truths of objects like [14, 17]. Thus, we select St from
t , where t contains all congurations inferred by Eq.(1) except the tracked
result, whose details will be described in Sec.5. Then, one needs to loop over
t to get the most violated conguration St . For each testing on St , one needs
to compute (Ot , St ) containing scores of IOMs and mutual relation vectors,
and (St , St ) comparing a predicted conguration with the ground truth, both
of which make the computation as O(M ). Therefore, the sub gradient can be
calculated in O(M |t |). Moreover, if some objects are missing (i.e. very small
scores of IOMs), there is no need to update weights related to these objects.
4.2 Calculating Mutual Relation Vectors

Mutual relation vectors show the interactions between related objects. They
can be interpreted as a spring model that represents the relative placement of
1
This update can be extended easily to an iterative way. However, we have not found
signicantly improvements in our experiments.
two objects. Ideally, more accurate locations of related objects indicate higher
scores. Unfortunately, only the rst frame is labeled as stated before. Motivated
by deformable part based model in [16, 17], we calculate mutual relation vectors
with the changes of relative locations between two objects.
Following denotations in Sec.3, the relative location of p with respect to q at
time t is zt,p,q = xt,p xt,q and the estimated relative location is zt,p,q = xt,p xt,q ,
where xt,p and xt,q are estimated locations for object p and q respectively. The
Kalman Filter [20] is adopted as the algorithm to estimate object locations,
where both linear Gaussian observation and motion models are assumed for
each object. Then, we dene the mutual relation vectors as t,p,q (st,p , st,q ) =
[dx dy dx2 dy 2 ]T where [dx dy]T = zt,p,q zt,p,q . Although zt,p,q is always
biased from the ground truths, it does not have a direct impact on the whole
tracking system because mutual relation vectors are taken as latent variables
and relational weights are updated online.
4.3 Updating Relational Graphs

The relational graph determines the complexity of inferring Eq.(1). A general
representation is a full connected graph. However, there is no guarantee to get
global optimizations on it. As our goal is to make use of frames in a short
time period, we suggest that a simplied graph, i.e. tree, is reasonable enough,
where dynamic programming can be used for ecient inferences which will be
explained in Sec.5. However, after a long time period or when some objects move
violently, one graph may lose the eectiveness and even interfere with tracking
Table 1. Our tracking algorithm
Input: video sequences O1:T .

Initialize: multiple objects S1 = {s1,1 , . . . , s1,M }, the relational graph G1 = (V1 , E1 ) by
E1 = , the relational weight 1 = [1; 0] by default (w1,p =1,w1,p,q =0), the frame number
of updating relational graphs NU , the object speed threshold v , and the frame number of
updating relational weights k.
Output: tracked objects in all frames S1:T .
Learn individual object models at the first frame.
For t = 1 : T
Apply individual object models at sampled locations.
Find the best configuration by inferring Eq.(1).
Store the other inferred results into t .
Update individual object models in predicted locations St .
Update relational weight t+1 by Eq.(7) on t .
If t == k (i.e. constructing the relational graph), or t > k&&(t%NU == 0||t = )
where t = {vt,p |vt,p > v }, (i.e. updating the relational graph).
+ Evaluate connected weights between objects by Eq.(8).
+ Generate the Minimum Spanning Tree on a full connected graph.
+ If the graph structure is changed, set tk+1 = [1; 0], update i+1 on i
for i = t k + 1 to t, and obtain t+1 . End If.
End If
End For
136 G. Duan et al.
performances. Although this interference may be reduced to some extent by

relational weights, it is supposed to be more reasonable to update the relational
graph over video sequences.
Adaptive Relational Graphs. Intuitively speaking, a reasonable adaption en-
courages linking objects moving similarly and disconnecting those moving dier-
ently. Therefore, we learn reasonable relational graphs through analyzing object
trajectories and dene the connected weights between two objects based on their
motion similarities in the latest k frames
t
t,p,q = 1/(1 + vi,p vi,q 2 ) (8)
i=tk+2
where vi,j = xi,j xi1,j (j = p, q) and is L2 norm. After calculating

all connected weights, one can perform Minimum Spanning Tree algorithms to
generate a new tree structure. Once the graph structure is changed, the relational
weights should be re-learned correspondingly. It is not reasonable to set the
relational weights by default, and thus we utilize stored results of latest k frames
{i }ti=tk+1 to learn t+1 , in which we set tk+1 by default.
5 Inference
The inference of our formulation is to maximize Eq.(1) over all involved objects,
St = arg max f (Ot , St ). (9)
St
Since we restrict the relational graph to be a tree, the inference can be eciently
solved by dynamic programming. Letting child(p) be the set of children of object
p in Gt , we compute the message of p passing to its parent q as follows.

score(p) = wt,p t,p (Ot , st,p ) + mb (p) (10)
bchild(p)
mb (p) = max (score(b) + wt,b,p t,b,p (st,b , st,p )) (11)

Eq.(10) computes the local score of p at all possible locations, by collecting
all messages from its children. Eq.(11) computes the messages at all possible
locations of b and nds the best scoring location. Once all messages are passed
to the root object, one can generate object congurations by starting from a root
location and backtrack to nd the location of each object in each maximal point
with the aid of keeping track of the argmax indices. The conguration with the
maximum root score corresponds to the tracked result in a frame.
Computation. After applying IOMs in one frame, we have scores of objects
at a number of locations. For considering the eciency of online tracking, we
only keep L highest scoring locations for each object. Then, we have to traverse
over L possible parent locations and compute over L possible child locations in
Eq.(11), resulting the computation O(L2 ) for each object. Therefore, the total
complexity of M objects is O(M L2 ). As each root location corresponds to an
inferred conguration, there are totally L inferred congurations. Except the
tracked result, other inferred congurations are stored in t for learning rela-
tional weights as in Sec.4.1. Thus, we have |t | = L 1 and the complexity of
learning relational weights is O(M |t |) = O(M L).
Till now, we have elaborated our framework for group tracking. For easy
references, we summarize the whole tracking algorithm in Tab.1.
6 Experiment
In this section, we carry out experiments to demonstrate the advantages of
MRMs for multi-object tracking, where tracking objects are not constrained
to be within a specied class. Besides MRMs, IOMs are also very important for
the whole tracking algorithm. As stated in Sec.3, many approaches for single
object tracking can be used as IOMs. But in this paper, we focus on showing
the performance of MRMs combined with one kind of IOMs. In the experiment,
we select MIL [1] as IOMs, which has achieved promising results in [1]. We use
the default parameters of MIL as in [1] and will demonstrate in the following
that the combination of MIL and MRMs improves tracking performances fur-
ther. Combining other kinds of IOMs with MRMs is one direction of the future
work. All experiments below are conducted on an Intel Core(TM)2 2.33GHz PC
with 2G memory.
6.1 Experimental Setup

Parameters. We select at most L = 100 locations for each object in each frame
and always set the weight update parameter t = 1000 in Eq.(7). We update
the relational graph every NU = 50 frames or if the speed of one object is larger
than v = 8. If the change of the relational graph is detected in Tab. 1, we use
the tracked results of latest k = 3 frames to learn new relational weights.
Datasets. To build up and utilize mutual relations eciently, a proper test
sequence requires that objects are fully or partially visible in most frames. Cur-
rently, there are no standard datasets for this objective. Therefore, we select ten
sequences from [1, 3, 21, 22] for evaluations, where we can easily label multi-
ple objects for tracking. We select david indoor, occluded face, occluded face2,
coke and clibar from [1] and shaking, animal and skating1 2 from [3], which
are widely used for single object tracking and cover the diculties of occlu-
sions, changing appearances, varying illumination and abrupt motions. [21] is a
widely used dataset for pedestrian tracking. We select a relatively challenging
sequence, Seq.#3, because of frequently and signicantly light changes. [22] is a
sign language dataset which focuses on locating arms and hands. Abrupt mo-
tions of hands, similar appearances of hands and relatively low resolutions make
this dataset very dicult for online visual tracking. The whole sequence is very
large, and we select a representative clip (#191#490) for evaluation.
2
There are no several objects existing simultaneously in most frames of skating1 due
to abrupt motions. To track several objects, we select #210#345 for evaluations.
138 G. Duan et al.
Metrics. Center location errors along with frame numbers are widely used for
evaluating single object tracking algorithms and the mean error over all frames
is a summarized performance. However, if a tracker tracks a nearby object but
not the concerned one for a long time, location errors fail to correctly capture
tracker performances. Another metric is the overlap criteria, where a result is
positive only if the intersection is larger than 50% of the union comparing to
the object with the same label. Using this metric, we can calculate the detection
rate by T rueP os/(F alseN eg + T rueP os).
Compared Algorithms. To our knowledge, there are no implementations avail-
able publically for group tracking, and thus we compare our approach with
the algorithms applying IOMs individually. Some applied state-of-the-art algo-
rithms for single object tracking are Online-AdaBoost (OAB) [23], FragTrack [6],
MIL [1], TLD [7] and VTD [3]. We utilize the codes provided by the authors
and carefully adjust the parameters of the trackers for fair comparisons. We use
the best ve results from multiple trials and average the location errors and
detection rates, or directly take results from the prior works.
6.2 Performance Evaluation

6.2.1 Evaluation on Two Objects
First, we evaluate our approach on the simplest situation, tracking two objects
together as illustrated in red in Fig.3, where the relational graph always con-
nects the two objects. We show the location error plots of some objects in Fig.4,
and summarize the average location errors and detection rates in Fig.5. Gener-
ally, our approach has improved MIL signicantly in most sequences with lower
location errors and higher detection rates. Due to the challenges of the testing
sequences, MIL does not perform as well as VTD and TLD in many sequences,
while our approach performs as well as or even better than TLD and VTD in
some sequences. Note that Fig.5(a) demonstrates the reduced average location
errors of MRMs+IOMs compared to only IOMs where MIL is selected as IOMs.
Although our current implementation is not the best in average location errors,
we believe that our framework can also improve their performances to some
extent if we select other algorithms as IOMs.
We mention some typical results on face, coke, shaking and skating1. Because
of frequent occlusions in face, MIL performs much less precisely than Frag, TLD
and VTD. Our approach utilizes eective MRMs to achieve as precise results
as Frag, TLD and VTD. The evaluated object in coke is specular, which causes
some diculties to learn generative models and discriminative classiers for it.
Thus VTD and MIL drift easily, while TLD obtains more stable results with
structured constraints on samples. With the help of eective MRMs, our ap-
proach achieves the highest detection rate in coke. Because of abrupt motions
and severe illumination changes in shaking and skating1, TLD and MIL become
inaccurate and lost objects in early frames. Our approach obtains a lower de-
tection rate than VTD in shaking. The main reason might be that our approach
can learn relatively robust mutual relations, but the applied individual model,
#000 #237 #387 #000 #582 #807
#000 #388 #714

#000 #180 #224
#000 #154 #316

#000 #237 #289
#000 #017 #032 #000 #098 135
OAB FRAG TLD VTD MIL OUR
Fig. 3. Qualitative results on david, face, face2, coke, clibar, shaking, animal and
skating1 sequences from left to right and top to down
David Occluded Face Occluded Face2 Coke

150 100 60 150
FRAG FRAG FRAG FRAG
OAB 80 OAB 50 OAB OAB
Position Error (pixel)

TLD TLD TLD TLD

100 VTD VTD 40 VTD 100 VTD
60
MIL MIL MIL MIL
OUR OUR 30 OUR OUR
40
50 20 50
20 10
0 0 0 0
0 100 200 300 400 0 200 400 600 800 0 200 400 600 800 0 50 100 150 200 250
Frame # Frame # Frame # Frame #
Cliffbar Shaking Animal Skating1
120 200 70
FRAG FRAG FRAG FRAG
100 OAB OAB OAB 60 OAB
TLD TLD 150 TLD TLD

100 50
80 VTD VTD VTD VTD
MIL MIL MIL 40 MIL
60 OUR 100
OUR OUR 30 OUR
40 50
20
50
20 10
0 0 0 0
0 50 100 150 200 250 300 0 100 200 300 0 10 20 30 40 50 60 70 0 20 40 60 80 100 120
Frame # Frame # Frame # Frame #
Fig. 4. Location error plots of some objects on test sequences
MIL is not designed to handle with both abrupt motions and severe illumination
changes. Moreover, our detection rate in skating1 is higher than that of VTD.
The reason is that the tracking objects in skating1 move violently from the very
beginning and undergo large variations of poses and scales, which causes many
diculties for building holistic appearance models.
6.2.2 Evaluation on More Objects

Then, we carry out experiments in more complex situations, tracking more than
two objects. Dierent from xed relational graphs previously, we show the perfor-
mances of adaptive relational graphs in this subsection. We select two challenging
sequences, Seq.#3 [21] and Sign [22], and numerate xed relational graphs to
compare with adaptive relational graphs as shown in Fig.6(a,c).
140 G. Duan et al.
OAB Frag TLD VTD MIL OUR 1 OAB Frag TLD VTD MIL OUR 2
david 32 46 12 26 23 15 8 david 0.34 0.06 0.59 0.72 0.65 0.96 0.31
face 39 6 10 10 27 10 17 face 0.44 1 1 0.93 0.77 1 0.23
face2 21 15 15 9 20 14 6 face2 0.80 0.84 0.85 0.96 0.87 0.96 0.09
coke 25 63 9 45 21 11 10 coke 0.18 0.06 0.43 0.06 0.26 0.64 0.38
cliffbar 14 34 6 12 12 11 1 cliffbar 0.54 0.22 0.79 0.68 0.71 0.8 0.09
shaking 16 62 6 8 34 13 21 shaking 0.63 0.13 0.16 0.95 0.49 0.73 0.24
animal 73 91 19 10 35 21 14 animal 0.16 0.02 0.72 0.93 0.44 0.48 0.04
skating1 66 86 10 13 60 48 12 skating1 0.29 0.12 0.1 0.37 0.41 0.56 0.15
(a) Average location errors (pixels) (b) Detection rates
Fig. 5. Quantitative results. After comparing our approach with MIL, 1 in (a) shows
reduced mean location errors, and 2 in (b) shows improved detection rates.
RG1 0.6 RG1

80
0.5
MIL 70
0.4 RG1 60 MIL
0.3 RG2 50 RG1
RG3 40 RG2
0.2
30
RG2 RG3 0.1
RGada
RG2 RG3
RG3
20 RGada
0 10
0
(a) (b) (c) (d)
Fig. 6. Tracking results on Seq.#3 (a,b) and Sign(c,d). (a,c) show tracking objects in
rectangles and numerated trees of relational graphs. (b) compares the detection rates
on Seq.#3 and (d) compares average location errors on Sign, where RGada indicates
the approach of using adaptive relational graphs.
Tracking objects in Seq.#3 are man, woman and their entirety which are
walking together with frequently light changes. As Seq.#3 is fully annotated
with rectangles, we use detection rates to compare our approaches with MIL
in Fig.6(b). Due to frequently light changes, MIL drifts easily on the man and
woman. Our approaches, with xed relational graphs and the adaptive relational
graph, all show signicantly improvements than MIL on the man, and particu-
larly the adaptive relational graph improves the detection rate by about 25%.
However, the improvement of the adaptive structure on the woman is slightly
(about 3.7%) and RG3 drops the tracking performance slightly, mainly because
light changes on the woman are more severe than those on the man.
Tracking objects in Sign are head, left hand and right hand, whose motions
are abrupt. Because Sign is sparsely annotated with masks while tracked results
are in xed ratios, we use average location errors to compare our approaches with
MIL in Fig.6(d). MIL tracks the left hand well, and our approaches with mutual
relations also keep the accuracies on it. However, due to violently motions and
changed appearances of the right hand, MIL and xed relational graphs drift
easily. In contrast, the adaptive relational graph performs more accurate and
always tracks the real objects as shown in Fig.7.
G1 G2 G3
G4 G5 G6
Objects switched Drifting 1 40
G1 G1
0.8 30
G2 G2
0.6
G3 20 G3
0.4
G4 10 G4
0.2
G5 G5
0 0
G6 G6
Detection Rates
(a) (b) Average Location Errors
Fig. 7. Tracking results of MIL(top) Fig. 8. Examples for the impacts of selected
and OUR(bottom) on #1, #11, #24, objects on tracking. Please see Sec.6.2.3 for
#32 and #163 from Sign. more details.
6.2.3 Impacts of Selected Objects

Our framework consists of IOMs and MRMs. Accurate IOMs are the basis of
learning robust MRMs, and in return the improved accuracies of object locations
by robust MRMs will make IOMs more accurate. If selected objects encounter
great challenges, IOMs may be inaccurate in some frames and the entire per-
formances are also aected. For example, the detection rates of coke, shaking,
animal and skating1 are less than those of other sequences in Sec.6.2.1, mainly
because of severe illumination changes and abrupt motions. If the connections
of selected objects are relatively strong, robust MRMs can be eciently learned,
such as the sequences in Sec.6.2.1. Due to frequently light changes, the con-
nection of pedestrians in Seq.#3 is weak, leading to the reduced performance
of RG3 in Fig.6(b). Similarly, the connection of hands in Sign is weak due to
abrupt motions and xed relational graphs drift easily.
As our main contribution is in MRMs, we give more analysis on MRMs,
especially in two-object tracking because it is the most basic situation. We choose
a typical sequence clibar, where the evaluated object is bar as shown in red in
Fig.8(a). We select ve candidates, head, box, torso, hand and elbow, and then
track both bar and one candidate each time, resulting in the ve groups G1G5
in Fig.8(b)(top). As MIL performs similarly on these objects, the performances
are mainly inuenced by MRMs. The connections between objects in G3/G4/G5
are relatively stronger than those in G1/G2. It is because bar directly links with
hand, elbow and torso, and their motions are related with each other, but these
connections are much weaker between bar and head or box. This explains the
results in Fig.8(b)(bottom), where G3/G4/G5 achieve slightly better results than
MIL, but G1/G2 reduce the performance a lot.
Overall, our approach can exploit MRMs well for robust tracking when se-
lected objects have relatively strong connections and some of the IOMs are rel-
atively accurate. To enhance connections between objects, we need more proper
models to measure interactions of objects and construct relational graphs. To
cope with inaccurate IOMs, we may utilize proper IOMs for dierent objects.
They are both beyond the scope of this paper and will be studied in the future.
142 G. Duan et al.
6.3 Discussion
Speed. The inference is highly ecient in our experiment (less than 10ms),
while the bottleneck of our implementation is online learning IOMs. We ob-
serve that the overall speed except the inference is slightly slower than tracking
objects individually. A partial reason might be that inferred locations are not
discriminative and it causes some diculties for online learning object models.
Comparison with Other Works. Although [8, 9] and our approach are all pro-
posed for multi-object tracking, their concentrations are dierent. Since tracked
objects are known in [8, 9], [8, 9] are provided with relatively good oine learned
object models (detectors) and are able to cope with many objects simultaneously.
Thus, [8, 9] focus on handling with ID switches because the oine models could
not distinguish objects in the same category, and coping with fragments because
the oine models could not cover all kinds of objects in this category. However,
as objects are previously unseen in our problem, we should online build models
for objects, where great challenges come from changes of objects and the sur-
roundings. Therefore, our main concern is to build up proper IOMs and MRMs
to avoid drifting and switching errors particularly when tracked objects are in
similar appearances.
7 Conclusion and Future Work

In this paper, we present a novel framework of combining IOMs and MRMs
to track multiple previously unseen objects. In MRMs, the relational graph in-
dicates related objects, the mutual relation vectors show the interactions, and
the relational weights balance all interactions and IOMs. We use online LSVM
to learn relational weights and analyze object trajectories to update relational
graphs. Various experiments show the advantages of our framework.
There are several interesting ways to extend this work in the future. Firstly,
there is no single algorithm that can cover all diculties in visual tracking.
Fortunately, our framework provides an easy way to integrate dierent IOMs
and we will try to nd if some models can complement each other. Secondly, we
will collect proper real world datasets for direct evaluations on tracking multiple
( 3) previously unseen objects. Finally, our framework can be extended to track
some specic object if we online select some regions or points like [4].
Acknowledgements. This work was supported in part by the National Science

Foundation of China under Grant No.61075026, the National Basic Research
Program of China under Grant 2011CB302203.
References
1. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple in-
stance learning. In: CVPR (2009)
2. Godec, M., Roth, P.M., Bischof, H.: Hough-based tracking of non-rigid objects. In:
ICCV (2011)
3. Kwon, J., Lee, K.M.: Visual tracking decomposition. In: CVPR (2010)
4. Grabner, H., Matas, J., Gool, L.V., Cattin, P.: Tracking the invisible: Learning
where the object might be. In: CVPR (2010)
5. Dinh, T.B., Vo, N., Medioni, G.: Context tracker: Exploring supporters and dis-
tracters in unconstrained environments. In: CVPR (2011)
6. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the
integral histogram. In: CVPR (2006)
7. Kalal, Z., Matas, J., Mikolajczyk, K.: P-n learning: Bootstrapping binary classiers
by structural constraints. In: CVPR (2010)
8. Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier, E., Gool, L.V.: Online
multi-person tracking-by-detection from a single, uncalibrated camera. PAMI 33,
18201833 (2011)
9. Huang, C., Wu, B., Nevatia, R.: Robust Object Tracking by Hierarchical Association
of Detection Responses. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
Part II. LNCS, vol. 5303, pp. 788801. Springer, Heidelberg (2008)
10. Pellegrini, S., Ess, A., Schindler, K., van Gool, L.: Youll never walk alone: Modeling
social behavior for multi-target tracking. In: ICCV (2009)
11. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition.
IJCV 61, 5579 (2005)
12. Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models.
In: CVPR (2011)
13. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. JMLR 6, 14531484 (2005)
14. Branson, S., Perona, P., Belongie, S.: Strong supervision from weak annotation:
Interactive training of deformable part models. In: ICCV (2011)
15. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class object
layout. In: ICCV (2009)
16. Felzenszwalb, P., McAllester, D., Ramaman, D.: A discriminatively trained, mul-
tiscale, deformable part model. In: CVPR (2008)
17. Yang, Y., Ramanan, D.: Articulated pose estimation using exible mixtures of
parts. In: CVPR (2011)
18. Finley, T., Joachims, T.: Training structural svms when exact inference is in-
tractable. In: ICML (2008)
19. Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex
optimization. Machine Learning 69, 169192 (2007)
20. Kalman, R.E.: A new approach to linear ltering and prediction problems. Trans-
actions of the ASME-Journal of Basic Engineering 82, 3545 (1960)
21. Ess, A., Leibe, B., Gool, L.V.: Depth and appearance for mobile scene analysis.
In: ICCV (2007)
22. Buehler, P., Everingham, M., Huttenlocher, D., Zisserman, A.: Long term arm and
hand tracking for continuous sign language tv broadcasts. In: BMVC (2008)
23. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via online boosting. In:
BMVC (2006)
A Discrete Chain Graph Model
for 3d+t Cell Tracking
with High Misdetection Robustness
Bernhard X. Kausler1, Martin Schiegg1 , Bjoern Andres1,2 ,

Martin Lindner1 , Ullrich Koethe1 , Heike Leitte1 ,
Jochen Wittbrodt3 , Lars Hufnagel4 , and Fred A. Hamprecht1
1
HCI/IWR, Heidelberg University, Germany
2
SEAS, Harvard University, United States
3
COS, Heidelberg University, Germany
4
European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
fred.hamprecht@iwr.uni-heidelberg.de
Abstract. Tracking by assignment is well suited for tracking a varying

number of divisible cells, but suers from false positive detections. We
reformulate tracking by assignment as a chain grapha mixed directed-
undirected probabilistic graphical modeland obtain a tracking simulta-
neously over all time steps from the maximum a-posteriori conguration.
The model is evaluated on two challenging four-dimensional data sets from
developmental biology. Compared to previous work, we obtain improved
tracks due to an increased robustness against false positive detections and
the incorporation of temporal domain knowledge.
Keywords: chain graph, graphical model, cell tracking.
1 Introduction
One grand challenge of developmental biology is to nd the lineage tree of all
cells in a growing organism, i.e. the complete ancestry of each cell [17]. The chal-
lenges encountered include (i) simultaneously tracking an unknown and variable
number of cells; (ii) allowing for cell division; and (iii) high accuracy, because
each tracking error aects a complete subtree of the lineage.
We propose a tracking by assignment solution contributing these novelties:
1. In recent years graphical models proved to be exceptionally successful in
computer vision [4]. We adopted the approach and present the rst proba-
bilistic graphical model for cell tracking which can track an unknown number
of divisible objects that may appear or disappear and which can cope with
false detections.
2. The model achieves robustness against noise detections by solving the track-
ing problem for all time steps at once taking the expected time between
divisions into account. The most likely cell lineage is obtained as the model
conguration with maximum a-posteriori (MAP) probability.

A Discrete Chain Graph Model for 3d+t Cell Tracking 145
Our model
1 2 3 4 5 189
1 2 3 4 5 189 1 2 3 4 5 189
Manual tracking Bise et al. [3]
Fig. 1. Maximum intensity projection of a Fig. 2. Lineage trees for

Drosophila embryo and tracking: common Drosophila: exemplary manually
ancestry is indicated by common color tracked lineage trees are compared to
the results of Bise et al. [3] and our
model.
3. While all literature with similar data focuses on single organisms, we present
results on zebrash andfor the rst timeon Drosophila together with gold
standard lineages and benchmarking tools (as a supplementary).
We obtain the exact MAP solution by integer linear programming (ILP), and
when comparing to the gold standard lineagesobtain results that are superior
to existing methods.
2 Related Work
Tracking in science comprises many dierent application domains and approaches

[23]. Our work is concerned with the tracking of a varying number of divisible ob-
jects with similar appearance in images with its main application in cell lineage
reconstruction. It has to be distinguished from tracking many similar particles
that are not dividing [21].
When the temporal resolution of the raw data is high compared to the rate
of change, derivative-based methods such as optical ow [18] or level sets [20]
can be used to simultaneously segment and track the evolution of one or more
targets in spacetime.
A lower temporal resolutionas it is the case in our applicationmakes the
problem harder and in response, most algorithms then separate the detec-
tion / segmentation from the tracking problem: detections are found in each
146 B.X. Kausler et al.
time step and fed into a tracking routine. Two dierent kinds of models are
typically used: state space models and assignment models. The Kalman lter, a
linear Gaussian state space model, is the archetype of the former class, which
interprets detections as caused by a hidden target. It has been generalized in
a number of ways allowing for nonlinear motion models, discrete state spaces,
non-Gaussian distributions [2,7], multiple target hypotheses, or even an unknown
(but xed) number of targets [8]. While state space models easily accommodate
target properties such as velocity, size, and appearance and are robust against
noisy detections by design, they are unfortunately hard to coax into dealing with
a variable number of dividing objects.
Tracking by assignment, on the other hand, treats every detection as a poten-
tial target itself. It easily accommodates multiple or dividing objects; however,
this increased exibility must be reined in by enforcing the consistency (see sec-
tion 3) of a tracking. This diculty has so far been addressed in three ways.
Firstly, approximate methods can be applied that forego a consistency guaran-
tee [10]. Secondly, tracks can be generated hierarchically from tracklets [3,15,6],
akin to the use of superpixels in image processing. Thirdly, tracking is performed
across pairs of frames only [11,12,19,16].
Approach [3] is most similar to ours since it isto our best knowledgethe
only other global model of detections and assignments over many time steps.
Therefore we chose this method to compare with.
3 Graphical Model for Global Tracking by Assignment

Both state space and assignment models treat detection (segmentation) and
tracking as separate problems. We will focus on tracking in this article and
assume given detections with a high misdetection rate ( 10%).
Specically, we propose a probabilistic model for the tracking problem that
meets the requirements set out in the introduction and can give limited feedback
to the detection step by incorporating the possibility of spurious detections, i.e.
allows to switch o erroneous detection hypotheses to address the high misde-
tection rate. The model takes the form of a chain graph [9]a directed graphical
model (Bayesian network) of supernodes, each of which consists of a condi-
tional random eld over a set of subnodes. A chain graph model turns out to
be necessary because neither Bayesian networks nor Markov random elds alone
are compatible with the independence assumptions imposed by tracking.
Our model contains two types of random variables: detection variables and
(t)
assignment variables. A binary detection variable Xi is associated with the ith
detected object (cell candidate) at time step t, and X (t) denotes the set of all
detection variables at time t. Values of these variables determine if the corre-
sponding detections are accepted as a true object (cell) (and hence incorporated
into the tracking interpretation), or rejected as a false alarm.1 Binary assign-
(t)
ment variables Yij are associated with every pair of cell candidates i, j at times
1
We deliberately do not model missed detections (i.e. objects, that are not visible in
the data) since they are very rare in the 3d images of typical applications.
t [1, . . . , T 1], t + 1, and Y (t) is the set of all assignment variables between
steps t and t + 1. A value of 1 expresses the belief that cell candidate j at time
step t + 1 is identical with, or a child of, cell candidate i at time step t, and a
value of 0 says that i and j are unrelated.
In addition, a number of natural consistency constraints are imposed on these
variables by our application in developmental biology: First, each cell must have
(t)
at most one predecessor, i.e. i Yij 1. Second, each cell must have a unique
fateif it does not disappear (by leaving the data frame or other reasons), it
can either move to (be assigned to) a unique cell candidate in the next time
frame, or it can divide and be associated with exactly two cell candidates in
the next time frame. These children, in turn, may not have any other nonzero
incoming assignment variables. Third, there is a biological upper bound on the
distance traveled between time frames, excluding all assignments that are too far
apart. This leads to a signicant reduction in the number of required assignment
variables.
These consistency constraints dene which congurations are impossible, i.e.
have zero probability. In the next two subsections, we describe how the proba-
bilities of feasible congurations are dened by connecting variables into a chain
graph whose factors encode our knowledge about the plausibility of dierent
trackings as a function of data-dependent properties such as distance traveled,
probability of being a misdetection, etc.
Conditional Random Field over Assignment Variables. Let us pretend

for the moment that the optimal values of the detection variables are already
known. Then the probability of a conguration Y (t) of assignment variables
connecting frames t and t + 1 given the detection variables X (t) and X (t+1) can
be expressed with an undirected graphical model, in particular a conditional
random eld (CRF):
CRF(t) : P (t) (Y (t) | X (t) , X (t+1) ) (1)
where
1 (t) (t) (t)
(t) (t) (t+1)
P (t) (Y (t) |X (t) , X (t+1) ) = (t)
i (Xi , Yi ) j (Yj , Xj ) , (2)
Z (t) (t+1)
Xi X (t) Xj X (t+1)
(t) (t)
and i and j denote the factors for outgoing and incoming transitions re-
spectively. Note that the CRF partition function Z (t) (X (t) , X (t+1) ) is only calcu-
lated over the assignment variables, since the detection variables are considered
as given. A factor graph [14] representation of eq. (2) is shown in Fig. 3a. Each of
the maximal cliques of the underlying undirected graph corresponds to a factor
in the CRF. These CRFs will form the supernodes of the chain graph, while the
individual assignment variables are the subnodes.
Analyzing the graph with d-separation the implicit independence assumptions
(t)
of the CRF can be identied: an assignment Yij is conditionally independent of
(a) Undirected part of the chain (b) Directed part of the chain graph.
graph. A single assignment CRF is The boxes symbolize supernodes containing
shown as a factor graph [14]. The dark assignment CRFs according to Fig. 3a.
nodes represent xed detection vari-
ables X (at t and t+1) and the smaller,
light nodes assignment variables Y.
Fig. 3. Chain graphical model for global tracking by assignment
(t)
all assignments Ykl involving dierent detection variables k = i, l = j, provided
(t)
that all incoming assignments Yj of detection j and all outgoing assignments
(t)
Yi of detection i are given:
(t) (t) (t) (t)
Yij
Ykl | Yj , Yi l = j, i = k . (3)
In other words, the assignment decisions only inuence a small neighborhood of
other assignment variables directly. By xing enough variables the CRF can even
decouple in two or more completely independent tracking problemspromoting
the tractability of the chain graph model.
Combining Assignment CRFs in a Chain Graph. In order to get rid of

the requirement that detection variables are already known, we connect all CRFs
dened above by means of directed edges from detection variables to factors, in
a manner analogous to a Bayesian network. The joint probability over all time
steps can now be factorized as

T 1
T
(t)
P (X , Y) = Pdet (Xi ) P (t) (Y (t) |X (t) , X (t+1) ) , (4)
t=1 X (t) X (t) t=1
i
(t)
where Pdet (Xi ) is the probability that a detection hypothesis should be ac-
cepted into the tracking. The corresponding chain graphical model is sketched
in Fig. 3b. By treating each CRF as a single random variable (supernode)
with as many states as congurations of the assignment variables (subnodes),
d-separation can be easily applied to the directed part of the model. The (unob-
served) supernodes have only incoming edges and are blocking the path between
detection variables, revealing the following conditional independence relations:
1. Detection variables are independent from each other, i.e. Xi (X \Xi ) , Xi .

2. Consecutive assignment variables are conditionally independent given the
connecting detection variables, i.e. Y (t)
Y (t+1) | X (t+1) .
These are rather strong independence assumptions andof courseonly an ap-
proximation to the real world. Later, we introduce an extension (eqs. (11), (10))
that, on the one hand, improves the model accuracy but, on the other hand,
weakens these relations and makes inference harder.
Inference by Integer Linear Programming. Under the above model, the

single most likely tracking is given by the maximum a-posteriori (MAP) cong-
uration of variables. Finding this conguration can equivalently be restated as
an energy minimization problem, by expressing the probability in eq. (4) as a
Gibbs distribution P (X , Y) = Z1 eE(X ,Y) such that
argmax P (X , Y) = argmin E(X , Y) . (5)

X ,Y X ,Y
Rewriting the chain graph distribution in (4) as a sum of energies E allows us to

apply integer linear programming (ILP) for minimization. This approach has two
advantages. Firstly, it allows us to use state-of-the-art ILP solvers that are able to
determine globally optimal solutions of (5). Any remaining shortcomings of our
tracking results can thus be attributed to our model and are not a consequence of
approximate optimization. Secondly, we can easily exclude biological impossible
trackings by incorporating appropriate constraints in the linear program.
Application: Cell Tracking. Applying (5) to the chain graph model equa-
tions (2) and (4) gives the energy function

T
T 1
Edet (Xi )+ Eout (Xi , Yi ) + Ein (Yj , Xj
(t) (t) (t) (t) (t+1)
E(X , Y) = )
t=1 X (t) X (t) t=1 i j
i
(6)
where Edet , Eout , and Ein are the energies corresponding to the factors Pdet ,
(t) (t)
i , and j respectively.
The rst term allows to incorporate evidence from the raw data regarding
the presence or absence of a cell at a location where a potential target was de-
tected.
In our
experiments, we use Random Forest [5] to estimate the probability
(t)
Pdet Xi of a detection being truly a cell, based on features such as intensity
distribution and volume of each cell candidate found by the detection module.
In keeping with the denition of the Gibbs distribution, we get

ln Pdet Xi(t) (t)
, Xi = 1
Edet Xi
(t)
= (7)
ln 1 Pdet X (t) (t)
, Xi = 0
i
The assignment energies Ein and Eout need to embody two kinds of prior knowl-
edge: rstly, which trackings are consistent; and secondly, amongst all consistent
trackings, which ones are plausible. To enforce consistency zero probability /
innite energy is assigned to cells dividing into more than two descendants, cells
arising from the merging of two or more cells from the previous frame, and cell
candidates that are regarded as misdetections.
To penalize implausible trackings, we need to parameterize our model ap-
propriately. The model equations expose exactly ve free parameters (Cinit ,
Cinit imposes a penalty for track initiation (i.e. when a
Cterm , Copp , w, d).
cell (re-)appears in the data frame) and Cterm for track termination (i.e. when
a cell disappears from the data frame or is lost due to cell death). To discour-
age trivial solutions where too many objects are explained as misdetections, we
exact a opportunity cost of Copp for the lost opportunity to explain more of the
data. Furthermore, each move is associated with a cost dependent on the squared
length of that move. For cell divisions, nally, it is known that the descendants
tend to be localized at an average distance d from the parent cell, and punish
deviations from that expected distance.
All of the above can be summarized in the following assignment energies:
(t) (t)

, Xi = 1 j Yij > 2 > 2 children

2
w (d d) (t)

+(d d) (t)
2 , Xi = 1
j Yij = 2 division

(t) (t)
(t) (t)
Eout (Xi , Yi ) = wd
2
, Xi = 1 j Yij = 1 move (8)

(t) (t)

Cterm , Xi = 1 j Yij = 0 disappearance

(t)

C
(t)
, Xi = 0 j Yij = 0 opportunity

opp
(t)
(t)
, Xi = 0 j Yij > 0 tracked misdetection
(t)

, Xj
(t+1)
= 1 i Yij >1 > 1 parent

(t)

(t+1)
= 1 i Yij
0 , Xj (t)
=1 move
(t) (t+1)
Ein (Yj , Xj ) = Cinit, Xj(t+1) = 1 i Yij =0 appearance (9)

(t)

0 , Xj
(t+1)
= 0 i Yij =0 misdetection, no parent

(t)
(t+1)
, Xj = 0 i Yij >0 misdetection with parent
Note that more informative features could be extracted from the raw data: for
instance, besides the length of a move, one could consider the similarity of the
associated cell candidates to obtain an improved estimate of their compatibility.
However, it is then no longer possible to select appropriate values for the pa-
rameters with a simple grid. Instead, a proper parameter learning strategy will
be needed, which will be a subject of our future research.
Domain Specic Knowledge: Minimal Cell Cycle Length. The model

may be extended further by incorporating domain specic knowledge. For in-
stance, cells must pass through specic cell cycle states (interphase, prophase,
metaphase, anaphase, telophase) and thus, there is a biological lower bound for
the duration of a cell cycle. In other words, it is impossible to observe a cell divid-
ing twice within a number of subsequent frames as it is the case in Fig. 5a. This
biological constraint can be integrated in our cell tracking model by expanding

(t)
the detection variables Xi to obtain the number of time steps since the last
(t) (t)
division of an individual cell, i.e. Xi {0, 1, ..., }, where Xi = 0 still stands
for misdetection and is a parameter for the minimal duration between division
events of an individual cell. Hence, the energy functions given above change only
(t) (t) (t)
slightly in that Xi = 1 becomes Xi 1 and Xi = for the division energy.
Besides, another two factors need to be introduced to incorporate the counting
between successive detections. The corresponding energy functions are given by
(t)

(t) (t+1)
, Yij = 1 k Yik = 2 Xj = 1

(t) (t) (t) (t+1) (t)
(t) (t) (t+1) , Yij = 1 k Yik = 1 Xi < Xj Xi + 1
=
Ecnt (Xi , Yi , Xi ) = ,

(t) (t) (t)
, Yij = 1 k Yik = 1 Xi = Xj
(t+1)

=

0 , otherwise
(10)
/ (t) (t+1) (t+1)
(t+1) (t) , k Ykj = 0 Xj = 0 Xj =
Ecnt (Xj , Yj ) = , (11)
0 , otherwise
(t+1) (t) (t)
where Xi are the detection variables connected to Xi through Yi . Tech-
nically, these rules are implemented as hard constraints on indicator variables
each representing a possible state of the detection variables. It should be noted
that with this modication, the detection variables are no longer independent.
For that reason and due to the increased number of variables, computation time
increases by a factor of 10.
4 Experiments
We applied the chain graph model to track cells during early embryogenesis of
zebrash (Danio Rerio) and fruit y (Drosophila): the GFP-stained nuclei were
imaged with dierent light sheet uorescence microscopes.
4.1 Zebrash Dataset

Keller et al. [13] recorded time-lapse volumetric images of developing zebrash
embryos (3d+t). A rst cell lineage reconstruction of the blastula period based on
that data was reported by [16]. Its segmentation-based cell detection method has
high recall (false negative rate of 41% for cell detection) but suers in precision
due to segmentation errors and noise. Cell tracking was done by matching the
time slices with minimal cost by integer linear programming. The test dataset
consisted of 80 frames of the animal pole from 66 cells up to ca. 2400 cells. Its
low segmentation precision warrants the application of the chain graph, which
explicitly models noisy detections.
We implemented the method of Lou et al. [16] and replicated its benchmark for
tracking the cells in the zebrash dataset. The benchmark measures performance
in terms of precision, recall, and f-measure2 of the reconstructed cell movements
2
Precision: true pos./(true pos. + false pos.); recall: true pos./(true pos. + false neg.);
f-measure: harmonic mean of precision and recall.
Table 1. Tracking results on the Ze-

brash dataset: our chain graph model
shows improved performance in terms of f-
measure compared to [16]
Lou et al.[16] Chain Graph

precision 0.807 0.897
Fig. 4. Drosophila dataset:
synopsis of raw data and corre- recall 0.861 0.850
sponding segmentation; shown is a f-measure 0.833 0.873
detail of slice 33 at time step 7
and divisions between two time slices against a manually obtained tracking on the
rst 25 time slices. We use the segmentation result of [16] for a fair comparison
of the two tracking methods.
4.2 Drosophila Dataset
The other dataset consists of 3d image stacks of a fruit y syncytial blastoderm

(cf. Fig. 1) over 40 time steps. We segmented the cells with the freely avail-
able ILASTIK software [22]. Subsequently we obtained a manual tracking to
benchmark the automatic methods.
Compared to the zebrash dataset the fruit y dataset has lower contrast
resulting in more ill-shaped cell segmentations and false positive detections that
are hard to distinguish from the actual cells (see Fig. 4). On the other hand,
cells reside only near the surface of the embryo, lowering the chance of wrong
assignments to neighboring cells compared to the zebrash embryo.
We examine the dierence of our holistic approach over all time steps com-
pared to a time slice by time slice tracking on that dataset. For the latter, we
take the Random Forest predictions to lter out false alarms beforehand (0.5
probability threshold), add detection variables for all remaining objects to the
graphical model and x them as active. Under these conditions the single assign-
ment random elds are independent and the model reduces to a time step-wise
tracking approach. Finally, we implemented the method of Bise et al. [3] and
evaluated it with the same dataset and benchmark.
4.3 Implementation
We implemented the chain graph model in C++ using the OpenGM [1] library
and wrote a MAP inference backend based on CPLEX.3
Parameter values are set according to an optimal f-measure of successfully
reconstructed move, division, appearance, and disappearance events relative to
a manually obtained tracking. We initialize the parameters with a reasonable
setting and ne-tune them with a search on a grid in parameter space.
3
A commercial (integer) linear programming solver. Free for academic use.
Table 2. Tracking results on the Drosophila dataset: a timestep-wise tracking

with previously ltered misdetections (rst col.) is inferior to our full chain graph model
optimized over all time steps (second col.). The third column shows the results of the
extended chain graph with the condition that division events of a cell must at least
be 3 time steps apart from each other. Finally, the method of [3] (fourth col.) shows
decent recall but suers in precision.
Fixed Unconditioned Chain Graph

Bise et al.[3]
Detections Chain Graph with = 3
precision 0.889 0.953 0.956 0.550
recall 0.933 0.957 0.960 0.718
f-measure 0.911 0.955 0.958 0.623
Runtimes on the presented datasets are of the order of minutesobviating

the need for approximate inference at the current problem size.
Finally, we implemented the methods by [16] and [3] in C++ and found optimal
parameters via grid search (improving over the ad hoc parametrization approach
of [3]). Source code and parameters are provided as a supplementary.
4.4 Results
We perform experiments on two datasets. First, we compare our tracker against
a previously published method [16] on the zebrash dataset. The results are
summarized in table 1. Second, we investigate the performance of the model on
the Drosophila dataset, comparing a chain graph having xed detection variables
with an unconditioned chain graph and a chain graph with 4-state detection
variables satisfying a minimal duration of 3 time steps between division events
of a particular track. Additionally we put the chain graph model in perspective
by comparing with a state-of-the-art cell tracking method by Bise et al. [3]. The
tracking results are presented in table 2.
Finally, table 3 shows the Drosophila results in terms of correctly and incor-
rectly identied cells for both variants of the chain graph in contrast to human
annotations and the Random Forest classier alone (which was used to x the
variables).
5 Discussion
The MAP solution of the chain graph is obtained from all time steps at once
considering properties of complete tracks. In particular a track needs a minimal
number of active transitions to overcome the track initialization and termina-
tion costs together with the cost for misdetections. Cells that are only shortly
presentdue to low signal to noise or due to entering/leaving the data frame
will consequently be labeled as misdetections. On the other hand, cells that were
badly segmented and were classied as misdetections by the Random Forest may
Table 3. Cell identication performance on Drosophila dataset: compared to

an object classier, the chain graph recovers less actual cells (true positives), but nds
signicantly more misdetections (true negatives). The unconditioned chain graph and
the variant with a minimal cell cycle length dont dier signicantly in identication
performance (but in tracking performance: see table 2).
Unconditioned Chain Graph

Classier Human
Chain Graph with = 3
true positive 12391 12330 12346 12493
true negative 1284 1607 1598 1810
false negative 102 163 147 -
false positive 526 203 212 -
be set active as long as they reasonably continue a track (see Fig. 6 for an il-
lustration of this behavior). The chain graph model can therefore be seen as a
regularizer on the pure classier output that trades some true positives (decreas-
ing recall) against a signicantly reduced number of false positives (increasing
precision).
The above interpretation is supported by the results. Compared to the perfor-
mance of method [16] on the zebrash dataset, our method loses 1.1 percentage
points (pp) in recall but gains 9pp in precision leading to an overall better perfor-
mance. More insight into this precision-recall trade-o can be gained by looking
at the cell vs. misdetection identication performance. Compared to the classier
alone, the (unconditioned) chain graph retrieves 61 less cells (less true positives),
but labels 323 more misdetections correctly (more true negatives). (The chain
graph controlling for minimal cell cycle length exposes the same behavior with
nonsignicant dierences in performance.)
Since correct cell identications are a precondition for a successful recovery of
move, division, appearance, and disappearance events, the higher object identi-
cation rate should directly transfer to the tracking performance. The results on
the zebrash are supporting this assumption, but could also be caused by the
fact that our method employs a Random Forest classier whereas the method of
[16] just treats every segmented object as a cell. To exclude the inuence of the
random forest we examined another variant of the chain graph model, where we
x the detection variables as a preprocessing step using the very same Random
Forest predictions. Compared to the chain graph with previously xed detection
variables, the full (unconditioned) model gains 2.4pp and 4pp in terms of recall
and precision, respectively. This is clear evidence that the performance gain over
the previously published step-by-step method is not only caused by the Random
Forest classier, but also by the holistic chain graph model that is optimized
over many time steps at once.
Further improvements can be obtained by requiring a minimal time between
two divisions. Table 2 shows an increase in f-measure by 0.3pp for a minimal
temporal distance of three ( = 3) compared to the unconditioned model. At
a rst glance the gain seems nonsignicant, but the f-measure puts the same

(a) Impossible (b) Performance on Drosophila depending on

division
Fig. 5. Extending the cell tracking model by the requirement of a minimal

duration between division events: (a) shows a biologically implausible tracking
(left) and the result of our model when requiring a minimal duration of = 3 (right).
Performance results on the Drosophila dataset with varying are depicted in (b):
shown is the performance for all events (left panel) and for division events only (right
panel).
importance on every type of tracking event. Since the number of move events
exceeds the number of divisions by a factor of 20, it is dominated by the for-
mer. Yet, the f-measure for divisions (exclusively) improves from 0.883 (precision
0.941, recall 0.832) to 0.917 (precision 0.923, recall 0.911). This is a signicant
gain and very important for an accurate assessment of cell ancestry since a sin-
gle mistracked division can spoil a whole subsequent lineage. In practice, setting
to a low value will already give a boost in performance with an acceptable
overhead. The eect on the overall and the division performance when varying
parameter is depicted in Fig. 5b. The method of Bise et al. [3] shows convinc-
ing results in regions with high data quality (cf. Fig. 2). However, its f-measure
is 33pp worse compared to the chain graph. This is most likely caused by the
high misdetection rate of 13% and can be seen from the low precision of 0.550
and Fig. 2.
Still the performance of our approach could be further improved. A manual in-
spection of incorrectly tracked divisions reveals that the tracker can be confused
by divisions happening in close local proximity. A better model for divisions
that also incorporates the change in shape and the geometric motion pattern
descendants move away from the ancestor in roughly opposite directionscould
prevent such mistakes. Future datasets may contain cells that move in a more co-
ordinated fashion. The presented move energies are optimal in case of Brownian
motion, but will most likely underperform for directed motions. In that case the
model can be extended with discrete momentum variables that are associated
with the detected objectsanalogous to our extension that controls the minimal
cell cycle length. Finally, the model can easily be extended to consider merging
objects (due to undersegmentation or 2d applications) by allowing more than
one active incoming variable.
Fig. 6. Sample tracking sequence obtained with the chain graph model.
The eight consecutive time steps show a detail of the Drosophila dataset projected to
2d. The colored spots are segmented objects, where objects with the same color are
assigned to each other. Dark gray objects are misdetections as indicated by inactive
detection variables. Ill-shaped objects will be treated as active, if they are bridging
two tracks in a sensible manner. On the contrary, cell-shaped objects may be labeled
inactive, if they do not constitute a track of a certain length.
6 Conclusion
We presented a chain graph model for tracking by assignment over many time
slices simultaneously. This model is highly robust against misdetections and
can consider temporal domain knowledge. It has been applied in the context
of 3d+t cell tracking and achieved better performance than two recently pub-
lished methods ([16], [3]). Furthermore, it showed that globally optimal tracking
by assignment over many time slices is superior to the popular time step-wise
approach. The performance could be further improved by introducing better
motion models for cell movements and divisions.
Acknowledgments. This work has been supported by the German Research
Foundation (DFG) within the programme Spatio-/Temporal Graphical Models
and Applications in Image Analysis, grant GRK 1653.
References
1. Andres, B., Kappes, J.H., Kothe, U., Schnorr, C., Hamprecht, F.A.: An Empirical
Comparison of Inference Algorithms for Graphical Models with Higher Order Fac-
tors Using OpenGM. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler,
K. (eds.) DGAM 2010. LNCS, vol. 6376, pp. 353362. Springer, Heidelberg (2010)
2. Arulampalam, M., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle lters
for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal
Processing 50(2), 174188 (2002)
3. Bise, R., Yin, Z., Kanade, T.: Reliable cell tracking by global data association. In:
IEEE International Symposium on Biomedical Imaging, ISBI 2011 (2011)
4. Blake, A., Kohli, P., Rother, C. (eds.): Markov Random Fields for Vision and Image
Processing. MIT Press (2011)
5. Breiman, L.: Random forests. Machine Learning 45, 532 (2001)

6. Brendel, W., Amer, M., Todorovic, S.: Multiobject tracking as maximum weight in-
dependent set. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2011), pp. 12731280 (2011)
7. Doucet, A., Johansen, A.: A tutorial on particle ltering and smoothing: Fifteen
years later. In: Handbook of Nonlinear Filtering, Oxford University Press (2011)
8. Fox, E., Choi, D., Willsky, A.: Nonparametric Bayesian methods for large scale
multi-target tracking. In: Fortieth Asilomar Conference on Signals, Systems and
Computers, ACSSC 2006, pp. 20092013 (2006)
9. Frydenberg, M.: The chain graph Markov property. Scandinavian Journal of Statis-
tics 17, 333353 (1990)
10. Jiang, H., Fels, S., Little, J.J.: A linear programming approach for multiple ob-
ject tracking. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2007 (2007)
11. Kachouie, N.N., Fieguth, P.W.: Extended-Hungarian-JPDA: exact single-frame
stem cell tracking. IEEE Transactions on Bio-Medical Engineering 54 (2007)
12. Kanade, T., Yin, Z., Bise, R., Huh, S., Eom, S., Sandbothe, M.F., Chen, M.:
Cell image analysis: Algorithms, system and applications. In: IEEE Workshop on
Applications of Computer Vision (WACV) (2011)
13. Keller, P.J., Schmidt, A.D., Wittbrodt, J., Stelzer, E.H.: Reconstruction of ze-
brash early embryonic development by scanned light sheet microscopy. Sci-
ence 322, 10651069 (2008)
14. Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product al-
gorithm. IEEE Transactions on Information Theory 47, 498519 (2001)
15. Li, Y., Huang, C., Nevatia, R.: Learning to associate: HybridBoosted multi-target
tracker for crowded scene. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2009 (2009)
16. Lou, X., Kaster, F.O., Lindner, M.S., Kausler, B.X., Koethe, U., Jaenicke, H.,
Hoeckendorf, B., Wittbrodt, J., Hamprecht, F.A.: DELTR: Digital embryo lineage
tree reconstructor. In: IEEE International Symposium on Biomedical Imaging, ISBI
2011 (2011)
17. Meijering, E., Dzyubachyk, O., Smal, I., van Cappellen, W.A.: Tracking in cell and
developmental biology. Seminars in Cell & Developmental Biology 20 (2009)
18. Melani, C., Peyrieras, N., Mikula, K., Zanella, C., Campana, M., Rizzi, B.,
Veronesi, F., Sarti, A., Lombardot, B., Bourgine, P.: Cells tracking in a live ze-
brash embryo. In: IEEE Engineering in Medicine and Biology Society, EMBS
2007 (2007)
19. Padeld, D., Rittscher, J., Roysam, B.: Coupled minimum-cost ow cell tracking
for high-throughput quantitative analysis. Medical Image Analysis 15(4) (2011)
20. Padeld, D., Rittscher, J., Thomas, N., Roysam, B.: Spatio-temporal cell cycle
phase analysis using level sets and fast marching methods. Medical Image Analy-
sis 13(1), 143155 (2009)
21. Smal, I., Meijering, E., Draegestein, K., Galjart, N., Grigoriev, I., Akhmanova, A.,
Van Royen, M., Houtsmuller, A., Niessen, W.: Multiple object tracking in molec-
ular bioimaging by Rao-Blackwellized marginal particle ltering. Medical Image
Analysis 12(6), 764777 (2008)
22. Sommer, C., Straehle, C., Koethe, U., Hamprecht, F.A.: ILASTIK: Interactive
learning and segmentation toolkit. In: IEEE International Symposium on Biomed-
ical Imaging, ISBI 2011 (2011)
23. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing
Surveys (CSUR) 38 (2006)
Robust Tracking with Weighted Online
Structured Learning
Rui Yao1,2 , Qinfeng Shi2 , Chunhua Shen2 ,

Yanning Zhang1 , and Anton van den Hengel2
1
School of Computer Science, Northwestern Polytechnical University, China
2
School of Computer Science, The University of Adelaide, Australia
Abstract. Robust visual tracking requires constant update of the target

appearance model, but without losing track of previous appearance infor-
mation. One of the diculties with the online learning approach to this
problem has been a lack of exibility in the modelling of the inevitable
variations in target and scene appearance over time. The traditional
online learning approach to the problem treats each example equally,
which leads to previous appearances being forgotten too quickly and a
lack of emphasis on the most current observations. Through analysis of
the visual tracking problem, we develop instead a novel weighted form
of online risk which allows more subtlety in its representation. However,
the traditional online learning framework does not accommodate this
weighted form. We thus also propose a principled approach to weighted
online learning using weighted reservoir sampling and provide a weighted
regret bound as a theoretical guarantee of performance. The proposed
novel online learning framework can handle examples with dierent im-
portance weights for binary, multiclass, and even structured output labels
in both linear and non-linear kernels. Applying the method to tracking
results in an algorithm which is both ecient and accurate even in the
presence of severe appearance changes. Experimental results show that
the proposed tracker outperforms the current state-of-the-art.
1 Introduction
Visual tracking is an important topic in computer vision, and has a wide range
of practical applications including video surveillance, perceptual user interfaces,
and video indexing. Tracking generic objects in a dynamic environment poses ad-
ditional challenges, including appearance changes due to illumination, rotation,
and scaling; and problems of partial vs. full visibility. In all cases the fundamen-
tal problem is the variable nature of appearance, and the diculties associated
with using such a changeable feature to distinguish the target from backgrounds.
A wide variety of approaches have been developed for modelling appearance
variations of targets within robust tracking. Many tracking algorithms attempt
to build robust object representations in a particular feature space, and search for
the image region which best corresponds to that representation. Such approaches
include the superpixel tracker [1], the integral histogram [2], subspace learning

Robust Tracking with Weighted Online Structured Learning 159
[3], visual tracking decomposition [4], interest points tracking [5], and sparse
representation tracking [6,7], amongst others.
An alternative approach has been to focus on learning, and particularly on iden-
tifying robust classiers capable of distinguishing the target from backgrounds,
which has been termed tracking-by-detection. This approach encompases boost-
ing [8,9], ensemble learning [10], multiple instance learning [11], discriminative
metric learning [12], graph mode-based tracking [13], structured output tracking
[14], and many more. One of the features of these approaches has been the ability
to update the appearance model online and thus to adapt to appearance changes.
Despite their successes, existing online tracking-by-detection methods have
two main issues. First, these trackers typically rely solely on an online classier
for adapting the object appearance model and thus completely discard past
training data. This makes it almost impossible to recover once the appearance
model begins to drift. Second, as tracking is clearly a time-varying process, more
recent frames should thus have a greater inuence on the current tracking process
than older ones. Existing approaches to online learning of appearance models do
not take this into account, but rather either include images in the model fully,
or discard them totally.
Contribution. Our contributions are: 1) We propose the rst formulation of

visual tracking as a weighted online learning problem. 2) We propose the rst
framework for weighted online learning. This is achieved using weighted reservoir
sampling. We also provide a weighted online regret bound for the new method. 3)
Using our weighted online learning framework, we propose a robust tracker with a
time-weighted appearance model. The proposed tracker is capable of exploiting
both structured outputs and kernels, and performs eciently and accurately
even in the presence of severe appearance drift. The weights associated with
online samples equip the tracker with the ability to adapt quickly while retaining
previous appearance information. The fact that the latest observation is given
higher weight means that it has a higher probability of being maintained in
the reservoir. This makes our method adapt quickly to newly observed data
and also oers exibility in the modelling of the frequent variations in target
and scene appearance over time. On the other hand, unlike traditional online
learning which sees the recent data only, our weighted online learning method
trains on both the recent data and historical data through the use of weighted
reservoir sampling. The combination of new and old information helps the tracker
recover from drift, since the patterns that the tracker has seen in the past are not
completely discarded. Overall, weighted online learning can update the object
appearance model over time without losing track of previous appearances thus
improving performance.
1.1 Related Work
We now briey review recent related tracking-by-detection methods. In [15], an

o-line support vector machine (SVM) is used to distinguish the target vehicle
160 R. Yao et al.
from the background, but it requires that all appearance variations be repre-
sented within the training data. Since the video frames are often obtained on
the y, features are obtained in an online fashion [9].For the same reason, online
learning are brought into tracking. For example, Grabner et al. [9] proposed to
update selected features incrementally using online AdaBoost. Babenko et al.
[11] used online multiple instance boosting to overcome the problem of ambigu-
ous labeling of training examples. Grabner et al. [8] combined the decision of
a given prior and an online classier and formulated them in a semi-supervised
fashion.
Structured learning has shown superior performance on a number of computer
vision problems, including tracking. For example, in contrast to binary classica-
tion, Blaschko et al. [16] considered object localisation as a problem of predicting
structured outputs: they modeled the problem as the prediction of the bounding
box of objects located in images. The degree of bounding box overlap to the
ground truth is maximised using a structured SVM [17]. Hare et al. [14] further
applied structured learning to visual object tracking. The approach, which is
termed Struck, directly predicts the change in object location between frames
using structured learning. Since Struck is closely related to our work, we now
provide more details here.
Struck: Structured output tracking with kernels. In [14] Hare et al. used a
Structured SVM [17] to learn a function f : X Y R, such that, given the t-th
frame xt X, and the bounding box of the object Bt1 in the (t 1)-th frame,
the bounding box Bt in the t-th frame can be obtained via
Bt = Bt1 + yt , yt = argmax f (xt , y). (1)

yY
Here B = (c, l, w, h), where c, l are the column and row coordinates of the upper-
left corner, and w, h are the width and height. The oset is y = (c, l, w, h).
Hare et al. set w and h to always be 0 and f (x, y) = w, (x, y) for the
feature map (x, y).
The discriminative function f is learned via structured SVM [17] by minimis-
ing
1 T
min ||w||2 + C t (2)
w 2
t=1
s.t. t : t 0 ; t, y = yt : w, (xt , yt ) w, (xt , y) (yt , y) t ,
where the label cost is

0
Area((Bt1 + yt ) (Bt1 + y))
(yt , y) = 1 ' , (3)
Area((Bt1 + yt ) (Bt1 + y))
which was introduced in [16] to measure the VOC bounding box overlap ratio.
In practice, ys are uniformly sampled in the vicinity of the yt in [14]. To enable
kernels, Hare et al. updated the dual variables online following the work of
[16,18,19] and achieved excellent results. Note that prior to the work of [14,16],
supervised learning in detection and tracking often required large volumes of

manually labeled training examples, which are not easy to obtain in many cases.
Also in tracking, when it is considered as binary classication, the boundary is
not well dened between a target image patch and a non-target image patch.
Using structured learning to optimise a label cost like the VOC bounding box
overlap ratio, the manual labeling requirement is signicantly reduced and the
ambiguity between target and non-target images patches is removed[14].
However, tracking is a time-varying process. It is not appropriate to weight
each frame uniformly, as this can cause tracking failure when visual drifts occurs.
It is this problem which motivates our approach.
2 Weighted Online Learning

In this section, we propose a novel framework for online learning with weights on
examples. We rst re-visit the standard approach to online learning as applied to
visual tracking, and then propose a weighted online risk that is better suited to
the requirements of the problem. The novel formulation of risk we propose cannot
be directly optimised by traditional online learning, however. We thus propose
an approach based on reservoir sampling for both primal and dual variables. A
weighted regret bound is provided as a performance guarantee.
In online learning, at each iteration, the online algorithm has visibility of one
observation z, or at most a small selection of recent data (a.k.a. mini-batch). Here
z = (x, y) for a binary or multiclass problem, and z = (x, y) for a multilabel or
a structured problem. In our visual tracking context (also in [14]), z = (x, y, y )
where x is an image frame, y is the ground truth oset of the target bounding
box in the frame, and y is a generated oset y = y. The goal of online learning
is to learn a predictor that predicts as well as batch learning (which has visibility
over all data).
Unweighted Risk. The regularised unweighted (empirical) risk for batch learn-
ing is

m
C (w, zi ) + (w),
i=1
where (w) is the regulariser (e.g. (w) = w 2 /2) that penalises overly
complicated models, and C is a scalar. Here (, ) is a loss function such as
hinge loss and w is the primal parameter.
Online learning algorithms use one datum or a very small number of recent
data to learn at each iteration by denition. Thus at iteration i, the online
algorithm minimises the regularised risk
J(i, w) = C(w, zi ) + (w).

162 R. Yao et al.
The regret Q of an online algorithm measures the prediction dierence between

the online predictor and the batch learning predictor using the same loss,
1 1
m m
Q(m, w) = (wi , zi ) (w, zi ).
m i=1 m i=1
The regret reects the amount of additional loss (compared with batch algo-
rithm) suered by the online algorithm not being able to see the entire data set
in advance. It has been shown in [20,19] that minimising the risk J(i, w) at each
iteration reduces the regret Q(m, w), and that the regret goes to zero when m
goes to innity for a bounded (w).
2.1 Weighted Risk

In visual tracking, scene variation over time and target appearance changes mean
that dierent frames provide varying levels of information about the current
appearance of the target and background. In vehicle tracking, for example, recent
frames are often more important than those earlier ones in the sequence due
to location (and thus background) and target orientation changes. A weighted
formulation of risk is capable of reecting this variation between the levels of
signicance of particular frames. Given nonnegative weights (1 , 2 , , m ), we
dene the weighted (empirical) risk as

m
C i (w, zi ) + (w). (4)
i=1
In a batch learning setting the miminisation of the weighted risk is straightforward.

In online learning, however, the method processes only the latest observation,
making it impossible to include weights on previous observations. Extending on-
line learning naively to handle weights will causes several problems. For example,
if we drop the regularisation term (w), the weighting with respect to other data
points is of no eect. We thus see that argminw Ci (w, zi ) = argminw (w, zi ),
and thus that i has no eect. Even with regularisation, the weights imposed
still do not faithfully reect the desired weighting of the data due to the shrink-
age of (w) over time. These are the problems with online learning that we aim
to resolve.
2.2 Reservoir and Optimisation

Though (4) is what we would like to minimise, it requires that we are able to
recall all of the observed data at every stage, which is not feasible in an online
setting. Usually one may need to operate over large data sets and long time
periods. Reservoir sampling is an online method for random sampling from a
population without replacement. Using weighted reservoir sampling allows us
to develop an online method for selecting a (constant-sized) set of observations
from a data stream according to their weights.
Proposition 1 (Expected Reservoir Risk). Minimising (4) is equivalent to

minimising
CZm
ERm [ (w, z)] + (w), (5)
|Rm |
zRm
Rm is a weighted reservoir with weights (1 , 2 , , m ) as in [21] and

where
m
Zm = i=1 i .
This is due to a property of weighted reservoir sampling (see the proof in sup-
plementary material). The reservoir is updated by replacing an element in the
reservoir by the current datum with certain probability. See [21] for details. The
method is applicable to data streams of unknown, or even innite length, which
means that rather than store all observations we might later require, we can
instead maintain a reservoir and minimise the expected risk over the samples
therein.
Reservoir Risk. Due to Proposition 1, at each iteration i, we minimise

C (w, z) + (w). (6)
zRi
One of the properties of weighted reservoir sampling is that data (i.e. {z1 , , zi })
with high weights have higher probability of being maintained in the reservoir.
The advantage of using (6) is that a weighted risk in (4) now is replaced with a
unweighted (i.e. uniformly weighted) risk, to which traditional online algorithms
are applicable. The (sub)gradient of the primal form may be used to update
the parameter wi . Learning in the dual form can be achieved by performing
OLaRank [19] on elements in the reservoir. Details are provided in Section 3.1.
2.3 Regret Bound

Shai et al. [20] introduced a regret for online learning for binary classication.
Bordes et al. [19] extended the regret bound to structured online learning. Based
on the work in [20,19], we derive the following bound on the weighted regret as
a theoretical guarantee of performance. Without losing generality, from now on,
we assume convexity on (w) and assume a xed size for the reservoir at any
iteration i. That is, denote N = |Ri | < m for all i.
Theorem 1 (Weighted Regret Bound). At each iteration i we perform
OLaRank[19] on each training example bij Ri , where Ri is the reservoir at
iteration i, 1 j N . Assume that for each update, the dual objective value
increase after seeing the example bij is at least C((wij , bij )), with
1 1
(x) = min(x, C)(x min(x, C)),
C 2
then, we have for any w,
1 1
m 1N 2 1 1
m i (w) C
ERi (wij , bij ) j (w, zj ) + + ,
m i=1 N j=1
m i=1
Z i j=1 CmN 2
164 R. Yao et al.
!"
%&''

% '(&'

)'( *&

''*&

++

" # $

Fig. 1. An illustration of greedy inference. LEFT: a contour plot of f in (8) for varying
yt . The centre blue solid square indicates the position of object in previous frame. The
centre point and some random points (purple) are used as the initial positions. The
yt with the highest f is obtained (green solid point) via Algorithm 2. RIGHT: The
contour plot and true bounding box shown over the relevant frame of the video.
i
where Zi = j=1 j .
The proof is provided in supplementary material.
3 Tracking with Weights

In tracking, z = (x, y, y ) Rt as mentioned earlier. Thus we have the loss
(w, (x, y, y )) = [(y, y ) (x, y), w + (x, y ), w]+ ,
where [x]+ = max(0, x). The risk at the t-th frame becomes

C (w, (x, y, y )) + (w).
(x,y,y )Rt
3.1 Kernelisation
Updating the dual variable enables us to use kernels, which often produce
superior results in many computer vision tasks. Using a similar derivation to
that introduced by [18], we rewrite the optimisation problem of (2) by introduc-
ing
coecients iy for each (xi , yi , yi ) Rt , where iy = yi if y = yi and
y
y=yi i otherwise and denote as the vector of the dual variables ,

1 y y
max (y, yi )iy
k((xi , y), (xj , y)) (7)

i,y
2 i,y,j,y i j
y
s.t. i, y : iy (y, yi )C; i : i = 0,
y
where (y, yi ) is 1 when y = yi and 0 otherwise, and k(, ) is the kernel function.
The discriminant scoring function becomes
y
f (x, y) = i k((xi , y), (x, y)). (8)
i,y
Algorithm 1. Online Dual Update for Tracking with Weights

Require: current frame xt , oset yt , previous dual variable t1 , reservoir Rt1 , the
preset reservoir size N , number of osets nt , and the preset time factor q > 1 (we
use q = 1.8 in our experiments).
1: Generates nt osets {y tj }n t
j=1 in the vicinity of the yt and Rt = Rt1 .
2: for each ytj do
1/
3: Calculate a key vtj = utj t , where t = q t and utj Uniform(0, 1).
4: if |Rt | < N then
5: Rt = Rt (xt , yt , ytj ). [In implementation, only pointers are stored]
6: else
7: Find the element (x, y, y ) in Rt with the smallest key denoted as v .
8: If vtj > v , replace (x, y, y ) in Rt with (xt , yt , ytj )
9: end if
10: end for
11: for each (x, y, y ) Rt do
12: Perform OLaRank to update t1 .
13: end for
14: t = t1
15: return Rt and t
As in [14], at the t-th frame, nt many yti are generated. We update the reservoir
and perform OLaRank on the elements of the reservoir. Pseudo code for online
update of the dual variables and the reservoir is provided in Algorithm 1. By
Theorem 1 we know the below holds (for the proof see supplementary material).
Corollary 1 (Tracking Bound). Assume that for each update in Algorithm 1,

the dual increase after seeing the example bij Ri is at least C((wij , bij )),
then we have for any w,
3
1
t 1N 2 1 t
1
i nj 2(w)
ERi (wij , bij ) j (w, xj , yj , yjl ) + ,
tN i=1 j=1
t i=1
Z i j=1 tN
l=1
i
where Zi = j=1 j nj .
3.2 Greedy Inference

Given f , we can predict yt using (1) and (8). The exhaustive search used in [14]
is computationally expensive, because kernel evaluations on all support vectors
and all possible y Y are required. Note, however, that the contour of (8) is
very informative and can guide the moving of the oset candidate yt . We have
therefore developed a signicantly faster greedy search approach using the con-
tour information in Algorithm 2 to predict oset yt . The current bounding box
is predicted via Bt = Bt1 + yt . An example is given in Fig. 1 for the inference.
The experiments in Section 4.2 show that our greedy inference accelerates the
speed of prediction without compromising predictive accuracy.
166 R. Yao et al.
Algorithm 2. Greedy inference for prediction

Require: Frame xt , Bt1 , search radius s, and the number of initial points n.
1: Generate a set Y = {ytj = (ctj , ltj , 0, 0)}nj=1 , whose elements are randomly
generated within the radius s, that is (ctj )2 + (ltj )2 < s2 .
2: for each oset ytj do
3: ytj = ytj +y where y = argmaxy f (xt , ytj + y) as in (8). Here y
{(yc , yl , 0, 0)|yc , yl = 0, 1, 1} and same location wont be re-checked.
4: end for
5: return yt = argmaxyY f (xt , y). [search over Y can be avoided by updating the
maximum value and index]
4 Experiments
In this section we rst demonstrate the performance impact of dierent reservoir

sizes, and dierent numbers of initial osets on greedy inference. We then present
qualitative and quantitative tracking results for the proposed method and its
competitors on 12 challenging sequences.
Experiment Setup. In order to enable a fair comparison, our tracker uses
the same features as Struck [14], MIL [11] and OAB [9]. The search radius is set
to 30 pixels for all sequences. As will be discussed in Section 4.1, the maximum
reservoir size is set to 200, time factor q is set to 1.8, and the weights to q t . The
number of initial osets for inference is set to 48 for all sequences. Furthermore,
the proposed tracking algorithm is implemented in C++ on a workstation with
an Intel Core 2 Duo 2.66GHz processor and 4.0G RAM. Without specic code
optimisation, the average running time of our tracker is about 0.08 seconds per
frame in the above setting.
Test Sequences and Compared Algorithms. Tracking evaluation uses
12 challenging sequences with 6, 435 frames in total. The coke, tiger1, tiger2,
and sylv sequences have been used in [14]. The basketball, shaking, singer1 se-
quence were obtained from [4]. Other sequences were obtained from [1]. These
video sequences contain a variety of scenes and object motion events. They
present challenging lighting, large variation in pose and scale, frequent half or
full occlusions, fast motion, shape deformation and distortion, and similar in-
terference. Our tracker is compared with 8 of the latest state-of-the-art tracking
methods named Struck(structured output tracker [14]), MIL(multiple instance
boosting-based tracker [11]), OAB(online AdaBoost [9]), Frag(Fragment-based
tracker [2]), IVT(incremental subspace visual tracker [3]), VTD(visual track-
ing decomposition [4]), L1T(L1 minimisation tracker [6]). These trackers were
evaluated on the basis of their publicly available source codes from the original
authors. For OAB, there are two dierent versions with r = 1 (OAB1) and r = 5
(OAB5) [11].
Evaluation Criteria. For quantitative performance comparison we use three
popular evaluation criteria, i.e. centre location error (CLE), Pascal VOC over-
lap ratio(VOR), VOC tracking success rates. VOC overlap ratio is dened as

!
"#$% &

!"#" $
!"#$ %&

Fig. 2. Performance of our tracker with various reservoir sizes on four sequences(coke,
tiger1, iceball, bird ). LEFT and MIDDLE: the average VOR and CLE. RIGHT:
the run time. The square, diamond, upward-pointing and down-pointing triangle in
three plots correspond to the performance of Struck on four sequences, respectively.

Fig. 3. Comparison of structured SVM scores and tracking results under various scenes.
TOP ROW: the tracking result of our tracker with reservoir size N = 200 and Struck,
where the solid rectangle shows result of our tracker and the dashed rectangle marks the
result of Struck. BOTTOM ROW: structured SVM score over time for our tracker
and Struck. Ours always has a higher score.
Roverlap = Area(BT BGT )/Area(BT BGT ), where BT is the tracking bound-

ing box and BGT the ground truth bounding box. If the VOC overlap ratio is
larger than 0.5, it is considered to be successful in visual tracking for each frame.
4.1 Eect of Diering Reservoir Sizes
To show the eect of the reservoir size we use ten dierent reservoir sizes N , from
100 to 1000. We compute the average CLE, VOR and CPU time for each sequence
with each dierent reservoir size, and plot in Fig. 2. We see that for 100 N
200, increasing the reservoir size N increases the CLE and VOR signicantly.
However, for N > 200, CLE and VOR do not change much. Meanwhile, the CPU
run time also increases as the reservoir size increases as expected. The result of
Struck is also shown in Fig. 2.
168 R. Yao et al.

!"# $

"!#!$ %&

!"#$
%

!

Fig. 4. Quantitative evaluation of tracking performance with dierent number of ini-

tial osets for greedy inference. LEFT: the CPU time of the proposed tracker with
greedy inference under various initial osets on four video sequences. MIDDLE and
RIGHT: the average CLE and VOR. The square, diamond, upward-pointing and
down-pointing triangle in three plots correspond to the performance of the proposed
tracker with full search on four video sequences, respectively.
Evaluation of Structured SVM Score. Fig. 3 shows the quantitative

structured SVM scores and qualitative tracking results of our tracker and Struck.
Due to the observation in Fig. 2, we set the reservoir size to 200. As we can see,
our tracker always has higher SVM score than Struck under the same setting (i.e.
same feature, kernel and regularisation). Our tracker manages to keep tracking
of the object even under challenging illumination (coke), fast movement (tiger1 ),
similar interference (iceball ) and occlusion (bird ).
4.2 Eect of the Number of Initial Osets on Inference

The number of initial osets for greedy inference aects tracking speed and
accuracy. To investigate this, we x the reservoir size to 200, and use the same
four video sequences as Fig. 2. A quantitative evaluation is shown in Fig. 4 under
dierent number of initial osets that equal to 4, 8, 16, 24, 36, 48 and 56. As
can be seen in the left plot of Fig. 4, the CPU time for full search is signicantly
higher than that for the greedy algorithm on four video sequences. In some cases,
the performance will not be boosted as the number of initial oset increases. For
example, performance varies little with varying numbers of initial osets on the
iceball and bird sequences. This is due to simple scene background and slow
the object movement. In this case, the local maximum of structured SVM score
is often the global optimal. Furthermore, when the number of initial osets is
already large (e.g. 48), no signicant improvement is obtained via increasing
the number of initial osets (see Fig. 4).
4.3 Comparison of Competing Trackers

We compare our tracker to eight state-of-the-art trackers. For visualisation pur-
pose, we only show the results of best four methods in Fig. 5. The full qualitative
tracking results of all the nine trackers over twelve challenging video sequences
are provided in supplementary material.
Table 1. Compared average VOC overlap ratio on 12 sequences
Sequence Ours Struck MIL OAB1 OAB5 Frag IVT VTD L1T
Coke 0.690.15 0.540.17 0.350.22 0.180.18 0.170.16 0.130.16 0.080.22 0.090.22 0.050.20
Tiger1 0.750.13 0.660.22 0.610.18 0.440.23 0.240.26 0.210.29 0.020.12 0.100.24 0.450.37
Tiger2 0.680.15 0.570.18 0.630.14 0.370.25 0.180.20 0.150.23 0.020.12 0.200.23 0.270.32
Sylv 0.670.11 0.680.14 0.670.18 0.470.38 0.120.12 0.600.23 0.060.16 0.580.31 0.450.35
Iceball 0.690.18 0.500.33 0.370.28 0.080.22 0.400.30 0.520.31 0.03.12 0.560.28 0.030.10
Basketball 0.720.14 0.450.34 0.150.27 0.120.23 0.120.24 0.460.35 0.020.07 0.640.10 0.030.10
Shaking 0.720.14 0.100.18 0.600.26 0.560.28 0.500.21 0.330.27 0.030.11 0.690.14 0.050.16
Singer1 0.710.08 0.330.36 0.180.32 0.170.32 0.070.19 0.130.26 0.200.28 0.490.20 0.180.32
Bird 0.750.10 0.550.25 0.580.32 0.560.28 0.590.30 0.340.32 0.080.19 0.100.26 0.440.37
Box 0.740.17 0.290.38 0.100.18 0.160.27 0.080.18 0.060.17 0.460.29 0.410.34 0.130.26
Board 0.690.25 0.390.29 0.160.27 0.120.24 0.170.26 0.490.25 0.100.21 0.310.27 0.110.27
Walk 0.630.18 0.460.32 0.500.34 0.520.36 0.490.34 0.080.25 0.050.14 0.080.23 0.090.25
Average 0.71 0.46 0.41 0.31 0.26 0.29 0.10 0.35 0.19

Fig. 5. Visual tracking result. Yellow, blue, green, magenta and red rectangles show
results by our tracker, Struck, MIL, Frag and VTD methods, respectively.
Pose and Scale Changes. As can be seen in rst row of Fig. 5, our algorithm
exhibits the best tracking performance amongst the group. In the box sequence,
the other four trackers cannot keep track of the object after rotation, as they are
not eective in accounting for appearance change due to large pose change. In the
basketball sequences, Struck, MIL and Frag methods lose the object completely
after frame 476 due to the rapid appearance change of object. The VTD tracker
gets an average error similar to ours on the basketball sequence, but its VOC
overlap ratio is worse (see Tab. 2 and Tab. 1).
Illumination Variation and Occlusion. The second row of Fig. 5 illus-
trates performance under heavy lighting and occlusion scenes. The MIL and
Frag methods drift to a background area in early frames in the singer1 sequence
as they are not designed to handle full occlusions. Although Struck maintains
track under strong illumination (frame 111 in the singer1 sequence), it loses
the object completely after illumination changes. As can be seen in the right
image of Fig. 3, frequent occlusion also aects the performance of Struck. On
the contrary, our tracker adapts to the appearance changes, and achieves robust
tracking results under illumination variation and partial or full occlusions.
170 R. Yao et al.
Table 2. Compared average centre location errors(pixel) on 12 sequences
Coke 5.04.5 8.35.6 17.59.5 25.215.4 56.730.0 61.532.4 10.521.9 47.321.9 55.322.4
Tiger1 5.13.6 8.612.1 8.36.6 17.616.9 39.232.8 40.125.6 56.320.6 68.736.2 23.230.6
Tiger2 6.13.7 8.56.0 7.03.7 19.715.2 37.327.0 38.725.1 92.734.7 37.029.7 33.128.1
Sylv 8.83.5 8.45.3 8.76.4 33.036.6 75.335.3 12.611.5 73.937.6 21.635.9 21.732.6
Iceball 4.93.6 15.622.1 61.685.6 97.753.4 58.184.0 40.372.9 90.267.5 13.526.0 77.150.0
Basketball 9.66.2 89.6163.7 163.9104.0 167.8100.6 199.599.9 53.959.8 31.536.7 9.85.0 120.166.2
Shaking 9.25.7 123.154.3 37.875.6 26.949.2 29.148.7 47.140.5 94.336.8 12.56.7 132.856.9
Singer1 5.11.8 29.223.6 187.9120.1 189.4114.3 158.048.6 172.594.5 17.313.0 10.57.5 108.578.5
Bird 8.64.1 20.714.7 49.085.3 47.987.7 48.686.2 50.043.3 107.757.6 143.979.3 46.349.3
Box 13.415.7 162.2132.8 140.977.0 153.595.8 165.484.1 147.067.8 34.743.8 63.565.7 150.082.7
Board 32.932.5 85.356.5 211.7124.8 216.5111.6 213.796.1 65.948.6 223.897.5 112.366.9 240.596.2
Walk 11.37.7 36.846.4 33.847.3 35.549.0 36.948.5 102.646.1 90.256.5 100.946.9 79.641.3
Average 9.9 49.6 77.3 85.8 93.1 69.3 76.9 53.4 90.6
Table 3. Compared success rates on 12 sequences
Coke 0.95 0.71 0.27 0.05 0.03 0.06 0.08 0.08 0.05
Tiger1 0.96 0.82 0.83 0.41 0.18 0.19 0.01 0.13 0.47
Tiger2 0.85 0.66 0.81 0.34 0.12 0.09 0.01 0.15 0.30
Sylv 0.93 0.90 0.76 0.52 0.01 0.69 0.03 0.67 0.53
Iceball 0.86 0.61 0.30 0.09 0.42 0.67 0.02 0.76 0.02
Basketball 0.90 0.61 0.16 0.13 0.13 0.56 0.01 0.90 0.02
Shaking 0.96 0.07 0.78 0.55 0.46 0.29 0.01 0.86 0.02
Singer1 0.99 0.39 0.21 0.21 0.07 0.17 0.18 0.63 0.26
Bird 0.97 0.53 0.73 0.74 0.75 0.34 0.06 0.14 0.48
Box 0.92 0.36 0.06 0.19 0.07 0.04 0.55 0.48 0.15
Board 0.70 0.29 0.13 0.08 0.09 0.45 0.09 0.23 0.12
Walk 0.76 0.58 0.67 0.68 0.62 0.10 0.03 0.08 0.10
Average 0.90 0.54 0.48 0.33 0.25 0.30 0.09 0.43 0.21
Quantitative Comparisons to Competing Trackers. Tab. 1 and Tab. 2

show the average VOR and CLE (best results are highlighted in bold) obtained
by our tracker and other eight competitor trackers. Because several trackers in-
volve some randomness, we also show the standard deviation of each evaluation
result. Tab. 3 reports the success rates of the nine trackers over the twelve video
sequences. From Tab. 1, Tab. 2 and Tab. 3, we found that the proposed track-
ing algorithm achieves the best tracking performance on most video sequences.
Furthermore, we report the CLE of tracking result in supplementary material1 .
5 Conclusion
We have developed a novel approach to online tracking which allows continuous
development of the target appearance model, without neglecting information
gained from previous observations. The method is extremely exible, in that
the weights between observations are freely adjustable, and it is applicable to
binary, multiclass, and even structured output labels in both linear and non-
linear kernels. In formulating the method we also developed a novel principled
1
The demonstration videos are available at http://www.youtube.com/woltracker12.
method of optimising weighted risk using weighted reservoir sampling. We also

derived a weighted regret bound in order to provide a performance guarantee,
and showed that the tracker outperformed the state-of-the-art methods.
Acknowledgments. This work was partially supported by Australian Research

Council grants DP1094764, LE100100235, DE120101161, and DP11010352. It
was also partially supported by NSF (60872145, 60903126), National High-Tech.
(2009AA01Z315), Postdoctoral Science Foundation (20090451397, 201003685)
and Cultivation Fund from Ministry of Education (708085) of China. R. Yaos
contribution was made when he was a visiting student at the University of Ade-
laide.
References
1. Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: Proc. ICCV,
pp. 13231330 (2011)
2. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the
integral histogram. In: Proc. CVPR, vol. 1, pp. 798805 (2006)
3. Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual
tracking. Int. J. Comput. Vision 77, 125141 (2008)
4. Kwon, J., Lee, K.M.: Visual tracking decomposition. In: Proc. CVPR (2010)
5. Garca Cifuentes, C., Sturzel, M., Jurie, F., Brostow, G.J.: Motion models that
only work sometimes. In: Proc. BMVC (2012)
6. Mei, X., Ling, H.: Robust visual tracking and vehicle classication via sparse rep-
resentation. IEEE Trans. on PAMI 33, 22592272 (2011)
7. Li, H., Shen, C., Shi, Q.: Real-time visual tracking using compressive sensing. In:
Proc. CVPR, pp. 13051312 (2011)
8. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-
bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
9. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In:
Proc. BMVC, vol. 1, pp. 4756 (2006)
10. Avidan, S.: Ensemble tracking. IEEE Trans. on PAMI 29, 261271 (2007)
11. Babenko, B., Yang, M.H., Belongie, S.: Visual Tracking with Online Multiple In-
stance Learning. In: Proc. CVPR, pp. 983999 (2009)
12. Li, X., Shen, C., Shi, Q., Dick, A.R., van den Hengel, A.: Non-sparse linear repre-
sentations for visual tracking with online reservoir metric learning. In: Proc. CVPR
(2012)
13. Li, X., Dick, A.R., Wang, H., Shen, C., van den Hengel, A.: Graph mode-based
contextual kernels for robust svm tracking. In: Proc. ICCV, pp. 11561163 (2011)
14. Hare, S., Saari, A., Torr, P.: Struck: Structured output tracking with kernels. In:
Proc. ICCV, pp. 263270 (2011)
15. Avidan, S.: Support vector tracking. IEEE Trans. on PAMI 26(8), 10641072 (2004)
16. Blaschko, M.B., Lampert, C.H.: Learning to Localize Objects with Structured Out-
put Regression. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
17. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. JMLR 6, 14531484 (2005)
172 R. Yao et al.
18. Bordes, A., Bottou, L., Gallinari, P., Weston, J.: Solving multiclass support vector
machines with larank. In: Proc. ICML, pp. 8996 (2007)
19. Bordes, A., Usunier, N., Bottou, L.: Sequence Labelling SVMs Trained in One
Pass. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part
I. LNCS (LNAI), vol. 5211, pp. 146161. Springer, Heidelberg (2008)
20. Shalev-Shwartz, S., Singer, Y.: A primal-dual perspective of online learning algo-
rithms. Machine Learning 69, 115142 (2007)
21. Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf.
Process. Lett. 97, 181185 (2006)
Fast Regularization of Matrix-Valued Images
Guy Rosman1 , Yu Wang1 , Xue-Cheng Tai2 ,

Ron Kimmel1 , and Alfred M. Bruckstein1
1
Dept. of Computer Science
Technion - IIT
Haifa 32000 Israel
{rosman,yuwang,ron,freddy}@cs.technion.ac.il
2
Dept. of Mathematics
University of Bergen
Johaness Brunsgate 12
Bergen 5007 Norway
tai@mi.uib.no
Abstract. Regularization of images with matrix-valued data is impor-

tant in medical imaging, motion analysis and scene understanding. We
propose a novel method for fast regularization of matrix group-valued
images.
Using the augmented Lagrangian framework we separate total-
variation regularization of matrix-valued images into a regularization and
a projection steps. Both steps are computationally ecient and easily
parallelizable, allowing real-time regularization of matrix valued images
on a graphic processing unit.
We demonstrate the eectiveness of our method for smoothing sev-
eral group-valued image types, with applications in directions diusion,
motion analysis from depth sensors, and DT-MRI denoising.
Keywords: Matrix-valued, Regularization, Total-variation, Optimiza-

tion, Motion understanding, DT-MRI, Lie-groups.
1 Introduction
Matrix Lie-group data, and specically matrix-valued images have become an in-
tegral part of computer vision and image processing. Such representations have
been found useful for tracking [35,45], robotics, motion analysis, image pro-
cessing and computer vision [10,32,34,36,48], as well as medical imaging [6,31].
Specically, developing ecient regularization schemes for matrix-valued images
is of prime importance for image analysis and computer vision. This includes
applications such as direction diusion [25,42,47] and scene motion analysis [27]
in computer vision, as well as diusion tensor MRI (DT-MRI) regularization
[7,14,21,40,43] in medical imaging.

This research was supported by Israel Science Foundation grant no.1551/09 and by
the European Communitys FP7- ERC program, grant agreement no. 267414.

174 G. Rosman et al.
In this paper we present an augmented Lagrangian method for ecient reg-

ularization of matrix-valued images with constraints on the singular values or
eigenvalues of the matrices. Examples include the special-orthogonal, special-
Euclidean, and symmetric positive-denite matrix groups. We show that the
augmented Lagrangian technique allows us to separate the optimization process
into a total-variation (TV, [38]) regularization, or higher-order regularization
step, and an eigenvalues or singular values projection step, both of which are
simple to compute, fast and easily parallelizable using consumer graphic process-
ing units (GPUs), achieving real-time processing rates. The resulting framework
unies algorithms using in several domains into one framework, where only the
projection operator is slightly dierent according to the matrix group in ques-
tion. While such an optimization problem could have been approached by gen-
eral saddle-point solvers such as [12], the domain of our problem is not convex,
requiring such algorithms to be modied in order to allow provable convergence.
We suggest using two sets of auxiliary elds with appropriate constraints.
One eld allows us to simplify the total-variation regularization operator as
done, for example, in [11,20,41]. Another eld separates the matrix manifold
constraint into a simple projection operator. This results in a unied framework
for processing of SO(n), SE(n) and SP D(n) images, as we describe in Section 3.
In Section 4 we demonstrate a few results of our method, for regularization of
3D motion analysis, direction diusion and diusion tensor imaging. Section 5
concludes the paper.
2 A Short Introduction to Lie-Groups
Lie-groups are groups endowed with a dierentiable manifold structure and an

appropriate group action. Their structure allows us to dene priors on Lie-group
data in computer vision and has been the subject of intense research eorts,
especially involving statistics of matrix-valued data [31], and regularization of
group-valued images [43], as well as describing the dynamics of processes involv-
ing Lie-group data [27]. We briey describe the Lie-groups our algorithm deals
with, and refer the reader to the literature for an introduction to Lie-groups [22].
The rotations group SO(n) - The group SO(n) describes all rotation ma-
trices of the n-dimensional Euclidean space,

SO(n) = R Rnn , RT R = I, det(R) = 1 . (1)
The special-Euclidean group SE(n) - This group represents rigid transfor-

mations of the n-dimensional Euclidean space. This group can be thought of
as the product manifold of the rotations manifold SO(n) and the manifold Rn
representing all translations of the Euclidean space. In matrix form this group
is written as
4
Rt
SE(n) = , R SO(n), t R .
n
(2)
0 1
Fast Regularization of Matrix-Valued Images 175
The symmetric positive denite group SP D(n) - This group is the group of
symmetric positive denite matrices. This group has been studied extensively in
control theory (see [17] for example), as well as in the context of diusion tensor
images [31], where the matrices are used to describe the diusion coecients
along each direction. By denition, this group is given in matrix form as
SP D(n) = {A Rnn , A 0} . (3)
3 An Augmented Lagrangian Regularization Algorithm

for Matrix-Valued Images
We now proceed to describe a fast regularization algorithm for images with
matrix-valued data, referred to as Algorithm 1. The optimization problem we
consider is

argmin u1 u + u u0 2 dx, (4)
uG
where is the Frobenius norm, u represents an element in an embedding of

the Lie-group G into Euclidean space, specically for the groups SO(n), SE(n),
and SP D(n). We use the notation u to denote the Jacobian of u, described as
a column-stacked vector. The regularization term u1 u expresses smooth-
ness in terms of the geometry of the Lie-group. Elements of SO(n) can be
embedded into Rm , m = n2 , and elements of SE(n) can similarly be embed-
ded into Rm , m = n(n + 1). The elements of SP D(n) can be embedded into
Rm , m = n(n + 1)/2.
For brevitys sake, we use the same notation to represent the Lie-group el-
ement, its matrix representation, and the embedding onto Euclidean space, as
specied in each case we explore.
The term u1 u can be thought of as a regularization term placed on ele-
ments of the Lie algebra about each pixel. In order to obtain a fast regularization
scheme, we look instead at regularization of an embedding of the Lie-group ele-
ments into Euclidean space,

argmin u + u u0 2 dx. (5)
uG
The rationale behind the dierent regularization term u stems from the fact
that SO(n) and SE(n) are isometries of Euclidean space, but such a regular-
ization is possible whenever the data consists of nonsingular matrices, and has
been used also for SPD matrices [46]. We refer the reader to our technical report
[37] for a more in-depth discussion of this important point. Next, instead of re-
stricting u to G, we add an auxiliary variable, v, at each point, such that u = v,
and restrict v to G, where the equality constraint is enforced via augmented La-
grangian terms [23,33]. The suggested augmented Lagrangian optimization now
reads
min max L(u, v; ) = (6)

vG,uRm

u
+
u u0
2 +
min m max dx.
vG,uR
r
2

u v
2 + tr(T (u v))
Given a xed Lagrange multiplier , the minimization w.r.t. u, v can be split

into alternating minimization steps with respect to u and v, both of which are
trivial to implement in an ecient and parallel manner.
3.1 Minimization w.r.t. v

The minimization w.r.t. v is a projection problem per pixel,
r
argmin v u2 + tr(T (u v))
vG 2
r
2

= argmin v +u (7)
vG 2 r

= Proj +u ,
G r
where ProjG denotes a projection operator onto the specic matrix-group G, and
its concrete form for SO(n),SE(n) and SP D(n) will be given later on.
3.2 Minimization w.r.t. u

Minimization with respect to u is a vectorial TV denoising problem

argmin u + u u (u0 , v, , r)2 dx, (8)
uRm
with u = (2u(2+r)
0 +rv+)
. This problem can be solved via fast minimization tech-
niques for TV regularization of vectorial images, such as [9,16,19]. We chose to
use the augmented-Lagrangian TV algorithm [41], as we now describe. In order
to obtain fast optimization of the problem with respect to u, we add an auxil-
iary variable p, along with a constraint that p = u. Again, the constraint is
enforced in an augmented Lagrangian manner. The optimal u now becomes a
saddle point of the optimization problem

u u (u0 , v, , r)
2 +
p
min max dx. (9)

u Rm 2 +T2 (p u) + r22
p u
2
p R2m
We solve for u using the Euler-Lagrange equation,

2(u u) + (div 2 + r2 div p) + u = 0, (10)
for example, in the Fourier domain, or by Gauss-Seidel iterations.
The auxiliary eld p is updated by rewriting the minimization w.r.t. p as

r2
argmin p + T2 p + p u2 , (11)
2
p R2m
with the closed-form solution [41]

1 1
p= max 1 , 0 w, w = r2 u 2 . (12)
r2 w
Hence, the main part of the proposed algorithm is to iteratively update v, u,
and p respectively. Also, according to the optimality conditions, the Lagrange
multipliers and 2 should be updated by taking

k = k1 + r v k uk , (13)

k2 = k1
2 + r2 pk uk .
An algorithmic description is summarized as Algorithm 1.
Algorithm 1. Fast TV regularization of matrix-valued data

1: for k = 1, 2, . . . , until convergence do
2: Update uk (x), pk (x), according to Equations (10,12).
3: Update v k (x), by projection onto the matrix group,
For SO(n) matrices, according to Equation (14).

For SE(n) matrices, according to Equation (15).
For SP D(n) matrices, according to Equation (16).
4: Update k (x), k2 (x), according to Equation (13).

5: end for
3.3 Regularization of Maps onto SO(n)

In the case of G = SO(n), Although the embedding of SO(n) in Euclidean space
is not a convex set, the projection onto the matrix manifold is easily
achieved

by means of the singular value decomposition [18]. Let USVT = r + uk be
the singular vector decomposition (SVD) of r + uk . We update v by

v k+1 = Proj + uk = U(x)VT (x), (14)
SO(n) r

USVT = + uk .
r
Other possibilities include using the Euler-Rodrigues formula, quaternions, or
the polar decomposition [26]. We note that the non-convex domain SO(n) pre-
vents a global convergence proof. The algorithm, in the case of G = SO(n) and
G = SE(n), can be made provably convergent using the method of Attouch et
al. [5]. The details and proof are shown in our technical report [3].
3.4 Regularization of Maps onto SE(n)
In order to regularize images with values in SE(n), we use an embedding into

Rn(n+1) as our main optimization variable, u, per pixel.
The projection step w.r.t. v applies only for the n2 elements of v describing
the rotation matrix, leaving the translation component of SE(n) unconstrained.
2
Specically, let v = (vR , vt ), vR Rn , vt Rn denotes the rotation and
translation parts of the current solution, with a similar partition for the Lagrange
multipliers = (R , t ). Updating v in step 3 of Algorithm 1 assumes the form

k+1 R t
vR = Proj + ukR , vtk+1 = + ukt (15)
SO(n) r r
k+1 k+1
v k+1 = Proj (v k ) = (vR , vt ).
SE(n)
3.5 Regularization of Maps onto SP D(n)

The technique described above can be used also for regularizing symmetric
positive-denite matrices. Here, the intuitive choice of projecting the eigenvalues
of the matrices onto the positive half-space is shown to be optimal [24]. Many
papers dealing with the the analysis of DT-MRI rely on the eigenvalue decom-
position of the tensor as well, i.e. for tractography, anisotropy measurements,
and so forth.
For G = SP D(n), the minimization problem w.r.t. v in step 3 of Algorithm 1
can be solved by projection of eigenvalues. Let U diag () UT be the eigenvalue
decomposition of the matrix r + uk . v is updated according to

v k+1 = Proj (v k ) = U(x) diag UT (x), (16)
SP D(n)

U diag () UT = + uk , = max (()i , 0) ,
r i
where the matrix U is aunitary one, representing the eigenvectors of the matrix,
and the eigenvalues are the positive projection of the eigenvalues ()i . Op-
i
timization w.r.t. u is done as in the previous cases, as described in Algorithm 1.
Furthermore, the optimization w.r.t. u, v is now over the domain Rm SP D(n),
and the cost function is convex, resulting in a convex optimization problem. The
convex domain of optimization allows us to formulate a convergence proof for the
algorithm similar to the proof by Tseng [44]. We refer the interested reader to
our technical report [3]. An example of using the proposed method for DT-MRI
denoising is shown in Section 4.
3.6 A Higher-Order Prior for Group-Valued Images
We note that the scheme we describe is susceptible to the staircasing eect, since
it minimizes the total variation of the map u. Several higher-order priors can
be incorporated into our scheme that do not suer from staircasing eects. One
such possibile higher-order term generalizes the scheme presented by Wu and Tai
[49], by replacing the per-element gradient operator with a Hessian operator. The
resulting saddle-point problem becomes
% 2 &
p + u u (u0 , v, , r)
min max dx, (17)
u Rm 2 +T2 (p Hu) + r22 p Hu2
p R4m ,
vG
where H denotes the per-element Hessian operator. We show an example using

the appropriately modied scheme in Figures 1,3
4 Numerical Results
As discussed above, the proposed algorithmic framework is considerably general,
suitable for various applications. In this section, several examples from dier-
ent applications are used to substantiate the eectiveness and eciency of our
algorithm.
4.1 Directions Regularization

Analysis of principal directions in an image or video is an important aspect
of modern computer vision, in elds such as video surveillance [30, and ref-
erences therein], vehicle control [15], crowd behaviour analysis [29], and other
applications[32].
Since SO(2) is isomorphic to S 1 , the suggested regularization scheme can be
used for regularizing directions, such as principal motion directions in a video
sequence. A reasonable choice for a data term would try to align the rst coor-
dinate axis after rotation with the motion directions in the neighborhood,

EP MD (U ) = U1,1 (vj )x + U1,2 (vj )y ,
(xj ,yj )N (i)

where xj , yj , (vj )x , (vj )y represent a sampled motion particle [29] in the video
sequence, and Ui,j represent elements of the solution u at each point.
In Figure 1 we demonstrate two sparsely sampled, noisy, motion elds, and
a dense reconstruction of the main direction of motion at each point. The data
for the direction estimation was corrupted by adding component-wise Gaussian
noise. In the rst image, the motion eld is comprised of 4 regions with a dierent
motion direction at each region. The second image contains a sparse sampling
(x,y)T
of an expansion motion eld of the form v(x, y) = (x,y) . Such an expansion
eld is often observed by forward-moving vehicles. Note that despite the fact
that a vanishing point of the ow is clearly not smooth in terms of the motion
directions, the estimation of the motion eld is still correct.
Fig. 1. TV regularization of SO(n) data. Left-to-right, top-to-bottom: a noisy, TV-

denoised, and higher-order regularized (minimizing Equation 17) version of a piecewise
constant SO(2) image, followed by a expansion eld directions image. Dierent colors
mark dierent orientations of the initial/estimated dense eld, black arrows signify the
measured motion vectors, and blue arrows demonstrate the estimated eld.
In Figure 2 we used the algorithm to obtain a smooth eld of principal mo-

tion directions over a trac sequence taken from the UCF crowd ow database
[4]. Direction cues are obtained by initializing correlation-based trackers from
arbitrary times and positions in the sequence, and observing all of them simul-
tenaously. The result captures the main trac lanes and shows the viability of
our regularization for real data sequence.
Yet another application for direction diusion is in denoising of directions
in ngerprint images. An example for direction diusion on a ngerprint image
taken from the Fingerprint Verication Competition datasets [1] can be seen in
Figure 3. Adding a noise of = 0.05 to the image and estimating directions
based on the structure tensor, we smoothed the direction eld and compared it
to the eld obtained from the original image. We used our method with = 3,
and the modied method based on Equation 17 with = 10, as well as the
method suggested by Sochen et al. [39] with = 100, T = 425. The resulting
MSE values of the tensor eld are 0.0317, 0.0270 and 0.0324, respectively, com-
pared to an initial noisy eld with M SE = 0.0449. These results demonstrate
the eectiveness of our method for direction diusion, even in cases where the
staircasing eect may cause unwanted artifacts.
4.2 SE(n) Regularization

We now demonstrate a smoothing of SE(3) data obtained from locally matching
between two range scans obtained from a Kinect device. For each small surface
patch from the depth image we use an iterative closest point algorithm[8] to
match the surface from the previous frame. The background is segmented by sim-
ple thresholding. The results from this tracking process over raw range footage
are an inherently noisy measurements set. We use our algorithm to smooth this
Fig. 2. Regularization of principal motion directions. The red arrows demonstrate mea-
surements of motion cues based on a normalized cross-correlation tracker. Blue arrows
demonstrate the regularized directions elds.
Table 1. Processing times (ms) for various sizes of images, with various iteration counts
Outer iterations 15 15 25 50 100

GS iterations 1 3 1 1 1
320 240 49 63 81 160 321
640 480 196 250 319 648 1295
1920 1080 1745 2100 2960 5732 11560
SE(3) image, as shown in Figure 4. It can be seen that for a careful choice of the
regularization parameter, total variation in the group elements is seen to signi-
cantly reduce rigid motion estimation errors. Furthermore, it allows us to discern
the main rigidly moving parts in the sequence by producing a scale-space of rigid
motions. Visualization is accomplished by projecting the embedded matrix onto
3 dierent representative vectors in R12 . The regularization is implemented us-
ing the CUDA framework, with computation times shown in Table 1, for various
image sizes and iterations. In the GPU implementation the polar decomposi-
tion was chosen for its simplicity and eciency. In practice, one Gauss-Seidel
iteration suced to update u. Using 15 outer iterations, practical convergence
is achieved in 49 milliseconds on an NVIDIA GTX-580 card for QVGA-sized
images, demonstrating the eciency of our algorithm and its potential for real-
time applications. This is especially important for applications such as gesture
recognition where fast computation is crucial.
Fig. 3. TV regularization of SO(2) data based on ngerprint direction estimation. Left-

to-right, top-to-bottom: The ngerprint image with added Gaussian noise of = 0.05,
the detected direction angles, the detected directions displayed as arrows, the detected
directions after regularization with = 3, regularization results using a higher-order
regularization term shown in Equation 17 with = 6, the regularization result by
Sochen et al. [39].
Fig. 4. Regularization of SE(3) images obtained from local ICP matching of the surface
patch between consecutive Kinect depth frames. Left-to-right: diusion scale-space
obtained by dierent values of : 1.5, 1.2, 0.7, 0.2, 0.1, 0.05 , the foreground segmentation
based on the depth, and an intensity image of the scene. Top-to-bottom: dierent frames
from the depth motion sequence.
4.3 DT-MRI Regularization
In Figure 5 we demonstrate a smoothing of DT-MRI data from [28], based on the

scheme suggested in Section 3.5. We show an axial view of the brain, glyph-based
visualization using Slicer3D [2], with anisotropy-based color coding.
The noise added is an additive Gaussian noise in each of the tensor elements
with = 0.1. Note that while dierent noise models are often assumed for
diusion-weighted images, at high noise levels the Gaussian model is a reasonable
approximation. Regularization with = 30 is able to restore a signicant amount
of the white matter structure. At such levels of noise, the TV-regularized data
bias towards isotropic tensors (known as the swell eect [13]) is less signicant.
The RMS of the tensor representation was 0.0406 in the corrupted image and
0.0248 in the regularized image. Similarly, regularized reconstruction of DT-MRI
signals from diusion-weighted images is also possible using our method, but is
beyond the scope of this paper.
Fig. 5. TV denoising of images with diusion tensor data, visualized by 3D tensor

ellipsoid glyphs colored by fractional anisotropy. Left-to-right: the original image, an
image with added component-wise Gaussian noise of = 0.1, and the denoised image
with = 30.
5 Conclusions
In this paper, a general framework for regularization of matrix valued maps is
proposed. Based on the augmented Lagrangian techniques, we separate the op-
timization problem into a TV-regularization step and a projection step, both
of which can be solved in an easy-to-implement and parallel way. Specically,
we show the eciency and eectiveness of the resulting scheme through several
examples whose data taken from SO(2), SE(3), and SP D(3) respectively. To
emphasize, for matrix-valued images, our algorithms allow real-time regulariza-
tion for tasks in image analysis and computer vision.
In future work we intend to explore other applications for matrix-valued image
regularization as well as generalize our method to other types of maps, and data
and noise models.
References
1. Fingerprints Verication Competition database

2. 3DSlicer software package
3. Fast regularization of matrix-valued images. Technical report, Anonymized version
submitted as supplementary material (2011)
4. Ali, S., Shah, M.: A Lagrangian particle dynamics approach for crowd ow seg-
mentation and stability analysis. In: CVPR, pp. 16 (2007)
5. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimiza-
tion and projection methods for nonconvex problems: An approach based on the
Kurdyka-Lojasiewicz inequality. Math. Oper. Res. 35, 438457 (2010)
6. Basser, P.J., Mattiello, J., LeBihan, D.: MR diusion tensor spectroscopy and
imaging. Biophysical Journal 66(1), 259267 (1994)
7. Bergmann, ., Christiansen, O., Lie, J., Lundervold, A.: Shape-adaptive DCT for
denoising of 3D scalar and tensor valued images. J. Digital Imaging 22(3), 297308
(2009)
8. Besl, P.J., McKay, N.D.: A method for registration of 3D shapes. IEEE-
TPAMI 14(2), 239256 (1992)
9. Bresson, X., Chan, T.: Fast dual minimization of the vectorial total variation norm
and applications to color image processing. Inverse Problems and Imaging 2(4),
455484 (2008)
10. Del Bue, A., Xavier, J., Agapito, L., Paladini, M.: Bilinear Factorization via Aug-
mented Lagrange Multipliers. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part IV. LNCS, vol. 6314, pp. 283296. Springer, Heidelberg (2010)
11. Chambolle, A.: An algorithm for total variation minimization and applications.
JMIV 20(1-2), 8997 (2004)
12. Chambolle, A., Pock, T.: A rst-order primal-dual algorithm for convex problems
with applications to imaging. Journal of Mathematical Imaging and Vision 40(1),
120145 (2011)
13. Deriche, R., Tschumperle, D., Lenglet, C.: DT-MRI estimation, regularization and
ber tractography. In: ISBI, pp. 912 (2004)
14. Duits, R., Burgeth, B.: Scale Spaces on Lie Groups. In: Sgallari, F., Murli,
A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 300312. Springer,
Heidelberg (2007)
15. Dumortier, Y., Herlin, I., Ducrot, A.: 4D tensor voting motion segmentation for
obstacle detection in autonomous guided vehicle. In: IEEE Int. Vehicles Symp., pp.
379384 (2008)
16. Duval, V., Aujol, J.-F., Vese, L.A.: Mathematical modeling of textures: Application
to color image decomposition with a projected gradient algorithm. JMIV 37(3),
232248 (2010)
17. Fletcher, R.: Semi-denite matrix constraints in optimization. SIAM J. on Cont.
and Optimization 23(4), 493513 (1985)
18. Gibson, W.: On the least-squares orthogonalization of an oblique transformation.
Psychometrika 27, 193195 (1962)
19. Goldluecke, B., Strekalovskiy, E., Cremers, D.: The natural vectorial variation
which arises from geometric measure theory. SIAM J. Imag. Sci. (2012)
20. Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split breg-
man method: Segmentation and surface reconstruction. J. of Sci. Comp. 45(1-3),
272293
21. Gur, Y., Sochen, N.: Fast invariant Riemannian DT-MRI regularization. In: ICCV,
pp. 17 (2007)
22. Hall, B.C.: Lie Groups, Lie Algebras,and Representations, An Elementary Intro-
duction. Springer (2004)
23. Hesteness, M.R.: Multipliers and gradient methods. J. of Optimization Theory and
Applications 4, 303320 (1969)
24. Higham, N.J.: Matrix nearness problems and applications. In: Applications of Ma-
trix Theory, pp. 127. Oxford University Press, Oxford (1989)
25. Kimmel, R., Sochen, N.: Orientation diusion or how to comb a porcupine. Special
Issue on PDEs in Image Processing, Comp. Vision, and Comp. Graphics, J. of Vis.
Comm. and Image Representation 13, 238248 (2002)
26. Larochelle, P.M., Murray, A.P., Angeles, J.: SVD and PD based projection metrics
on SE(N). In: On Advances in Robot Kinematics, pp. 1322. Kluwer (2004)
27. Lin, D., Grimson, W., Fisher, J.: Learning visual ows: A Lie algebraic approach.
In: CVPR, pp. 747754 (2009)
28. Lundervold, A.: On consciousness, resting state fMRI, and neurodynamics. Non-
linear Biomed. Phys. 4(Suppl. 1) (2010)
29. Mehran, R., Moore, B.E., Shah, M.: A Streakline Representation of Flow in
Crowded Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part III. LNCS, vol. 6313, pp. 439452. Springer, Heidelberg (2010)
30. Nicolescu, M., Medioni, G.: A voting-based computational framework for visual
motion analysis and interpretation 27, 739752 (2005)
31. Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing.
IJCV 66(1), 4166 (2006)
32. Perona, P.: Orientation diusions. IEEE Trans. Image Process. 7(3), 457467
(1998)
33. Powell, M.J.: A method for nonlinear constraints in minimization problems. In:
Optimization, pp. 283298. Academic Press (1969)
34. Rahman, I.U., Drori, I., Stodden, V.C., Donoho, D.L., Schroeder, P.: Multiscale
representations of manifold-valued data. Technical report, Stanford (2005)
35. Raptis, M., Soatto, S.: Tracklet Descriptors for Action Modeling and Video Analy-
sis. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS,
36. Rosman, G., Bronstein, M.M., Bronstein, A.M., Wolf, A., Kimmel, R.: Group-
Valued Regularization Framework for Motion Segmentation of Dynamic Non-rigid
Shapes. In: Bruckstein, A.M., ter Haar Romeny, B.M., Bronstein, A.M., Bronstein,
M.M. (eds.) SSVM 2011. LNCS, vol. 6667, pp. 725736. Springer, Heidelberg (2012)
37. Rosman, G., Wang, Y., Tai, X.-C., Kimmel, R., Bruckstein, A.M.: Fast regulariza-
tion of matrix-valued images. Technical Report CIS-11-03, Technion (2011)
38. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal
algorithms. Physica D Letters 60, 259268 (1992)
39. Sochen, N.A., Sagiv, C., Kimmel, R.: Stereographic combing a porcupine or
studies on direction diusion in image processing. SIAM J. Appl. Math. 64(5),
14771508 (2004)
40. Steidl, G., Setzer, S., Popilka, B., Burgeth, B.: Restoration of matrix elds by
second-order cone programming. Computing 81(2-3), 161178 (2007)
41. Tai, X.-C., Wu, C.: Augmented Lagrangian method, dual methods and split Breg-
man iteration for ROF model. In: SSVM, pp. 502513 (2009)
42. Tang, B., Sapiro, G., Caselles, V.: Diusion of general data on non-at manifolds
via harmonic maps theory: The direction diusion case. IJCV 36, 149161 (2000)
43. Tschumperle, D., Deriche, R.: Vector-valued image regularization with PDEs: A
common framework for dierent applications. IEEE-TPAMI 27, 506517 (2005)
44. Tseng, P.: Coordinate ascent for maximizing nondierentiable concave functions.
LIDS-P 1940. MIT (1988)
45. Tuzel, O., Porikli, F., Meer, P.: Learning on Lie-groups for invariant detection and
tracking. In: CVPR, pp. 18 (2008)
46. Vemuri, B.C., Chen, Y., Rao, M., McGraw, T., Wang, Z., Mareci, T.: Fiber tract
mapping from diusion tensor MRI. In: VLSM, pp. 8188. IEEE Computer Society
(2001)
47. Weickert, J., Brox, T.: Diusion and regularization of vector- and matrix-valued
images. Inverse problems, image analysis, and medical imaging, vol. 313 (2002)
48. Wen, Z., Goldfarb, D., Yin, W.: Alternating direction augmented Lagrangian meth-
ods for semidenite programming. CAAM TR09-42, Rice university (2009)
49. Wu, C., Tai, X.-C.: Augmented lagrangian method, dual methods, and split breg-
man iteration for ROF, vectorial TV, and high order models. SIAM J. Imaging
Sciences 3(3), 300339 (2010)
Blind Correction of Optical Aberrations
Christian J. Schuler, Michael Hirsch,

Stefan Harmeling, and Bernhard Scholkopf
Max Planck Institute for Intelligent Systems, Tubingen, Germany

{cschuler,mhirsch,harmeling,bs}@tuebingen.mpg.de
http://webdav.is.mpg.de/pixel/blind_lenscorrection/
Abstract. Camera lenses are a critical component of optical imaging

systems, and lens imperfections compromise image quality. While tra-
ditionally, sophisticated lens design and quality control aim at limiting
optical aberrations, recent works [1,2,3] promote the correction of opti-
cal aws by computational means. These approaches rely on elaborate
measurement procedures to characterize an optical system, and perform
image correction by non-blind deconvolution.
In this paper, we present a method that utilizes physically plausible
assumptions to estimate non-stationary lens aberrations blindly, and thus
can correct images without knowledge of specics of camera and lens. The
blur estimation features a novel preconditioning step that enables fast
deconvolution. We obtain results that are competitive with state-of-the-
art non-blind approaches.
1 Introduction
Optical lenses image scenes by refracting light onto photosensitive surfaces. The
lens of the vertebrate eye creates images on the retina, the lens of a photographic
camera creates images on digital sensors. This transformation should ideally sat-
isfy a number of constraints formalizing our notion of a veridical imaging process.
The design of any lens forms a trade-o between these constraints, leaving us
with residual errors that are called optical aberrations. Some errors are due to
the fact that light coming through dierent parts of the lens can not be fo-
cused onto a single point (spherical aberrations, astigmatism and coma), some
errors appear because refraction depends on the wavelength of the light (chro-
matic aberrations). A third type of error, not treated in the present work, leads
to a deviation from a rectilinear projection (image distortion). Camera lenses
are carefully designed to minimize optical aberrations by combining elements of
multiple shapes and glass types.
However, it is impossible to make a perfect lens, and it is very expensive to
make a close-to-perfect lens. A much cheaper solution is in line with the new
eld of computational photography: correct the optical aberration in software.
To this end, we use non-uniform (non-stationary) blind deconvolution. Deconvo-
lution is a hard inverse problem, which implies that in practice, even non-blind
uniform deconvolution requires assumptions to work robustly. Blind deconvolu-
tion is harder still, since we additionally have to estimate the blur kernel, and

188 C.J. Schuler et al.
non-uniform deconvolution means that we have to estimate the blur kernels as

a function of image position. The art of making this work consists of nding
the right assumptions, suciently constraining the solution space while being at
least approximately true in practice, and designing an ecient method to solve
the inverse problem under these assumptions. Our approach is based on a for-
ward model for the image formation process that incorporates two assumptions:
(a) The image contains certain elements typical of natural images, in particular,
there are sharp edges.
(b) Even though the blur due to optical aberrations is non-uniform (spatially
varying across the image), there are circular symmetries that we can exploit.
Inverting a forward model has the benet that if the assumptions are correct, it
will lead to a plausible explanation of the image, making it more credible than
an image obtained by sharpening the blurry image using, say, an algorithm that
lters the image to increase high frequencies.
Furthermore, we emphasize that our approach is blind, i.e., it requires as
an input only the blurry image, and not a point spread function that we may
have obtained using other means such as a calibration step. This is a substantial
advantage, since the actual blur depends not only on the particular photographic
lens but also on settings such as focus, aperture and zoom. Moreover, there are
cases where the camera settings are lost and the camera may even no longer be
available, e.g., for historic photographs.
2 Related Work and Technical Contributions
Correction of Optical Aberrations: The existing deconvolution methods to

reduce blur due to optical aberrations are non-blind methods, i.e., they require a
time-consuming calibration step to measure the point spread function (PSF) of
the given camera-lens combination, and in principle they require this for all pa-
rameter settings. Early work is due to Joshi et al. [1], who used a calibration sheet
to estimate the PSF. By nding sharp edges in the image, they also were able to
remove chromatic aberrations blindly. Kee et al. [2] built upon this calibration
method and looked at the problem how lens blur can be modeled such that for
continuous parameter settings like zoom, only a few discrete measurements are
sucient. Schuler et al. [3] use point light sources rather than a calibration sheet,
and measure the PSF as a function of image location. The commercial software
DxO Optics Pro (DXO) also removes lens softness1 relying on a previous
calibration of a long list of lens/camera combinations referred to as modules.
Furthermore, Adobes Photoshop comes with a Smart Sharpener, correcting
for lens blur after setting parameters for blur size and strength. It does not re-
quire knowledge about the lens used, however, it is unclear if a genuine PSF is
inferred from the image, or the blur is just determined by the parameters.
1
http://www.dxo.com/us/photo/dxo optics pro/features/optics geometry
corrections/lens softness
Blind Correction of Optical Aberrations 189
Non-stationary Blind Deconvolution: The background for techniques of op-

tical aberration deconvolution is recent progress in the area of removing camera
shake. Beginning with Fergus et al.s [4] method for camera shake removal, ex-
tending the work of Miskin and MacKay [5] with sparse image statistics, blind de-
convolution became applicable to real photographs. With Cho and Lees work [6],
the running time of blind deconvolution has become acceptable. These early
methods were initially restricted to uniform (space invariant) blur, and later ex-
tended to real world spatially varying camera blur [7,8]. Progress has also been
made regarding the quality of the blur estimation [9,10], however, these methods
are not yet competitive with the runtime of Cho and Lees approach.
Technical Contributions: Our main technical contributions are as follows:
(a) we design a class of PSF families containing realistic optical aberrations, via
a set of suitable symmetry properties,
(b) we represent the PSF basis using an orthonormal basis to improve condi-
tioning, and allow for direct PSF estimation,
(c) we avoid calibration to specic camera lens combinations by proposing a
blind approach for inferring the PSFs, widening the applicability to any
photographs (e.g., with missing lens information such as historical images)
and avoiding cumbersome calibration steps,
(d) we extend blur estimation to multiple color channels to remove chromatic
aberrations as well, and nally
(e) we present experimental results showing that our approach is competitive
with non-blind approaches.
3 Spatially Varying Point Spread Functions
Optical aberrations cause image blur that is spatially varying across the image.
As such they can be modeled as a non-uniform point spread function (PSF), for
which Hirsch et al. [11] introduced the Ecient Filter Flow (EFF) framework,

R
y= a(r) w(r) x , (1)
r=1
= *
blur parameters
... PSF basis
Fig. 1. Optical aberration as a forward model

where x denotes the ideal image and y is the image degraded by optical aber-
ration. In this paper, we assume that x and y are discretely sampled images,
i.e., x and y are nite-sized matrices whose entries correspond to pixel inten-
sities. w(r) is a weighting matrix that masks out all of the image x except for
a local patch by Hadamard multiplication (symbol , pixel-wise product). The
r-th patch is convolved (symbol ) with a local blur kernel a(r) , also represented
as a matrix. All blurred patches are summed up to form the degraded image.
The more patches are considered (R is the total number of patches), the better
the approximation to the true non-uniform PSF. Note that the patches dened
by the weighting matrices w(r) usually overlap to yield smoothly varying blurs.
The weights are chosen such that they sum up to one for each pixel. In [11] it
is shown that this forward model can be computed eciently by making use of
the short-time Fourier transform.
4 An EFF Basis for Optical Aberrations

Since optical aberrations lead to image degradations that can be locally mod-
eled as convolutions, the EFF framework is a valid model. However, not all blurs
expressible in the EFF framework do correspond to blurs caused by optical aber-
rations. We thus dene a PSF basis that constrains EFF to physically plausible
PSFs only.
To dene the basis we introduce a few notions. The image y is split into
overlapping patches, each characterized by the weights w(r) . For each patch, the
symbol lr denotes the line from the patch center to the image center, and dr the
length of line lr , i.e., the distance between patch center and image center. We
assume that local blur kernels a(r) originating from optical aberrations have the
following properties:
(a) Local Reection Symmetry: a local blur kernel a(r) is reection symmet-
ric with respect to the line lr .
(b) Global Rotation Symmetry: two local blur kernels a(r) and a(s) at the
same distance to the image center (i.e., dr = ds ) are related to each other
by a rotation around the image center.
(c) Radial Behavior: along a line through the image center, the local blur
kernels change smoothly. Furthermore, the maximum size of a blur kernel is
assumed to scale linearly with its distance to the image center.
Note that these properties are compromises that lead to good approximations
of real-world lens aberrations.2
For two dimensional blur kernels, we represent the basis by K basis elements
(1) (R)
bk each consisting of R local blur kernels bk , . . . , bk . Then the actual blur
(r)
kernel a can be represented as linear combinations of basis elements,
2
Due to issues such as decentering, real world lenses may not be absolutely rota-
tionally symmetric. Schuler et al.s exemplar of the Canon 24mm f/1.4 (see below)
exhibits PSFs that deviate slightly from the local reection symmetry. The assump-
tion, however, still turns out to be useful in that case.

K
(r)
a(r) = k b k . (2)
k=1
To dene the basis elements we group the patches into overlapping groups, such
that each group contains all patches inside a certain ring around the image center,
i.e., the center distance dr determines whether a patch belongs to a particular
group. Basis elements for three example groups are shown Figure 2. All patches
inside a group will be assigned similar kernels. The width and the overlap of
the rings determine the amount of smoothness between groups (see property (c)
above).
For a single group we dene a series of basis elements as follows. For each
patch in the group we generate matching blur kernels by placing a single delta
peak inside the blur kernel and then mirror the kernel with respect to the line lr
(see, Figure 3). For patches not in the current group (i.e., in the current ring),
the corresponding local blur kernels are zero. This generation process creates
basis elements that fulll the symmetry properties listed above. To increase
smoothness of the basis and avoid eects due to pixelization, we place little
Gaussian blurs (standard deviation 0.5 pixels) instead of delta peaks.
Fig. 2. Three example groups of patches, each forming a ring
(a) outside parallel to lr (a) inside parallel to lr (c) perpendicular to lr
Fig. 3. Shifts to generate basis elements for the middle group of Figure 2

Fig. 4. SVD spectrum of a typical basis matrix B with cut-o
5 An Orthonormal EFF Basis

The basis elements constrain possible blur kernels to fulll the above symme-
try and smoothness properties. However, the basis is overcomplete and direct
projection on the basis is not possible. Therefore we approximate it with an
orthonormal one. To explain this step with matrices, we reshape each basis ele-
(r)
ment as a column vector by vectorizing (operator vec) each local blur kernel bk
and stacking them for all patches r:
1 2T
(1) (R)
bk = [vec bk ]T . . . [vec bk ]T . (3)
Let B be the matrix containing the basis vectors b1 , . . . , bK as columns. Then

we can calculate the singular value decomposition (SVD) of B,
B = U SV T . (4)
with S being a diagonal matrix containing the singular values of B. Figure 4

shows the SVD spectrum and the chosen cut-o of some typical basis matrix B,
with approximately half of the eigenvalues being below numerical precision.
We dene an orthonormal EFF basis that is the matrix that consists of the
column vectors of U that correspond to large singular values, i.e., that contains
the relevant left singular vectors of B. Properly chopping the column vectors of
into shorter vectors one per patch and reshaping those back to the blur kernel,
(r)
we obtain an orthonormal basis k for the EFF framework that is tailored to
optical aberrations. This representation can be plugged into the EFF forward
model in Eq. (1),
K

R
(r)
y = x := k k w(r) x . (5)
r=1 =1
Note that the resulting forward model is linear in the parameters .
6 Blind Deconvolution with Chromatic Shock Filtering

Having dened a PSF basis, we perform blind deconvolution by extending [6]
to our non-uniform blur model (5) (similar to [13,8]). However, instead of con-
sidering only a gray-scale image during PSF estimation, we are processing the
Original Blurry Shock lter [12] Chromatic shock lter
Fig. 5. Chromatic shock lter removes color fringing (adapted from [12])
full color image. This allows us to better address chromatic aberrations by an

improved shock ltering procedure that is tailored to color images: the color
channels xR , xG and xB are shock ltered separately but share the same sign
expression depending only on the gray scale image z:
R = xR t sign(z )|xR |
xt+1 t t t
G = xG t sign(z )|xG |
xt+1 t t t
with z t = (xtR + xtG + xtB )/3
B = xB t sign(z )|xB |
xt+1 t t t
(6)
where z denotes the second derivative in the direction of the gradient. We call
this extension chromatic shock ltering since it takes all three color channels
simultaneously into account. Figure 5 shows the reduction of color fringing on
the example of Osher and Rudin [12] adapted to three color channels.
Combining the forward model y = x dened above and the chromatic shock
ltering, the PSF parameters and the image x (initialized by y) are estimated
by iterating over three steps:
(a) Prediction Step: the current estimate x is rst denoised with a bilateral l-
ter, then edges are emphasized with chromatic shock ltering and by zeroing
at gradient regions in the image (see [6] for further details). The gradient
selection is modied such that for every radius ring the strongest gradients
are selected.
(b) PSF Estimation: if we work with the overcomplete basis B, we would like
to nd coecients that minimize the regularized t of the gradient images
y and x,

R
2
R
(r) 2
R
(r) 2
y (B (r) ) (w(r) x) + B + B (7)
r=1 r=1 r=1
where B (r) is the matrix containing the basis elements for the r-th patch.
Note that is the same for all patches. This optimization can be performed
iteratively. The regularization parameters and are set to 0.1 and 0.01,
respectively.
However, the iterations are costly, and we can speed up things by using the
orthonormal basis . The blur is initially estimated unconstrained and then
projected onto the orthonormal basis. In particular, we rst minimize the
t of the general EFF forward model (without the basis) with an additional
regularization term on the local blur kernels, i.e., we minimize

R
2
R
(r) 2
R
(r) 2
y
a (w x) +
(r) (r) a + a (8)
r=1 r=1 r=1
This optimization problem is approximately minimized using a single step

of direct deconvolution in Fourier space, i.e.,
F Zx x (F Er Diag(w(r) ) Zy y)
a(r) CrT F H for all r. (9)
|F Zx x|2 + |F Zl l|2 +
where l = [1, 2, 1]T denotes the discrete Laplace operator, F the discrete
Fourier transform, and Zx , Zy , Zl , Cr and Er appropriate zero-padding
and cropping matrices. |u| denotes the entry-wise absolute value of a com-
plex vector u, u its entry-wise complex conjugate. The fraction has to be
implemented pixel-wise.
Finally, the resulting unconstrained blur kernels a(r) are projected onto
the orthonormal basis leading to the estimate of the blur parameters .
(c) Image Estimation: For image estimation given the blurry image y and
blur parameters , we apply Tikhonov regularization with = 0.01 on the
gradients of the latent image x, i.e.

y x2 + x2 . (10)
As shown in [8], this expression can be approximately minimized with respect
to x using a single step of the following direct deconvolution:
F Zb (F Er Diag(w(r) ) Zy y)
xN CrT F H . (11)
r
|F Zb |2 + |F Zl l|2
where l = [1, 2, 1]T denotes the discrete Laplace operator, F the discrete
Fourier transform, and Zb , Zy , Zl , Cr and Er appropriate zero-padding and
cropping matrices. |u| denotes the entry-wise absolute value of a complex
vector u, u its entry-wise complex conjugate. The fraction has to be im-
plemented pixel-wise. The normalization factor N accounts for artifacts at
patch boundaries which originate from windowing (see [8]).
Similar to [6] and [8] the algorithm follows a coarse-to-ne approach. Having
estimated the blur parameters we use a non-uniform version of Krishnan and
Fergus approach [14,8] for the non-blind deconvolution to recover a high-quality
estimate of the true image. For the x-sub problem we use the direct deconvolution
formula (11).
7 Implementation and Running Times

The algorithm is implemented on a Graphics Processing Unit (GPU) in Python
using PyCUDA3 . All experiments were run on 3.0GHz Intel Xeon with an
3
http://mathema.tician.de/software/pycuda
NVIDIA Tesla C2070 GPU with 6GB of memory. The basis elements gener-
ated as detailed in Section 4 are orthogonalized using the SVDLIBC library4 .
Calculating the SVD for the occurring large sparse matrices can require a few
minutes of running times. However, the basis is independent of the image con-
tent, so we can compute the orthonormal basis once and reuse it. Table 1 reports
the running times of our experiments for both PSF and nal non-blind decon-
volution along with the EFF parameters and image dimensions. In particular, it
shows that using the orthonormal basis instead of the overcomplete one improves
the running times by a factor of about six to eight.
Table 1. (a) Image sizes, (b) size of the local blur kernels, (c) number of patches
horizontally and vertically, (d) runtime of PSF estimation using the overcomplete basis
B (see Eq. (7)), (e) runtime of PSF estimation using the orthonormal basis (see
Eq. (8)) as used in our approach, (f) runtime of the nal non-blind deconvolution.
image image dims local blur patches using B using NBD

bridge 26011733 1919 108 127 sec 16 sec 1.4 sec
bench 1097 730 8181 106 85 sec 14 sec 0.7 sec
historical 21911464 2929 106 103 sec 13 sec 1.0 sec
facade 28171877 2929 128 166 sec 21 sec 1.7 sec
(a) (b) (c) (d) (e) (f)
8 Results
In the following, we show results on real photos and do a comprehensive com-
parison with other approaches for removing optical aberrations. Image sizes and
blur parameters are shown in Table 1.
8.1 Schuler et al.s Lens 120mm
Schuler et al. show deblurring results on images taken with a lens that consists
only of a single element, thus exhibiting strong optical aberrations, in particular
coma. Since their approach is non-blind, they measure the non-uniform PSF with
a point source and apply non-blind deconvolution. In contrast, our approach is
blind and is directly applied to the blurry image.
To better approximate the large blur of that lens, we additionally assume that
the local blurs scale linearly with radial position, which can be easily incorpo-
rated into our basis generation scheme. For comparison, we apply Photoshops
Smart Sharpening function for removing lens blur. It depends on the blur size
and the amount of blur, which are manually controlled by the user. Thus we
call this method semi-blind since it assumes a parametric form. Even though we
choose its parameters carefully, we are not able to obtain comparable results.
4
http://tedlab.mit.edu/~ dr/SVDLIBC/
Comparing our blind method against the non-blind approach of [3], we observe
that our estimated PSF matches their measured PSFs rather well (see Figure 7).
However, surprisingly we are getting an image that may be considered sharper.
The reason could be over-sharpening or a less conservative regularization in the
nal deconvolution; it is also conceivable that the calibration procedure used by
[3] is not suciently accurate. Note that neither DXO nor Kee et al.s approach
can be applied, lacking calibration data for this lens.
8.2 Canon 24mm f/1.4

The PSF constraints we are considering assume local axial symmetry of the PSF
with respect to the radial axis. For a Canon 24mm f/1.4 lens also used in [3],
this is not exactly fullled, which can be seen in the inset in Figure 8. The
wings of the green blur do not have the same length. Nonetheless, our blind
estimation with enforced symmetry still approximates the PSF shape well and
yields a comparable quality of image correction. We stress the fact that this was
obtained blindly in contrast to [3].
8.3 Kee et al.s Image

Figure 9 shows results on an image taken from Kee et al. [2]. The close-ups re-
veal that Kees non-blind approach is slightly superior in terms of sharpness and
noise-robustness. However, our blind approach better removes chromatic aberra-
tion. A general problem of methods relying on a prior calibration is that optical
aberrations depend on the wavelength of the transmitting light continuously: an
approximation with only a few (generally three) color channels therefore depends
on the lighting of the scene and could change if there is a discrepancy between
the calibration setup and a photos lighting conditions. This is avoided with a
blind approach.
We also apply DxO Optics Pro 7.2 to the blurry image. DXO uses a database
for combinations of cameras/lenses. While it uses calibration data, it is not clear
whether it additionally infers elements of the optical aberration from the image.
For comparison, we process the photo with the options chromatic aberrations
and DxO lens softness set to their default values. The result is good and
exhibits less noise than the other two approaches (see Figure 9, however, it is
not clear if an additional denoising step is employed by the software.
8.4 Historical Images

A blind approach to removing optical aberrations can also be applied to histor-
ical photos, where information about the lens is not available. The left column
of Figure 10 shows a photo (and some detail) from the Library of Congress
archive that was taken around 19405. Assuming that the analog lm used has
a suciently linear light response, we applied our blind lens correction method
5
http://www.loc.gov/pictures/item/fsa1992000018/PP/
Blurred image Our approach (blind)
Adobes Smart Sharpen (semi-blind) Schuler et al. [3] (non-blind)
Fig. 6. Schuler et al.s lens. Full image and lower left corner
(a) blindly estimated by our approach (b) measured by Schuler et al. [3]
Fig. 7. Schuler et al.s lens. Lower left corner of the PSF
Blurred image Our approach Schuler et al. [3]

(blind) (non-blind)
Fig. 8. Canon 24mm f1/4 lens. Shown is the upper left corner of the image. PSF inset
is three times the original size.
Blurry image Our approach Kee et al. DXO

(blind) (non-blind) (non-blind)
Fig. 9. Comparison between our blind approach and two non-blind approaches of Kee
et al.[2] and DXO
Blurry image Our approach Adobes Smart Sharpen

(blind) (semi-blind)
Fig. 10. Historical image from 1940
and obtained a sharper image. However, the blur appeared to be small, so al-
gorithms like Adobes Smart Sharpen also give reasonable results. Note that
neither DXO nor Kee et al.s approach can be applied here since lens data is not
available.
9 Conclusion
We have proposed a method to blindly remove spatially varying blur caused by
imperfections in lens designs, including chromatic aberrations. Without relying
on elaborate calibration procedures, results comparable to non-blind methods
can be achieved. By creating a suitable orthonormal basis, the PSF is constrained
to a class that exhibits the generic symmetry properties of lens blurs, while fast
PSF estimation is possible.
9.1 Limitations
Our assumptions about the lens blur are only an approximation for lenses with
poor rotation symmetry.
The image prior used in this work is only suitable for natural images, and is
hence content specic. For images containing only text or patterns, this would
not be ideal.
9.2 Future Work

While it is useful to be able to infer the image blur from a single image, it does
not change for photos taken with the same lens settings. On the one hand, this
implies that we can transfer the PSFs estimated for these settings for instance
to images where our image prior assumptions are violated. On the other hand, it
suggests the possibility to improve the quality of the PSF estimates by utilizing
a substantial database of images.
Finally, while optical aberrations are a major source of image degradation,
a picture may also suer from motion blur. By choosing a suitable basis, these
two eects could be combined. It would also be interesting to see if non-uniform
motion deblurring could prot from a direct PSF estimation step as introduced
in the present work.
References
1. Joshi, N., Szeliski, R., Kriegman, D.: PSF estimation using sharp edge prediction.
In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition (June 2008)
2. Kee, E., Paris, S., Chen, S., Wang, J.: Modeling and removing spatially-varying
optical blur. In: Proc. IEEE Int. Conf. Computational Photography (2011)
3. Schuler, C., Hirsch, M., Harmeling, S., Scholkopf, B.: Non-stationary correction of
optical aberrations. In: Proc. IEEE Intern. Conf. on Comput. Vision (2011)
4. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera
shake from a single photograph. ACM Trans. Graph. 25 (2006)
5. Miskin, J., MacKay, D.: Ensemble learning for blind image separation and decon-
volution. Advances in Independent Component Analysis (2000)
6. Cho, S., Lee, S.: Fast Motion Deblurring. ACM Trans. Graph. 28(5) (2009)
7. Whyte, O., Sivic, J., Zisserman, A., Ponce, J.: Non-uniform deblurring for shaken
images. In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition (2010)
8. Hirsch, M., Schuler, C., Harmeling, S., Scholkopf, B.: Fast removal of non-uniform
camera shake. In: Proc. IEEE Intern. Conf. on Comput. Vision (2011)
9. Krishnan, D., Tay, T., Fergus, R.: Blind deconvolution using a normalized sparsity
measure. In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition (2011)
10. Levin, A., Weiss, Y., Durand, F., Freeman, W.: Ecient marginal likelihood opti-
mization in blind deconvolution. In: Proc. IEEE Conf. Comput. Vision and Pattern
Recognition. IEEE (2011)
11. Hirsch, M., Sra, S., Scholkopf, B., Harmeling, S.: Ecient lter ow for space-
variant multiframe blind deconvolution. In: Proc. IEEE Conf. Comput. Vision and
Pattern Recognition (2010)
12. Osher, S., Rudin, L.: Feature-oriented image enhancement using shock lters. SIAM
J. Numerical Analysis 27(4) (1990)
13. Harmeling, S., Hirsch, M., Scholkopf, B.: Space-variant single-image blind decon-
volution for removing camera shake. In: Advances in Neural Inform. Processing
Syst. (2010)
14. Krishnan, D., Fergus, R.: Fast image deconvolution using hyper-Laplacian priors.
In: Advances in Neural Inform. Process. Syst. (2009)
Inverse Rendering of Faces on a Cloudy Day
Oswald Aldrian and William A.P. Smith
Department of Computer Science

University of York,
YO10 5GH, UK
{oswald,wsmith}@cs.york.ac.uk
Abstract. In this paper we consider the problem of inverse rendering

faces under unknown environment illumination using a morphable model.
In contrast to previous approaches, we account for global illumination
eects by incorporating statistical models for ambient occlusion and bent
normals into our image formation model. We show that solving for ambi-
ent occlusion and bent normal parameters as part of the tting process
improves the accuracy of the estimated texture map and illumination
environment. We present results on challenging data, rendered under
complex natural illumination with both specular reectance and occlu-
sion of the illumination environment.
1 Introduction
The appearance of a face in an image is determined by a combination of intrinsic
and extrinsic factors. The intrinsic properties of a face include its shape and
reectance properties (which vary spatially, giving rise to parameter maps or,
in the case of diuse albedo, texture maps). The extrinsic properties of the
image include illumination conditions, camera properties and viewing conditions.
Inverse rendering seeks to recover intrinsic properties from an image of an object.
These can subsequently be used for recognition or re-rendering under novel pose
or illumination.
The forward rendering process is very well understood and physically-based
rendering tools allow for photorealistic rendering of human faces. The inverse
process on the other hand is much more challenging. Perhaps the best known
results apply to convex Lambertian objects. In this case, reectance is a function
solely of the local surface normal direction and irradiance (even under complex
environment illumination) can be accurately described using a low-dimensional
spherical harmonic approximation [1]. This observation underpins the successful
appearance-based approaches to face recognition.
However, faces are not globally convex and it has been shown that occlusions
of the illumination environment play an important role in human perception of
3D shape [2,3]. Prados et al. [4] have shown how shading caused by occlusion
under perfectly ambient illumination can be used to estimate 3D shape. In this
paper we take a step towards incorporating global illumination eects into the
inverse rendering process. This is done in the context of tting a 3D morphable
face model, so the texture is subject to a global statistical constraint.

202 O. Aldrian and W.A.P. Smith
We use a model which incorporates ambient occlusion [5] and bent normals
[6] into the image formation process. This is an approximation to the rendering
equation that is popular in graphics, because it can be precomputed and subse-
quently used in realtime rendering applications. Both properties are a function
of the 3D shape of an object. Figure 1 illustrates this concept graphically. For
non-convex objects, using surface normals to model illumination with spherical
harmonics leads to an approximation error; a problem well understood and often
referred to within the vision and graphics communities. We focus on a certain
object class, human faces, which possess non-convex regions. If the object shown
in Figure 1 is illuminated by a uniform distant light source, a further approxi-
mation error occurs: the entire object will appear un-shaded. In reality however,
points on the surface where the upper hemisphere relative to the surface normal
intersects other parts of the scene (including the object itself), shading occurs
that is proportional to the degree of occlusion; a phenomenon known as ambient
shading.
c a) Point of interest
b) Upper hemisphere
e c) Un-occluded directions
d) Surface normal
a d
e) Bent normal
b
Fig. 1. Spherical lighting modelled via surface normals results in a systematic error for
non-convex parts of a shape. Bent normals (the direction of least occlusion) provide a
more accurate approximation to the true underlying process.
Accurately calculating ambient occlusion and bent normals for a given shape is
computationally expensive. In this paper, we build a statistical model of ambient
occlusion and bent normals and learn the relationship between shape parameters
and ambient occlusion/bent normal parameters. This means an initial estimate of
the parameters can be made by statistical inference from the shape parameters.
During the inverse rendering process, the parameters are rened to best describe
the observed image. The result is that the texture map is not corrupted by dark
pixels in occluded regions and the accuracy of the estimated texture map and
illumination environment is increased. We present results on synthetic images
with corresponding ground truth.
2 Image Formation Process

We allow unconstrained illumination of arbitrary colour. Our image formation
process models additive Lambertian and specular terms. The Lambertian term
further distinguishes between an ambient term, which is pre-multiplied with an
occlusion model and an unconstrained part.
Inverse Rendering of Faces on a Cloudy Day 203
2.1 The Physical Image Formation Process

Consider a surface with additive diuse (Lambertian) and specular reectance
illuminated by a distant spherical environment. The image irradiance at a point
p with local surface normal np is given by an integral over the upper hemisphere
(np ):

ip = L()Vp, [p (np ) + s(np , , v)] d, (1)
(np)
where L() is the illumination function (i.e. the incident radiance from direction
). Vp, is the visibility function, dened to be zero if p is occluded in the
direction and one otherwise. p is the spatially varying diuse albedo and we
assume specular reectance properties are constant over the surface.
A common assumption in computer vision is that the object under study is
convex, i.e.: p, (np ) Vp, = 1. The advantage of this assumption is
that the image irradiance reduces to a function of local normal direction which
can be eciently characterised by a low order spherical harmonic.
However, under point source illumination this corresponds to an assumption
of no cast shadows and under environment illumination it neglects occlusion of
regions of the illumination environment. In both cases, this can lead to a large
discrepancy between the modelled and observed intensity and, in the context of
inverse rendering, distortion of the estimated texture. Heavily occluded regions
are interpreted as regions with dark texture.
Many approximations to Equation 1 have been proposed in the graphics lit-
erature and several could potentially be incorporated into an inverse rendering
formulation. However, in this paper our aim is to demonstrate that even the
simplest approximation yields an improvement in inverse rendering accuracy.
Specically, we use the ambient occlusion and bent normal model proposed by
Zhukov et al. [5] and Landis [6]. Ambient occlusion is based on the simplication
that the visibility term can be moved outside the integral:

i p = ap L() (p (np ) + s(np , , v)) d, (2)
(np)
where the ambient occlusion ap [0, 1] at a point p is given by:

1
ap = Vp, (np )d. (3)
2 (np)
Under this model, light from all directions is equally attenuated, i.e. the di-
rectional dependence of illumination and visibility are treated separately. For a
perfectly ambient environment (i.e. S 2 , L() = k), the approximation is
exact. Otherwise, the quality of the approximation depends on the nature of the
illumination environment and reectance properties of the surface. An extension
to ambient occlusion is the so-called bent normal. This is the average unoccluded
direction. It attempts to capture the dominant direction from which light arrives
at a point and is used in place of the surface normal for rendering.
2.2 Model Approximation of the Image Formation Process

The image formation model stated in Equation 1 is approximated using the
following multilinear system:
Imod = (Tb + t). [(Ha la ) . (Oc + o) + Hb lb ] + Slx . (4)
The factors and terms are dened as follows:
Tb + t Diuse Albedo Hb lb Diuse lighting
Ha la DC lighting component Slx Specular contribution .
Oc + o Ambient occlusion
Given a 3D shape and the camera projection matrix, the unknown photometric
coecients are: b, la , c, lb and lx .
2.3 Inverse Rendering

We estimate unknown coecients given a single 2D image. To get correspondence
between a subset of the model vertices and the image, we a t a statistical
shape model using the method described in [7]. Assuming an ane mapping,
the method alternates between estimating rigid and non-rigid transformations.
Given a 3D shape, we have access to surface normals, which can be used to
construct spherical basis functions.
In this paper, we propose to construct a spherical harmonic basis from bent
normals rather than surface normals. Surface normals are still of interest to us
for two reasons. Firstly, they serve as reference for the proposed modication,
and secondly, they are incorporated as part of a joint model which eciently
infers bent normals from 3D shape. The image formation model (Equation 4),
and its algebraic solution with respect to the unknowns is not aected by this
substitution.
The basis set Ha RN 3 contains ambient constants per colour channel and
is independent of the normals. Hb RN 24 contains linear and quadratic terms
with respect to the normals. Similarly, the set S RN s contains higher order
approximations (up to polynomial degree s), which are used to model specular
contributions. However, in the specular case, the normals are rotated about
the viewing direction, v. The three sets are sparse and can directly be inferred
given a shape estimate. The parameters la , lb and lx depend on a single lighting
function l, and are coupled via Lambertian and specular BRDF parameters.
We use the method described in [8] to obtain the lighting function from the
reectance parameters. In order to prevent overtting, we add prior terms for
texture and ambient occlusion to the objective function: E(b, la , lb , lx , c). Both
priors penalise complexity and the overall objective takes the following form:
E = Imod Iobs 2 + b2 + c2 , (5)
where Iobs are RGB measurements mapped onto the visible vertices of the pro-
jected shape model. Under the assumption that each aspect contributes indepen-
dently to the objective, the system can be solved with a global optimum for each
parameter-set. As opposed to conventional multilinear systems, which can only

be solved up to global scale, our system factors the mean for texture and occlu-
sion; this property makes the solution unique. We equate the partial derivatives
of Equation 5 to zero and obtain closed-form solutions for each parameter-set.
In order to preserve source integrity, we estimate la and lb jointly in one step.
3 Statistical Modelling
Our proposed framework requires ve statistical models. A PCA model for 3D

shape, diuse albedo and ambient occlusion, respectively and PGA models for
surface normals and bent normals. Coecients for global shape, texture and
ambient occlusion are obtained directly in the tting pipeline. Coecients for
bent normals on the other hand cannot be obtained in the same way, due to
higher order dependencies. We therefore propose a generative method to infer
bent normals from 3D shape. A supplemental surface normal model is used to
reduce generalisation error. Comparative results to using vertex data only are
presented in the experimental section.
3.1 Shape Model

We use a 3D morphable model (3DMM) [9] to describe global shape. The model
transforms 3D shape to a low-dimensional parameter space and provides a prior
to ensure reconstructions correspond to plausible face shapes. A 3DMM is con-
structed from m face meshes which are in dense correspondence. Each mesh con-
sists of p vertices and is written as a vector v = [x1 y1 z1 . . . xp yp zp ]T Rn ,
where n = 3p. Applying principal components analysis (PCA) to the data matrix
formed results in m 1 eigenvectors Vi , their corresponding variances a,i 2
and
the mean shape v. Any face shape can be approximated as a linear combination
of the modes of variation:

m1
v = v + ai Vi , (6)
i=1
T
where a = [a1 . . . am1 ] corresponds to a vectors of parameters. We also dene
the variance-normalised vector as: ea = [a1 /a,1 . . . am1 /a,m1 ]T .
3.2 Surface Normal Model

In addition to 3D vertices, we make of use surface normals to capture local shape
variation. In contrast to data lying on a Euclidian manifold (Rn ), surface normals
are part of a Riemannian manifold and can not be simply modelled by applying
PCA to the samples. Fletcher et al. [10] introduced the concept of Principal
Geodesic Analysis (PGA), which can be seen as a generalisation of PCA to
the manifold setting. Smith and Hancock [11] showed how this framework can
be successfully applied to model directional data. In this work, we use PGA
to construct statistical models of surface normals and bent normals. For data
n laying on a spherical manifold S2 , rst a mean vector intrinsic to the

manifold is calculated. The mean vector serves as reference point p at which a
tangent plane is constructed. In the next step, all samples nj = (nx , ny , nz )j
are projected to points vj = (vx , vy )j on the tangent plane in geodesic distance
preserving manner using the Log map. The principal directions Ni are found
by applying PCA to the projected samples. A sample can be back projected
onto the manifold by applying the Exponential map. Using this notation, we
construct a model for normals as follows:
m1

n = Log b i Ni , (7)
i=1
T
where b = [b1 . . . bm1 ] are parameter vectors. Variance-normalised they are
dened as: eb = [b1 /b,1 . . . bm1 /b,m1 ]T .
3.3 Ambient Occlusion Model

For each of the m shape samples, we compute ground truth ambient occlusion
using Meshlab [12]. Each vertex is assigned a single integer oi [0, 1], which cor-
responds to the occlusion value. A value of 1 indicates a completely unoccluded
vertex. As for shape, we construct a PCA model for ambient occlusion:

m1
o = o + ci O i , (8)
i=1
where c = [c1 . . . cm1 ]T is a parameter vector. The computed mean value is

dened as: o, and the Oi s are modes of variation capturing decreasing energy
2
c,i .
3.4 Bent Normal Model

From a modelling perspective, bent normals are equivalent to surface normals
(samples on a spherical manifold). And the model is constructed in the same
way as the one described in section 3.2:
m1

b = Log di Bi . (9)
i=1
Note that here refers to the intrinsic mean of the bent normals. As in pre-
vious dened models, d = [d1 . . . dm1 ]T is a vector of parameters. Variance-
normalised they are dened as: ec = [d1 /d,1 . . . dm1 /d,m1 ]T .
3.5 Bent Normal Inference

We infer bent normals from shape data using a generative non-parametric model,
namely a probabilistic linear Gaussian model with class specic prior functions.
In a discrete setting, a joint instance comprising the knowns and unknowns is

represented as a feature vector f = [fkT fcT ]T , where fkT = [eTa eTb ]T or fkT = eTa
(depending on whether only vertices or vertices and surface normals are used)
and fcT = eTc . The training set consists of m instances re-projected into the
corresponding models. To ensure the scales of both models are commensurate, we
use variance normalised parameter vectors. We construct the design matrix D
R(mk +mc )m by stacking the parameter vectors. From a conceptual perspective
our model is equivalent to a probabilistic PCA model [13]. The non-probabilistic
term is described with a noise term, . Noise is assumed normally distributed,
and its distribution is assumed stationary for all features. A joint occurrence is
described as follows:
f = W + + . (10)
The parameter and the noise term are assumed to be Gaussian distributed:
p() N (0, I) p() N (0, 2 I). (11)
In probabilistic terms, a joint instance is written as:
p(f |) N (W + , 2 I). The characteristic model parameters are: W, and
2 (I is the identity matrix of appropriate dimension). The most likely values
m can
i=1 (fi
1
be obtained by applying PCA to the mean-free design matrix: D = m
)(fi )T = U2 VT , and setting:
1
m m1
1 1
ml = fi , 2
ml = 2 , Wml = Uu (2u ml
2
I) 2 .
m i=1 m 1 u i=u+1 i,i
(12)
The small letter u corresponds to the number of used modes. Our aim is to
estimate fc given fk . For data which is jointly Gaussian distributed, the following
margninalisation property holds true:
% & % &
k Wk WkT Wk WcT
p(f ) = p(fk , fc ) N , . (13)
c Wc WkT Wc WcT
Therefore we can write the probability p(fk |) N (Wk + k , 2 I). According

to the specications, we also have knowledge of how is distributed (see relations
11). Applying Bayes rule, we can infer the posterior for alpha as follows:
p(|fk ) N (M1 WkT (fk k ), 2 M1 ), (14)
where M = Wk WkT + 2 I (see for example [14] for an in-depth explanation of
linear Gaussian models). Given an estimate for , we can obtain the posterior
distribution for the missing part p(fc |), with the mode centre (Wc + c )
corresponding to the MAP estimate.
4 Experiments
We use a 3D morphable model [15] to represent shape and diuse albedo. Since
we do not have access to the initial training data, we sample from the model
to construct a representative set. This is only required for the shape model, as
neither ambient occlusion nor bent normals exhibit a dependency on texture.
In order to capture the span of the model, we sample 3 standard deviations
for each of the k = 199 principal components plus the mean shape. Because of
the non linear relationship between shape and ambient occlusion/bent normals,
we additionally sample 200 random faces from the model. This accounts for a
total of m = 599 training examples. For each sample, we calculate ground truth
ambient occlusion and bent normals using Meshlab [12]. We retain ma,b,c,d = 199
most signicant modes for each model.
Using a physically based rendering toolkit (PBRT v2 [16]), we render 3D faces
of eight subjects in three pose angles and three illumination conditions. The
subjects are not part of the training set. To cover a wide range, we chose pose
angles: 60 , 0 and 45 . The faces are rendered in the following illumination
environments: White, Glacier and Pisa, where the latter two are obtained
from [17]. Skin reectance is composed of additive Lambertian and specular
terms with a ratio of 10/1. The test set consists of 8 3 3 = 72 images in total.
For each of the samples, we rst recover 3D shape and pose from a sparse set of
feature points using algorithm [7]. We project shape into the images and obtain
RGB values for a subset of the model vertices, n = 3p. Using the proposed image
formation process and objective function, we decompose the observations into
its contributions: texture, ambient shading1 , diuse shading and specular reec-
tion. We compare three settings: In a reference method, we calculate spherical
harmonic basis functions using surface normals and do not account for ambient
occlusion (Fit A). In a second setting, we use the same basis functions and t
ambient occlusion (Fit B). And nally, we derive the basis functions from bent
normals and t ambient occlusion (Fit C). Each of the settings is evaluated
according to three quantitative measures:
1. Texture reconstruction error
2. Fully synthesised model error
3. Illumination estimation accuracy
We also investigate how accurately bent normals are predicted from the joint
model by comparing against ground truth. The last part of this section shows
qualitative reconstructions for texture and full model synthesis and an appli-
cation of illumination transfer. Out-of-sample faces are labeled with three digit
numbers (001 323).
4.1 Bent Normal Generalisation Error

In this section, we investigate generalisation error of the bent normal model.
In a rst trial, we examine how well the model generalises to unseen data by
projecting out-of-sample data into the model (Model). In a second and third
trial, we measure the error induced by predicting bent normals from shape. The
rst method uses only vertex data (Joint A). The second method additionally
1
Ambient occlusion is only estimated in the second and third setting.
Table 1. Bent normal approximation error for eight subjects. Errors are measured in
mean angular dierence.
Eb , : 001 002 014 017 052 053 293 323 mean

Model: 3.32 2.68 2.79 2.67 2.56 3.10 2.19 2.41 2.72
Joint A: 4.53 4.04 4.82 4.04 4.13 5.49 3.83 3.73 4.33
Joint B: 3.94 3.57 4.42 3.66 3.88 4.75 3.15 3.34 3.84
Table 2. Texture reconstruction errors averaged over subjects and pose angles. Indi-
vidual entries are 103 .
Et , : 001 002 014 017 052 053 293 323 0 45 60 mean

Fit A 3.394 3.866 3.417 13.13 5.108 3.454 3.710 3.856 5.277 5.165 4.534 4.991
Fit B 3.213 3.603 2.843 12.63 4.387 3.511 3.254 3.109 4.596 4.704 4.410 4.569
Fit C 2.803 3.716 3.111 10.09 3.509 2.978 2.992 3.352 4.224 3.973 4.028 4.069
incorporates surface normals (Joint B). Reconstruction error is measured in mean

angular distance:

1
p
big bir
Eb = arccos , (15)
p i=1 big bir
where big corresponds to the i th ground truth bent normal and bir to the recon-
struction. Table 1 shows reconstruction errors for eight out-of-sample faces.
4.2 Texture Reconstruction Error

We use the 60 most signicant principal components to model texture. Our
evaluation is based on squared Euclidian distance:
1
Et = tg tr 2 , (16)
n
between ground truth texture tg and reconstruction tr . Individual values within
each texture vector are within R [0, 1]. Table 2 shows reconstruction errors for
all subjects averaged over illumination and over pose angles.
4.3 Full Model Composition Error

The dierence between the fully synthesised model and the images is examined
in this part. As for texture, we measure the dierence in squared Euclidian
distance:
1
fg fr 2 ,
Ef = (17)
n
where entries in fg and fr are within range [0, 1]. Error is normalised over the
number of observations n, and diers for subjects and pose. But is constant for
the three methods. Results for this measurements are shown in Table 3.
Table 3. Full model approximation errors averaged over subjects and pose angles.
Individual entries are 103
Ef , : 001 002 014 017 052 053 293 323 0 45 60 mean

Fit A 2.130 2.662 2.651 3.574 2.402 2.449 2.346 2.569 2.272 3.106 2.538 2.598
Fit B 1.822 2.362 2.323 3.296 2.058 2.110 1.974 2.267 1.949 2.783 2.199 2.277
Fit C 1.819 2.387 2.268 3.269 2.080 2.083 2.009 2.204 1.908 2.729 2.257 2.264
Table 4. Light source approximation error for for the three methods. Entries represent
angular error averaged over all subjects and pose angles.
El , : White Glacier Pisa mean

Fit A 10.14 16.61 19.05 15.25
Fit B 9.12 15.31 16.94 13.79
Fit C 6.44 12.70 14.03 11.06
4.4 Environment Map Approximation Error
In this section, we compare lighting approximation error for the three methods.
We obtain ground truth lighting coecients by rendering a sphere in the same
illumination conditions than the faces. The material properties are also set to be
equal. As the normals and the texture of the sphere are known, we deconvolve
the image formation and extract lighting coecients. This also makes sense for
white illumination, as with this procedure we obtain the overall magnitude of
light source intensity. We divide reectance vectors by the corresponding BRDF
parameters and use them as ground truth: lg . For each of the images, we compute
angular distance between the recovered lighting coecients lr and the ground
truth:
lg lr
El = arccos . (18)
lg lr
Results for the experiments are shown in Table 4, where we have averaged over
all subjects and pose angles.
For visual comparison, we show qualitative results for the three methods under
investigation. Figure 2 displays tting results for various subjects, pose angles
and illumination conditions. The Figure shows full model synthesis and texture
reconstructions for methods Fit A and Fit C.
As can be seen in Table 2 and 3, quantitative dierences between method Fit
B and Fit C are less pronounced. This also applies for perceptual dierences.
Methods Fit C notably obtains more accurate reconstructions for regions which
are severely occluded. Figure 3 shows magnications of full model synthesis of
the nose region for two subjects.
052,-60,W 053,0,G 001,0,P 293,45,G 017,0,P
Fig. 2. Comparison of method Fit A and Fit C for ve subjects in dierent pose angles
and illumination condition. Top row shows input images. Second row shows full model
synthesis for method Fit A. The third row shows full model synthesis for method Fit
C. And the fourth and fth row show ground truth texture (and shape) and texture
reconstruction using method Fit A. The last row shows texture reconstructions using
method Fit C. Labels on top of images are: Face ID, Pose, Illumination, where W,G
and P corresponds to White, Glacier and Pisa.
A most important feature to be extracted is diuse albedo. As an identity

specic parameter it should be consistently estimated across pose and illumi-
nation. Figure 4 shows model synthesis and texture reconstructions for method
Fit C for one subject in all pose and illumination combinations, including shape
and pose estimates.
Fig. 3. Comparison of the three methods: Fit A, Fit B and Fit C for two subjects.
Close up nose region face ID: 001 (top) and ID: 014 (bottom).
Fig. 4. Fitting results for one subject (ID: 002) in all pose angles and illumination
condition. Top row shows input images. Second row shows full model synthesis using
method Fit C. Bottom row shows texture reconstructions. Note, that the face shown,
does not posses lower texture reconstruction error than method Fit B.
4.6 Illumination Transfer
To demonstrate stability of estimated parameters, we combined lighting coe-

cients (ambient, diuse and specular) estimated from a set of subjects with iden-
tity parameters (shape, texture and ambient occlusion) and pose from the same
set. The results are shown in Figure 5. Diagonal entries show tting results to the
actual images, and o-diagonal entries show cross illumination/identity results.
Identity (Shape, Albedo, Ambient Occlusion), Pose

Illumination (Ambient, Diffuse, Specular)
Fig. 5. Illumination transfer example. Top row and rst column show input images
of six subjects. Estimated parameters for shape, pose, diuse albedo and ambient
occlusion from the columns are combined with lighting estimates obtained by the rows.
5 Conclusions
We have presented a generative method to estimate global illumination in an in-
verse rendering pipeline. To do so, we learn and incorporate a statistical model
of ambient occlusion and bent normals into the image formation process. The
resulting objective function is convex in each parameter set and can be solved ac-
curately and eciently using alternating least squares. In addition to qualitative
improvements, empirical results show that reconstruction accuracy for texture,
lighting and full model synthesis increases by around 10 18%. In future work,
we would like to explore the performance of the proposed tting algorithm in a
recognition experiment and consider more complex approximations to full global
illumination.
References
1. Basri, R., Jacobs, D.W.: Lambertian reectance and linear subspaces. IEEE Trans.
Pattern Anal. Mach. Intell. 25, 218233 (2003)
2. Langer, M.S., Zucker, S.W.: Shape from shading on a cloudy day. JOSA-A 11,
467478 (1994)
3. Langer, M.S., Butho, H.H.: Depth discrimination from shading under diuse light-
ing (2000)
4. Prados, E., Jindal, N., Soatto, S.: A non-local approach to shape from ambient
shading. In: Proc. IEEE Intl. Conf. on Scale Space and Variational Methods in
Computer Vision, pp. 696708. Springer (2009)
5. Zhukov, S., Iones, A., Kronin, F.: An ambient light illumination model. In: Ren-
dering Techniques, Proceedings of the Eurographics Workshop. Springer (1998)
6. Landis, H.: Production-ready global illumination. Siggraph Course Notes 16 (2002)
7. Aldrian, O., Smith, W.A.P.: A linear approach of 3D face shape and texture recov-
ery using a 3D morphable model. In: Proceedings of the British Machine Vision
Conference, pp. 75.175.10. BMVA Press (2010)
8. Aldrian, O., Smith, W.A.P.: Inverse rendering with a morphable model: A mul-
tilinear approach. In: Proceedings of the British Machine Vision Conference,
pp. 88.188.10. BMVA Press (2011)
9. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proc.
SIGGRAPH, pp. 187194 (1999)
10. Fletcher, P.T., Joshi, S., Lu, C., Pizer, S.M.: Principal geodesic analysis for the
study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23, 9951005
(2004)
11. Smith, W.A.P., Hancock, E.R.: Recovering facial shape using a statistical model of
surface normal direction. IEEE Trans. Pattern Anal. Mach. Intell. 28, 19141930
(2006)
12. Visual Computing Laboratory, Institute of the National Research Council of Italy
(Meshlab), http://meshlab.sourceforge.net/
13. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal
of the Royal Statistical Society, Series B 61, 611622 (1999)
14. Rasmussen, C.E., Williams, C.K.I.: Gaussian processes for machine learning. In:
Adaptive Computation and Machine Learning. MIT Press (2006)
15. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model
for pose and illumination invariant face recognition. In: Proc. IEEE Intl. Conf. on
Advanced Video and Signal based Surveillance (2009)
16. Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Imple-
mentation. Morgan Kaufmann. Elsevier Science (2010)
17. University of Southern California: High-resolution light probe image gallery (2011),
http://gl.ict.usc.edu/Data/HighResProbes
On Tensor-Based PDEs and Their
Corresponding Variational Formulations
with Application to Color Image Denoising
Freddie Astrom1,2 , George Baravdish3, and Michael Felsberg1,2

1
Computer Vision Laboratory, Department of E.E., Linkoping University
2
Center for Medical Image Science and Visualization (CMIV), Linkoping University
3
Department of Science and Technology, Linkoping University
{firstname.lastname}@liu.se
Abstract. The case when a partial dierential equation (PDE) can be

considered as an Euler-Lagrange (E-L) equation of an energy functional,
consisting of a data term and a smoothness term is investigated. We
show the necessary conditions for a PDE to be the E-L equation for a
corresponding functional. This energy functional is applied to a color
image denoising problem and it is shown that the method compares
favorably to current state-of-the-art color image denoising techniques.
1 Introduction
In their seminal work Perona and Malik proposed a non-linear diusion PDE to
lter an image while retaining lines and structure [1]. This triggered researchers
to investigate well-posedness of the underlying PDE, but also to build on the
drawbacks of the Perona and Malik formulation, namely that noise is preserved
along lines and edges in the image structure. A tensor-based image diusion
formulation, which gained wide acceptance, was proposed by Weickert [2] where
the outer product of the gradient was used to describe the local orientation of
the image structure. As the main novelty of this work we investigate the case
when a given image diusion PDE can be considered as a corresponding E-L
equation to a functional of the form
1
J(u) = ||u u0 ||p + R(u) , (1)
p
where R(u) is a tensor-based smoothness term, and the Lp -norm, 1 < p < , is
a data term. Furthermore, the derived E-L equation of (1) is applied to a color
image denoising problem.
Related Work. State-of-the-art gray-value image denoising achieves impres-
sive results. However, the extention from gray-valued image processing to color
images still pose major problems due to the unknown mechanisms which govern
the process describing color perception. One of the most common color spaces
used to represent images is the RGB color space, despite it being known that

216 F. Astrom, G. Baravdish, and M. Felsberg
the colors (red, green and blue) are correlated due to physical properties such
as the design of color sensitive lters in camera equipment. In [3] anisotropic
diusion in the RGB color space is proposed, where the color components are
used to derive a weighted diusion tensor. One drawback of this formulation
is that color artifacts can appear on the boundary of regions with sharp color
changes. Some alternative color spaces investigated for image diusion include
manifolds [4], the CMY, HSV, Lab and Luv color space [5], luminance and chor-
maticity [6], and a maximal decorrelation approach [7]. Due to the simplicity of
its application we follow the color model used in [7] which was developed in [8].
With regards to variational approaches of image diusion processes, it has
been shown that non-linear diusion and wavelet shrinkage are both related to
variational approaches [9]. Also, anisotropic diusion has been described using
robust statistics of natural images [10]. However, in [10] they construct an update
scheme for the iterative anisotropic process which contains the addition of a
an extra convolution of the diusion tensor. In the paper [11] a tensor-based
functional is given, but the tensor it describes is static, thus the update-scheme
becomes linear. Generalizations of the variational formulation exists, i.e. in [12]
the smoothness term is considered as an Lp -norm where p is an one-dimensional
function.
The main contributions of this paper are: (1) we establish a new tensor-based
variational formulation for image diusion in Theorem 1; (2) in Theorem 2 we
derive necessary conditions such that it is possible to nd an energy functional
given a tensor-based PDE; (3) the derived E-L equation of the established image
diusion functional is applied to a color image denoising problem. It is shown
that the resulting denoising compares favorably to state-of-the-art techniques
such as anisotropic diusion [3], tracebased diusion [13,7] and BM3D [14].
In section 2 we briey review both scalar and tensor-based image diusion
ltering as a motivation for the established functional. Furthermore, we state
the necessary conditions for a PDE to be considered as an E-L equation to the
same functional. The derived functional is applied to a color image denoising
problem in section 3. The paper is concluded with some nal remarks.
2 Image Diusion and Variational Formulation
Convolving an image with a Gaussian lter reduces the noise variance, but does
not take the underlying image structure such as edges and lines into account.
To avoid blurring of lines and edges Perona and Malik [1] introduced an edge-
stopping function in the solution of the heat equation
t u = div(g(|u|2 )u) , (2)

t
with diusivity function g, and u = ux , uy where ux and uy denote the
derivative of u with respect to x and y. If g = I the heat equation is obtained.
The Perona-Malik formulation produces visually pleasing denoising results and
preserves the image structure for a large number of iterations. However the
On Tensor-Based PDEs and Their Corresponding Variational Formulations 217
formulation only depends on the absolute image gradient and hence noise is pre-
served at edges and lines. Weickert [2] replaced the diusivity function with a
tensor. This allows the lter to reduce noise along edges and lines and simulta-
neously preserve image structure. The tensor-based image diusion formulation
is
t u = div(Du) , (3)
where D = D(Ts ) is a positive-semidenite diusion tensor constructed by scal-
ing the eigenvalues of the structure tensor Ts ,
Ts = w (uut ) , (4)
where denotes the convolution operation and w is typically a Gaussian weight
function [15].
2.1 Scalar Diusion

The functional

1
J(u) = (u u0 )2 dxdy + (|u|) dxdy , (5)
2
where is a strict convex function and is a scalar, describes a nonlinear
minimization problem for image denoising. By using the Gateaux derivative and
the divergence theorem, the solution u which minimizes J(u) is the solution to
the E-L equation
J(u)
= 0 in , u n = 0 on , (6)
u
where is a grid described by the image size in pixels and n is the normal
vector on the boundary . The functional J(u) satises the initial-boundary
value problem for the PDE

(|u|)

t u div u = 0 in , t > 0
|u|
(7)

n u = 0 on , t > 0

u(x, y, 0) = u0 (x, y) in
It is commonly accepted that there is no known way of determining a functional
such that is any arbitrary function. However, it is possible to derive a corre-
sponding functional by equating the non-specied diusivity function g with the
known E-L equation (7)
(|u|)
g(|u|) = , (8)
|u|
5
hence, with s = |u| one gets (s) = sg(s) ds = sG(s) G1 (s) + C where
G1 = G, G = g and the constant C is chosen so that is non-negative. For
example if g(s) = sp2 then (s) = |u|
p
p and similarly if g(s) = es then

s s
= se e + C where C 1. Naturally, these results carry over to other
types of diusivity functions such as the popular negative exponential diusivity
1
function and the Perona-Malik function g(s) = 1+s .
2.2 Deriving a New Tensor-Based Functional
To achieve better denoising results close to lines and edges than the scalar dif-
fusion formulation enables, it is common to dene the tensor T which describes
local orientation of the image structure. Hence the PDE in (7) reads

t u div(T u) = 0 in , t > 0

n u = 0 on , t > 0 (9)

u(x, y, 0) = u0 (x, y) in
where T (u) depends on the image gradient. Note that the common formulation
of the tensor T is to let it depend on (x, y) . The ansatz to derive the new
tensor-based image diusion functional is based on equating (7) and (9) such
that
(|u|)
div(T u) = div u . (10)
|u|
The equality in (10) holds if
(|u|)
T u = u + C , (11)
|u|
for some vector-eld C where div(C) = 0. In this work we consider C = 0. Here

extending the equation with the scalar product from the left side, and integrating
the result with respect to |u| we nd as

ut T u
(|u|) = d|u| . (12)
|u|
Using this result, we motivate the subsequent theorem by performing the in-
tegration with
5 a tensor dened by the outer product of the gradient, then
(|u|) = |u|3 d|u| = 14 |u|4 +c where c is a constant. Setting the constant
to null we obtain the tensordriven functional

1 1
J(u) = (u u0 )2 dxdy + ut T u dxdy . (13)
2 4
Thus, a generalization of this energy functional is now stated in the following

theorem.
Theorem 1. Let u0 be an observed image in a domain R2 , and let T (ux , uy )
describe the local orientation of the underlying structure of u in . Denote by
J(u) the functional

1
J(u) = ||u u0 || +
p
ut T (u) u dxdy (14)
p
where u C 2 and 1 < p < . Then the E-L equation of J(u) is

/
(u u0 )p1 div (Su) = 0 in
(15)
n u = 0 on
where t
u Tux
S= + T + Tt , (16)
ut Tuy
and Tux , Tuy are the derivatives of T (ux , uy ) with respect to ux and uy .
Proof. To nd the E-L equation of J(u) we dene the smoothness term

R(u) = ut T (u) u dxdy . (17)

Let v C () be an arbitrary function such that n v| = 0. By the denition

of the Gateaux derivative it follows that
R(u + v) R(u)
= v, T (u + v)u + u, T (u + v)v

u, (T (u + v) T (u))u
+

+ v, T (u + v)v , (18)
55 t
where a, b = a b dxdy and a, b R2 Rn denote the scalar product on a
function space. Note that, since T (u) = T (ux , uy ),
T (u + v) T (u)
lim = Tux vx + Tuy vy , as 0 . (19)
0
Then the variation of the smoothness term with respect to v reads
6 7
R, v = u, (Tux vx + Tuy vy )u + v, T u + u, T v . (20)
Using the scalar product and the adjoint operator we obtain

6 7
R, v = v, (Tux ux + Tuy uy )u + v, T u + v, T t u . (21)
By Greens formula we obtain

v, T u = v div(T u) dxdy , (22)

v, T t u = v div(T t u) dxdy , (23)
8 t 9 t
u Tux u u Tux
v, = v div u dxdy . (24)
ut Tuy u ut Tuy
Hence, using (22) - (24) now gives

R, v = v div(Su) dxdy . (25)

Thus t
u Tux
S= + T + Tt , (26)
ut Tuy
in the E-L equation (15). Now by taking the derivative of the data term in J(u)
with respect to u the proof is completed.

3 Necessary Conditions for the Existence of a Variational

Formulation
In this section we formulate necessary conditions for the existence of an energy
functional of the type in (14) given a tensor-based diusion PDE (15).
Theorem 2. For a given tensor S = S(ux , uy ) with entities s(i) satisfying the
necessary condition (NC)
ux ux suy = s
s(3) + ux s(3) uy uy sux ,
(1) (2)
+ uy s(2) (4)
(27)
there exists a symmetric tensor T (ux, uy ) such that the tensor-based diusion
PDE
/
(28)
n u = 0 on
where 1 < p < is the E-L equation to the functional J(u) in (14). Moreover,
the entities in
f (ux , uy ) g(ux , uy )
T (ux , uy ) = , (29)
g(ux , uy ) h(ux , uy )
are given below. First we consider g(ux , uy ) = g(, ), where

1 1
g(, ) = 2 R(, ) d + 2 () , (30)

and = ux , = ux /uy and R(, ) = R(ux , uy ) is the right-hand side of NC
(27). Second, f and h are obtained from
ux
1 1
f (ux , uy ) = 2 (s(1) (, uy ) uy g (, uy )) d + 2 (uy ) , (31)
ux 0 ux
uy
1 1
h(ux , uy ) = 2 (s(4) (ux , ) ux g (ux , )) d + 2 (ux ) , (32)
uy 0 uy
where , and are arbitrary functions.

Proof. Let the system of equations,

ux fux + uy gux + 2f = s(1) (33)

ux gu + uy hu + 2g = s(2)
x x (34)

ux fuy + uy guy + 2g = s(3) (35)

u g + u h + 2h = s(4)
x uy y uy (36)
represent the expansion of (26) with given notation. The function f in (31) is
obtained from (33) by solving

(u2 f ) = ux s(1) ux uy gux . (37)
ux x
To construct g, dierentiate f in (31) with respect to uy . This give that
ux
1 1
fuy (ux , uy ) = 2 y (, uy ) g (, uy ) uy guy (, uy )) d + 2 (uy ) .
(s(1)
ux 0 ux
(38)
Insert now fuy into (35) and dierentiate with respect to ux to obtain
ux ux suy .
ux gux + uy guy + 2g = s(3) + ux s(3) (1)
(39)
In the same way, solve for h in (34) to obtain (32). Using now (34) in (36) yield
uy uy sux .
ux gux + uy guy + 2g = s(2) + uy s(2) (4)
(40)
Comparing (39) and (40), we deduce the necessary condition (NC) to establish
the existence of g, f and h. From NC we are now able to compute g satisfying
ux gux + uy guy + 2g = R(ux , uy ) , (41)
where R is the right-hand side (or the left-hand side) in NC. Changing variables
= ux and = ux /uy , in (41) we get
g (ux ux + uy uy ) + g (ux ux + uy uy ) + 2g = R(, ) , (42)
where R(, ) = R(ux , uy ) and g(, ) = g(ux , uy ). Since ux = 1, uy = 0 and

ux = 1/uy uy = ux /u2y , equation (42) can then be writen as g + 2g = R.
We solve for g in the following way

1 1
(2 g) = R(, ) g = 2 R(, ) d + 2 () , (43)

where is an arbitrary function. Put (, ) = (ux , uy ) to obtain g in (30).

Now f and h can be obtained from (31) and (32).

Corollary 1. For the tensor S = uut there exists a tensor
2
1 ux ux uy
T = , (44)
4 ux uy u2y
such that the following PDE
/
(45)
n u = 0 on
is the E-L equation of the functional J(u) in (14).

Proof. It is straighforward to verify that the entities of the tensor S satises the
NC condition where the right-hand side (27)
R(ux , uy ) = 2ux uy . (46)
For simplicity we choose , , and to be zero in (30) - (32). With change of

variables = ux and = ux /uy , (46) become R(, ) = 22 / so that (30) is
2
g(, ) =
2 , then
1
g(ux , uy ) = ux uy . (47)
2
Using (31) and (32), f and h are

1 ux
u2x u2y
f (ux , uy ) =
( 3 u2y ) d =, (48)
u2x
0 4 4
uy
1 u2y u2
h(ux , uy ) = 2 ( 3 u2x ) d = x . (49)
uy 0 4 4
Let T: be the tensor dened by (47) - (49), then put this tensor in (14) and
obtain
2 2
ux u2y 2ux uy ux ux uy
ut T:u = ut u = u t
u = ut T u ,
2ux uy u2y u2x ux uy u2y
(50)
neglecting the factor 1/4. This concludes the proof and (44) follow from (50).
3.1 Weighted Tensor-Based Variational Formulation

In this section we want to nd a weighted tensor

f (ux , uy ) g(ux , uy )
T = w (x, y) , (51)
g(ux , uy ) h(ux , uy )
where is the convolution operator and w is a smooth kernel, such that the
functional
1
J(u) = ||u u0 || +
p
ut T (u) u dxdy , (52)
p
has
/
(53)
n u = 0 on
as the corresponding E-L equation where S = w (uut ).

Proposition 1. For the tensor S = w (uut ), the PDE (53) is the E-L
equation of the functional (52) with T as in (51) if the necessary condition
ux (x, y)(w gux )(x, y) = uy (x, y)(w guy )(x, y) , (54)

is satised. Moreover, the entities of T can be obtained by solving the following

system of dierential equations

ux (w fux ) + uy (w gux ) + 2(w f ) = w u2x (55)

ux (w gux ) + 2(w g) = w (ux uy ) (56)

uy (w guy ) + 2(w g) = w (ux uy ) (57)

u (w g ) + u (w h ) + 2(w h) = w u2 (58)
x uy y uy y
Remark 1: We have chosen to include w in the formulation of the tensor (51).

This will allow us to include ux and uy inside the convolution operation in (54)
assuming that ux and uy are sampled at dierent positions on the grid.
Remark 2: Theorem 1, describe the case when rank(S) = 1, i.e. the signal has
an intrinsically one-dimensional structure (such as lines), Proposition 1 describe
in general the case when rank(S) = 2.
To conclude, we have derived a framework which states under what conditions
a given PDE is the corresponding E-L equation to a tensor-based image diusion
energy functional. In the next section the derived E-L equation (15) is used on
a color image denoising problem.
4 Application to Color Image Denoising

It is known that the components of the RGB color space are correlated, therefore
decorrelation methods using PCA, HSV, Lab etc. have been investigated by the
image processing community. In this work we utilize the decorrelation transform
by Lenz and Carmona [8] which has shown to perform well for color image
denoising problems [7]. The transform is described by the matrix

I 2 2 2 R
1
C1 = 2 1 1 G
(59)
C 6 0 3 3 B
2
and when applied to the RGB color space a primary component describing
the
average gray-value I and two color-opponent components C = C1 , C2 are
obtained.
4.1 From Structure Tensor to Diusion Tensor

Let T be dened as the outer product of the image gradient weighted by a
smooth kernel, see (4). Since T is symmetric it is also positive-semidenite. To
favor smoothing along edges the tensor is scaled using a diusivity function g.
The scaling aects the eigenvalues of the tensor, but preserves the eigenvectors
since they describe the orientation of the image structure. In this study we use
the negative exponential function exp(ks) as the scaling function where k is a
positive scalar, i.e.
D(T ) = g(T ) = V g()V 1 , (60)
where V are the eigenvectors and the eigenvalues of T [13]. Then has the
eigenvalues 1 and 2 on its main diagonal such that g() = diag(g(1 ), g(2 )).
Then we denote D = D(T ) the diusion tensor dened
by (60). With the same
ut Tux
notation dene E = E(N ), where N = in (26). Since N is a non-
ut Tuy
symmetric tensor, E may be a complex valued function, however in practice
Im(E) 0 hence this factor is neglected.
4.2 Discretization
To nd the solution u of (15) with tensors E(N ) and D(T ) we dene the
parabolic equation
t u div(E(N )u) 2 div(D(T )u) = 0 . (61)
The term containing D(T ) is implemented by using the non-negativity scheme

described in [2] resulting in a term A(u) whereas the other divergence term is
implemented using the approach in [16] described by B(u). In an explicit discrete
setting the iterative update equation become
ui+1 = ui + (A(ui ) + B(ui )) , (62)
where is the steplength for each iteration i.
4.3 Experiments
2
The image noise variance, est , is estimated by using the technique in [17] which
is the foundation for the estimate of the diusion parameter in [13] (cf. (47) p.
6), the resulting diusion parameter for the diusion lters is
e1 2
k= , (63)
e 2 est
where e is the Euler number. A naive approach is taken to approximate the noise
of the color image, the noise estimate will be the average sum of the estimated
noise of each RGB color component. In most cases this approach works reason-
ably well for noise levels < 70 when compared to the true underlying noise
distribution, but note that the image itself contains noise, hence the estimate
will be biased. We use color images from the Berkeley segmentation dataset
[18], more specically images {87065, 175043, 208001, 8143}.jpg, being labeled
as lizard, snake, mushroom and owl (see gure 1). All images predominately
contain scenes from nature and hence large number of branches, leafs, grass and
other commonly occurring structures. The estimated noise levels for the dierent
images can be seen in table 1.
In this work we found that modifying the estimate of the diusion parameter
to scale with the color transform kI = 101 k and kc = 10k, yields an improved
result compared to using k without any scaling. The motivation for using dier-
ent diusion parameters in the dierent channels is that ltering in the average
Table 1. SSIM and PSNR values
SSIM PSNR
Noise Est TR Proposed RGB BM3D TR Proposed RGB BM3D
Lizard
5 17.0 0.99 0.99 0.98 0.97 38.0 38.0 35.2 34.1
10 19.2 0.96 0.96 0.94 0.95 32.7 32.8 30.6 32.3
20 24.2 0.90 0.91 0.87 0.92 27.3 27.8 26.7 29.3
50 57.1 0.75 0.77 0.68 0.73 22.6 23.5 22.0 23.2
70 67.1 0.65 0.70 0.59 0.66 20.8 21.8 20.5 21.7
Snake
5 15.5 0.98 0.98 0.97 0.96 38.3 38.3 35.3 34.6
10 18.1 0.95 0.95 0.92 0.94 33.0 33.1 30.9 32.6
20 23.5 0.87 0.88 0.83 0.90 27.7 28.4 27.3 29.7
50 53.9 0.69 0.73 0.63 0.69 23.2 24.3 23.0 24.4
70 65.4 0.59 0.64 0.54 0.61 21.5 22.7 21.6 22.9
Mushroom
5 9.9 0.97 0.97 0.95 0.96 38.7 38.9 36.9 38.2
10 13.9 0.92 0.93 0.89 0.93 33.8 34.3 32.8 35.1
20 21.1 0.83 0.86 0.79 0.88 29.3 30.4 29.2 31.8
50 49.0 0.66 0.70 0.58 0.70 25.1 26.0 24.7 26.4
70 60.4 0.57 0.62 0.52 0.63 23.0 23.8 22.8 24.1
Owl
5 61.4 0.99 0.99 0.98 0.78 36.9 37.0 34.6 23.2
10 19.8 0.97 0.97 0.95 0.96 33.0 33.0 30.1 32.0
20 25.7 0.93 0.93 0.88 0.93 27.8 28.0 25.8 28.7
50 65.1 0.80 0.81 0.68 0.72 22.6 23.0 20.5 21.8
70 71.2 0.71 0.73 0.59 0.66 20.4 21.0 18.8 20.5
gray-value channel I primarily aects the image noise and hence it is preferable
to reduce the k-parameter in this channel. However, one must be careful since
noise may be considered as structure and may be preserved throughout the lter-
ing process. Filtering in the color opponent components will provide intra-region
smoothing in the color space and hence color artifacts will be reduced with a
less structural preserving diusion parameter.
To illustrate the capabilities of the proposed technique it is compared to three
state-of-the-art denoising techniques, namely tracebased diusion (TR) [7], dif-
fusion in the RGB color space [3] and the color version of BM3D [14]. Code
was obtained from the authors of TR-diusion and BM3D, whereas the RGB-
color space diusion was reimplemented based on [3] and [2] (cf. p. 95), the
k-parameter in this case is set to the value obtained from (63) since it uses
the RGB color space. To promote research on diusion techniques we intend to
release the code for the proposed technique.
Noisy TR [7] Proposed RGB [3] BM3D [14]
Fig. 1. Results. Even rows display some subregions of the image from the respectively
previous row. The rst four rows are illustrations using additive Gaussian noise of 50
std, the next two rows of 20 std, and the last two rows of 70 std noise. Best viewed in
color.
For a quantitative evaluation we use the SSIM [19] and the peak-signal-to-
noise (PSNR) error measurements. To the best of our knowledge there is no
widely accepted method to evaluate color image perception, hence the error
measurements are computed for each component of the RGB color space and
averaged. Gaussian white noise with {5, 10, 20, 50, 70} standard deviations (std)
is added to the images. Before ltering is commenced the noise is clipped to
t the 8bit image value range. The steplength for the diusion techniques was
set to 0.2 and manually scaled so that the images in gure 1 are obtained after
approximately 1/2 of the maximum number iterations (100).
The purpose here is not to claim the superiority of any technique, rather it is
to illustrate the advantages and disadvantages of the dierent lter methodolo-
gies in various situations. The error measurements SSIM and PSNR values are
given to illustrate the strengths of the compared denoising techniques, results
are shown in table 1.
It is clear from table 1 that the established E-L equation performs well in
terms of SSIM and PSNR values compared to other state-of-the-art denoising
techniques used for color images. However, for images (such as the mushroom
image) with large approximately homogeneous surfaces, BM3D is the favored
denoising technique. This is due to many similar patches which the algorithm
averages over to reduce the noise variance. On the other hand, diusion tech-
niques perform better in high-frequency regions due to the local description of
the tensor. This is also depicted in gure 1 where the diusion techniques pre-
serve the image structure in the lizard, snake and the owl image, whereas BM3D
achieves perceptually good results for the mushroom image.
5 Conclusion
In this work we have given necessary conditions for a tensor-based image diusion
PDE to be the E-L equation for an energy functional. Furthermore, we apply the
derived E-L equation in Theorem 1 to a color image denoising application with
results comparable to state-of-the-art techniques. Generalization of Proposition
1 and its complete proof will be subject to further study.
Acknowledgment. This research has received funding from the Swedish Re-
search Council through a grant for the project Non-linear adaptive color image
processing, from the ECs 7th Framework Programme (FP7/2007-2013), grant
agreement 247947 (GARNICS), and by ELLIIT, the Strategic Area for ICT
research, funded by the Swedish Government.
References
1. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diusion.
IEEE Trans. Pattern Analysis and Machine Intelligence 12, 629639 (1990)
2. Weickert, J.: Anisotropic Diusion In Image Processing. ECMI Series. Teubner-
Verlag, Stuttgart (1998)
3. Weickert, J.: Coherence-enhancing diusion of colour images. Image and Vision

Computing 17, 201212 (1999)
4. Sochen, N., Kimmel, R., Malladi, R.: A general framework for low level vision.
IEEE Trans. Image Processing 7, 310318 (1998)
5. Renner, A.I.: Anisotropic Diusion in Riemannian Colour Space. PhD thesis,
Ruprecht-Kars-Universitat, Heidelberg (2003)
6. Tang, B., Sapiro, G., Caselles, V.: Color image enhancement via chromaticity dif-
fusion. IEEE Trans. Image Processing 10, 701707 (2001)
7. Astrom, F., Felsberg, M., Lenz, R.: Color Persistent Anisotropic Diusion of Im-
ages. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688, pp. 262272.
8. Lenz, R., Carmona, P.L.: Hierarchical s(3)-coding of rgb histograms. Selected pa-
pers from VISAPP 2009, vol. 68, pp. 188200. Springer (2010)
9. Black, M.J., Rangarajan, A.: On the unication of line processes, outlier rejection,
and robust statistics with applications in early vision. IJCV 19, 5791 (1996)
10. Scharr, H., Black, M., Haussecker, H.: Image statistics and anisotropic diusion.
In: Ninth IEEE ICCV, vol. 2, pp. 840847 (2003)
11. Krajsek, K., Scharr, H.: Diusion ltering without parameter tuning: Models and
inference tools. In: CVPR, pp. 25362543 (2010)
12. Baravdish, G., Svensson, O.: Image reconstruction with p(x)-parabolic equation.
In: 7th ICIPE, Orlando Florida (2011)
13. Felsberg, M.: Autocorrelation-driven diusion ltering. IEEE Trans. Image Pro-
cessing 20, 17971806 (2011)
14. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Color image denoising via sparse
3d collaborative ltering with grouping constraint in luminance-chrominance space.
In: IEEE International Conference on Image Processing, pp. 313316 (2007)
15. Forstner, W., Gulch, E.: A fast operator for detection and precise location of
distinct points, corners and centres of circular features. In: ISPRS Intercommission,
Workshop, Interlaken, pp. 149155 (1987)
16. Weickert, J.: A scheme for coherence-enhancing diusion ltering with optimized
rotation invariance. VCIR 13, 103118 (2002)
17. Forstner, W.: Image preprocessing for feature extraction in digital intensity, color
and range images. In: Geomatic Method for the Analysis of Data in the Earth
Sciences. LNES, vol. 95, pp. 165189 (2000)
ecological statistics. In: Proc. ICCV, vol. 2, pp. 416423 (2001)
19. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from
error visibility to structural similarity. IEEE Trans. Image Processing 13, 600612
(2004)
Kernelized Temporal Cut for Online Temporal
Segmentation and Recognition
Dian Gong, Gerard Medioni, Sikai Zhu, and Xuemei Zhao
Institute for Robotics and Intelligent Systems, University of Southern California

Los Angeles, CA, 90089
{diangong,medioni,sikaizhu,xuemeiz}@usc.edu
Abstract. We address the problem of unsupervised online segmenting human

motion sequences into different actions. Kernelized Temporal Cut (KTC), is pro-
posed to sequentially cut the structured sequential data into different regimes.
KTC extends previous works on online change-point detection by incorporating
Hilbert space embedding of distributions to handle the nonparametric and high
dimensionality issues. Based on KTC, a realtime online algorithm and a hier-
archical extension are proposed for detecting both action transitions and cyclic
motions at the same time. We evaluate and compare the approach to state-of-the-
art methods on motion capture data, depth sensor data and videos. Experimental
results demonstrate the effectiveness of our approach, which yields realtime seg-
mentation, and produces higher action segmentation accuracy. Furthermore, by
combining with sequence matching algorithms, we can online recognize actions
of an arbitrary person from an arbitrary viewpoint, given realtime depth sensor
input.
1 Introduction
Temporal segmentation of human motion sequences (motion capture data, 2.5D depth
sensor or 2D videos), i.e., temporally cut sequences into segments with different se-
mantic meanings, is an important step for building an intelligent framework to analyze
human motion. Temporal segmentation can be applied to motion animations [1], ac-
tion recognition [24], video understanding and activity analysis [5, 6]. In particular,
temporal segmentation is crucial for human action recognition. Most recent works in
human activity recognition focus on simple primitive actions such as walking, running
and jumping, in contrast to the fact that daily activity involves complex temporal pat-
terns (walking then sit-down and stand-up). Thus, recognizing such complex activities
relies on accurate temporal structure decomposition [7].
Previous work on temporal segmentation can be mainly divided into two categories.
On one side, many works in statistics, i.e., either offline or online (quickest) change-
point detections [8], are often restricted to univariate series (1D) and the distribution
is assumed to be known in advance [9]. Because of the complex structure of motion
dynamics, these works are not suitable for temporal segmentation of human actions.
On the other side, temporal clustering has been proposed for unsupervised learning of
human motions [10]. However, temporal clustering is usually performed offline, thus
not suitable for applications such as realtime action segmentation and recognition.

230 D. Gong et al.
Fig. 1. Online Hierarchial Temporal Segmentation. A 22 secs input sequence is temporally cut
into two segments; a walking segment (S1) which is further cut into 6 action units, and a jumping
segment (S2) which is further cut into 4 action units.
Fig. 2. System Flowchart. Input can be either Mocap, 2.5D depth sensor or 2D videos. Output
are temporal cuts for both action transitions and cyclic action units.
Motivated by these difficulties, we propose an online temporal segmentation method

with no parametric distribution assumptions, which can be directly applied to detect
action transitions. The proposed method, Kernelized Temporal Cut (KTC), is a tem-
poral application of Hilbert space embedding of distributions [11, 12] and kernelized
two-sample test [13, 14] of online segmentation on structured sequential data. KTC
can simultaneously detect action transitions and cyclic motion structures (action units)
within a regime as shown in Fig. 1. Furthermore, a realtime implementation of KTC is
proposed, incorporating an incremental sliding window strategy. Within a sliding win-
dow, segmentation is performed by the two-sample hypothesis test based on the pro-
posed spatio-temporal kernel. The input and output of our system are shown in Fig. 2.
In summary, our approach presents several advantages:
Online: KTC can sequentially process the input and capture action transitions, which
is extremely helpful for realtime applications such as continuous action recognition.
Hierarchical: KTC can simultaneously capture transition points between different
actions, and cyclic action units within an action regime, e.g., a walking segment con-
tains several walking cycles. This is important for realtime activity recognition, i.e.,
action can be recognized after only one action unit.
Kernelized Temporal Cut for Online Temporal Segmentation and Recognition 231
Nonparametric: nonparametric and high dimensionality issues are handled by the

Hilbert space embedding based on the spatio-temporal kernel, which can be directly
applied to complex sequential data such as human motion sequences.
Online Action Recognition and Transfer Learning: KTC can be applied to a va-
riety of input such as motion capture data, depth sensor data and 2D videos from an
unknown viewpoint. More importantly, KTC can be applied to online action recogni-
tion from OpenNI and Kinect by combining methods such as [15]. The recognition is
performed without any training data from depth sensor, which is truly transfer learning.
2 Related Work
Temporal segmentation is a multifaced area and several related topics in machine learn-
ing, statistics, computer vision and graphics are discussed in this section.
Change-Point Detection. Most of the work in statistics, i.e., offline or quickest (on-
line) change-point detections (CD) [8], is often restricted to univariate series (1D) and
parametric distribution assumption, which does not hold for human motions with com-
plex structure. [16] uses the undirected sparse Gaussian graphical models and performs
jointly structure estimation and segmentation. Recently, as a nonparametric extension
of Bayesian online change-point detection (BOCD) [9], [17] is proposed to combine
BOCD and Gaussian Process (GPs) to relax the i.i.d assumption in a regime. Although
GPs improve the ability to model complex data, it also brings in high computational
cost. More relevant to us, kernel methods have been applied to non-parametric change-
point detection on multivariate time series [18, 19]. In particular, [18] (KCD) utilizes the
one-class SVM as online training method and [19] (KCpA) performs sequentially seg-
mentation based on the Kernel Fisher Discriminant Ratio. Unlike all the above works,
KTC can not only detect action transitions but also cyclic motions.
Temporal Clustering. Clustering is a long standing topic in machine learning [20,
21]. Recently, as an extension of clustering, some works focus on how to correctly
temporally segment time series into different clusters. As a elegant combination of Ker-
nel K-means and spectral clustering, Aligned Cluster Analysis (ACA) is developed for
temporal clustering of facial behavior with a multi-subject correspondence algorithm
for matching facial expressions [6]. To estimate the unknown number of clusters, [22]
use the hierarchical Dirichlet process as a prior to improve the switch linear dynamical
system (SLDS). Most of these works offline segment time series and provide cluster
labels as in clustering. As a complementary approach, KTC performs online temporal
segmentation which is suitable for realtime applications.
Motion Analysis. In computer vision and graphics, some works focus on grouping
human motions. Unusual human activity detection is addressed in [5] using the (bipar-
tite) graph spectral clustering. [23] extracts spatio-temporal features to address event
clustering on video sequences. [10] proposes a geometric-invariant temporal clustering
algorithm to cluster facial expressions. More relevantly, [1] proposes an online algo-
rithm to decompose motion sequences into distinct action segments. Their method is an
elegant temporal extension of Probabilistic Principal Component Analysis for change-
point detection (PPCA-CD), which is computationally efficient but restricted to (ap-
proximate) Gaussian assumptions.
232 D. Gong et al.
Action Recognition. Although significant progress has been made in human activ-
ity recognition [3, 4, 7, 24], the problem remains inherently challenging due to view-
point change, partial occlusion and spatio-temporal variations. By combining KTC
and alignment approaches such as [25, 15], we can perform online action recognition
for input from 2.5D depth sensor. Unlike other works on supervised joint segmentation
and recognition [26], two significant features of our approach are, viewpoint indepen-
dence and handling arbitrary person with a few labeled Mocap sequences, in the transfer
learning module.
3 Online Temporal Segmentation of Human Motion

This section describes the Kernelized Temporal Cut (KTC), a temporal application of
Hilbert space embedding of distributions [11] and kernelized two-sample test [14, 13],
to sequentially estimate temporal cut points in human motion sequences.
3.1 Problem Formulation

Given a stream input X 1:Lx = {xt }t=L
t=1 (xt " , where Dt can be fixed or change
x Dt
over time), the goal of temporal segmentation is to predict temporal cut points ci . For
instance, if a person walks and then boxes, a temporal cut point must be detected. For
depth sensor data, xt is the vector representation of tracked joints. More details of xt
are given in sec. 6. From a machine learning perspective, the estimated {ci }N
i=1 can be
c
modeled by minimizing the following objective function,

Nc
LX ({ci }N c
i=1 , Nc ) = I(X ci1 :ci 1 , X ci :ci+1 1 ) (1)
i=1
where X ci :ci+1 1 "D(ci+1 ci ) indicates the segment between two cut points ci and
ci+1 (c1 = 1, cNc +1 = Lx + 1). Here I() is the homogeneous function to measure the
spatio-temporal consistency between two consecutive segments. It is worth noting that,
both {ci }N
i=1 and Nc need to be estimated from eq. 1. Next, the main task is to design
c
I() and to online optimize eq. 1. As the counterpart, eq. 1 could be offline optimized
by dynamic programming when Nc is given, which is out of the scope of this paper.
3.2 KTC-S
Instead of jointly optimizing eq. 1, the proposed Kernelized Temporal Cut (KTC) se-
quentially optimizes ci+1 based on ci by minimizing the following loss function,
L{X ci :ci +T 1 } (ci+1 ) =I(X ci :ci+1 1 , X ci+1 :ci +T 1 ), i = 1, 2, ..., Nc 1 (2)
where ci (c1 = 1, cNc +1 = Lx + 1) is provided by the previous step and T is a

fixed length. We refer to this sequential optimization process for eq. 2 as KTC-S, where
S stands for sequential. Sequentially optimizing L is actually a fixed-length sliding
window process which is also used in [19]. However, setting T is a difficult task and how
to improve this process is described in sec. 3.3. Essentially, Eq. 2 is a two-class temporal
clustering problem for X ci :ci +T 1 "DT . The crucial factor is constructing I(),
which is related to temporal version of (dis)similar functions in spectral clustering [20,
21, 10] and information theoretical clustering [27].
To handle the complex structure of human motion, unlike previous work, KTC uti-
lizes Hilbert space embedding of distributions (HED) to map the distribution of X t1 :t2
into the Reproducing Kernel Hilbert Space (RKHS). [11, 12] are seminal works on com-
bining kernel methods and probability distribution analysis. Without going into details,
the idea of using HED for temporal segmentation is straightforward. The change-point
is detected by using a well behaved (smooth) kernel function, whose values are large
on the samples belonging to the same spatio-temporal pattern and small on the samples
from different patterns. By doing this, KTC does not only handle nonparametric and
high dimensionality problems but also rests on a solid theoretical foundation [11].
HED. Inspired by [12], probabilistic distributions can be embedded in RKHS. At the
center of the Hilbert space embedding of distributions are the mean mapping functions,
1
T
(P x ) = E x (k(x, )), (X) = k(xt , ) (3)
T t=1
where xt=T
t=1 are assumed to be i.i.d sampled from the distribution P x . Under mild
conditions, (P x ) (same for (X)) is an element of the Hilbert space as follows,
1
T
< (P x ), f >= E x (f (x)), < (X), f >= f (xt )
T t=1
Mappings (P x ) and (X) are attractive because,

Theorem 1. if the kernel k is universal, then the mean map : P x (P x ) is injec-
tive. [12]
This theorem states that distributions of x "D have a one-to-one correspondence
with mappings (P x ). Thus, for two distributions P x and P y , we can use the func-
tion norm ||(P x ) (P y )|| to quantitatively measure the difference (denoted as
D(P x , P y )) between these two distributions. Moreover, we do not need to access the
actual distributions but rather finite samples to calculate D(P x , P y ) because:
Theorem 2. Assume that ||f ||inf C for all f H with |f||H 1, then with
probability at least 1 , ||(P x ) (X)|| 2RT (H, P x ) + C T 1 log()). [12]
As long as the Rademacher average is well behaved, finite samples yield error that
converges to zero, thus they empirically approximate (P x ). Therefore, D(P x , P y )
can be precisely approximated by using finite sample estimation ||(X) (Y )||.
Thanks to the above facts, we use HED to construct IKT C (X 1:T1 , Y 1:T2 ) to measure
the consistency between distributions of two segments as follows,
2 1 1
k(xi , y j ) 2 k(xi , xj ) 2 k(y i , y j ) (4)
T1 T2 i,j T1 i,j T2 i,j
234 D. Gong et al.
Combining eq. 2 and eq. 4, ci+1 is estimated by minimizing the following function in
matrix formulation as:
c c c 1 1:ci 1 c :T
L{X ci :ci +T 1 } (ci ) = (E Ti )H K KT C i i
ci :ci +T 1 E T , E T = e eTi (5)
ci T di
where ci and di are short notations for ci+1 ci and ci + T ci+1 . etT1 :t2 "T 1 is a
T T
ci :ci +T 1 "
binary vector with 1 for positions from t1 to t2 and 0 for others. K KT C
is the kernel matrix based on the kernel function kKT C ().

Kernel. The success of kernel methods largely depends on the choice of the kernel
function [11]. As mentioned before, the difficulty of human motion, is that both spatial
and temporal structures are important. Thus, we propose a novel spatio-temporal kernel
kKT C () as follows,
i ), (x
kKT C (xi , xj ) = kS (xi , xj )kT (xi , xj ) = kS (xi , xj )kT ((x j )) (6)

where kS () is the spatial kernel and kT () is the temporal kernel. (x) is the esti-
mated local tangent space at point x. kS () and kT () can be chosen according to the
domain knowledge or universal kernels such as Gaussian. For instance, the canonical
component analysis (CCA) kernel [28] is used for joint-position features as,
kSCCA (xi , xj ) = exp(S dCCA (xi , xj )2 ) (7)
where dCCA () is the CCA metric based on M 3 matrix representation of x

"3M11 . Or in general, we set them as,
kS (xi , xj ) = exp(S xi xj 2 )
(8)
i ), (x
kT ((x j )) = exp(T ((x
i ), (x
j ))2 )
where S is the kernel parameter for kS () and T is the kernel parameter for kS ().
() is the notation of principal angle between two subspace (range from 0 to 2 ).
In short, the spatio-temporal kernel kKT C captures both spatial and temporal dis-
tributions of data (a visual example in Fig. 3), which is suitable to model structured
sequential data. As special cases, kST degenerates to spatial kernel if T 0 and to
temporal kernel if S 0.
Optimization. Unlike the NP-Hard optimization in spectral clustering [21], eq. 5 can
be efficiently solved because the feasible region of ci+1 is [ci + 1, ci + T 1], allowing
to search the entire space to minimize L(ci+1 ). For each step, minimizing eq. 5 has
complexity at most O(T 2 ) to access kKT C ().
3.3 KTC-R
Sequentially optimizing eq. 1 is given in sec. 3.2. However, this process may not be
suitable for realtime applications. A key feature of human motion is temporal variations,
1
M is the number of 3D joints from Mocap or depth sensor.
Fig. 3. An illustration of KTC-R. Left: tracked joints from depth sensor, right: K KT
1:190
C
190190 for window X 1:190 . The decision to make no cut between frame 1 to 110 is made
before the current window with a maximum delay of T0 = 80 frames.
i.e., one action can last for a long time or only a few seconds. Thus, it is difficult to use
a fixed-length-T sliding-window to capture transitions. Small values of T cause over-
segmentation and large values of T cause large delays (T = 300 for depth sensor results
in 10 secs delay). To overcome this problem, we combine incremental sliding-window
strategy [1] and two-sample test [14, 13] to design a realtime algorithm for eq. 5, i.e.,
KTC-R (Fig. 3).
Given X 1:Lx = [x1 , ..., xLx ] "DLx , KTC-R sequentially processes the varying-
length window X t = [xnt , ..., xnt +Tt ] at step t. This process starts from n1 = 1 and
T1 = 2T0 , where T0 is the pre-defined shorest possible action length. At step-t (assume
the last cut is ci ), if there is no captured action transition point, the following updating
process is performed,
nt+1 = nt , Tt+1 = Tt + T (9)
else if there is a transition point,
ci+1 = nt + Tt T0 , nt+1 = ci+1 , Tt+1 = T1 (10)
where T is the step length of increasing the window. This process ends when nt Lx
T0 . As shown in eq. 9 and eq. 10, X 1:Lx is sequentially processed and all cuts ci are
estimated when the algorithm requires the (ci + T0 1)th frame (same for non-cut
frames). This fact indicates STC-R has a fixed-length time delay T0 , as shown in Fig. 3.
At each step, deciding on a cut (at frame nt + Tt T0 ) is equivalent to the following
hypothesis test,
n 1
H0 : {xi }ntt and {xi }nnt +Tt 1 are the same; HA : not H0 (11)
t
where nt is the short notation for nt + Tt T0 . eq. 11 is re-written by combining eq. 5
as follows,
n nt H n nt
Lt = (E Ttt ) K KT C
nt :nt +Tt 1 E Tt
t
(12)
H0 : Lt t : nt is not a cut; HA : Lt < t : nt is a cut
where t is the adaptive threshold for the hypothesis test 12. In fact, eq. 12 is directly
inspired by [14] which proposes a kernelized two-sample test method. Lt is analogous
236 D. Gong et al.
to the negative square of empirical estimation of Maximum Mean Discrepancy (MMD),

which has the following formulation,
1
T1
M M D[F , X 1:T1 , Y 1:T2 ] = ( 2 k(xi , xj )
T1 i,j=1
(13)
2 1
T1 ,T2 T2
1
k(xi , y j ) + 2 k(y i , y j )) 2
T1 T2 i,j=1 T2 i,j=1
where F is a unit ball in a universal RKHS H, and {x}Ti=1

1
and {y}Tj=1
2
are i.i.d. samples
from distributions P x and P y . It can be shown that,
lim Lt = M M D[F , X nt :nt 1 , X nt :nt +Tt 1 ]2 (14)

T 0
if the same kernel in MMD is used as the spatial kernel in kKT C () (eq. 6) and kT ()
degenerates to 1 as T 0. Based on eq. 14, t is set as BR (t) + where BR (t) is an
adaptive threshold which is calculated from the Rademacher bound [14], and is a fixed
global threshold which is the only non-trivial parameter in KTC-R (used to control the
coarse-fine level of segmentation).
Analysis. In summary, both KTC-S and KTC-R are based on eq. 5. The main dif-
ferences are, KTC-S performs segmentation by sequential optimization in a two-class
temporal clustering way, and KTC-R performs segmentation by using incremental
sliding-window in a two-sample test way. KTC-R requires more sliding-windows than
KTC-S, but for each one, there is no optimization, and accessing kKT C () O(Tt T )
times is enough (linear to Tt ). Only when a new cut is detected, O(Tt2 ) times accessing
is required. Thus, KTC-R is extremely efficient and suitable for realtime applications.
It is notable that, even if the fixed-length sliding-window method (sec. 3.2) is im-
proved to make the decision whether a cut happens or not in X ci :ci +T 1 , a small T
is still not reliable for realtime applications. The reason is that a clear temporal cut for
human motion requires a large number of observations before and after the cut. Indeed,
the required number of frames varies from action to action, even for manual annotation.
4 Online Hierarchical Temporal Segmentation
Besides estimating {ci }, decomposing an action segment X ci :ci+1 1 into an unknown

number of action units (e.g., three walking cycles) if cyclic motions exists, is also
needed [29]. This is not only helpful for understanding motion sequences, but also for
other applications such as recognition and indexing. Thus, an online cyclic structure
segmentation algorithm, i.e., Kernelized Alignment Cut (KAC), is proposed as a gener-
alization of kernel embedding of distributions and temporal alignment [15, 6]. By com-
bining KAC and KTC-R, we get the two-layer segmentation algorithm KTC-H, where
H stands for hierarchical. Action units segmentation is difficult for non-periodic mo-
tions (e.g., jumping), which are actions that are usually performed once locally. How-
ever, people can still perform two consecutive non-periodic motions, and these two
motions are not identical because of intra-person variations, which brings challenges
for KAC.
KAC. As an online algorithm, KAC utilizes the sliding-window strategy. Each window
X aj +nt Tm :aj +nt 1 is sequentially processed, starting from n1 = 2Tm , a1 = ci ,
where aj is jth action unit cut. Tm is a parameter which is the minimal length of one
action units. We empirically find that results are insensitive to Tm .
For each window X aj +nt Tm :aj +nt 1 , this process has two branches. The last
action unit continues: nt+1 = nt + Tm ; or there is a new action unit: aj+1 =
aj + nt Tm , nt+1 = 2Tm . Here Tm is the step length. This process ends when
a new cut point ci+1 received. Deciding whether X aj +nt Tm :aj +nt 1 is the start of a
new unit or not can be formulated as,
St = SAlign (X aj :aj +Tm 1 , X aj +nt Tm :aj +nt 1 )
(15)
H0 : St t : aj + nt Tm is a new unit; HA : St > t : not H0
where SAlign () is the metric to measure the structure similarity between X aj :aj +Tm 1
and X aj +nt Tm :aj +nt 1 , to handle intra-person variations. t is an adaptive threshold
(empirically set by cross-validation) and ideally should be close to zero if alignment
can perfectly leverage variations. Similar to KTC-R, KAC has delay Tm . In particular,
KAC uses dynamic time warping (DTW) [15, 6] to design SKAC () by minimizing the
following loss function based on the kernel from eq. 6,
a +n T :a +nt 1
SKAC (K ajj :aj +T
t m j
m 1
: W 1, W 2) (16)
where K is the cross-kernel matrix for two segments, and W 1 and W 2 are binary
temporal warping matrices encoding the temporal alignment path as shown in [15].
Interested readers are referred to [15, 6] for more details about S(). Eq. 16 can be opti-
mized by using dynamic programming with complexity O(Tm 2
), and SKAC () measure
the similarity between the current action unit (a part) and the current window. Impor-
tantly, alignment methods such as DTW are not suitable for eq. 12. This is because
alignment requires two segments to have roughly the same starting and ending points,
which does not hold in eq. 12.
KTC-H. By combining KTC-R and KAC, we can sequentially simultaneously capture
action transitions (cuts) and action units, in the integrated algorithm KTC-H. Formally,
KTC-H uses the two-layer sliding window strategy, i.e., the outer loop (sec. 3.3) to
estimate ci and the inner loop to estimate aj between ci and the current frame from the
outer loop. Since KTC-R (eq. 12) and KAC (eq. 15) both have fixed delay (T0 and Tm ),
KTC-H is suitable for realtime transition and action unit segmentation.
Discussion. We compare with several related algorithms: (1) Spectral clustering [21]
can be extended to temporal clustering if only allowing temporal cuts (TSC) [10]. Sim-
ilarly, minimizing eq. 5 can be viewed as an instance of TSC motivated by embedding
distributions into RKHS. (2) PPCA-CD is proposed in [1] to model motion segments
by Gaussian models, where CD stands for change-point detection. Compared to [1],
KTC has higher computational cost but gains the ability to handle nonparametric dis-
tributions. (3) KTC is similar to KCpA [19], which uses the novel kernel Fisher dis-
criminant ratio. Compared to [19], KTC performs change-point detection by using the
238 D. Gong et al.
incremental sliding-window. More importantly, KTC detects both change-points and

cyclic structures. This is crucial for online recognition, making action can be recog-
nized after only one unit instead of the whole action. (4) As an elegant extension of
Kernel K-means and spectral clustering, ACA is proposed in [6] for offline temporal
clustering. KTC can be viewed as an online complementary approach to [6].
5 Online Action Recognition from OpenNI

We use KTC-H to recognize tracked OpenNI data online, based on labeled Mocap se-
quences only. To the best of our knowledge, this is the first work towards continuous
action recognition from OpenNI in the transfer learning framework, without the need
of training data from OpenNI.
Recognition. In our system, action recognition is done by combining KTC-H and
sequences alignment. Assume there are N labeled Mocap segments (action units)
{X imocap "3MLi }N i=1 associated with label I I, where I = 1, 2, ..., C indi-
i
cates C action classes. Given a segmented action unit X aj :aj+1 1 "3K(aj+1 aj )

T est
from KTC-H, the estimated action label I is given by using Dynamic Manifold
Warping (DMW) [25],
T est
I aj :aj+1 1 = arg min SDMW (X aj :aj+1 1 , (X imocap , I i )) (17)
i{1,2..,N }
where K can be M or M when indicating noisy and possible occluded tracking tra-
jectories from depth sensor data by using the OpenNI tracker. The spatial and temporal
variations between X and X imocap are handled by DMW.
6 Experimental Validation
We evaluate the performance of our system from two aspects: (1) temporal segmenta-
tion on depth sensor, Mocap and video, (2) online action recognition on depth sensor.
6.1 Online Temporal Segmentation of Human Action

In this section, quantitative comparison of online temporal segmentation methods is
provided. KTC-R is compared with other state-of-the-art methods, i.e, PPCA-CD [1]
and TSC-CD [23, 10], where TSC-CD is a change-point detection algorithm based on
temporal spectral clustering by our implementation. PPCA-CD uses the same incre-
mental sliding-window strategy as sec. 3.3 and TSC-CD uses the fixed-length sliding
window as sec. 3.2. Thresholds (e.g., for KTC-R and thresholds for other methods)
are set by cross-validation on one sequence. Methods like ACA [6] and [22] can not
be directly compared since they are offline. Results are evaluated by three metrics, i.e.,
precision, recall and rand index. The first two are for cut points and the last one is
for all frames. The ground-truth for rand index (RI) is labeled as consecutive numbers
1, 2, 3, ... for different segments. Importantly, T0 is set as 80, 250, 60 for depth sen-
sor, Mocap and video respectively, making KTC-R have 2.3, 2.1 and 1 seconds delay.
Fig. 4. Examples of Online Temporal Segmentation. Top (Depth Sensor): a sequence includes
3 segments as walking, boxing and jumping. Noisy joint-trajectories are tracked by OpenNI.
Middle (Mocap): a sequence has 4579 frames and 7 action segments. Bottom (Video): a clip
includes walking and running. For all cases, KTC-R achieves the highest accuracy.
KTC-S achieves similar accuracy to KTC-R but with longer delay, details are omitted
due to lack of space. Results are very robust to T0 and T . For instance, we got almost
identical results when T0 ranges from 60 to 120 in OpenNI data.
Depth Sensor. To validate online temporal segmentation on depth sensor, 10 human
motion sequences are captured by the PrimeSense sensor. Each sequence is a com-
bination of 3 to 5 actions (e.g., walking to boxing) with length around 700 frames
(30Hz). For human pose tracking, we use the available OpenNI tracker to automatically
track joints on human skeleton. K [12, 15] key points are tracked, resulting in joint
3D position xt in "36 to "45 . Although human pose tracking results are often noisy
(Fig. 4 and Fig. 5), we can correctly estimate action transitions from these noisy tracking
Table 1. Temporal Segmentation Results Comparison. Precision (P), recall (R) and rand index
(RI) are reported
Methods PPCA-CD (online) TSC-CD (online) KTC-R (online)

Depth 0.73(P)/0.78(R)/0.80(RI) 0.77(P)/0.81(R)/0.81(RI) 0.87(P)/0.93(R)/0.88(RI)
Mocap 0.85(P)/0.90(R)/0.90(RI) 0.83(P)/0.86(R)/0.88(RI) 0.86(P)/0.91(R)/0.92(RI)
Video 0.78(P)/0.85(R)/0.82(RI) 0.85(P)/0.92(R)/0.88(RI)
240 D. Gong et al.
results. In particular, KTC-R (T = 30) significantly improves the accuracy from other
methods (Table 1). The main reason is the joint-positions of noisy tracked joints have
complex nonparametric structures, which is handled by kernel embedding of distribu-
tions [14, 13, 12] in KTC.
KTC-H. Besides action transitions, results on detecting both cyclic motions and tran-
sitions are reported by performing KTC-H (Tm = 50, Tm = 1). Since other methods
dont have the module, we report quantitative comparison on online hierarchial seg-
mentation by using KTC-H or other methods plus our KAC algorithm in sec. 4. Results
show (Table 2) that KTC-H gets higher accuracy than other combinations. It is notable
that, because of the natural of RI, the RI metric will increase when the number of cuts
increase, even for low P/R, which is the case for hierarchial segmentation (including
two types of cuts).
Mocap. Similar to [1, 6], M = 14 joints are used to represent the human skele-
ton, resulting in joint-quaternions of joint angles in 42D. Online temporal segmentation
methods are tested on 14 selected Mocap sequences from subject 86 in CMU Mocap
database. Each sequence is a combination of roughly 10 action segments and in total
there are around 105 frames (120Hz). Since the implementation of PPCA-CD differs
from [1] (such as only forward pass is allowed in our experiments), results are not the
same as in [1]. Table 1 shows that the gain of KTC-R (T = 50) to other methods
in Mocap is reduced, compared with depth sensor data. This is because the Gaussian
property is more likely to hold for quaternions representation of noiseless Mocap data,
which is not the case for real data in general.
Video. Furthermore, KTC-R is performed on a number of sequences from
HumanEva-2, which is a benchmark for human motion analysis [30]. Silhouettes are
extracted by background substraction, resulting in a sequence of binary masks (60Hz).
xt "Dt is set as the vector representation of the mask at frame t. It is notable that,
Dt (size of masks) in different frames may not be identical, so PPCA-CD can not be
applied. This fact supports the advantage of KTC, which is applicable for complex se-
quential data as long as a (pseudo) kernel can be defined. In particular, we follow [6]
to compute the matching distance of silhouettes to set the kernel. Results are shown in
Fig. 4 and Table. 1. As a reference, state-of-the-art offline temporal clustering method
ACA achieves higher accuracy than KTC-R on Mocap (96% precision). However, of-
fline methods (1) are not suitable for real-time applications, and (2) require the number
of clusters (segments) to be set in advance, which is not applicable in many cases.
6.2 Joint Online Segmentation and Recognition from OpenNI

We collect additional 5109 frames (N = 30) with 10 primitive actions from CMU
Mocap as the labeled (training) data for recognition. In order to associate labeled Mo-
cap sequences with data from other domains, joint-position trajectories (M = 15) are
used in eq. 17 [25]. Testing data are previous collected sequences from depth sen-
sor, and online segmentation and recognition are simultaneously performed by KTC-H
and eq. 17. A significant feature of our approach is, there is no extra-training process
for depth sensor, i.e., the knowledge from Mocap can be transferred to other motion
Fig. 5. Online action segmentation and recognition on 2.5D depth sensor. Top to bottom,
depth image sequences, KTC-H results and action recognition results. For segmentation, blue line
indicates the cut and different rectangles indicate different action units. The blue circle indicates
noisy tracking results. For recognition: distance to labeled Mocap sequences, and inferred 3M D
motion sequences.
sequences, based on proper features. Tracked trajectories from OpenNI in an action unit
(segmented by KTC-H) are associated with labeled Mocap sequences from 10 action
categories.
Table 2. Online hierarchial segmentation and recognition on 2.5D depth sensor
Methods PPCA-CD+KAC KTC-H

Depth 0.72(P)/0.76(R)/0.89(RI)/0.71(Acc)0.85(P)/0.87(R)/0.94(RI)/0.85(Acc)
Although OpenNI tracking results are often noisy (highlighted by blue circles in
Fig. 5), we achieve 85% recognition accuracy (Acc) from these noisy tracking results
(Table 2), without any additional training on depth sensor data. This result does not
only benefit from DMW [25], but also from KTC-H. DMW requires the input only
contain one action unit, while KTC-H performs a critical missing step, i.e., accurate
online temporal segmentation, in order to perform recognition. As illustrated in Table
2, the accuracy on OpenNI is enhanced from 0.71 to 0.85, which strongly supports the
effectiveness of KTC-H. Furthermore, the complete and accurate 3M D human motion
sequences can be inferred by associating the learned manifolds from Mocap.
7 Conclusion
In this paper, we propose an online temporal segmentation method KTC, as a temporal

extension of Hilbert space embedding of distributions for change-point detection based
on the novel spatio-temporal kernel. Furthermore, a realtime implementation of KTC
242 D. Gong et al.
and a hierarchial extension are designed, which can detect both action transitions and
action units. Based on KTC, we achieve realtime temporal segmentation on both Mo-
cap, motion sequences from 2.5D depth sensor and 2D videos. Furthermore, temporal
segmentation is combined with alignment, resulting in realtime action recognition on
depth sensor input, without the need of training data from depth sensor.
Acknowledgments. This work was supported in part by NIH Grant EY016093 and
DE-FG52-08NA28775 from the U.S. Department of Energy.
References
1. Barbic, J., Safonova, A., Pan, J.Y., Faloutsos, C., Hodgins, J.K., Pollard, N.S.: Segmenting
motion capture data into distinct behaviors. In: Proc. Graphics Interface, pp. 185194 (2004)
2. Lv, F., Nevatia, R.: Recognition and Segmentation of 3-D Human Action Using HMM and
Multi-class AdaBoost. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part IV.
3. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from
movies. In: Proc. CVPR (2008)
4. Weinland, D., Ozuysal, M., Fua, P.: Making Action Recognition Robust to Occlusions and
Viewpoint Changes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III.
5. Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in video. In: Proc. CVPR,
pp. 8168231 (2004)
6. Zhou, F., De la Torre, F., Hodgins, J.K.: Hierarchical aligned cluster analysis for temporal
clustering of human motion. Under review at IEEE PAMI (2011)
7. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling Temporal Structure of Decomposable Mo-
tion Segments for Activity Classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part II. LNCS, vol. 6312, pp. 392405. Springer, Heidelberg (2010)
8. Chen, J., Gupta, A.: Parametric Statistical Change-point Analysis. Birkhauser (2000)
9. Adams, R.P., MacKay, D.J.: Bayesian online changepoint detection. University of Cam-
bridge Technical Report (2007)
10. De la Torre, F., Campoy, J., Ambadar, Z., Conn, J.F.: Temporal segmentation of facial be-
havior. In: Proc. ICCV (2007)
11. Hofmann, T., Scholkopf, B., Smola, A.: Kernel methods in machine learning. Annals of
Statistics 36, 11711220 (2008)
12. Smola, A.J., Gretton, A., Song, L., Scholkopf, B.: A Hilbert Space Embedding for Dis-
tributions. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI),
13. Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.: A fast, consistent kernel two-
sample test. In: NIPS, vol. 19, pp. 673681 (2009)
14. Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B., Smola, A.: A kernel method for the
two-sample-problem. In: NIPS, vol. 19, pp. 513520 (2007)
15. Zhou, F., De la Torre, F.: Canonical time warping for alignment of human behavior. In: NIPS,
vol. 22, pp. 22862294 (2009)
16. Xuan, X., Murphy, K.: Modeling changing dependency structure in multivariate time series.
In: Proc. ICML (2007)
17. Saatci, Y., Turner, R., Rasmussen, C.: Gaussian process change point models. In: Proc. ICML
(2010)
18. Desobry, F., Davy, M., Doncarli, C.: An online kernel change detection algorithm. IEEE
Trans. on Signal Processing 53, 29612974 (2005)
19. Harchaoui, Z., Bach, F., Moulines, E.: Kernel change-point analysis. In: NIPS 21, pp. 609616
(2009)
20. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In:
NIPS, vol. 14, pp. 849856 (2002)
21. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395416
(2007)
22. Fox, E., Sudderth, E., Jordan, M., Willsky, A.: Nonparametric bayesian learning of switching
linear dynamical systems. In: NIPS 21, pp. 457464 (2009)
23. Zelnik-Manor, L., Irani, M.: Statistical analysis of dynamic actions. IEEE PAMI 28,
15301535 (2006)
24. Satkin, S., Hebert, M.: Modeling the Temporal Extent of Actions. In: Daniilidis, K., Mara-
gos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 536548. Springer,
Heidelberg (2010)
25. Gong, D., Medioni, G.: Dynamic manifold warping for view invariant action recognition. In:
Proc. ICCV, pp. 571578 (2011)
26. Hoai, M., Lan, Z., De la Torre, F.: Joint segmentation and classification of human actions in
video. In: Proc. CVPR (2011)
27. Faivishevsky, L., Goldberger, J.: A nonparametric information theoretic clustering algorithm.
In: Proc. ICML, pp. 351358 (2010)
28. Bach, F.R., Jordan, M.I.: Kernel independent component analysis. JMLR 3, 148 (2003)
29. Laptev, I., Belongie, S., Perez, P., Wills, J.: Periodic motion detection and segmentation via
approximate sequence alignment. In: Proc. ICCV, pp. 8168231 (2005)
30. Sigal, L., Balan, A.O., Black, M.J.: Humaneva: Synchronized video and motion capture
dataset and baseline algorithm for evaluation of articulated human motion. IJCV 87, 427
(2010)
Grain Segmentation of 3D Superalloy Images
Using Multichannel EWCVT
under Human Annotation Constraints
Yu Cao1 , Lili Ju2 , and Song Wang1

1
Department of Computer Science & Engineering,
2
Department of Mathematics
University of South Carolina, Columbia, SC 29208, USA
{cao,songwang}@cec.sc.edu, ju@math.sc.edu
Abstract. Grain segmentation on 3D superalloy images provides super-

alloys micro-structures, based on which many physical and mechanical
properties can be evaluated. This is a challenging problem in senses of
(1) the number of grains in a superalloy sample could be thousands
or even more; (2) the intensity within a grain may not be homoge-
neous; and (3) superalloy images usually contains carbides and noises.
Recently, the Multichannel Edge-Weighted Centroid Voronoi Tessella-
tion (MCEWCVT) algorithm [1] was developed for grain segmentation
and showed better performance than many widely used image segmen-
tation algorithms. However, as a general-purpose clustering algorithm,
MCEWCVT does not consider possible prior knowledge from material
scientists in the process of grain segmentation. In this paper, we ad-
dress this issue by dening an energy minimization problem which sub-
ject to certain constraints. Then we develop a Constrained Multichannel
Edge-Weighted Centroid Voronoi Tessellation (CMEWCVT) algorithm
to eectively solve this constrained minimization problem. In particu-
lar, manually annotated segmentation on a very small set of 2D slices
are taken as constraints and incorporated into the whole clustering pro-
cess. Experimental results demonstrate that the proposed CMEWCVT
algorithm signicantly improve the previous grain-segmentation perfor-
mance.
1 Introduction
For dierent industrial applications, engineers need to choose proper kinds of su-
peralloys with desired mechanical or physical properties, such as lightness, hard-
ness, stiness, electrical conductivity and fluid permeability [2]. Such properties
are related to the micro-structures of the superalloy material which is usually
made up of a set of grains [3]. Thus, in the repeated development or researches on
superalloy materials, the major task is to identify the superalloy micro-structures
so that their mechanical and physical properties can be evaluated. Currently, ma-
terial scientists mainly conduct manual annotations/segmentations on 2D super-
alloy image slices and then combine the 2D results to reconstruct the 3D grains.

Grain Segmentation of 3D Superalloy Images 245
Given the large number of grains in a superalloy sample and the large number
of slices in its high-resolution 3D microscopic images, this manual-segmentation
process is very tedious, time consuming and prone to error. Manual segmentation
becomes even more dicult when the microscopic superalloy images are multi-
channel images, where dierent channels reect dierent parameter settings of
the electronic microscope. For example, Figure 1 shows four channels of a single
2D slice, from which we can see that adjacent grains may be distinguishable
by their intensity in certain channels but not in other channels. As a result, we
may need to manually segment the image in each channel and then combine the
segmentation results from dierent channels.
In principle, many existing segmentation methods may be useful to relieve
material scientists from the manual segmentation. 2D segmentation methods,
such as mean shift [4], watershed [5], statistical region merging [6], normalized
cuts [7], graph cuts [8], level set [9] and watershed cuts [10, 11], can be ap-
plied to segment 2D slices. Especially, in [12], a region merging segmentation
method called the stabilized inverse diusion equation and a stochastic seg-
mentation method called the expectation-maximization/maximization of the
posterior marginals (EM/MPM) algorithm are combined for 2D Ni-based su-
peralloy image segmentation. However, all the above 2D segmentation methods
still cannot eectively and accurately correspond the segments across adjacent
slices to achieve the desired 3D grain segmentation.
g1 g1 g1 g1
g2 g2 g2 g2
g3 g3 g3 g3
(a) (b) (c) (d)
Fig. 1. Microscopic images of a slice of the superalloy sample taken by using 4 different
microscope settings. Small black points in the images are carbides. (a) 4000 Series. (b)
5000 Series. (c) 6000 Series. (d) 7000 Series.
3D image segmentation methods [1316] can segment a volume image to

achieve 3D grains directly. Recently, Cao et al. developed a Multichannel Edge-
Weighted Centroid Voronoi Tessellation (MCEWCVT) algorithm for 3D mul-
tichannel superalloy grain segmentation, which outperforms many existing seg-
mentation algorithms [1]. In MCEWCVT, the voxels in a 3D volume image are
clustered into a set of groups which then form a segmentation on the 3D im-
age space. The clustering is achieved by an unconstrained minimization of an
energy function which combines a multichannel clustering energy term called
similarity and an edge energy term called regularity. The similarity term
measures the distances from each voxels multichannel intensity to the center of
a cluster (in multichannel intensity space) to which the voxel is clustered. The
246 Y. Cao, L. Ju, and S. Wang
regularity term measures the amount of segmentation edges in image space.

Thus, when the MCEWCVT energy function is minimized, the voxels sharing
similar multichannel intensities are grouped into the same cluster while the small
over-segmentations are suppressed.
However, the MCEWCVT algorithm has two limitations: 1) the number of
clusters is empirically chosen and the initial clusters are randomly generated; 2)
no existing prior knowledge from material scientists is incorporated into the seg-
mentation process. In practice, the prior knowledge could include the geometry
on the grains layout, the grains phase identications, and the grains intensity
similarity in dierent channels, etc. Material scientists have been exploring such
kinds of prior knowledge for decades, and using them when they manually an-
notate the segmentations of 3D superalloy grains. In this paper, we incorporate
such prior knowledge into the MCEWCVT model to further improve the grain
segmentation performance. Specically, we rst construct a constrained mini-
mization problem by enforcing a set of human annotated constraints (coming
from a few manually segmented 2D slices) into the original MCEWCVT model.
Then we develop a constrained multichannel edge-weighted centroid Voronoi tes-
sellation (CMEWCVT) algorithm to solve the constrained minimization prob-
lem. In this algorithm, we also use the human annotated constraints to help
determine the proper cluster numbers and the initial cluster centers.
The remainder of this paper is organized as follows. In Section 2, we briey re-
view the MCEWCVT model. The new constrained energy minimization problem
is proposed in Section 3, and then we develop an eective CMEWCVT algorithm
for solving the proposed problem in Section 4. We report the experimental and
comparison results in Section 5. Section 6 concludes the paper.
2 The Multichannel Edge-Weighted Centroidal Voronoi

Tessellation Model
In [1], segmentation of a multichannel superalloy volumetric image is mathe-

matically formalized as an unconstrained minimization problem in which the
energy (cost) function consists of a multichannel clustering energy term called
similarity and an edge energy term called regularity. Such a minimization
problem can be eectively solved by an iterative MCEWCVT algorithm which
guarantees decreasing of the energy in each iteration. In this section, we briey
review this algorithm for superalloy segmentation.
2.1 Multichannel Edge-Weighted Clustering Energy
Let N denote the number of image channels of the superalloy volume, i.e., we
have N images, u1 , u2 , . . . , uN , of the same superalloy sample taken under N
different microscope parameter settings. The set of intensity values of the 3D
superalloy images U and the generators W can be written in the vector form as
U = {u(i, j, k) = (u1 , u2 , , uN )T RN }(i,j,k)D

and
W = {wl = (wl1 , wl2 , , wlN )T RN }L
l=1 ,
respectively. Note here L denotes the number of generators and D = {(i, j, k)|i =
1, . . . , I, j = 1, . . . , J, k = 1, . . . , K} is an index set for a superalloy volume
domain. The multichannel clustering energy can be dened as

L
EC (W; D) = (i, j, k)u(i, j, k) wl 2 (1)
l=1 (i,j,k)Dl
where D = {Dl }L l=1 is a tessellation of the image domain D and is a predened

density function over D.
For each voxel (i, j, k) D, we denote N (i, j, k) a local neighborhood for it,
which could be a 2 2 2 cube centered at (i, j, k) or a sphere centered
at (i, j, k) with radius . Further if we dene a local characteristic function
(i,j,k) : N (i, j, k) {0, 1} as

1 if u (i , j , k ) = u (i, j, k) ,
(i,j,k) (i , j , k ) =
0 otherwise,
where u (i, j, k) : D {1, . . . , L} tells which cluster u(i, j, k) belongs to, then
similar to [17], an edge energy can be dened as

EL (D) = (i,j,k) (i , j , k ). (2)
(i,j,k)D (i ,j ,k )N (i,j,k)
When the density function is dened as = 1+|u|, the total energy function,
i,e., the multichannel edge-weighted clustering energy, is dened as:
E(W; D) = EC (W; D) + EL (D)

L
= (1 + |u(i, j, k)|)u(i, j, k) w l 2
l=1 (i,j,k)Dl
(3)

+ (i,j,k) (i , j , k ),
(i,j,k)D (i ,j ,k )N (i,j,k)
where is a weighting parameter to control the balance between EC and EL .

Thus, the desired 3D superalloy segmentation result are
(W; D) = arg min E(W; D). (4)

(W;D)
2.2 Energy Minimizer and MCEWCVT
As proved in [1], moving a voxel (i, j, k) to the cluster of a generator to which

it has the shortest multichannel edge-weighted distance will decrease the total
clustering energy E(W; D) at the most, where the multichannel edge-weighted

distance can be dened as

dist((i, j, k), w l ) = (i, j, k)u(i, j, k) wl 2 + 2nl (i, j, k)
(5)
= (1 + |u(i, j, k)|)u(i, j, k) w l 2 + 2nl (i, j, k)
'
with nl (i, j, k) denotes the number of voxels in Nw (i, j, k)\(Dl (i, j, k)).
Given a set of generators W = {wl }L l=1 , the multichannel edge-weighted
Voronoi tessellation (MCEWVT) D = {Dl }L l=1 associated with the generators
W in the physical volume space D can be dened as
Dl = {(i, j, k) D | dist((i, j, k), w l ) dist((i, j, k), w m ), m = 1, . . . , L}. (6)
When W is xed, the MCEWVT D associated with W corresponds to the

minimizer of the multichannel edge-weighted energy E(W; D), i.e., D =
arg minD E(W; D). On the other hand, given a set of partition D = {Dl }L l=1
of D, their corresponding centroids W = {wl }L l are dened to be

wl = arg min (i, j, k)||u(i, j, k) w||2 , l = 1, , L. (7)
w
(i,j,k)Dl
It was shown in [1] that (W; D) is a minimizer of E(W; D) only if (W; D) forms
a multichannel edge-weighted centroidal Voronoi tessellation (MCEWCVT), i.e.,
W are simultaneously the corresponding centroids of the associated multichannel
edge-weighted Voronoi regions {Dl }L l=1 .
3 Constrained 3D Superalloy Grain Segmentation

To improve 3D multichannel superalloy grain image segmentation, we may use
human annotated segmentation results on a small set of pre-selected superalloy
slices as the prior knowledge, i.e., constraints from the problem domain.
S1 S2 S3 S4
2 3
g21 g g31 g g42
g11 g21 2
2
2 g33 g41
g
g24 3
g34 g36 g43
g13 g2
g53 g46
5 2
1 1 g
g g 8
g3
g4
6 g15 4 g26 g27 7 g38 g39 4 g45
(a) (b) (c) (d)
Fig. 2. Illustrations of the segmentation constraints on some pre-selection image slices
As illustrated in Fig. 2, 2D human annotated segmentations S = {Sm }M m=1

on M (pre-selected) constraint superalloy image slices (the segmentations are
annotated by considering all the channels) can be written as Sm = N m

n=1 gn
m
m
where gn denotes the n-th 2D grain region on the m-th constraint superalloy
image slice, Nm is the number of grains on m-th constraint superalloy image
according to the annotated segmentation. It satises

gnm gnm = , if n = n . (8)
Thus the constraints over the tessellation (or say grouped clusters) D can be
dened mathematically as:
C1 (D) = { (p) = (q), p, q gnm }
C2 (D) = { (p) = (q), p gnm , q gnm , gnm and gnm are neighbor grains}
(9)
where () denotes the generic form of the clustering function which provides a
cluster label for a given voxel. Thus we can dene the 3D multichannels super-
alloy image segmentation problem under such human annotation constraints to
be a constrained energy minimization problem in the form of
(W; D) = arg min E(W; D) subject to C = {C1 (D) & C2 (D)}. (10)
(W;D)
We would like to point out that the above dened constraints namely only take
eect on the voxels within the same constraint 2D superalloy slice, but it is ex-
pected that those constraints will propagate their eects to other non-constraint
slices during the solution process to give improved grain segmentations.
4 Constrained Multichannel Edge-Weighted Centroid

Voronoi Tessellation Algorithm
In the following, we will develop a constrained multichannel edge-weighted cen-
troid Voronoi tessellation (CMEWCVT) algorithm to solve the above constrained
minimization problem. In the algorithm we rst enforce the initial clusters to
satisfy constraint C. And then the constraint is imposed on the whole energy
minimization process.
Denote the average multichannel intensities of grains gnm S to be
Uc = {u11 , u12 , , u1N1 , u21 , u22 , , u2N2 , , uM
1 , , uNM }.
M
(11)
4.1 Determine an Initial Conguration Satisfying Constraints

We rst run the K-means on Uc with a small cluster number using a random
initialization. Let G be the neighbor graph of the grains in S. If there exist
adjacent grains on the same constraint image slices gnm and gnm whose average
intensities um m
n and un are clustered into a same cluster, we add a new cluster
whose generator is either um m
n or un . Using the new set of generators to run the
K-means on U again. We repeat this process until we obtain a set of clusters
c
which can group Uc with no adjacent grains on the same constraint image slice
belonging to the same cluster. We then dene the centers of the these clusters as
the initial generators for the CMEWCVT. See Algorithm 1 for the description
of the whole procedure.
S
Algorithm 1. (W c ={wcl }L
l=1 , D ={Dl }l=1 , L) = ConstrainedGenerators(U ,
S L c
G, S, L0 )
1: INPUT: Human annotated segmentation on certain superalloy image slices S. The
average intensities Uc of grain regions in S. Grains neighboring matrix G. Initial
cluster number guess L0 .
2: START:
3: L = L0 and randomly initialized L cluster generators W = {w l }L l=1 .
4: Run the classic K-means on Uc with W to obtain a clustering of Uc .
5: DS {Dls }L
l=1 : If un belongs to the cluster with generator w l , then all voxels in
m
m s
gn are in Dl .
6: if Existing gnm , gnm , which belongs to the same cluster, for neighbor grains gnm and
gnm then
7: L=L+1
8: w L = umn
9: W {W, wL }
10: Go to 4.
11: else
12: Wc = W
13: Return (W c , DS , L).
14: end if
4.2 Constrained Multichannel Edge-Weighted Voronoi Tessellation

In MCEWCVT the clustering energy is minimized by iteratively transferring
each voxel from its current cluster to a cluster to which it has the shortest edge-
weighted distance dened by Eqn. (5). In the new CMEWCVT model, we need
to constrain the transferring of the voxels between clusters. Specically, we will
force the voxels in the constraint grains (in the 2D constraint slices) remains in
their original cluster, and for the voxels in the non-constraint image slices, they
are allowed to be transfered between clusters.
Given a set of constrained generators W c = {wcl }L l=1 and corresponding con-
straints S, we dene the constrained multichannel edge-weighted Voronoi tessel-
lation (CMEWCVT), D = {Dlc }L l=1 in the physical volume space D as
Dlc = {(i, j, k) D\S | dist((i, j, k), wcl )

(12)
dist((i, j, k), wcm ), m = 1, , L} DlS ,
where (i, j, k) refers to the voxel on the non-constraint image slices. DS is given
by Algorithm 1. Thus, we can dene the constrained multichannel edge-weighted
clustering energy as
E(W c ; D; S) = EC (W c ; D; S) + EL (D)

L
= (1 + |u(i, j, k)|)u(i, j, k) wcl 2
l=1 (i,j,k)Dlc
(13)

+ (i,j,k) (i , j , k ).
(i,j,k)D (i ,j ,k )N (i,j,k)
From Eqn. (5), it is also easy to nd that when W c and S are xed, the con-
strained multichannel edge-weighted Voronoi tessellation D = {Dlc }L
l=1 associ-
ated with W c and S corresponds to the minimizer of the constrained multichan-
nel edge-weighted clustering energy E(W c ; D; S), i.e.,
D = arg min E(W c ; D; S).

D
Then we dene the constrained multichannel edge-weighted Voronoi tessellation

energy for a given set of constrained generators W c = {wcl }L
l=1 to be
ECMEW V T (W c , S) = E(W c ; D; S). (14)
Algorithm 2 can be used to effectively construct the constrained multichannel

edge-weighed Voronoi tessellation for a given set of constrained generators.
Algorithm 2. {Dlc }L
l=1 = CMEWVT(u,{w l }l=1 ,{Dl }l=1 , S)
c L c L
1: INPUT: A 3D N -channel image determined by u, a set of constrained generators

{w cl }L
l=1 and an initial constrained partition {Dl }l=1 of the physical space D.
c L
Human annotated segmentations constraints S.

2: START:
3: for all voxels (i, j, k) D do
4: if voxel (i, j, k)
/ S then
5: a) calculate the multichannel edge-weighted distances from the voxel (i, j, k)
to all constrained generators {w cl }L
l=1 .
6: b) transfer the voxel (i, j, k) from its current cluster to the cluster whose
generator has the shortest multichannel edge-weighted distance to it.
7: end if
8: end for
9: If no voxel in the loop is transferred, return {Dlc }L
l=1 = {Dl }l=1 and exit; otherwise,
c L
go to 3.
4.3 The CMEWCVT Model and Its Construction

In order to dene the CMEWCVT model, we need to further determine the
centroids of a given set of partition D = {Dlc }L
l=1 of D, i.e., nd W = {w l }l=1
c c L
such that
w c
l = arg min (i, j, k)u(i, j, k) w cl 2 (15)
wl
c
(i,j,k)Dlc
for l = 1, 2, , L. Since we use the norm, it is hard to nd an analytical

solution for wcl . Usually, the w cl dened through the above minimization pro-
cess could be solved numerically. For example, the Powell method [18] could be
used to eectively calculate w cl approximately although there is no derivative
information available.
Denition (CMEWCVT).
For a given constrained
multichannel edge-weighted
Voronoi tessellation {wcl }L
l=1 ; {Dl }l=1 ; S
c L
of D, we call it a constrained mul-
tichannel edge-weighted centroidal Voronoi tessellation (CMEWCVT) of D if
the generators {wcl }L
l=1 are also the corresponding centroids of the associated
constrained multichannel edge-weighted Voronoi regions {Dlc }L
l=1 , i.e.,
wcl = wc
l , l = 1, 2, , L.
Based on the CVT principle [19], we know that (W c ; D; S) is a minimizer of

E(W c ; D; S) only if (W c ; D; S) forms a CMEWCVT of D. We propose the
Algorithm 3 for the construction of the CMEWCVTs. As discussed in [17]
for the EWCVT construction algorithms, some improvements of Algorithm 3-
CMEWCVT can be obtained by using narrow-banded implementation. We also
note that the energy ECMEW V T (W c , S) keeps decreasing along the iterations in
this algorithm.
Algorithm 3. ({w cl }L
l=1 , {Dl }l=1 ) = CMEWCVT(u,L0 ,S,G)
c L
1: INPUT: A 3D N -channel images determined by u and initial cluster number guess

L0 . Constraints S and the neighboring graph G of grains in S.
2: START:
3: Construct Uc using S and G.
S L
4: ({w cl }L
l=1 , {Dl }l=1 , L)=ConstrainedGenerators(U , G, S, L0 ).
c
S L
5: Construct {Dl }l=1 based on {Dl }l=1 and Voronoi tessellation on D\S associated
c L
with {w cl }L l=1 under the Euclidean distance in intensity space.

6: Construct constrained multichannel centroidal Voronoi tessellation regions
{Dlc }L
l=1 =CMEWVT(u,{w l }l=1 ,{Dl }l=1 , S).
c L c L
7: Calculate cluster centroids {wc l }L c

l=1 for Dl , l = 1, . . . , L.
8: Take {w c l }L
l=1 as the generators, determine the corresponding constrained
multichannel edge-weighted Voronoi clustering {Dlc }L l=1 by using Algorithm 2-
CMEWVT, i.e., {Dlc }L l=1 =CMEWVT(u,{w l }l=1 ,{Dl }l=1 , S).
c L c L
9: If {Dlc }Ll=1 and {Dl }l=1 are the same, set {w l }l=1
c L c L
= {w cl }l=1 , return
L
({w l }l=1 ; {Dl }l=1 ) and exit; otherwise, set Dl = Dl for l = 1, . . . , L and go
c L c L c c
to step 7.
5 Experiments and Evaluation
The proposed CMEWCVT algorithm is tested on a Ni-based 3D superalloy im-

age dataset. The dataset consists 4 channels of superalloy slice images taken
under dierent electronic microscope parameters settings. Each slice was pho-
tographed as new facets appearing by keeping abrading the up-front facet of the
superalloy sample. The size of each image slice is 671 671 and the number
of slices in each channel is 170. The resolution within a slice is 0.2m and the
resolution between slices is 1m.
The resolution on the 2D superalloy image slice are 5 times higher than that
between adjacent image slices. In order to control the smoothness and compact-
ness of the segmentation on 3D superalloy image, we followed [1] by linearly
interpolating the 3D superalloy image with 4 more slices between each two orig-
inal slices. The interpolated superalloy volume image has 846 slices. In order
to run the CMEWCVT with more iterations in reasonable time, we downsized
the 3D volume image to half of its size along each direction and applied the
proposed CMEWCVT algorithm on this downsized superalloy volume image.
Finally, we scale the segmentation results back to its original resolution for per-
formance evaluation. There are six parameters/factors that can be controlled
in CMEWCVT: L, the number as input for constructing initial clusters gen-
erators; , the weighting parameter which balances the clustering energy term
and the edge energy term; , the neighbor size which denes the local search
region N (i, j, k) for each voxel (i, j, k); , the predened threshold of the stop
condition of CMEWCVT; and S, a set of constraint slices with human annotated
segmentation.
In the experiments, we chose L = 5, = 500. Considering the size of carbides
and the energy balance, (see discussions in [17]), we set = 6, and N (i, j, k) to
be a sphere centered at voxel (i, j, k). Theoretically, Algorithm 3-CMEWCVT
stops when the energy function Eqn. (13) is completely minimized. However, in
practical applications, we can set the stop condition as
|Ei+1 Ei |
< (16)
Ei
where Ei denotes the CMEWVT energy at the i-th iteration and is a prede-
ned threshold. In the experiments, we selected = 104 . In addition, we chose
dierent number of constraint slices and analyze the aect to the segmentation
accuracy. In our experiments, we tried 3, 6, 9, 12, 15, 18 and 21 constraint slices.
To quantitatively evaluate the accuracy and performance of the CMEWCVT
algorithm, we calculated its boundary accuracy against the manually annotated
ground-truth segmentation and compared the boundary accuracy with other
segmentation/boundary-detection methods. Besides the MCEWCVT [1], we also
compare the segmentation accuracy with meanshift [4], SRM [6], pbCanny edge
detector [20], pbCGTG edge detector [20], gPb [21] and watershed [5].
For the MCEWCVT algorithm [1] and the proposed CMEWCVT algorithm,
we examine the segmentation boundaries on the 170 non-interpolated image
slices. For other 2D comparison methods, we tested them directly on 170 non-
interpolated slices. For the segmentation methods that are developed only for
single-channel images, we applied them on each of the four channels indepen-
dently and then combine the segmentation results from all four channels to get
a unied segmentation for each slice. Specically, for Pb edge detectors, at each
pixel we took the maximum Pb value over all four channels. For the other com-
parison methods, we combined the segmentations from all four channels using a
logic OR operation.
For the quantitative comparison, we used segmentation evaluation codes pro-
vided in the Berkeley segmentation benchmark [20] to calculate the precision-recall
and F -measure(harmonic mean of the precision and recall), which is in the form
of
precision recall
F =2 .
precision + recall
For the CMEWCVT algorithm, we evaluated the average best F-measure for
Case I: over all non-interpolated slices but the constraint slices; and Case II:
over all non-interpolated slices including the constraint slices. For the other
comparison algorithms, include MCEWCVT and the 2D image segmentation
algorithms, we evaluated the average best F-measure over all the 170 superalloy
image slices.
For the Pb based edge detectors, varying the threshold on a soft boundary
map can produce a series of binarized boundary images, with which a precision-
recall curve can be calculated and a best F -measure can be obtained to measure
the segmentation performance. For the other comparison methods, binary seg-
mentation boundaries leads to a single precision-recall value, from which we
can compute a F -measure. Generally, the higher the F -measure, the better the
segmentation. From Table 1, it is easy to see that the proposed CMEWCVT al-
gorithm achieves the best F -measures of 93.6% and 94.3% in Case I and Case II,
respectively, which signicantly outperforms the MCEWCVT (F -measure: 89%)
and other six 2D segmentation/edge-detection methods. Here, we set = 6, and
= 500, and use 21 constraint slices.
Table 1. Segmentation performance of the proposed CMEWCVT algorithm and its

comparison methods. The F -measure of CMEWCVT is based on using 21 constraint
slices.
Methods CMEWCVT (Case I) CMEWCVT (Case II) MCEWCVT gPb pbCanny

F -measure 93.6% 94.3% 89% 85% 83%
Methods pbCGTG meanshift SRM watershed
F -measure 82% 81% 80% 31%
Figure 3 shows a visual comparison between the segmentation results of

CMEWCVT and MCEWCVT on a few selected image slices. We can see that,
dierent intensity variation is allowed within each grain by following human
annotated segmentation. This way, compared with MCEWCVT, many small
over-segmentation errors are suppressed in the proposed CMEWCVT. Mean-
while, the intensity dierence between neighboring grains are better dened by
human annotated segmentation. As a result, some under-segmentation errors in
MCEWCVT are corrected in CMEWCVT.
We also conducted experiments to analyze the affect of using different number
of constraint slices. As mentioned above, we tried the use of 3, 6, 9, 12, 15, 18
and 21 constraint slices which are all uniformly distributed across the whole 3D
superalloy image volume and Table 2 shows the resulting F -measures for Case
I. As we can see, the F -measure increases as we use more constraint slices. This
validates the proposed idea that incorporating prior knowledge from material
4000_Series 5000_Series 6000_Series 7000_Series 4000_Series 5000_Series 6000_Series 7000_Series

MCEWCVT
MCEWCVT
CMEWCVT
CMEWCVT
(a) (b)
Fig. 3. Visual comparisons (two slices, (a) and (b) between the segmentation results
of the proposed CMEWCVT algorithm and the MCEWCVT algorithm
scientists can substantially improve the segmentation performance. In addition,

the more the constrained slices, the better the performance. Table 3 shows the
segmentation performance for Case II.
Table 2. The comparison of the segmentation accuracy when using different number
of constraint image slices. The numbers in this table are calculated without taking
constraint image slices into account (Case I).
Number of constraint slices 3 6 9 12 15 18 21

Average Recall 90.7% 93% 93% 89.5% 90.3% 90.8% 94.3%
Average Precision 90.9% 89.5% 89.8% 94.5% 94.9% 94.7% 92.9%
F-measure 90.8% 91.2% 91.3% 91.9% 92.5% 92.7% 93.6%
Table 3. The comparison of the segmentation accuracy when using different number
of constraint image slices when taking constraint image slices into account (Case II)
Number of constraint slices 3 6 9 12 15 18 21

Average Recall 90.8% 93.2% 93.3% 90.2% 91% 91.6% 94.9%
Average Precision 91% 89.9% 90.3% 94.9% 95.4% 95.2% 93.7%
F-measure 90.9% 91.5% 91.8% 92.5% 93.1% 93.4% 94.3%
From Fig. 4, another observation from the experiments is that the improve-
ment of the average F -measure from CMEWCVT is contributed by almost all
the superalloy image slices. In another word, by introducing the constraint slices,
the segmentation on almost all the other slices are improved, instead of just the
few slices near to the constraint slices. This indicates the prior knowledge pro-
vided by the constraint slices can be propagated to the whole 3D volume image
by using the proposed CMEWCVT. In addition, almost all the slices have the
F -measure around 0.9 or even higher except for the 40-th slice which has the
F -measure around 0.7 due to its poor imaging quality.
Comparison of Fmeasures on each superalloy slices of using different number of constraints

1
0.95
0.9
Fmeasure
0.85
= 6, = 500, 3 constraint slices

0.8
0.75 = 6, = 500, 12 constraint slices
0.7 = 6, = 500, 18 constraint slices
0.65
0 20 40 60 80 100 120 140 160 180
Index of superalloy slices
Fig. 4. F -measure computed from each superalloy image slice when using different
number of constraint slices
6 Conclusions
In this paper, we developed a constrained multichannel edge-weighted centroidal
Voronoi tessellations (CMEWCVT) algorithm for the purpose of 3D superalloy
grain image segmentation. Compared to the previous MCEWCVT algorithm,
the proposed algorithm can incorporate the human annotated segmentation on
a small set of superalloy slices as constraints for the MCEWCVT energy mini-
mization. As a result, the professional prior knowledge from material scientists
can indeed be incorporated to produce much better segmentation results as
demonstrated by our experiments.
Acknowledgments. This work was supported, in part, by AFOSR FA9550-

11-1-0327, NSF IIS-0951754, NSF IIS-1017199, NSF DMS-0913491, NSF DMS-
1215659 and ARL under Cooperative Agreement Number W911NF-10-2-0060
(DARPA Minds Eye). The views and conclusions contained in this document
are those of the authors and should not be interpreted as representing the ocial
policies, either express or implied, of AFOSR, NSF, ARL or the U.S. Govern-
ment. The U.S. Government is authorized to reproduce and distribute reprints
for Government purposes, notwithstanding any copyright notation herein.
References
1. Cao, Y., Ju, L., Zou, Q., Qu, C., Wang, S.: A multichannel edge-weighted centroid
Voronoi tessellation algorithm for 3D superalloy image segmentation. In: CVPR,
pp. 1724 (2011)
2. Reed, R.C.: The Superalloys Fundamentals and Applications. Cambridge Press
(2001)
3. Voort, G.F.V., Warmuth, F.J., Purdy, S.M.: Metallography: Past, Present, and
Future (75th Anniversary Vol.). Astm Special Technical Publication (1993)
4. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space
analysis. TPAMI 24, 603619 (2002)
5. Meyer, F.: Topographic distance and watershed lines. Signal Processing 38,
113125 (1994)
6. Nock, R., Nielsen, F.: Statistical region merging. TPAMI 26, 14251458 (2004)
7. Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI 22, 888905
(1997)
8. Felzenszwalb, P.F.: Ecient graph-based image segmentation. IJCV 59, 167181
(2004)
9. Li, C., Xu, C., Gui, C., Fox, M.D.: Level set evolution without re-initialization: A
new variational formulation. In: CVPR, pp. 430436 (2005)
10. Couprie, C., Najman, L., Grady, L., Talbot, H.: Power watersheds: A new im-
age segmentation framework extending graph cuts, random walker and optimal
spanning forest. In: ICCV, pp. 731738 (2009)
11. Cousty, J., Bertrand, G., Najman, L., Couprie, M.: Watershed cuts: thinnings,
shortest-path forests and topological watersheds. TPAMI 32, 925939 (2010)
12. Chuang, H.-C., Human, L.M., Comer, M.L., Simmons, J.P., Pollak, I.: An auto-
mated segmentation for Nickel-based superalloy. In: ICPR, pp. 22802283 (2008)
13. Boykov, Y., Funka-Lea, G.: Graph cuts and ecient N-D image segmentation.
IJCV 70, 109131 (2006)
14. Grady, L.: Random walks for image segmentation. TPAMI 28, 17681783 (2006)
15. Whitaker, R.T.: A level-set approach to 3D reconstruction from range data.
IJCV 29, 203231 (1998)
16. Magee, D., Bulpitt, A., Berry, E.: Level set methods for the 3D segmentation of
CT images of abdominal aortic aneurysms. In: Medical Image Understanding and
Analysis, pp. 141144 (2001)
17. Wang, J., Ju, L., Wang, X.: An edge-weighted centroidal Voronoi tessellation model
for image segmentation. IEEE TIP 18, 18441858 (2009)
18. Powell, M.J.D.: An ecient method for nding the minimum of a function of several
variables without calculating derivatives. Computer Journal 7, 152162 (1964)
19. Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications
and algorithms. SIAM Review 41, 637676 (1999)
20. Martin, D.R., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natu-
ral images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: ICCV, pp. 416423 (2001)
21. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical
image segmentation. TPAMI 33, 898916 (2011)
Hough Regions for Joining
Instance Localization and Segmentation
Hayko Riemenschneider1,2 , Sabine Sternig1 ,

Michael Donoser1 , Peter M. Roth1 , and Horst Bischof1
1
Institute for Computer Graphics and Vision,
Graz University of Technology, Austria
2
Computer Vision Laboratory, ETH Zurich, Switzerland
Abstract. Object detection and segmentation are two challenging tasks

in computer vision, which are usually considered as independent steps.
In this paper, we propose a framework which jointly optimizes for both
tasks and implicitly provides detection hypotheses and corresponding
segmentations. Our novel approach is attachable to any of the available
generalized Hough voting methods. We introduce Hough Regions by for-
mulating the problem of Hough space analysis as Bayesian labeling of
a random eld. This exploits provided classier responses, object center
votes and low-level cues like color consistency, which are combined into
a global energy term. We further propose a greedy approach to solve
this energy minimization problem providing a pixel-wise assignment to
background or to a specic category instance. This way we bypass the
parameter sensitive non-maximum suppression that is required in related
methods. The experimental evaluation demonstrates that state-of-the-art
detection and segmentation results are achieved and that our method is
inherently able to handle overlapping instances and an increased range
of articulations, aspect ratios and scales.
1 Introduction
Detecting instances of object categories in cluttered scenes is one of the main
challenges in computer vision. Standard recognition systems dene this task as
localizing category instances in test images up to a bounding box representation.
In contrast, in semantic segmentation each pixel is uniquely assigned to one of a
set of pre-dened categories, where overlapping instances of the same category
are indistinguishable. Obviously, these tasks are quite interrelated, since the
segmentation of an instance directly delivers its localization and a bounding box
detection signicantly eases the segmentation of the instance.

This work was supported by the Austrian Research Promotion Agency (FFG) under
the projects CityFit (815971/14472-GLE/ROD) and MobiTrick (8258408) in the
FIT-IT program and SHARE (831717) in the IV2Splus program and the Austrian
Science Fund (FWF) under the projects MASA (P22299) and Advanced Learning
for Tracking and Detection in Medical Workow Analysis (I535-N23).

Hough Regions for Joining Instance Localization and Segmentation 259
Instance CRF Support CRF

Best
HoughRegion
Scales
Hypothesis
gives seed for
selected
support CRF
across scales
and classes
Classes
Hough Graph update
Fig. 1. Hough regions (yellow) nd reasonable instance hypotheses across scales
and classes left) despite strongly smeared vote maps (middle). By jointly optimiz-
ing for both segmentation and detection, we identify even strongly overlapping in-
stances (right) compared to standard bounding box non-maximum suppression.
Both tasks have recently enjoyed vast improvements and we are now able
to accurately recognize, localize and segment objects in separate stages using
complex processing pipelines. Problems still arise where recognition is confused
by background clutter or occlusions, localization is confused by proximate fea-
tures of other classes, and segmentation is confused by ambiguous assignments
to foreground or background.
In this work we propose an object detection method which combines instance
localization with its segmentation in an implicit manner for jointly nding op-
timal solutions. Related work in this eld, as will be discussed in detail in the
next section, either tries to combine detection and segmentation in separate
subsequent stages [1] or aims at full scene understanding by jointly estimating
a complete scene segmentation of an image for all object categories [2].
We place our method in between these two approaches, since we combine in-
stance localization and segmentation in a joint framework. We localize object
instances by extracting so-called Hough regions which exploit available informa-
tion from a generalized Hough voting method, as it is illustrated in Figure 1.
Hough regions are an alternative to the de-facto standard of bounding box non-
maximum suppression or vote clustering, as they are directly optimized in the
Hough vote space. Since we are considering maxima regions instead of single
pixel maxima, articulation and scale become far less important. Our method is
thus more tolerant against diverging scales and the number of overall scales in
testing is reduced. Further, we implicitly provide instance segmentations and to
increase the recall in scenes containing heavily overlapping objects.
2 Related Work
In this work we consider the problem of localizing and accurately segmenting

category instances. Thus, in this section we mainly discuss the related work in
260 H. Riemenschneider et al.
the three most related research elds: non-maximum suppression, segmentation

and scene understanding.
One of the most prominent approaches for improving object detection perfor-
mance is non-maximum suppression (NMS) and its variants. NMS aims for sup-
pressing all hypotheses (i.e. bounding boxes) within a certain distance (e.g. the
widespread 50% PASCAL criterion) and localization certainty with respect to
each other. Barinova et al. [3] view the Hough voting step as an iterative proce-
dure, where each bounding box of an object instance is greedily considered. Desai
and Ramaman [4] see the bounding box suppression as a problem of context eval-
uation. In their work they learn pairwise context features, which determine the
acceptable bounding box overlap per category. For example, a couch may overlap
with a person, yet not with an airplane.
The second approach for improving object detection is to use the support of
segmentations. The work of Leibe et al. [5] introduced an implicit shape model
which captures the structure of an object in a generalized Hough voting manner.
They additionally provide segmentations per detected category instance, but
require ground truth segmentations for every positive training sample. To re-
cover from overlapping detections, they introduce a minimum description length
(MDL) criterion to combine detection hypotheses based on their costs as sep-
arate or grouped hypotheses. Borenstein and Ullman [6] generate class-specic
segmentations by a combination of object patches, yet this approach is decoupled
from the recognition process. Yu and Shi [7] show a parallel segmentation and
recognition system in a graph theoretic framework, but are limited to a set of a
priori known objects. Amit et al. [8] are treating parts as competing interpreta-
tions of the same object instances. Larlus and Jurie [9] showed how to combine
appearance models in a Markov Random Field (MRF) setup for category level
object segmentation. They used detection results to perform segmentation in
the areas of hypothesized object locations. Such an approach implicitly assumes
that the nal detection bounding box contains the object-of-interest and cannot
recover from examples not sticking to this assumption, which is also the case
for methods in full scene understanding. Gu et al. [10] use regions as underlying
reasoning for object detection, however they rely on a single over-segmentation
of the image, which cannot recover from initial segmentation errors. Tu et al. [11]
propose the unication of segmentation, detection and recognition in a Bayesian
inference framework where bottom-up grouping and top-down recognition are
combined for text and faces.
The third approach for improving object detection is to strive for a full scene
understanding to explain every object instance and all segmentations in an im-
age. Gould et al. [12] jointly estimate detection and segmentation in a unied
optimization framework, however with an approximation of the inference step,
since their cost formulation is intractable otherwise. Such an approach cannot
nd the global optimal solution. Wojek and Schiele [13] couple scene and detector
information, but due to the inherent complexity the problem is not solvable in an
exact manner. Winn and Shotton [14] propose a layout consistent Conditional
Random Field (CRF) which splits the object into parts and enforces a coherent
layout when assigning the labels and connects each Hough transform with a part
to extract multiple objects. Yang et al. [15] propose a layered object model for
image segmentation, which denes shape masks and explains the appearance,
depth ordering, and labels of all pixels in an image. Ladicky et al. [2] combine
semantic segmentation and object detection into a framework for full scene un-
derstanding. Their method trains various classiers for stu and things and
incorporates them into a coherent cooperating scene explanation. So far, only
Ladicky et al. managed to incorporate information about object instances, their
location and spatial extent as important cues for a complete scene understand-
ing, which allows to answer a question like what, where and how many object
instances can be found in an image. However, their system is designed for di-
cult full scene understanding and not for an object detection task, as they only
integrate detector responses and infer from the corresponding bounding boxes.
This limits the ability to increase the detection recall or to improve the accuracy
of instance localization.
Our method improves the accuracy of object detectors by using the objects
supporting features for joint segmentation and reasoning between object in-
stances. We achieve a performance increase in recall and precision, through the
ability to better handle articulations and diverging scales. Additionally, we do
not require a parameter sensitive non-maximum suppression since our method
delivers unique segmentations per object instance. Please note, in contrast to
related methods [5,16] these segmentation masks are provided without learning
from ground truth segmentation masks for each category.
3 Joint Reasoning of Instances and Support

The main goal of this work is the joint reasoning about object instances and
their segmentations. Our starting point is any generalized Hough voting method
like the Implicit Shape Model (ISM) [5], Hough Forests [17] or the max-margin
Hough transform [18]. We formulate our problem in terms of a Bayesian labeling
of a rst-order Conditional Random Field (CRF) aiming at the minimization
of a global energy term. Since global inference in the required full random eld
is intractable, we propose a greedy method which couples two stages iteratively
solving each stage in a global optimal manner. Our model inherently links clas-
sier probabilities, corresponding Hough votes and low-level cues such as color
consistency and centroid proximity.
We rst describe in Section 3.1 the global energy term for providing pixel-
wise assignments to category instances, analyzing unary and pairwise potentials.
In Section 3.2 we introduce our approach for minimizing the energy term by a
greedy approach. Finally, in Section 3.3 we directly compare the properties of
our method to related work.
3.1 Probabilistic Global Energy

We assume that we are given any generalized Hough voting method
like [17,5,18,19], which provides our N voting elements Xi within the test
image and corresponding object center votes Hi into the Hough space. We fur-
ther assume that we are given p (C|Xi , Di ) per voting element, which measures
the likelihood that a voting element Xi belongs to category C by analyzing a
local descriptor Di . This could be feature channel dierences between randomly
drawn pixel pairs as in [17]. All this information can be directly obtained from
the generalized Hough methods; see experimental section for implementation
details.
The goal of our method is to use the provided data of the generalized Hough
voting method to explain a test image by classifying each pixel as background
or as belonging to a specic instance of an object category. We formulate this
problem as Bayesian labeling of a rst-order Conditional Random Field (CRF)
by minimizing a global energy term to derive an optimal labelling.
Such random eld formulations are widespread, especially in the related eld
of semantic segmentation, e.g. [20]. In a standard random eld X each pixel is
represented as a random variable Xi X , which takes a label from the set L =
{l1 , l2 , . . . , lK } indicating one of K pre-dened object classes as e.g. grass, people,
cars, etc. Additionally, an arbitrary neighborhood relationship N is provided,
which denes cliques c. A clique c is a subset of the random variables Xc , where
i, j c holds i Nj and j Ni , i.e. they are mutually neighboring concerning
the dened neighborhood system N .
In our case, we not only want to assign each pixel to an object category,
we additionally aim at providing a unique assignment to a specic instance of
a category in the image, which is a dicult problem if category instances are
highly overlapping. For the sake of simplicity, we dene our method for the single
class case, but the method is easily extendable to a multi-class setting.
We also represent each pixel as random variable Xi with i = 1 . . . N , where
N is the number of pixels in the test image, and aim at assigning each pixel a
category-specic instance label from the set L = {l0 , l1 , . . . , lL }, where L is the
number of instances. We use label l0 for assigning a pixel to the background. We
seek for the optimal labeling x of the image which is taken from the set L = LN .
The optimal labeling minimizes the general energy term E (x) dened as

E (x) = log P (x|D) log Z = c (xc ), (1)
cC
where Z is a normalization constant, D is the observed data, C is the set of

considered cliques dened by the neighborhood relationship N and c (xc ) is
a potential function of the dened clique. The Maximum a Posteriori (MAP)
labeling x is then found by minimizing this energy term as
x = argmin E (x) . (2)

xL
To obtain reasonable label assignments it is important to select powerful poten-

tials, which range from simple unary terms (e.g. evaluating class likelihoods),
pairwise terms (e.g. the widespread Potts model) or even higher order terms.
While unary potentials ensure correct label assignments, pairwise potentials aim
at providing a smooth label map, e.g. by avoiding the assignment of identical

labels over high image gradients. The overall goal is to provide a smooth labeling
consistent with the underlying data.
Our energy is modeled as the sum of unary and pairwise potentials. In the
following, we assume that we are given a set of seed variables S which consist
of L subsets Sl X of random variables for which we know the assignment to
a specic label in the image. We again assume that the set S0 represents the
background. These assignments are an essential step of our method and how to
obtain them is discussed in Section 3.2.
Based on the given assignments Sl , we dene our instance-labeling problem
as minimizing the following energy term

E(x) = (Xi ) + (Xi ) + (Xi ) + (Xi , Xj ) , (3)
Xi Sl Xi Sl Xi Sl Xi ,Xj N
which contains a unary, instance-specic class potential , a color based unary

potential , a distance based unary potential and a pairwise potential . The
rst unary potential contains the estimated likelihoods for a pixel taking a
certain category label as provided by the generalized Hough voting method as
= log p (C|Xi , Di ) , (4)
i.e. the unary potentials describe the likelihood of getting assigned to one of the
identied seed regions or the background. Since we do not have a background
likelihood, the corresponding values for l0 are set to a constant value pback ,
which denes the minimum class likelihood we want to have. The term drives
our label assignments to correctly distinguish background from actual object
hypotheses.
The unary potential analyzes the likelihood for assigning a pixel to one of the
dened seed regions, e.g. analyzing local appearance information in comparison
to the seed region. In general any kind of modeling scheme is applicable in this
step. We model each subset Sl X by Gaussian Mixture Models (GMM) Gl and
dene color potentials for assigning a pixel to each instance by
= log p (Xi |Gl ) . (5)
The corresponding likelihoods for the background class l0 are set to log(1
maxlL p (Xi |Gl )) since the background is mostly too complex to model. The
term ensures that pixels are assigned to the right instances by considering the
appearance of each instance.
The third unary potential denes a spatial distance function analyzing how
close each pixel is to each seed region Sl by
= (Xi , Sl ) , (6)
where is a distance function. The term ensures that correct assignments

to instances are made considering constraints like far away pixels with diverging
center votes are not assigned to the same instance.
Finally, our pairwise potentials encourage neighboring pixels in the image to

take the same label considering a contrast-sensitive Potts model analyzing
pixels included in the same cliques c dened as

0 if xi = xj
= , (7)
if xi = xj
where is an image specic gradient measure or local color dierence. The

term mainly ensures that smooth label assignments are achieved.
This way we can eectively incorporate not only the probability of a single
pixel belonging to an object, but also consider the spatial extent and the local
appearance of connected pixels in our inference. Please note that this formulation
can e.g. be extended to superpixels to include higher-order potentials [21].
Finding an optimal solution without knowledge of the seeds Sl is infeasible
since inference in such a dense graph is NP-hard. Therefore, in the next section
we propose a greedy algorithm for solving the above presented energy mini-
mization problem, which alternately nds optimal seed assignments and then
segments instances analyzing the hough vector support. The nal result of our
method is a segmentation of the entire image into background and individual
category instances.
3.2 Instance Labeling Inference
We propose a novel, greedy inference concept to solve the energy minimization

problem dened in Eq. 3. The core idea is to alternately nd a single, optimal
seed region analyzing the provided Hough space (Eq. 8), and afterwards to use
this seed region to nd an optimal segmentation of the corresponding instance
in the image space (Eq. 3). The obtained segmentation is then used to update
the Hough space information (Eq. 10), and we greedily obtain our nal image
labeling.
We assume that we are given the votes Hi of each pixel into the Hough vote
map, providing a connection between pixelwise feature responses and projected
Hough centers. Hence, we can build a two-layer graph for any image, where the
nodes in the rst layer I (image graph) are the underlying random variables Xi
for all pixels in the image, and the second layer H (Hough graph) contains their
transformed counterparts H(Xi ) in the Hough vote map. Figure 2 illustrates this
two-layer graph setup.
The rst step is to optimally extract a seed-region from the corresponding
Hough graph H. Therefore, we propose a novel paradigm denoted as Hough
regions, which formulates the seed pixel extraction step itself as a segmentation
problem. A Hough region is dened as connected, arbitrarily shaped subset of
graph nodes Hl H, which are the hypothesized center votes for the object
instance into the Hough space. In the ideal case, all pixels of the instance would
vote to the same center pixel, unfortunately in the real world these Hough center
votes are quite imperfect. Even despite recent research to decrease the Hough
vector impurity [17,19], the Hough center is never a single pixel. This arises from
Fig. 2. Two layer graph design (top: Hough and bottom: image) to solve proposed
image labeling problem. We identify Hough regions as subgraphs of the Hough graph
H. The backprojection into the image space, in combination with color information,
distance transform values and contrast sensitive Potts potentials denes instance seg-
mentations in the image graph I.
various sources of error such as changes in global and local scales, aspect ratios,
articulation, etc. Perfect Hough maxima are unlikely and there will always exist
inconsistent centroid votes. The idea and benet of our Hough regions paradigm
is now to exactly capture this uncertainty by considering regions, and not single
Hough maxima.
Thus, our goal is to identify Hough regions Hl in the Hough space H, which
then allows to directly dene the corresponding seed region Sl by the back-
projections into the image space, i.e. the random variables Xi Sl are the
nodes in the image graph I which project into the Hough region Hl . For this
reason, we dene a binary image labeling problem in the Hough graph by

E(x) = (H(Xi )) + (H(Xi ), H(Xj )), (8)
Xi X N
which contains the projected class-specic unary potentials and a pairwise

potential . The potential for each variable Xi is projected to the Hough
graph using H(Xi ). The unary potential at any node H(Xi ) is then the sum of the
classier responses voting for this node, as it is common in general Hough voting
methods. The pairwise potential is dened on this very same graph H as a
gradient-sensitive Potts model, where the gradient is calculated as the dierence
between the unary potentials of two nodes H(Xi ) and H(Xj ) in the Hough graph
H. This binary labeling problem can be solved in a global optimal manner using
any available inference algorithm as e.g. graph cuts [22], and the obtained Hough
region Hl is then back-projected to the seed region Sl .
The second step, after nding the optimal seed region Sl , is to identify all
supporting pixels of the category instance in the image graph I. Since we have
now given the required seed region Sl , we can apply our energy minimization
problem dened in Eq. 3 and again solve a binary labeling problem, for assigning
each pixel to the background or the currently analyzed category instance. The
last part is the distance function (Xi , Sl ), which we dene as

0 if H(Xi ) Hl
(Xi , Sl ) = (9)
DT (Xi ) otherwise,
where H(Xi ) is the Hough transformation of an image location to its associated

Hough nodes and DT (Xi ) is the distance transform over the elements of the
current seed pixels, which are given by the Hough region Hl . Again this binary
labeling problem can be solved in a global optimal manner using e.g. graph cut
methods [22], which returns a binary segmentation mask Ml for the current
category instance.
After nding the optimal segmentation Ml for the currently analyzed category
instance in the image space, we update the Hough vote map, considering the
already assigned Hough votes. In such a way our framework is not error-prone
to spurious incorrect updates in the Hough voting space as it is common in
related methods. This directly improves the detection performance, as occluded
object instances are not removed. On the contrary, occluded instances are now
more easily detectable by their visible segmentation, as they require less visibility
with competing object instances or other occlusions.
In detail, the update considers each image location and its (independent)
votes, which are accumulated in the nodes of the Hough graph. An ecient
update is possibly by subtracting the previously segmented object instance from
the classwise, unary potentials, which are initially 0 (Xi ) = (Xi ), by
t+1 (Xi ) = t (Xi ) (XMl ), (10)
where each random variable XMl within the segmentation mask Ml of the
category instance is used in the update. Using our obtained segmentations, we
focus the update solely on the areas of the graph where the current object
instance plays a role. This leads to a much ner dissection of the image and
Hough graphs, as shown in Figure 2.
After updating the Hough graph H with the same update step, we repeat nd-
ing the next optimal seed region and afterwards segmenting the corresponding
category instance in the image graph I. This guarantees a monotonic decrease
in max(t+1 ) and our iteration stops when max(t+1 ) < pback , i.e. we have
identied all Hough regions (object instances L) above a threshold.
3.3 Comparison to Related Approaches
Our implicit step of updating the Hough vote maps by considering optimal seg-
mentations in the image space, is related to other approaches in the eld of non-
maximum suppression. In general, one can distinguish two dierent approaches
in this eld.
Bounding boxes are frequently selected as underlying representation to perform

non-maximum suppression and are the de-facto standard for dening the number
of detections from a Hough image. In these methods the Hough space is analyzed
using a Parzen-window estimate based on the Hough center votes. Local max-
ima in the Parzen window estimate determine instance hypotheses by placing a
bounding box considering the current scale onto the local maxima. Conicting
bounding boxes, e.g. within a certain distance and quality with respect to each
other are removed (NMS). Integrating such a setup in our method would mean
that each bounding box represents one seed region Sl and additionally denes
a binary distance function (Xi , Sl ), where = 1 for all pixels within the
bounding box and = 0 for all other pixels.
Non-maximum suppression comes in many dierent forms like a) nding all
local maxima and performing non-maximum suppression in terms of mutual
bounding box overlap in image space to discard low scoring detections; b) ex-
tracting global maxima iteratively and eliminating these by deleting the interior
of the bounding box hypothesis in Hough space [17]; c) iteratively nding global
maxima for the seed pixels [3] and estimating bounding boxes and subtracting
the corresponding Hough vectors. All pixels inside the bounding box are used
to reduce the Hough vote map. This leads to a coarse dissection of the Hough
map which leads to problems since it includes much background information and
overlapping object instances.
Clustering methods allow more elegant formulation since they attempt to group
Hough vote vectors to identify instance hypotheses. The Implicit Shape Model
(ISM) [5], for example, employs a mean-shift clustering on sparse Hough center
locations to nd coherent votes. The work by Ommer and Malik [23] extends
this to clustering Hough voting lines, which are innite lines as opposed to
nite Hough vectors. The benet lies in extrapolating the scale from coherent
Hough lines, as the line intersection in the 3D xyscale space determines the
optimal scale. Such an approach relates to ours, since the clustering step can
be interpreted as elegantly identifying the seed pixels we analyze. However such
approaches assume well-distinguishable cluster distributions and are not well-
suited for scenarios including many overlaps and occlusions. Furthermore, similar
to the bounding box methods, which require a Parzen-window estimate to bind
connected Hough votes, clustering methods require an intra- to inter cluster
balance. In terms of a mean-shift approach this is dened by the bandwidth
parameter considered. An additional drawback of clustering methods is also
the limited scalability in terms of number of input Hough vectors that can be
eciently handled. An interesting alternative is the grouping of dependent object
parts [24], which yields signicantly less uncertain group votes compared to the
individual votes.
Our proposed approach has several advantages compared to the discussed re-
lated methods. A key benet is that we do not have to x a range for local
neighborhood suppression, as it is e.g. required in non-maximum suppression
approaches, due to our integration of segmentation cues. As it is also demon-
strated in the experimental section, our method is robust to an increased range
of variations in object articulations, aspect ratios and scales, which allows to
reduce the number of analyzed scales during testing while maintaining the de-
tection performance. We implicitly also provide segmentations of all detected
category instances, without requiring segmentation masks during training. In
overall, the benets of our Hough Regions approach lead to higher precision and
recall compared to related approaches.
tud_campus tud_crossing
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Precision
Precision
0.5 0.5
0.4 0.4
Hough Regions (1 scales)
0.3 Hough Regions (3 scales) 0.3
Hough Regions (5 scales)
0.2 Barinova [3] (1 scales) 0.2
Barinova [3] (3 scales) Hough Regions (1 scale)
0.1 Barinova [3] (5 scales) 0.1 Barinova [3] (1 scale)
ISM [11] (5 scales) ISM [11] (1 scale)
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall
Fig. 3. Recall/Precision curve for the TUD campus and TUD crossing sequences show-
ing analysis at multiple scales for highly overlapping instances. See text for details
4 Experimental Evaluation
We use the publicly available Hough Forest code [17] to provide the required class
likelihoods and the centroid votes. For non-maximum suppression (NMS), we ap-
ply the method of [3] (also based on [17]), which outperforms standard NMS in
scenes with overlapping instances. It is important to note that for datasets with-
out overlapping instances like PASCAL VOC, the performance is similar to [17].
For this reason, we mainly evaluate on datasets including severely overlapping
detections, namely TUD Crossing and TUD Campus. Additionally, we evaluate
on a novel window detection dataset GT240, designed for testing sensitivity to
aspect ratio and scale variations. We use two GMM components, equal weighting
between the energy terms and x pback = 0.125.
4.1 TUD Campus
To demonstrate the ability of our system to deal with occluded and overlapping
object instances (which results in an increased recall using the PASCAL 50%
criterion), we evaluated on the TUD Campus sequence, which contains 71 images
and 303 highly overlapping pedestrians with large scale changes. We evaluated
the methods on multiple scale ranges (one, three and ve scales) and Figure 3
shows the Recall-Precision curve (RPC) for the TUD Campus sequence. Aside
the fact that multiple scales benet the detection performance for all methods,
one can also see that our Hough regions method surpasses the performance at
each scale range (+3%, +8%, +8% over [3]) and over fewer scales. For example,
using Hough regions we only require three scales to achieve the recall and preci-
sion, which is otherwise only achieved using ve scales. This demonstrates that
our method handles scale variations in an improved manner, which allows to
reduce the number of scales and thus runtime during testing. The ISM approach
of Leibe et al. [5] reaches good precision, but at the cost of decreased recall, since
the MDL criterion limits the ability to nd all partially occluded objects.
gt240
1
Baseline [9]
0.9 Hough Regions
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Fig. 4. Recall/Precision curve for the street-level window detection dataset

GrazTheresa240 with strong distortion and aspect ratio changes in 5400 images
4.2 TUD Crossing

The TUD Crossing dataset [25] contains 201 images showing side views of pedes-
trians in a relatively crowded scenario. The annotation by Andriluka et al. [25]
contains 1008 tight bounding boxes designed for pedestrians with at least 50%
visibility, ignoring highly overlapping pedestrians. For this reason we created a
pixelwise annotation. The new annotation is based on the original bounding box
annotation and now contains 1212 bounding box annotations with corresponding
segmentations. In addition to bounding boxes we also annotated the visibility
of pedestrians in fully visible body parts: head, front part of upper body, back
part of upper body, left leg or right leg.
Typically, in this sequence three scales are evaluated, however to show ability
to handle scale, we evaluate only on a single scale for all methods. Figure 3 shows
the Recall-Precision curve (RPC) for the TUD crossing sequence in comparison
to an Implicit Shape Model (ISM) [5] and the probabilistic framework of [3]. Our
method achieves a better recall compared to the other approaches. We increase
the recall as well as the precision indicating that our method can better handle
the overlaps, scale and articulation changes.
4.3 GrazTheresa240
We also evaluated our method on a novel street-level window detection dataset
(denoted as GrazTheresa240), which consists of 240 buildings with 5400 redun-
dant images with a total of 5542 window instances. Window detection itself
is dicult due to immense intra-class appearance variations. Additionally, the
dataset includes a large range of diverging aspect ratios and strong perspective
distortions, which usually limit object detectors. In our experiment we tested
on three dierent scales and a single aspect ratio. As shown in Figure 4 we can
substantially improve the detection performance in both recall and precision
(12% at EER) compared to the baseline [17]. Our Hough regions based detector
Fig. 5. Segmentations for TUD Crossing and TUD Campus datasets
consistently delivers improved localization performance, mainly by reducing the

number of false positives and by being less sensitive to diverging aspect ratios
and perspective distortions.
4.4 Segmentation
As nal experiment we analyze achievable segmentation accuracy on the TUD
Crossing sequence [25], where we created binary segmentations masks for each ob-
ject instance for the entire sequence. Segmentation performance is measured by the
ratio between the number of correctly assigned pixels to the the total number of
pixels (segmentation accuracy). We compared our method to the Implicit Shape
Model (ISM) [5], which also provides segmentations for each detected instance.
Please note that the ISM requires segmentation masks for all training samples to
provide reasonable segmentations whereas our method learns from training im-
ages only. Nevertheless, we achieve competitive segmentation accuracy of 98.59%
compared to 97.99% for the ISM. Illustrative results are shown in Figure 5.
5 Conclusion
In this work we proposed Hough regions for object detection to jointly solve
instance localization and segmentation. We use any generalized hough voting
method, providing pixel-wise object centroid votes for a test image, as starting
point. Our novel Hough regions then determine the locations and segmentations
of all category instances and implicitly handles large location uncertainty due to
articulation, aspect ratio and scale. As shown in the experiments, our method
jointly and accurately delineates the location and outline of object instances,
leading to increased recall and precision for overlapping and occluded instances.
These results conrm related research where combining the tasks of detection
and segmentation improves performance, because of the joint optimization ben-
et over separate individual solutions. In future work, we plan to consider our
method also during training for providing more accurate foreground/background
estimations without increasing manual supervision.
References
1. Ramanan, D.: Using segmentation to verify object hypotheses. In: CVPR (2007)
2. Ladicky, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, Where and
How Many? Combining Object Detectors and CRFs. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 424437. Springer,
Heidelberg (2010)
3. Barinova, O., Lempitsky, V., Kohli, P.: On the detection of multiple object in-
stances using hough transforms. PAMI 34, 17731784 (2012)
layout. IJCV 95, 112 (2011)
5. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved
categorization and segmentation. IJCV 77, 259289 (2008)
6. Borenstein, E., Ullman, S.: Class-Specic, Top-Down Segmentation. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part II. LNCS, vol. 2351,
7. Yu, S., Shi, J.: Object-specic gure-ground segregation. In: CVPR (2003)
8. Amit, Y., Geman, D., Fan, X.: A coarse-to-ne strategy for multiclass shape de-
tection. PAMI 26, 16061621 (2004)
9. Larlus, D., Jurie, F.: Combining appearance models and markov random elds for
category level object segmentation. In: CVPR (2008)
10. Gu, C., Lim, J., Arbelaez, P., Malik, J.: Recognition using regions. In: CVPR (2009)
11. Tu, Z., Chen, X., Yuille, A., Zhu, S.: Image parsing: Unifying segmentation, detec-
tion, and recognition. IJCV 62, 113140 (2005)
12. Gould, S., Gao, T., Koller, D.: Region-based segmentation and object detection.
In: NIPS (2009)
13. Wojek, C., Schiele, B.: A Dynamic Conditional Random Field Model for Joint La-
beling of Object and Scene Classes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
14. Winn, J., Shotton, J.: The layout consistent random eld for recognizing and seg-
menting partially occluded objects. In: CVPR (2006)
15. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object models for image
segmentation. PAMI 34, 17311743 (2011)
16. Floros, G., Rematas, K., Leibe, B.: Multi-Class Image Labeling with Top-Down
Segmentation and Generalized Robust PN Potentials. In: BMVC (2011)
17. Gall, J., Lempitsky, V.: Class-specic hough forests for object detection. In: CVPR
(2009)
18. Maji, S., Malik, J.: Object detection using a max-margin hough transform. In:
CVPR (2009)
19. Okada, R.: Discriminative generalized hough transform for object dectection. In:
ICCV (2009)
20. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image catego-
rization and segmentation. In: CVPR (2008)
21. Kohli, P., Ladicky, L., Torr, P.: Robust higher order potentials for enforcing label
consistency. IJCV 82, 302324 (2009)
22. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. PAMI 23, 12221239 (2001)
23. Ommer, B., Malik, J.: Multi-scale object detection by clustering lines. In: ICCV
(2009)
24. Yarlagadda, P., Monroy, A., Ommer, B.: Voting by Grouping Dependent Parts.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS,
25. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-
detection-by-tracking. In: CVPR (2008)
Learning to Segment a Video to Clips
Based on Scene and Camera Motion
Adarsh Kowdle and Tsuhan Chen
Cornell University, Ithaca, NY, USA

apk64@cornell.edu, tsuhan@ece.cornell.edu
Abstract. In this paper, we present a novel learning-based algorithm for tempo-

ral segmentation of a video into clips based on both camera and scene motion, in
particular, based on combinations of static vs. dynamic camera and static vs. dy-
namic scene. Given a video, we first perform shot boundary detection to segment
the video to shots. We enforce temporal continuity by constructing a Markov Ran-
dom Field (MRF) over the frames of each video shot with edges between consec-
utive frames and cast the segmentation problem as a frame level discrete labeling
problem. Using manually labeled data we learn classifiers exploiting cues from
optical flow to provide evidence for the different labels, and infer the best labeling
over the frames. We show the effectiveness of the approach using user videos and
full-length movies. Using sixty full-length movies spanning 50 years, we show
that the proposed algorithm of grouping frames purely based on motion cues can
aid computational applications such as recovering depth from a video and also re-
veal interesting trends in movies, which finds itself interesting novel applications
in video analysis (time-stamping archive movies) and film studies.
Keywords: video temporal segmentation, film study.
1 Introduction
Video analysis such as video summarization, activity recognition, depth from video,
etc. are active research areas. Each of these communities focus on specific modalities of
video; for example, estimating depth from video has been well explored for videos of a
static scene captured from a dynamic camera. In addition to computational applications,
studies in cognitive science [8] and film studies [3, 11, 25] require analyzing the visual
activity in movies, where currently exhaustive manual effort is used [1]. In this work, we
automatically segment an input video into clips based on the scene and camera motion
allowing for more focused algorithms for video editing and analysis.
The task of video segmentation has been studied for various applications. Shot bound-
ary detection [22, 3234], performs a coarse segmentation of frames into the basic ele-
ments of a video, shots1 . Several works build on top of this, grouping shots into scenes
to aid video summarization [15, 24, 31, 35] or performing spatio-temporal segmenta-
tion [7, 9, 14, 18, 29]. Other video segmentation works include camera motion segmen-
tation [10, 21, 23, 28, 30, 35] classifying camera motion such as tilt and zoom, albeit,
without focusing on the motion of the objects within the scene. In this work, we define
1
A shot is everything between turning the camera on and turning it off, consisting of many clips.
A scene consists of all the shots at a location.
Learning to Segment a Video to Clips Based on Scene and Camera Motion 273

Fig. 1. The proposed algorithm takes a video as input (black box) and first segments the video to
shots, where each shot is the group of frames shown in an orange box. The algorithm segments
the frames of each shot to clips based on scene and camera motion, expressed as combinations
of static camera (CS ) vs. dynamic camera (CD ) and static scene (SS ) vs. dynamic scene (SD ),
shown with red, green, blue and yellow borders around the frame (color code on first row). The
above are some results of our algorithm on the movie Sound of Music.
a novel task of labeling frames into one of four categories based on combinations of
both camera and scene motion i.e. static vs. dynamic camera and static vs. dynamic
scene. While prior work in temporal segmentation were rule-based approaches, we pro-
pose a novel learning-based approach. Some results are shown in Fig 1.
An overview of our algorithm is shown in Fig 2. We formulate the task of video
segmentation as a discrete labeling problem. We first perform shot boundary detection
to segment the video into shots, and assume that the characteristics of motion between
shots are independent of each other. In order to capture the heavy temporal dependence
between frames within a shot, we construct an MRF over the frames of the shot with
edges between consecutive frames. We extract intuitive features from the optical flow
between consecutive frames and using labeled video data learn classifiers to distinguish
between the different classes. Using the classifier scores as evidence for the different
discrete categories we setup an energy function over the graph of frames enforcing
smoothness between adjacent frames and infer the minimum energy labeling over the
frames. We then use the resulting labeling to extract contiguous sequence of frames that
are assigned the same label thus segmenting the video to clips.
Contributions. The main contributions of this paper are:
We propose a novel learning-based approach to segmenting a video into clips based
on both camera and scene motion.
We show that the joint inference over all the frames of the shot as opposed to
independent decisions on each frame leads to significantly better performance.
274 A. Kowdle and T. Chen
((a)
a) (b)
(c)
(d)
SS SD
CS (e)
CD
(f)
Fig. 2. Algorithm overview: (a) We start with the input video and (b) extract the frames of the
video. (c) We represent each frame as a node in our graph and (d) perform shot boundary detection
to identify shot boundary frames, shown in orange. (e) We now construct an MRF over the frames
within each shot and formulate a discrete labeling problem where each frame has to be assigned
one of four discrete labels, the combinations of static camera (CS ) vs. dynamic camera (CD )
and static scene (SS ) vs. dynamic scene (SD ), in red, green, blue and yellow. (f) The minimum
energy labeling obtained via graph cuts allows us to group consecutive frames with the same label
into clips. Note that in addition to segmenting the video to shots, shots 1 and 3 have been further
segmented into clips based on camera and scene motion.
Motivated by film study, we apply our approach to a collection of sixty full-length

movies to reveal interesting trends of scene and camera motion over the years.
We successfully use the proposed approach for the novel application of time-
stamping archive movies and for recovering depth from video.
2 Related Work
With the number of video capture devices and the movie industry rapidly growing,
video analysis is gaining a lot of popularity. A number of researchers are studying dif-
ferent aspects of video analytics, from shot detection to video summarization. We give
an overview of related works with a focus on video segmentation and bring our work
into perspective amongst these works.
Shot Boundary Detection. One of the key steps in most video analysis applications,
is given a video extracting out the basic components shots, that allows for further pro-
cessing. Shot boundary detection is also a form of video segmentation albeit a coarse
one. Automatic shot boundary detection has been very well studied in the past with
approaches varying from using the changes in image color statistics across adjacent
frames to using edge change ratio [22, 33, 34]. There are a number of detailed surveys
that compare the various techniques [4, 13, 19]. We refer the reader to a recent formal
study of shot boundary detection by Yuan et. al. [32]. In our work, we perform shot
boundary detection as our first stage of coarse segmentation, but our goal is to go be-
yond and obtain a finer segmentation of the shots to clips.
Spatio-temporal Segmentation. With the shots from a video extracted using the shot
boundary detection algorithms, an active line of research is spatio-temporal segmenta-
tion of the video shot frames. The goal of these works is to obtain a spatial grouping
of pixels enforcing temporally consistency [7, 9, 14, 18, 29]. In our work, we focus on
purely temporal segmentation where we segment the video shot into clips by giving the
entire frame a discrete label. However, we note that our proposed approach can allow
for better, more focused spatio-temporal segmentation.
Scene Segmentation. A line of active research motivated by the application of video
summarization, indexing and retrieval, is segmenting a video into scenes [35]. A scene
can consist of multiple shots, each with different camera and scene dynamics. Scene
segmentation leverages the use of semantics to group shots to scenes and extract key
frames to summarize the video. Scene Transition Graph by Yeung et. al. [31] is one such
approach, where a graph is constructed with each node representing a shot and edges
representing the transitions between shots. Hierarchically clustering the graph splits it
into several sub graphs resulting in the scenes. A similar approach of clustering was
proposed by Hanjalic et. al. [15] to find logical story units in the MPEG compressed
domain. More recently Rasheed et. al. proposed a similar approach for scene detection
in Hollywood movies and TV shows [24]. In our work, as opposed to clustering shots
to scenes, we break the shot into clips based on scene and camera motion.
Camera Motion Characterization. Camera motion characterization is related to our

work. Motion characterization is known to aid video compression and allow for a more
compact video representation for content-based video indexing. Dorai et. al. [10], Tan
et. al. [23] and Zhu et. al. [35], analyze motion vectors encoded in the P and B-frames
in the compressed MPEG stream to characterize the camera motion. However, the focus
here is to characterize camera motion such as tilt and zoom. Ngo et. al. proposed an ap-
proach to extract camera motion via temporal slice analysis [21]. Srinivasan et. al. [28]
and Xiong et. al. [30] introduced motion extraction methods analyzing the spatial dis-
tribution of optical flow. However, these rule-based approaches assume that the Field
of Expansion (FOE) [16] or Focus of Contraction (FOC) is at the center of the image,
which is not always satisfied. However, unlike our work, these works are rule-based
approaches considering only camera motion and are agnostic to the scene motion. Ci-
fuentes et. al. [12] perform supervised classification of pre-segmented clips to camera
motion classes to improve interest-point tracking that is complementary to our work.
In our work, we propose a learning-based algorithm to perform segmentation of the
video into clips by analyzing both scene and camera motion. We apply the proposed
algorithm to movies from different decades and show that such a grouping purely based
on motion cues can aid computational applications such as recovering depth from a
video and also reveal interesting trends in movies, which finds itself interesting novel
applications in video analysis (time-stamping archive movies) and film studies.
3 Algorithm
In this section, we describe the proposed algorithm in detail. We first describe the prob-
lem formulation followed by the description of the proposed approach in detail.
3.1 Problem Formulation

Given a video, we wish to split the video into shots and further split each shot into one
of four categories based on camera and scene motion. We treat this as a frame level
discrete labeling problem. We first consider a binary labeling problem of shot boundary
detection, labeling the frame where the shot changes as a Shot-Boundary (SB). Using
these shots as structural elements of the movie, we cast the problem of splitting the
shots to clips as a 4-label discrete labeling problem where each frame is to be assigned
one of four discrete labels. We use the following notation throughout the paper,
Static Camera, Static Scene (CS , SS )
Dynamic Camera, Static Scene (CD , SS )
Static Camera, Dynamic Scene (CS , SD )
Dynamic Camera, Dynamic Scene (CD , SD )
We will finally use the labeling of frames to extract contiguous sequence of frames
which are assigned the same label, thus segmenting the video to clips.
3.2 Proposed Approach

We now describe the proposed approach in detail. We leverage the use of optical flow all
through our algorithm. Given a video we first compute the optical flow for each frame.
Shot Boundary Detection. A shot is a sequence of frames captured between turning

the camera on to turning it off (at one-shot). The first stage of our algorithm is identify-
ing shot boundaries in the video. Shot boundary detection has been fairly well studied
using features such as change in color statistics, edge change ratio (ECR), etc. In our ex-
periments with movies, we found that using the color statistics is not reliable especially
in case of shots boundaries between scenes of natural environments. ECR [33] using
image edges is much more reliable however, suffers in the presence of image noise and
requires manual tuning to work across movies.
An observation with optical flow is that, while the flow for frames on either sides
of the shot boundary have similar magnitude, the flow drastically changes at the shot
boundary frame. Identifying the shot boundary using a fixed threshold on the flow be-
tween only two consecutive frames is however not too reliable since, the flow is often
very noisy (especially old movies). We therefore use more temporal support and dy-
namically change the threshold to detect a shot boundary. We use a sliding window W,
of ten frames centered on the test frame, and compute the median optical flow magni-
tude for each frame in this window. Given this vector of W flow magnitudes denoted by
|O|W , our goal is to identify the outlier. We compute the median2 and standard deviation
of this vector. Any frame with flow magnitude more than two standard deviations away
from the median is identified as a shot boundary (SB).
2
Median is more reliable in identifying the outlier than using the mean.
Fig. 3. Illustration of the features used in the unary term. (a) shows a sample frame with the
corresponding color coded flow. Flow vectors are illustrated with arrows. (b) Row 1 shows the
log-bin histogram used to classify static vs. dynamic camera, and Row 2 shows the orientation
histogram used to classify static vs. dynamic scene. Refer to Section 3.2 for details.
Segmenting Video to Clips. The shot boundaries from the previous stage are treated
as the last frame of a shot, thus segmenting the video to shots. Our goal is to segment
each shot into clips based on camera and scene motion. We cast this segmentation task
as a frame-level discrete labeling problem. We formulate the multilabel problem as an
energy minimization problem over a graph of the frames in the shot. This is analogous
to constructing a graph over frames of the whole video and breaking the links when
there is a shot boundary assuming independent motion between shots.
Notation. Let the frames in the ith shot of the video be denoted as Fi . The objective
is to obtain a labeling Li over the shot frames such that each frame f is given a label
lf {(CS , SS ), (CD , SS ), (CS , SD ), (CD , SD )}. We build a graph over the frames
f Fi , with edges between adjacent frames denoted by {N }. We define the energy
function to minimize over the entire shot as:

E(Li ) = Ef (lf ) + Ef g (lf , lg ) (1)
f Fi (f,g)N
The unary term Ef (lf ) measures the cost of assigning a frame f to the frame label lf .
The pairwise term Ef g (lf , lg ) measure the penalty of assigning frames f and g to labels
lf and lg respectively, and is a regularization parameter.
Unary Term (Ef ). We extract intuitive features from the optical flow and using la-
beled data learn classifiers to classify between static vs. dynamic camera categories and
static vs. dynamic scene categories. We now describe the unary term in detail.
Static Camera vs. Dynamic Camera. Our goal is to learn a classifier that, given the
features for the current frame returns a score indicating the likelihood that the current
frame was captured with a static camera vs. a dynamic camera. In the most trivial sce-
nario where a static camera is looking at a static scene, the optical flow between the
frames should have zero magnitude. In this case, a histogram of the optical flow mag-
nitude would peak at the zero bin indicating that it is a static camera. However, in a
typical video the scene can be composed of dynamic objects as well.
We handle this by computing a histogram of the optical flow magnitude, using log-
binning with larger bins as we move away from zero magnitude. We obtain a 15-dim
vector of the normalized histogram represented as (f ) to describe each frame f as
illustrated in Fig 3. Using manually labeled data, we use these vectors as the feature
vectors and learn a logistic classifier to classify the frame between static and dynamic
camera. Let S represent the logit coefficients we learn during training to classify static
camera frames against dynamic camera frames. During inference, given a frame x, we
compute the likelihood of the frame being captured by a static camera represented as
L(CS ) and that by a dynamic camera as L(CD ) given by:
L(CS ) = P (CS | (x); S )
L(CD ) = 1 P (CS | (x); S ) (2)
Static Scene vs. Dynamic Scene. Consider a scene captured from a dynamic camera,
we use a simple cue from the optical flow to separate static vs dynamic scene. Given
the flow for each frame of the shot, we compute the orientation of the flow field at
each pixel and bin them into an orientation histogram with eight bins evenly spaced
between 0 to 360 as illustrated in Fig 3. In addition, we use a no-flow bin that helps
characterize a static camera.
Intuitively, the peak in this histogram indicates the dominant direction of camera
motion. We remove the component of the histogram corresponding of this dominant
motion, and normalize the residual histogram. We use the entropy of this residual dis-
tribution (H(f )) as a feature to identify static scene vs. dynamic scene. We assume a
gaussian distribution over the entropy values and using the labeled data fit a parametric
model to the residual entropies for frames labeled as static frames and frames labeled
as dynamic frames which results in two gaussians N (S , S ) and N (D , D ) respec-
tively. At inference, given a new frame x, we compute the residual entropy H(x) and
compute the likelihood for static scene, L(SS ) and dynamic scene, L(SD ) given by:
L(SS ) = P (SS |H(x); S , S )
L(SD ) = P (SD |H(x); D , D ) (3)
Since the dominant camera motion has been extracted out in the computation of the
scene motion, we assume that the scene and camera motion are independent. We show
in Section 4.2, that this assumption performs respectably on both movies and user
videos, however, causes errors in the ambiguous case of a dynamic camera looking
at a static object vs. a static camera looking at a dynamic object up-close where the
moving object has a large spatial extent in the frame. This is a scenario ambiguous even
to humans, without contextual or semantic reasoning about the actual spatial extent of
the dynamic object and the surrounding scene. While we do not model this in our work,
we see that such a model can be incorporated into Equation 3.
We compute the likelihoods for the four discrete frame labels to define the unary
term in our MRF as follows:
L(CS , SS ) = L(CS ) L(SS )
L(CD , SS ) = L(CD ) L(SS ) (4)
L(CS , SD ) = L(CS ) L(SD )
L(CD , SD ) = L(CD ) L(SD )
We use the negative log-likelihood as the unary term (Ef ) for the four discrete labels.
Pairwise Term (Ef g ). We model the pairwise term using a contrast sensitive Potts
model.
Ef g (lf , lg ) = I (lf = lg ) exp(df g ) (5)
where I () is an indicator function that is 1(0) if the input argument is true (false), df g is
the contrast between frames f and g and is a fixed parameter. The pairwise term tries
to penalize label discontinuities among neighboring frames modulated by a contrast-
sensitive term. Given the optical flow between adjacent frames f and g, we warp the
image from frame g to frame f . The mean error between the original frame f and the
flow induced warped image is used as the contrast (df g ), between the two frames in the
pairwise term. In practice, this enforces temporal continuity since the contrast is low if
the flow can describe the previous frame very well.
With the energy function setup, we use graph-cuts with -expansion to compute
the MAP labels for each shot, using the implementation by Bagon [2] and Boykov et.
al. [5, 6, 17]. We show some of our results on the full-length movies and user videos
in Fig 1 and 5. We note that we can extend the proposed algorithm using prior work
(Section 2), to obtain true camera motion characterization of frames into zoom, pan,
tilt, etc without scene motion clutter, by either post-process the frames labeled by our
algorithm as dynamic camera frames or add additional discrete labels.
4 Experiments and Results

In this section, we describe and discuss our results. We first describe our dataset fol-
lowed by quantitative and qualitative analysis of the proposed approach. We then dis-
cuss interesting applications and analysis applying our approach on full-length movies.
4.1 Dataset
Our dataset consists 60 full-length movies and five user videos. The movies are a subset
of the dataset used in [8] that spans from 1960-2010, with 12 full-length movies in each
decade. We divide the dataset into two parts.
The first subset of the dataset (Set-A) consists of five user videos and one full-length
movie that we use for the quantitative analysis. We sample the videos at a frame rate of
10fps. We manually label the frames with one of four discrete labels we define in this
work. We use these labeled 110,000 frames (made publicly available on our website3 )
to perform our quantitative analysis. We also label shot boundaries on a subset of the
movie to evaluate shot boundary detection.
The second subset of the dataset (Set-B) consists of the 60 full-length movies. We
sample the movies at 10fps resulting in an average of 50,000 frames for each movie.
We use the results of the proposed algorithm on this exhaustive dataset to infer trends
in scene and camera motion in movies over the years. We later use the dataset to show
a novel application of time-stamping archive movies.
We extract the optical flow between every two consecutive frames for each of the
videos in the dataset using the implementation by Liu et. al. [20].
3
http://chenlab.ece.cornell.edu/projects/Video2Clips

"
(CS, SS) .74 .24 .01 .01 (CS, SS) .85 .13 .01
!

%

(CD, SS) .67 .01 .33 (C , S ) .93 .07
D S

(CS, SD) .03 .03 .84 .10 (C , S ) .01 .97 .02

S D

(CD, SD) .01 .25 .05 .69 (CD, SD) .09 .04 .86

) ) ) ) ) ) ) )
,SS ,SS ,SD ,SD ,SS ,SS ,SD ,SD
(C S (C D (C S (C D (C S (C D (C S (C D
(a) (b) (c)

Fig. 4. Quantitative analysis: (a) Average accuracy of the frame level labeling using the learnt
models. The proposed approach of joint inference via the graph over frames achieves significant
improvement in performance over taking an independent frame level decision. Confusion ma-
trix: (b) Independent decisions, (c) Joint inference via the temporal graph over frames results in
significant improvement as evident in the diagonal elements.
4.2 Quantitative Analysis
We use the Set-A to perform the quantitative analysis. Given the optical flow for each
of the videos, we first perform the shot boundary detection as described in Section 3.2.
We first evaluate the performance of the shot boundary detection in comparison to the
edge change ratio (ECR) approach [33], which is known to perform better than using
color histograms [32]. We use the manually labeled movie data to compute the F-score
of detecting the shot boundary. The ECR performed respectably with an F-score of
0.92 (cross-validation to set threshold), well above random. The proposed flow-based
approach performed better with an F-score of 0.96 without requiring to set the threshold.
While both these approaches had errors due to missed detections in case of slow fades
and dissolves, we observed that the lower performance of ECR is due to false positives
in case of long shots, or when text is instantaneously overlaid on the movie frame.
Other sophisticated approaches (Sec 2) can easily be plugged in here as shot boundary
detection serves as a pre-processing step in our approach.
In order to evaluate the performance of the algorithm in segmenting the video to the
four classes, we treat the problem as a multi-class labeling problem and use the frame-
level labeling accuracy as our metric. We perform leave-one-out cross validation. Using
all the videos in the dataset except the test video, we learn the model parameters de-
scribed in Section 3.2. Given these parameters we evaluate the performance on the test
video. We note that while the task we tackle has not been addressed before, prior work
on camera motion characterization use a rule-based approach using features extracted
from the motion vectors albeit without reasoning about the scene dynamics and making
a decision at a per-frame level. We compare the performance of the proposed approach
of using a multilabel MRF over the frames against taking an independent decision on
each frame using the learnt models. We report the performance in Fig 4a. We see that
using the learnt models on each frame independently performs much better than taking
a random decision, the temporal consistency enforced by the proposed approach gives
a significant additional boost of more than 20%. We compare the confusion matrices
in Fig 4. We observe via the diagonal elements that the proposed algorithm performs
better across all the categories.
Observations. The key observation from the confusion matrices is that the model is
able to learn to distinguish between the different classes as evident from the dominant
diagonal, even in the independent case (Fig 4b). We investigate the errors by considering
the non-zero off-diagonal entries in Fig 4c.
Consider the case of static camera, static scene (CS , SS ) i.e. row 1 of the confusion
matrix. We first note that the ratio of frames in a movie that belongs to this category were
fairly low as we see in Sec 4.4. The confusion here is due to the instability of the camera
that results in dynamic camera, static scene (CD , SS ) being a more appropriate label.
We see even in Fig 4b that there is a large confusion between the classes of dynamic
camera, static scene (CD , SS ) and dynamic camera, dynamic scene (CD , SD ) i.e rows
2 and 4 of the confusion matrix. The proposed approach (Fig 4c) performs significantly
better than the independent case, however the confusion is due to the dynamic content
in the scene having varied spatial extents in the frames, as noted in Sec 3.2.
4.3 Qualitative Analysis

We show some qualitative results of our proposed algorithm in Fig 1 and 5. Please
see the website3 for sample result videos on the movie Sound of Music, and the user
videos. The proposed algorithm accurately segments the video to shots and shots to
clips. We note in particular the result on the 1990 movie, Goodfellas (Fig 5b), where
the famous 3 minute long Copacabana shot is accurately segmented out as a single shot.
Other approaches using color statistics and edge change ratio failed to do so due to the
drastically changing image content. In addition, our algorithm accurately segmented the
shot into a dynamic camera, dynamic scene clip. We also note the result on a user video
(Fig 5d), where the algorithm accurately segments the shot into clips captured using
a dynamic camera, and switches between a static scene to a dynamic scene (when the
person walks across the scene) and back. Also note the negative results in Fig 5e.
4.4 Trends in Movies over the Years

We use Set-B with movies from 1960-2010 to analyze trends in movies. We run our al-
gorithm on each movie to obtain a frame-level labeling. We then analyze the distribution
of the various labels for movies from each decade to observe trends across time.
We compute the percentage of frames in the movie for each label and plot it across
years in Fig 6. Our first observation, is that the movies from each decade follow a par-
ticular distribution of the four labels. We observe that, the category static camera, static
scene (CS , SS ) is very sparse in movies irrespective of the decade the movie belongs to.
We also note that the category of dynamic camera, dynamic scene (CD , SD ) dominates
over all other categories. We observe from that the categories (CS , SD ) and (CD , SS )
follow distinct trends. The dynamic camera categories (CD , SD ) and (CD , SS ) seem to
follow an increasing trend while the category of static camera, dynamic scene (CS , SD )
follows a downwards trend.
We explore this further by first grouping the static and dynamic scene categories to
compare the categories of static camera vs. dynamic camera in Fig 6b. We note that our
earlier observation of the increase in the use of a dynamic camera clearly stands out. We
discovered from literature that the movies have evolved as a result of advancement of
capture technology such as cameras on helicopters, zip-lines, stable camera harnesses,
(a) Lady Killers - 1960
(b) Goodfellas - 1990
(c) Charlies Angels - 2000
(d) User video
(e) Negative results

(CD, SS) (CS, SD) (CD, SD)
Fig. 5. Results from the proposed approach (Color code for the frame borders on the last row):
(a) A shot from the movie Lady Killers where it accurately switches between a static camera to
a dynamic camera; (b) The famous 3 min long Copacabana shot from the movie Goodfellas is
not only identified accurately as a single shot but is also accurately identified as dynamic camera,
dynamic scene; (c) A shot from the more recent movie Charlies Angels that accurately switches
from static scene to dynamic scene; (d) A shot from user video captured using a dynamic camera
accurately switches from static scene to dynamic scene and back; (e) Two clips show negative
results from our algorithm. While the clips belong to the (CS , SD ) category, the clips have been
incorrectly labeled (CD , SS ) and (CD , SD ). The error arises since the dynamic scene has a large
spatial extent and occludes the static background providing no evidence of the static camera.
etc for moving cameras, resulting in an increase in the dynamic camera shots. Our algo-
rithm has interestingly picked up this trend, which researchers in film-studies manually
observed [3,11,25]. We then group the static and dynamic camera categories to compare
the static scene vs. dynamic scene categories in Fig 6c. We see that there is a decreasing
trend of dynamic scene category and an increasing trend of the static scene category.
One explanation for this is that, with more sophisticated and stable dynamic cameras,
more static scenes are being captured now. Another explanation from recent research in
cognitive science is that recent movies could have reduced dynamics due to rapid shot
changes or shortened shots [8], which could explain the trends we see.
80
Average percentage of frames in a movie (%)

100 100
70
90 90
60 80 80
(C , S )
S S
50 (CD, SS) 70 70
Static camera Static scene
(CS, SD) 60 60
40 Dynamic camera Dynamic scene
(CD, SD) 50 50
30
40 40
20 30 30
10 20 20
10 10
1960s 1970s 1980s 1990s 2000s
1960s 1970s 1980s 1990s 2000s 1960s 1970s 1980s 1990s 2000s
(a) (b) (c)

Fig. 6. Film study: (a) Trends over the years: Plot shows the trends for the four different categories
as average percentage of frames of the movie, from 1960s to 2000s. (b-c) Camera and scene
motion trends: Plot shows the trends for (b) Static vs. dynamic camera, (c) Static vs. dynamic
scene as average percentage of frames of the movie, from 1960s to 2000s.
Time-Stamping Archive Movies. Time-stamping archive data is an exciting new re-

search area. While there has been some prior work in time-stamping archive pho-
tographs [26], we consider a novel application of time-stamping archive movies. We
use the trends we observe (Fig 6) and learn a regression model using the 4-dim vector
to predict the decade of an unknown movie. Using 80% of the movies for training and
20% of the movies for testing, we report the performance after 5-fold cross-validation
in Fig 7. We perform significantly better than a random decision and in addition we are
almost always (> 96%) within a decade from the actual time stamp of the movie as
seen from the second set of bar graphs. In addition, we perform a human-study with 12
subjects. We provided two movies from each decade to define the task and asked them
to label the decade the other movies belong to. We report the performance in Fig 7.
Some of the features the users reported they used were age of famous actors, hairstyles,
models of cars used, the scene setting, etc. We note that using just motion cues our algo-
rithm performs better than humans who used a number of high-level semantic features.
Depth from Video. An application of the proposed approach is in aiding depth from
video. One of the key constraints for geometric approaches to structure from mo-
tion (SFM) and depth from video, is a dynamic camera. Our approach can segment
a long video (or movie) into clips based on the camera motion thus segmenting out
frames where one can leverage an SFM algorithm. We illustrate depth from video on
one of the dynamic camera, static scene clips extracted by the algorithm in the movie
Sound of Music in Fig 8. We first recover the camera parameters for the frames using

'
*
&

%
$
#
"
!

Exact decade prediction Atmost 1 decade error
Fig. 7. Time-stamping movies: Learning a regression to estimate the year (decade) using the
output of the proposed algorithm significantly outperforms random guessing and human perfor-
mance (12 subjects). The first set of bars indicate the performance on the exact decade prediction,
while the next set indicates the performance with an error tolerance of atmost one decade.
structure-from-motion [27] followed by implementing a simple fronto-parallel plane

sweep algorithm. It is worth noting here that without first separating out these clips,
one cannot estimate the camera parameters for the whole video. In addition, while most
prior works consider the scenario of a dynamic camera looking at a static scene, our
proposed algorithm also segments frames within the shot where a dynamic camera is
looking at a dynamic scene, which can lead to interesting future extensions to this work.
Fig. 8. 2D to 3D video: Row 1 shows a clip from the movie Sound of music labeled as (CD , SS );
Row 2 is the depth map recovered using plane sweep stereo; Row 3 are anaglyph pairs generated
using the depthmap (to be viewed using red-cyan glasses)
5 Conclusions and Future Work

In this paper, we have proposed a learning based video segmentation algorithm for the
task of segmenting a video into clips based on scene and camera motion. We showed
the effectiveness of the algorithm via quantitative analysis and using an exhaustive col-
lection of movies spanning 50 years, we demonstrated applications such as inferring
camera and scene motion trends, archive movie time-stamping and depth from video.
We believe that proposed video temporal segmentation finds itself a number of exciting
future extensions. For example, our algorithm can tease apart frames where a dynamic
object was captured from a dynamic camera, which can be leveraged for applications
such as, 3D modeling of a dynamic object, dynamic object co-segmentation, etc.
Acknowledgements. The authors thank Jordan DeLong [8] for collecting and sharing
the dataset of movies across decades.
References
1. Cinemetrics, http://www.cinemetrics.lv
2. Bagon, S.: Matlab wrapper for graph cut (December 2006)
3. Bordwell, D.: The Way Hollywood Tells It: Story and Style in Modern Movies. A Hodder
Arnold Publication (2006)
4. Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. In:
SPIE (1996)
5. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms
for energy minimization in vision. PAMI 26(9), 11241137 (2004)
6. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via graph cuts.
PAMI 20(12), 12221239 (2001)
7. Chan, A.B., Vasconcelos, N.: Modeling, clustering, and segmenting video with mixtures of
dynamic textures. PAMI 30(5), 909926 (2008)
8. Cutting, J.E., DeLong, J.E., Brunick, K.L.: Visual activity in hollywood film: 1935 to 2005
and beyond. PACA 5(2) (2010)
9. Dementhon, D.: Spatio-temporal segmentation of video by hierarchical mean shift analysis.
In: SMVP (2002)
10. Dorai, C., Kobla, V.: Extracting motion annotations from mpeg-2 compressed video for hdtv
content management applications. In: ICMCS (1999)
11. Elsaesser, T., Buckland, W.: Studying Contemporary American Film: A Guide to Movie
Analysis. University of California Press (2002)
12. Garca Cifuentes, C., Sturzel, M., Jurie, F., Brostow, G.J.: Motion models that only work
sometimes. In: BMVC (2012)
13. Gargi, U., Kasturi, R., Antani, S.: Performance characterization and comparison of video
indexing algorithms. In: CVPR (1998)
14. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video seg-
mentation. In: CVPR (2010)
15. Hanjalic, A., Lagendijk, R.L., Member, S., Biemond, J.: Automated high-level movie seg-
mentation for advanced video-retrieval systems. CSVT (1999)
16. Jain, R.: Direct computation of the focus of expansion. PAMI 5(1), 5864 (1983)
17. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts?
PAMI 26(2), 147159 (2004)
18. Li, Y., Sun, J., Shum, H.-Y.: Video object cut and paste. In: ACM SIGGRAPH (2005)
19. Lienhart, R.: Reliable transition detection in videos: A survey and practitioners guide. Inter-
national Journal of Image and Graphics 1, 469486 (2001)
20. Liu, C.: Beyond Pixels: Exploring New Representations and Applications for Motion Anal-
ysis. PhD thesis, MIT (2009)
21. Ngo, C.-W., Pong, T.-C., Zhang, H.-J., Chin, R.T.: Motion characterization by temporal slices
analysis. In: CVPR (2000)
22. Otsuji, K., Tonomura, Y.: Projection detecting filter for video cut detection. ACM Multimedia
(1993)
23. Peng Tan, Y., Saur, D.D., Kulkarni, S.R., Member, S., Ramadge, P.J.: Rapid estimation of
camera motion from compressed video with application to video annotation. CSVT 10,
133146 (2000)
24. Rasheed, Z., Shah, M.: Scene detection in hollywood movies and tv shows. In: CVPR (2003)
25. Salt, B.: Statistical style analysis of motion pictures. Film Quarterly, 28(1) (1974)
26. Schindler, G., Dellaert, F.: Probabilistic temporal inference on reconstructed 3d scenes. In:
CVPR (2010)
27. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. In:
SIGGRAPH (2006)
28. Srinivasan, M.V., Venkatesh, S., Hosie, R.: Qualitative estimation of camera motion param-
eters from video sequences. Pattern Recognition 30(4), 593606 (1997)
29. Wang, J., Thiesson, B., Xu, Y.-Q., Cohen, M.: Image and Video Segmentation by Anisotropic
Kernel Mean Shift. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004, Part II. LNCS, vol. 3022,
30. Xiong, W., Lee, J.C.-M.: Efficient scene change detection and camera motion annotation for
video classification. CVIU 71(2), 166181 (1998)
31. Yeung, M., Yeo, B.-L., Liu, B.: Segmentation of video by clustering and graph analysis.
CVIU 71, 94109 (1998)
32. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A formal study of shot
boundary detection. TCSVT (2007)
33. Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene
breaks. ACM Multimedia (1995)
34. Zhang, H., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video. Mul-
timedia Syst. 1(1), 1028 (1993)
35. Zhu, X., Elmagarmid, A.K., Xue, X., Wu, L., Catlin, A.C.: Insightvideo: toward hierarchical
video content organization for efficient browsing, summarization and retrieval. IEEE Trans-
actions on Multimedia 7(4), 648666 (2005)
Evaluation of Image Segmentation Quality
by Adaptive Ground Truth Composition
Bo Peng1,2 and Lei Zhang2,

1
Dept. of Software Engineering, Southwest Jiaotong University, China
boopeng@gmail.com
2
Dept. of Computing, The Hong Kong Polytechnic University, Hong Kong
cslzhang@comp.polyu.edu.hk
Abstract. Segmenting an image is an important step in many computer

vision applications. However, image segmentation evaluation is far from
being well-studied in contrast to the extensive studies on image segmen-
tation algorithms. In this paper, we propose a framework to quantita-
tively evaluate the quality of a given segmentation with multiple ground
truth segmentations. Instead of comparing directly the given segmenta-
tion to the ground truths, we assume that if a segmentation is good,
it can be constructed by pieces of the ground truth segmentations. Then
for a given segmentation, we construct adaptively a new ground truth
which can be locally matched to the segmentation as much as possible
and preserve the structural consistency in the ground truths. The quality
of the segmentation can then be evaluated by measuring its distance to
the adaptively composite ground truth. To the best of our knowledge,
this is the rst work that provides a framework to adaptively combine
multiple ground truths for quantitative segmentation evaluation. Experi-
ments are conducted on the benchmark Berkeley segmentation database,
and the results show that the proposed method can faithfully reect the
perceptual qualities of segmentations.
Keywords: Image segmentation evaluation, ground truths, image seg-

mentation.
1 Introduction
Image segmentation is a fundamental problem in computer vision. Over the past

decades, a large number of segmentation algorithms have been proposed with the
hope that a reasonable segmentation could approach the human-level interpreta-
tion of an image. With the emergence and development of various segmentation
algorithms, the evaluation of perceptual correctness on the segmentation output
becomes a demanding task. Despite the fact that image segmentation algorithms
have been, and are still being, widely studied, quantitative evaluation of image
segmentation quality is a much harder problem and the research outputs are
much less mature.

Corresponding author.

288 B. Peng and L. Zhang
Human beings play an essential role in evaluating the quality of image seg-
mentation. The subjective evaluation has been long regarded as the most reliable
assessment of the segmentation quality. However, it is expensive, time consuming
and often impractical in real-world applications. In addition, even for segmenta-
tions which are visually close, dierent human observers may give inconsistent
evaluations. As an alternative, the objective evaluation methods, which aim to
predict the segmentation quality accurately and automatically, are much more
expected.
The existing objective evaluation methods can be classied as ground-truth
based ones and non-ground-truth based ones. In non-ground-truth based meth-
ods, the empirical goodness measures are proposed to meet the heuristic criteria
in the desirable segmentations. Then the score is calculated based on these cri-
teria to predict the quality of a segmentation. There have been examples of
empirical measures on the uniformity of colors [1, 2] or luminance [3] and even
the shape of object regions [4]. Since the criteria are mainly summarized from
the common characteristics or semantic information of the objects (e.g., ho-
mogeneous regions, smooth boundaries, etc.), they are not accurate enough to
describe the complex objects in the images.
The ground-truth based methods measure the dierence between the seg-
mentation result and the human-labeled ground truths. They are more intuitive
than the empirical based measures, since the ground truths can well represent
the human-level interpretation of an image. Some measures in this category aim
to count the degree of overlapping between regions with strategies of being intol-
erant [5] or tolerant [6] to region renement. In contrast to working on regions,
there are also measures [7, 8] matching the boundaries between segmentations.
Considering only the region boundaries, these measures are more sensitive to
the dissimilarity between the segmentation and the ground truths than the re-
gion based measures. Some other measures use non-parametric tests to count
the pairs of pixels that belong to the same region in dierent segmentations.
The well-known Rand index [9] and its variants [10, 11] are of this kind. So far,
there is no standard procedure for segmentation quality evaluation due to the
ill-dened nature of image segmentation, i.e., there might be multiple acceptable
segmentations which are consistent to the human interpretation of an image. In
addition, there exists a large diversity in the perceptually meaningful segmen-
tations for dierent images. The above factors make the evaluation task very
complex.
In this work, we focus on evaluating segmentation results with multiple ground
truths. The existing methods [511] of this kind prefer matching the given whole
segmentation with ground truths for evaluation. However, the available human-
labeled ground truths are only a small fraction of all the possible interpretations
of an image. The available dataset of ground truths might not contain the desired
ground truth which is suitable to match the input segmentation. Hence such kind
of comparison often leads to a certain bias on the result or is far from the goal
of objective evaluation.
Evaluation of Image Segmentation Quality 289
Fig. 1. A composite interpretation between the segmentation and the ground truths -
the basic idea. (a) is a sample image and (b) is a segmentation of it. Dierent parts
(shown in dierent colors in (c)) of the segmentation can be found to be very similar
to parts of the dierent human-labeled ground truths, as illustrated in (d).
We propose a new framework to solve the problem. The basic idea is illustrated
in Fig.1. Fig. 1(b) shows a possible segmentation of the image in Fig. 1(a), and it
is not directly identical to any of the ground truths listed in Fig. 1(d). However,
one would agree that Fig. 1(b) is a good segmentation, and it is similar to
these ground truths in the sense that it is composed of similar local structures
to them. Inspired by this observation, we propose to design a segmentation
quality measure which could generalize the congurations of the segmentation
and preserve the structural consistency across the ground truths. We assume
that if a segmentation is good, it can be constructed by pieces of the ground
truths. To measure the quality of the segmentation, a composite ground truth can
be adaptively produced to locally match the segmentation as much as possible.
Note that in Fig. 1(c), the shared regions between the segmentation and the
ground truths are typically irregularly shaped; therefore the composition is data
driven and cannot be predened. Also, the condence of selected pieces from the
ground truths will be examined during the process of comparison. Less reliance
should be given on the ambiguous structures, even if they are very similar to the
segmentation. In the proposed measure, we will integrate all these factors for
a robust evaluation. Fig. 2 illustrates the owchart of the proposed framework.
Firstly a new composite ground truth is adaptively constructed from the ground
truths in the database, and then the quantitative evaluation score is produced
by comparing the input segmentation and the new ground truth.
Researchers have found that human visual system (HVS) is highly adapted
to extract structural information from natural scenes [14]. As a consequence,
a perceptually meaningful measure should be error-sensitive to the structures
in the segmentations. It is also known that human observers may pay dierent
attentions to dierent parts of the images [6, 8]. The dierent ground truths
segmentations of an image therefore present dierent levels of details of the
objects in the image. This fact makes them rarely identical in the global view,
while more consistent in the local structures. For this reason, the evaluation of
image segmentation should be based on the local structures, rather than only the
Fig. 2. Flowchart of the proposed segmentation evaluation framework
global views. The standard similarity measures, such as Mutual Information [13],
Mean Square Error (MES) [14], probabilistic Rand index [9] and Precision-Recall
curves [8], compare the segmentation with the whole ground truths, producing a
matching result based on the best or the average score. Since the HVS tends to
capture the local structures of images, these measures cannot provide an HVS-
like evaluation on the segmentation. The idea of matching images with composite
pieces has been explored in [1517], where there is no globally similar image in
the given resource to exactly match the image in query. These methods pursue
a composition of given images which is meaningful in the perception of vision.
In contrast, in this paper we propose to create a new composition that is not
only visually similar to the given segmentation, but also keeps the consistency
among the ground truths. The construction is automatically performed on the
boundary maps instead of the original images. To the best of our knowledge, the
proposed work is the rst one that generalizes and infers the ground truths for
segmentation evaluation.
The rest of paper is organized as follows. Section 2 provides the theoretical
framework of constructing the ground truth that is adaptive to the given segmen-
tation. A segmentation evaluation measure is consequently presented in Section
3. Section 4 presents experimental results on 500 benchmark images. Finally the
conclusion is made in Section 5.
2 The Adaptive Composition of Ground Truth
2.1 Theoretical Framework
Many problems in computer vision, such as image segmentation, can be taken as

a labeling problem. Consider a set of ground truths G = {G1 , G2 , . . . , GK } of an
image X = {x1 , x2 , . . . , xN }, where Gi = {g1i , g2i , . . . , gN
i
} denotes a labeling set
of X, i = 1, . . . , K, and N is the number of elements in the image (e.g., pixels).
Let S = {s1 , s2 , . . . , sN } be a given segmentation of X, where sj is the label of xj
(e.g., boundary or non-boundary), j = 1, . . . , N . To examine the similarity be-
tween S and G, we compute the similarity between S and a new ground truth G ,
which is to be generated from G based on S. Denote by G = {g1 , g2 , . . . , gN
}
the composite ground truth. We construct G by putting together pieces from

G, i.e., each piece gj {gj1 , gj2 , . . . , gjK }. Clearly, one primary challenge is how
to reduce the artifact in the process of selecting and fusing image pieces. The
pieces of the composite ground truth should be integrated seamlessly to keep
the consistency of image content. Thus, our principle to choose these pieces can
be summarized as: each one of gj should be most similar to its counterpart in
S with the constraint of structural consistency across the ground truths. Once
G is constructed, the quality of segmentation S can be evaluated by calculating
the similarity between S and G .
G is a geometric ensemble of local pieces from G. We adopt an optimistic
strategy to choose the elements of G , by which S will match G as much as
possible. G can then be taken as a new segmentation of X by assigning a
label to each pixel. Meanwhile, gj contains the information of the corresponding
location in the K ground truths. To construct such a G , we introduce a label
lgj (l = 1, . . . , K) to each gj in G . Fig. 3 uses an example to illustrate how to
construct the new ground truth G . We can see that, given two ground truth
images G1 and G2 , G is found by rstly computing the optimal labeling set
for the ground truths. Then elements of G which are labeled as 1 (or 2) will
take their values from G1 (or G2 ). This leads to a maximum-similarity-matching
between S and G .
Fig. 3. An example of adaptive ground truth composition for the given segmentation S.
G1 and G2 are two ground truths by human observers. The optimal labeling {lG1 , lG2 }
of G1 and G2 produces a composite ground truth G , which matches the S as much
as possible.
Given a set of labels L and a set of elements in G, to construct the segmen-

tation G a label lg L needs to be assigned to each of the elements g G.
The label set could be an arbitrary nite set such as L = {1, 2, . . . , K}. Let
l = {lg |lg L} stand for a labeling, i.e., label assignments to all elements in G.
We could formulate the labeling problem in terms of energy minimization, and
then seek for the labeling l that minimizes the energy. In this work, we propose
an energy function that follows the Potts model [18]:

E(l) = D(lgj ) + u{gj ,gj } T (lgj = lgj ) (1)
j {gj ,gj }M
There are two terms in the energy function. The rst term D(lgj ) is called the
data term. It penalizes the decision of assigning lgj to the element gj , and thus
can be taken as the measure of dierence. Suppose that the normalized distances
between the ground truths and the segmentation S is d(sj , gj ), we can dene:
D(lgj ) = d(sj , gj ) (2)
The second term u{gj ,gj } T (lgj = lgj ) indicates the cost of assigning dierent
labels to the pair of elements {gj , gj } in G . M is a neighborhood system, and
T is an indicator function:

1 if lgj = lgj
T (lgj = lgj ) = (3)
0 otherwise
We call u{gj ,gj } T (lgj = lgj ) the smoothness term in that it encourages the
labeling in the same region have the same labels. In this way, the consistency
of neighboring structures can be preserved. It is expected that the separation of
regions should pay higher cost on the elements whose label is agreeable by few
ground truths, while lower cost on the reverse. Thus we can dene u{gj ,gj } in
the expression:
u{gj ,gj } = min{dj , dj } (4)
where dj is the average distance between gj and {gj1 , gj2 , . . . , gjK }.

In Eq. (1), the parameter is used to control the relative importance of
the data term versus the smoothness term. If is very small, only the data
term matters. In this case, the label of each element is independent of the other
elements. If is very large, all the elements will have the same label. In the
implementation, is set manually and its range for dierent images does not
vary a lot (from 20 to 100), which is similar to many graph cut based works.
Minimization of Eq.(1) is NP-hard. We use the eective expansion-moves/swap-
moves algorithm described in [19] to solve it. The algorithm aims to compute the
minimum cost multi-way cuts on a dened graph. Nodes in the graph are con-
necting to their neighbors by n-links. Each n-link is assigned a weight u{gj ,gj }
dened in the energy function Eq.(1). Suppose we have K ground truths, then
there will be K virtual nodes in the graph, representing the K labels of these
ground truths. Each graph node connects to the K virtual nodes by t-links.
We weight the t-links as D(lgj ) to measure the similarity between the graph
nodes and the virtual nodes. The K-way cuts will divide the graph into K parts,
and bring a one-to-one correspondence to the labeling of the graph. Fig.4 shows
an example, where red lines are the graph cuts computed by the expansion-
moves/swap-moves algorithm. Labeling of the graph is accordingly obtained
(shown in dierent colors). Finally, we copy the segmented pieces of regions

from each ground truth to form the new ground truth G that is driven by the
given segmentation S.
Fig. 4. An example of the construction of G . It is produced by the copies of selected

pieces (shadow areas) in the ground truths.
2.2 The Denition of Distance

The distance d , which is used in Eq. (2) and Eq. (4), needs to be dened to op-
timize the labeling energy function Eq. (1). Although many distance measures
have been proposed in the existing literature, it is not a trivial work to per-
form the matching between the machines segmentation and the human-labeled
ground truths. For ground truth data, due to the location errors produced in
drawing process, boundaries of the same object might not be fully overlapped.
This is an inherent problem for human-labeled ground truths. In Fig.5, we show
an example of boundary distortions in dierent ground truths. If simply match-
ing pixels between the segmentation S and ground truths G, S will probably
be over-penalized by the unstable and slightly mis-localized boundaries in G.
Moreover, dierent segmentation algorithms may produce the object boundaries
in dierent widths. For example, if we take the border pixels in both sides of the
adjacent regions, the boundaries will appear in a two-pixel width. To match the
segmentation appropriately, the measure should tolerate some acceptable defor-
mations between dierent segmentations. The work in [20] solves this problem by
matching the boundaries under a dened threshold. However, since their method
involves a minimum-cost perfect matching of the bipartite graph, it might be
limited to performing on the boundaries only.
As an alternative, we consider the structural similarity index CW-SSIM pro-
posed by Sampat et al. [21]. It is a general purpose image similarity index which
benets from the fact that the relative phase patterns of complex wavelet coef-
cients can well preserve the structural information of local image features, and
rigid translation of image structures will lead to constant phase shift. CW-SSIM
overcomes the drawback of its previous version SSIM [15] in that it does not
require the precise correspondences between pixels, and thus becomes robust to
Fig. 5. An example of distorted boundaries among dierent human-labeled ground

truths. Ground truths are drawn into the same image, where the whiter boundaries
indicate that more subjects marked as a boundary. (Taken from the Berkeley Dataset
[6].)
the small geometric deformations during the matching process. Therefore, we

adopt the principle of CW-SSIM and slightly modify it into a new one called
G-SSIM, which uses the complex Gabor ltering coecients of an image instead
of the steerable complex wavelet transform coecients used in [21]. Specically,
the Gabor ltering coecients are obtained by convolving the segmentation with
24 Gabor kernels, which are on 3 dierent scales and along 8 dierent directions,
respectively. As a result, the G-SSIM on each Gabor kernel is dened as:
N N
2 |cs,i ||cs ,i | + 2| i=1 cs,i cs ,i | +
H(cs , cs ) = N i=1
N N (5)

i=1 |cs,i | + i=1 |cs ,i | + 2 i=1 |cs,i cs ,i | +
2 2
where cs and cs are the complex Gabor coecients of two segmentations s and
s , respectively. |cs,i | is the magnitude of a complex Gabor coecient, and c is
the conjugate of c. and are small positive constants to make the calculation
stable.
It is easy to see that the maximum value of G-SSIM is 1 if cs and cs are
identical. Since the human labeling variations cause some technical problems in
handling the multiple ground truths. The wavelet measure we used can better
solve this problem. It can well preserve the image local structural information
without requiring the precise correspondences between pixels and is robust to
the small geometric deformations during the matching process. Now we dene
the distance d as:
d(cs , cs ) = 1 H(cs , cs ) (6)
where H(cs , cs ) is the average value of G-SSIM by 24 Gabor kernels. Notice
that although the value of d is dened on each pixel, it is inherently decided
by the textural and structural properties of localized regions of image pairs.
With the distance d dened in Eq. (6), we can optimize Eq. (1) so that the
composite ground truth G for the given segmentation S can be obtained. Fig.
6 shows three examples of the adaptive ground truth composition. It should be
noted that the composite ground truth does not have to contain the closed form
of boundary, since it is designed to work in conjunction with the benchmark
criterion (in Sec. 3).
Fig. 6. Examples of the composite ground truth G . (a) Original images. (b)-(e) Human
labeled ground truths. (f) Input segmentations of images by the mean-shift algorithm
[22]. (g) The composite ground truth G .
3 The Measure of Segmentation Quality

Once the composite ground truth G that is adaptive to S is computed, the prob-
lem of image segmentation quality evaluation becomes a general image similarity
measure problem: given a segmentation S, how to calculate its similarity (or dis-
tance) to the ground truth G . Various similarity measures can be employed,
and in this section we present a simple but eective one.
In the process of constructing the ground truth, we have allowed for some rea-
sonable deformations of the local structures of S. When the distance d(sj , gj )
between sj and gj is obtained, the distance for the whole segmentation can
be calculated as the average of all d(sj , gj ). However, the condence of such
evaluation between sj and gj also needs to be considered, due to the fact that
less reliance should be given on the ambiguous structures, even if they are very
similar to the segmentation. For this purpose, we introduce Rsj as the empir-
ical global condence of gj w.r.t. G. For example, we can estimate Rsj as the
similarity between gj and {gj1 , gj2 , . . . , gjK }, and there is:
Rsj = 1 dj (7)
where dj is the average distance between between gj and {gj1 , gj2 , . . . , gjK }.
In Eq.(7), Rsj achieves the highest value 1 when the distance between gj and
{gj1 , gj2 , . . . , gjK } is zero and achieves the lowest value zero when the situation is
reversed. Since Rsj is a positive factor for describing the condence, the simi-
larity between sj and gj should be normalized to [-1,1] such that the high con-
dence works reasonably for both of the good and bad segmentations. If there
are K instances in G and all of them contribute to the construction of G , we
can decompose S into K disjointed sets {S0 , S1 , . . . , SK }. Based on the above
considerations, the measure of segmentation quality is dened as:
1
K
M (S, G) = (1 2d(sj , gj )) Rsj (8)
N i=1
sj Si
We can see that the proposed quality measure ranges from -1 to 1, which
is an accumulated sum of similarity provided by the individual elements of S.
The minimum value of d(sj , gj ) is zero when sj is completely identical to the
ground truths gj ; in the meanwhile, if all the ground truths in G are identical,
Rsj will be 1 and then M (S, G) will achieve its maximum value 1. Note that
the measure is decided by both the distances d(sj , gj ) and Rsj . If S is only
similar/dissimilar to G without a high consistency among the ground truths
data {gj1 , gj2 , . . . , gjK }, M (S, G) will be close to zero. This might be a common
case for images with complex contents, where perceptual interpretation of the
image contents is diverse. In other words, Rsj enhances the condent decisions
on the similarity/dissimilarity and therefore preserves the structural consistency
in the ground truths.
4 Experiments
In this section, we conduct experiments to validate the proposed measure in com-

parison with the commonly used F-measure [20] and the Probabilistic Rand (PR)
index [11] on a large amount of segmentation results. The code of our method
is available at http://www4.comp.polyu.edu.hk/~cslzhang/SQE/SQE.htm. It
should be stressed that the proposed method is to evaluate the quality of an in-
put segmentation, yet it is possible to adopt the proposed method into a system
for segmentation algorithm evaluation.
The F-measure is mainly used in the boundary-based evaluation [20]. Specif-
ically, a precision-recall framework is introduced for computing this measure.
Precision is the fraction of detections that are true positives rather than false
positives, while recall is the fraction of true positives that are detected rather
than missed. In the context of probability, precision corresponds to the probabil-
ity that the detection is valid, and recall is the probability that the ground truth
is detected. In the presence of multiple ground truths, a reasonable combination
of the values should be considered. In [20] only the boundaries which match no
ground truth boundary are counted as false ones. The precision value is averaged
among all the related ground truths. A combination of the precision and recall
leads to the F-measure as below:
PR
F = (9)
R + (1 )P
where is a relative cost between precision (P ) and recall (R). We set it to be

0.5 in the experiments.
The PR index [11] examines the pair-wise relationships in the segmentation.
If the label of pixels xj and xj are the same in the segmentation image, it
is expected that their labels to be the same in the ground truth image for a
good segmentation and vice versa. Denote by ljs the label of pixel xj in the
segmentation S and by ljG the corresponding label in the ground truth G. The
PR index for comparing S with multiple ground truths G is dened as:
1
P R(S, G) = N [I(ljS = ljS )pj,j + I(ljS = ljS )(1 pj,j )] (10)
2 j,j
jj
where pj,j is the ground truth probability that I(ljS = ljS ). In practice, pj,j is
dened as the mean pixel-pair relationship among the ground truths:
1 Gi
pj,j = I(lj = ljG i ) (11)
K
According to the above denitions, the PR index ranges from 0 to 1, where a
score of zero indicates that the segmentation and the ground truths have opposite
pair-wise relationships, while a score of 1 indicates that the two have the same
pair-wise relationships.
Fig. 7. Example of measure scores for dierent segmentations. For each original image,
5 segmentations are obtained by the mean-shift algorithm [22].The rightmost column
shows the plots of scores achieved by F-measure (in blue), PR index (in green) and our
method (in red).
For segmentation algorithm, dierent parameter settings will lead to segmen-

tations of dierent granularities. In Fig.7, dierent segmentations of the given
images are produced by the mean-shift method [22]. From the plots of scores
(in the rightmost column) we can see that the proposed measure produces the
closest results to the human perception. In contrast, both of F-measure and

PR index wrongly nd the best segmentation results (in 2nd , 3rd , 5th and 6th
rows) or reduce the range of scores (in 1st and 4th rows). The success of the
proposed measure comes from the adaptive evaluation of the meaningful struc-
tures in dierent levels. However, from the denition of F-measure [20] we can
see that the highest value of recall is achieved when a segmentation contains
all the boundaries in the ground truths. This strategy is not very intuitive in
practical applications. The PR index [11] suers from issues such as reducing the
dynamic range and favoring of large regions. These drawbacks can be observed
in the examples in Fig. 7.
Next, we examine the overall performance of the proposed measure on 500
natural images from the Berkeley segmentation database [6]. We apply three
segmentation techniques, mean-shift (MS) [22], normalized cut (NC) [23] and
ecient-graph based algorithm (EG) [12], to produce segmentations on these
images. For each image, we tune the parameters of the used segmentation method
such that two dierent segmentations are produced. It was ensured that both the
two produced segmentations are visually meaningful to the observers so that they
can make meaningful judgement. Thus 500 pairs of segmentations are obtained.
Scores of these segmentations are computed by the proposed measure, the F-
measure and the PR index. Then we arranged 10 observers to evaluate these
500 pairs of segmentations by pointing out which segmentation is better than
the other one, or if there is a tie between them. We avoid giving the specic
instructions so the task becomes generalized. We took the answers from the
majority, and if the two segmentations are in a tie, any judgment from the
measure will be taken as correct. In the subjective evaluation results, we have
85 pairs of them being classied as ties.
Table 1 shows the comparison results of the three measures. Each of them gives
500 results on the pair of segmentations, where our measure has 379 results which
are consistent to the human judgment, while the F-measure and the PR index
have only 333 and 266, respectively. Also we count the number of results which
are only correctly produced by one measure (we call them winning cases). Our
measure obtains 114 winning cases, while F-measure and PR index only obtain
4 and 19, respectively. And there are 16 results which are wrongly classied by
all of the three measures. The proposed measure outperforms the other two in
both of the consistent and the winning cases. In the meantime, we can have
an interesting observation. Since the F-measure and the PR index are based
on the boundary and the region relationship, respectively, they have dierent
preferences in the segmentation quality evaluation. The PR index works better
than F-measure in terms of winning case but it works the worst in terms
of consistent case. This is mainly because the PR index is a region-based
measure, which might compensate for the failure of boundary-based measures.
In the PR index, the parameter pij is based on the ground truths of the image;
however, having a suciently large number of valid segmentations of an image is
often infeasible in practice. As a result, it is hard to obtain a good estimation of
this parameter. Moreover, one can conclude that this limitation of ground truth
segmentations in existing databases makes the proposed adaptive ground truth

composition based segmentation evaluation method a more sensible choice.
Table 1. The number of results which are consistent to the human judgment, as well
as the numbers of winning cases and failure cases by competing methods
Consistent Winning Failure

F-measure 333 4
PR index 266 19 16
Ours 379 114
5 Conclusions
In this paper, we presented a novel framework for quantitatively evaluating the

quality of image segmentations. Instead of directly comparing the input segmen-
tation with the ground truth segmentations, the proposed method adaptively
constructs a new ground truth from the available ground truths according to
the given segmentation. In this way, it becomes more eective and reasonable to
measure the local structures of the segmentation. The quality of the given seg-
mentation can then be measured by comparing it with the adaptively composite
ground truth. From the comparison between the proposed measure and two other
popular measures on the benchmark Berkeley database of natural images, we can
see that the proposed method better reect the perceptual interpretation on the
segmentations. Finally, we compared the measures on 1000 segmentations of 500
natural images. By counting the number of segmentations that are consistent to
the human judgment, it was shown that the proposed measure performs better
than the F-measure and PR index. The proposed method can be extended to
get closed contours for composition, and this will be our future work.
References
1. Liu, J., Yang, Y.: Multiresolution color image segmentation. IEEE Trans. on Pat-
tern Analysis and Machine Intelligence 16, 689700 (1994)
2. Borsotti, M., Campadelli, P., Schettini, R.: Quantitative evaluation of color image
segmentation results. Pattern Recognition Letter 19, 741747 (1998)
3. Zhang, H., Fritts, J., Goldman, S.: An entropy-based objective segmentation eval-
uation method for image segmentation. In: SPIE Electronic Imaging Storage and
Retrieval Methods and Applications for Multimedia, pp. 3849 (2004)
4. Ren, X., Jitendra, M.: Learning a classication model for segmentation. In: IEEE
Conference on Computer Vision, pp. 1017 (2003)
5. Christensen, H., Phillips, P.: Empirical evaluation methods in computer vision.
World Scientic Publishing Company (2002)
ecological statistics. In: International Conference on Computer Vision, pp. 416423
(2001)
7. Freixenet, J., Munoz, X., Raba, D., Mart, J., Cuf, X.: Yet Another Survey on Im-
age Segmentation: Region and Boundary Information Integration. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part III. LNCS, vol. 2352,
8. Martin, D.: An empirical approach to grouping and segmentation. Ph.D. disserta-
tion U.C. Berkeley (2002)
9. Rand, W.: Objective criteria for the evaluation of clustering methods. Journal of
the American Statistical Association 66, 846850 (1971)
10. Fowlkes, E., Mallows, C.: A method for comparing two hierarchical clusterings.
Journal of the American Statistical Association 78, 553569 (1983)
11. Unnikrishnan, R., Hebert, M.: Measures of similarity. In: IEEE Workshop on Ap-
plications of Computer Vision, pp. 394400 (2005)
12. Felzenszwalb, P., Huttenlocher, D.: Ecient graph-based image segmentation. In-
ternational Journal on Computer Vision 59, 167181 (2004)
13. Viola, P., Wells, W.: Alignment by maximization of mutual information, vol. 3,
pp. 1623 (1995)
14. Wang, Z., Bovik, A.: Modern image quality assessment. Morgan and Claypool
Publishing Company, New York (2006)
15. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: From
error visibility to structural similarity. IEEE Transactions on Image Process. 13,
600612 (2004)
16. Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless,
B., Salesin, D., Cohen, M.: Interactive digital photomontage, vol. 23, pp. 294302
(2004)
17. Russell, B., Efros, A., Sivic, J., Freeman, W., Zisserman, A.: Segmenting scenes
by matching image composites. In: Advances in Neural Information Processing
Systems, pp. 15801588 (2009)
18. Potts, R.: Some generalized order-disorder transformation. Proceedings of the
Cambridge Philosophical Society 48, 106109 (1952)
graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23,
12221239 (2001)
20. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries
using local brightness, color and texture cues. IEEE Trans. on Pattern Analysis
and Machine Intelligence 26, 530549 (2004)
21. Sampat, M., Wang, Z., Gupta, S., Bovik, A., Markey, M.: Complex wavelet struc-
tural similarity: A new image similarity index. IEEE Transactions on Image Pro-
cessing 18, 23852401 (2009)
22. Comanicu, D., Meer, P.: Mean shift: A robust approach toward feature space anal-
ysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 24, 603619 (2002)
23. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22, 888905 (1997)
Exact Acceleration of Linear Object Detectors
Charles Dubout and Francois Fleuret
Idiap Research Institute, Centre du Parc,

Rue Marconi 19, CH - 1920 Martigny, Switzerland
{charles.dubout,francois.fleuret}@idiap.ch
Abstract. We describe a general and exact method to considerably

speed up linear object detection systems operating in a sliding, multi-
scale window fashion, such as the individual part detectors of part-based
models. The main bottleneck of many of those systems is the compu-
tational cost of the convolutions between the multiple rescalings of the
image to process, and the linear lters. We make use of properties of the
Fourier transform and of clever implementation strategies to obtain a
speedup factor proportional to the lters sizes. The gain in performance
is demonstrated on the well known Pascal VOC benchmark, where we
accelerate the speed of said convolutions by an order of magnitude.
Keywords: linear object detection, part-based models.
1 Introduction
A common technique for object detection is to apply a binary classier at every
possible position and scale of an image in a sliding-window fashion. However,
searching the entire search space, even with a simple detector can be slow, es-
pecially if a large number of image features are used.
To that end, linear classiers have gained a huge popularity in the last few
years. Their simplicity allows for very large scale training and relatively fast
testing, as they can be implemented in terms of convolutions. They can also
reach state-of-the-art performance provided one use discriminant enough fea-
tures. Indeed, such systems have constantly ranked atop of the popular Pascal
VOC detection challenge [1,2]. Part-based deformable models [3,4] are the latest
incarnations of such systems, and current winners of the challenge.
The algorithm we propose leverages the classical use of the Fourier trans-
form to accelerate the multiple evaluations of a linear predictor in a multi-scale
sliding-window detection scheme. Despite relying on a classic result of signal
processing, the practical implementation of this strategy is not straightforward
and requires a careful organization of the computation. It can be summarized in
three main ideas: (a) we exploit the linearity of the Fourier transform to avoid
having one such transform per image feature (see Sect. 2.2), (b) we control the
memory usage required to store the transforms of the lters by building patch-
works combining the multiple scales of an image (see Sect. 3.1), and nally (c)
we optimize the use of the processor cache by computing the Fourier domain
point-wise multiplications in small fragments (see Sect. 3.2).

302 C. Dubout and F. Fleuret
Our implementation is a drop-in replacement for the publicly available sys-

tem from [5], and provides close to one order of magnitude acceleration of the
convolutions (see Table 3). It is available under the GPL open source license at
http://www.idiap.ch/scientific-research/resources.
1.1 Related Work

Popular methods to search a large space of candidate object locations include
cascades of simple classiers [6], salient regions [7], Hough transform based de-
tection [8], branch-and-bound [9]. Regarding part-based model, only the rst
method, building a cascade of classiers, was investigated [10]. Cascades in gen-
eral and [10] in particular are approximate methods, with no guaranteed speedup
in the worst case. Their thresholds are also notoriously hard to tune in order to
obtain good performance without sacricing too much accuracy, often requiring
a dedicated validation set [11,6]. The approach we pursue here is akin to [12],
taking advantage of properties of the Fourier transform to speed up linear object
detectors using multiple features.
Besides accelerating the evaluation of the detector at each possible location,
other works have already dealt with the problem of the ecient computation
of the feature pyramid and, in the case of part-based models, of the optimal
assignment of the parts locations. The fast construction of the complete image
pyramid and associated features computation at each scale has been addressed by
[13]. Their idea is to compute such features only once per octave and interpolate
the scales in-between, making the whole process typically an order of magnitude
faster with only a minor loss in detection accuracy. [14] provides linear time
algorithms for solving maximization problems involving an arbitrary sampled
function and a spatial quadratic cost. By using deformation costs of this form,
the optimal assignment of the parts locations can be eciently computed.
Table 1. Notations
Cstd Computational cost in ops of a standard convolution

F size (number of coecients) of a fragment (see Sect. 3.2)
K number of features
L number of linear lters
R number of patchworks (see Sect. 3.1)
rescaling factor of the image pyramid
u(F ) time it takes to point-wise multiply together two planes of F coecients
v(F ) time it takes to read (resp. write) F coecients to (resp. from) the CPU cache
M N size on an image
xk RM N the kth feature plane of a particular image
P Q size of a lter
yk RP Q the kth feature plane of a particular lter
z R(M +P 1)(N+Q1) the scores of a particular lter evaluated on a particular image
u(F ) and v(F ) are linear for large enough values of F

Exact Acceleration of Linear Object Detectors 303
2 Linear Object Detectors and Fourier Transform
Typical linear object detectors including the individual part detectors of part-
based models extract low-level features densely from every scale of an image
pyramid. Those features are arranged in a coarse grid with several features ex-
tracted from each grid cell. For example, the Histogram of Oriented Gradients
(HOG) [15] correspond to the bins of an histogram of the gradient orientations
of the pixels within the cell. Typically cells of size 8 8 pixels are used [15,4],
while the number of features per cell vary from around ten to a hundred (32 in
[5], that we use as a baseline). An alternative description of the arrangement of
the features is to view them as organized in planes as depicted in Fig. 1. Those
planes are analogous to the RGB channels of standard color images, but instead
of colors they contain distinct features from each cell of the grid. The lters
trained by the detector are similar in composition.
Per image Per image x filter

...
HOG
Image Image
HOG ...
x3 (rgb) x32
+
Perfeature Detection
Per filter * score score
...
x32
Filter
HOG
x32
Per image Per image x filter

... ...
HOG FT
Image Image Image
HOG HOG ...
x3 (rgb) x32 x32

+ FT
x Perfeature Detection Detection
Per filter score score score
...
...
x32
Filter FT
Filter
HOG HOG
x32
x32
Fig. 1. The top gure shows the standard process, convolving and summing all image
and lter planes. The bottom gure depicts how such a process can be sped up by
taking advantage of the fact that the inverse Fourier transform that produces the nal
detection score needs to be done only once per image / lter pair, and not once per
feature, since the sum across planes can be done in the Fourier domain.
2.1 Evaluation of a Linear Detector as a Convolution
Let K stands for the number of features, xk RMN for the k th feature plane
of a particular image, and yk RP Q for the k th feature plane of a particular
lter. The scores z R(M+P 1)(N +Q1) of a lter evaluated on an image are
given by the following formula:
P
K1 1 Q1

zij = xki+p,j+q ypq
k
(1)
k=0 p=0 q=0
that is the sum across planes of the Frobenius inner products of the images
sub-window of size P Q starting at position (i, j) and the lter. Computing
the responses of a lter at all possible locations can thus be done by summing
the results of the (full) convolutions of the image and the (reversed) lter, i.e.

K1
z= xk yk (2)
k=0
where y is the reversed lter (yij = yP 1i,Q1j ).

The cost of a standard convolution between an image of size M N and a
lter of size P Q is O(M N P Q). More precisely the number of oating point
operations of a standard (full) convolution is
Cstd = 2M N P Q (3)
corresponding to one multiplication and one addition for each image and each
lter coecient. Ultimately one needs to convolve L lters and sum them across
K feature planes (see Fig. 1), bringing the total number of operations per image
to
Cstd/image = KLCstd . (4)
2.2 Leveraging the Fourier Transform

It is well known that convolving in the original signal space is equivalent to point-
wise multiplying in the Fourier domain. Convolutions done by rst computing
the Fourier transforms of the input signals, multiplying them in Fourier domain,
before taking the inverse transform of the result can also be more ecient if the
lter size is big enough. Indeed, the cost of a convolution done with the help of
the Fourier Transform is O(M N log M N ).
If we dene
CFFT 2.5M N log2 M N (5)

Cmul = 4M N (6)
the costs of one FT (the approximation of CFFT comes from [16]) and of the
point-wise multiplications respectively, the total cost is approximately
CFourier = 3 CFFT + Cmul (7)
per product for the three (two forward and one inverse) transforms using a Fast
Fourier Transform algorithm [16]. Note that the lters forward FTs can be done
o-line, and thus should not be counted in the overall detection time, and that
an images forward FT has to be done only once, independently of the number
of lters. Moreover, in the case of learning methods based on bootstrapping

samples, the images forward FTs can also be done o-line for training.
Taking all this into account, and using the linearity property of the FT, one
can drastically reduce the cost per image from KLCFourier. Since the FT is
linear, it does not matter if the sum across planes is done before or after the
inverse transforms. If done before, only one inverse transform per lter will be
needed even if there are multiple planes. Together with the fact that the forward
transforms need to be done only once per lter or per image, the total cost per
image is
CFourier/image = KCFFT + LCFFT + KLCmul (8)

!" # !" # !" #
forward FFTs inverse FFTs multiplications
enabling large computational gains if K + L $ KL.

Plugging in typical numbers (M, N = 64, P, Q = 6, K = 32, L = 54 as in [5]),
doing the convolutions with Fourier results in a theoretical speedup factor of 13.
The cost is independent of the lters size P Q, resulting in even larger gains
for bigger lters. The FT is also very numerically accurate, as demonstrated by
our experiments. There is no precision loss for small lter sizes, and even an
increase in precision for larger ones. Finally, one can also reduce by half the
cost of computing the FTs of the lters if they are symmetric [4], or come by
symmetric pairs [5].
3 Implementation Strategies
Implementing the convolutions with the help of the Fourier transform is straight-
forward, but involves two diculties: memory over-consumption and lack of
memory bandwidth. Those two problems can be remedied using methods pre-
sented in the following subsections.
3.1 Patchworks of Pyramid Scales
The computational cost analysis of Sect. 2.2 was done under the assumption
that the Fourier Transforms of the lters were already precomputed. But the
computation of the point-wise multiplications between the FT of an image and
that of a lter requires to rst pad them to the same size. Images can be of
various sizes and aspect-ratio, especially since images are parsed at multiple
scales, and precomputing lters at all possible sizes as in Fig. 2(a) is unrealistic
in term of memory consumption. Another approach could be to precompute the
FTs of the images and the lters padded only to the biggest image size, as shown
in Fig. 2(b). This would require as little memory as possible for the lters, but
would result in an additional computational burden to compute the FTs of the
images, and more importantly to perform the point-wise multiplications.
However, a simpler and more ecient approach exists, combining the advan-
tages of both alternatives. By grouping images together in patchworks of the
(a) (b) (c)
Fig. 2. The computation of the point-wise multiplications between the Fourier Trans-
form of an image and that of a lter requires to pad them to the same size before
computing the FTs. Given that images have to be parsed at multiple scales, the naive
strategy is either to store for each lter the FTs corresponding to a variety of sizes
(a), or to store only one version of the lters FT and to pad the multiple scales of
each image (b). Both of these strategies are unsatisfactory either in term of memory
consumption (a) or computational cost (b). We propose instead a patchwork approach
(c), which consists of creating patchwork images composed of the multiple scales of
the image, and has the advantages of both alternatives. See Sect. 3.1 and Table 2 for
details.
size of the largest image, one needs to compute the FTs of the lters only at
that size, while the amount of padding needed is much less than required by the
second approach. We observed it experimentally to be less than 20%, vs. 87%
for the second approach. The performance thus stays competitive with the rst
approach while retaining the memory footprint of the second (see Table 2 for
an asymptotical analysis). The grouping of the images does not need to be opti-
mal, and very fast heuristics yielding good results exist, such as the bottom-left
bin-packing heuristic [17].
3.2 Taking Advantage of the Cache
A naive implementation of the main computation, that is the point-wise multipli-

cations between the patchworks Fourier Transforms and the lters FTs would
simply loop over all patchworks and all lters. This would require to reload both
from memory for each pairwise product as they are likely too large to all t in
cache. We observed in practice that such an implementation is indeed memory
limited.
However, reorganizing the computation allows to remove this bottleneck. Let
R be the total number of patchworks to process, L the number of lters, K the
number of features, M N the size of the patchworks FTs, u(F ) the time it
takes to point-wise multiply together two planes of F coecients, and v(F ) the
Table 2. Asymptotic memory footprint and computational cost for the three ap-
proaches described in Sect. 3.1, to process + one image of size M N with L lters,
at scales 1, , 2 , . . . The factor 11
2 = k=0
2k
accounts for the multiple scales of
the image pyramid, while 12 log 2 for 1 is the number of scales to
log M N log M N
visit. Taking the same typical values as in Sect. 2.2 for M, N = 64, and = 0.9 gives
1
12
5.3 and log 12
MN
44. Our patchwork method (c) combines the advantages of
both methods (a) and (b).
Approach Memory (image + lters) Computational cost

1 1 1
(a) 12
MN + 12
LM N 12
LM N
log M N log M N
(b) 12
M N + LM N 12
LM N
1 1
(c) 12
MN + LM N 12
LM N
... ... ...
Filter Filter Filter

HOG HOG HOG
... K=32
Patchwork ...
HOG
...
Patchwork ...
R HOG
... ... ... ...
...
Patchwork ...
HOG
Fragments to read
Fragments to write
Fig. 3. To compute the point-wise products between each of the Fourier Transform
of the R patchworks, and each of the FT of the L lters, the naive procedure loops
through every pair. This strategy unfortunately requires multiple CPU cache violations,
since the transforms are likely to be too large to all t in cache, resulting in a slow
computation of each one of the LR products. We propose to decompose the transforms
into fragments (here shown as red rectangles), and to have an outer loop through them.
With such a strategy, by loading a total of L + R fragments in the CPU cache, we end
up computing LR point-wise products between fragments. See Sect. 3.2 for details.
time it takes to read (resp. write) F coecients from (resp. into) the memory
to (resp. from) the CPU cache.
A naive strategy going through every patchwork / lter pair results in a total
processing time of
Tnaive = KLR 2v(M N ) + KLR u(M N ) + LR v(M N ) . (9)

!" # !" # !" #
reading multiplications writing
This is mainly due to the bad use of the cache, which is constantly reloaded with
new data from the main memory.
We can improve this strategy by decomposing transforms into fragments of
size F , and by adding an outer loop through these MN F fragments (see Fig. 3).
The cache usage will be K(R + 1)F , and the time to process all patchwork /
lter pairs will become

MN
Tfast = K(L + R)v(F ) + KLRu(F ) + LR v(F ) (10)
F#
!" !" # !" # !" #
number of fragments
= K(L + R)v(M N ) + KLR u(M N ) + LR v(M N ) . (11)

!" # !" # !" #
By making F small, we could reduce the cache usage arbitrarily. However, CPUs
are able to load from the main memory in bursts, which makes values smaller
than that burst size sub-optimal (see Fig. 4). The speed ratio between the naive
and the fast methods is
Tnaive
1
(2 + K ) + u(MN )
v(MN )
= u(MN )
(12)
Tfast ( L+R 1
LR + K ) + v(MN )
v(M N )
2 +1 (13)
u(M N )
In practice, the cache can hold at least one patchwork of size M N and the actual
speedup we observe is around 5.7. Decomposing the transforms into fragments
also scales better across multiple CPU cores, as they can focus on distinct parts
of the transforms, instead of all loading the same patchwork or lter.
4 Experiments
To evaluate our approach for linear object detector acceleration we compared it

to the publicly available system from [5]. We used the trained models already
present in the system, trained on the Pascal VOC 2007 challenge [1] dataset,
which achieve state-of-the-art detection results. Note that [5] provides several
implementations of the convolutions, ranging from the most basic to the most
heavily optimized.
0.14
0.12
Time (s) 0.1
0.08
0.06
0.04
0.02
0
1 10 100 1000 10000
Fragment size (log)
Fig. 4. Average time taken by the point-wise multiplications (in seconds) for dierent
fragment sizes (number of coecients) for one image of the Pascal VOC 2007 challenge
The evaluation was done over all 20 classes of the challenge by looking at the
detection time speedup with respect to the fastest baseline convolution imple-
mentation on the same machine. The baseline is written in assembly and makes
use of both CPU SIMD instructions and multi-threading. As our method is ex-
act, the average precision should stay the same up to numerical precision issues.
The results are given in Table 4 for verication purposes.
We used the FFTW (version 3.3) library [16] to compute the FFTs, and the
Eigen (version 3.0) library [18] for the remaining linear algebra. Both libraries are
very fast as they make use of the CPU SIMD instruction sets. Our experiments
show that our approach achieves a signicant speedup, being more than seven
times faster (see Table 3). We compare only the time taken by the convolutions
in order to be fair to the baseline, some of its other components being written in
Matlab, while our implementation is written fully in C++. The average time taken
by the baseline implementation to convolve a feature pyramid (10 scales per oc-
tave) with all the lters of a particular class (54 lters, most of them of size 6 6)
was 413 ms. The average time taken by our implementation was 56 ms, including
the forward FFTs of the images. For comparison, the time taken in our implemen-
tation to compute the HOG features (including loading and resizing the image)
was on average 64 ms, while the time taken by the distance transforms was 42 ms,
the time taken by the remaining components of the system being negligible.
We also tested the numerical precision of both approaches. The maximum ab-
solute dierence that we observed between the baseline and a more precise im-
plementation (using double precision) was 9.5 107 , while for our approach it
was 4.8 107 . The mean absolute dierence were respectively 2.4 108 and
1.8 108 .
While the speed and numerical accuracy of the baseline degrade proportionally
with the lters sizes, they remain constant with our approach, enabling one to
use bigger lters for free. For example if one were to use lters of size 8 8
instead of 6 6 as in most of the current models, the speedup of our method

66 1.78 and similarly for the
over the baseline would increase by a factor 88
numerical precision.
Table 3. Pascal VOC 2007 challenge convolution time and speedup
aero bike bird boat bottle bus car cat chair cow table
V4 (ms) 409 437 403 414 366 439 352 432 417 429 450
Ours (ms) 55 56 53 56 57 56 54 56 56 57 57
Speedup (x) 7.4 7.8 7.6 7.4 6.4 7.9 6.5 7.7 7.5 7.5 8.0
dog horse mbike person plant sheep sofa train tv mean

V4 (ms) 445 439 429 379 358 351 425 458 433 413
Ours (ms) 57 59 57 54 54 55 57 58 55 56
Speedup (x) 7.8 7.5 7.6 7.0 6.6 6.4 7.4 7.9 7.9 7.4
Table 4. Pascal VOC 2007 challenge results
aero bike bird boat bottle bus car cat chair cow table
V4 (%) 28.9 59.5 10.0 15.2 25.5 49.6 57.9 19.3 22.4 25.2 23.3
Ours (%) 29.4 58.9 10.0 13.4 25.3 50.6 57.6 18.9 22.6 24.9 24.4
dog horse mbike person plant sheep sofa train tv mean

V4 (%) 11.1 56.8 48.7 41.9 12.2 17.8 33.6 45.1 41.6 32.3
Ours (%) 11.5 56.7 47.3 42.4 13.0 19.2 34.8 46.3 40.4 32.4
5 Conclusion
The idea motivating our work is that the Fourier transform is linear, enabling
one to do the addition of the convolutions across feature planes in Fourier space,
and be left in the end with only one inverse Fourier transform to do. To take ad-
vantage of this, we proposed two additional implementation strategies, ensuring
maximum eciency without requiring huge memory space and/or bandwidth,
and thus making the whole approach practical.
The method increases the speed of many state-of-the-art object detectors sev-
eralfold with no loss in accuracy when using small lters, and becomes even faster
and more accurate with larger ones. That such an approach is possible is not
entirely trivial (the reference implementation of [5] contains ve dierent ways
to do the convolutions, all at least an order of magnitude slower); nevertheless,
the analysis we developed is readily applicable to many other systems.
Acknowledgments. Charles Dubout was supported by the Swiss National Sci-

ence Foundation under grant 200021-124822 VELASH, and Francois Fleuret
was supported in part by the European Communitys 7th Framework Programme
under grant agreement 247022 MASH.
References
1. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.: The
PASCAL Visual Object Classes Challenge 2007 (VOC 2007) (2007) Results,
http://www.pascal-network.org/challenges/VOC/voc2007/workshop/
index.html
2. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.: The
PASCAL Visual Object Classes Challenge 2011 (VOC 2011) (2011) Results,
http://www.pascal-network.org/challenges/VOC/voc2011/workshop/
index.html
3. Felzenszwalb, P., Huttenlocher, D.: Pictorial Structures for Object Recognition.
International Journal of Computer Vision 61, 5579 (2005)
4. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object Detection
with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern
Analysis and Machine Intelligence 32(9) (2010)
5. Felzenszwalb, P., Girshick, R., McAllester, D.: Discriminatively Trained De-
formable Part Models, Release 4,
http://people.cs.uchicago.edu/~ pff/latent-release4/
6. Viola, P., Jones, M.: Robust Real-time Object Detection. International Journal of
Computer Vision (2001)
7. Perko, R., Leonardis, A.: Context Driven Focus of Attention for Object Detection.
In: Paletta, L., Rome, E. (eds.) WAPCV 2007. LNCS (LNAI), vol. 4840, pp. 216
233. Springer, Heidelberg (2007)
8. Maji, S., Malik, J.: Object detection using a max-margin Hough transform. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 10381045 (2009)
9. Lampert, C., Blaschko, M., Hofmann, T.: Beyond sliding windows: Object local-
ization by ecient subwindow search. In: IEEE Conference on Computer Vision
and Pattern Recognition, pp. 18 (2008)
10. Felzenszwalb, P., Girshick, R., McAllester, D.: Cascade object detection with de-
formable part models. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp. 22412248 (2010)
11. Zhang, C., Viola, P.: Multiple-Instance Pruning For Learning Ecient Cascade
Detectors. In: Advances in Neural Information Processing Systems (2007)
12. Cecotti, H., Graeser, A.: Convolutional Neural Network with embedded Fourier
Transform for EEG classication. In: International Conference on Pattern Recog-
nition, pp. 14 (2008)
13. Dollar, P., Belongie, S., Perona, P.: The Fastest Pedestrian Detector in the West.
In: British Machine Vision Conference (2010)
14. Felzenszwalb, P., Huttenlocher, D.: Distance transforms of sampled functions.
Technical report, Cornell Computing and Information Science (2004)
15. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In:
IEEE Conference on Computer Vision and Pattern Recognition, pp. 886893 (2005)
16. Frigo, M., Johnson, S.: The design and implementation of FFTW3. Proceedings of
the IEEE 93(2), 216231 (2005)
17. Chazelle, B.: The Bottom-Left Bin-Packing Heuristic: An Ecient Implementation.
IEEE Transactions on Computers, 697707 (1983)
18. Guennebaud, G., Jacob, B.: Eigen v3, http://eigen.tuxfamily.org
Latent Hough Transform for Object Detection
Nima Razavi1 , Juergen Gall2 , Pushmeet Kohli3 , and Luc van Gool1,4
1
Computer Vision Laboratory, ETH Zurich
2
Preceiving Systems Department, MPI for Intelligent Systems
3
Microsoft Research Cambridge
4
IBBT/ESAT-PSI, K.U. Leuven
Abstract. Hough transform based methods for object detection work

by allowing image features to vote for the location of the object. While
this representation allows for parts observed in dierent training
instances to support a single object hypothesis, it also produces false
positives by accumulating votes that are consistent in location but in-
consistent in other properties like pose, color, shape or type. In this
work, we propose to augment the Hough transform with latent variables
in order to enforce consistency among votes. To this end, only votes
that agree on the assignment of the latent variable are allowed to sup-
port a single hypothesis. For training a Latent Hough Transform (LHT)
model, we propose a learning scheme that exploits the linearity of the
Hough transform based methods. Our experiments on two datasets in-
cluding the challenging PASCAL VOC 2007 benchmark show that our
method outperforms traditional Hough transform based methods leading
to state-of-the-art performance on some categories.
1 Introduction
Object category detection from 2D images is an extremely challenging and com-
plex problem. The fact that individual instances of a category can look very
dierent in the image due to pose, viewpoint, scale or imaging nuisances is one
of the reasons for the diculty of the problem. A number of techniques have been
proposed to deal with this problem by introducing invariance to such factors.
While some approaches [13] aim at achieving partial invariance through specic
feature representations, others [49] divide the object into parts, assuming less
variation within each part and thus a better invariant representation.
Codeword or voting based methods for object detection belong to the latter
category. These methods work by representing the image by a set of voting ele-
ments such as interest points, pixels, patches or segments that vote for the values
of parameters that describe the object instance. The Implicit Shape Model [4],
an important instance of the Generalized Hough Transform (GHT), represents
an object by the relative locations of specic image features with respect to a
reference point. For object detection, the image features vote for the possible
locations of this reference point. Although this representation allows for parts
from dierent training examples to support a single object hypothesis, it also
produces false positives by accumulating votes that are consistent in location

Latent Hough Transform for Object Detection 313
but inconsistent in other properties like pose, color, shape or type. For example,
features taken from a frontal view image of a training example and a side view
image of another training example might agree in location, but an object can
not be seen from frontal and side views at the same time. It is our understanding
that this accumulation of inconsistent votes is the main reason behind the poor
performance of the voting based approaches.
To improve the detection performance, researchers have proposed enforcement
of consistency of the votes by estimating additional parameters like aspect ra-
tio [5] or pose [1014]. While the use of more parameters obviously improves con-
sistency, it also increases the dimensionality of the Hough space. However, Hough
transform-based methods are known to perform poorly for high-dimensional
spaces [15]. Consistency can also be enforced by grouping the training data
and voting for each group separately. Such a grouping can be dened based
on manual annotations of the objects, if available, or obtained by clustering the
training data. While this does not increase the dimensionality of the voting space,
the votes for each group can become sparse due to a limited number of train-
ing examples, which impairs the detection performance. Even if annotations are
available for the training examples, it is not clear which properties to annotate
for an optimal detection performance since the importance of properties diers
from case to case. For instance, viewpoint is important for detecting airplanes
but far less so for detecting balls.
In this work, we propose to augment the Hough transform by latent variables
to enforce consistency of the votes. This is done by only allowing votes that
agree on the values of the latent variables to support a single hypothesis. We
discriminatively learn the optimal assignments of the training data to an arbi-
trary latent space to improve object detection performance. To this end, starting
from a random assignment, the training examples are reassigned by optimizing
an objective function that approximates the average precision on the training
set. In order to make the optimization feasible, the objective function exploits
the linearity of voting approaches. Further, we extend the concept that train-
ing instances can only be assigned to a single latent value. In particular, we let
training examples assume multiple values and further allow these associations
to be weighted, i.e. modeling the uncertainty involved in assigning a training
example to them. This generalization makes the latent Hough transform robust
with respect to the number of quantized latent values and we believe that the
same is applicable when learning latent variable models in other domains.
Experiments on the Leuven cars [10] and the PASCAL VOC 2007 bench-
mark [16] show that our latent Hough transform approach signicantly out-
performs the standard Hough transform. We also compare our method to other
baselines with unsupervised clustering of the training data or by manual annota-
tions. In our experiments, we empirically demonstrate the superior performance
of our latent approach over these baselines. The proposed method performs bet-
ter than the best Hough transform based methods. And it even outperforms
state-of-the-art detectors on some categories of the PASCAL VOC 2007.
314 N. Razavi et al.
2 Related Work
A number of previous works have investigated the idea of using object properties
to group training data. For instance, the training data is clustered according to
the aspect ratios of the bounding boxes in [8]. Other grouping criteria like user-
annotated silhouettes [14], viewpoints [13, 1719], height [21], and pose [11, 20]
have been considered as well. To enforce consistency, the features vote for objects
depth in addition to locations in [22], the close-by features are grouped in [23]
to vote together. In [24], two subtypes are trained for each part by mirroring the
training data. The location of each part with respect to its parent joint is also
used in [25] to train sub-types. Instead of grouping the training examples, [26]
train a model for every single positive instance in the training data. While all
of these works divide the training data into disjoint groups, [27] proposes a
generative clustering approach that allows overlapping groups.
Latent variable models have been successfully used in numerous areas of com-
puter vision and machine learning to deal with unobserved variables in train-
ing data, e.g. [2831]. Most related to our work, [8, 32] learn a latent mixture
model for object detection by discriminatively grouping the training examples
and training a part model consisting of a root lter and a set of (hierarchical)
parts for each group. In contrast to these works, our approach is a non-parametric
voting based method where we assume the parts are given in the form of a shared
vocabulary. As has been shown in previous works [17, 33], the shared vocabu-
lary allows better generalization when learning a model with few examples as
it makes use of the training data much more eectively. Given this vocabulary,
we aim at learning latent groupings of the votes, i.e. training patches, which
lead to consistent models and improve detection accuracy. The advantage of this
approach is that we need to train the parts only once and not re-train them from
scratch as in [8, 32].
Several approaches have been proposed for training the parts vocabulary.
Generative clustering of the interest points is used in [4, 12, 34] as the codebook
whereas [5, 13] discriminatively train a codebook by optimizing the classica-
tion and localization performance of image patches. Discriminative learning of
weights for the codebook has been addressed before as well. In [34], a weight is
learned for each entry in a Max-Margin framework. [35] introduced a kernel to
measure similarity between two images and used it as a kernel to train weights
with an SVM classier. Although increasing the detection performance, those
works dier from our approach in that they train weights within a group.
Detecting consistent peaks in the Hough-space for localization has been the
subject of many investigations as well. While Leibe et al. [4] used a mean-shift
mode estimation for accurate peak localizations, [36] utilize an iterative pro-
cedure to demist the voting space by removing the improbable votes from
the voting space. Barinova et al. [37] pose the detection as a labeling problem
where each feature can belong to only a single hypothesis and propose an it-
erative greedy optimization by detecting the maximum scoring hypothesis and
removing its votes from the voting space.
3 Detection with the Hough Transform

In this section, we briey describe Hough transform based object detection ap-
proaches that learn a codebook of voting elements, which can be some image
features [4] or simply dense image patches [5]. Since our method does not require
any retraining of the codebook, we refer for the details of the codebook learning
to the corresponding works. Having such a codebook, the voting elements of an
image Ii I are extracted and matched against the codebook to cast weighted
votes V (h|Ii ) for an object hypothesis h, which encodes the position and scale
of the object in the image.
For a given object hypothesis h H, the score
of h is determined by the sum
of votes that support the hypothesis: S(h) = i V (h|Ii ). In fact , the accumu-
lated weights of the votes that agree on the location and scale of the object. For
detecting multiple objects, following the probabilistic approach of [37], the max-
imum scoring hypothesis is localized and its supporting votes are removed from
the voting space. This process is iterated until the desired number of objects are
detected or a condence threshold is reached.
Since in an Implicit Shape Model [4], the votes are estimated from training
patches in a non-parametric way and the score of each hypothesis is linear in
the votes, S can be written as sum of votes from training instances t T :

S(h) = V (h|t, Ii ). (1)
tT i
where V (h|t, Ii ) denotes the votes of element Ii from training image t. Although
the votes originating from a single training example are always consistent, this
formulation accumulates the votes over all training examples even if they are
inconsistent, e.g., in pose or shape. In the next section, we show how one can
use latent variables to enforce consistency among votes.
4 Latent Hough Transform (LHT)

In detecting objects with the latent Hough transform, we augment the hypothesis
space by a latent space Z to enforce consistency of the votes in some latent
properties z Z. The score of a hypothesis in the augmented space can be
determined as
S(h) = max V (h, z|t, Ii ) (2)
zZ
tT i
where V (h, z|t, Ii ) are the votes of an image element Ii from training image t to
the augmented latent space H Z. For instance, Z can be quantized viewpoints
of an object. Voting in this augmented space allows only votes that are consistent
in viewpoint to support a single hypothesis.
Similar to other latent variable models [8, 29, 30, 32], one can associate each
training image t to a single latent assignment z. This association groups the
training data into |Z| disjoint groups. Note that the number of these groups is
limited by the size of training data |T |.
The grouping of the training data by latent assignments can be represented

by a binary |Z| |T | matrix, which we denote by W and refer to as latent
ma-
trix. The elements of W are denoted by wz,t , where wz,t {0, 1} and z wz,t =
1, t T . Observe that every W that satises these constraints denes a group-
ing of the training data. Given a latent matrix W , we can rewrite the hypothesis
score as

S(h, W ) = max wz,t V (h|t), where V (h|t) = V (h|t, Ii ). (3)
zZ
tT i
The term V (h|t) is the sum of the votes originated from the training example
t. This term will be very important while learning the optimal W as we will
discuss in Sec. 5.
4.1 Generalized Latent Assignments
The association of the training data to a single latent value z does not make use
of the training data eectively. We therefore generalize the latent assignments
of the training data by letting a training image assume multiple latent values.
To this end, we relax the constraints on W and allow wz,t to be real-valued in
[0, 1] and non-zero for more than a single assignment z. This can be motivated
by the uncertainties of the assignments for the training examples. In particular,
the latent space can be continuous and, with an increasing quantization of Z,
two elements of Z can become very similar and thus need to share training
examples. As we show in our experiments, this generalization makes the latent
Hough transform less sensitive to the number of quantizations, i.e., |Z|.
4.2 Special Cases of the Latent Matrix
The most basic special case of the latent matrix is when |Z| = 1 and wz,t =
1, t T which is equivalent to the original Hough transform formulation in
Eq.(1). The splitting of the training data by manual annotations or unsuper-
vised clustering are also other special cases of the latent matrix where each row
of the matrix represents one cluster. For splitting the training data, we have
considered manual annotations of the viewpoints and two popular methods for
clustering. Namely, agglomerative clustering of the training imnstances based
on their similarity and k-means clustering of the aspect ratios of ground truth
bounding boxes. Similar to [35, 13], we dene the similarity of two training hy-
potheses as the 2 distance of their occurrence histograms. Groupings of the
training data based on the similarity measure and manual view annotations are
visualized in Figs. 1(a) and 3(b) respectively. As shown in these gures, the
groupings based on annotations or clustering might be visually very meaningful.
However, as we illustrate in our experiments the visual similarity alone may not
be optimal for the detection. In addition, it is not clear what similarity measure
to choose for grouping and how to quantize it. These problems underline the
importance of learning the optimal latent matrix for detection.
(a) (b)
Fig. 1. (a) Visualization of the 2 similarity metric using Isomap. The two ellipses show
the clustering of the training instances of the Aeroplane category of PASCAL VOC
2007. As can be seen the clustering splits the data into very meaningful groups. (b)
The visualization is accurate since the rst two dimensions cover most of the variation.
5 Discriminative Learning of the Latent Matrix

We formulate the problem of learning the optimal latent matrix as the optimiza-
tion problem
W = arg max O(W, R). (4)
W
where R = {(h, y)} denotes the set of hypotheses h and their labels y {0, 1},
and O(W, R) is the objective function. A hypothesis is assigned the label y = 1
if it is a true positive and y = 0 otherwise. For each hypothesis h, we pre-
compute the contribution of every training instance t T to that hypothesis, i.e.,
V (h|t) (3). It is actually the linearity of the Hough transform based approaches
in Eq.(1) that allows for this pre-computation, which is essential for learning the
latent matrix W in reasonable time.
As our objective function, we use the average precision measure on the vali-
dation set which is calculated as

1 j,y =1 I(hk , hj )
O(W, R) = j (5)
j yj k,yk =1 j I(hk , hj )
where, for a given latent matrix W , I(hk , hj ) indicates whether the score of
hypothesis hk is smaller than that of hypothesis hj or not

1 if S(hj , W ) S(hk , W )
I(hk , hj ) = (6)
0 otherwise.
Learning the latent matrix to optimize the detection performance is very chal-
lenging. First, the number of parameters to be learned is proportional to the
number of training instances in the codebook which is usually very large. An-
other problem is that to be faithful to the greedy optimization in [37], with
every update in the weights, one needs to run the detector on the whole valida-
tion dataset in order to measure the performance.
Algorithm 1. Interacting Simulated Annealing (ISA) [38] with cross-validation.

{Rs } sample(R, maxN eg, maxP os)
0.6
for p = 1 n do
Wp initialize W at random
end for
for epoch = 1 maxEpochs do
{Rs } sample(R, maxN eg, maxP os)
c getM axP erturbations(iter, epoch) //adaptively reduce perturbation
for iter = 1 maxIter do
W perturb(W, c) // perturb c elements of W at random
for p = 1 n do
op O(Rs , Wp )
end for
20 (epoch maxIter + iter)
{W } selection({W }, {o}, , )
end for
end for
In practice, we make an approximation and deal with the detection problem

by sampling a sparse set of hypotheses from the validation set assuming that the
positions of detections remain the same. To this end, we run the detector on the
validation set once and collect a large number of hypotheses R. To increase the
number of positive hypotheses, we also generate new object hypotheses from the
positive training examples by mirroring and rescaling the training examples.
The objective function in Eq. 5 is non-convex and is not even continuous
and thus it is not possible to optimize it with a gradient-based approach. For
optimization, we used the Interacting Simulated Annealing (ISA) [38]. ISA is a
particle-based global optimization method similar to the simulated annealing.
Starting from an initial set of weights for n particles, it iteratively, i) perturbs
the weights of selected particles ii) evaluates the objective value for each particle,
exponentiates these values with the algorithm parameter , and normalizes them
to create a probability distribution. iii) randomly selects a number of particles
using this distribution. This process is continued until a strong local optimum is
reached. The perturbation of W at each iteration is done by selecting a random
number of elements c (maximum 10) and changing their weights randomly. c is
1
adaptively decreased at each epoch by the factor sqrt(epoch) .
Algorithm 1 gives an overview of the optimization with ISA. To avoid over-
tting eects due to the large number of parameters to be estimated, we run a
cross-validation loop inside the global optimization. For cross-validation, we use
a random subset Rs of R at each epoch. In practice we have kept all the positive
examples and 5% of the negatives for training at each epoch. We have also found
well performing solutions, in detection performance, by running the optimization
for a reasonable amount of time (a couple of hours on a single machine).
Fig. 2. This gure illustrates the result of using view annotations and unsupervised
clustering for grouping training data of aeroplane, bicycle and sheep categories
of PASCAL VOC 2007. Groupings based on aspect ratio are shown in the rst row,
similarity clustering in the second, and the manual view annotations in the third row.
Although clustering increases the performance for aeroplane, it is reducing it for the
sheep. Also the AR clustering is performing better than similarity clustering for aero-
plane and bicycle, yet clustering similarities leads to better results for the sheep. By
using four clusters, the results are deteriorating in all three categories which is due to
the insuciency of the number of training data per cluster.
6 Experiments
We have evaluated our latent Hough transform on two popular datasets, namely
the Leuven cars dataset [10] and the PASCAL VOC 2007 [16]. As a baseline for
our experiments, we compare our approach with the marginalization over latent
variables by voting only for locations (Marginal), unsupervised clusterings of
aspect ratios (AR clustering) and image similarities (Similarity clustering),
and the manually annotated viewpoints (View Annotations) provided for both
Leuven cars and PASCAL VOC 2007.
In all our experiments, the codebook of the ISM is trained using the Hough
forests [5] with 15 trees and the bounding boxes for a detection are estimated
using backprojection. The trees are trained up to the maximum depth of 20 such
that at least 10 occurrences are remained in every leaf. For performing detection
on a test image, we have used the greedy optimization in [37]. The multi-scale
detection was performed by doing detection on a dense scale pyramid with a
1

4
2
resizing factor. In addition, instead of penalizing the larger hypotheses by
(a) (b) annotated views

Fig. 3. (a) Performance comparison of our latent Hough transform with the baselines
on the Leuven cars. As can be seen, in the red curve, AP is clearly increased by
manually splitting the data to 14 views. By learning 14 latent groups, the performance is
signicantly improved over both baselines, the magenta curve. Learning the generalized
latent matrix (green) and increasing the number of groups (cyan) improves the results
further. (b) Examples of the 14 views manually annotated in the training data.
adding a negative bias as in [37], similar to [4], we allow larger deformations by

increasing the standard deviation of the smoothing kernel proportional to the
scale. The smoothing kernel is chosen as a gaussian with = 1.25 at scale one. In
every test, 40 bounding boxes are detected and the hypothesis score in Eq.(2) is
assigned as the condence of every detection. According to [16], precision/recall
curves and average precision measure (AP) were used for the evaluations.
Prior to learning the latent groups, we have collected a set of positive and
negative detections by running the detector on the whole validation set of a
category and detecting 100 bounding boxes from each image. The bounding
boxes with more than 60% overlap with ground truth were considered as positive
and the ones with less than 30% as negatives.
Leuven cars: For the Leuven cars dataset, 1471 cropped training images of
cars are provided. The viewing angle is divided into 14 views and training images
are annotated for 7 viewpoints. The training data of the other 7 viewpoints is
obtained by mirroring the traning images, creating the total of 2942 training
images annotated for 14 viewpoints. Prior to the training, all positive objects in
the training instances rescaled to have the height of 70 pixels. In addition, for
the background category, we are using the clutter set of Caltech 256 [39]. A third
of positive images and 200 negative images and from each of which 250 patches
are randomly sampled for training each tree. As the validation set for learning
the latent groupings, the Amsterdam cars dataset [10] and the Graz02 cars [40]
are used. The Leuven sequence [10] is used as the test set and the detection was
performed on 12 scales starting from 2.7.
PASCAL VOC 2007: A seperate forest is trained for each category. The
training is carried out by using all the positive examples and their mirrors in
the trainval set of a category as the positive set and the images not containing
the category as the negative set. The partial view annotations are ignored for
the training. Similar to the cars, the positive training instances are cropped and
normalized to have the maximum height or width of 100 pixels. For training
(a) (b)
(c) (d) (e)

Fig. 4. This gure shows the result of learning the latent matrix on three categories.
By learning the latent matrix we can consistently outperform the clustering (AR
clustering) and the Hough transform baseline (Marginal). (a) When learning 2
latent groups, there is not much benet in assigning training examples to multiple
groups. (b) However, doing so already gives a benet for learning three latent groups
as it models the uncertainty in the assignments. (c-d) Results for two other categories.
(e) Shows the comparison of the training and testing performance as a function of the
number of epochs. Since the same training data is used for creating the ISM codebook
and learning the latent matrix, the overall performance of the training is much better.
Yet, the two curves correlate well and the training shows little overtting.
each tree, 200 training images from the positive set and 200 from the negative
set and from each of which 250 patches are sampled at random. The trainval
set of a category was used as the validation set for learning the latent groupings.
The performance of the method is evaluated on the test set. The multi-scale
detection is done with 18 scales starting from 1.8.
To evaluate the benets of the discriminative learning against unsupervised
clustering and manual view annotations, we compared the results of learning on
the Leuven cars and PASCAL VOC 2007 datasets. For a fair comparison, we
train the Hough forests [13] for a category only once and without considering
the view annotations or learned groupings. For the optimization with ISA, we
used 500 particles and the number of epoch and iterations were both set to 40.
Figure 3 compares the performance of the learning with our baselines on the
Leuven cars [10]. Disjoint groupings of the training data based on view anno-
tations, improves the result by 4 AP percentage points. By learning the latent
groups the performance improves by 6.8 and 2.7 points w.r.t the marginalization
and manual view annotations respectively. Learning the generalized latent ma-
trix improves the result further by about 2.5 points. By allowing more latent
Potted Plant
Din. Table
Motorbike
Aeroplane
TV/Mon.
mean AP
Bicycle
Person
Bottle
Sheep
Horse
Chair
Train
Boat
Bird
Cow
Sofa
Dog
Bus
Car
Cat
VOC 2007
Our Approach
HT Marginal 16.1 35.6 2.9 3.3 20.4 15.8 25.5 7.5 10.9 37.2 10.3 3.2 28.6 34.2 4.5 19.2 21.9 10.3 9.8 43.4 18.03
HT + AR 21.8 42.7 11.4 10.2 19.6 19.1 25.4 6.0 6.3 38.2 7.6 6.1 30.5 39.0 4.9 20.5 18.4 10.8 16.9 41.3 19.83
HT + View 18.85 40.0 5.3 3.0 - 18.3 28.7 6.6 4.7 34.9 - 2.9 29.8 38.6 6.1 - 23.3 10.5 21.2 38.7 19.50
LHT Ours 24.3 43.2 11.1 10.5 20.7 20.3 24.0 8.0 11.9 38.8 10.5 4.9 33.2 39.4 8.2 21.3 22.5 10.5 17.3 44.1 21.23
Competing Approaches
HT ISK [35] 24.6 32.1 5.0 9.7 9.2 23.3 29.1 11.3 9.1 10.9 8.1 13.0 31.8 29.5 16.6 6.1 7.3 11.8 22.6 21.9 16.65
MKL [41] 37.6 47.8 15.3 15.2 21.9 50.7 50.6 30.0 17.3 33.0 22.5 21.5 51.2 45.5 23.3 12.4 23.9 28.5 45.3 48.5 31.10
Context [42] 53.1 52.7 18.1 13.5 30.7 53.9 43.5 40.3 17.7 31.9 28.0 29.5 52.9 56.6 44.2 12.6 36.2 28.7 50.5 40.7 36.77
LSVM [8] 29.0 54.6 0.6 13.4 26.2 39.4 46.4 16.1 16.3 16.5 24.5 5.0 43.6 37.8 35.0 8.8 17.3 21.6 34.0 39.0 26.26
VOC best 26.2 40.9 9.8 9.4 21.4 39.3 43.2 24.0 12.8 14.0 9.8 16.2 33.5 37.5 22.1 12.0 17.5 14.7 33.4 28.9 23.33
Table 1. Detection results on the PASCAL VOC 2007 dataset [16]. The rst block
compares the performance of the Hough transform without grouping (HT Marginal),
aspect ratio clustering (HT + AR), view annotations (HT+View), and our proposed
latent Hough transform (LHT Ours). As can be seen the clustering improves the results
for 14 categories over the marginalization but reduces it for the other 6. Yet, by learning
latent groups we outperform all three baselines on most categories and perform similar
or slightly worse (red) on others. The comparison to the state-of-the-art approaches is
shown in the second block. We outperform the best previously published voting-based
approach (ISK [35]) in mAP. Our performance is competitive on many categories with
the latent part model of [8] and is state-of-the-art on two categories (green).
assignments, one can learn ner groupings of the data and increase the perfor-
mance by 10 AP points compared to the marginalization baseline.
The detection performance with the two clusterings and view annotations on
three distinct categories aeroplane, bicycle and sheep of VOC07 dataset are
summarized in Fig. 2. As can be seen, although grouping training examples may
improve performance, this improvement is very much dependent on the grouping
criteria, the category and the number of training data per group. For example,
in detecting airplanes, although using two clusters improves performance, using
more clusters impairs the results. As another example, in detecting sheep, the
marginalization is clearly outperforming clustering with two clusters. The clus-
tering or the view annotations do not lead to optimal groupings and even nding
the well performing ones requires plenty of trial and error.
In contrast to clustering, one can discriminatively learn optimal groupings
by treating them as latent variables. Figure 4 compares the performance of the
learning with the clustering and marginalization. By learning the latent groups,
we outperform clustering baselines. In addition, the learning is not sensitive to
selecting the right number of latent groups as it, unlike clustering, shares training
examples between groups. Table 1, gives the full comparison of our latent Hough
transform method with our baselines and other competing approaches. Some
qualitative results on the VOC07 dataset are shown in Fig. 5.
7 Conclusions
In this paper, we have introduced the Latent Hough Transform (LHT) to enforce
consistency among votes that support an object hypothesis. To this end, we have
augmented the Hough space with latent variables and discriminatively learned
the optimal latent assignments of the training data for object detection. Further,
to add robustness to the number of quantizations, we have generalized the latent
variable model by allowing the training instances to have multiple weighted as-
signments and have shown how the previous grouping approaches can be cast as
special cases of our model. In the future, it would be interesting to use the latent
formulation in a more general context e.g., learning a multi-class LHT model or
learning latent transformations of the votes for better detection accuracy.
Acknowledgements. This project has been funded in parts by Toyota Motor

Corporation/Toyota Motor Europe, and the European projects RADHAR (FP7
ICT 248873) and IURO (FP7 ICT 248314).
References
1. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005)
2. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments
for object detection. TPAMI 30, 3651 (2008)
3. Ojala, T., Pietikinen, M., Menp, T.: Multiresolution gray-scale and rotation invari-
ant texture classication with local binary patterns. TPAMI 24, 971987 (2002)
4. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved
categorization and segmentation. IJCV 77, 259289 (2008)
5. Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object
detection, tracking, and action recognition. TPAMI 33, 21882202 (2011)
6. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised
scale-invariant learning. In: CVPR (2003)
7. Hoiem, D., Rother, C., Winn, J.: 3d layoutcrf for multi-view object class recognition
and segmentation. In: CVPR (2007)
8. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with
discriminatively trained part based models. TPAMI 32, 16271645 (2009)
9. Bergtholdt, M., Kappes, J., Schmidt, S., Schnorr, C.: A study of parts-based object
class detection using complete graphs. IJCV 87, 93117 (2010)
10. Leibe, B., Cornelis, N., Cornelis, K., Van Gool, L.: Dynamic 3d scene analysis from
a moving vehicle. In: CVPR (2007)
11. Seemann, E., Leibe, B., Schiele, B.: Multi-aspect detection of articulated objects.
In: CVPR (2006)
12. Seemann, E., Fritz, M., Schiele, B.: Towards robust pedestrian detection in crowded
image sequences. In: CVPR (2007)
13. Razavi, N., Gall, J., Van Gool, L.: Backprojection revisited: Scalable multi-view
object detection and similarity metrics for detections. In: ECCV (2010)
14. Marszalek, M., Schmid, C.: Accurate object localization with shape masks. In:
CVPR (2007)
15. Stephens, R.: Probabilistic approach to the hough transform. Image and Vision
Computing 9, 6671 (1991)
16. Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal
visual object classes (voc) challenge. IJCV 88, 303338 (2010)
17. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: ecient boosting
procedures for multiclass object detection. In: CVPR (2004)
18. Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., Van Gool, L.: To-
wards multi-view object class detection. In: CVPR (2006)
19. Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specic multiview
object localization. In: CVPR (2009)
20. Dantone, M., Gall, J., Fanelli, G., Van Gool, L.: Real-time facial feature detection
using conditional regression forests. In: CVPR (2012)
21. Sun, M., Kohli, P., Shotton, J.: Conditional regression forests for human pose
estimation. In: CVPR (2012)
22. Sun, M., Bradski, G., Xu, B.X., Savarese, S.: Depth-encoded hough voting for
coherent object detection, pose estimation, and shape recovery. In: ECCV (2010)
23. Yarlagadda, P., Monroy, A., Ommer, B.: Voting by grouping dependent parts. In:
ECCV (2010)
24. Girshick, R.B., Felzenszwalb, P.F., McAllester, D.: Object detection with grammar
models. In: NIPS (2011)
25. Yang, Y., Ramanan, D.: Articulated pose estimation with exible mixtures-of-
26. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-svms for object
detection and beyond. In: ICCV (2011)
27. Torsello, A., Bulo, S., Pelillo, M.: Beyond partitions: Allowing overlapping groups
in pairwise clustering. In: ICPR (2008)
28. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Ma-
chine Learning 42, 177196 (2001)
29. Farhadi, A., Tabrizi, M., Endres, I., Forsyth, D.: A latent model of discriminative
aspect. In: ICCV (2009)
30. Wang, Y., Mori, G.: A discriminative latent model of object classes and attributes.
In: ECCV (2010)
31. Bilen, H., Namboodiri, V., Van Gool, L.: Object and action classication with
latent variables. In: BMVC (2011)
32. Zhu, L., Chen, Y., Yuille, A., Freeman, W.: Latent hierarchical structural learning
for object detection. In: CVPR (2010)
33. Razavi, N., Gall, J., Van Gool, L.: Scalable multiclass object detection. In: CVPR
(2011)
CVPR (2009)
35. Zhang, Y., Chen, T.: Implicit shape kernel for discriminative learning of the hough
transform detector. In: BMVC (2010)
36. Woodford, O., Pham, M., Maki, A., Perbet, F., Stenger, B.: Demisting the hough
transform. In: BMVC (2011)
37. Barinova, O., Lempitsky, V., Kohli, P.: On detection of multiple object instances
using hough transforms. In: CVPR (2010)
38. Gall, J., Pottho, J., Schnorr, C., Rosenhahn, B., Seidel, H.: Interacting and an-
nealing particle lters: Mathematics and a recipe for applications. Journal of Math-
ematical Imaging and Vision 28, 118 (2007)
39. Grin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical
Report 7694, California Institute of Technology (2007)
40. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic object recognition with
boosting. TPAMI 28, 416431 (2006)
41. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object
detection. In: ICCV (2009)
42. Song, Z., Chen, Q., Huang, Z., Hua, Y., Yan, S.: Contextualizing object detection
and classication. In: CVPR (2011)
Aeroplane
Sheep
Bicycle
Bottle
Potted Plant
Bird
Fig. 5. Some qualitative results on the test set of PASCAL VOC 2007 database.
Ground-truth bounding boxes are in blue, correctly detected boxes in green and false
positives in red.
Using Linking Features in Learning
Non-parametric Part Models
Leonid Karlinsky and Shimon Ullman
Weizmann Institute of Science,

234 Herzl Street, Rehovot 76100, Israel
{leonid.karlinsky,shimon.ullman}@weizmann.ac.il
http://www.weizmann.ac.il
Abstract. We present an approach to the detection of parts of highly

deformable objects, such as the human body. Instead of using kinematic
constraints on relative angles used by most existing approaches for mod-
eling part-to-part relations, we learn and use special observed linking
features that support particular pairwise part congurations. In addi-
tion to modeling the appearance of individual parts, the current ap-
proach adds modeling of the appearance of part-linking, which is shown
to provide useful information. For example, congurations of the lower
and upper arms are supported by observing corresponding appearances
of the elbow or other relevant features. The proposed model combines
the support from all the linking features observed in a test image to in-
fer the most likely joint conguration of all the parts of interest. The
approach is trained using images with annotated parts, but no a-priori
known part connections or connection parameters are assumed, and the
linking features are discovered automatically during training. We evalu-
ate the performance of the proposed approach on two challenging human
body parts detection datasets, and obtain performance comparable, and
in some cases superior, to the state-of-the-art. In addition, the approach
generality is shown by applying it without modication to part detection
on datasets of animal parts and of facial ducial points.
Keywords: linking features,parts detection,FLPM model.
1 Introduction
In this paper, we present a method for detecting parts of highly deformable
objects. In these objects the parts conguration is exible, and therefore the
detection of the correct conguration is a dicult problem. As a central applica-
tion we consider the detection of human body parts, namely: head, torso, lower
and upper arms (gure 1). Recently, several approaches have been proposed to
tackle this problem [110]. The central pipeline in all of these approaches can be
roughly summarized as: input image extraction of image features estimat-
ing part posteriors using dedicated part detectors extracting the most likely
parts conguration by utilizing the human body kinematic constraints. For the
last step of the pipeline prior parametric models of the human body are used,

Using Linking Features in Learning Non-parametric Part Models 327
Fig. 1. Examples of detected body parts on the Buy (left) and PASCAL stickmen
(right). The detected parts are annotated using color-coded sticks.
which are based on the pictorial-structure (PS) model [1], with either standard
tree-connectivity of the body parts [14, 810] (tree edges representing joints
with constrained range of motion), or with extra non-tree edges for additional
spatial constraints between physically unconnected parts [57, 11]. Notably, [10]
use more parts in their tree PS model, allowing better treatment of non-rigid
part deformations like foreshortening of limbs leading to better performance. In
addition, [12] handle limb foreshortening in human pose estimation from video
by employing stretchable models which model joint locations instead of limbs.
In the standard PS model, the pairwise constraints between the related parts
are purely kinematic (e.g. a distribution of allowed relative angles) and do not
depend on the query image. Recently, several approaches [8, 9] introduced some
image dependence into the pairwise kinematic constraints of the PS model. In
[8] the pairwise kinematic constraints are adaptive, and are derived from a sub-
set of training images that are similar (in appearance) to the query image. In
[9], connected parts are required to follow common segmentation boundaries,
but only in the lower levels of the cascade, while in the higher cascade levels
kinematic constraints are used. In addition, [13] uses the detections of poselet
features to constrain the relative conguration between the parts. This state-of-
the-art person detector [13] focuses on the detection of entire persons and their
segmentation boundaries, and does not detect the arms and their sub-parts. Fi-
nally, a related idea of using chains of features to reach the part being detected
has appeared in [14]; However, [14] detects only a single location on a part of
interest (e.g. a hand), and its probabilistic model does not represent parts ex-
plicitly, leading to reduced detection performance compared with the proposed
scheme (section 3).
In this paper, we propose a new method for parts detection that takes a dif-
ferent modeling approach. The approach follows the above pipeline of choosing
the best conguration of detected part candidates. However, instead of estimat-
ing the quality of the relative conguration of part candidates using kinematic
328 L. Karlinsky and S. Ullman
D E F
/ Q

Q
/

! /QP

1N

7RUVR 5 LQ $LQ

WOO
WUO
EOO
)L Q
L ! N Q
EUO
Q !
Fig. 2. (a) The learned associations between parts. An arrow from part A to part B
shows how much part A aected (via the linking features) the detection of part B
relative to all parts that aected part B. The incoming arrows sum to 100% (only
arrows with at least 15% are shown); (b) Since no a-priori skeleton model is assumed
(i.e. the object skeleton is not constrained by the model) the same approach can be
directly applied to other objects (e.g. ostrich); (c) A graphical illustration of the FLPM
model. The dotted orange and the solid blue rectangles are the features plate and the
query images plate respectively. Only Fin variables are observed.
pairwise PS model constrains (either general or image-dependent), we propose

a simple generative non-parametric Feature-Linked Parts Model (FLPM) that
connects the part candidates through linking features observed in the query
image. Roughly speaking, the score of the relative conguration of two part
candidates in a test image is measured by the cumulative score of the detected
linking features. These features are patch descriptors learned, during training, to
support this particular conguration. For example, to detect a particular bent-
arm conguration, in addition to knowing that the relation between upper and
lower arm parts is kinematically possible, it is useful to verify the appearance
and location of features, such as the elbow, common to this arm conguration.
In other words, the elbow appearance and possibly other image features com-
mon to this bent arm, link the two parts of the arm together in this particular
conguration.
As opposed to the kinematic models, no a-priori known model for the parts
connectivity is assumed. The model implicitly adapts the parts connectivity
pattern to each query image using the detected linking features associated with
the dierent part-to-part connections. The adaptive nature of the connectivity
pattern can also use image features that support relations between two kine-
matically unrelated parts. For example, crossed hands or hand over torso have
some informative local image features associated with them that can be detected
and used by the FLPM model. Figure 2a shows the average connectivity pattern
obtained by averaging all of the linking-features based pairwise conguration
scores from all the test images of the human body parts detection experiments
(section 3).
head
tra
tla
torso
bra bla
(a) (b) (c) (d)
Fig. 3. Main stages of the proposed approach. (a) SIFT descriptor features are ex-
tracted on a regular grid (section 2.2); (b) Each part is voted for by the features (top
shows a voting map for bottom left arm), and features are then clustered (section 2.4)
to get candidates for the part (illustrated by lines on the bottom image); (c) Features
vote for connections between parts and linking features are identied to form the
FLPM model for the test image, the example shows two automatically discovered sets
of linking features (orange dots), for the lower-upper right arm and the lower-upper
left arm connections respectively (section 2.1); (d) The nal result is obtained through
approximate inference on the obtained FLPM model (section 2.1). Best viewed in color.
The experimental results show that without known kinematic constraints, and
using only simple SIFT descriptors computed on grayscale images as features
(without more complex features such as color and segmentation contours used
in [8, 9]) the approach achieves results comparable, and in some cases superior, to
state-of-the-art on two challenging upper body datasets, Buy [3] and PASCAL-
stickmen [3]. In addition, since no a-priori known connectivity between the parts
is assumed, the same approach can be successfully applied (without modication)
to other domains. This is illustrated by applying the approach for facial ducial
points detection on the AR dataset [15, 16] (gure 5), and applying it for ostrich
parts detection (gure 2b).
The rest of the paper is organized as follows. Section 2 provides the details
of the proposed approach, section 3 describes the experimental evaluation; sum-
mary and discussion are provided in section 4.
2 Method
This section gives a brief overview of our approach further details are provided
in subsections below.
The method starts with two standard pre-processing stages (gure 3a): (i)
Extracting the region of interest (bounding box) around the object (e.g. hu-
man) whose parts are to be detected. It is a standard practice in pose estima-
tion approaches, e.g. [3, 4, 8, 9], to assume that the object is already detected
and roughly localized; and (ii) Computing image features (patch descriptors)
centered on dense regular grid locations inside the region of interest. For the
feature extraction stage, the step size of the grid and the size of the patches for
which the descriptors are computed are determined by the scale of the region of
interest, assuming the scale is roughly determined by the object detector. The
details of the pre-processing are described in section 2.2.
After the pre-processing, the method proceeds in three main stages. Roughly,
the rst focuses on individual parts (gure 3b), the second on pairwise connec-
tivity between part candidates (gure 3c), and the third on extracting the most
likely global parts conguration (gure 3d). First, a set of candidate locations,
orientations and sizes, are detected for each part. The detection of candidates
is explained in section 2.4. Second, parameters of pairwise connectivity between
part candidates are estimated, using non-parametric probability computations.
At this stage, each conguration formed by a pair of part-candidates (e.g. relative
position of two sticks), is scored by using the linking features (observed in the
test image) that support this particular conguration. These linking features are
patch descriptors that were associated with this conguration during training.
The combined set of estimated connectivity parameters forms the FLPM model.
Third, the most consistent subset of candidate parts is computed by a greedy
approximate MAP inference over the estimated FLPM. The FLPM model, the
computation of its parameters, and the approximate inference procedure for it
are detailed in section 2.1.
The training phase of the method receives a set of images with annotated
parts (e.g. stick annotation in upper body experiments). As the method is
non-parametric, the training only involves building ecient data structures for
similar descriptor search and memorizing a set of parameters for each feature,
e.g. relative oset from dierent parts. These data structures are later used in
order to compute the model probabilities for test images using Kernel Density
Estimation (KDE) (section 2.3).
2.1 Feature-Linked Parts Model (FLPM)

Model Denition. The FLPM m is a generative model which consists of the fol-
lowing random variables. Lnj j=1 are hidden variables describing the location,
scale and orientation of the parts being detected in a test image In . In case the
parts we are detecting are sticks (as in Buy and PASCAL-stickmen datasets),
the location is the (x, y) image location of the center of the stick, the scale is
the length of the stick, and orientation is its 0 180 degrees angle. We also
add an additional part denoted full object, which is a union of all the parts,
and its corresponding Lnj has only location (of the center of mass of the object)
and no orientation and scale. During the rst step (section 2.4) we compute a
discrete pool of candidates for each part and Lnj takes its possible values from
kn
the corresponding pool for the j th part. {Fin }i=1 are observed variables for
the kn features collected from the region of interest (around the object) on the
test image In . In our implementation, Fin are the SIFT descriptors [17, 18] col-
lected during the pre-processing stage (section 2.2) together with their image
locations. The model includes two additional sets of auxiliary hidden random
variables, which reside in the feature plate (i.e. two additional unobserved vari-
kn
ables for each feature). These variables are: pair-of-parts choices {Rin }i=1 and
k
linking feature indicators {Ai }i=1 . The exact meaning of these additional vari-
n n
ables is explained below the denition of the joint distribution. The FLPM model
joint distribution is dened as:

m m
P Lnj j=1 , {Fin , Rin , Ani }ki=1
n
= P Lnj j=1
kn 1 ; ; 2 (1)
n; m n; n m
i=1 P (Ai ) P Ri ;{Lj }j=1 P Fi ;Ai , Ri , {Lj }j=1
n n
A graphical illustration of the model in plate notation is given in gure 2c.

The components of the model are explained below and the way to empirically
estimate the necessary probabilities from the training data is explained in section
2.3.
We do not
impose any specic prior on the parts pose, therefore the joint
n m
P Lj j=1 is assumed uniform. The Ani are binary hidden variables, one for
each feature identifying the linking features in the image:
m
P (Fin |Ani , Rin , Lnj j=1 =
/ ; m n
; (2)
P F n ;Rn , Ln
i i j A =1
j=1 i
P (Fin ) Ani = 0
Here P (Fin ) is the probability of spontaneously observing the feature Fin any-
where inside the region of interest regardless of the parts conguration. During
the inference, the Ani = 0 option provides a more robust way of ignoring some
of the features that are either outside the object or are not potential linking
features. The Rin are pair-of-parts hidden variables taking values from the set of
part pairs: Rin = (p, q) {(p, q) |1 p = q m }. TheRin provides the means to
m
associate the feature Fin with a pair of parts from Lnj j=1 . Roughly, Rin = (p, q)
means that the feature Fin connects parts p and q. The corresponding conditional
is dened as follows:
; m ;
;
P Fin ;Rin = (p, q) , Lnj j=1 = P Fin ;Lnp , Lnq (3)
The conditional probability of Rin = (p, q) depends only on the assignemnts to

Lnp and Lnq and is independent of the feature index i:
;
; m
P Rin = (p, q) ; Lnj j=1 = pq Lnp , Lnq (4)
;
See section 2.3 for the way to compute P Fin ;Lnp , Lnq and pq Lnp , Lnq from
the training data. The pq can also contain any external prior on pairwise con-
guration of the parts, such as an articulation prior, and can be used to combine
the proposed approach with other methods.
Inference. The MAP inference in the FLPM model involves the computation of
m
the arg max, over part candidate choices Lnj j=1 , of the following log-posterior:
MAP m 1 ; 2
n m ; n kn
Lj j=1
= arg max log P L j j=1
;{Fi }i=1 (5)
{Lnj }m
j=1
It is shown in Appendix A in the supplementary material that assuming that

P (Ani ) is such that P (Ani = 1) << P (Ani = 0), reecting the fact that most of
the features are not potential linking features (or belong to the background), we
can approximate the posterior as:
MAP m
Lj
j=1 % &
m n n kn P (Fin |Lnp ,Lnq ) (6)
arg max p,q=1 pq Lp , Lq i=1 P (Fin )
{Lj }j=1
n m
Note that although the posterior 6 does not depend on {Ani }, these auxiliary
variables are necessary in order to derive it and make it more robust.
For a given test image In , the pre-processing (sec. 2.2) produces a set of kn
kn
features {Fin }i=1 and part candidates stage (sec. 2.4) produces a set of part
candidates Sj for each of the m parts (such that for every j, Lnj Sj ). Assume
for brevity that for every j: |Sj | = r, i.e. we have the same number of candidates
for each part. In order to perform MAP inference, we compute the two tables
WR and WF , both of size m m r r, each entry of which corresponds to a
choice of 1 p, q m (m m options), and choice of assignment to Lnp , Lnq

(r r options). The WR (p, q, r1 , r2 ) = pq Lnp = Sp {r1 } , Lnq = Sq {r2 } corre-
sponds to pairs-of-parts prior, here Sp {r1 } means the r1 -th element of the Sp
kn P (Fin |Lnp =Sp {r1 },Lnq =Sq {r2 } )
set of part p candidates. The WF (p, q, r1 , r2 ) = i=1 P (Fin )
corresponds to the feature based relative conguration prior. Note that the likeli-
hood ratio inside the sum is higher for features that are more strongly associated
with the considered relative part conguration. Finally, we multiply these two
tables entry-wise (denoted .) to form: W = WR . WF which is used for MAP
inference over the FLPM posterior. According to equation 6, we need
to nd an
optimal choice of indices {r1 , r2 , . . . , rm

} such that mp=1
m
q=1 W p, q, rp , rq is
maximal. In our experiments, we have tested two greedy approximate inference
approaches to the choice of these indices. In the rst approach, denoted FLPM-
sum, we start from the most likely candidate of the auxiliary full object part
and then greedily select parts and their optimal candidates having maximal sum
of weights (entries in W ) to the already selected parts:

q, rq arg max W p, q, rp , rq (7)
q,rq
pselected
In the second approach, denoted FLPM-max, the next part and its optimal
candidate are greedily chosen using only the maximal weight to one of the already
selected candidates:
% &

q, rq arg max max W p, q, rp , rq (8)
q,rq pselected
Experiments have shown that FLPM-sum produces marginally better results

than FLPM-max and it was eventually the one we used for the FLPM inference.
This greedy inference procedure was selected based on its performance com-
pared with alternatives. In particular, in our experiments, an optimization (for
m

maximizing m p=1 q=1 W p, q, rp , rq ) based on Quadratic Integer Program-
ming (QIP) approximation produced results slightly inferior to the greedy scheme
that we have eventually used.
2.2 Pre-processing and Feature Extraction

Given a training or test image In , and a bounding box around the object on that
image, we extract the region of interest (ROI) by scaling the bounding box to a
common 150 pixel height (in our implementation) and enlarging it by 20 pixels
on every side. Exactly as in [3, 4, 8, 9], for the tested human body datasets the
bounding boxes were automatically estimated from bounding boxes provided by
kn
[3]. The features {Fin }i=1 extracted from the above ROI are SIFT descriptors
computed for 20 20 pixel sized patches centered on a regular grid with 3 pixel
step size.
The method parameters mentioned here and in subsequent sections 2.3 and
2.4, were set once (based on previous experiments with similar methods) and
not further optimized; it is possible that optimization of the parameters for the
current method could lead to even better perfromance.
2.3 Empirically Estimating FLPM Probabilities from the Training

Images
The training phase of FLPM receives a set of images annotated with (body)
parts on objects of interest. In our human upper-body experiments we used
sticks annotation for parts [3], where each part is marked by a stick (two end-
points). No a-priori known connectivity pattern between the parts was assumed
by our method. In order to obtain features associated with a specic part on
each training image, we enlarge the parts stick in orthogonal directions to form
a rectangle. Training features associated with the j-th part are collected from
all the grid points (see section 2.2) residing inside the j-th part rectangles in
all the training images. For each part, we construct an ecient Approximate
Nearest Neighbors search data-structure (ANN) that allows eciently searching
for (approximate) neighbors of descriptors of test features among the descriptors
of the training features. We use ANN implementation of [19] in our experiments.
Denote by AN Nj the ANN containing the training features associated with the
j-th part, and by AN Nall the ANN containing all the collected training features.
Approximate Kernel Density Estimation (KDE) [20] over neighbors returned by
querying the respective ANNs is used to compute all the feature related proba-
bilities, namely: P (Fin ) - probability of Fin to appear spontaneously inside the
object ROI; Pj (Fin )- probability
; of Fin to appear somewhere on the j-th part
n; n
(see section 2.4); Pj Fi Lj - probability of Fin; to appear in its particular lo-

cation on j-th part (see section 2.4); and P Fin ;Lnj , Lnk - probability of Fin to
link the j-th and the k-th parts.
Given a test feature descriptor Fin , the probabilities P (Fin ) and Pj (Fin ) are
obtained by querying q = 25 neighbors of Fin in AN Nall and AN Nj respec-
tively. The probabilities are then computed by KDE over these neighbors, as an

average (over the q neighbors) of exp 0.5 d2r / 2 , where dr is the
distance
to
the r-th neighbor (1 r q), and = 0.2. We assume that Pj Lnj and the
;
P Lnj , Lnk are uniform. Then the Pj Fin ;Lnj Pj Fin , Lnj is obtained using
n
the neighbors
1 of Fi returned by the AN Nj query. It is computed as an average of
2 2 2
exp 0.5 dr / + oj or /
2 2 j
L , where L = 15 pixels, oj is an oset be-
tween Fin s location (around which the descriptor was computed) and Lnj s center
location, and ojr is the oset between the location of the r-th neighbor
; and the
center of the j-th part on the r-th neighbor source image. The P Fin ;Lnj , Lnk

P Fin , Lnj , Lnk is computed as 0.5 Pj Fin , Lnj , Lnk + Pk Fin , Lnj , Lnk , where

Pj Fin , Lnj , Lnk (and similarly Pk Fin , Lnj , Lnk ) is obtained by the KDE over

neighbors of Fin returned by the AN Nj query. The Pj Fin , Lnj , Lnk is an aver-
age over r of:
1 2 2 2
exp 0.5 d2r / 2 + oj ojr + ok okr /L +
2 (9)
j j + k k / 2
2 2
r r
where dr , oj , ok , ojr , and okr are as above; j and k are orientations of the
Lnj and the Lnk , and rj and rk are the orientations of the j-th and k-th parts
respectively on the r-th neighbor source image; we use = 20 degrees. In
addition, we restrict the above Pj Fin , Lnj , Lnk computation only to neighbors
which are identied as candidate linking features during training. Those are
the features, collected from training images, that lay in the intersections of the
j-th and k-th part training rectangles enlarged by 10 pixels. This increases the
importance of more relevant mutual features for candidate parts consistency
weighting (e.g. elbows for lower and ;upper arms).

; n m
Finally, we dene P Ri = (p, q) ; Lj j=1 = pq Lnp , Lnq to be uniform,
n

thus providing no prior on the relative conguration of the parts. The pq Lnp , Lnq
can also incorporate the part candidate scores produced by the star model used
to generate the candidates (section 2.4). However, in our experiments using part
candidate scores did not result in performance increase due to the obvious dif-
culty of detecting parts independently of the other parts. In addition, any
kinematic prior on the pairwise relative conguration of the parts that
; is used
; m
by the existing approaches may be incorporated into P Rin = (p, q) ; Lnj j=1
by substituting it with the prior value.
2.4 Generating Part Candidates

In this section we describe how the part candidates are generated for a given
ROI in a test image In . The hidden variable Lnj of the FLPM takes values from
the candidates detected for the j-th part. The detection of candidates for the
j-th part proceeds in three steps (same steps are repeated independently for each
part to generate part candidates for all the parts).
kn
First, all the collected features {Fin }i=1 vote for the j-th part center of mass
location using the non-parametric star model. This star model was used as one
of the baselines in [14] and is described in more detail there. Briey, we ac-
cumulate votes from all the features in a matrix Vj with the size equal to
the size of In , while each feature votes for 25 candidate locations of the part
obtained
; from 25 approximate nearest neighbors, with voting weight equal to
Pj Fin ;Lnj /Pj (Fin ) (section 2.3). Each vote is smoothed by a Gaussian with
STD of 15 pixels.
t
Second, up to t 100 strongest local maxima {ms }s=1 are collected from Vj
(with the weakest one mt having at least 5% score of the Vj global maxima m1 :
mt 0.05 m1 ). Then for each feature Fin we compute a t-sized vector vi each
entry of which contains the amount of weight it contributed to each of the chosen
t kn
maxima {ms }s=1 . The features {Fin }i=1 are then clustered into 20 clusters using
spectral clustering [21], where the association between two features Fin and Fln
is computed as an inner product between vi and vl . The reason for this clustering
is that for the more ambiguous parts such as lower or upper arms, small local
features along the boundary of the part have a wider range of possible part
center locations relative to them (and they essentially vote for many of those).
Third, for each feature cluster the voting (as in the rst step) is repeated using
only features belonging to the cluster. The location of the maximal accumulated
vote for each cluster is taken as the center location for the corresponding part
candidate. Then the features of each cluster vote for the top three candidate
orientations and top one length estimate of the part. The weights for orientation
and length voting are the same as in the rst step. This way, out of 20 feature
clusters, 60 candidates are produced for each of the m parts.
After the computation of part candidates, the computation proceeds as ex-
plained in the description of the inference procedure in section 2.1. The next
section describes the experimental evaluation of the proposed approach.
3 Results
To test the proposed approach we applied it to two challenging human body
parts datasets proposed by [3], namely Buy (version 2.1) and PASCAL stick-
men (version 1.1). Moreover, in order to test the generality of the method, we
also applied it (without modication) to the AR facial ducial points detection
dataset [15], and an ostrich parts detection dataset (see below). Table 4a sum-
marizes the numerical results and state-of-the-art comparisons. The detections
are assumed correct if they meet the standard P CP0.5 criterion [3], where both
endpoints of the detected part should be within 0.5 ground-truth part length
from the ground-truth part endpoints (the full P CP curves are given in gure
5a). The results show that the FLPM produces comparable performance to the
state-of-the-art approaches, improving the best result on the PASCAL stickmen
v1.1 dataset. Note that FLPM is trained without knowing the kinematics of the
object (connections and allowed relative angles between the parts) that is used
by all the other methods [3, 8, 9], and using only grayscale images and simple
SIFT descriptor features (as opposed to complex features involving color and
segmentation boundaries used by [8, 9]). Figure 2a shows the average connectiv-
ity pattern discovered by the FLPM. This pattern is obtained by averaging all
D E
%XII\Y 0HWKRG?'DWDVHW %XII\ 3$6&$/
0HWKRG?3DUW 7RUVR 8SSHU /RZHU +HDG 7RWDO VWLFNPHQ
DUPV DUPV )/30
)/30
FDQGLGDWHUHFDOO
EDVHOLQHLQGHSHQGHQWPD[VFRUH

(LFKQHU )HUUDUL>@ FDQGLGDWHVQROLQNLQJIHDWXUHV
$QGULOXNDHWDO>@
.DUOLQVN\HWDO>@
6DSSHWDO>@ F
6DSSHWDO>@
<DQJ 5DPDQDQ>@
3DVFDO6WLFNPHQY
)/30
FDQGLGDWHUHFDOO
(LFKQHU )HUUDUL>@
6DSSHWDO>@

<DQJ 5DPDQDQ>@

ERWK>@DQG>@RULJLQDOO\UHSRUWHGWKHLUUHVXOWVRQWKHFXUUHQWO\XQDYDLODEOH3$6&$/VWLFNPHQYGDWDVHWFRQWDLQLQJOHVVWHVWLPDJHVWKHQYWKHUHVXOWLQ
WKHWDEOHLVRIDSSO\LQJWKHSXEOLFO\DYDLODEOHFRGHRI>@RQWKHYGDWDVHW

>@GLGQRWUHSRUWUHVXOWVIRUWKH3$6&$/VWLFNPHQGDWDVHWWKHUHVXOWLQWKHWDEOHLVRIDSSO\LQJWKHSXEOLFO\DYDLODEOHFRGHRI>@RQWKHYGDWDVHW
Fig. 4. (a) Summary of Buy and PASCAL stickmen results. Our results are in blue.
The candidate-recall test computes the maximum recall rates (disregarding the score)
of the individual part detectors (providing upper bound on the potential performance).
To the best of our knowledge, [8] tested on the Buy dataset version 1.0 (we test on
v2.1); (b) Baseline comparison that tests the contribution of the core component of the
model - the linking features. The comparison shows the prominent contribution of the
linking features to the FLPM performance; (c) More examples of successfully detected
parts on the tested datasets, illustrating a range of poses and appearances.
the entries of W (dened in section 2.1) corresponding to the part candidates

selected during the inference stage for each of the test images. We also tested the
maximal recall of the part candidates for the individual parts. In this experiment,
denoted candidate-recall, we computed, for each part, the percentage of the test
images on which at least one of the part candidates was correct (P CP0.5 ). The
results of candidate-recall provide an upper-bound for FLPM performance and
show that the correct part detection is included in the candidate set with high
probability and therefore does not signicantly limit the FLPM performance.
Table 4b shows FLPM vs. baseline comparison that tests the contribution of
the core component of the FLPM the linking features. Instead of choosing part
candidates using the connectivity scores between the candidates (obtained by the
linking features), for each test image, the baseline chose maximal scoring part
candidate for each part. The candidate scores are produced by the star model
that is used to generate them, as explained in section 2.4. Using these scores
resulted in signicant > 35% drop in performance, underscoring the importance
of the linking features to the approach. The results of the baseline remained
low even when adding a kinematic prior on the relative positions and angles

FRUUHFWGHWHFWLRQ

%XII\Y

3$6&$/Y

3&3
D E
Fig. 5. (a) PCP curves for FLPM on PASCAL stickmen & Buy datasets; (b) Example
visual results of applying the FLPM approach (without modication) to the detection
of 130 facial ducial points (light blue) on the AR dataset.
between parts. The FLPM has two sources of information to select the most
likely parts conguration, namely, the candidate part detection scores and the
part connection scores via the linking features. The baseline illustrates that
the most signicant source of information in the model comes from the linking
features. Figure 4c shows more examples of successfully detected parts.
Buy v2.1. This dataset consists of 748 frames from 5 episodes (2-6) of Buy the
Vampire Slayer TV series. On each image one person is annotated. The results
are reported for the usual 235 images from the episodes 2, 5 & 6 detected by the
upper body detector used in [3]. In our Buy experiments, the training and testing
is done in leave-one-episode-out manner, in which all images of one episode are
used for testing, while images of all the other episodes are used for training.
PASCAL Stickmen v1.1. This dataset consists of 549 unrelated images with
one annotated person per-image. In our experiments on this dataset, the training
and testing was done in 5-fold cross validation manner, in which we divided the
dataset images into 5 equal folds, and for testing on images of each fold the
model was trained using images of the other folds. As in [3, 8, 9], the results are
reported for the subset of 412 dataset images (denoted test images) on which
the person bounding box was successfully detected by the upper body detector
of [3]. Note however, that the two previous schemes [8, 9], originally reported
their results on the previous (currently unavailable) version 1.0 of this dataset
(with only 360 test images). Therefore, in table 4a, we compare the results of
the proposed approach (FLPM) with the results of publicly available code of [9]
applied to the 1.1 version of this dataset (412 test images).
AR Dataset. This is a facial ducial points detection dataset proposed by [15,
16]. It consists of 895 images (with ground truth annotations) of 112
dierent people with 130 facial landmarks to be detected (gure 5b). The
images exhibit dierent facial expressions, dierent lighting conditions, and some
occluders (glasses, facial hair, etc.). We treat each facial landmark as a part
whose location is to be detected on the test images. The linking features in
the FLPM model score the relative locations of the facial landmarks, and the
most coherent (according to the FLPM) conguration of the facial landmarks
is selected during inference. The experiments were conducted in leave-person-
out-manner, each time leaving all the images belonging to a test person out of
the training. The FLPM model attains an average detection error of 4.0 1.3
pixels comparing favorably to the average error of 8.4 1.2 pixels reported by
the state-of-the-art method of [16].
Osterich. The dataset consists of 558 frames of a running ostrich movie sequence
downloaded from YouTube. The training and testing was again done by 5-fold
cross validation, each time leaving a sequence of about 110 consecutive frames
for testing and training on the other frames. The average P CP0.5 was 90.7%,
with perfect detection of neck and torso, 78.2% detection of the upper legs and
93.8% detection of the lower legs. A movie with the visual results is provided in
the supplementary.
4 Summary and Discussion

We presented an approach to the detection of highly exible parts of deformable
objects, which uses linking appearance features to weigh pairwise congurations
between part candidates. Previous models have focused on the appearance of the
individual parts combined with kinematic constraints (general or adaptive to the
query image), the current model also learns and uses the appearance of part-
linking. The results show that linking features supply highly useful information
for identifying likely congurations of complex objects. In our implementation,
linking features were used instead of the kinematic constraints that were used
by other approaches. An interesting future research direction is to combine both
the linking features constraints and the kinematic constraints in a single model,
which can enjoy the power of both. For example, choosing the best result between
the FLPM and the CPS method [9] for each test image of PASCAL stickmen
v1.1, has the potential to improve the average correct detection to 81.4% (choos-
ing all 6 parts together) or even to 83.7% (choosing each part separately). This
suggests that the two methods are complementary on many test images and
that a combined method is worth pursuing. In the current implementation, part
candidates are discovered using dedicated star models, and are then connected
by the linking features, which vote for consistent pairwise part congurations.
One plausible future extension is to use a single model which will simultane-
ously discover parts and link them through intermediate linking features. A
possible way to achieve this is to use an extension of the chains model [14], in
which parts can be connected by extended chains of linking features (between
parts) and internal features (within parts). Additional extensions include using
more complex features and color information, which have been used successfully,
e.g. in [8, 9, 13].
Acknowledgments. This work was supported by ERC Advanced Grant 269627

Digital Baby to SU. The authors would also like to thank Michael Dinerstein,
Danny Harari, Maria Zontak, Nimrod Dorfman and the anonymous reviewers
for valuable comments that helped improve this manuscript.
References
1. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition.
IJCV (2005)
2. Ramanan, D.: Learning to parse images of articulated objects. In: NIPS (2006)
3. Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In:
BMVC (2009)
4. Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection
and articulated pose estimation. In: CVPR (2009)
5. Wang, Y., Mori, G.: Multiple Tree Models for Occlusion and Spatial Constraints in
Human Pose Estimation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
6. Jiang, H., Martin, D.R.: Global pose estimation using non-tree models. In: CVPR
(2008)
7. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-
object interaction activities. In: CVPR (2010)
8. Sapp, B., Jordan, C., Taskar, B.: Adaptive pose priors for pictorial structures. In:
CVPR (2010)
9. Sapp, B., Toshev, A., Taskar, B.: Cascaded Models for Articulated Pose Estimation.
In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS,
10. Yang, Y., Ramanan, D.: Articulated pose estimation using exible mixtures of
11. Ioe, S., Forsyth, D.: Human tracking with mixtures of trees. In: ICCV (2001)
12. Sapp, B., Weiss, D., Taskar, B.: Parsing human motion with stretchable models.
In: CVPR (2011)
13. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose
annotations. In: ICCV (2009)
14. Karlinsky, L., Dinerstein, M., Harari, D., Ullman, S.: The chains model for detect-
ing parts by their context. In: CVPR (2010)
15. Martinez, A., Benavente, R.: The ar face database. CVC Technical Report num.
24 (1998)
16. Ding, L., Martinez, A.M.: Features versus context: An approach for precise and
detailed detection and delineation of faces and facial features. PAMI (2010)
17. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
18. Liu, C., Yuen, J., Torralba, A.: Sift ow: dense correspondence across dierent
scenes and its applications. PAMI (2010)
19. Mount, D., Arya, S.: Ann: A library for approximate nearest neighbor searching.
CGC (1997)
20. Duda, R., Hart, P.: Pattern classication and scene analysis. Wiley (1973)
21. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm.
In: NIPS (2001)
Diagnosing Error in Object Detectors
Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai
Department of Computer Science

University of Illinois at Urbana-Champaign
Abstract. This paper shows how to analyze the influences of object characteris-
tics on detection performance and the frequency and impact of different types of
false positives. In particular, we examine effects of occlusion, size, aspect ratio,
visibility of parts, viewpoint, localization error, and confusion with semantically
similar objects, other labeled objects, and background. We analyze two classes
of detectors: the Vedaldi et al. multiple kernel learning detector and different ver-
sions of the Felzenszwalb et al. detector. Our study shows that sensitivity to size,
localization error, and confusion with similar objects are the most impactful forms
of error. Our analysis also reveals that many different kinds of improvement are
necessary to achieve large gains, making more detailed analysis essential for the
progress of recognition research. By making our software and annotations avail-
able, we make it effortless for future researchers to perform similar analysis.
1 Introduction
Large datasets are a boon to object recognition, yielding powerful detectors trained on
hundreds of examples. Dozens of papers are published every year citing recognition
accuracy or average precision results, computed over thousands of images from Cal-
Tech or PASCAL VOC datasets [1, 2]. Yet such performance summaries do not tell
us why one method outperforms another or help understand how it could be improved.
For authors, it is difficult to find the most illustrative qualitative results, and the read-
ers may be suspicious of a few hand-picked successes or intuitive failures. To make
matters worse, a dramatic improvement in one aspect of recognition may produce only
a tiny change in overall performance. For example, our study shows that complete ro-
bustness to occlusion would lead to an improvement of only a few percent average
precision. There are many potential causes of failure: occlusion, intra-class variations,
pose, camera position, localization error, and confusion with similar objects or textured
backgrounds. To make progress, we need to better understand what most needs im-
provement and whether a new idea produces the desired effect. We need to measure the
modes of failure of our algorithms. Fortunately, we already have excellent large, diverse
datasets; now, we propose annotations and analysis tools to take full advantage.
This paper analyzes the influences of object characteristics on detection performance
and the frequency and impact of different types of false positives. In particular, we ex-
amine effects of occlusion, size, aspect ratio, visibility of parts, viewpoint, localization
error, confusion with semantically similar objects, confusion with other labeled objects,
and confusion with background. We analyze two types of detectors on the PASCAL

This work was supported by NSF awards IIS-1053768 and IIS-0904209, ONR MURI Grant
N000141010934, and a research award from Google.

Diagnosing Error in Object Detectors 341
VOC 2007 dataset [2]: the Vedaldi et al. [3] multiple kernel learning detector (called
VGVZ for authors initials) and the Felzenszwalb et al. [4, 5] detector (called FGMR).
By making our analysis software and annotations available, we will make it effortless
for future researchers to perform similar analysis.
Relation to Existing Studies: Many methods have been proposed to address a par-
ticular recognition challenge, such as occlusion (e.g., [68]), variation in aspect ratio
(e.g., [5, 3, 9]), or changes of viewpoint (e.g., [1012]). Such methods are usually eval-
uated based on overall performance, home-brewed datasets, or artificial manipulations,
making it difficult to determine whether the motivating challenge is addressed for nat-
urally varying examples. Researchers are aware of the value of additional analysis, but
there are no standards for performing or compactly summarizing detailed evaluations.
Some studies focus on particular aspects of recognition, such as interest point detec-
tion (e.g., [13, 14]), contextual methods [15, 16], dataset design [17], and cross-dataset
generalization [18]. The work by Divvala et al. [15] is particularly relevant: through
analysis of false positives, they show that context reduces confusion with textured back-
ground patches and increases confusion with semantically similar objects. The report on
PASCAL VOC 2007 by Everingham et al. [19] is also related in its sensitivity analysis
of size and qualitative analysis of failures. For size analysis, Everingham et al. measure
the average precision (AP) for increasing size thresholds (smaller-than-threshold ob-
jects are treated as dont cares). Because the measure is cumulative, some effects of
size are obscured, causing the authors to conclude that detectors have a limited prefer-
ence for large objects. In contrast, our experiments indicate that object size is the best
single predictor of performance.
Studies in specialized domains, such as pedestrian detection [20] and face recog-
nition [21, 22], have provided useful insights. For example, Dollar et al. [20] analyze
effects of scale, occlusion, and aspect ratio for a large dataset of pedestrian videos,
summarizing performance for different subsets with a single point on the ROC curve.
Contributions: Our main contribution is to provide analysis tools (annotations, soft-
ware, and techniques) that facilitate detailed and meaningful investigation of object
detector performance. Although benchmark metrics, such as AP, are well-established,
we had much difficulty to quantitatively explore causes and correlates of error; we wish
to share some of the methodology that we found most useful with the community.
Our second contribution is the analysis of two state-of-the-art detectors. We intend
our analysis to serve as an example of how other researchers can evaluate the frequency
and correlates of error for their own detectors. We sometimes include detailed analysis
for the purpose of exemplification, beyond any insights that can be gathered from the
results.
Third, our analysis also helps identify significant weaknesses of current approaches
and suggests directions for improvement. Equally important, our paper highlights the
danger of relying on overall benchmarks to measure short-term progress. Currently,
detectors perform well for the most common cases of objects and avoid egregious errors.
Large improvements in any one aspect of object recognition may yield small overall
gains and go under-appreciated without further analysis.
342 D. Hoiem, Y. Chodpathumwan, and Q. Dai
Details of Dataset and Detectors: Our experiments are based on the PASCAL VOC
2007 dataset [2], which is widely used to evaluate performance in object category de-
tection. The detection task is to find instances of a specific object category within each
input image, localizing each object with a tight bounding box. For the 2007 dataset,
roughly 10,000 images were collected from Flickr.com and split evenly into train-
ing/validation and testing sets. The dataset is favored for its representative sample of
consumer photographs and rigorous annotation and collection procedures. The dataset
contains bounding box annotations for 20 object categories, including some auxiliary
labels for truncation and five categories of viewpoint. We extend these annotations with
more detailed labels for occlusion, part visibility, and viewpoint. Object detections con-
sist of a bounding box and confidence. The highest confidence bounding box with 50%
overlap is considered correct; all others are incorrect. Detection performance is mea-
sured with precision-recall curves and summarized with average precision. We use the
2007 version of VOC because the test set annotations are available (unlike later ver-
sions), enabling analysis on test performance of various detectors. Experiments on later
versions are very likely to yield similar conclusions.
The FGMR detector [23] consists of a mixture of deformable part models for each
object category, where each mixture component has a global template and a set of de-
formable parts. Both the global template and the deformable parts are represented by
HOG features captured at a coarser and finer scale respectively. A hypothesis score con-
tains both a data term (filter responses) and a spatial prior (deformable cost), and the
overall score for each root location is computed based on the best placement of parts.
Search is performed by scoring the root template at each position and scale. The training
of the object model is posed as a latent SVM problem where the latent variables specify
the object configuration. The main differences between v4 and v2 are that v4 includes
more components (3 vs. 2) and more latent parts (8 vs. 6) and has a latent left/right flip
term. We do not use the optional context rescoring.
The VGVZ detector [3] adopts a cascade approach by training a three-stage classi-
fier, using as features Bag of Visual Words, Dense Words, Histogram of Oriented Edges
and Self-Similarity Features. For each feature channel, a three-level pyramid of spatial
histograms is computed. In the first stage, a linear SVM is used to output a set of can-
didate regions, which are passed to the more powerful second and third stages, which
uses quasi-linear and non-linear kernel SVMs respectively. Search is performed using
jumping windows proposed based on a learned set of detected keypoints.
2 Analysis of False Positives

One major type of error is false positives, detections that do not correspond to the tar-
get category. There are different types of false positives which likely require different
kinds of solutions. Localization error occurs when an object from the target category
is detected with a misaligned bounding box (0.1 <= overlap < 0.5). Other overlap
thresholds (e.g., 0.2 <= overlap < 0.5) led to similar conclusions. Overlap is de-
fined as the intersection divided by union of the ground truth and detection bounding
boxes. We also consider a duplicate detection (two detections for one object) to be lo-
calization error because such mistakes are avoidable with good localization. Remaining
false positives that have at least 0.1 overlap with an object from a similar category are
aeroplane (loc): ov=0.47 1r=0.95 aeroplane (sim): ov=0.00 1r=0.93 aeroplane (bg): ov=0.00 1r=0.88 aeroplane (loc): ov=0.50 1r=0.91 aeroplane (loc): ov=0.49 1r=0.87 aeroplane (sim): ov=0.00 1r=0.85
Top Airplane False Positives
cat (loc): ov=0.40 1r=0.99 cat (loc): ov=0.38 1r=0.98 cat (loc): ov=0.16 1r=0.97 cat (loc): ov=0.47 1r=0.97 cat (loc): ov=0.35 1r=0.96 cat (loc): ov=0.63 1r=0.96
Top Cat False Positives
cow (sim): ov=0.00 1r=0.93 cow (sim): ov=0.00 1r=0.91 cow (sim): ov=0.00 1r=0.92 cow (sim): ov=0.00 1r=0.90 cow (sim): ov=0.00 1r=0.96 cow (sim): ov=0.00 1r=0.90
Top Cow False Positives
Fig. 1. Examples of top false positives: We show the top six false positives (FPs) for the FGMR
(v4) airplane, cat, and cow detectors. The text indicates the type of error (loc=localization;
bg=background; sim=confusion with similar object), the amount of overlap (ov) with a
true object, and the fraction of correct examples that are ranked lower than the given false positive
(1-r, for 1-recall). Localization errors could be insufficient overlap (less than 0.5) or duplicate
detections. Qualitative examples, such as these, indicate that confusion with similar objects and
localization error are much more frequent causes of false positives than mislabeled background
patches (which provide many more opportunities for error).
counted as confusion with similar objects. For example a dog detector may assign
a high score to a cat region. We consider two categories to be semantically similar
if they are both within one of these sets: {all vehicles}, {all animals including per-
son}, {chair, diningtable, sofa}, {aeroplane, bird}. Confusion with dissimilar objects
describes remaining false positives that have at least 0.1 overlap with another labeled
VOC object. For example, the FGMR bottle detector very frequently detects people
because the exterior contours are similar. All other false positives are categorized as
confusion with background. These could be detections within highly textured areas
or confusions with unlabeled objects that are not within the VOC categories. See Fig. 1
for examples of confident false positives.
In Figure 2, we show the frequency and impact on performance of each type of
false positive. We count top-ranked false positives that are within the most confi-
dent Nj detections. We choose parameter Nj to be the number of positive examples
for the category, so that if all objects are correctly detected, no false positives would
remain. A surprisingly small fraction of confident false positives are due to confusion
with background (e.g., only 9% for VGVZ animal detectors on average). For animals,
most false positives are due to confusion with other animals; for vehicles, localization
and confusion with similar categories are both common. In looking at trends of false
positives with increasing rank, localization errors and confusion with similar objects
tend to be more common among the top-ranked than the lower-ranked false positives.
Confusion with other objects and confusion with background may be similar types
of errors. Some categories were often confused with semantically dissimilar categories.
Animals Vehicles Furniture Person Cow Car Motorbike

BG BG BG
Oth
9%
Loc BG Loc BG Loc BG 11% Loc BG Loc 8% Loc
Oth 22% 21% 3% 20%
24% 31% 23% 24% 33% 24% Oth 26%
6% Loc 27%
Oth Sim Sim
Oth 50% dog mbike bicycle
Sim 13% 16% 66%
Sim Oth 15% horse Oth Sim
63% Sim Sim
32% 40% sheep 12% 39%
11% 31%
% Top False Positives, AP Impact: VGVZ09 Detector

Animals Vehicles Furniture Person Cow Car Motorbike
Loc
Loc BG
BG Loc BG BG Oth 10% BG BG
Loc BG 19% 14% Loc
22% 17% 23% 22% 5% 20% 14%
36% 29% Oth Oth 30%
Oth Loc
Oth Sim dog 6% 19%
10% Sim Oth Loc bus 53%
Sim 11% 17% 12% Sim 71% horse
Sim Oth 56% Sim Sim bicycle
51% 10% sheep
30% 35% 21% 37%
% Top False Positives, AP Impact: FGMR (v4) Detector
Fig. 2. Analysis of Top-Ranked False Positives. Pie charts: fraction of top-ranked false positives
that are due to poor localization (Loc), confusion with similar objects (Sim), confusion with other
VOC objects (Oth), or confusion with background or unlabeled objects (BG). Each category
named within Sim is the source of at least 10% of the top false positives. Bar graphs display
absolute AP improvement by removing all false positives of one type (B removes confusion
with background and non-similar objects). L: the first bar segment displays improvement if
duplicate or poor localizations are removed; the second displays improvement if the localization
errors were corrected, turning false detections into true positives.
For example, bottles were often confused with people, due to the similarity of exterior
contours. Removing only one type of false positives may have a small effect, due to
the T PT+F
P
P form of precision. In particular, the potential improvement by removing
all background false detections is surprisingly small (e.g., 0.02 AP for animals, 0.04
AP for vehicles). Improvements in localization or differentiating between similar cat-
egories would lead to the largest gains. If poor localizations were corrected, e.g. with
an effective category-based segmentation method, performance would improve greatly
from additional high-confidence true positives, as well as fewer false positives.
3 False Negatives and Impact of Object Characteristics

Detectors may incur a false negative by assigning a low confidence to an object or by
missing it completely. Intuitively, an object may be difficult to detect due to occlusion,
truncation, small size, or unusual viewpoint. In this section, we measure the sensitivity
of detectors to these characteristics and others and also try to answer why so many
objects (typically about 40%) are not detected with even very low confidence.
3.1 Definitions of Object Characteristics

To perform our study, we added annotations to the PASCAL VOC dataset, including
level of occlusion and which sides and parts of an object are visible. We also use stan-
dard annotations, such as the bounding box. We created the extra annotations for seven
Low
High
None Medium
Fig. 3. Example of 4 levels of occlusion for the aeroplane class
categories (aeroplane, bicycle, bird, boat, cat, chair, diningtable) that span
the major groups of vehicles, animals, and furniture. Annotations were created by one
author to ensure consistency and are publicly available.
Object size is measured as the pixel area of the bounding box. We also considered
bounding box height as a size measure, which led to similar conclusions. We assign
each object to a size category, depending on the objects percentile size within its object
category: extra-small (XS: bottom 10%); small (S: next 20%); medium (M: next 40%);
large (L: next 20%); extra-large (XL: next 10%). Aspect ratio is defined as object width
divided by object height, computed from the VOC bounding box annotation. Similarly
to object size, objects are categorized into extra-tall (XT), tall (T), medium (M), wide
(W), and extra-wide (XW), using the same percentiles. Occlusion (part of the object
is obscured by another surface) and truncation (part of the object is outside the im-
age) have binary annotations in the standard VOC annotation. We replace the occlusion
labels with degree of occlusion (see Fig. 3): None, Low (slight occlusion), Moder-
ate (significant part is occluded), and High (many parts missing or 75% occluded).
Visibility of parts influences detector performance, so we add annotations for whether
each part of an object is visible. We annotate viewpoint as whether each side (bottom,
front, top, side, rear) is visible. For example, an object seen from the front-right
may be labeled as bottom=0, front=1, top=0, side=1, rear=0.
3.2 Normalized Precision Measure

To analyze sensitivity to object characteristics, we would like to summarize the per-
formance of different subsets (e.g. small vs. large) of objects. Current performance
measures do not suffice: ROC curves are difficult to summarize, and average precision
is sensitive to the number of positive examples. We propose a simple way to normal-
ize precision so that we can easily measure and compare performance for objects with
particular characteristics.
The standard PASCAL VOC measure is average precision (AP), which summarizes
precision-recall curves with the average interpolated precision value of the positive ex-
amples. Recall R(c) is the fraction of objects detected with confidence of at least c.
Precision P (c) is the fraction of detections that are correct:
R(c) Nj
P (c) = (1)
R(c) Nj + F (c)
where Nj is the number of objects in class j and F (c) is the number of incorrect detec-
tions with at least c confidence. Before computing AP, precision is interpolated, such
that the interpolated precision value at c is the maximum precision value for any exam-
ple with at least confidence c. If Nj is large (such as for pedestrians), then the precision
Table 1. Detection Results on PASCAL VOC2007 Dataset. For each object category, we show the
total number of objects and the average precision (AP) and average normalized precision (APN )
for the VGVZ and FGMR detectors. For APN , the precision is normalized to be comparable
across categories. Categories with many positive examples (e.g., person) will have lower APN
than AP; the reverse is true for categories with few examples.
aero bike boat bus car mbike train bird cat cow dog horse sheep bottle chair table plant sofa tv pers
Num Objs 285 337 263 213 1201 325 282 459 358 244 489 348 242 469 756 206 480 239 308 4528
VGVZ AP .364 .468 .113 .513 .508 .450 .447 .106 .277 .312 .174 .512 .208 .193 .131 .191 .073 .266 .477 .200
APN .443 .531 .181 .612 .480 .511 .527 .136 .372 .418 .222 .565 .294 .222 .130 .318 .094 .294 .567 .073
FGMR AP .277 .581 .134 .481 .555 .475 .436 .028 .149 .214 .034 .582 .154 .225 .191 .204 .068 .306 .405 .410
APN .343 .631 .187 .574 .517 0.537 .528 .039 .207 .311 .048 .631 .215 .252 .190 .346 .086 .409 .477 .214
would be higher than if Nj were small for the same detection rate. This sensitivity to
Nj invalidates AP comparisons for different sets of objects. For example, we cannot use
AP to determine whether people are easier to detect than cows, or whether big objects
are easier to detect than small ones.
We propose to replace Nj with a constant N to create a normalized precision:
R(c) N
PN (c) = . (2)
R(c) N + F (c)
In our experiments, we set N = 0.15Nimages = 742.8, which is roughly equal to the
average Nj over the PASCAL VOC categories. Detectors with similar detection rates
and false positive rates will have similar normalized precision-recall curves. We can
summarize normalized precision-recall by averaging the normalized precision values of
the positive examples to create average normalized precision (APN ). When computing
APN , undetected objects are assigned a precision value of 0. We compare AP and APN
in Table 1.
3.3 Analysis
To understand performance for a category, it helps to inspect the performance variations
for each characteristic. We include this detailed analysis primarily as an example of how
researchers can use our tool to understand the strengths and weaknesses of their own
detectors. Upon careful inspection of Figure 4, for example, we can learn the following
about airplane detectors: both detectors perform similarly, preferring similar subsets
of non-occluded, non-truncated, medium to extra-wide, side views in which all major
parts are visible. Performance for very small and heavily occluded airplanes is poor,
though performance on non-occluded airplanes is near average (because few airplanes
are heavily occluded). FGMR shows a stronger preference for exact side-views than
VGVZ, which may also account for differences in performance on small and extra-
large objects, which are both less likely to be side views. We can learn similar things
about the other categories. For example, both detectors work best for large cats, and
FGMR performs best with truncated cats, where all parts except the face and ears are
not visible. Note that both detectors seem to vary in similar ways, indicating that their
sensitivities may be due to some objects being intrinsically more difficult to recognize.
The VGVZ cat detector is less sensitive to viewpoint and part visibility, which may be
due to its textural bag of words features.
VGVZ 2009: aeroplane FGMR (v4): aeroplane

0.8
Occlusn. Trnc. BBox Area Aspect Rat. Sides Visible Parts Visible Occlusn. Trnc. BBox Area Aspect Rat. Sides Visible Parts Visible
0.8
0.76
0.64 0.65
0.6
0.6 0.60
0.52 0.48 0.48
0.49 0.50
0.49 0.48 0.50
0.48 0.47 0.47 0.47 0.47 0.46 0.42
0.45 0.44
0.44 0.4 0.37 0.38 0.40 0.40
0.37 0.39
0.4 0.35 0.35 0.37 0.36
0.35 0.38 0.38 0.36 0.33 0.32
0.34 0.35 0.29
0.29 0.25
0.23 0.25 0.23 0.22
0.2 0.2 0.20
0.18 0.15 0.16
0.11 0.12 0.13
0.11 0.10 0.12 0.12 0.10 0.09
0.08 0.08 0.08
0.02 0.03 0.03
0 0.00
bttm front rear side top body head tail wing
0 0.00
bttm front rear side top body head tail wing
N L MH NT XSS M L XL XTT M WXW N L MH NT XSS M L XL XTT M WXW
0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1
VGVZ 2009: cat FGMR (v4): cat

0.8 0.6
Occlusn. Trnc. BBox Area Aspect Rat. Sides Visible Parts Visible Occlusn. Trnc. BBox Area Aspect Rat. Sides Visible Parts Visible
0.5
0.6 0.60
0.53 0.4 0.40
0.45 0.45 0.44
0.4 0.41 0.41 0.390.38 0.390.40 0.40 0.3 0.31 0.31
0.38
0.37 0.38 0.36 0.380.38 0.37 0.38 0.28 0.29
0.33 0.35 0.34 0.35 0.27 0.25 0.26
0.32 0.32 0.33 0.31 0.33 0.25 0.24
0.23 0.22 0.22 0.22 0.21
0.28 0.27 0.2 0.21 0.21 0.21
0.20
0.21 0.17 0.16 0.16
0.2 0.14 0.15
0.13
0.15 0.15 0.15
0.15 0.15 0.12 0.12 0.12
0.1 0.11 0.11
0.09 0.07
0.05
0 0.01 0 0.00
N L MH N T XSS M LXL XTT MWXW bttm front rear side top body ear face leg tail N L MH N T XSS M LXL XTT MWXW bttm front rear side top body ear face leg tail
0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1
Fig. 4. Per-Category Analysis of Characteristics: APN (+) with standard error bars (red).
Black dashed lines indicate overall APN . Key: Occlusion: N=none; L=low; M=medium; H=high.
Truncation: N=not truncated; T=truncated. Bounding Box Area: XS=extra-small; S=small;
M=medium; L=large; XL =extra-large. Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium;
W=wide; XW =extra-wide. Part Visibility / Viewpoint: 1=part/side is visible; 0=part/side is
not visible. Standard error is used for the average precision statistic as a measure of significance,
rather than confidence bounds, due to the difficulty of modeling the precision distribution.
Since size and occlusion are such important characteristics, we think it worthwhile
to examine their effects across several categories (Fig. 5). Typically, detectors work best
for non-occluded objects, but when objects are frequently occluded (bicycle, chair, din-
ingtable) there is sometimes a small gain in performance for lightly occluded objects.
Although detectors are bad at detecting medium-heavy occluded objects, the impact on
overall performance is small. Researchers working on the important problem of occlu-
sion robustness should be careful to examine effects within the subset of occluded ob-
jects. Both detectors tend to prefer medium to large objects (the 30th to 90th percentile
in area). The difficulty with small objects is intuitive. The difficulty with extra-large ob-
jects may initially surprise, but qualitative analysis (e.g., Fig. 7) shows that large objects
are often highly truncated or have unusual viewpoints.
Fig. 6 provides a compact summary of the sensitivity to each characteristic and the
potential impact of improving robustness. The worst-performing and best-performing
subsets for each characteristic are averaged over 7 categories: the difference between
best and worst indicates sensitivity; the difference between best and overall indicates
potential impact. For example, detectors are very sensitive to occlusion and truncation,
but the impact is small (about 0.05 APN ) because the most difficult cases are not com-
mon. Object area and aspect have a much larger impact (roughly 0.18 for area, 0.13
for aspect). Overall, VGVZ is more robust than FGMR to viewpoint and part visibility,
likely because it is better at encoding texture (due to bags of words features), while
FGMR can accommodate only limited deformations. The performance of VGVZ varies
more than FGMR with object size. Efforts to improve occlusion or viewpoint robust-
ness should be validated with specialized analysis so that improvements are not diluted
by the more common easier cases.
One of the aims of our study is to understand why current detectors fail to detect
30 40% of objects, even at small confidence thresholds. We consider an object to be
VGVZ 2009: Occlusion FGMR (v4): Occlusion

0.8
airplane bicycle bird boat cat chair table airplane bicycle bird boat cat chair table
0.8
0.65 0.72
0.6 0.68
0.56
0.6
0.49
0.4 0.41 0.46

0.42
0.32
0.350.33 0.4 0.37 0.36
0.25 0.27 0.29
0.23 0.21 0.22 0.24
0.2 0.21 0.23 0.22 0.23 0.21
0.17 0.15 0.2 0.17
0.14 0.14 0.15
0.12 0.110.12 0.10 0.13 0.120.11 0.12
0.08 0.09 0.08 0.08
0.05 0.040.02 0.03 0.05
0.02 0.020.000.00 0.03
0 0.00 0.00 0 0.00
N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H N L M H
VGVZ 2009: BBox Area FGMR (v4): BBox Area

airplane bicycle bird boat cat chair table airplane bicycle bird boat cat chair table
0.8 0.8 0.77
0.76 0.74
0.69
0.66 0.64
0.62
0.6 0.60 0.6 0.60
0.56
0.50 0.53 0.52
0.47 0.45 0.48 0.47
0.45 0.43
0.4 0.41 0.4 0.40
0.35 0.37
0.35 0.33 0.35
0.31 0.32
0.29
0.26
0.26
0.22 0.21 0.20 0.22 0.21
0.2 0.17
0.2 0.17
0.14 0.15 0.14 0.16
0.11 0.11 0.12 0.11 0.11 0.10 0.12
0.09 0.10
0.05 0.06 0.05 0.03
0.02 0.01 0.01 0.01 0.01 0.02 0.00 0.02 0.02 0.00 0.00 0.01
0 0.00 0 0.00 0.00
XS S M L XL XS S M L XL XSS M L XL XSS M L XL XSS M L XL XS S M L XL XS S M L XL XS S M L XL XS S M L XL XSS M L XL XSS M L XL XSS M L XL XS S M L XL XS S M L XL
Fig. 5. Sensitivity and Impact of Object Characteristics: APN (+) with standard error bars
(red). Black dashed lines indicate overall APN . Key: Occlusion: N=none; L=low; M=medium;
H=high. Bounding Box Area: XS=extra-small; S=small; M=medium; L=large; XL =extra-large.
VGVZ 2009: Sensitivity and Impact FGMR (v2): Sensitivity and Impact FGMR (v4): Sensitivity and Impact
0.6 0.6 0.6
0.5 0.488 0.5 0.5

0.445 0.435
0.406 0.419 0.403
0.4 0.4 0.4
0.356 0.361 0.373 0.382 0.375 0.362
0.324 0.323 0.323
0.3 0.302 0.3 0.278 0.288
0.301 0.3 0.277
0.238
0.2 0.213 0.2 0.2 0.201
0.187
0.153
0.120 0.105 0.103 0.121 0.110
0.1 0.098 0.1 0.100 0.1 0.082
0.060 0.074 0.065 0.069 0.060
0.026
0 0 0
occ trn size asp view part occ trn size asp view part occ trn size asp view part
Fig. 6. Summary of Sensitivity and Impact of Object Characteristics: We show the average
(over categories) APN performance of the highest performing and lowest performing subsets
within each characteristic (occlusion, truncation, bounding box area, aspect ratio, viewpoint, part
visibility). Overall APN is indicated by the black dashed line. The difference between max and
min indicates sensitivity; the difference between max and overall indicates the impact.
undetected if there is no detection above 0.05 PN (roughly 1.5 FP/image). Our analysis
indicates that size is the best single explanation, as nearly all extra-small objects go
undetected and roughly half of undetected objects are in the smallest 30%. But size
accounts for only 20% of the entropy of whether an object is detected (measured by
information gain divided by entropy). Contrary to intuition, occlusion does not increase
the chance of going undetected (though it decreases the expected confidence); there is
also only a weak correlation with aspect ratio. As can be seen in Figure 7, misses are
due to a variety of factors, such as unusual appearance, unusual viewpoints, in-plane
rotation, and particularly disguising occlusions. Some missed detections are actually
detected with high confidence but poor localization. With a 10% overlap criteria, the
number of missed detections is reduced by 40-85% (depending on category, detector).
Conclusions and Discussion: As expected, objects that are small, heavily occluded,
seen from unusual views (e.g., the bottom), or that have important occluded parts are
hard to detect. But the deviations are interesting: slightly occluded bicycles and ta-
bles are easier; detectors often have trouble with very large objects; cats are easiest for
0.091 0.876
0.136 0.538
0.673
0.654
0.021
0.654
0.030 0.654
0.026 0.364
0.032 0.364 0.594 0.319
0.124 0.301
0.319
0.008 0.012
0.056 0.0400.386
0.030
0.403 0.397
0.386
0.115
0.323
0.020
Fig. 7. Unexpectedly Difficult Detections: We fit a linear regressor to predict confidence (PN )
based on size, aspect ratio, occlusion and truncation for the FGMR (v4) detector. We show 5
of the 15 objects with the greatest difference between predicted confidence and actual detection
confidence. The ground truth object is in red, with predicted confidence in upper-right corner
in italics. The other box shows the highest scoring correct detection (green), if any, or highest
scoring overlapping detection (blue dashed) with the detection confidence in the upper-left corner.
FGMR when only the head is visible. Despite high sensitivity, the overall impact of
occlusion and many other characteristics is surprisingly small. Even if the gap between
no occlusions and heavy occlusions were completely closed, the overall AP gain would
be only a few percent. Also, note that smaller objects, heavily occluded objects, and
those seen from unusual viewpoints will be intrinsically more difficult, so we cannot
expect to close the gap completely.
We can also see the impact of design differences for two detectors. The VGVZ de-
tector incorporates more textural features and also has a more flexible window proposal
method. These detector properties are advantageous for detecting highly deformable
objects, such as cats, and they lead to reduced sensitivity to part visibility and view-
point. However, VGVZ is more sensitive to size (performing better for large objects
and worse for small ones) because the bag-of-words models are size-dependent, while
the FGMR templates are not. Comparing the FGMR (v2) and (v4) detectors in Figure 6,
we see that the additional components and left/right latent flip term lead to improved
robustness (improvement in both worst and best cases) to aspect ratio, viewpoint, and
truncation.
4 Conclusion
4.1 Diagnosis
We have analyzed the patterns of false and missed detections made by currently top-
performing detectors. The good news is that egregious errors are rare. Most false posi-
tives are due to misaligned detection windows or confusion with similar objects (Fig. 2).
Most missed detections are atypical in some way small, occluded, viewed from an
unusual angle, or simply odd looking. Detectors seem to excel at latching onto the
common modes of object appearance and avoiding confusion with miscellaneous back-
ground patches. For example, detectors more easily detect lightly occluded objects for
categories that are typically occluded, such as bicycles and tables. Airplanes in the
stereotypical direct side-view are detected by FGMR with almost double the accuracy
of the average airplane.
Further progress, however, is likely to be slow, incremental, and to require careful
validation. Localization error is not easily handled by detectors because determining
object extent often requires looking well outside the window. A box around a cat face
could be a perfect detection in one image but only a small part of the visible cat in
another (Figs. 1, 8). Gradient histogram-based models, designed to accommodate mod-
erate positional slop, may be too coarse for differentiation of similar objects, such as
cows and horses or cats and dogs. Although detectors fare well for common views of
objects, there are many types of deviations that may require different solutions. Solving
only one problem will lead to small gains. For example, even if occluded objects could
be detected as easily as unoccluded ones, the overall AP would increase by 0.05 or less
(Fig. 6). Better detection of small objects could yield large gains (roughly 0.17 AP), but
the smallest objects are intrinsically difficult to detect. Our biggest concern is that suc-
cessful attempts to address problems of occlusion, size, localization, or other problems
could look unpromising if viewed only through the lens of overall performance. There
may be a gridlock in recognition research: new approaches that do not immediately ex-
cel at detecting objects typical of the standard datasets may be discarded before they
are fully developed. One way to escape that gridlock is through more targeted efforts
that specifically evaluate gains along the various dimensions of error.
4.2 Recommendations
Detecting Small/Large Objects: An objects size is a good predictor of whether it will
be detected. Small objects have fewer visual features; large objects often have unusual
viewpoints or are truncated. Current detectors make a weak perspective assumption that
an objects appearance does not depend on its distance to the camera. This assumption
is violated when objects are close, making the largest objects difficult to detect. When
objects are far away, only low resolution template models can be applied, and contextual
models may be necessary for disambiguation. There are interesting challenges in how
to benefit from the high resolution of nearby objects while being robust to near-field
perspective effects. Park et al. [24] offer one simple approach of combining detectors
trained at different resolutions. Other possibilities include using scene layout to predict
viewpoint of nearby objects, including features that take advantage of texture and small
parts that are visible for close objects, and occlusion reasoning for distant objects.
Improving Localization: Difficulty with localization and identification of duplicate
detections has a large impact on recognition performance for many categories. In part,
the problem is that an template-based detectors cannot accommodate flexible objects.
For example, the FGMR (v4) detector works very well for localizing cat heads, but not
cat bodies [25]. Additionally, an object part, such as a head, could be considered correct
if the body is occluded, or incorrect otherwise; it is impossible to tell from within the
window alone (Fig. 8). Template detectors should play a major role in finding likely
object positions, but specialized processes are required for segmenting out the objects.
Incorrect Localization Right Wrong Right Wrong
Dog Model
Challenges of Confusion with Similar Categories
Fig. 8. Challenging False Positives: In the top row, we show examples of dog detections that
are considered correct or incorrect, according to the VOC localization criteria. On the right, we
cropped out the rest of the image; it is not possible to determine correctness from within the win-
dow alone. On the bottom, we show several confident dog detections, some of which correspond
to objects from other similar categories. The robust HOG template detector (right), though good
for sorting through many background patches quickly, may be too robust for more fine differen-
tiations.
In some applications, precise localization is unimportant; even so, the ability to identify
occluding and interior object contours would also be useful for category verification,
attribute recognition, or pose estimation. Current approaches to segment objects from
known categories (e.g., [8, 25]) tend to combine simple shape and/or color priors with
more generic segmentation techniques, such as graph cuts. There is an excellent oppor-
tunity for a more careful exploration of the interaction of material, object shape, pose,
and object category, through contour and texture-based inference. We encourage such
work to evaluate specifically in terms of improved localization of objects and to avoid
conflating detection and localization performance.
Reducing Confusion with Similar Categories: By far, the biggest task of detectors
is to quickly discard random background patches, and they do well with features that
are made robust to illumination, small shifts, and other variations. But a detector that
sorts through millions of windows per second may not be suited for differentiating be-
tween dogs and cats or horses and cows. Such differences require detailed comparison
of particular features, such as the shape of the head or eyes. Some recent work has
addressed fine-grained differentiation of birds and plants [2628], and the ideas of find-
ing important regions for comparison may apply to category recognition as well. Also,
while HOG features [29] are well-suited to whole-object detection, some of the feature
representations originally developed for detection of small faces (e.g., [30]) may be
better for differentiating similar categories based on their localized parts. We encourage
such work to evaluate specifically on differentiating between similar objects. It may be
worthwhile to pose the problem simply as categorizing a set of well-localized objects
(even categorizing objects given PASCAL VOC bounding boxes is not easy).
Robustness to Object Variation: One interpretation of our results is that existing de-
tectors do well on the most common modes of appearance (e.g., side views of airplanes)
but fail when a characteristic, such as viewpoint, is unusual. Greatly improved recogni-
tion will require appearance models that better encode shape with robustness to moder-
ate changes in viewpoint, pose, lighting, and texture. One approach is to learn feature
representations from paired 3D and RGB images; a second is to learn the natural vari-
ations of existing features within categories for which examples are plentiful and to
extend that variational knowledge to other categories.
More Detailed Analysis: Most important, we hope that our work will inspire research
that targets and evaluates reduction in specific modes of error. In the supplemental ma-
terial, we include automatically generated reports for four detectors. Authors of future
papers can use our tools to perform similar analysis, and the results can be compactly
summarized in 1/4 page, as shown in Figs. 2, 6. We also encourage analysis of other
aspects of recognition, such as the effects of training sample size, cross-dataset gener-
alization, cross-category generalization, and recognition of pose and other properties.
References
1. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report
7694, California Institute of Technology (2007)
2. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL
Visual Object Classes Challenge 2007 (VOC 2007) (2007) Results,
http://www.pascal-network.org/challenges/VOC/voc2007/
workshop/index.html
3. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection.
In: ICCV (2009)
4. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, de-
formable part model. In: CVPR (2008)
5. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discrim-
inatively trained part based models. PAMI (2009)
6. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by
bayesian combination of edgelet part detectors. In: ICCV (2005)
7. Wang, X., Han, T.X., Yan, S.: An hog-lbp human detector with partial occlusion handling.
In: ICCV (2009)
8. Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object detection for multi-class
9. Vedaldi, A., Zisserman, A.: Structured output regression for detection with partial occulsion.
In: NIPS (2009)
10. Kushal, A., Schmid, C., Ponce, J.: Flexible object models for category-level 3d object recog-
nition. In: CVPR (2007)
11. Hoiem, D., Rother, C., Winn, J.: 3d layoutcrf for multi-view object class recognition and
12. Sun, M., Su, H., Savarese, S., Fei-Fei, L.: A multi-view probabilistic model for 3d object
classes. In: CVPR (2009)
13. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. IJCV 37,
151172 (2000)
14. Gil, A., Mozos, O.M., Ballesta, M., Reinoso, O.: A comparative evaluation of interest point
detectors and local descriptors for visual slam. Machine Vision and Applications 21(6),
905920 (2009)
15. Divvala, S., Hoiem, D., Hays, J., Efros, A., Hebert, M.: An empirical study of context in
object detection. In: CVPR (2009)
16. Rabinovich, A., Belongie, S.: Scenes vs. objects: a comparative study of two approaches to
context based recognition. In: Intl. Wkshp. on Visual Scene Understanding, ViSU (2009)
17. Pinto, N., Cox, D.D., DiCarlo, J.J.: Why is real-world visual object recognition hard? PLoS
Computational Biology 4, e27 (2008)
18. Torralba, A., Efros, A.: Unbiased look at dataset bias. In: CVPR (2011)
19. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. IJCV 88, 303338 (2010)
20. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: CVPR
(2009)
21. Sim, T., Baker, S., Bsat, M.: The CMU Pose, Illumination, and Expression (PIE) database
of human faces. Technical Report CMU-RI-TR-01-02, Carnegie Mellon, Robotics Institute
(2001)
22. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K., Marques, J.,
Min, J., Worek, W.: Overview of the face recognition grand challenge, pp. 947954 (2005)
23. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Discriminatively trained deformable part
models, release 4,
http://people.cs.uchicago.edu/pff/latent-release4/
24. Park, D., Ramanan, D., Fowlkes, C.: Multiresolution Models for Object Detection. In: Dani-
ilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 241254.
25. Parkhi, O., Vedaldi, A., Jawahar, C.V., Zisserman, A.: The truth about cats and dogs. In:
ICCV (2011)
26. Belhumeur, P.N., Chen, D., Feiner, S.K., Jacobs, D.W., Kress, W.J., Ling, H., Lopez, I.,
Ramamoorthi, R., Sheorey, S., White, S., Zhang, L.: Searching the Worlds Herbaria: A
System for Visual Identification of Plant Species. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 116129. Springer, Heidelberg (2008)
27. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-
UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology
(2010)
28. Khosla, A., Yao, B., Fei-Fei, L.: Combining randomization and discrimination for fine-
grained image categorization. In: CVPR (2011)
29. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
30. Schneiderman, H., Kanade, T.: A statistical model for 3-d object detection applied to faces
and cars. In: CVPR (2000)
Attributes for Classifier Feedback
Amar Parkash1 and Devi Parikh2

1
Indraprastha Institute of Information Technology, Delhi, India
2
Toyota Technological Institute, Chicago, US
Abstract. Traditional active learning allows a (machine) learner to query the

(human) teacher for labels on examples it finds confusing. The teacher then
provides a label for only that instance. This is quite restrictive. In this paper,
we propose a learning paradigm in which the learner communicates its belief (i.e.
predicted label) about the actively chosen example to the teacher. The teacher then
confirms or rejects the predicted label. More importantly, if rejected, the teacher
communicates an explanation for why the learners belief was wrong. This ex-
planation allows the learner to propagate the feedback provided by the teacher to
many unlabeled images. This allows a classifier to better learn from its mistakes,
leading to accelerated discriminative learning of visual concepts even with few la-
beled images. In order for such communication to be feasible, it is crucial to have
a language that both the human supervisor and the machine learner understand.
Attributes provide precisely this channel. They are human-interpretable mid-level
visual concepts shareable across categories e.g. furry, spacious, etc. We ad-
vocate the use of attributes for a supervisor to provide feedback to a classifier and
directly communicate his knowledge of the world. We employ a straightforward
approach to incorporate this feedback in the classifier, and demonstrate its power
on a variety of visual recognition scenarios such as image classification and an-
notation. This application of attributes for providing classifiers feedback is very
powerful, and has not been explored in the community. It introduces a new mode
of supervision, and opens up several avenues for future research.
1 Introduction
Consider the scenario where a teacher is trying to teach a child how to recognize gi-
raffes. The teacher can show example photographs of giraffes to the child, as well as
many pictures of other animals that are not giraffes. The teacher can then only hope
that the child effectively utilizes all these examples to decipher what makes a giraffe a
giraffe. In computer vision (and machine learning in general), this is the traditional way
in which we train a classifier to learn novel concepts.
It is clear that throwing many labeled examples at a passively learning child is very
time consuming. Instead, it might be much more efficient if the child actively engages
in the learning process. After seeing a few initial examples of giraffes and non-giraffes,
the child can identify animals that it finds most confusing (say a camel), and request the
teacher for a label for these examples where the labels are likely to be most informative.
This corresponds to the now well studied active learning paradigm.
While this paradigm is more natural, there are several aspects of this set-up that seem
artificially restrictive. For instance, the learner simply poses a query to the teacher, but

Attributes for Classifier Feedback 355

Fig. 1. We propose a novel use of attributes as a mode for a human supervisor to provide feedback
to a machine active learner allowing it to better learn from its mistakes
does not communicate to the teacher what its current model of the world (or giraffes!) is.
It simply asks What is this?. Instead, it could just as easily say I think this is a giraffe,
what do you think?. The teacher can then encouragingly confirm or gently reject this
belief. More importantly, if the teacher rejects the belief, he can now proactively provide
a focussed explanation for why the learners current belief is inaccurate. If the learner
calls a tiger a giraffe, the teacher can say Unfortunately, youre wrong. This is not
a giraffe, because its neck is not long enough. Not only does this give the learner
information about this particular example, it also gives the learner information that can
be transferred to many other examples: all animals with necks shorter than this query
animal (tiger) must not be giraffes. With this mode of supervision, the learner can learn
what giraffes are (not) with significantly fewer labeled examples of (not) giraffes.
To allow for such a rich mode of communication between the learner and the teacher,
we need a language that they both can understand. Attributes provide precisely this
mode of communication. They are mid-level concepts such as furry and chubby
that are shareable across categories. They are visual and hence machine detectable,
as well as human interpretable. Attributes have been used for a variety of tasks such
as multimedia retrieval [16], zero-shot learning of novel concepts [7, 8], describing
unusual aspects of objects [9], face verification [10] and others.
In this paper, we advocate the novel use of attributes as a mode of communication
for a human supervisor to provide feedback to a machine classifier. This allows for a
natural learning environment where the learner can learn much more from each mistake
it makes. The supervisor can convey his domain knowledge about categories, which can
be propagated to unlabeled images in the dataset. Leveraging these images allows for
effective discriminative learning even with a few labeled examples. We use a straight-
forward approach to incorporate this feedback in order to update the learnt classifiers.
In the above example, all images that have shorter necks than the tiger are labeled as
not-giraffe, and the giraffe classifier is updated with these new negative examples. We
demonstrate the power of this feedback in two different domains (scenes and faces) on
four visual recognition scenarios including image classification and image annotation.
2 Related Work
We discuss our work relative to existing work on exploiting attributes, collecting richer
annotation, active learning, providing a classifier feedback and discriminative guidance.
356 A. Parkash and D. Parikh
Attributes: Human-nameable visual concepts or attributes are often used in the mul-
timedia community to build intermediate representations for images [16]. Attributes
have also been gaining a lot of attention in the computer vision community over the
past few years [724]. The most relevant to our work would be the use of attributes for
zero-shot learning. Once a machine has learnt to detect attributes, a supervisor can teach
it a novel concept simply by describing which attributes the concept does or does not
have [7] or how the concept relates to other concepts the machine already knows [8].
The novel concept can thus be learnt without any training examples. However zero-shot
learning is restricted to being generative in nature and the category models have to be
built in the attribute space. Given a test image, its attribute values need to be predicted,
and then the image can be classified. Hence, while very light on supervision, zero-shot
learning often lags in classification performance. Our work allows for the marriage of
using attributes to alleviate supervision effort, while still allowing for discriminative
learning. Moreover, the feature space is not restricted to being the attributes space.
This is because we use attributes during training to transfer category labels to other
unlabeled image instances, as opposed to directly carving out portions of the attributes
space where a category can live. At test time, attribute detectors need not be used.
Detailed Annotation: With the advent of crowd-sourcing services, efforts are being
made at getting many more images labeled, and more importantly, gathering more
detailed annotations such as segmentation masks [25], parts [26], attributes [10, 20],
pose [27], etc. In contrast, the goal of our work is not to collect deeper annotations
(e.g. attribute annotations) for the images. Instead, we propose the use of attributes as
an efficient means for the supervisor to broadly transfer category-level information to
unlabeled images in the dataset, that the learner can exploit.
Active Learning: There is a large body of work, in computer vision and otherwise, that
utilizes active learning to efficiently collect data annotations. Several works consider
the problem of actively interleaving requests for different forms of annotation: object
and attribute labels [20], image-level annotation, bounding boxes or segmentations [28],
part annotations and attribute presence labels [21], etc. To re-iterate, our work focuses
only on gathering category-label annotations, but we use attributes as a mode for the
supervisor to convey more information about the categories, leading to propagation of
category labels to unlabeled images. The system proposed in [22] recognizes a bird
species with the help of a human expert answering actively selected questions pertain-
ing to visual attributes of the bird. Note the involvement of the user at test time. Our
approach uses a human in the loop only during training. In [29], the learner actively asks
the supervisor linguistic questions involving prepositions and attributes. In contrast, in
our work, an informative attribute is selected by the supervisor that allows the classi-
fier to better learn from its mistakes. Our work can be naturally extended to leveraging
the provided feedback to simultaneously update the attribute models as in [20], or for
discovering the attribute vocabulary in the first place as in [23].
Rationales: In natural language processing, works have explored human feature selec-
tion to highlight words relevant for document classification [3032]. A recent work
in vision [33] similarly solicits rationales from the annotator in the form of spatial
information or attributes for assigning a certain label to the image. This helps in iden-
tifying relevant image features or attributes for that image. But similar to zero-shot
learning, this can be exploited only if the feature space is the attributes space and thus
directly maps to the mode of feedback used by the annotator. Our work also uses at-
tributes to allow the supervisor to interactively communicate an explanation, but this
explanation is propagated as category labels to many unlabeled images in the dataset.
The feature space where category models are built is thus not constrained by the mode
of communication i.e. attributes used by the annotator and can hence be non-semantic
and quite complex (e.g. bag-of-words, gist, etc.).
Focussed Discrimination: The role of the supervisor in our work can be viewed as
that of providing a discriminative direction that helps eliminate current confusions of
the learner. Such a discriminative direction is often enforced by mining hard negative
examples that violate the margin in large-margin classifiers [34], or is determined after-
the-fact to better understand the learnt classifier [35]. In our work, a human supervisor
provides this direction by simply verbalizing his semantic knowledge about the domain
to steer the learner away from its inaccurate beliefs.1
3 Proposed Approach
We consider a scenario where a supervisor is trying to teach a machine visual concepts.
The machine has already learnt a vocabulary of attributes relevant to the domain. As
learning aid, there is a pool of unlabeled images that the supervisor will label over the
course of the learning process for the classifier to learn from. The learning process starts
with the learner querying the supervisor for a label for a random example in the pool
of unlabeled images. At each subsequent iteration, the learner picks an image from the
unlabeled pool that it finds most confusing (e.g. a camel image in the giraffe learning
example). It communicates its own belief about this image to the supervisor in the form
of a predicted label for this image. The supervisor either confirms or rejects this label.
If rejected, the supervisor provides a correct label for the query. He also communicates
an explanation using attributes for why the learners belief was wrong. Note that the
feedback can be relative to the category depicted in the query image (necks of tigers
are not long enough to be giraffes) or relative to the specific image instance (this
image is too open to be a forest image, see Fig. 1). The learner incorporates both the
label-feedback (whether accepted or rejected with correction) and the attributes-based
explanation (when applicable) to update its models. And the process continues.
We train a binary classifier hk (x), k {1 . . . K} for each of the K categories to be
learnt. At any point during the learning process, there is an unlabeled pool of images and
a training set Tk = {(xk , y k )} of labeled images for each classifier. The labeled images
xk can lie in any feature space, including the space of predicted attribute values RM .
The class labels are binary y k {+1, 1}. In our implementation we use RBF SVMs
as the binary classifier with probability estimates as output. If pk (x) is the probability
1
In similar spirit, contemporary work at this conference [36] uses attributes to prevent a semi-
supervised approach from augmenting its training data with irrelevant candidates from an un-
labeled pool of images, thus avoiding semantic-drift.
the classifier assigns to image x belonging to class k, the confusion in x is computed

as the entropy H of the distribution p(x), where
pk (x)
pk (x) = K . (1)
k=1 pk (x)
At each iteration in the learning process, the learner actively selects an unlabeled image
x with maximum entropy.
x = argmax H(p(x)) (2)

x
Note that there are two parts to the supervisors response to the query image. The first
is label-based feedback where the supervisor confirms or rejects and corrects the learn-
ers predicted label for the actively chosen image instance x . And the second is an
attributes-based explanation that is provided if the learners predicted label was incor-
rect and thus rejected. We now discuss how the attributes-based feedback is incorpo-
rated. The label-based feedback can be interpreted differently based on the application
at hand, and we discuss that in Section 4.
3.1 Incorporating Attribute-Based Explanation

Lets say the learner incorrectly predicts the label of actively chosen image x to be l.
The supervisor identifies an attribute am that he deems most appropriate to explain to
the learner why x does not belong to l. In this work we consider two simple forms
of explanation. The supervisor can either say x is too am to be l or x is not am
enough to be l, whichever be the case.
In the former case, the learner computes the strength of am in x as rm (x ), where
rm is a pre-trained attribute strength predictor for attribute am (Section 3.2). The learner
identifies all images in the currently unlabeled pool of images U with attribute strength
of am more than rm (x ). Clearly, if x is too am to be l, all images depicting a higher
strength of am must not be l either (Fig. 1). Hence the training data Tl is updated to
be Tl = Tl {(x, 1)} x U s.t. rm (x) rm (x ). Similarly, for the latter form
of feedback, Tl = Tl {(x, 1)} x U s.t. rm (x) rm (x ). The category-label
information is thus propagated to other images in U aside from just x .
As is evident, the supervisor provides an explanation only when the classifier is
wrong. Arguably, it seems wasteful to request an explanation when the classifier is
already right. In addition, the form of feedback we consider only explains why an im-
age is not a certain class. As a result only negative labels are propagated to images. Our
approach can easily also incorporate explanations that explain why an image does be-
long to a certain class. We argue that more often than not, several factors come together
to make an image a certain concept e.g. an image is a coast image if it is open, natural,
displays the horizon and a body of water. Hence explaining to the system why an image
is a certain concept can be quite cumbersome. On the other hand, the supervisor needs
to identify only one reason why an image is not a certain class. Our formulation can be
easily extended to incorporate multi-attribute explanations if the supervisor so chooses.
Influence of Imperfect Attribute Predictors: The updated Tl may contradict the un-
derlying ground truth labels (not available to the system). For instance, when the su-
pervisor says x is too open to be a forest, there may be images in the unlabeled
pool U that are more open than x but are in fact forest images. Or, due to faulty at-
tribute predictors, a forest image that is less open than the query image is predicted to
be more open, and is hence labeled to be not forest. Hence, when in conflict, the label-
based feedback from the supervisor (which is the ground truth by definition) overrides
the label of an image that was or will be deduced from any past or future attributes-
based feedback. In this way, if the entire pool of unlabeled images U were sequentially
presented to the supervisor, even with inaccurate attributes-based feedback or faulty
attribute predictors, it will be correctly labeled. Of course, the scenario we are propos-
ing is one where the supervisor need not label all images in the dataset, and employ
attributes-based feedback instead to propagate labels to unlabeled images. Hence, in
any realistic setting, some of these labels are bound to be incorrectly propagated. Our
approach relies on the assumption that the benefit obtained by propagating the labels to
many more images outweighs the harm done by inaccuracies in the propagation caused
by faulty attribute predictors. This is reasonable, especially since we use SVM classi-
fiers to build our category models, which involve maximizing a soft margin making the
classifier robust to outliers in the labeled training data. Of course, if the attribute pre-
dictors are severely flawed (in the extreme case: predicting the opposite attribute than
what they were meant to predict), this would not be the case. But as we see in our re-
sults in Section 6, in practice the attribute predictors are reliable enough allowing for
improved performance. Since the attribute predictors are pre-trained, their effectiveness
is easy to evaluate (perhaps on held-out set from the data used to train the attributes
in the first place). Presumably one would not utilize severely faulty attribute predictors
in real systems. Recall, attributes have been used in literature for zero-shot learning
tasks [7, 8] which also hinge on reasonable attribute predictors. We now describe our
choice of attribute predictors.
3.2 Attributes Predictors

The feedback provided by the supervisor relates the query image actively selected by
the learner to the concept the learner believes the image depicts. Hence relative at-
tributes [8] are a natural choice. They were shown to respect relative judgements from
humans more faithfully than the score of a binary classifier, which is desirable in our
approach. We provide a brief overview of relative attributes below.
Suppose we have a vocabulary of M attributes A = {am }. These attributes are
mid-level concepts that can be shared across the categories the supervisor is trying to
teach the machine. For celebrities, relevant attributes may be age , chubby, etc.
while for scenes they may be open, natural, etc. This vocabulary of attributes is
learnt offline only once, using a set of training images I = {i} represented in Rn by
feature-vectors {xi }. For each attribute, we are given two forms of supervision: a set
of ordered pairs of images Om = {(i, j)} and a set of un-ordered pairs Sm = {(i, j)}
such that (i, j) Om = i % j, i.e. image i has a stronger presence of attribute
am than j, and (i, j) Sm = i j, i.e. i and j have similar relative strengths of
am . Either Om or Sm , but not both, can be empty. We wish to learn a ranking function
T
rm (xi ) = wm xi for m = 1, . . . , M , such that the maximum number of the following
constraints is satisfied:
(i, j) Om : wm
T T
x i > wm xj , (i, j) Sm : wm
T T
x i = wm xj . (3)
This being an NP hard problem a relaxed version is solved using a large margin learning
to rank formulation similar to that of Joachims [37] and adapted in [8]. With this, given
any image x in our pool of unlabeled images U , one can compute the relative strength
T
of each of the M attributes as rm (x) = wm x.
4 Applications
We consider four different applications to demonstrate our approach. For each of these
scenarios, the label-based feedback is interpreted differently, but the attributes-based
feedback has the same interpretation as discussed in Section 3.1. The label-based feed-
back carries different amounts of information, and thus influences the classifiers to vary-
ing degrees. Recall that via the label-based feedback, the supervisor confirms or rejects
and corrects the classifiers predicted label for the actively chosen image x .
Classification: This is the classical scenario where an image is to be classified into only
one of several pre-determined number of classes. In this application, the label-based
feedback is very informative. It not only specifies what the image is, it also indicates
what the image is not (all other categories in the vocabulary). Lets say the label pre-
dicted by the classifier for the actively chosen image x is l. If the classifier is correct,
the supervisor can confirm the prediction. This would result in Tl = Tl {(x , 1)},
and Tn = Tn {(x , 1)} n = l. On the other hand, if the prediction was in-
correct, the supervisor would indicate so, and provide the correct label q. In this case,
Tq = Tq {(x , 1)} and Tn = Tn {(x , 1)} n = q (note that this includes
Tl = Tl {(x , 1)}). In this way, every label-based feedback impacts all the K classi-
fiers, and is thus very informative. Recall that the attributes-based explanation can only
affect (the negative side) of one classifier at a time. We would thus expect that attributes-
based feedback in a classification scenario would not improve the performance of the
learner by much, since the label-based feedback is already very informative.
Large Vocabulary Classification: Next we consider is one that is receiving a lot of
attention in the community today: a classification scenario where the number of classes
is very large (and perhaps evolving over time); for example (celebrity) face recogni-
tion [10] or specialized fine-grained classification of animal species [16] or classifying
thousands of object categories [38]. In this scenario, the supervisor can verify if the
classifiers prediction of x is correct or not. But when incorrect, it would be very time
consuming for or beyond the expertise of the supervisor to seek out the correct label
from the large vocabulary of categories to provide as feedback. Hence the supervisor
only confirms or rejects the prediction, and does not provide a correction if rejecting
it. In this case, if the classifiers prediction l is confirmed, similar to classification,
Tl = Tl {(x , 1)}, and Tn = Tn {(x , 1)} n = l. However, if the classi-
fiers prediction is rejected, we only have Tl = Tl {(x , 1)}. In this scenario, the
label feedback has less information when the classifier is wrong, and hence the classifier
would have to rely more on the attributes-based explanation to learn from its mistakes.
Annotation: In this application, each image can be tagged with more than one class.
Hence, specifying (one of the things) the image is, does not imply anything about what
else it may or may not be. Hence when the classifier predicts that x has tag l (as
one of its tags), if the supervisor confirms it we have Tl = Tl {(x , 1)} and if the
classifiers prediction is rejected, we have Tl = Tl {(x , 1)}. Note that only the
lth classifier is affected by this feedback, and all other classifiers remain unaffected. To
be consistent across the different applications, we assume that at each iteration in the
learning process, the classifier makes a single prediction, and the supervisor provides
a single response to this prediction. In the classification scenario, this single response
can affect all classifiers as we saw above. But in annotation, to affect all classifiers, the
supervisor would have to comment on each tag individually (as being relevant to image
x or not). This would be time consuming and does not constitute a single response.
Since the label-based feedback is relatively uninformative in this scenario, we expect
the attributes-based feedback to have a large impact on the classifiers ability to learn.
Biased Binary Classification: The last scenario we consider is that of a binary clas-
sification problem, where the negative set of images is much larger than the positive
set. This is frequently encountered when training an object detector for instance, where
there are millions of negative image windows not containing the object, but only a few
hundreds or thousands of positive windows containing the object of interest. In this case,
an example can belong to either the positive or negative class (similar to classification),
and the label-based feedback is informative. However, being able to propagate the neg-
ative labels to many more instances via our attributes-based explanation can potentially
accelerate the learning process. Approaches that mine hard negative examples [34] are
motivated by related observations. However, what we describe above is concerned more
with the bias in the class distribution, and not so much with the volume of data itself.
For all three classification-based applications, at test time, an image is assigned to
the class that receives the highest probability. For annotation, all classes that have a
probability greater than 0.5 are predicted as tags for the image.
5 Experimental Set-Up
Datasets: We experimented with the two datasets used in [8]. The first is the out-
door scene recognition dataset [39] (Scenes) containing 2688 images from 8 categories:
coast, forest, highway, inside-city, mountain, open-country, street and tall-building de-
scribed via gist features [39]. The second dataset is a subset of the PubFig: Public
Figures Face Database [10] (Faces) containing 772 images of 8 celebrities: Alex Ro-
driguez, Clive Owen, Hugh Laurie, Jared Leto, Miley Cyrus, Scarlett Johansson, Viggo
Mortensen and Zac Efron described via gist and color features [8]. For both our datasets,
we used the pre-trained relative attributes made publicly available by Parikh et al. [8].
For Scenes they include natural, open, perspective, large-objects, diagonal-plane and
close-depth. For Faces: Masculine-looking, White, Young, Smiling, Chubby, Visible-
Forehead, Bushy-Eyebrows, Narrow-Eyes, Pointy-Nose, Big-Lips and Round-Face.
Applications: For Scenes, we show results on all four applications described above.
The binary classification problems are set up as learning only one of the 8 categories
(a) This image is not (b) Zac Effon is too (c) open-country, (d) street, inside-city,
perspective enough to young to be Hugh mountain, forest tall-building
be a street scene. Laurie (bottom right).
Fig. 2. (a), (b) Example feedback collected from subjects (c), (d) Example images from Scenes
dataset annotated with multiple tags by subjects
through the entire learning process, which is selected at random in each trial. Images
from the remaining 7 categories are considered to be negative examples. This results in
more negative than positive examples. Many of the images in this dataset can be tagged
with more than one category. For instance, many of the street images can just as easily
be tagged as inside-city, many of the coast images have mountains in the background,
etc. Hence, this dataset is quite suitable for studying the annotation scenario. We collect
annotation labels for the entire dataset as we will describe next. For Faces, we show
results for classification, binary classification and large vocabulary classification (anno-
tation is not applicable since each image displays only one persons face). For evaluation
purposes, we had to collect exhaustive real human data as attributes-based explanations
(more details below). This restricted the vocabulary of our dataset to be small. However
the domain of celebrity classification lends itself well to large vocabulary classification.
Collecting Attributes-Based Feedback: To gather attributes-based feedback from real

users, we conducted human studies on Amazon Mechanical Turk. We gather all the
data offline, allowing us to run automatic experiments without a live user in the loop,
while still using real data from users. Note that the attribute feedback depends on the
image being classified, and the label predicted by the classifier for that image. Since we
can not anticipate ahead of time what the learner will predict as the label of an image
if actively chosen as the query at some point during the learning process, we collect
feedback for each image being classified into each one of the classes. As stated earlier,
this restricted the number of categories our datasets can have. Note that the restriction
on number of classes was only to systematically evaluate our approach, and is not a
reflection of any limitations of our approach to scale to a large number of classes.
Collecting Ground Truth Annotation Labels: We had 5 subjects provide feedback

for a total of 240 images (30 per category) from the Scenes dataset. Six attributes al-
low for 12 possible feedback statements for each category: this image is too open to
be a forest or this image is not natural enough to be a forest and so on. The study
asked subjects to select one of the 12 statements that was most true. An image was
annotated with a statement for that category if the statement was chosen by at least 2
subjects. In 18% of the cases, we found that no such statement existed i.e. all 5 subjects
Classification (att) Large Vocab. Classification (att) Biased Bin. Classification (att) Annotation (att)
60 60
50 40 60 55
Accuracy
Accuracy
Accuracy
Accuracy
40 50
30 50
30 45
20 20 40 40
10 35
20 40 20 40 5 10 15 20 20 40
# iterations # iterations # iterations # iterations
Classification (feat) Large Vocab. Classification (feat) Biased Bin. Classification (feat) Annotation (feat)
40
60
65 52
50 30 60
Accuracy
Accuracy
Accuracy
Accuracy
40 50
55 48
30 20
50
20 46
10 45
10 44
40
20 40 20 40 10 20 30 20 40
# iterations # iterations
zeroshot baseline # iterations
proposed (oracle) #proposed
iterations (hum)
Fig. 3. Results for four different applications on the Scenes Dataset. See text for details.
selected a different statement. These cases were left unannotated. In our experiments,
no attributes-based explanation was provided for those particular image-category com-
binations. Hence the performance reported in our results is an underestimate.
For the Faces dataset, since the categories correspond to individuals, the attributes we
use can be expected to be the same across all images in the category. Hence we collect
feedback at the category-level. For each pair of categories, our 11 attributes allow for
22 statements such as Miley Cyrus is too young to be Scarlett Johansson or Miley
Cyrus is too chubby to be Scarlett Johansson and so on. We showed subjects example
images for the celebrities. Again, 5 subjects selected one of the statements that was
most true. We found that all category pairs had at least two subjects agreeing on the
feedback. Note that during the learning process, unlike Scenes, every time an image
from class q in the Faces dataset is classified as class l, the same feedback will be used.
Example responses collected from subjects can be seen in Fig. 2 (a), (b).
We annotate all 2688 images in the Scenes dataset with multiple tags. For each im-
age, 5 subjects on Amazon Mechanical Turk were asked to select any of the 8 categories
that they think can be aptly used to describe the image. A tag was retained if at least
2 subjects selected it. For sake of completeness, for each image, we append its newly
collected tags with its ground truth tag from the original dataset (although in most cases
the ground truth label was already provided by subjects). On average, we have 1.6 tags
for every image in the dataset. Example annotations can be seen in Fig. 2 (c), (d).
6 Results
We now present our experimental results on the Scenes and Faces datasets for the ap-
plications described above. We will compare the performance of our approach that uses
both the label-based feedback (Section 4) as well as attributes-based explanations (Sec-
tion 3.1) for each mistake the classifier makes, to the baseline approach of only using
Classification (att) Large Vocab. Classification (att) Biased Bin. Classification (att)
70 70
50
60 45 60
Accuracy
Accuracy
Accuracy
50 40
50
35
40
30 40
30 25
10 20 30 40 50 10 20 30 40 50 5 10 15 20
# iterations # iterations # iterations
Classification (feat) Large Vocab. Classification (feat) Biased Bin. Classification (feat)
60
40 20
55
Accuracy
Accuracy
Accuracy
30
15
20 50
10
10 45
10 20 30 40 50 10 20 30 40 50 10 20 30 40
# iterations zeroshot# iterations
baseline proposed (oracle) # iterations
proposed (hum)
Fig. 4. Results for three different applications on the Faces Dataset. See text for details.
the label-based feedback (Section 4). Hence the difference in information in the label-
based feedback for the different applications affects both our approach and the baseline.
Any improvement in performance is due to the use of attributes for providing feedback.
Our results are shown in Figs. 3 and 42 . We show results in the raw feature space
(feat) i.e. gist for Scenes and gist+color for Faces, as well as in the relative attributes
space (att). The results shown are average accuracies of 20 random trials. For each trial,
we randomly select 50 images as the initial unlabeled training pool of images. Note
that for Scenes, these 50 images come from the 240 images we had subjects annotate
with feedback. The remaining images (aside from the 240 images) in the datasets are
used as test images. On the x-axis are the number of iterations in the active learning
process. Each iteration corresponds to one response from the supervisor as described
earlier. For classification-based applications, accuracy is computed as the proportion of
correctly classified images. For annotation, the accuracy of each of the tags (i.e. class)
in the vocabulary is computed separately and then averaged.
As expected, for classification where the label-based feedback is already informa-
tive, attributes-based feedback provides little improvement. On the other hand, for an-
notation where label-based feedback is quite uninformative, attributes-based feedback
provides significant improvements. This demonstrates the power of attributes-based
feedback in propagating information (i.e. class labels), even if weak, to a large set of
unlabeled images. This results in faster learning of concepts. Particularly with large vo-
cabulary classification, we see that the attributes-based feedback boosts performance
in the raw feature space more than the attributes space. This is because learning in
high-dimensional space with few examples is challenging. The additional feedback of
our approach can better tame the process. Attributes-based feedback also boosts perfor-
mance in the biased binary classification scenario more than in classical classification.
In datasets with a larger bias towards negative examples, the improvement may be even
2
Number of iterations for binary classification was truncated because performance leveled off.
larger. Note that in many cases, the same classification accuracy can be reached via our
approach using only a fifth of the users time as with the baseline approach. Clearly,
attributes-based feedback allows classifiers to learn more effectively from their mis-
takes, leading to accelerated learning even with fewer labeled examples.
In Figs. 3 and 4 we also show an oracle-based accuracy for our approach. This is
computed by selecting the feedback at each iteration that maximizes the performance on
a validation set of images. This demonstrates the true potential our approach holds. The
gap between the oracle and the performance using human feedback is primarily due to
lack of motivation of mechanical turk workers to identify the best feedback to provide.
We expect real users of the system to perform between the two solid (red and black)
curves. In spite of the noisy responses from workers, the red curve performs favorably
in most scenarios. This speaks to the robustness of the proposed straightforward idea of
using attributes for classifier feedback.
In Figs. 3 and 4 we also show the accuracy of zero-shot learning using the direct at-
tribute prediction model of Lampert et al. [7]. Briefly, each category is simply described
in terms of which attributes are present and which ones are not. These binary attributes
are predicted for a test image, and the image is assigned to the class with the most
similar signature. We use the binary attributes and binary attributes-based descriptions
of categories from [8]. Note that in this case, the binary attributes were trained using
images from all categories, and hence the categories are not unseen. More impor-
tantly, the zero-shot learning descriptions of the categories was exactly what was used
to train the binary attribute classifiers in the first place. We expect performance to be
significantly worse if real human subjects provided the descriptions of these categories.
We see that even this optimistic zero-shot learning performance compares poorly to our
approach. While our approach also alleviates supervision burden (but not down to zero,
of course) by using attributes-based feedback, it still allows access to discriminative
learning, making the proposed form of supervision quite powerful.
7 Discussion and Conclusion
The proposed learning paradigm raises many intriguing questions. It would be interest-
ing to consider the different strategies a user may use to identify what feedback should
be provided to the classifier for its mistake. One strategy may be to provide the most
accurate feedback. That is, when saying this image is too open to be a forest, ensur-
ing that there is very little chance any image that is more open than this one could be
a forest. This ensures that incorrect labels are not propagated across the unlabeled im-
ages, but there may be very few images to transfer this information to in the first place.
Another strategy may be to provide very precise feedback. That is, when saying the
person in this image (Miley Cyrus) is too young to be Scarlett Johansson, ensuring
that Scarlett Johansson is very close in age to Miley Cyrus. This ensures that this feed-
back can be propagated to many celebrities (all that are older than Scarlett Johansson).
Yet another strategy may be to simply pick the most obvious feedback. Of the many
attributes celebrities can have, some are likely to stand out to users more than others.
For example, if a picture of Miley Cyrus is classified as Zac Efron, most of us would
probably react to the fact that the genders are not consistent. Analyzing the behavior
of users, and perceptual sensitivity of users along the different attributes is part of fu-
ture work. Developing active learning strategies for the proposed learning paradigm is
also part of future work. An image should be actively selected with considerations that
the supervisor will provide an explanation that would propagate to many other images
relative to the selected image. One can also envision accounting for distance between
images in the attributes space when propagating labels. For instance, if a particular im-
age is too open to be a forest, an image that is significantly more open is extremely
unlikely to be a forest. This leads to interesting calibration questions: when a human
views an image as being significantly more open, what is the corresponding differ-
ence in the machine prediction of the relative attributes open? One could also explore
alternative interfaces for the supervisor to provide feedback e.g. providing an automati-
cally selected short-list of relevant attributes to select from. It would be worth exploring
the connections between using attributes-based feedback to transfer information to un-
labeled images and semi-supervised learning. Finally, exploring the potential of this
novel learning paradigm for other tasks such as object detection is part of future work.
Conclusion: In this work we advocate the novel use of attributes as a mode of com-
munication for the human supervisor to provide an actively learning machine classifier
feedback when it predicts an incorrect label for an image. This feedback allows the
classifier to propagate category labels to many more images in the unlabeled dataset
besides just the individual query images that are actively selected. This in turn allows
it to learn visual concepts faster, saving significant user time and effort. We employ a
straight forward approach to incorporate this attributes-based feedback into discrimi-
natively trained classifiers. We demonstrate the power of this feedback on a variety of
visual recognition applications including image classification and annotation on scenes
and faces. What is most exciting about attributes is their ability to allow for communi-
cation between humans and machines. This work takes a step towards exploiting this
channel to build smarter machines more efficiently.
References
1. Smith, J., Naphade, M., Natsev, A.: Multimedia semantic indexing using model vectors. In:
ICME (2003)
2. Rasiwasia, N., Moreno, P., Vasconcelos, N.: Bridging the gap: Query by semantic example.
IEEE Transactions on Multimedia (2007)
3. Naphade, M., Smith, J., Tesic, J., Chang, S., Hsu, W., Kennedy, L., Hauptmann, A., Curtis,
J.: Large-scale concept ontology for multimedia. IEEE Multimedia (2006)
4. Zavesky, E., Chang, S.F.: Cuzero: Embracing the frontier of interactive visual search for
informed users. In: Proceedings of ACM Multimedia Information Retrieval (2008)
5. Douze, M., Ramisa, A., Schmid, C.: Combining attributes and fisher vectors for efficient
image retrieval. In: CVPR (2011)
6. Wang, X., Liu, K., Tang, X.: Query-specific visual semantic spaces for web image re-ranking.
In: CVPR (2011)
7. Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by
between-class attribute transfer. In: CVPR (2009)
8. Parikh, D., Grauman, K.: Relative attributes. In: ICCV (2011)
9. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In:
CVPR (2009)
10. Kumar, N., Berg, A., Belhumeur, P., Nayar, S.: Attribute and simile classifiers for face veri-
fication. In: ICCV (2009)
11. Berg, T.L., Berg, A.C., Shih, J.: Automatic Attribute Discovery and Characterization from
Noisy Web Data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
12. Wang, J., Markert, K., Everingham, M.: Learning models for object recognition from natural
language descriptions. In: BMVC (2009)
13. Wang, G., Forsyth, D.: Joint learning of visual attributes, object classes and visual saliency.
In: ICCV (2009)
14. Wang, Y., Mori, G.: A Discriminative Latent Model of Object Classes and Attributes. In:
Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315,
15. Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS (2007)
16. Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P., Belongie, S.: Visual
Recognition with Humans in the Loop. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
17. Fergus, R., Bernal, H., Weiss, Y., Torralba, A.: Semantic Label Sharing for Learning with
Many Categories. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
18. Wang, G., Forsyth, D., Hoiem, D.: Comparative object similarity for improved recognition
with few or no examples. In: CVPR (2010)
19. Mahajan, D., Sellamanickam, S., Nair, V.: A joint learning framework for attribute models
and object descriptions. In: ICCV (2011)
20. Kovashka, A., Vijayanarasimhan, S., Grauman, K.: Actively selecting annotations among
objects and attributes. In: ICCV (2011)
21. Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization
with humans in the loop. In: ICCV (2011)
22. Branson, S., Wah, C., Schroff, F., Babenko, B., Welinder, P., Perona, P., Belongie, S.: Visual
Recognition with Humans in the Loop. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
23. Parikh, D., Grauman, K.: Interactively building a discriminative vocabulary of nameable
attributes. In: CVPR (2011)
24. Bourdev, L., Maji, S., Malik, J.: Describing people: A poselet-based approach to attribute
classification. In: ICCV (2011)
25. Russel, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a database and web-based tool
for image annotation. IJCV (2008)
26. Farhadi, A., Endres, I., Hoiem, D.: Attribute-centric recognition for cross-category general-
ization. In: CVPR (2010)
tions. In: ICCV (2009)
28. Vijayanarasimhan, S., Grauman, K.: Multi-level active prediction of useful image annota-
tions for recognition. In: NIPS (2008)
29. Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for
multi-class active learning. In: CVPR (2010)
30. Raghavan, H., Madani, O., Jones, R.: Interactive feature selection. IJCAI (2005)
31. Druck, G., Settles, B., McCallum, A.: Active learning by labeling features. In: EMNLP
(2009)
32. Zaidan, O., Eisner, J., Piatko, C.: Using annotator rationales to improve machine learning for
text categorization. In: NAACL - HLT (2007)
33. Donahue, J., Grauman, K.: Annotator rationales for visual recognition. In: ICCV (2011)
35. Golland, P.: Discriminative direction for kernel classifiers. In: NIPS (2001)
36. Shrivastava, A., Singh, S., Gupta, A.: Constrained Semi-Supervised Learning Using At-
tributes and Comparative Attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y.,
Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 370384. Springer, Heidelberg
(2012)
37. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD (2002)
38. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hier-
archical Image Database. In: CVPR 2009 (2009)
39. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. IJCV (2001)
Constrained Semi-Supervised Learning
Using Attributes and Comparative Attributes
Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta
The Robotics Institute, Carnegie Mellon University, Pittsburgh (PA), USA

http://graphics.cs.cmu.edu/projects/constrainedSSL/
Abstract. We consider the problem of semi-supervised bootstrap learning for

scene categorization. Existing semi-supervised approaches are typically unreli-
able and face semantic drift because the learning task is under-constrained. This
is primarily because they ignore the strong interactions that often exist between
scene categories, such as the common attributes shared across categories as well
as the attributes which make one scene different from another. The goal of this
paper is to exploit these relationships and constrain the semi-supervised learn-
ing problem. For example, the knowledge that an image is an auditorium can
improve labeling of amphitheaters by enforcing constraint that an amphitheater
image should have more circular structures than an auditorium image. We pro-
pose constraints based on mutual exclusion, binary attributes and comparative
attributes and show that they help us to constrain the learning problem and avoid
semantic drift. We demonstrate the effectiveness of our approach through exten-
sive experiments, including results on a very large dataset of one million images.
1 Introduction
How do we exploit the sea of visual data available online? Most supervised computer
vision approaches are still impeded by their dependence on manual labeling, which, for
rapidly growing datasets, requires an incredible amount of manpower. The popularity
of Amazon Mechanical Turk and other online collaborative annotation efforts [1, 2] has
eased the process of gathering more labeled data, but it is still unclear whether such
an approach can scale up with the available data. This is exacerbated by the heavy-
tailed distribution of objects in the natural world [3]: a large number of objects occur so
sparsely that it would require a significant amount of labeling to build reliable models.
In addition, human labeling has a practical limitation in that it suffers from semantic and
functional bias. For example, humans might label an image of Christ/Cross as Church
due to high-level semantic connections between the two concepts.
An alternative way to exploit a large amount of unlabeled data is semi-supervised
learning (SSL). A classic example is the bootstrapping method: start with a small
number of labeled examples, train initial models using those examples, then use the
initial models to label the unlabeled data. The model is retrained using the confident
self-labeled examples in addition to original examples. However, most semi-supervised
approaches, including bootstrapping, have often exhibited low and unacceptable accu-
racy because the limited number of initially labeled examples are insufficient to con-
strain the learning process. This often leads to the well known problem of semantic

370 A. Shrivastava, S. Singh, and A. Gupta
Initial Seed Examples Unlabeled Data
Amphitheatre Auditorium
Selected Candidates (Bootstrapping)
Evaluating Constraints
Binary-Attribute Constraints
Indoor Amphitheatre
Has
Seat
9 9 8
Rows
Auditorium
Comparative-Attribute Constraints
Has More Circular-Structure
Selected Candidates (Constrained Bootstrapping)
9
9 8 Amphitheatre Auditorium
Fig. 1. Standard Bootstrapping vs. Constrained Bootstrapping: We propose to learn multiple clas-
sifiers (auditorium and amphitheater) jointly by exploiting similarities (e.g., both are indoor and
have seating) and dissimilarities (e.g, amphitheater has more circular structures than auditorium)
between the two. We show that joint learning constrains the SSL, thus avoiding semantic drift.
drift [4], where newly added examples tend to stray away from the original meaning
of the concept. The problem of semantic drift is more evident in the field of visual
categorization because intra-class variation is often greater than inter-class variation.
For example, electric trains resemble buses more than steam engines and many
auditoriums appear very similar to amphitheaters.
This paper shows that we can avoid semantic drift and significantly improve perfor-
mance of bootstrapping approach by imposing additional constraints. We build upon
recent work in information extraction [5], and propose a novel semi-supervised image
classification framework where instead of each classifier selecting its own set of images,
we jointly select images for each classifier by enforcing different types of constraints.
We show that coupling scene categories via attributes and comparative attributes1 pro-
vides us with the constraints necessary for a robust bootstrapping framework. For ex-
ample, consider the case shown in Figure 1. In the case of the nave bootstrapping
approach, the initial amphitheater and auditorium classifiers select self-labeled im-
ages independently which leads to incorrect instances being selected (outlined in red).
However, if we couple the scene categories and jointly label all the images in the dataset,
then we can use the auditorium images to clean the amphitheater images, since the lat-
ter should have more circular structures compared to the former. We demonstrate that
the joint labeling indeed makes the data selection robust and improves the performance
1
Comparative attributes are special forms of attributes that are used to compare and express
relationships between two nouns. For example, banquet halls are bigger than bedrooms;
amphitheaters are more circular than auditoriums.
Constrained SSL Using Attributes and Comparative Attributes 371
significantly. While we only explore the application of these new constraints to boot-
strapping approaches, we believe they are generic and can be applied to other semi-
supervised approaches as well.
Contributions: We present a framework for coupled bootstrap learning and explore its
application to the field of image classification. The input to our system is an ontology
which defines the set of target categories to be learned, the relationships between those
categories and a handful of initial labeled examples. We show that given these initial la-
beled examples and millions of unlabeled images, our approach can obtain much higher
accuracy by coupling the labeling of all the images using multiple classifiers. The key
contributions of our paper are: (a) a semi-supervised image classification framework
which jointly learns multiple classifiers, (b) demonstrating that sharing information
across categories via attributes [6, 7] is crucial to constrain the semi-supervised learning
problem, (c) extending the notion of sharing across categories and showing that infor-
mation sharing can also be achieved by capturing dissimilarities between categories.
Here we build upon the recent work on relative attributes [8] and comparative adjec-
tives [9] to capture differences across categories. In this paper, the relationships be-
tween attributes and scene categories are defined by a human annotator. As opposed to
image-level labeling, attribute and comparative attribute relationships are significantly
cheaper to annotate as they scale with the number of categories (not with the number of
images). Note that the attributes need not be semantic at all [10]. However, the general
benefit of using semantic attributes is that they are human-communicable [11] and we
can obtain them automatically using other sources of data such as text [12].
2 Prior Work
During the past decade, computer vision has seen some major successes due to the in-
creasing amount of data on the web. While using big data is a promising direction, it
is still unclear how we should exploit such a large amount of data. There is a spectrum
of approaches based on the amount of human labeling required to use this data. On
one end of the spectrum are supervised approaches that use as much hand-labeled data
as possible. These approaches have focused on using the power of crowds to generate
hand-labeled training data [1, 2]. Recent works have also focused on active learn-
ing [1317, 11], to minimize human effort by selecting label requests that are most
informative. On the other end of the spectrum are completely unsupervised approaches,
which use no human supervision and rely on clustering techniques to discover image
categories [18, 19].
In this work, we explore the intermediate range of the spectrum; the domain of semi-
supervised approaches. Semi-supervised learning (SSL) techniques use a small amount
of labeled data in conjunction with a large amount of unlabeled data to learn reliable
and robust visual models. There is a large literature on semi-supervised techniques. For
brevity, we only discuss closely related works and refer the reader to recent survey on
the subject [20]. The most commonly used semi-supervised approach is the bootstrap-
ping method, also known as self-training. However, bootstrapping typically suffers
from semantic drift [4] that is, after many iterations, errors in labeling tend to accu-
mulate. To avoid semantic drift, researchers have focused on several approaches such as
using multi-class classifiers [21] or using co-training methods to exploit conditionally

independent feature spaces [22]. Another alternative is to use graph-based methods,
such as the graph Laplacian, for SSL. These methods capture the manifold structure
of the data and encourage similar points to share labels [23]. In computer vision, ef-
ficient graph based methods have been used for labeling of images as well [24]. The
biggest limitation with graph based approaches is the need for similarity measures that
create graphs with no inter-class connections. In the visual world, it is very difficult to
learn such a good visual similarity metric. Often, intra-class variations are larger than
inter-class variations, which make pair-wise similarity based methods of little utility.
To overcome this difficulty, researchers have focused on text based features for better
estimation of visual similarity [25].
In this work, we argue that there exists a richer set of constraints in the visual world
that can help us constrain the SSL-based approaches. We present an approach to com-
bine a variety of such constraints in a standard bootstrapping framework. Our work is
inspired by works from the textual domain [5] that try to couple learning of category
and relation classifiers. However, in our case, we build upon recent advances in the
field of visual attributes [6, 7] and comparative attributes [8, 9] and propose a set of
domain-specific visual constraints to model the coupling between scene categories.
3 Constrained Bootstrapping Framework
Our goal is to use the initial set of labeled seed examples (L) and a large unlabeled
dataset (U) to learn robust image classifiers. Our method iteratively trains classifiers
in a self-supervised manner. It starts by training classifiers using a small amount of
labeled data and then uses these classifiers to label unlabeled data. The most confident
new labels are promoted and added to the pool of data used to train the models, and
the process repeats. The key difference from the standard bootstrapping approach is the
set of constraints that restrict which data points are promoted to the pool of labeled data.
In this work, we focus on learning scene classifiers for image classification. We rep-
resent these classifiers as functions (f : X Y ) which, given input image features x,
predict some label y. Instead of learning these classifiers separately, we propose an ap-
proach which learns these classifiers jointly. Our central contribution is the formulation
of constraints in the domain of image classification. Specifically, we exploit the recently
proposed attribute-based approaches [6, 7] to provide another view of the same data and
enforce multi-view agreement constraints. We also build upon the recent framework of
comparative adjectives [9] to formulate pair-wise labeling constraints. Finally, we use
introspection to perform an additional step of self-refinement to weed out false positives
included in the training set. We describe all the constraints below.
3.1 Output Constraint: Mutual Exclusion (ME)
Classification of a single datapoint by multiple scene classifiers is not an independent

process. We can use this knowledge to enforce certain constraints on the functions
learned for the classifiers. Mathematically, if we know some constraint on output values
of two classifiers f1 : X Y1 and f2 : X Y2 for an input x, then we can require the
learned functions to satisfy these. One such output constraint is the mutual exclusion
constraint (ME). In mutual exclusion, positive classification by one classifier imme-
diately implies negative classification for the other classifiers. For example, an image
classified as restaurant can be used as an negative example for barn, bridge etc.
Current semi-supervised approaches enforce mutual exclusion by learning a multi-
class classifier where a positive example of one class is automatically treated as a nega-
tive example for all other classes. However, the multi-class classifier formulation is too
rigid for a learning algorithm. Consider, for example, banquet hall and restaurant,
which are very similar and likely to be confused by the classifier. For such classes, the
initial classifier learned from a few seed examples is not reliable enough; hence, adding
the mutual exclusion constraint causes the classifier to overfit.
We propose an adaptive mutual exclusion constraint. The basic idea is that during
initial iterations, we do not want to enforce mutual exclusion between similar classes
(y1 and y2 ), since this is likely to confuse the classifier. Therefore, we relax the ME
constraint for manually annotated similar classes a candidate added to the pool of one
is not used as a negative example for the other. However, after a few iterations, we adapt
our mutual exclusion constraints and enforce these constraints across similar classes as
well.
3.2 Sharing Commonalities: Binary-Attribute Constraint (BA)

For the second constraint, we exploit the commonalities shared by scene categories. For
example, both amphitheaters and auditoriums have large seating capacity; bed-
rooms and conference rooms are indoors and man-made. We propose to model these
shared properties via attributes [6, 7]. Modeling visual attributes helps us enforce a con-
straint that the promoted instances must also share these properties.
Formally, we model this constraint in a multi-view framework [22]. For a function
f : X Y , we partition X into views (Xa , Xb ) and learn two classifiers fa and fb
which can both predict Y . In our case, fa : Xa Y is the original classifier which uses
low-level features to predict the scene classes. We model the sharing between multiple
classes via fb . fb is a compositional function (fb : Xb A Y ) which uses low-level
features to predict attributes A and then uses them to predict scene classes. It should be
noted that even though we use multi-view framework to model sharing, it is quite a
powerful constraint. In case of sharing, the function fb updates at each iteration by
learning a new attribute classifier, Xb A, which collects large amounts of data from
multiple scene classes (e.g., the indoor attribute classifier picks up training instances
from restaurant, bedroom, conference-room etc.). Also, note that the human annotated
mapping from attribute space to scene class, A Y , remains fixed in our case.
3.3 Pairwise Constraint: Comparative Attributes (CA)

The above two constraints are unary in nature: these constraints still assume that the
labeling procedures for two instances X1 and X2 should be completely independent
of each other. Graph-based approaches [20, 24] have focussed on constraining labels of
instances X1 , X2 based on similarity that is, if two images are similar they should have
same labels. However, learning semantic similarity using image features is an extremely
difficult problem specifically because of high intra-class variations. In this paper, we

model stronger and richer pairwise constraints on labeling of unlabeled images using
comparative attributes. For example, if an image X1 is labeled as auditorium, then
another image X2 can be labeled as amphitheater if and only if it has more circular
structures than X1 .
Formally, for a given pair of scene classes, f1 : X1 Y1 and f2 : X2 Y2 ,
we model the pairwise constraints using a function f c : (X1 , X2 ) Yc and enforce
the constraint that (f1 , f2 , f c ) should produce a consistent triplet (y1 , y2 , yc ) for a given
pair of images (x1 , x2 ) . Some examples of consistent triplets in our case would include
(field, barn, more open space) and (church, cemetery, has larger structures) which mean
field has more open space than barn and church has larger structures than cemetery.
3.4 Introspection or Self-cleaning

In iterative semi-supervised approaches, a classifier should ideally improve with each
iteration. Empirically, these classifiers tend to make more mistakes in the earlier itera-
tions as they are trained on very small amount of data. Based on these two observations,
we introduce an additional step of introspection where after every five iterations, start-
ing at fifteen, we use the full framework to score already included training data (instead
of the unlabeled data) and drop positives that receive very low scores. This results in
further performance improvement of the learned classifiers.
4 Mathematical Formulation: Putting It Together
We now describe how we incorporate the constraints described above in a bootstrap-

ping semi-supervised approach. Figure 2 shows the outline of our approach. We have
a set of binary scene classifiers f1 ...fN , attribute classifiers f1a ...fK a
and comparative
c c
attribute classifiers f1 ...fM . Initially, these classifiers are learned using seed examples
but are updated at each iteration using new labeled data. At each iteration, we would
like to label the large unlabeled corpus (U) and obtain the confidence of each labeling.
Instead of labeling all images separately, we label them jointly using our constraints.
We represent all images in U as nodes in a fully connected graph. The most likely as-
signment of each image (node) in the graph can be posed as minimizing the following
energy function E(y) over class labels assignments y = {y1 , .., y|U | }:
< =

E(y) = (xi , yi ) + (xi , xj , yi , yj ) (1)
xi U (xi ,xj )U 2
where (xi , yi ) is the unary node potential for image i with features xi and its candidate
label yi and (xi , xj , yi , yj ) is the edge potential for labels yi and yj of pair of images
i and j. It should be noted that yi denotes assigned label to image i which can take label
assignments in {l1 ....ln }.

"L #
"U #

Fig. 2. Overview of our approach
The unary term (xi , yi ) is the confidence in assigning label yi to image i by the
combination of scene and attribute classifier scores. Specifically, if fj (xi ) is the raw
score of j th scene classifier on xi and f ak (xi ) is the raw score of k th attribute classifier
on xi , then the unary potential is given by:

(xi , yi = lj ) = (fj (xi )) + 1 (1)1xi ,yi (ak ) (f ak (xi )) (2)
ak A
The first term takes in the raw scene classifier scores and converts these scores into po-
tentials using the sigmoid function ((t) = exp(t)/(1 + exp(t))) [26]. The second
term uses a weighted voting scheme for agreement between scene and attribute classi-
fiers (1 is the normalization factor). Here, A is the set of binary attributes, 1xi ,yi (ak )
is an indicator function which is 1 if the attribute ak is detected in the image i but ak
is a property of scene class yi (and vice versa). () denotes the confidence in prediction
for attribute classifier2 . Intuitively, this second term votes positive if both the attribute
classifier and scene classifier agree in terms of class-attribute relationships. Otherwise,
it votes negative where the vote is weighted in terms of the confidence of prediction.
2
The confidence in prediction is defined as: (f ak (xi )) = max((f ak (xi )), 1(f ak (xi ))).
The binary term (xi , xj , yi , yj ) between images i and j with labels yi and yj
represents comparative-attribute relations between labeled classes.

(xi , xj , yi , yj ) = 1ck (yi , yj ) log((f ck (xi , xj ))) (3)
ck C
where C denotes the set of comparative attributes, 1ck (yi , yj ) denotes if a given com-
parative attribute ck exists between pairs of classes yi and yj and f ck (xi , xj ) is the
score of comparative-attribute classifier for the pair of images xi , xj . Intuitively, this
term boosts the labels yi and yj if a comparative attribute ck scores high on pairwise
features. For example, if instances i, j are labeled conference-room and bedroom,
their scores get boosted if the comparative attribute has more space scores high on
pairwise features (since conference rooms have more space than bedrooms).
Promoting Instances: Typically in the semi-supervised problem, |U| varies from tens
of thousand to million images. Estimating the most likely label for each image in U
necessitates minimizing Eq.(1) which is computationally intractable in general. Since
our goal is to find a few very confident images to add to the labeled set L we do not
need to minimize Eq.(1) over the entire U. Instead, we follow the standard practice
of pruning the image nodes which have low probability of being classified as one of
the n scene classes. Specifically, we evaluate the unary term (), that represents the
confidence of assigning label to an image, for the entire U and use it to prune out and
keep only the top-N candidate images for each class. In our experiments, we set N to
3 times the number of instances to be promoted.
While pruning image nodes reduces the search space, exact inference still remains
intractable. However, approximate inference techniques like loopy belief propagation
or Gibbs sampling can be used to find the most likely assignments. In this work, we
compute the marginals at each node by running one iteration of loopy belief propagation
on the reduced graph. This approximate inference gives us the confidence of candidate
class labeling for each image incorporating scene, attribute and comparative-attribute
constraints. Now we select C most confidently labeled images for each class (U l ), add
them to (L U l ) L (and remove from (U \ U l ) U) and re-train our classifiers.
4.1 Scene, Attribute and Comparative Attribute Classifiers

We now describe the classifiers used for scenes, attributes and comparative attributes.
Scene and Attribute classifier: Our scene category classifiers as well as attribute clas-
sifier are trained using boosted decision trees [27]. We use 20 boosted trees with 8 inter-
nal nodes for scene classifier and 40 boosted trees with 8 internal nodes for training our
attributes. These classifiers were trained on the 2049 dimensional feature vector from
[28]. Our image feature includes 960D GIST [29] features, 75D RGB features (image
is resized to 5 5) [3], 30D histogram of line lengths, 200D histogram of orientation
of lines and 784D 3D-histogram Lab color space (14 14 4).
Comparative Attribute Classifier: Given a pair of images (xi , xj ), the goal of com-
parative attributes classifier is to predict whether the pair satisfies comparative relation-
ships such as more circular and has more indoor space. To model and incorporate
comparative attributes, we follow the approach proposed in [9] and train classifiers over
differential features (xi xj ). We train the comparative attribute classifiers using ground
truth pair of images that follow such relationships and for the negative data we use ran-
dom pair of images and inverse relationships. We used a boosted decision tree classifier
with 20 trees and 4 internal nodes.
5 Experiments
We now present experimental results to demonstrate the effectiveness of constraints in
bootstrapping based approach. We first present a detailed experimental analysis of our
approach using the fully labeled SUN dataset [30]. Using a completely labeled dataset
allows us to evaluate the quality of unlabeled images being added to our classifiers
and how it affects the performance of our system. Finally, we evaluate our complete
system on a large scale dataset which uses approximately 1 million unlabeled images
to improve the performance of scene classifiers. For all the experiments we use a fixed
vocabulary of scene classes, attributes and comparative attributes as described below.
Vocabulary: We evaluate the performance of our coupled bootstrapping approach in
learning 15 scene categories3 randomly chosen from from SUN dataset . These classes
are: auditorium, amphitheater, banquet hall, barn, bedroom, bowling alley, bridge, casino
indoor, cemetery, church outdoor, coast, conference room, desert sand, field cultivated
and restaurant. We used 19 attribute classes: horizon visible, indoor, has water, has
building, has seat rows, has people, has grass, has clutter, has chairs, is man-made,
eating place, fun place, made of stone, meeting place, livable, part of house, relaxing,
animal-related, crowd-related. We used 10 comparative attributes: is more open, had
more space (indoor), had more space (outdoor), has more seating space, has larger struc-
tures, has horizontally longer structures, has taller structures, has more water, has more
sand and has more greenery. The relationship between scene category and attributes
were defined using a human annotator (see the website for list of relationships).
Baselines: The goal of this paper is to show the importance of additional constraints in
semi-supervised domain. We believe these constraints should improve the performance
irrespective of the choice of a particular approach. Therefore, as a baseline, we com-
pare the performance of our constrained bootstrapping approach with two versions of
standard bootstrapping approach: one uses independent multiple binary classifiers and
the other uses multi-class scene classifiers.
For our first set of experiments, we also compare our constrained bootstrapping
framework to the state-of-the-art SSL technique for image classification based on eigen
functions [24]. Following experimental settings from [24], we map the GIST descrip-
tor for each image down to a 32D space using PCA, use k = 64 eigen functions with
= 50 and = 0.2 for computing the Laplacian (see [24] for details).
Evaluation Metric: We evaluate the performance of our approach using two met-
rics. Firstly, we evaluate the performance of our trained classifier in terms of Average
3
We expect our approach to scale and perform better with more categories: increasing the num-
ber of categories will add more constraints and enforce extensive sharing, which is exactly
what our approach exploits.
(BA+CA+ME) BA Constraints Bootstrapping

Our Approach
Amphitheatre
Bootstrapping
BA Constraints
Our Approach
(BA+CA+ME)
Bridge
Fig. 3. Qualitative Results: We demonstrate how Binary Attribute (middle row) constraints and
Comparative Attribute (bottom row) constraints help us promote better instances as compared to
nave Bootstrapping (top row)
-Precision (AP) at each iteration on a held-out test dataset. Secondly, for the small-scale
experiments (sections 5.1 and 5.2), we also evaluate purity of the promoted instances in
terms of the fraction of correctly labeled images.
5.1 Experiment 1: Using Pre-trained Attribute Classifiers
We first evaluate the performance of our approach on SUN dataset. We train the 15
scene classifiers using 2 images each (labeled set). We want to observe the effect of
these constraints when the attribute and comparative attribute classifiers are not re-
trained at each iteration. Therefore, in this case, we used fixed pre-trained attribute
classifiers and relative attribute classifiers. These classifiers were trained on 25 exam-
ples each (from a held-out dataset). Our unlabeled dataset consists of 18,000 images
from SUN dataset. Out of these 18K images, 8.5K images are from these 15 categories
and the remainder are randomly sampled from the rest of the dataset. At each iteration,
we add 5 images per category from the unlabeled dataset to the classifier.
Figure 3 shows examples of how each constraint helps to select better instances that
should be added to the classifier. The bootstrapping approach clearly faces semantic
drift, as it adds bridge, coastal and park images to the amphitheater classi-
fier. It is the presence of binary attributes such as has water and has greenery that
help us to reject these bad candidates. While binary attributes do help to prune lot
Iteration 1
Iteration 10
Banquet Hall
Iteration 40
Iteration 90
Conference Room
Iteration 1
Iteration 10
Iteration 40
Iteration 90
Iteration 1
Iteration 10
Church Outdoor
Iteration 90
Iteration 40
Fig. 4. Qualitative results showing selected candidates for our approach at iteration 1, 10, 40 and
90. Notice that during the final iterations, there are errors such as bedroom images being added
to conference rooms etc.
of bad instances, they sometimes promote bad instances like the cemetery image
(3rd in 2nd row). However, comparative attributes help us clean such instances. For ex-
ample, the cemetery image is rejected since it has less circular structure. Similarly,
the church image is rejected since it does not have the long horizontal structures
compared to other bridge images. Interestingly, our approach does not overfit to the
seed examples and can indeed cover a greater diversity, thus increasing recall. For ex-
ample, in Figure 4, the seed examples for banquet hall include close-view of tables but
as iterations proceed we incorporate distant views of banquet hall (eg., 3rd image in
iteration 1 and 40).
Next, we quantitatively evaluate the importance of each constraint in terms of per-
formance on held-out test data. Figure 5(a) shows the performance of our approach
with different combinations of constraints. Our system shows significant improvement
in performance by adding attribute-based constraints. In fact, using a randomly chosen
set of ten attributes (ME+10BA) seems to provide enough constraints to avoid seman-
tic drift. Adding another nine attributes to the system does provide some improvement
during the initial iterations (ME+19BA). However, at later stages, the effect of these at-
tributes saturate. Finally, we evaluate the importance of comparative attributes by com-
paring the performance of our system with and without CA constraint. Adding CA
constraint does provides significant boost (6-7%) in performance. Figure 5(b) shows
the comparison of our full system with other baseline approaches. Our approach shows
significant improvement over all the baselines which include self-learning approaches
based on independent binary classifiers, self-learning based on multi-class classifier
and Eigen-functions [24]. We also show the upper-bound on the performance of our
approach, which is achieved if all the unlabeled data is manually labeled and used to
train the scene classifiers. Since the binary attribute classifiers are pre-trained on some
other data, it would be interesting to analyze the performance of these classifiers alone
0.5 Upperbound 1 Selflearning (binary)

0.7 Selflearning (binary)
Selflearning (multiclass)
Selflearning (multiclass) 0.9
Purity of Labeled Set (L)

Mean AveragePrecision
0.6 Eigen Functions Eigen Functions

0.8
0.4 BA only (no scene) Our Approach
Our Approach
0.5 0.7
ME
0.3 ME+10BA 0.4 0.6
ME+19BA
0.5
ME+19BA+CA (No Introspection) 0.3
0.2 Our Approach 0.4
0.2
0.3
0.1 0.1 0.2
0 0.1
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Number of Iterations Number of Iterations Number of Iterations
(a) Evaluating Constraints (b) Comparisons with Baselines (c) Purity based Comparison
Fig. 5. Quantitative Evaluations: (a) We first evaluate importance of each constraint in our system
using control experiments. (b) and (c) Show the comparison of our approach against standard
baselines in terms of performance on held-out test data and purity of added instances respectively.
Initial Seeds Iteration-1 Iteration-60

Bootstapping Our Approach Bootstapping
Bootstapping Our Approach Bootstapping

Bedroom
Coast
Our Approach
Our Approach
Fig. 6. Selected candidates for baseline and our approach at iterations 1 and 60
(and without scene classifiers). Our approach performs significantly better than just us-
ing attributes alone. This indicates that coupling does provide constraints and help in
better labeling of unlabeled data. We also compare the performance of our full system
in terms of purity of added instances (See Figure 5(c)).
5.2 Experiment 2: Co-training Attributes and Comparative Attributes

In the previous experiment we showed how each constraint is useful in improving the
performance of the semi-supervised approach. To isolate the reasons behind the perfor-
mance boost, we used fixed pre-trained attribute and comparative attribute classifiers. In
this experiment, we train our own attribute and comparative attribute classifiers. These
classifiers are trained using the same 30 images (15 categories 2 images) which were
used to train the initial scene classifiers. Now, we use the co-training setup where we
retrain these attribute and comparative attribute classifiers at each iteration using the
new images from unlabeled dataset.
Figure 6 shows qualitative results of image instances which are added to the clas-
sifiers at iterations 1 and 60. The qualitative results show how the baseline approach
suffers from semantic drift and adds auditorium to bedrooms and field to coast.
On the other hand, our approach is more robust and even at 60th iteration adds good
instances to the classifier. Figure 7 shows the comparison of our approach against
1
Selflearning (binary) Selflearning (binary)
Selflearning (multiclass) 0.9 Selflearning (multiclass)
Purity of Labeled Set (L)

Eigen Functions Eigen Functions

0.3 0.8
Our Approach Our Approach
0.7
0.6
0.2 0.5
0.4
0.3
0.1
0.2
0.1
0 20 40 60 80 100 0 20 40 60 80 100
Number of Iterations Number of Iterations
Fig. 7. Quantitative Evaluations: We evaluate our approach against standard baselines in terms of
(a) mean AP over all scene classes, (b) purity of added instances
Table 1. Quantitative Results on Large Scale Semi-Supervised Learning (AP Scores)

Amphitheater Auditorium Banquet Barn Bedroom Bowling Bridge Casino Cemetery Church Coast Conference Desert Field Restaurant Mean
Hall Alley indoor outdoor Room Sand Cultivated
Iteration-0 0.517 0.317 0.260 0.309 0.481 0.602 0.144 0.647 0.449 0.482 0.539 0.384 0.716 0.700 0.270 0.455
Self (Binary) 0.557 0.269 0.324 0.255 0.458 0.590 0.156 0.644 0.453 0.499 0.466 0.317 0.690 0.572 0.241 0.433
Self (Multi-Class) 0.488 0.254 0.290 0.261 0.443 0.601 0.162 0.655 0.509 0.475 0.548 0.322 0.733 0.657 0.303 0.447
Our Approach 0.571 0.298 0.302 0.352 0.521 0.627 0.209 0.650 0.506 0.506 0.571 0.391 0.786 0.702 0.311 0.487
baselines. Notice that our approach outperforms all the baselines significantly even
though in this case we used the same seed examples to train the attribute classifier. This
shows that attribute classifiers can pool information from multiple classes to help avoid
semantic drift.
5.3 Experiment 3: Large Scale Semi-Supervised Learning

In the two experiments discussed above, we demonstrated the importance and effective-
ness of adding different constraints to the bootstrapping framework. As a final exper-
iment, we now demonstrate the utility of such constraints for large scale learning. We
start with 25 seed examples from SUN dataset for each of the 15 categories. Our un-
labeled dataset consists of one million images selected from the imagenet dataset [31].
At each iteration, we add 10 images per category from unlabeled dataset to the clas-
sifier. Table 1 shows the performance of learned scene classifiers after 100 iterations.
The constrained bootstrapping approach not only improves the performance by 3.2%
but also outperforms all the baselines significantly.
Acknowledgments. The authors would like to thank Tom Mitchell, Martial Hebert,
Varun Ramakrishna and Debadeepta Dey for many helpful discussions. This work was
supported by Google, ONR-MURI N000141010934 and ONR Grant N000141010766.
References
1. von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: SIGCHI Conference
on Human Factors in Computing Systems (2004)
2. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and web-
based tool for image annotation. IJCV (2009)
3. Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large database for non-
parametric object and scene recognition. PAMI (2008)
4. Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion boot-
strapping. In: Conference of the Pacific Association for Computational Linguistics (2007)
5. Carlson, A., Betteridge, J., Hruschka Jr., E.R., Mitchell, T.M.: Coupling semi-supervised
learning of categories and relations. In: NAACL HLT Workskop on SSL for NLP (2009)
CVPR (2009)
7. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by
between-class attribute transfer. In: CVPR (2009)
8. Parikh, D., Grauman, K.: Relative attributes. In: ICCV (November 2011)
9. Gupta, A., Davis, L.S.: Beyond Nouns: Exploiting Prepositions and Comparative Adjectives
for Learning Visual Classifiers. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
10. Rastegari, M., Farhadi, A., Forsyth, D.: Attribute Discovery via Predictable Discrimina-
tive Binary Codes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.)
ECCV 2012, Part VI. LNCS, vol. 7577, pp. 876889. Springer, Heidelberg (2012)
11. Parkash, A., Parikh, D.: Attributes for Classifier Feedback. In: Fitzgibbon, A., Lazebnik, S.,
Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 355369.
12. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward
an architecture for never-ending language learning. In: AAAI (2010)
13. Vijayanarasimhan, S., Grauman, K.: Large-scale live active learning: Training object detec-
tors with crawled data and crowds. In: CVPR (2011)
14. Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active learning with gaussian processes
for object categorization. In: ICCV (2007)
15. Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Zhang, H.J.: Two-dimensional active learning for image
classification. In: CVPR (2008)
16. Joshi, A., Porikli, F., Papanikolopoulos, N.: Multi-class active learning for image classifica-
tion. In: CVPR (2009)
17. Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for
multi-class active learning. In: CVPR (2010)
18. Russell, B., Freeman, W., Efros, A., Sivic, J., Zisserman, A.: Using multiple segmentations
to discover objects and their extent in image collections. In: CVPR (2006)
19. Kang, H., Hebert, M., Kanade, T.: Discovering object instances from scenes of daily living.
In: ICCV (2011)
20. Zhu, X.: Semi-supervised learning literature survey. Technical report, CS, UW-Madison
(2005)
21. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level boot-
strapping. In: AAAI (1999)
22. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT
(1998)
23. Ebert, S., Larlus, D., Schiele, B.: Extracting Structures in Image Collections for Object
Recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS,
24. Fergus, R., Weiss, Y., Torralba, A.: Semi-supervised learning in gigantic image collections.
In: NIPS (2009)
25. Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image
classification. In: CVPR (2010)
26. Tighe, J., Lazebnik, S.: Understanding scenes on many levels. In: ICCV (2011)
27. Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. IJCV (2007)
28. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Transactions
on Graphics, SIGGRAPH (2007)
29. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in
recognition. Progress in Brain Research (2006)
30. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large scale scene
recognition from abbey to zoo. In: CVPR (2010)
31. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierar-
chical image database. In: CVPR (2009)
Renormalization Returns:
Hyper-renormalization and Its Applications
Kenichi Kanatani1 , Ali Al-Sharadqah2 ,

Nikolai Chernov3 , and Yasuyuki Sugaya4
1
Department of Computer Science, Okayama University, Okayama 700-8530 Japan
2
Department of Mathematics, University of Mississippi, Oxford, MS 38677, U.S.A.
3
Department of Mathematics, University of Alabama at Birmingham,
AL 35294, U.S.A.
4
Department of Information and Computer Sciences,
Toyohashi University of Technology, Toyohashi, Aichi 441-8580 Japan
kanatani@suri.cs.okayama-u.ac.jp, aalshara@olemiss.edu,
chernov@math.uab.edu, sugaya@iim.ics.tut.ac.jp
Abstract. The technique of renormalization for geometric estimation

attracted much attention when it was proposed in early 1990s for hav-
ing higher accuracy than any other then known methods. Later, it was
replaced by minimization of the reprojection error. This paper points
out that renormalization can be modied so that it outperforms repro-
jection error minimization. The key fact is that renormalization directly
species equations to solve, just as the estimation equation approach
in statistics, rather than minimizing some cost. Exploiting this fact, we
modify the problem so that the solution has zero bias up to high order er-
ror terms; we call the resulting scheme hyper-renormalization. We apply
it to ellipse tting to demonstrate that it indeed surpasses reprojection
error minimization. We conclude that it is the best method available
today.
1 Introduction
One of the most fundamental tasks of computer vision is to compute 2-D and
3-D shapes of objects from noisy observations by using geometric constraints.
Many problems are formulated as follows. We observe N vector data x1 , ..., xN ,
whose true values x1 , ..., xN are supposed to satisfy a geometric constraint in
the form
F (x; ) = 0, (1)
where is an unknown parameter vector which we want to estimate. We call
this type of problem simply geometric estimation. In traditional domains of
statistics such as agriculture, pharmaceutics, and economics, observations are
regarded as repeated samples from a parameterized probability density model
p (x); the task is to estimate the parameter . We call this type of problem
simply statistical estimation, for which the minimization principle has been
a major tool: One chooses the value that minimizes a specied cost. The best

Renormalization Returns: Hyper-renormalization and Its Applications 385
known approach is maximum likelihood (ML), which minimizes the negative log-
N
likelihood l = =1 log p (x ). Recently, an alternative approach is more and
more in use: One directly solves specied equations, called estimating equations
[5], in the form of g(x1 , ..., xN , ) = 0. This approach can be viewed as an ex-
tension of the minimization principle; ML corresponds to g(x1 , ..., xN , ) = l,
known as the score. However, the estimating equations need not be the gradient
of any function, and one can modify g(x1 , ..., xN , ) as one likes so that the re-
sulting solution have desirable properties (unbiasedness, consistency, eciency,
etc.). In this sense, the estimating equation approach is more general and exi-
ble, having the possibility of providing a better solution than the minimization
principle.
In the domain of computer vision, the minimization principle, in particular
reprojection error minimization, is the norm and is called the Gold Standard [6].
A notable exception is renormalization of Kanatani [7,8]: Instead of minimizing
some cost, it iteratively removes bias of weighted least squares (LS). Renor-
malization attracted much attention because it exhibited higher accuracy than
any other then known methods. However, questions were repeatedly raised as to
what it minimizes, perhaps out of the deep-rooted preconception that optimal
estimation should minimize something. One answer was given by Chojnacki et
al., who proposed in [4] an iterative scheme similar to renormalization, which
they called FNS (Fundamental Numerical Scheme), for minimizing what is now
referred to as the Sampson error [6]. They argued in [3] that renormalization
can be rationalized if viewed as approximately minimizing the Sampson error.
Leedan and Meer [14] and Matei and Meer [15] also proposed a dierent Sampson
error minimization scheme, which they called HEIV (Heteroscedastic Errors-in-
Variables). Kanatani and Sugaya [13] pointed out that the reprojection error
can be minimized by repeated applications of Sampson error minimization: The
Sampson error is iteratively modied so that it agrees with the reprojection error
in the end. However, reprojection error minimization, which is ML statistically,
still has some bias. Kanatani [9] analytically evaluated the bias of the FNS solu-
tion and subtracted it; he called this scheme hyperaccurate correction. Okatani
and Deguchi [16,17] removed the bias of ML of particular types by analyzing
the relationship between the bias and the hypersurface dened by the constraint
[16] and using the method of projected scores [17].
We note that renormalization is similar to the estimating equation approach
for statistical estimation in the sense that it directly species equations to solve,
which has the form of the generalized eigenvalue problem for geometric esti-
mation. In this paper, we show that by doing high order error analysis using
the perturbation technique of Kanatani [10] and Al-Sharadqah and Chernov
[1], renormalization can achieve higher accuracy than reprojection error mini-
mization and is comparable to bias-corrected ML. We call the resulting scheme
hyper-renormalization.
In Sec. 2, we summarize the fundamentals of geometric estimation. In
Sec. 3, we describes the iterative reweight, the most primitive form of the non-
minimization approach. In Sec. 4, we reformulate Kanatanis renormalization as
386 K. Kanatani et al.
an iteratively improvement of the Taubin method [19]. In Sec. 5, we do a detailed

error analysis of the generalized eigenvalue problem. In Sec. 6, the procedure of
hyper-renormalization is derived as an iterative improvement of what is called
HyperLS [11,12,18]. In Sec. 7, we apply our technique to ellipse tting to demon-
strate that it indeed outperforms reprojection error minimization. In Sec. 8, we
conclude that hyper-renormalization is the best method available today.
2 Geometric Estimation
Equation (1) is a general nonlinear equation in x. In many practical problem,
we can reparameterize the problem to make F (x; ) linear in (but nonlinear
in x), allowing us to write Eq. (1) as
((x), ) = 0, (2)
where and hereafter (a, b) denotes the inner product of vectors a and b. The
vector (x) is some nonlinear mapping of x from Rm to Rn , where m and n
are the dimensions of the data x and the parameter , respectively. Since the
vector in Eq. (2) has scale indeterminacy, we normalize it to unit norm:
= 1.
Example 1 (Ellipse tting). Given a point sequence (x , y ), = 1, ..., N ,
we wish to t an ellipse of the form
Ax2 + 2Bxy + Cy 2 + 2(Dx + Ey) + F = 0. (3)
If we let
= (x2 , 2xy, y 2 , 2x, 2y, 1), = (A, B, C, D, E, F ) , (4)
Eq. (3) has the form of Eq. (2).

Example 2 (Fundamental matrix computation). Corresponding points
(x, y) and (x , y ) in two images of the same 3-D scene taken from dierent
positions satisfy the epipolar equation [6]
(x, F x ) = 0, x (x, y, 1) , x (x , y , 1 ) , (5)
where F is a matrix of rank 2 called the fundamental matrix , from which we can
compute the camera positions and the 3-D structure of the scene [6,8]. If we let
= (xx , xy , x, yx , yy , y, x , y , 1) ,

= (F11 , F12 , F13 , F21 , F22 , F23 , F31 , F32 , F33 ) , (6)
Eq. (5) has the form of Eq. (2).

We assume that each datum x is a deviation from its true value x by indepen-
dent Gaussian noise of mean 0 and covariance matrix 2 V0 [x ], where V0 [x ] is
a known matrix that species the directional dependence of the noise and is
an unknown constant that species the absolute magnitude; we call V0 [x ] the

normalized covariance matrix , and the noise level .
Let us write (x ) simply as . It can be expanded in the form
= + 1 + 2 + , (7)
where and hereafter bars indicate terms without noise and the symbol k means
kth order noise terms O( k ). Using the Jacobian matrix of the mapping (x),
we can express the rst order noise term 1 in terms of the original noise
terms x as follows:
;
(x) ;;
1 = x . (8)
x ;x=x
We dene the covariance matrix of by
; ;
(x) ;; ;
(x) ;
V [ ] = E[1 1 ] = ; E[x x ] = 2 V0 [ ], (9)
x x=x x ;x=x
where E[ ] denotes expectation, and we dene

; ;
(x) ;; (x) ;;
V0 [ ] V0 [x ] . (10)
x ;x=x x ;x=x
The true values x are used in this denition. In actual computation, we replace
them by their observations x . It has been conrmed by many experiments that
this does not aect the nal result of practical problems. Also, V0 [ ] takes only
the rst order error terms into account via the Jacobian matrix, but it has been
conrmed by many experiments that incorporation of higher order terms does
not aect the nal result.
3 Iterative Reweight
The oldest method that is not based on minimization is the following iterative
reweight :
1. Let W = 1, = 1, ..., N , and 0 = 0.
2. Compute the following matrix M :
1
N
M= W
. (11)
N =1
3. Solve the eigenvalue problem M = , and compute the unit eigenvector

for the smallest eigenvalue .
4. If 0 up to sign, return and stop. Else, let
1
W , 0 , (12)
(, V0 [ ])
and go back to Step 2.
The motivation of this method is the weighted least squares that minimizes
1 1
N N
W ( , )2 = W
= (, M ). (13)
N =1 N =1
This is minimized by the unit eigenvector of the matrix M for the smallest
eigenvalue. As is well known in statistics, the optimal choice of the weight W
is the inverse of the variance of that term. Since ( , ) = (1 , ) + , the
leading term of the variance is
E[(1 , )2 ] = E[ 1 1 2
] = (, E[1 1 ]) = (, V0 [ ]).
(14)
Hence, we should choose W = 1/(, V0 [ ]), but is not known. So, we do
iterations, determining the weight W from the value of in the preceding
step. Let us call the rstvalue of computed with W = 1 simply the initial
N
solution. It minimizes =1 ( , )2 , corresponding to what is known as least
squares (LS ), algebraic distance minimization, and many other names [6]. Thus,
iterative reweight is an iterative improvement of the LS solution.
It appears at rst sight that the above procedure minimizes
1 ( , )2
N
J= , (15)
N =1 (, V0 [ ])
which is known today as the Sampson error [6]. However, iterative reweight
does not minimize it, because at each step we are computing the value of that
minimizes the numerator part with the denominator part xed. Hence, at the
time of the convergence, the resulting solution is such that
1 ( , )2 1 ( , )2
N N
(16)
N =1 (, V0 [ ]) N =1 (, V0 [ ])
for any , but the following does not necessarily hold:

1 ( , )2 1 ( , )2
N N
. (17)
N =1 (, V0 [ ]) N =1 ( , V0 [ ] )
The perturbation analysis in [10] shows that the covariance matrix V [] of
the resulting solution agrees with a theoretical accuracy limit, called KCR
(Kanatani-Cramer-Rao) lower bound [2,8,10], up to O( 4 ). Hence, further co-
variance reduction is practically impossible. However, it has been widely known
that the iterative reweight solution has a large bias [8]. For ellipse tting, for
example, it almost always t a smaller ellipse than the true shape. Thus, the
following strategies were introduced to improve iterative reweight:
Remove the bias of the solution.
Exactly minimize the Sampson error in Eq. (15).
The former is Kanatanis renormalization [7,8], and the latter is the FNS of
Chojnacki et al. [4] and the HEIV of Leedan and Meer [14] and Matei and Meer
[15].
4 Renormalization
Kanatanis renormalization [7,8] can be described as follows:
1. Let W = 1, = 1, ..., N , and 0 = 0.

2. Compute the following matrices M and N :
1 1
N N
M= W
, N= W V0 [ ]. (18)
N =1 N =1
3. Solve the generalized eigenvalue problem M = N , and compute the unit

eigenvector for the smallest eigenvalue in absolute value.
1
W , 0 , (19)
(, V0 [ ])
This has a dierent appearance from the procedure described in [7], in which the
generalized eigenvalue problem is reduced to the standard eigenvalue problem,
but the resulting solution is the same [8]. The motivation of renormalization is
as follows. Let M be the true value of the matrix M in Eq. (18) dened by the
true values . Since ( , ) = 0, we have M = 0. Hence, is the eigenvector
of M for eigenvalue 0. But M is unknown, so we estimate it. Since E[1 ] =
0, the expectation of M is to a rst approximation
1 1
N N
E[M ] = E[
W ( +1 )( +1 ) ] = M + W E[1 1
]
N =1 N =1
2
N
= M + W V0 [ ] = M + 2 N . (20)
N =1
Hence, M = E[M ] 2 N M 2 N , so instead of M = 0 we solve

(M 2 N ) = 0, or M = 2 N . Assuming that 2 is small, we regard it as
the eigenvalue closest to 0. As in the case of iterative reweight, we iteratively
update the weight W so that it approaches 1/(, V0 [ ]).
N
Note that the initial solution with W = 1 solves =1 =

N
=1 V0 [ ] , which is nothing but the method of Taubin [19], known to be
a very accurate algebraic method without requiring iterations. Thus, renormal-
ization is an iterative improvement of the Taubin solution. According to many
experiments, renormalization is shown to be more accurate than the Taubin
method with nearly comparable accuracy with the FNS and the HEIV. The
accuracy of renormalization is analytically evaluated in [10], showing that the
covariance matrix V [] of the solution agrees with the KCR lower bound up
to O( 4 ) just as iterative reweight, but the bias is much smaller.
Very small it may be, the bias is not 0, however. The error analysis in [10]
shows that the bias expression involves the
matrix N . Our strategy is to optimize
N
the matrix N in Eq. (18) to N = (1/N ) =1 W V0 [ ] + so that the bias
is zero up to high order error terms.
5 Error Analysis
Substituting Eq. (7) into the denition of the matrix M in Eq. (18), we can
expand it in the form
M = M + 1 M + 2 M + , (21)
where 1 M and 2 M are given by
1
N 1
N

1 M = W 1 + 1
+ 1 W , (22)
N =1 N =1
1
N

2 M = W 1 1
+
2 +
2

N =1
1 1
N N

+ 1 W (1 + 1
) + 2 W . (23)
N =1 N =1
Let = + 1 + 2 + be the corresponding expansion of the resulting

. At the time of convergence, we have W = 1/(, V0 [ ]). Substituting the
expansion of , we obtain the expansion W = W +1 W +2 W + , where
1 W = 2W2 (1 , V0 [ ]), (24)

(1 W )2
2 W = W2 (1 , V0 [ ]1 ) + 2(2 , V0 [ ]) . (25)
W
(We omit the derivation). Similarly expanding the eigenvalue and the matrix
N yet to be determined, the generalized eigenvalue problem M = N has
the form
(M +1 M +2 M + )( +1 +2 + )
= (+1 +2 + )(N +1 N +2 N + )(+1 +2 + ). (26)
Equating the noiseless terms on both sides, we have M = N , but since M

= 0, we have = 0. Equating the rst and the second order terms on both sides,
we obtain the following relationships:
M 1 + 1 M = 1 N , (27)
M 2 + 1 M 1 + 2 M = 2 N . (28)
Computing the inner product of Eq. (27) and on both sides, we have
(, M 1 ) + (, 1 M ) = 1 (, N ), (29)
but (, M 1 ) = (M , 1 ) = 0 and Eq. (22) implies (, 1 M ) = 0, so

1 = 0. The matrix M has rank n 1, (n is the dimension of ), being

its null vector. Hence, if we let M be the pseudoinverse of M , the product

M M equals the projection matrix P in the direction of . It follows that by

multiplying both sides of Eq. (27) by M from left, 1 is expressed as follows:

1 = M 1 M . (30)
Here, we have noted that since is normalized to unit norm, 1 is orthogonal

to so P 1 = 1 . Substituting Eq. (30) into Eq. (28), we obtain

2 N = M 2 1 M M 1 M + 2 M = M 2 + T , (31)
where we dene the matrix T to be

T 2 M 1 M M 1 M . (32)
Because is a unit vector, it has no error in the direction of itself; we are

interested in the error orthogonal to it. So, we dene the second order error of
to be the orthogonal component

2 P 2 = M M 2 . (33)

Multiplying Eq. (31) by M on both sides from left, we obtain
2 in the
following form:

2 = M (2 N T ). (34)
Computing the inner product of Eq. (31) and on both sides and noting that
(, M 2 ) = 0, we obtain 2 in the form
(, T )
2 = . (35)
(, N )
Hence, Eq. (34) is rewritten as follows:
(, T )

2 = M N T . (36)
(, N )
6 Hyper-renormalization
From Eq. (30), we see that E[1 ] = 0: the rst order bias is 0. Thus, the bias
is evaluated by the second order term E[2 ]. From Eq. (36), we obtain
(, E[T ])

E[
2 ] = M N E[T ] , (37)
(, N )
which implies that if we can choose such an N that its noiseless value N satises
E[T ] = cN for some constant c, we will have E[ 2 ] = 0. Then, the bias
will be O( 4 ), because the expectation of odd-order error terms is zero. After a
lengthy analysis (we omit the details), we nd that E[T ] = 2 N holds if we
dene
1
N
N = W V0 [ ] + 2S[ e
]
N =1
1 2
N

W ( , M )V0 [ ] + 2S[V0 [ ]M ] , (38)
N 2 =1
where S[ ] denotes symmetrization (S[A] = (A + A )/2) and the vectors e

are dened via
E[2 ] = 2 e . (39)
From this result, we obtain the following hyper-renormalization:
1. Let W = 1, = 1, ..., N , and 0 = 0.

2. Compute the following matrices M and N :
1
N
M= W
, (40)
N =1
1
N
N= W V0 [ ] + 2S[ e
]
N =1
1 2
N

W ( , M )V [
n1 0 ]+2S[V [
0 ]M n1 . (41)
]
N 2 =1
Here, M n1 is the pseudoinverse of M with truncated rank n 1, i.e., with

the smallest eigenvalue replaced by 0 in the spectral decomposition.
3. Solve the generalized eigenvalue problem M = N , and compute the unit
eigenvector for the smallest eigenvalue in absolute value.
1
W , 0 , (42)
(, V0 [ ])
It turns out that the initial solution with W = 1 coincides with what is called
HyperLS [11,12,18], which is derived to remove the bias up to second order error
terms within the framework of algebraic methods without iterations. The expres-
sion of Eq. (41) with W = 1 lacks one term as compared with the corresponding
expression of HyperLS, but the same solution is produced. We omit the details,
but all the intermediate solutions in the hyper-renormalization iterations are
also free of second oder bias. Thus, hyper-renormalization is an iterative im-

provement of HyperLS . As in the case of iterative reweight and renormalization,
the covariance matrix V [] of the hyper-renormalization solution agrees with
the KCR lower bound up to O( 4 ).
Standard linear algebra routines for solving the generalized eigenvalue prob-
lem M = N assume that N is positive denite, but the matrix N in Eq. (41)
has both positive and negative eigenvalues. For the Taubin method and renor-
malization, the matrix N in Eq. (18) is positive semidenite having eigenvalue
0. This, however, causes no diculty, because the problem can be rewritten as
1
N = M . (43)

The matrix M in Eq. (40) is positive denite for noisy data, so we can use
a standard routine to compute the eigenvector for the largest eigenvalue in
absolute value. If the matrix M happens to have eigenvalue 0, it indicates that
the data are all exact, so its null vector is the exact solution.
7 Ellipse Fitting Experiment

We dene 30 equidistant points on the ellipse shown in Fig. 1(a). The major and
minor axis are set to 100 and 50 pixels, respectively. We add random Gaussian
noise of mean 0 and standard deviation to the x and y coordinates of each point
independently and t an ellipse to the noisy point sequence using the following
methods: 1. LS, 2. iterative reweight, 3. the Taubin method, 4. renormaliza-
tion, 5. HyperLS, 6. hyper-renormalization, 7. ML, 8. ML with hyperaccurate
correction [9].
For our noise, ML means reprojection error minimization, which can be
computed by repeated Sampson error minimization [13]. We used the FNS of
Chojnacki et al. [4] for minimizing the Sampson error, but according to our ex-
periments, the FNS solution agrees with the ML solution up to three or four
signicant digits, as also observed in [13], so we identied the FNS with ML.
Figures 1(b), (c) show tting examples for = 0.5; although the noise mag-
nitude is xed, tted ellipses are dierent for dierent noise. The true shape
is indicated by dotted lines. Iterative reweight, renormalization, and hyper-
renormalization all converged after four iterations, while FNS for ML computa-
tion required nine iterations for Fig. 1(b) and eight iterations for Fig. 1(c).
We can see that the LS and iterative reweight produce much smaller ellipses
than the true shape. The closest ellipse is tted by hyper-renormalization in
Fig. 1(b) and by ML with hyperaccurate correction in Fig. 1(c). For a fair com-
parison, we need statistical tests.
Since the computed and its true value are both unit vectors, we measure
their discrepancy by the orthogonal component = P , where P (

I ) is the orthogonal projection matrix along (Fig. 2(a)). We generated
1 1
2 2
7 3
5 5
8 4 4 8
6 3 6 7
(a) (b) (c)
Fig. 1. (a) Thirty points on an ellipse. (b), (c) Fitted ellipses ( = 0.5). 1. LS, 2.
iterative reweight, 3. the Taubin method, 4. renormalization, 5. HyperLS, 6. hyper-
renormalization, 7. ML, 8. ML with hyperaccurate correction. The dotted lines indicate
the true shape.
10000 independent noise instances for each and evaluated the bias B (Fig. 2(b))
and the RMS (root-mean-square) error D (Fig. 2(c)) dened by
>
1 10000 ?
? 1 10000
(a)
B= , D=@ (a) 2 , (44)
10000 a=1 10000 a=1
where (a) is the solution in the ath trial. The dotted line in Fig. 2(c) indicates
the KCR lower bound [2,8,10]

DKCR = trM , (45)
N

where M is the pseudoinverse of the true value M (of rank 5) of the M in
Eqs. (11), (18), and (40), and tr stands for the trace. The interrupted plots
in Fig. 2(b) for iterative reweight, ML, and ML with hyperaccurate correction
indicate that the iterations did not converge beyond that noise level. Our con-
vergence criterion is 0 < 106 for the current value and the value 0
in the preceding iteration; their signs are adjusted before subtraction. If this cri-
terion is not satised after 100 iterations, we stopped. For each , we regarded
the iterations as nonconvergent if any among the 10000 trials did not converge.
Figure 3 shows the enlargements of Figs. 2(b), (c) for the small part.
We can see from Fig. 2(b) that LS and iterative reweight have very large bias,
in contrast to which the bias of the Taubin method and renormalization is very
small. The bias of HyperLS and hyper-renormalization is still smaller and even
smaller than ML. Since the leading covariance is common to iterative reweight,
renormalization, and hyper-renormalization, the RMS reects the magnitude of
the bias as shown in Fig. 2(c). Because the hyper-renormalization solution does
not have bias up to high order error terms, it has nearly the same accuracy as
ML. A close examination of the small part (Fig. 3(b)) reveals that hyper-
renormalization outperforms ML. The highest accuracy is achieved, although
the dierence is very small, by Kanatanis hyperaccurate correction of ML [9].
0.1 0.3

0.08
2
0.2 8
7
0.06
1 2 6
6 4
5
1 5
O 0.04 3
7 0.1
4
0.02 3
8
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
(a) (b) (c)
Fig. 2. (a) The true value , the computed value , and its orthogonal component
of . (b), (c) The bias (a) and the RMS error (b) of the tted ellipse for the standard
deviation of the added noise over 10000 independent trials. 1. LS, 2. iterative reweight,
3. the Taubin method, 4. renormalization, 5. HyperLS, 6. hyper-renormalization, 7. ML,
8. ML with hyperaccurate correction. The dotted line in (c) indicates the KCR lower
bound.
0.02 0.1
0.08 2 8
6
2 1 7
4
5
0.06 3
0.01 1
0.04
7
4
3 6 0.02
5 8
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
(a) (b)
Fig. 3. (a) Enlargement of Fig. 2(b). (b) Enlargement of Fig. 2(c).
However, it rst requires the ML solution, and the FNS iterations for its com-
putation may not converge above a certain noise level, as shown in Figs. 2(b),
(c). In contrast, hyper-renormalization is very robust to noise, because the initial
solution is HyperLS, which is itself highly accurate as shown in Figs. 2 and 3. For
this reason, we conclude that it is the best method for practical computations.
Figure 4(a) is an edge image of a scene with a circular object. We tted an
ellipse to the 160 edge points indicated in red, using various methods. Figure 4(b)
shows the tted ellipses superimposed on the original image, where the occluded
part is also articially composed for visual ease. In this case, iterative reweight
converged after four iterations, and renormalization and hyper-renormalization
converged after three iterations, while FNS for ML computation required six
iterations. We can see that LS and iterative reweight produce much smaller
ellipses than the true shape as in Fig. 1(b), (c). All other ts are very close to
the true ellipse, and ML gives the best t in this particular instance.
7 6
8
3
1
5
2
4
(a) (b)
Fig. 4. (a) An edge image of a scene with a circular object. An ellipse is tted to the
160 edge points indicated in red. (b) Fitted ellipses superimposed on the original image.
The occluded part is articially composed for visual ease. 1. LS, 2. iterative reweight,
3. the Taubin method, 4. renormalization, 5. HyperLS, 6. hyper-renormalization, 7.
ML, 8. ML with hyperaccurate correction.
8 Conclusions
We have reformulated iterative reweight and renormalization as geometric esti-
mation techniques not based on the minimization principle and optimized the
matrix N that appears in the renormalization computation so that the resulting
solution has no bias up to the second order noise terms. We called the resulting
scheme hyper-renormalization and applied it to ellipse tting. We observed:
1. Iterative reweight is an iterative improvement of LS. The leading covariance
of the solution agrees with the KCR lower bound, but the bias is very large,
hence the accuracy is low.
2. Renormalization [7,8] is an iterative improvement of the Taubin method [19].
The leading covariance of the solution agrees with the KCR lower bound,
and the bias is very small, hence the accuracy is high.
3. Hyper-renormalization is an iterative improvement of HyperLS [11,12,18].
The leading covariance of the solution agrees with the KCR lower bound
with no bias up to high order error terms. Its accuracy outperforms ML.
4. Although the dierence is very small, ML with hyperaccurate correction
exhibits the highest accuracy, but the iterations for its computation may
not converge in the presence of large noise, while hyper-renormalization is
robust to noise.
We conclude that hyper-renormalization is the best method for practical com-
putations.
Acknowledgments. The authors thank Takayuki Okatani of Tohoku Univer-

sity, Japan, Mike Brooks and Wojciech Chojnacki of the University of Adelaide,
Australia, and Peter Meer of Rutgers University, U.S.A. This work was supported
in part by JSPS Grant-in-Aid for Challenging Exploratory Research (24650086).
References
1. Al-Sharadqah, A., Chernov, N.: A doubly optimal ellipse t. Comp. Stat. Data
Anal. 56(9), 27712781 (2012)
2. Chernov, N., Lesort, C.: Statistical eciency of curve tting algorithms. Comp.
Stat. Data Anal. 47(4), 713728 (2004)
3. Chojnacki, W., Brooks, M.J., van den Hengel, A.: Rationalising the renormalization
method of Kanatani. J. Math. Imaging Vis. 21(11), 2138 (2001)
4. Chojnacki, W., Brooks, M.J., van den Hengel, A., Gawley, D.: On the tting of
surfaces to data with covariances. IEEE Trans. Patt. Anal. Mach. Intell. 22(11),
12941303 (2000)
5. Godambe, V.P. (ed.): Estimating Functions. Oxford University Press, New York
(1991)
6. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision.
Cambridge University Press, Cambridge (2000)
7. Kanatani, K.: Renormalization for unbiased estimation. In: Proc. 4th Int. Conf.
Comput. Vis., Berlin, Germany, pp. 599606 (May 1993)
8. Kanatani, K.: Statistical Optimization for Geometric Computation. Theory and
Practice. Elsevier, Amsterdam (1996); reprinted, Dover, New York, U.S.A., (2005)
9. Kanatani, K.: Ellipse tting with hyperaccuracy. IEICE Trans. Inf. & Syst.
E89-D(10), 26532660 (2006)
10. Kanatani, K.: Statistical optimization for geometric tting: Theoretical accuracy
bound and high order error analysis. Int. J. Comput. Vis. 80(2), 167188 (2008)
11. Kanatani, K., Rangarajan, P.: Hyper least squares tting of circles and ellipses.
Comput. Stat. Data Anal. 55(6), 21972208 (2011)
12. Kanatani, K., Rangarajan, P., Sugaya, Y., Niitsuma, H.: HyperLS for parameter
estimation in geometric tting. IPSJ Trans. Comput. Vis. Appl. 3, 8094 (2011)
13. Kanatani, K., Sugaya, Y.: Unied computation of strict maximum likelihood for
geometric tting. J. Math. Imaging Vis. 38(1), 113 (2010)
14. Leedan, Y., Meer, P.: Heteroscedastic regression in computer vision: Problems with
bilinear constraint. Int. J. Comput. Vis. 37(2), 127150 (2000)
15. Matei, J., Meer, P.: Estimation of nonlinear errors-in-variables models for computer
vision applications. IEEE Trans. Patt. Anal. Mach. Intell. 28(10), 15371552 (2006)
16. Okatani, T., Deguchi, K.: On bias correction for geometric parameter estimation
in computer vision. In: Proc. IEEE Conf. Computer Vision Pattern Recognition,
Miami Beach, FL, U.S.A., pp. 959966 (June 2009)
17. Okatani, T., Deguchi, K.: Improving accuracy of geometric parameter estimation
using projected score method. In: Proc. 12th Int. Conf. Computer Vision, Kyoto,
Japan, pp. 17331740 (September/October 2009)
18. Rangarajan, P., Kanatani, K.: Improved algebraic methods for circle tting. Elec-
tronic J. Stat. 3, 10751082 (2009)
19. Taubin, G.: Estimation of planar curves, surfaces, and non-planar space curves
dened by implicit equations with applications to edge and range image segmen-
tation. IEEE Trans. Patt. Anal. Mach. Intell. 13(11), 11151138 (1991)
Scale Robust Multi View Stereo
Christian Bailer, Manuel Finckh, and Hendrik P.A. Lensch
Computer Graphics, Tubingen University

72076 Tubingen, Germany
Abstract. We present a Multi View Stereo approach for huge unstruc-

tured image datasets that can deal with large variations in surface sam-
pling rate of single images. Our method reconstructs surface parts always
in the best available resolution. It considers scaling not only for large
scale dierences, but also between arbitrary small ones for a weighted
merging of the best partial reconstructions. We create depth maps with
our GPU based depth map algorithm, that also performs normal opti-
mization. It matches several images that are found with a heuristic image
selection method, to a reference image. We remove outliers by comparing
depth maps against each other with a fast but reliable GPU approach.
Then, we merge the dierent reconstructions from depth maps in 3D
space by selecting the best points and optimizing them with not selected
points. Finally, we create the surface by using a Delaunay graph cut.
1 Introduction
Thanks to fast camera calibration methods [1,2] today, it is possible to automati-

cally calibrate thousands of images in space, i.e., to determine relative capturing
position and camera pose for each image. An interesting application for such
calibrated datasets is dense 3D reconstruction as this allows for automatically
creating accurate 3D models of large objects or even whole city parts out of
common photographs. For famouse locations, these photographs can even sim-
ply be taken from image portals like Flickr. However, there are still very few
methods that can deal with such datasets, even if there are over 50 dierent
Multi View Stereo approaches today. For example, this can be seen in the sub-
mission list of the Middlebury evaluation portal [3]. Many methods, which are
based on voxels [4,5,6] or visual hulls [7,8,9] are not exible enough to recon-
struct big open datasets. Others, mainly those based on multiple depth maps
or patches [10,11,12,13] are more exible, but they usually use simple heuristics
to nd compatible images for depth map or patch creation and ignore the fact
that images can show objects in strongly diering resolutions. Thus, they are
primarily suited for structured datasets, where e.g. neighboring images in the
image stream are also neighbors in space. To be able to reconstruct more general
datasets, where image views can be arbitrarily scattered in space, an approach
additionally needs some kind of image selection method that determines which
images are worthwhile to compare because they show the same objects. Good
image selection is important for accuracy but not trivial, because it depends on

Scale Robust Multi View Stereo 399
Fig. 1. Reconstructed 3D model of the Ulm Minster (German: Ulmer Munster ) dataset
with very large resolution dierences, created by our method. Note the huge scale
dierence from the left most image (about 160m hight) to the right most image (about
3cm).
the surface distance (Figure 3(a)) and not the most similar images are the best
choice. Furthermore, an algorithm should also consider if there are scale dier-
ences in the dataset, i.e., if objects can be seen in dierent images from dierent
distances in dierent scales. Otherwise this usually leads to loss of reconstruction
detail. Our approach provides solutions for both problems. For image selection
we extend the image selection approach of Goesele et al. [14]. Scale robustness
is achieved by considering scale in all stages of this work. These are besides im-
age selection, multi depth map ltering, point selection, point optimization and
meshing. Multi depth map ltering is an ecient, GPU parallelizable way to
remove outliers by comparing depth map against each other and point selection
and optimization selects the assumed best points in space and optimizes them
in position with worse not selected points. This is only possible if we carefully
consider their assumed accuracy.
2 Related Work
One of the most constrained methods to reconstruct huge datasets is recon-
struction from street level video. Pollefeys et al. [15] demonstrated that this can
even be done in realtime. Frahm et al. [16] developed an algorithm which is
able to deal with millions of images. However, their method seems to be primar-
ily interesting for image preselection and calibration and not for massive stereo
reconstruction, as they demonstrated this only locally for single objects.
Jancosek et al. [17] proposed a scalable patch based method which can deal
with huge unstructured image datasets, but it is not scale robust and has only
a simple image selection heuristic. Goesele et al. [14] were probably the rst
that proposed a robust image selection method for huge unstructured datasets,
with large scale dierences. This allowed them to create good depth maps for
such datasets. However, they didnt consider point accuracy for meshing. Thus,
the overall method is not scale robust. They addressed this problem in re-
cent publications where they presented scale robust meshing methods for depth
400 C. Bailer, M. Finckh, and H.P.A. Lensch
maps [18,19]. Furukawa et al. [20] clustered huge unstructured datasets into
clusters of similar visibility and scale. These clusters can then be independently
processed by simple Multi View Stereo algorithms without image selection, like
their previous work [12]. Afterwards, they rene the point clouds of dierent clus-
ters by comparing them against each other. Thereby, they also consider point
accuracies and remove points from clusters if a surface part is represented by
another cluster in clearly higher resolution. However, they dont demonstrate
meshing for their point clouds.
3 Overview
This section gives an overview over the proposed pipeline. Before our approach
can work with raw image datasets, these must be calibrated in space at rst.
We use the publicly available Bundler software [1] for this task. Bundler delivers
rotation matrices, translation vectors, distortion coecients and focal lengths
for all images, that could successfully be calibrated and a sparse set of surface
points. This data is the input to our pipeline which is outlined in Figure 2. We
create depth maps out of the perspective of each image, by matching it to other
images that are selected as described in Section 4. The selection process prefers
images with similar scale, large view angles and a big overlap in eld of vision to
the reference image. Additionally, we demand that overlaps and view angles are
not too similar in dierent selected images. Our depth map algorithm which is
described in Section 5 is GPU based and can perform normal optimization similar
to region growing based approaches. It further lters out outliers. After creating
and independently ltering all depth maps, they are compared against each other
and points in depth maps that have not enough support by other depth maps or
have too many visibility conicts are ltered out by the approach in Section 6.
Successively, points are transformed into the 3 dimensional space and ordered
into a kd-tree. Because depth maps often have redundancy to one another the
point cloud is very dense in some regions. Thus, the best points are selected and
the rest of the points are only used to optimize the selected points in position as
described in Section 7. Thereby, the redundancy can be used to remove surface
noise. Selected local outliers are also removed in the optimization process by
pulling them onto the surface. Finally, the mesh is created by a Delaunay graph
cut described in Section 8.
Fig. 2. The pipeline of our approach


(a) (b)
Fig. 3. a) Good image selection depends on the surface distance. Green indicates a
good image distance. b) Our angular weighting scheme reduces the inuence of close-
by views, here demonstrated on the Ulm Minster data set.
4 Image Selection
In this section we present an image selection heuristic that considers scale and
robust stereo reconstruction. To nd suitable images for stereo matching with
diversity, Goesele et al. [14] demand view angles above 10 degrees between all
images including the reference image to which the other images are matched
for depth map creation, if possible. However, works like [21] showed that 10
degrees are by far not enough for accurate stereo matching. Thus, we extend their
approach by demanding especially big view angles between the reference and the
other images. Furthermore, our approach prioritizes images, where regions of
the reference image are visible that are hardly or not visible in already selected
images.
To nd suitable images Is for stereo matching with a reference image R we use
sparse Structure From Motion feature points f , delivered by Bundler [1]. These
points have already visibility information and are necessary, because knowledge
of the surface distance is essential for a good selection (Figure 3(a)). To rate
an image, we use similar to Goesele et al. a weighted sum of all feature points
visible in R and the tested image:1

gR (I) = wa (f )ws (f )wc (f ) (1)
f FR FI
where I is an image, wa is the angle weighting, ws the scale weighting, wc the

covering weighting and FX the set of feature points visible in image X. Our angle
weighting scheme diers from the reference and the already selected images:
x y
I,R (f ) I,J (f )
wa (f ) = min ,1 min ,1 , (2)
max max
J|JIs f F (J)
whereX,Y (f ) is the view angle dierence at point f for rays from the center of
projection of image X and Y and Is the set of already selected images. As Is
1
Please consult [14] for a detailed motivation for using a weighted sum.
increases over time all images must be rerated after each selection of the best
image. We used mostly max = 35 , max = 14 , x = 1.5 and y = 1. Our
coverage weighting, which favors feature points that are only sparsely covered
by already selected images is given by:
rI (f )
wc (f ) = . (3)
rI (f ) +
J|JIs pP (J) rJ (f )

A simple choice for rX that ignores scale would be rX = 1. However, we dont
want images with good scales be covered by images with bad scales, so we set it

to rX (f ) = min sR (f )2 /sX (f )2 , 1 . This can be especially important at depth
discontinuities. There, it might not be possible to select images with good scale
on both sides of the discontinuity. We also changed the scale weighting of Goesele
et al. to be more strict, as we dont want big angles with bad scale to be better
rated than small angles with good scale. If sX (f ) is the scale at f in an image
and r = sR (f )/sI (f ) our scale weighting is calculated as:

0 r > 1.8
r2 1 < r 1.8
ws (f ) = 1 1/1.6 < r 1 . (4)

1.6 2
r else
As Figure 3(b) shows the provided angles are often smaller than 35 . Any-
way, for x = 1.5 we get more than twice as much large angles than Goesele
et al. Furthermore, our approach also performs better for the accuracy weight-
ing max(1, 1/r). Overall we get better view angles not at the cost of worse
scaling.
5 Depth Map Creation

After having selected images, we now can compute depth maps and lter them
on the GPU. In contrast to common GPU based methods our algorithm can
perform normal optimization like region growing based approaches [14,12] to get
high quality depth maps. Our method is based on the stochastic PatchMatch
algorithm [22] for texture synthesis. Our depth map estimation pipeline is illus-
trated in Figure 4.
First, we initialize the depth map by inserting feature points provided by
bundler as depth values. Then, we propagate depths into neighboring pixels if
there is no depth value or the matching error of the existing depth is bigger than
the matching error of the propagated depth. The original algorithm propagates
values for each pixel from the left and from the top like in Figure 5(c), i.e.,
(x 1, y) (x, y) and (x, y 1) (x, y). If pixels are processed in the order
of Figure 5(a), good values can propagate arbitrary far to the right and to the
bottom in only one pass. We found out, that this approach can directly be
parallelized by processing the blue lines in Figure 5(b) in parallel, which leads to
similar results as Figure 5(a). However, the parallelism depends on the position.
Fig. 4. The depth map creation pipeline
Fig. 5. Ways to propagate depth values through the depth map
Thus, we use horizontal processing lines and the three propagation directions of
Figure 5(d). We call this downward propagation. Additionally, we perform upward
propagation at the same time. Up and downward propagation is performed in
all uneven propagation steps. Left and rightward propagation in all even steps.
Fig. 6. Dierence to ground truth. Left: No Normal Tuning Right: Tuning with random
normals. Artifacts disappear nearly completely after the second propagation step.
After propagation, depth values are randomly shifted in depth and the new
depth is kept if the matching error of the shifted depth is smaller. We perform
three shifts with uniform distributions with 16 fd , 8 fd and 4 fd maximum shift
distance, where d is the depth value and f the focal length of the image. Depth
values have initially no normals, hence, we assign normals with uniform random
direction and surface angle to them if the matching error decreases:
nx = sin() tan(), ny = cos() tan() , (5)
where nx and ny are the gradients of the tangent plane in x and y direction. As
Figure 6 shows, random normals are sucient as the good ones get distributed
over the depth map in the next propagation step. Allover, we perform 3 propa-
gation steps as this usually almost leads to convergence. Even with two steps we
mostly retrieved good results. Important for fast convergence of sloped surfaces
is the usage of normal information to propagate into tangent direction in prop-
agation step 2 and 3. Finally, we optimize the depth values and normals with a
gradient descent similar to region growing based approaches. As matching error

E over all selected images IS to the reference image R we use a self weighted
average over all 1 minus N ormalized Cross Correlation (1 N CC) scores:

IIS 1 N CCR,I 1N CCR,I
1
IS
E= 1 = 1 . (6)
IIS 1N CCR,I IIS 1N CCR,I
The error function was occlusion robust in our tests, as it strongly prefers small
errors. It performed even at non occluded surface parts better than a common
weighting.2 For details about matching with normals please consult [12].
Depth Map Filtering. To lter erroneous values in the depth map, we discard
all depth values that have a (1 N CC) matching error above a threshold in at
least n images:
n = max(c, min(3, Iv /2)) , (7)
where Iv is the number of images that can see the point if there is no occlusion,
i.e., the point lies not outside of the image area if it is projected into the image.
We use c = 2 in most of our tests and c = 1 for sparse datasets.
Fig. 7. Left: Points dont resemble a surface. Right: Segmentation of our lter.
Additionally, we lter outliers if they dont resemble a surface (see Figure 7).
This is done by segmenting the images by depth discontinuities. If there is a
way from one point in the depth map to another point without crossing a depth
discontinuity above a threshold, both are in the same region. Thereafter, we
delete all regions which have less than 15 points. The depth threshold is set to
2 fd , where d is the current depth and f the focal length of the image.
6 Multi Depth Map Filtering

In this section we describe how outliers and other inconsistent points are re-
moved in a fast GPU parallelizable but very reliable way that works for arbitrary
datasets. The idea is to compare depth maps against each other as proposed by
Merrell et al. [25] and to lter out points in depth maps that are inconsistent to
other depth maps. To determine which depth maps should be compared to a ref-
erence depth map, that we want to lter, we use our image selection algorithm.
2
For tests in this section we used the datasets of [23] with ground truth depth maps
of [24]. See: http://cvlab.epfl.ch/alumni/tola/daisy.html

(a) (b)
Fig. 8. a) Depth maps can be consistent (B), inconsistent (A, C) or show dierent non
conicting surfaces (not shown). b) Dicult scene for MVS, because of textureless and
transparent objects and bad camera calibrations. Left: Even the Delaunay Graph cut
cannot correct the unltered depth maps. Right: With our additional multi depth map
ltering we can recover a lot of surface parts.
Especially image selection makes the approach powerful because the depth maps
it nds show not only the same surface parts as the reference depth map, but
are also very unlikely to have the same outliers, as they dier in view angle and
thus in the set of images they were created from.
Figure 8(a) shows the dierent cases that can occur when corresponding points
in two depth maps are compared to each other. In order to compare them, we
assign a support value to each depth value. A depth value gains support for
similarity (case B in Figure 8(a)) and looses support for visibility conicts (case
A and C). Hereby, we take only the dierence of case A and C conicts as
negative support into account like in [25], because a point can not be too near
and too far at the same time. The support for a depth value is dened as:
; ;
; ;
; ;
SR (x, y) = BR,D (x, y)rb ; AR,D (x, y) CR,D (x, y); , (8)
; ;
DDs DDs DDs
where AR,D ,BR,D and CR,D are booleans that determine the case, A, B, or C
between the reference depth map R and depth map D. rb is a constant, which is
set to 2 in this work, because we think case A and C are despite the dierence
still stronger aected by outliers in other depth maps than case B. After all
depth maps were processed and points which meet the condition SR (x, y) < 1
are ltered out.
We test for conicts similar to [25], but when we render a depth map for case
C into the view of the reference depth map, we render each depth value into
the 4 closest pixels3 instead of only one to avoid gaps (see Figure 9) that can
otherwise strongly occur for large view angles and scale dierences. There is a
conict if a depth value of the rendered depth map is signicantly smaller than
the value of the reference depth map.
3
Depth values can only be overwritten by smaller ones.
Case B could also be decided like case C. However, we handle it like case
A because we think this is more accurate. This is done by transforming points
of the reference depth map into the view of another depth map and comparing
depth values there on the rays of the other view. Thereby, direct comparisons are
possible without rendering. For the acceptance threshold we consider for case B
only the accuracy of the referenced depth map. For conicts we take the worse
of both accuracies:
c
TA = TC = c max(dR (X), dDi (X)), TB = dR (X) , (9)
2
where c is a constant we set to 1.6, X is the 3D representation of the tested
point in the reference depth map, R the reference depth map, Di another depth
map, and d the sampling distance of a depth map at a point, i.e., the depth the
point would have in that depth map over the focal length of the depth map. We
do not test for conicts if 1.6 dDi (X) > dR (X) to avoid border eects at depth
discontinuities, because of inaccurate borders in low resolution depth maps. After
multi depth map ltering, we again lter depth maps with the segmentation lter
of Figure 7 to remove single surviving outliers.
The approach can also be used to lter depth values in small scale depth
maps. Therefore, the selection algorithm can be modied to prefer depth maps
with 1.6 times or higher sampling rates as the reference depth map and depth
values are ltered out if such a depth map is found for a pixel position. This
is not necessary for scale robustness, but speeds up the rest of our approach.
Practically, we use the lter but dont select extra high resolution depth maps.
Fig. 9. When rendering a depth map into another view some pixels are covered multiple
times, others never. This can be avoided by rendering points into 4 pixels at once.
7 Point Selection and Optimization

After all depth maps are ltered the remaining depth values are transformed
into the 3 dimensional space. We transform only one depth value per 2x2 pixel
window, as depth maps from stereo usually have no pixel accuracy. Alternatively,
only half sized depth maps could be created. Anyway, we created full depth maps
and used a small bilateral lter on the depth map to keep the extra information.
In 3D space, each point gets, besides position and color, a normal, that is already
delivered by some depth map algorithms and a scale value and inuence radius
that we assume as equal for the moment: I = S = 2z f , where z is the depth value
in the depth map, f the focal length of the depth map and the factor of 2 is
set because we used only one value per 2x2 pixel window. Then, all points are
dened as primary points that have no primary points with smaller I in their
own inuence radius. This is done by testing the points with the smallest I rst.
Neighbors of a point in the same depth map can slightly lie inside the inuence
radius and are ignored to keep the regular depth map structure. Primary points
then represent the surface approximately in the resolution of the best depth
map and are most likely the most accurate points. Nevertheless, they can still
be improved in position by using other points with similar scale to remove surface
noise. We use a modied minimum least square algorithm [26] for this:

qP w(||pi q||)(Sp /Sq ) q
2

pi+1 = i (10)
qPi w(||pi q||)(Sp /Sq )
2

qPi w(||pi q||)(Sp /Sq ) nq
2
ni+1 = (11)
qPi w(||pi q||nq ||)(Sp /Sq )
2

pi+1 = pi + ni (pi+1 pi )ni (12)
p0 = p is an unoptimized primary point, Pi contains all unmodied points that
(a) (b)
Fig. 10. (a): Local outlier avoidance with iteration (b) Left: Unoptimized primary
points. Right: Optimized primary points.
are in the radius of 2Ipi and have a scale that is smaller than 1.6 times the scale
of p. According to our simulations, (Sp /Sq )2 is the optimal weighting function
for all medium free noise functions. Of course, this is not fullled between depth
maps of dierent scales, because rough depth maps can not have information
about ne details. That is why we dont include inaccurate points to Pi . ni is
the local normal. pi is only allowed to move along the normal and not along
the tangent direction for pi+1 . pi is optimized to pi+1 until the process fails or
converges. If it fails because it does not terminate after 20 iterations, |Pi | < 3,
or the point moves out of its original inuence radius the point is discarded. The
process converges if:
||pn pn+1 || < min(, ||pn pn1 ||) , (13)
with a small > 0. As weighting function we use

1
w(x) = . (14)
|c(x/Ip )3 | + 1
We use this function to improve the eect in Figure 10(a) to pull local outlier
points onto the surface. The often used Gauss function falls outwards too fast to
take much advantage of this eect. To speed up the optimization process we start
two iterations for a point. One with c = 5 and one with c = 30. Local outliers
can slide between other points on the surface, so, we discard primary points if
there are better primary points in 0.8x their inuence radius after optimization.
Finally, we use Formula 10 with the color cq instead of q to improve the color
of an optimized point. In Figure 10(b) the dierence between unoptimized and
optimized primary points can be seen. Details are kept while noise is removed.
However, noise removal is only possible if there are enough samples with
similar scale in the inuence radius. Otherwise, we must increase I before point
selection and decouple it from S to avoid noisy surfaces. We estimate the noise
N as
1/N = w(||pi q||)(Ip /Iq )2 . (15)
qPi
I is increased as long as 1/N < Z I < 3S. We set Z to 2 and c in w to 10.

Practically, it depends on the quality of the depth maps and, hence, the datasets.
8 Meshing
For meshing of our optimized primary points we use the Delaunay graph cut of
Labatut et al. [27], because it is robust to changes in point density and considers
visibility, which helps to reconstruct dicult surface parts. In their work they
also showed that their approach is very robust against outliers. We got similar
results with random noise and dense surface point clouds. However, as can be
seen in Figure 8(b) outlier avoidance in real datasets can be more dicult.
For their visibility constraint we consider not only the reference image, from
which that point was created, but we also add the reference image of each sec-
ondary point to the nearest optimized primary point. Additionally, we bind their
constant to the scale of the primary point, to handle ray accuracy depending
on the point accuracy and not the ray length:
p = const Sp . (16)
9 Results
We tested our approach on several datasets. Results can be seen in Figure 1, 11
and 12. Our memory peak in the Ulm Minster dataset was mainly because of the
Delaunay graph cut 12.8 GB for about 18 Million triangles and 55 Million points
for point selection and optimization. Point processing took single core about 20
minutes and meshing 1 hour. Compared to [18,19] our approach is clearly faster
and needs clearly less memory, as far as it is comparable on dierent datasets.
Furthermore, it is probably also more accurate as we perform careful continuous
optimizations instead of just using voxels of dierent scales. Depth map creation
(a) (b)
Fig. 11. a) The upper row shows reconstruction from a Ladybug video dataset.
Through short Ladybug videos it is possible to reconstruct huge scenes with our ap-
proach. The lower row shows our sofa and garden dataset. Thanks to point optimiza-
tion, the sofa dataset is very smooth but anyhow accurate at locations where many
depth maps overlap. b) Ulm Minster surface.
(a) (b) (c) (d) (e)
Fig. 12. a) Some dierences with (bottom) and without (top) covering weighting on
the garden dataset. b) Columns. The right most reconstruction was strongly reduced
to demonstrate the high resolution. c) Some parts of the chapel were photographed in
especially high resolution compared to the other parts. Thus, there is a strong resolution
discontinuity at the border of the close reconstruction, that is clearly visible in the left
image. The left image was like the column in (b) size reduced. The full reconstruction
can be seen on the right. d) and e) Further results for the Ulm Minster dataset. The
left column in d) has only pixel size in Figure 11 b).
took in average 35 seconds, out of 9 images, for the 333 1863x1236 pixel depth
maps of the Ulm Minster dataset. This is also very fast compared to run times
of region growing based approaches in [3]. Multi depth map ltering took 192
seconds, altogether. Most time was needed for loading depth maps from the
harddisk. We got 6.8% more 3D points and 5.3% more mesh points with our
covering weighting on the garden dataset, than without. On the Ulm Minster
dataset with less occlusions it were 3.4/1.8%. Figure 12(a) shows some dierences
because of the weighting.
10 Conclusion
We presented a scale robust Multi View Stereo approach that can process ar-
bitrary image datasets. Our core contribution for scale robustness is surely the
point selection and optimization method that delivers rened point clouds that
can easily be processed by Delaunay based meshing methods. Nevertheless, our
image selection method is also important for high quality depth maps and, thanks
to image selection, the depth map based outlier lter is a powerful and fast
method to remove outliers.
Acknowledgments. This work has been partially funded by the DFG Emmy
Noether fellowship (Le 1341/1-1) and by an NVIDIA Professor Partnership
Award.
References
1. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in
3d. In: SIGGRAPH Conference Proceedings, pp. 835846. ACM Press, New York
(2006), http://phototour.cs.washington.edu/bundler/
2. Agarwal, S., Snavely, N., Simon, I., Seitz, S., Szeliski, R.: Building rome in a day.
In: 2009 IEEE 12th International Conference on Computer Vision, pp. 7279. IEEE
(2009)
3. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and
evaluation of multi-view stereo reconstruction algorithms. In: 2006 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519
528. IEEE (2006), http://vision.middlebury.edu/mview/
4. Pons, J., Keriven, R., Faugeras, O.: Multi-view stereo reconstruction and scene
ow estimation with a global image-based matching score. International Journal
of Computer Vision 72, 179193 (2007)
5. Vogiatzis, G., Hernandez Esteban, C., Torr, P.H.S., Cipolla, R.: Multiview stereo
via volumetric graph-cuts and occlusion robust photo-consistency. IEEE Trans.
6. Kolev, K., Pock, T., Cremers, D.: Anisotropic Minimal Surfaces Integrating Photo-
consistency and Normal Information for Multiview Stereo. In: Daniilidis, K. (ed.)
ECCV 2010, Part III. LNCS, vol. 6313, pp. 538551. Springer, Heidelberg (2010)
7. Sinha, S., Pollefeys, M.: Multi-view reconstruction using photo-consistency and
exact silhouette constraints: A maximum-ow formulation (2005)
8. Furukawa, Y., Ponce, J.: Carved Visual Hulls for Image-Based Modeling. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951,
9. Song, P., Wu, X., Wang, M.: Volumetric stereo and silhouette fusion for image-
based modeling. The Visual Computer 26, 14351450 (2010)
10. Gallup, D., Frahm, J., Mordohai, P., Yang, Q., Pollefeys, M.: Real-time plane-
sweeping stereo with multiple sweeping directions. In: 2007 IEEE Conference on
Computer Vision and Pattern Recognition, pp. 18. IEEE (2007)
11. Campbell, N.D.F., Vogiatzis, G., Hernandez, C., Cipolla, R.: Using Multiple Hy-
potheses to Improve Depth-Maps for Multi-View Stereo. In: Forsyth, D., Torr, P.,
Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 766779. Springer,
Heidelberg (2008)
12. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 13621376 (2009)
13. Hiep, V., Keriven, R., Labatut, P., Pons, J.: Towards high-resolution large-scale
multi-view stereo. In: IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2009, pp. 14301437. IEEE (2009)
14. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.: Multi-view stereo for
community photo collections. In: Proc. ICCV, pp. 18. Citeseer (2007)
15. Pollefeys, M., Nister, D., Frahm, J., Akbarzadeh, A., Mordohai, P., Clipp, B.,
Engels, C., Gallup, D., Kim, S., Merrell, P., et al.: Detailed real-time urban 3d
reconstruction from video. International Journal of Computer Vision 78, 143167
(2008)
16. Frahm, J.-M., Fite-Georgel, P., Gallup, D., Johnson, T., Raguram, R., Wu, C.,
Jen, Y.-H., Dunn, E., Clipp, B., Lazebnik, S., Pollefeys, M.: Building Rome on
a Cloudless Day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part IV. LNCS, vol. 6314, pp. 368381. Springer, Heidelberg (2010)
17. Jancosek, M., Shekhovtsov, A., Pajdla, T.: Scalable multi-view stereo. In: 2009
IEEE 12th International Conference on Computer Vision Workshops (ICCV Work-
shops), pp. 15261533. IEEE (2009)
18. Mucke, P., Klowsky, R., Goesele, M.: Surface reconstruction from multi-resolution
sample points. In: Proceedings of Vision, Modeling, and Visualization, pp. 105112
(2011)
19. Fuhrmann, S., Goesele, M.: Fusion of depth maps with multiple scales. ACM Trans-
actions on Graphics (TOG) 30, 148 (2011)
20. Furukawa, Y., Curless, B., Seitz, S., Szeliski, R.: Towards internet-scale multi-view
stereo. In: Proceedings of IEEE CVPR. IEEE (2010)
21. Rumpler, M., Irschara, A., Bischof, H.: Multi-view stereo: Redundancy benets for
3d reconstruction. In: Proceedings of the 35th Workshop of the Austrian Associa-
tion for Pattern Recognition, AAPR/OAGM (2011)
22. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.: Patchmatch: A random-
ized correspondence algorithm for structural image editing. ACM Transactions on
Graphics (TOG) 28, 24 (2009)
23. Strecha, C., Von Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On bench-
marking camera calibration and multi-view stereo for high resolution imagery
(2008), http://cvlab.epfl.ch/~ strecha/multiview/denseMVS.html
24. Tola, E., Lepetit, V., Fua, P.: Daisy: An ecient dense descriptor applied to wide-
baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence,
815830 (2009)
25. Merrell, P., Akbarzadeh, A., Wang, L., Mordohai, P., Frahm, J., Yang, R., Nister,
D., Pollefeys, M.: Real-time visibility-based fusion of depth maps. In: IEEE 11th
Int. Conf. on Computer Vision, ICCV 2007, pp. 18. IEEE (2007)
26. Cuccuru, G., Gobbetti, E., Marton, F., Pajarola, R., Pintus, R.: Fast low-memory
streaming mls reconstruction of point-sampled surfaces. In: Proceedings of Graph-
ics Interface 2009, pp. 1522. Canadian Information Processing Society (2009)
27. Labatut, P., Pons, J., Keriven, R.: Robust and ecient surface reconstruction from
range data. Computer Graphics Forum 28, 22752290 (2009)
Laplacian Meshes
for Monocular 3D Shape Recovery
Jonas Ostlund, Aydin Varol, Dat Tien Ngo, and Pascal Fua
EPFL CVLab.
1015 Lausanne, Switzerland
Abstract. We show that by extending the Laplacian formalism, which

was rst introduced in the Graphics community to regularize 3D meshes,
we can turn the monocular 3D shape reconstruction of a deformable
surface given correspondences with a reference image into a well-posed
problem. Furthermore, this does not require any training data and elim-
inates the need to pre-align the reference shape with the one to be re-
constructed, as was done in earlier methods.
1 Introduction
Shape recovery of deformable surfaces from single images is inherently ambigu-

ous, given that many dierent congurations can produce the same projections.
In particular, this is true of template-based methods, in which a reference image
of the surface in a known conguration is available and point correspondences
between this reference image and an input image in which the shape is to be
recovered are given.
It has been shown that solving this problem amounts to solving a degenerate
linear system, which requires either reducing the number of degrees of freedom or
imposing additional constraints [1]. The rst can be achieved by various dimen-
sionality reduction techniques [2, 3] while the second often involves assuming
the surface to be either developable [46] or inextensible [79, 3]. These two
approaches are sometime combined and augmented by introducing additional
sources of information such as shading or textural clues [10, 11]. The result-
ing algorithms usually require solving a fairly large optimization problem and,
even though it is often well behaved or even convex [9, 7, 12, 13, 3], it remains
computationally demanding. Closed-form solutions have been proposed [14, 11]
but they also involve solving very large equation systems and to make more
restrictive assumptions than the optimization-based ones, which can lower the
performance [3].
Here, we show that, by extending the Laplacian formalism rst introduced
in the Graphics Community [15, 16], we can turn the degenerate linear system
mentioned above into a smaller non-degenerate one. More specically, we ex-
press all vertex coordinates as linear combinations of those of a small number of

This work was supported in part by the Swiss National Science Foundation.

Laplacian Meshes for Monocular 3D Shape Recovery 413
control vertices and inject this expression into the large system to produce the
small one we are looking for. Well-posedness stems from an implicit and novel
regularization term, which penalizes deviations away from the reference shape,
which does not have to be planar.
In other words, instead of performing a minimization involving many degress
of freedom or solving a large system of equations, we end up simply solving
a very compact linear system. This yields an initial 3D shape estimate, which
is not necessarily accurate, but whose 2D projections are. In practice, this is
extremely useful to quickly and reliably eliminate erroneous correspondences. We
can then achieve better accuracy than the method of [3], which has been shown to
outperform other recent ones, by enforcing the same inextensibility constraints.
This is done by optimizing with respect to far fewer variables, thereby lowering
the computational complexity, and without requiring training data as [3] does.
Furthermore, unlike in other recent approaches that also rely on control
points [9, 17], our implicit regularization term is orientation invariant. As a
result, we do not have to pre-align the reference shape with the one to be re-
covered or to precompute a rotation matrix. To explore this aspect, we replaced
our Laplacian approach to dimensionality reduction by a simpler one based on
the same linear deformation modes as those of [14, 11] and again enforced the
inextensibility constraints of [3]. We will show that without pre-alignment, it
performs much worse. With manual pre-alignment it can yield the same accu-
racy but an automated version would necessitate an additional non-linear opti-
mization step. Furthermore, computing the deformation modes requires either
training data or a stiness matrix, neither of which may be forthcoming.
In short, our contribution is a novel approach to reducing the dimensionality
of the surface reconstruction problem. It removes the need both for an estimate
of the rigid transformation with respect to the reference shape and for access
to training data or material properties, which may be unavailable or unknown.
Furthermore, this is achieved without sacricing accuracy.
In the remainder of this paper, we rst review existing approaches and remind
the reader of how the problem can be formulated as one of solving a linear
but degenerate system of equations, as was done in earlier approaches [1]. We
then introduce our extended Laplacian formalism, which lets us transform the
degenerate linear system into a smaller well-posed one. Finally, we present our
results and compare them against state-of-the-art methods.
2 Related Work
Reconstructing the 3D shape of a non-rigid surface from a single input image
is a severely under-constrained problem, even when a reference image of the
surface in a dierent but known conguration is available, which is the problem
we address here.
When point correspondences can be established between the reference
and input images, shape-recovery can be formulated in terms of solving an
ill-conditioned linear-system [1] and requires the introduction of additional
constraints to make it well-posed. The most popular ones involve preserving
414 J. Ostlund et al.
Euclidean or Geodesic distances as the surface deforms and are enforced either
by solving a convex optimization problem [9, 7, 8, 12, 13, 3] or by solving in
closed form sets of quadratic equations [14, 11]. The latter is typically done by
linearization, which results in very large systems and is no faster than mini-
mizing a convex objective function, as is done in [3] which is one of the best
representatives of this class of techniques.
The complexity of the problem can be reduced using a dimensionality reduc-
tion technique such as Principal Component Analysis (PCA) to create morphable
models [2, 18], modal analysis [11, 12], or Free Form Deformations (FFDs) [9].
As shown in the result section, PCA and modal analysis can be coupled with
constraints such as those discussed above to achieve good accuracy. However,
this implies using training data or being able to build a stiness matrix, which
is not always possible. The FFD approach [9] is particularly relevant since, like
ours, it relies on parameterizing the surface in terms of control points. However,
it requires placing the control points on a regular grid, whereas ours can be
arbitrarily chosen for maximal accuracy. This makes our approach closer to ear-
lier ones to tting 3D surfaces to point clouds that rely on Dirichlet Free Form
Deformations and also allow arbitrary placement of the control points [19].
Furthermore, none of these dimensionality reduction methods allow
orientation-invariant regularization because deformations are minimized with
respect to a reference shape or a set of control points that must be properly ori-
ented. Thus, global rotations must be explicitly handled to achieve satisfactory
results. To overcome this limitation, we took our inspiration from the Laplacian
formalism presented in [15] and the rotation invariant formulation of [16], which
like ours involves introducing virtual vertices. In both these papers, the mesh
Laplacian is used to dene a regularization term that tends to preserve the shape
of a non-planar surface. Furthermore, vertex coordinates can be parameterized
as a linear combination of those of control vertices to solve specic minimization
problems [16]. In Section 4, we will go one step further by providing a generic
linear expression of the vertex coordinates as a function of those of the control
vertices, independently of the objective function to be minimized.
3 Linear Problem Formulation

As shown in [1], given point correspondences between a reference image in which
the 3D shape is known and an input image, recovering the new shape in that
image amounts to solving the linear system

v1

Mx = 0 , where x = ... , (1)
vNv
where vi contains the 3D coordinates of the ith vertex of the Nv -vertex trian-
gulated mesh representing the surface, and M is a matrix that depends on the
coordinates of correspondences in the input image and on the camera internal pa-
rameters. A solution of this system denes a surface such that 3D feature points
that project at specic locations in the reference image reproject at matching

locations in the input image. Solving this system in the least-squares sense there-
fore yields surfaces, up to a scale factor, for which the overall reprojection error
is small.
The diculty comes from the fact that, for all practical purposes, M is rank
decient, with at least one third of its singular values being extremely small with
respect to the other two thirds even when there are many correspondences. This
is why the inextensibility constraints discussed in Section 2 were introduced.
A seemingly natural way to address this issue is to introduce a linear subspace
model and to write surface deformations as linear combinations of relatively few
basis vectors. This can be expressed as

Ns
x = x0 + wi bi = x0 + Bw , (2)
i=1
where x is the coordinate vector or Eq. 1, B is the matrix whose columns are the
bi basis vectors typically taken to be the eigenvectors of a stiness matrix, and
w is the associated vectors of weights wi . Injecting this expression into Eq. 1
and adding a regularization term yields a new system
% &% &
MB Mx0 w
=0, (3)
r L 0 1
which is to be solved in the least squares sense, where L is a diagonal matrix

whose elements are the inverse values of the eigenvalues associated to the basis
vectors, and r is a regularization weight. This favors basis vectors that corre-
spond to the lowest-frequency deformations and therefore enforces smoothness.
In practice, the linear system of Eq. 3 is less poorly conditioned than the one
of Eq. 1. But, because there usually are several smooth shapes that all yield
virtually the same projection, its matrix still has a number of near zero singular
values. As a consequence, additional constraints still need to be imposed for the
problem to become well-posed. An additional diculty is that, because rotations
are strongly non-linear, this linear formulation can only handle small ones. As
a result, the rest shape dened by x0 must be roughly aligned with the shape
to be recovered, which means that a global rotation must be computed before
shape recovery can be attempted.
In the remainder of this paper, we will show that we can reduce the dimen-
sionality in a dierent way, which allows us to minimize the reprojection error
by solving a small and well-conditioned linear system.
4 Laplacian Formulation
In the previous section, we introduced the linear system of Eq. 1, which is so
badly conditioned that we cannot minimize the reprojection error by simply
solving it. In this section, we show how to turn it into a much smaller and
well-conditioned system.
(a) (b)
Fig. 1. Linear parameterization of mesh vertices using control points (a) Vertices and
control points in their rest state. (b) Control points and vertices in a deformed state.
Every vertex is a linear combination of the control points.
To this end, let us assume we are given a reference shape, which may be planar
or not and let xref be the coordinate vector of its vertices, as the ones highlighted
in Fig. 1. We rst show that we can dene a regularization matrix A such that
Ax2 = 0, with x being the coordinate vector of Eq. 1, when xref = x up to
2
a rigid transformation. In other words, Ax penalizes non-rigid deformations
away from the reference shape but not rigid ones. We then show that, given a
subset of Nc mesh vertices whose coordinates are

vi1

c = ... , (4)
viNc
if we force the mesh both to go through these coordinates and to minimize

Ax2 , we can dene a matrix P that is independent of the control vertex
coordinates and such that
x = Pc . (5)
In other words, we can linearly parameterize the mesh as a function of the control
vertices coordinates, as illustrated in Fig. 1. Injecting this parameterization into
Eq. 1, yields
MPc = 0 , (6)
where MP is a matrix without any zero singular values for suciently small num-
bers of control points as demonstrated by Fig. 2(a). This system can therefore
be solved in the least-square sense up to a scale factor by nding the eigenvector
corresponding to its smallest singular value. In practice, because the correspon-
dences can be noisy, we further regularize by solving
min MPc2 + wr APc2 , s. t. c = 1 , (7)

c
where wr is a scalar coecient and which can still be done in closed form. We
will show in the results section that this yields accurate reconstructions, even
with relatively small numbers of control points.
1 = 0.001
= 0.0056
Normalized Singular Values

0.9
= 0.032
0.8 = 0.18
0.7 =1
= 5.6
0.6
= 32
0.5 = 1.8e+02
0.4 = 1e+03
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
Index
(a) (b)
Fig. 2. Singular values (a) of MP and M drawn with red and blue markers respectively
for a planar model for a given set of correspondences. The smallest singular value of
MP is about four whereas it is practically zero for the rst Nv singular values of M.
Normalized singular values (b) of regularization matrix A computed on a non-planar
model for various values of . Note that the curves are almost superposed for values
greater than one.
4.1 Regularization Matrix
We now turn to building the matrix A such that Axref = 0 and Ax = Ax
2 2
when x is a rigidly transformed version of x. We rst propose a very simple

scheme for the case when the reference shape is planar and then a similar, but
more sophisticated one, when it is not.
Planar Rest Shape. Given a planar mesh that represents a reference surface
in its rest shape, consider every pair of facets that share an edge, such as those
depicted by Fig. 3(a). They jointly have four vertices v1 ref to v4 ref . Since they
lie on a plane, whether or not the mesh is regular, one can always nd a unique
set of weights w1 , w2 , w3 and w4 such that
0 = w1 v1 ref + w2 v2 ref + w3 v3 ref + w4 v4 ref ,

0 = w1 + w2 + w3 + w4 , (8)
1 = w12 + w22 + w32 + w42 ,
up to a sign ambiguity, which can be resolved by simply requiring the rst

weight to be positive. To form the A matrix, we rst generate a matrix A for
1-dimensional vertex coordinates by considering every facet pair conguration i
of four vertices j1 , j2 , j3 , j4 and set the elements of A taken to be aij = wj for
j {j1 , j2 , j3 , j4 }. All remaining elements are set to zero and the matrix A is
A = A I3 .
4 2
1 3
(a) (b) (c)
Fig. 3. Building the A regularization matrix of Section 4.1. (a) Two facets that share
an edge. (b) Non-planar reference mesh. (c) Non-planar reference mesh with virtual
vertices and edges added. One of the pairs of tetrahedra used to construct the matrix
is shown in red.
It is easy to verify that Ax = 0 as long as x represents a planar mesh and

that Ax is invariant to rotations and translations, and in fact to all ane de-
formations. The rst equality of Eq. 8 is designed to enforce planarity while the
second guarantees invariance. The third equality is there to prevent the weights
from all being zero.
Non-planar Rest Shape. When the rest shape is non-planar, there will be
facets for which we cannot solve Eq. 8. As in [16], we extend the scheme described
above by introducing virtual vertices. As shown in Fig. 3(c), we create virtual
vertices at above and below the center of each facet as a distance controlled
by the scale parameter . Formally, for each facet vi , vj and vk , its center
vc = 31 (vi + vj + vk ) and its normal n = (vj vi ) (vk vi ), we take the
virtual vertices to be
n n
+
vijk = vc + and vijk = vc , (9)
n n

where the norm of n/ n is approximately equal to the average edge-length
of the facets edges. These new vertices, along with the original ones, are used
to build the blue tetrahedra of Fig. 3(c).
Let xV be the coordinate vector of the virtual vertices only, and xA =
1 2
T T
xT , xV the coordinate vector of both real and virtual vertices. Let xA ref
be similarly dened for the reference shape. Given two tetrahedra that share a
facet, such as the red ones in Fig. 3(c), and the ve vertices v1 A A
ref to v5 ref they
share, we can now nd weights w1 to w5 such that
0 = w1 v1 A A A A A
ref + w2 v2 ref + w3 v3 ref + w4 v4 ref + w5 v5 ref ,
0 = w1 + w2 + w3 + w4 + w5 , (10)
1= w12 + w22 + w32 + w42 + w52 .
The three equalities of Eq. 10 serve the same purpose as those of Eq. 8. We form
a matrix AA by considering all pairs of tetrahedra that share a facet, computing
the w1 to w5 weights that encode local linear dependencies between real and
virtual vertices, and using them to add successive rows to the matrix, as we did
to build the A matrix in the planar case. It is again easy to verify to AA xA
ref = 0
and that AA xA is invariant to ane deformations of xA . In our scheme, the
regularization term can be computed as
2
2

C = AA xA = Ax + AxV , (11)
1 2
if we write AA as A A where A has as three times as many columns as there
are real vertices and A as there are virtual ones. Given the real vertices x, the
virtual vertex coordinates that minimize the C term of Eq. 11 is
1
xV = AT A AT Ax (12)
1 2

C= Ax A AT
A AT
Ax = Ax2 ,

1
where A = A A AT A AT A. In other words, this matrix A is the regular-
ization matrix we are looking for and its elements depend only on the coordinates
of the reference vertices and on the distance chosen to build the virtual vertices.
To study the inuence of the scale parameter of Eq. 9, which controls the
distance of the virtual vertices from the mesh, we computed the A matrix and
its singular values for a sail shaped mesh and many values. As can be seen
in Fig. 2(b), there is a wide range for which the distribution of singular values
remains practically identical. This suggests that has limited inuence on the
numerical character of the regularization matrix. In all our experiments, we set
to 1 and were nevertheless able to obtain accurate reconstructions of many
dierent surfaces with very dierent physical properties.
4.2 Linear Parameterization

As before, let Nv be the total number of vertices and Nc < Nv the number of
control points we want to use to parameterize the mesh. Given the coordinate
vector c of these control vertices, the coordinates of the minimum energy mesh
that goes through the control vertices and minimizes the regularization energy
can be found by minimizing
2
Ax subject to Pc x = c , (13)
where A is the regularization matrix introduced above and Pc is a 3Nc 3Nv
matrix containing ones in columns corresponding to the indices of control vertices
and zeros elsewhere. We prove in the supplementary material that, then, there
exists a matrix P whose components can be computed from A and Pc such that
c , x = Pc , (14)
if x is solution of the minimization problem of Eq. 13.
4.3 Reconstruction
Given that we can write the coordinate vector x as Pc and given potentially
noisy correspondences between the reference and input images, recall that we
can nd c as the solution of the linear least-squares problem of Eq. 7.
As shown in the result section, this yields a mesh whose reprojection is very
accurate but whose 3D shape may not be because our regularization does not
penalize ane deformations away from the rest shape. In practice, we use this
initial mesh to eliminate erroneous correspondences. We then rene it by solving
min MPc2 + wr APc2 , s. t. C (Pc) 0 , (15)

c
where C (Pc) are inextensibility constraints that prevent distances between

neighboring vertices to grow beyond a bound, such as their Euclidean distance
in their rest shape when it is planar. These are exactly the same constraints as
those used in [3], against which we compare ourselves below. The inequality con-
straints are reformulated as equality constraints with additional slack variables
whose norm is penalized to prevent the solution from shrinking to the origin.
4.4 Complexity and Robustness

To handle outliers in the correspondences, we iteratively perform the uncon-
strained optimization of Eq. (7) starting with a relatively high regularization
weight wr and reducing it by half at each iteration. Given a current shape esti-
mate, we project it on the input image and disregard the correspondences with
higher reprojection error than a pre-set radius and reduce it by half for the next
iteration. Repeating this procedure a xed number of times results in an initial
shape estimate and provides inlier correspondences for the more computation-
ally demanding constrained optimization that follows. This decoupled outlier
rejection mechanism brings us a computational advantage against the state-of-
the-art reconstruction method [3] which relies on a similar iterative method but
solves the nonlinear optimization at each iteration. Furthermore, while dealing
with the same number of constraints, our method typically uses ten times fewer
1
degrees-of-freedom (i.e. Nc < ( Nv )), and thus solves the constrained least
10
square problem of Eq 15 much faster.
5 Results
We compare our approach to surface reconstruction from a single input image
given correspondences with a reference image against the recent method of [3],
which has been shown to outperform closed-form solutions based on a global
deformation model and inextensibility constraints [14] and solutions that rely
on a nonlinear local deformation models [20]. In all these experiments, we set
the initial regularization weight wr of Eq. 7 to 2e4, the scale parameter of
Eq. 9 to 1 and the initial radius for outlier rejection to 150 pixels.
To complete this comparison, we also implemented and tested an approach

to reducing dimensionality using a linear subspace model as discussed in Sec-
tion 3 and solving in the least squares sense the system of Eq. 3 under the same
inextensibility constraints as those of Eq. 15. We took our basis vectors to be
the eigenvectors of a stiness matrix [2123] and chose their number so that the
number of degrees of freedom using either the subspace approach or ours would
be the same. For both approaches, we report the results with and without per-
forming a rigid alignment of the reference model with the image prior to starting
the minimization.
In all of these experiments we used SIFT to compute correspondences between
a reference image in which the the 3D shape is known and individual other
images. We assume the internal camera parameters are known and reason in
camera coordinate frame. In the remainder of the section, we rst evaluate the
accuracy of our reconstruction method against the competing methods on images
for which we have ground truth. We then present results on images of surfaces
made of dierent materials.
5.1 Quantitative Results
To provide quantitative results both in the planar and non-planar cases, we

acquired images of a deforming piece of paper using a Kinecttm camera and of
a deforming sail using a pair of cameras.
Paper Sequence. Our qualitative results are depicted by Fig. 4. The rest shape is
at and modeled by a 119 rectangular mesh and parameterized by 20 regularly
spaced control points such as those of Fig. 1. For reconstruction purposes, we
ignored the depth maps and used only the color images, along with SIFT corre-
spondences with the reference image. For evaluation purposes, we measured the
distance of the recovered vertices to the depth map.
We report the 3D reconstruction errors for dierent methods in Fig. 6(a). Our
approach yields the highest average accuracy, and it is followed closely by the
subspace method provided that the model is properly pre-aligned by computing
global rotation and translation.
Sail Images. We applied our reconstruction algorithm on sail images such as the
one of Fig. 5. Since images taken by a second camera are available we used a
simple triangulation method to generate 3D points corresponding to the circular
markers on the sail which serve as ground-truth measurements. We compare
the reconstructions against the scaled ground-truth points such that the mean
distance between the predicted marker positions and their ground-truth positions
is minimized. In this case, as shown in Fig. 6(b) , our method achieves the highest
accuracy against the competing methods.
Our C++ implementation relies on the Ferns keypoint classier [24] to match
feature points and average computation times per frame are given in Fig. 6(c)
as a function of the numbers of control points. As stated above, the accuracy
results of Figs. 6(a,b) were obtained using 20 control points.
Fig. 4. Paper sequence with ground-truth: Top row: Reprojection of the con-
strained reconstructions on the input images Middle row: Unconstrained solutions
seen from a dierent view-point Bottom row: Constrained solutions seen from the
same view-point. We supply the corresponding video sequence as supplementary ma-
terials.
(a) (b) (c)
Fig. 5. Sail reconstructions (a) Reprojection of the unconstrained reconstruction

on the input image. (b)-(c) Triangulations seen from a dierent view-point with the
ground-truth points shown in green for unconstrained and constrained methods, re-
spectively. Note that the two reconstructions are visually the same.
Here we present our results on real-images of surfaces made of three dierent

materials, paper, cloth, and plastic.
The paper model is a triangulated mesh of size 118 which is parameterized
by 20 regularly placed control points. Similarly, we model a t-shirt using a 710
mesh parametrized by 12 control points. Our reconstruction results for the paper
and t-shirt sequences are depicted by Figs. 7 and 8, respectively.
Our nal example involves a plastic ball that we cut in half, leaving the
half shell shown in Fig. 9. We used our Kinect camera to produce the concave
0.2
5 Unconstrained
Constrained
15
35
Ours Total
0.15
Time (seconds)
Mean Reconstruction Error
Mean Reconstruction Error
Ours Subspace with aligning

30
Subspace with aligning Salzmann11
25 Salzmann11
Subspace without aligning
10
0.1
20
15
5 0.05
10
0
0 0 0 10 20 30 40
Number of control points
(a) (b) (c)
Fig. 6. Reconstruction errors for dierent methods on (a) paper sequence

(b) sail images. We omit the results of the subspace method without aligning in (b)
as it is a very high compared to the others in the plot. We are reporting all the
distances in millimeters. Computation times (c) for C++ implementation with
unconstrained outlier rejection (green), constrained optimization (red) and total time
including keypoint matching and extraction (blue).
Fig. 7. Paper sequence. Top row: Reprojections of our reconstructions on the input
images. Bottom row: Reconstructions seen from a dierent view-point. We supply
the corresponding video-sequences as supplementary material.
Fig. 8. T-shirt sequence. Top row: Reprojections of our reconstructions on the

input images. Bottom row: Reconstructions seen from a dierent view-point. We
supply the corresponding video-sequences as supplementary material.
reference shape of Fig. 9. Our method can then be used to recover its shape from
input images whether it is concave or convex. This demonstrates our methods
ability to recover truly large non planar deformations.
Fig. 9. Large deformations. Top row: The reference image with a concave reference
shape and two input images. In the rst, the surface is concave and in the other convex.
Bottom row: The depth image used to build the reference mesh, reconstructions of
the concave and convex versions of the surface.
6 Conclusion
We have presented a novel approach to parameterizing the vertex coordinates of

a mesh as a linear combination of a subset of them. In addition to reducing the
dimensionality of the monocular 3D shape recovery problem, it yields a rotation-
invariant regularization term that lets us achieve good results without training
data and or having to explicitly handle global rotations.
In our current implementation, the subset of vertices was chosen arbitrarily
and we applied constraints on every single edge of the mesh. In future work, we
will explore ways to optimally chose this subset and the constraints we enforce
to further decrease computational complexity without sacricing accuracy.
References
1. Salzmann, M., Fua, P.: Deformable Surface 3D Reconstruction from Monocular
Images. Morgan-Claypool Publishers (2010)
2. Blanz, V., Vetter, T.: A Morphable Model for the Synthesis of 3D Faces. In:
SIGGRAPH, pp. 187194 (1999)
3. Salzmann, M., Fua, P.: Linear Local Models for Monocular Reconstruction of De-
formable Surfaces. PAMI 33, 931944 (2011)
4. Gumerov, N., Zandifar, A., Duraiswami, R., Davis, L.S.: Structure of Applicable Sur-
faces from Single Views. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004, Part III.
5. Liang, J., Dementhon, D., Doermann, D.: Flattening Curved Documents in Images.
In: CVPR, pp. 338345 (2005)
6. Perriollat, M., Bartoli, A.: A Quasi-Minimal Model for Paper-Like Surfaces. In:
BenCos: Workshop Towards Benchmarking Automated Calibration, Orientation
and Surface Reconstruction from Images (2007)
7. Ecker, A., Jepson, A.D., Kutulakos, K.N.: Semidenite Programming Heuristics

for Surface Reconstruction Ambiguities. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 127140. Springer, Heidelberg
(2008)
8. Shen, S., Shi, W., Liu, Y.: Monocular Template-Based Tracking of Inextensible
Deformable Surfaces under L2 -Norm. In: Zha, H., Taniguchi, R.-i., Maybank, S.
(eds.) ACCV 2009, Part II. LNCS, vol. 5995, pp. 214223. Springer, Heidelberg
(2010)
9. Brunet, F., Hartley, R., Bartoli, A., Navab, N., Malgouyres, R.: Monocular Template-
Based Reconstruction of Smooth and Inextensible Surfaces. In: Kimmel, R.,
Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 5266.
10. White, R., Forsyth, D.: Combining Cues: Shape from Shading and Texture. In:
CVPR (2006)
11. Moreno-Noguer, F., Salzmann, M., Lepetit, V., Fua, P.: Capturing 3D Stretchable
Surfaces from Single Images in Closed Form. In: CVPR (2009)
12. Moreno-Noguer, F., Porta, J.M., Fua, P.: Exploring Ambiguities for Monocular
Non-Rigid Shape Estimation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part III. LNCS, vol. 6313, pp. 370383. Springer, Heidelberg (2010)
13. Perriollat, M., Hartley, R., Bartoli, A.: Monocular Template-Based Reconstruction
of Inextensible Surfaces. IJCV 95 (2011)
14. Salzmann, M., Moreno-Noguer, F., Lepetit, V., Fua, P.: Closed-Form Solution to
Non-rigid 3D Surface Registration. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
15. Sorkine, O., Cohen-Or, D., Lipman, Y., Alexa, M., Rossl, C., Seidel, H.P.: Laplacian
Surface Editing. In: Symposium on Geometry Processing (2004)
16. Summer, R., Popovic, J.: Deformation Transfer for Triangle Meshes. TOG (2004)
17. Bue, A.D., Bartoli, A.: Multiview 3D Warps. In: ICCV (2011)
18. Dimitrijevic, M., Ilic, S., Fua, P.: Accurate Face Models from Uncalibrated and
Ill-Lit Video Sequences. In: CVPR (2004)
19. Ilic, S., Fua, P.: Using Dirichlet Free Form Deformation to Fit Deformable Models
to Noisy 3-D Data. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.)
ECCV 2002, Part II. LNCS, vol. 2351, pp. 704717. Springer, Heidelberg (2002)
20. Salzmann, M., Urtasun, R., Fua, P.: Local Deformation Models for Monocular 3D
Shape Recovery. In: CVPR (2008)
21. Terzopoulos, D., Witkin, A., Kass, M.: Symmetry-Seeking Models and 3D Object
Reconstruction. IJCV 1, 211221 (1987)
22. Cohen, L., Cohen, I.: Finite-Element Methods for Active Contour Models and
Balloons for 2D and 3D Images. PAMI 15 (1993)
23. Metaxas, D., Terzopoulos, D.: Constrained Deformable Superquadrics and Non-
rigid Motion Tracking. PAMI 15 (1993)
24. Ozuysal, M., Calonder, M., Lepetit, V., Fua, P.: Fast Keypoint Recognition Using
Random Ferns. PAMI 32 (2010)
Soft Inextensibility Constraints
for Template-Free Non-rigid Reconstruction
Sara Vicente and Lourdes Agapito
School of Electronic Engineering and Computer Science,

Queen Mary University of London, Mile End Road, London E1 4NS, UK
{sara.vicente,lourdes.agapito}@eecs.qmul.ac.uk
Abstract. In this paper, we exploit an inextensibility prior as a way

to better constrain the highly ambiguous problem of non-rigid recon-
struction from monocular views. While this widely applicable prior has
been used before combined with the strong assumption of a known 3D-
template, our work achieves template-free reconstruction using only in-
extensibility constraints. We show how to formulate an energy function
that includes soft inextensibility constraints and rely on existing discrete
optimisation methods to minimise it. Our method has all of the follow-
ing advantages: (i) it can be applied to two tasks that have been so far
considered independently template based reconstruction and non-rigid
structure from motion producing comparable or better results than the
state-of-the art methods; (ii) it can perform template-free reconstruction
from as few as two images; and (iii) it does not require post-processing
stitching or surface smoothing.
Keywords: Non-rigid reconstruction, inextensiblility priors, MRF op-

timization.
1 Introduction
The reconstruction of non-rigid 3D objects from monocular sequences is an in-

herently ambiguous problem that has received a signicant amount of attention
in recent years. Most methods for reconstruction of deformable objects fall into
two categories: template based methods [13], where the 3D shape of the object
is known for a reference frame; and methods that rely on points being tracked
along a video sequence, such as the pioneer non-rigid structure from motion
(nrsfm) methods based on the statistical prior that the shape of a non-rigid ob-
ject can be expressed in a compact way as a linear combination of an unknown
low-rank shape basis [4, 5].
In both cases, additional priors or assumptions have to be imposed to better
constrain a highly ambiguous problem. These priors vary in their applicability
range. Broadly applicable priors include temporal smoothness [5, 6] and surface

This work was funded by the European Research Council under the ERC Starting
Grant agreement 204871-HUMANIS.

Soft Inextensibility Constraints for Template-Free Non-rigid Reconstruction 427
(a) Input images with correspondences (b) 3D Reconstruction
Fig. 1. 3D Reconstruction obtained using our template-free method based

on soft inextensibility constraints. Given (a) two input images and point corre-
spondences (colour coded according to their estimated depth), (b) shows the estimated
3D reconstructions estimated using our approach. Best viewed in colour.
smoothness [2, 3]. Temporal smoothness is particularly meaningful for recon-

struction from video sequences, where there are only small deformations from
one frame to the next, but it relies on having coherent tracks for the full se-
quence. Surface smoothness is also a commonly used prior, but in general it is
too weak to be used alone.
In this paper we consider a more restrictive, but still widely applicable prior:
an inextensibility constraint. A surface is inextensible if it does not stretch or
shrink, i.e. if the geodesic distances between points are preserved under defor-
mations. This is satised in a variety of situations, for example, if the images
depict an inextensible material (e.g. paper and non-stretchable fabric) or if the
object is articulated.
Inextensibility constraints have been extensively used for template based re-
construction [13], but less explored when no template is available. The assump-
tion that a 3D template shape is available is quite restrictive and in practice
template based methods have mostly been applied to surfaces which are isomet-
ric with a plane, i.e. developable surfaces. This may be related to the diculty
in obtaining an accurate template for other types of surfaces.
Besides not requiring a template shape our method uses a soft version of the
inextensibility constraint. Instead of enforcing the constraint exactly, we penalise
deviations from it. There are several scenarios in which a soft constraint is more
adequate, for instance when there is a small level of stretching but inextensibility
is still broadly observed. Moreover, a soft constraint will result in improved
robustness in the presence of noise.
Inextensibility constraints are usually imposed in a small neighbourhood, such
that the distances of nearby points are preserved in the 3D reconstruction. This
can be interpreted as enforcing local rigidity. Local rigidity constraints have
been previously considered in [7, 8]. Both methods form part of a recent trend of
piecewise reconstruction methods in nrsfm. Instead of relying on a single model
for the full surface, these approaches model small patches of the surface inde-
pendently. The main advantage of piecewise modelling is that complex surfaces
can still be recovered, while keeping the complexity of each local model low.
428 S. Vicente and L. Agapito
On the other hand, since each patch is reconstructed independently, these meth-
ods require a post-processing stitching step. Compared to prior work, ours is the
rst to include all of the following advantages:
It does not require a template shape, in contrast to most previous meth-

ods that impose inextensibility constraints. Instead we rely on having point
correspondences between images.
It does not require a temporal sequence, instead we require as few as two
images and do not use information about their temporal order.
It imposes inextensibility as a soft constraint. A soft constraint guaran-
tees improved robustness.
It does not require post-processing stitching, in contrast to existing piece-
wise reconstruction methods where patches are reconstructed independently
and then require a further 3D alignment step.
In this paper we propose a pointwise reconstruction method that recovers the

depth of each interest point in each image. We formulate the task of 3D recon-
struction under inextensibility constraints as an energy minimisation problem,
where the energy is a function of the depths. This energy is minimised using dis-
crete optimisation techniques. Besides inferring the depth for all points we also
recover a weak template of the underlying shape. An example of the application
of our method is shown in Fig. 1. Given two images with point correspondences
shown in (a), our method recovers the 3D shape shown in (b) by imposing in-
extensibility constraints without the need for a known template shape.
1.1 Related Work
The inextensibility prior has been most commonly used in template based meth-
ods. These focus on reconstructing a single image, assuming that correspon-
dences between the image and a 3D template shape are given. Template based
methods have been mostly applied to the reconstruction of developable surfaces.
Since developable surfaces are isometric with a plane, the template shape is usu-
ally a planar surface. A closed form solution for this problem was introduced in
[1], where the template was in the form of a 2D mesh and the inextensibility
constraint was enforced between mesh nodes. A dierent version of the inexten-
sibility constraint that allows shrinking to account for the dierences between
Euclidean and geodesic distances was presented in [9] and shown to be more
suitable for reconstructing sharply folding surfaces. In [2] the known distances
between interest points were used to obtain upper bounds on the depth of each
point. This approach was later extended in [3] to account for noise in the corre-
spondences and in the template.
An interesting exception to this coupling between inextensibility priors and
template based methods was presented in [10]. This method assumed the surface
was developable and recovered a planar template given correspondences across
multiple images. In contrast to these methods our approach is not only template
free, but also applicable to non-developable surfaces.
Table 1. Notation. When referring to the case of a single input image or a generic
image, we sometimes drop the index s.
Notation Description
S Number of input images.
N Number of interest points. We use i and j to index the set of points.
psi = (xsi , yis ) 2D point i in image frame s.
Qsi 3D coordinates of point i on the shape corresponding to frame s.
si Depth of the 3D point i on the shape corresponding to frame s.
N = {(i, j)} Neighbourhood system: the set of all pairs (i, j) of neighbouring
points. Points i and j are neighbours if their image distance is small
in all frames.
Also related to our approach are the recently introduced piecewise models
[7, 8, 11, 6]. In general, these methods can cope with sequences with stronger
deformations than previous global approaches. They dier in the type of local
model considered (piecewise planarity [7], local rigidity [8] and quadratic defor-
mations [11, 6]) and in the way the surface is divided into patches. For example
in [8] each patch is a triangle formed by neighbouring points, while in [7] the
surface is divided into regular rectangular patches. A more principled approach
was proposed in [6], where the format of the patches is jointly optimised with
the model for each patch.
Inextensibility is imposed in the piecewise planar patches used in [7] and
the rigid triangles used in [8]. Neither of these approaches require a template.
However, they both require a stitching step to recover a coherent nal surface.
Furthermore, [7] requires a mesh tting step with manual initialisation to re-
cover a smooth surface, while the method in [8] discards non-rigid triangles,
producing only partial reconstructions. Our method does not suer from any
of these disadvantages. It produces a full reconstruction, even for parts which
violate the inextensibility constraint, and does not require any post-processing
step of stitching or mesh tting.
Most related to our approach is the incremental rigidity scheme proposed
in [12] for temporal sequences. The scheme is applicable to both rigid and non-
rigid objects and it maintains a 3D model of the object which is incrementally
updated based on every new frame, by maximizing rigidity. Since the initializa-
tion of the model consists of a at shape (at constant depth), the method is
not expected to recover a plausible reconstruction for the initial frames. This
scheme was later extended in [13] to reconstruct deforming developable surfaces.
Our method diers from these approaches in that we aim at recovering the 3D
shape corresponding to all frames and we do not require a temporal sequence.
1.2 Notation
We start by dening the notation used throughout the paper and discuss some
of the assumptions made. Our notation is summarised in Table 1.
(a) Orthographic camera (b) Perspective camera
j
0 i j
Fig. 2. Illustration of the camera models considered. In both cases, we reduce

the problem of 3D reconstruction of point i to nding its depth i .
As previously mentioned, a surface is inextensible if the geodesic distances

between points are preserved under deformation. Since our method performs
pointwise 3D reconstruction (as opposed to recovering a continuous surface) we
do not have access to the geodesic distance between points. Instead we rely on
the Euclidean distance being a reasonable approximation. We consider distances
between nearby points, making this approximation more meaningful than pre-
vious approaches that considered it for all possible pairs of points [2, 3].
For simplicity, we consider we have a single static camera and explore the
cases of both orthographic and calibrated perspective cameras. Similarly to [2],
we assume that the world frame is aligned with the camera coordinate system and
that the noise in the internal calibration is small so that the 3D reconstruction
reduces to nding the depth of the point in each image.
Orthographic Camera. Under orthographic projection, a 3D point Qi can be
written as a function of its depth i as: Qi (i ) = (xi , yi , i ). This is illustrated
in Fig. 2. Orthography is a simplifying assumption applicable when the relief of
the object is small compared with the distance to the camera.
Perspective Camera. For a calibrated perspective camera, a 3D point is given
by Qi (i ) = i vi , where vi is the direction of the sightline dened as vi =
M 1 pi
M 1 pi , where pi is the point pi in homogeneous coordinates and M is the
matrix of intrinsic camera parameters.
The goal of our 3D reconstruction approach is to recover the depths si and
corresponding 3D points Qsi for all images s.
2 Template-Based Reconstruction for a Single Image

We start by describing our method for reconstruction from a single image when
a template shape is known. This will be later extended to the case where no
template is available. We assume that point correspondences between the input
image and the template are given.
The geodesic distances of all pairs of points (i, j) N on the 3D shape to

be reconstructed, denoted by tij , are known and obtained from the template
shape, as in [2, 3], but could alternatively be obtained, for instance, from similar
training shapes. As previously mentioned, we assume that: (i) points (i, j) are
only neighbours if their distance is small; and (ii) the Euclidean distance is a
reasonable approximation of the geodesic distance in a small neighbourhood.
Recall that our method incorporates a soft inextensibility constraint. In order
to impose such a constraint, we formulate an energy function that penalises
deviations from inextensibility. For the case where distances (tij ) between pairs
of points are known, the energy function is of the form:

E() = |Qi (i ) Qj (j ) tij | (1)
(i,j)N
where = {i R : i = 1, ..., N } and Qi Qj is the euclidean distance

between points i and j in the reconstruction.
All the terms in this energy are non-negative and the energy is minimised
when the inextensibility constraints are exactly satised, i.e. when the euclidean
distance between two reconstructed points is the same as the distance between
those two points in the template.
3 Template-Free Reconstruction from Multiple Images

The energy (1) is only applicable when the distances tij are known. A more re-
alistic assumption is that a template might not be available but multiple images
of the object can be acquired. In this section we assume that the distances be-
tween neighbouring points are unknown and estimated together with the depths.
Similarly to section 2, our main assumption is that the surface is close to inex-
tensible, so that the following relation should be satised for two given images
s and r:
Qsi Qsj = Qri Qrj (i, j) N (2)
Since we assume that more than two images may be available, we introduce
auxiliary variables dij that represent the distance between points i and j. We
formulate the task of 3D reconstruction as an energy minimisation problem,
where the energy is a function of the form:

S
E(, d) = |Qsi (si ) Qsj (sj ) dij | (3)
s=1 (i,j)N
where the quantities to be estimated are = {si R : s = 1, ..., S; i = 1, ..., N },

the vector encoding the depths of all points in all images; and d = {dij R0 :
(i, j) N }, the vector of distances between all pairs of neighbouring points. The
vector of estimated distances d can be loosely interpreted as a weak template
for the underlying (close to inextensible) shape, such that deviations from this
template are penalised in the reconstruction of each image.
Reconstruction Ambiguities. For a perspective camera, the energy function

(3) has a trivial minimum for = 0 and d = 0. This is due to the depth/scale
ambiguity inherent to the perspective camera model. In order to avoid this trivial
solution, when using a perspective camera, we impose the constraint si k s, i.
This constraint can be seen as xing an arbitrary scale to the object. Note that
this trivial minimum only occurs for energy (3) and not for energy (1), since for
this case the scale of the object is xed.
In contrast, the orthographic camera model presents as ambiguities a global
depth and ip ambiguity. The scale of the object is invariant under these ambi-
guities, thus this model does not suer from the trivial minimum.
4 Discrete Optimisation
We rely on discrete optimisation methods and in particular fusion moves to
minimise both the template-based (1) and the template-free (3) energy functions.
While (1) is a pairwise energy function, the template-free problem involves a
higher-order energy function (3), with cliques of size 3. We describe in detail
how energy (3) is minimised and omit details on the minimisation of (1) since
it is analogous, but simpler.
4.1 Fusion Moves

Fusion moves [14, 15] have been recently introduced as a method for minimising
energy functions deriving from Markov Random Fields models. They are par-
ticularly suitable for minimising energy functions over continuous variables and
they have been successfully applied to stereo [14] and optical ow [15] estimation.
Assume that the goal is to minimize an energy function E(x) and that we
are given two suboptimal proposals (or solutions) x0 and x1 . A fusion move
fuses these two solutions into a new one xnew with the following guaran-
tee: E(xnew ) E(x0 ), E(xnew ) E(x1 ), i.e. it produces a solution which has
smaller energy than any of the two proposals. The new solution, xnew , is obtained
by minimising a specially designed binary function E (y). For more details on
how this binary function is constructed, see [15]. Given the optimal solution, y,
to this binary problem, i.e. y = arg miny E (y), the new solution xnew is con-
structed as follows: xnew
i = x0i if yi = 0 and xnew
i = x1i if yi = 1, where i indexes
the vector x of unknowns. The form of the binary energy, E (y), depends on the
original problem. When minimising energy (3) the resulting binary problem is
also a sum of triple cliques. We use the reduction introduced in [16] to convert the
energy into a pairwise function and minimise it using the QPBO algorithm (see
e.g. [17] for a review). This algorithm is not guaranteed to produce an optimal
solution if the function is non-submodular. However, even if the global minimum
of the binary problem is not achieved, the fusion move is still guaranteed not to
increase the energy. We use the QPBO implementation of [17] and the maxow
algorithm of [18] in our experiments.
Similarly to previous approaches, we use fusion moves in an iterative fashion
by fusing several proposals.
Algorithm 1: Greedy region growing

input : A current solution ( , d ) and a seed region A {1, ..., N }

output: A new proposal (, d)
initialisation: si si i A, s
Mark all i A as visited.
Sort indices i
/ A according to their distance to the seed region.
1 forall the i / A (sorted according to distance) do
2 forall the images s do
3 si arg minD j:(i,j)N ,jvisited |Qsi () Qsj (sj ) dij |
4 mark i as visited
5 end
6 end
7 forall the (i, j) N
do
8 dij arg mind s |Qsi (si ) Qsj (sj ) d| // Median
9 end
4.2 Proposal Generation

A crucial step for the success of fusion moves is the quality of the proposals. In
our experiments we observed that using a chain of proposals, i.e. proposals which
are derived sequentially from the previous solution, produces better results than
the common strategy of generating independent proposals. We use two dierent
strategies for generating proposals based on a current solution: greedy region
growing and gradient descent.
Greedy Region Growing. Given a current solution ( , d ) and a seed region
A {1, ..., N }, i.e. A is a subset of feature points, we obtain a new proposal
by xing the depths of the points in the seed region to the current solution.
(, d)
Then, we iterate over all other points, sorted according to their distance to the
seed region, and update their depth. The new value for the depth is chosen so that
the distances between this point and its already visited neighbours better match
the current distances d . The outline of the algorithm is shown in Algorithm 1.
Note that, in the minimisation dened in line 3 of Algorithm 1 we allow only
a discrete set D of possible depths in order to improve eciency. As mentioned
in [15], a proposal does not need to be correct in all the points in order to be
useful. Instead, it can still be useful if only a part of it corresponds to a good
solution. We expect a proposal generated using this algorithm to satisfy this
criterion.
Gradient Descent. Another alternative to generate a proposal from a current
solution is to adopt a gradient descent strategy as proposed in [19]. The new
is obtained from the current one ( , d ) by dierentiating the
proposal (, d)
energy function:
E(, d) E(, d)
si = si dij = dij (4)
si dij
where is a weighting term. Note that this weighting is less crucial than in nor-
mal gradient descent, since using this new proposal in a fusion move guarantees
that the energy does not increase.
4.3 Initialisation
We have described how to obtain new proposals from a current solution. We now
discuss how to dene the initial proposal ( , d ). We start by approximating the
unknown distances dij from the known 2D projections by taking the maximum
distance in 2D between each pair of points in the image, i.e. we initialise the
distances as dij = maxs |psi psj |. For an orthographic camera, these 2D distances
are guaranteed to be a lower bound on the real 3D distances.
All the depths are initialised to the same value, i.e. we set si = const, i, s,
where const = 0 for orthographic and const = k for perspective cameras.
4.4 Full Optimisation Method
Our overall algorithm functions in the following way: we initialise the algorithm
8 times as described in section 4.3. We then run 50 fusion moves for each ini-
tialisation, where the proposals are generated using the greedy region growing
procedure summarised in Algorithm 1. As choice of seed region we alternate
between using a pair of neighbouring points or a half-plane region dened by a
constraint of the form: {i : xsi < m}, {i : xsi > m}, {i : yis < m} or {i : yis > m}
for some frame s and constant m.
The 8 initialisations are optimised independently, therefore we obtain 8 dif-
ferent solutions and we choose the one with minimum energy. Finally, we run
an additional 100 fusion moves on this solution with proposals obtained from
gradient descent.
5 Extensions
5.1 Additional Priors
Two commonly used priors for the task of non-rigid reconstruction are temporal
and surface smoothness. Both these priors can be included in our formulation
by adding extra-terms to the energy function and using a similar optimization
strategy. Note that, the addition of these extra-terms requires a choice of sensible
weights for the dierent terms.
Temporal Smoothness. Temporal smoothness is a common assumption in

reconstruction from video sequences [5]. Note that so far we have not assumed
any particular order in the images used. If these are consecutive frames of a
video sequence, our formulation can be complemented with a temporal prior by

adding one of the following terms to the energy function:

S1 N
S2 N
T () = |si s+1
i | T () = |si 2s+1
i + s+2
i | (5)
s=1 i=1 s=1 i=1
that correspond respectively to a rst-order and a second-order temporal prior.
Surface Smoothness. Surface smoothness is usually imposed in approaches

that recover a continuous surface [3] or that represent the reconstruction as a
combination of smooth basis shapes [1]. Since we do not recover a continuous
surface or use a basis shape formulation, we use an alternative denition for
smoothness that follows [14] and that has been shown to considerably improve
the quality of stereo reconstructions. Assume the feature points form a regular
square grid. We consider triplets of neighbouring points (i, j, k) on the same
horizontal or vertical line. A second-order smoothness prior can be imposed by
adding extra terms, one per triplet, to the energy function. The term for each
triplet has the form: |i 2j + k |, where j is the index of the middle point
and we assume for simplicity that the camera is orthographic.
For most sequences no prior knowledge exists on the location of feature points
and they do not necessarily form a regular grid. In that case we approximate this
construction using triplets of nearby points which are collinear in all images.
5.2 Dealing with Missing Data

Our formulation does not require that all points are visible in all frames and can
be easily updated to deal with missing data. Let Vs be an indicator function for
each image s, such that Vs (i) = 1 if point i is visible in image s and Vs (i) = 0
otherwise. Also, let Ns = {(i, j) : (i, j) N (Vs (i) = 1) (Vs (j) = 1)} be
a constrained neighbourhood system for each image s that takes visibility into
account. The energy function in the presence of missing data takes the form:

S
E(, d) = |Qsi (si ) Qsj (sj ) dij | (6)
s=1 (i,j)Ns
where si is not dened if Vs (i) = 0, i.e. there is no 3D reconstruction if the

point is occluded. Adapting the optimisation method to deal with missing data
is also straightforward, since it only requires taking into account the per-image
neighbourhood system in the region growing algorithm.
6 Experiments
We evaluate our approach in the two dierent scenarios it allows: template based
reconstruction for a single image and template-free reconstruction from multiple
images. We use several datasets that have been previously introduced in [2, 8, 20,
11]. We start by providing details about the construction of the neighbourhood
system and the computation of the error measures.
pwre for our method

No noise With noise
Orthographic 0.25 0.77
Perspective 0.39 1.80
Fig. 3. Template based reconstruction for synthetic data. The table records
median pwre for 1000 randomly generated paper sheets. We show results for two
typical paper sheets: black grid is ground truth and blue points are our reconstruction.
Denition of the Neighbourhood System N . For all pairs of points (i, j),
1 i = j N we compute their furthest distance fij over all the frames:
fij = maxs psi psj . For each point i, let Ki be its set of k-nearest neighbours
according to the distances fij . The neighbourhood system is then dened as
N = i {(i, j) : j Ki , fij t}, where t corresponds to a threshold on the
distance. For most of the experiments we set k = 10 and t = 200. An example
of neighbourhood system can be seen in Fig.1(b), where each edge corresponds
to a pair of points in the neighbourhood.
Error Measures. Some of the datasets have associated ground truth that can
be used to quantitatively evaluate our method. We denote the ground truth
reconstruction of point i in image s by Qsi . We found that there is no stan-
dard error measure uniformly used in previous related work and in order to
enable an easier comparison we use two dierent error metrics reported by pre-
vious authors: per-frame pointwise reconstruction error (pwre) [2, 3], given by
N
pwre= N1 i=1 Qi C
Qsi ; and root mean squared error (rmse) over a sequence
s

[8] dened as rmse= 3N1 S Ss=1 N i=1 Qi Qi . When considering an or-
s s 2
thographic camera, we resolve the depth ambiguity by subtracting the centroids

from both the ground truth and our solution. We then choose the ip with small-
est error, from the two possible ips. In contrast with some previous work we do
not use Procrustes analysis to align the shapes.
6.1 Template Based Reconstruction for a Single Image

To evaluate the performance of our algorithm for 3D reconstruction of a single
image given a known template, we use the synthetic dataset provided by [2, 3],
based on synthetically generated pieces of paper (i.e. developable surfaces). We
use the method in [21] and the authors code, to generate 1000 synthetic pieces
of paper. Furthermore, we use the same experimental set up as reported in [3]:
a 200mm wide piece of paper with 150 randomly extracted correspondences and
added zero mean Gaussian noise with standard deviation of 1 pixel to the point
correspondences.
The results for our method are summarised in Fig. 3. We consider both
an orthographic and a calibrated perspective camera and generate the images
accordingly. We set k = 20 in the generation of the neighbourhood. Our 3D
error for perspective projection with noise compares favourably with the re-
sults reported in [3]. We outperform all other point-wise reconstruction methods
[9, 2, 3], all of which have a median pwre greater than 2 while ours is 1.8. The
only method that outperforms ours is the surface reconstruction method also in-
troduced in [3] which has median pwre around 1.5 (see gure 3 in [3]). However,
surface reconstruction methods use stronger priors than pointwise approaches.
6.2 Template-Free Reconstruction of a Small Collection of Images
We show results of our template-free reconstruction algorithm applied to a small

number of images. We report qualitative results on three datasets with real
images often used in the nrsfm literature.
Grati Dataset [3]: Fig. 1 shows the reconstructions we obtained by using

only the two images displayed.
Paper Sequence [20]: Fig. 4 shows the results of our method applied to re-
construct just 3 frames of the sequence. We use the tracks provided by [8].
In these reconstructions we make use of a perspective calibrated camera and of
the surface smoothness prior discussed in section 5.1, since the interest points
are positioned in a grid. For both datasets, we obtain reconstructions which are
visually comparable to previous methods that use more information, either a
template [3] or temporal consistency [8, 6]. Furthermore, our method recovers
Frame 1 Frame 30 Frame 70
Y
Z
Frame 1 Frame 39 Frame 60
Fig. 4. Reconstruction of a small collection of images. Original images with

point correspondences colour coded by estimated depth red points are further from
the camera and blue points are closer. Top row : Paper sequence with estimated 3D
shapes for the 3 frames. Bottom row : Head sequence. Best viewed in colour.
correctly the relative depths of the reconstructed shapes. This can be seen in the
last column of Fig. 4.
Head Sequence [22]: This is an example of a non-developable surface with

articulated motion. We reconstruct the three frames shown in Fig. 4. The choice
of the neighbourhood system is particularly relevant for this sequence, since
a naive choice can connect points which are in dierent articulated parts (for
example the shoulder and the ear). In order to prevent this, we set the parameters
to be t=100 and k=7. We correctly recover the dierences in depth between the
two upper-arms in the rst two frames and the facial features.
6.3 Reconstruction of Temporal Video Sequences

We also evaluate our method in four challenging MoCap temporal sequences
(with ground truth data). The wind [8] and ag [11] sequences correspond
respectively to paper and fabric waving in the wind, the jacky [5] sequence
shows a face and the bend [8] sequence depicts a paper bending. We assume an
orthographic camera and report results without the use of the temporal prior
described in section 5.1. The results are summarised in Table 2. Following [8],

we also report results when Gaussian noise of standard deviation = 100 is
added to every point, where is the maximum distance of an image point to the
centroid of all the points. Our method consistently outperforms Taylor et al. [8]
on all ground truth sequences, and performs comparably to Russell et al. [6]1 . On
the jacky sequence we also get comparable results with methods that use linear
basis shapes [5]. The near rigidity of this sequence is particularly favourable for
Table 2. Quantitative results for MoCap temporal sequences. We compare

our method with previous methods in terms of RMSE. For the ag sequence, we
reconstructed only the rst 50 frames.
Noise Time RMSE

Sequence S N
(min) Ours Competitors
ag [11] 50 540 0 12 1.79 1.29 [6]* / 2.13 [8]*
0 15 0.73 1.7 [8]
wind [8] 1000 17
1 10 1.48 3.4 [8]
0 40 3.14 3.8 [8] / 3.2 [5]
jacky [5] 1000 43
1 49 3.85 4.7 [8] / 3.5 [5]
bend [8] 471 34 0 11 0.87 -
*Error value reproduced from [6]. This error corresponds to the full sequence and it
was converted from the measure used in [6]: the Frobenius norm normalised by size.
Error value reproduced from [8].
1
[8, 6] are currently considered the state of the art nrsfm methods.
Fig. 5. Flag sequence. Reconstruction (red dots) overlaid with ground truth (black
circles) for a few frames from dierent viewpoints.
this type of methods. On the challenging MoCap ag sequence our 3D error of

1.79, on the rst 50 frames of the sequence, is only slightly larger than the one
given by the piecewise quadratic approach of [6] with a reported error of 1.29 for
the full sequence. We show qualitative 3D reconstructions for the ag sequence
in Fig. 52 .
7 Conclusion
In this work, we introduced a soft inextensibility constraint for reconstruction

of non-rigid objects from as few as two images. We showed how we can include
inextensibility in an energy minimisation framework and how to optimise it us-
ing recently introduced techniques for discrete optimisation. Our method can
be used in two distinct scenarios that have so far been considered separately
with specially designed methods: template based reconstruction and nrsfm. We
evaluated our method in several datasets, including MoCap sequences, obtain-
ing similar performance to state-of-the-art methods. Our formulation allows the
incorporation of additional priors and we believe it can be further extended and
customised.
References
1. Salzmann, M., Moreno-Noguer, F., Lepetit, V., Fua, P.: Closed-Form Solution to
Non-rigid 3D Surface Registration. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
2. Perriollat, M., Hartley, R., Bartoli, A.: Monocular template-based reconstruction
of inextensible surfaces. In: BMVC (2008)
3. Brunet, F., Hartley, R., Bartoli, A., Navab, N., Malgouyres, R.: Monocu-
lar Template-Based Reconstruction of Smooth and Inextensible Surfaces. In:
Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494,
4. Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3D shape from
image streams. In: CVPR (June 2000)
5. Torresani, L., Hertzmann, A., Bregler., C.: Non-rigid structure-from-motion: Esti-
mating shape and motion with hierarchical priors. PAMI (2008)
2
Note that the apparent strip-like structure of the ag is present in the ground truth
3D and is due to the regular way in which the MoCap markers were placed.
6. Russell, C., Fayad, J., Agapito, L.: Energy based multiple model tting for non-
rigid structure from motion. In: CVPR (2011)
7. Varol, A., Salzmann, M., Tola, E., Fua, P.: Template-free monocular reconstruction
of deformable surfaces. In: ICCV (2009)
8. Taylor, J., Jepson, A.D., Kutulakos, K.N.: Non-rigid structure from locally-rigid
motion. In: CVPR (2010)
9. Salzmann, M., Fua, P.: Reconstructing sharply folding surfaces: A convex formu-
lation. In: CVPR (2009)
10. Ferreira, R., Xavier, J., Costeira, J.P.: Shape from motion of nonrigid objects: the
case of isometrically deformable at surfaces. In: BMVC (2009)
11. Fayad, J., Agapito, L., Del Bue, A.: Piecewise Quadratic Reconstruction of Non-
Rigid Surfaces from Monocular Sequences. In: Daniilidis, K., Maragos, P., Paragios,
N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 297310. Springer, Heidelberg
(2010)
12. Ullman, S.: Maximizing rigidity: the incremental recovery of 3-d structure from
rigid and rubbery motion. Perception (1984)
13. Jasinschi, R., Yuille, A.: Nonrigid motion and regge calculus. Journal of the Optical
Society of America 6(7), 10881095 (1989)
14. Woodford, O.J., Torr, P.H.S., Reid, I.D., Fitzgibbon, A.W.: Global stereo recon-
struction under second order smoothness priors. In: CVPR (2008)
15. Lempitsky, V., Rother, C., Roth, S., Blake, A.: Fusion moves for markov random
eld optimization. PAMI 8, 13921405 (2010)
16. Ishikawa, H.: Higher-order clique reduction in binary graph cut. In: CVPR (2009)
17. Rother, C., Kolmogorov, V., Lempitsky, V., Szummer, M.: Optimizing binary
MRFs via extended roof duality. In: CVPR (2007)
18. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ow
algorithms for energy minimization in vision. PAMI (September 2004)
19. Ishikawa, H.: Higher-order gradient descent by fusion-move graph cut. In: ICCV
(2009)
20. Salzmann, M., Hartley, R., Fua, P.: Convex optimization for deformable surface
3-d tracking. In: ICCV (2007)
21. Perriollat, M., Bartoli, A.: A quasi-minimal model for paper-like surfaces. In:
CVPR (2007)
22. Yan, J., Pollefeys, M.: A General Framework for Motion Segmentation: Independent,
Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate. In: Leonardis, A.,
Bischof, H., Pinz, A. (eds.) ECCV 2006, Part IV. LNCS, vol. 3954, pp. 94106.
Spatiotemporal Descriptor for Wide-Baseline Stereo
Reconstruction of Non-rigid and Ambiguous Scenes
Eduard Trulls, Alberto Sanfeliu, and Francesc Moreno-Noguer
Institut de Robotica i Informatica Industrial (CSIC/UPC)

C/ Llorens i Artigas 4-6, 08028 Barcelona, Spain
{etrulls,sanfeliu,fmoreno}@iri.upc.edu
Abstract. This paper studies the use of temporal consistency to match appear-
ance descriptors and handle complex ambiguities when computing dynamic depth
maps from stereo. Previous attempts have designed 3D descriptors over the space-
time volume and have been mostly used for monocular action recognition, as they
cannot deal with perspective changes. Our approach is based on a state-of-the-art
2D dense appearance descriptor which we extend in time by means of optical
flow priors, and can be applied to wide-baseline stereo setups. The basic idea be-
hind our approach is to capture the changes around a feature point in time instead
of trying to describe the spatiotemporal volume. We demonstrate its effectiveness
on very ambiguous synthetic video sequences with ground truth data, as well as
real sequences.
Keywords: stereo, spatiotemporal, appearance descriptors.
1 Introduction
Appearance descriptors have been used to successfully address both low- and high-
level computer vision problems by describing in a succinct and informative manner the
neighborhood around an image point. They have proved robust in many applications
such as 3D reconstruction, object classification or gesture recognition, and remain a
major topic of research in computer vision.
Stereo reconstruction is often performed matching descriptors from two or more dif-
ferent images to determine the quality of each correspondence, and applying a global
optimization scheme to enforce spatial consistency. This approach fails on scenes with
poor texture or repetitive patterns, regardless of the descriptor or the underlying opti-
mization scheme. In these situations we can attempt to solve the correspondence prob-
lem by incorporating dynamic information. Many descriptors are based on histograms
of gradient orientations [13], which can be extended to the spacetime volume. Since
the local temporal structure depends strongly on the camera view, most efforts in this
direction have focused on monocular action recognition [47]. For stereo reconstruc-
tion, spatiotemporal descriptors computed in this manner should be oriented according

Work supported by the Spanish Ministry of Science and Innovation under projects Rob-
TaskCoop (DPI2010-17112), PAU+ (DPI2011-27510) and MIPRCV (Consolider-Ingenio
2010)(CSD2007-00018), and the EU ARCAS Project FP7-ICT-2011-287617. E. Trulls is sup-
ported by scholarship from Universitat Politecnica de Catalunya.

442 E. Trulls, A. Sanfeliu, and F. Moreno-Noguer
to the geometry of the setup. This approach was applied in [8] to disparity estimation
for narrow-baseline stereo, but its application to wide-baseline scenarios remains unex-
plored.
We face the problem from a different angle. Our main contribution is a spatiotempo-
ral approach to 3D stereo reconstruction applicable to wide-baseline stereo, augment-
ing the descriptor with optical flow priors instead of computing 3D gradients or similar
primitives. We compute, for each camera, dense Daisy descriptors [3] for a frame over
the spatial domain, and then extend them over time with optical flow priors while dis-
carding incorrect matches, effectively obtaining a concatenation of spatial descriptors
for every pixel. After matching, global optimization algorithm is applied over the spa-
tiotemporal domain to enforce spatial and temporal consistency. The core idea is to
develop a primitive that captures the evolution of the spatial structure around a fea-
ture point in time, instead of trying to describe the spatiotemporal volume, which for
matching requires a knowledge of the geometry of the scene. We apply this approach to
dynamic sequences of non-rigid objects with a high number of ambiguous correspon-
dences and evaluate it on both synthetic and real sequences. We show that its reconstruc-
tions are more accurate and stable than those obtained with state-of-the-art descriptors.
In addition, we demonstrate that our approach can be applied to wide-baseline setups
with occlusions, and that it performs very strongly against image noise.
2 Related Work
The main reference among keypoint descriptors is still the SIFT descriptor [1], which
has shown great resilience against affine deformations on both the spatial and intensity
domains. Given its computational cost, subsequent work has often focused on devel-
oping more efficient descriptors, such as PCA-SIFT [9], GLOH [2] or SURF [10], and
recent efforts such as Daisy veer towards dense computation [3]i.e. computing a de-
scriptor for every pixel. Open problems include the treatment of non-rigid deformations,
scale, and occlusions. Current techniques accommodate scale changes by limiting their
application to singular points [1, 11, 10], where scale can be reliably estimated. A recent
approach [12] exploits a log-polar transformation to achieve scale invariance without
detection. Non-rigid deformations have been seldom addressed with region descriptors,
but recent advances have shown that kernels based on heat diffusion geometry can ef-
fectively describe local features of deforming surfaces [13]. Regarding occlusions, [3]
demonstrated performance improvements in multi-view stereo from the treatment of
occlusion as a latent variable and enforcing spatial consistency with graph cuts [14].
Although regularization schemes such as graph cuts may improve the spatial consis-
tency when matching pairs of images, in scenes with little texture or highly repetitive
patterns the problem can be too challenging. Dynamic information can then be used
to further discriminate amongst possible matches. In controlled settings, the correspon-
dence problem can be further relaxed via structured-light patterns [15, 16]. Otherwise,
more sophisticated descriptors need to be designed. For this purpose, SIFT and its many
variants have been extended to 3D data, although mostly for monocular cameras, as the
local temporal structure varies strongly with large viewpoint changes. For instance, this
has been applied to volumetric images on clinical data, and to video sequences for ac-
tion recognition [6, 5, 7, 17]. There are exceptions in which spatiotemporal descriptors
Spatiotemporal Descriptor for Wide-Baseline Stereo 443
have also been applied to disparity estimation for narrow-baseline stereo [8], designing
primitives (Stequels) that can be reoriented and matched from slightly different view-
points. The application of this approach to wide-baseline scenarios remains unexplored.
Our approach is also related to scene flowi.e. the simultaneous recovery of 3D
flow and geometry. We have not attempted to compute scene flow, as these approaches
usually require strong assumptions such as known reflectance models or relatively small
deformations to simplify the problem, and often use simple pixel-wise matching strate-
gies that cannot be applied to ambiguous scenes or wide baselines. A reflectance model
is exploited in [18] to estimate the scene flow under known illumination conditions.
Zhang and Kambhamettu [19] estimate an initial disparity map and compute the scene
flow iteratively, exploiting image segmentation to maintain discontinuities. A more re-
cent approach is that of [20], which couples dense stereo matching with optical flow
estimationthis involves a set of partial differential equations which are solved nu-
merically. Of particular interest to us is the work of [21], which decouples stereo from
motion while enforcing a scene flow consistent across different viewsthis approach
is agnostic to the algorithm used for stereo.
3 Spatiotemporal Descriptor
Our approach could be applied, in principle, to any spatial appearance descriptor de-
scribed in the Section 2. We pick Daisy [3] for the following reasons: (1) it is designed
for dense computation, which has been proven useful for wide-baseline stereo, and
will particularly help in ambiguous scenes without differentiated salient points; (2) it is
efficient to compute, and more so for our purposes, as we can store the convolved ori-
entation maps for the frames that make up a descriptor without recomputing them; and
(3) unlike similar descriptors like SIFT [1] or GLOH [2], Daisy is computed over a dis-
crete grid, which can be warped in time to follow the evolution of the structure around
a point, allowing us to validate the optical flow priors and determine the stability of the
descriptor in time. The last point will help us particularly when dealing with occlusions.
This section describes the computation of the descriptor set for a single camera, and the
following section explains how to match descriptor sets for stereo reconstruction.
We extend the Daisy descriptor in the following way. For every new (gray-scale)
frame Ik , we compute the optical flow priors for each consecutive pair of frames from
+
the same camera, in both the forward and backward directions: Fk1 (from Ik1 to Ik )

and Fk (viceversa). To compute a full set of spatiotemporal descriptors for frame k
using T = 2 B + 1 frames we require frames IkB to Ik+B and their respective flow
priors. To compute the flow priors we use [22]. Since the size of the descriptor grows
linearly with T we use small values, of about 3 to 11.
We then compute, for each frame, H gradient maps, defined as the gradient norms
at each location if they are greater than zero, to preserve polarity. Each orientation map
is convolved with multiple gaussian kernels of different values to obtain the convolved
orientation maps, so that the size of the kernel determines the size of the region. The
gaussian filters are separable, and the convolution for larger kernels is obtained from
consecutive convolutions with smaller kernels, for increased performance. The result is
a series of convolved orientation maps, which contain a weighted sum of gradient norms
around each pixel. Each map describes regions of a different size for a given orientation.
A Daisy descriptor for a given feature point would then be computed as a concatenation
of these values over certain pixel coordinates defined by a grid. The grid is defined as
equally spaced points over equally spaced concentric circles around the feature point
coordinates (see Fig. 1). The number of circles Q and the number of grid points on
each circle P , along with the number of gradient maps H, determine the size of the
descriptor, S = (1 + P Q) H. The number of circles also determines the number of
kernels to compute, as outer rings use convolution maps computed over larger regions
for some invariance against rotation. The histograms are normalized separately for each
grid point, for robustness against partial occlusions. We can compute a score for the
match between a pair of Daisy descriptors as:
1
S
[g] [g]
D(D1 (x), D2 (x)) = D1 (x) D2 (x), (1)
S g=1
[g]
where Di denotes the histogram at grid point g = 1 . . . G. To enforce spatial coher-
ence in dealing with occlusions we apply binary masks over each grid point, obscuring
parts of its neighborhood and thus removing them from the matching process, follow-
ing [3]. The choice of masks will be explained in the following section. The matching
score (cost) can then be computed in the following way:
1
S
[g] [g]
D = S M[g] D1 (x) D2 (x), (2)
q=1 M[q] g=1
where M[q] is the mask for grid point g. Note that we do not use masks to build the
spatiotemporal descriptor, only to match two spatiotemporal descriptors for stereo re-
construction.
For our descriptor, we use the grid defined above to compute the sub-descriptor over
the feature point on the central frame. We then warp the grid through time by means of
the optical flow priors, translating each grid point independently. In addition to warping
the grid, we average the angular displacement of each grid point over the center of the
grid to estimate the change in rotation over the patch between the frames, and compute
a new sub-descriptor using the warped grid and the new orientation.
An example of this procedure is depicted in Fig. 1. We then match each sub-
descriptor computed with the warped grid against the sub-descriptor for the central
frame, and discard it if the matching score falls above a certain threshold. The matching
score is typically smaller than the occlusion cost in the regularization process (see sec-
tion 4). With this procedure we can validate the optical flow and discard a match if the
patch suffers significant transformations such as large distortions, lighting changes or,
in particular, partial occlusions. Note that we do not need to recompute the convolved
orientation maps for each frame, as we can store them in memory. Computing the de-
scriptors is then reduced to sampling the convolved orientation maps at the appropriate
grid points on each frame. To compute descriptors over different orientations we simply
rotate the grid and shift the histograms circularly. We also apply interpolation over both
the spatial domain and the gradient orientations. The resulting descriptor is assembled
time
flow priors flow priors
spatial
descriptor
spatial spatial spatial spatial

descriptor descriptor descriptor descriptor
sub-descriptor
matching
{
spatiotemporal descriptor
Fig. 1. Computation of the spatiotemporal descriptor with flow priors. Sub-descriptors outside
the central frame are computed with a warped grid and matched against the sub-descriptor for the
central frame. Valid sub-descriptors are then concatenated to create the spatiotemporal descriptor.

by concatenating the sub-descriptors in time D(x) = {DkB . . . Dk+B }, and its size

is at most S = T S, where S is the size of a single Daisy descriptor.
To compute the distance between two spatiotemporal descriptors Dk1 we average the
distance between valid pairs of sub-descriptors:

k+B
= 1
D
{l}
vl D (D1 (x), D2 (x)) ,
{l}
(3)
V
l=kB
{l}
where Di is the sub-descriptor for frame l, vl are a set of binary flags that determine
valid matching sub-descriptor pairs (where both sub-descriptors pass the validation pro-
k+B
cess), and V = l=kB vl . Note that in a worst-case scenario we will always match
the sub-descriptors for frame k.
4 Depth Estimation
For stereo reconstruction we use a pair of calibrated monocular cameras. Since Daisy is
not rotation-invariant we require the calibration data to compute the descriptors along
the epipolar lines, rotating the grid. We do this operation for the central frame, and
for frames forwards and backwards we use the flow priors to warp the grid and to
estimate the patch rotation between the frames. We discretize the 3D space from a
given perspective, then compute the depth for every possible match of descriptors and
store it if the match is the best for a given depth bin, building a cube of matching scores
of size W H L, where W and H are the width and height of the image and L is
the number of layers we use to discretize 3D space. We can think of these values as
costs, and then apply global optimization techniques such as graph cuts [14] to enforce
piecewise smoothness, computing a good estimate that balances the matching costs with
a smoothness cost that penalizes the use of different labels on neighboring pixels. To
deal with occlusions we incorporate an occlusion node in the graph structure with a
constant cost. We use the same value, 20% of the maximum cost, for all experiments.
We can use two strategies for global optimization: enforcing spatial consistency or
enforcing both spatial and temporal consistency. To enforce spatial consistency, we can
match two sets of spatiotemporal descriptors and run the optimization algorithm for a
single frame. The smoothness function for the optimization process will consider the 4
pixels adjacent to the feature point, and its cost is binary: a constant penalty for differing
labels, and 0 for matching labels.
To enforce spatial and temporal consistency, we match two sets of multiple spa-
tiotemporal descriptors. To do this we use a hypercube of matching costs of size W
H L M , where M is the number of frames used in the optimization process. Note
that M does not need to match T , the number of frames used for the computation of
the descriptors. We then set 6 neighboring pixels for the smoothing function, so that
each pixel (x, y, t) is linked to its 4 adjacent neighbors over the spatial domain and the
two pixels that share its spatial coordinates on two adjacent frames, (x, y, t 1) and
(x, y, t + 1). We then run the optimization algorithm over the spatiotemporal volume
and obtain M separate depth maps. The value for M is constrained by the amount of
memory available, and we use M = 5 for all the experiments in this paper.
This strategy produces better, more stable results on dynamic scenes, but introduces
one issue. The smoothness function operates over both the spatial and the temporal
domain in the same manner, but we have a much shorter buffer in the temporal domain
(M , versus the image size). This results in over-penalizing label discontinuities over
the time domain, and in practice smooth gradients on the temporal dimension are often
lost in favor of two frames with the same depth values. Additionally, the frames at
either end of the spatiotemporal volume are linked to a single frame, as opposed to two,
and the problem becomes more noticeable. For 2D stereo we apply the Potts model,
F (, ) = S T ( = ), where and are discretized depth values for adjacent
pixels, S is the smoothness cost, and T () is 1 if the argument is true and 0 otherwise.
For spatiotemporal stereo we use instead a truncated linear model:
/
S||
L if | | L
F (, ) = (4)
0 otherwise.
Notice that F = F for L = 1. This effectively relaxes the smoothness constraint,

but the end results are better due to the integration of spatiotemporal hypotheses. The
frames at either end of the spatiotemporal buffer, which are linked to one single neigh-
bor, can still present this behaviour. We solve this problem discarding their depth maps,
which are reestimated in the following iteration. Note that we do not need to recom-
pute the cost cubes for the discarded frames: we just store the two cubes on the end
of the buffer, which become the first ones on the next buffer, and then compute the
following M 2 cubes (M 3). This procedure is depicted in Fig. 2. This strategy in-
troduces some overhead in the regularization process, but its computational cost is neg-
ligible with respect to the cost of matching dense descriptors. We use the spatiotemporal
COST CUBES time
ls
be
la
h
pt
de
image plane temporal
constraint
spatial
constraint
{
buffered for the
next iteration
...
(discarded) valid depth maps
next iteration
Fig. 2. To perform spatiotemporal regularization, we apply a graph cuts algorithm over the spa-
tiotemporal hypercube of costs (here displayed as a series of cost cubes). The smoothness func-
tion connects pixels on a spatial and temporal neighborhood. To avoid singularities on the frames
at either end of the buffer, we discard their resulting depth maps and obtain them from the next
iteration, as pictured. We do not need to recompute the cost cubes.
optimization strategy for most of the experiments on this paper. We refer to each strat-
egy as GC-2D and GC-3D.
4.1 Masks for Occlusions

To deal with occlusions we formalize binary masks over the descriptor, as in [3].We
use a set of P + 1 binary masks, where P is the number of grid points on each ring.
To enforce spatial coherence the masks are predefined as follows: one mask enables all
grid points, and the rest enable the grid points in a half moon and disable the rest, for
different orientations of the half moon (see Fig. 3). After an initial iteration, we use the
first depth estimation to compute a score for each mask, using the formulation proposed
by [3], which penalizes differing depth values on enabled areas. We then compute a new
depth estimation, using for each pixel the mask with the highest score computed from
the previous depth map. Two or three iterations are enough to refine the areas around
occlusions.
As we explained in Section 1 our goal is to design a spatiotemporal descriptor that
can be applied to wide-baseline scenarios without prior knowledge of their geometry,
in order to deal with highly ambiguous video sequences. As such we are not focused on
dealing with occlusions, and in fact do not use masks for most experiments in this paper.
Masks only help around occlusions, and will be counter-productive if the first depth es-
timation contains errors due to ambiguous matches, as they will incorrectly reject valid
data in the following iterations. Nevertheless, we extend this approach to our descriptor
and apply it to a standard narrow-baseline video sequence with ground truth data (see
Section 5.4), to demonstrate its viability for spatiotemporal stereo. To apply a mask over
the spatio-temporal descriptor, we apply it separately to each of its sub-descriptors for
different frames. For simplicity, we use a GC-2D optimization for the first iterations,
and a GC-3D optimization on the last iteration. The masks are recomputed at each step.
Fig. 3. Application of masks to scenes with occlusions. Left to right: reference image, a depth
map after one iteration, and a posterior refinement using masks. Enabled grid points are shown
in blue, and disabled grid points are drawn in black. We show a simplified grid of 9 points, for
simplicity.
For reference, a round of spatiotemporal stereo for two 480x360 images takes about 24
minutes on a 2 Ghz dual-core computer (without parallelization), up from about 11 min-
utes for Daisy. This does not include optical flow estimation (approximately 1 minute
per image), for which we use a Matlab off-the-shelf implementationthe rest of the
code is C++. The bottleneck is on descriptor matchingnote that for these experiments
we allow matches along most of the epipolar lines.

Wide-baseline datasets are scarce, and we know of none for dynamic scenes, so we
create a series of synthetic sequences with ground truth depth data to evaluate our de-
scriptor. We compute a dense mesh over 3D space, texture it with a repetitive pattern,
and use it to generate images and ground truth depth maps from different camera poses.
We generate three datasets: (1) A dataset with 5 5 different mesh deformations: si-
nusoidal oscillations, and random noise over the mesh coordinates. (2) A wide-baseline
dataset, with camera poses along five angular increments of /16 rad (for a largest
baseline of /4). And (3) a narrow-baseline dataset with gaussian image noise.
We use the first dataset to gain an intuitive understanding of what kind of dynamic
information our descriptor can take advantage of. The second and third datasets are used
to benchmark our approach against state-of-the-art descriptors: SIFT [1], Daisy [3] and
the spatiotemporal Stequel descriptor [8]. Additionally, we demonstrate the application
of masks on a real dataset with occlusions. We also apply it to highly ambiguous real
sequences without ground truth data.
Each of our synthetic datasets contains 30 small (480 480) images from each cam-
era pose. All the experiments are performed over the 15 middle frames, and the spa-
tiotemporal descriptors use the additional frames as required. We refer to our descriptor
as Daisy-3D, and to the spatial Daisy descriptor as Daisy-2D. If necessary, we use
Daisy-3D-2D and Daisy-3D-3D to distinguish between the spatial and spatiotemporal
regularization schemes presented in Section 4. For Daisy-2D we use the configuration
suggested in [3], with a grid radius of 15 pixels, 25 grid points and 8 gradient ori-
entations, resulting in a descriptor size of 200. For Daisy-3D we use smaller spatial
sub-descriptors, concatenated in time, e.g. 136 7 means (at most) 7 sub-descriptors of
size 136. For Daisy-2D and SIFT, we apply a graph-cuts spatial regularization scheme.
To provide a fair comparison with SIFT we do not use feature points, but instead com-
pute the descriptors densely at a constant scale, over a patch rotated along the epipolar
lines, as we do for Daisy-2D and -3D. For the stequel algorithm we rectify the images
with [23]. We do not compute nor match descriptors for background pixels. We choose
a wide scene range, as the actual range of our synthetic scenes is very narrow and the
correspondence problem would be too simple. To evaluate and plot the results, we prune
them to the actual scene range (i.e. white or black pixels in the depth map plots are out-
side the actual scene range). We follow this approach for every experiment except for
those on occlusions (5.4), which already feature very richly textured scenes.
5.1 Parameter Selection

A preliminary study with the synthetic datasets shows that, as expected, noise on the
mesh coordinates is very discriminating, for spatial descriptors but more so for spa-
tiotemporal descriptors. Affine and non-rigid transformations are also informative, but
to a lesser degree. The optimal number of frames used to build the descriptor is often
small, between 5 and 9. Larger values require a smaller footprint for the spatial de-
scriptor, which reduces the overall performance. For the Daisy parameters, we prefer to
reduce the number of histograms (grid points) rather than bins (oriented gradients), as
the latter prove more discriminant. For Daisy-3D we use descriptors of 17 histograms
of 8 bins (size 136) and up to 7 frames in time for most of the experiments in this paper.
5.2 Wide Baseline Experiments

For this experiment we compare our descriptor with Daisy-2D, SIFT and Stequel. We
match video sequences of the same mesh from five different camera positions, each
rotated /16 with respect to the mesh (up to /4), computing depth maps from the
rightmost viewpoint. Figs. 4 and 6 show the results for two video sequences using
different image patterns. One contains a highly textured scene, while the other shows
a sparse, repetitive pattern. On the configuration with the narrowest baseline Stequel
performs better than any of the other algorithms, due to sub-pixel accuracy from a
Lucas-Kanade-like refinement [8], while the rest of the algorithms suffer from the dis-
cretization process in our regularization framework. Stequel still performs well on the
following setup, despite slight systematic errors. For the wide baseline setups the other
algorithms perform much better in comparison, with the Daisy-based descriptors edg-
ing out SIFT. In fact, large baselines prove very informative without occlusions. Note
that we do not apply masks to this experiment, since the scene contains no occlusions.
5.3 Image Noise Experiments

For these experiments we have 11 video sequences with different levels of image noise.
Despite the distortion over the flow estimates, our descriptor significantly outperforms
the others (see Figs. 5 and 7). For this experiment we also run our spatiotemporal de-
scriptor with spatial regularization (Daisy-3D-2D) and with spatiotemporal regulariza-
tion (Daisy-3D-3D), to demonstrate that the descriptor itself is robust to image noise.
Source Stequel Daisy-3D3D
/4 baseline
Fig. 4. Baseline examples
Source SIFT Stequel Daisy-2D Daisy-3D-3D Flow

= 0.1
= 0.25
= 0.5
Fig. 5. Samples from the image noise experiment
The Stequel descriptor proves very vulnerable to noise, and SIFT performs slightly bet-
ter than Daisy.
5.4 Experiments with Masks

To assess the behavior of our descriptor against occlusions we have used a narrow-
baseline stereo dataset from [8]. It contains a very richly textured scene, and we use a
very small descriptor, with 9 grid points and 4 gradients (size 36). Note that Daisy-2D
stereo already performs very well on these settings, and we do not expect significant
improvements with the spatiotemporal approach. Instead, we use it to demonstrate that
propagating sub-descriptors in time around occlusion areas does not impair the recon-
struction process if they are validated with the mechanism presented in Section 3. Fig. 8
shows that the reconstruction does not change significantly as the number of frames
used to build the spatiotemporal descriptor increases. This is to be expected, as the
scene contains few ambiguities and dynamic content. But we also observe that using a
high number of frames does not result in singularities around occlusions.
Baselinegravel, error treshold at 10% the range Baselinegravel, error treshold at 5% the range Image noise, error treshold at 10% the scene range
1
1 1
0.9
Accuracy (% under error threshold)

0.95 0.95

0.9 0.9 0.8
0.85 0.85 0.7
0.8 0.8 0.6
0.75 0.75 0.5
0.7 0.7 0.4

SIFT
0.65 0.65 0.3 Daisy(200)
SIFT SIFT Daisy(136x7)3D2D
0.6 Daisy(200) 0.6 Daisy(200) 0.2 Daisy(136x7)3D3D
Daisy(136x7)3D3D Daisy(136x7)3D3D Stequel
0.55 0.55 0.1
Stequel Stequel
0.5 0.5 0
\pi/16 \pi/8 3\pi/16 \pi/4 \pi/16 \pi/8 3\pi/16 \pi/4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Baseline Baseline Noise variance
Baselineflowers, error treshold at 10% the range Baselineflowers, error treshold at 5% the range Image noise, error treshold at 5% the scene range
1 1 1
0.95 0.95 0.9


0.9 0.9 0.8
0.85 0.85 0.7
0.8 0.8 0.6
0.75 0.75 0.5
0.7 0.7 0.4

SIFT
0.65 0.65 Daisy(200)
0.3
SIFT SIFT
Daisy(136x7)3D2D
0.6 Daisy(200) 0.6 Daisy(200) 0.2 Daisy(136x7)3D3D
0.55
Daisy(136x7)3D3D 0.55
Daisy(136x7)3D3D Stequel
0.1
Stequel Stequel
0.5 0.5
\pi/16 \pi/8 3\pi/16 \pi/4 \pi/16 \pi/8 3\pi/16 \pi/4 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Baseline Baseline
Noise variance
Fig. 6. Baseline results Fig. 7. Image noise results
Daisy3D3D against occlusions

0.845
0.84
0.835
0.83
0.825
0.82
Treshold 5% the range
0.815 Treshold 10% the range
Treshold 25% the range
0.81
1 3 5 7 9 11
Number of frames used
Fig. 8. Benchmarking the spatiotemporal descriptor against occlusions. The image displays the
depth error after three iterations with masks, using the Daisy-3D3D approach with 11 frames. The
error is expressed in terms of layers: white indicates a correct estimation, light yellow indicates
an error of one layer, which is often due to quantization errors, and further deviations are drawn in
yellow to red. Red also indicates occlusions. Blue indicates a foreground pixel incorrectly labeled
as an occlusion. The plot shows accuracy values at different error thresholds for an increasing
number of frames used to compute the spatiotemporal descriptor.
5.5 Experiments with Real Sequences

In addition to the synthetic sequences, we apply the algorithm to highly ambiguous
real sequences. We do so qualitatively, as we cannot compute ground truth data at high
rates through traditional means: structured light would alter the scene, most 3D laser
range finders are not fast enough, and available time-of-flight cameras or motion sen-
sor devices such as the Kinect are not accurate enough to capture small variations in
depth. We require high rates to extract an accurate flow, which would otherwise fail on
sequences of this complexity. In fact for this experiment we compute the optical flow
Reference
Daisy-3D-3D
Daisy-2D
SIFT
Stequel
Reference
Daisy-3D-3D
Daisy-2D
SIFT
Stequel
Fig. 9. Depth reconstruction for five consecutive frames of two very challenging stereo sequences
at a higher frame-rate, approximately 100 Hz. We then interpolate the flow data and
run the stereo algorithm at a third of that rate. We do so because even though we have
demonstrated that our spatiotemporal descriptor is very robust against perturbations of
the flow data (see Section 5.3), video sequences of this nature may suffer from the aper-
ture problemwe believe that in practice this constraint can be relaxed. Even without
ground truth data, simple visual inspection shows that the depth estimate is generally
much more stable in time and follows the expected motion (see Fig. 9).
We have proposed a solution to stereo reconstruction that can handle very challeng-
ing situations, including highly repetitive patterns, noisy images, occlusions, non-rigid
deformations, and large baselines. We have shown that these artifacts can only be ad-
dressed by appropriately exploiting temporal information. Even so, we have not at-
tempted to describe cubic-shaped volumes of data in space and time, as is typically
done in current spatiotemporal stereo methods. In contrast, we have used a state-of-the-
art appearance descriptor to represent all pixels and their neighborhoods at a specific
moment in time, and have propagated this representation through time using optical
flow information. The resulting descriptor is able to capture the non-linear dynamics
that individual pixels undergo over time. Our approach is very robust against optical
flow noise and other artifacts such as mislabeling due to aperture noisethe spatiotem-
poral regularization can cope with these errors. Additionally, objects that suddenly pop
on-screen or show parts that were occluded will be considered for matching across
frames where they may not existwe address this problem matching warped descrip-
tors across time.
One of the main limitations of the current descriptor is its high computational cost
and dimensionality. Thus, as future work, we will investigate techniques such as vector
quantization or dimensionality reduction to compact our representation. In addition, we
believe that monocular approaches to non-rigid 3D reconstruction and action recogni-
tion [24, 25] may benefit from our methodology, specially when the scene or the person
is observed from very different points of view. We will also research along these lines.
We would also like to investigate the application of our approach to scene flow es-
timation. We currently do not check the flow estimates for consistencynote that we
perform alternative tests which are also effective: we do check the warped descriptors
for temporal consistency and discard them when they match poorly, and we apply a
global regularization to enforce spatiotemporal consistency. Given our computational
concerns, we could decouple stereo from motion as in [21], while enforcing joint con-
sistency.
References
1. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. T. PAMI 27
(2005)
3. Tola, E., Lepetit, V., Fua, P.: Daisy: An efficient dense descriptor applied to wide-baseline
stereo. T. PAMI 32 (2010)
4. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to
action recognition. In: Int. Conf. on Multimedia (2007)
5. Derpanis, K., Sizintsev, M., Cannons, K., Wildes, R.: Efficient action spotting based on a
spacetime oriented structure representation. In: CVPR (2010)
6. Klaser, A., Marszaek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients.
In: BMVC (2008)
7. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003)
8. Sizintsev, M., Wildes, R.: Spatiotemporal stereo via spatiotemporal quadric element (stequel)
matching. In: CVPR (2009)
9. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image de-
scriptors. In: CVPR, Washington, USA, pp. 511517 (2004)
10. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. CVIU
(2008)
11. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape con-
texts. T. PAMI (2002)
12. Kokkinos, I., Yuille, A.: Scale invariance without scale selection. In: CVPR (2008)
13. Moreno-Noguer, F.: Deformation and illumination invariant feature point descriptor. In:
CVPR (2011)
14. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. T.
PAMI 23, 12221239 (2001)
15. Zhang, L., Curless, B., Seitz, S.M.: Spacetime stereo: Shape recovery for dynamic scenes.
In: CVPR (2003)
16. Davis, J., Ramamoothi, R., Rusinkiewicz, S.: Spacetime stereo: A unifying framework for
depth from triangulation. In: CVPR (2003)
17. Rodriguez, M., Ahmed, J., Shah, M.: Action MACH: A spatio-temporal maximum average
correlation height filter for action recognition. In: CVPR (2008)
18. Carceroni, R., Kutulakos, K.: Multi-view scene capture by surfel sampling: From video
streams to non-rigid 3D motion, shape reflectance. In: ICCV, pp. 6067 (2001)
19. Zhang, Y., Kambhamettu, C., Kambhamettu, R.: On 3D scene flow and structure estimation.
In: CVPR, pp. 778785 (2001)
20. Huguet, F., Devernay, F.: A variational method for scene flow estimation from stereo se-
quences. In: ICCV, Rio de Janeiro, Brasil (2007)
21. Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., Cremers, D.: Efficient Dense Scene
Flow from Sparse or Dense Stereo Data. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)
ECCV 2008, Part I. LNCS, vol. 5302, pp. 739751. Springer, Heidelberg (2008)
22. Liu, C.: Beyond pixels: Exploring new representations and applications for motion analysis.
PhD Thesis, MIT (2009)
23. Fusiello, A., Trucco, E., Verri, A.: A compact algorithm for rectification of stereo pairs.
Machine Vision and Applications 12, 1622 (2000)
24. Moreno-Noguer, F., Porta, J.M., Fua, P.: Exploring Ambiguities for Monocular Non-Rigid
Shape Estimation. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III.
25. Simo-Serra, E., Ramisa, A., Alenya, G., Torras, C., Moreno-Noguer, F.: Single image 3D
human pose estimation from noisy observations. In: CVPR (2012)
Elevation Angle from Reectance Monotonicity:
Photometric Stereo
for General Isotropic Reectances
Boxin Shi1 , Ping Tan2 , Yasuyuki Matsushita3 , and Katsushi Ikeuchi1

1
The University of Tokyo
2
National University of Singapore
3
Microsoft Research Asia
Abstract. This paper exploits the monotonicity of general isotropic re-

ectances for estimating elevation angles of surface normal given the az-
imuth angles. With an assumption that the reectance includes at least
one lobe that is a monotonic function of the angle between the surface
normal and half-vector (bisector of lighting and viewing directions), we
prove that elevation angles can be uniquely determined when the surface
is observed under varying directional lights densely and uniformly dis-
tributed over the hemisphere. We evaluate our method by experiments
using synthetic and real data to show its wide applicability, even when
the assumption does not strictly hold. By combining an existing method
for azimuth angle estimation, our method derives complete surface nor-
mal estimates for general isotropic reectances.
1 Introduction
Photometric stereo estimates a pixel-wise surface normal direction from a set

of images taken under varying illumination and a xed viewpoint [1]. However,
the estimation suers from inaccuracy when the surface has non-Lambertian re-
ectances. Although more exible parametric reectance models (e.g., the Ward
model [2]) can be integrated to describe the non-Lambertian phenomenon, there
exists a trade-o between the computational complexity and generality of the
materials that can be dealt with. In this paper, we exploit the reectance mono-
tonicity for estimating elevation angles of surface normal given the azimuth
angles (e.g., from the method in [3]) to fully determine the surface normal for a
broad class of reectances.
As described by Chandraker and Ramamoorthi [4], an isotropic Bidirectional
Reectance Distribution Function (BRDF) consists of a sum of lobes, and the
reectance of each lobe monotonically decreases as the surface normal deviates
from the lobes projection direction, along which the reectance function is con-
centrated. Following their work, we assume that a surface reectance contains
a single dominant lobe projected on the half-vector (the bisector of lighting and
viewing direction). Under this assumption, we perform a pixel-wise 1-D search
for the elevation angle in the range [0, /2], given an azimuth angle for each

456 B. Shi et al.
pixel. We prove that the reectance monotonicity is maintained only when the
correct elevation angle is found, if the scene is observed under dense directional
lights uniformly distributed over the whole hemisphere.
Based on the evaluations using MERL BRDF database [5] and the conclusion
in [4], we show that many materials such as acrylic, phenolic, metallic-paint, and
some shiny plastics can be well approximated by the 1-lobe half-vector BRDF
model. While some other materials, such as fabric, require two or more lobes for
precise representation, in practice, our algorithm still shows robustness against
the deviations and performs accurate estimation when a dominant lobe projected
on half-vector exists. We assess the applicability of our method to 2-lobe BRDFs
as well and verify the eectiveness of the proposed method on various real data.
2 Related Works
Conventional photometric stereo algorithm [1] assumes Lamberts reectance
and can recover surface normal directions from as few as three images. With ad-
ditional images, non-Lambertian phenomenon such as shadow can be handled [6].
With more images, various robust techniques can be applied to statistically
handle non-Lambertian outliers. There are approaches based on RANSAC [7],
median ltering [8] and rank minimization [9]. Spatial information has also
been used for robustly solving the problem by using expectation maximiza-
tion [10]. Some photometric stereo methods explicitly model surface reectances
with parametric BRDFs. Georghiades [11] use the Torrance-Sparrow model [12]
for specular tting, and Goldman et al . [13] build their methods on the Ward
model [2].
For surfaces with general reectance, various properties have been exploited
to estimate surface normal, such as radiance similarity [14] and attached shadow
codes [15]. Higo et al .s method [16] uses monotonicity and isotropy of general
diuse reectances. Alldrin and Kriegman [3] show that the azimuth angle of
normal can be reliably estimated for isotropic materials using reectance sym-
metry, if lights locate on a view-centered ring. Later, their method is extended
to solve for both shape and reectance by further restricting the BRDF to be
bivariate [17]. Based on the symmetry property, a comprehensive theory is devel-
oped in [18], and a surface reconstruction method using a special lighting rig is
introduced in [19]. By focusing on the low-frequency reectance, the biquadratic
model [20] can also be applied to solve the problem here.
We exploit reectance monotonicity to compute the per-pixel elevation angle
of surface normal given estimated azimuth angles using [3]. Unlike [17]s ap-
proach that involves complex optimization for iteratively estimating shape and
reectance, our method avoids the complex optimization by taking the advan-
tages of the monotonicity from the 1-lobe BRDFs.
3 Elevation Angle Estimation

Assume that we have already known the azimuth angles of surface normal. For
example, we can use the method in [3] to obtain the azimuth angles rst, while
Elevation Angle from Reectance Monotonicity 457
h n
l
h
l
o
l
Fig. 1. Coordinate system and key variables
our method is not limited to a particular azimuth angle estimation method.

Given the azimuth angles, our method performs 1-D search for determining the
elevation angles ranging from 0 to /2 for visible normals.
3.1 1-Lobe BRDF

We use bold letters to denote unit 3-D vectors. l and v represent the lighting
and viewing directions, respectively, and h is their unit bisector. For photomet-
ric stereo, we use the view-centered coordinate system with v = (0, 0, 1)T . As
illustrated in Fig. 1, we use spherical coordinates (, ) and (l , l ) to represent
surface normal n and lighting direction l respectively.
The semi-parametric model proposed in [4] suggests that BRDFs can be well
represented by the summation of several lobes in the form of (nT k), where
() is a monotonically increasing function, and k is referred as a projection
direction. We further assume a dominant lobe (weighting is larger than other
lobes) of (nT h) exists. We analyze the monotonicity of this lobe, and use this
property for determining elevation angles.
3.2 BRDF Prole

At a scene point, each hypothesized elevation angle uniquely determines a
normal direction n , which in turn leads to multiple hypothesized BRDF values
by dividing the observed scene radiance with the foreshortening term nT l. For
simplicity, we refer to these BRDF values across varying nT h as a BRDF prole.
Given the observed scene radiance i at a pixel, the hypothesized BRDF prole
is computed as
i nT l
(nT h) = = (nT h) . (1)
nT l nT l
We prove that is only monotonic w.r.t. nT h when n is the correct surface
normal direction (i.e., is the correct elevation angle ), except for some degen-
erate lighting congurations. Based on this, we nd the correct elevation angle
using the monotonicity of the BRDF prole .
458 B. Shi et al.
3.3 Monotonicity of the BRDF Prole

Consider two dierent lighting directions l1 and l2 , and their associated half-
vectors h1 and h2 . Without loss of generality, we assume a surface normal with
the azimuth angle = 0 and elevation angle , i.e., n = (cos , 0, sin )T . The
hypothesized elevation angle leads to n = (cos , 0, sin )T . For notation
simplicity, we dene x1 = nT h1 , x2 = nT h2 , x1 = nT h1 , x2 = nT h2 and y1 =
(x1 ), y2 = (x2 ), y1 = (x1 ), y2 = (x2 ).
If is a monotonically increasing function1 , the following condition holds: for
x1 < x2 , we have y1 < y2 . There are two cases when becomes non-monotonic:
1. The ordering of x is swapped, but y is not swapped: x1 > x2 and y1 < y2 ;
2. The ordering of y is swapped, but x is not swapped: x1 < x2 and y1 > y2 .
In the following, we rst discuss the conditions for reordering of x and y respec-
tively (which we call x-swap and y-swap for short hereafter), and then analyze
under what condition an incorrect estimate of the elevation angle will break
the monotonicity of .
Note here we focus on the case of two observations under lighting directions
l1 and l2 . In a photometric stereo setting, we often have far more than two input
observations. The discussion applies to any pair of observations: becomes non-
monotonic, if any observation pair breaks its monotonicity.
Necessary and Sucient Condition for x-swap. Suppose that the ordering
of x is changed by the hypothesized , i.e., (x1 x2 )(x1 x2 ) = (nT h1
nT h2 )(nT h1 nT h2 ) < 0. From the denition of h and v = (0, 0, 1)T , we have
l+v (lx , ly , lz + 1)T (lx , ly , lz + 1)T
h= = C = . (2)
l + v lx2 + ly2 + (lz + 1)2 2 + 2lz
After some simple derivations, we have

1
nT h 1 = ((l1z + 1) sin + l1x cos ) . (3)
2 + 2l1z
nT h2 can be computed in a similar way, and their dierence becomes

nT h1 nT h2 = A sin + B cos = A2 + B 2 sin( + ), (4)
where A = l2+2l
1z +1
l2+2l
2z +1 l1x
, B = 2+2l 2+2l
l2x
A,
, and = arctan B
1z 2z 1z 2z
[/2, /2]. We obtain a similar equation for n by replacing in Eq. (4) with .

Therefore the necessary and sucient condition to change the ordering of x

is (nT h1 nT h2 )(nT h1 nT h2 ) = sin( + ) sin( + ) < 0. In other words,
sin( + ) and sin( + ) should have dierent signs. This condition is true only
when < 0 and < < (or, < < ), since and are both within
[0, /2], and is within [/2, /2]. Note that is completely determined by
1
We discuss only the monotonically increasing case in this paper. The similar analysis
can also be applied to monotonically decreasing case.
the two lighting directions l1 and l2 , and independent of and . Hence, a larger
dierence between and gives higher possibility for an x-swap to happen.
Sucient Condition for y-swap. Next, we consider when the ordering of

y will be changed, i.e., we discuss the case where y1 < y2 and y1 > y2 . We
assume that l1 is close to l2 such that (nT h1 ) is close to (nT h2 ). This is
true when lighting is dense and neighboring lighting directions are sampled,
and lighting directions do not cause the highlight reection at n. Note that by
focusing on nearby lighting directions, we can only derive the sucient condition
for swapping the ordering of y, because there could be two parted lights causing
y-swap to happen. Under these assumptions and according to Eq. (1), y is
mainly determined by nT l/nT l. For l1 , we obtain
nT l 1 l1x cos + l1z sin sin( + 1 ) l1x

= = , tan = . (5)
nT l1 l1x cos + l1z sin sin( + 1 ) l1z
sin(+1 ) sin(+2 )
From the relationship sin( + ) >
1 sin( +2 ) > 0, we can derive the sucient
condition for y-swap as
( )l12 y < 0. (6)
Here, l12y is y-component of the cross product of l1 and l2 . The derivation can
be found in the Appendix. Eq. (6) indicates that the ordering of y depends on
two factors: (1) the sign of ( ), i.e., the hypothesized is larger or smaller
than the true value, and (2) the relative positions of the two lights l1 and l2 .
Sucient Condition for Unique Solution. For any hypothesized , if it

swaps the ordering of y (or, x) while keeps the ordering of x (or, y) unchanged,
the function will become non-monotonic. Hence, if we can ensure an incorrect
will always break the monotonicity of , we can nd the correct elevation
angle by choosing the that makes monotonic.
For a monotonically increasing , we have x1 = nT h1 < x2 = nT h2 , or equiv-
alently h1 > h2 . We begin with the case where lights are densely distributed
along the same longitude as the normal n, i.e., l = = 0. Since we require
h1 > h2 , h2 is closer to n than h1 . Further, as illustrated in Fig. 2, h1 and h2
are restricted on the red dotted line when the lighting directions l1 and l2 move
on the green dotted line. By observing the geometric relationship between h and
l, we obtain the following results:
1. As shown in Fig. 2(a), if /4, we always have l1 > l2 (l1 is closer to v
than l2 ). This ensures l12y > 0. Then from Eq. (6), y-swap always happens if
the hypothesized becomes < . In contrast, y-swap might not2 happen
when > ;
2. As shown in Fig. 2(b), if > /4, the relative positions of l1 and l2 cannot
be determined from h1 > h2 . There are two cases: (1) If h1 and h2 are
2
Because we only have the sucient condition for y-swap.
460 B. Shi et al.
(a) v (b) v
n
h1 l12y h2
h2 h2 h1
l1 l2
h2 l2 l1
y y l12y
n
l2 l2
x x
Fig. 2. Lights, half-vectors and normal on the same longitude. (a) /4; (b) > /4.
values withl = 0
(a) (/2, /2) -/4 (b) 0
-0.2
max()
-0.4
l2
-0.6
(0, 0) -/2 -0.8

l1 0 0.5 l 1 1.5
Fig. 3. (a) values for l = 0 and l1 , l2 [0, /2]; (b) max() values for l (0, /2]
both farther from v than n, we should have l2 > l1 (l2 is closer to v than
l1 ). Hence, l12y < 0; (2) If h1 and h2 are both closer to v than n, we should
have l1 > l2 (l1 is closer to v than l2 , which is a similar case as Fig. 2(a)).
Hence, l12y > 0. When lights are densely distributed along the longitude,
we can always nd a pair of lights satisfying these two cases. As a result,
Eq. (6) can hold to cause y-swap no matter what is.
Now we know y-swap might not happen when /4 and > . In the next,
we analyze when x-swap happens. If both swaps do not happen, an incorrect
estimate of the elevation angle will not break the monotonicity of , and
we cannot tell whether that is correct or not. Due to the complexity of the
analytic form of , we simulate and plot all values for all combinations of l1
and l2 on the same longitude as n in Fig. 3(a). It is interesting to note that
continuously changes in [/2, max()], where max() /4. In other words,
covers the whole range of [ max(), /2]. Recall that the necessary and
sucient condition for x-swap to happen is that < < (or, < < ).
Thus, if both and are smaller than max(), x-swap will never happen.
Combining the conditions that both x-swap and y-swap do not happen, we
conclude [, max()] is the degenerate interval for normals with l = 0,
where the monotonicity of is not broken by any incorrect .
Next, we consider lights along other longitudes, i.e. l (0, /2]. The same
analysis can be applied, and similarly, the monotonicity of is not broken by
any incorrect within the degenerate interval [, max()l ]. Here, max()l

depends on the azimuth angle of lighting directions at the longitude l . We plot
it as a function of l for those longitudes in Fig. 3(b). We nd that max()
increases from about /4 to 0 when l approaches /2. Since monotonicity of
is lost if any pair of lights on a longitude can break its monotonicity, the nal
degenerate interval is the intersection of all degenerate intervals [, max()l ]
from each longitude. This makes the nal intersection of the interval an empty
set, because max()l approaches 0. In other words, if we have lights on all
longitudes with azimuth angles l [0, /2], we can uniquely determine the
elevation angle for normals with = 0.
To recover the elevation angle of arbitrary surface normals, we will need lights
spanned all longitudes that cover the whole hemisphere. These lights form a
dense and uniform distribution over the hemisphere. In practice, we capture
images under random lighting directions that approximate the uniformity.
4 Solution Method
We perform a 1-D search for within [0, /2] at each pixel. For each hypothe-
sized value , we assess the reectance monotonicity. Specically, we compute
the hypothesized BRDF values y and evaluate its monotonicity w.r.t. to x .
To measure the reectance monotonicity, we calculate the derivatives (discrete
dierentiation) of y and sum up the absolute values of negative derivatives,
denoted as ( ):

d y
( ) = max , 0 , (7)

dx
x
The cost penalizes monotonically decreasing sequence, i.e., a larger value

indicates a less monotonically increasing y . We show a typical cost function in
Fig. 4 for a normal with = /6 and = 0. Fig. 4(a) is the cost computed
where all lights locate on the same longitude as the normal. As we have proved
in Sec. 3.3, there is a degenerate interval for [, max()]. Indeed, our
cost function is almost a constant value in [/6, /4], and we cannot tell which
value within this interval is the correct elevation angle. Fig. 4(b) shows the cost
computed with lights distributed on the whole hemisphere. It has a clear global
minimum to estimate the elevation angle.
The complete normal estimation algorithm is summarized as Algorithm 1.
There are a few implementation details: (1) We discard shadows by simple
thresholding. For scene radiance values normalized to 1, we use a threshold
i0 = 106 for synthetic data, and in real data the threshold is manually chosen
through cross validation (typically set as 0.02 in our experiments); (2) When cal-
culating y , we set it as a big value (i = 1010 ), if nT l 0; (3) When evaluating
the monotonicity, we apply a monotonic mapping as y y ( is empirically
determined as 5 in all experiments) for a data normalization purpose.
462 B. Shi et al.
(a) 1 (b) 1
0.8 0.8
0.6 0.6
0.4 1 0.4 1
0.2 0.5
0.2 0.5
2
2
f ( x) 1 f ( x) 1
1 e x 1 e x
0 0
0 5 10 0 5 10
0 0
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
T' T'
Fig. 4. Cost function for [0 , 90 ] (normalized by f () shown in the right bottom

for visualization purpose). (a) Using lights on the same longitude as normal; (b) Using
lights covering the whole hemisphere.
Algorithm 1. Photometric Stereo for General Isotropic Reectances

INPUT: Scene radiance values i, lighting directions l, threshold i0 , i .
Compute the azimuth angle of normal using the method in [3];
for each pixel do
for [0, /2] do
Let n = (cos cos , cos sin , sin )T ;
Calculate y based on Eq. (1);
If any l causes y i0 (in shadow), set its corresponding y = i0 ;
If any nT l 0, set its corresponding y = i ;
Order y = i/(nT l) w.r.t. x = nT h;
Evaluate the cost values using Eq. (7);
end for
= argmin ( );

n = (cos cos , cos sin , sin )T ;

end for
OUTPUT: Normals for all pixels.
5 Experiments
We evaluate the accuracy of elevation angle estimation using all the 100 materials
in the MERL BRDF database [5], also with varying number of lights. Synthetic
experiments using 2-lobe BRDFs are performed to assess the robustness of our
method against the deviation from the 1-lobe BRDF assumption. Finally, we
show our normal estimates on real data.
5.1 Synthetic Data

Performance on Measured BRDFs. We sample 1620 normal directions from
the visible hemisphere by uniformly choosing 36 longitudes and 45 altitudes. We
generate their observations under 337 lighting directions uniformly sampled on
4 0.07 0.12 0.12 0.4 0.8
A 0.06 0.1
0.1 0.3 0.6
0.05 0.08
U(n h)
U(n h)
U(n h)
U(n h)
U(n h)
3.5 0.08 0.2 0.4
T
0.04 0.06
Elevation angle error (deg.)
0.06 0.1 0.2

0.03 0.04
3 0.02
0 0.5
T
1
0.02
0 0.5
T
1
0.04
0 0.5
T
1
0
0 0.5
T
1
0
0 0.5
T
1
n h n h n h n h n h
2.5
B
1.5 C A B C D E
1
0.5
D E
0
green-metallic-paint2
light-red-paint
red-plastic
pink-fabric2
white-fabric2
delrin
teflon
chrome-steel
specular-green-phenolic
gold-metallic-paint2
gold-metallic-paint3
green-plastic
pink-fabric
chrome
silicon-nitrade
fruitwood-241
silver-metallic-paint2
tungsten-carbide
green-latex
white-fabric
specular-white-phenolic
gold-metallic-paint
black-phenolic
blue-metallic-paint2
color-changing-paint3
silver-metallic-paint
specular-maroon-phenolic
aventurnine
violet-acrylic
pickled-oak-260
white-marble
alumina-oxide
polyurethane-foam
beige-fabric
polyethylene
white-paint
pink-plastic
nylon
pure-rubber
pink-jasper
grease-covered-steel
dark-red-paint
aluminium
black-oxidized-steel
green-metallic-paint
red-metallic-paint
alum-bronze
two-layer-silver
pvc
gray-plastic
red-fabric2
maroon-plastic
specular-orange-phenolic
yellow-plastic
neoprene-rubber
pink-felt
silver-paint
orange-paint
pearl-paint
violet-rubber
green-fabric
blue-rubber
gold-paint
nickel
special-walnut-224
black-fabric
black-soft-plastic
colonial-maple-223
blue-metallic-paint
black-obsidian
dark-blue-paint
specular-black-phenolic
steel
light-brown-fabric
brass
green-acrylic
ipswich-pine-221
natural-209
two-layer-gold
red-specular-plastic
specular-red-phenolic
white-acrylic
white-diffuse-bball
yellow-paint
specular-yellow-phenolic
red-fabric
blue-acrylic
dark-specular-fabric
specular-blue-phenolic
yellow-phenolic
ss440
blue-fabric
specular-violet-phenolic
cherry-235
hematite
red-phenolic
purple-paint
yellow-matte-plastic
Material names
Fig. 5. Elevation angle errors (degree) on 100 materials (ranked by errors in a descend-
ing order). Rendered spheres of some representative materials and their BRDF values
in the -nT h space are shown near the curve. The spheres in the -nT h plots show
the 1-D projection RMS errors using all the directions on the hemisphere (dark blue
means small, and red means large errors).
the hemisphere. For the sampling, we use the vertices dened by an icosahedron
with a tessellation order of three. The average elevation angle errors of the 1620
samples for all materials in the MERL BRDF database are summarized in Fig. 5.
The average error on all the 100 materials is 0.77 from the ground truth.
We observe that the error is relatively large for materials that cannot be well
represented by a monotonic 1-lobe BRDF with half-vector as the projection di-
rection, for example the material A and B in Fig. 5 (see the -nT h plot above
the rendered spheres). For those materials, their optimal 1-lobe BRDF repre-
sentations have a much dierent projection direction from h (see the spheres
of RMS error distribution in the -nT h plots). Therefore, if we force them to
project on h and apply our method, the results have relatively larger errors.
Performance with Varying Number of Lights. Next, we evaluate the sys-

tem performance variation with dierent numbers of lighting directions (input
images). We perform the same experiment using 25, 50, 100 and 200 random
lighting directions. The errors and light distributions are shown in Fig. 6. Em-
pirically, about 100 random lights provide average elevation angle error of around
1 . Thus in the following test, we x the number of lights to 100.
Performance on 2-lobe BRDFs. To demonstrate the robustness of our

method, we evaluate our method on synthetic data using 2-lobe BRDFs in Fig. 7.
The weighting of the two lobes k1 and k2 are varied from 0 to 1. Fig. 7(a) shows
464 B. Shi et al.
4.5
Elevation angle error (deg.)
4 3.98
3.5
2.5 25 50
2 2.14
1.5 1.37
1 0.88 0.77
0.5
0
25 50 100 200 337 100 200 337
Number of lights Light distributions
Fig. 6. Average elevation angle errors (degree) on 100 materials varying with the num-
ber of lighting directions. Next to the curve, the blue dots on spheres show the light
distribution in each case.
(a) (1, 1) (b) (1, 1) (c) (1, 1)

k1
k1
k1
(0.1, 0.1) (0.1, 0.1) (0.1, 0.1)

k2 k2 k2
Fig. 7. Elevation angle errors (degree) on 2-lobe BRDFs. (a) The Cook-Torrance
model, where k1 is the Lambertian and k2 is the specular strength; (b) k1 1 (nT v) +
k2 2 (nT (v + 2l)); (c) k1 1 (nT h) + k2 2 (nT k), where k is a random direction.
our result on the Cook-Torrance model [21] (roughness m = 0.5) with a Lamber-
tian diuse lobe and a specular lobe. Note the specular lobe of the Cook-Torrance
model is not centered at h. However, our method always gives accurate result
for dierent relative strength of the two lobes. The error in estimated elevation
angles is the largest (about 1 ) when BRDF is completely dominated by the
specular lobe.
As discussed in [4], some fabric materials contain lobes with projection direc-
tion v or (v + 2l). Hence, we deliberately create such a BRDF as k1 1 (nT v) +
k2 2 (nT (v + 2l)) and evaluate our method on this BRDF. Note both lobes are
not centered at h. Here, 1 (x) = 2 (x) = x. In Fig. 7(b), we plot the elevation
angle errors for dierent values of k1 , k2 . The errors are smaller than 3 for most
of the combinations.
At last, we create a 2-lobe BRDF k1 1 (nT h) + k2 2 (nT k) by combining a
lobe centered at h and another one with a random direction k. k is a randomly
sampled direction on the sphere for each pixel under each lighting direction. The
elevation angle errors are shown in Fig. 7(c). We can observe that for more than
half of all cases, our method generates errors smaller than 5 . Our method can
work reasonably well with errors smaller than 3 when the relative strength of
the random lobe is below 0.3.
(a) (b) (c) (d) (e)
Fig. 8. Results on Gourd1 (102) (The number in the bracket indicates the number of
images in the dataset.) [17]. (a) One input image; (b), (c) Our normal and Lambertian
shading; (d), (e) Normal and shading from Lambertian photometric stereo. Note that
(b) and (d) have an average angular dierence of about 12 .
(a) (b) (c) (d) (e)
Fig. 9. Results on Apple (112) [17]. (a) One input image; (b), (c) Our normal and
reconstructed surface; (d), (e) Normal and surface shown in paper [17]. Note here we
use the color mapping of ((nx + 1)/2, (ny + 1)/2, nz ) (R, G, B) for a comparison
purpose. The shapes look dierent partially due to we do not know exactly the same
reconstruction method and rendering parameters used in [17].
5.2 Real Data
We show the estimated normals by our method on real data. We visualize the
estimated normals by linearly encoding the x, y and z components of normals
in RGB channels of an image (except for the Apple data in Fig. 9 for a com-
parison purpose). First, in Fig. 8, our method is compared with the Lambertian
photometric stereo [1] by showing the estimated normals and the Lambertian
shadings calculated from the estimated normals with the same lighting direc-
tion as the input image. For such non-Lambertian materials, our method shows
much more reasonable normal estimates. Next, we compare with the method
in [17] by showing the surface reconstructions (according to the method in [22])
from the estimated normals in Fig. 9(b). Due to the lack of the ground truth,
we cannot make a quantitative comparison, but qualitatively, our method shows
similar results as [17]. In terms of the complexity, our method is much simpler
and computationally inexpensive for deriving the elevation angle.
Finally, we show our estimated normals on other materials with various re-
ectances, such as plastic, metal, paint, etc., as shown in Fig. 10. The datasets
on the left side are from [17] and [23]; right part of the datasets are captured by
ourselves with a Sony XCD-X710CR camera. By comparing the input images
and the Lambertian shadings, we claim our estimated normals are of high ac-
curacy. Some noisy points observed on the results are mainly caused by pixels
with too dark intensities and hence low signal-to-noise ratio.
466 B. Shi et al.
Fig. 10. Real data results. Left part: Gourd2 (98) [17], Helmet side right (119) (We
use only 119 out of all the 253 images in the original dataset.) [23], Kneeling Knight
(119) [23]; Right part: Pear (77), God (57), Dinosaur (118). We show one input
image, the estimated normals and the Lambertian shadings for each dataset.
6 Conclusions
We show a method for estimating elevation angles of surface normal by exploit-
ing reectance monotonicity given the azimuth angles. We assume the BRDF
contains a dominant monotonic lobe projected on the half-vector, and prove that
the elevation angle can be uniquely determined under dense lights uniformly dis-
tributed on the hemisphere. In synthetic experiments, we rst demonstrate the
accuracy of our method on a broad category of materials. We further evaluate its
robustness to deviations from our assumption about BRDFs. Various real-data
experiments also show the eectiveness of the proposed method.
Limitations. Our method assumes known azimuth angles. Joint estimation

of both azimuth and elevation angles makes the problem prohibitively dicult
due to its highly non-linear nature of the problem. When the assumed azimuth
angles are not accurate, the elevation angle estimation will be deteriorated ac-
cordingly. For example, in the experiment of Fig. 5, the average elevation angle
errors over 100 materials increase to {1.16, 2.24, 3.32, 4.72} degrees with additive
azimuth angle errors normally distributed with mean 0 and standard deviation
{0.5, 1, 1.5, 2} degrees, respectively. To make the proposed method robust against
inaccuracy of azimuth angles is an interesting direction. Besides, while we have
discussed only the sucient condition, deriving a compact lighting conguration
that uniquely determines the elevation angles is our future work.
Acknowledgement. The authors would like to thank Neil Alldrin for providing
the code of [3] and Manmohan Chandraker for helpful discussions. Boxin Shi
and Katsushi Ikeuchi was partially supported by the Digital Museum Project,
Ministry of Education, Culture, Sports, Science & Technology in Japan. Ping
Tan was supported by the Singapore ASTAR grant R-263-000-592-305 and MOE
grant R-263-000-555-112.
References
1. Woodham, R.: Photometric method for determining surface orientation from mul-
tiple images. Optical Engineering 19, 139144 (1980)
2. Ward, G.: Measuring and modeling anisotropic reection. Computer Graphics 26,
265272 (1992)
3. Alldrin, N., Kriegman, D.: Toward reconstructing surfaces with arbitrary isotropic
reectance: A stratied photometric stereo approach. In: Proc. of International
Conference on Computer Vision, ICCV (2007)
4. Chandraker, M., Ramamoorthi, R.: What an image reveals about material re-
ectance. In: Proc. of International Conference on Computer Vision, ICCV (2011)
5. Matusik, W., Pster, H., Brand, M., McMillan, L.: A data-driven reectance model.
In: ACM Trans. on Graph (Proc. of SIGGRAPH) (2003)
6. Barsky, S., Petrou, M.: The 4-source photometric stereo technique for three-
dimensional surfaces in the presence of highlights and shadows. IEEE Trans. Pat-
tern Anal. Mach. Intell. 25, 12391252 (2003)
7. Mukaigawa, Y., Ishii, Y., Shakunaga, T.: Analysis of photometric factors based on
photometric linearization. Journal of the Optical Society of America 24, 33263334
(2007)
8. Miyazaki, D., Hara, K., Ikeuchi, K.: Median photometric stereo as applied to
the segonko tumulus and museum objects. International Journal of Computer Vi-
sion 86, 229242 (2010)
9. Wu, L., Ganesh, A., Shi, B., Matsushita, Y., Wang, Y., Ma, Y.: Robust Photometric
Stereo via Low-Rank Matrix Completion and Recovery. In: Kimmel, R., Klette, R.,
Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 703717. Springer,
Heidelberg (2011)
10. Wu, T., Tang, C.: Photometric stereo via expectation maximization. IEEE Trans.
11. Georghiades, A.: Incorporating the Torrance and Sparrow model of reectance in
uncalibrated photometric stereo. In: Proc. of International Conference on Com-
puter Vision, ICCV (2003)
12. Torrance, K., Sparrow, E.: Theory for o-specular reection from roughened sur-
faces. Journal of the Optical Society of America 57, 11051114 (1967)
13. Goldman, D., Curless, B., Hertzmann, A., Seitz, S.: Shape and spatially-varying
BRDFs from photometric stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32,
10601071 (2010)
14. Sato, I., Okabe, T., Yu, Q., Sato, Y.: Shape reconstruction based on similarity in
radiance changes under varying illumination. In: Proc. of International Conference
on Computer Vision, ICCV (2007)
15. Okabe, T., Sato, I., Sato, Y.: Attached shadow coding: estimating surface nor-
mals from shadows under unknown reectance and lighting conditions. In: Proc.
of International Conference on Computer Vision, ICCV (2009)
468 B. Shi et al.
16. Higo, T., Matsushita, Y., Ikeuchi, K.: Consensus photometric stereo. In: Proc. of
IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2010)
17. Alldrin, N., Zickler, T., Kriegman, D.: Photometric stereo with non-parametric and
spatially-varying reectance. In: Proc. of IEEE Conference on Computer Vision and
Pattern Recognition, CVPR (2008)
18. Tan, P., Quan, L., Zickler, T.: The geometry of reectance symmetries. IEEE Trans.
19. Chandraker, M., Bai, J., Ramamoorthi, R.: A theory of photometric reconstruction
for unknown isotropic reectances. In: Proc. of IEEE Conference on Computer
Vision and Pattern Recognition, CVPR (2011)
20. Shi, B., Tan, P., Matsushita, Y., Ikeuchi, K.: A biquadratic reectance model for
radiometric image analysis. In: Proc. of IEEE Conference on Computer Vision and
Pattern Recognition, CVPR (2012)
21. Cook, R., Torrance, K.: A reectance model for computer graphics. ACM Trans.
on Graph. 1, 724 (1982)
22. Kovesi, P.: Shapelets correlated with surface normals produce surfaces. In: Proc.
of International Conference on Computer Vision, ICCV (2005)
23. Einarsson, P., et al.: Relighting human locomotion with owed reectance elds.
In: Proc. of Eurographics Symposium on Rendering (2006)
Appendix: Proof of Eq. (6)

Proof. If nT l > 0 and nT l > 0, from Eq. (5), the following holds:
sin( + 1 ) sin( + 2 )
> > 0.
sin( + 1 ) sin( + 2 )
Therefore, we have
sin( + 1 ) sin( + 2 ) sin( + 1 ) sin( + 2 ) > 0.
By applying the product-to-sum trigonometric identities, it is simplied as
cos( + 1 2 ) cos( + 1 2 ) > 0.
This further becomes as following by applying the sum-to-produce identities:
sin( ) sin(1 2 ) > 0.
Since , [0, /2], we can use ( ) to replace sin( ) with inequality

retained. From the denition of tan , we have the following relation
sin(1 2 ) = sin(1 ) cos(2 ) cos(1 ) sin(2 )

= 2l1x 2 2l2z 2 2l1z 2 2l2z 2 .
l1x +l1z l2x +l2z l1x +l1z l2x +l2z
This indicates the sign of sin(1 2 ) is the same as l1x l2z l1z l2x = l2 l1 =
l12y . Therefore, the inequality of Eq. (6) holds.
Local Log-Euclidean Covariance Matrix (L2 ECM)
for Image Representation and Its Applications
Peihua Li and Qilong Wang
Heilongjiang University, School of CS, School of EE, China

peihualj@hotmail.com
Abstract. This paper presents Local Log-Euclidean Covariance Matrix

(L2 ECM) to represent neighboring image properties by capturing cor-
relation of various image cues. Our work is inspired by the structure
tensor which computes the second-order moment of image gradients for
representing local image properties, and the Diusion Tensor Imaging
which produces tensor-valued image characterizing the local tissue struc-
ture. Our approach begins with extraction of raw features consisting of
multiple image cues. For each pixel we compute a covariance matrix in
its neighboring region, producing a tensor-valued image. The covariance
matrices are symmetric and positive-denite (SPD) which forms a Rie-
mannian manifold. In the Log-Euclidean framework, the SPD matrices
form a Lie group equipped with Euclidean space structure, which en-
ables common Euclidean operations in the logarithm domain. Hence,
we compute the covariance matrix logarithm, obtaining the pixel-wise
symmetric matrix. After half-vectorization we obtain the vector-valued
L2 ECM image, which can be exibly handled with Euclidean operations
while preserving the geometric structure of SPD matrices. The L2 ECM
features can be used in diverse image or vision tasks. We demonstrate
some applications of its statistical modeling by simple second-order cen-
tral moment and achieve promising performance.
1 Introduction
Characterizing local image properties is of great research interest in recent years
[1]. The local descriptors can be either used sparsely for describing region of
interest (ROI) extracted by the region detectors [2], or be used at dense grid
points for image representation [3,4]. They play a fundamental role for success
of many middle-level or high-level vision tasks, e.g., image segmentation, scene
or texture classication, image or video retrieval and person recognition. It is
challenging to present feature descriptors which are highly distinctive and robust
to photometric or geometrical transformations.
The motivation of this paper is to present a general kind of image descriptors
for representing local image properties by fusing multiple cues. We are inspired
by the structure tensor method [5,6] and Diusion Tensor Imaging (DTI) [7],
both concerning tensor-valued (matrix-valued) images and enjoying important
applications in image or medical image processing. The former computes second-
order moment of image gradients at every pixel, while the latter associates to

470 P. Li and Q. Wang
each voxel a 3-D symmetric matrix describing the molecular mobility along three
directions and correlations between these directions. Our methodology consists
in the pixel-wise covariance matrix for representing the local image correlation of
multiple image cues. This leads to the model of Local Log-Euclidean Covariance
matrix, L2 ECM. It can fuse multiple image cues, e.g., intensity, lower- or higher-
order gradients, spatial coordinates, texture, etc. In addition, it is insensitive
to scale and rotation changes as well as illumination variation. Comparison of
structure tensor, DTI and L2 ECM is presented in Table 1. Our model can be
seen as a generalization of the structure tensor method; it can also be interpreted
as a kind of imaging method, in which each pixel point is associated with
logarithm of a covariance matrix, describing the local image properties through
the correlation of multiple image cues.
Table 1. Comparision of Tensor-valued (Matrix-valued) images
Structure Tensor DTI L2 ECM

C11 C1n vlogC1 vlogCmn+1
% & D11 D12 D13
Ix2 Ix Iy . . . .
D12 D22 D23 C=
.. . . vlog .. .
Ix Iy Iy2 . . . .
D13 D23 D33
Cn1 Cnn vlogCm
2nd -order moment
33 symmetric matrix Logarithm of nn (n=25) covariance matrix C
of partial deriva-
describing molecules of raw features followed by half-vectorization (m =
tives of image I
diusion (n2 + n)/2 due to symmetry)
w.r.t x, y.
Fig. 1(a) shows the outline of the modeling methodology of L2 ECM. Given
an image I(x, y), we extract raw features for all pixels, obtaining a feature map
f (x, y) which may contain intensity, lower- or higher-order derivatives, spatial
coordinates, texture, etc. Then for every pixel point, one covariance matrix is
computed using the raw features in the neighboring region around this pixel. We
thus obtain a tensor-valued (matrix-valued) image C(x, y). By computing the
covariance matrix logarithm followed by half-vectorization, we obtain the vector-
valued L2 ECM image which can be handled with Euclidean operations. For
illustration purpose, 3-dimensional (3-D) raw features are used which comprising
intensity and the absolute values of partial derivatives of image I w.r.t x, y; the
resulting L2 ECM features are 5-D and the corresponding slices are shown at the
bottom-right in Fig. 1(a). Refer to section 3 for details on L2 ECM features.
The covariance matrices are symmetric and positive-denite (SPD), the spaces
of which is not a Euclidean space but a smooth Riemannian manifold. In the
Log-Euclidean framework [8], the SPD matrices form a commutative Lie group
which is equipped with a Euclidean structure. This framework enables us to com-
pute the logarithms of SPD matrices, which can then be exibly and eciently
handled with common Euclidean operations. Our technique avoids complicated
and computationally expensive operations such as the geodesic distance, intrin-
sic mean computation or statistical modeling directly in Riemannian manifold.
Instead, we can conveniently operate in the Euclidean space.
Local Log-Euclidean Covariance Matrix (L2 ECM) 471
| I y (x, y)|
w | I x (x, y)|
I(x, y) C(1,1)

C(1, w) v log C(1,1) vlogC(1,w)
C(2,1) C(2, w) v log C(2,1) vlogC(2,w)

h

C(k,1) C(k, w)
v log C(k,1) vlogC(k,w)

C(h,1) C(h, w) v logC(h,1) vlogC(h,w)
I ( x, y ) f ( x, y ) C( x, y ) vlogC( x, y )
| I y (x, y)|
w | I x (x, y)|
I(x, y)
c11 c12 c13

h c21 c22 c23
c c33
31 c32
I ( x, y ) f ( x, y )
vlogC1 vlogC2 vlogC4
vlogC3 vlogC5
vlogC6
Fig. 1. Overview of L2 ECM (3-D raw features are used for illustration). (a) shows the
modeling methodology of L2 ECM. Given an image I(x, y), the raw feature image f (x, y)
is rst extracted; then the tensor-valued image C(x, y) is obtained by computing the
covariance matrix for every pixel; after the logarithm of C(x, y), the symmetric matrix
log C(x, y) is vectorized to get the 6-D vector-valued image denoted by vlogC(x, y),
slices of which are shown at the bottom-right. Refer to Section 3 for details on L2 ECM.
(b) shows the modeling methodology of Tuzel et al. [9]only one global covariance
matrix is computed for the overall image of interest.
The covariance matrix as a region descriptor was rst proposed by Tuzel et al.
[9], the modeling methodology of which was illustrated in Fig. 1(b). The main
dierence is their global modeling vs our local one: the former computes one
covariance matrix for the overall image region of interest; while the latter com-
putes a covariance matrix for every pixel point. Above all, Tuzel et al. employed
ane-Riemannian metric [10] which is known to have high computational cost
in computations of geodesic distance and statistics, e.g., the intrinsic mean, the
second-order moment, mixture models or Principal Component Analysis (PCA),
in the Riemannian space [11]. This may make the modeling of pixel-wise covari-
ance matrices prohibitively inecient. By contrast, we employ Log-Euclidean
metric [8] which transforms the operations from the Riemannian space into Eu-
clidean one so that the above-mentioned limitations are overcome.
The remainder of this paper is organized as follows. Section 2 reviews the
papers related to our work. Section 3 describes in detail the L2 ECM features
and the underlying theory. Applications of statistics modeling of L2 ECM by the

second-order central moment are presented in section 4. The conclusion is given
in section 5.
2 Related Work
The natural choice for region representation is to vectorize the intensities or

gradients [12] of pixels in the region. The most commonly used techniques are
modeling image gradient distributions through histograms. The Scale Invariant
Feature Transform (SIFT) descriptor [13] uniformly partitions an image region
into small cells and in each one an orientation histogram is computed, which
results in a 3D histogram of image gradient position and angles. The Gradient
Location and Orientation Histogram (GLOH) [1] is an extension of the SIFT
which quantizes the patch into log-polar cells instead of cartesian ones. Similar
techniques are also used in Histogram of Orientation Gradient (HOG) that are
particularly suitable for human detection [14]. SURF [15] and DAISY [16] de-
scriptors retain the strengths of SIFT and GLOH and can be computed quickly at
every pixel point. Other approaches may be based on spatial-frequency analysis,
image moments, etc. For an overview of local descriptors and their performance
evaluation, one may refer to [1]
Based on the global covariance descriptor, Tuzel et. al presented approaches
for human detection, texture classication or object tracking [9,17,18]. The un-
derlying theory of their approaches is the computation of intrinsic distance and
mean of covariance matrices by the ane-invariant Riemannian metric. In [19],
a novel online subspace learning method is presented for adapting to appear-
ance changes during tracking. They employed Log-Euclidean metric rather than
ane-invariant Riemannian one so that the mean and PCA can be computed
eciently in the Euclidean space. By augmenting the lower triangular matrix of
Cholesky factorization of the covariance matrix and the mean vector, the Shape
of Gaussian (SOG) descriptor is introduced and its Lie group structure is ana-
lyzed [20]. Nakayama et al. [21] used both covariance matrix and the mean vector
via Gaussian distribution to express statistics of dense features such as SIFT [13]
or SURF [15]; they presented some kernel metrics based on the theory of infor-
mation geometry for scene categorization. A novel method was proposed in [22]
for sparse decomposition of covariance matrices. The above papers employed the
covariance matrix as a global descriptor in the sense that one covariance matrix
is computed for representing the overall object region.
The structure tensor has a long history [5] and enjoys successful applications
ranging from image segmentation, motion estimation to corner detection. Be-
cause DTI is able to provide at the microscopic level the information on tissue
structure, it has been applied in many studies on neuroscience, neurodevelop-
ment, neurodegenerative studies etc. Review of the two techniques are beyond
the scope of this paper. One may refer to [6] for recent advances and development
on structure tensor and DTI.
3 Local Log-Euclidean Covariance Matrix (L2 ECM)

The space of SPD matrices is not a vector space but a Riemannian manifold
(an open convex half-cone). Hence, the conventional Euclidean operations, e.g.,
the Euclidean distance, mean or the statistics do not apply. Two class of Rie-
mannian framework have been presented for dealing with SPD matrices: the
ane-invariant Riemannian framework [10,23] and the Log-Euclidean Rieman-
nian framework [8]. The latter has almost the same good theoretical properties
as the former, and in the meantime enjoys a drastic reduction in computational
cost. In the following we rst introduce briey the Log-Euclidean Framework
(refer to [8] for details) and then present the proposed L2 ECM features.
3.1 Log-Euclidean Framework on SPD Matrices

Matrix Exponential and Logarithm. The matrix exponential and logarithm
are fundamental to the Log-Euclidean framework. Let SPD(n) and S(n) de-
note the space of n n real SPD matrices and n n real symmetric matrices,
respectively. Any matrix S S(n) has the eigen-decomposition of the form
S = UUT , where U is an orthonormal matrix and = Diag(1 , , n ) is
a diagonal matrix composed of the eigenvalues i of S. Furthermore, if S is
positive-denite, i.e., S SPD(n), then i > 0 for i = 1, . . . , n. The expo-
nential map, exp: S(n) SPD(n), is bijective, i.e., one to one and onto. By
eigen-decomposition, the exponential of a S S(n) can be computed as
exp(S) = U Diag(exp(1 ), . . . , exp(n )) UT (1)
For any SPD matrix S SPD(n), it has a unique logarithm log(S) in S(n):
log(S) = U Diag(log(1 ), . . . , log(n )) UT (2)
The exponential map, exp: S(n) SPD(n), is dieomorphism, i.e., the bijective
map exp and its inverse log are both dierentiable.
Lie Group Structure on SPD(n). For any two matrices S1 , S2 SP D(n),
dene the logarithmic multiplication S1 S2 as
S1 S2 exp(log(S1 ) + log(S2 )) (3)
It can be shown that SPD(n) with the associated logarithmic multiplication
satises the following axioms: (1) SPD(n) is closed, i.e., S1 SPD(n) and
S2 SPD(n) implies S1 S2 SPD(n). (2) is associative: (S1 S2 ) S3 =
S1 (S2 S3 ). (3) For the n n identity matrix I, S I = I S = S. (4) For
matrix S and its inverse S1 , S S1 = S1 S = I. (5) It is commutative, that
is, S1 S2 = S2 S1 . In viewing of the above, SPD(n) is a commutative Lie
group. Actually, (SPD(n), , I) is isomorphic to its Lie algebra spd(n) = S(n).
Vector Space Structure on SPD(n). The commutative Lie group SPD(n)
admits a bi-invariant Riemannian metric and the distance between two matrices
S1 , S2 is
d(S1 , S2 ) = log(S1 ) log(S2 ) (4)
where || || is the Euclidean norm in the vector space S(n). This bi-invariant
metric is called Log-Euclidean metrics, which is invariant under similarity trans-
formation. For a real number , dene the logarithmic scalar multiplication
between and a SPD matrix S
S exp( log(S)) = S (5)
Provided with logarithmic multiplication (3) and logarithmic scalar multiplica-

tion (5), the SPD(n) is equipped with a vector space structure.
The desirable property of such a vector space structure of SPD(n) is that, by
matrix logarithm, the Rimannian manifold of SPD matrices is mapped to the
Euclidean space. As such, in the logarithmic domain, the SPD matrices can be
handled with simple Euclidean operations and, if necessary, the results can be
mapped back to the Riemannian space via the matrix exponential. This greatly
facilitates the statistical analysis of SPD matrices, for example, the geometric
mean of N SPD matrices S1 , , SN is the algebraic mean (the zeroth-order
moment) of their logarithms mapped back to SPD(n), which has a closed form
N
ELE (S1 , , SN ) = exp N i=1 log(Si ) .
1
3.2 L2 ECM Feature Image

We describe the feature descriptor of L2 ECM with the gray-level image. Given
an image of interest, I(x, y), (x, y) , we can extract the raw feature vector
f (x, y) which, for example, has the following form:
( )T
f (x, y) = I(x, y) |Ix (x, y)| |Iy (x, y)| |Ixx (x, y)| |Iyy (x, y)| (6)
where | | denotes the absolute value, Ix (resp. Ixx ) and Iy (resp. Iyy ) denote the
rst (resp. second)-order partial derivative with respect to x and y, respectively.
Other image cues, e.g., spatial coordinates, gradient orientation or texture fea-
ture can also be included in the raw features. For a color image, the gray-levels
of three channels may be combined as well.
Provided with the raw feature vectors, we can obtain a tensor-valued image
by computing the covariance matrix function C(x, y) at every pixel
1
C(x, y) = (f (x , y ) f (x, y))(f (x , y ) f (x, y))T (7)
N Sr 1
(x ,y )S r (x,y)
1
f (x, y) = f (x , y )
N Sr
(x ,y )S r (x,y)
where Sr (x, y) = {(x , y )||x x | r/2, |x x | r/2, (x, y) } and Nsr

denotes the number of points inside Sr . In practice, there may exist covariance
matrices which are not positive-denite and which we should take care in compu-
tations. From our experience with computation of a huge number of covariance
matrices, involved in human detection and texture classication in real images,
this violation is really a rare event. With increase of r, the local image structure
in larger scales will be captured. Small r may be helpful in capturing ne local
structure, which may however result in the singularity of the covariance matrix
due to insucient number of samples for estimation. In our experiments r 16
is appropriate for most applications. The covariance matrix is robust to noise
and insensitive to changes of scale, rotation and illumination [9].
We wish to exploit these covariance matrices as fundamental features for vi-
sion applications. It is known that the Ane-Riemannian framework involves
intensive computations of matrix square root, matrix inverse, matrix exponen-
tial and logarithm. Hence, we utilize the competent Log-Euclidean framework:
C(x, y) in the commutative Lie group SPD(n) is mapped by matrix logarithm
to log C(x, y) in the vector space of S(n). log C(x, y) can then be handled with
the Euclidean operations and the intensive computations involved in Ane-
Riemannian framework are avoided. It also facilitates greatly further analysis or
modeling of the SPD matrices.
From the tensor-valued image C(x, y) SPD(n), we compute the logarithm
of the covariance matrix C(x, y) according to Eq. (2). log C(x, y) is a symmetric
matrix of Euclidean space, i.e., log C(x, y) S(n). Because of its symmetry, we
perform half-vectorization of log C(x, y), denoted by vlogC(x, y), i.e., we pack
into a vector in the column order the upper triangular part of log C(x, y). The
nal L2 ECM feature descriptor can thus be represented as
( )
vlogC(x, y) = vlogC1 (x, y) vlogC2 (x, y) . . . vlogCn(n+1)/2 (x, y) (8)
The covariance matrices can be computed eciently via the Integral Images [9].
The computational complexity of constructing the integral images is O(||n2 ),
where || is the image size. The complexity of eigen-decomposition of all covari-
ance matrices is O(||n3 ). In practice, n=25 may suce for most problems.
In particular, when n=2 or n=3, the eigen-decomposition of covariance matrices
can be obtained analytically; hence, the L2 ECM features, which are 3-D (n=2)
or 6-D (n=3), can be computed fast through integral images and closed-form of
eigen-decomposition.
The L2 ECM may be used in a number of ways:
It may be seen as imaging technology by which various novel multi-channel

images are produced. When n = 2, by combinations
C of varying raw features,
e.g., Ix and Iy , Ixx and Iyy , or I and Ix2 + Iy2 , we obtain dierent 3-D
color images that may be suitable for a wide variety of image or vision
tasks. Considering its desirable properties, it is particularly interesting to
use the 3-D L2 ECM images instead of the original ones for object tracking
[24,25]. In the cases of larger n, we achieve compact L2 ECM features which
can be densely sampled and packed as region descriptors.
It enables a wide variety of statistical techniques applicable to covariance
matrices. For example, it is straightforward to model L2 ECM features by
probabilistic mixture models, e.g, Gaussian mixture model (GMM), principal
component analysis, etc. This way, the geometric structure of covariance
matrices is preserved while avoiding computationally expensive algorithms

in Riemannian space [26,23].
We can describe the region statistics simply by the rst-order (mean) or
second-order moments (covariance matrix) of L2 ECM features. Note that
they actually represent the lower-order statistics of SPD matrices in the
logarithm domain.
In the following, we demonstrate applications of statistical modeling of L2 ECM
features by the second-order moment (covariance matrix).
4 Applications of L2 ECM Features

In this section, applications of L2 ECM features are exhibited to human detection,
texture classication and object tracking.
4.1 Human Detection

For performance evaluation, we exploit the INRIA person dataset [14], a chal-
lenging benchmark dataset containing large changes in the pose and appearance,
partial occlusions, illumination variation and cluttered background. It includes
2416 positive, normalized images and 1218 person-free images for training, to-
gether with 288 images of humans and 453 person-free images for testing.
The normalized image is of 96160 pixels and the centered 64128 pixels
window is used. This way, the boundary eects are eliminated. For a normalized
image, we rst compute the L2 ECM feature image (r = 16). Then we divide the
vector-valued image into 12 overlapping, 32 32 blocks with a stride of 16 pixels.
We compute for each block the second-order moment (covariance matrix) which
is again subject to matrix logarithm and half-vectorization. The resulting feature
for the whole, normalized image is a 1440-dimensional vector. We exploit the
linear SVM [27] with default parameters for classication. Our training process
is similar to that of Dalal and Tiggs [14].
Fig. 2 shows the Detection Error Tradeo (DET) curves on a log-log scale
of the proposed method and those of [14] and [17]. The DET curves of other
methods are produced from their respective papers. It is clear that our method
is superior to the methods that uses the linear or kernel SVM proposed in [14].
Our method is also better than the method of classication on Riemannian
manifold [17] when FFPW7104 , but the method of [17] outperforms when
FFPW<7104 . At 104 FFPW, our method has the lowest miss rate of 5.7%
while [17] has the second lowest of 6.8%.
4.2 Texture Classification

In this section, we apply the L2 ECM features to texture classication. The Bro-
datz database and KTH-TIPS database [28] are used for performance evaluation.
Brodatz Dataset. The Brodatz dataset contains 111 textures (texture D14 is
missing); each texture is represented by one 640640 image. Note that though
0.2
HOG Lin. SVM
HOG Ker. SVM
Tuzel et al.
Ours
miss rate
0.1
0.05
0.02
0.01
0.0 5 4 3 2 1
10 10 10 10 10
false positives per window (FFPW)
Fig. 2. DET curves of human detection in the INRIA person dataset
the dataset does not integrate scale, viewpoint or illumination changes, it in-
cludes non-homogenous textures which pose diculty for classication.
For each texture, the corresponding 640640 image is divided into 49 overlap-
ping blocks, each of which is of 160160 pixles with a block stride of 80. Twenty
four images of each class are randomly selected for training and the remaining
ones for testing. For each image, we rst compute the L2 ECM feature image;
it is then divided into four 8080 patches, the covariance matrices of which
are computed; the resultant four covariance matrices are subject to logarithmic
operation and half-vectorization. When testing, each covariance matrix of the
testing image is compared to all covariance matrices of the training set and its
label is determined by KNN algorithm (k = 5). The maximum votes of the four
matrices associated with this testing image determine its classication. To avoid
bias, the experiments are repeated twenty times and the result is computed as
the average value and standard deviation over the twenty runs.
Fig. 3(a) shows the classication accuracy of our method and the following
six texture classication approaches: Harris detector+Laplacian detector+SIFT
descriptor+SPIN descriptor (HLSS) [29], Lazebniks method [30], VZ-joint [31],
and Jalbas method [32], Haymans method [28], and Tuzel et al.s method [9].
The data of other methods are duplicated from their respective papers. Our
method obtains the best classication accuracy; the random covariances method
of Tuzel et al. obtains the second best accuracy which, however, does not report
standard deviation. These two are far better than the remaining ones.
KTH-TIPS Database [28]. This dataset is challenging in the sense that it
contains varying scale, illumination and pose. There are 10 texture classes each
of which is represented by 81 image samples. The size of samples is 200200
pixels.
The classication method is similar to that in the Brodatz database. For each
sample, the L2 ECM feature image is computed (r = 32). Every feature image is
uniformly divided into four blocks and a global covariance matrix is computed
100
90
80
Accuracy
70
Classification
60
GG mean+std
Lazenik
50
VZjoint
Hayman
(HS+LS)(SIFT+SPIN)
40 Ours
30
5 10 15 20 25 30 35 40
Number of training images
(a) (b)
Fig. 3. Texture classication in the Brodatz (a) and KTH-TIPS (b) databases
on each block. The logarithms of these covariance matrices are subject to half-
vectorization.
Fig. 3(b) shows the classication accuracy against the number of training
images. For comparison, the classication results of the following methods are
also shown [29]: Lazebniks method [30], VZ-joint [31], Haymans method [28],
global Gabor Filters(GG) [33], and Harris detector+Laplacian detector+SIFT
descriptor+SPIN descriptor((HS+LS)(SIFT+SPIN))[29]. It can be seen that the
classication accuracy of our method is much better than all the others.
4.3 Object Tracking
Based on the covariance matrix as a global region descriptor, we compare our

method based on L2 ECM (L2 ECM tracker for short) and that of Tuzel et al. [18]
(Tuzel tracker for short). Fig. 4 shows dierent covariance matrix computation
in these two methods. In L2 ECM tracker, 15 15 covariance matrices are com-
puted from the L2 ECM image (r = 16, 5-D raw features (6) and 15-D L2 ECM
feature); in Tuzel tracker, 99 covariance matrices are computed from the raw
feature image, in which every 9-D feature vector comprises x- and y-coordinates,
intensity, the rst- and second-order partial derivatives w.r.t x and y, gradient
magnitude and orientation. Initialization of two trackers are both by hand in
the rst frame and three image sequences are used for comparison. The aver-
age tracking time of L2 ECM tracker is approximately one half of that of Tuzel
tracker. The most time-consuming procedure of Tuzel tracker is the model up-
date [18, Sec. 2.4], which involves the mean computation via the ane-invariant
Riemannian metrics known to be iterative and very expensive [10,11].
Car Sequence. This sequence (190 frames, size: 384288) concerns a surveil-
lance scenario, available at http://www.cvg.cs.rdg.ac.uk/PETS2001/pets2001-
dataset.html. The black car running towards the left is the object of interest.
m n

a11 a1m b11 b1n

a a b b
m1 mm
n1 nn
Fig. 4. Covariance matrix computation in the L2 ECM (left) and Tuzel (right) trackers
In the sequence, there exist pose variation of the object and cars nearby in the
background which have similar appearances with the object. Both trackers suc-
ceed in following the object throughout the sequence. As seen in Table 2, the
L2 ECM has smaller tracking errors than the Tuzel tracker. Some sample tracking
results are shown in Fig. 5 (top pannel).
Face Sequence. In the face sequence (1480 frames, size: 320240) the object
undergoes large illumination changes [34]. We track the object every four frames
(so there are 370 frames in total) because the object motion between consecutive
frames is very small. Though both trackers can follow the object across the
sequence, the L2 ECM tracker has much smaller tracking error than the Tuzel
tracker. The average error is presented in Table 2. Fig. 5 (middle panel) shows
some sample tracking results.
Mall Sequence. This sequence (190 frames, size: 360288) is lmed by a

moving camera in a mall [35]. The object (an adult along with a child) has
large non-rigid pose variations and illumination changes. There also occur partial
occlusions in the sequence. The Tuzel tracker loses the object around frame #116
despite the model update scheme, while the L2 ECM tracker successfully follows
the object across the sequence. It can be seen from Table 2 that the L2 ECM
trackers error is much smaller than the Tuzel tracker. Some sample results are
shown in Fig. 5 (bottom panel).
Table 2. Comparison of average Image seq. Method Dist. err. (pixels) Succ. frames
tracking errors (meanstd) and Car seq.
Tuzel 10.76 5.72 190/190
L2 ECM 7.55 3.38 190/190
number of successful frames vs total
Tuzel 20.4413.87 370/370
frames Face seq.
L2 ECM 4.45 3.87 370/370
Tuzel 30.9216.58 116/190
Mall seq.
L2 ECM 17.6710.45 190/190
Fig. 5. Tracking results in the car sequence (top panel), face sequence (middle panel)
and mall sequence (bottom panel). In each panel, the results of Tuzel tracker and
L2 ECM tracker are shown in the rst and second rows, respectively.
5 Conclusion
The L2 ECM is analogous to the structure tensor and Diusion tensor, which
can be seen as an imaging technology in producing novel multi-channel images
that capture the local correlation of multiple cues in the original image. We
compute a tensor-valued image from the original one in which each point is
associated with a covariance matrix computed locally. Through matrix logarithm
and half-vectorization we obtain the vector-valued, L2 ECM feature image. The
theoretical foundation of L2 ECM is the Log-Euclidean framework, which endows
the commutative lie group formed by the SPD matrices with a linear space
structure. This enables the common Euclidean operations of covariance matrices
in the logarithmic domain while preserving their geometric structure.
The paper demonstrates applications of simple statistical modeling of L2 ECM

by the second-order moment in human detection, texture classication and ob-
ject tracking. We achieve comparable performance with or superior performance
over the state-of-the-art methods. In the future work, we are interested in sta-
tistical modeling of L2 ECM by mixture models and study of its applications in
diverse image or vision tasks.
Acknowledgements. The work was supported by the National Natural Science
Foundation of China (60973080, 61170149), Program for New Century Excellent
Talents in University (NCET-10-0151), Key Project by Chinese Ministry of Edu-
cation (210063), and High-level professionals (innovative teams) of Heilongjiang
University (Hdtd2010-07). We thank Qi Sun for assistance in the experiment on
covariance tracking.
References
1. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE
Trans. Pattern Anal. Mach. Intell. 27(10), 16151630 (2005)
2. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaalitzky,
F., Kadir, T., Van Gool, L.: A comparison of ane region detectors. Int. J. Comput.
Vision 65(1/2), 4372 (2005)
3. Liu, C., Yuen, J., Torralba, A.: SIFT ow: Dense correspondence across scenes and
its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978994 (2011)
4. Bosch, A., Zisserman, A., Muoz, X.: Scene Classication via pLSA. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517530. Springer,
Heidelberg (2006)
5. Knutsson, H.: Representing local structure using tensors. In: Proc. 6th Scandina-
vian Conf. on Image Analysis, pp. 244251 (1989)
6. Weichert, J., Hagen, H. (eds.): Visualization and image processing of tensor elds.
7. Bihan, D.L., Mangin, J., Poupon, C., Clark, C., Pappata, S., Molko, N.: Diusion
tensor imaging: Concepts and applications. J. Magn. Reson. Imaging 66, 534546
(2001)
8. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector
space structure on symmetric positive-denite matrices. SIAM J. Matrix Anal.
Appl. (2006)
9. Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection
and Classication. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
10. Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing.
Int. J. Comput. Vision, pp. 4166 (2006)
11. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Fast and Simple Calculus on Tensors
in the Log-Euclidean Framework. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005.
12. Yan, K., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local
image descriptors. In: Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 506513 (2004)
13. Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput.
Vision 60, 91110 (2004)
Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 886893 (2005)
15. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF).
Comput. Vis. Image Underst. 110, 346359 (2008)
16. Tola, E., Lepetit, V., Fua, P.: DAISY: An ecient dense descriptor applied to wide-
baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815830 (2010)
17. Tuzel, O., Porikli, F., Meer, P.: Human detection via classication on Riemannian
manifolds. In: Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 18 (2007)
18. Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model update based on
lie algebra. In: Proc. Int. Conf. Comp. Vis. Patt. Recog., pp. 728735 (2006)
19. Li, X., Hu, W., Zhang, Z., Zhang, X., Zhu, M., Cheng, J.: Visual tracking via incre-
mental Log-Euclidean Riemannian subspace learning. In: Proc. Int. Conf. Comp.
Vis. Patt. Recog., pp. 18 (2008)
20. Gong, L., Wang, T., Liu, F.: Shape of gaussians as feature descriptors. In: Proc.
Int. Conf. Comp. Vis. Patt. Recog., pp. 23662371 (2009)
21. Nakayama, H., Harada, T., Kuniyoshi, Y.: Global gaussian approach for scene
categorization using information geometry. In: Proc. Int. Conf. Comp. Vis. Patt.
Recog., pp. 23362343 (2010)
22. Sivalingam, R., Boley, D., Morellas, V., Papanikolopoulos, N.: Tensor Sparse Cod-
ing for Region Covariances. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
23. Fletcher, P.T., Joshi, S.: Principal Geodesic Analysis on Symmetric Spaces:
Statistics of Diusion Tensors. In: Sonka, M., Kakadiaris, I.A., Kybic, J. (eds.)
CVAMIA/MMBIA 2004. LNCS, vol. 3117, pp. 8798. Springer, Heidelberg (2004)
24. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans.
Pattern Anal. Mach. Intell. 25(5), 564575 (2003)
25. Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: Int. Conf. on
Comp. Vis., pp. 13231330 (2011)
26. de Luis Garca, R., Westin, C.F., Alberola-Lpez, C.: Gaussian mixtures on tensor
elds for segmentation: Applications to medical imaging. Comp. Med. Imag. and
Graph. 35(1), 1630 (2011)
27. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Trans. Intell. Syst. Technol. 2, 27:127:27 (2011)
28. Hayman, E., Caputo, B., Fritz, M., Eklundh, J.-O.: On the Signicance of Real-
World Conditions for Material Classication. In: Pajdla, T., Matas, J(G.) (eds.)
29. Zhang, J., Marszalek, M., Lzaebnik, S., Schmid, C.: Local features and kernels
for classication of texture and object categories: A comprehensive study. Int. J.
Comput. Vision 73, 213238 (2007)
30. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local
ane regions. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12651278 (2005)
31. Varma, M., Zisserman, A.: Texture classication: Are lter banks necessary? In:
Proc. Int. Conf. Comp. Vis. Patt. Recog. (2003)
32. Jalba, A., Roerdink, J., Wilkinson, M.: Morphological hat-transform scale spaces
and their use in pattern classication. Pattern Recognition 37, 901915 (2004)
33. Manjunath, B., Ma, W.: Texture features for browsing and retrieval of image data.
IEEE Trans. Pattern Anal. Mach. Intell., 837842 (1996)
34. Ross, D., Lim, J., Yang, M.-H.: Adaptive Probabilistic Visual Tracking with Incre-
mental Subspace Update. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004, Part II.
35. Leichter, I., Lindenbaum, M., Rivlin, E.: Tracking by ane kernel transformations
using color and boundary cues. IEEE Trans. on Pattern Anal. Mach. Intell. 31(1),
164171 (2009)
Ensemble Partitioning
for Unsupervised Image Categorization
Dengxin Dai, Mukta Prasad, Christian Leistner, and Luc Van Gool
Computer Vision Lab., ETH Zurich
Abstract. While the quality of object recognition systems can strongly

benet from more data, human annotation and labeling can hardly keep
pace. This motivates the usage of autonomous and unsupervised learn-
ing methods. In this paper, we present a simple, yet eective method for
unsupervised image categorization, which relies on discriminative learn-
ers. Since automatically obtaining error-free labeled training data for
the learners is infeasible, we propose the concept of weak training (WT)
set. WT sets have various deciencies, but still carry useful information.
Training on a single WT set cannot result in good performance, thus
we design a random walk sampling scheme to create a series of diverse
WT sets. This naturally allows our categorization learning to leverage
ensemble learning techniques. In particular, for each WT set, we train
a max-margin classier to further partition the whole dataset to be cat-
egorized. By doing so, each WT set leads to a base partitioning of the
dataset and all the base partitionings are combined into an ensemble
proximity matrix. The nal categorization is completed by feeding this
proximity matrix into a spectral clustering algorithm. Experiments on a
variety of challenging datasets show that our method outperforms com-
peting methods by a considerable margin.
1 Introduction
Image categorization is an intensely studied eld with many applications. In the
last decade, advances in representations and supervised learning methods have
led to great progress [1,2,3,4]. However, the explosion of available visual data,
the cost of annotation, and the user-specic bias of annotations have resulted
in an increased focus on learning with less supervision. While powerful methods
for semi-supervised [5] or weakly-supervised learning [6,7] have been proposed,
unsupervised methods have been studied less.
In this paper, we are interested in automatically discovering image categories
from an unordered image collection. This task is hard as objects and scenes
can appear with clutter, large variations in pose and lighting, high intra-class
variances and low inter-class variances. The recent success of discriminative
classiers in supervised image categorization [1,2,4] suggests that one possible
way to progress in unsupervised image categorization is to construct a simi-
lar scenario, in which the power of discriminative classiers can be leveraged.
Since obtaining a pure training set automatically is infeasible, in this paper, we

484 D. Dai et al.

Fig. 1. The pipeline of our method for an image collection comprised of four categories:
face, car, panda and duck. The pseudolabels in the weak training (WT) sets are as-
signed by our random walk sampling scheme. The image collection is partitioned by the
classiers learnt from the WT sets and the results are stored into the binary proximity
matrices. All of the binary matrices are then averaged into the ensemble proximity
matrix (EnProx) as input for spectral methods. As expected, the partitionings of the
WT sets contain errors, but their errors dier, which discovers the underlying truth
by ensemble learning techniques.
propose the concept of WT sets (c.f . denition in 2) that can be exploited

with supervised learning techniques. Since knowledge carried by a single WT set
is limited, we develop a very simple, yet eective random walk (RW) sampling
method to create a set of diverse WT sets. The RW sampling is performed in
the space of pairwise similarities built up by common distance measures. Sam-
ples that are hard to cluster using common distance measures are left out in
the sampling step so that the sampled WT sets are not too noisy. In order to
learn knowledge contained in these WT sets for unsupervised categorization, we
propose the ensemble partitioning framework based on the ensemble learning
principle [8].
The remainder of this paper is structured as follows: 2 reports on related
work and introduces our method. In 3 we make observations that inspired
the approach described in 4. We show improvement over the state-of-the-art
through evaluations in 5. 6 concludes the paper.
2 Unsupervised Image Categorization
2.1 Related Work
Several unsupervised image categorization techniques have been proposed.

Fergus et al . [9] modeled objects as constellations of visual parts and esti-
mated parameters using the expectation-maximization algorithm for unsuper-
vised recognition. Sivic et al . [10] proposed using aspect models to discover
Ensemble Partitioning for Unsupervised Image Categorization 485
object categories from an unordered image collection. Aspect models were orig-
inally developed for content analysis in textual document collections, but they
perform very well for images as well [10]. Later on, [11] used Hierarchical Latent
Dirichlet Allocation to automatically discover object class hierarchies. For scene
category discovery, Dai et al . [12] proposed to combine information projection
and clustering sampling. All these methods share assumptions about the sample
distribution. Image categories, nevertheless, are arranged in complex and widely
diverging shapes, making the design of explicit models dicult. An alternative
strand, which is more versatile in handling structured data, builds on similarity-
based methods. Frey and Dueck [13] applied the anity propagation algorithm
[14] for unsupervised image categorization. Grauman and Darrell [15] developed
partially matching image features to compute image similarity and used spectral
methods for image clustering. The main diculty of this strand is how to mea-
sure image similarity as the semantic level goes up. For a more detailed survey,
we refer readers to [16].
The method closest to ours is that of Gomes et al . [17]. They also propose a
framework for learning discriminative classiers while clustering the data, called
regularized information maximization. Yet, the method still signicantly diers
from ours. In particular, [17] nds a partitioning of images that results in a
good discriminative classier, which is evaluated by three criteria: class separa-
tion, class balance and classier complexity. Our method however, more closely
follows the training-test paradigm; that is, sample a series of weak training sets,
learn knowledge from these sets and classify the whole dataset with this learned
knowledge. Our method also shares similarities with the method of Lee and
Grauman [18]. They proposed to use curriculum learning (CL) [19] for unsuper-
vised object discovery. CL imitates the learning process of people; that is, using
a pre-dened learning schedule, it starts from learning the easiest concepts rst
and graduately increases the complexity of concepts [19]. Besides technical dif-
ferences, the main benet of our work compared to [19] is that we do not rely
on pre-selected learning schedules.
As we already stated, our method builds on the ensemble learning principle [8].
Ensemble methods build a committee of (usually weak) base learners that over-
all performs better than its individual parts. Thereby, they benet from the
strength and diversity of their base learners. Popular ensemble methods that
have been successfully applied in computer vision are boosting [20], bagging [21]
and random forests [22]. There has been previous work using ensemble methods
for clustering e.g. [23,24]. The main dierence of our work to [23,24] is that
we focus on high-level categories. [23,24] make basic assumptions about the un-
derlying distributions, which are hard to obtain for our task and, hence, makes
previous methods only applicable in rather low-level clustering tasks. In con-
trast, we propose to learn category distinctions from sampled training data. We
also dier from previous approaches in the way we combine our base learners. In
particular, [24] used a boosting framework and [23] used hierarchical clustering,
while our method is based on spectral techniques.
486 D. Dai et al.
2.2 Notation and Overview

In this section, we rst give the denition of the weak training set and then give
an overview of our method.
Denition 1. Given a categorization task E, a weak training set is a labeled
dataset B which has severe deciencies to be a good training set, e.g. some cat-
egories may be absent, some may be impure, and/or some may be with bias,
but training on it can lead to a weak classier (.) which performs better than
random guessing for the task E.
Our dataset D consists of N images, each belonging to one of C categories. The
task is to discover these categories. Each image is represented by a d-dimensional
feature vector fn , n {1 . . . N }. Assume that the ensemble partitioning learns
knowledge from T WT sets in total. For the tth trial, the WT set Dt = {st , ct }
is a set of sampled images st and their pseudolabels ct , which imitate the role of
labeled training data in supervised frameworks. The unordered image collection
D is partitioned according to the discriminative classier (.)t learnt from the
WT set Dt . Classication from each trial t creates a unique partitioning of
the data D. Since pseudolabels cannot be mapped into correspondence across
dierent trials, the classication agreement between image pairs in each trial t is
stored as a binary proximity matrix At , where atij = 1 implies that images i, j
{1 . . . N }2 belong to the same class. Thenal proximity matrix A, is averaged
over all the T binary matrices: A = T1 t At . The nal categorization results
are obtained by feeding A to a spectral clustering algorithm. The whole pipeline
of our method is sketched in Fig. 1. Though simple, our method outperforms
competing methods by a considerable margin on various challenging datasets.
The main contribution of this paper is a novel approach to unsupervised im-
age categorization. In particular, we (i) propose the concept of WT sets for
image categorization, (ii) design a random-walk based sampling scheme to gen-
erate them, and (iii) propose the ensemble partitioning framework to learn the
category distinctions for unsupervised image categorization.
3 Observations
In this section, we share some insights with the reader that help to motivate
our approach and understand why it is working better. We raise two questions:
First, how much does the impurity of the training data in WT sets inuence the
convergence of the learning? Second, given a standard distance metric, how does
one approach the problem of getting good WT sets in an unsupervised manner?
Observation 1: Ensemble learning can learn the essence of categories from a

series of diverse WT sets. We examined this idea in supervised image cate-
gorization. Given the ground truth data divided as training and test sets: D =
{Dtrain , Dtest }, (i) we articially synthesized a set of WT sets Dttrain , t = 1, . . . , T
from training data Dtrain , and (ii) ensemble learning was then performed on these
sets and its performance on test data classication was measured.
Fig. 2. Classication accura-

cies of ensemble learning on
the 15-Scene dataset [1], for
varying training label noise R
and number of WT sets T .
High precision can be learnt
for noise percentage as high
as 80% given enough WT sets
(Best viewed in color).
In order to guarantee the diversity of the training sets (for ensemble learning),
each WT set Dttrain is formed by randomly taking 30% of the images from Dtrain ,
and randomly re-assigning labels of a xed percentage R of these. Hence, R = 0
corresponds to the upper performance bound as every sample is assigned its
true label. A classier is trained for each of these WT sets. At test time, each of
these classiers returns the category label of each image in Dtest . The winning
label is the mode of the results returned by all the classiers. Fig. 2 evaluates
this for the 15-Scene dataset [1]. Linear SVMs are used as the classiers with a
concatenation of GIST [25], PHOG [2] and LBP [26] as input. When the label
noise percentage R is low, the classication precision starts out high and levels
quickly with T , as one would expect. But interestingly, for R even as high as
80%, the classication precision, which starts low, converges to a similarly high
precision given more WT sets T ( 500). This shows that it is possible to learn
the essence of image categories even from very weak (extremely mislabelled)
training sets, given a sucient number with diverse deciencies. This is crucial
for our task, because WT sets are much easier to get than pure training data in
unsupervised settings.
Observation 2: Given a standard distance metric, how does one approach the
problem of getting good (precise and diverse) WT sets without supervision?
An ideal image representation along with a powerful metric should ensure that
all images of the same category are more similar to each other than to those
of other categories. In order to examine it, we tabulate how often an image is
from the same category as its k th -nearest neighbor. We refer to the frequency as
label co-occurrence probability p(k). p(k) is averaged across images and labels in
the dataset. Various features and distance metrics were tested: GIST [25] with
Euclidean distance, PHOG [2] and LBP [26] with the 2 distance.
Fig. 3 shows the results on the 15-Scene dataset [1]. The results reveal that
using the distance metric in conventional ways (e.g. clustering by K-means and
spectral methods) will result in very noisy training sets, because the label co-
occurrence probability p(k) drops very quickly with k. However, sampling in the
very close neighborhood of a given image is likely to generate more instances of
the same category, whereas sampling in very far-away space can gather samples of
new categories. This suggests that samples along with a few very close neighbors,
488 D. Dai et al.
Fig. 3. The label co-

occurrence probability p(k)
calculated on the 15-Scene
dataset (4485 images in total)
using GIST [25] with Eu-
clidean distance, and PHOG
[2] and LBP [26] with the
2 distance (Best viewed in
color).
namely compact image clusters, can form a training set for a single class, and
a set of such image clusters which scatter far away from each other in feature
space can serve as a good WT set for our task. Furthermore, sampling in this
way provides the chance of creating a large number of diverse WT sets, due to
the small size of the sampled WT set. This will be further examined in 4.1.
4 Our Approach
In this section, we introduce our approach for unsupervised image category dis-
covery. It has two interleaved parts: 1) the creation of the weak training (WT)
sets by random walk (RW) sampling, and 2) learning the ensemble proximity
matrix A for categorization.
4.1 Creating WT Sets via Random Walk Sampling

Q
A WT set Dt = {st , ct } consists of a set of st {1 . . . N } images of size Q
sampled from D, where Q = CK and K is the ideal number of training images
D EQ
for each of the C classes. ct 1 . . . Ct are pseudo labels, assigned during the
creation of WT set Dt . Note that Ct refers to the number of empirical categories
estimated in the tth trial and does not necessarily have to reect the true C.
Recall that as an insight from 3, in the creation process of WT sets and with
respect to a standard distance measure, images that get assigned the same pseudo
labels c should be very close to each other and images with dierent pseudo la-
bels far away, respectively. Additionally, ensemble learning theory teaches us
that weak learners should have low bias but high variance to form a power-
ful combined classier [8]. We use strong discriminative learners (see below) to
achieve the rst requirement and we incorporate randomness into the sampling
process to ensure the second one [21,22].
To this end, we adopt a random walk [27] based sampling method. In detail,
each image is taken as a node and the weight of the edge between nodes i and
j is dened as /
0 if i = j
wi,j = dist(fi ,fj )2 (1)
exp{ 2 } otherwise
Fig. 4. A toy example of random walk

sampling in an image collection com-
prised of images from three categories.
Solid lines indicate the process of ex-
ploiting neighborhoods. Dash lines in-
dicate the process of exploring the
space for new categories. The isolated
red samples indicate the far samples
explored. The numerical numbers de-
notes the pseudolabels. (Best viewed in
color).
where dist(, ) denotes a common distance measure and is a scale parameter.

Besides the N nodes from individual images, we take the current WT set Dt
as the (N + 1)th node in order to provide the possibility of jumping to a far-
away node from all visited nodes. Its weight to node j {1, ..., N } is dened
as wN +1,j = argmaxist wi,j and to itself it is 0. The maximization is used
to guarantee that nodes far away from node (N + 1) are far away from all the
visited nodes. Note that node N + 1 and its weights need to be updated when
a new node is visited: a new image is added into Dt . When Dt = , its weights
are uniformly distributed over all images. Having all the weights, we dene the
transition matrix to be
1 w 2
P = [pi,j ](N +1)(N +1) = k wi,k
i,j
. (2)
(N +1)(N +1)
Obtaining each WT set Dt involves sampling for Q images (visiting Q nodes)

and assigning pseudolabels ct to them. Starting with Dt = {}, it is built up
using the following scheme: at each instant either jump to a neighboring node i
of the previously visited node j to exploit the regions already explored, or jump
to a node i that is far away from all the visited nodes, to explore the space of
new categories. The corresponding image is added to st . The pseudolabels are
generated as follows: every time a far node is explored, a new class label is
assigned to it. If a node i neighboring node j is selected, the class label of image
j is assigned to it. The above scheme is formally described in Algo. 1 and a toy
example is illustrated in Fig. 4. For the handling of node N + 1, the reader is
referred to Algo. 1.
4.2 Ensemble Partitioning

We now explore the use of the WT sets acquired in 4.1 for image categorization.
D E
In particular, we train a discriminative classier for each trial t: t () 1 . . . Ct
for each WT set Dt . The full data D is then classied by t to obtain a binary
proximity matrix At , where atij = 1, if t (fi ) = t (fj ), and 0 otherwise. The
binary proximity matrices over multiple trials are combined into an ensemble
proximity matrix A = (1/T ) t At . We employ an SVM-based Ct -class classier
490 D. Dai et al.
Algorithm 1. RW Sampling in tth trial

Input:
Image data D, feature representation matrix F
Initialize training set st = {any(1 . . . N )} and ct = {1}
Initialize transition matrix P
for m = 1 Q 1 do
1. With probability (Q C)/Q exploit the neighborhood:
Jump to a neighboring node i {1, ...N }\st with probability
P(stm ,i )
P(stm ,j)
, c = ctm OR
j{1,...N }\st
2. With probability C/Q explore new space:
1P(N+1,i )
Jump to a far node i {1, ..., N }\st with probability
1P(N+1,j)
,
j{1,...N }\st
assign a new category label to c .
Add sample i to st and pseudolabel c to ct .
Update transition matrix P.
end for
Algorithm 2. Ensemble Partitioning

Input:
the image dataset D, feature representation F.
Initialize proximity matrix A = 0.
for t = 1 T do
0. Initialize the current proximity matrix At = 0.
1. Sample Dt = {st , ct } using
Algo. 1.
3. Train classier t () 1 . . . Ct on Dt .
4. Partitioning the entire data D according to:
atij = (t (fi ), t (fj )), {i, j} {1 . . . N }2 ,
end for
A = T1 t At .
Obtain object categorization by spectral clustering on A.
t in each trial as they usually work well for most features types. Additionally,
they have also been shown to generalize well even when trained only on a few
training samples [28], which is an important characteristic as we keep the size
of the individual WT sets relatively small. Given a collection of WT sets, the
ensemble partitioning is summarized in Algo. 2. Final categorization can be
completed by feeding A to any similarity-based clustering algorithm. In this
work, we choose spectral clustering due to its simplicity and popularity. The
whole pipeline of Fig. 1 has now been explained.
5 Experiments
In this section, we elaborate upon the datasets, evaluation criteria and experi-
mental settings for the evaluation of the proposed approach.
Evaluation Datasets: We tested our method on three datasets: 15-Scene [1],
Caltech-101 [29], and a 35-Compound dataset. The 15-Scene dataset contains 15
scene categories in both indoor and outdoor environments. Each category has 200
to 400 images, and they are of size 300 250 pixels on average. The Caltech-101
dataset [29] contains 101 object categories and is one of the most popular datasets
for supervised object recognition. The large number of categories, combined with
the uneven category distribution (from 31 to 800 images per category), poses a
real challenge for unsupervised categorization. In order to evaluate our method
on a more general image dataset, we composed a 35-Compound data set by
mixing up the 15 scene categories [1] with the 20 object categories selected from
Caltech 256. The 20 object categories are: American ag, fern, French horn,
leopards 101, pci card, tombstone, airplanes 101, diamond ring, re extinguisher,
ketch 101, mandolin, rotary phone, Pisa tower, face easy 101, dice, reworks,
killer whale, motorbikes 101, roulette wheel, and zebra.
Features: The following features are considered in our experiments: f1 =SIFT
[30], f2 = GIST [25], f3 =PHOG [2], f4 = LBP [26], f5 = GIST + PHOG,
f6 = GIST + LBP, f7 = PHOG + LBP, and f8 = GIST + PHOG + LBP,
where + means a simple concatenation. GIST features are computed on resized
images of 256 256 pixels and other features are computed on the original
images. For PHOG, we computed the derivatives in 8 directions and used a two-
layer pyramid. For LBP, the uniform LBP was used. For SIFT, we used Hessian
detectors and k-means to get 300 centroids for the bag-of-feature representation.
Our Method: We evaluated our approach against competing methods for the
datasets and features described above. The number of categories C is assumed
to be known. Although the transition matrix P of 4.1, Algo. 1, the classier
t of 4.2 and Algo. 2 use the same notation for their features, speed-accuracy
tradeos may cause dierent features to be optimal at each stage. Since GIST
features have proven very eective for holistic image categorization with the
Euclidean distance [25], we adopted them for building P of Algo. 1, where dist(, )
in Eq. 1 is the Euclidean distance between features. For training the classiers,
f2 f8 were considered 1 . For the evaluation and comparison across all the
datasets and features, we used the same set of parameters: T = 1000 and K = 9.
The inuence of these two parameters on the performance of our method was
also tested on the three datasets, where a set of T s in the range of [1, 1000]
and a set of Ks in the range of [3, 30] were evaluated. The in Eq. 1 is set
automatically by using a self-tuning method proposed in [31], which has already
shown promising performance for similarity scaling.
Competing Methods: We compared our method to well-known clustering meth-
ods: k-means and spectral clustering. For k-means, four features f 2 f 4 and
f 8 were tried. For spectral clustering, the same features are tried: f 2 with the
Euclidean distance measure, f 3 and f 4 with the 2 distance measure, and f 8
1
The reason why we did not consider f 1 is to avoid severe over-tting, given the small
size of the WT sets and the high dimension of f 1.
492 D. Dai et al.
with the Mahalanobis distance measure with diagonal covariances (the o di-
agonal elements were set to 0). We selected these combinations just because
the 2 distance measure is superior for histogram-based features, and the Maha-
lanobis distance measure with diagonal covariances is capable of learning feature
weights. We also compared our method to the PLSA-based method of [10], the
anity propagation method (AP) of [14], and the regularized information max-
imization method (RIM) of [17]. f 1 is used as the input for the PLSA-based
method. Spatial pyramid matching (SPM) [1] is used as the input for the RIM
as it was used in [17]. The AP used the same input as the spectral clustering. For
the implementation, we used the authors code directly 2 3 4 . Since the number
of categories cannot be set directly in the RIM and the AP the input param-
eters are searched to yield the right number of categories we only tested these
two methods on the 15-Scene dataset.
Baselines for Sampling: To investigate the performance of our random walk
sampling scheme, we create two baselines for obtaining the WT sets. Everything
else in Algo. 2 remains as before. GIST features are also used for the baselines.
1. Baseline 1: In the tth trial, run k-means on the whole dataset to cluster
images into C groups. The WT set Dt was created by adding all images and
their labels to st and ct , resp.
2. Baseline 2: In the tth trial, we run k-means on a bootstrap subset of the
whole dataset. For a fair comparison to the WT sampling, CK images were
randomly selected for each subset with K = 9. Again, the WT set Dt was
created by adding all images and their labels to st and ct , resp.
5.2 Results and Discussion
We follow [12,16] to use the purity (the bigger the better) for performance evalu-
ation. First, let us compare our method to existing methods. Table 1 lists the av-
erage and variance of the purity over 10 runnings of all the methods on 15-Scene,
Caltech-101 and the 35-Compound dataset. The table shows that our method
outperforms all the competing methods by a substantial margin on all the three
datasets while having comparable variance. The superiority of our method can
be attributed to its discriminative learning ability, which gives more weight to
relevant variables and rejects the irrelevant ones. This can be proven by simply
examining how the performance of our method improves when providing more
features to the classier. This learning ability is crucial for image and object
categorization, as so far no single feature can describe all image categories very
well. The classier trained on each individual WT set may yield over-tting,
but this can be oset by the classiers trained on other WT sets as the sets are
mutually diverse.
2
http://www.robots.ox.ac.uk/~ vgg/software
3
http://vision.caltech.edu/~ gomes/software.html
4
http://www.psi.toronto.edu/index.php?q=affinity%20propagation
Table 1. Categorization results on 15-Scene, Caltech-101 and the 35-Compound

dataset. SClustering, EnPar, G, P, L are the shorthands of spectral clustering, en-
semble partitioning, GIST, PHOG, and LBP respectively. A(f ) means that feature f
is used for training the classier of our method. A1 and A2 indicate that the WT sets
are obtained by the baseline 1 and baseline 2 respectively.
15-Scene Caltech-101 35-Compound

Methods Input
Purity (%) Purity (%) Purity (%)
K-means G 39.44 (1.01) 28.71 (0.30) 34.32 (0.39)
K-means P 29.99 (0.72) 30.35 (0.26) 26.71 (0.47)
K-means L 29.01 (0.66) 29.66 (0.31) 33.06 (0.34)
K-means G+P+L 31.50 (1.08) 35.51 (0.35) 35.28 (0.67)
Sclustering Euclidean(G) 43.48 (1.04) 33.48 (0.52) 41.12 (0.32)
Sclustering 2 (P) 29.32 (0.89) 35.77 (0.82) 33.16 (0.73)
Sclustering 2 (L) 34.09 (0.96) 34.71 (0.71) 44.49 (0.60)
Sclustering Mahalanobis(G+P+L) 42.05 (1.60) 37.72 (0.64) 48.51 (0.85)
PLSA [10] SIFT 29.34 (1.21) 26.57 (0.81) 28.21 (0.79)
RIM [17] SPM[1] 38.40 (1.04)
AP [14] Euclidean(G) 40.71 (0.98)
AP [14] 2 (P) 32.14 (0.73)
AP [14] 2 (L) 35.23 (0.82)
AP [14] Mahalanobis(G+P+L) 44.24 (1.01)
EnPar A(G) 46.31 (0.88) 35.79 (0.60) 45.14 (0.66)
EnPar A(P) 44.31 (0.81) 36.20 (0.58) 41.78 (0.59)
EnPar A(L) 46.44 (0.91) 37.16 (0.64) 53.18 (0.71)
EnPar A(G+P) 56.27 (0.98) 39.37 (0.79) 53.17 (0.68)
EnPar A(G+L) 55.41 (0.95) 39.60 (0.78) 55.01 (0.72)
EnPar A(P+L) 56.10 (1.01) 40.40 (0.83) 56.60 (0.81)
EnPar A(G+P+L) 61.49 (1.08) 42.31 (0.88) 58.34 (0.76)
EnPar A1 (G+P+L) 47.52 (0.84) 35.76 (0.55) 42.70 (0.65)
EnPar A2 (G+P+L) 55.43 (0.98) 39.90 (0.82) 52.12 (0.74)
Secondly, let us compare the three ways of creating the WT sets. From Ta-
ble 1, we can see that the RW sampling outperforms the two baselines substan-
tially. This superiority can be attributed to the good diversity and accuracy of
the WT sets generated by the RW sampling, obviously fullling the requirements
of successful ensemble learning [8]. The accuracy comes from the underlying prin-
ciple of the RW sampling: with a high probability choosing the most separable
images and leaving out the ambiguous images, which will be handled after ab-
stracting knowledge by the max-margin learning. The great diversity comes from
the random jumping by the RW sampling. The possibility that two WT sets are
the same or very similar is very low. The superiority of baseline 2 over baseline
1 results from the higher diversity of the WT sets that baseline 2 generated.
Using the bootstrap technique to increase the diversity of training sets and in
turn boost overall performance, has been widely accepted for supervised learn-
ing. Our experimental results conrm that this diversity is also very important
for unsupervised scenarios like ours.
494 D. Dai et al.
Fig. 5. The purity of discovered categories as a function of T and K
One may argue that the WT sets generated by the RW sampling are not
precise and the precision of some individual sets may be aected by the outliers
in the dataset. This is true, but our method demands far less from a WT set
than a supervised categorization method does from its training set the label
assignment of the WT sets only needs to be better than random guessing. The
imprecision of one WT set can be oset by other WT sets since their deciencies
are dierent. This is analogous to the weak learner of ensemble learning methods
such as bagging and boosting each weak learner could be imprecise, but the
ensemble is precise and with good generality. Also, our categorization framework
is very exible and not tied to the way the WT sets are created. Any method that
can generate good WT sets can be used in this framework. The RW sampling
is a case in point. An interesting thing is that even using the WT set created
by the two casual baselines, the ensemble partitioning framework can generate
more precise results than competing methods.
Furthermore, let us examine the inuence of the parameters T and K on the
performance of our method. Fig. 5 shows the evaluation results over a variety
of values for T and K. From the curves of T in Fig. 5, it is evident that the
purity increases pretty fast with T but then also stabilizes quickly to a good
performance. That is, our method benets from adding more WT sets and can
provide accurate results when a fairly large number of WT sets are provided.
This is quite similar to how bagging [21] and random forests [22] behave with
dierent numbers of training sets. For parameter K = Q/C, we nd that very
small and very large values (e.g. 3 and 30 in our case) degrade the performance.
A very small K leads to insucient training samples per category and a very
large K results in very noisy training sets. But as Fig. 5 shows, our method is not
overly sensitive to the choice of K: there are large ranges to select appropriate
values from. Its choice can be based on the experience of choosing training data
for supervised categorization.
Finally, though additional time is needed for training (most unsupervised
categorization methods need no training), our method is still quite ecient. The
eciency comes from two factors: 1) Training linear SVMs is very ecient by
using the optimized package5 ; and 2) the performance of our method stabilizes
5
www.csie.ntu.edu.tw/~ cjlin/liblinear
quickly with respect to T as Fig. 5 shows. The training time on a Core i5 2.80
GHz desktop PC are: 3.67 minutes for 15-Scene (4485 images in total), 66.18
minutes for Caltech-101 (8251 images in total), and 12.33 minutes for the 35-
Compound dateset (8671 images in total). Furthermore, our method is inherently
parallelizable and can take advantage of multi-core processors.
6 Conclusion
This paper has tackled the hard task of unsupervised image categorization. To
this end, we leveraged the power of discriminative learning and ensemble meth-
ods. We presented the concept of weak training sets and proposed a new yet
simple sampling scheme, denoted as RW sampling, to generate them. For each
weak training set, a discriminative classier is trained to get a base partitioning of
the unordered image collection and then all the base partitionings are combined
to a strong ensemble proximity matrix that can be incorporated into spectral
clustering. In the experiments on several challenging datasets, we showed that
our method is able to consistently outperform the state-of-the-art methods. The
study poses multiple interesting questions for future research: Is it possible to
design more sophisticated methods for the creation of the WT sets, rather than
relying on the common distance metric as we do? What is the way of extend-
ing the method to handle noisy web data? How could we apply the method to
weakly-supervised image categorization?
Acknowledgements. The authors gratefully acknowledge support from SNF
project Aerial Crowds and the ERC Advanced Grant VarCity.
References
1. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In: CVPR (2006)
2. Bosch, A., Zisserman, A., Muoz, X.: Image classication using random forests and
ferns. In: ICCV (2007)
3. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based image
classication. In: CVPR (2008)
4. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database large-scale
scene recognition from abbey to zoo. In: CVPR (2010)
5. Fergus, R., Weiss, Y., Torralba, A.: Semi-supervised learning in gigantic image
collections. In: NIPS (2009)
6. Deselaers, T., Alexe, B., Ferrari, V.: Localizing Objects While Learning Their
Appearance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part
IV. LNCS, vol. 6314, pp. 452466. Springer, Heidelberg (2010)
7. Blaschko, M.B., Vedaldi, A., Zisserman, A.: Simultaneous object detection and
ranking with weak supervision. In: NIPS (2010)
8. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F.
(eds.) MCS 2000. LNCS, vol. 1857, pp. 115. Springer, Heidelberg (2000)
9. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised
scale-invariant learning. In: CVPR (2003)
496 D. Dai et al.
10. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering
objects and their location in images. In: ICCV (2005)
11. Sivic, J., Russell, B.C., Zisserman, A., Freeman, W.T., Efros, A.A.: Unsupervised
discovery of visual object class hierarchies. In: CVPR (2008)
12. Dai, D., Wu, T., Zhu, S.C.: Discovering scene categories by information projection
and cluster sampling. In: CVPR (2010)
13. Dueck, D., Frey, B.J.: Non-metric anity propagation for unsupervised image cat-
egorization. In: ICCV (2007)
14. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Sci-
ence 315, 972976 (2007)
15. Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of partially
matching image features. In: CVPR (2006)
16. Tuytelaars, T., Lampert, C.H., Blaschko, M.B., Buntine, W.: Unsupervised object
discovery: A comparison. IJCV 88, 284302 (2009)
17. Gomes, R., Krause, A., Perona, P.: Discriminative clustering by regularized infor-
mation maximization. In: NIPS (2010)
18. Lee, Y.J., Grauman, K.: Learning the easy things rst: Self-paced visual category
discovery. In: CVPR (2011)
19. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In:
ICML (2009)
20. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical
View of Boosting. The Annals of Statistics 28, 337407 (2000)
21. Breiman, L.: Bagging predictors. ML 24, 123140 (1996)
22. Breiman, L.: Random forest. ML 45, 532 (2001)
23. Leisch, F.: Bagged clustering. Working Paper Series (1999)
24. Saari, A., Bischof, H.: Clustering in a boosting framework. In: CVWW (2007)
25. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation
of the spatial envelope. IJCV 41, 145175 (2001)
26. Ojala, T., Pietikinen, M., Menp, T.: Multiresolution gray-scale and rotation invari-
ant texture classication with local binary patterns. PAMI 24, 971987 (2002)
27. Harel, D., Koren, Y.: On Clustering Using Random Walks. In: Hariharan, R.,
Mukund, M., Vinay, V. (eds.) FSTTCS 2001. LNCS, vol. 2245, pp. 1841. Springer,
Heidelberg (2001)
28. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-svms for object
detection and beyond. In: ICCV (2011)
29. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few
training examples: an incremental Bayesian approach tested on 101 object cate-
gories. In: WGMBV (2004)
30. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60,
91110 (2004)
31. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: NIPS (2004)
Set Based Discriminative Ranking
for Recognition
Yang Wu1, , Michihiko Minoh1 , Masayuki Mukunoki1 , and Shihong Lao2

1
Academic Center for Computing and Media Studies, Kyoto University
2
OMRON Social Solutions Co., Ltd., Kyoto 619-0283, Japan
yangwu@mm.media.kyoto-u.ac.jp, {minoh,mukunoki}@media.kyoto-u.ac.jp,
lao@ari.ncl.omron.co.jp
Abstract. Recently both face recognition and body-based person re-

identication have been extended from single-image based scenarios to
video-based or even more generally image-set based problems. Set-based
recognition brings new research and application opportunities while at
the same time raises great modeling and optimization challenges. How to
make the best use of the available multiple samples for each individual
while at the same time not be disturbed by the great within-set varia-
tions is considered by us to be the major issue. Due to the diculty of
designing a global optimal learning model, most existing solutions are
still based on unsupervised matching, which can be further categorized
into three groups: a) set-based signature generation, b) direct set-to-set
matching, and c) between-set distance nding. The rst two count on
good feature representation while the third explores data set structure
and set-based distance measurement. The main shortage of them is the
lack of learning-based discrimination ability. In this paper, we propose a
set-based discriminative ranking model (SBDR), which iterates between
set-to-set distance nding and discriminative feature space projection to
achieve simultaneous optimization of these two. Extensive experiments
on widely-used face recognition and person re-identication datasets not
only demonstrate the superiority of our approach, but also shed some
light on its properties and application domain.
1 Introduction
The existing research on object recognition mainly focuses on single-image based

approaches, i.e., recognizing a single instance at each time. However, in many
real applications like recognizing people in video surveillance (either by face
or full body), it is not only possible but also easy to get multiple images or
video frames for both query and gallery objects. And if allowed, users usually
prefer to use as many images of each object as possible, because they think that

This work was supported by R&D Program for Implementation of Anti-Crime and
Anti-Terrorism Technologies for a Safe and Secure Society, Special Coordination
Fund for Promoting Science and Technology of the Ministry of Education, Culture,
Sports, Science and Technology, the Japanese Government.

498 Y. Wu et al.
Q Xi Q Xi Q Xi Q Xi
dWS S ( Q, X i )
dWS S ( Q, X i )
dWS S ( Q, X j )
dWS S ( Q, X j )
Xj Xj Xj
Xj
(b) Between-set geometric (c) Feature space projection (d) Learned feature space
(a) Original query and gallery sets
distance finding (metric) learning and between-set distances
Fig. 1. Illustration of the proposed set-based discriminative ranking of a relevant and

an irrelevant image sets in the gallery given a query set of images
more candidates generally mean more information and should result in better
recognition performance. Such a belief comes from the experiences of human
vision, and it has also been proved by recent research eorts [14].
Though set-based recognition is believed to be more promising than recog-
nition based on single images, it brings new challenges: the images in a single
set may be taken by dierent cameras under dierent conditions (illumination,
viewpoint, background, etc.), and the object itself may also have large appear-
ance changes (pose, occlusion, etc.). Such a wide coverage increases the chance
of nding the correct matchings between sets, but could also increase the proba-
bility of making mistakes. How to make the best use of multiple instances while
at the same time be discriminative on distinguishing image sets from dierent
classes becomes the key issue. Existing supervised classication methods usually
suppose the data to be recognized is a single instance which can be represented
by a feature vector in the same space as the training samples stay, so they are
not directly applicable to set-based recognition tasks.
In this paper, we propose a novel approach to optimize a global set-based
recognition objective function by iteratively optimizing between-set distance
nding and feature space projection (or namely metric learning). As shown in
Figure 1, given a query set Q and two arbitrary gallery sets Xi and Xj , each
of which contains multiple images of an object, our approach nds the distance
between each pair of query-gallery sets by computing the geometric distance
of their approximated convex hulls (distance of closest approach) [1]. It can be
treated as nding the closest points from them, though by convex approxima-
tion these points may be virtual as they can be linear combinations of actual
sample points. Then a maximum-margin-based ranking algorithm is adopted to
learn a good metric, making the closest approach between correct query-gallery
pair smaller than that between incorrect ones as much as possible. Such a met-
ric learning can be viewed as feature space projection. After that, the closest
approach between each query-gallery pair is updated in the new feature space,
and it will activate another metric learning step. The iteration between these two
continues until convergence is achieved. Experimental results on both face recog-
nition and person re-identication demonstrate that our approach consistently
outperforms state-of-the-art methods without parameter tuning.
Set Based Discriminative Ranking for Recognition 499
2 Related Work
Existing methods on set-based object recognition depend much on the object
categories they are working on. Generally speaking, research on face recognition
focuses much on exploring the distribution [5] or structure of the data sets [68]
and designing the between-set distance [9, 2], while in the literature of person re-
identication, more attention has been paid to feature representation [3, 10, 4].
Such a phenomenon to some extent is due to the dierences between these two
categories: usually human bodies have greater appearance variations and oc-
clusions than faces, causing diculties for feature representation. Nevertheless,
they are valuable attempts for solving the generic set-based recognition prob-
lems. Here we briey categorize them into three groups: a) set-based signature
generation, b) direct set-to-set matching, and c) between-set distance nding.
The rst two groups count on exploring new features, in which group (a) aims
at a single global and informative representation for each query/gallery set while
group (b) looks for good features for individual images. More concretely, in 2006,
a spatiotemporal segmentation algorithm was employed to generate signatures
that are invariant to variations in illumination, pose, and the dynamic appear-
ance of clothing, followed by clustering and matching methods [11]. Three years
later, a new signature called HPE (Histogram Plus Epitome) [10] was proposed
to integrate global HSV color histogram with local epitome descriptors (generic
and local epitomes). The most recent work in group (a) is called MRCG (Mean
Riemannian Covariance Grid) [4] which divides each image into a dense grid
structure with overlapping cells, and then describe each cell by the region covari-
ance features [12]. Specially designed mean and variance functions were applied
to the cells to form a signature and then a similarity function was proposed to
compare two signatures. Though these signatures are compact and directly com-
patible with image-based matching/classication algorithms, it is hard to make
them both representative and discriminative. Unlike them, the feature represen-
tation in group (b) does not have to meet such high demands as it seeks for the
cooperation with set-to-set matching methods. A typical approach is the one
named SDALF (Symmetry-driven Accumulation of Local Features) [3], which
explores the symmetry and asymmetry of the human body to partition it into
three parts and then uses three types of features to describe each part (except
the smallest head part). A weighted combination of distances on these features
was dened as the image-to-image distance, and the minimum distance among
all possible image-pairs was adopted for set-to-set matching. Though this group
takes the advantage of having multiple candidates in each set for matching, the
eectiveness of it still largely depends on the goodness of features which requires
a case-dependent careful design.
Instead of focusing on features, the third group works on set structure extrac-
tion and distance computation. Between-set distance nding approaches were
mainly proposed for face recognition. Among these eorts, two recently in-
troduced methods are remarkable. One is AHISD/CHISD (Ane/Convex Hull
based Image Set Distance) [1], which characterizes each image set (query/gallery)
in terms of an ane/convex hull of the feature vectors of its images, and then
500 Y. Wu et al.
nds the geometric distance (distance of closest approach) between two hulls to
serve as the set-to-set distance. Such a convex approximation is less overtting
than the models based on sample points because it can generate new samples
on the hull, and the approach can also be robust to outliers to some extent. The
other work is named SANP (Sparse Approximated Nearest Points) [2] which en-
forces the sparsity of samples used for point generation via linear combination.
Compared with direct set-to-set matching, this group of methods can make a
better use of the data sets, however, as they are unsupervised, the feature space
where the data sets stay directly determines their eectiveness since all the di-
mensions are equally weighted for distance computation.
3 Set Based Discriminative Ranking

The problem of set-based object recognition can be formulated as follows. Given
a query Q Q where Q is the query space and Q itself is an image set containing
the same object, the goal is to recognize the identity/category of the object
by comparing to a gallery of image sets X , each of which (X X ) is about
a specic object or object category1. Motivated by both the theoretical and
practical advantage of using ranking-based models for recognition [13, 14], here
we propose a discriminative ranking method for solving the set-based recognition
problem2 . In this setting, for each query Q, we divide the gallery into two groups
+ +
IQ andIQ , where IQ are the indices of image sets containing the same object as

Q, i.e. relevant sets, and IQ denotes irrelevant ones. Then the desired ranking

of the gallery sets yQ Y (where Y is the space of all the feasible rankings)

will be the one that satises: i IQ +
, j IQ , Xi y Q
Xj , which denotes that a
relevant set should always be ranked before any irrelevant set.
3.1 Set-Based Ranking Model

Given an arbitrary query set Q Q and the whole gallery X composed by
a batch of image sets, we dene a joint feature map for a candidate ranking
yQ of gallery X as (Q, X , yQ ). Then a desired ranking algorithm should be
able to learn a model w that can successfully distinguish the correct ranking

yQ from any other incorrect ranking yQ . Following the maximum-margin-based
formation, it can be approached by optimizing the following objective function:
C
min{f (w) + Q }, (1)
w |Q|
Q

w, (Q, X , y ) + y , y

w, Q, X , yQ Q Q Q Q ,

s.t. Q Q, yQ = yQ ;
Q 0, Q Q.
1
To be easily understandable, in remainder of the paper recognition will simply
mean identication though our model can also applies to object categorization.
2
Classication models (e.g. Structured SVM) can also be adopted here. We choose
ranking for a potential application of our model to search and retrieval applications.

where Q is the slack variable for Q and C is the trade-o parameter. yQ , yQ

is the loss function measuring the penalty of predicting yQ instead of yQ . And
in the test stage, given an input query Q, the model will infer the best ranking
of X from the compatibility score:

yQ = arg max w, (Q, X , yQ ). (2)
yQ Y
We decompose the feature map by the so-called partial order features [15]:

(Q, Xi ) (Q, Xj )
po (Q, X , yQ ) = yij ; ; ; ;
; +; ; ; , (3)
+
iI jI ;IQ ; ;IQ ;
Q Q
where
1 Xi y Q Xj
yij = ,
1 Xi %yQ Xj
and (Q, Xi ) is a feature map characterizing the relationship between query set
Q and gallery set Xi . Like what has been studied in single-instance-based ranking
[16], it can be proved that such a denition has a very attractive
property: given

the model w, the ranking yQ that maximizes w, Q, X , yQ is the one that
sorts X by descending w, (Q, Xi ), where Xi is an arbitrary set in X . This
property transfers the ranking problem to a weighted set-to-set similarity scoring
problem.
Now the problem becomes how to dene a proper set-to-set joint feature map
(Q, Xi ). It looks like an extension of the traditional relative feature map or
relative distance as called in [17], however, for two sets which may contain
dierent numbers of points it is nontrivial to design such a relative feature map
as simple feature subtraction cannot be directly applied to point sets.
3.2 Set-to-Set Distance Metric

Let W 0 denote a symmetric, positive semi-denite matrix in Rdd , then
the distance between two arbitrary samples xk and xl in the d dimensional
feature
space under the metric dened by W can be denoted as xk xl W =
(xk xl )T W (xk xl ). To avoid the square root operation, we can just use
the squared distance xk xl 2W instead:

dW (xk , xl ) = (xk xl )T W (xk xl ) (4)
= tr(W (xk xl )(xk xl ) ) T
(5)
= W, (xk xl )(xk xl ) F ,T
(6)
where tr() means the trace of a matrix and , F denotes the Frobenius inner
product. Eqn. 6 presents a way to separate the metric matrix W from the oper-
ation of feature vectors. Therefore, as stated in [16], (xk xl )(xk xl )T can be
used to dene the relative feature map between xk and xl :

(xk , xl ) = (xk xl )(xk xl )T . (7)
502 Y. Wu et al.
Therefore, if we can dene our set-to-set relative feature map similar to that, then
by replacing the vector style model parameter w with the metric matrix W and
changing the normal inner product w, (Q, Xi ) to the Frobenius inner prod-
uct W, (Q, Xi )F , the large-margin-based objective function for ranking will
directly optimize the set-to-set distance metric. About the embedded function
f (W ) in the objective function, there may be dierent options such as tr(W ),
1 T
2 tr(W W ), etc. tr(W ) is a good choice for us as it prefers sparse solutions.
Inspired by the closest points based set-to-set distance and the nearest neigh-
bor based set-to-set matching, we dene our set-to-set distance metric as follows:

dSS
W (Xi , Xj ) = min dW (xik , xjl ) (8)
xik Xi ,xjl Xj
= min (xik xjl )T W (xik xjl ) (9)

xik Xi ,xjl Xj
= min W, (xik xjl )(xik xjl )T F . (10)

xik Xi ,xjl Xj
Then, we can choose

W, (Q, Xi )F = dSS
W (Q, Xi ) (11)
for our ranking model. However, as minimization cannot be exchanged with
the Frobenius inner product, we do not have an explicit form for (Q, Xi ).
Nevertheless, it doesnt inuence the usage of distance metric in our model.
3.3 Geometric Distance between Convex Models

The denition of the set-to-set distance metric in Eqn. 9 has two limitations.
On one side, it needs to compute the minimum pair-wise distance over two sets
which is computationally expensive as it is a quadratic form of the number
of set points. On the other side, the distance highly depends on the actual
positions of sample points which indicates low generalization ability. Similar
to the recently proposed approaches on set-based distance nding [1, 2], we
break these two limitations by using convex approximations of the point sets and
then measure their dissimilarity by the geometric distance between them. For
example, when the ane hull is adopted for convex approximation, dSS W (Xi , Xj )
can be rewritten as the minimum distance between two ane hulls of Xi and
Xj :
dSS
W (Xi , Xj ) = (Xi i Xj j )T W (Xi i Xj j ),
in which the linear combination coecient vectors i and j can be found by
(i , j ) = arg min(Xi i Xj j )T W (Xi i Xj j ),

i ,j

ni
nj
s.t. ik = 1 = jk ,
k=1 k =1
where ni and nj are the numbers of points in Xi and Xj , respectively.

Recall that the metric matrix W is positive semi-denite, so that we can

perform eigen decomposition to it:
1
W = AAT = P P T , P = A 2 ,
where the columns of A are eigenvectors of W and is a diagonal matrix whose

diagonals are the corresponding eigenvalues. Therefore, the geometric distance
can be transformed to
dSS
W (Xi , Xj ) = P T Xi i P T Xj j 2 = XiP i XjP j 2 , (12)
where XiP = P T Xi and XjP = P T Xj can be viewed as feature space projections

of the original point sets Xi and Xj . After the projection by P , it becomes a
traditional geometric distance between ane hulls. By bounding the coecients
ik and jk within a predened range [L, U ], we can constrain the hulls to be
the reduced ane hulls or even convex hulls (L = 0, U 1), so that AHISD and
CHISD methods [1] can be directly used here, along with their extended kernel
versions. If we put sparsity constraints on i and j , then it becomes the SANP
model. Therefore, both the two latest set-based distance nding models can be
used in our set-to-set distance metric.
3.4 Simultaneous Optimization

As the above formulation shows, our set-based ranking model directly embeds
the set-based geometric distance nding in the maximum-margin based model
learning. Therefore, the optimization of the convex approximation coecients
s and the optimization of the ranking model (namely W ) are mutually depen-
dent which demands simultaneous optimization. However, it is hard to directly
optimize them in a single objective function because they are optimizing over dif-
ferent data (two sets vs. multiple sets) with dierent objectives and constraints.
Therefore, we use an iterative algorithm in this work to do the simultaneous
optimization.
The procedure of our Set-based Discriminative Ranking (SBDR) model is
briey described in Algorithm 1. The framework is based on the 1-Slack margin-
rescaling cutting-plane algorithm as described in [18]. We have also extended the
ecient computation strategy proposed in [16] for the implementation of our set-
to-set distances dSS
W (Qi , Xi ) and adopting them for both constraint updating
and metric optimization. Readers are referred to [16] for implementation details.
In the iteration, on one hand, the set-to-set distance nding generally de-
creases all the weighted joint features (i.e., the set-to-set distances), thus indi-
rectly decrease , resulting a reduction of the objective function value, while on
the other hand, metric learning also approaches the global optimal by optimiz-
ing W . Therefore, it will nally converge. In our experiments to be presented as
below, it always converged within 30 iterations.
504 Y. Wu et al.
Algorithm 1. Learning the SBDR model:

Require:
Query set collection Q, gallery set collection X , desired rankings y1 , . . . , y|Q|

of X
for each query set, slack trade-o C > 0, termination threshold > 0.
Ensure: Metric matrix W .
1: Initialize W with a diagonal matrix whose diagonal values are standard deviations
of the whole dataset on each feature dimension.
2: Initialize working set of constraints: C .
3: repeat
Compute the currently optimal metric in Eqn. 1.
4: Compute set-to-set distances:
Qi Q, Xi X , dSS
W (Qi , Xi ).
5: Update the working set of constraints:
6: for i = 1 |Q| do

7: yQi arg max yQ i
, y + W, po (Qi , X , y)F
yY
8: end for
9: C C (y1 , . . . , y|Q| ) .
10: until
|Q|
1

yQi , yQi W, po (Qi , X , yQ ) po (Qi , X , yQi ) F + .
|Q| i=1 i

We evaluate the proposed approach SBDR with CHISD (linear) and SANP
as its geometric distance nder (denoted by SBDRCHISD and SBDRSANP
respectively) on two representative set-based recognition tasks: face recognition
and person re-identication. They are not only important in many real appli-
cations, but are also representative of two typical cases for research: faces are
relatively more rigid with less appearance variations but the between-class dif-
ferences are also very subtle, while person re-identication has greater appear-
ance variations (caused by both the object itself and its surrounding environ-
ment) which challenge both feature representation and the recognition model.
Therefore, we present comparisons with state-of-the-art methods on widely used
databases for these two dierent tasks, with experimental details and results
presented separately for clarity.
4.1 Experiments on Face Recognition

The well-known Honda/UCSD [19] and the CMU MoBo [20] datasets are used
for our experiments. The Honda/UCSD dataset was collected for video-based
face recognition, containing 20 individuals in 59 video sequences. We use the
version from [2] which has the faces detected, resized to gray-scale images of
size 20 20 and histogram equalized. The length of each sequence varies from
13 to 782. Within each sequence there are large pose variations with moderate
expression changes. The raw pixels of images were used as features. The CMU
MoBo dataset was originally built for pose recognition, but recently the human
faces have been detected and resized to 40 40 for face recognition. There are
24 individuals appearing in 96 sequences, which were captured from multiple
cameras with four dierent walking styles: slow, fast, inclined, and carrying a
ball. Each individual appears in 4 sequences which cover dierent walking styles.
We used the exact LBP features as adopted by CHISD and SANP for CMU
MoBo dataset, and followed their original experimental setting: select one se-
quence for each individual as the gallery set and use the other 3 as query sets.
The gallery-query splitting on Honda/UCSD data exactly follows [2], i.e., 20 se-
quences for query and the remaining 39 for gallery. Since the performance on the
whole sequences of the Honda/UCSD dataset has got saturated (SANP claimed
100% accuracy), we sampled 50 or 100 frames per sequence instead. This strategy
has also been applied to the CMU MoBo dataset for in-depth comparison.
The frame sampling strategy was originally proposed by [2], however, it simply
treats the rst 50/100 frames in the beginning of each sequence as testing sam-
ples, which may be improper for long sequences with probably more variations in
the remaining frames. Therefore, we propose to experiment on two dierent types
of sampling strategies to better show the properties of the data and recognition
algorithms: Type I refers to sampling from the beginning of each sequence for
testing and randomly sampling images from the rest for training; while Type II
stands for doing random sampling for both training and testing without overlap-
ping. For those short sequences which do not have enough frames, we sampled
as many as we can while making the training and test datasets balanced. For
each setting, the results are averaged over 10 trials to eliminate the uncertainty
of random sampling.
As the recently proposed CHISD and SANP models present so far the best
performance on these two datasets and our model can directly utilizes them,
in this paper we just compare our SBDR model with these two approaches.
We used P rec@k with k = 1 as our loss function since the evaluation is on the
recognition rate, and we set C = 10 without further tuning.
Table 1. Average recognition rate (%) on faces. The stars indicate that the results
are dierent from the ones presented in [2] due to the small change of experimental
settings. Detailed explanations are given in the text.
50 frames 100 frames

Dataset Sampling Strategy Type I Type II Type I Type II
CHISD 80.51 93.85 78.97 94.02
SBDRCHISD 83.08 96.41 84.10 95.73
Honda/UCSD
SANP 81.03 91.03 83.59 92.31
SBDRSANP 87.69 95.64 89.23 97.95
CHISD 90.28 96.11 93.06 96.67
SBDRCHISD 93.89 97.78 95.56 98.61
CMU MoBo
SANP 88.89 93.06 93.06 96.67
SBDRSANP 95.00 98.61 96.11 98.89
506 Y. Wu et al.
Experimental results are shown in Table 1. They not only clearly demonstrate
the superiority of the proposed SBDR approach but also provide important in-
sights on how to use it properly (data sampling and set-distance nder selection).
We explain the results from the following perspectives.
SBDR significantly outperforms both CHISD and
SANP, and SANP fits SBDR better than CHISD. All the results
consistently show that embedding discriminative metric learning using the
proposed SBDR model can signicantly improve the performance of unsuper-
vised set-based distance-nding methods (CHISD and SANP), and such an
improvement is much greater over SANP. For almost all the cases (with only
one exception), SBDRSANP performs better than SBDRCHISD even when
SANP itself performs worse than CHISD. A positive reason for such an in-
teresting phenomenon might be that SANP provides SBDR with between-set
distances determined by sparse but representative samples which could be
more informative for discriminative learning than those distances computed
over all the samples in the CHISD model.
Wider within-class variation coverage results in better recognition
performance. For the same set size, random sampling is more likely results
in larger within-set variations than sequential sampling from the beginning
of the video sequences, which can be witnessed by the dierence between
training sets and testing set for sampling strategy Type I as presented in Fig.
2. As expected, experimental results clearly show that for all the methods
the performance on randomly sampled testing sets is much better than that
on sequentially sampled ones. This is more signicant on the Honda/UCSD
dataset, which may due to that the facial appearance changes more gradu-
ally and sequentially in the Honda/UCSD dataset than in the CMU Mobo
dataset.
CHISD vs. SANP: CHISD performs better on sparse data, but
SANP exceeds as the within-set data density increases. For sampling
strategy Type II, CHISD performs considerably better than SANP on both
datasets when the set size equals 50, but the superiority is weakened or even
eliminated when the set size goes up to 100. Such a change over the growing
of set size can be observed for sampling strategy Type I as well. Results
on the Honda/UCSD dataset demonstrate that more signicantly, which
coincides with the experimental results in [2], though the exact numbers
there are dierent from the ones shown here due to two reasons: we have
much smaller testing sets for persons with less than 50/100 images (in such
cases only half of them are used for testing while the other half are reserved
for training), and our results are averaged over 10 trails instead of only one
trail. With the same set size, comparing to CHISD, SANP performs better on
Type I than Type II, this is because Type I has smaller within-set variations
so that the density of data is relatively higher than that of Type II.
Note that even with only 100 frames per set for testing, the recognition rate
of SBDRSANP has already passed the ones ever reported by other state-of-the-
art methods (excluding CHISD and SBDR as they have already been compared
Fig. 2. Dierent sampling styles (random and sequential) result in dierent within-class
variation coverage for each set. Without loss of generality, the training and testing sets
for the rst 5 persons (P1 to P5) in the Honda/UCSD dataset are presented using
sampling strategy Type I with a set size of 50. Therefore, both the test sets for query
and gallery are sequentially sampled from the beginning, while the training sets are
randomly sampled from the remaining frames of each sequence. It can be clearly seen
that the training sets have larger within-set variations.
with) using full length sequences. More concretely, the best ever reported rate
on the Honda/UCSD dataset is 97.44% with averagely 267 frames per set, while
that on the CMU Mobo dataset is 95.97% with averagely 496 frames per set [2].
4.2 Experiments on Person Re-identication
Datasets and Settings: There are two publicly available and commonly used
datasets for set-based (or namely multiple-shot) person re-identication: the
ETHZ dataset [21] and the i-LIDS dataset [22]. However, due to the recently
saturated performance on the ETHZ dataset [4] which suggests its unsuitability
for further comparison, we only experiment on the i-LIDS dataset in this paper.
The i-LIDS dataset for person re-identication was adjusted from the 2008
i-LIDS Multiple-Camera Tracking Scenario (MCTS), released by the Home Of-
ce of UK for the research on cross-camera human tracking. Since its introduc-
tion, it has been tested by almost all of the approaches proposed for person
re-identication. It contains 476 images of 119 unique individuals. As the im-
ages are automatically extracted, the width-height-ratio varies and misalignment
happens. Though it has an average of 4 images for each individual, the actual
number varies from 2 to 8. This dataset is very challenging due to that it was
collected by two non-overlapping cameras from two dierent view angles at a
busy airport and there are many occlusions and truncations, as well as large
illumination changes, as shown in Fig. 3. Since there are too few images for
each person, we follow the setting in [17] to have only one image per person for
gallery sets (actually not sets any more), and the other images for query sets.
508 Y. Wu et al.
(a) Two non-overlapping views (b) Images of sampled individual persons
Fig. 3. The i-LIDS dataset for person re-identication. (a) shows the two non-
overlapping views selected for the dataset. (b) lists the cropped images for several
persons, of which the rst two rows show the great intra-class variations including
viewpoint, pose, illumination, occlusion and background changes, while the last two
rows present the rather small within-class variations between some pairs of persons.
Red/black bounding boxes are used to separate dierent persons.
The same color and texture features as used in [17] and [23] were adopted in our
experiments.
Comparison with Existing Methods: We compare our approach with the
state-of-the-art methods from each of the three groups mentioned in Section 2.
Concretely, they are: MRCG [4] for set-based signature generation, SDALF [3] for
direct set-to-set matching, CHISD (linear) for set-based distance nding. Note
that the sparsity-based SANP model is not suitable here due to the extremely
small set size (N=1 for gallery sets and averagely N=3 for query sets). They
represent the latest and so far the most powerful methods in each specic group.
Unfortunately these three groups of methods are all unsupervised and as far as
we are aware there is no existing work on using supervised learning to directly
optimizes multiple-shot (i.e. set-based) re-identication. Therefore, to show the
advantage of set-based recognition we also present here the best results from
two recently proposed single image based classication and ranking algorithms:
RankSVM [24] and PRDC [17], with exactly the same feature representation
and experimental setting as ours. For the loss function in our model, we just
used the Mean Reciprocal Rank (MRR) as in [14]. The trade-o parameter C
was set to 10 without cross-validation. Since we randomly sampled the data for
training and test, all the experiments are repeated 10 times for averaging.
Results and Analysis: As shown in Table 2, using simple features for repre-
sentation and training with about the same amount of data as that for testing
Table 2. Top ranked matching rate (%) on i-LIDS dataset, where r means rank
Unsupervised p = 50
Methods SDALF MRCG RankSVM PRDC CHISD SBDRCHISD
r=1 34.96 45.80 37.41 37.83 41.80 46.60
r=5 60.92 66.81 63.02 63.70 68.80 71.60
r = 10 73.36 75.21 73.50 75.09 83.60 80.40
r = 20 83.78 83.61 88.30 88.35 94.40 90.40
(the number of individuals for testing p = 50), SBDRCHISD performs signi-

cantly better than SDALF and MRCG which have paid great eorts on feature
design. For comparing with other learning-based methods, we choose dierent
training-test-ratios by changing p following PRDC [17]. Though in that paper
PRDC was claimed to be less overtting than other learning based approaches,
the experimental results presented in Table 3 show that our proposed SBDR
model is even more promising when the training set is small (decreasing less
as the training set shrinks). Overall, SBDRCHISD can get a better performance
(lower ranks matter more) than CHISD, indicating that SBDRCHISD learns a
better feature space for set-based distance metric than the original space.
Table 3. Top ranked matching rate (%) on i-LIDS dataset, where r means rank
p = 30 p = 80
Methods RankSVM PRDC CHISD SBDRCHISD RankSVM PRDC CHISD SBDRCHISD
r=1 42.96 44.05 50.67 53.67 31.73 32.60 37.87 37.75
r=5 71.30 72.74 76.00 79.00 55.69 54.55 61.75 64.13
r = 10 85.15 84.69 90.00 88.67 67.02 65.89 75.37 76.38
r = 20 96.99 96.29 98.67 97.67 77.78 78.30 84.13 84.63

In this paper we presents a novel model called set based discriminative ranking
for set-based recognition. It simultaneously optimizes the set-to-set geometric
distance nding and the feature space projection, resulting in a discriminative
set-distance-based model. As far as we are aware, this is the rst time a global
optimal learning-based model is proposed for solving this challenging problem.
We demonstrate its superiority by comparing with the state-of-the-art methods
on two representative object recognition tasks: face recognition and person re-
identication. Since our model is a general approach which makes no assumptions
on the data, future work can be done on applying it to other object recognition
tasks, along with other related problems like image search and retrieval.
References
1. Cevikalp, H., Triggs, B.: Face recognition based on image sets. In: CVPR,
pp. 25672573 (2010)
2. Hu, Y., Mian, A.S., Owens, R.: Sparse Approximated Nearest Points for Image Set
Classication. In: CVPR, 121128 (2011)
510 Y. Wu et al.
3. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-
identication by symmetry-driven accumulation of local features. In: CVPR (2010)
4. Bak, S., Corvee, E., Bremond, F., Thonnat, M.: Multiple-shot Human Re-
Identication by Mean Riemannian Covariance Grid. In: Advanced Video and
Signal-Based Surveillance (2011)
5. Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R., Darrell, T.: Face recog-
nition with image sets using manifold density divergence. In: CVPR, pp. 581588
(2005)
6. Yamaguchi, O., Fukui, K., Maeda, K.: Face recognition using temporal image se-
quence. In: IEEE International Conference on Automatic Face and Gesture Recog-
nition, p. 318 (1998)
7. Li, X., Fukui, K., Zheng, N.: Image-Set Based Face Recognition Using Boosted Global
and Local Principal Angles. In: Zha, H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV
2009, Part I. LNCS, vol. 5994, pp. 323332. Springer, Heidelberg (2010)
8. Kim, T.K., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image
set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell. 29,
10051018 (2007)
9. Wang, R., Shan, S., Chen, X., Gao, W.: Manifold-manifold distance with applica-
tion to face recognition based on image set. In: CVPR (2008)
10. Bazzani, L., Cristani, M., Perina, A., Farenzena, M., Murino, V.: Multiple-shot
person re-identication by hpe signature. In: ICPR, pp. 14131416 (2010)
11. Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentication using spatiotem-
poral appearance. In: CVPR, pp. 15281535 (2006)
and Classication. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part
II. LNCS, vol. 3952, pp. 589600. Springer, Heidelberg (2006)
13. Prosser, B., Zheng, W.S., Gong, S., Xiang, T.: Person re-identication by support
vector ranking. In: BMVC (2010)
14. Wu, Y., Mukunoki, M., Funatomi, T., Minoh, M., Lao, S.: Optimizing Mean Re-
ciprocal Rank for Person Re-identication. In: 2011 8th IEEE International Con-
ference on Advanced Video and Signal-Based Surveillancei (AVSS), p. 6 (2011)
15. Joachims, T.: A support vector method for multivariate performance measures. In:
ICML, pp. 377384 (2005)
16. McFee, B., Lanckriet, G.: Metric learning to rank. In: ICML, pp. 775782 (2010)
17. Zheng, W.-S., Gong, S., Xiang, T.: Person Re-identication by Probabilistic Rela-
tive Distance Comparison. In: CVPR, pp. 649656 (2011)
18. Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural svms. Ma-
chine Learning 77, 2759 (2009)
19. Lee, K., Ho, J., Yang, M.H., Kriegman, D.: Video-based face recognition using
probabilistic appearance manifolds. In: CVPR, pp. 313320 (2003)
20. Gross, R., Shi, J.: The cmu motion of body (mobo) database. Technical Report CMU-
RI-TR-01-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (2001)
21. Schwartz, W.R., Davis, L.S.: Learning discriminative appearance-based models
using partial least squares. In: SIBGRAPI, pp. 322329 (2009)
22. Zheng, W.S., Gong, S., Xiang, T.: Associating groups of people. In: BMVC (2009)
23. Gray, D., Tao, H.: Viewpoint Invariant Pedestrian Recognition with an Ensemble
of Localized Features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
24. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings
of the Eighth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 133142 (2002)
A Global Hypotheses Verication Method
for 3D Object Recognition
Aitor Aldoma1 , Federico Tombari2 , Luigi Di Stefano2 , and Markus Vincze1

1
Vision4Robotics Group, ACIN, Vienna University of Technology
2
Computer Vision Lab., DEIS, University of Bologna
{aa,vm}@acin.tuwien.ac.at,
{federico.tombari,luigi.distefano}@unibo.it
Abstract. We propose a novel approach for verifying model hypothe-

ses in cluttered and heavily occluded 3D scenes. Instead of verifying one
hypothesis at a time, as done by most state-of-the-art 3D object recog-
nition methods, we determine object and pose instances according to a
global optimization stage based on a cost function which encompasses
geometrical cues. Peculiar to our approach is the inherent ability to de-
tect signicantly occluded objects without increasing the amount of false
positives, so that the operating point of the object recognition algorithm
can nicely move toward a higher recall without sacricing precision. Our
approach outperforms state-of-the-art on a challenging dataset including
35 household models obtained with the Kinect sensor, as well as on the
standard 3D object recognition benchmark dataset.
1 Introduction
Object recognition has been extensively pursued during the last decade within
application scenarios such as image retrieval, robot grasping and manipulation,
scene understanding and place recognition, human-robot interaction, localiza-
tion and mapping. A popular approach to tackle object recognition - especially
in robotic and retrieval scenarios - is to deploy 3D data, motivated by its in-
herently higher eectiveness compared to 2D images in dealing with occlusions
and clutter, as well as by the possibility of achieving 6-degree-of-freedom (6DOF)
pose estimation of arbitrarily shaped objects. Moreover, the recent advent of low-
cost, real-time 3D cameras, such as the Microsoft Kinect and the ASUS Xtion,
has turned 3D sensing into an easily aordable technology. Nevertheless, ob-
ject recognition remains an unsolved task, particularly in challenging real world
settings involving texture-less and/or smooth objects, signicant occlusions and
clutter, dierent sensing modalities and/or resolutions (i.e. see Fig. 1).
Algorithms for 3D object recognition can be divided between local and global.
Local approaches extract repeatable and distinctive keypoints from the 3D sur-
face of the models in the library and the scene, each being then associated with
a 3D descriptor of the local neighborhood [15]. Scene and model keypoints are
successively matched together via their associated descriptors to attain point-to-
point correspondences. Once correspondences are established, they are usually

512 A. Aldoma et al.
Fig. 1. With the proposed approach, heavily occluded and cluttered scenes (left) are
handled by evaluating a high number of hypotheses (center), then retaining only those
providing a coherent interpretation of the scene according to a global optimization
framework based on geometric cues (right)
clustered together by taking into account geometrical constraints derived from

the underlying assumption that model-to-scene transformations are rigid. Clus-
tered correspondences dene model hypotheses, i.e. subsets of correspondences
holding consensus for a specic 6DOF pose instance of a given model in the
current scene [1, 3, 6, 5, 7, 8]. Conversely, global methods, e.g., [9, 10], compute
a single descriptor which encompasses the whole object shape: this requires, in
presence of clutter and/or occlusions, the scene to be pre-processed by a suitable
3D segmentation algorithm capable of extracting the individual object instances.
These 3D pipelines usually comprise an additional nal step whereby object
hypotheses are further veried so as to reject false detections. However, unlike
previous stages, this nal Hypothesis Verication (HV) step has been relatively
unexplored so far, with only a few techniques explicitly addressing it [11, 3, 6, 5].
The most common HV paradigm consists in considering one hypothesis at a time
and thresholding a consensus score depending on the amount of scene points
explained by the transformed model points. Hence, this paradigm disregards
interactions between dierent hypotheses, this implying the inability to detect
strongly occluded objects (scoring low in terms of explained scene points), unless
the consensus threshold is kept low resulting in numerous false detections.
This paper proposes a study focused on the HV stage including three main
contributions: (i) a careful analysis of geometrical cues to be deployed within the
HV stage, taking into account model-to-scene/scene-to-model tting, occlusions
and model-to-model interactions. (ii) A more principled approach to address the
HV stage where, instead of considering each model hypothesis individually, we
take into account the whole set of hypotheses as a global scene model by for-
malizing the HV problem in terms of the minimization of a suitable global cost
function, trying to maximize the number of recognized models while taking into
account the aforementioned cues. Due to the computational burden involved
in the minimization of such a global cost function for relatively large solution
spaces, we explore the use of Simulated Annealing, an approximate method, to
retrieve accurate solutions within a limited amount of time and computational
resources. Finally, (iii) a complete local 3D recognition pipeline to eciently
generate model hypotheses to be validated by the HV stage. By means of exper-
imental results we demonstrate that the proposed approach neatly outperforms
A Global Hypotheses Verication Method for 3D Object Recognition 513
state-of-the-art HV algorithms. In this respect, a major advantage provided by

the global HV approach deals with the dramatic increase of correct recognitions,
in particular those that are weak, without increasing the number of false posi-
tives. Furthermore, our approach brings in a signicant reduction of the number
of hard thresholds required by the recognition pipeline, thus providing a general
framework capable to handle dierent scenarios governed by a few parameters,
some easily derivable from the characteristics of the processed data.
2 Related Work
So far only a few methods have outlined a specic proposal for the HV stage. In
[1, 11] using the correspondences supporting a hypothesis as seeds, a set of scene
points is grown by iteratively including neighboring points which lie closer than
a pre-dened distance to the transformed model points. If the nal set of points
is larger than a pre-dened fraction of the number of model points (from one
fourth to one third of the number of model points), the hypothesis is selected
and ICP is selectively run on the attained set of points in order to rene objects
pose. Obviously, one disadvantage of such an approach is that it can not handle
levels of occlusions higher than 75%.
The HV method proposed in [3] ranks hypotheses based on the quality of
supporting correspondences, so that they are then veried sequentially start-
ing from the highest rank. To verify each hypothesis, after ICP, two terms are
evaluated: the former, similarly to [1], is the ratio between the number of model
points having a correspondent in the scene and the total number of model points,
the latter is the product between this ratio and the quality score of supporting
correspondences. This step requires to set three dierent thresholds. Two addi-
tional checks are then enforced, so as to prune hypotheses based on the number
of outliers (model points without a correspondent in the scene) as well as on
the amount of occlusion generated by the current hypothesis with respect to the
remaining scene points. Again, these two additional checks require three thresh-
olds. If an hypothesis gets through each of these steps, it is accepted and its
associated scene points are eliminated from the scene, so that they will not be
taken into account when the next hypothesis is veried.
In [6], for each model yielding correspondences, the set of hypotheses associ-
ated with the model is rst pruned by thresholding the number of supporting
correspondences. Then, the best hypothesis is chosen based on the overlap area
A(Hbest ) between the model associated with that hypothesis and the scene, and
the initial pose is rened by means of ICP. Finally, the accuracy of the selected
hypothesis is given by the ratio MA(H best )
a (Hbest )
where Ma (Hbest ) is the total visible
surface of the model within the bounding box of the scene. The model is said
to be present in the scene if its accuracy is above a certain threshold and, upon
acceptance, the scene points associated with A(Hbest ) are removed.
Papazov and Burschka [5] evaluate how well a model hypothesis ts into
the scene by means of an acceptance function which takes into account, as a
bonus, the number of transformed model points falling within a certain distance
from a scene point (support ) and, as a penalty, the number of model points
that would occlude other scene points (i.e. their distance from the closest scene
point is above threshold but they lie on the line of sight of a scene point). A
hypothesis is accepted by thresholding its support and occlusion sizes. Given
the hypotheses fullling the requirements set forth by the acceptance function, a
conict graph is built, wherein forks are created every time two hypotheses share
a percentage of scene points above a -third- threshold. Surviving hypotheses are
then selected by means of a non-maxima suppression step carried over the graph
and based on the acceptance function. This approach is the most similar to
ours, as, thanks to the conict graph, interaction between hypotheses is taken
into account. Nevertheless, their method is only partially global, since the rst
stage of the verication still relies on pruning hypotheses one at a time and a
winner-take-all strategy is used for conicting hypotheses.
Relevant to our work but aimed at piecewise surface segmentation on range
images, Leonardis et al. proposed in [12] a model selection strategy based on the
minimization of a cost function to produce a globally consistent solution. Even
though the minimization is formalized in terms of a Quadratic Boolean Problem,
the nal solution is still attained taking into accounts hypotheses sequentially
by means of a winner-take-all strategy.
3 Proposed Method
This Section illustrates the proposed HV approach for 3D object recognition.
After introducing notation, we analyze the geometrical cues that ought to be
taken into account for global hypotheses verication. Then, we illustrate how
to formulate the cost function and tackle the optimization problem. Finally, we
describe the overall 3D object recognition pipeline.
3.1 Notation, Grounds and Geometrical Cues

We consider a model library consisting of m point clouds, M = {M1 , , Mm },
together with a scene point cloud, S. We address the general case of S containing
any number of instances from M (as well as no instance at all), including the
case of multiple instances of the same model. The pose, T , which relates a model
to its instance in S is given by a 6 DOF rigid body transformation (i.e. a 3D
rotation and translation). We assume that the previous stages of the 3D pipeline
provide a set of n recognition hypotheses H = {h1 , , hn }, each hypothesis hi
given by the pair (Mhi , Thi ), with Mhi M being the model hypothesis and
Thi the pose hypothesis which relates Mhi to S.
The goal of the proposed method is to choose an arbitrary (up to n) number
of items belonging to H in order to maximize the number of correct recognitions
(TPs) while minimizing the number of wrong recognitions (FPs). Purposely, we
determine and minimize a suitable cost function dened over the solution space of
the HV problem. In particular, we denote a solution as a set of boolean variables
X = {x0 , , xn } having the same cardinality as H, with each xi B = {0, 1}
indicating whether the corresponding hypothesis hi H is dismissed/validated
(i.e. xi = 0/1). Hence, the cost function can be expressed as F (X ) : Bn R,

Bn being the solution space, of cardinality 2n .
Occlusions. Given an hypothesis hi , model parts not visible in the scene due to
self-occlusions or occlusions generated by other scene parts should be removed
since they cannot provide consensus for hi . Thus, given an instance of X , for
each xi = 1 we compute a modied version of Mhi by transforming the model
according to Thi and removing all occluded points. Hereinafter this new point
cloud will be denoted as Mvhi .
Establishing whether a model point is visible or occluded can be done e-
ciently based on the range image associated to the scene point cloud, possibly
generating the range image from the point cloud whenever the former is not
available directly. Thus, similarly to [5, 3], a point p Mhi is considered oc-
cluded if its back-projection into the rendered range map of the scene falls onto
a valid depth point and its depth is bigger than the corresponding one of the
model. The same reasoning applies to self-occlusions.
Cues i, ii) Scene Fitting and Model Outliers. Once the set of visible points
of a model, Mvhi , has been calculated, we want to determine whether these points
have a correspondent on the scene, i.e. how well they explain scene points. This
cue was exploited by thresholding scene points based on a xed distance value
in the HV stage proposed in [1, 3, 6, 5]. Here, we want to rene such approach,
by measuring how well each visible model point locally ts the scene. Hence, for
each xi = 1 we associate to each scene point, p, a weight which estimates how
well the point is explained by hi by measuring
the local tting with respect to
its nearest neighbor in Mvhi , denoted as N p, Mvhi :

hi (p) = p, N p, Mvhi (1)
Local tting between two points p and q is measured by function (p, q), which
accounts for their distance as well as the local alignment of surfaces, as typically
done, e.g., to assess the quality of registration between two surfaces. Referring
to the normals at p and q respectively as np and nq , (p, q) is dened as
/
( pq
e
2
+ 1)(np nq ), p q2 e
(p, q) = (2)
0, elsewhere
where is the dot product, which is rounded to 0 whenever negative to avoid

negative weights (note that all normals have a consistent orientation based on
the position of the sensor). As for the distance weight, so far we have employed
a simple linear function truncated to 0 when the distance between p and q gets
bigger than a threshold e , though in principle other choices may be considered,
e.g. according to one of the several M-estimators proposed in the literature.
Additionally, we point out that the use of weights performs a soft thresholding
for visible points, which helps in case e is not chosen properly.
For each solution X , we can associate to each scene point p the sum of all
weights related with active hypotheses:

n
X (p) = hi (p) xi (3)
i=1
Then, a scene point is said to be explained by X if X (p) > 0, unexplained

otherwise (X (p) = 0). Moreover, a point p Mvhi is termed an outlier for
hypothesis hi if it is not tted by any scene point according to (2), it is termed
inlier otherwise. Hereinafter, we will denote as hi the set of outliers for hy-
pothesis hi and as |hi | the cardinality of each such a set. In the bottom left
of Fig. 2-a) and -b), we provide an example of the classication of model points
associated to a solution into outliers and inliers.
The amount of explained scene points and outliers are powerful geometrical
cues for evaluating the goodness of a solution X within a global HV framework.
In particular, i) the number of explained scene points should be maximized;
and ii) the number of outliers associated with all active hypotheses should be
minimized.
Cue iii) Multiple Assignment. An important cue highlighting the existence

of incoherent hypotheses within a solution deals with a surface patch in the
scene being tted by multiple models. According to our denitions, this can be
exploited by penalizing scene points explained by two or more hypotheses (see
Fig. 2-a) and -b) for a graphical explanation). Thus, given a solution X and a
scene point p, we dene a function X (p)
n
n
sgn (hi (p)) , sgn (hi (p)) > 1
X (p) = i=1 (4)

i=1
0, elsewhere
which counts the number of conicting hypotheses with respect p according to

the denition given in (1). Again, in bottom-right of Fig. 2 -a) and -b), we show
an example of scene points explained by either a single or multiple hypotheses.
Hence, another cue for global HV to be enforced through X (p) is that iii) the
number of multiple hypothesis assignments to scene points should be minimized.
Cue iv) Clutter. In many application scenarios not all sensed shapes can be
tted with some known objects model. Exceptions might occur, for instance,
in some controlled industrial environments where all the objects making up the
scene are known a priori. More generally, though, several visible scene parts
which do not correspond to any model in the library might locally - and erro-
neously - t some model shapes, potentially leading to false detections. Maximiz-
ing the number of explained scene points (i.e. cue i) ), although useful to increase
the number of correct recognitions, nevertheless favors this circumstance. On the
other hand, computing the outliers associated with these false positives (cue ii))
might not help, since the parts of the model which do not t the scene might
(a) (b)
Fig. 2. Proposed cues for global optimization. In both a) and b), top left: a solution
consisting of a set of active model hypotheses super-imposed onto the scene. Top right:
scene labeling via smooth surface segmentation. Bottom left: classication of visible
model points between inliers (orange) and outliers (green). Bottom right: classication
of scene points between explained by a single hypothesis (blue), by multiple hypotheses
(black), unexplained (red), cluttered due to region growing (yellow) and to proximity
(purple).
turn out occluded or outside the eld of view of the 3D sensor. This is the case,
e.g., of the chicken in Fig. 2-b) being wrongly tted to the rhino (in particular,
the bottom left image shows that the potential outliers turn out indeed occluded
and hence the wrong hypothesis not signicantly penalized).
To counterattack the eect of clutter, we devised an approach, inspired by
surface-based segmentation [13], aimed at penalizing a hypothesis that locally
explains some part of the scene but not nearby points belonging to the same
smooth surface patch. This is also useful to penalize hypotheses featuring cor-
rect recognition but wrong alignment of the model in the scene. Surface-based
segmentation methods are based on the assumption that object surfaces are con-
tinuous and smooth. Continuity is usually assessed by density of points in space
and smoothness through surface normals. Following this idea, scene segmenta-
tion is performed by identifying smooth clusters of points. Each new cluster
is initialized with a random point, then it is grown by iteratively including all
points pj lying in its neighborhood which show a similar normal:
||pi pj ||2 < td ni nj > tn (5)

At the end of the process, each scene point is associated with a unique label l(p).
In top right of Fig. 2-a) and -b), we report two examples of scene segmentation.
Hence, given a solution X , likewise in (3), we compute a clutter term, X (p),

at each unexplained scene point p, so as to penalize those that are likely to
belong to the same surface as nearby explained points:

n
X (p) = xi (p, N (p, Ehi )) (6)
i=1
X (p) consists of contributions, (p, N (p, Ehi )), associated with each active hy-
pothesis hi , where Ehi is the set scene points explained by hi . Analogously to
(p, q), we want (p, q) to weight clutter based on the proximity of p to its
nearest neighbor, q Ehi , as well as according to the alignment of their surface
patches:

, p q2 c l(p) = l(q)
pq 22
(p, q) = ( 2 + 1)(np nq ), p q2 c l(p) = l(q) (7)

c
0, elsewhere
The radius given by c denes the spatial support related to (p, q), while
is a constant parameter used to penalize unexplained points that have been
assigned to the same cluster by the smooth segmentation step. Thanks to the
above formulation, wrong active hypotheses, such as the milk bottle in Fig. 2-a)
and the chicken model in Fig. 2-b), cause a signicantly valued clutter term
X (e.g. the purple and yellow regions in the bottom right part of the Figure),
which will penalize their validation within the global cost function. Therefore,
we have derived the last cue: iv) the amount of unexplained scene points close
to an active hypothesis according to (7) should be minimized.
3.2 Cost Function
We have so far outlined four cues i)-iv). While i) is aimed at increasing as much
as possible the number of recognized model instances (thus TPs and FPs), ii),
iii) and iv) try to penalize unlikely hypotheses through geometrical constraints,
so as to minimize false detections (FPs). The cost function F we are looking for
is obtained as the sum of the terms related to the cues that need to be enforced
within our optimization framework:
F (X ) = fS (X ) + fM (X ) (8)
where is a regularizer, and fS , fM account, respectively, for cues dened on

scene points and model points:

fS (X ) = (X (p) + X (p) X (p)) (9)
pS

n
fM (X ) = |hi | xi (10)
i=1
The global cost formulation in (8) easily allows plugging in additional cost terms
derived from geometric constraints or from specic application characteristics.
For instance, the relatively common assumption, in indoor robotic scenarios,
that objects are placed over a planar surface [9, 10] would allow to penalize
hypotheses having object parts lying below - or intersecting with - this surface.
3.3 Optimization
Having dened the cost function, we need to devise a solver for the proposed
optimization problem. As previously pointed out, we are looking for a solution
X which minimizes function F (X ) over the solution space Bn :
X = argmin { F (X ) = fS (X ) + fM (X )} (11)
X Bn
As the cardinality of the solution space is 2n , even with a relatively small num-
ber of recognition hypotheses (e.g. in the order of tens) exhaustive enumeration
becomes prohibitive. As the dened problem belongs to the class of non-linear
pseudo-boolean optimization, we adopt a classical approach from operation re-
search based on simulated annealing [14] (SA). SA is a meta-heuristic algorithm
that optimizes a certain function without the guarantee to nd the global opti-
mum. It randomly explores the solution space applying moves from a solution X i
to another valid solution X j . In order to deal with local optima, the algorithm
allows moves which increase the cost of the target solution. Such bad moves
are usually performed during the initial iterations (when the temperature of the
system is high), whilst they become less and less probable as the optimization
goes on (system cooling down). The algorithm stops when the temperature has
reached a minimum or no improvement has been achieved in the last N moves,
which occurs either when the algorithm reaches the global optimum or trapped
into a local minimum. We initialize SA assuming all hypotheses to be active,
i.e. X 0 = {1, , 1}, each move consisting then in switching on/o a specic
hypothesis at a time.
In our experiments we relied on the SA implementation available on MetsLib1
based on linear cooling and on the default parameter values, except for the
number of iterations, as we used a high enough value (6000) so that dierent
runs yielded the same results. We wish to point out that the proposed formulation
allows the terms included in the cost function to be pre-computed for each single
hypothesis (those related to fM ) and scene point (those related to fS ), so that at
each new move the cost function can be computed eciently, and independently
of the total amount of scene points and number of hypotheses, by updating
only the hypothesis (and related scene points) being switched on/o. In Sec.
4 we will show how, despite being an approximate optimization algorithm, in
the addressed scenarios SA can yield accurate results while requiring reasonably
low computation times (typically below 2 seconds for hypothesis sets up to 200
elements see Fig. 3-a) and Fig. 4-b).
1
www.coin-or.org/metslib
3.4 3D Recognition Pipeline

We briey present the complete pipeline which will be used for our object recog-
nition experiments. This pipeline is based on local features, although we wish
to point out that the proposed algorithm can be straightforwardly plugged into
pipelines based on global features or recognition pipelines combining several de-
scriptors with dierent strengths as long as a 6DOF pose is provided.
Input Data. We assume input data to be either in the form of 3D meshes

(2.5D/3D) or point clouds. In most practical scenarios, scenes will be represented
by range images obtained from a single viewpoint. To build up the model library,
we transform each full 3D model into 2.5D clouds by placing a virtual camera
on a tessellated sphere around the mesh and rendering it from the centroids of
the triangles building up the tessellated sphere. In our pipeline, a icosahedron is
tessellated once, giving a total of 80 camera positions (i.e., similarly to [9]).
Keypoint Detection and Description. Keypoints are extracted at uniformly sam-

pled positions on the surface of models and scene, parameter s being the sam-
pling distance. Then, the SHOT local descriptor [2] is computed at each keypoint
over a support size specied by radius d . As for SHOT parameter values, we
have used those originally proposed in [2].
Correspondence Matching and Grouping. Descriptors are then matched to attain

point-to-point correspondences. To handle the case of multiple instances of the
same model, each scene descriptor is matched, via fast indexing [15], against all
models descriptors. We explicitly avoid using a matching threshold to reject weak
correspondences, given the ad-hoc choice of such thresholds and their strong
dependency to the metric being used.
Successively, a Correspondence Grouping (CG) algorithm is run to obtain
the model hypotheses to be feed to the proposed global HV process. Our CG
approach, inspired by [16], iteratively groups subsets of correspondences based
on checking the geometric consistency of pairs of correspondences. In partic-
ular, starting from a seed correspondence ci = {pM S M
i , pi }, pi and pSi being,
respectively, the model and scene 3D keypoints in ci , and looping over all corre-
spondences not yet grouped, the correspondence cj = {pM S
j , pj } is added to the
group started by ci if the following relation holds:
; M ;
;||p pM ||2 ||pS pS ||2 ; < (12)
i j i j
being a parameter of the method, intuitively representing the inlier tolerance

for the consensus set. A threshold is usually deployed to discard subsets sup-
ported by too few correspondences. Given that each subset of correspondences
yielded by CG denes a model hypothesis, threshold inuences the nal car-
dinality of the hypothesis set H (i.e. n) and, consequently, the computational
eciency of the HV stage. As a further renement, we run RANSAC on each
subset obtained out of the previous stage, the model being the 6DOF transfor-
mation provided via Absolute Orientation (AO) [17]. The consensus set tolerance
Table 1. Parameters used for the proposed object recognition pipeline
Parameters
Det./descr. CG HV
Dataset s d e c
Laser Scanner 0.5 cm 4 cm 5 0.5 cm 0.5 cm 3 cm 5 3
Kinect 1 cm 5 cm 5 1.5 cm 1 cm 5 cm 5 3
and minimum cardinality for RANSAC are set, respectively, to and . Finally,
ICP is deployed on each subset to rene the 6DOF pose given by AO. At this
point, the hypothesis set H is ready to be fed to the proposed HV stage.
4 Experimental Results
Two experiments were conducted to validate the proposed HV algorithm and

the proposed 3D recognition pipeline, as well as to evaluate the suitability of
our approach with respect to dierent kinds of 3D data. The rst experiment
is performed on a novel dataset, referred to hereinafter as Kinect, whereby we
match a set of CAD models against scenes acquired with a Kinect sensor. In this
experiment we compare our HV algorithm to that proposed in [5], as the latter
appears to be currently the best performing HV algorithm (see Fig. 4-a) and
is able to handle multiple model occurrences on the scene. For a fair compar-
ison, both algorithms are plugged into exactly the same 3D object recognition
pipeline, i.e. that described in Sec. 3.4. Furthermore, to validate the entire 3D
object recognition approach proposed in this paper, in the second experiment we
evaluate our proposal on the standard benchmark dataset for 3D object recog-
nition presented in [3], which comprises objects acquired by a laser scanner and
will be referred to as Laser Scanner.
Due to the dierent nature of the two datasets in terms of scene size, model
library size and noise Kinect data is noisier than Laser Scanner data, espe-
cially as the distance to the camera increases parameters were slightly tuned
to accommodate the algorithms to the underlying data (see Table 1). For the
implementation of [5], the same values of e reported in Table 1 were deployed,
while remaining parameters were tuned according to their performance on both
datasets: we used a support threshold of 0.08, a penalty threshold of 0.05 and
a value of 0.02 to decide when two hypotheses are in conict. A TP is scored if
the id of the recognized model matches that in the ground truth and the RMSE
computed on the estimated model pose with respect to ground truth is under a
certain threshold, otherwise, a FP is scored.
Kinect Dataset. This dataset consists of 50 scenes and a library of 35 models

including typical household objects2 . This dataset is particularly challenging
2
http://users.acin.tuwien.ac.at/aaldoma/datasets/ECCV.zip
(a) (b)
(c)
Fig. 3. Kinect dataset: a) results reported by the proposed HV algorithm and that
in [5]; the chart also reports the upper bound on the recognition rate related to the
deployed 3D pipeline. b-c): qualitative results yielded by the proposed algorithm on a
challenging dataset scene (b) and other simpler scenes (c).
given the dierent nature of the 3D data between scenes (2.5D acquired with a
Kinect) and models (fully 3D, mainly CAD, the remaining acquired with a laser
scanner). The relatively high number of models present in the library allows
testing how well the proposed approach scales with the library size. Another
relevant challenge of this dataset is represented by the traits of the models shape,
some of which are highly symmetrical and hardly descriptive (e.g. bowls, cups,
glasses), many of them being characterized by a high similarity (e.g. dierent
kinds of glasses, dierent kinds of mugs, ..), as also vouched by Fig. 3. Fig. 3-a)
shows the Recognition vs Occlusion curve and number of FPs of the proposed HV
algorithm and that in [5], both plugged in on the 3D pipeline presented in Sec.
3.4. The Figure also shows the upper bound curve, i.e. the maximum recognition
rate achievable with the proposed recognition pipeline. It can be seen that our
approach gets really close to the upper bound while dramatically reducing the
number of false positives. Our approach also outperforms that in [5], in terms
of Recognition rates as well as in terms of FPs. The overall recognition rate on
this dataset is 84.1% for the upper bound, 79.5% for the proposed approach and
73.8% for [5]. Qualitative results obtained by the proposed algorithm are also
provided in Fig. 3-b) and 3-c).
Laser Scanner Dataset. This dataset consists of 50 scenes and 5 models,

and it can be currently regarded as the most popular benchmark for 3D object
recognition. Fig. 4-a) shows the Recognition vs Occlusion curve reported by the
(a) (b)
Fig. 4. Evaluation on Laser Scanner dataset. a) Comparison of the proposed recogni-

tion pipeline and the HV approach against results published in [5, 7, 6]. b) Comparison
of the proposed HV stage and our implementation of that in [5] plugged on the proposed
3D pipeline.
proposed 3D pipeline together with the results reported in [57]: our method
clearly outperforms the state of the art. Moreover, to the best of our knowledge,
we are the rst to achieve a recognition rate of 100% without any false posi-
tive. Fig. 4-b) shows the Recognition vs Occlusion curve and number of FPs of
the proposed HV stage compared to that proposed in Papazov et al. [5], with
the recognition pipeline proposed in Sec. 3.4 for dierent ICP iterations. The
average time required by each combination is also reported. It is worth noting
that although the results obtained by both HV algorithms are equivalent at 0
and 10 ICP iterations in terms of recognition rate, the proposed algorithm is
able to yield less FPs (0 against 2). Then, at 30 ICP iterations, our method
accurately recognizes all models without yielding any FP, while the number of
FPs reported by [5] signicantly increases due to the fact that a high number of
ICP iterations tend to locally align incorrect hypotheses to the scene points, so
that their respective support is large enough to be accepted by that method.
We have proved the eectiveness of the proposed geometrical cues as well as

of simultaneously considering the interaction between model hypotheses as to
dramatically reduce the number of false positives while preserving those correct
recognitions that, due to occlusions, exhibit a small support in the scene. Over-
all, the potentialities of the HV stage to boost the performance of 3D object
recognition pipelines have also been highlighted, which to the authors opinion
have been underrated so far in literature.
Despite the relative eciency of the proposed method in comparison to other
state-of-the-art strategies (see Fig. 4-b)), future work concerns exploiting GPU
parallelism for optimizing the main computational bottlenecks of the proposed
algorithm, namely the ICP stage and the initial computation of the cost terms
during the SA stage. Another direction regards the use of additional cues, par-
ticularly with the aim of penalizing physical intersections between active visible
models. We plan to publicly release the implementation of the proposed HV
algorithm and recognition pipeline in the open source Point Cloud Library.
Acknowledgements. The research leading to these results has received funding

from the Doctoral College on Computational Perception at TU Vienna and the
Austrian Science Foundation under grant agreement No. I513-N23.
References
1. Johnson, A.E., Hebert, M.: Using spin images for ecient object recognition in
cluttered 3d scenes. IEEE Trans. PAMI (5) (1999)
2. Tombari, F., Salti, S., Di Stefano, L.: Unique Signatures of Histograms for Local
Surface Description. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
3. Mian, A., Bennamoun, M., Owens, R.: 3d model-based object recognition and
segmentation in cluttered scenes. IEEE Trans. PAMI (10) (2006)
4. Mian, A., Bennamoun, M., Owens, R.: On the repeatability and quality of key-
points for local feature-based 3d object retrieval from cluttered scenes. IJCV (2010)
5. Papazov, C., Burschka, D.: An Ecient RANSAC for 3D Object Recognition
in Noisy and Occluded Scenes. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.)
ACCV 2010, Part I. LNCS, vol. 6492, pp. 135148. Springer, Heidelberg (2011)
6. Bariya, P., Nishino, K.: Scale-hierarchical 3d object recognition in cluttered scenes.
In: Proc. CVPR (2010)
7. Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: Ecient
and robust 3d object recognition. In: Proc. CVPR (2010)
8. Tombari, F., Di Stefano, L.: Object recognition in 3d scenes with occlusions and
clutter by hough voting. In: Proc. PSIVT (2010)
9. Aldoma, A., Blodow, N., Gossow, D., Gedikli, S., Rusu, R.B., Vincze, M., Bradski,
G.: CAD-Model Recognition and 6DOF Pose Estimation Using 3D Cues. In: 3DRR
Workshop, ICCV (2011)
10. Rusu, R.B., Bradski, G., Thibaux, R., Hsu, J.: Fast 3d recognition and pose using
the viewpoint feature histogram. In: Proc. 23rd IROS (2010)
11. Johnson, A.E., Hebert, M.: Surface matching for object recognition in complex
three-dimensional scenes. IVC (9) (1998)
12. Leonardis, A., Gupta, A., Bajcsy, R.: Segmentation of range images as the search
for geometric parametric models. IJCV (1995)
13. Rabbani, T., van den Heuvel, F., Vosselmann, G.: Segmentation of point clouds
using smoothness constraint. In: IEVM 2006 (2006)
14. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing.
Science (4598) (1983)
15. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algo-
rithm conguration. In: VISAPP. INSTICC Press (2009)
16. Chen, H., Bhanu, B.: 3d free-form object recognition in range images using local
surface patches. Pattern Recognition Letters (10) (2007)
17. Arun, K., Huang, T., Blostein, S.: Least-squares tting of two 3-d point sets. Trans.
PAMI (1987)
Are You Really Smiling at Me? Spontaneous
versus Posed Enjoyment Smiles
Hamdi Dibeklioglu1, Albert Ali Salah2 , and Theo Gevers1

1
Intelligent Systems Lab Amsterdam, University of Amsterdam, Amsterdam, The Netherlands
{h.dibeklioglu,th.gevers}@uva.nl
2
Department of Computer Engineering, Bogazici University, Istanbul, Turkey
salah@boun.edu.tr
Abstract. Smiling is an indispensable element of nonverbal social interaction.

Besides, automatic distinction between spontaneous and posed expressions is im-
portant for visual analysis of social signals. Therefore, in this paper, we propose
a method to distinguish between spontaneous and posed enjoyment smiles by us-
ing the dynamics of eyelid, cheek, and lip corner movements. The discriminative
power of these movements, and the effect of different fusion levels are inves-
tigated on multiple databases. Our results improve the state-of-the-art. We also
introduce the largest spontaneous/posed enjoyment smile database collected to
date, and report new empirical and conceptual findings on smile dynamics. The
collected database consists of 1240 samples of 400 subjects. Moreover, it has the
unique property of having an age range from 8 to 76 years. Large scale experi-
ments on the new database indicate that eyelid dynamics are highly relevant for
smile classification, and there are age-related differences in smile dynamics.
Keywords: Face analysis, smile classification, affective computing.
1 Introduction
Human facial expressions are indispensable elements of non-verbal communication.
Since faces can reveal the mood or the emotional feeling of a person, automatic under-
standing and interpretation of facial expressions provide a natural way to interact with
computers. In recent studies, analysis of spontaneous facial expressions have gained
more interest. For social interaction analysis, it is necessary to distinguish spontaneous
(felt) expressions from the posed (deliberate) ones. Spontaneous expressions can re-
veal states of attention, agreement and interest, as well as deceit. The foremost facial
expression for spontaneity analysis is the smile, as it is the most frequently performed
expression, and used for signaling enjoyment, embarrassment, politeness, etc. [1]. It is
also used to mask other emotional expressions, since it is the easiest emotional facial
expression to pose voluntarily [2].
Several characteristics of spontaneous and posed smiles, such as symmetry, speed,
and timing are analyzed in the literature [3], [4], [5]. Their findings suggest that dif-
ferent facial regions contribute differently to the classification of smiles. In this paper,
we assess the facial dynamics for different face regions, and demonstrate that the eye
region contains the most useful information for this problem. Our method combines

526 H. Dibeklioglu, A.A. Salah, and T. Gevers
region-specific smile dynamics (duration, amplitude, speed, acceleration, etc.) with eye-
lid movements to detect the genuineness of enjoyment smiles.
Our contributions are: 1) A region-specific analysis of facial feature movements un-
der various conditions; 2) An accurate smile classification method which outperforms
the state-of-the-art methods; 3) New empirical findings on age related differences in
smile expression dynamics; 4) The largest spontaneous/posed enjoyment smile database
in the literature for detailed and precise analysis of enjoyment smiles. The database, its
evaluation protocols and annotations are made available to the research community.
2 Related Work
The smile is the easiest emotional facial expression to pose voluntarily [2]. Broadly,
a smile can be identified as the upward movement of the mouth corners, which corre-
sponds to Action Unit 12 (AU12) in the facial action coding system (FACS) [6]. In terms
of anatomy, the zygomatic major muscle contracts and raises the corners of the lips dur-
ing a smile [4]. In terms of dynamics, smiles are composed of three non-overlapping
phases; the onset (neutral to expressive), apex, and offset (expressive to neutral), re-
spectively. Ekman individually identified 18 different smiles (such as enjoyment, fear,
miserable, embarrassment, listener response smiles) by describing the specific visual
differences on the face and indicating the accompanying action units [2].
Smiles of joy are called Duchenne smiles (D-smiles) in honor of Guillaume Du-
chenne, who did early experimental work on smiles. A good indicator for the D-smile
is the contraction of the orbicularis oculi, pars lateralis muscle that raises the cheek,
narrows the eye aperture, and forms wrinkles on the external side of the eyes. This ac-
tivation corresponds to Action Unit 6 and is called the Duchenne marker (D-marker)
in the literature. However, new empirical findings questioned the reliability of the D-
marker [7]. Recently, it has been shown that orbicularis oculi, pars lateralis can be
active or inactive under both spontaneous and posed conditions with similar frequen-
cies [8]. On the other hand, untrained people consistently use the D-marker to recognize
spontaneous and posed enjoyment smiles [9].
Symmetry is also potentially informative to distinguish spontaneous and posed en-
joyment smiles [4]. In [3], it is claimed that spontaneous enjoyment smiles are more
symmetrical than posed ones. Later studies reported no significant difference [7]. This
study has also failed to find significant effects of symmetry.
In the last decade, dynamical properties of smiles (such as duration, speed, and am-
plitude of smiles; movements of head and eyes) received attention as opposed to mor-
phological cues to discriminate between spontaneous and posed smiles. In [10], Cohn et
al. analyze correlations between lip-corner displacements, head rotations, and eye mo-
tion during spontaneous smiles. In another study, Cohn and Schmidt report that spon-
taneous smiles have smaller onset amplitude of lip corner movement, but a more stable
relation between amplitude and duration [5]. Furthermore, the maximum speed of the
smile onset is higher in posed samples and posed eyebrow raises have higher maximum
speed and larger amplitude, but shorter duration than spontaneous ones [7].
In [5], Cohn and Schmidt propose a system which distinguishes spontaneous and de-
liberate enjoyment smiles by a linear discriminant classifier using duration, amplitude,
Are You Really Smiling at Me? 527
duration
and amplitute measures of smile onsets. They analyze the significance of the proposed
features and show that the amplitude of the lip corner movement is a strong linear func-
tion of duration in spontaneous smile, but not in deliberate ones. In [11], Valstar et al.
propose a multimodal system to classify posed and spontaneous smiles. GentleSVM-
Sigmoid classifier is used with the fusion of shoulder, head and inner facial movements.
In [12], Pfister et al. propose a spatiotemporal method using both natural and infrared
face videos to discriminate between spontaneous and posed facial expressions. By en-
abling the temporal space and using the image sequence as a volume, they extend the
Completed Local Binary Patterns (CLBP) texture descriptor into the spatio-temporal
CLBP-TOP features for this task.
Recently, Dibeklioglu et al. have proposed a system which uses eyelid movements
to classify spontaneous and posed enjoyment smiles [13], where distance-based and
angular features are defined in terms of changes in eye aperture. Several classifiers are
compared and the reliability of eyelid movements are shown to be superior to that of
the eyebrows, cheek, and lip movements for smile classification.
In conclusion, the most relevant facial cues for smile classification in the literature
are 1) D-marker, 2) the symmetry, and 3) the dynamics of smiles. Instead of analyzing
these facial cues separately, in this paper, the aim is to use a more generic descriptor
set which can be applied to different facial regions to enhance the indicated facial cues
with detailed dynamic features. Additionally, we focus on the dynamical characteristics
of eyelid movements (such as duration, amplitude, speed, and acceleration), instead of
simple displacement analysis, motivated by the findings of [5] and [13].
3 Method
In this section, details of the proposed spontaneous/posed enjoyment smile classifica-

tion system will be summarized. The flow of the system is as follows. Facial fiducial
points are located in the first frame, and tracked during the rest of the smile video. These
points are used to calculate displacement signals of eyelids, cheeks, and lip corners.
Onset, apex, and offset phases of the smile are estimated using the mean displacement
of the lip corners. Afterwards, descriptive features for eyelid, cheek, and lip corner
movements are extracted from each phase. After a feature selection procedure, the most
informative features with minimum dependency are used to train the Support Vector
Machine (SVM) classifiers.
3.1 Facial Feature Tracking
To analyze the facial dynamics, 11 facial feature points (eye corners, center of upper
eyelids, cheek centers, nose tip, lip corners) are tracked in the videos (see Fig. 1(a)).
Each point is initialized in the first frame of the videos for precise tracking and analysis.
In our system, we use the piecewise Bezier volume deformation (PBVD) tracker, which
is proposed by Tao and Huang [14] (see Fig. 1(b)). While this method is relatively old,
we have introduced improved methods for its initialization, and it is fast and robust with
accurate initialization. The generic face model consists of 16 surface patches. To form
a continuous and smooth model, these patches are embedded in Bezier volumes.

(a) (b)
Fig. 1. (a) Used facial feature points with their indices and (b) the 3D mesh model
3.2 Feature Extraction
Three different face regions (eyes, cheeks, and mouth) are used to extract descriptive
features. First of all, tracked 3D coordinates of the facial feature points i (see Fig. 1(a))
are used to align the faces in each frame. We estimate the 3D pose of the face, and
normalize the face with respect to roll, yaw, and pitch rotations. Since three non-colinear
points are enough to construct a plane, we use three stable landmarks (eye centers and
nose tip) to define a plane P. Eye centers are defined as middle points between inner
and outer eye corners as c1 = 1 + 2
3
and c2 = 4 + 2 . Angles between the positive
6
normal vector NP of P and unit vectors U on X (horizontal), Y (vertical), and Z

(perpendicular) axes give the relative head pose as follows:
U.NP
= arccos , where N = 9 c2 9 c1 . (1)
U NP

In Equation 1, 9 c2 and 9 c1 denote the vectors from point 9 to points c2 and c1 ,
respectively. U and NP are the magnitudes of U and NP vectors. According to
the face geometry, Equation 1 can estimate the exact roll (z ) and yaw (y ) angles of
the face with respect to the camera. If we assume that the face is approximately frontal
in the first frame, then the actual pitch angles (x ) can be calculated by subtracting the
initial value. Once the pose of the head is estimated, tracked points are normalized with
respect to rotation, scale, and translation as follows:
% &
c1 + c2 100
i = i Rx (x )Ry (y )Rz (z ) , (2)
2 (c1, c2)
where i is the aligned point and Rx , Ry , and Rz denote the 3D rotation matrices for the
given angles. () denotes the Euclidean distance. On the normalized face, the middle
point between eye centers is located at the origin and the interocular distance (distance
between eye centers) is set to 100 pixels. Since the normalized face is approximately
frontal with respect to the camera, we ignore the depth (Z) values of the normalized
feature points i , and denote them as li .
After the normalization, onset, apex, and offset phases of the smile can be detected
using the approach proposed by Schmidt et al. [15], by calculating the amplitude of the
smile as the distance of the right lip corner to the lip center during the smile. Differently
from [15], we estimate the smile amplitude as the mean amplitude of right and left
lip corners, normalized by the length of the lip. Let Dlip (t) be the value of the mean
amplitude signal of the lip corners in the frame t. It is estimated as:
l110 +l111 t l1 +l111 t
( , l10 )
+ ( 10 2 , l11 )
Dlip (t) = 2
1 1 ) , (3)
2(l10 , l11
where lit denotes the 2D location of the ith point in frame t. The longest continuous
increase in Dlip is defined as the onset phase. Similarly, the offset phase is detected as
the longest continuous decrease in Dlip . The phase between the last frame of the onset
and the first frame of the offset defines the apex.
To extract features from the eyelids and the cheeks, additional amplitude signals are
computed. We estimate the (normalized) eyelid aperture Deyelid and cheek displacement
Dcheek as follows:
lt1 +lt3 t lt1 +lt3 t lt +lt lt4 +lt6 t
( 2 , l2 )( 2 , l2 ) + ( 4 2 6 , l5t )( 2 , l5 )
Deyelid (t) = , (4)
2(l1t , l3t )
l17 +l18 t l1 +l1
( + ( 7 2 8 , l8t )
2 , l7 )
Dcheek (t) = , (5)
2(l71 , l81 )
where (li , lj ) denotes the relative vertical location function, which equals to 1 if lj
is located (vertically) below li on the face, and 1 otherwise. Dlip , Deyelid , and Dcheek are
hereafter referred to as amplitude signals. In addition to the amplitudes, speed V and
acceleration A signals are extracted by computing the first and the second derivatives
of the amplitudes, respectively.
In summary, description of the used features and the related facial cues with those
are given in Table 1. Note that the defined features are extracted separately from each
phase of the smile. As a result, we obtain three feature sets for each of the eye, mouth
and cheek regions. Each phase is further divided into increasing (+ ) and decreasing
( ) segments, for each feature set. This allows a more detailed analysis of the feature
dynamics.
In Table 1, signals symbolized with superindex (+ ) and ( ) denote the segments
of the related signal with continuous increase and continuous decrease, respectively.
For example, D+ pools the increasing segments in D. defines the length (number of
frames) of a given signal, and is the frame rate of the video. DL and DR define the
amplitudes for the left and right sides of the face, respectively. For each face region,
three 25-dimensional feature vectors are generated by concatenating these features.
In some cases, features cannot be calculated. For example, if we extract features
from the amplitude signal of the lip corners Dlip using the onset phase, then decreasing
segments will be an empty set ( (D ) = 0). For such exceptions, all the features
describing the related segments are set to zero.
3.3 Feature Selection and Classification

To classify spontaneous and posed smiles, individual SVM classifiers are trained for
different face regions. As described in Section 3.2, we extract three 25-dimensional
Table 1. Definitions of the extracted features, and the related facial cues with those. The relation
with D-marker is only valid for eyelid features.
Feature Definition Related Cue(s)

(D + ) (D ) (D)
Duration:
,
,
Dynamics

(D + ) (D )
Duration Ratio: (D)
, (D) Dynamics
Maximum Amplitude: max(D) Dynamics, D-marker

+

D
Mean Amplitude: (D)
, (DD+ ) , |D |
(D )
Dynamics, D-marker
STD of Amplitude: std(D) Dynamics

+
Total Amplitude: D , D Dynamics
+
Net Amplitude: D D Dynamics
+
D |D |
Amplitude Ratio: + , + Dynamics
D + |D | D + |D |

Maximum Speed: max(V + ) , max(|V |) Dynamics
+
V |V |
Mean Speed: (V + )
, (V ) Dynamics

Maximum Acceleration: max(A ) , max(|A |)
+
Dynamics
+
A |A |
Mean Acceleration: (A+ )
, (A ) Dynamics

( D + |D |)
Net Ampl., Duration Ratio: (D)
Dynamics

| DL DR |
Left/Right Ampl. Difference: (D)
Symmetry
feature vectors for each face region. To deal with feature redundancy, we use Min-
Redundancy Max-Relevance (mRMR) algorithm to select discriminative features [16].
mRMR is an incremental method for minimizing the redundancy while selecting the
most relevant information as follows:

1
max I (fj , c) I (fj , fi ) , (6)
fj F Sm1 m1
fi Sm1
where I shows the mutual information function and c indicates the target class. F and
Sm1 denote the feature set, and the set of m 1 features, respectively.
During the training of our system, both individual feature vectors for onset, apex,
offset phases, and a vector with their fusion are generated. The most discriminative
features on each of the generated feature sets are selected using mRMR. Minimum
classification error on a separate validation set is used to determine the most informa-
tive facial region, and the most discriminative features on the selected region. Sim-
ilarly, to optimize the SVM configuration, different kernels (linear, polynomial, and
radial basis function (RBF)) with different parameters (size of RBF kernel, degree of
polynomial kernel) are tested on the validation set and the configuration with the mini-
mum validation error is selected. The test partition of the dataset is not used for param-
eter optimization.
4 Database
4.1 UvA-NEMO Smile Database
We have recently collected the UvA-NEMO Smile Database1 to analyze the dynamics
of spontaneous/posed enjoyment smiles. This database is composed of videos (in RGB
color) recorded with a Panasonic HDC-HS700 3MOS camcorder, placed on a monitor,
at approximately 1.5 meters away from the recorded subjects. Videos were recorded
with a resolution of 19201080 pixels at a rate of 50 frames per second under controlled
illumination conditions. Additionally, a color chart is present on the background of the
videos for further illumination and color normalization (See Fig. 2).
The database has 1240 smile videos (597 spontaneous, 643 posed) from 400 subjects
(185 female, 215 male), making it the largest smile database in the literature so far.
Ages of subjects vary from 8 to 76 years. 149 subjects are younger than 18 years (235
spontaneous, 240 posed). 43 subjects do not have spontaneous smiles and 32 subjects
have no posed smile samples. (See Fig. 3 for age and gender distributions).
Fig. 2. Spontaneous (top) and posed (bottom) enjoyment smiles from the UvA-NEMO Smile
Database
For posed smiles, each subject was asked to pose an enjoyment smile as realistically
as possible, sometimes after being shown the proper way in a sample video. Short, funny
video segments were used to elicit spontaneous enjoyment smiles. Approximately five
minutes of recordings were made per subject, and genuine smiles were segmented.
1
This research was part of Science Live, the innovative research programme of science cen-
ter NEMO that enables scientists to carry out real, publishable, peer-reviewed research using
NEMO visitors as volunteers. See http://www.uva-nemo.org on how to obtain the UvA-NEMO
Smile Database.

0DOHVXEMHFWV

)HPDOHVXEMHFWV
1XPEHURIVXEMHFWV

$JH

6SRQWDQHRXVPDOH
6SRQWDQHRXVIHPDOH
1XPEHURIVDPSOHV
'HOLEHUDWHPDOH

'HOLEHUDWHIHPDOH

$JH
Fig. 3. Age and gender distributions for the subjects (top), and for the smiles (bottom) in the
UvA-NEMO Smile Database
For each subject, a balanced number of spontaneous and posed smiles were selected
and annotated by seeking consensus of two trained annotators. Segments start/end with
neutral or near-neutral expressions. The mean duration of the smile samples is 3.9 sec-
onds ( = 1.8). Average interocular distance on the database is approximately 200
pixels (estimated by using the tracked landmarks). 50 subjects wear eyeglasses.
4.2 Existing Smile Databases
Facial expression databases in the literature rarely contain spontaneous smiles. We

have used several existing databases to report results with the proposed method. Ta-
ble 2 shows a comparative overview of publicly available smile databases. BBC Smile
Dataset2 was gathered from Spot the fake smile test on the BBC website, with 10
spontaneous and 10 posed smile videos, each from a different subject, and each starting
and ending with a neutral face. MMI Facial Expression Database [17] is not specifically
gathered for smile classification, but includes 74 posed smiles from 30 subjects, as well
as spontaneous smiles from 25 subjects. SPOS Corpus [12] contains natural color and
infrared videos of 66 spontaneous and 14 posed smiles from seven subjects. At the
moment only the onsets of expressions are available, but the entire database will be
opened to the public. USTC-NVIE Database [18] consists of images and videos of six
basic facial expressions and neutral faces. Images are recorded in both natural color and
infrared, simultaneously, under three different illumination conditions. NVIE database
has two parts; spontaneous part includes image sequences of onset phases, where posed
2
http://www.bbc.co.uk/science/humanbody/mind/surveys/smiles/
Table 2. Overview of databases with spontaneous/posed smile content. Note that the posed part
of USTC-NVIE includes only the most expressive (apex) images.
Database Participants Resolution & Frame Rate Age Annotation (range)

BBC 20 314 286 pixels @25Hz No (unknown)
720 576 pixels @25Hz
MMI [17] 55 576 720 pixels @25Hz Yes (1964 years)
SPOS [12] 7 640 480 pixels @25Hz No (unknown)
USTC-NVIE [18] 148 No (1731 years)
UvA-NEMO 400 1920 1080 pixels @50Hz Yes (876 years)
part consists of only the most expressive (apex) images. The database has 302 spon-
taneous smiles (image sequences of onset phases) from 133 subjects, and 624 posed
smiles (single apex images) from 104 subjects.
We use the UvA-NEMO smile database to evaluate our system, but report further results
on BBC and SPOS corpora in Section 5.4. We use a two level 10-fold cross-validation
scheme: Each time a test fold is separated, a 9-fold cross-validation is used to train
the system, and parameters are optimized without using the test partition. There is no
subject overlap between folds. For classification, linear SVM is found to perform better
than polynomial and RBF alternatives. For detailed analysis of the features and the
system, tracking is initialized by manually annotated facial landmarks. Additionally,
results with automatic initialization are also given in Section 5.4.
5.1 Assessment of Facial Regions and Feature Selection

To evaluate the discriminative power of eye, cheek, and mouth regions for smile clas-
sification, we use features of onset, apex, and offset phases of all regions, individually.
Additionally, features of all phases are concatenated and tested for each region (shown
as All in Fig. 4). Feature selection reduces the number of feature dimensions and in-
creases the correct classification rates for each feature set, except for apex features of
eyelid and lip corners, and offset features of cheeks (Fig. 4). Decrease in the accuracy
for these three feature sets is around 1% (absolute), where the feature selection increases
the accuracy by approximately 3% (absolute) on average. Since these results confirm
the usefulness of feature selection, it is used in the remainder of this section.
When we analyze the results with feature selection, it is seen that the best accuracy,
for individual phases, is achieved by the onset features of lip corners (82.58%). How-
ever, the discriminative power of apex and offset phases of lip corners do not reach
those of the eyelids and cheeks. The most reliable apex and offset features are obtained
from the eyelids, which provide the highest correct classification rate (87.10%) when
Correct C lassification Rate 1

Without feature selection
0.9 With feature selection
0.8
0.7
0.6
0.5
0.4
Ons A Off A Onse A O K K
>
Fig. 4. Effect of feature selection on correct classification rates for different facial regions
the features of all phases are concatenated. Lip corners (83.63%) and cheeks (83.15%)
follow the eyelids. Combined eyelid features have the minimum validation error in this
experiment.
5.2 Assessment of Fusion Strategies
Three different fusion strategies (early, mid-level, and late) are defined and evaluated.
Each fusion strategy enables feature selection before classification. In early fusion, fea-
tures of onset, apex, and offset of all regions are fused into one low-abstraction vector
and classified by a single classifier. Mid-level fusion concatenates features of all phases
for each region, separately. Constructed feature vectors are individually classified by
SVMs and the classifier outputs are fused (either by SUM or PRODUCT rule, or by
voting3 ). In the late fusion scheme, feature sets of onset, apex, and offset for all facial
regions are individually classified by SVMs and fused.
As shown in Fig. 5 (a), mid-level fusion provides the best performance (88.87%
with voting), followed by early and late fusion, respectively. Elimination of redundant
information by feature selection after low-level feature abstraction on each region, sep-
arately, and following higher level of abstraction for classification on different facial
regions can explain the high performance of mid-level fusion.
5.3 Effect of Age
The features on which we base our analysis may depend on the age of the subjects.
In this section, we split the UvA-NEMO smile database into two partitions as young
people (age < 18 years), and adults (age 18 years), and all training and evalua-
tion is repeated separately for the two partitions of the database. Fig. 5 (b) shows the
classification accuracies. Regional performances are given using the fused (onset, apex,
offset) features of the related region. Mid-level fusion (voting) accuracies are also given
in Fig. 5 (b).
3
The SUM and PRODUCT rules fuse the computed posterior probabilities for the target classes
of different classifiers. To estimate these posterior probabilities, sigmoids of SVM output dis-
tances are used.
0.90 0.95
Z
Z
Sum Young people
Product Adults
0.88 Whole set
Voting 0.90
0.86
0.85
0.84
0.82 0.80
0.80 0.75
& D& >& > Z
D

(a) (b)
Fig. 5. (a) Effect of different fusion strategies and (b) age on correct classification rates
Our results show that eyelid features perform better on adults (as well as on the
whole set) than cheeks and lip corners. For adults, eyelid features reach an accuracy of
89.41%, whereas features of cheeks and lip corners have an accuracy of 83.79% and
82.22%, respectively. However, the correct classification rate for young people with
eyelid features is approximately 5% and 3% less than for the adults and whole set, re-
spectively. Lip corner features provide the most reliable classification for young people
with an accuracy of 86.53%, and also have the minimum validation error for this parti-
tion. Additionally, results show that fusion does not increase the performance for adults
and young people, individually, and the highest correct classification rate is achieved on
the whole set.
5.4 Comparison with Other Methods
We compare our method with the state-of-the-art smile classification systems proposed
in the literature [5], [13], [12] by evaluating them on the UvA-NEMO database with the
same experimental protocols, as well as on BBC and SPOS. Results for [12] are given
by using only natural texture images. For a fair comparison, all methods are tested by
using the piecewise Bezier volume tracker [14] and the tracker is initialized automati-
cally by the method proposed in [19]. Correct classification rates are given in Table 3.
Results show that the proposed method outperforms the state-of-the-art methods. Mid-
level fusion with voting provides an accuracy of 87.02%, which is 9.76% (absolute)
higher than the performance of the method proposed in [5]. Using only eyelid features
decreases the correct classification rate by only 1.29% (absolute) in comparison to mid-
level fusion. This confirms the reliability of eyelid movements and the discriminative
power of the proposed dynamical eyelid features to distinguish between types of smiles.
Our system with eyelid features has an 85.73% accuracy, significantly higher than
that of [13] (71.05%), which uses only eyelid movement features without any temporal
segmentation. This shows the importance of temporal segmentation. The results of [12]
with 73.06% classification rate is less than the accuracy of our method with only onset
features, and shows that spatiotemporal features are not as reliable as facial dynamics.
Since [5] relies on solely the onset features of lip corners, we also tested our method
with onset features of lip corners, and obtained an accuracy of 80.73% (compared
to [5]s 77.26%). We conclude that using automatically selected features from a large
Table 3. Correct classification rates on the BBC, SPOS, and UvA-NEMO databases
Correct Classification Rate (%)

Method
BBC SPOS UvA-NEMO
Proposed, Eyelid Features 85.00 72.50 85.73
Proposed, Mid-level fusion (voting) 90.00 75.00 87.02
Cohn & Schmidt [5] 75.00 72.50 77.26
Dibeklioglu et al. [13] 85.00 66.25 71.05
Pfister et al. [12] 70.00 67.50 73.06
pool of informative features serves better than enabling a few carefully selected mea-
sures for this problem. Manually selected features may also show less generalization
power across different (database-specific) recording conditions.
It is important to note that the proposed method uses solely onset features on SPOS
corpus, since it has only onset phases of smiles. We have observed that spontaneous
smiles are generally classified better than posed ones for all methods. One possible
explanation is that dynamical facial features have more variance in posed smiles. Sub-
sequently, class boundaries of spontaneous smiles are more defined, and this leads to a
higher accuracy.
6 Discussion
In our experiments, onset features of lip corners perform best for individual phases. This
result is consistent with the findings of Cohn et al. [5]. However, when onset, apex, and
offset phases are fused, the eyelid movements are more descriptive than those of cheeks
and lip corners for enjoyment smile classification.
On UvA-NEMO, the best fusion scheme increases the correct classification rate by
only 1.29% (absolute) with respect to the accuracy of eyelid features. This finding sup-
ports our motivation and confirms the discriminative power and the reliability of eyelid
movements to classify enjoyment smiles. However, it is important to note that temporal
segmentation of the smiles is performed by using lip corner movements, which means
that additional information from the movements of lip corners is leveraged.
Highly significant (p < 0.001) feature differences (of selected features) between
adults and young people are obtained. For both spontaneous and posed smiles, max-
imum and mean apertures of eyes are larger for adults. During onset, both amplitude
of eye closure and closure speed are higher for young people. During offset, amplitude
and speed of eye opening are higher for young people. When we analyze the signifi-
cance levels of the most selected features for smile classification, we see that the size
of the eye aperture is smaller during spontaneous smiles. However, many subjects in
UvA-NEMO lower their eyelids also during posed smiles. This result is consistent with
the findings of Krumhuber and Manstead [8], which indicate that D-marker can exist
during both spontaneous and posed enjoyment smiles.
Another important finding is that the speed and acceleration of eyelid movements
are higher for posed smiles. As a result, since faster eyelid movements of young people
cause confusion with posed smiles, the classification accuracy with eyelid features is
higher for adults. Similarly, features extracted from the cheek region perform better for
adults, since cheek movements of adults are slower and more stationary. The duration
of spontaneous smiles are longer than posed ones, but the lip corner movement for
posed smiles is faster (also in terms of acceleration). This improves the accuracy of
classification with lip corner features in favor of young people, since the lip corner
movements of young people are significantly faster than adults during posed smiles.
Since eyelid and cheek features are reliable in adults as opposed to lip corners in
young people, fusion on a specific age group decreases the accuracy compared to the
performance on the whole set. Lastly, there is no significant symmetry difference (in
terms of amplitude) between spontaneous and posed smiles (as in [20], [7]) or between
young people and adults. More detailed analysis of age related smile dynamics is given
in [21], which uses the proposed features (facial dynamics) for age estimation.
7 Conclusions
In this paper, we have presented a smile classifier that can distinguish posed and spon-
taneous enjoyment smiles with high accuracy. The method is based on the hypothesis
that dynamics of eyelid movements are reliable cues for identifying posed and spon-
taneous smiles. Since the movements of eyelids are complex to analyze (because of
blinks and continuous change in eye aperture), novel features have been proposed to
describe dynamics of eyelid movements in detail. The proposed features also incorpo-
rate facial cues previously used in the literature (D-marker, symmetry, and dynamics)
for classification of smiles, and can be generalized to any facial region.
We have introduced the largest spontaneous/posed enjoyment smile database (1240
samples of 400 subjects, ages of subjects vary from 8 to 76 years) in the literature for
detailed and precise analysis of enjoyment smiles. On this database, we have verified
the discriminative power of eyelid movement dynamics and showed its superiority over
lip corner and cheek movements. We have evaluated fusion of features, and obtained
minor improvements over using only eyelid features. We provided comparative results
on three smile databases with the proposed method, as well as three other methods.
We report new and significant empirical findings on smile dynamics that can be lever-
aged to implement better facial expression analysis systems: 1) Maximum and mean
apertures of eyes are smaller and speed of eyelid (also cheek and lip corner) move-
ments are faster for young people compared to adults, during both spontaneous and
posed smiles. 2) Mean eye aperture is smaller during spontaneous smiles in comparison
to posed ones. 3) The speed and acceleration of eyelid movements are higher in posed
smiles. 4) There is no significant difference in (movement) symmetry between sponta-
neous and posed smiles, even when tested separately for young people and adults.
Acknowledgments. This work is supported by Bogazici University project BAP-6531.

Authors would like to thank Tomas Pfister for providing the executable of the CLBP-
TOP implementation.
References
1. Ambadar, Z., Cohn, J.F., Reed, L.I.: All smiles are not created equal: Morphology and timing
of smiles perceived as amused, polite, and embarrassed/nervous. J. Nonverbal Behav. 33,
1734 (2009)
2. Ekman, P.: Telling lies: Cues to deceit in the marketplace, politics, and marriage. WW. Norton
& Company, New York (1992)
3. Ekman, P., Hager, J.C., Friesen, W.V.: The symmetry of emotional and deliberate facial ac-
tions. Psychophysiology 18, 101106 (1981)
4. Ekman, P., Friesen, W.V.: Felt, false, and miserable smiles. J. Nonverbal Behav. 6, 238252
(1982)
5. Cohn, J.F., Schmidt, K.L.: The timing of facial motion in posed and spontaneous smiles. Int.
Journal of Wavelets, Multiresolution and Information Processing 2, 121132 (2004)
6. Ekman, P., Friesen, W.V.: The Facial Action Coding System: A technique for the measure-
ment of facial movement. Consulting Psychologists Press Inc., San Francisco (1978)
7. Schmidt, K.L., Bhattacharya, S., Denlinger, R.: Comparison of deliberate and spontaneous
facial movement in smiles and eyebrow raises. J. Nonverbal Behav. 33, 3545 (2009)
8. Krumhuber, E.G., Manstead, A.S.R.: Can Duchenne smiles be feigned? New evidence on
felt and false smiles. Emotion 9, 807820 (2009)
9. Manera, V., Giudice, M.D., Grandi, E., Colle, L.: Individual differences in the recognition of
enjoyment smiles: No role for perceptualattentional factors and autistic-like traits. Frontiers
in Psychology 2 (2011)
10. Cohn, J.F., Reed, L.I., Moriyama, T., Xiao, J., Schmidt, K.L., Ambadar, Z.: Multimodal
coordination of facial action, head rotation, and eye motion during spontaneous smiles. In:
IEEE AFGR, pp. 129135 (2004)
11. Valstar, M.F., Pantic, M.: How to distinguish posed from spontaneous smiles using geometric
features. In: ACM ICMI, pp. 3845 (2007)
12. Pfister, T., Li, X., Zhao, G., Pietikainen, M.: Differentiating spontaneous from posed facial
expressions within a generic facial expression recognition framework. In: ICCV Workshops,
pp. 868875 (2011)
13. Dibeklioglu, H., Valenti, R., Salah, A.A., Gevers, T.: Eyes do not lie: Spontaneous versus
posed smiles. ACM Multimedia, 703706 (2010)
14. Tao, H., Huang, T.: Explanation-based facial motion tracking using a piecewise Bezier vol-
ume deformation model. In: CVPR, pp. 611617 (1999)
15. Schmidt, K.L., Cohn, J.F., Tian, Y.: Signal characteristics of spontaneous facial expressions:
Automatic movement in solitary and social smiles. Biological Psychology 65, 4966 (2003)
16. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Trans. on PAMI 27, 12261238
(2005)
17. Valstar, M.F., Pantic, M.: Induced disgust, happiness and surprise: An addition to the MMI
facial expression database. In: LREC, Workshop on EMOTION, pp. 6570 (2010)
18. Wang, S., Liu, Z., Lv, S., Lv, Y., Wu, G., Peng, P., Chen, F., Wang, X.: A natural visible and
infrared facial expression database for expression recognition and emotion inference. IEEE
Trans. on Multimedia 12, 682691 (2010)
19. Dibeklioglu, H., Salah, A.A., Gevers, T.: A statistical method for 2-d facial landmarking.
IEEE Trans. on Image Processing 21, 844858 (2012)
20. Schmidt, L.K., Ambadar, Z., Cohn, J.F., Reed, L.I.: Movement differences between deliber-
ate and spontaneous facial expressions: Zygomaticus major action in smiling. J. Nonverbal
Behav. 30, 3752 (2006)
21. Dibeklioglu, H., Gevers, T., Salah, A.A., Valenti, R.: A smile can reveal your age: Enabling
facial dynamics in age estimation. ACM Multimedia (2012)
Ecient Monte Carlo Sampler for Detecting
Parametric Objects in Large Scenes
Yannick Verdie and Florent Lafarge
INRIA Sophia Antipolis, France

{yannick.verdie,florent.lafarge}@inria.fr
Abstract. Point processes have demonstrated eciency and compet-

itiveness when addressing object recognition problems in vision. How-
ever, simulating these mathematical models is a dicult task, especially
on large scenes. Existing samplers suer from average performances in
terms of computation time and stability. We propose a new sampling
procedure based on a Monte Carlo formalism. Our algorithm exploits
Markovian properties of point processes to perform the sampling in par-
allel. This procedure is embedded into a data-driven mechanism such that
the points are non-uniformly distributed in the scene. The performances
of the sampler are analyzed through a set of experiments on various ob-
ject recognition problems from large scenes, and through comparisons to
the existing algorithms.
1 Introduction
Markov point processes constitute an object-oriented extension of traditional

Markov Random Fields (MRF). Whereas MRFs address labeling problems on
static graphs, Markov point processes can tackle object recognition problems
by directly manipulating parametric entities on dynamic graphs. These proba-
bilistic models introduced by Baddeley et al. [1] exploit random variables whose
realizations are congurations of parametric objects, each object being assigned
to a point positioned in the scene. The number of objects is itself a random vari-
able, and thus must not be estimated or specied by an user. Another strength
of Markov point processes is their ability to take into account complex spatial
interactions between the objects and to impose global regularization constraints.
1.1 Point Processes for Vision Problems
Many recent works exploiting point processes have been proposed to address a
large variety of computer vision problems [29]. The growing interest in these
probabilistic models is motivated by the need to manipulate parametric ob-
jects interacting in scenes. Parametric objects can be dened in discrete and/or
continuous domains. They usually correspond to geometric entities, e.g. seg-
ments, rectangles, circles or planes, but can more generally be any type of multi-
dimensional function. A point process requires the formulation of a probability

540 Y. Verdie and F. Lafarge
density in order to measure the quality of a conguration of objects. The den-

sity is typically dened as a combination of a term assessing the consistency
of objects to the data, and a term taking into account spatial interactions be-
tween objects in a Markovian context. The optimal conguration maximizing
this density is usually searched for by a Monte Carlo based sampler capable of
exploring the whole conguration space, in most cases a Markov Chain Monte
Carlo (MCMC) algorithm [10, 11].
Descombes et al. [2] propose a point process for counting populations from
aerial images, each entity being captured by an ellipse. Ge et al. [3] present a
point process for a similar application, but dedicated to crowd detection from
ground-based photos, for which objects are dened as a set of body shape tem-
plates learned from training data. Multi-view images are used by Utasi et al. [9]
to detect people by a point process in 3D where the objects are specied by cylin-
ders. Sun et al. [8] and Lacoste et al. [7] propose point processes for extracting
line networks from images by taking into account spatial interactions between
lines to favor the object connexion. Lafarge et al. [4] present a general model for
extracting dierent types of geometric features from images, including line, rect-
angles or disks. A mixture of object interactions are also considered such that
the process can address dierent problems ranging from population counting to
line network extraction, through to texture recognition. Lieshout [5] develops a
point process for tracking rectangular objects from video. A mono-dimensional
point process is proposed by Mallet et al. [6] for modeling 1D-signals by mixtures
of parametric functions while imposing physical constraints between the modes.
1.2 Motivations
The results obtained by these point processes are particularly convincing and
competitive, but the performances remain limited in terms of computation time
and convergence stability, especially on large scenes. These drawbacks explain
why industry has been reluctant until now to integrate these mathematical mod-
els in their products. Indeed, the point processes presented in Section 1.1 empha-
size complex model formulations by proposing objects parametrically sophisti-
cated [3, 4], advanced techniques to t objects to the data [9], and non-trivial spa-
tial interactions between objects [6, 8]. However, these works have only slightly
addressed the optimization issues from such complex models.
Jump-Diusion algorithms [4, 12, 13] have been designed to speed-up the
MCMC sampling by inserting diusion dynamics for the exploration of the con-
tinuous subspaces. These samplers are unfortunately restricted to specic density
forms. Data considerations have also been used to drive the sampling with more
eciency [14] for image segmentation problems. Some works have also proposed
parallelization procedures by using multiple chains simultaneously [15] or decom-
position schemes in congurations spaces of xed dimension [16, 17]. However
they are limited by border eects, and are not designed to perform on large
scenes. In addition, the existing decomposition schemes cannot be used for con-
guration spaces of variable dimension. A mechanism based on multiple creation
and destruction of objects has been also developed for addressing population
Ecient Monte Carlo Sampler 541
counting problems [2, 9]. Nevertheless object creations require the discretization
of the point coordinates which induces a signicant loss of accuracy.
These alternative versions of the conventional MCMC sampler globally allow
the improvement of optimization performances in specic contexts.That said,
the gains in terms of computation time remain weak and are usually realized
at the expense of solution stability. Finding a fast ecient sampler for general
Markov point processes clearly represents a challenging problem.
1.3 Contributions
We present solutions to address this problem and to drastically reduce computa-
tion times while guarantying quality and stability of the solution. Our algorithm
presents several important contributions to the eld.
Sampling in parallel - Contrary to the conventional MCMC sampler which
makes the solution evolve by successive perturbations, our algorithm can per-
form a large number of perturbations simultaneously using a unique chain. The
Markovian property of point processes is exploited to make the global sampling
problem spatially independent in a neighborhood.
Non uniform point distributions - Point processes mainly use uniform point
distributions which are computationally easy to simulate, but make the sampling
extremely slow. We propose an ecient mechanism allowing the modications,
creations or removals of objects by taking into account information on the ob-
served scenes. Contrary to the data-driven solutions proposed by [3] and [14],
our non-uniform distribution is not built directly from image likelihood, but is
created via space-partitioning trees to ensure the sampling parallelization.
Ecient GPU implementation - We propose an implementation on GPU
which signicantly reduces computing times with respect to existing algorithms,
while increasing stability and improving the quality of the obtained solution.
3D object detection model from point cloud - To evaluate the algorithm
in 3D, we propose an original 3D point process to detect parametric objects from
point clouds. This model is applied to tree recognition from laser scans of large
urban and natural environments. To our knowledge, it is the rst point process
sampler to date to perform in such highly complex state spaces.
2 Point Process Background

A point process describes random congurations of points in a continuous bounded
set K. Mathematically speaking, a point process Z is a measurable mapping from
a probability space (, A, P) to the set of congurations of points in K such that
, pi K, Z() = {p1 , ..., pn() } (1)
where n() is the number of points associated with the event . We denote by
P, the space of congurations of points in K. The most natural point process
is the homogeneous Poisson process for which the number of points follows a
discrete Poisson distribution whereas the position of the points is uniformly and
independently distributed in K. Point processes can also provide more complex
realizations of points by being specied by a density h(.) dened in P and a
reference measure (.) under the condition that the normalization constant of
h(.) is nite:
h(p)d(p) < (2)
pP
The measure (.) having the density h(.) is usually dened via the intensity
measure (.) of an homogeneous Poisson process. Specifying a density h(.) allows
the insertion of data consistency, and also the creation of spatial interactions
between the points. In particular, the Markovian property can be used in point
processes, similarly to random elds, to create a spatial independence of the
points in a neighborhood. Note also that h(.) can be expressed by a Gibbs
energy U (.) such that
h(.) exp U (.) (3)
From points to parametric objects - What makes point processes attractive

for vision is the possibility of marking each point pi by additional parameters mi
such that the point becomes associated with an object xi = (pi , mi ). We denote
by C, the corresponding space of object congurations where each conguration
is given by x = {x1 , ..., xn(x) }. For example, a point process on K M with
K R2 and the additional parameter space M =] 2 , 2 ] [lmin , lmax ] can be
seen as random congurations of 2D line-segments since an orientation and a
length are added to each point (see Fig. 1). Such point processes are also called
marked point processes in the literature.
The most popular family of point processes corresponds to the Markov point
processes of objects specied by Gibbs energies on C of the form

x C, U (x) = D(xi ) + V (xi , xj ) (4)
xi x xi xj
where denotes the symmetric neighborhood relationship of the Markov point

process, D(xi ) is a unitary data term measuring the quality of object xi with
respect to data, and V (xi , xj ), a pairwise interaction term between two neighbor-
ing objects xi and xj . The relationship is usually dened via a limit distance
between points such that
xi xj = {(xi , xj ) x2 : i > j, ||pi pj ||2 < } (5)
In the sequel, we consider Markov point processes of this form. Note that this
energy form has similarities with the standard labeling energies for MRFs. As
explained in [18], our problem can be seen as a generalization of these models.
Simulation - Point processes are usually simulated by a Reversible Jump

MCMC sampler [10] to search for the conguration which minimizes the en-
ergy U . This sampler consists of simulating a discrete Markov Chain (Xt )tN on
Fig. 1. From left to right: realizations of a point process in 2D, of a Markov point
process, and of a Markov point process of line-segments.The grey dashed lines represent
the pairs of points interacting with respect to the neighboring relationship which is
specied here by a limit distance between two points.
the conguration space C, converging towards an invariant measure specied by

U . At each iteration, the current conguration x of the chain is locally perturbed
to a conguration y according to a density function Q(x .), also called a ker-
nel. The perturbations are local, which means that x and y are close, and dier
by no more than one object. The conguration y is then accepted as new state of
the chain with a certain probability depending on the energy variation between
x and y, and a relaxation parameter Tt . The kernel Q can be formulated as a
mixture of sub-kernels Qm chosen with a probability qm such that

Q(x .) = qm Qm (x .) (6)
m
Each sub-kernel is usually dedicated to specic types of moves, as the cre-

ation/removal of an object (Birth and Death kernel) or the modication of
parameters of an object (e.g. translation, dilatation or rotation kernels). The
kernel mixture must allow any conguration in C to be reached from any other
conguration in a nite number of perturbations (irreductibility condition of the
Markov chain), and each sub-kernel has to be reversible, i.e. able to propose the
inverse perturbation.
Algorithm 1. RJMCMC sampler [10]

1- Initialize X0 = x0 and T0 at t = 0;
2- At iteration t, with Xt = x,
Choose a sub-kernel Qm according to probability qm

Perturb x to y according to Qm (x .)
Compute the Green ratio

Qm (y x) U (x) U (y)
R= exp (7)
Qm (x y) Tt
Choose Xt+1 = y with probability min(1, R), and Xt+1 = x otherwise

The RJMCMC sampler is controlled by the relaxation parameter Tt , called the

temperature, depending on time t and approaching zero as t tends to innity.
Although a logarithmic decrease of Tt is necessary to ensure the convergence to
the global minimum from any initial conguration, one uses a faster geometric
decrease which gives an approximate solution close to the optimum [1].
3 New Sampling Procedure
3.1 Simultaneous Multiple Perturbations
The conventional RJMCMC sampler performs successive perturbations on ob-

jects. Such a procedure is obviously long and fastidious, especially for large scale
problems. A natural idea but still unexplored for Markov point processes con-
sists in sampling objects in parallel by exploiting their conditional independence
outside the local neighborhood. Such a strategy implies partitioning the space K
so that simultaneous perturbations are performed at locations far enough apart
to not interfere and break the convergence properties.
From Sequential to Parallel Sampling - Let (Xt )tN , be a Markov chain

simulating a Markov point process, and {cs } be a partition of the space K, where
each component cs is called a cell. Two cells cs and cs are said independent on
X if the transition probability for any random perturbation falling in cs and at
any time t does not depend on either objects or perturbations falling in cs , and
vice versa. One can easily check that the transition probability of two successive
perturbations falling in independent cells under the temperature Tt is equal to
the product of the transition probabilities of each perturbation under the same
temperature [18]. In other words, realizing two successive perturbations on in-
dependent cells at the same temperature is equivalent to performing them in
parallel.
Ensuring Cell Independence - Two independent cells must be located at

a minimum distance from each other. As illustrated in Fig. 2, this distance must
take into account both the width of the neighboring relationship, i.e. , and
the length of the biggest move allowed as object perturbation, denoted by max .
Independence between two cells cs and cs is then guaranteed if
min ||p p ||2 + 2max (8)

pcs , p cs
Independent Cell Gathering - The natural idea consists of partitioning the

space K into a regular mosaic of cells with size greater than or equal to +
2max . The cells can then be regrouped into 2dim K sets such that each cell is
adjacent to cells belonging to dierent sets. Fig. 2 illustrates the regrouping
scheme for dim K = 2 and dim K = 3. This regrouping scheme ensures the
mutual independence between all the cells of a same set. In the sequel, such a
set is called a mic-set (set of Mutually Independent Cells).
Fig. 2. Independence of cells. (a) the width of the cell c2 is not large enough to ensure
the independence of the cells c1 and c3 : the two grey points in c1 and c3 cannot be
perturbed at the same time. (b) the cells c1 and c3 are independent as Eq. 8 is satised.
(c) in dimension two, the cells are squares regrouped into 4 mic-sets (yellow, blue, red
and green). Each cell is adjacent to cells belonging to dierent mic-sets. (d) in dimension
three, the cells are cubes regrouped into 8 mic-sets.
3.2 Non-uniform Point Distributions
Sampling objects in parallel via a regular partitioning, as illustrated on Fig. 2-

(c)&(d), is however not optimal because the spatial point distribution is nec-
essarily uniform and does not take into account the characteristics of observed
scenes. To overcome this problem, a non-regular partitioning of the scene is cre-
ated by exploiting data-based knowledge.
Non-uniform Kernel from Uniform Sub-Kernels - Mixtures of sub-kernels

are frequently used to simulate point processes by MCMC dynamics, each sub-
kernel corresponding to a perturbation type (e.g. birth and death, translation,
rotation, etc). However, the idea consisting of accumulating sub-kernels with
spatial restrictions to create non-uniform point distributions has not been ex-
ploited in the literature. Let {cs }(1) , .., {cs }(L) , be L partitions of the space K
such that {cs }(i) is a subdivided partition of {cs }(i1) . The L partitions dene
a space-partitioning tree, denoted K and whose levels each correspond to a par-
tition of K. By associating with each cell contained in K a uniform sub-kernel
spatially restricted to the subspace supporting this cell, a non-uniform kernel
can be created by accumulation, as dened in Eq. 6 and illustrated in Fig. 3.
Data-Driven Space-Partitioning Tree - A regular 1-to-2dim K hierarchi-

cal subdivision scheme is considered to build a partitioning tree, typically a
quadtree in dimension two and an octree in dimension three. The subdivision
of the cells is driven by the data. We assume that a class of interest in K, in
which the objects have a high probability to belong to, can be roughly distin-
guished from the data. The extraction of such a class is not addressed here, and
is supposed to be done by a segmentation algorithm of the literature adapted
to the considered application. A cell at a given level of the tree is divided into
2dim K cells at the next level if it overlaps with the given class of interest. The
hierarchical decomposition is stopped when the minimal size of the cell becomes
inferior to + 2max , i.e. when the cell independence condition (Eq. 8) is not
longer valid.
The partitioning tree allows the creation of a non uniform point distribution
naturally and eciently. Indeed, the density progressively decreases when moving
far from the class of interest as shown in Fig. 3, while being ensured to be non-
null. The point distribution is thus not directly aected when the class of interest
is inaccurately extracted.
Fig. 3. Space partitioning tree in dimension two. (b) A class of interest (blue area)
is estimated from (a) an input image. (c) A quadtree is created so that the levels
are recursively partitioned according to the class of interest. Each level is composed
of four mic-sets (yellow, blue, red and green sets of cells) to guaranty the sampling
parallelization. (d) The cell accumulation at the dierent levels designs a non uniform
distribution. Note how the distribution naturally describes the class of interest by
progressively decreasing the density when moving away.
Kernel Formulation - Given a space-partitioning tree K composed of L levels,

and 2dim K mic-sets for each level, we can formulate a general kernel Q as a
mixture of uniform sub-kernels Qc,t , each sub-kernel being dened on the cell c
of K, by the perturbation type t T , such that

x C, Q(x .) = qc,t Qc,t (x .) (9)
cK tT
where qc,t is the probability of choosing sub-kernel Qc,t (x .), given by

Pr(t)
qc,t = (10)
#cells in K
Four types of kernels are usually considered in practice so that we have T =
{birth and death, translation, rotation, scale}. The switching kernel can also
be used when objects have several possible types. Note that the kernel Q is
reversible as a sum of reversible sub-kernels. Note also that this kernel allows us
to visit the whole conguration space C since it is guaranteed by the sub-kernels
of the coarsest level of K.
3.3 Sampler Formulation

The kernel dened in Eq. 9 is embedded into the MCMC dynamics so that
the new sampler allows the association of multiple perturbations performed in
Algorithm 2. Our parallel sampler

1-Initialize X0 = x0 and T0 at t = 0;
2-Compute a space-partitioning tree K;
3-At iteration t, with Xt = x,
Choose
a mic-set Smic K and a kernel type t T according to probability
qc,t
cSmic
For each cell c Smic ,
Perturb x in the cell c to a conguration y according to Qc,t (x .)
Calculate the Green ratio

Qc,t (y x) U (x) U (y)
R= exp (11)
Qc,t (x y) Tt
Choose Xt+1 = y with probability min(1, R), and Xt+1 = x otherwise
parallel with relevant non-uniform point distributions based on data-driven space

partitioning trees. Algorithm 2 details the sampling procedure.
Note that the temperature parameter is updated after each series of simulta-
neous perturbations such that the temperature decrease is equivalent to a cooling
schedule by plateau in a standard sequential MCMC sampling. Note also that the
hierarchical partitioning of K protects the sample from mosaic eects. In prac-
tice, the sampling is stopped when no perturbation has been accepted during a
certain number of iterations.
4 Experiments
The proposed algorithm has been implemented on GPU for dim K = 2 and
dim K = 3, and tested on various problems including population counting, line-
network extraction from images, and object recognition from 3D point clouds.
Note that additional results and comparisons can be found in [18], as well as
details on energy formulations.
4.1 Implementation
The algorithm has been implemented on GPU using CUDA. A thread is ded-
icated to each simultaneous perturbation so that operations are performed in
parallel for each cell of a mic-set. The sampler is thus all the more ecient as the
mic-set contains many cells, and generally speaking, as the scene supported by K
is large. Moreover, the code has been programmed to avoid time-consuming op-
erations. In particular, the threads do not communicate between each other, and
memory coalescing permits fast memory access. The memory transfer between
CPU and GPU has also been minimized by indexing the parametric objects. The
experiments presented in this section have been performed on a 2.5 Ghz Xeon
computer with a Nvidia graphics card (Quadro 4800, architecture 1.3).
4.2 Point Processes in 2D
The algorithm has been evaluated on population counting problems from large-
scale images using a point process in 2D, i.e. with dim K = 2. The problem
presented in Fig. 4 consists in detecting migrating birds to extract information
on their number, their size and their spatial organization. The point process
is marked by ellipses which are simple geometric objects dened by a point
(center of mass) and 3 additional parameters, and are well adapted to capture
the bird contours. The energy is specied by a unitary data term based on the
Bhattacharyya distance between the radiometry inside and outside of the object,
and a pairwise interaction penalizing the strong overlapping of objects [18].
Fig. 4. Bird counting by a point process of ellipses. (right) More than ten thousand
birds are extracted by our algorithm in a few minutes from (left) a large scale aerial
image. (middle) A quad-tree structure is used to create a non-uniform point distribu-
tion. Note, on the cropped parts, how the birds are accurately captured by ellipses in
spite of the low quality of the image and the partial overlapping of birds.
Computation time, quality of the reached energy, and stability are the three
important criteria used to evaluate and compare the performance of samplers. As
shown on Fig. 5, our algorithm obtains the best results for each of the criteria
compared to the existing samplers. In particular, we reach a better energy (-
8.76 vs -5.78 for [2] and -2.01 for [10]) while signicantly reducing computation
times (269 sec vs 1078 sec for [2] and 2.8 106 sec for [10]). Note that, for
the reasons mentioned in Section 4.1, this gap in performance increases when
the input scene becomes larger. Fig. 5 also underlines an important limitation
of the reference point process sampler for population counting [2] compared to
our algorithm. Indeed, the discretization of the object parameters required in
[2] causes approximate detection and localization of objects which explains the
average quality of the reached energy. The stability is analyzed by the coecient
of variation, dened as the standard deviation over mean and known to be
a relevant statistical measure for comparing methods having dierent means.
Coecient of variation
energy time #objects
RJMCMC [10] 7.3% 4.2% 1.7%
multiple birth and 5.0% 2.1% 1.3%
death [2]
our sampler without 7.4% 6.2% 1.6%
partitioning tree
our sampler with 4.4% 1.8% 1.1%
partitioning tree
Fig. 5. Performances of the various samplers. (left) The graph describes the energy
decrease over time from the bird image presented in Fig. 4. Time is represented using
a logarithmic scale. Note that the RJMCMC algorithm [10] is so slow that the conver-
gence is not displayed on the graph (2.9 107 iterations are required versus 1.8 104
for our algorithm). (right) The table presents the coecients of variation of the energy,
time and number of objects reached at the convergence over 50 simulations.
Our sampler provides a better stability than the existing algorithms. The impact
of the data-driven space partitioning tree is also measured by performing tests
with uniform point distributions. The performances decrease but remain better
than the existing algorithms. In particular, the sampler loses stability, and the
objects are detected and located less accurately than with the partitioning tree.
Fig. 6. Line-network extraction by a point process of line-segments. (middle) Even if

the point distribution is roughly estimated, (right) the road network is recovered (red
segments) by our algorithm in 16 seconds from (left) a satellite image. Note that, as
shown on the close-up, some parts of the network can be omitted when roads are hidden
by trees at some locations: existing methods also encounter diculties in such cases.
The algorithm has also been tested on line-network extraction from images.
The parametric objects here are line-segments dened by a point (center of mass)
and two additional parameters (length and orientation). Contrary to the popula-
tion counting model, the pairwise potential includes a connexion interaction for
linking the line-segments. Figure 6 shows a road network extraction result ob-
tained from a satellite image whereas Table 1 provides elements of comparisons
with existing methods. The result quality in terms of road under/over-detection

is globally similar to existing approaches (higher than [4] and [19], and lower than
[7]), but our algorithm signicantly improves the computing times. 16 seconds
are required in our case, compared to 7 minutes by a Jump-Diusion algorithm
[4], 155 minutes for a RJMCMC method [7], and 60 minutes for an Active Con-
tour based approach [19].
Table 1. Comparisons with existing line-network extraction methods from the image
presented in Fig. 6
our sampler Jump-Diusion [4] RJMCMC [7] Active contour [19]

Time (minute) 0.26 6.35 155 60
Over-detection 0.45% 2.06% 0.06% 2.28%
Under-detection 32.7% 52.7% 17% 58.9%
Representation line-segment line-segment line-segment pixels
4.3 Point Processes in 3D

We tested our algorithm with dim K = 3 on an original object recognition prob-
lem from laser scans. The goal is to extract trees from unstructured point clouds
containing a lot of outliers, noise and other dierent objects (buildings, ground,
cars, fences, wires, etc), and to recognize their shapes and types. The objects as-
sociated with the point process correspond to a library of dierent 3D-templates
of trees detailed in [18]. The unitary data term of the energy measures the dis-
tance from points to a 3D-template, whereas the pairwise interaction takes into
account constraints on object overlapping as well as on tree type competition.
Compared to the former applications, the conguration space C is of higher
dimension since the objects are parametrically more complex. This allows our
algorithm to exploit more deeply its potential. The rotation kernel is not used
here since the objects are invariant by rotation. However, we use a switching
kernel in order to exchange the type of an object.
Fig. 7 shows results obtained from laser scans of large urban and natural
environments. 30 (respectively 5.4) thousand trees are extracted in 96 (resp. 53)
minutes on the 3.7km2 mountain area (resp. 1km2 urban area) from 13.8 (resp.
2.3) million input points. The computation times can appear high, but nding
non-trivial 3D-objects in such larges scenes by point processes is a challenge
which, to our knowledge, has not been achieved until now due to the extreme
complexity of the state space. Note also that the performances could be improved
by reducing the space C with a 3D-point process on manifolds, i.e. where the
z-coordinate of points is determined by an estimated ground surface.
Evaluating the detection quality with accuracy for this application is a dicult
task since no ground truth exists. As illustrated on the cropped part in Fig. 7, we
have manually indexed the trees on dierent zones from aerial images acquired
with the laser scans. The objects are globally well located and tted to the input
points with few omissions, even when trees are surrounded by other types of urban
entities such as buildings. The non-overlapping constraint of the energy allows us to

obtain satisfactory results for areas with high tree density. Errors frequently occur
in distinguishing the tree type in spite of the tree competition term of the energy.
Fig. 7. Tree recognition from point clouds by a 3D-point process specied by 3D-
parametric models of trees. Our algorithm detects trees and recognizes their shapes in
large-scale (left, input scan: 13.8M points) natural and (top right, input scan: 2.3M
points) urban environments, in spite of other types of urban entities, e.g. buildings, car
and fences, contained in input point clouds (red dot scans). An aerial image is joined to
(bottom right) the cropped part to provide a more intuitive representation of the scene
and the tree location. Note, on the cropped part, how the parametric models t well
to the input points corresponding to trees, and how the interaction of tree competition
allows the regularization of the tree type in a neighborhood.
5 Conclusion
We propose a new algorithm to sample point processes whose strengths lean on
the exploitation of Markovian properties to enable the sampling to be performed
in parallel, and the integration of a data-driven mechanism allowing ecient dis-
tributions of the points in the scene. Our algorithm improves the performances
of the existing samplers in terms of computing times and stability, especially on
large scenes where the gain is very important. It can be used without particular
restrictions, contrary to most samplers, and even appears as an interesting alter-
native to the standard optimization techniques for MRF labeling problems. In
particular, one can envisage using the model proposed in Section 4.3 to extract
any type of parametric objects from large 3D-point clouds.
In future works, it would be interesting to implement the algorithm on other
GPU architectures more adapted to the manipulation of 3D data structures,
e.g. architecture 2.0, so that the performances for point processes in 3D could
be improved.
Acknowledgments. This work was partially funded by the European Research

Council (ERC Starting Grant Robust Geometry Processing, Grant agreement
257474). The authors thank the French Mapping Agency (IGN) and the Tour
du Valat for provinding the datasets.
References
1. Baddeley, A.J., Lieshout, M.V.: Stochastic geometry models in high-level vision.
Journal of Applied Statistics 20 (1993)
2. Descombes, X., Minlos, R., Zhizhina, E.: Object extraction using a stochastic birth-
and-death dynamics in continuum. JMIV 33 (2009)
3. Ge, W., Collins, R.: Marked point processes for crowd counting. In: CVPR, Miami,
U.S. (2009)
4. Lafarge, F., Gimelfarb, G., Descombes, X.: Geometric feature extraction by a
multi-marked point process. PAMI 32 (2010)
5. Lieshout, M.V.: Depth map calculation for a variable number of moving objects
using markov sequential object processes. PAMI 30 (2008)
6. Mallet, C., Lafarge, F., Roux, M., Soergel, U., Bretar, F., Heipke, C.: A marked
point process for modeling lidar waveforms. IP 19 (2010)
7. Lacoste, C., Descombe, X., Zerubia, J.: Point processes for unsupervised line net-
work extraction in remote sensing. PAMI 27 (2005)
8. Sun, K., Sang, N., Zhang, T.: Marked Point Process for Vascular Tree Extrac-
tion on Angiogram. In: Yuille, A.L., Zhu, S.-C., Cremers, D., Wang, Y. (eds.)
EMMCVPR 2007. LNCS, vol. 4679, pp. 467478. Springer, Heidelberg (2007)
9. Utasi, A., Benedek, C.: A 3-D marked point process model for multi-view people
detection. In: CVPR, Colorado Springs, U.S. (2011)
10. Green, P.: Reversible Jump Markov Chains Monte Carlo computation and Bayesian
model determination. Biometrika 82 (1995)
11. Hastings, W.: Monte Carlo sampling using Markov chains and their applications.
Biometrika 57 (1970)
12. Han, F., Tu, Z.W., Zhu, S.: Range image segmentation by an eective jump-
diusion method. PAMI 26 (2004)
13. Srivastava, A., Grenander, U., Jensen, G., Miller, M.: Jump-Diusion Markov pro-
cesses on orthogonal groups for object pose estimation. Journal of Statistical Plan-
ning and Inference 103 (2002)
14. Tu, Z., Zhu, S.: Image Segmentation by Data-Driven Markov Chain Monte Carlo.
PAMI 24 (2002)
15. Harkness, M., Green, P.: Parallel chains, delayed rejection and reversible jump
mcmc for object recognition. In: BMVC, Bristol, U.K (2000)
16. Byrd, J., Jarvis, S., Bhalerao, A.: On the parallelisation of mcmc-based image pro-
cessing. In: IEEE International Symposium on Parallel and Distributed Processing,
Atlanta, U.S. (2010)
17. Gonzalez, J., Low, Y., Gretton, A., Guestrin, C.: Parallel Gibbs sampling: From
colored elds to thin junction trees. Journal of Machine Learning Research (2011)
18. Verdie, Y., Lafarge, F.: Towards the parallelization of Reversible Jump Markov Chain
Monte Carlo algorithms for vision problems. Research report 8016, INRIA (2012)
19. Rochery, M., Jermyn, I., Zerubia, J.: Higher order active contours. IJCV 69 (2006)
Supervised Geodesic Propagation
for Semantic Label Transfer
Xiaowu Chen , Qing Li , Yafei Song, Xin Jin, and Qinping Zhao
State Key Laboratory of Virtual Reality Technology and Systems

School of Computer Science and Engineering, Beihang University, Beijing, China
{chen,liqing,songyf,jinxin,zhaoqp}@vrlab.buaa.edu.cn
Abstract. In this paper we propose a novel semantic label transfer

method using supervised geodesic propagation (SGP). We use super-
vised learning to guide the seed selection and the label propagation.
Given an input image, we rst retrieve its similar image set from anno-
tated databases. A Joint Boost model is learned on the similar image
set of the input image. Then the recognition proposal map of the input
image is inferred by this learned model. The initial distance map is de-
ned by the proposal map: the higher probability, the smaller distance.
In each iteration step of the geodesic propagation, the seed is selected as
the one with the smallest distance from the undetermined superpixels.
We learn a classier as an indicator to indicate whether to propagate
labels between two neighboring superpixels. The training samples of the
indicator are annotated neighboring pairs from the similar image set.
The geodesic distances of its neighbors are updated according to the
combination of the texture and boundary features and the indication
value. Experiments on three datasets show that our method outperforms
the traditional learning based methods and the previous label transfer
method for the semantic segmentation work.
1 Introduction
Semantic labeling, or multi-class segmentation, is a basic and important topic in
computer vision and image understanding. In the last decades, there has been a
great advance in this research area [15]. However it is still a challenging problem
for current computer vision technique to recognize and segment objects in an
image as human beings. Recently, some methods have been proposed to typically
solve this problem with trained generative or discriminative models [13] which
usually need a xed dataset contains certain category classes. Moreover, some
approaches aim to integrate the low level features and context priori into the
bottom-up and top-down model [4]. These methods require the training proce-
dure on the xed dataset for the parametric model and they are not scalable
with the number of object categories. For example, to include more object cat-
egory for the learning-based model, we need to train the model with additional
label categories to adapt the model parameters.

Corresponding authors.

554 X. Chen et al.
(a) Input image (b) Our result (c) Ground truth
Fig. 1. The objective. Our objective is to get the semantic segmentation of the input
image. This input image is taken from MSRC dataset[6].
With the increasing availability of image collections, such as LabelMe database

[7], large data driven methods have revealed the potential for nonparametric
methods in several applications, such as object and scene recognition [8] and
semantic segmentation [9, 10]. Since Liu et al. [9] addressed the nonparametric
scene parsing work as label transfer for the rst time, several scholars have
paid attention on this topic [1012] and given promising results. Label transfer,
considered as transfer the label of existing annotation to the unlabeled input
image, involves two key issues to be solved. The rst issue is how to retrieve
proper similar images from the database for a given input image. The second
issue is how to parse the input image with the annotated similar images. The
rst issue is well studied in some previous works [8, 13] and is not the main focus
of our work. To solve the second issue, it needs a precise matching between the
similar images and the input image, which is the dominate point in [912].
As above label transfer methods usually use MRF optimization followed by
the matching, complete pixel-level or superpixel-level matching between similar
images and input image needed to be implemented. We thus take the idea of
partly matching on a little number of superpixels and selecting the initial seeds
to propagate the semantic label. Some segmentation works [1416] mainly grow
regions based on the foreground/background seed employing the geodesic dis-
tance metric. In addition, Chen et al. [17] employ the geodesic propagation in
multi-class segmentation. Inspired by these works, we present a novel method
for label transfer using supervised geodesic propagation (SGP).
Given an input image, we rst retrieve its K-Nearest-Neighbor (KNN) similar
images to form its similar image set from annotated database with GIST match-
ing [18]. Then the proposal map of the input image is inferred by the boosted
recognition model learned on its similar image set. Besides the proposal map, a
classier for propagation indication is also learned on this similar image set. As
the initial distance map of input image has been derived from the proposal map,
we start our geodesic propagation guided by selected seed and the indication of
label propagation. Moreover, the texture and boundary features of input image
are also integrated into the propagation procedure.
Supervised Geodesic Propagation for Semantic Label Transfer 555
Our main contributions include: (1)A label transfer method with geodesic
propagation. (2)Supervised seed selection and label propagation scheme in geodesic
propagation for label transfer. Figure 1 shows our label transfer result.
2 Related Works
Liu et al. [9] rstly address the semantic labeling work into a nonparametric
scene parsing framework denoted as label transfer. Following the idea of label
transfer, Tighe and Lazebnik [10] use scene-level matching with global features,
superpixel-level matching with local features and Markov random eld (MRF)
optimization for incorporating neighborhood context. However, these two non-
parametric methods both require an existing large training database. In addition,
nding highly similar training images for a given query image is dicult. To as-
sure that the retrieved images are proper, multiple image sets cover all semantic
categories in the input image are rstly obtained in [11]. Then a KNN-MRF
matching scheme is proposed to establish dense correspondence between the in-
put image and each found image sets. A Markov random eld optimization is
used based on those matching correspondences. In [12], a ANN bilateral match-
ing scheme is proposed. Both the retrieval of partially similar images and the
parsing scheme are built upon the ANN bilateral matching. The test image is
parsed by integrating multiple cues in a markov random eld. Inspired by these
works, our method take the similarity of retrieved images for semantic labeling
without precise pixel-level or super-pixel level matching.
Geodesic distance which is the shortest path between points in the space is
used as a metric to classify the pixels in [14]. The method proposed in [15]
is also based on the idea of seed growing geodesic segmentation. To avoid the
bias to seed placement and the lack of edge modeling in geodesic, Price et al.
[15] present to combine geodesic distance information with edge information in
a graph cut optimization framework. Gulshany et al. [16] introduce Geodesic
Forests, which exploit the structure of shortest paths in implementing the star-
convexity constraints. In addition to the methods [1416] that mainly focus on
the foreground/background segmentation with geodesic distance, Chen et al. [17]
employ geodesic distance to multi-class segmentation by integrating color and
edge constraints on the edge weight. However, they [17] utilize no contextual
similarity in the propagation. Besides, we are obviously dierent from [17] in
the training set and the seed selection scheme. We are somehow similar with
[19, 20] in using GIST feature and boosting classier, however, we use geodesic
propagation to get the deterministic solution while [19, 20] use CRF or MRF to
get the optimal solution. In addition, we utilize the contextual information of
both the similar images and the test image itself, while [19, 20] utilize only the
contextual information of the test image or test sequence.
3 Supervised Geodesic Propagation

The workow of our method is illustrated in Figure 2. Given an input image, our
method begins with getting the similar image set from the annotated dataset
556 X. Chen et al.
Gist
Matching
(b) Dataset
Retrieval
(c) Similar Image Set

(a) Input Image Learning (h)Geodesic Propagation
Boosted ((e)) Propagation
p g
1 Iteration 496 Iteration
Model Indicator
(d) Proposal Map 233 Iteration 561 Iteration
Features (f) Texture 369 Iteration 616 Iteration

Extraction
(g) Boundary
Fig. 2. Overview of our supervised geodesic propagation. (a) The input image. (b) The
dataset with manually annotations. Each color uniquely maps to a category label. The
semantic label maps are best viewed in color. (c) We rstly get the similar image set
from the dataset with Gist matching and re-ranking (Section 3.1). (d) Then we infer
the proposal map of the input image using the boosted model which is trained on this
similar image set (Section 3.2). (e) The propagation indicator is learned on the similar
image set (Section 3.3). (f) and (g) are the texture and boundary features of the input
image (Section 3.4). (h) The propagation indicator, the proposal map, the texture and
boundary features of the input image are integrated into the geodesic propagation
framework (Section 3.4) to get the semantic label result of the input image.
through Gist matching [18]. We infer the proposal map of the input image for
seed selection. Then the proposal map, the texture and boundary features of
input image, and the contextual similarity of the similar images are integrated
into the geodesic propagation framework to get segmentation result. There is
no explicit energy to minimize in the framework, instead, multi-class semantic
labeling can be solved through geodesic propagation process.
The denotations in this paper are presented here for clarity. Given an in-
put image I, the semantic labeling problem is to assign each pixel x I with
a certain label l L; L = {1, 2, , N }. To solve this problem, we build up
a neighbor-connected graph G =< V, E >. In this paper, each superpixel i
is a vertex v in graph G and is assigned a specic label l contained in the
dataset through geodesic propagation procedure. The edge set E consists of edges
between neighboring vertices. We dene the weight of edge W (i, j) on a hybrid

manifold based on both texture and boundary feature to indicate the smoothness
between neighboring vertices i and j.
3.1 Similar Image Matching
In order to transfer labels of annotated images to the input image, the rst thing
we need to do is to select proper images similar to the input image in seman-
tic categories and contextual informations. Since the similar image retrieval is
not the main focus of this paper, we use Gist matching as recent label transfer
methods [912] did. The gist descriptor [18] is employed to retrieve the K-Nearest
Neighbors from the dataset and the similar image set is formed with these neigh-
bors. After gist matching, the K-Nearest neighbors in the similar image set are
re-ranked in the following way. We over-segment the input image and each of its
similar image R {R} using algorithm described in [21]. Then each superpixel
i I is matched to a proper superpixel r(i) R which has the smallest match-
ing distance to i. The following distance metric is used to compute the matching
distance of correspondence in the re-ranking procedure. Given two images I and
R, the matching distance Dr (I, R) is scored as:

Dr (I, R) = (f vi f vr(i) )2 (1)
iI,r(i)R
Here, f vi is a 22 dimension descriptor of i, including average HSV colors, coordi-

nates and 17 dimension lter responses [6] of i. The Euclidean distance metric is
used in our implementation. After re-ranking the gist similar images according
to their matching scores, we get the top K similar images which are denoted as
{RK }. In our experiments, we use the {RK } as the similar image set instead of
{R}.
3.2 Proposal Map for Seed Selection
Previous works [1416] utilize the manual scribbles as the initial seeds for each
category or objects which take the human priori into the segmentation task.
Instead, we take a dynamic seed selection for geodesic propagation. The similar
images in {RK } imply the possible categories in the input image. We assume
that categories l {RK } cover all the categories in I. To exploit the inherent
possibilities, the similar image set RK is used as the training set for the input
image I to learn the recognition model. The 17 dimension raw texton features
and Jointboost algorithm are used in the learning procedure as [22] did.
When we get the recognition proposal map of I, the initial geodesic distances
of all classes for each superpixel are dened upon this proposal map. Each su-
perpixel is temporarily assigned the initial label that has the max probability
pl (i). According to equation 2, we get the initial geodesic distance of their tem-
porary class for each superpixel. Then a distance map of the input image can
be obtained: superpixels which have higher probabilities will have smaller initial
558 X. Chen et al.
geodesic distance. In each propagation step, the undetermined superpixel with

the smallest distance of all classes is selected as the current seed to ensure the
estimation is the best in probability (see Section 3.4 for more details) .
Disinitial (i) = 1 pl (i) (2)
3.3 Propagation Indicator

Each image R {RK } is similar with the input image in some aspects, such
as the appearance and the contextual information. We thus assume that the
contextual similarity between the similar image set and the input image can
provide useful information for parsing the input image. Based on this assump-
tion, we take a supervised classication for label propagation. A set of classiers
is learned on the similar image set to guide the propagation and each seman-
tic category has its corresponding classier. We denote these classiers as the
propagation indicators. In this section, we introduce how to get the indicator of
every semantic category. More details about how these indicators work in the
propagation procedure will be introduced in section 3.4.
Our indicator is used to classify whether to propagate label from superpixel
i to its neighbor j in the input image. For neighboring superpixels which are
classied as the same category, we propagate current label; otherwise, we do not
propagate current label. Our propagation indicators for each category are trained
using random forests [23, 24], a competitive non-linear model that predicts by
averaging over multiple regression trees. The random forests implementation
available online [25] is used with default parameters in our method.
To generate training samples for the indicator of each category l in {RK },
we get all neighboring superpixel pairs (i, j) as well as their category labels li
and lj known from the annotation. Note that pair (i, j) is dierent from pair
(j, i). For each pair (i, j), we denote f v(i, j) =< f vi , f vj > as a 44 dimension
feature vector, which includes average HSV colors, coordinates and 17 dimension
lter responses [22] of both f vi and f vj . If lj is consistent with li , then f v(i, j)
is taken as a positive sample of the indicator of label li , otherwise a negative
sample. All the features in f v are normalized in the range of [0, 1]. In the testing
procedure, we extract the f v(i, j) feature vector of neighboring superpixels and
then get the condence exported by trained classier as an indicator value for
propagation. As shown in equation 3, Tl (i, j) is the indicator function, conl (i, j)
is the condence and is the threshold for indicator.
Tl (i, j) = 1[conl (i, j) > ] (3)
3.4 Geodesic Propagation

We take the denition that geodesic distance is the smallest integral of a weight
function over all possible paths from the seeds to a point [14, 17]. There is no ex-
plicit energy to be minimized in our propagation. Our objective is to assign each
pixel the category label which has the smallest geodesic distance. The weight W
of edge E on the graph G denote the smoothness relationship between neigh-

boring superpixels. There are many features can be included in the weights. To
propagate correct labels to superpixels, we exploit intra contextual information
of the input image by using color, texture and boundary features. These features
can indicate the real object probability distribution on the input image to a cer-
tain extent [6, 21]. In this section, we will introduce our geodesic propagation
procedure which simultaneously propagates geodesic distance of all the classes
eciently.
Here we integrate two components to measure the weight of edge Wij on graph
G: the texture component and the boundary component. The weight function W
between neighboring vertices is demonstrated in equation 4, where 1 and 2 are
tuning parameters. Regions of dierent categories can commonly present appar-
ent texture disparities. We measure the Wtexture (i, j) with Euclidean distance
metric using a texture descriptor consist of average HSV colors and 17 lter
responses features [22]. The reliable Berkeley edge detector [21] is applied com-
bining color, brightness and texture cues to capture the boundary condence.
The weight function for boundary component Wbdry is dened in equation 5, in
which is the threshold for boundary condence Pb (). We detect the boundaries
in pixel level and then convert these boundary condences into superpixel level.
W (i, j) = 1 Wtexture (i, j) + 2 Wbdry (i, j) (4)
Wbdry (i, j) = Pb (i, j, ) (5)
Our supervised geodesic propagation algorithm starts with the initial geodesic
distance and initial labels for all the vertices. In each step of propagation, ver-
tices which have undetermined labels will be put into the unlabeled set Q to sort
for current seed that has minimum geodesic distance. The initial seed for propa-
gation is obtained by the recognition proposal map. Once a vertex is selected as
seed in a step, it is removed out of the set Q with its semantic label determined
as current label. The weight of edge W (i, j) and propagation indicator are inte-
grated in the propagation iteration to decide how to update geodesic distance.
In each step of geodesic propagation, the geodesic distance between a labeled
seed and its neighboring undetermined superpixels have to be updated according
to corresponding indicator. Suppose i has labeled as li and j has undetermined
label, then employ the propagation indicator of category li and get the indicator
condence value Tl (i, j). If Tl (i, j) is true and W (i, j) is less than threshold e ,
then the geodesic distance Dis(j) of j is updated as Dis(i) + W (i, j) and the
label of j is assigned li ; otherwise Dis(j) and the label of j stay their current
statuses. Figure 2 (h) illustrates the propagation process step by step. The se-
mantic labels are propagated from initial seed to the entire image after iterating
616 times. Comparing the 233, 369, 496 and 561 iteration, we can see that our
method work well around the object edges and our algorithm propagates seman-
tic label among multiple classes. The algorithm is convergent in a linear time
depending on the image size and we implement a queue introduced in [26].
560 X. Chen et al.
Current seed Update
Undetermined neighbors No Update

of Current seed
The 313 iteration The 313 iteration The 313 iteration The 616 iteration
(a) Seed Selection (b) Before Update Neighbors (c) After Update Neighbors (d) Final Result
Fig. 3. Illustration of one iteration. Take the three statuses of 313 iteration for details
instruction. Dark green means superpixels determinately labeled as the class grass and
dark blue determinately the cow. Light green means superpixels temporarily labeled as
the class grass and light blue temporarily the cow. (a) In the rst status, the current
seed is selected (in red curve) and its label is determined as grass. (b) Then in the
second status, we get the undetermined neighboring superpixels of current seed for
their update of both label and geodesic distance. We only shows parts of its neighbors
for examples. The intra features of the input image and our indicator jointly make
decision that whether to update the current label and distance of these neighhors. (c)
In the third7KHLWHUDWLRQ
status, one neighbor is updated while the other is not. Then it is ready
for the next iteration. (d) In the nal iteration (616), we get the nal result with all
superpixels have their determined labels.
To illustrate how our supervised indicator guide the propagation direction, we

choose one step in the propagation as shown in Figure 3. Dark green demonstrates
superpixels determinately labeled as the class grass and dark blue determinately
the cow. Light green demonstrates superpixels temporarily labeled as the class
grass and light blue temporarily the cow. The rst status is the beginning of the
313 iteration: the current seed which is denoted in red curve is selected with its la-
bel being determined as grass. Then in the second status, the undetermined neigh-
boring superpixels of current seed is gure out for their update. The intra features
of the input image and our indicator jointly make decision that whether to update
the current geodesic distances and labels of these neighhors. Note that we show
the 313 iteration for instance and the 314-615 iterations are omitted. The nal
iteration (616) is displayed for easy comparison.
4 Experiments
In this section we investigate the performance of our method on several challeng-
ing datasets, and compare our results with several state-of-the-art works. The
datasets are the Cambridge-driving Labeled Video dataset (CamVid) [27], the
Microsoft Research Cambridge (MSRC) dataset [6] and the CBCL StreetScenes
(CBCL) dataset [28]. Table 1 shows our semantic segmentation accuracy com-
pared with other methods. Our Method without GP indicates the accuracy of
the proposal map (without the Geodesic Propagation). Our Method indicates
the accuracy after our geodesic propagation procedure. We are the best on the
CamVid dataset and the MSRC dataset. Although our method is 1.1% lower
than [11] on the CBCL dataset, we are 3.46% higher than [11] on the CamVid
dataset. Moreover, our method performs better than [3] signicantly on the
CBCL dataset.
Table 1. Comparison of semantic segmentation accuracy over three datasets
Method CamVid dataset MSRC dataset CBCL dataset

Zhang et al. [11] 84.4% - 72.8%
Shotton et al. [22] - 72.2% -
Tu [1] - 77.7% -
Shotton et al. [3] - - 61.9%
Our Method without GP 86.09% 75.4% 67.7%
Our Method 87.76% 79.2% 71.7%
Input Image Our Result Ground Truth Input Image Our Result Ground Truth
Building Tree Sky Car Signsymbol Road
Pedestrian Fence Columnpole Sidewalk Bicyclist
Fig. 4. Our results on CamVid dataset. Left column shows the input image and middle
column shows our label transfer result. Ground truth is shown on the right. Legend is
shown on the bottom.
In our experiments, the training procedure of Joint Boost model takes about
40 seconds per image and the training of label propagation scheme for each
image is also about 40 seconds. Geodesic propagation takes about 5 seconds. On
each dataset, we set 500 rounds to train the Joint Boost model for each test
image. All the experiments are implemented on PC computers. Some results on
the three datasets are shown in Figure 4, Figure 5 and Figure 6.
4.1 CamVid Dataset

The CamVid dataset is the rst collection of videos with object class semantic
labels. It provides 701 still images taken under dierent lighting condition (day
and dusk). The images in the original dataset are at the size of 960 720 and
cover 32 object classes. To compare with others, we group the dataset into 11
categories as [11] did and resize the images to 480360 pixels. The 11 domi-
nant categories are building, tree, sky, car, sign-symbol, road, pedestrian, fence,
column-pole, sidewalk, bicyclist. Besides, a void labeling indicates that the pixel
does not belong to the 11 categories. We use 5 similar images for each test image
562 X. Chen et al.
Building Grass Tree Cow Sheep Sky Aeroplane Water Face Car Bike
Flower Sign Bird Book Chair Road Cat Dog Body Boat
Fig. 5. Our results on MSRC dataset. Left column shows the input image and middle
Car Pedestrian Bicycle Building Tree
Sk
Sky R d
Road Sid
Sidewalk
lk St
Store
Fig. 6. Our results on CBCL dataset. Left column shows the input image and middle
to train the propagation indicator. The threshold e and are set to be 0.8 and
0.75 respectively. The segmentation accuracy of our method is 87.76% as shown
in Table 1.
4.2 MSRC Dataset
MSRC dataset is composed of 591 images of 21 object classes. We randomly

split this dataset into 491 training images and 100 testing images. The void
labeling is used to cope with pixels not belonging to a class in the dataset, and
the manual labeling is not aligned exactly with boundaries. Image in this dataset
is about 320213 pixels. We use 10 similar images for each test image to train
the propagation indicator. The threshold e and are set to be 0.5 and 0.6
respectively. The segmentation accuracy of our method on this dataset is 79.2%.
Figure 5 shows some results of proposed method on this dataset. The com-
parisons with previous methods are shown in Table 1. Our method is the best
among those methods on this dataset.
4.3 CBCL Dataset
The CBCL StreetScenes dataset contains 3547 still images of street scenes, which
includes nine categories: car, pedestrian, bicycle, building, tree, sky, road, side-
walk, and store. In our test, the pedestrian, bicycle, and store are not included as
[11] did. To compare with [11], we resize the original images to 320240 pixels.
We use 5 similar images for each test image to train the propagation indicator.
The threshold e and are set to be 0.3 and 0.6 respectively. The segmentation
accuracy of our method is 71.7% as shown in Table 1. Although our method is
not the best on this dataset (about 1 percent lower than [11]), we perform better
than [3] who has accuracy of 61.9%. Figure 6 shows the results of our method.
5 Conclusion and Discussion
In this paper we propose a supervised geodesic propagation for semantic label

transfer. Following the label transfer idea, given an input image, we rst retrieve
its top K-Nearest-Neighbor similar images from annotated dataset with GIST
matching. We then infer the recognition proposal map of the input image with
the similar images based on the trained model. The similar image set is not only
used to learn the category proposals, but also used to train the label propagation
scheme. Unlike other label transfer methods which focus on matching the test
image with the similar images, we employ the supervised geodesic propagation
to utilize the similar images for semantic labeling of input image. Then we begin
geodesic propagation that starts from dynamic selected seeds. Experiments on
three datasets show that our method outperforms the traditional learning based
methods and the previous label transfer methods for the semantic segmentation
work.
Limitations and Future Work. As we consider the similar image set cover
all categories in the input image, our method is sensitive to the retrieved similar
images. If the retrieved images have little similarity with the test image, or the
retrieved images do not have the category in the test at all, our method will
fail. In future, we will pay attention on how to retrieve similar image set of
high quality. In addition, some parameters in our implementation are manually
selected. Next we will focus on adaptive parameter selection scheme.
Acknowledgments. In addition to the anonymous reviewers who provided

insightful suggestions, the authors would like to thank Yi Liu for her data pro-
cessing. This work was partially supported by NSFC (60933006), 863 Program
(2012AA011504), R&D Program (2012BAH07B01), Special Program on ITER
(2012GB102008) and BUAA (YWF-12-LKGY-001 and YWF-12-RBYJ-035).
564 X. Chen et al.
References
1. Tu, Z.: Auto-context and its application to high-level vision tasks. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 18. IEEE Press,
Piscataway (2008)
2. Gould, S., Fulton, R., Koller, D.: Decomposing a Scene into Geometric and Se-
mantically Consistent Regions. In: IEEE International Conference on Computer
Vision, pp. 18. IEEE Press, Piscataway (2009)
3. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image catego-
rization and segmentation. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 18. IEEE Press, Piscataway (2008)
4. Han, F., Zhu, S.-C.: Bottom-up/Top-Down Image Parsing by Attribute Graph
Grammar. In: IEEE International Conference on Computer Vision, pp. 17781785.
IEEE Press, Piscataway (2005)
5. Zhao, P., Fang, T., Xiao, J., Zhang, H., Zhao, Q., Quan, L.: Rectilinear parsing
of architecture in urban environment. In: IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pp. 342349. IEEE Press, Piscataway
(2010)
6. Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost for Image Under-
standing: Multi-Class Object Recognition and Segmentation by Jointly Modeling
Texture, Layout, and Context. Int. J. Comput. Vis. 81, 223 (2009)
7. Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: A database and
web-based tool for image annotation. MIT AI Lab Memo (2005)
8. Torralba, A., Fergus, R., Freeman, W.: 80 Million Tiny Images: A Large Data
Set for Nonparametric Object and Scene Recognition. IEEE Trans. Pattern Anal.
Mach. Intell. 30, 19581970 (2008)
9. Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing: Label transfer via
dense scene alignment. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp. 19721979. IEEE Press, Piscataway (2009)
10. Tighe, J., Lazebnik, S.: SuperParsing: Scalable Nonparametric Image Parsing
with Superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part V. LNCS, vol. 6315, pp. 352365. Springer, Heidelberg (2010)
11. Zhang, H., Xiao, J., Quan, L.: Supervised Label Transfer for Semantic Segmenta-
tion of Street Scenes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
12. Zhang, H., Fang, T., Chen, X., Zhao, Q., Quan, L.: Partial similarity based non-
parametric scene parsing in certain environment. In: IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 22412248. IEEE Press, Piscataway
(2011)
13. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features
in recognition. Prog. Brain Res. 155, 2336 (2006)
14. Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video
segmentation and matting. In: IEEE International Conference on Computer Vision,
pp. 18 (2007)
15. Price, B., Morse, B., Cohen, S.: Geodesic Graph Cut for Interactive Image Seg-
mentation. In: IEEE Conference on Computer Vision and Pattern Recognition,
pp. 31613168. IEEE Press, Piscataway (2010)
16. Gulshany, V., Rotherz, C., Criminisiz, A., Blakez, A., Zisserman, A.: Geodesic Star
Convexity for Interactive Image Segmentation. In: IEEE Conference on Computer
Vision and Pattern Recognition, pp. 31293136. IEEE Press, Piscataway (2010)
17. Chen, X., Zhao, D., Zhao, Y., Lin, L.: Accurate semantic image labeling by fast
geodesic propagation. In: IEEE International Conference on Image processing, pp.
40214024. IEEE Press, Piscataway (2009)
18. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation
of the spatial envelope. Int. J. Comput. Vis. 42, 145175 (2001)
19. Xiao, J., Quan, L.: Multiple view semantic segmentation for street view im-
ages. In: Proceedings of 12th IEEE International Conference on Computer Vision,
pp. 686693. IEEE Press, Piscataway (2009)
20. Xiao, J., Fang, T., Zhao, P., Lhuillier, M., Quan, L.: Image-based street-side city
modeling. ACM Trans. Graph. 28, 112 (2009)
21. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierarchical
Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 898916 (2011)
22. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: Joint Appearance,
Shape and Context Modeling for Multi-class Object Recognition and Segmenta-
tion. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS,
23. Breiman, L.: Random forests. Mach. Learn. 45, 532 (2001)
24. Liaw, A., Wiener, M.: Classication and Regression by randomForest. R News 2,
1822 (2002)
25. Jaiantilal, A.: Classication and regression by randomforest-matlab (2009),
http://code.google.com/p/randomforest-matlab
26. Yatziv, L., Bartesaghi, A., Sapiro, G.: O(n) implementation of the fast marching
algorithm. J. Comput. Phys. 212, 393399 (2006)
27. Brostow, G., Fauqueur, J., Cipolla, R.: Semantic Object Classes in Video: A High-
Denition Ground Truth Database. Pattern Recognit. Lett. 30, 8897 (2009)
28. Bileschi, S.: CBCL streetscenes challenge framework (2007),
http://cbcl.mit.edu/software-datasets/streetscenes/
Bayesian Face Revisited: A Joint Formulation
Dong Chen1 , Xudong Cao3 , Liwei Wang2 , Fang Wen3 , and Jian Sun3
1
University of Science and Technology of China
chendong@mail.ustc.edu.cn
2
The Chinese University of Hong Kong
lwwang@cse.cuhk.edu.hk
3
Microsoft Research Asia, Beijing, China
{xudongca,fangwen,jiansun}@microsoft.com
Abstract. In this paper, we revisit the classical Bayesian face recog-

nition method by Baback Moghaddam et al. and propose a new joint
formulation. The classical Bayesian method models the appearance dif-
ference between two faces. We observe that this dierence formulation
may reduce the separability between classes. Instead, we model two faces
jointly with an appropriate prior on the face representation. Our joint
formulation leads to an EM-like model learning at the training time and
an ecient, closed-formed computation at the test time. On extensive ex-
perimental evaluations, our method is superior to the classical Bayesian
face and many other supervised approaches. Our method achieved 92.4%
test accuracy on the challenging Labeled Face in Wild (LFW) dataset.
Comparing with current best commercial system, we reduced the error
rate by 10%.
1 Introduction
Face verication and face identication are two sub-problems in face recognition.
The former is to verify whether two given faces belong to the same person, while
the latter answers who is who question in a probe face set given a gallery face
set. In this paper, we focus on the verication problem, which is more widely
applicable and lay the foundation of the identication problem.
Bayesian face recognition [1] by Baback Moghaddam et al. is one of repre-
sentative and successful face verication methods. It formulates the verication
task as a binary Bayesian decision problem. Let HI represents the intra-personal
(same) hypothesis that two faces x1 and x2 belong to the same subject, and HE
is the extra-personal (not same) hypothesis that two faces are from dierent
subjects. Then, the face verication problem amounts to classifying the dier-
ence = x1 x2 as intra-personal variation or extra-personal variation. Based
on the MAP (Maximum a Posterior) rule, the decision is made by testing a log
likelihood ratio r(x1 , x2 ):
P ( |HI )
r(x1 , x2 ) = log . (1)
P ( |HE )

Bayesian Face Revisited: A Joint Formulation 567
The above ratio can be also considered as a probabilistic measure of similarity

between x1 and x2 for the face verication problem. In [1], two conditional
probabilities in Eqn. (1) are modeled as Gaussians and eigen analysis is used for
model learning and ecient computation.
Because of the simplicity and competitive performance [2] of Bayesian face,
further progresses have been made along this research lines. For example, Wang
and Tang [3] propose a unied framework for subspace face recognition which
decomposes the face dierence into three subspaces: intrinsic dierence, trans-
formation dierence and noise. By excluding the transform dierence and noise
and retaining the intrinsic dierence, better performance is obtained. In [4], a
random subspace is introduced to handle the multi-model and high dimension
problem. The appearance dierence can be also computed in any feature space
such as Gabor feature [5]. Instead of using a native Bayesian classier, a SVM is
trained in [6] to classify the the dierence face which is projected and whitened
in an intra-person subspace.
However, all above Bayesian face methods are generally based on the dier-
ence of a given face pair. As illustrated by a 2D example in Fig. 1, modeling
the dierence is equivalent to rst projecting all 2D points on a 1D line (X-Y)
and then performing classication in 1D. While such projection can capture the
major discriminative information, it may reduce the separability. Therefore, the
power of Bayesian face framework may be limited by discarding the discrimina-
tive information when we view two classes jointly in the original feature space.
Y
C
la
In
ss
se
2
pa
C
}
O
la
ra
ss
bl
e
1
X-
Y
O X
Fig. 1. The 2-D data is projected to 1-D by xy. The two classes which are separable in
joint representation are inseparable after the projecting. Class1 and Class2 could
be considered as an intra-personal and an extra-personal hypothesis in face recognition.
In this paper, we propose to directly model the joint distribution of {x1 , x2 }

for the face verication problem in the same Bayesian framework. We introduce
an appropriate prior on face representation: each face is the summation of two
independent Gaussian latent variables, i.e., intrinsic variable for identity, and
intra-personal variable for within-person variation. Based on this prior, we can
eectively learn the parametric models of two latent variables by an EM-like
algorithm. Given the learned models, we can obtain joint distributions of {x1 , x2 }
568 D. Chen et al.
and derive a closed-form expression of the log likelihood ratio, which makes the
computation ecient in the test phase.
We also nd interesting connections between our joint Bayesian formulation
and other two types of face verication methods: metric learning [710] and
reference-based methods [1114]. On one hand, the similarity metric derive from
our joint formulation is beyond the standard form of the Mahalanobis distance.
The new similarity metric preserves the separability in the original feature space
and leads to better performance. On the other hand, the joint Bayesian method
could be viewed as a kind of reference model with parametric form.
Many supervised approaches including ours need a good training data which
contains sucient intra-person and extra-person variations. A good training data
should be both wide and deep: having large number of dierent subjects and
having enough images of each subject. However, the current large face datasets
in the wild condition suer from either small width (Pubg [11]) or small depth
(Labeled Faces in Wild (LFW) [15]). To address this issue, we introduce a new
dataset, Wide and Deep Reference dataset (WDRef), which is both wide (around
3,000 subjects) and deep (2,000+ subjects with over 15 images, 1,000+ subjects
with more than 40 images). To facilitate further research and evaluation on
supervised methods on the same test bed, we also share two kinds of extracted
low-level features of this dataset. The whole dataset can be downloaded from
our project website http://home.ustc.edu.cn/~chendong/JointBayesian/.
Our main contributions can be summarize as followings:
A joint formulation of Bayesian face with an appropriate prior on the face
representation. The joint model can be eectively learned from large scale,
high-dimension training data, and the verication can be eciently per-
formed by the derived closed-form solution.
We demonstrate our joint Bayesian face outperforms the state of arts super-
vised methods, through comprehensive comparisons on LFW and WDRef.
Our simple system achieved better average accuracy than the current best
commercial system (face.com) [16]1 .
A large dataset (with annotations and extracted low-level features) which is
both wide and deep is released.
2 Our Approach: A Joint Formulation

In this section, we rst present a naive joint formulation and then introduce our
core joint formulation and model learning algorithm.
2.1 A Naive Formulation

A straightforward joint formulation is to directly model the joint distribu-
tion of {x1 , x2 } as a Gaussian. Thus, we have P (x1 , x2 |HI ) = N (0, I ) and
1
Leveraging an accurate 3D reconstruct and billions training images. But the details
have not been published.
P (x1 , x2 |HE ) = N (0, E ), where covariance matrixes I and E can be esti-

mated from the intra-personal pairs and extra-personal pairs respectively. The
mean of all faces is subtracted in the preprocessing step. At the test time, the
log likelihood ratio between two probabilities is used as the similarity metric. As
will be seen in later experiments, such naive formulation is moderately better
than the conventional Bayesian face.
In above formulation, two covariance matrixes are directly estimated from
the data statistics. There are two factors which may limit its performance. First,
suppose the face is represented as a d-dimensional feature, in the naive formula-
tion, we need to estimate the covariance matrix in higher dimension (2d) feature
space of [x1 x2 ]. We have higher chance to get a less reliable statistic since we
do not have sucient independent training samples. Second, since our collected
training samples may not be complectly independent, the estimated E may
not be a blockwise diagonal. But in theory, it should be because x1 and x2 are
statistically independent.
To deal with these issues, we next introduce a simple prior on the face repre-
sentation to form a new joint Bayesian formulation. The resulting model can be
more reliably and accurately learned.
Fig. 2. Prior on face representation: both of the identities distribution (left) and the
within-person variation (right) are modeled by Gaussians. Each face instance is repre-
sented by the sum of identity and the its variant.
2.2 A Joint Formulation

As already observed and used in previous works [1720], the appearance of a face
is inuenced by two factors: identity, and intra-personal variation, as shown in
Fig. 2. A face is represented by the sum of two independent Gaussian variables:
x = + , (2)
where x is the observed face with the mean of all faces subtracted, represents its
identity, is the face variation (e.g., lightings, poses, and expressions) within the
570 D. Chen et al.
same identity. Here, the latent variable and follow two Gaussian distributions
N (0, S ) and N (0, S ), where S and S are two unknown covariance matrixes.
For brevity, we call the above representation and associated assumptions as a
face prior.
Joint Formulation with Prior. Given the above prior, no matter under which
hypothesis, the joint distribution of {x1 , x2 } is also Gaussian with zero mean.
Based on the linear form of Eqn. (2) and the independent assumption between
and , the covariance of two faces is:
cov(xi , xj ) = cov(i , j ) + cov(i , j ), i, j {1, 2}. (3)

Under HI hypothesis, the identity 1 , 2 of the pair are the same and their intra-
person variations 1 , 2 are independent. Considering Eqn.(3), The covariance
matrix of the distribution P (x1 , x2 |HI ) can be derived as:
% &
S + S S
I = .
S S + S
Under HE , both the identities and intra-person variations are independent.
Hence, the covariance matrix of the distribution P (x1 , x2 |HE ) is
% &
S + S 0
E = .
0 S + S
With the above two conditional joint probabilities, the log likelihood ratio r(x1 , x2 )
can be obtained in a closed form after simple algebra operations:
P (x1 , x2 |HI )
r(x1 , x2 ) = log = xT1 Ax1 + xT2 Ax2 2xT1 Gx2 , (4)
P (x1 , x2 |HE )
where
A = (S + S )1 (F + G), (5)
1
F +G G S + S S
= . (6)
G F +G S S + S
Note that in Eqn. (4) the constant term is omitted for simplicity.
There are three interesting properties of this log likelihood ratio metric. Read-
ers can refer to the supplementary materials for the proof.
Both matrix A and G are negative semi-denite matrixes.
The negative log likelihood ratio will degrade to Mahalanobis distance if
A = G.
The log likelihood ratio metric is invariant to any full rank linear transform
of the feature.
We will have more discussions on the new metric obtained by log likelihood ratio
over the Mahalanobis distance in the Section 3.
2.3 Model Learning

S and S are two unknown covariance matrixes which need to be learned from
the data. We develop an EM-like algorithm to jointly estimate two matrixes in
our model.
E-Step: for each subject with m images, the relationship between the latent
variables h = [; 1 ; ; m ] and the observations x = [x1 ; ; xm ] is:

I I 0 0
I 0 I 0

x = Ph, where P = . .. .. . . .. . (7)
.. . . . .
I 0 0 I
And the distribution of the hidden variable h is
h N (0, h ),
where h = diag(S , S , , S ). Therefore, based on Eqn. (7), we have the
distribution of x:
x N (0, x ),
where

S + S S S
S S + S S

x = .. .. .. .. .
. . . .
S S S + S
Given the observation x, the expectation of the hidden variable h is:
E(h|x) = h PT x1 x. (8)
Directly computing Eqn. (8) is expensive because the complexity in both memory
and computation is O(m2 d2 ) and O(d3 m3 ) respectively, where d is the dimension
of the feature. Fortunately, by taking the advantage of the structure of h , PT
and x , the complexity in memory and computation can be reduced to O(d2 )
and O(d3 + md2 ) respectively. Readers can refer to supplementary materials for
the details.
M-Step: in this step, we aim to update the values of parameters = {S , S }:
S = cov()
S = cov()
where and are the expectations of the latent variables estimated in the
E-step.
Initialization: In the implementation, S and S are initialized by random
positive denite matrix, such as the covariance matrix of random data. We will
have more discussions on the robustness of the EM algorithm in the next section.
572 D. Chen et al.
2.4 Discussion
In this section, we discuss three aspects of our joint formulation.
Robustness of EM. Generally speaking, the EM algorithm only guarantees
convergence. The quality of the solution depends on the objective function (the
likelihood function of the observations) and the initialization. Here we empiri-
cally study the robustness of the EM algorithm by the following ve experiments:
1. S and S are set to between-class and within-class matrixes (no EM).
2. S and S are initialized by between-class and within-class matrixes.
3. S is initialized by between-class matrix and S is initialized randomly.
4. S is initialized randomly and S is initialized by within-class matrix.
5. Both S and S are initialized randomly.
The above experiments are performed by using the training dataset (WDRef)
and testing dataset (LFW, under unrestrict protocol). We use PCA to reduce
the dimension of feature to 2,000. The rest of the experimental settings are the
same with those in Section 4.2.
The experiment results in Table 1 show that: 1) EM optimization improves
the performance (87.8% to 89.44%); 2) EM optimization is robust to various
initializations. The estimated parameters converge to the similar solutions from
quite dierent starting points. This indicates that our objective function may
have very good convex properties.
Table 1. Evaluation of the robustness of EM algorithm. Results of 3-5 are averages

over 10 trials
Experiments 1(no EM) 2 3 4 5

Accuracy 87.80% 89.4% 89.34%0.15% 89.38%0.20% 89.39%0.08%
Adaptive Subspace Representation. In our model, the space structures of

and are fully encoded into the covariance matrixes S and S . Even if
and are in lower dimensional intrinsic spaces, we do not need to pre-dene
intrinsic dimensions for them. The covariance matrixes will automatically adapt
to the intrinsics structures of the data. Therefore our model is free from param-
eter tuning for the number of intrinsic dimension and able to fully exploit the
discriminative information in the whole space.
Ecient Computation. As the matrix G is negative denite, we can decom-
pose it as G = U T U after the training. In the test time, we can represent each
face by a transformed vector y = U x and a scalar c = xT Ax. Then, the log
likelihood ratio can be very eciently computed by: c1 + c2 + y1T y2 .
3 Relationships with Other Works

In this section, we discuss the connections between our joint Bayesian formula-
tion and three other types of leading supervised methods which are widely used
in face recognition.
3.1 Connection with Metric Learning

Metric learning [710, 21] has recently attracted a lot of attentions in face recog-
nition. The goal of metric learning is to nd a new metric to make two classes
more separable. One of the main branches is to learn a Mahalanobis distance:
(x1 x2 ) T M (x1 x2 ) , (9)
where M is a positive denite matrix which parameterizes the Mahalanobis
distance.
As we can see from above Equation, this method shares the same drawback
of the conventional Bayesian face. Both of them rstly project the joint repre-
sentation to a lower dimension by a transform [I, I]. As we have discussed, this
transformation may reduce the separability and degrade the accuracy.
In our method, using the joint formulation, the metric in Eqn. (4) is free from
the above disadvantage. To make the connection more clear, we reformulate Eqn.
(4) as,
(x1 x2 ) T A (x1 x2 ) + 2x1 T (A G)x2 . (10)

Comparing Eqn. (9) and Eqn. (10), we see that the joint formulation provides
an additional freedom for the discriminant surface. The new metric could be
viewed as more general distance which better preserves the separability.
To further understand our new metric in Eqn. (4), we rewrite it here:
xT1 Ax1 + xT2 Ax2 2xT1 Gx2 .
There are two components in the metric: the cross inner product term x1 T Gx2
and two norm terms x1 T Ax1 and x2 T Ax2 . To investigate their roles, we perform
an experiment under ve conditions: a) use both A and G; b) A 0; c) G 0;
d) A G; e) G A. The experiment results are shown in Table 2.
We can observe two things: most of discriminative information lies in the cross
inner product term x1 T Gx2 ; the norm terms x1 T Ax1 and x2 T Ax2 also play
signicant roles. They serve as the image specic adjustments to the decision
boundary.
There are many works trying to develop dierent learning methods for Ma-
halanobis distance. However, relatively few works investigate the other forms of
metrics like ours. A recent work [22] explores the metric based on cosine simi-
larity by discriminative learning. Their promising results may inspire us to learn
the log likelihood ratio metric in a discriminative way in the future work.
Table 2. Roles of matrixes A and G in the log likelihood ratio metric. Both training
and testing are conducted on LFW following the unrestricted protocol. SIFT feature
is used and its dimension is reduce to 200 by PCA. Readers can refer to Section 4.4
for more information.
Experiments A and G A 0 G 0 A G G A
Accuracy 87.5% 84.93% 55.63% 83.73% 84.85%
574 D. Chen et al.
3.2 Connection with LDA and Probabilistic LDA

Linear Discriminant Analysis (LDA) [17] learns discriminative projecting direc-
tions by maximizing the between-class variation and minimizing within-class
variation. The solution for the projections are the eigenvectors of an eigen prob-
lem. The discriminative power of the projections decreases along with the corre-
sponding eigen value rapidly, which make the discriminative power distribution
is very unbalance. This heterogeneous property make it hard to nd an appro-
priate metric to fully exploit the discriminative information in all projections.
Usually, LDA gets the best performance only with a few top eigenvectors, and
its performance will decrease if more projections are added even though there
are still useful information for discrimination.
Probabilistic LDA (PLDA) [18, 19] uses factor analysis to decompose the
face into three factors: x = B + W + , i. e. identity B, intra-personal
variation W and noise . The latent variables , and are assumed as
Gaussian distributions. The covariance matrix of and are identity matrixes
and the covariance matrix of is diagonal matrix . B, W and are unknown
parameters which needs to be learned. Similar to LDA, PLDA works well if the
dimension of hidden variables is low. But its performance will rapidly degrade if
the dimension of hidden variable is high, in which case, it is hard to get reliable
estimation of the parameters.
In contrast to LDA and PLDA, we do not make the low dimension assumption
and treat each dimension equally. Our joint model can reliably estimate the
parameters in high dimension and does not require elaborate initialization. These
two properties enable us to fully exploit the discriminative information of the
high dimensional feature and achieve better performance.
3.3 Connection with Reference Based Methods

Reference-based methods [1114] represent a face by its similarities to a set of
reference faces. For example, in work [11], each reference is represented by a SVM
Simile classier which is trained from multiple images of the same person. The
score of SVM is used as the similarity.
From the Bayesian view, if we model each reference as a gaussian N (i , ),
then the similarity from a face x to each reference is the conditional likelihood
P (x |i ). Given n references, x can be represented as [P (x |1 ) , ..., P (x |n )].
With this reference-based representation, we can dene the similarity between
two faces {x1 , x2 } as the following log likelihood ratio:
n
1
P (x 1 | i ) P (x2 | i )
Log 1 nn i=1 1 n . (11)
n i=1 P (x1 |i ) n i=1 P (x2 |i )
If we consider the references are innite and independent sampled from a distri-
bution P (), the above equation can be rewritten as:
5
P ( x1 | ) P ( x2 | ) P ()d
Log 5 5 . (12)
P ( x1 | ) P ()d P ( x2 | ) P ()d
Interestingly, when P () is a Gaussian, the above metric is equivalent to the metric

derived from our joint Bayesian in Eqn. (4). (The proof can be found in supple-
mentary). Hence our method can be considered as a kind of probabilistic reference-
based method, with innite references, but under Gaussian assumptions.
In this section, we compare our joint Bayesian approach with conventional

Bayesian face and other competitive supervised methods.
4.1 Dataset
Label Face in the Wild(LFW) [15] contains face images of celebrities collected
from the Internet with large variations in pose, expression, and lighting. There
are a total of 5749 subjects but only 95 persons have more than 15 images. For
4069 identities, just one image is available.
In this work, we introduce a new dataset, WDRef, to relieve this depth issue.
We collect and annotate face images from image search engines by querying a
set of people names. We need to emphasize that there is no overlap between our
queries and the names in LFW. Then, the faces are detected by a face detector
and rectied by ane transform estimated by ve landmarks, i.e. eyes, nose and
mouth corners. The landmarks are detected by [23]. We extract two kind of
low-level features: LBP [24] and LE [25] in rectied holistic face.
Our nal data set contains 99,773 images of 2995 people, and 2065 of them
have more than 15 images. The large width and depth of this dataset will en-
able researchers to better develop and evaluate supervised algorithms which
needs sucient intra-personal and extra-personal variations. We will make the
extracted features publicly available.
Fig. 3. Sample images in WDRef dataset
4.2 Comparison with Other Bayesian Face Methods
In the rst experiment, we compare conventional Bayesian face, unied sub-

space, Wang and Tangs unied subspace work [3], and our joint Bayesian. All
methods are tested on two datasets: LFW and WDRef. When tested on LFW,
576 D. Chen et al.
all identities in WDRef are used for the training. We vary the identities number
in the training data from 600 to 3000 to study the performance w.r.t training
data size. When test on WDRef, we split it into two mutually exclusive parts:
300 dierent subjects are used for test, the others are for training. Similar to
protocol in LFW, the test images are divided into 10 cross-validation sets and
each set contains 300 intra-personal and extra-personal pairs. We use LBP fea-
ture and reduce the feature dimension by PCA to the best dimension for each
method (2000 for joint Bayesian and unied subspace, 400 for Bayesian and
naive formulation methods).
0.9 0.86
0.88 0.84
0.86 0.82
Accuracy
Accuracy
0.84 0.8
0.82 0.78
0.8 0.76
classic Bayesian [1] classic Bayesian [1]
0.78 Unified subspace [3] 0.74 Unified subspace [3]
naive formulation naive formulation
joint Bayesian (formulation) joint Bayesian (formulation)
0.76 0.72
600 900 1200 1500 1800 2100 2400 2700 3000 600 900 1200 1500 1800 2100 2400 2700
Identity Number Identity Number
Fig. 4. Comparison with other Bayesian face related works. The joint Bayesian method
is consistently better by using dierent training data sizes and on two databases:
LFW(left) and WDRef(right).
As shown in Figure.(4), by enforcing an appropriate prior on the face repre-

sentation, our proposed joint Bayesian method substantially better on various
training data sizes. The unied subspace stands in the second place by taking
the advantage of subspace selection, i.e. retaining the identity component and
excluding the intra-person variation component and noise. We also note that
when the training data size is small, the naive formulation is the worst. The
reason is that it needs to estimate more parameters in higher dimension. How-
ever, as training data increasing, The performance of conventional Bayesian and
unied subspace method(using the dierence of face pair) gradually saturate.
In contrast, The performance of the naive joint formulation keeps increasing
as the training data increase. Its performance surpasses the performance of the
conventional Bayesian method and is approaching the performance of unied
subspace method. The trend of joint Bayesian method shares the same pattern
as the naive joint formulation. The observation strongly demonstrates that joint
formulation helps the discriminability.
4.3 Comparison with LDA and PLDA

We use the same experiment setting (trained on WDRef) as described in Section
4.2. LDA is based our own implementation and PLDA is from authorss imple-
mentation [19]. In the experiment, the original 5900 dimension LBP feature is
reduced to the best dimension by PCA for each method (2000 for all methods).
For LDA and PLDA, we further traverse to get the best sub-space dimension
(100 in our experiment). As shown in Figure 5, our method signicantly out-
performs the other two methods for any size of training data. Both PLDA and
LDA only use the discriminative information in such a low dimension space and
ignore the other dimensions even though there are useful information in them.
On the contrary, our method treats each dimension equally and leverages high
dimension feature.
0.9 0.86
0.88 0.84
0.86 0.82
Accuracy
Accuracy
0.84 0.8
0.82 0.78
0.8 0.76
0.78 LDA [17] 0.74 LDA [17]

PLDA [19] PLDA [19]
Joint Bayesian Joint Bayesian
0.76 0.72
600 900 1200 1500 1800 2100 2400 2700 3000 600 900 1200 1500 1800 2100 2400 2700
Identity Number Identity Number
Fig. 5. Comparison with LDA and PLDA on LFW(left) and WDRef(right)
4.4 Comparison under LFW Unrestricted Protocol
In order to fairly compare with other approaches published on the LFW website,
in this experiment, we follow the LFW unrestricted protocol, using only LFW
for training. We combine the scores of 4 descriptors (SIFT, LBP, TPLBP and
FPLBP) with a linear SVM classier. The same feature combination could also
be nd in [12]. As shown in Table 3, our joint Bayesian method achieves the
highest accuracy.
Table 3. Comparison with state of the arts method following the LFW unrestricted
protocol. The results of other methods are from their original papers.
Method LDML-MkNN [9] Multishot [12] PLDA [19] Joint Bayesian

Accuracy 87.5% 89.50% 90.07% 90.90%
4.5 Training with Outside Data
Finally, we present our best performance on LFW along with existing state of the
art methods or systems. Our purposes are threefold: 1) verify the generalization
578 D. Chen et al.
ability from one dataset to another dataset; 2) see what we can achieve using
a limited outside training data; 3) compare with other methods which also rely
on the outside training data.
We follow the standard unconstricted protocol in LFW and simply combine
the four similarity scores computed under LBP feature and three types of LE
[25] features. As shown by ROC curves in Figure 6, our approach with simple
joint Bayesian formulation is ranked as No.1 and achieved 92.4% accuracy. The
error rate is reduced by over 10%, compared with the current best (commercial)
system which takes the additional advantages of an accurate 3D normalization
and billions of training samples.
0.9
true positive rate
0.8
0.7
0.6 Attribute and Simile classifiers [11]

AssociatePredict [13]
face.com r2011b [16]
Joint Bayesian (combined)
0.5
0 0.1 0.2 0.3 0.4 0.5
false positive rate
Fig. 6. The ROC curve of Joint Bayesian method comparing with the state of the art
methods which also rely on the outside training data on LFW
5 Conclusions
In this paper, we have revisited the classic Bayesian face recognition and pro-
posed a joint formulation in the same probabilistic framework. The superior
performance on comprehensive evaluations shows that the classic Bayesian face
recognition is still highly competitive and shining, given modern low-level fea-
tures and a training data with the moderate size.
References
1. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian face recognition. Pattern
Recognition 33, 17711782 (2000)
2. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The feret evaluation methodology
for face-recognition algorithms. PAMI 22, 10901104 (2000)
3. Wang, X., Tang, X.: A unied framework for subspace face recognition. PAMI 26,
12221228 (2004)
4. Wang, X., Tang, X.: Subspace analysis using random mixture models. In: CVPR
(2005)
5. Wang, X., Tang, X.: Bayesian face recognition using gabor features, pp. 7073
(2003)
6. Li, Z., Tang, X.: Bayesian face recognition using support vector machine and face
clustering. In: CVPR (2004)
7. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin
nearest neighbor classication, vol. 10, pp. 207244 (2005)
8. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: ICML (2007)
9. Guillaumin, M., Verbeek, J.J., Schmid, C.: Is that you? metric learning approaches
for face identication. In: ICCV (2009)
10. Ying, Y., Li, P.: Distance metric learning with eigenvalue optimization. Journal of
Machine Learning Research 13, 126 (2012)
11. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile clas-
siers for face verication. In: ICCV (2009)
12. Taigman, Y., Wolf, L., Hassner, T.: Multiple one-shots for utilizing class label
information. In: BMVC (2009)
13. Yin, Q., Tang, X., Sun, J.: An associate-predict model for face recognition. In:
CVPR (2011)
14. Zhu, C., Wen, F., Sun, J.: A rank-order distance based clustering algorithm for
face tagging. In: CVPR (2011)
15. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E., Hanson, A.: Labeled faces
in the wild: A database for studying face recognition in unconstrained environ-
ments. In: ECCV (2008)
16. Taigman, Y., Wolf, L.: Leveraging billions of faces to overcome performance bar-
riers in unconstrained face recognition. Arxiv preprint arXiv:1108.1122 (2011)
17. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. sherfaces:
Recognition using class specic linear projection. PAMI 19, 711720 (1997)
18. Ioe, S.: Probabilistic Linear Discriminant Analysis. In: Leonardis, A., Bischof,
H., Pinz, A. (eds.) ECCV 2006, Part IV. LNCS, vol. 3954, pp. 531542. Springer,
Heidelberg (2006)
19. Prince, S., Li, P., Fu, Y., Mohammed, U., Elder, J.: Probabilistic models for infer-
ence about identity. PAMI 34, 144157 (2012)
20. Susskind, J., Memisevic, R., Hinton, G., Pollefeys, M.: Modeling the joint density
of two images under a variety of transformations. In: CVPR (2011)
21. Ramanan, D., Baker, S.: Local distance functions: A taxonomy, new algorithms,
and an evaluation. In: ICCV (2009)
22. Nguyen, H.V., Bai, L.: Cosine Similarity Metric Learning for Face Verication. In:
Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part II. LNCS, vol. 6493,
23. Liang, L., Xiao, R., Wen, F., Sun, J.: Face Alignment Via Component-Based Dis-
criminative Search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
24. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
invariant texture classication with local binary patterns. PAMI 24, 971987 (2002)
25. Cao, Z., Yin, Q., Tang, X., Sun, J.: Face recognition with learning-based descriptor.
In: CVPR (2010)
Beyond Bounding-Boxes: Learning Object Shape
by Model-Driven Grouping
Antonio Monroy and Bjorn Ommer
Interdisciplinary Center for Scientic Computing, University of Heidelberg, Germany

{antonio.monroy,bommer}@iwr.uni-heidelberg.de
Abstract. Visual recognition requires to learn object models from train-

ing data. Commonly, training samples are annotated by marking only
the bounding-box of objects, since this appears to be the best trade-
o between labeling information and eectiveness. However, objects are
typically not box-shaped. Thus, the usual parametrization of object hy-
potheses by only their location, scale and aspect ratio seems inappropri-
ate since the box contains a signicant amount of background clutter.
Most important, however, is that object shape becomes only explicit once
objects are segregated from the background. Segmentation is an ill-posed
problem and so we propose an approach for learning object models for de-
tection while, simultaneously, learning to segregate objects from clutter
and extracting their overall shape. For this purpose, we exclusively use
bounding-box annotated training data. The approach groups fragmented
object regions using the Multiple Instance Learning (MIL) framework to
obtain a meaningful representation of object shape which, at the same
time, crops away distracting background clutter to improve the appear-
ance representation.
1 Introduction
Recognizing and localizing all instances of object categories in a novel query
image is a core task on the way to automatic scene understanding. Object detec-
tion typically proceeds by localizing object bounding-boxes (e.g. [1]), which are
parameterized by their location, scale and aspect ratio. A classier is then eval-
uated for each detection window, thereby providing hypotheses that are ranked
by their score. Such approaches have proven to be very successful in bench-
marks, but there are two issues that remain unresolved. First, objects are not
box-shaped and so the detection window contains a signicant amount of back-
ground clutter that tends to deteriorate the whole windows classication result.
And indeed, even complex models like [1] are eventually based on a holistic repre-
sentation of the whole bounding-box, including the clutter. The second problem
is that object shape becomes only available for detection once the object has
been segregated from the background. To overcome both problems, not only
background suppression is required, but also a reasoning about the object shape
is essential. Recent work (e.g. [2]) in the eld of segmentation has shown that
relying only on low-level cues is not enough. Furthermore, it appears reasonable

Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping 581
to combine class-specic top-down information to achieve better results. The

purpose of this paper is to learn object models for detection by explicitly repre-
senting object shape and segregating it from the background, however, without
requiring manual segmentation of the training samples. Therefore, we propose
a model-based approach that does not require supervision, but automatically
learns object shape and appearance while segregating objects from background.
Since we use more than a mere bottom-up segmentation, we are able to capture
the overall object shape in a model-driven manner by grouping the corresponding
foreground regions.
Detection approaches which use shape information can be classied accord-
ing to the degree of supervision they require during training. On the one hand,
methods like ([35]) require ground-truth pixel-wise segmentation masks during
training. The main disadvantage is that such information is usually not available
for large-scale detection tasks or is tedious to obtain and so we are proposing an
automatic MIL learning-based approach to circumvent these shortcomings. On
the other hand, we have methods which only require bounding-box information
during training [612]. These methods dier in the way, how shape information
is integrated into the detection task. Pure bottom-up methods [6, 7] are suscep-
tible to segmentation artifacts. While [6] directly classies bottom-up generated
segments using a k-nearest neighbor classier, [7] computes hierarchical segmen-
tations to nd object subtrees similar to those learned during training. [8, 9]
can be viewed as top-down approaches. [8] divides the bounding-box into cells
and infers an occlusion map by clustering the response scores of a linear SVM
on each cell, where occluded regions are dened as the groups with a negative
overall response. This approach does not use any shape information to train the
linear SVM. Furthermore, negative response scores can also be caused by occlu-
sion or by other factors, such as background or an uncommon shape. Based on
the model of [13], [9] attempts to capture the objects shape by means of a xed
number of coarse box-shaped patterns. Finally, methods like [11, 10, 12] attempt
to combine bottom-up and top-down cues. Gu et al. [11] proposed a method for
detection using regions. Starting with regions as the basic elements, a general-
ized Hough-like voting strategy for generating hypotheses is used (see [14] for
improvements to the idea of voting). The methods drawbacks are twofold. First,
it needs a general sliding window classier for verication, which does not take
shape into account. Second, ground-truth pixel-wise segmentation masks for the
training data are required. Recently, [10] proposed a method for object detec-
tion based on the category-independent gure-ground segmentation masks of
[15]. To train with only bounding-box information, the authors assume that the
best ranked segment within the bounding-box covers the entire object. This seg-
ment is thus used to learn a regression function that predicts the quality of query
segments. Consequently, the performance of their detection system is highly sus-
ceptible to the fact that the rst bottom-up generated segment actually covers
the entire object. In datasets like PASCAL VOC we observe that in many images
this assumption is too strong. Finally, [12] utilizes multiple over-segmentations
to propose class-independent bounding-boxes for classication. However, the
582 A. Monroy and B. Ommer

Fig. 1. Object representation (best viewed in color). We divide the bounding-box into
cells and calculate features on each cell. Inferring a foreground segmentation cell-mask
from unsegmented training data, we suppress the background features by setting the
corresponding cells to zero.
authors discard the shape information contained in the super-pixels. They sam-
ple at each pixel 5 dierent color features and utilize them within a standard
bag-of-words model to classify the object.
The paper is organized as follows. In Sec. 2 we explain how to suppress the
background and represent the shape of an object and in Sec. 3 we then describe
how to learn our model (without using pixel-wise segmentation masks in the
training stage) and infer the shape of an object using a MIL framework.
2 Model
2.1 Suppressing the Background
Detection window approaches like [16, 1] have demonstrated a good performance
in dicult benchmarks. Consequently, such a framework oers us a good basis
to implement our idea. The detection window is commonly divided into a grid
of cells and we learn object shape to suppress cells in the clutter and concen-
trate on the actual object. In this section we describe how to model a fore-
ground/background segregation.
Suppose an object Oj within image I is given and we assume for a moment
a pixel-wise foreground objects segmentation is also given. In the next section
we will describe how to automatically learn a cell-accurate shape estimation for
the objects foreground.
First, we divide the bounding-box j into an array of size l0 h0 . For each
cell we calculate a ddimensional feature. This l0 h0 d matrix is called
0 (pj0 ), where pj0 = (x, y) is the top-left position of the bounding-box in image
I. Specically, in this paper we use histograms of oriented gradients (HoG) as
features. These widely used and fast to calculate descriptors capture the edge or
gradient structure that is very characteristic of local shape. Additionally, they
exhibit invariance to local geometric and photometric transformations ([1, 16]).
However, our framework is independent of this specic choice of features. A
combination of dierent descriptors (e.g. like in [12]) can be integrated into our
model and should enable further performance improvements.
The foreground of an object is modeled by dening a binary vector mj0
B 1l0 h0
. This vector contains ones if the corresponding cell is covered by the
Fig. 2. Left: the rst column shows a detection and the last two columns the two most
similar prototypical segments. Right: Subset of prototypical segments for the category
cow.
object, otherwise it is zero. We call this vector mj0 the root-cell mask for object
Oj part-cell masks are introduced in Sec. 4.1. Using mj0 we set to zero the
cells of 0 (pj0 ) corresponding to the zero entries in mj0 . Formally, the foreground
representation of object Oj is dened as
0 (pj0 , mj0 ) := (mj0 1d ) 0 (pj0 ), (1)
where denes the Hadamard-Product and the element-wise multiplication.

Fig. 1 shows how to suppress the background of a bounding-box if a root-cell
mask for the objects foreground is given.
2.2 Matching Objects

Due to dierent pose variations, occlusion and clutter, the foreground root-cell
masks mj0 and mu0 of two objects may dier substantially. Therefore, building
an euclidean dot product between the feature representations 0 (pj0 , mj0 ) and
0 (pu0 , mu0 ) as [3] or [1] do, will lead to unstable matching scores. Rather than
using a simple dot product, we represent each object with a prototypical set
of shape segments C0 = {m0 }=1 . This set of segments is automatically learnt
from unsegmented training data (see section 3.3 for more details). The idea is
to reduce the high intra-class shape variability by using a reduced number of
typical class-specic views of its shape. We then use a weighted sum to match
both representations. Precisely, the matching score is given by
d0 (0 (pj0 , mj0 ), 0 (pu0 , mu0 )) :=

1

< a(mj0 , m0 )0 (pj0 , m0 ), a(mu0 , m0 )0 (pu0 , m0 ) >, (2)
=1
where
mj0 m0 2
a(mj0 , m0 ) := exp (3)
|m0 |
represents the dissimilarity score between the root-cell mask mj0 and the proto-
typical root-cell mask m0 . The parameter is obtained by cross-validation. In
our experiments, we obtained an optimal value in the range of 1.1 0.1 for the
dierent object classes. Here |m0 | represents the total number of active cells in
the prototypical root-cell mask m0 .
Equation (2) induces a Mercer kernel, since the sum of Mercer kernels is
a Mercer kernel again. By the Kernel Trick we know, that there exists a
(possibly unknown) transformation into a space in which the kernel (2) is
a scalar product. To keep the notation simple we identify Oj := (0 (pj0 , mj0 ))
and refer to this scalar product as
< Oj , Ou >CB := d0 (0 (pj0 , mj0 ), 0 (pu0 , mu0 )). (4)
In praxis we do not require to evaluate the function to learn our model, but
use the kernel values instead. By dening the kernel (2) we have integrated
both of our goals into the detection window approach: We suppress the features
corresponding to the background and robustly represent the shape of an object
through a prototypical set of shapes.
3 Learning
Lets assume for the moment that for all objects Oj in the training data their
root-cell masks mj0 containing the whole object foreground are given. The train-
ing set is denoted by {(Oj , y j )}. Here y j {1, 1} denotes the label of object
Oj = (0 (pj0 , mj0 )). In this special case, we could easily learn a discriminative
function

f (0 (pq0 , mq0 )) = yi i d0 (0 (pq0 , mq0 ), 0 (pi0 , mi0 )) + b (5)
iSV
to classify the query object 0 (pq0 , mq0 ) (SV is the set of support vectors). How-
ever, in contrast to [3], we are not provided with the foreground root-cell masks
mj0 during training, but rather we automatically learn them from unsegmented
training data. This is described in the next section. Similar to [3], [10] assumes
that the best-ranked foreground segmentation mask of [15] covers the whole ob-
ject. In practice this assumption is, however, not valid: The second row of gure 3
shows the best ranked CMPC segments that lie within the object bounding-box.
None of them covers exactly the whole object.
3.1 Learning from Unsegmented Training Data

The question is now how to learn the classication function f if the foreground
root-cell masks mj0 are not given during training?
Given a discriminatively trained function f , the problem of inferring the fore-
ground root-cell mask mj0 for an object Oj can be formulated as
mj0 = argmax f (0 (pj0 , m0 )) (6)

m0
i.e. the inference (6) is tackled by grouping cells in a model-driven, top-down

manner so as to maximize the classication score.

! "
#

"
#

Fig. 3. First row: We simultaneously detect and infer the object foreground. Second
row: We show our data-driven grouping from which we infer the foreground of our
object. For complex categories we can not assume that the rst CMPC segment covers
the whole object.
We simultaneously learn the function f and solve the grouping problem by for-
mulating our problem in the Multiple Instance Learning (MIL) framework. Here,
a bag contains features corresponding to dierent root-cell masks. For positive
instances, at least one of these features correspond to the foreground of an ob-
ject. In the ideal case, a bag B0j would contain all possible combinations of cells
within the bounding-box. Since this is not tractable, we describe in the next
section how to create a shortlist of meaningful groups in a bottom-up manner.
Suppose we obtain l dierent groups for a bounding-box j. The i-the group is
represented by a root cell mask mj0i and build the set U j := {mj0i }li=1 . A bag is
then dened as
D E|P (U j )|
B0j := 0 (pj0 , mj0i )|mj0i P (U j ) , (7)
i=1
where P (U j ) is the power set of U j . If the bounding-box contains an object, the

label Yj of the bag B0j is set to 1, otherwise it is 1. Using our kernel (2) the
problem of learning the function f transform into:
1
min w0 + C I (8)
w0 ,b, 2
I
s.t. I : YI max(< w0 , OiI >CB +b) 1 I , I 0, (9)
iI
here OiI = (0 (pI0 , mI0i )) are object hypotheses and denote the elements within
the bag I. Once the function f is learned, the inference problem (6) for a query
image is transformed into
mj0 = argmax f (0 (pj0 , mj0i )) (10)

mj0i P (U j )
In other words, in (10) we look for the most positive instance within B0j and
by doing this, we indirectly infer the corresponding root-cell segmentation mask
mj0 (s. rst row of Fig. 3). In practice the optimization problem (8) is solved by
alternating the calculation of the hyperplane w0 and bias b with the calculation
of the margin for the positive bags: YI maxiI (< w0 , OiI >CB +b) (we used the
MIL solver of [17] but other approaches like [18] could also be used). This means
that for every positive bag we x the most positive instance and then we use
all other instances of the negative bags to learn a SVM using our Mercer kernel
(2). In our experiments we used this MIL formulation since it is eective, fast
(convergence is reached after a few iterations) and the performance was robust
for varying initializations. Specically, we randomly chose an element for every
positive bags to initialize the algorithm.
In the rst row of Fig. 3 we visualize the inference of the nal foreground
root-cell mask mj0 , given a data-driven grouping of cells for the bounding-box.
3.2 Data-Driven Grouping

In this section we describe how to create a shortlist of candidate groups by means
of a data-driven grouping of cells for a given bounding-box. This is necessary to
render the inference problem (6) and the creation of bags (7) feasible.
Recently, [15] presented the combinatorial CMPC algorithm for generating
a set of binary gure-ground segmentation hypotheses {StI }N s
t=1 for an image I.
In general we can not assume (see second row of Fig. 3 ) that the best ranked
segment covers the whole object (as in [10]). However, the pool of CMPC seg-
ments yields a good basis to obtain groups of pixels, which cover only parts of
the object. An example of our grouping can be seen in Fig. 3 (second row).
Given a bounding-box BBj in image I, the idea is to rst weight each pixel-
wise segment StI generated by [15] with the ratio between the number of pixels
pkl belonging to the segment StI which lie outside the bounding-box and the
total number of pixels covered by the bounding-box |BBj | itself:
1
rtj := [pkl StI ] [pkl BBj ] . (11)
|BBj |
kl
Only segments StI

that fully lie within the bounding-box will get high scores,
while straddling segments will be penalized. We then take the weighted sum of
all segments which intersect the bounding-box and build a density map for this
bounding-box
1
Ns
j
Hkl := rj [pkl S I ] [pkl BBj ] . (12)
Ns t t t
The values in this map map indicate which regions within the bounding-box
were consistently covered by CMPC segments StI . We then apply a mean-shift
clustering algorithm on this 2D density map Hj and enforce the connectedness
of each of the resulting groups. The cells covering each of these groups dene
the root-cell masks mj0i used to construct the bags in equation (7).
In practice, for bounding boxes containing an object, we typically obtain be-
tween 6 and 8 groups. For boxes in the background, our grouping algorithm
typically does not generate any segment, since these regions are not covered by
a CMPC segment (the weights in Eq. 11 are zero). This situation renders feasible
an exhaustive search using inference (6). We favor mean-shift over other cluster-
ing methods because it allows an adaptive bandwidth for dierent clusters.
3.3 Learning a Prototypical Set of Segments

Our goal is to represent every object through a prototypical set of segments.
In section 2.2 we use such a representation to robustly match dierent object
instances. In this section we describe how to learn such a prototypical set.
The idea is to explain the shape complexity of a class through a reduced
number of segments that are typical for a certain class. Using the bottom-up
grouping described in section 2.2, we obtain for every positive training sample
j a bag B0j . To nd those specic segments that appear frequently within the
class, we hierarchically cluster the elements of all positive bags (e.g. using Wards
method). Every group is then represented by its medoid, i.e. the element with the
minimal average dissimilarity (using measure (3)) to all the objects in the cluster.
The set of all medoids dene the prototypical set of segments C0 = {m0 }=1 used
to train our model. The number of clusters is chosen using cross-validation and
ranges between 10 and 40 segments (s. Fig. 2).
4 Results
We use a sliding window detection model similar to [1] to implement our idea.
The model in [1] describes an object Oj by means of a bounding-box covering
the entire object (root window) as well as eight smaller windows (about half
the size) that cover parts of the root window. Every part window i is divided
into a grid of cells of size li hi , i = 1 . . . 8 and a HoG feature is calculated for
every cell. During training, weights (used as linear lters) are learnt for the root
window and additional 8 linear lters are trained for the parts. In our case, if we
ignore the parts for a moment, we rst would need to learn the prototypical set
of segments using the positive training samples as described in Sec. 3.3 and then
learn the classier (Eq. 5) as described in Sec. 3.1. To include the concept of parts
from [1], we will rst introduce the notion of a bag for each of the part windows
and then extend our matching kernel (2) also for these parts. Thereafter, the
corresponding classier can be trained analogously to [1] and thus we remit to
that work for further details.
Modeling Parts: Running the bottom-up grouping described in section 3.3
exclusively on the root window, results in the bags B0j for each training sample
j ( being the number of instances in each bag). We then dene a bag Bij for
each of the part-windows as follows:
Bij := {i (pji , mjik )}k=1 , i = 1 . . . 8. (13)

Here pji denotes the position of the i-th part-window for sample j. The binary
vector mjik Bli hi denotes the k-th part-cell segmentation mask of part i. It is
obtained by taking the overlap of part i with the root-cell mask mj0k Bl0 h0 .
In doing so, we obtain the feature representation i (pji , mjik ) for the i-th part
(similar to Eq. (1)). Following this notation, the matching score of Eq. 2 between
two objects Oj , Ou can be extended to include parts,

8
d(Oj , Ou ) := di (i (pji , mji ), i (pui , mui ))+ < pji pj0 , pui pu0 > . (14)
i=0
Here the last term compares the displacement of the i-th part w.r.t. the object
center. di (., .) denotes the matching score for part i dened as in Eq. (2). To
obtain di (., .) we also use a set of prototypical segments to represent each one
of the parts. This set is obtained in a similar way as for the root window by
hierarchically clustering the elements of all positive bags Bij . In practice, 7 pro-
totypical segments are used to represent each part. Using the kernel (14) we are
then capable of learning a discriminative function along the lines of (5).
4.2 Experimental Results

The purpose of our experiments is to show that if only bounding-box annotated
data is available during training, using a top-down generated prototypical rep-
resentation of the object shape, as well as suppressing the background within a
bounding-box, helps to improve pixel-wise object detection.
The methods of [10] and [11] are the most similar to ours and therefore pro-
vide us with a baseline for our results. Both methods present pixel-wise detec-
tion results exclusively on the ETHZ-Shape dataset ([19]). Specically, [10] also
presents results for the PASCAL segmentation challenge. However, this challenge
assesses a simpler problem than that in our paper, since pixel-wise segmenta-
tion masks are used for training the model. For purposes of comparison with
state-of-the-art [10, 11] we also use the ETHZ-Shape dataset to test our models
performance. Larger and more complex datasets for object detection (e.g. INRIA
Horses or PASCAL VOC) are suboptimal to demonstrate this papers purpose,
since there are no pixel-wise masks for the whole test-set and measuring detec-
tion performance is only possible up to a bounding-box.
[10] is currently the state-of-the-art for pixel-wise detection on the ETHZ-
Shape dataset. This dataset contains 5 object categories and 255 images. We
follow the experimental settings in [19]. The image set is evenly split into training
and testing sets and performance is averaged over 5 random splits. Following
[11] and [10], we report pixel-wise average precision (AP) on each class. The
PASCAL criterion is used to decide if a detection is correct. The ground-truth
segmentation masks were provided by [11].
Our results are displayed in table 1. Our method outperforms the state-of-
the-art approach of [10] by 7% mean AP and our detection rate is comparable
with the detection rate at 0.02, 0.3 and 0.4 FPPI in [10] (see table 2).
Table 1. Detection results for the ETHZ-Shape dataset. Performance is measured

as pixel-wise AP over 5 trials, following [10, 11]. For completeness, we include the
performance of [1] measured using a bounding-box parametrization. We improve state-
of-the-art by 7% AP.
Our Method Carreira etal. [10] Gu etal. [11] Felz. etal.[1]

Apples 0.963 0.023 0.890 0.019 0.772 0.112 0.934 0.048
Bottles 0.877 0.011 0.900 0.021 0.906 0.015 0.891 0.028
Giraes 0.823 0.038 0.754 0.019 0.742 0.025 0.817 0.048
Mugs 0.885 0.037 0.777 0.059 0.760 0.044 0.856 0.073
Swans 0.927 0.023 0.805 0.028 0.606 0.013 0.813 0.125
Mean 0.896 0.026 0.825 0.012 0.757 0.032 0.862 0.051
For the sake of completeness, we also evaluate our model on the level of
bounding-boxes for the detected objects (standard setting). We used the INRIA
horses dataset, which contains 340 images. Half of the images contain one ore
more horses and the rest are negative images. 50 horse images and 50 negative
images are used for training. The remaining 120 horse images plus 120 negative
images are used for testing. Results are listed in gure gure 5. Compared to
[1], we improve state-of-the-art detection rate at 0.1 fppi by 3.5% achieve a gain
of 29% to the recent segmentation-based approach of [20].
Table 2. Detection rate at 0.02, 0.3 and 0.4, fppi on ETHZ-Shape. We reach compa-
rable pixel-wise detection rates to [10].
Our Method Carreira etal. [10] Gu etal. [11] Felz. etal.[1]

Apples 0.985/0.985/0.985 0.904/0.941/0.941 0.697/0.854/0.916 0.956/0.989/0.989
Bottles 0.860/0.975/0.975 0.891/0.975/0.975 0.745/0.932/0.958 0.835/0.981/0.981
Giraes 0.830/0.924/0.924 0.920/0.970/0.970 0.543/0.736/0.800 0.675/0.936/0.943
Mugs 0.896/0.956/0.956 0.812/0.925/0.925 0.496/0.816/0.833 0.816/0.932/0.937
Swans 0.934/1/1 0.983/1/1 0.569/0.800/0.800 0.835/0.919/0.919
Mean 0.901/0.968/0.968 0.902/0.963/0.963 0.594/0.829/0.861 0.824/0.951/0.954
Next, we evaluate the impact of our bottom-up grouping (see section 3.2)
during training. For this experiment, the union of the rst n best-ranked CMPC
segmentation masks of [15] lying within the bounding-box was taken to dene
the bags (13). This setting would be equivalent to [10], which assumes that
the best bottom-up generated segment covers the whole object. We varied the
number of segments n and measured the detection performance in terms of
average precision (AP). The experiment was evaluated on the horse category of
PASCAL VOC 2007. The result is plotted in the left side of gure (6). For large
n the performance reaches that of [1], since eventually all cells of mj0 are active.
Conversely, performance signicantly drops as we approach n=1, which is the

Our Method Our Method Our Method

Ferrari et al. Ferrari et al. Ferrari et al.
Gu et al. Gu et al. Gu et al.
SvrSegm SvrSegm SvrSegm
Fig. 4. Detection Re-

sults on ETHZ-Shape
classes. Our method
outperforms state-of-
Our Method
Ferrari et al.
Our Method the-art by 7% mean
Ferrari et al.
Gu et al.
SvrSegm
Gu et al. AP reaching a compa-
SvrSegm
rable detection rate at
0.02,0.3 and 0.4 FPPI
Method AP Det.rate at
0.1 FPPI
Our Method 0.883 0.902
[21] - 0.730
BoSS [20] - 0.630
[22] - 0.652
[23] - 0.674
[1] 0.871 0.867
Fig. 5. Detection results for the INRIA horses dataset. We improve [1] by 3.5% and
the segmentation-based approach [20] by 29.9% detection rate at 0.1 FPPI.
setting of [10]. Our full model is plotted as a constant line, since it is independent
of the number of segments generated by [15].
In a second experiment, we tested the impact of our bottom-up grouping dur-
ing testing. Instead of obtaining a bottom-up grouping for each sliding window,
we tested our model exclusively on all the CMPC segments. We considered the
tight bounding-box around each gure-ground segment StI for an image I and
used this segment to construct the bags Bij (in this case we have as many bags
as segments StI , see Eq. (13)). The experiment was carried out using the car cat-
egory of VOC 2007 (see right plot in gure 6). We observed a 7.3% performance
drop in AP. Hence, it is advisable to combine the dierent segments StI (as we
do) to obtain a better detection performance.
We also tested the impact of using a prototypical set of segments (see section
3.3) to represent object shape. Since the matching score (14) use a prototypical
set of segments to evaluate each di (., .), we trained in this experiment a linear
SVM using the euclidean dot product (instead of using di (., .)) between the
feature representations i (pji , mji ) for all parts. In this case the matching score
(14) is transformed into

8
j , Ou ) :=
d(O < i (pji , mji ), i (pui , mui ) > + < pji pj0 , pui pu0 > . (15)
i=0
In doing so, we obtained a very poor performance of 0.45 AP for the horse
category compared to the 0.578 AP of our model.

Fig. 6. Impact of bottom-

up grouping. Left: We

trained our model using

the union of the n best-

ranked CMPC segments.

Right: Test exclusively on
! CMPC segments.
To the best of our knowledge, there is no approach which explicitly tries to

infer the overall object form using a model exclusively learnt from bounding-
box annotated training data for any category in the PASCAL dataset. In order
to compare our approach with other detection methods we evaluate our model
using the standard setting on those PASCAL VOC 2007 categories, where [1]
best performs. In table 3 and gure 7, we observe that our model exhibits robust
performance (43.68 MAP or Mean Average Precision) under challenging image
conditions at the same time that we obtain a richer output than just a bounding-
box for detection. While [1] (42.34 MAP) is considered as our baseline model,
we also listed comparable state-of-the-art detection methods. Due to the lack of
exact precision numbers, the multi-feature approach of [12] is not listed in table
3. However, from the diagram presented in their paper, we read an approximate
MAP of 42 for this set of categories. and of 40 if [1] is evaluated exclusively on
the proposed windows. Regardless of this, the strength of [12] remains in the
usage of 5 dierent color features to train a Bag-Of-Words model. While we use
Fig. 7. Detection examples for certain PASCAL VOC 2007 categories. The cells cor-
responding to the object foreground are grouped and used for detection.
Table 3. AP for best performing categories of [1] in PASCAL VOC 2007
horse cow cat train plane car mbike bus tv bicycle sofa person
Our approach 57.8 25.3 23.9 47.8 31.9 59.8 49.8 51.6 41.9 59.8 33.7 41.9
Felz. etal. [1] 56.8 25.2 19.3 45.1 28.9 57.9 48.7 49.6 41.6 59.5 33.6 41.9
best2007 [26] 37.5 14.0 24.0 33.4 26.2 43.2 37.5 39.3 28.9 40.9 14.7 22.1
UCI [27] 45.0 17.7 12.4 34.2 28.8 48.7 39.4 38.7 35.4 56.2 20.1 35.5
LHS [13] 50.4 19.3 21.3 36.8 29.4 51.3 38.4 44.0 39.3 55.8 25.1 36.6
C2F [28] 52.0 22.0 14.6 35.3 27.7 47.3 42.0 44.2 31.1 54.0 18.8 26.8
SMC [29] 51.0 23.0 16.0 41.0 26.0 50.0 45.0 47.0 38.0 56.0 29.0 37.0
HStruct [30] 48.5 18.3 15.2 34.1 31.7 48.0 38.9 41.3 39.8 56.3 18.8 35.8
LatentCRF [31] 49.1 18.5 14.5 34.3 31.9 49.3 41.9 49.8 41.3 57.0 23.3 35.7
MKL [24] 51.2 33.0 30.0 45.3 37.6 50.6 45.5 50.7 48.5 47.8 28.5 23.3
a single, standard feature type, multi-feature approaches (e.g. [12, 24, 25]) are
complementary and should enable further performance improvements.
5 Conclusion
We have presented a model that explicitly represents object shape and seg-
regates it from the background, however, without requiring segmented training
samples. The basis of this method is to capture the overall object form by group-
ing foreground regions in a model-driven manner and representing it through a
class-specic prototypical set of segments automatically learnt from unsegmented
training data. By using exlcusively bounding-box annotated training data, our
model improves pixel-wise detection results and at the same time it provides a
richer object parametrization for detecting object instances.
Acknowledgements. This work was supported by the Excellence Initiative of

the German Federal Government, DFG project number ZUK 49/1.
References
discriminatively trained part-based models. PAMI (2010)
2. Levin, A., Weiss, Y.: Learning to combine bottom-up and top-down segmentation.
IJCV 81(1), 105118 (2009)
3. Gao, T., Packer, B., Koller, D.: A segmentation-aware object detection model with
occlusion handling. In: CVPR, pp. 13611368 (2011)
4. Marszalek, M., Schmidt, C.: Accurate object recognition with shape masks.
IJCV (97), 191209 (2011)
5. Vijayanarasimhan, S., Grauman, K.: Ecient region search for object detection.
In: CVPR (2011)
6. Malisiewicz, T., Efros, A.: Improving spacial support for objects via multiple seg-
mentations. In: BMVC (2007)
7. Todorovic, S., Ahuja, N.: Learning subcategory relevances for category recognition.
In: CVPR (2008)
8. Wang, X., Han, T., Yan, S.: An hog-lbp human detector with partial occlusion
handling. In: ICCV (2009)
9. Chen, Y., Zhu, L(L.), Yuille, A.: Active Mask Hierarchies for Object Detection.
10. Carreira, J., Li, F., Sminchisescu, C.: Object Recognition by Sequential Figure-
Ground Ranking. IJCV (November 2011)
11. Gu, C., Lim, J., Arbelaez, J., Malik, J.: Recognition using regions. In: ICCV (2009)
12. Van de Sande, K., Uijlings, J., Gevers, T., Smeulders, A.: Segmentation as selective
search for object recognition. In: ICCV (2011)
13. Zhu, L., Chen, Y., Yuille, A.L., Freeman, W.: Latent hierarchical structural learn-
ing for object detection. In: CVPR, pp. 10621069 (2010)
14. Ommer, B., Malik, J.: Multi-scale object detection by clustering lines. In: ICCV
(2009)
15. Carreira, J., Scminchisescu, C.: Constrained parametric min-cuts for automatic
object segmentation. In: CVPR (2010)
CVPR (2005)
17. Andrews, S., Tscochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: NIPS, vol. 15 (2003)
18. Deselaers, T., Ferrari, V.: A conditional random eld for multiple-instance learning.
In: ICML (2010)
19. Ferrari, V., Jurie, F., Schmid, C.: Accurate object detection with deformable shape
models learnt from images. In: CVPR (2007)
20. Toshev, A., Taskar, B., Daniilidis, K.: Object detection via boundary structure
21. Yarlagadda, P., Monroy, A., Ommer, B.: Voting by Grouping Dependent Parts.
CVPR (2009)
23. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments
for object detection. PAMI 30(1), 3651 (2008)
24. Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object
25. Harzallah, H., Jurie, F., Schmid, C.: Combining ecient object localization and
image classication. In: ICCV (2009)
26. Mark, E., Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual object
classes challenge 2007 (voc 2007). Results (2007)
27. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for mulit-class object
layout. In: ICCV, pp. 229236 (2009)
28. Pedersoli, M., Vedaldi, A., Gonzalez, J.: A coarse-to-ne approach for fast de-
formable object detection. In: CVPR (2011)
29. Razavi, N., Gall, J., van Gool, L.: Scalable mulit-class object detection. In: CVPR
(2011)
30. Schnitzpan, P., Fritz, M., Roth, S., Schiele, B.: Discriminative structure learning of
hierarchical representations for object detection. In: CVPR, pp. 22382245 (2009)
31. Schnitzspan, P., Roth, S., Schiele, B.: Automatic discovery of meaningful object
parts with latent crfs. In: CVPR, pp. 121128 (2010)
In Defence of Negative Mining
for Annotating Weakly Labelled Data
Parthipan Siva, Chris Russell , and Tao Xiang
Queen Mary, University of London,

Mile End Road, London E1 4NS, United Kingdom
{psiva,chrisr,txiang}@eecs.qmul.ac.uk
Abstract. We propose a novel approach to annotating weakly labelled

data. In contrast to many existing approaches that perform annotation
by seeking clusters of self-similar exemplars (minimising intra-class vari-
ance), we perform image annotation by selecting exemplars that have
never occurred before in the much larger, and strongly annotated, nega-
tive training set (maximising inter-class variance). Compared to existing
methods, our approach is fast, robust, and obtains state of the art results
on two challenging data-sets voc2007 (all poses), and the msr2 action
data-set, where we obtain a 10% increase. Moreover, this use of nega-
tive mining complements existing methods, that seek to minimize the
intra-class variance, and can be readily integrated with many of them.
Keywords: Weakly Supervised Learning, Multiple-Instance Learning,

Negative Mining, Automatic Annotation.
1 Introduction
Detecting objects in images [1, 2] and actions in videos [3, 4] are among the
most widely studied computer vision problems, with applications in consumer
photography, surveillance and automatic media tagging. Typically, these stan-
dard detectors are fully-supervised, that is they require a large body of training
data where the location of the objects/actions in images/videos have been man-
ually annotated, as shown in Fig.1. With the emergence of digital media, and
the rise of high-speed internet, raw images and video are available for little to
no cost. However, the manual annotation of object and action locations remains
tedious, slow, and expensive, and as a result there has been a great interest in
training detectors with weak supervision [58].
For weakly supervised training of object detector, each image in the training
set is annotated with a weak label indicating if the image contains the object
of interest or not, but not the locations of the object. As shown in Fig. 2, a
weakly labelled data-set consists of two types of images: a set of weakly-labelled
positive images where the exact location of object is unknown, and a set of
strongly labelled negative images which we know for sure that every location in
the image does not contain the object of interest.

This author was funded by the European Research Council under the ERC Starting
Grant agreement 204871-HUMANIS.

In Defence of Negative Mining for Annotating Weakly Labelled Data 595
(a) Bicycle Annotation (b) Hand-waving Annotation
Fig. 1. Manual annotation of object and action data-sets, which are required for fully
supervised detector training
Fig. 2. In the annotating of weakly labelled data task we have a set of images or videos
where the object or action of interest is present and a set of images or videos where
the object or action is not present. Absence of object or action is strong information,
as we know every part of the image or video is negative, whereas the presence of object
or action is weak information as we do not know where the object or action is located.
Automatically annotating weakly-labelled training data is typically posed as

a multiple-instance learning (mil) problem [5, 6, 8]. Within a mil framework, a
single image weakly labelled with data such as: This image contains a bike.
is represented as a bag containing a set of instances, or possible locations of the
object. Positive bag is used to refer an image containing at least one instance of
the class, while negative bags are those that contain no positive instances. Given
a set of positive and negative bags for training, the goal of mil is to train a
classier that can correctly classify a test bag or test instance as either positive
or negative. The latter is more relevant in the context of detector learning. In
particular, taking a mil approach, the problem of detector learning can be solved
in two stages: in the rst stage a decision is made as to which portion of the pos-
itive images represent the objects, and in the second stage a standard detector is
trained from the decision made in the rst stage. This approach has substantial
pragmatic value, as it allows computationally intensive transductive [9] meth-
ods [8, 10] for automatically annotating the object location in the training data,
and ecient state-of-the-art detectors [2] to locate the objects in the test data.
Classical mil approaches [11, 12] make use of two dierent types of infor-
mation to train a classier: intra-class and inter-class. Intra-class information
596 P. Siva, C. Russell, and T. Xiang
concerns the selected positive instances. The information is typically exploited

by enforcing that the selected positive instances look similar to each other. In
contrast inter-class information refers to the dierence in appearance between
selected positive and negative instances. This information is normally used by
introducing a constraint that all instances selected as positive look dissimilar to
the instances selected as negative. In the case of automatic annotation of object
locations [6, 8] a third type of information, saliency, is often exploited. Saliency1
refers to knowledge about the appearance of foreground objects, i.e. things [13],
regardless of the class of object to which they belong. Saliency may capture
generic knowledge regarding the typical size and location of objects in photos,
or express a relationship between the strength of image edges and the location
of object bounding boxes [14]. Saliency is typically used to prune the space of
possible object or action locations a priori, allowing us to consider a reduced set
of possible locations. The measure of saliency itself can also be used for selecting
positive instances.
In this paper, we ask a question: Which of intra- and inter-class information
is more useful in practice?. A close examination of a widely used benchmarking
data-set can give us some hints. The voc2007 data-set [15] is often used to eval-
uate object detectors and mil approaches. A typical class in this data-set has
approximately 300 associated images containing an object and 4,700 images not
containing that object. If 100 candidate object locations (instances) per image
are extracted using a saliency measure, this gives 470,000 strongly labelled nega-
tive instances vs. 300 true objects located somewhere within the 30,000 instances
in the positive images. Furthermore, object locations proposed by the saliency
measure may not include the true object location2 . Therefore, when considering
intra-class distances, we have < 300 unlabelled similar positive instances in a
high-dimensional feature space vs. an inter-class distance based upon 470, 000
strongly labelled instances in the same high-dimensional space. For object detec-
tion, the feature spaces tend to have thousands, or even hundreds of thousands
of dimensions. When using an rbf kernel, or nearest neighbour classier in such
high dimensional feature spaces, the coverage provided by 470, 000 labelled nega-
tive instances provides substantially more useful inter information than the intra
information provided by the 300 positive instances.
This simple observation motivates our approach, which only makes use of the
inter-class information (strongly labelled negative examples), and is referred to
as negative mining. By combining negative mining with existing saliency mea-
sures [14] we are able to produce a classier that outperforms the majority of
existing approaches to image and video annotation, and can be readily integrated
with many of them.
In this paper we show 1) how strongly labelled negative data can be used to
create a state-of-the-art classier, bypassing the problem of resolving complex
interdependencies between the positive bags and 2) how our classier can be
1
Typically learned from meta-data.
2
On the voc2007data-set, in 30% of the images the saliency measure of [14] fails to
propose a valid location.
fused with several existing approaches to mil-based image annotation [8, 16, 17],
where it invariably leads to an improvement over the method it was fused with.
2 Prior Works
Traditional approaches to mil [11, 12] were applied to image-level categorisation.

We are interested in a related but dierent problem of selecting which instances
in the positive training set are true positives, and our literature review focuses
on those approaches which have been successfully applied to the annotation of
weakly labelled data.
The mi-svm formulation of Andrews et al. [17] makes use of both inter and
intra-class information, as it seeks a linear classier that maximise the margin
between the positive and negative instances. Both [5] and [7] make use of this
approach. They requires a set of initial positive instance and for that both [5]
and [7] use the entire positive image as the initial positive instance. We make
no such assumptions and will show that negative mining in conjuncture with
a saliency measure can outperform these mi-svm formulations. We also show
how fusing negative mining and saliency measures with mi-svm formulation can
further improve annotation accuracy.
Another method to use only inter-class and intra-class information is by Siva
and Xiang [16] for annotating weakly labelled action data. They also made use
of the distance to the nearest negative example, however they combine this
with intra-class measure into a single cost function. We show that our negative
mining measure alone outperforms their combined cost function on all data-
sets. Furthermore we propose the use of saliency which signicantly improve the
results on the msr2 action data-set used in [16].
Deselaers et al. [6] and Siva and Xiang [8] made use of all three types of in-
formation: inter-class, intra-class and saliency. Deselaers et al. [6] only use their
intra-class measure in the rst iteration then use the selected instances to iter-
atively tune their choice of inter-class and saliency measures on an additional,
manually-annotated, auxiliary data-set. Our experimental results strongly sug-
gest that inter-class information is more reliable and would provide more useful
information at initialisation. Siva and Xiang [8] treat inter-class and intra-class
information independently from each other, then fuse the results at a score level.
For their inter-class measure, they use the mi-svm formulation like [5] and [7].
They then merge intra-class and saliency in a single cost function used to select
a set of similar looking instances. Again we show that inter-class and saliency are
more meaningful measures and that combining inter-class and saliency leads to
greater accuracy than combining intra-class and saliency measures. We will also
show that the results of [8] can be further improved by fusing with our approach.
Fu et al. [18] also made use of the distance of instances from the negative
bags. They selected instances furthest from the negative bags and used them to
initialise cluster centres, which were then used to create the bag level feature
descriptors of [12]. Their goal was a bag-level classier and we dier from them
in that we are interested in the direct annotation of the instances in the training
set. Furthermore, unlike [18], we treat and evaluate negative mining as a classier
in its own right rather than as a pre-processing heuristic.
3 Methodology
A formal denition of negative mining is given in section 3.1. Section 3.2 con-
tains a discussion of feature normalisation, which is essential for getting negative
mining to work as a mil method. Finally, we describe the saliency measure used
to dene instances in images and videos in section 3.3.
3.1 Negative Mining
Consider a data-set consisting of a set of positive images Ii+ that contain the
object of interest and a set of negative images Ii which do not. Following [6]
and [8] we consider a set of 100 salient locations or instances xi,j=1...100 in each
image i. Each instance xi,j is represented by a bag-of-words (bow) histogram.
The goal is to select a single instance x+ i from each positive image I
+
corre-
sponding to the correct location of an object of interest. The negative mining
algorithm accomplishes this by selecting the instance that maximises the dis-
tance to the nearest neighbour in any image containing only negative instances
x
i,j ,
x+i = arg max
+
||x+
i,j N (xi,j )||1 ,
+
(1)
xi,j
where || ||1 is the L1 norm and N (x+ i,j ) refers to the negative nearest neighbour
of x+
i,j . As mentioned earlier, for a typical voc class, we have 470,000 negative

instances (xi=1...4700,j=1...100 ) in a 2, 000 dimensional space. As such, the key to
an ecient algorithm is fast nearest neighbour look-up, and to handle the large
volume of data eciently we make use of a KD-tree based approximate nearest
neighbour algorithm [19], with 16 trees.
This approach of mining the nearest negative instance relies on the abun-
dance of known negative instances and unlike [6, 8] requires no optimisation of
intra-class cost function, resulting in a computationally ecient algorithm. Fur-
thermore, unlike [8], this inter-class measure does not assume that the average
instance in each positive bag represents a positive instance, nor does it assume
that the entire image is a positive instance like [5, 7].
An important variation on (1) incorporates a saliency measure. If we have a
measure (), which serves as a prior of how likely an instance is to be an object
of interest, regardless of the choice of class, we can simply add (x+ i,j ) to our
negative mining cost:
+
i = arg max ||xi,j N (xi,j )||1 + (xi,j )
x+ + +
(2)
x+
i,j
Later in section 3.3, we dene (x+

i,j ) for each of the data-sets considered.
3.2 Normalisation Strategies
Finding the instance with the maximum negative nearest neighbour (nnn) dis-
tance in a positive bag is complicated by the drastic change in the size of proposed
instances. Consider Fig. 3(a) where we plot the distance from all the instances
in the positive bags of a single class to its nearest negative neighbour (Distance
i,j N (xi,j )||1 ) vs. the instances size in pixel (sizeof (xij )). The plot uses
= ||x+ + +
standard normalised bow histograms, which are typically used for both object
and action detection:
h(i)
h(i) = B , (3)
i=1 h(i)
where h is the bow histogram composed of B bins, and the L1 distance between
them3 . We observe that the nnn distance for instances small in size are almost al-
ways much greater than the nnn distance associated with instances large in size.
This behaviour can be attributed to sampling artefacts: small boxes naturally
contain fewer densely sampled words. Compared to a large box, the distribution
of words associated with a small box is much more likely to have a few sharply
peaked modes, and many empty bins.
As a consequence, when selecting the instances that maximise the normalised
distance to the nearest negative instance we select very small instances in each
positive bag. The same behaviour can be observed for other type of distances
based upon normalised bow histograms (Fig. 3(d)). In the voc data-set, this
bias is extremely inappropriate, as the majority of images contain large object
instances, and on the voc2007 data-set this causes negative mining to perform
ten times worse than the random selection of positive instances. A similar degra-
dation in performance can be observed on the msr2 data-set [20].
On the contrary, if we use unnormalised histograms (Fig. 3(c)), together with
the L1 distance, we observe the opposite eect. Large instances contain many
dense words, and owing to the sheer number of words, typically have a large
distance from their nearest negative neighbour, while small instances lie very
close to their nnn. Although biased towards large instances this measure is
at least biased in the correct direction, and performs substantially better than
random.
To minimise the bias towards either large or small boxes, we consider the
novel measure of root-normalised histograms
h(i)
h(i) = C . (4)
B
i=1 h(i)
Empirically this measure performs better than either normalised or unnormalised

histograms. We compare the accuracy of weak annotation using negative mining
with the dierent nearest neighbour in section 4. Figure 3(b) shows the relation-
ship of distance vs. instance size of root-normalised histograms.

3
This is equivalent to the histogram distance d(x, y) = 1 k min(xk , yk ) proposed
in [16].
(a) Normalised (b) Root-normalised
(c) Unnormalised (d) Chi squared and L2 distances of

normalised histograms
Fig. 3. The L1 distances from all instances in the positive bags of the cat class in
voc2007 to the nearest negative instance plotted against the size in pixels of each
instance in the positive bag. The L1 distance is computed on a 2,000 word bow his-
togram formed from regular grid sift features. (a-c) illustrate the eect of dierent
normalisation strategies on bow histograms. We see that selecting the instance with
the largest distance to the nearest negative is correlated to the size of the instance and
this correlation is less pronounced for root-norm. (d) shows that this phenomenon is
not limited to the L1 distance.
3.3 Saliency
Saliency is used at two places in our framework. We require it to propose a small
set of viable instances or potential locations of objects and actions in each image
or video. We also require a saliency measure of how likely a location is to be an
object or action of any class, which we use in (2).
Instance Denition. To propose potential object locations on the voc2007

data-set, we use the generic object detector [14] which was also used in other
works on weak object class annotation [6, 8]. The rst 100 samples from the
generic object detector per image are used as instances. For direct comparison
with [6, 8] we use version 1.01 of [14].
To propose potential locations of actions in the msr2 video data-set we follow

the procedure in [16]. We run a person detector [2] on each video frame and select
a cuboid surrounding the detected person as a potential action location. From all
these possible cuboids, 200 instances are selected based upon the spatio-temporal
interest point (stip) density.
Instance Measure. To measure how likely an instance xi,j is to be a positive

location of any object ((x+ i,j )), we use the value of objectness returned by the
generic object detector [14]. Objectness has also been used by [6] and [8] for this
purpose.
For the action data-set, current literature does not have an equivalent of
objectness. Instead we propose a simple heuristic, and choose (x+ i,j ) = 0.6D
where D is the density of stips in the dened cuboid. There are two reasons
for using stip density: rst stip density is used to sample the potential action
cuboids (instances) from each video per [16] and, second, we expect motion where
ever there is an action being performed which will generate a lot of stips. We
weight stip density by 0.6 as it is simple heuristic in comparison with objectness
used for object detection.
4 Experiments
The data-sets and features used in our experiments are outlined in this section.
Our main analysis is the comparison of our negative mining result with state-of-
the-art results in section 4.1. Improvements in other mil formulations when fused
with the negative mining are given in section 4.2. Dierent histogram normali-
sation strategies are evaluated in section 4.3. Finally in section 4.4 we evaluate
a leave-one-out classier, assuming we have complete manual annotation of the
training set, to determine the theoretical upper bound on the intra-class infor-
mation in comparison to the inter-class information.
Data-Sets. For object detection we use the challenging Pascal voc2007 data-
set [15] and for action we use the msr2 data-set [21]. For comparison with [8] we use
vocAll which consists of all 20 classes with no pose annotation, unlike [6] and [7]
who use manual pose annotation. For comparison with [6] and [7] we also include
the voc6x2 which consists of six classes (aeroplane, bicycle, boat, bus, horse, and
motorbike) where the left and right pose are considered as separate classes for a
total of 12 classes. For both vocAll and voc6x2 we only exclude images anno-
tated as dicult in the voc2007 data-set. As in the other works [6], all results are
presented as the percent of correct localisation, where correct localisation refers
to 50% overlap of selected instance with ground truth as dened in [15].
Features. For features we use bow histograms. For the voc data-set bow his-
tograms are formed from regular grid SIFT features constructed with vlfeat [22].
For the msr2 data-set bow histograms are formed using the histogram of ow
and histogram of gradient feature at spatio-temporal interest points (stips) de-

ned in [23]. For both data-sets kmeans clustering is used to construct a 2,000
word codebook.
4.1 Comparison to State-of-the-Art

We report our comparison of negative mining with existing state-of-the-art image
annotation algorithms in Table 1 (refer to Fig. 8 for more detailed results and
Fig. 7 for illustrations). For the vocAll the negative mining (N) alone (27.1%)
outperforms both the inter-class (26.6%) and intra-class (23.9%) classiers of
Siva and Xiang [8]. When negative mining is combined with saliency (N + )
our performance (29.0%) is slightly better than the combined inter- and intra-
class method of [8] (28.9%).
For the restricted voc6x2 data-set our negative mining (N) method (35.0%)
is similar to saliency () alone (34.8%), and when combined (N + ) the perfor-
mance increases (37.1%). The combined method is better than [7] without their
cropping heuristic and also better than [6] using a single feature. The inter-class
measure proposed by [8] works better on the single pose subset than our nega-
tive mining based inter-class measure. However, as we show in section 4.2, the
inter-class measure of [8] can be further improved by fusing its score with our
inter-class measure. Through the use of post-processing [7], multiple features [6]
and iteratively training object models [68] the annotation results can be fur-
ther improved to 49%, 50%, and 61% for [68] respectively. We do not compare
directly with these further renements as our analysis is on the use of negative
mining as a classier for the initial annotation of the weakly annotated data.
However, four rounds of iterative training of a detector using negative mining as
initialisation results in an annotation accuracy of 46%.
More interestingly we note that the intra-class measure reported in [8] is actu-
ally intra-class with saliency as they included objectness value in their intra-class
measure (Eq. 5 of [8]). However, the performance of their intra-class measure
(23.9%) is worse than simply using objectness alone (25.1%) for the vocAll set
(Fig. 4) and the same (34.8%) for the voc6x2 data-set. This clearly shows that
their use of intra-class measure, at best does not improve the performance, and
at worse signicantly reduces the performance of saliency. In contrast, our inter-
class measure consistently boosts the performance of saliency on all data-sets.
On the msr2 data-set negative mining outperforms all existing methods. Our
proposed saliency measure (stip density) also outperforms mi-svm [17] by itself
but is not as strong as the methods of [16, 24]. However, saliency combined with
negative mining gives a 10% performance boost over all pre-existing methods,
including [24] which makes uses of a single manual annotation.
4.2 Fusion with Existing Classiers

We combine negative mining and saliency with the intra-class, inter-class and
combined inter-intra class measure of [8] as well as the intra-class measure of [16].
Each measure provides a score for individual instance in the positive bag and
Fig. 4. Results using negative mining (N), saliency (), and combined negative mining
and saliency (N + ) methods
Table 1. Results using negative mining (N), saliency (), and combined negative
mining and saliency (N + ) methods
Pandey [7]** Localisation Siva and Xiang [8]* MI- Nguyen

Data-set Init. Only Only [6]** Intra Inter Intra + SVM et al. N N+
no crop with crop 1 Feat. 5 Feat. C C Inter C [17]* [5]*
vocAll Avg. N/A N/A N/A N/A 23.9 26.6 28.9 25.4 22.4 25.1 27.1 29.0
voc6x2 Avg. 36.7 43.7 35.0 39.0 34.8 39.0 39.6 38.7 24.9 34.8 35.0 37.1
*As reported in [8]. **As reported in [7] and [6] respectively.
Data-set [16] [24] MI-SVM N N+
msr2 Avg. 71.2 71.7 55.8 60.8 73.9 81.5
Table 2. Augmention of existing methods by fusing with our negative mining (N) and
saliency ()
Intra Intra Nguyen et al. Inter [8] Inter [8] Inter Inter Intra Intra
[8] [8] [5] MI-SVM + [8]+ [8]+ [16] [16]+
Data-set + MI-SVM Init. Avg. N+ Intra Intra N+
N+ Init. Img. [8] [8]+
One Iteration N+
vocAll Avg. 23.9 28.8 21.3 26.6 29.6 28.9 30.2 21.4 24.3
voc6x2 Avg. 34.8 36.5 24.6 39.0 40.8 39.6 40.2 31.0 36.2
the instance that maximises the sum of all measure scores is selected as the true
positive instance. We weight all measures equally. Average results are shown in
Table 2 and more detailed per class results are in Table 4. Note that inter-class
measure of [8] is mi-svm [17] run for a single iteration and initialised with the
average instance in each positive bag. Initialising with average instance performs
better than initialising with the entire image, the method used by [5]. In all cases,
we show that a score level fusion with our method (combined negative mining
and saliency, N + ) consistently boosts the performance of all other methods
by between 1 and 5%.
4.3 Eect of Normalisation Strategies
As mentioned in section 3.2 the normalisation of the bow histogram is vital for
nearest neighbour to work. Fig. 5 shows annotation accuracy with normalised,
Data-set vocAll voc6x2 msr2

Norm 1.4 2.3 14.7
Root-Norm 27.1 35.0 73.9
UnNorm 26.3 33.1 74.3
Rand 14.7 18.0 41.7 Fig. 6. The localisation result using just negative
mining for the dierent bow normalisation strate-
Fig. 5. Comparison of dier- gies. Normalised histogram tends to select small
ent normalisation strategies. boxes, unnormalised tends to select large boxes, and
See supplementary materials root-normalised exhibits a weaker bias for large and
for per class results. small boxes.
Table 3. Leave-one-out results evaluating the usefulness of intra-class distance (P)

when manually annotating all but one bag in comparison to negative mining (N) and
saliency (). See supplementary materials for per class results.
Data-set N N + P P + N P + P + N +
vocAll Avg. 25.1 27.1 29.0 26.3 27.7 28.6 29.6
voc6x2 Avg. 34.8 35.0 37.1 38.1 37.3 38.4 39.2
msr2 Avg. 60.8 73.9 81.5 79.5 77.3 73.4 82.6
root-normalised and unnormalised bow histograms and L1 distance. We also

show the performance of randomly selecting an instance from each positive bag.
The performance of normalised histogram for selecting the instance that max-
imises the distance to the nearest negative is substantially worse than random
selection. This is because the use of normalised histograms actively selects the
smallest instance (bounding box) in each positive bag which is not the behaviour
you want for most bags (Fig. 6). Unnormalised histograms have the opposite
eect, that is they tend to select the largest instance (bounding box). Root-
normalised histograms tend to be a good compromise between normalised and
unnormalised for all data-sets. Only exception is on the msr2 data-set where the
unnormalised distance is slightly better due to the fact there are no overly large
instances proposed for the action data-set in comparison to the object data-set.
4.4 Leave-One-Out Positive Mining vs. Negative Mining

Our approach is based on the assumption that intra-class measure is not suitable
for the rst step in annotating weakly labelled data due to the availability of
very few true positive data in a high dimensional space. To validate this we
take a leave-one-out approach in which we assume all but one bag is manually
annotated; this gives a theoretical upper bound on the intra-class information.
In this scenario the unannotated bag can be annotated by selecting the instance
that minimises the distance to its nearest manually annotated ground truth
data; in practice we are maximising the negative distance to be consistent with
(1). We also consider combining this leave-one-out intra-class distance with our
negative mining and saliency measures. We use the normalisation strategies (see
section 3.2) that maximise classication accuracy, being always normalised for
Siva and Xiang [8]* MI- Nguyen

Class Intra Inter Intra + SVM et al. N N+
C C Inter C [17]* [5]*
aeroplane 31.1 41.2 45.4 37.8 30.7 32.4 45.4 38.7
bicycle 18.5 17.7 20.6 17.7 16.5 16.9 20.2 22.2
bird 25.2 28.2 29.7 26.7 23.0 22.4 29.1 27.6
boat 13.8 13.3 12.2 13.8 14.9 19.9 14.9 21.0
bottle 03.3 05.3 04.1 04.9 04.9 04.1 04.5 06.6
bus 31.2 35.5 37.1 34.4 29.6 31.2 31.7 33.3
car 26.7 33.0 41.0 33.7 26.5 34.2 34.2 39.4
cat 42.1 50.1 53.4 46.6 35.3 41.8 48.4 46.0
chair 06.7 05.4 06.5 05.4 07.2 07.6 07.6 08.1
cow 28.4 30.5 31.9 29.8 23.4 31.2 27.7 34.8
diningtable 24.0 16.0 20.5 14.5 20.5 29.0 23.5 31.5
dog 33.3 36.1 40.9 32.8 32.1 33.0 35.9 38.0
horse 30.7 38.7 37.3 34.8 24.4 32.4 34.5 37.6
motorbike 34.7 44.1 46.5 41.6 33.1 38.8 39.6 43.3
person 14.3 20.6 22.3 19.9 17.2 21.3 21.9 23.0
pottedplant 09.4 11.4 10.2 11.4 12.2 09.4 14.3 11.4
sheep 26.0 25.0 27.1 25.0 20.8 31.3 25.0 28.1
sofa 25.3 23.6 32.3 23.6 28.8 24.0 29.7 34.5
train 42.9 47.9 49.0 45.2 40.6 32.2 45.2 43.7
tvmonitor 10.6 08.6 09.8 08.6 07.0 09.4 08.2 10.5
Average 23.9 26.6 28.9 25.4 22.4 25.1 27.1 29.0
Aeroplane left 42.2 43.8 37.5 42.2 26.6 39.1 50.0 39.1
Aeroplane right 38.5 59.6 55.8 57.7 25.0 42.3 44.2 50.0
Bicycle left 23.9 26.9 34.3 29.9 20.9 19.4 25.4 28.4
Bicycle right 33.9 30.7 30.7 19.4 25.8 29.0 27.4 30.6
Boat left 09.4 01.9 09.4 01.9 05.7 22.6 09.4 15.1
Boat right 13.8 13.8 12.1 13.8 15.5 22.4 10.3 20.7
Bus left 44.8 31.0 37.9 37.9 24.1 44.8 31.0 31.0
Bus right 29.7 43.2 43.2 43.2 21.6 40.5 29.7 35.1
Horse left 43.9 40.9 48.5 42.4 30.3 34.8 51.5 48.5
Horse right 33.9 54.8 51.6 56.5 22.6 30.6 45.2 45.2
Motorbike left 46.3 55.6 50.0 55.6 31.5 42.6 46.3 46.3
Motorbike right 57.5 66.0 63.8 63.8 48.9 48.9 48.9 55.3
Average 34.8 39.0 39.6 38.7 24.9 34.8 35.0 37.1
*As reported in [8]
Class [16] [24] MI-SVM N N+
boxing 40.7 57.4 20.4 57.4 75.9 83.3
clapping 79.4 70.6 61.8 67.6 73.5 82.4
handwaving 93.6 87.2 85.1 57.4 72.3 78.8
Average 71.2 71.7 55.8 60.8 73.9 81.5
Fig. 7. Sucesses
and failures of neg-
ative mining with Fig. 8. Results using negative mining (N), saliency (), and
saliency (N + ) combined negative mining and saliency (N + ) methods
positive nearest neighbour, and root-normalised for negative distances. As with

the previous section, all scores were linearly mapped onto a [0, 1] range.4 Table 3
shows all comparisons with the intra-class distance on the voc2007 and msr2
data-sets.
4
The exception being on the msr2 data-set, which was mapped onto [0, 0.6], see
section 3.3.
Table 4. Augmention of existing methods by fusing with our negative mining (N) and
saliency ()
Intra Intra Nguyen Inter [8] Inter [8] Inter Inter Intra Intra
[8] [8] et al. [5] MI-SVM + [8]+ [8]+ [16] [16]+
Data-set + MI-SVM Init. Avg. N+ Intra Intra N+
N+ Init. Img. [8] [8]+
One Iteration N+
aeroplane 31.1 37.8 27.7 41.2 46.6 45.4 45.8 31.9 36.6
bicycle 18.5 21.8 19.3 17.7 21.0 20.6 21.8 14.8 19.8
bird 25.2 27.0 20.3 28.2 29.7 29.7 30.9 24.8 26.1
boat 13.8 17.1 14.4 13.3 18.8 12.2 20.4 11.6 17.7
bottle 03.3 04.1 05.7 05.3 05.3 04.1 05.3 04.5 04.1
bus 31.2 31.2 31.7 35.5 37.6 37.1 37.6 25.8 23.7
car 26.7 39.8 29.0 33.0 40.7 41.0 40.8 21.5 22.9
cat 42.1 48.7 32.3 50.1 50.1 53.4 51.6 44.5 45.4
chair 06.7 08.5 07.4 05.4 06.7 06.5 07.0 06.1 06.5
cow 28.4 31.9 24.1 30.5 29.8 31.9 29.8 27.7 28.4
diningtable 24.0 32.0 15.0 16.0 26.5 20.5 27.5 18.0 24.5
dog 33.3 39.7 33.0 36.1 39.4 40.9 41.3 30.6 35.6
horse 30.7 38.7 17.1 38.7 41.8 37.3 41.8 30.3 33.1
motorbike 34.7 44.9 30.2 44.1 45.7 46.5 47.3 33.1 38.4
person 14.3 23.3 14.9 20.6 23.8 22.3 24.1 13.9 14.1
pottedplant 09.4 11.4 12.7 11.4 12.2 10.2 12.2 10.6 8.6
sheep 26.0 27.1 24.0 25.0 28.1 27.1 28.1 17.7 22.9
sofa 25.3 34.9 18.8 23.6 29.7 32.3 32.8 18.3 24.5
train 42.9 45.2 39.5 47.9 48.3 49.0 48.7 36.0 42.5
tvmonitor 10.6 10.5 08.2 08.6 09.8 09.8 09.4 05.5 11.3
Average 23.9 28.8 26.6 21.3 29.6 28.9 30.2 21.4 24.3
Aeroplane left 42.2 42.2 31.3 43.8 37.5 37.5 45.3 32.8 42.2
Aeroplane right 38.5 46.2 26.9 59.6 61.5 55.8 53.8 28.8 42.3
Bicycle left 23.9 28.4 25.4 26.9 28.4 34.3 31.3 25.4 23.9
Bicycle right 33.9 27.4 27.4 30.7 37.1 30.7 30.6 27.4 30.6
Boat left 09.4 17.0 07.5 01.9 15.1 09.4 13.2 05.7 15.1
Boat right 13.8 15.5 15.5 13.8 12.1 12.1 15.5 05.2 15.5
Bus left 44.8 34.5 24.1 31.0 34.5 37.9 37.9 34.5 37.9
Bus right 29.7 37.8 24.3 43.2 51.4 43.2 45.9 18.9 37.8
Horse left 43.9 50.0 24.2 40.9 51.5 48.5 50.0 47.0 48.5
Horse right 33.9 41.9 03.2 54.8 53.2 51.6 51.6 38.7 45.2
Motorbike left 46.3 46.3 31.5 55.6 50.0 50.0 50.0 46.3 42.6
Motorbike right 57.5 51.1 53.2 66.0 57.4 63.8 57.4 61.7 53.2
Average 34.8 36.5 24.6 39.0 40.8 39.6 40.2 31.0 36.2
This leave-one-out measure shows the maximal accuracy of a nearest neigh-

bour classier on voc2007 and msr2 data-sets given the best possible annota-
tions, as such it can be seen as an approximate upper-bound for the quality
of any 1-nn algorithm. Despite this, on the full voc data-set, the classication
accuracy of negative mining substantially out-performs a nearest neighbour clas-
sier based on positive distance alone, and is only 0.6% worse than the combined
classier using both positive and negative distances. Similar results can be seen
on the msr2 data-set negative mining and saliency together out performs all
but the combination of intra-distances, negative mining and saliency. On the
voc6x2 data-set the additional annotations of left and right and the fact that
truncated or occluded models are discarded, reduces the possible changes in ap-
pearance and makes intra-distances a more useful measure. Still, the combined
measure of all cues is only 2% better than negative mining and saliency, and this
is a reasonable trade-o for the much weaker annotation requirements.
5 Conclusion
This work has presented a simple, robust, technique based upon negative mining.
By itself, this technique out-performs all previous existing techniques on the
msr2 and voc2007 (all views) data-sets. We have shown how this technique can
be readily combined with other approaches to mil, by fusing it as an additional
potential, and that doing so has always lead to a signicant improvement in
performance over the original method. Our comprehensive experiments have
validated our approach. We believe that the conceptual and implementational
simplicity of our approach, alongside its state-of-the-art performance will make
it a valuable tool for increasing the performance of many more sophisticated mil
approaches in the future.
References
1. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple
features. In: CVPR, pp. 511518 (2001)
discriminatively trained part-based models. TPAMI 32(9) (2010)
3. Poppe, R.: A survey on vision-based human action recognition. Image and Vision
Computing 28(6), 976990 (2010)
4. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for ac-
tion representation, segmentation and recognition. Computer Vision and Image
Understanding 115(2), 224241 (2011)
5. Nguyen, M.H., Torresani, L., de la Torre, F., Rother, C.: Weakly supervised
discriminative localization and classication: a joint learning process. In: ICCV,
pp. 19251932 (2009)
6. Deselaers, T., Alexe, B., Ferrari, V.: Localizing Objects While Learning Their Ap-
pearance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV.
7. Pandey, M., Lazebnik, S.: Scene recognition and weakly supervised object local-
ization with deformable part-based model. In: ICCV (2011)
8. Siva, P., Xiang, T.: Weakly supervised object detector learning with model drift
9. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience (1998)
10. Deselaers, T., Ferrari, V.: A conditional random eld for multiple-instance learning.
In: ICML (2010)
11. Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: NIPS
(1998)
12. Chen, Y., Bi, J., Wang, J.: Miles: Multiple-instance learning via embedded instance
selection. PAMI 28(12), 19311947 (2006)
13. Adelson, E.H.: On seeing stu: the perception of materials by humans and ma-
chines. In: SPIE, vol. 4299, pp. 112 (2001)
14. Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: CVPR, pp. 7380
(2010)
15. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
pascal visual object classes (voc) challenge. IJCV 88(2), 303338 (2010)
16. Siva, P., Xiang, T.: Weakly supervised action detection. In: BMVC (2011)
17. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: NIPS, pp. 577584 (2003)
18. Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple Instance Learning with Instance
Selection. TPAMI (99) (2010)
19. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algo-
rithm conguration. In: International Conference on Computer Vision Theory and
Application VISSAPP 2009, pp. 331340. INSTICC Press (2009)
20. Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for ecient action
detection. In: CVPR, pp. 24422449 (2009)
21. Cao, L., Liu, Z., Huang, T.S.: Cross-data action detection. In: CVPR (2010)
22. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer
vision algorithms (2008), http://www.vlfeat.org/
23. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, Nice, France,
pp. 432439 (2003)
24. Siva, P., Xiang, T.: Action detection in crowds. In: BMVC (2010)
Describing Clothing by Semantic Attributes
Huizhong Chen1 , Andrew Gallagher2,3, and Bernd Girod1

1
Department of Electrical Engineering, Stanford University, Stanford, California
2
Kodak Research Laboratories, Rochester, New York
3
Cornell University, Ithaca, New York
Abstract. Describing clothing appearance with semantic attributes is an appeal-

ing technique for many important applications. In this paper, we propose a fully
automated system that is capable of generating a list of nameable attributes for
clothes on human body in unconstrained images. We extract low-level features
in a pose-adaptive manner, and combine complementary features for learning at-
tribute classifiers. Mutual dependencies between the attributes are then explored
by a Conditional Random Field to further improve the predictions from inde-
pendent classifiers. We validate the performance of our system on a challenging
clothing attribute dataset, and introduce a novel application of dressing style anal-
ysis that utilizes the semantic attributes produced by our system.
1 Introduction
Over recent years, computer vision algorithms that describe objects on the semantic
level have attracted research interest. Compared to conventional vision tasks such as ob-
ject matching and categorization, learning meaningful attributes offers a more detailed
description about the objects. One example is the FaceTracer search engine [1], which
allows the user to perform face queries with a variety of descriptive facial attributes.
In this paper, we are interested in learning the visual attributes for clothing items.
As shown in Fig.1, a set of attributes is generated to describe the visual appearance
of clothing on the human body. This technique has a great impact on many emerging
applications, such as customer profile analysis for shopping recommendations. With a
collection of personal or event photos, it is possible to infer the dressing style of the per-
son or the event by analyzing the attributes of clothes, and subsequently make shopping
recommendations. The application of dressing style analysis is shown in Fig.9.
Another important application is context-aware person identification, where many
researchers have demonstrated superior performance by incorporating clothing infor-
mation as a contextual cue that complements facial features [24]. Indeed, within a
certain time frame (i.e., at a given event), people are unlikely to change their clothing.
By accurately describing the clothing, person identification accuracy can be improved
over conventional techniques that only rely on faces. In our study, we also found that
clothing carry significant information to infer the gender of the wearer. This observa-
tion is consistent with the prior work of [5, 6], which exploits human body information
to predict gender. Consequently, a better gender classification system can be developed
by combining clothing information with traditional face-based gender recognition al-
gorithms.

610 H. Chen, A. Gallagher, and B. Girod
Original image Pose estimation

List of attributes
Womens
No collar
White color
Attribute Tank top
learning No sleeve
High skin exposure
Solid pattern

Fig. 1. With the estimated human pose, our attribute learning algorithm generates semantic at-
tributes for the clothing
We present a fully automatic system that learns semantic attributes for clothing on
the human upper body. We take advantage of the recent advances in human pose estima-
tion [7], by adaptively extracting image features from different human body parts. Due
to the diversity of the clothing attributes that we wish to learn, a single type of feature is
unlikely to perform well on all attributes. Consequently, for each attribute, the predic-
tion is obtained by aggregating the classification results from several complementary
features. Last, but not the least, since the clothing attributes are naturally correlated,
we also explore the mutual dependencies between the attributes. Essentially, these mu-
tual dependencies between various clothing attributes capture the Rules of Style. For
example, neckties are rarely worn with T-shirts. To model the style rules, a conditional
random field (CRF) is employed on top of the classification predictions from individ-
ual attribute classifiers, and the final list of attributes will be produced as the inference
result of the CRF.
To evaluate the descriptive ability of our system, we have created a dataset with an-
notated images that contain clothed people in unconstrained settings. Learning clothing
attributes in unconstrained settings can be extremely difficult due to the wide variety of
clothing appearances, such as geometric and photometric distortions, partial occlusions,
and different lighting conditions. Even under such challenging conditions, our system
demonstrates very good performance. Our major contributions are as follows: 1) We
introduce a novel clothing feature extraction method that is adaptive to human pose;
2) We exploit the natural rules of fashion by learning the mutual relationship between
different clothing attributes, and show improved performance with this modeling; 3)
We propose a new application that predicts the dressing style of a person or an event by
analyzing a group of photos, and demonstrate gender classification from clothing that
advocate similar findings from other researchers [5, 6].
2 Related Work
The study of clothing appearance has become a popular topic because many tech-
niques in context-aware person identification use clothing as an important contextual
cue. Anguelov et. al. [2] constructed a Markov Random Field to incorporate clothing
and other contexts with face recognition. They extract clothing features by collecting
color and textual information from a rectangular bounding box under the detected face,
Describing Clothing by Semantic Attributes 611
but this approach suffers from losing clothing information by neglecting the shape and
variable pose of the human body. To overcome this problem, Gallagher and Chen [4]
proposed clothing cosegmentation from a set of images and demonstrated improved
performance on person identification. However, clothing is described only by low-level
features for the purpose of object matching, which differs from our work that learns
mid-level semantic attributes for clothes.
Very recently, attribute learning has been widely applied to many computer vision
tasks. In [8], Ferrari and Zisserman present a probabilistic generative model to learn vi-
sual attributes such as red, stripes and dots. With the challenge of training mod-
els on noisy images returned from Google search, their algorithm has shown very good
performance. Farhadi et. al. [9] learn discriminative attributes and effectively categorize
objects using the compact attribute representation. Similar work has been performed by
Russakovsky and Fei-Fei [10], by utilizing attributes and exploring transfer learning
for object classification on large-scale datasets. Also, for the task of object classifica-
tion, Parikh and Grauman [11] build a set of attributes that is both discriminative and
nameable by displaying categorized object images to a human in the loop. In [12],
Kumar et. al. propose attribute and simile classifiers that describe face appearances,
and have demonstrated very competitive results for the application of face verification.
Siddiquie et. al. [13] explore the co-occurrence of attributes for image ranking and re-
trieval with multi-attribute queries. One of the major challenges of attribute learning is
the lack of training data, since the acquisition of labels can be both labor-intensive and
time-consuming. To overcome this limitation, Berg et. al. [14] proposed mining both
the catalog images and their descriptive text from the Internet and perform language
processing to discover attributes. Another interesting application is to incorporate the
discovered attributes with language models to generate a sentence for an image, in a
similar manner that a human might describe an image [15, 16]. Most work in the at-
tribute learning literatures either assumes a pre-detected bounding box that contains the
object of interest, or uses the image as a whole for feature extraction. Our work is differ-
ent from previous work not simply because we perform attribute learning on clothing,
but also due to the fact that we extract features that are adaptive to unconstrained human
poses. We show that with the prior knowledge of human poses, features can be collected
in a more sensible way to make better attribute predictions.
Recently, Bourdev et. al. [6] proposed a system that describes the appearance of peo-
ple, using 9 binary attributes such as is male, has T-shirt, and long hair. They
used a set of parts from Poselets [17] for extracting low-level features and perform sub-
sequent attribute learning. In comparison, we propose a system that comprehensively
describes the clothing appearance, with 23 binary attributes and 3 multi-class attributes.
We not only consider high-level attributes such as clothing categories, but also deal
with some very detailed attributes like collar presence, neckline shape, striped,
spotted and graphics. In addition, we explicitly exploit human poses by selectively
changing sampling positions of our model and experimentally demonstrate the benefits
of modeling pose. Further, [6] applies an additional layer of SVMs to explore the at-
tribute correlations, whereas in our system, the attribute correlations are modeled with a
CRF which benefits from computation efficiency. Besides the system of [6] that learns
descriptive attributes for people, some limited work has been done to understand the
clothing appearance on the semantic level. Song et. al. [18] proposed an algorithm that
predicts job occupation via human clothes and contexts, showing emerging applica-
tions by investigating clothing visual appearance. Yang and Yu [19] proposed a cloth-
ing recognition system that identifies clothing categories such as suit and T-shirt. In
contrast, our work learns a much broader range of attributes like collar presence and
sleeve length. The work that explores similar attributes to ours is [20]. However, their
system is built for images taken in a well controlled fitting room environment with con-
strained human poses in frontal view, whereas our algorithm works for unconstrained
images with varying poses. In addition, their attributes predictions use hand-designed
features, while we follow a more disciplined learning approach that more easily allows
new attributes to be added to the model.
3 Clothing Attributes and Image Data

By surveying multiple online catalogs, we produced a list of common attributes to de-
scribe clothing. As shown in Table 1, some of these attributes are binary like collar
presence, while some are multi-class such as clothing category. Although we tried
to include a full set of attributes that comprehensively describes clothing, we dropped a
few attributes like contains logo due to very small number of positive examples.
Clothing images and labels are required to train attribute classifiers. We have col-
lected images from Sartorialist 1 and Flickr, by applying an upper body detector [21] to
select pictures with people. Altogether, we harvested 1856 images that contain clothed
people (mostly pedestrians on the streets). Some image samples from our dataset are
shown in Fig.8. We then use Amazon Mechanical Turk (AMT) to collect ground truth
labels. When labeling a piece of clothing on the human body, the AMT workers were
asked to make a choice for each attribute. For instance, the worker may select one an-
swer from 1) no sleeve; 2) short sleeve; 3) long sleeve for the attribute sleeve length.
Note that for the gender attribute, in order to avoid the ambiguity of judging whether
this piece of clothing is mens or womens, workers were explicitly told to label the
gender of the wearer. To eliminate noisy labels, every image was labeled by 6 work-
ers. Only labels that have 5 or more agreements are accepted as the ground truth. We
gathered 283,107 labels from the AMT workers, and the ground truth statistics of our
clothing attribute dataset are summarized in Table 1.
4 Learning Clothing Attributes

The flowchart of our system is illustrated in Fig.2. For an input image, human pose
estimation is performed to find the locations of the upper torso and arms. We then
extract 40 features from the torso and arm regions, and subsequently quantize them.
For each attribute, we perform SVM classification using the combined feature computed
from the weighted sum of the 40 features. Each attribute classifier outputs a probability
score that reflects the confidence of the attribute prediction. Next, a CRF is employed
to learn the stylistic relationships between various attributes. By feeding the CRF with
the probability scores from the attribute classifiers, the attribute relations are explored,
which leads to better predictions than independently using the attribute classifiers.
1
http://www.thesartorialist.com
Table 1. Statistics of the clothing attribute dataset. There are 26 attributes in total, including 23
binary-class attributes (6 for pattern, 11 for color and 6 miscellaneous attributes) and 3 multi-class
attributes (sleeve length, neckline shape and clothing category).
Clothing pattern Solid (1052 / 441), Floral (69 / 1649), Spotted (101 / 1619)
(Positive / Negative) Plaid (105 / 1635), Striped (140 / 1534), Graphics (110 / 1668)
Red (93 / 1651), Yellow (67 / 1677), Green (83 / 1661), Cyan (90 / 1654)
Major color
Blue (150 / 1594), Purple (77 / 1667), Brown (168 / 1576), White (466 / 1278)
(Positive / Negative)
Gray (345 / 1399), Black (620 / 1124), > 2 Colors (203 / 1541)
Wearing necktie Yes 211, No 1528
Collar presence Yes 895, No 567
Gender Male 762, Female 1032
Wearing scarf Yes 234, No 1432
Skin exposure High 193, Low 1497
Placket presence Yes 1159, No 624
Sleeve length No sleeve (188), Short sleeve (323), Long sleeve (1270)
Neckline shape V-shape (626), Round (465), Others (223)
Shirt (134), Sweater (88), T-shirt (108), Outerwear (220)
Clothing category
Suit (232), Tank Top (62), Dress (260)
4.1 Human Pose Estimation
Thanks to the recent progress in human pose estimation [7, 22, 23], the analysis of
complex human poses has been made possible. With the knowledge of the physical
pose of the person, clothing attributes can be learned more effectively, e.g., features
from the arm regions offer a strong clue for sleeve length.
Estimating the full body pose from 2-D images still remains a challenging problem,
partly because the lower body is occasionally occluded or otherwise not visible in some
images. Consequently, we only consider the clothing items on the upper body. We apply
the work in [7] for human pose estimation and briefly review the technique here for the
completeness of the paper. Given an input image, the upper body of the person is firstly
located by using complementary results of an upper body detector [21] and Viola-Jones
face detector [24]. The bounding box of the detected upper body is then enlarged, and
the GrabCut algorithm [25] is used to segment the person from the background. Person-
specific appearance models for different body parts are estimated within the detected
upper body window. Finally, an articulated pose is formed within the segmented person
area, by using the person-specific appearance models and generic appearance models.
The outputs of the pose estimation algorithm are illustrated in Fig.1, which include a
stick-man model and the posterior probability map of the six upper body regions (head,
torso, upper arms and lower arms). We threshold the posterior probability map to obtain
binary masks for the torso and arms, while ignoring the head region since it is not related
to clothing attribute learning.
4.2 Feature Extraction
Due to the large number of attributes that we wish to learn, the system is unlikely to
achieve optimal performance with a single type of feature. For example, while texture
descriptors are useful for analyzing the clothing patterns such as striped and dotted,
they are not suitable for describing clothing colors. Therefore, in our implementation
we use 4 types of base features, including SIFT [26], texture descriptors from the Max-
Feature 1
Combine

SVM
features
Feature N
Pose estimation A: attribute

F: feature
Attribute
classifier 1 F1 Predictions
F2 F4 Blue
A1
Feature Attribute Solid pattern
extraction & classifier 2 A2 A4 Outerwear
F3
quantization Wear scarf

A3 Long sleeve
Attribute
Multi-attribute
classifier M CRF inference
Fig. 2. Flowchart of our system. Several types of features are extracted based on the human pose.
These features are combined based on their predictive power to train a classifier for each attribute.
A Conditional Random Field that captures the mutual dependencies between the attributes is
employed to make attribute predictions. The final output of the system is a list of nameable
attributes that describe the clothing appearance.
Fig. 3. Extraction of SIFT over the torso region. For better visualization, we display fewer de-
scriptors than the actual number. The left figure shows the location, scale and orientation of the
SIFT descriptors in green circles. The right figure depicts the relative positions between the sam-
pling points (red dots), the torso region (white mask) and the torso stick (green bar).
imum Response Filters [27], color in the LAB space, and skin probabilities from our
skin detector.
As mentioned in Section 1, the features are extracted in a pose-adaptive way. The
sampling location, scale and orientation of the SIFT descriptors depend on the esti-
mated human body parts and the stick-man model. Fig.3 illustrates the extraction of the
SIFT descriptors over the persons torso region. The sampling points are arranged ac-
cording to the torso stick and the boundary of the torso. The configuration of the torso
sampling points is a 2-D array, whose size is given by the number of samples along the
stick direction times the number of samples normal to the stick direction. The scale of
the SIFT descriptor is determined by the size of the torso mask, while the orientation is
simply chosen as the direction of the torso stick. The extraction of SIFT features over
the 4 arm regions is done in a similar way, except that the descriptors are only sampled
along the arm sticks. In our implementation, we sample SIFT descriptors at 64 32
Table 2. Summary of feature extraction
Feature type Region Aggregation

SIFT Torso Average pooling
Texture descriptor Left upper arm Max pooling
Color Right upper arm
Skin probability Left lower arm
Right lower arm
locations for the torso, and 32 locations along the arm stick for each of the 4 arm seg-
ments. The remaining three of our base features, namely textures descriptors, color and
skin probabilities, are computed for each pixel in the five body regions.
Once the base features are computed, they are quantized by soft K-means with 5
nearest neighbors. As a general design guideline, features with larger dimensionality
should have more quantization centroids. Therefore we allocate 1000 visual words for
SIFT, 256 centroids for texture descriptor, and 128 centroids for color and skin proba-
bility. We employed a Kd-tree for efficient feature quantization. The quantized features
are then aggregated by performing average pooling or max pooling over the torso and
arm regions. Note that the feature aggregation for torso is done by constructing a 4 4
spatial pyramid [28] over the torso region and then average or max pooled.
Table 2 summarizes the features that we extract. In total, we have 40 different fea-
tures from 4 feature types, computed over 5 body regions with 2 aggregation methods.
Last, but not the least, we extract one more feature that is specifically designed for
learning clothing color attributes, which we call skin-excluded color feature. Obviously,
the exposed skin color should not be used to describe clothing colors. Therefore we
mask out the skin area (generated by our skin detector) to extract the skin-excluded
color feature. As the arms may contain purely skin pixels when the person is wearing
no-sleeve clothes, it may not be feasible to aggregate skin-excluded color feature for the
arm regions. Hence we perform aggregation over the whole upper body area (torso +
4 arm segments), excluding the skin region. Also, since the maximum color responses
are subject to image noise, only average pooling is adopted. So compared to the set of
40 features described above, we only use 1 feature for learning clothing color.
4.3 Attribute Classification
We utilize Support Vector Machines (SVMs) [29] to learn attributes since they demon-
strate state-of-the-art performance for classification. For clothing color attributes like
red and blue, we simply use the skin-excluded color feature to train one SVM per
color attribute. For each of the other clothing attributes such as wear necktie and
collar presence, we have 40 different features and corresponding weak classifiers but
it is uncertain which feature offers the best classification performance for the attribute
that we are currently learning. A naive solution would be to concatenate all features
into a single feature vector and feeding to an SVM for attribute classification. How-
ever, this approach suffers from three major drawbacks: 1) the model is likely to overfit
due to the extremely high dimensional feature vector; 2) due to the high feature di-
mension, classification can be slow; and most importantly, 3) within the concatenated
feature vector, high dimensional features will dominate over low dimensional ones so
F2
F1 FM
A2
A1 AM
Fig. 4. A CRF model with M attribute nodes pairwise connected. Fi denotes the feature for
inferring attribute i
the classifier performance is similar to the case when only high dimensional features are
used. Indeed, in our experiments we found that the concatenated feature vector offers
negligible improvements over the SVM that uses only the SIFT features.
Another attribute classification approach is to train a set of 40 SVMs, one for each
feature type, and pick the best performing SVM as the attribute classifier. This method
certainly works, but we are interested in achieving a even better performance with all
the available features. Using the fact that SVM is a kernel method, we form a com-
bined kernel from the 40 feature kernels by weighted summation, where the weights
correspond to the classification performance of the features. Intuitively, better feature
kernels are assigned heavier weights than weaker feature kernels. The combined kernel
is then used to train an SVM classifier for the attribute. This method is inspired by the
work in [30], where they demonstrated significant scene classification improvement by
combining features. Our experimental results in Section 5.1 also show the advantage
offered by feature combination.
4.4 Multi-attribute Inference

Due to the functionality and fashion of clothing, it is common to see correlations be-
tween clothing attributes. As an example, in our ground truth dataset there is only 1
instance when the person wears a necktie but does not have a collar. It should be noted
that the dependencies between the attributes are not symmetric, e.g., while having a
necktie strongly suggest collar presence, the presence of a collar does not indicate that
the person should wear a necktie.
We explore the mutual dependencies between attributes by applying a CRF with the
SVM margins from the previous attribute classification stage. Each attribute functions
as a node in the CRF, and the edge connecting every two attribute nodes reflects the joint
probability of these two attributes. We build a fully connected CRF with all attribute
nodes pairwise connected, as shown in Fig.4.
Let us consider the relation between two attributes, A1 and A2 . F1 and F2 are the
features that we use to infer A1 and A2 respectively. Our goal is to maximize the con-
ditional probability P (A1 , A2 |F1 , F2 ):
P (F1 , F2 |A1 , A2 )P (A1 , A2 )
P (A1 , A2 |F1 , F2 ) = (1)
P (F1 , F2 )
P (F1 , F2 |A1 , A2 )P (A1 , A2 ) (2)
= P (F1 |A1 )P (F2 |A2 )P (A1 , A2 ) (3)
P (A1 |F1 ) P (A2 |F2 )
P (A1 , A2 ) (4)
P (A1 ) P (A2 )
Best feature (with pose) Combined feature (with pose)

95% Combined feature (no pose)
90%
85%
Classification Accuracy
80%
75%
70%
65%
60%
55%
50%
45%
Necktie
sleevelength
Gender
Color red
Scarf
Color yellow
Pattern stripe
Placket presence
Color green
Color cyan
Color blue
Color purple
Color white
Color black
category
Collar
Pattern solid
Color brown
Color gray
Pattern floral
Pattern spot
Pattern graphics
Pattern plaid
neckline
Skin exposure
>2 colors
Fig. 5. Comparison of attribute classification under 3 scenarios
Equation 3 is consistent with our CRF model in Fig.4, assuming that the observed
feature Fi is independent of all other features once the attribute Ai is known. From
the ground truth of the training data, we can estimate the joint probability P (A1 , A2 )
as well as the priors P (A1 ) and P (A2 ). The conditional probabilities P (A1 |F1 ) and
P (A2 |F2 ) are given by the SVM probability
outputs for A1 and A2 respectively. We de-
i |Fi )
fine the unary potential (Ai ) = log PP(A(A i)
and the edge potential (Ai , Aj ) =
log(P (Ai , Aj )). Following Equation 1, the optimal inference for (A1 , A2 ) is achieved
by minimizing (A1 )+ (A2 )+(A1 , A2 ), where the first two terms are the unary po-
tentials associated with the nodes, and the last term is the edge potential that describes
the relation between the attributes.
For a fully connected CRF with a set of nodes S and a set of edges E, the optimal
graph configuration is obtained by minimizing the graph potential, given by:

(Ai ) + (Ai , Aj ) (5)
Ai S (Ai ,Aj )E
where assigns relative weights between the unary potentials and the edge potentials.
It is typically less than 1 because a fully connected CRF normally contains more edges
than nodes. The actual value of can be optimized by cross validation. In our imple-
mentation, we use belief propagation [31] to minimize the attribute label cost.
5 Experiments
We have performed extensive evaluations of our system on the clothing attribute dataset.
In Section 5.1, we show that our pose-adaptive features offer better classification accu-
racies compared to the features extracted without a human pose model. The results from
section 5.2 demonstrates that the CRF improves attribute predictions since it explores
relations between the attributes. Finally, we show some potential applications that di-
rectly utilize the output of our system.
Predict No sleeve Short sleeve Long sleeve Predict

Actual Shirt Sweater T-shirt Outerwear Suit Tank top Dress
Actual
No sleeve 86.17% 10.11% 3.72%
Shirt 43.56% 19.35% 1.61% 8.06% 8.06% 9.68% 9.68%
Short sleeve 23.40% 64.90% 11.70%
Sweater 14.52% 54.84% 0% 16.13% 6.45% 1.61% 6.45%
Long sleeve 5.85% 10.11% 84.04%
(a) Sleeve length T-shirt 4.84% 1.61% 80.65% 3.23% 0% 6.44% 3.23%
Outerwear 12.90% 6.45% 0% 61.30% 8.06% 4.84% 6.45%
Predict V-shape Round Other style
Actual
Suit 9.68% 8.06% 0% 16.13% 66.13% 0% 0%
V-shape 67.27% 13.45% 19.28%
Round 24.22% 51.56% 24.22%
Tank top 4.84% 1.61% 0% 1.61% 0% 79.04% 12.90%
Other style 24.66% 20.63% 54.71% Dress 4.84% 11.28% 3.23% 3.23% 3.23% 17.74% 56.45%
(b) Neckline shape (c) Clothing category
Fig. 6. Multi-class confusion matrices. Predictions are made using the combined feature extracted
with the pose model.
5.1 Attribute Classification

We use the chi-squared kernel for the SVM attribute classifiers since it outperforms
both the linear and the RBF kernels in our experiments. To examine the classification
accuracy, we partition the data such that each attribute has equal number of examples
in all classes. For example, we balance the data of the wearing necktie attribute so
that it has the same number of positive and negative examples. We use leave-1-out cross
validation to report the classification accuracy for each attribute.
Fig.5 compares the SVM performances with three types of feature inputs: 1) With
pose model, using the best feature out of our 40 features; 2) With pose model, combining
the 40 features with the method described in Section 4.3; 3) No pose model, same ex-
periment settings as case 2 but with features extracted within a scaled clothing mask [4]
under the face. Note that the single best feature accuracies for the the color attributes
are not displayed in Fig.5, because color attribute classifiers only use the skin-excluded
color feature, which we regard as the combined feature for color attributes.
First of all, we observe that the SVM classifiers with combined features always per-
form better than those using the single best feature. This is because the combined fea-
ture utilizes the complementary information of all extracted features. More importantly,
we are interested to know whether considering human pose helps the classification of
clothing attributes. As can be seen in Fig.5, pose-adaptive features offer improved per-
formance for most attributes, except a slight decrease in performance for 4 attributes
in colors and patterns. In particular, for those attributes like sleeve length that heav-
ily depend on human poses, we observe a significant boost in performance with pose-
adaptive features, compared to the classifier performance of using features without prior
knowledge of the human pose.
We also show the confusion matrices of the multi-class attributes in Fig.6, where
the attributes are predicted using the combined feature extracted with the pose model.
The misclassification patterns of the confusion matrices are plausible. For example, no
sleeve is misclassified more often as short sleeve than long sleeve, and tank top
is more often confused with dress than with other clothing categories.
5.2 Exploiting Attribute Relations with CRF

We apply a CRF as described in Section 5.2 to perform multi-attribute inference. As
shown in Equation 1, the CRF uses the prior probability of each attribute, and the prior
95% Before CRF After CRF

90%
85%
80%
G-mean
75%
70%
65%
60%
55%
50%
45%
sleevelength
Necktie
Gender
Color red
Scarf
Pattern plaid
Pattern stripe
Color yellow
category
Placket presence
Color green
Color cyan
Color purple
Collar
Pattern solid
Color blue
Color white
Color black
Color brown
Color gray
Pattern floral
Pattern spot
Pattern graphics
neckline
Skin exposure
>2 colors
Fig. 7. Comparison of G-means before and after the CRF
reflects the proportion of each attribute class in the dataset. Therefore, instead of testing
the classifier performance on the balanced data, we evaluate the classification perfor-
mance on the whole clothing attribute dataset. Since this is an unbalanced classification
problem, classification accuracy is no longer a proper evaluation metric. We use the
Geometric Mean (G-mean), which is a popular evaluation metric for unbalanced data
classification.
N N1

G-mean = Ri (6)
i=1
where Ri is the retrieval rate for class i, and N is the number of classes.
The G-means before and after applying the CRF are shown in Fig.7. It can be seen
that the CRF offers better predictions for 19 out of 26 attributes, sometimes providing
large margins of improvement. For those attributes that the CRF fails to improve the
performance, only very minor degradations are observed. Overall, better classification
performance is achieved by applying the CRF on top of our attribute classifiers. Fig.8
shows the attribute predictions of 4 images sampled from our dataset. As can be seen
in Fig.8, the CRF corrects some conflicting attributes that are generated by independent
classifiers. More clothing attribute illustration examples can be found here 2 .
5.3 Applications
Dressing Style Analysis With a collection of customer photos, our system can be used
to analyze the customers dressing style and subsequently make shopping recommen-
dations. For each attribute, the system can tell the percentage of its occurrences in the
group of photos. Intuitively, in order for an attribute class to be regarded as a personal
style, this class has to be observed much more frequently compared to the prior of the
general public. As an example, if a customer wears plaid-patterned clothing three days
a week, then wearing plaid is probably his personal style, because the frequency that he
2
https://sites.google.com/site/eccv2012clothingattribute/
No necktie (Wear necktie) Wear necktie No necktie No necktie

Has collar Has collar Has collar No collar
Mens Mens Mens Mens
Has placket Has placket Has placket Has placket
Low exposure High exposure (Low exposure) Low exposure Low exposure
No scarf No scarf Wear scarf Wear scarf
Solid pattern Solid pattern Solid pattern Solid pattern
Black Gray & black Brown & black Blue & white
Short sleeve (Long sleeve) Long sleeve No sleeve (long sleeve) Long sleeve
V-shape neckline V-shape neckline V-shape neckline Round neckline
Dress (Suit) Suit Tank top (outerwear) T-shirt (Outerwear)
Fig. 8. For each image, the attribute predictions from independent classifiers are listed. Incorrect
attributes are highlighted in red. We describe the abnormal attributes generated by independent
classifiers for the above images, from left to right: man in dress, suit with high skin exposure,
no sleeves but wearing scarf, T-shirt with placket. By exploring attribute dependencies, the CRF
corrects some wrong attribute predictions. Attributes that are changed by the CRF are shown in
parentheses, and override the independent classifier result to the left.
is observed in plaid is much higher than the general publics prior of wearing plaid (the
prior of plaid is 6.4% according to our clothing attribute dataset). In our analysis, we
regard an attribute class as a dressing style if it is 20% higher than its prior. Of course,
this threshold can be flexibly adjusted based on specific application requirements.
We perform dressing style analysis on both personal and event photos, as shown in
Fig.9. Firstly, we analyze the dressing style of Steve Jobs, who is well known for wear-
ing his black turtlenecks. Using 35 photos of Jobs from Flickr, our system summarizes
his dressing style as solid pattern, mens clothing, black color, long sleeves, round
neckline, outerwear, wear scarf. Most of the styles predicted by our system are very
sensible. The wrong inferences of outerwear and wearing scarf are not particularly
surprising, since Steve Jobs high-collar turtlenecks share visual similarities with out-
erwears and the presence of scarfs. Similarly, our system predicts the dressing style of
Mark Zuckerberg as gray, brown, T-shirt, outerwear, which is in agreement with
his actual dressing style.
Apart from personal clothing style analysis, we can also analyze the dressing style for
events. We downloaded 54 western-style wedding photos from Flickr, and consider the
dressing styles for men and women using the gender attribute predicted by our sys-
tem. The dressing style for men at wedding is predicted as: solid pattern, suit, long-
sleeves, V-shape neckline, wearing necktie, wear scarf, has collar, has placket,
while the dressing style for women is high skin exposure, no sleeves, dress, other
neckline shapes (i.e. neither v-shape nor round), white, >2 colors, floral pattern.
Most of these predictions agree well with the dressing style of weddings, except wear-
ing scarf for men, and >2 colors and floral pattern for women. These abnormal
predictions can be understood, by considering the similarity between mens scarfs and
neckties, as well as the visual confusion for including the wedding flowers in womens
hands when describing the color and pattern of their dresses. Similar analysis was done
Personal Dressing Style Event Dressing Styles

Steve Jobs Wedding
Solid pattern Men Women
Mens Solid pattern High skin exposure
Black Suit No sleeves
Long sleeves Long-sleeves Dress
Round neckline V-shape neckline Other neckline shapes
Outerwear Wear necktie White
Wear scarf Wear scarf >2 colors
Has collar Floral pattern
Has placket
Mark Zuckerberg Basketball

Gray / brown No sleeves
T-shirt / outerwear High skin exposure
Other neckline shapes
Tank top
Fig. 9. With a set of person or event photos (sample images shown above), our system infers the
dressing style of a person or an event. Most of our predicted dressing styles are quite reasonable.
The wrong predictions are highlighted in red.
to predict the clothing style of the event of basketball. Using 61 NBA photos, the bas-
ketball players dressing style is predicted as no sleeves, high skin exposure, other
neckline shapes, tank top.
Gender Classification. Although there exist clothes that are unisex style, visual ap-
pearance of clothes often carries significant information about gender. For example,
men are less likely to wear floral-patterned clothes, and women rarely wear a tie. Mo-
tivated by this observation, we are interested in combining the gender prediction from
clothing information, with traditional gender recognition algorithms that use facial fea-
tures. We adopt the publicly available gender-from-face implementation of [32], which
outputs gender probabilities by projecting aligned faces in the Fisherface space. For the
clothing-based gender classification, we use the graph potential in Equation 5 as the
penalty score for assigning gender to a testing instance. The male and female predic-
tion penalties are simply given by the CRF potentials, by assigning male or female to
the gender node, while keeping all other nodes unchanged. An RBF-kernel SVM is
trained to give gender predictions that combine both the gender probabilities from faces
and penalties from clothing.
We evaluate the gender classification algorithms on our clothing attribute dataset. As
shown Table 3, the combined gender classification algorithm offers a better performance
Table 3. Performance of gender classification algorithms

G-mean
Face only 0.715
Clothing only 0.810
Face + clothing 0.849
than that of using each feature alone. Interestingly, clothing-only gender recognition
outperforms face-only gender recognition.
6 Conclusions
We propose a fully automated system that describes the clothing appearance with se-
mantic attributes. Our system demonstrates superior performance on unconstrained im-
ages, by incorporating a human pose model during the feature extraction stage, as well
as by modeling the rules of clothing style by observing co-occurrences of the attributes.
We also show novel applications where our clothing attributes can be directly utilized.
In the future, we expect to observe even more improvements on our system, by employ-
ing the (almost ground truth) human pose estimated by Kinect sensors [23].
References
1. Kumar, N., Belhumeur, P.N., Nayar, S.K.: FaceTracer: A Search Engine for Large Collec-
tions of Images with Faces. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part
2. Anguelov, D., Lee, K., Gokturk, S.B., Sumengen, B.: Contextual identity recognition in per-
sonal photo albums. In: CVPR (2007)
3. Lin, D., Kapoor, A., Hua, G., Baker, S.: Joint People, Event, and Location Recognition in
Personal Photo Collections Using Cross-Domain Context. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 243256. Springer, Heidelberg
(2010)
4. Gallagher, A.C., Chen, T.: Clothing cosegmentation for recognizing people. In: CVPR (2008)
5. Cao, L., Dikmen, M., Fu, Y., Huang, T.S.: Gender recognition from body. ACM Multimedia
(2008)
6. Bourdev, L., Maji, S., Malik, J.: Describing people: Poselet-based attribute classification. In:
ICCV (2011)
7. Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: Articulated human pose estima-
tion and search in (almost) unconstrained still images. Technical Report 272, ETH Zurich,
D-ITET, BIWI (2010)
8. Ferrari, V., Zisserman, A.: Learning visual attributes. In: NIPS (2007)
CVPR (2009)
10. Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: ECCV (2010)
11. Parikh, D., Grauman, K.: Interactively building a discriminative vocabulary of nameable
attributes. In: CVPR (2011)
12. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face
verification. In: ICCV (2009)
13. Siddiquie, B., Feris, R.S., Davis, L.S.: Image ranking and retrieval based on multi-attribute
queries. In: CVPR (2011)
14. Berg, T.L., Berg, A.C., Shih, J.: Automatic Attribute Discovery and Characterization from
Noisy Web Data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I.
15. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Berg, A., Choi, Y., Berg, T.: Baby talk: Under-
standing and generating image descriptions. In: CVPR (2011)
16. Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J.,
Forsyth, D.: Every Picture Tells a Story: Generating Sentences from Images. In: Daniilidis,
K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 1529.
tions. In: ICCV (2009)
18. Song, Z., Wang, M., Hua, X., Yan, S.: Predicting occupation via human clothing and con-
texts. In: ICCV (2011)
19. Yang, M., Yu, K.: Real-time clothing recognition in surveillance videos. In: ICIP (2011)
20. Zhang, W., Begole, B., Chu, M., Liu, J., Yee, N.: Real-time clothes comparison based on
multi-view vision. In: ICDSC (2008)
22. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In:
CVPR (2011)
23. Shotton, J., Fitzgibbon, A., Cook, M., Blake, A.: Real-time human pose recognition in parts
from single depth images. In: CVPR (2011)
24. Viola, P., Jones, M.: Robust real-time object detection. IJCV (2001)
25. Rother, C., Kolmogorov, V., Blake, A.: Grabcut - interactive foreground extraction using
iterated graph cuts. In: SIGGRAPH (2004)
26. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
27. Varma, M., Zisserman, A.: A statistical approach to texture classification from single images.
IJCV (2005)
28. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: CVPR (2006)
29. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. on
Intel. Sys. and Tech. (2011)
30. Xiao, J., Hays, J., Ehinger, K.A., Torralba, A., Oliva, A.: Sun database: Large scale scene
recognition from abbey to zoo. In: CVPR (2010)
31. Tappen, M.F., Freeman, W.T.: Comparison of graph cuts with belief propagation for stereo,
using identical mrf parameters. In: ICCV (2003)
32. Gallagher, A.C., Chen, T.: Understanding images of groups of people. In: CVPR (2009)
Graph Matching via Sequential Monte Carlo
Yumin Suh1 , Minsu Cho2 , and Kyoung Mu Lee1

1
Department of EECS, ASRI, Seoul National University, Seoul, Korea
2
INRIA - WILLOW / Ecole Normale Superieure, Paris, France
Abstract. Graph matching is a powerful tool for computer vision and

machine learning. In this paper, a novel approach to graph matching
is developed based on the sequential Monte Carlo framework. By con-
structing a sequence of intermediate target distributions, the proposed
algorithm sequentially performs a sampling and importance resampling
to maximize the graph matching objective. Through the sequential sam-
pling procedure, the algorithm eectively collects potential matches un-
der one-to-one matching constraints to avoid the adverse eect of outliers
and deformation. Experimental evaluations on synthetic graphs and real
images demonstrate its higher robustness to deformation and outliers.
Keywords: graph matching, sequential Monte Carlo, feature correspon-

dence, image matching, object recognition.
1 Introduction
Graphs provide excellent means to represent features and the structural rela-
tionship between them, and the problem of feature correspondence is eectively
solved by graph matching. Therefore, graph matching has gained high popular-
ity in various applications in computer vision and machine learning (e.g., object
recognition [1], shape matching [2], and feature tracking [3]). Due to the NP-hard
nature of graph matching in general, its optimal solution is virtually unachiev-
able so that a myriad of algorithms are proposed to approximate it. In recent
computer vision literature, the graph matching problem is widely formulated
as the Integer Quadratic Program (IQP) [4,5,6,7] that explicitly considers both
local appearances and generic pairwise relations. This line of graph matching
researches has further extended to hyper-graph matching [8,9], learning tech-
niques [10,11], and its progressive framework [12]. The existing methods, how-
ever, still suer from severe distortion of the graphs (e.g., addition of outliers in
nodes or deformation of attributes). Robustness to such distortion is of particu-
lar importance in practical vision problems because real-world image matching
problems are widely exposed to background clutter and image variation.
In this paper, a robust graph matching algorithm based on the Sequential
Monte Carlo (SMC) framework [13] is proposed. Unlike previous methods, the
algorithm sequentially samples potential matches via proposal distributions to
maximize the graph matching objective in a robust and eective manner. To
treat the graph matching problem in the SMC framework, a sequence of arti-
cial target distributions is designed, which provide smooth transitions from an

Graph Matching via Sequential Monte Carlo 625
initial distribution to the nal target distribution as the graph matching objec-
tive. The proper design of an importance distribution and a resampling scheme
allows the algorithm to explore the solution space eciently under the matching
constraints. Consequently, the proposed algorithm achieves higher robustness to
outliers and deformation than other graph matching algorithms.
Among numerous graph matching methods, the algorithms closely related to
ours are as follows. Maciel and Costeira [14] casted a graph matching problem
as an integer constrained minimization, which is popularly adopted for its gen-
erality in recent studies including ours. Many of the subsequent approaches to
the IQP formulation handle the original problem by solving a relaxed problem
and then enforcing the matching constraint on its solution. For example, Gold
and Rangarajan [15] relaxed the IQP by dropping the integer constraint and
iteratively estimated the convex approximations around the previous solution.
Leordeanu et al. [4] and Cour et al. [16] proposed spectral types of relaxation
followed by a discretization process. Torresani et al. [17] used dual decomposition
technique for graph matching optimization. Cho et al. [6] introduced a random
walk algorithm to explore the relaxed solution space in the consideration of the
matching constraints. More recently, to overcome the limitation of pair-wise re-
lations and embed higher-order information, hyper-graph matching techniques
were proposed based on the generalization of the IQP [8,9].
There have been many sampling-based algorithms for feature correspondence
and shape matching in literature. Dellaert et al. [18] used the Metropolis-Hastings
(MH) scheme to solve a correspondence problem in a Bayesian framework. Cao
et al. [2] used the MH scheme for partial shape matching. Tamminen et al. [19]
used the SMC framework to solve Bayesian matching of objects with occlusion.
Lu et al. [20] and Yang et al. [21] proposed a shape detection algorithm based on
the particle lter with static observations. However, all these methods for shape
matching problems cannot be applied to graph matching in general. Lee et al. [7]
introduced the data-driven Markov chain Monte Carlo (DDMCMC) algorithm
for graph matching which adopts spectral relaxation for data-driven proposals.
To our best knowledge, our work is the rst attempt to use the SMC framework
for the matching of general graphs.
2 Problem Denition
We consider two attributed graphs, GP = (V P , E P , AP ) and GQ = (V Q , E Q , AQ ),

where V, E, and A denotes a set of nodes, edges, and attributes, respectively. In
visual feature matching problems, attribute aPi A of node vi V
P P P
usually
describes a local appearance at feature i in image P , and attribute aP ij A
P
of edge eij E represents geometric relationship between features i and j in

P P
image P . In this paper, one-to-one matching constraints are adopted that every
node in GP is mapped to at most one node in GQ and vice versa. Then, Graph
matching is to nd node correspondences between GP and GQ , which best pre-
serve the attribute relations under the matching constraints. A graph matching
solution M is represented by a set of assignments or matches (viP , vaQ ) between
626 Y. Suh, M. Cho, and K.M. Lee
nodes viP V P and vaQ V Q . M can be alternatively expressed by an assignment

nP nQ
matrix X {0, 1} , where nP and nQ are the numbers of nodes in GP and
GQ , respectively; Xi,a = 1 if (viP , vaQ ) M and zero otherwise. An assignment
vector x denotes the column-wise concatenated vector of X, and the element of x
corresponding to Xi,a is represented by x(i, a). The anity matrix W consists of
the relational similarity values between edges and nodes; the compatibility of two
Q Q
edge attributes aP P Q P
ij and aab , or equivalently two matches (vi , va ) and (vj , vb ), is
encoded in non-diagonal component Wi,a;j,b , and the compatibility of two node
attributes aP Q
i and aa is encoded in the diagonal component Wi,a;i,a . The graph
matching score of x is evaluated by xT Wx which corresponds to the sum of all
similarity values covered by the matching. Therefore, graph matching is mathe-
matically formulated as the following IQP problem [4,7,6]:
x = argmax(xT Wx)
P Q
s.t. x {0, 1}n n ,

i, a x(i, a) 1, a, i x(i, a) 1. (1)
3 Sequential Monte Carlo for Graph Matching

In this paper, graph matching is formulated as a sequential sampling of potential
matches based on the SMC framework. The SMC methods (e.g., the particle
lter [22]) are widely used for dynamic problems in computer vision, such as
object tracking and structure from motion. The basic concept of the SMC is
to sequentially estimate the target probability distribution t at time t with a
(i) (i)
set of particles, that is a cloud of weighted random samples {(xt , wt )}N i=1 . A
(i) (i)
weight wt represents the importance of the associated sample xt [23].
(i)
The samples xt at time t are generated from the importance distribution t ,
which is constructed by a Markov transition kernel K and the previous distri-
(i) (i)
bution t1 [24]; each sample xt evolves from xt1 according to the kernel K,
and the importance distribution t is represented by

t (xt ) = t1 (xt1 )K(xt1 , xt ). (2)
xt1
(i) (i)
The weight wt of each sample xt is proportional to the ratio of the target
distribution t to the importance distribution t , and is dened as
(i) (i)
(i) t (xt ) (i) W
Wt = (i)
, wt = N t (i)
. (3)
t (xt ) i=1 Wt
Then, the particles as the weighted samples asymptotically converges to the tar-
get distribution:

N
(i) a.s.
t (xt ) = wt x(i) (xt ) t (xt ) as N , (4)
t
i=1
where denotes the Dirac delta function.

Fig. 1. The main idea of the proposed algorithm. The algorithm sequentially samples
potential matches via proposal distributions to maximize the graph matching objective
under the matching constraints. In every time step, new matches are sampled based
on the anity with the previously sampled matches.
Unlike typical SMC methods for dynamic problems [22], we do not assume any
prior knowledge about the graph for generality, except for the anity matrix.
In this case, it is intractable to calculate the conditional distribution even up
to a normalizing constant, thus the direct sequential sampling from conditional
distributions is also intractable.
To formulate graph matching in the SMC framework, we introduced a se-
quence of intermediate distributions [24], which starts from a simple initial dis-
tribution and gradually moves toward the nal target distribution equivalent to
the objective function of (1). This SMC procedure performs graph matching by
sequentially re-distributing particles over the evolving sample space according
to the intermediate target distribution at each step. The procedure is illustrated
in Fig. 2, and the details are described in the following subsections.
3.1 Sequential Target Distribution

In the SMC graph matching, each particle incrementally collects a new match
(i)
at each time step so that each particle xt at time step t always has t matches.
The maximum number of matches in the solution (or the maximum time step
in the sampling) is L = min(nP , nQ ). We design the nal target distribution as
L (xL ) exp(xTL WxL / ) (5)
where is a scaling constant. Since the exponential function increases mono-

tonically, the optimal solution of the graph matching score xT Wx in Eq. (1)
maximizes the probability L of Eq. (5). Since xL has high dimensionality of
nP nQ , direct sampling from L is not tractable in general cases. As illustrated
in Fig. 2(a), a sequence of intermediate target distributions {t }L1
t=1 is intro-
duced for the SMC sampling as follows:
t (xt ) exp(xTt Wxt / ), t = 1, , L 1. (6)
A feasible set of the graph matching problem, or a set of all possible x satisfying
the constraints of Eq. (1), is denoted by X . Each intermediate target distribution
t in Eq. (6) then is dened on the measurable space (Xt , t ) where Xt = {x
(i)
X | i,a xt (i, a) = t}. The more true matches a particle xt includes, the higher
importance
(
distribution
( )
( resample

Target distribution( ) (
update
(a)

(c) (b)
Fig. 2. Overview of the proposed method. (a) A sequence of intermediate target distri-
butions from a simple initial distribution to the target distribution is constructed. (b)
Transition from time step t 1 to the t is performed based on the sampling and impor-
tance resampling (SIR) technique exploiting the particles from the previous distribu-
tion. (c) An example of the neighborhood relation between two candidate matchings.
They share two common matches and dier in one match.
(i)
value the probability t (xt ) has. Thus, the sequence of distributions {t }L1
t=1
eectively drives the particles to the nal target distribution L .
3.2 Sequential Sampling
In order to sample from the intermediate target distribution t , new particles

(i)
are propagated from the previous particles {xt1 }N
i=1 using the sampling and
importance resampling (SIR) technique [22]. In each time step, each particle
additionally selects a new match m in a probabilistic manner according to the
transition kernel,
1
exp(eTm Wxt1 / ) if (xt1 + em ) Xt
K(xt1 , xt1 + em ) = Z (7)
0 otherwise
where em is a unit vector that corresponds to the match m and Z is a normalizing

constant. In selecting the new match m, the kernel reects both the matching
constraints and the anity relations with previously sampled matches in xt1 .
As the time step increases, previously selected inlier matches boost the sampling
probability of other inlier matches for each particle. The augmented particle in
this step is denoted by xt = xt1 + em . Given that particles at time step t 1
follow the distribution t1 , the transition leads them to a new importance
distribution

N
t (xt ) = t1 (xt1 )K(xt1 , xt ) t1 (xt1 )K(xt1 , xt ).
(i) (i)
(8)
xt1 i=1
For further discussion, let us denote the particles after transition by {x t }N
(i)
i=1 .
The particles of equal matches are grouped together to form a set of n unique
assignment vectors {g t }nj=1 and the number of particles corresponding to the
(j)
j-th group is denoted by nj . Then, the weight of the particle is approximated as

follows:
t (x t ) t (x t ) t (x t )

(i) (i) (i)
(i)
Wt = (i)
n (i) (j)
(9)
t (x t ) 1
N j=1 nj (x t g t ) nI(i)
where I(i) is a group ID of the particle x t . Therefore, to calculate the weights,

(i)
particles of same matches should be grouped together rst as detailed in Sec.

3.4. Then, the weights are easily obtained by computing t (xt ). Finally, the re-
(i)
sampled particles based on the weights, {xt }N i=1 , follow the intermediate target
distribution t .
3.3 Particle Updating

To correct false matches sampled in early steps, we update particles after sam-
pling at every time step. As shown in Fig. 2(c), the particle updating spreads the
current particle to a neighborhood support sharing t 1 matches by changing
the most unreliable match to a better match. Among the matches in the particle
(i)
xt , the match sharing the lowest anity values with other matches is selected
by
(i)
mw = argmin eTm Wxt . (10)
(i)
m s.t. eT
m xt =1
The selected match mw is replaced with a new match sampled by the tran-
(i) (i)
sition kernel K(xt emw , xt ) of Eq. (7). After repeating the grouping and
(i)
resampling process of Sec. 3.2, the updated particles {xt }N i=1 are obtained. In
this updating procedure, while following the intermediate target distribution,
the particles are redistributed into the support with high probability under the
target distribution.

The proposed algorithm is summarized in Algorithm 1. Some implementation
issues are addressed in the following.
Computational Complexity. One step procedure is composed of transition,
grouping, resampling, and updating. The computational complexity for each
subprocedure is as follows: Transition O(N ncm + Gt ncm + Gt t), grouping O(N t)
(ncm : the number of candidate matches, N : the number of particles, Gt : the
number of unique groups at time step t, Gt N , t L ncm ). Therefore the
Algorithm 1: SMC graph matching

input : anity matrix W
output: assignment vector x
// Initialization
Sample particles from the initial distribution 1 ;
// SMC loop
while sampling available do
Propagate each particle by sampling a new match according to Eq. (7);
Resample the particles according to the weights of Eq. (9);
Update the particles according to Sec. 3.3;
end
Choose a particle x with the highest matching score as the nal solution;
total complexity for one step procedure is O(N ncm ), and the entire algorithm
takes O(ncm LN ) where L denotes the number of samplings.
Initial Distribution. The initial distribution is calculated as the marginal
of the intermediate target distribution, 2 . Utilizing the symmetric property of
the anity matrix, the initial distribution 1 is obtained as
1 (x1 ) 1T exp(W/ )x1 , (11)
where exp(W/ ) denotes the elementwise exponent of (W/ ) and 1 is a column

vector with all ones.
Transition Kernel. In order to reduce the computational cost and memory
cost, a transition kernel K is computed as an approximated version which its
lowest 100(1 )% was forced to be 0. Experimental results for dierent are
shown in synthetic random graph matching task in Sec. 4.3.
Grouping. Grouping of particles can be performed eciently by using radix
sort and a simple trick. First, a unique ID is assigned to every candidate match.
Then each particle is represented by a combination of IDs. IDs in each combina-
tion are sorted so that the sequence of IDs form a N-ary t digit number with the
i-th digit value corresponding to the i-th smallest ID. Now, after N-ary radix
sort, particles can be eciently grouped in O(N t) (N :the number of particles,
t: match size) by comparing only the adjacent neighbors.
Final Solution. The SMC loop is terminated simply when no more sampling
is possible while keeping the constraints. Then, a particle maximizing the score of
Eq. (1) is selected as the nal solution. Since the matches of each particle already
satisfy the one-to-one matching constraints, no additional post-processing for
discretization is needed in our approach.
4 Experiments
In this section, two sets of experiments were performed for quantitative evalua-
tion: (1) synthetically generated random graph matching and (2) feature match-
ing using real images. The proposed algorithm (SMCM) was compared with
seven state-of-the-art methods including SM[4], SMAC[16], PM[25], IPFP[5],

GAGM[15], RRWM[6], and DDMCM[7]. The authors implementations were
used for SM, SMAC1 , PM2 , IPFP3 , DDMCM, and RRWM4 . For GAGM, the
implementation by Cour et al. [16] is adopted. All methods were implemented
using MATLAB and were tested on 2.40 GHz Core2 Quad desktop PC. In each
experiment, the same anity matrix was shared by all algorithms, and is xed
to 0.1 in SMCM. In addition to the two experiments, the experimental analyses
of SMCM will be provided in Sec. 4.3.
4.1 Synthetic Random Graph Matching
The random graph matching experiments were performed following the exper-
imental protocol of [16,15,6]. In this experiment, the Hungarian algorithm[26]
was used for the post-processing of discretization, and = 2 is used for SMCM.
Two graphs, GP and GQ , were constructed at each trial with n = nin + nout
nodes, where nin and nout denote number of inlier and outlier nodes, respec-
tively. The reference graph GP was generated with random edges, where each
edge (i, j) E P was assigned a random attribute aP ij distributed uniformly
in [0, 1]. A perturbed graph GQ was then created by adding noise on the edge
attributes between inlier nodes: aQ P
p(i)p(j) = aij +, where p(.) is a random permu-
tation function. The deformation noise was assigned by the Gaussian function
N (0, 2 ). All the other edges connecting at least one of the outlier nodes were
randomly generated with random edges in the same way of GP . Thus, GP and
GQ share a common but perturbed subgraph with size nin . The anity matrix
Q 2 Q
W was computed by Wi,a;j,b = exp(|aP i,j aa,b | /s ), eij E , eab E
2 P P Q
2
and the diagonal terms are set to be zero. The scaling factor s was chosen to
be 0.1 to perform best in overall algorithms. The accuracy was computed as
the ratio of the number of true positive matches to that of ground truths. The
objective score was computed by the IQP objective xT Wx.
In the experimental setting, there are two independent variables, the number
of outliers nout and the amount of deformation noise . Using these, three ex-
periments with dierent conditions were conducted: (1) varying the amount of
deformation noise with no outliers (2) varying the number of outliers with no
deformation (3) varying the number of outliers with some deformation. Each ex-
periment was repeated for 100 times with dierent matching problems generated
from the above procedure, and the average accuracy and objective score were
recorded. In all experiments, the number of inliers nin is xed to 20. Figure 3(a)
represents the results of varying deformation noise from 0 to 0.3 by increments
of 0.05 while xing the number of outliers nout = 0 and particles N = 10000.
Figure 3(b) shows the result of varying the number of outliers nout from 0 to
20 by increments of two while xing N = 2000 and deformation noise = 0.
1
http://www.seas.upenn.edu/~ timothee/
2
http://www.cs.huji.ac.il/~ zass/
3
http://86.34.14.245/code.php
4
http://cv.snu.ac.kr/research/~ RRWM/
(a) deformation (b) outliers (c) outliers with deform
Fig. 3. Synthetic graph matching experiments. The accuracies (top), objective scores
(middle) and computational times (bottom) are plotted for three dierent conditions:
(a) varying deformation noise with no outliers, (b) varying the number of outliers with
no deformation noise, and (c) varying the number of outliers with some deformation
noise.
Figure 3(c) represents the result of varying the number of outliers nout from 0 to
20 by increments of two while xing N = 5000 and deformation noise = 0.1.
As observed in the score plots (middle) in Fig. 3, SMCM consistently out-
performs all the other algorithms in the objective score, regardless of the degree
of deformation and outliers; SMCM performs best as the optimizer of graph
matching score. Especially, the accuracy of SMCM signicantly exceeds those of
the other algorithms under increasing outliers as shown in Figs. 3(b) and 3(c). It
implies that in the SMC framework SMCM avoids the adverse eects of outliers
eectively through sequential collection of inliers under the matching constraints.
In deformation experiment of Fig. 3(a), the accuracy of SMCM is comparable or
slightly inferior to the best of RRWM. However, SMCM outperforms all other
methods in realistic conditions where both outliers and deformation exists as in
Fig. 3(c). The bottom row of Fig. 3 shows the computational time.
Input Images SMCM 27/27 PM 9/27
SM 13/27 GAGM 25/27 RRWM 27/27
Input Images SMCM 16/20 SMAC 11/20
IPFP 6/20 GAGM 6/20 RRWM 5/20
SMCM 36/40 SMCM 67/128 SMCM 12/13
Fig. 4. Some representative results of the real image matching experiments. True
positive matches are represented by blue lines and false positive matches are represented
by black lines.
4.2 Real Image Matching

In this experiment, we performed real image matching on the benchmark image
dataset5 used in [6]. The dataset consists of 30 image pairs containing photos
of various objects, most of which are collected from Caltech-101 and MSRC
datasets, and provides the detected MSER features [27], the initial matches, and
the manually-labeled ground truth matches. For each image pair, two graphs
were constructed by considering the features as nodes. To sparsify the an-
ity matrix, only the similarity values related to candidate correspondences are
assigned in the anity matrix. In our experiments, 70 nodes and 720 can-
didate matches per an image pair are used in average. The anity matrix
Wi,a;j,b = exp(di,a;j,b /s2 ) is computed using the symmetric transfer error
5
The dataset is available from http://cv.snu.ac.kr/research/~ RRWM/
Table 1. Avergage performance on the real images
Algorithm SM RRWM PM IPFP SMAC GAGM SMCM

Accuracy (%) 65.3 69.0 60.5 65.7 50.8 67.6 69.6
Time (Sec.) 0.02 1.00 0.05 0.13 0.03 1.67 0.89
di,a;j,b , which is used in [6], as the measure of ane-invariant dissimilarity be-

tween two feature pairs, (viP , vjP ) and (vaQ , vbQ ). The scaling parameter s2 was set
to 10 to enhance the overall accuracy over all algorithms. In this experiment, the
greedy mapping was used for the post-processing of discretization, and = 10
and 500 particles are used for SMCM. The matching accuracy and the running
time were measured for SM, RRWM, PM, IPFP, SMAC, GAGM, and SMCM. 6
The results are summarized in the Table 1 and some examples are shown in Fig.
4. SMCM records the highest accuracy in acceptable running time. Its robust-
ness to outliers and deformations is observed again in this real image matching
experiment.
(a) (b)
Fig. 5. To determine the proper number of particles experimentally, we plotted the

average accuracy over 30 experiments with varying the number of particles and problem
diculty in the synthetic random graph matching experiment (Sec. 4.1). The plots are
obtained by varying (a) the number of outliers and (b) deformation rate. In both cases,
the accuracy saturates as the number of particles become large enough, except for the
extreme conditions.
4.3 More Experimental Analyses

In this section, we performed three additional experiments to analyze the char-
acteristics of SMCM. In the rst experiment, we evaluated the performance of
SMCM while varying number of particles. The larger the number of particles is,
the more likely the set of matches with low probability are sampled, resulting
6
For SMCM, we measured the average accuracy over 30 trials because SMCM is a
stochastic algorithm unlike others.
(a) (b)
Fig. 6. The average precision of particles and the number of particle groups at each
time step are measured on the synthetic random graph matching in Sec. 4.1. (a) The
average precision of particles increases as the number of matches in particles increases.
The plot is depicted while the number of outliers is varied. An abrupt increase occurs
when enough true matches (38) are sampled. (b) The number of particle groups is
depicted versus the time step while the number of outliers is varied. The number of
groups decreases as the match size increases.
in broader search. Therefore, more particles produce better performance at the

cost of more computation. There is a trade-o between accuracy and time, and
it can be controlled by the number of particles, depending on the application.
Using the same synthetic random graph experiment of Sec. 4.1, we plotted the
average accuracy over 30 experiments with respect to the number of particles.
Figures 5(a) and 5(b) provide the results varying the number of outliers with no
deformation, and varying deformation noise with no outliers, respectively. For
the problem with outliers, 4001000 samples are needed for saturation, and in
the presence of deformation, more particles, about 5000, are needed to arrive at
saturation point. Unlike outlier noise, the deformation noise lowers the anity
value between true matches, making true and false matches indistinguishable.
Therefore, more particles are needed in the presence of deformation noise. Nev-
ertheless, the required number of particles is only a tiny fraction of the number
of all possible combinations (20!).
In the second experiment, we evaluated the eciency of the algorithm. First,
the average precision of particles at each time step is depicted in Fig. 6(a) for
the synthetic random graph matching experiment in Sec. 4.1 while the number
of outliers is varied. It increases abruptly after about three to eight samplings,
which means that previously sampled matches in the particles boost sampling
of other inlier matches. It decreases after time step 20 because the number of
true matches is 20. Secondly, Fig. 6(b) shows the number of groups at each time
step in the same experiments. Since the information of particles is shared within
a group, the number of groups directly aects the computational cost. In Fig. 6,
it decreases until time step 38 while average precision increases, which implies
that the weights are highly concentrated on reliable groups at the early stages.
(a) (b) (c)
Fig. 7. The accuracies for dierent settings of are shown. The experimental settings
are same as the synthetic random graph matching in Sec. 4.1.
When the precision saturates, the number of groups gradually decreases after a
small increase, resulting in eective reduction of computational cost.
In the third experiment, we evaluated the accuracy for dierent settings of
for the same experimental settings in Sec. 4.1. The results are shown in Fig.
7. As the value of increases, SMCM becomes more robust to the deformation
noise. In the presence of outliers, however, the highest accuracy is achieved at
= 0.1 where the distracting outlier matches with relatively small anity values
are eectively ignored.
5 Conclusion
In this paper, we proposed a novel graph matching algorithm based on the Se-
quential Monte Carlo (SMC) framework. During the sequential sampling
procedure, our algorithm eectively collects potential matches under matching
constraints and avoids the adverse eect of outliers and deformation. The exper-
imental results show that it outperforms the state-of-the-art algorithms and is
highly robust to outliers. Since the SMC can preserves samples of multi-modes in
particles, it is extendable to multi-modal graph matching algorithm by utilizing
the properties. We leave this in future work.
References
1. Duchenne, O., Joulin, A., Ponce, J.: A graph-matching kernel for object catego-
rization. In: ICCV (2011)
2. Cao, Y., Zhang, Z., Czogiel, I., Dryden, I., Wang, S.: 2d nonrigid partial shape
matching using mcmc and contour subdivision. In: CVPR (2011)
3. Bircheld, S.: KLT: An implementation of the kanade-lucas-tomasi feature tracker
(1998)
4. Leordeanu, M., Hebert, M.: A spectral technique for correspondence problems using
pairwise constraints. In: ICCV (2005)
5. Leordeanu, M., Hebert, M.: An integer projected xed point method for graph
matching and map inference. In: NIPS (2009)
6. Cho, M., Lee, J., Lee, K.M.: Reweighted Random Walks for Graph Matching.
7. Lee, J., Cho, M., Lee, K.M.: A graph matching algorithm using data-driven markov
chain monte carlo sampling. In: ICPR (2010)
8. Duchenne, O., Bach, F., Kweon, I.S., Ponce, J.: A tensor-based algorithm for high-
order graph matching. PAMI (2010)
9. Lee, J., Cho, M., Lee, K.M.: Hyper-graph matching via reweighted random walks.
In: CVPR (2011)
10. Caetano, T., McAuley, J., Cheng, L., Le, Q., Smola, A.: Learning graph matching.
PAMI (2009)
11. Leordeanu, M., Hebert, M.: Unsupervised learning for graph matching. In: CVPR
(2009)
12. Cho, M., Lee, K.M.: Progressive graph matching: Making a move of graphs via
probabilistic voting. In: CVPR (2012)
13. Cappe, O., Godsill, S., Moulines, E.: An overview of existing methods and recent
advances in sequential monte carlo. Proc. IEEE (2007)
14. Maciel, J., Costeira, J.P.: A global solution to sparse correspondence problems.
PAMI 25, 187199 (2003)
15. Gold, S., Rangarajan, A.: A graduated assignment algorithm for graph matching.
PAMI, 411436 (1996)
16. Cour, T., Srinivasan, P., Shi, J.: Balanced graph matching. In: NIPS (2006)
17. Torresani, L., Kolmogorov, V., Rother, C.: Feature Correspondence Via Graph
Matching: Models and Global Optimization. In: Forsyth, D., Torr, P., Zisserman,
A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 596609. Springer, Heidelberg
(2008)
18. Dellaert, F., Seitz, S.M., Thrun, S., Thorpe, C.: Feature correspondence: A markov
chain monte carlo approach. In: NIPS (2001)
19. Tamminen, T., Lampinen, J.: Sequential monte carlo for bayesian matching of
objects with occlusions. PAMI (2006)
20. Lu, C., Latecki, L.J., Adluru, N., Yang, X., Ling, H.: Shape guided contour grouping
with particle lters. In: ICCV (2009)
21. Yang, X., Latecki, L.J.: Weakly Supervised Shape Based Object Detection with
Particle Filter. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
22. Isard, M., Blake, A.: Condensation - conditional density propagation for visual
tracking. IJCV (1998)
23. Moral, P.D., Doucet, A.: Sequential monte carlo for bayesian computation. Journal
of the Royal Statistical Society (2007)
24. Moral, P.D., Doucet, A., Jarsa, A.: Sequential monte carlo samplers. Journal of
the Royal Statistical Society, 411436 (2006)
25. Zass, R., Shashua, A.: Probabilistic graph and hypergraph matching. In: CVPR
(2008)
26. Munkres, J.: Algorithms for the assignment and transportation problems. SIAM
(1957)
27. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from
maximally stable extremal regions. In: BMVC (2002)
Jet-Based Local Image Descriptors
Anders Boesen Lindbo Larsen1 , Sune Darkner1 ,

Anders Lindbjerg Dahl2 , and Kim Steenstrup Pedersen1
1
Department of Computer Science, University of Copenhagen, DK
{abll,darkner,kimstp}@diku.dk
2
Department of Informatics and Mathematical Modelling,
Technical University of Denmark, DK
abd@imm.dtu.dk
Abstract. We present a general novel image descriptor based on higher-

order dierential geometry and investigate the eect of common descrip-
tor choices. Our investigation is twofold in that we develop a jet-based
descriptor and perform a comparative evaluation with current state-of-
the-art descriptors on the recently released DTU Robot dataset. We
demonstrate how the use of higher-order image structures enables us
to reduce the descriptor dimensionality while still achieving very good
performance. The descriptors are tested in a variety of scenarios includ-
ing large changes in scale, viewing angle and lighting. We show that
the proposed jet-based descriptor is superior to state-of-the-art for DoG
interest points and show competitive performance for the other tested
interest points.
1 Introduction
Image characterizations based on local interest points and descriptors like SIFT
[1], GLOH [2], PCA-SIFT [3], DAISY [4,5], and many others have had great
impact on computer vision research. Such descriptors have been used in a range
of applications for eciently determining image similarities. Common for these
approaches is their aim at a description of a local image patch, which is invariant
with respect to photogrammetric variations like viewpoint, scale, and lighting
change. Additionally, they all have low dimensionality compared to the original
pixel representation, but the dierence in dimensionality among the descriptors
varies signicantly.
These characterizations all describe the local geometry of the image, like the
local distributions of rst order derivatives in the SIFT descriptor [1], or the
learned basis of PCA-SIFT [3] that encodes the parameters of this basis. The
introduction of PCA-SIFT inevitable leads to the question of whether the com-
plex structure and relatively high dimensionality of most descriptors are in fact
over-representations and perhaps much simpler, theoretically sound, and com-
pact formulations exist. This observation has inspired us to investigate the local
k-jet [6], a higher order dierential descriptor, as a natural basis for encoding
the geometry of an image patch. Thus, the descriptor we propose can, similar to

Jet-Based Local Image Descriptors 639
PCA-SIFT, be interpreted as a basis representation of the image patch. We will

investigate the eect of a multi-scale representation in our descriptor as well as
the eect of introducing multi-locality similar to the grid of histograms used in
SIFT.
Local image descriptors based on higher order image dierentials have previ-
ously been proposed [7,8] with little success for feature matching [2]. Compared
to our descriptor, these descriptors do not rely on multi-locality and only use
dierential invariants (functions of the local jet) up to the third order. Our de-
scriptor bears some resemblance to the jet-descriptor formulated by Laptev and
Lindeberg in [9], which is based on the local 4-jet, but this does not include
multi-locality and is only used for spatio-temporal recognition.
We investigate the matching performance of our method compared to cur-
rent state-of-the-art descriptors on the DTU Robot dataset [10]. This dataset
allows for superior performance studies for certain perturbation scenarios (view
angle, scale and lighting) because of its large number of dierent scenes, and its
systematic light variation and camera positions. The dataset used by Mikola-
jczyk and Schmid [11] contains the possibility for more perturbation scenarios
including view angle, rotation, scale, focus, exposure level and compression arti-
facts, but consists of only eight scenes. In Winder et al. [4], a dataset with ground
truth from 3D reconstructions of three dierent outdoor scenes from tourist pho-
tographs was used. This dataset represents the combination of many dierent
perturbation factors (e.g. lighting, view angle, scale, noise, perspective). How-
ever, a bias towards the descriptor (SIFT) used for doing the 3D reconstruction
is potentially present. Virtual scenes of photorealistic quality have recently been
proposed [12,13]. These datasets oer better control over image perturbations
and the ground truth is known, but are synthetic which may cause a bias on the
descriptor performance.
The contributions of this paper include a new approach to the construction
of local image descriptors based on the local k-jet as well as evaluation of local
image descriptors on the DTU Robot dataset. We show that our descriptor is
very competitive with state-of-the-art and that it outperforms these descriptors
by a signicant margin under certain conditions. Furthermore, our descriptor is
not designed for a specic type of local image geometry as detected by corner or
blob interest point detectors. As such, it is a general purpose descriptor appli-
cable to any image patch. We will therefore investigate how performance of our
descriptor varies with the choice of interest point detector.
2 Jet Descriptor
To construct the jet descriptor we use the multi-scale k-jet as described by

Florack et al. [6]. We consider the scale normalized derivatives of the linear
scale-space of a 2D image I(r) : R2 R
n+m
Lxn ym (r; ) = n+m (G I) (r; ) , r = (x, y) , (1)
xn y m
640 A.B.L. Larsen et al.
where convolution is denoted by and G is the Gaussian aperture function

1 r r
G(r; ) = exp 2 , 0 , G(r; 0) . (2)
( 2)d 2
Here (0) denotes the Dirac delta function centered at zero. The local k-jet is
then given by the vector Jk (r; ) RK , K = (2+k)!
2k! 1
Jk (r; ) = ({Lxn ym (r; ) | 0 < n + m k})T . (3)
By excluding the zeroth order term, the k-jet becomes invariant to additive
changes to the intensities.
By construction, the scale space derivatives are correlated and under basic
assumptions of the image statistics this correlation can be described analyti-
cally [14,15]. In order to create a scale normalized and decorrelated descriptor,
we therefore perform a whitening of the local k-jet coecients according to the
covariance structure derived in [14,15]. This yields a vector where the elements
are uncorrelated and of the same order of magnitude, allowing the use of the
Euclidean distance as descriptor similarity measure. According to [14], the co-
variance between the jet coecients Lxi yj and Lxk yl , where both n = i + j and
m = k + l are even, is given by
n+m 2 n!m!
cov(Lxi yj , Lxk yl ) = (1) 2 +k+l . (4)
2 n+m 2n+m (n + m) n2 ! m
2 !
is the scale parameter of the local k-jet, and is a model parameter that is
irrelevant in our context and will be set to = 1. If either n or m is odd, the
covariance is 0. Finally, we remark that this normalization method is related,
but not identical, to the descriptor similarity measure proposed in [8]. After the
whitening, the descriptor is L2 normalized to achieve invariance to ane contrast
changes.
Using the local k-jet above, we wish to investigate what descriptor congu-
rations yield good discriminability. More specically, we investigate the eect
of sampling jets in a multi-scale or multi-local fashion. We will also explore to
what extent we can increase the dierential order k while getting performance
improvements. This leads to the descriptor proposals listed in Table 1. We use
the following naming convention. The sux -scale2 indicates that the jet is
sampled at two dierent scales in a single point. The sux -gridn indicates a
multi-local jet sampling in a regular square grid of size n n (similar to the
spatial sampling grid used in SIFT [1]).
Following [2,16], image patches of size 64 64 pixels are extracted at three
times the scale of the interest points generated by the interest point detector.
Moreover, ane interest point regions are warped to isotropic regions. The jet
descriptor parameters shown in Table 1 have been optimized manually using the
dataset by the Oxford Visual Geometry Group1 . The parameters denote the
1
Data and code available at http://www.robots.ox.ac.uk/~ vgg/research/affine.
jet scales in pixel units. The -gridn sampling location parameters p denote the
x and y pixel osets of the grid columns and rows respectively (the row and
column osets are the same since the grid is square, hence the shared p values).
The proposed jet descriptors are, by construction, invariant to changes in
scale, translation, and contrast. The scale and translation invariance arise mainly
from the interest point detectors which provide a localization in scale-space of
the interest point and the resampling of the interest region around the point
to 64 64 pixels. However, this localization includes some noise depending on
the quality of the detector algorithm and its implementation (see e.g. [10]). We
compensate for this by using fairly large scales relative to the 6464 pixels patch
making the jet descriptor robust towards small perturbations in the detected
position and scale. We compensate for the reduction in derivative response level
caused by the Gaussian blurring by using scale normalized derivatives (see (1)).
The jet descriptors are not invariant to rotations around the optical axis as we do
not orient the jets according to e.g. a dominant orientation. This could, however,
easily be achieved by rotating the coordinate frame according to some ducial
orientation of the image patch and expressing the jets in this coordinate system.
Table 1. The jet descriptor proposals, their parameters and dimensionalities. Bot-
tom rows: State-of-the-art descriptors that we compare the jet descriptors to in our
experiments.
Variant Dim. Parameters

J4 14 = 10.6
J5 20 = 10.6
J6 27 = 10.6
J7 35 = 10.6
J4 -scale2 28 1 = 7.5 , 2 = 16
J5 -scale2 40 1 = 7.5 , 2 = 16
J3 -grid2 36 = 6.8 , p1 = 21 , p2 = 44
J4 -grid2 56 = 6.8 , p1 = 21 , p2 = 44
J5 -grid2 80 = 6.8 , p1 = 21 , p2 = 44
J3 -grid4 144 = 5.2 , p1 = 15 , p2 = 26 , p3 = 38 , p4 = 50
SIFT [1] 128 -
GLOH [2] 128 -
SURF [17] 64 -
PCA-SIFT [3] 36 -
3 Dataset and Evaluation Criteria

The DTU Robot dataset2 [10,16] consists of 60 dierent scenes containing object
categories such as miniature buildings, fabrics, books and groceries (see Fig. 1a
2
Dataset and code available at http://roboimagedata.imm.dtu.dk.
Light path (z-axis)
Scene
Light path (x-axis)
Arc 1 Key frame
Arc 2
Linear path
Arc 3
(a) (b)
Fig. 1. The DTU Robot dataset. (a): Examples of dierent scenes. (b): A top-down
view of all camera positions (circles) and light source positions (pluses). The key frame
in the front center is used as reference image when evaluating descriptor performance.
Arc 1 is at a distance of 0.5 m from the scene and spans 40 , Arc 2 has distance
0.65 m and spans 25 and Arc 3 has distance 0.8 m and spans 20 . The linear path
spans the range [0.5 m; 0.8 m].
for examples). Each scene is photographed systematically with dierent cong-

urations of camera position and light source. The camera positions are placed
along four paths. Three of these paths are horizontal arcs at dierent distances
to the scene. The fourth camera path is linear along the depth axis (z-axis).
Note that there are no vertical variations in the placement of the camera, nor
are there any rotations along the optical axis. The illumination possibilities cover
light sources placed on two linear paths along the horizontal axis (the x-axis)
and the depth axis (z-axis) respectively. Note that the light sources are created
articially from the original dataset as described in [16]. An overview of the
camera and light congurations is shown in Fig. 1b.
Descriptor performance is calculated following the evaluation criteria described
in [16]. That is, for each feature descriptor in the key frame (see Fig. 1b), we
perform a nearest neighbor distance ratio matching with the feature descriptors
of another image. We then calculate the receiver operating characteristic (ROC)
curve from the number of correct and incorrect feature matchings. In order to
get a single value to quantify the descriptor performance on an image pair, we
compute the area under the curve (AUC). For each conguration of camera and
lighting position, we compute the mean AUC over the 60 scenes and use it to
quantify the descriptor performance.
Note that the ROC analysis diers slightly from the evaluation criteria used
in the popular study by Mikolajczyk and Schmid [2], where the recall vs. 1
precision curve is used instead of the ROC curve. We have experimented with
both evaluation measures and have found that they reveal the same descriptor
performance as the ordering of the descriptors does not change.
0.95
Mean AUC
0.90
0.85
0.80
0.75
Arc 1 Arc 2 Arc 3 Linear path
0.90 J4 J5 -scale2
Mean AUC
0.88 J5 J3 -grid2
0.86 J6 J3 -grid4
0.84 J7 J4 -grid2
0.82 J4 -scale2 J5 -grid2
0.80
Light path, x-axis Light path, z-axis
Fig. 2. Performance of dierent jet descriptors using MSER interest points
4 Experiments
Using the DTU Robot dataset, we investigate the performance of our dierent
jet descriptor congurations from Table 1. For this experiment, we use interest
points generated by the MSER detector [18] as it has been shown to perform well
[16]. The experimental results are shown in Fig. 2. We see that for the single
jets, the performance improvements stagnate around k = 6. The multi-scale
approach does not seem to increase the discriminability signicantly as J4 -scale2
and J5 -scale2 only have a slight edge over J4 and J5 respectively. The multi-
scale descriptors are slightly more discriminative than the single-scale descriptors
for changes in view point. Under illumination variations, the performance of the
multi-scale descriptors is no better than the single-scale descriptors. Conversely,
the multi-local approach yields a visible improvement as it consistently achieves
better scores than the single- and multi-scale jets. We see, to some extent, that
multi-locality can be replaced with higher-order image structure as J4 -grid2
perform slightly better than J3 -grid4. Among the multi-local approaches, J4 -
grid2 oers a good trade-o between descriptor dimensionality and performance.
We compare our jet descriptors with the state-of-the-art descriptors listed in
Table 1. We choose these descriptors because they are among the top performers
in the comparative study by Dahl et al. [16]. These descriptors are generated
using code made available by the Oxford Visual Geometry Group and from the
SURF website3 . Since our dataset does not support rotations around the optical
axis, we have disabled rotational invariance for all descriptors to better reveal
their discriminative ability. As representative for our multi-local jet descriptors,
3
http://www.vision.ee.ethz.ch/~ surf
0.95
Mean AUC
0.90
0.85
0.80
0.75
0.90 SIFT J4 -grid2

Mean AUC
0.88 GLOH J5
0.86 SURF J7
0.84 PCA-SIFT
0.82
0.80
Fig. 3. Performance comparison with other state-of-the-art descriptors using MSER

interest points
we select J4 -grid2. We also include J5 , J7 as we want to investigate the perfor-

mance when relying on a single, low-dimensional jet. We do not compare with
previously published dierential invariant based descriptors, such as [7,8,9], be-
cause they have in our experiments (not part of the paper) been shown to be
signicantly under par.
In Fig. 3, we have plotted the performance of the descriptors on MSER in-
terest points. We see that for changes in view angle (Arc 1), J4 -grid2 is able
to follow the top performers SIFT, GLOH and PCA-SIFT decently (though a
bit under par). Under variations in scale (Linear path), the jet descriptors per-
form signicantly under par. For changes in lighting, however, J4 -grid2 is a top
performer together with PCA-SIFT.
In order to investigate the eect of dierent interest point detectors, and
thereby dierent types of local image geometry, we show the descriptor perfor-
mance on Dierence of Gaussians (DoG) interest points [1] illustrated in Fig.
4. The picture is quite dierent from MSER interest points as J4 -grid2 is the
top performer by a visible margin in all perturbation scenarios. Even our single
scale jets perform similar to SIFT at a much lower dimensionality. Interestingly,
J5 performs better than J7 under changes in viewpoint and vice versa under
illumination changes. It is also interesting to notice that PCA-SIFT and J7 have
similar performance with J7 slightly under par in some situations.
We also show the performance of our descriptors on Harris-Laplace (Har-
Lap) and Harris-Ane (HarA) interest points [19] in Fig. 5 and Fig. 6 re-
spectively. At this point, we recognize a trend as J4 -grid2 consequently shows
superior robustness towards changes in lighting while being very competitive un-
der viewpoint variations. The single jets are very competitive under illumination
0.95
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.90 SIFT J4 -grid2

Mean AUC
0.85 GLOH J5
0.80 SURF J7
0.75 PCA-SIFT
0.70
0.65
Fig. 4. Descriptor performance on DoG interest points
variations, however, for changes in viewpoint, their performance is mediocre in

most cases. The performance of SIFT and GLOH is very correlated with an ad-
vantage to GLOH. Lastly, the SURF descriptor performs poorly in the majority
of the evaluation scenarios.
As the nal experiment, we will examine to what extent descriptor perfor-
mance is aected by surface structure of the scene content. More specically,
we want to see if planar surfaces vs. non-planar surfaces have inuence on the
discriminability of the descriptors. To represent planar surfaces, we use a total
of 17 scenes containing miniature houses, books and building materials (wooden
planks and bricks). To represent non-planar surfaces, we use a total of 22 scenes
containing fabric, plush toys, vegetables and beer cans. We evaluate the descrip-
tor performance using DoG and HarA interest points (note that HarA esti-
mates an ane transformation of the interest points, which cannot be assumed
for non-planar surfaces). The results are shown in Fig. 7. We have omitted the
results for illumination changes to save space and because they show similar
results. We see a notable dierence in descriptor performance depending on the
scene type. Feature matching on non-planar scenes is more dicult as we see
the AUC is generally lower than for planar scenes in the extreme camera posi-
tions on the arcs and the linear path. It seems that the jet descriptors are more
robust on non-planar surfaces as their performance decreases less than the other
descriptors (especially for large view angle changes). Notice also the dierence
between the Harris-Ane and DoG detectors; in general the Harris-Ane detec-
tor produce more noisy AUC mean curves, which indicates the presence of more
noise in the matching of these points. This is also the case for the planar scenes
for which the underlying assumptions in the Harris-Ane detector should be
fullled. Furthermore, DoG has a higher mean AUC close to the reference image
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.95 SIFT J4 -grid2

Mean AUC
0.90 GLOH J5
SURF J7
0.85
PCA-SIFT
0.80
0.75
Fig. 5. Descriptor performance on HarLap interest points
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.95 SIFT J4 -grid2

Mean AUC
0.90 GLOH J5
SURF J7
0.85 PCA-SIFT
0.80
0.75
Fig. 6. Descriptor performance on HarA interest points

in Arc 1. One may speculate whether this dierence arises from the dierent
geometries detected by DoG and Harris-Ane blobs versus corners or from
the ane correction performed by the Harris-Ane detector.
5 Discussion
From the experiments, we have seen that jet-based descriptors oer a viable al-
ternative to state-of-the-art local image descriptors. More specically, the multi-
local jet descriptor J4 -grid2 has shown superior results for illumination changes,
competitive performance for changes in viewing angle and a performance slightly
under par for scale variations, except for DoG interest points on which we obtain
superior performance for all perturbations. Thus, we have shown that the popu-
lar histogram approaches, as applied in e.g. SIFT [1] and its extensions [2,4,5], is
not crucial for achieving good performance. Additionally, we have shown that the
use of higher-order dierential image structure allows us to reduce the complex-
ity of the multi-locality (recall that e.g. SIFT is constructed from 16 histogram
sampling points). In fact, a single jet can be competitive with both SIFT and
GLOH in some situations. The reduction of multi-local sampling points leads to
a much lower descriptor dimensionality, which has typically been obtained by
adding a PCA step to the description algorithm.
The jet descriptors have, similar to PCA-SIFT, an interpretation as a basis
representation of the image patch. However, instead of learning a basis for the
patch, our descriptor relies on the monomial basis of the Taylor expansion of the
patch. This interpretation give us a low-dimensional descriptor which allows us
to reconstruct the patch directly from the descriptor itself [20].
Descriptor performance has been evaluated using interest points generated
by dierent popular detectors. We consider this an important part of our de-
scriptor evaluation, as we have seen that the relative descriptor performance to
a large extent is dependent on the detector. Thus, one should be very careful
when drawing conclusions from a single interest point detector as it clearly will
favor some descriptors over others. From our results, we have seen that the jet
descriptors oer superior performance on DoG interest points in all perturbation
scenarios. For MSER, HarA and HarLap interest points, the jet approach oers
competitive performance.
The promising results of our jet descriptors are at rst glance peculiar con-
sidering previous non-favorable evaluations of descriptors based on image dier-
entials [2,8]. Our approach diers in the following ways: We use the local k-jet
directly (instead of dierential invariants), we extract dierentials to a higher
order and we employ multi-local sampling of the jets. Including spatial sampling
of jets and increasing dierential order is essential to the success of our descrip-
tor. We speculate that part of the explanation for the good performance of our
jets is that the DTU Robot dataset contains non-planar surfaces (as we have
seen, jet descriptors are more robust in these sitations). Previous evaluations of
dierential descriptors have been on datasets containing to a large extent planar
or near-planar surfaces.
SIFT SURF J4 -grid2 J7

GLOH PCA-SIFT J5
0.90 HarAff, planar

Mean AUC
0.85
0.80
0.75
0.70
0.90 HarAff, non-planar

0.85
Mean AUC
0.80
0.75
0.70
0.65
0.95
DoG, planar
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.65
0.60
0.95
DoG, non-planar
0.90
Mean AUC
0.85
0.80
0.75
0.70
0.65
0.60
0.55
Fig. 7. Descriptor performance for planar and non-planar image surface structures
Our jet descriptors are very simple to implement and rely on very few param-
eters (compared to histogram-based descriptors) making them easy to congure.
We remark that J4 -grid2 requires 14 convolutions of the input patch which al-
lows for a fast implementation in hardware. Single-scale jets can be implemented
even more eciently from a series of point-wise products.
Taking the positive performance results and the simplicity of the descriptor
into account, we believe that the multi-local jet descriptor can prove to be valu-
able for many computer vision applications.
References
1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60, 91110 (2004)
2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27, 16151630 (2005)
3. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local
image descriptors. In: IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2004), vol. 2, pp. 506513 (2004)
4. Winder, S., Hua, G., Brown, M.: Picking the best daisy. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR 2009), pp. 178185 (2009)
5. Tola, E., Lepetit, V., Fua, P.: Daisy: An ecient dense descriptor applied to
wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 32, 815830 (2010)
6. Florack, L., ter Haar Romeny, B.M., Viergever, M., Koenderink, J.: The gaussian
scale-space paradigm and the multiscale local jet. International Journal of Com-
puter Vision 18, 6175 (1996)
7. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 19, 530535 (1997)
8. Balmashnova, E., Florack, L.: Novel similarity measures for dierential invariant
descriptors for generic object retrieval. Journal of Mathematical Imaging and Vi-
sion 31, 121132 (2008)
9. Laptev, I., Lindeberg, T.: Local Descriptors for Spatio-temporal Recognition.
In: MacLean, W.J. (ed.) SCVMA 2004. LNCS, vol. 3667, pp. 91103. Springer,
Heidelberg (2006)
10. Aans, H., Dahl, A., Steenstrup Pedersen, K.: Interesting interest points: A com-
parative study of interest point performance on a unique data set. International
11. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaal-
itzky, F., Kadir, T., Gool, L.V.: A comparison of ane region detectors. Interna-
tional Journal of Computer Vision 65, 4372 (2005)
12. Kaneva, B., Torralba, A., Freeman, W.T.: Evaluating image feaures using a pho-
torealistic virtual world. In: IEEE International Conference on Computer Vision
(2011)
13. Vedaldi, A., Ling, H., Soatto, S.: Knowing a Good Feature When You See It:
Ground Truth and Methodology to Evaluate Local Features for Recognition. In:
Cipolla, R., Battiato, S., Farinella, G. (eds.) Computer Vision. SCI, vol. 285,
14. Pedersen, K.S.: Statistics of natural image geometry. PhD thesis, University of
Copenhagen, Department of Computer Science, Denmark (2003)
15. Markussen, B., Pedersen, K.S., Loog, M.: Second order structure of scale-space
measurements. Journal of Mathematical Imaging and Vision 31, 207220 (2008)
16. Dahl, A.L., Aans, H., Pedersen, K.S.: Finding the best feature detector-descriptor
combination. In: The First Joint Conference of 3D Imaging, Modeling, Processing,
Visualization and Transmission (2011)
17. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (surf).
Computer Vision and Image Understanding 110, 346359 (2008)
18. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from max-
imally stable extremal regions. Image and Vision Computing 22, 761767 (2004)
19. Mikolajczyk, K., Schmid, C.: Scale & ane invariant interest point detectors. In-
ternational Journal of Computer Vision 60, 6386 (2004)
20. Lillholm, M., Nielsen, M., Grin, L.D.: Feature-based image analysis. International
Abnormal Object Detection
by Canonical Scene-Based Contextual Model
Sangdon Park, Wonsik Kim, and Kyoung Mu Lee
Department of EECS, ASRI, Seoul National University, 151-742, Seoul, Korea

{sangdon,ultra16,kyoungmu}@snu.ac.kr
http://cv.snu.ac.kr
Abstract. Contextual modeling is a critical issue in scene understand-

ing. Object detection accuracy can be improved by exploiting tendencies
that are common among object congurations. However, conventional
contextual models only exploit the tendencies of normal objects; abnor-
mal objects that do not follow the same tendencies are hard to detect
through contextual model. This paper proposes a novel generative model
that detects abnormal objects by meeting four proposed criteria of suc-
cess. This model generates normal as well as abnormal objects, each
following their respective tendencies. Moreover, this generation is con-
trolled by a latent scene variable. All latent variables of the proposed
model are predicted through optimization via population-based Markov
Chain Monte Carlo, which has a relatively short convergence time. We
present a new abnormal dataset classied into three categories to thor-
oughly measure the accuracy of the proposed model for each category;
the results demonstrate the superiority of our proposed approach over
existing methods.
Keywords: abnormal object detection, generative model, sampling.
1 Introduction
Contextual modeling is a critical issue in scene understanding, particularly in

object detection [24, 1, 5, 6]. Contextual models exploit the prior knowledge
that in a specic scene, specic objects follow common or normal congurations,
such as cars on the road. Thus, conventional contextual models weaken car
bounding boxes oating over the road, or reinforce bounding boxes in correct
positions. However, detecting cars actually oating on the road is dicult if
contextual models only consider normal congurations. In this paper, we propose
a novel generative model that can detect out-of-context objects, also referred to
as abnormal objects.
Finding abnormal objects is an important and interesting task. With the
advent of image-manipulation tools such as Adobe Photoshop, the number of
articially manipulated images continues to increase. Additionally, the ability to
detect abnormal objects can be useful in surveillance systems. Therefore, models
able to understand abnormal scenes are needed.

652 S. Park, W. Kim, and K.M. Lee
sofa[0.57] person[0.91] person[0.69]

person[0.91]
sofa[0.57] person[0.69]
Fig. 1. Examples of abnormal objects detected using the proposed method. Each image
contains the topmost abnormal object that violates co-occurrence, relative position and
relative scale among objects.
This paper focuses on abnormal object detection, that is, nding abnormal ob-
jects in given scenes. We dene abnormal objects as those that do not match ex-
pectations set by the surrounding context or by common knowledge. Specically,
abnormal objects (1) do not co-occur with surrounding objects (co-occurrence-
violating objects), (2) violate positional relationships with other objects (position-
violating objects) or (3) have relatively huge or small sizes (scale-violating
objects) (Fig. 1). Exploiting the distributions of normal objects is necessary
for abnormal object detection. This is because abnormality can be dened based
on the extent to which an object is not normal.
Four necessary properties of abnormal object detection methods are dened as
follows: (1) auent object relations, (2) quantitative object relations, (3) auent
context types, and (4) prior-free object search. First, auent relations among
objects, such as a fully connected relation among objects, is critical because
abnormal objects rarely occur. If an abnormal object, such as a oating car, is
related with only one object, road, then identifying whether the target object is
abnormal or not becomes dicult, because models can only identify the oating
cars abnormality once the road has been detected correctly.
Second, object relations, particularly relative position/scale relation, should
be dened quantitatively because qualitative representation is improper for de-
termining the contextual violations. For example, if the relative position is qual-
itatively dened, such as above, then person and road can be related by
the above relation. However, identifying the abnormality of a person hanging in
the air becomes dicult because the person is still above the road. Third,
contextual models become more informative when the more context types, such
as co-occurrence and relative position/scale among objects, are used [7]. Finally,
the models should not restrict the interpretation of scenes to nd abnormal
object properly. If an abnormal object-detecting model has prior knowledge of
searching objects, then nding abnormal objects is nearly impossible because
abnormal objects do not exist in locations where they are expected.
We propose a novel generative model that generates both normal and abnor-
mal objects, following a learned conguration among the objects. This learned
conguration is determined by transforming a standard conguration, also called
Abnormal Object Detection by Canonical Scene-Based Contextual Model 653
a Canonical Scene. This Canonical Scene, conditioned on a scene type, is a space

wherein normal and abnormal objects can exist, following their physical posi-
tions and scale. In this sense, the Canonical Scene is similar to the real world. In
the Canonical Scene, normal objects co-occur with each other without violating
relative position and scale among objects, but abnormal objects do not. The
proposed method can check whether the object conguration of the input scene
is similar to that of the Canonical Scene by modeling the generative model of
the Canonical Scene, thereby identifying abnormal and normal objects. The rst
three aforementioned necessary properties for abnormal object detection models
are satised by the denition of the Canonical Scene. In addition, the proposed
model does not dictate which or where objects exist, which satises the nal
property.
We use a population-based Markov Chain Monte Carlo (Pop-MCMC) tech-
nique [8] to predict the existence of abnormal objects via the proposed model.
This technique generates samples from multiple, dependent chains, resulting in
a high mixing rate. With regard to optimization, Pop-MCMC has more chances
of escaping from the local optimum, making it a proper optimization tool for
multimodal and/or high dimensional models.
We also propose a novel annotated dataset that contains one or more abnormal
objects in each images to verify the accuracy of abnormal object detection tasks.
This dataset consists of three separate sets of scenes where (1) fewer co-occurring
objects and (2) location-violating objects exist, and (3) scale-violating objects
exist. Whether the proposed abnormal object detection methods are adequate
for nding each type of abnormal object must be veried. At present, only a few
studies on abnormal object detections have been conducted [1, 5]. This paper is
inspired by these studies.
2 Related Work
Although a variety of contextual models have been proposed over the past decade
[24, 1, 5, 9, 7, 6, 10, 11], contextual models that meet all four criteria for nd
abnormal objects are dicult to nd.
One group of such models incorporates object-object interaction [12] which
is used to directly enforce contextual relation among objects. In [9], a condi-
tional random eld (CRF)-based contextual model provides only co-occurrence
statistics among objects to rene miss-labeled image segments. This work is fur-
ther expanded in [7] so that relative position context, such as above/below
and inside/around, among objects is qualitatively encoded in the CRF model.
Furthermore, [4] proposed a far/near relation between objects to reduce the
ambiguity of qualitative context representation. In [1], Choi represents the re-
lation among objects in a tree model for computational eciency, taking only
a few signicant relationship among the objects into consideration. This tree
model is more expanded in [5] to also encode support context, a geometric
context for prohibiting oating objects. The strategy of [3] is to rst nd easy
objects, and then search for dicult objects based on the locations of the easy
2
car
O
building
tree
s l c
K
T x y
M O N
O S D 2
(a) (b)
Fig. 2. (a) Possible locations of objects in learned outdoor Canonical Scene. (b) The
graphical model of the Canonical Scene model.
objects. This model combines boosting and CRF to eciently search for un-
known objects based on known ones, making it unnecessary to search the entire
image to locate objects. [6] leverages the qualitative spatial contexts of stu,
such as sky or road, to improve object detection in satellite images.
Another group of contextual models indirectly embeds the contextual relation-
ship among objects via a latent variable, such as a scene. [2] correlates objects
through a common cause scene, thus including co-occurrence statistics in a
hierarchical Bayesian model. The same contextual information is exploited by
[10] to achieve scene classication, annotation, and segmentation altogether in a
single framework. [13] solves the object detection problem by assuming that holis-
tically similar images share the same object conguration. This method retrieves
normal, annotated images with similar Gist feature [14] and then constructs a
graphical contextual model to constrain which objects appear and where they
can be found.
3 Transformed-Canonical Scene Generating Model

Our goal is to model a joint distribution over scenes and objects; Simultaneously
inferring the scene type, which objects exist, where the objects are positioned,
and what the objects sizes are is made possible by maximizing distribution.
From these inferred ones, we also infer which objects are abnormal ones.
The proposed graphical model is described in Fig. 2(b). The observed variables
x and y can be considered as an input image of the whole system. We assume
O
that the input image is represented as o=1 No number of image patches (can-
didate objects), where O is the number of object categories and No is the number
of candidate objects in object category o. Normally or abnormally located can-
didate objects are called just normal or abnormal objects, respectively, when the
candidate objects are correctly detected. Each candidate objects is located at

xo,n and has the appearance score yo,n , where o represents an object category
and n is the index of an instance of the candidate object. x is a location matrix,
where the element xo,n is the location vector of the corresponding candidate ob-
ject. y is an appearance matrix, where the element yo,n represents a quantied
similarity between the appearance of object category o and the corresponding
candidate object. In this paper, O = 10 and No = 10 for all object category o.
The latent variables s, c, and l can be considered as outputs of the system.
The variable s [1, 2, . . . , S] is a scene category of the given image where S is the
number of scene categories. The matrix c is a correctness matrix of appearance,
where the element co,n is a boolean-valued ag. When the ag is set to 1, the
corresponding candidate object is similar to object o in appearance. The matrix
l is a location type matrix, where each element lo,n is also a boolean-valued
ag. lo,n = 1 means that the corresponding candidate object is positioned at
a normal location of object category o. Note that the location is used to
represent context information because this paper assumes that the co-occurrence,
position, and scale information of the objects are geometrically represented. We
assume that two or more objects can exist in the same scene if and only if
they co-occur with each other. Moreover, both the position information and
scale information of the objects can be geometrically represented by using the
undo projectivity technique, which is described in the following section. This
paper distinguishes between normal and abnormal objects through geometric
representation by assuming all normal objects, not stu, are located on the
ground plane. In summary, all context information are compactly represented as
objects locations.
3.1 Image Representations and Assumptions
This paper adopts object-level image representation (such as [15, 1]), which
diers from conventional low or mid-level representations of images, such as
colors, SIFT or bag of visual words. Representing an image at object-level is
necessary because our goal includes modeling the objects position/scale in the
real world based on the objects in the image.
An image is described as a set of candidate objects, {(xo,n , yo,n )|1 o
O, 1 n No }. This set can be a set of bounding boxes generated by applying
conventional object detectors on a given image, but only the top No scored
candidate objects are used among the outputs of the object o detector. xo,n =
(bo,n , ho,n ) is the location and height of the corresponding candidate object,
where bo,n is a y-coordinate of the candidate objects center and ho,n is the height
of the candidate object. We do not consider an x-coordinate and the width of a
candidate object because these values are not informative [14]. Thus, the words
scale and height are used interchangeably. yo,n is a score that represents
the similarity between the appearance of the corresponding candidate object
and that of the object category o. yo,n can be calculated by applying an object
detector (such as [16]) to the corresponding candidate object.
We transform the location of the candidate object from the image coordinate
into the camera coordinate to exploit the position relationships and the scale
information between the objects. For this purpose, we apply Hoiems undo
projectivity technique [17], assuming that all instances of the same object cate-
gory o have the same physical height. The objects location, that is, in an image
coordinate is transformed into a camera coordinate by the following triangu-
b f
lation [1]: Go,n = (Gyo,n , Gzo,n ) = ( ho,n
o,n
Ho , ho,n Ho ). The coordinates are calcu-
lated based on the center location bo,n of the candidate object, the candidate
objects height ho,n [h1 , h2 ], the hand-crafted physical height of the objects
Ho [H1 , H2 ], and the focal length f . This paper assumes that the height of
the images is normalized to one, the principal axis passes through the center
of the image, f = 1, h1 = 0.1, h2 = 1, H1 = 0.1, and H2 = 30. Furthermore,
with a normalization of the location space to S = [0.5, 0.5] [0, 1], we restate
the location of the corresponding candidate object as xo,n = fnorm (Go,n ), which
represents both position and scale information.
This paper assumes that all instances of an object category o have the same
physical height. Thus, the relatively small object instance in the image also has
the same height as the normal objects in the camera coordinate. Even if the
small object is dicult to identify based on its height, the value of the second
component of the small objects xo,n is larger than that of the normal ones. Thus,
the abnormal object can be identied using the second component of xo,n .
This paper employs the exact sense of objects used by [18]. Therefore, we
use car object, person object, or sky stu. However, object and stu
are called object if no confusion exists.
3.2 Canonical Scene for Location Model

This subsection presents one of the main idea of this paper: a novel approach
for designing the location model p(K, x|s, l, T ). This model measures whether
the candidate objects conguration in an input scene is similar to that of a
predened scene template, also called the Canonical Scene. This idea can be
considered a template matching problem. Thus, the following processes are re-
quired: dening a scene template, matching transformation, and measuring the
similarity between the input scene and the template.
Canonical Scene. Given a scene category s, a Canonical Scene Ls = {Lo,n,d}

is a set of locational random vectors of nth candidate object of object category
o, where d {0, 1}; Lo,n,0 and Lo,n,1 represent locations of abnormal and normal
candidate objects, respectively; and Lo,n,d follows truncated Gaussian distribu-
tions. For example, Fig. 2(a) represents a realized outdoor Canonical Scene
in which possible normal locations for car, building, and tree are dened.
Note that even though the range of Lo,n,1 is restricted in S, for simplicity, we
use Gaussian instead of truncated Gaussian distribution by assuming that all
masses of distributions of candidate objects location is inside of the restricted
domain. Likewise, truncated Gaussian distributions are approximated as uniform
distributions by letting locations of abnormal candidate objects follow uniform
distributions. This paper represents the parameters of Gaussian and uniform

distributions together as s,o,d .
Matching Transformation. The Canonical Scene is transformed to match the

input scene. This matching transformation can be arbitrary, depending on the
applications. Only isometric transformation is considered in this paper because if
changing the scale of the Canonical Scene would make identifying abnormal ob-
jects dicult. For example, if a scene has many person objects and one oating
person, the distance from the ground plane to the oating person is a critical
clue for identifying the abnormal object. However, if the scale of the Canonical
Scene is changed, the distance would becomes vague. Thus, the two-dimentional

(2-D) isometric matching transformation T only consists of translation [ 0]
and rotation R , where 2 < < 2 . Moreover, isometry is only restricted on

congruence mapping with no reection, and is therefore invertible.
Similarity Measure. Matching measures can be dened in many ways, such

as SAD, SSD, and maximum-likelihood measure [19]. The maximum-likelihood
similarity measure between the Canonical Scene and the input scene is dened
as
mTs,l (Ls , x) mTs,lo,n (Lo,n , xo,n ), (1)
o,n
where mTs,lo,n (Lo,n , xo,n ) is an arbitrary similarity measure between a template

feature Lo,n and an image feature xo,n . This paper measures the similarity be-
tween the locations of candidate objects in the transformed Canonical Scene
T (Ls ) and the locations of candidate objects in the input scene x, where T is
the isometric transformation by T . Because Lo,n = {Lo,n,0 , Lo,n,1}, matching
one of them to xo,n is required. This paper denes the aforementioned matching
as follows: If lo,n = d, then Lo,n,d is mapped to xo,n . Thus, l control matches
between Ls and x.
Connection with the Location Model. If a Canonical Scene is properly

learned, the congurations of candidate objects in the Canonical Scene are the
same as those in the real world. For example, Fig. 2(a) is an outdoor Canonical
Scene in which possible normal locations of candidate objects, such as the normal
locations of car and building, are dened. The possible location of car
is on the ground plane on average and that of building is above the ground
plane. Moreover, sofa does not exist in the outdoor Canonical Scene. Normal
objects in the Canonical Scene co-occur with each others and exist without
violating relative position/scale among objects.
Because the Canonical Scene embeds objects as they are located in the real
world, the similarity between the Canonical Scene and an input scene can conrm
whether or not the conguration of candidate objects in the input scene follows
that of candidate objects in the real world. When letting mTs,lo,n (Lo,n , xo,n )
p(Ko,n , xo,n |s, lo,n , T ), where T (Lo,n ) = Ko,n , the location model p(K, x|s, l, T )
can be naturally interpreted as the maximum-likelihood measure mTs,l of template
matching problems, thus also being possible to check normality.
3.3 Joint Location and Appearance Model

A joint distribution p(s, l, c, T, K, x, y) is designed as a graphical model (Fig.
2(b)) to join the location model with the appearance model. The joint distribu-
tion is restated as
p(s, l, c, T, K, x, y) = p(K, x|s, l, T )p(y|c)p(s, l, c, T ) (2)

/ F

= p(Ko,n , xo,n |s, lo,n , T )
o,n
/ F/ F

p(yo,n |co,n ) p(s)p(T ) p(co,n |lo,n )p(lo,n ) ,
o,n o,n
where the rst term is the location model, the second is the appearance model,
and the last term is the prior distribution
5 over latent variables.
The marginal location model p(Ko,n , xo,n |s, lo,n , T )dKo,n can be analyti-
cally represented. Because p(xo,n |lo,n , Ko,n ) is dened as N (xo,n |Ko,n , o ) when
the candidate object is correctly detected as a normal object, lo,n = 1, the
marginal location model is also a Gaussian with mean of R s,o,1 + and co-

variance of R s,o,1 R + o . Moreover, if lo,n = 0 and xo,n is independent on
Ko,n and follows a uniform distribution, then the marginal location model also
follows a uniform distribution. The appearance model p(yo,n |co,n ) is adopted
from the conventional models [2, 1]. The p(yo,n |co,n ) is indirectly dened by the
p(co,n |yo,n )p(yo,n )
Bayes theorem, p(yo,n |co,n ) = p(co,n ) , where p(co,n |yo,n ) is dened as

a logistic regression model (o,co,n [1 y] ). The scene distribution is dened as
p(s) M ult() and the transformation distributions p(T = (, )) is dened
as a bivariate Gaussian distribution N ((, )| = (, )) with no correlation
between and . The distribution over correctness of appearance p(co,n |lo,n )
is modeled by the Bernoulli distribution with parameter o,lo,n and location
distribution p(lo,n ) Bern(o ).
4 Parameter Learning
This paper explicitly estimates T to convert all random variables into observed
ones for learning Canonical Scenes. Given the scene/object-level annotated dataset
[20, 21], separate dataset D = {Ds } and use Ds to build the Canonical Scene
for scene category s. The distributions of Lo,n,1 are estimated by maximum a
posterior criterion by considering the relation T (Lo,n,1 ) = Ko,n,1 . Assuming that
all objects on the images are located in the ground plane, then Algorithm 1 es-
timates T , which transforms objects on the ground plane to the slanted plane
in the camera coordinate. In this estimate, only the locations of objects, not
those of stu, are used. Moreover, the distribution of object os normal location
Lo,n,1 which does not exist in a Ds is set to N (0, 1 )

Learning the logistic regression model p(co,n |yo,n ) = (o,co,n [1 yo,n ] ) in the
appearance model may seen trivial. However, when data classes are imbalanced,
Algorithm 1. Estimation of the isometry T of an image

Input: Locations of normal objects, Ko,n,1 , for all o and n
Output: T : Lo,n,1 Ko,n,1
1: Estimate line z = a0 +a1 y passing through points Ko,n,1 by minimizing least square
errors

2: Ko,n,1 = R Lo,n,1 +[ 0] , where R is a 2-D rotation matrix with = arctan( a11 )
and = aa01 .
parameter learning with a conventional machine learning algorithm can reduce

the accuracy of the classication problem [22]. The data used in learning the
appearance model are also imbalanced. The data consist of a minor number of
correctly detected candidate objects, thus resulting in low accuracy for classi-
fying correctness. To handle this problem, we adopt conventional random over-
sampling: Uniform sampling on minor classes provides balanced data. However,
in our case, we apply more weight to samples with high yo,n . Other parameters,
such as , , , , and , are experimentally set.
5 Pop-MCMC for MAP Inference

Maximizing posterior distribution (eq. (3)) is required to detect abnormal ob-
jects.
s, l, c = arg max p(s, l, c, T |x, y)dT. (3)
s,l,c
Because the integral in eq. (3) cannot be analytically solved and is computa-
tionally expensive, we approximate its value by assuming that most mass of the
distribution over T are concentrated at a single point. This assumption seems
valid
5 because pictures are commonly taken in the conventional way. Therefore,
p(s, l, c, T |x, y)dT is approximated as p(s, l, c, T |x, y), where T is estimated by
solving the non-linear 5 optimization problem using the gradient ascent method:
T = arg maxT p(T ) p(K, x|s, l, c, T )dK. Because T follows Bivariate Gaussian
distribution, the mean of T is a proper initial for the gradient method.
This paper adopts a sampling method, which is called Pop-MCMC [8], to
solve the optimization problem (3). Pop-MCMC generates multiple samples,
also called chromosomes, from multiple Markov chains in parallel. This method
can make global moves because it exchanges information between samples, thus
leading to a higher mixing rate compared with conventional single chain MCMC
samplers. When it comes to an optimization problem, it is possible to escape from
local optimum via chromosomes of the Pop-MCMC. Therefore, optimization by
means of Pop-MCMC is ecient when the objective function is multimodal
and/or high-dimensional, such as the proposed model as well as MRF models
for stereo problems [23].
Note that detections should be scored for ranking and drawing precision-recall
curve to measure object detection results. The natural score for the detections
becomes the posterior marginals [1] or log-odds ratio [4]. The log-odds ratio
can quantify oddness of realizations of random variables. In this paper, the log-
odds ratio represents how abnormal a candidate object is. The odd ratio of
object category os nth candidate object is calculated as follows: m((lo,n , co,n ) =
p((l ,co,n )=(0,1)|x,y)
(0, 1) = log p((lo,n
o,n ,co,n )=(0,1)|x,y)
. This log-odd ratio can be approximated as [4]
and calculated by modifying the MAP inference of eq. (3).
6 Evaluation
6.1 Abnormal Dataset and Experimental Setup
The dataset should be separated into each types of abnormal object to allow
thorough evaluation of the accuracy of abnormal object detection models. The
new dataset1 consists of three dierent types of images (50 images each, for a
total of 150 images containing annotated objects that violate co-occurrence, rel-
ative position, and relative scale) and additional extra images that contain two
or more dierent types of abnormal objects. The abnormal objects are anno-
tated relative to normal objects contained in the normal dataset. For example,
a conventional normal dataset, such as LabelMe, consists of cars on the road,
and so a ying car is annotated as an abnormal object. Likewise, a person
who is taller than common adults is also annotated. Even though an abnormal
dataset has already been established [1, 5], the set is unsuitable for a thorough
evaluation of all three types of abnormal objects because the set has not been
separated into each type.
Two datasets, one normal and one abnormal, are used for training and test.
The SUN dataset [20, 1] is used for the rst dataset which has scene-level anno-
tations as well as many annotated objects in a single image. Therefore, the SUN
is the proper dataset for training and testing the relationships among objects.
The second dataset is the proposed abnormal dataset. Object detector models
[16] and the parameters of the proposed model are trained on randomly sepa-
rated images from the SUN dataset. Evaluation is conducted on the test set of
the SUN dataset and the entire abnormal dataset. Two scene categories and ten
object categories2 are used to train and test the proposed model. The parame-
ters for the proposed model are = 9 = 5 36 , t1 = 0.02, t2 = 0.004 and
o = 10128 I. Note that optimizing eq. (3) takes about two minutes using an
Intel Quad Core 3.3GHz PC platform with 8GB of memory.
We choose a hybrid model of co-occurrence contextual (CO) and support con-
textual (SUP) model [5] as the baseline method for abnormal object detection,
because our proposed model simultaneously detects abnormal objects that vi-
olate co-occurrence context and position/scale context. The number of objects
used in the baseline is 10+2 objects, including support objects such as oor and
road. These supporting objects, useless for the proposed model, are positively
necessary for the baseline method because of their dependency on supporting
objects. This baseline method is quantitatively (Fig. 3) and qualitatively (Fig.
4) compared with the proposed method.
1
L. Wei [24] has copyright on several abnormal images.
2
indoor, outdoor / bed, bottle, building, car, monitor, person, sky, sofa, toilet, tree.
100 100 100

CS CS CS
CO+SUP CO+SUP CO+SUP
80 80 80
60 60 60
Precision
Precision
Precision
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Recall Recall Recall
(a) co-occurrence (b) position (c) scale
100 20
CS CS
CO+SUP CO+SUP
80
15
% CS CO+SUP
60
100 15.21 0.91 9.71 0.05
Precision
50 9.71 0.53 4.55 0.10

AP
10
40 10 5.64 0.91 2.53 0.19
5 5 7.32 2.50 3.38 0.11
20
0 0
0 20 40 60 80 100 100 50 10 5
Recall Abnormal image ratio (%)
(d) abnormal (e) average precision
Fig. 3. Quantitative comparisons. Each precision-recall curve ((a) to (d)) is generated

using classied abnormal datasets and a combination of the three classied datasets
and the extra abnormal images. Experiment (e) is conducted on mixtures of abnormal
and normal dataset. Each mixtures consists of a set of 150 images composed of 100%,
50%, 10%, and 5% abnormal images.
6.2 Results
Precision-recall curves and average precision measures are used to evaluate ab-
normal object detection accuracy. These methods are conventional measures [25]
for object detection tasks.
The average precision in Fig. 3(e) shows that the proposed method out-
performs the baseline method on both the abnormal and mixed datasets. Our
method is robust in all three types of dataset, particularly in the position dataset
Fig. 3(b), as shown by the analysis of precision-recall curve (Figs. 3(a) to 3(c)).
The reason is that the proposed Canonical Scene method is more robust on
separating oating objects than the baseline method. Fig. 3(d) shows that our
method also outperforms in the combination of each type of dataset and extra
abnormal images. Fig. 3(e) shows average precision scores on 150 images with
varying ratios of abnormal and normal images. If we assume that a general im-
age set is composed of 5% of abnormal images, then this experiment veries
the robustness of the proposed method on a general image set. The proposed
method may also be superior on the general image set based on the experiment
sofa[0.28] person[0.38] car[0.44]

tree[0.16] person[0.36] person[0.35]
tree[0.16] car[0.15] person[0.38] person[0.12] person[0.34]
person[0.35]
car[0.15]
sofa[0.28]
person[0.36]
person[0.12] person[0.34]
car[0.44]
sofa[0.55] person[0.67] person[0.60]

bed[0.17] bed[0.25] person[0.35] bed[0.46]
bed[0.17] person[0.67] person[0.31] car[0.27]
person[0.60]
sofa[0.55]bed[0.25]
person[0.35] bed[0.46]
person[0.31]
car[0.27]
0.08 0.08 0.08

abnormal abnormal abnormal
0.06 0.06 normal 0.06 normal
0.04 "outdoor" scene 0.04 "outdoor" scene 0.04 "outdoor" scene

"sofa" object "person" object "person" object
0.02 0.02 0.02
0 0 0
0.02 0.02 0.02
0.04 0.04 0.04
0.06 0.06 0.06
0.08 0.08 0.08

0.15 0.1 0.05 0 0.05 0.1 0.15 0.15 0.1 0.05 0 0.05 0.1 0.15 0.15 0.1 0.05 0 0.05 0.1 0.15
Fig. 4. Qualitative comparisons. The results in the top and the middle-rows are gener-
ated using the baseline method and the proposed method, respectively. In each images,
the top-three abnormal objects are represented in bounding boxes. In the third row, the
distribution of objects locations in the Canonical Scene is represented with estimated
abnormalities.
results. Note that in Fig. 3(c) shows that the proposed method is weak at classi-
fying abnormal objects with abnormally huge or small sizes. Improperly learned
Canonical Scenes may cause this weakness.
The qualitative comparisons and results are shown in Figs. 4 and 5. Based on
the last row of Fig. 4, we can conclude that the proposed method exploits the
rich relations among objects and instance of objects. For example, the second
gure in the last row represents the distribution of a normal person in an
outdoor scene. The learned normal distribution of person is transformed to
cover person on the road. Therefore, the transformed distribution, or Canoni-
cal Scene, cannot cover the oating person, thus making it possible to classify
the oating person as an abnormal object. Fig. 5 illustrates the positive and
negative results of the proposed method. The negative results are due to the
improperly learned Canonical Scene, the experimentally set parameters, and the
weakness of the base object detectors.
sofa[0.57] car[0.96] person[0.52]

person[0.45]
person[0.43] person[0.51] bed[0.37] person[0.52] sofa[0.40]
person[0.51] person[0.45]
person[0.43] person[0.37] bed[0.39]
person[0.43] car[0.96] person[0.34] person[0.30]
person[0.43] person[0.37] sofa[0.34] bed[0.27]
bed[0.37]
bed[0.27]
bed[0.39]sofa[0.40]
car[0.65] person[0.88] bed[0.31]

person[0.85] person[0.31] person[0.31]
bed[0.46]
toilet[0.35] person[0.88] person[0.72]
person[0.61]
person[0.85]
person[0.72]
person[0.66] bed[0.31] sofa[0.29]
bed[0.46] person[0.66] sofa[0.29] person[0.29]
car[0.65] person[0.61]
sofa[0.32] person[0.29] bed[0.27]
person[0.32] bed[0.27]
toilet[0.35]
Fig. 5. Qualitative results. The images in the rst, second and third columns consist of
co-occurrence-violating, relative position-violating, and relative scale-violating images,
respectively. In each image, the top-ve abnormal objects are represented in bounding
boxes.
7 Conclusion
We proposed a generative model for abnormal object detection. The model mainly
exploited the rich and quantitative relations among objects and objects instances
via latent variables. Because of these considerations, our model outperformed the
state-of-the-art method for abnormal object detection. In addition, we were able
to thoroughly analyze the accuracy of the proposed method for dierent types
of abnormality by classifying evaluation dataset into three types. The analysis re-
vealed that our model is strong at detecting abnormal co-occurrence and position,
but not as eective at detecting scale-violating objects. We expect that accuracy
will be increased by fully learning of the proposed model.
References
1. Choi, M., Lim, J., Torralba, A., Willsky, A.: Exploiting hierarchical context on a
large database of object categories. In: Proc. of IEEE Conf. on CVPR (2010)
2. Murphy, K., Torralba, A., Freeman, W.: Using the forest to see the trees: a graph-
ical model relating features, objects and scenes. In: NIPS (2003)
3. Torralba, A., Murphy, K., Freeman, W.: Contextual models for object detection
using boosted random elds. In: NIPS (2005)
layout. In: Proc. of ICCV (2009)
5. Choi, M., Torralba, A., Willsky, A.: Context models and out-of-context objects.
Pattern Recognition Letters 33, 853862 (2012)
6. Heitz, G., Koller, D.: Learning Spatial Context: Using Stu to Find Things. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302,
7. Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-
occurrence, location and appearance. In: Proc. of IEEE Conf. on CVPR (2008)
8. Jasra, A., Stephens, D., Holmes, C.: On population-based simulation for static
inference. Statistics and Computing 17, 263279 (2007)
9. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects
in context. In: Proc. of ICCV (2007)
10. Li, L., Socher, R., Fei-Fei, L.: Towards total scene understanding: Classication,
annotation and segmentation in an automatic framework. In: Proc. of IEEE Conf.
on CVPR (2009)
11. Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph Cut Based Inference
with Co-occurrence Statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.)
ECCV 2010, Part V. LNCS, vol. 6315, pp. 239253. Springer, Heidelberg (2010)
12. Galleguillos, C., Belongie, S.: Context based object categorization: A critical sur-
vey. Computer Vision and Image Understanding 114, 712722 (2010)
13. Russell, B., Torralba, A., Liu, C., Fergus, R., Freeman, W.: Object recognition by
scene alignment. In: NIPS (2007)
14. Torralba, A.: Contextual priming for object detection. IJCV 53, 169191 (2003)
15. Li, L., Su, H., Xing, E., Fei-Fei, L.: Object bank: A high-level image representation
for scene classication and semantic feature sparsication. In: NIPS (2010)
discriminatively trained part-based models. IEEE Trans. on PAMI 32, 16271645
(2010)
17. Hoiem, D., Efros, A., Hebert, M.: Putting objects in perspective. In: Proc. of IEEE
Conf. on CVPR (2006)
18. Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: Proc. of IEEE Conf.
on CVPR (2010)
19. Olson, C.: Maximum-likelihood image matching. IEEE Trans. on PAMI 24,
853857 (2002)
20. Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: Large-scale
scene recognition from abbey to zoo. In: Proc. of IEEE Conf. on CVPR (2010)
21. Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a database and
web-based tool for image annotation. IJCV 77, 157173 (2008)
22. Van Hulse, J., Khoshgoftaar, T., Napolitano, A.: Experimental perspectives on
learning from imbalanced data. In: ICML (2007)
23. Kim, W., Park, J., Lee, K.: Stereo matching using population-based mcmc.
IJCV 83, 195209 (2009)
24. Wei, L. (2012), http://www.liweiart.com
25. Everingham, M., Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. IJCV 88, 303338 (2010)
Shapecollage: Occlusion-Aware, Example-Based
Shape Interpretation
Forrester Cole, Phillip Isola, William T. Freeman,

Fredo Durand, and Edward H. Adelson
Massachusetts Institute of Technology

{fcole,phillipi,billf,fredo,adelson}@csail.mit.edu
Abstract. This paper presents an example-based method to interpret

a 3D shape from a single image depicting that shape. A major diculty
in applying an example-based approach to shape interpretation is the
combinatorial explosion of shape possibilities that occur at occluding
contours. Our key technical contribution is a new shape patch repre-
sentation and corresponding pairwise compatibility terms that allow for
exible matching of overlapping patches, avoiding the combinatorial ex-
plosion by allowing patches to explain only the parts of the image they
best t. We infer the best set of localized shape patches over a graph of
keypoints at multiple scales to produce a discontinuous shape represen-
tation we term a shape collage. To reconstruct a smooth result, we t a
surface to the collage using the predicted condence of each shape patch.
We demonstrate the method on shapes depicted in line drawing, diuse
and glossy shading, and textured styles.
1 Introduction
A long-standing goal of computer vision is to recover 3D shape from a single

image. Early researchers developed techniques for specic domains such as line
drawings of polyhedral objects [1], shape-from-shading for Lambertian objects
[2], and shape-from-texture for simple textured objects [3]. While based on solid
mathematical and physical foundations, these techniques have proven dicult
to generalize beyond limited cases: even the seemingly simple line drawing and
shaded images of Figure 1 confound all existing techniques.
Recently, machine learning methods (e.g., [4, 5]) have been proposed to allow
generalization by learning the relationship between shape and training images.
However, the exibility and computational resources required for learning-based
vision depend critically on the choice of shape representation. Occlusion bound-
aries, for example, lie between uncorrelated regions and cause a combinatorial
explosion in possibilities if not explicitly included in the representation.
In this paper, we describe a fully data-driven approach that reconstructs an
input shape by piecing together patches from a training set. We make this ap-
proach practical by using a new, irregular, multi-scale representation we term a

666 F. Cole et al.
2ULJLQDO
VKDSH
,QSXW ,QIHUUHGQRUPDOV
Fig. 1. A line drawing and diuse rendering of a 3D shape (left) are interpreted
as normal maps by our system (right). Each rendering produces a dierent plausible
interpretation. Our system uses compatibility between local depth layers to interpret
self-occlusion (blue box) and near-occlusion (red box).
shape collage, coupled with a new patch representation and compatibility mea-
surement between patches that addresses the combinatorial explosion at occlu-
sion boundaries and generalizes patches across scale, rotation, and translation.
Given compatibility between patches, we use belief propagation to produce the
most likely collage and then t a smooth surface to produce the nal shape
interpretation (Figure 1).
We demonstrate our approach on synthetic renderings of blobby shapes in a
range of rendering styles, including line drawings and glossy, textured images.
Interpreting line drawings of curved shapes has been a long-standing unsolved
problem in computer vision [6, 7]. Much progress has been made for drawings
and wire frame representations of 3D shapes with at facets (e.g., [8]), but
progress on the interpretation of generic line drawings has stalled. Interpretation
of shaded renderings has been addressed more frequently in the computer vision
community, but is still not a solved problem [2].
Our contributions are in the design of the shape patches and the measurement
of compatibility between them, the irregular, multi-scale relationships between
patches, and the selection of a diverse set of shape candidates for each patch.
Besides shape, our representation can also estimate factors such as the visual
style of the input.
1.1 Related Work
The shape-from-a-single-image problem is perhaps the oldest in computer vision,

dating at least to Roberts [1]. Because of the diculty of the problem, it has
Shapecollage: Occlusion-Aware, Example-Based Shape Interpretation 667
over time been divided into subproblems such as shape-from-shading (see [2] for
a recent survey), shape-from-line-drawing (e.g., [68]), and shape-from-texture
(e.g., [3]). These parametric methods have diculty capturing many of the cases
that occur in the visual world, while learning-based methods can successfully
capture such visual complexity. Saxena, et al [5], use a learning-based approach
to nd a parametric model of a 3D scene and also explicitly treat occlusion
boundaries. While their system can process complex input due to its learned
model, their parametric scene description is tailored for boxy shapes such as
buildings and does not contain features other than shape. By contrast, our rich,
example-based model extends to smooth shapes and can predict visual style as
well as shape.
For line drawing recognition, Saund [9] applied belief propagation in a Markov
Random Field (MRF) to label contours and edges of pre-parsed sketch images,
although without any explicit representation of shape. Ouyang and Davis [10]
combined learned local evidence for symbol sketches with MRF inference for
sketch recogntion, focussing on chemical sketches.
Our work is a spiritual successor to the work of Freeman, et al [11], learning
low-level vision. That system uses a network of multi-scale patches and a set
of candidates at each patch to infer the most likely explanation for the stimulus
image. Hassner and Basri [4] also use a patch-based approach to reconstruct
depth, while adding the ability to synthesize new exemplar images on-the-y.
Unlike both systems, we use an irregular network of patches centered on interest
points in the image, and directly tackle the problem of occlusion boundaries.
Because our training data is synthetic we could also create new exemplars on-
the-y, but have not explored that option in this work.
Example-based image priors were also used by Fitzgibbon et al [12] in the
context of image interpolation between measurements from multiple cameras.
The benets of learning from large, synthetically generated datasets were shown
emphatically in the recent work of Shotton et al [13].
2 Overview
Our approach nds interest points in the image, selects candidate interpretations
for each point, then denes a Markov random eld tailored to those interests and
performs inference to determine the most likely shape conguration (Figure 2).
Shape is represented by patches that contain the rendered appearance of the
patch, a normal map, a depth map, a map of occluding contours, and an owner-
ship mask that denes for which pixels the patch provides a shape explanation
(Figure 3). Patches are placed and oriented using image keypoints computed
at multiple scales. As described in Section 3, we found that current approaches
for detecting keypoints do not provide a good set of points for stimuli such as
line drawings, and designed our own method based on disk sampling of an in-
terest map. We propose a simple set of multi-scale graph relationships for these
keypoints.
The local appearance at each keypoint determines the selection of candidate
patches (Section 4). There may be thousands of possible matches for the more
668 F. Cole et al.
1. Find keypoints across scale 3. Select candidates by local appearance 5. Infer most likely patches
Appearance
Normals
4. Compute compatibility between local layers Averaged Predicted Averaged Predicted

2. Connect proximate keypoints Appearance Normals
B
A
6. Fit surface to patches
A
GOOD BAD BAD

B
BAD BAD GOOD
OK OK BAD
Fit Surface Original Shape
Fig. 2. The major steps of our approach. First, keypoints are extracted at multiple
scales (1), then neighboring keypoints in both image space and scale are connected
(2). Separately, a set of candidate shapes is selected for each keypoint based on local
appearance (3), and the compatibility of each candidate is evaluated against the can-
didates at each connected keypoint (4). Loopy belief propagation is used to nd the
most likely candidates, which form a shape collage (5). Finally, a smooth surface is t
to the collage (6).
QRUPDOV GHSWK FRQWRXUV RZQHUVKLS
Fig. 3. Our shape patch representation. Given a square, oriented training patch (left,
red box) the normal map is extracted and rotated to match the keypoint orientation.
The depth map and occluding contours are also stored along with an ownership mask.
common patches, far more than our inference algorithm can support. It is critical
to choose a diverse-enough subset of candidates, for which we use similarity of
local shape descriptors.
The patch candidates are scored based on similarity of local appearance, nor-
mal and depth layer compatibility with adjacent patches, and prior probability
(Section 5). The most likely shape collage given these scores is found using loopy
belief propagation and a thin-plate spline surface is t to this shape collage.
3 Keypoints and Graph
The rst step of shape interpretation is to extract interest-driven keypoints at

multiple scales and connect them into a graph. Each keypoint will become the
center of a patch. A keypoint has four parameters: image position, scale, and ori-
entation. All four are found by the detection stage and remain xed throughout
the interpretation process.
%OXUUDGLXVSL[HOV SL[HOV SL[HOV
Fig. 4. Keypoint detection at three scales (blur at 16, 32, 64 pixels, image: 3002 pixels).
Left of pair: level of interest in the image, computed by the trace of the structure
tensor. Right of pair: original image with keypoints overlaid. Keypoints are placed to
cut out the most interesting pixel until a threshold is reached.
Our criteria for a keypoint detector are: coverage, all the interesting parts of
the image should be covered by at least one keypoint; sparsity, keypoints should
not be redundant; and repeatability, similar image patches should have a similar
distribution of keypoints.
Standard methods such as Harris corner detection [14] and SIFT [15] are not
directly suitable for our needs. Corner detection, for example, fails the coverage
criterion, since we want keypoints along all linear features in addition to corners.
The SIFT keypoint detector also fails the coverage criterion: some interesting
image features do not correspond to maxima of the blob detector in scale space,
and so do not produce a SIFT keypoint.
By contrast, a naive method such as a regular or randomized grid fails the
repeatability criterion. Repeatability is important for two reasons: it reduces the
number of required training examples, and increases the eectiveness of local
appearance matching (Section 5) by aligning image and patch features.
3.1 Detection
Our approach to keypoint detection is to compute an interest map over the
image, then iteratively place keypoints at the highest point of this map that is
not already near another keypoint. Intuitively, this approach uses a cookie-cutter
to greedily stamp out keypoints at the most interesting points in the image.
The interest map I(p) is dened as the sum of the eigenvalues (i.e., the trace)
of the 2x2 structure tensor of the test image. Prior to computing the structure
tensor, the image is blurred to the desired level of scale. For a keypoint of radius
r, the blur is dened as a Gaussian of = r/3.
The stamp shape for a keypoint at p is a radially symmetric function of the
distance s from p, with a smooth rollo:
/
I(p) : s < 0.25r
stampp (s) = (1)
I(p)G(s 0.25r) : s > 0.25r
where G is a Gaussian of = r/3. The stamp function is subtracted from the

interest map for each keypoint. Detection stops when the maximum remaining
value in the interest map falls below a 1% of the original maximum value.
670 F. Cole et al.
The orientation of a keypoint is found using the SIFT orientation detector,

which denes orientation using the histogram of gradients inside the keypoint
radius [15].
This detection method satises our three criteria: the iterative procedure en-
sures that the keypoints are comprehensive and cover all interesting areas in the
image. The stamp function ensures that the keypoints are sparse. The keypoint
selection is repeatable since the trace of the structure tensor is rotation and
translation invariant.
3.2 Graph Connections

We treat each keypoint as a vertex in an MRF graph. The edges of the graph are
dened by the positions and scales of the keypoints. The edge weights correspond
to the compatibility between shape patches. To compute compatibility, we need
sucient overlap between patches. We therefore only connect two keypoints if
p1 p2 < (r1 + r2 ) 0.8 (2)
where pi are the keypoint positions and ri are the radii.

Additionally, two keypoints are only connected if they are close enough in
scale. In general, large scale patches will t the test image more coarsely than
small scale patches. The tolerable margin of error for large patches may be wider
than an entire patch at a small scale, making the compatibility score between
very large and very small patches useless. Therefore, we only connect patches
where
| log2 r1 log2 r2 | <= 1 (3)
or in other words, where the radii dier by a power of two or less.
4 Selecting Shape Candidates

We formulate shape interpretation as a discrete labeling problem on the keypoint
graph, where the labels correspond to candidate shape patches.
Ideally, every patch in the training set would be a candidate at every vertex
in our graph. However, it is computationally intractable to include all 106
training patches at each vertex. We must carefully prune a subset of the training
set for each vertex in our graph. Proper selection of this subset is critical to
successful interpretation because it denes the space of possible shapes that the
MRF inference can explore.
Our approach is as follows. Given a keypoint and its associated test image
patch, we rst select the subset of the training patches that are similar in appear-
ance to the test patch. This subset may still include several thousand candidates,
especially for common or ambiguous patches. Out of this large subset of candi-
dates, we choose a constant, small number of patches with diverse shapes. This
approach reduces the number of labels to a manageable number while maintain-
ing a suciently wide space of candidate shapes.
4.1 Matching Appearance

Our appearance matching is based on comparison of standard 128-dimension
SIFT descriptors to provide invariance to scale and rotation. The SIFT descrip-
tor is computed at the scale and orientation of the keypoint. A training descriptor
is said to match the test descriptor if the Euclidean distance between the de-
scriptors is less than a xed threshold t, where t = 150 in our experiments. Each
descriptor dimension varied from 0 255.
4.2 Choosing Diverse Shapes

After the appearance matching step, we have a set of candidates that look close
enough. We want the most diverse set of shapes that lie in this subset.
When creating the training set, we compute a shape descriptor for each patch
by resampling the normal map to 32x32 pixels, applying PCA to nd to dominant
directions of variation in the 322 D space, then keeping the principal components
that account for 95% of the variance. This produces descriptors with 20 30
components, depending on the training set.
At interpretation time, we pick a diverse subset of k candidates from the close
enough set of N candidates. We pick the rst candidate uniformly at random
from the set of N . For the nth pick, we choose the candidate with the maximum
minimum distance in descriptor space from the previous n1 selections. In other
words, we choose the candidate farthest away from the closest previous pick.
4.3 Null Candidates

In some cases the candidate selection process fails to nd any shape patches
suitable for a given keypoint. To handle these cases, we include a null or dummy
label at each keypoint that provides no shape explanation but allows the in-
ference to bail out if it cannot nd an acceptable solution. The penalty for
choosing this label should intuitively be the maximum error we are willing to
tolerate in the interpretation. A negative aspect of allowing null candidates is
that the shape interpretation can become disconnected. To keep the shape in
one piece, we disallow null candidates for the nest scale keypoints.
5 Occlusion-Aware Compatibility Scoring

Once a set of candidates is selected for each keypoint, we compute the likelihoods
of each candidate and the compatibilites between candidates to set up the MRF
inference step. Each candidate is scored by appearance match with the test image
and the shape compatibility with the candidates at each neighboring keypoint.
Shape compatibility scoring is involved because each patch may explain only part
of its assigned area, and the context of the test image can aect the compatibility
of two candidates even when their ownership masks do not overlap.
Consider Figure 5a: in order to avoid a combinatorial explosion in the neces-
sary size of the training set, it must be possible for two separate shape patches
672 F. Cole et al.
D

$ %
D $ % $ %
E
G
$ %
$ % $ %
F
F

%

E $ $ %
$ %
G

$ %
7HVWLPDJH $ % $ %
2YHUODSDUHDDQGUHJLRQJUDSK &DQGLGDWHGHSWKPDSV 'HSWKOD\HU

RYHUODSVXSHULPSRVHG ODEHOLQJ
Fig. 5. Depth layer labeling. Left: test image with challenging areas outlined. Mid-
dle: contours of test image (black) dene regions in the overlap area and a graph of
connections between them. Right: the depth map of each candidate shape is used to
assign an ordinal layer to each overlap region. The compatibility of the candidates is
the compatibility of the graph labelings weighted by the size of the overlap regions.
In (a), the two shapes are compatible because the overlap graph allows for a gap; (b),
the two shapes are incompatible because they compete to label both sides of the single
contour; (c), the large scale patch has a single depth layer in the entire patch, but two
layers inside the overlap region; (d), at the T-junction the two labelings are slightly
incompatible, but the incompatible labeling has small weight.
to explain two nearby contours. Simultaneously, two separate patches should not
claim opposite sides of the same contour (Figure 5b). The imperfect alignment
of shape patches to image features makes exact matching of pixels unreliable.
Instead, we propose a scoring mechanism based on the regions of the test image
inside the overlap area (Figure 5, middle and right).
5.1 Overlap Regions and Graph

The regions of the overlap area O between two patches are found by applying
image segmentation to the test image pixels restricted to the overlap area. As
detailed below, our scoring metric is robust to oversegmentation and we only
need a very basic segmentation algorithm. We currently edge detect, binarize,
then ood ll (bwlabel in MATLAB) the overlap area to nd regions.
Once we have regions Oi of the overlap area we construct a graph with a
node for each Oi and edges between Oi that abut in the test image (Figure 5,
middle). We say a candidate patch owns a node in this graph if more than
50% of the corresponding regions pixels are covered by the patchs ownership
mask. Labeling is only compared between nodes that are either owned by both
patches or that border nodes owned by both patches.
Each candidate patch assigns a labeling to Oi by constructing a local depth
layer representation inside the overlap area. The local depth layers are con-
structed as follows: the depth map of the patch is masked by the overlap area
and divided into regions Ri by the patchs occluding contours. The depth values
inside each Ri are averaged, then the average values are sorted. The index j of
Rj in the sorted list is the local depth layer. Each Oi takes the mean of the local
depth layer pixels inside its region, rounded to the nearest integer.
5.2 Depth Layer Compatibility

Because the depth layer representation is ordinal, not metric, scoring the com-
patibility of two patches requires a tting operation performed on the overlap
regions Oi , as follows.
First we select a subset of the overlap regions Oi in which to compute com-
patibility. For depth layers, the important regions are those either owned by a
shape patch or adjacent to an owned region, and other regions in the graph may
be ignored (e.g., Figure 5a). Let OAi be the regions owned by or bordering a
region owned by patch A, and OBi be the same for patch B. Then the subset
of interest is:
Oi = OAi OBi (4)
Next we dene the label sets Ai and Bi using the layer labels assigned by
the two candidates to Oi , but compressed so that they have no gaps (i.e., [1 3]
becomes [1 2]).
Finally, we perform a linear t of Ai against Bi and vice versa. Dene b as the
label set to t against and a as the other set. Dene w as the vector of weights
corresponding to the relative areas of Oi . Then we solve
[aw, w]x = b (5)
for the two-element parameter vector x in the least-squares sense. The best t
labels a are found by multiplying through with x. The score of the tting is then

i
Sab = |ai bi |wi (6)
1..n
The nal compatibility score is the max of the tting scores in both directions.
5.3 Normal Map Compatibility

Normal map compatibility is dened as the mean L1 distance between the 3D
normals of each patch. Because the ownership masks (areas where normals are
dened) of the two patches will not in general align perfectly, we only compute
the normal map compatibility inside the overlap regions Oi that are directly
owned by both patches. Pixels inside an owned region but outside the ownership
mask receive a value of (0, 0, 0) for purposes of comparison.
674 F. Cole et al.
5.4 Local Appearance Score

Scoring local appearance is straightforward. The basic idea is to dene the score
as the average dierence in pixel value between the test image and the candidate
patch, for pixels where ownership is nonzero. However, naive masking has the
undesirable eect of beneting patches with vanishingly small masks, since they
have fewer chances to make errors. We compromise by computing the average
error as usual, but rolling o to a constant not explained error if the area of
the ownership mask is too small (< 15% of the patch area).
In addition, the average dierence of pixel values is very vulnerable to slight
misalignments in the matching patches, and misalignment of patches is the rule
rather than the exception. To make our metric robust to (slight) misalignment,
we rst blur both the test and candidate patches with a Gaussian of = r/3,
where r is the radius of the patch. See [16] for details of the rollo and blur.
5.5 Prior Probability Score

The candidate selection process ensures that a diverse set of shapes are present
at each keypoint. It is important to give greater weight to the more common
shape candidates than the unusual ones to avoid inferring a possible but very
implausible shape. Given a set of candidate patches C, we compute the prior
probability of a patch Pi using kernel density estimation on the training set
patches near Pi in shape descriptor space. The density is estimated by ltering
the space of descriptors with an N -dimensional Gaussian with = r/9, where r
is the distance between the farthest two patches in C. The density provides an
unnormalized probability pi , which we convert to a likelihood score:
pi
li = log( j ) (7)
1..|C| pj
This prior estimate is simple but considerably improves the quality of results,
especially for ambiguous stimuli such as line drawings.
6 Implementation and Results

Our approach produces an estimate of the surface normal at each point and an
estimate of the rendering style of the test image. We present results on synthetic
data rendered in six styles: line drawings with occluding and suggestive con-
tours [17], diuse shading with frontal illumination, glossy shading with frontal
illumination, isotropic solid texture with no lighting, texture with diuse shad-
ing, and texture with glossy shading.
6.1 Training and Test Set

The training data for our approach consists of many synthetic renderings of
abstract blobs. The blobs are constructed by nding isosurfaces of ltered, 3D
(a) (b) (c) (d)
Fig. 6. Examples of normal map estimation for varying rendering styles. Top: ground
truth normal map. (a,b,c): line drawing, diuse and glossy shading with solid material,
with interpreted normal map. (d): same as (c) but with texture only, diuse and glossy
shading with texture. Bottom: normal interpolation from shape boundaries.
white noise. We generate these shapes procedurally so that we can produce an

arbitrary amount of training data. For each shape (and set of random camera
positions), we render a depth map, a normal map, an image containing occlusion
boundaries, and one image for each of the six rendering styles.
The training set itself is constructed by detecting keypoints in each rendered
image (see Section 3) at four scale levels: radius 16, 32, 64, and 128 pixels. The
input images were 300x300 pixels. We use a training set of N = 96 shapes,
k = 20 cameras per shape, and 100-200 keypoints per rendered image, and six
styles for a total of approximately 1.2m patches. The entire training set is <1GB,
a modest size compared to other data-driven vision systems (e.g., [18]).
The test set is 10 shapes with the same blobby characteristics as the training
set shapes. All shapes and renderings are available in [16].
6.2 Shape Interpretation
Given a full set of candidate patches and likelihood scores between them, nding
the most probable shape collage is a straightforward MRF inference task. We
use min-sum loopy belief propagation for this purpose.
To produce a complete surface from the most likely shape collage, we t a
thin-plate spline to the most likely patches, taking care to allow the spline to split
at occluding contours. An example collage and t surface is shown in Figure 2.
676 F. Cole et al.
To measure and visualize our results, we mask out background pixels using
ground truth. Additional details of our inference and tting steps may be found
in [16].
6.3 Normal Map Estimation
The principal goal of our method is to estimate a normal map for the stimulus
image. Figure 6 shows example normal maps computed by our system. Table 1
shows the average errors for our test set against ground truth.
For a baseline comparison we propose the following simple method, similar to
that proposed by Tappen [19]: clamp the normals to ground truth values at the
occluding contour, and smoothly interpolate the normals in the interior of the
shape. This method produces smooth blobs (Figure 6, bottom). Note that while
the interpolation method is very simple, it still is given ground truth normals at
shape boundaries, whereas our method is not.
Overall our method produces accurate interpretations of the simuli, with aver-
age angular error between 20 and 26 for the shaded and textured stimuli, and
32 for line drawings (Table 1). In some cases, particularly the line drawings, the
input stimulus is ambiguous, and our system proposes a plausible interpretation
that is dierent from the original shape (e.g., Figure 6c). We also tested with
training sets containing only patches of the same style as the input and found a
small improvement in accuracy.
The running time for a single stimulus is approximately 20 minutes on a
modern, multicore PC. The implementation is written entirely in MATLAB.
6.4 Style Prediction
The inferred patches can also be used trivially to approximate the stimulus image
and thus make an estimate of the rendering style (Figure 7). We simply take the
original appearance of each training patch and average it with the appearance
of its neighbors. To make an estimate of the stimulus style, we simply nd the
style used by the majority of the selected shape patches. In our experiments the
styles are suciently distinct that this estimate is very accurate.
Table 1. Average per image RMS error in degrees for normal estimates for each style,
trained with all styles and only with the target style. Boundary interpolation does not
vary between styles. Front-facing error is deection from a front-facing normal.
line diuse glossy only tex. tex. + di. tex. + gloss all
Training set w/ all styles 32 20 26 23 21 23 24
Train. set w/ only target style 30 20 25 23 21 23 24
Boundary interpolation - - - - - - 39
Front-facing - - - - - - 46
6WLPXOXV 5HFRQVWUXFWHG 6WLPXOXV 5HFRQVWUXFWHG 6WLPXOXV 5HFRQVWUXFWHG
Fig. 7. Reconstructing the stimulus from patch appearance. Reconstructed patches are
usually faithful to the original rendering, though are sometimes confused (boxed). The
set of styles used for reconstruction gives an estimate of the stimulus style.
6.5 Discussion
So far we have experimented with synthetic datasets. The renderings were cho-
sen to span a range of styles that challenge parametric interpretation systems.
Suggestive contours, for example, have been shown to mimic lines drawn by
artists [20], but are often disconnected and noisy. Disconnected lines violate the
major assumption of shape-from-line systems (e.g., [6]) that the line drawing
be a complete, connected graph. The glossy, textured stimuli violate the Lam-
bertian assumptions of even recent work on shape-from-shading (e.g., [21]). Our
system can interpret all six styles with no modication and a single training set.
There are several limitations to the current method that must be overcome
in order to process real-world stimuli. Most importantly, we need a rich train-
ing set of photographs and associated 3D shape. Such data may become more
commonplace as 3D acquisition hardware becomes more robust and easy to use.
Also, realistic computer graphics renderings may provide an accurate enough
approximation of real photographs to provide training data (similar to [22]).
Some aspects of the algorithm itself would also require extension. The appear-
ance matching step (Section 4.1), for example, matches based on the appearance
of the entire patch. This approach works well when the background is empty,
such as our stimuli, but can become confused by noise or distracting elements.
An extended method could match appearance based on the ownership masks
of training set patches. The appearance scoring metric (Section 5.4) would also
need to be adapted to more general stimuli.
7 Conclusion
We have presented an approach to infer a normal map from a single image of a 3D
shape. The treatment of occlusion between patches and the irregular, interest-
driven patch placement that we introduce dramatically reduce the complexity
of example-based shape interpretation, allowing our method to interpret images
with multiple layers of depth and self-occlusion using a moderately sized training
set. Because it is data-driven, our method can interpret ambiguous stimuli such
as sparse line drawings and complex stimuli such as glossy, textured surfaces,
and it uses the same machinery in all cases. To our knowledge, this property is
unique among current vision systems.
678 F. Cole et al.
Acknowledgements. We thank Ce Liu and Jon Barron for helpful comments.

This material is based upon work supported by the National Science Foundation
under Grant No. 1111415, by ONR MURI grant N00014-09-1-1051, by NIH
grant R01-EY019262 to E. Adelson, and a grant from NTT Research Labs to E.
Adelson. P. Isola is supported by an NSF graduate research fellowship.
References
1. Roberts, L.: Machine perception of three-dimensional solids, dspace.mit.edu
(1963)
2. Durou, J., Falcone, M., Sagona, M.: Num. methods for shape-from-shading: A new
survey with benchmarks. Comp. Vis. and Img. Understanding 109, 2243 (2008)
3. Malik, J., Rosenholtz, R.: Computing local surface orientation and shape from
texture for curved surfaces. IJCV 23, 149168 (1997)
4. Hassner, T., Basri, R.: Example based 3d reconstruction from single 2d images. In:
CVPR Workshops: Beyond Patches, pp. 18 (2006)
5. Saxena, A., Sun, M., Ng, A.: Make3d: Learning 3d scene structure from a single
still image. PAMI 31, 824840 (2009)
6. Malik, J.: Interpreting line drawings of curved objects. IJCV (1987)
7. Wang, Y., Chen, Y., Liu, J., Tang, X.: 3d reconstruction of curved objects from
single 2d line drawings. In: CVPR, pp. 18341841 (2009)
8. Xue, T., Liu, J., Tang, X.: Symmetric piecewise planar object reconstruction from
a single image. In: CVPR, pp. 25772584 (2011)
9. Saund, E.: Logic and mrf circuitry for labeling occluding and thinline visual con-
tours. In: Neural Information Processing Systems, NIPS (2005)
10. Ouyang, T., Davis, R.: Learning from neighboring strokes: combining appearance
and context for multi-domain sketch recognition. In: NIPS (2009)
11. Freeman, W., Pasztor, E., Carmichael, O.: Learning low-level vision. International
12. Fitzgibbon, A.W., Wexler, Y., Zisserman, A.: Image-based rendering using image-
based priors. In: Intl. Conf. Computer Vision, ICCV (2003)
13. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from a sin-
gle depth image. In: CVPR (2011)
14. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision
Conference, vol. 15, p. 50 (1988)
15. Lowe, D.: Object recognition from local scale-invariant features. In: ICCV (1999)
16. Supplemental Material (2012),
http://people.csail.mit.edu/fcole/shapecollage
17. DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., Santella, A.: Suggestive contours
for conveying shape. ACM Trans. Graph. 22, 848855 (2003)
18. Hays, J., Efros, A.: Im2gps: estimating geographic information from a single image.
In: CVPR, pp. 18 (2008)
19. Tappen, M.: Recovering shape from a single image of a mirrored surface from
curvature constraints. In: CVPR, pp. 25452552 (2011)
20. Cole, F., Golovinskiy, A., Limpaecher, A., Barros, H., Finkelstein, A., Funkhouser,
T., Rusinkiewicz, S.: Where do people draw lines? In: SIGGRAPH (2008)
21. Barron, J.T., Malik, J.: High-frequency shape and albedo from shading using nat-
ural image statistics. In: CVPR (2011)
22. Kaneva, B., Torralba, A., Freeman, W.: Evaluation of image features using a
photorealistic virtual world. In: ICCV (2011)
Interactive Facial Feature Localization
Vuong Le1 , Jonathan Brandt2 , Zhe Lin2 ,

Lubomir Bourdev3 , and Thomas S. Huang1
1
University of Illinois at Urbana Champaign, Urbana, IL 61801, USA
2
Adobe Systems Inc., San Jose, CA 95110, USA
3
Facebook Inc., Menlo Park, CA 94025, USA
Abstract. We address the problem of interactive facial feature localiza-

tion from a single image. Our goal is to obtain an accurate segmentation
of facial features on high-resolution images under a variety of pose, ex-
pression, and lighting conditions. Although there has been signicant
work in facial feature localization, we are addressing a new application
area, namely to facilitate intelligent high-quality editing of portraits,
that brings requirements not met by existing methods. We propose an
improvement to the Active Shape Model that allows for greater inde-
pendence among the facial components and improves on the appearance
tting step by introducing a Viterbi optimization process that operates
along the facial contours. Despite the improvements, we do not expect
perfect results in all cases. We therefore introduce an interaction model
whereby a user can eciently guide the algorithm towards a precise solu-
tion. We introduce the Helen Facial Feature Dataset consisting of anno-
tated portrait images gathered from Flickr that are more diverse and
challenging than currently existing datasets. We present experiments
that compare our automatic method to published results, and also a
quantitative evaluation of the eectiveness of our interactive method.
1 Introduction
Accurate facial component localization is useful for intelligent editing of pictures
of faces, such as opening the eyes, red eye removal, highlighting the lips, or
making a smile. In addition, it is an essential component for face recognition,
tracking and expression analysis. The problem remains challenging due to pose
and viewpoint variation, expression or occlusion (e.g. with sunglasses, hair or a
beard). While the techniques have improved a lot over the past ve years [15],

680 V. Le et al.
we are still far from a reliable and fully-automated facial component localization
system. For all practical purposes, an interactive, user-assisted face localization
would be necessary.
Our goal is a facial component localization system that achieves high delity
results with minimal user interaction. Our contributions are as follows:
1. A novel facial component localization algorithm designed to accommodate

partial labels. We use a part-based formulation which allows us to edit one
part without aecting the correct localization of other face parts. The land-
mark predictions of each face part are jointly t by nding the global mini-
mum of the energy using dynamic programming.
2. An extension of the feature localization algorithm to accommodate partial
observations. We allow the user to specify landmarks and adjust the remain-
ing ones to minimize the residual error.
3. A new challenging dataset for facial component localization which contains
2330 high-resolution, accurately labeled face images and has larger degree of
out-of-plane orientation and occlusion typical of real-world scenarios.
We show that our system in a fully automated mode outperforms other state of
the art systems on this challenging dataset. Furthermore we show that the error
decreases signicantly after only a minimal number of user corrections.
2 Related Work
Active shape model (ASM) [6] and Active appearance model (AAM) [7] form
classic families of methods for facial feature point detection. Comparing these
two models, ASM has the advantages of being more accurate in point (or contour)
localization, less sensitive to lighting variations and more ecient, hence is more
suitable for applications requiring accurate contour tting.
Since the rst introduction of the traditional ASM, there have been many ex-
tensions for improving its robustness, performance and eciency. For example,
[8] employs mixtures of Gaussians for representing the shape, [9] uses Kernel
PCA and SVM, and models nonlinear shape changes by 3D rotation, [10] apply
robust least squares algorithms to match the shapes to observations, [11] uses
more robust texture descriptors to replace the 1D prole model and used k near-
est neighbor search for prole searching, and [12] relies on Bayesian inference.
Among those extensions to classical ASM, the recent work of Milborrow and
Nicolls [1] with the introduction of the 2D prole model and denser point set
obtained promising results through quantitative evaluation. The recent work [4]
combines global and local models based on MRF and an iterative tting scheme,
but this approach is mainly focused on localizing very sparse set of landmark
points.
Other notable recent works exploring alternatives to ASM include [2]. In [2],
a robust approach for facial feature localization is proposed by a discriminative
search approach combining component detectors with learned directional clas-
siers. In [3], a generative model with shape regularization prior is learned and
Interactive Facial Feature Localization 681
Fig. 1. Sample images from the Helen Facial Feature Dataset
used for face alignment to robustly deal with challenging cases such as expres-
sion, occlusion and noise. As an alternative to the parametric model approaches,
a principled optimization strategy with nonparametric representations for de-
formable shapes is recently proposed in [13].
A Component based ASM is introduced in [14] which implements a set of
ASMs for facial features and stacks up the PCA parameters into one long vec-
tor, then ts those vectors in a single global model using a Gaussian Process
Latent Variable Model to handle nonlinear distribution. In comparison with our
approach, they do not model the global spatial conguration of features of faces
therefore may generate invalid congurations. In [2] the face components are
found by an AAM style tting method and a discriminative classication is used
for predicting the movement of components. In another eort to explore the
conguration of face parts, Bayesian objective function are combined with local
detectors in [15] for localization of facial parts.
Motivated by the pictorial structure model in [16], we aim to decompose a
face shape into components, and model variations of individual components as
well as the spatial conguration between components. In [16], the face/human
pose inference problem is modeled as a part-based problem formulated as energy
minimization, where the energy (or cost) is expressed as the linear combination
of a unary tting term and a pair-wise potential term. Following this work,
we represent the global face shape model as a set of inter-related components,
and model their spatial relationships. Unlike [16] where the unknown parame-
ter space is low dimensional, in our approach we need to estimate component
locations as well as shapes, which is a higher dimensional space. Therefore, we
adopt PCA to model both the component shape variation and the relative spa-
tial conguration. Due to the component decomposition, our approach is more
exible in modeling face shapes for handling larger pose changes and has better
generalization capabilities than the standard ASM. Our approach uses PCA to
model the relative positions between components, hence it is simpler and more
ecient than [4] which relies on expensive MRF inference.
682 V. Le et al.
3 Helen Facial Feature Dataset

Our goal is a facial feature localization algorithm that can operate reliably and
accurately under a broad range of appearance variation, including pose, lighting,
expression, occlusion, and individual dierences. In particular, it is necessary
that the training set include high resolution examples so that, at test time, a high
resolution test image can be t accurately. Although a number face databases
exist, we found none that meet our requirements, particularly the resolution
requirement. Consequently, we constructed a new dataset using annotated Flickr
images.
Specically, the dataset was constructed as follows: First, a large set of can-
didate photos was gathered using a variety of keyword searches on Flickr. In
all cases the query included the keyword portrait and was augmented with
dierent terms such as family, outdoor, studio, boy, wedding, etc.
(An attempt was made to avoid cultural bias by repeating the queries in sev-
eral dierent languages.) A face detector was run on the resulting candidate set
to identify a subset of images that contain suciently large faces (greater than
500 pixels in width). The subset was further ltered by hand to remove false
positives, prole views, as well as low quality images. For each accepted face,
we generated a cropped version of the original image that includes the face and
a proportional amount of background. In some cases, the face is very close or
in contact with the edge of the original image and is consequently not centered
in the cropped image. Also, the cropped image can contain other face instances
since many photos contain more than one person in close proximity.
Finally, the images were hand-annotated using Amazon Mechanical Turk to
precisely locate the eyes, nose, mouth, eyebrows, and jawline. (We adopted the
same annotation convention as the PUT Face Database [17].) To assist the Turk
worker in this task, we initialized the point locations to be the result of the
STASM [1] algorithm that had been trained on the PUT database. However,
since the Helen Dataset is much more diverse than PUT, the automatically
initialized points were often far from the correct locations.
In any case, we found that this particular annotation task required an unusual
amount of review and post-processing of the data in order to ensure high quality
results. Ultimately this is attributable to the high number of degrees of freedom
involved. For example, it frequently happened that a Turk worker would permute
components (swap eyes and brows or inner lip for outer lip), or alternatively shift
the positions of the points suciently that their roles would be changed (such as
selecting a dierent vertex to serve as an eye or mouth corner). Graphical cues
in the interface, as well as a training video and qualifying test were employed
to assist with the process. Also, automated processes were developed to enforce
consistency and uniformity in the dataset. In addition to the above, the faces
were reviewed at the component level manually by the authors to identify errors
in the annotations. Components with unacceptable error were resubmitted to
the Turk for correction.
The resulting dataset consists of 2000 training and 330 test images with highly
accurate, detailed, and consistent annotations of the primary facial components.
A sampling of the dataset is depicted in Fig. 1. The full dataset is publicly

available at http://www.ifp.illinois.edu/~vuongle2/helen.
4 Facial Feature Localization

We rst briey review the global ASM, and then derive our new components-
based model and the facial feature localization algorithm.
4.1 Classic ASM

In the classical global ASM, the shape is holistically represented as a set of
pre-dened points, called landmarks, along the shape contour (Fig. 1 left), and
described by a vector concatenating all the x coordinates of the ordered landmark
points followed by all the y coordinates. We relate one shape with another using
similarity transformation.
ASM is composed of two submodels: the prole model and the shape model.
At each landmark, the prole model is the normalized gradient vector in the
direction orthogonal to the shape boundaries. The prole distance is computed
as the Mahalanobis distance over the training set. The global shape model is the
statistics of the global shape vector for the training set, and is represented as a
PCA linear subspace.
Given a face rectangle obtained by a generic face detector, ASM tting rst
initializes the landmark points by placing the mean shape1 to the detected face
frame, and then repeats the following steps: (i) adjust each landmark points
independently by prole template matching; (ii) t a global shape model to
the adjusted point set. Due to the greedy scheme, many of those individually
adjusted points might be incorrect, and the point set will be regularized based
on the global shape model.
4.2 Component-Based ASM

Shape Model. The classical global ASM relies on a dense correlation matrix of
all the member landmark locations in a single Gaussian model. As a consequence,
it imposes strong spatial constraints for every pair of landmarks in the face. These
constraints make the model work fairly well on studio data where the variances
are small and test images are similar to training samples. However, when applied
to natural photos with large variations in face postures and occlusions, the full
correlation matrix of the global ASM tends to be too strict and cannot be simply
generalized for wide variety of face congurations. An example of a failure case
for the global ASM is shown in Fig.2(a).
Inspired by this observation and the pictorial structure work [16], we introduce
a new component-based model referred to as the Component-based ASM (Com-
pASM) for handling the wide variety of variations in natural face images. In our
1
The mean shape refers to the average of aligned shapes for training faces, where
alignment is done by tting with similarity transformation.
684 V. Le et al.
model, shape variation of each facial component is modeled independently, up

to a similarity transformation, and the relative positions of facial components
are encoded by a conguration model.
Fig. 2. (a) A failure example of the classic ASM on a face image by STASM [1]. The
bigger left eye constrains the right eye and the tilted right brow pulled the left brow
o the correct location. (b) A better tting result using our CompASM.
We specically consider a facial landmark set as a union of seven components:

jawline, nose, lips, left eye, right eye, left brow, eye brow. Each compo-
nent has its own coordinate frame called local frame, centered at the center of
mass of the component. Those centers are successively represented in a higher
level canonical face frame called the conguration frame. An illustration of our
component-based model is shown in Fig.3.
In this model, there are three coordinate frames: global, conguration and lo-
cal frames. The j-th landmark point of component i has its global coordinates pij
on the global coordinate frame, and local coordinates qij on the local coordinate
frame.
In the local frame, the coordinates of landmark points for a component are
represented as a single shape model centered at the components centroid. It is
a combination of principal vectors by PCA:
qi = qi + i bi (1)
where qi is the concatenation of qij , qi and i are the mean vector and the shape
basis learnt from the training data respectively, and bi denotes the set of linear
coecients of the tting.
In the conguration frame, the location of each components centroid is rep-
resented by a displacement vector ti from the face center. The linear subspace
representation of these vectors is learnt to be:
t = t + r, (2)
where is conguration bases, and r denotes the set of PCA coecients.
The global coordinates of the landmark points can be obtained from local
coordinate and conguration coordinates by a similarity transform:
pij = sR(qij + ti ) + t0 , (3)
where R, s and t0 denote the rotation, scale and translation that align a face
shape from the global frame with the conguration frame; ti denotes the location
Fig. 3. Illustration of our component-based model
coordinates of component i in conguration frame. While ti are dierent from

component to component, the three similarity transform parameters in s and R
are shared among the components.
Taking the mean of Eq.3 for each component, we obtain: pi = sR(qi + ti ) + t0 ,
where qi is the local coordinate of centroid of component i and is eectively zero.
And we can re-write it as pi = sRti + t0 .
Further combining the equation for all components, the model can be written
as concatenated form as:
p = sRt + t0 . (4)
Eq. 4 and 2 are directly used in the conguration tting step in the shape search
algorithm. See Section 4.3 for the detailed algorithm.
In CompASM, the conguration model constrains relative locations between
face components and is responsible for nding the orientation and scale of the
face and estimating optimal locations of components, while the local model for
each component is responsible for optimally tting the components shape model
to the observation. In other words, each local model is tted independently, hence
can handle larger global shape variations.
Furthermore, the fact that each component shape is a single connected contour
opens the possibility of jointly searching for multiple landmark points in the
prole model. The new prole model adopting joint landmark optimization is
introduced in the following section.
Prole Model. In the classic ASM [6, 1], during the prole tting step, at each
of the N landmarks of a given component we consider M candidate locations
along a line orthogonal to the shape boundary. We evaluate the prole model
at each location and pick the one with the highest score. The problem with this
greedy algorithm is that each landmark location is chosen independently of its
neighbors and neighboring landmarks could end up far from each other.
686 V. Le et al.
Fig. 4. An example of prole search by greedy method of classic ASM (a) and by our
Viterbi global optimization (b)
To address this problem we introduce a binary constraint for each pair of

adjacent landmarks to ensure that they are at an appropriate distance from
each other. We jointly t all the landmark locations by nding the sequence of
candidate locations with a maximum sum of unary scores (at each candidate
location) and binary scores (at each pair of adjacent locations). We can nd the
globally optimal such sequence in O(N M 2 ) time using dynamic programming
(the Viterbi algorithm [18]).
Our unary score is the probability of candidate location i of a given landmark,
dened as:
pi = (1 di )/(M 1) (5)
wheredi is the Mahalanobis distance at the candidate location, normalized so
M M
that i (di ) = 1 which ensures that i (pi ) = 1.
Our binary score is the probability that two adjacent landmarks are a given
distance x apart. We model it using a continuous Poisson:
x e
p(x) = (6)
(x + 1)
where x is the Euclidean distance between the two locations and is the gamma
function. We chose the Poisson distribution because it has a suitable PDF and
a single parameter which we t separately for each component.
To nd the optimal set of candidate locations we run Viterbi, which maximizes
the sum of the log probabilities. In our experiment section we show that our
joint optimization of the landmarks outperforms the greedy approach in the
traditional ASM algorithm. In addition, our model easily accommodates user
interaction as discussed in Section 5. An example of the prole search results of
two methods is depicted in Fig.4.
4.3 Our Algorithm
Here we describe how we combine our new shape model and prole model into
the facial feature localization algorithm. Given an input image, after initializa-
tion by face detection, the shape model and prole model will be applied in an
interleaved, iterative way in a coarse-to-ne manner in the pyramid.
In each particular iteration step, prole search is applied rst, which gives
the suggested candidate locations of the landmarks. The landmark candidates
are then regularized by the local shape model. In each pyramid level, the tting
process is performed by repeating conguration model tting and local model
tting until convergence.
The details of our facial feature localization method using CompASM is illus-
trated in Algorithm 1.
Algorithm 1. Facial Feature Localization Algorithm

Detect faces, initialize the shape model based on the face rectangle
for each resolution level do
repeat
a. Do prole search for suggesting new landmark locations
a1. Collect unary scores by prole matching
a2. Jointly nd optimal path using the Viterbi algorithm
b. Update the landmark localtions with local shape and conf. model
b1. Find the centroid of suggested landmarks for each component
b2. Fit the centroids to the conguration model using Eq.2 and 4
b3. Apply the new conguration to the suggested landmarks using Eq.3
b4. Fit the local shape model PCA subspace to landmarks using Eq.1
c. Form a new global shape by applying the inverse similarity transformation
until Number of points moved between two consecutive rounds < 20%
Map the localized result to the next resolution level
end for
Return the result from the highest resolution lvel
5 User Interaction Model
In this section we describe our algorithm for user-assisted facial feature local-
ization. Once the automatic tting is performed, the user is instructed to pick
the landmark with the largest error and move it to the correct location after
which we adjust the locations of the remaining landmarks to take into account
the user input. This interaction step is called an interaction round. This process
is repeated until the user is satised with the results. Our goal is to adjust the
locations of the remaining landmarks as to minimize the number of interaction
rounds.
Our update procedure consists of two main steps: Linearly scaled movement
followed by a Model tting update.
In the Linearly scaled movement step, when the user moves a landmark pi ,
we move the neighboring landmarks on the same contour as pi along the same
direction. The amount of movement for neighboring landmarks is proportional to
their proximity to pi . Specically, we identify landmarks pa and pb on both sides
of pi which dene the span of landmarks that will be aected. Each landmark
with index j (a, i] moves by aj
ai d where d is the user-specied displacement
688 V. Le et al.
vector of pi 2 . The landmark pa is specied as the closest along the contour to

pi of the following three landmarks: (1) the one *n/4+ landmarks away, where
n is the number of landmarks in the component, (2) the nearest user-corrected
landmark and (3) the end landmark if the component contour is open. The
landmark pb on the other side of the contour is determined analogously.
After the rst step, in most cases, the neighboring points of pi have moved
close enough to the real curve that their prole models can snap to the right
locations. That enables the second step of Model tting update. In this step, we
do several component based ASM tting iterations at the highest resolution.
Unlike regular ASM tting, in these iterations we update only the current com-
ponent. We also adjust the tting so that the user-corrected landmarks dont
move. Specically, as we describe in Section 4, our tting step iterates between
shape model and prole model tting. During the shape model tting, we uti-
lize constrained linear least squares to nd the best location of the component
given that it must go through the xed landmarks. In our prole tting, as we
describe in Section 4.2 we specify candidate locations for each landmark and use
the Viterbi algorithm to nd the optimal contour. For user-specied landmarks,
we choose the user-specied location as the only candidate location, thereby
forcing Viterbi to nd a path through the user-specied landmarks.
6 Experiments
6.1 Comparison to STASM
Our algorithm is designed to allow for high independence and freedom in com-
ponent shape so that it can deal with natural photos with higher variations than
the studio data. Nevertheless, we needs to assure that it can work adequately
well on standard studio data as well. Therefore, in our rst experiment, we com-
pare our algorithm with STASM [1, 19] on standard datasets that they have
chosen in their work. In this experiment, following [19] we train we train our al-
gorithm on MUCT dataset [19] and test on BioID dataset [20]. MUCT contains
3755 images of 276 subjects taken in studio environment. Bioid consists of 1521
frontal images of 23 dierent test persons taken at the same lighting.
On the BioID test set, the tting performance of STASM which features
the stacked model and 2D prole trained on MUCT is reported to be the best
among current algorithms [19]. The MUCT and BioID point sets includes ve
standalone points; therefore their shapes cannot be separated into single con-
tours. Consequently, we are not able to use Viterbi optimized prole searching
in this experiment. Following STASM, we evaluate the performance using the
me-17 measure. This measure is the mean of distance between tting result of 17
points in the face to the manually marked ground truth divided by the distance
between two eye pupils. The comparison of the two algorithms performance is
shown in Table 1. The result shows that both of the algorithms results on BioID
are very close to perfect and STASM made a slightly lower error.
2
Note that neighboring landmarks in our dataset are equidistant.
Fig. 5. CDF of me-194 error for CompASM and STASM on the Helen dataset
Table 1. Comparison of STASM and CompASM on the two test sets. For
MUCT/BioID we use the me-17 measure. For Helen, we use the me-194 measure.
Dataset Algorithm Mean Median Min Max

MUCT/BioID STASM 0.043 0.040 0.020 0.19
MUCT/BioID CompASM/Greedy 0.045 0.043 0.021 0.23
Helen STASM 0.111 0.094 0.037 0.411
Helen CompASM/Greedy 0.097 0.080 0.035 0.440
Helen CompASM/Viterbi 0.091 0.073 0.035 0.402
While BioID is an easy dataset for tting algorithms to deal with, the He-
len dataset is a more challenging one. The Helen data consists of natural images
which not only vary in pose and lighting but are also more diverse in subject iden-
tities and expression. In this dataset, the face features are highly uncorrelated
with non-gaussian distribution which our model can exploit better. To measure
tting performance on Helen, we use me-194, similar to the me-17 measure,
which calculates the mean deviation of 194 points from ground truth normalized
by distance between two eye centroids. We divided Helen images into training
set of 2000 images and testing set of 330 images. The comparison of STASM
and CompASM is shown in Table 1 and Figure 5. Table 1 shows that Com-
pASM outperforms STASM by 16%. The result proves that CompASM is more
robust over diverse face images. Some examples of tting results of STASM and
CompASM are shown in Fig. 6.
6.2 Evaluation of User Interaction

Here we evaluate the eectiveness of our method described in Section 5 to re-
ne the automatically determined facial component locations through user in-
teraction. Qualitatively, we wish to minimize the users eort in obtaining a
690 V. Le et al.
Fig. 6. Some examples of tting results on the Helen test set of CompASM (rst row)
and STASM (second row)
0.12
linearly scaled movement and refit
linearly scaled movement no refit
move isolated point and refit
0.1 move isolated point no refit
0.08
me194 Error
0.06
0.04
0.02
0
2 4 6 8 10 12 14
Interaction Round
Fig. 7. The mean me-194 error measure across the Helen test set as a function of
interaction round. One interaction round constitutes a single user action (see text for
details). Each curve is the mean value across the test set under varying experimental
conditions. Using both the linearly scaled movement and constrained retting steps
results in the most eective reduction in error.
satisfactory tting result. More concretely, we wish to observe how the tting
error decays as a function of the number of interaction steps.
To this end, we simulate the users typical behavior when interacting with the
system, namely to choose the landmark point located farthest from its true posi-
tion and move it to its true position. After being moved, it becomes a constraint
for subsequent tting rounds as described in Sec. 5.
We repeat this procedure for a number of rounds and record the me-194 error
measure at each round. Fig. 7 depicts the distribution of the error across our test
set over 15 interactivity rounds. To evaluate the eectiveness of the two steps
of interactivity described in Sec. 5, we compare the performance of the case
Fig. 8. Examples of a sequence of editing steps where the most erroneous point is
selected at each step. The red lled point denotes the point selected for editing. The
open yellow circles denote points that have already been edited and xed.
where both of the update steps are done with that of the case where only one
of the steps is done, and also against a baseline where only the most erroneous
point is moved. The comparative eectiveness of these strategies is apparent in
Fig 7, from which we can conclude that both steps used in conjunction are most
eective at reducing the overall error most quickly. Fig. 8 depicts some example
runs of the interaction process. More examples with animation are available at
the website http://www.ifp.uiuc.edu/~vuongle2/helen/interactive
7 Conclusion
In this paper, we proposed a new component-based ASM model (CompASM)
and an interactive renement algorithm for practical facial feature localization
on real-world face images. CompASM extends the classical, global ASM by de-
composing global shape tting into two modules: component shape tting and
conguration model tting. The model provides more exibility for individual
components shape variation and relative conguration between components,
hence is more suitable for handling images with large variation. We improve pro-
le matching by introducing a new joint landmark optimization scheme using the
Viterbi algorithm which outperforms the standard greedy-based approach. We
also propose an interactive renement algorithm for facial feature localization
to minimize tting errors for bad initial tting. Furthermore, a new challeng-
ing facial feature localization dataset is introduced which contains more than
2000 high-resolution fully annotated real-world face images. Experimental re-
sults show that our approach outperforms the state of the art on the new, real-
world facial feature localization dataset. Our interactive renement algorithm is
capable of reducing errors quickly but there is space to improve it further by
learning from users behavior or by guiding the interaction through statistics of
landmark uncertainties, which we leave as our future work.
692 V. Le et al.
References
1. Milborrow, S., Nicolls, F.: Locating Facial Features with an Extended Active Shape
Model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS,
2. Liang, L., Xiao, R., Wen, F., Sun, J.: Face Alignment Via Component-Based Dis-
criminative Search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
4. Tresadern, P., Bhaskar, H., Adeshina, S., Taylor, C., Cootes, T.: Combining local
and global shape models for deformable object matching. In: BMVC (2009)
5. Zhu, X., Ramanan, D.: Face Detection, Pose Estimation, and Landmark Localiza-
tion in the Wild. In: CVPR (2012)
6. Cootes, T., Taylor, C.J., Cooper, D., Graham, J.: Active shape models - their
training and application. Computer Vision and Image Understanding 61, 3859
(1995)
7. Cootes, T., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans.
PAMI 23, 681685 (2001)
8. Cootes, T., Taylor, C.: A mixture model for representing shape variation. Image
and Vision Computing 17, 567574 (1999)
9. Romdhani, S., Gong, S., Psarrou, A.: A multi-view non-linear active shape model
using kernel pca. In: BMVC (1999)
10. Rogers, M., Graham, J.: Robust Active Shape Model Search. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353,
11. Van Ginneken, B., Frangi, A., Staal, J., Ter Haar Romeny, B., Viergever, M.: Active
shape model segmentation with optimal features. IEEE Trans. Medical Imaging 21,
924933 (2002)
12. Zhou, Y., Gu, L., Zhang, H.: Bayesian tangent shape model: Estimating shape and
pose parameters via bayesian inference. In: CVPR (2003)
13. Saragih, J.M., Lucey, S., Cohn, J.F.: Deformable model tting by regularized land-
mark mean-shift. Int. J. Comput. Vision 91, 200215 (2011)
14. Huang, Y., Liu, Q., Metaxas, D.: A component based deformable model for gen-
eralized face alignment. In: ICCV (2007)
15. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of
faces using a consensus of exemplars. In: CVPR 2011, pp. 545552 (2011)
16. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition.
International Journal of Computer Vision 61, 5579 (2005)
17. Kasinski, A., Florek, A., Schmidt, A.: The PUT face database. Image Processing
and Communications 13, 5964 (2008)
18. Forney Jr., G.D.: The Viterbi algorithm. Proceedings of IEEE 61, 268278 (1973)
19. Milborrow, S., Morkel, J., Nicolls, F.: The MUCT Landmarked Face Database.
Pattern Recognition Association of South Africa (2010)
20. Jesorsky, O., Kirchberg, K.J., Frischholz, R.: Robust Face Detection Using the
Hausdor Distance. In: Bigun, J., Smeraldi, F. (eds.) AVBPA 2001. LNCS,
Propagative Hough Voting
for Human Activity Recognition
Gang Yu1 , Junsong Yuan1 , and Zicheng Liu2

1
School of Electrical and Electronic Engineering, Nanyang Technological University
gyu1@e.ntu.edu.sg, jsyuan@ntu.edu.sg
2
Microsoft Research Redmond, WA, USA
zliu@microsoft.com
Abstract. Hough-transform based voting has been successfully applied

to both object and activity detections. However, most current Hough
voting methods will suer when insucient training data is provided. To
address this problem, we propose propagative Hough voting for activity
analysis. Instead of letting local features vote individually, we perform
feature voting using random projection trees (RPT) which leverage the
low-dimension manifold structure to match feature points in the high-
dimensional feature space. Our RPT can index the unlabeled feature
points in an unsupervised way. After the trees are constructed, the la-
bel and spatial-temporal conguration information are propagated from
the training samples to the testing data via RPT. The proposed activity
recognition method does not rely on human detection and tracking, and
can well handle the scale and intra-class variations of the activity pat-
terns. The superior performances on two benchmarked activity datasets
validate that our method outperforms the state-of-the-art techniques not
only when there is sucient training data such as in activity recognition,
but also when there is limited training data such as in activity search
with one query example.
1 Introduction
Hough-transform based local feature voting has shown promising results in both
object and activity detections. It leverages the ensemble of local features, where
each local feature votes individually to the hypothesis, thus can provide robust
detection results even when the target object is partially occluded. Meanwhile,
it takes the spatial or spatio-temporal congurations of the local features into
consideration, thus can provide reliable detection in the cluttered scenes, and
can well handle rotation or scale variation of the target object.
Despite previous successes, most current Hough-voting based detection ap-
proaches require sucient training data to enable discriminative voting of local
patches. For example, [7] requires sucient labeled local features to train the
random forest. When limited training examples are provided, e.g., given one or
few query examples to search similar activity instances, the performance of pre-
vious methods is likely to suer. The root of this problem lies on the challenges

694 G. Yu, J. Yuan, and Z. Liu
of matching local features to the training model. Due to the possibly large varia-
tions of activity patterns, if limited training examples are provided, it is dicult
to tell whether a given local feature belongs to the target activity or not. Thus
the inaccurate voting scores will degrade the detection performance.
In this paper, we propose propagative Hough voting for activity analysis.
To improve the local feature point matching, we introduce random projection
trees [18], which are able to capture the intrinsic low dimensional manifold struc-
ture to improve matching in high-dimensional space. With the improved match-
ing, the voting weight for each matched feature point pair can be computed more
reliably. Besides, as the number of trees grows, our propagative Hough voting
algorithm is theoretically guaranteed to converge to the optimal detection.
Another nice property of the random projection tree is that its construction is
unsupervised, thus making it perfectly suitable for leveraging the unlabeled test
data. When the amount of training data is small such as in activity search with
one or few queries, one can use the test data to construct the random projection
trees. After the random projection trees are constructed, the label information
in the training data can then be propagated to the testing data by the trees.
Our method is explained in Fig. 1. For each local patch (or feature) from
the training example, it searches for the best matches through RPT. Once the
matches in the testing video are found, the label and spatial-temporal cong-
uration information are propagated from the training data to the testing data.
The accumulated Hough voting score can be used for recognition and detection.
By applying the random projection trees, our proposed method is as ecient as
the existing Hough-voting based activity recognition approach, e.g., the random
forest used in [7]. However, our method does not rely on human detection and
tracking, and can well handle the intra-class variations of the activity patterns.
With an iterative scale renement procedure, our method can handle small scale
variations of activities as well.
We evaluate our method in two benchmarked datasets, UT-interaction [10]
and TV Human Interaction [19]. To fairly compare with existing methods, we test
our propagative Hough voting with (1) insucient training data, e.g., in activity
search with few query examples, and (2) sucient training data, e.g., in activity
recognition with many training examples. The superior performances over the
state-of-the-arts validate that our method can outperform in both conditions.
2 Related Work
Based on the successful development of video features, e.g., STIP [1], cuboids [13],
and 3D HoG [22], many human activity recognition methods have been devel-
oped. Previously, [15][7][16][11] rely on the human detection or even human pose
estimation for activity analysis. But human detection, tracking, and pose esti-
mation in uncontrolled environments are challenging problems.
Without relying on auxiliary algorithms such as human detection, [12][5][6] per-
form activity recognition by formulating the problem as a template matching pro-
cess. In [12], it learns a spatial-temporal graph model for each activity and classies
Propagative Hough Voting for Human Activity Recognition 695
Distribution modeling with

Implicit Spatial-temporal Shape Model Random Projection Trees Hough Voting
y y
t t
x x
Fig. 1. Propagative Hough Voting: The left gure illustrates our implicit spatial-
temporal shape model on a training video. Three sample STIPs from the testing videos
are illustrated with blue triangles in the right gure. Several matches will be found for
the three STIPs given the RPT. We choose three yellow dots to describe the matched
STIPs from the training data in the middle gure. For each training STIP (yellow dot),
the spatial-temporal information will be transferred to the matched testing STIPs (blue
triangle) in the testing videos. By accumulating the votes from all the matching pairs, a
sub-volume is located in the right gure. The regions marked with magenta color refer
to the low-dimension manifold learned with RPT, which can built on either training
data or testing data. (Best viewed in color)
the testing video as the one with the smallest matching cost. In [5], temporal infor-
mation is utilized to build the string of feature graph. Videos are segmented at
a xed interval and each segment is modeled by a feature graph. By combining the
matching cost from each segment, they can determine the category of the testing
video as the one with the smallest matching cost. In [6], similar idea of partition-
ing videos to small segments is used but video partition problem is solved with
dynamic programming. There are several limitations in these matching based al-
gorithms. First, the template matching algorithms, e.g., graph matching [12][5],
are computationally intensive. For example, in order to use these template based
methods for activity localization, we need to use sliding-windows to scan all the
possible sub-volumes, which is an extremely large search space. Despite the fact
that [6] can achieve fast speed, the proposed dynamic BoW based matching is not
discriminative since it drops all the spatial information from the interest points.
Similarly, in [12][5], the temporal models are not exible to handle speed variations
of the activity pattern. Third, a large training dataset will be needed to learn the
activity model in [12][6]. However, in some applications such as activity search, the
amount of training data is extremely limited.
To avoid enumerating all the possible sub-volumes and save the computational
cost, Hough voting has been used to locate the potential candidates. In [7], Hough
voting has been employed to vote for the temporal center of the activity while
the spatial locations are pre-determined by human tracking. However, tracking
human in unconstrained environment is a challenging problem. In contrast, our
algorithm does not rely on tracking. In our algorithm, both spatial and temporal
centers can be determined by Hough voting and the scale can be further rened
with back-projection. Besides, the trees in [7] are supervisedly constructed for
the classication purpose while our trees are trying to model the underlying data
distribution in an unsupervised way. Furthermore, our trees can be built on the
test data which allows our propagative Hough voting to well handle the limited
training data problem.
3 Activity Recognition by Detection

Spatial-temporal interest points (STIP) [1] are rst extracted from each video. For
each STIP, we describe it with Histogram of Gradient and Histogram of Optical
Flow. In total, the feature dimension is 162. We represent each training video with
implicit spatial-temporal shape model based on extracted STIP points as shown
in Fig. 1. Although we only apply sparse STIP feature in our experiments, our
method is also applicable to dense local features. We refer to the training data as
R : {dr = [fr , lr ]; r = 1, 2, , NR }, where fr and lr are the descriptor and 3D
location of interest point dr , respectively. NR is the number of interest points.
Suppose we have a set of testing videos, denoted by S = {V1 , V2 , , VNS },
we want to recognize and locate the specic activity in the training video set
R, where R can be one or more training examples. Our goal is to nd a video
sub-volume, V , to maximize the following similarity function:
maxV SV S (V, R) = maxx,t, S(V (x, t, ), R), (1)
where V (x, t, ) refers to the sub-volume with temporal center t and spatial
center x; refers to the scale size and duration; S(, ) is the similarity measure. In
total, we have 6 parameters (center position x, y, t, and width, height, duration)
to locate the optimal sub-volume V . For the problem of multi-class activity
recognition (suppose we have K classes), Eq. 1 will be searched K times with
training data R from dierent categories. Later, the class with the highest score
of V will be picked as the predicting label for the testing video V.
To measure the similarity between V (x, t, ) and the training data R, similar
to [17], we dene S(V (x, t, ), R) in Eq. 1 as follow.

S(V (x, t, ), R) = dr R p([x, t, ], dr )
(2)
= dr R p([x, t, ]|dr )p(dr ),
where dr = [fr , lr ] with fr representing the feature description and lr representing
the location of the rth STIP point in the training videos. p([x, t, ], fr , lr ) is the
probability that there exists a target activity at position [x, t, ] and a matched
STIP point dr in the training data. Since it is reasonable to assume a uniform
prior over dr , we skip p(dr ) and focus on the local feature voting p([x, t, ]|dr ):

p([x, t, ]|dr ) = ds S p([x, t, ], ds |dr )
= ds S p([x, t, ]|ds , dr )p(ds |dr ) (3)
= ds S p([x, t, ]|ls , lr )p(fs |fr ).
In Eq. 3, p(fs |fr ) determines the voting weight which relies on the similarity
between fs and fr . We will elaborate on how to compute p(fs |fr ) in Section 4.
On the other hand, p([x, t, ]|ls , lr ) determines the voting position. Suppose dr =
[fr , lr ] R matches ds = [fs , ls ] S, we cast the spatial-temporal information
from the training data to the testing data with voting position lv = [xv , tv ]:
xv = xs x (xr cxr )
(4)
tv = ts t (tr ctr ),
where [xs , ts ] = ls , [xr , tr ] = lr , [cxr , ctr ] is the spatio-temporal center position of

the training activity and = [x , t ] refers to the scale level and duration level
(the scale size of the testing video, i.e., , over the scale size of the matched
training video).
Once the voting position for testing sequence is available, we can compute
p([x, t, ]|ls , lr ) as:
1 ||[xv x,t2v t]||2
p([x, t, ]|ls , lr ) = e , (5)
Z
where Z is a normalization constant and 2 is a bandwidth parameter.
4 Propagative Interest Point Matching

The matching of local features p(fs |fr ) plays an essential role in our Hough
voting. According to Eq. 3, as each dr R will be matched against all ds S,
an ecient and accurate matching is essential. We propose to use the random
projection trees [18] (RPT), which is constructed in an unsupervised way, to
model the underlying low-dimension feature distribution, as the light magenta
regions shown in Fig. 1. Compared with traditional Euclidean distance which
ignores the hidden data distribution, RPT can give a more accurate evaluation
of p(fs |fr ) with the help of underlying data distribution.
RPT has three unique benets compared with other data structures, e.g.,
[21]. First of all, as proven in [18], random projection trees can adapt to the
low-dimension manifold existing in a high dimension feature space. Thus, the
matching found by random projection trees is superior to the nearest neighbor
based on Euclidean distance. This advantage is further validated by our exper-
imental results in Section 6. Second, similar to BoW model, we quantize the
feature space by tree structures. Rather than enumerating all the possible in-
terest point matches, we can eciently nd the matches by passing the query
interest point from the root to the leaf nodes. This can save a lot of computational
cost. Third, we can make more accurate estimation by increasing the number of
trees. Later, we will prove that our random projection tree based Hough voting
generates optimal solution when the number of trees approaches innity. In the
following section, we describe how to implement the random projection trees.
4.1 Random Projection Trees

Depending on the applications, our random projection trees can be built on (1)
training data only, e.g., standard action classication and detection, (2) testing
data only, e.g., activity search, (3) both training and testing data. The trees are
constructed in an unsupervised way and the labels from the training data will only
be used in the voting step. Assume we have a set of STIPs, denoted by D = {di ; i =
1, 2, , ND }, where di = [fi , li ] as dened in Section 3 and ND is the total number
of interest points. The feature dimension is set to n = 162, so fi Rn .
Algorithm 1. Trees = ConstructRPT (D)

1: for i = 1 NT do
2: BuildTree(D, 0)
3: end for
4: Proc Tree = BuildTree(D, depth)

5: if depth < d then
6: Choose a random unit direction v Rn
7: Pick any x D; nd the farthest point y D from x
8: Choose uniformly at random in [1, 1] 6||x y||/ n
9: Rule(x) := x v (median({z v; z D}) + )
10: LTree BuildTree({x D; Rule(x) = true}, depth+1)
11: RTree BuildTree({x D; Rule(x) = f alse}, depth+1)
12: end if
We implement random projection trees [18] as shown in Algorithm 1. There

are two parameters related to the construction of trees. NT is the number of
trees and d is the maximum tree depth. Each tree can be considered as one
partition of the feature space to index the interest points.
At the matching step, p(fs |fr ) in Eq. 3 will be computed as:
1
NT
p(fs |fr ) = Ii (fs , fr ), (6)
NT i=1
where NT refers to the number of trees and

1, fs , fr belong to the same leaf in tree Ti
Ii (fs , fr ) = (7)
0, otherwise
Thus, Eq. 2 becomes
N
T
S(V (x, t, ), R) Ii (fs , fr )p([x, t, ]|ls , lr )
dr R i=1 ds S
N T (8)
p([x, t, ]|ls , lr ),
dr R i=1 ds S && Ii (fs ,fr )=1
where ds S && Ii (fs , fr ) = 1 refers to the interest points from S which fall in
the same leaf as dr in the ith tree. Based on Eq. 5, we can compute the voting
score as:

NT ||[xv x,tv t]||2
S(V (x, t, ), R) e 2 . (9)
dr R i=1 ds S && Ii (fs ,fr )=1
4.2 Theoretical Justication
The matching quality of Eq. 6 depends on the number of trees NT . To justify

the correctness of using random projection trees for interest point matching, we
show that, when the number of trees is sucient, our Hough voting algorithm can
obtain the optimal detection results. For simplicity, we assume our hypothesis
space is of size W H T , with W, H, T refer to the width, height and duration of
the testing data, respectively. Each element refers to a possible center position
for one activity and the scale is xed. We further assume there is only one
target activity existing in the search space at the position l = [x , t ]. So in
total there are NH = W H T 1 background positions. To further simplify the
problem, we only vote for one position for each match rather than a smoothed
region in Eq. 5. That is,

1, l = lv
p(l |ls , lr ) = (10)
0, otherwise
We introduce a random variable x(i) with Bernoulli distribution to indicate

whether we have a vote for the position l or not in the ith match. We refer to the
match accuracy as q and therefore p(x(i) = 1) = q. We introduce another random
variable with Bernoulli distribution y(i) to indicate whether we have a vote for
the background position lj (where lj = l ) or not in the ith match. Suppose each
background position has an equal probability to be voted, then p(y(i) = 1) =
1q
NH . We prove the following theorem in the supplementary material.
Theorem 1. Asymptotic property of propagative Hough voting: When

the number of trees NT , we have S(V (l ), R) > S(V (lj ), R) with proba-

(q 1q ) NM
bility 1 ( NH
xy ). Specically, if q NH1+1 , we have S(V (l ), R) >
S(V (lj ), R) when the number of trees NT .
5x x2
In Theorem 1, (x) = 12 e 2 dx and xy refers to the variance. NM refers
to the number of matches according to Eq. 8: NM = NT NR NL if we build
our RPT on the testing data, and NM = NT NS NL if we build our RPT
on the training data. NL , referring to the average number of interest points in
each leaf, can be estimated as NL N D
2d
where d denotes the tree depth and
ND the size of the data for building RPT. Based on our empirical simulation
experiments, q is much larger than NH1+1 . Thus the asymptotic property is true.
5 Scale Determination
To estimate in activity localization, we propose an iterative renement method,

which iteratively applies the Hough voting and scale renement. The reason we
use the iterative algorithm is that we have 6 parameters to search for. This
cannot be well handled in traditional Hough voting [17], especially when there
is not sucient amount of training data. We have two steps for the iterative
renement: 1) x the scale, search for the activity center with Hough voting; 2)
x the activity center, and determine the scale based on back-projection. We
iterate the two steps until convergence.
The initial scale information is set to the average scale of the training
videos. Based on the Hough voting step discussed in Section 3, we can obtain
the rough position of the activity center. Then back-projection, which has been
used in [17][14] for 2D object segmentation or localization, is used to determine
the scale parameters.
After the Hough voting step, we obtain a back-projection score for each testing
interest point ds from the testing video based on Eq. 2:

sds = dr R p(l |ls , lr )p(fs |fr )
||l ls ||2 (11)
= Z1 dr R e 2 p(fs |fr ),
where l is the activity center computed from last round; Z and 2 are, respec-
tively, normalization constant and kernel bandwidth, which are the same as in
Eq. 5. p(fs |fr ) is computed by Eq. 6. The back-projection score sds represents
how much this interest point ds contributes to the voting center, i.e., l . For each
sub-volume detected in previous Hough voting step, we rst enlarge the origi-
nal sub-volume in both spatial and temporal domains by 10%. We refer to the
l
extended volume as VW HT , meaning a volume centered at l with width W ,

height H and duration T . We need to nd a sub-volume Vwl h t to maximize
the following function:

max

s d s + w h t , (12)
w ,h ,t
ds Vwl h t
where is a small negative value to constrain the size of the volume.

We assume each interest point which belongs to the detected activity would
contribute in the Hough voting step, i.e., it should have a high back-projection
score sds . Thus, for those interest points with low back-projection scores, we
consider them as the background. This motivates us to use the method in Eq. 12

to locate the optimal sub-volume Vwl h t .
Once we obtain the scale information of the sub-volume, we replace in Eq. 1
with [w , h , t ] computed from Eq. 12 and start a new round of Hough voting.
The process iterates until convergence or reaching to a pre-dened iteration
number.
For activity classication, since the testing videos have already been seg-
mented, the scale can be determined by the width, height and duration of the
testing video. The similarity between the training activity model and testing
video dened in Eq. 1 is S(V (x , t , ), R) where [x , t ] refers to the center
position of the testing video.
6 Experiments
Two datasets are used to validate the performance of our algorithms. They are
UT-Interaction [10] and TV Human Interaction dataset [19]. We perform two
types of tests: 1) activity recognition with few training examples but we have a
large testing data (building RPT on the testing data), and 2) activity recognition
when the training data is sucient (building RPT on the training data).
6.1 RPT on the Testing Data

In the following experiments, we rst show that our algorithm is able to handle
the cases when the training data is not sucient. Experiments on UT-Interaction
dataset and TV Human Interaction validate the performance of our algorithm.
In these experiments, we build RPT using the testing data without labels. The
reason why we do not train our RPT with both the training and testing data is
that we need to handle the activity search problem, where we do not have the
prior knowledge on query (training) data initially.
Table 1. Comparison of classication results on UT-Interaction (20% training)
Method [10] [12] NN + HV RPT + HV

Accuracy 0.708 0.789 0.75 0.854
Activity Classication on UT-Interaction Dataset with 20% Data for

Training. We use the setting with 20% data for training and the other 80%
for testing on UT-interaction dataset. This evaluation method has been used
in [10][12]. Since the training data is not sucient, we build our random pro-
jection trees from the testing data. We list our results in Table 1. NN + HV
refers to the method that nearest neighbor search is used to replace RPT for
feature points matching. It shows that our algorithm has signicant performance
advantages compared with the state-of-the-arts.
Activity Classication on UT-Interaction Dataset with One Video

Clip only For Training. [5] provided the result of activity classication with
training on only one video clip for each activity type and testing on the other
video clips. To compare with [5], we performed the same experiments with just a
single video clip as the training data for each activity type. We obtain an average
accuracy of 73% which is signicantly better than the average accuracy of 65%
as reported in [5].
Activity Search with Localization on UT-Interaction Dataset. The ac-

tivity search experiments are tested on the continuous UT-Interaction dataset.
In the application scenario of activity search, there is usually just a few or even
a single training sample available that indicates what kind of activity the user
wants to nd. Following the requirement of such an application scenario, we test
our algorithm with only one query sample randomly chosen from the segmented
videos. But if more training samples are available to our algorithm, the perfor-
mance will be further boosted. With the help of our iterative activity search
Activity Retrieval: hug

1 Activity Retrieval: kick Activity Retrieval: point
1 1
HV without scale refinement, AP:0.4019
0.9 HV without scale refinement, AP:0.1389 HV without scale refinement, AP:0.2456
HV with scale refinement, AP:0.4019 0.9 0.9
HV with scale refinement, AP:0.2491 HV with scale refinement, AP:0.3202
[20], AP:0.0456
0.8 [20], AP:0.0388 [20], No detections
NN + Hough voting, AP:0.3769 0.8 0.8
NN + Hough voting, AP:0.1107 NN + Hough voting, AP:0.0889
0.7 0.7 0.7
0.6 0.6 0.6

precision
precision
precision
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
recall recall recall
Activity Retrieval: punch Activity Retrieval: push Activity Retrieval: shakehands
1 1 1
HV without scale refinement, AP:0.2262 HV without scale refinement, AP:0.3283 HV without scale refinement, AP:0.1836
0.9 HV with scale refinement, AP:0.2500 0.9 HV with scale refinement, AP:0.3283 0.9 HV with scale refinement, AP:0.2436
[20], AP:0.1016 [20], AP:0.1504 [20], AP:0.0668
0.8 0.8 NN + Hough voting, AP:0.1992 0.8 NN + Hough voting, AP:0.1369
NN + Hough voting, AP:0.1774
0.7 0.7 0.7
0.6 0.6 0.6
precision
precision
precision
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
recall recall recall
Fig. 2. Activity search results on UT-Interaction dataset
algorithm, we can eciently locate all similar activities in a large un-segmented

(continuous) video set. To compute the precision and recall, we consider a correct
Volume(V G) 1
detection if: Volume(V G) > 2 where G is the annotated ground truth subvolume,

and V is the detected subvolume.
Fig. 2 shows the results of dierent algorithms. The dierence between our
activity search and previous work is that we are only given one query video clip.
Our system has no prior information about the number of activity categories in
the database. In contrast to [10][5], for every activity type, there is at least one
video clip provided as training data. As previous works of activity search do not
provide precision-recall curves, we only compare with the following algorithms:
Branch&Bound [20][2] (magenta curve) and nearest neighbors+Hough voting
without scale determination (green curve). We use the same code provided by [20]
to run the results. We list two categories of our results: 1) red curves: results after
one step of Hough voting without scale renement and 2) blue curves: results
after one round of iteration (including both Hough voting and scale renement).
Compared with NN search, we can see the clear improvements by applying RPT
to match feature points. Besides, back-projection renes the results from Hough
voting. Since the dataset does not have very large spatial and temporal scale
changes, we only present the results after one round of our iterative algorithm.
The performance does not improve signicantly when we further increase the
number of iterations.
Fig. 3 provides sample results of our activity search algorithm. One segmented
video (sample frame for each category is shown in the rst column) is used as
the query and three detected results (marked with red rectangle) are included
from the second to the forth column of Fig. 3.
Fig. 3. Activity search results on the UT-Interaction Dataset. we show two categories
of results in each row. For each category, the rst image is from the query video and
the following three images are sample detection results. The red regions refer to our
detected sub-volumes.
Activity Search on TV Human Interaction Dataset. Since UT-Interaction

is recorded in controlled environments, we use the TV Human Dataset [19] to
show that our algorithm is also capable of handling activity recognition in uncon-
trolled environments. The dataset contains 300 video clips which are segmented
from dierent TV shows. There are four activities: hand shake, high ve, hug
and kiss.
We have performed an experiment on the TV Human dataset for the perfor-
mance evaluations on dierent number of training samples. We take the experi-
ment with the following setting: 100 videos (25 videos for each category) as the
database and randomly select a number of other videos as queries. RPT is built
on the database (testing data). Fig 4 (Left) compares our results with those
of NN + Hough voting. It shows the performance benets of our RPT based
matching compared with nearest neighbor based matching.
6.2 RPT on the Training Data

We have two experiments to further show that our algorithm can also have
promising results for the traditional activity recognition problem, i.e., the train-
ing data is sucient. One is tested on UT-Interaction dataset with Leave-one-out
cross validation and another is on TV Human dataset.
Activity Classication on UT-Interaction Dataset with Leave-One-

Out Validation. This setting was used in the activity classication contest [8].
It is a 10-fold leave-one-out cross validation. Table 2 lists the published results on
two dierent sets of videos. Since enough training data is provided, we build our
unsupervised random projection trees from the training data without using the
labels. The experimental results show that our algorithm outperforms the state-
of-the-art methods on the classication problem when the amount of training
data is sucient.
0.55 1
shakehands, AP:0.6609
Our proposed 0.9 highfive, AP:0.5800
NN + Hough voting hug, AP:0.7107
0.5 0.8 kiss, AP:0.6947
0.7
average precision
0.45 0.6
precision
0.5
0.4 0.4
0.3
0.35 0.2
0.1
0
1 3 5 7 9 11 13 0 0.2 0.4 0.6 0.8 1
# of query videos recall
Fig. 4. Left: Average precision versus dierent number of training videos provided
on TV Human Dataset (testing on a database with 100 videos). Right: PR curves
for activity recognition on TV Human Dataset (25 videos for each activity used for
training).
Table 2. Comparison of classication results on UT-Interaction Set 1 (left) and Set

2 (right) with leave-one-out cross validation setting
Method Shake Hug Kick Point Punch Push Total Method Shake Hug Kick Point Punch Push Total
[1] + kNN 0.18 0.49 0.57 0.88 0.73 0.57 0.57 [1] + kNN 0.3 0.38 0.76 0.98 0.34 0.22 0.497
[1] + Bayes 0.38 0.72 0.47 0.9 0.5 0.52 0.582 [1] + Bayes 0.36 0.67 0.62 0.9 0.32 0.4 0.545
[1] + SVM 0.5 0.8 0.7 0.8 0.6 0.7 0.683 [1] + SVM 0.5 0.7 0.8 0.9 0.5 0.5 0.65
[13] + kNN 0.56 0.85 0.33 0.93 0.39 0.72 0.63 [13] + kNN 0.65 0.75 0.57 0.9 0.58 0.25 0.617
[13] + Bayes 0.49 0.86 0.72 0.96 0.44 0.53 0.667 [13] + Bayes 0.26 0.68 0.72 0.94 0.28 0.33 0.535
[13] + SVM 0.8 0.9 0.9 1 0.7 0.8 0.85 [13] + SVM 0.8 0.8 0.6 0.9 0.7 0.4 0.7
[7] 0.7 1 1 1 0.7 0.9 0.88 [7] 0.5 0.9 1 1 0.8 0.4 0.77
[6] BoW - - - - - - 0.85 Our Proposed 0.7 0.9 1 0.9 1 1 0.917
Our proposed 1 1 1 1 0.6 1 0.933
As we can see from Table 2, results from cuboid features [13] are better than
those from STIP features [1]. Even though we use STIP features, we still achieve
better results than the state-of-the-art techniques that use cuboid features.
Action Recognition on TV Human Dataset. We test our algorithm using

the standard setting as in [19]: training with 25 videos for each activity and
testing on the remaining videos. In addition to [19], there are other works that
published the results on this dataset. But they used additional information pro-
vided in the dataset, e.g., actor position, head orientation and interaction label
of each person. Thus, it is un-fair for us to compare with them since we only
utilize the video data.
Following the evaluation method in [19], we also evaluate our algorithm based
on average precision. Table 3 compares our results with those reported in [19].
+Neg means we add 100 negative videos that do not contain the target activ-
ities into the testing dataset. The precision-recall curves from our algorithm are
shown in Fig. 4 (Right).
Table 3. Comparison of activity classication on TV Human Dataset based on average

precision
100 Videos 100 Videos + 100 Neg

[19] 0.3933 0.3276
Our algorithm 0.6616 0.5595
6.3 Computational Complexity

Here we only discuss the online computational cost as the RPT can be built
oine. For the Hough voting step, it takes O(NM ) + O(W H T ), where NM
refers to the number of matches, which is dened in Section 4.2, and W , H , T
are the width, height and duration of the testing videos, respectively. For the
back-projection step, the computational complexity is O(NM )+O(W HT ), where
W, H, T are the width, height and duration of the extended sub-volume dened
in Section 5 and T << T . It takes approximately 10 seconds to perform the
activity classication for each 4-second long testing video and 15 seconds for
activity search on a 1 min testing video on the UT-Interaction dataset. The
feature extraction takes a few more seconds depending on the length of the
video. The system is implemented in C++ and runs on a regular desktop PC.
7 Conclusion
Local feature voting plays an essential role in Hough voting-based detection. To
enable discriminative Hough-voting with limited training examples, we propose
propagative Hough voting for human activity analysis. Instead of matching the
local features with the training model directly, by employing random projec-
tion trees, our technique leverages the low-dimension manifold structure in the
high-dimensional feature space. This provides us signicantly better matching
accuracy and better activity detection results without increasing the compu-
tational cost too much. As the number of trees grows, our propagative Hough
voting algorithm can converge to the optimal detection. The superior perfor-
mances on two benchmarked datasets validate that our method can outperform
not only with sucient training data, e.g., in activity recognition, but also with
limited training data, e.g., in activity search with one query example.
Acknowledgement. This work was supported in part by the Nanyang Assis-

tant Professorship (SUG M58040015) to Dr. Junsong Yuan.
References
1. Laptev, I.: On space-time interest points. International Journal of Computer Vi-
sion 64(2-3), 107123 (2005)
2. Yuan, J., Liu, Z., Wu, Y.: Discriminative Video Pattern Search for Ecient Action
Detection. IEEE Trans. on PAMI (2011)
3. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: Proc. CVPR (2008)
4. Yuan, F., Prinet, V., Yuan, J.: Middle-Level Representation for Human Activities
Recognition: the Role of Spatio-temporal Relationships. In: ECCV Workshop on
Human Motion (2010)
5. Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: A String of Feature Graphs
Model for Recognition of Complex Activities in Natural Videos. In: ICCV (2011)
6. Ryoo, M.S.: Human Activity Prediction: Early Recognition of Ongoing Activities
from Streaming Videos. In: ICCV (2011)
7. Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object
detection, tracking, and action recognition. PAMI, 21882202 (2011)
8. Ryoo, M.S., Chen, C., Aggarwal, J.: An overview of contest on semantic description
of human activities, SDHA (2010)
9. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
ing. In: International Conference on Very Large Data Bases (VLDB), pp. 518529
(1999)
10. Ryoo, M.S., Aggarwal, J.K.: Spatio-Temporal Relationship Match: Video Structure
Comparison for Recognition of Complex Human Activities. In: ICCV (2009)
11. Amer, M.R., Todorovic, S.: A Chains Model for Localizing Participants of Group
Activities in Videos. In: ICCV (2011)
12. Brendel, W., Todorovic, S.: Learning Spatiotemporal Graphs of Human Activities.
In: ICCV (2011)
13. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse
Spatio-Temporal Features. In: Workshop on Visual Surveillance and Performance
Evaluation of Tracking and Surveillance (2005)
14. Razavi, N., Gall, J., Van Gool, L.: Backprojection Revisited: Scalable Multi-view
Object Detection and Similarity Metrics for Detections. In: Daniilidis, K., Maragos,
P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 620633. Springer,
Heidelberg (2010)
15. Choi, W., Savarese, S.: Learning Context for Collective Activity Recognition. In:
CVPR (2011)
16. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-
object interaction activities. In: CVPR (2010)
17. Leibe, B., Leonardis, A., Schiele, B.: Robust Object Detection with Interleaved
Categorization and Segmentation. IJCV 77(1-3), 259289 (2007)
18. Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds.
In: ACM Symposium on Theory of Computing (STOC), pp. 537546 (2008)
19. Patron-perez, A., Marszalek, M., Zisserman, A., Reid, I.: High Five: Recognising
human interactions in TV shows. In: BMVC (2010)
20. Yu, G., Yuan, J., Liu, Z.: Unsupervised Random Forest Indexing for Fast Action
Search. In: CVPR (2011)
21. Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image clas-
sication. PAMI 30, 16321646 (2008)
22. Klaser, A., Marszalek, M.: A spatio-temporal descriptor based on 3D-gradients. In:
BMVC (2008)
Spatio-Temporal Phrases
for Activity Recognition
Yimeng Zhang1, , Xiaoming Liu2 , Ming-Ching Chang2 ,

Weina Ge2 , and Tsuhan Chen1
1
School of Electrical and Computer Engineering, Cornell University
2
GE Global Research Center, 1 Research Circle, Niskayuna, NY
{yz457,tsuhan}@cornell.edu, {liux,changm,gewe}@research.ge.com
Abstract. The local feature based approaches have become popular for
activity recognition. A local feature captures the local movement and
appearance of a local region in a video, and thus can be ambiguous;
e.g., it cannot tell whether a movement is from a persons hand or foot,
when the camera is far away from the person. To better distinguish
dierent types of activities, people have proposed using the combination
of local features to encode the relationships of local movements. Due to
the computation limit, previous work only creates a combination from
neighboring features in space and/or time. In this paper, we propose
an approach that eciently identies both local and long-range motion
interactions; taking the push activity as an example, our approach can
capture the combination of the hand movement of one person and the
foot response of another person, the local features of which are both
spatially and temporally far away from each other. Our computational
complexity is in linear time to the number of local features in a video.
The extensive experiments show that our approach is generically eective
for recognizing a wide variety of activities and activities spanning a long
term, compared to a number of state-of-the-art methods.
Keywords: Activity Recognition, Spatio-Temporal Phrases.
1 Introduction
Activity recognition in videos has attracted increasing interest recently. An ac-
tivity can be dened as a certain spatial and temporal pattern involving the
movements of a single or multiple actors. The recognition task requires cap-
turing enough spatial and temporal information to distinguish dierent activity
categories, while handling the large intra-category variations. Also, most video
analysis applications, such as surveillance, require high computational eciency.

This work was done while the rst author was visiting GE Global Research as an in-
tern. This work was supported by the National Institute of Justice, US Department
of Justice, under the award #2009-SQ-B9-K013. The opinions, ndings, and conclu-
sions or recommendations expressed in this publication are those of the authors and
do not necessarily reect the views of the Department of Justice.

708 Y. Zhang et al.
T=15 T=20
T=55
T=136 T=141
T=176
Fig. 1. An example co-occurring spatio-temporal (ST) phrase of two video sequences.

The left images show space-time locations of the local features in each video. Red points
indicate the features composing the co-occurring phrase. The right images show the
frames (sampled at rate 5, frame index is shown) composing the phrase. Red circles
indicate the local features composing the phrase. The ST phrase captures the causality
relationships among body parts from dierent individuals of a long time span for the
same activity push.
The bag-of-words (BoW) model based on local features is very popular for
activity recognition due to its robustness to real-world environments [15]. De-
spite its computational eciency and intra-category invariance, the BoW model
discards any structural spatial and temporal information among the local fea-
tures. However, a local word only captures the local movement of a particular
region, and can be ambiguous sometimes. For example, using the local features
alone, the activities of push and kick can be quite similar when it is hard to
tell whether a movement is from a persons hand or foot due to video resolution.
In order to better distinguish dierent types of activities, many works have
been done to incorporate spatio-temporal relationships among the local features
[612]. In particular, approaches that model the mutual relationships among the
words are spatio-temporally shift invariant [6, 8, 1315], and thus do not require
the knowledge on where and when the activity occurs in a video. However, due
to the computational concern, these works have one or more of the following
limitations: 1) only create a combination (phrase) with words in a local space or
time neighborhood [8,14,15], and thus cannot capture the long-term interactions
of dierent body parts, which can be extracted from dierent individuals; 2) only
consider a pair of local words [6, 13], and therefore cannot encode interactions
among more than two time stamps or local regions; 3) only encode the distances
Spatio-Temporal Phrases for Activity Recognition 709
between the words, while discarding the temporal ordering and spatial layout
among them [6,13], and thus is incapable of modeling the casuality relationships.
Along the direction of modeling the mutual relationships among words, this
paper presents a generic activity recognition approach that addresses all of
the aforementioned limitations. Our approach is formed around a novel con-
cept named the spatio-temporal phrase (ST phrase). A ST phrase of length k
is dened as a combination of k words in a certain spatial and temporal struc-
ture, including their order and relative positions. Hence, it encodes rich tempo-
ral ordering and spatial geometry information of local words. Fig. 1 illustrates
example ST phrases in two videos, both of which include the push activity
at dierent time stamps. The ST phrases capture the patterns, e.g. the hand
movement of one person in one frame and the foot movement of another person
several frames later. The illustrated phrase consists of local movements that do
not occur within the same space-time neighborhood region. Moreover, the same
phrase, which characterizes the push activity, occurs at dierent locations and
time stamps in the two videos.
The algorithm for identifying ST-phrases extends a recent work [16] proposed
for image retrieval, which eciently detects co-occurring 2D phrases in two static
images. We extend the work to the space-time domain, and show that we can
identify all ST phrases of any length in a video in time linear to the num-
ber of local features. We also provide an algorithm that makes the ST phrases
speed invariant to deal with dierent speeds or durations of the activities. The
co-occurring ST phrases are further used to compute a kernel for discrimina-
tive learning with SVM, which implicitly determines the importance of dierent
phrases. In addition, we propose an algorithm that can update the kernel values
incrementally when a new frame arrives, thereby enabling us to eciently detect
the activities from video streams in an online fashion.
1.1 Related Work

Many techniques have been proposed to incorporate spatial and temporal infor-
mation into the BoW model. Space-time pyramid matching [2] captures absolute
spatio-temporal locations of the local features with quantized space-time bins;
therefore, it is not spatio-temporal shift invariant. Aligned space-time pyramid
matching [10] relaxes the xed bin matching, and identies optimal matching
between the bins of two videos. The method is shift invariant at the cost of
discarding the spatio-temporal ordering among dierent bins. Trajectories of the
local features [12,17,18] are created by feature tracking methods and are capable
of incorporating temporal information of the same feature in a certain period.
However, exploring rich spatio-temporal relationships of dierent trajectories re-
mains challenging. Hough Voting or the Implicit Shape Model [7, 19] allow the
local features to vote for the action center in both space and time domains, and
can encode space-time information of the words relative to the action center;
however, the recognition process requires an iterative EM process to predict the
center [7] or a preprocess for detecting the bounding box of the person [19]. Mu-
tual word relationships have been explored [6,8,1315]. As discussed in Section 1,
710 Y. Zhang et al.
Video V
Offset Space
Y
*
T
X ****
'Y *
Video V *
* 'T
'X
Y
T
X
Fig. 2. 3D correspondence transform. Circles represent local features, and colors rep-
resent word assignments. A co-occurring ST phrase of length 4 is shown.
due to the high computational cost, previous works do not capture higher order
and long range dependencies.
2 Approach
In activity recognition, a video can be represented as a collection of visual words

in the xyt-space, which are created by clustering the local descriptors detected at
local space-time volumes. Various promising local feature detectors, descriptors
and clustering methods [1, 35] can be adopted. Specically, a video is denoted
as V = (wi , xi , yi , ti ), where wi is the word entry for feature i, and (xi , yi ), ti are
the space and time index of the feature respectively.
2.1 Bag of Spatio-Temporal Phrase
A spatio-temporal (ST) phrase of length k is dened as a combination of k local

words in a certain spatial and temporal structure, including their order and
relative positions. We represent a video as a bag of ST phrases (BoP): V = {Pi },
which encodes much richer spatio-temporal information than the BoW model. A
straightforward implementation of using the BoP is to calculate the histogram
of the phrases and then forward to a classier like SVM. However, the length of
the histogram, which is the number of possible ST phrases, is exponential to the
phrase length k.
2.2 3D Correspondence Transform

Rather than representing a video with an intractably long histogram of phases,
we directly resort to estimating the distance of two videos by nding the co-
occurring ST phrases and such distance can be easily utilized by SVM learning
via the kernel trick. A co-occurring ST phrase of length k in two videos must
have the same k words and the same space-time layout, as shown in Fig. 1.
The algorithm for identifying co-occurring ST phrases is an extension of the

correspondence transform algorithm proposed in [16, 20] for 2D images. As il-
lustrated in Fig. 2, for each pair of local features (wi , xi , yi , ti ) and (wi , xi , yi , ti )
that has the same word assignment (wi = wi ) in the two videos V and V , we
calculate their space-time oset as (xi , yi , ti ) = (xi xi , yi yi , ti ti ),
and create a vote in the XY T oset space (X, Y, T ).
In the quantized XY T oset space, if we have k votes at the same location,
we have a co-occurring ST phrase of length k in the two videos with the same
word entries and the same spatio-temporal structure. Thus, to count the number
of co-occurring ST phrases of length k, we can simply use the XY T oset space.
Whenever we have nl (nl k) votes at the same location l = (x, y, t) in
the oset space, we increase the number of co-occurrences by nl choose k, nkl ,
since
the same space-time structure of nl words also indicates the same structure
of nkl number of k word. Thus, the number of co-occurring length-k ST phrases
Kk (V, V ) is calculated as:
nl
Kk (V, V ) = . (1)
k
l
Activity Speed Variations: The same activity category can occur with dif-
ferent speeds and time durations in dierent videos. For example, people may
run at dierent speeds, or take dierent time to form a group in order to
ght. To explicitly deal with this problem, we would like to detect k words
from two videos as the same ST phrase if their only structural dierence is the
temporal rates. To this end, we add another dimension to the oset space, which
indicates the temporal scaling d, and allow each pair of words vote for multiple
d. The oset location of a pair of features with the same word assignment is:
(xi , yi , ti , d) = (xi xi , yi yi , ti ti d, d).
In this 4D space, if we have k votes at the same location, we have a co-
occurring ST phrase with a particular temporal scale dierence d. In the ex-
periments, we set the scale d to 1, 1.5, 1.52, .., 5, which tolerates up to 5 times
speed dierence between activities of two videos.
Scale Variations: People in an activity can appear in dierent scales for dierent
videos. The scale invariance can be obtained by adding another dimension to the
oset space as scale dierence, similar as the solution for the speed invariance. We
can also utilize the detected scales for local features and vote for only the scale
dierence that is the same as that of the local features. Therefore, we still create
one vote for each pair of same local words when speed is not considered.
2.3 ST Phrase Kernel

Using similar proof as [20] for 2D phrases, we can show that the number of co-
occurring ST phrases of length k equals exactly the inner product of the bag of
length-k-ST phrase histograms of two videos. Therefore, we can use the number
of co-occurrences (Eqn. 1), which can be eciently computed, as the kernel to
the SVM. The kernel obviously satises the Mercers condition. The nal kernel
712 Y. Zhang et al.
Support Vectors Offset Spaces Offset Space for S1

frame t S1
D1 0.9 'y
'y
'T
'x
S2
t 'y
'y
Vt 1 D 2 0.4
Y T
'x 'T Red: votes added
X Blue: votes deleted

Obtained from SVM training
Fig. 3. Illustration of the incremental kernel computing algorithm. The support vectors
(selected training videos) S and their coecients (second column) are obtained during
training, while the oset spaces for each support vector (third column) need to be
computed online during testing. The right most image shows the enlarged oset space
between the video segment Vt and support vector S1 .
value between two videos V and V is a weighted summation of the normalized

kernel values for the ST phrase of all lengths:

Kk (V, V )
K(V, V ) = k , (2)
k=1
Kk (V, V )Kk (V , V )
where is a factor (greater than 1) chosen to ensure that the normalized kernel
values for dierent k are in similar scales, i.e. the values are not too small.
The SVM with the above kernel projects the word space to the higher order ST
phrase space for classication. During training, we use both positive and negative
examples to compute the kernel matrix, which is used to train a SVM. The
SVM will implicitly assign dierent weights to dierent ST-phrases by learning
the coecients for dierent training videos. Let the training videos that have
non-zero weights (support vectors) be Sj and the coecients be j , the decision
score for a test video V of a particular activity category is computed as:

Score(V ) = j j K(V, Sj ). (3)
2.4 Incremental Kernel Computing for Activity Localization
Some applications, such as surveillance or security monitoring, require online

activity detection from real-time video streams instead of recognizing temporally
segmented video clips in a batch mode. One simple yet commonly used approach
is to make predictions for every frame based on the video segment composed of
the current frame and the previous T frames as the context. If some detection
delay is allowed, the following T frames are also included in the segment.
As illustrated in Fig. 3, to determine the activity category of the current frame,
we need to compute the kernel values of the corresponding video segment with
the support vectors as in Eqn.(3). The essential for obtaining the kernel value of
Algorithm 1: Compute kernels for frame t and a support vector Sj

1. Initialize Kk (Vt , Sj ) = Kk (Vt1 , Sj )
2. For each same word pair in video Sj and frame t of the test video
(a) Compute their oset l = (x,
y, t)
nl
(b) Increase Kk (Vt , Sj )+ = k1 , (nl is the original votes at l)
(c) Add one vote to location l
Save the list of oset locations contributed by frame t
3. For each saved oset location l of frame
tT
(a) Decrease Kk (Vt , Sj ) = nk1 l1
(a) Delete one vote from location l
a video segment Vt and a support vector Sj is to compute the oset space which
is created with the votes from the local features. As discussed in Section 2.2,
this computational complexity is linear to the number of local features in the
segment.
However, note that the video segment of the current frame and that of the
previous frame have a signicant overlap; we can further accelerate the kernel
computation with an incremental updating algorithm. The dierence between
the video segments of the current and the previous frame only involves two
frames, i.e., the current frame t, and the frame at t T (Fig. 3). If we have
stored the oset spaces for the previous frame segment Vt1 , to compute the
new oset space for the current Vt and a support vector Sj , we only need to add
the votes made by the local features of the current frame, and delete the votes
that were contributed by the features of the frame t T .
When we add a vote at location l in the oset space that originally had
nl votes, the number of co-occurring length-k ST phrases (Eqn.(1)) would be
changed as follows:

nl nl + 1 nl
Kknew (V, V ) = Kkold (V, V ) + = Kkold (V, V ) + . (4)
k k k1
The change for the number of co-occurring length-k ST phrases when we delete
a vote can be calculated similarly.
The algorithm for updating the kernel values is listed in Algorithm 1. Conse-
quently, to compute the kernel values for the current frame and a support vector,
we only need to perform the operations of Eqn.(4) 2Nt times, with Nt being the
number of local features at frame t. This acceleration enables us to consider the
long-term context (a large number of previous frames) without increasing the
recognition time for each frame. This property is especially useful when detecting
complex activities that usually last a long period of time.

The computational complexity for calculating the number of co-occurring ST
phrases (Section 2.2) is linear to the number of same word pairs in two videos.
Let N be the number of local features per video. The computation is O(N 2 ) for
worst case when all the words are the same, but O(N ) in practise. For memory
usage, although the oset space has high dimension (3D or 4D), it is actually
714 Y. Zhang et al.
very sparse with most locations having zero votes. Since we only care about the
number of votes that fall to the same location, we can simply store the non-zero
locations, which is again linear to the local feature number per video. Thus, we
have the same O(N ) computational complexity with BoW for classifying a video
clip, when BoW uses a non-linear kernel, such as 2 or Gaussian kernel.
With the incremental kernel computing, computing the kernel value between
a support vector and a video segment dened for the current frame depends only
on the number of words in the current and previous frames. Therefore, making
prediction for one frame requires O(SNt ) computations in total, where S is the
number of support vectors, and Nt is the number of words per frame.
3 Experiments
We rst perform experiments on single person activities with the KTH dataset [21],
YouTube Action dataset [5], and a hospital surveillance dataset. The KTH
dataset is created in a controlled environment, while the hospital dataset is
collected from real surveillance cameras in a more complex environment and
the YouTube Action dataset includes more pose and scale variations taken with
shaky cameras.
Then we evaluate our performance for the multiple-person activity recognition
problem, which is an up-coming topic that recent work are addressing [8,22,23],
with the UT interaction dataset [24] and the MPR dataset [22]. On the MPR
dataset, we evaluate our incremental algorithm (2.4) for online activity detection.
Baseline: We aim to verify that the proposed bag-of-ST-phrase (BoP) repre-
sentation outperforms BoW with exactly the same local features, same visual
words, and same classier (SVM). For BoW, we use the 2 -kernel, which already
captures the co-occurrence statistics of dierent words in a video. Therefore, the
comparison of BoW and BoP veries whether the spatial-temporal layout of the
words helps activity recognition. Moreover, we compare favorably with other
state-of-the-art approaches [2,13,19,25,26] that incorporate spatial and/or tem-
poral information to the BoW representation.
3.1 Single Person Activity: KTH Dataset
The KTH dataset includes 2391 videos performed by 25 dierent subjects. We

use the same local features (HOF) as [2]. Features are computed with the pub-
lished code by the authors 1 .
Vocabulary Size: Table 1(a) compares BoP with BoW using dierent vocabu-
lary sizes under the same settings as [2]: videos from 16 subjects for training and
videos from the other 9 for testing. The performance of BoW decreases when
a smaller vocabulary size is used since local words are more ambiguous. Mean-
while, BoP achieves similar accuracies even with a smaller vocabulary, since the
1
For fair comparison, feature HOF uses version 1.1 code, same as [2].
Table 1. (a) Classication accuracies (%) with dierent vocabulary sizes on the KTH
dataset. (b) Accuracies (%) under each train and test setting on the KTH dataset.
We compare BoP with other methods that incorporate spatio-temporal relationships
to BoW.
(a) (b)
Method 500 1000 2000 4000 Setting BoW BoP [2] [19] [13] [26]
BoW 90.8 91.2 91.3 91.5 16 train / 9 test 91.5 94.0 91.8 n/a n/a 91.1
BoP 94.0 93.8 94.1 94.0 5-fold 92.9 94.6 n/a 92.0 n/a n/a
leave-one-out 91.9 95.5 n/a n/a 94.2 93.8
ST phrases capture the spatio-temporal relationships among the words, thereby

increasing the discrimination.
Comparison: Table 1(b) compares BoP with other methods that encode spatial
relationships into BoW. By capturing higher order and more detailed spatio-
temporal information with the ST phrases, our approach outperforms other
methods: spatio-temporal pyramid matching [2], Hough voting [19], correlo-
gram [13], and predened spatio-temporal relationship rules [26].
3.2 Single Person Activity: Hospital Surveillance Dataset
This dataset includes the realistic surveillance videos of 6 patients in the private
sickrooms of a hospital2 . The goal is to detect abnormal behaviors of the patients
(get up from the bed ) from other normal ones, which is also a real application
used in the hospital. The environment of this dataset is much more complex
than the KTH dataset, since the patients conduct many variations of normal
activities and many noisy motion features are detected on non-person objects
or other people; eg. the activity that a nurse cleans the bed may lead to a false
positive. To collect the ground-truth, we segmented the videos into 10-second
clips, and manually labeled each clip. In total, the dataset has 124 positive
(abnormal) examples, and 1067 negative examples. We perform a leave-one-
patient-out cross-validation experiment. We extract similar features as the KTH
dataset, and create a vocabulary of size 500. Figure 4 shows that our approach
outperforms the BoW method.
3.3 Single Person Activity: YouTube Action Dataset
The YouTube action dataset [5] consists of 11 categories with 1168 videos. The
videos are separated into 25 groups, which are taken in dierent environments
or by dierent photographers. Following [5], we perform a leave-one-group-out
cross-validation experiment. We extract HOG/HOF features with the code [2],
and create a vocabulary of size 2000 with K-means.
2
For patient privacy, we cannot show the example images from the dataset.
716 Y. Zhang et al.
1.05
BoW BoP
0.95
AUC
0.85
0.75
False Positive Rate Patient-1 Patient-2 Patient-3 Patient-4 Patient-5 Patient-6
(a) (b)
Fig. 4. Hospital Surveillance Dataset. (a) ROC curve for all patients. (b) The AUC
score for each patient.
BoW BoP
0.9
0.7
0.5
0.3
(a) (b) average: 72.9%

Fig. 5. (a) Accuracies for each category of the YouTube action dataset. Average ac-
curacies for BoW and BoP for 63.7% and 72.9%, respectively. (b) Confusion matrix
using BoP.
Figure 5 shows the classication performance. The BoW achieves 63.7% ac-
curacy averaged over the 11 categories. Our implementation of BoW achieves
similar performance as the one in [5] (65.4%) when similar (motion) features are
used. With BoP, we improve the performance by 9.2%. Our BoP result (72.9%)
is also better than the nal result of [5] (71.2%), where additional static features,
feature pruning techniques, and semantic vocabulary learning are adopted. These
techniques that improve the local features are complimentary to our work. There-
fore, further improvement can be expected when these techniques are applied.
We notice the main improvement is the discrimination of dierent swing activ-
ities, and dierent categories that both involve jump actions, such as basketball
shooting, trampoline jumping, and horse riding. The main reason is that BoP
can capture the ST relationships among the local movements.
3.4 Multiple-person Activities

We evaluate our approach using the UT interaction dataset [24], which consists
of six activities of two-person interactions: shake hands, hug, kick, point, punch,
(a) BoW (b) Hough Voting [25] (c) BoP
Fig. 6. Confusion Matrices on the UT interaction dataset (Set 1)
and push. There are two sets in the dataset. Set 1 was recorded with relatively
stable camera, while Set 2 was taken on a lawn in a windy day. Background
is moving slightly (e.g. tree leaves move), and the set contains more camera
jitters. For each set, every type of activity has 10 video clips. These high-level
activities are more complex, and involve a combination of atomic movements of
two people over a period of time. Therefore, the BoW model which only captures
the atomic movements may work poorly. This dataset was used in the activity
recognition contest at ICPR 2010 [24]. We use the cuboid [1] as local features
and a vocabulary of size 500.
Comparison: We compare BoP with BoW and the best results reported in
the contest [25]. We adopt the classication settings described for the contest,
where a 10-fold leave-one-set-out cross validation is performed. Figure 6 shows
the confusion matrix for dierent methods on Set 1. Our implementation for
BoW with a 2 kernel for SVM demonstrated 76.7% accuracy, which is similar
to rates of the BoW method reported in the contest. Table 2 (b) compares the
performance on both Set 1 and Set 2. By modeling the atomic moves to the
activity center, the Hough voting based method [25] improves BoW by around
7%. Using BoP, we further outperform the Hough voting method by around 10%.
Analysis: Since the activities in this dataset involve more atomic movement
interactions of dierent body parts and from dierent persons, the rich spatio-
temporal interactions modeled with BoP are benecial. As shown in Fig. 6, BoW
confuses push vs. punch, and shake hands vs. hug, since the local movements of
each body part are similar for these activities. With the ST phrases, we can
capture the spatio-temporal combinations, thereby better discriminating these
activities.
Analysis for Spatial and Temporal Information: We show the benet of
simultaneous spatial and temporal modeling with ST-phrases in Table 2(a). To
incorporate space alone, we still use the BoP algorithms we proposed in this
paper, but discarding the temporal domain when computing the co-occurring
ST phrases. In other words, the temporal domain is modeled with BoW. To
incorporate time alone, we ignore the space domain. According to the table,
718 Y. Zhang et al.
Table 2. Recognition accuracies on UT interact dataset. (a) The performance of in-

corporating spatial, temporal, and spatio-temporal information into BoW on Set 1.
Note that ST-phrase is not a simple combination of spatial and temporal phrases. (b)
The accuracies of dierent methods on both Set 1 and Set 2 of the dataset. For Hough
Voting, we cite the reported results in [25].
1.0 ST, Dataset Method Avg. Shake Hug Kick Point Push Punch
Space, 0.95
0.90 Time, BoW 0.77 0.70 0.80 0.90 1.00 0.50 0.70
0.9 Set 1 Hough Voting
0.85 0.83 0.50 1.00 1.00 1.00 0.70 0.80
BOW, BoP 0.95 1.00 1.00 1.00 0.90 0.90 0.90
0.8 0.77
BoW 0.73 0.70 0.70 0.80 0.80 0.70 0.70
Set2 Hough Voting 0.80 0.70 0.90 1.00 1.00 0.80 0.40
0.7
Accuracy BoP 0.90 0.80 1.00 1.00 0.80 0.90 0.90
(a) (b)
(a) group dispersion (b) group formation (c) approaching (d) group flanking (e) group fighting
Fig. 7. Example scenarios we aim to recognize from videos of the MPR dataset
both spatial and temporal information improves BoW, and incorporating spatial
information helps more than incorporating temporal information. Simultaneous
spatial and temporal modeling obtain the best result.
3.5 Online Group-Level Activity Detection

Finally, we evaluate on group-level activities with the Mock Prison Riot (MPR)
dataset [22]. The dataset has 19 surveillance videos captured in an abandoned
prison yard. Several volunteer correctional ocers enact typical behaviors of
the inmates, including 6 group-level events of interest: group formation, group
dispersion, group following, group chasing, group anking, and group ghting.
The goal is to detect these events of interest from other random behaviors of the
group of people. The duration of each event ranges from 2 to 30 seconds. These
properties of the dataset require the system to use a long duration context when
detecting events at each frame, and thus the speed of the detection algorithm is
essential. Figure 7 gives sample snapshots of the dataset. We extract the same
features as [27], which use a BoW approach on these features.
Online Detection with Incremental Kernel Computing: We perform ex-
periment for event detection on continuous videos. For this task, we use the
incremental algorithm proposed in Section 2.4. We randomly select 60% out of
the 19 videos for training and the rest for testing. Prediction for every frame is
made using observations from a four-second temporal window ([t 4s, t]), i.e.,
the previous 4 seconds. Figure 8(a) shows the predicted probabilities for one of
fight BoP
BoW
Rule
follow
chasing
flank
disperse
forming
Ground truth BoW BoP 0.55 0.65 0.75 0.85 0.95

(a) (b)
Fig. 8. (a) The predicted probabilities (vertical) of various events at each frame index
(horizontal) for a test video. (b) Comparison of area under curve (AUC) scores for
each event category on the whole test dataset. We compare the rule based method [22],
BoW [27], and BoP. Note that the rules for ghting are not dened in [22].
the test videos. Although BoW can detect the events of interest well, it gener-
ates much more false positives than BoP because of the lack of discrimination
of the events of interest with normal behaviors. Figure 8(b) compares BoP with
previous works using the AUC (area under curve) scores of ROC curves for dier-
ent categories. BoP improves the previous works, especially for group forming,
group following, and group ghting events, where the local group changes are
quite confusing with those of random behaviors.
Running Time: We implement our approach with C++ and Python on an In-
tel 2.4G dual-core computer. Excluding person tracking and feature extraction
(around 0.02s per frame), classifying a 640 480 frame using a 4-second context
takes less than 5 millisecond with the incremental algorithm proposed in Sec-
tion 2.4. Therefore, we can perform event detection in real time for continuous
video streams. In comparison, directly classifying its 4-second context for each
frame takes more than 200 millisecond.
4 Conclusion
We proposed the spatio-temporal (ST) phrases to model rich spatial and tem-
poral relationships among the local features, and present an algorithm which
can identify all ST phrases in time linear to the number of local features in an
video. Experiment results demonstrated that our approach improves the state-of-
the-art approaches in activity recognition. Our approach is independent of local
feature representation and is widely applicable to a large variety of activities, as
illustrated by the diversity of experimental datasets.
720 Y. Zhang et al.
References
1. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.J.: Behavior recognition via sparse
spatio-temporal features. In: PETS Workshop (2005)
actions from movies. In: CVPR (2008)
3. Willems, G., Tuytelaars, T., Van Gool, L.: An Ecient Dense and Scale-Invariant
Spatio-Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A.
(eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650663. Springer, Heidelberg (2008)
4. Wang, X.G., Ma, X.X., Grimson, W.E.L.: Unsupervised activity perception in
crowded and complicated scenes using hierarchical Bayesian models. PAMI 31,
539555 (2009)
5. Liu, J.G., Luo, J.B., Shah, M.: Recognizing realistic actions from videos in the
wild. In: CVPR (2009)
6. Liu, J.G., Shah, M.: Learning human actions via information maximization. In:
CVPR (2008)
7. Wong, S.F., Kim, T.K., Cipolla, R.: Learning motion categories using both seman-
tic and structural information. In: CVPR (2007)
8. Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: A string of feature graphs
model for recognition of complex activities in natural videos. In: ICCV (2011)
9. Wang, P., Abowd, G.D., Rehg, J.M.: Quasi-periodic event analysis for social game
retrieval. In: ICCV (2009)
10. Duan, L., Xu, D., Tsang, I.W.H., Luo, J.: Visual event recognition in videos by
learning from web data. In: CVPR (2010)
11. Nowozin, S., Bakir, G., Tsuda, K.: Discriminative subsequence mining for action
classication. In: ICCV (2007)
12. Sun, J., Wu, X., Yan, S.C., Cheong, L.F., Chua, T.S., Li, J.T.: Hierarchical spatio-
temporal context modeling for action recognition. In: CVPR (2009)
13. Savarese, S., Pozo, A.D., Niebles, J.C., Li, F.F.: Spatial-temporal correlatons for
unsupervised action classication. In: WMVC (2008)
14. Gilbert, A., Illingworth, J., Bowden, R.: Scale Invariant Action Recognition Using
Compound Features Mined from Dense Spatio-temporal Corners. In: Forsyth, D.,
Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 222233.
15. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time
neighborhood features for human action recognition. In: CVPR (2010)
16. Zhang, Y., Jia, Z., Chen, T.: Image retrieval with geometry-preserving visual
phrases. In: CVPR (2011)
17. Wu, S., Moore, B.E., Shah, M.: Chaotic invariants of lagrangian particle trajectories
for anomaly detection in crowded scenes. In: CVPR (2010)
18. Messing, R., Pal, C., Kautz, H.A.: Activity recognition using the velocity histories
of tracked keypoints. In: ICCV (2009)
19. Yao, A., Gall, J., Van Gool, L.: A Hough transform-based voting framework for
action recognition. In: CVPR (2010)
20. Zhang, Y., Chen, T.: Ecient kernels for identifying unbounded-order spatial fea-
tures. In: CVPR (2009)
21. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM ap-
proach. In: ICPR (2004)
22. Chang, M.C., Krahnstoever, N., Ge, W.: Probabilistic group-level motion analysis
and scenario recognition. In: ICCV (2011)
23. Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities.
In: ICCV (2011)
24. Ryoo, M.S., Chen, C.-C., Aggarwal, J.K., Roy-Chowdhury, A.: An Overview of
Contest on Semantic Description of Human Activities (SDHA) 2010. In: Unay, D.,
Cataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 270285. Springer,
Heidelberg (2010)
25. Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-Voting Action
Recognition System. In: Unay, D., Cataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS,
26. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure
comparison for recognition of complex human activities. In: ICCV (2009)
27. Zhang, Y., Ge, W., Chang, M.C., Liu, X.: Group context learning for event recog-
nition. In: WACV (2012)
Complex Events Detection
Using Data-Driven Concepts
Yang Yang and Mubarak Shah
Computer Vision Lab, University of Central Florida

{yyang,shah}@eecs.ucf.edu
Abstract. Automatic event detection in a large collection of uncon-

strained videos is a challenging and important task. The key issue is to
describe long complex video with high level semantic descriptors, which
should nd the regularity of events in the same category while distinguish
those from dierent categories. This paper proposes a novel unsupervised
approach to discover data-driven concepts from multi-modality signals
(audio, scene and motion) to describe high level semantics of videos. Our
methods consists of three main components: we rst learn the low-level
features separately from three modalities. Secondly we discover the data-
driven concepts based on the statistics of learned features mapped to a
low dimensional space using deep belief nets (DBNs). Finally, a compact
and robust sparse representation is learned to jointly model the concepts
from all three modalities. Extensive experimental results on large in-the-
wild dataset show that our proposed method signicantly outperforms
state-of-the-art methods.
1 Introduction
User uploaded videos on the internet have been growing explosively in recent
years. Automatic event detection in videos is an interesting and important task
with great potential for many applications, such as on-line video search and
indexing, consumer content management, etc. However, it is a very challeng-
ing task to deal with large corpora of unconstrained videos with huge content
variations and uncontrolled capturing conditions.(as illustrated in Fig.1).
Common approaches in event recognition rely on hand-crafted low level fea-
tures such as SIFT [1], STIP [2], MFCC [3], and human-dened high level con-
cepts [4]. The use of high level semantic concepts have been proven eective in
representing complex events[5]. However, how to discover a powerful set of se-
mantic concepts is still unclear and has not been investigated in previous works.
The drawbacks of human dened concepts include: (1) its hard to extend these
concepts to a larger scale, (2) they can not handle multiple modalities, and (3)
the concepts dont generalize well to new datasets.
In this paper, we propose a novel unsupervised approach to discover event
concepts directly from training data in three modalities (audio, image frames
and video).

Complex Events Detection Using Data-Driven Concepts 723
Fig. 1. Randomly selected example videos from our dataset. Each row shows frames
of four videos from two categories.
We rst learn low level features for each modality using Topography Indepen-
dent Component Analysis (TICA), which has shown superior performance over
popular hand-designed features in [6]. Then we map our low level features to
more compact representations using deep belief networks (DBN)[7]. After that,
our data-driven concepts are learned by clustering the training data in a low-
dimensional space using vector quantization (VQ). This dimension reduction
step is crucial to produce reasonable clustering results. Finally, we merge the
concepts from three dierent modalities by learning compact sparse representa-
tions. The whole framework is shown in Fig.2.
We argue that unsupervised learning of concepts is appropriate due to two
reasons. First, the disconnection between limited linguistic words and complexity
of real world events makes human denition of visual concepts very hard if not
impossible. We will later show that large number of learned concepts help to im-
prove recognition accuracy signicantly. Second, most of the time, insuciency
of annotated data prevents us from learning concepts in supervised manner. We
present extensive evaluation of our method. The results show that our proposed
approach signicantly outperforms popular baselines.
The rest of the paper is organized as follows. We rst review the related
literature in section 2. The proposed method is presented in section 3-5 in the
following order: Low-level Feature Learning, Data-driven Concept Discovery, and
Event Representation Learning. Extensive experiment results, comparisons and
analysis are reported in section 6. Finally, we conclude in section 7.
2 Related Work
Most of the existing works[8,5,9] investigate dierent hand-crafted features such
as SIFT [1], STIP [2], Dollar [10] and MFCC [3]. Recently, there has been a
growing interest in learning visual features using biologically-inspired networks,
such as, Independent Component Analysis (ICA) [11] and Independent Subspace
Analysis [12]. In [13], Le shows that using their learned 3D (Spatiotemporal)
lters by ISA, the action recognition performance is comparable to other hand-
designed features. In [6], TICA, another extension of ICA, was proposed for
static images that achieves state-of-the-art performance on object recognition.
724 Y. Yang and M. Shah
Fig. 2. Framework of the proposed method. Each video is divided into short clips.
We rst learn low level features for each modality using Topography Independent
Component Analysis (TICA). Then we map our low level features to more compact
representations using deep belief networks (DBN). After that, our data-driven concepts
are learned by clustering the training data in a low-dimensional space using vector
quantization (VQ). Finally, we merge the concepts from three dierent modalities by
learning compact sparse representations.
Concept detectors provide high-level semantic representation for videos with

complicated content, which can be very useful for developing powerful retrieval
or ltering systems for consumer media. Lots of eort [4,14] have been devoted
to building huge datasets for training concept detectors. However most of them
are recorded in a well constrained conditions [15,16], which are not suitable for
detecting actions in complex events.[4] provides a benchmark dataset with 25
selected concepts over a set of 1,338 consumer videos. But its concept collections
are based on static images only. Audio or motion concepts are not used. Due to
the large diversity of the data and insucient training samples, concept detec-
tors perform far below expectation. In this paper, we propose an unsupervised
approach to discover concepts from three modalities using DBN, which has been
proposed to solve digit recognition and achieved promising results [7]. Besides, it
has been shown in [17] that DBN performs better than PCA and LLE (Locally
Linear Embedding) in terms of dimension reduction.
Multiple data sources can be combined using either early fusion or late fusion
strategies [18,4,5,8]. Traditional fusion methods treat each source independently[5].
We argue that it is desirable to exploit the relationships between multiple sources
to achieve robust classication. In this paper, we propose sparse coding [19] to
perform late fusion and empirically show the benets of such approach.
3 Learning Low-Level Features

We use the TICA feature learning networks [6] to learn the invariant audio,
scene and motion features from 1D audio signal, 2D image patches and 3D video
cuboids respectively. To make the paper self-contained we briey descbribe TICA

in the context of event recognition. For more details, please refer to [6] and [20]
We write x(p) Rn as the pth whitened local raw signal extracted from one
modality of the video clips. For 2D image patches and 3D video cuboids, we
atten them into 1D vectors. Learning features can be viewed as learning a
set of lters that map the raw signal into feature space by calculating the l-
ter responses. TICA is a two-layered network. The rst layer learns m lters
S Rmn from input x(p) by minimizing Eqn.3. The lter responses are the
activations of the rst hidden layer units H. The second layers lters V are
manually xed to pool over a small neighborhood of adjacent rst layer units
H, representing the subspace structure of the neurons in the rst layer. More
specically, in a 2D topography, the hk units lie on a 2D grid, with each acti-
vation of the second layer ri pooling over a connected 3 3 block of H units
through V .
In more detail, the activation of units k in the rst layer is:

hk x(p) ; S = Sk x(p) , (1)
where Sk is the k th row of S.
The activation of unit i in the second layer is:
>
?m
?
ri hk ; V = @ Vik h2k , (2)
k=1
where V Rmm is a xed matrix that encodes the topography of the hidden
units H. m is the number of hidden units in the rst layer.
In the lter learning process, the optimal S is learned by minimizing function:

T
m

S = arg min ri x(p) ; S; V . (3)
S
p=1 i=1
s.t. SS T = I
where T is the total number of training samples. The orthogonality constraint
SS T = I provides competitiveness and ensures that the learned features are
diverse. In the feature extraction process, given S and the new whitened local
raw signal x, the activation in the second layer R will be served as the feature
of x.
Considering the data we have are quite diverse and huge amount, we argue
that learning good features directly from the data is very ecient. We choose
TICA as our low-level building block because of its two advantages: feature
robustness and less computational complexity. The pooling architecture of TICA
ensures the learned features are invariant to slight location and orientation shifts,
and selective to frequency, rotation and motion velocity. The lter learning is
much faster than other methods such as GRBM [21] since the gradient of the
objective function Eqn.3 is tractable. The feature extraction is also fast compared
with sparse coding as the feature is simply computed through the matrix vector
products.
4 Data-Driven Concept Discovery

Previous works[15,16] use human dened concepts for action recognition. How-
ever, this is not suitable to event recognition due to two reasons: rst, dening
concepts that describe the huge diversity of human actions using limited lin-
guistic words is not practical. Second, current event datasets [22] do not have
detailed annotation of action concepts for each video clip, which make it hard to
train concept detectors. These two problems originally motivate us to propose
data-driven concept discovery.
We assume there is only one type concept from each of the three modalities
appears in a single shot clip. The idea is that we want to nd a representation
Y Rd , which can map each clip of dierent modalities from raw signal space
to a semantic space, where clips with similar concepts are near with each other.
Considering the high diversity of our data, instead of pooling the low-level TICA
features spatial-temporally, we use bag of word (BoW) histogram Q RD by
adopting vector quantization (VQ) technique using K-means soft assignment [23].
One problem is, the BoW histogram is usually long (corresponding to large
cookbook) in order to capture variations of data. And k-means is well-known
to be sensitive to noise in a high dimensional space especially when we apply
Euclidean distance as similarity measurement. To address this, we propose to
use deep belief nets (DBN) [17] to learn a lower dimensional representation for
the clips from each modality. A DBN is a two-layered network, which is a stack
of restricted Boltzmann machines (RBMs). The activations of the lower RBM
serve as the input of the upper RBM. In each RBM, the hidden layer captures
strong correlations of the units activities in the layer below. For our highly
complex event data, stacking several RBMs is an ecient way to progressively
expose low-dimensional, non-linear structure. We begin by describing RBM in
the case of real value input following the description in paper [17]and [24], and
then we show how we use the learned clip representation to discover data-driven
concepts from each modality.
DBN Learning: We start with the visible units Q in the bottom layer, which
are essentially the BoW representation of each clip. A set of hidden units l are
built through symmetric connection weights represented by weight matrix W .
We can view the RBM as an undirected graphical model and the energy of any
state in it is given by the following function:
E(q, l) = log P (q, l) (4)

1 2 1
= 2
qi 2 ( ci qi + b j lj + wij qi lj ).
w i i j i,j
Here, is the standard deviation of the Gaussian density, lj are hidden unit
variables, qi are visible unit variables, wij is the weight connected with qi and
lj , ci and bj are the bias term of visible and hidden units respectively. The
learning process is to estimate wij , ci and bj through minimizing the energy of
states drawn from the data q distribution, and raise the energy of states that are
improbable given the data. We follow [24] to use contrastive divergence learning
which gives an ecient approximation to the gradient of the energy function.
Further, in each iteration, we apply contrastive divergence update rule, followed
by one step of gradient descent using the gradient of the regularization term.
Once nish training a layer of the network, we feed the output values of this
layer as inputs of the next higher layer. Finally, after nishing training all the
layers, we obtain the clip representation as the outputs of the last layer, denoted
as Y Rd . By doing so, We map the original features to much lower dimensional
space since D $ d.
Building Concepts: The low dimensional representations Y from similar
clips are then grouped into concepts with a semantic meaning. In our framework,
concepts are obtained from three modalities separately and each event video is
represented as the occurrence frequency of each concepts from three modalities,
denoted as Z.
5 Event Representation Learning

It is common that concepts of dierent modalities are highly correlated with each
other. For example, in a birthday party event, action concept people dancing
is almost always co-occur with concept happy music or scene concept crowd
people, instead of horrible music nor trac scene. By modeling the interac-
tion context and inter-modality occurrence of concepts, we can removing noisy
concepts and further improve the event representation. The idea is that we want
to learn a set of bases which capture the co-occurrence information of concepts
and the event can be represented as a linear combination of the bases. Further
by imposing the sparsity on the coecients, the noisy occurrence of irrelevant
concepts will be removed.
More precisely, given N events represented in terms of concatenated concepts
from three modalities, {Z (1) , , Z (i) , , Z (N ) }. We learn the basis by model-
ing it as a sparse coding problem [19]:
(i) 2
= arg min Z (i) aj j 2 + a(i) 1
a,
i j
s.t. j 2 1, j {1, 2, , s}. (5)

(i)
where j is the basis vector, aj is the coecient of ith event associated with
j th basis. The rst term in Eqn.5 is reconstruction error, while the second term
enforces sparsity of coecients. is the relative weight to balance the two terms.
We use sparse coding algorithm in [19] to solve this minimization problem.
After learning a set of bases , we can encode an input event Z (t) as sparse
linear combination of the bases. The combination coecients a(t) will serve as
the nal representation of this event, and can be obtain by solving Eqn.6.
(t) 2
arg minZ (t) aj j 2 + a(t) 1 . (6)
a(t)
j
2
SVM with kernel is used for classication [25].
6 Experiments
In this section, we will describe the dataset sand discuss several interesting ob-
servations that we had.
6.1 Datasets and Experimental Settings

We tested our approach on TRECVID 2011 event collection [22], which has 15
categories: Boarding trick, Flash mob, Feeding animal, Landing sh,
Wedding, Woodworking project, Birthday party, Changing tire, Ve-
hicle unstuck, Grooming animal, Making sandwich, Parade, Parkour,
Repairing appliance, Sewing project. As shown in Fig.1, it is a new set of
videos characterized by a high degree of diversity in content, style, production
qualities, collection devices, language, etc. The frame rate ranges from 12 to 30
fps, resolution ranges from 320 640 to 1280 2000, the time duration ranges
from 30 seconds to 5 minutes.
We manually dened and annotated an action concept data set based on
TRECVID event collection, which has 62 action concepts(e.g. open box, per-
son cheering, animal approaching, wheel rotating, etc.) for approximately 9,000
videos. To the best of our knowledge, this is the largest action concepts dataset
in related literatures.
In the experiment, we rst resize all the videos to 480 640, and then divide
each video into 4 and 10 seconds clips with 2 seconds overlap, based on our
observation that the motion concepts duration varies from 2 to 10 seconds. There
are approximately 300,000 clips in total. Performance was evaluated in terms of
Mean Average Precision (MAP) on 15 events.
Also, we compare our low-level features with other hand-designed features
on UCF YouTube action dataset, which has 11 action categories. 25-fold cross-
validation is used. It is important to note that although the YouTube dataset is
one of the most extensive realistic action datasets in the vision community, it is
still less noisy and much simpler then the TRECVID data in terms of inner-class
diversity.
6.2 Low-Level Feature Extraction

We use TICA to learn three modalities of feature representation: audio, image
and video. For each modality, approximately 200,000 sampled signal/patches/
video-blocks are used to train the lters and 600 lters from each modality are
nally chosen as the bases for feature construction. The audio signal is extracted
with sampling rate of 16 KHz. The inputs of the visual layer are 800, 20 20,
20 20 10, respectively, in the three modalities. Fig. 3,4,5 shows randomly
selected learned lters on 1D, 2D and 3D training examples respectively.
In order to demonstrate that our learned features outperform other classical
hand designed features. We use the same bag of word framework as [26] where the
code book is generated using K-means and the histogram is classied using SVM
with 2 kernel. The code book size is set to 4,000. We compare our results on
Fig. 3. 24 out of 600 audio lters learned from TRECVID event collection
Fig. 4. 48 out of 600 image (2D) lters learned from TRECVID event collection. Since
the training patches have 3 channels (RGB), the learned lters are also with 3 channels.
The color information of the lters mainly captures the scene concepts, such as indoor,
outdoor.
manually annotated 62 action concepts, EC and UCF 11 dataset, using MFCC

[3], MBH [27], SIFT [1] and STIP [2].
Table 1 summarizes the results. Our learned features work 10% better on av-
erage in terms of recognition accuracy than all the other hand designed features,
on EC and 62 concepts dataset. The performance of our 3D TICA feature is 20%
higher than STIP (30.9%) motion feature on 62 action concepts dataset. The
results also show that combining features from dierent modalities improves the
overall accuracy. This suggests that features from dierent modalities capture
complementary information and using features from dierent sources is neces-
sary.
Interestingly, we notice that the learned features are not better than Motion
Boundary Histogram (MBH) on UCF-11 dataset. The reason is presumably that
MBH tends to overt itself to relatively easy dataset, such as UCF-11, which con-
tains only well-dened action with relatively simple and clean background. How-
ever, its performance plunges by a half to 44% on dicult dataset TRECVID,
where our method yields the highest accuracy of 63.2%. This demonstrates the
robustness of learned local features and suggests that feature discovery is im-
portant and necessary especially under uncontrolled in-the-wild condition.
Fig. 5. 12 out of 600 spatiotemporal (3D) lters learned from TRECVID event collec-
tion. Filter size is 20 20 10. Each row shows two 3D lters.
Table 1. A comparison of performance using dierent features and modality combi-

nations. Our learned features outperform all the other hand designed features on the
dicult TRECVID dataset; Combining features from dierent modalities improves the
overall accuracy.
UCF TRECVID TRECVID

11 62 concepts 15 Events
MFCC [3] 31.1 34.8
SIFT [1] 58.1 40.3 30.1
STIP [2] 57.5 30.9 41.0
MFCC+SIFT+Dollar 51.1
MBH [27] 83.9 36.0 44.0
ISA[13] 75.8 51.3 53.5
TICA 1D 31.7 39.7
TICA 2D 56.4 43.3 45.2
TICA 3D 74.3 53.5 55.2
TICA 2+3D 79.1 57.9 59.5
TICA 1+2+3D 58.1 63.2
6.3 Data-Driven Concept Discovery

We trained a ve-layer deep belief net using RBM at each layer. The RBMs were
initialized with small random weights and zero biases, and trained for 60 epochs
using mini-batches size of 100. For the linear-binary RBM we used a learning
rate of 0.001. We reduced the learning rate at the beginning of learning when
the gradient can be large, and also at the end of learning in order to minimize
uctuations in the nal weights. We also used a momentum of 0.8 to speed up
the learning.
Fig.6 shows the detection results, which evaluates the performance of the clip
representation of each layer: after being trained in each layer, clips are grouped
into concepts based on the new representation. Then, SVM is used for classica-
tion. We rst attempt to use K-means directly on the initial clip representation
without any dimension reduction on the data. The event detection MAP is 39%
based on concept representation. Then we adopt RBM recursively to reduce the
Fig. 6. A comparison event detection performance as a function of number of hidden

units (dimensionality) of RBM, PCA, LLE and EigenMaps. The accuracy of directly
using K-means clustering on 4000 dimension without any dimension reduction is 39%.
We compare dierent dimension reduction techniques. RBMs achieve the best per-
formance of 66% when the number of hidden units is 1,000. Note that the accuracy
generally increases, for all techniques, while we reduce the dimension of clip represen-
tation, which suggests the necessity of dimension reduction in concept learning.
data dimension from 4,000 to 100. Fig.6 shows that when the dimension of the
clip representation reduces from 4,000 to 1000, the event detection MAP in-
creases from 39% to 66% and it reaches the highest point at 1,000 dimension.
This supports our assumption that the initial representation of the clip lies in
a high dimensional space where Euclidean distance can not measure the true
similarity and DBN learns the regularity between the clips correctly. If we keep
decreasing the dimension of the clip representation, the accuracy goes down. It
means that the high dimensional data is compressed into a too concise space,
some useful information maybe lost there. Further, we repeat the same exper-
iments using other manifold learning methods such as PCA, EighenMap and
LLE. Figure 5 shows the detection results. It is clear that DBN performs signif-
icantly better.
Fig.7 shows the classication results based on a dierent number of concepts
using three modalities and their combination. The results show that motion
concepts play an important role in the event detection problem. And a large
number of concepts helps the recognition mainly because that larger pool of
concepts captures ner level variations of actions, e.g., running action from dif-
ferent viewpoints. However, when the number of concepts increases further, the
accuracy drops presumably due to the insuciency of training video samples,
and SVM runs into overtting.
We observe that, the curve of audio signal (TICA 1D) peaks at 500 con-
cepts, while that of scene (TICA 2D) and motion (TICA 3D) reach their highest
performance much later. This implies that the underlying variation of audio
signal is less than that of motion and 2D scene signals, which is consistent
with common sense. Combined concepts (TICA 1D+2D+3D) achieve the best
results since audio, scene and motion concepts capture complementary infor-
mation in the videos. We also use STIP feature to run the same experiments.
Fig. 7. A comparison of event detection performance using proposed approach employ-

ing audio, image and video features individually and jointly as a function of number of
discovered concepts. We also show performance using standard STIP features. Finally,
we show the importance of discovered concepts compared to manually annotated 62 ac-
tion concepts. The results show that a larger number of data-driven concepts improves
the detection rate and is better than human annotated concepts.
(a) (b)
Fig. 8. (a)A comparison of event detection using dierent number of concepts dis-
covered using three modalities jointly and separately. The former fuse the low level
features from three modalities rst as the initial clip representation, then learns con-
cepts jointly. The later discovers the audio, scene and motion concepts separately. The
results show that using concepts based on the modalities separately outperform the
early fusion one. (b)A comparison of event detection using sparse representation and
bag of concept representation. Sparse representation works consistently better than
bag of concept representation.
The performance is signicantly worse than using TICA feature. This shows that
low-level features are important for discovering meaningful data-driven concepts.
Interestingly, the model using 62 human-dened concepts trained in super-
vised manner, outperforms its counterpart using the same number of concepts
Fig. 9. Comparison of our proposed method with the combination of MFCC, SIFT
and STIP features in terms of detection accuracy on each event category. Our mean
average precision is 68.2%. The MAP of combination method is 51.1%.
but discovered in unsupervised way. This suggests more supervision helps when
only a small number of concepts are used. However, when the number of concepts
increase, data-driven concepts perform better. This shows that a large number
of concepts improves the event detection.
We also compare the performance of discovered concepts using early feature
fusion of three modalities to the concepts learned from three modalities sepa-
rately. Fig.8a shows that discovering concepts separately is always better.
6.4 Sparse Video Representation

After concepts are discovered, each long event video can be represented in terms
of concepts. Fig.8b shows the comparison of the sparse representation and the
bag of concept representation, in terms of detection rate. In addition, based
on the best results, the detection rate of each category compared with baseline
(SIFT + MFCC + STIP) is shown in Fig.9. The MAP of our method over 15
events is 68.2%. In comparison, the MAP of SIFT+STIP+MFCC is 51.1%.
7 Conclusion and Future Work

In this paper, we present a three step approach which learns the sparse video rep-
resentation based on data-driven concepts from three modalities (audio, image
and video) in an unsupervised manner. Through learning the low-level features
and clip representation, high-level semantic concepts are discovered. Extensive
experiments show that our method signicantly outperforms the baselines using
human designed features on complex in-the-wild event recognition dataset.
Acknowledgements. The research presented in this paper is supported by

the Intelligence Advanced Research Projects Activity (IARPA) via Department
of Interior National Business Center contract number D11PC20071. The U.S.

Government is authorized to reproduce and distribute reprints for Governmen-
tal purposes notwithstanding any copyright annotation thereon. The views and
conclusions contained herein are those of the authors and should not be inter-
preted as necessarily representing the ocial policies or endorsements, either
expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
References
1. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com-
put. Vision 60, 91110 (2004)
2. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, pp. 432439 (2003)
3. Rabiner, L., Juang, B.: Fundamentals of Speech Recognition, Englewood Clis,
New Jersey. Prentice-Hall Signal Processing Series (1993)
4. Loui, A.C., Luo, J., Chang, S.F., Ellis, D., Jiang, W., Kennedy, L.S., Lee, K.,
Yanagawa, A.: Kodaks consumer video benchmark data set: concept denition
and annotation. In: Multimedia Information Retrieval, pp. 245254 (2007)
5. Wei, X.Y., Jiang, Y.G., Ngo, C.W.: Concept-driven multi-modality fusion for video
search. IEEE Trans. Circuits Syst. Video Techn. 21, 6273 (2011)
6. Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: Tiled convolutional
neural networks. In: NIPS 2010 (2010)
7. Hinton, G.E., Osindero, S., Whye Teh, Y.: A fast learning algorithm for deep belief
nets. Neural Computation (2006)
8. Schindler, G., Zitnick, L., Brown, M.: Internet video category recognition. In:
CVPRW 2008, pp. 17 (2008)
9. Wang, Z., Zhao, M., Song, Y., Kumar, S., Li, B.: Youtubecat: Learning to categorize
wild web videos. In: CVPR 2010, pp. 879886 (2010)
10. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse
spatio-temporal features. In: VS-PETS (2005)
11. van Hateren, J.H., Ruderman, D.L.: Independent component analysis of natural
image sequences yields spatiotemporal lters similar to simple cells in primary
visual cortex (1998)
12. Hyvarinen, A., Hoyer, P.: Emergence of phase- and shift-invariant features by de-
composition of natural images into independent feature subspaces. Neural Compu-
tation (2000)
13. Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical invariant spatio-temporal
features for action recognition with independent subspace analysis. In: CVPR 2011,
pp. 33613368 (2011)
14. Liu, X., Huet, B.: Automatic concept detector renement for large-scale video
semantic annotation. In: 2010 IEEE Fourth International Conference on Semantic
Computing (ICSC), pp. 97100 (2010)
15. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild.
In: CVPR (2009)
16. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum
average correlation height lter for action recognition. In: CVPR. IEEE Computer
Society (2008)
17. Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural
networks. Science 313, 504507 (2006)
18. Snoek, C.G.M.: Early versus late fusion in semantic video analysis. ACM Multi-
media, 399402 (2005)
19. Olshausen, B.A.: Sparse coding of time-varying natural images. In: Proc. of the
Int. Conf. on Independent Component Analysis and Blind Source Separation,
pp. 603608 (2000)
20. Hyvarinen, A., Hoyer, P., Inki, M.: Topographic ica as a model of v1 receptive
elds. In: IJCNN 2000, vol. 4, pp. 8388 (2000)
21. Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional Learning of Spatio-
temporal Features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part VI. LNCS, vol. 6316, pp. 140153. Springer, Heidelberg (2010)
22. Trecvid (2011), http://www-nlpir.nist.gov/projects/tv2011/tv2011.html
23. van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Vi-
sual word ambiguity. IEEE Transactions on Pattern Analysis and Machine In-
telligence 32, 12711283 (2010)
24. Lee, H., Ekanadham, C., Ng, A.Y.: Sparse deep belief net model for visual area V2.
In: Advances in Neural Information Processing Systems 20, pp. 873880 (2008)
Transactions on Intelligent Systems and Technology 2, 27:127:27 (2011), Software
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
26. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local
spatio-temporal features for action recognition. In: British Machine Vision Confer-
ence, p. 127 (2009)
27. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajecto-
ries. In: CVPR 2011, pp. 31693176 (2011)
Directional Space-Time Oriented Gradients
for 3D Visual Pattern Analysis
Ehsan Norouznezhad1,2, Mehrtash T. Harandi1,2, Abbas Bigdeli1,2 , Mahsa Baktash1,2 ,

Adam Postula1,2 , and Brian C. Lovell1,2
1
NICTA, P.O. Box 6020, St. Lucia, QLD 4067, Australia
2
The University of Queensland, School of ITEE, QLD 4072, Australia
Abstract. Various visual tasks such as the recognition of human actions, ges-
tures, facial expressions, and classification of dynamic textures require modeling
and the representation of spatio-temporal information. In this paper, we propose
representing space-time patterns using directional spatio-temporal oriented gra-
dients. In the proposed approach, a 3D video patch is represented by a histogram
of oriented gradients over nine symmetric spatio-temporal planes. Video com-
parison is achieved through a positive definite similarity kernel that is learnt by
multiple kernel learning. A rich spatio-temporal descriptor with a simple trade-off
between discriminatory power and invariance properties is thereby obtained. To
evaluate the proposed approach, we consider three challenging visual recognition
tasks, namely the classification of dynamic textures, human gestures and human
actions. Our evaluations indicate that the proposed approach attains significant
classification improvements in recognition accuracy in comparison to state-of-
the-art methods such as LBP-TOP, 3D-SIFT, HOG3D, tensor canonical correla-
tion analysis, and dynamical fractal analysis.
1 Introduction
The goal of visual pattern recognition is to detect the presence of a particular object or
pattern in a given image or video. This usually involves representing patterns in a suit-
able feature space to achieve robustness against a broad range of environmental changes
like photometric variations, occlusions, background clutter, geometric transformations,
and variations in view angle.
The study of space-time patterns such as human actions, gestures, facial expressions,
and dynamic textures has attracted growing attention, mainly due to the wide range of
applications in real world [1]. One of the major theme of research in space-time pattern
classification is shaped around devising robust spatio-temporal local descriptors [1,2].
Broadly speaking, spatio-temporal local descriptors can be categorized into three main
classes.
The first and the largest category includes the direct extension of 2D local descriptors
to 3D. The underlying idea is to replace the rectangular regions in 2D descriptors with
3D patches and recast the functions/procedures from the spatial domain (ie. (x, y))
into the spatio-temporal space (ie. (x, y, t)). Examples include 3D-SIFT by Scovanner
et al. [3], HOG3D by Klaser et al. [4], Volume Local Binary Patterns (VLBP) by Zhao
et al. [5] and Extended-SURF (ESURF) by Willems et al. [6].

Directional Space-Time Oriented Gradients for 3D Visual Pattern Analysis 737
In the second category, the static (ie. spatial) and dynamic (ie. temporal) properties
of a given 3D patch are independently captured and then merged to form the descriptor.
The HOG/HOF descriptor proposed by Laptev et al. [7] is an example of this school
of thought. Other examples of this category include combining appearance and motion
descriptors for recognizing human actions as proposed by Schindler et al. [8] and Huang
et al. [9].
In the third category, no direct extension from 2D to 3D is considered. Instead, a
set of 2D local descriptors are extracted from three orthogonal spatio-temporal planes
within a 3D patch. Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) pro-
posed by Zhao et al. [5] was one of the very first attempts in this direction and has been
successfully applied in classification of various space-time patterns such as dynamic
textures, facial expressions [5] and human actions [10]. In LBP-TOP, a video patch
is encoded by computing LBP features over the three orthogonal XY , XT , and Y T
planes. Other examples of encoding video patches by three orthogonal planes include
extensions of the Webers local descriptor (WLD) [11] and Local Phase Quantization
(LQP) [12] to WLD-TOP [10] and LQP-TOP [13] respectively.
Despite considerable success, two major shortcomings are associated with the no-
tion of three orthogonal planes (TOP). First, TOP is not able to optimally capture the
dynamical properties of a given 3D patch. This is mainly due to the fact that dynamical
information is only encoded by two planes (XT and YT). Second, to create the video
descriptor, the descriptors associated with the XY, XT and YT planes are simply con-
catenated together. This is restrictive and may lead to inappropriate modeling in more
complex recognition scenarios.
With the aim of addressing the aforementioned shortcomings, in this paper we extend
the TOP approach to Nine Symmetry Planes (NSP). Unlike the TOP approach, NSP
benefits from six more diagonal planes to model the dynamical information of a video
patch. In our proposal, a video patch is described by a Histogram of Oriented Gradients
(HOG) using gradient information in all nine symmetric planes. Moreover, to overcome
the second shortcoming, we propose to learn the optimal fusion of sub-descriptors by
a positive definite similarity kernel. This can be done through the notion of Multiple
Kernel Learning (MKL) and is subtly different from previous studies on feature fusion
1
. As such, our proposed method, the HOG-NSP, is free from concatenation at any level.
1.1 Contributions
The three main novelties in this work are (i) we propose to encode a video patch through
nine spatio-temporal symmetry planes, and (ii) we apply a positive definite similarity
kernel learnt by multiple kernel learning to compare two videos, and (iii) the proposed
descriptor has been analyzed and contrasted against state-of-the-art methods in three
visual recognition tasks, namely the classification of dynamic textures, hand gestures
and human actions.
The remainder of the paper is organized as follows: We elaborate on the NSP ap-
proach in Section 2. In Section 3 we compare the performance of the proposed method
1
While in previous studies, MKL is employed to fuse features of different kind such as color,
texture, and motion, to authors knowledge, this is the first paper that adapts MKL to fuse
sub-descriptors of the same kind.
738 E. Norouznezhad et al.
Fig. 1. The HOG-NSP feature extraction procedure: (a) input space-time pattern as a 3D volume;
(b) extract cuboid patches on dense spatio-temporal grid; (c) nine symmetry planes within the
patch are considered for feature extraction with each plane corresponding to a specific space-
time direction; (d) extract oriented gradients on each plane and represent as locally normalized
histogram
with previous approaches on the aforementioned visual classification tasks. Finally, the
main findings and possible future directions are summarised in Section 4.
2 Proposed Approach
The proposed methods consists of three main components:
1. Space-Time Plane Descriptor. A video is decomposed into a set of 3D cuboid
patches. Each cuboid patch is then represented by histogram of oriented gradients
over nine spatio-temporal symmetry planes. The final descriptor of a 3D cuboid
patch, is obtained by the Bag of features (Bof) framework. As such, a dictionary is
trained for each spatio-temporal plane and used to encode 3D cuboid patches.
2. Video sub-descriptors. A video is described by a set of sub-descriptor using a
spatio-temporal pyramid. More specifically, individual representations of a given
video will be created in each channel of spatio-temporal pyramid. The total number
of sub-descriptors is equal to the number of plane-descriptors multiplied by the
number of spatio-temporal grids.
3. Similarity-Based Classification. Video comparison is achieved through a positive
definite similarity kernel that is learnt by multiple kernel learning. The similarity
kernel takes into account all video sub-descriptors and can be seen as a feature
fusion.
Each of the components is explained in detail in the following sections.
2.1 Space-Time Plane Descriptor

To describe a video sequence, HOG-NSP decomposes the video into a set of cuboid
patches of size ws hs ls with possible overlapping using a fixed and dense grid
structure (see Fig. 1). We note that HOG-NSP can be used in conjunction with interest
Fig. 2. The nine symmetry planes: (a) the first row illustrates the XY and its rotations around
the X axis; (b) the middle row demonstrates the Y T plane and its rotations around the Y axis;
(c) the bottom row shows the XT plane and its rotations around the T axis. It should be noted
that we are left with nine planes since some planes are repeated twice (P(X,0) = P(Y,0) , P(T,90)
= P(Y,90) and P(T,0) = P(X,90) ).
point detectors [14] to represent a video sparsely. However, in this study we confine
ourselves to dense grid-based representation. This is mainly driven by the recent success
of dense representation in various recognition scenarios [15,1].
Within each cuboid patch, nine space-time directions are considered for feature ex-
traction. Each space-time direction corresponds to one symmetry planes inside the
cuboid patch as shown in Fig. 1. The planes can be obtained systematically by ro-
tating XY, XT and YT planes. For example, by rotating the XY plane around the X axis
by 45 , 90 and 135 , three orthogonal planes to the YT plane can be attained. The
rotated planes are shown at the top row of Fig. 2 and labeled P(X,45) , P(X,90) , P(X,135)
respectively. Similarly, three orthogonal planes to XT plane and three orthogonal planes
to XY plane are constructed by rotating XY and XT planes as shown in the middle and
bottom rows of Fig. 2. Since some planes are repeated twice (P(X,0) = P(Y,0) , P(T,90)
= P(Y,90) and P(T,0) = P(X,90) ), we are left with nine distinct planes.
Having nine symmetry planes at our disposal, a cuboid patch is then represented by
normalized gradients histograms over the spatio-temporal planes. This is accomplished
by dividing each plane into smaller blocks of size M N and computing histogram
of oriented gradients within each block. For normalization, we pursue a similar nor-
malization strategy as Lowe [16]. The descriptor in each plane is normalized using
the l2 -norm with predefined cut values of c. In our experiments, the cut value is set to
c = 0.25. Furthermore sub-block and histogram parameters are set to M = 4, N = 4
and B = 8 which results in a 128 dimensional plane descriptor.
HOG-NSP should not be confused with other 3D video descriptors such as HOG3D
or 3D-SIFT. HOG3D or 3D-SIFT represent 3D video patches by measuring and quan-
tizing the 3D gradient orientations using a regular n-sided polyhedron. In contrast,
HOG-NSP uses directional spatio-temporal oriented gradients to model static and dy-
namic properties of 3D patches. In other words, local oriented gradients in nine discrete
space-time directions (planes) within a 3D patch are used for representation.
Fig. 3. The classification procedure: (a) input space-time pattern; (b) separate visual dictionar-
ies (codebook) are learned by quantizing the feature space for each plane-descriptor using the
training dataset; each codebook has a number of visual words; only three visual words are shown
in this figure for illustration purposes (c) HOG-NSP features are extracted for a given video; (d)
Spatio-temporal pyramids are created for representing videos; for illustrative purposes, only three
levels (L = 0, 1, 2) are shown; (e) individual representations of a given video will be created in
each channel; the number of channels is equal to the number of plane-descriptors (i.e. 9) multi-
plied by the number of spatio-temporal grids (i.e. 1, 8 and 64 for L = 0, 1, 2 respectively); (f) the
representation for each channel is assigned a kernel and multiple kernel learning (MKL) is used
to find the optimal weights for the kernels; SVM is used for classification.
2.2 Bag-of-Features
Given a set of spatio-temporal local descriptors, the next step is to represent the video
based on the extracted descriptors. We use bag-of-features (BoF) framework for this
purpose. In BoF framework, the first step is to construct a visual dictionary called code-
book. This is achieved by clustering the descriptors that are extracted from the videos
in the training set using k-means. This results in nine separate visual vocabularies for
the nine plane-descriptors. In our experiments, the number of visual words in the each
dictionary is set to 800 by default which have shown empirically to give good results
for the datasets under test. Given a video, the extracted plane-descriptors are assigned
to their closest visual word, using the Euclidean distance. The occurrences of each of
the visual words of each plane-descriptor can be used to represent a given video.
2.3 Encoding Global Structural Information
The bag-of-features (BoF) approach represents a video as an orderless collection of lo-

cal patches. It does not preserve any information regarding the global spatio-temporal
distributions of local patches. To overcome this problem, Lazebink et al. [17] pro-
posed Spatial Pyramid Matching (SPM) method. SPM encodes the coarse layout of
local patches which has proven to be very effective for object and scene classification
[17]. In their proposed approach, the given image is repeatedly subdivided and distri-
butions of local features is computed and compared at increasingly finer resolutions.
The SPM approach was later extended to spatio-temporal domain and was utilized for
representing space-time patterns in videos [7,18]. In our proposed approach, Spatio-
Temporal Pyramid Matching (STPM) is employed to encode global structural infor-
mation into the representation. A sequence of increasingly finer binary partitions are
constructed at resolution 0, , L, such that the partition at each level l has 2l cells along
each dimension. Fig. 3 illustrates the STPM approach for L = 2. Combining the
spatio-temporal grids at L = 0, 1, 2 will result in in 73 spatio-temporal grids.
2.4 Similarity-Based Classification

Up to this point, we have elaborated on how a video sequence vi can be represented by
a set of local descriptors {xi1 , xi2 , , xin }, xij Rndic . One can create a unified video
descriptor by concatenating the local descriptors into a large feature vector and use it
for classification. Nevertheless simple concatenation is not suitable for HOG-NSP due
to the curse of dimensionality. More specifically, in HOG-NSP the size of each local
descriptor is equal to the length of the associated dictionary. Each video descriptor con-
stitutes of 9 nST P M local descriptors. As a result by concatenating local descriptors,
the size of video descriptor would become 9 nST P M ndic which is restrictive in
many practical applications. To remedy this problem, we propose to compare two video
sequences based on the notion of similarity classification and Multiple Kernel Learning
(MKL) [19,20]. More specifically, the similarity of two video vs and vt , each repre-
sented by a set of local descriptors {xs1 , xs2 , , xsn } and {xt1 , xt2 , , xtn } is defined
as:

n

K (vs , vt ) = di exp 2 (xsi , xti ) (1)
i=1
n
where di 0, i=1 di = 1 are kernel combination weights and k(, ) is the chi-squared
distance defined as:

ndic
(x(j) y(j))
2
(x, y) = (2)
j=1
x(j) + y(j)
The kernel combination weights, (di ), can be learnt through a convex optimization prob-
lem as discussed for example in [20]. Having a positive definite similarity kernel at our
disposal, Support Vector Machines can be utilized to classify video sequences.
2.5 Properties of HOG-NSP Descriptor

The HOG-NSP enjoys a number of properties that are worth emphasizing. Firstly, while
local orientated gradients in space characterizes the appearance, it represent motion and
apearance in space-time domain. Therefore HOG-NSP encodes both static and dynamic
properties of video pattern by spatial and spatio-temporal planes respectively. Secondly,
thanks to the inherent properties of the oriented gradient descriptor and the normaliza-
tion scheme, the proposed descriptor is robust against photometric variations. Lastly,
the approach is robust against geometric variations (eg. rotation, shift or scale). This
Fig. 4. Representative examples from the UCLA dynamic texture database [24] used for evalu-
ation
is achieved through locally measuring of the static and dynamic properties of a given
pattern. Furthermore our experiments demonstrate that utilizing a multi-plane approach
for feature extraction and reducing the angle between spatio-temporal planes (the angle
between spatio-temporal planes is reduced to 45 ) has further increased the robustness
of the proposed descriptor.
3 Experiments
To evaluate and contrast the performance of the proposed HOG-NSP approach, three
sets of experiments are considered in this section. First, we appraise HOG-NSP against
state-of-the-art methods such as tensor canonical correlation analysis [21], dynamical
fractal analysis [22] and Kinematic features with multiple instance learning [23] in
classifying dynamic textures, hand gestures and human actions. Second, we compare
the performance of HOG-NSP descriptor against spatio-temporal local descriptors such
as 3D-SIFT and LBP-TOP. For the latter case we only consider the recognition of dy-
namic texture. Finally, the computational complexity of HOG-NSP descriptor is com-
pared against state-of-the-art spatio-temporal local descriptors.
3.1 Dynamic Texture Recognition

Dynamic textures (DTs) are video sequences that exhibit some form of stationary pat-
terns in time [24]. Examples of dynamic textures such as fire, smoke, river, clouds and
windblown vegetation can be found commonly in real world. To assess HOG-NSP, we
have selected two publicly available datasets, namely UCLA [24] and DynTex++ [25].
The UCLA dynamic texture dataset has been widely used for benchmarking DT classi-
fication methods and it contains 50 classes of different dynamic textures. Each texture
has four video sequences (ie. 200 DT sequences in total) recorded from various view-
points. The dataset is originally recorded in color with a resolution of 220 320 pixels.
In our tests, the videos were cropped to 48 48 pixels and converted to grayscale.
For a thorough evaluation, we have considered four different test protocols [22].
The first test protocol consists of 50 classes of dynamic textures, each having 4 se-
quences. In the second protocol, the video sequences are grouped into 9 classes by
combining classes which refer to similar dynamic texture recorded from different view-
points. The DT classes in the second protocol are boiling water (8), fire (8), flowers
(12), sea (12), smoke (4), fountains (20), water (12), waterfall (16) and plants (108),
where the numbers denotes the number of sequences in each class. In the third test
Fig. 5. Classification rate (%) on each class of the DynTex++ dataset
protocol [26], the number of classes is reduced to 8 by removing the class of plants. Fi-
nally the fourth test protocol, shift-invariant-recognition (SIR) was proposed by [27] to
evaluate shift-invariance properties of descriptors. Representative examples of UCLA
dataset are shown in Fig. 4
The second dataset is the Dyntex++ dataset. This dataset is compiled from a Dyntex
dataset [28] and is proposed in [25]. It contains 3600 dynamic textures, grouped into 36
classes. Each sequence has a size of 50 50 50 pixels. We follow the test protocol
proposed in [25]. In all our experiments the data was randomly split into two equal
size training and test sets. The random split was repeated 20 times and the average
classification accuracy is reported here.
For representing dynamic texture, spatio-temporal pyramid with l = 0 was used
which is the standard bag-of-features (BoF) representation. This is due to the fact that
in our experiments we did not observe significant improvement by using higher levels
of pyramid. Furthermore the size of codebook and cuboid patch was empirically set to
800 and 16 16 16 pixels with 50% overlap respectively.
The proposed approach is compared against the following methods: Bag-of-dynamical
systems (BDS) [26], Distance Learning using DL-PEGASOS [25], Space-time oriented
structures [27], and dynamic fractal spectrum (DFS) descriptors [22]. BDS models
dynamic textures using linear dynamical systems (LDS) in the framework of Bag-of-
features. Space-time oriented structures and DFS utilize 3D oriented filters for repre-
senting dynamic textures. DFS exploit dynamic fractal analysis for modelling dynamic
textures. DL-PEGASOS uses maximum margin distance learning (MMDL) approach
based on Pegasos algorithm for dynamic texture recognition.
The classification accuracies are presented in Table 1 and 2 for UCLA and Dyn-
tex++ dataset respectively. In UCLA dataset, the proposed approach achieves the high-
est performance with 98.1% and 78.2% in 9-class and SIR test protocols respectively. In
8-class test protocol, the proposed approach achieves 98.7% which is very close to state-
of-the-art (99%). Moreover in 50-class test protocol the proposed approach achieves the
second highest performance with 97.2% as compared to DFS method [22] that achieves
perfect recognition accuracy. In Dyntex++ dataset, the proposed approach achieves the
highest classification rate with 90.1% in comparison to DL-Pegasos [25] and DFS [22].
Table 1. Classification rate (%) for dynamic texture recognition on the UCLA dataset using
BDS [26], DL-PEGASOS [25], Space-time oriented structures (STOS) [27], and DFS descrip-
tors [22] and the proposed approach (HOG-NSP descriptor in STPM framework using MKL for
sub-descriptors fusion)
Method 50-class 9-class 8-class SIR
BDS [26] N/A N/A 80 N/A

DL-PEGASOS [25] N/A 95.6 99 N/A
STOS [27] 81 N/A N/A 42.3
DFS [22] 89.5, 100 97.5 99 73.8
HOG-NSP 97.2 98.1 98.7 78.2
Table 2. Classification rate (%) for dynamic texture recognition on the Dyntex++ dynamic tex-
ture database using DL-PEGASOS [25] , DFS [22], and the proposed approach
Method DL-PEGASOS [25] DFS [22] HOG-NSP
Accuracy 63.7 89.9 90.1
3.2 Gesture Recognition
For gesture recognition, we used the Cambridge hand gesture dataset [21]. It consists
of 900 image sequences of 9 gesture classes. The gesture classes are defined by three
primitive hand shapes and three primitive motions. Each class has 100 image sequences
which are performed by 2 subjects, captured under 5 illuminations and 10 arbitrary
motions. Each sequence was recorded at 30f ps with a resolution of 320 240, in front
of a fixed camera having roughly isolated gestures in space and time. Representative
examples of this dataset are shown in Fig 6.
Following the test protocol in [21], sequences are divided into five sets where the
fifth set with normal illumination is used for training and the remaining sets are used for
testing. As per [21], we report the classification rate for all the four illumination sets.
The proposed method is compared against the original Canonical Correlation Analy-
sis (CCA) [29], Tensor Canonical Correlation Analysis (TCCA) [21], and Product
Manifolds (PM) [30]. Canonical correlation analysis (CCA) is a standard method for
measuring the similarity between subspaces. TCCA, as the name implies, is the exten-
sion of canonical correlation analysis to multiway data arrays or tensors. In the PM
method, a tensor is characterised as a point on a product manifold and classification is
performed on this space.
For the hand gesture test, HOG-NSP with three levels of spatio-temporal pyramid
attains the maximum performance. As such each video sequence is described by (1 +
8 + 64) 9 spatio-temporal sub-descriptor. The size of codebook and cuboid patch is
Fig. 6. Examples of actions in the Cambridge hand-gesture video dataset [21]
Table 3. Average correct classification rate (%) for gesture recognition on the Cambridge hand-
gesture database using CCA [29], TCCA [21], Product Manifold [30] and the proposed approach
Method Set1 Set2 Set3 Set4 Overall

CCA [29] 63 61 65 69 65 3.2
TCCA [21] 81 81 78 86 82 3.5
PM [30] 89 86 89 87 88 2.1
HOG-NSP 91 87 90 89 90 1.3
the same as previous experiment. The results, presented in Table 3, demonstrate that
the proposed approach achieves the highest performance with overall classification rate
of 90%. We note that the performance gap between CCA, TCCA and HOG-NSP is
quite significant. The proposed approach also achieves the lowest variations with 1.3%
standard deviations. This is an indication of robustness in presence of photometric vari-
ations.
3.3 Action Recognition

In order to evaluate the performance of proposed approach for action recognition, we
used two publicly available datasets, namely the KTH [31] and Weizmann [32] datasets.
Both datasets consist of videos of a single person, performing various actions in front of
a static camera with uncluttered homogenous backgrounds. In the case of KTH dataset,
there are six different human actions classes (see figure 7 for examples), performed by
25 subjects in four different scenarios: outdoors, outdoors with scale variation, outdoors
with different clothes, and indoors. In total, there are 2391 sequences. The Weizmann
actions dataset consists of ten different types of human action classes performed by 9
subjects. There are 93 sequences in total. For fair comparison, the test protocols pre-
sented in [31] and [32] were used for KTH and Weizmann datasets respectively. For
both datasets, the classification performance is presented as average accuracy over all
classes. The proposed method is compared against six other state-of-the-art approaches,
the HOG/HOF [7], 3D-SIFT [3], HOG3D [4], LBP-TOP [10], Kinematic features with
multiple instance learning (Abbreviated as KF-MIL) [23], and unsupervised feature
learning using independent subspace analysis (Abbreviated as UFL-ISA) [33]. For the
human actions experiment, the levels of spatio-temporal pyramids, size of the codebook
and cuboid patch was empirically set to three, 1000 and 18 18 18 pixels with 50%
overlap respectively. The results presented in table 4 demonstrate that the proposed ap-
proach achieves the highest performance in both datasets.
Walking Jogging Running Boxing Waving Clapping

(a)
Bending JumpingJack JumpingForward Running Walking
(b)
Fig. 7. Sample frames from (a) KTH (b) Weizmann dataset
Table 4. Comparison of mean classification accuracy on the KTH and Weizmann datasets
Method 3D-SIFT [3] HOG3D [4] HOG/HOF [7] LBP-TOP [10] KF-MIL [23] UFL-ISA [33] HOG-NSP
KTH N/A 91.4 % 91.8 % 86.25 % 87.7 % 93.9 % 96.4%
Weizmann 82.6 % 84.3 % N/A N/A 95.75 % N/A 95.9 %
3.4 Comparison with Local Descriptors

In order to contrast the performance of the proposed spatio-temporal local descriptor
against previous video descriptors, we performed another set of experiments. For a fair
comparison, the MKL framework is removed from HOG-NSP and sub-descriptors are
uniformly concatenated. The proposed descriptor is evaluated against three state-of-the-
art spatio-temporal local descriptors, namely 3D-SIFT [3], VLBP and LBP-TOP [5].
For all descriptors, cuboid patches of 16 16 16 pixels with 50% overlap were
used . Inline with evaluation framework presented in [1], the size of codebook was
selected to be 4000. For classification, we used Nearest-Neighbour (NN) classifier for
all descriptors. The performance of each descriptor was evaluated on both UCLA and
Dyntex++ datasets. For UCLA dataset, three different test protocols, the 50-class, 8-
class and shift-invariant-recognition (SIR) were used. The results presented in Table 5
demonstrate that the proposed HOG-NSP descriptor significantly outperforms the other
three descriptors by a notable margin. Moreover, the shift-invariant test indicates that
the HOG-NSP enjoys a better level of robustness in presence of geometric variations.
Comparing HOG-NSP, HOG-TOP and LBP-TOP that utilize spatio-temporal planes for
feature extraction against direct extension methods like 3D-SIFT and VLBP, illustrates
that the former approaches achieve higher accuracy.
Table 5. Classification rate (%) for dynamic texture recognition on UCLA and Dyntex++ dataset
UCLA
Method Dyntex++
50-class 8-class SIR
3D-SIFT [3] 77.3 86.2 44.7 63.7
VLBP [5] 79.8 91.4 40.6 61.1
LBP-TOP [5] 86.1 96.8 60.3 71.2
HOG-TOP 83.7 95.4 62.7 72.8
HOG-NSP 88.5 97.5 66.3 78.7

As final evaluation, we compare the computational complexity of HOG-NSP descriptor
against other state-of-the-art 3D video descriptors. To this end, we measured the average
time required for computating HOG-NSP, LBP-TOP [5], 3D-SIFT [3] and HOG3D [4]
on a set of videos from KTH actions dataset [31]. Experiments were performed on
an Intel 3.00GHz processor using Matlab. Moreover, all videos were downsampled to
160 120 and a grid of non-overlapping 16 16 16 cuboid patches were used to
compute each descriptor.2. Table 6 presents the average runtime for all studied descrip-
tors. The proposed HOG-NSP descriptor is more than 8 times faster than HOG3D and
3D-SIFT. Moreover, the complexity of HOG-NSP is roughly similar as LBP-TOP.
Table 6. Average run-time (sec) for computing HOG-NSP, LBP-TOP [5], 3D-SIFT [3] and
HOG3D [4] descriptors on KTH dataset
Descriptor HOG-NSP LBP-TOP [5] 3D-SIFT [3] HOG3D [4]
Average run-time (sec) 8.6 7.4 74.7 73.3
4 Main Findings and Future Directions

In this paper, we proposed a novel approach for space-time pattern recognition. The
main contributions of this work are the proposed 3D video descriptors and the applica-
tion of Multiple Kernel Learning (MKL) for optimal fusion of sub-descriptors. In our
proposal, a local 3D video patch is described by histogram of oriented gradients on nine
discrete space-time planes which are the symmetry planes of the 3D patch. This results
in a rich descriptor that encodes both static and dynamic properties of video pattern.
Furthermore, to compare two video sequences, we proposed to learn a positive-definite
similarity kernel by combining local video descriptors over a spatio-temporal pyramid
structure.
2
The source codes of LBP-TOP and 3D-SIFT descriptors were downloaded from the authors
website while HOG3D was implemented in Matlab by ourselves.
We extensively evaluated the performance of the proposed approach on three chal-

lenging visual recognition tasks, namely classification of dynamic textures, hand ges-
tures and human actions. The experiments indicate that the proposed approach achieves
superior or comparable performance to state-of-the-arts methods. Furthermore the pro-
posed HOG-NSP descriptor outperforms state-of-the-art 3D video descriptors such as
such as LBP-TOP, 3D-SIFT and HOG3D, with much lower computational load com-
pared to HOG3D or 3D-SIFT. We also acknowledge that the proposed approach has
achieved the highest reported performance on DynTex++, a challenging dataset for eval-
uating dynamic texture methods. In this study, the symmetry spatio-temporal planes are
encoded using oriented gradient features. However, our proposed approach is quite gen-
eral and is not restricted to gradient features. As such, future avenues of research include
exploring integration of other 2D local descriptors inside the HOG-NSP machinery. We
also plan to appraise HOG-NSP on other space-time classification tasks such as facial
expression recognition.
Acknowledgements. NICTA is funded by the Australian Government as represented

by the Department of Broadband, Communications and the Digital Economy, as well
as by the Australian Research Council through the ICT Centre of Excellence program.
References
1. Wang, H., Ullah, M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal
features for action recognition (2009)
2. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., Windridge,
D.: An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In:
WACV, pp. 344351 (2011)
3. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action
recognition. In: International Conference on Multimedia, pp. 357360 (2007)
4. Klaser, A., Marszaek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients.
In: BMVC, pp. 9951004 (2008)
5. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an
application to facial expressions. PAMI 29(6), 915928 (2007)
6. Willems, G., Tuytelaars, T., Van Gool, L.: An Efficient Dense and Scale-Invariant Spatio-
Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
7. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from
movies. In: CVPR, pp. 18 (2008)
8. Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recogni-
tion require? In: CVPR, pp. 18 (2008)
9. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recog-
nition. In: ICCV, pp. 18 (2007)
10. Mattivi, R., Shao, L.: Human Action Recognition Using LBP-TOP as Sparse Spatio-
Temporal Feature Descriptor. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702,
11. Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M., Chen, X., Gao, W.: Wld: A robust local
image descriptor. PAMI 32(9), 17051720 (2010)
12. Ojansivu, V., Heikkila, J.: Blur Insensitive Texture Classification Using Local Phase Quanti-
zation. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) ICISP 2008. LNCS,
13. Paivarinta, J., Rahtu, E., Heikkila, J.: Volume Local Phase Quantization for Blur-Insensitive
Dynamic Texture Classification. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688,
14. Laptev, I.: On space-time interest points. IJCV 64(2), 107123 (2005)
15. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories.
In: CVPR, vol. 2, pp. 524531 (2005)
16. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91110
(2004)
17. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: CVPR, vol. 2, pp. 21692178 (2006)
18. Choi, J., Jeon, W., Lee, S.: Spatio-temporal pyramid matching for sports videos. In: ACM
Int. Conf. on Multimedia Information Retrieval, pp. 291297 (2008)
19. Chen, Y., Garcia, E.K., Gupta, M.R., Rahimi, A., Cazzanti, L.: Similarity-based classifica-
tion: Concepts and algorithms. JMLR 10, 747776 (2009)
20. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. JMLR 9, 24912521
(2008)
21. Kim, T., Cipolla, R.: Canonical correlation analysis of video volume tensors for action cate-
gorization and detection. PAMI 31(8), 14151428 (2009)
22. Xu, Y., Quan, Y., Ling, H., Ji, H.: Dynamic texture classification using dynamic fractal anal-
ysis. In: ICCV (2011)
23. Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple
instance learning. PAMI 32(2), 288303 (2010)
24. Doretto, G., Chiuso, A., Wu, Y., Soatto, S.: Dynamic textures. IJCV 51(2), 91109 (2003)
25. Ghanem, B., Ahuja, N.: Maximum Margin Distance Learning for Dynamic Texture Recogni-
tion. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312,
26. Ravichandran, A., Chaudhry, R., Vidal, R.: View-invariant dynamic texture recognition using
a bag of dynamical systems. In: CVPR, pp. 16511657 (2009)
27. Derpanis, K., Wildes, R.: Dynamic texture recognition based on distributions of spacetime
oriented structure. In: CVPR, pp. 191198 (2010)
28. Peteri, R., Fazekas, S., Huiskes, M.: Dyntex: A comprehensive database of dynamic textures.
Pattern Recognition Letters 31(12), 16271632 (2010)
29. Kim, T., Kittler, J., Cipolla, R.: Discriminative learning and recognition of image set classes
using canonical correlations. PAMI 29(6), 10051018 (2007)
30. Lui, Y., Beveridge, J., Kirby, M.: Action classification on product manifolds. In: CVPR,
pp. 833839 (2010)
31. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In:
ICPR, vol. 3, pp. 3236 (2004)
32. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes.
PAMI 29(12), 22472253 (2007)
33. Le, Q., Zou, W., Yeung, S., Ng, A.: Learning hierarchical invariant spatio-temporal features
for action recognition with independent subspace analysis. In: CVPR, pp. 33613368 (2011)
Learning to Recognize Unsuccessful Activities
Using a Two-Layer Latent Structural Model
Qiang Zhou1 and Gang Wang1,2

1
Advanced Digital Sciences Center, Sinapore
Zhou.Qiang@adsc.com.sg
2
Nanyang Technological University, Singapore
wanggang@ntu.edu.sg
Abstract. In this paper, we propose to recognize unsuccessful activities

(e.g., one tries to dress himself but fails), which have much more complex
temporal structures, as we dont know when the activity performer fails
(which is called the point of failure in this paper). We develop a two-
layer latent structural SVM model to tackle this problem: the rst layer
species the point of failure, and the second layer species the temporal
positions of a number of key stages accordingly. The stages before the
point of failure are successful stages, while the stages after the point
of failure are background stages. Given weakly labeled training data,
our training algorithm alternates between inferring the two-layer latent
structure and updating the structural SVM parameters. In recognition,
our method can not only recognize unsuccessful activities, but also infer
the latent structure. We demonstrate the eectiveness of our proposed
method on several newly collected datasets.
1 Introduction
In recent years, researchers have made astonishing progress in developing vision
system to recognize human activities. For a target activity class, recognition is
usually cast as a binary classication problem: is someone performing it or not
performing it. However, in many applications, particularly in health care, we also
would like to know whether an activity is successfully completed. For example, a
senior person in a nursing home tries to dress himself, but may fail at some stage
(e.g., cannot raise his hands). If we have a sensor installed in the room, and this
unsuccessful activity is recognized, then a nurse can be automatically alerted
to help him. Recognizing unsuccessful activities is extremely useful for helping
patients with dementia, which is one of the most costly diseases. Patients with
dementia tend to forget things because of the deteriorated cognitive ability. As
a result, they have to be fully take care of by caregivers. If a computer system
is able to recognize their unsuccessful activities, then an instructional video can
be automatically played to remind (or guide) them. This will help them live
independently and reduce signicantly economic cost.
To tackle this interesting and useful research problem, we propose to recognize
unsuccessful activities using vision algorithms. As shown in Fig. 1, we now have

Learning to Recognize Unsuccessful Activities 751
Fig. 1. The unsuccessful activity recognition task. There are three possible class labels:
successful, unsuccessful and background. It is dierent from traditional binary activity
recognition, which only deal whether someone is performing the target activity or not.
three possible class labels for a target activities class: perform the activity and
succeed (abbreviated as successful); perform the activity but fail at some stage
(abbreviated as unsuccessful); not performing the target activity (abbreviated
as background). Then given a test video, we aim to automatically categorize
it as one of these three classes.
One of our target applications is health care, where patients privacy infor-
mation must be preserved. Hence we choose the Kinect sensor to record activity
videos, we can turn o the RGB camera in real application to only capture depth
information which is user friendly. However, our proposed approach can also be
applied to RGB videos, as shown in the experimental section.
Recognizing unsuccessful activities is hard, partially because we dont know
at which stage the activity performer fails. For example, one person may fail to
dress himself at the beginning or at the end. This means that we cannot take a
holistic way [1, 2] to represent unsuccessful activities. Instead, we decompose an
activity into a number of key temporal stages (e.g.,hand up,hand down). In
a unsuccessful activity video clip, the activity performer completes some stages
but does not complete all. The changing point is called the point of failure
in this paper. It is prohibitive to manually annotate these key stages and the
point of failure. Instead, we treat them as latent variables and let the model
automatically discover them. We develop a two-layer latent structural model
752 Q. Zhou and G. Wang
based on the general latent structural SVMs [3]. Our temporal structure has two
layers: the rst layer denes the temporal position of the point of failure; the
second layer denes the temporal positions of a number of key temporal stages.
The stages before the point of failure are successful stages, while the stages after
the point of failure are background stages. One advantage of this method is
that it can deal with the three class classication problem with a single model,
rather than train one classier for each class independently as the conventional
methods.
In the recognition procedure, the learned model selects a discriminative de-
composition for matching. It can not only categorize a test video, but also dis-
cover the two-layer temporal structure to support the categorization decision
(where is the point of failure? where do dierent key stages take place?).
To our best knowledge, there are no previous vision algorithms tackling this
research problem and there are no existing dataset available either. We collect a
new dataset to study it. Our experimental results demonstrate that it is promis-
ing to solve this problem using computer vision techniques, and our proposed
two-layer latent structural SVM model shows superior performance compared to
several previous approaches.
1.1 Related Work

Human Activity Recognition. Human activity analysis is an important task
in computer vision, much work has been done on this topic over the last decade.
A recent survey paper is [4]. Previous works [1, 2, 57] usually treat activity
recognition as a binary classication problem. The method of bag-of-words rep-
resentation [1, 2, 5, 7] has achieve great success on this task. Dierent from them,
we focus on unsuccessful activity recognition. A illustration of the dierence is
shown in Fig. 1. The most related work is [8], which develops a probability formu-
lation to recognize unnished human activities in video surveillance. We argue
that early recognition of unnished activities [8] and recognition of unsuccessful
activities are two dierent problems and are developed for dierent applications.
[8] is more suitable for surveillance applications, while ours is more suitable for
health-care applications.
Latent Structural Model. Our two-layer latent structural model is closely
related to the general latent structural SVMs [3], but has a deeper structure,
and is specially designed for this unsuccessful activity recognition problem. In
computer vision, latent SVMs [9, 10] are successfully used to detect objects.
In their models, the positions of object parts are treated as latent variables.
Niebles et al. [7] employ latent SVMs to model complex activities as a set of
stages. However, the formulation of [7] is derived from the perspective of multiple
instance SVM, which is only suitable for binary classication. Here, we have
to tackle a three-class classication problem and more complicated temporal
structures. Dierent from [7], our model is built on the general latent structural
SVMs [7], and is able to deal with the three-class classication problem with a
single model by modeling a two-layer latent temporal structure.
Activity Recognition in Depth Videos. Previously, activity recognition is

mainly performed on videos collected using RGB cameras [1, 2, 7]. With the
advent of the depth sensor, more and more researchers pay attention to activity
recognition in depth videos [1113]. Our collected datasets will enrich the existing
depth video databases.
1.2 Contributions
We summarize our contributions as below.
(1) Our major contribution is to show the promise of recognizing unsuccessful
human activities, which is expected to be useful in health care. To the best of our
knowledge, this is the rst vision algorithm on unsuccessful activity recognition.
(2) We develop a two-layer latent structural SVM model to recognize unsuc-
cessful activities and discover the temporal structures.
(3) We create a new dataset that enables further investigation and study of
unsuccessful activity recognition.
2 Approach
Given a test video, we aim to categorize it with one of the three possible classes:
successful, unsuccessful or background (related random activities), which are
denoted as +1, 0 and 1 respectively. We write the label set as Y .
2.1 A Two-Layer Latent Structural Model

To categorize these three categories, a simple approach is to train a classier for
each category in the 1-vs-all fashion. However, it is not desirable because of two
limitations. First, it cannot explain when the activity performer fails in a unsuc-
cessful activity. Second, unsuccessful and successful activities should share some
key stages at the beginning, and unsuccessful and background may share some
key stages at the end. Training one model for each category independently cannot
take the advantage of this sharing structure and may even produce inaccurate
models. We propose a new model to address this problem. Fig. 2 illustrates our
model. The intuition is: an unsuccessful activity can be modeled using a two-
layer latent structure, the rst layer denes the temporal position of the point
of failure; the second layer denes the temporal positions of a number of key
temporal stages (e.g.,hand up,hand down). Before the point of failure, the
key temporal stages should be similar to the ones of a successful activity; after
the point of failure, the key temporal stages should be dissimilar to any key tem-
poral stages of a successful activity. This model is very natural for three-class
classication. If the point of failure is between any two key stages, then it is a
unsuccessful activity video; if all the stages are successful stages and the point
of failure is not observed, then it is a successful activity video; otherwise, it is a
background video.
Fig. 2. Illustration of our two-layer latent structural model for unsuccessful activity
recognition. The structure model has two layers: the rst layer denes the point of
failure (where the activity performer fails to perform the activity); the second layer
denes the temporal positions of a number of key stages. denotes the model parame-
ter. f p is the point of failure. (h1 , . . . , hK ) are the temporal positions of K key stages.
A (x, hk ) is appearance feature vector, when k f p, A (x, hk ) should be compatible
with the class label +1 since this stage should be similar to a temporal stage of a suc-
cessful activity; when k > f p, A (x, hk ) should be compatible with the class label 1
since this stage should be dissimilar to any key temporal stages of a successful activity.
T (hk ) denotes the temporal displacement feature. In recognition, we infer the latent
structure and calculate a score for each temporal stage. We summarize over all the
temporal stages to produce a nal classication score.
We develop our model based on the general latent structural SVM formulation
[3], but with deeper temporal structures. Given a video x, let z = (h1 , ..., hK , f p)
specify a particular two-layer structure. (h1 , ..., hK ) are the second layer latent
variables, where hk species the temporal position of the k-th temporal stage.
In our model, we normalize hk using the total number of video frames and it
falls in the range of [0, 1]. The rst layer latent variable f p denotes the point of
failure in unsuccessful activities, and f p is specied as a number in [0, 1, . . . , K].
This structure means people successfully perform the rst f p stages and fail
to perform the remaining (K f p) stages. For successful activities, f p should
be equal to K since all the K stages have been successfully performed. For
unsuccessful activities, f p must be in [1, . . . , K 1]. Similarly, f p must be 0 in
background activities because no key stages have been successfully performed.
Given learned model parameters, the overall score of a hypothesis z can be
expressed in terms of dot product between the model parameters and the joint
feature vector (x, y, z),
f (x, y, z) = (x, y, z) (1)

Eq. (1) measures the compatibility among the input x, output y, when the latent
variable z is specied [3, 10]. (x, y, z) is a concatenation of features extracted
at dierent key temporal stages.
(x, y, z) = [(x, y, h1 ), . . . , (x, y, hK )] (2)
(x, y, hk ) is the joint feature vector for the k-th temporal stage which includes
appearance feature A (x, hk ) and temporal displacement feature T (hk ). In this
paper, we experiment with two types of appearance features: one is bag-of-words
features (HOG and HOF are used as local features), and the other one is human
pose descriptors [14]. The details of extracting appearance features are described
in Sec. 3. Each temporal stage has an anchor point tk . If a temporal stage hy-
pothesis has a temporal displacement relative to its anchor point, a deformation
cost is enforced. Similar to [7], T (x, hk ) is dened as:
T (hk ) = (dtk , dt2k ), dtk = hk tk (3)
where dtk is the temporal displacement of the k-th key temporal stage relative
to its anchor point tk .
When y = 1,the feature vector is dened as:
(x, y, hk ) = [y A (x, hk ), T (hk )] (4)
For a unsuccessful activity (y = 0), when k f p, A (x, hk ) should be compatible
with the class label +1 since this stage should be similar to a temporal stage of a
successful activity; when k > f p, A (x, hk ) should be compatible with the class
label 1 because this stage should be dissimilar to any key temporal stages of a
successful activity. Hence the feature vector is dened as:

(+1 A (x, hk ), T (hk )) if k f p
(x, y, hk ) = (5)
(1 A (x, hk ), T (hk )) if k > f p
For the k-th stage, we dene a stage appearance lter Fk to measure the ap-
pearance response; and a two-dimensional vector dk as the deformation cost
parameter. Our model parameters are:
= (F1 , d1 , . . . , FK , dK ) (6)
2.2 Model Inference

Suppose the model parameters are learned. Given a test video, the inference
task is to nd a class label and a two-layer structure (y , z ) with the biggest
score: (y , z ) = arg maxy,z [ (x, y, z)].
In our two-layer latent structure model, we use a recursive procedure to nd
the class label and the best two-layer latent structure. The values of the second
layer latent variable hk depend on the value of the rst layer variable f p. For a
class label y, we enumerate all the possible f p. The set of f p is as below:

{K} if y = +1
F P = {1, . . . , K 1} if y = 0 (7)

{0} if y = 1
We nd K is usually small (below 5), when achieving good performance in our

experiments, so it is very quick to try all the possible f p. After determining the
value of f p, we use dynamic programming and distance transforms techniques
[15, 16] as in [9, 7] to nd the values of the second layer latent variables. The
class label and the two-layer structure (y , z ) with the highest score of Eq. (1)
will be returned as the inference result. We show the procedure of inference in
Alg. 1.
Algorithm 1. Inference of Our Two-Layer Latent Structural Model

Input: Model parameter and a test video x
1 forall the y Y do
2 forall the f p F P do
3 Extract appearance feature from the video x
4 Compute response for each stage appearance lter Fk
5 Transform responses to allow temporal displacement using the distance
transforms technique
6 Find the best values for the second layer latent variables with the
biggest score (x, y, z) using dynamic programming
7 end
8 end
9 Compare all possible labels and hypotheses, nd the one with the height score
according to Eq. (1)
Output: class label y and a two-layer structure z (key temporal stage
positions and the point of failure)
2.3 Model Learning

Similar to latent SVM training in object detection [9, 10] and activity recognition
[7], we train our model with a weakly labeled setting since it is prohibitive to label
the point of failure or the temporal position of each stage. Let D = {xi , yi }, i =
1, . . . , N be a set of labeled examples where yi {1, 0, 1}. We learn the model
parameters by solving the following optimization problem:
1 2

min + C i (8)
2 i
s.t. max (xi , yi , z) max (xi , y, z) 1 i i, i 0 (9)
z (y=yi ,z)
2
where 12 is a regularization term, and i is the slack variable for the i-th
training example. It attempts to learn model parameters , such that the score
extracted for the ground truth label yi , give the best latent structure, is greater
than the score for any other possible class labels, give the best latent structure,
by at least 1.
As mentioned in previous works [9, 3, 7, 10], this two-layer latent structural
SVM model leads to a nonconvex optimization problem. But after we x the
value of z for each (xi , yi , z), the optimization problem becomes convex. Similar
to previous works [3, 10, 5], we can get a local optimum of Eq. (8) using an
iterative procedure as following:
Step 1: Fixing , infer the latent variables z for each (xi , yi ):

zi = arg maxz (xi , yi , z).
Step 2: Holding zi , optimize by solving a standard structural SVM learning
problem.
The task of Step 1 is the same as the inference problem described in Sec. 2.2. In
Step 2, following [9], we use a stochastic gradient descent method to train the
model. Another choice is the cutting plane algorithm in [17, 3, 10].
In stochastic gradient descent, we randomly select a training example xi , the
hinge loss is as below:

Li = max 0, 1 (xi , yi , zi ) max (xi , y, z) , (10)
(y,z)
The gradient is calculated as:
g = + CLi (11)

0 if Li = 0
Li (, xi , yi ) = (12)
(xi , yi , zi ) (xi , y , z ) if Li > 0
where
(y , z ) = max (xi , y, z) (13)
(y,z)
Finding (y , z ) is also equivalent to an inference problem, we then take a step

to update .
t+1 = t t g (14)
We also use an adaptive learning rate. Let t be the learning rate and we set
t = 10/(1000 + t) in our experiments. We average the temporal positions of all
training data for each temporal stage as the new temporal stage anchor point.
Our learning algorithm is summarized in Alg. 2.
2.4 Model Initialization
In such an iterative learning procedure, model initialization is very important.

In this paper, we train a standard binary SVM classier using the LIBSVM
[18] to initialize the appearance model. We use successful activity videos as
positive examples and background activity videos as negative examples. Each
video was evenly decomposed into K temporal segments. We specify the center
of each segment as the temporal position of each stage. This simple initialization
method has been shown to work well in our experiments. The displacement cost
parameters for each key temporal stage are initialized as dk = (0, 1).
Algorithm 2. Our Learning Algorithm

Input: Training data: D = {xi , yi }, i = 1, . . . , N
1 Initialize the model parameters 0
2 repeat
3 for i = 1 to N do
4 Use Alg. 1 to infer the best two-layer structure zi for xi
5 end
6 for t = 1 to Tmax do
7 Randomly select one training example to compute the gradient g using
Eq. (11)
8 Update model parameters using Eq. (14)
9 end
10 until max iteration;
Output: Model parameters
3 Implementation Details
We tried two types of features in our experiments. The rst one is bag-of-words
representation (HOG and HOF are used as local features). We extract these
low-level features from depth videos. We also compare with baselines which
extract bag-of-words feature from RGB videos. Another feature is geometric
pose descriptors [14] to represent human poses.
3.1 HOG/HOF Features

Low-level spatio-temporal features with bag-of-words representation have re-
cently achieved great success for activity recognition [1, 2, 7, 8]. Following [1],
we rst detect spatio-temporal interest points (STIP) [19] in each video and
then extract histogram of gradient (HOG) and histogram of ow (HOF) for
each SITP. Each SITP is represented as 162 dimensional (72 for HOG and 90
for HOF) features. Bag-of-words representation is employed to summarize these
low-level features. We cluster all HOF/HOF features from the training videos
with the k-means algorithm and return V (In our experiments, we set V = 500)
centers as the codebook. Each key temporal stage is represented as a 500 dimen-
sional vector. Each key temporal stage is assumed to last for 10% 20% of all
the video frames. (We usually have 36 key stages.)
3.2 Geometric Pose Descriptor (GPD) Features

Human poses are very discriminative for activity recognition [20, 21]. In depth
videos, we can obtaine human poses using the state-of-the-art method [22]. In
our experiments, we save human pose information when we collect videos using
the Kinect sensor. Some poses are shown in Figure 2.
We use the geometric pose descriptor (GPD) [14] to represent human poses.
After we got human poses, there are a set of joints, lines and planes dened
on human skeleton. In our experiment, we only use joint-joint orientations to

describe 3D poses since we nd it more discriminative for our problem. Each hu-
man pose is described by 16 3D joint points. Suppose (Jxi , Jyi , Jzi ) and (Jxj , Jyj , Jzj )
are 3D coordinates of two joints J i and J j respectively, we use normalize vector
fJ i J j to represent the orientation from J i to J j .
(Jxj Jxi ) (Jyj Jyi ) (Jzj Jzi )

fJ i J j = ( , , ) (15)
norm(J , J ) norm(J , J ) norm(J i , J j )
i j i j
C
norm(J i , J j ) = (Jxj Jxi )2 + (Jyj Jyi ) + (Jzj Jzi ) (16)
We represent each pose as a 16 15 3/2 = 360 dimensions GPD feature.
Note that when we use GPD features, each temporal stage only has one frame.
4.1 Dataset
Since there are no datasets for unsuccessful acitvity recognition, we collected a

new one using the Microsoft Kinect sensor. It contains 7 daily human activi-
ties: drinking, dressing, undressing, calling, mopping, get up and
stand up. We invited 10 volunteers and each activity is performed by about 8
volunteers. For each activity, they are instructed to perform its three versions:
successful, unsuccessful and background. Our dataset includes about 650 activity
video clips for RGB and depth respectively. Each video has 200 450 frames. We
collected RGB videos only for research study and comparison. In real applica-
tions, we should only capture depth videos to preserve privacy. The resolutions
of both RGB and depth video frames are 320 240. The frame rate of both RGB
and depth videos are 30 frames per second. We record videos using a station-
ary sensor in a lab. Many previous datasets have more complicated background.
However, for health care applications, indoor environment is very typical.
4.2 Baselines
We compare our approach with three baselines. In Baseline 1, we represent the

whole video with one bag-of-words feature vector (HOG and HOF are used as
local features). For classication, we use the support vector machine with a 2
kernel.
In baseline 2, a video is partitioned into spatio-temporal grids. In our ex-
periments, we use three types of spatio-temporal grids: 1 1 t1, 1 1 t2 and
h3 1 t1. More details can be found in [1].
We apply these two baseline methods on RGB and depth videos respectively.
Furthermore, we also compare our two-layer latent structural SVM with the
one-layer version (for each activity, we train one one-layer latent structural SVM
model for each of the three categories in the 1-vs-rest fashion).
Table 1. Recognition results of our two-layer latent structural SVM model and three
baselines approaches. Baseline 1 is the standard bag-of-words method and baseline 2 is
the spatial-temporal pyramid matching [1]. We also compare our approach with one-
layer latent structural model. RGB and Depth means RGB videos and depth videos
respectively. We also try two types of features (HOG/HOF features and GPD features)
using our model. The column K = 4 shows the results of our model with 4 temporal
stages and the column Best K reports the results of our model with the best K for
each activity. When people lie on a bed, we cant get joints using the SDK. So the
result of get up using GPD feature is not available. A01: Calling, A02: Dressing,
A03: Drinking, A04: Stand Up, A05: Get Up, A06: Mopping, A07: Undressing.
A01 A02 A03 A04 A05 A06 A07 Avg.

RGB 63.5% 72.9% 66.7% 72.9% 89.2% 79.8% 72.6% 73.9%
Baseline1
Depth 64.6% 68.8% 66.7% 60.4% 89.2% 85.7% 64.3% 71.4%
RGB 68.8% 74.0% 76.0% 80.2% 92.8% 84.5% 72.6% 78.4%
Baseline2
Depth 64.6% 67.7% 69.8% 69.8% 86.9% 84.5% 67.9% 73.0%
GPD,K = 4 65.6% 61.5% 65.6% 57.9% N.A.% 72.6% 56.0% 63.2%
One-
GPD,Best K 65.6% 65.6% 65.6% 69.5% N.A.% 73.8% 64.3% 67.4%
Layer
HOG/HOF,K = 4 72.9% 70.8% 77.1% 72.9% 85.7% 81.0% 64.3% 75.0%
Model
HOG/HOF,Best K 72.9% 70.8% 77.1% 72.9% 86.9% 84.5% 64.3% 75.6%
GPD,K = 4 81.3% 62.5% 80.2% 75.0% N.A.% 81.0% 54.8% 72.5%
Our GPD,Best K 82.3% 70.8% 80.2% 80.2% N.A.% 81.0% 56.0% 75.1%
Model HOG/HOF,K = 4 86.5% 76.0% 85.4% 92.7% 92.9% 81.0% 65.5% 82.9%
HOG/HOF,Best K 86.5% 76.0% 86.5% 92.7% 92.9% 86.9% 71.4% 84.7%
4.3 Results of Unsuccessful Human Activity Recognition
In this section, we compare our two-layer latent structural SVM with the two
baselines on unsuccessful activity recognition. We use leave-one-out method in
our experiments. At each time, we use data of one performer for testing and
all the other data for training. All of the experiments were done on a 2.8 GHz
Intel Xeon PC. It takes about 2 hours to train an activity model (K = 4) on
training data with 84 videos and inference takes around 15 seconds on a video
with 400 frames. We report the average accuracy over all the performers. The
results of dierent methods are summarized in Table 1. When people lie on a
bed, we cant get joints using the SDK. So the result of get up using GPD
feature is not available. We can see that our method signicantly outperforms
baseline methods on almost all the activities. This suggests the eectiveness of
our two-layer latent structural model for unsuccessful human activity recogni-
tion. And with our model, HOG/HOF features work better than GPD features.
One possible reason is that when using HOG/HOF features, multiple frames are
used to represent each temporal stage; while when using GPD features, only one
frame can be used to represent each temporal stage. Fig. 3 shows the confusion
matrix obtained by our model with HOG/HOF features.
In Fig. 5, we show the two-layer structure inferred using our approach on two
unsuccessful activities videos. Our method can nd the point of failure and the
positions of dierent temporal stages.
Fig. 3. Confusion matrixes of our model using HOG/HOF features (Best K). B, U
and S are short for Background, Unsuccessful and Successful respectively.
Fig. 4. The eect of dierent temporal stage sizes. We show recognition accuracy of
each class with dierent stage sizes. The top image shows the results using GPD feature.
The bottom image shows the results using HOG/HOF features.
4.4 Eect of Dierent Temporal Stage Sizes

In this section, we evaluate the eects of dierent temporal stage sizes. We test
four sizes: K = 3, 4, 5, 6. We do the evaluation with Geometric Pose Descriptors
and HOG/HOF features respectively. We show the results in Fig. 4.
For HOG/HOF features, the best stage size is K = 3. For GPD features, the
best stage size is K = 4.
Fig. 5. Two examples of inferred latent structure using our learned model (K = 3)
with HOG/HOF features. The top image shows a unsuccessful get up video and
the bottom image shows a unsuccessful calling video. We use red boxes to represent
unsuccessful temporal stages and green boxes to represent successful temporal stages.
(RGB images are for illustration only, our model is built on depth videos.)
5 Conclusions and Discussions

We have introduced unsuccessful activity recognition, which is an interesting new
research problem and more can be further investigated. We propose a two-layer
latent structural SVM model, which can deal with the three-class (successful,
unsuccessful, and background) classication problem with a single model, and is
able to fully take advantage of the shared information among the three categories.
Our model can be used in other applications which need to model multiple classes
and deeper structures. Promising results have been present. In the future, we
are interested in combining cameras with other types of sensors such as motion
sensors to further improve the performance. We are also interested in collecting
more realistic data from hospitals or nursing homes to study unsuccessful activity
recognition more deeply.
Acknowledgments. This study is supported by the research grant for the

Human Sixth Sense Programme at the Advanced Digital Sciences Center from
Singapores Agency for Science, Technology and Research (A*STAR).
References
actions from movies. In: Proc. CVPR (2008)
2. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild.
In: Proc. CVPR (2009)
3. Yu, C.N.J., Joachims, T.: Learning structural svms with latent variables. In: Proc.
ICML (2009)
4. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing
Surveys 43(3), 16 (2011)
5. Wang, Y., Mori, G.: Max-margin hidden conditional random elds for human ac-
tion recognition. In: Proc. CVPR (2009)
6. Ni, B., Yan, S., Kassim, A.A.: Recognizing human group activities with localized
causalities. In: Proc. CVPR (2009)
7. Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling Temporal Structure of Decompos-
able Motion Segments for Activity Classication. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 392405. Springer,
Heidelberg (2010)
8. Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from
streaming videos. In: Proc. ICCV (2011)
9. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detec-
tion with discriminatively trained part-based models. IEEE Trans. Pattern Anal.
Mach. Intell. 32(9), 16271645 (2010)
10. Zhu, L., Chen, Y., Yuille, A.L., Freeman, W.T.: Latent hierarchical structural
learning for object detection. In: Proc. CVPR (2010)
11. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In:
CVPR Workshops (2010)
12. Sung, J., Ponce, C., Selman, B., Saxena, A.: Human activity detection from rgbd
images. In: AAAI Workshop (2011)
13. Ni, B., Wang, G., Moulin, P.: Rgbd-hudaact: A color-depth video database for
huamn daily activity recognition. In: 1st IEEE Workshop on Consumer Depth
Cameras for Computer Vision, in Conjunction with ICCV (2011)
14. Chen, C., Zhuang, Y., Nie, F., Yang, Y., Wu, F., Xiao, J.: Learning a 3d human
pose distance metric from geometric pose descriptor. IEEE Trans. Vis. Comput.
Graph. 17(11), 16761689 (2011)
15. Felzenszwalb, P.F., Huttenlocher, D.P.: Distance transforms of sampled func-
tions. Cornell Computing and Information Science Technical Report TR2004-1963
(September 2004)
16. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition.
IJCV 61(1), 5579 (2005)
17. Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine
learning for interdependent and structured output spaces. In: Proc. ICML (2004)
Transactions on Intelligent Systems and Technology 2, 27:127:27 (2011), Software
available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm
19. Laptev, I.: On space-time interest points. International Journal of Computer Vi-
sion 64(2-3), 107123 (2005)
20. Yao, B., Li, F.F.: Modeling mutual context of object and human pose in human-
object interaction activities. In: Proc. CVPR (2010)
21. Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with
latent poses. In: Proc. CVPR (2010)
22. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from single
depth images. In: Proc. CVPR (2011)
Action Recognition Using Subtensor Constraint
Qiguang Liu and Xiaochun Cao
School of Computer Science and Technology

Tianjin University, Tianjin, China
{qliu,xcao}@tju.edu.cn
Abstract. Human action recognition from videos draws tremendous in-

terest in the past many years. In this work, we rst nd that the trifocal
tensor resides in a twelve dimensional subspace of the original space if
the rst two views are already matched and the fundamental matrix be-
tween them is known, which we refer to as subtensor. Then we use the
subtensor to perform the task of action recognition under three views.
We nd that treating the two template views separately or not consid-
ering the correspondence relation already known between the rst two
views omits a lot of useful information. Experiments and datasets are
designed to demonstrate the eectiveness and improved performance of
the proposed approach.
1 Introduction
As one of the hottest recognition tasks in computer vision, human action recog-
nition from videos draws tremendous interest [13]. In this work, we focus on
the task of action recognition that can be stated as: Given a video sequence
containing one person performing an action, the task is to design a system for
automatically recognizing what the action is performed in this video.
Approaches targeting dierent challenges in action recognition are intensively
addressed. An action can be decomposed into spatial and temporal parts. Im-
age features that are discriminative with respect to posture and motion of the
human body are rst extracted. Using these features, spatial representations to
discriminate actions can be suggested. Many approaches represent the spatial
structure of actions with reference to the human body [47]. Holistic represen-
tations capture the spatial structure by densely computing features on a regular
grid bounded by a detected region of interest (ROI) centered around the person
[810]. Other approaches recognize actions based on the statistics of local fea-
tures from all regions, which relies neither on explicit body part labeling, nor on
explicit human detection and localization [1113].
Dierent representations can be used to learn the temporal structure of actions
from these features as well. Some approaches represent an action as a sequence of
moments, each with their own appearance and dynamics [14][15]. Some methods
attempt to directly learn the complete temporal appearance through example
sequences [16]. Temporal statistics attempt to model the appearance of actions
without an explicit model of their dynamics [1719].

Action Recognition Using Subtensor Constraint 765
View-invariance is also of great importance in action recognition. It aims to

nd representations or matching functions independent of the assumed camera
model. Some methods manage to search for view-invariant representations in
3D [20], 3D body part trajectories [21], 3D shape context and spherical har-
monics [22] and etc. Other approaches nd view-invariant representations in 2D,
histogramming [23], motion history volume [24], frame-to-frame self-similarities
within a sequence [25], transfer learning approach [26] and etc. A very promising
group of approaches [2730] explore constraints for recognizing an action based
on the theories of multi-view geometry. Our method falls into this group.
In this work, we recognize an action under three views which are comprised
of two template videos for an action captured in two dierent views and a test-
ing video with unknown action and view. We assume that human body joints
representing the poses in every frame of the videos are already tracked, and
their correspondences in the template views are matched. A constraint in three
views called subtensor constraint is extracted to match the same pose in dierent
videos. Then Dynamic Time Warping is applied to align the frame sequences in
the three views, and the average matching score is used as the similarity measure
between the template views and the test view.
2 Related Work
The authors in [27] use a similarity measurement that is not aected by the
camera viewpoint changes for matching two actions. Based on ane epipolar
geometry, they claim that two trajectories of the same action match if and only
if the observation matrix is of rank at most 3. But their approach assumes an
ane camera model and treats individual body points separately ignoring the
space relation between them.
The work in [28] demonstrates the plausibility of the assumption that the
proportion between human body parts can be captured by a projective transfor-
mation. Then fundamental matrix in epipolar geometry is applied as a constraint
for matching poses in the testing video and a template video. In [31], two-body
fundamental matrix is used to capture the geometry of dynamic action scenes
and help deriving an action matching score across dierent views without any
prior segmentation of actors. Since the proposed rank constraint treats individ-
ual poses separately, they use the dynamic time warping to enforce the temporal
consistency in another step.
To address the temporal information, the work in [29] develops a constraint
called fundamental ratio to catch the pose transition between successive frames.
But for estimating the two epipoles of two virtual cameras, they assume the
second view is under some non-general motion relative to the rst view. The
authors in [30] use the property that a planar homography has two equal eigen-
values to match the pose transition. Though these methods treat sets of body
joints separately instead of giving an overall constraint on the whole body, they
rely on the globally estimated epipoles which again is constrained by the epipolar
geometry.
766 Q. Liu and X. Cao
Fig. 1. Two poses of a basketball player playing basketabll. An action can be viewed
as a collection of static poses represented by the body joints.
All the above approaches under perspective geometry work in the framework
of two view geometry. One view is used for template video of an action, and the
other is for the testing video. Our work adds more information and certainty for
recognizing an action by introducing another template view. The reasons why
we choose tri-view are: First, by adding only one template view, the informa-
tion for matching an action is doubled without a burden of maintaining a large
template dataset. Also, the tri-view geometry is well studied in the literature
giving a clear and understandable way for nding constraints under this frame-
work. One may think that the problem becomes trivial since one can reconstruct
the template poses in 3D given two template views. However, knowing only the
point correspondences, the pose can only be reconstructed up to a projective
ambiguity which distorts the actual 3D model in an unexpected way. Though
nding the right projection may match the 3D pose with the 2D test view, the
projection and the matching errors will depend on the degree of distortion intro-
duced by the projective transformation determining the reconstruction. So we
believe that the indirect method of reconstructing the 3D template pose instead
of extracting direct view invariant geometric constraint is not an eective way.
The contributions of this paper include: 1) We nd that the trifocal tensor is
under a twelve dimensional subspace of the original 27D space if the rst two
views are already matched and the fundamental matrix between them is known,
which we refer to as subtensor constraint; 2) We use this subtensor constraint
for recognizing actions, which requires less matching points but gains better
performance; 3) We experimentally prove that treating the two template views
separately or not considering the correspondence relation already known between
the rst two views omits a lot of useful information.
3 Our Method
Shown in Fig. 1, an action can be viewed as a sequence of frames, each of which
contains a static pose. Each pose can be represented by the joints or anatomical
landmarks (spheres in Fig. 1) of the body that contain sucient information for
recognizing an action [28]. Our approach is designed to nd constraints for the
joints under three views. Suppose we have a testing video in which the action
it represents is unknown and two template videos recorded for a known action
under two dierent views. We can consider the unknown video as a third view of
the known action if these three views satisfy some tri-view geometry properties.
Hence, we want to nd a certain constraint on the body joint correspondences,
and the degree of satisfaction to it implies the similarity between the testing
action and the template actions. We assume that the 2D image coordinates of
body joints are already extracted and recognized from the video sequence or
synthesized from motion capture data.
3.1 Subtensor Constraint
Our goal is to nd some overall constraint between the joints of the pose in
this frame and joints of two corresponding poses in two template action videos.
Apparently, using the two template views separately is not a good solution.
Since our approach works under three views, another naive proposal would be
use the trifocal tensor [32]. But directly using the trifocal tensor will omit a lot of
useful information given our assumption that the two views in the two template
videos are already correctly matched. Though the trifocal tensor contains 27
parameters which form a 27-D space, we will rst show that it actually resides
in a 12 dimensional subspace of the original 27-D space if we already know the
correct correspondence relationship between the rst and the second views.
If we have two views, views 1 and 2, of the same pose and the joint coordinates
in each view, the correspondence relation between the joints in the two views
is partially encoded by the fundamental matrix F21 , so that a pair of points,
x, x , in the rst two views satises xT F21 x = 0. We can rst estimate F21
given the known information of two corresponding frames in the two template
videos. In the third view, the testing video, a line l results in a homography
H12 mapping points from the rst view to the second view through the trifocal
tensor T = [T1 T2 T3 ] .

H12 = [T1 T2 T3 ]l . (1)
The homography H12 is compatible with the fundamental matrix F21 if and only
if the matrix HT12 F21 is skew-symmetric, so
HT12 F21 + FT21 H21 = 033 . (2)

Putting (1) and (2) together, one can obtain

T
l [TT1 TT2 TT3 ]F21 + FT21 [T1 T2 T3 ]l = 033 . (3)
If we write F21 as [f1 f2 f3 ]T , after a serial of deduction from (3), we can obtain:

( F21 T1 + ( [TT1 TT2 TT3 ]f1 )T )l = 031 . (4)

( F21 T2 + ( [TT1 TT2 TT3 ]f2 )T )l = 031 . (5)

( F21 T3 + ( [TT1 TT2 TT3 ]f3 )T )l = 031 . (6)

Since equations (4),(5) and (6) are valid for an arbitrary line l in the third
view, the following equations must hold true.
F21 T1 + ( [TT1 TT2 TT3 ]f1 )T = 033 . (7)
F21 T2 + ( [TT1 TT2 TT3 ]f2 )T = 033 . (8)
F21 T3 + ( [TT1 TT2 TT3 ]f3 )T = 033 . (9)

From equations (7),(8) and (9), we can obtain a linear system of equations
comprised of 27 equations, but only 15 of them are linearly independent, which
the reader can easily prove by simply expand (7),(8) and (9) with respect to
t, the vector form of T. This means that the trifocal tensor T resides in a 12
dimensional subspace if F21 is known. So any valid trifocal tensor t in a vector
form of T can be written as:
t = Bt , (10)
where, B is a 27 12 matrix of which the columns are the base vectors spanning
the subspace estimated from the 15 independent constraints, and t is some 12
dimensional vector which we refer to as subtensor.
A point correspondence across three views { x, x , x } must satisfy

[x ] ( xi Ti )[x ] = 033 , (11)
i
where i ranges from 1 to 3. Four linearly independent equations constraining T

can be drawn from (11). Using the vector form t of T, the constraints can be
expressed as
= 041 ,
At (12)
where A is a 427 matrix obtained using coordinates of { x, x , x }. Since every
and provides 4 constraints,
point correspondence across three views forms one A
n correspondences can give 4n constraints and form a larger matrix A of size
4n 27 by concatenating A of every correspondence together, so that
At = 04n1 . (13)
Putting (10) and (13) together, we can obtain
Gt = 04n1 , (14)
where G = AB. Since F21 already provides some information about correspond-
ing relation between the rst and second views, instead of four independent
template frame #
template frame #
template frame #
template frame #
50 50 50 50
100 100 100 100
150 150 150 150

50 100 150 50 100 150 20 40 60 80 100120 20 40 60 80 100120
testing frame # testing frame # testing frame # testing frame #
(a) (b) (c) (d)
Fig. 2. Subtensor constraint. (a) Frame-wise similarity estimated by three views of

the template jumping jacks action. (b) Frame-wise similarity using two views of the
template action and one view of the same action played by a dierent actor. c) Frame-
wise similarity between the template action and shoot basketball action. (d) Frame-wise
similarity between the template action and pirouette in dance action.
constraints, every point correspondence now gives two independent constraints

under this new form. As the dimension of t is 12, 6 correspondences are thereot-
ically sucient to estimate t , which requires less compared to 7 necessary corre-
spondences for evaluating t. In the context of action recognition, the same body
pose recorded in dierent videos can be considered as recorded under dierent
views, and the same body joint in dierent videos can be viewed as a point cor-
respondence. So there must exits a subtensor t approximately satisfying (14)
for all the body joints. For such t to exist and be exclusive only up to a scale
factor, the rank of G should be 11. In other words, 12 , the smallest singular
value of G, should be 0, which formulates our subtensor constraint. Because of
the existence of noise and anthropometric dierence, 12 may not exactly be 0.
So, similar to [28], we use the quotient of the smallest and largest singular value
of G, Eg = 12 /1 , to indicate how well the constraint is satised.
Figure 2 shows an intuitive example of the eectiveness of the proposed con-
straint. Using the above constraint, we check frame-wise similarity between the
two template videos of the jumping jacks action and another action video. First,
we estimate Eg between every frame of a third view of the template action and
every frame of the two template videos, shown in Fig. 2 (a). The row number
in the gure shows the frame number of the two template video and the collum
number gives the frame number of the video to be checked. Since the three videos
are three views of the template action, their frames are correctly matched and
precisely satisfy the constraint. Therefore, values on the up-left to down-right
diagonal in Fig. 2 (a) should be zero and form the darkest line in the gure.
Because of the smooth movement of the body and the small dierence between
successive poses, Eg gives small value when poses are similar around the dark
line while gives large value on the opposite. Fig. 2 (a) also captures the tem-
porally symmetric property of this action, resulting in the symmetric texture.
Fig. 2 (b) shows the frame-wise similarity between the two views of the template
action and a view of the same action played by another actor. One can also no-
tice a dark path around the up-left to down-right diagonal of the gure and the
similar texture to Fig. 2 (a), which indicates the similarity between the actions.
We also estimate the frame-wise similarity between the template action and two
other actions, shoot basketball (g. 2 (c)) and pirouette in dance (g. 2 (d)). No
dark path can be found in these two gures as in (a) and (b), implying dierent
actions with the template action.
3.2 Action Recognition

To perform action recognition, we need to decide that among several classes of
dierent actions, which class a testing action belongs to. To do this, we maintain
an action database for the K classes of actions. There are two templates of image
sequence for each action captured in two dierent views, {I1k t
}, {I2k
t
}, where k
ranges for 1 to K and t represents the sequence number of an image. The frames
or joints between the two templates are matched ahead. Given a testing video,
{I3t }, dynamic time warping (DTW) is rst applied similar to [28] using Eg to
temporally align the templates and the testing video. Then the action is assigned
to the class c same to the templates that have the smallest mean alignment error
with the testing video.
1
c = arg min i
Eg (I1c i
, I2c , I3j ), (15)
c |P |
(i,j)P
where P is a set containing matched frame pairs after the DTW alignment.
4 Experiments and Discussion

In this section, we will rst introduce the two datasets used to evaluate the pro-
posed approach. Then, experimental results are given. Simulation is used to prove
the robustness of the approach under view changes and noises. Then quantitative
results are given using the datasets to compare our and other approaches.
4.1 Datasets
In our approach, we need two template views where the correspondences are
matched. But for evaluating the performance, there is no publicly available
dataset that provides matched correspondences in two views to serve as the
templates, even for the multiview IXMAS dataset [24] widely used for view in-
dependent recognition. So, we evaluate our approach on two challenging datasets
generated by motion capture data under dierent views. We use 11 body joints,
which are root, left femur, left tibia, right femur, right tibia, lower neck, head,
left humerus, left wrist, right humerus and right wrist. Though these are sim-
ulated clean data, the actions are played by dierent actors for dierent trials,
varying largely in speed, style, anthropometric proportions, viewpoint, and etc.
Therefore, these data can be representative of the types of testing poses in real
videos.
600 700 600

700
500 600 500
600
400 400
500 500
300
400 300
200 400
300 200
100 300
400 600 1000 1200 500 600 700 800 400 600
(a) (b) (c) (d) (e)
Fig. 3. Datasets generation. (a) Views used to generate the two datasets. The red small
spheres represent the locations for the camera. (b) and (c) are example images in CMU
mocap dataset. (d) and (e) are example images in HDM05 mocap dataset.
1) CMU Motion Capture Dataset: The rst dataset is generated using

the CMU motion capture database. Six classes of actions are used: walk, run,
jump, sorer, climb lander and walk on steps. Each class is played at least by two
actors for three trials. One trial is used for generating the template views, the
other two trials are used for generating the testing videos.
2) HDM05 Motion Capture Dataset: The second dataset is generated
using the Motion Capture Database HDM05 [33]. We use the sports category,
which have the most many types of activities. We use the rst action in each
of the activities played by at least two persons: basic step waltz, kick (right
foot forward), throw(sit on the oor), forward rotation (right arm), jumping
jacks, rope skipping and badminton (low serve). Each action is played at lease
by two actors for three trials. The rst trial played by the rst person is used
for generating the template views, the other two trials are used for generating
the testing videos. This dataset is more challenging due to the large number of
classes and the large variance between same actions played by dierent actors.
For generating the datasets in dierent views, we set the camera at locations
on the small red spheres of the hemisphere in Fig. 3 (a). The radius of the
hemisphere is 150, and the camera is looking at the center of the hemisphere.
The size of the images is set to be 1200 900, and the focal length is randomly
set from 1500 to 2500. The size of the actor in the images is approximately
400 300 pixels. Example images of the two datasets are shown in Fig. 3. 50
testing videos are generated for each action, and there are 300 testing videos in
the CMU mocap dataset and 350 testing videos in the HDM05 mocap dataset.
In this subsection, we rst check the view invariant property of the proposed
approach. Fig. 4 (a) and (b) show a pose in a playing soccer action and a pose in
a playing basketball action, respectively. They are both observed by two cameras
at locations represented by the two red spheres in Fig. 4 (c) as the template
poses. Another pose the same to the pose in Fig. 4 (a) but performed by a
dierent actor is observed at locations represented by the cyan spheres in Fig.
4 (c) as the testing pose. Then the testing pose in dierent views is compared
0.1
matching error
0.08
0.06
0.04
0.02
400
100
50 200
zenith 0 0 azimuth
(a) (b) (c) (d)
Fig. 4. View invariant. (a) A pose in a playing soccer action. (b) A pose in a playing
basketball action. (c) The same pose as in (a) played by another actor. The pose in
(c) is observed in the views represented by the cyan spheres. Poses in (a) and (b) are
observed at views on the two red spheres as the templates. (d) The lower surface is
the matching errors between template (a) and the test pose (c) in dierent views. The
higher surface is the matching errors between template (b) and the test pose (c).
with the two template poses in (a) and (b) using Eg , and the matching errors
are plotted in Fig. 4 (d) respectively. The lower and higher surfaces represent
the matching errors with the template poses generated using Fig. 4 (a) and (b),
respectively. As we can see, since the testing pose is the same to the template
pose (a), the error is much smaller for all the views.
Due to tracking inaccuracy, the position of the body joints may not be pre-
cisely identied. But developing an eective method for pose estimating and
tracking of the body joints is beyond the scope of our consideration. So we ex-
amine our approach when the position of the body joints suers from tracking
noises. Shown in Fig. 5, the conguration is the same to Fig. 4 (d), except that
the position of the body joints in the testing poses is corrupted by Gaussian
noise with standard deviation of 1, 5, 10, 15 and 20 pixels. We plot the matching
error in Fig. 5 (a)-(e), respectively. Fig. 5 (f) gives a curve showing the change of
the mean dierence of the matching error between dierent poses and the same
pose with respect to the noise level. As can be seen, even with the standard
deviation of the noise up to 20, the matching error of the same pose is still lower
than that of the dierent pose in most of the views, which means that the poses
can also be matched even if the body joints tracked are severely deviated from
the right positions. With a robust method for solving the problem of estimating
and tracking the body joints, our approach will work eectively.
We compare our approach with the two views approach in [28], and other
two naive approaches of recognizing an action in three views as well, which
are approaches treating the two template views separately or directly using T
without considering the correspondence relation already known between the rst
two views. First, we give a result to intuitively show why our approach works
better. As shown in Fig. 6, framewise similarity is estimated using F in [28], T
and subtensor constraint for the same action played by two dierent actors. One
pose of the action is shown in the rst gure in Fig. 6. 7, 9, 11 and 15 joints are
used to estimate the matching error respectively. The three rows are estimated
using F,T and subtensor constraint, respectively. Since F requires at least 9
matching error
matching error
matching error
0.1 0.1 0.1
0.05
0.05 0.05
0
20 0 0
40 100 100
10 20 400 400
zenith 0 0 azimuth 50 200 50 200
zenith 0 0 azimuth zenith 0 0 azimuth
(a) (b) (c)
mean difference of error

0.04
matching error
matching error
0.1 0.1
0.05 0.05
0.02
0 0
100 100
400 400 0
50 200 50 200 0 20 40
zenith 0 0 azimuth zenith 0 0 azimuth noise level
(d) (e) (f)
Fig. 5. Robustness of noise. (a) Matching error under noise level of 1 pixel. (b) Noise
level of 5 pixels. (c) Noise level of 10 pixels. (d) Noise level of 15 pixels. (e) Noise level
of 20 pixels. (f) The mean dierence of the matching error between dierent poses and
the same poses with respect to the noise level.
F 9 points F 11 points F 15 points

template frame #
template frame #
template frame #
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
testing frame # testing frame # testing frame #
T 7 points T 9 points T 11 points T 15 points
template frame #
template frame #
template frame #
template frame #
20 20 20 20
40 40 40 40
60 60 60 60
80 80 80 80
100 100 100 100
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
subtensor 7 points subtensor 9 points subtensor 11 points subtensor 15 points
template frame #
template frame #
template frame #
template frame #
20 20 20 20
40 40 40 40
60 60 60 60
80 80 80 80
100 100 100 100
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Fig. 6. Framewise similarity estimated for an action using F, T and subtensor with 7,
9, 11 and 15 body joints available
Table 1. Results on CMU Mocap Dataset. Each row shows the number of correctly
recognized testing videos for each action class and the total precision for an approach.
There are 50 testing videos for each class.
Action walk run jump sorer climb step total

Fund 1 43 44 49 45 38 48 89.00
Fund 2 40 45 45 39 44 47 86.00
Fund 12 42 47 49 45 44 46 91.00
Trifocal 47 50 48 50 48 50 97.67
Subtensor 48 50 50 48 50 50 98.67
Table 2. Results on HDM05 Mocap Dataset
Action waltz kick throw rotation jump skip badminton total

Fund 1 41 37 20 46 31 24 35 66.86
Fund 2 45 42 29 41 25 32 35 71.14
Fund 12 44 41 24 46 27 25 32 68.29
Trifocal 44 43 15 50 50 45 36 80.86
Subtensor 41 36 29 50 50 49 45 85.71
correspondences, the 7 joints situation is not used for F. There should be a dark
path around the diagonal area of the gure, and the path will be blurred with
the decreasing of the correspondences because of the lost of the information. As
can be seen, the path is blurring even for sucient 15 correspondences for F . As
for T, the path is very blurring when there are 9 correspondences and can hardly
be recognized when the number of the correspondences reduces to 7. Whereas,
for the subtensor, the path can still be recognized in the case of 7 joints.
The approaches are tested on the two datasets. The examined approaches are:
1) Using the rst view as the template action for F in [28]; 2) Using the second
view as the template action for F; 3) Using both the rst and second templates
but separately for F, and the smaller matching error is used for recognizing
an action; 4) Directly using trifocal tensor; 5) Using the proposed subtensor
constraint. All the 11 points correspondences are used for matching. The results
are given in Tab. 1 and Tab. 2 to show the number of correctly recognized testing
videos for each action class and the total precision for every approach.
From Tab. 1 and 2, we can see that recognizing actions using tri-view con-
straint is more eective than the two view constraint and simply combining two
view constraint using two templates separately can not obtain much gain either.
The best performance is obtained using the subtensor. Sine there are sucient
11 joints, using T can also get a fair performance. But in real life situation,
some of the body joints in a testing video may be missing due to the failure of a
tracking algorism or self-occlusion. To evaluate the performance under such situ-
ations, we test our approach on the two datasets with some of the joints missing.
Table 3. Results on CMU Mocap Dataset with occlusion. Each row shows the number
of correctly recognized testing videos for each action class and the total precision for
an approach. There are 50 testing videos for each class.
Action walk run jump sorer climb step total

Trifocal 35 36 48 22 22 45 69.33
Subtensor 46 48 48 44 38 50 91.33
Table 4. Results on HDM05 Mocap Dataset with occlusion
Action waltz kick throw rotation jump skip badminton total

Trifocal 38 30 8 38 32 14 23 52.29
Subtensor 40 27 18 50 49 41 33 73.75
The datasets are generated using the above CMU Mocap and HDM05 Mocap
Datasets. In each frame of an action sequence in the datasets, we randomly
choose 1 to 4 joints with the probabilities of 0.4, 0.3,0.2 and 0.1, respectively, and
assume that they are missing. Then we use these sequences as the testing videos.
Shown in Tab. 3 and Tab. 4, there is a signicant degeneration of the performance
for approach using T, while our approach still works well. We believe that since
the known correspondence relation between the rst two views is not considered
for directly using T , a lot of information is omitted.
To compare the degeneration speed of the performance with respect to the
degree of occlusion, we again generate data using the CMU Mocap and HDM05
Mocap Datasets. For every frame in the entire dataset, m points are randomly
assumed to be occluded and missing. Fig. 7 shows the precision of the two
approaches when there are 0 5 points missing. The case of 5 points missing
is not considered for T, since T requires at least 7 points. We can see from the
gure, the subtensor is more robust to occlusion.
CMU mocap dataset HDM05 Mocap Dataset

100 100
subtensor
80 T
80
precision (%)
precision (%)
60
60
40 subtensor
T 40
20
0 20
0 1 2 3 4 5 0 1 2 3 4 5
number of missing points number of missing points
Fig. 7. Performance degeneration respecting to the degree of occlusion simulated by a

certain number of points missing
5 Conclusion
In this work, we rst proved that the trifocal tensor is under a twelve dimensional
subspace of the original 27D space if the fundamental matrix between the rst
two views are known, which we refer to as subtensor constraint. Then we used
this constraint for recognizing actions under three views, which requires less
matching points but gains better performance. We found that 1) Recognizing
actions under three views is more eective than two view constraint; 2) Simply
by adding another template view using two view constraint can not obtain much
gain; 3) A naive solution of using trifocal tensor instead of the subtensor will
lose useful information and lead to a deciency of the constraint. One weakness
of our approach is that one may consider the precondition that the 2D image
coordinates of the body joints are extracted be a strong assumption. However,
as shown in the experiments, our approach performs well when the body joints
tracked suering from noise. With a robust tracking method for, our approach
will work eectively.
Acknowledgment. This work was supported by National Natural Science

Foundation of China (No.60905019), Tianjin Key Technologies R&D program
(No.11Z CKFGX00800), Tsinghua-Tencent Joint Laboratory for Internet Inno-
vation Technology, SKL of CG&CAD, and Strategic Priority Research Program
of the Chinese Academy of Sciences(XDA06030601).
References
1. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based hu-
man motion capture and analysis. CVIU 104, 90126 (2006)
2. Wang, L., Hu, W., Tan, T.: Recent developments in human motion analysis. PR 36,
585601 (2003)
3. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action
representation. CVIU 115, 224241 (2011)
4. Ikizler, N., Forsyth, D.A.: Searching video for complex activities with nite state
models. In: CVPR, pp. 1823 (2007)
5. Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images.
PAMI 28, 4458 (2006)
6. Peursum, P., Venkatesh, S., West, G.A.W.: Tracking-as-recognition for articulated
full-body human motion analysis. In: CVPR, pp. 18 (2007)
7. Yilmaz, A., Shah, M.: Recognizing human actions in videos acquired by uncali-
brated moving cameras. In: CVPR, pp. 150157 (2005)
8. Yilmaz, A., Shah, M.: Actions sketch: A novel action representation. In: CVPR,
pp. 984989 (2005)
9. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In:
CVPR, pp. 2432 (2008)
10. Schindler, K., Gool, L.J.V.: Action snippets: How many frames does human action
recognition require. In: CVPR, pp. 18 (2008)
11. Laptev, I.: On space-time interest points. IJCV 64, 107123 (2005)
12. Gilbert, A., Illingworth, J., Bowden, R.: Scale Invariant Action Recognition Using
Compound Features Mined from Dense Spatio-temporal Corners. In: Forsyth, D.,
Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 222233.
13. Niebles, J.C., Li, F.F.: A hierarchical model of shape and appearance for human
action classication. In: CVPR, pp. 18 (2007)
14. Morency, L.P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models
for continuous gesture recognition. In: CVPR, pp. 18 (2007)
15. Thurau, C., Hlavac, V.: Pose primitive based human action recognition in videos
or still images. In: CVPR, pp. 18 (2008)
16. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. PAMI 29, 22472253 (2007)
17. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action cate-
gories using spatial-temporal words. IJCV 79, 299318 (2008)
18. Nowozin, S., Bakir, G.H., Tsuda, K.: Discriminative subsequence mining for action
classication. In: ICCV, pp. 18 (2007)
19. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application
to action recognition. In: ACMMM, pp. 357360 (2007)
20. Parameswaran, V., Chellappa, R.: View invariants for human action recognition.
In: CVPR, pp. 613619 (2003)
21. Campbell, L.W., Becker, D.A., Azarbayejani, A., Bobick, A.F., Pentland, A.: In-
variant features for 3-d gesture recognition. In: FG, pp. 157163 (1996)
22. Holte, M.B., Moeslund, T.B., Fihl, P.: View invariant gesture recognition using the
csem swissranger sr-2 camera. IJISTA 5, 295303 (2008)
23. Zelnik-manor, L., Irani, M.: Event-based analysis of video. In: CVPR, pp. 123130
(2001)
24. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using
motion history volumes. CVIU 104, 249257 (2006)
25. Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition
from temporal self-similarities. PAMI 33, 172185 (2011)
26. Farhadi, A., Tabrizi, M.K.: Learning to Recognize Activities from the Wrong View
Point. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS,
27. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of
actions. IJCV 50, 203226 (2002)
28. Gritai, A., Sheikh, Y., Rao, C., Shah, M.: Matching trajectories of anatomical
landmarks under viewpoint, anthropometric and temporal transforms. IJCV 84,
325343 (2009)
29. Shen, Y., Foroosh, H.: View-invariant action recognition using fundamental ratios.
In: CVPR, pp. 18 (2008)
30. Shen, Y., Foroosh, H.: View-invariant action recognition from point triplets.
PAMI 31, 18981905 (2009)
31. ul Haq, A., Gondal, I., Murshed, M.: On dynamic scene geometry for view-invariant
action matching. In: CVPR, pp. 33053312 (2011)
32. Faugeras, O.D., Papadopoulo, T.: A nonlinear method for estimating the projective
geometry of three views. In: ICCV, pp. 477484 (1998)
33. Muller, M., Roder, T., Clausen, M., Eberhardt, B., Kruger, B., Weber, A.: Docu-
mentation mocap database hdm05. Technical report CG-2007-2, Universitat Bonn
(2007)
Globally Optimal Closed-Surface Segmentation
for Connectomics
Bjoern Andres1, , Thorben Kroeger1, , Kevin L. Briggman2, Winfried Denk3 ,

Natalya Korogod4, Graham Knott4 , Ullrich Koethe1 , and Fred A. Hamprecht1
1
HCI, University of Heidelberg
2
NIH, Bethesda
3
MPI for Medical Research, Heidelberg
4
EPFL, Lausanne
Abstract. We address the problem of partitioning a volume image into a previ-

ously unknown number of segments, based on a likelihood of merging adjacent
supervoxels. Towards this goal, we adapt a higher-order probabilistic graphical
model that makes the duality between supervoxels and their joint faces explicit
and ensures that merging decisions are consistent and surfaces of final segments
are closed. First, we propose a practical cutting-plane approach to solve the MAP
inference problem to global optimality despite its NP-hardness. Second, we ap-
ply this approach to challenging large-scale 3D segmentation problems for neu-
ral circuit reconstruction (Connectomics), demonstrating the advantage of this
higher-order model over independent decisions and finite-order approximations.
1 Introduction
This paper studies the problem of partitioning a volume image into a previously un-
known number of segments, based on a likelihood of merging adjacent supervoxels.
We choose a graphical model approach in which binary variables are associated with
the joint faces of supervoxels, indicating for each face whether the two adjacent super-
voxels should belong to the same segment (0) or not (1). Models of low order can lead
to inconsistencies where a face is labeled as 1 even though there exists a path from one
of the adjacent segments to the other along which all faces are labeled as 0. As a result,
the union of all faces labeled as 1 need not form closed surfaces. Such inconsistencies
can be excluded by a higher-order conditional random field (CRF) [1] that constrains
the binary labelings to the multicut polytope [2], thus ensuring closed surfaces. While
the number of multicut constraints can be exponential [3], constraints that are violated
by a given labeling can be found in quadratic time [4]. The MAP inference problem
can therefore be addressed by the cutting-plane method, i.e. by solving a sequence of
relaxed problems to global optimality until no more constraints are violated [4].
Here, we show that the optimization scheme described in [1] is unsuitable for large
3D segmentations where the supervoxel adjacency graph is denser and non-planar. We
therefore extend the cutting-plane approach by adding only constraints which are facet-
defining by a property of the multicut polytope (Section 4.3), a double-ended, paral-
lel search to find violated constraints (Section 4.2) and a problem-specific warm-start

Contributed equally.
Globally Optimal Closed-Surface Segmentation for Connectomics 779
Fig. 1. Segmenting volume images based on a likelihood of merging adjacent supervoxels is

difficult if merging decisions are made independently. Left: Segmentation errors remain a problem
even if the model is biased optimally with = 0.8 in Eq. 4 (note the under-segmentation at the
bottom and missing segments at the top). Right: Multicut constraints alleviate this problem and
allow for an unbiased, parameter-free model.
heuristic (Section 4.4). This approach scales gracefully to volume images that consist
of 109 voxels and are initially segmented into 106 supervoxels (Section 5).
The fact that exact MAP inference remains tractable at this scale is important for
the reconstruction of neural circuits form electron microscopic volume images (Fig. 1)
where multicut constraints substantially improve the quality of segmentations across
different imaging techniques (Section 5).
For this quantitative analysis, we manually segmented 109 voxels of two different
datasets acquired at different laboratories using different imaging techniques and as-
sess the performance of three models: the simplest, local model uses learned unary
conditional probabilities that state, independently, if each face should be on (establish
part of an object boundary) or off (merge adjacent supervoxels). Optimization of this
model almost always results in inconsistent labelings (Figs. 3b, 3e). The intermediate,
finite-order model guarantees consistency across a small local horizon, leading to better
results. The best results are obtained with the fully constrained model which admits only
labelings that are globally consistent. With the cutting-plane approach proposed here,
the full model is often faster to optimize than the finite-order approximation. Figs. 3, 4,
and 5 summarize why the fully constrained method is the one we recommend.
2 Related Work
The problem we address is known as the multicut problem [5] in combinatorial op-
timization and as correlation clustering [6,7] in statistics, and it is a special case of
the partition problem [2]. Both problems are NP-hard [6,8]. Instances of the multicut
problem have been solved by tightening an outer approximation of the multicut poly-
tope [9] via cutting planes [10,4]. In computer vision, this technique has been applied
in [11,12,13] where the relaxed LP is solved first, as well as in [1] where the inte-
grality constraints are kept throughout the cutting-plane loop. Cutting planes have also
been used to enforce connectivity in foreground vs. background image segmentation
[14,15,16]. Here, we build on the probabilistic formulation in [1] but without the likeli-
hood terms w.r.t. geometry that were shown to have a negligible effect on segmentations
of photographs.
780 B. Andres et al.
We concentrate on exact MAP inference for which we propose a cutting-plane ap-

proach that is efficient for large-scale 3D segmentation. In particular, we discuss the
efficient search for violated constraints that are facet-defining and the parallelization of
this search. We measure how the optimization runtime scales with the size of the image
and the number of variables, respectively, with and without improvements.
Results were shown in [1] for photographs in the Berkeley Segmentation Dataset [17]
which consist of 105 pixels that were initially partitioned into 104 superpixels, leading
to optimization problems with at most 104 variables. Here, we segment 3D images of
up to 109 voxels which are initially partitioned into 105 supervoxels, leading to 100
times larger optimization problems with 106 variables.
3 Probabilistic Model
In order to make the duality between supervoxels and their joint faces explicit, we
build a cell-complex (C, , dim) representation [18] of the supervoxel segmentation
by connected component labeling of the finest possible grid-cell topology. This yields
disjoint sets of supervoxels C3 , joint-faces between supervoxels C2 , lines C1 and points
C0 . The function dim : C N maps each cell to its dimension. For any cells c, c
C = 3j=0 Cj , c c indicates that c bounds c , which implies dim(c) < dim(c ).
This representation has the advantage that multiple disconnected joint-faces seperating
the same pair of supervoxels can be treated independently for feature extraction and
classification (see below). We model the posterior probability of a joint labeling y
{0, 1}|C2| of all faces, given
m N features fc Rm of each face c C2 , summarized in a vector F Rm|C2 |
the bounding relation between faces and supervoxels, encoded in a topology matrix
T {0, 1}|C2||C3 | in which Tcc = 1 if and only if c c .
We assume that features are independent of the topology, a) F T , b) F T | y,
that features of any faces c = c are independent, c) fc
fc , d) fc
fc | y and that
the label of any face c is independent of the label of any face c = c given the features
fc , e) yc
yc | fc . (Note, however, that yc yc | T ). From these conditional
independence assumptions follows
p(F, T |y)p(y) (a,b) p(F |y)p(T |y)p(y)
p(y|F, T ) = =
p(F, T ) p(F )p(T )
(c,d) p(T |y) p(fc |y)p(y) p(T |y)
= = p(y|fc )
p(T ) p(fc ) p(T )
cC2 cC2
p(T |y)
(e) p(T |y) p(fc |yc )p(yc )
= p(yc |fc ) =
p(T ) p(T ) p(fc )
cC2 cC2

p(T |y) p(fc |yc )p(yc ) . (1)
cC2
The prior p(yc ) is assumed to be identical for all faces and is specified with a single
design parameter (0, 1) as p(yc = 1) = and p(yc = 0) = 1 .
The likelihood p(T |y) is the instrument that is used to enforce consistency: While
uninformative for all consistent labelings, it assigns zero probability to all inconsistent
labelings. Consistent labelings y MC, i.e. labelings inside the multicut polytope1, are
defined in (2) w.r.t. the set SC(n) of all simple cycles (s1 , t1 , . . . sn , tn , s1 ) over n N
pairwise distinct supervoxels (s1 , . . . , sn ) via pairwise distinct faces (t1 , . . . tn ). The
inequalities in (2) ensure that along each cycle either none or more than one face is
labeled as 1:
;
; c1 = c2n+1 j Nn : c2j1 C3 c2j C2
(cj )jN2n+1 ; ;
SC(n) =
in C ; j Nn : c2j c2j1 c2j c2j+1
; j, k Nn : j = k (c2j = c2k c2j1 = c2k1 )
D ; n E
;
MC(n) = y {0, 1}|C2| ; (s1 , t1 , . . . sn , tn , s1 ) SC(n) : yt1 j=2 ytj
|C2 |
G
MC = MC(j) (2)
j=1
The likelihood p(fc |yc ) is learned by means of a random forest. More precisely, we
learn p(yc |fc ) from class-balanced training data, i.e. with p(yc ) = 0.5, and assume
p(fc |yc ) = p(fc |yc ). Therefore, p(fc |yc ) p(yc |fc )p(fc ) and thus,

p(y|F, T ) p(T |y) p(yc |fc )p(yc ) . (3)
cC2
For comparison, we consider two simpler models, a local model in which p(T |y) is
uniform and thus, faces are labeled independently,

p (y|F, T ) p(yc |fc )p(yc ) , (4)
cC2
and a finite-order approximation of (3) in which not all multicut constraints need to be
fulfilled but only those that correspond to cycles up to length 4. With p (T |y) = const.
if y 4j=1 MC(j) and p (T |y) = 0, otherwise,

p (y|F, T ) p (T |y) p(yc |fc )p(yc ) . (5)
cC2
4 MAP Inference
4.1 Integer Linear Programming Problem (ILP)
Instead of maximizing (3), we minimize its negative log likelihood by solving the ILP
min wT y
y{0,1}|C2 | (6)
subject to y MC
1
The multicut polytope is the convex hull of MC, by Lemma 2.2 in [2].
Fig. 2. Depicted is one slice of a supervoxel

segmentation. Supervoxels (green dots) are
bounded by joint faces which are labeled as
0 (white dots) or 1 (black dots). The sim-
ple circles drawn in long-dashed blue (short-
dashed red) are chordless (chordal).
with w R|C2 | such that j {1, . . . , |C2 |}:

p(yj = 0|fj ) 1
wj = log + log (7)
p(yj = 1|fj )
We start by solving the (trivial) ILP without multicut constraints. We then search for
constraints that are violated by the solution, add these to the constraint pool and re-
solve the constrained ILP using the branch-and-cut algorithm of a state-of-the-art solver.
This procedure is repeated until no more multicut constraints are violated and thus, the
original problem (6) has been solved to optimality.
4.2 Search for Violated Constraints

If any multicut constraints are violated by a given labeling y, then at least one can be
found by considering all faces c C2 with yc = 1 and looking for a shortest cycle
(r1 , t1 = c, r2 , t2 , . . . , rn , tn , r1 ) SC(n) with n N along which all other faces
t2 , . . . , tn are labeled as 0 [1,4]. Shortest cycles correspond to constraints with minimal
numbers of variables.
Double-Ended Search. Although a breadth-first search for such a cycle can be carried
out in time O(|C2 |+|C3 |), starting from either r1 or r2 , the absolute runtime to perform
this task for all relevant faces can be comparable to that of solving the ILP (Section 5).
We therefore propose to grow two search trees, rooted at r1 and r2 , simultaneously.
This saves runtime because both trees are only half as deep as a single one would be, at
the point when a shortest cycle is found.
Parallelization. Shortest paths need to be found for many faces, from one adjacent
supervoxel to the other, and yet not so many that it would be profitable to solve the All
Pairs Shortest Path problem. We therefore propose to solve Single Pair Shortest Path
problems in parallel with a space complexity that is linear in the number of threads. In
practice, we use OpenMP for this embarrassingly parallel task.
4.3 Chordality Check

Not all inequalities in (2) define a facet of the multicut polytope due to the following
n[2]: If n N and (cj ) = (s1 , t1 , . . . sn , tn , s1 ) SC(n), the inequality

Theorem
yt1 j=2 ytj is facet-defining if and only if (cj ) is chordless2 (Fig. 2).
2
A path is chordless if each node is connected only to its successor and predecessor. Here, the
path of segments via joint faces is chordless if each segment is connected by a face only to its
successor and predecessor.
We exploit this algorithmically by adding violated inequalities only if these correspond

to chordless cycles (Fig. 2).
4.4 Warm Start Heuristic

When violated constraints are added, the solution of the relaxed problem becomes infea-
sible and thus, the upper bound on the global minimum is lost. However, the structure
of the problem allows us to find a new feasible solution efficiently: A given labeling
y {0, 1}|C2| is mapped to a labeling on the multicut polytope by labeling all variables
of all violated inequalities as 0. Since violated inequalities have already been found, this
heuristic does not change the runtime complexity of the optimization scheme overall.
5 Applications
5.1 SBEM Volume Image
A volume image referred to as E1088 in [19] was acquired using serial block-face
electron microscopy (SBEM) [20] and shows a section of rabbit retina at the almost
isotropic resolution of 22 22 30 nm3 . A small subset is shown in Fig. 3a. The bright
intra-cellular space that makes up more than 90% of the volume contrasts the stained
extra-cellular space that forms thin membranous faces. This staining [21] simplifies the
automated segmentation because no intra-cellular structures such as mitochondria or
vesicles are visible (Fig. 3d shows a different staining). A supervoxel segmentation is
obtained as described in Appendix C.
Features fc of each face c C2 described in Appendix B include statistics of voxel
features over c as well as characteristics of the two supervoxels that are bounded by c.
In order to learn p(yc |fc ) by means of a random forest, 437 faces per class have been
labeled interactively in a subset of 2503 voxels in about one hour, starting with obvious
cases and continuing where the predictions needed improvement. A custom extension
of ILASTIK [22] implements this online learning workflow. The user can mark faces as
on, off or unlabeled via a single mouse click and inspect intermediate predictions
by viewing faces colored according to p(yc |fc ).
Qualitative results on independent test data are shown in Fig. 3a-c. The global maxi-
mum of the local model (4) is inconsistent, i.e. not all surfaces are closed. A consistent
labeling with closed surfaces and thus a segmentation is obtained by merging super-
voxels transitively, i.e. whenever there exists a path from one supervoxel to the other
along which all faces are labeled as 0 (cf. Section 4.4), regardless of how many faces
between these supervoxels are labeled as 1. This mapping to the multicut polytope is
biased towards under-segmentation (Fig. 3b). In contrast, the global maximum of the
fully constrained model (3) is consistent and directly yields a segmentation with closed
surfaces (Fig. 3c). Note also the quality of the initial supervoxel segmentation from
which only 19.3% of all faces are removed in the found global optimum.
Quantitatively, the effect of introducing multicut constraints is shown in Fig. 4a
where maxima of (3), the local model (4) and the finite-order approximation (5) are
compared to a man-made segmentation of 400 200 200 voxels (Appendix A) in
1m
(a) SBEM raw data (b) w/o constraints, = 0.5 (c) with constraints, = 0.5
1m
(d) FIBSEM raw data (e) w/o constraints, = 0.5 (f) with constraints, = 0.5
Fig. 3. Segmentations of the SBEM dataset (a-c, 2422 voxels) and FIBSEM dataset (d-f, 5122
voxels). Faces are colored to show if the algorithm decides that these are part of a segment
boundary (blue) or not (magenta). Importantly, yellow faces are decided to be part of a segment
boundary, but are ignored because the bounded supervoxels are merged elsewhere.
1.5 0.994 3.5

VI RI 99.3155% VI 1.0 RI
1.4 0.993 3.0
0.9 99.5477%
1.3 2.64
0.992 99.3262% 2.5
0.8 99.5362%
1.2 0.991 2.0 fully constrained 60.4290%
0.7
0.97 nite-order approx.
1.1 0.990 99.2134% 1.5 local model
0.94 0.6
1.0 1.13
0.989 1.0 0.5
0.55 0.68
0.9
0.988 0.5 0.4
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(a) SBEM (400 200 200 voxels) (b) FIBSEM (900 900 900 voxels)
Fig. 4. Variation of information (VI) and Rand index (RI) from a comparison of man-made seg-
mentations (Appendix A) with segmentations obtained as maxima of the fully constrained model
(Eq. 3), the finite-order approximation (Eq. 5), and the local model (Eq. 4), for various priors
. Exact optimization of the finite-order approximation becomes intractable for most on the
FIBSEM data. In contrast, optima of the full model are found in less than 13 minutes. (Fig. 5b).
Table 1. Globally optimal closed-surface segmentations of the SBEM and and FIBSEM volume
images of 8003 voxels are found in less than 13 minutes using the proposed cutting-plane method
that is about 22 times as fast as the optimization scheme in [1]
Dataset Problem size Runtime [min.] for optimizing (3)

[1] Our work
Voxels |C3 | |C2 | Cplex Gurobi Cplex Gurobi
SBEM 8003 194 809 1 603 683 249.0 299.0 11.4 12.9
FIBSEM 8003 73 099 624 747 386.2 291.7 9.2 7.2
terms of the Variation of Information (VI) [23] and Rand Index (RI) [24]. The overall
best segmentation is obtained from (3), i.e. with all multicut constraints, and without an
artificial bias = 0.5; note that the bias term in (7) vanishes for = 0.5. Without any
multicut constraints, the best segmentation, obtained for = 0.8, is worse in terms of
both VI and RI. Segmentations of intermediate quality are obtained from the finite-order
approximation (5).
Runtimes for optimizing (3) are shown in Fig. 5a and Tab. 1 for our C++ implemen-
tation. The ILP (6) is solved with IBM ILOG Cplex and Gurobi alternatively, with the
duality gap set to 0 in order to obtain globally optimal solutions. Global optima of (3),
for blocks of 8003 voxels with 1.6106 and 6105 faces (and variables), respectively, are
found in less than 13 minutes (Tab. 1), about 22 times as fast as with the optimization
scheme in [1].
Maximizing (3) can be faster than maximizing the finite-order approximation (5)
that does not guarantee closed surfaces and yields worse segmentations empirically
(runtimes not shown). We therefore recommend to use all multicut constraints.
5.2 FIBSEM Volume Image
A volume image acquired with a focused ion beam serial-section electron microscope
(FIBSEM) [25] shows a section of adult mouse somatosensory cortex at the almost
isotropic resolution of 556 nm3 . A small subset is shown in Fig. 3d. Intra- and extra-
cellular space are indistinguishable by brightness and texture. Not only cell membranes
but also intra-cellular structures such as mitochondria and vesicles are visible due to
a different staining. The resolution is four times as high as that of the SBEM image.
However, intra-cellular structures have membranous surfaces themselves and thus make
the problem of segmenting entire cells more difficult. A supervoxel segmentation is
obtained as described in Appendix C.
We use the same features as for the SBEM dataset but adjusted in scale. Our labeling
strategy was to annotate membranes of both cells and mitochondria3 as yc = 1.
Qualitative results are shown in Fig. 3d-f for a slice of 5123 voxels. The MAP label-
ing of the local model (4) is inconsistent and almost all faces are removed by transitivity
3
It is still possible to obtain the geometry of cells because mitochondria can be detected reliably
[26], even via their mean gray-value once a segmentation is available. Segments which are
classified as mitochondria are disregarded when computing VI [23] and RI [24].
250 250
50 40
40 35
30
200 30 200 25
20 20
Runtime [minutes]
Runtime [minutes]
10 15
150 150 10
0 5
1.2 1.4 1.6
0
106 4.2 4.4 4.6 4.8 5.0 5.2
100 100 108
noCDPW
noC
50 50
noD
noP
all
0 0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0 1 2 3 4 5
#joint faces (variables) 106 Number of voxels 108
(a) SBEM dataset
450 450
20 20
15 15
400 10 400 10
5 5
0 0
350 5.0 5.2 5.4 5.6 5.8 6.0 6.2 350 4.64.74.84.95.05.15.2
Runtime [minutes]
Runtime [minutes]
105 108
300 300
250 250
200 200
150 150
noCDPW noCDPW
100 noC 100 noC
noD noD
50 noP 50 noP
all all
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5
#joint faces (variables) 105 Number of voxels 108
(b) FIBSEM dataset
Fig. 5. Wall clock runtimes for optimizing (3) with = 0.5 (on an 8-core Intel i7 at 2.8 GHz),
with and without improvements to the optimization scheme in [1], for increasing problem size
(number of joint faces of supervoxels, left, and number of voxels, right), for datasets ranging
in size from 1503 through 8003 voxels. all: proposed optimization scheme, noC: no chordal-
ity check, noD: no double-ended search, noP: no parallel search, noCDPW: no improvements.
Analogous plots for Gurobi instead of Cplex are provided as supplementary material.
(a) frames 1285, (b) without multicut constraints, (c) with multicut constraints, =
1950 = 0.5 0.5
Fig. 6. Color video segmentation. Supervoxel faces are colored blue if yc = 1, magenta if yc = 0
and yellow if the decision is inconsistent, i.e. yc = 1 but the adjacent supervoxels are merged.
Segments are filled with their mean color.
(note the lack of blue faces in Fig. 3e). In contrast, the global maximum of the fully con-
strained model (3) is consistent (Fig. 3f). In contrast to the SBEM dataset in which only
19.3% of the faces are removed, 80.7% of all faces are removed here, due to the inferior
supervoxel segmentation.
Quantitative results are shown in Fig. 4b. VI and RI are w.r.t. a complete segmenta-
tion of 9003 voxels carried out by a neurobiologist (Appendix A). Similarly as for the
SBEM volume segmentation, the best segmentations are obtained from the full model
(3). Exact MAP inference for the finite-order approximation (5) becomes intractable for
most .
Runtimes are shown in Fig. 5b for problem instances from blocks between 1503 and
8003 voxels. Unlike for the SBEM dataset (Fig. 5a), the overall speedup is dominated
by the chordality check.
5.3 Video Segmentation
As a proof-of-concept, we applied the same model to bottom-up video segmentation,

treating the video4 as a three-dimensional (x,y,t)-volume. Obtaining a supervoxel seg-
mentation that strikes a balance between negligible under-segmentation and a small
number of excessive supervoxels has proven difficult. We settled for a marker-based
watershed transformation of the color gradient magnitude in the Lab color space. A
486 360 video with 1000 frames (Fig. 6a) is thus partitioned into |C3 | = 22 056
supervoxels with |C2 | = 256 734 joint faces.
As features fc , we use (i) the mean (over the face c) of the 2D patch features in [1]
which are computed per frame, and (ii) the absolute distance of color histograms of
the bounded supervoxels. A training set of 153 labels per class was acquired using the
same tool and protocol as for the SBEM dataset. Only faces intersecting frame 1015
were labeled.
4
Frames 10002000 of the video at youtube.com/watch?v=YN0I-TZFn58
Qualitative results are shown in Fig. 6. A video of the segmentation is provided as

supplementary material. While the finite-order model is intractable for this problem,
global optimization of the fully constrained model (3) with = 0.5 takes 131 seconds.
6 Discussion

In a fully connected graph with n N nodes, the n3 cycle constraints of order 3
n4 n!
imply all j=0 j! higher-order cycle constraints. However, in the sparse graphs that
we consider, the higher-order constraints need to be dealt with explicitly. An example
is depicted in Fig. 2 where the constraint that corresponds to the blue cycle on the
left has order 4, excluding from the feasible set a locally closed loop that is globally
inconsistent. Our experiments have shown that including these higher-order constraints
is essential to achieve the best performance w.r.t. ground truth.
When inconsistent labelings are permitted, unlike in the fully constrained model
(3), and mapped to the multicut polytope as described in Section 4.4, the risk of false
mergers is higher in 3D than in 2D because there are on average more and shorter paths
from one supervoxel to another along which faces can be incorrectly labeled as 0. We
therefore expect multicut constraints to be more important in 3D than in 2D.
Solving (6) as proposed here requires the solution of problems of an NP-hard class.
Whether or not this is tractable in practice depends on the quality of the predictions
p(yc |fc ). The learning of this function is therefore important.
The warm start heuristic described in Section 4.4 is biased maximally towards under-
segmentation. Smarter heuristics that explore local neighborhoods in the cell complex
are subject of future work. The model (3) can be grafted on any method, 2D or 3D, that
finds superpixels or supervoxels and probabilities that joint boundaries of these should
be preserved or removed. Whether it can improve segmentations of images acquired by
transmission electron microscopy is subject of future research.
The recontruction of neural circuits such as a neocortical column or the central ner-
vous system of Drosophila melanogaster will eventually require the segmentation of
volume images of 1012 voxels. The result that 109 voxels can be segmented by optimiz-
ing a non-submodular higher-order multicut objective exactly on a single computer in
13 minutes is encouraging.
7 Conclusion
We have addressed the problem of segmenting volume images based on a learned like-
lihood of merging adjacent supervoxels. To solve this problem, we have adapted a
probabilistic model that enforces consistent decisions via multicut constraints to 3D
cell topologies and suggested a fast scheme for exact MAP inference. The resulting
22-fold speedup has allowed us to systematically study the positive effect of multicut
constraints in large-scale 3D segmentation problems for neural circuit reconstruction.
The best segmentations have been obtained for an unbiased parameter-free model with
multicut constraints.
(a) SBEM, 1400 1400 600 voxels (b) FIBSEM, 900 900 900 voxels
Fig. 7. The largest volume images we have segmented with multicut constraints. Shown are 600
segments in the SBEM dataset and 50 objects in the FIBSEM dataset.
A Acquisition of Ground Truth
We manually segmented subsets of the SBEM and the FIBSEM dataset (Fig. 8) using
the interactive method [27]. Each object was segmented independently. Some indepen-
dently segmented objects needed correction because there was overlap. This asserts a
consistent ground truth and shows that the segmentation problem is non-trivial, even for
a human. For the SBEM dataset, 528 objects (90.8% of 400 200 200 voxels) were
segmented by one expert in two weeks. For the FIBSEM dataset, 514 objects (96.8% of
a cubic block of 9003 voxels) were segmented by a neurobiologist in three weeks.
(a) SBEM (400 200 200) (b) FIBSEM (9003 )
Fig. 8. Man-made segmentations (ground truth). Shown are orthogonal slices in which a random
color is assigned to each segment.
Table 2. Features of supervoxel faces. The statistics (*) include the min, max, mean, median,
standard deviation, 0.25- and 0.75-quantile over all voxels adjacent to the face.
(a) Features of supervoxel faces (b) Features of voxel neighborhoods

Index Feature Details Index Feature
1 Size of the face 1 Volume image
23 Sizes v1 and v2 of (v1 + v2 )1/3 2 Bilateral filter
adjacent supervoxels |v1 v2 |1/3 ws (r) = 1 r2
exp 2 2
s (2)3/2 s
410 Bilateral filter b * 1
wv (v) = 2
1117 Gradient magnitude * 1+ v 2
v
1824 Hessian matrix of max. eigenvalue,* 34 Gradient magnitude
2531 Voxel classifier * 516 Structure Tensor eigenvalues
1728 Hessian matrix eigenvalues
B Features
From every joint face of adjacent supervoxels, 31 features are extracted (Tab. 2a). One
of these features is the response of a Random Forest that discriminates between mem-
branes on the one hand and intra-/extra-cellular tissue on the other hand, based on 28
rotation-invariant non-linear features of local neighborhoods of 113 voxels (Tab. 2b).
C Supervoxel Segmentation
We computed supervoxels by marker-based watersheds. To obtain an elevation map and
markers for the SBEM dataset, we train a random forest classifier to distinguish between
two classes of voxels, extra-cellular space and intra-cellular space, based on rotation
invariant features of local neighborhoods (Tab. 2b and Appendix B). As training data,
1600 voxels per class were labeled interactively in two subsets of 1503 voxels which
has taken three hours using ILASTIK [22]. Predicted probabilities are used as elevation
levels. Connected components of at least three voxels classified as intra-cellular space
are used as markers. For the FIBSEM dataset, the elevation level is defined as the largest
eigenvalue of the Hessian matrix at scale 1.6. Markers are taken to be maximal plateaus
of the raw data that consist of at least two voxels.
References
1. Andres, B., Kappes, J.H., Beier, T., Kothe, U., Hamprecht, F.A.: Probabilistic image segmen-
tation with closedness constraints. In: ICCV (2011)
2. Chopra, S., Rao, M.R.: The partition problem. Math. Program. 59, 87115 (1993)
3. Alt, H., Fuchs, U., Kriegel, K.: On the Number of Simple Cycles in Planar Graphs. In:
Mohring, R.H. (ed.) WG 1997. LNCS, vol. 1335, pp. 1524. Springer, Heidelberg (1997)
4. Grotschel, M., Wakabayashi, Y.: A cutting plane algorithm for a clustering problem. Math.
Program. 45, 5996 (1989)
5. Costa, M.-C., Letocart, L., Roupin, F.: Minimal multicut and maximal integer multiflow: A
survey. European J. of Oper. Res. 162, 5569 (2005)
6. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56, 89113
(2004)
7. Demaine, E.D., Emanuel, D., Fiat, A., Immorlica, N.: Correlation clustering in general
weighted graphs. Theoretical Computer Science 361, 172187 (2006)
8. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NP-
Completeness. Freeman, New York (1979)
9. Sontag, D., Jaakkola, T.: New outer bounds on the marginal polytope. In: NIPS (2008)
10. Barahona, F., Grotschel, M., Junger, M., Reinelt, G.: An application of combinatorial opti-
mization to statistical physics and circuit layout design. Oper. Res. 36, 493513 (1988)
11. Kappes, J.H., Speth, M., Andres, B., Reinelt, G., Schnorr, C.: Globally Optimal Image
Partitioning by Multicuts. In: Boykov, Y., Kahl, F., Lempitsky, V., Schmidt, F.R. (eds.)
EMMCVPR 2011. LNCS, vol. 6819, pp. 3144. Springer, Heidelberg (2011)
12. Kim, S., Nowozin, S., Kohli, P., Yoo, C.D.D.: Higher-order correlation clustering for image
segmentation. In: NIPS (2011)
13. Nowozin, S., Jegelka, S.: Solution stability in linear programming relaxations: graph parti-
tioning and unsupervised learning. In: ICML (2009)
14. Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with connec-
tivity priors. In: CVPR (2008)
15. Nowozin, S., Lampert, C.H.: Global connectivity potentials for random field models. In:
CVPR (2009)
16. Lempitsky, V., Kohli, P., Rother, C., Sharp, T.: Image segmentation with a bounding box
prior. In: ICCV (2009)
17. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images
and its application to evaluating segmentation algorithms and measuring ecological statistics.
In: ICCV (2001)
18. Hatcher, A.: Algebraic Topology. Cambridge Univ. Press (2002)
19. Helmstaedter, M., Briggman, K.L., Denk, W.: High-accuracy neurite reconstruction for high-
throughput neuro-anatomy. Nature Neuroscience 14, 10811088 (2011)
20. Denk, W., Horstmann, H.: Serial block-face scanning electron microscopy to reconstruct
three-dimensional tissue nanostructure. PLoS Biology 2, e329 (2004)
21. Briggman, K.L., Helmstaedter, M., Denk, W.: Wiring specificity in the direction-selectivity
circuit of the retina. Nature 471, 183188 (2011)
22. Sommer, C., Straehle, C., Koethe, U., Hamprecht, F.A.: Ilastik: Interactive Learning and
Segmentation Toolkit. In: ISBI (2011)
23. Meila, M.: Comparing clusterings an information based distance. J. of Multivariate
Anal. 98, 873895 (2007)
24. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the Amer-
ican Statistical Association 66, 846850 (1971)
25. Knott, G., Marchman, H., Wall, D., Lich, B.: Serial section scanning electron microscopy of
adult brain tissue using focused ion beam milling. J. Neurosci. 28, 29592964 (2008)
26. Lucchi, A., Smith, K., Achanta, R., Knott, G., Fua, P.: Supervoxel-based segmentation of
mitochondria in em image stacks with learned shape features. IEEE Transactions on Medical
Imaging 31, 474486 (2012)
27. Straehle, C.N., Kothe, U., Knott, G., Hamprecht, F.A.: Carving: Scalable Interactive Seg-
mentation of Neural Volume Electron Microscopy Images. In: Fichtinger, G., Martel, A.,
Peters, T. (eds.) MICCAI 2011, Part I. LNCS, vol. 6891, pp. 653660. Springer, Heidelberg
(2011)
Reduced Analytical Dependency Modeling
for Classier Fusion
Andy Jinhua Ma and Pong Chi Yuen
Department of Computer Science, Hong Kong Baptist University

Kowloon Tong, Hong Kong
{jhma,pcyuen}@comp.hkbu.edu.hk
Abstract. This paper addresses the independent assumption issue in

classier fusion process. In the last decade, dependency modeling tech-
niques were developed under some specic assumptions which may not
be valid in practical applications. In this paper, using analytical functions
on posterior probabilities of each feature, we propose a new framework
to model dependency without those assumptions. With the analytical
dependency model (ADM), we give an equivalent condition to the inde-
pendent assumption from the properties of marginal distributions, and
show that the proposed ADM can model dependency. Since ADM may
contain innite number of undetermined coecients, we further propose
a reduced form of ADM, based on the convergent properties of analyt-
ical functions. Finally, under the regularized least square criterion, an
optimal Reduced Analytical Dependency Model (RADM) is learned by
approximating posterior probabilities such that all training samples are
correctly classied. Experimental results show that the proposed RADM
outperforms existing classier fusion methods on Digit, Flower, Face and
Human Action databases.
Keywords: Dependency modeling, analytical function, classier fusion,

pattern classication.
1 Introduction
Many computer vision and pattern recognition applications face the challenges
of complex scenes with clustered backgrounds, small inter-class variations and
large intra-class variations. To solve this problem, many algorithms have been
developed to extract local or global discriminative features such as Local Binary
Patterns [1], Laplacianfaces [2], etc. Instead of extracting a high discrimina-
tive feature, classier fusion has been proposed and the results are encourag-
ing [3] [4] [5] [6] [7] [8] [9] [10]. While many classier combination techniques [3]
have been studied and developed in the last decade, it is a general assumption
that classication scores are conditionally independent distributed. With this
assumption, the joint probability of all the scores can be expressed as the prod-
uct of marginal probabilities. The conditionally independent assumption could
simplify the problem, but may not be valid in many practical applications.

Reduced Analytical Dependency Modeling for Classier Fusion 793
In [5], instead of taking the advantage of conditionally independent assump-

tion, the classier fusion methods are proposed by estimating the joint distri-
bution of multiple classiers and performance is improved. However, when the
number of classiers is large, it needs numerous data to accurately estimate the
joint density [11]. On the other hand, Terrades et al. [7] proposed to combine
classiers in a non-Bayesian framework by linear combination. Under dependent
normal assumption (DN), they formulated the classier combination problem
into a constraint quadratic programming problem. Nevertheless, if normal as-
sumption is not valid, the results will not be optimal. In this context, Ma and
Yuen [10] proposed a linear classier dependency model (LCDM), which can
model dependency under the assumption that the posteriors will not deviate
very much from the priors.
Apart from the methods mentioned above, optimal weighting method [4], LP-
Boost [12] and its multi-class variants [8] aim at determining the correct weight-
ing for linear classier combination to improve the recognition performance.
Since linear classier combination methods are limited to linear separated sys-
tems, Toh et al. [6] developed a reduced multivariate polynomial model (RM) to
describe the nonlinear input-output relationships for classier fusion. Although
these methods are derived without the conditionally independent assumption,
they do not take full advantages of the probabilistic properties in the specic
task of dependency modeling.
In this paper, we develop a novel framework for dependency modeling, and
propose a method, namely Reduced Analytical Dependency Modeling (RADM)
for classier fusion. Inspired by Product rule [3] (with independent assumption)
and LCDM [10] (without independent assumption), we propose to model depen-
dency by analytical functions on posterior probabilities of each feature. With
the analytical dependency model (ADM), we give an equivalent condition to in-
dependent assumption from the properties of marginal distributions, and show
that the proposed ADM can model dependency. Since there may be innite
number of undetermined coecients in the ADM model, we further propose a
reduced form of ADM, based on the convergent properties of analytical func-
tions. At last, under the regularized least square criterion, the optimal RADM is
learned by approximating posterior probabilities such that all training samples
are correctly classied. The contributions of this paper are two-fold.
We develop a new framework for dependency modeling by analytical func-
tions on posterior probabilities of each feature. It is shown that Product rule [3]
and LCDM [10] can be unied by the proposed ADM framework. On the other
hand, we give an equivalent condition when independent assumption is held from
the properties of marginal distributions, and show that the proposed ADM can
model dependency.
We propose a novel RADM method for classier fusion. Since the ADM
model may contain innite number of undetermined coecients, a reduced form
of ADM, which can model dependency as well, is proposed from the convergent
properties of analytical functions. After that, an optimal RADM is learned by
794 A.J. Ma and P.C. Yuen
a new constraint quadratic programming problem, which minimizes the regu-

larized least square error to approximate posterior probabilities such that all
training samples are correctly classied.
The rest of this paper is organized as follows. We rst review related works on
classier fusion. Section 3 reports the proposed method. Experimental results
and conclusion are given in Section 4 and Section 5, respectively.
2 Related Works on Classier Fusion

Combining classiers is one of the strategies to improve recognition performance
in general recognition problems. According to Bayesian theory [13], under the
conditionally independent assumption, the posterior probability is given by
P0 M
Pr(l |x1 , , xM ) = Pr(l |xm ) (1)
Pr(l )M1 m=1
where l denotes label, M isthe number of feature measurements, xm is the

M
m=1 Pr(xm )
m-th measurement and P0 = Pr(x 1 , ,xM )
. Product rule was then derived by (1)
in [3]. Moreover, with the assumption that posterior probabilities of each clas-
sier will not deviate dramatically from the priors, Sum rule [3] was induced.
Based on Product rule and Sum rule, Kittler et al. [3] justied that the com-
monly used classier combination rules, i.e. Max, Min, Median and Majority
Vote, can be derived. Besides these combination rules developed under Bayesian
framework [3], Terrades et al. [7] tackled the classier combination problem using
a non-Bayesian probabilistic framework. Under the assumptions that classiers
can be combined linearly and the scores follow independent normal distribution,
the independent normal (IN) combination rule was derived [7].
Without conditionally independent assumption, the posterior probability of
classiers can be computed by joint distribution estimation. For example, in [5],
Parzen window density estimation was used to estimate the joint density of pos-
terior probabilities by a selected set of classiers. Since it needs numerous data to
ensure that estimation of the joint distribution is accurate [11], Terrades et al. [7]
proposed to combine classiers by a linear model under normal distribution as-
sumption. When features are not conditionally independent, the covariance ma-
trix in the normal distribution is not diagonal. In this case, the dependent normal
(DN) rule [7] was formulated into a constraint quadratic programming problem,
which can be solved by nonlinear programming techniques [14]. Removing the
normal distribution assumption on scores, Ma and Yuen [10] proposed to add
dependency terms to each posterior probabilities, and expand the product for-
mulation as the linear classier dependency model (LCDM) by neglecting high
order terms, i.e.

M
Pr(l |x1 , , xM ) P0 [(1 M )Pr(l ) + alm Pr(l |xm )] (2)
m=1
where al1 , , alM are the dependency weights. Then, the optimal LCDM model
was learned by solving a standard linear programming problem, which maxi-
mized margins between genuine and imposter posterior probabilities in [10].
Besides the explicit dependency modeling methods [5] [7] [10], the optimal
weighting method (OWM) [4], LPBoost approaches [8] [12] and reduced multi-
variate polynomial model (RM) [6] can be used to combine classiers with dif-
ferent kinds of features to improve the recognition performance as well. OWM
and LPBoost methods aimed at determining the correct weighting for linear
combination by minimizing the classication error and 1-norm soft margin er-
ror, respectively. In order to describe the nonlinear input-output relationships,
the reduced multivariate polynomial (RM) model was introduced in [6]. Since
the number of terms will increase exponentially with the order in the multivari-
ate polynomial, Toh et al. [6] proposed to approximate the full polynomial by
modied lumped multinomial. Then, the optimal RM model was learned by a
weight-decay regularization problem in [6].
3 Reduced Analytical Dependency Modeling

In this section, we propose a novel Reduced Analytical Dependency Modeling
(RADM) method to model dependency for classier fusion. In Section 3.1, we
rst derive the analytical dependency model (ADM) by unifying Product rule (1)
and LCDM (2), as well as give detailed explanation on how the proposed ADM
models dependency. Since ADM may contain innite number of undetermined
coecients, a reduced form of ADM is derived in Section 3.2. Finally, a method
to learn the optimal RADM is presented in Section 3.3.
3.1 Analytical Dependency Modeling

Consider a combination problem that, there are M distinct feature descrip-
tors f1 , , fM for any object O. Denote feature measurements x1 , , xM as
xm = fm (O). The objective of dependency modeling is to estimate the poste-
rior probability Pr(l |x1 , , xM ) for better classication performance. Since
the modalities of feature measurements can be dierent, e.g. xm can be a vec-
tor or a set of points, direct dependency modeling in feature level is dicult.
In turn, we consider modeling dependency by posterior probabilities of each
feature, Pr(l |xm ). Let us denote slm = Pr(l |xm ) and sl = (sl1 , , slM )T .
Suppose the prior probabilities are the same, i.e. Pr(l ) = L1 , where L is the
number of classes. With these notations, the Product rule (1) under indepen-
dent assumption and LCDM (2) for dependency modeling can be rewritten as

M
Product: Pr(l |x1 , , xM ) = P0 (LM1 slm ) = P0 hProduct(sl )
m=1
(3)

M
1M
LCDM: Pr(l |x1 , , xM ) P0 ( alm slm + ) = P0 hLCDM (sl )
m=1
L
As indicated in (3), the Product rule and LCDM model can be formulated as two
dierent functions hProduct and hLCDM on posterior probabilities sl1 , , slM .
This implies, if we choose a function dierent from hProduct with independent
assumption, e.g. hLCDM , the dependency can be modeled. On the other hand,
LCDM was proposed under the assumption that posterior probabilities of each
classier will not deviate dramatically from the priors as mentioned in [10].
However, without these assumptions, the function hl for class l on sl1 , , slM
should be dierent from hProduct and hLCDM . Generally speaking, hl can be
any function which models dependency between feature measurements by the
posterior probabilities sl1 , , slM , i.e.
Pr(l |x1 , , xM ) = P0 hl (sl1 , , slM ) (4)
Analytical functions are very popular and have been employed in Product rule
and LCDM. We follow this direction and consider hl as an analytical function.
According to the denition of analytical functions [15], hl can be expressed
explicitly by the converged power series as

hl (sl ; l ) = l sl (5)
||=0
where = (n1 , , nM )T , n1 , , nM are non-negative integers, || = n1 + +

M
nM , sl = m=1 snlmm and l = (l0 , , l , )T is weighting coecient vector
in which 0 = (0, , 0)T .
With the analytical dependency model (ADM) given by (5), we further in-
vestigate how it can model dependency from the probabilistic aspect. The ADM
model (5) can be rewritten according to the order of slm as,

hl (sl ; l ) = glmr (slm ; lmr )srlm (6)
r=0
where slm = (sl1 , , sl(m1) , sl(m+1) , , slM )T and glmr is an analytical func-
tion of slm with coecient vector lmr . On the other hand, the posterior prob-
abilities can be given by the Bayes rule [13] as follow,
Pr(xm |l )Pr(l )
Pr(l |xm ) =
Pr(xm )
(7)
Pr(x1 , , xM |l )Pr(l )
Pr(l |x1 , , xM ) =
Pr(x1 , , xM )
Since conditional probability Pr(xm |l ) in (7) can be viewed as the marginal

probability of the joint density Pr(x1 , , xM |l ) over random measurements
except xm [13], we get

Pr(xm |l ) = Pr(x1 , , xM |l )dx1 dxm1 dxm+1 dxM (8)
M
Pr(x )
m
With P0 = Pr(xm=1
1 , ,xM )
as mentioned in Section 2 and equations (4) (6) (7), the
conditional joint density can be written as
M
Pr(xm )
Pr(x1 , , xM |l ) = m=1 glmr (slm ; lmr )srlm (9)
Pr(l ) r=0
With notations slm = Pr(l |xm ), substituting the probabilities Pr(xm |l ) in (7)
and Pr(x1 , , xM |l ) in (9) into (8), we get

slm = Pr(xi )[ glmr (slm ; lmr )srlm ]dx1 dxm1 dxm+1 dxM (10)
i=m r=0
According to Abels Lemma [15], the series in (10) is uniformly converged. Thus,
it can be integrated term by term, and equation (10) becomes

slm = Glmr (lmr )srlm (11)
r=0
5
where Glmr (lmr ) = Pr(xi )glmr (slm ; lmr )dx1 dxm1 dxm+1 dxM .
i=m
Comparing the left and the right hand sides in (11), the following equations
can be obtained,
Glm1 (lm1 ) = 1, (12)
Glm0 (lm0 ) = 0, Glm2 (lm2 ) = 0, , Glmr (lmr ) = 0, (13)
According to the denition of (6), glmr (slm ; lmr ) is an analytical function sim-
ilar to hl (sl ; l ) in (5) and vector slm can be considered as the mappings from
feature measurements x1 , , xm1 , xm+1 , , xM to their posterior probabili-
ties. Therefore, the integration of i=m Pr(xi )glmr (slm ; lmr ) over feature mea-
surements except xm , which is denoted by Glmr (lmr ), is a linear function on
coecient vector lmr . Without calculating the integration, we can observe that
lm0 = 0, lm2 = 0, , lmr = 0, is a trivial solution to (13). Substi-
tuting this trivial solution into (6), and under the assumption that ADM is
symmetric on each slm , we have equations hl (sl ; l ) = glm1 (slm ; lmr ) slm
for 1 m M . This implies that each term in the power series hl contains
all variables sl1 , , slM and the order of each M slm cannot be larger than one.
In this case, there is only one non-zero term m=1 slm in the analytical func-
tion hl . In addition, according to (12), the ADM model becomes hl (sl ; l ) =

LM1 M m=1 slm , which is the Product rule under conditionally independent as-
sumption. This means independent condition is equivalent to the situation that
the solution to (13) is trivial. In other words, if the solution to (13) is non-trivial,
the dependency can be modeled. For general analytical functions, the weight vec-
tors lm0 , lm2 , , lmr , are not necessary to be zeros. Consequently, ADM
can model dependency by setting non-trivial solution to (13).
3.2 Reduced Form of the ADM Model

The ADM model may have innite number of coecients in which direct estimat-
ing the coecient vector l is infeasible. In turn, we propose to approximate the
ADM model based on convergent properties of the series dened by (5) and (6).
Let us consider the equation (5) again. According to the denition of conver-
gence of series
[16], for any positive number , there exists a positive integer K,
such that | ||=K+1 l sl | . If tends to zero, the analytical function can
be approximated by the following equation,

K
hl (sl ; l ) hl (sl ; l ; K) = l sl (14)
||=0
Similarly, since the series dened by (6) is converged, hl can be approximated

by the rst R + 1 terms in (6) as follows,

R
hl (sl ; l ) hl (sl ; l ; R) = glmr (slm ; lmr )srlm (15)
r=0
where R is a positive integer. According to the analysis in Section 3.1, the ap-
proximation in (15) can model dependency by setting solutions to the rst R
equations in (13) as non-trivial. Therefore, the reduced form of ADM can model
dependency. Equations (14) and (15) indicate that the highest order of each term
and each variable are K and R, respectively. Combining these two equations,
the reduced analytical dependency model (RADM) is given by

K
hl (sl ; l ; K, R) = l sl , s.t. = (n1 , , nM )T and 0 nm R (16)
||=0
Since the model order should be larger than or equal to the variable order,
K R. On the other hand, if K > M R, hl (sl ; l ; K, R) = hl (sl ; l ; M R, R).
This means the RADM model degenerates to hl (sl ; l ; M R, R) when K > M R.
Therefore, the relationship between the model order K and variable order R in
RADM is restricted as R K M R.
Denote the score vector z l = (s0l , , sl , )T , where s0l , , sl , are the
terms in (16). With these notations, the RADM model (16) can be written as
hl (sl ; l ; K, R) = Tl z l . The algorithmic procedure to obtain z l in RADM for
class l is given in Algorithm 1.
3.3 Learning the Optimal RADM Model

The optimal coecient vector l in (16) is determined by a learning process
from J training samples O1 , , OJ and their corresponding labels y1 , , yJ .
Let us consider the posterior probabilities as, Pr(l |Oj ) is equal to one, if l =
yj , and zeros, otherwise. This ensures that all samples are correctly classied.
Algorithm 1. Construct the score vector z l in the RADM model.

Require: Posterior probabilities sl1 , , slM and model parameters K, R;
1: Set D = (0, 1, , R)T and z l = (1, sl1 , s2l1 , , sR T
l1 ) ;
2: for m = 2, 3, , M do
3: Set D = (D, 0), where 0 = (0, , 0)T with the same column dimension of D;
4: for r = 1, 2, , R do
5: Update D = (D; (D, r1)) which is the column concatenation of D and (D, r1),
where 1 = (1, , 1)T with the same column dimension of D;
6: Update z l = (z l ; srlm z l ) which is the column concatenation of z l and srlm z l ;
7: Delete the rows in D and corresponding elements in z l such that the summa-
tions of the rows in D are larger than K;
8: end for
9: Set D = D;
10: end for
11: return z l .
On the other hand, with equation (4), the posterior probability is computed
by P0 hl (sl ; l ). Since P0 is a parameter depending on Oj , denote pj as the
parameter with respective to Oj . With these notations, we get
/
1, l = yj
hl (sjl ; l ) = qjl = jl /pj , jl = (17)
yj
0, l =
where sjl = (sjl1 , , sjlM )T , sjlm = Pr(l |xjm ) and xjm = fm (Oj ). Denote
z jl = (s0jl , , sjl , )T , where s0jl , , sjl , are the terms in (16). The ob-
jective function for learning the optimal RADM is given by approximating the
distribution (17) under regularized least square criterion [17] as follows

L
J
L
min (Tl z jl qjl )2 + b l 2 (18)
,q
l=1 j=1 l=1
where denotes the L2 -norm andb is a regularization constant.

M
Pr(x
jm )
As mentioned in Section 2, pj = Pr(x m=1
j1 , ,xjM )
. Since estimations of probabil-
ity Pr(xjm ) for each m and joint density Pr(xj1 , , xjM ) are dicult, direct
computation of pj may not be feasible. Consequently, p1 , , pJ are treated as
undetermined variables. Thus, dierent from traditional least square regulariza-
tion [17], optimization problem (18) cannot be solved directly. In this context,
we rewrite the objective function (18) as follows, so that it can be solved.
According to (17), qjl = 0, if yj = l . The optimization problem (18) becomes

L
L
min [ (Tl z jl qjl )2 + (Tl z jl )2 ] + b l 2 (19)
,q
l=1 yj =l yj =l l=1
Let q l = (qj1 l , , qjNl l )T , such that yjn = l , where Nl is the number of samples
for class l . On the other hand, denote Al and Bl as matrices made up by vectors
z jl for yj = l and yj = l , respectively. With these notations and adding a

regularization term to q l , the optimization problem (19) can be reformulated as

L
min [(Tl Al q Tl )(ATl l q l ) + Tl Bl BlT l + b Tl l + c q Tl q l ] (20)
,q
l=1
where c is a regularization constant for q 1 , , q L . Since the undetermined vec-

tor tuples (1 ; q 1 ), , (L ; q L ) are independent with each other with respective
to index l. The problem (20) can be broken down into L independent optimiza-
tion sub-problems, and each of them is written as

Al ATl + Bl BlT + bIl Al
min l Hl l , s.t. Hl =
T
, l = (l ; q l ) (21)
l ATl (1 + c)Iq l
where Il and Iq l are identity matrices with same dimensions as l and q l ,

respectively.
In order to ensure that the probabilities are positive in the optimal RADM,
constraints need to be added to the optimization problem (21). Since qjl =
Pr(x , ,x )
1/pj = M j1 Pr(xjM) > 0 for yj = l , the rst constraint is set as qjl
m=1 jm
for yj = l , where is a positive number such that = minj,yj =l qjl . On
the other hand, according to the analysis in Section 3.2, the RADM model
approximates (but is not exactly equal to) the posterior probability Pr(l |Oj ).
This means Pr(l |Oj ) = pj (Tl z jl + jl ) pj Tl z jl , where jl represents
the reminder term close to zero. Since Pr(l |Oj ) is positive, Tl z jl jl . To
avoid introducing extra parameters jl , we set the necessary condition that the
posterior probabilities are positive as the second constraint, i.e. Tl z jl 0 ,
where 0 is a small constant such that 0 = maxj,l jl .
With these two constraints, the optimal l in the RADM model is learned by
the following optimization problem,
min Tl Hl l
l
(22)
s.t. i) qln , 1 n Nl ; ii) Tl z jl 0 , 1 j J
where l = (l ; q l ) and Hl is dened in (21). The solution to (22) can be

determined by any standard nonlinear programming techniques [14], e.g. active-
set, cutting plane or interior point methods. Since our experiments are performed
in the Matlab environment, a Matlab build-in function is employed to solve (22).
4 Experiments
In this section, we compare the proposed RADM with state-of-the-art classi-

er fusion algorithms, including Sum [3], IN [7], LPBoost [12], LP-B [8], RM [6],
DN [7] and LCDM [10], in four dierent domains of recognition problems. In Sec-
tions 4.1 and 4.2, these combination methods are evaluated with the Digit [18]
and Flower [19] databases, respectively. After that, the results for face recogni-
tion using CMU PIE [20] and FERET [21], and human action recognition using
Weizmann [22] and KTH [23] databases, are reported in Sections 4.3 and 4.4,
respectively. It is important to point out that the main objective of these exper-
iments is to evaluate the performance of dierent classier fusion methods, but
not state-of-art digit, ower, face and human action recognition algorithms.
4.1 Digit Recognition
Multiple feature digit database [18] contains ten digits from 0 to 9, and 200 exam-
ples for each digit. Six features, namely Fourier coecients, prole correlations,
Karhunen-Love coecients, pixel averages, Zernike moments and morphological
features, are extracted [18]. In this experiment, we randomly select 20 samples of
each digit for training and the rest for testing. Since the probabilities are hard to
determine accurately due to the problem of limited training samples, we use SVM
classiers [24] and normalize the SVM outputs by the double sigmoid method [25]
to approximate the probabilities. To select the best parameters, ve-fold cross
validation (CV) is performed. The parameter C introduced in the soft margin
SVM is selected from {0.01, 0.1, 1, 10, 100, 1000}. The CV outputs of the SVMs
are used to train the weights for classier combination. The RADM parameters
and 0 in (22) are set as follows. If features are independent, the point-wise
Pr(x , ,x )
dependencies M j1 Pr(xjM) for 1 j J are equal to 1. Since parameter
m=1 jm
represents the lower bound of the point-wise dependencies, we set = 0.5 < 1,
so that RADM includes the independent case. On the other hand, parameter
0 in (22) must be a small number, so 0 is set as 0.01. Other parameters are
selected from suitable sets as follows. Regularization parameters b, c are selected
from {106, , 101 , 1}, the variable order R is selected from {1, 2} and the
model order K is selected from one to the number of features. For LPBoost,
LP-B and LCDM, the best parameter is selected from {0.05, 0.1, . . . , 0.95}.
This experiment has been repeated ten times.
The mean accuracies of the best single feature (BestFea) and dierent classi-
er combination methods are reported in the second column of Table 1. From
Table 1, we can see that the proposed RADM obtains the highest recognition
rate of 96.84% on this database. Moreover, the recognition accuracies of the
classier fusion methods are higher than that of the best single feature. This
convince that the performance can be improved by combining classiers, and
RADM outperforms other fusion methods by better modeling dependency.
4.2 Flower Recognition
Oxford owers database [19] contains 17 categories of owers with 80 images per
category. Seven features including shape, color, texture, HSV, HoG, SIFT inter-
nal, and SIFT boundary, are extracted using the methods reported in [19] [26].
We evaluate the proposed method using 17 40 images for training, 17 20
for validation and 17 20 for testing. The best parameters are selected by the
Table 1. Recognition accuracies (%) of dierent methods on all databases

PP Dataset
PP Digit Flower CMU PIE FERET Weizmann KTH
Method PPP
BestFea 94.77 70.39 88.87 83.33 82.22 78.70
Sum [3] 96.23 85.39 91.75 86.11 84.44 84.72
IN [7] 95.63 85.49 93.32 88.19 85.56 84.26
LPBoost [12] 96.41 82.74 92.14 88.43 83.33 83.33
LP-B [8] 96.57 85.49 92.00 87.65 84.44 85.19
RM [6] 96.71 85.39 94.14 90.05 84.44 88.43
DN [7] 94.93 84.22 93.91 87.73 84.44 83.80
LCDM [10] 96.79 86.27 93.01 88.81 85.56 85.19
RADM 96.84 88.04 94.34 90.97 85.56 90.28
validation set similar to the procedure in Digit database. Following the settings
in [8], kernel SVMs are used in this experiment, and the kernel matrices are de-
ned as exp (d(x, x )/), where d is the distance and is the mean of pairwise
distances. This experiment was repeated three times using the predened splits
of this database [19].
The third column in Table 1 shows the mean accuracies of dierent methods.
From Table 1, we can see that RADM obtains an improvement of 2.55% and
1.77% over the classier fusion methods with independent assumption and those
without independent assumption, respectively. This indicates that modeling de-
pendency without specic assumptions like those in DN and LCDM helps to
improve the recognition performance.
4.3 Face Recognition

Two pubilicly available face databases, CMU PIE [20] and FERET [21], are used
for classier fusion experiments. CMU PIE face database contains 68 subjects
with 41,368 images captured under varying pose, illumination and expression.
We used 105 near frontal face images for each individual, randomly select six
for training, four for validation and the rest for testing. In FERET database,
we select 72 individuals with six near frontal face images per person under dif-
ferent face expressions. Six images for each individual are randomly separated
into training, validation and testing sets with equal size, i.e. two images for each
set. Four features, Eigenfaces [27], Fisherfaces [27], Laplacianfaces [2] and local
binary patterns (LBP) [1] are extracted in both two databases. Inspired by the
experiments in [1] [2] [27], parameters introduced from these features are deter-
mined as follows. The dimensions of Eigenfaces, Fisherfaces and Laplacianfaces
are set as 100 and the number of nearest neighbors in Laplacianfaces is set to
be 3, while the window size is set as 16 12 with LBPu2 8,2 . SVM outputs of the
training data are used to train the weights for classier fusion methods. The best
parameters are selected by the validation set similar to the procedure mentioned
in Section 4.1. These experiments were repeated ten times on CMU PIE and
three times on FERET database.
(a) (b)
(c) (d)
Fig. 1. (a) (b) CMC curves on CMU PIE and FERET face databases. (c) ROC curve
on Weizmann database. (d) Recognition rates of RADM and RM with changed model
order on KTH database.
The mean accuracies on these two databases are reported in the fourth and
fth columns of Table 1. Same conclusion can be drawn that RADM outperforms
the other methods on both two face databases. While results in Table 1 only show
the Rank-1 accuracies, CMC curves of the top four methods on CMU PIE and
FERET databases are plotted in Fig. 1(a) and Fig. 1(b), respectively, for detailed
comparison. It can be seen that RADM outperforms IN and DN on CMU PIE,
LPBoost and LCDM on FERET database, and is slightly better than RM with
dierent number of ranks. This indicates that RADM with dependency modeling
gives the best performance for face recognition as well.
4.4 Human Action Recognition

In this section, we compare the classier fusion methods on Weizmann [22] and
KTH [23] human action databases. Weizmann database contains 93 videos from
nine persons, each performing ten actions. Eight out of the nine persons in this
database are used for training, and the remaining one is used for the evaluation.
This is repeated nine times and the rates are averaged. On the other hand, there
are 25 subjects performing six actions under four scenarios in KTH database. We
follow the common setting [23] to separate the video set into training (8 persons),
validation (8 persons), and testing (9 persons) sets. Eight features including
intensity, intensity dierence, HoF, HoG, HoF2D, HoG2D, HoF3D and HoG3D,
are extracted from videos as reported in [10]. In these two databases, eight-fold
CV is performed on the training data, and the CV outputs are used to train the
weights for classier fusion. The best parameters are selected by the CV outputs
on Weizmann and validation set on KTH database, respectively, similar to the
procedure mentioned in Section 4.1.
The last two columns in Table 1 show the recognition rates of dierent meth-
ods. We can see that RADM, LCDM and IN get the highest recognition rate
of 85.56% on Weizmann, while RADM outperforms other methods on KTH
database. We further compare the best three algorithms on Weizmann database
by the ROC measurement. The ROC curves in Fig. 1(c) show that RADM gives
better performance when the false positive rate is larger than 10%. And the ar-
eas under curves (AUC) are 0.8524 for RADM, 0.8472 for LCDM and 0.8424 for
IN. This also convinces that the proposed RADM is better than other classier
fusion methods for human action recognition. At last, we compare RADM and
RM with changed model order K on KTH database. From Fig. 1(d), we can see
that RADM outperforms RM with dierent model orders and is less sensitive to
model order changed. This is another advantage of the proposed method.
5 Conclusion
In this paper, we have designed and proposed a new framework for dependency
modeling by analytical functions on posterior probabilities of each feature. It
is shown that Product rule [3] (with independent assumption) and LCDM [10]
(without independent assumption) can be unied by the proposed analytical
dependency model (ADM). With the ADM, we give an equivalent condition
to independent assumption from the properties of marginal distributions, and
show that ADM can model dependency. Since ADM may contain innite number
of undetermined coecients, a reduced form is proposed based on the conver-
gent properties of analytical functions. At last, the optimal Reduced Analytical
Dependency Model (RADM) is learned by a modied least square regulariza-
tion problem, which aims at approximating posterior probabilities such that all
training samples are correctly classied. Experimental results show that RADM
outperforms existing classier fusion methods on Digit, Flower, Face and Human
Action databases. This indicates that RADM can better model dependency and
help to improve the performance in many recognition problems.
Acknowledgments. This project was partially supported by the Science
Faculty Research Grant of Hong Kong Baptist University, National Natural Sci-
ence Foundation of China Research Grants 61128009 and 61172136, and NSFC-
GuangDong Research Grant U0835005.
References
1. Ahonen, T., Hadid, A., Pietikainen, M.: Face Recognition with Local Binary Pat-
terns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469481.
2. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using Laplacian-
faces. TPAMI 27, 328340 (2005)
3. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classiers.
TPAMI 20, 226239 (1998)
4. Ueda, N.: Optimal linear combination of neural networks for improving classica-
tion performance. TPAMI 22, 207215 (2000)
5. Prabhakar, S., Jain, A.K.: Decision-level fusion in ngerprint verication. Pattern
Recognition 35, 861874 (2002)
6. Toh, K.A., Yau, W.Y., Jiang, X.: A reduced multivariate polynomial model for
multimodal biometrics and classiers fusion. IEEE Transactions on Circuits and
Systems for Video Technology 14, 224233 (2004)
7. Terrades, O.R., Valveny, E., Tabbone, S.: Optimal classier fusion in a non-bayesian
probabilistic framework. TPAMI 31, 16301644 (2009)
8. Gehler, P., Nowozin, S.: On feature combination for multiclass object classication.
In: ICCV, pp. 221228 (2009)
9. Scheirer, W., Rocha, A., Micheals, R., Boult, T.: Robust Fusion: Extreme Value
Theory for Recognition Score Normalization. In: Daniilidis, K., Maragos, P., Para-
gios, N. (eds.) ECCV 2010, Part III. LNCS, vol. 6313, pp. 481495. Springer,
Heidelberg (2010)
10. Ma, A.J., Yuen, P.C.: Linear dependency modeling for feature fusion. In: ICCV,
pp. 20412048 (2011)
11. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman
and Hall (1986)
12. Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting via
column generation. JMLR 46, 225254 (2002)
13. Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1.
Wiley (1968)
14. Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. Springer (2008)
15. Krantz, S.G., Parks, H.R.: A Primer of Real Analytic Functions. Birkhauser (2002)
16. Rudin, W.: Principles of mathematical analysis. McGraw-Hill (1976)
17. Rifkin, R., Yeo, G., Poggio, T.: Regularized least squares classication. NATO
Science Series III: Computer and Systems Sciences 190, 131153 (2003)
18. Breukelen, M.V., Duin, R.P.W., Tax, D.M.J., Hartog, J.E.D.: Handwritten digit
recognition by combined classiers. Kybernetika 34, 381386 (1998)
19. Nilsback, M.E., Zisserman, A.: A visual vocabulary for ower classication. In:
CVPR (2006)
20. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database.
TPAMI 25, 16151618 (2003)
21. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation method-
ology for face recognition algorithms. TPAMI 22, 10901104 (2000)
22. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. TPAMI 29, 22472253 (2007)
23. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM
approach. In: ICPR (2004)
24. Canu, S., Grandvalet, Y., Guigue, V., Rakotomamonjy, A.: SVM and kernel meth-
ods matlab toolbox. Perception Systmes et Information, INSA de Rouen, Rouen,
France (2005)
25. Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric
systems. Pattern Recognition 38, 22702285 (2005)
26. Nilsback, M.E., Zisserman, A.: Automated ower classication over a large number
of classes. In: ICVGIP (2008)
27. Peter, N., Belhumeur, J.P.H., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: recogni-
tion using class specic linear projection. TPAMI 19, 711720 (1997)
Learning to Match Appearances
by Correlations in a Covariance Metric Space
Sawomir Bk, Guillaume Charpiat, Etienne Corve,

Franois Brmond, and Monique Thonnat
INRIA Sophia Antipolis, STARS Group

2004, Route des Lucioles, BP93
06902 Sophia Antipolis Cedex, France
firstname.surname@inria.fr
Abstract. This paper addresses the problem of appearance matching

across disjoint camera views. Signicant appearance changes, caused by
variations in view angle, illumination and object pose, make the problem
challenging. We propose to formulate the appearance matching problem
as the task of learning a model that selects the most descriptive fea-
tures for a specic class of objects. Learning is performed in a covariance
metric space using an entropy-driven criterion. Our main idea is that
dierent regions of the object appearance ought to be matched using
various strategies to obtain a distinctive representation. The proposed
technique has been successfully applied to the person re-identication
problem, in which a human appearance has to be matched across non-
overlapping cameras. We demonstrate that our approach improves state
of the art performance in the context of pedestrian recognition.
Keywords: covariance matrix, re-identication, appearance matching.
1 Introduction
Appearance matching of the same object registered in disjoint camera views is
one of the most challenging issues in every video surveillance system. The present
work addresses the problem of appearance matching, in which non-rigid objects
change their pose and orientation. The changes in object appearance together
with inter-camera variations in lighting conditions, dierent color responses, dif-
ferent camera viewpoints and dierent camera parameters make the appearance
matching task extremely dicult.
Recent studies tackle this topic in the context of pedestrian recognition. Hu-
man recognition is of primary importance not only for people behavior analysis
in large area networks but also in security applications for people tracking. De-
termining whether a given person of interest has already been observed over a
network of cameras is referred to as person re-identication. Our approach is
focused on, but not limited to, this application.
Appearance-based recognition is of particular interest in video surveillance,
where biometrics such as iris, face or gait might not be available, e.g. due to sen-
sors scarce resolution or low frame-rate. Thus, appearance-based techniques rely

Learning to Match Appearances 807
on clothing assuming that individuals wear the same clothes between dierent
sightings.
In this paper, we propose a novel method, which learns how to match ap-
pearance of a specic object class (e.g. class of humans). Our main idea is that
dierent regions of the object appearance ought to be matched using dierent
strategies to obtain a distinctive representation. Extracting region-dependent
features allows us to characterize the appearance of a given object class in a more
ecient and informative way. Dierent kinds of features characterizing various
regions of an object is fundamental to our appearance matching method.
We propose to model the object appearance using covariance descriptor [1]
yielding rotation and illumination invariance. Covariance descriptor has already
been successfully used in the literature for appearance matching [2,3]. In con-
trast to these approaches, we do not dene a priori feature vector for extracting
covariance, but we learn which features are the most descriptive and distinc-
tive depending on their localization in the object appearance. Characterizing
a specic class of objects (e.g. humans), we select only essential features for
this class, removing irrelevant redundancy from covariance feature vectors and
ensuring low computational cost. The output model is used for extracting the
appearance of an object from a set of images, while producing a descriptive and
distinctive representation.
In this paper, we make the following contributions:
We formulate the appearance matching problem as the task of learning a
model that selects the most descriptive features for a specic class of objects
(Section 3).
By using a combination of small covariance matrices (4 4) between few
relevant features, we oer an ecient and distinctive representation of the
object appearance (Section 3.1).
We propose to learn a general model for appearance matching in a covari-
ance metric space using correlation-based feature selection (CFS) technique
(Section 3.2).
We evaluate our approach in Section 4 before discussing future work and con-
cluding.
2 Related Work
Recent studies have focused on the appearance matching problem in the con-
text of pedestrian recognition. Person re-identication approaches concentrate
either on distance learning regardless of the representation choice [4,5], or on
feature modeling, while producing a distinctive and invariant representation for
appearance matching [6,7]. Learning approaches use training data to search for
strategies that combine given features maximizing inter-class variation whilst
minimizing intra-class variation. Instead, feature-oriented approaches concen-
trate on an invariant representation, which should handle view point and camera
changes. Further classication of appearance-based techniques distinguishes the
808 S. Bk et al.
single-shot and the multiple-shot approaches. The former class extracts appear-
ance using a single image [8,9], while the latter employs multiple images of the
same object to obtain a robust representation [6,7,10].
Single-Shot Approaches: In [9], a model for shape and appearance of the
object is presented. A pedestrian image is segmented into regions and their color
spatial information is registered into a co-occurrence matrix. This method works
well if the system considers only a frontal viewpoint. For more challenging cases,
where viewpoint invariance is necessary, an ensemble of localized features (ELF )
[11] is selected by a boosting scheme. Instead of designing a specic feature
for characterizing people appearance, a machine learning algorithm constructs
a model that provides maximum discriminability by ltering a set of simple
features. In [12], pairwise dissimilarity proles between individuals are learned
and adapted for nearest neighbor classication. Similarly, in [13], a rich set of
feature descriptors based on color, textures and edges is used to reduce the
amount of ambiguity among human class. The high-dimensional signature is
transformed into a low-dimensional discriminant latent space using a statistical
tool called Partial Least Squares (PLS) in a one-against-all scheme. However,
both methods demand an extensive learning phase based on the pedestrians
to re-identify, extracting discriminative proles, which makes the approaches
non-scalable. The person re-identication problem is reformulated as a ranking
problem in [14]. The authors present extensive evaluation of learning approaches
and show that a ranking-based model can improve the reliability and accuracy.
Distance learning is also the main topic of [5]. A probabilistic model maximizes
the probability of true match pair having a smaller distance than that of a wrong
match pair. This approach focuses on maximizing matching accuracy regardless
of the representation choice.
Multiple-Shot Approaches: In [15], every individual is represented by two mod-
els: descriptive and discriminative. The discriminative model is learned using the
descriptive model as an assistance. In [10], a spatiotemporal graph is generated for
ten consecutive frames to group spatiotemporally similar regions. Then, a cluster-
ing method is applied to capture the local descriptions over time and to improve
matching accuracy. In [7], the authors propose the feature-oriented approach which
combines three features: (1) chromatic content (HSV histogram); (2) maximally
stable color regions (MSCR) and (3) recurrent highly structured patches (RHSP).
The extracted features are weighted using the idea that features closer to the bod-
ies axes of symmetry are more robust against scene clutter. Recurrent patches are
presented in [6]. Using epitome analysis, highly informative patches are extracted
from a set of images. In [16], the authors show that features are not as important
as precise body parts detection, looking for part-to-part correspondences.
Learning approaches concentrate on distance metrics regardless of the repre-
sentation choice. In the end, those approaches use very simple features such as
color histograms to perform recognition. Instead of learning, feature-oriented ap-
proaches concentrate on the representation without taking into account discrim-
inative analysis. In fact, learning using a sophisticated feature representation
is very hard or even unattainable. In [15], the authors deteriorate covariance

feature to apply learning. As covariances do not live on Euclidean space, it is
dicult to perform learning on an unknown manifold without a suitable metric.
This work overcomes learning issues by considering a covariance metric space
using an entropy-driven technique. We combine advantages of a strong descriptor
with the ecient selection method, thus producing a robust representation for
appearance matching.
3 The Approach
Our appearance matching requires two operations. The st stage overcomes color
dissimilarities caused by variations in lighting conditions. We apply the histogram
equalization [17] technique to the color channels (RGB) to maximize the entropy
in each of these channels and to obtain camera-independent color representation.
The second step is responsible for appearance extraction (Section 3.3) using a
general model learned for a specic class of objects (e.g. humans). The following
sections describe our feature space and the learning, by which the appearance
model for matching is generated.
3.1 Feature Space
Our object appearance is characterized using the covariance descriptor [1]. This
descriptor encodes information on feature variances inside an image region, their
correlations with each other and their spatial layout. The performance of the
covariance features is found to be superior to other methods, as rotation and
illumination changes are absorbed by the covariance matrix.
In contrast to [1,2,15], we do not limit our covariance descriptor to a single fea-
ture vector. Instead of dening a priori feature vector, we use a machine learning
technique to select features that provide the most descriptive representation of
the appearance of an object.
Let L = {R, G, B, I, I , I , . . . } be a set of feature layers, in which each layer
is a mapping such as color, intensity, gradients and lter responses (texture l-
ters, i.e. Gabor, Laplacian or Gaussian). Instead of using covariance between all
of these layers, which would be computationally expensive, we compute covari-
ance matrices of a few relevant feature layers. These relevant layers are selected
depending on the region of an object (see Section 3.2). In addition, let layer D be
a distance between the center of an object and the current location. Covariance
of distance layer D and three other layers l (l L) form our descriptor, which is
represented by a 4 4 covariance matrix. By using distance D in every covari-
ance, we keep a spatial layout of feature variances, which is rotation invariant.
State of the art techniques very often use pixel location (x, y) instead of distance
D, yielding better description of an image region. Conversely, among our detail
experimentation, using D rather than (x, y), we did not decrease the recognition
accuracy in the general case, while decreasing the number of features in the co-
variance matrix. This discrepancy may be due to the fact that we hold spatial
810 S. Bk et al.
Fig. 1. A meta covariance feature space. Example of three dierent covariance features.
Every covariance is extracted from a region (P ), distance layer (D) and three channel
functions (e.g. bottom covariance feature is extracted from region P3 using layers: D,
I-intensity, I -gradient magnitude and I -gradient orientation).
information twofold, (1) by location of a rectangular sub-region from which the

covariance is extracted and (2) by D in covariance matrix. We constraint our co-
variances to combination of 4 features, ensuring computational eciency. Also,
bigger covariance matrices tend to include superuous features which can clutter
the appearance matching. 4 4 matrices provide suciently descriptive corre-
lations while keeping low computational time needed for calculating generalized
eigenvalues during distance computation.
Dierent combinations of three feature layers produce dierent kinds of covari-
ance descriptor. By using dierent covariance descriptors, assigned to dierent
locations in an object, we are able to select the most discriminative covariances
according to their positions. The idea is to characterize dierent regions of an
object by extracting dierent kinds of features (e.g. when comparing human
appearances, edges coming from shapes of arms and legs are not discrimina-
tive enough in most cases as every instance posses similar features). Taking into
account this phenomenon, we minimize redundancy in an appearance represen-
tation by an entropy-driven selection method.
Let us dene index space Z = {(P, li , lj , lk ) : P P; li , lj , lk L}, of our meta
covariance feature space C, where P is a set of rectangular sub-regions of the
object; and li , lj , lk are color/intensity or lter layers. Meta covariance feature
space C is obtained by mapping Z C : covP (D, li , lj , l k ), where covP () is the
T
covariance descriptor [1] of features : covP () = |P |11
kP (k )(k ) .
Fig. 1 shows dierent feature layers as well as examples of three dierent types
of covariance descriptor. The dimension n = |Z| = |C| of our meta covariance
feature space is the product of the number of possible rectangular regions by the
number of dierent combinations of feature layers.
3.2 Learning in a Covariance Metric Space

Let aci = {aci,1 , aci,2 , . . . aci,m } be a set of relevant observations of an object i in
camera c, where acij is a n-dimensional vector composed of all possible covariance
features extracted from image j of object i in the n-dimensional meta covariance

feature space C. We dene the distance vector between two samples aci,j and ack,l
as follows
( )T
(aci,j , ack,l ) = (aci,j [z], ack,l [z]) zZ , (1)
where is the geodesic distance between covariance matrices [18], and aci,j [z],

ack,l [z] are the corresponding covariance matrices (the same region P and the
same combination of layers). The index z is an iterator of C.
We cast the appearance matching problem into the following distance learning
problem. Let + be distance vectors computed using pairs of relevant samples
(of the same people captured in dierent cameras, i = k, c
= c ) and let be
distance vectors computed between pairs of related irrelevant samples (i
= k,
c
= c ). Pairwise elements + and are distance vectors, which stand for
positive and negative samples, respectively. These distance vectors dene a co-
variance metric space. Given + and as training data, our task is to nd a
general model of appearance to maximize matching accuracy by selecting rele-
vant covariances and thus dening a distance.
Riemannian Geometry: Covariance descriptors as positive denite symmetric

matrices lie on a manifold that is not a vector space (they do not lie on Euclidean
space). Specifying the covariance manifold as Riemannian we can apply dieren-
tial geometry [19] to perform usual operations such as mean or distance. However,
learning on a manifold space is a dicult and unsolved challenge. Methods [2,20]
perform classication by regression over the mappings from the training data to
a suitable tangent plane. Dening tangent plane over the Karcher mean of the
positive training data points, we can preserve a local structure of the points.
Unfortunately, the models extracted using means of the positive training data
points tend to overt. These models concentrate on tangent planes obtained
from training data and do not have generalization properties. In [15], authors
avoid this problem by casting covariance matrices into Sigma Points that lie on
an approximate covariance space.
Neither using tangent planes over the Karcher means extracted from training
data, nor casting covariances into Sigma Points satises our matching model. As
we want to take full advantage of the covariance manifold space, we propose to
extract a general model for appearance matching by identifying the most salient
features. Based on the hypothesis: A good feature subset is one that contains
features highly correlated with (predictive of ) the class, yet uncorrelated with (not
predictive of ) each other [21], we build our appearance model using covariance
features fz (fz C) chosen by a correlation-based feature selection technique.
Correlation-based Feature Selection (CFS): Correlation-based feature se-

lection (CFS) [21] is a lter algorithm that ranks feature subsets according to
a correlation-based evaluation function. This evaluation function favors feature
subsets which contain features highly correlated with the class and uncorre-
lated with each other. In our distance learning problem, we dene positive and
812 S. Bk et al.
Fig. 2. Correlation-based feature selection
negative class by + and , as relevant and irrelevant pairs of samples (see

Section 3.2). Further, feature fz C is characterized by a distribution of the
zth elements in distance vectors + and . The feature-class correlation and
the feature-feature inter-correlation is measured using a symmetrical uncertainty
model [21]. As this model requires nominal valued features, we discretize fz us-
ing the method of Fayyad and Irani [22]. Let X be a nominal valued feature
obtained by discretization of fz .
We assume that a probabilistic model of X can be formed by estimat-
ing the probabilities of the values x X from the training data. The
uncertainty can be measured by entropy H(X) = xX p(x) log2 p(x).
A between features X and Y can be given by H(X|Y ) =
relationship
yY p(y) xX p(x|y) log2 p(x|y). The amount by which the entropy of X
decreases reects additional information on X provided by Y and is called the
information gain (mutual information) dened as Gain = H(X) H(X|Y ) =
H(Y ) H(Y |X) = H(X) + H(Y ) H(X, Y ).
Even if the information gain is a symmetrical measure, it is biased in favor
of features with more discrete values. Thus,( the symmetrical
uncertainty
) rXY is
used to overcome this problem rXY = 2 Gain/ H(X) + H(Y ) .
Having the correlation measure, subset of features S is evaluated using func-
tion M(S) dened as
k rcf
M(S) = , (2)
k + k (k + 1) rf f
where k is the number of features in subset S, rcf is the average feature-class
correlation and rf f is the average feature-feature inter-correlation
1 2
rcf = rcfz , rf f = rfi fj , (3)
k k (k 1)
fz S fi ,fj S
i<j
where c is the class, or relevance feature, which is +1 on + and 1 on . The

numerator in Eq. 2 indicates predictive ability of subset S and the denominator
stands for redundancy among the features.
Equation 2 is the core of CFS, which ranks feature subsets in the search
space of all possible feature subsets. Since exhaustive enumeration of all possible
Fig. 3. Extraction of the appearance using the model (the set of features selected by
CFS). Dierent colors and shapes in the model refer to dierent kinds of covariance
features.
feature subsets is prohibitive in most cases, a heuristic search strategy has to be

applied. We have investigated dierent search strategies, among which best rst
search [23] performs the best.
Best rst search is an AI search strategy that allows backtracking along the
search path. Our best rst starts with no feature and progresses forward through
the search space adding single features. The search terminates if T consecutive
subsets show no improvement over the current best subset (we set T = 5 in
experiments). By using this stopping criterion we prevent the best rst search
from exploring the entire feature subset search space. Fig. 2 illustrates CFS
method. Let be the output of CFS that is the feature subset of C. This feature
subset forms a model that is used for appearance extraction and matching.
3.3 Appearance Extraction

Having the general model for a specic object class (e.g. humans), we compute
the appearance representation using a set of frames (see Fig. 3). In the context
of person re-identication, our method belongs to the group of multiple-shot
approaches, where multiple images of the same person are used to extract a
compact representation. This representation can be seen as a signature of the
multiple instances.
Using our model , a more straightforward way to extract appearance would
be to compute covariance matrices of a video volume directly for every fz .
However, using volume covariance we loose information on real feature distri-
bution (time feature characteristics are merged). Thus, similarly to [2] we com-
pute Karcher means using a Riemannian manifold. The mean covariance matrix
as an intrinsic average blends appearance information from multiple images.
For every feature fz we compute the mean covariance matrix. The set of
mean covariance matrices stands for an appearance representation of an object
(signature).
814 S. Bk et al.
Fig. 4. Example of the person re-identication on i-LIDS-MA. The left-most image is

the probe image. The remaining images are the top 20 matched gallery images. The
red boxes highlight the correct matches.
3.4 Appearance Matching
Let A and B be the object signatures. The signature consists of mean covariance
matrices extracted using set . The similarity between two signatures A and B
is dened as
1 1
S(A, B) = , (4)
|| max((A,i , B,i ), )
i
where is a geodesic distances [18], A,i and B,i are mean covariance matrices
extracted using covariance feature i and = 0.1 is introduced to avoid the
denominator approaching to zero. Using the average of similarities computed on
feature set the appearance matching becomes robust to noise.
We carry out experiments on 3 i-LIDS datasets1 : i-LIDS-MA [2], i-LIDS-AA

[2] and i-LIDS-119 [24]. These datasets have been extracted from the 2008 i-
LIDS Multiple-Camera Tracking Scenario (MCTS) dataset with multiple non-
overlapping camera views. These datasets tackle the person re-identication
problem, where appearances of the same person acquired by dierent cameras,
has to be matched. The results are analyzed in terms of recognition rate, using
the cumulative matching characteristic (CMC) [25] curve. The CMC curve rep-
resents the expectation of nding the correct match in the top n matches (see
Fig. 4).
Feature Space: We scale every human image into a xed size window of 64192
pixels. The set of rectangular sub-regions P is produced by shifting 328 and 16
16 pixel regions with 8 pixels step (up and down). It gives |P| = 281 overlapping
1
The Image Library for Intelligent Detection Systems (i-LIDS) is the UK governments
benchmark for Video Analytics (VA) systems.
rectangular sub-regions. We set L = {(l, l , l )l=I,R,G,B , Gi=1...4 , N , L}, where

I, R, G, B refer to intensity, red, green and blue channel, respectively; is the
gradient magnitude; corresponds to the gradient orientation; Gi are Gabors
lters with parameters , , , 2 set to (0.4, 0, 8, 2), (0.4, 2 , 8, 2), (0.8, 4 , 8, 2)
and (0.8, 32 , 8, 2), respectively; N is a gaussian and L is a laplacian lter. A
learning process involving all possible combinations of three layers would not be
computationally tractable (229296 covariances to consider in section 3.2). Thus
instead, we experimented with dierent subsets of combinations and selected a
reasonably ecient one. Among all possible combinations of the three layers, we
choose 10 combinations (Ci=1...10 ) to ensure inexpensive computation. We set
Ci to (R, G, B), (R , G , B ), (R , G , B ), (I, I , I ), (I, G3 , G4 ), (I, G2 , L),
(I, G2 , N ), (I, G1 , N ), (I, G1 , L), (I, G1 , G2 ), respectively, separating color and
texture features. Similar idea was already proposed in [11]. Note that we add
to every combination Ci layer D, thus generating our nal 4 4 covariance
descriptors. The dimension of our meta covariance feature space is n = |C| =
10 |P| = 2810.
Learning and Testing: Let us assume that we have (p + q) individuals seen
from two dierent cameras. For every individual, m images from each camera are
given. We take q individuals to learn our model, while p individuals are used to
set up the gallery set. We generate positive training examples by comparing m
images of the same individual from one camera with m images from the second
camera. Thus, we produce | + | = qm2 positive samples. Pairs of images coming
from dierent individuals stand for negative training examples, thus producing
| | = q (q 1) m2 .
Acronym: We name our approach COrrelation-based Selection of covariance
MATrIces (COSMATI).
4.2 Results
i-LIDS-MA [2]: This dataset consists of 40 individuals extracted from two
non-overlapping camera views. For each individual a set of 46 images is given.
The dataset contains in total 40 2 46 = 3680 images. For each individual we
randomly select m = 10 images. Then, we randomly select q = 10 individuals to
learn a model. The evaluation is performed on the remaining p = 30 individuals.
Every signature is used as a query to the gallery set of signatures from the other
camera. This procedure has been repeated 10 times to obtain averaged CMC
curves.
COSMATI vs. Ci : We rst evaluate the improvement in using dierent com-
binations of features for the appearance matching. We compare models based
on a single combination of features with the model, which employs several com-
binations of features. From Fig. 5(a) it is apparent that using dierent kinds of
covariance features we improve matching accuracy.
COSMATI w.r.t. the Number of Shots: We carry out experiments to show
the evolution of the performance with the number of given frames per individual
816 S. Bk et al.
iLIDSMA iLIDSMA
100 100
90 90
Recognition Percentage
80
80
COSMATI
C1
70 C2 70
C3
C4
60 C5 60
COSMATI, N=1
C6
COSMATI, N=3
C7
COSMATI, N=5
50
C8 50 COSMATI, N=10
C9
COSMATI, N=20
C10
COSMATI, N=46
40 40
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Rank score Rank score
(a) p=30 (b) p=30
Fig. 5. Performance comparison: (a) with models based on a single combination of

features; (b) w.r.t. the number of given frames
iLIDSMA iLIDSAA
100
80
95
70
90
60
85 COSMATI 50 COSMATI
LCP
LCP
80 40
75 30
1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20
Rank score Rank score
(a) p = 30 (b) p = 100
Fig. 6. Performance comparison with LCP [2] using CMC curves
(Fig. 5(b)). The results indicate that the larger number of frames, the better
performance is achieved. It clearly shows that averaging covariance matrices on
a Riemannian manifold using multiple shots leads to a much better recognition
accuracy. It is worth noting that N 50 is usually aordable in practice as it
corresponds to only 2 seconds of a standard 25 framerate camera.
COSMATI vs. LCP: We compare our results with LCP [2] method. This
method employs a priori 11 11 covariance descriptor. Discriminative charac-
teristics of an appearance are enhanced using one-against-all boosting scheme.
Both methods are evaluated on the same subsets (p = 30) of the original data.
Our method achieves slightly better results than LCP as shown in Fig. 6(a). It
shows that using specic descriptors for dierent regions of the object, we are
able to obtain equally distinctive representation as LCP, which uses an exhaus-
tive one-against-all learning scheme.
i-LIDS-AA [2]: This dataset contains 100 individuals automatically detected
and tracked in two cameras. Cropped images are noisy, which makes the dataset
more challenging (e.g. detected bounding boxes are not accurately centered
around the people, only part of the people is detected due to occlusion).
COSMATI vs. LCP: Using the models learned on i-LIDS-MA, we evaluate
our approach on 100 individuals. Results are illustrated in Fig. 6(b). Our method
again performs better than LCP. It is relevant to mention that once our model is
iLIDS119 iLIDS119 iLIDS119

100 100
90
90 90
80
80 80
70
70
70 60
COSMATI 60 50 COSMATI, N = 2
60 COSMATI
PRDC LCP
50 PRDC 40 CPS, N = 2
50
SDALF, N = 2
40 30 HPE, N = 2
40
Group Context
30 20
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25
Rank score Rank score Rank score
(a) p = 50 (b) p = 80 (c) p = 119

Fig. 7. Performance comparison using CMC curves, p is the size of the gallery set
(larger p means smaller training set): (a),(b) our vs. PRDC [5]; (c) our vs. LCP [2],
CPS [16], SDALF [7], HPE [6], Group Context [13]
learned oine, it does not need any additional discriminative analysis. Specifying
informative regions and their characteristics (features) we obtain a distinctive
representation of the object appearance. In contrast to [2] our oine learning is
scalable and do not require any reference dataset.
i-LIDS-119 [24]: i-LIDS-119 does not t very well for multiple-shot case, be-
cause the number of images per individual is very low (in average 4). However,
this dataset was extensively used in the literature for evaluating the person re-
identication approaches. Thus, we also use this dataset to compare with state
of the art results. The dataset consist of 119 individuals with 476 images. This
dataset is very challenging since there are many occlusions and often only the
top part of the person is visible. As only few images are given, we extract our
signatures using maximally N = 2 images.
COSMATI vs. PRDC [5]: We compare our approach with PRDC method.
This method also requires oine learning. PRDC focuses on distance learning
that can maximize matching accuracy regardless of the representation choice.
We reproduce the same experimental settings as [5]. All images of p randomly
selected individuals are used to set up the test set. The remaining images form
the training data. Each test set is composed of a gallery set and a probe set.
In contrast to [5], we use multiple images (maximally N = 2) to create the
gallery and the probe set. The procedure is repeated 10 times to obtain reliable
statistics. Our results (Fig. 7(a,b)) show clearly that COSMATI outperforms
PRDC. The results indicate that using strong descriptors, we can signicantly
increase matching accuracy.
COSMATI vs. LCP [2], CPS [16], SDALF [7], HPE [6] and Group
Context [13]: We have used models learned on i-LIDS-MA to evaluate our
approach on the full dataset of 119 individuals. Our CMC curve is generated
by averaged CMC over 10 trials. Results are presented in Fig. 7 (c). Our ap-
proach performs the best among all considered methods. We believe that it is
due to the informative appearance representation obtained by CFS technique. It
clearly shows that a combination of the strong covariance descriptor with the e-
cient selection method produces distinctive models for the appearance matching
problem.
818 S. Bk et al.
Table 1. A table showing the percent of dierent covariance features embedded in

models
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
9.61 % 5.30% 4.62% 4.37% 5.47% 12.70 % 14.29% 17.35% 12.12% 14.16 %
Model Analysis: The strength of our ap-

proach is the combination of many dierent
kinds of features into one similarity function.
Table 1 shows the percentage of dierent kinds
of covariance features embedded in all our
models, which were used during the evalu-
ation. Unlike [11], our method concentrates
(a) (b) more on texture lters than on color features.
Fig. 8. Extracted models for (a) It appears that the higher the resolution of
lower and (b) higher resolutions: images, the more frequent the usage of tex-
red indicates color features which ture lters is. Fig. 8 illustrates that models
turn out to be more prominent in extracted on higher resolution employ more
(a) texture features.
Using multiple images, a straightforward
way to extract an appearance would be to use a background subtraction algorithm
to obtain foreground regions. Unfortunately in many real scenarios, consecutive
frames are not available due to gaps in tracking results. Having a still image, we
could apply the extended graph-cut approach: GrabCut [26] to obtain a human
silhouette. This approach can be driven by cues coming from detection result
(a rectangle around the desired object). Surprisingly, employing GrabCut in our
framework did not increase matching accuracy. The main reason for this is that
GrabCut often mis-segments signicant part of foreground region. Further, by
learning a model, our approach already focuses on features corresponding to
foreground.
Computation Complexity: In our experiments, for q = 10 and m = 10 we

generate | + | = 1000 and | | = 9000 training samples. Learning on 10.000 sam-
ples takes around 20 minutes on Intel quad-core 2.4GHz. The model is composed
of 150 covariance features in average. The calculation of generalized eigenvalues
of 4 4 covariance matrices (distance computation) takes 2s without apply-
ing any hardware-dependent optimization routines (e.g. LAPACK library can
perform faster using block operations optimized for architecture).
5 Conclusion
We proposed to formulate the appearance matching problem as the task of learn-
ing a model that selects the most descriptive features for a specic class of
objects. Our strategy is based on the idea that dierent regions of the object ap-
pearance ought to be matched using various strategies. Our experiments demon-
strate that: (1) by using dierent kinds of covariance features w.r.t. the region
of an object, we obtain clear improvement in appearance matching performance;
(2) our method outperforms state of the art methods in the context of pedes-
trian recognition. In the future, we plan to integrate the notion of motion in our
recognition framework. This would allow to distinguish individuals using their
behavioral characteristics and to extract only the features which surely belong
to foreground region.
Acknowledgements. This work has been supported by the VANAHEIM and

ViCoMo European projects.
References
and Classication. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
2. Bak, S., Corvee, E., Bremond, F., Thonnat, M.: Boosted human re-identication
using riemannian manifolds. Image and Vision Computing (2011)
3. Oncel, F.P., Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model up-
date based on lie algebra. In: CVPR (2006)
4. Dikmen, M., Akbas, E., Huang, T.S., Ahuja, N.: Pedestrian Recognition with a
Learned Metric. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part
5. Zheng, W.-S., Gong, S., Xiang, T.: Person re-identication by probabilistic relative
distance comparison. In: CVPR (2011)
6. Bazzani, L., Cristani, M., Perina, A., Farenzena, M., Murino, V.: Multiple-shot
person re-identication by hpe signature. In: ICPR, pp. 14131416 (2010)
7. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-
identication by symmetry-driven accumulation of local features. In: CVPR (2010)
8. Park, U., Jain, A., Kitahara, I., Kogure, K., Hagita, N.: Vise: Visual search engine
using multiple networked cameras. In: ICPR, pp. 12041207 (2006)
9. Wang, X., Doretto, G., Sebastian, T., Rittscher, J., Tu, P.: Shape and appearance
context modeling. In: ICCV, pp. 18 (2007)
10. Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentication using spatiotem-
poral appearance. In: CVPR, pp. 15281535 (2006)
11. Gray, D., Tao, H.: Viewpoint Invariant Pedestrian Recognition with an Ensemble
of Localized Features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008,
12. Lin, Z., Davis, L.S.: Learning Pairwise Dissimilarity Proles for Appearance Recog-
nition in Visual Surveillance. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D.,
Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne,
T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 2334. Springer,
Heidelberg (2008)
13. Schwartz, W.R., Davis, L.S.: Learning discriminative appearance-based models us-
ing partial least squares. In: SIBGRAPI, pp. 322329 (2009)
14. Prosser, B., Zheng, W.-S., Gong, S., Xiang, T.: Person re-identication by support
vector ranking. In: BMVC, pp. 21.121.11 (2010)
820 S. Bk et al.
15. Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person Re-identication by Descrip-
tive and Discriminative Classication. In: Heyden, A., Kahl, F. (eds.) SCIA 2011.
16. Cheng, D.S., Cristani, M., Stoppa, M., Bazzani, L., Murino, V.: Custom pictorial
structures for re-identication. In: BMVC, pp. 68.168.11 (2011)
17. Hordley, S.D., Finlayson, G.D., Schaefer, G., Tian, G.Y.: Illuminant and device
invariant colour using histogram equalisation. Pattern Recognition 38 (2005)
18. Frstner, W., Moonen, B.: A metric for covariance matrices. In: Quo vadis geode-
sia..?, Festschrift for Erik W. Grafarend on the occasion of his 60th birthday, TR
Dept. of Geodesy and Geoinformatics, Stuttgart University (1999)
19. Pennec, X., Fillard, P., Ayache, N.: A riemannian framework for tensor computing.
Int. J. Comput. Vision 66, 4166 (2006)
20. Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classication on rieman-
nian manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 30, 17131727 (2008)
21. Hall, M.A.: Correlation-based Feature Subset Selection for Machine Learning. PhD
thesis, Department of Computer Science, University of Waikato (1999)
22. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued at-
tributes for classication learning. In: IJCAI, pp. 10221027 (1993)
23. Rich, E., Knight, K.: Articial Intelligence. McGraw-Hill Higher Education (1991)
24. Zheng, W.-S., Gong, S., Xiang, T.: Associating groups of people. In: BMVC (2009)
25. Gray, D., Brennan, S., Tao, H.: Evaluating Appearance Models for Recognition,
Reacquisition, and Tracking. In: PETS (2007)
26. Rother, C., Kolmogorov, V., Blake, A.: "Grabcut": interactive foreground extrac-
tion using iterated graph cuts. In: SIGGRAPH, pp. 309314 (2004)
On the Convergence of Graph Matching:
Graduated Assignment Revisited
Yu Tian1 , Junchi Yan1, , Hequan Zhang3

Ya Zhang1 , Xiaokang Yang1, and Hongyuan Zha2
1
Shanghai Jiao Tong University, 200240 Shanghai, China
2
Georgia Institute of Technology, 30032 Atlanta, USA
3
Beijing Institute of Technology, 100081 Beijing, China
{tianyu0124,ethanyan1985}@gmail.com, zhanghequan1987@126.com,
{ya zhang,xkyang}@sjtu.edu.cn, zha@cc.gatech.edu
Abstract. We focus on the problem of graph matching that is fundamental in

computer vision and machine learning. Many state-of-the-arts frequently for-
mulate it as integer quadratic programming, which incorporates both unary and
second-order terms. This formulation is in general NP-hard thus obtaining an ex-
act solution is computationally intractable. Therefore most algorithms seek the
approximate optimum by relaxing techniques. This paper commences with the
finding of the circular character of solution chain obtained by the iterative Gra-
dient Assignment (via Hungarian method) in the discrete domain, and proposes
a method for guiding the solver converging to a fixed point, resulting a conver-
gent algorithm for graph matching in discrete domain. Furthermore, we extend
the algorithms to their counterparts in continuous domain, proving the classical
graduated assignment algorithm will converge to a double-circular solution chain,
and the proposed Soft Constrained Graduated Assignment (SCGA) method will
converge to a fixed (discrete) point, both under wild conditions. Competitive per-
formances are reported in both synthetic and real experiments.
1 Introduction and Problem Formulation
In computer vision and machine learning, many tasks that require finding correspon-
dences between two node sets can be formulated as graph matching such that the rela-
tions between nodes can be preserved as much as possible. Although extensive research
[1] has been done for decades, graph matching is still challenging mainly due to two
reasons: 1) In general, the objective function is non-convex and prone to local optimum;
2) the solution needs to satisfy combinatorial constraints. The graph matching problem
are widely formulated as integer quadratic programming (IQP), by considering both
unary and second-order terms reflecting the local similarities together with the pair-
wise relations. Concretely, given two graphs GL (V L , E L , AL ) and GR (V R , E R , AR ),
where V denotes nodes, E, edges and A, attributes, there is an affinity matrix defined
as Mia;jb that measures the affinity with the candidate edge pair (viL , vjL ) vs. (vaR , vbR ).

The work is supported by NSF IIS-1116886, NSF IIS-1049694, NSFC 61129001/F010403
and the 111 Project (B07022).

822 Y. Tian et al.
And the diagonal term Mia;ia describes the unary affinity of a node match (viL , vaR ).
By introducing a permutation matrix x {0, 1}nlnR whereby xia = 1 if node ViL
matches node VaR (and xia = 0 otherwise), it leads to the following formulation:
x = argx max(xT Mx) s.t. Ax = 1 x {0, 1} (1)
here x is deformed into vectorized version, constraints Ax = 1 refer to the one-to-one
matching from GL to GR such that one feature from one image can be matched to at
most one other feature from the other image. The difficulty depends on the structure of
matrix M, in general it is NP-hard and no polynomial algorithm exists.
2 Algorithm and Theories

We commence with a simple gradient based method - Gradient Assignment (GA): given
a initial value x0 , relax the objective from the quadratic assignment xk Mxk to a linear
assignment in an iterative fashion: xk+1 Mxk , whereby the Hungarian method [8] (de-
noted by Hd in the sequel) is performed to obtain the discrete solution in each iteration:
xk+1 = Hd (Mxk ). We will show it induces a circular solution sequence given that the
returned solution from Hungarian method is unique - as described in Assumption 1:
Assumption 1 (min )
i; j, k, j = k; (pj pk )T Mpi = 0 (2)
pi , pj , pk are vectorized permutation matrix. min equals the smallest difference among
nonzero results of Equ.2: min = minpj ,pk ,pi ,j=k |(pj pk )T Mpi |
Theorem 1. If Assumption 1 holds, after k finite number of iterations, the gradient
assignment (GA) method will converge to a sequence {pi , pi+1 , pi ...} or {pi , pi , pi ...}.
Proof: For an ascending solution path induced by Hungarian method: pTi+1 Mpi =
pTi Mpi+1 pTi+2 Mpi+1 ... pTi+k+1 Mpi+k = pTi+1 Mpi : 1) there must exist a k
satisfying pi+k = pi due to that the finite permutation space can be covered by a large
enough k; 2) Assumption 1 ensures equal scores denote identical solutions.
This circular property is unwanted due to the relaxation largely deviates the original
objective xT Mx when xk1 and xk are not close. Thus we are motivated to add a con-
straint term xk xk1 < , in the spirit of encouraging converging to a fixed point.
Using L2 norm, one can obtain the following formulation:
max xTk+1 (M + I)xk xTk+1 xk+1 xTk xk (3)
xk+1
We drop the quadratic terms for three reasons: a) the new quadratic terms increase the
difficulty for optimizing; b) xT x = n holds for one-to-one matching as considered in
this paper; c) the resultant quadratic-free formulation will not change the optimal solu-
tion against the original objective since the deviated score associated with the diagonal
I plays an equal role to all matching candidates:
max xTk+1 (M + I)xk (4)
xk+1
On the Convergence of Graph Matching: Graduated Assignment Revisited 823
Using Equ. 4 the Constrained Gradient Assignment (CGA) algorithm is proposed in

Alg. 1. It always measures the current solution via the original affinity matrix M0 (the
updted Mk is only used for the Hungarian methods input to obtain a new solution) and
select the best along the whole path. Theorem 2 explores its convergence property.
Algorithm 1. Constrained Gradient Assignment (CGA)

Initial: = ; k = 0; x0 = [ n12 , ..., n12 ]T ; x = x1 = x0 ; M = M0 ; score = 0
2: repeat
(xk+1 = Hd (Mxk ));
4: if (xTk+1 M0 xk+1 >score) then
score = xTk+1 M0 xk+1 ; x = xk+1 ; x1 = xk ;
6: end if
if (xk+1 == xk ) then
8: return: x ;
else if (xk+1 == xk1 ) then
10: xk+1 = x ; xk = x1 ; M = M + I;
end if
12: k++;
until k exceeds max iteration: return x ;
Theorem 2. CGA converges to a fixed point provided on Assumption 1(min ) holds.

Proof: As stated in Theorem 1, when Assumption 1(min ) holds, for a fixed M, the
GA algorithm must converge in the form of pk+1 = pk1 . Meanwhile, M is updated as
M = M + I. After a finite number of rounds, M would be converted to a symmetric
positive definite matrix (M is symmetric). If CGA converges before this positive definite
matrix, the proof completes, else M can be decomposed into M = Mdec MTdec . Without
loss of generality, assume the iteration sequence converges to {pi , pi+1 , pi ...}. Thus we
can reach:
pTi+1 Mpi = Hd (Mpi )T Mpi pTi Mpi (5)

pTi Mpi+1 = Hd (Mpi+1 ) Mpi+1
T
pTi+1 Mpi+1
And by summing up the two above inequations we have:
2MTdec pi+1 , MTdec pi MTdec pi+1 , MTdec pi+1 + MTdec pi , MTdec pi (6)
Equation holds i.f.f. pTi+1 Mpi+1 =pTi+1 Mpi =pTi Mpi , using Assumption 1: pi =pi+1 .
CGA is sensitive to initial point and sometimes would trap to an unsatisfied point
shortly. To address this issue, on one hand, by observing the affinity matrix may be
distorted after finishing CGA, in order to extend the solution chain, we can frame an
outer loop whereby the current iteration (CGA)s output is the input to the next itera-
tion where CGA is also performed. And the outer loop will stop if the solution with
respect to the original objective cannot be improved by the new iteration - we call this
824 Y. Tian et al.
outer loop CGA algorithm as LCGA. On the other hand, the Hungarian method con-
fined in the discrete domain misses many optimums, which could otherwise have been
explored by a soft counterpart in the continuous domain. Thus we leverage the tool of
softmax [6] (instead of the Hungarian method) that has been applied in the classical
Graduate Assignment algorithm (GAGM) [2], and extend CGA to its soft version:
Soft Constrained Graduate Assignment (SCGA) described in Alg.2. The algorithmic
Algorithm 2. Soften Constrained Gradient Assignment (SCGA)

Initial: = 0 ; = ; k = 0; x0 = [ n12 , ..., n12 ]T ; x = x1 = Hd (Mx0 ); M = M0 ;
score = 0; = 0
repeat
xk+1 = SoftMax (Mxk );
4: increase ;
if (Hd (Mxk )T M0 Hd (Mxk ) score) then
score = Hd (Mxk )T M0 Hd (Mxk ); x = xk ; x1 = xk1 ;
end if
8: if (
xk+1 xk
< ) then
return: Hd (Mx )
end if
if (
xk+1 xk1
< ) then
12: xk+1 = x ; xk = x1 ; M = M + I;
end if
k++;
until k exceeds max iteration: return x ;
procedure of Alg.2 is empirically established based on GAGM and CGA, yet we are
more interested in the underlying question: will the continuous methods (SCGA/GAGM)
have similar converge property as their discrete counterparts (CGA/GA)? Affirmative
answer is given in Theorem 3 and Corollary 1 and Corollary 2 provided on Assumption
1 in addition with Assumption 2 hold, which is described in below:
Assumption 2. (sj , Tj , ) Let T = 1 where is the parameter in SoftMax. During

the iterations, there exist sj , Tj , satisfying 1 :
i = j + 1; (pj+1 pi )T Msj (7)

Tj min
nN 2 log N < (8)
2
Tj
0 < min 2 nN 2 log N (9)

1
min is defined in Assumption 1, N = n2 where n is the number of nodes in one graph,
and are defined in Lemma 4 while sj and pj are defined in Equ.10 and Equ.11 in the
appendix respectively.
Theorem 3. If Assumption 1(min ) and Assumption 2 (sj , Tj , ) hold, using the con-
tinuous solution chain {sj } as defined in Equ. 10, the induced discrete solution chain
{pj } defined by Equ. 11 will asymptotically (if only j is big enough) satisfy
sj+1 = pj+1 , and {pj } will satisfy the definition of the Hungarian method: pj+k+1
Hd (Mpj+k ).
Theorem 3 directly leads two important resultant corollaries:
Corollary 1. During the iteration, if there exists sj s.t. Assumption 1(min ) and As-
sumption 2 (sj , Tj , ) hold, then GAGM will converge to a double-circular solution
chain.
Corollary 2. During the iteration, if there exists sj s.t. Assumption 1(min ) and As-
sumption 2 (sj , Tj , ) hold, then SCGA will converge to a fixed discrete point.
Proof details can be found in the appendix.
3 Related Work
Early work by Christmas, Kittler and Petrou [9] and Wilson and Hancock [10] showed
how relaxation labeling could be used to the graph matching problem by modeling
the probabilistic distribution of matching errors. Drawing on ideas from these con-
nectionist literature, Gold and Rangarajan [2] developed a relaxation scheme based
on soft-assign, which gradually updates the derivative of the relaxed IQP i.e. Gradu-
ate Assignment (GAGM) [2]. Leordeanu and Hebert [3] present a spectral matching
(SM) method which computes the leading eigenvector of symmetric nonnegative affin-
ity. The constraint are entirely dropped and they only consider the integer constraints
in the final discretization step. Later, spectral graph matching with affine constraints
was developed [11] (SMAC). In addition, [11] suggests a preprocessing on the input
matrix, by normalizing it into a doubly stochastic matrix. More recently, Cho et al. [4]
apply random walk and introduce a reweighting jump scheme to incorporate mapping
constraints. None of these aforementioned approaches are concerned with the original
integer constraints during optimization while they assume the final continuous solution
can lead to an optimal feasible solution after discretization. Contrary to this idea, an
iterative matching method (IPFP) is proposed by Leordeanu et al. [5]. In their method,
the optimized solution xk in each iteration is continuous, yet they keep the discrete
one bk induced by xk if it suffices the optimal condition. When the iteration ends, the
solution is selected among the stored discrete set instead from the binarization of the
continues solution in the final iteration. This paper addresses the general learning-free
graph matching problem. In terms of convergence study as most emphasized in this pa-
per, Rangarajan et al [7] show that GAGM will converge to a discrete point provided
the affinity matrix is positive definite, while having no clear answer for general cases.
This paper clarifies that for the general graph matching problem, GAGM will converge
to a circular solution chain, and progressively making additive to the diagonal of the
affinity matrix will result in an algorithm that is deemed to converge to a fixed discrete
point after a finite number of iterations, under mild conditions. Motivated by this theo-
retical finding, we further carefully design the mechanism for when and how much the
826 Y. Tian et al.
addition is performed during iterations. In general, the proposed methods prolong the
searching path in both discrete and continuous domain, avoiding early trap to unwanted
point. Moreover, the proposed mechanism keeps the relaxed solver off deviating from
the original objective function thus improves the solution quality.
4 Experiments and Discussion

Choice of Step Size and Max Iterations: We keep the CGA/SCGAs max iterations as
500, and LCGAs as 10. As the maximum value of input affinity matrix M is 1, from
Gershgorin circle theorem, we know when the diagonal element of M increase to N ,
N
the matrix become positive definite, thus the step size of is 500 .
Experiments on both synthetic and real image are designed to evaluate the proposed
algorithms (SCGA, LCGA) on various graph matching tasks against state-of-the-art
methods: SM [3], GAGM [2], IPFP [5], RRWM [4]. For RRWM and IPFP, the publicly
available codes by [4] were used, and SM, GAGM were implemented by self. The
experimental methodology and design details are quite similar with the protocol in [4]
for fair comparison. All methods were implemented using MATLAB and tested on a
2.67G HZ PC. For each trial, the same affinity matrix was shared as the input and
Hungarian algorithm was commonly used at final discretization step for all methods.
Each quantitative result in the synthetic experiments was acquired from averages of
20 random trials. Due to the common character of LCGA and IPFP that can refine
other approaches output as their input, we tested them independently, as well as in
conjunction with others as a post-processing step. When tested as standalone, LCGA
and IPFP are always initialized with a flat, uniform solution. Control parameters of
GAGM, RRWM was based on authors papers and tuned for better performance. For
SCGA, we set the same as [4] and , k as discussed before in all experiments.
Synthetic Random Graph: We follow exactly the same experimental protocol of [2,
4, 11]. For each matching trial, two graphs are constructed, reference graph GL with
nL = nin nodes and the perturbed graph GR with nR = nin + nout . The perturbed
graph GR is created by permuting nodes order of nin and adding Gaussian noises.The
affinity matrix M is constructed by Mia,jb = exp(dL ij dab /s ) where s is set to
R
0.15 as the same value in [4]. The performances of each method are measured in both
accuracy and scores as defined in [4]. Based on the aforementioned settings, we con-
duct two sub-experiments to verify the robustness against deformation and outlier. In
the deformation test, we change the deformation noise parameter from 0 to 0.4 with
the interval 0.05, to compare with RRWM on a larger baseline, we set inlier number nin
to 50 and 60. In outlier noise experiment, the number of outliers nout varies from 0 to
20 by increments of 2, while fix nin to 50. The experimental results are plotted in Fig.
1. As observed from the deformation test, SCGA outperforms in accuracy while LCGA
and RRWM are comparable. And LCGA achieves a higher score given increased noise.
These phenomenon may imply a continuous technique like SoftMax is a key factor
against large noises. In contrast, a looping scheme like LCGA pays more attention to
objective score. In case of small noise, the proposed two methods are both very fast,
yet becoming relative slow in the existence of larger noise. Considering efficiency, all
algorithms are stopped when the iteration exceeds a fixed threshold. In outlier test, both
SCGA and LCGA outperform others in accuracy and score, and in score test LCGA
is better again. It suggests that LCGA is suitable when noise is small as in this case
a better score always means a better match. We also tested our LCGA with its corre-
sponding simple version CGA and Gradient Assignment (GA). The results in Fig. 1
verify the efficacy of gradually adding constrains and outer looping. Note CGA shows
good performance and relative low time cost.
1 9
7000
0.9 GAGM GAGM 8 GAGM

LCGA 6500 LCGA LCGA
0.8 IPFP IPFP 7 IPFP
Objective score
SCGA 6000 SCGA SCGA
0.7
RRWM RRWM 6 RRWM
Accuracy
SM 5500 SM SM
0.6
Time
5
0.5 5000
4
0.4
4500
0.3 # of inliers n = 60 # of inliers n = 60 3
# of inliers n = 60
in in in
0.2
# of outliers n = 0 4000 # of outliers n = 0 2 # of outliers n = 0
out out out
0.1 Edge density = 1 3500 Edge density = 1 1 Edge density = 1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Deformation noise Deformation noise Deformation noise
1
7000 7
GAGM GAGM
0.9
LCGA LCGA
6
6500 IPFP IPFP
0.8
Objective score
SCGA SCGA
RRWM 5 RRWM
0.7
Accuracy
6000
SM SM
Time
0.6 4
5500
0.5 3
GAGM
5000
0.4 # of inliers n = 50
LCGA # of inliers n = 50 # of inliers n = 50
IPFP in in 2 in
0.3 Deformation noise = 0
SCGA
RRWM
4500 Deformation noise = 0 Deformation noise = 0
Edge density = 1 Edge density = 1 Edge density = 1
1
SM
0.2 4000
0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
# of outliers n # of outliers n # of outliers n
out out out
1
4.5
0.9
4500 4
0.8
3.5
Objective score
0.7 LCGA 4000 LCGA LCGA

Accuracy
CGA CGA 3 CGA

0.6 GA GA GA
Time
2.5
0.5 3500
2
0.4
0.3
# of inliers nin = 50 3000 # of inliers nin = 50 1.5 # of inliers nin = 50
0.2
# of outliers nout = 0 # of outliers nout = 0 1 # of outliers nout = 0
2500
0.5
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
1 8
LCGA
0.95 CGA
7000 7
GA
0.9
6
Objective score
0.85 LRGA
6500
Accuracy
RGA 5
0.8
Time
GA
0.75 4
6000
LCGA
0.7
CGA 3
0.65 # of inliers nin = 50 # of inliers nin = 50 GA # of inliers nin = 50
5500 2
0.6 Deformation noise = 0 Deformation noise = 0 Deformation noise = 0
0.55 Edge density = 1 Edge density = 1 1
Edge density = 1
5000
0.5 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
out out out
Fig. 1. Evaluation on synthetic random graphs. Two top rows: Deformation and outlier evaluation
among GAGM, LCGA, IPFP, SCGA, RRWM, SM; Two bottom rows: Deformation and outlier
evaluation among GA, CGA, LCGA.
Performance of Being a Post Step: As shown in Fig. 2, in this test we use IPFP and
LCGA as a post processor using the results from RRWM, SM, SCGA and GAGM
as their initial input, called RRWM-IPFP and so on. Solid line means LCGA for post
processing while Dashes denotes IPFP. We find LCGA outperforms IPFP in all methods
except for SM, which is worth our future study. Aware SCGA-IPFP and SCGA-LCGA
are comparable and achieve the best results, as similar as in the outlier test.
828 Y. Tian et al.
1 7
RRWMIPFP RRWMIPFP
0.9 RRWMLCGA RRWMLCGA
4500 SMIPFP 6 SMIPFP
0.8 SMLCGA SMLCGA
Objective score
SCGAIPFP 5 SCGAIPFP
0.7
4000 SCGALCGA SCGALCGA
Accuracy
GAGMIPFP GAGMIPFP
0.6
Time
GAGMLCGA 4 GAGMLCGA
0.5 3500
RRWMIPFP 3
0.4 RRWMLCGA
0.3
# of inliers n = 50
SMIPFP 3000 # of inliers n = 50 # of inliers n = 50
SMLCGA in in 2 in
0.2
# of outliers n = 0
SCGAIPFP # of outliers n = 0 # of outliers n = 0
SCGALCGA out out out
2500 1
0.1 GAGMIPFP
GAGMLCGA
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
1
RRWMIPFP 8 RRWMIPFP
7500
0.9 RRWMLCGA RRWMLCGA
SMIPFP 7 SMIPFP
7000 SMLCGA SMLCGA
0.8
Objective score
SCGAIPFP 6 SCGAIPFP
0.7 6500 SCGALCGA SCGALCGA
Accuracy
GAGMIPFP 5 GAGMIPFP
Time
0.6 6000 GAGMLCGA GAGMLCGA
4
RRWMIPFP
0.5 5500
RRWMLCGA
3
0.4 # of inliers n = 50
SMIPFP
5000 # of inliers n = 50 # of inliers n = 50
SMLCGA in in in
0.3 Deformation noise = 0
SCGAIPFP
SCGALCGA 4500 Deformation noise = 0 2
Deformation noise = 0
0.2 Edge density = 1
GAGMIPFP Edge density = 1 1 Edge density = 1
GAGMLCGA 4000
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
out out out
Fig. 2. Performance evaluation of using LCGA and IPFP as a post processor optimizing with the
result from other algorithms as their initial input
Real Image: In the second image experiment, as the same with [4, 12], 30 image pairs
from Caltech-101 and MSRC datasets are selected from which feature points are ex-
tracted by MSER detector and SIFT descriptor. We followed the setup from [4, 12]
exactly. The complexity of this experiment is due to the large intra-category variations
in shape and large number of outliers, posing the challenge regarding noise and outlier.
Some comparative examples are shown in Fig. 3, and the average performance compar-
ison on the dataset is summarized in Table 1, the results show LCGA outperform others,
especially in score while SCGA is comparable with the state-of-the-art algorithms. We
also compare our algorithm with IPFP as being post step in this test, the result is sim-
ilar to synthetic experiment, both methods can elevate the performance, especially in
score. LCGA is slightly better than IPFP in most cases except SM. Both SCGA-IPFP
and SCGA-LCGA achieve competitive results in accuracy and score.
Table 1. Evaluation of various matching algorithms. (+IPFP) and (+LCGA) denote perform post
IPFP and LCGA using other methods output as their input.
Method GAGM LCGA IPFP SCGA RRWM SM

Accuracy(%) 72.45 75.84 73.6 71.2 73.61 62.58
Score(%) 91.06 98.69 94.63 92.21 93.01 79.63
Time(S) 0.23 0.38 0.12 0.36 0.31 0.03
Accuracy(+IPFP) 75.82 74.78 76.28 71.32
Accuracy(+LCGA) 75.88 75.12 74.94 62.58
Score(+IPFP) 99.51 99.15 97.23 92.45
Score(+LCGA) 99.87 99.78 97.84 79.62
GAGM(31/45, 92.85) IPFP(31/45, 101.55) LRGA(30/45, 102.58)
RRWM(24/45, 97.50) SRGA(32/45, 92.12) SM(23/45, 86.22)
GAGM(8/8, 6.07) IPFP(6/8, 5.38) LRGA(8/8, 6.12)
RRWM(0/8, 5.41) SRGA(8/8, 6.07) SM(0/8, 4.56)
Fig. 3. Evaluation on real images, in the bracket: method name, accuracy, and score
Conclusion: We have proposed a novel framework and induced algorithms for graph
matching, whose convergence are also proved. The experiments show that it outper-
forms the state-of-the-art methods in the presence of outliers and deformation,
especially when the number of graph node is large. The comparison reveals that the
matching accuracy and convergence rate in the challenging situations largely depends
on the effective exploitation of the matching constraints in the searching space.
References
1. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern
recognition. IJPRAI (2004)
2. Gold, S., Rangarajan, A.: A graduated assignment algorithm for graph matching. PAMI
(1996)
3. Leordeanu, M., Hebert, M.: A Spectral Technique for Correspondence Problems using Pair-
wise Constraints. In: ICCV (2005)
4. Cho, M., Lee, J., Lee, K.M.: Reweighted Random Walks for Graph Matching. In: Daniilidis,
K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 492505.
5. Leordeanu, M., Herbert, M.: An integer projected fixed point method for graph matching and
map inference. In: NIPS (2009)
6. Gold, S., Rangarajan, A.: Softmax to softassign: neural network algorithms for combinatorial
optimization. J. Artif. Neural Netw., 3810399 (1995)
7. Rangarajan, A., Yuille, A.: Convergence properties of the softassign quadratic assignment
algorithm. Neural Computation (1999)
830 Y. Tian et al.
8. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics
Quarterly 2(1-2), 8397 (1955)
9. Christmas, W., Kittler, J., Petrou, M.: Structural matching in computer vision using proba-
bilistic relaxation. PAMI 19(6), 634648 (1997)
10. Wilson, R., Hancock, E.: Structural matching by discrete relaxation. IEEE Trans. Pattern
Anal. Mach. Intell. 17(8), 749764 (1995)
11. Cour, T., Srinivasan, P., Shi, J.: Balanced Graph Matching. In: NIPS (2006)
12. Lee, J., Cho, M., Lee, K.M.: Hyper-graph Matching via Reweighted RandomWalks. In:
CVPR (2011)
13. Kosowsky, J.J., Yuille, A.L.: The Invisible Hand Algorithm: Solving the Assignment Prob-
lem With Statistical Physics. Neural Networks 7(3) (1994)
Appendix: Proof Details of Theorem 3

First we introduce some preliminaries to facilitate later proving process: Let the initial
value for SCGA be s0 , and define the solution chain {sj }, {pj } as:
sj+1 SoftMax(Msj ) (10)
pj+1 Hd (Msj ) (11)
The proof of GA and CGA are both based on Assumption 1(min ), though Assumption
1(min ) seems strong while our following analysis shows that it is built on an even
more weaker precondition as shown in the following:
e M, exist s, k Z s.t. e = s 10k (12)
Equ.12 suggests any element e in matrix M can be measured in a bounded precision by
k. From a theoretical perspective, it excludes the irrational and circulator; yet from a
practical perspective based on current computer architecture, all the elements are trun-
cated in a bounded precision. Thus in practice, Equ.12 always holds.
We then show a technique by which Assumption 1(min ) would always hold, and
in turn the convergence properties of GA and CGA are ensured.
Lemma 1. If Assumption 1(min ) does not hold, one can add specks to M such that
satisfies the presumption Equ.12 and induces the establishment of Assumption 1(min )
and keep pT Mp almost unchanged (the slight distortion to M will not change the op-
timum property of the original score). Furthermore, such distortion is constructible.
Proof: Assume the precision of Ms elements is 10k (Equ.12). And the number of
the elements in M is N 2 . One can add specks on M: 10k1K 21 ,10k1K
22 ,...,10k1K 2N to obtain the new matrix Mnew . Thus for i; j, k, j = k,
2
each element in pj pk is from the discrete set {1, 0, 1} and each item in pi is {0, 1}.
As a result, (pj pk )T Mpi is the linear combination of the elements of original M
plus the additional specks. When (pj pk )T Mpi = 0 holds for the original M, we
can see that the new Mnew is nonequal to zero due to the additional specks whose
corresponding (pj pk )T (Mnew M)pi (linear combination) cannot be zero in that
each basis is 2L , L N 2 while the coefficients are {1, 0, 1}. Furthermore, when k is
big enough, the precision of (pj pk )T (Mnew M)pi is less than the original function,
so the original result that is nonequal to zero is still nonzero after this modification.
Using Lemma 1 and Theorem 1, Theorem 2 ensure the convergence for GA, CGA
respectively. In what follows, we will present the proof flow details of the convergence
of SCGA and GAGM as stated in the main text.
Lemma 2. (sj , T ) Given a fixed , or T = 1/. SoftMax(Msj ) must be converged to
the solution sj+1 , whose matrix form is doubly stochastic.
Lemma 3. (T ) Given a fixed T , let pj+1 = Hd (Msj ), we have: pTj+1 Msj T N log N
sTj+1 Msj pTj+1 Msj where N = n2 and n is the number of graph node.
Lemma 4. (sj , T, ) Given Msj , if there exists a unique best solution: pj+1 =Hd (Msj )
returned by the Hungarian method, let pi be the second best one (discrete), and let
=|(pj+1 pi )T Msj | then we have:
T N log N
max |pij+1 sij+1 | (13)
i
where pij+1 denotes the ith element in the vectorized permutation matrix pj+1 .
Lemma 2, Lemma 3, Lemma 4 have been proven by Kosowsky and Yuille in [13]. And
one can find Assumption 2 (sj , Tj , ) derives the establishment of Lemma 4 (sj , Tj , ).
Lemma 5. (sj , Tj , ) If Assumption 1 (min ), Assumption 2(sj , Tj , ) hold then

Tj
i, |pTi Mpj+1 pTi Msj+1 | nN 2 log N (14)

Proof: Obviously Lemma 4 (sj , Tj , ) holds.

Tj Tj
|pTi Mpj+1 pTi Msj+1 | = |pTi M(pj+1 sj+1 )| pTi E C= nN 2 log N

where E is the N N matrix whose elements are all one, and C =
[N log N, ..., N log N ]T . The above equation is based on the observation that pi is a
vectorized permutation matrix (so it contains n 1s), and the elements of M are bounded
within [0, ].
Lemma 6. (sj , Tj , ) If Assumption 1(min ), 2(sj , Tj , ) hold, then pj+2 =Hd

(Mpj+1 ).
Proof: By contradiction: Obviously Lemma 5 holds, if Lemma 6 does not hold, due to
Assumption 1(min ), there exists i, s.t. pTj+2 Mpj+1 < pTi Mpj+1 , and
pTj+2 Mpj+1 pTi Mpj+1 min (15)

According to Lemma 5, we have
Tj
pTi Mpj+1 min pTi Msj+1 min + nN 2 log N

Tj
pTj+2 Msj+1 nN 2 log N pTj+2 Mpj+1 (16)

832 Y. Tian et al.
By combining the above three equations we have:
Tj
pTj+2 Msj+1 pTj Msj+1 + min 2 nN 2 log N (17)

By the definition pj+2 Hd (Msj+1 ), we have pTj+2 Msj+1 pTj Msj+1 0, By

T
Assumption 2 (sj , Tj , ) have: j nN 2 log N < min
2 . Thus contradiction exists.
Lemma 7. If Assumption 1(min ) and Assumption 2 (sj , Tj , ) hold, we have:
i = j + 2, pTj+2 Msj+1 pTi Msj+1 (18)
i.e. Assumption 2 (sj+1 , Tj , ) also holds.
Proof: Obviously Lemma 5 (sj , Tj , ) & Lemma 6 (sj , Tj , ) hold. By Lemma 5:
Tj
pTj+2 Msj+1 pTj+2 Mpj+1 nN 2 log N (19)

By Assumption 1(min ) and Lemma 6 (sj , Tj , ) we have:
Tj Tj
pTj+2 Mpj+1 nN 2 log N pTi Mpj+1 + min nN 2 log N (20)

By Lemma 5 (sj , Tj , ) we have:
Tj Tj
pTi Mpj+1 + min nN 2 log N pTi Msj+1 + min 2 nN 2 log N

Combining the three equations we reach:
Tj
pTj+2 Msj+1 pTi Msj+1 + min 2 nN 2 log N (21)

T
By Assumption 2 (sj , Tj , ), min 2 j nN 2 log N , the lemma holds.
Lemma 8. If Assumption 2 (sj+1 , Tj , ) holds, Assumption 2 (sj+1 , Tj+1 , ) holds.
Proof: Since Tj+1 < Tj , the below relations hold, either thus Assumption 2
(sj+1 , Tj+1 , ).
Tj+1 Tj min
nN 2 log N < nN 2 log N < (22)
2
Tj Tj+1
0 < < min 2 nN log N < min 2
2
nN 2 log N (23)

We commence with the satisfaction of Assumption 2 (sj , Tj , ) at a specific iteration j,

and then prove Assumption 2 (sj+1 , Tj+1 , ) will also hold in the next iteration. Thus
for each k > 0 in the later iterations, Assumption 2 (sj+k , Tj+k , ) will always hold
and this leads to our main results: here we prove our main theoretical conclusion as
shown in Theorem 3 in the main text.
Proof: For k = 0, 1, ... Lemma 4 (sj+k+1 , Tj+k+1 , ) and Lemma 6
(sj+k+1 , Tj+k+1 , ) obviously hold. By Lemma 6 (sj+k , Tj+k , ), we know
pj+k+2 = Hd (Mpj+k+1 ). According to Assumption 1 (min ) and Lemma 4
(sj+k+1 , Tj+k+1 , ), when k is big enough, we have sequence sj+k will converged
to discrete sequence pj+k .
Now we prove the different convergence property of the classical GAGM and the pro-
posed SCGA as has been stated in Corollary 1 and Corollary 2 in the main text.
Proof: By combining Theorem 1 and Theorem 3, we know GAGM will converged
to circular sequence. By combining Theorem 2 and Theorem 3, we know when M
becomes positive definite, SCGA will converge to a fixed discrete point.
Remark 1. (The wildness of Assumption 2) According to Equ.8 and Equ.9, lies in:

min 2min 8Tj nN 2 log N min + 2min 8Tj nN 2 log N
[ , ] (24)
2 2
When Tj 0, this formula imply (0, min ), also when Tj goes to zero, the
SoftMax result Sof tM ax(sj ) approaches to discrete result (vectorized permutation
matrix). From Lemma 1 we know as sj goes to discrete result, the score difference
between best solution pj+1 and suboptimal solution pi approaches min , thus after
several iteration, we can choose = min 2 , which lies in the range (0, min ) - and
smaller than the difference between best and suboptimal score, thus the existence of
Assumption 2 is assured. In addition, based on our previous derivation, for a given
sj , once there exists a sufficing the condition for Assumption 2, all later sj+1 would
satisfy the constraint. These facts consolidate our judgement that Assumption 2 is weak.
Furthermore, we have a more concrete observation as described in Lemma 9.
Lemma 9. By adding specks to M, there exist s1 and T1 , s.t. Assumption 2 holds.
Proof: One can select s0 =( N12 , . . . , N12 )T , obviously lies within [ N12 min ,
N 2 1
N 2 min ], and then we can select an appropriate T1 s.t. lies in the scope of Equ.24.
At the end of this appendix, we give conceptual illustrations showing the idea of our
proof, and suppose T 0 without loss of generality. First we introduce the initial
status. The dotted arrows mean the generate process, red dotted arrows denote iterative
step of GAGM [2], where sj+1 = SoftMax(Msj ); blue dotted arrows denote discrete
optimum where pj+1 = Hd (Msj ). Yellow solid lines mean the corresponding function
score. The bidirectional-arrow means the quantity relationship between targets. Dashed
bidirectional-arrow denotes the Gap relationship while solid bidirectional-arrow de-
notes the asymptotic relationship.
As illustrated in Fig.4, according to Lemma 3 we know as T approximates to 0, for
all k, sTj+k+1 Msj+k approaches to pTj+k+1 Msj+k . Based on Assumption 2 (sj , Tj , )
we know there is a gap between optimal score pTj+1 Msj and any other score. Thus from
834 Y. Tian et al.
Fig. 4. Conceptualization of applying [13]s conclusions to the proposed iteration chain
Lemma 4 we get the asymptotic relationship between sj+1 and pj+1 . The green curve
arrow illustrates the role of Assumption 2 - here we use Tj to represent Assumption
2(sj , Tj , ). Kosowsky and Yuille focus on the connection of the Hungarian solution
and the SoftMax one, which is first order version versus the pairwise graph matching
task we address in this paper - a quadratic assignment problem solved by transforming
it into a series of iterative first-order assignment subproblems. By repeatedly applying
their conclusion to the separate steps in the quadratic case, we know that starting from
sj , each score sTj+k+1 Msj+k would approximate to pTj+k+1 Msj+k (Roughly speaking,
the approximation of two scores do not necessarily ensure sj+k+1 pj+k+1 , k, while
we show in Fig.5 that the answer is affirmative if only Assumption 2 holds for the first
iteration). In addition, under the precondition of Assumption 2 (sj , Tj , ), according
the conclusion from [13], one can obtain a single sj+1 that approaches pj+1 . Note
that [13]s conclusion does not address the chain property in the quadratic assignment
problem where first-order assignment is performed in each iteration, thus involving
multiple (sj+k+1 , pj+k+1 ) pairs. In another saying, if directly applying their conclusion
to the chain case, one has to check the existence of a specific and separate k that
builds the asymptotic relation between respective sj+k+1 , pj+k+1 in each iteration - and
further the convergence property of the iterative algorithm. As an extension, we strictly
analyze and give the affirmative answer in terms of convergence for this iteration chain,
under the rather weak precondition: if only Assumption 2 (sj , Tj , ) holds once in one
beginning iteration, say j, then in all of the next iterations, the updated Assumption 2
(sj+k+1 , Tj+k+1 , ) holds, and sj+k+1 is bound to approximate to pj+k+1 .
We also prove that pj+k+2 is the Hungarian solution of Mpj+k+1 , thus for other pi ,
the score gap exists - see the dotted eclipses for attention (thus the discrete sequence
pj+k+1 induced by sj+k composes an iteration chain by self). We show the proof flow
in Fig.5 and summarize key steps as following:
Fig. 5. Conceptualization of proof flow for the iteration chain in the pairwise graph matching
problem. The main idea is exploring the connection between continuous solution and discrete
one: sj pj , as well as pj+1 = Hd (Mpj ), where pj+1 is indeed calculated from Hd (Msj ). In
this spirit, SCGAs convergence proof is similar to the one of CGA.
1. Using Assumption 2, suppose there is a gap between function score of optimal

solution pj+1 and suboptimal pi ;
2. From Lemma 4 we know pj+1 sj+1 , which we will use in the next step;
3. Prove the asymptotic relationship between pTi Msj+1 and pTi Mpj+1 (also
pTj+2 Msj+1 and pTj+2 Mpj+1 ) by Lemma 5, as illustrated by red solid bidirectional-
arrows;
4. In Lemma 6, we proved pj+2 is the Hungarian solution of Mpj+1 , thus pTj+2 Mpj+1
is bigger than any pTi Mpj+1 , according to Assumption 1, there exists a gap min
between them, as illustrated by black dashed bidirectional-arrows;
5. We know pTi Msj+1 pTi Mpj+1 and pTj+2 Msj+1 pTj+2 Mpj+1 , using gap
min we obtain gap in Lemma 7;
6. Update Tj to Tj+1 in Lemma 8, thus Assumption 2 (sj+1 , Tj+1 , ) holds, and
obtain asymptotic relationship between pj+2 and sj+2 ;
7. Finally, by combining above results, Theorem 3 proves actually discrete sequence
pj+k is a Hungarian sequence, thus there exists a gap min between adjacent dis-
crete scores ( indicated by the nodes pTj+2 Mpj+1 , pTj+3 Mpj+2 in Fig.5).
Image Annotation
Using Metric Learning
in Semantic Neighbourhoods
Yashaswi Verma and C.V. Jawahar
International Institute of Information Technology, Hyderabad, India - 500032

yashaswi.verma@research.iiit.ac.in, jawahar@iiit.ac.in
Abstract. Automatic image annotation aims at predicting a set of tex-

tual labels for an image that describe its semantics. These are usually
taken from an annotation vocabulary of few hundred labels. Because
of the large vocabulary, there is a high variance in the number of im-
ages corresponding to dierent labels (class-imbalance). Additionally,
due to the limitations of manual annotation, a signicant number of
available images are not annotated with all the relevant labels (weak-
labelling). These two issues badly aect the performance of most of the
existing image annotation models. In this work, we propose 2PKNN,
a two-step variant of the classical K-nearest neighbour algorithm, that
addresses these two issues in the image annotation task. The rst step
of 2PKNN uses image-to-label similarities, while the second step uses
image-to-image similarities; thus combining the benets of both. Since
the performance of nearest-neighbour based methods greatly depends on
how features are compared, we also propose a metric learning frame-
work over 2PKNN that learns weights for multiple features as well as
distances together. This is done in a large margin set-up by generaliz-
ing a well-known (single-label) classication metric learning algorithm
for multi-label prediction. For scalability, we implement it by alternating
between stochastic sub-gradient descent and projection steps.
Extensive experiments demonstrate that, though conceptually sim-
ple, 2PKNN alone performs comparable to the current state-of-the-art
on three challenging image annotation datasets, and shows signicant
improvements after metric learning.
1 Introduction
Automatic image annotation is a labelling problem which has potential appli-
cations in image retrieval [5,11,16], image description [23], etc. Given an unseen
image, the goal is to predict multiple textual labels describing that image. Re-
cent outburst of multimedia content has raised the demand for auto-annotation
methods, thus making it an active area of research [5,11,16,19,20]. Several meth-
ods have been proposed in the past for image auto-annotation which try to
model image-to-image [5,11,16], image-to-label [10,20] and label-to-label [11,20]
similarities. Our work falls under the category of the supervised annotation mod-
els such as [4,5,10,11,13,16,19,22] that work with large annotation vocabularies.

Image Annotation Using Metric Learning in Semantic Neighbourhoods 837
Among these, K-nearest neighbour (or KNN) based methods such as [5,11,16]
have been found to give some of the best results despite their simplicity. The
intuition is that similar images share common labels [11]. In most of these
approaches, this similarity is determined only using image features. Though
this can handle label-to-label dependencies to some extent, it fails to address
the class-imbalance (large variations in the frequency of dierent labels) and
weak-labelling (many available images are not annotated with all the relevant
labels) problems that are prevalent in the popular datasets (Sec. 4.1) as well as
real-world databases. E.g., in an experiment on the Corel 5K dataset, we found
that for the 20% least frequent labels, JEC [11] achieves an F-score of 19.7%,
whereas it gives reasonably good performance for the 20% most frequent labels
with F-score being 50.6%.
As per our knowledge, no attempt has been made in the past that directly
addresses these issues. An indirect attempt was made in TagProp [16] to address
class-imbalance, which we discuss in Sec. 2. To address these issues in a nearest-
neighbour set-up, we need to make sure that (i) for a given image, the (subset
of) training images that are considered for label prediction should not have large
dierences in the frequency of dierent labels; and (ii) the comparison criteria
between two images should make use of both image-to-label and image-to-image
similarities (image-to-image similarities also capture label-to-label similarities
in a nearest-neighbour scenario). With this motivation, we present a two-step
KNN-based method. We call this 2-Pass K-Nearest Neighbour (or 2PKNN)
algorithm. For an image, we say that its few nearest neighbours from a given class
constitute its semantic neighbourhood, and these neighbours are its semantic
neighbours. For a particular class, these are the samples that are most related
with a new image. Given an unseen image, in the rst step of 2PKNN we identify
its semantic neighbours corresponding to all the labels. Then in the second step,
only these samples are used for label prediction. This relates with the idea of
bottom-up pruning common in day-to-day scenarios such as buying a car,
or selecting a cloth to wear; where rst the potential candidates are short-listed
based on quick analysis, and then another set of criteria is used for nal selection.
It is well-known that the performance of KNN based methods largely depends
on how two images are compared [11]. Usually, this comparison is done using a
set of features extracted from images and some specialized distance metric for
each feature (such as L1 for colour histograms, L2 for Gist) [11,16]. In such a
scenario, (i) since each base distance contributes dierently, we can learn ap-
propriate weights to combine them in the distance space [11,16]; and (ii) since
every feature (such as SIFT or colour histogram) itself is represented as a multi-
dimensional vector, its individual elements can also be weighted in the feature
space [12]. As the 2PKNN algorithm works in the nearest-neighbour setting, we
would like to learn weights that maximize the annotation performance. With this
goal, we perform metric learning over 2PKNN by generalizing the LMNN [12]
algorithm for multi-label prediction. Our metric learning framework extends
LMNN in two major ways: (i) LMNN is meant for single-label classication (or
simply classication) problems, while we adapt it for images annotation which
838 Y. Verma and C.V. Jawahar
is a multi-label classication task; and (ii) LMNN learns a single Mahalanobis

metric in the feature space, while we extend it to learn linear metrics for multi-
ple features as well as distances together. Since we need to learn large number
of weights, and iteratively perform pair-wise comparisons between large sample
sets, scalability appears as one of the major concerns. To address this, we im-
plement metric learning by alternating between stochastic sub-gradient descent
and projection steps (similar to Pegasos [9]). This allows to optimize the weights
iteratively using small number of comparisons at each iteration, thus making our
learning easily scalable for large datasets with samples represented in very high
dimensions. In our experiments on three benchmark image annotation datasets,
our method (i.e., 2PKNN with metric learning) signicantly outperforms the
previous results.
In the next section, we discuss some of the notable contributions in the eld
of image annotation. In Sec. 3, we formalize 2PKNN and the metric learning
model; in Sec. 4, we discuss the experiments; and nally conclude in Sec. 5.
2 Related Work
The image annotation problem was initially addressed using generative models;
e.g. translation models [2,3] and nearest-neighbour based relevance models [4,5].
Recently, a Markov Random Field [13] based approach was proposed that can
exibly accommodate most of the previous generative models. Though these
methods are directly extendable to large datasets, they might not be ideal for
the annotation task as their underlying joint distribution formulations assume
independence of image features and labels, whereas recent developments [17]
emphasize on using conditional dependence to achieve Bayes optimal prediction.
Among discriminative models, SML [10] treats each label as a class of a multi-
class multi-labelling problem, and learns class-specic distributions. However, it
requires large (class-)balanced training data to estimate these distributions. Also
label interdependencies might result into corrupted distribution models. Another
nearest-neighbour based method [19] tries to benet from feature sparsity and
clustering properties using a regularization based algorithm for feature selection.
JEC [11] treats image annotation as retrieval. Using multiple global features, a
greedy algorithm is used for label transfer from neighbours. They also performed
metric learning in the distance space but it could not do any better than using
equal weights. This is because they used a classication-based metric learning
approach for the annotation task which is multi-label classication by nature.
Though JEC looks simple at the modelling level, it reported the best results on
benchmark annotation datasets when it was proposed. TagProp [16] is a weighted
KNN based method that transfers labels by taking a weighted average of key-
words presence among the neighbours. To address the class-imbalance problem,
logistic discriminant models are wrapped over the weighted KNN method with
metric learning. This boosts the importance given to infrequent labels and sup-
presses it for frequent labels appearing among the neighbours.
Nearly all image annotation models can be considered as multi-label clas-
sication algorithms, as they associate multiple labels with an image. Recent
methods such as [14,21] treat it as a multi-label ranking problem. Given an im-

age, instead of predicting some xed number of labels, they generate a ranked
list of all the labels based on their chances of getting assigned to that image.
E.g., in [21] an algorithm was proposed to learn from incompletely labelled data;
i.e., only a subset of the ground-truth labels of each training image is used at the
time of learning. Of late, some annotations methods such as [15,20] have been
proposed that try to model image features and labels as well as dependencies
among them, but most of these work on small vocabularies containing few tens
of labels, and hence class-imbalance and weak-labelling are not a big concern.
Our work is more comprehensive and falls under the category of previous works
on image annotation [5,10,11,13,16,19,22] that address a more realistic and chal-
lenging scenario where the vocabulary contains few hundreds of labels and the
datasets seriously suer from class-imbalance and weak-labelling issues.
3 Label Prediction Model

Here, rst we describe the 2PKNN algorithm, and then formulate the metric
learning over it.
3.1 The 2PKNN Algorithm

Let {I1 , . . . , It } be a collection of images and Y = {y1 , . . . , yl } be a vocabulary
of l labels (or semantic concepts). The training set T = {(I1 , Y1 ), . . . , (It , Yt )}
consists of pairs of images and their corresponding label sets, with each Yi
Y. Similar to SML [10], we assume the conditional probabilities P (A|yi ) that
model the feature distribution of an image A given a semantic concept yi Y.
Using this, we model image annotation as a problem of nding the posterior
probabilities
P (A|yi )P (yi )
P (yi |A) = , (1)
P (A)
where P (yi ) is the prior probability of the label yi . Then, given an unannotated
image J, the best label for it will be given by
y = arg max P (yi |J) . (2)

i
Let Ti T , i {1, . . . , l} be the subset of training data that contains all the
images annotated with the label yi . Since each set Ti contains images with one
semantic concept (or label) common among them, we consider it as a semantic
group (similar to [20]). It should be noted that the sets Ti s are not disjoint, as an
image usually has multiple labels and hence belongs to multipe semantic groups.
Given an unannotated image J, from each semantic group we pick K1 images
that are most similar to J and form corresponding sets TJ,i Ti . Thus, each TJ,i
contains those images that are most informative in predicting the probability
of the label yi for J. The samples in each set TJ,i are the semantic neighbours
of J corresponding to yi . These semantic neighbours incorporate image-to-label
' Once TJ,i s are determined, we merge them all to form a set TJ = {TJ,1
similarity.
'
. . . TJ,l }. This way, we obtain a subset of the training data TJ T specific
to J that contains its semantic neighbours corresponding to all the labels in the
vocabulary Y. This is the first pass of 2PKNN. It can be easily noted that in
TJ , each label appears (at least) K1 times, thus addressing the class-imbalance
issue. To understand how this step also handles the issue of weak-labelling, we
analyze the cause of this. Weak-labelling occurs because obvious labels are
often missed by human annotators while building a dataset, and hence many
images depicting any such concept are actually not annotated with it. Under
this situation, given an unseen image, if we use only its few nearest neighbours
from the entire training data (as in [11,16]), then such labels may not appear
among these neighbours and hence not get apt scores. In contrary, the rst pass
of 2PKNN nds a neighbourhood where all the labels are present explicitly.
Therefore, now they (i.e., the obvious labels) have better chances of getting
assigned to a new image, thus addressing the weak-labelling issue.
bay, gravel, road, bush, coast, grey, house, mountain, bush, landscape, meadow, house,
house, landscape, sea, sky hill, sky, village, cloud, sea, tree sea, bush, dirt,
meadow, shrub, tree sky, bay, road,
sky coast
Test Image
Neighbours bush, coast, grey, hill, sky, house, bush, coast, grey, hill, tree, house,
sea, sky mountain, village, sea, sky village, mountain,
tree sky
Fig. 1. For a test image from the IAPR TC-12 dataset (rst column), the rst row
on the right section shows its 4 nearest images (and their labels) from the training
data after the rst pass of 2PKNN (K1 = 1), and the second row shows the 4 nearest
images using JEC [11]. The labels in bold are the ones that match with the ground-
truth labels of the test image. Note the frequency (9 vs. 6) and diversity ({sky, house,
landscape, bay, road, meadow} vs. {sky, house}) of matching labels for 2PKNN vs.
JEC.
Figure 1 shows an example from the IAPR TC-12 dataset (Sec. 4.1) illus-
trating how the rst pass of 2PKNN addresses both class-imbalance and weak-
labelling. For a given test image (rst column) along with its ground-truth labels,
we can notice the presence of rare labels {landscape, bay, road,meadow}
among its four nearest images found after the rst pass of 2PKNN (rst row
on the right), without compromising with the frequent labels {sky, house}.
In contrary, the neighbours obtained using JEC [11] (second row) contain only
the frequent labels. It can also be observed that though the labels {landscape,
meadow} look obvious for the neighbours found using JEC, these are actually
absent in their ground-truth annotations (weak-labelling), whereas the rst pass
of 2PKNN explicits their presence among the neighbours.
The second pass of 2PKNN is a weighted sum over the samples in TJ to
assign importance to labels based on image similarity. This gives the posterior
probability for J given a label yk Y as

P (J|yk ) = J,Ii .P (yk |Ii ) = exp(D(J, Ii )).(yk Yi ), (3)
(Ii ,Yi )TJ (Ii ,Yi )TJ
where J,Ii = exp(D(J, Ii )) (see Eq. (4) for the denition of D(J, Ii )) denotes
the contribution of image Ii in predicting the label yk for J depending on their
visual similarity; and P (yk |Ii ) = (yk Yi ) denotes the presence/absence of
label yk in the label set Yi of Ii , with () being 1 only when the argument
holds true and 0 otherwise. Assuming that the rst pass of 2PKNN will give
a subset of the training data where each label has comparable frequency, we
set the prior probability in Eq. (1) as a uniform distribution; i.e., P (yi ) = |T1 | ,
i {1, . . . , l}. Putting Eq. (3) in Eq. (1) generates a ranking of all the labels
based on their probability of getting assinged to the unseen image J. Note that
along with image-to-image similarities, the second pass of 2PKNN implicitly
takes care of label-to-label dependencies, as the labels appearing together in the
same neighbouring image will get equal importance.
Both conceptually as well practically, 2PKNN is entirely dierent from the
previous two-step variants of KNN such as [1,7]. They use few (global) nearest
neighbours of a sample to apply some other more sophisticated technique such as
linear discriminant analysis [1] or SVM [7]. Whereas, the rst pass of 2PKNN
considers all the samples but in localized semantic groups. Also, the previous vari-
ants were designed for the classication task, while 2PKNN addresses the more
challenging problem of image annotation (which is multi-label classication by
nature), where the datasets suer from high class-imbalance and weak-labelling
(note that weak-labelling is not a concern in classication problems).
3.2 Metric Learning (ML)

Most of the classication metric learning algorithms try to increase inter-class
and reduce intra-class distances, thus treating each pair of samples in a bi-
nary manner (recall that similar approach was used in JEC [11] but could not
improve the annotation performance). Since image annotation is a multi-label
classication task, here two samples relate in the continuous space [0, 1]; hence
classication metric learning cannot be applied directly. As part of metric learn-
ing, our aim is to learn (non-negative) weights over multiple features as well as
base distances that maximize the annotation performance for 2PKNN. For this
purpose, we extend the classical LMNN [12] algorithm for multi-label prediction.
Let there be two images A and B, each represented by n features {fA 1
, . . . , fA
n
}
and {fB , . . . , fB } respectively. Each feature is a multi-dimensional vector, with
1 n
the dimensionality of a feature f i being Ni , i.e. f i RNi for i = 1, . . . , n. We de-

note an entry of a vector x as x(). The distance between two images is computed
by nding the distance between their corresponding features using some special-
ized distance measure for each feature (such as L1 for colour histograms, 2 for
SIFT features, etc.), and then combining them all. In order to learn weights in
the feature space, it should be noted that some of the popular distance measures
such L1 , (squared) L2 and 2 can be written as a dot product of two vectors. E.g.,
i i
given any two corresponding feature vectors fA and fB , if we consider a vector
Ni
dAB R+ such that dAB (j) = |fA (j) fB (j)|, j {1, . . . , Ni }, then the L1
i i i i
distance between the two feature vectors can be written as L1 (fA i

, fB
i
) = vi diAB ;
Ni
where || gives the absolute value, and v R+ is usually taken as a normalized
i
unit vector1 . Note that vi can replaced by any non-negative real-valued normal-
ized vector that assigns appropriate weights to individual dimensions of a feature
vector in the feature space. Moreover, we can also learn weights w Rn+ in the
distance space to optimally combine multiple feature distances. Based on this,
we write the distance between A and B as
n
Ni

D(A, B) = w(i). vi (j).diAB (j) . (4)
i=1 j=1
Now we describe how to learn the weights appearing in Eq. (4). For a given
labelled sample (Ip, Y p) T , we dene its (i) target neighbours as its K1 nearest
images from the semantic group Tq , q s.t. yq Yp , and (ii) impostors as its K1
nearest images from Tr , r s.t. yr Y \ Yp . Our objective is learn the weights
such that the distance of a sample from its target neighbours is minimized, and
is also less than its distance from any of the impostors (i.e., pull the target
neighbours and push the impostors). In other words, given an image Ip along
with its labels Yp , we want to learn weights such that its nearest (K1 ) semantic
neighbours from the semantic groups Tq s (i.e., the groups corresponding to its
ground-truth labels) are pulled closer, and those from the remaining semantic
groups are pushed far. With this goal, for sample image Ip , its target neighbour
Iq and its impostor Ir , the loss function will be given by

Eloss = pq D(Ip , Iq ) + pq (1 pr )[1 + D(Ip , Iq ) D(Ip , Ir )]+ . (5)
pq pqr
Here, > 0 handles the trade-o between the two error terms. The variable pq
|Y Y |
is 1 if Iq is a target neighbour of Ip and 0 otherwise. pr = p|Yr | r [0, 1], with
Yr being the label set of an impostor Ir of Ip . And [z]+ = max(0, z) is the hinge
loss which will be positive only when D(Ip , Ir ) < D(Ip , Iq ) + 1 (i.e., when for a
sample Ip , its impostor Ir is nearer than its target neighbour Iq ). To make sure
1
diAB can similarly be computed for other measures such as squared L2 and 2 .
that a target neighbour Iq is much closer than an impostor Ir , a margin (of size 1)
is used in the error function. Note that pr is in the continuous range [0, 1], thus
scaling the hinge loss depending on the overlap between the label sets of a given
image Ip and its impostor Ir . This means that for a given sample, the amount of
push applied on its impostor varies depending on its conceptual similarity with
that sample. An impostor with large similarity will be pushed less, whereas an
impostor with small similarity will be pushed more. This makes it suitable for
multi-label classication tasks such as image annotation. The above loss function
is minimized by the following constrained optimization problem:

min pq pq D(Ip , Iq ) + pqr pq (1 pr )pqr
w,v
s.t. D(Ip , Ir ) D(Ip , Iq ) 1 pqr p, q, r
pqr 0 p, q, r
n
w(i) 0 i; i=1 w(i) = n
Ni i
v (j) 0 i, j;
i
j=1 v (j) = 1 i {1, . . . , n} (6)
Here, the slack variables pqr represent the hinge loss in Eq. (5), and v is a vec-
tor obtained by concatenating all the vi s for i = 1, . . . , n. We solve the above
optimization problem in the primal form itself. Since image features n are usually
in very high dimensions, the number of variables is large (= n + i=1 Ni ). This
makes the scalability of the above optimization problem dicult using conven-
tional gradient descent. To overcome this issue, we solve it by alternatively using
stochastic sub-gradient descent and projection steps (similar to Pegasos [9]). This
gives an approximate optimal solution using small number of comparisons, and
thus helps in achieving high scalability. To determine the optimal weights, we
alternate between the weights in distance space and feature space.
Our extension of LMNN conceptually diers from its previous extensions such
as [18] in at least two signicant ways: (i) we adapt LMNN in its choice of tar-
get/impostors to learn metrics for multi-label prediction problems, whereas [18]
uses the same denition of target/impostors as in LMNN to address classica-
tion problem in multi-task setting, and (ii) in our formulation, the amount of
push applied on an impostor varies depending on its conceptual similarity w.r.t.
a given sample, which makes it suitable for multi-label prediction tasks.
4 Experiments
4.1 Data Sets and Their Characteristics
We have used three popular image annotation datasets Corel 5K, ESP Game
and IAPR TC-12 to test and compare the performance of our method with
previous approaches. Corel 5K was rst used in [3], and since then it has become
a benchmark for comparing annotation performance. ESP Game contains images
annotated using an on-line game, where two (mutually unknown) players are
randomly given an image for which they have to predict same keyword(s) to
score points [6]. This way, many people participate in the manual annotation
task thus making this dataset very challenging and diverse. IAPR TC-12 was
introduced in [8] for cross-lingual retrieval. In this, each image has a detailed
description from which only nouns are extracted and treated as annotations.
In Table 1, columns 2 5 show the general statistics of the three datasets;
and in columns 6 8, we highlight some interesting statistics that provide better
insights about the structure of the three datasets. It can be noticed that for
each dataset, around 75% of the labels have frequency less than the mean label
frequency (column 8), and also the median label frequency is far less than the
corresponding mean frequency (column 7). This veries the claim we previously
made in Sec. 1 (i.e., datasets badly suer from the class-imbalance problem).
Table 1. General (columns 2-5) and some insightful (columns 6-8) statistics for the
three datasets. In column 6 and 7, the entries are in the format mean, median, maxi-
mum. Column 8 (Labels#) shows the number of labels whose frequency is less than
the mean label frequency.
Dataset Number Number Training Testing Labels Images per label (or Labels#
of images of labels images images per image label frequency)
Corel 5K 5,000 260 4,500 500 3.4, 4, 5 58.6, 22, 1004 195 (75.0%)
ESP Game 20,770 268 18,689 2,081 4.7, 5, 15 326.7, 172, 4553 201 (75.0%)
IAPR TC-12 19,627 291 17,665 1,962 5.7, 5, 23 347.7, 153, 4999 217 (74.6%)
Though it is not straightforward to quantify weak-labelling, we try to analyze

it from the number of labels per image (column 6). We argue that large gap
between mean (or median) and maximum number of labels per image indicates
that many images are not labelled with all the relevant labels. Based on this, we
can infer that both ESP Game and IAPR TC-12 datasets suer from weak-
labelling. For Corel 5K dataset, we examined the images and their corresponding
annotations to realize weak-labelling.
4.2 Features and Evaluation Measures

Features. To compare our models performance with the previous methods,
we use the similar features as in [16]. These are a combination of local and
global features. The local features include the SIFT and hue descriptors obtained
densely from multi-scale grid, and from Harris-Laplacian interest points. The
global features comprise of histograms in RGB, HSV and LAB colour spaces,
and the Gist features. To encode some spatial information about an image, all
but the Gist features are also computed over three equal horizontal partitions for
each image. To calculate distance between two features, L1 measure is used for
the colour histograms, L2 for the Gist, and 2 for the SIFT and hue descriptors.
Evaluation Measures. To analyze the annotation performance, we compute

precision and recall of each label in a dataset. Suppose a label yi is present in the
ground-truth of m1 images, and it is predicted for m2 images during testing out

of which m3 predictions are correct (m3 m2 and m3 m1 ), then its precision
will be = m3 /m2 and recall will be = m3 /m1 . We average these values over
all the labels of a dataset and get percentage mean precision P and percentage
mean recall R, similar to the previous annotation methods such as [11,16,19,22].
Using these two measures, we get the percentage F1-score F1 = 2.P.R/(P+R),
which takes care of the trade-o between precision and recall. The dierent image
annotation methods are compared using two criteria: rst F1; and second N+
which is the number of labels that are correctly assigned to at least one test
image (i.e., the number of labels with positive recall).
To compare with [14], we use three measures (see [14] for more details): (i)
One-error is similar to classication error, which tells the number of times the
label predicted with the highest probability is not present in the ground-truth,
(ii) Coverage is a measure of the worst rank assigned to any of the ground-truth
labels, and (iii) Average Precision (not to be confused with percentage mean
precision P used to measure annotation performance) gives area under precision-
recall curve used to evaluate a ranked list of labels. To compare with [21], we
adopt Area Under ROC curve (or AUC) as the evaluation measure.
For One-error and Coverage, smaller value implies better performance, while
for all other measures, higher value implies better performance.
4.3 Experimental Details
A randomly-sampled subset of training data consisting of 3000 samples is used to

learn the weights w and vi s in leave-one-out manner, and a separate (random)
validation set of 400 samples is used for early-stopping. This is repeated ve
times, and the model that performs best during validation is used to evaluate
the performance on test data. For the rst pass of 2PKNN, we set K1 as 4, 2
and 2 for the Corel 5K, ESP Game and IAPR TC-12 datasets respectively. The
results corresponding to 2PKNN are determined using an L1 -normalized one-
vector for each vi ; & the vector w is replaced by a scalar that controls the decay
of J,Ii (Eq. (3)) with distance (similar to [16]). The results for 2PKNN+ML are
obtained by using weighted features and base distances in 2PKNN, with weights
being determined after metric learning. The probability scores are appropriately
scaled such that the relevance of a label for any image is never above one.
4.4 Comparisons
Now we present the quantitive analysis on the three datasets. First, for the sake
of completeness, we compare with recent multi-label ranking methods [14,21],
and then we show detailed comparisons with the previous annotation methods.
Comparison with Multi-label Ranking Methods. In order to show that

our method does not compromise with the performance on frequent labels, we
quantitatively compare our method with the best reported results of MIML [14]
in Table 2 using the same conventions as theirs. To be specic, we use the Corel
5K dataset and consider only the 20 most frequent labels, which results in 3, 947
training and 444 test images with average 1.8 labels per image. As we can see,
our method performs signicantly better than [14] on all the three measures.
Notable is the considerable reduction in the coverage score, which indicates that
in most of the cases, all the ground-truth annotations are included among the
top 4 labels predicted by our method.
Table 2. Comparison between MIML [14] and 2PKNN combined with metric learning
on a subset of the Corel 5K dataset using only 20 most frequent labels as in [14]
One-error Coverage Average Precision

MIML [14] 0.565 5.507 0.535
2PKNN+ML (This work) 0.427 3.38 0.644
To show that our method addresses weak-labelling issue, we compare with [21].
Though [21] addresses the problem of incomplete labelling (see Sec. 2), conceptu-
ally it overlaps with the weak-labelling issue as both work under the scenario of
inavailability of all relevant labels. Following their protocol, we select those im-
ages from the entire ESP Game dataset that are annotated with at least 5 labels,
and then test on four cases. First we use all the labels in the ground-truth, and
then randomly remove 20%, 40% and 60% labels respectively from the ground-
truth annotation of each training image. On an average, they achieved 83.7475%
AUC for these cases, while we get 85.4739% which is better by 1.7264%.
Comparison with Image Annotation Methods. To compare the image an-

noation performance, we follow the pre-dened partitions for training and testing
as used in [11,16]. To each test image, we assign the top ve labels predicted
using Eq. (2). Our results as well as those reported by the previous models for
image annotation are summarized in Table 3. We can see that on all the three
datasets, our base method 2PKNN itself performs comparable to the previous
best results. Notable is the signicant increase in the number of labels with pos-
itive recall (credit goes to the rst pass of 2PKNN). After combining metric
learning with it (2PKNN+ML), the performance signicantly improves. Pre-
cisely, for the Corel 5K, ESP Game and IAPR TC-12 datasets, we gain 6.7%,
3.8%, and 4.1% respectively in term of F1; and 19, 13 and 12 respectively in
terms of N+ over the current state-of-the-art. We also found that F1 improves
by upto 2% in general on using both w and vi s as compared to using any one
of them.
These empirical evaluations conclude that our method consistently shows su-
perior performance than the previous models under multiple evaluation criteria,
thus establishing its overall eectiveness.
Table 3. Comparison of annotation performance among dierent methods. The top

section shows the performances reported by the previous methods. The bottom section
shows the performance achieved by our method (2PKNN), and that combined with
metric learning (2PKNN+ML). The best results in both parts are highlighted in bold.
Corel 5K ESP Game IAPR TC-12

Method P R F1 N+ P R F1 N+ P R F1 N+
CRM [4] 16 19 17.4 107
MBRM [5] 24 25 24.5 122 18 19 18.5 209 24 23 23.5 223
SML [10] 23 29 25.7 137
JEC [11] 27 32 29.3 139 22 25 23.4 224 28 29 28.5 250
GS [19] 30 33 31.4 146 32 29 30.4 252
MRFA [13] 31 36 33.3 172
CCD (SVRMKL+KPCA) [22] 36 41 38.3 159 36 24 28.8 232 44 29 35.0 251
TagProp(ML) [16] 31 37 33.7 146 49 20 28.4 213 48 25 32.9 227
TagProp(ML) [16] 33 42 37.0 160 39 27 31.9 239 46 35 39.8 266
2PKNN (This work) 39 40 39.5 177 51 23 31.7 245 49 32 38.7 274
2PKNN+ML (This work) 44 46 45.0 191 53 27 35.7 252 54 37 43.9 278
4.5 Discussion
In Figure 2, we analyze how 2PKNN addresses the class-imbalance issue as

compared to the traditional weighted KNN method (used in TagProp [16]). For
this purpose, we use the annotation performance in terms of mean recall. The
labels are partitioned into two groups based on their frequency. The rst partition
consists of the 50% least frequent labels and the second partition consists of the
50% most frequent labels. Three observations can be made by looking at this
gure. First, for all the three datasets, unlike weighted KNN, 2PKNN performs
comparable for both the label-partitions despite the large dierences in their
frequency (compare the median label frequency with mean and maximum label
frequency in Table 1, column 7). This suggests that 2PKNN actually addresses
the class-imbalance problem in the challenging datasets with large vocabularies.
Second, 2PKNN does not compromise with the performance on frequent labels
compared to weighted KNN, and always performs better than it. This shows
that 2PKNN can be a better option than weighted KNN for the complicated
image annotation task. And third, after metric learning, the performance always
improves for both the label-partitions for all the datasets. This conrms that
our metric learning approach benets both rare as well as frequent labels. In
Figure 3, we show some qualitative annotation results from the three datasets.
Each image is an example of a weakly-labelled image. It can be seen that for
all these images, our method predicts all the ground-truth labels. Moreover, the
additional labels predicted are actually depicted in the corresponding images, but
missing in their ground-truth annotations. Recall that for quantitative analysis,
we experimentally compared our method with [21] (Sec. 4.4) and achieved better
performance. These show that our method is capable of addressing the weak-
labelling issue prevalent in the real-world datasets.
Fig. 2. Annotation performance in terms of mean recall (vertical axis) for the three
datasets obtained using weighted KNN as in TagProp [16] (blue), using 2PKNN (red),
and 2PKNN combined with metric learning (green). The labels are grouped based on
their frequency in a dataset (horizontal axis). The rst bin corresponds to the 50%
least frequent labels and the second bin corresponds to the 50% most frequent labels.
bear, reflection, field, horses, green, phone, fight, grass, building, base, fence, moun-
water, black mare, foals woman, hair game horse, statue tain, range
bear, reflec- field, horses, green, phone, fight, grass, building, fence, moun-
tion, water, mare, foals, woman, hair, game, anime , base, horse, tain, range,
black, river tree suit man statue, man airplane , sky
Fig. 3. Annotations for example images from the three datasets. The second row shows
the ground-truth annotations and the third row shows the labels predicted using our
method 2PKNN+ML. The labels in blue (bold) are those that match with ground-
truth. The labels in red (italics) are those that, though depicted in the corresponding
images, are missing in their ground-truth annotations and are predicted by our method.
5 Conclusion
We showed that our variant of the KNN algorithm, i.e. 2PKNN, combined with
metric learning performs better than the previous methods on three challenging
image annotation datasets. It also addresses class-imbalance and weak-labelling
issues that are prevalent in the real-world scenarios. This can be useful for natu-
ral image databases where tags obey the Zipf s law. We also demonstrated how
a classication metric learning algorithm can be eectively adapted for the more
complicated multi-label classication problems such as annotation. We hope that
this work throws some light on the possibilites of extending the popular discrim-
inative margin-based classication methods for the multi-label prediction tasks.
References
1. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classication.
PAMI 18(6), 607616 (1996)
2. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing
and vector quantizing images with words. In: First International Workshop on
Multimedia Intelligent Storage and Retrieval Management (1999)
3. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object Recognition
as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In:
Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV.
4. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of
pictures. In: NIPS (2003)
5. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for
image and video annotation. In: CVPR, pp. 10021009 (2004)
6. von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: ACM SIGCHI
(2004)
7. Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest
neighbor classication for visual category recognition. In: CVPR (2006)
8. Grubinger, M.: Analysis and Evaluation of Visual Information Systems Perfor-
mance. PhD thesis, Victoria University, Melbourne, Australia (2007)
9. Shwartz, S.S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient solver
for SVM. In: ICML (2007)
10. Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of
semantic classes for image annotation and retrieval. PAMI 29(3), 394410 (2007)
11. Makadia, A., Pavlovic, V., Kumar, S.: A New Baseline for Image Annotation. In:
Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304,
12. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest
neighbor classication. JMLR 10(2), 207244 (2009)
13. Xiang, Y., Zhou, X., Chua, T.-S., Ngo, C.W.: A Revisit of generative models for
automatic image annotation using markov random elds. In: CVPR (2009)
14. Jin, R., Wang, S., Zhou, Z.H.: Learning a distance metric from multi-instance
multi-label data. In: CVPR, pp. 896902 (2009)
15. Wang, H., Huang, H., Ding, C.: Image annotation using multi-label correlated
Greens function. In: ICCV (2009)
16. Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative met-
ric learning in nearest neighbour models for image auto-annotation. In: ICCV (2009)
17. Dembczynski, K., Cheng, W., Hullermeier, E.: Bayes optimal multilabel classica-
tion via probabilistic classier chains. In: ICML (2010)
18. Parameswaran, S., Weinberger, K.Q.: Large margin multi-task metric learning. In:
NIPS (2010)
19. Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., Metaxas, D.N.: Automatic image
annotation using group sparsity. In: CVPR (2010)
20. Wang, H., Huang, H., Ding, C.: Image annotation using bi-relational graph of
images and semantic labels. In: CVPR (2011)
21. Bucak, S.S., Jin, R., Jain, A.K.: Multi-label learning with incomplete class assign-
ments. In: CVPR, pp. 28012808 (2011)
22. Nakayama, H.: Linear distance metric Learning for large-scale generic image recog-
nition. PhD thesis, The University of Tokyo, Japan (2011)
23. Gupta, A., Verma, Y., Jawahar, C.V.: Choosing linguistics over vision to describe
images. In: AAAI (2012)
Dynamic Programming
for Approximate Expansion Algorithm
Olga Veksler
University of Western Ontario

London, Canada
olga@csd.uwo.ca
http://www.csd.uwo.ca/~ olga/
Abstract. Expansion algorithm is a popular optimization method for

labeling problems. For many common energies, each expansion step can
be optimally solved with a min-cut/max ow algorithm. While the ob-
served performance of max-ow for the expansion algorithm is fast, its
theoretical time complexity is worse than linear in the number of pixels.
Recently, Dynamic Programming (DP) was shown to be useful for 2D
labeling problems via a tiered labeling algorithm, although the struc-
ture of allowed (tiered) is quite restrictive. We show another use of DP
in a 2D labeling case. Namely, we use DP for an approximate expansion
step. Our expansion-like moves are more limited in the structure than the
max-ow expansion moves. In fact, our moves are more restrictive than
the tiered labeling structure, but their complexity is linear in the number
of pixels, making them extremely ecient in practice. We illustrate the
performance of our DP-expansion on the Potts energy, but our algorithm
can be used for any pairwise energies. We achieve better eciency with
almost the same energy compared to the max-ow expansion moves.
1 Introduction
Expansion algorithm [1] has been used for a variety of pixel labeling problems
arising in vision and graphics [26]. In a pixel labeling problem the goal is to
assign a label to each image pixel. In general, pixel labeling problems are NP-
hard [7], but can be solved eciently in some special cases [812].
Expansion algorithm is an approximate optimization method with certain
optimality guarantees [1]. It works by iteratively visiting labels and perform-
ing an -expansion move on that label. An expansion move nds an optimal
subset of pixels to switch to label . Each expansion step is performed with a
min-cut/max-ow algorithm [13]. There is a max-ow algorithm [14] that was
specically designed for vision problems and it performs very eciently in prac-
tice for many instances. Nevertheless the best known worst-case time complexity
of max-ow is worse than linear in the number of image pixels.
Recently, a tiered labeling [12] algorithm, based on Dynamic Programming
(DP) was proposed for labeling of 2D scenes. This algorithm nds the exact

This research was supported,in part, by NSERC, CFI, and ERA grants.

Dynamic Programming for Approximate Expansion Algorithm 851
global minimum, but only for scenes with a tiered label layout. In particular,
the scene is assumed to be divided by two simple horizontal curves into three
main tiers, top, middle, and bottom. The top and bottom tiers receive a xed
label, T and B, respectively. The middle tier can be further subdivided into
tiers by vertical straight lines, and each sub-tier has a choice of one out of k
labels. This tiered scene layout is rather limited, but an optimal labeling can
be found in O(kn1.5 ) time for a square image of n pixels and the Potts energy
model. Most importantly, [12] shows that dynamic programming can be useful
for labeling 2D scenes, whereas traditionally, in computer vision, DP was applied
mostly to one-dimensional or low treewidth structures [15].
Inspired by tiered labeling, we propose an approximate expansion algorithm
based on fast DP. Just as in tiered labeling, the main idea of our DP-expansion
moves is to allow only certain structure layout inside the image to switch to
label . This is in contrast to the classical -expansion, that allows an arbitrary
subset of pixels to switch to . By limiting the structure of the subsets that
can switch to , we are modifying the problem to make it suitable for dynamic
programming. In fact, our moves are even more restrictive than the tiered
labeling structure, but their complexity is linear in the number of pixels, making
them extremely ecient in practice. Our moves are based on the observation that
instead of expanding on the optimal subset of pixels in a single step, it maybe
sucient, in practice, although clearly suboptimal, to nd an expansion region
where only the left (or right, or top, or bottom boundaries) are accurate. This
allows linear eciency in the number of pixels. To recover accurate boundaries
in all directions, we apply our moves iteratively.
Like tiered labeling, due to DP, we are limited to pairwise energies with a 4
or 8 connected neighborhood structure. This is a clear limitation, but longer-
range interactions are seldomly used. In addition to speed, a clear advantage of
our DP-expansion moves over the max-ow expansion moves is that, like tiered
labeling, we can handle arbitrary pairwise terms, again, due to DP. In fact, due
to the simplicity of our moves, we can handle arbitrary terms with the same
complexity as Potts pairwise terms, i.e. O(n), where n is the number of pixels.
We illustrate the performance of our DP-expansion on the Potts energy, but
our algorithm can be used for any pairwise energies. We achieve better eciency
with almost the same energy compared to the max-ow expansion moves.
This paper is organized as follows. We review the most related work in Sec-
tion 2. We motivate and explain our DP-expansion moves in Section 3. We
describe our ecient optimization algorithm in Section 4, present experimental
results in Section 5, and conclude with a short discussion in Section 6.
2 Related Work
There has been a lot of interest in vision community to develop ecient opti-
mization methods for pixel labeling problems, see a survey in [16]. The popular
methods evaluated in [16] include the expansion and swap algorithms [7], loopy
belief propagation (LBP) [17], and sequential tree re-weighted message passing
852 O. Veksler
(TRW-S) [18]. One of the conclusions of [16] is that the expansion algorithm
performs best in terms of speed and accuracy for Potts energies.
Since our optimization is based on making approximate expansion moves, our
work is related to the general direction of move-making algorithms. In addition to
the expansion and swap algorithms [7], a number of new move making algorithms
have been developed recently. In [19], they develop a fusion move that instead of
expanding on a xed label , can expand on a dierent but single xed label for
each pixel in a move, thereby fusing the current solution and a new proposal
solution together. In [20, 21], range moves that are designed specically for
energies tuned to truncated convex pairwise terms are developed. These moves,
unlike the previous ones, are multi-label, they allow each pixel a choice of more
than one label to switch to. In [22] they develop multi-label moves for more
general energy functions. The approach in [23] also uses multi-label moves for
fast approximate optimization of a tractable (submodular) multi-label energies
that can, in fact, be exactly optimized by such methods as [10], but the time
and space complexities of the exact method are heavy.
Another line of related methods are those optimization techniques that limit
the structure of the allowed labellings. Tiered labeling [12], which inspired our
work, uses a limited scene layout and relies on DP for optimization. Another
very related approach is [24], where they use move-making algorithm based on
tiered labeling for approximate optimization of more general energies. Their time
complexity is rather large, since using tiered labeling for a single move has cost
higher than linear in the number of pixels. We construct a simpler move than
that in [24], but our eciency is much better. Another related method is [25].
In [25] they propose a min-cut based approach for a ve-label energy function
with certain label structure layout. This layout is motivated by the application of
indoor geometric scene labeling. The approach in [25] is approximate, unlike [12],
even though for a four-connected neighborhood system tiered labeling layout is
more general than that in [25]. Recently, [26] showed how to optimize the energy
in [25] globally an optimally with DP. Another DP-based optimization method
on 2D domains is [27], although their objective function is suitable only for
object/background segmentation, not a more general scene labeling problem.
Finally, since our goal is to speed-up the expansion algorithm, there are re-
lated methods in this direction as well, which signicantly improve eciency of
expansion. In [28], they attain speed-ups by considering the dual formulation.
In [29], they exploit several techniques, such as recycling information from pre-
vious instances, clever intialization, etc. In [30], they nd the order of iterations
over labels likely to lead to the fastest energy decrease. Our approach explores
a previously unexplored direction, namely we limit the allowed move structure.
3 DP-Expansion Moves
In this section, we explain and motivate our DP-expansion moves.
3.1 Energy Function

In a pixel labeling problem we are given a set of pixels P and a set of labels L.
We need to assign a label from the label set to each image pixel p P. Let lp be
the label assigned to pixel p and l be the collection of all label assignments for
all pixels in P. The goal is to nd a labeling minimizing the following energy:

E(l) = Dp (lp ) + Vpq (lp , lq ), (1)
pP (p,q)N
where Dp (lp ) and Vpq (lp , lq ) are unary and pairwise energy terms, and N is a
neighborhood system on P. We assume N is 4-connected with ordered pairs
(p, q) such that p is directly to the left or above q. Our algorithm can be easily
extended to the 8-connected neighborhood that includes diagonal edges with
almost no increase in computational time. This is unlike max-ow expansion,
where including diagonal links noticeably increases computational time.
The unary term Dp () is the cost for assigning label to pixel p. It usu-
ally depends on the value of pixel p in the observed image. The pairwise term
Vpq (, ) species the cost for assigning labels and to pixels p and q. This
term is usually used to enforce some smoothness constraints on a low-cost label-
ing. Unlike most work on optimizing Eq. (1), we impose no constraints on unary
and pairwise terms. In particular, Vpq (, ) does not have to be submodular [31].
3.2 Review of Tiered Labeling

We rst review the tiered labeling algorithm [12] that inspired our approach.
Let T and B be the special labels allowed only for the top and bottom tiers and
M be the set of labels allowed for the middle tier. Therefore M = L\{T, B}.
Let p = (i, j) be the pixel in row i column j and li,j be its label. A labeling is
tiered if in each image column, there exists integers t and b s.t. all pixels strictly
above t are labeled as T , all pixels below b are labeled as B, and all other pixels
take one of the labels from the label set M. That is for each column j, t b
s.t. i < t, li,j = T , i b, li,j = B, and t i < b, li,j L. Notice that if
t = b, then there is no middle tier label in column j, only labels T and B. This
property is important for our DP-expansion moves. A simplied tiered structure,
where the middle tier has only one (green) label, is illustrated in Figure 1.
Fig. 1. Simplied version of tiered labeling: the middle tier is only allowed one label
854 O. Veksler
Similar to [24], we can use tiered labeling to nd an approximate -expansion

move in the following manner. Let l be the current labeling. The bottom and
the top tiers will correspond to the pixels that do not switch their label in the -
expansion move. That is for each image pixel p, labels B and T correspond to the
old label of pixel p, namely lp . We allow only the label for the middle tier, that
is M = {}. The optimal tiered labeling under such construction corresponds to
the optimal set of pixels with restricted structure to switch to . Namely, we can
nd the best contiguous set of pixels in each column to switch. Some columns
can choose not to switch any pixels to . If no column makes a switch, the
optimal -expansion is empty and all pixels retain old labels. This is important
for ensuring that the energy never goes up in a move-making algorithm.
This approach is clearly inferior to the max-ow expansion that can nd an
unrestricted subset of pixels to switch to . However, the combined result of
repeated applications of this move can nd rather complex regions to switch
to , see [24]1 . The move described modies the top and bottom boundaries of
the region to switch to . We can improve the accuracy further by repeating
this move in the orthogonal direction, by turning the image by /2, nding the
optimal contiguous subset of pixels in each row to switch to .
The approach outlined above (basically following [24]) is expensive. Tiered
labeling complexity is O(n2 m), where n is the number of rows and m is the
number of columns in the image. Tiered labeling works by collapsing each image
column to a single state. Each column has two possible break points t and b, so
the number of possible states is O(n2 ). The algorithm in [12] develops a clever
speed-up strategy that allows to reduce the DP complexity from O(n4 m) to just
O(n2 m). Still, the complexity is worse than linear.
3.3 DP-Expansion
To speed up, we need to reduce the number of states in each column further.
Observe that in the approach outlined in Section 3.2 a single move is capable
of carving out a relatively complex upper and lower boundaries of a region
to switch to , leaving left and right boundaries relatively simpler. Being able
to carve out complex boundaries is clearly important for faithfully recovering
object boundaries. However, it may be sucient to work on one boundary at
a time. Our main idea is to develop an even simpler move. Our move allows a
complex boundary in only one direction. The trade-o is that the state space is
reduced and we can achieve a linear time complexity DP.
This motivates the following DP-based moves. We have four dierent move
types, top-anchored, left-anchored, bottom-anchored, and right-anchored. For
the top-anchored move, if any pixels in column j decide to switch their labels
to , they must form a contiguous set. Furthermore this set must contain the
topmost, row 0, pixel. That is, for each column j, only pixel subsets of form
{(i, j)|0 i < b} for some b n are allowed to switch to label . Here n is the
1
Actually, [24] introduced more powerful moves than we just described, but we believe
these moves will also work almost as well as those in [24].
number of rows in the image. This move is illustrated in Figure 2(a). It allows
recovering of a region with a complex boundary on the bottom.
The other moves, illustrated in Figure 2(b,c,d) are similar. They require any
switching to start on the left, bottom, right of the image, respectively, and
they are capable of recovering a region with complex boundaries on the right,
top, and left, respectively.
(a) top-anchored (b) left-anchored (c) bottom-anchored (d) right-anchored
Fig. 2. Four DP-expansion moves. The old label is white and the new label is green.
Top-anchored move in (a): if any pixels in a column switch to label , they must form
a contiguous region and the top pixel must switch to . Left-, bottom-, and right-
anchored moves require switching to start at the left, bottom, right of the image.
We show how to implement these moves in time linear in the number of pixels
in Section 4. While very fast, these moves would not be powerful if they were
applied only to the whole image, since any switch to a new label must start at
the image top (or bottom, left, right). We found it eective to apply these moves
to image blocks of dierent sizes in random order.
When applying a DP-expansion move to an image block, only the unary and
pairwise terms that depend on pixels entirely within the block are involved.
However, the pairwise terms Vpq that involve one pixel in the block and one
pixel outside the block are also eected by the changes within the block, and
must be addressed to ensure the energy strictly decreases. Like previous work [1],
the x is simple. Let l be the old labeling. Let Vpq have p inside the block and
q outside the block. We add Vpq (lp , lq ) to Dp (lp ) and Vpq (, lq ) to Dp ().
4 Fast Optimization
We now explain how to do fast optimization for the DP-expansion moves dened
in Sec. 3.3. Even though most of our moves are performed on image blocks, it is
easier to explain how optimization works assuming it is done on the whole image,
avoiding the extra sub-indexing notation. The x that needs to be done when
performing the moves on image blocks was explained at the end of Section 3.3.
Our fast DP derivation closely follows the tiered labeling speed-ups in [12].
However, our speed-up of DP is simpler since we have a less complex state space.
We explain optimization for the case of top-anchored moves. The other cases are
handled by appropriately rotating the image.
For most derivations in this paper, for pixel p = (i, j) it is convenient to
denote Dp as Dij , and lp as li,j . Also, if p = (i, j) and q = (i + 1, j), that is a
v
vertical term, then we denote Vpq as Vi,j . If p = (i, j), and q = (i, j + 1), that is a
856 O. Veksler
Fig. 3. Site j in state d. Pixels above row d (in green) are labeled as in column j.
Pixels below and including row d (in white) in column j retain the old label.
h
horizontal term, we denote Vpq as Vi,j . Let image have n rows, indexed from 0 to
n 1, and m columns, indexed from 0 to m 1, see Fig. 3. Let the old labeling
be l and we want to nd the best top-anchored expansion move. That is, in
each column, we want to nd the best contiguous subset of pixels to switch to
, provided that any switch in column j, must begin at row 0, i.e. at pixel (0, j).
4.1 Reduction to Dynamic Programming

We represent each image column j with a single site j, visually collapsing it
vertically. Thus there are m sites indexed by 0, 1..., m 1. The state of site j is
denoted by sj . The number of possible states sj corresponds to the number of
possible labellings in column j. A possible state for column j can be represented
with one integer 0 d n, which is decoded as follows. If sj = d, then all pixels
above row d are labeled as and all pixels at row d and below retain their old
label. See Figure 3 for illustration. If d = 0 than no pixels in column j get label
. The total number of states for column j is n + 1. Let s be the assignment of
states for all sites.
Let l be a labeling satisfying the top-anchored move constraints and s be the
corresponding assignment of states of the one-dimensional problem. The energy
function in Equation (1) can be written as:

m1
m2
E(s) = Uj (sj ) + Bj (sj , sj+1 ), (2)
j=0 j=0
where Uj (sj ) adds up all the unary terms and pairwise terms of the energy in
Equation (1) that are contained entirely in column j, and Bj (sj , sj+1 ) adds up
all the pairwise terms of the energy in Equation (1) that rely on exactly one pixel
from column j and one pixel from column j + 1. Let us dene a labeling lj,d as
follows. Pixels in columns other than j can have any labels. Pixels in column j
and rows 0 to d 1 have label . Pixels in column j and rows d to n 1 have
the same label as they have in the old labeling l . Then

n1
j,s

n2
j,s j,s
v
Uj (sj ) = Dij (li,j j ) + Vi,j (li,j j , li+1,j
j
), (3)
i=0 i=0
and

n1
j,s j+1,s
h
Bj (sj , sj+1 ) = Vi,j (li,j j , li,j+1 j+1 ). (4)
i=0
As in [12], any Uj and Bj can be computed in O(1) time with precomputed

cumulative sums of data terms, horizontal and vertical terms separately in each
column. This is like the well known integral image technique [32], only applied
to columns and therefore slightly simpler.
The standard DP for optimizing the energy in Equation (2) proceeds as fol-
lows. We build look-up tables, Ej , for 0 j < m. Ej (t) is the cost of the best
assignment to the rst j + 1 states, namely states (0, 1, ..., j) where the last state
is xed to have value t, i.e. sj+1 = t. Ej are computed in order of increasing j
using the following recurrence, and for all t {0, 1, ..., n}:
E0 (t) = U0 (t), (5)

Ej (t) = Uj (t) + min

[Ej1 (t ) + Bj1 (t , t)] . (6)
t
We also compute the optimum previous state in a look-up table Pj (t). It is used
to trace back the optimal sequence of states, starting from the last state and
going backwards. That is the optimal state for the last site is computed with
sm1 = argmaxt Em1 (t). After that, the previous optimal state is computed
with sj1 = Pj (sj ), where sj is the optimal state for site j.
For a xed j, we have to compute Ej (t) for n+ 1 possible values of t . For each
xed j and t, we need to search over all previous states t , and there are n + 1 of
them. Therefore, computing Ej (t) is O(n) for a xed j and t. Computing Ej (t)
is O(n2 ) for all n + 1 values of t, and, nally, computing Ej (t) for all m values of
j is O(n2 m), which is too expensive. We explain how to speed it up to O(nm),
i.e. linear in the number of pixels in Section 4.2.
4.2 Fast Dynamic Programming
We now explain how, for a xed j and t, compute Ej (t) in O(1) time instead
of the straightforward O(n) time. This will result in the total time complexity
O(nm), i.e. linear in the number of pixels.
The expensive part of Ej (t) computation is the search over the previous states
t . Let us give this part its own name Fj (t):
Fj (t) = min

[Ej1 (t ) + Bj1 (t , t)] . (7)
t
We can break optimization of Fj (t) in two cases, for t t and t > t:

Fj (t) = min {Tj (t), Lj (t)} , (8)
858 O. Veksler
where
Tj (t) = min

[Ej1 (t ) + Bj1 (t , t)] (9)
t t
and
Lj (t) = min

[Ej1 (t ) + Bj1 (t , t)] (10)
t >t
(a) (b)
Fig. 4. Case 1: Minimization over t t. (a) The state of column j is xed to t, which
means that all pixels above row t in column j take new label . All possible choices for
t in column j 1 are illustrated. We have to nd the smallest cost choice. Below the
red line, all four choices have exactly the same labels, therefore the costs are dierent
only for the energy terms that include pixels above the red line. (b) is the same as in
(a) but now column j is in state n. Still the costs of four possible states diers only in
the energy terms above the red line, since labels below the red line are the same.
We are going to minimize these two cases separately, and each one in O(1)
time. Consider rst computation of Tj (t), i.e. minimization is over t t. In this
case, we are assigning to pixels (0, j), (1, j), ...(t 1, j) and the rest of pixels in
column j retain their old label. We have to nd the best possible state t t in
the previous column j 1. That is the previous column we can have in rows 0
through t 1, where t can range from 0 to t. This is illustrated in Figure 4(a).
When considering which t 0, ..., t is the optimal state to go through in
column j 1, observe that no matter which particular value we choose for t ,
both in column j 1 and column j, in rows including and below t (below the
red line in Figure 4(a)), all the pixels have the same labels (the old labels they
had in l ). Therefore all pairwise terms V and unary terms D of the original 2D
labeling that depend only on pixels in rows below and including t are the same
for all choices of t 0, ..., t. This means that we can nd the optimal t only
considering the unary and pairwise terms that involve pixels above row t.
Furthermore, consider Figure 4(b). It illustrates the case when we are com-
puting Tj (n), that is the jth column has all pixels labeled as . Notice that out
of t t, the best t is the same for both (a) and (b). This is because again,
the best t is independent of the labels of pixels in rows below the red line.
And above the red line, all cases for (a) and (b) are identical. Thus t mini-
mizing (a) and (b) is the same, but the values at the minimum t are dierent.
The dierence between minimum values depends only on t, let us denote it by

k(t). To compute k(t), we just have to add up all the energy terms in (a) that
depend on pixels below the red line and subtract all the energy terms in (b) that
depend on pixels below the red line. Therefore we get:
Tj (t) = min

[Ej1 (t ) + Bj1 (t , t)] = k(t) + min

[Ej1 (t ) + Bj1 (t , n)] , (11)
t t t t
where

n1
( )

k(t) = h
Vi,j1 (li,j1 , li,j ) Vi,j1
h
(li,j1 , ) .
i=t
We can compute k(t) in O(1) time using cumulative sums on the columns that
store pairwise costs V h , in the same way as Bj terms in Equation (4).
Let us now turn to the ecient computation of the second term in Equa-
tion (11). First dene a table C of size n, indexed starting at 0, as:
C(i) = Ej1 (i) + Bj1 (i, n)
We compute running minimum of C as another array C , also of size n:
C (i) = min

C(i). (12)
i i
We can compute C in O(n) time. Initialize C (0) = C(0). Then update, in order
of increasing i, C (i) = min {C(i), C (i 1)}. After the table C is computed,
the minimum in Equation (11) can be simply looked up as C (t). Since we reuse
C for all t in column j, computing Tj (t) in Equation (11) takes O(n) time for
all t {0, 1, ..., n 1}.
(a) (b)
Fig. 5. Case 2: Minimization over t > t. (a) When column j is xed to state t, all
possible choices for t in column j 1 are illustrated. Above the red line, all four choices
have exactly the same labels, therefore the costs are dierent only for the energy terms
that include pixels below the red line. (b) is the same as in (a) but now column j is in
state 0. Still the costs of four possible states diers only in the energy terms below the
red line, since labels above the red line are the same.
860 O. Veksler
The computation of Lj (t), i.e. minimization is over t > t is handled very

similarly. The illustration is in Figure 5. Observe that in rows smaller than t,
all pixel labels are the same (and equal to ) for all dierent choices of t , see
Figure 5(a). This means that we can nd the optimal t without considering any
energy terms that depend entirely on pixels in rows smaller than or equal to t.
(a) E = 221,884, 5.2 sec. (b) E=223,314, 10.2 sec.
(c) E = 196,088, 5.5 sec. (d)E=195,869, 11.9 sec.
(e) E=350,958, 1.1 sec. (f) E = 346,085, 2.2 sec.
Fig. 6. Results on Middlebury stereo pairs. The left column shows our results, the right
one max-ow expansion results. We report the energy and the running time in seconds
for each case.
Furthermore, consider Figure 5(b). It illustrates the case when we are com-
puting Tj (0), that is the jth column retains old labels for all pixels. Notice that
the minimum state t is the same between (a) and (b), when only t > t are
considered. The dierence in minimum values between (a) and (b) dier by k (t)
that depends only on t, not t . Therefore we get:
Li (t) = min

[Ei1 (t ) + Bi1 (t , t)] = k (t) + min

[Ei1 (t ) + Bi1 (t , 0)] , (13)
t >t t t
where

t1
( )
k (t) = h
Vi,j1 (, ) Vi,j1
h
(, li,j ) .
i=0

We compute k (t) in O(1) time using cumulative sums, similar to that of k(t).
Let us dene a table H of size n as:
H(i) = Ej1 (i) + Bj1 (i, 0).
We need to compute running minimum for array H, but now in the other direc-
tion than what we used for array C. Dene running minimum H as:
H (i) = min

H(i ). (14)
i i
We can compute H in O(n) time. Initialize H (n1) = H(n1). Then compute,

in order of decreasing i, H (i) = min {H(i), H (i + 1)}. After the table H is
computed, the minimum in Equation (13) can be simply looked up as H (t).
Since we reuse H for all t in column j, computing Lj (t) in Equation (11) takes
O(n) time for all t {0, 1, ..., n 1}.
The parent table P that is used to recover the optimal state in DP is computed
very similar to the tables E, therefore we omit the description.
In this section, we evaluate the performance of our DP-expansion algorithm
and compare it to the max-ow expansion. We do not compare our method
to previous work that improves eciency of expansion, such as [2830], since
our goal is to introduce a previously unexplored direction for speed-ups, rather
than compete with previous work. In addition, since our approach is orthogonal
to [2830], we can reuse some of their ideas.
Even though we can handle arbitrary pairwise potentials, we evaluate here
only the Potts model, namely Vpq (lp , lq ) = wpq min{|lp lq |, 1}. As in previous
work [1], wpq depends on the strength of the gradient between pixels p and q in
the observed image. The stronger the gradient, the smaller is wpq .
We evaluate our performance on the application of stereo correspondence and
on the set of ten Middlebury stereo pairs [33, 34]. For the data term, we use a
simple truncated absolute value dierence between the pixel in the left image
and the pixel in the right image shifted according to the disparity label. That
is if L is the left image and R is the right image, Dp (lp ) = min{T, |L(p)
R(p lp )|}. We used T = 20. Since our goal is to evaluate our DP-expansion
moves as an optimization method, we did not tune parameters for the best stereo
correspondence performance. We ran the expansion algorithm for two iterations,
since it has been observed that the energy decreases most during the rst two
iterations. We ran our algorithm for box sizes from 1/10 of the image to the
largest image dimension, increasing them by a factor of 2 and making them
overlap by half of their side length.
862 O. Veksler
It has been shown [1] that rst two iterations decrease the energy the most
for the max-ow expansion. Thus to make comparisons more fair to max-ow
expansion, we run it for two iterations. On average, max-ow expansion nds an
energy 1.3% lower, where we measured energy dierence between DP-expansion
and max-ow expansion relative to the energy computed by max-ow expansion.
The largest relative energy dierence was 3%, and in some case DP-expansion
returned a slightly lower energy. This is only because max-ow expansion was
not run until convergence. The running time of our approach was, on average,
1.8 times faster. Fig. 6 illustrates some of the results.
6 Discussion
We have presented DP-expansion moves that work by nding a restricted struc-
ture subset of pixels to switch to . The restriction allows to implement them
very eciently, in linear time. There are many other aspects for future research.
First, there are many variants of such moves to explore. For example, it is easy
to extend our moves to the case when the anchoring happens not necessarily
at the same row for each column. Right now, the anchoring is done in row 0 (or
at the left, right, bottom of the picture). It is easy to see that as long as the
one of the end points is xed, not necessarily to the same value for each column,
we can compute the move with with linear time eciency. These moves can be
used for taking a low-quality but cheap solution, exploring the connected sets
of pixels assigned to the same label, and improving their boundaries one side at
a time, leaving the other sides xed. It is also easy to parallelize our algorithm,
expanding on a non-overlapping set of boxes at the same time. Most of the boxes
are small to medium in size and it is easy to come up with a non-overlapping
subsets of them. Finally, just as there are several speed-ups to the basic max-ow
expansion suggested in the literature [29, 30], based on clever heuristics, there
are multiple speed-ups possible for our approach too.
References
1. Boykov, Y., Veksler, O., Zabih, R.: Ecient approximate energy minimization via
graph cuts. PAMI 26, 12221239 (2001)
2. Wills, J., Agarwal, S., Belongie, S.: What went where. In: CVPR, vol. I, pp. 3744
(2003)
3. Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B.,
Salesin, D., Cohen, M.: Interactive digital photomontage. ACM Transactions on
Graphics 23, 294302 (2004)
4. Kwatra, V., Schdl, A., Essa, I., Turk, G., Bobick, A.: Graphcut textures: Image
and video synthesis using graph cuts. SIGGRAPH 2003 22, 277286 (2003)
5. Hoiem, D., Rother, C., Winn, J.: 3d layout crf for multi-view object class recogni-
tion and segmentation. In: CVPR, pp. 18 (2007)
6. Delong, A., Osokin, A., Isack, H.N., Boykov, Y.: Fast approximate energy mini-
mization with label costs. IJCV 96, 127 (2012)
graph cuts. PAMI 23, 12221239 (2001)
8. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for
binary images. Journal of the Royal Statistical Society, Series B 51, 271279 (1989)
9. Ishikawa, H.: Exact optimization for markov random elds with convex priors.
PAMI 25, 13331336 (2003)
10. Schlesinger, D., Flach, B.: Transforming an arbitrary minsum problem into a binary
one. Technical Report TUD-FI06-01, Dresden University of Technology (2006)
11. Kolmogorov, V., Rother, C.: Minimizing nonsubmodular functions with graph cuts
- a review. PAMI 29, 12741279 (2007)
12. Felzenszwalb, P.F., Veksler, O.: Tiered scene labeling with dynamic programming.
In: CVPR (2010)
13. Ford, L., Fulkerson, D.: Flows in Networks. Princeton University Press (1962)
14. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-ow
algorithms for energy minimization in vision. PAMI 26, 11241137 (2004)
15. Amini, A., Weymouth, T., Jain, R.: Using dynamic programming for solving vari-
ational problems in vision. PAMI 12, 855867 (1990)
16. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A.,
Tappen, M.F., Rother, C.: A comparative study of energy minimization methods for
markov random elds with smoothness-based priors. PAMI 30, 10681080 (2008)
17. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible in-
ference. Morgan Kaufmann (1988)
18. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimiza-
tion. PAMI 28, 15681583 (2006)
19. Lempitsky, V.S., Rother, C., Roth, S., Blake, A.: Fusion moves for markov random
eld optimization. PAMI 32, 13921405 (2010)
20. Veksler, O.: Graph cut based optimization for mrfs with truncated convex priors.
In: CVPR, pp. 18 (2007)
21. Kumar, M.P., Torr, P.H.S.: Improved moves for truncated convex models. In:
Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) NIPS, pp. 889896 (2009)
22. Gould, S., Amat, F., Koller, D.: Alphabet soup: A framework for approximate
energy minimization. In: CVPR, pp. 903910 (2009)
23. Carr, P., Hartley, R.: Solving multilabel graph cut problems with multilabel swap.
In: DICTA (2009)
24. Vineet, V., Warrell, J., Torr, P.: A tiered move-making algorithm for general pair-
wise mrfs. In: CVPR (2012)
25. Liu, X., Veksler, O., Samarabandu, J.: Order preserving moves for graph cut based
optimization. PAMI (2010)
26. Bai, J., Song, Q., Veksler, O., Wu, X.: Fast dynamic programming for labeling
problems with ordering constraints. In: CVPR (2012)
27. Asano, T., Chen, D., Katoh, N., Tokuyama, T.: Ecient algorithms for
optimization-based image segmentation. IJCGA 11, 145166 (2001)
28. Komodakis, N., Tziritas, G., Paragios, N.: Fast, approximately optimal solutions
for single and dynamic mrfs. In: CVPR (2007)
29. Alahari, K., Kohli, P., Torr, P.H.S.: Reduce, reuse & recycle: Eciently solving
multi-label mrfs. In: CVPR (2008)
30. Batra, D., Kohli, P.: Making the right moves: Guiding alpha-expansion using local
primal-dual gaps. In: CVPR, pp. 18651872 (2011)
31. Kolmogorov, V., Zabih, R.: What energy function can be minimized via graph
cuts? PAMI 26, 147159 (2004)
32. Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple
features. In: CVPR (1), pp. 511518 (2001)
33. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. IJCV 47, 742 (2001)
34. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light.
In: CVPR, pp. 195202 (2003)
Real-Time Compressive Tracking
Kaihua Zhang1 , Lei Zhang1 , and Ming-Hsuan Yang2

1
Dept. of Computing, The Hong Kong Polytechnic University
2
Electrical Engineering and Computer Science, University of California at Merced
{cskhzhang,cslzhang}@comp.polyu.edu.hk, mhyang@ucmerced.edu
Abstract. It is a challenging task to develop eective and ecient ap-

pearance models for robust object tracking due to factors such as pose
variation, illumination change, occlusion, and motion blur. Existing on-
line tracking algorithms often update models with samples from obser-
vations in recent frames. While much success has been demonstrated,
numerous issues remain to be addressed. First, while these adaptive
appearance models are data-dependent, there does not exist sucient
amount of data for online algorithms to learn at the outset. Second,
online tracking algorithms often encounter the drift problems. As a re-
sult of self-taught learning, these mis-aligned samples are likely to be
added and degrade the appearance models. In this paper, we propose a
simple yet eective and ecient tracking algorithm with an appearance
model based on features extracted from the multi-scale image feature
space with data-independent basis. Our appearance model employs non-
adaptive random projections that preserve the structure of the image
feature space of objects. A very sparse measurement matrix is adopted
to eciently extract the features for the appearance model. We com-
press samples of foreground targets and the background using the same
sparse measurement matrix. The tracking task is formulated as a binary
classication via a naive Bayes classier with online update in the com-
pressed domain. The proposed compressive tracking algorithm runs in
real-time and performs favorably against state-of-the-art algorithms on
challenging sequences in terms of eciency, accuracy and robustness.
1 Introduction
Despite that numerous algorithms have been proposed in the literature, object
tracking remains a challenging problem due to appearance change caused by
pose, illumination, occlusion, and motion, among others. An eective appearance
model is of prime importance for the success of a tracking algorithm that has
been attracting much attention in recent years [110]. Tracking algorithms can
be generally categorized as either generative [1, 2, 6, 10, 9] or discriminative [3
5, 7, 8] based on their appearance models.
Generative tracking algorithms typically learn a model to represent the target
object and then use it to search for the image region with minimal reconstruction
error. Black et al. [1] learn an o-line subspace model to represent the object of
interest for tracking. The IVT method [6] utilizes an incremental subspace model

Real-Time Compressive Tracking 865
to adapt appearance changes. Recently, sparse representation has been used in

the 1 -tracker where an object is modeled by a sparse linear combination of target
and trivial templates [10]. However, the computational complexity of this tracker
is rather high, thereby limiting its applications in real-time scenarios. Li et al. [9]
further extend the 1 -tracker by using the orthogonal matching pursuit algorithm
for solving the optimization problems eciently. Despite much demonstrated
success of these online generative tracking algorithms, several problems remain
to be solved. First, numerous training samples cropped from consecutive frames
are required in order to learn an appearance model online. Since there are only
a few samples at the outset, most tracking algorithms often assume that the
target appearance does not change much during this period. However, if the
appearance of the target changes signicantly at the beginning, the drift problem
is likely to occur. Second, when multiple samples are drawn at the current target
location, it is likely to cause drift as the appearance model needs to adapt to
these potentially mis-aligned examples [8]. Third, these generative algorithms do
not use the background information which is likely to improve tracking stability
and accuracy.
Discriminative algorithms pose the tracking problem as a binary classication
task in order to nd the decision boundary for separating the target object from
the background. Avidan [3] extends the optical ow approach with a support
vector machine classier for object tracking. Collins et al. [4] demonstrate that
the most discriminative features can be learned online to separate the target
object from the background. Grabner et al. [5] propose an online boosting algo-
rithm to select features for tracking. However, these trackers [35] only use one
positive sample (i.e., the current tracker location) and a few negative samples
when updating the classier. As the appearance model is updated with noisy and
potentially misaligned examples, this often leads to the tracking drift problem.
Grabner et al. [7] propose an online semi-supervised boosting method to allevi-
ate the drift problem in which only the samples in the rst frame are labeled
and all the other samples are unlabeled. Babenko et al. [8] introduce multiple
instance learning into online tracking where samples are considered within posi-
tive and negative bags or sets. Recently, a semi-supervised learning approach [11]
is developed in which positive and negative samples are selected via an online
classier with structural constraints.
In this paper, we propose an eective and ecient tracking algorithm with
an appearance model based on features extracted in the compressed domain.
The main components of our compressive tracking algorithm are shown by Fig-
ure 1. Our appearance model is generative as the object can be well represented
based on the features extracted in the compressive domain. It is also discrimi-
native because we use these features to separate the target from the surround-
ing background via a naive Bayes classier. In our appearance model, features
are selected by an information-preserving and non-adaptive dimensionality re-
duction from the multi-scale image feature space based on compressive sensing
theories [12, 13]. It has been demonstrated that a small number of randomly
generated linear measurements can preserve most of the salient information and
866 K. Zhang, L. Zhang, and M.-H. Yang
x x x x x x
x x x x x x x x x "
xx xx xx " x x x " xx xx xx
x x x xx xx xx
# x x x

x x x
x x x x x x
x x x x x x "
# xx xx xx " x x x
x x x xxx
x x x x x x "

xx xx xx x x x
Sparse xxx
Multiscale Multiscale measurement Compressed
Frame(t) Samples Classifer
filter bank image features matrix vectors
(a) Updating classier at the t-th frame

Sparse
Multiscale measurement
filter bank Classifier
matrix
Multiscale Compressed
image features vectors Sample with maximal
Frame(t+1) classifier response
(b) Tracking at the (t + 1)-th frame
Fig. 1. Main components of our compressive tracking algorithm
allow almost perfect reconstruction of the signal if the signal is compressible such
as natural images or audio [1214]. We use a very sparse measurement matrix
that satises the restricted isometry property (RIP) [15], thereby facilitating ef-
cient projection from the image feature space to a low-dimensional compressed
subspace. For tracking, the positive and negative samples are projected (i.e.,
compressed) with the same sparse measurement matrix and discriminated by
a simple naive Bayes classier learned online. The proposed compressive track-
ing algorithm runs at real-time and performs favorably against state-of-the-art
trackers on challenging sequences in terms of eciency, accuracy and robustness.
2 Preliminaries
We present some preliminaries of compressive sensing which are used in the
proposed tracking algorithm.
2.1 Random Projection

A random matrix R Rnm whose rows have unit length projects data from
the high-dimensional image space x Rm to a lower-dimensional space v Rn
v = Rx, (1)
where n $ m. Ideally, we expect R provides a stable embedding that approxi-

mately preserves the distance between all pairs of original signals. The Johnson-
Lindenstrauss lemma [16] states that with high probability the distances be-
tween the points in a vector space are preserved if they are projected onto a
randomly selected subspace with suitably high dimensions. Baraniuk et al. [17]
proved that the random matrix satisfying the Johnson-Lindenstrauss lemma also
holds true for the restricted isometry property in compressive sensing. There-
fore, if the random matrix R in (1) satises the Johnson-Lindenstrauss lemma,
we can reconstruct x with minimum error from v with high probability if x is
compressive such as audio or image. We can ensure that v preserves almost all
the information in x. This very strong theoretical support motivates us to ana-
lyze the high-dimensional signals via its low-dimensional random projections. In
the proposed algorithm, we use a very sparse matrix that not only satises the
Johnson-Lindenstrauss lemma, but also can be eciently computed for real-time
tracking.
2.2 Random Measurement Matrix
A typical measurement matrix satisfying the restricted isometry property is the

random Gaussian matrix R Rnm where rij N (0, 1), as used in numerous
works recently [14, 9, 18]. However, as the matrix is dense, the memory and
computational loads are still large when m is large. In this paper, we adopt a
very sparse random measurement matrix with entries dened as

1 with probability 2s 1
rij = s 0 with probability 1 1s (2)

1 with probability 2s1
.
Achlioptas [16] proved that this type of matrix with s = 2 or 3 satises the
Johnson-Lindenstrauss lemma. This matrix is very easy to compute which re-
quires only a uniform random generator. More importantly, when s = 3, it is
very sparse where two thirds of the computation can be avoided. In addition, Li
et al. [19] showed that for s = O(m) (x Rm ), this matrix is asymptotically
normal. Even when s = m/ log(m), the random projections are almost as accu-
rate as the conventional random projections where rij N (0, 1). In this work,
we set s = m/4 which makes a very sparse random matrix. For each row of R,
only about c, c 4, entries need to be computed. Therefore, the computational
complexity is only O(cn) which is very low. Furthermore, we only need to store
the nonzero entries of R which makes the memory requirement also very light.
3 Proposed Algorithm
In this section, we present our tracking algorithm in details. The tracking prob-
lem is formulated as a detection task and our algorithm is shown in Figure 1.
We assume that the tracking window in the rst frame has been determined. At
each frame, we sample some positive samples near the current target location
and negative samples far away from the object center to update the classier. To
predict the object location in the next frame, we draw some samples around the
current target location and determine the one with the maximal classication
score.
x 1 v1

x 2 v2

u # vi = rij x j
j
v
# n

Rnm

x m

Fig. 2. Graphical representation of compressing a high-dimensional vector x to a low-

dimensional vector v. In the matrix R, dark, gray and white rectangles represent neg-
ative, positive, and zero entries, respectively. The blue arrows illustrate that one of
nonzero entries of one row of R sensing an element in x is equivalent to a rectangle
lter convolving the intensity at a xed position of an input image.
3.1 Ecient Dimensionality Reduction

For each sample z Rwh , to deal with the scale problem, we represent it by
convolving z with a set of rectangle lters at multiple scales {h1,1 , . . . , hw,h }
dened as
1, 1 x i, 1 y j
hi,j (x, y) = (3)
0, otherwise
where i and j are the width and height of a rectangle lter, respectively. Then,
we represent each ltered image as a column vector in Rwh and then concatenate
these vectors as a very high-dimensional multi-scale image feature vector x =
(x1 , ..., xm ) Rm where m = (wh)2 . The dimensionality m is typically in the
order of 106 to 1010 . We adopt a sparse random matrix R in (2) with s = m/4
to project x onto a vector v Rn in a low-dimensional space. The random
matrix R needs to be computed only once o-line and remains xed throughout
the tracking process. For the sparse matrix R in (2), the computational load is
very light. As shown by Figure 2, we only need to store the nonzero entries in
R and the positions of rectangle lters in an input image corresponding to the
nonzero entries in each row of R. Then, v can be eciently computed by using R
to sparsely measure the rectangular features which can be eciently computed
using the integral image method [20].
3.2 Analysis of Low-Dimensional Compressive Features

As shown in Figure 2, each element vi in the low-dimensional feature v Rn
is a linear combination of spatially distributed rectangle features at dierent
scales. As the coecients in the measurement matrix can be positive or negative
(via (2)), the compressive features compute the relative intensity dierence in
a way similar to the generalized Haar-like features [8] (See also Figure 2). The
Haar-like features have been widely used for object detection with demonstrated
success [20, 21, 8]. The basic types of these Haar-like features are typically de-
signed for dierent tasks [20, 21]. There often exist a very large number of
0.2 0.1 0.12
0.1
0.08
0.15
0.08
0.06
0.1 0.06
0.04
0.04
0.05
0.02
0.02
0 0 0
0 0.5 1 1.5 2 0 2 4 6 8 10 12 0 5 10 15
5 4 4
x 10 x 10 x 10
Fig. 3. Probability distributions of three dierent features in a low-dimensional space.

The red stair represents the histogram of positive samples while the blue one represents
the histogram of negative samples. The red and blue lines denote the corresponding
estimated distributions by our incremental update method.
Haar-like features which make the computational load very heavy. This problem
is alleviated by boosting algorithms for selecting important features [20, 21]. Re-
cently, Babenko et al. [8] adopted the generalized Haar-like features where each
one is a linear combination of randomly generated rectangle features, and use
online boosting to select a small set of them for object tracking. In our work,
the large set of Haar-like features are compressively sensed with a very sparse
measurement matrix. The compressive sensing theories ensure that the extracted
features of our algorithm preserve almost all the information of the original im-
age. Therefore, we can classify the projected features in the compressed domain
eciently without curse of dimensionality.
3.3 Classier Construction and Update
For each sample z Rm , its low-dimensional representation is v = (v1 , . . . , vn )

Rn with m . n. We assume all elements in v are independently distributed
and model them with a naive Bayes classier [22],
n n
p(vi |y = 1)p(y = 1) p(vi |y = 1)
H(v) = log i=1
n = log , (4)
i=1 p(vi |y = 0)p(y = 0) i=1
p(vi |y = 0)
where we assume uniform prior, p(y = 1) = p(y = 0), and y {0, 1} is a binary
variable which represents the sample label.
Diaconis and Freedman [23] showed that the random projections of high di-
mensional random vectors are almost always Gaussian. Thus, the conditional
distributions p(vi |y = 1) and p(vi |y = 0) in the classier H(v) are assumed to
be Gaussian distributed with four parameters (1i , i1 , 0i , i0 ) where
p(vi |y = 1) N (1i , i1 ), p(vi |y = 0) N (0i , i0 ). (5)
The scalar parameters in (5) are incrementally updated
1i 1i + (1 )1
C
i1 (i1 )2 + (1 )( 1 )2 + (1 )(1i 1 )2 , (6)
C
n1
k=0|y=1 (vi (k) ) and
1
where > 0 is a learning parameter, 1 = n
1 2
1
n1
1 = n k=0|y=1 vi (k). The above equations can be easily derived by maximal
likelihood estimation. Figure 3 shows the probability distributions for three dif-
ferent features of the positive and negative samples cropped from a few frames of
a sequence for clarity of presentation. It shows that a Gaussian distribution with
online update using (6) is a good approximation of the features in the projected
space. The main steps of our algorithm are summarized in Algorithm 1.
Algorithm 1. Compressive Tracking

Input: t-th video frame
1. Sample a set of image patches, D = {z|||l(z) lt1 || < } where lt1 is the
tracking location at the (t-1)-th frame, and extract the features with low dimen-
sionality.
2. Use classier H in (4) to each feature vector v(z) and nd the tracking location
lt with the maximal classier response.
3. Sample two sets of image patches D = {z|||l(z) lt || < } and D, = {z| <
||l(z) lt || < } with < < .
4. Extract the features with these two sets of samples and update the classier
parameters according to (6).
Output: Tracking location lt and classier parameters
3.4 Discussion
We note that simplicity is the prime characteristic of our algorithm in which the
proposed sparse measurement matrix R is independent of any training samples,
thereby resulting in a very ecient method. In addition, our algorithm achieves
robust performance as discussed below.
Dierence with Related Work. It should be noted that our algorithm is
dierent from the recently proposed 1 -tracker [10] and compressive sensing
tracker [9]. First, both algorithms are generative models that encode an ob-
ject sample by sparse representation of templates using 1 -minimization. Thus
the training samples cropped from the previous frames are stored and updated,
but this is not required in our algorithm due to the use of a data-independent
measurement matrix. Second, our algorithm extracts a linear combination of gen-
eralized Haar-like features but these trackers [10][9] use the holistic templates
for sparse representation which are less robust as demonstrated in our experi-
ments. In [9], an orthogonal matching pursuit algorithm is applied to solve the
1 -minimization problems. Third, both of these tracking algorithms [10][9] need
to solve numerous time-consuming 1 -minimization problems but our algorithm
is ecient as only matrix multiplications are required.
Random Projection vs. Principal Component Analysis. For visual track-
ing, dimensionality reduction algorithms such as principal component analysis
0.12 0.18 0.2
0.16 0.18
0.1
0.16
0.14
0.14
0.08 0.12
0.12
0.1
0.06 0.1
0.08
0.08
0.04 0.06
0.06
0.04
0.04
0.02
0.02 0.02
0 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
5 5 5
x 10 x 10 x 10
Fig. 4. Illustration of robustness of our algorithm to ambiguity in detection. Top row:

three positive samples. The sample in red rectangle is the most correct positive sam-
ple while other two in yellow rectangles are less correct positive samples. Bottom
row: the probability distributions for a feature extracted from positive and negative
samples. The red markers denote the feature extracted from the most correct pos-
itive sample while the yellow markers denote the feature extracted from the two less
correct positive samples. The red and blue stairs as well as lines denote the estimated
distributions of positive and negative samples as shown in Figure 3.
and its variations have been widely used in generative tracking methods [1, 6].
These methods need to update the appearance models frequently for robust
tracking. However, these methods are sensitive to occlusion due to the holis-
tic representation schemes. Furthermore, it is not clear whether the appearance
models can be updated correctly with new observations (e.g., without alignment
errors to avoid tracking drift). In contrast, our algorithm does not suer from the
problems with online self-taught learning approaches [24] as the proposed model
with the measurement matrix is data-independent. It has been shown that for
image and text applications, favorable results can be achieved by methods with
random projection than principal component analysis [25].
Robustness to Ambiguity in Detection. The tracking-by-detection methods
often encounter the inherent ambiguity problems as shown in Figure 4. Recently
Babenko et al. [8] introduced multiple instance learning schemes to alleviate the
tracking ambiguity problem. Our algorithm is robust to the ambiguity problem
as illustrated in Figure 4. While the target appearance changes over time, the
most correct positive samples (e.g., the sample in the red rectangle in Figure
4) are similar in most frames. However, the less correct positive samples (e.g.,
samples in yellow rectangles of Figure 4) are much more dierent as they involve
some background information which vary much more than those within the tar-
get object. Thus, the distributions for the features extracted from the most
correct positive samples are more concentrated than those from the less cor-
rect positive samples. This in turn makes the features from the most correct
positive samples much more stable than those from the less correct positive
samples (e.g., bottom row in Figure 4, the features denoted by red markers are
more stable than those denoted by yellow markers). Thus, our algorithm is able
to select the most correct positive sample because its probability is larger than
those of the less correct positive samples (See the markers in Figure 4). In ad-
dition, our measurement matrix is data-independent and no noise is introduced
by mis-aligned samples.
Robustness to Occlusion. Each feature in our algorithm is spatially localized
(Figure 2) which is less sensitive to occlusion than holistic representations. Simi-
lar representations, e.g., local binary patterns [26] and generalized Haar-like fea-
tures [8], have been shown to be more eective in handling occlusion. Moreover,
features are randomly sampled at multiple scales by our algorithm in a way similar
to [27, 8] which have demonstrated robust results for dealing with occlusion.
Dimensionality of Projected Space. Assume there exist d input points in
Rm . Given 0 < < 1 as well as > 0, and let R Rnm be a random
matrix projecting data from Rm to Rn , the theoretical bound
for the dimension
n that satises the Johnson-Lindenstrauss lemma is n 2 /24+2
3 /3 ln(d) [16].
In practice, Bingham and Mannila [25] pointed out that this bound is much
higher than that suces to achieve good results on image and text data. In their
applications, the lower bound for n when = 0.2 is 1600 but n = 50 is sucient
to generate good results. In our experiments, with 100 samples (i.e., d = 100),
= 0.2 and = 1, the lower bound for n is approximately 1600. Another bound
derived from the restricted isometry property in compressive sensing [15] is much
tighter than that from Johnson-Lindenstrauss lemma, where n log(m/)
and and are constants. For m = 106 , = 1, and = 10, it is expected
that n 50. We nd that good results can be obtained when n = 50 in our
experiments.
4 Experiments
We evaluate our tracking algorithm with 7 state-or-the-art methods on 20 chal-
lenging sequences among which 16 are publicly available and 4 are our own.
The Animal, Shaking and Soccer sequences are provided in [28] and the Box
and Jumping are from [29]. The 7 trackers we compare with are the fragment
tracker (Frag) [30], the online AdaBoost method (OAB) [5], the Semi-supervised
tracker (SemiB) [7], the MILTrack algorithm [8], the 1 -tracker [10], the TLD
tracker [11], and the Struck method [31]. We note that the source code of [9]
is not available for evaluation and the implementation requires some technical
details and parameters not discussed therein. It is worth noticing that we use
the most challenging sequences from the existing works. For fair comparison, we
use the source or binary codes provided by the authors with tuned parameters
for best performance. For our compared trackers, we either use the tuned pa-
rameters from the source codes or empirically set them for best results. Since
all of the trackers except for Frag involve randomness, we run them 10 times
and report the average result for each video clip. Our tracker is implemented in
MATLAB, which runs at 35 frames per second (FPS) on a Pentium Dual-Core
2.80 GHz CPU with 4 GB RAM. The source codes and datasets are available at
http://www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm.
Table 1. Success rate (SR)(%). Bold fonts indicate the best performance while the
italic fonts indicate the second best ones. The total number of evaluated frames is 8270.
Sequence CT MILTrack OAB SemiB Frag 1 -track TLD Struck

Animal 76 73 15 47 3 5 75 97
Bolt 79 83 1 16 39 2 1 10
Biker 75 21 42 62 26 31 42 35
Box 89 65 13 38 16 5 92 92
Coupon book 100 99 98 37 27 100 16 99
Cliff bar 89 65 23 65 22 38 67 70
David indoor 89 68 31 46 8 41 98 98
Girl 78 50 71 50 68 90 57 99
Jumping 100 99 86 84 36 9 99 18
Kitesurf 68 90 31 67 10 31 65 40
Occluded face 2 100 99 47 40 52 84 46 78
Panda 81 75 69 67 7 56 29 13
Sylvester 75 80 70 68 34 46 94 87
Skiing 70 42 69 69 7 10 59 80
Shaking 92 85 40 31 28 10 16 1
Soccer 78 17 8 9 27 13 10 14
Twinings 89 72 98 23 69 83 46 98
Tiger 1 78 39 24 28 19 13 65 73
Tiger 2 60 45 37 17 13 12 41 22
Walking person 89 32 86 81 32 98 55 100
Average SR 84 69 52 48 31 46 49 64

Given a target location at the current frame, the search radius for drawing
positive samples is set to = 4 which generates 45 positive samples. The inner
and outer radii for the set X , that generates negative samples are set to = 8
and = 30, respectively. We randomly select 50 negative samples from set X , .
The search radius for set D to detect the object location is set to = 20 and
about 1100 samples are generated. The dimensionality of projected space is set
to n = 50, and the learning parameter is set to 0.85.

All of the video frames are in gray scale and we use two metrics to evaluate
the proposed algorithm with 7 state-of-the-art
trackers. The rst metric is the
success rate, score = area(ROI
area(ROIT
T ROIG )
ROIG ) , where ROIT is the tracking bounding
box and ROIG is the ground truth bounding box. If the score is larger than
0.5 in one frame, the tracking result is considered as a success. The other is
the center location error measured with manually labeled ground truth data.
Table 1 and Table 2 show the quantitative results averaged over 10 times as
described above. We note that although TLD tracker is able to relocate on the
target during tracking, it is easy to lose the target completely for some frames
in most of the test sequences. Thus, we only show the center location errors for
the sequences that TLD can keep track all the time. The proposed compressive
tracking algorithm achieves the best or second best results in most sequences in
terms of both success rate and center location error. Furthermore, our tracker
runs faster than all the other algorithms although they (except for the TLD
method and 1 -tracker) are implemented in C or C++ which is intrinsically more
ecient than MATLAB. Figure 5 shows screenshots of some tracking results.
Table 2. Center location error (CLE)(in pixels) and average frame per second (FPS).
Bold fonts indicate the best performance while the italic fonts indicate the second best
ones. The total number of evaluated frames is 8270.
Sequence CT MILTrack OAB SemiB Frag 1 -track TLD Struck

Animal 17 32 62 25 99 155 17
Bolt 8 8 227 101 43 58 157
Biker 11 38 12 13 72 74 14
Box 14 57 74 119 160 196 11
Coupon book 4 6 9 74 63 6 10
Cliff bar 7 14 33 56 34 35 20
David indoor 16 19 57 37 73 42 12 9
Girl 21 25 23 50 26 13 10
Jumping 6 10 11 11 29 99 8 42
Kitesurf 9 5 33 10 55 51 30
Occluded face 2 10 16 36 39 58 19 25
Panda 6 9 10 10 69 10 20 67
Sylvester 9 10 12 14 47 42 7 9
Skiing 10 15 12 11 134 189 166
Shaking 9 12 22 133 41 192 166
Soccer 16 64 96 135 54 189 95
Twinings 9 14 7 70 15 10 15 7
Tiger 1 10 27 42 39 39 48 12
Tiger 2 13 18 22 29 37 57 22
Walking person 5 8 5 7 59 4 6 2
Average CLE 10 19 36 49 56 60 36
Average FPS 35.2 10.5 8.6 6.7 3.8 0.1 9.6 0.01
Scale, Pose and Illumination Change. For the David indoor sequence shown
in Figure 5(a), the illumination and pose of the object both change gradually.
The MILTrack, TLD and Struck methods perform well on this sequence. For
the Shaking sequence shown in Figure 5(b), when the stage light changes dras-
tically and the pose of the subject changes rapidly as he performs, all the other
trackers fail to track the object reliably. The proposed tracker is robust to pose
and illumination changes as object appearance can be modeled well by random
projections (based on the Johnson-Lindenstrauss lemma) and the classier with
online update is used to separate foreground and background samples. Moreover,
the proposed tracker is a discriminative model with local features that has been
demonstrated to handle pose variation well (e.g., MILTrack [8]). The generative
subspace tracker (e.g., IVT [6]) has been shown to be eective in dealing with
large illumination changes while the discriminative tracking method with local
features (i.e., MILTrack [8]) has been demonstrated to handle pose variation ad-
equately. Furthermore, the features we use are similar to generalized Haar-like
features which have been shown to be robust to scale and orientation change [8]
as illustrated in the David indoor sequence. In addition, our tracker performs
well on the Sylvester and Panda sequences in which the target objects undergo
signicant pose changes (See the supplementary material for details).
Occlusion and Pose Variation. The target object in Occluded face 2 sequence
in Figure 5(c) undergoes large pose variation and heavy occlusion. Only the MIL-
Track and Struck methods as well as the proposed algorithm perform well on
this sequence. In addition, our tracker achieves the best performance in terms
of success rate, center location error, and frame rate. The target player in Soc-
cer sequence is heavily occluded by others many times when he is holding up
the trophy as shown in Figure 5(d). In some frames, the object is almost fully
(a) David indoor (b) Shaking

#150 #168 #400 #60 #242 #350
(c) Occluded face 2 (d) Soccer

#470 #720 #780 #70 #120 #300
(e) Kitesurf (f) Animal

#20 #50 #60 #5 #22 #35
(g) Biker (h) Tiger 2

#80 #90 #120 #110 #175 #325
(i) Coupon book (j) Bolt

#50 #140 #230 #20 #140 #200
Frag OAB SemiB MILTrack TLD Struck L1track CT
Fig. 5. Screenshots of some sampled tracking results
occluded (e.g., #120). Moreover, the object undergoes drastic motion blur and
illumination change (#70, and #300). All the other trackers lose track of the
targets in numerous frames. Due to drastic scene change, it is unlikely that on-
line appearance models are able to adapt fast and correctly. Our tracker can
handle occlusion and pose variations well as its appearance model is discrimi-
natively learned from target and background with a data-independent measure-
ment, thereby alleviating the inuence from background (See also Figure 4).
Furthermore, our tracker performs well for objects with non-rigid pose variation
and camera view change in the Bolt sequence (Figure 5(j)) because the appear-
ance model of our tracker is based on local features which are insensitive to
non-rigid shape deformation.
Out of Plane Rotation and Abrupt Motion. The object in the Kitesurf
sequence (Figure 5(e)) undergoes acrobat movements with 360 degrees out of
plane rotation. Only the MILTrack, SemiB and the proposed trackers perform
well on this sequence. Both our tracker and the MILTrack method are designed
to handle object location ambiguity in tracking with classiers and discrimina-
tive features. The object in the Animal sequence (Figure 5(f)) exhibits abrupt
motion. Both the Struck and the proposed methods perform well on this se-
quence. However, when the out of plane rotation and abrupt motion both occur
in the Biker and Tiger 2 sequences (Figure 5(g),(h)), all the other algorithms
fail to track the target objects well. Our tracker outperforms the other methods
in all the metrics (accuracy, success rate and speed). The Kitesurf, Skiing, Twin-
ings, Girl and Tiger 1 sequences all contain out of plane rotation while Jumping
and Box sequences include abrupt motion. Similarly, our tracker performs well
in terms of all metrics.
Background Clutters. The object in the Cli bar sequence changes in scale,
orientation and the surrounding background has similar texture. As the 1 -
tracker uses a generative appearance model that does not take background in-
formation into account, it is dicult to keep track of the objects correctly. The
object in the Coupon book sequence undergoes signicant appearance change at
the 60-th frame, and then the other coupon book appears. Both the Frag and
SemiB methods are distracted to track the other coupon book (#230 in Fig-
ure 5(i)) while our tracker successfully tracks the correct one. Because the TLD
tracker relies heavily on the visual information in the rst frame to re-detect the
object, it also suers from the same problem. Our algorithm is able to track the
right objects accurately in these two sequences because it extracts discriminative
features for the most correct positive sample (i.e., the target object) online
(See Figure 4) with classier update for foreground/background separation.
5 Concluding Remarks
In this paper, we proposed a simple yet robust tracking algorithm with an ap-
pearance model based on non-adaptive random projections that preserve the
structure of original image space. A very sparse measurement matrix was adopted
to eciently compress features from the foreground targets and background
ones. The tracking task was formulated as a binary classication problem with
online update in the compressed domain. Our algorithm combines the merits of
generative and discriminative appearance models to account for scene changes.
Numerous experiments with state-of-the-art algorithms on challenging sequences
demonstrated that the proposed algorithm performs well in terms of accuracy,
robustness, and speed.
References
1. Black, M., Jepson, A.: Eigentracking: Robust matching and tracking of articulated
objects using a view-based representation. IJCV 38, 6384 (1998)
2. Jepson, A., Fleet, D., Maraghi, T.: Robust online appearance models for visual
tracking. PAMI 25, 12961311 (2003)
3. Avidan, S.: Support vector tracking. PAMI 26, 10641072 (2004)
4. Collins, R., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking
features. PAMI 27, 16311643 (2005)
5. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via online boosting. In:
BMVC, pp. 4756 (2006)
6. Ross, D., Lim, J., Lin, R., Yang, M.-H.: Incremental learning for robust visual
tracking. IJCV 77, 125141 (2008)
7. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-
bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
8. Babenko, B., Yang, M.-H., Belongie, S.: Robust object tracking with online multi-
ple instance learning. PAMI 33, 16191632 (2011)
9. Li, H., Shen, C., Shi, Q.: Real-time visual tracking using compressive sensing. In:
CVPR, pp. 13051312 (2011)
10. Mei, X., Ling, H.: Robust visual tracking and vehicle classication via sparse rep-
resentation. PAMI 33, 22592272 (2011)
11. Kalal, Z., Matas, J., Mikolajczyk, K.: P-n learning: bootstrapping binary classier
by structural constraints. In: CVPR, pp. 4956 (2010)
12. Donoho, D.: Compressed sensing. IEEE Trans. Inform. Theory 52, 12891306
(2006)
13. Candes, E., Tao, T.: Near optimal signal recovery from random projections and
universal encoding strategies. IEEE Trans. Inform. Theory 52, 54065425 (2006)
14. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via
sparse representation. PAMI 31, 210227 (2009)
15. Candes, E., Tao, T.: Decoding by linear programing. IEEE Trans. Inform. The-
ory 51, 42034215 (2005)
16. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with
binary coins. J. Comput. Syst. Sci 66, 671687 (2003)
17. Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the re-
stricted isometry property for random matrices. Constr. Approx 28, 253263 (2008)
18. Liu, L., Fieguth, P.: Texture classication from random features. PAMI 34, 574586
(2012)
19. Li, P., Hastie, T., Church, K.: Very sparse random projections. In: KDD, pp. 287296
(2006)
20. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple
features. In: CVPR, pp. 511518 (2001)
21. Li, S., Zhang, Z.: Floatboost learning and statistical face detection. PAMI 26, 112
(2004)
22. Ng, A., Jordan, M.: On discriminative vs. generative classier: a comparison of
logistic regression and naive bayes. In: NIPS, pp. 841848 (2002)
23. Diaconis, P., Freedman, D.: Asymptotics of graphical projection pursuit. Ann.
Stat. 12, 228235 (1984)
24. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: Transfer
learning from unlabeled data. In: ICML (2007)
25. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Appli-
cations to image and text data. In: KDD, pp. 245250 (2001)
26. Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns:
application to face recogntion. PAMI 28, 20372041 (2006)
27. Leonardis, A., Bischof, H.: Robust recogtion using eigenimages. CVIU 78, 99118
(2000)
28. Kwon, J., Lee, K.: Visual tracking decomposition. In: CVPR, pp. 12691276 (2010)
29. Santner, J., Leistner, C., Saari, A., Pock, T., Bischof, H.: PROST Parallel Robust
Online Simple Tracking. In: CVPR (2010)
30. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragements-based tracking using the
integral histogram. In: CVPR, pp. 798805 (2006)
31. Hare, S., Saari, A., Torr, P.: Struck: structured output tracking with kernels. In:
ICCV, pp. 263270 (2011)
Author Index
Adelson, Edward H. 665 Dahl, Anders Lindbjerg 638

Agapito, Lourdes 426 Dai, Dengxin 483
Ai, Haizhou 129 Dai, Qieyun 340
Aldoma, Aitor 511 Darkner, Sune 638
Aldrian, Oswald 201 Denk, Winfried 778
Al-Sharadqah, Ali 384 Dibeklioglu, Hamdi 525
Andres, Bjoern 144, 778 Di Stefano, Luigi 511
Astrom, Freddie 215 Donoser, Michael 258
Astrom, Kalle 100 Duan, Genquan 129
Avrithis, Yannis 15 Dubout, Charles 301
Durand, Fredo 665
Bailer, Christian 398 Eichner, Marcin 114
Bak,
Slawomir 806
Baktash, Mahsa 736 Felsberg, Michael 215
Baravdish, George 215 Ferrari, Vittorio 114
Batista, Jorge 57 Finckh, Manuel 398
Bigdeli, Abbas 736 Fletcher, P. Thomas 1
Bischof, Horst 258 Fleuret, Francois 301
Bourdev, Lubomir 679 Freeman, William T. 85, 665
Brandt, Jonathan 679 Fua, Pascal 412
Bremond, Francois 806
Briggman, Kevin L. 778 Gall, Juergen 312
Brostow, Gabriel J. 71 Gallagher, Andrew 609
Bruckstein, Alfred M. 173 Ge, Weina 707
Gevers, Theo 525
Girod, Bernd 609
Campbell, Neill D.F. 71
Gong, Dian 229
Cao, Song 129
Gupta, Abhinav 369
Cao, Xiaochun 764
Cao, Xudong 566 Hamprecht, Fred A. 144, 778
Cao, Yu 244 Harandi, Mehrtash T. 736
Caseiro, Rui 57 Harmeling, Stefan 187
Chang, Ming-Ching 707 Henriques, Joao F. 57
Charpiat, Guillaume 806 Hinkle, Jacob 1
Chen, Dong 566 Hirsch, Michael 187
Chen, Huizhong 609 Hoiem, Derek 340
Chen, Tsuhan 272 Huang, Thomas S. 679
Chen, Xiaowu 553 Hufnagel, Lars 144
Cheny, Tsuhan 707
Chernov, Nikolai 384 Ikeuchi, Katsushi 455
Cho, Minsu 624 Isola, Phillip 665
Chodpathumwan, Yodsawalai 340
Cole, Forrester 665 Jammalamadaka, Nataraj 114
Corvee, Etienne 806 Jawahar, C.V. 114, 836
880 Author Index
Jin, Xin 553 Ommer, Bjorn 580

Joshi, Sarang 1 Ostlund, Jonas 412
Ju, Lili 244
Parikh, Devi 354
Kalantidis, Yannis 15 Park, Sangdon 651
Kanatani, Kenichi 384 Parkash, Amar 354
Karlinsky, Leonid 326 Pedersen, Kim Steenstrup 638
Kausler, Bernhard X. 144 Peng, Bo 287
Kim, Wonsik 651 Postula, Adam 736
Kimmel, Ron 173 Prasad, Mukta 483
Knott, Graham 778 Razavi, Nima 312
Koethe, Ullrich 144, 778 Riemenschneider, Hayko 258
Kohli, Pushmeet 312 Rosman, Guy 173
Korogod, Natalya 778 Roth, Peter M. 258
Kowdle, Adarsh 272 Rubinstein, Michael 85
Kroeger, Thorben 778 Russell, Chris 594
Kuang, Yubin 100
Salah, Albert Ali 525
Lafarge, Florent 539 Sanfeliu, Alberto 441
Lao, Shihong 129, 497 Schiegg, Martin 144
Larsen, Anders Boesen Lindbo 638 Scholkopf, Bernhard 187
Le, Vuong 679 Schuler, Christian J. 187
Lee, Kyoung Mu 624, 651 Shah, Mubarak 722
Leistner, Christian 483 Shen, Chunhua 158
Leitte, Heike 144 Shi, Boxin 455
Lensch, Hendrik P.A. 398 Shi, Qinfeng 158
Li, Peihua 469 Shrivastava, Abhinav 369
Li, Qing 553 Singh, Saurabh 369
Lin, Zhe 679 Siva, Parthipan 594
Lindner, Martin 144 Smith, Brandon M. 43
Liu, Ce 85 Smith, William A.P. 201
Liu, Qiguang 764 Song, Yafei 553
Liu, Xiaoming 707 Sternig, Sabine 258
Liu, Zicheng 693 Sugaya, Yasuyuki 384
Lovell, Brian C. 736 Suh, Yumin 624
Sun, Jian 29, 566
Ma, Andy Jinhua 792
Mac Aodha, Oisin 71 Tai, Xue-Cheng 173
Martins, Pedro 57 Tan, Ping 455
Matsushita, Yasuyuki 455 Thonnat, Monique 806
Medioni, Gerard 229 Tian, Yu 821
Minoh, Michihiko 497 Tombari, Federico 511
Monroy, Antonio 580 Trulls, Eduard 441
Moreno-Noguer, Francesc 441 Ullman, Shimon 326
Mukunoki, Masayuki 497
Muralidharan, Prasanna 1 van den Hengel, Anton 158
Van Gool, Luc 312, 483
Nair, Arun 71 Varol, Aydin 412
Ngo, Dat Tien 412 Veksler, Olga 850
Norouznezhad, Ehsan 736 Verdie, Yannick 539
Author Index 881
Verma, Yashaswi 836 Yao, Rui 158

Vicente, Sara 426 Yu, Gang 693
Vincze, Markus 511 Yuan, Junsong 693
Yuen, Pong Chi 792
Wang, Gang 750
Wang, Liwei 566
Zha, Hongyuan 821
Wang, Qilong 469
Zhang, Hequan 821
Wang, Song 244
Zhang, Kaihua 864
Wang, Yu 173
Zhang, Lei 287, 864
Wei, Yichen 29
Zhang, Li 43
Wen, Fang 29, 566
Zhang, Ya 821
Wittbrodt, Jochen 144
Zhang, Yanning 158
Wu, Yang 497
Zhang, Yimeng 707
Xiang, Tao 594 Zhao, Qinping 553
Zhao, Xuemei 229
Yan, Junchi 821 Zhou, Qiang 750
Yang, Ming-Hsuan 864 Zhu, Sikai 229
Yang, Xiaokang 821 Zhu, Wangjiang 29
Yang, Yang 722 Zisserman, Andrew 114

978 3 642 33712 3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

978 3 642 33712 3 PDF

Uploaded by

Copyright:

Available Formats

Lecture Notes in Computer Science 7574

Commenced Publication in 1973

ISSN 0302-9743 e-ISSN 1611-3349

Library of Congress Control Number: 2012947663

LNCS Sublibrary: SL 6 Image Processing, Computer Vision, Pattern Recognition,

Springer-Verlag Berlin Heidelberg 2012

October 2012 Roberto Cipolla

Welcome to the proceedings of the 2012 European Conference on Computer

October 2012 Andrew Fitzgibbon

Industrial Liaison Chair

Video Processing Chairs

Travel Grants Chair

Travel Visa Chair

Local Committee Chair

Rob Fergus New York University, USA

Olof Enqvist German Gonzalez Wenze Hu

Aditya Khosla Ruonan Li Jason Meltzer

Maja Pantic Erik Reinhard Shiguang Shan

Xiaoou Tang Song Wang Qingxiong Yang

Approximate Gaussian Mixtures for Large Scale Vocabularies . . . . . . . . . . 15

Geodesic Saliency Using Background Priors . . . . . . . . . . . . . . . . . . . . . . . . . 29

Joint Face Alignment with Non-parametric Shape Models . . . . . . . . . . . . . 43

Discriminative Bayesian Active Shape Models . . . . . . . . . . . . . . . . . . . . . . . 57

Patch Based Synthesis for Single Depth Image Super-Resolution . . . . . . . 71

Annotation Propagation in Large Image Databases via Dense Image

Numerically Stable Optimization of Polynomial Solvers for Minimal

Has My Algorithm Succeeded? An Evaluator for Human Pose

Group Tracking: Exploring Mutual Relations for Multiple Object

Robust Tracking with Weighted Online Structured Learning . . . . . . . . . . . 158

Fast Regularization of Matrix-Valued Images . . . . . . . . . . . . . . . . . . . . . . . . 173

Blind Correction of Optical Aberrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Inverse Rendering of Faces on a Cloudy Day . . . . . . . . . . . . . . . . . . . . . . . . . 201

On Tensor-Based PDEs and Their Corresponding Variational

Kernelized Temporal Cut for Online Temporal Segmentation and

Grain Segmentation of 3D Superalloy Images Using Multichannel

Hough Regions for Joining Instance Localization and Segmentation . . . . . 258

Learning to Segment a Video to Clips Based on Scene and Camera

Evaluation of Image Segmentation Quality by Adaptive Ground Truth

Oral Session 3: Detection and Attributes

Latent Hough Transform for Object Detection . . . . . . . . . . . . . . . . . . . . . . . 312

Using Linking Features in Learning Non-parametric Part Models . . . . . . . 326

Diagnosing Error in Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Attributes for Classier Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

Constrained Semi-Supervised Learning Using Attributes and

Scale Robust Multi View Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398

Laplacian Meshes for Monocular 3D Shape Recovery . . . . . . . . . . . . . . . . . 412

Soft Inextensibility Constraints for Template-Free Non-rigid

Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction

Elevation Angle from Reectance Monotonicity: Photometric Stereo

Local Log-Euclidean Covariance Matrix (L2 ECM) for Image

Ensemble Partitioning for Unsupervised Image Categorization . . . . . . . . . 483

Set Based Discriminative Ranking for Recognition . . . . . . . . . . . . . . . . . . . . 497

A Global Hypotheses Verication Method for 3D Object Recognition . . . 511

Are You Really Smiling at Me? Spontaneous versus Posed Enjoyment

Ecient Monte Carlo Sampler for Detecting Parametric Objects

Supervised Geodesic Propagation for Semantic Label Transfer . . . . . . . . . 553

Bayesian Face Revisited: A Joint Formulation . . . . . . . . . . . . . . . . . . . . . . . 566

Beyond Bounding-Boxes: Learning Object Shape by Model-Driven

In Defence of Negative Mining for Annotating Weakly Labelled Data . . . 594

Describing Clothing by Semantic Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 609

Graph Matching via Sequential Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 624

Jet-Based Local Image Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638

Abnormal Object Detection by Canonical Scene-Based Contextual

(0) Exp(0) ((0) E) = Exp(0) (0 (0)) (14)

metric is available: x, y = xT y. With this metric the adjoint-transpose action

In the case that M is an n-dimensional sphere, S n = {p Rn+1 : p = 1},