You are on page 1of 61

MSc.

Thesis Scene layout segmentation of trac


environments using a Conditional Random Field
Fernando Cervigni Martinelli
Honda Research Institute Europe GmbH
A Thesis Submitted for the Degree of
MSc Erasmus Mundus in Vision and Robotics (VIBOT)
2010
Abstract
At least 80% of the trac accidents in the world are caused by human mistakes. Whether
drivers are too tired, drunk or speeding, most accidents have their root in the improper behavior
of drivers. Many of these accidents could be avoided if cars were equipped with some kind of
intelligent system able to detect inappropriate actions of the driver and autonomously intervene
by controlling the car in emergency situations. Such an advanced driver assistance system
needs to be able to understand the car environment and, from that information, predict the
appropriate behavior of the driver at every instant. In this thesis project we investigate the
problem of scene understanding solely based on images from an o-the-shelf camera mounted
to the car.
A system has been implemented that is capable of performing semantic segmentation and
classication of road scene video sequences. The object classes which are to be segmented can
be easily dened as input parameters. Some important classes for the prediction of the driver
behavior include road, sidewalk, car and building, for example. Our system is trained in
a supervised manner and takes into account information such as color, location, texture and
also spatial context between classes. These cues are integrated within a Conditional Random
Field model, which oers several practical advantages in the domain of image segmentation
and classication. The recently proposed CamVid database, which contains challenging inner-
city road video sequences with very precise ground truth segmentation data, has been used for
evaluating the quality of our segmentation, including a comparison to state-of-the-art methods.
Everything should be made as simple as possible, but not simpler . . .
Albert Einstein
Contents
Acknowledgments v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Problem denition 4
2.1 Combined segmentation and recognition . . . . . . . . . . . . . . . . . . . . . . . 4
3 State of the art 6
3.1 Features for image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Spatial prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Sparse 3D cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.4 Color distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.5 Texture cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.6 Context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Probabilistic segmentation framework . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Energy minimization for label inference . . . . . . . . . . . . . . . . . . . 13
i
3.3 Example: TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Potentials without context . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Texture-layout potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Application to road scenes (Sturgess et al.) . . . . . . . . . . . . . . . . . . . . . 20
4 Methodology 21
4.1 CRF framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Basic model: location and edge potentials . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Texture potential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Feature vector and choice of lter bank . . . . . . . . . . . . . . . . . . . 24
4.3.2 Boosting of feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.3 Adaptive training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Texture-layout potential model (context) . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Results 37
5.1 Model without context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Model with context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Inuence of number of weak classiers . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Inuence of the dierent model potentials . . . . . . . . . . . . . . . . . . 39
5.2.3 Inuence of 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 CamVid sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusions 49
Bibliography 54
ii
List of Figures
2.1 Example of ideal segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 GrabCut: segmentation using color GMMs and user interaction . . . . . . . . . . 9
3.4 The Leung-Malik (LM) lter bank . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Clique layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Sample results of TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Image textonization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8 Texture-layout lters (context) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.9 Sturgess: higher-order cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Examples of location potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Intuitive example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Filter bank responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 The MR8 lter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 3D features interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Adaboost training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7 Adaboost classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
5.1 Confusion matrix without adaptive training . . . . . . . . . . . . . . . . . . . . . 38
5.2 Confusion matrix after adaptive training . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Example of texture-layout features (context) . . . . . . . . . . . . . . . . . . . . 39
5.4 Inuence of number of weak classiers . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Inuence of the dierent potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Confusion matrices for 4-class segmentation . . . . . . . . . . . . . . . . . . . . . 44
5.7 Example of segmentations for 4-class set . . . . . . . . . . . . . . . . . . . . . . . 45
5.8 Results for 11-class set segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.9 Example of segmentations for 11-class set . . . . . . . . . . . . . . . . . . . . . . 47
5.10 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Adaptive scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
iv
Acknowledgments
I would like to thank above all my family for the constant support. They are always with me,
even though they live on the other side of the Atlantic ocean.
My heartly thanks to my supervisors at Honda, Jannik Fritsch, who has been so nice and
given me all the support I needed, and Martin Heracles, who has carefully revised this thesis
report and given precious advice all along these four months. For his help with the iCub
repository and for providing me with his essential CRF code, I would like to sincerely thank
Andrew Dankers.
I wish also to thank my supervisor, Prof. Fabrice Meriaudeau, and all professors of the
Vibot Masters. It is hard to fathom how much I learned with you during these 2 years. Thanks
also for oering this program, which has been an amazing and unforgettable experience.
Last but not least, I wish to thank all my Vibot mates, who have been a great company
studying before exams or chilling at the bar.
v
Chapter 1
Introduction
1.1 Motivation
Within the Honda Research Institute Europe (HRI-EU), the Attentive Co-Pilot project (ACP)
conducts research on a multi-function Advanced Driver Assistance System (ADAS). It is desired
and to be expected that, in the future, cars will autonomously respond to inappropriate actions
taken by the driver. If he or she does not stop the car when the trac lights are red or falls
asleep and slowly deviates from the normal driving course, the car should trigger an emergency
procedure and warn the driver. A similar warning should come up, for example, if the driver
gets distracted and the car in front inadvertently brakes, without the driver noticing it. It would
be even safer if the car had the capability of not only recognizing it and warning the driver, but
also of taking over control in critical situations and safely correcting the drivers inappropriate
actions. Since human mistakes, and not technical problems, are by far the main cause of trac
accidents, countless lives could be saved and much damage avoided if such reliable advanced
driver assistance systems existed and were widely implemented.
If, however, this Advanced Driver Assistance System is to become responsible for saving
lives, in a critical real-time context, it cannot aord to fail. In order to manage the extremely
challenging task of building such an intelligent system, many smaller problems have to be
successfully tackled. One of the most important is related to understanding and adequately
representing the environment in which the car operates. For that, a variety of sensors and input
data can be used. Indeed, participants of the DARPA Urban Challenge [5], which requires
autonomous vehicles to drive through specic routes in a restricted city environment, rely on
a wide range of sensors such as GPS, Radar, Lidar, inertial guidance systems as well as on the
use of annotated maps.
1
Chapter 1: Introduction 2
One of our aspirations, though, is to achieve the task of scene understanding by visual
perception alone, using an o-the-shelf camera mounted in the car. We humans prove in our
daily life as drivers that seeing the world is largely sucient to achieve an understanding of the
trac environment. By ruling out the use of complicated equipment and sensing techniques, we
aim at, once a reliable driver assistance system is achieved, manufacturing it cheap enough for
it to be highly scalable. Considering their great potential of increasing the safety of drivers
and therefore also of pedestrians, bicyclists, and other trac participants, such advanced
driver assistance systems will most likely become an indispensable car component, like todays
seat-belts.
1.2 Goal
A rst step to understanding and representing the world surrounding the car is to segment
the images acquired by the camera in meaningful regions and objects. In our case, meaningful
regions are understood as the regions that are potentially relevant for the behavior of the
driver. Examples of such regions are the road, sidewalks, other cars, trac signs, pedestrians,
bicyclists and so on. In contrast, in our context it is not so important, for example, to segment
and distinguish a building on the side of the road as an individual class, since, as far as the
driver behavior is concerned, it makes no dierence whether there is a building, a fence or even
a tree at that location.
In order to correctly segment such meaningful regions, we need to consider semantic aspects
of the scene rather than only its appearance, that is, even if the road consists of dark and bright
regions because of shadows, it should still be segmented as only one semantic region. This can
be achieved by supervised training using ground truth segmentation data.
The work described in this thesis aims at performing this task of semantic segmentation,
exploring the most recent insights of researchers in the eld, as well as well-known and state-
of-the-art image processing and segmentation techniques.
1.3 Thesis outline
This thesis is structured in ve more chapters. In Chapter 2, the main goal of the investigation
done in this thesis project is fomalized and explained. Chapter 3 investigates the state of the art
in the eld of semantic segmentation and road scene interpretation. Cutting-edge algorithms
like TextonBoost are described in greater detail as they are fundamental to state-of-the-art
methods. In Chapter 4, the methodology and implementation steps followed throughout this
thesis project are detailed. Chapter 5 shows the results obtained for the CamVid database, both
3 1.3 Thesis outline
for a set of four classes and for a set of eleven classes. A comparison of these results and to the
state of the art mentioned in Chapter 3 is also shown. Finally, in Chapter 6 the conclusions
of the thesis are presented and suggestions regarding the areas on which future eorts should
focus are given.
Chapter 2
Problem denition
2.1 Combined segmentation and recognition
The main goal of this thesis project is to investigate, implement and evaluate a system that per-
forms segmentation of road scene images including a classication of the dierent object classes
involved. More specically, each input color image x G
MN3
, where G = 0, 1, 2, , 255
and M and N are the image height and width, respectively, must be pixel-wise segmented.
That means that each pixel i of the image has to be assigned one of N pre-dened classes or
labels, of a set L = l
1
, l
2
, l
3
, , l
N
. In mathematical terms, the segmentation investigated
can be dened as a function f that takes the color image x G
MN3
and returns a label
image f(x) = y L
MN
, also called a labeling of x:
This is achieved by supervised training, which means that the system is given labeled training
images, from which it should learn in order to subsequently segment new, unseen images.
According to the state of the art researchers, supervised segmentation techniques yield better
results than unsupervised techniques (see Chapter 3). This is not surprising, since unsupervised
segmentation techniques do not have ground truth information from which to learn semantic
properties, hence can only segment the images based on purely data-driven features.
Figure 2.1 shows a typical inner-city road scene as considered in this thesis project, as well
an ideal segmentation, obtained by manual annotation. The example is taken from the CamVid
database [4], which is a recently proposed image database with high-quality, manually-labeled
ground truth which we use for training our system. The images have been acquired by a car-
mounted camera, lming the scene in front of the car while driving in a city. More detail about
the CamVid dataset is given in Chapter 5.
Theoretically, it would be ideal if the segmentation algorithm proposed could precisely
4
5 2.1 Combined segmentation and recognition
(a) (b)
Figure 2.1: (a) An example of a typical inner-city road scene extracted from the CamVid
database. (b) The corresponding manually labeled ground truth, taking into account classes
like road, pedestrian, sidewalk and sky, among others. The goal of the segmentation
system to be implemented is to produce, given an image (a), an automatic segmentation that
is as close as possible to the ground truth (b).
segment all 32 classes annotated in the CamVid database. However, the more classes one tries
to segment the more challenging and time-consuming the problem becomes. Although our
system is supposed to be able to segment an arbitrary set of classes, as long as they are present
in the training database, a compromise between computational eciency and the number of
classes to segment has to be reached. More importantly, many of the classes dened in the
CamVid database have little if any inuence at all on the behaviour of the driver. Bearing
this in mind, the segmentation algorithm should be optimized and tailored towards the most
behaviorly relevant classes.
Furthermore, a related study recently conducted in the ACP Group suggests that, in order
to achieve a good prediction of the drivers behavior, more eort should be invested in how to
use such a segmentation of meaningful classes in terms of segmentation-based features rather
than in precisely segmenting a vast number of classes that may not inuence, after all, how the
driver controls the car [11].
Chapter 3
State of the art
The problem of image segmentation has been focus, already for some decades, of countless
image processing researchers around the globe. Although the problem itself is old, the solution
to many segmentation tasks remains, still today, under active investigationin particular for
image segmentation applied to highly complex real-world scenes (e.g. trac scenes). This
chapter describes some of the techniques for image segmentation that have been applied in
related areas to the one investigated in this thesis project.
3.1 Features for image segmentation
3.1.1 Spatial prior knowledge
One of the simplest but useful cues that may be explored when segmenting images in a super-
vised fashion is the location information of objects in the scene. For many object classes, there
is an important correlation between the label to which a region in an image belongs and its
location on the image. For instance, the fact that the road is mostly at the lower part of pictures
could be helpful for its segmentation. The same applies for the segmentation of the sky, which
is normally at the upper part of an image. Many similar examples can be mentionedlike
buildings being usually on the sides of the imagewhich makes this feature powerful, despite
its simplicity.
3.1.2 Sparse 3D cues
Dierent regions in an image have often dierent depths. Therefore, if available, the information
of how far each point in the image was from the camera when the image was acquired can be very
6
7 3.1 Features for image segmentation
Figure 3.1: The algorithm proposed by Brostow et al. uses 3D point clouds estimated from
videos sequences and performs, using motion and structure features, a very satisfactory 11-class
semantic segmentation.
useful for segmentation purposes. Since individual images do not carry any depth information,
3D cues can only be explored in specic cases where one can either measure or infer how far
the objects in an image are. If the use of radars or equipment that directly measure distance
is to be discarded, 3D information can be inferred by using a stereo camera set or, in the case
of a single camera, by using structure-from-motion techniques [9]. When dealing with images
taken from an ordinary video frame, structure-from-motion techniques must be applied.
Figure 3.1, extracted from the work of Brostow et al. [3], shows how accurate the segmen-
tation of road scenes can get only by using reconstructed 3D point clouds.
Brostow et al. based their work on the following features, which can be extracted from the
sparse 3D point cloud:
Height above the ground;
Shortest distance from cars path;
Surface orientation of groups of points;
Residual error, which is a measure of how objects in the scene move in respect to the
world;
Density of points.
3.1.3 Gradient-based edges
When ones thinks of image segmentation, it is natural to expect that the label boundaries
correspond to strong edges on the image being segmented. For example, the image of a blue
car on a city street will have rather strong edges where, in a perfect labeling, the boundaries
between the labels car and street are located. Some methods, like, for example, active contour
snakes [13], explore gradient-based edge information for segmentation. Figure 3.2 shows an
Chapter 3: State of the art 8
(a) (b)
Figure 3.2: (a) Original grayscale image of Lena. (b) Edge image obtained by calculating the
image gradients. Edge based segmentation methods explore the information in (b) to propose
a meaningful segmentation of (a). Note how Lenas hat, face and shoulder could be quite well
segmented only with this edge cue.
example picture of Lena and its gradient. The white pixels have a greater probability of being
located on boundaries between labels in a segmentation.
Notice that although this is a very reasonable and useful cue, it can also turn out to be
misleading. When dealing, for example, with shadowed scenes, very often there are stronger
edges inside regions that belong to the same label than there are on the boundaries between
labels. This is particularly challenging for real-world scenes such as the trac scenes considered
in this thesis project. The way this cue was explored in this project is explained in detail in
Section 3.3.1.
3.1.4 Color distribution
Early methods, like [20] tackle the problem of image segmentation by relying solely on color
features, which can be modeled as histogram distributions or by Gaussian Mixture Models
(GMMs). A Gaussian Mixture Model represents a probability distribution, P(x), which is
obtained by summing dierent Gaussian distributions:
P(x) =

k
P
k
(x) (3.1)
where
P
k
(x) = ^(x[
k
,
k
) (3.2)

k
,
k
being the mean and variance of the individual Gaussian distribution k.
The use of GMMs to model colors in images has also proven very ecient in binary seg-
9 3.1 Features for image segmentation
Figure 3.3: Segmentation achieved by GrabCut using color GMMs and user interaction.
mentation problems, as shown by Rother et al [19] with their GrabCut algorithm. In such
problems, one wants to separate a foreground object from the background for image editing,
object recognition and so on. When possible, user interaction can be very useful to rene the
results by giving important feedback after the initial automatic segmentation (see Figure 3.3).
However, in most cases, either the number of images to segment is prohibitive or the real
time nature of the segmentation task prevents any user interference at all. These both remarks
are true in the eld of trac scenes segmentation for driver assistance.
3.1.5 Texture cues
Along with color, texture information is often considered and can bring signicant improvement
to the segmentation accuracy, as in [7], where graylevel texture features were combined to color
ones. Nowadays, most if not all the research eort on segmentation also incorporates texture
information. This can be extracted and modeled in two main ways:
1. Statistical Models, which try to describe the statistical correlation between pixel colors
within a restricted vicinity. Among such methods, co-occurrence matrices, have been
successfully used, for instance, for seabed classication [18];
Chapter 3: State of the art 10
Figure 3.4: The LM lter bank has a mix of edge, bar and spot lters at multiple scales and
orientations. It has a total of 48 lters2 Gaussian derivative lters at 6 orientations and 3
scales, 8 Laplacian of Gaussian lters and 4 Gaussian lters.
2. Filter bank convolution, where the image is convolved with a carefully selected set of lter
primitives, usually composed of Gaussians, Gaussian derivatives and Laplacians. A well
known example, the Leung-Malik (LM) lter bank [16], is shown in Figure 3.4. It is
interesting to mention that such lter banks have similarities with the receptive elds of
neurons in the human visual cortex.
3.1.6 Context features
Although color and texture may eciently characterize image regions, they are far from enough
for a high quality semantic segmentation if considered alone. For instance, even humans may
not be able to tell apart, when looking only at a local patch of an image, a blue sky from the
walls of a blue building. The key aspect of which humans naturally take advantage, and that
allows them to unequivocally understand scenes, is the context. Even if one sees a building
wall painted with exactly the same color as the sky, one just knows that that wall cannot be
the sky because it is surrounded by windows. In the case of road scenes segmentation, typical
spatial relationships between objects can be a very strong cuefor example, the fact that the
car is always on the road, which, in turn, is usually surrounded by sidewalks.
With this in mind, computer vision researchers are now frequently looking beyond low-level
features and are more interested in contextual issues [7, 10, 14]. In Section 3.3, an example of
how context in images can be exploited for segmentation is described.
3.2 Probabilistic segmentation framework
The choice of image features, described in the previous section, is independent of the theoretical
framework or machine learning technique applied for segmentation inference. One can choose
the very same features as in [7], where belief networks are used, and process them using Support
11 3.2 Probabilistic segmentation framework
Vector Machines, for example. In recent years, Conditional Random Fields (CRFs) have played
an increasingly central role. CRFs have been introduced by Laerty et al. in [15] and have ever
since been systematically used in cutting-edge segmentation and classication approaches like
TextonBoost [21], image sequence segmentation [27], contextual analysis of textured scenes [24]
and trac scene understanding [22], to name a few. Conditional Random Fields are based on
Markov Random Fields and oer practical advantages for image classication and segmenta-
tion. These advantages are explained in the next section, after the formal denition of Markov
Random Fields is given.
3.2.1 Conditional Random Fields
In the Random Field theory, an image can be described by a lattice o composed of sites i,
which can be thought of as the image pixels. The sites in o are related to one another via
a neighborhood system, which is dened as ^ = ^
i
, i o, where ^
i
is the set of sites
neighbouring i. Additionally, i / ^
i
and i ^
j
j ^
i
.
Let y denote a labeling conguration of the lattice o belonging to the set of all possible
labelings . In the image segmentation context, y can be seen as a labeling image, where each
of the sites (or pixels) i from the lattice o is assigned one label y
i
in the set of possible labels
L = l
1
, l
2
, l
3
, , l
N
, which are the object classes. The pair (o, ^) can be referred to as a
Random Field.
Moreover, (o, ^) is said to be a Markov Random Field (MRF) if and only if
P(y) > 0, y , and (3.3)
P(y
i
[y
S{i}
) = P(y
i
[y
N
i
) (3.4)
That means, rstly, that the probability of any dened label conguration must be greater
than zero
1
and, secondly and most importantly, that the probability of a site assuming a given
label just depends on its neighboring sites. The latter statement is also known as the Markov
condition.
According to the Hammersley-Cliord theorem [1], an MRF like dened above can equiv-
alently be characterized by a Gibbs distribution. Thus, the probability of a labeling y can be
written as
P(y) = Z
1
exp(U(y)), (3.5)
where
Z =

yY
exp(U(y)) (3.6)
1
This assumption is usually taken for convenience, as it, in practical terms, does not inuence the problem.
Chapter 3: State of the art 12
is a normalizing constant called the partition function, and U(y) is an energy function of the
form
U(y) =

cC
V
c
(y). (3.7)
( is the set of all possible cliques and each clique c has a clique potential V
c
(y) associated
with it. A clique c is dened as a subset of sites in o in which every pair of distinct sites
are neighbours, with single-site cliques as a special case (see Figure 3.5). Due to the Markov
condition, the value of V
c
(y) depends only on the local conguration of clique c.
(a) (b) (c)
Figure 3.5: (a) Example of a 4-pixel neighborhood. (b) Possible unary clique layout. (c)
Possible binary clique layouts.
Now let us consider the observation x
i
, for each site i, which is a state belonging to a set of
possible states J = w
1
, w
2
, , w
n
. In this manner, we can represent the image we want to
segment, where each pixel i is assigned to one state of the set J. If one thinks of a gray scale
image with 8 bit-resolution, for example, the set of possible states for each site (or pixel) would
be dened as J = 0, 1, 2, , 255. The segmentation problem then boils down to nding the
labeling y

such that P(y

[x)the posterior probability of labeling y

given the observation


xis maximized. Bayes theorem tells us that
P(y[x) = P(x[y)P(y)/P(x) (3.8)
where P(x) is a normalization factor, as Z in Eq. 3.5, and plays no role in the maximization.
Thanks to the Hammersley-Cliord theorem, one can greatly simplify this maximization prob-
lem by dening only locally the clique potential functions V
c
(x, y, ). How to choose the forms
and parameters of the potential functions for a specic application is a major topic in MRF
modeling and will be further discussed in Chapter 4.
The main dierence between MRFs and CRFs lies on the fact that MRFs are generative
models, whereas CRFs are discriminative. That is, CRFs directly model the posterior distri-
bution P(y[x) while MRFs learn the underlying distributions P(x[y) and P(y), arriving at the
posterior distribution by applying the Bayes theorem.
13 3.2 Probabilistic segmentation framework
In other words, for MRFs, the learned state-label joint probability is represented as P(y[x) =
P(x[y)P(y)/P(x)), where x represents the observation and y the corresponding labeling con-
guration. However, for CRFs, it is not required to generate prior distributions over the labels
P(x[y) like for MRFs, as the a posteriori P(y[x) is modeled directly.
This directly modeled posterior probability is simpler to implement and usually sucient for
segmenting images. Hence, for the road scene segmentation and classication problem at hand,
CRFs are advantageous in comparison to MRFs. This is the main reason why they became so
popular [21, 22, 27].
3.2.2 Energy minimization for label inference
Finding the labeling y

that maximizes the a posteriori probability expressed in Eq. 3.5 is


equivalent to nding y

that minimizes the energy function in Eq. 3.7. An ecient way of


nding a good approximation of the energy minimum of such functions, is the alpha-expansion
graph-cut algorithm [2] which widely used along with MRFs and CRFs. The idea of the alpha-
expansion algorithm is to reduce the problem of minimizing a function like U(y) with multiple
labels to a sequence of binary minimization problems. These sub-problems are referred to as
alpha-expansions, and will be shortly described for completeness (for details see [2]).
Suppose that we have a current image labeling y and one randomly chosen label L =
l
1
, l
2
, l
3
, , l
N
. In the alpha-expansion operation, each pixel i makes a binary decision: it
can either keep its old label y
i
or switch to label , provided that this change decreases the
value of energy the function. For that, we introduce a binary vector s 0, 1
MN
which
indicates which pixels in the image (of size M N) keep their label and which switch to label
. This denes the auxiliary conguration y[s] as
y
i
[s] =
_
_
_
y
i
, if s
i
= 0
, if s
i
= 1
(3.9)
This auxiliary conguration y[s] transforms the function U with multiple labels into a func-
tion of binary variables U

(s) = U(y[s]). If function U is composed of attractive potentials,


which could be seen as a kind of convex functions, the global minimum of this binary function
2
is guaranteed to be found exactly using standard graph cuts [21].
The expansion move algorithm starts with any initial conguration y
0
, which could be set,
for instance, taking, for each pixel, the label with maximum location prior probability
3
. It
then computes optimal alpha-expansion moves for labels in a random order, accepting the
2
Notice that this does not mean that the global minimum of the multi-label function is found.
3
In the road scene segmentation case, for instance, pixels on top of the image could start with label sky and
pixels at the bottom with label road. This is equivalent to exploring the features described in 3.1.1
Chapter 3: State of the art 14
moves only if they decrease the energy function. The algorithm is guaranteed to converge, and
its output is a strong local minimum, characterized by the property that no further alpha-
expansion can decrease the value of function U.
3.3 Example: TextonBoost
One CRF-based approach to image segmentation that is currently fundamental for state-of-
the-art methods is TextonBoost [21]. In their research, Shotton et al. have used the Microsoft
Research Cambridge (MSRC) database
4
, which is composed of 591 photographs of the following
21 object classes: building, grass, tree, cow, sheep, sky, airplane, water, face, car, bicycle, ower,
sign, bird, book, chair, road, cat, dog, body, and boat. Approximately half of those pictures
is picked for training, in a way that ensures proportional contributions from each class. Some
results of their semantic segmentation on previously unseen images are shown in Figure 3.6.
Figure 3.6: TextonBoost results extracted from [21]. Above, unseen test images. Below, seg-
mentation using a color-coded labeling. Textual labels are superimposed for better visualization.
Since the algorithm implemented for the segmentation of road scenes in this master thesis
has been mainly inspired by TextonBoost, a short description of the way it works is provided.
The inference framework used is a conditional random eld (CRF) model [15]. The CRF
learns, through the training of the parameters of the clique potentials, the conditional distribu-
tion over the possible labels given an input image. The use of a conditional random eld allows
the incorporation of texture, layout, color, location, and edge cues in a single, unied model.
The energy function U(y[x, ), which is the sum of all the clique potentials (see Eq. 3.7), is
dened as:
U(y[x, ) =

i
_
location
..
(y
i
, i;

) +
color
..
(y
i
, x
i
;

) +
texturelayout
..

i
(y
i
, x;

)
_
+

(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);

) (3.10)
where y is the labeling or segmentation and x is a given image, is the set of edges in a
4
The MSRC database can be downloaded at http://research.microsoft.com/vision/cambridge/recognition/
15 3.3 Example: TextonBoost
4-connected neighborhood, =

are the model parameters, and i and j index


pixels in the image, which correspond to sites in the lattice of the Conditional Random Field.
Notice that the model consists of three unary potentials, which depend only on one site i in
the lattice, and one pairwise potential, depending on pairs of neighboring sites.
Each of the potentials is subsequently explained in a simplied way, for details please see [21].
3.3.1 Potentials without context
Location potential
The unary location potentials (y
i
, i;

) capture the correlation of the class label and the abso-


lute location of the pixel in the image. For the databases with which TextonBoost was tested,
the location potentials had a rather low importance since the context of the pictures is very
diverse. In the case of our road scene segmentation, which is a more structured environment,
they have had signicantly more relevance, as discussed in Chapter 5.
Color potential
In TextonBoost, the color distributions of object classes are represented as Gaussian Mixture
Models (see Section 3.1.4) in CIELab color space where the mixture coecients depend on the
class label. The conditional probability of the color x of a pixel labeled with class y is given by
P(x[y) =

k
P(x[k)P(k[y) (3.11)
with color clusters (mixture components) P(x[k). Notice that the clusters are shared between
dierent classes, and that only the coecients P(k[y) depend on the class label. This makes
the model more ecient to learn than a separate GMM for each class, which is important since
TextonBoost takes into account a high number of classes.
Edge potential
The pairwise edge potentials have the form of a contrast sensitive Potts model [2],
(y
i
, y
j
, g
ij
(x);

) =
T

g
ij
(x)[y
i
,= y
j
], (3.12)
with [] the zero-one indicator function:
[condition] =
_
_
_
1, if condition is true
0, otherwise
(3.13)
Chapter 3: State of the art 16
The edge feature g
ij
measures the dierence in color between the neighboring pixels, as sug-
gested by [19],
g
ij
=
_
exp( |x
i
x
j
|
2
1
_
(3.14)
where x
i
and x
j
are three-dimensional vectors representing the CIELab colors of pixels i and j
respectively. Including the unit element allows a bias to be learned, to remove small, isolated
regions
5
. The quantity is an image-dependent contrast term, and is set separately for each
image to (2|x
i
x
j
|
2
))
1
, where ) denotes an average over the image. The two scalar
constants that compose the parameter vector

are appropriately set by hand.


3.3.2 Texture-layout potential
The texture-layout potential is the most important contribution of TextonBoost. It is based on
a set of novel features which are introduced in [21] as texture-layout lters. These new features
are capable of, at once, capturing the correlation between texture, spatial layout, and textural
context in an image.
Here, we quickly describe how the texture-layout features are calculated and the boosting
approach used to automatically select the best features and, thereby, learn the texture-layout
potentials used in Eq. 3.10.
Image textonization
As a rst step, the images are represented by textons [17] in order to arrive at a compact
representation of the vast range of possible appearances of objects or regions of interest
6
. The
process of textonization is depicted in Figure 3.7, and proceeds as follows. At rst, each of the
training images is convolved with a 17-dimensional lter bank. The responses for all training
pixels are then whitenedso that they have zero mean and unit covarianceand clustered
using a standard Euclidean-distance K-means clustering algorithm for dimension reduction.
Finally, each pixel in each image is assigned to the nearest cluster center found with K-means,
producing the texton map T, where pixel i has value T
i
1, ..., K.
Texture-Layout Filters
The texture-layout lter is dened by a pair (r, t) of an image region, r, and a texton t, as
illustrated in Figure 3.8. Region r is relatively referenced to the pixel i being classied and
texton t belongs to the texton map T. For eciency reasons, only rectangular regions are
5
The unit element means that for every pair of pixels that have dierent labels a constant potential is added
to the whole. This makes contiguous labels preferable when the energy function is minimized.
6
Textons have been proven eective in categorizing materials [25] as well as generic object classes [28].
17 3.3 Example: TextonBoost
Figure 3.7: The process of image textonization, as proposed by [21]. All training images are
convolved with a lter-bank. The lter responses are clustered using K-means. Finally, each
pixel is assigned a texton index corresponding to the nearest cluster center to its lter response.
implemented by TextonBoost, although any arbitrary region shape could be considered. A set
R of candidate rectangles is chosen at random, such that every rectangle lies inside a xed
bounding box.
The feature response at pixel i of texture-layout lter (r, t) is the proportion of pixels under
the oset region r +i that have been assigned texton t in the textonization process,
v
[r,t]
(i) =
1
area(r)

j(r+i)
[T
j
= t] . (3.15)
Any part of the region r + i that lies outside the image does not contribute to the feature
response.
An ecient and elegant way to calculate the lter responses anywhere over an image can
be achieved with the use of integral images [26]. For each texton t in the texton map T, a
separate integral image I
(t)
is calculated. In this integral image, the value at pixel i = (u
i
, v
i
)
is dened as the number of pixels in the original image that have been assigned to texton t in
the rectangular region with top left corner at (1, 1) and bottom right corner at (u
i
, v
i
):
I
(t)
(i) =

j:(u
j
u
i
)&(v
j
v
i
)
[T
j
= t] . (3.16)
The advantage of integral images is that they can later be used to compute the texture-
layout lter responses in constant time: if I
(t)
is the integral image for texton channel t dened
like above, then the feature response is computed as:
v
[r,t]
(i) =
_
I
(t)
(r
br
) I
(t)
(r
bl
) I
(t)
(r
tr
) +I
(t)
(r
tl
)
_
/area(r) (3.17)
where r
br
, r
bl
, r
tr
and r
tl
denote the bottom right, bottom left, top right and top left corners
Chapter 3: State of the art 18
(a) (c) (e)
(b) (d) (f)
Figure 3.8: Graphical explanation of texture-layout lters extracted from [21]. (a, b) An image
and its corresponding texton map (colors represent texton indices). (c) Texture-layout lters
are dened relative to the point i being classied (yellow cross). In this rst example feature,
region r
1
is combined with texton t
1
in blue. (d) A second feature where region r
2
is combined
with texton t
2
in green. (e) The response v
[r
1
,t
1
]
(i) of the rst feature is calculated at three
positions in the texton map (magnied). In this example, v
[r
1
,t
1
]
(i
1
) 0, v
[r
1
,t
1
]
(i
2
) 1, and
v
[r
1
,t
1
]
(i
3
) 1/2. (f ) The second feature (r
2
, t
2
), where t
2
corresponds to grass, can learn
that points i (such as i
4
) belonging to sheep regions tend to produce large values of v
[r
2
,t
2
]
(i),
and hence can exploit the contextual information that sheep pixels tend to be surrounded by
grass pixels.
of rectangle r.
Texture-layout features are suciently general to allow for an automatic learning of layout
and context information. Figure 3.8 illustrates how texture-layout lters are able to model
textural context and layout.
Boosting of texture-layout lters
A Boosting algorithm iteratively selects the most discriminative texture-layout lters (r, t) as
weak learners and combines them into a strong classier used to derive the texture-layout
potential in Eq. 3.10. The boosting scheme used in TextonBoost shares each weak learner
between a set of classes C, so that a single weak learner classies for several classes at once.
According to the authors, this allows for classication with cost sub-linear in the number of
classes, and leads to improved generalization.
The strong classier learned is the sum over the classication condences h
m
i
(c) of M weak
19 3.3 Example: TextonBoost
learners
H(y
i
, i) =
M

m=1
h
m
i
(c) (3.18)
The condence value H(y
i
, i) for pixel i is then just multiplied by a negative constantso
that a positive condence turns into a negative energy, which will be preferred in the energy
minimizationto give the texture-layout potentials
i
used in Eq. 3.10:

i
(y
i
, x;

) =

.H(y
i
, i) (3.19)
Each weak learner is a decision stump based on the feature response v
[r,t]
(i) of the form
h
i
[c] =
_
_
_
a
_
v
[r,t]
(i) >

+b, if c C
k
c
, otherwise,
(3.20)
with parameters (a, b, k
c
, , C, r, t). The region r and texton index t together specify the texture-
layout lter feature, and v
[r,t]
(i) denotes the corresponding feature response at position i. For
the classes that share this feature, that is, (c C), the weak learner gives h
i
(c) a + b, b
depending on whether v
[r,t]
(i) is, respectively, greater or lower than a threshold . For classes
not sharing the feature (c / C), the constant k
c
ensures that unequal numbers of training
examples of each class do not adversely aect the learning procedure.
In order to choose the weak classiers, TextonBoost uses the standard boosting algorithm
introduced by Schapire et al. in [8], which will be explained for completeness. Suppose we are
choosing the m
th
weak classier. Each training example i, a pixel in a training image, is paired
with a target value z
c
i
1, +1where +1 means that pixel i has ground truth class c and
1 notand assigned a weight w
c
i
specifying its classication accuracy for class c after the
m 1 previous rounds of boosting. The m
th
weak classier is chosen by minimizing an error
function J
error
weighted by w
c
i
:
J
error
=

i
w
c
i
(z
c
i
h
m
i
(c))
2
(3.21)
The training examples are then re-weighted
w
c
i
:= w
c
i
e
z
c
i
h
m
i
(c)
(3.22)
Minimizing the error function J
error
requires, for each new weak classier, an expensive
brute-force search over the possible sharing classes in C, features (r, t), and thresholds . As
shown in [21] however, given these parameters, a closed form solution does exist for a, b and k
c
.
Chapter 3: State of the art 20
3.4 Application to road scenes (Sturgess et al.)
In the more specic eld of road scene segmentation, Sturgess et al. [22] have recently quite
successfully segmented inner-city road scenes in 11 dierent classes. Their method builds on
the work of Shotton et al. (see Section 3.3) and on that of Brostow et al. [3] integrating
the appearance-based features from TextonBoost with the structure-from-motion features from
Brostow et al. (see Section 3.1.2) in a higher order CRF. According to the authors, the use
of higher-order cliquesthat is, cliques with several pixels, instead of only pairs of pixels like
in TextonBoostproduces accurate segmentations with precise object boundaries. Figure 3.9
shows how Sturgess et al. use an unsupervised meanshift segmentation of the input image to
obtain regions that are used as higher-oder cliques and included in the energy function U to be
minimized.
Figure 3.9: The original image (left), its ground truth labelling (centre) and the meanshift
segmentation of the image (right). The segments in the meanshift segmentation on the right
are used to dene higher-order potentials, allowing for more precise object boundaries in the
nal segmentation.
Sturgess et al. achieved an overall accuracy of 84% compared to the previous state-of-
the-art accuracy of 69% [3] on the challenging CamVid database [4]. The work of Sturgess is
therefore especially important for this thesis as it successfully tackles the same inner-city scene
segmentation problem. The CamVid database will be better described in chapter 5, where the
results obtained by our implementation are compared with those of Sturgess et al. [22].
Chapter 4
Methodology
4.1 CRF framework
After thorough consideration of related work, CRFs have been deemed very suitable and up-
to-date for dealing with the problem proposed in this thesis project. As discussed in section
2, conditional random elds allow the incorporation of a big variety of cues in a single, unied
model. Moreover, state-of-the-art work in the eld of image segmentation (see Section 3.3,
TextonBoost) and also more specically in the domain of inner-city road scene understanding
(see Section 3.4, Sturgess et al.) has used CRFs. Sturgess et al. have been able to very
successfully segment eleven dierent classes in road scenes, some of which being very important
to our nal goal of driver behavior prediction.
4.2 Basic model: location and edge potentials
Location and edge cues, as mentioned in 3.1, are very meaningful and can signicantly con-
tribute to the quality of any segmentation. In our case, location cues are all the more important
because we deal with a very spatially structured scene. The road will, for example, never be at
the top of the image and the sky will never be at the bottom. We can then extract precious
information as to where to expect our classes to be located on the picture.
If, for a better understanding of the problem, we consider, at rst, a model with just the
location and edge potentials, then the energy function to be minimized in order to infer the
21
Chapter 4: Methodology 22
most likely labeling becomes
U(y[x, ) =

i
location
..
(y
i
, i;

) +

(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);

) . (4.1)
The location potential is calculated based on the incidence, for all the training images, of each
class at each pixel:
(y
i
, i;

) = log
_
N
y
i
,i
+

N
i
+

_
(4.2)
where N
y
i
,i
is the number of pixels at position i assigned class y
i
in the training images, N
i
is
the total number of pixels at position i and

is a small integer to avoid the indenition log(0)


when N
y
i
,i
= 0 (we use

= 1). Figure 4.1 illustrates the location potential of classes road


and sidewalk in images from the CamVid database.
(a) (b)
Figure 4.1: (a) Location potential of class road, (road, i;

). (b) Location potential of


class sidewalk, (sidewalk, i;

). The whiter, the bigger the incidence of pixels from the


corresponding class in the training images.
The pairwise edge potential has the form of a contrast sensitive Potts model [2] like dened
in TextonBoost:
(y
i
, y
j
, g
ij
(x);

) =
T

g
ij
(x)[y
i
,= y
j
], (4.3)
with [] the zero-one indicator function. The edge feature g
ij
measures the diference in color
between the neighboring pixels, as suggested by [19],
g
ij
=
_
exp( |x
i
x
j
|
2
1
_
(4.4)
With the help of an intuitive example, shown in Figure 4.2a, we can see how location and
23 4.2 Basic model: location and edge potentials
edge potentials interact, resulting in a meaningful segmentation. In this example, we want to
segment the toy image in three dierent classes, background, foreground-1 and foreground-2.
Figures 4.2b, 4.2d and 4.2f show the unary location potentials (y
i
, i;

) for classes foreground-


1, foreground-2 and background, respectively, at every pixel i
1
. A white pixel represents a
high probability of a class being present at that pixelwhich is equivalent to saying that the
energy potential is lowimpelling the function minimization to prefer labels where the pixels
are white rather than black. Figure 4.2c shows the gradient image, which is a way to visualize
the edge potential calculated like in Eq. 4.3. The segmentation boundaries are more likely to
be located where the edge potential is white. Figure 4.2e shows the nal segmentation obtained
through the minimization of Eq. 4.1.
(a) (c) (e)
(b) (d) (f)
Figure 4.2: (a) Noisy toy image to be segmented. (c) Gradient image as basis for edge po-
tential. (b,d,f) Location potentials of classes foreground-1, foreground-2 and background,
respectively. (e) Final segmentation inferred from the minimization of Eq. 4.1.
Note that the nal segmentation correctly ignores the noise, as it is not present at the same
pixels simultaneously in the edge and location potentials. The red and yellow structures inside
the main blob are all segmented as class foreground-1 thanks to the contribution of its location
potential. The constant term in Eq. 4.3, which adds a given cost for any pixel belonging to a
label boundary, helps suppress the appeareance of noisy, small foreground regions.
1
The location potential of the class background is the complementary of the foreground classes potentials.
That is, when either class foreground-1 or foreground-2 is likely, class background is unlikely, and vice-versa.
Chapter 4: Methodology 24
4.3 Texture potential model
Although the segmentation of the toy example, obtained with location and edge potentials
described in the last section, was robust against noise, the location potentials provided were
very similar to the regions we wanted to segment. In real images, not only the location potentials
are less correlated to the position of the labels, but there are also much more complex objects
to be segmented that cannot be dierentiated just by using location and edge potentials. The
next step to a better segmentation is then modeling the texture information present in the
images. We can represent this new potential by rewriting the energy function U as:
U(y[x, ) =

i
location
..
(y
i
, i;

) +
texture
..

i
(y
i
, x;

) +

(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);

) (4.5)
Note that the texture potential represents local texture only, i.e., it does not take into
account context. It is merely a local feature. Context and layout are explored in Section 4.4,
where the use of simplied texture-layout lters is investigated.
In order to represent the texture information of the images to segment, we opted, similarly
to TextonBoost [21], for the use of lter banks. By using an N-dimensional lter bank F,
one obtains an N-dimensional feature vector, f
x
(i), for each pixel i. Each component of this
vector is the result of the convolution of the input image converted to grayscale, x, and the
corresponding lter shifted to the position of i:
f
x
(i) =
_

_
(F
1
x)[
i
(F
2
x)[
i
.
.
.
(F
N
x)[
i
_

_
(4.6)
Equivalently, the result of the convolution of a N-dimensional lter bank with an image can
be understood by considering the convolution of the image with each component of the lter
at a time. Figure 4.3 shows an example of input image, and the response images for some of
the Leung-Malik lter bank components [16].
4.3.1 Feature vector and choice of lter bank
The choice of the lter bank used to represent the texture in the images to be segmented was
based on the following criteria:
Good coverage of possible textures without too much redundancy between lters;
25 4.3 Texture potential model
(a) (c) (e)
(b) (d) (f)
Figure 4.3: (a) Example of inner-city road scene image. (b-f) Examples of responses of ve
dierent lter components of the LM lter bank, which are shown at the bottom left corner of
each gure.
Fast and ecient lter response calculation;
Ready-to-use implementation available.
Considering those criteria, a very interesting implementation by the Intelligent Systems Lab
of the University of Amsterdam has been found. It is implemented as a matlab .mex le, which
means it is actually a C script which is pre-compiled and then called by Matlab in execution
time. The libraries are freely available for research purposes
2
.
Using this fast to calculate .mex implementation, 5 dierent lter banks have been assessed
by segmenting images using only the texture potential in Eq. 4.5. Four classes have been
considered: road, sidewalk, others
3
and sky.
The lter banks assessed were the following:
MR8 The MR8 lter bank consists of 38 lters but only 8 lter responses. The lter bank
contains lters at multiple orientations but their outputs are collapsed by recording only
the maximum lter response across all orientations (see Figure 4.4);
MR8 - no maxima The rotation invariance of the MR8 lter bank, achieved by taking
only the maximum response over all orientation, may not be a desired property of a tex-
ture lter bank used for segmentationsome classes could be described by their features
2
Source code at: http://www.science.uva.nl/

mark.
3
Class others is assigned to any pixel that is not labeled as one of the other three classesit can thus be
seen as the complement of the other three classes.
Chapter 4: Methodology 26
orientation. Therefore, a lter bank called MR8 - no maxima has been dened, where all
the 38 responses are kept;
MR8 - separate channels Here, the MR8 lter is applied individually to each of the
three color channels, in an attempt to verify whether discriminative texture information
is unevenly distributed over the color channels;
MR8 - multi-scale This lter bank is composed of three MR8 lter banks in three
subsequent scales. Although the MR8 lter bank itself already uses lters in dierent
scales, we found interesting to try to cover even more scales as road scenes contain,
almost always, objects whose distance may vary in many orders of magnitude
4
;
TextonBoosts lter bank This lter bank has 17 dimensions and is based on the CIELab
color space. consists of Gaussians at scales k, 2k and 4k, x and y derivatives of Gaussians
at scales 2k and 4k, and Laplacians of Gaussians at scales k, 2k, 4k and 8k. The Gaussians
are applied to all three color channels, while the other lters are applied only to the
luminance.
Figure 4.4: The MR8 lter bank is low dimensional, rotationally invariant and yet capable of
picking out oriented features. Note that only the maximum response of the lters of each of
the rst 6 rows is taken.
As all the lter banksexcept MR8 - separate channels and TextoonBoosts lter bankare
convolved with grayscale images, we also concatenated to the texture feature vector f
x
(i)
4
For instance, there might be a car immediately in front of the camera but also another one tens of meters
away.
27 4.3 Texture potential model
which is the response of the lter bankthe L, a and b color values of its corresponding pixel:
f

x
(i) =
_

_
f
x
(i)
L
i
a
i
b
i
_

_
(4.7)
In this manner, the color information was merged with the texture, giving an extra cue to
the Adaboost classiers
5
.
The results of the tests showed that the lter bank that yielded the best segmentation
results and, thus, best represented the texture information in the road scene images was the
MR8 - multi-scale. This is probably due to the aforementioned fact that road scene images
have similar objects and regions that may vary greatly in depth. This variation is well captured
by the multiple-scale characteristic of the MR8 - multi-scale lter bank.
Combination of 3D cues to feature vector
As discussed in section 3.1.2, 3D information can be extracted from images in a video sequence
using structure from motion techniques. Those techniques can only infer the 3D position of
characteristic points in the image, that is, points that can be located, described and then
matched in subsequent images. In this thesis this has been done using the Harris corner detector,
with normalized cross-correlation over patches for matching. Other possible patch descriptors
are, for example, SIFT and SURF.
All 3D features mentioned in section 3.1.2 have been concatenatedjust like the L, a and
b color valuesto the feature vector described in Eq. 4.7:
f

x
(i) =
_

_
f

x
(i)
3Dfeature
1
(i)
.
.
.
3Dfeature
5
(i)
_

_
(4.8)
However, in order to include this 3D cues in our feature vector, they need to be dened for
every pixel of an input image. That means that we have to transform the sparse 3D features
obtained using reconstruction techniques into dense features. This can be done by interpolation,
where every pixel is assigned 3D feature values based on the values of the sparse neighbors that
could be dened with reconstruction techniques.
5
Tests have been performed with dierent color spaces, yielding the best results when CIELab was used.
This comes from the fact that the CIELab color space is partially invariant to scene lighting modicationsonly
the L dimension changes in contrast to the three dimensions of the RGB color space, for instance.
Chapter 4: Methodology 28
Figure 4.5 shows an example of dense interpolation of the 3D feature height above ground
for an image taken from the CamVid database.
(a) (b)
Figure 4.5: (a) A dusk image taken from the CamVid database. (b) The calculated height above
ground 3D feature. After determining a point cloud from structure from motion techniques,
the sparse features have been interpolated as to yield a dense representation. Notice how the
sky has high values and that we can see a faint blob where the car is located in the original
image.
It is important to mention that, before concatenating them to the feature vector as shown
in Eq. 4.8, the 3D features have been appropriately normalized. The normalization guarantees
that they do not overshadow the texture and color features during the clustering process. This
could happen if the values of the 3D features were much greater than the values of the other
features. Since the clustering method implemented uses Euclidian distances, such an imbalance
in the feature values would result in biased cluster centers. The inuence of the use of 3D
features on the segmentation results is discussed in Chapter 5.
4.3.2 Boosting of feature vectors
Having dened the feature vector as in Eq. 4.8, we need then to nd patterns in the features
extracted from training images and try to recognize them in new, unseen images. For instance,
we want to learn what texture, color and 3D cues are typical in each of the classes we want to
segment. Some of the machine learning techniques suitable for this task are neural networks,
belief networks or Gaussian Mixture Models in the N-dimensional space (where N is the number
of lters in the lter bank). Nonetheless, an Adaboost approach has been preferred for its
generalization power and ease to use.
A short overview about the way Adaboost works is described here. For more details about
its implementation and theoretical grounds, please see [8]. For this thesis project we have
29 4.3 Texture potential model
Figure 4.6: Example of training procedure for classier road. The Q K data matrix D is
represented by the red vectors whereas the 1K label vector L is indicated by the green arrows.
utilized a ready-to-use Matlab implementation from the Moscow state university
6
.
Note that, since we are dealing with binary Adaboost classication, a classier is trained
for each of the classes we want to segment in a one-versus-all manner. For the training of each
classier, a learning data matrix D 1
QK
is taken as input by the Adaboost trainer. Matrix
D has size QK, where Q is the number of dimensions
7
of the feature vector from Eq. 4.8 and
K is the number of training vectors used for training (the feature vectors are extracted from
pixels in the training images). Another input, a 1K vector L 0, 1
1K
, contains the labels
of the training data D. Vector L is comprised of ones for the pixels belonging to the class of
the classier being trained, and zeros otherwise. Figure 4.6 illustrates how individual classiers
for each class are trained.
The Adaboost classier of class c is composed of M stump weak classiers h
c
(f),
h
c
(f) =
_
_
_
1 if f
p
>
0, otherwise
(4.9)
where f
p
is the p
th
dimension of vector f and is a threshold. The strong classier H
class
c
(f(i))
is built by choosing the most discriminative weak learners by minimizing the error to the target
value, like explained in Section 3.3.2. Figure 4.7 shows how a trained classier outputs a
condence value between zero and one for feature vectors from unseen images.
Once we have dened a strong classier H for each class, the texture potential of Eq. 4.5
6
Source code available at http://graphics.cs.msu.ru/en/science/research/machinelearning/modestada.
7
Q = N(number of dimensions of the lter bank) +3(L, a, b) + 5(3D features).
Chapter 4: Methodology 30
Figure 4.7: Given a trained classier, a classication condence is computed based on how
similiar the input feature vector is to the positive examplesand on how dierent it is from
the negative onesprovided in the training phase illustrated in Figure 4.6.
can be dened as:

i
(y
i
, x;

) =

.H
class
y
i
(f
x
(i)) (4.10)
The output of the strong classier H
class
y
i
(f
x
(i)) is multiplied by a negative constant,

so
that a positive condence turns into a negative energy, which will be preferred in the energy
minimization.

is the set of all parameters used in the Adaboost training of H, for instance,
number of weak classiers.
4.3.3 Adaptive training procedure
In order to make the training of Adaboost classiers more tractable, not every pixel of every
training image has been selected to build the training data matrix D. Since there is a lot
of redundancy between pixels, this simplication has not adversely aected the quality of the
Adaboost classiers.
Although the selection of pixels for the extraction of training feature vectors has initially
been random, a smarter and innovative algorithm has been developed.
The adaptive training procedure works by, in an iterative way, choosing an unequal propor-
tion of feature vectors from each label. The idea is that, based on the confusion matrix of a
given segmentation experiment, we know the strengths and weaknesses of the classiers trained.
For instance, suppose that for a given segmentation experiment class sky is not confused as
much as street and sidewalk. Then, it is reasonable that we choose in the next segmentation
experiment more feature vectors from classes street and sidewalk and less feature vectors
from class sky for the training of classiers street and sidewalk.
Formally, if we represent the weight (or proportion) of training feature vectors from class
i, used in the Adaboost training of classier j, as W
ij
, the update of every weight after each
31 4.4 Texture-layout potential model (context)
segmentation iteration (experiment) can be expressed as:
W

ij
=
_
_
_
W
ij
e
Cm
ij
/Z if i ,= j
W
ij
e
(1Cm
ij
)
/Z if i = j
(4.11)
where Cm
ij
is the element in the i
th
row and j
th
column of the confusion matrix of the
previous segmentation iteration. is a learning speed factor and Z is a normalization factor
that guarantees that

i
W

ij
= 1, (4.12)
or, in other words, that the sum of the proportions of feature vectors from each class remains
equal to 1. The weights are all equally initialized as W
ij
= 1/N
c
, N
c
representing the number
of classes.
Notice that in the case of a perfect segmentation, where the confusion matrix is equal to
the identity matrix, the proportion of training feature vector samples W

ij
does not change.
Although the adaptive learning algorithm improved a lot the segmentation quality (see
Section 5.1), the use of local features alone is intrinsically limited. As precise and discriminative
as a classier may be, there are cases where class sidewalk is virtually identical to class road
for every local feature imaginable. The natural next step towards a better segmentation is to
use context information. Then, the fact that sidewalks are normally alongside roads, separating
them from buildings or other regions, can be explored and help us correctly dierentiate what
locally is indierentiable.
4.4 Texture-layout potential model (context)
In order to model contextual information, we opt for utilizing the texture layout features in-
troduced by TextonBoost. This new potential replaces the texture potentials explained in the
previous section, as they are more general. We then have the following energy function:
U(y[x, ) =

i
location
..
(y
i
, i;

) +
texture-layout
..

i
(y
i
, x;

) +

(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);

) (4.13)
In this equation, the texture-layout potentials are dened similarly to the way they are dened
in TextonBoost:

i
(y
i
, x;

) =

.H(y
i
, i) (4.14)
Chapter 4: Methodology 32
The condence H(y
i
, i) is the output of a strong classier found by boosting weak classiers,
H(y
i
, i) =
M

m=1
h
m
y
i
(i) (4.15)
Each weak classier, in turn, is dened based on the response of a texture-layout lter:
h
m
y
i
(i) =
_
_
_
a, if v
[r,t]
(i) >
b, otherwise,
(4.16)
Notice the dierence from the denition in Eq. 3.20 of TextonBoost: bearing in mind our
nal goal of behavior prediction, we do not need to classify as many classes as in TextonBoost
where up to 32 dierent classes are segmented. TextonBoost shares weak classiers because
the computation cost becomes sub-linear with the number of classes. Since we do not need as
many classes, it is possible for us to simplify the calculation of strong classiers by not using
shared weak classiers. Therefore, in our approach, each strong classier has its own, exclusive
weak classiers.
The texture-layout lter response v
[r,t]
(i) is the proportion of pixels in the input image,
from all those lying in the rectangle r with its origin shifted to pixel i, that have been assigned
texton t in the textonization process illustrated in section 3.3.2:
v
[r,t]
(i) =
1
area(r)

j(r+i)
[T
j
= t] . (4.17)
4.4.1 Training procedure
We used, for our textonization process, the same feature vector denition as in Eq. 4.8, which
contains texture, color and 3D cues.
In order to build a strong classiernote that we need to train one strong classier for
each of the classes we want to segment our image in, weak classiers are added one by one
following the following boosting procedure:
1. Generation of weak classier candidates: Each weak classier is composed of a texture-
layout lter (r, t) and a threshold . The candidates are generated by randomly choosing
a rectangle region inside a bounding box, a texton index t T = 1, 2, , K where K is
the number of clusters used in the textonization process, and nally a threshold between
0 and 1. For the addition of each weak classier, an arbitrary number of candidates, N
cd
,
is generated.
33 4.4 Texture-layout potential model (context)
2. Calculation of parameters a and b for all candidates: Each weak classier candidate must
also be assigned values a and b so that its response, h
m
c
(i), is fully dened (see Eq. 4.16).
Like described by Torralba et al [23], who use the same boosting approach (except our
does not share weak classiers), a and b can be calculated as follows:
b =

i
w
c
i
z
c
i
_
v
[r,t]
(i)

i
w
c
i
_
v
[r,t]
(i)
, (4.18)
a =

i
w
c
i
z
c
i
_
v
[r,t]
(i) >

i
w
c
i
_
v
[r,t]
(i) >
, (4.19)
where c is the label for which the classier is being trained, z
c
i
= +1 or z
c
i
= 1 for
pixels i which, respectively, have ground truth label c or dierent from c, and w
c
i
the are
classication accuracy weights used by Adaboost (see Section 3.3.2).
Note, from Eq. 4.18 and Eq. 4.19, that, for the calculation of a and b, the response of the
texture-layout lters, v
[r,t]
(i), must be calculated for all training pixels i and compared
to threshold .
3. Search for the best weak classier candidate: Once each weak classier is fully dened, that
is, all parameters (r, t, , a, b) are dened, the most discriminative among the candidates
is found by minimizing the error function with respect to the target values z
c
i
.
In Chapter 5 we see how texture-layout strong classiers can learn the context between
objects. We observe also how the number of weak classiers inuences the segmentation quality.
4.4.2 Practical considerations
System architecture
Due to the short period of time available for this thesis work, the implementation of software had
to be ecient and fast. Owing to its exibilityand variety of ready-to-use image processing,
statistics, plotting and other functions availableMatlab has been the preferred tool for the
implementation of the solution.
Conditional Random Fields are, however, intrinsically highly demanding in computational
resources. This is due to the iterative nature of the minimization procedure of the cost function
U, detailed in section 3.2.1. As Matlab is an interpreted programming language, it is signi-
cantly slower to process loops than compiled languages such as C or C++. Therefore, Matlab
has proven to be unable to cope with the massive calculations needed for the segmentation
inference, when the cost function U is minimized.
Chapter 4: Methodology 34
Figure 4.8: Software architecture. The Matlab layer is responsible for the higher-level processing
whereas the C++ layer takes the heavy energy minimization computation.
In the context of the iCub project [12]which is lead by the RobotCub Consortium, consist-
ing of several European universities, a good C++ framework for the minimization of Markov
Random Field energy functions has been found. The main goal of the iCub platform is to
study cognition through the implementation of biological motivated algorithms. The project is
open-sourceboth the hardware design and the software are freely available.
The software implemented has been then based on a two-layer layout, as illustrated in
Figure 4.8. Matlab, on a higher-level, pre-processes imagescalculating, for instance, lter
convolutionswhereas the C++ program calculates the minimum of the energy function U.
In other words, the C++ layer infers, from the given cliques potentials and input Matlab
pre-processed data, what the maximum a posteriori likelihood labeling is.
The assessment of the quality of the segmentations, the storage of results and all comple-
mentary software functionalities are handled by Matlab on the higher-level layer.
35 4.4 Texture-layout potential model (context)
Implementation challenges and optimizations
Dierently from the case of the texture potential explained in the previous section, we could not
nd any ready-to-use Matlab implementation of the boosting procedure for the texture-layout
potential, as it is very specic to this problem. The whole algorithm had then to be implemented
from scratch. Moreover, since there are countless loops involved in the training algorithm
described above, Matlab was ruled out as programming environment of the implementation,
being replaced by C++.
Two main practical problems have been faced in the C++ programming of the developed
algorithm described above. Firstly, the long processing time and, secondly, the lack of RAM
memory.
1. Processing time: The boosting procedure described in the previous section requires com-
putations over all training pixels. If we consider 100 imagesa typical number for a
training data seteach composed of, for instance, 800 600 pixels, we have already 48
million calculation loops for each step. This turns out to be impractical for todays pro-
cessors. The solution found was to resize all dataset images before segmenting them and
also to consider, as training pixels, only a subsampled set of each image. By resizing the
images to half their original size and subsampling the training pixels in 5-pixel steps, we
could already reduce the number of calculation loops 100 times. After this simplica-
tion was applied, the decrease in segmentation quality was almost imperceptible, which
indicates that the information necessary for training the classiers was not lost with the
resizing and subsampling.
2. RAM memory:
As discussed in section 3.3, the use of integral images is essential for the eciency of the
calculation of the texture-layout lters v
[r,t]
. If we consider that 100 textons have been
dened in the textonization process, we have, for each training image, 100 integral images,
one for each texton index. Again, considering 100 training images already resized to half
their original size, we have ten thousand 400 300 matrices (each matrix represents an
integral image). If we use a normal int variable for each matrix elementwhich in C++
occupies 4 byteswe need 10000 400 300 4 = 4.8 Gigabytes or RAM memory.
The rst attempt to avoid this memory problem was to load only some of the integral
images at a time. However, for the calculation of the texture-layout lter responses of the
weak classier candidates, all the integral matrices are necessary. They had then to be
all simultaneously accessible in the RAM memory.
The solution was to use short unsigned integerswith only 2 bytes, which were big
Chapter 4: Methodology 36
enough for all the integral matrices analyzed
8
, and also to subsample the integral image
matrices:
I
(t)
(u, v) = I
(t)
_
round
_
(u, v)
SubsamplingFactor
_
_
(4.20)
Again, the subsampling almost did not change the results of the nal segmentations. One
of the reasons why the results did not change much is probably that the subsampling
rate of 3 used is much smaller than the sizes of the rectangular regions r used in the
texture-layout features. Although the subsampling reduced the amount of RAM memory
necessary for loading the integral images, there is a limit of training images that can be
used for training without causing memory problems.
8
Each short unsigned integer can store a number of up to 65535. If we consider a 400 300 pixel image, the
maximum value of an integral imageif all pixels were assigned to one single textonis 120000. However, since
each pixel is assigned to one of many texton indexes, the integral image of each texton never has values close to
the limit 65535.
Chapter 5
Results
In this chapter, we investigate the performance of our semantic segmentation system on the
challenging CamVid dataset and compare our results with existing work. Firstly, we show
preliminary results obtained with the texture features described in Section 4.3 without consid-
ering any context. We then analyse our nal model with the context features (texture-layout
features) described in Section 4.4. The eect of dierent aspects and parameters of the model
is discussed before we present the best results obtained and analyse them quantitatively and
qualitatively.
5.1 Model without context features
Figure 5.1 shows the confusion matrix of the segmentation of approximately 200 pictures, with
classiers trained on 140 other pictures, all randomly taken from the CamVid database. For this
segmentation experiment, 500 training feature vectors have been randomly chosen per training
image. The segmentations have been computed by minimization of Eq. 4.5 which does not
include any context feature. Notice how sidewalks are almost not recognized at all.
The adaptive training procedure described in Section 4.3.3 chooses for the training of the
adaboost classiers more examples of feature vectors from labels that are confusedlike road
and sidewalkthan from those who are easily recognizedlike sky. The confusion matrix of
Figure 5.1 shows the results of the segmentation of the rst iterationwhere all training vectors
are chosen randomlyof this adaptive Adaboost training algorithm. After three iterations,
examples are selectively chosen and the confusion matrix of the segmentation results, shown in
Figure 5.2, shows much better discernment between classes that were initially mixed up.
Although the adaptive training procedure improved the segmentation quality, context in-
formation, as discussed in the next section, contributes to dierentiate classes even better.
37
Chapter 5: Results 38
Figure 5.1: Confusion matrix of segmentation experiment chosing random feature vectors for
training the Adaboost classiers. Each row shows what proportion of the ground truth classes
has been assigned to each class by the classiers. Class others is the union of all classes dened
in the CamVid database except street, sidewalk and sky. For an ideal segmentation, the
confusion matrix would be equal to the identity matrix.
Figure 5.2: Confusion matrix of segmentation after three iterations of the adaptive training.
Initially, 65% of class sidewalk was wrongly assigned to class road, as compared to only 25%
with the adaptive learning. The percentage of class sidewalk correctly assigned also increased
from 9% to 61%.
5.2 Model with context features
Our nal model includes the texture-layout potential (see Section 4.4). This model and its
results are discussed in detail in the following sections.
5.2.1 Inuence of number of weak classiers
As illustrated in Figure 3.8, texture-layout lters work by exploring the contextual correlation
of texturesand in our solution also colorbetween neighboring regions. Figure 5.3 shows the
rectangular region r of each of the rst ten texture-layout features for the classier of class
road. Notice that the location distribution of the regions r is slightly biased towars either the
top half or the bottom half of the image. This comes, probably, from the fact that most of
the correlations between textures present in class road and other textures happen in a vertical
fashion: the road is normally below other classes.
39 5.2 Model with context features
Figure 5.3: r regions of rst ten weak classiers composing the strong classier for class road.
The yellow cross in the middle indicates the pixel i being classied and the blue rectangle
represents the bounding box within which all the weak classiers candidates are created. The
bigger the bounding box, the farther the context can be modeled. The downside of dening
a big bounding box is that, if the context near pixel i is more discriminative, there is a lower
probability of a weak classier being generated in that region than for a smaller bounding box.
As seen in section 4.4.1, the boosting scheme used garantees that the target function, that
is, the labeling of the training images, is approximated with an increasing number of weak
classiers. However, the quality of the segmentation of unseen, test images seems to have a
clear upperboundary, like shown if Figure 5.4.
The number of weak classier candidates generated for every round of the boosting is a
compromise between computation time and how discriminative the weak classiers found are.
Increasing the number of generated candidates increases proportionally the computation time
for training. The quality of the segmentation increases, however, in a logarithmic fashion. At
a given point, it is better to boost more weak classiers than generating too many candidates
and boosting just a few of them. Tests showed that 1000 weak classier candidates resulted in
a good compromise between training time and texture-layout classiers quality.
5.2.2 Inuence of the dierent model potentials
Although all the dierent potentials included in the model contribute to the nal quality of
the segmentation, we observed that the most important contribution comes from the texture-
Chapter 5: Results 40
(a) (b)
Figure 5.4: (a) Notice how the training error J
error
with respect to target function, that is,
the ground truth labeling of the training images, decreases exponentially with the number of
weak classiers. (b) The segmentation accuracy for unseen test images, however, seems to be
bound to 92%. These accuracies have been calculated by segmenting the images only using the
texture-layout potential. In this experiment no overtting has occurred, but it is possible that,
with other test images, the accuracy decreases again after a certain number of weak classiers
is reached.
layout potential. This potential alone correctly segments the bulk of the scene, lacking however
coherent and smooth boundaries as this aspect is not explicitly modeled in the texture-layout
features. The edge potential, on the other hand, is responsible for better delineation of bound-
aries by smoothing them and making them stick to existing edges on the input image. Although
this does not contribute substantially in the overall pixel accuracy, it increases perceptual qual-
ity, making the segmentation much more natural to human observers, which is also useful for
our nal goal of driver behavior prediction. The location potential is also important to correct
wrongly segmented regions by the texture-layout potential. This happens, for instance, when
white, saturated regions on the image are assigned by the texture-layout potential to class sky,
even if those regions are located at the bottom of the image. Figure 5.5 shows how perceived
segmentation quality and pixel-wise accuracy increase as we add the dierent potentials.
5.2.3 Inuence of 3D features
The inuence of the implementation of 3D features, described in Section 4.3.1, was, due to
time constraints, only tested for the 4-class set comprised of road, sidewalk, others and
sky. After the concatenation of the 3D features to the feature vector, no improvement in the
quality of the segmentations has been noticed. This might be due to the fact that the classes
which were mostly mixed, road and sidewalk, do not have signicantly dierent 3D features.
41 5.2 Model with context features
(a) (c) (e)
(b) (d) (f)
Figure 5.5: (a) Original image to be segmented. (b) Manually labeled ground truth provided
by the CamVid dataset. (c) Segmentation obtained by using only the texture-layout potential,
with overall accuracy of 90.7%. (d) Segmentation obtained with texture-layout and location
potentials. Notice how some pixels assigned as class sidewalk in (c) turn now into class road
because of their location. The accuracy is increased to 91.3%. (e) Segmentation obtained
with texture-layout and edge potentials. Note how classes now have coherent and smooth
boundaries, increasing the overall accuracy to 92.6%. (e) Finally, the segmentation obtained
with all potentials combined. Notice how the strengths are added, as the labeling is spatially
coherent and has smooth boundaries. The nal accuracy achieved is 92.9%.
Therefore, in the case of this 4-class set, no discriminative cues have been added through the
combination of the 3D features.
Nonetheless, if we are to consider more classes in the segmentation, the 3D features could
be useful as the additional classes may have quite dierent 3D characteristics. For instance,
signs, pedestrians and bicyclists can stand out in comparison to buildings behind them because
they are much closer to the car camera.
Another possible way of improving the contribution of the 3D features would be clustering
them separetely from the appearance features described. This technique has been applied by
Sturgess et al.
Chapter 5: Results 42
Type Seq. Name # Images
Day Seq05VD 171
Day 0016E5 305
Day 0006R0 101
Dusk 0001TP 124
5.3 CamVid sequences
The CamVid database is composed of four sequences of inner-city road scenes. Three of them
have been recorded during the day, with good sun illumination, and the fourth one as it was
getting dark. The four sequences are described in table 5.3.
We next show the results obtained using each of the sequences separately. For each of them,
the rst half of the sequence has been used for training and the second half for testing. The
most important parameters
1
have been set as follows:
Number of clusters for textonization: 30;
Number of weak classiers: 500;
Number of weak classier candidates: 1000;
Resize factor of original images: 0.5;
Bounding box of texture-layout lters: 300 300 pixels;
Initially, we intended to automatically tune the parameters by developing a test bench
that would automatically run tests and change input parameters trying to nd maxima in the
segmentation qualityfor instance by judging quality by the overall pixel accuracy. Although
this is in theory an interesting idea, it turns out to be impractical, as we need hours to complete
a single segmentation experiment. All system parameters, including those cited above, have
then been manually tweaked to yield the best results possible.
Two dierent label sets have been tested and are separately described.
4-class set
Bearing in mind that the nal goal of this segmentation system is to help predict the drivers
behavior, a set with four of the most important classes have been dened: road, sidewalk,
others and sky. The road is the most important of all, as it indicates the region where
1
The system devised has many more parameters than listed. However, they have turned out to have less
inuence in the nal results than those mentioned here. This was a sign of robustness achieved by the developed
system.
43 5.3 CamVid sequences
the car has freedom to drive. The sidewalk helps dene in what direction the road follows its
course, giving an important cue regarding the steering wheel behavior, that is, whether the
driver should turn left, right or go ahead
2
. Class others represents all sorts of obstacles, such
as other cars, pedestrians, bicyclists, buildings and so forth. Class sky has been dened as it
is very characteristic and easy to dierentiate from all others.
Figure 5.6 shows the overall accuracy and confusion matrices of the segmentation results for
each of the four sequences of the CamVid dataset. As explained in section 5.1, the confusion
matrices show for the ground truth classes in each row, what proportion was assigned to which
label. The name of the labels assigned are indicated on the top of each column. Overall accuracy
is calculated by dividing the number of pixels correctly assigned by the total number of pixels.
Some examples of segmentation from all four sequences in the CamVid dataset are shown
in Figure 5.7.
Considering the four classes mentioned above and the dusk image sequence, 0001TP, which
has 124 images, our system took 6,3 hours to train and about one minute to segment each test
image. The processor used was a Pentium 4, 3.20 Ghz with 1 gigabyte of RAM memory. It is
important to notice that, within the one minute elapsed for the segmentation of each test image,
a considerable amount of timeapproximately 20 secondsis spent by Matlab in writing the
integral images to .txt les and by the C++ program in reading them. Hence, a great deal of
processing time could be reduced by integrating the whole system in the C++ platform.
11-class set
In order to be able to compare our system with the state-of-the-art and to use our segmentation
to try to predict the driver behaviour similarly to Ess et al. [6], we decided to use the same 11-
class set as they did. This set is dened by classes Building, Tree, Sky, Car, Sign-Symbol,
Road, Pedestrian, Fence, Column-pole, Sidewalk and Bicyclist.
Since the confusion matrices of the segmentation results for this 11-class set are 1111,
Figure 5.8 shows only their diagonal for each of the sequences considered.
Some examples of segmentation from all four sequences in the CamVid dataset are shown
in Figure 5.9.
Considering the eleven classes mentioned above and the dusk image sequence, 0001TP,
which has 124 images, our system took 16,5 hours to train and 3 minutes to segment each test
image. The same Pentium 4, 3.20 Ghz processor with 1 gigabyte of RAM memory was used.
2
If the behavior is to be predicted more precisely, other curve granularities can be dened, for example, sharp
right, slight left and so on.
Chapter 5: Results 44
(a)
(b)
(c)
(d)
Figure 5.6: Confusion matrices and overall pixel accuracies for all sequences of the CamVid
database. Notice that all overall accuracies have been above 90%. Classes that are less present,
for instance sidewalk, have yielded the worst results. This is also caused by the fact that these
smaller classes have more complicated formslike narrow stripes in the case of sidewalks
which makes it easier to mix them with neighboring classes. In sequence 0006R0, which was
mostly shot in a car parking lot, there are very few examples of class sidewalk, which justies
the fact that the classiers did not learn well enough to handle this class, resulting in a zero
column in the confusion matrix (c).
45 5.3 CamVid sequences
(a) (b) (c)
Figure 5.7: (a) Example of images from the CamVid dataset to be segmented. (b) Ground
truth annotation for the 4 classes considered. (c) Our segmentation using all three potentials
implemented: texture-layout, edge and location. Note that the third image, from the sequence
0006R0 has very little pixels from class sidewalk. This leads to insucient learning and
inexistence of pixels assigned to class sidewalk in our segmentations. Notice how, in the
fourth image, which comes from the dusk sequence, the dark bumper of the car is confused
with the dark road.
Chapter 5: Results 46
Figure 5.8: Accuracy is again computed by comparing the ground truth pixels to the inferred
segmentation. For each sequence we report individual class accuracies (the diagonal of the
confusion matrix), the average accuracy of all classes, and the overall segmentation accuracy
(column Global). The average accuracy measure applies equal importance to all 11 classes,
despite the widely varying class incidences, and is therefore a harder performance metric than
the overall pixel accuracy. Notice how results vary between 81%, for sequence Seq05VD, and
66% for sequence 0006R0. Again, sequence 0006R0 has been the most dicult to segment
as it has been recorded mostly in a parking lot, hence some classes are almost not present in
it. Indeed, for this sequence, no example of class bicyclist was present in the training images,
therefore the . For overall quality comparison, the baseline obtained by chance segmentation
would achieve a global accuracy of about 9%.
5.4 Comparison to state of the art
In order to be able to compare our results with the state of the art, we have used the same
training and test split of the CamVid database as Sturgess et al. [22] and Brostow et al. [3],
which is shown in table 5.3. These nal results have been obtained by using images from the
whole database in a single experiment, that is, images from day scenes as well as dusk scenes
have been used together for training and testing. This is a good test for the generalization power
of the segmentation system, giving also an insight on how invariant to lighting conditions and
other varying parameters our features are. Figure 5.10 shows our results in comparison to the
results of Sturgess et al. and Brostow et al. using the same 11-class set detailed in the previous
section.
It is important to mention, though, that for time and memory constraints, the results we
compared to the state of the art have been obtained using less weak classiers (200 instead of
the 500 used in the results shown previously) and also with only around 50% of the images
from the CamVid database. Our results would probably improve if we had used more weak
classiers (see Section 5.2.1) and also all the images in the database.
47 5.4 Comparison to state of the art
(a) (b) (c)
Figure 5.9: (a) Example of images to be segmented. (b) Ground truth annotation for the
11 classes considered. (c) Our segmentation using all three potentials implemented: texture-
layout, edge and location. Notice that some of the most important classes for the driver behavior
predictionlike the road, cars, buildings and sidewalkare still well recognized for the 11-class
set. Unfortunately, other more ne structured classes like signs and pedestrians which are also
important for the driver behavior prediction are not well recognized. Suggestions to overcome
this problem are given in Chapter 6.
Chapter 5: Results 48
Figure 5.10: Comparison of our results with state-of-the-art road-scene segmentation systems
from Sturgess et al. [22] and Brostow et al. [3]. Notice how, like in the results shown for
the segmentation of each sequence separately, the ne detailed structures like column-pole,
bicyclists and pedestrians are almost not recognized at all in our system. Sturgess et al.
obtain better results in these classes probably because they use more appropriate features to
represent them, like Histogram of Gradients (HoGs), for instance. Brostow et al. use only
structure from motion 3D cues, which have proven more robust to variations in scene lighting,
yielding also very good results for most classes. According to the Sturgess et al. and Brostow et
al., the use of shared weak classiers for the training of the texture-layout classiers improves
the generalization power of the classication. The fact that we did not share weak classiers
in our training may have also been a reason why our results were in general worst than the
state-of-the-art ones.
Chapter 6
Conclusions
This thesis project has been highly pleasing as it was possible to put in practice many of the
concepts learned during the Vibot Masters while investigating cutting-edge techniques in a very
interesting eld like the development of driver assistance systems. It was also very important to
understand, since the beginning of the project, the importance of semantically segmenting inner-
city road scenes as a step towards the nal goal of predicting the driver behavior. Although
this masters project had to be concluded in only four months, satisfactory results have been
achieved. This short period of time had to be eciently distributed among research of the state
of the art, investigation of new techniques, software implementation and report writing. Even if
the initial planning of activities was not rigorously put into practice, it helped focus the eorts
throughout the thesis so that all dierent aspects of this project have been given their deserved
attention.
The initial research of existing methods and state-of-the-art techniques has shown that
signicant improvements have been done in the last few years in the eld of image segmentation
and recognition. Not only important contributions have been done for the segmentation of
general scenes but also in our more specic context of inner-city road scenes.
The segmentation system implemented has been based on very up-to-date features and
segmentation techniques, which had to be eciently adapted taking into account our goal of
driver-behaviour predicting. Some aspects of the implementation of the techniques used have
been simplied due to our strict time constraints. However, these simplicationslike not shar-
ing weak classiers during boostinghave been carefully assessed so that the accomplishment
of the main goals of the thesis project were not jeopardized.
The quantitative as well as the qualitative results of the segmentation of the challenging
CamVid database have fullled our expectations. Although we did not obtain results with the
same quality as the state-of-the-art ones, parallel research [11] within the group in which this
49
Chapter 6: Conclusions 50
thesis was developed showed that they were good enough for predicting, in a basic way, the
drivers behaviour. In this parallel work, state-of-the-art techniques have been applied to model
the drivers behaviour based on the segmentation of the road scene. It has been noticed, by
applying these techniques, that the quality of the behaviour prediction using the ground truth
segmentations was not much better than the quality achieved using the segmentation from the
system implemented in this master thesis. The conclusion is that, although the quality of the
segmentation has an impact on the nal quality of the behavior prediction, more eort has to
be invested improving the behaviour prediction than the semantic segmentation itself.
Future work
In an attempt to encourage the continuation of the work performed in this masters thesis, we
propose some ideas that might help improve the quality of the segmentation system imple-
mented:
Depth-adaptive scaling of feature vectors
In our semantic segmentation system, the feature vector dened in Eq. 4.7 is obtained by
convolution with the same lter bank for every image pixel i. Suppose, now, that we adapt
the scale of the lter bank used for calculating the feature vector according to depth of pixel
i in the image. By doing so, we could, for example, represent the texture of a sidewalk in an
image by one single feature vector cluster. Figure 6.1 illustrates the working principle of the
depth-adaptive scaling feature extraction.
The depth information necessary for this algorithm could be inferred from the same structure
from motion techniques used to obtain the 3D cues described in section 3.1.2. It could be also
inferred for every image in test-time by using an automatic scale recognition techniquelike
done in SIFT by nding the peak response of a multi-scale pyramid convolution.
Addition of more features
One of the probable reasons why our results were not as good as those from Sturgess et
al. [22], which is the state-of-the-art in the eld of road scene segmentation and classication,
is because we did not use as many features as they did. They used, for instance, histograms
of gradients, HoGs, to describe the orientation of local edges. By doing so, they could better
discriminate ne detailed structures like signs, bicyclists, cars and so on. Sturgess et al. clus-
tered these features separately from the textons, letting the boosting procedure decide, for each
classier, whether to use texton features or other features.
Further optimization of C++ code
51
Figure 6.1: Notice how the texture of the sidewalk changes its scale as it gets far from the car
camera. If we could estimate depth information and use it to adapt the scale of our lter bank
convolution, we would be able to cluster class features more appropriately. In the diagram,
the blue circle represents a lter bank scale of 2,5 and the red circle a scale of 1. With such
adapted scales, feature vectors all over the sidewalk would be very similar and easier to learn.
This would be similarly valid for other classes.
State-of-the-art implementations are faster than ours. For instance, Sturguess et al. train
their system with more images than we train ours and manage to segment an unseen image in
30 to 40 seconds while we need 3 minutes. As Figure 5.4 suggests, the number of weak classiers
used to obtain a good segmentation could be reduced from 500 to around 200, this would make
the classication much faster. We can also gain a lot of time by integrating the whole system
in a single C++ program.
Bibliography
[1] J. Besag. Spatial interaction and the statistical analysis of lattice systems (with discussion).
J. of Royal Statist. Soc., series B, 36(2):192-326, 1974.
[2] Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary and region seg-
mentation of objects in n-d images. Computer Vision, volume 1, pages 105-112, July
2001.
[3] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using
structure from motion point clouds. ECCV, 2008.
[4] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-
denition ground truth database. Pattern Recognition Letters, 30(2):88-97, 2009.
[5] DARPA. Darpa urban challenge rulebook. http://www.darpa.mil/GRANDCHALLENGE/
docs/Urban Challenge Rules 102707.pdf.
[6] A. Ess, T. Mller, H. Grabner, and L. v. Gool. Segmentation-based urban trac scene
understanding. British Machine Vision Conference (BMVC), 2009.
[7] X. Feng, C. Williams, and S. Felderhof. Combining belief networks and neural networks
for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(4):467-483, April 2002.
[8] Y. Freund and R.E.Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer and System Sciences, no. 55, 1997.
[9] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge
University Press, 2003.
[10] X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random elds for image
labeling. IEEE International Conference on Computer Vision and Pattern Recognitions,
volume 2, pages 695-702, 2004.
52
53 BIBLIOGRAPHY
[11] M. Heracles, F. Martinelli, and J. Fritsch. Vision-based behavior prediction in urban trac
environments by scene categorization. British Machine Vision Conference (BMVC), 2010
(Submitted).
[12] iCub. Robotcub project, funded by the european commission. http://www.robotcub.
org/.
[13] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models.
IJCV, 1(4):321331, 1988.
[14] S. Kumar and M. Hebert. A discriminative framework for contextual interaction in classi-
cation. IEEE International Conference on Computer Vision, pages 1150-1157, 2003.
[15] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models
for segmenting and labeling sequence data. Proceedings of ICML-01, 282-289, 2001.
[16] T. Leung and J. Malik. Representing and recognizing the visual appearance of materials
using three-dimensional textons. IJCV, 43(1):2944, 2001.
[17] J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image
segmentation. Int. J. Computer Vision, 43(1):7-27, June 2001.
[18] N Pican, E Trucco, M Ross, DM Lane, and Y Petillot. Texture analysis for seabed classi-
cation: Co-occurrence matrices vs self-organizing maps. IEEE, 1998.
[19] C. Rother, V. Kolmogorov, and A. Blake. Grabcut - interactive foreground extraction
using iterated graph cuts. ACM Transactions on Graphics, 23(3):309-314, August 2004.
[20] E. Saber, A. Tekalp, R. Eschbach, and K. Knox. Automatic image annotation using
adaptive color classication. Graphical Models and Image Processing, 58(2):115-126, 1996.
[21] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance
shape and context modeling for multi-class object recognition and segmentation. ECCV,
volume 1, pages 1-15, 2006.
[22] P. Sturgess, K. Alahari, L. Ladicky, and P. Torr. Combining appearance and structure
from motion features for road scene understanding. British Machine Vision Conference
(BMVC), 7 - 10, September 2009.
[23] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multiclass and
multi-view object detection. IEEE Trans. on Pattern Analysis and Machine Intelligence,
19(5):854-869, May 2007.
BIBLIOGRAPHY 54
[24] M Turtinen and M. Pietikaeinen. Contextual analysis of textured scene images. British
Machine Vision Conference (BMVC), 2006.
[25] M. Varma and A. Zisserman. A statistical approach to texture classication from single
images. Int. J. Computer Vision, 62(1-2):61-81, April 2005.
[26] P. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features.
Proc. IEEE Conf. Computer Vision and Pattern Recognition, volume 1, pages 511-518,
December 2001.
[27] Y. Wang and Q. Ji. A dynamic conditional random eld model for object segmentation
in image sequences. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR05), volume 1, pages 264-270, 2005.
[28] J. Winn, A. Criminisi, and T. Minka. Categorization by learned universal visual dictionary.
In Proc. Int. Conf. on Computer Vision, volume 2, pages 1800-1807, October 2005.

You might also like