Professional Documents
Culture Documents
k
P
k
(x) (3.1)
where
P
k
(x) = ^(x[
k
,
k
) (3.2)
k
,
k
being the mean and variance of the individual Gaussian distribution k.
The use of GMMs to model colors in images has also proven very ecient in binary seg-
9 3.1 Features for image segmentation
Figure 3.3: Segmentation achieved by GrabCut using color GMMs and user interaction.
mentation problems, as shown by Rother et al [19] with their GrabCut algorithm. In such
problems, one wants to separate a foreground object from the background for image editing,
object recognition and so on. When possible, user interaction can be very useful to rene the
results by giving important feedback after the initial automatic segmentation (see Figure 3.3).
However, in most cases, either the number of images to segment is prohibitive or the real
time nature of the segmentation task prevents any user interference at all. These both remarks
are true in the eld of trac scenes segmentation for driver assistance.
3.1.5 Texture cues
Along with color, texture information is often considered and can bring signicant improvement
to the segmentation accuracy, as in [7], where graylevel texture features were combined to color
ones. Nowadays, most if not all the research eort on segmentation also incorporates texture
information. This can be extracted and modeled in two main ways:
1. Statistical Models, which try to describe the statistical correlation between pixel colors
within a restricted vicinity. Among such methods, co-occurrence matrices, have been
successfully used, for instance, for seabed classication [18];
Chapter 3: State of the art 10
Figure 3.4: The LM lter bank has a mix of edge, bar and spot lters at multiple scales and
orientations. It has a total of 48 lters2 Gaussian derivative lters at 6 orientations and 3
scales, 8 Laplacian of Gaussian lters and 4 Gaussian lters.
2. Filter bank convolution, where the image is convolved with a carefully selected set of lter
primitives, usually composed of Gaussians, Gaussian derivatives and Laplacians. A well
known example, the Leung-Malik (LM) lter bank [16], is shown in Figure 3.4. It is
interesting to mention that such lter banks have similarities with the receptive elds of
neurons in the human visual cortex.
3.1.6 Context features
Although color and texture may eciently characterize image regions, they are far from enough
for a high quality semantic segmentation if considered alone. For instance, even humans may
not be able to tell apart, when looking only at a local patch of an image, a blue sky from the
walls of a blue building. The key aspect of which humans naturally take advantage, and that
allows them to unequivocally understand scenes, is the context. Even if one sees a building
wall painted with exactly the same color as the sky, one just knows that that wall cannot be
the sky because it is surrounded by windows. In the case of road scenes segmentation, typical
spatial relationships between objects can be a very strong cuefor example, the fact that the
car is always on the road, which, in turn, is usually surrounded by sidewalks.
With this in mind, computer vision researchers are now frequently looking beyond low-level
features and are more interested in contextual issues [7, 10, 14]. In Section 3.3, an example of
how context in images can be exploited for segmentation is described.
3.2 Probabilistic segmentation framework
The choice of image features, described in the previous section, is independent of the theoretical
framework or machine learning technique applied for segmentation inference. One can choose
the very same features as in [7], where belief networks are used, and process them using Support
11 3.2 Probabilistic segmentation framework
Vector Machines, for example. In recent years, Conditional Random Fields (CRFs) have played
an increasingly central role. CRFs have been introduced by Laerty et al. in [15] and have ever
since been systematically used in cutting-edge segmentation and classication approaches like
TextonBoost [21], image sequence segmentation [27], contextual analysis of textured scenes [24]
and trac scene understanding [22], to name a few. Conditional Random Fields are based on
Markov Random Fields and oer practical advantages for image classication and segmenta-
tion. These advantages are explained in the next section, after the formal denition of Markov
Random Fields is given.
3.2.1 Conditional Random Fields
In the Random Field theory, an image can be described by a lattice o composed of sites i,
which can be thought of as the image pixels. The sites in o are related to one another via
a neighborhood system, which is dened as ^ = ^
i
, i o, where ^
i
is the set of sites
neighbouring i. Additionally, i / ^
i
and i ^
j
j ^
i
.
Let y denote a labeling conguration of the lattice o belonging to the set of all possible
labelings . In the image segmentation context, y can be seen as a labeling image, where each
of the sites (or pixels) i from the lattice o is assigned one label y
i
in the set of possible labels
L = l
1
, l
2
, l
3
, , l
N
, which are the object classes. The pair (o, ^) can be referred to as a
Random Field.
Moreover, (o, ^) is said to be a Markov Random Field (MRF) if and only if
P(y) > 0, y , and (3.3)
P(y
i
[y
S{i}
) = P(y
i
[y
N
i
) (3.4)
That means, rstly, that the probability of any dened label conguration must be greater
than zero
1
and, secondly and most importantly, that the probability of a site assuming a given
label just depends on its neighboring sites. The latter statement is also known as the Markov
condition.
According to the Hammersley-Cliord theorem [1], an MRF like dened above can equiv-
alently be characterized by a Gibbs distribution. Thus, the probability of a labeling y can be
written as
P(y) = Z
1
exp(U(y)), (3.5)
where
Z =
yY
exp(U(y)) (3.6)
1
This assumption is usually taken for convenience, as it, in practical terms, does not inuence the problem.
Chapter 3: State of the art 12
is a normalizing constant called the partition function, and U(y) is an energy function of the
form
U(y) =
cC
V
c
(y). (3.7)
( is the set of all possible cliques and each clique c has a clique potential V
c
(y) associated
with it. A clique c is dened as a subset of sites in o in which every pair of distinct sites
are neighbours, with single-site cliques as a special case (see Figure 3.5). Due to the Markov
condition, the value of V
c
(y) depends only on the local conguration of clique c.
(a) (b) (c)
Figure 3.5: (a) Example of a 4-pixel neighborhood. (b) Possible unary clique layout. (c)
Possible binary clique layouts.
Now let us consider the observation x
i
, for each site i, which is a state belonging to a set of
possible states J = w
1
, w
2
, , w
n
. In this manner, we can represent the image we want to
segment, where each pixel i is assigned to one state of the set J. If one thinks of a gray scale
image with 8 bit-resolution, for example, the set of possible states for each site (or pixel) would
be dened as J = 0, 1, 2, , 255. The segmentation problem then boils down to nding the
labeling y
i
_
location
..
(y
i
, i;
) +
color
..
(y
i
, x
i
;
) +
texturelayout
..
i
(y
i
, x;
)
_
+
(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);
) (3.10)
where y is the labeling or segmentation and x is a given image, is the set of edges in a
4
The MSRC database can be downloaded at http://research.microsoft.com/vision/cambridge/recognition/
15 3.3 Example: TextonBoost
4-connected neighborhood, =
k
P(x[k)P(k[y) (3.11)
with color clusters (mixture components) P(x[k). Notice that the clusters are shared between
dierent classes, and that only the coecients P(k[y) depend on the class label. This makes
the model more ecient to learn than a separate GMM for each class, which is important since
TextonBoost takes into account a high number of classes.
Edge potential
The pairwise edge potentials have the form of a contrast sensitive Potts model [2],
(y
i
, y
j
, g
ij
(x);
) =
T
g
ij
(x)[y
i
,= y
j
], (3.12)
with [] the zero-one indicator function:
[condition] =
_
_
_
1, if condition is true
0, otherwise
(3.13)
Chapter 3: State of the art 16
The edge feature g
ij
measures the dierence in color between the neighboring pixels, as sug-
gested by [19],
g
ij
=
_
exp( |x
i
x
j
|
2
1
_
(3.14)
where x
i
and x
j
are three-dimensional vectors representing the CIELab colors of pixels i and j
respectively. Including the unit element allows a bias to be learned, to remove small, isolated
regions
5
. The quantity is an image-dependent contrast term, and is set separately for each
image to (2|x
i
x
j
|
2
))
1
, where ) denotes an average over the image. The two scalar
constants that compose the parameter vector
j(r+i)
[T
j
= t] . (3.15)
Any part of the region r + i that lies outside the image does not contribute to the feature
response.
An ecient and elegant way to calculate the lter responses anywhere over an image can
be achieved with the use of integral images [26]. For each texton t in the texton map T, a
separate integral image I
(t)
is calculated. In this integral image, the value at pixel i = (u
i
, v
i
)
is dened as the number of pixels in the original image that have been assigned to texton t in
the rectangular region with top left corner at (1, 1) and bottom right corner at (u
i
, v
i
):
I
(t)
(i) =
j:(u
j
u
i
)&(v
j
v
i
)
[T
j
= t] . (3.16)
The advantage of integral images is that they can later be used to compute the texture-
layout lter responses in constant time: if I
(t)
is the integral image for texton channel t dened
like above, then the feature response is computed as:
v
[r,t]
(i) =
_
I
(t)
(r
br
) I
(t)
(r
bl
) I
(t)
(r
tr
) +I
(t)
(r
tl
)
_
/area(r) (3.17)
where r
br
, r
bl
, r
tr
and r
tl
denote the bottom right, bottom left, top right and top left corners
Chapter 3: State of the art 18
(a) (c) (e)
(b) (d) (f)
Figure 3.8: Graphical explanation of texture-layout lters extracted from [21]. (a, b) An image
and its corresponding texton map (colors represent texton indices). (c) Texture-layout lters
are dened relative to the point i being classied (yellow cross). In this rst example feature,
region r
1
is combined with texton t
1
in blue. (d) A second feature where region r
2
is combined
with texton t
2
in green. (e) The response v
[r
1
,t
1
]
(i) of the rst feature is calculated at three
positions in the texton map (magnied). In this example, v
[r
1
,t
1
]
(i
1
) 0, v
[r
1
,t
1
]
(i
2
) 1, and
v
[r
1
,t
1
]
(i
3
) 1/2. (f ) The second feature (r
2
, t
2
), where t
2
corresponds to grass, can learn
that points i (such as i
4
) belonging to sheep regions tend to produce large values of v
[r
2
,t
2
]
(i),
and hence can exploit the contextual information that sheep pixels tend to be surrounded by
grass pixels.
of rectangle r.
Texture-layout features are suciently general to allow for an automatic learning of layout
and context information. Figure 3.8 illustrates how texture-layout lters are able to model
textural context and layout.
Boosting of texture-layout lters
A Boosting algorithm iteratively selects the most discriminative texture-layout lters (r, t) as
weak learners and combines them into a strong classier used to derive the texture-layout
potential in Eq. 3.10. The boosting scheme used in TextonBoost shares each weak learner
between a set of classes C, so that a single weak learner classies for several classes at once.
According to the authors, this allows for classication with cost sub-linear in the number of
classes, and leads to improved generalization.
The strong classier learned is the sum over the classication condences h
m
i
(c) of M weak
19 3.3 Example: TextonBoost
learners
H(y
i
, i) =
M
m=1
h
m
i
(c) (3.18)
The condence value H(y
i
, i) for pixel i is then just multiplied by a negative constantso
that a positive condence turns into a negative energy, which will be preferred in the energy
minimizationto give the texture-layout potentials
i
used in Eq. 3.10:
i
(y
i
, x;
) =
.H(y
i
, i) (3.19)
Each weak learner is a decision stump based on the feature response v
[r,t]
(i) of the form
h
i
[c] =
_
_
_
a
_
v
[r,t]
(i) >
+b, if c C
k
c
, otherwise,
(3.20)
with parameters (a, b, k
c
, , C, r, t). The region r and texton index t together specify the texture-
layout lter feature, and v
[r,t]
(i) denotes the corresponding feature response at position i. For
the classes that share this feature, that is, (c C), the weak learner gives h
i
(c) a + b, b
depending on whether v
[r,t]
(i) is, respectively, greater or lower than a threshold . For classes
not sharing the feature (c / C), the constant k
c
ensures that unequal numbers of training
examples of each class do not adversely aect the learning procedure.
In order to choose the weak classiers, TextonBoost uses the standard boosting algorithm
introduced by Schapire et al. in [8], which will be explained for completeness. Suppose we are
choosing the m
th
weak classier. Each training example i, a pixel in a training image, is paired
with a target value z
c
i
1, +1where +1 means that pixel i has ground truth class c and
1 notand assigned a weight w
c
i
specifying its classication accuracy for class c after the
m 1 previous rounds of boosting. The m
th
weak classier is chosen by minimizing an error
function J
error
weighted by w
c
i
:
J
error
=
i
w
c
i
(z
c
i
h
m
i
(c))
2
(3.21)
The training examples are then re-weighted
w
c
i
:= w
c
i
e
z
c
i
h
m
i
(c)
(3.22)
Minimizing the error function J
error
requires, for each new weak classier, an expensive
brute-force search over the possible sharing classes in C, features (r, t), and thresholds . As
shown in [21] however, given these parameters, a closed form solution does exist for a, b and k
c
.
Chapter 3: State of the art 20
3.4 Application to road scenes (Sturgess et al.)
In the more specic eld of road scene segmentation, Sturgess et al. [22] have recently quite
successfully segmented inner-city road scenes in 11 dierent classes. Their method builds on
the work of Shotton et al. (see Section 3.3) and on that of Brostow et al. [3] integrating
the appearance-based features from TextonBoost with the structure-from-motion features from
Brostow et al. (see Section 3.1.2) in a higher order CRF. According to the authors, the use
of higher-order cliquesthat is, cliques with several pixels, instead of only pairs of pixels like
in TextonBoostproduces accurate segmentations with precise object boundaries. Figure 3.9
shows how Sturgess et al. use an unsupervised meanshift segmentation of the input image to
obtain regions that are used as higher-oder cliques and included in the energy function U to be
minimized.
Figure 3.9: The original image (left), its ground truth labelling (centre) and the meanshift
segmentation of the image (right). The segments in the meanshift segmentation on the right
are used to dene higher-order potentials, allowing for more precise object boundaries in the
nal segmentation.
Sturgess et al. achieved an overall accuracy of 84% compared to the previous state-of-
the-art accuracy of 69% [3] on the challenging CamVid database [4]. The work of Sturgess is
therefore especially important for this thesis as it successfully tackles the same inner-city scene
segmentation problem. The CamVid database will be better described in chapter 5, where the
results obtained by our implementation are compared with those of Sturgess et al. [22].
Chapter 4
Methodology
4.1 CRF framework
After thorough consideration of related work, CRFs have been deemed very suitable and up-
to-date for dealing with the problem proposed in this thesis project. As discussed in section
2, conditional random elds allow the incorporation of a big variety of cues in a single, unied
model. Moreover, state-of-the-art work in the eld of image segmentation (see Section 3.3,
TextonBoost) and also more specically in the domain of inner-city road scene understanding
(see Section 3.4, Sturgess et al.) has used CRFs. Sturgess et al. have been able to very
successfully segment eleven dierent classes in road scenes, some of which being very important
to our nal goal of driver behavior prediction.
4.2 Basic model: location and edge potentials
Location and edge cues, as mentioned in 3.1, are very meaningful and can signicantly con-
tribute to the quality of any segmentation. In our case, location cues are all the more important
because we deal with a very spatially structured scene. The road will, for example, never be at
the top of the image and the sky will never be at the bottom. We can then extract precious
information as to where to expect our classes to be located on the picture.
If, for a better understanding of the problem, we consider, at rst, a model with just the
location and edge potentials, then the energy function to be minimized in order to infer the
21
Chapter 4: Methodology 22
most likely labeling becomes
U(y[x, ) =
i
location
..
(y
i
, i;
) +
(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);
) . (4.1)
The location potential is calculated based on the incidence, for all the training images, of each
class at each pixel:
(y
i
, i;
) = log
_
N
y
i
,i
+
N
i
+
_
(4.2)
where N
y
i
,i
is the number of pixels at position i assigned class y
i
in the training images, N
i
is
the total number of pixels at position i and
) =
T
g
ij
(x)[y
i
,= y
j
], (4.3)
with [] the zero-one indicator function. The edge feature g
ij
measures the diference in color
between the neighboring pixels, as suggested by [19],
g
ij
=
_
exp( |x
i
x
j
|
2
1
_
(4.4)
With the help of an intuitive example, shown in Figure 4.2a, we can see how location and
23 4.2 Basic model: location and edge potentials
edge potentials interact, resulting in a meaningful segmentation. In this example, we want to
segment the toy image in three dierent classes, background, foreground-1 and foreground-2.
Figures 4.2b, 4.2d and 4.2f show the unary location potentials (y
i
, i;
i
location
..
(y
i
, i;
) +
texture
..
i
(y
i
, x;
) +
(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);
) (4.5)
Note that the texture potential represents local texture only, i.e., it does not take into
account context. It is merely a local feature. Context and layout are explored in Section 4.4,
where the use of simplied texture-layout lters is investigated.
In order to represent the texture information of the images to segment, we opted, similarly
to TextonBoost [21], for the use of lter banks. By using an N-dimensional lter bank F,
one obtains an N-dimensional feature vector, f
x
(i), for each pixel i. Each component of this
vector is the result of the convolution of the input image converted to grayscale, x, and the
corresponding lter shifted to the position of i:
f
x
(i) =
_
_
(F
1
x)[
i
(F
2
x)[
i
.
.
.
(F
N
x)[
i
_
_
(4.6)
Equivalently, the result of the convolution of a N-dimensional lter bank with an image can
be understood by considering the convolution of the image with each component of the lter
at a time. Figure 4.3 shows an example of input image, and the response images for some of
the Leung-Malik lter bank components [16].
4.3.1 Feature vector and choice of lter bank
The choice of the lter bank used to represent the texture in the images to be segmented was
based on the following criteria:
Good coverage of possible textures without too much redundancy between lters;
25 4.3 Texture potential model
(a) (c) (e)
(b) (d) (f)
Figure 4.3: (a) Example of inner-city road scene image. (b-f) Examples of responses of ve
dierent lter components of the LM lter bank, which are shown at the bottom left corner of
each gure.
Fast and ecient lter response calculation;
Ready-to-use implementation available.
Considering those criteria, a very interesting implementation by the Intelligent Systems Lab
of the University of Amsterdam has been found. It is implemented as a matlab .mex le, which
means it is actually a C script which is pre-compiled and then called by Matlab in execution
time. The libraries are freely available for research purposes
2
.
Using this fast to calculate .mex implementation, 5 dierent lter banks have been assessed
by segmenting images using only the texture potential in Eq. 4.5. Four classes have been
considered: road, sidewalk, others
3
and sky.
The lter banks assessed were the following:
MR8 The MR8 lter bank consists of 38 lters but only 8 lter responses. The lter bank
contains lters at multiple orientations but their outputs are collapsed by recording only
the maximum lter response across all orientations (see Figure 4.4);
MR8 - no maxima The rotation invariance of the MR8 lter bank, achieved by taking
only the maximum response over all orientation, may not be a desired property of a tex-
ture lter bank used for segmentationsome classes could be described by their features
2
Source code at: http://www.science.uva.nl/
mark.
3
Class others is assigned to any pixel that is not labeled as one of the other three classesit can thus be
seen as the complement of the other three classes.
Chapter 4: Methodology 26
orientation. Therefore, a lter bank called MR8 - no maxima has been dened, where all
the 38 responses are kept;
MR8 - separate channels Here, the MR8 lter is applied individually to each of the
three color channels, in an attempt to verify whether discriminative texture information
is unevenly distributed over the color channels;
MR8 - multi-scale This lter bank is composed of three MR8 lter banks in three
subsequent scales. Although the MR8 lter bank itself already uses lters in dierent
scales, we found interesting to try to cover even more scales as road scenes contain,
almost always, objects whose distance may vary in many orders of magnitude
4
;
TextonBoosts lter bank This lter bank has 17 dimensions and is based on the CIELab
color space. consists of Gaussians at scales k, 2k and 4k, x and y derivatives of Gaussians
at scales 2k and 4k, and Laplacians of Gaussians at scales k, 2k, 4k and 8k. The Gaussians
are applied to all three color channels, while the other lters are applied only to the
luminance.
Figure 4.4: The MR8 lter bank is low dimensional, rotationally invariant and yet capable of
picking out oriented features. Note that only the maximum response of the lters of each of
the rst 6 rows is taken.
As all the lter banksexcept MR8 - separate channels and TextoonBoosts lter bankare
convolved with grayscale images, we also concatenated to the texture feature vector f
x
(i)
4
For instance, there might be a car immediately in front of the camera but also another one tens of meters
away.
27 4.3 Texture potential model
which is the response of the lter bankthe L, a and b color values of its corresponding pixel:
f
x
(i) =
_
_
f
x
(i)
L
i
a
i
b
i
_
_
(4.7)
In this manner, the color information was merged with the texture, giving an extra cue to
the Adaboost classiers
5
.
The results of the tests showed that the lter bank that yielded the best segmentation
results and, thus, best represented the texture information in the road scene images was the
MR8 - multi-scale. This is probably due to the aforementioned fact that road scene images
have similar objects and regions that may vary greatly in depth. This variation is well captured
by the multiple-scale characteristic of the MR8 - multi-scale lter bank.
Combination of 3D cues to feature vector
As discussed in section 3.1.2, 3D information can be extracted from images in a video sequence
using structure from motion techniques. Those techniques can only infer the 3D position of
characteristic points in the image, that is, points that can be located, described and then
matched in subsequent images. In this thesis this has been done using the Harris corner detector,
with normalized cross-correlation over patches for matching. Other possible patch descriptors
are, for example, SIFT and SURF.
All 3D features mentioned in section 3.1.2 have been concatenatedjust like the L, a and
b color valuesto the feature vector described in Eq. 4.7:
f
x
(i) =
_
_
f
x
(i)
3Dfeature
1
(i)
.
.
.
3Dfeature
5
(i)
_
_
(4.8)
However, in order to include this 3D cues in our feature vector, they need to be dened for
every pixel of an input image. That means that we have to transform the sparse 3D features
obtained using reconstruction techniques into dense features. This can be done by interpolation,
where every pixel is assigned 3D feature values based on the values of the sparse neighbors that
could be dened with reconstruction techniques.
5
Tests have been performed with dierent color spaces, yielding the best results when CIELab was used.
This comes from the fact that the CIELab color space is partially invariant to scene lighting modicationsonly
the L dimension changes in contrast to the three dimensions of the RGB color space, for instance.
Chapter 4: Methodology 28
Figure 4.5 shows an example of dense interpolation of the 3D feature height above ground
for an image taken from the CamVid database.
(a) (b)
Figure 4.5: (a) A dusk image taken from the CamVid database. (b) The calculated height above
ground 3D feature. After determining a point cloud from structure from motion techniques,
the sparse features have been interpolated as to yield a dense representation. Notice how the
sky has high values and that we can see a faint blob where the car is located in the original
image.
It is important to mention that, before concatenating them to the feature vector as shown
in Eq. 4.8, the 3D features have been appropriately normalized. The normalization guarantees
that they do not overshadow the texture and color features during the clustering process. This
could happen if the values of the 3D features were much greater than the values of the other
features. Since the clustering method implemented uses Euclidian distances, such an imbalance
in the feature values would result in biased cluster centers. The inuence of the use of 3D
features on the segmentation results is discussed in Chapter 5.
4.3.2 Boosting of feature vectors
Having dened the feature vector as in Eq. 4.8, we need then to nd patterns in the features
extracted from training images and try to recognize them in new, unseen images. For instance,
we want to learn what texture, color and 3D cues are typical in each of the classes we want to
segment. Some of the machine learning techniques suitable for this task are neural networks,
belief networks or Gaussian Mixture Models in the N-dimensional space (where N is the number
of lters in the lter bank). Nonetheless, an Adaboost approach has been preferred for its
generalization power and ease to use.
A short overview about the way Adaboost works is described here. For more details about
its implementation and theoretical grounds, please see [8]. For this thesis project we have
29 4.3 Texture potential model
Figure 4.6: Example of training procedure for classier road. The Q K data matrix D is
represented by the red vectors whereas the 1K label vector L is indicated by the green arrows.
utilized a ready-to-use Matlab implementation from the Moscow state university
6
.
Note that, since we are dealing with binary Adaboost classication, a classier is trained
for each of the classes we want to segment in a one-versus-all manner. For the training of each
classier, a learning data matrix D 1
QK
is taken as input by the Adaboost trainer. Matrix
D has size QK, where Q is the number of dimensions
7
of the feature vector from Eq. 4.8 and
K is the number of training vectors used for training (the feature vectors are extracted from
pixels in the training images). Another input, a 1K vector L 0, 1
1K
, contains the labels
of the training data D. Vector L is comprised of ones for the pixels belonging to the class of
the classier being trained, and zeros otherwise. Figure 4.6 illustrates how individual classiers
for each class are trained.
The Adaboost classier of class c is composed of M stump weak classiers h
c
(f),
h
c
(f) =
_
_
_
1 if f
p
>
0, otherwise
(4.9)
where f
p
is the p
th
dimension of vector f and is a threshold. The strong classier H
class
c
(f(i))
is built by choosing the most discriminative weak learners by minimizing the error to the target
value, like explained in Section 3.3.2. Figure 4.7 shows how a trained classier outputs a
condence value between zero and one for feature vectors from unseen images.
Once we have dened a strong classier H for each class, the texture potential of Eq. 4.5
6
Source code available at http://graphics.cs.msu.ru/en/science/research/machinelearning/modestada.
7
Q = N(number of dimensions of the lter bank) +3(L, a, b) + 5(3D features).
Chapter 4: Methodology 30
Figure 4.7: Given a trained classier, a classication condence is computed based on how
similiar the input feature vector is to the positive examplesand on how dierent it is from
the negative onesprovided in the training phase illustrated in Figure 4.6.
can be dened as:
i
(y
i
, x;
) =
.H
class
y
i
(f
x
(i)) (4.10)
The output of the strong classier H
class
y
i
(f
x
(i)) is multiplied by a negative constant,
so
that a positive condence turns into a negative energy, which will be preferred in the energy
minimization.
is the set of all parameters used in the Adaboost training of H, for instance,
number of weak classiers.
4.3.3 Adaptive training procedure
In order to make the training of Adaboost classiers more tractable, not every pixel of every
training image has been selected to build the training data matrix D. Since there is a lot
of redundancy between pixels, this simplication has not adversely aected the quality of the
Adaboost classiers.
Although the selection of pixels for the extraction of training feature vectors has initially
been random, a smarter and innovative algorithm has been developed.
The adaptive training procedure works by, in an iterative way, choosing an unequal propor-
tion of feature vectors from each label. The idea is that, based on the confusion matrix of a
given segmentation experiment, we know the strengths and weaknesses of the classiers trained.
For instance, suppose that for a given segmentation experiment class sky is not confused as
much as street and sidewalk. Then, it is reasonable that we choose in the next segmentation
experiment more feature vectors from classes street and sidewalk and less feature vectors
from class sky for the training of classiers street and sidewalk.
Formally, if we represent the weight (or proportion) of training feature vectors from class
i, used in the Adaboost training of classier j, as W
ij
, the update of every weight after each
31 4.4 Texture-layout potential model (context)
segmentation iteration (experiment) can be expressed as:
W
ij
=
_
_
_
W
ij
e
Cm
ij
/Z if i ,= j
W
ij
e
(1Cm
ij
)
/Z if i = j
(4.11)
where Cm
ij
is the element in the i
th
row and j
th
column of the confusion matrix of the
previous segmentation iteration. is a learning speed factor and Z is a normalization factor
that guarantees that
i
W
ij
= 1, (4.12)
or, in other words, that the sum of the proportions of feature vectors from each class remains
equal to 1. The weights are all equally initialized as W
ij
= 1/N
c
, N
c
representing the number
of classes.
Notice that in the case of a perfect segmentation, where the confusion matrix is equal to
the identity matrix, the proportion of training feature vector samples W
ij
does not change.
Although the adaptive learning algorithm improved a lot the segmentation quality (see
Section 5.1), the use of local features alone is intrinsically limited. As precise and discriminative
as a classier may be, there are cases where class sidewalk is virtually identical to class road
for every local feature imaginable. The natural next step towards a better segmentation is to
use context information. Then, the fact that sidewalks are normally alongside roads, separating
them from buildings or other regions, can be explored and help us correctly dierentiate what
locally is indierentiable.
4.4 Texture-layout potential model (context)
In order to model contextual information, we opt for utilizing the texture layout features in-
troduced by TextonBoost. This new potential replaces the texture potentials explained in the
previous section, as they are more general. We then have the following energy function:
U(y[x, ) =
i
location
..
(y
i
, i;
) +
texture-layout
..
i
(y
i
, x;
) +
(i,j)
edge
..
(y
i
, y
j
, g
ij
(x);
) (4.13)
In this equation, the texture-layout potentials are dened similarly to the way they are dened
in TextonBoost:
i
(y
i
, x;
) =
.H(y
i
, i) (4.14)
Chapter 4: Methodology 32
The condence H(y
i
, i) is the output of a strong classier found by boosting weak classiers,
H(y
i
, i) =
M
m=1
h
m
y
i
(i) (4.15)
Each weak classier, in turn, is dened based on the response of a texture-layout lter:
h
m
y
i
(i) =
_
_
_
a, if v
[r,t]
(i) >
b, otherwise,
(4.16)
Notice the dierence from the denition in Eq. 3.20 of TextonBoost: bearing in mind our
nal goal of behavior prediction, we do not need to classify as many classes as in TextonBoost
where up to 32 dierent classes are segmented. TextonBoost shares weak classiers because
the computation cost becomes sub-linear with the number of classes. Since we do not need as
many classes, it is possible for us to simplify the calculation of strong classiers by not using
shared weak classiers. Therefore, in our approach, each strong classier has its own, exclusive
weak classiers.
The texture-layout lter response v
[r,t]
(i) is the proportion of pixels in the input image,
from all those lying in the rectangle r with its origin shifted to pixel i, that have been assigned
texton t in the textonization process illustrated in section 3.3.2:
v
[r,t]
(i) =
1
area(r)
j(r+i)
[T
j
= t] . (4.17)
4.4.1 Training procedure
We used, for our textonization process, the same feature vector denition as in Eq. 4.8, which
contains texture, color and 3D cues.
In order to build a strong classiernote that we need to train one strong classier for
each of the classes we want to segment our image in, weak classiers are added one by one
following the following boosting procedure:
1. Generation of weak classier candidates: Each weak classier is composed of a texture-
layout lter (r, t) and a threshold . The candidates are generated by randomly choosing
a rectangle region inside a bounding box, a texton index t T = 1, 2, , K where K is
the number of clusters used in the textonization process, and nally a threshold between
0 and 1. For the addition of each weak classier, an arbitrary number of candidates, N
cd
,
is generated.
33 4.4 Texture-layout potential model (context)
2. Calculation of parameters a and b for all candidates: Each weak classier candidate must
also be assigned values a and b so that its response, h
m
c
(i), is fully dened (see Eq. 4.16).
Like described by Torralba et al [23], who use the same boosting approach (except our
does not share weak classiers), a and b can be calculated as follows:
b =
i
w
c
i
z
c
i
_
v
[r,t]
(i)
i
w
c
i
_
v
[r,t]
(i)
, (4.18)
a =
i
w
c
i
z
c
i
_
v
[r,t]
(i) >
i
w
c
i
_
v
[r,t]
(i) >
, (4.19)
where c is the label for which the classier is being trained, z
c
i
= +1 or z
c
i
= 1 for
pixels i which, respectively, have ground truth label c or dierent from c, and w
c
i
the are
classication accuracy weights used by Adaboost (see Section 3.3.2).
Note, from Eq. 4.18 and Eq. 4.19, that, for the calculation of a and b, the response of the
texture-layout lters, v
[r,t]
(i), must be calculated for all training pixels i and compared
to threshold .
3. Search for the best weak classier candidate: Once each weak classier is fully dened, that
is, all parameters (r, t, , a, b) are dened, the most discriminative among the candidates
is found by minimizing the error function with respect to the target values z
c
i
.
In Chapter 5 we see how texture-layout strong classiers can learn the context between
objects. We observe also how the number of weak classiers inuences the segmentation quality.
4.4.2 Practical considerations
System architecture
Due to the short period of time available for this thesis work, the implementation of software had
to be ecient and fast. Owing to its exibilityand variety of ready-to-use image processing,
statistics, plotting and other functions availableMatlab has been the preferred tool for the
implementation of the solution.
Conditional Random Fields are, however, intrinsically highly demanding in computational
resources. This is due to the iterative nature of the minimization procedure of the cost function
U, detailed in section 3.2.1. As Matlab is an interpreted programming language, it is signi-
cantly slower to process loops than compiled languages such as C or C++. Therefore, Matlab
has proven to be unable to cope with the massive calculations needed for the segmentation
inference, when the cost function U is minimized.
Chapter 4: Methodology 34
Figure 4.8: Software architecture. The Matlab layer is responsible for the higher-level processing
whereas the C++ layer takes the heavy energy minimization computation.
In the context of the iCub project [12]which is lead by the RobotCub Consortium, consist-
ing of several European universities, a good C++ framework for the minimization of Markov
Random Field energy functions has been found. The main goal of the iCub platform is to
study cognition through the implementation of biological motivated algorithms. The project is
open-sourceboth the hardware design and the software are freely available.
The software implemented has been then based on a two-layer layout, as illustrated in
Figure 4.8. Matlab, on a higher-level, pre-processes imagescalculating, for instance, lter
convolutionswhereas the C++ program calculates the minimum of the energy function U.
In other words, the C++ layer infers, from the given cliques potentials and input Matlab
pre-processed data, what the maximum a posteriori likelihood labeling is.
The assessment of the quality of the segmentations, the storage of results and all comple-
mentary software functionalities are handled by Matlab on the higher-level layer.
35 4.4 Texture-layout potential model (context)
Implementation challenges and optimizations
Dierently from the case of the texture potential explained in the previous section, we could not
nd any ready-to-use Matlab implementation of the boosting procedure for the texture-layout
potential, as it is very specic to this problem. The whole algorithm had then to be implemented
from scratch. Moreover, since there are countless loops involved in the training algorithm
described above, Matlab was ruled out as programming environment of the implementation,
being replaced by C++.
Two main practical problems have been faced in the C++ programming of the developed
algorithm described above. Firstly, the long processing time and, secondly, the lack of RAM
memory.
1. Processing time: The boosting procedure described in the previous section requires com-
putations over all training pixels. If we consider 100 imagesa typical number for a
training data seteach composed of, for instance, 800 600 pixels, we have already 48
million calculation loops for each step. This turns out to be impractical for todays pro-
cessors. The solution found was to resize all dataset images before segmenting them and
also to consider, as training pixels, only a subsampled set of each image. By resizing the
images to half their original size and subsampling the training pixels in 5-pixel steps, we
could already reduce the number of calculation loops 100 times. After this simplica-
tion was applied, the decrease in segmentation quality was almost imperceptible, which
indicates that the information necessary for training the classiers was not lost with the
resizing and subsampling.
2. RAM memory:
As discussed in section 3.3, the use of integral images is essential for the eciency of the
calculation of the texture-layout lters v
[r,t]
. If we consider that 100 textons have been
dened in the textonization process, we have, for each training image, 100 integral images,
one for each texton index. Again, considering 100 training images already resized to half
their original size, we have ten thousand 400 300 matrices (each matrix represents an
integral image). If we use a normal int variable for each matrix elementwhich in C++
occupies 4 byteswe need 10000 400 300 4 = 4.8 Gigabytes or RAM memory.
The rst attempt to avoid this memory problem was to load only some of the integral
images at a time. However, for the calculation of the texture-layout lter responses of the
weak classier candidates, all the integral matrices are necessary. They had then to be
all simultaneously accessible in the RAM memory.
The solution was to use short unsigned integerswith only 2 bytes, which were big
Chapter 4: Methodology 36
enough for all the integral matrices analyzed
8
, and also to subsample the integral image
matrices:
I
(t)
(u, v) = I
(t)
_
round
_
(u, v)
SubsamplingFactor
_
_
(4.20)
Again, the subsampling almost did not change the results of the nal segmentations. One
of the reasons why the results did not change much is probably that the subsampling
rate of 3 used is much smaller than the sizes of the rectangular regions r used in the
texture-layout features. Although the subsampling reduced the amount of RAM memory
necessary for loading the integral images, there is a limit of training images that can be
used for training without causing memory problems.
8
Each short unsigned integer can store a number of up to 65535. If we consider a 400 300 pixel image, the
maximum value of an integral imageif all pixels were assigned to one single textonis 120000. However, since
each pixel is assigned to one of many texton indexes, the integral image of each texton never has values close to
the limit 65535.
Chapter 5
Results
In this chapter, we investigate the performance of our semantic segmentation system on the
challenging CamVid dataset and compare our results with existing work. Firstly, we show
preliminary results obtained with the texture features described in Section 4.3 without consid-
ering any context. We then analyse our nal model with the context features (texture-layout
features) described in Section 4.4. The eect of dierent aspects and parameters of the model
is discussed before we present the best results obtained and analyse them quantitatively and
qualitatively.
5.1 Model without context features
Figure 5.1 shows the confusion matrix of the segmentation of approximately 200 pictures, with
classiers trained on 140 other pictures, all randomly taken from the CamVid database. For this
segmentation experiment, 500 training feature vectors have been randomly chosen per training
image. The segmentations have been computed by minimization of Eq. 4.5 which does not
include any context feature. Notice how sidewalks are almost not recognized at all.
The adaptive training procedure described in Section 4.3.3 chooses for the training of the
adaboost classiers more examples of feature vectors from labels that are confusedlike road
and sidewalkthan from those who are easily recognizedlike sky. The confusion matrix of
Figure 5.1 shows the results of the segmentation of the rst iterationwhere all training vectors
are chosen randomlyof this adaptive Adaboost training algorithm. After three iterations,
examples are selectively chosen and the confusion matrix of the segmentation results, shown in
Figure 5.2, shows much better discernment between classes that were initially mixed up.
Although the adaptive training procedure improved the segmentation quality, context in-
formation, as discussed in the next section, contributes to dierentiate classes even better.
37
Chapter 5: Results 38
Figure 5.1: Confusion matrix of segmentation experiment chosing random feature vectors for
training the Adaboost classiers. Each row shows what proportion of the ground truth classes
has been assigned to each class by the classiers. Class others is the union of all classes dened
in the CamVid database except street, sidewalk and sky. For an ideal segmentation, the
confusion matrix would be equal to the identity matrix.
Figure 5.2: Confusion matrix of segmentation after three iterations of the adaptive training.
Initially, 65% of class sidewalk was wrongly assigned to class road, as compared to only 25%
with the adaptive learning. The percentage of class sidewalk correctly assigned also increased
from 9% to 61%.
5.2 Model with context features
Our nal model includes the texture-layout potential (see Section 4.4). This model and its
results are discussed in detail in the following sections.
5.2.1 Inuence of number of weak classiers
As illustrated in Figure 3.8, texture-layout lters work by exploring the contextual correlation
of texturesand in our solution also colorbetween neighboring regions. Figure 5.3 shows the
rectangular region r of each of the rst ten texture-layout features for the classier of class
road. Notice that the location distribution of the regions r is slightly biased towars either the
top half or the bottom half of the image. This comes, probably, from the fact that most of
the correlations between textures present in class road and other textures happen in a vertical
fashion: the road is normally below other classes.
39 5.2 Model with context features
Figure 5.3: r regions of rst ten weak classiers composing the strong classier for class road.
The yellow cross in the middle indicates the pixel i being classied and the blue rectangle
represents the bounding box within which all the weak classiers candidates are created. The
bigger the bounding box, the farther the context can be modeled. The downside of dening
a big bounding box is that, if the context near pixel i is more discriminative, there is a lower
probability of a weak classier being generated in that region than for a smaller bounding box.
As seen in section 4.4.1, the boosting scheme used garantees that the target function, that
is, the labeling of the training images, is approximated with an increasing number of weak
classiers. However, the quality of the segmentation of unseen, test images seems to have a
clear upperboundary, like shown if Figure 5.4.
The number of weak classier candidates generated for every round of the boosting is a
compromise between computation time and how discriminative the weak classiers found are.
Increasing the number of generated candidates increases proportionally the computation time
for training. The quality of the segmentation increases, however, in a logarithmic fashion. At
a given point, it is better to boost more weak classiers than generating too many candidates
and boosting just a few of them. Tests showed that 1000 weak classier candidates resulted in
a good compromise between training time and texture-layout classiers quality.
5.2.2 Inuence of the dierent model potentials
Although all the dierent potentials included in the model contribute to the nal quality of
the segmentation, we observed that the most important contribution comes from the texture-
Chapter 5: Results 40
(a) (b)
Figure 5.4: (a) Notice how the training error J
error
with respect to target function, that is,
the ground truth labeling of the training images, decreases exponentially with the number of
weak classiers. (b) The segmentation accuracy for unseen test images, however, seems to be
bound to 92%. These accuracies have been calculated by segmenting the images only using the
texture-layout potential. In this experiment no overtting has occurred, but it is possible that,
with other test images, the accuracy decreases again after a certain number of weak classiers
is reached.
layout potential. This potential alone correctly segments the bulk of the scene, lacking however
coherent and smooth boundaries as this aspect is not explicitly modeled in the texture-layout
features. The edge potential, on the other hand, is responsible for better delineation of bound-
aries by smoothing them and making them stick to existing edges on the input image. Although
this does not contribute substantially in the overall pixel accuracy, it increases perceptual qual-
ity, making the segmentation much more natural to human observers, which is also useful for
our nal goal of driver behavior prediction. The location potential is also important to correct
wrongly segmented regions by the texture-layout potential. This happens, for instance, when
white, saturated regions on the image are assigned by the texture-layout potential to class sky,
even if those regions are located at the bottom of the image. Figure 5.5 shows how perceived
segmentation quality and pixel-wise accuracy increase as we add the dierent potentials.
5.2.3 Inuence of 3D features
The inuence of the implementation of 3D features, described in Section 4.3.1, was, due to
time constraints, only tested for the 4-class set comprised of road, sidewalk, others and
sky. After the concatenation of the 3D features to the feature vector, no improvement in the
quality of the segmentations has been noticed. This might be due to the fact that the classes
which were mostly mixed, road and sidewalk, do not have signicantly dierent 3D features.
41 5.2 Model with context features
(a) (c) (e)
(b) (d) (f)
Figure 5.5: (a) Original image to be segmented. (b) Manually labeled ground truth provided
by the CamVid dataset. (c) Segmentation obtained by using only the texture-layout potential,
with overall accuracy of 90.7%. (d) Segmentation obtained with texture-layout and location
potentials. Notice how some pixels assigned as class sidewalk in (c) turn now into class road
because of their location. The accuracy is increased to 91.3%. (e) Segmentation obtained
with texture-layout and edge potentials. Note how classes now have coherent and smooth
boundaries, increasing the overall accuracy to 92.6%. (e) Finally, the segmentation obtained
with all potentials combined. Notice how the strengths are added, as the labeling is spatially
coherent and has smooth boundaries. The nal accuracy achieved is 92.9%.
Therefore, in the case of this 4-class set, no discriminative cues have been added through the
combination of the 3D features.
Nonetheless, if we are to consider more classes in the segmentation, the 3D features could
be useful as the additional classes may have quite dierent 3D characteristics. For instance,
signs, pedestrians and bicyclists can stand out in comparison to buildings behind them because
they are much closer to the car camera.
Another possible way of improving the contribution of the 3D features would be clustering
them separetely from the appearance features described. This technique has been applied by
Sturgess et al.
Chapter 5: Results 42
Type Seq. Name # Images
Day Seq05VD 171
Day 0016E5 305
Day 0006R0 101
Dusk 0001TP 124
5.3 CamVid sequences
The CamVid database is composed of four sequences of inner-city road scenes. Three of them
have been recorded during the day, with good sun illumination, and the fourth one as it was
getting dark. The four sequences are described in table 5.3.
We next show the results obtained using each of the sequences separately. For each of them,
the rst half of the sequence has been used for training and the second half for testing. The
most important parameters
1
have been set as follows:
Number of clusters for textonization: 30;
Number of weak classiers: 500;
Number of weak classier candidates: 1000;
Resize factor of original images: 0.5;
Bounding box of texture-layout lters: 300 300 pixels;
Initially, we intended to automatically tune the parameters by developing a test bench
that would automatically run tests and change input parameters trying to nd maxima in the
segmentation qualityfor instance by judging quality by the overall pixel accuracy. Although
this is in theory an interesting idea, it turns out to be impractical, as we need hours to complete
a single segmentation experiment. All system parameters, including those cited above, have
then been manually tweaked to yield the best results possible.
Two dierent label sets have been tested and are separately described.
4-class set
Bearing in mind that the nal goal of this segmentation system is to help predict the drivers
behavior, a set with four of the most important classes have been dened: road, sidewalk,
others and sky. The road is the most important of all, as it indicates the region where
1
The system devised has many more parameters than listed. However, they have turned out to have less
inuence in the nal results than those mentioned here. This was a sign of robustness achieved by the developed
system.
43 5.3 CamVid sequences
the car has freedom to drive. The sidewalk helps dene in what direction the road follows its
course, giving an important cue regarding the steering wheel behavior, that is, whether the
driver should turn left, right or go ahead
2
. Class others represents all sorts of obstacles, such
as other cars, pedestrians, bicyclists, buildings and so forth. Class sky has been dened as it
is very characteristic and easy to dierentiate from all others.
Figure 5.6 shows the overall accuracy and confusion matrices of the segmentation results for
each of the four sequences of the CamVid dataset. As explained in section 5.1, the confusion
matrices show for the ground truth classes in each row, what proportion was assigned to which
label. The name of the labels assigned are indicated on the top of each column. Overall accuracy
is calculated by dividing the number of pixels correctly assigned by the total number of pixels.
Some examples of segmentation from all four sequences in the CamVid dataset are shown
in Figure 5.7.
Considering the four classes mentioned above and the dusk image sequence, 0001TP, which
has 124 images, our system took 6,3 hours to train and about one minute to segment each test
image. The processor used was a Pentium 4, 3.20 Ghz with 1 gigabyte of RAM memory. It is
important to notice that, within the one minute elapsed for the segmentation of each test image,
a considerable amount of timeapproximately 20 secondsis spent by Matlab in writing the
integral images to .txt les and by the C++ program in reading them. Hence, a great deal of
processing time could be reduced by integrating the whole system in the C++ platform.
11-class set
In order to be able to compare our system with the state-of-the-art and to use our segmentation
to try to predict the driver behaviour similarly to Ess et al. [6], we decided to use the same 11-
class set as they did. This set is dened by classes Building, Tree, Sky, Car, Sign-Symbol,
Road, Pedestrian, Fence, Column-pole, Sidewalk and Bicyclist.
Since the confusion matrices of the segmentation results for this 11-class set are 1111,
Figure 5.8 shows only their diagonal for each of the sequences considered.
Some examples of segmentation from all four sequences in the CamVid dataset are shown
in Figure 5.9.
Considering the eleven classes mentioned above and the dusk image sequence, 0001TP,
which has 124 images, our system took 16,5 hours to train and 3 minutes to segment each test
image. The same Pentium 4, 3.20 Ghz processor with 1 gigabyte of RAM memory was used.
2
If the behavior is to be predicted more precisely, other curve granularities can be dened, for example, sharp
right, slight left and so on.
Chapter 5: Results 44
(a)
(b)
(c)
(d)
Figure 5.6: Confusion matrices and overall pixel accuracies for all sequences of the CamVid
database. Notice that all overall accuracies have been above 90%. Classes that are less present,
for instance sidewalk, have yielded the worst results. This is also caused by the fact that these
smaller classes have more complicated formslike narrow stripes in the case of sidewalks
which makes it easier to mix them with neighboring classes. In sequence 0006R0, which was
mostly shot in a car parking lot, there are very few examples of class sidewalk, which justies
the fact that the classiers did not learn well enough to handle this class, resulting in a zero
column in the confusion matrix (c).
45 5.3 CamVid sequences
(a) (b) (c)
Figure 5.7: (a) Example of images from the CamVid dataset to be segmented. (b) Ground
truth annotation for the 4 classes considered. (c) Our segmentation using all three potentials
implemented: texture-layout, edge and location. Note that the third image, from the sequence
0006R0 has very little pixels from class sidewalk. This leads to insucient learning and
inexistence of pixels assigned to class sidewalk in our segmentations. Notice how, in the
fourth image, which comes from the dusk sequence, the dark bumper of the car is confused
with the dark road.
Chapter 5: Results 46
Figure 5.8: Accuracy is again computed by comparing the ground truth pixels to the inferred
segmentation. For each sequence we report individual class accuracies (the diagonal of the
confusion matrix), the average accuracy of all classes, and the overall segmentation accuracy
(column Global). The average accuracy measure applies equal importance to all 11 classes,
despite the widely varying class incidences, and is therefore a harder performance metric than
the overall pixel accuracy. Notice how results vary between 81%, for sequence Seq05VD, and
66% for sequence 0006R0. Again, sequence 0006R0 has been the most dicult to segment
as it has been recorded mostly in a parking lot, hence some classes are almost not present in
it. Indeed, for this sequence, no example of class bicyclist was present in the training images,
therefore the . For overall quality comparison, the baseline obtained by chance segmentation
would achieve a global accuracy of about 9%.
5.4 Comparison to state of the art
In order to be able to compare our results with the state of the art, we have used the same
training and test split of the CamVid database as Sturgess et al. [22] and Brostow et al. [3],
which is shown in table 5.3. These nal results have been obtained by using images from the
whole database in a single experiment, that is, images from day scenes as well as dusk scenes
have been used together for training and testing. This is a good test for the generalization power
of the segmentation system, giving also an insight on how invariant to lighting conditions and
other varying parameters our features are. Figure 5.10 shows our results in comparison to the
results of Sturgess et al. and Brostow et al. using the same 11-class set detailed in the previous
section.
It is important to mention, though, that for time and memory constraints, the results we
compared to the state of the art have been obtained using less weak classiers (200 instead of
the 500 used in the results shown previously) and also with only around 50% of the images
from the CamVid database. Our results would probably improve if we had used more weak
classiers (see Section 5.2.1) and also all the images in the database.
47 5.4 Comparison to state of the art
(a) (b) (c)
Figure 5.9: (a) Example of images to be segmented. (b) Ground truth annotation for the
11 classes considered. (c) Our segmentation using all three potentials implemented: texture-
layout, edge and location. Notice that some of the most important classes for the driver behavior
predictionlike the road, cars, buildings and sidewalkare still well recognized for the 11-class
set. Unfortunately, other more ne structured classes like signs and pedestrians which are also
important for the driver behavior prediction are not well recognized. Suggestions to overcome
this problem are given in Chapter 6.
Chapter 5: Results 48
Figure 5.10: Comparison of our results with state-of-the-art road-scene segmentation systems
from Sturgess et al. [22] and Brostow et al. [3]. Notice how, like in the results shown for
the segmentation of each sequence separately, the ne detailed structures like column-pole,
bicyclists and pedestrians are almost not recognized at all in our system. Sturgess et al.
obtain better results in these classes probably because they use more appropriate features to
represent them, like Histogram of Gradients (HoGs), for instance. Brostow et al. use only
structure from motion 3D cues, which have proven more robust to variations in scene lighting,
yielding also very good results for most classes. According to the Sturgess et al. and Brostow et
al., the use of shared weak classiers for the training of the texture-layout classiers improves
the generalization power of the classication. The fact that we did not share weak classiers
in our training may have also been a reason why our results were in general worst than the
state-of-the-art ones.
Chapter 6
Conclusions
This thesis project has been highly pleasing as it was possible to put in practice many of the
concepts learned during the Vibot Masters while investigating cutting-edge techniques in a very
interesting eld like the development of driver assistance systems. It was also very important to
understand, since the beginning of the project, the importance of semantically segmenting inner-
city road scenes as a step towards the nal goal of predicting the driver behavior. Although
this masters project had to be concluded in only four months, satisfactory results have been
achieved. This short period of time had to be eciently distributed among research of the state
of the art, investigation of new techniques, software implementation and report writing. Even if
the initial planning of activities was not rigorously put into practice, it helped focus the eorts
throughout the thesis so that all dierent aspects of this project have been given their deserved
attention.
The initial research of existing methods and state-of-the-art techniques has shown that
signicant improvements have been done in the last few years in the eld of image segmentation
and recognition. Not only important contributions have been done for the segmentation of
general scenes but also in our more specic context of inner-city road scenes.
The segmentation system implemented has been based on very up-to-date features and
segmentation techniques, which had to be eciently adapted taking into account our goal of
driver-behaviour predicting. Some aspects of the implementation of the techniques used have
been simplied due to our strict time constraints. However, these simplicationslike not shar-
ing weak classiers during boostinghave been carefully assessed so that the accomplishment
of the main goals of the thesis project were not jeopardized.
The quantitative as well as the qualitative results of the segmentation of the challenging
CamVid database have fullled our expectations. Although we did not obtain results with the
same quality as the state-of-the-art ones, parallel research [11] within the group in which this
49
Chapter 6: Conclusions 50
thesis was developed showed that they were good enough for predicting, in a basic way, the
drivers behaviour. In this parallel work, state-of-the-art techniques have been applied to model
the drivers behaviour based on the segmentation of the road scene. It has been noticed, by
applying these techniques, that the quality of the behaviour prediction using the ground truth
segmentations was not much better than the quality achieved using the segmentation from the
system implemented in this master thesis. The conclusion is that, although the quality of the
segmentation has an impact on the nal quality of the behavior prediction, more eort has to
be invested improving the behaviour prediction than the semantic segmentation itself.
Future work
In an attempt to encourage the continuation of the work performed in this masters thesis, we
propose some ideas that might help improve the quality of the segmentation system imple-
mented:
Depth-adaptive scaling of feature vectors
In our semantic segmentation system, the feature vector dened in Eq. 4.7 is obtained by
convolution with the same lter bank for every image pixel i. Suppose, now, that we adapt
the scale of the lter bank used for calculating the feature vector according to depth of pixel
i in the image. By doing so, we could, for example, represent the texture of a sidewalk in an
image by one single feature vector cluster. Figure 6.1 illustrates the working principle of the
depth-adaptive scaling feature extraction.
The depth information necessary for this algorithm could be inferred from the same structure
from motion techniques used to obtain the 3D cues described in section 3.1.2. It could be also
inferred for every image in test-time by using an automatic scale recognition techniquelike
done in SIFT by nding the peak response of a multi-scale pyramid convolution.
Addition of more features
One of the probable reasons why our results were not as good as those from Sturgess et
al. [22], which is the state-of-the-art in the eld of road scene segmentation and classication,
is because we did not use as many features as they did. They used, for instance, histograms
of gradients, HoGs, to describe the orientation of local edges. By doing so, they could better
discriminate ne detailed structures like signs, bicyclists, cars and so on. Sturgess et al. clus-
tered these features separately from the textons, letting the boosting procedure decide, for each
classier, whether to use texton features or other features.
Further optimization of C++ code
51
Figure 6.1: Notice how the texture of the sidewalk changes its scale as it gets far from the car
camera. If we could estimate depth information and use it to adapt the scale of our lter bank
convolution, we would be able to cluster class features more appropriately. In the diagram,
the blue circle represents a lter bank scale of 2,5 and the red circle a scale of 1. With such
adapted scales, feature vectors all over the sidewalk would be very similar and easier to learn.
This would be similarly valid for other classes.
State-of-the-art implementations are faster than ours. For instance, Sturguess et al. train
their system with more images than we train ours and manage to segment an unseen image in
30 to 40 seconds while we need 3 minutes. As Figure 5.4 suggests, the number of weak classiers
used to obtain a good segmentation could be reduced from 500 to around 200, this would make
the classication much faster. We can also gain a lot of time by integrating the whole system
in a single C++ program.
Bibliography
[1] J. Besag. Spatial interaction and the statistical analysis of lattice systems (with discussion).
J. of Royal Statist. Soc., series B, 36(2):192-326, 1974.
[2] Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary and region seg-
mentation of objects in n-d images. Computer Vision, volume 1, pages 105-112, July
2001.
[3] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using
structure from motion point clouds. ECCV, 2008.
[4] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-
denition ground truth database. Pattern Recognition Letters, 30(2):88-97, 2009.
[5] DARPA. Darpa urban challenge rulebook. http://www.darpa.mil/GRANDCHALLENGE/
docs/Urban Challenge Rules 102707.pdf.
[6] A. Ess, T. Mller, H. Grabner, and L. v. Gool. Segmentation-based urban trac scene
understanding. British Machine Vision Conference (BMVC), 2009.
[7] X. Feng, C. Williams, and S. Felderhof. Combining belief networks and neural networks
for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(4):467-483, April 2002.
[8] Y. Freund and R.E.Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer and System Sciences, no. 55, 1997.
[9] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge
University Press, 2003.
[10] X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random elds for image
labeling. IEEE International Conference on Computer Vision and Pattern Recognitions,
volume 2, pages 695-702, 2004.
52
53 BIBLIOGRAPHY
[11] M. Heracles, F. Martinelli, and J. Fritsch. Vision-based behavior prediction in urban trac
environments by scene categorization. British Machine Vision Conference (BMVC), 2010
(Submitted).
[12] iCub. Robotcub project, funded by the european commission. http://www.robotcub.
org/.
[13] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models.
IJCV, 1(4):321331, 1988.
[14] S. Kumar and M. Hebert. A discriminative framework for contextual interaction in classi-
cation. IEEE International Conference on Computer Vision, pages 1150-1157, 2003.
[15] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models
for segmenting and labeling sequence data. Proceedings of ICML-01, 282-289, 2001.
[16] T. Leung and J. Malik. Representing and recognizing the visual appearance of materials
using three-dimensional textons. IJCV, 43(1):2944, 2001.
[17] J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image
segmentation. Int. J. Computer Vision, 43(1):7-27, June 2001.
[18] N Pican, E Trucco, M Ross, DM Lane, and Y Petillot. Texture analysis for seabed classi-
cation: Co-occurrence matrices vs self-organizing maps. IEEE, 1998.
[19] C. Rother, V. Kolmogorov, and A. Blake. Grabcut - interactive foreground extraction
using iterated graph cuts. ACM Transactions on Graphics, 23(3):309-314, August 2004.
[20] E. Saber, A. Tekalp, R. Eschbach, and K. Knox. Automatic image annotation using
adaptive color classication. Graphical Models and Image Processing, 58(2):115-126, 1996.
[21] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance
shape and context modeling for multi-class object recognition and segmentation. ECCV,
volume 1, pages 1-15, 2006.
[22] P. Sturgess, K. Alahari, L. Ladicky, and P. Torr. Combining appearance and structure
from motion features for road scene understanding. British Machine Vision Conference
(BMVC), 7 - 10, September 2009.
[23] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multiclass and
multi-view object detection. IEEE Trans. on Pattern Analysis and Machine Intelligence,
19(5):854-869, May 2007.
BIBLIOGRAPHY 54
[24] M Turtinen and M. Pietikaeinen. Contextual analysis of textured scene images. British
Machine Vision Conference (BMVC), 2006.
[25] M. Varma and A. Zisserman. A statistical approach to texture classication from single
images. Int. J. Computer Vision, 62(1-2):61-81, April 2005.
[26] P. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features.
Proc. IEEE Conf. Computer Vision and Pattern Recognition, volume 1, pages 511-518,
December 2001.
[27] Y. Wang and Q. Ji. A dynamic conditional random eld model for object segmentation
in image sequences. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR05), volume 1, pages 264-270, 2005.
[28] J. Winn, A. Criminisi, and T. Minka. Categorization by learned universal visual dictionary.
In Proc. Int. Conf. on Computer Vision, volume 2, pages 1800-1807, October 2005.