Mail

Msc Dissertation
NAO project :
object recognition and localization
Author : Sylvain Cherrier
Msc : Robotics and Automation
Period : 04/04/2011 30/09/2011
Sylvain Cherrier Msc Robotics and Automation
Acknowledgement
In order to do my project, I used a lot of helpful documents which simplified me the work. I could
mention a website from Utkarsh Sinha (SIFT, mono-calibration and other stuff), the publication of
David G. Lowe Distinctive Image Features from Scale-Invariant Keypoints (SIFT and feature
matching) or the dissertation of a previous MSc student, Suresh Kumar Pupala named 3D-
Perception from Binocular Vision (stereovision and other stuff).
After, I would like to thank Dr. Bassem Alachkar to have worked with me on the recognition. He
guided me to several feature extraction methods (SIFT and SURF) and presented me several
publications which allowed me to understand the feature matching.
Then, I would like to thank my project supervisor Dr. Samia Nefti-Meziani, who allowed me to use
the laboratory and in particular the NAO robot and the stereovision system.
Finally, I would like to thank all researchers and studients from the laboratory for their sympathy
and their help.
NAO project page 2/107
Table of contents
1 Introduction......................................................................................................................................3
2 Subject : object recognition and localization on NAO robot............................................................5
2.1 Mission and goals......................................................................................................................5
2.2 Recognition...............................................................................................................................7
2.3 Localization.............................................................................................................................22
2.4 NAO control............................................................................................................................42
3 Implementation and testing...........................................................................................................58
3.1 Languages, softwares and libraries.........................................................................................58
3.2 Tools and methods..................................................................................................................59
3.3 Steps of the project.................................................................................................................64
4 Conclusion.......................................................................................................................................67
5 References......................................................................................................................................68
6 Appendices.....................................................................................................................................69
6.1 SIFT: a method of feature extraction......................................................................................69
6.2 SURF : another method adapted for real time applications...................................................84
6.3 Mono-calibration....................................................................................................................92
6.4 Stereo-calibration.................................................................................................................100
1 Introduction
Video cameras are interesting devices to work with because they can transmit a lot of data to
work with. For human beings or animals, image data inform them about the objects within the
environment they meet : a goal for the future is to understand how we can recognize an object.
Firstly, we learn about it and after we are able to recognize it within a different environment.
The strength of the brain is to be able to recognize an object among different classes of objects :
we call this ability clustering. But before all, the brain is able to extract interesting features from
an object that make it belonging to a class.
Thus, the difficulty to implement a recognition ability on a robot has two levels :
what features extract from an object ?
How to identify a specific object within a lot of several others ?
We can say we have firstly to have well chosen features before to be able to identify an object, so
the first question has to be replied before the second one.
Besides recognition, we need often the location and orientation of the object in order to make the
robot interact with it (control). Thus, the vision system must also enable object localization as it
allows to increase the definition of the object seen.
In this report, firstly, I will present the university and where I worked before to describe what were
the context and goals of my project.
After, I will focus on the three subjects I had to study : object recognition, object localization and
NAO control before to present how I implemented and managed my project.
2 Subject : object recognition and localization on NAO
robot
2.1 Mission and goals
2.1.1. Context
In real life, the environment can change easily and our brain can easily adapt itself to match object
whatever the conditions.
In Robotics, in order to tackle recognition problems, we think often about a specific context ; I can
mention three examples :
For industrial robots, we can have to make robots recognize circular objects on a moving
belt : the features to extract is well defined and we will implement an effective program to
match only circular ones. In this first example, the program has been done to work if the
background is the moving belt one. Thus, a first problem is the background : in real
applications, it can vary a lot !
When we want improve a CCTV camera to recognize vehicles, we can use the thickness to
their edges to distinguish them from people walking in the street. After, we can also use
the car blobs to match a specific size or the optical flow to match a specific speed. The
problem there can be at the occlusion level : indeed, the car blob could be not valid if a
tree hide the vehicle and the would be not identified.
Using a HSV spacecolor, we can recognize monocoloured objects within a constant
illumination. Even if the use of blobs could improve the detection, a big problem is
illumination variation.
Thus, from these examples we can underline three types of variations :
background
occlusion
illumination
But there are a lot of others such as all the transformation in space :
3D translation (on image : translation (X, Y) and scale (1/Z))
3D rotation
2.1.2. State of the art
A lot of people has worked to find invariant features to recognize an object whatever the
conditions.
An insteresting way of study is to work on invariant keypoints :
The Scale-Invariant Feature Transform (SIFT) was published in 1999 by David G. Lowe,
researcher from the University of British Columbia. This algorithm allow to detect local
keypoints within an image, these features are invariant in scale, rotation, illumination.
Using keypoints instead of keyregions enable to deal with occlusion.
The Speeded Up Robust Feature (SURF) was presented in 2006 by Herbert Bay et al.,
researcher from the ETH Zurich. This algorith is partly inspired from SIFT but it's several
times faster than SIFT and claimed by its authors to be more robust against different image
transformations.
2.1.3. My mission
My mission was to recognize an object in difficult conditions (occlusion, background, different
rotations, scales and positions). Moreover, the goal was also to apply it for a robotic purpose : the
control of the robot need to be linked to the recognition. Thus, I had to study the localization
between recognition and control in order to orientate to robotic applications.
In the following, I will speak more about these methods and how to use them to match an object.
After, I will focus on the localization as it allows to have a complete description of the object in the
space : it enables more interaction between the robot and the object.
As I worked on the NAO robot, I will describe the hardware and software of it and the ways to
program it.
Then, in a final part, I will link all the previous parts (recognition, localization and robot control) in
order to describe my implementation and testing. I will also speak about the way I managed my
project.
2.2 Recognition
2.2.1. Introduction
In pattern recognition, there are two main steps before to be able to have a judgement on an
object:
Firstly, we have to analyze the image, the features (which can be for example keypoints or
keyregions) susceptible to be interesting to work with. In the following, I will call this step:
feature extraction.
Secondly, we have to compare these features with them stocked in the database. I will call
it : feature matching.
Dealing with the database, there are two ways to create it:
either it's created manualy by the user and stays fixed (the user learns to the robot)
or it's updated automatically by the robot (the robot learns by itself)
2.2.2. SIFT and SURF : two methods of invariant feature extraction
SIFT and SURF are two invariant-feature extraction methods which are divided in two main steps :
keypoint extraction : keypoints are found for specific locations on the image and an
orientation is given to them. They are found depending on their stability with scale and
position : the invariance with scale is done at this step.
Descriptor generation : a descriptor is generated for each keypoint in order to identify
them accurately. In order to be flexible to transformations, it has to be invariant : the
invariances with rotation, brightness and contrast are done at this step.
During the keypoint extraction, each method have to build a scale space in order to estimate the
Laplacian of Gaussian. The Laplacian allows to detect pixels with rapid intensity changes but it's
necessary to apply on a Gaussian in order to have the invariance in scale. Moreover, the Gaussian
makes the Laplacian less noisy. From the Laplacian of Gaussian applied to the image, the keypoints
are found at the extremums of the function as their intensity changes very quickly around their
location. Then, the second step is to calculate their orientation using gradients from their
neighborhood.
Concerning the descriptor generation, we will need firtly to rotate the axis around each keypoint
by their orientation in order to have invariance witk rotation. After, as for the calculation of the
keypoint orientation, it's necessary to calculate the gradients around a specific neighborhood (or
window). This window will be then divided in several subwindow in order to build a vector
containing several gradient parameters. Finally, a threshold is applied to this vector and it's
normalized at unit length in order to make it respectively invariant with brightness and contrast.
This vector have a specific number of values depending on the methods : with SIFT, it has 128
values and with SURF, 64.
The fact that SURF works on integral images and uses simplified masks reduces the computation
and makes the algorithm quicker to execute. For these reasons, SURF is better for real time
applications than SIFT even if SIFT can be very robust.
For more details about the feature extraction methods SIFT and SURF, you can refer to the
appendices where I describe more each one of these methods.
The diagram below sums up the main common steps used by SIFT and SURF.
Build a scale space
(invariance on scale)
Estimate the Laplacian of Gaussian
Calculate the gradians around the keypoints
within a window W1
Estimate the orientations
Calculate the gradians around keypoints
within a window W2
Divide the window W2
in several subwindows
Rotate the keypoint axis
by the orientation of the keypoints
(invariance on rotation)
Calculate gradian parameters
for each subwindow
Threshold and normalize the vector
(invariance on brightness and contrast)
Keypoint
extraction
Descriptor
generation
Keypoint
location
Keypoint
orientation
Descriptor
2.2.3. Feature matching
Thus, now we have seen two methods to extract keypoints which locate and orientate keypoints
and assign them an invariant descriptor. The feature matching consists in recognize a model within
an image acquired.
The goal of the matching is to compare two lists of keypoint-descriptor in order to :
identify the presence of an object
know the 2D position, 2D size, 2D orientation and scale of the object on the image
A big problem of recognition is to generalize it for 3D objects : a 3D object is complicated to be
read by a camera but also the 3D position, 3D size and 3D orientation are difficult to estimate with
only one camera.
Thus, we will see how a matching can be done using only one model we want recognize within the
image acquired. It can be divided in three main steps :
compare the descriptor vectors (descriptor matching)
compare the feature of the keypoints matched (keypoint matching)
calculate the 2D features of the model within the image acquired (calculation of 2D model
features)
kpList
acq
Descriptor matching
Keypoint matching
Calculation of 2D model features
Feature extraction (SIFT or SURF) Feature extraction (SIFT or SURF)
Image acquired Image model
descList
acq
descList
model
kpList
model
Feature matching
The diagram above shows the principle of the feature matching : features are extracted from the
image acquired and from the image model in order to match their descriptors and keypoints. The
goal is to obtain the 2D features of the model within the acquisition at the end of the matching.
2.2.3.1 descriptor matching
2.2.3.1.1 Compare distances
The descriptors we have previously calculated using SIFT or SURF allow us to have data invariant in
scale, rotation, brightness and contrast.
Firstly, in order to compare descriptors, we can focus on their distance or on their angle. If desc
acq
(desc
acq
(1) desc
acq
(2) desc
acq
(size
desc
)) is a descriptor from the image acquired and desc
model
(desc
model
(1) desc
model
(2) desc
model
(size
desc
)) is one from the model :
d
euclidian
=
.
(desc
acq
(1)desc
model
(1))
2
+...+(desc
acq
(size
desc
)desc
model
( size
desc
))
2
0=arccos( desc
acq
(1) desc
model
(1)+...+desc
acq
( size
desc
) desc
model
( size
desc
))
Thus, there are two basic solutions to compare descriptors : using euclidian distance or angle (in
the following I qualify the angle as distance also). There are others types of distances I don't
mention here in order to make it simple.
Thus, we could calculate the distances between one descriptor from the image acquired and each
one from the image model : the smallest distance could be compared to a specific threshold.
But how to find the threshold in this case ? Indeed, each descriptor could have its own valid
threshold. Thus, a threshold has to be relative to a distance.
David G. Lowe had the idea to compare the smallest distance with the next smallest distance and
deduced that rejecting matching with a distance ratio (ratio
distance
= distance
smallest
/distance
next-smallest
)
more that 0.8, we eliminate 90 % of the false matches with just 5 % of the correct ones.
The figure below represents the Probability Density Functions (PDF) functions of the ratio of
distance (closest/next closest). As you can see, removing matches from 0.8 enables to keep the
most of correct matches and to reject the most of false matches. We can also keep matches under
0.6 in order to focus on only correct matches (but we lose data).
This method allows to eliminate quickly matches from the background of the acquisition which are
completely incompatibles with the model. At the end of the procedure, we have a number of
interested keypoints from the image acquired that are supposed to match with the model but we
need a method more precised than distances.
2.2.3.1.2 k-d tree
The goal now is to match the nearest neighboors more accurately. To do that, we can use the k-d
tree method.
Before all, k-d tree uses two lists of vectors of at least k dimensions (as it will focus only on the k
dimensions specified) :
the reference (N vectors) which aludes to the vectors we use to build the tree
the model (M vectors) which aludes to the vectors we use to find a nearest neighboor
Thus, when the k-d tree has finished its querying for each vector from the model within the tree, it
has for each one of them a nearest reference vector : we have M matches at the end.
In order to avoid several model vectors to match for the same reference vector, we need to have :
NM
Then, depending on the previous operation (compare distances), we will need to assign correctly
the model and acquisition descriptor lists depending on the number of their descriptors to match :
if there are more descriptors in the acquisition than in the model, the acquisition will be
the reference and the model the reference of the k-d tree.
If there are more descriptors in the model than in the acquisition, it will be the opposite.
In our case, if we want accuracy, the dimension of the tree has to be the dimension of the
corresponding descriptors (128 for SIFT and 64 for SURF). The drawback is that more we have
dimensions, more there is computation and slower the process is.
I don't explain more details about k-d tree as I used it as a tool.
2.2.3.2 Keypoint matching
After knowing the nearest neighboors, the descriptors are not useful any more.
Now, we have to check if the matches are coherent at the level of the geometry of the keypoints.
Thus, now we will use the data relative to the keypoint (2D position, scale, magnitude and
orientation) in order to check if there are again some false matches.
To do this, we can use a Generalized Hough Transforms (GHT) but there are two types :
GHT invariant with 2D translation (classical GHT)
GHT invariant with 2D translation, scale and rotation (invariant GHT)
The classical GHT uses only the 2D translation of the keypoints as reference. It has the advantage
to give already an idea of the scale and orientation of the model within the image acquired but it
requires more computation as it's necassary to go through several possible scales and 2D
rotations. Moreover, it's necessary to know a specific range of scale and 2D rotation to work with
and at what sampling.
The invariant GHT uses both the 2D translation and orientation of the keypoints as references. It
has the main advantage to be invariant with all the 2D transformations of the object but it doesn't
inform about the scale and 2D rotation and as it uses orientations, it's sensitive to their accuracy.
In the following, we will both study the classical Generalized Hough Transform and the invariant
one.
2.2.3.2.1 Classical GHT
The Generalized Hough Transform leans mainly on keypoint coordinates defined relative to a
reference point. Indeed, using a reference point for all the keypoints, it enables to keep a data
invariant in 2D translation.
As the goal of the GHT is here to check whether the keypoints are correct, it has as input the
keypoints matched between the image acquired and the model.
Build the Rtable of the model
Before all, the model need to have a corresponding Rtable which describes for each of its keypoint
their displacement from the reference point :
y(1, 0)=
(
x
model
x
r
y
model
y
r
)
Where x
model
and y
model
are the coordinates of the keypoint from the model and x
r
and y
r
are those
from the reference point. The two parameters of the displacement alude respectively to the
scale (s = 1) and rotation ( = 0 rad) as the model is considered as base.
The reference point can be simply calculated as being the mean of position among all the
keypoints :
x
r
=
x
model
nbKp
model
y
r
=
y
model
nbKp
model
Where nbKp
model
refers to the number of keypoints within the model.
Thus, the Rtable for each model is structured as below :
T=
(
x
model
(1)x
r
y
model
(1)y
r
x
model
(2)x
r
y
model
(2)y
r
... ...
x
model
(nbKp
model
)x
r
y
model
( nbKp
model
)y
r
)
Check the features of the acquisition
The second step is to estimate the position of the reference point within the acquired image and
check the keypoints which have a good result.
As we have already a match between the keypoints from the acquisition and those from the
model, we can directly use the Rtable in order to make each keypoint vote for a specific position of
the reference point.
Indeed, if we consider a transformation invariant both in scale and rotation, we have :
x
r
=x
acq
Rtable (numKp
model
,1) y
r
=y
acq
Rtable( numKp
model
, 2)
Where x
acq
and y
acq
alude to the position of the keypoint from the acquisition and numKp
model
refers
to the corresponding keypoint within model matched by acquisition.
With scale s and rotation , we have to modify the value of the displacement given by the Rtable :
y( s , 0)=s
(
cos(0) sin (0)
sin (0) cos(0)
)
y(1,0)

Thus, we have our new reference point for the acquisition :
x
r
=x
acq
s (cos(0) Rtable (numKp
model
,1)sin(0) Rtable( numKp
model
, 2))
y
r
=y
acq
s (sin (0) Rtable (numKp
model
, 1)+cos(0) Rtable (numKp
model
, 2))
Thus, at the end of this step, we have an accumulator of votes of the size the image acquired.
In the image below, you can see an example of accumulator : the maximum show the best
probability of position of the reference point within the image acquired.
A problem is to know at what scale and rotation we have the best results : personnaly, I focused
on the maximum of votes to choose but there can be several for different scales and rotation. An
idea would be to detect where the votes are more grouped.
Adding the scale and rotation, another problem is that we add more calculation and the difficulty
here is to know what range use and how many iterations (more we increase the number of
iterations, more it will be accurate but more it will spend time).
2.2.3.2.2 Invariant GHT
Contrary to the previous method, we will not use the displacement this time.
We will use a relation between the speed of the keypoint ' and its acceleration ''
Speed ' and acceleration ''
It's easier to define the speed and accleration focusing on a template and not on keypoints.
The studied template has a reference point as origin (x
0
, y
0
).
The speed (at a specific angle of the template) is defined as being tangent of its border. The
acceleration is defined as going through the border and the reference point (x
0
, y
0
).
We can prove that between the angle of speed '
a
and the angle of acceleration ''
a
, there is a
value k invariant with translation, rotation and scale :
k=
a
'
a
' '
In our case, we will use the orientation of the keypoint as the angle of speed '
a
and concerning
the acceleration '', it's defined by :
' ' =
yy
0
xx
0
and
a
' ' =arctan(' ' )
Build the Rtable of the model
As previously, we will build a Rtable for the model but this time we will record the value of k and
not the displacement. Thus we will use both the position and orientation of each keypoint to build
the Rtable.
Check the features of the acquisition
Using the position of the keypoints within the acquisition and their corresponding matches, we
will have an associated value of k.
After, knowing the orientation of the keypoints ('
a
), we will have the angle of acceleration ''
a
.
Contrary to the previous method, each keypoint will not vote for a precised position but for a line.
2.2.3.2.3 Conclusion
Thus, at the end of the both methods, we have to look for the maximum of votes in the
accumulator or where the votes are the most grouped.
I chose to focus on the maximum as it's more simple.
In order to be more tolerant, I added a radius of tolerance around the maximum. The false
matches, outside the tolerance zone will be rejected : it allows to be coherent at the geometric
level.
The GHT allows to reduce errors in the calculation of 2D model features we will see in the
following.
2.2.3.3 calculation of 2D model features
Least Square Solution
In order to calcultate the 2D features, we will have to use a Least Squares Solution.
Indeed, a point of the model (p
model
) is expressed in the image acquired (p
model/acq
) using a specific
translation t, a scale s (scalar) and a rotation matrix R :
p
model / acq
=s R p
model
+t
where R=
(
cos(0) sin(0)
sin(0) cos(0)
)
and t =
(
tx
ty
)
is the angle of rotation between the model and the model within the image acquired. We
consider that the model has a scale s of 1 and an angle of 0 radians.
Thus, we will apply the Least Squares in order to find two matrices :
M=s R=
(
m
11
m
12
m
21
m
22
)
and
t =
(
t
x
t
y
)
The idea of the Least Squares is to arrange the equation (4.3-1) in order to group the unknowns
m
11
, m
12
, m
21
, m
22
, t
1
and t
2
within a vector :
p
model / acq
=
(
m
11
m
12
m
21
m
22
)
p
model
+
(
t
x
t
y
)
with
p
model / acq
=
(
x
model/ acq
y
model /acq
)
and
p
model
=
(
x
model
y
model
)
Thus, we have also :
A mt =
(
x
model / acq
y
model/ acq
)
with
A=
(
x
model
y
model
0 0 1 0
0 0 x
model
y
model
0 1
)
and mt =( m
11
m
12
m
21
m
22
t
x
t
y
)
T
Thus the vector mt can be determined using the inverse of the matrix A, if it's not square, we will
have to calculate a pseudo inverse :
mt =(( A
T
A)
1
A
T
)
(
x
model / acq
y
model /acq
)
As we have six unknowns, in order to solve this problem, we require at least three positions p
model
with three corresponding positions within the image acquired p
model/acq
. But, in order to increase
the accuracy of the Least Squares Solution, it's better to have more positions : all the matches can
be considered (of course, the previous steps such as the descriptor matching and keypoint one
have to be correct).
Thus, we have :
A mt =v
model / acq
with A=
(
x
model
(1) y
model
(1) 0 0 1 0
0 0 x
model
(1) y
model
(1) 0 1
... ... ... ... ... ...
x
model
(nb
matches
) y
model
(nb
matches
) 0 0 1 0
0 0 x
model
( nb
matches
) y
model
(nb
matches
) 0 1
)
and
v
model/ acq
=
(
x
model / acq
(1)
y
model /acq
(1)
...
...
x
model /acq
(nb
matches
)
y
model / acq
( nb
matches
)
)
nb
matches
is the number of matches detected.
Thus, as previously, the solution is as the following :
mt =(( A
T
A)
1
A
T
)v
model/ acq
We have now the matrix M and the translation t but we need to have the scale s and the angle
aluding to the transformation from the model to the acquisition.
To calculate them, we can simply lean on the proprieties of the rotation matrix R :
RR
T
=I
and R=
(
cos(0) sin (0)
sin(0) cos(0)
)
We conclude :
MM
T
=s
2
I
and s=. MM
T
(1,1)=.MM
T
( 2, 2)=
.
MM
T
(1, 1)+MM
T
(2, 2)
2
We deduce the rotation matrix R and the angle :
R=
M
s
and 0=arctan(
R( 2,1)
R(1, 1)
)
Then, at the end of this step, we have the translation t, the size s and the angle refering to the
transformation of the model within the acquisition.
In the diagram below, the red points refer to examples of matches between acquisition and
model : it's necessary to have at least three couples of positions (acquisition-model) in order to
calculate the features. You can see some measures useful to estimate : C
model
and C
model/acq
alude
respectively to the center of rotation within the model and within the acquisition. the Least
Squares Solution gives us only an estimation of the origin of the model within the image acquired
O
model/acq
but the center of rotation is easily calculated using the size of the model (sW
0
, sH
0
) which
refer to the model image from where the keypoints were extracted.
O
model/acq
t
x
t
y
O
acq
W
0
H
0
sH
0
sW
0
Acquisition Model
O
model
C
model
C
model/acq
Reprojection error
In order to have an idea of the quality of the Least Square Solution, it's necessary to calculate the
reprojection error.
The reprojection error is simply obtained by calculating the new position vector of the model
within the acquisition v'
model/acq
using the position vector of the model v
model
and the matrix M and
translation t previously calculated.
Thus, we have an error (the reprojection error error
repro
) defined by the distance between the
position vector calculated v'
model/acq
and the real one v
model/acq
:
d
error
=v'
model / acq
v
model /acq
and error
repro
=
d
error
v
model / acq

2.3 Localization
2.3.1. Introduction
There are a lot ways to localize objects :
ultrasonic sensors
infrared sensors
cameras

But, localization with cameras allows to have a lot of data and working with pixels is more
accurate. However it need more processing time and computing.
In the context of the project, localization allows to improve the interaction between the robot and
the object detected.
We will see in the following how we can localize objects using one and several cameras; we'll see
also the drawbacks and advantages of these two methods.
But before all we need to focus on the parameters of a video camera which are important to
define.
2.3.2. Parameters of a video camera
A video camera can be defined by several parameters:
extrinsic
intrinsic
The extrinsic parameters focus on the geographical location of the camera whereas the intrinsic
parameters deals with the internal features of the camera. In the following, we will describe each
of these parameters.
2.3.2.1 extrinsic
The extrinsic parameters of a camera alude to the position and orientation of the camera in
relation to the focused object. Thus, the position of the camera p
cam
defined as below:
p
cam
=R
cam
p
world
+T
cam
= R
cam
T
cam
p
world
where:
p
world
aludes to the position of the world origin
R
cam
and T
cam
refer respectively to the rotation and translation of the camera relative to the
reference point
R
cam
=
(
r
11
r
12
r
13
r
21
r
22
r
23
r
31
r
32
r
33
)
T
cam
=
(
t
x
t
y
t
z
)
In our case, it's interesting to know the object position p
obj/world
in relation to the camera frame
p
obj/cam
. In order to build this relation, we consider the R
cam
and T
cam
are referenced to the object
and not to the world as before. As we want to express T
cam
in the object frame, we have a
translation of -T
cam
before the rotation:
p
obj/ cam
=R
cam
( p
obj/ world
T
cam
)=R
cam
IT
cam
p
obj /world
2.3.2.2 intrisic
The intrinsic parameters can be divided in two categories:
the projection parameters allowing to build the projection relation between 2D points seen
X
obj/world
Y
obj/world
Z
obj/world
O
obj/world
O
obj/cam
Z
obj/cam
X
obj/cam
Y
obj/cam
(R
cam
, T
cam
)
and real 3D points
the distortions defined in relation to an ideal pinhole camera
2.3.2.2.1 projection parameters
The projection parameters allow to fill the projection matrix linking 3D points from the world with
2D points (or pixels) seen with the camera.
This matrix allows to apply the relation followed by ideal pinhole cameras. Actually, a pinhole
camera is a simple camera without a lens and with a single small aperture; we can assume a video
camera with a lens has a similar behaviour that the pinhole camera because the rays go throught
the optical center of the lens.
This relation is determined easily using geometry using a front projection pinhole camera :
x'
img
x
obj/cam
=
y '
img
y
obj/ cam
=
f
cam
z
obj/cam
We deduce:
x'
img
=f
cam
x
obj/ cam
z
obj /cam
y '
img
=f
cam
y
obj /cam
z
obj /cam
These relations allow to build the projection matrix linking the 2D points on the image p'
img
and the
3D object points expressed relative to the video camera frame p
obj/cam
:
p'
img
=s P
cam
p
obj/ cam
O
cam
Optical axis
Image plane
O
img
O
obj f
cam
z
obj/cam
y
z
x
(x
obj/cam
, y
obj/cam
)
(x'
img
, y'
img
)
where P
cam
=
(
f
cam
0 0
0 f
cam
0
0 0 1
)
and
s=
1
z
obj/ cam
s is a scale factor depending on the distance between the object and the camera: higher it is,
bigger the object will appear on the image.
The position on the image p
img
is expressed in distance unit, it's necessary now to convert it in pixel
unit (p
img
) and apply an offset:
p
img
=P
img
p'
img
where P
img
=
(
N
x
N
x
cot(0) c
x
0
N
y
sin(0)
c
y
0 0 1
)
N
x
and N
y
alude respectively to the number of pixels for one unit of distance horizontally and
vertically.
c
x
and c
y
alude to the coordinates of the image center (in pixels); as the z axis of the camera frame
is located on the optical axis, it's necessary to offset p
img
of the half of image width on x and of the
half of image height on y (of c
x
and c
y
respectively).
refers to the angle between the x and y axes, it can be responsible for a screw distortion on the
image. The axis are generally orthogonal, this angle is so very closed to 90.
Finally, we can deduce the final projection matrix P
cam-px
allowing to have directly the position on
the image in pixels p
img
:
p
img
=s P
img
P
cam
p
obj/ cam
=s P
campx
p
obj/ cam
where P
campx
=
(
N
x
f
cam
N
x
f
cam
cot (0) c
x
0
N
y
f
cam
sin(0)
c
y
0 0 1
=
f
x
f
x
cot (0) c
x
0
f
y
sin(0)
c
y
0 0 1
)
We have f
x
= N
x
f
cam
and f
y
= N
y
f
cam
; generally, the term sin() = 1 as 90 but the term cot() is
often kept as tan(). Thus, a skew coefficent is defined as
c
= cot(); the projection matrix
can be approximated:
P
campx
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
)
where
c
0
2.3.2.2.2 distortions
There are two differrent types of distortion:
the radial distortions dr
the tangential distortions on x (dt
x
) and on y (dt
y
)
The radial distortions come from the convexity of the lens; more we go far from the center of the
lens (center of the image), more we have distortions. You can see below the effect of radial
distortion.
The tangential distortions come from the lack of parallelism between the lens and the image
plane.
Thus, we can express the distorted and normalized position (p
d
n
) relative to the normalized object
position s p
obj/cam
(x
n
, y
n
):
p
n
d
=dr s p
obj/ cam
+
(
dt
x
dt
y
1
)
=M
d
s p
obj/ cam
where M
d
=
(
dr 0 dt
x
0 dr dt
y
0 0 1
)
and
s=
1
z
obj/ cam
You can notice we have to normalize the object position before to apply distortions but it must be
done before the camera projection we saw in the previous part. Indeed, focals, center of the
image or skew coefficient will be applied on the distorted and normalized position p
d
n
instead of
directly the object position p
obj/cam
; it will be not necessary to normalize again by 1/s during the
projection.
dr, dt
x
and dt
y
are expressed functions of the normalized object position on z axis (s p
obj/cam
(x
n
, y
n
)):
x
n
=
x
obj/ cam
z
obj/ cam
y
n
=
y
obj / cam
z
obj/ cam
r=
.
x
n
2
+y
n
2
Thus, the distortion coefficients are calculated as shown below:
dr=1+dr
1
r
2
+dr
2
r
4
+dr
3
r
6
+...+dr
n
r
2n
dt
x
=2dt
1
x
n
y
n
+dt
2
(r
2
+2x
n
2
) dt
y
=dt
1
(r
2
+2 y
n
2
)+2dt
2
x
n
y
n
2.3.2.3 Conclusion
Using the equations from above, we can express a real 2D pixel point on the image p
img
from the
camera functions of a 3D object point within the world p
obj/world
:
p
img
=s P
campx
M
d
R
cam
IT
cam
p
obj /world
p
img
=s
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
) (
dr 0 dt
x
0 dr dt
y
0 0 1
)
(
r
11
r
12
r
13
t
x
r
21
r
22
r
23
t
y
r
31
r
32
r
33
t
z
)
p
obj/world
=M p
obj/world
M will be called the intrinsic matrix in the following.
2.3.3. Localization with one camera
With one camera, it's a simple way to localize objects in the space even if it's necessary to know at
least one dimension in distance unit.
It consists in three steps :
calibrate the camera (mono-calibration)
undistord the image
find patterns
calculate the position of the patterns
As the diagram above shows, the calibration of the camera is only needed at the beginning of the
process.
In the following, we will describe each one of these steps.
2.3.3.1 Calibrate the camera (mono-calibration)
We have before all to calibrate the camera in order to know its intrisic parameters. We can notice
that calibration can also inform us about external parameters but as they are always variable, they
are less interesting.
Calibrate the camera
Undistord the image
Find patterns
Calculate the position of the patterns
Real time
application
What are the different methods of calibration ?
We can notice that there are several ways to calibrate a camera:
photogrammetric calibration
self-calibration
Photogrammetric calibration need to know the exact geometry of the object in 3D space. Thus,
this method isn't flexible using an unknown environment and moreover, it's difficult to implement
because of the need of precised 3D objects.
Self-calibration doesn't need any measurements and produces more accurate results. It's this
method chosen to calibrate using a chessboard which has simple reptitive patterns.
Chessboard corner detection
A simple way to calibrate is using a chessboard, this allows to know accurately the position of the
corners on the image.
In the appendices, you could find more details about this type of calibration, I describe the Tsai
method which is a simple way to proceed.
Thus, the calibration informs us about the intrinsics parameters of the camera :
the focales on x (f
x
) and on y (f
y
)
the position of the center (c
x
, c
y
)
the skew coefficient (
c
)
the radial distortions (dr) and tangential ones (dt
x
and dt
y
)
Calibration is only needed at the beginning of a process, the intrinsic parameters could be save in
an XML file.
2.3.3.2 Undistord the image
Undistord the image is the first step of the real time application which will be executed at a
specific speed.
Distortions can be responsible to a bad arrangement of pixels within the image, which will have a
big impact on the accuracy of the localization.
Thus, distortions have to be compensated by a undistortion process that will remap the image.
To do so, we use the distortion matrix M
d
and the projection matrix P
cam-px
in order to calculate the
ideal pixel from the distorted one.
If we consider p
i
img
as the ideal pixel and p
d
img
as the distorted one, considering what we said
previously, we have:
p
img
i
=s P
campx
p
obj/cam and
p
img
d
=s P
campx
M
d
p
obj/ cam
The idea is to calculate the distorted pixel from the ideal one:
p
img
d
=s P
campx
M
d
1
s
P
campx
1
p
img
i
=P
campx
M
d
P
campx
1
p
img
i
We can't calculated directly the ideal one because the distortion matrix M
d
is calculated thanks to
the ideal pixels. Because of this problem, we will need to use image mapping:
x
img
d
=map
x
( x
img
i
, y
img
i
)
and
y
img
d
=map
y
( x
img
i
, y
img
i
)
Thus, in order to find the ideal image, we remap:
( x
img
i
, y
img
i
)=(map
x
( x
img
i
, y
img
i
) , map
y
(x
img
i
, y
img
i
))
2.3.3.3 find patterns
Then, we have to find patterns we want to focus on.
We can either :
find keypoints (Harris, Shi-Thomasi, SUSAN, SIFT, SURF, ...)
or find keyregions (RANSAC, Hough, )
SIFT and SURF allow to add some robust pattern recognition invariant in scale, rotation, brightness
and contrast even if they require more computation than other corner detection methods.
I don't give more details about the corner detection methods but they are divide in two types :
Harris, Shi-Thomasi : an Harris matrix is calculated using the local gradients and depending
on the eigenvalues, it's possible to extract specific corners.
SUSAN (Smallest Univalue Segment Assimilating Nucleus): it works with a SUSAN mask and
generate a USAN (Univalue Segment Assimilating Nucleus) region area depending on the
pixel intensities. It can be improved using shaped rings counting the number of intensity
changes.
As my project is oriented for pattern recognition, I used SIFT and SURF to extract keypoints.
2.3.3.4 calculate the position of the patterns
Knowing the projection matrix P
cam-px
given by the calibration, it's possible to calculate 3D
coordinates (x
obj/cam
, y
obj/cam
, z
obj/cam
) from pixels (x
img
, y
img
) :
(
x
img
y
img
1
)
=s P
campx
(
x
obj/ cam
y
obj /cam
z
obj/ cam
)
1
z
obj/ cam
(
f
x
x
obj/cam
+c
x
z
obj/ cam
f
y
y
obj /cam
+c
y
z
obj/ cam
z
obj /cam
)
if
o
c
0
In the following, we consider
c
= 0.
Knowing the distance from the object (zobj/cam), we can calculate directly the position of
the object (x
obj/cam
, y
obj/cam
) :
x
obj/ cam
=
x
img
c
x
f
x
z
obj/ cam
and y
obj/ cam
=
y
img
c
y
f
y
z
obj /cam
Knowing one dimension of the object (d
obj
), we can calculate the distance from the object
(z
obj/cam
) :
z
obj/ cam
=
f
0
d
img
c
0
d
obj
Where f
0
and c
0
are the mean values of f
x
, f
y
and c
x
, c
y
. The dimension d
img
refers to the the
dimension d
obj/cam
seen on the image.
Thus, we see there are two main problems of the localization with one camera :
either it's needed to know the distance from the object z
obj/cam
or it's needed to know at least one dimension of the object (d
obj/cam
) and know how to
extract it from the image (d
img
).
In order to make more autonomous the localization, we need more than one camera as we will
see in the next part using the stereovision.
2.3.4. Localization with two cameras : stereovision
Now we will focus on stereovision using two cameras (one on the left and another on the right) : it
enables autonomous localization even if it's more complicated to implement.
It requires the same main steps that using one camera but adds a bit more :
mono and stereo-calibrate the cameras
rectify, reproject and undistord the image
find patterns in the left camera
find the correspondences in the right camera
calculate the position of the patterns
As previously, the calibration is isolated from all the real time application.
In the following, I give more details about each one of these steps.
mono and stereo-calibrate the cameras
rectify, reproject and undistord the image
Find patterns in the left camera
Find the correspondences in the right image
Calculate the position of the patterns
Real time
application
2.3.4.1 mono and stereo-calibrate the cameras
This operation consists, as its name means, to mono-calibrate each one of the cameras (left and
right) and stereo-calibrate the system made up of the two cameras.
A simple way to do is to firstly mono-calibrate each one of the cameras in order to have their
respective intrinsic and extrinsic parameters. After, the stereo-calibration will use their extrinsic
parameters in order to know their relationship.
This relationship will be defined by :
the relative position between the left camera and the right one P
lr
(rotation R
lr
and
translation T
lr
)
the essential and fundamental matrices E and F
In the diagram above, you can see P
l
and P
r
as the position of the left and right cameras realtive to
the object P. T
lr
and R
lr
alude to the relative translation and rotation from the left camera to the
right one.
The essential matrix E links the position of the left camera P
l
and the right one P
r
(it considers only
extrinsic parameters) :
P
r
T
E P
l
=0
The fundamental matrix F links the image projection of the left camera p
l
and p
r
(it considers both
extrinsic and intrinsic parameters) :
P
O
l
O
r
P
l
P
r
T
lr
R
lr
p
r
T
F p
l
=0
Where
p
l
=M
l
P
l
and
p
r
=M
r
P
r
M
l
and M
r
alude to the intrinsic matrices of the left and right cameras.
The matrix F enables us to have a clear relation between one pixel from the left camera and others
from the right one. Indeed, one pixel on the left image corresponds to a line in the right one
(called right epipolar line l
r
), the same with one pixel on the right image (left epipolar line l
l
).
In the diagram above, you can see a pixel p
l
from the left image img
l
with its corresponding right
epiline l
r
from the right image l
r
.
O
l
and O
r
alude to the optical center of the left and right camera respectively.
Thus, thanks to the fundamental matrix F, we can calculate the epilines :
if p
l
is a pixel from the left camera, the correponding right epiline l
r
will be Fp
l
.
if p
r
is a pixel from the right camera, the correponding left epiline l
l
will be F
T
p
r
.
O
l O
r
p
l
Right epiline l
r
Left camera Right camera
img
l
img
r
Moreover, using the relative position P
lr
, we are able to calculate a rectification and reprojection
matrices for each one of the cameras.
But why rectify and reproject ?
As we're working with a stereovision of two cameras, the goal is to have a simple relationship
between them in order to :
simplify the correspondence step (on the right image)
but before all, calculate easily the disparity between the two images (step of position
calculation)
Indeed, a translation only on x will enable us to predict more easily the position of the features
within the right camera : it avoids us to have to use the fundamental matrix F and the epilines to
make the correspondence.
The fact that the cameras have just just a shift on x between them enables the epilines to be at the
same y that the pixel from the other image as the diagram below shows.
The rectification will compensate the relative position between the cameras in order to
have only a shift on x between them. A rectification matrix, which is a rotation one, will be
applied to each one of the cameras (R
l
rect
and R
r
rect
).
The reprojection will unify the projection matrices of the two cameras (P
l
cam-px
and P
r
cam-px
) in
order to build a global one P
global
cam-px
. Indeed, the goal here is to find the global camera
equivalent to the stereovision system. A reprojection matrix will be applied also at each
one of the cameras (P
l
repro
and P
r
repro
).
O
l
p
l
Left camera
Right camera
img
l
Right epiline l
r
img
r
O
r
In the appendices, there are more explanation about the epipolar geometry used in stereovision
and about the calculation of the rectification and reprojection matrices.
2.3.4.2 rectify, reproject and undistord the image
This step is similar to the undistording step seen for the mono-camera case : the difference here is
that we apply also a rectification and reprojection matrices to each camera.
our ideal pixel will be now:
p
img
i
=s
global
P
repro
R
rect
p
obj /cam
Thus, we can express the distorted (and unrectified) pixel p
d
img
functions of p
i
img
:
p
img
d
=s P
campx
M
d
1
s
global
R
rect
T
P
repro
1
p
img
i
Then, similarly to undistortion, we will use image mapping.
2.3.4.3 find patterns in the left camera
Exactly similar to the case of one camera but now we will extract patterns from the left camera.
2.3.4.4 find the correspondences in the right camera
The correspondences on the right image can be calculated using :
a matching function on epipolar lines
an optical flow (Lucas-Kanade for example)
Matching function on epipolar lines
The first method allows to use the interest of the rectification which enables to work on an
horizontal line (constant y).
Using pixels from the patterns found in the left image, it goes through horizontal lines in the right
image for a same y : as we work on the right image, the correspondence points are expected to be
on the left side of the position x of the pattern. A matching operation using a small window called
Sum of Absolute Difference (SAD) is calculated at each possible pixels of the right image (this
operation is based before on texture).
Thus, the maximum of the matching function give the best correspondence.
The diagram below explains the principle.
Optical flow
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces and edges in a
visual scene caused by the relative motion between an observer and the scene.
Thus, you can extract an optical flow using :
several images from one camera at different instants
several images from different cameras with different viewpoint
In stereovision, we are in the second case.
The Lucas-Kanade method is an accurate method to find the velocity in a sequence of images but
it leans on several assumptions :
brightness constancy : a pixel from an object detected has a constant intensity.
temporal persistence : the image motion is small.
spatial coherence : neighboring points in the world have same motion, surface and
neighboring position on the image.
Using Taylor series, we can express the intensity of the next image I(x+x, y+y, t+t) functions of
the previous one I(x, y, t) :
I ( x+6 x , y+6 y ,t +6t )I ( x , y ,t )+
I
x
6 x+
I
y
6 y+
I
t
6t
Using the assumption of brightness constancy and the relation from above a an equality, we have :
I ( x+6 x , y+6 y , t +6t )=I ( x , y ,t ) and
I
x
V
x
+
I
y
V
y
+
I
t
=0
Where V
x
=
x
t
and V
y
=
y
t
As we have two unknowns for the components of the velocity (V
x
and V
y
), the equation can't be
solved directly. We need to use a region made up of several pixels (at least two pixels) where we
will consider that we have a constant vector of speed (spatial coherence assumption).
As Lucas-Kanade method focus on local features (small window), we see why it's important to
have small movements (temporal persistence assumption).
In the following, I will consider :
I
x
=
I
x
, I
y
=
I
y
and I
t
=
I
t
Then, using the previous formula applied to small regions, we have for a region of n pixels :
(
I
x1
I
y1
I
x2
I
y2
... ...
I
xn
I
yn
)
(
V
x
V
y
)
=
(
I
t1
I
t2
...
I
tn
)
In order to calculate the velocity, we need to calculate a Least Squares Solution :
(
V
x
V
y
)
=(( A
T
A)
1
A
T
)
(
I
t1
I
t2
...
I
tn
)
where
A=
(
I
x1
I
y1
I
x2
I
y2
... ...
I
xn
I
yn
)
The Least Squares Solutions works well for features as corners but less for flat zones.
The Lucas-Kanade method can be improved to consider also big movements using the scale
dimension. Indeed, in this case, we can calculate the velocity (V
x
, V
y
) for a large scale to refine it
several times before to work on the raw image (pyramid Lucas-Kanade). The diagram below shows
the principle of this method.
Thus, using the velocity, we can calculate the displacement vector d and so the correspondence
pixels within the right image.
2.3.4.5 calculate the position of the patterns
This step is the core of the use of stereovision of localization.
The diagram below illustrates the stereovision system when the two images have been
undistorted, rectified and reprojected.
We can find a geometrical relation between the cameras in order to have the distance from the
object Z :
T( x
l
x
r
)
Zf
=
T
Z
thus
Z=
f T
x
l
x
r
The disparity between the two cameras is defined as :
d=x
l
x
r
Thus, similarly that in the mono-camera case, knowing Z, we can calculate the complete position
of the object relative to the camera (X, Y, Z) :
X=
x
l
c
x
l
f
Z
and
Y=
y
l
c
y
l
f
Z
disparity
The disparity d is an interesting parameter to measure as it's inverse proportionnal of the distance
of the object Z :
for an object far from the camera, it will be near zero : the localization become less
accurate.
For an object closed to the camera, it will tend to infinite : the localization become very
accurate.
Thus, we will have to find a maximal distance from where the object is detected with enough
accuracy.
2.4 NAO control
2.4.1. Introduction
The NAO is an autonomous humanoid robot built by the French company Aldebaran-Robotics. This
robot is good for education purposes as it's done to be easy and intuitive to program.
It integrates several sensors and actuators which allow students to work on locomotion, grasping,
audio and video signal treatment, voice recognition and more.
A humanoid robot such as NAO can make more funny the courses motivating the students to work
on practical projects : it allows to test very quickly some complicated algorithms on real time and
using a real environment.
2.4.2. Hardware
2.4.2.1 General
Firstly, it's necessary to know that there are several models of NAO which enable to do different
kinds of applications :
NAO
T2
(humanoid torso) : sensing, thinking
This model allows to focus on signal treatment and artificial intelligence.
NAO
T14
(humanoid torso) : sensing, thinking, interacting, controlling, grasping
This model adds more practical aspect of the previous one with the capacity to work on
control.
NAO
H21
(humanoid) : sensing, thinking, interacting, controlling, traveling
This first model of humanoid allows to make NAO move within the environment : a lot of
applications are possible with for example mapping.
NAO
H25
(humanoid) : sensing, thinking, interacting, controlling, travelling, being
autonomous, grasping
This last model is the most evolved : it adds the ability to grasp objects to the humanoid (as
the torso NAO
T14
does also).
Notice : Moreover, besides these different models of NAO, there are some modules which can be
added to NAO such as the laser head. The use of a laser improves the navigation of the robot in
complex indoor environments.
Besides the model, the NAO robot refers to a specific version (V3+, V3.2, V3.3, ).
The robots available in the laboratory are NAO
H25
version V3.2 : this hardware is a good base for a
lot of applications.
2.4.2.2 Features
The NAO robot (NAO
H25
) has the following features :
height of 0.57 meters and weight of 4.5 kg
sensors : 2 CMOS digital cameras, 4 microphones, 2 ultrasonic emitters and receivers
(sonars), 2 infrared emitters and receivers, 1 inertial board (2 gyrometers 1 axis and 1
accelerometer 3 axis), 9 tactile sensors (buttons), 8 pressure sensors (FSR), 36 encoders
(Hall effect sensors)
NAO
T2
NAO
T14
NAO
H21
NAO
H25
actuators : brush DC motors
others outputs : 1 voice synthetizer, LED lights, 2 high quality speakers
Connectivity : internet (Ethernet and WIFI) and infrared (remote control, other NAO, ...)
a CPU (x86 AMD GEODE 500 MHz CPU + 256 MB SDRAM / 2 GB Flash Memory) located in
the head with a Linux kernel (32bit x86 ELF) and Aldebran Software Development Kit (SDK)
NAOqi
a second CPU located in the torso
a 55 Watt-Hours Lithium ion battery (nearly 1.5 hours of autonomy)
The NAO is well equipped in order to be able to be very flexible on its applications.
We can notice that the motors are controlled by a dsPIC microcontroller and through their
encoders (Hall effect sensors).
The robot has 25 Degree Of Freedom (DOF) :
head : 2 DOF
arm : 4 DOF in each arm
pelvis : 1 DOF
leg : 5 DOFin each leg
hand : 2 DOF in each hand
For my project, I focused on the head at the beginning in order to work on object tracking.
After, I planned to work on object grasping : this task requires more calculations to synchronise
correctly the actuators.
Depending on the time, I had to leave some control as my main topic was on image processing for
object recognition and localization.
In the next part, I will give more details about the two cameras of the NAO robot as I focused on
these sensors.
2.4.2.3 Video cameras
NAO has two identical video cameras which are located on its face. They provide a resolution of
640x480 and they can run at 30 frames per second.
Moreover, we have these features for the two cameras :
camera output : YUV422
Field Of View (FOV) : 58 degrees (diagonal)
focus range : 30 cm infinity
focus type : fixed focus
The upside camera is oriented straight through the head of NAO : it enables the robot to see in
front of him.
The downside camera, down shiffted, is oriented to see the ground : it enables NAO to see what
he meets on the ground closed to him.
We can notice that these cameras can't be used for stereovision as there is no everlapping
between them.
2.4.3. Softwares
There are several softwares which enable to program NAO very quickly and easily. They allow also
to interact easily with the robot.
2.4.3.1 Choregraphe
Choregraphe is a cross-platform application that allows to edit NAO's movement and behavior. A
behavior is a group of elementary actions linked together following an event-based and a time-
based approach.
It consists on a simple GUI (Graphical User Interface) based on NAOqi SDK, which is the framework
used to control NAO : we will focus on it later.
Interactions
Choregraphe allows to see a NAO robot on the screen which imitate the actions of the real NAO in
real time. It enables the user to use the simulated NAO to save more time during the tests. The
software enables to un-enslave the motors which is impose no stiffness for each motor. This
aspect allows the user to focus on the simulated NAO if he had to test the code more quickly and
safely.
Moreover, the software enables to control each joints in real time from the GUI. The interface
displays the new joint values in real time even if the user moves the robot manually. Thanks to
these features, the user can easily test the range of each joint for example.
Boxes
In Choregraphe, each movement and detection (action) can be represented as a box; these boxes
are arranged in order to create an application. As mentionned previously, behaviors which are
groups of actions can be also represented as boxes but with several boxes inside.
In order to connect the boxes, they have input(s) and output(s) which allow generally to start and
finish the action. For specific boxes, such as the sensors'ones or computation's ones, there can be
several outputs (the value of the detected signals, calculated variables, ...) and several inputs
(variables, conditions, ).
Timeline
An interesting feature of Choregraphe is the possibility to create an action or a sequence of
behaviours using the timeline. Contrary to the system of boxes seen previously, the timeline
allows to have a view on the time which is the real conductor of the application.
The screenshot above show the timeline (marked as 1), the behavior layers (marked as 2) and the
position on the frame (marked as 3). Thus, it's easier to manage several behaviors in series or in
parallel.
Moreover, the timeline enables to create an action from several positions of the joints of the
robot. Indeed, it's only necessary to select a specific frame from the timeline, to put the NAO in a
specific position and save the joints. The software will automatically calculate the required speed
of the actuators in order to follow the constraints of the frames. It's also possible to edit manually
the laws of speed for each motor even if it's necessary to be careful !
In the screenshot above, you can see the keyframes (squares) and several functions controlling the
speed of the motors.
Edit a movement using keyframes allows for example to implement easily a dance for NAO.
Choregraphe offers others possibilities such as a video monitor panel enabling to :
display the images from NAO cameras
recognize an object using as model the selection of a pattern on the image
2.4.3.2 Telepathe
Telepathe is an application that allows to have feedback from the robot, as well as send basic
orders. As Choregraphe, it's a cross-platform GUI but which is customizable (the user can load
plugins in different widgets). These widgets or windows can be connected to different NAO robots.
Actually, Telepathe is a more sophiscated software than Choregraphe for the configuration of
devices and visualization of data. It has directly a link with the memory of NAO which allows to
follow in real time a lot of variables.
Memory viewer
The module ALMemory of NAO (which is part of NAOqi) allows to write to and read from the
memory of the robot. Telepathe uses this module to display variables with tables and graphs. The
user can monitor some variable accurately during testing.
Video configuration
Moreover, Telepathe enables to configure deeply the video camera. Thus, it make easier the
testing of the applications of computer vision.
2.4.4. NAOqi SDK
The Software Development Kit (SDK) NAOqi is robotic framework built especially for NAO. There
are a lot of others for use in a large range of robots : Orocos, YARP (Yet Another Robotic Platform),
Urbi and other. They can be specialized in motion, in real-time, IA oriented and they can be
sometimes open-source in order to enable all people to work on the source code.
NAOqi allows to improve the management of the events and the synchronization of the tasks as a
real-time kernel can do. It works with a system of modules and around a shared memory
(ALMemory module).
2.4.4.1 Languages
The framework uses mainly three languages :
Python : as a simple interpeter is needed to generate the machine code, the language
enables to control quickly NAO and manage parallelism more easily.
C++ : the language needs a compilator to execute the program; it's used to program new
modules as they work well with oriented object languages and as they are expected to stay
fixed.
URBI (UObjects and urbiScript) : URBI is a robotic framework as NAOqi which can start
from NAOqi (it's a NAOqi module called ALUrbi). It enables to create UObjects which refer
to modules and use urbiScript to manage events and parallelism.
URBI is an interesting framework as it's compatible with a lot of robots (Aibo, Mindstorms NXT,
NAO, Spykee, ) but in my internship, I focused on NAOqi which has the same principle. Thus, in
the following, I will focus on the Python and C++ languages which use the core of NAOqi.
NAOqi is very flexible on its languages (cross-language) as C++ modules can be called from a
Python code (as the opposite) : this simplifies the programming and the exchange of programs
between people.
Moreover, NAOqi is embedded with several libraries such as Qt or OpenCV which enable
respectively to create GUIs and manipulate images.
2.4.4.2 Brokers and modules
NAOqi works using brokers and modules :
a broker is an executable and a server that listen remote commands on IP and port.
a module is both a class and a library which deals with a specific device or action of NAO.
Modules can be called from a broker using proxies : a proxy enables to use methods from a local
or remote module. Indeed, a broker can be executed on NAO but also on others NAO and
computers : it enables to make only one program working on several devices.
An interesting propriety of the module is that they can't call directly other modules : only proxies
enable them to use methods from different modules as they refer to the IP and port of a specific
broker.
In the diagram above, you can see the architecture adopted by both NAO and its softwares; purple
rectangles alude to brokers and green circles to modules. The broker mainBroker is the broker of
the NAO robot which is executed directly when the robot is switched on and has an IP address
(Ethernet or WIFI).
This architecture is a safe one as the robot and each one of the softwares work with different
brokers : if there is a crash on Choregraphe or on Telepathe, the robot will be less expected to fall
as the brokers will not communicate during 100 ms.
By using NAOqi as developper, we can choose to use the mainBroker (unsafe but fast) or create
our own remote broker. Personnaly, as I focused firstly on the actuators from the head for the
tracking, I prefered use the mainBroker as it was more intuitive and fast.
As the diagram above shows, there are a lot of modules available for NAO (mainBroker) but we
can add more using the module generation from C++ classes.
The modules refer often to devices available on the robot :
sensors (ALSensors) : tactile sensors
motion (ALMotion) : actuators
leds (ALLeds) : led outputs
videoInput (ALVideoDevice) : video cameras
audioIn (ALSoundDetection) : microphones
AudioPlayer (ALAudioPlayer) : speakers

But, there are others which are built for a specific purpose :
interpret Python code : pythonbridge (ALPythonBridge)
run computationnal processes : faceDetection (ALVisionRecognition), landmark
(ALLandMarkDetection),
communicate with the lower level software controlling electrical devices : DCM (DCM)
manage a shared memory between modules : memory (ALMemory)

Concerning Telepathe, we meet a memory and a camera modules as expected.
In order to enable a module to make its methods available for the brokers, they need to be bound
to the API (Application Programming Interface) module. This module enables to call the methods
both locally and remotely.
As we said previously, python code allows to control quickly NAO because NAOqi SDK provides a
lot of python functions to call Proxies from one or several brokers and use these proxies to
execute methods.
NAOqi provide the same kind of methods for C++ but more oriented for the generation of
modules.
2.4.4.3 Generation of modules in C++
Before all, the module can be generated using a Python code called module_generator.py. It
allows to generate automatically several C++ templates :
projectNamemain.cpp : it allows to create the module for a specific platform (NAO or PC).
moduleName.h and moduleName.cpp : they are the files which will be customized by the
user.
projectName and moduleName will be remplaced by the desired names.
Thus, as NAOqi provides module libraries and other libraries (OpenCV for example) for C++, the
user can develop easily with the framework. When the module is ready, it's necessary to compile
it.
In order to build modules, NAOqi uses CMake which enables to compile sources using a
configuration file (generally with the name CMakeLists) and others options such as the location of
the sources or of the binaries. The Python script module_generator.py creates also a CmakeLists
file which facilitate the compiling process. CMake can be executed with a command as it's the case
on Linux or it can open a GUI as on Windows.
In order to compile modules, we use cross-compiling so as to make the code run in a specific
platform (PC or robot). Indeed, the PC has not the same modules as the robot as it has not the
same devices. Cross-compiling is used using a cmake file : toolchain-pc.cmake to compile a module
for the PC and toolchain-geode.cmake for NAO.
On Linux, the CMake command has to mention another option in order to cross-compile :
cmake -DCMAKE_TOOLCHAIN_FILE=path/toolchain_file.cmake ..
Cross-compilation from Windows for NAO is impossible as NAO has a Linux OS : however, Cygwin
on Windows can be an alternative.
After execution, CMake provide :
a MakeFile configuration file on Linux : it descibes the compilator to use (g++ in our case)
and a lot of other options and it has to be executed by the make command.
A Visual Studio Solution file (.sln) on Windows : it can be open on Visual Studio and it's
necessary to build it.
At the end of the process, both Operating Systems create a library file (.so on Linux and .dll on
Windows). The problem of Windows in this case is that it can't provide a library for NAO as the
robot uses .so libraries (Linux). The module library can be moved after in the lib folder of NAO or
the PC and the configuration file autoload.ini has to be modified in order to load this module.
2.4.4.4 The vision module (ALVideoDevice)
As I focused on the vision, I present in this part the management of the NAO cameras through the
vision module called ALVideoDevice.
The vision on NAO (ALVideoDevice module) is controlled by three modules :
the driver V4L2
It allows to acquire frames from a device : it can be a real camera, a simulated camera, a file.
It used a circular buffer which contains a specific number of image buffers (for NAO, there are four
images). This images will be updated at a specific frame rate, resolution and format depending on
the features of the device.
the Video Input Module (VIM)
This module allows to manage the video device and the driver V4L2 : it opens and close the video
device using the I2C bus, it allows to start it in streaming mode (circular buffer) and stop it. It will
configure the driver for a specific device (real camera, simulated camera, file). Moreover, it can
extract an image from the circular buffer (unmap buffer) at a specific frame rate and convert the
image at a specific resolution and format. We can notice that several VIM can be created at the
same time for different or same devices, they are identified by a name.
the Generic Video Module (GVM)
This module is the main interface between the video device and the user : it allows the user to
choose a device, a frame rate, resolution and format; the GVM will configure the VIM in order to
do this. Moreover, this module enables the user to control the video device in local or remote
mode. The difference will be at the level of the data received : in local mode, the image data (from
the VIM) can be accesed using a pointer whereas in remote mode, the image data has to be
duplicate from the image of the VIM and it will be an ALImage. The GVM has also methods to
recuperate the raw images from the VIM (without any conversions) : this way to do allow to
improve the frame rate even if it requires a restricted number of VIM depending on the number of
buffers enabled by the driver.
The diagram below show the three modules with their own thread (driver, VIM and GVM). The
GVM can access the image by two accesses :
access 1 to use conversion : by pointer if local and by ALImage variable if remoteross-
compilation from Windows for NAO is impossible as NAO has a Linux OS : however, Cygwin
on Windows can be an alternative.
access 2 to use without conversion (raw image) : by pointer if local and by ALImage
variable if remote
In my case, I used the GVM on remote and local mode with conversion. I used the front camera of
NAO in RGB and BGR modes : as the camera is initally in YUV422, a conversion is needed.
GVM thread
Camera
Driver thread
VIM thread
GVM thread
GVM thread
Unmap
buffer
Format
conversion
Access 1
Access 2
3 Implementation and testing
In this part, I will focus on what hardware and software I used to accomplish my mission : object
recognition and localization on a NAO robot for control purpose.
3.1 Languages, softwares and libraries
I worked using two ways :
simulating and testing on Matlab
Matlab is a very useful tool to test algorithms as it's very easy to manipulate data and
display it on figures. Moreover, a lot of toolboxes are available on the web : it enables the
users to use quickly programs as tools.
Moreover, Matlab is cross-platform : it can work an Windows, Linux or Mac. It's cross-
language as the mex command allows to convert C, C++, Frotran code to matlab code and
the opposite.
It offers a lot of toolboxes for several fields shuch as aerospace, image processing,
advanced control, Matlab offers also functions to read data from serial port (serial) or
from other hardwares such as video cameras which allow to manipulate a large range of
devices. All these aspects make Matlab very flexible for any applications needing
computation.
real implementation in C++/Python with OpenCV for NAO
NAO can be programmed using both C++ and Python as we saw in the NAO control part. C+
+ allows to build modules for NAO whereas Python which is directly interpreted allows to
do quickly applications using modules.
Matlab
In my case, as the vision was the core of my project, I chose to build a vision module in
order to have the main features of the object recognized and to enable the user to choose
what object to recognize among other parameters.
I used Python for control as I didn't focus on control during my project. If there is a
problem from the control part, it's very easy to modify the Python code as it's not
necessary to complile it. I could use Matlab to simulate the control as i did for the vision
part but I didn't have enough time.
I used the Open source Computer Vision (OpenCV) library in order to implement more
easily the vision part.
3.2 Tools and methods
3.2.1. Recognition
For recognition, I used two ways to test on Matlab :
Firstly, I used an image (studied image) and a transfromed image of it (cropped, rotated,
scaled) as model image.
This way to do allows to test easily the number of keypoints detected, the number of
matches depending on the rotation, scale and size of the cropped image. Using this
method, we can test both the feature extraction (SIFT/SURF) and the feature matching.
I focused before all on the invariance in rotation for the test : I used several rotated images
and calculated the number of keypoints and matches with the studied image (with SIFT).
You can see below the keypoints detected for several rotations and the number of
keypoints functions of the rotation of the model image (green points : studied image, red
points : model image, blue points : matches).
OpenCV Python
The number of matches is reduced with the rotation of the image : the SIFT was not
efficient and I had to improve it.
After, I focused on the calculation of 2D model features within the studied image : the
model image was cropped, rotated and scaled depending on the inputs of the user. The
matching could estimate the values and in order to test it, we could calculate the
translation, rotation and scale errors.
Then, I used images acquired from the video camera of my computer and model images
of objects.
This way to do allows to test the algorithm for real 3D objects from the world. It enables to
see the influence of rotation of the object in 3D which limits the recognition. To make it
simple, I focused firstly on plane objects (books for example) as only one model image is
required. After, for specific objects such as robots, several model images are needed. The
2D features of the model within the acquisition are always interesting to calculate.
3.2.2. Localization
For localization, I didn't spend the time to test on Matlab and I focused directly on the
implementation on C++ using the Open source Compter Vision (OpenCV). I used a stereovision
system available in the laboratory (a previous MSc studient, Suresh Kumar Pupala, had designed it
for his dissertation). I wanted to learn more about mono-calibration, stereo-calibration and stereo
correspondence as OpenCV functions hide the processing. Thus, I used also Matlab to test
programs for calibration and stereo correspondence. I thought that understanding the concepts
could enable me to be more comfortable with camera devices. The photos below shows the
stereovision system and NAO wearing them : the system can be connected to a PC (USB ports); in
order to use them with NAO, it's necessary to start NAOqi both on NAO and the PC.
You can see below the screenshots showing the stereo-calibration and the stereovision using C+
+/OpenCV.
The blue points alude to the corners found in the left image (left camera), the green points to the
corresponding points in the right image using stereocorrespondence (right camera) and then the
red points to the 3D points deduced in the left image.
3.2.3. NAO control
My final goal was to use a vision module (enabling object recognition and localization) combined
with a program in Python for NAO control. I wanted firstly to do a basic control for NAO (as
tracking the object by its head, moving to the object, saying what object he is watching) and after,
if it's possible, more complicated control (grasping an object for example). In order to practice on
NAO, I did a small vision module to track a red ball (HSV color space) and I made NAO focus on the
barycenter of the object rotating its head (pitch and yaw).
In the screenshot above, you can see the original image on the left and the HSV mask on the right :
tracking using HSV colorspace is not perfect as it depends a lot on the brightness but it was a
simple project to practice on NAO.
3.3 Steps of the project
My project was divided in several steps :
As we said before, the recognition is divided in two main step : feature extraction and
feature matching. I worked firstly on feature extraction with SIFT. I implemented SIFT on
Matlab using a C++ code written by an Indian student in Computer Science, Utkarsh Sinha.
He used as me the publication of David G. Lowe ( Distinctive Image Features from Scale-
Invariant Keypoints ). I focused on a clear program instead of a fast one because I wanted
to divide it in several distinct functions : by doing this, the program becomes more flexible.
After, I studied the feature matching and I focused on the descriptor matching and the
calculation of the 2D model features (Least Squares Solution).
Then, I looked into the localization by learning the parameters of a video camera, the
calibration and the stereovision. In order to implement it, I used a stereovision system
adaptable for NAO's head which was available in the laboratory. I used a lot of references
such as a MSc dissertation from a previous student in Robotics and Automation ( 3D-
Perception from Binocular Vision - Suresh Kumar Pupala) or the publication from David G.
Lowe. In order to implement it in C++, I used the Open source Compter Vision (OpenCV)
library. . I did also some experiments on matlab in order to understand more the
calibration and the stereo correspondence. Before to link the localization with the
recognition, I focused only on corners, this allows to study the evolution of disparity and
the values of 3D positions.
After, I learned about the control of NAO and I tested a small module of recognition in C++
(ballTracking) combined with a control part in Python. The goal of the small project was to
deal with the use of NAO camera or another camera (laptop camera), in remote or local
mode and to build a simple control of the head of NAO. The project consisted to track a red
ball on the image from the camera (HSV color space), to calculate the barycenter, and to
control the head of NAO in order to make it focus its camera on the object.
Later, I studied another invariant feature extraction method (SURF). Contrary to SIFT, I
didn't implement it on Matlab, I only downloaded the OpenSURF library which can be
found for either matlab code or C++. As this was quicker than SIFT, I used it to study the
feature matching.
Finally, I learned about the Generalized Hough Transform which enables to do keypoint
matching focusing on the position and the orientation of the keypoints. I studied firstly the
algorithms outside the matching and after I integrated them to it; I used only Matlab to
test them.
My work is not finished as I didn't implement the feature matching in C++. The goal is then
to replace the corner detection of the localization by SURF feature extraction and add the
feature matching before to do the stereo correspondence on the right image : the
recognition will be done on the left image only. Thus, combining recognition and
localization, we will have keypoints localized within the space, it enables to estimate the
average position of the object and maybe more (3D orientation, 3D size, ).
Feature matching
- descriptor matching
- 2D features calculation
NAO control
- ball tracking project
Localization
- parameters of a video camera
- calibration
- stereovision
SIFT
- SIFT on Matlab
- some testing
SURF
- use of OpenSURF library
- some testing
Feature matching
- keypoint matching
Future work
- feature matching in C++
- SURF and feature matching
combined with stereovision
in a module for NAO
4 Conclusion
The project was around three subjects :
object recognition
object localization
NAO control
I spoke firstly about these subjects before to present what I did and what I have to do for the next
days. I used different methods to deal with each one of these subjects : I prefered to simulate
more on matlab the recognition whereas I chose to practice quickly the localization and the
control of NAO (using C++, OpenCV and Python). However, dealing with the localization, I used
also Matlab in order to understand more what was inside the calibration and stereo-calibration.
I can conclude that this project allowed me to learn more about the use of video camera for
robotic purpose. Indeed, recognition is useful at the beginning to track the desired object and
localization enables the robot to interact with it using control : the behavior of the robot will
depend on these three parts (recognition, localization and control).
Concerning my project, as I said when I described its steps, it's not finished : it need some
implementations in C++ and some testing. I hope it will be. Contrary to SIFT, SURF can be open
source with OpenSURF for example : thus, the recognition I presented is not only academic and
could be used by companies.
5 References
3D-Perception from Binocular Vision (Suresh Kumar Pupala)
Website of Utkarsh Sinha http://www.aishack.in/
A simple camera calibration method based on sub-pixel corner extraction of the
chessboard image (Yang Xingfang, Huang Yumei and Gao Feng)
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5658280&tag=1
University of Nevada, Reno
http://www.cse.unr.edu/~bebis/CS791E/Notes/EpipolarGeonetry.pdf
A simple rectification method of stereo image pairs (Huihuang Su, Bingwei He)
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5678343
OpenCV tutorials (Noah Kuntz) http://www.pages.drexel.edu/~nk752/tutorials.html
Learning OpenCV (Gary Bradski and Adrian Kaehler)
OpenCV 2.1 C Reference http://opencv.willowgarage.com/documentation/c/index.html
Structure from Stereo Vision using Optical Flow (Brendon Kelly)
http://www.cosc.canterbury.ac.nz/research/reports/HonsReps/2006/hons_0608.pdf
Distinctive Image Features from Scale-Invariant Keypoints (David G. Lowe)
http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf
Feature Extraction and Image Processing (Mark S. Nixon and Alberto S. Aguado)
Speeded-Up Robust Features (SURF) (Herbert Bay, Andreas Ess, Tinne Tuytelaars and Luc
Van Gool)
ftp://ftp.vision.ee.ethz.ch/publications/articles/eth_biwi_00517.pdf
Aldebaran official website (Aldebaran) http://www.aldebaran-robotics.com/
Aldebaran Robotics (Aldebaran)
http://users.aldebaran-robotics.com/docs/site_en/index_doc.html
URBI (Gostai)
http://www.gostai.com/downloads/urbi-sdk/2.x/doc/urbi-sdk.htmldir/index.html#urbi-
platforms.html
NAO tutorials (Robotics Group of the University of Len)
http://robotica.unileon.es/mediawiki/index.php/Nao_tutorial_1:_First_steps
6 Appendices
6.1 SIFT: a method of feature extraction
6.1.1. Definition and goal
Find the keypoints invariant with scale, rotation and illumination of the object.
In pattern recognition, SIFT has two advantages:
it can identify complex objects in a scene using specific keypoints
it can identify a same object at several positions and rotations of the camera
However, it has a main disavantage, the processing time. SURF (Speeded Up Robust Feature) leans
on the same principle that SIFT but it's quicker but in order to be invariant in rotation, it has to use
FAST keypoint detection.
6.1.2. Description
SIFT process can be divided in two main steps we will study in detail:
find the location and orientation of the keypoints (keypoint extraction)
generate for each one a specific descriptor (descriptor generation)
Indeed, we have to know before all the location of the keypoints on the image but after, we have
to build an ID very descriminating, that's why we need one descriptor for each one of them. We
can have a lot of keypoints so we have to be able to be very selective in our choice.
Dealing with SIFT, one descriptor is a vector of 128 values.
6.1.3. Keypoint extraction
This first process is divided in two main steps:
find the location of keypoints
find their orientation
These two steps have to be done very carefully because they have a big impact on descriptor
generation we will see later.
Firslty, in order to find their location, we will have to work on the laplacian of gaussian.
6.1.3.1 Keypoint location
6.1.3.1.1 The laplacian of gaussian (LoG) operator
The laplacian operator can be explained thanks to the following formula :
L( x , y)=A I ( x, y)=
2
I ( x , y)=
2
I ( x , y)
x
2
+
2
I ( x, y)
y
2
The laplacian allows to detect pixels with rapid intensity change on : they match with extremums
of the laplacian.
However, as it's a second derivative measurement, it's very sensitive to noise so before to apply
the laplacian, it's used to blur the image in order to delete the high frequency noise on the image.
In order to do this, we use a gaussian operator which blurs the image in 2D according to the
parameter u :
G( x , y , u)=
1
2u
2
e
x
2
+y
2
2u
2
We call the resulting operator (gaussian followed by laplacian) the laplacian of gaussian (LoG).
We can estimate the laplacian of gaussian :
AG( x , y , u)=
2
G( x , y , u)
x
2
+
2
G( x, y ,u)
y
2
AG( x , y , u)=
x
2
+y
2
2u
2
2u
6
e
x
2
+y
2
2u
2
Actually, in the SIFT method, we don't calculate the laplacian of gaussian but we focus on the
difference of gaussian (DoG) which has more advantages we will see in the next part.
6.1.3.1.2 The operator used in SIFT : the difference of gaussian (DoG)
Using the heat diffusion equation, we have an estimation of the laplacian of gaussian:
G( x , y, u)
u
=u
2
G( x , y ,u)
Where
G( x , y, u)
u
can be calculated thanks to a difference of gaussians:
G( x , y, u)
u

G( x, y , k u)G( x, y , u)
k uu
From equations (1) and (2), we deduce:
G( x , y , k u)G( x , y, u)(k1)u
2
2
G( x , y ,u)
Using the difference of gaussian instead of the laplacian, we have three main advantages:
it reduces the processing time because laplacian requires two derivatives
it's less sensitive to noise than laplacian
it's already scale invariant (normalization by u
2
shown in equation (3)). We can notice
the factor k-1 is constant in the SIFT method so it doesn't have an impact on extrema
location.
6.1.3.1.3 The resulting scale space
In order to generate these differences of gaussian, we have to build a scale space. Instead of blu
ring several times the raw image, we resize it several times in order to reduce the calculation time.
Indeed, more we blur, more we lose high frequency details, more we can approximate the image
by reducing its size.
Below, you can see the structure of the scale space for gaussians.
Several constants have to be specified :
the number of octaves (octavesNb) specifying the number of scales to work with
the number of intervals (intervalsNb) specifying the number of local extremas to generate
for each octave
Of course, more there are octaves and intervals, more keypoints are found but more the
processing time is important.
(1)
(2)
(3)
In order to generate intervalsNb local extremas images, we have to generate intervalsNb+2
Differences of Gaussians (DoG) and intervalsNb+3 Gaussians.
Indeed, two Gaussians are substracted to generate one DoG and three DoG are needed to find
local extremas.
2
-0
size
2
0
2
0/intervalsNb
2
-0
size
2
0
2
(intervalsNb+2)/intervalsNb
2
-(octaveNb-1)
size
2
(octaveNb-1)
2
(intervalsNb+2)/intervalsNb
2
-(octaveNb-1)
size
2
(octaveNb-1)
2
0/intervalsNb
-
-
-
-
MIN
MAX
MIN
MAX
Gaussians
Differences of
Gaussians
Local
extremas
intervalsNb
images
intervalsNb+2
images
intervalsNb+3
images
Comment : prior smoothing
In order to increase the number of stable keypoints, a prior smoothing to the raw image can be
done. David G. Lowe Found experimentally that a gaussian of = 1.6 could increase the number of
stable keypoints by a factor of 4. In order to keep highest frequency, the size of the raw image is
doubled. Thus, two gaussians are applied :
one before doubling the size of the image for anti-aliasing ( = 0.5)
one after for pre-bluring ( = 1.0)
6.1.3.1.4 Keep stable keypoints
Local extremas can be numerous for each scale, so we have to be selective in our choice.
There are several ways we can be more selective :
keep keypoints whose DoG has enough contrast
keep keypoints whose DoG is located on a corner
In order to find whether a keypoint is located on a corner, we use the Hessian matrix :
H( x , y)=
(
2
I ( x, y)
x
2

2
I ( x , y)
x y
2
I ( x, y)
y x

2
I ( x , y)
y
2
)
=
(
dxx dxy
dxy dyy
)
Using the classical differentiate equation, we can estimate the different derivatives of the Hessian
matrix :
f ' ( x)=lim
h-0
f ( x+h)f (x)
h
=lim
h-0
f ( x+h)f ( xh)
2h
Using (8) with h=0.5, we deduce :
dx
h=0.5
( x , y)=I ( x+0.5, y)I ( x0.5, y)
So,
dxxdx
h=0.5
( x+0.5, y)dx
h=0.5
( x0.5, y)=I ( x+1, y)+I ( x1, y)2 I ( x, y)
We have similarly :
dyyI ( x, y+1)+I ( x , y1)2 I ( x , y)
And then, in order to calculate dxy, we can use h=1 :
dx
h=1
( x , y)=
I (x+1, y)I ( x1, y)
2
and dxy
dx
h=1
( x, y+1)dx
h=1
( x , y1)
2
So, dxy
I ( x+1, y+1)+I ( x1, y1)I ( x+1, y1)I ( x1, y+1)
4
After, we can express the trace and dterminant of the Hessian matrix thanks to its eigenvalues
and :
tr ( H)=dxx+dyy=o+ and det ( H)=dxx dyy2dxy=o
Thus, we can estimate a curvature value C supposing = r where r >= 1 :
C=
tr (H)
2
det ( H)
=
(o+)
2
o
=
(r+1)
2
r
=f (r)
The function f has a minimal value when r = 1, i.e when the two eigenvalues are closed. More they
are closed, more the corresponding keypoint is a corner and more they are different, more it's
similar to an edge. Thus, we can be less or more selective depending on the value of r we choose
to threshold C :
Cf (r
threshold
)
David G. Lowe used r = 10 in its experiments.
Moreover, in the case of flat zones, one or the two eigenvalues are near zero so these zones can
be identified by det(H) 0.
Then, if det(H) < 0, it means the eigenvalues have an opposite sign ; in this case also, the point is
discarded as not being an extremum.
To conclude, after this step, we know the position of the keypoints on their specific octave and the
corresponding scale where they were found. The scale is calulated depending on the octave and
interval numbers.
6.1.3.2 Keypoint orientation
After to know the location of the keypoints (position on the octave and scale), we have to find
their orientation.
Finding the keypoint orientation can be explained using three steps :
gradient parameters (magnitude and orientation) are calculated around all the pixels of
gaussians
Feature windows around each keypoint are generated (one feature window for magnitude
and another for orientation)
Orientation of each keypoint (center of feature windows) are estimated doing an average
of the orientations within their corresponding feature window
6.1.3.2.1 Calculation of gradient parameters within gaussians
As mentionned presiously, the gradient parameters alude to the magnitude and orientation.
gradients allow to classify contrast changes on the image :
the magnitude shows how big changes are
the orientation aludes to the direction of the main change
Thus, a gradients can be represented as a vector whose position, magnitude and orientation are
specified.
Comment : gradients on the image are oriented from black to white.
x
y
x
M
In order to calculate gradient parameter, we hyave only to calculate the first derivative for each
pixel on the image. Using equation (8) showed before, we deduce with h = 1 :
dx=
I ( x , y)
x

I ( x+1, y)I ( x1, y)
2
and dy=
I ( x , y)
y

I ( x, y+1)I ( x , y1)
2
Knowing dx and dy, we deduce the magnitude (M) and orientation () :
M=.dx
2
+dy
2
and 0=arctan(
dy
dx
)
6.1.3.2.2 Generation of keypoint feature windows
Now, we will speak about the way we will create the feature window.
Two parameters have to be studied :
the gaussian to apply to feature windows
the size of the feature windows
Why apply a gaussian ?
It allows to have a best average of the orientation of the keypoint. Indeed, the keypoint (located at
the center of the feature window) has to have the best weight whereas the pixels located near the
limits of the feature window has to have the smaller weight. Then, instead of having the
magnitude of orientation, we have a weight W :
weight =GM
With G being the gaussian mask and M the magnitudes of the entire image.
How to choose the of the gaussian and the size of the feature window ?
The to choose will depend on the scale we are. Indeed, more we will have an image blured (i.e
higher scale or higher octave and interval numbers), less we will have details on the image, more
we will need pixels to estimate the average of orientation.
David G. Lowe uses :
u
kpFeatureWindow
=1.5u
Where is the absolute sigma of the corresponding image studied (depending on the scale).
Dealing with the size of the featue window, it's chosen depending on
u
kpFeatureWindow
. Indeed, the
gaussian can be considered as negligible from a specific position on the feature window. Thus, the
size can be approximated :
size
kpFeatureWindow
=3u
kpFeatureWindow
6.1.3.2.3 Averaging orientations within each feature window
From the feature windows we have generated, we can now estimage an average of the
orientation of each keypoint.
This is done thanks to an histogram of 36 bins.
Each bin deals with a range of 10 degrees for orientations so that the entire histogram deals with
orientations from 0 degree to 360 degrees.
If an orientation matches a specific bin, its weight fills the corresponding bin.
The maximum of the histogram will refer to the average of the keypoint orientation.
Comment : Accurac y
As we work with an histogram, we must be accurate and we can't have an error of 10 degrees
because it's the length of a bin.
Thus, in order to approximate the real maximum, we use the left and right neighbooring bins of
the matched one in order to calculate the parabole which goes through these three points. Thanks
to the parabolic equation, it's easy after to calculate the maximum.
The accuracy can be improved also by making more accurate the bin filling. Indeed, as we have a
real orientation at the beginning, it's always located between two bins. Thus, we have to compare
this value with the middle of each neighbooring bins and spread the weight accordingly.
Weight
Orientation ()
0 11 21 351
n n+1 n-1
m
n
m
n+1
m
n-1
bin()
As we can see with the diagram above, the weight we have to add to bins n-1 and bins n can be
calculated easily :
weight
n
=weight
n
+
bin(0)m
n1
m
n
m
n1

W(0)=weight
n
+bin(0)m
n1
W(0)
weight
n1
=weight
n1
+bin(0)m
n
W(0)
6.1.4. Descriptor generation
This second process is divided in several steps:
calculate the interpolated gradient parameters of each gaussian (interpolated magnitudes
and orientations)
generate the descriptor feature windows
generate the descriptor of each keypoint
6.1.4.1 Calculation of interpolated gradient parameters within gaussians
Why calculate the interpolated gradient parameters ?
Contrary to the previous case with keypoints, in order to generate the descriptors, we will
consider the keypoint as a subpoint. The pixel located at its location doesn't interest us but its
neighbours yes.
Kp
Neig
Neig
Neig
Neig
Thus, we have to work on the subpixels of the image instead of working on the real pixels.
There are several ways to do so, we could either :
firstly generate each interpolated gaussians to calculate the gradient parameters after
or directly estimate the interpolated gradient parameters from the gaussians
We chose the second method.
The calculation of interpolated gradients is similar to the calculation used for the non interpolated
ones (keypoints). The x gradient dx and y one dy are determined using the same formulas except
that now we will have to use averages :
I
inter
( x+1, y)=I ( x+0.5, y)
I ( x , y)+I ( x+1, y)
2
I
inter
( x1, y)=I ( x1.5, y)
I ( x2, y)+I ( x1, y)
2
I
inter
( x , y+1)=I ( x, y+0.5)
I ( x , y)+I ( x , y+1)
2
I
inter
( x , y1)=I ( x, y1.5)
I ( x , y2)+I ( x , y1)
2
Thus, we deduce dx and dy for the interpolated case :
dx
inter
I
inter
( x+1, y)I
inter
( x1, y)
I ( x+1, y)+I (x , y)I (x1, y)I ( x2, y)
2
dy
inter
I
inter
( x, y+1)I
inter
( x, y1)
I ( x, y+1)+I ( x , y)I ( x , y1)I ( x , y2)
2
The calculation of the magnitude and orientation is exactly the same that previously with the
keypoints.
6.1.4.2 Generation of descriptor feature windows
As we said before, we will interest ourself in the subpoints around the keypoint. In order to have a
symmetric, the size of the feature window should be pair.
David G. Lowe used :
size
descFeatureWindow
=16
This value is interested for the generation of the subwindows we will see after. Indeed, this
number is easily divisible by 4.
As for the keypoint case, we need a gaussian in order to calculate the weight within the descriptor
feature window. This time, the size of the window we choose can help us to find a good gaussian.
David G. Lowe used :
u
descFeatureWindow
=
size
descFeatureWindow
2
6.1.4.3 Generation of the descriptor components for each keypoint
As we have generated the feature windows for each of our keypoint, we can now interest
ourselves on the most important part : the descriptor generation.
This process can be divided in three steps :
Before all, we have to rotate the feature window by the orientation of the corresponding
keypoint (invariance in rotation).
after, we divide our feature window in 16 subwindows (same size).
For each subwindow, we fill a 8 bins histogram using the relative orientation and the
magnitude of the gradients (interpolated).
We thresold (invariance in brightness) and normalize (invariance in contrast) the vector
(given by the histograms).
Kp
We will focus on the filling of the 8 bins histograms.
Before all, we have to notice that, in order to have the invariance with rotation, we have to fill the
histograms using the interpolated orientations substracted by the keypoint orientation :
0
diff
=0
inter
0
Except this fact, the filling of histograms use the interpolated weight as for the keypoint case ;
however, we could also substract the intetrpolated weight with the keypoint weight :
weight
diff
=weight
inter
M
We can be more accurate spreading the weight between two adjacent bins as for the keypoint
case.
128 (= 16x8) descriptor components
Fill 1 histogram 8 bins for each subwindow
Split into 16 subwindows
Kp
threshold and normalize the vector
Kp
Rotate the feature window by the
orientation of the keypoint
x
M
6.1.5. Conclusion and drawbacks
To conclude, SIFT is divided in two steps :
keypoint extraction : the location is found using a scale space of images and calculating the
difference of Gaussian. The orientation is found using a feature window of the gradients
around the keypoint : a histogram of 36 bins allows to have the value.
descriptor generation : descriptors are generated using a feature window of the
interpolated gradients around the keypoint. This feature window is rotated by the
orientation of the keypoint, divided in 16 subwindows : for each subwindow, a histogram
of 8 bins is generated using the relative orienation to generate 8 values. Then, the final
vector (128 values) is thresholded and normalized.
SIFT has a main drawback which is its speed : it's very slow mainly because it needs to blur the
image several times.
Besides this problem, keypoints extracted are not invariant with all 3D rotation : indeed, it's before
all invariant with rotations within the plan of the image and less with others.
In order to improve the speed of SIFT for its use in real time applications, Herbert Bay et al.
Presented the Speeded Up Robust Feature in 2006.
Contrary to SIFT, SURF works on the integral of the image, on a scale space of filters and on Haar
wavelets for orientation assignement and descriptor generation, those enable to be faster.
In the following, we will see more in details how SURF works.
6.2 SURF : another method adapted for real time
applications
6.2.1. keypoint extraction
As for SIFT, keypoint extraction consists in two steps :
find the location of the keypoints
find their main orientation
6.2.1.1 Keypoint location
6.2.1.1.1 LoG approximation : Fast-Hessian
In order to calculate the Laplacian of Gaussian (LoG) within the image, SURF uses a Fast-Hessian
calculation.
Fast-Hessian focus on the Hessian matrix generalised to Gaussian operators :
H ( x ,u)=
(
L
xx
( x , u) L
xy
( x , u)
L
xy
( x , u) L
yy
( x , u)
)
Where L
xx
is the convolution of the Gaussian second order derivative
2
x
2
with the image I in
point x, and similarly for L
xy
and L
yy
.
In order to extract invariant keypoints in scale, we study the determinant of the Hessian matrix :
det ( H)=L
xx
L
yy
L
xy
2
As the determinant is the product of eigen values, we can conclude that :
if it's negative, the eigen values have opposite sign and the keypoint doesn't correspond to
a local extrema.
If it's positive or negative, both eigen values have the same sign and the keypoint is a local
minimum or maximum.
In order to calculate the values of the Hessian matrices (known as Laplacian of Gaussian), SURF
doesn't use Difference of Gaussian as SIFT but a mask as approximation of the Gaussian second
derivative.
In the figure below, you can see :
On the first row, the real Gaussian second order derivative (from the left to the right : L
xx
,
L
yy
and L
xy
).
On the second row, the simplified Gaussian second order derivative used in SURF (from the
left to the right : D
xx
, D
yy
and D
xy
).
The simplified mask allows to improve the speed keeping enough accuracy. Using this mask, the
determinant has to be ajusted :
det ( H
approx
)=D
xx
D
yy
(0.9 D
xy
)
2
where
H
approx
( x , u)=
(
D
xx
( x ,u) D
xy
( x ,u)
D
xy
( x ,u) D
yy
( x , u)
)
6.2.1.1.2 Work on an integral image
SURF uses the simplified Gaussian second order derivative D and applies it not directly to the
image but to the integral image, which enable to reduce the number of calculations.
Indeed, working with an integral image allows to make calculation indepedent to the area studied.
The integral image I
is calculated by the sum of the values between the point and the origin :
I
2
( x , y)=
i=0
i<x
j=0
j <y
I (i , j )
Working with an itegral image, we need only four calculation to calculate the whatever area within
the original image :
2=A+D(C+B)
6.2.1.1.3 Scale space
As for SIFT, SURF has to use a scale space in order to apply its Fast-Hessian operation.
As we saw previously, SIFT blur the image within a same octave and reduce the size of the image
to go to the next octave : this method is time consuming !
SURF doesn't modify the image but the mask it uses (the simplified Gaussian second order
derivative D). Indeed, as it works on an integral image, the size of the mask is independant of the
number of calculation as it requires always four calculation for an area.
Thus, in SURF, in order to build a scale space within several octaves, it's necessary to increase the
size of the mask D : each octave aludes to a doubling of scale.
The minimum size of D is determined using a real gaussian of = 1.2. The smaller D is represented
in the figure above. For example, D
xx
has its three lobes of width three pixels and of height five
pixels.
In order to calculate the approximate scale
approx
, we can use a proportionnal relation between
the real operator L and the aproximated one D :
u
approx
=size
mask
u
mask base
size
mask base
=size
mask
1.2
9
In order to increase the scale of D, the minimum step is to add six pixels on the longer length (as
it's necessary to have a pixel at the center and three lobes of identical size) and four on the smaller
length (to keep the structure).
6.2.1.1.4 Thresolding and non maximal suppression
Similarly to SIFT, after applying the Fast-Hessian with a specific mask D, we will reject pixels with
not enough intensity.
Then, it's necessary to compare the pixel with its neighbours (eight in the same scale and nine in
both the previous and the next scales).
6.2.1.2 Keypoint orientation : Haar wavelets
6.2.1.2.1 Haar wavelets
Haar wavelets are simple filters which can be used to find gradients in the x and y directions.
In the diagram below, you can see from the left to the right : a Haar wavelet filter for x-direction
gradient and another one for y-direction gradient.
The white side has a weight of 1 whereas the black one has a weight of -1. Working with integral
images is interesting because it needs only 6 operations to convolve an Haar wavelet with it.
6.2.1.2.2 Calculation of the orientation
To determine the orientation, Haar wavelet responses of size 4 are calculated for a set of pixels
within a radius of 6 of the detected point ( refers to the scale of the corresponding keypoint).
The sets of pixels are defined by a step of .
As for SIFT, there is a weight by a Gaussian centered at the keypoint, it is chosen to have
Gaussian
=2.5 .
In order to know the main orientation, the x and y-reponses of the Haar wavelets are combined to
focus on the vector-response. The dominant orientation is selected by rotating a circle segment
covering an angle of /3 around the origin. At each angle of the circle, the vector-reponses are
summed and form a new vector : at the end of the rotating, the longest vector will be the main
orientation of the keypoint.
6.2.2. Descriptor generation : Haar wavelets
As it uses to calculate the main orientation of the keypoints, SURF will use again Haar wavelets to
generate the descriptor.
The descriptor is calculated within a 20 window wich will be rotated of the main orientation in
order to make it invariant to rotation. The descriptor window is divided in 4 by 4 subregions where
Haar wavelets responses of size 2 are calculated. As for main orientation calculation, the sets of
pixels used for calculation are defined by a step of .
For each subregions, a vector of four values is estimated :
V
subregion
=(
dx

dy

dx

dy
)
Where dx and dy are respectively the x-response and y-response of each sets of pixels within the
subwindow.
Thus, we conclude that SURF has a descriptor of 4 x 4 x 4 = 64 values which is less than SIFT.
As for SIFT, in order to make the descriptor invariant with brightness and contrast, it's necessary to
threshold it and normalize it to unit length.
64 (= 16x4) descriptor components
Split into 16 subwindows
Kp
Threshold and normalize the vector
Kp
Rotate the feature window by the
orientation of the keypoint
x
M
Calculate the 4 values vector for each subwindow
6.2.3. Conclusion
To conclude, as SIFT, SURF is divided in two steps :
keypoint extraction : the location is found using a scale space of specific masks and by
calculating the determinant of the Hessian matrix at each point. The orientation is
generated using Haar wavelets for the calculation of the gradients within feature windows
around the keypoints : the value is calculated by estmating average gradient vectors.
descriptor generation : descriptors are generated using a feature window of gradients
(given also by Haar wavelets). This feature window is rotated by the orientation of the
keypoint, divided in 16 subwindows : 4 values are calculated to build the vector. The final
vector (64 values) is then thresholded and normalized.
The fact that SURF works on integral images and uses simplified masks reduces the computation
and makes the algorithm quicker to execute.
6.3 Mono-calibration
Why mono-calibrate ?
The goal of the mono-calibration is to know the extrinsic and intrinsic parameters of the camera:
knowing the distortion coefficents, it will be possible to undistord the image
knowing the focale, we will be able to know dimensions in units of distance within the
image
knowing the extrinsic parameters, it will be possible to rectify images from a stereovision
system (stereo-calibration)
What are the different methods of calibration ?
We can notice that there are several ways to calibrate a camera:
photogrammetric calibration
self-calibration
Photogrammetric calibration need to know the exact geometry of the object in 3D space. Thus,
this method isn't flexible using an unknow environment and moreover, it's difficult to implement
because of the need of precised 3D objects.
Self-calibration doesn't need any measurements and produces more accurate results. It's this
method chosen to calibrate using a chessboard which has simple reptitive patterns.
Chessboard corner detection
A simple way to calibrate is using a chessboard, this allow to know accurately the position of the
corners on the image.
There are two main steps in this process:
the generation of the 3D corner points and of the 2D corner points
the two steps calibration
Dealing with the two steps calibration, I chose to use the Tsai method which is a simple way to find
the extrinsic and intrinsic parameters.
6.3.1. Generation of the 3D corner points and of the 2D corner points
Firstly, it's necessary to generate the 3D corner points and the 2D corner points within the image
in order to use them for the two step calibration.
6.3.1.1 3D corner points
The idea is to choose as reference the chessboard. Using this way, it's easy to generate the 3D
corner points functions of the world frame.
As you can see on the diagram above, the chessboard is supposed fixed and the camera is
supposed to be moving whereas in reality, the camera is fixed and the user moves the chessboard.
Thus, it will be needed to generate 3D points only one time depending on the dimensions of the
rectangles on the chessbard (dx, dy). We will see later that only the relation between dx and dy
(rectangle ratio) will interest us: if we work on squares, the ratio will be 1 and the calibration will
be definitely a self one.
6.3.1.2 2D corner points
Dealing with the 2D corner points on the image, we can detect them using two ways:
manual selection
automatic detection
X
w
Y
w
Z
w
O
w
dx
O
cam1
Z
cam1
X
cam1
Y
cam1
O
cam2
Z
cam2
X
cam2
Y
cam2
dy
Using the manual selection, the user have to select each one of the corners on the image. This
method can be very laborious if the chessboard is big or if the user want register a lot of positions
of the chessboard. Moreover, this method is not accurate and need often a checking by the
software. The software can for example use the first points given by the user in order to generate
the others using pattern recurrence, the user could check also to see whether it's valid.
Using the automatic detection, a corner detection technics will be used. I will not go into details in
this report but an interesting method is to use SUSAN operator combined with a ring shaped
operator.
6.3.2. Two steps calibration
Using Tsai method, we will use two least squares solutions in order to know projection matrix,
extrinsic parameters and distortions.
We will come from the equation presented in the last part (describing the parameters of the
camera); this realtion linked a 2D pixel on the image with a 3D point (in distance unit) within the
world:
p
img
=s
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
) (
dr 0 dt
x
0 dr dt
y
0 0 1
)
(
r
11
r
12
r
13
t
x
r
21
r
22
r
23
t
y
r
31
r
32
r
33
t
z
)
p
obj/world
=M p
obj/world
We will have to separate in two problems:
find the projection and extrinsic parameters
find the distortions
Y
cam1
X
cam1
X
cam2
Y
cam2
6.3.2.1 First least-square solution (projection matrix and extrinsic
parameters)
Firstly, we negligate the distortions in order to make the equation easier:
p
img
=s
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
)
(
r
11
r
12
r
13
t
x
r
21
r
22
r
23
t
y
r
31
r
32
r
33
t
z
)
p
obj/ world
p
img
=s M p
obj/ world
Where
p
img
=
(
x
img
y
img
1
)
,
M=
(
m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
m
34
)
and
p
obj/ world
=
(
X
obj/world
Y
obj/ world
Z
obj/world
1
)
The scale s can be easily deduced from M:
s=
1
m
31
X
obj/ world
+m
32
Y
obj /world
+m
33
Z
obj /world
+m
34
Thus, the equation (1) can be also expressed as:
(m
31
X
obj /world
+m
32
Y
obj/ world
+m
33
Z
obj/ world
+m
34
)
(
x
img
y
img
1
)
=
(
m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
m
34
)
(
X
obj/ world
Y
obj/ world
Z
obj/ world
1
)
m
11
X
obj/ world
+m
12
Y
obj / world
+m
13
Z
obj / world
+m
14
m
31
X
obj / world
x
img
m
32
Y
obj / world
x
img
m
33
Z
obj / world
x
img
=m
34
x
img
m
21
X
obj/ world
+m
22
Y
obj / world
+m
23
Z
obj / world
+m
24
m
31
X
obj / world
y
img
m
32
Y
obj / world
y
img
m
33
Z
obj / world
y
img
=m
34
y
img
We can express the two relations from above using a matrix A and a vector m:
A m=
(
m
34
x
img
m
34
y
img
)
=p
img
0
where:
A=
(
X
obj/ world
Y
obj / world
Z
obj / world
1 0 0 0 0 X
obj / world
x
img
Y
obj / world
x
img
Z
obj / world
x
img
0 0 0 0 X
obj /world
Y
obj /world
Z
obj / world
1 X
obj /world
y
img
Y
obj /world
y
img
Z
obj / world
y
img
)
m=( m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
)
T
Thus, we can calculate m using the least squares method (pseudo inverse calculation):
m=(( A
T
A)
1
A
T
) p
img
0
At the beginning, we have to suppose m
34
=1 but we will calculate its real value later. When we will
modify m
34
, it will be necessary to multiply all the others coefficients as they are directly
proportional.
Thanks to this first solution, we can have an idea of the coefficients of M but in order to be able to
calculate a correct least squares solutions, we have to know at least 6 corner features
(corresponding 2D and 3D corner points); more corner features are recorded, better the
approximation of M is.
In a global wiewpoint, if we consider K as the number of chessboard views to treat and N the
number of corner point (couple of 2D and 3D corner points), we have:
2NK constraints (2 because of 2D coordinates)
6K unknown extrinsic parameters (3 for rotation and 3 for translation)
5 unknown projection parameters (f
x
, f
y
,
c
, c
x
and c
y
)
Thus, in order to find a solution, we have to have:
2NK6K+5
So, if we choose to work on only 1 chessboard view, we need at least 6 corner points as we
showed before. Whereas using 2 chessboard views, we can use only 5 corners.
However, in reality, 1 chessboard plane offers only 4 corner points worth of information whatever
the number of corner points chosen; it's why we have to work at least on 3 chessboard views.
In practice, because of noise and numerical stability, we have to take more points and more
chessboard views in order to have more accuracy: 10 views of a 7-by-8 or larger chessboard is a
good choice.
The next step is to split the matrix M into two matrices: projection matrix and extrinsic matrix.
A simple way is to split the matrix M into M
r
and M
t
in order to use the proprieties of the
orthogonal matrices:
M= M
r
M
t
= P
campx
R
cam
P
campx
R
cam
T
cam
As R
cam
is a rotation matrix, R
cam
R
cam
T
=I
Thus, we deduce:
M
r
M
r
T
=P
campx
R
cam
( P
campx
R
cam
)
T
=P
campx
R
cam
R
cam
T
P
campx
T
=P
campx
P
campx
T
By calculating the coefficients of P
cam-px
P
T
cam-px
, it's possible to estimate the projection parameters:
P
campx
P
campx
T
=
(
(1+o
c
2
)f
x
2
+c
x
2
o
c
f
x
f
y
+c
x
c
y
c
x
o
c
f
x
f
y
+c
x
c
y
f
y
2
+c
y
2
c
y
c
x
c
y
1
)
We can notice a condition which allows us to find the real value of m
34
:
P
cam
px
P
cam
px
T
(3,3)=M
r
M
r
T
(3,3)=1 m
34
=
.
1
M
r
M
r
T
(3, 3)
Thus, we have to normalize the matrix M by mutliplying it by the coefficient m
34
.
After the normalization, we can solve the system of equations:
c
x
=M
r
M
r
T
(1,3) c
y
=M
r
M
r
T
(2,3)
f
y
=
.
M
r
M
r
T
(2, 2)c
y
2
f
x
=
.
M
r
M
r
T
(1, 1)c
x
2
k
2
and
o
c
=
k
f
x
with k=
M
r
M
r
T
(1,2)c
x
c
y
f
y
When the projection matrix is known, it's easy to deduce the external matrix (rotation and
translation matrices of the camera):
R
cam
=P
campx
1
M
r
and T
cam
=R
cam
T
P
campx
1
M
t
6.3.2.2 Second least square solution (distortions)
Thanks to the matrix M previously calculated, we can estimated the ideal 2D corner points p
i
img
(x
i
img,
y
i
img
) using the equations (2) and (3):
x
img
i
=
m
11
X
obj/world
+m
12
Y
obj/ world
+m
13
Z
obj/ world
+m
14
m
31
X
obj/ world
+m
32
Y
obj/ world
+m
33
Z
obj/ world
x
img
+m
34
y
img
i
=
m
21
X
obj/ world
+m
22
Y
obj /world
+m
23
Z
obj /world
+m
24
m
31
X
obj /world
+m
32
Y
obj/ world
+m
33
Z
obj/ world
x
img
+m
34
Here, we have to consider the normalized points before the projection p
img0
and p
i
img0
:
p
img0
=
P
campx
1
m
34
p
img
and
p
img0
i
=
P
campx
1
m
34
p
img
i
Then, using the deformation equation:
p
img0
=dr p
img0
i
+
(
dt
x
dt
y
1
)
A p
img0
=( p
img0
p
img0
i
)=(dr
1
r
2
+dr
2
r
4
+dr
3
r
6
+...+dr
n
r
2n
) p
img0
i
+
(
dt
x
dt
y
1
)
with r=
.
x
img0
i 2
+y
img0
i 2
and dt
x
=2dt
1
x
img0
i
y
img0
i
+dt
2
(r
2
+2x
img0
i 2
) dt
y
=dt
1
(r
2
+2 y
img0
i 2
)+2dt
2
x
img0
i
y
img0
i
In order to calculate k
c
= (dr
1
, dr
2
, , dr
n
, dt
1
, dt
2
)
T
, we have to express the equation of above using
a matrix A
d
, the principle is the same of previously with A:
A p
img0
=A
d
k
c
where A
d
=
(
r
2
x
img0
i
r
4
x
img0
i
... r
2n
x
img0
i
2 x
img0
i
y
img0
i
r
2
+2x
img0
i 2
r
2
y
img0
i
r
4
y
img0
i
... r
2n
y
img0
i
r
2
+2 y
img0
i 2
2x
img0
i
y
img0
i
)
As before, the least-square solution a this equation is as follow:
k
c
=(( A
d
T
A
d
)
1
A
d
T
)A p
img0
The minimal number of points required depends on the maximal degree we want for the radial distortions.
Stopping until the sixth degree offers enough precision (three radial coefficients); so, in this case, we need
to work at least on three 2D corner points. As before, more we have points, better is the least-squares
solution.
6.3.3. A main use of mono-calibration: undistortion
As we said previously, a main use of mono-calibration is to undistord the images. To do so, we use the the
distortion matrix M
d
and the projection matrix P
cam-px
in order to calculate the ideal pixel from the distorted
one.
If we consider p
i
img
as the ideal pixel and p
d
img
as the distorted one, we have:
p
img
i
=s P
campx
p
obj/cam and
p
img
d
=s P
campx
M
d
p
obj / cam
The idea is to calculate the distorted pixel from the ideal one:
p
img
d
=s P
campx
M
d
1
s
P
campx
1
p
img
i
=P
campx
M
d
P
campx
1
p
img
i
We can't calculated directly the ideal one because the distortion matrix M
d
is calculated thanks to the ideal
pixels. Because of this problem, we will need to use image mapping:
x
img
d
=map
x
( x
img
i
, y
img
i
)
and
y
img
d
=map
y
( x
img
i
, y
img
i
)
Thus, in order to find the ideal image, we remap:
( x
img
i
, y
img
i
)=(map
x
( x
img
i
, y
img
i
) , map
y
(x
img
i
, y
img
i
))
6.4 Stereo-calibration
Stereo-calibration deals with the calibration of a system made up of several cameras. Thus, it focus
on the location of each one from each others in order to make relations between them.
In the following, because it's easier and because the tool was available in the laboratory, we will
focus only on a stereovision system made up of two cameras (one on the left and the other on the
right).
6.4.1. Epipolar geometry
Epipolar geometry is the geometry used in stereovision. We have firstly to focus on that in order
to understand better the stereovision system.
On the diagram below, the left camera parameters have the indication l and the right one the
indication r.
P is an object point visible for the two cameras.
O
i
is the center, P
i
the position relative to the object P,
i
the image plane and p
i
the image
projection of each camera (i=l, r).
e
l
and e
r
are respectively the epipoles of the left and right cameras. Their position is defined by the
intersection between the line O
l
O
r
and each one of the image planes
i
and
r
. If the cameras are
stationary, They are located always at the same place within each image.
The epipolar lines l
l
and l
r
are defined respectively by the segment e
l
p
l
and e
r
p
r
.
Using this geometry, we can define two interesting matrices which allows to link the parameters
of the two cameras:
the essential matrix E links the camera positions P
l
and P
r
the fundamental matrix F links the image projections p
l
and p
r
6.4.1.1 The essential matrix E
Actually, the essential matrix E describes the extrensic relations between the left and right
cameras (due to the location of each of them).
Indeed, each one of the cameras are defined by their position from the object as we saw during
the mono-calibration:
P
l
=R
l
IT
l
P
and
P
r
=R
r
IT
r
P
We deduce a relative relation from the left camera to the right one P
lr
:
P
l -r
=P
r
P
l
1
or
P
l -r
=R
l -r
IT
l -r
P
with R
l -r
=R
r
R
l
T
and
T
l-r
=T
l
T
r
Using a coplanarity relation, we can calculate the essential matrix E. The three vectors PO
l
,PO
r
and O
l
O
r
shown on the diagram below have to be coplanar, we deduce these relations:
PO
r
(O
l
O
r
PO
l
)=0 P
r
T
(T
l -r
P
l
)=0
P
O
l
O
r
P
l
P
r
T
lr
R
lr
Using the diagram above, we can see in terms of vector, we have:
P
r
=P
l
T
l-r
Moreover, we can express P
r
functions of P
lr
using the definitions of P
l
and P
r
:
P
r
=P
l -r
P
l
=R
l -r
IT
l -r
P
l
=R
l -r
( P
l
T
l -r
)
Thanks to the two equations mentionned above, we can simplify the coplanarity relation:
P
r
T
(T
l -r
P
l
)=0 ( P
l
T
l -r
)
T
(T
l-r
P
l
)=0 ( R
l-r
T
P
r
)
T
(T
l -r
P
l
)=0
Finally, we can define the essential matrix E:
( R
l -r
T
P
r
)
T
(T
l-r
P
l
)=P
r
T
R
l -r
(T
l -r
P
l
)=P
r
T
R
l -r
S P
l
=0 and
E=R
l -r
S
Where S deals with the vector product between T
lr
and P
l
:
S=
(
0 t
z
t
y
t
z
0 t
x
t
y
t
x
0
)
with rank(S) = 2
Thus, the positions of the left and right cameras are linked as follows:
P
r
T
E P
l
=0
6.4.1.2 The fundamental matrix F
The fundamental matrix F will consider the intrinsic parameters of each camera in order to link
their pixels.
Each camera has an intrinsic matrix (M
l
and M
r
) linking their image projection and their position
relative to the object:
p
l
=M
l
P
l
and
p
r
=M
r
P
r
As we saw with the mono-calibration, M
l
and M
r
alude to the projection and distortion matrices of
each camera.
Thus, the previous coplanarity relation can be written as below:
P
r
T
E P
l
=0 ( M
r
1
p
r
)
T
E( M
l
1
p
l
)=p
r
T
( M
r
1
)
T
E M
l
1
p
l
F is defined as: F=( M
r
1
)
T
E M
l
1
=( M
r
1
)
T
R
l -r
S M
l
1
Knowing the fundamental matrix, it's possible to know:
the epipolar line of one camera corresponding to one point in the other
the epipoles of the two cameras (e
l
and e
r
)
Only one relation is needed to know all these parameters:
p
r
T
F p
l
=p
l
T
F
T
p
r
=0
Moreover, as we work on a projective representation of lines, we have if l is a line and p a point:
p l pl=0
Thus, we conclude from these two relations:
if p
l
is a pixel from the left camera, the correponding right epiline l
r
will be Fp
l
.
if p
r
is a pixel from the right camera, the correponding left epiline l
l
will be F
t
p
r
.
Dealing with the epipoles, they are defined as points lying on all epilines; so, whe have:
e
l
T
l
l
=0 and e
r
T
l
r
=0
We conclude:
e
l
T
l
l
=0 l
l
T
e
l
=0 p
r
T
Fe
l
=0 Fe
l
=0
e
r
T
l
r
=0 l
r
T
e
r
=0 p
l
T
F
T
e
r
=0 F
T
e
r
=0
6.4.1.3 Roles of the stereo-calibration
We deduce from the previous calculations that the main roles of the stereo-calibration to
calculate:
the relative position between the two cameras P
lr
(R
lr
and T
lr
)
the essential and fundamental matrices E and F
The stereo-calibration can do also integrate mono-calibration of each cameras but that's a
complicated solutions. Thus, a easy way to proceed is to mono-calibrate each one of the camera
before to stereo-calibrate the stereovision system made up of them.
6.4.2. Calculation of the rectification and reprojection matrices
6.4.2.1 Rectification matrices
Thanks to the calibration, we know some key links between the cameras.
Now, we want to make simple the stereovision system in order to match easily the same object
within the two cameras. Working in the epipolar plane (plane defined by e
l
, e
r
and P) make easier
this matching: this mean align the cameras, their relative position relative to the object P
l
and P
r
have to be colinear. Thus, we need a rectification matrix for each camera R
l
rect
and R
r
rect
in order to
make a correcting rotation.
In order to know the rectification matrices, it's necessary to know the orientation of the frame
defined by O
l
O
r
= T
lr
as it's this vector which define the orientation of the epipolar plane. A simple
way to know that is by calculating the unit vector e
1
, e
2
and e
3
defining this frame:
e
2
is chosen as the vector product between e
1
and the z axis:
e
2
=
e
1
( 0 0 1)
T
e
1
( 0 0 1)
T
=
1
.
t
x
2
+t
y
2
(
t
y
t
x
0
)
e
3
is directly deduced from e
1
and e
2
:
e
3
=
e
1
e
2
e
1
e
2
=
1
.
t
x
2
t
z
2
+t
y
2
t
z
2
+(t
x
2
+t
y
2
)
2
(
t
x
t
z
t
y
t
z
t
x
2
+t
y
2
)
From this three unit vectors, we deduce the rectification matrix for the left camera R
l
camRect
:
R
camRect
l
=( e
1
e
2
e
3
)
As the right camera has a rotation relative of the left one defined by R
lr
, we have R
r
camRect
:
R
camRect
r
=R
l -r
R
camRect
l
As you can see, that are the rectification matrices for the cameras and not for the images; indeed,
in order to simulate a real rotation of the camera in one direction, we need to rotate the image in
the opposite direction. Thus, we conclude the rectification matrices for the images:
R
rect
l
=(R
camRect
l
)
T
=
(
e
1
T
e
2
T
e
3
T )
and R
rect
r
=(R
camRect
l
)
T
R
l -r
T
=R
rect
l
R
l-r
T
Thus, we can calcute the rectified position of each camera P
l
rect
and P
r
rect
:
P
rect
l
=R
rect
l
P
l
and P
rect
r
=R
rect
r
P
r
6.4.2.2 Reprojection matrices
After the rectified position calculated, we can calculate the new projection or reprojection matrix.
The goal is to consider the system of multiple cameras as one global camera with only one general
projection matrix which could represent the whole system. Thus, we will need a reprojection
matrix for each camera P
l
repro
and P
r
repro
in order to reproject the pixels from each camera to the
global camera.
The projection matrix of the global camera can be expressed as this:
P
campx
global
=
(
f
x
global
f
x
global
o
c
global
c
x
global
0 f
y
global
c
y
global
0 0 1
)
where
f
x
global
=
f
x
l
+f
x
r
2
,
f
y
global
=
f
y
l
+f
y
r
2
,
o
c
global
=
o
c
l
+o
c
r
2
if this is a horizontal stereosystem, c
y
global
=
c
y
l
+c
y
r
2
and c
x
global
is not common for the two
cameras ( c
x
r
=c
x
l
+dx where dx is the disparity through the x axis).
if this is a vertical stereosystem, c
x
global
=
c
x
l
+c
x
r
2
and c
y
global
is not common for the two
cameras ( c
y
r
=c
y
l
+dy where dy is the disparity through the y axis).
In order to simplify the model and to use the global projection matrix P
global
cam-px
, we can consider
the disparity as 0. thus, we can consider we have also a mean between c
l
and c
r
.
Geometrically, focusing on a horizontal stereovision system, the left camera and the right one are
separated by x = f
x
global
t/z
global
(where t is the norm of T
lr
); thus, if we consider the left camera as
being the reference frame, pixel coordinates from the right ones have to be offseted by x.
Finally, in the case of a horizontal stereovision system, we deduce the reprojection matrices P
l
repro
and P
r
repro
(if the left camera is considered as reference frame):
P
repro
l
=
(
f
x
global
f
x
global
o
c
global
c
x
global
0
0 f
y
global
c
y
global
0
0 0 1 0
)
P
repro
r
=
(
f
x
global
f
x
global
o
c
global
c
x
global
f
x
global
t
x
0 f
y
global
c
y
global
0
0 0 1 0
)
Then, we can express the stereo-rectified pixel of each camera p
l
rect
and p
r
rect
functions of the
positions of the object relative to the camera P
l
and P
r
:
p
rect
l
=P
repro
l
P
rect
l
=P
repro
l
R
rect
l
P
l
and p
rect
r
=P
repro
r
P
rect
r
=P
repro
r
R
rect
r
P
r
x = f
x
global
t/z
global
P
O
l
O
r
P
l
rect
P
r
rect
z
global
6.4.3. Rectification and reprojection of an image
In order to rectifate and reproject an image from a camera within a stereovision system, we will
use the same method as for undistortion adding the rectification and reprojection matrices.
Indeed, our ideal pixel will be now:
p
img
i
=s
global
P
repro
R
rect
p
obj /cam
Thus, we can express the distorted (and unrectified) pixel p
d
img
functions of p
i
img
:
p
img
d
=s P
campx
M
d
1
s
global
R
rect
T
P
repro
1
p
img
i
Then, similarly to undistortion, we will use image mapping.

Mail

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mail

Uploaded by

Copyright:

Available Formats

Msc Dissertation

NAO project page 21/107

You might also like