You are on page 1of 14

White Paper: Camera Calibration and Stereo Vision

Peter Hillman
Square Eyes Software
15/8 Lochrin Terrace, Edinburgh EH3 9QL
pedro@peterhillman.org.uk
www.peterhillman.org.uk

October 27, 2005

Abstract

This white paper outlines a process for camera calibration: computing the mapping between points in the real
world and where they arrive in the image. This allows graphics to be rendered into an image in the correct position.
Given this information for a pair of stereo cameras, it is possible to reverse the process to compute the 3D position of
a feature given its position in each image — one of the most important tasks in machine vision. The system presented
here requires the capture of a calibration chart with known geometry.

1 Introduction and Overview


This document is intended to be a tutorial which describes a process for calibrated stereo vision: it is meant as a simple
get-it-working overview, with some of the important bits of the theory “glossed over” rather than explained in detail.
The idea is that after reading this paper you should be able to implement a reliable and reasonably robust system,
rather than understand all of the theory. For detail, refer to Hartley and Zisserman’s Multiple View Geometry [4].
In this paper the terms 3D position and 3D co-ordinates refer to the actual position of an object in the real world. A
real-world object is also called a feature. When viewed through a camera, the feature appears at some position in the
image. This is refered to as the feature’s Image Point which has 2D co-ordinates measured in pixels.
The remainder of this paper is arranged as follows: virtually all the processing uses homogeneous co-ordinates, which
are explained in Section 2. The mapping between 3D co-ordinates of features and 2D positions of their corresponding
image points is given by a 4x3 Projection Matrix P . Section 3 explains this matrix and how it is used: how the matrix
is used to find where a feature in space appears in an image. To reverse the process and find the 3D position of a
feature, two different views of the point are required. This is the classic stereo vision problem and is presented in
Section 4.
These sections assume that P is known. Section 5 shows how a calibration chart can be used to find this data.
Although stereo vision techniques is best understood by presenting the algorithm in this order, it is best to implement
the paper backwards: the calibration algorithm in Section 5 first, then the 3D reconstruction algorithm in Section 4.

1
1.1 Assumed background
This paper assumes you know a little bit about linear algebra, but not much. If you’ve ever done matrix multiplication
and tried to do matrix inversion, that’s probably enough. You should have also met some kind of image processing
and have some idea of the multitude of different techniques that you can use to identify the position of a feature in an
image.

2 Homogeneous co-ordinates
In normal co-ordinates, a point in an image (an image point) which is x pixels to the right of the origin and y pixels
below (or above) it is described as a pair (x, y) or with a vector
 
x
y
However, in most computational geometry systems, homogeneous co-ordinates are used. An extra element w is
tagged onto the vector  
a
b
w
for 2D and  
a
b
 
c
w
for 3D. To convert from homogeneous co-ordinates to normal co-ordinates, simply divide a,b (and c if 3D) by w to
get x, y and z: x = a/w, y = b/w, z = c/w. Choosing w = 1 makes this process simpler: just chop off the last
element.
In this paper, co-ordinates are sometimes written as transformed row vectors such as [abw]T , since it takes less space
on the page. Similarly, homogenous points in 3D are written [abcw]T .
If you choose it to be 1, why bother with w at all? Two reasons: Firstly, it means that matrix multiplication can be
used much more effectively to manipulate points. With normal (inhomogeneous co-ordinates), a 2 × 2 matrix can
rotate or scale a point, but only about the origin: it cannot apply a translation. To achieve translation, a constant must
be added. With homogeneous co-ordinates, this is possible. Consider this case:
 ′   
x 1 0 tx x
 y ′  =  0 1 ty   y 
w′ 0 0 1 1

Here, w on the right hand side has been set to 1 for convenience, which means a = x and b = y. Applying matrix
multiplication gives x′ = x + tx and y ′ = y + ty with w′ = 1. Thus, a constant vector [tx ty ]T has been added to each
point using matrix multiplication. (You should try to convince yourself that this works no matter what w is set to)
Another reason to use homogeneous co-ordinates is the ability to represent points infinitely far away: [100]T is a
point infinitely for away on the X-axis (the direction in which the x-axis points), [010]T is infinitely far away on the
y-axis. Why would you want to represent infinite points? Because points which are infinitely far away in 3D space
can appear at a fixed, finite position in an image, thanks to the projection matrix P described in the next section. The
“vanishing point” — the point at which parallel railway lines appear to converge — is infinitely far away but shows
up in an image. Stars can also be thought of as infinitely far away, but they will still be at a given position in an image.

2
3 The Projection Matrix
The projection matrix P is a mapping between the 3D co-ordinates of a feature and the features’s image point: P
is a mapping from points in 3D space (the world) to 2D space (the image). A graphics renderer applies a projection
matrix to a feature in 3D space in order find where to draw the feature in the image. A real camera effectively does
the same. That a matrix multiplication is sufficient for the job will not be explained here, but such a matrix can be
composed of the following components

• the 3D position of the camera — a translation


• the pixel pitch (which is also related to the image size) — a scaling
• the angle of view of the camera (where it is looking) — a set of rotations
• the effective focal length of the camera. This causes the 2D x and y positions to be dependent on the 3D z
position, so that points generally appear to move in the image as they approach the camera.

Since a vector describing the position of a 3D point has 4 elements including w, and a vector describing a position of
a 2D point has 3 elements, P has 3 rows and 4 columns. To map a point from 3D to 2D image point using P , we
apply the following formula:

 
 ′ a
a
 b′  = P  b 
 
c (1)
w′
w

or more verbosely

 
 ′   a
a p11 p12 p13 p14  
 b′  = p21 p22 p23 p24   b  (2)
c
w′ p31 p32 p33 p34
w

Now you have enough information to make a 3D graphics rendering package! The following algorithm is at the heart
of most packages:
Algorithm 1 To draw 3D points in an image at the correct location:

• decide on the value of P , based on where you want the camera to be, where you want it to be looking and what
field of view you want.
• for each point p (x, y, z) in the 3D scene
– add an element w = 1 to form a homogeneous co-ordinate vector [xyz1]T
– compute [abw]T = P [xyz1]T .
– divide by w to find the 2D image point: x = a/w and y = b/w
– draw a point at that position in the image

Of course this algorithm doesn’t do any of the fancy lighting effects or hidden surface removal you’ll need to make a
realistic image. For that, read Foley et al [2]

3
3d point

2d image of point

Figure 1: Reconstruction: The 2D point (shown as a cross) corresponds to a line of points


(dashed line) in 3D; any point along this line would project to the same point.
Thus, the 3D position of a point cannot be determined in general from one
camera.

4 3D recovery
The previous section describes how to compute the 2D position of the point given the 3D co-ordinates. How can we
go the other way? Can we work out where a point must be in real 3D space given its position in an image (its image
point)? The simple answer is you cannot, except in very special circumstances. This is because all points on a 3D
line (called the back projection line) end up at the same image point: given an image point, you know that the feature
must lie on the back projection line but you cannot tell where on the line the feature is. So, P applied in reverse maps
a 2D point to a 3D line: see figure 1

3d point

2d image of point

Figure 2: Reconstruction 2: The same point observed in two cameras results in two dif-
ferent lines which intersect at the 3D object. Hence, 3D reconstruction from
stereo cameras is possible

With two cameras, each camera gives a different 3D back projection line. These two back projection lines usually
coincide at exactly one point (fig. 2). So, given a stereo setup it is possible to find the 3D position of a point
by observing its position in two different cameras. Why “usually”? It is possible that the point is infinitely far
away (think back to the stars), and if the cameras are looking in the same direction separated only by a translation
(like binoculars), then the back projection lines are parallel, and will not (strictly speaking) intersect. However, if
homogenous points are used carefully, an “infinite point” of the form [abc0]T will be recovered, which can be used to
compute the direction of the point.

4
Recovery of the 3D position of a point proceeds as follows:
For two cameras with two projection matrices P 1 and P 2 we can have
 
  X
a1 Y 
 b1  = P 1   (3)
Z 
w1
1
 
  X
a2 Y 
 b2  = P 2   (4)
Z 
w2
1

We know the observed image points in each camera x1 ,y1,x2 and y2, given by a1 /w1 , b1 /w1 ,a2 /w1 and b2 /w1
respectively, but we cannot assume that w = 1. We wish to solve for [XY Z]T , since this is the 3D position of the
feature.
Now, multiply out in terms of the P i , where i is 1 or 2. Here, pi11 means element 1,1 in P i

ai = Xpi11 + Y pi12 + Zpi13 + pi14 (5)


bi = Xpi21 +Y pi22 + Zpi23 + pi24 (6)
wi = Xpi31 +Y pi32 + Zpi33 + pi34 (7)

substituting into xi wi = ai and yi wi = bi gives

Xxi pi31 + Y xi pi32 + Zxi pi33 + xi pi34 = Xpi11 + Y pi12 + Zpi13 + pi14 (8)
Xyi pi31 +Y yi pi32 + Zyi pi33 + yi pi34 = Xpi21 +Y pi22 + Zpi23 + pi24 (9)

Since we want to solve this for X, Y, Z, those terms are collected on the left and other terms on the right:

X(xi pi31 − pi11 ) + Y (xi pi32 − pi12 ) + Z(xi pi33 − pi13 ) = pi14 − xi pi34 (10)
X(yi pi31 − pi21 ) +Y (yi pi32 − pi22 ) + Z(yi pi33 − pi23 ) = pi24 − yi pi34 (11)

We substitute for i = 1 and i = 2 to give us four equations and write them as rows of a matrix multiplication:

x1 p131 − p111 x1 p132 − p112 x1 p133 − p113  


 1
p14 − x1 p134
  
 y1 p131 − p121 y1 p132 − p122 X
y1 p133 − p123   1 1 
  Y  =  p24 − y1 p34 

x2 p2 − p2 x2 p2 − p2 2 2 (12)
31 11 32 12 x2 p33 − p13  p14 − x2 p234 
2
Z
y2 p231 − p221 y2 p232 − p222 2 2
y2 p33 − p23 p224 − y2 p234
AX = B (13)

We need to solve this for X. We’d like to invert A directly but this isn’t allowed (it isn’t even square), so instead we
play the little pseudo inverse trick. We can left multiply each side by AT , then by (AT A)−1 . The inversion here is
usually OK because AT A is always square. So we have

(AT A)−1 AT AX = (AT A)−1 AT B (14)

5
You’ll see that the first part of the left hand side of this is a matrix multiplied by its inverse which by definition is the
identity matrix I. Since IM = M for all M , we can write

X = (AT A)−1 AT B (15)

That, believe or not, is the basic equation of stereo vision. Let’s reiterate this in the form of the algorithm:
Algorithm 2

• Find P 1 and P 2 , the projection matrix for each camera. This can be done “off-line” as they only change when
the camera geometry changes.
• To find the 3D position of a point:
• Locate the position of the point (x1 , y1 ) and (x2 , y2 ) in the image from each camera.
• Form matrices A and B from (x1 , y1 ), (x2 , y2 ), P 1 and P 2 using equation (13)
• Form and invert (AT A). Best to use Singular Value Decomposition to do the inversion as it gives you the
closest fit1
• solve equation (15) to give the 3D position.

There you are! Now you can find points in 3D from a stereo camera setup. At least, you can once you’ve read
section 5, which explains how to find P .

4.1 Epipolar Geometry

3d point

epipolar line

Figure 3: Epipolar Geometry: All points on the back projection line corresponding to a
point observed in the left image project to a line in the right image called an
epipolar line

Figure 3 shows the same set-up as Figure 2. Suppose an image point is observed in the left image from a 3D feature.
The exact position of the point is unknown, but it will definitely lie somewhere on the back projection line (shown
dashed). If we take this line and project it onto the right camera image, we get a line in the right camera image2 .
1
The OpenCV and VXL libraries contain routines for this, as does Matlab and scilab. The Numerical Recipes implementation still appears
to be broken despite several attempts to fix it
2
‘Projecting a line’ means taking every point on the line, and projecting it using P , then combining the projected points together to form a
line

6
This line in the image is called an epipolar line. It is an important concept to grasp: If a feature projects to a point
(x, y) in one camera view, the corresponding image point in the other camera view must lie somewhere on an
epipolar line in the camera image. An image point in camera 1 corresponds to an epipolar line in camera 2 and
vice versa.
Computing the equation for the epipolar line requires combining P 1 and P 2 to form a Fundamental Matrix F .
Details are given in Hartley and Zisserman’s Multiple View Geometry, which provides a Matlab code sample for
computing F from P 1 and P 2 . It is mentioned here because it can be much more efficient to use epipolar lines. The
algorithm would look like this:
Algorithm 3

• Find P 1 and P 2 , the projection matrix for each camera and combine to form F . This can be done “off-line” as
they only change when the camera geometry changes.
• To find the 3D position of a feature:
• Locate the position of the feature p1 = [x1 y1 1]T as observed by camera 1.
• Compute the epipolar line using e = F p. e is of the form [abw]T where points on the line satisfy ax+by +w =
0.
• Scan along e in the image from camera 2 to find the feature at p2 = [x2 y2 1]T
• Use equation (15) to find the 3D position from p1 and p2 .

This is almost twice as fast as the simpler algorithm, since scanning the whole image from each camera is slower
than scanning the whole of one image followed by one line in another. However, it does rely on the feature being
accurately located in camera 1. If the feature is mis-located, the epipolar line is likely to be wrong. Finding the best
match on this incorrect epipolar line might give a reconstructed point which is significantly inaccurate.

4.2 Errors in 3D projection

Reconstructed 3d point

Located position of feature

Projection of reconstructed 3D point

Figure 4: Reconstruction errors: With erroneously located features (red crosses), Algo-
rithm 2 will find the best fit 3D reconstructed point, which re-projects into the
images at different points (black crosses) to the located feature points. The sum
of the distances between the red cross and the black cross in each image gives a
measurement of the error.

We return to the situation of Algorithm 2, where the whole of both images are scanned. If the feature is located
precisely in both images, then the image point in camera 2 will lie on the epipolar line corresponding to the image
point in camera 1 and vice versa. If the feature is inaccurately located in both images, this is unlikely to be the case.
Another effect is that equation (13) will not hold exactly. An error in pixels can be computed once X has been found.

7
Simply take the computed point X and project it using P 1 and P 2 to find two reconstructed image points. Taking the
distance between these reconstructed points and the original found feature points gives a measure of error. Figure 4
shows the setup.

5 Calibration: Finding the Projection Matrix

5.1 Choosing a co-ordinate frame for 3D points


2D points are measured in pixels, with the origin in some given position, usually the top left corner. But what about
the 3D points? How should they be measured? Basically, any way you want. You must decide what you want the
origin of the 3D space, what direction you want the axes to be (make sure they are orthogonal though!), and you
decide what scale you want. It will depend on your application is. If, for example, you are using the system to control
a robotic arm, you will probably want to align the origin and axes with the system used to control the robotic arm.
For a forward-looking vehicle mounted camera, you could do worse than setting the origin to be the camera position,
scale to be in metres, the z axis to be along the road and y to be up. That way, the magnitude of 3D coordinates gives
you how far away the feature is.

5.2 Solving for the Projection Matrix


The 3D position of points cannot be recovered until P is found for each camera. If we have a set of features (points
in 3D) and know the corresponding image points, then we can solve for P . Let’s suppose we have a set N of points,
with Cn = (X, Y, Z) being the known 3D point and cn = (x, y) being the known image point corresponding to Cn .
Let’s restate Equations (8) and (9), discarding the i superscripts and subscripts, since we are solving for each matrix
separately:

Xxp31 + Y xp32 + Zxp33 + xp34 = Xp11 + Y p12 + Zp13 + p14 (16)


Xyp31 + Y yp32 + Zyp33 + yp34 = Xp21 + Y p22 + Zp23 + p24 (17)

Moving everything to the right hand side and writing in matrix form gives

 
p11
p 
 12 
p 
 13 
p 
 14 
p 
  21 
X Y Z 1 0 0 0 0 −Xx −Y x −Zx −x p22 
 
  = 0 (18)
0 0 0 0 X Y Z 1 −Xy −Y y −Zy −y p23 
p24 
 
p31 
 
p32 
 
 
p33 
p34
Cp = 0 (19)

8
Here, the 12 element column vector p is just the elements of the 4x3 matrix P stacked into a vector so we can express
equations (16) and (17) as a matrix multiplication. Once we’ve solved p, all we need to do is reshape the elements
to form P . We have two equations with 12 unknowns, which isn’t a promising start, but so far we’ve only used one
point. However, we can add more points simply by adding rows to the matrix C using more correspondencies Cn
and cn . Lets add subscripts to identify different points:

 
X1 Y1 Z1 1 0 0 0 0 −X1 x1 −Y1 x1 −Z1 x1 −x1
 0 0 0 0 X1 Y1 Z1 1 −X1 y1 −Y1 y1 −Z1 y1 −y1 
 
 X2 Y2 Z2 1 0 0 0 0 −X2 x2 −Y2 x2 −Z2 x2 −x2 
 
 0 0 0 0 X2 Y2 Z2 1 −X2 y2 −Y2 y2 −Z2 y2 −y2 
 p = 0 (20)
 .
.. .. .. 

 . . 

Xn Yn Zn 1 0 0 0 0 −Xn xn −Yn xn −Zn xn −xn 
0 0 0 0 Xn Yn Zn 1 −Xn yn −Yn yn −Zn yn −yn
Cp = 0 (21)

If we have n points, p will have 2n rows. 0 is a column vector with 2n elements, all of which are zero.
So how do we solve for p? Since the right hand side of equation (21) is zero, we need to find the nullspace of C. We
can solve this using Singular Value Decomposition [5]. This finds a decomposition of a matrix (in this case C) into
three matrices U W V T where W is a diagonal matrix (only the elements on the diagonal of W are non-zero). The
nullspace is those columns of V for which the corresponding element of W is (nearly) zero. So, if element Waa is
zero, or very close to zero, then column a of V is part of the nullspace.
To solve for for p, use a standard implementation of Singular Value Decomposition (for example from VXL or
Matlab), check that the elements of W are in descending order (they will be in the case of VXL and Matlab), and
reshape the rightmost column of V into P . Any problems with this “ignorant” approach will be mopped up by the
RANSAC algorithm.

5.3 The RANSAC algorithm


We have a method for finding P using some number of points. But how many points should be used? Well, it makes
sense to pick at least 6, since there are 12 elements in P , and we get two equations per point, but this is not strictly
necessary. Presumably, you will have a system that identifies points automatically in the image and somehow knows
what the real position of the point is in space. One way of getting this data is using the calibration chart reading
algorithm described in section 6. There will probably be some error in the location of the points, which will introduce
errors into C. The effect of SVD is to give a kind of least squares fit to the points provided. So, in some sense, the
more points you use the better, since this will give a better fit. But you might have an outlier: you might mistake some
speck of dust for one of the calibration chart points, which will give you at least one erroneous point. Including this
erroneous point in C will skew the results and introduce errors in P . The answer is to use only the accurate points.
The outliers are difficult to detect but the RANSAC (Random Sampling Consensus) [1] algorithm provides a brute
force solution. RANSAC repetitively chooses a random subset of the points (hopefully one without outliers), solves
for P using just those, then looks a how good the solution is. Given perfect input data, P will map all the 3D points
to the located 2D image positions. Of course, a few won’t match because there are outliers, and the rest won’t be
in exactly the right position because of minor location errors. So, we say that the best solution for P is the one that
reconstructs the most 3D points to a position close to the located 2D positions.
Written algorithmically:
Algorithm 4

9
• Given a list N of n point correspondences between 3D co-ordinates and the corresponding 2D image points,
do the following for several iterations:
– pick x point pairs out of N to form C
– Solve for P using SVD
– For each 3D to 2D point correspondence C ⇒ c in N
∗ project C to the image point c′ using c′ = P C
∗ If c′ is fairly close to c, count as a good match
• Set the final P to be the one that has the most good matches

Here, a “good match” means the Euclidean distance between the c and c′ is less than a parameter. This parameter
should be proportional to the expected error in inlier matches. That is, look at the output from your point correspon-
dence finding algorithm, throw away the obvious outliers, and measure the errors in the remaining correspondences.
So how many points x to pick each time? The higher the error in the inliers, the more points you need as the more
you need the best fit to smooth out your errors. The maximum value you should use is n − r, where n is the number
of correspondences and r is the maximum number of expected outliers. If you set it higher than this, then you will
always have an outlier in the matrix C and you will be doomed. The more values you use, the more iterations of
RANSAC you should use, since it will take it longer to stumble upon a group of matches that don’t contain outliers.

5.4 Normalisation
Before you reach for your keyboard an implement all this junk, an important point to note. Suppose you want measure
the real world co-ordinates in metres, and have a fancy high resolution camera. Columns 1-8 of C would then be of
the order of 0.1 to 1, 9-11 of the order 1-100, and column 12 of the order 100-1000. This is bad, because the least
squares action of SVD will be skewed by these large numbers in the last column. Hartley and Zisserman suggest (and
rather emphatically insist upon) a normalisation process to make sure that each column are of the same order: before
computing the matrix C, find the mean of all 3D points and remove this mean. Now scale these points so that the
variance is 1. Do exactly the same to the 2D points c. The projection matrix P ′ found from these scaled points needs
to be pre- and post-multiplied so that it can work with un-normalised values.

 
  σX 0 0 X
σx 0 x  0 σY 0 Y
P =  0 σy y P ′ 
 0
 (22)
0 σZ Z
0 0 1
0 0 0 1

where x and y are the mean x and y positions of the reconstructed image points and σx and σy are the variances of
the image points. X, Y and Z are the mean x y and z positions and σX , σY and σZ the variances of the features.

6 Reading a calibration chart


A calibration object is some object with identifiable features, where the position of the objects are known. The idea
is you know where the features are in 3D space, and you can identify the corresponding image points, so you have
the pairs of Ci and ci required to compute the projection matrix P . It is common to use an array of squares on a flat
piece of paper. This is actually a bad idea, since you only solve for points on two out of the three axes. Better to have
identifiable features on two pieces of paper 90 degrees apart. This is harder to make because the angle needs to be
correct, but the improved accuracy pays dividends.

10
Figure 5: Calibration chart of 4cm squares on two sheets of paper 90 degrees apart. The
y axis is the normal to the bottom sheet, the z axis normal to the top sheet. The
x axis is along the fold line. The origin is on the fold half way between the two
centre squares.

Figure 5 shows an example calibration chart. Note that it has one major problem: you can’t tell which way up the
camera is. If you were to turn the camera upside-down, the projection will be inverted. Putting unique identifiers in
the chart (like a bright red square in one corner) will go a long way to solving this problem.
Now, we can start searching and building correspondences. Writing this robustly is difficult, since the pixel spacing
of the squares is unknown and there can be any amount of rotation. A feature detector will do a reasonably good job
of finding the corners of the squares, but it will also pick up the corners of the chart and anything else “corner-like”
in the image. Note you really don’t need to understand the following algorithms: feel free to skip to the next section.
There are many ways of reading a calibration chart and this – my own invention – has advantages and disadvantages.
If you need to understand a calibration chart, this algorithm might give you some ideas.
The first step in reading a calibration chart is to locate where in the image the corners of each square of the calibration
chart are. Then, the squares are identified using their layout in the image.
This approach uses quadrants: Whatever the rotation of the corners, one corner will lie in each quadrant of the square:
one in the top left, one in the top right, one in the bottom left and one in the bottom right, with respect to the centre
of the region. Figure 6 shows the process.
Algorithm 5
To locate the positions of the corners of each square in the image:

• Threshold the image at the average intensity of the centre portion of the image (assuming this is where the
chart is)
• Region grow (also called connected components analysis) on all black regions in the thresholded image, to find
the centre and size of each black region. Filter out any region which is far too small or big to be a black square.
The output is a region image: each pixel in this image indicates the region number that the pixel belongs to in
the original image. A region number of zero indicates that the point is white or belongs to a region that is too

11
Harris corner

Centre of region

Figure 6: Example of corner finding: The grey area is the square found by the region
finder. The region is divided into quadrants which meet at the centre of the
region. The filled circles are Harris features, marked as corners of this square.
The hollow circle is a Harris feature rejected as a corner of the square since it is
not the furthest corner in its quadrant from the region centre

big or small.
• Run a Harris feature detector [3] and select the 144 strongest points (there are 96 corners so there are 50%
more points in this list than expected corners). Call this list of points H.
• Dilate the region image by one pixel (to make sure that the located corner points lie within a region)
• For each corner h in H (with position hij ):
– Identify to which region each corner point hij belongs by reading the pixel ij from the dilated region
image. Let R be the region to which it belongs and xy be the centre point of R
– Identify to which quadrant the point belongs based on whether i is less than or greater than x and j is less
than or greater than y
– Mark this point as being in the appropriate quadrant of square R.
• If any quadrant of any region has more than one point, set the corner to be the furthest point from the centre of
the region, using the city block distance d(a, b) = |ay − by | + |ax − bx |

At this point, we have (hopefully) identified the exact position of the corner of each square in the image, and using
the quadrant system we have identified which corner is which. As the Harris feature detector might fail to locate the
corners of some of the squares, some of the quadrants might be empty. Now comes the really tricky bit: working
out which square is which. Because some quadrants might be empty, some points might have to be extrapolated
(predicted).
Algorithm 6

• Find the top-left most region in the image with all corner points marked in each quadrant. Assume (to start with
at least) that this is the top left square of the calibration chart.
• output the correspondence between the 3D position of the top left square and the image position of the centre
of the square.
• predict the location of the next square (see Fig. 7) If a,b,c,d are the corners, the corners of the next square will
lie close to 2b − a,3b − 2a, 2d − c,3d − 2c.
• Find the square s closest to the predicted location.

12
b
a
d
c

Figure 7: Identifying which square is which

• If any corners are missing from s, use the predicted location for the corner, otherwise use the actual location.
• output the correspondence between the 3D position of s and the image position of the centre of the square.
• continue until the end of the row, predicting the position of each next square from the previous one. Then do
the next column, predicting the first square of the next row from the top left square.

The centre points are used for correspondences since they are more predictably located than the corners.
If too many points needed to be predicted rather than read from the actual location, the chances are that the region
assumed to be the top left of the chart was wrong. So, pick the next top-left most region and see if you need to predict
fewer points.
Note that the first row of black squares on the bottom sheet starts in the fold. This makes it easier to predict the
position of these squares.

7 Recommended exercise: Your first test


The rather convoluted algorithms of section 6 give reasonably accurate results if you can get them to locate most of
the regions correctly. To start with, set up a couple of cameras on tripods, make a calibration chart like the one in
Fig. 5, and photograph it. Then locate (by hand if you don’t fancy implementing section 6) the corners of the squares,
and output them and the corresponding 3D positions of the squares. Taking the origin as half way along the centre
fold means the top left corner of the top left square will be (-14,24,0) and the bottom right corner will be at (14,0,-20),
with the z axis being depth away from the camera and the scale in cms.
Feed all those correspondences into the RANSAC algorithm to find P for both cameras. You can confirm that it has
worked by drawing the reconstructed positions of all points on top of the image.
Now you’ve got P 1 and P 2 , put some object in the scene, measuring its 3D position relative to the calibration chart
origin. Photograph again, and feed the pixel positions into the Stereo algorithm. If you get roughly the right 3D
position out, you know you are on the right track!

13
Shameless plug
I (the author of the paper) am a freelance researcher and developer, working on problems like this, and specialising in
development of plug-ins for visual media post–production/digital special effects packages like Shake and Maya. I’d
be more than happy to advise with questions you have with any aspect of stereo vision or any other image processing
problem, and I’d be much more than happy to take some money off you to develop specialised imaging software for
any application.
Refer to my website, http://www.peterhillman.org.uk/ for more information and more shameless plugs

References
[1] H. Cantzler. Random Sample Consensus (RANSAC). http://www.inf.ed.ac.uk/.
[2] James Foley, Andries van Dam, Steven Feiner, John Hughes, and Richard Phillips. An Introduction to Computer
Graphics. Adison Wesley, 1993.
[3] Chris Harris and Mike Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988.
[4] Richard Hartley and Andrew Zisserman. Multiple View Geometry. Cambridge, second edition, 2003.
[5] William H Press, Saul A Teukolsky, William T Vetterling, and Brain B Flannery. Numerical Recipes in C: The
Art of Scientific Computing, chapter 2.6: Singular Value Decomposition, pages 59–70. Cambridge University
Press, second edition, 1992.

14

You might also like