Professional Documents
Culture Documents
www.elsevier.com/locate/trc
a,*
Digital Imaging Research Centre, Faculty of Computing, Information Systems and Mathematics, Kingston University,
Penrhyn Road, Kingston upon Thames, Surrey KT1 2EE, United Kingdom
b
Ipsotek Ltd., P.O. Box 54055, London SW19 4WE, United Kingdom
c
Centre for Transport Studies, University College London, Gower St, London WC1E 6BT, United Kingdom
Received 14 August 2002; received in revised form 23 March 2006; accepted 25 May 2006
Abstract
The timely detection of potentially dangerous situations involving passengers in public transport sites is vital to improve
the safety and condence of the travelling public. Conventional CCTV systems are monitored manually so that a single
observer is typically responsible for dealing with tens or hundreds of cameras at a time. Thus, important events might
be missed or detected too late for eective action. This paper gives an overview of motion-based methods used in a system
developed as part of a EU-funded research project, to detect three important situations of interest to public transport operators. The style has been kept intentionally general so as to provide a good broad understanding of the transport needs
being addressed. Emphasis is given to the performance of these methods as assessed with a large set of video recordings
supplied by metropolitan railway networks in London, Paris and Milan.
2006 Elsevier Ltd. All rights reserved.
Keywords: Visual surveillance; Personal security; Public transport security; Pedestrian monitoring; Motion estimation; Background
estimation
1. Introduction
There is widespread recognition that public transport networks can make a signicant contribution towards
shifting patterns of travel, especially in big cities, from private means to public means. National and supranational policies aim thus to reduce the levels of congestion and pollution and, in general, improve the quality
of life of citizens. This is a complex problem that concerns the implementation of truly integrated modes of
transport, improvements on mobility and accessibility, long-term investment on infrastructure, taxing regimes
*
Corresponding author. Tel.: +44 20 8547 7719; fax: +44 20 8547 7972.
E-mail addresses: sergio.velastin@kingston.ac.uk, sergio.velastin@iee.org (S.A. Velastin), boghos.boghossian@ipsotek.com (B.A.
Boghossian), mavs@transport.ucl.ac.uk (M.A. Vicencio-Silva).
0968-090X/$ - see front matter 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.trc.2006.05.006
97
and fares policies. As discussed by Sanchez-Svensson et al. (2001), an important part of the necessary improvements is to make public transport systems safer in terms of personal security, both in actual terms and,
perhaps more importantly, on how they are perceived by the travelling public, especially those that currently
either chose not to use them or are eectively excluded from them. Therefore, the increasing demands on public transport networks has led to extensive deployment of Closed Circuit Television systems (CCTV) to
improve the safety and condence of the travelling public. Continuous or time-lapsed recordings of CCTV
cameras are kept as visual evidence to help with a posteriori investigation of unusual events or criminal
actions. CCTV systems therefore can potentially play an important role in the planning of urban crowd management and routine pedestrian data collection.
On-line management of crowds requires continuous monitoring by surveillance ocers. In periods with
high levels of crowding it is crucial that those managing a facility have timely information of the areas where
problems are likely to arise, so that incidents can be prevented before they adversely aect the normal operation of a transport facility. In a typical network, images from a number of cameras are routed to a Control
Centre located at a station (in large interchanges) or at a remote location dealing with many stations. A Control Centre might deal with video signals originating from 100 to 500 cameras. A small subset of these (530) is
then shown, for manual observation, on an array of monitors. Most networks use CCTV in what can be called
a passive or reactive mode, whereby careful monitoring only takes place once an unusual situation has
been reported by ground personnel or the public. Some networks use CCTV in an active mode, where one
or two human observers are responsible for permanently looking at the monitors to detect situations of interest. Trained CCTV operators have an excellent ability to spot abnormal behaviour even when they cover less
than 5% of the observed image. However, there is a limit to what such operators can do. For example, in a
busy commuter station all the monitoring eort is normally concentrated on detecting congestion on train
platforms. Important events at other locations might be missed completely or not detected promptly. Moreover, according to research carried out by the Police Scientic Research Branch at the UKs Home Oce
(Wallace and Diey, 1998), CCTV observers suer from video blindness after 2040 min of observation.
Network operators might not be in a position to employ more monitoring sta and in fact there is a desirability and public demand for the existing sta to be in closer contact with the public on the ground.
In summary, partly as a response to public demands for increased safety, there has been a rapid increase in
the number of CCTV systems installed to monitor public places (e.g. London has over 20,000 cameras for
public transport alone). Being operated on a daily basis, these systems generate huge volumes of data that
could provide valuable information on routine patterns of behaviour and site usage, but which it is too expensive and tedious to analyse manually. Events that require immediate action are missed because the few available human observers cannot see all the cameras simultaneously and spend much of their time dealing with
uneventful situations. It is increasingly dicult for those in charge CCTV systems, to deal with and manage
the large amount of potentially useful information that they generate.
There is, therefore, a need for automating the pedestrian monitoring task. To investigate and demonstrate
this were the main aims of the project CROMATICA (Crowd Management through Telematic Imaging and
Communications Assistance), funded under the European Unions (EU) Framework IV Research and Development Programme. This paper provides an overview of the methods used and the results obtained by the
authors within that project. Their work dealt with the detection of potentially dangerous situations involving people in underground metropolitan railway stations. The term potentially dangerous refers to situations that require the attention of a human operator to prevent a possible uncontrollable condition. These
results formed the basis for the undertaking of the follow-up projects, ADVISOR and PRISMATICA, also
funded by the European Union (under its Framework V Research and Development Programme), whereby
the methods outlined here were integrated as part of an advanced multi-camera, multi-sensor (video, audio,
smart cards and wireless cameras) system. Note that to provide sucient descriptions of results and a global
appreciation of the work described here, space limitations prevent in-depth description and characterisation of
algorithms. Full details are given in Boghossian (2001). The reader is also directed, for example, to see Velastin
et al. (2004) and Lo et al. (2003) for complementary work carried out as part of the PRISMATICA project.
The starting point was a worldwide survey (Langlais, 1996) carried out among public transport operators
that identied their monitoring priorities. This was followed up by detailed informal 2-h interviews with the
3 shifts of control room operators in one of the busiest stations in the London Underground network for the
98
researchers to gain sucient understanding of their manual observations methods and the areas more likely to
benet from automatic detection assistance. Further details are given in Sanchez-Svensson et al. (2005). It
became clear that a number of situations are detected manually using motion cues. Those rated by operators
as being most critical and then considered as amenable to detection through image processing are:
1. Overcrowding, dened as the presence of too many people in a given area (i.e. where density has exceeded a
pre-dened safe threshold). An interesting situation occurs when overcrowding is associated to lack of
movement (i.e. congestion) where from an operational point of view the thresholds of acceptability tend
to be lower than those applied for moving crowds.
2. Forbidden or unexpected directions of motion (e.g. counter-ow in a one-way corridor), dened as a signicant amount of motion within a range of possible directions.
3. Stationary individuals or objects, dened as the consistent non-moving presence of people or objects (over a
minimum size) exceeding a pre-determined safe (typical) time threshold.
The work reported here dealt with the detection of these situations in main circulation areas (ticket halls
and corridors). For subsequent work done for estimating congestion in underground platforms please see
Lo and Velastin (2001).
This paper is organised as follows. Section 2 reviews some of the relevant previous work while Section 3
describes the main hardware and algorithmic components of the developed system. Then Section 4 shows
how detection mechanisms have been built from these components and Section 5 presents the experimental
results with a representative set of real-world data. The paper ends, Section 6 with the main conclusions
and suggestions for further work.
2. Relevant work
2.1. Obtaining motion vectors
The changes in illumination that occur from one image to another, can be represented by so called motion
vectors. At any given position in the image, the change is measured by a displacement and a direction. Given
the large amounts of data involved (for example in images of 512 512 pixels each), this process is computationally intensive but needs to be robust in a range of small and large displacements. The net result is a
(motion) vector eld in image space that correlates to changes in the scenes arising from either object movements, illumination changes or camera movements. This is useful information for a computer-based video
analysis system. A popular method of computing image motion is the so called block-matching algorithm,
whereby for any given rectangular block in an image, a similar block is found on the subsequent image. When
such a block is found, the dierence in position between the two blocks corresponds to the motion vector (i.e.
indicates by how much the original block as moved). The use of the block-matching technique was rst proposed in Velastin et al. (1993) for estimating the general trends in motion of crowds by considering regions in
histograms of velocity directions. The use of a block-matching technique to detect the direction of motion of
crowds was also considered by Bouchafa et al. (1997) while a more detailed study was conducted by Yin (1996)
showed that suciently accurate estimation of crowd movements can be obtained through the appropriate
settings of the operating parameters (size of block, size of search window).
2.2. Estimating the background
A major component of a visual surveillance system is the process that separates the (irrelevant) background
from the (relevant) foreground. In this context, background corresponds to the xed environment (oors,
walls, pillars, ticket oces and so on) and foreground that of the generally transient objects (sta, passengers
and their belongings). As human beings, we have the ability (the mechanisms of which are yet to be fully
understood) to recognise objects. We know what a table or a oor looks like and more importantly what their
purpose is in a given environment, so that when we see a scene (directly or through a television screen) we can
automatically concentrate on foreground even if there are short-term or long-term changes such as sudden or
99
gradual variations in lighting. The applications discussed here clearly require robust continuous unattended
operation and a system that cannot adapt to environmental changes cannot be considered. There has been
much eort in the computer vision research community to emulate this ability at least at a very low level
of pixel intensity (as opposed to objects and their contextual meaning). In its more basic level, it is possible
to work on dierences of images (known as interframe images) so as to eliminate longer term variability. These
are eective to detect intrusion into sterile zones (and hence somewhat misleadingly called motion detectors
in the surveillance industry) but suer from noise (inherent in dierential operation). Some early work (e.g.
Ridder et al., 1995; Wern et al., 1997; Tsuchikawa et al., 1995) dealt with methods to model and lter out illumination variability, e.g. through the use of Kalman lters. A more interesting approach is that rst proposed
by Stauer et al. (2000) where the temporal properties of pixel illumination (and colour) are regarded to be
stochastic and approximated by a mixture of Gaussian distributions. A given pixel at a given point in time
is considered to be either background or foreground depending on its probability to belong to the calculated
distributions. At the core of this approach is the assumption that background pixels occur more often. This is
not necessarily the case in crowded conditions. Also, an object that remains in the same position for some time
eventually becomes part of the background and disappears. When the object moves o again, it leaves behind
a ghost image (of the background it was previously covering). So, these approaches have serious limitations
for the applications considered here. The novelty of our approach is to qualify the background estimation process (and hence that of event detection) with motion vector information.
2.3. Measuring crowding levels
The estimation of crowding levels in public places gained signicant interest in the early literature. See for
example Velastin et al. (1993, 1994), Ivanov et al. (1998), Regazzoni et al. (1993), Regazzoni and Tesei (1994),
Schoeld et al. (1995), Coianiz et al. (1996), Ottonello et al. (1992) and Marana et al. (1997). This is because
the measurement of crowding levels plays an important role in ensuring public safety and on measuring levels
of service. One of the approaches is to establish a direct relationship between the number of feasible image
features (e.g. edge pixels, vertical edges, foreground pixels, circles, blobs etc.) and the crowding levels
(or the number of people in the scene). In this paper we follow the work reported by Davies et al. (1995)
and Tsuchikawa et al. (1995) by using foreground blocks as features to estimate the number of people in
the scene. Moreover, we address the problem of perspective distortion and its eect on the accuracy of the
estimation results. We also present a perspective-distortion correction method derived by automatic camera
calibration.
2.4. Detecting stationarity
The detection of stationary objects or people in complex (cluttered) environments has been addressed in the
past mainly through three approaches: temporal ltering as in Takatoo et al. (1996), frequency domain methods as in Davies et al. (1995) and motion estimation as in Bouchafa et al. (1997). The typical problems associated with the detection of stationarity in complex scenes are:
Frequent occlusion of the stationary object by moving pedestrians.
Occlusion of the stationary object by moving pedestrians wearing shades of colour similar to the
background.
Continuous change in the pose and position of human subjects suspiciously waiting in public places.
In Sections 3.3 and 4.2 we show how some of these challenges can be addressed using the foreground and
motion information extracted from images.
2.5. Perspective distortion
The data obtained from a single xed camera corresponds to a projection, on the imaging plane, of the
reection from objects in the real world. So, there is inevitably a distortion that results from perspective
100
(e.g. objects nearer to the camera appear to be bigger than those further away). If the distortion is not taken
into account, measurements based on image features (such as crowding level estimation) will be biased especially if what is observed is not distributed homogeneously in the images. There are well established means of
compensating for this distortion by establishing the correspondence between the image plane and the (real)
ground plane. This is typically done through a process of careful camera calibration (see for example Seitz
and Dyer, 1995). When contemplating the deployment of this type of systems over hundreds and perhaps
thousands of cameras (which are inevitably moved from time to time), such manual methods become an obstacle. Methods to automatically estimate the scene structure in similar cases have tended to follow one of the
following approaches:
Scene structure from controlled or uncontrolled camera movement as in Vieville and Faugeras (1995).
Depth from defocus (Nayar et al., 1996).
Stereo and multi-camera imaging (Rander et al., 1996).
Range Sensors (Indyk and Velastin, 1994).
Surface orientation from repetitive texture and pattern analysis (Schaalitzky and Zisserman, 1998).
Here we propose a method based on using motion information and exploiting the observation than, on the
whole, people move within a narrow range of speeds.
3. Base system components
The prototype system (Fig. 1) consisted of a Pentium PC tted with a monochrome (256-level grey scale)
video digitiser and a specic-purpose motion detection board developed by the authors. The digitiser converts
analogue images (direct from cameras or pre-recorded on video tape) at 25 frames per second (fps) and a resolution of 512 512 pixels. The PC feeds these images onto the motion detector (described in the next section)
that then returns motion vectors to carry out further processing, namely noise reduction, foreground extraction, perspective correction and event detection. In this section we concentrate on the preliminary processes
that take place before event detection.
3.1. Block-matching motion detector
A block-matching motion detection approach has been used to preserve motion discontinuities and its ecient implementation through a hardware systolic array. The work reported in Yin (1996) showed that accurate results can be obtained with a block size of 8 8 pixels and a search window of 24 24 pixels. When using
non-overlapping blocks, a 512 512 frame would be processed in about 500 ms (2 fps) on a 2.4 GHz Pentium
processor. Although this seems fast at rst sight, we need to bear in mind that this is only the rst processing
step and further processing is required to reduce noise, estimate background/background, correct perspective
Camera
Alarm to alert
the operators
PX
500
Video
Digitizer
STi
3220
Motion
Detector
Pentium
PC
Image
Processor
101
Fig. 2. Typical raw motion vectors (shown in white superimposed on the input image).
and detect events. We also need to consider that the cost of a general-purpose computer can still be a significant overhead when compared to the cost of a surveillance camera.
In this system, real-time block-matching motion detection is performed by specialised hardware that operates on images of 512 512 pixels. Further details are given in Boghossian and Velastin (1998). Motion vectors
are calculated at full video rate (25 fps) from consecutive pairs of images, using non-overlapping blocks of
8 8 pixels (software-selectable). A typical set of output motion vectors (raw: before any further processing)
is shown in Fig. 2. Including the additional processing of data carried out by the PC leading to event detection,
the system performed at rates between 5.5 and 16 fps (depending on the complexity of feature extraction and
detection), which is well within the requirements of potential end-users (Langlais, 1996).
3.2. Motion noise reduction
The raw outcome of the block-matching motion detection stage is inevitably sensitive to image noise. The
most signicant noise sources experienced are camera and mains frequency interference, digitisation and
recording media noise. Therefore, a pre-processing stage is necessary to eliminate outliers, using a 3 3 mean
lter followed by a 3 3 median lter (throughout the processing chain), a user-dened Region of Interest
(ROI) is used to disregard areas where analysis is not needed (e.g. ceilings and far away regions). A typical
result is shown in Fig. 3.
3.3. Background estimation
Motion features are used to control a statistical estimation of the background image, whereby only stationary blocks are considered to be background updating candidates. In the context here, stationary blocks are
those determined as not being foreground (explained later) and having a lterer block-matching motion vector
of zero and having an interframe dierence of zero (an example of an interframe image is shown in Fig. 4).
Starting from a fast rough estimation of the background (based on instantaneous detection of non-moving
blocks), the process adaptively reduces ambiguities in the motion vector map. Then, stationary scene patterns
are kept in a multi-layer temporal (history) array that is updated continuously to hold the most frequently
repeated patterns at the top layer. Patterns that persist, within a narrow range of intensities in order to account
for slight variations of illumination mainly due to shadows, over a suciently long period of time are then
considered to correspond to the scene background. This period of time is selected on the basis of the expected
reaction time by operators upon the detection of a stationary person or object (typically of the order of 5 min).
Fig. 5 shows a typical image at the start of this process and Fig. 6 shows the corresponding result of background estimation.
102
Fig. 4. Example of an interframe image: (a) image at time t, (b) image at time t + 1 and (c) interframe image (white: no motion).
103
104
Fig. 8 shows the imaging model and describes the perspective distortion problem. It also shows a linear
correction curve as a function of the vertical position in the image. A linear correction curve is used to compensate for the perspective distortion eect because the latter is proportional to the inverse of the object depth.
Considering the bottom image row as the base for the linear correction curve projection, a weight of one is
assigned to it. Based on the geometry of the scene and the camera (Yin et al., 1995), a linear increase in
the weights is introduced along the ground plane rows with a slope proportional to the maximum perspective
distortion expected {R}. Finally, the rows above the projected ground plane are assigned a constant weight
because they lie at the same depth. Two variables are involved in the estimation of the correction curve,
namely: the maximum projection distortion {R} and the extent of the ground plane {H}.
The fact that pedestrians circulate in planes perpendicular to the ground plane and at dierent depths in the
scene, enforces the need to use a single correction factor throughout the body of each pedestrian. This factor
corresponds to the ground plane projection of the body position. This requires complex and time-consuming
segmentation techniques to dene the area occupied by each pedestrian in the scene. Alternatively, it is possible to encode the average pedestrian height variation into the correction curve in a way that allows the integral of the correction curve over the pedestrians body height to be equal to the correction factor at the feet.
Fig. 9 shows a typical updated correction curve for one of the ticket halls in one of the stations used for experimental purposes (shown in Fig. 7).
Scene
Camera
0,0
512
H2
Image plane
Projection
512-H
H1
512
Maximum distortion R=H1/H2
R
01
Linear Distortion
Correction
3
2
1
0
Updated correction
Linear correction
-1
-2
-3
10
20
30
40
50
60
Vertical Position
Fig. 9. Original and updated perspective correction curves for one of the ticket halls.
Position
(x,y,z)
(x,y)
View point
Z
105
(Vx, Vy)
f
Image plane
X
Fig. 10 shows a pinhole camera scene-projection model for a monocular imaging system with a xed camera. Hence, the transformation of world co-ordinates to image plane co-ordinates is given by Eq. (1) and
velocity component projections are given by Eq. (2)
fx
fy
;
f z f z
fvx
lfvy
;
vx0 ; vy 0
f z zf z
x0 ; y 0
1
2
where (x, y, z) and (x 0 , y 0 ) are world and image co-ordinates respectively, (vx, vy, vz) and (vx 0 , vy 0 ) are world and
image velocities respectively, f is the imaging system focal length and l is camera height.
Assuming a constant world velocity and constant imaging parameter (f, l) the horizontal component of the
projected object velocity (vx 0 ) is inversely proportional to object depth (z), whereas the vertical component is
inversely proportional with the square of (z). Consequently, the object depth, and hence scene structure, can
be estimated from these cues.
Fig. 11 shows the scene model used to estimate the two variables involved in the derivation of the distortion
correction curve as in Fig. 9. The velocity at each row in the image is estimated via a temporal averaging lter
and the velocity curve (plotted on the right) is used directly to estimate {R} and {H}. The maximum projection
distortion {R} is dened as the ratio of maximum to minimum experienced velocities, as they correspond to
the velocities of the closest and furthest pedestrians respectively. The average height of the closest pedestrian
{H2} is estimated as the length of the region with maximum velocity. This can be used to estimate the average
height of the furthest pedestrian by dividing over {R}. Then, the ground plane border is derived from {H2}
and the length of the motion free region.
Correction
curve
Average
Velocity
Scene model
H2
H
H1
-3 -2 -1 0
1 2
106
107
Fig. 12. Abnormal direction of motion, shown by white vectors (Paris Metro).
the digital video they could quickly see who left the package and make a judgement on their intentions. Without this conrmatory mechanism, stations are being unnecessarily evacuated and closed (this is a daily occurrence in a network such as Londons) with the subsequent loss in revenue and inconvenience to passengers.
We dene what we call a scene information array that holds the number of samples (images) during which
the corresponding image block (8 8 pixels) has been stationary. An image block is rst set as a candidate for a
stationary area once it satises two conditions: it does not belong to the background and it experiences no
motion. Subsequently, cells in the information array corresponding to candidate blocks are incremented on
each new sample unless they belong to the background and there is no motion. These two sets of conditions
that control the beginning and resetting of information array counters provide immunity against occlusion
including cases of moving people with shades of grey similar to the background. A region-growing algorithm
is used to update the information array cells as candidate individuals change position or pose. Image blocks
removed from the information array due to sudden changes in position are reintroduced to the array at the
new positions by this algorithm, allowing slow or overlapping changes to be recovered within a few seconds
(typically 3.25 s). A nal process clusters neighbouring blocks that have remained stationary for a period longer
than a user-dened value (typically the 2 min. period mentioned earlier). The presence of one or more of such
clusters triggers the detection of this type of abnormal situation. Figs. 13 and 14 show typical examples.
An abnormal stationarity event (of an object or a person or a group of people and so on, as we are not
concerned here with classifying dierent types of objects or identifying people) is then said to have occurred
if there is a region of a (perspective-corrected) size that exceeds a user-dened value by over a period of time
also determined by the user.
Fig. 13. Dealing with occlusion from moving pedestrians (images magnied for clarity): (a) stationary person detected, (b) stationary
person is occluded and (c) stationary person still detected.
108
Fig. 14. Dealing with changes in position and pose (images magnied for clarity): (a) stationary person detected, (b) person moves to the
right and (c) person re-detected after 3 s.
109
Table 1
Performance gures for various background reference-image estimation methods
Method
MAE
40
22
18
20
14
13
30.97
9.22
21.39
6.40
6.34
6.20
background labelling operation used is based on absolute similarity. Table 1 shows the performance measures
for six methods operating on the same video sequence (the last one corresponds to the work presented here).
The Estimation Time is an important factor when continuous adaptation to variations in the scene background is important, e.g. as in Ivanov et al. (1998). Therefore, in such cases a simple statistical model is fast
and good enough. However, the additional use of motion information in the process introduces robustness
against situations where the background is occluded by moving crowds for a long period. Therefore, it
achieves better results in shorter periods.
5.2. Estimation of crowd motion direction
For ground truthing purposes, a pedestrian entering and leaving the cameras eld of view is dened as an
event. Then, the system estimation is compared with the manual observation records to estimate performance
gures. The under-sampling approach to overcome the up-down head and body movements discussed in Section 4.2 proved ineective, whereas, the segmentation approach proved more successful to avoid false alarms.
Table 2 shows the performance assessment gures of the latter method.
5.3. Stationarity
The algorithm presented in Section 4.3 has been evaluated to verify its robustness against:
1. 100% occlusion. Complete occlusion by moving or standing pedestrians.
2. Occlusion with the same colour as the background. Occlusion with moving pedestrians wearing grey shades
similar to the background shades (note that only eight cases were considered due to lack of data).
3. Pose and position variations. Movement of limbs and torso or shift in standing location with at least 1%
overlapping with original position, with an updating period of 3.25 s.
Moreover, the accuracy of the detection delay is assessed for each of the above tests. Consequently the evaluation metrics is dened as stability with occlusion, stability with occlusion with background shades, accuracy
of detection delay and accuracy in updating the stationary area. In this evaluation process, an event is dened
as the case of a pedestrian standing within the area of interest for more than the user-dened period (2 min).
Table 3 shows the performance gures for the tests mentioned above.
5.4. Correction of perspective distortion
The eect of the perspective distortion correction approach has been evaluated for the range of crowding
levels present in our dataset to estimate its performance. Fig. 16 shows the automated and manual estimation
Table 2
Motion direction estimation performance (for 250 events)
Walking direction
Up
Down
68%
32%
True positive
False positive
True negative
99.6%
0.8%
0.4%
110
Table 3
Performance gures for stationary object detection
Test
Detection percentage
Normal occlusion
Occlusion with background colour
Detection delay accuracy 2 min 5 s
Position updating in 3.25 s
True positive
False positive
True negative
97.9
87.5
100
100
0
0
0
4
2.1
12.5
0
0
50
45
40
35
30
25
20
15
Estimated
Manual
10
5
0
0
10
20
30
40
50
gures. It is obvious that the eect of occlusion at high crowding levels is signicant. That eect is embedded
in the adopted approach, where the features used to estimate the number of pedestrians in the scene (nonbackground image blocks) are not immune to occlusion. The non-linearity in the relation between automated
and manual measurements can be ignored if operation is within the linear region. However, setting high
crowding alarm-triggering values causes the system response to become slower and the true negative rate to
increase. On the other hand, the performance of the proposed non-linear correction procedure has been tested
against the conventional (linear) approach, showing that the latter overestimates the crowding levels when
pedestrians are close to the camera whereas the former gives more accurate results.
5.5. Scene structure from motion
The algorithm has been tested on locations with dierent geometry, pedestrian movement paths, obstacle
distributions (queues, columns, etc.) and crowding levels. The estimated scene parameters have been compared
with the manually generated ones to measure performance. By analysing the experimental results in Table 4 it
can be seen in many of the cases studied (especially those involving horizontal motion) the approach gives
results comparable to human performance (itself estimated to be around 3%). However, we can observe the
following:
The structure parameters for scenes with dominant vertical paths are poorly estimated, because the vertical
component of the image object velocity vanishes rapidly with depth causing reduced accuracy in the
estimation.
Queues and obstacles have signicant eect on the world velocities of objects (pedestrians), therefore causing errors in scene parameters.
Very low mounted cameras (less than 2.5 m) increase the eect of occlusion causing uncertainty and thus
errors in estimation.
111
Table 4
Scene structure parameters estimation for eight dierent scenes
Test description
Number
Obstacles
Paths
Horizontal
1
2
3
4
5
6
7
8
X
X
X
X
X
X
X
X
Vertical
X
X
X
X
X
Queues
Low
camera
X
X
X
X
Diagonal
X
X
X
Manual measurement
Automatic estimation
Perspective
distortion
Ground
plane extent
Perspective
distortion
Ground plane
extent
2.20
n/a
n/a
3.90
3.30
2.80
2.28
2.30
40
n/a
n/a
49
43
39
46
45
2.33
n/a
n/a
4.00
3.33
2.60
2.00
2.00
40
n/a
n/a
50
44
38
46
43
Table 5
Performance gures for the overcrowding and congestion estimates
Method
Overcrowding
Congestion
Detection percentage
True positive
False positive
True negative
95.62
98.51
4.00
0.28
0.37
1.21
The period required to estimate scene parameters is completely dependent on the available motion in the
scene.
The algorithm does not converge in cases where large permanent obstacles restrict pedestrian movements.
This is the case in test 3 where the camera gives a side view of underground ticket barriers.
112
feasibility of such systems. Although much progress is being made in visual surveillance, we are still far away
from being able to emulate the human ability to assess, based on experience and a complex set of clues and
contextual information, if a given situation is likely to develop into a problem. A signicant amount of work
still needs to be done to be able, for example, to track people in cluttered conditions and from one place to
another (with spatial and temporal gaps in visibility), interpret the interaction between people and between
people and the environment, make sense of posture and gesture clues and above all to carry out such interpretation in a robust manner (e.g. one that uses dierent sources of data to reinforce situational assessment
and that degrades gracefully (or at least that it can self-assess degradation and ask for assistance) as environmental conditions worsen).
Acknowledgements
The work described in this paper was mainly carried out as part of the EC project TR-1016 CROMATICA. Other partners in this project included UCL (University College London), RATP (Paris public transport operator), LUL (London Underground), Politecnico di Milano, ATM (Milan public transport operator),
Molynx Ltd. (UK), INRETS (French Transport Research Laboratory), USTL (University of Lille) and CEALETI (French Atomic Energy Authority). The authors are particularly grateful to Mr. Gary Trimmer, Group
Station Manager, for his cooperation in allowing access to a major London Underground station and its sta.
B.A. Boghossian is now based at Ipsotek Ltd. (UK).
References
Boghossian, B.A., 2001. Motion-based image processing algorithms applied to crowd monitoring systems. Ph.D. Thesis, Department of
Electronic Engineering, Kings College London.
Boghossian, B.A., Velastin, S.A., 1998. Real-time motion detection of crowds in video signals. In: IEE Colloquium on High Performance
Architecture for real-time image processing, London, UK, February 1998, pp. 12/112/6.
Bouchafa, S., Aubert, D., Bouzar, S., 1997. Crowd motion estimation and motionless detection in subway corridors by image processing.
In: IEEE Conference on Intelligent Transportation Systems (ITSC97), pp. 332337.
Coianiz, T., Boninsegna, M., Caprile, B., 1996. A fuzzy classier for visual crowding estimates. In: IEEE International Conference on
Neural Networks 36 June 1996, vol. 2, pp. 11741178.
Davies, A.C., Yin, J.H., Velastin, S.A., 1995. Crowd Monitoring Using Image Processing. Electronics and Communication Engineering
Journal 7 (1), 3747.
Indyk, D., Velastin, S.A., 1994. Survey of range vision systems. Mechatronics 4 (4), 417449.
Ivanov, Y., Bobick, A., Liu, J., 1998. Fast lighting independent background subtraction. In: IEEE Workshop on Visual Surveillance, pp.
4955.
Langlais, A., 1996. Deliverable D2: User Needs Analysis. CROMATICA TR-1016 (CEC Framework IV Telematics Applications
Programme), November 1996 (available on request from the authors).
Lo, B.P.L., Velastin, S.A., 2001. Automatic congestion detection system for underground platforms. In: International Symposium on
Intelligent Multimedia, Video and Speech Processing, IEEE, Hong-Kong, 24 May 2001, pp. 159161.
Lo, B.P.L., Sun, J., Velastin, S.A., 2003. Fusing visual and audio information in a distributed intelligent surveillance system for public
transport systems. Acta Automatica Sinica 29 (3), 393407.
Marana, N., Velastin, S.A., Costa, L.F., Lotufo, R., 1997. Estimation of crowd density using image processing. In: IEE Colloquium on
Image Processing for security applications, London, UK, 1997, pp. 11/111/8.
Nayar, S.K., Watanabe, M., Noguchi, M., 1996. Real-time focus range sensor. IEEE Transactions on Pattern Analysis and Machine
Intelligence 18 (12), 11861198.
Ottonello, C., Peri, M., Regazzoni, C.S., Tesei, A., 1992. Integration of multisensor data for overcrowding estimation. In: IEEE
International Conference on Systems, Man and Cybernetics, 1992, pp. 791796.
Rander, P.W., Narayanan, P.J., Kanade, T., 1996. Recovery of dynamic scene structure from multiple image sequences. In: IEEE/SICE/
RSJ International Conference on Multisensor Fusion and Integration for Intelligent Systems (MF96), 1996, pp. 305312.
Regazzoni, C.S., Tesei, A., 1994. Density evaluation and tracking of multiple objects from image sequences. In: IEEE International
Conference on Image Processing, 1994 (ICIP-94), pp. 545549.
Regazzoni, C.S., Tesei, A., Murino, V., 1993. A real-time vision system for crowding monitoring. In: International Conference on
Industrial Electronics 1993 (IECON93), pp. 18601964.
Ridder, C., Munkelt, O., Kirchner, H., 1995. Adaptive background estimation and foreground detection using Kalman-ltering. In:
International Conference on Recent Advances in Mechatronics, ICRAM (1995), pp. 193199.
Sanchez-Svensson, M., Heath, C., Hindmarsh, J., Lu, P, Vicencio-Silva, M.A., Allsop, R.E., Tyler, N., 2001. Deliverable D4: Report on
requirements for project tools and processes. Part I: Operational Requirements; Part II: Empirical Studies of the Perception of Key
Stakeholders, PRISMATICA Project (GRD1 200010601), European Commission, Brussels, September 2001.
113
Sanchez-Svensson, M., Heath, C., Lu, P., 2005. Monitoring practice: event detection and system design. In: Velastin, S.A., Remagnino,
P. (Eds.), Intelligent Distributed Surveillance Systems. The Institution of Electrical Engineers (IEE), ISBN 0-86341-504-0, pp. 3154.
Schaalitzky, F., Zisserman, A., 1998. Geometric grouping of repeated elements within images. In: British Machine Vision Conference,
BMVC 1998, pp. 1322.
Schoeld, A.J., Stonham, T.J., Mehta, P.A., 1995. A RAM based neural network approach to people counting. In: Fifth International
Conference on Image Processing and its Applications, 46 July 1995. IEE, pp. 652656, Conference Publication No. 410.
Seitz, S.M., Dyer, C.R., 1995. Complete scene structure from four point correspondences. In: Fifth International Conference on Computer
Vision, 1995, pp. 330337.
Stauer, C., Eric, W., Grimson, L., 2000. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis
and Machine Intelligence 22 (8), 747757.
Takatoo, M., Onuma, C., Kobayashi, Y., 1996. Detection of objects including persons using image processing. In: 13th IEEE
International Conference on Pattern Recognition ICPR96, pp. 466472.
Tsuchikawa, M., Sato, A., Koike, H., Tomono, A., 1995. A moving-object extraction method robust against illumination level changes for
pedestrian counting system. In: Fifth International Symposium on Computer Vision, pp. 563568.
Velastin, S.A., Davies, A.C., Yin, J.H., Vicencio-Silva, M.A., Allsop, R.E., Penn, A., 1993. Analysis of crowd movements and densities in
built-up environments using image processing. In: IEE Colloquium on Image Processing for Transport Applications, London UK,
1993, pp. 8/18/6.
Velastin, S.A., Davies, A.C., Yin, J.H., Vicencio-Silva, M.A., Allsop, R.E., Penn, A., 1994. Automated measurement of crowd density and
motion using image processing, In: Seventh International Conference on Road Trac Monitoring and Control, London, UK, pp. 127
132.
Velastin, S.A., Lo, B.P.L., Sun, J., 2004. A exible communications protocol for a distributed surveillance system. Journal of Network and
Computer Applications 27 (4), 221253.
Vieville, T., Faugeras, O.D., 1995. Motion analysis with a camera with unknown, and possibly varying intrinsic parameters. In: Fifth
International Conference on Computer Vision, pp. 750756.
Wallace, E., Diey, C., 1998. CCTV Control Room Ergonomics. Police Scientic Development Branch (PSDB), UK Home Oce,
Publication No. 14/98.
Wern, C., Azarbayejani, A., Darrell, T., Pentland, A., 1997. Pnder: Real-time tracking of the human body. IEEE Transactions on Pattern
Analysis and Machine Intelligence 19 (7), 780785.
Yin, J.H. 1996. Automation of crowd data-acquisition and monitoring in conned areas using image processing. Ph.D. Thesis, Kings
College London.
Yin, J.H., Velastin, S.A., Davies, A.C., 1995. Image processing techniques for crowd density estimation using a reference image. In:
Second Asian Conference on Computer Vision, Singapore, 58 December 1995, Vol. III, pp. 610.