You are on page 1of 4

MOTION DETECTION WITH AN UNSTABLE CAMERA Pierre-Marc Jodoin Janusz Konrad Venkatesh Saligrama Vincent Veilleux-Gaboury

Universit e de Sherbrooke , D epartement dinformatique 2500 Boul. de lUniversit e Sherbrooke, Qc, Canada, J1K 2R1 [Pierre-Marc.Jodoin,Vincent.Veilleux.Gaboury]@usherbrooke.ca

Boston University Electrical and Computer Engineering Boston MA 02215, USA [jkonrad,srv]@bu.edu.

ABSTRACT Fast and accurate motion detection in the presence of camera jitter is known to be a difcult problem. Existing statistical methods often produce abundant false positives since jitter-induced motion is difcult to differentiate from scene-induced motion. Although frame alignment by means of camera motion compensation can help resolve such ambiguities, the additional steps of motion estimation and compensation increase the complexity of the overall algorithm. In this paper, we address camera jitter by applying background subtraction to scene dynamics instead of scene photometry. In our method, an object is assumed moving if its dynamical behavior is different from the average dynamics observed in a reference sequence. Our method is conceptually simple, fast, requires little memory, and is easy to train, even on videos containing moving objects. It has been tested and performs well on indoor and outdoor sequences with strong camera jitter. Index Terms Motion detection, background subtraction, camera jitter. 1. INTRODUCTION Detecting motion is a key problem in many surveillance applications. Since most applications use a xed camera, resulting in a static background, the most intuitive and fast motion detection methods are those that compare pixel color changes between two frames. An intensity difference is usually computed between two successive frames or between a frame and a reference image containing no moving objects [1, 2]. The computed intensity difference is then thresholded with a predetermined global threshold. Although adaptive thresholds [3] and spatial priors can be used [4], these methods are sensitive to the very phenomena that violate the basic assumptions of motion detection (a xed camera with a static noise-free background). In order to account for noise more accurately, Wren et al. [5] have proposed to model the intensity distribution of every background pixel with a Gaussian. Based on the assumption that noise is uncorrelated in time, parameters of the Gaussian are learned from a sequence of frames with stationary background. Motion detection is then performed by nding, for each video frame, the set of pixels whose probability value is below a predetermined threshold. In many outdoor applications, the background cannot be assumed to be stationary because of natural events such as wind shaking trees, animated water, snow or other natural phenomena. In those applications, intensity distribution at each pixel is often multimodal and can hardly be modeled by a single Gaussian. Thus, other models have been proposed. Among the most common are models using a mixture of Gaussians [6, 7, 8] and those using non-parametric distributions based on the Parzen-window approach [9, 10]. Still other methods use Wiener [11] or Kalman lters [12, 13] to predict intensity values in presence of dynamic backgrounds. Such predictors have the advantage of being able to learn repetitive patterns and

Input frame

Motion label field : Lt

B C

B C
A

Pixel B binary plot : L(xB ,t)


1

0 0 1

100

200

300

360

Pixel C binary plot : L(xC ,t)

100

200

300

360

Fig. 1. Typical motion detection result from unstable camera (top), and motion activity plot for two pixels (bottom). The pixel located near an intensity edge (B) has a lot of false positives compared to the one (C) located in a uniform region. thus detect moving objects even if their intensity distributions are similar to those of the background. While these methods work well for compensating background instability, experiments have shown that many of them fail in the presence of heavy camera jitter. As can be seen in Fig. 1, they often detect false positives near strong intensity edges because extreme intensity shifts occur temporally in those areas. These existing statistical methods have difculty differentiating those shifts from the ones induced by true scene motion. Although frame alignment by means of camera motion compensation can help resolve such ambiguities, the additional steps of motion estimation and compensation increase the complexity of the overall algorithm. Furthermore, stabilization methods based on phase correlation are only effective when the number of moving object is small. Therefore, we propose in this paper a new motion detection method that uses no motion compensation to align frames but still performs well in the presence of camera jitter. As opposed to previous methods, that model temporal intensity/color distributions, our statistical approach models motion label distributions. The method proposed here makes four contributions. First, it is robust to heavy camera jitter. Second, it uses scene dynamics instead of photometry which, to our knowledge, has never been done before. Third, it is fast and requires small amount of memory at runtime and, nally, it is easy to train, even on videos containing moving objects. 2. MOTION DETECTION WITH A STABLE CAMERA Let b = {b(x)|x S } and It = {I (x, t)|x S } denote, respectively, a background image and a frame at time t that contains the

A Image

O Image

Resulting motion label field

Since storing L in memory is prohibitive as M gets large, one can pre-compute and store the average amount of activity in L using a two-dimensional structure. We call this structure the A-image and dene it as follows: A(x) = 1 M
M

> <

L(x, i)
i=1

(1)

Fig. 2. The proposed method detects motion by thresholding the difference between the A and O images (A contains the average background activity and O contains the average amount of activity monitored at time t). Subtracting A from O results in a motion label eld made of true positives only. background b with objects moving in front of it, both dened on an A B orthogonal lattice S . For implementation purposes, b(x) and I (x, t) are assumed to be scalars between 0 and 255 for grayscale sequences and 3-vectors with components between (0, 0, 0) and (255, 255, 255) for color sequences. The goal of motion detection is to estimate a binary label eld L(x, t) that indicates if a pixel x at time t corresponds to a moving point or not. In simple background subtraction, L(x, t) is computed by simply thresholding a frame difference: I (x, t) I (x) , where I is typically the background image b or the previous frame It1 . Note that in applications where no background images absent of moving objects are available, b can be computed with a temporal median lter. In probabilistic motion detection, temporal intensity/color distribution of every background pixel is modeled by a likelihood probability density function (pdf): P (I (x, t)|L(x, t)). Motion is then detected by comparing this pdf to a predened threshold. As already mentioned, this pdf is typically modeled by a single Gaussian, a mixture of Gaussians or a Parzen-window estimate. 3. PROPOSED METHOD As we had mentioned before, camera jitter can be largely removed by camera motion compensation. This however, requires additional, computationally-complex steps of motion estimation and compensation. The goal of our method is to account for camera jitter without motion compensation. Unlike standard motion detection that uses intensity/color (photometry), we propose to detect motion based on dynamics. Although various characteristics of image sequence dynamics could be used, we believe that among the simplest, but most meaningful ones, is the binary result obtained from background subtraction, namely L. As can be seen in Fig. 1, the motion activity plot for a pixel (L(xB , t) and L(xC , t)) strongly depends on where the pixel is located in the image. For instance, since pixel B is located near an edge, the temporal difference I (xB , t) b(xB ) is often large resulting in many false positives. Thus, for all practical purposes, the distribution L(xB , t) has a global average of about 0.5, i.e., motion is detected about half of the time at this pixel. On the other hand, since pixel C is located in a uniform region, in absence of any moving object its temporal difference is small resulting in almost no false positives. Thus, for this motion distribution, the global average motion is almost zero. Interestingly, the true scene motion at pixels B and C can be detected using those two distributions only. Based on the assumption that a moving object produces a high amount of local activity, it can be seen that a pedestrian has walked through pixel B around frame 210 and through pixel C around frames 110 and 300. More formally, true positives at time t can be detected when the average amount of activity monitored during the previous w frames: t 1 i=tw+1 L(x, i) is signicantly larger than the average amount w M 1 of activity monitored during a training sequence: M t=1 L(x, t). Note that L stands for binary sequence obtained during training.

where M is the total number of training frames. As shown in Fig. 2, the A-image contains the average background activity. It thus contains large values near edges (due to camera jitter) and small values in uniform regions. Note that although the training sequence used in Fig. 2 contains moving objects, those objects barely inuence the global average contained in the A-image (for sufciently large M). Similarly, the average amount of activity at time t in the observed sequence can be computed and stored in a 2D image as follows: O(x, t) = 1 w
t

L(x, i).
i=t(w1)

(2)

As shown in Fig. 2, the O-image contains the average amount of activity observed at time t. Similarly to the A-image, the O-image contains large values near edges (caused by camera jitter), but it also contains large values around moving objects (caused by true positives). Thus, once A and O images have been computed, a new motion label eld L (x, t) (largely void of spurious false positives) can be computed by means of simple image subtraction and thresholding: L (x, t) = 1 0 if (O(x, t) A(x)) > 2 otherwise (3)

where is a global threshold. This can be interpreted as subtraction of the average background activity (A) from the average activity observed at time t (O). Note that in order to account for variations in time, the A image can be also updated as follows: A = A + (1 )O (4)

where is an updating factor whose value ranges between 0 and 1. This permits adaptation of the average background activity A to changes occurring in the monitored scene (e.g., weather change introducing wind). Furthermore, the background image b used to compute L can be similarly updated. An example of pseudo-code for our motion detection method is shown on next page. 4. RESULTS We have tested the proposed motion detection method on indoor and outdoor video sequences. For every sequence, we used 1 = 35, 2 = 0.5, M = 300 and w = 24. Also, as shown in the pseudocode, the background image b is computed with a temporal median lter. We compared our approach with a simple background subtraction and with three probabilistic methods: based on Parzen windows, single Gaussian and a Markovian approach (single Gaussian plus a prior label model with Ising potential [14]). For every method implemented, the training was done regardless of the content of the video sequence and no post-processing was applied. The outdoor sequences have been captured by a network of video cameras (Information Systems and Sciences Visual Sensor Network) deployed on Boston Universitys Charles River Campus. The camera jitter is due to buildings ventilation fans close to which the cameras are located. Fig. 3 shows results for simple background subtraction and the proposed algorithm on highway sequence, whereas Fig. 4 shows similar results plus those for the Parzen-window and two Gaussian models on sidewalk and badminton (indoor) sequences. As can be seen, our method signicantly

Input frame

Background subtraction

Our method

Fig. 3. The highway sequence processed with a background subtraction method and our method. Algorithm 1 Our Motion Detection Method Input: I : Input video Output: L : Motion label eld Initialization b Median(I0 , I30 , I60 , ..., IN ) A0 for each time frame t from 1 to M do Lt It b > 1 A A + Lt / M end for Motion detection k0 for each time frame t do L [k ] I t b > 1 w 1 O w i=1 L[i] Lt ( O A ) > 2 A A + (1 )O k modulo(k + 1, w) end for return L

1: 2: 3: 4: 5: 6:

7: 8: 9: 10: 11: 12: 13: 14: 15:

small amount of memory at runtime, and is easy to train, even on sequences containing moving objects. Note that our method relies on two predetermined thresholds (namely 1 and 2 ) which can be a limitation for some applications. As a workaround, one can automatically estimate both thresholds based on characteristics of the histogram of It b and (O A). Among the characteristics one may look at is the entropy, various statistical moments, and maximum peak value to name a few. A good literature review on this topic is proposed by Rosin et al. [2]. In the future, we look forward to adapt Kalman and Wiener ltering to our method (similar to methods presented in [11, 12, 13]). We hope this will allow to better deal with camouage effects and better detect fast moving objects. Also, other more sophisticated test statistics are to be implemented. For instance, one could learn empirical motion density estimates suitably sparsied based on kernel methods for both the training and observed data. These empirical densities could then be compared using different measures such as the Kullback-Leibler distance.
6. REFERENCES [1] D. Zhang and G. Lu, Segmentation of moving objects in image sequence: A review, Circuits, Systems and Signal Process., vol. 20, no. 2, pp. 143183, 2001. [2] P.L. Rosin and E. Ioannidis, Evaluation of global image thresholding for change detection, Pattern Recognition Letters, vol. 24, pp. 23452356, 2003.

eliminates false positives. Note that the sequence badminton, in addition to heavy jitter, contains interlacing artifacts. However, since our method is based on cumulative dynamics, those artifacts are largely suppressed. As mentioned previously, our method uses a small amount of memory at runtime. For each pixel, it requires 1 oating-point number for each O and A, and w/8 bytes for L[.] (line 9 of the pseudocode). This corresponds to a total of 11 bytes per pixel when w = 24. This is signicantly less than the 12 oating-point numbers per pixel needed by the one-Gaussian-per-pixel method for color videos. Our method currently runs in Matlab at 20 fps on color images of size 240 352 using a 2.1 GHz dual-core laptop. Full video sequences of the detection results can be downloaded from www.dmi.usherb.ca/jodoin/projects/jitter/. 5. CONCLUSION AND FUTURE WORK We presented a motion detection method that is based on dynamics instead of intensity/color of a scene, and does not use explicit camera shake compensation. Our method detects motion by accumulating the average background activity in an A-image and the average amount of activity at time t in an O-image. Motion is then detected by simply thresholding the difference between O and A, i.e., by removing from O the average amount of background activity accumulated in A. Our method has the advantage of being robust to severe camera instability, it is conceptually simple, fast, requires

[3] A. Neri, S. Colonnese, G. Russo, and P. Talone, Automatic moving object and background separation, Signal Processing, vol. 66, no. 2, pp. 219232, April 1998. [4] J. Konrad, Motion Detection and Estimation, chapter 3.10, Elsevier Academic Press, 2005. [5] C.R. Wren, A. Azarbayejani, T. Darrell, and A.P.Pentland, Pnder: Real-time tracking of the human body, IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 780785, 1997. [6] N. Friedman and S.J. Russell, Image segmentation in video sequences: A probabilistic approach., in UAI, 1997, pp. 175181. [7] A. Mittal and D. Huttenlocher, Scene modeling for wide area surveillance and image synthesis, in Proc. of CVPR, 2000, pp. 160167. [8] C. Stauffer and E.L. Grimson, Learning patterns of activity using real-time tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 747757, 2000. [9] A. Elgammal, R. Duraiswami, D. Harwood, and L.S. Davis, Background and foreground modeling using nonparametric kernel density for visual surveillance, in Proc of the IEEE, 2002, vol. 90, pp. 11511163. [10] A. Mittal and N. Paragios, Motion-based background subtraction using adaptive kernel density estimation., in Proc. of CVPR, 2004, pp. 302309. [11] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallower: Principles and practice of background maintenance., in proc. of ICCV, 1999, pp. 255261. [12] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russell, Towards robust automatic trafc scene analysis in real-time, in Proc of ICPR, 1994, pp. 126131. [13] J. Zhong and S. Sclaroff, Segmenting foreground objects from a dynamic textured background via a robust Kalman lter, in Proc of ICCV, 2003, pp. 4450. [14] J.M. McHugh, Probabilistic methods for adaptive background subtraction, M.S. thesis, Boston University, Dept. of Electr. and Comp. Eng., Jan. 2008. [15] E. L. Lehmann, Testing Statistical Hypothesis, Springer, 2nd edition, 1997. Springer Texts in Statistics.

Input frame

Background subtraction

Parzen windows

Gaussian
Input frame

Gaussian + prior
Background subtraction

Our method
Parzen windows

Gaussian

Gaussian + prior

Our method

Fig. 4. The badminton and the sidewalk sequences. From left to right and top to bottom : an input frame and motion label elds obtained with background subtraction, a Parzen windows approach, a one-Gaussian approach, a one-Gaussian-with-Ising-prior approach, and our method.

You might also like