You are on page 1of 6

A VERY FAST ADAPTIVE FACE DETECTION SYSTEM

Renaud S eguier IETR Group AC, Team ETSN Institut of Electronics and Telecommunications of Rennes Sup elec, Avenue de la Boulaie, 35511 Cesson-S evign e, France email: Renaud.Seguier@supelec.fr ABSTRACT In this article we will make a presentation of a real-time face detector which is robust in difcult conditions (bad camera color calibration and complex background) for embedded systems (mobile phones, PDAs, laptop computers and game boy-like devices). Our system provides a reliable note qualifying the detection of the faces during the sequence. This note makes it possible to focus the attention of the detector on the shape, the color and the movement of the faces in order to be robust in difcult conditions. No a priori is made on the skin color, which is not the case of the usual fast face detectors. Our algorithm is executed in only 5ms on a generic computer, leaving thus 35ms by image to carry out other treatments. Its genericity (C code, no specic hard or soft optimization) is illustrated by an implementation on a PALM (Sony Cli e, Nx70-v) and Pentium 4-2.6Ghz. KEY WORDS Human-computer Interaction, Face Localization. in real time. In the medium-level analysis, characteristics independent of light conditions and faces orientation are sought. In the highest level analysis, face features such as eyes, nose, mouth and face outlines are associated. Deformable models, snakes or Point Distributed Models can then be used. These last models require a good image resolution and are not easily achievable in real time. However, once the face is detected, it is then possible to track it [5]. Our work takes place in the second approach, medium level analysis. The originality of our system lies in its adaptation and its process speed without any particular optimization. When we have condence in the quality of the detection (detected face form and movement must be coherent during several instant ) we focus our attention in adapting the system parameters to the color and the shape of the face in the image. The result is a very fast and robust localization system considering the light conditions, background, orientation and color of the detected face. In section 2 we present the adaptive architecture of the system. In the following section we detail the different modules one by one. In section 4 we illustrate the performances of our system on real sequences. Finally, we conclude with the last section which indicates implementation performances and future work.

Introduction

Face detection in the eld of videophony and video games will have to be implemented on embedded systems (PDA, game console, mobile phone). In this context two major constraints appear: the algorithms must be at the same time robust against face variability, illumination conditions and complex background; and rapid (<10ms) to allow the system to carry out other processes in real time (compression, 3D face modelling, gesture recognition etc.). Although many systems were proposed in the past, the respect of these two constraints is still a challenge. Current face detection systems can be classied according to whether they are based on the whole face or on characteristic features [1] [2]. In the rst approach a representative database is generated from which a classier will learn what is a face (Neural Networks, Support Vector Machine, Principal Component Analysis - Eigenfaces...). These systems are sometimes remarkably robust [3] [4] but too complex to be carried out in real time. In the second approach three levels of analysis can be distinguished. In the lowest level, gray pixel values, movement or color are taken into account to detect blobs which look like a frontal face. These approaches are not robust but can be achieved

Face Detection system

We model a face by an occasionally moving ellipse delimiting an area of a particular color which we learn in the very rst seconds and whose movement in the image is not chaotic. Several teams already implemented an adaptation of the skin color [6]; nevertheless, they postulate that the camera is correctly calibrated and thus use a color skin signature witch is known a priori [3] and progressively rened during the sequence. Our system is dedicated to mobile phone, PDAs, laptop computer and game boy like systems. Facing the difculty of having correctly calibrated cameras (even using automatic white balance) and the capacity of the CCD to vary in temperature in those systems, we do not use a priori information about the skin color. The general idea is to take benet from the redundancy of information during the sequence (it is always the same face which is detected several seconds running) to adapt the system parameters to the face characteristics. The

real time processing (the time quantum being imposed by the acquisition frequency of the camera) increases the robustness of our system for two reasons. The adaptation is all the more effective as we exploit the maximum of information. So we must treat all the acquired images. Our system being closed loop, it must thus be carried out at the video frequency. When a face moves quickly, if the acquisition frequency is lower than 20Hz, the movement of the face appears chaotic. In the contrary case, the movement is coherent and we can, if our process is carried out at camera frequency, estimate the future face position in the image. This estimator allows us to qualify the detection quality at every instant. The better it will be, the more robust the system will be. Our system consists of eight modules organized according to the diagram of gure 1. In the following section, we detail each one of these modules.
Skin learning Color filter Edge extraction Tracking

time we have condence in detection, we perform the following adaptation with Sig {Cb , Cr , Cb , Cr }: Sig [new] = 0.9 Sig [old] + 0.1 Sig [current] (1)

Knowing that Sig [current] is the signature learned at the current moment, Sig [old] is the mean signature dened the last time when the system had condence in detection and Sig [new], the re-actualized mean signature. The gure 2 makes it possible to compare the histograms of the total image (top histograms), those of the face (bottom histograms) and the gaussian evaluated at the end of ten detections.
Cr Global region Cb

Face region
Motion evaluation Detection Ellipse detection

Figure 2. Skin signature estimation.

3.1.2
SUPERVISION

Color Filter module

Figure 1. The adaptive face detection system.

At the beginning of the sequence, until we have a signature of the skin, we use the red component Cr of the image from which we evaluate contours. This component is independent of brightness and makes it possible to prevent parasitic contours on the face due to prole lights. Once the system has learnt the face signature this module lters the image and replace each pixel value by O(Cb , Cr ): (Cr Cr )2 (Cb Cb )2 + )) 2 ( Cr )2 2 ( Cb )2 (2) is dened in the supervision module. It decreases in the process of time and makes it possible to focus the attention of the system on the face color as we have condence in detections. O(Cb , Cr ) = 255 exp(

Face Detection modules

The rst four modules (skin learning, color lter, edge extraction and tracking) directly work on the pixels. The two following modules (motion and ellipse detection) use image contours, the last two high level modules (detection and supervision) exploits the outputs of the preceding modules.

3.1 3.1.1

Pixels levels modules 3.1.3 Skin learning module


We use a Shen-Castan lter [7] for its precision and execution time. Detected contours are marked according to four different orientations (horizontal, vertical, and two diagonals). In order to be able to control the processing time, we adapt the threshold used in the Shen-Castan lter so that the number of contours points is roughly equal to 5% of the total pixel number of the image. Thus even if the image

Edge Extraction module

When the system has condence in current detection, a histogram is carried out on Cb Cr components of the image region detected as a face (gray rectangle in gure 2). We estimate the means {Cb , Cr } and standard deviation {Cb , Cr } of these histograms, values which constitute the color signature of the face that we wish to hang. Each

contrast changes (which often occurs outdoor) generating more contours points, the system is not penalized and is carried out at the same speed.

3.3 3.3.1

High level module Detection module

3.1.4

Tracking module

The face detected at time t 1 is looked for in the image acquired at time t by a simple block matching. The estimated position of the face by this algorithm will be used in the supervision module to qualify the quality of the current detection.

We detect the cell of maximum value in the accumulators generated by the FGHT. The position of this cell in the accumulator indicates the position of the face in the image, the number of the accumulator gives us the size of the localized face. This module can seek this maximum value in a restricted number of accumulators, if the system estimates that the face has a particular size.

3.3.2 3.2 3.2.1 Contours levels modules Motion evaluation module

Supervision module

Traditional motion detector working on the pixels are time consuming, for that reason we detect the movement on the basis of oriented contours. Because they represent a small percentage of the image and are marked according to their orientation (see 3.1.3), it is possible to use them to detect the motion very quickly in a reliable way. For each orientation we compare the lists of the contours points of the images acquired at time t 1 and t. The sum of the different contours points contained in both images gives us an idea of the amount of motion. The comparison of this sum to a threshold enables us to decide if the face moved in the sequence.

This module supervises the execution of the previously described modules. If any movement is detected, it forces on the ellipse detection module to take into account only contours which moved. The interest is that the contours associated with the background are then not considered, making a more robust detection. As one can see it in the gure 3 the signal in the accumulators is much less disturbed when only contours moving are taken into account. Nevertheless when motion is not detected, the ellipse detection module considers all the edges of the image. This is why we have a greater condence in detection when this one was done on moving contours.

3.2.2

Ellipse detection module

Figure 3. Hough space when the face is still or not.

We use a Fuzzy Generalized Hough Transform (FGHT) to locate objects which looks like ellipses. This particular Hough Transform has the advantage of being able to detect objects of a great variability around an average template. The vote in the accumulators is weighted by two gaussian functions. The rst one takes into account the difference between the position of the contours point and that of the contour point corresponding to the searched template [8]. The second one takes into account the difference between the angle detected on the current contour point and that of the contour point corresponding to the searched template [9]. We do not use a Randomized Hough Transform [10] because the selected contours points are increasingly relevant during the sequence (because of adaptive ltering) and are in a very restricted number (the implemented accumulators are dedicated to ellipses of a little more than twenty pixels width, which does not authorize a very great number of points devoted to the face outlines).

We compare the estimated face position delivered by the tracking module to that given by the ellipse detection module. If these two positions are close (estimated centers in a ray of a third of the width detected face), it is natural to have a little more condence in detection that in the contrary case. In addition if the size of the detected face does not vary too much in the course of time, it is also an additional argument to have condence in current detection. Thus the strategy is the following one. If: the tracking and ellipse detection modules give appreciably the same values, movement is detected, detected faces sizes are the same the three times running we estimate we have caught the face. If the face is hung ve times running, then we have condence in the current detection. When we have condence in detection, two actions are performed:

The skin learning module is activated. At the beginning of the sequence, we wait until we have analyzed the face color N time (N = 5 in practice) before giving the skin signature to the lter module. The positive value used by this last module is decreasing in the course of time (see gure 4). If N b is the number of times during which the system had condence in detection since the beginning of the sequence, each time N b is a multiple of N , is modied in the following way: Nb (3) = 2.5 2N

Figure 5. Detection examples.

a
2 1.5 1 0.5 N 2N 3N 4N 5N Nb
Figure 6. Rotated face in a complex background.

Figure 4. Evolution of . conditions, before the system has learned the detected color face, during a rotation. To illustrate the adaptation of our system to different face colors, we have constrained the camera to lm the same scene under different camera calibration by specifying: an automatic white balance (sequence 1), an indoor light different from neon (sequence 2, blue dominant), an outdoor light (sequence three, yellow dominant) whereas the scene is illuminated by neon. As we can note it on gure 7, the histograms (Cr and Cb ) of the face contained in the scene vary in a signicant way from one sequence to another.

It is indicated to the detection module to seek the face only in the accumulator corresponding to the currently detected ellipse size and in the two closest accumulators. We thus focus the attention of the detector on particular face sizes.

Face Detection performances

In order to accelerate the treatments, we carry out each module in a multiresolution mode. In practice the images are acquired with a resolution of 320x240 pixels, in 4/2/0 Y Cb Cr . The components Cb Cr thus have a resolution of 160x120 pixels. We set up three levels of resolution by under-sampling the images of a factor 2, 3 and 4. For each resolution level, we use two accumulators dedicated to ellipses of 20 and 24 pixels width. Being given the factors of under-sampling, we are thus able to detect faces ranging from 40 to 100 pixels wide. As one can see on gure 5 the system is able to locate faces of variable sizes, in spite of their orientation, form and presence of noise in the background (noise in term of contours and color). Its capacity to make the distinction between arms and face makes it interesting compared to real time detection systems which are based on color skin and blob analysis. The system being based nally on contours, it is interesting to test the ability of the FGHT to detect ellipses, in spite of a strongly disturbed background in term of contours. Figure 6 indicates the result of detection under such

t2 t1 t2 t2

t1 t2

t1 t1 t2

Sequence 1: automatic white balance Sequence 2: indoor light Sequence 3: outdoor light

t1

t3

t2 t1

Figure 7. Skin estimation.

As we said in the introduction, our system makes no a priori on the face color. Thereby, as one sees it in on the graphs of gure 7, our system learns more and more nely the signature from the detected face progressively with the sequence. At time t1, the gaussian are rather broad ( = 2), they are rened quickly at time t2 to end up correctly estimating the color of the face at time t3. The impact of the lter on extracted contours is signicant as the gure 8 shows it. Progressively during the sequence we note an attenuation of background which contains panels whose colors are close to that of the face.

Seq. 1 Seq. 2 Seq. 3

image on P3 800Mhz). The most powerful results in term of speed and effectiveness are those of [12] which only consider the oriented contours for frontal face detection. The robustness of their system is comparable to neuronal architectures ones with a very weak execution time (40ms for the detection of a 27x32 face in 320x240 images on Athlon 1Ghz). Carried on our P4, 2.6Ghz, their detector would reach the 15ms. We thus remain three times faster while being able to detect frontal and prole faces and while providing the color signature of the detected faces. We have also implemented this detector on a PALM (Sony Cli e, NX70-V). Unfortunately the integrated video camera drivers not being available, PALM version of our detector is dedicated to only still images, acquired beforehand with the PALM. We still have to integrate this detector in a more complete tool making it possible to make audio-video speech recognition [16] in order to produce a exible, noninvasive and real-time voice recognition system in noisy audio environment.

t1

t2

t3

References
[1] M.H Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images : A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. [2] E. Hjelmas and B. K. Low. Face detection: A survey. Computer Vision and Image Understanding, 2001. [3] Raphael F eraud, Olivier J. Bernier, Jean-Emmanuel Viallet, and Michel Collobert. A fast and accurate face detector based on neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [4] C.Garcia and M.Delakis. A neural architecture for fast and robust face detection. In Proceedings of the IEEE-IAPR International Conference on Pattern Recognition (ICPR02), 2002. [5] K. Toennies, F. Behrens, and M. Aurhammer. Feasibility of hough-transfrm-based iris localisation for real-time-application. In International Conference on Pattern Recognition, 2002. [6] V.Girondel, L.Bonnaud, and A.Caplier. Hands detection and tracking for interactive multimedia applications. In International Conference on Computer Vision and Graphics, 2002. [7] J. Shen and S. Castan. An optial linear operator for step edge detection. Graphical models and image processing CVGIP, 54(2), 1991. [8] S.M.Bhandarkar. A fuzzy probabilistic model for the generalized hough transform. IEEE Trans. on Systems, man, and cybernetics, 1994.

Figure 8. Background attenuation. We tested our system on the ten rst persons of the European Data Base M2VTS (Multi Modal Checking for Teleservices and Security applications [11]) which gives us a database of 6400 still images. The persons pronounces four times (at one week interval) the digits from 0 to 9. We chose this base to compare our system with [12] work which proposed a very fast detection system. They estimate to have correctly located the face if the detected center is lower than 30% of the real face center. They overtake a rate of 96.5% of good detection. We locate the faces correctly in 97.5% on this basis. Our results are thus comparable with the state of the art when we apply our algorithms to still images. Let us note that we implement a localization system (only one face is searched in the image), reason for which we do not present a false detection rate.

Implementation and conclusion

This system was implemented in C code on a Pentium 42.6Ghz without any specic optimization (neither material nor software). A usual webcam is used and makes it possible to acquire the video at 25Hz via USB port. Our detector is carried out in only 5ms leaving thus 35ms by image to process other treatments (compression, face analysis and so one). These performances are to be compared with those of [13] which take 92ms (extrapolated results, in the article: 300ms for an 320x240 image on P3 800Mhz); of [14] which take 66ms to locate a face with a DSP and of [15]which take 33ms to detect a face (results extrapolated from those presented in the article: 110ms for an 320x240

[9] R. S eguier, A. Le Glaunec, and B. Loriferne. Human faces detection and tracking in video sequence. In Proc. 7th Portuguese Conf. on Pattern Recognition, 1995. [10] N.Kiryati, H.Kalviainen, and S.Alaoutinenr. Randomized or probabilistic hough transfrm: unied performance evaluation. Pattern Recognition Letters, 2000. [11] S. Pigeon. M2vts. In www.tele.ucl.ac.be/ PROJECTS/M2VTS/m2fdb.html, 1996. [12] B. Froba and C. Kublbeck. Robust face detection at video frame rate on edge orientation features. In International Conference on Automatic Face and Gesture Recognition, 2002. [13] X. He, Z.M. Liu, and J.L. Zhou. Real-time human face detection in color image. 2003. [14] K. Imagawa and al. Real-time face detection with mpeg4 codec lsi for a mobile multimedia terminal. In International Conference on Consumer Electronics, 2003. [15] C.C. Chiang, W.N. Tai, M.T. Yang, Y.T. Huang, and C.J. Huang. A novel method for detecting lips, eyes and faces in real time. Real-Time Imaging, 9, 2003. [16] R. S eguier and N.Cladel. A multiobjectives genetic snakes application on audio-visual speech recognition. In 4th EURASIP Conference focused on Video, Image Processing and Multimedia Communications, 2003.

You might also like