You are on page 1of 6

AN MPEG-4 FACIAL ANIMATION PARAMETERS GENERATION SYSTEM

Gunnar Hovden and Nam Ling Computer Engineering Department Santa Clara University Santa Clara. USA

ABSTRACT We present a method for generating MPEG-4 FAPs (Facial Animation Parameters) from a video sequence of a talking head. The method includes a render unit' that animates a face based on a set of FAPs. The render unit provides feedback to the FAP generation process as guidance toward an optimal set of FAPs. Our optimization process consists of minimizing a penalty function, which includes a matching function aud a few barrier functions. The matching fuiiction compares how well an animated face matches with the original face. Each barrier function indicates the level of distortion from a normal looking face for a certain part of a face, and advises the optimizer. Unnecessary FAPs are eliminated and the search is partitioned to speed up the optimization process. Three different search techniques, Steepest Decent Method, Linear Search Method. and Cyclic Coordinates Method are applied to derive an optimum.

The success of facial animation relies on being able to reliably and accurately track facial features and generate FAPs. Many methods have been proposed including the use of eigenspaces~ and deformable graphs [2], segmentation and geometrical characteristics [3], color snakes [4], color tracking [SI, as well as edge detection and templates [6]. Known difficulties with the above methods include lack of robustness and accuracy. The uniqueness of our method is the use of feedback from the render unit to ensure good resemblance between the animated face and the original face from the video sequence. This improves the robustness and accuracy over existing methods. 11. PROPOSED FAP GENERATION SYSTEM Figure 1 shows an overview of the proposed FAP generation system. A video sequence contains frames of a talking person's face. The render unit [7] is based on a three-dimensional face model with texture mapping and delivers high quality, lifelike animations of the face based on the FAPs. For the first iteration all FAPs are set to 0, which corresponds to a face with neutral expression and closed mouth looking straight ahead. The animated face is compared to the original face from the video sequence aud the penalty for the animated face is computed. The penalty indicates how good the animated face matches the original face. It reflects how humans perceive colors, patterns and similarities between pictures. Higher value means poorer match. If penalty = 0, then the original face and the animated face are identical. The penalty is fed back to the FAP optimizer, and is used to guide the FAP optimizer to find a set of better FAPs. Determining the values of the best FAPs is an iterative process. The process continues until some stopping criteria are fulfilled, and the FAPs are then finalized.

1. INTRODUCTION

The MPEG-4 FBA (Face and Body Animation) standard [ l ] defines a face model that can be used to animate the human face. A total of 68 FAPs (Facial Animation Parameters) are defined by the standard. Each FAP describes the movement of a certain feature point on the face (for example, the left comer of the inner part of the lip or the right corner of the right eyebrow). A stream of FAPs can animate the movements, moods and expressions of a talking face. Very high compression can be achieved by this method, making it suitable for many consumer electronics products including video phone and video conferencing, cell phones and PDA's with video capabilities, and Internet agents with virtual human face interface. In addition, PC and video games will benefit from the added realism provided by close-ups of animated human faces. and characters in cartoons can be animated with FAPs generated by a human actor for better facial expressions.

' We would like to thank face2face animation inc. [7] for generously letting us use their face model and render unit

0-7803-7795-8/031$17.00

02003 IEEE

171

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on August 2, 2009 at 22:48 from IEEE Xplore. Restrictions apply.

value lower than I if the two colors have different iiitensity levels. The skin on the face is mostly of the same color - different intensity levels are thus important for matching. Different colors that have the same inteusities also result in a value lower than 1, so that the lips are clearly distinguished from the surrounding area even if they happen to have similar intensities. The penalty is based on the match function and is defined as: penalty =xfap3, .._, fap68) mask(x,y)f 1 -mafch(ori,(x,y),ori~(x,y),
Render unit
Frames

+ V Penalty
Fig. I Overview of the FAP generation process

where
mask(x,y)=

1, ifthe pixel at (x,y) is part of the face 0, if the pixel at (x,y) is not part of tlie face

111. PENALTY FUNCTION

The penalty is a function of tlie 66 FAPs (not counting the first two FAPs, which are high-level FAPs) defined in the MPEG-4 standard. We can express tlie penalty as penalty =.xfap3, ,fap4, ..., fap68)
(1)

where.fap3 is the value for FAP #3, etc. The coniplexity of the peiialty function depends on the quality of the face model and the render unit. If the image from the rendcr unit is lifelike then a simple penalty fu~lctioiicall be used with great success. Our render unit has very high quality, giviiig very lifelike results. A pixelby-pixcl peiialty function will suffice. Each pixel in the original image will he compared with the pixel at the corresponding location in the animated image. We are using RGB color space. Colors are represented with a three-tuple of real numbers (r, g. b) ranging from 0 to 1. A match function, mafch(r,, g , . b,, r2, g,, b2), tells how well the colors match. We have derived tlie following inatcli function:
ii1arcNr1,g,, b,,r 2 x g b,) 2,

and ori(x, y ) is the pixel at position (x, y ) in the original frame aiid ani(x, y) is the pixel at position (x, y ) in the animated frame. The mask function in ( 3 ) is used to mask out the part of the image that is not part of the face, so that tlie background does not affect the penalty. The mask is based on the rendered face rather than on the original face, since the area occupied by the rendered face, as opposed to the original face, can he determined with 100% accuracy. Figure 2 shows an example of a face and the corresponding mask.

Fig. 2(a) A rendered face (b) The corresponding mask


= ( I -obs(r, -rJ).(l -abs(g,-g,)).(l -abs(b,-b,))
(2)

where obs( ...) is the absolute value. The above match function returns 1 if two colors and their intensities are identical, othenvise it returns a nonnegative value less thaii I . The match function returns a

Although ( 3 ) defines the penalty in terms of pixels, color matches and masks, it is important to remember that this merely provides a way of evaluating the function.Xfap3, ...,fap68). The penalty is a function of the 66 FAPs.

172

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on August 2, 2009 at 22:48 from IEEE Xplore. Restrictions apply.

Searching i n a 66-dimensional space is hard. Add noise and the problem of local versus global minima, and the task becomes ovenvhelming. Hence, we need search sti-ategies to make the search practical.
IV. ELIMINATING UNNECESSARY FAPS

Left eyeball (2 FAPs) Right eyeball (2 FAPs) 0 Left eyebrow (4 FAPs) 0 Right eyebrow (4 FAPs)
0

VI. ANATOMICAL CONSTRAINTS

Not all of the FAPs defined i n the MPEG-4 FBA standard are necessary to make a lifelike animation that is truthful to the original face. A total of 45 FAPs are used in our experiment, and they are: 0 Jaw (2 FAPs) 0 Outer lip ( I O FAPs) 0 Inner lip (IO FAPs) 0 Cheek (4 FAPs) 0 Eyelids (4 FAPs) 0 Eyeballs (4 FAPs) 0 Eyebrows (8 FAPs) 0 Head pose (3 FAPs) The MPEG-4 FBA standard assumes that the head does not move vertically, horizontally, or back and forth. It is, however, nccFssary to know the location of the head before the other FAPs can be found. Our search space is therefore augmented to 48 dimensions.
V. PARTITION THE SEARCH SPACE

The complexity of the optimization problem can he reduced by partitioning the 48-tuple of FAP values into smaller, independent optimization problems. It is considcrably easier (and faster) to partition a search space into independent spaces with lower dimensions and search i n each smaller space than to search in the original space. Dissiinilaritics between the original and the animated face i n the upper part of the face (eyes and eyebrows) do not affect the dissimilarities in the lower part ofthe face (nose, lips, cheek, and chin), and vice versa. We can therefore search for an optinial solution for FAPs related to the uppcr face independently from the FAPs related to the lower face. Further partitioning is possible; an example is to partition the left from the right part of the face. The partitions used in our experiment are as follows: Location (horizontal position, vertical position, and scale) and pose (roll, pitch, and yaw) of head (6 FAPs) Jaw (2 FAPs) Inner lip (IO FAPs) Outer lip ( I O FAPs) Left cheek (2 FAPs) Kight cheek (2 FAPs) Left eyelid (2 FAPs) Right eyelid (2 FAPs)

The penalty function as given by (3) has no knowledge about the appearance of a normal face. What may seem to be a good fit according the penalty may look exaggerated, twisted and distorted to a human. We need to guide the FAP optimizer unit to avoid exaggerated, twisted and distorted faces. So far we have attempted to generate FAPs by minimizing the penalty function, penalfy =A...).We now add one term, which includes knowledge about how normal and non-distorted faces look. The problem then becomes to minimize penal@ = xfap3, ..., Jap68) + barrieflap3, ..., fap68). The pulpose of the function barrier( ...) is to tell the FAP optimizer unit if a set of FAPs generates a distorted or unnatural face. If barrier( ...) = 0 then the set of FAP generates a normal looking, undistorted face. A set of FAPs resulting in barrier( ...) > 0 indicates that the FAPs represents an unnatural or distorted face. The higher the value of burrier(...) the more distorted the face is. The name of the function is chosen because it serves as a barrier that is difficult for optimization algorithms to cross or overcome. Hence, forcing the optimization to produce FAPs that do not generate distorted or unnatural faces. The barrier function does not prohibit any set of FAP values. It simply advises the optimizer unit on how a natural face should look like. Even unnatural or unusual facial expressions are permitted if the minimal point in xfap3, ...,fap68) is vely dominant. A. Shape of lips A problem encountered while optimizing the lips is that feature point 2.9 is too close to feature point 2.7 and feature point 2.6 is too close to feature point 2.8, as depicted in Figure 3. This will occasionally happen when the mouth is almost, hut not fully, closed. The shape of the upper lip can be approximated with a sine function with the argument ranging from 0 for the right, inner lip corner (feature point 2.5) to x for the left, inner lip corner (feature point 2.4), where the magnitude is given by the mid, inner lip (feature point 2.2). This sine function is drawn with a dashed line in Figure 3. We will use a barrier function to suggest to the optimizer unit that a naturally looking upper lip should resemble a sine function from 0 to x. The upper lip is certainly allowed to deviate from the shape of a sine function to ensure accuracy with

173

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on August 2, 2009 at 22:48 from IEEE Xplore. Restrictions apply.

the original lip. The harrier function is simply a guide to natural looking lips. We will skip the geometry and express the barrier as Durrier-,(fap3..._. .fap68) (4) =c;(distance from feature point 2.7 to sine funcion) + r;(distance from feature point 2.6 to sine function) for suitable constants c , and cz

because we insist that the upper lip should be above the lower lip while we merely make recommendation regarding the shape of the lips.

C. Thickness of lips
Although the thickness of the lips will vary during the video sequence depending on the mouth opening and the shape of the lips, the result looks hest if the variation is limited. We therefore apply a harrier function to limit the variation of the thickness of the lip. A neutral face, according to the MPEG-4 FBA standard, with the mouth closed is obtained by setting all FAPs to zero. We will let this be our preferred lip thickness. The harrier function is simply given by comparing the FAPs affecting the inner lip with the FAPs affecting the outer lip barrier,(fap3, .... fap68) =cs(fap4-fapSl)'+c,(fap5 -fop52)2+c,(fap6-fap53)' i c,(fap7 -fap54)' t cl0(fap8 -fap55)' i c,, Vap9 -fap56)' +c,,(faplO-fap5f12 + c , , ( f a p l l -f a ~ 5 8 ) ~ (6) for suitable constants c6, .__, c,,. The search space containing the lips is the most difficult to optimize. The search spaces for the other facial features have much lower dimension and is therefore easier to optimize. Therefore, we do not need to impose anatomical contraints on the other facial features. The anatomical constraints we impose on the solution result in the following minimization problem: penalq =xfap3, ___, fap68) + barrier,(fap3, ___, fap68) (7) + barrier'(fap3, ___, fap68) + barrier,vap3, __., fap68) The set of FAP values, fap3, ___, fap68, that minimizes (7) is the solution to our FAP generation problem.

Fig. 3 Inner lip feature points

B. Upper lip above lower lip


It is difficult to distinguish between the upper and the lower lips when the mouth is closed. Sometimes the optimizer unit will position the lower part of the upper lip below the upper part of the lower lip. We can avoid this problem by insisting that the lower part of the upper lip should always he above the upper part of the lower lip, i.e., feature point 2.7 should be above feature point 2.9, feature point 2.2 should be above feature point 2.3 and feature point 2.6 should he above feature point 2.8. We adopt the notation that 2.3.y means the y-value for feature point 2.3 in the animated face. The following barrier function provides the optimizer unit with the necessary guidance: barrier,(fap3, .._. .fap68)
=[

0,

i f 2.7.yt2.Y.y

VII. SOLUTIONS TO OUR SEARCH PROBLEM


Three different search techniques are used to solve our FAP generation problem. 0 Steepest Descent Method: The optimal position (vertically and horizontally), scale and rotation (around the x-, y-, and z-axes) of the head are first found by using the Steepest Descent Method. The other features on the face, with the exception of the lips (e.g. eyes, eyebrows and eyelids) are then determined, also by the Steepest Descent Method. 0 Linear Search: The lips move very freely on a person's face. A simple linear search is used to provide the more sophisticated Steepest Descent

c1.(2.9.y.-2.7.y), i f 2.7.y<2.Y.y

.+{

0,

if 2.2.yt2.3.y

(5)

c1.(2.31y-2.2.y),

if 2.2.y<2.3.y
i f 2.6y22.8.y

c;(2.8.yY2.6.y),

i f 2.6.y<2.8.y

The values for the constants c,, cq and cI are much higher than the values restricting the shape of the lips (c, and cz)

174

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on August 2, 2009 at 22:48 from IEEE Xplore. Restrictions apply.

Method with a coarse estimate of tlie lips position, by gradually opening the mouth in 30 steps. The mouths opening and the shape of the lips are then optimized by tlie Steepest Descent Method. Cyclic Coordinates Method: The last step in the FAP generation process is two iterations of the Cyclic Coordinates Method to refine the FAPs before they are finnlized. Each of the partitioned sub problems that comprises of the complete penalty function, penalty = f l a p 3 , ..., fap68) + brirrier,(rap3, .... .fap68) + barrier,@p3, _.., ,fap68) + 6orrier3(/iip3,...,,/ap68), has been shown to have a condition number close to 1. This indicates that even a simple optimization method like Cyclic Coordinate Method will perform well in terms of number of iterations required to reach the minimum. Thc condition number is calculated as the ratio of the largest eigenvalue to the smallest eigenvalue for tlie Hessian iiiatrix for the penalty function.
VIII. RESULTS

Fig. 4 Neutral face generated by the render unit

All tests itre perfomied in full color with each frame having a resolution of 480 by 420 pixels and 24 bits per pixel. The implcmentation of the above described FAP generation system is written in C and runs on a AMD K7 processor running at 1.533GHz with a RADEON 7500 graphics card. On tlie average 2263 iterations are required to compute the FAPs for each frame. The time to process each frame is approximately 46 seconds. Faster hardware could reduce this time significantly. Figure 4 shows the neutral animated face (all FAPs are zero). The left coluinii of Figure 5 shows three frames from the video sequence, and the right column shows the corrcspoiiding animated faces based on the FAPs found through optimization. In particular, the pose, the gaze dircction. tlie eyelids, the eyebrows and the lips are very close to the original face and tlie animated face looks very natural. IX. CONCLUSION AND FUTURE WORK This paper introduces a new method for automatic FAP generation and demonstrates tliat the method works in a real-world system. Unlike previous methods in the literature, tlie introduced method utilizes feedback from the render unit to ensure that tlie generated FAPs produce an aniniation that resembles the face in the original video sequence. The method is very robust and the resulting aiiiniation~ are accurate, lifelike, and truthful to the original face.

Fig. 5 original face (left column) and the corresponding animated face (right column) for three different frames in the video sequence

175

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on August 2, 2009 at 22:48 from IEEE Xplore. Restrictions apply.

The emphasis i n this paper has been on the quality of the results. Possible ways to improve the speed of the FAP generation include: 0 Reducing the number of iterations required to reach the minimum 0 Tweaking the source code to speed up the program 0 Using faster hardware (CPU and graphics card) 0 Removing memory traffic bottlenecks 0 Utilizing MMX and 3D-NOW technology that is already present in the CPU

[3]

N. Sanis, P. Karagiannis, and M. Strintzis. "Automatic Extraction o f Facial Feature Points for MPEG4 Videophone Applications", IEEE Internotional Conference on Consumer Electronics, 2000, pp 130-131.

[4]

K. Seo, W. Kim, C. Oh, and J. Lee, "Face Detection and Facial Feature Extraction Using Color Snake", Proceedings of the 2002 IEEE International Symposium on Industrial Electronics, Volume 2, 2002, pp. 457.462.
S. Ahn and H. Kim, "Automatic FDP (Facial Definition Parameters) Generation for a Video Conferencing System", International W o r k h o p on Synthetic-Natural Hybrid Coding and nree Dimensional Imaging, Sep. 1999, pp. 16-19.
J. Kim, M. Song, I. Kim, Y . Kwon, H. Kim, and S. A h , "Automatic FDP/FAP Generation from an Image Sequence", IEEE International Symposium on Circuits and Systems, May 2000, pp. I 40-42.

X. REFERENCES
[I]
JTCI/SC29IWG1I, "Final Draft of Inteiiiational Standard ISO/IEC 14496-2, Coding of Audio-visual Objects: Visual", Atlantic City, Oct. 1998.

[SI

ISOIIEC

[6]

[2]

J. Ahlberg, "Facial Feature Extraction using Eigenspaces and Deformable Graphs", International Wor.!aliop on Synthetic-Natural Hybrid Coding and Three Diinensional Imaging, Sep. 1999, pp. 8-1 1.

[7]

face2face animation inc. [Online]. http:Nwww.t2fanimation.com

Available:

176

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on August 2, 2009 at 22:48 from IEEE Xplore. Restrictions apply.

You might also like