You are on page 1of 11

A Baseline Algorithm for

Face Detection and Tracking in Video


Vasant Manohar† , Padmanabhan Soundararajan† , Valentina Korzhova† , Matthew Boonstra† ,
Dmitry Goldgof† , Rangachar Kasturi† , Rachel Bowers‡ and John Garofolo‡
† Computer Science & Engineering, University of South Florida, Tampa, FL, U.S.A
{vmanohar, psoundar, korzhova, boonstra, goldgof, r1k}@cse.usf.edu
‡ National Institute of Standards and Technology, Gaithersburg, MD, U.S.A
{rachel.bowers, john.garofolo}@nist.gov

ABSTRACT
Establishing benchmark datasets, performance metrics and baseline algorithms have considerable research sig-
nificance in gauging the progress in any application domain. These primarily allow both users and developers
to compare the performance of various algorithms on a common platform. In our earlier works, we focused on
developing performance metrics and establishing a substantial dataset with ground truth for object detection and
tracking tasks (text and face) in two video domains – broadcast news and meetings. In this paper, we present
the results of a face detection and tracking algorithm on broadcast news videos with the objective of establishing
a baseline performance for this task-domain pair.
The detection algorithm uses a statistical approach that was originally developed by Viola and Jones and later
extended by Lienhart. The algorithm uses a feature set that is Haar-like and a cascade of boosted decision tree
classifiers as a statistical model. In this work, we used the Intel Open Source Computer Vision Library (OpenCV)
implementation of the Haar face detection algorithm. The optimal values for the tunable parameters of this
implementation were found through an experimental design strategy commonly used in statistical analyses of
industrial processes. Tracking was accomplished as continuous detection with the detected objects in two frames
mapped using a greedy algorithm based on the distances between the centroids of bounding boxes. Results on
the evaluation set containing 50 sequences (≈ 2.5 mins.) using the developed performance metrics show good
performance of the algorithm reflecting the state-of-the-art which makes it an appropriate choice as the baseline
algorithm for the problem.
Keywords: Baseline algorithm, Face detection & tracking, Broadcast news, Performance evaluation

1. INTRODUCTION
Since the very beginning of image processing, as in the case of software engineering, there was a portion of the
development process committed to algorithm testing. It dealt with the objective of determining whether or not
a particular algorithm satisfies its specifications relating to established criteria such as accuracy and robustness.
There are two important purposes of testing,1 namely,

1. Provide a quantitative method of evaluating an algorithm


2. Provide a comparative measure of the algorithm against similar algorithms using similar criteria

One of the major challenges in designing image processing algorithms is in perceiving the criteria used to
gauge the results. This involves a trade-off among measuring sensitivity of parameters, robustness of algorithm,
and accuracy of results. Performance evaluation in a general sense involves the measurement of some essential
behavior of an algorithm which, could be realizable accuracy, robustness, or extensibility. It allows for both
emphasis of intrinsic characteristics of an algorithm as well as the assessment of its benefits and limitations.
Several such evaluation frameworks for different image processing and computer vision tasks are available.
Image thresholding methods were investigated by Sezgin and Sankur. 2 Heath et al3 presented their work on eval-
uating edge detection algorithms. Range segmentation algorithms were compared by Hoover et al. 4 Mikolajczyk

1
and Schmid5 analyzed the performance of descriptors computed for local interest regions. The gait identification
problem was formally addressed by Sarkar et al.6 Cappelli et al7 evaluated fingerprint verification systems with
a goal of establishing a new common benchmark for an unambiguous comparison of fingerprint-based biometric
systems. To provide a means for measuring progress and characterizing the properties of face recognition, the
Face Recognition Grand Challenge (FRGC) was introduced by Phillips et al. 8
Evaluating the performance of any image processing algorithm is complicated because of the fact that several
factors influence the performance. As noted by Heath et al.,3 there are many determinants to the performance
of an algorithm, namely,

• The algorithm itself


• The nature of dataset used to test the algorithm
• The parameter settings of the algorithm used in the evaluation
• The testing methodology used for evaluating the algorithm

One can observe that, besides the algorithm, there are several other factors that define an algorithm’s per-
formance. The nature of the testing set used is an important element. Using an easy set of images overestimates
the capability of the algorithm leading to unsatisfactory results when fielded in real-world applications. On
the other hand, using a difficult set of testing images underestimates the efficacy of the algorithm leading to a
mis-representation of the state-of-the-art for the problem.
The number of parameters an algorithm requires to be specified defines the hardship in evaluating its perfor-
mance. For instance, an algorithm that requires to set 1 parameter is much easier to evaluate than an algorithm
that has 5 parameters whose values have to be trained for an optimal performance.
Finally, the testing strategy used to evaluate an algorithm also dictates its performance. There are no
strict guidelines as to how the performance evaluation process should be characterized. However, there are
certain aspects to be considered, such as the testing protocol, performance metrics, databases, and benchmark
algorithms.
While the testing protocol, performance metrics, and testing databases deal with objective of providing a
quantitative method of evaluating an algorithm, establishing benchmark algorithms aim at providing a com-
parative measure of an algorithm’s performance. We focussed on the first 3 facets of evaluations in our earlier
works.9, 10 In this paper, we present a baseline algorithm for face detection and tracking with the goal of providing
a reference for future performance comparison for this task.
This paper is organized in the following fashion. Section 2 provides a formal description of the face detection
and tracking task accompanied by a set of guidelines for consistent and reliable annotation of the groundtruth.
Section 3 describes the baseline face detection and tracking algorithm implemented in this evaluation. Section 4
details the experimental design strategy used to compute the optimal values of the user-selectable parameters
of the OpenCV implementation of the Haar face detection algorithm. Section 5 presents the results of the
baseline algorithm on the broadcast news corpus along with a brief description of the metrics used to compute
the performance scores. We conclude the findings of this work in Section 6.

2. TASK DEFINITION AND REFERENCE ANNOTATIONS


A typical object detection and tracking task can be defined as detecting the particular object(s) of interest in
each individual frame and tracking them through all of the frames in a given sequence. During the process of
generating the reference annotation, a clear and exhaustive set of guidelines were established and followed in
order to reduce the intra-annotator variability (the same annotator marking the boxes inconsistently at different
times) and inter-annotator variability (mismatch between different annotators). Further effort was directed in
developing a ground truth that is rich with details. Hence, each object block is associated with a set of attributes
which characterizes the region both from an evaluation and informational point of view. This section explains
the set of guidelines and additional flags used in generating the groundtruth annotations for the face detection
and tracking task.9 We also present the specific evaluation settings used for assessing the performance of the
baseline algorithm.

2
2.1. Face Ground Truth
A Face is enclosed by an oriented bounding box. The face features are used as guides to mark the bounds of
this box. If for any reason, the features are obstructed, then the box is approximated. A face is defined to be
VISIBLE in the scene as long as one eye, part of the nose, and part of the mouth are seen. Each face bounding
box has the added properties of how clear the face is (AMBIGUITY), whether it is a real face or SYNTHETIC
(cartoon, for example), whether the face is OCCLUDED with another object in the scene, and whether a person
is wearing hat/sunglasses (HEADGEAR). The clarity of the face is denoted by 3 levels of AMBIGUITY (0 =
Very clear when the face can be clearly seen, 1 = Partially clear when two of three features are visible, and 2
= Very confusing when none of three features are visible). This set of attributes makes the annotations rich
and can be used for sub-scoring in the evaluations. A sample annotation for face depicting different levels of
AMBIGUITY is shown in Figure 1.

(a) AMBIGUITY = 0 (b) AMBIGUITY = 1 (c) AMBIGUITY = 2


All facial features are visible Motion blur distorts face Facial features barely discernable

Figure 1. Sample Annotation for Face Depicting Different Levels of AMBIGUITY.

Each face bounding box has a unique ID and is maintained in the entire clip as long as the face exists without
any temporal breaks. If a face object exits the clip at frame n and comes back in frame m, then a new ID is
given to this face. The annotator is merely performing a detection/tracking task and not a recognition task.

2.2. Evaluation Settings


One can observe that the reference is richly annotated with a variety of information. The additional set of
attributes is used in deciding whether a particular object should be evaluated or not. The specific setting used
for the face task is shown in Table 1.

TASK EVALUATION SETTINGS


VISIBLE = TRUE
AMBIGUITY = 0
Face Detect & Track SYNTHETIC = FALSE
OCCLUDED = FALSE
HEADGEAR = FALSE

Table 1. Evaluation Setting for Face Task.

All other annotated regions are treated as “Don’t Care” where the system output is neither penalized for
missing nor given credit for detecting the unscored region. It has to be noted that each of these attributes can
be selectively specified to be included in evaluation through the scoring tool that we have developed.

3
3. FACE DETECTION AND TRACKING
Detecting faces in images is an essential step to intelligent vision-based human-computer interaction. It is the
first step in many research efforts in face processing including face recognition, pose estimation, and expression
recognition. Numerous techniques have been proposed to detect faces in images and video. 11, 12 This section
describes the baseline face detection and tracking algorithm used in this evaluation. The main purpose of
presenting a baseline algorithm is to provide a reference for future performance comparison and to show how a
real-life system can be subjected to evaluation.

3.1. Face Detection


The baseline face detection algorithm uses a statistical approach for detection, an approach originally developed
by Viola and Jones13 and later extended by Lienhart.14 The algorithm uses a feature set that is Haar-like and
a cascade of boosted decision tree classifiers as a statistical model. Both of these are the core of the whole
approach.
Each Haar feature, named so because of their similarity to the coefficients of Haar wavelet transforms, is
essentially a scalable template that can be applied to the search window on the image. Lienhart 14 extended the
original eight templates to fourteen.
Detection is accomplished by sliding a search window through the image and checking the reponse of the
classifier to decide whether a certain location looks like a face or not. In order to detect faces of different sizes
the image has to be scaled to different sizes. However, Viola developed an elegant way in which the classifier is
scaled instead of the image. This is efficiently implemented by use of the Summed Area Table described in. 13
Following this, Viola13 proposes the building of a cascade of boosted weak classifiers with increasing com-
plexity. During detection, these classifiers are applied to the search window one at a time in the order of priority
that is learned during training. Each of these classifiers either reject the image region in the search window
or pass it on to the next classifier in the chain which, further analyzes the search window in a similar fashion.
This technique greatly increases the detection speed because maximum time is spent on detecting faces while,
majority of the non-face regions are rejected in the first few levels of the boosted tree.
3.1.1. Implementation
The Intel Open Source Computer Vision Library (OpenCV) implementation (Version 1.0rc1) of the Haar face
detection algorithm was used as the baseline for this task. OpenCV provides both low-level and high-level APIs
for face detection. Using the low-level API, users can analyze an individual location of the image using the
classifier to determine if it is a face candidate or not. Auxilliary functions can be used to calculate the integral
images (Summed Area Table) and to scale the classifier to detect faces of different sizes. Alternatively, the
high-level API can be used that does all of the above steps automatically. The low-level API was used in this
paper.
The face detector was not trained on the training set of the Broadcast News corpus. Instead, the trained
classifier cascade available with the OpenCV distribution was used.
There are three user selectable parameters for the OpenCV implementation of Haar face detection, namely:

• The Scaling factor – Specifies the factor by which the search window is scaled in a subsequent iteration. If
the user specifies a value of 1.25, then in a subsequent iteration, the search window is made 25% larger.
• The Grouping factor – Signifies the minimum number of neighboring face rectangles that should be joined
into a single “face”. For instance, a group factor of 2 means groups of 3 or more neighboring rectangles
are joined into a single face. Smaller groups are rejected for a lack of sufficient evidence.
• The Pre-processing flag – When set, the algorithm works with edge data (Canny edge detector 15 ) instead
of the raw data. This makes the algorithm run faster due to the reduction in information.

The optimal values for each of these parameters were found using an experimental design strategy 16 described
in Section 4. The final set of values was: scaling factor = 1.2, grouping factor = 3 and pre-processing flag =
TRUE.

4
3.2. Face Tracking
The tracking algorithm used in our face baseline uses the detected objects in each frame to track them throughout
the given sequence. The ID assigned by the detection algorithm to each object must be used to uniquely identify
the object across all frames in the sequence. The objects are identified by bounding rectangles which are defined
with the right upper corner (x, y) and values of height (h) and width (w). However, once a detection algorithm
identifies an object, it must be identified with the same object ID across frames to be scored as correct. A
distance-based algorithm is used to track the object throughout the given sequence. A short description of the
algorithm is given below:

1. Three consecutive frames (current, previous and 2nd previous) with the detected objects are specified.
The number of detected objects in current frame is n and numbers of detected objects in the previous
and 2nd previous frames are n1 and n2 consecutively; also, two thresholds are predefined (threshold1 and
threshold2).

2. Centroids, (cxk , cyk ), k = 1, ..., M , M = n + n1 + n2 , of the bounding rectangles of the detected objects are
calculated using the formulae cx = x + w2 and cy = y + h2 .

3. The minimum Euclidean distance (dmini ) for i−detected object of the previous
q frame to the detected
objects in the current frame is found, using a greedy approach. dmini = minj (cpxi − ccxj )2 + (cpyi − ccyj )2 ,
j = 1, ..., n, where (cpxi , cpyi ) is centroid of the i-detected object of the previous frame, (ccxj , ccyj ) is centroid
of the j-detected object of the current frame.

4. If dmini ≤ threshold1, assign the ID of the previous object to the corresponding object of the current
frame with the minimum distance. In cases, when the minimum distance achieved for two or more different
objects in current frame, the object with the closest area to the previous object is chosen.

5. Repeat steps 2 − 4 n1 times.

6. Repeat the above steps n2 times for the detected objects of the 2nd previous frame and the detected objects
of the current frame, with additional conditions (different distance threshold (threshold2) and constraint
that two or more objects cannot have the same ID.
7. New IDs are assigned for any of the unassigned objects in the current frame.

4. PARAMETER TRAINING FOR THE FACE DETECTOR


This section describes an experimental design technique16 used to identify the optimal values for the three user-
selectable parameters of the OpenCV implementation of the Haar face detection algorithm. It has to be noted
that the purpose of this training was not to learn the parameters of the Haar wavelets themselves. Instead,
the algorithm was treated as a blackbox and the values of the tunable parameters of the implementation were
computed.
Our task in this process was to analyze the following aspects of the algorithm: (1) the optimal parameter
values that would maximize the performance; (2) the significance of each parameter; (3) the correlation between
parameters (called as factor interaction in statistical learning).
After careful deliberation, an experiment was designed with the following combinations –

• Scale factor (Factor A) at three levels – 1.1, 1.2 and 1.4.

• Grouping factor (Factor B) at two levels – 2 and 3.

• Pre-processing flag (Factor C) at two levels – On and Off.

5
Thus, it is a three-factor mixed-level experiment. For a given parameter setting, the algorithm, being deter-
ministic, will produce the same output when run multiple times. Thus, there is no random error involved in the
process. This results in a one-replicate design experiment.
To accomplish the tasks described earlier, the following strategy was adopted.

1. Perform an ANalysis Of VAriance (ANOVA) on the performance values for various settings to identify the
significance of individual parameters and their interactions.
2. Based on the results of the ANOVA, build a regression model with the significant factors.

3. Follow it up with an optimization step to compute the optimal parameter settings.

In order to test the generality of the solution, the above process was repeated on six training videos. For each
parameter setting, the algorithm output was obtained and the score was computed by using the SFDA metric.
Table 2 shows sample data for one of the videos.

Preprocessing flag at zero Preprocessing flag at one


Grouping Grouping
2 3 2 3
1.1 0.669158 0.785116 1.1 0.669633 0.785496
Scale 1.2 0.766816 0.842430 Scale 1.2 0.767765 0.842430
1.4 0.807907 0.796853 1.4 0.807907 0.795713

Table 2. Sample Data Showing Performance Values for Different Parameter Settings.

Being a one-replicate experiment, there was a need to build a reduced model to estimate the error. By
observing the experimental data, we inferred that the level of factor C (pre-processing flag) was not significant
to the performance of the algorithm. From Table 2, one can observe that the level of factor C causes a difference
in the score which is significant only in the fourth decimal. Hence, we used the sum of squares of factor C as our
initial estimate for the error.
Through the ANOVA, it was found that the factors A & B (scale and grouping factors) and the interaction
between them were the only significant sources of variation. This observation was consistent across all six training
videos. Based on these results, the final model was built by combining sum of squares of C, AC, BC and ABC
to get a better estimate for the sum of squares of the error.
A regression model was then built with the significant factors as the model parameters. Since a linear
model was not able to accurately capture the variation in the underlying distribution space, a quadratic model
2
(z = β0 + β1 XA + β2 XB + β3 XA XB + β4 XA ) was built by introducing a quadratic term for the scale factor. The
2
average R value for this model was 0.88 which justified that this model was adequate. The optimal values for
each of the six videos was obtained using this model.
Table 3 presents the optimal values for scale and grouping factors for each of the six videos.

Factor A Factor B
Input Data
Scale Factor Grouping Factor
Video-1 1.25684 3
Video-2 1.28917 3
Video-3 1.24587 3
Video-4 1.25851 3
Video-5 1.26124 3
Video-6 1.25851 3

Table 3. Optimal Values for Scale and Grouping Factors for the Training Videos.

6
A final regression model was built based on these optimal values. The motivation behind this step is to
find the values that can be generalized across more videos. The values for scale and grouping factors after this
optimization step were 1.20 and 3, respectively. With the pre-processing flag set to be TRUE for faster execution,
the above values were used in the test set.

5. BASELINE ALGORITHM EVALUATION


This section presents the results of the baseline face detection and tracking algorithm on a dataset containing
50 video sequences (≈ 2.5 mins long) from the Broadcast News domain comprising of feeds from ABC and
CNN. Before reporting the results, we briefly describe the metrics used to compute the performance scores. The
complete details of these metrics can be found in.9, 17

5.1. Performance Measures


All of the performance measures are primarily area-based and depend on the spatial overlap between the ground
truth and the system output objects to generate the score. To fairly score an algorithm’s performance, we
perform a one-to-one mapping between the ground truth and the system output objects such that the metric
scores are maximized. All the measure scores are computed such that better performance get a numerically
higher score.
Sequence Frame Detection Accuracy: The Sequence Frame Detection Accuracy (SFDA) is a frame-level
measure that accounts for number of objects detected, missed detects, false positives, and spatial alignment of
system output and ground truth objects.
For a given frame t, the Frame Detection Accuracy (F DA(t)) is calculated as:
(t)
Overlap Ratio (t)
T (t)
PNmapped |Gi D |
F DA(t) =  (t) (t)  where, Overlap Ratio = i=1 (t)
S i(t) (1)
NG +ND |Gi Di |
2

(t) (t) (t)


Here, NG is the number of ground truth objects, ND is the number of detected objects, and Nmapped is the
number of mapped object pairs, where the mapping is done between objects which have the best spatial overlap
in the given frame t, using the Hungarian matching strategy.18, 19
The Sequence Frame Detection Accuracy (SFDA) is calculated as the average of the FDA measure over all
the relevant frames in the sequence.
Multiple Object Detection Accuracy: To assess the accuracy aspect of system performance, we utilize the
missed detection and false positive counts. Assuming that the number of misses is indicated by m t and the
number of false positives is indicated by f pt for each frame t, we can compute the Normalized Multiple Object
Detection Accuracy (N-MODA) for the sequence as:
PNf rames
t=1 (cm (mt ) + cf (f pt ))
N -M ODA = 1 − PNf rames t (2)
t=1 NG
t
where, cm and cf are the cost functions for the missed detects and false positives and N G is the number of
th
ground truth objects in the t frame.
Multiple Object Detection Precision: We use the spatial overlap information between the ground truth and the
system output (similar usage as in Equation 1) to compute the Mapped Overlap Ratio as defined in Equation 3.
t
Nmapped (t) T (t)
X |Gi Di |
M apped Overlap Ratio = (t) S (t)
(3)
i=1 |Gi Di |
(t)
where, Gi denotes the ith ground truth object in tth frame, Dit denotes the detected object for Gti , and Nmapped
t

is the number of mapped object pairs in frame t.

7
Using the assignment sets, the Multiple Object Detection Precision (MODP) for each frame t can be computed
as:
(M apped Overlap Ratio)
M ODP (t) = t (4)
Nmapped

The Normalized Multiple Object Detection Precision (N-MODP) is computed as the average of the MODP
measure over all the relevant frames in the sequence.
Average Tracking Accuracy: The Average Tracking Accuracy (ATA) is a spatio-temporal measure which
penalizes fragmentations in both the temporal and spatial dimensions while accounting for the number of objects
detected and tracked, missed objects, and false positives.
The Sequence Track Detection Accuracy (STDA) is defined as:
 
PNf rames |G(t)
i
(t)
∩Di |
Nmapped t=1 (t) (t)
X |Gi ∪Di |
ST DA = (5)
i=1
N(Gi ∪Di 6=∅)

Analyzing the numerator of Equation 5, we observe that it is merely the overlap of the detected object over
the ground truth, which is very similar to Equation 1. The only difference is that, in tracking we measure the
overlap in the spatio-temporal dimension while in detection the overlap is in the spatial dimension alone.
The Average Tracking Accuracy (ATA) is calculated as the average of STDA over all the unique objects in
the sequence.
Multiple Object Tracking Accuracy: To extract the accuracy aspect of the system output track, we compute
the number of missed detects, false positives, and switches in the system output track for a given reference
ground truth track.
PNf rames
t=1 (cm (mt ) + cf (f pt ) + log10 (ID-SW IT CHESt ))
M OT A = 1 − PNf rames t (6)
t=1 NG

where, after computing the mapping for frame t, mt is the number of misses, f pt is the number of false positives,
and ID-SW IT CHESt is the number of ID mismatches in frame t considering the mapping in frame (t − 1). It
should be noted that because of the log function, we start the ID-SW IT CH count at 1.
Multiple Object Tracking Precision: To obtain the precision score, we calculate the spatio-temporal overlap
between the reference tracks and the system output tracks. The Multiple Object Tracking Precision (MOTP) is
defined as:
 
PNmapped PNft rames |G(t)
i
(t)
∩Di |
i=1 t=1 (t) (t)
|Gi ∪Di |
M OT P = PNf rames (7)
t
Nmapped
t=1

where, Nmapped refers to the mapped system output objects over an entire reference track taking into account
t
splits and merges, and Nmapped refers to the number of mapped objects in the tth frame.

5.2. Results
Figure 2 shows some sample images depicting the detection accuracy of the baseline Haar face detection algorithm.
From a visual inspection, one can observe that the algorithm suffers from both miss detection and false alarms.
Misses primarily arise from the fact that the Viola-Jones face detector mostly detects close-to-frontal horizontally
aligned faces alone. False alarms are because of the basic design principle that the classifiers should trigger a
response to any face-like image region. Despite these shortcomings, the Haar face detection algorithm achieves
respectable detection rate.

8
Figure 2. Images Showing the Detection Accuracy of the Haar Face Detection Algorithm.

Figure 3 shows the boxplots of the detection and tracking scores for the baseline algorithm. For detection, the
specific measures that we observe are the MODP, MODA, and SFDA whose mean values are 0.339, 0.423, and
0.797. While the SFDA measures the detection performance comprehensively and has a high score, the MODP
and MODA are lower. Specifically, the MODP shows that the overall spatial overlap (on the mapped objects)
is about 0.34 and on some clips as high as nearly 0.40. The MODA, which shows how accurate the counts of
the objects are, measures at around 0.42 overall and on some clips as high as 0.80. It has to be noted that the
SFDA score is thresholded at 10% spatial overlap meaning, if the groundtruth box and the system output box
overlap at least by 10% in Equation 1 it is considered to be as 100% spatial overlap.
For the tracking performance, the specific measures we use are MOTP, MOTA, and ATA. The MOTP and
MOTA are similar to their detection counterparts but are slightly lower. The ATA score averages around 0.23
and on some clips as high as nearly 0.40.

6. CONCLUSIONS
Though there have been several frameworks for the performance evaluation of various image processing and
computer vision applications, there has not been a significant effort to evaluate the key computer vision topic
of object detection and tracking in video. The objective of our work is to systematically address this problem
and to provide useful resources such as data, metrics, and tools for the evaluation of such detection and tracking
algorithms. To that extent, in our past publications we had developed performance metrics, testing databases,
and an evaluation strategy to provide a quantitative methodology for performance assessment. In this paper,
we presented a baseline algorithm for face detection and tracking to supply a comparative measure for such
algorithms. From the results, it was observed that the performance of the baseline algorithm is comparable to
the state-of-the-art algorithm for face detection and tracking.
This work, together with our earlier papers describing the evaluation protocol, gives a thorough treatment
to every aspect of formal evaluation and provides researchers an invaluable resource to advance research on the
topic of object detection and tracking.

9
1

0.8

Performance Score 0.6

0.4

0.2

−0.2

−0.4

−0.6
N−MODP N−MODA MOTP MOTA SFDA ATA
Metrics

Figure 3. Boxplot of the Performance Scores of the Baseline Face Detection and Tracking Algorithm on the Broadcast
News Corpus (+ indicates mean value).

REFERENCES
1. M. Wirth, M. Fraschini, M. Masek, and M. Bruynooghe, “Performance Evaluation in Image Processing,”
EURASIP Journal on Applied Signal Processing 2006, pp. 1–3, Article ID 45742, 2006.
2. M. Sezgin and B. Sankur, “Survey Over Image Thresholding Techniques and Quantitative Performance
Evaluation,” Journal of Electronic Imaging 13(1), pp. 146–168, 2004.
3. M. D. Heath, S. Sarkar, T. Sanocki, and K. W. Bowyer, “A Robust Visual Method for Assessing the
Relative Performance of Edge–Detection Algorithms,” IEEE Transactions on Pattern Analysis and Machine
Intelligence 19, pp. 1338–1359, Dec 1997.
4. A. Hoover, G. Jean-Baptiste, X. Jiang, P. J. Flynn, H. Bunke, D. Goldgof, K. Bowyer, D. W. Eggert,
A. Fitzgibbon, and R. B. Fisher, “An Experimental Comparison of Range Image Segmentation Algorithms,”
IEEE Transactions on Pattern Analysis and Machine Intelligence 18(7), pp. 673–689, 1996.
5. K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” IEEE Transactions on
Pattern Analysis and Machine Intelligence 27(10), pp. 1615–1630, 2005.
6. S. Sarkar, P. Phillips, Z. Liu, I. Vega, P. Grother, and K. Bowyer, “The HumanID Gait Challenge Prob-
lem: Data Sets, Performance, and Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 27, pp. 162–177, February 2005.
7. R. Cappelli, D. Maio, D. Maltoni, J. L. Wayman, and A. K. Jain, “Performance Evaluation of Fingerprint
Verification Systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence 28, pp. 3–18,
Jan 2006.
8. P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and
W. Worek, “Overview of the Face Recognition Grand Challenge,” in Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 1, pp. 947–954, 2005.
9. V. Manohar, P. Soundararajan, H. Raju, D. Goldgof, R. Kasturi, and J. Garofolo, “Performance Evaluation
of Object Detection and Tracking in Video,” in Proceedings of the Seventh Asian Conference on Computer
Vision (ACCV), Part II, LNCS 3852, pp. 151–161, Springer, 2006.

10
10. V. Manohar, P. Soundararajan, M. Boonstra, H. Raju, D. Goldgof, R. Kasturi, and J. Garofolo, “Perfor-
mance Evaluation of Text Detection and Tracking in Video,” in Proceedings of the Seventh International
Workshop on Document Analysis Systems (DAS), LNCS 3872, pp. 576–587, Springer, 2006.
11. E. Hjelmasa and B. Low, “Face Detection: A Survey,” Computer Vision and Image Understanding 83(3),
pp. 236–274, 2001.
12. M. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey,” IEEE Transactions on
Pattern Analysis and Machine Intelligence 24(1), pp. 34–58, 2002.
13. P. Viola and M. J. Jones, “Robust Real-Time Face Detection,” International Journal of Computer Vi-
sion 57(2), pp. 137–154, 2004.
14. R. Lienhart and J. Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection,” in Pro-
ceedings of the International Conference on Image Processing, pp. 900–903, IEEE, 2002.
15. J. Canny, “A Computational Approach to Edge Detection,” IEEE Transactions on Pattern Analysis and
Machine Intelligence 8(6), pp. 679–698, 1986.
16. D. C. Montgomery, Design and Analysis of Experiments, John Wiley & Sons, Inc., Hoboken, NJ, USA,
sixth ed., 2005.
17. R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The CLEAR
2006 Evaluation,” in Multimodal Technologies for Perception of Humans, LNCS 4122, pp. 1–44, Springer,
2006.
18. J. R. Munkres, “Algorithms for the Assignment and Transportation Problems,” J. SIAM 5, pp. 32–38, 1957.
19. C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 1982.

11

You might also like