Professional Documents
Culture Documents
ABSTRACT
Establishing benchmark datasets, performance metrics and baseline algorithms have considerable research sig-
nificance in gauging the progress in any application domain. These primarily allow both users and developers
to compare the performance of various algorithms on a common platform. In our earlier works, we focused on
developing performance metrics and establishing a substantial dataset with ground truth for object detection and
tracking tasks (text and face) in two video domains – broadcast news and meetings. In this paper, we present
the results of a face detection and tracking algorithm on broadcast news videos with the objective of establishing
a baseline performance for this task-domain pair.
The detection algorithm uses a statistical approach that was originally developed by Viola and Jones and later
extended by Lienhart. The algorithm uses a feature set that is Haar-like and a cascade of boosted decision tree
classifiers as a statistical model. In this work, we used the Intel Open Source Computer Vision Library (OpenCV)
implementation of the Haar face detection algorithm. The optimal values for the tunable parameters of this
implementation were found through an experimental design strategy commonly used in statistical analyses of
industrial processes. Tracking was accomplished as continuous detection with the detected objects in two frames
mapped using a greedy algorithm based on the distances between the centroids of bounding boxes. Results on
the evaluation set containing 50 sequences (≈ 2.5 mins.) using the developed performance metrics show good
performance of the algorithm reflecting the state-of-the-art which makes it an appropriate choice as the baseline
algorithm for the problem.
Keywords: Baseline algorithm, Face detection & tracking, Broadcast news, Performance evaluation
1. INTRODUCTION
Since the very beginning of image processing, as in the case of software engineering, there was a portion of the
development process committed to algorithm testing. It dealt with the objective of determining whether or not
a particular algorithm satisfies its specifications relating to established criteria such as accuracy and robustness.
There are two important purposes of testing,1 namely,
One of the major challenges in designing image processing algorithms is in perceiving the criteria used to
gauge the results. This involves a trade-off among measuring sensitivity of parameters, robustness of algorithm,
and accuracy of results. Performance evaluation in a general sense involves the measurement of some essential
behavior of an algorithm which, could be realizable accuracy, robustness, or extensibility. It allows for both
emphasis of intrinsic characteristics of an algorithm as well as the assessment of its benefits and limitations.
Several such evaluation frameworks for different image processing and computer vision tasks are available.
Image thresholding methods were investigated by Sezgin and Sankur. 2 Heath et al3 presented their work on eval-
uating edge detection algorithms. Range segmentation algorithms were compared by Hoover et al. 4 Mikolajczyk
1
and Schmid5 analyzed the performance of descriptors computed for local interest regions. The gait identification
problem was formally addressed by Sarkar et al.6 Cappelli et al7 evaluated fingerprint verification systems with
a goal of establishing a new common benchmark for an unambiguous comparison of fingerprint-based biometric
systems. To provide a means for measuring progress and characterizing the properties of face recognition, the
Face Recognition Grand Challenge (FRGC) was introduced by Phillips et al. 8
Evaluating the performance of any image processing algorithm is complicated because of the fact that several
factors influence the performance. As noted by Heath et al.,3 there are many determinants to the performance
of an algorithm, namely,
One can observe that, besides the algorithm, there are several other factors that define an algorithm’s per-
formance. The nature of the testing set used is an important element. Using an easy set of images overestimates
the capability of the algorithm leading to unsatisfactory results when fielded in real-world applications. On
the other hand, using a difficult set of testing images underestimates the efficacy of the algorithm leading to a
mis-representation of the state-of-the-art for the problem.
The number of parameters an algorithm requires to be specified defines the hardship in evaluating its perfor-
mance. For instance, an algorithm that requires to set 1 parameter is much easier to evaluate than an algorithm
that has 5 parameters whose values have to be trained for an optimal performance.
Finally, the testing strategy used to evaluate an algorithm also dictates its performance. There are no
strict guidelines as to how the performance evaluation process should be characterized. However, there are
certain aspects to be considered, such as the testing protocol, performance metrics, databases, and benchmark
algorithms.
While the testing protocol, performance metrics, and testing databases deal with objective of providing a
quantitative method of evaluating an algorithm, establishing benchmark algorithms aim at providing a com-
parative measure of an algorithm’s performance. We focussed on the first 3 facets of evaluations in our earlier
works.9, 10 In this paper, we present a baseline algorithm for face detection and tracking with the goal of providing
a reference for future performance comparison for this task.
This paper is organized in the following fashion. Section 2 provides a formal description of the face detection
and tracking task accompanied by a set of guidelines for consistent and reliable annotation of the groundtruth.
Section 3 describes the baseline face detection and tracking algorithm implemented in this evaluation. Section 4
details the experimental design strategy used to compute the optimal values of the user-selectable parameters
of the OpenCV implementation of the Haar face detection algorithm. Section 5 presents the results of the
baseline algorithm on the broadcast news corpus along with a brief description of the metrics used to compute
the performance scores. We conclude the findings of this work in Section 6.
2
2.1. Face Ground Truth
A Face is enclosed by an oriented bounding box. The face features are used as guides to mark the bounds of
this box. If for any reason, the features are obstructed, then the box is approximated. A face is defined to be
VISIBLE in the scene as long as one eye, part of the nose, and part of the mouth are seen. Each face bounding
box has the added properties of how clear the face is (AMBIGUITY), whether it is a real face or SYNTHETIC
(cartoon, for example), whether the face is OCCLUDED with another object in the scene, and whether a person
is wearing hat/sunglasses (HEADGEAR). The clarity of the face is denoted by 3 levels of AMBIGUITY (0 =
Very clear when the face can be clearly seen, 1 = Partially clear when two of three features are visible, and 2
= Very confusing when none of three features are visible). This set of attributes makes the annotations rich
and can be used for sub-scoring in the evaluations. A sample annotation for face depicting different levels of
AMBIGUITY is shown in Figure 1.
Each face bounding box has a unique ID and is maintained in the entire clip as long as the face exists without
any temporal breaks. If a face object exits the clip at frame n and comes back in frame m, then a new ID is
given to this face. The annotator is merely performing a detection/tracking task and not a recognition task.
All other annotated regions are treated as “Don’t Care” where the system output is neither penalized for
missing nor given credit for detecting the unscored region. It has to be noted that each of these attributes can
be selectively specified to be included in evaluation through the scoring tool that we have developed.
3
3. FACE DETECTION AND TRACKING
Detecting faces in images is an essential step to intelligent vision-based human-computer interaction. It is the
first step in many research efforts in face processing including face recognition, pose estimation, and expression
recognition. Numerous techniques have been proposed to detect faces in images and video. 11, 12 This section
describes the baseline face detection and tracking algorithm used in this evaluation. The main purpose of
presenting a baseline algorithm is to provide a reference for future performance comparison and to show how a
real-life system can be subjected to evaluation.
• The Scaling factor – Specifies the factor by which the search window is scaled in a subsequent iteration. If
the user specifies a value of 1.25, then in a subsequent iteration, the search window is made 25% larger.
• The Grouping factor – Signifies the minimum number of neighboring face rectangles that should be joined
into a single “face”. For instance, a group factor of 2 means groups of 3 or more neighboring rectangles
are joined into a single face. Smaller groups are rejected for a lack of sufficient evidence.
• The Pre-processing flag – When set, the algorithm works with edge data (Canny edge detector 15 ) instead
of the raw data. This makes the algorithm run faster due to the reduction in information.
The optimal values for each of these parameters were found using an experimental design strategy 16 described
in Section 4. The final set of values was: scaling factor = 1.2, grouping factor = 3 and pre-processing flag =
TRUE.
4
3.2. Face Tracking
The tracking algorithm used in our face baseline uses the detected objects in each frame to track them throughout
the given sequence. The ID assigned by the detection algorithm to each object must be used to uniquely identify
the object across all frames in the sequence. The objects are identified by bounding rectangles which are defined
with the right upper corner (x, y) and values of height (h) and width (w). However, once a detection algorithm
identifies an object, it must be identified with the same object ID across frames to be scored as correct. A
distance-based algorithm is used to track the object throughout the given sequence. A short description of the
algorithm is given below:
1. Three consecutive frames (current, previous and 2nd previous) with the detected objects are specified.
The number of detected objects in current frame is n and numbers of detected objects in the previous
and 2nd previous frames are n1 and n2 consecutively; also, two thresholds are predefined (threshold1 and
threshold2).
2. Centroids, (cxk , cyk ), k = 1, ..., M , M = n + n1 + n2 , of the bounding rectangles of the detected objects are
calculated using the formulae cx = x + w2 and cy = y + h2 .
3. The minimum Euclidean distance (dmini ) for i−detected object of the previous
q frame to the detected
objects in the current frame is found, using a greedy approach. dmini = minj (cpxi − ccxj )2 + (cpyi − ccyj )2 ,
j = 1, ..., n, where (cpxi , cpyi ) is centroid of the i-detected object of the previous frame, (ccxj , ccyj ) is centroid
of the j-detected object of the current frame.
4. If dmini ≤ threshold1, assign the ID of the previous object to the corresponding object of the current
frame with the minimum distance. In cases, when the minimum distance achieved for two or more different
objects in current frame, the object with the closest area to the previous object is chosen.
6. Repeat the above steps n2 times for the detected objects of the 2nd previous frame and the detected objects
of the current frame, with additional conditions (different distance threshold (threshold2) and constraint
that two or more objects cannot have the same ID.
7. New IDs are assigned for any of the unassigned objects in the current frame.
5
Thus, it is a three-factor mixed-level experiment. For a given parameter setting, the algorithm, being deter-
ministic, will produce the same output when run multiple times. Thus, there is no random error involved in the
process. This results in a one-replicate design experiment.
To accomplish the tasks described earlier, the following strategy was adopted.
1. Perform an ANalysis Of VAriance (ANOVA) on the performance values for various settings to identify the
significance of individual parameters and their interactions.
2. Based on the results of the ANOVA, build a regression model with the significant factors.
In order to test the generality of the solution, the above process was repeated on six training videos. For each
parameter setting, the algorithm output was obtained and the score was computed by using the SFDA metric.
Table 2 shows sample data for one of the videos.
Table 2. Sample Data Showing Performance Values for Different Parameter Settings.
Being a one-replicate experiment, there was a need to build a reduced model to estimate the error. By
observing the experimental data, we inferred that the level of factor C (pre-processing flag) was not significant
to the performance of the algorithm. From Table 2, one can observe that the level of factor C causes a difference
in the score which is significant only in the fourth decimal. Hence, we used the sum of squares of factor C as our
initial estimate for the error.
Through the ANOVA, it was found that the factors A & B (scale and grouping factors) and the interaction
between them were the only significant sources of variation. This observation was consistent across all six training
videos. Based on these results, the final model was built by combining sum of squares of C, AC, BC and ABC
to get a better estimate for the sum of squares of the error.
A regression model was then built with the significant factors as the model parameters. Since a linear
model was not able to accurately capture the variation in the underlying distribution space, a quadratic model
2
(z = β0 + β1 XA + β2 XB + β3 XA XB + β4 XA ) was built by introducing a quadratic term for the scale factor. The
2
average R value for this model was 0.88 which justified that this model was adequate. The optimal values for
each of the six videos was obtained using this model.
Table 3 presents the optimal values for scale and grouping factors for each of the six videos.
Factor A Factor B
Input Data
Scale Factor Grouping Factor
Video-1 1.25684 3
Video-2 1.28917 3
Video-3 1.24587 3
Video-4 1.25851 3
Video-5 1.26124 3
Video-6 1.25851 3
Table 3. Optimal Values for Scale and Grouping Factors for the Training Videos.
6
A final regression model was built based on these optimal values. The motivation behind this step is to
find the values that can be generalized across more videos. The values for scale and grouping factors after this
optimization step were 1.20 and 3, respectively. With the pre-processing flag set to be TRUE for faster execution,
the above values were used in the test set.
7
Using the assignment sets, the Multiple Object Detection Precision (MODP) for each frame t can be computed
as:
(M apped Overlap Ratio)
M ODP (t) = t (4)
Nmapped
The Normalized Multiple Object Detection Precision (N-MODP) is computed as the average of the MODP
measure over all the relevant frames in the sequence.
Average Tracking Accuracy: The Average Tracking Accuracy (ATA) is a spatio-temporal measure which
penalizes fragmentations in both the temporal and spatial dimensions while accounting for the number of objects
detected and tracked, missed objects, and false positives.
The Sequence Track Detection Accuracy (STDA) is defined as:
PNf rames |G(t)
i
(t)
∩Di |
Nmapped t=1 (t) (t)
X |Gi ∪Di |
ST DA = (5)
i=1
N(Gi ∪Di 6=∅)
Analyzing the numerator of Equation 5, we observe that it is merely the overlap of the detected object over
the ground truth, which is very similar to Equation 1. The only difference is that, in tracking we measure the
overlap in the spatio-temporal dimension while in detection the overlap is in the spatial dimension alone.
The Average Tracking Accuracy (ATA) is calculated as the average of STDA over all the unique objects in
the sequence.
Multiple Object Tracking Accuracy: To extract the accuracy aspect of the system output track, we compute
the number of missed detects, false positives, and switches in the system output track for a given reference
ground truth track.
PNf rames
t=1 (cm (mt ) + cf (f pt ) + log10 (ID-SW IT CHESt ))
M OT A = 1 − PNf rames t (6)
t=1 NG
where, after computing the mapping for frame t, mt is the number of misses, f pt is the number of false positives,
and ID-SW IT CHESt is the number of ID mismatches in frame t considering the mapping in frame (t − 1). It
should be noted that because of the log function, we start the ID-SW IT CH count at 1.
Multiple Object Tracking Precision: To obtain the precision score, we calculate the spatio-temporal overlap
between the reference tracks and the system output tracks. The Multiple Object Tracking Precision (MOTP) is
defined as:
PNmapped PNft rames |G(t)
i
(t)
∩Di |
i=1 t=1 (t) (t)
|Gi ∪Di |
M OT P = PNf rames (7)
t
Nmapped
t=1
where, Nmapped refers to the mapped system output objects over an entire reference track taking into account
t
splits and merges, and Nmapped refers to the number of mapped objects in the tth frame.
5.2. Results
Figure 2 shows some sample images depicting the detection accuracy of the baseline Haar face detection algorithm.
From a visual inspection, one can observe that the algorithm suffers from both miss detection and false alarms.
Misses primarily arise from the fact that the Viola-Jones face detector mostly detects close-to-frontal horizontally
aligned faces alone. False alarms are because of the basic design principle that the classifiers should trigger a
response to any face-like image region. Despite these shortcomings, the Haar face detection algorithm achieves
respectable detection rate.
8
Figure 2. Images Showing the Detection Accuracy of the Haar Face Detection Algorithm.
Figure 3 shows the boxplots of the detection and tracking scores for the baseline algorithm. For detection, the
specific measures that we observe are the MODP, MODA, and SFDA whose mean values are 0.339, 0.423, and
0.797. While the SFDA measures the detection performance comprehensively and has a high score, the MODP
and MODA are lower. Specifically, the MODP shows that the overall spatial overlap (on the mapped objects)
is about 0.34 and on some clips as high as nearly 0.40. The MODA, which shows how accurate the counts of
the objects are, measures at around 0.42 overall and on some clips as high as 0.80. It has to be noted that the
SFDA score is thresholded at 10% spatial overlap meaning, if the groundtruth box and the system output box
overlap at least by 10% in Equation 1 it is considered to be as 100% spatial overlap.
For the tracking performance, the specific measures we use are MOTP, MOTA, and ATA. The MOTP and
MOTA are similar to their detection counterparts but are slightly lower. The ATA score averages around 0.23
and on some clips as high as nearly 0.40.
6. CONCLUSIONS
Though there have been several frameworks for the performance evaluation of various image processing and
computer vision applications, there has not been a significant effort to evaluate the key computer vision topic
of object detection and tracking in video. The objective of our work is to systematically address this problem
and to provide useful resources such as data, metrics, and tools for the evaluation of such detection and tracking
algorithms. To that extent, in our past publications we had developed performance metrics, testing databases,
and an evaluation strategy to provide a quantitative methodology for performance assessment. In this paper,
we presented a baseline algorithm for face detection and tracking to supply a comparative measure for such
algorithms. From the results, it was observed that the performance of the baseline algorithm is comparable to
the state-of-the-art algorithm for face detection and tracking.
This work, together with our earlier papers describing the evaluation protocol, gives a thorough treatment
to every aspect of formal evaluation and provides researchers an invaluable resource to advance research on the
topic of object detection and tracking.
9
1
0.8
0.4
0.2
−0.2
−0.4
−0.6
N−MODP N−MODA MOTP MOTA SFDA ATA
Metrics
Figure 3. Boxplot of the Performance Scores of the Baseline Face Detection and Tracking Algorithm on the Broadcast
News Corpus (+ indicates mean value).
REFERENCES
1. M. Wirth, M. Fraschini, M. Masek, and M. Bruynooghe, “Performance Evaluation in Image Processing,”
EURASIP Journal on Applied Signal Processing 2006, pp. 1–3, Article ID 45742, 2006.
2. M. Sezgin and B. Sankur, “Survey Over Image Thresholding Techniques and Quantitative Performance
Evaluation,” Journal of Electronic Imaging 13(1), pp. 146–168, 2004.
3. M. D. Heath, S. Sarkar, T. Sanocki, and K. W. Bowyer, “A Robust Visual Method for Assessing the
Relative Performance of Edge–Detection Algorithms,” IEEE Transactions on Pattern Analysis and Machine
Intelligence 19, pp. 1338–1359, Dec 1997.
4. A. Hoover, G. Jean-Baptiste, X. Jiang, P. J. Flynn, H. Bunke, D. Goldgof, K. Bowyer, D. W. Eggert,
A. Fitzgibbon, and R. B. Fisher, “An Experimental Comparison of Range Image Segmentation Algorithms,”
IEEE Transactions on Pattern Analysis and Machine Intelligence 18(7), pp. 673–689, 1996.
5. K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” IEEE Transactions on
Pattern Analysis and Machine Intelligence 27(10), pp. 1615–1630, 2005.
6. S. Sarkar, P. Phillips, Z. Liu, I. Vega, P. Grother, and K. Bowyer, “The HumanID Gait Challenge Prob-
lem: Data Sets, Performance, and Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 27, pp. 162–177, February 2005.
7. R. Cappelli, D. Maio, D. Maltoni, J. L. Wayman, and A. K. Jain, “Performance Evaluation of Fingerprint
Verification Systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence 28, pp. 3–18,
Jan 2006.
8. P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and
W. Worek, “Overview of the Face Recognition Grand Challenge,” in Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 1, pp. 947–954, 2005.
9. V. Manohar, P. Soundararajan, H. Raju, D. Goldgof, R. Kasturi, and J. Garofolo, “Performance Evaluation
of Object Detection and Tracking in Video,” in Proceedings of the Seventh Asian Conference on Computer
Vision (ACCV), Part II, LNCS 3852, pp. 151–161, Springer, 2006.
10
10. V. Manohar, P. Soundararajan, M. Boonstra, H. Raju, D. Goldgof, R. Kasturi, and J. Garofolo, “Perfor-
mance Evaluation of Text Detection and Tracking in Video,” in Proceedings of the Seventh International
Workshop on Document Analysis Systems (DAS), LNCS 3872, pp. 576–587, Springer, 2006.
11. E. Hjelmasa and B. Low, “Face Detection: A Survey,” Computer Vision and Image Understanding 83(3),
pp. 236–274, 2001.
12. M. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey,” IEEE Transactions on
Pattern Analysis and Machine Intelligence 24(1), pp. 34–58, 2002.
13. P. Viola and M. J. Jones, “Robust Real-Time Face Detection,” International Journal of Computer Vi-
sion 57(2), pp. 137–154, 2004.
14. R. Lienhart and J. Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection,” in Pro-
ceedings of the International Conference on Image Processing, pp. 900–903, IEEE, 2002.
15. J. Canny, “A Computational Approach to Edge Detection,” IEEE Transactions on Pattern Analysis and
Machine Intelligence 8(6), pp. 679–698, 1986.
16. D. C. Montgomery, Design and Analysis of Experiments, John Wiley & Sons, Inc., Hoboken, NJ, USA,
sixth ed., 2005.
17. R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The CLEAR
2006 Evaluation,” in Multimodal Technologies for Perception of Humans, LNCS 4122, pp. 1–44, Springer,
2006.
18. J. R. Munkres, “Algorithms for the Assignment and Transportation Problems,” J. SIAM 5, pp. 32–38, 1957.
19. C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 1982.
11