Professional Documents
Culture Documents
1 Introduction
A.B. Chaudhri et al. (Eds.): EDBT 2002 Workshops, LNCS 2490, pp. 299317, 2002.
c Springer-Verlag Berlin Heidelberg 2002
300 J. Zhou and W. Tavanapong
semantic content inherent in most videos. As a result, a long list of search re-
sults is expected. Users pinpoint their desired video segment by watching each
video from the beginning or skimming through the video using fast-forward and
fast-reverse operations. This search approach is tedious, time-consuming, and
resource-intensive.
Content-based video browsing and retrieval is an alternative that allows users
to browse and retrieve desired video segment in a non-sequential fashion. Video
segmentation divides a video le into basic units called shots dened as a con-
tiguous sequence of video frames recorded from a single camera operation [1,2,3,
4]. More meaningful high-level aggregates of shots are then generated for brows-
ing and retrieval. This is because (i) users are more likely to recall important
events rather than a particular shot or frame [5]; and (ii) there are too many
shots in a typical lm (e.g., about 600-1500 shots for a typical lm [6]) to be
used for eective browsing. Since manual segmentation is very time consuming
(i.e., ten hours of work for one hour video [6]), recent years have seen a plethora
of research on automatic video segmentation techniques.
A typical automatic video segmentation consists of three important steps.
The rst step is shot boundary detection (SBD). A shot boundary is declared if
a dissimilarity measure between consecutive frames exceeds a threshold value.
Some of the recent SBD techniques are [3,7,1,2,8,9,10,11,12,13]. The second step
is key-frame selection. For each shot, one or more frames that best represent the
shot, termed key-frame(s), are extracted to reduce the processing overhead in
the next step. For instance, key-frames can be selected from one or more pre-
determined location(s) in a shot. More frames in the same shot can be selected
if they are visually dierent than the previously selected key-frames. Recent
key-frame selection techniques include [14,15,2,3,16,17]. The nal step is scene
segmentation that groups related shots into a meaningful high-level unit termed
scene in this paper. We focus on this step for a narrative lm a lm that tells
a story [18]. Most movies are narrative. In narrative lms, viewers understand
a complex story by identifying important events and associating them by cause
and eect, time, and space.
by crosscutting1 these events together (i.e., the shot of the plane taking o is
followed by the shot of Kevin walking downstairs followed by the shot of the
plane in the air and so on). Nevertheless, it is still unclear what constitutes an
event in these denitions. For instance, when several events happen in the same
locale, should each event be considered a scene or should they all belong to the
same scene?
Existing scene segmentation techniques can be divided into the techniques
using only visual features (e.g., [20,5,22,21]) and those using both visual and
audio features (e.g., [23,24,25]). In both categories, visual similarities of entire
shots or key-frames (i.e., global features such as global color histograms or color
moments) are typically used for clustering shots into scenes. That is, global
features of key-frames of nearby shots are compared. If the dissimilarity measure
of the shots is within the threshold, these shots and the shots in between them are
considered in the same scene. Global features, however, tend to be too coarse for
shot clustering because they include noiseobjects that are excluded when human
groups shots into scenes. Designing a shot clustering technique that can decide
which areas of video frames to be used to approximate a scene is challenging.
Even if an object can be reliably recognized using advanced object recognition
techniques, the problem of determining at what time which object is used for
shot clustering still remains.
Video 0 1 2 3
Scene 0
group 0- shot (0,3,5,7)
group 1- shot (1,4,6,8)
group 2- shot (2)
Scene 1
group 3- shot (9,10,12,13)
group 4- shot (11,14,15)
Scene 2,...
Fig. 1 demonstrates the nal result of the segmentation. ToC organizes shots
into groups and then scenes as follows. Initially, shot 0 is assigned to group
0 and scene 0. The next shot, shot 1, is compared with all existing groups to
locate the group that is most similar to this shot. If such group is not suciently
similar to the shot (i.e., the similarity of the shot and the last shot of the group
based on the above criterion is below a pre-determined group threshold), a new
group consisting of only this shot is created. Next, the shot is compared with
all existing scenes to locate the nearest similar scene. If the most similar scene
is not suciently similar to the shot (i.e., the similarity measure between the
shot and the average similarity of all the groups in the scene is smaller than a
ShotWeave: A Shot Clustering Technique 303
pre-determined scene threshold), a new scene is created (e.g., scene 1 has only
group 1 at this point). The next shot, shot 2, is then considered. In this example,
shot 2 is not similar to any existing groups or scenes. Shot 2 is assigned a new
group (group 2) and a new scene (scene 2). Shot 3 is considered next and is
found similar to group 0. Therefore, shot 3 is assigned to group 0. All the groups
between this shot and the last shot in group 0 are then considered in the same
scene of group 0. That is, all shots in between shot 0 and shot 3 belong to scene
0. The temporary scene 1 and scene 2 created previously are removed. Shot
4 is considered next, and the similar process is repeated until all shots in the
video are considered. Our experiments indicate that ToC is likely to generate
more false scenes by separating shots of the same scene. In ToC, noise or visual
information of objects not relevant in shot clustering is included in the global
features.
shot 2a
A B
shot 2 shot 3
3
A 2
A A
shot 1 shot 6
1 w C
h
Lh
w w
E
B D
FH C
Lw Lw Feature B
Extraction
E
2
2 Lh Lh 2
feature vector
FW
in the same locale. For instance, in a scene of a party held in a dinning room,
since the four walls of the room typically have the same shade of color, when the
camera changes its focus from one person (in the rst shot) to the next (in the
next shot) sitting along a dierent wall, the background of the two shots is still
likely to have similar color. In Fig. 4(a), the two corners of Region 1 are used to
detect shots taken in the same locale in the presence of a close-up of an object
or an object moving toward a camera. Although the middle of the horizontal bar
of the background region is disrupted, the two corners in consecutive shots
are still likely to be similar since close-ups typically occur in the center of the
frame.
w w
w w
h
w w
Primary
object in
shot i 2 1
Primary
object Primary Primary
expanded Lh object in object in
in shot the 2nd the 1st
i+1 due to key-frame key-frame
camera
zoom
Lw
Region 2 in Fig. 3 consists of the lower left and the lower right corners for
detecting a simple traveling scene. In Fig. 4(b), in each shot, the main character
begins at one corner of a frame and travels to the other corner in the last frame
of the same shot. In the next shot, the same character travels in the same direc-
ShotWeave: A Shot Clustering Technique 307
tion to maintain viewers perception that the character is still traveling. The
background of these two shots tends to be dierent because the character travels
across dierent locales, but the two lower corners are likely to be similar since
they capture the primary character.
The sizes of the regions are calculated as follows. Let F W and F H be the
width and the height of a frame in DC coecients, respectively. Let w and h
be the width of the horizontal bar in the background region and the height
of the upper corner in DC coecients. Lw , Lh , w, and h shown in Fig. 3(a) are
computed using the following equations.
Lh =w+h (1)
h =2w (2)
2 Lh = 0.8 F H (3)
3 Lw = 0.9 F W (4)
The area of the lower corner (Lh xLw ) is made slightly larger than the area
of the upper corner since the lower corners are for capturing the traveling of
primary characters whereas the upper corners are to exclude close-up objects.
Therefore, in Equation (1), Lh is made about w more than h. In Equation (2),
h is chosen to be twice w. Equation (3) ensures that the upper corner and the
lower corner do not meet vertically while Equation (4) prevents the two lower
corners from covering the center bottom area of the frame horizontally. This is
to avoid inclusion of the visual properties of the middle area of the frame that
often captures many objects.
If shot m where m = k+1 is found similar to shot k, the two shots represent a
non-interacting event or an interacting event in the same location if background
criterion is satised. If the lower corner is used to group these shots together,
the two shots are part of a traveling scene. If shot m where m > k + 1 is found
similar to shot k, both shots are highly likely to capture (i) the same character
in an interacting event (e.g., in a conversation scene, shots k and m focus on
the same person in one location and a shot in between them captures another
person possibly in the same or another location) or (ii) the same event in parallel
events. We note that the 10% threshold is selected in all the three comparisons
since it consistently gives a better performance than other threshold values.
The input of our clustering algorithm consists of the video le, the total number
of frames in the video, the list of shot boundaries (i.e., a list of frame numbers
of the rst and last frames of each shot), the forward search range (F ), and the
backward search range (R). Both F and R are used to limit the temporal distance
of shots within the same scene. The output of the clustering algorithm is the
list of frame numbers, each indicating the beginning frame of a detected scene.
The variable CtShotF orward is the number of shots between the shot being
considered (current shot) and the last shot of the video whereas CtShotBack
denotes the number of shots from the current shot to the rst shot of the video.
For ease of the presentation, we assume that these two variables are implicitly
updated to the correct values as the algorithm progresses.
Step 1 : By manipulating the shot boundary information, a shot that lasts less
than half a second is combined with its previous shot if any, and the shot
boundaries are updated accordingly. The rationale is that this very short shot
appears too briey to be meaningful by itself. It is typically the result of the
imperfect shot detection technique that creates false boundaries in presence
of fast tilting and panning, a sudden brightness such as a ashlight, or a
large object moving quickly within a shot.
Step 2 : The rst shot is the current shot. The following steps in Step 2 are
repeated until the last shot in the video has been considered.
Step 2a: Forward Comparison: At most min(F, CtShotF orward)
subsequent shots are compared with the current shot to locate the near-
est similar shot (termed matching shot hereafter). Feature comparison
discussed in Section 3.4 is used. Feature extraction of relevant shots is
done when needed since it can be performed very quickly. The extracted
features are kept in memory and purged when the associated shot is no
longer needed.
ShotWeave: A Shot Clustering Technique 309
Simiar?
Simiar?
No
Simiar?
YES
new scene c new scene c YES
(a) Scene with establishing shot (b) Scene without establishing shots
The selection of shots to compare in the forward comparison and the back-
track comparison was used in SIM [5]. The feature extraction and comparisons
and the other steps are introduced in this paper. We use the shot selection in
SIM since it is easy to implement and use a small memory space to store key-
frames of only F + R + 2 shots during the entire clustering process (F + R + 1
shots during the forward and the backtrack comparison and one more for the
reestablishing shot check).
scene 1
1 2 3 4 5 6 7 8 9 10
4 Performance Evaluation
In this section, the performance of the two techniques are evaluated on two
full-length movies; each lasts more than 100 minutes. Let Nc and Nt be the
number of correct scene boundaries and the total scene boundaries detected by
a scene segmentation technique, respectively. Nt includes both correct and false
scene boundaries. False scene boundaries do not correspond to any manually seg-
mented scene boundaries. Na denotes the total number of manually segmented
scene boundaries. The following metrics are used.
Nc
Recall (Hr )= N a
. High recall is desirable; it indicates that the technique is
able to uncover most of scene boundaries judged by human.
Precision (Hp )= N Nt . High precision is desirable since it indicates that most
c
increasing. When the scene threshold equals to 0.8, ToC gives the highest utility
(i.e., Um = 0.107). This value was, therefore, selected as the best scene threshold
and used for determining the best group threshold.
st Nc Nt Hr Hp Um Rt
0.5 7 254 0.113 0.028 0.070 96
0.6 9 268 0.145 0.034 0.090 96
0.7 9 283 0.145 0.032 0.088 97
0.8 11 298 0.177 0.037 0.107 96
0.9 11 337 0.177 0.033 0.105 96
1.0 11 369 0.177 0.030 0.104 96
Table 3 shows the performance of ToC using a scene threshold of 0.8 and
dierent group thresholds. When the group threshold is 1.6, the highest utility
and recall are achieved. Beyond this threshold, the number of correct scene
boundaries does not improve, but the number of false boundaries increases as
indicated by a drop in the precision. Therefore, we select the best group and
scene thresholds for ToC to be 1.6 and 0.8, respectively.
gt Nc Nt Hr Hp Um Rt
0.9 7 287 0.113 0.024 0.069 66
1.0 8 337 0.129 0.024 0.076 73
1.25 11 298 0.177 0.037 0.107 96
1.6 12 353 0.194 0.034 0.114 149
1.8 12 377 0.194 0.032 0.113 188
2.0 10 393 0.161 0.025 0.093 232
Table 4 illustrates the performance of ToC on the two test videos using these
best thresholds. The results are dierent from those reported in the original
work [20]. This is due to the fact that a strict scene denition and dierent test
videos were used. In addition, the test videos in the original work were much
shorter. The longer the video, the higher the probability that dierent types of
camera motions and lming techniques are used, aecting the eectiveness of
the technique.
Video Nc Nt Hr Hp Um Rt
Home Alone 12 353 0.194 0.034 0.114 149
Far and Away 8 418 0.138 0.019 0.078 164
ShotWeave achieves the best utility when F equals two because of the very high
recall of 0.71. In other words, about 71% of the correct scene boundaries are
uncovered by ShotWeave. However, the number of detected scenes is also high. As
F increases, recall drops and precision increases. To compare the performance of
ShotWeave with that of ToC, F of 3 was chosen since it gives the second highest
utility with less number of total scenes detected. When comparing the scene
boundaries detected by ShotWeave with those detected manually, we observe
that if F can be dynamically determined based on the number of participating
characters in an interacting event, the performance of the technique may be
further improved. For instance, if there are three people participating in an
interacting event, F of 2 is too limited because it takes at least three shots, each
of which captures each of the persons to convey the idea that these persons are
interacting.
F Nc Nt Hr Hp Um Rt
2 44 428 0.709 0.102 0.406 6.14
3 32 216 0.516 0.148 0.332 6.25
4 25 148 0.403 0.168 0.286 6.32
5 23 106 0.371 0.216 0.293 6.45
6 16 63 0.258 0.253 0.256 6.26
7 14 48 0.225 0.291 0.258 6.5
8 9 37 0.145 0.243 0.194 6.61
Table 6 depicts the results when R was varied while F was xed at 3. The
results indicate that R equal to 2 oers the highest precision while the same
recall was maintained. Thus, (F ,R) of (3,2) was selected as the best parameters
for ShotWeave. Table 7 demonstrates the results of ShotWeave on both videos
using these best (F ,R) values.
F Nc Nt Hr Hp Um Rt
1 32 216 0.516 0.148 0.332 6.25
2 32 205 0.516 0.156 0.336 7.7
ShotWeave: A Shot Clustering Technique 315
Video Nc Nt Hr Hp Um Rt
Home Alone 32 205 0.516 0.156 0.336 7.7
Far and Away 36 249 0.621 0.144 0.383 6.47
The two techniques using their best parameters are compared in terms of seg-
mentation accuracy and segmentation time. The results are shown in Table 8.
ShotWeave outperforms ToC in all the four metrics. ShotWeave oers as much as
about ve times higher recall and precision than ToC on the test videos. Further-
more, ShotWeave takes much less time than ToC to identify scene boundaries.
Note that time for feature extraction was accounted in the segmentation time
for ShotWeave, but was not counted in ToC. The short running time of less than
10 seconds allows ShotWeave to be done on the y after the users identify their
desirable weights for recall or precision.
When the detected scenes are analyzed, several scenes, each consisting of a
single shot are found. These single-shot scenes are in fact establishing shots of
the nearest subsequent scene consisting of more than one shots. In several cases,
these establishing shots are not quite visually similar to any of the shots in the
scene, causing a reduction in the precision of ShotWeave.
5 Concluding Remarks
ShotWeave suggests that the use of visual properties will not improve the per-
formance of ShotWeave much further since establishing and reestablishing shots
are not visually similar to the rest of the shots in the same scene. We are inves-
tigating the use of sound together with ShotWeave to improve the segmentation
performance.
References
1. Zhang, H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion
video. ACM Multimedia Systems 1 (1993) 1028
2. Zhang, H.J., Wu, J.H., Zhong, D., Smoliar, S.: Video parsing, retrieval and brows-
ing: An integrated and content-based solution. Pattern Recognition (Special Issue
on Image Databases) 30 (1997) 643658
3. Zhuang, Y., Rui, Y., Huang, T.S., Mehrotra, S.: Adaptive key frame extraction us-
ing unsupervised clustering. In: Proc. of Intl Conf. on Image Processing, Chicago,
IL (1998) 886870
4. Oh, J., Hua, K.A., Liang, N.: A content-based scene change detection and classi-
cation technique using background tracking. In: Proc. of SPIE. (1999)
5. Hanjalic, A., Lagendijk, R.L.: Automated high-level movie segmentation for ad-
vanced video-retrieval systems. IEEE Transaction on Circuits and Systems for
Video Technology 9 (1999) 580588
6. Bimbo, A.D.: Content-based Video Retrieval. Morgan Kaufmann Publishers, Inc.,
San Francisco, CA (1999)
7. Aigrain, P., Joly, P.: The automatic real-time analysis of le editing and transition
eects and its applications. Computer and Graphics 18 (1994) 93103
8. Zhang, H.J., Low, C.Y., Smoliar, S.W.: Video parsing and browsing using com-
pressed data. Multimedia Tools and Applications 1 (1995) 89111
9. Yeo, B.L., Liu, B.: Rapid scene analysis on compressed video. IEEE Transaction
on Circuits and Systems for Video Technology 5 (1995) 533544
10. Shin, T., Kim, J.G., Lee, H., Kim, J.: A hierarchical scene change detection in an
mpeg-2 compressed video sequence. In: Proc. of IEEE Intl Symposium on Circuits
and Systems. Volume 4. (1998) 253256
11. Gamaz, N., Huang, X., Panchanathan, S.: Scene change detection in mpeg domain.
In: Proc. of IEEE Southwest Symposium on Image Analysis and Interpretation.
(1998) 1217
12. Dawood, M., M., A., Ghanbari: Clear scene cut detection directly from MPEG bit
streams. In: Proc. of IEEE Intl. Conf. on Image Processing and Its Applications.
Volume 1. (1999) 286289
13. Nang, J., Hong, S., Ihm, Y.: An ecient video segmentation scheme for MPEG
video stream using macroblock information. In: Proc. of ACM MM99, Orlando,
FL (1999) 2326
14. Xiong, W., Lee, J.C.M., Ma, R.: Automatic video data structuring through shot
partitioning. Machine Vision and Applications 10 (1997) 5165
ShotWeave: A Shot Clustering Technique 317
15. Ferman, A.M., Tekalp, A.M.: Multiscale content extraction and representation for
video indexing. In: Multimedia Storage and Archival Systems, Dallas, TX (1997)
16. Wolf, W.: Key frame selection by motion analysis. In: Proc. of IEEE Intl Conf.
on Images Processing, U.S.A (1996) 12281231
17. Girgensohn, A., Boreczky, J.: Time-constrained keyframe selection technique. In:
Proc. of Intl Conf. on Multimedia and Computing Systems, Florence, Italy. (1999)
756761
18. Bordwell, D., Thompson, K. McGraw-Hill Companies, Inc. (1997)
19. Oh, J., Hua, K.A.: Ecient and cost-eective techniques for browsing and indexing
large video databases. In: ACM SIGMOD. (2000)
20. Rui, Y., Huang, T.S., Mehrotra, S.: Constructing table-of-content for videos. Mul-
timedia Systems 7 (1999) 359368
21. Corridoni, J.M., Bimbo, A.D.: Structured representation and automatic indexing
of movie information content. Pattern Recognition 31 (1998) 20272045
22. Yeung, M.M., Liu, B.: Ecient matching and clustering of video shots. In: Proc.
of IEEE Intl Conf. on Images Processing. Volume 1., U.S.A (1995) 338341
23. Sundaram, H., Chang, S.F.: Determining computable scens in lms and their
structures using audio-visual memory models. In: Proc. of ACM Multimedia00,
Los Angeles, CA, USA (2000) 95104
24. Chang, H.S.S.F.: Video scene segmentation using audio and video features. In:
ICME, New York, NY USA (2000)
25. Adams, B., Dorai, C., Venkatesh, S.: Novel approach to determining tempo and
drama story sections in motion pictures. In: Proc. of IEEE ICME. (2000) 283286
26. Zhou, J.: Shotweave: A shot clustering technique for story browsing for large video
databases. M.S. Thesis Iowa State University (2001)