Scene Segmentation

ShotWeave: A Shot Clustering Technique for
Story Browsing for Large Video Databases
Junyu Zhou and Wallapak Tavanapong
The Miea Laboratory, Dept. of Computer Science, Iowa State University,

Ames, IA 50011-1040, U. S. A.
zhoujunyu@hotmail.com,tavanapo@cs.iastate.edu
Abstract. Automatic video segmentation is the rst and necessary step

for organizing a long video le into several smaller units for subsequent
browsing and retrieval. The smallest basic unit is shot. Since users of a
video database management system are more likely to recall important
events or stories rather than a particular frame or shot, relevant shots
are typically grouped into a high-level unit called scene. Each scene is
part of a story. Browsing these scenes unfolds the entire story of the
lm, allowing the users to locate their desired video segments quickly
and eciently. Existing scene denitions are rather broad, making it
dicult to evaluate the scene results and compare existing techniques.
This paper rst gives a stricter scene denition and presents ShotWeave,
a novel technique for clustering relevant shots into a scene for narrative
lms. The crux of ShotWeave is its feature extraction and comparison.
Features are extracted from carefully selected regions of representative
frames of shots. These regions capture essential information needed to
maintain viewers thought in presence of shot breaks guided by common
continuity-editing techniques used in lm making. The experimental
results show that ShotWeave performs well, and is more robust than a
recent shot clustering technique on full-length lms consisting of a wide
range of camera motions and a complex composition of related shots.
Keywords: Video segmentation, browsing, retrieval and indexing, clus-

tering.
1 Introduction
Rapid advances in multimedia processing, computing power, high-speed inter-

networking, and the World-Wide Web have made digital videos part of many
emerging applications such as distance learning, digital libraries, and electronic
commerce. Searching for a desired video segment from a large collection of videos
becomes increasingly dicult as more videos are easily created. A well-known
search approach matching user-specied keywords with titles, subjects, or short
text descriptions is not eective because they are too coarse to capture rich

Contact author
A.B. Chaudhri et al. (Eds.): EDBT 2002 Workshops, LNCS 2490, pp. 299317, 2002.
c Springer-Verlag Berlin Heidelberg 2002

300 J. Zhou and W. Tavanapong
semantic content inherent in most videos. As a result, a long list of search re-
sults is expected. Users pinpoint their desired video segment by watching each
video from the beginning or skimming through the video using fast-forward and
fast-reverse operations. This search approach is tedious, time-consuming, and
resource-intensive.
Content-based video browsing and retrieval is an alternative that allows users
to browse and retrieve desired video segment in a non-sequential fashion. Video
segmentation divides a video le into basic units called shots dened as a con-
tiguous sequence of video frames recorded from a single camera operation [1,2,3,
4]. More meaningful high-level aggregates of shots are then generated for brows-
ing and retrieval. This is because (i) users are more likely to recall important
events rather than a particular shot or frame [5]; and (ii) there are too many
shots in a typical lm (e.g., about 600-1500 shots for a typical lm [6]) to be
used for eective browsing. Since manual segmentation is very time consuming
(i.e., ten hours of work for one hour video [6]), recent years have seen a plethora
of research on automatic video segmentation techniques.
A typical automatic video segmentation consists of three important steps.
The rst step is shot boundary detection (SBD). A shot boundary is declared if
a dissimilarity measure between consecutive frames exceeds a threshold value.
Some of the recent SBD techniques are [3,7,1,2,8,9,10,11,12,13]. The second step
is key-frame selection. For each shot, one or more frames that best represent the
shot, termed key-frame(s), are extracted to reduce the processing overhead in
the next step. For instance, key-frames can be selected from one or more pre-
determined location(s) in a shot. More frames in the same shot can be selected
if they are visually dierent than the previously selected key-frames. Recent
key-frame selection techniques include [14,15,2,3,16,17]. The nal step is scene
segmentation that groups related shots into a meaningful high-level unit termed
scene in this paper. We focus on this step for a narrative lm a lm that tells
a story [18]. Most movies are narrative. In narrative lms, viewers understand
a complex story by identifying important events and associating them by cause
and eect, time, and space.
1.1 Challenges in Automatic Scene Segmentation

Since scene is based on human understanding of the meaning of a video segment,
it is rather dicult to provide an objective denition of a scene. This coupled
with the fact that no standard test databases exist for full-length lms makes
it dicult to develop and evaluate the performance of dierent automatic scene
segmentation techniques. A denition of scene found in the literature is a se-
quence of shots unied by a common locale or an individual event [19,20,21,22].
Another scene denition also includes parallel events [5]. That is, two or more
events are considered parallel if they happen simultaneously in the story time.
For instance, in the movie Home Alone, one important scene consists of two
parallel events: (i) the whole family is leaving the house by plane, and (ii) Kevin,
the main character, is left behind alone in his family home. The lm director
conveys to viewers that the two events happen simultaneously in the story time
ShotWeave: A Shot Clustering Technique 301
by crosscutting1 these events together (i.e., the shot of the plane taking o is
followed by the shot of Kevin walking downstairs followed by the shot of the
plane in the air and so on). Nevertheless, it is still unclear what constitutes an
event in these denitions. For instance, when several events happen in the same
locale, should each event be considered a scene or should they all belong to the
same scene?
Existing scene segmentation techniques can be divided into the techniques
using only visual features (e.g., [20,5,22,21]) and those using both visual and
audio features (e.g., [23,24,25]). In both categories, visual similarities of entire
shots or key-frames (i.e., global features such as global color histograms or color
moments) are typically used for clustering shots into scenes. That is, global
features of key-frames of nearby shots are compared. If the dissimilarity measure
of the shots is within the threshold, these shots and the shots in between them are
considered in the same scene. Global features, however, tend to be too coarse for
shot clustering because they include noiseobjects that are excluded when human
groups shots into scenes. Designing a shot clustering technique that can decide
which areas of video frames to be used to approximate a scene is challenging.
Even if an object can be reliably recognized using advanced object recognition
techniques, the problem of determining at what time which object is used for
shot clustering still remains.
1.2 Motivation and Contributions
In this paper, we rst provide a stricter denition of scene based on continuity-

editing techniques commonly used in lm literature for composing a scene [18].
We present a novel shot clustering technique called ShotWeave. Like some ex-
isting techniques, ShotWeave currently uses only visual features for scene seg-
mentation. Unlike those techniques utilizing global visual features, ShotWeave
extracts features from two pre-determined areas of key-frames. These regions
capture essential information needed to maintain viewers thought in presence
of shot breaks. This is to reduce the possibility of shots of dierent scenes be-
ing mistakenly combined. The extracted features are utilized in several steps
to lessen the possibility of wrongly separating shots of the same scene. Finally,
ShotWeave and two recent scene segmentation techniques were implemented and
their performance are evaluated under this stricter scene denition. These tech-
niques were chosen because they were proposed recently and shown to perform
well for movies. Experiments were done on two full-length lms, each lasts more
than 100 minutes. Given the complexity of the problem, our experimental results
show that ShotWeave gives better segmentation accuracy and is faster than the
two techniques.
The remainder of this paper is organized as follows. In Section 2, we summa-
rize a recent technique by Rui et al. [20]. For ease of presentation, we refer to
this technique as Table-of-Content (ToC) in this paper. ShotWeave is presented
1
Editing that alternates shots of two or more lines of actions occurring in dierent
places, usually simultaneously [18]
in detail in Section 3. We report experimental environments and performance

results of the two techniques in Section 4 and oers our concluding remarks in
Section 5. Due to limited space, interested readers in more extensive experimen-
tal results are referred to [26].
2 A Global-Feature Based Scene Segmentation Technique

In ToC, each shot is represented by (i) color histograms of the key-frames of
the shot and (ii) shot activity calculated as the average color dierence (using
histogram intersection of global color histograms) of all the frames in the shot.
Low shot activity means that there are very small or no object or camera move-
ments within a shot. The similarity measure of any two shots is the weighted
sum of both the dierence of the shot activities of the corresponding shots and
the maximum color dierence of the key-frames of the shots (using histogram
intersection). Since shots that are too far apart are unlikely in the same scene,
distance between shots is also taken into account when computing the dierence
in shot activities.
Video 0 1 2 3
Scene 0
group 0- shot (0,3,5,7)
group 1- shot (1,4,6,8)
group 2- shot (2)
Scene 1
group 3- shot (9,10,12,13)
group 4- shot (11,14,15)
Scene 2,...
Fig. 1. Result of shot clustering using ToC.
Fig. 1 demonstrates the nal result of the segmentation. ToC organizes shots
into groups and then scenes as follows. Initially, shot 0 is assigned to group
0 and scene 0. The next shot, shot 1, is compared with all existing groups to
locate the group that is most similar to this shot. If such group is not suciently
similar to the shot (i.e., the similarity of the shot and the last shot of the group
based on the above criterion is below a pre-determined group threshold), a new
group consisting of only this shot is created. Next, the shot is compared with
all existing scenes to locate the nearest similar scene. If the most similar scene
is not suciently similar to the shot (i.e., the similarity measure between the
shot and the average similarity of all the groups in the scene is smaller than a
pre-determined scene threshold), a new scene is created (e.g., scene 1 has only
group 1 at this point). The next shot, shot 2, is then considered. In this example,
shot 2 is not similar to any existing groups or scenes. Shot 2 is assigned a new
group (group 2) and a new scene (scene 2). Shot 3 is considered next and is
found similar to group 0. Therefore, shot 3 is assigned to group 0. All the groups
between this shot and the last shot in group 0 are then considered in the same
scene of group 0. That is, all shots in between shot 0 and shot 3 belong to scene
0. The temporary scene 1 and scene 2 created previously are removed. Shot
4 is considered next, and the similar process is repeated until all shots in the
video are considered. Our experiments indicate that ToC is likely to generate
more false scenes by separating shots of the same scene. In ToC, noise or visual
information of objects not relevant in shot clustering is included in the global
features.
3 ShotWeave: A Shot Clustering Technique

In this section, we rst give the background on continuity-editing techniques and
our stricter scene denition. Next, region selection, feature extraction, and the
clustering algorithm are presented.
3.1 Background on Continuity Editing

Continuity editing was developed to create a smooth ow of viewers thoughts
from shot to shot [18]. Three commonly used techniques are as follows.
The 180 system: All cameras are positioned only on one side of the 180 line,
an imaginary line where actions take place. In Fig. 2, consider a sequence of
shots 1, 2, and 3, indicating that two people are walking toward each other.
Assume that a blue building is behind the two people at a distance away.
The 180 system ensures the following.
A common space between consecutive shots. This indicates that the two
shots occur in the same locale. In Fig. 2, shots 1, 2, and 3 share a common
backgroundthe blue building.
The location and the movements of the characters in relation to one
another. In Fig. 2, person A is to the left of person B, and both A and
B are moving toward each other. If shot 2 is replaced by shot 2a (i.e., a
camera is on the other side of the 180 line), viewers see that A is moving
away from B and they no longer see the blue building. This would cause
the viewers to think that A is no longer walking toward B.
Shot/reverse-shot: Once the 180 line has been established, shots of each
end point of the line can be interleaved with one another since viewers have
learned the locations of the characters from the previous shots. Typical ex-
amples of shot/reverse-shot involve conversations between characters where
one shot concentrates on one character interleaving with another shot cap-
turing another character. The next shot cuts back to the rst character and
so forth. Alternating of close-up shots of A and B in Fig. 2 will illustrate
shot/reverse-shot.
Establishment/breakdown/reestablishment: The establishing shot indicates

the overall space, introduces main characters, and establishes the 180 line.
The breakdown gives more details about the event and is typically conveyed
by shot/reverse-shots. The reestablishing shot describes the overall space
again. For instance, shot 1 in Fig. 2 functions as an establishing shot, and
shot 5 is a reestablishing shot.
shot 2a
The 180 degree

line
A B
shot 2 shot 3
3
A 2
A A
shot 1 shot 6
Fig. 2. The 180 system (adapted from [18]).
3.2 Strict Scene Denition

We rst dene three types of events. Shots are unied by event, and then scene.
A traveling event is an event that one or more characters travel together,
crossing dierent locales and spending a brief period of screen time2 in each
locale.
An interacting event is dened as an interaction between two or more char-
acters or between character(s) and objects of interests. The characters may
not necessarily be in the same locale (e.g, phone conversation). An inter-
acting event typically appears in a scene after establishing shots and before
re-establishing shots.
2
Duration that frames are presented on the screen; screen time is typically dierent
than the story time. For instance, the screen time of only 2 seconds can indicate
that over ten years have passed in the story time.
A non-interacting event consists of consecutive shots that occur in the same

locale, but are not part of any of the above two event types. Non-interacting
events also function as establishment or reestablishment in a scene.
A scene is dened according to one of the following.
A traveling scene consists of exactly one traveling event.
A serial-event scene consists of zero or more non-interacting events as es-
tablishment, an interacting event, and zero or more non-interacting events
as reestablishment. An example of a serial-event scene is as follows. An es-
tablishing shot is a wide angle shot capturing three characters at a dining
table. An interacting event involves the conversation between these charac-
ters. Each shot in the interacting event captures one or two characters at a
time. An interleaving of shots of dierent character conveys the interaction
among them. Finally, the reestablishing shot re-captures the three people at
the dining table from a dierent camera angle.
A parallel-event scene consists of two or more serial-event scenes that happen
simultaneously in the story time such as in the Home Alone example.
This is typically conveyed to viewers by interleaving of shots the serial-event
scenes.
Compared to the existing denitions, the stricter denition is more objective
and should give scenes that familiar to most viewers. This is because the de-
nition is based on editing techniques that have been successfully used to convey
stories to most viewers regardless of the stories. Although we do not see a clear
advantage of dening more scene types, we see the possibility of scenes that al-
most t one of the dened types, but not quite due to some tricks chosen by the
lm directors. In this case, it is recommended to consider the scene as belonging
to its closest type.
3.3 Region Selection and Feature Extraction

In ShotWeave, each shot is represented by two key-frames, the rst and the last
frames of the shot. Other key-frame selection techniques can be used. For each
key-frame, a feature vector of ve visual features is extracted from two pre-
determined regions (see Fig. 3(a)). Each feature is called a color key and is the
average value of all the DC coecients of MPEG blocks in a corresponding area.
As shown in Fig. 3(b), ve color keys are extracted from the entire background
region, the upper left corner (B), the upper right corner (C), the bottom left
corner (D), and the bottom right corner (E), respectively. The rationale for the
selected regions is as follows.
In Fig. 3(a), region 1, the shape background region, is for detecting shots
in the same locale. The horizontal bar of the region can detect (i) a common
background between consecutive shots when the 180 system is used; and (ii)
the same background of repetitive shots of the same character in the same locale
due to shot/reverse-shot. The horizontal bar works when no objects appear in
the background region and when similar tones of background color are used
FH: Frame height in number of MPEG blocks

FW: Frame width in number of MPEG blocks
1 w C
h
Lh
w w
E
B D
FH C
Lw Lw Feature B
Extraction
E
2
2 Lh Lh 2
feature vector
FW
(a) Selected regions (b) Feature extraction
Fig. 3. Selected regions and feature extraction.
in the same locale. For instance, in a scene of a party held in a dinning room,
since the four walls of the room typically have the same shade of color, when the
camera changes its focus from one person (in the rst shot) to the next (in the
next shot) sitting along a dierent wall, the background of the two shots is still
likely to have similar color. In Fig. 4(a), the two corners of Region 1 are used to
detect shots taken in the same locale in the presence of a close-up of an object
or an object moving toward a camera. Although the middle of the horizontal bar
of the background region is disrupted, the two corners in consecutive shots
are still likely to be similar since close-ups typically occur in the center of the
frame.
w w
w w
h
w w
Primary
object in
shot i 2 1
Primary
object Primary Primary
expanded Lh object in object in
in shot the 2nd the 1st
i+1 due to key-frame key-frame
camera
zoom
Lw
(a) Detecting close-ups (b) Detecting a traveling event
Fig. 4. Detecting dierent scenarios.
Region 2 in Fig. 3 consists of the lower left and the lower right corners for
detecting a simple traveling scene. In Fig. 4(b), in each shot, the main character
begins at one corner of a frame and travels to the other corner in the last frame
of the same shot. In the next shot, the same character travels in the same direc-
tion to maintain viewers perception that the character is still traveling. The
background of these two shots tends to be dierent because the character travels
across dierent locales, but the two lower corners are likely to be similar since
they capture the primary character.
The sizes of the regions are calculated as follows. Let F W and F H be the
width and the height of a frame in DC coecients, respectively. Let w and h
be the width of the horizontal bar in the background region and the height
of the upper corner in DC coecients. Lw , Lh , w, and h shown in Fig. 3(a) are
computed using the following equations.
Lh =w+h (1)
h =2w (2)
2 Lh = 0.8 F H (3)
3 Lw = 0.9 F W (4)
The area of the lower corner (Lh xLw ) is made slightly larger than the area
of the upper corner since the lower corners are for capturing the traveling of
primary characters whereas the upper corners are to exclude close-up objects.
Therefore, in Equation (1), Lh is made about w more than h. In Equation (2),
h is chosen to be twice w. Equation (3) ensures that the upper corner and the
lower corner do not meet vertically while Equation (4) prevents the two lower
corners from covering the center bottom area of the frame horizontally. This is
to avoid inclusion of the visual properties of the middle area of the frame that
often captures many objects.
3.4 Feature Comparison

To determine the similarity between any two shots, say shots k and m, where
m > k, feature vectors of all combinations of the key-frames of the shots are
considered. That is, if two key-frames per shot are used, (i) features of the rst
key-frame of shot k are compared to those of the rst and of the second key-
frames of shot m; and (ii) features of the second key-frame of shot k are compared
to those of the rst and of the second key-frames of shot m. For each comparison,
the following steps are taken.
Background Comparison : If the dierence between the color keys of the
background regions is within 10% of the color key of the background
region of shot k, shots k and m are considered similar due to locale and
are grouped in the same scene. Otherwise, the upper-corner comparison is
performed next.
Upper-corner Comparison : Compute the dierence of the color keys of the
upper left corners and the dierence of the color keys of the upper right
corners. If the minimum of the two dierences is within 10% of the color key
of the corresponding corner of shot k, the two shots k and m are considered
similar due to locale. Otherwise, the lower-corner comparison is considered
next. The upper-corner comparison helps detecting the same locale due to
object interruption in the horizontal bar of the background region as dis-
cussed previously.
Lower-corner Comparison : This is similar to the upper-corner comparison,

but features from the lower corners are utilized instead. If this comparison
fails, the two shots are considered not similar. The lower-corners are used to
detect a traveling scene.
If shot m where m = k+1 is found similar to shot k, the two shots represent a
non-interacting event or an interacting event in the same location if background
criterion is satised. If the lower corner is used to group these shots together,
the two shots are part of a traveling scene. If shot m where m > k + 1 is found
similar to shot k, both shots are highly likely to capture (i) the same character
in an interacting event (e.g., in a conversation scene, shots k and m focus on
the same person in one location and a shot in between them captures another
person possibly in the same or another location) or (ii) the same event in parallel
events. We note that the 10% threshold is selected in all the three comparisons
since it consistently gives a better performance than other threshold values.
3.5 Clustering Algorithm
The input of our clustering algorithm consists of the video le, the total number
of frames in the video, the list of shot boundaries (i.e., a list of frame numbers
of the rst and last frames of each shot), the forward search range (F ), and the
backward search range (R). Both F and R are used to limit the temporal distance
of shots within the same scene. The output of the clustering algorithm is the
list of frame numbers, each indicating the beginning frame of a detected scene.
The variable CtShotF orward is the number of shots between the shot being
considered (current shot) and the last shot of the video whereas CtShotBack
denotes the number of shots from the current shot to the rst shot of the video.
For ease of the presentation, we assume that these two variables are implicitly
updated to the correct values as the algorithm progresses.
Step 1 : By manipulating the shot boundary information, a shot that lasts less
than half a second is combined with its previous shot if any, and the shot
boundaries are updated accordingly. The rationale is that this very short shot
appears too briey to be meaningful by itself. It is typically the result of the
imperfect shot detection technique that creates false boundaries in presence
of fast tilting and panning, a sudden brightness such as a ashlight, or a
large object moving quickly within a shot.
Step 2 : The rst shot is the current shot. The following steps in Step 2 are
repeated until the last shot in the video has been considered.
Step 2a: Forward Comparison: At most min(F, CtShotF orward)
subsequent shots are compared with the current shot to locate the near-
est similar shot (termed matching shot hereafter). Feature comparison
discussed in Section 3.4 is used. Feature extraction of relevant shots is
done when needed since it can be performed very quickly. The extracted
features are kept in memory and purged when the associated shot is no
longer needed.
If a matching shot is found, the current shot, the matching shot,

and the shots in between them are clustered in the same scene. The
matching shot becomes the current shot, and proceed to Step 2a.
If none of the F subsequent shots is similar to the current shot,
proceed to Step 2b.
Step 2b: Backtrack Comparison: Let r be min(R, CtShotBack),
and f denotes min(F, CtShotF orward). In the worst case, from the
current shot, each of the nearest r preceding shots is compared with its
f subsequent shots using the feature comparison for similarity matching.
Let shot i be the current shot. This step compares shot i 1 with shots
i + 1, i + 2, . . . , i 1 + f or stops when a matching shot of shot i 1
is found. If no shots are similar to shot i 1, the same process repeats
for the comparisons of shot i 2 and its f subsequent shots until either
(i) shot i r has been compared with it f subsequent shots or (ii) a
matching pair is found rst.
At any time in this step, if a matching pair is found, the pair and all
the shots in between are considered in the same scene, the matching shot
becomes the current shot, and proceed to Step 2a. Otherwise, proceed to
Step 2c. The backtrack comparison is necessary since the matching pair
discovered in this step captures (i) another event parallel to the event
captured by the forward comparison or (ii) shots of a dierent character
participating in the same interacting event as shots linked by the forward
comparison.
Step 2c: Reestablishing Shot Check: The next shot becomes the
current shot and is determined whether it is the reestablishment of the
current scene or not. Two scenarios are shown in Fig. 5. Let sl(c) be the
number of shots included in scene c so far. If the previous scene consists
of only one shot and the current scene has more than F shots, the cur-
rent shot is compared with the shot in the previous single-shot scene (see
Fig. 5(a)). If they are similar, the previous single-shot scene and the cur-
rent shot are included in the current scene. That is, the single-shot scene
is in fact not a scene, but the establishing shot of the current scene. The
next shot becomes the current shot and proceed to Step 2d. In Fig. 5(a),
shots in scenes c 1 and c, and the current shot are merged into one
scene. Otherwise, the current shot is compared with the preceding shots
within min(lg2 sl(c) + 1, sl(c)) shots from the current shot. Unlike the
previous two steps, only the background comparison is used in this step
since the establishing/reestablishing shots are typically more visually dif-
ferent than other shots in the same scene. The lg2 function is used to
reduce the chance of shots that are far apart from being combined in the
same scene. If the current shot can be merged into the current scene, the
next shot becomes the current shot and proceed to Step 2d. Otherwise,
proceed to Step 2d.
Step 2d: Scene Cut: A scene cut is declared before the current shot
and proceed to Step 2a.
Step 3: Return the scene segmentation result.
scene c-1 scene c

current scene c current
shot shot
Simiar?
Simiar?
No
Simiar?
YES
new scene c new scene c YES
(a) Scene with establishing shot (b) Scene without establishing shots
Fig. 5. Scenes with a reestablishing shot.
The selection of shots to compare in the forward comparison and the back-
track comparison was used in SIM [5]. The feature extraction and comparisons
and the other steps are introduced in this paper. We use the shot selection in
SIM since it is easy to implement and use a small memory space to store key-
frames of only F + R + 2 shots during the entire clustering process (F + R + 1
shots during the forward and the backtrack comparison and one more for the
reestablishing shot check).
scene 1
1 2 3 4 5 6 7 8 9 10
Fig. 6. Example of a detected scene.
Fig. 6 is an example of a scene boundary detected using ShotWeave when the

forward search range is 3 and the backward search range is 1. Shots 1, 2, 5 and 8
are grouped together due to the forward comparison. The gray line indicates the
clustering of shots that are automatically considered in the same scene although
the similarity between them is not directly compared. Once no subsequent shots
of shot 8 within the forward search range of 3 is found similar, the backtrack
comparison starts. The nearest preceding shots of shot 8 are compared with their
subsequent shots. In Fig. 6, shot 7 is checked rst and is found similar to shot
9. Shot 9 becomes the current shot. However, this shot fails the forward and the
backtrack comparison. Shot 10 becomes the current shot. Since the reestablishing
shot check fails, a scene cut is then declared.
4 Performance Evaluation
In this section, the performance of the two techniques are evaluated on two
full-length movies; each lasts more than 100 minutes. Let Nc and Nt be the
number of correct scene boundaries and the total scene boundaries detected by
a scene segmentation technique, respectively. Nt includes both correct and false
scene boundaries. False scene boundaries do not correspond to any manually seg-
mented scene boundaries. Na denotes the total number of manually segmented
scene boundaries. The following metrics are used.
Nc
Recall (Hr )= N a
. High recall is desirable; it indicates that the technique is
able to uncover most of scene boundaries judged by human.
Precision (Hp )= N Nt . High precision is desirable since it indicates that most
c
detected boundaries are correct boundaries according to human judgment.

Utility (Um )= r Hr + p Hp , where r + p = 1, and r and p are the
weights for recall and precision, respectively. The values of r or p are be-
tween 0 and 1. Utility measures the overall accuracy of a scene segmentation
technique, taking both recall and precision into account. Dierent weights
of recall and precision can be used, depending on which measure is more
important to the user. In general, techniques oering a high utility are more
eective. In this study, an equal weight is assigned to recall and precision.
Segmentation Time (Rt ): Time taken in seconds to cluster shots given that
shot boundaries have been detected and all features needed for the technique
have been extracted. After shot detection, each shot has two key frames with
all necessary information for each of the clustering algorithms such as DCT
values of the key-frames, and HSV histograms of the key-frames.
The implementation of ToC is based on the information described in the
original paper. All the experiments were done on an Intel Pentium III 733 MHz
machine running Linux. In the following, rst, the best values for the important
parameters of each technique that gives high utility were determined experimen-
tally. These parameter values were then used in the performance comparison of
the two techniques.
4.1 Video Characteristics

Two MPEG-1 videos from the entire Home Alone and Far and Away movies
were used. Each video has a frame rate of 30 frames per second and the frame
size of 240 352 pixels. Scenes were segmented manually as follows.
First, shot boundaries were manually determined. The boundaries were cate-
gorized into sharp cuts or gradual transitions of dierent types (dissolve, fade
in, fade out, and wipe). This is to investigate the use of gradual transitions
in narrative lms.
Second, related shots were manually grouped into a scene according to the
strict scene denition. The brief description of the content of each shot and
the functionality of the shot in a scene (i.e., an establishing shot, a reestab-
lishing shot, etc.) were recorded.
The characteristics of the test videos are summarized in Table 1. We did

not use the same test videos as in the original ToC paper since the video titles
were not reported and the test videos were only 10-20 minute segments of the
entire movies. Table 1 suggests that gradual shot transitions occurs less than one
percent of the total number of shots in either movie. The average shot length is
between 4-5 seconds. Both titles were produced in the 1990s, suggesting that shot
clustering techniques are important for newer lms than early lms (1985-1905)
that tend to have shots of longer duration due to lack of eective ways to edit
shots at the time. No temporal relations between shots (i.e., the organization of
shots of dierent lengths) to create certain feelings such suspense were found in
the two titles.
Table 1. Characteristics of test videos
Video Title Home Alone Far and Away

Duration (min:sec) 103:44 132.45
Number of scenes 62 58
Number of shots 1349 1493
Average shot length (in frames) 121 153
Number of abrupt shot cuts 1340 1490
Number of fade in/fade out 1 2
Number of dissolves 6 1
Number of wipes 2 0
In the following subsections, the values of important parameters of each tech-

nique were experimentally determined on Home Alone video by varying the
value of the parameter being investigated while xing the other parameter val-
ues. The best parameter values give high utility.
4.2 Determining Important Parameter Values for ToC

Recall that ToC rst assigns shots to groups, and then scenes. The two impor-
tant parameters for ToC are the group similarity threshold (gt ) and the scene
similarity threshold (st ). Both parameters were suggested to only be determined
by the user once and can be used for other videos for the same user. For any
two shots to be in the same scene, the similarity between them must be greater
than the threshold. To nd the best scene threshold for this technique, the group
threshold was xed at 1.25 as recommended [20], and the scene threshold was
varied. The results are depicted in Table 2. This best scene threshold was later
used to determine the best group threshold.
As the the scene threshold increases (i.e., shots to be considered in the same
scene must be more similar), more scene boundaries are generated, increasing
both the number of correct and false scene boundaries. The number of correct
scene boundaries is not improved any further (Nc remains at 11) when the
scene threshold is beyond 0.8 whereas the number of false boundaries keeps
increasing. When the scene threshold equals to 0.8, ToC gives the highest utility
(i.e., Um = 0.107). This value was, therefore, selected as the best scene threshold
and used for determining the best group threshold.
Table 2. ToC Performance when gt =1.25
st Nc Nt Hr Hp Um Rt
0.5 7 254 0.113 0.028 0.070 96
0.6 9 268 0.145 0.034 0.090 96
0.7 9 283 0.145 0.032 0.088 97
0.8 11 298 0.177 0.037 0.107 96
0.9 11 337 0.177 0.033 0.105 96
1.0 11 369 0.177 0.030 0.104 96
Table 3 shows the performance of ToC using a scene threshold of 0.8 and
dierent group thresholds. When the group threshold is 1.6, the highest utility
and recall are achieved. Beyond this threshold, the number of correct scene
boundaries does not improve, but the number of false boundaries increases as
indicated by a drop in the precision. Therefore, we select the best group and
scene thresholds for ToC to be 1.6 and 0.8, respectively.
Table 3. ToC performance when st = 0.8
gt Nc Nt Hr Hp Um Rt
0.9 7 287 0.113 0.024 0.069 66
1.0 8 337 0.129 0.024 0.076 73
1.25 11 298 0.177 0.037 0.107 96
1.6 12 353 0.194 0.034 0.114 149
1.8 12 377 0.194 0.032 0.113 188
2.0 10 393 0.161 0.025 0.093 232
Table 4 illustrates the performance of ToC on the two test videos using these
best thresholds. The results are dierent from those reported in the original
work [20]. This is due to the fact that a strict scene denition and dierent test
videos were used. In addition, the test videos in the original work were much
shorter. The longer the video, the higher the probability that dierent types of
camera motions and lming techniques are used, aecting the eectiveness of
the technique.
4.3 Determining Important Parameter Values for ShotWeave
The performance of ShotWeave on Home Alone under dierent values of F

is shown in Table 5. ShotWeave oers a much higher recall than ToC does.
Table 4. ToC performance when st = 0.8 and gt = 1.6
Video Nc Nt Hr Hp Um Rt
Home Alone 12 353 0.194 0.034 0.114 149
Far and Away 8 418 0.138 0.019 0.078 164
ShotWeave achieves the best utility when F equals two because of the very high
recall of 0.71. In other words, about 71% of the correct scene boundaries are
uncovered by ShotWeave. However, the number of detected scenes is also high. As
F increases, recall drops and precision increases. To compare the performance of
ShotWeave with that of ToC, F of 3 was chosen since it gives the second highest
utility with less number of total scenes detected. When comparing the scene
boundaries detected by ShotWeave with those detected manually, we observe
that if F can be dynamically determined based on the number of participating
characters in an interacting event, the performance of the technique may be
further improved. For instance, if there are three people participating in an
interacting event, F of 2 is too limited because it takes at least three shots, each
of which captures each of the persons to convey the idea that these persons are
interacting.
Table 5. ShotWeave Performance when R=1 for dierent F values
F Nc Nt Hr Hp Um Rt
2 44 428 0.709 0.102 0.406 6.14
3 32 216 0.516 0.148 0.332 6.25
4 25 148 0.403 0.168 0.286 6.32
5 23 106 0.371 0.216 0.293 6.45
6 16 63 0.258 0.253 0.256 6.26
7 14 48 0.225 0.291 0.258 6.5
8 9 37 0.145 0.243 0.194 6.61
Table 6 depicts the results when R was varied while F was xed at 3. The
results indicate that R equal to 2 oers the highest precision while the same
recall was maintained. Thus, (F ,R) of (3,2) was selected as the best parameters
for ShotWeave. Table 7 demonstrates the results of ShotWeave on both videos
using these best (F ,R) values.
Table 6. ShotWeave performance when F =3 for dierent R values
F Nc Nt Hr Hp Um Rt
1 32 216 0.516 0.148 0.332 6.25
2 32 205 0.516 0.156 0.336 7.7
Table 7. ShotWeave performance when F = 3 and R = 2
Video Nc Nt Hr Hp Um Rt
Home Alone 32 205 0.516 0.156 0.336 7.7
Far and Away 36 249 0.621 0.144 0.383 6.47
Table 8. Performance comparison
Technique Home Alone Far and Away

Hr Hp Um Rt Hr Hp Um Rt
ToC 0.194 0.034 0.114 149 0.138 0.019 0.078 164
ShotWeave 0.516 0.156 0.336 7.7 0.621 0.144 0.383 6.47
4.4 Performance Comparison
The two techniques using their best parameters are compared in terms of seg-
mentation accuracy and segmentation time. The results are shown in Table 8.
ShotWeave outperforms ToC in all the four metrics. ShotWeave oers as much as
about ve times higher recall and precision than ToC on the test videos. Further-
more, ShotWeave takes much less time than ToC to identify scene boundaries.
Note that time for feature extraction was accounted in the segmentation time
for ShotWeave, but was not counted in ToC. The short running time of less than
10 seconds allows ShotWeave to be done on the y after the users identify their
desirable weights for recall or precision.
When the detected scenes are analyzed, several scenes, each consisting of a
single shot are found. These single-shot scenes are in fact establishing shots of
the nearest subsequent scene consisting of more than one shots. In several cases,
these establishing shots are not quite visually similar to any of the shots in the
scene, causing a reduction in the precision of ShotWeave.
5 Concluding Remarks
Scene segmentation becomes increasingly more important as more digital videos

are easily created. However, it is dicult to design and evaluate a scene seg-
mentation technique if the denition of scene is too broad. In this paper, we
give a more stricter scene denition for narrative lms. Based on the denition,
we design a novel scene segmentation technique called ShotWeave for clustering
relevant shots into a scene. The crux of ShotWeave is the simple feature extrac-
tion and comparison based on three common continuity-editing techniques to
maintain viewers thought in presence of shot breaks. These techniques are 180
system, shot/reverse-shot, and establishment/breakdown/reestablishment.
Given the complexity of the problem, the experimental results indicate that
ShotWeave performs reasonably well, and is more robust than a recent shot
clustering technique on two full-length lms consisting of a wide range of cam-
era motions and a complex composition of related shots. Our experience with
ShotWeave suggests that the use of visual properties will not improve the per-
formance of ShotWeave much further since establishing and reestablishing shots
are not visually similar to the rest of the shots in the same scene. We are inves-
tigating the use of sound together with ShotWeave to improve the segmentation
performance.
Acknowledgements. This work is partially supported by National Science

Foundation under Grant No. CCR 0092914. Any opinions, ndings, and conclu-
sions or recommendation expressed in this paper are those of author(s) and do
not necessarily reect the views of the National Science Foundation.
References
1. Zhang, H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion
video. ACM Multimedia Systems 1 (1993) 1028
2. Zhang, H.J., Wu, J.H., Zhong, D., Smoliar, S.: Video parsing, retrieval and brows-
ing: An integrated and content-based solution. Pattern Recognition (Special Issue
on Image Databases) 30 (1997) 643658
3. Zhuang, Y., Rui, Y., Huang, T.S., Mehrotra, S.: Adaptive key frame extraction us-
ing unsupervised clustering. In: Proc. of Intl Conf. on Image Processing, Chicago,
IL (1998) 886870
4. Oh, J., Hua, K.A., Liang, N.: A content-based scene change detection and classi-
cation technique using background tracking. In: Proc. of SPIE. (1999)
5. Hanjalic, A., Lagendijk, R.L.: Automated high-level movie segmentation for ad-
vanced video-retrieval systems. IEEE Transaction on Circuits and Systems for
Video Technology 9 (1999) 580588
6. Bimbo, A.D.: Content-based Video Retrieval. Morgan Kaufmann Publishers, Inc.,
San Francisco, CA (1999)
7. Aigrain, P., Joly, P.: The automatic real-time analysis of le editing and transition
eects and its applications. Computer and Graphics 18 (1994) 93103
8. Zhang, H.J., Low, C.Y., Smoliar, S.W.: Video parsing and browsing using com-
pressed data. Multimedia Tools and Applications 1 (1995) 89111
9. Yeo, B.L., Liu, B.: Rapid scene analysis on compressed video. IEEE Transaction
on Circuits and Systems for Video Technology 5 (1995) 533544
10. Shin, T., Kim, J.G., Lee, H., Kim, J.: A hierarchical scene change detection in an
mpeg-2 compressed video sequence. In: Proc. of IEEE Intl Symposium on Circuits
and Systems. Volume 4. (1998) 253256
11. Gamaz, N., Huang, X., Panchanathan, S.: Scene change detection in mpeg domain.
In: Proc. of IEEE Southwest Symposium on Image Analysis and Interpretation.
(1998) 1217
12. Dawood, M., M., A., Ghanbari: Clear scene cut detection directly from MPEG bit
streams. In: Proc. of IEEE Intl. Conf. on Image Processing and Its Applications.
Volume 1. (1999) 286289
13. Nang, J., Hong, S., Ihm, Y.: An ecient video segmentation scheme for MPEG
video stream using macroblock information. In: Proc. of ACM MM99, Orlando,
FL (1999) 2326
14. Xiong, W., Lee, J.C.M., Ma, R.: Automatic video data structuring through shot
partitioning. Machine Vision and Applications 10 (1997) 5165
15. Ferman, A.M., Tekalp, A.M.: Multiscale content extraction and representation for
video indexing. In: Multimedia Storage and Archival Systems, Dallas, TX (1997)
16. Wolf, W.: Key frame selection by motion analysis. In: Proc. of IEEE Intl Conf.
on Images Processing, U.S.A (1996) 12281231
17. Girgensohn, A., Boreczky, J.: Time-constrained keyframe selection technique. In:
Proc. of Intl Conf. on Multimedia and Computing Systems, Florence, Italy. (1999)
756761
18. Bordwell, D., Thompson, K. McGraw-Hill Companies, Inc. (1997)
19. Oh, J., Hua, K.A.: Ecient and cost-eective techniques for browsing and indexing
large video databases. In: ACM SIGMOD. (2000)
20. Rui, Y., Huang, T.S., Mehrotra, S.: Constructing table-of-content for videos. Mul-
timedia Systems 7 (1999) 359368
21. Corridoni, J.M., Bimbo, A.D.: Structured representation and automatic indexing
of movie information content. Pattern Recognition 31 (1998) 20272045
22. Yeung, M.M., Liu, B.: Ecient matching and clustering of video shots. In: Proc.
of IEEE Intl Conf. on Images Processing. Volume 1., U.S.A (1995) 338341
23. Sundaram, H., Chang, S.F.: Determining computable scens in lms and their
structures using audio-visual memory models. In: Proc. of ACM Multimedia00,
Los Angeles, CA, USA (2000) 95104
24. Chang, H.S.S.F.: Video scene segmentation using audio and video features. In:
ICME, New York, NY USA (2000)
25. Adams, B., Dorai, C., Venkatesh, S.: Novel approach to determining tempo and
drama story sections in motion pictures. In: Proc. of IEEE ICME. (2000) 283286
26. Zhou, J.: Shotweave: A shot clustering technique for story browsing for large video
databases. M.S. Thesis Iowa State University (2001)

Scene Segmentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scene Segmentation

Uploaded by

Copyright:

Available Formats

ShotWeave: A Shot Clustering Technique for

Story Browsing for Large Video Databases

Junyu Zhou and Wallapak Tavanapong

The Miea Laboratory, Dept. of Computer Science, Iowa State University,

Abstract. Automatic video segmentation is the rst and necessary step

Keywords: Video segmentation, browsing, retrieval and indexing, clus-

Rapid advances in multimedia processing, computing power, high-speed inter-

1.1 Challenges in Automatic Scene Segmentation

1.2 Motivation and Contributions

In this paper, we rst provide a stricter denition of scene based on continuity-

in detail in Section 3. We report experimental environments and performance

2 A Global-Feature Based Scene Segmentation Technique

Fig. 1. Result of shot clustering using ToC.

3 ShotWeave: A Shot Clustering Technique

3.1 Background on Continuity Editing

Establishment/breakdown/reestablishment: The establishing shot indicates

The 180 degree

Fig. 2. The 180 system (adapted from [18]).

3.2 Strict Scene Denition

A non-interacting event consists of consecutive shots that occur in the same

3.3 Region Selection and Feature Extraction

FH: Frame height in number of MPEG blocks

(a) Selected regions (b) Feature extraction

Fig. 3. Selected regions and feature extraction.

(a) Detecting close-ups (b) Detecting a traveling event

Fig. 4. Detecting dierent scenarios.

3.4 Feature Comparison

Lower-corner Comparison : This is similar to the upper-corner comparison,

3.5 Clustering Algorithm

If a matching shot is found, the current shot, the matching shot,

scene c-1 scene c

Fig. 5. Scenes with a reestablishing shot.

Fig. 6. Example of a detected scene.

Fig. 6 is an example of a scene boundary detected using ShotWeave when the

detected boundaries are correct boundaries according to human judgment.

4.1 Video Characteristics

The characteristics of the test videos are summarized in Table 1. We did

Table 1. Characteristics of test videos

Video Title Home Alone Far and Away

In the following subsections, the values of important parameters of each tech-

4.2 Determining Important Parameter Values for ToC

Table 2. ToC Performance when gt =1.25

Table 3. ToC performance when st = 0.8

4.3 Determining Important Parameter Values for ShotWeave

The performance of ShotWeave on Home Alone under dierent values of F

Table 4. ToC performance when st = 0.8 and gt = 1.6

Table 5. ShotWeave Performance when R=1 for dierent F values

Table 6. ShotWeave performance when F =3 for dierent R values

Table 7. ShotWeave performance when F = 3 and R = 2

Table 8. Performance comparison

Technique Home Alone Far and Away

4.4 Performance Comparison

Scene segmentation becomes increasingly more important as more digital videos

Acknowledgements. This work is partially supported by National Science

You might also like