Professional Documents
Culture Documents
AbstractMotion trajectories provide a meaningful clue in trajectories in both space and time space. Designing shape
motion characterization of humans, robots, and moving objects. descriptors is an important topic in computer vision for shape
This paper addresses motion trajectory recognition by exploring (curve) matching [4], [5], classification [6], and retrieval [7].
local self-similarities of motion trajectories over time. Such Such descriptors include compactness [8], curvature scale
temporal self-similarities within a motion trajectory are space (CSS) [7], shape context [5], B-spline [9], moment
observed by building a Self-Similarity Matrix (SSM) based on invariant [10], integral invariants [11], Fourier and Wavelet
the sigmoid distances between all pairs of points along the descriptor [12], [13]. Most of these descriptors were initially
motion trajectory. On analysis of SSMs, we develop a designed for planar closed shapes and curves, and thus are not
self-similarity descriptor that captures the layout of local sufficient to represent temporal trajectories in 3D case. Our
temporal similarities within a motion trajectory. Such previous work [14] develops some effective and robust
descriptors exhibit a noise stability and invariance to group invariant descriptors that can provide substantial advantages
transformations. Temporal pyramid ordering is used in the BoF over the raw trajectory data and other sensitive invariants in
approach to quantize a set of self-similarity descriptors as a
trajectory matching and retrieval, but they suffer a low
histogram of visual words, forming a temporal pyramid
efficiency in recognition due to the direct template matching.
representation accordingly as input data used for recognition.
Our method for recognizing motion trajectories is validated on a Although these relevant descriptors for motion trajectories
sign language dataset. It shows similar or superior performance vary significantly, they all share the same basic framework that
in comparison with other methods. In particular, a significant geometrical properties and temporal dynamics of a shape or
improvement in recognition efficiency and robustness to noise trajectory are extracted by investigating spatial interrelation,
are achieved using our method. statistical distributions, and transformed spaces. They perform
I. INTRODUCTION well being applied to matching and retrieval. Nevertheless,
recognition tasks require richer descriptions whose statistical
Visual motion recognition of humans and objects of distribution for each motion class can be encoded effectively to
interest has drawn much attention in different areas such as fit a classification model, achieving good performance in both
computer vision, machine learning, and robotics. Especially in accuracy and efficiency. Therefore, we intend to discover rich
robotics, it is a critical task as they can be widely applied in spatio-temporal patterns in a short period of frames to build
human-robot interaction, vision-based manipulation, such descriptions. Shechtman et al. [15] proposed a local
humanlike behavior imitation, etc. A motion trajectory, a set of self-similarity descriptor to match the local similarities across
positions of moving objects in 3-dimensional Euclidean space images and videos, which captures the internal geometric
(3D), can provide a compact and rich clue for motion layouts of local self-similarities within images and videos.
characterization as shown in Fig. 1(a). In this paper, we are Later on, in [16] Junejo et al. investigated the temporal
especially interested in recognizing motion trajectories as self-similarities of a human action sequence over time to
motions are abstracted as motion trajectories. Methods for such achieve a stable view-independent action recognition. Similar
tasks are usually based on a rich and effective descriptions work can be also found in [17]. All of the related studies
capturing local spatio-temporal patterns within a motion achieved their applications with first building a spatial or
trajectory. Therefore, studying effective descriptions for temporal self-similarity matrix (SSM) by computing distances
motion trajectories is very important for motion recognition. between extracted features of all frame pairs in a sequence.
Various applications using motion trajectories have been With built SSMs, in [15] Shechtman et al. transformed an SSM
proposed over past years [1][3]. However, in most of the at each pixel directly into a binned log-polar representation that
related works, raw data and some simple features of motion accounts for local non-rigid deformations. But, it is a kind of
trajectories are often used directly in those applications. Such global descriptor for images in nature, and does not take
raw data are some quantities including trajectory positions, temporal information into account. In [16], local descriptors
velocities, and angular changes, which are not flexible and are extracted from an SSM by accumulating histograms of
robust in practical applications. It is a feasible way to introduce gradient orientations in local patches of the SSM. They then
particular shape and curve descriptors to describe motion are quantized as histograms of visual words based on bag of
features (BoF) approach [19]. However, built SSMs in [16] are
computed in the way that elements of an SSM are Euclidean
*The work was supported by a grant from the Research Grants Council of
Hong Kong [Project No. CityU 118613] and NSFC [61273286]
distances of all pairs of time frames. The Euclidean distance is
Zhanpeng Shao, Y.F. Li, and Yao Guo are with the Department of a common metric but sensitive to noise and outliers [21].
Mechanical and Biomedical Engineering, City University of Hong Kong, 83 Moreover, the BoF there discards the temporal information in
Tat Chee Avenue, Kowloon, Hong Kong (phone:852-34426778; fax: the sequence. Similar limitations can be observed in [17].
852-34420172; e-mail: perry.shao@my.cityu.edu.hk, meyfli@cityu.edu.hk,
yaoguo4-c@my.cityu.edu.hk).
103
self-similarities. That means that the distance dij in an SSM is
the sigmoid distance between a pair of corresponding points at
frame i and j of a motion trajectory.
A motion trajectory records a sequence of position vectors
of a moving object in 3-dimensional Euclidean space, and it is
parameterized with (t ) {x(t ), y (t ), z (t )} | t [1, N ] in
discrete {x, y, z , t} space. The SSM for is a square
symmetric matrix of size of N N ,
0 d12 d1N
d 0 d2 N
d ij 21
i , j 1,, N
d N1 dN 2 0
, (1) Figure 2.The distance measure for Euclidean metric and the sigmoid metric
with different .
where d ij is the local distance between the points at two
Euclidean distance to the range of [0,1] , as shown in Fig. 2.
instant i and j of a motion trajectory, and is defined as the
Given an appropriate value of , the sigmoid function maps a
sigmoid function,
corresponding range of Euclidean distances between pairs of
points of a trajectory into the range [0,1) as distance measure.
d (i , j ) tanh( (i ) ( j ) 2 c ) , (2)
In this way, the noisy points and outliers in motion trajectories
where is a positive constant that determine the steepness of will yield abnormal distances between themselves and their
the curve of the sigmoid function in (2), c is the bias. The neighboring points in the SSM. The abnormal distances are
beyond the range determined by , thus are mapped to the
defined sigmoid function d (i, j ) is an monotone increasing
upper bound (the value is 1) of the sigmoid function.
function of (i ) ( j ) 2 , and we claim d (i, j ) is a metric Nevertheless, the abnormal distances are mapped linearly
since three conditions are satisfied, (1) when using Euclidean distance directly in computing SSMs. In
d (i, j ) 0 i j d (i, i ) 0 , (non-negativity); (2) other words, the normal distances are mapped to the
approximated linear area of the sigmoid function that gives a
d (i, j ) d ( j , i ) , (symmetry); (3) d (i, k ) d (i, j ) d ( j , k ) ,
large weight to those normal points, while the abnormal
(triangle inequality ). distances are mapped to the non-linear area of the sigmoid
As we have claimed, the sigmoid distance is a metric that function that give a small weight to those noisy points and
show more robust to noise and outliers over the Euclidean outliers. Therefore, is a key parameter that is obtained by
distance. The sigmoid distance function can map any training with a particular dataset, since different raw data of
motion trajectories are with various scales and sampling rates.
Figure 3. Examples of SSMs for the motion trajectory of an all instance from the ASL dataset [20]. (a) The original trajectory. (b) The transformed version. (c)
The SSM of (a) using Euclidean distances. (d) The SSM of (a) using the sigmoid distances. (e) The SSM of (b) using the sigmoid distances.
104
Figure 4. Example of building a self-similarity descriptor. (a) It is a sign trajectory extracted from an all instance from the ASL dataset [20]. (b) An SSM is
computed from the pairwise sigmoid distances at all frames. (d) A local self-similarity descriptor is computed by accumulating the histograms of gradient
orientations with (d) a log-polar patch that is portioned into cells with a set of parameters of the log-polar coordinate.
We adjust so that the corresponding sigmoid function maps hi hi (1) hi ( ) , (3)
normal distances and abnormal distances to the ranges of [0,1)
and 1, respectively. where is the cell order from 1 to the number of cells within
Such SSM are not only robust to noise and outlier and also the patch. Such histograms of all the cells in a patch are
can achieve invariance to rotation, translation and scaling in concatenated into a self-similarity descriptor at frame i ,
motion trajectory representation. The sigmoid distance in SSM Hi ( ) hi1 hi . For histograms with cells falling outside an
is the function of Euclidean distance. The Euclidean distance is SSM, we set them to zero. Thus, a set of self-similarity
intrinsically invariant with respect to those transformations in
descriptors is built as H ( ) Hi ( )i 1:N for a motion
T
Euclidean space. Examples of SSMs computed from an
example motion trajectory are shown in Fig. 3. To illustrate the trajectory .
invariance of SSMs, we transform the original trajectory in Fig.
3(a) to a new one shown in Fig. 3(b) by a series of actions, B. SSM-Based Motion Trajectory Recognition
including first rotating 30 and 45 by x and z axis,
respectively, translating 200mm and 500mm along x and y To recognize motion trajectories, a recent BoF approach is
direction, respectively, and finally scaling by 0.5 factor. Note employed to encode the statistics of self-similarity descriptors
the visual difference of SSMs using both Euclidean distances by quantizing the descriptors into histograms of visual words
and the sigmoid distances in Fig. 3(c-d). Note also the of BoF. Following classic BoF approach, a visual vocabulary
similarity of SSMs computed for the same trajectory despite is learned offline by k-means clustering of K random local
the transformations on the motion trajectory above mentioned. self-similarity descriptors from training data. By clustering, we
can obtain a predefined number of clusters, D , centers of
III. SSM-BASED RECOGNITION which are the words of the visual vocabulary. In training and
testing, a set of self-similarity descriptors H ( ) for a motion
A. SSM-Based Description trajectory are quantized into a normalized histogram z of
visual words. Unlike an orderless BoF, in our situation we need
With built SSMs, most current work decomposes and
to take temporal information into account when building
transforms them into a reduced dimension space, or extracts
histogram of visual words. We follow an extension of BoF in
image-based features. However, they are global features with
[19] to first partition an SSM into temporal sub-blocks from a
discarding temporal information. We intend to discover local
fine to coarse scale, and compute the histograms of local
self-similarities by extracting a self-similarity descriptor in
descriptors across different sub-blocks and over different
each local patch centered at elements along the diagonal of an
SSM as shown in Fig. 4(b). Self-similarity descriptors are temporal scales. Typically, 2 sub-blocks, 1,, L , are
obtained by accumulating histograms of gradient orientations used. An example of partitioning an SSM into temporal
[18] in local patches which are along the diagonal of the SSM sub-blocks at three scales is shown in Fig. 5, where z ( s )
with a log-polar cell structure that is defined with the denotes the histogram from s -th sub-block at -th scale. By
parameters: r , the radius, the number of bins along the concatenating the histograms from various sub-blocks at
radial and angular directions, , the number of gradient different scales, an SSMs temporal pyramid representation
z = [ z 0 ,..., z L ] is obtained. Such temporal pyramid
orientations at each cell. An example of building such
representations are then input to a nonlinear SVM classifier
self-similarity descriptors is shown in Fig. 4, where for the
with the pyramid matching kernel [19] defined as
log-polar coordinate centered at i frame, the corresponding
patch is partitioned into 25 cells ( 8 4 , i.e. 8 angular bins, 1 0 L 1
L I + L 1 I (4)
4 radial bins) as those center cells are combined into a single 2L 1 2
cell. An 6-bin unsigned ( 6 , gradient orientations are
limited within 0 ~ ) histogram of gradient orientations where the histogram intersection function I between the
histograms of self-similarity descriptors of a given pair of
within each cell of the local patch at frame i is computed as
trajectories x and y is:
105
(AII) (HMM-AII), [14] as observations, (4) a SVM classifer
using AII-based BoF approach that means descriptors in
clustering and quantizing to vocabulary words are AII
descriptors rather than the self-similarity descriptors as our
method (SVM-AII-BoF), (5) the 1-NN classifier using Fourier
descriptors (FD) with Euclidean distances (1-NN-FD).
The recognition experiments for all the methods are
implemented using MATLAB on a common PC with Core
i5-2400 3.1 GHZ CPU (32bit), 4G RAM.
106
classifier on FD descriptors with Euclidean distances. Thus, statistics of local descriptors, which will be our forthcoming
we can see that the result via the 1-NN using FD is 78.07%, research issue.
1-NN-FD method depends on an exhaustive matching which
leads to a higher average time cost of 430ms per query in
recognition efficiency. Our method achieves a significant REFERENCES
improvement in recognition efficiency with the average time [1] C. Rao, A. Yilmaz, and M. Shah, View-invariant representation and
cost of 52ms as recorded in Table II. Finally, a classic left-right recognition of actions, Int. J. Comput. Vis., vol. 50, no. 2, pp. 203226,
2002.
HMM is employed to model the temporal dependency for each
[2] J. Beh, D. K. Han, R. Durasiwami, and H. Ko, Hidden Markov Model
motion trajectory using AII descriptors as observations, and an on a unit hypersphere space for gesture trajectory recognition, Pattern
average accuracy is obtained with 66.90%. Accordingly, its Recognit. Lett., vol. 36, no. 1, pp. 144153, 2014.
time cost is 151ms. [3] M. Bennewitz, Learning Motion Patterns of People for Compliant
Robot Motion, Int. J. Rob. Res., vol. 24, no. 1, pp. 3148, Jan. 2005.
[4] C. Xu, J. Liu, and X. Tang, 2D shape matching by contour flexibility,
C. Noise Effects IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 180186,
In order to support our claim that SSMs computed by the 2009.
sigmoid distance are more robust to noise and outliers, we set [5] G. Mori, S. Belongie, and J. Malik, Efficient shape matching using
up the recognition experiments in same configuration as before shape contexts, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no.
11, pp. 18321837, 2005.
but add white Gaussian noise to motion trajectories of signs [6] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del
from the test set of ASL dataset. In this paper, the noise is Bimbo, 3-D human action recognition by shape analysis of motion
measured with normalized standard deviation , and is trajectories on riemannian manifold, IEEE Trans. Cybern., vol. 45, no.
increased from 0 to 0.3. The recognition results are shown in 7, 2015.
Fig. 6 that plots the recognition accuracy drops as addictive [7] F. Bashir and A. Khokhar, Curvature scale space based affine-invariant
trajectory retrieval, in Proceedings of IEEE International Multitopic
noise is increased when using SSM-sig-TPM and Conference, 2004, pp. 2025.
SSM-raw-TPM. As indicated in Fig. 6, the accuracy using [8] J. Xu, J. Faruque, C. F. Beaulieu, D. Rubin, and S. Napel, A
SSM-raw-TPM has a sharper drop in noisy trajectory comprehensive descriptor of shape: Method and application to
recognition. content-based retrieval of similar appearing lesions in medical images,
J. Digit. Imaging, vol. 25, no. 1, pp. 121128, 2012.
[9] A. Oikonomopoulos, M. Pantic, and I. Patras, Sparse B-spline
polynomial descriptors for human activity recognition, Image Vis.
Comput., vol. 27, no. 12, pp. 18141825, 2009.
[10] J. Flusser, J. Kautsky, and F. roubek, Implicit Moment Invariants,
Int. J. Comput. Vision., vol. 86, no. 1, pp. 7286, Jan. 2010.
[11] B. Hong and S. Soatto, Shape Matching using Multiscale Integral
Invariants, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 01, pp.
110, 2014.
[12] S. Khalid, Motion-based behaviour learning, profiling and
classification in the presence of anomalies, Pattern Recognit., vol. 43,
no. 1, pp. 173186, 2010.
[13] E. Bala and A. E. Cetin, Computationally efficient wavelet affine
invariant functions for shape recognition, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 26, no. 8, pp. 10951099, 2004.
[14] Z. Shao, Y.F. Li, Integral invariants for space motion trajectory
matching and recognition, Pattern Recognition, vol. 48, no. 8, pp.
24182432, 2015
[15] E. Shechtman and M. Irani, Matching local self-similarities across
Figure 6. Noise effects on recognition accuracy when the added noise is images and videos, in Proceedings of IEEE Conference on Computer
increased from 0 to 0.3 Vision and Pattern Recognition, 2007, pp. 18.
[16] I. N. Junejo, E. Dexter, I. Laptev, and P. Prez, View-independent
action recognition from temporal self-similarities, IEEE Trans. Pattern
V. CONCLUSION Anal. Mach. Intell., vol. 33, no. 1, pp. 172185, 2011.
We propose a self-similarity descriptor to capture rich [17] A.-R. Lee, H.-I. Suk, and S.-W. Lee, View-invariant 3D action
recognition using spatiotemporal self-similarities from depth camera,
spatio-temporal patterns within motion trajectories, and use in Proceedings of International Conference on Pattern Recognition,
them to perform fast recognition tasks. As the sigmoid distance 2014, pp. 501505.
is a basic metric to compute SSMs, they have been [18] N. Dalal and W. Triggs, Histograms of oriented gradients for human
demonstrated to be more discriminative and robust than using detection, in Proceedings of IEEE Conference on Computer Vision and
Euclidean distances as basic units. Moreover, the temporal Pattern Recognition, 2005, pp. 886893.
[19] S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial
pyramid matching for BoF histograms of self-similarity pyramid matching for recognizing natural scene categories, in
descriptors yields a significant improvement in recognition Proceedings of IEEE Conference on Computer Vision and Pattern
accuracy. Compared with other methods, our method is clearly Recognition, 2006, vol. 2, pp. 21692178.
confirmed in recognition accuracy and efficiency. [20] UCI KDD ASL Archive, Australian sign language dataset, Available:
http://kdd.ics.uci.edu/databases/auslan2/auslan.html. .
In our method, the BoF approach uses clustering to build a [21] K. Wu and M. Yang, Alternative c-means clustering algorithms,
visual vocabulary for a training dataset, which yields a coarse Pattern Recognit., vol. 35, no. 10, pp. 22672278, 2002.
reconstruction of self-similarity descriptors. It is believed that a [22] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector
less reconstruction of self-similarity descriptors can be found machines, ACM Trans. Intell. Syst. Technol., vol. 2, pp. 27:127:27,
2011.
by sparse coding that is a refined approximation for building
107