Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: The aim of this paper is to present a new approach for human activity recognition in a video sequence by
Available online 7 May 2015 exploiting the key poses of the human silhouettes, and constructing a new classication model. The
spatio-temporal shape variations of the human silhouettes are represented by dividing the key poses
Keywords: of the silhouettes into a xed number of grids and cells, which leads to a noise free depiction. The com-
Human activity recognition (HAR) putation of parameters of grids and cells leads to modeling of feature vectors. This computation of param-
Linear Discriminant Analysis eters of grids and cells is further arranged in such a manner so as to preserve the time sequence of the
K-Nearest Neighbor
silhouettes. To classify, these feature vectors, a hybrid classication model is proposed based upon the
Support Vector Machine
Hybrid classier
comparative study of Linear Discriminant Analysis (LDA), K-Nearest Neighbor (K-NN) and Support
Vector Machine (SVM) classier. The proposed hybrid classication model is a combination of SVM
and 1-NN model and termed as SVMNN. The effectiveness of the proposed approach of activity repre-
sentation and classication model is tested over three public data sets i.e. Weizmann, KTH, and Ballet
Movement. The comparative analysis shows that the proposed method is superior in terms of recognition
accuracy to similar state-of-the-art methods.
2015 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2015.04.039
0957-4174/ 2015 Elsevier Ltd. All rights reserved.
6958 D.K. Vishwakarma, R. Kapoor / Expert Systems with Applications 42 (2015) 69576965
the classication accuracy of HAR system, a hybrid classication a need of dimensional reduction techniques for efcient recogni-
model of SVMNN is constructed. tion. The PCA is a popular linear dimensionality reduction tech-
The rest of the paper is organized as follows: Section 2 presents nique that has been widely used for dimension reduction and
the past work carried out in the eld of human activity recognition. classication purpose in activity recognition (Masoud &
The details of the proposed framework which comprises silhouette Papanikolopoulos, 2003; Olivieri et al., 2012). The low dimensional
extraction, feature extraction, feature representation, and various map feature sets are efciently classied using various linear and
classiers are presented in Section 3. Section 4 gives the details nonlinear classiers. Gorelick et al. (2007) used a nonlinear
of experimental work, and the discussion of the result. classication approach for human activity using K-NN along with
Euclidian Distance on the global features of the silhouette. Batra,
Chen, and Sukthankar (2008) used the nearest neighbor
2. Related work classication approach on local features computed in the form of
histograms of code words. Another classication approach that is
Signicant amount of work has been reported in the literature widely used for human activity recognition based upon local
for the recognition of human action and activity using video features is SVM and used by Cao, Masoud, Boley, and
sequences and most of the HAR methods rely on the local features, Papanikolopoulos (2009), Laptev, Caputo, Schuldt, and Lindeberg
global features, key points, spatial-temporal features, bags of (2007), and Schuldt, Laptev, and Caputo (2004). Laptev et al.
words etc. (Agrawal & Ryoo, 2011; Chaaraoui et al., 2012; Poppe, (2007) used the SVM as well as a KNN classier to classify human
2010; Vishwakarma & Agrawal, 2012; Weinland et al., 2011; activity and showed that SVM gives better accuracy than KNN,
Ziaeefar & Bergevin, 2015). All these methods generate a set of fea- but some factors conne their performances like the interclass sim-
tures and then an appropriate machine learned classier is used for ilarity and intraclass dissimilarity.
the recognition of the activity. A brief review of local feature based Based upon the analysis of earlier state-of-the art methods on
spatio-temporal interest point (STIP) and holistic approach that human action and activity recognition, we have captured the prob-
incorporate both local as well as global features are discussed. lems and listed a layout of the solutions of these problems as
An efcient approach of spatio-temporal interest points based follows:
on local features using a temporal Gabor lter and a spatial
Gaussian lter was introduced by Dollar, Rabaud, Cottrell, and It is observed that the holistic representation of human activity
Belongie (2005). Thereafter, a number of STIPs detectors and requires an efcient method for the extraction of silhouette
descriptors have been proposed by several researchers from the video sequence. Usually, foreground segmentation is
(Chakraborty, Holte, Moeslund, & Gonzlez, 2012; Everts, Gemert, done using background modeling and background subtraction,
& Gevers, 2014; Jargalsaikhan, Little, Direkoglu, & OConnor, but it is not always possible to get good results due to inaccu-
2013; Laptev, 2005; Ryoo & Aggarwal, 2009). These local features rate modeling of the background. Hence, in this work we have
based descriptors became popular due to their robustness against used a texture based segmentation approach to extract the sil-
noise, illumination change, and background movements. houette in context of human activity recognition.
However, these methods seemingly, are less effective for complex The problem of losing geometrical information in the bags of
activity modeling (e.g. Ballet Movement). visual word is addressed by selecting key poses of the human
A holistic approach of human action recognition that relies on silhouettes. Further, to describe the silhouette information, we
the human silhouette sequences was proposed by several research- have proposed a simple scheme which preserves the spatial
ers (Bobick & Davis 2001; Chaaraoui & Revuelta, 2014; Eweiwi, change in the human silhouette over time.
Cheema, Thurau, & Bauckhage, 2011; Gorelick, Blank, Shechtman, The performance of a classier reduces when the activities have
Irani, & Basri, 2007; Olivieri, Conde, & Sobrino, 2012; Weinland, interclass similarity and intraclass dissimilarity. Therefore, to
Boyer, & Ronfard, 2007; Wu & Shao, 2013). In silhouette based improve the classication of HAR system, we have constructed
method, the foreground is extracted using background segmenta- a hybrid classication model with the combination of SVM
tion and then features are extracted from the silhouettes. Bobick NN classiers.
and Davis (2001) presented a silhouette based method in which
the Motion History Images and Motion Energy Images (MHI, MEI)
are used for activity recognition. These MEI and MHI are the images 3. Proposed framework
extracted from the video frames and these images are then stacked
so as to preserve the temporal content of the activity. Chaaraoui and Our approach is based on the silhouette of the human body
Revuelta (2014) proposed a method of HAR system that uses the which is extracted from the video sequence of the activity by seg-
optimized parameters and human silhouette is considered as the mentation techniques. The segmented silhouette is preprocessed
basic entity for the feature estimation. The optimized parameters to improve its quality for the feature extraction. Features generated
are evaluated using evolutionary computation. Weinland et al. from different silhouette images are then arranged in a repre-
(2007) worked on matching template techniques in which the sentable form. Further, dimension reduction, and classication
region of interest (ROI) is divided into a xed spatial or temporal are used. The workow diagram of the proposed framework is
grid due to which, the effect of noise present in an image and view- depicted in Fig. 1 and description of each block is presented in
point variance can be reduced signicantly. Thurau and Hlavac following subsections.
(2008) used a histogram of oriented gradients based approach to
represent activity, and also concluded that silhouettes are the best
Input Activity Video Silhouette Features Extraction
information unit for representing human activity. Wu and Shao Sequences Extraction & Representation
(2013) proposed modied bags-of-words model called as bag of
correlated poses using the advantages of global and local features.
They addressed the problem of losing geometric information in
bag of visual words representation, which generally is implemented Classification: LDA, Dimension
Recognized Activity
KNN, SVM, SVM-NN Reduction PCA
using k-mean clustering algorithm. In these methods, it has been
observed that the holistic approach model results in high dimen-
sionality (Agrawal & Ryoo, 2011) of the descriptor; hence there is Fig. 1. Workow diagram of proposed framework.
D.K. Vishwakarma, R. Kapoor / Expert Systems with Applications 42 (2015) 69576965 6959
3.1. Extraction of silhouette using texture information different features for classication of different textures in the
image. Entropy is one the most important parameter that describes
For human activity recognition, the background subtraction is the texture information in an image and it can be expressed as:
considered as the fundamental challenge of vision based activity XX
recognition. The other challenges which make the task challenging f qi; j logqi; j 1
i j
are illumination change, dynamic background, shadows, video
noise, etc. (Brutzer, Hoferlin, & Heidemann, 2011). In background where qi; j PMi;j is the probability density function; here i and
Mi;j
subtraction, the fundamental concept is to construct and update i;j
the model of the background scene, and foreground object pixels j are indices to the co-occurrence matrix M. The entropy of the
are detected if they differ from the background model up to a cer- image is used to describe its complexity. Higher the entropy, higher
tain limit. In the past, Gaussian Mixture Model (GMM) and Local is the complexity of the image. An entropy lter is generated in an
Binary Pattern (LBP) based background models are widely used. image to represent the different textures present in the image. The
In GMM, numerous Gaussian mixture distributions are used to lter matrix is generated for a pixel and its entropy is calculated in
demonstrate each pixel, where each Gaussian distribution charac- 9 9 neighborhood. Converting this lter matrix into binary form
terizes the intensity of pixel distribution over time. For the noisy with some thresholding gives an image with white spots at differ-
video sequence, the parameter estimation is unpredictable in case ent areas. For a two-textured image, one part contains spots of
of assumption of Gaussian distribution. Hence, it can be concluded the same size, and the same is true for the other parts. Removing
that this assumption is not always true. LBP is a very efcient tex- one such part gives the mask for getting a human blob. Applying
tural operator, which labels the pixels of the image by thresholding this mask over the raw image provides us with the silhouette from
the neighborhood of each pixel, resulting in a binary pattern. this raw image as shown in Fig. 2.
Initially, Haralick, Shanmugam, and Dinstein (1973), proposed a The segmented image may contain different white contours, but
textural feature based segmentation approach using Gray Level all of them are not human silhouettes. By comparing the size of
co-occurrence matrix (GLCM). Thereafter, numerous textural fea- these contours one can nd the contour with the largest area
ture based segmentation techniques have been proposed (Chua, which is fortunately, the human silhouette. This part of the image
Wang, & Leman, 2012; Komorkiewicz & Gorgon, 2013; Tahir, is selected and classied as human silhouette. As in Fig. 2, two
Bouridane, & Kurugollu, 2005). A robust textural feature based parts are shown that have the same texture, but human part has
approach using LBP for background subtraction model is proposed the larger area and therefore it has been selected as a silhouette.
by Chua et al. (2012), in which, they have demonstrated that the
textural feature based segmentation method gives more effective 3.2. Feature extraction
results for video surveillance applications. Textural feature based
fast and efcient segmentation approach using Gray level Recently, it has been observed that the concept of visual recog-
co-occurrence matrix (GLCM) has been implemented on FPGA nition in static images (Guo & Lai, 2014) has been successfully
(Komorkiewicz, 2013; Tahir et al., 2005) for different real life appli- extended to video sequences representation. The various methods
cations. The realization of texture based algorithm in real life appli- (Weinland et al., 2011), used for representing human activities are:
cations has encouraged us to use textural feature based the feature detectors, feature descriptors, bag-of-features repre-
background subtraction in this work. sentations, and local features based on voting for localization.
A method for describing different textures was originally pre- The feature extraction is the prime step for analysis of video
sented by Haralick et al. (1973). They proposed a matrix called sequences and extracted features must be robust, and invariant
Gray-Level Co-occurrence Matrix, which allows describing texture against the variation of recording conditions, body pose, etc. All
based on differences in intensity in different directions and used 14 the analysis is done over the feature set collected and based on
Resize to
Feature
extraction
where f i contains the white pixel count of the ith key frame. Thus, used for the human activity recognition, which are i) Linear and
the feature vector of an activity video sequence is expressed as: ii) Nonlinear. We have used one linear and two nonlinear classi-
ers. The linear classier is the LDA, and nonlinear classiers are:
F i f 1 ; f 2 ; f 3 ; f 4 ; . . . . . . : . . . . . . . . . :f k ; where i
KNN, and SVM. In addition to these one more nonlinear classica-
1; 2; 3; 4; . . . . . . . . . . . . . . . . . . V T : 6 tion model is proposed using the combination of SVM and KNN.
Originally, the concept of Linear Discriminant Analysis (LDA)
where V T is the total number of videos in the data set. Substituting
was proposed by Fisher (1936), also known as Fishers
the Eq. (5)in Eq. (6), then feature vector for a single video of activity
Discriminant Analysis. It provides an improved classication of
is:
2 3 various classes of data by maximizing the margin between dissim-
ilar classes and minimizing the margin within the same classes.
6 7
F i 4w1 ; w2 ; w3 ; w4 ; . . . . . . wNc ; w1 ; w2 ; w3 ; w4 ; . . . . . . . . . w1 ; w2 ; w3 ; w4 ; . . . . . . wNc 5 The KNN classier (Cover & Hart, 1967) selects the K-closest
|{z} |{z}
First Frame Last Frame i:e:kth samples of training feature set to the new instance, and the nearest
7 class having highest voting is mapped to test instance. One of the
biggest advantages of this classier is its non-parametric nature.
Similarly, the feature set of a dataset which contains V T number
It does not require any assumptions and easily classies the data
of videos of all classes can be represented as:
0 1
even in higher dimension space.
F 1 w1 ; w2 ; w3 ; w4 ; . . . . . . W Nc ; w1 ; w2 ; w3 ; w4 ; . . . . . . . . . w1 ; w2 ; w3 ; w4 ; . . . . . . W Nc Support Vector Machine (SVM) is a machine learning classier
B |{z} |{z} C
B First Frame Last Frame i:e:kth C based on the structural risk minimization principle (Vapnik,
B C
B F 2 w1 ; w2 ; w3 ; w4 ; . . . . . . W Nc ; w1 ; w2 ; w3 ; w4 ; . . . . . . . . . ; w1 ; w2 ; w3 ; w4 ; . . . . . . W Nc C
B |
{z
} | {z
} C 1999). It is one of the widely used classiers for classication of
B C
B First Frame Last Frame i:e:k th C
B
B C
C human activity (Hsu & Lin, 2002). The term support vectors are fea-
B
Fv B C
B
C
C
tures of the samples, which are close to the hyperplane of an SVM.
B C
B C Therefore, the determination of the location of most important
B C
B C data is near the hyperplane and it is formed by the set of lines
B C
BF w ;w ;w ;w ;......W ;w ;w ;w ;w ;.........;w ;w ;w ;w ;......W C
@ V T |
1 2 3 4 Nc
{z} 1 2 3 4 1 2 3
| 4 Nc A
{z} drawn between the class samples.
First Frame Last Frame i:e:kth
To enhance the recognition accuracy, a hybrid form of SVM
8
NN classication model is introduced which utilizes the individ-
where v corresponds to the videos of all classes. One of the impor- ual benets of SVM and NN model. The proposed ow of layout
tant aspects of the feature set is its dimension. If the dimension is of the classication model of SVMNN is as depicted in Fig. 5.
high, then it must be reduced to a lower dimension for efcient In this model, initially the SVM is used to classify the input fea-
and speedy classication. The dimension of the feature set F v can ture set and in these feature sets, some are correctly classied and
be determined by determining the dimension of feature vector rep- some are wrongly classied. The wrongly classied feature sets lie
resenting a single video in Eq. (7). It is determined to be N f N c k near the separating hyperplane and are the support vectors.
and after concatenation the dimension is written as 1 Nf . Hence, Further, these support vectors are classied using NN, and are con-
the dimension of the nal feature set F v is determined as V T Nf . sidered as representative points.
Finally, in order to recognize actions, these feature vectors and their
labels are passed to the classiers. 4. Experimental results and discussion
3.3. Classication model In order to assess the effectiveness of the proposed approach,
we have conducted experiments on three public benchmarks, the
Usually, the computed data of feature set have the correlated Weizmann data set (Gorelick et al., 2007), the KTH data set
and uncorrelated data together and the correlated feature set of (Schuldt et al., 2004), and the Ballet data set (Fathi & Mori,
different classes makes the classication complex and slow. 2008). The assessments include a range of variations resulting from
Hence, for efcient and fast classication, the feature set must have different resolutions, lighting conditions, occlusions, background
uncorrelated data since the uncorrelated feature set has lower disorder, and uneven motions. In this experiment, for the represen-
dimension. Therefore, for easy handling of data and improved clas- tation of all videos of the data sets, 31 key frames of size 40 25
sication, the dimension of feature set must be reduced. The PCA are used to represent an activity sequence and each frame has 40
(Jolliffe, 1986) is a popular method for the reduction of dimension number of cells, which is obtained by considering the size of each
of the feature set by maximizing the variance of the feature set and cell 5 5. Hence, the dimension of a silhouette is 31 40 1240,
mapping the feature set into a lower dimensional space. and after concatenation the silhouette feature vector is repre-
The signicant lower dimension data is classied by training sented as 1 1240. To evaluate the outcome of action classica-
and testing of the feature set. There are two types of classiers tions, a leave-one-out (LOO) cross validation scheme is opted for
No Yes
PCA
KNN
Recognized Activity
all the data sets. For each data set, we have used three machine lighting effects in some cases. Therefore, the silhouette extraction
learned classier (LDA, KNN, SVM). In addition to these, to enhance is not straight forward and simple background subtraction method
the recognition accuracy, a hybrid SVMNN classication model may not suitable. Hence, for the silhouette extraction we have
is proposed. The average recognition accuracy (ARA) is computed incorporated the texture based segmentation method. The sample
using Eq. (9). images of the data set are depicted in Fig. 7.
Ballet data set: The Ballet Movement action data set (Fathi,
TP TN 2008) is one of the complex human action data sets. This data
ARA 100 In percentage 9
TP TN FP FN set consist of eight Ballet Movements performed by three actors
and these movements are named as Hopping (HP), Jump
where TP, TN, FP and FN are the number of true positive, true neg-
(JP), Left-to-Right Hand Opening (LRHO), Leg Swinging (LS),
ative, false positive, and false negative, respectively. The highest
Right-to-Left Hand Opening (RLHO), Standing with Hand
obtained recognition accuracy of these classiers are compared
Opening (SHO), Stand Still (SS) and Turning (TR),. The data
with the similar state-of-the-art methods.
set is highly challenging due to the considerable amount of
Weizmann data set: This dataset was introduced by Gorelick
intra-class dissimilarities in terms of spatial and temporal scale,
et al. (2007), which consists of 90 videos with a frame rate of
speed, and clothing. The sample frames of the data set are depicted
25fps and each frame has a size of 144 180. In the video
in Fig. 8.
sequence, 9 people are performing 10 different actions, categorized
as walk, run, jump-jack, bend, jumping forward on one leg, jump-
ing on two legs in the forward direction, jumping in place, side- 4.1. Classication results
ways jump, one hand wave, two hand wave. This is one of the
established benchmarks of the data sets for the human action The classication results of the proposed approach are depicted
recognition and most of the earlier methods (Chaaraoui, Prez, & in Table 1 on three different data sets using four different classi-
Revuelta, 2013; Gorelick et al., 2007; Goudelis, Karpouzis, & cation models, including the proposed one i.e. SVMNN. The aim
Kollia, 2013; Mel, Kondra, & Petrosino, 2013; Touati & Mignotte, of this depiction is to show the effectiveness of the proposed
2014; Wu & Shao, 2013) recognition accuracy is computed on this descriptor as well as the performance of the proposed classication
data set. The sample frames of the data set are shown in Fig. 6. model in comparison with the existing models.
KTH data set: This dataset was introduced by Schuldt et al. The classication strategy opted for the Weizmann data set is as
(2004) and is a more challenging dataset as compared to the per Gorelick et al. (2007). From Table 1, the recognition accuracy
Weizmann dataset. The dataset consists of six basic activities, obtained by LDA is less as compared to the other classiers. This
namely; Hand-Clapping, Hand-Waving, Jogging, Jumping, is due to the similarity between the activities of the data set like
Running, and Walking. Each activity has 100 videos for four running, jumping and walking and therefore activities are very dif-
different scenarios in different light conditions, indoor and outdoor cult to separate using a linear model of classication. The highest
conditions. All these video sequences are recorded in a uniform ARA achieved in our experiment is through hybrid SVMNN clas-
background with a static camera of frame rate 25fps and further sication model which is 100%. The hybrid classier is the combi-
down-sampled to the spatial resolution of 160 120 pixels. The nation of two nonlinear classiers and these classiers are more
recording conditions of the videos in the KTH data set are not sta- adept to inter class similarities and intra class dissimilarities, hence
tionary and there is a signicant amount of camera movement and giving the improved result.
Run Side Skip Jump PJump Bend Jack Walk Wave1 Wave2
Scenarios 1
Scenarios 2
Scenarios 3
Scenarios 4
Fig. 8. Images of the ballet data set depicting eight movement of actions.
Table 1
LDA is a linear classier, the speed of processing of which is fas-
Classication results on the data sets in terms of ARA (%). ter than a kernel based classier, but the speed comes at the cost of
efciency. The MCE is 13.6%, which is higher than others and hence
Data setsnClassiers Weizmann KTH Ballet mRA (%)
it can be concluded that it may not be a good classier for the data-
LDA 94.4 90 75 86.4 set which has more interclass similarity. The MCE of a nearest
KNN 96.6 91.7 90.2 92.8
SVM 97.7 92.4 92.75 94.28
neighbor classier is almost 7.2%, which is slightly lower than that
SVMNN 100 96.4 94 96.8 of LDA and the reason for the lower MCE is the nonlinear nature of
the classier. The SVM is a kernel based trick, and gives a unique
The highest values in the table are presented in bold.
solution because the optimality problem is convex. The MCE is
reported as 5.7%, which is lower than the LDA and KNN both. The
individual performance of SVM and KNN approach for classica-
For the KTH data set, the variation in recording condition is
tion of human activity is already proven (Chaaraoui et al., 2013;
more as compared to the Weizmann data set due to which the
Conde & Olivieri, 2015; Gorelick et al., 2007; Goudelis et al.,
extraction of silhouette in this data set is a difcult task. Simple
2013; Mel et al., 2013; Rahman, Song, Leung, Lee, & Lee 2014;
frame differencing methods for extraction of silhouette may give
Sadek, Hamadi, Elmezain, Michaelis, & Sayed, 2012; Sagha &
good results in Weizmann data set, but in the case of KTH, it is very
Rajan, 2012; Touati, 2014; Wu & Shao, 2013). Therefore, a hybrid
difcult to extract silhouette using these methods. Due to the vari-
SVMNN classier is proposed and the performance is measured
ation in recording conditions the texture of object is least affected.
in terms of MCE (3.2%), which is lowest among all the classiers
Hence, we utilize the texture based foreground extraction. The
used.
highest ARA achieved on the KTH data set in our experiment is
The effectiveness of our proposed description methodology of
96.4% as shown in Table 1. From Table 1, it can also been seen that
human activity is analyzed by comparison of the highest ARA
the performance of various classiers on the KTH data set is
achieved on each data set with respect to the other methodologies
increasing.
as presented in Tables 24. The highest ARA is achieved by the
For the Ballet data set, the highest ARA achieved is 94% through
hybrid SVMNN classication model on all the data sets used.
SVMNN classier as shown in Table 1. The performance of our
Tables 2 and 3, show the comparison of our result achieved
approach is least on this data set as compared to the Weizmann
through SVMNN classication model with the similar
and KTH data sets, because of the complex motion patterns, which
state-of-the-art methods on Weizmann and KTH data set,
differentiate in execution of the motion from actor to actor. The
misclassication error is instigated by the hopping as it is con-
fused with a much related activity jump.
Table 2
Comparison of our results with other similar state-of-the-art methods on Weizmann
4.2. Comparison of recognition accuracy data set.
16 set.
13.6
14
12 Methods Input Classiers Test ARA
10 scheme (%)
8 7.2 Sadek et al. (2012) Silhouettes SVM 93.30
5.72
6 Sagha and Rajan (2012) Silhouettes KNN LOO 92.6
4 3.2 Goudelis et al. (2013) Silhouettes SVM LOPO 93.14
2 Mel et al. (2013) Silhouettes SVM LOO 95.25
0 Rahman et al. (2014) Silhouettes KNN LOO 94.49
LDA NN SVM SVM-NN Conde and Olivieri Images KNN 91.3
Classifers (2015)
Proposed method Silhouettes SVMNN LOO 96.70
Fig. 9. Showing the comparative performance of the classiers.
6964 D.K. Vishwakarma, R. Kapoor / Expert Systems with Applications 42 (2015) 69576965
Eweiwi, A., Cheema, S., Thurau, C., & Bauckhage, C. (2011). Temporal key poses for Poppe, R. (2010). A survey on vision-based human action recognition. Image and
human action recognition. In Proceedings of IEEE international conference on Vision Computing, 28, 976990.
computer vision, 13101317. Rahman, S. A., Song, I., Leung, M. K. H., Lee, I., & Lee, K. (2014). Fast action
Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion recognition using negative space features. Expert Systems with Applications, 41,
features. In Proceedings of IEEE conference on computer vision and pattern 574587.
recognition. Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. structure comparison for recognition of complex human activities. In
Annals of Eugenics, 7, 179188. Proceedings of the IEEE international conference on computer vision, Los
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space Alamitos, CA.
time shapes. In IEEE trans. on pattern analysis and machine intelligence, 29 (12), Sadek, S., Hamadi, A. A., Elmezain, M., Michaelis, B., & Sayed, U. (2012). Human
22472253. action recognition via afne moment invariants. In 21st International conference
Goudelis, G., Karpouzis, K., & Kollia, S. (2013). Exploring trace transform for robust on pattern recognition (pp. 218221).
human action recognition. Pattern Recognition, 46, 32383248. Sagha, B., & Rajan, D. (2012). Human action recognition using Pose-based
Guha, T., & Ward, R. K. (2012). Learning sparse representations for human action discriminant embedding. Signal Processing: Image Communication, 27, 96111.
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM
15761588. approach. In Proceedings of the international conference on pattern recognition, 3,
Guo, G., & Lai, A. (2014). A survey on still image based human action recognition. 3236.
Pattern Recognition, 47, 33433361. Tahir, M. A., Bouridane, A., & Kurugollu, F. (2005). An FPGA based coprocessor for
Haralick, R. M., Shanmugam, K., & Dinstein, I. (1973). Textural features for image GLCM and haralick texture features and their application in prostate cancer
classication. IEEE Transactions on Systems Man, and Cybernetics, 6, 610621. classication. Analog Integrated Circuits and Signal Processing, 43, 205215.
Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector Thurau, C., & Hlavac, V. (2008). Pose primitive based human action recognition in
machines. IEEE Transactions on Neural Network, 13(2), 415425. videos or still images. In Proceedings of the conference on computer vision and
Iosidis, A., Tefas, A., & Pitas, I. (2014). Discriminant bag of words based pattern recognition (pp. 16).
representation for human action recognition. Pattern Recognition Letters, 49, Touati, R., & Mignotte, M. (2014). MDS-based multi-axial dimensionality reduction
185192. model for human action recognition. In Proceedings of IEEE canadian conference
Jargalsaikhan, I., Little, S., Direkoglu, C., & OConnor, N.E. (2013). Action recognition on computer and robot vision (pp. 262267).
based on sparse motion trajectories. In Proceedings of IEEE international Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on
conference on image processing, 39823985. Neural Network, 10(5), 989999.
Jolliffe, I. (1986). Principal component analysis. New York: Springer-Verlag. Vishwakarma, S., & Agrawal, A. (2012). A survey on activity recognition and
Komorkiewicz, M., & Gorgon, M. (2013). Foreground object features extraction with behavior understanding in video surveillance. Visual Computer, 29, 9831009.
GLCM texture descriptor in FPGA. In Proceedings of IEEE conference on design and Wang, Y., & Mori, G. (2009). Human action recognition by semilatent topic models.
architectures for signal and image processing (DASIP). In IEEE Transaction on pattern analysis and machine intelligence, 31 (10), 1762
Laptev, I. (2005). On space-time interest points. International Journal of Computer 1774.
Vision, 64(2/3), 107123. Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary
Laptev, I., Caputo, B., Schuldt, C., & Lindeberg, T. (2007). Local velocity-adapted views using 3D exemplars. IEEE International Conference on Computer Vison, 17.
motion events for spatio-temporal recognition. Computer Vision and Image Weinland, D., Ronfard, R., & Boyer, E. (2011). A survey of vision-based methods for
Understanding, 108(3), 207229. action representation, segmentation, and recognition. Computer Vision and Image
Masoud, O., & Papanikolopoulos, N. (2003). A method for human action recognition. Understanding, 115, 224241.
Image and Vision Computing, 21(8), 729743. Wu, D., & Shao, L. (2013). Silhouette analysis-based action recognition via exploiting
Mel, R., Kondra, S., & Petrosino, A. (2013). Human activity modeling by spatio human poses. IEEE Transactions on Circuits and Systems for Video Technology,
temporal textural appearance. Pattern Recognition Letter, 34, 19901994. 23(2), 236243.
Ming, X. L., Xia, H. J., & Zheng, T. L. (2013). Human action recognition based on Ziaeefar, M., & Bergevin, R. (2015). Semantic human activity recognition: A
chaotic invariants. Jounral of South Central University, 20, 31713179. literature review. Pattern Recognition. http://dx.doi.org/10.1016/
Olivieri, D. N., Conde, I. G., & Sobrino, X. A. V. (2012). Eigen space based fall j.patcog.2015.03.006.
detection, and activity recognition from motion templates and machine
learning. Expert Systems with Applications, 39, 59355945.