You are on page 1of 8

HUMAN ACTIVITY RECOGNITION USING SKELETON BODY POSE FEATURES AND SVM

CLASSIFIER
Megha D Bengalur
BVBCET, Hubli
Department of Electronics and Communication Engineering.
megha776@gmail.com
ABSTRACT
In this paper, we address the problem of human activity recognition using support vector machine(SVM) classifier to classify different types of activities. Activity recognition aims to
recognize the actions and goals of one or more agents from a
series of observations on the agents actions and environmental conditions. We use 3D joint skeleton taken from depth
sensor(Microsoft Kinect) which provides adequate accuracy
for real-time full body tracking of humans, as a compact representation of postures. We create a complete human activity
dataset depicting an activity including RGBD image and motion capture data. We make use of the skeleton information
obtained from these videos to best recognize the activities.
We test our method on detecting and recognizing 13 different
types of activities performed by 10 individuals with varied
views in indoor and outdoor environments and achieve good
performance. We show a better results for detection of activities even if the person is not seen before in the training set
and achieve an overall detection accuracy of 89%.
Index Terms Support Vector Machine(SVM), skeletal
joint features, RGBD image, Microsoft Kinect, ROC curves.
1. INTRODUCTION
In this paper, we use the supervised machine learning approach for human action recognition with a particular emphasis on feature selection, data modeling and classifier
structure. Recently, human activity analysis has become
the most active research area in computer vision. This is due
to the promising applications in areas such as visual surveillance, human performance analysis and computer-human
interfaces [1][2][3]. We recognize activities performed by
individuals using RGBD (Microsoft kinect) sensor. Human
joint sequence is an effective representation for structured
motion [4], hence we only utilize a sequence of tracked human joints inferred from RGBD images as a feature. We
generate a dataset to evaluate various features, for detection
of activities via SVMs. We collect dataset for 13 different activities: drinking, walking, reading, waving, writing,
clapping, stretching, dozing e.t.c from 10 participants. We

Figure 1: Illustration of walk activity

evaluate several geometric relational body-pose features including joint features and plane features using our dataset for
activity detection. Experimentally, we find joint features to
perform well than others feature choices for this dataset. To
date research has mainly focused on learning and recognizing
actions from video sequences taken by a single visible light
camera. Recently, the rapid development of depth sensors
(e.g. Microsoft Kinect) provides adequate accuracy of realtime full-body tracking with low cost. This enables us to once
again explore the feasibility of skeleton based features for activity recognition. Authors in [5] used hierarchical Maximum
Entropy Markov model to recognize activities in unstructured
environments. They infer the two layered graph structure using dynamic programming approach [6]. Authors in [7] uses
generative model to classify activities using Hidden Markov
Model(HMM) which uses model of randomly generating observable data. But, in real world applications, activities are
seldom performed in unstructured environments and different
person perform activity at different rates. To overcome this
problem, we use the PrimeSense skeleton tracking system
(provided by Microsoft Kinect), which will extract only the
skeleton of a person. Hence we only utilize a sequence of
tracked human joints inferred from RGBD images as a feature. It is interesting to evaluate body-pose features motivated
from motion capture data [8][9][10] using tracked skeletons
from a single depth sensor. So the environment, whether
structured or unstructured doesnt matters. We make use of
the Support Vector Machine (SVM) [11] to tackle irrelevant
actions in whole sequence classification. We find that SVM
based classifier has good classification accuracy for detecting

the activity even when the person is not present in the training
set.
Contributions to the paper are:
1. We propose a method for human activity recognition
for 13 different types of activities captured through
Kinect camera using SVM classifier.
2. We propose to achieve good results even if the person is
not present in the training set by using human skeletal
joint features
3. We use discriminative models to achieve better accuracy rate for all activities.
Proposed method is reviewed in Section 2 which also provides the detailed description of activity dataset which also
defines the geometric relational body-pose features for activity detection. Section 3 shows the experimental results and
section 4 concludes the paper.
2. PROPOSED METHODOLOGY
We use the idea of SVM to solve the problem of activity
recognition in whole sequence classification. Figure 2. shows
our proposed methodology having a database containing a
training set for learning and a test set for validation. We extract distinct features for each activity to build a SVM model.

Figure 2: Overview of the method

dataset contains some activities like drinking, writing e.t.c


performed by people in both left and right hand. The dataset
also contains videos that are captured in varied views like
left, right, front, back. The entire dataset has a total of 600
videos comprising of all the 13 activities viz.. drinking, dozing, Stretching, writing, mobile, walking, running, waving,
clapping, jumping, walking and drinking, hugging, shaking
hands.
Figure 3. provides some sample images of our dataset.
Our dataset consists 13 types of activities, motivated by the
activity classes from [15][6][16]. These action categories
are challenging because they are not only periodic actions,
but also have very similar body movements. For instance,
stretching hands and waving hands contain common body
movements, where both actors extend and then withdraw
arms. Similarly, shaking hands might be confused with hugging where both the actors approach perform the activity and
depart. Figure 4 shows different views of an activity and
figure 5 shows variations of subjects performing the same
drinking activity.
2.2. Feature Extraction
2.2.1. Skeleton Joints
We use a supervised learning approach in which we collect
labeled training data for training the proposed model. For
the input data taken from the kinect, different features are extracted which are provided as input data to the multi-class
one-against-one and one-against-all SVM classifier to train
the model.
The skeleton is described by the length of the links and the
joint angles in order to compute the human pose features. We
obtain the joint coordinates of each skeleton joint and compute the orientation matrix for each joint.
Taking the persons torso as origin, each orientation matrix is transformed to obtain a true pose i.e invariant of the
sensor location. The orientation information is obtained for
only 10 joints out of 15 joints. Orientation information for
the end joints of two hands and two feet are excluded. For
these 10 joints the rotation matrix information is converted
into half-space quaternion values which represents the data
compactly. We obtain the information regarding the persons
posture(i.e standing or sitting) by calculating the position of
each foot w.r.t the torso position.

2.1. Activity Dataset


We collect activity dataset using the Microsoft Kinect Sensor [12][13][14]. The camera captures both color image and
depth image. The videos form a discriminative database captured in 2 environment(indoor and outdoor), so that we train
our learning model to detect activities irrespective of different environments. Each video in the training dataset contains six to seven sequences per action category with activities performed in a way differing from person to person. The

2.2.2. Body Position


We have activities performed in two environments(indoor and
outdoor). The person performing activity in an indoor environment is usually in a seated position. So, hands play a major role in performing the activities. To get information about
the hand gestures, joint position and joint distances of hand
joints w.r.t torso and head are computed. The joint distance is
defined as the Euclidean Distance between all pairs of joints

Figure3: Activities a) Drinking b) Dozing c) Reading d) Stretching e) Writing f) Clapping g)Jumping h) Waving i) Running j)
Walking k) Shaking hands l) Hugging m) Walking and Drinking.
of a person at a particular time T. It captures the distance between two joints in a single pose. We also record the vertical
and horizontal positions of the hands over the last n frames
by capturing the geometrical relationship between the plane
and a joint. The plane features are computed with emphasis
on geometric relationship between a plane and a joint. For
example, we can find how far a persons hand lie in-front of
the plane spanned by his hip, torso, neck
2.2.3. Motion and Velocity information
Motion information is plays an key role in classifying the different activities like running, walking etc. So, we compute
this by selecting n frames over a specified time spaced as:
n/2, 2n/2, . . . 10n/2.. where n gives the number of frames
to be chosen. Using this motion we compute the joint rotation and joint angle that has occurred in each of these frames
represented as half-space quaternions. The plane features are
computed with emphasis on geometric relationship between
a plane and a joint. For example, we can find how far a persons hand lie in-front of the plane spanned by his hip, torso,
neck. The velocity feature captures the velocity of one joint
along the direction between two other joints at time T. The
velocity information especially a very key role in detecting

walking and running activities where the only feature with


much change is the velocity of the person.

3. SVM CLASSIFIER
We make use of the multi-class SVM classifiers to classify
the dataset with more than one class of activities. As SVMs
[11][3][17] are inherently binary classifier, the traditional way
to do multi-class classification is to use one-against-one or
one-against-all method [18][19]. We use of both the methods in our approach. We make use of the NuSVC kernel
[20][21][22] for one-against-one approach and Linear SVC
for one-against-all. In case of one-against-one, the classifier
gives one vote to the winning class and the test point is labeled with the point with the highest vote. In the case of oneagainst-all, we choose the class that classifies the test data
with largest margin. We have a set of 13 different activities
with common features applied to all. So, we create a feature
space with a separate plane for every feature with each plane
accommodating the corresponding feature of all the activities. Once the features are mapped into the feature space we
apply the SVM to each of these spaces so that the classifier
manages to distinguish all the activities with separate hyper-

planes. In the case of one-against-one, we obtain 13(13-1)/2


machines(i.e 66 machines) whereas the one-against-all generates only 13 two-class cases. We compute the centroid of each
activities in all of the feature space and use this data for test
class classification. Table 1 shows the results of one-againstone and one-against-all classifier.
The minimum distance classifier uses the Euclidean
distance[23][24] to compute the distances between the test
point, T(F1 . . . F605 ) and the centroids, A(F1 . . . F605 ). The
two distances dT A and dT are calculated using the following
formula for our activities with 605 features.

dT A =

(F1 F1 )2 + + (F605 F605 )2

dT = min(dT A1 , dT A2 . . . dT A10 )

curacy of our proposed model is better when compared to the


accuracy of other papers in the literature survey.
The portion of the test videos which are correctly classified as corresponding activity(True Positive) and wrongly
classified as other activities(False Positive) is used to obtain
the receiver operating characteristics (ROC)[25][20]. The
ROC curves of our proposed framework for human activity
detection in varied environments are shown in Figure 3, 4, 5,
6. We obtain better detection rate for all the three classes of
activities.

(1)
(2)

Figure 4: Different views of the an activity.

where dT A is the distance between the test sample and


the centroid of the corresponding activities, dT calculates the
minimum distance among all the activities. Thus a sample
decision is made that the test sample belongs to the group
with minimum distance.
4. EXPERIMENTS AND RESULTS
We demonstrate the effectiveness of the proposed classifier
on detecting 13 different activities. We test our classifier on 3
different types of activity classes namely: person performing
one activity in indoor and outdoor environments, two person
interaction activities and single person performing two activities simultaneously. We have a total of 50 videos taken for
each activity in varied pose as training data and 15 videos as
test data. We extract the features of all the activities and train
our learning model. We classify the test data with the SVM
classifier. In case of one-against-one, the classifier gives one
vote to the winning class and the test point is labeled with
the point with the highest vote. In the case of one-against-all,
we choose the class that classifies the test data with largest
margin. One disadvantage if the one-against-all over oneagainst-one is that the performance can be compromised due
to unbalanced training dataset. However, one-against-one is
computationally intensive since the result of more SVM pairs
ought to be computed. Figure 6 shows examples of sequence
of frames in which one activity is misclassified as other activity in the beginning frame sequences. For such cases, we
have trained our videos with many sequences of the activities
performed periodically of Table 1 shows the results of onagainst-one and on-against-all classifier for all the activities
with different cases Test 1, Test 2 and Test 3. Table 2 shows
the overall performance analysis of both the classifiers. From
table 1 we see that the classification accuracy of the classifier
slightly decreases in test 3. But, we see that this decrease in
accuracy is considerably compromising due to our discriminative model of the video datasets. Hence the detection ac-

Figure 5: Variations of subjects performing the same action.

Table 2. Performance analysis on whole sequence classification for both training set containing test set(Set 1) and training
set not containing test set(Set 2). One-against-one has better
classification results than one-against-all SVM.
Classifier
Set 1
Set 2
Performance
decrease
One-against-one 89.538% 80.230%
-9.308%
One-against-all
87%
78.234%
-8.766%

5. CONCLUSION
We have addressed the problem of human activity recognition
using supervised learning approach. We have used the oneagainst-one and one-against-all multi-class SVM approach to
classify all the activities. We have proposed the skeletal joint
features extracted from the depth videos with varied poses for

Figure 7: ROC Graph of indoor activities for One-against-One SVM

Figure 8: ROC Graph of outdoor activities for One-against-One SVM

Figure 9: ROC Graph of indoor activities for One-against-all SVM

all the activities as features to detect the activities. We have


demonstrated the proposed method gives an overall detection
rate better than what is reported in the literature for different
class of activities like two person interaction and two activities performed simultaneously. We find that our approach
performs well in detection of activities even if the person is
not seen before in the training set with a detection accuracy
of 89%. The main significance of our approach is that we we
have presented our approach on a relatively larger size of the
the data set taken in multiple views with activities performed
varying from person to person. The papers specified in the
literature survey detects only single person or two person in-

teraction. Even though the results are not 100% accurate our
approach outperforms in recognizing 3 kinds of activities like
single person one activity, two person interaction and single
person performing two activities which the papers in the literature survey fails to do. Thus, we propose single person
performing two activities as the novelty of our approach.

6. REFERENCES
[1] R.W. Poppe, A survey on vision-based human action
recognition., .

Figure 10: ROC Graph of outdoor activities for One-against-all SVM

[9] L. Kovar and M. Gleicher, Automated extraction and


parameterization of motions in large data sets., .
[10] T. Roder M. Muller and M. Clausen, Efficient contentbased retrieval of motion capture data., .
[11] Chih-Chung Chang Chih-Wei Hsu and Chih-Jen Lin, A
practical guide to support vector classifcation., .
[12] 2010. PRIMESENSE. http://www.primesense.com, , .
Figure 6: Examples of sequences of frames from real-time
interaction detection. Each row shows the detected activity.
A box1 is false detection and the box2 is false detection.
Top: the first 4 frames are incorrectly classified as writing,
instead of drinking. Bottom: the first 4 frames are classified
as either shaking hands not as hugging. All these false
detection are caused by irrelevant actions in training data.

[13] Young Min Kim, Microsoft kinect., Geometric Computing Group, 2012.
[14] Mat Cook Toby Sharp Richard Moore Alex Kipman Andrew Blake Jamie Shotton, Andrew Fitzgibbon, Realtime human pose recognition in parts from single depth
images., Microsoft Research Cambridge and Xbox Incubation, 2011.

[2] G. Fanelli A. Yao, J. Gall and L. V. Gool, Does human


action recognition benefit from pose estimation?, .

[15] H. Wang J. Niebles and L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal
words., .

[3] Matthew Scholten, Testing of the support vector machine for binary- class classification, .

[16] C. Schmid I. Laptev, M. Marszalek and R. Benjamin,


Learning realistic human actions from movies., .

[4] S. Wang J. Gu, X. Ding and Y. Wu, Action and gait


recognition from recovered 3-d human joints., .

[17] Dimiter Tsvetkov Nikolay Stanevski, Using support


vector machine as a binary classifier., .

[5] Bart Selman Jaeyong Sung, Colin Ponce and Ashutosh


Saxena, Unstructured human activity detection from
rgbd images., .

[18] Yi Liu and Yuan F. Zheng, One-against-all multi-class


svm classification using reliability measures., .

[6] I.Laptev C.Schuldt and Caputo, Recognizing human


actions: a local svm approach., .
[7] Chia-Chih Chen Histograms of 3D Joints Lu Xia and
J. K. Aggarwal, View invarient human action recognition using histogram of 3d joints., .
[8] A. Baak M. Muller and H.-P. Seidel, , .

[19] Minh N. Nguyen Kai-Bo Duan, Jagath C. Rajapakse,


One-versus-one and one-versus-all multiclass svm-rfe
for gene selection in cancer classification., .
[20] Tom Fawcett, An introduction to roc analysis., .
[21] Olivier Bousquet Sayan Mukherjee Olivier Chapella,
Vladimir Vapnik, Choosing multiple parameters for
support vector machines., .

[22] Sheng-DeWang Kuo-PingWu, Choosing the kernel parameters for support vector machines by the inter-cluster
distance in the feature space., National Taiwan University,Taipei,Taiwan., 2009.
[23] Bernhard Burgstaller and Friedrich Pillichshammer,
The average distance between two points, Linkoping
University, 2008.
[24] Yinyu Ye, Semidefinite programming for euclidean
distance geometric optimization., Stanford University,
2009.
[25] Charles E. Metz, Basic principles of roc analysis., .

Table 1. One-against-one(OAO) and One-against-all(OAA) SVM Classifier. In Test 1, 1/3 of the samples are used as training
samples and the rest as testing samples. In Test 2, 2/3 samples are used as training samples. Test 3 is a cross subject test, half
of the subjects are used as training and the rest of the subjects were used as testing..

Indoor
Activities

Outdoor
Activities

Two Person
Interaction
Activity
One person
performing
Two Activities

Activity
Label

Test 1
OAO

Test 1
OAA

Test 2
OAO

Test 2
OAA

Test 3
OAO

Test3
OAA

Drinking
Dozing
Reading
Stretching
Writing
Clapping
Jumping
Waving
Running
Walking
Shakinghands
Hugging
Walking
and
Drinking

94%
94%
90%
82%
90%
86%
90%
82%
86%
86%
88%

90%
90%
89%
82%
87%
84%
88%
80%
84%
85%
87%

94%
94%
92%
84%
90%
88%
91%
86%
88%
90%
92%

92%
92%
88%
82%
86%
84%
90%
86%
85%
86%
90%

84%
84%
80%
75%
80%
76%
78%
80%
80%
81%
84%

82%
82%
79%
70%
78%
72%
70%
75%
75%
76%
78%

89%
83%

86%
80%

94%
85%

92%
80%

86%
81%

84%
76%

You might also like