Professional Documents
Culture Documents
Introduction:
The most common form of interaction between humans is through vocal language.
In addition to many other ways to communicate, languages provide us with one
such tool. Spoken language is something that has allowed humans to have an upper
hand from other living species, thus giving us the ability to communicate and share
ideas and thoughts directly. Unfortunately, some people have not been blessed
with the ability to speak or hear. They require some alternative method to
communicate. Sign language is the primary alternative to a spoken language. Sign
language uses manual movements and body language to communicate thoughts
with others. The basic component of a sign language includes hand movements,
arm movements, and facial expressions to communicate certain feelings. Every
region in the world has a unique spoken language, and similarly, every region has a
unique sign language. Thus, sign language varies from culture to culture and from
region to region (Sandler, 2006). There are 25 sign languages in Africa alone, and
America, Asia/Pacific, Europe, and the Middle East have their own sign languages
(Aarons & Philemon, 2002). People with speech and/or hearing impairment find it
difficult to communicate with other individuals via sign language due to the inability
of most of the people to understand a sign language.
ID:6703
2
Abstract:
The proposed system focuses on the problem explained above and uses image
processing techniques to assist the individuals with hearing and speech disability by
translating their sign language gestures into spoken language. The basic tool used is
a Microsoft Kinect 360TM camera. Microsoft Kinect has the ability to directly
provide depth images of body joints. An advantage of using Kinect is its infrared
camera that is useful in minimizing the issue of lighting conditions. We have used a
Dynamic Time Warping (DTW) based algorithm to recognize a performed gesture
that is later translated to a spoken language by an off-the-shelf software tool,
NaturalReader 12.0 (http://www. naturalreaders.com/pc_nr12.php). Details on the
DTW algorithm can be seen in Al-Naymat, Chawla, and Taheri. (2009). Those with
hearing disability only are provided with a text-based interface. The proposed
solution provides the facility to add new gestures to the dictionary, which can later
be recognised by the system. For the purpose of testing, we have focused on Indo-
Pak sign language. However, for the sake of completeness and to demonstrate the
systems generality to other gestures, we have tested the solution on three sets of
gestures a) Pakistani Sign Language (PSL) b) generic gestures and c) American Sign
Language (ASL). The proposed system has the ability to detect gestures while being
insensitive to finger movements (finger movements are a subset of few gestures;
example of gestures involving movement of fingers includes quit and order gestures
in PSL). The system is able to detect gestures that are performed between head and
hip. The proposed system can be used by hearing and speech disabled to assist them
in having better communication with other members of the society. It can also be
utilized in making places like schools, shopping marts and customer service counters
more friendly to speech and hearing impaired. Research questions are described in
the next section. A review of literature is presented in the Related Work section.
The proposed system is presented in Methodology section. Data analysis and
ID:6703
3
Related Work:
This section presents a review of literature closely related to the problem addressed
in this article. The section is organized such that it initially gives an overview of sign
language followed by eminent work in gesture recognition, and, lastly, a review of
Kinect-based assistive technologies is presented.
Sign Language:
Sign language not only requires movement of hands and arms but also involves
specific formation of fingers, facial expressions, and head movements. Most
importantly, sign language is not universal. Different countries have their own sign
languages; for instance, the Pakistani sign language contains nearly 4,000 signs for
different words. Different regions, towns, and cities can also have their own local
sign language. Shorter signs convey more meaning than a single short word (Alvi
et al., 2005; Hollis, 2011; Li, Lothrop, Gill, & Lau, 2011). Proper sign language was
first introduced in 1960. Manual component involves the hand, whereas involvement
of rest of the body parts in representation of signs can be described as non-manual
component. Two signs can be different using hand in terms of hand location,
orientation, shape, and movement. Single-handed signs can be in a state of motion
or can be represented using static or rest position of the hand. Double-handed signs
involve the domination of one hand over the other, or both hands share equal priority
(Diwakar & Basu, 2008; Li et al., 2011). The past, present, and future tenses in sign
languages are performed with differences in representation of the sign, performing
same sign with different facial expression or movement of different body parts. To
ID:6703
4
convey the verbal aspects, sign is performed with repetitions. Sign language also has
syntax, for instance, the question is represented by raising the eyebrows (Lucas,
1990). This requires inclusion of facial features as well while recognizing the
gestures. However, the work presented in this article focuses only on hands and will
incorporate facial features in recognizing gestures as a future extension.
Gesture Recognition:
There exists literature in gesture recognition where the problem is addressed by a
diverse set of techniques from artificial intelligence and image processing. Work in
Li (2012) mentioned use of Kinect to implement the gesture recognition system for
media player. To capture the hand motion, 3D vector and to detect the hand gesture,
Hidden Markov Model (HMM; Lang, Block, & Rojas, 2012) was used. The system
was limited to perform and detect hand gestures that do not involve different
alignment of fingertips. The system was able to detect top-todown or left-to-right
movement of hands. Raheja et al.s (Raheja, Dutta, Kalita, & Lovendra, 2012)
system was able to grab fingertips and to identify the center of the palm.
Segmentation was used to separate the hand from video frames.
Authors in Ren, Meng, Yuan, and Zhang (2011) have presented a hand-gesture-
based Human Computer Interaction (HCI) solution using Kinect sensor. They have
used Finger-Earth Movers Distance (FEMD) for hand-gesture recognition and
showed its utility on two real-life applications. Authors in Abdur, Qamar, Ahmed,
Ataur, and Basalamah (2013) have used Kinect for an interactive multimedia-based
environment to act as a therapist for the rehabilitation of disabled children. The
system in Abdur et al. (2013) can be used in homes in the absence of an actual
therapist at the convenience of the users. The authors claim to obtain
ID:6703
5
Methodology:
The proposed system consists of a series of steps for the detection of a particular
gesture and for translating it to vocal signals. Initially, the subject has to position
ID:6703
6
him/herself to face the Kinect device and start performing the gesture. The
ID:6703
7
user has to perform a specific pre-defined gesture. In our case, the system requires
the user to place both hands below hip bone with distance between hands less than
0.5 m to be able to start or end a sign language gesture. We call this gesture
Recording Translation Gesture, or RTG. After the RTG is performed, the
remaining sequences of steps are explained as follows.
Joints of Interest:
The Microsoft Kinect SDK provides with functions to get the Cartesian coordinate
of the joints. We use these coordinates to store the movements performed by the
users in order to record the performed gesture. Microsoft Kinect SDK is capable
of getting 20 joints of the human body. However, we are interested
in only those joints which are required for detecting sign language gestures. For
RTG gestures, we require hip joint and center joint. Figure 2 shows the joints of
interest, we have used (1) head, (2) right wrist, (3) left wrist, (4) right hand, (5) left
hand, (6) spine, (7) hip bone, (8) left shoulder, (9) center shoulder, and (10) right
shoulder. We store and track the coordinates of the joints and then normalize
them.
Normalization:
Every users height and dimensions can be different, and this has a huge impact on
the performance of the system. The reason being that the, X, Y, and Z coordinates
of joints of every user might be different. This can also happen due to the varying
position of a user from Kinect. Ideally, a user should be at a distance of six feet from
Kinect and straight in front of the camera, but it is not always the case. So, a need
to normalize the data is necessary in order to increase accuracy of gesture
recognition. The coordinates when captured are in Cartesian coordinate system.
ID:6703
8
Temporary Storage:
After normalization, the data are to be stored in memory. A linked list is
maintained to store the normalized skeleton frames until the gesture is completed.
This is done by storing the coordinates of joints in private variables in an object of
gesture class and forming a linked list of the objects. The ending RTG marks the end
of a gesture. When the complete system executes, a dictionary (explained in the
next section) is loaded to the memory in the form of a two dimensional linked list
of objects. As shown in Figure 3, all gestures are linked vertically, whereas each
gesture individually is connected to a list of objects that consists of the values of
the joints, spherical coordinates, and their normalized coordinates for each frame.
Dictionary:
A dictionary is maintained to store the gestures that will be recognized by the
system. There are two modes of the dictionary: recording and translation. In order
to add new gestures to the dictionary, the recording mode is enabled. In the
recording mode, once the ending RTG is performed, the normalized skeleton frame
linked list is written in the gesture dictionary. The gesture dictionary is a text file
ID:6703
9
Controlled environment:
This environment was based on a laboratory setup, and the research groups
premises were used for this purpose. Field evaluation: This environment was based
on a real-life shopping mart, and the Institutes shopping area was utilized for this
purpose. Shopping mart was chosen to test the systems performance in a real-life
scenario where the disabled interact with relatively unacquainted people. A total
ID:6703
10
ID:6703
11
details are listed in Table 2. All the 20 gestures are tested in controlled experiment,
and 15 out of these 20 in the field experiment.
ID:6703
12
recognized by the system. Table 3 also shows the detailed results of this
experiment. For the third configuration, users were positioned at a distance of 12
feet from Kinect with the weight assigned to hand and elbow as 1. The same test
was repeated with keeping the distance from Kinect constant and changing the
weights of hand and elbow to 0.5 and 0.2 first and then 0.3 and 0.1. All subjects
tested the system three times by performing the 20 gestures in the dictionary. With
the aforementioned configurations, all the gestures, with the exception of 4, were
successfully recognized by the system. The last three columns of Table 3 shows the
detailed results of this experiment. The system was successful in detecting almost
all of the gestures. The overall accuracy of the system is found by taking the average
of each of the tests, resulting in an overall accuracy of 91%.
ID:6703
13
different days. A survey immediately after this session was conducted. The survey
covered questions listed in the Research Questions section of this paper. The
participants of the survey were the 10 subjects and the 5 individuals handling the
cash counter with whom these subjects had interacted with and without the
support of the proposed system. The participants of the survey were asked
questions related to acceptability, usefulness, portability, overall output, and
importance of the system. The participants had to answer each question with any
of the three options: agreed, disagreed, or uncertain. It was noticed from the
survey that 100% of the individuals agreed about the proposed solutions
acceptability. For the aspect of usefulness of the solution, around 87% agreed that
the solution is useful and helps in effective communication, whereas almost 7%
disagreed about the usefulness of the system. Although the usefulness reported is
high, it can further be improved if the individuals using the system get some
training to use it, since the system does require proper positioning in front of the
Kinect in order to start recognition of the gestures. Around 53% of the participants
found the solution to be portable, whereas 46% disagreed to the portability of the
solution. Since the proposed solution needs to move Kinect and requires
connection to a processing unit, this certainly decreases the portability of the
proposed solution. But to make places like schools, shopping marts, and customer
service counters more friendly to speech and hearing impaired individuals, the
system can be installed there. 73% of the participants agreed that the overall
output of the system is worth using it. Based on this discussion, Table 4 summarizes
the key benefits and limitations of the proposed solution. The questions listed in
the Research Question section are answered by the proposed system with an
average agreement value by the participants as 74%, considering acceptability,
ID:6703
14
usefulness, portability, and overall output of the System .Although the results of
field survey suggests improved communication between sign language speakers
and non-sign language speakers due to the use of proposed system, to further see
the systems ability to assist in terms of time consumed, we compare the time taken
by each of the 10 disabled individuals with and without the support of the system.
The duration of time show the number of minutes consumed by the subject from
entry to exit from the shopping mart. Table 5 lists the details of duration and shows
average time consumed by the subjects as 13.2 minutes without the system, which
is reduced to an average of 8.4 minutes with the support of the proposed system.
There is decrease in every subjects time consumed, with the maximum decrease
to be 7 minutes and minimum as 2 minutes. A t-test on the data presented in Table
5 has been performed. The null hypothesis was that the system support does not
reduce the time required for a specific job, and its alternative hypothesis was that
the support of proposed system significantly reduces the time. The task with the
proposed systems support is completed in 36% less time. For the t-test with =
0.05 (95% confidence level), the test gave a p-value of .0000153, which was less
than . Thus, we reject the null hypothesis and accept the alternative hypothesis.
Discussion:
Based on the two types of evaluations, the major finding of this work is that the use
of the proposed assistive technology enhances the communication capability of
speech and hearing impaired. The key feature is the systems ability to recognize a
performed gesture and communicate it to a non-sign language speaker viaaudio or
text-based interface. From the point of view of controlled experiment, the accuracy
of the system is tested using the three sets of gestures by varying the distance of
ID:6703
15
user from Kinect and weights assigned to hand and elbow. For the closest value of
distance (6 feet) of the user from the Kinect, all the gestures have been recognized
by the system. As the distance is increased from 6 feet to 9 feet, the average
recognition accuracy is decreased from 100% to an average of 86%. Similarly, if we
further increase the distance to 12 feet, recognition accuracy further reduces to
75%. Thus, there is a positive correlation between Kinect-to-user distance and the
accuracy of the system. As the distance increases, the maximum decrease in
recognition accuracy is observed for Direct free kick, Quit, Order, and Cost. This is
due to the fact that these gestures involve minor variation of figures, and as the
distance increases, the systems either fails to recognize them or misclassifies these
as any other gesture with close resemblance. The field evaluation tests the system
from the point of view of its users, including both sign language and non-sign
language speakers. The field evaluation results in 100% acceptability of the
proposed system, and 88% of the subjects find the system useful. However, the
portability of the system is rated as 53%, which needs to be looked into in the
future. Around 66% of the subjects find the system helpful in assisting them with
better communication. If we consider the individuals with hearing or speech
disability only, the rating of the system in terms of assisting them in better
communication is increased to 90%. The overall importance of the system is 66%.
However, if the systems importance is studied by considering the disabled subjects
only, it goes up to 80%. This shows that the proposed system is more important for
the disabled as compared to the rest. In summary, the proposed solution is helpful
in improving communication between the sign language speakers and non-sign
language speakers. However, the portability aspect needs to be improved.
Limitations:
ID:6703
16
Some limitations of the proposed system are mentioned in this section based on
the controlled experiment and field evaluation. The proposed system recognizes
gestures which are stored in its dictionary; however, it has a low performance on
particular gestures that involve minor variations of fingers. To illustrate this, the
Quit and Order gestures in PSL have minor variations in the orientation of fingers,
and the system shows poor performance in both these cases. The proposed system
also needs to add a feature where hearing people could communicate with deaf
people.This will require the system to record the voice of hearing people and
convert this to sign language gestures later to be displayed on a screen. From the
point of view of field evaluation, the system has been tested only in the shopping
mart scenario. It will be interesting to see the systems performance in other
situations and places as well. The system also has limitations from a portability
perspective; it will be more useful for disabled if the system is already provided for
them in particular places; otherwise carrying the complete system for better
communications seems to be undesirable for the disabled, keeping the size and
weight of the gadgets in mind. The system has been tested using 20 gestures only.
It will be useful to evaluate the systems performance with a larger number of
words in its dictionary.
ID:6703
17
Future work:
Future work will involve making the system more portable for the disabled to carry
with them at their convenience. This can preferably be achieved by using the
camera of mobile phones to get a sequence of images of hand gestures and then
applying image processing techniques to get the gesture recognized. Since the
mobile camera is 2D, it will be challenging to calculate depth factor in gestures.
Another direction can be to combine the two techniques of gesture recognition and
finger detection for a complete system capable of detecting any type of gesture
specifically those involving minor variations of fingers. The presented system is
designed to recognize the gestures and then using the off-the-shelf software to
convert these to voice/text commands. It will be very useful to add a feature that
ID:6703
18
does the converse process for all possible sign languages. Since there are many sign
languages and each with its own dictionary, the optimization of memory utilization
in this case will be of importance. Incorporating facial expression in the sign
language can also be one of the important directions which need to be investigated
in the future. Another direction in which this work can be extended is to evaluate
its performance using a larger dictionary size having more than 20 gestures and
study its effect on the accuracy and speed.
References:
Aarons, D., & Philemon, A. (2002). South African sign language: One language or
many?. In R. Mesthrie (Ed.), Language in South Africa (pp. 127147). UK:
Cambridge University Press.
Abdur, R. M., Qamar, A. M., Ahmed, M. A., Ataur, R. M., & Basalamah, S. (2013,
April). Multimedia interactive therapy environment for children having physical
disabilities. Proceedings of the 3rd ACM Conference on International Conference
on Multimedia Retrieval, pp. 313314.
Al-Naymat, G., Chawla, S., & Taheri, J. (2009). SparseDTW: A novel approach to
speed up dynamic time warping. Proc. of the Eighth Australasian Data Mining
Conference, pp. 117127. Alvi, A. K., Azhar, M. Y. B., Usman, M., Mumtaz, S., Rafiq,
S., Rehman, R. U., & Ahmed, I. (2005). Pakistan sign language recognition using
statistical template matching. Proceedings of World Academy of Science,
Engineering and Technology (Vol. 3), pp. 14.
Armin, K., Mehrana, Z., & Fatemeh, D. (2013). Using Kinect in teaching children with
hearing and visual impairment. In M. Moazeni et al., (Eds.), Proceedings of the 4th
ID:6703
19
Chai, X., Li, G., Chen, X., Zhou, M., Wu, G., & Li, H. (2013, October). VisualComm: A
tool to support communication between deaf and hearing persons with the Kinect.
Proceedings of the 15th International ACM SIGACCESS Conference on Computers
and Accessibility, p. 76.
Chang, C-L., Chen, C-C., Chen, C-Y., & Lin, B-S. (2013). Kinect-based Powered
Wheelchair Control System. In D. Al-Dabass et al., (Eds.), Proceedings of the Fourth
International Conference on Intelligent Systems, Modelling and Simulation, ISMS
(pp. 186189). January 2930, Bangkok, Thailand: IEEE.
Chang, Y. J., Chen, S. F., & Huang, J. D. (2011). A Kinect-based system for physical
rehabilitation: A pilot study for young adults with motor disabilities. Research in
Developmental Disabilities, 32(6), 25662570. Cooper, R. G., & Kleinschmidt, E. J.
(2011). New products: The key factors in success. Decatur, GA: Marketing Classics
Press.
Diwakar, S., & Basu, A. (2008). A multilingual multimedia Indian sign language
dictionary tool. Proceedings of 6thWorkshop on Asian Language Resources,
Hyderabad, India, 112 January 2008, pp. 6572. Hollis, S. (2011). Sign Language
ID:6703
20
for Beginners: Discover the Art of Sign Language. Bristol & West House,
Bournemouth, UK: Print Smarter.
Halim, Z., Baig, A. R., & Hasan, M. (2012). Evolutionary search for entertainment in
computer games. Intelligent Automation & Soft Computing, 18(1), 3347. A Kinect-
Based Sign Language Gesture Recognition 43 Lahamy, H., & Lichti, D. (2010). Real-
time hand gesture recognition using range cameras. Proceedings of the Canadian
Geomatics Conference, Calgary, Calgary, 1518 June 2010, pp. 16.
Lang, S., Block, M., & Rojas, R. (2012). Sign language recognition with Kinect.
Proceedings of the 11th International Conference on Artificial Intelligence and Soft
Computing, Zakopane, April 293 May 2012, pp. 394402.
Li, K. F., Lothrop, K., Gill, E., & Lau, S. (2011). A web-based sign language translator
using 3D video processing. Proceedings of IEEE International Conference on
Network-Based Information Systems, Melbourne, 2628 September 2011, pp. 356
361.
Li, Y. (2012). Hand gesture recognition using Kinect. Proceedings of 3rd IEEE
International Conference on Software Engineering and Service Science, Beijing, 22
24 June 2012, pp. 196199.
Lucas, C. (Ed.). (1990). Sign language research: theoretical issues. Washington D.C.:
Gallaudet University Press.
Masood, S., Parvez Q.M., Shah, M.B., Ashraf, S., Halim, Z., & Abbas, G. (2014).
Dynamic time wrapping based gesture recognition In A. Ghafoor et al., (Eds.),
Proceedings of the International Conference on Robotics and Emerging Allied
Technologies in Engineering, iCREATE (pp. 205210). April, 2224, Islamabad,
Pakistan: IEEE.
ID:6703