You are on page 1of 26

CERTIFICATE

Page |1
TABLE OF CONTENTS

Declaration………………………………………………………………………………2

Acknowledgement……………………………………………………………………… 3

Certificate………………………………………………………………………………. 4

1. Abstract……………………………………………………………………….6

2. Related Work…………………………………………………………………7

3. Introduction…………………………………………………………………...9

i) Objective……………………………………………………………...9
ii) Indian Sign Language………………………………………………...9
iii) Description…………………………………………………………..10

4. Proposed Work……………………………………………………………...13

i) Image Pre-processing………………………………………………..13
ii) Feature Extraction…………………………………………………...15
iii) Classification………………………………………………………...16
iv) Flowchart…………………………………………………………….21

5. Results……………………………………………………………………….22

6. Conclusion…………………………………………………………………..26

7. Future Work…………………………………………………………………27

8. References…………………………………………………………………...28

Page |2
ABSTRACT

Sign Languages are a set of languages that use predefined actions and movements to convey a
message. These languages are primarily developed to aid deaf and other verbally challenged people.
They use a simultaneous and precise combination of movement of hands, orientation of hands, hand
shapes etc. Different regions have different sign languageslike American Sign Language, Indian Sign
Language etc. We focus on Indian Sign language in this project.

Indian Sign Language (ISL) is a sign language that is predominantly used in South Asian countries. It
is sometimes referred to as Indo-Pakistani Sign Language (IPSL). There are many special features
present in ISL that distinguish it from other Sign Languages. Features like Number Signs, Family
Relationship, use of space etc. are crucial features of ISL. Also, ISL does not have any temporal
inflection.

In this project, we aim towards analyzing and recognizing various alphabets from a database of sign
images. Database consists of various images with each image clicked in different light condition with
different hand orientation. With such a divergent data set, we are able to train our system to good
levels and thus obtain good results.

We investigate different machine learning techniques like Support Vector Machines (SVM), Logistic
Regression, K-nearest neighbors (KNN) and a neural network technique Convolution Neural
Networks (CNN) for detection of sign language.

The system is implemented in Python platform and trained using dataset of around 5000 images. The
system is expandable for new dataset.

Page |3
RELATED WORK

With increasing awareness towards helping the specially challenged people around the world, a lot of
work is being carried out in the field of sign and language recognition.

Pansare et al. carried out their work to recognize Indian Sign Language (ISL). First of all, they
removed the background noise by implementing median and Gaussian filters. This made it easy to use
morphological operations. To detect the edges of Region of Interest, Sobel edge detection method was
used which further aided in finding the centroid and area of the sign. 100 samples each of 26
alphabets were classified by calculating their Euclidian Distance. Average accuracy achieved was
90%.

Some other work carried out in this field involved segmentation of the images prior to feature
extraction. Rekhaet al.proposed recognition of signs by first segmenting hands region using skin
colour detection in YCbCrcolour space. Feature detection was then carried out by using Principle
Curvature Based Region (PCBR) detector, Wavelet Packet Decomposition (WPD-2) and complexity
defects method. With 40 training samples each in 23 alphabets, multi class SVM was used in
classification giving an accuracy of 91.3%.

Real-time application of sign recognition has also been implemented in amateur phases. Georganas
made use of Bag-of-Feature (BoF) and Support Vector Machine (SVM) to implement such a real-time
framework. Segmentation of hand region from the background was done in HSV colour space. Shift-
Invariant Feature Transform (SIFT) algorithm is employed to extract keypoints from the detected
hand region which were first quantized using Kmeans clustering and then mapped into BoF.

In Arabic Sign language (ArSL) recognition, Tharwat et al. [8] first extract SIFT features from sign
performed. The authors used Linear Discriminant Analysis (LDA) method to reduce the
dimensionality of SIFT features extracted and shown to have reduction in computational time. The
performance of classifiers are compared between SVM, k-Nearest Neighbour and minimum distance,

Page |4
and it is found that classification using SVM yields the highest accuracy. Using 30 ArSL with 7 train
images each, an average accuracy of 99% is obtained.

More recent researches suggest and implement the use of additional hardware for more effective sign
language recognition. Chai et al. created a 3D motion trajectories database by obtaining color and
depth database using Kinect. Euclidian distance is then calculated between new motion trajectories
with those in database for matching. Sign Language Recognition is then facilitated using Leap Motion
Controller (LMC). Although LMC has some drawbacks like finger occlusion blind spot, it still was
able to provide useful features like palm center, fingertip position, hand angle and orientation to
improve classification result.

Page |5
INTRODUCTION
1. OBJECTIVE

The aim of the project is to do a comparative study of different Machine LearningAlgorithms and
neural network technique CNN with higher accuracy. Also a significant amount of time is spent on
image pre-processing methods so as to improve the results of applied algorithms.

2. INDIAN SIGN LANGUAGE

Sign language is a form of hand gestures involving visual motions and signs, which are used as a
system of communication notably by the deaf and verbally-challenged community. Following are the
symbols for different alphabets:

A B C D

E F G H

I K L M

N O P Q

Page |6
R S T U

V W X Y

3. DESCRIPTION

Sign language recognition in general involves a few phases of process namely the segmentation,
feature extraction and classification.

Detailed Explanation of the processes involved:

a) Segmentation:

The main objective of the segmentation phase is to remove the background and noises, leaving only
the Region of Interest (ROI), which is the only useful information in the image.This is achieved via
Skin Masking defining the threshold on RGB schema and then converting RGB colour space to grey
scale image. Finally Canny Edge technique is employed to identify and detect the presence of sharp
discontinuities in an image, thereby detecting the edges of the figure in focus.

BGR to HSV Masked Canny-Edge

Page |7
b)Feature Extraction:

The Speeded Up Robust Feature (SURF) technique is used toextract descriptors from the segmented
hand gesture images. SURF is a novel feature extraction method which is robust against rotation,
scaling, occlusion and variation in viewpoint.

SURF Features

Page |8
c) Classification

The SURF descriptors extracted from each image are different in number with thesame dimension
(64). However, a multiclass SVM requires uniform dimensions of feature vector as its input. Bag of
Features(BoF) is therefore implemented to represent the features in histogram of visual vocabulary
rather than the features as proposed.The descriptors extracted are first quantized into 150 clusters
using K-means clustering. Given a set of descriptors, where K-means clustering categorizes numbers
of descriptors into K numbers of cluster center.

The clustered features then form the visual vocabulary where each feature corresponds to an
individual sign language gesture. With the visual vocabulary, each image is represented by the
frequency of occurrence of all clustered features. BoF represents each image as a histogram of
features, in this case the histogram of 24 classes of sign languages gestures.

Page |9
P ROPOSED W ORK

1. Image Pre-processing
Original data-set consists of a number of images for each alphabet. Out of these images, 60%
of the images are used for training our system and 40% are used for testing it. All of the
images are in RGB format.

To perform feature extraction algorithms on these images and obtain effective results, it is
important to obtain a grey-scale version of these images. This is done as follows:-

a. The entire data-set is transformed from RGB model to HSV model.


b. Colour of our skin falls in a certain colour range. So skin-masking is performed by
selecting only those colour components in all the images that fall in that range.
c. The resultant images are converted into single channel grey-scale format.

Now, to filter out redundant information and obtain the primary required data from each
image, we use Canny edge detection method.

Canny edge detection is a technique to identify and detect the presence of sharp
discontinuities in an image. It is a multi-stage algorithm and its various stages are:

a. Noise Reduction :
Since noises in an image are not a favourable feature to be exploited, they are filtered
with a Gaussian filter.

b. Intensity Gradient :
Sobel Kernel filters out the smoothened image in both horizontal and vertical
direction to get first derivatives in both horizontal (Gx) and vertical (Gy) direction.

P a g e | 10
c. Non-maximum Suppresion :
After getting gradient magnitude and direction, a full scan of image is done to remove
any unwanted pixels which may not constitute the edge. For this, at every pixel, pixel
is checked if it is a local maximum in its neighborhood in the direction of gradient.

Point A is on the edge (in vertical direction). Gradient direction is normal to the edge.
Point B and C are in gradient directions. So point A is checked with point B and C to
see if it forms a local maximum. If so, it is considered for next stage, otherwise, it is
suppressed ( put to zero).

d. Hysteresis Thresholding :
This stage decides which are all edges are really edges and which are not. Forthis, we
need two threshold values, minVal and maxVal. Any edges with intensity gradient
more than maxVal are sure to be edges and those below minVal are sure to be non-
edges,so discarded. Those who lie between these two thresholds are classified edges
or non-edges based on their connectivity. If they are connected to "sure-edge"pixels,
they are considered to be part of edges. Otherwise, they are also discarded.

P a g e | 11
2. Feature Extraction
The SURF technique which is developed based on SIFT, is employed to extract descriptors
from the segmented hand gesture images. SIFT is a novel feature extraction method which is
robust against rotation, scaling, occlusion and variation in viewpoint. SURF is proposed as a
feature extraction alternative to the existing method which is more computation efficient. As
opposed to Difference of Gaussian (DoG) used in SIFT, SURF approximates the Laplacian of
Gaussian (LoG) with a box filter. The convolution of box filter calculated using integral
images is faster and can be done in parallel with differing scales, thus it is much faster
compared to SIFT. To detect descriptors, SURF uses an integer approximation of the
determinant of Hessian blob detector, which can be computed with three integer operations
using a pre-computed integral image. Its feature descriptor is based on the sum of Haar
wavelet response around the point of interest. Squareshaped filters are used as an
approximation of Gaussian smoothening. Integral image is the sum of intensity value, for all
points in the image with a location less than or equal to (x,y)

SURF employs hessian blob detector to obtain interest points. The determinant of Hessian
matrix describes the extent of the response and is an expression of local change around the
area. The Hessian matrix with point and scale is defined as

P a g e | 12
3. Classification

The SURF descriptor obtained will all have a different dimension. However, a multiclass
SVM and other ML techniques require uniform dimensions of feature vector as its input. Bag
of Features(BoF) is therefore implemented to represent the features in histogram.

Following Steps are followed to achieve this:

a) The descriptors extracted are first clustered into 150 clusters using K-Means clustering.

b) K-means clustering technique categorizes m numbers of descriptors into x number of


cluster centre.

c) The clustered features form the basis for histogram i-e each image is represented by
frequency of occurrence of all clustered features.

d) BoF represents each image as a histogram of features, in our case the histogram of 24
classes of sign language is generated.

K-Means Clustering:

K-means clustering is a method of vector quantization, originally from signal processing, that
is popular for cluster analysis in data mining. K-means clustering aims to partition n
observations into k clusters in which each observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster.

The problem is computationally difficult (NP-hard); however, there are efficient heuristic
algorithms that are commonly employed and converge quickly to a local optimum. These are
usually similar to the expectation-maximization algorithm for mixtures of Gaussian
distributions via an iterative refinement approach employed by both algorithms. Additionally,
they both use cluster centres to model the data; however, k-means clustering tends to find
clusters of comparable spatial extent, while the expectation-maximization mechanism allows
clusters to have different shapes.

P a g e | 13
The algorithm has a loose relationship to the k-nearest neighbour classifier, a popular
machine learning technique for classification that is often confused with k-means because of
the k in the name. One can apply the 1-nearest neighbour classifier on the cluster centres
obtained by k-means to classify new data into the existing clusters. This is known as nearest
centroid classifier or Rocchio algorithm.

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real
vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S 1, S2,
…, Sk} so as to minimize the within-cluster sum of squares (WCSS) (sum of distance
functions of each point in the cluster to the K centre). In other words, its objective is to find:

whereμi is the mean of points in Si.

PSEUDO CODE

1. Select K points as the initial centroids.


2. Repeat
3. From K clusters by assigning all points to the closest centroid.
4. Re-compute the centroid of each cluster.
5. until The centroids don‘t change

After the histogram of 24 classes of Indian Sign Language is made, various Machine Learning
algorithms are applied for comparative study.
Also a comparison is done to measure the effectiveness of SURF feature extraction technique
in overall accuracy. As expected, there is a significant improvement in results after applying
SURF technique.
Following Classification Methods are used:

P a g e | 14
1. NAIVE BAYES CLASSIFIER:

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels
to problem instances, represented as vectors of feature values, where the class labels are
drawn from some finite set. It is not a single algorithm for training such classifiers, but a
family of algorithms based on a common principle: all naive Bayes classifiers assume that the
value of a particular feature is independent of the value of any other feature, given the class
variable.

For some types of probability models, naive Bayes classifiers can be trained very efficiently
in a supervised learning setting. In many practical applications, parameter estimation for
naive Bayes models uses the method of maximum likelihood; in other words, one can work
with the naive Bayes model without accepting Bayesian probability or using any Bayesian
methods.

2. LOGISTIC REGRESSION CLASSIFIER:

Logistic regression is a regression model where the dependent variable (DV) is categorical.
Logistic regression measures the relationship between the categorical dependent variable and
one or more independent variables by estimating probabilities using a logistic function, which
is the cumulative logistic distribution.

Logistic regression can be seen as a special case of generalized linear model and thus
analogous to linear regression. The model of logistic regression, however, is based on quite
different assumptions (about the relationship between dependent and independent variables)
from those of linear regression. In particular the key differences of these two models can be
seen in the following two features of logistic regression. First, the conditional distribution

is a Bernoulli distribution rather than a Gaussian distribution, because the dependent


variable is binary. Second, the predicted values are probabilities and are therefore restricted to
(0, 1) through the logistic distribution function because logistic regression predicts the
probability of particular outcomes.

P a g e | 15
3. K-NEAREST-NEIGHBOURS:

The k-Nearest Neighbours algorithm is a non-parametric method used for classification and
regression. In both cases, the input consists of the k closest training examples in the feature
space. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a majority


vote of its neighbours, with the object being assigned to the class most common among its k
nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply
assigned to the class of that single nearest neighbour.

In k-NN regression, the output is the property value for the object. This value is the average
of the values of its k nearest neighbours.

k-NN is a type of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. The k-NN algorithm
is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to assign weight to the contributions of
the neighbours, so that the nearer neighbours contribute more to the average than the more
distant ones. For example, a common weighting scheme consists in giving each neighbour a
weight of 1/d, where d is the distance to the neighbour.

The neighbours are taken from a set of objects for which the class (for k-NN classification) or
the object property value (for k-NN regression) is known. This can be thought of as the
training set for the algorithm, though no explicit training step is required.

A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data.
The algorithm has nothing to do with and is not to be confused with k-means, another popular
machine learning technique.

P a g e | 16
4. SUPPORT VECTOR MACHINE:

Support Vector Machine or SVM is a supervised machine learning algorithm which is used
for classification problem. In this algorithm, we plot each data item as a point in n-
dimensional space with the value of each feature being the value of a particular co-ordinate.
Then, we perform classification by finding the hyper-plane that differentiates the two classes
very well.

5. CONVOLUTIONAL NEURAL NETWORK:

This method is implemented without extracting the SURF features and gives highest accuracy
amongst all other methods used without SURF feature extraction technique.Convolutional
Neural Networks are very similar toordinary Neural Networks,they are made up of neurons
that have learnable weights and biases.Each neuron receives some inputs, performs a dot
product and optionally follows it with a non-linearity.The whole network still expresses a
single differentiable score function: from the raw image pixels on one end to class scores at
the other.

P a g e | 17
FLOWCHART

P a g e | 18
RESULTS

Computer Specifications:
The system used for training has the following specifications:

Operating System Windows 10 home single language


Processor Intel® Core™ i5-7200U CPU @ 2.50GHz-2.70GHz
RAM 4.00GB DDR4
System Type 64-bit Operating System, x64 – based processor

Recognition Rate:

RECOGNITION RATE
SVM with SURF CNN without SURF

120

100

80

60

40

20

0
A B C D E F G H I K L M N O P Q R S T U V W X Y

The performance of sign language recognition framework is evaluated for each of the 24 gestures
comprising the sign language alphabet: A, B, C, D, E, F,G, H, I,K, L,M, N, O,P,Q, R, S, T, U,V,W,X
and Y.

A total of 4972 images were used with 2995 images used in training our system and rest 1977 images
to test the system.

P a g e | 19
Overall Accuracy of0.92 was achieved with SVM(including SURF) and an accuracy of 0.78 with
CNN(excluding SURF).

Confusion Matrix:

Confusion Matrix gives an idea of alphabets having greater similarity and thus prone to be
misclassified.

Confusion Matrix of SVM with SURF

P a g e | 20
Confusion Matrix of CNN without SURF

Accuracy:

Accuracy Without SURF

NaiveBayes

Logistic Regression

K nearest neighbours

Support Vector Machine

Convolution neural networks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Accuracy

P a g e | 21
Accuracy With SURF

K nearest Neighbours

Logistic Regression

Naïve Bayes

Support Vector Machine

0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94

Accuracy

P a g e | 22
CONCLUSION

Through this project, we have shown how good feature extraction technique along with some pre-
processing of images can lead to higher accuracy. Canny Edge detection and skin masking technique
are used to segment the hand gesture from background captured by the camera.

SURF descriptors were then extracted from segmented hand gesture which were then clustered into
150 clusters to improve the accuracy. Bag of Features was employed to get a uniform dimension of
feature vector and finally various machine learning techniques were used to get the results.

Also a comparative study was done by employing machine learning algorithms with and without the
SURF descriptors. Convolutional Neural Network (CNN) gave the accuracy of 0.78 without the
feature extraction step. Our proposed framework gave an accuracy of 0.92 with SVM and with
minimum accuracy with Naïve Bayes of 0.80 which is still better than novel framework.

Also it is to be noted that the test images used were captured under extreme conditions i.e. from
highly-illuminated to poorly- illuminated backgrounds along with wide range of skin colours. Thus
the amount of accuracy achieved is quite high in accordance with the kind of dataset used comprising
of around 4000 images.

Also it is observed that sign language alphabets which share great similarity in appearance are prone
to be misclassified and hence results in lower accuracy. In our case, Alphabets M and N were largely
misclassified due to very high similarity.

Overall this project resulted in very high level of accuracy seeing the divergence in the dataset used
and hence may lead to a practical solution in helping the verbally challenged to overcome
communication barrier.

P a g e | 23
FUTURE WORK

Our proposed framework can be extended to dynamic sign language recognition system i.e. instead of
recognition of saved images, it can recognize signs in live stream which can be converted to suitable
audio as per the requirement. Definitely this will aid verbally challenged to overcome the
communication gap.

Such system requires less computation time along with high level of accuracy. Also introducing such
systems on mobile platform can be very handy. Future Work should also focus on supporting wide
range of vocabulary and not just limited to particular region.

To achieve the goal of realizing real-time sign language recognition more advanced algorithms may
replace the current in place so as to improve the accuracy and processing speed of the systems since
processing speed will play a key role in such real-time systems.

Also, hardware of such systems may be enhanced to achieve efficient and effective levels of sign
recognition, reducing any kind of ambiguity or failures occurring in current systems.

P a g e | 24
REFERENCES

1. 2016 IEEE Region 10 Symposium (TENSYMP), Bali, Indonesia, research paper on-A Mobile
Application of American Sign Language Translation via Image Processing Algorithms by
zaidomar,jin.
 Research Paper implemented.

2. H. Bay, T. Tuytelaars and L. Van Gool, "Surf: Speeded up robust features," In Computer
vision–ECCV 2006, pp. 404-417. Springer Berlin Heidelberg, 2006[11]
 SURF Technique

3. D.G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal
of computer vision, 60(2), pp.91-110, 2004.[16]
 SIFT and its robustness

4. A. Tharwat, T. Gaber, A.E. Hassanien, M.K. Shahin and B. Refaat, “Sift-based arabic sign
language recognition system,” In Afro-european conference for industrial advancement (pp.
359-370). Springer International Publishing, 2015.[8]
 Comparison of results with this paper

5. A. Ben-Hur and J. Weston, “A user’s guide to support vector machines,” Data mining
techniques for the life sciences, pp.223-239, 2010.[17]
 Study of data mining techniques and svm

6. K. Singh, and S. Chander, "Content Based Image Retrival Using SURF, SVM and Color
Histogram – A Review,". International Journal of Emerging Technology and Advanced
Engineering, Vol. 4, 2250-2459, 2014.[10]
 SURF and Histogram formation

7. N.A. Ibraheem and R.Z. Khan, “Vision based gesture recognition using neural networks
approaches: A review,” International Journal of human Computer Interaction (IJHCI), 3(1),
pp.1-14, 2012
 Gesture Recognition using Neural Networks

8. http://docs.opencv.org/trunk/da/d22/tutorial_py_canny.html

P a g e | 25
 Canny-Edge-Detection

9. N.H. Dardas and N.D. Georganas, "Real-time hand gesture detection and recognition using
bag-of-features and support vector machine techniques," Instrumentation and Measurement,
IEEE Transactions on 60.[7]
 Bag of Features implementation guide

P a g e | 26

You might also like