You are on page 1of 25

Spam Filter - Machine Learning

Project report submitted in partial fulfillment


of the requirements for the degree of

Bachelor of Technology
in
Electronics and Communication Engineering

by

Ashish Dhameja - 15UEC015


Vishal Jain - 15UEC071
Yash Gupta - 15UEC075

Under Guidance of
Dr. Kshitiz Verma

Department of Electronics and Communication Engineering


The LNM Institute of Information Technology, Jaipur

April 2018
Copyright
c The LNMIIT 2017
All Rights Reserved
The LNM Institute of Information Technology
Jaipur, India

CERTIFICATE

This is to certify that the project entitled Title of the project , submitted by Ashish Dhameja (15UEC015),
Vishal Jain (15UEC071) and Yash Gupta (15UEC075) in partial fulfillment of the requirement of degree
in Bachelor of Technology (B. Tech), is a bonafide record of work carried out by them at the Department
of Electronics and Communication Engineering, The LNM Institute of Information Technology, Jaipur,
(Rajasthan) India, during the academic session 2016-2017 under my supervision and guidance and the
same has not been submitted elsewhere for award of any other degree. In my/our opinion, this thesis is
of standard required for the award of the degree of Bachelor of Technology (B. Tech).

Date Adviser: Dr. Kshitiz Verma


Acknowledgments

In the accomplishment of this project successfully, many people have best owned upon us with their
heart-pledged support and blessings.

First and Foremost , We are really thankful to Dr. Kshitiz Verma , our B.Tech. Project mentor who
guided and helped us these 2 whole semesters and was always ready to help and support us any time of
the day. His help and support has always kept us motivated to keep our work to a benchmark . I would
also like to thank some of our friends who has prior knowledge about the subject we have been working
upon , in solving our basic problems and difficulties and keeping our project upto a mark.

We are also thankful to our various faculties and staff members of LNMIIT who helped us in vari-
ous stages of our undertaking project by the information provided by them . We are obliged for their
cooperation and support in completion of our project.

iv
Abstract

In the Short Messages Services(SMS) in our phones, we generally tend to separate the messages in
basically two blocks. One that we intend to receive(Ham) and others that are not desired by us(Spam).
The messages on our phones are generally from our friends or for example some OTP or any order
related details. The other spam messages which are used for promotional purposes by various companies
can upset the client and even causes them to lose subscribers. In this project, we are trying to use the
concepts of machine learning for creating an algorithm such that the messages received by the client
can be categorized as spam or ham. The most common techniques used for making a spam filter are by
using datasets and using them as reference to create an algorithm which can automatically flag a SMS
as spam or ham.

v
Contents

Chapter Page

1 Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Area of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.4 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.5.1 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Libraries Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3.1 Data Pre-Processing Steps . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3.2 Representation of Data . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.4 Word Cloud for messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


3.1 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


4.1 SCOPE OF FURTHER WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vi
Chapter 1

Spam Filtering

1.1 The Area of Work


Machine learning is the study that enables computer systems to learn from Data Sets (i.e., continu-
ously enhance execution on an explicit assignment) from information, without being amended. Machine
Learning is clearly identified with computation processes that expects the decision to be on the basis of
the information being fed in Learning classified sets. Machine Learning these days are linked with Data
Mining,where Data Mining focuses on Information Examination and falls in the category of Unsuper-
vised learning. Similarly Machine Learning can be Unsupervised and can learn and build up standard
expectancy and predictions for different substances and later can be used to find the differences and
abnormalities.

1.2 Problem Addressed


It is quite hard to identify who exactly was the first person who found the simple idea that if you
forward an ad to a mass of people, then atleast one or the other receiver will respond to it regardless of the
proposal. E-mail are the best way to spread these massive amount of advertisements at inexpensive costs
for the sender, and this unfortunate fact is these days massively abused by various organization. The
major consequence of this is, the email box of almost the entire world gets filled with all this unwanted
mass group mails also known as Spams or junk mails. Extensively inexpensive to send, spams creates
a lot trouble to the entire Internet World: massive numbers of Junk mails causes delays in the sending
and receiving of the actual legitimate emails, great amount of persons still using dial up networks have
slower browsing and waste of major bandwidth just to load his mailbox full of Spams or Junk mails.
Rearranging spams and legitimate mails in the mailbox is quite time consuming and sometimes may
lead to accidental deletion of some important mails. Moreover , there are many pornographic or drugs
related spams that is not meant to be slipped in the younger adults. Many methods of eliminating spams
have been proposed , The first one is legal measure proposed in US called anti-spam campaign and the
other is to avoid reacting and opening spams and never publishing your personal information like email
address to public websites or where you see threat of misuse of your email address for such purposes

1
. There are many other ways to block spam like blocking spammers IP-address, and e-mail filtering.
The automatic spam filtering methods are the only one that has been successful in blocking of majority
of spams landing in our inboxes by 50% . The finer the spammers use techniques the finer our email
blocking mechanism changed .Just few years ago , almost every spam has fixed set of subject in the end
of the mail, and gets blocked by email spam filtering , To avoid this easy identification and blocking ,
spammers flushed those spams with misspelled words like (eg. BUYY , NOWW). The two approaches
widely used for automated spam filtering are knowledge engineering (KE) and machine learning (ML)
which itself learns and grow from more number of spams and becomes more intelligent without the use
of much human power. In the previous case, some specified rules are generated abiding which mails are
classified as SPAM or HAM (Legitimate Mails) . A common example may be that if the mail contains
words like BUY NOW or BUYY NOWW or related , it is a Spam and does not gets delivered to the
email boxes. The big disadvantage of this method is the rules are needed to be continuously changed and
amendments made in order to keep our inboxes away from spams. There are no explicit rules in terms of
Machine learning but there are millions of classified mails ( called dataset ) from which Machine learns
and adapts and classifies the other mails approaching the email box . The system usually gets better and
better with time and new data set.

2
1.3 System
The system that we have utilized in our Project is a code written in Python Language . We have
utilized some earlier offered datasets to prepare our machine , the framework which we utilized is gone
under Supervised learning. In Supervised learning we have a data set and we have prior learning what
our result will resemble. There is a connection between the information and result. The calculation
adapts furthermore by encouraging new arranged information and will quit learning when it achieves an
adequate level of execution. In this code we have considered different models of logistic regression and
bayesian classifier, SVM and different classifiers and then reduce the cost function by ’REGULARIZA-
TION’ and increase the efficiency of the code. The framework considers the examination between these
models and plot a diagram to clarify the differences between different regression and classifier models.
We have likewise made different word cloud pictures to show signs of better understanding to sepa-
rate among ’HAM’ and ’SPAM’ messages. We additionally considered the procedure of Data cleaning
which takes every necessary step of evacuating articles, pronouns and so forth from the messages in
order to get perfect words and making it all the more simple and effective to group the messages.

Figure 1.1

3
Chapter 2

Proposed Work

2.1 Introduction

The report comprises solutions for spam filters done by group 17. We will be following some Ma-
chine Learning fundamentals in order to solve our our selected spam filtering problem. Reading an
e-mail nowadays is a propensity. In reality, we see that messages are the most convenient way to send
important information quickly and safely. This makes it most loved both in expert and individual corre-
spondences. There are also many obscure sources which are sending intentionally promotional e-mails
or e-mails with some other motivation. Nonetheless, when over 60% or indeed, even 90% of E-mails
are of such kind, and frequently unlawful. These are the kind of messages that we have a bad dream
about. These messages are classified as spam messages.

2.2 Work Done

We have used some Machine Learning algorithms to solve our problem. The main aim is to reduce
the number of spam messages reaching to ones phone via creating an algorithm using machine learning
techniques. The technique we have used is Logistic Regression Model, Naive Bayes, Decision tree
Classifier, Neural Networks and Support vector machine.

2.3 Learning Algorithms

2.3.1 Logistic Regression

We have used the NUS SMS Corpus (NSC) random messages as the dataset for the problem. Also,
we have added some messages manually from the Grumbletext Web site. A total number of messages
are 5574. Some of the messages are pre-tagged as spam and ham for testing purposed and others are
not so as to find the actual results from the algorithm. We have tried to train the data to work with the
logistic regression model to classify between the ham and the spam messages.

4
The class we have used implements regularized logistic regression using the liblinear library, newton-
cg, sag and lbfgs solvers. It can handle both dense and sparse input. For a binary classification Linear
Regression model, the output is squashed by a sigmoid function i.e.

exp(wT x)
hw (x) = σ(wT x) =
exp(wT x) + 1

wT x = w0 + w1 x1 + ...wd xd

The probability that y is true given the input x, i.e.

P (y = 1jx) = hw (x)

P (y = 0jx) = 1 − P (y = 1jx)

The loss function or the cost function we use for logistic regression is the negative log-likelihood
function.

(i) (i)
L(w) = Σ(htrue log(hw (x(i) )) + (1 − htrue )log(1 − hw (x(i) ))))

It is the sum over the training data (i-th data point is x(i), y(i)), of the negative log probability assigned
to the target class. We have used the gradient descent optimization technique. The loss function was
chosen keeping in mind the gradients as useful. Also, the log used in the loss/cost function undoes the
sigmoid function.
We have estimated the logistic regression coefficients using maximum likelihood estimation. We
cannot find a closed-form expression for the coefficient values so as to maximize the likelihood function
so as to use an iterative process instead in linear regression with normally distributed residuals. Due to
this, regular iterations occur by beginning with a tentative solution and revision of the solution until the
improvement can be made using logistic regression coefficients are estimated using maximum likelihood
estimation.

5
Figure 2.1 Sigmoid function

2.3.2 Naive Bayes Classifier

Naive Bayes classifiers are an accumulation of classification algorithms dependent on Bayes’ Theo-
rem. It is not a sort of a single algorithm. Rather, it is a family of algorithm where all of them have a
common principle. Every pair is classified as independent of each other.
In simple terms we can explain Naive Bayes Classifier as one who compares nearness of a specific
term in a class to some other element’s nearness. Taking an example, If a fruit is red and it has a diameter
of around 3 inches, it may be considered as an apple. We see even these features are dependent on each
other and all separately contribute to being the property of this fruit, in this case apple, and this is the
reason of it being called ’Naive’.
Naive Bayes algorithm is extremely useful for large datasets. It is simple to use and the algorithm is
very well known to outperform highly complicated classification methods.

P (x|c)P (c)
P (c|x) =
P (x)
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

If we want to design and develop a Bayesian Classifier for spam detection, in some way or the other
we have to decide the probabilities P(x — c) and P(c) for any x and c. For basic reasons, we cannot
in any way know them correctly, rather we can assess them from their information. For example, P(S)
can be approximated by matching it to the quantity of the messages in the test information messages.
Estimating P(x — c) is quite more hard and unpredictable and depends upon how x(element vector) is
chosen for the message m. We take the most easy and basic instance of the component vector which

6
has nearness of a particular word w in the given message. So, if we characterize the upcoming message
with the element xw, we get 1 is the word is present in the message and 0 if it isn’t.

2.3.3 Support Vector Machine

Support vector machine, or SVM, is a classification algorithm that looks to expand the edge among
positive and negative classes. For instance, consider our spam or ham grouping issue in a model com-
prising of a hundred highlights. So as to augment the edge among positive and negative classes, SVM,
utilizing a kernel function, could internally guide those highlights into a million-measurement space.
In this classification, we plot each data set’s item in a n-dimentional space as a point (n = number
of highlights) where each element’s estimation is the estimation of a particular arrangement. Currently,
we are doing classification by matching and finding the co-ordinates that separate the two classes well.
Support vectors are just the coordinates of independent perception.

2.3.4 Decision Tree Classifier

Decision tree is a type of learning algorithm which is supervised i.e an algorithm which has a pre-
defined target variable and it is one of the most widely used algorithms in classification problems. It
is suitable for both types of input variables, categorical and continuous. It divides the data sets into
smaller and smaller subsets altogether making a decision tree associated to it and the end result being a
decision tree which has leafs and decision nodes which inturn helps in increasing the dimensionality of
the feature space. A decision tree can be easily interpreted by humans.

2.3.5 Neural Networks

Neural networks is a classification model which is made after the human brain and is made with
intention to recognize designs. They make sensory data interpretation through a kind of machine per-
ception, tagging and labelling or grouping raw input. They recognize numerical patterns which are
contained in vectors into which any real data like images or sound, text or time etc, may be translated.
Neural networks helps us to label and group. We can think of them as a labelling or grouping layer
which is on the top of the data you store and manage. Neural networks need various inputs and it
processes them via multiple neurons which are hidden in various multiple layers and give back the
result using an output layer. This process is termed as ”Forward Propagation”.
Then we compare the result with what is the actual output. The motive is to make the approximation
result as close as possible to the actual result. But how do we reduce the error?
In order to reduce the error, we try and minimize the value/weight of neurons which are the biggest
error generators and this is done by travelling back to the neurons of the neural network and checking
where the error is. This process is termed as ”Backward Propagation”.

7
Figure 2.2 Most extreme edge hyperplane and edges for a SVM prepared with tests from two classes.
Tests on the edge are known as the support vectors.

2.3.5.1 Activation Function

Activation Function takes the sum of weighted info (w1*x1 + w2*x2 + w3*x3 + 1*b) as a contention
and give the yield of the neuron. The activation function work is generally used to make a non-linear
change which enables us to fit nonlinear speculations or to assess the unpredictable capacities. There
are different actuation capacities, as: ”Sigmoid”, ”Tanh”, ReLu and numerous other.

8
Figure 2.3 Decision Trees

Figure 2.4 Blending customizable weights with info highlights is the manner by which we dole out
essentialness to those highlights as to how the system groups and bunches input

9
Figure 2.5 Types of Activation Functions

10
2.4 Visualizations

2.4.1 Dataset

The SMS spam collection is the dataset we took from Kaggle which contains Tagged SMS’s. It
contains 5574 English SMS’s, tagged according being ham (legitimate) or spam.The Data set we took
contains one complete sms per line. Each line an arrangement of just 2 columns: v1 contains the label
(ham or spam) and v2 contains the raw text.A collection of 425 SMS spam messages was manually
extracted from the Grumbletext Web site.

2.4.2 Libraries Used

Figure 2.6 System Libraries

2.4.3 Data Cleaning

Data cleaning is the path toward recognizing and clearing (or helping) inaccurate records from a
dataset, table, or database and implies seeing deficient, sensitive, mixed up or non-relevant parts of the
data and after that restoring, overhauling, or removing the untidy or unpleasant data. Data cleaning may
be executed as bunch taking care of through scripting or astutely with data wrangling devices.

2.4.3.1 Data Pre-Processing Steps

In filtering of spam, the pre-processing of the printed data is extremely basic and imperative. Primary
goal of content information pre-preparing is to expel information which don’t give helpful data with
respect to the class of the report. Moreover we too need to expel information that is repetitive. Most
broadly utilized information cleaning ventures in the printed recovery undertakings are expelling of
stop words and performing stemming to diminish the vocabulary. in addition to these two stages we
additionally evacuated the words that have length lesser than or equivalent to two.

11
Figure 2.7 Data Pre-processing

2.4.3.2 Representation of Data

The next primary task turned into the representation of statistics. The data representation step is
needed as its very tough to do computations with the textual statistics. The representation must be such
that it must screen the actual facts of the textual statistics. Data representation must be in a way so that
the real data of the textual records is transformed to right numbers. Furthermore it must facilitate the
class tasks and should be easy sufficient to put in force.

2.4.3.3 Classification

Simple terms classification is a process of Learning patterns in a given data that exists in data with
the help of previous known instances and associating those data patterns with the classes. Later , if an
unknown data is given , it will search for similar patter from previous instances and then will pick the
class either in the presence of similar data pattern or its absence.

Pseudo code:

Figure 2.8 Data Cleaning

12
Figure 2.9 Output of Data Cleaning

2.4.4 Word Cloud for messages


Words like ’discount ’, ’free’, ’congrats’ are great indicators of spam and have expansive (’mali-
cious’) weights. Initial spam lters dependent on heuristic ltering could easily detect and lter spam mes-
sages dependent on the nearness of such clear words.The bigger a word shows up, the more frequently
it has been found to happen in email spam). Spammers adjusted rapidly by making beyond any doubt
such evident words are not encountered verbatim in their messages. To overcome lters, they re-arranged
to straightforward jumbling procedures like breaking the word into numerous pieces, as-
f-r-e-e
installing extraordinary characters
fr<!–xx–>ee
using HTML words
<a href=m97;i108;to58;6re101;>free</a>
with character-substance encoding
o fr101xe
encoded with HTML ASCII codes

13
Figure 2.10

14
Figure 2.11 The above image generated is the Word Cloud for ’SPAM’

15
Figure 2.12 The above Image is the Word Cloud for ’HAM’ generated with the mentioned code
(0).

16
Pseudo Code:

Figure 2.13

17
Chapter 3

Simulation and Results

3.1 Simulation and Results


1. First we took labelled data of approximate 5500 of all type of messages.
2. We then split the data set into train and test set. 3. We then sorted the messages as spam and ham.
4. Generation of word cloud for both ham and spam messages.
5. Data Cleaning.
6. Apply machine learning algorithms like regression, naive bayes classifier, decision trees classifier
and support vector machines. 7. Analyze the accuracy and result of the models we use .
8. Comparing the results and accuracies obtained by the various method of regression and classifiers
and generating a graph.

This figure shows the distribution of messages as spam and non-spam messages.

Figure 3.1 Message Distribution

18
Chapter 4

Conclusions and Future Works

We made effective conclusions about the efficiency of the various approaches used to recognize spam
messages by implementing the approaches and comparing their respective accuracies. As of now uti-
lized antispam frameworks couples a few machine learning strategies for content arrangement. Spamas-
sassin utilizes the hereditary programming to create its bayesian classifier for each discharge. Content
arrangement systems, for example, bayesian classifiers and neural systems offer a decent hypothetical
furthermore, handy foundation to battle the issue of spams. Nonetheless, two hindrances are against such
moderately basic methodologies.In the first place, the meaning of spontaneous E-sends differs starting
with one then onto the next. A summed up order can punish a few clients intrigued by a few items
publicized electronically. The second detriment is that a mail can be other thing than straightforward
text.

4.1 SCOPE OF FURTHER WORK


In the future we wish to gather our learnings and improve our accuracy by using K Nearest Neighbor
classifier in which learning calculation that can be utilized for both classification and regression. At
the point when utilized in the characterization setting, the calculation predicts the class of an unlabeled
precedent as the dominant part class among k its nearest neighbors in the vector space. In the regression
setting, the mark of an unlabeled model is determined as a normal of the names of the k its nearest
neighbors. The separation starting with one precedent then onto the next is typically given by a compa-
rability metric and random forest classifier models.
We further aim to add features and should consider to use Random forest tree classifier which is based
on decision tree classifier.The random-forest calculation carries additional randomness into the model,
developing the trees and shrubs—woods—forest. Instead of scanning for the best while part a nodes,for
the best among a random of highlights. This makes a wide variety,which by and large outcomes in a
superior model. In view of content arrangement strategies we can conclude that there is no strategy that
can promise to give a perfect arrangement 0% false positive and 0% false negative. There is part of
extension for research in grouping content messages and additionally media messages.

19

You might also like