You are on page 1of 5

International Journal on Recent and Innovation Trends in Computing and Communication

Volume: 2 Issue: 12

ISSN: 2321-8169
4205 - 4209

_______________________________________________________________________________________________

A Machine Learning Approach for Detection of Phished Websites


Using Neural Networks
Charmi J. Chandan, Hiral P. Chheda, Disha M. Gosar, Hetal R. Shah
Prof. Uday Bhave.
charmichandan@yahoo.in
chheda.hiral@yahoo.com
disha6gosar@gmail.com
shahhetalnov@gmail.com

Abstract:Phishing is a means of obtaining confidential information through fraudulent website that appear to be legitimate .On detection of all
the criteria ambiguities and certain considerations involve hence neural network techniques are used to build an effective tool in identifying
phished websites There are many phishing detection techniques available, but a central problem is that web browsers rely on a black list of
known phishing website, but some phishing website has a lifespan as short as a few hours. These website with a shorter lifespan are known as
zero day phishing website. Thus, a faster recognition system needs to be developed for the web browser to identify zero day phishing website.
To develop a faster recognition system, a neural network technique is used which reduces the error and increases the performance. This paper
describes a framework to better classify and predict the phishing sites.

__________________________________________________*****_________________________________________________
I.

INTRODUCTION

Phishing is a type of online fraud in which a scamartist uses an


e-mail or website to illicitly obtainconfidential information. It
is a semantic attack which targets the user rather than the
computer. It is arelatively new internet crime. The phishing
problemis a hard problem because of the fact that it is
veryeasy for an attacker to create an exact replica of agood
banking website which looks very convincingto users. The
communication (usually email) directsthe user to visit a web
site where they are asked toupdate personal information, such
as passwords andcredit card, social security, and bank
accountnumbers that the legitimate organization already has.
There are some characteristics in webpage source code that
distinguish phishing website from legitimate website, so we
can detect the phishing attacks by checking the webpage and
by searching for these characteristics in the source code file if
it exists. In this paper, we propose a Heuristic-based approach
for phishing detection .This approach checks one or more
characteristics of a website to detect phishing rather than look
in a black list. These characteristics can be the uniform
resource locater (URL), the hypertext mark-up language
(HTML) code, or the page content itself. Most of the
heuristics were targeted at the HTML source code. We extract
some phishing characteristics and check each character in the
webpage source code, if we find a phishing character; we will
decrease from the initial secure weight. Finally we calculate
the security rating based on the final weight, the lowest rated
website indicates secure website and others indicates the
website is most likely to be a phishing website.
The goal of this project is to apply multilayer neural networks
to phishing websites and evaluate the effectiveness of this

approach. We design the feature set, process the phishing


dataset, and implement the neural network systems. We then
use cross validation to evaluate the performance of neural
network with different numbers of hidden units and activation
functions. We also compare the performance of neural
network with other major machine learning algorithms. From
the statistical analysis, we conclude that neural network with
an appropriate number of hidden units can achieve satisfactory
accuracy even when the training examples are scarce.
Moreover, our feature selection is effective in capturing the
characteristics of phishing websites, as most machine learning
algorithms can yield reasonable results with it.
II.
PHISHING SCAMS
There are many ways in which someone can usephishing to
social engineer someone. For example, someone can
manipulate a website address to make itlook like you are
going to a legitimate website, whenin fact you are going to a
website hosted by acriminal.
The process of phishing involves five steps namely,planning,
setup, attack, collection and identity theftand fraud. During the
planning stage the phishersdecide which business to target and
determine how to get e-mail addresses for the customers of
that business. They often use the same mass-mailing and
address collection techniques as spammers. In the setup stage
after they know which business to spoof and who their victims
are, the phishers create methods for delivering the message
and collecting the data. Most often, this involves e-mail
addresses and a web page. The attack stage is the step people
are most familiar with - the phisher sends a phony message
that appears to be from a reputable source. The collection
stage is the one in which phishers record the information
entered by victims into Web pages or popup windows. The
final stage is the Identity theft and Fraud where the phishers
use the information they've gathered to make illegal purchases
4205

IJRITCC | December 2014, Available @ http://www.ijritcc.org

_______________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 2 Issue: 12

ISSN: 2321-8169
4205 - 4209

_______________________________________________________________________________________________
or otherwise commit fraud. As many as a fourth of the victims
never fully recover.
If the phisher wants to coordinate another attack, he evaluates
the successes and failures of the completed scam and begins
the cycle again. Phishing scams take advantages of software
and security weaknesses on both the client and server sides .
III.

PHISHING CHARACTERISTICS AND INDICATORS

website not in the body i.e. source of the webpage because


phishers used Https inside their source code file to tell us that
this images or this links is secured but it is not. The normal
page should be like this <imgsrc=mona.png />. But there are
some phishers who use the SSL certificate in the source code
like this<imgsrc=https://www.xx.com/mona.png/>. They use
https to make us think its secured website buts it is not. Many
similar phishing attacks in which phishing websites use a
certificate that can be expected to trigger a browser warning.

Phishing is prevalent nowadays. Thephishing problem is a


hard problem because of thefact that it is very easy for an
attacker to create anexact replica of a good legitimate site,
which looksvery convincing to users.Based on case studies
conducted 7 features andindicators were gathered and
clustered them into sixcriteria [1]. Those six criteria are URL
& domainidentity, Security & encryption, Source code &
javascript, Page style & contents, Web address bar andSocial
human factor.

2) Images: All images in the website including website logo


should load from the same URL of the website not from
another website, so all links should be internal links not
external links. Therefore, we check the links to detect any
external links inside the source code like this : <imgsrc=https:
//www.Phishers.com/logo.jpg>" it is a phishing character.

TABLE: 1- PHISHING INDICATORS WITH ITS CRITERIA

4) Domain: It is the external domains mean: if we logged to


website which its name is www.paypal.com and we found
there is some URLS of links in the source code like this
www.pay-pal.com which it is not the source URL so it
means that this website try to hack our information .Phisher
use forward domain also called domain redirection, it is a
technique on the World Wide Web for making a webpage
available under many URLs.

3) Suspicious URLs: Most of the phishers use an IP address


instead of using the actual domain name. Others use @ marks
to ambiguous their host names. For example:
http://192.185.74.105/~verify/user-verfication.

5) iframe: It is HTML tag code and used to embedding


another webpage into current webpage. It creates a frame or
window on a webpage so that another page can load inside this
frame. Phishers use the iframe and make it invisible i.e.
without frame borders, when the user goes to website, he/she
cannot know that there is another page is also loading in the
iframe window. It is a big problem which all people do not
know it, it is like small website open in current webpage for
example: we can openwww.google.com in my page
www.mona.com by using iframe so when the people enter our
website they will see the secured website is opened but it is
not in the page it open throw iframe .
Example:
http://www.phisher.com/index.php?search="'><iframe
src=http://google.com
></iframe>
//
Replace
http://google.com by the phishing page.

IV.

PHISHING WEBSITES DETECTION


METHODOLOGY

A. The phishing characteristics:


Phishers use some tricks to fool users, so our approach is to
check for these tricks and factors in the webpage source code
and calculate the security value and compare that value with
the pre-defined value based on these factors to classify the
webpage if it is secure or not, they are:
1) Https: It is the secured protocol which used to tell us that
this website is secured but it should be in URL of the

V.

NEURAL NETWORKS AND PHISHINGPREDICTION

We aregoing to utilize neural network techniques inour new


phishing website detection model as shown in table 1 to find
the most importantphishing features and significant patterns of
phishing characteristic or factors in the phishingwebsite
archive data. Each indicator will rangebetween the input
values genuine, doubtful andlegitimate and these were given
integer values to compare. These values would range between
0 to 1. Using these values rules will be formedand the network
will be trained to give output thatranges between Very
legitimate, legitimate,suspicious, phishy and very phishy
which was again compared with a pre-defined error rate.
4206

IJRITCC | December 2014, Available @ http://www.ijritcc.org

_______________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 2 Issue: 12

ISSN: 2321-8169
4205 - 4209

_______________________________________________________________________________________________
Err = T-O
If the error is positive, then we need to increase O; ifit is
negative, we need to decrease 0. Now each inputunit
contributes Wjaito the total input, so if aiispositive, an increase
in Wjwill tend to increase O,and if aiis negative, an increase in
Wjaiwill tend todecrease O. Thus, we can achieve the effect we
wantwith the following rule:

Where the term a is a constant called the learning rate.


As the error rate decreases the accuracy of detection of
phished websites increases. Hence, the error rate to which the
final value was compared was minimized during training.
An artificial neural network (ANN), usuallycalled neural
network (NN), is a mathematical model or computational
model that is inspired by thestructure and/or functional aspects
of biologicalneural networks. A neural network consists of
aninterconnected group of artificial neurons, and itprocesses
information using a connectionist approachto computation. In
most cases an ANN is an adaptivesystem that changes its
structure based on external orinternal information that flows
through the networkduring the learning phase. The most
interestingfeature in neural networks is the possibilityof
learning which trains the network based upon the input data.
The network learns when examples with known results are
presented to it. The weightingfactors are adjusted by an
algorithm to bring the finaloutput closer to known result.Each
unit in neural network performs a simplecomputation: it
receives signals from its input linksand computes a new
activation level that it sendsalong each of its output links. The
computation of theactivation level is based on the values of
each inputsignal received from a neighbouring node, and
theweights on each input link. The computation is splitinto
two components. First is a linear component,called the input
function, inithat computes theweighted sum of the unit's input
values. Second is anonlinear component called the activation
function,g, that transforms the weighted sum into the
finalvalue that serves as the unit's activation value, ai.
Thetotal weighted input is the sum of the inputactivations
times their respective weights:

Most neural network learning algorithms, includingthe follow


the current-best-hypothesis In this case,the hypothesis is a
network, defined by the currentvalues of the weights. The
initial network hasrandomly assigned weights, usually from
the range [-0.5, 0.5]. The network is then updated to try to
makeit consistent with the examples. This is done bymaking
small adjustments in the weights to reducethe difference
between the observed and predictedvalues. Typically, the
updating process is divided intoepochs. Each epoch involves
updating all the weightsfor all the examples. For perceptrons,
the weightupdate rule is particularly simple. If the
predictedoutput for the single output unit is O, and the
correctoutput should be T, then the error is given by

In multilayer networks, there are many weightsconnecting


each input to an output, and each of theseweights contributes
to more than one output. Theback-propagation algorithm is a
sensible approach todividing the contribution of each weight.
At theoutput layer, the weight update rule is given below.
There are two differences: the activation of thehidden unit a jis
used instead of the input value; andthe rule contains a term for
the gradient of theactivation function. If Erriis the error (T - O)
at theoutput node, then the weight update rule for the linkfrom
unity to unit iis

whereg' is the derivative of the activation function g.We will


find it convenient to define a newerror term Ai which for
output nodes is defined as
Ai = Errig'(ini) The update rule then becomes

For updating the connections between the input unitsand the


hidden units, we need to define a quantityanalogous to the
error term for output nodes. Here iswhere we do the error
backpropagation.The hidden node j is "responsible" for some
fractionof the error A, in each of the output nodes to which it
connects. Thus, the Ajvalues are divided accordingto the
strength of the connection between the hiddennode and the
output node, and propagated back toprovide the A, values for
the hidden layer. Thepropagation rule for the A values is the
following

Now the weight update rule for the weights betweenthe inputs
and the hidden layer is almostidentical to the update rule for
the output layer:

The below given algorithm explains how the weightupdate


rule works for the output layer and hiddenlayer. The
4207

IJRITCC | December 2014, Available @ http://www.ijritcc.org

_______________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 2 Issue: 12

ISSN: 2321-8169
4205 - 4209

_______________________________________________________________________________________________
algorithms has networks, examples and as inputs where a is
the learning rate.

website that has been reported, the time of that report, and
sometimes further detail such as the screenshots of the
website, and is publicly available.
The Anti Phishing Working Group (APWG) which maintains
a Phishing Archive describing phishing attacks. In addition,
27 features are used to train and test the classifiers .We will
use a series of short scripts to programmatically extract the
abovefeatures, and store them in an excel sheet for quick
reference. The age of the dataset is the most significant
problem, which is particularly relevant with the phishing
corpus. E-banking Phishing websites are short-lived, often
lasting only in the order of 48 hours. Some of our features can
therefore not be extracted from older websites, making our
tests difficult. The average phishing site stays live for
approximately 2.25 daysand some are websites for Zero Day
phishing which are live for peak hours only.
VI.

CASE STUDIES ON E-BANKING PHISHING SITES

Website Phishing:

This algorithm explains for updating weights in amulti-layer


network.Initially all the phishing website details are
collectedand stored in the phishing website archive. Then it
issent to a preprocessor to convert into machineunderstandable
format.The result is then stored as records in the database.The
database also stores configuration parameters(the 7 phishing
indicators that are being extractedfrom the code). Using the
data collected in thedatabase, rules are generated to detect the
websitephishing rate using the neural network techniques.For
each indicator appropriate value is assigned and after applying
algorithm the final value is calculated .Once the neural
network has been created it needs tobe trained with the
existing data in the archive. Oneway of doing this is initialize
the neural net withrandom weights and then feed it a series of
inputs.We then check to see what its output is and adjust
theweights accordingly so that whenever it seessomething
looking like the existing data it outputs thesame result as that
data.

Consider the example of original website and the phished


website of a bank say, the State Bank of India (SBI) which
involves e-banking. When the user is a known visitor of the
site it is impossible for user to identify the authentication of
the website based on its look and feel. When user take a close
look at the two sites some differences can be noticed (1)
Difference in url -The URL of the actual original site is
ww.onlinesbi.com [4 ]and the URL of the phished website is
ww.sbionline.com [5] and (2) Validation of the EV SSL
certificate - Extended Validation Secure Sockets Layer (SSL)
Certificates are special SSL Certificates that work with high
security. Web browsers clearly identify a Web site's
organizational identity. Extended Validation (EV) helps you
make sure a Web site is genuine and verified. In the actual
original website, the address bar turns green indicating that the
site is secured by an EV certificate.
THE FOLLOWING FIGURE 1 IS THE ORIGINAL WEBSITE OF SBI,

Figure:1 Original website of SBI


For our implementation we plan to use two publicly available
datasets to test: the phishtank from the phishtank.com .The
PhishTank database records the URL for the suspected
4208
IJRITCC | December 2014, Available @ http://www.ijritcc.org

_______________________________________________________________________________________

International Journal on Recent and Innovation Trends in Computing and Communication


Volume: 2 Issue: 12

ISSN: 2321-8169
4205 - 4209

_______________________________________________________________________________________________
THE FIGURE 2 SHOWS THE PHISHED WEBSITE OF SBI

Figure:2 Phished website of SBI


VII.

IMPLEMENTATION

classification algorithms were used but the error rate of those


algorithms were very high. When an element of the neural
networks fails, it can continue without any problem because of
its parallel nature. Thus performance can be made better by
considering neural networks as it reduces the error and gives
better classification. We believe that this framework works
better and gives a lower error rate.
In this paper we proposed a phishing detection approach that
classifies the webpage security by checking the webpage
source code, we extract some phishing characteristics to
evaluate the security of the websites, and check the webpage
source code, if we find a phishing character, and we will
decrease from the initial secure weight. Finally we calculated
the security percentage based on the final weight, the high
percentage indicates secure website and others indicates the
website is most likely to be a phishing website.
In the Figure3-4 the final values of a legitimate and phished
website is compared and hence detection is performed. In
Future work we can add other checks in the program and
check more source codes contains many languages in it like
PHP, CSS, asp, java, Perl, etc.
REFERENCES
[1]
[2]
[3]

[4]

[5]
[6]

Figure:3 Legitimate Website Detection


[7]

Evolving Fuzzy Neural Network for Phishing Emails


Detection-Ammar ALmomani, Tat-Chee Wan
Phishing Detection Using Neural Network-Ningxia Zhang,
YongqingYuanFreedman
Detection of Phishing Attacks: A Machine Learning
Approach-Ram Basnet, Srinivas Mukkamala, and Andrew
H. Sung
A Framework for Predicting Phishing Websites Using
Neural
Networks-A.Martin1,
Na.Ba.Anutthamaa,
M.Sathyavathy,
Marie
Manjari
Saint
Francois,Dr.PrasannaVenkatesan
Phishing Activity Trends Report- 2nd Quarter 2012-APWG
Techniques for detecting zero day phishing websites by
Michael Blasi\
www.phishtank.com

Figure:4 Legitimate Website Detection


VIII.

CONCLUSION AND FUTURE WORK

The prediction of phishing websites is essential and this can be


done using neural networks. For the prediction of phishing
websites, earlier works were done using various data mining
4209
IJRITCC | December 2014, Available @ http://www.ijritcc.org

_______________________________________________________________________________________

You might also like