Professional Documents
Culture Documents
THESIS
SUBMITTED TO
GANPAT UNIVERSITY
KHERVA
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE AND APPLICATION)
BY
JYOTINDRA N. DHARWA
A.M.PATEL INSTITUTE OF COMPUTER STUDIES
GANPAT UNIVERSITY, KHERVA
DR. A. R. PATEL
DIRECTOR, DEPARTMENT OF COMPUTER SCIENCE
HEMCHANDRACHARYA NORTH GUJARAT UNIVERSITY, PATAN
APRIL 2010
CONTENTS
Abstract I
Acknowledgement II
List of Tables V
Chapter Contents XI
Chapter 1: Introduction 1
(TRSGM)
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion 245
ABSTRACT
The Internet in India is growing rapidly. It has given rise to new opportunities in every
field we can think of - be it entertainment, business, sports or education. There are two
sides to a coin. Internet also has its own disadvantages. One of the major disadvantages is
Cyber crime- illegal activity committed on the internet. The internet, along with its
disadvantages, has also exposed us to security risks that come with connecting to a large
network. Computers today are being misused for illegal activities like e-mail espionage,
credit card fraud, spasm, software piracy and so on, which invade our privacy and offend
our senses. Criminal activities in the cyberspace are on the rise.
Developing a financial cyber crime detection system is a challenging task. Whenever any
online transaction is performed through the credit card, then there is not any system that
surely predicts a transaction as fraudulent. It just predicts the likelihood of the transaction
to be a fraudulent.
We propose a novel approach for online transaction fraud detection, which combines
evidences from current as well as past behavior. The proposed transaction risk generation
model (TRSGM) consists of five major components, namely, DBSCAN algorithm,
Linear equation, Rules, Historical transaction database and Bayesian learner. DBSCAN
algorithm is used to form the clusters of past transaction amounts of the customer and
find out the deviation of new incoming transaction amount and cluster coverage. The
patterns generated by Transaction Pattern Generation Tool (TPGT) are used in Linear
equation along with its weightage to generate a risk score for new incoming transaction.
The guidelines shown in various web sites, print and electronic media as indication of
online fraudulent transaction for credit card company is implemented as rules in TRSGM.
In the first four components, we determine the suspicion level of each incoming
transaction based on the extent of its deviation from good pattern. The transaction is
classified as genuine, fraudulent or suspicious depending on this initial belief. Once a
transaction is found to be suspicious, belief is further strengthened or weakened
according to its similarity with fraudulent or transaction history using Bayesian learning.
I
ACKNOWLEDGEMENT
I hereby take a chance to express my sense of extreme gratitude towards my Ph.D. guide
Dr. A. R. Patel, for his suggestions and constant inspiration at every stage of the research.
He is an extremely sympathetic and principle-centered person. His skills, as a researcher
and guide helped me to overcome all the hurdles. Without his constant support and
encouragement, I would not have been able to complete my research work successfully.
I am thankful to my manager brother who has provided me statistical data at the initial
stage of the research, so my initial model design is just possible due to his support.
I would like to thank my colleagues Dr. N. J. Patel, Dr. S. M. Parikh and staff at Acharya
Motibhai Patel Institute of Computer Studies for their invaluable encouragement and
help.
My parents have their own share in my success. I firmly believe that their blessings
always enlighten my path ahead. I hereby take a chance to salute my father Nathalal and
mother Late Menaben. I would like to thank my brother navinbhai, sister-in-law lilaben,
nephews Vikas and kunal for their love, blessings and moral support throughout my
research work. I give my special thanks to my wife Urmila, daughters Mudra and Aditi ,
without whose support and sacrifice this thesis would not have been possible for me.
At last, I thank the one and all, for the divine blessings.
Jyotindra N. Dharwa
II
CERTIFICATE
I hereby certify that Mr. Jyotindra N. Dharwa has completed his Ph.D. thesis for
doctorate degree on the topic “Data Mining Techniques: Study, Analysis, Prevention &
I further certify that the whole work, done by him is of his own, original and tends to
certify that he has not been conferred any degree, diploma and distinction by either the
III
DECLARATION
I, Mr. Jyotindra N. Dharwa hereby declare that my Ph.D. thesis titled “Data Mining
Techniques: Study, Analysis, Prevention & Detection for Financial Cyber Crime and
Frauds” is written as a partial fulfillment of the requirement for a doctorate degree on the
topic. The complete study is based on literature survey, study of periodicals, journals and
websites and building a model for proving the concept studied and designed.
I further declare that the complete thesis work, including all analysis, hypothesis,
inferences and interpretation of data and information, is done by me and it is my own and
original work. Moreover, I declare that no degree, diploma or distinction has been
conferred on the basis of this thesis by the Ganpat University or any other university to
me before.
IV
LIST OF TABLES
Chapter 2
Table 2.1 Steps in the Evolution of Data Mining 22
Table 2.2 Time Line of Data Mining Development 25
Table 2.3 Initial Weight Values for the Neural Network Shown in Figure 2.4 45
Table 2.4 Comparison of Clustering Algorithms 89
Table 2.5 Data Mining Technique for Data Mining Task 90
Chapter 3
Table 3.1 Average (Median) Loss Per Typical Complaint Demographics 109
Table 3.2 Losses based on fraud category wise 110
Chapter 5
Table 5.1 Transaction 126
Table 5.2 Customer_Master 127
Table 5.3 Creditcard_Master 128
Table 5.4 Seller_Master 128
Table 5.5 Address_Master 129
Table 5.6 Product_Master 130
Table 5.7 Product_Category_Master 130
Table 5.8 Shipping_Master 131
Table 5.9 Location_Master 132
Table 5.10 City_Master 133
Table 5.11 State_Master 133
Table 5.12 Country_Master 134
Table 5.13 User_Log_Master 134
Table 5.14 Cardholder_Master 135
Table 5.15 Fraud 136
Table 5.16 Suspect 137
Table 5.17 Customer_DailyCount 138
V
Table 5.18 Customer_WeeklyCount 139
Table 5.19 Customer_FortnightlyCount 139
Table 5.20 Customer_MonthlyCount 140
Table 5.21 Customer_SundayCount 140
Table 5.22 Customer_HolidayCount 141
Table 5.23 Statistical data of expenditure in category by income 144
Table 5.24 Components of Gaussian distribution 144
Table 5.25 Sample Data of Table Transaction 145
Table 5.26 Credit Card Parameters 152
Chapter 7
Table 7.1 Parameters of the Equation 192
Table 7.2 Sample output of the application for different transaction amounts 231
Table 7.3 Sample output of the application for different sellers 234
Table 7.4 Sample output of the application for different locations 237
VI
LIST OF FIGURES
Chapter 2
Figure 2.1 Historical Perspective of Data Mining 24
Figure 2.2 Decision Tree for Example 2.1 38
Figure 2.3 Decision Tree for Example 2.2 40
Figure 2.4 A Fully Connected Feed-Forwarded Neural Network 44
Figure 2.5 Radial Basis Function Network 70
Figure 2.6 Classification of Clustering Algorithms 72
Figure 2.7 Example of Dendrogram 73
Chapter 3
Figure 3.1 Affecting the Person by Cyber Crime (in %) 93
Figure 3.2 IC3 Complaint Categories (in %) 107
Figure 3.3 Percentage of Referrals by Monetary Loss 108
Figure 3.4 Plastic Card Fraud Losses on UK-issued Cards 1998-2008 109
Figure 3.5 Percentage of Different Plastic Card Fraud Category in Year 1998 110
Figure 3.6 Percentage of Different Plastic Card Fraud Category in Year 2008 111
Figure 3.7 Internet/E-Commerce Fraud Losses on UK-issued Cards 111
Figure 3.8 Revenue Lost to Online Fraud (in %) 113
Chapter 4
Figure 4.1 Architecture of 2-Stage Solution 116
Chapter 5
Figure 5.1 Data warehouse Design Layout-I 142
Figure 5.2 Data warehouse Design Layout-II 143
Figure 5.3 Credit Card Number Semantic Graph 151
Figure 5.4 Sample of Credit Card 154
VII
Chapter 6
Figure 6.1 Parameters of TPGT 156
Figure 6.2 Subparameters of DP 157
Figure 6.3 Subparameters of CP 158
Figure 6.4 Subparameters of PP 159
Figure 6.5 Subparameters of TP 160
Figure 6.6 Subparameters of WP 161
Figure 6.7 Subparameters of VP 162
Figure 6.8 Subparameters of AP 162
Figure 6.9 Subparameters of FP 163
Figure 6.10 Subparameters of MP 164
Figure 6.11 Subparameters of SP 164
Figure 6.12 Subparameters of HP 165
Figure 6.13 Subparameters of LP 166
Figure 6.14 Subparameters of GP 167
Chapter 7
Figure 7.1 Block Diagram of Proposed Financial Cyber Crime Detection System 209
Figure 7.2 Graph of clusters formed by DBSCAN algorithm for Card id=1 210
Figure 7.3 Graph of clusters formed by DBSCAN algorithm for Card id=5 210
Figure 7.4 Graph of clusters formed by DBSCAN algorithm for Card id=100 211
Figure 7.5 Graph of clusters formed by DBSCAN algorithm for Card id=1507 211
Figure 7.6 Sample output of Clusters formed by DBSCAN Algorithm – I 212
Figure 7.7 Sample output of Clusters formed by DBSCAN Algorithm – II 212
Figure 7.8 Sample output of Clusters formed by DBSCAN Algorithm – III 213
Figure 7.9 Sample output of Clusters formed by DBSCAN Algorithm – IV 213
Figure 7.10 Sample output of Clusters formed by DBSCAN Algorithm – V 214
Figure 7.11 Sample output of Clusters formed by DBSCAN Algorithm – VI 214
Figure 7.12 Sample output of Clusters formed by DBSCAN Algorithm – VII 215
Figure 7.13 Sample output of Clusters formed by DBSCAN Algorithm – VIII 215
Figure 7.14 Sample output of Clusters formed by DBSCAN Algorithm – IX 216
VIII
Figure 7.15 Sample output of Clusters formed by DBSCAN Algorithm – X 216
Figure 7.16 Sample output of Clusters formed by DBSCAN Algorithm – XI 217
Figure 7.17 Sample output of Clusters formed by DBSCAN Algorithm – XII 217
Figure 7.18 Sample output of Data Mining Application for Genuine 223
Transaction–I
Figure 7.19 Sample output of Data Mining Application for Genuine 223
Transaction-II
Figure 7.20 Sample output of Data Mining Application for Genuine 224
Transaction-III
Figure 7.21 Sample output of Data Mining Application for Fraudulent 225
Transaction–I
Figure 7.22 Sample output of Data Mining Application for Fraudulent 225
Transaction–II
Figure 7.23 Sample output of Data Mining Application for Fraudulent 226
Transaction–III
Figure 7.24 Sample output of Data Mining Application for Suspicious 227
Transaction - I
Figure 7.25 Sample output of Data Mining Application for Suspicious 228
Transaction - II
Figure 7.26 Sample output of Data Mining Application for Suspicious 228
Transaction - III
Figure 7.27 Sample output of Data Mining Application for Multiple Order 229
Product Support - I
Figure 7.28 Sample output of Data Mining Application for Multiple Order 229
Product Support - II
Figure 7.29 Sample output of Data Mining Application for Multiple Order 230
Product Support – III
Figure 7.30 Sample output of Data Mining Application for different transaction 232
amounts – I
Figure 7.31 Sample output of Data Mining Application for different transaction 232
amounts – II
IX
Figure 7.32 Sample output of Data Mining Application for different transaction 233
amounts – III
Figure 7.33 Sample output of Data Mining Application for different transaction 233
amounts – IV
Figure 7.34 Sample output of Data Mining Application for different sellers – I 235
Figure 7.35 Sample output of Data Mining Application for different sellers – II 235
Figure 7.36 Sample output of Data Mining Application for different sellers – III 236
Figure 7.37 Sample output of Data Mining Application for different sellers – IV 236
Figure 7.38 Sample output of Data Mining Application for different locations – I 238
Figure 7.39 Sample output of Data Mining Application for different locations – II 238
Figure 7.40 Sample output of Data Mining Application for different locations – III 239
Figure 7.41 Sample output of Data Mining Application for different locations – IV 239
Figure 7.42 Sample output of Data Mining Application for Cluster Coverage 240
Figure 7.43 Sample output of Data Mining Application for maximum 241
purchasing habit input - I
Figure 7.44 Sample output of Data Mining Application for maximum 242
purchasing habit input - II
Figure 7.45 Sample output of Data Mining Application for maximum 242
purchasing habit input – III
Figure 7.46 Sample output of Data Mining Application for Bayesian 243
Learning - I
Figure 7.47 Sample output of Data Mining Application for Bayesian 244
Learning - II
X
CHAPTERS CONTENTS
Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Objective of the research 2
1.3 Related Work 4
1.3.1 In Fraud Detection 4
1.3.2 In Financial Cyber crime Prevention 11
1.4 Research Issues 13
1.5 Outline of the Research 15
1.6 References 16
XI
2.7.7 Weaknesses 47
2.8 Genetic Algorithms 47
2.8.1 Where Gas can be used? 48
2.8.2 Explanation of terms 48
2.8.3 Applications of GA 50
2.8.4 Strengths of GA 50
2.8.5 Weaknesses of GA 50
2.9 Classification 51
2.9.1 Statistical-Based Algorithms 51
2.9.2 Distance-Based Algorithms 55
2.9.3 Decision Tree-Based Algorithms 58
2.9.4 Neural Network-Based Algorithms 65
2.10 Clustering 71
2.10.1 Hierarchical Algorithms 72
2.10.2 Agglomerative Algorithms 73
2.10.3 Partitional Algorithms 75
2.10.4 Clustering Large Databases 82
2.10.5 Comparison of Clustering Algorithms 87
2.11 Selection Criteria of a Data Mining Technique 87
2.12 References 90
XII
3.5 Types of Fraud 95
3.5.1 Credit Card Fraud 95
3.5.2 Telecommunications Fraud 97
3.5.3 Computer Intrusion 98
3.6 Financial Crimes 99
3.6.1 Types of Financial Crimes 99
3.7 Ways of Online Banking Fraud 105
3.7.1 Phising 105
3.7.2 Malware 105
3.7.3 Spyware 106
3.8 2008 Internet Crime Report 106
3.8.1 Complain Characteristics 106
3.8.2 Case Studies of APACS 109
3.9 Online Fraud Report, Cybersource 2010 112
3.10 References 113
XIII
5.1.3 Fact Constellation Architecture 125
5.2 Fact Table 125
5.3 Dimensional Tables 127
5.4 Lookup Tables 138
5.5 Data Collection 143
5.6 Sample Data 145
5.7 Credit Card Number Generation 151
5.7.1 The Luhn Algorithm 152
5.7.2 An example of Luhn Validation Technique 153
5.8 References 154
XIV
6.2.5 Weekly Parameters (WP) 171
6.2.6 Seller or Vendor Parameters (VP) 171
6.2.7 Address Parameters (AP) 172
6.2.8 Fortnightly Parameters (FP) 172
6.2.9 Monthly Parameters (MP) 173
6.2.10 Sunday Parameters (SP) 173
6.2.11 Holiday Parameters (HP) 174
6.2.12 Location Parameters (LP) 175
6.2.13 Transaction Gap Parameters (GP) 176
6.3 Computations of the Patterns 177
6.3.1 TP1 to TP8 177
6.3.2 TP11 and TP12 179
6.3.3 GP1 to GP7 180
6.3.4 AP1 and AP2 181
6.4 References 182
XV
7.5.2 Inter Transaction Gap Recording 219
7.5.3 Maximum Value Finding 222
7.6 Sample Results 223
7.6.1 Genuine Transaction 223
7.6.2 Fraudulent Transaction 225
7.6.3 Suspicious Transaction 227
7.6.4 Multiple Product Order Support 229
7.7 Result Analysis & Discussions 231
7.8 References 244
XVI
CHAPTER 1
INTRODUCTION
1.1 MOTIVATION
1.2 OBJECTIVE OF THE RESEARCH
1.3 RELATED WORK
1.4 RESEARCH ISSUES
1.5 OUTLINE OF THE RESEARCH
1.6 REFERENCES
The Internet in India is growing rapidly. It has given rise to new opportunities in every
field we can think of - be it entertainment, business, sports or education. There are two
sides to a coin. Internet also has its own disadvantages. One of the major disadvantages is
Cyber crime- illegal activity committed on the internet. The internet, along with its
disadvantages, has also exposed us to security risks that come with connecting to a large
network. Computers today are being misused for illegal activities like e-mail espionage,
credit card fraud, spasm, software piracy and so on, which invade our privacy and offend
our senses. Criminal activities in the cyberspace are on the rise.
So in today’s electronic society, e-commerce has become an essential sales channel for
global business. Due to rapid advancement of e-commerce, use of credit cards for
purchases has dramatically increased. Unfortunately, fraudulent or illegal use of credit
card has also become an attractive source of revenue for fraudsters. Occurrences of credit
card fraud are increasing dramatically due to exposure of security weaknesses in
traditional credit card processing systems resulting in loss of billions of money every
year. Fraudsters now become very dynamic and use sophisticated techniques to perpetrate
credit card fraud. The fraudulent activities worldwide present unique challenges to banks
and other financial institutions who issue credit cards.
1
Chapter 1: Introduction
According to 2008 Internet Crime Report [41] of Internet Crime Complaint Center, From
January 1, 2008 – December 31, 2008, the IC3 website received 275,284 complaint
submissions. This is a (33.1%) increase when compared to 2007 when 206,884
complaints were received. These filings were composed of complaints primarily related
to fraudulent and non-fraudulent issues on the Internet. Dollar loss of referred complaints was
at an all time high in 2008, $264.59 million, exceeding last year’s record breaking dollar loss of
$239.09 million. On average, men lost more money than women.
A Gartner survey [40] of more than 160 companies reveals that 12 times more fraud
exists on Internet transactions and those e-tailers are paying credit card discount rates that
are 66 percent higher than traditional retailer fees. Moreover, Web merchants bear the
liability and costs in cases of fraud, while credit card companies generally absorb the
fraud for traditional retailers
1.1 MOTIVATION
The Various cyber crime cases through the credit card comes frequently in the daily news
papers and broad coverage in the television media inspired me to work in this area.
The purpose of research is first to discuss the different financial cyber crimes and frauds
which are seen today in the forms of Credit card fraud, Phising etc. Secondly study the
different data mining techniques like Neural Network, Clustering techniques, Decision
trees etc. and eventually how these techniques can be used and applied to detect the
financial cyber crime and frauds.
Fraud Prevention describes measures to stop fraud occurring in the first place. In contrast,
fraud detection involves identifying fraud as quickly as possible once it has been
perpetrated. Fraud detection comes into play once fraud prevention has failed. In practice,
fraud detection must be used continuously, as one will typically be unaware that fraud
2
Chapter 1: Introduction
prevention has failed. We can try to prevent credit card fraud by guarding our cards
assiduously, but if nevertheless the card’s details are stolen, then we need to be able to
detect, as soon as possible, that fraud is being perpetrated.
Currently, data mining is a popular way to combat frauds because of its effectiveness.
The task of data mining is to analyze a massive amount of data and to extract some
usable information that we can interpret for future uses. In doing so, we have to define
the clear goal of data mining, and find out the right structure of possible model or
patterns that fit to the given data set. Once we have the right model for the data, we can
use the model for predicting future events by classifying the data. In terms of data
mining, fraud detection can be understood as the classification of the data. Input data is
analyzed with the appropriate model and determined whether it implies any fraudulent
activities or not. A well-defined classification model is developed by recognizing the
patterns of former fraudulent behaviors. Then the model can be used to predict any
suspicious activities implied by new data set.
The prediction of user behavior in financial systems can be used in many situations.
Predicating client migration, marketing or public relations can save a lot of money and
other resources. One of the most interesting fields of prediction is the fraud of credit line,
especially credit card payments. For the high data traffic of 400,000 transactions per day,
a reduction of 2.5% fraud triggers a saving of lots of money per year.
Certainly, all transactions, which deal with accounts of known misuse, are not authorized.
Nevertheless, there are transactions, which are formally valid, but experienced people can
tell that these transactions are probably misused, caused by stolen cards or fake
merchants. So, the task is to avoid a fraud by a credit card transaction before it is known
as “illegal”.
Data mining methods have made most impact on fraud detection. This is typically
because there are large quantities of information is numerical or can be easily converted
into the numerical in the form of counts and proportions. We should also consider speed
3
Chapter 1: Introduction
A key issue of the proposed work is how effective the tools are in detecting fraud and a
fraudulent problem is that one typically dose not know how many fraudulent cases slip
through the net. In applications such as average time to detection after fraud starts (in
minutes, number of transactions, etc.) should also be reported. Measures of this aspect
interact with measures of final detection rate: in many situations an account, telephone,
etc. will have to used for several fraudulent transactions before it is detected as fraudulent
, so that several false negative classifications will necessarily be made.
Credit card fraud detection has drawn a lot of research interest and a number of
techniques, with special emphasis on data mining, have been suggested. Gosh and Reilly
[1] have developed fraud detection system with neural network. Their system is trained
on large sample of labeled credit card account transactions. These transactions contain
example fraud cases due to lost cards, stolen cards, application fraud, counterfeit fraud,
mail-order fraud and non receive issue(NRI) fraud.
E. Aleskerov et al. [2] present CARDWATCH, a database mining system used for credit
card fraud detection. The system is based on a neural learning module and provides an
interface to variety of commercial databases.
Dorronsoro et al. [3] have suggested two particular characteristics regarding fraud
detection- a very limited time span for decisions and a large number of credit card
operations to be processed. They have separated fraudulent operations from the normal
ones by using Fisher’s discriminant analysis.
4
Chapter 1: Introduction
Syeda et al. [4] have used parallel granular neural network for improving the speed of
data mining and knowledge discovery in credit card fraud detection. A complete system
has been implemented for this purpose.
Chan et al. [5] have divided a large set of transactions into smaller subsets and then apply
distributed data mining for building models of user behavior. The resultant base models
are then combined to generate a meta-classifier for improving detection accuracy.
Chiu and Tsai [7] consider web services for data exchange among banks. A fraud pattern
mining (FPM) algorithm has been developed for mining fraud association rules which
give information regarding the new fraud patterns to prevent attacks.
Some survey papers have been published which categorize, compare and summarize
articles in the area of fraud detection. Phua et al. [8] did an extensive survey of data
mining based Fraud Detection Systems and presented a comprehensive report. Kou et al.
[9] have reviewed the various fraud detection techniques for credit card fraud,
telecommunication fraud and computer intrusion detection. Bolton and Hand [10]
describe the tools available for statistical fraud detection and areas in which fraud
detection technologies are most commonly used. D.W.Abbott et al. [21] compare five of
the most highly acclaimed commercial data mining tools on a fraud detection application,
with descriptions of their distinctive strengths and weaknesses, based on the lessons
learned by the authors during the process of evaluating the products. D.Yue [32] conduct
an extensive review on literatures to get the answers of the questions like (1) Can FSF be
detected? How likely and how to do it? (2) What data features can be used to predict
FSF? (3) What kinds of algorithm can be used to detect FSF? (4) How to measure the
performance of the detection? And (5) How effective of these algorithms in terms of
fraud detection?
V.Hanagandi et al. [11] generate a fraud a score using the historical information on credit
card account transactions. They describe a fraud-non fraud classification methodology
5
Chapter 1: Introduction
using radial basis function network (RBFN) with a density based clustering approach.
The input data is transformed into cardinal component space and clustering as well as
RBFN modeling is done using a few cardinal components.
A.Shen et al. [12] investigates the efficacy of applying classification models to credit
card fraud detection problems. They tested three classification methods i.e. neural
network, decision tree and logistic regression for their applicability in fraud detections.
H.shao et al. [13] introduced an application in data mining to detect fraud behavior in
customs declarations data and used data mining technology such as an easy-to-expand
multi-dimension-criterion data model and a hybrid fraud-detection strategy.
K.B.Bignell [14] outlines a framework for internet banking security using multi-layered,
feed-forward artificial neural networks.
A. Srivastava et al. [15] model the sequence of operations in credit card transaction
processing using Hidden Markov Model (HMM) and show how it can be used for
detection of frauds. An HMM is initially trained with normal behavior of card holder. If
an incoming credit card transaction is not accepted by trained HMM with sufficiently
high probability, it is considered to be fraudulent. At the same time they also try to ensure
that genuine transactions are not rejected.
B.Zhang et al. [16] consider network level features, such as users’ belief of other users to
deal with fraud in group behavior. They use loopy belief propagation algorithm and apply
it to network level fraud detection, classifying fraudsters, accomplices, honest users.
J.E.Carbal et al. [17] propose a methodology based on rough sets and KDD for fraud
detection made by electrical energy consumers. This methodology does a detailed
evaluation of the boundary region between fraudulent and normal customers, identifying
patterns of fraudulent behavior at historical data sets of electricity companies. They
6
Chapter 1: Introduction
derive classification rules using these patterns; it will permit the detection on the database
of electricity companies of those clients that present fraudulent feature.
J.Quah et al. [18] focuses on real time fraud detection and presents a new and innovative
approach in understanding spending patterns to decipher potential fraud cases. They
make use of self organizing map to decipher, filter and analyze customer behavior for
detection of fraud.
E.L.Barse et al. [19] generate synthetic test data for fraud detection in an IP based video-
on-demand service by ensuring that important statistical properties of the authentic data
are preserved.
J.Xu et al. [20] present an anomaly detection technique based on behavior mining and
monitoring that work at both the individual and system level. They utilize frequent
pattern tree to profile the normal behavior adaptively. They design a novel tree-based
pattern matching algorithm to discover individual level anomalies.
Recently fraud detection system is developed by Suvasini Panigrahi et al. [22], which
consist of four components, namely, rule-based filter, Dempster-Shafer adder, transaction
history database and Bayesian learner. In the rule based component, they determine the
suspicion level of each incoming transaction based on the extent of its deviation from
good pattern. Dempster-Shafer theory is used to combine multiple such evidences and an
initial belief is computed.
Yi Peng et al. [23] apply two clustering techniques, SAS EM and CLUTO, to a large real-
life health insurance dataset and compare the performances of these two methods.
J.Tuo et al. [24] propose a case-based genetic artificial immune system for fraud
detection (AISFD). Their system is a self-adapted system designed for credit card fraud
detection. With the case-based learning model and genetic algorithm, their system can
7
Chapter 1: Introduction
perform online learning with limited time and cost, and update the capability of fraud
detection in the rapid growth of transactions and commerce activities.
J.Kim [25] proposes a novel artificial immune system, called CIFD (Computer Immune
System for Fraud Detection), and adopts both negative selection and positive selection to
generate artificial immune cells. CIFD also employs an analogy of the self-major
histocompatability complex (MHC) molecules when antigen data is presented to the
system. Their novel mechanism improves the scalability of CIFD, which is designed to
process gigabytes or more of transaction data per day.
S.J.Stoflo et al. [26] developed the JAM distributed data mining system for the real world
problem of fraud detection in financial information systems. They have shown that cost-
based metrics are more relevant in certain domains, and defining such metrics poses
significant and interesting research questions both in evaluating systems and alternative
models, and in formalizing the problems to which one may wish to apply data mining
technologies. They also demonstrate how the techniques developed for fraud detection
can be generalized and applied to the important area of intrusion detection in networked
information systems.
F.Yu et al. [27] focus on how to build data mining algorithm centered application system
for common users. They present a case study about building a fraudulent tax declaration
detection system using decision tree classification algorithm.
A.Leung et al. [28] sheds some light on the designing issues on this add-on fraud
detection module, namely Fraud Detection Manager. Their design is based on the concept
of atomic transactions called Coupons that they implemented in e-wallet accounts.
W.Chai et al. [29] propose a method to convert fraud classification rules learned from a
genetic algorithm to a fuzzy score representing the degree to which a company’s financial
statements match those rules.
8
Chapter 1: Introduction
B.Garner and F.Chen [30] propose a paradigm, which involves an anomaly detection
model, case based hypothesis generation, and hypothesis synthesis, is deemed to provide
a basic platform for management intelligence systems and fraud detection in electronic
data processing environment.
V.Aggelis [31] demonstrates one successful fraud detection model. His scope is to
present its contribution in fast and reliable detection of any “strange” transaction
including fraudulent ones.
S.Rozsnyai et al. [33] introduce solution architecture for detection and preventing fraud
in real time by using an event-based system called SARI (Sense and Respond
Infrastructure). They present architecture and components for a real time fraud
management solution which can be easily adapted to the business needs of domain
experts and business users. Their SARI system provides functions to monitor customer
behavior as well as it can steer and optimize customer processes in real time. They show
fraud scenarios of an online gambling service provider.
T.M.Padmaja et al. [34] propose a new approach called extreme outlier elimination and
hybrid sampling technique. They use k reverse nearest neighbors (kRNNs) concept as a
data cleaning method for eliminating extreme outliers in minority regions. They conduct
the experiments with classifiers namely C4.5, Naïve Bayes, k-NN and Radial Basis
function networks and compared the performance of their approach against simple hybrid
sampling technique. They showed using obtained results that extreme outlier elimination
from minority class, produce high predictions for both fraud and non-fraud classes.
Z.Ferdousi et al. [35] use Peer Group Analysis (PGA), an unsupervised technique, to find
outliers in time series financial data. They apply the tool to the stock market data, which
has been collected from Bangladesh Stock Exchange to asses its performance in stock
fraud detection. They observe that PGA can detect those brokers who suddenly start
selling the stock in a different way to other brokers to whom they were previously
similar. They also apply t-statistics to find the deviations effectively.
9
Chapter 1: Introduction
M.Sternberg et al. [36] utilize a cultural algorithm (CA) to respond o dynamic changes in
the application of rule-based expert system. The CA provides self-adaptive capabilities
which can generate the information necessary for the expert system to respond
dynamically.
O.Dandash et al. [37] presents a security analysis of the proposed internet banking model
compared with that of the current existing models used in fraudulent internet payments
detection and prevention. Their proposed model facilitates internet banking fraud
detection and prevention (FDP) by applying two new secure mechanisms, Dynamic Key
Generation (DKG) and Group Key (GK).
S.Viaene et al. [38] apply the weight of evidence reformulation of AdaBoosted naive
Bayes scoring to the problem of diagnosing insurance claim fraud. Their method
effectively combines the advantages of boosting and the explanatory power of the weight
of evidence scoring framework.
E.Lundin et al. [39] developed a method for generating synthetic data that is derived from
authentic data. They also narrate that in many cases synthetic data is more suitable than
authentic data for the testing and training of fraud detection systems.
It is well known that every cardholder has a certain purchasing habits, which establishes
an activity profile for him. Almost all the existing fraud detection techniques try to
capture these behavioral patterns as rules and check for any violation in subsequent
transactions. However, these rules are largely static in nature. As a result, they become
ineffective when the cardholder develops new patterns of behavior that are not yet known
to the FDS. The goal of a reliable detection system is to learn the behavior of users
dynamically so as to minimize its own loss. Thus, systems that can not evolve or “learn”,
may soon become outdated resulting in large number of false alarms. A fraudster can also
attempt new types of attacks which should still get detected by the FDS. For example, a
fraudster may aim at deriving maximum benefit either by making a few high value
10
Chapter 1: Introduction
purchases or a large number of low value purchases in order to evade detection. Thus,
there is a need for developing fraud detection systems which can integrate multiple
evidences including patterns of genuine cardholders as well as that of fraudsters.
We propose a credit card fraud detection system that combines different types of
evidences effectively.
The first attempt at making online credit card transactions secure was to take the
transaction off-line. Many sites will allow us to call in our credit card number to a
customer support person. This solves the problem of passing the credit card number over
the Internet, but eliminates the merchant's ability to automate the purchasing process.
The next method that was developed, which is currently used by many sites, is hosting
the WWW site on a secure server. A secure server is one that uses a protocol such as SSL
or S-HTTP to transmit data between the browser and the server. These protocols encrypt
the data being transmitted, so when we submit our credit card number through WWW
form it travels to the server encrypted. This section describes the three most famous
system of secure credit card transactions First virtual, CyberCash and SET (Secure
Electronic Transactions)
The first virtual was the first successfully used model that made internet transactions
secure. Instead of using credit card numbers, transactions are done using a First
VirtualPIN which references the buyer's First Virtual account. These PIN numbers can be
sent over the Internet because even if they are intercepted, they cannot be used to charge
purchases to the buyer's account. A person's account is never charged without email
verification from them accepting the charge.
11
Chapter 1: Introduction
Their payment system is based on existing Internet protocols, with the backbone of the
system designed around Internet email and the MIME (Multipurpose Internet Mail
Extensions) standard. First Virtual uses email to communicate with a buyer to confirm
charges against their account. Sellers use either email, Telnet or automated programs that
make use of First Virtual's Simple MIME Exchange Protocol (SMXP) to verify accounts
and initiate payment transactions. To use this scheme of transaction customer and
merchant, both should have an account on first virtual’s server. The First virtual’s model
was one of the most successfully used models but it is out of use now.
1.3.2.2 CyberCash
CyberCash makes safe passage over the Internet for credit card transaction data. They
take the data that is sent to them from the merchant, and pass it to the merchant's
acquiring bank for processing. Except for dealing with the merchant through CyberCash's
server, the acquiring bank processes the credit card transaction as they would process
transactions received through a point of sale (POS) terminal in a retail store.
The CyberCash payment system is centered on the CyberCash Wallet software program,
which buyers use when making a purchase. This program handles passing payment
information, encrypted, between the buyer and the merchant.
1.3.2.3 SET
MasterCard and Visa have developed SET as a license-free protocol for credit card
transactions over the Internet. SET is based on two earlier protocols STT (Secure
transaction technology) and SEPP (Secure Electronic Payment Protocol). Secure
Electronic Transaction (SET) is a system for ensuring the security of financial
transactions on the Internet. It was supported initially by MasterCard, Visa, Microsoft,
Netscape, and others. With SET, a user is given an electronic wallet (digital certificate)
and a transaction is conducted and verified using a combination of digital certificates and
12
Chapter 1: Introduction
digital signatures among the purchaser, a merchant, and the purchaser's bank in a way
that ensures privacy and confidentiality.
SET makes use of Secure Socket Layer (SSL), and Secure Hypertext Transfer Protocol
(SHTTP). SET uses some but not all aspects of a public key infrastructure (PKI). Many
other systems are also functional like PayPal, DigiCash etc.
These systems are highly secure but are rarely used by customers and merchants. These
models secure your transaction over internet but cannot stop any forgery if credit card
information is lost physically or when customer gives his information in wrong hands.
Anshul Jain et al. [43] have given this model. According to this model, a login id and a
password are issued by bank along with credit card. Once the customer logs in, he is
asked for his credit card details in order to make sure that the person logging in has the
possession of the card thus avoiding leakage of id and password. If the user is
authenticated then an internet virtual credit card number is issued. User has to select the
expiry date between present date and date of actual expiry date. Customers, who transact
very often, could activate the internet virtual credit card only for a few days, in order to
avoid forgery.
Financial fraud detection is quite confidential and is not much disclosed in public. The
major issue in this domain is that any financial institution or bank does not share its live
data with researchers as they have strict policy and they can not disclose it. Also there is
not benchmark of data set available in this area. So there are very few researchers (just
one or two) who have worked with real life credit card data and showed their results.
13
Chapter 1: Introduction
Most of the researchers have generated synthetic data based on the statistical techniques.
It may be noted that Aleskerov et al. [2] tested the performance of their CARDWATCH
system on sets of synthetic data based on Gaussian distribution. Chan et al. [5] have used
skewed distribution to generate a training set of labeled transactions. They have done
experiments to determine the most effective training distribution. Li and Zhang [42] have
modeled a customer’s payment by a Poisson process which can only capture the time gap
between two transactions. Panigrahi et al. [22] have generated synthetic data using a
Markov modulated poisson process (MMPP) and two Gaussian distribution functions.
According to E.L.Barse et al. [19], using synthetic data for evaluation, training and
testing offers several advantages over using authentic data. Properties of synthetic data
can be tailored to meet various conditions not available in authentic data sets. They
discussed motivation for using synthetic data for several reasons as authentic data can not
be used in some cases for a number of reasons. The target service may still be under
development and thus produce irregular or only small amounts of authentic data.
Synthetic data can be designed to demonstrate certain key properties or to include attacks
not available in the authentic data, giving a high degree of freedom during testing and
training. Synthetic data can cover extensive periods of time or represent large number of
users, a necessary property to train some of the more “intelligent” detection schemes.
There are two types of data mining techniques, Unsupervised and Supervised Methods.
Unsupervised methods do not need the prior knowledge of fraudulent and non-fraudulent
transactions in historical database, but instead detect changes in behavior or unusual
transactions. These methods model a baseline distribution that represents normal
behavior and then detect observations that show greatest departure from this norm.
Outliers are a basic form of non-standard observation that can be used for fraud detection.
In supervised methods, models are trained to discriminate between fraudulent and non-
fraudulent behavior so that new observations can be assigned to classes. Supervised
methods require accurate identification of fraudulent transactions in historical databases
and can only be used to detect frauds of a type that have previously occurred. An
advantage of using unsupervised methods over supervised methods is that previously
14
Chapter 1: Introduction
undiscovered types of fraud may be detected. Supervised methods are only trained to
discriminate between legitimate transactions and previously known fraud.
All the techniques or models of fraud detection just indicate the likelihood of fraud
occurrence. No one method surely confirms any transaction as fraudulent transaction.
When the user performs transaction on the internet, then transaction related data are
generated. These data are stored in the dimensional modeling data warehouse. The
Transaction Pattern Generation Tool (TPG) generates the different patterns (parameters)
like maximum amount of transaction, time passed since the last transaction, time passed
since the same category purchased etc. based on the historical data stored in the data
warehouse. These all the parameters collectively represent the normal purchasing
behavior of the customer.
Whenever the deviation occurs than the normal behavior, then the model should raise the
alarm. The Transaction Risk Generation Model (TRSGM) works on this principle. The
Model predicts for each transaction how far or close it is from the previous set of all the
normal transactions. The risk score between 0 and 1 is generated by the model. The score
generated below 0.5 for the transaction, is considered as genuine transaction and above or
equal to 0.8 then the transaction is considered to be fraudulent and it is verified by
confirming the customer. If the risk score generated is between 0.5 and 0.8, then the
transaction is considered as suspicious transaction and an additional layer of Bayesian
learner is added by the model. Once the transaction is found suspicious, then it waits for
the next transaction on the same card. When the next transaction occurs on the same card,
again risk score is generated. If the risk score is less than 0.5 then transaction is declared
as genuine transaction, greater than 0.8 then transaction is declared as fraudulent
transaction or again found suspicious then the Bayesian learner calculates the posterior
probability whether the transaction comes from normal customer or fraudster from the
genuine transaction set and fraudulent transaction set. If probability of normal transaction
15
Chapter 1: Introduction
The model is implemented by using the data mining techniques 1) Rules 2) DBSCAN
algorithm and 3) Bayesian Learner in oracle 9i.
Chapter 2 gives the overview of data mining and compares various data mining
techniques based on easiness of understanding and implementation of technique, input
and output issue, applications, strengths, weaknesses etc. Then it discuss the criteria that
is helpful for selecting a data mining technique such as whether learning is supervised or
unsupervised, the nature of input and output data, presence of noisy data, time (speed)
issue (algorithms for building decision tree and production rules typically execute much
faster than NN or GA), classification accuracy.
Various types of financial cyber crimes and frauds committed worldwide are discussed in
the chapter 3. Chapter 4 discusses about how various data mining techniques and rules
become helpful in financial crime detection. Chapter 5 describes the design and
implementation of the Data warehouse and also the various tables maintained by financial
cyber crime detection system (FCDS). Development of Transaction Pattern Generation
Tool (TPGT) is discussed in the Chapter 6, which generates various parameters for the
customer who is performing online transaction. Development of Transaction Risk Score
Generation Model (TRSGM) is discussed in the Chapter 7, which assigns a risk or fraud
score (0-1) for each transaction. Features of developed data mining application software,
significance of the research, limitation of study and future scope of the research as
conclusion is discussed in Chapter 8.
1.6 REFERENCES
[1] S.Ghosh, D.L.Reilly, “Credit card fraud detection with a neural–network”, in:
Proceedings of the Twenty-seventh Hawaii International Conference on system Sciences,
1994, pp. 621-630,
16
Chapter 1: Introduction
17
Chapter 1: Introduction
[13] H.Shao, H. Zhao, G.Chang, “Applying Data mining to detect fraud behavior in
customs declaration”, in: Proceedings of the First International Conference on Machine
Learning and Cybernetics, Beijing, November 2002, pp.1241-1244
[14] K.B.Bignell, “Authentication in an internet banking environment strategy; towards
developing a strategy for fraud detection” in: Proceedings of International Conference
ICISP 2006, 26-28 Aug. 2006, pp.23
[15] A.Srivastava, A.Kundu, S.Sural, A.K.Majumdar, “Credit card fraud detection using
hidden markov model”, in: IEEE transactions on dependable and secure computing,Vol.
5, No. 1, January-March 2008.
[16] B.Zhang, Y. Zhou, C. faloutsos, “Toward a comprehensive model in internet auction
fraud detection”, in: Proceedings of the 41st Hawaii International Conference on System
sciences, 2008
[17] J.E.Carbal, J.Pinto, S.C.Linares, M.A.C.Pinto, Methodology for fraud detection
using rough sets, http://www.ieeexplore.ieee.org/iel5/10898/34297/01635791.pdf
[18] J.Quah, M.Sriganesh, “Real time credit card fraud detection using computational
intelligence”, in: Proceedings of the International Joint Conference on Neural Networks,
Florida, U.S.A, August 2007
[19] E.L.Barse, H.Kvanstrom, E.Jonsson, “Synthesizing test data for fraud detection
system”, in: Proceedings of the 18th Annual Computer Security Applications
Conference,2003
[20] J.Xu, A.H.Sung, Q.Liu,”Tree based behavior monitoring for adaptive fraud
detection”, in: Proceedings of the 18th International Conference on pattern recognition,
2006
[21] D.W.Abbott, I.P.Matkovsky, J.F. Elder IV, “An Evaluation of High-end Data
Mining Tools for Fraud Detection”, 1998, IEEE Xplore
[22] Suvasini Panigrahi, Amlan Kundu, Shamik Sural, A.K.Majumdar, “Credit card
fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian
learning”, www.scincedirect.com
[23] Yi Peng, Gang Kou, A. Sabatka, Z. Chen, D.Khazanchi, Y.Shi, “Application of
Clustering Methods to Health Insurance Fraud Detection”, www.ieeexplore.ieee.org
18
Chapter 1: Introduction
[24] J.Tuo, S.Ren, W.Liu, X.Li, B.Li, L.Lei, “Artificial Immune System for Fraud
Detection”, in: Proceedings of the International Conference on systems, Man and
Cybernetics
[25] J.Kim, A.Ong, R.E.Overill, “Design of an Artificial Immune System as a Novel
Anomaly Detector for Combating Financial Fraud in the Retail Sector”, in: Evolutionary
Computation, 2003. CEC ’03, pp 405-412 Vol.1
[26] S.J.Stoflo, W.Lee, A.Prodromidis, P.K.Chan, “Cost-based Modeling for Fraud and
Intrusion Detection: Results from the JAM Project”, http://
www..citeseer.ist.psu.edu/244959.html
[27] F.Yu, Z.Qin, X.Jia, “Data Mining Issues in Fraudulent Tax Declaration Detection”,
in: Proceedings of the Second International Conference on Machine Learning and
Cybernetics, Xian, November 2003, pp.2202-2206
[28] A.Leung, Z.Yan, S.Fong, “On Designing a Flexible E-Payment System with Fraud
Detection Capability”, in: Proceedings of the IEEE International Conference on E-
Commerce Technology, 2004
[29] W.Chai, B.K.Hoogs, B.T.Verschueren, “Fuzzy Ranking of Financial Statements for
Fraud Detection”, in: Proceedings of the IEEE International Conference on Fuzzy
Systems, Canada, 2006, pp.152-158
[30] B.Garner, F.Chen, “Hypothesis Generation Paradigm for Fraud Detection”,
http://www.ieeexplore.ieee.org/iel2/2978/8447/00369309.pdf
[31] V.Aggelis, “Offline Internet Banking Fraud Detection”, in: Proceedings of the First
International Conference on Availability, Reliability and Security, 2006
[32] D.Yue, X.Wu, Y.Wang, Y.Li,C-H Chu, “A Review of Data Mining-based Financial
Fraud Detection Research”,
http://www.ieeexplore.ieee.org/iel5/4339774/4339775/04341127.pdf
[33] S.Rozsnyai, J.Schiefer, A.Schatten, “Solution Architecture for Detecting and
Preventing Fraud in Real Time”, in: Proceedings of the 2nd International Conference
ICDIM ’07, Volume 1, pp:152-158
[34] T.M.Padmaja, N.Dhulipalla, R.S.Bapi, P.R.Krishna, “Unbalanced Data
Classification Using extreme outlier Elimination and Sampling Techniques for Fraud
19
Chapter 1: Introduction
20
CHAPTER 2
Data Mining is the process of employing one or more computer learning techniques to
automatically analyze and extract knowledge from data contained within a database. The
purpose of a data mining session is to identify trends and patterns in data.
Data Mining has been defined as “the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data”.
And “the science of extracting useful information from large data sets or databases.”
Hand et al define that data mining is “a well-defined procedure that takes data as input
and produces output in the forms of models or patterns.”
21
Chapter 2: A Comparative Study of Data Mining Techniques
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies
that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:
22
Chapter 2: A Comparative Study of Data Mining Techniques
Commercial databases are growing at unprecedented rates. A recent META group survey
of data warehouse projects found that 19% of respondents are beyond the 50-gigabyte
level. In some industries, such as retail, these numbers can be much larger. The
accompanying need for improved computational engines can now be met in a cost-
effective manner with parallel multiprocessor computer technology. Data mining
algorithms embody techniques that have existed for at least 10 years, but have only
recently been implemented as mature, reliable, understandable tools that consistently
outperform older statistical methods.
In the evolution from business data to business information, each new step has built upon
the previous one. For example, dynamic data access is critical for drill-through in data
navigation applications, and the ability to store large databases is critical to data mining.
Today, the maturity of these techniques, coupled with high-performance relational
database engines and broad data integration efforts, make these technologies practical for
current data warehouse environments.
The current evolution of data mining functions and products is the result of years of
influence from many disciplines, including databases, information retrieval, statistics,
algorithms and machine learning (Figure 2.1). Another computer science area that has
had a major impact on the KDD process is multimedia and graphics.
23
Chapter 2: A Comparative Study of Data Mining Techniques
Information Retrieval
Statistics
Databases
DATA
MINING
24
Chapter 2: A Comparative Study of Data Mining Techniques
25
Chapter 2: A Comparative Study of Data Mining Techniques
2.4 DATA MINING PROCESS: A practical data mining application is often complex.
It is interactive and iterative, involving a number of key steps:
1.Understanding the application domain, and the application goals.
2. Extracting one or more target data sets from databases.
3. Cleaning data, e.g., removing noise and handling the missing data.
4. Removing the irrelevant attributes and tuples from the data.
5. Choosing the data mining task, i.e., deciding whether the goal of data mining
process is classification, association, clustering, etc, or a combination of them.
6. Choosing the data mining algorithms.
7. Data mining using the selected algorithms to discover hidden patterns in data.
8. Post-processing the discovered patterns, i.e., analyzing the patterns
automatically or semi-automatically to identify those truly interesting/useful
patterns for the user.
There are many statistical concepts that are the basis for data mining techniques. Here is a
brief review of some of these concepts.
Point estimation is a well-known and computationally tractable tool for learning the
parameters of a data mining model. It can be used for many data mining tasks such as
summarization and time-series prediction. Summarization is the process of extracting or
deriving representative information about the data. Point estimation is used to estimate
mean, variance, standard deviation, or any other statistical parameter for describing the
data. In time-series prediction, point estimation is used to predict one or more values
appearing later in a sequence by calculating parameters for a sample.
26
Chapter 2: A Comparative Study of Data Mining Techniques
Several methods exist for obtaining point estimates, including least squares, the method
of moments, maximum likelihood estimation, Bayes estimators, and robust estimation.
The method of moments, introduced by Karl Pearson circa 1894, is one of the oldest
methods of determining estimates.
1 n
φr = ∑
n i =1
Xi T (2.1)
27
Chapter 2: A Comparative Study of Data Mining Techniques
(ii) Equate the population moments obtained from step (i) to the corresponding sample
moments calculated using above equation and solve θ 1,….., θ k as the estimates of
parameters.
Sir Ronald A. Fisher circa 1920 introduced the method of maximization of likelihood
functions. Given a random sample X1,X2,……,Xn distributed with the density (mass)
function f(x; Θ ), the likelihood function of the random sample is the joint probability
density function, denoted by
In above equation, Θ is the set of unknown population parameters { θ 1,….., θ k}. If the
random sample consists of random variables that are independent and identically
distributed with a common density function f(x; Θ ), the likelihood function can be
reduced to
which is the product of individual density functions evaluated at each sample point.
maximum likelihood estimate, therefore, is a set of parameter values Θ 1={ θ 11,…..,
θ k1}that maximizes the likelihood function of the sample. A well-known approach to
find Θ 1 is to take the derivative of L, set it equal to zero and solve for Θ . Thus, Θ 1 can
be obtained by solving the likelihood equation
∂
L ( Θ) = 0 (2.4)
∂Θ
28
Chapter 2: A Comparative Study of Data Mining Techniques
ALGORITHM 2.1
Input:
Ø = { Ø1,……., Øp} //Parameters to be estimated
Xobs ={x1,……..,xk} //Input database values observed
Xmiss={xk+1,…..,xn} //Input database values missing
Output:
Ô //Estimates for Ø
EM algorithm:
i:=0;
Obtain initial parameter MLE estimate, Ôi ;
repeat
Estimate missing data, Ximiss;
i++
Obtain next parameter estimate, Øi to
maximize likelihood;
until estimate coverages;
An initial set of estimates for the parameters is obtained. Given these estimates and the
training data as input, the algorithm 2.1 then calculates a value for the missing data. For
example, it might use the estimated mean to predict a missing value. These data are then
29
Chapter 2: A Comparative Study of Data Mining Techniques
used to determine an estimate for the mean that maximizes the likelihood. These steps are
applied iteratively until successive parameter estimates converge. Any approach can be
used to find the initial parameter estimates. In algorithm 2.1 it is assumed that the input
database has actual observed values Xobs ={x1,……..,xk} as well as values that are
missing Xmiss={xk+1,…..,xn}. It is assumed that the entire database is actually X= Xobs
∪ Xmiss.
There are several different ways (estimators) to estimate unknown parameters. In order to
assess the usefulness of estimators, some criteria are necessary to measure the
performance of estimators. These are – bias, mean squared error, standard error,
efficiency and consistency.
2.5.2.1 Bias
The bias of an estimator is the difference between the expected value of the estimator and
the actual value:
An unbiased estimator is one whose bias is 0. While point estimators for small data sets
may actually be unbiased, for larger database applications we would expect that most
estimators are biased.
30
Chapter 2: A Comparative Study of Data Mining Techniques
MSE is defined as the expected value of the squared difference between the estimate and
the actual value:
The standard error gives a measure of the precision of the estimators. The standard error
The sample mean can be used as an example to illustrate the concept of standard error.
Let f(x) represents a probability density function with finite variance σ 2 and mean μ .
Let X be the sample mean for a random sample of size n drawn from this distribution. By
the Central Limit Theorem, the distribution of X is approximately normally distributed
σ2
with mean μ and variance . So the standard error is given by
n
σ
SE ( X ) = σ X = (2.8)
n
When the standard deviation σ for the underlying population is unknown, then an
estimate S for the parameter can be used as a substitute for it and leads to the estimated
standard error
m ( X ) = σl X S
SE = (2.9)
n
31
Chapter 2: A Comparative Study of Data Mining Techniques
2.5.2.4 Efficiency
Another measure used to compare estimators is called efficiency. Suppose there are two
estimators Ò and Õ for a parameter Ø based on the sample X1,…..,Xn. If the MSE of one
estimator is less than the MSE of the other, i.e. MSE(Ò) < MSE(Õ), then the estimator Ò
is said to be more efficient than Õ. The relative efficiency of Ò with respect to Õ is
defined as the ratio
If this ratio is greater than one, then Ò is a more efficient estimator of the parameter
Ø. When the estimator is unbiased, the ratio is just the ratio of their variance, and the
most efficient estimator would be the one with minimum variance.
2.5.2.5 Consistency
Unlike the four measures defined previously, consistency is defined for increasing sample
sizes, not a fixed sample sizes. Like the efficiency, consistency is also defined using the
When MSE is written in terms of bias and variance. Thus equation holds if and only if
32
Chapter 2: A Comparative Study of Data Mining Techniques
i =1 n
∑
j =1
+ ∑x
j = i +1
j
μl ( i ) = (2.12)
n −1
Here the subscript (i) indicates that this estimate is obtained by omitting the ith value.
m ( X ) = σl X S
Given a set of jackknife estimates, SE = (i), these can in turn be used to
n
obtain an overall estimate.
∑θ ( j)
θ (.) = j =1
(2.13)
n
There are many basic concepts that provide an abstraction and summarization of the data
as a whole. The basic well-known statistical concepts such as mean, variance, standard
deviation, median and mode are simple models of the underlying population. Fitting a
population to a specific frequency distribution provides an even better model of the data.
Of course, doing this with large databases that have multiple attributes, have complex
and/or multimedia attributes, and are constantly changing is not practical.
There are also many well-known techniques to display the structure of the data
graphically. For example, a histogram shows the distribution of the data. A box plot is a
more sophisticated technique that illustrates several different features of the population at
once.
33
Chapter 2: A Comparative Study of Data Mining Techniques
Another visual technique to display data is called a scatter diagram. This is a graph on a
two-dimensional axis of points representing the relationships between x and y values.
P( xi | h1) P (h1)
P(h1 | xi ) = (2.14)
P( xi | h1) P (h1) + P( xi | h 2) P(h 2)
Here P (h1 | xi) is called the posterior probability, while P (h1) is the prior probability
associated with hypothesis h1. P (xi ) is the probability of the occurrence of data value xi
and P (xi | h1) is the conditional probability that, given a hypothesis, the tuple satisfies it.
m
P(xi) = ∑ P( x | h ) P(h )
j =1
i j j (2.15)
Thus we have
P( xi | h1) P(h1)
P (h1 | xi ) = (2.16)
P( xi )
34
Chapter 2: A Comparative Study of Data Mining Techniques
Bayes rule allow us to assign probabilities of hypothesis given a data value, P (hj | xi).
Here we discuss tuples when in actuality each xi may be an attribute value or other data
label. Each hi may be an attribute value, set of attribute values, or even a combination of
attribute values.
Hypothesis testing attempts to find a model that explains the observed data by first
creating a hypothesis and then testing the hypothesis against the data. The hypothesis
usually is verified by examining a data sample. If the hypothesis holds for the sample, it
is assumed to hold for the population in general. Given a population, the initial
hypothesis to be tested, H0, is called the null hypothesis. Rejection of the null hypothesis
causes another hypothesis, H1, called the alternative hypothesis, to be made.
One technique to perform hypothesis testing is based on the use of the chi-squared
statistic. Actually, there is a set of procedures referred to as chi squared. These
procedures can be used to test the association between two observed variable values and
to determine if a set of observed variable values is statistically significant. A hypothesis
is first made, and then the observed values are compared based on this hypothesis.
Assuming that O represents the observed data and E is the expected values on the
hypothesis, the chi-squared statistic, X2 , is defined as:
(O − E ) 2
X =∑
2
(2.17)
E
35
Chapter 2: A Comparative Study of Data Mining Techniques
Both bivariate regression and correlation can be used to evaluate the strength of a
relationship between two variables. Regression is generally used to predict future values
based on past values by fitting a set of points to a curve. Correlation, however, is used to
examine the degree to which the values for two variables behave similarly.
Linear regression assumes that a linear relationship exists between the input data and the
output data. The common formula for a linear relationship is used in this model:
Here there are n input variables, which are called predictors or regressors; one output
variable (the variable being predicted), which is called the response; and n + 1 constants,
which are chosen during the modeling process to match the input examples. This is called
multiple linear regression because there is more than one predictor.
The decision tree method of decision analysis uses a tree structure to illustrate the
decision process. Probabilities are assigned to events, and the expected value of each
alternative is determined. The alternative with the most attractive total expected value is
chosen. Depending on the decision, the most attractive expected value may be the highest
or lowest number.
It is based on the “Twenty Questions” game that children play, as illustrated by Example
2.1. Figure 2.2 graphically shows the steps in the game. This tree has as the root the first
question asked. Each subsequent level in the tree consists of questions at that stage in the
game. Nodes at the third level show questions asked at the third level in the game. Leaf
nodes represent a successful guess as to the object being predicted. This represents a
correct prediction. Each question successfully divides the search space much as a binary
36
Chapter 2: A Comparative Study of Data Mining Techniques
search does. As with a binary search, questions should be posed so that the remaining
space is divided into two equal parts. Often young children tend to ask poor questions by
being too specific, such as initially asking “Is it my Mother? This is a poor approach
because the search space is not divided into two equal parts.
EXAMPLE 2.1
Mudra and Vikas are playing a game of “Twenty Questions”. Vikas has in mind some
object that Mudra tries to guess with no more than 20 questions. Mudra’s first question is
“Is this object alive?” Based on Vikas’s answer, Mudra then asks a second question. Her
second question is based on the answer that Vikas provides to the first question. Suppose
that Vikas says “yes” as his first answer. Mudra’s second question is “Is this a person?.
When Vikas responds “yes”, Mudra asks “Is it a friend?”. When Vikas says “no”, Mudra
then asks “Is it someone in my family?”. When Vikas responds “yes”, Mudra then begins
asking the names of family members and can immediately narrow down the search space
to identify the target individual. This game is illustrated in Figure 2.2.
37
Chapter 2: A Comparative Study of Data Mining Techniques
Alive?
No Yes
Ever alive? Person?
No Yes No Yes
… … Mammal? Friend?
No Yes No Yes
… … In Family? …
No Yes
… Mom?
No Yes
… FINISHED
DEFINITION 2.4. A decision tree(DT) is a tree where the root and each internal node is
labeled with a question. The arcs emanating from each node represent each possible
answer to the associated question. Each leaf node represents a prediction of a solution to
the problem under consideration.
38
Chapter 2: A Comparative Study of Data Mining Techniques
The building of the tree may be accomplished via an algorithm that examines data from a
training sample or could be created by a domain expert. Most decision tree techniques
differ in how the tree is created. Algorithm 2.2 shows the basic steps in applying a tuple
to the DT, step three in Definition 2.5. We assume here that the problem to be performed
is one of prediction, so the last step is to make the prediction as dictated by the final leaf
node in the tree. The complexity of the algorithm is straightforward to analyze. For each
tuple in the database, we search the tree from the root down to a particular leaf. At each
level, the maximum number of comparisons to make depends on the branching factor at
that level. So the complexity depends on the product of the number of levels and the
maximum branching factor.
ALGORITHM 2.2
Input:
T //Decision Tree
D //Input database
Output:
M //Model prediction
DTProc algorithm:
//Simplistic algorithm to illustrate prediction technique using DT
for each t ∈ D do
n = root node of T;
while n not leaf node do;
Obtain answer to question on n applied to t;
Identify arc from t, which contains correct answer;
n=node at end of this arc;
Make prediction for t based on labeling of n;
We use Example 2.2 to further illustrate the use of decision trees.
39
Chapter 2: A Comparative Study of Data Mining Techniques
Gender
=F =M
Height Height
EXAMPLE 2.2
40
Chapter 2: A Comparative Study of Data Mining Techniques
2.6.1 Strengths
Decision trees have several advantages. Here is a list of a few of the many advantages
decision trees have to offer.
• Decision trees are easy to understand and map nicely to a set of production
rules.
• Decision trees have been successfully applied to real problems.
• Decision trees make no prior assumptions about the nature of the data.
• Decision trees are able to build models with datasets containing numerical as
well as categorical data.
2.6.2 Weaknesses
• Output attributes must be categorical, and multiple output attributes are not
allowed.
• Decision tree algorithms are unstable in that slight variations in the training
data can result in different attribute selections at each choice point within the
tree. The effect can be significant as attribute choices affect all descendent sub
trees.
• Trees created from numeric datasets can be quite complex as attribute splits
for numeric data are typically binary.
Neural networks offer a mathematical model that attempts to mimic the human brain.
Knowledge is often represented as a layered set of interconnected processors. These
processor nodes are frequently referred to as neurodes so as to indicate a relationship with
41
Chapter 2: A Comparative Study of Data Mining Techniques
the neurons of the brain. Each node has a weighted connection to several other nodes in
adjacent layers. Individual nodes take the input received from connected nodes and use
the weights together with a simple function to compute output values.
Neural networks, with their remarkable ability to derive meaning from complicated or
imprecise data, can be used to extract patterns and detect trends that are too complex to
be noticed by either humans or other computer techniques. A trained neural network can
be thought of as an “expert” in the category of information it has been given to analyze.
This expert can then be used to provide projections given new situations of interest and
answer “what if” questions.
The NN approach, like decision trees, requires that a graphical structure be built to
represent the model and then that the structure be applied to the data. The NN can be
viewed as a directed graph with source (input), sink (output) and internal (hidden) nodes.
The input nodes exist in an input layer, while the output nodes exist in an output layer.
The hidden nodes exist over one or more hidden layers. To perform the data mining task,
a tuple is input through the input nodes and the output node determines what the
prediction is. Unlike decision trees, which have only one input node (the root of the tree),
the NN has one input node for each attribute value to be examined to solve the data
mining function. Unlike decision trees, after a tuple is processed, the NN may be changed
to improve future performance. Although the structure of the graph does not change, the
labeling of the edges may change.
DEFINITION 2.6. A neural network (NN) is a directed graph, F=(V,A) with vertices
V= {1,2,….,n} and arcs A={(i,j) | 1<=i,j<=n}, with the following restrictions.
1. V is partitioned into set of input nodes, VI, hidden nodes, VH and output
nodes, Vo.
42
Chapter 2: A Comparative Study of Data Mining Techniques
2. The vertices are also partitioned into layers {1,….,k} with all input nodes
in layer 1 and output nodes in layer k. All hidden nodes are in layers 2 to
k-1 which are called the hidden layers.
3. Any arc (i,j) must have node ii in layer h-1 and node j in layer h.
4. Arc (i,j) is labeled with a numeric value wij.
5. Node i is labeled with a function fi.
Definition 2.6 is a very simplistic view of NNs. Although there are many more
complicated types that do not fit this definition, this defines the most common type of
NN.
Figure 2.4 shows a fully connected feed-forward neural network structure together with a
single input instance [1.0,0.4,0.7]. Arrow indicates the direction of flow for each new
instance as it passes through the network. The network is fully connected because nodes
at one layer are connected to all nodes in the next layer.
The number of input attributes found within individual instances determines the number
of input layer nodes. The user specifies the number of hidden layers as well as the
number of nodes within a specific hidden layer. Determining a best choice for these
values is matter of experimentation. In practice, the total number of hidden layers is
usually restricted to two. Depending on the application, the output layer of the neural
network may contain one or several nodes.
43
Chapter 2: A Comparative Study of Data Mining Techniques
Node j
W1i Wjk
W2j
Node 2 Node k
0.4 W2i
Wi k
Node i
Node 3 W3j
0.7 W3i
The input to individual neural network nodes must be numeric and fall in the closed
interval range [0, 1]. Because of this, we need a way to numerically represent categorical
data. We also require a conversion method for numerical data falling outside the [0, 1]
range.
The output nodes of a neural network represent continuous values in the [0, 1] range.
However, the output can be transformed to accommodate categorical class values.
The purpose of each node within a feed-forward neural network is to accept input values
and pass an output value to the next higher network layer. The nodes of the input layer
pass input attribute values to the hidden layer unchanged. Therefore for the input instance
shown in figure 2.4, the output of node 1 is 1.0, the output of node 2 is 0.4 and the output
of node 3 is 0.7.
44
Chapter 2: A Comparative Study of Data Mining Techniques
Table 2.3: Initial Weight Values for the Neural Network Shown in Figure
2.4
A hidden or output layer node n takes input from the connected nodes of the previous
layer, combines the previous node values into a single value, and uses the new value as
input to an evaluation function. The output of the evaluation function is a number in the
closed interval [0, 1]. This value represents the output of node n.
Let’s look at an example. Table 2.3 shows sample weight values for the neural network
of Figure 2.4. Consider node j. To compute the input to node j, we determine the sum
total of the multiplication of each input weight by its corresponding input layer node
value. That is:
Therefore 0.25 represents the input value for node j’s evaluation function.
The first criterion of an evaluation function is that the function must output values in the
[0, 1] interval range. A second criterion is that the function should output a value close to
1 when sufficely excited. The sigmoid function is computed as:
f(x)=1/1+e –x (2.20)
Character Recognition – The idea of character recognition has become very important
as handled devices like the Palm Pilot are becoming increasingly popular. Neural
networks can be used to recognize handwritten characters.
45
Chapter 2: A Comparative Study of Data Mining Techniques
Image Compression – Neural networks can receive and process vast amounts of
information at once, making them useful in image compression. With the Internet
explosion and more sites using more images on their sites, using neural networks for
image compression is worth a look.
Stock Market Prediction – The day-to-day business of the stock market is extremely
complicated. Many factors weigh in whether a given stock will go up or down on any
given day. Since neural networks can examine a lot of information quickly and sort it all
out, they can be used to predict stock prices.
Traveling Salesman’s Problem – Interestingly enough, neural networks can solve the
traveling salesman problem, but only to a certain degree of approximation.
Medicine, Electronic Nose, Security and Loan Applications – These are some
applications that are in their proof-of-concept stage, with the acception of a neural
network that will decide whether or not to grant a loan, something that has already been
used more successfully than many humans.
2.7.6 Strengths
• Neural networks well with datasets containing large amounts of noisy input
data. Neural network evaluation functions such as the sigmoid function
naturally smooth input data variations caused by outliers and random error.
• Neural networks can process and predict numeric as well as categorical
outcome. However, categorical data conversions can be tricky.
• Neural networks can be used for applications that require a time element to be
included in the data.
• Neural networks have performed consistently well in several domains.
• Neural networks can be used for both supervised learning and unsupervised
clustering.
46
Chapter 2: A Comparative Study of Data Mining Techniques
2.7.7 Weaknesses
• Probably the biggest criticism of neural networks is that they lack the ability
to explain their behavior.
• Neural network learning algorithms are not guaranteed to converge to an
optimal solution. With most types of neural networks, the problem can be
dealt with by manipulating various learning parameters.
• Neural networks can easily be overtrained to the point of working well on the
training data but poorly on test data. This problem can be monitored by
consistently measuring test set performance.
Genetic algorithms are different from other heuristic methods in several ways. The most
important difference is that a GA works on a population of possible solutions, while other
heuristic methods use a single solution in their iterations. Another difference is that GAs
are probabilistic (stochastic), not deterministic.
Each individual in the GA population represents a possible solution to the problem. The
suggested solution is coded into the “genes” of the individual. One individual might have
these genes:”1100101011”, another has these:”0101110001” (just examples). The values
47
Chapter 2: A Comparative Study of Data Mining Techniques
(0 or 1) and their position in the “gene string” tell the genetic algorithm what solution the
individual represents.
GAs can be used where optimization is needed. It means that where there large solutions
to the problem but we have to find the best one. Like we can use GAs in finding best
moves in chess, mathematical problems, and financial problems and in many more areas.
Fitness: Fitness is the value assigned to an individual. It is based on how far or close an
individual is from the solution. Greater the fitness value better the solution it contains.
Fitness function: Fitness function is a function which assigns fitness value to the
individual. It is problem specific.
Breeding: Taking two fit individuals and intermingling there chromosome to create new
two individuals.
Crossover: The first genetic operator, forms new elements for the population by
combing parts of two elements currently in the population.
Mutation: A second genetic operator is sparingly applied to elements chosen for
elimination. Mutation can be applied by randomly flipping bits (or attribute values)
within a single element.
Selection: A third genetic operator that is sometimes used. With selection, the elements
deleted from the population are replaced by copies of elements that pass the fitness test
with high scores.
48
Chapter 2: A Comparative Study of Data Mining Techniques
ALGORITHM 2.3
Input:
P //Initial population
Output:
P’ //Improved population
Genetic algorithm:
//Algorithm to illustrate genetic algorithm
repeat
N=| P |;
P’=θ;
repeat
i1,i2 = select(P);
o1,o2= cross(i1,i2);
o1 = mutate(o1);
o2 = mutate(o2);
until | P’ | = N;
P = P’;
until termination criteria satisfied;
49
Chapter 2: A Comparative Study of Data Mining Techniques
Algorithm 2.3 outlines the steps performed by a genetic algorithm. Initially, a population
of individuals, P, is created. Although different approaches can be used to perform this
step, they typically are generated randomly. From this population, a new population, P’,
of the same size is created. The algorithm 2.3 repeatedly selects individuals from whom
to create new ones. These parents i1, i2 are then used to produce two offspring, o1,o2,
using a crossover process. Then mutants may be generated. The process continues until
the new population satisfies the termination condition.
We assume here that the entire population is replaced with each iteration. An alternative
would be to replace the two individuals with the smallest fitness. Although this algorithm
is quite general, it is representative of all genetic algorithms. There are many variations
on this general theme.
2.8.3 Applications of GA
2.8.4 Strengths of GA
• The major advantage to the use of genetic algorithms is that they are easily
parallelized.
• It can quickly scan a vast solution set. Bad proposals do not affect the end
solution negatively as they are simply discarded. The inductive nature of the
GA means that it doesn’t have to know any rules of the problem – it works by
its own internal rules. This is very useful for complex or loosely defined
problems.
2.8.5 Weaknesses of GA
50
Chapter 2: A Comparative Study of Data Mining Techniques
2.9 CLASSIFICATION
Classification is the most familiar and most popular data mining technique. Examples of
classification applications include image and pattern recognition, medical diagnosis, loan
approval, detecting faults in industry applications, and classifying financial market
trends. Estimation and prediction may be viewed as types of classification.
2.9.1.1 Regression
Regression problems deal with an estimation of an output value based on input values.
When used for classification, the input values are values from the database and the output
values represent the classes. Regression can be used to solve classification problems, but
it can also be used for other applications such as forecasting.
There are many reasons why the linear regression model may not be used to estimate
output data. One is that the data do not fit a linear model. It is possible, however, that the
data generally do actually represent a linear model, but the linear model generated is poor
51
Chapter 2: A Comparative Study of Data Mining Techniques
because noise or outliers exist in the data. Noise is erroneous data. Outliers are data
values that are exceptions to the actual and expected data.
Suppose we are having k points in any training sample then we are having k formulas
With a simple linear regression, given an observable value (x1i, yi), ε i is the error, and
thus the squared error technique introduced in the above section can be used to indicate
the error. To minimize the error, a method of least squares is used to minimize the least
square error. This approach finds coefficients c0,c1 so that the squared error is minimized
for the set of observable values. The sum of the squares of the errors is
k k
L = ∑ ε i2 = ∑ ( yi − c 0 − c1 x1i ) 2 (2.23)
i =1 i =1
Taking the partial derivatives (with respect to the coefficients) and setting equal to zero,
we can obtain the least squares estimates for the coefficients, c 0 and c 1 .
Regression can be used to perform classification using two different approaches:
1. Division: The data are divided into regions based on class.
2. Prediction: Formulas are generated to predict the output class value.
If the predictors in the linear regression function are modified by some function (square,
square root, etc.), then the model looks like
where fi is the function being used to transform the predictor. In this case the regression is
called nonlinear regression. Linear regression techniques, while easy to understand, are
not applicable to most complex data mining applications. They do not work well with
52
Chapter 2: A Comparative Study of Data Mining Techniques
nonnumeric data. They also make the assumption that the relationship between the input
value and the output value is linear, which of course may not be the case.
Linear regression is not always appropriate because the data may not fit a straight line,
but also because the straight line values can be greater than 1 and less than 0. Thus, they
certainly cannot be used as the probability of occurrence of the target class. Another
commonly used regression technique is called logistic regression. Instead of fitting the
data to a straight line, logistic regression uses a logistic curve such as illustrated in
Figure. The formula for a univariate logistic curve is
e( c 0+ c1x1)
p= (2.25)
1 + e( c 0+ c1x1)
The logistic curve gives a value between 0 and 1 so it can be interpreted as the
probability of class membership. As with linear regression, it can be used when
classification into two classes is desired. To perform the regression, the logarithmic
function can be applied to obtain the logistic function
p
log( ) = c 0 + c1 x1 (2.26)
1− p
Here p is the probability of being in the class and 1-p is the probability that it is not.
However, the process chooses values for c0 and c1 that maximizes the probability of
observing the values.
Assuming that the contribution by all attributes are independent and that each contributes
equally to the classification problem, a simple classification scheme called naïve Bayes
classification has been proposed that is based on Bayes rule of conditional probability as
stated in Definition 2.3. This approach was briefly outlined in section. By analyzing the
53
Chapter 2: A Comparative Study of Data Mining Techniques
Given a training set, the naïve Bayes algorithm first estimates the prior probability P( Cj )
for each class by counting how often each class occurs in the training data. For each
attribute, xi, the number of occurrences of each attribute value xi can be counted to
determine P (xi ). Similarly, the probability P (xi | Cj ) can be estimated by counting how
often each value occurs in the class in the training data. A tuple in the training data may
have many different attributes, each with many values. This must be done for all
attributes and all values of attributes. We then use these derived probabilities when a new
tuple must be classified. This is why naïve Bayes classification can be viewed as both a
descriptive and a predictive type of algorithm. The probabilities are descriptive and are
then used to predict the class membership for a target tuple.
When classifying a target tuple, the conditional and prior probabilities generated from the
training set are used to make the prediction. This is done by combing the effects of the
different attribute values from the tuple. Suppose that tuple ti has p independent attribute
values {xi1, xi2, ……., xip}. From the descriptive phase, we know P ( xik | Cj ), for each
class Cj and attribute xik. We then estimate P ( ti | Cj ) by
p
P (ti | Cj ) = ∏ P( xik | Cj ) (2.27)
k =1
At this point in the algorithm, we then have the needed prior probabilities P ( Cj ) for
each class and the conditional probability P ( ti | Cj ). To calculate P (ti ), we can estimate
the likelihood that ti is in each class. This can be done by finding the likelihood that this
54
Chapter 2: A Comparative Study of Data Mining Techniques
tuple is in each class and then adding all these values. The probability that ti is in a class
is the product of the conditional probabilities for each attribute value. The posterior
probability P ( Cj | ti ) is then found for each class. The class with the highest probability
is the one chosen for the tuple.
2.9.1.2.1 Strengths
• It is easy to use.
• Unlike other classification approaches, only one scan of the training data is
required.
• The naïve Bayes approach can easily handle missing values by simply omitting
that probability when calculating the likelihoods of membership in each class.
• In cases where there are simple relationships, the technique often does yield good
results.
2.9.1.2.2 Weaknesses
• Although the naïve Bayes approach is straightforward to use, it does not always
yield satisfactory results.
• The technique does not handle continuous data. Dividing the continuous values
into ranges could be used to solve this problem, but the division of the domain
into ranges is not an easy task, and how this is done can certainly impact the
results.
Each item that is mapped to the same class may be thought of as more similar to the other
items in that class that it is to be items found in other classes. Therefore, similarity (or
distance) measures may be used to identify the “alikeness” of different items in the
database.
55
Chapter 2: A Comparative Study of Data Mining Techniques
Using a similarity measure for classification where the classes are predefined is
somewhat simpler than using a similarity measure for clustering where the classes are not
known in advance.
To calculate these similarity measures, the representative vector for each class must be
determined. A simple classification technique, then, would be to place each item in the
class where it is most similar to the center of that class. The representative for the class
may be found in other ways. For example, in pattern recognition problems, a predefined
pattern can be used to represent each class. Once a similarity measure is defined, each
item to be classified will be compared to each predefined pattern. The item will be placed
in the class with the largest similarity value. Algorithm 2.4 illustrates a straightforward
distance-based approach assuming that each class, ci, is represented by its center or
centroid. In the algorithm 2.4 ci is used to be the center for its class. Since each tuple
must be compared to the center for a class and there are a fixed number of classes, the
complexity to classify one tuple is O (n).
ALGORITHM 2.4
Input:
c1, ……, cm // Centers for each class
t // Input tuple to classify
Output:
c //Class to which t is assigned
56
Chapter 2: A Comparative Study of Data Mining Techniques
One common classification scheme based on the use of distance measures is that of the K
nearest neighbors (KNN). The KNN technique assumes that the entire training set
includes not only the data in the set but also the desired classification for each item. In
effect, the training data become the model. When a classification is to be made for a new
item, its distance to each item in the training set must be determined. Only the K closet
entries in the training set are considered further. The new item is then placed in the class
that contains the most items from this set of K closet items.
Algorithm 2.5 outlines the use of KNN algorithm. We use T to represent the training
data. Since each tuple to be classified must be compared to each element in the training
data, this is O(q). Given n elements to be classified, this becomes an O(nq) problem.
Given that the training data are of a constant size, this can be viewed as an O(n) problem.
ALGORITHM 2.5
Input:
T //Training data
K //Number of neighbors
T //Input tuple to classify
Output:
C //Class to which t is assigned
KNN algorithm:
57
Chapter 2: A Comparative Study of Data Mining Techniques
The decision tree approach is most useful in classification problems. With this technique,
a tree is constructed to model the classification process. Once the tree is built, it is applied
to each tuple in the database and results in a classification for that tuple. There are two
basic steps in the technique: building the tree and applying the tree to the database. Most
research has focused on how to build effective trees as the application process is
straightforward.
The decision tree approach to classification is to divide the search space into rectangle
regions. A tuple is classified based on the region into which it falls. A definition for a
decision tree used in classification is contained in Definition 2.10. There are alternative
definitions; for example, in a binary DT the nodes could be labeled with the predicates
themselves and each are would be labeled with yes or no (like in the “Twenty Questions”
game).
58
Chapter 2: A Comparative Study of Data Mining Techniques
DEFINITION 2.10 Given a database D = {t1, ….. ,tn} where ti=<ti1, ….., tin> and the
database schema contains the following attributes {A1,A2, ……., An}. Also given is a set
of classes C= {C1, …., Cm}. A decision tree (DT) or classification tree is a tree associated
with D that has the following properties:
There are many advantages to the use of DTs for classification. DTs are certainly easy to
use and efficient. Rules can be generated that are easy to interpret and understand. They
scale well for large databases because the tree size is independent of the database size.
Each tuple in the database must be filtered through the tree. This takes time proportional
to the height of the tree, which is fixed. Trees can be constructed for data with many
attributes.
ALGORITHM 2.6
Input:
D //Training data
Output:
T //Decision tree
DTBuild algorithm:
//Simplistic algorithm to illustrate naïve approach to building DT
T= ∅ ;
Determine best splitting criterion;
T=Create root node and label with splitting attribute;
59
Chapter 2: A Comparative Study of Data Mining Techniques
T=Add arc to root node for each split predicate and label;
for each arc do
D=Database created by applying splitting predicate to D;
if stopping point reached for this path, then
T’=Create leaf node and label with appropriate class;
else
T’=DTBuild(D);
T=Add T’ to arc;
Disadvantages also exist for DT algorithms. First, they do not easily handle continuous
data. These attribute domains must be divided into categories to be handled. Handling
missing data is difficult because correct branches in the tree could not be taken. Since the
DT is constructed from the training data, overfitting may occur. This can be overcome via
tree pruning. Finally, correlations among attributes in the database are ignored by the DT
process.
2.9.3.1 ID3
The concept used to quantify information is called entropy. Entropy is used to measure
the amount of uncertainty or surprise or randomness in a set of data. Certainly, when all
data in a set belong to a single class, there is no uncertainty. In this case the entropy is
zero.
60
Chapter 2: A Comparative Study of Data Mining Techniques
ALGORITHM 2.7
61
Chapter 2: A Comparative Study of Data Mining Techniques
The algorithm is based on Occam’s razor: it prefers smaller decision tree over larger
ones. However, it does not always produce the smallest tree, and is therefore a heuristic.
Occam’s razor is formalized using the concept of information entropy:
m
IE (i ) = −∑ f (i, j ) log 2 f (i, j ). (2.28)
j =1
2.9.3.2 C4.5
At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits
its set of samples into subsets enriched in one class or the other. Its creation is the
normalized information gain (difference in entropy) that results from choosing an
attribute for splitting the data. The attribute with the highest normalized information gain
is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.
62
Chapter 2: A Comparative Study of Data Mining Techniques
• None of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.
• Instance of previously-unseen class encountered. Again, C4.5 creates a
decision node higher up the tree using the expected value.
ALGORITHM 2.8
63
Chapter 2: A Comparative Study of Data Mining Techniques
2.9.3.1 CART
m
Φ ( s / t ) = 2 PLPR ∑ | P(Cj | tL) − P(Cj | tR ) | (2.29)
j =1
This formula is evaluated at the current node, t, and for each possible splitting attribute
and criterion, s. Here L and R are used to indicate the left and right subtrees of the current
node in the tree. PL, PR is the probability that a tuple in the training set will be on the left
| tuples in subtree |
or right side of the tree. This is defined as . We assume that the
| tuples in training set |
64
Chapter 2: A Comparative Study of Data Mining Techniques
right branch is taken on equality. P(Cj | tL ) or P(Cj | tR ) is the probability that a tuple is in
this class, Cj, and in the left or right subtree. This is defined as the
| tuples of class j in subtree |
. At each step, only one criterion is chosen as the best over
| tuples at the t arg et node |
all possible criteria.
CART handles missing data by simply ignoring that record in calculating the goodness of
a split on that attribute. The tree stops growing when no split will improve the
performance. Even though it is the best for the training data, it may not be the best for all
possible data to be added in the future. The CART algorithm also contains a pruning
strategy.
With neural networks (NNs), just as with decision trees, a model representing how to
classify any given database tuple is constructed. The activation functions typically are
sigmoidal. When a tuple must be classified, certain attribute values from that tuple are
input into the directed graph at the corresponding source nodes. There often is one sink
node for each class. The output value is generated indicates the probability that the
corresponding input tuple belongs to that class. The tuple will then be assigned to the
class with the highest probability of membership. The learning process modifies the
labeling of the arcs to better classify tuples. Given a starting structure and value for all
the labels in the graph, as each tuple in the training set is sent through the network, the
projected classification made by the graph can be compared with the actual classification.
Based on the accuracy of the prediction, various labeling in the graph can change. This
learning process continues with all the training data or until the classification accuracy is
adequate.
65
Chapter 2: A Comparative Study of Data Mining Techniques
2.9.4.1 Propagation
The normal approach used for processing is called propagation. Given a tuple of values
input to the NN, X = <x1,x2,…….,xn>, one value is input at each node in the input layer.
Then the summation and activation functions are applied at each node, with an output
value created for each output arc from that node. These values are in turn sent to the
subsequent nodes. This process continues until a tuple of output values, Y = <y1,….,ym>,
is produced from the nodes in the output layer. The process of propagation is shown in
algorithm 2.9 using a neural network with one hidden layer. Here a hyperbolic tangent
activation function is used for nodes in the output layer. We assume that the constant c in
the activation function has been provided. We also use k to be number of edges coming
into a node.
ALGORITHM 2.9
Input:
N //neural network
X=<x1,x2,…….,xn> //Input tuple consisting of values for input attributes only
Output:
Y=<y1,y2,……,ym> //Tuple consisting of output values from NN
Propagation algorithm:
//Algorithm illustrates propagation of a tuple through a NN
for each node i in the input layer do
output xi on each output arc from i;
for each hidden layer do
for each node i do
k
Si = (∑ ( wjixji ));
j =1
66
Chapter 2: A Comparative Study of Data Mining Techniques
1
Output yi =
(1 + e − cs i )
The NN starting state is modified based on feedback of its performance with the data in
the training set. This type of learning is referred to as supervised because it is known a
priori what the desired output should be. Unsupervised learning can also be performed if
the output is not known. With unsupervised approaches, no external teacher set is used. A
training set may be provided, but no labeling of the desired outcome is included.
Supervised learning in an NN is the process of adjusting the arc weights based on its
performance with a tuple from the training set. The training set can be used as a “teacher”
during the training process. The output from the network is compared to this known
desired behavior. Algorithm 2.10 outlines the steps required.
ALGORITHM 2.10
Input:
N //Starting neural network
X //Input tuple from training set
D //Output tuple desired
Output:
N //Improved neural network
Suplearn algorithm
//Simplistic algorithm to illustrate approach to NN learning
Propagate X through N producing output Y;
Calcualte error by comparing D to Y;
67
Chapter 2: A Comparative Study of Data Mining Techniques
Assuming that the output from node i is yi but should be di, the error produced from a
node in any layer can be found by
| yi – di | (2.30)
( yi − di ) 2
(2.31)
2
ALGORITHM 2.11
Input:
N //Starting neural network
X=<x1,x2,…..,xn> //Input tuple from training set
D=<d1,d2,…...,dm> //Output tuple desired
Output:
N //Improved neural network
Backpropagation algorithm:
//Illustrate backpropagation
Propogation(N,X);
m
E = 1/ 2∑ (di − yi ) 2 ;
i =1
Gradient (N,E);
68
Chapter 2: A Comparative Study of Data Mining Techniques
ALGORITHM 2.12
Input:
N //Starting neural network
E //Error found from back algorithm
Output:
N //Improved neural network
Gradient algorithm:
//Illustrates incremental gradient descent
for each node i in output layer do
for each node j input to i do
Δwji = η (di − yi ) yj (1 − yi ) yi;
wji = wji + Δwji;
layer = previous layer;
for each node j in this layer do
for each node k input to j do
1 − y 2j
Δwkj = η yk
2
∑ m (dm − ym) wjmym(1 − ym);
69
Chapter 2: A Comparative Study of Data Mining Techniques
after each tuple in the training set is applied. The incremental technique is usually
preferred because it requires less space and may actually examine more potential
solutions (weights), thus leading to a better solution.
A radial function or a radial basis function (RBF) is a class of functions whose value
decreases (or increases) with the distance from a central point. An RBF has a Gaussian
shape, and an RBF network is typically an NN with three layers. The input layer is used
to simply input the data. A Gaussian activation function is used at the hidden layer, while
a linear activation function is used at the output layer. The objective is to have the hidden
nodes learn to respond only to a subset of the input, namely, that where the Gaussian
function is centered. This is usually accomplished via supervised learning. When RBF
functions are used as the activation functions on the hidden layer, the nodes can be
sensitive to a subset of the input values. Figure 2.5 shows the basic structure of an RBF
unit with one output node.
X1 w11
c1
w21 f1 y
X2
wk1 ∑
w1n
c2
w2n
f2
X3 wkn
70
Chapter 2: A Comparative Study of Data Mining Techniques
2.9.4.4 Perceptrons
2.10 CLUSTERING
• Set of like elements. Elements from different clusters are not alike.
• The distance between points in a cluster is less than the distance between a point
in the cluster and any point outside it.
71
Chapter 2: A Comparative Study of Data Mining Techniques
Clustering
With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy
has a set of clusters. At the lowest level, each item is in its own unique cluster. At the
highest level, all items belong to the same cluster. With hierarchical clustering, the
desired number of clusters is not input. With partitional clustering, the algorithm creates
only one set of clusters.
72
Chapter 2: A Comparative Study of Data Mining Techniques
A B C D E F
The space complexity for hierarchical algorithms is O(n2) because this is the space
required for the adjancency matrix. The space required for the dendrogram is O(kn),
which is much less than O(n2). The time complexity for hierarchical algorithms is O(kn2)
because there is one iteration for each level in the dendrogram. Depending on the specific
algorithm, however, this could actually be O(maxd n2) where maxd is the maximum
distance between points.
Hierarchical techniques are well suited for many clustering applications that naturally
exhibit a nesting relationship between clusters. For example, in biology, plant and animal
taxonomies could easily be viewed as a hierarchy of clusters.
Agglomerative algorithms start with each individual item in its own cluster and
iteratively merge clusters until all items belong in one cluster. Different agglomerative
algorithms differ in how the clusters are merged at each level. Algorithm 2.13 illustrates
the typical agglomerative clustering algorithm. It assumes that a set of elements and
distances between them is given as input. We use an n * n vertex adjacency matrix, A, as
input. Here the adjacency matrix, A, contains a distance value rather than a simple
73
Chapter 2: A Comparative Study of Data Mining Techniques
boolean value: A[i,j] = dis(ti,tj).The output of the algorithm 2.13 is a dendrogram, DE,
which we represent as a set of ordered triples <d,k,K> where d is the threshold distance, k
is the number of clusters, and K is the set of clusters.
ALGORITHM 2.13
Input:
D = {t1,t2,……,tn} //set of elements
A //Adjacency matrix showing distance between elements
Output:
DE // Dendrogram represented as a set of ordered triples
Agglomerative algorithm:
d=0;
k=n;
K={{t1},….,{tn}};
DE={<d,k,K>}; // Initially dendrogram contains each element
in its own cluster.
repeat
oldk=k;
d=d+1;
Ad=Vertex adjacency matrix for graph with threshold distance of d;
<k,K>=NewClusters(Ad,D);
if oldk <> k then
DE=DE U <d,k,K>; // New set of clusters added to dendrogram
until k=1
The single link technique is based on the idea of finding maximal connected components
in a graph. A connected component is a graph in which there exists a path between any
two vertices. With the single link approach, two clusters are merged if there is at least one
74
Chapter 2: A Comparative Study of Data Mining Techniques
edge that connects two clusters; that is, if the minimum distance between any two points
is less than or equal to the threshold distance being considered. For this reason, it is often
called the nearest neighbor clustering technique.
Although the complete link algorithm is similar to the single link algorithm, it looks for
cliques rather than connected components. A clique is a maximal graph in which there is
an edge between any two vertices. Here a procedure is used to find the maximum
distance between any clusters so that two clusters are merged if the maximum distance is
less than or equal to the distance threshold. In this algorithm, we assume the existence of
a procedure, clique, which finds all cliques in a graph.
The average link technique merges two clusters if the average distance between any two
points in the two target clusters is below the distance threshold.
With divisive clustering, all items are initially placed in one cluster and clusters are
repeatedly spilt in two until all items are in their own cluster. The idea is to split up
clusters where some elements are not sufficiently close to other elements.
75
Chapter 2: A Comparative Study of Data Mining Techniques
Since the clustering problem is to define a mapping, the output of this algorithm shows
the clusters as a set of ordered pairs <ti,j> where f(ti) = Kj.
ALGORITHM 2.14
Input:
D = {t1,t2,…..,tn} // Set of elements
A //Adjacency matrix showing distance between elements
k // Number of desired clusters
Output:
f // Mapping represented as a set of ordered pairs
Partitional MST Algorithm:
M = MST(A)
identify inconsistent edges in M;
remove k-1 inconsistent edges;
create output representation;
The squared error clustering algorithm 2.15 minimizes the squared error. The squared
error for a cluster is the sum of the squared Euclidean distances between each element in
the cluster and the cluster centroid, Ck. Given a cluster Ki, let the set of items mapped to
that cluster be {ti1,ti2,……,tim}. The squared error is defined as
76
Chapter 2: A Comparative Study of Data Mining Techniques
m
seKi = ∑ || tij − Ck ||2 (2.32)
j =1
k
seK = ∑ seKj (2.33)
j =1
ALGORITHM 2.15
Input:
D={t1,t2,……,tn} // set of elements
k // Number of desired clusters
Output:
K // Set of clusters
Squared error algorithm:
assign each item ti to a cluster;
calculate center for each cluster;
repeat
assign each item ti to the cluster which has the closet center;
calculate new center for each cluster;
calculate squared error;
until the difference between successive squared errors is below a threshold;
The K-Means algorithm (Lloyd, 1982) is a simple yet effective statistical clustering
technique. It is an iterative clustering algorithm 2.16 in which items are moved among
sets of clusters until the desired set is reached.
77
Chapter 2: A Comparative Study of Data Mining Techniques
1 m
mi = ∑ tij
m j =1
(2.34)
ALGORITHM 2.16
Input:
D = {t1,t2,….,tn} //Set of elements
Kk //Number of desired clusters
Output :
K //Set of clusters
K-means algorithm:
assign initial values for means m1,m2,….,mk;
repeat
assign each item to the cluster which has the closet mean; calculate
new mean for each cluster;
until convergence criteria is met;
EXAMPLE 2.3
{2,4,10,12,3,20,30,11,25}
and suppose that k=2. We initially assign the means to the first two values: m1=2 and
m2=4. Using Euclidean distance, we find that
78
Chapter 2: A Comparative Study of Data Mining Techniques
m1 m2 K1 K2
3 18 {2,3,4,10} {12,20,30,11,25}
4.75 19.6 {2,3,4,10,11,12} {20,30,25}
7 25 {2,3,4,10,11,12} {20,30,25}
Note that the clusters in the last two steps are identical. This will yield identical means,
and thus the means have converged. Our answer is thus K1={2,3,4,10,11,12} and
K2={20,30,25}.
The time complexity of K-means is O(tkn) where t is the number of iterations. K-means
finds a local optimum and may actually miss the global optimum.
79
Chapter 2: A Comparative Study of Data Mining Techniques
2.10.3.3.1 Strengths
2.10.3.3.2 Weaknesses
• Although the K-means algorithm often produces good results, it is not time-
efficient and does not scale well.
• The algorithm only works with real-valued data. If we have a categorical
attribute in our dataset we must either discard the attribute or convert the
attribute values to numeric equivalents.
• The K-means algorithm works best when the clusters that exist in the data are
of approximately equal size. This being the case, if an optimal solution is
represented by clusters of unequal size, the K-Means algorithm is not likely to
find a best solution.
• There is no way to tell which attributes are significant in determining the
formed clusters. For this reason several irrelevant attributes can cause less
than optimal results.
An algorithm similar to the single link technique is called the nearest neighbor algorithm.
With this serial algorithm, items are iteratively merged into the existing clusters that are
closet. In this algorithm a threshold, t, is used to determine if items will be added to
existing clusters or if a new cluster is created.
80
Chapter 2: A Comparative Study of Data Mining Techniques
ALGORITHM 2.17
Input:
D={t1,t2,….,tn} // Set of elements
A // Adjacency matrix showing distance between elements
Output:
K // Set of clusters
Nearest neighbor algorithm:
K1={t1};
K={K1};
k=1;
for i=2 to n do
find the tm in some cluster Km in K such that dis(ti,tm) is the smallest;
if dis(ti,tm) <= 2 then
Km=Km U ti
Else
k=k+1;
kk={ti};
The PAM (partitioning around medoids) algorithm, also called the K-medoids algorithm,
represents a cluster by a medoid. Using a medoid is an approach that handles outliers
well. Initially, a random set of k items is taken to be the set of medoids. Then at each
step, all items from the input dataset that are not currently medoids are examined one by
one to see if they should be medoids. By looking at all pairs of medoid, non-medoid
objects, the algorithm chooses the pair that improves the overall quality of the clustering
the best and exchanges them. Quality here is measured by the sum of all distances from a
non-medoid object to the medoid for the cluster it is in. An item is assigned to the cluster
represented by the medoid to which it is closet (minimum distance).
The total impact to quality by a medoid change TCih is given by
81
Chapter 2: A Comparative Study of Data Mining Techniques
n
TCih = ∑ C jih (2.35)
j =1
ALGORITHM 2.18
Input:
D={t1,t2,….,tn} // Set of elements
A // Adjacency matrix showing distance between elements
k // Number of desired clusters
Output:
K // Set of clusters
PAM algorithm:
arbitrarily select k medoids from D;
repeat
for each th not a medoid do
for each medoid ti do
calculate TCih;
find i,h where TCih is the smallest;
if TCih < 0, then
replace medoid ti with th;
until TCih >= 0;
for each ti ∈ ∈ D do
assign ti to Kj, where dis(ti,tj) is the smallest over all medoids;
2.10.4.1 BIRCH
82
Chapter 2: A Comparative Study of Data Mining Techniques
outlier handling technique. BIRCH applies only to numeric data. This algorithm 2.19
uses a tree called a CF tree as defined in Definition 2.12.
DEFINITION 2.12: A clustering feature (CF) is a triple (N,LS,SS), where the number of
the points in the cluster is N, LS is the sum of the points in the cluster, and SS is the sum
of the squares of the points in the cluster.
DEFINITION 2.13: A CF tree is a balanced tree with a branching factor B. Each internal
node contains a CF triple for each of its children. Each leaf node also represents a cluster
and contains a CF entry for each subcluster in it. A subcluster in a leaf node must have a
diameter no greater than a given threshold value T.
ALGORITHM 2.19
Input:
D = {t1,t2,….,tn} // Set of elements
T // Thresold for CF tree construction
Output:
K //Set of clusters
BIRCH clustering algorithm:
For each ti ∈ D do
determine correct leaf node for ti insertion;
if threshold condition is not violated, then
add ti to cluster and update CF triples;
else
if room to insert ti, then
insert ti as single cluster and update CF triples;
else
split leaf node and redistribute CF features;
BIRCH is linear in both space and I/O time. The choice of threshold values is imperative
to an efficient execution of the algorithm. Otherwise, the tree may have to be rebuilt
83
Chapter 2: A Comparative Study of Data Mining Techniques
many times to ensure that it can be memory –resident. This gives the worst-case time
complexity of O(n2).
2.10.4.2 DBSCAN
DEFINITION 2.14: Given values Eps and Minpts, a point p is directly density-reachable
from q if
• dis(p,q) <= Eps and
• | {r | dis(r,q) <= Eps} | >= MinPts
ALGORITHM 2.20
Input:
D={t1,t2,….,tn} //Set of elements
MinPts // Number of points in cluster
Eps // Maximum distance for density measure
Output:
K={K1,K2,….,KK}
DBSCAN algorithm:
K=0;
for i=1 to n do
84
Chapter 2: A Comparative Study of Data Mining Techniques
2.10.4.2.1 Strengths
2.10.4.2.2 Weaknesses
• DBSCAN can only result in a good clustering as its distance measure is in the
function getNeighbors(P,epsilon). The most common distance metric used is the
euclidean distance measure. Especially for high-dimensional data, this distance
metric can be rendered almost useless.
• DBSCAN does not respond well to data sets with varying densities.
One objective for the CURE (Clustering Using Representatives) clustering algorithm is to
handle outliers well. It has both a hierarchical component and partitioning component.
First, a constant number of points, c, are chosen from each cluster. These well-scattered
points are then shrunk toward the cluster’s centroid by applying a shrinkage factor, α .
When α is 1, all points are shrunk to just one-the centroid. These points represent the
85
Chapter 2: A Comparative Study of Data Mining Techniques
cluster better than a single point (such as a medoid or centroid) could. With multiple
representative points, clusters of unusual shapes can be better represented. CURE then
uses a hierarchical clustering algorithm. At each step in the agglomerative algorithm,
clusters with the closest pair of representative points are chosen to be merged. The
distance between them is defined as the minimum distance between any pair of points in
the representative sets from the two clusters.
In the algorithm 2.21, we assume that each entry u in the heap contains the set of
representative points, u.rep; the mean of the points in the cluster, u.mean; and the cluster
closet to it, u.closest. We use the heap operations; heapify to create the heap, min to
extract the minimum entry in the heap, insert to add a new entry, and delete to delete an
entry. A merge procedure is used to merge two clusters. In CURE, a k-D tree is used to
assist in the merging of clusters.
ALGORITHM 2.21
Input:
D= {t1,t2,….,tn} //Set of elements
k // Desired number of clusters
Output:
Q // Heap containing one entry for each cluster
CURE algorithm:
T = build(D);
Q = heapify(D); // Initially build heap with one entry per item;
repeat
u = min(Q);
delete(Q, u.close);
w = merge(u,v);
delete(T,u);
delete(T,v);
insert(T,w);
86
Chapter 2: A Comparative Study of Data Mining Techniques
for each x ∈ Q do
x.close = find closet cluster to x;
if x is closet to w, then
w.close = x;
insert(Q,w);
until number of nodes in Q is k;
Here there is a comparison of different clustering algorithms based on type, space, time
and whether it is incremental, iterative or not. It is given in table 2.4.
For a particular problem, these questions have obvious answers. For example, we know
neural network is a black-box structure. Therefore this technique is a poor choice if an
explanation about what has been learned is required. Also, association rules are usually a
best choice when attributes are allowed to play multiple roles in the data mining process.
There are some guidelines.
1. Does the data contain several missing values?
87
Chapter 2: A Comparative Study of Data Mining Techniques
Most data mining researchers agree that, if applicable, neural networks tend to
outperform other models when a wealth of noisy data are present.
2. Is time an issue?
Algorithms for building decision trees and production rules typically execute
much faster than neural network or genetic learning approaches.
3. Do we know the distribution of the data?
Datasets containing more than a few hundred instances can be a problem for
data mining techniques that require the data to conform to certain standards.
For example, many statistical techniques assume the data to be normally
distributed.
4. Do we know which attributes best define the data to be modeled?
Decision trees and certain statistical approaches can determine those attributes
most predictive of class membership. Neural network, Nearest neighbor and
various clustering approaches assume attributes to be of equal importance.
This is a problem when several attributes not predictive of class membership
are present in the data.
5. Which technique is most likely to give best classification accuracy?
For a particular problem, these questions have obvious answers. For example, we know
neural network is a black-box structure. Therefore this technique is a poor choice if an
explanation about what has been learned is required. Also, association rules are usually a
best choice when attributes are allowed to play multiple roles in the data mining process.
We can also select data mining technique based on the data-mining task we want to
perform. In the table 2.5 data mining problem types are related to appropriate modeling
techniques.
88
Chapter 2: A Comparative Study of Data Mining Techniques
89
Chapter 2: A Comparative Study of Data Mining Techniques
2.12 REFERENCES
90
Chapter 2: A Comparative Study of Data Mining Techniques
91
CHAPTER 3
Cyber crime encompasses any criminal act dealing with computers and networks (called
hacking). Additionally, cyber crime also includes traditional crimes conducted through
the Internet. For example, hate crimes, telemarketing and Internet fraud, identity theft,
and credit card account thefts are considered to be cyber crimes when the illegal activities
are committed through the use of a computer and the Internet.
Information Systems Security Association (ISSA), Ireland conduct IRIS cyber crime
survey every year. They developed a questionnaire in which respondents indicated the
types of cyber crime incident which had affected their organization. Figure 3.1 detail the
responses received in the year 2007.
92
Chapter 3: Financial Cyber crime and Frauds
20
18
18
16
16 15
14 13
12 11
10
10
8 8
8
2
0
0
Systemor networ k El ectr oni c eml oyee El ectr oni c f i nanci al Or gani sati onal Thef t of i ntel l ectual El ectr oni c f i nanci al Phi si ng(di r ected Tel ecomf r aud Attacks agai nst
i ntr usi on(i nter nal har assment(exter nal f r aud(exter nal ) i denti ty thef t(e.g. pr oper ty f r aud(i nter nal ) agai nst the manuf actur i ng,
sour ce) sour ce) cl oned websi te) or gani sati on) SCADA or pr ocess
contr ol systems
One example of financial crime is, a website offered to sell Alphonso mangoes at a
throwaway price. Initially very few people responded to or supplied the website with
their credit card numbers. These people were actually sent the Alphonso mangoes. The
word about this website now spread like wildfire. Thousands of people from all over the
country responded to this site and ordered mangoes by providing their credit card
numbers. The owners of what was later proven to be a bogus website then fled taking the
numerous credit card numbers and proceeded to spend huge amounts of money much to
the chagrin of the card owners.
We simply have to type credit card no, expiry date, CVV no into www page of the
vendor for online transaction. If electronic transactions are not secured the credit card
numbers can be stolen by the hackers who can misuse this card by impersonating the
credit card owner.
93
Chapter 3: Financial Cyber crime and Frauds
Copying the company’s confidential data in order to extort said company for huge
amount.
3.3.3 Phising
In such crime criminal makes insignificant changes in such a manner that such changes
would go unnoticed. Criminal makes such program that deducts small amount like
Rs.1.00 per month from the account of all the customers of the bank and deposit the same
in his account. In this case no account holder will approach the bank for such small
amount but criminals gain huge amount.
This crime can be committed by sale and purchase through net. There are web sites which
offer sale and shipment of contrabands drugs. They may use the techniques of
stenography for hiding the messages.
Fraud may be defined as a dishonest or illegal use of services, with the intension to avoid
service charges. Frauds have plagued telecommunication industries, financial institutions
and other organizations for a long time. These frauds cost the business at great expenses
per year. As a result, fraud detection has become an important and urgent task for these
businesses. At present a number of methods have been implemented to detect frauds,
94
Chapter 3: Financial Cyber crime and Frauds
from both statistical approaches (e.g. data mining) and hardware approaches. (e.g.
firewalls, smart cards).
We discuss the types of fraud like Credit card fraud, Telecommunications fraud and
Intrusion in computer systems.
When an individual uses another individual’s credit card for personal reasons while the
owner of the card issuer are not aware of the fact that the card is being used. Further, the
individual using the card has no connection with cardholder or issuer, and has no
intension of either contacting the owner of the card or making repayments for the
purchases made.
Generally we can categorize credit card fraud into two main types 1.Identity theft fraud
and 2. Non-identity theft fraud
While identity theft and what we call credit card fraud are both pernicious crimes, and
both constitute fraud, we would like to distinguish the two for policy purposes. We place
identity theft into two basic categories.
95
Chapter 3: Financial Cyber crime and Frauds
This involves the unlawful acquisition and use of another person’s identifying
information to obtain credit, or the use of that information to create a fictitious identity to
establish an account.
In order to commit identity theft by means of fraudulent application, the perpetrator needs
to acquire not just a name, address or credit card number but unique identifiers such as
mother’s maiden name, social security number and detailed information about a person’s
credit history such as the amount of their most mortgage payment. This is why more than
40 percent of the identity theft cases that we see are committed by someone familiar to
the victim, frequently a family member or someone in a position of intimacy or trust.
This variety of identity theft represents three percent of our total fraud cases.
This occurs when someone unlawfully uses another person’s identifying information to
take ownership of an account. This would typically occur by making an unauthorized
change of address followed by a request for a new product such as a card or check, or
perhaps a PIN number. This variety of identity theft represents less than one percent of
our total fraud cases.
3.5.1.2 Non-identity Theft Fraud-The Other 96 Percent of Our Total Fraud Cases
This type of fraud constitutes the vast majority of occurrences and falls under four basic
headings.
1) Lost or Stolen Cards: The card is actually in possession of the customer and is
subsequently lost or stolen.
2) Non-Receipt: The card is never received by the customer and is intercepted by
the perpetrator prior to or during mail delivery.
96
Chapter 3: Financial Cyber crime and Frauds
The prevention of credit card fraud is an important application for prediction techniques.
One major obstacle neural network training technique is the high necessary diagnostic
quality. Since only one financial transaction of a thousand is invalid no prediction success
less than 99.9% is acceptable.
Johnson defines the telecommunications fraud as any transmission of voice or data across
a telecommunications network where the intent of the sender is to avoid or reduce
legitimate call charges. In similar vein, Davis and Goyal define fraud as obtaining
unbillable services and undeserved fees.
There are many different types of telecoms fraud, and these can occur at various levels.
The two most prevalent types are subscription fraud and superimposed or ‘surfing’ fraud.
Subscription fraud: This occurs when fraudster obtains a subscription to a service, often
with false identity details, with no intension of paying. This is thus at the level of a phone
number – all transactions from this number will be fraudulent.
Superimposed fraud: This is the use of a service without having the necessary authority
and is usually detected by the appearance of ‘phantom’ calls on a bill. There are several
ways to carry out superimposed fraud, including mobile phone cloning and obtaining
calling card authorization details. Superimposed fraud will generally occur at the level of
individual calls – the fraudulent calls will be mixed in with the legitimate ones.
Subscription fraud will generally be detected at some point through the billing process –
though one would aim to detect it well before that, since large costs can quickly be run
up. Superimposed fraud can remain undetected for a long time.
97
Chapter 3: Financial Cyber crime and Frauds
Intrusion detection plays a vital role in today’s networked environment. Intrusions into
computer systems include unauthorized users penetrating the computer systems and
authorized users abusing their privileges. Intrusion into computer systems is the most
epidemic type of fraud since it is easy to commit. Furthermore, it is very difficult to trace
the intruders because they may hide in any corner of the world so long as they have the
Internet connection.
In recent years, computer security has become increasingly important and an international
priority. Intrusion detection techniques are largely categorized into two types such as
anomaly detection and misuse detection.
Anomaly detection: In this technique, the task is focused on extracting normal (non-
fraudulent) usage patterns and finding out deviation from them.
Misuse detection: In this technique, the patterns of previous intrusions and the
vulnerable spots of a system are captured based on the historical data. Then, an intrusion
trail is compared with these identified previous patterns.
98
Chapter 3: Financial Cyber crime and Frauds
This would include cheating, credit card frauds, money laundering etc.
Credit-card fraud detection is especially challenging because the analyst needs to identify
both the physical theft of a card, as well as an individual's identity; this means stolen
cards, as well as cloned and personal identification number (PIN) thefts. This type of
fraud can also be the result of the theft of an individual's identification, such as his or her
home address, for the creation of new accounts under false or stolen identities.
Credit-card theft will defraud the credit-card issuer or merchant. It has a profile of many
small amounts, and an out-of-character purchasing pattern. The fraud activity is time-
constrained. The card will be reported as stolen at some point and identity theft will be
detected, at least by the next statement date. This time constraint forces perpetrators to
use the card rapidly and for amounts normally out of pattern—this is the signature of this
financial crime and a method to its detection. It is a crime where, inevitably, some loss
will occur before detection. This crime is both highly organized and opportunistic.
Internet and phone-order transactions are the classic card-not-present (CNP) sales. They
are also time-sensitive crimes, where the thieves are racing to beat the credit-card
monthly statement mailing date.
99
Chapter 3: Financial Cyber crime and Frauds
clues to these perpetrators are the use of Web-based e-mail addresses and different
shipping and billing addresses.
This type of financial crime involves the manipulation and inflation of an individual
credit rating prior to performing a "sting," leading to a loan default and a loss for the
financial service provider.
This financial crime relies on creating a false identity and takes time to develop. Once an
account has been created with a stolen or false identity, the marketing initiatives
employed by the bank or credit-card issuer assist the perpetrator in building a portfolio of
credit-cards, loan accounts, and a viable credit-rating and history—before defaulting on
them.
This financial crime involves the creation of fictitious bank accounts for the conduit of
money and the siphoning of other legitimate accounts. It may also be for fictitious
account purchases, particularly in association with investment accounts, bond and bearer
bond transactions.
Many of the methods of executing internal fraud are similar to money laundering, except
there is an obvious attempt to defraud the bank, whereas in money laundering the
objective is simply to hide the funds. In addition, this fraud often works in conjunction
with the establishment of creditworthy accounts, lines of credit, and fictitious accounts.
The sting is often a single or small number of large-volume transactions, often related to
real estate purchases, business investments, and the like.
100
Chapter 3: Financial Cyber crime and Frauds
messages between banks—"wire transfer"—is one way to swiftly move illegal profits
beyond the easy reach of law enforcement agents and at the same time begin to launder
the funds by confusing the audit trail.
To launder money is to disguise the origin or ownership of illegally gained funds to make
them appear legitimate. Hiding legitimately acquired money to avoid taxation, or moving
money for the financing of terrorist attacks also qualify as money laundering activities.
1. Placement: introducing cash into the banking system or into legitimate commerce
2. Layering: separating the money from its criminal origins by passing it through
several financial transactions, such as transferring it into and then out of several
bank accounts, or exchanging it for travelers' checks or a cashier's check
3. Integration: aggregating the funds with legitimately obtained money or providing
a plausible explanation for its ownership
Wire transfers of illicit funds are yet another key vehicle for moving and laundering
money through the vast electronic funds transfer systems. Using data mining technologies
and techniques for the identification of these illicit transfers could reveal previously
unsuspected criminal operations or make investigations and prosecutions more effective
by providing evidence of the flow of illegal profits.
There are many ways to launder money. Any system that attempts to identify money
laundering will need to evaluate wire transfers against multiple profiles. In addition,
money launderers are believed to change their MOs frequently. If one method is
discovered and used to arrest and convict a ring of criminals, activity will switch to
alternative methods. Law enforcement and intelligence community experts stress that
criminal organizations engaged in money laundering are highly adaptable and flexible.
For example, they may use non bank financial institutions, such as exchange houses and
check cashing services and instruments like postal money orders, cashier's checks, and
certificates of deposit. In this way, money launderers resemble individuals who engage in
ordinary fraud: They are adaptive and devise complex strategies to avoid detection. They
101
Chapter 3: Financial Cyber crime and Frauds
often assume their transactions are being monitored and design their schemes so that each
transaction fits a profile of legitimate activity.
As with other criminal detection applications the major obstacle to using data mining
techniques is the absence of data uniformity. Related issues, such as the absence of
experts, high costs, and privacy concerns, are being reevaluated in light of the recent
terrorist attacks. The post-9/11 environment is changing the priorities of years ago. One
of the biggest obstacles to using data mining to detect the use of wire transfers for illegal
money laundering was the poor quality of the data; ineffective standards did not ensure
that all the data fields in the reporting forms were complete and validated.
Insurance fraud and health care-related crimes are widespread and very costly to carriers,
the government, and the consumer public. Insurance fraud involves intentional deception
or misrepresentation intended to result in an unauthorized benefit. An example would be
billing for health care services that have not been rendered. Health care crime involves
charging for services that are not medically necessary, do not conform to professionally
recognized standards, or are unfairly priced. An example would be performing a
laboratory test on a large numbers of patients when only a few should have it. Health care
crime may be similar to insurance fraud, except that it is not possible to establish that the
abusive acts were done with intent to deceive the insurer.
False-claim schemes are the most common type of health-insurance fraud. The goal in
these schemes is to obtain undeserved payment for a claim or series of claims.
This includes billing for services, procedures, or supplies that were not provided or used,
as well as misrepresentation of what was provided, when it was provided, the condition
or diagnosis, the charges involved, or the identity of the provider recipient. This may also
involve providing unnecessary services or ordering unnecessary tests.
102
Chapter 3: Financial Cyber crime and Frauds
Illegal billing schemes involve charging a carrier for a service that was not performed.
This includes unbundling of claims—that is, billing separately for procedures that
normally are covered by a single fee. A variation is double billing, charging more than
once for the same service, also known as upcoding, the scam of charging for a more
complex service than was performed. This may also involve kickbacks in which a person
receives payment or other benefits for making referrals.
103
Chapter 3: Financial Cyber crime and Frauds
Many instances have been discovered in which corrupt attorneys and health care
providers, usually chiropractors or medical clinics, combine to bill insurance companies
for nonexistent or minor injuries. The typical scam includes "cappers" or "runners," who
are paid to recruit legitimate or fake auto-accident victims or worker's compensation
claimants. Victims are commonly told they need multiple visits.
Mills fabricate diagnoses and reports, providing expensive, but unnecessary, services.
The lawyers then initiate negotiations on settlements based upon these fraudulent or
exaggerated medical claims.
3.6.1.11 Miscoding
104
Chapter 3: Financial Cyber crime and Frauds
analysis may be billed as one or more tests for vitamin deficiency. Nonstandard allergy
tests may be coded as standard ones.
Scams such as phising, spyware and malware are responsible for online banking fraud.
3.7.1 Phising
Phising is the name given to the practice of sending emails at random, purporting to come
from a genuine company operating on the internet, in an attempt to trick customers of that
company into disclosing information at a bogus website operated by fraudsters. These
emails usually claim that it is necessary to ‘update’ or ‘verify’ your password, and they
urge to click on a link from the email that takes us to the bogus website. Any information
entered on the bogus website will be captured by the criminals for their own fraudulent
purposes.
Phising originated because the banks’ own systems have proved incredibly difficult to
attack. Criminals have turned their attention to phising attacks, individual internet users
in order to gain personal or secret information that can be used online for fraudulent
purposes.
3.7.2 Malware
Although the rising number of phising incidents has undoubtedly helped to raise fraud
losses, we also know that online banking customers are increasingly being targeted by
malware attacks. Malware (malicious software) includes computer viruses that can be
installed on a computer without the user’s knowledge, typically by users clicking on a
link in an unsolicited email, or by downloading suspicious software. Malware is capable
of logging keystrokes thereby capturing passwords and other financial information.
105
Chapter 3: Financial Cyber crime and Frauds
3.7.3 Spyware
Spyware is a type of computer virus that can be installed on computer without user
realizing. Spyware is sometimes capable of acting as a ‘keystroke logger’, capturing all
of the keystrokes entered into a computer keyboard. Typically the fraudsters will send out
emails at random, to get people to click on a link from the email and visit a malicious
website, where vulnerabilities on the customer’s computer are exploited to install the
spyware. The emails are not normally related to internet banking, and try to dupe people
into visiting, or clicking on the link to, the malicious website using a variety of excuses.
The Internet Crime Complaint Centre (IC3) was established with a mission to serve as a
vehicle to receive, develop, and refer criminal complaints regarding the rapidly
expanding arena of cyber crime. IC3 accepts online Internet crime complaints from either
the person who believes they were defrauded or from a third party to the complainant.
During 2008, non-delivery of merchandise and/or payment was by far the most reported
offense, comprising 32.9% of referred crime complaints. This represents a 32.1%
increase from the 2007 levels of non-delivery of merchandise and/or payment reported to
IC3. In addition, during 2008, auction fraud represented 25.5% of complaints (down
28.6% from 2007), and credit and debit card fraud made up an additional 9.0% of
complaints. Confidence fraud such as Ponzi schemes, computer fraud, and check fraud
complaints represented 19.5% of all referred complaints. Other complaint categories such
as Nigerian letter fraud, identity theft, financial institutions fraud, and threat complaints
together represented less than 9.7% of all complaints (See Figure 3.2).
106
Chapter 3: Financial Cyber crime and Frauds
2 0 0 8 T op 1 0 I C 3 C ompl a i nt C a t e gor i e s
T hr eat
I dent i t y T hef t
Chec k Fr aud
Comput er Fr aud
A uc t i on Fr aud
Non-del i v er y
0 5 10 15 20 25 30 35
Source : www.ic3.gov
During 2008, non-delivered merchandise and/or payment were, by far, the most reported
offense, comprising 32.9% of referred complaints. Internet auction fraud accounted for
25.5% of referred complaints. Credit/debit card fraud made up 9.0% of referred
complaints. Confidence fraud, computer fraud, check fraud, and Nigerian letter fraud
round out the top seven categories of complaints referred to law enforcement during the
year.
A key area of interest regarding Internet fraud is the average monetary loss incurred by
complainants contacting IC3 (See Figure 3.3). Such information is valuable because it
provides a foundation for estimating average Internet fraud losses in the general
population. To present information on average losses, two forms of averages are offered:
the mean and the median. The mean represents a form of averaging that is familiar to the
general public: the total dollar amount divided by the total number of complaints.
Because the mean can be sensitive to a small number of extremely high or extremely low
loss complaints, the median is also provided. The median represents the 50th percentile,
or midpoint, of all loss amounts for all referred complaints. The median is less
susceptible to extreme cases, whether high or low cost.
107
Chapter 3: Financial Cyber crime and Frauds
Of the 72,940 fraudulent referrals processed by IC3 during 2008, 63,382 involved a
victim who reported a monetary loss. Other complainants who did not file a loss may
have reported the incident prior to victimization (e.g., received a fraudulent business
investment offer online or in the mail), or may have already recovered money from the
incident prior to filing (e.g., zero liability in the case of credit/debit card fraud).
The total dollar loss from all referred cases of fraud in 2008 was $264.6 million. That loss
was greater than 2007 which reported a total loss of $239.1 million. Of those complaints
with a reported monetary loss, the mean dollar loss was $4,174.50 and the median was
$931.00. Nearly fifteen percent (14.8%) of these complaints involved losses of less than
$100.00, and (36.5%) reported a loss between $100.00 and $1,000.00. In other words,
over half of these cases involved a monetary loss of less than $1,000.00. Nearly a third
(33.7%) of the complainants reported
7%
15%
8% $.01 to $99.99
$100 to $999.99
$1,000 to $4,999.99
$5,000 to $9,999.99
34% 36% $10,000 to $99,999.99
$100,000.00 and over
Source : www.ic3.gov
A key area of interest regarding Internet fraud is the average monetary loss incurred by
complainants contacting IC3. Of the 72,940 fraudulent referrals processed by IC3 during
2008, 63,382 involved a victim who reported a monetary loss. The total dollar loss from
all referred cases of fraud in 2008 was $264.6 million.
108
Chapter 3: Financial Cyber crime and Frauds
Amount Lost per Referred Complaint by Average (Median) Loss Per Typical
Selected Complainant Demographics Complaint
Male $993.76
Female $860.98
Under 20 $500.00
20-29 $873.58
30-39 $900.00
40-49 $1,010.23
50-59 $1,000.00
60 and older $1,000.00
3.8.2 Case Studies of APACS (UK Payment Association and UK Card Association)
700
609.9
600
535.2
504.8
500
424.6 420.4 439.4 427
411.5
400
317
300
188.4
200
135
100
0
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
109
Chapter 3: Financial Cyber crime and Frauds
Fraud Type 98 99 00 01 02 03 04 05 06 07 08
Card Not 13.6 29.3 72.9 95.7 110.1 122.1 150.8 183.2 212.7 290.5 328.4
Present
Counterfiet 26.8 50.3 107.1 160.4 148.5 110.6 129.7 96.8 98.6 144.3 169.8
Lost/Stolen 65.8 79.7 101.9 114.0 108.3 112.4 114.5 89 68.5 56.2 54.1
Card IDTheft 16.8 14.4 17.4 14.6 20.6 30.2 36.9 30.5 31.9 34.1 47.4
Mail non- 12.0 14.6 17.7 26.8 37.1 45.1 72.9 40.0 15.4 10.2 10.2
receipt
Total 135 188.4 317.0 411.5 424.6 420.4 504.8 439.4 427.0 535.2 609.9
APACS has been the forum for the co-operative activity of banks, building societies and
card issuers on payments and payment systems since the mid-80s in U.K. Figure 3.4
shows the total losses in £ millions from year 1998 to 2008 alone in U.K. due to credit
and debit card. Table 3.2 shows how this plastic card fraud occurs category wise like
card-not-present, counterfeit, lost/stolen, card id theft and mail non-receipt.
3.8.2.2 Card fraud losses split by type (as percentage of total losses)
12%
10%
Lost/Stolen
Mail non-receipt
1998 49% Counterfeit
Card-not-present
9%
Figure 3.5 Percentage of Different Plastic Card Fraud Category in Year 1998
Source : www.cardwatch.org.uk
110
Chapter 3: Financial Cyber crime and Frauds
9% 8%
2%
Lost/Stolen
Mail non-receipt
2008 28%
Counterfeit
Card-not-present
Card ID theft
53%
Figure 3.6 Percentage of Different Plastic Card Fraud Category in Year 2008
Source : www.cardwatch.org.uk
200
17 8 .3 18 1.7
180
160 15 4 .5
140
117 117 .1
120
100
80
60
45
40 28
20 15
3 .8
0
2000 2001 2002 2003 2004 2005 2006 2007 2008
Figure 3.7 shows financial cyber crime in £ millions from year 2000 to 2008 in U.K.
111
Chapter 3: Financial Cyber crime and Frauds
According to the Cybersource, 11th Annual Online Fraud Report, which is based on
U.S.A. and Canadian online merchants, from 2006 to 2008 the percent of online revenues
lost to payment fraud was stable. However, total dollar losses from online payment fraud
in the U.S. and Canada steadily increased during this period as eCommerce continued to
grow.
The percent of accepted orders which are later determined to be fraudulent also fell in
2009. In 2009, merchants reported an overall average fraudulent order rate of 0.9%, down
from 1.1% in 2008 for their U.S. and Canadian orders. Over the past six years the average
percent of accepted orders which turn out to be fraudulent has varied from 1.0% to 1.3%.
2009 represents the first time this rate has dropped below the 1% threshold. Among
industry sectors, Consumer Electronics reported the highest fraudulent order rate,
averaging 1.5%, but this was down from 2.0% in 2008.
Since 2007, the percent of orders rejected due to suspicion of fraud has fallen from 4.2%
to 2.4% in 2009, a decline of more than 40% in order rejection, representing a 1.8%
increase in total orders accepted.
112
Chapter 3: Financial Cyber crime and Frauds
3.10 REFERENCES
[1] Jesus Mena : Investigative Data Mining for Security and Criminal Detection
[2] Website – www.ic3.gov
[3] Website – www.cardwatch.org.uk
[4] Website – www.cybersource.com
[5] Website - www.issaireland.org/cybercrime
[6] Website – www.en.wikipedia.org
[7] Website – www.fas.org
[8] Website – www.citizencentre.virtualpune.com
[9] Website – www.indiaforensic.com
[10] Website – www.itbusinessedge.com
113
Chapter 3: Financial Cyber crime and Frauds
114
CHAPTER 4
Today Industry is facing huge losses due to these types of financial crimes, so it would
be able to find financial crime through data mining techniques and remove it then it can
be great benefit to the industry.
In this chapter we have suggested a two-tier architecture model for financial crime
detection. In the first stage the financial transaction is verified against the rule-based
system and is given risk score by the system. These rules contain the human insight and
then this transaction is passed to second stage of data mining technique, which will
learn from the past experience of fraudulent transactions and then decide about the
current transaction. So the accuracy of prediction increased as the financial transaction
has to pass through two stages, one of rule based system and second of data mining
technique based system.
Here we have given a figure 4.1 of architecture of 2-stage solution for financial crime.
In the first stage, rule based system contains the static rules which is generally based on
human knowledge i.e. human insight. If the financial transaction passes through this
phase then it passes to the second phase.
115
Chapter 4: Role of Data Mining in Financial Crime Detection
In the second stage, data mining techniques generate dynamic rules based on past
fraudulent transactions. Here learning is totally dynamic so if the pattern of fraudulent
transaction changed then the model learns itself from transactions and generates
dynamic rules for prediction of financial crime.
Financial Transaction
Stage 1
Rule Based System
Stage 2
Data Mining Based
In this section we have suggested 2-stage solution in each type of financial crime.
2. If current transaction amount is very much greater than average transaction amount
and income range is medium then recommendation= Fraud
4.2.1.2 Detection Technique: Sequencing of purchases will change; the merchant mix
will be out of character compared to previous consumer transactions. Frequency,
monetary, and recency (FMR) techniques can be examined and employed. Time-
sequence accumulated-risk scores may be used as an input to aggregated risk exposure.
A change in location may indicate a ring operation. There are a number of leads that
relate specifically to credit card and debit card fraud. They are common points-of-
purchase (CPP) detection, particularly with regard to new merchant agents. The main
method of detection is to look for outliers and changes in the normal patterns of usage.
A SOM neural network can be used to perform an autonomous clustering of patterns in
the data.
4.2.2.2 Detection Technique: Indicators include looking for repeated attempts with
slight variations of card numbers or the use of different names and addresses. Another
possible indication of trouble is an IP address at variance with other data. If
demographics are available, a model may be developed. The absence of certain data,
such as activity in a credit report, is also signals of possible identify theft and fraud.
This type of financial crime involves the manipulation and inflation of an individual
credit rating prior to performing a "sting," leading to a loan default and a loss for the
financial service provider.
117
Chapter 4: Role of Data Mining in Financial Crime Detection
This financial crime is done by creating a false identity and it takes time to develop.
Once an account has been created with a stolen or false identity, the marketing
initiatives employed by the bank or credit-card issuer assist the perpetrator in building a
portfolio of credit-cards, loan accounts, and a viable credit-rating and history—before
defaulting on them.
A Rule Based scoring system can be developed for preventing loan default on various
parameters like age (i.e. age is less then more points given or more age then less),
educational qualification (for higher studies or degrees more points otherwise less), No
of Assets owned by borrower at home (for more assets more points otherwise less),
borrower’s income, margin etc.
4.2.3.2 Detection Technique: There are many lead indicators available. There is often
only one "pot" of money that is cycled through the various accounts—a pattern of cash
withdrawals from credit cards, and then at the end of the credit cycle, a similar amount
repaid, usually using a cash withdrawal from another credit card. Lead indicators
include credit cards that are rarely used to make actual merchant purchases and have
small outstanding credit balances. Another pattern to look for is a loan account that is
left unused. These techniques inflate a centrally controlled credit rating, providing a
false impression that the account is deemed responsible. Detection has to occur before
the "sting," which is a use of the credit and loan accounts very rapidly within a credit
cycle. This financial crime can result in high losses. Detection must occur before the
loss, because the sting has a short execution time.
118
Chapter 4: Role of Data Mining in Financial Crime Detection
The critical factors for detecting all of these financial fraud crimes is knowing the
behavior of credit, bank, and loan accounts and developing an understanding of the
categories of customers. Data mining can be used to spot outliers or account usages that
are normal and out of character. Sometimes the account seems "too good to be true,"
and it often is. The absence of telephone numbers or other contact information may
indicate a "ring." These rings enable fraudulent activities to be distanced from their
sources and add complexity to criminal detection. Another clue is the multiple use of
the same address or phone number for different accounts.
A Rule Based Scoring system can be developed for Insurance crime such as neck injury
can be given more risk points than leg injury, laboratory or x-ray report that is not
relevant or unnecessary to the disease (high risk score can be assigned according to the
irrelevancy of the disease), etc.
In the insurance industry, there are various methods by which carriers attempt to review
for fraud while processing policy claims. The following are some important data
attributes for detecting potential fraud claims:
• Duration of illness
• Net amount cost
• Illness (disease)
• Claimant sex
• Claimant age
• Claim cost
• Hospital
Using these variables, analyses can be performed to identify outliers for each, such as
test costs, hospital charges, illness duration, and doctor charges. These are some
temporal parameters for analyzing insurance claims.
119
Chapter 4: Role of Data Mining in Financial Crime Detection
4.2.5.2.1 Detection Technique: Depending on the insurance carrier, we can use various
methods in an attempt to identify false claims, including red-flag reviews by fraud
specialists, both on-line and behind the scenes. A carrier may also use an expert system,
which is a rule-based program that codifies the rules of a human reviewer. Link analysis
may be used to look for a ring of fraudulent providers, and, of course, data mining tools,
such as neural networks, may be used for training and detection if samples of fraud
cases exist. The net amount of the claim may be too large compared to the average
amount of similar claims.
4.2.5.3.1 Detection Technique: The methods are the same as with false claims. In
addition, a carrier may use models and rules developed insurance special coupled with
those from data mining analyses, such as decision trees or rule generators to detect these
schemes
4.2.5.4.1 Detection Technique: Mill activity can be suspected when claims are
submitted for many unrelated individuals who receive similar treatment from a small
number of providers. These claims are typically manually reviewed by claim specialists;
however, link analysis and rule generators can also be used for screening large volumes
of claims.
4.2.5.5 MISCODING
4.2.5.5.1 Detection Technique: Any code that is not standard must be subject to review
and matched against prior claims from similar clinics or practitioners, typically
performed by red-flag claim specialists. Clustering of historical data can be used to
detect outliers automatically, and to check a disease (illness) against average duration
and cost using a historical claims database to generate a histogram.
120
Chapter 4: Role of Data Mining in Financial Crime Detection
4.3. CONCLUSION
Data Mining Techniques like Neural Networks, Decision trees, Link Analysis etc. can
become very helpful for financial crime detection. These techniques can be used with
rule-based system combinely so accuracy of prediction increased very much. The two-
tier architecture model is used very effectively for any financial transaction verification.
Any financial transaction has to pass through two level of verification, so prediction
gets closer to real prediction and also any genuine or normal transaction is not caught
by the model as fraudulent transaction so normal or genuine customer does not have to
suffer.
Here we suggested a two-stage solution for financial crime detection, which is actually
hybrid approach and contains both human insight and machine insight also. In these
types of crime hybrid approach proves more powerful than any single stage solution and
also accuracy of prediction is increased drastically.
In this type of model or system, we also need to take care of that any normal or genuine
transaction must not be caught by as fraudulent transaction and create overhead on
customer. If any customer suffered then we might lose him.
4.4. REFERENCES
121
Chapter 4: Role of Data Mining in Financial Crime Detection
122
CHAPTER 5
We can define the data warehouse as a historical database designed for decision support.
A more precise definition is given by W.H. Inmon(1996). Specifically,
123
Chapter 5: Data Warehouse Implementation
In Data Warehouse environments, the relational model can be transformed into the
following architectures:
• Star schema
• Snowflake schema
• Constellation schema
Star schema architecture is the simplest data warehouse design. The main feature of a star
schema is a table at the center, called the fact table and the dimension tables which allow
browsing of specific categories, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form). Despite the fact that
the star schema is the simplest data warehouse architecture, it is most commonly used in
the data warehouse implementations across the world today (about 90-95% cases).
The snowflake schema is a variation of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such table is easy to maintain and saves storage space because a large
dimension table can become enormous when the dimensional structure is included as
columns. However, this saving of space is negligible in comparison to the typical
magnitude of the fact table.
124
Chapter 5: Data Warehouse Implementation
Sophisticated applications may require multiple fact tables to share dimensional tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
Here we have designed the data warehouse using snowflake schema architecture. Data
warehouse design layout is given in Figure 5.1 and 5.2.
Description: This table contains information regarding the transaction performed by card
holder online. Whenever user orders any product through internet, then his transaction
details mentioned below are stored in this table. This is the fact table, so instead of
storing direct values, it stores the links from tables like product_master,
product_category_master, customer_master, creditcard_master, shipping_master,
location_master and seller_master.
Primary key: Serial_id
Foreign Key: Product_id references Product_Master
Product_cat_id references Product_Cat_Id
Customer_id references Customer_Master
Creditcard_id references Creditcard_Master
Seller_id references Seller_Master
Shipping_id references Shipping_Master
Location_id references Location_Master
125
Chapter 5: Data Warehouse Implementation
TRANSACTION_DAY_T
NUMBER(1) 1: Holiday, 0: Working day
YPE
126
Chapter 5: Data Warehouse Implementation
Description: This table contains personal information mentioned below of the customer
who is currently performing the online transaction. This information is required by the
web site through which the customer wants to perform the transaction.
Primary key: Customer_id
127
Chapter 5: Data Warehouse Implementation
Description: It contains credit card details of the credit card holder. Whenever user wants
to purchase any thing through the credit card then he must give credit card number,
expiry date and customer verification value (CVV) number, then he is able to perform the
transaction.
Primary key: Creditcard_id
Description: It contains the seller or vendor name, with which customer is performing
the transaction.
Primary key: Seller_id
128
Chapter 5: Data Warehouse Implementation
Description: It contains billing address or residential address of the credit card holder.
During the online transaction, this address is verified against the shipping address entered
by the buyer to decide the sensitivity of the transaction.
Primary key: Address_id
Foreign key: Cityid references City_Master
Stateid references State_Master
Countryid reference Country_Master
129
Chapter 5: Data Warehouse Implementation
Description: Product category information is stored in this table. This table is useful to
study the customer purchase behavior in different categories, so the incoming transaction
is predicted according to this behavior.
Primary key: Product_cat_id
130
Chapter 5: Data Warehouse Implementation
Description: This is the address entered by the customer during the online transaction,
where the customer wants his product to be shipped. This address may be different from
billing address.
Primary key: Shipping_id
Foreign key: Cityid references City_Master
Stateid references State_Master
Countryid reference Country_Master
131
Chapter 5: Data Warehouse Implementation
Description: It contains the address details, where the customer requests to purchase the
product through the internet. There are several free tools are available for capturing this
kind of information. The system matches the city where the transaction is performed with
the billing address’s city to consider the time zone if both is found from different country.
Primary key: Location_id
Foreign key: Cityid references City_Master
Stateid references State_Master
Countryid reference Country_Master
132
Chapter 5: Data Warehouse Implementation
Description: This table contains city related information along with time zone. Whenever
any online transaction is performed different or outside of the customer’s country, then
the system uses this table to convert time zone of one city to time zone of another city.
Primary key: Cityid
133
Chapter 5: Data Warehouse Implementation
Description: This table is used to store the user id and login date, time of the user.
System resets the value of the tables customer_daily_count, customer_weekly_count,
customer_fortnightly_count and customer_monthly_count by using this table only. E.g. If
logon_day contains the value of 1st January, 2010 then the next day 2nd January, 2010 the
value of daily_count and amount field of customer_daily_count table becomes zero. After
the completion of week, the value of weekly_count and amount field of
customer_weekly_count table becomes zero and accordingly for
customer_fortnightly_count and customer_monthly_count tables.
Primary key: User_id
134
Chapter 5: Data Warehouse Implementation
Description: This table contains personal information mentioned below of the customer
who is the credit card holder.
Primary key: Cardholder_id
Foreign key: Address_id references Address_Master
Cardid references Creditcard_Master
135
Chapter 5: Data Warehouse Implementation
Description: It is a generic fraud table maintained by the system. It stores the number of
fraud transactions performed within different given below time periods. The system
records time gap between each two transactions. If the transaction is found suspecious by
the system, then it uses the following table to calculate the posterior probability using
bayesian learning and decide about the sensitivity of the transaction.
136
Chapter 5: Data Warehouse Implementation
Description: Whenever any transaction is found suspicious by the system, then its details
are stored in the following table. It stores the different time periods since the last
transaction. If another transaction on the same card is found suspicious, then this table is
updated accordingly till either the next transaction is found genuine or the generated risk
score reaches the specified threshold.
Foreign key: Cardid references Creditcard_Master
137
Chapter 5: Data Warehouse Implementation
Apart from these there are additional tables maintained by the system, which holds the
current transaction details of the card holder basis on daily, weekly, fortnightly or
monthly time duration.
Description: Whenever customer performs a transaction during day, this table is updated
automatically by the system. It stores the total number of transactions and total amount of
purchasing during the current day. (e.g. If customer first performs transaction of Rs.
4000, then transcount contains 1, amount contains 4000. If customer again performs the
second transaction of Rs.5000 in the same day, then transcount contains 2, amount
contains 9000).The next day value of number of transactions and amount of purchasing
becomes automatically zero by the system. So this table is used to observe the daily
behavior of customer. Then system match this data with the past daily customer behavior.
Foreign key: Cardid references Creditcard_Master
138
Chapter 5: Data Warehouse Implementation
Description: This table is used to observe the behavior of current week of customer. All
the transactions performed in the current week are automatically reflected in this table.
This data is used to match past weekly customer behavior. The next week value of these
fields becomes zero.
Foreign key: Cardid references Creditcard_Master
Description: This table stores the transaction details of the current fifteen days only. At
the end of fifteen days, value reset with zero by the system. Here also comparison is
made of current fifteen days behavior with the past fifteen days behavior to decide the
validity of the transaction.
Foreign key: Cardid references Creditcard_Master
139
Chapter 5: Data Warehouse Implementation
Description: It contains the total transaction details of the current month only. After the
completion of the month, the value of transcount and amount again starts to update
according to the transactions performed by the customer. This table is used to compare
the current monthly behavior of the customer with the past monthly behavior.
Foreign key: Cardid references Creditcard_Master
140
Chapter 5: Data Warehouse Implementation
Description: It stores the transaction details of the whole current day if the customer
performs the transactions on holiday. Customer’s current holiday behavior is checked
with the past holidays behavior to predict about the transaction is genuine or not.
Foreign key: Cardid references Creditcard_Master
141
Chapter 5: Data Warehouse Implementation
Product_Master
PRODUCT_ID Seller_Master
PRODUCTNAME SELLER_ID
Transaction SELLER_NAME
Product_Catego SERIAL_ID
ry_Master TRANSACTION_ID Shipping_Master
PRODUCT_CAT TRANSACTION_DATE SHIPPING_ID
_ID AMOUNT SHIPPING_ADDRESS1
PRODUCT_ID SHIPPING_ADDRESS2
PRODUCT_CAT PRODUCT_CAT_ID SHIPPING_ADDRESS3
EGORY CUSTOMER_ID CITY
CREDITCARD_ID PINCODE
Customer_Master SELLER_ID STATE
CUSTOMER_ID SHIPPING_ID COUNTRY
FIRST_NAME LOCATION_ID
MIDDLE_NAME TRANSACTION_DAY_T
Location_Master
YPE
LAST_NAME LOCATION_ID
GENDER ADDRESS1
AGE ADDRESS2
ANNUAL_INCOME ADDRESS3
Creditcard_Master CITYID
CREDITCARD_ID PINCODE
ACCOUNT_NO STATEID
CARD_NO COUNTRYID
CARD_TYPE
EXPIRY_DATE
CVV_NO
City_Master
Location_Master
CITYID
LOCATION_ID
CITYNAME
ADDRESS1
TIME_ZONE
ADDRESS2
Country_Master
ADDRESS3
COUNTRYID
CITYID
COUNTRYNAME
PINCODE
State_Master
STATEID
STATEID
COUNTRYID
STATENAME
The data used in this work was gathered from an online shopping firm. Even though the
firm provided real credit card data for this research, it required that the firm name was
kept confidential. Though real credit card transactional data is obtained, real credit card
number, customer personal information is not given due to confidentiality and fraudulent
transactional records are not available.
We have also generated huge synthetic data based on the statistical data to test the model
speed on large scale data. We had used Gaussian distribution to generate this data. The
number of transactional records is more than 10, 00,000.
143
Chapter 5: Data Warehouse Implementation
Categor <2000 20000 30000 40000 50000 60000 70000 80000 >9000
y 0 - - - - - - - 0
29999 39999 49999 59999 69999 79999 89999
1 .19 .18 .18 .17 .16 .15 .15 .14 .13
2 .36 .38 .37 .36 .34 .32 .32 .30 .31
3 .06 .05 .05 .05 .04 .04 .04 .04 .04
4 .16 .15 .16 .17 .19 .20 .20 .21 .18
5 .05 .08 .09 .09 .08 .07 .07 .06 .04
6 .04 .04 .04 .04 .05 .05 .05 .05 .06
7 .14 .12 .11 .12 .14 .17 .17 .20 .24
The data is generated using Gaussian distribution with the following mean and standard
deviation.
144
Chapter 5: Data Warehouse Implementation
There is a huge data in the data warehouse from the year 2005 to 2009. Here is a sample
transaction of some customers of year 2005.
145
Chapter 5: Data Warehouse Implementation
146
Chapter 5: Data Warehouse Implementation
147
Chapter 5: Data Warehouse Implementation
148
Chapter 5: Data Warehouse Implementation
149
Chapter 5: Data Warehouse Implementation
150
Chapter 5: Data Warehouse Implementation
We had also studied how the realistic credit card numbers are generated. To generate
realistic credit card numbers, we use the semantic graph shown in the figure 5.3.
Card Type
Card Issuer
MII
Creditcard Number
The first digit on a credit card is the Major Industry Identifier (MII) which represents the
source from where the credit card was issued. For example, a credit card number starting
with 6 is assigned for merchandising and banking purposes, such as in the case of the
Discovery card. Credit card numbers starting with 4 and 5 are used for banking and
financing purposes, as in the case of Visa and MasterCard. Digit 3 is used to represent
travel and entertainment used, for instance the American Express card. Table 5.26 is an
overview of the rules for numbering credit card. The first six numbers including the MII
represents the issuer identifier. The rest of the digits on the credit card represent the
cardholder’s account number except the last digit. The lone digit at the very right end of
151
Chapter 5: Data Warehouse Implementation
the complete 15 or 16 digit credit card number sequence is known as the “check digit”,
which often is the final number that is computer generated to satisfy the mathematical
formulations of the Luhn check sum process. Meanwhile, in between the first 6 digits and
the last single check digit is the actual personalized account number – the 8 or 9 digit
sequence given by the card issuer.
The Luhn Algorithm is the check sum formula used by payment verification systems and
mathematicians to verify the sequential integrity of real credit card numbers. It’s used to
help bring order to seemingly random numbers and used to prevent erroneous credit card
numbers from being cleared for use. The Luhn algorithm is not used for straight credit
card number generation from scratch, but rather utilized as a simple computational way to
distinguish valid credit card numbers from random collections of numbers put together.
The validation formula also works with most debit cards as well.
The Luhn formula was created and filed as a patent (now freely in the public domain) in
1954 by Hans Peter Luhn of IBM to detect numerical errors found in pre-existing and
newly generated identification numbers. Since then, it’s primary use has been in the area
of check sum validation, made popular with its use to verify the validity of important
sequences such as credit card numbers. Currently, almost all credit card numbers issued
today are generated and verified using the Luhn Algorithm. The luhn algorithm only
validates the 15-16 digit credit card number and not the other critical components of a
152
Chapter 5: Data Warehouse Implementation
genuine card account such as the expiration date and the commonly used Card
Verification Value (CVV) and Card Verification Code (CVC) numbers.
ALGORITHM 5.1
1. The Luhn Algorithm always starts from right to left, beginning with the rightmost
digit on the credit card face (the check digit). Starting with the check digit and
moving left, double the value of every alternate digit. Non-doubled digits will
remain the same. The check digit is never doubled. For example, if the credit card
is a 16 digit Visa card, the check digit would be the rightmost 16th digit. Thus we
would double the value of the 15th, 13th, 11th, 9th digits, and so on until all odd
digits have been doubled. The even digits would be left the same.
2. For any digit that becomes a two digit number of 10 or more when doubled, add
the two digits together. For example, the digit 5 when doubled will become 10,
which turns into a 1.
3. Now, lay out the new sequence of numbers. The new doubled digits will replace
the old digits. Non-doubled digits will remain the same.
4. Add up the new sequence of numbers together to get a sum total. If the combined
tally is perfectly divisible by ten, then the account number is mathematically valid
according to the Luhn formula. If not, the credit card number provided is not valid
and thus fake or improperly generated.
We can follow the luhn steps from 1 to 4 below, starting with the right most digit. I have
taken my own credit card to check how it is mathematically correct according to the Luhn
validation technique.
153
Chapter 5: Data Warehouse Implementation
5 1 7 6 5 3 0 0 9 2 2 4 5 0 0 3
(2) Double every other number. If doubled numbers are two digits, then add them.
5 1 7 6 5 3 0 0 9 2 2 4 5 0 0 3
10 14 10 0 18 4 10 0
(3) Drop the numbers down to the bottom arrow and keep other digits as it is.
1 1 5 6 1 3 0 0 9 2 4 4 1 0 0 3
(4) Add these new numbers, which is 40 and perfectly divisible by 10, so according to
luhn algorithm, it is valid credit card number.
5.8 REFERENCES
154
Chapter 5: Data Warehouse Implementation
[3] Ian H. Witten, Eibe Frank – DATA MINING Practical Machine Learning Tools and
Techniques, Morgan Kaufmann Publishers, ISBN: 0-12-088407-0
[4] K.V.S Sarma – Statistics Made Simple Do It Yourself on PC, Prentice Hall of India,
ISBN: 81-203-1741-6
[5] R.S. Bhardwaj – Business Statistics, Excel Books, ISBN: 81-7446-181-7
[6] Ivan Bayross – SQL, PL/SQL The Programming Language of Oracle, BPB
Publications, ISBN 81-7656-964-X
[7] Nilesh Shah – Database Systems Using Oracle, Prentice Hall of India, ISBN: 81-203-
2147-2
[8] A. Leon, M. Leon – Database Management Systems, Vikas Publishing House, ISBN:
0-81-259-1165-0
[9] http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1081&rep...
[10] http:// www.thetaoofmakingmoney.com/2007/04/12/324.html
[11] http://www.etl-tools.info/en/bi/datawarehouse_star-schema.htm
155
CHAPTER 6
The transaction pattern generation tool (TPGT) will generate the patterns (parameters)
based on the historical data stored in the data warehouse. TPGT is implemented in the
Oracle 9i. All the patterns generated by TPGT will collectively decide the purchasing
behavior of the card holder. These patterns are very useful for deciding or verifying the
current transaction performed by the card holder online. Implementation code is given in
the Appendix.
TPGT
DP CP PP TP WP VP AP FP MP SP HP LP GP
156
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.1 Subparameters of DP
DP
157
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.2 Subparameters of CP
CP
158
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.3 Subparameters of PP
PP
PP1 PP2
159
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.4 Subparameters of TP
TP
TP1 TP2 TP3 TP4 TP5 TP6 TP7 TP8 TP9 TP10 TP11 TP12
160
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.5 Subparameters of WP
WP
161
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.6 Subparameters of VP
VP
VP1 VP2
6.1.7 Subparameters of AP
AP
AP1 AP2
AP1: Number of transactions shipped with the same current shipping address
AP2: Number of transactions with different shipping and billing address
162
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.8 Subparameters of FP
FP
163
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.9 Subparameters of MP
MP
6.1.10 Subparameters of SP
SP
164
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.11 Subparameters of HP
HP
165
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.12 Subparameters of LP
LP
166
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.1.13 Subparameters of GP
GP
This parameter contains the value of average amount of purchases per day. Suppose the
total amount of purchasing is done by customer is Rs.30000 in one year, then this value is
divided by 365 and then the value of the parameter is derived.
167
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.2.2.3 Number of times the transactions taken place within same category (CP3)
Total number of transactions in each category is also stored by the tool in this parameter.
E.g. If customer currently buys product of electronics category and in the past the
customer has performed total six transactions within same category, then this parameter
has value six.
168
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
169
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
170
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
171
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.2.7.2 Number of transactions with different shipping and billing address (AP2)
The tool finds how many transactions the customer has performed other than his billing
address. So the customer habit of performing transaction other than his billing address
can be studied by the model and decide about the sensitivity of new incoming transaction.
172
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
173
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
174
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
6.2.12.2 Number of transactions ordered in the different city within same state (LP2)
If customer initiates any order from different city other than his own city but within same
state, then it will be added into this parameter.
6.2.12.3 Number of transactions ordered in the different city outside of the state
(LP3)
If the customer orders a product outside of his state but within his country, then it will be
added into this parameter.
6.2.12.5 Number of transactions shipped in the different city within same state (LP5)
This parameter has a value of number of transactions; the user has requested to ship the
items in the different city other than his billing address city, but within his state.
6.2.12.6 Number of transactions shipped in the different city outside of the state
(LP6)
This parameter has a value of number of transactions; the user has requested to ship the
items in the different state other than his billing address state, but within his country.
175
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
176
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
The Calculation of the parameters TP1 to TP8 in the tool is done as follows.
The tool divides all the transactions of the customer into eight different time frames
according to the following.
T1 becomes true if the past transaction is performed from 3:00 to 6:00 time frame on the
card Ck within data warehouse.
T2 becomes true if the past transaction is performed from 6:00 to 9:00 time frame on the
card Ck within data warehouse.
T3 becomes true if the past transaction is performed from 9:00 to 12:00 time frame on the
card Ck within data warehouse.
T4 becomes true if the past transaction is performed from 12:00 to 15:00 time frame on
the card Ck within data warehouse.
177
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
T5 becomes true if the past transaction is performed from 15:00 to 18:00 time frame on
the card Ck within data warehouse.
T6 becomes true if the past transaction is performed from 18:00 to 21:00 time frame on
the card Ck within data warehouse.
T7 becomes true if the past transaction is performed from 21:00 to 0:00 time frame on the
card Ck within data warehouse.
T8 becomes true if the past transaction is performed from 0:00 to 3:00 time frame on the
card Ck within data warehouse.
The tool then finds the total number of the transactions performed by the customer in
time frame from T1 to T8.
TP1 = occurrences (count) of T1 on the card Ck from the data warehouse (6.9)
TP2 = occurrences (count) of T2 on the card Ck from the data warehouse (6.10)
TP3 = occurrences (count) of T3 on the card Ck from the data warehouse (6.11)
TP4 = occurrences (count) of T4 on the card Ck from the data warehouse (6.12)
TP5 = occurrences (count) of T5 on the card Ck from the data warehouse (6.13)
TP6 = occurrences (count) of T6 on the card Ck from the data warehouse (6.14)
178
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
TP7 = occurrences (count) of T7 on the card Ck from the data warehouse (6.15)
TP8 = occurrences (count) of T8 on the card Ck from the data warehouse (6.16)
Finally the percentage of all the parameters of all the transactions is computed as follows.
Percent_TP1=(TP1 * 100) / total transactions on card Ck from the data warehouse (6.17)
Percent_TP2=(TP2 * 100) / total transactions on card Ck from the data warehouse (6.18)
Percent_TP3=(TP3 * 100) / total transactions on card Ck from the data warehouse (6.19)
Percent_TP4=(TP4 * 100) / total transactions on card Ck from the data warehouse (6.20)
Percent_TP5=(TP5 * 100) / total transactions on card Ck from the data warehouse (6.21)
Percent_TP6=(TP6 * 100) / total transactions on card Ck from the data warehouse (6.22)
Percent_TP7=(TP7 * 100) / total transactions on card Ck from the data warehouse (6.23)
Percent_TP8=(TP8 * 100) / total transactions on card Ck from the data warehouse (6.24)
L1 becomes true if the transaction is performed from 0:00 to 4:00 on the card Ck from the
data warehouse.
L2 becomes true if the transaction is performed except from 0:00 to 4:00 on the card Ck
within the data warehouse.
TP11 = occurrences (count) of L1 on the card Ck from the data warehouse (6.27)
TP12 = occurrences (count) of L2 on the card Ck from the data warehouse (6.28)
179
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
G1 becomes true if the transaction occurs just within 4 hours from the previous
transaction on the same card Ck from the data warehouse.
G2 becomes true if the transaction occurs just within 5 to 8 hours from the previous
transaction on the same card Ck from the data warehouse.
G3 becomes true if the transaction occurs just within 9 to 16 hours from the previous
transaction on the same card Ck from the data warehouse.
G4 becomes true if the transaction occurs just within 17 to 24 hours from the previous
transaction on the same card Ck from the data warehouse.
G5 becomes true if the transaction occurs from 2nd day to within a week from the previous
transaction on the same card Ck from the data warehouse.
G6 becomes true if the transaction occurs just within 15 days from the second week since
the previous transaction on the same card Ck from the data warehouse.
180
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
G7 becomes true if the transaction occurs after 15 days from the previous transaction on
the same card Ck from the data warehouse.
GP1 = occurrences (count) of G1 on the card Ck from the data warehouse (6.36)
GP2 = occurrences (count) of G2 on the card Ck from the data warehouse (6.37)
GP3 = occurrences (count) of G3 on the card Ck from the data warehouse (6.38)
GP4 = occurrences (count) of G4 on the card Ck from the data warehouse (6.39)
GP5 = occurrences (count) of G5 on the card Ck from the data warehouse (6.40)
GP6 = occurrences (count) of G6 on the card Ck from the data warehouse (6.41)
GP7 = occurrences (count) of G7 on the card Ck from the data warehouse (6.42)
A1 becomes true if the past transactions are also shipped with the same shipping address
from the data warehouse.
A2 becomes true if the transaction is performed with the different shipping and billing
address.
181
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
AP1 = occurrences (count) of A1 on the card Ck from the data warehouse (6.45)
AP2 = occurrences (count) of A2 on the card Ck from the data warehouse (6.46)
6.4 REFERENCES
182
CHAPTER 7
This is the most important parameters considered by the model. When the shipping
address entered by the customer is different than the billing address, then the model
checks how many previous transactions are performed on the same shipping address by
checking the value of parameter (AP1). If it is greater than zero, then model considers it
as highly genuine and generates 0 risk score. The model also learns how many total
transactions the customer has performed with different billing and shipping address by
the parameter (AP2).
183
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The model considers the location from which the current online transaction is performed.
The model then use the parameter LP1: number of transactions ordered from the same
location and accordingly generates a risk score. If there is no any transaction performed
from the same location then more risk score generated and more transactions performed
then less risk score generated.
When a customer purchase a product in any category, then the model finds how much
amount is spend by the customer in this category. It uses the parameter CP1 for
generating risk score. Higher the value of CP1, less risk score generated as the model
assumes that it is matched with customer purchasing pattern. Less the value of CP1,
higher the risk score is generated as it is far from customer purchasing habit.
The model also use the parameter PP1, time passed since the same product purchased. If
the value of this parameter is less and the product is costly, then the model considers the
transaction as sensitive and generates risk score accordingly. Number of times the same
product purchased is also recorded by the model.
The parameters TP1 to TP8 has value of percentage of total number of transactions
performed within particular time frame. The model records the current transaction time
and finds percentage of total number of transactions in this time by using parameters TP1
to TP8. If percentage is higher, then less risk score generated as most of the past
transactions are performed within this time frame. If percentage is less, then risk score is
generated higher as it is not matched with customer past transaction time.
184
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
If current transaction is performed during late night, then the model checks the parameter
TP12 to find total past transactions performed by the customer during late night. If value
of TP12 is 0 or very less (as compared with total transactions), then the model considers
the transaction as sensitive and generates the risk score accordingly.
The customer is active and time passed since the last transaction (TP9) is more, the model
considers the transaction sensitive and generates the risk score accordingly.
The model also finds the deviation than maximum amount of all past transactions (TP10).
It generates a risk score based on how much the current transaction amount is greater
than TP10.If it is less then the risk score generated is less.
With the new incoming transaction with the seller, the model checks the parameter VP2
to find total amount of transactions performed by the customer with the same seller.
Higher purchasing is done with the seller, lesser risk score generated and lesser
purchasing is done, higher risk score generated.
Customer’s first transaction amount is compared with SP4: maximum individual Amount
of transactions on Sunday and accordingly risk score is generated. If customer
subsequently performs transactions on this day, then its total amount and total number of
transactions are compared with SP5: Maximum total Amount of transactions on Sunday
and SP3: Maximum number of transactions on Sunday.
185
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
For the first transaction performed by customer on holiday, its amount is compared with
HP4: maximum individual Amount of transactions on holiday and accordingly risk score
is generated. If customer subsequently performs transactions on this day, then its total
amount and total number of transactions are compared with HP5: Maximum total amount
of transactions on holiday and HP3: Maximum number of transactions on holiday.
All the transactions performed by the customer on current day are monitored by model
and stored in the table customer_dailycount. They are compared with daily parameters to
find how close or far current day behavior is from past daily behavior.
Total amount of transactions on current day is compared with parameter DP2: Maximum
Amount of Purchase daily and risk score is generated accordingly. Higher risk score is
generated according its value is greater than DP2.
Total number of transactions on current day is matched with DP3: Maximum Number of
transactions a day and risk score is generated accordingly. Higher risk score is generated
according to its value is greater than DP3 otherwise less.
The transactions of the current week are updated in the table customer_weeklycount. Its
value is matched with weekly parameters to find the deviation from the past weekly
behavior.
186
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The weekly transaction amount is compared with WP4: Maximum Amount of Purchase
weekly and if it is greater than WP4 then more risk score generated and it is less than WP4
then less risk score is generated.
The total number of transactions of current week is matched with WP3: Maximum
Number of transactions a week. If it is higher then more risk score generated otherwise
less.
Total number of transactions in the current fortnight is checked with the parameter FP3:
Maximum Number of transactions a fortnight and risk score is generated accordingly.
Total amount of transactions in the current fortnight is checked with FP4: Maximum
Amount of Purchase fortnightly and risk score is generated accordingly.
All the transactions of the current month are stored in the customer_monthlycount table.
This table is used to find how far or close the current month’s behavior from past
monthly behavior by comparing with monthly parameters.
Total number of transactions of current month is compared with MP3: Maximum Number
of transactions a month and risk score is generated accordingly.
Total amount of transactions is compared with MP4: Maximum Amount of Purchase
monthly and risk score is generated accordingly.
187
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The model has one important feature that it records the transaction gap between each two
successive transactions performed by the customer. Seven transaction gap parameters
GP1 to GP7 are generated according to the transaction gap.
Whenever any transaction is found suspicious by the model, it updates the field
suspect_count of the suspect table. Then the model first finds which event occurs on this
card and finds probability that it occurs from generic fraudulent transactions set or
normal transactions set by using these parameters. Finally posterior probabilities are
computed by the model.
Here the time gap between successive transactions on the same card is considered to
capture the frequency of card use. The transaction gap is divided into seven mutually
exclusive and exhaustive events – E1, E2, E3, E4, E5, E6 and E7. Occurrence of each event
depends on the time since the last purchase (transaction gap-g ) on any particular card.
The event E1 is defined as the occurrence of a transaction on the same card Ck within 4
hours of the last transaction which can be represented as:
The event E2 is defined as the occurrence of a transaction on the same card Ck from 4th to
8 hours of the last transaction which can be represented as:
The event E3 is defined as the occurrence of a transaction from 8th to 16 hours of the last
transaction.
188
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The event E4 is defined as the occurrence of a transaction from 16th to 24 hours of the last
transaction
The event E5 is defined as the occurrence of a transaction within a week (from 2nd day to
7th day) of the last transaction
The event E6 is defined as the occurrence of a transaction within a fortnight (from the 8th
day to 15th day) of the last transaction
The event E7 is defined as the occurrence of a transaction after 15 days of the last
transaction
In the TRSGM, a number of rules are used to analyze the deviation of each incoming
transaction from the normal profile of the cardholder by computing the patterns generated
by TPGT. The initial belief value is obtained as the risk score. The belief is further
strengthened or weakened according to its similarity with fraudulent or genuine
transaction history using Bayesian learning. In order to meet this functionality, the
TRSGM is designed with the following five major components:
189
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
A customer usually carries out similar types of transactions in terms of amount, which
can be visualized as part of a cluster. Since a fraudster is likely to deviate from the
customer’s profile, his transactions can be detected as exceptions to the cluster – a
process known as outlier detection. It has important applications in the field of fraud
detection and has been used for quite some time to detect anomalous behavior.
ε
doutlier = 1 − if | N ε ( P )| < MinPts (7.8)
vavg
doutlier=0 otherwise
where
MinPts: Minimum number of points required in the ε - neighborhood of each point to
form cluster.
ε : Maximum radius of the neighborhood N ε ( p ) = {q ∈ D | dist ( p, q ) ≤ ε }.
190
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The key idea of the DBSCAN algorithm is that for each point p in a cluster ci, there are at
least a minimum number of points (MinPts) in the ε - neighborhood of that point p
denoted as N ε (p) i.e. the density in the ε - neighborhood has to exceed some threshold.
The larger the ε - neighborhood, the less is the number of clusters formed. If it is set too
high, there will be no cluster since the MinPts condition is not satisfied. However, if both
the parameters are small, there can be a lot of clusters. If MinPts is set to 1, then each
point in the database is treated as a separate cluster and even noise gets identified as a
separate cluster.
Here DBSCAN algorithm is used to form the clusters of transaction amounts spend by
the customer. Whenever a new transaction is performed by the customer, the algorithm
finds the cluster coverage of this particular amount. If this amount occurs more than once
in the past, then the TRSGM considers as highly genuine transaction.
The TRSGM is based on the following linear equation, which generates a risk score and
indicates how far or close the current transaction is from the normal profile of the
customer. If the generated risk score is closer to 0, then it is considered closely match to
customer normal profile. If the risk score is greater than 0.5 or close to 1, then it
considered heavily deviation from the customer normal profile.
n
Risk score = (1 − thresold ) ∑ ( Pi *Wi ) (7.9)
i =1
Where threshold=0.5
Pi = Parameter generated by TPGT
Wi= Weightage of the parameter which is given as input to algorithm 7.1
Weightage is in the percentage
191
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
7.2.2.1 Parameters
Sr No Parameter Weightage
1 Location from which product is ordered W1 %
2 Amount of the transaction W2 %
3 Number of the transactions W3 %
4 Category of the purchase W4 %
5 Time frame during which product is ordered W5 %
6 Seller or Vendor, with whom product is W6 %
purchased
7 Same product purchased within short time W7 %
8 Time passed since the last transaction W8 %
9 Late night transaction W9 %
10 Overseas transaction W10 %
f(x)=1/1+e –x (7.10)
where
e is the base of natural logarithms approximated by 2.718282.
This function is used when the value of parameter can not be shown in the percentage as
it maps the computation value in the range [0, 1].
192
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The weightage of different parameters have been derived and implemented using
artificial intelligence. Despite that the application is not stick to this weightage, it is made
dynamic and can be changed if any credit card company wish to do that. It is also
observed that within particular month or time, fraudster becomes so active and fraudulent
transactions increased drastically. So it is useful as the weightage is dynamic because we
can give more weightage to any sensitive parameter when there is a fear of fraudster in a
particular time period.
7.2.3 Rules
There are various guidelines given on several websites, print and electronic media as
indications for fraudulent transaction. These guidelines are implemented as rules in the
TRSGM.
193
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
is also considered as sensitive, it is monitored by the TRSGM also and risk score
is also generated according to the duration of time since the last transaction
performed.
• Generally customer doesn’t purchase the costly and luxury product again within
short time. So the TRSGM raises alarm by generating a risk score if similar event
occurs on the same card.
• Overseas transaction is also considered as highly sensitive by the TRSGM if in
the past no overseas transaction is performed on the same card.
HTD is the transaction repository component of the proposed TRSGM, which is stored in
the data warehouse. The expected behavior of a fraudster is to maximize his benefit from
a stolen card. This can be achieved by carrying out high value transactions frequently.
However, to avoid detection, the fraudsters can make either high value purchases at
longer time gaps or smaller value purchases at shorter time gaps. Contrary to such usual
behavior, a fraudster may also carry out low value purchases at longer time gaps. This
would be difficult for the TRSGM to detect if it resembles the genuine cardholder’s
profile. However, in such cases, the total loss incurred by the credit card company will
also be quite low.
To capture the frequency of card use, we consider the time gap between successive
transactions on the same card. The transaction gap is divided into seven mutually
exclusive and exhaustive events – E1, E2, E3, E4, E5, E6 and E7. Occurrence of each event
depends on the time since last purchase (transaction gap) on any particular card. All the
events are already defined according to the equation (7.1) to (7.7).
The Event E is the union of all the seven events E1, E2, E3, E4, E5, E6 and E7 such that:
7
P ( E ) = ∑ P( Ei ) = 1 (7.21)
i =1
194
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Now Compute P(Ei| f ) and P(Ei| f ) from the normal transaction set of that card holder
and generic fraud transactions set. P(Ei| f ) measures the probability of occurrence of Ei
given that a transaction is originating from a fraudster and P(Ei| f ) measures the
probability of occurrence of Ei given that it is genuine. The likelihood functions P(Ei| f )
P ( Ei ) = P( Ei | f )* P( f ) + P( Ei | f )* P( f ) (7.24)
P( Ei | f )* P( f )
P ( f | Ei ) = (7.25)
P( Ei )
195
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
P( Ei | f )* P( f )
P ( f | Ei ) = (7.26)
P( Ei | f )* P( f ) + P( Ei | f )* P( f )
We use Bayesian learning once the transaction is found suspicious in the light of the new
evidence Ei. Ψ is the probability that the current transaction is fraudulent.
The credit card fraud detection problem has the following two hypothesis: f : fraud and
f : ¬ fraud . By substituting the values obtained from equations (7.22), (7.23) in (7.26),
the posterior probability for hypothesis f : fraud is given as:
P( Ei | fraud ) * P( fraud )
P ( fraud | Ei ) = (7.27)
P( Ei | fraud ) * P( fraud ) + P( Ei | ¬fraud ) * P(¬fraud )
P( Ei | ¬fraud ) * P(¬fraud )
P (¬fraud | Ei ) = (7.28)
P ( Ei | ¬fraud ) * P(¬fraud ) + P( Ei | fraud ) * P( fraud )
Depending on which of the two posterior values is greater, future actions are decided by
the TRSGM.
7.3 ALGORITHM
The working principle of the proposed TRSGM is presented in Algorithm 7.1. It takes the
transaction parameters – card id, transaction amount, product, product category, shipping
address, location id from where transaction is performed and transaction day type(
working day or normal day) as well as design parameters - ε , MinPts and Wi (Weightage
of the parameter Pi) as input.
196
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
An incoming transaction is first checked for the address mismatch. If shipping address
and billing address is found same, then the transaction is considered to be genuine and is
approved and no other check is performed. The incoming transaction amount is checked
with the clusters formed by DBSCAN algorithm for its coverage. If coverage is found to
be more than 10%, then the transaction is considered to be genuine and is approved and
no other check is performed with the transaction. Then the linear equation of the patterns
generated by TPGT along with its weightage( Wi) generates a risk score for the
transaction. If the risk score < 0.5, the transaction is considered to be genuine and is
approved. On the other hand, if risk score > 0.8 then the transaction is declared to be
fraudulent and manual confirmation is made with the cardholder. In case 0.5 ≤ risk score
≤ 0.8, the transaction is allowed but the card Ck is labeled as suspicious. If this is the first
suspicious transaction on this card, the field suspect_count is incremented to 1 for this
card number in a suspect table. The TRSGM then waits until the next transaction occurs
on the same card number.
When the next transaction occurs on the same card Ck, it is also passed to the TRSGM.
The first four components of the TRSGM again generate a risk score to the transaction. In
case the transaction is found to be suspicious, the following events take place. Since each
transaction is time stamped, from the time gap g between the current and the last
transaction, the TRSGM determines which event E has occurred out of the seven Ei’s and
retrieves the corresponding P ( Ei | f ) and P ( Ei | f ) . The posterior probabilities P ( f | Ei )
and P ( f | E ) are next computed using Eqs. (7.14) and (7.15). If P ( f | Ei ) > P ( f | E )
197
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
AlGORITHM 7.1:
Input: Ck, Tamount(i), Saddr, Location, ε , MinPts, categoryi, producti,selleri, day_type, Wi,
no_of_products // ( No of the products customer has ordered online)
Tamount_daily ; // It stores total amount of current day purchase and update table
customer_dailycount accordingly
Ttotal_daily ; // It stores total number of current day transactions and update table
customer_dailycount accordingly
Tamount_weekly; // It stores total amount of current week purchase and update table
customer_weeklycount accordingly
Ttotal_weekly ; // It stores total number of current week transactions and update table
customer_weeklycount accordingly
Tamount_fortnightly ; // It stores total amount of current fortnight transactions and update table
customer_fortnightlycount accordingly
Ttotal_fortnightly ; // It stores total number of current fortnight transactions and update table
customer_fortnightlycount accordingly
Tamount_monthly; // It stores total amount of current month transactions and update table
customer_monthlycount accordingly
Ttotal_monthly ; // It stores total number of current month transactions and update table
customer_monthlycount accordingly
Tamount_sunday ; // It stores total amount of current day(if Sunday) purchase and update
table customer_sundaycount accordingly
Ttotal_sunday ; // It stores total number of current day(if Sunday) transactions and update
table customer_dailycount accordingly
Tamount_holiday ; // It stores total amount of current day(if holiday) purchase and update
table customer_dailycount accordingly
Ttotal_holiday ; // It stores total number of current day(if holiday) transactions and update
table customer_dailycount accordingly
Ψ=0
trans_amount = 0
198
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
i=1
199
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
200
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Tamount_daily=Tamount_daily + Tamount;
Ttotal_daily=Ttotal_daily + 1;
Update_customer_daily_count_table(Tamount_daily, Ttotal_daily );
End if; // End of current day
risk_score Ψ = generate_and_update_risk_score_8 (DP);//DP: Daily
Parameters
// At the end of day, trigger is automatically executed and update
Table customer_dailycount(Tamount_daily=0, Ttotal_daily=0 )
Tamount_weekly=Tamount_weekly + Tamount;
Ttotal_weekly=Ttotal_weekly + 1;
Update_customer_weekly_count_table(Tamount_weekly, Ttotal_weekly );
End if; // End of current week
risk_score Ψ = generate_and_update_risk_score_9 (WP);
//WP: Weekly Parameters
// At the end of week, trigger is automatically executed and update
table customer_weeklycount(Tamount_weekly=0, Ttotal_weekly=0 )
Tamount_fortnightly=Tamount_fortnightly + Tamount;
Ttotal_fortnightly=Ttotal_fortnightly + 1;
201
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Update_customer_fortnightlycount_table(Tamount_fortnightly, Ttotal_fortnightly );
End if; // End of current fortnight
risk_score Ψ = generate_and_update_risk_score_10 (FP);
//FP: Fortnightly Parameters
// At the end of fortnight, trigger is automatically executed and update
table customer_fortnightlycount(Tamount_fortnightly=0, Ttotal_fortnightly=0 )
Tamount_monthly=Tamount_monthly + Tamount;
Ttotal_monthly=Ttotal_monthly + 1;
Update_customer_monthly_count_table(Tamount_monthly, Ttotal_monthly );
End if; // End of current month
risk_score Ψ = generate_and_update_risk_score_11 (MP);
//MP: Monthly Parameters
// At the end of month, trigger is automatically executed and update
Table customer_monthlycount(Tamount_monthly=0, Ttotal_monthly=0 )
202
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Variable Meaning
Ck Current online transaction is performed on a card Ck
Tamount Purchase amount of current online transaction of each product
Trans_amount Total purchase amount of all the products of current online transaction
203
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
204
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Ttotal_Sunday It stores total number of current day(if Sunday) transactions till the
current day is completed
Tamount_holiday It stores total amount of current day(if holiday) purchase till the
current day is completed
Ttotal_holiday It stores total number of current day(if holiday) transactions till the
current day is completed
Ψ Risk score generated by the model
Clusteri It indicates the particular cluster formed by DBSCAN Algorithm
g Transaction gap e.g. Number of hours since the last transaction on the
same card
E Model finds event based on eqs. From (7.1) to (7.6)
Ef Probability of event E coming from fraudulent transaction set
Ef Probability of event E coming from normal transaction set
Posteriorf Posterior probability of event E that it is fraudulent
Posterior f Posterior probability of event E that it is genuine
Suspect_count If the current transaction is found suspicious, then the value of
suspect_count is incremented to 1 and system waits for the next
transaction.
Wi Weightage of the parameter
• First Algorithm checks the shipping address entered by the customer with the
billing address given by the customer while performing online transaction, If both
are same then it considers the transaction highly genuine and generate risk score
0.
• If shipping address is different than billing address, then algorithm checks the
parameter AP1 generated by TPGT to check whether the past transactions are
successfully performed on the same shipping address. If products are successfully
shipped on the current shipping address, then also it considers the transaction
highly genuine and generate risk score 0.
205
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
206
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
207
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
the transactions are found genuine then they are stored in the data warehouse, so
the next parameters are generated accordingly by TPGT.
• Here the block diagram of the proposed financial cyber crime detection system is
shown in the figure 7.1 which is the brief pictorial representation of the algorithm
7.1.
208
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Patterns
Current Transaction
TRSGM
Yes
Genuine Transaction No
< 0.8
Suspicious Transaction
Fraudulent Transaction
Yes No
Po.f > Po. f
Figure 7.1 Block Diagram of Proposed Financial Cyber Crime Detection System
209
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Here we have generated the scatter graphs of the different clusters formed by the
DBSCAN algorithm by taking transaction amount attribute for the various customers. In
all the examples ε =500 and MinPts=5 was taken.
3.5
3
Cluster Number
2.5
2
1.5
1
0.5
0
0 500 1000 1500 2000 2500
Transaction Am ount
Figure 7.2 Graph of clusters formed by DBSCAN algorithm for Card id=1
7
6
Cluster Number
5
4
3
2
1
0
0 1000 2000 3000 4000 5000 6000 7000
Transaction Am ount
Figure 7.3 Graph of clusters formed by DBSCAN algorithm for Card id=5
210
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
8
Cluster Number
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Transaction Am ount
Figure 7.4 Graph of clusters formed by DBSCAN algorithm for Card id=100
10
Cluster Number
8
6
4
2
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Transaction Am ount
Figure 7.5 Graph of clusters formed by DBSCAN algorithm for Card id=1507
211
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Here a result is shown of the clusters formed by DBSCAN algorithm implemented in the
data mining application for various transaction amounts spend by the customer having
card id 1507.
212
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
213
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
214
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
215
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
216
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
217
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The implementation of FCDS has been done in Oracle 9i. The data warehouse is
designed and implemented in oracle 9i, which consists of a number of tables, as shown in
the Chapter 6. Descriptions of all the tables are also shown in the same chapter. Lookup
tables are designed to store the current spending behavior of the customer. Current online
transaction is given as input to the FCDS. Linear equation along with the rules
implemented in the TRSGM generates a risk score for this transaction.
Stored procedures, functions, packages and triggers were written to facilitate the
functioning of the setup. These were used to check the deviation of each transaction from
the customer’s normal profile.
The following trigger is automatically executed when logging into the system and
updates all the lookup tables according to their specified time duration.
begin
select logon_day into previousday from user_log_master;
commit;
218
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
As we discussed in the chapter 6, TPGT generates the parameters GP1 to GP7 for inter
transaction gap (time duration between each two successive transactions on the same
card). For this the following procedure time_previous_transaction( ) is implemented in
the data mining application.
219
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
/* This procedure finds the time difference in days, hours, minutes and seconds between
each two successive transactions. */
PROCEDURE time_previous_transaction(a_array1 IN tpg_date_array,time_diff out
tpg_array,days out tpg_array,hrs out tpg_array,mins out tpg_array,secs out tpg_array) is
hrs_frac number(12,6);
mins_frac number(12,6);
secs_frac number(12,6);
hrs_int number(12,6);
mins_int number(12,6);
secs_int number(12,6);
hrs_full number(12,6);
mins_full number(10,2);
secs_full number(12,6);
index_time number(7):=1;
BEGIN
for i in 2..a_array1.LAST
LOOP
select (a_array1(i) - a_array1(i-1)) into time_diff(index_time) from
dual;
SELECT floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600),
floor((((a_array1(i)-a_array1(i-1))*24*60*60) -
floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600)/60),
round((((a_array1(i)-a_array1(i-1))*24*60*60) -
floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600 -
(floor((((a_array1(i)-a_array1(i-1))*24*60*60) -
floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600)/60)*60) ))
into hrs_frac,mins_frac,secs_frac
FROM dual;
220
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
days(index_time):=floor(hrs_frac/24);
hrs_int:=hrs_frac/24-floor(hrs_frac/24);
hrs_full:=floor(hrs_int*24);
mins_full:=floor((hrs_full - floor(hrs_full))*60);
hrs(index_time):=hrs_full;
secs_full:=floor((mins_full - floor(mins_full))*60);
mins(index_time) := mins_full + mins_frac;
secs(index_time):=secs_full + secs_frac;
221
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
To find the maximum value from given list of array find_maximum() function is
implemented in the data mining application. This function is called several times to find
the maximum amount of transaction and maximum number of transactions.
/* This function finds the maximum value from the given array. */
FUNCTION find_maximum(a_array1 IN tpg_array) RETURN number IS
max_value number(12,2);
BEGIN
max_value:=0;
for m in a_array1.FIRST .. a_array1.LAST
loop
if a_array1(m) > max_value then
max_value:=a_array1(m);
end if;
end loop;
return max_value;
END;
In this way several procedures, functions, triggers, packages are implemented in the data
mining application.
222
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.18 Sample output of Data Mining Application for Genuine Transaction - I
Figure 7.19 Sample output of Data Mining Application for Genuine Transaction - II
223
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.20 Sample output of Data Mining Application for Genuine Transaction -III
224
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.21 Sample output of Data Mining Application for Fraudulent Transaction
–I
Figure 7.22 Sample output of Data Mining Application for Fraudulent Transaction -
II
225
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.23 Sample output of Data Mining Application for Fraudulent Transaction -
III
226
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Here a sample of suspicious transaction is shown along with the probability of the
transaction being genuine or fraudulent. The snapshot of table suspect is also shown
where the filed suspect_count is incremented.
Figure 7.24 Sample output of Data Mining Application for Suspicious Transaction -
I
227
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.25 Sample output of Data Mining Application for Suspicious Transaction –
II
Figure 7.26 Sample output of Data Mining Application for Suspicious Transaction -
III
228
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.27 Sample output of Data Mining Application for Multiple Order Product
Support - I
Figure 7.28 Sample output of Data Mining Application for Multiple Order Product
Support - II
229
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.29 Sample output of Data Mining Application for Multiple Order Product
Support - III
230
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
• The most interesting result of the TRSGM is that the risk score generated by it is very
dynamic. i.e. If the customer make any purchase and there is a very minor change in
transaction amount and keeping all other inputs same, then also risk score generated
is different. This minor change would also be reflected in the risk score. We have run
the application several times for different transaction amounts with slight variation
and keeping all other inputs fix and same. Before taking the result for second time
and onwards, we have also reset all the lookup tables. Here is an example.
Table 7.2 Sample output of the application for different transaction amounts
231
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.30 Sample output of Data Mining Application for different transaction
amounts – I
Figure 7.31 Sample output of Data Mining Application for different transaction
amounts - II
232
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.32 Sample output of Data Mining Application for different transaction
amounts - III
Figure 7.33 Sample output of Data Mining Application for different transaction
amounts - IV
233
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
In the same way, we have changed the different sellers for the same product, category,
amount, shipping address and location address. It is observed that this change would also
reflect in the risk score. Here is an example.
234
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.34 Sample output of Data Mining Application for different sellers - I
Figure 7.35 Sample output of Data Mining Application for different sellers - II
235
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.36 Sample output of Data Mining Application for different sellers - III
Figure 7.37 Sample output of Data Mining Application for different sellers - IV
236
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
We have also checked if the customer purchases the same product, category, amount,
seller, shipping address on different location, then its change is reflected in the risk
score. Here is an example.
237
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.38 Sample output of Data Mining Application for different locations - I
Figure 7.39 Sample output of Data Mining Application for different locations-II
238
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.40 Sample output of Data Mining Application for different locations - III
Figure 7.41 Sample output of Data Mining Application for different locations - IV
239
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
• The application finds the cluster coverage of each new incoming transaction
amount and if it is greater than 10% then model assumes that it is a genuine
transaction considering the regular payment of the customer. So the application
generates 0 risk score for the transaction. Here is an example.
Figure 7.42 Sample output of Data Mining Application for Cluster Coverage
• The author has extensively run the applications and check that the transaction
,which is the closely met by the customer purchasing habit (i.e. maximum
purchase in this category, maximum number of transactions in this time frame,
maximum number of transactions ordered from the same location etc.), generates
a least score. The transaction, which does not fall into customer purchasing habit
and more deviation than the normal profile, generate more risk score. Here is an
example. As more and more transactions performed within this particular set,
more risk score decreased.
240
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
The customer having cardid 1570 has the maximum purchasing habit in the given
field as below.
Category :2
Time frame : 18:01 to 21:00
Location Id : 205
Seller Id : 257
241
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
242
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
• In the domain of credit card fraud detection, the system should not raise too many
false alarms (i.e. genuine transactions should not be caught as fraudulent
transactions) because a credit card company needs to minimize its losses but, at
the same time, does not wish the cardholder to feel restricted too often. In the
same way, fraudulent transactions should also not get undetected. Considering
both of these matters, the model is designed flexible. Here we have taken upper
threshold value 0.8, but with more learning it can be changed. All the parameters’
weightage is also set according to the recommendation of credit card company.
• There is one interesting result by Bayesian learning. The customer having card id
8, first performs the transaction of 17000 is considered suspicious. After short
while he performs another transaction which is 13500, predicted as fraudulent by
Bayesian learning. Once transaction found suspicious, time duration since last
transaction is also stored in the table suspect. So if we consider both the
transactions as individual then they are seemed to be normal, but it is power of
Bayesian learning that occurrence of subsequent transaction after the first
transaction is predicted as fraudulent. Here is an example.
243
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
Figure 7.47 Sample output of Data Mining Application for Bayesian Learning-II
7.8 REFERENCES
244
CHAPTER 8
As we discussed in Chapter 1, for financial cyber crime prevention different methods like
First Virtual, Cyber Cash and SET are used. These systems are highly secure but are
rarely used by customers and merchants. These models secure our transaction over
internet but cannot stop any forgery if credit card information is lost physically or when
customer gives his information in wrong hands.
Anshul Jain et al. [1] have given an Internet Virtual Credit Card Model. In this model, a
login id and password will be given by the bank. Then After logging into bank’s website
a virtual credit card number and expiry date of this virtual credit card will be issued by
the bank. So the customer has to give and remember four details like login id, password,
virtual credit card number and expiry date of this virtual card while performing the online
transaction. In my opinion, it will create overhead on customer and extra burden of
remembering these additional details.
Recently in India Reserve Bank of India has made a mandate for all the banks to issue
separate password to their credit card holders for online transaction. In other countries
this tactics is already being used. In my opinion, this tactics is not enough to prevent
fraud as the first transaction is highly secure, but the subsequent transactions we can not
245
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
surely consider as highly secure, because while performing the first transaction by the
customer, the password can be stolen by fraudster by hacking the computer or any other
tactics. Also the card holder is not given any control or flexibility to their own end to
prevent the fraud.
Considering all the limitations of above models, the financial cyber crime prevention
model has been proposed.
In this model, not only separate password for online transaction is given to credit card
holder, but also the validity of this password is given to the card holder according to his
choice. The customer has to log into bank website. Then he can set his password along
with the expiry date of this password for the online transaction. So whenever customer
performs online transaction, then he requires the password to complete the transaction.
The model checks the validity of the password; if the password expires then he is not able
to complete the transaction. If he is the genuine card holder, then he has to log in bank
website and set password and expiry date for this password.
So in this model the password remain valid only till the completion of expiry date. When
the password expires, the customer has to again obtain password along with its validity
from the bank. The expiry date selected by the customer must be between the present date
and actual expiry date of the card.
So here the control and flexibility is given to the customer to their own end. They can
give expiry date of password according to their convenience and keeping in mind of
avoiding the forgery. As we compare with Internet Virtual Credit Card Model, the
customer can not remember the virtual credit card number easily, as it is long and it is
issued by bank so he has to store it somewhere. While in our model, the password will be
given by the user, so it is user defined word, so he can easily keep in mind. Customers,
who transact very often, could give the expiry date of their password very short, in order
246
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
to avoid forgery. He can also give the expiry date of password such that it would expire
on the next day. So the customer can himself make his each online transaction very safe.
Customers who do not transact very often or consider this as overhead, can select long
expiry date. Whenever financial cyber crime increased drastically in particular month,
then user can give shorter expiry date to avoid forgery.
Thus the benefit of this model is that user can temporarily suspend his credit or debit card
by giving short expiry date as when he fears that his information may be stolen or when a
time comes that huge amount of cyber crime cases drastically increased. So no one can
use his or her credit card information for online purchases.
247
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
248
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
249
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
allow any parameter to increase its share in the final risk score as it maps the
value in the range [0, 1].
• Flexibility: With the consultation of the bank, the weightage of different
parameters are derived and implemented in the software. But the software is not
stick to this weightage only and it is flexible as we can change the weightage of
any parameter according to the recommendations of the credit card company.
• The developed data mining application is used only for those customers who are
making credit card purchases frequently. It is not for those who transact once or
very less in a year. The model has to learn all the purchasing habits of the
customer, so for new incoming transaction it can predict properly. As more and
250
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
more transactions are performed by the customer, the model becomes stronger,
learns the customer behavior and predicts the transaction more accurately.
• The application is also not used for a new customer for the same above reason.
• Though the application is global and implemented keeping view of all the
countries, the parameter holiday is not the same for all the countries. So the
application requires minor changes to implement this change.
• In the current work the location, where the customer performs the online
transaction is considered. The computer on which the online transaction is
performed is not taken into account, but in future work the IP Address can also be
considered and patterns can be generated for this IP address. The only problem to
consider is that IP address is not static, but dynamic. So care should be taken to
consider this as one parameter.
• It may be worthwhile to generate more parameters to closely match the
customer’s purchasing habits.
• More dynamic rules can be derived from the historical data and applied for the
initial belief.
• Full care has been taken to ensure that the research is designed and conducted to
achieve the research objectives. This is really a thrilling domain, in which one can
not stop and it requires constant refreshing to incorporate the dynamic changes
occurred as the real problems.
• Though data mining algorithm DBSCAN is implemented for only transaction
amount, it can be implemented for other attribute as well.
8.6 REFERENCES
251