Thesis

A PH.D.
THESIS
DATA MINING TECHNIQUES: STUDY, ANALYSIS,

PREVENTION & DETECTION FOR FINANCIAL
CYBER CRIME AND FRAUDS
SUBMITTED TO
GANPAT UNIVERSITY
KHERVA
FOR THE AWARD OF
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE AND APPLICATION)
BY
JYOTINDRA N. DHARWA
A.M.PATEL INSTITUTE OF COMPUTER STUDIES
GANPAT UNIVERSITY, KHERVA
UNDER THE GUIDANCE OF
DR. A. R. PATEL
DIRECTOR, DEPARTMENT OF COMPUTER SCIENCE
HEMCHANDRACHARYA NORTH GUJARAT UNIVERSITY, PATAN
APRIL 2010
CONTENTS
Abstract I
Acknowledgement II
Certificate by Research Guide III
Declaration by Ph.D. Student IV
List of Tables V
List of Figures VII
Chapter Contents XI
Chapter 1: Introduction 1
Chapter 2: A comparative Study of Data Mining Techniques 21
Chapter 3: Financial Cyber crime and Frauds 92
Chapter 4: Role of Data Mining in Financial Crime Detection 115
Chapter 5: Data Warehouse Implementation 123
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT) 156
Chapter 7: Development of Transaction Risk Score Generation Model 183
(TRSGM)
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion 245
ABSTRACT
The Internet in India is growing rapidly. It has given rise to new opportunities in every
field we can think of - be it entertainment, business, sports or education. There are two
sides to a coin. Internet also has its own disadvantages. One of the major disadvantages is
Cyber crime- illegal activity committed on the internet. The internet, along with its
disadvantages, has also exposed us to security risks that come with connecting to a large
network. Computers today are being misused for illegal activities like e-mail espionage,
credit card fraud, spasm, software piracy and so on, which invade our privacy and offend
our senses. Criminal activities in the cyberspace are on the rise.
Developing a financial cyber crime detection system is a challenging task. Whenever any
online transaction is performed through the credit card, then there is not any system that
surely predicts a transaction as fraudulent. It just predicts the likelihood of the transaction
to be a fraudulent.
We propose a novel approach for online transaction fraud detection, which combines
evidences from current as well as past behavior. The proposed transaction risk generation
model (TRSGM) consists of five major components, namely, DBSCAN algorithm,
Linear equation, Rules, Historical transaction database and Bayesian learner. DBSCAN
algorithm is used to form the clusters of past transaction amounts of the customer and
find out the deviation of new incoming transaction amount and cluster coverage. The
patterns generated by Transaction Pattern Generation Tool (TPGT) are used in Linear
equation along with its weightage to generate a risk score for new incoming transaction.
The guidelines shown in various web sites, print and electronic media as indication of
online fraudulent transaction for credit card company is implemented as rules in TRSGM.
In the first four components, we determine the suspicion level of each incoming
transaction based on the extent of its deviation from good pattern. The transaction is
classified as genuine, fraudulent or suspicious depending on this initial belief. Once a
transaction is found to be suspicious, belief is further strengthened or weakened
according to its similarity with fraudulent or transaction history using Bayesian learning.
I
ACKNOWLEDGEMENT
I hereby take a chance to express my sense of extreme gratitude towards my Ph.D. guide
Dr. A. R. Patel, for his suggestions and constant inspiration at every stage of the research.
He is an extremely sympathetic and principle-centered person. His skills, as a researcher
and guide helped me to overcome all the hurdles. Without his constant support and
encouragement, I would not have been able to complete my research work successfully.
I owe a debt of gratitude to Prof. P. I. Patel, Director, Ganpat University for

encouragement.
I am thankful to my manager brother who has provided me statistical data at the initial
stage of the research, so my initial model design is just possible due to his support.
I would like to thank my colleagues Dr. N. J. Patel, Dr. S. M. Parikh and staff at Acharya
Motibhai Patel Institute of Computer Studies for their invaluable encouragement and
help.
My parents have their own share in my success. I firmly believe that their blessings
always enlighten my path ahead. I hereby take a chance to salute my father Nathalal and
mother Late Menaben. I would like to thank my brother navinbhai, sister-in-law lilaben,
nephews Vikas and kunal for their love, blessings and moral support throughout my
research work. I give my special thanks to my wife Urmila, daughters Mudra and Aditi ,
without whose support and sacrifice this thesis would not have been possible for me.
At last, I thank the one and all, for the divine blessings.
Jyotindra N. Dharwa
II
CERTIFICATE
I hereby certify that Mr. Jyotindra N. Dharwa has completed his Ph.D. thesis for
doctorate degree on the topic “Data Mining Techniques: Study, Analysis, Prevention &
Detection for Financial Cyber Crime and Frauds” under my guidance.
I further certify that the whole work, done by him is of his own, original and tends to
general advancement of knowledge. According to the best of my knowledge, I also
certify that he has not been conferred any degree, diploma and distinction by either the
Ganpat University or any other university for this thesis.
Date: (Dr. A. R. Patel)
III
DECLARATION
I, Mr. Jyotindra N. Dharwa hereby declare that my Ph.D. thesis titled “Data Mining
Techniques: Study, Analysis, Prevention & Detection for Financial Cyber Crime and
Frauds” is written as a partial fulfillment of the requirement for a doctorate degree on the
topic. The complete study is based on literature survey, study of periodicals, journals and
websites and building a model for proving the concept studied and designed.
I further declare that the complete thesis work, including all analysis, hypothesis,
inferences and interpretation of data and information, is done by me and it is my own and
original work. Moreover, I declare that no degree, diploma or distinction has been
conferred on the basis of this thesis by the Ganpat University or any other university to
me before.
Date: (Jyotindra N. Dharwa)
IV
LIST OF TABLES
Chapter 2
Table 2.1 Steps in the Evolution of Data Mining 22
Table 2.2 Time Line of Data Mining Development 25
Table 2.3 Initial Weight Values for the Neural Network Shown in Figure 2.4 45
Table 2.4 Comparison of Clustering Algorithms 89
Table 2.5 Data Mining Technique for Data Mining Task 90
Chapter 3
Table 3.1 Average (Median) Loss Per Typical Complaint Demographics 109
Table 3.2 Losses based on fraud category wise 110
Chapter 5
Table 5.1 Transaction 126
Table 5.2 Customer_Master 127
Table 5.3 Creditcard_Master 128
Table 5.4 Seller_Master 128
Table 5.5 Address_Master 129
Table 5.6 Product_Master 130
Table 5.7 Product_Category_Master 130
Table 5.8 Shipping_Master 131
Table 5.9 Location_Master 132
Table 5.10 City_Master 133
Table 5.11 State_Master 133
Table 5.12 Country_Master 134
Table 5.13 User_Log_Master 134
Table 5.14 Cardholder_Master 135
Table 5.15 Fraud 136
Table 5.16 Suspect 137
Table 5.17 Customer_DailyCount 138
V
Table 5.18 Customer_WeeklyCount 139
Table 5.19 Customer_FortnightlyCount 139
Table 5.20 Customer_MonthlyCount 140
Table 5.21 Customer_SundayCount 140
Table 5.22 Customer_HolidayCount 141
Table 5.23 Statistical data of expenditure in category by income 144
Table 5.24 Components of Gaussian distribution 144
Table 5.25 Sample Data of Table Transaction 145
Table 5.26 Credit Card Parameters 152
Chapter 7
Table 7.1 Parameters of the Equation 192
Table 7.2 Sample output of the application for different transaction amounts 231
Table 7.3 Sample output of the application for different sellers 234
Table 7.4 Sample output of the application for different locations 237
VI
LIST OF FIGURES
Chapter 2
Figure 2.1 Historical Perspective of Data Mining 24
Figure 2.2 Decision Tree for Example 2.1 38
Figure 2.3 Decision Tree for Example 2.2 40
Figure 2.4 A Fully Connected Feed-Forwarded Neural Network 44
Figure 2.5 Radial Basis Function Network 70
Figure 2.6 Classification of Clustering Algorithms 72
Figure 2.7 Example of Dendrogram 73
Chapter 3
Figure 3.1 Affecting the Person by Cyber Crime (in %) 93
Figure 3.2 IC3 Complaint Categories (in %) 107
Figure 3.3 Percentage of Referrals by Monetary Loss 108
Figure 3.4 Plastic Card Fraud Losses on UK-issued Cards 1998-2008 109
Figure 3.5 Percentage of Different Plastic Card Fraud Category in Year 1998 110
Figure 3.6 Percentage of Different Plastic Card Fraud Category in Year 2008 111
Figure 3.7 Internet/E-Commerce Fraud Losses on UK-issued Cards 111
Figure 3.8 Revenue Lost to Online Fraud (in %) 113
Chapter 4
Figure 4.1 Architecture of 2-Stage Solution 116
Chapter 5
Figure 5.1 Data warehouse Design Layout-I 142
Figure 5.2 Data warehouse Design Layout-II 143
Figure 5.3 Credit Card Number Semantic Graph 151
Figure 5.4 Sample of Credit Card 154
VII
Chapter 6
Figure 6.1 Parameters of TPGT 156
Figure 6.2 Subparameters of DP 157
Figure 6.3 Subparameters of CP 158
Figure 6.4 Subparameters of PP 159
Figure 6.5 Subparameters of TP 160
Figure 6.6 Subparameters of WP 161
Figure 6.7 Subparameters of VP 162
Figure 6.8 Subparameters of AP 162
Figure 6.9 Subparameters of FP 163
Figure 6.10 Subparameters of MP 164
Figure 6.11 Subparameters of SP 164
Figure 6.12 Subparameters of HP 165
Figure 6.13 Subparameters of LP 166
Figure 6.14 Subparameters of GP 167
Chapter 7
Figure 7.1 Block Diagram of Proposed Financial Cyber Crime Detection System 209
Figure 7.2 Graph of clusters formed by DBSCAN algorithm for Card id=1 210
Figure 7.6 Sample output of Clusters formed by DBSCAN Algorithm – I 212
Figure 7.7 Sample output of Clusters formed by DBSCAN Algorithm – II 212
Figure 7.8 Sample output of Clusters formed by DBSCAN Algorithm – III 213
Figure 7.9 Sample output of Clusters formed by DBSCAN Algorithm – IV 213
Figure 7.10 Sample output of Clusters formed by DBSCAN Algorithm – V 214
Figure 7.11 Sample output of Clusters formed by DBSCAN Algorithm – VI 214
Figure 7.12 Sample output of Clusters formed by DBSCAN Algorithm – VII 215
Figure 7.13 Sample output of Clusters formed by DBSCAN Algorithm – VIII 215
Figure 7.14 Sample output of Clusters formed by DBSCAN Algorithm – IX 216
VIII
Figure 7.15 Sample output of Clusters formed by DBSCAN Algorithm – X 216
Figure 7.16 Sample output of Clusters formed by DBSCAN Algorithm – XI 217
Figure 7.17 Sample output of Clusters formed by DBSCAN Algorithm – XII 217
Figure 7.18 Sample output of Data Mining Application for Genuine 223
Transaction–I
Transaction-II
Transaction-III
Figure 7.21 Sample output of Data Mining Application for Fraudulent 225
Transaction–I
Transaction–II
Transaction–III
Figure 7.24 Sample output of Data Mining Application for Suspicious 227
Transaction - I
Transaction - II
Transaction - III
Figure 7.27 Sample output of Data Mining Application for Multiple Order 229
Product Support - I
Product Support - II
Product Support – III
Figure 7.30 Sample output of Data Mining Application for different transaction 232
amounts – I
amounts – II
IX
amounts – III
amounts – IV
Figure 7.34 Sample output of Data Mining Application for different sellers – I 235
Figure 7.35 Sample output of Data Mining Application for different sellers – II 235
Figure 7.36 Sample output of Data Mining Application for different sellers – III 236
Figure 7.37 Sample output of Data Mining Application for different sellers – IV 236
Figure 7.38 Sample output of Data Mining Application for different locations – I 238
Figure 7.39 Sample output of Data Mining Application for different locations – II 238
Figure 7.40 Sample output of Data Mining Application for different locations – III 239
Figure 7.41 Sample output of Data Mining Application for different locations – IV 239
Figure 7.42 Sample output of Data Mining Application for Cluster Coverage 240
Figure 7.43 Sample output of Data Mining Application for maximum 241
purchasing habit input - I
purchasing habit input - II
purchasing habit input – III
Figure 7.46 Sample output of Data Mining Application for Bayesian 243
Learning - I
Figure 7.47 Sample output of Data Mining Application for Bayesian 244
Learning - II
X
CHAPTERS CONTENTS
Chapter 1 Introduction 1
1.1 Motivation 2
1.2 Objective of the research 2
1.3 Related Work 4
1.3.1 In Fraud Detection 4
1.3.2 In Financial Cyber crime Prevention 11
1.4 Research Issues 13
1.5 Outline of the Research 15
1.6 References 16
Chapter 2 A Comparative Study of Data Mining Techniques 21

2.1 Data Mining: A Definition 21
2.2 The Foundations of Data Mining 22
2.3 The Development of Data Mining 23
2.4 Data Mining Process 26
2.5 A Statistical Perspective on Data Mining 26
2.5.1 Point Estimation 26
2.5.2 Measures of Performance 30
2.5.3 Models Based on Summarization 33
2.6 Decision Trees 36
2.6.1 Strengths 41
2.6.2 Weaknesses 41
2.7 Neural Networks 41
2.7.1 Why use Neural Networks? 42
2.7.2 Network Layers 42
2.7.3 Neural Network Input and Output Format 44
2.7.4 The Sigmoid Function 44
2.7.5 Applications of Neural Networks 45
2.7.6 Strengths 46
XI
2.7.7 Weaknesses 47
2.8 Genetic Algorithms 47
2.8.1 Where Gas can be used? 48
2.8.2 Explanation of terms 48
2.8.3 Applications of GA 50
2.8.4 Strengths of GA 50
2.8.5 Weaknesses of GA 50
2.9 Classification 51
2.9.1 Statistical-Based Algorithms 51
2.9.2 Distance-Based Algorithms 55
2.9.3 Decision Tree-Based Algorithms 58
2.9.4 Neural Network-Based Algorithms 65
2.10 Clustering 71
2.10.1 Hierarchical Algorithms 72
2.10.2 Agglomerative Algorithms 73
2.10.3 Partitional Algorithms 75
2.10.4 Clustering Large Databases 82
2.10.5 Comparison of Clustering Algorithms 87
2.11 Selection Criteria of a Data Mining Technique 87
2.12 References 90
Chapter 3 Financial Cyber crime and Frauds 92

3.1 What is a Cyber Crime? 92
3.2 An Example of Financial Cyber Crime 93
3.3 Financial Cyber Crimes 93
3.3.1 Credit Card Fraud 93
3.3.2 Net Extortion 94
3.3.3 Phising 94
3.3.4 Salami Attack 94
3.3.5 Sale of Narcotics 94
3.4 What is a Fraud? 94
XII
3.5 Types of Fraud 95
3.5.2 Telecommunications Fraud 97
3.5.3 Computer Intrusion 98
3.6 Financial Crimes 99
3.6.1 Types of Financial Crimes 99
3.7 Ways of Online Banking Fraud 105
3.7.1 Phising 105
3.7.2 Malware 105
3.7.3 Spyware 106
3.8 2008 Internet Crime Report 106
3.8.1 Complain Characteristics 106
3.8.2 Case Studies of APACS 109
3.9 Online Fraud Report, Cybersource 2010 112
3.10 References 113
Chapter 4 Role of Data Mining in Financial Crime Detection 115

4.1 Two Stage Solution for Financial Crime Detection 115
4.2 Types of Financial Crime 116
4.2.2 Card-Not-Present Fraud 117
4.2.3 Loan Default 117
4.2.4 Bank Fraud 118
4.2.5 Insurance Crimes 119
4.3 Conclusion 121
4.4 References 121
Chapter 5 Data Warehouse Implementation 123

5.1 Data Warehouse Architecture 123
5.1.1 Star Schema Architecture 123
5.1.2 Snowflake Schema Architecture 124
XIII
5.1.3 Fact Constellation Architecture 125
5.2 Fact Table 125
5.3 Dimensional Tables 127
5.4 Lookup Tables 138
5.5 Data Collection 143
5.6 Sample Data 145
5.7 Credit Card Number Generation 151
5.7.1 The Luhn Algorithm 152
5.7.2 An example of Luhn Validation Technique 153
5.8 References 154
Chapter 6 Development of Transaction Pattern Generation Tool (TPGT) 156

6.1 Main Patterns (Parameters) generated by TPGT 156
6.1.1 Subparameters of DP 157
6.1.2 Subparameters of CP 158
6.1.3 Subparameters of PP 159
6.1.4 Subparameters of TP 160
6.1.5 Subparameters of WP 161
6.1.6 Subparameters of VP 162
6.1.7 Subparameters of AP 162
6.1.8 Subparameters of FP 163
6.1.9 Subparameters of MP 164
6.1.10 Subparameters of SP 164
6.1.11 Subparameters of HP 165
6.1.12 Subparameters of LP 166
6.1.13 Subparameters of GP 167
6.2 Descriptions of the Patterns (Parameters) 167
6.2.1 Daily Parameters (DP) 167
6.2.2 Category Parameters (CP) 168
6.2.3 Product Parameters (PP) 169
6.2.4 Transaction Parameters (TP) 169
XIV
6.2.5 Weekly Parameters (WP) 171
6.2.6 Seller or Vendor Parameters (VP) 171
6.2.7 Address Parameters (AP) 172
6.2.8 Fortnightly Parameters (FP) 172
6.2.9 Monthly Parameters (MP) 173
6.2.10 Sunday Parameters (SP) 173
6.2.11 Holiday Parameters (HP) 174
6.2.12 Location Parameters (LP) 175
6.2.13 Transaction Gap Parameters (GP) 176
6.3 Computations of the Patterns 177
6.3.1 TP1 to TP8 177
6.3.2 TP11 and TP12 179
6.3.3 GP1 to GP7 180
6.3.4 AP1 and AP2 181
6.4 References 182
Chapter 7 Development of Transaction Risk Score Generation Model

(TRSGM) 183
7.1 Significance of the parameters in TRSGM 183
7.2 TRSGM Components 189
7.2.1 DBSCAN Algorithm 190
7.2.2 Linear Equation 191
7.2.3 Rules 193
7.2.4 Historical Transaction Database (HTD) 194
7.2.5 Bayesian Learner 195
7.3 Algorithm 196
7.3.1 Description of data structure used in the algorithm 203
7.3.2 Description of the Algorithm 205
7.4 Graphs of Cluster Formation by DBSCAN Algorithm 210
7.5 Implementation Environment 218
7.5.1 Lookup Tables Auto Updation 218
XV
7.5.2 Inter Transaction Gap Recording 219
7.5.3 Maximum Value Finding 222
7.6 Sample Results 223
7.6.1 Genuine Transaction 223
7.6.2 Fraudulent Transaction 225
7.6.3 Suspicious Transaction 227
7.6.4 Multiple Product Order Support 229
7.7 Result Analysis & Discussions 231
7.8 References 244
Chapter 8 Proposed Financial Cyber Crime Prevention Model

& Conclusion 245
8.1 Proposed Financial Cyber Crime Prevention Model 246
8.2 Features of Developed Data Mining Application Software 247
8.3 Significance of the Research 250
8.4 Limitation of the Study 250
8.5 Future Scope of the Research 251
8.6 References 251
XVI
CHAPTER 1
INTRODUCTION
1.1 MOTIVATION
1.2 OBJECTIVE OF THE RESEARCH
1.3 RELATED WORK
1.4 RESEARCH ISSUES
1.5 OUTLINE OF THE RESEARCH
1.6 REFERENCES
The Internet in India is growing rapidly. It has given rise to new opportunities in every
field we can think of - be it entertainment, business, sports or education. There are two
sides to a coin. Internet also has its own disadvantages. One of the major disadvantages is
Cyber crime- illegal activity committed on the internet. The internet, along with its
disadvantages, has also exposed us to security risks that come with connecting to a large
network. Computers today are being misused for illegal activities like e-mail espionage,
credit card fraud, spasm, software piracy and so on, which invade our privacy and offend
our senses. Criminal activities in the cyberspace are on the rise.
So in today’s electronic society, e-commerce has become an essential sales channel for
global business. Due to rapid advancement of e-commerce, use of credit cards for
purchases has dramatically increased. Unfortunately, fraudulent or illegal use of credit
card has also become an attractive source of revenue for fraudsters. Occurrences of credit
card fraud are increasing dramatically due to exposure of security weaknesses in
traditional credit card processing systems resulting in loss of billions of money every
year. Fraudsters now become very dynamic and use sophisticated techniques to perpetrate
credit card fraud. The fraudulent activities worldwide present unique challenges to banks
and other financial institutions who issue credit cards.
1
Chapter 1: Introduction
According to 2008 Internet Crime Report [41] of Internet Crime Complaint Center, From
January 1, 2008 – December 31, 2008, the IC3 website received 275,284 complaint
submissions. This is a (33.1%) increase when compared to 2007 when 206,884
complaints were received. These filings were composed of complaints primarily related
to fraudulent and non-fraudulent issues on the Internet. Dollar loss of referred complaints was
at an all time high in 2008, $264.59 million, exceeding last year’s record breaking dollar loss of
$239.09 million. On average, men lost more money than women.
A Gartner survey [40] of more than 160 companies reveals that 12 times more fraud
exists on Internet transactions and those e-tailers are paying credit card discount rates that
are 66 percent higher than traditional retailer fees. Moreover, Web merchants bear the
liability and costs in cases of fraud, while credit card companies generally absorb the
fraud for traditional retailers
1.1 MOTIVATION
The Various cyber crime cases through the credit card comes frequently in the daily news
papers and broad coverage in the television media inspired me to work in this area.
1.2 OBJECTIVE OF THE RESEARCH
The purpose of research is first to discuss the different financial cyber crimes and frauds
which are seen today in the forms of Credit card fraud, Phising etc. Secondly study the
different data mining techniques like Neural Network, Clustering techniques, Decision
trees etc. and eventually how these techniques can be used and applied to detect the
financial cyber crime and frauds.
Fraud Prevention describes measures to stop fraud occurring in the first place. In contrast,
fraud detection involves identifying fraud as quickly as possible once it has been
perpetrated. Fraud detection comes into play once fraud prevention has failed. In practice,
fraud detection must be used continuously, as one will typically be unaware that fraud
2
prevention has failed. We can try to prevent credit card fraud by guarding our cards
assiduously, but if nevertheless the card’s details are stolen, then we need to be able to
detect, as soon as possible, that fraud is being perpetrated.
Currently, data mining is a popular way to combat frauds because of its effectiveness.
The task of data mining is to analyze a massive amount of data and to extract some
usable information that we can interpret for future uses. In doing so, we have to define
the clear goal of data mining, and find out the right structure of possible model or
patterns that fit to the given data set. Once we have the right model for the data, we can
use the model for predicting future events by classifying the data. In terms of data
mining, fraud detection can be understood as the classification of the data. Input data is
analyzed with the appropriate model and determined whether it implies any fraudulent
activities or not. A well-defined classification model is developed by recognizing the
patterns of former fraudulent behaviors. Then the model can be used to predict any
suspicious activities implied by new data set.
The prediction of user behavior in financial systems can be used in many situations.
Predicating client migration, marketing or public relations can save a lot of money and
other resources. One of the most interesting fields of prediction is the fraud of credit line,
especially credit card payments. For the high data traffic of 400,000 transactions per day,
a reduction of 2.5% fraud triggers a saving of lots of money per year.
Certainly, all transactions, which deal with accounts of known misuse, are not authorized.
Nevertheless, there are transactions, which are formally valid, but experienced people can
tell that these transactions are probably misused, caused by stolen cards or fake
merchants. So, the task is to avoid a fraud by a credit card transaction before it is known
as “illegal”.
Data mining methods have made most impact on fraud detection. This is typically
because there are large quantities of information is numerical or can be easily converted
into the numerical in the form of counts and proportions. We should also consider speed
3
of processing is of the essence. This is particularly the case in transaction processing,

especially with telecoms and intrusion data, where vast numbers of records are processed
every day, but also applies in credit card, banking and retail sectors.
A key issue of the proposed work is how effective the tools are in detecting fraud and a
fraudulent problem is that one typically dose not know how many fraudulent cases slip
through the net. In applications such as average time to detection after fraud starts (in
minutes, number of transactions, etc.) should also be reported. Measures of this aspect
interact with measures of final detection rate: in many situations an account, telephone,
etc. will have to used for several fraudulent transactions before it is detected as fraudulent
, so that several false negative classifications will necessarily be made.
1.3 RELATED WORK
1.3.1 In Fraud detection
Credit card fraud detection has drawn a lot of research interest and a number of
techniques, with special emphasis on data mining, have been suggested. Gosh and Reilly
[1] have developed fraud detection system with neural network. Their system is trained
on large sample of labeled credit card account transactions. These transactions contain
example fraud cases due to lost cards, stolen cards, application fraud, counterfeit fraud,
mail-order fraud and non receive issue(NRI) fraud.
E. Aleskerov et al. [2] present CARDWATCH, a database mining system used for credit
card fraud detection. The system is based on a neural learning module and provides an
interface to variety of commercial databases.
Dorronsoro et al. [3] have suggested two particular characteristics regarding fraud
detection- a very limited time span for decisions and a large number of credit card
operations to be processed. They have separated fraudulent operations from the normal
ones by using Fisher’s discriminant analysis.
4
Syeda et al. [4] have used parallel granular neural network for improving the speed of
data mining and knowledge discovery in credit card fraud detection. A complete system
has been implemented for this purpose.
Chan et al. [5] have divided a large set of transactions into smaller subsets and then apply
distributed data mining for building models of user behavior. The resultant base models
are then combined to generate a meta-classifier for improving detection accuracy.
Chiu and Tsai [7] consider web services for data exchange among banks. A fraud pattern
mining (FPM) algorithm has been developed for mining fraud association rules which
give information regarding the new fraud patterns to prevent attacks.
Some survey papers have been published which categorize, compare and summarize
articles in the area of fraud detection. Phua et al. [8] did an extensive survey of data
mining based Fraud Detection Systems and presented a comprehensive report. Kou et al.
[9] have reviewed the various fraud detection techniques for credit card fraud,
telecommunication fraud and computer intrusion detection. Bolton and Hand [10]
describe the tools available for statistical fraud detection and areas in which fraud
detection technologies are most commonly used. D.W.Abbott et al. [21] compare five of
the most highly acclaimed commercial data mining tools on a fraud detection application,
with descriptions of their distinctive strengths and weaknesses, based on the lessons
learned by the authors during the process of evaluating the products. D.Yue [32] conduct
an extensive review on literatures to get the answers of the questions like (1) Can FSF be
detected? How likely and how to do it? (2) What data features can be used to predict
FSF? (3) What kinds of algorithm can be used to detect FSF? (4) How to measure the
performance of the detection? And (5) How effective of these algorithms in terms of
fraud detection?
V.Hanagandi et al. [11] generate a fraud a score using the historical information on credit
card account transactions. They describe a fraud-non fraud classification methodology
5
using radial basis function network (RBFN) with a density based clustering approach.
The input data is transformed into cardinal component space and clustering as well as
RBFN modeling is done using a few cardinal components.
A.Shen et al. [12] investigates the efficacy of applying classification models to credit
card fraud detection problems. They tested three classification methods i.e. neural
network, decision tree and logistic regression for their applicability in fraud detections.
H.shao et al. [13] introduced an application in data mining to detect fraud behavior in
customs declarations data and used data mining technology such as an easy-to-expand
multi-dimension-criterion data model and a hybrid fraud-detection strategy.
K.B.Bignell [14] outlines a framework for internet banking security using multi-layered,
feed-forward artificial neural networks.
A. Srivastava et al. [15] model the sequence of operations in credit card transaction
processing using Hidden Markov Model (HMM) and show how it can be used for
detection of frauds. An HMM is initially trained with normal behavior of card holder. If
an incoming credit card transaction is not accepted by trained HMM with sufficiently
high probability, it is considered to be fraudulent. At the same time they also try to ensure
that genuine transactions are not rejected.
B.Zhang et al. [16] consider network level features, such as users’ belief of other users to
deal with fraud in group behavior. They use loopy belief propagation algorithm and apply
it to network level fraud detection, classifying fraudsters, accomplices, honest users.
J.E.Carbal et al. [17] propose a methodology based on rough sets and KDD for fraud
detection made by electrical energy consumers. This methodology does a detailed
evaluation of the boundary region between fraudulent and normal customers, identifying
patterns of fraudulent behavior at historical data sets of electricity companies. They
6
derive classification rules using these patterns; it will permit the detection on the database
of electricity companies of those clients that present fraudulent feature.
J.Quah et al. [18] focuses on real time fraud detection and presents a new and innovative
approach in understanding spending patterns to decipher potential fraud cases. They
make use of self organizing map to decipher, filter and analyze customer behavior for
detection of fraud.
E.L.Barse et al. [19] generate synthetic test data for fraud detection in an IP based video-
on-demand service by ensuring that important statistical properties of the authentic data
are preserved.
J.Xu et al. [20] present an anomaly detection technique based on behavior mining and
monitoring that work at both the individual and system level. They utilize frequent
pattern tree to profile the normal behavior adaptively. They design a novel tree-based
pattern matching algorithm to discover individual level anomalies.
Recently fraud detection system is developed by Suvasini Panigrahi et al. [22], which
consist of four components, namely, rule-based filter, Dempster-Shafer adder, transaction
history database and Bayesian learner. In the rule based component, they determine the
suspicion level of each incoming transaction based on the extent of its deviation from
good pattern. Dempster-Shafer theory is used to combine multiple such evidences and an
initial belief is computed.
Yi Peng et al. [23] apply two clustering techniques, SAS EM and CLUTO, to a large real-
life health insurance dataset and compare the performances of these two methods.
J.Tuo et al. [24] propose a case-based genetic artificial immune system for fraud
detection (AISFD). Their system is a self-adapted system designed for credit card fraud
detection. With the case-based learning model and genetic algorithm, their system can
7
perform online learning with limited time and cost, and update the capability of fraud
detection in the rapid growth of transactions and commerce activities.
J.Kim [25] proposes a novel artificial immune system, called CIFD (Computer Immune
System for Fraud Detection), and adopts both negative selection and positive selection to
generate artificial immune cells. CIFD also employs an analogy of the self-major
histocompatability complex (MHC) molecules when antigen data is presented to the
system. Their novel mechanism improves the scalability of CIFD, which is designed to
process gigabytes or more of transaction data per day.
S.J.Stoflo et al. [26] developed the JAM distributed data mining system for the real world
problem of fraud detection in financial information systems. They have shown that cost-
based metrics are more relevant in certain domains, and defining such metrics poses
significant and interesting research questions both in evaluating systems and alternative
models, and in formalizing the problems to which one may wish to apply data mining
technologies. They also demonstrate how the techniques developed for fraud detection
can be generalized and applied to the important area of intrusion detection in networked
information systems.
F.Yu et al. [27] focus on how to build data mining algorithm centered application system
for common users. They present a case study about building a fraudulent tax declaration
detection system using decision tree classification algorithm.
A.Leung et al. [28] sheds some light on the designing issues on this add-on fraud
detection module, namely Fraud Detection Manager. Their design is based on the concept
of atomic transactions called Coupons that they implemented in e-wallet accounts.
W.Chai et al. [29] propose a method to convert fraud classification rules learned from a
genetic algorithm to a fuzzy score representing the degree to which a company’s financial
statements match those rules.
8
B.Garner and F.Chen [30] propose a paradigm, which involves an anomaly detection
model, case based hypothesis generation, and hypothesis synthesis, is deemed to provide
a basic platform for management intelligence systems and fraud detection in electronic
data processing environment.
V.Aggelis [31] demonstrates one successful fraud detection model. His scope is to
present its contribution in fast and reliable detection of any “strange” transaction
including fraudulent ones.
S.Rozsnyai et al. [33] introduce solution architecture for detection and preventing fraud
in real time by using an event-based system called SARI (Sense and Respond
Infrastructure). They present architecture and components for a real time fraud
management solution which can be easily adapted to the business needs of domain
experts and business users. Their SARI system provides functions to monitor customer
behavior as well as it can steer and optimize customer processes in real time. They show
fraud scenarios of an online gambling service provider.
T.M.Padmaja et al. [34] propose a new approach called extreme outlier elimination and
hybrid sampling technique. They use k reverse nearest neighbors (kRNNs) concept as a
data cleaning method for eliminating extreme outliers in minority regions. They conduct
the experiments with classifiers namely C4.5, Naïve Bayes, k-NN and Radial Basis
function networks and compared the performance of their approach against simple hybrid
sampling technique. They showed using obtained results that extreme outlier elimination
from minority class, produce high predictions for both fraud and non-fraud classes.
Z.Ferdousi et al. [35] use Peer Group Analysis (PGA), an unsupervised technique, to find
outliers in time series financial data. They apply the tool to the stock market data, which
has been collected from Bangladesh Stock Exchange to asses its performance in stock
fraud detection. They observe that PGA can detect those brokers who suddenly start
selling the stock in a different way to other brokers to whom they were previously
similar. They also apply t-statistics to find the deviations effectively.
9
M.Sternberg et al. [36] utilize a cultural algorithm (CA) to respond o dynamic changes in
the application of rule-based expert system. The CA provides self-adaptive capabilities
which can generate the information necessary for the expert system to respond
dynamically.
O.Dandash et al. [37] presents a security analysis of the proposed internet banking model
compared with that of the current existing models used in fraudulent internet payments
detection and prevention. Their proposed model facilitates internet banking fraud
detection and prevention (FDP) by applying two new secure mechanisms, Dynamic Key
Generation (DKG) and Group Key (GK).
S.Viaene et al. [38] apply the weight of evidence reformulation of AdaBoosted naive
Bayes scoring to the problem of diagnosing insurance claim fraud. Their method
effectively combines the advantages of boosting and the explanatory power of the weight
of evidence scoring framework.
E.Lundin et al. [39] developed a method for generating synthetic data that is derived from
authentic data. They also narrate that in many cases synthetic data is more suitable than
authentic data for the testing and training of fraud detection systems.
It is well known that every cardholder has a certain purchasing habits, which establishes
an activity profile for him. Almost all the existing fraud detection techniques try to
capture these behavioral patterns as rules and check for any violation in subsequent
transactions. However, these rules are largely static in nature. As a result, they become
ineffective when the cardholder develops new patterns of behavior that are not yet known
to the FDS. The goal of a reliable detection system is to learn the behavior of users
dynamically so as to minimize its own loss. Thus, systems that can not evolve or “learn”,
may soon become outdated resulting in large number of false alarms. A fraudster can also
attempt new types of attacks which should still get detected by the FDS. For example, a
fraudster may aim at deriving maximum benefit either by making a few high value
10
purchases or a large number of low value purchases in order to evade detection. Thus,
there is a need for developing fraud detection systems which can integrate multiple
evidences including patterns of genuine cardholders as well as that of fraudsters.
We propose a credit card fraud detection system that combines different types of
evidences effectively.
1.3.2 In Financial Cyber Crime Prevention
The first attempt at making online credit card transactions secure was to take the
transaction off-line. Many sites will allow us to call in our credit card number to a
customer support person. This solves the problem of passing the credit card number over
the Internet, but eliminates the merchant's ability to automate the purchasing process.
The next method that was developed, which is currently used by many sites, is hosting
the WWW site on a secure server. A secure server is one that uses a protocol such as SSL
or S-HTTP to transmit data between the browser and the server. These protocols encrypt
the data being transmitted, so when we submit our credit card number through WWW
form it travels to the server encrypted. This section describes the three most famous
system of secure credit card transactions First virtual, CyberCash and SET (Secure
Electronic Transactions)
1.3.2.1 First Virtual
The first virtual was the first successfully used model that made internet transactions
secure. Instead of using credit card numbers, transactions are done using a First
VirtualPIN which references the buyer's First Virtual account. These PIN numbers can be
sent over the Internet because even if they are intercepted, they cannot be used to charge
purchases to the buyer's account. A person's account is never charged without email
verification from them accepting the charge.
11
Their payment system is based on existing Internet protocols, with the backbone of the
system designed around Internet email and the MIME (Multipurpose Internet Mail
Extensions) standard. First Virtual uses email to communicate with a buyer to confirm
charges against their account. Sellers use either email, Telnet or automated programs that
make use of First Virtual's Simple MIME Exchange Protocol (SMXP) to verify accounts
and initiate payment transactions. To use this scheme of transaction customer and
merchant, both should have an account on first virtual’s server. The First virtual’s model
was one of the most successfully used models but it is out of use now.
1.3.2.2 CyberCash
CyberCash makes safe passage over the Internet for credit card transaction data. They
take the data that is sent to them from the merchant, and pass it to the merchant's
acquiring bank for processing. Except for dealing with the merchant through CyberCash's
server, the acquiring bank processes the credit card transaction as they would process
transactions received through a point of sale (POS) terminal in a retail store.
The CyberCash payment system is centered on the CyberCash Wallet software program,
which buyers use when making a purchase. This program handles passing payment
information, encrypted, between the buyer and the merchant.
1.3.2.3 SET
MasterCard and Visa have developed SET as a license-free protocol for credit card
transactions over the Internet. SET is based on two earlier protocols STT (Secure
transaction technology) and SEPP (Secure Electronic Payment Protocol). Secure
Electronic Transaction (SET) is a system for ensuring the security of financial
transactions on the Internet. It was supported initially by MasterCard, Visa, Microsoft,
Netscape, and others. With SET, a user is given an electronic wallet (digital certificate)
and a transaction is conducted and verified using a combination of digital certificates and
12
digital signatures among the purchaser, a merchant, and the purchaser's bank in a way
that ensures privacy and confidentiality.
SET makes use of Secure Socket Layer (SSL), and Secure Hypertext Transfer Protocol
(SHTTP). SET uses some but not all aspects of a public key infrastructure (PKI). Many
other systems are also functional like PayPal, DigiCash etc.
These systems are highly secure but are rarely used by customers and merchants. These
models secure your transaction over internet but cannot stop any forgery if credit card
information is lost physically or when customer gives his information in wrong hands.
1.3.2.4 Internet Virtual Credit Card Model
Anshul Jain et al. [43] have given this model. According to this model, a login id and a
password are issued by bank along with credit card. Once the customer logs in, he is
asked for his credit card details in order to make sure that the person logging in has the
possession of the card thus avoiding leakage of id and password. If the user is
authenticated then an internet virtual credit card number is issued. User has to select the
expiry date between present date and date of actual expiry date. Customers, who transact
very often, could activate the internet virtual credit card only for a few days, in order to
avoid forgery.
1.4 RESEARCH ISSUES
Financial fraud detection is quite confidential and is not much disclosed in public. The
major issue in this domain is that any financial institution or bank does not share its live
data with researchers as they have strict policy and they can not disclose it. Also there is
not benchmark of data set available in this area. So there are very few researchers (just
one or two) who have worked with real life credit card data and showed their results.
13
Most of the researchers have generated synthetic data based on the statistical techniques.
It may be noted that Aleskerov et al. [2] tested the performance of their CARDWATCH
system on sets of synthetic data based on Gaussian distribution. Chan et al. [5] have used
skewed distribution to generate a training set of labeled transactions. They have done
experiments to determine the most effective training distribution. Li and Zhang [42] have
modeled a customer’s payment by a Poisson process which can only capture the time gap
between two transactions. Panigrahi et al. [22] have generated synthetic data using a
Markov modulated poisson process (MMPP) and two Gaussian distribution functions.
According to E.L.Barse et al. [19], using synthetic data for evaluation, training and
testing offers several advantages over using authentic data. Properties of synthetic data
can be tailored to meet various conditions not available in authentic data sets. They
discussed motivation for using synthetic data for several reasons as authentic data can not
be used in some cases for a number of reasons. The target service may still be under
development and thus produce irregular or only small amounts of authentic data.
Synthetic data can be designed to demonstrate certain key properties or to include attacks
not available in the authentic data, giving a high degree of freedom during testing and
training. Synthetic data can cover extensive periods of time or represent large number of
users, a necessary property to train some of the more “intelligent” detection schemes.
There are two types of data mining techniques, Unsupervised and Supervised Methods.
Unsupervised methods do not need the prior knowledge of fraudulent and non-fraudulent
transactions in historical database, but instead detect changes in behavior or unusual
transactions. These methods model a baseline distribution that represents normal
behavior and then detect observations that show greatest departure from this norm.
Outliers are a basic form of non-standard observation that can be used for fraud detection.
In supervised methods, models are trained to discriminate between fraudulent and non-
fraudulent behavior so that new observations can be assigned to classes. Supervised
methods require accurate identification of fraudulent transactions in historical databases
and can only be used to detect frauds of a type that have previously occurred. An
advantage of using unsupervised methods over supervised methods is that previously
14
undiscovered types of fraud may be detected. Supervised methods are only trained to
discriminate between legitimate transactions and previously known fraud.
All the techniques or models of fraud detection just indicate the likelihood of fraud
occurrence. No one method surely confirms any transaction as fraudulent transaction.
1.5 OUTLINE OF THE RESEARCH
When the user performs transaction on the internet, then transaction related data are
generated. These data are stored in the dimensional modeling data warehouse. The
Transaction Pattern Generation Tool (TPG) generates the different patterns (parameters)
like maximum amount of transaction, time passed since the last transaction, time passed
since the same category purchased etc. based on the historical data stored in the data
warehouse. These all the parameters collectively represent the normal purchasing
behavior of the customer.
Whenever the deviation occurs than the normal behavior, then the model should raise the
alarm. The Transaction Risk Generation Model (TRSGM) works on this principle. The
Model predicts for each transaction how far or close it is from the previous set of all the
normal transactions. The risk score between 0 and 1 is generated by the model. The score
generated below 0.5 for the transaction, is considered as genuine transaction and above or
equal to 0.8 then the transaction is considered to be fraudulent and it is verified by
confirming the customer. If the risk score generated is between 0.5 and 0.8, then the
transaction is considered as suspicious transaction and an additional layer of Bayesian
learner is added by the model. Once the transaction is found suspicious, then it waits for
the next transaction on the same card. When the next transaction occurs on the same card,
again risk score is generated. If the risk score is less than 0.5 then transaction is declared
as genuine transaction, greater than 0.8 then transaction is declared as fraudulent
transaction or again found suspicious then the Bayesian learner calculates the posterior
probability whether the transaction comes from normal customer or fraudster from the
genuine transaction set and fraudulent transaction set. If probability of normal transaction
15
likelihood is more than fraudulent transaction likelihood then the transaction is

considered as genuine transaction otherwise fraudulent transaction.
The model is implemented by using the data mining techniques 1) Rules 2) DBSCAN
algorithm and 3) Bayesian Learner in oracle 9i.
Chapter 2 gives the overview of data mining and compares various data mining
techniques based on easiness of understanding and implementation of technique, input
and output issue, applications, strengths, weaknesses etc. Then it discuss the criteria that
is helpful for selecting a data mining technique such as whether learning is supervised or
unsupervised, the nature of input and output data, presence of noisy data, time (speed)
issue (algorithms for building decision tree and production rules typically execute much
faster than NN or GA), classification accuracy.
Various types of financial cyber crimes and frauds committed worldwide are discussed in
the chapter 3. Chapter 4 discusses about how various data mining techniques and rules
become helpful in financial crime detection. Chapter 5 describes the design and
implementation of the Data warehouse and also the various tables maintained by financial
cyber crime detection system (FCDS). Development of Transaction Pattern Generation
Tool (TPGT) is discussed in the Chapter 6, which generates various parameters for the
customer who is performing online transaction. Development of Transaction Risk Score
Generation Model (TRSGM) is discussed in the Chapter 7, which assigns a risk or fraud
score (0-1) for each transaction. Features of developed data mining application software,
significance of the research, limitation of study and future scope of the research as
conclusion is discussed in Chapter 8.
1.6 REFERENCES
[1] S.Ghosh, D.L.Reilly, “Credit card fraud detection with a neural–network”, in:
Proceedings of the Twenty-seventh Hawaii International Conference on system Sciences,
1994, pp. 621-630,
16
[2] E. Aleskerov, B. Freisleben, B.Rao, “CARDWATCH: a neural network based

database mining system for credit card fraud detection”, in: Proceedings of the
Computational Intelligence for Financial Enginnering, 1997, pp.220-226
[3] J. R.Dorronsoro, F. Ginel, C.Sanchez and C.S. Cruz, “Neural fraud detection in credit
card operations”, IEEE Transactions on Neural Network , Vol. 8, No. 4, July 1997, pp.
827-834
[4] M. Syeda, Y.Q.Zhang, Y. Pan, “Parallel granular neural networks for fast credit card
fraud detection”, in: Proceedings of the IEEE International Conference on Fuzzy
Systems, 2002, pp. 572-577
[5] P.K. Chan, W. Fan, A.L. Prodromidis, S.J. Stolfo, “Distributed data mining in credit
card fraud detection”, in: Proceedings of the IEEE Intelligent Systems, 1999, pp. 67-74
[6] Tao Guo, Gui-Yang Ali, “Neural data mining for credit card fraud detection”, in:
Proceedings of the Seventh International Conference on Machine learning and
cybernetics, Kunming, 12-15 July 2008, pp.3630-3634
[7] C. Chiu, C. Tsai, “A web service-based collaborative scheme for credit card fraud
detection”, in: Proceedings of the IEEE International Conference on e-Technology, e-
Commerce and e-Service, 2004, pp. 177-181.
[8] C. Phua, V.Lee, K.Smith, R.Gayler, “A comprehensive survey of data mining-based
fraud detection research”, March 2007
http://www.clifton.phua.googlepages.com/fraud-detection-survey.pdf
[9] Y.Kou, C.T.Lu, S.Sirwongwattana, Y.Huang, “Survey of fraud detection techniques”,
in: Proceedings of the IEEE International Conference on Networking, Sensing and
Control, vol. 1, 2004, pp.749-754.
[10] R.J.Bolton and D.J.Hand, “Statistical fraud detection”: a review, Journal of
Statistical Science(2002), pp.235-255.
[11] V.Hanagandi, A. Dhar, K.Buescher, “Density based clustering and radial basis
function modeling o generate credit card fraud score”, in: Proceedings of the IEEE
International Conference, February 1996, pp.247-251
[12] A.Shen, R.Tong, Y.Deng, “Application of classification models on credit card fraud
detection”, in: Proceedings of the IEEE Service Systems and Service Management,
International Conference, 9-11 June 2007, pp:1-4
17
[13] H.Shao, H. Zhao, G.Chang, “Applying Data mining to detect fraud behavior in
customs declaration”, in: Proceedings of the First International Conference on Machine
Learning and Cybernetics, Beijing, November 2002, pp.1241-1244
[14] K.B.Bignell, “Authentication in an internet banking environment strategy; towards
developing a strategy for fraud detection” in: Proceedings of International Conference
ICISP 2006, 26-28 Aug. 2006, pp.23
[15] A.Srivastava, A.Kundu, S.Sural, A.K.Majumdar, “Credit card fraud detection using
hidden markov model”, in: IEEE transactions on dependable and secure computing,Vol.
5, No. 1, January-March 2008.
[16] B.Zhang, Y. Zhou, C. faloutsos, “Toward a comprehensive model in internet auction
fraud detection”, in: Proceedings of the 41st Hawaii International Conference on System
sciences, 2008
[17] J.E.Carbal, J.Pinto, S.C.Linares, M.A.C.Pinto, Methodology for fraud detection
using rough sets, http://www.ieeexplore.ieee.org/iel5/10898/34297/01635791.pdf
[18] J.Quah, M.Sriganesh, “Real time credit card fraud detection using computational
intelligence”, in: Proceedings of the International Joint Conference on Neural Networks,
Florida, U.S.A, August 2007
[19] E.L.Barse, H.Kvanstrom, E.Jonsson, “Synthesizing test data for fraud detection
system”, in: Proceedings of the 18th Annual Computer Security Applications
Conference,2003
[20] J.Xu, A.H.Sung, Q.Liu,”Tree based behavior monitoring for adaptive fraud
detection”, in: Proceedings of the 18th International Conference on pattern recognition,
2006
[21] D.W.Abbott, I.P.Matkovsky, J.F. Elder IV, “An Evaluation of High-end Data
Mining Tools for Fraud Detection”, 1998, IEEE Xplore
[22] Suvasini Panigrahi, Amlan Kundu, Shamik Sural, A.K.Majumdar, “Credit card
fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian
learning”, www.scincedirect.com
[23] Yi Peng, Gang Kou, A. Sabatka, Z. Chen, D.Khazanchi, Y.Shi, “Application of
Clustering Methods to Health Insurance Fraud Detection”, www.ieeexplore.ieee.org
18
[24] J.Tuo, S.Ren, W.Liu, X.Li, B.Li, L.Lei, “Artificial Immune System for Fraud
Detection”, in: Proceedings of the International Conference on systems, Man and
Cybernetics
[25] J.Kim, A.Ong, R.E.Overill, “Design of an Artificial Immune System as a Novel
Anomaly Detector for Combating Financial Fraud in the Retail Sector”, in: Evolutionary
Computation, 2003. CEC ’03, pp 405-412 Vol.1
[26] S.J.Stoflo, W.Lee, A.Prodromidis, P.K.Chan, “Cost-based Modeling for Fraud and
Intrusion Detection: Results from the JAM Project”, http://
www..citeseer.ist.psu.edu/244959.html
[27] F.Yu, Z.Qin, X.Jia, “Data Mining Issues in Fraudulent Tax Declaration Detection”,
in: Proceedings of the Second International Conference on Machine Learning and
Cybernetics, Xian, November 2003, pp.2202-2206
[28] A.Leung, Z.Yan, S.Fong, “On Designing a Flexible E-Payment System with Fraud
Detection Capability”, in: Proceedings of the IEEE International Conference on E-
Commerce Technology, 2004
[29] W.Chai, B.K.Hoogs, B.T.Verschueren, “Fuzzy Ranking of Financial Statements for
Fraud Detection”, in: Proceedings of the IEEE International Conference on Fuzzy
Systems, Canada, 2006, pp.152-158
[30] B.Garner, F.Chen, “Hypothesis Generation Paradigm for Fraud Detection”,
http://www.ieeexplore.ieee.org/iel2/2978/8447/00369309.pdf
[31] V.Aggelis, “Offline Internet Banking Fraud Detection”, in: Proceedings of the First
International Conference on Availability, Reliability and Security, 2006
[32] D.Yue, X.Wu, Y.Wang, Y.Li,C-H Chu, “A Review of Data Mining-based Financial
Fraud Detection Research”,
http://www.ieeexplore.ieee.org/iel5/4339774/4339775/04341127.pdf
[33] S.Rozsnyai, J.Schiefer, A.Schatten, “Solution Architecture for Detecting and
Preventing Fraud in Real Time”, in: Proceedings of the 2nd International Conference
ICDIM ’07, Volume 1, pp:152-158
[34] T.M.Padmaja, N.Dhulipalla, R.S.Bapi, P.R.Krishna, “Unbalanced Data
Classification Using extreme outlier Elimination and Sampling Techniques for Fraud
19
Detection”, in: Proceedings of 15th International Conference on Advanced Computing

and Communications, 2007, pp.511-516
[35] Z.Ferdousi, A.Maeda, “Unsupervised Outlier Detection in Time Series Data”, in:
Proceedings of 22nd International Conference on System Engineering Workshops, 2006,
pp.51-56
[36] M.Sternberg, R.G.Reynolds, “Using Cultural Algorithms to Support Re-Engineering
of Rule-Based Expert Systems in Dynamic Performance Environments: A Case Study in
Fraud Detection”, in: IEEE Transactions on Evolutionary Computation, Vol. 1, No.4,
1997
[37] O.Dandash, P.D.Le, B.Srinivasan, “Security Analysis for Internet Banking Models”,
in: Proceedings of Eighth International Conference on Software Engineering, Artificial
Intelligence, Networking and Parallel/Distributed Computing, 2007, pp. 1141-1146
[38] S.Viaene, R.A.Derrig, G.Dedene, “A Case Study of Applying Boosting Naïve Bayes
to Claim Fraud Diagnosis”, in: Proceedings of IEEE Transactions on Knowledge and
Data Engineering, Vol. 16, No.5, May 2004, pp. 612-620.
[39] E.Lundin, H.Kvanstrom, E.Jonsson, “A Synthetic Fraud Data Generation
Methodology”, ICICS 2002, Springer-Verlag Berlin
[40] Online fraud is twelve times higher than offline fraud, 20,June,2007
http://sellitontheweb.com/ezine/news0434.shtml
[41] http://www.ic3.gov/media/annualreport/2008_IC3Report.pdf
[42] Y.Li and X.Zhang, “Securing credit card transactions with one-time payment
scheme”, Journal of Electronic Commerce Research and Applications(2005), pp. 413-
426.
[43] A.Jain, T.Sharma, Internet Virtual Credit Card Model,
http://www.profile.iiita.ac.in/ajain1_b04/ivccm.pdf
20
CHAPTER 2
A COMPARATIVE STUDY OF DATA MINING TECHNIQUES
2.1 DATA MINING: A DEFINITION

2.2 THE FOUNDATIONS OF DATA MINING
2.3 THE DEVELOPMENT OF DATA MINING
2.4 DATA MINING PROCESS
2.5 A STATISTICAL PERSPECTIVE ON DATA MINING
2.6 DECISION TREES
2.7 NEURAL NETWORKS
2.8 GENETIC ALGORITHMS
2.9 CLASSIFICATION
2.10 CLUSTERING
2.11 SELECTION CRITERIA OF A DATA MINING TECHNIQUE
2.12 REFERENCES
2.1 DATA MINING: A DEFINITION
Data Mining is the process of employing one or more computer learning techniques to
automatically analyze and extract knowledge from data contained within a database. The
purpose of a data mining session is to identify trends and patterns in data.
Data Mining has been defined as “the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data”.
And “the science of extracting useful information from large data sets or databases.”
Hand et al define that data mining is “a well-defined procedure that takes data as input
and produces output in the forms of models or patterns.”
21
Chapter 2: A Comparative Study of Data Mining Techniques
2.2 THE FOUNDATIONS OF DATA MINING
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies
that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:
• Massive data collection

• Powerful multiprocessor computers
• Data mining algorithms
Table 2.1 Steps in the Evolution of Data Mining
Evolutionary Business Enabling Product Providers Characteristics

Step Question Technologies
Data "What was my Computers, tapes, IBM, CDC Retrospective,
Collection total revenue in disks static data
the last five delivery
(1960s) years?"
Data Access "What were Relational Oracle, Sybase, Retrospective,
unit sales in databases Informix, IBM, dynamic data
(1980s) New England (RDBMS), Microsoft delivery at
last March?" Structured Query record level
Language (SQL),
ODBC
Data "What were On-line analytic Pilot, Comshare, Retrospective,
Warehousing unit sales in processing Arbor, Cognos, dynamic data
& Decision New England (OLAP), Microstrategy delivery at
Support last March? multidimensional multiple levels
Drill down to databases, data
(1990s) Boston." warehouses
Data Mining "What’s likely Advanced Pilot, Lockheed, Prospective,
(Emerging to happen to algorithms, IBM, SGI, proactive
Today) Boston unit multiprocessor numerous startups information
sales next computers, (nascent industry) delivery
month? Why?" massive databases
22
Commercial databases are growing at unprecedented rates. A recent META group survey
of data warehouse projects found that 19% of respondents are beyond the 50-gigabyte
level. In some industries, such as retail, these numbers can be much larger. The
accompanying need for improved computational engines can now be met in a cost-
effective manner with parallel multiprocessor computer technology. Data mining
algorithms embody techniques that have existed for at least 10 years, but have only
recently been implemented as mature, reliable, understandable tools that consistently
outperform older statistical methods.
In the evolution from business data to business information, each new step has built upon
the previous one. For example, dynamic data access is critical for drill-through in data
navigation applications, and the ability to store large databases is critical to data mining.
Today, the maturity of these techniques, coupled with high-performance relational
database engines and broad data integration efforts, make these technologies practical for
current data warehouse environments.
2.3 THE DEVELOPMENT OF DATA MINING
The current evolution of data mining functions and products is the result of years of
influence from many disciplines, including databases, information retrieval, statistics,
algorithms and machine learning (Figure 2.1). Another computer science area that has
had a major impact on the KDD process is multimedia and graphics.
23
Information Retrieval
Statistics
Databases
DATA
MINING
Algorithms Machine Learning
Figure 2.1 Historical perspective of data mining
Table 2.2 shows developments in the areas of artificial intelligence(AI), information

retrieval(IR), databases(DB) and statistics(Stat) leading to the current view of data
mining. These different historical influences, which have led to the development of the
total data mining area, have given rise to different views of what data mining functions
actually are:
• Induction is used to proceed from very specific knowledge to more general

information. This type of technique is often found in AI applications.
• Because the primary objective of data mining is to describe some characteristics
of a set of data by a general model, this approach can be viewed as a type of
compression. Here the detailed data within the database are abstracted and
compressed to a smaller description of the data characteristics that are found in
the model.
24
Table 2.2 Time Line of Data Mining Development
Time Area Contribution

Late 1700s Stat Bayes theorem of probability
Early 1900s Stat Regression Analysis
Early 1920s Stat Maximum likelihood estimate
Early 1940s AI Neural networks
Early 1950s Nearest neighbor
Single Link
Late 1950s AI Perceptron
Late 1950s Stat Resampling, bias reduction, jackknife estimator
Early 1960s AI ML started
Early 1960s DB Batch reports
Mid 1960s Decision trees
Mid 1960s Stat Linear models for classification
IR Similarity measures
IR Clustering
Stat Exploratory data analysis(EDA)
Late 1960s DB Relational data model
Early 1970s IR SMART IR systems
Mid 1970s AI Genetic Algorithms
Late 1970s Stat Estimation with incomplete data (EM Algorithm)
Late 1970s Stat K-means clustering
Early 1980s AI Kohonen self-organizing map
Mid 1980s AI Decision tree algorithms
Early 1990s DB Association rule algorithms
Web and search engines
1990s DB Data Warehousing
1990s DB Online analytic processing (OLAP)
25
2.4 DATA MINING PROCESS: A practical data mining application is often complex.
It is interactive and iterative, involving a number of key steps:
1.Understanding the application domain, and the application goals.
2. Extracting one or more target data sets from databases.
3. Cleaning data, e.g., removing noise and handling the missing data.
4. Removing the irrelevant attributes and tuples from the data.
5. Choosing the data mining task, i.e., deciding whether the goal of data mining
process is classification, association, clustering, etc, or a combination of them.
6. Choosing the data mining algorithms.
7. Data mining using the selected algorithms to discover hidden patterns in data.
8. Post-processing the discovered patterns, i.e., analyzing the patterns
automatically or semi-automatically to identify those truly interesting/useful
patterns for the user.
2.5 A STATISTICAL PERSPECTIVE ON DATA MINING
There are many statistical concepts that are the basis for data mining techniques. Here is a
brief review of some of these concepts.
2.5.1 Point Estimation
Point estimation is a well-known and computationally tractable tool for learning the
parameters of a data mining model. It can be used for many data mining tasks such as
summarization and time-series prediction. Summarization is the process of extracting or
deriving representative information about the data. Point estimation is used to estimate
mean, variance, standard deviation, or any other statistical parameter for describing the
data. In time-series prediction, point estimation is used to predict one or more values
appearing later in a sequence by calculating parameters for a sample.
26
2.5.1.1 Methods of Point Estimation
Several methods exist for obtaining point estimates, including least squares, the method
of moments, maximum likelihood estimation, Bayes estimators, and robust estimation.
DEFINITION 2.1: Let X1,X2,……,Xn be a random sample, and let θ = { θ 1,….., θ k} be

the set of population parameters. An estimator is a function that maps a random sample
X1,X2,……,Xn to a set of parameter values φ = { φ 1,….., φ k}, where φ j is the estimate of
parameter θ j.
2.5.1.1.1 The Method of Moments
The method of moments, introduced by Karl Pearson circa 1894, is one of the oldest
methods of determining estimates.
DEFINITION 2.2: Let X1,X2,……,Xn be a random sample from a population whose

density function depends on a set of unknown parameters θ = { θ 1,….., θ k}. Assume
that the first k population moments exist as functions Ør( θ ) of the unknown parameters,
where r = 1,2,….,k Let
1 n
φr = ∑
n i =1
Xi T (2.1)
be the rth sample moment. By equating φ r to Ør, where r=1,2,……,k, k equations in k

unknown parameters can be obtained.
Therefore, if there are k population parameters to be estimated, the method of moments

consists of the following two steps.
(i) Express the first k population moments in terms of the k population parameters θ 1,
θ 2,….., θ k;
27
(ii) Equate the population moments obtained from step (i) to the corresponding sample
moments calculated using above equation and solve θ 1,….., θ k as the estimates of
parameters.
2.5.1.1.2 Maximum Likelihood Estimation
Sir Ronald A. Fisher circa 1920 introduced the method of maximization of likelihood
functions. Given a random sample X1,X2,……,Xn distributed with the density (mass)
function f(x; Θ ), the likelihood function of the random sample is the joint probability
density function, denoted by
L( Θ ; X1,X2,……,Xn) = f(X1,X2,……,Xn; Θ ). (2.2)
In above equation, Θ is the set of unknown population parameters { θ 1,….., θ k}. If the
random sample consists of random variables that are independent and identically
distributed with a common density function f(x; Θ ), the likelihood function can be
reduced to
L( Θ ; X1,X2,……,Xn) = f(X1; Θ ) * …..* f(Xn; Θ ), (2.3)
which is the product of individual density functions evaluated at each sample point.
maximum likelihood estimate, therefore, is a set of parameter values Θ 1={ θ 11,…..,
θ k1}that maximizes the likelihood function of the sample. A well-known approach to
find Θ 1 is to take the derivative of L, set it equal to zero and solve for Θ . Thus, Θ 1 can
be obtained by solving the likelihood equation
∂
L ( Θ) = 0 (2.4)
∂Θ
28
It is important to note that a solution to the likelihood equation is not necessarily a

maximum; it could also be a minimum or a stationary point. One should ensure that the
solution is a maximum before using it as a maximum likelihood estimate.
2.5.1.1.3 The Expectation-Maximization Algorithm
The Expectation-maximization (EM) algorithm is an approach that solves the estimation

problem with incomplete data. The EM algorithm finds an MLE for a parameter using a
two-step process: estimation and maximization. The basic algorithm is shown as under.
ALGORITHM 2.1
Input:
Ø = { Ø1,……., Øp} //Parameters to be estimated
Xobs ={x1,……..,xk} //Input database values observed
Xmiss={xk+1,…..,xn} //Input database values missing
Output:
Ô //Estimates for Ø
EM algorithm:
i:=0;
Obtain initial parameter MLE estimate, Ôi ;
repeat
Estimate missing data, Ximiss;
i++
Obtain next parameter estimate, Øi to
maximize likelihood;
until estimate coverages;
An initial set of estimates for the parameters is obtained. Given these estimates and the
training data as input, the algorithm 2.1 then calculates a value for the missing data. For
example, it might use the estimated mean to predict a missing value. These data are then
29
used to determine an estimate for the mean that maximizes the likelihood. These steps are
applied iteratively until successive parameter estimates converge. Any approach can be
used to find the initial parameter estimates. In algorithm 2.1 it is assumed that the input
database has actual observed values Xobs ={x1,……..,xk} as well as values that are
missing Xmiss={xk+1,…..,xn}. It is assumed that the entire database is actually X= Xobs
∪ Xmiss.
The EM algorithm is useful in computational recognition, image retrieval, computer

vision, and many other fields. In data mining, the EM algorithm can be used when the
data set has missing values due to limitations of the observation process. It is especially
useful when maximizing the likelihood function directly is analytically intractable. In that
case, the likelihood function can be simplified by assuming that the hidden parameters
are known.
2.5.2 Measures of Performance
There are several different ways (estimators) to estimate unknown parameters. In order to
assess the usefulness of estimators, some criteria are necessary to measure the
performance of estimators. These are – bias, mean squared error, standard error,
efficiency and consistency.
2.5.2.1 Bias
The bias of an estimator is the difference between the expected value of the estimator and
the actual value:
Bias = E(Ô) – Ô (2.5)
An unbiased estimator is one whose bias is 0. While point estimators for small data sets
may actually be unbiased, for larger database applications we would expect that most
estimators are biased.
30
2.5.2.2 Mean Squared Error (MSE)
MSE is defined as the expected value of the squared difference between the estimate and
the actual value:
MSE(Ô) = E(Ô – O)2 (2.6)

2.5.2.3 Standard Error
The standard error gives a measure of the precision of the estimators. The standard error
of an estimator θ is defined as the standard deviation of its sampling distribution.
SE (θ ) = V (θ ) = σ θ (2.7)
The sample mean can be used as an example to illustrate the concept of standard error.
Let f(x) represents a probability density function with finite variance σ 2 and mean μ .
Let X be the sample mean for a random sample of size n drawn from this distribution. By
the Central Limit Theorem, the distribution of X is approximately normally distributed
σ2
with mean μ and variance . So the standard error is given by
n
σ
SE ( X ) = σ X = (2.8)
n
When the standard deviation σ for the underlying population is unknown, then an
estimate S for the parameter can be used as a substitute for it and leads to the estimated
standard error
m ( X ) = σl X S
SE = (2.9)
n
31
2.5.2.4 Efficiency
Another measure used to compare estimators is called efficiency. Suppose there are two
estimators Ò and Õ for a parameter Ø based on the sample X1,…..,Xn. If the MSE of one
estimator is less than the MSE of the other, i.e. MSE(Ò) < MSE(Õ), then the estimator Ò
is said to be more efficient than Õ. The relative efficiency of Ò with respect to Õ is
defined as the ratio
eff(Ò,Õ) = MSE(Õ)/MSE(Ò) (2.10)
If this ratio is greater than one, then Ò is a more efficient estimator of the parameter
Ø. When the estimator is unbiased, the ratio is just the ratio of their variance, and the
most efficient estimator would be the one with minimum variance.
2.5.2.5 Consistency
Unlike the four measures defined previously, consistency is defined for increasing sample
sizes, not a fixed sample sizes. Like the efficiency, consistency is also defined using the
MSE. Let θ n be the estimator of a parameter based on a sample of size n, then an

estimator is said to be consistent if
lim MSE (θ n ) = 0 (2.11)

n →∞
When MSE is written in terms of bias and variance. Thus equation holds if and only if
both variance and bias θ n tend o zero as n approaches infinite.
32
2.5.2.6 The Jackknife Method
It is a popular estimate technique. With this approach, the estimate of a parameter, θ , is

obtained by omitting one value from the set of observed values. Suppose there is a set of
n values X = {x1,x2,……,xn}. An estimate for the mean would be
i =1 n
∑
j =1
+ ∑x
j = i +1
j
μl ( i ) = (2.12)
n −1
Here the subscript (i) indicates that this estimate is obtained by omitting the ith value.
m ( X ) = σl X S
Given a set of jackknife estimates, SE = (i), these can in turn be used to
n
obtain an overall estimate.
∑θ ( j)
θ (.) = j =1
(2.13)
n
2.5.3 Models Based on Summarization
There are many basic concepts that provide an abstraction and summarization of the data
as a whole. The basic well-known statistical concepts such as mean, variance, standard
deviation, median and mode are simple models of the underlying population. Fitting a
population to a specific frequency distribution provides an even better model of the data.
Of course, doing this with large databases that have multiple attributes, have complex
and/or multimedia attributes, and are constantly changing is not practical.
There are also many well-known techniques to display the structure of the data
graphically. For example, a histogram shows the distribution of the data. A box plot is a
more sophisticated technique that illustrates several different features of the population at
once.
33
Another visual technique to display data is called a scatter diagram. This is a graph on a
two-dimensional axis of points representing the relationships between x and y values.
2.5.3.1 Bayes Theorem
With statistical inference, information about a data distribution is inferred by examining

data that follow that distribution. Given a set of data X={x1,……,xn}, a data mining
problem is to uncover properties of the distribution from which the set comes. Bayes rule
is a technique to estimate the likelihood of a property given the set of data as evidence or
input. Suppose that either hypothesis h1 or hypothesis h2 must occur, but not both. Also
suppose that xi is an observable event.
DEFINITION 2.3: Bayes Rule or Bayes Theorem is
P( xi | h1) P (h1)
P(h1 | xi ) = (2.14)
P( xi | h1) P (h1) + P( xi | h 2) P(h 2)
Here P (h1 | xi) is called the posterior probability, while P (h1) is the prior probability
associated with hypothesis h1. P (xi ) is the probability of the occurrence of data value xi
and P (xi | h1) is the conditional probability that, given a hypothesis, the tuple satisfies it.
Where there are m different hypothesis we have:
m
P(xi) = ∑ P( x | h ) P(h )
j =1
i j j (2.15)
Thus we have
P( xi | h1) P(h1)
P (h1 | xi ) = (2.16)
P( xi )
34
Bayes rule allow us to assign probabilities of hypothesis given a data value, P (hj | xi).
Here we discuss tuples when in actuality each xi may be an attribute value or other data
label. Each hi may be an attribute value, set of attribute values, or even a combination of
attribute values.
2.5.3.2 Hypothesis Testing
Hypothesis testing attempts to find a model that explains the observed data by first
creating a hypothesis and then testing the hypothesis against the data. The hypothesis
usually is verified by examining a data sample. If the hypothesis holds for the sample, it
is assumed to hold for the population in general. Given a population, the initial
hypothesis to be tested, H0, is called the null hypothesis. Rejection of the null hypothesis
causes another hypothesis, H1, called the alternative hypothesis, to be made.
One technique to perform hypothesis testing is based on the use of the chi-squared
statistic. Actually, there is a set of procedures referred to as chi squared. These
procedures can be used to test the association between two observed variable values and
to determine if a set of observed variable values is statistically significant. A hypothesis
is first made, and then the observed values are compared based on this hypothesis.
Assuming that O represents the observed data and E is the expected values on the
hypothesis, the chi-squared statistic, X2 , is defined as:
(O − E ) 2
X =∑
2
(2.17)
E
When comparing a set of observed variable values to determine statistical significance,

the values are compared to those of the expected case. This may be the uniform
distribution.
35
2.5.3.3 Regression and Correlation
Both bivariate regression and correlation can be used to evaluate the strength of a
relationship between two variables. Regression is generally used to predict future values
based on past values by fitting a set of points to a curve. Correlation, however, is used to
examine the degree to which the values for two variables behave similarly.
Linear regression assumes that a linear relationship exists between the input data and the
output data. The common formula for a linear relationship is used in this model:
Y = c0 + c1x1 +…….+cnxn (2.18)
Here there are n input variables, which are called predictors or regressors; one output
variable (the variable being predicted), which is called the response; and n + 1 constants,
which are chosen during the modeling process to match the input examples. This is called
multiple linear regression because there is more than one predictor.
2.6 DECISION TREES
The decision tree method of decision analysis uses a tree structure to illustrate the
decision process. Probabilities are assigned to events, and the expected value of each
alternative is determined. The alternative with the most attractive total expected value is
chosen. Depending on the decision, the most attractive expected value may be the highest
or lowest number.
It is based on the “Twenty Questions” game that children play, as illustrated by Example
2.1. Figure 2.2 graphically shows the steps in the game. This tree has as the root the first
question asked. Each subsequent level in the tree consists of questions at that stage in the
game. Nodes at the third level show questions asked at the third level in the game. Leaf
nodes represent a successful guess as to the object being predicted. This represents a
correct prediction. Each question successfully divides the search space much as a binary
36
search does. As with a binary search, questions should be posed so that the remaining
space is divided into two equal parts. Often young children tend to ask poor questions by
being too specific, such as initially asking “Is it my Mother? This is a poor approach
because the search space is not divided into two equal parts.
EXAMPLE 2.1
Mudra and Vikas are playing a game of “Twenty Questions”. Vikas has in mind some
object that Mudra tries to guess with no more than 20 questions. Mudra’s first question is
“Is this object alive?” Based on Vikas’s answer, Mudra then asks a second question. Her
second question is based on the answer that Vikas provides to the first question. Suppose
that Vikas says “yes” as his first answer. Mudra’s second question is “Is this a person?.
When Vikas responds “yes”, Mudra asks “Is it a friend?”. When Vikas says “no”, Mudra
then asks “Is it someone in my family?”. When Vikas responds “yes”, Mudra then begins
asking the names of family members and can immediately narrow down the search space
to identify the target individual. This game is illustrated in Figure 2.2.
37
Alive?
No Yes
Ever alive? Person?
No Yes No Yes
… … Mammal? Friend?
No Yes No Yes
… … In Family? …
No Yes
… Mom?
No Yes
… FINISHED
FIGURE 2.2: Decision tree for Example 2.1
DEFINITION 2.4. A decision tree(DT) is a tree where the root and each internal node is
labeled with a question. The arcs emanating from each node represent each possible
answer to the associated question. Each leaf node represents a prediction of a solution to
the problem under consideration.
DEFINITION 2.5. A decision tree(DT) model is a computational model consists of three

parts:
1. A decision tree as defined in Definition 2.4.
2. An algorithm to create the tree.
3. An algorithm that applies the tree to data and solves the problem under
consideration.
38
The building of the tree may be accomplished via an algorithm that examines data from a
training sample or could be created by a domain expert. Most decision tree techniques
differ in how the tree is created. Algorithm 2.2 shows the basic steps in applying a tuple
to the DT, step three in Definition 2.5. We assume here that the problem to be performed
is one of prediction, so the last step is to make the prediction as dictated by the final leaf
node in the tree. The complexity of the algorithm is straightforward to analyze. For each
tuple in the database, we search the tree from the root down to a particular leaf. At each
level, the maximum number of comparisons to make depends on the branching factor at
that level. So the complexity depends on the product of the number of levels and the
maximum branching factor.
ALGORITHM 2.2
Input:
T //Decision Tree
D //Input database
Output:
M //Model prediction
DTProc algorithm:
//Simplistic algorithm to illustrate prediction technique using DT
for each t ∈ D do
n = root node of T;
while n not leaf node do;
Obtain answer to question on n applied to t;
Identify arc from t, which contains correct answer;
n=node at end of this arc;
Make prediction for t based on labeling of n;
We use Example 2.2 to further illustrate the use of decision trees.
39
Gender
=F =M
Height Height
<1.3 m >1.8 m <1.5 m >2 m

>= 1.3 m >= 1.5m
<= 1.8 m <= 2 m
Short Medium Tall Short Medium Tall
Figure 2.3: Decision tree for Example 2.2
EXAMPLE 2.2
Suppose that students in a particular university are to be classified as short, tall or

medium based on their height. Assume that the database schema is {name, address,
gender, height, age, year, major}.To construct a decision tree, we must identify the
attributes that are important to the classification problem at hand. Suppose that height,
age and gender are chosen. Certainly, a female who is 1.95 m in height is considered as
tall. Also, a child 10 years of age may be tall if he or she is only 1.5 m. Since this is a set
of university students, we would expect most of them to be over 17 years of age. We thus
decide to filter out those under this age and perform their classification separately. We
may consider these students to be outliers because their ages are not typical of most
university students. Thus, for classification we have only gender and height. Using these
two attributes, a decision tree building algorithm will construct a tree using a sample of
the database with known classification values. This training sample forms the basis of
how tree is constructed. One possible resulting tree after training is shown in Figure 2.3.
40
2.6.1 Strengths
Decision trees have several advantages. Here is a list of a few of the many advantages
decision trees have to offer.
• Decision trees are easy to understand and map nicely to a set of production
rules.
• Decision trees have been successfully applied to real problems.
• Decision trees make no prior assumptions about the nature of the data.
• Decision trees are able to build models with datasets containing numerical as
well as categorical data.
2.6.2 Weaknesses
There are several issues surrounding decision tree usage. Specifically,
• Output attributes must be categorical, and multiple output attributes are not
allowed.
• Decision tree algorithms are unstable in that slight variations in the training
data can result in different attribute selections at each choice point within the
tree. The effect can be significant as attribute choices affect all descendent sub
trees.
• Trees created from numeric datasets can be quite complex as attribute splits
for numeric data are typically binary.
2.7 NEURAL NETWORKS
Neural networks offer a mathematical model that attempts to mimic the human brain.
Knowledge is often represented as a layered set of interconnected processors. These
processor nodes are frequently referred to as neurodes so as to indicate a relationship with
41
the neurons of the brain. Each node has a weighted connection to several other nodes in
adjacent layers. Individual nodes take the input received from connected nodes and use
the weights together with a simple function to compute output values.
2.7.1 Why use Neural Networks?
Neural networks, with their remarkable ability to derive meaning from complicated or
imprecise data, can be used to extract patterns and detect trends that are too complex to
be noticed by either humans or other computer techniques. A trained neural network can
be thought of as an “expert” in the category of information it has been given to analyze.
This expert can then be used to provide projections given new situations of interest and
answer “what if” questions.
2.7.2 Network Layers
The NN approach, like decision trees, requires that a graphical structure be built to
represent the model and then that the structure be applied to the data. The NN can be
viewed as a directed graph with source (input), sink (output) and internal (hidden) nodes.
The input nodes exist in an input layer, while the output nodes exist in an output layer.
The hidden nodes exist over one or more hidden layers. To perform the data mining task,
a tuple is input through the input nodes and the output node determines what the
prediction is. Unlike decision trees, which have only one input node (the root of the tree),
the NN has one input node for each attribute value to be examined to solve the data
mining function. Unlike decision trees, after a tuple is processed, the NN may be changed
to improve future performance. Although the structure of the graph does not change, the
labeling of the edges may change.
DEFINITION 2.6. A neural network (NN) is a directed graph, F=(V,A) with vertices
V= {1,2,….,n} and arcs A={(i,j) | 1<=i,j<=n}, with the following restrictions.
1. V is partitioned into set of input nodes, VI, hidden nodes, VH and output
nodes, Vo.
42
2. The vertices are also partitioned into layers {1,….,k} with all input nodes
in layer 1 and output nodes in layer k. All hidden nodes are in layers 2 to
k-1 which are called the hidden layers.
3. Any arc (i,j) must have node ii in layer h-1 and node j in layer h.
4. Arc (i,j) is labeled with a numeric value wij.
5. Node i is labeled with a function fi.
Definition 2.6 is a very simplistic view of NNs. Although there are many more
complicated types that do not fit this definition, this defines the most common type of
NN.
Figure 2.4 shows a fully connected feed-forward neural network structure together with a
single input instance [1.0,0.4,0.7]. Arrow indicates the direction of flow for each new
instance as it passes through the network. The network is fully connected because nodes
at one layer are connected to all nodes in the next layer.
The number of input attributes found within individual instances determines the number
of input layer nodes. The user specifies the number of hidden layers as well as the
number of nodes within a specific hidden layer. Determining a best choice for these
values is matter of experimentation. In practice, the total number of hidden layers is
usually restricted to two. Depending on the application, the output layer of the neural
network may contain one or several nodes.
43
Input Layer Hidden Layer Output Layer
1.0 Node 1 W1j
Node j
W1i Wjk
W2j
Node 2 Node k
0.4 W2i
Wi k
Node i
Node 3 W3j
0.7 W3i
Figure 2.4 A fully connected feed-forward neural network
2.7.3 Neural Network Input and Output Format
The input to individual neural network nodes must be numeric and fall in the closed
interval range [0, 1]. Because of this, we need a way to numerically represent categorical
data. We also require a conversion method for numerical data falling outside the [0, 1]
range.
The output nodes of a neural network represent continuous values in the [0, 1] range.
However, the output can be transformed to accommodate categorical class values.
2.7.4 The Sigmoid Function
The purpose of each node within a feed-forward neural network is to accept input values
and pass an output value to the next higher network layer. The nodes of the input layer
pass input attribute values to the hidden layer unchanged. Therefore for the input instance
shown in figure 2.4, the output of node 1 is 1.0, the output of node 2 is 0.4 and the output
of node 3 is 0.7.
44
Table 2.3: Initial Weight Values for the Neural Network Shown in Figure
2.4
W1J W1I W2J W2I W3J W3I WJK WIK

0.20 0.10 0.30 -0.10 -0.10 0.20 0.10 0.50
A hidden or output layer node n takes input from the connected nodes of the previous
layer, combines the previous node values into a single value, and uses the new value as
input to an evaluation function. The output of the evaluation function is a number in the
closed interval [0, 1]. This value represents the output of node n.
Let’s look at an example. Table 2.3 shows sample weight values for the neural network
of Figure 2.4. Consider node j. To compute the input to node j, we determine the sum
total of the multiplication of each input weight by its corresponding input layer node
value. That is:
Input to node j= (0.2)(1.0) + (0.3)(0.4) + (-0.1)(0.7) = 0.25 (2.19)
Therefore 0.25 represents the input value for node j’s evaluation function.
The first criterion of an evaluation function is that the function must output values in the
[0, 1] interval range. A second criterion is that the function should output a value close to
1 when sufficely excited. The sigmoid function is computed as:
f(x)=1/1+e –x (2.20)
where e is the base of natural logarithms approximated by 2.718282.
2.7.5 Applications of neural networks
Character Recognition – The idea of character recognition has become very important
as handled devices like the Palm Pilot are becoming increasingly popular. Neural
networks can be used to recognize handwritten characters.
45
Image Compression – Neural networks can receive and process vast amounts of
information at once, making them useful in image compression. With the Internet
explosion and more sites using more images on their sites, using neural networks for
image compression is worth a look.
Stock Market Prediction – The day-to-day business of the stock market is extremely
complicated. Many factors weigh in whether a given stock will go up or down on any
given day. Since neural networks can examine a lot of information quickly and sort it all
out, they can be used to predict stock prices.
Traveling Salesman’s Problem – Interestingly enough, neural networks can solve the
traveling salesman problem, but only to a certain degree of approximation.
Medicine, Electronic Nose, Security and Loan Applications – These are some
applications that are in their proof-of-concept stage, with the acception of a neural
network that will decide whether or not to grant a loan, something that has already been
used more successfully than many humans.
2.7.6 Strengths
• Neural networks well with datasets containing large amounts of noisy input
data. Neural network evaluation functions such as the sigmoid function
naturally smooth input data variations caused by outliers and random error.
• Neural networks can process and predict numeric as well as categorical
outcome. However, categorical data conversions can be tricky.
• Neural networks can be used for applications that require a time element to be
included in the data.
• Neural networks have performed consistently well in several domains.
• Neural networks can be used for both supervised learning and unsupervised
clustering.
46
2.7.7 Weaknesses
• Probably the biggest criticism of neural networks is that they lack the ability
to explain their behavior.
• Neural network learning algorithms are not guaranteed to converge to an
optimal solution. With most types of neural networks, the problem can be
dealt with by manipulating various learning parameters.
• Neural networks can easily be overtrained to the point of working well on the
training data but poorly on test data. This problem can be monitored by
consistently measuring test set performance.
2.8 GENETIC ALGORITHMS
A Genetic Algorithm is heuristic, which means it estimates a solution. We won’t know if

we get the exact solution, but that may be a minor concern.
In fact, most real-life problems are like that: we estimate a solution rather than
calculating it exactly. For most problems we don’t have any formula for solving the
problem because it is too complex, or if we do, it just takes too long to calculate the
solution exactly. An example could be space optimization – it is very difficult to find the
best way to put objects of varying size into a room so they take as little space as possible.
The most feasible approach then is to use a heuristic method.
Genetic algorithms are different from other heuristic methods in several ways. The most
important difference is that a GA works on a population of possible solutions, while other
heuristic methods use a single solution in their iterations. Another difference is that GAs
are probabilistic (stochastic), not deterministic.
Each individual in the GA population represents a possible solution to the problem. The
suggested solution is coded into the “genes” of the individual. One individual might have
these genes:”1100101011”, another has these:”0101110001” (just examples). The values
47
(0 or 1) and their position in the “gene string” tell the genetic algorithm what solution the
individual represents.
2.8.1 Where GAs can be used?
GAs can be used where optimization is needed. It means that where there large solutions
to the problem but we have to find the best one. Like we can use GAs in finding best
moves in chess, mathematical problems, and financial problems and in many more areas.
DEFINITION 2.7. Given an alphabet A, an individual or chromosome is a string I =

I1,I2,…..,In where Ij ∈ A. Each character in the string, Ij, is called a gene. The values that
each character can have are called the alleles. A population, P, is a set of individuals.
2.8.2 Explanation of terms
Fitness: Fitness is the value assigned to an individual. It is based on how far or close an
individual is from the solution. Greater the fitness value better the solution it contains.
Fitness function: Fitness function is a function which assigns fitness value to the
individual. It is problem specific.
Breeding: Taking two fit individuals and intermingling there chromosome to create new
two individuals.
Crossover: The first genetic operator, forms new elements for the population by
combing parts of two elements currently in the population.
Mutation: A second genetic operator is sparingly applied to elements chosen for
elimination. Mutation can be applied by randomly flipping bits (or attribute values)
within a single element.
Selection: A third genetic operator that is sometimes used. With selection, the elements
deleted from the population are replaced by copies of elements that pass the fitness test
with high scores.
48
DEFINITION 2.8. A genetic algorithm (GA) is a computational model consisting of

five parts:
1. Starting set of individuals, P.
2. Crossover technique.
3. Mutation algorithm.
4. Fitness function.
5. Algorithm that applies the crossover and mutation techniques to P iteratively
using the fitness function to determine the best individuals in P to keep. The
algorithm replaces a predefined number of individuals from the population
with each iteration and terminates when some threshold is met.
ALGORITHM 2.3
Input:
P //Initial population
Output:
P’ //Improved population
Genetic algorithm:
//Algorithm to illustrate genetic algorithm
repeat
N=| P |;
P’=θ;
repeat
i1,i2 = select(P);
o1,o2= cross(i1,i2);
o1 = mutate(o1);
o2 = mutate(o2);
until | P’ | = N;
P = P’;
until termination criteria satisfied;
49
Algorithm 2.3 outlines the steps performed by a genetic algorithm. Initially, a population
of individuals, P, is created. Although different approaches can be used to perform this
step, they typically are generated randomly. From this population, a new population, P’,
of the same size is created. The algorithm 2.3 repeatedly selects individuals from whom
to create new ones. These parents i1, i2 are then used to produce two offspring, o1,o2,
using a crossover process. Then mutants may be generated. The process continues until
the new population satisfies the termination condition.
We assume here that the entire population is replaced with each iteration. An alternative
would be to replace the two individuals with the smallest fitness. Although this algorithm
is quite general, it is representative of all genetic algorithms. There are many variations
on this general theme.
2.8.3 Applications of GA
Typical applications of genetic algorithms include scheduling, robotics, economics,

biology, and pattern recognition.
2.8.4 Strengths of GA
• The major advantage to the use of genetic algorithms is that they are easily
parallelized.
• It can quickly scan a vast solution set. Bad proposals do not affect the end
solution negatively as they are simply discarded. The inductive nature of the
GA means that it doesn’t have to know any rules of the problem – it works by
its own internal rules. This is very useful for complex or loosely defined
problems.
2.8.5 Weaknesses of GA
• GAs are difficult to understand and to explain to end users.
50
• The abstraction of the problem and method to represent individuals is quite

difficult.
• Determining the best fitness function is difficult.
• Determining how to do crossover and mutation is difficult.
2.9 CLASSIFICATION
Classification is the most familiar and most popular data mining technique. Examples of
classification applications include image and pattern recognition, medical diagnosis, loan
approval, detecting faults in industry applications, and classifying financial market
trends. Estimation and prediction may be viewed as types of classification.
2.9.1 Statistical-Based Algorithms
2.9.1.1 Regression
Regression problems deal with an estimation of an output value based on input values.
When used for classification, the input values are values from the database and the output
values represent the classes. Regression can be used to solve classification problems, but
it can also be used for other applications such as forecasting.
In section, we briefly introduced linear regression using the formula
y=c0 + c1x1+……+cnxn (2.21)
By determining the regression coefficients c0,c1,…..,cn the relationship between the

output parameter, y and the input parameters x1,x2,…,xn can be estimated.
There are many reasons why the linear regression model may not be used to estimate
output data. One is that the data do not fit a linear model. It is possible, however, that the
data generally do actually represent a linear model, but the linear model generated is poor
51
because noise or outliers exist in the data. Noise is erroneous data. Outliers are data
values that are exceptions to the actual and expected data.
Suppose we are having k points in any training sample then we are having k formulas
yi=c0 + c1x1i + ε i , i=1,…..,k (2.22)
With a simple linear regression, given an observable value (x1i, yi), ε i is the error, and
thus the squared error technique introduced in the above section can be used to indicate
the error. To minimize the error, a method of least squares is used to minimize the least
square error. This approach finds coefficients c0,c1 so that the squared error is minimized
for the set of observable values. The sum of the squares of the errors is
k k
L = ∑ ε i2 = ∑ ( yi − c 0 − c1 x1i ) 2 (2.23)
i =1 i =1
Taking the partial derivatives (with respect to the coefficients) and setting equal to zero,
we can obtain the least squares estimates for the coefficients, c 0 and c 1 .
Regression can be used to perform classification using two different approaches:
1. Division: The data are divided into regions based on class.
2. Prediction: Formulas are generated to predict the output class value.
If the predictors in the linear regression function are modified by some function (square,
square root, etc.), then the model looks like
y=c0 + f1(x1) + …. + fn(xn) (2.24)
where fi is the function being used to transform the predictor. In this case the regression is
called nonlinear regression. Linear regression techniques, while easy to understand, are
not applicable to most complex data mining applications. They do not work well with
52
nonnumeric data. They also make the assumption that the relationship between the input
value and the output value is linear, which of course may not be the case.
Linear regression is not always appropriate because the data may not fit a straight line,
but also because the straight line values can be greater than 1 and less than 0. Thus, they
certainly cannot be used as the probability of occurrence of the target class. Another
commonly used regression technique is called logistic regression. Instead of fitting the
data to a straight line, logistic regression uses a logistic curve such as illustrated in
Figure. The formula for a univariate logistic curve is
e( c 0+ c1x1)
p= (2.25)
1 + e( c 0+ c1x1)
The logistic curve gives a value between 0 and 1 so it can be interpreted as the
probability of class membership. As with linear regression, it can be used when
classification into two classes is desired. To perform the regression, the logarithmic
function can be applied to obtain the logistic function
p
log( ) = c 0 + c1 x1 (2.26)
1− p
Here p is the probability of being in the class and 1-p is the probability that it is not.
However, the process chooses values for c0 and c1 that maximizes the probability of
observing the values.
2.9.1.2 Bayesian Classification
Assuming that the contribution by all attributes are independent and that each contributes
equally to the classification problem, a simple classification scheme called naïve Bayes
classification has been proposed that is based on Bayes rule of conditional probability as
stated in Definition 2.3. This approach was briefly outlined in section. By analyzing the
53
contribution of each “independent” attribute, a conditional probability is determined. A

classification is made by combining the impact that the different attributes have on the
prediction to be made. The approach is called “naïve” because it assumes the
independence between the various attribute values. Given a data value xi the probability
that a related tuple, ti, is in class Cj is described by P(Cj | xi ). Training data can be used to
determine P(xi ), P (xi | Cj ) and P( Cj ). From these values, Bayes theorem allows us to
estimate the posterior probability P (Cj | xi ) and P(Cj | ti ).
Given a training set, the naïve Bayes algorithm first estimates the prior probability P( Cj )
for each class by counting how often each class occurs in the training data. For each
attribute, xi, the number of occurrences of each attribute value xi can be counted to
determine P (xi ). Similarly, the probability P (xi | Cj ) can be estimated by counting how
often each value occurs in the class in the training data. A tuple in the training data may
have many different attributes, each with many values. This must be done for all
attributes and all values of attributes. We then use these derived probabilities when a new
tuple must be classified. This is why naïve Bayes classification can be viewed as both a
descriptive and a predictive type of algorithm. The probabilities are descriptive and are
then used to predict the class membership for a target tuple.
When classifying a target tuple, the conditional and prior probabilities generated from the
training set are used to make the prediction. This is done by combing the effects of the
different attribute values from the tuple. Suppose that tuple ti has p independent attribute
values {xi1, xi2, ……., xip}. From the descriptive phase, we know P ( xik | Cj ), for each
class Cj and attribute xik. We then estimate P ( ti | Cj ) by
p
P (ti | Cj ) = ∏ P( xik | Cj ) (2.27)
k =1
At this point in the algorithm, we then have the needed prior probabilities P ( Cj ) for
each class and the conditional probability P ( ti | Cj ). To calculate P (ti ), we can estimate
the likelihood that ti is in each class. This can be done by finding the likelihood that this
54
tuple is in each class and then adding all these values. The probability that ti is in a class
is the product of the conditional probabilities for each attribute value. The posterior
probability P ( Cj | ti ) is then found for each class. The class with the highest probability
is the one chosen for the tuple.
2.9.1.2.1 Strengths
• It is easy to use.
• Unlike other classification approaches, only one scan of the training data is
required.
• The naïve Bayes approach can easily handle missing values by simply omitting
that probability when calculating the likelihoods of membership in each class.
• In cases where there are simple relationships, the technique often does yield good
results.
2.9.1.2.2 Weaknesses
• Although the naïve Bayes approach is straightforward to use, it does not always
yield satisfactory results.
• The technique does not handle continuous data. Dividing the continuous values
into ranges could be used to solve this problem, but the division of the domain
into ranges is not an easy task, and how this is done can certainly impact the
results.
2.9.2 Distance-Based Algorithms
Each item that is mapped to the same class may be thought of as more similar to the other
items in that class that it is to be items found in other classes. Therefore, similarity (or
distance) measures may be used to identify the “alikeness” of different items in the
database.
55
Using a similarity measure for classification where the classes are predefined is
somewhat simpler than using a similarity measure for clustering where the classes are not
known in advance.
2.9.2.1 Simple Approach
The classification problem can be restated in Definition

DEFINITION 2.9 Given a database D = {t1,t2,……,tn} of tuples where each tuple
ti=<ti1,ti2,……,tin> contains numeric values and a set of classes C={C1,…….,Cm} where
each class Cj = <Cj1,Cj2,……Cjk> has numeric values, the classification problem is to
assign each ti to the class Cj such that (ti,Cj) >=sim(ti, Cl) ∀ Cl ∈ C where Cl <> Cj.
To calculate these similarity measures, the representative vector for each class must be
determined. A simple classification technique, then, would be to place each item in the
class where it is most similar to the center of that class. The representative for the class
may be found in other ways. For example, in pattern recognition problems, a predefined
pattern can be used to represent each class. Once a similarity measure is defined, each
item to be classified will be compared to each predefined pattern. The item will be placed
in the class with the largest similarity value. Algorithm 2.4 illustrates a straightforward
distance-based approach assuming that each class, ci, is represented by its center or
centroid. In the algorithm 2.4 ci is used to be the center for its class. Since each tuple
must be compared to the center for a class and there are a fixed number of classes, the
complexity to classify one tuple is O (n).
ALGORITHM 2.4
Input:
c1, ……, cm // Centers for each class
t // Input tuple to classify
Output:
c //Class to which t is assigned
56
Simple distance-based algorithm

dist= ∞ ;
for i:=1 to m do
if dis(ci,t) < dist, then
c=i;
dist=dist(ci,t);
2.9.2.2 K Nearest Neighbors
One common classification scheme based on the use of distance measures is that of the K
nearest neighbors (KNN). The KNN technique assumes that the entire training set
includes not only the data in the set but also the desired classification for each item. In
effect, the training data become the model. When a classification is to be made for a new
item, its distance to each item in the training set must be determined. Only the K closet
entries in the training set are considered further. The new item is then placed in the class
that contains the most items from this set of K closet items.
Algorithm 2.5 outlines the use of KNN algorithm. We use T to represent the training
data. Since each tuple to be classified must be compared to each element in the training
data, this is O(q). Given n elements to be classified, this becomes an O(nq) problem.
Given that the training data are of a constant size, this can be viewed as an O(n) problem.
ALGORITHM 2.5
Input:
T //Training data
K //Number of neighbors
T //Input tuple to classify
Output:
C //Class to which t is assigned
KNN algorithm:
57
//Algorithm to classify tuple using KNN

N= ∅ ;
//Find set of neighbors, N, for t
for each d ∈ T do
if | N | <=K, then
N=N ∪ {d};
else
if ∃ u ∈ N such that sim(t,u) <= sim(t,d), then
begin
N=N – {u};
N=N ∪ {d};
end
//Find class for classification
c=class to which the most u ∈ N are classified;
2.9.3 Decision Tree-Based Algorithms
The decision tree approach is most useful in classification problems. With this technique,
a tree is constructed to model the classification process. Once the tree is built, it is applied
to each tuple in the database and results in a classification for that tuple. There are two
basic steps in the technique: building the tree and applying the tree to the database. Most
research has focused on how to build effective trees as the application process is
straightforward.
The decision tree approach to classification is to divide the search space into rectangle
regions. A tuple is classified based on the region into which it falls. A definition for a
decision tree used in classification is contained in Definition 2.10. There are alternative
definitions; for example, in a binary DT the nodes could be labeled with the predicates
themselves and each are would be labeled with yes or no (like in the “Twenty Questions”
game).
58
DEFINITION 2.10 Given a database D = {t1, ….. ,tn} where ti=<ti1, ….., tin> and the
database schema contains the following attributes {A1,A2, ……., An}. Also given is a set
of classes C= {C1, …., Cm}. A decision tree (DT) or classification tree is a tree associated
with D that has the following properties:
• Each internal node is labeled with an attribute, Ai.

• Each arc is labeled with a predicate that can be applied to the attribute associated
with the parent.
• Each leaf node is labeled with a class, Cj.
Solving the classification problem using decision trees is a two-step process.

1. Decision tree induction: Construct a DT using training data.
2. For each ti ∈ D, apply the DT to determine its class.
There are many advantages to the use of DTs for classification. DTs are certainly easy to
use and efficient. Rules can be generated that are easy to interpret and understand. They
scale well for large databases because the tree size is independent of the database size.
Each tuple in the database must be filtered through the tree. This takes time proportional
to the height of the tree, which is fixed. Trees can be constructed for data with many
attributes.
ALGORITHM 2.6
Input:
D //Training data
Output:
T //Decision tree
DTBuild algorithm:
//Simplistic algorithm to illustrate naïve approach to building DT
T= ∅ ;
Determine best splitting criterion;
T=Create root node and label with splitting attribute;
59
T=Add arc to root node for each split predicate and label;
for each arc do
D=Database created by applying splitting predicate to D;
if stopping point reached for this path, then
T’=Create leaf node and label with appropriate class;
else
T’=DTBuild(D);
T=Add T’ to arc;
Disadvantages also exist for DT algorithms. First, they do not easily handle continuous
data. These attribute domains must be divided into categories to be handled. Handling
missing data is difficult because correct branches in the tree could not be taken. Since the
DT is constructed from the training data, overfitting may occur. This can be overcome via
tree pruning. Finally, correlations among attributes in the database are ignored by the DT
process.
2.9.3.1 ID3
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm used to generate

a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm. The
ID3 technique to building a decision tree is based on information theory and attempts to
minimize the expected number of comparisons.
The concept used to quantify information is called entropy. Entropy is used to measure
the amount of uncertainty or surprise or randomness in a set of data. Certainly, when all
data in a set belong to a single class, there is no uncertainty. In this case the entropy is
zero.
The ID3 algorithm can be summarized as follows:

1. Take all unused attributes and count their entropy concerning test samples.
60
2. Choose attribute for which entropy is minimum (or, equivalently,

information gain is maximum).
3. Make node containing that attribute.
ALGORITHM 2.7
ID3( examples, target_attribute, attributes)

create a root node for the tree
if all examples are positive, return the single-node tree root, with label=+.
if all examples are negative, return the single-node tree root, with label=-.
if number of predicting attributes is empty, then return the single node tree root
with label=most common value of the target attribute in the examples.
otherwise begin
A=the attribute that best classifies examples
decision tree attribute for root = A.
for each possible value, vi, of A
Add a new tree branch below root, corresponding to the test A= vi.
Let examples(vi), be the subset of examples that have the value vi
for A
if examples(vi) is empty
then below this new branch add a leaf node with label=most
common target value in the examples
else
below this new branch add the subtree ID3(examples(vi),
target_attribute, attributes – {A})
end
return root
61
2.9.3.1.2 The ID3 metric
The algorithm is based on Occam’s razor: it prefers smaller decision tree over larger
ones. However, it does not always produce the smallest tree, and is therefore a heuristic.
Occam’s razor is formalized using the concept of information entropy:
m
IE (i ) = −∑ f (i, j ) log 2 f (i, j ). (2.28)
j =1
2.9.3.2 C4.5
C4.5 is an algorithm used to generate a decision tree. C4.5 is an extension of Quinlan’s

ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and
for this reason, C4.5 is often referred to as a statistical classifier.
C4.5 builds decision trees from a set of training data in the same way as ID3, using the
concept of information entropy. The training data is a set S=S1,S2,……. of already
classified samples. Each sample Si=X1,X2,….. is a vector where X1,X2,….. represent
attributes or features of the sample. The training data is augmented with a vector
C=C1,C2,…. Where C1,C2,…. represent the class to which each sample belongs.
At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits
its set of samples into subsets enriched in one class or the other. Its creation is the
normalized information gain (difference in entropy) that results from choosing an
attribute for splitting the data. The attribute with the highest normalized information gain
is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.
This algorithm has a few base cases.

• All the samples in the list belong to the same class. When this happens, it
simply creates a leaf node for the decision tree saying to choose that class.
62
• None of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.
• Instance of previously-unseen class encountered. Again, C4.5 creates a
decision node higher up the tree using the expected value.
ALGORITHM 2.8
1. Check for base cases.

2. For each attribute a
find the normalized information gain from splitting on a
3. Let a_best be the attribute with the highest normalized information gain.
4. Creates a decision node that splits on a_best.
5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as
children of node.
2.9.3.2.1 Improvements from ID3 algorithm
C4.5 made a number of improvements to ID3. Some of these are:
• Handling both continuous and discrete attributes – In order to handle continuous

attributes. C4.5 creates a threshold and then splits the list into those whose
attribute value is above the threshold and those that are less than or equal to it.
• Handling training data with missing attribute values – C4.5 allows attribute values
to be marked as ? for missing. Missing attribute values are simply not used in gain
and entropy calculations.
• Handling attributes with differing costs.
• Pruning trees after creation – C4.5 goes back through the tree once it’s been
created and attempts to remove branches that do not help by replacing them with
leaf nodes.
63
2.9.3.2.2 Improvements in C5.0/See5 algorithm
C5.0 offers a number of improvements on C4.5. Some of these are:
• Speed – C5.0 is significantly faster than C4.5.

• Memory usage – C5.0 is more memory efficient than C4.5.
• Smaller decision trees – C5.0 gets similar results to C4.5 with considerably
smaller decision trees.
• Support for boosting – Boosting improves the trees and gives them more
accuracy.
• Weighting – C5.0 allows us to weight different attributes and misclassification
types.
• Winnowing – C5.0 automatically winnows the data to help reduce noise.
2.9.3.1 CART
A classification and regression tree (CART) is a technique that generates a binary

decision tree. As with ID3, entropy is used as a measure to choose the best splitting
attribute and criterion. Unlike ID3, however, where a child is created for each
subcategory, only two children are created. The splitting is performed around what is
determined to be the best split point. At each step, an exhaustive search is used to
determine the best split, where “best” is defined by
m
Φ ( s / t ) = 2 PLPR ∑ | P(Cj | tL) − P(Cj | tR ) | (2.29)
j =1
This formula is evaluated at the current node, t, and for each possible splitting attribute
and criterion, s. Here L and R are used to indicate the left and right subtrees of the current
node in the tree. PL, PR is the probability that a tuple in the training set will be on the left
| tuples in subtree |
or right side of the tree. This is defined as . We assume that the
| tuples in training set |
64
right branch is taken on equality. P(Cj | tL ) or P(Cj | tR ) is the probability that a tuple is in
this class, Cj, and in the left or right subtree. This is defined as the
| tuples of class j in subtree |
. At each step, only one criterion is chosen as the best over
| tuples at the t arg et node |
all possible criteria.
CART handles missing data by simply ignoring that record in calculating the goodness of
a split on that attribute. The tree stops growing when no split will improve the
performance. Even though it is the best for the training data, it may not be the best for all
possible data to be added in the future. The CART algorithm also contains a pruning
strategy.
2.9.4 Neural Network-Based Algorithms
With neural networks (NNs), just as with decision trees, a model representing how to
classify any given database tuple is constructed. The activation functions typically are
sigmoidal. When a tuple must be classified, certain attribute values from that tuple are
input into the directed graph at the corresponding source nodes. There often is one sink
node for each class. The output value is generated indicates the probability that the
corresponding input tuple belongs to that class. The tuple will then be assigned to the
class with the highest probability of membership. The learning process modifies the
labeling of the arcs to better classify tuples. Given a starting structure and value for all
the labels in the graph, as each tuple in the training set is sent through the network, the
projected classification made by the graph can be compared with the actual classification.
Based on the accuracy of the prediction, various labeling in the graph can change. This
learning process continues with all the training data or until the classification accuracy is
adequate.
65
2.9.4.1 Propagation
The normal approach used for processing is called propagation. Given a tuple of values
input to the NN, X = <x1,x2,…….,xn>, one value is input at each node in the input layer.
Then the summation and activation functions are applied at each node, with an output
value created for each output arc from that node. These values are in turn sent to the
subsequent nodes. This process continues until a tuple of output values, Y = <y1,….,ym>,
is produced from the nodes in the output layer. The process of propagation is shown in
algorithm 2.9 using a neural network with one hidden layer. Here a hyperbolic tangent
activation function is used for nodes in the output layer. We assume that the constant c in
the activation function has been provided. We also use k to be number of edges coming
into a node.
ALGORITHM 2.9
Input:
N //neural network
X=<x1,x2,…….,xn> //Input tuple consisting of values for input attributes only
Output:
Y=<y1,y2,……,ym> //Tuple consisting of output values from NN
Propagation algorithm:
//Algorithm illustrates propagation of a tuple through a NN
for each node i in the input layer do
output xi on each output arc from i;
for each hidden layer do
for each node i do
k
Si = (∑ ( wjixji ));
j =1
for each output arc from i do

(1 − e− s i )
Output ;
(1 + e − cs i )
66
for each node i in the output layer do

k
Si = (∑ ( wjixji ));
j =1
1
Output yi =
(1 + e − cs i )
2.9.4.2 NN Supervised Learning
The NN starting state is modified based on feedback of its performance with the data in
the training set. This type of learning is referred to as supervised because it is known a
priori what the desired output should be. Unsupervised learning can also be performed if
the output is not known. With unsupervised approaches, no external teacher set is used. A
training set may be provided, but no labeling of the desired outcome is included.
Supervised learning in an NN is the process of adjusting the arc weights based on its
performance with a tuple from the training set. The training set can be used as a “teacher”
during the training process. The output from the network is compared to this known
desired behavior. Algorithm 2.10 outlines the steps required.
ALGORITHM 2.10
Input:
N //Starting neural network
X //Input tuple from training set
D //Output tuple desired
Output:
N //Improved neural network
Suplearn algorithm
//Simplistic algorithm to illustrate approach to NN learning
Propagate X through N producing output Y;
Calcualte error by comparing D to Y;
67
Update weights on arcs in N to reduce error;
Assuming that the output from node i is yi but should be di, the error produced from a
node in any layer can be found by
| yi – di | (2.30)
The mean squared error (MSE) is found by
( yi − di ) 2
(2.31)
2
Backpropagation is a learning technique that adjusts weights in the NN by propagating

weight changes backward from the sink to the source nodes. Backpropagation is the most
well known form of learning because it is easy to understand and generally applicable.
Backpropagation can be thought as a generalized delta rule approach.
ALGORITHM 2.11
Input:
X=<x1,x2,…..,xn> //Input tuple from training set
D=<d1,d2,…...,dm> //Output tuple desired
Output:
Backpropagation algorithm:
//Illustrate backpropagation
Propogation(N,X);
m
E = 1/ 2∑ (di − yi ) 2 ;
i =1
Gradient (N,E);
68
A simple version of backpropagation algorithm is shown in Algorithm 2.11. The MSE is

used to calculate the error. Each tuple in the training set is input to this algorithm. The
last step of the algorithm uses gradient descent as the technique to modify the weights in
the graph. The basic idea of gradient descent is to find the set of weights that minimizes
∂E
the MSE. gives the slope (or gradient) of the error function for one weight. We thus
∂wji
wish to find the weight where this slope is zero. Algorithm 2.12 illustrates the concept.
ALGORITHM 2.12
Input:
E //Error found from back algorithm
Output:
Gradient algorithm:
//Illustrates incremental gradient descent
for each node i in output layer do
for each node j input to i do
Δwji = η (di − yi ) yj (1 − yi ) yi;
wji = wji + Δwji;
layer = previous layer;
for each node j in this layer do
for each node k input to j do
1 − y 2j
Δwkj = η yk
2
∑ m (dm − ym) wjmym(1 − ym);
wkj = wkj + Δwkj;

The algorithm 2.12 changes weights by working backward from the output layer to the
input layer. There are two basic versions of this algorithm. With the batch or offline
approach, the weights are changed once after all tuples in the training set are applied and
total MSE is found. With the incremental or online approach, the weights are changed
69
after each tuple in the training set is applied. The incremental technique is usually
preferred because it requires less space and may actually examine more potential
solutions (weights), thus leading to a better solution.
2.9.4.3 Radial Basis Function Networks
A radial function or a radial basis function (RBF) is a class of functions whose value
decreases (or increases) with the distance from a central point. An RBF has a Gaussian
shape, and an RBF network is typically an NN with three layers. The input layer is used
to simply input the data. A Gaussian activation function is used at the hidden layer, while
a linear activation function is used at the output layer. The objective is to have the hidden
nodes learn to respond only to a subset of the input, namely, that where the Gaussian
function is centered. This is usually accomplished via supervised learning. When RBF
functions are used as the activation functions on the hidden layer, the nodes can be
sensitive to a subset of the input values. Figure 2.5 shows the basic structure of an RBF
unit with one output node.
X1 w11
c1
w21 f1 y
X2
wk1 ∑
w1n
c2
w2n
f2
X3 wkn
Figure 2.5: Radial basis function network
70
2.9.4.4 Perceptrons
The simplest NN is called a perceptron. A perceptron is a single neuron with multiple

inputs and one output. The original perceptron proposed the use of a step activation
function, but it is more common to see another type of function such as a sigmoid
function. A simple perceptron can be used to classify into two classes. Using a unipolar
activation function, an output of 0 would be used to pass in other class.
2.10 CLUSTERING
Clustering is similar to classification in that data are grouped. However, unlike

classification, the groups are not predefined. Instead, the grouping is accomplished by
finding similarities between data according to characteristics found in the actual data. The
groups are called clusters. Many definitions for clusters have been proposed.
• Set of like elements. Elements from different clusters are not alike.
• The distance between points in a cluster is less than the distance between a point
in the cluster and any point outside it.
Some basic features of clustering are:

• The number of clusters is not known
• There may not be any a priori knowledge concerning the clusters
• Cluster results are dynamic.
DEFINITION 2.11: Given a database D ={t1,t2,….,tn} of tuples and an integer value k,

the clustering problem is to define a mapping f : D {1,…..,k} where each ti is
assigned to one cluster Kj, 1 <= j <=k. A cluster, Kj, contains precisely those tuples
mapped to it; that is, Kj = {ti | f(ti) = Kj, 1<=i<=n, and ti E D}.
A classification of different types of clustering algorithms is shown in the figure 2.6.
71
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive Sampling Compression
Figure 2.6: Classification of Clustering Algorithms
With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy
has a set of clusters. At the lowest level, each item is in its own unique cluster. At the
highest level, all items belong to the same cluster. With hierarchical clustering, the
desired number of clusters is not input. With partitional clustering, the algorithm creates
only one set of clusters.
2.10.1 Hierarchical Algorithms
Hierarchical clustering algorithms actually creates sets of clusters. Hierarchical

algorithms differ in how the sets are created. A tree data structure, called a dendrogram,
can be used to illustrate the hierarchical clustering technique and the sets of different
clusters. The root in a dendrogram tree contains one cluster where all elements are
together. The leaves in the dendrogram each consist of a single element cluster. Internal
nodes in the dendrogram represent new clusters formed by merging the clusters that
appear as its children in the tree. Each level in the tree is associated with the distance
measure that was used to merge the clusters. All clusters created at a particular level were
combined because the children clusters had a distance between them less than the
distance value associated with this level in the tree. One example of dendrogram can be
as given in figure 2.7.
72
A B C D E F
Figure 2.7: Example of Dendrogram
The space complexity for hierarchical algorithms is O(n2) because this is the space
required for the adjancency matrix. The space required for the dendrogram is O(kn),
which is much less than O(n2). The time complexity for hierarchical algorithms is O(kn2)
because there is one iteration for each level in the dendrogram. Depending on the specific
algorithm, however, this could actually be O(maxd n2) where maxd is the maximum
distance between points.
Hierarchical techniques are well suited for many clustering applications that naturally
exhibit a nesting relationship between clusters. For example, in biology, plant and animal
taxonomies could easily be viewed as a hierarchy of clusters.
2.10.2 Agglomerative Algorithms
Agglomerative algorithms start with each individual item in its own cluster and
iteratively merge clusters until all items belong in one cluster. Different agglomerative
algorithms differ in how the clusters are merged at each level. Algorithm 2.13 illustrates
the typical agglomerative clustering algorithm. It assumes that a set of elements and
distances between them is given as input. We use an n * n vertex adjacency matrix, A, as
input. Here the adjacency matrix, A, contains a distance value rather than a simple
73
boolean value: A[i,j] = dis(ti,tj).The output of the algorithm 2.13 is a dendrogram, DE,
which we represent as a set of ordered triples <d,k,K> where d is the threshold distance, k
is the number of clusters, and K is the set of clusters.
ALGORITHM 2.13
Input:
D = {t1,t2,……,tn} //set of elements
A //Adjacency matrix showing distance between elements
Output:
DE // Dendrogram represented as a set of ordered triples
Agglomerative algorithm:
d=0;
k=n;
K={{t1},….,{tn}};
DE={<d,k,K>}; // Initially dendrogram contains each element
in its own cluster.
repeat
oldk=k;
d=d+1;
Ad=Vertex adjacency matrix for graph with threshold distance of d;
<k,K>=NewClusters(Ad,D);
if oldk <> k then
DE=DE U <d,k,K>; // New set of clusters added to dendrogram
until k=1
2.10.2.1 Single Link Technique
The single link technique is based on the idea of finding maximal connected components
in a graph. A connected component is a graph in which there exists a path between any
two vertices. With the single link approach, two clusters are merged if there is at least one
74
edge that connects two clusters; that is, if the minimum distance between any two points
is less than or equal to the threshold distance being considered. For this reason, it is often
called the nearest neighbor clustering technique.
2.10.2.2 Complete Link Algorithm
Although the complete link algorithm is similar to the single link algorithm, it looks for
cliques rather than connected components. A clique is a maximal graph in which there is
an edge between any two vertices. Here a procedure is used to find the maximum
distance between any clusters so that two clusters are merged if the maximum distance is
less than or equal to the distance threshold. In this algorithm, we assume the existence of
a procedure, clique, which finds all cliques in a graph.
2.10.2.3 Average Link
The average link technique merges two clusters if the average distance between any two
points in the two target clusters is below the distance threshold.
2.10.2.4 Divisive Clustering
With divisive clustering, all items are initially placed in one cluster and clusters are
repeatedly spilt in two until all items are in their own cluster. The idea is to split up
clusters where some elements are not sufficiently close to other elements.
2.10.3 Partitional Algorithms
Nonhierarchical or partitional clustering creates the clusters in one step as opposed to

several steps. Some of the popular partitional algorithms are as below.
75
2.10.3.1 Minimum Spanning Tree
Since the clustering problem is to define a mapping, the output of this algorithm shows
the clusters as a set of ordered pairs <ti,j> where f(ti) = Kj.
ALGORITHM 2.14
Input:
D = {t1,t2,…..,tn} // Set of elements
A //Adjacency matrix showing distance between elements
k // Number of desired clusters
Output:
f // Mapping represented as a set of ordered pairs
Partitional MST Algorithm:
M = MST(A)
identify inconsistent edges in M;
remove k-1 inconsistent edges;
create output representation;
“Inconsistent” could be defined based on distance. Zahn proposes more reasonable

inconsistent measures based on the weight (distance) of an edge as compared to those
close to it. For example, an inconsistent edge would be one whose weight is much larger
than the average of the adjacent edges.
2.10.3.2 Squared Error Clustering Algorithm
The squared error clustering algorithm 2.15 minimizes the squared error. The squared
error for a cluster is the sum of the squared Euclidean distances between each element in
the cluster and the cluster centroid, Ck. Given a cluster Ki, let the set of items mapped to
that cluster be {ti1,ti2,……,tim}. The squared error is defined as
76
m
seKi = ∑ || tij − Ck ||2 (2.32)
j =1
Given a set of clusters K={K1,K2,…….KK}, the squared error for K is defined as
k
seK = ∑ seKj (2.33)
j =1
They follow the basic algorithm structure as shown in algorithm 2.15.
ALGORITHM 2.15
Input:
D={t1,t2,……,tn} // set of elements
Output:
K // Set of clusters
Squared error algorithm:
assign each item ti to a cluster;
calculate center for each cluster;
repeat
assign each item ti to the cluster which has the closet center;
calculate new center for each cluster;
calculate squared error;
until the difference between successive squared errors is below a threshold;
2.10.3.3 The K-Means Clustering Algorithm
The K-Means algorithm (Lloyd, 1982) is a simple yet effective statistical clustering
technique. It is an iterative clustering algorithm 2.16 in which items are moved among
sets of clusters until the desired set is reached.
77
The cluster mean of Ki = {ti1,ti2,……,tim} is defined as
1 m
mi = ∑ tij
m j =1
(2.34)
ALGORITHM 2.16
Input:
D = {t1,t2,….,tn} //Set of elements
Kk //Number of desired clusters
Output :
K //Set of clusters
K-means algorithm:
assign initial values for means m1,m2,….,mk;
repeat
assign each item to the cluster which has the closet mean; calculate
new mean for each cluster;
until convergence criteria is met;
EXAMPLE 2.3
Suppose that we are given the following items to cluster:
{2,4,10,12,3,20,30,11,25}
and suppose that k=2. We initially assign the means to the first two values: m1=2 and
m2=4. Using Euclidean distance, we find that
78
for m1=2 for m2=4

__________________________________
2 Æ0 2 Æ4
4 Æ4 4 Æ0
10Æ 64 10Æ 36
12Æ 100 12Æ 64
3 Æ1 3Æ4
20Æ 324 20Æ 256
30Æ 784 30Æ 676
11Æ 81 11Æ 49
25Æ 529 25Æ 441
so we get initially K1={2,3} and K2={4,10,12,20,30,11,25}. The value 3 is equally close

to both means , so we arbitrarily choose K1. Any desired assignment could be used in the
case of ties. We then recalculate the means to get K1={2,3,4} and
K2={10,12,20,30,11,25}. Continuing in this fashion, we obtain the following.
m1 m2 K1 K2
3 18 {2,3,4,10} {12,20,30,11,25}
4.75 19.6 {2,3,4,10,11,12} {20,30,25}
7 25 {2,3,4,10,11,12} {20,30,25}
Note that the clusters in the last two steps are identical. This will yield identical means,
and thus the means have converged. Our answer is thus K1={2,3,4,10,11,12} and
K2={20,30,25}.
The time complexity of K-means is O(tkn) where t is the number of iterations. K-means
finds a local optimum and may actually miss the global optimum.
79
2.10.3.3.1 Strengths
• The K-means method is easy to understand and implement.
• Although the K-means algorithm often produces good results, it is not time-
efficient and does not scale well.
• The algorithm only works with real-valued data. If we have a categorical
attribute in our dataset we must either discard the attribute or convert the
attribute values to numeric equivalents.
• The K-means algorithm works best when the clusters that exist in the data are
of approximately equal size. This being the case, if an optimal solution is
represented by clusters of unequal size, the K-Means algorithm is not likely to
find a best solution.
• There is no way to tell which attributes are significant in determining the
formed clusters. For this reason several irrelevant attributes can cause less
than optimal results.
Despite these limitations the K-Means algorithm continues to be a favorite statistical

technique.
2.10.3.4 Nearest Neighbor Algorithm
An algorithm similar to the single link technique is called the nearest neighbor algorithm.
With this serial algorithm, items are iteratively merged into the existing clusters that are
closet. In this algorithm a threshold, t, is used to determine if items will be added to
existing clusters or if a new cluster is created.
80
ALGORITHM 2.17
Input:
D={t1,t2,….,tn} // Set of elements
A // Adjacency matrix showing distance between elements
Output:
Nearest neighbor algorithm:
K1={t1};
K={K1};
k=1;
for i=2 to n do
find the tm in some cluster Km in K such that dis(ti,tm) is the smallest;
if dis(ti,tm) <= 2 then
Km=Km U ti
Else
k=k+1;
kk={ti};
2.10.3.5 PAM Algorithm
The PAM (partitioning around medoids) algorithm, also called the K-medoids algorithm,
represents a cluster by a medoid. Using a medoid is an approach that handles outliers
well. Initially, a random set of k items is taken to be the set of medoids. Then at each
step, all items from the input dataset that are not currently medoids are examined one by
one to see if they should be medoids. By looking at all pairs of medoid, non-medoid
objects, the algorithm chooses the pair that improves the overall quality of the clustering
the best and exchanges them. Quality here is measured by the sum of all distances from a
non-medoid object to the medoid for the cluster it is in. An item is assigned to the cluster
represented by the medoid to which it is closet (minimum distance).
The total impact to quality by a medoid change TCih is given by
81
n
TCih = ∑ C jih (2.35)
j =1
ALGORITHM 2.18
Input:
D={t1,t2,….,tn} // Set of elements
A // Adjacency matrix showing distance between elements
Output:
PAM algorithm:
arbitrarily select k medoids from D;
repeat
for each th not a medoid do
for each medoid ti do
calculate TCih;
find i,h where TCih is the smallest;
if TCih < 0, then
replace medoid ti with th;
until TCih >= 0;
for each ti ∈ ∈ D do
assign ti to Kj, where dis(ti,tj) is the smallest over all medoids;
2.10.4 Clustering Large Databases
2.10.4.1 BIRCH
BIRCH(balanced iterative reducing and clustering using hierarchies) is designed for

clustering a large amount of metric data. It is incremental and hierarchical, and it uses an
82
outlier handling technique. BIRCH applies only to numeric data. This algorithm 2.19
uses a tree called a CF tree as defined in Definition 2.12.
DEFINITION 2.12: A clustering feature (CF) is a triple (N,LS,SS), where the number of
the points in the cluster is N, LS is the sum of the points in the cluster, and SS is the sum
of the squares of the points in the cluster.
DEFINITION 2.13: A CF tree is a balanced tree with a branching factor B. Each internal
node contains a CF triple for each of its children. Each leaf node also represents a cluster
and contains a CF entry for each subcluster in it. A subcluster in a leaf node must have a
diameter no greater than a given threshold value T.
ALGORITHM 2.19
Input:
D = {t1,t2,….,tn} // Set of elements
T // Thresold for CF tree construction
Output:
K //Set of clusters
BIRCH clustering algorithm:
For each ti ∈ D do
determine correct leaf node for ti insertion;
if threshold condition is not violated, then
add ti to cluster and update CF triples;
else
if room to insert ti, then
insert ti as single cluster and update CF triples;
else
split leaf node and redistribute CF features;
BIRCH is linear in both space and I/O time. The choice of threshold values is imperative
to an efficient execution of the algorithm. Otherwise, the tree may have to be rebuilt
83
many times to ensure that it can be memory –resident. This gives the worst-case time
complexity of O(n2).
2.10.4.2 DBSCAN
The approach used by DBSCAN(density-based spatial clustering of applications with

noise) is to create clusters with a minimum size and density. Density is defined as a
minimum number of points within a certain distance of each other. This handles the
outlier problem by ensuring that an outlier will not create a cluster. One input parameter,
MinPts, indicates the minimum number of points in any cluster. In addition, for each
point in a cluster there must be another point in the cluster whose distance from it is less
than a threshold input value, Eps. The Eps-neighborhood or neighborhood of a point is
the set of points within a distance of Eps. The desired number of clusters, k, is not input
but rather is determined by the algorithm itself.
DEFINITION 2.14: Given values Eps and Minpts, a point p is directly density-reachable
from q if
• dis(p,q) <= Eps and
• | {r | dis(r,q) <= Eps} | >= MinPts
ALGORITHM 2.20
Input:
D={t1,t2,….,tn} //Set of elements
MinPts // Number of points in cluster
Eps // Maximum distance for density measure
Output:
K={K1,K2,….,KK}
DBSCAN algorithm:
K=0;
for i=1 to n do
84
if ti is not in a cluster, then

X={tj | tj is density-reachable from ti};
If X is a valid cluster, then
k=k+1;
Kk=X;
2.10.4.2.1 Strengths
• DBSCAN does not requite number of clusters as input, as it is decided by itself.

• DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely
surrounded by a different cluster. Due to MinPts parameter, the so-called single-
link effect (different clusters being connected by a thin line of points) is reduced.
• DBSCAN has a notion of noise.
• DBSCAN requires just two parameters and is mostly insensitive to the ordering of
the points in the database.
• DBSCAN can only result in a good clustering as its distance measure is in the
function getNeighbors(P,epsilon). The most common distance metric used is the
euclidean distance measure. Especially for high-dimensional data, this distance
metric can be rendered almost useless.
• DBSCAN does not respond well to data sets with varying densities.
2.10.4.3 CURE Algorithm
One objective for the CURE (Clustering Using Representatives) clustering algorithm is to
handle outliers well. It has both a hierarchical component and partitioning component.
First, a constant number of points, c, are chosen from each cluster. These well-scattered
points are then shrunk toward the cluster’s centroid by applying a shrinkage factor, α .
When α is 1, all points are shrunk to just one-the centroid. These points represent the
85
cluster better than a single point (such as a medoid or centroid) could. With multiple
representative points, clusters of unusual shapes can be better represented. CURE then
uses a hierarchical clustering algorithm. At each step in the agglomerative algorithm,
clusters with the closest pair of representative points are chosen to be merged. The
distance between them is defined as the minimum distance between any pair of points in
the representative sets from the two clusters.
In the algorithm 2.21, we assume that each entry u in the heap contains the set of
representative points, u.rep; the mean of the points in the cluster, u.mean; and the cluster
closet to it, u.closest. We use the heap operations; heapify to create the heap, min to
extract the minimum entry in the heap, insert to add a new entry, and delete to delete an
entry. A merge procedure is used to merge two clusters. In CURE, a k-D tree is used to
assist in the merging of clusters.
ALGORITHM 2.21
Input:
D= {t1,t2,….,tn} //Set of elements
k // Desired number of clusters
Output:
Q // Heap containing one entry for each cluster
CURE algorithm:
T = build(D);
Q = heapify(D); // Initially build heap with one entry per item;
repeat
u = min(Q);
delete(Q, u.close);
w = merge(u,v);
delete(T,u);
delete(T,v);
insert(T,w);
86
for each x ∈ Q do
x.close = find closet cluster to x;
if x is closet to w, then
w.close = x;
insert(Q,w);
until number of nodes in Q is k;
2.10.5 Comparison of Clustering Algorithms
Here there is a comparison of different clustering algorithms based on type, space, time
and whether it is incremental, iterative or not. It is given in table 2.4.
2.11 SELECTION CRITERIA OF A DATA MINING TECHNIQUE
The following questions may be useful in determining which techniques to apply:
• Is learning supervised or unsupervised?

• Do we require a clear explanation about the relationships present in the data?
• Is there one set of input attributes and one set of output attributes or can
attributes interact with one another in several ways?
• Is the input data categorical, numerical or a combination of both?
• If learning is supervised, is there one output attribute or are there several
output attributes? Are the output attribute(s) categorical or numeric?
For a particular problem, these questions have obvious answers. For example, we know
neural network is a black-box structure. Therefore this technique is a poor choice if an
explanation about what has been learned is required. Also, association rules are usually a
best choice when attributes are allowed to play multiple roles in the data mining process.
There are some guidelines.
1. Does the data contain several missing values?
87
Most data mining researchers agree that, if applicable, neural networks tend to
outperform other models when a wealth of noisy data are present.
2. Is time an issue?
Algorithms for building decision trees and production rules typically execute
much faster than neural network or genetic learning approaches.
3. Do we know the distribution of the data?
Datasets containing more than a few hundred instances can be a problem for
data mining techniques that require the data to conform to certain standards.
For example, many statistical techniques assume the data to be normally
distributed.
4. Do we know which attributes best define the data to be modeled?
Decision trees and certain statistical approaches can determine those attributes
most predictive of class membership. Neural network, Nearest neighbor and
various clustering approaches assume attributes to be of equal importance.
This is a problem when several attributes not predictive of class membership
are present in the data.
5. Which technique is most likely to give best classification accuracy?
For a particular problem, these questions have obvious answers. For example, we know
neural network is a black-box structure. Therefore this technique is a poor choice if an
explanation about what has been learned is required. Also, association rules are usually a
best choice when attributes are allowed to play multiple roles in the data mining process.
We can also select data mining technique based on the data-mining task we want to
perform. In the table 2.5 data mining problem types are related to appropriate modeling
techniques.
88
Table 2.4 Comparison of Clustering Algorithms
Algorithm Type Space Time Notes

Single Link Hierarchical O(n2) O(kn2) Not Incremental
Average Hierarchical O(n2) O(kn2) Not Incremental
Link
Complete Hierarchical O(n2) O(kn2) Not Incremental
Link
MST Hierarchical O(n2) O(n2) Not Incremental
/Partitional
Squared Partitional O(n) O(tkn) Iterative
Error
K-Means Partitional O(n) O(tkn) Iterative;
No Categorical
Nearest Partitional O(n2) O(n2) Iterative
Neighbor
PAM Partitional O(n2) O(tk(n- IterativeAdapted;ag
k)2) glomerative
BIRCH Partitional O(n) O(n) CF-tree;
Incremental;
Outliers
2
CURE Mixed O(n) O(n lgn) Heap;k-D tree;
Incremental;
Outliers;
Sampling
ROCK Agglomerati O(n2) O(kn2) Sampling;
ve Categorical;
Links
DBSCAN Mixed O(n2) O(n2) Sampling;
Outliers
89
Table 2.5 Data Mining Technique for Data Mining Task
No Data Mining Data Mining Technique

Task
1 Classification Decision trees, Neural networks, K-nearest
neighbors, Rule induction methods
2 Prediction Neural networks, K-nearest neighbors,
Regression Analysis
3 Dependency Correlation analysis, Regression Analysis,
Analysis Association rules, Bayesian networks,
Inductive logic
4 Data Statistical techniques, OLAP
description
and
summarization
5 Segmentation Clustering techniques, Neural Networks
or clustering
2.12 REFERENCES
[1] Margaret H. Dunham, S. Sridhar- DATA MINING Introductory and Advanced

Topics, Pearson Education, ISBN 81-7758-785-4
[2] Richard J. Roiger, Michael W. Geatz – Data Mining A Tutorial-based Primer,
Pearson Education, ISBN:81-297-1089-7
[3] Ian H. Witten, Eibe Frank – DATA MINING Practical Machine Learning Tools and
Techniques, Morgan Kaufmann Publishers, ISBN: 0-12-088407-0
[4] J. Han, M. Kamber – Data Ming Concepts and Techniques, Morgan Kaufmann
Publishers, ISBN: 81-8147-049-4
[5] V.B.Rao and H.V.Rao – C ++ Neural Networks and Fuzzy Logic MIS 1993.
[6] R.Bharath and J.Drosen – Neural Network Computing McGraw-Hill 1994
[7] R. Hecht–Nielsen – Neurocomputing
90
[8] M.Minsky and S.Papert – Perceptrons-Expanded Edition: An Introduction to

Computational Geometry. MIT Press 1987
[9] Dharwa Jyotindra N., Parikh S. M., Patel A. R. (Dr.) “A Comparative Study of Data
Mining Techniques and its Selection Issue” 61-65 Proceedings of the national conference
on IDBIT-2008, 23-24 February 2008 at SRIMCA, ISBN:978-81-906446-0-0.
[10] Website: http://www.jooneworld.com/: a free Neural Network engine written in
Java.
[11] Website: www.tek271.com/free/nuExpert.html: a free Neural Network engine
[12] Website: www.cs.purdue.edu
[13] Website: www.dbmsmag.com
[14] Website: www.osw.ca
[15] Website: www.jmis.bentley.edu
[16] Website: www.uni.edu
[17] Website: www.dms.irb.hr
[18] Website: www.ParasChopra.com
91
CHAPTER 3
FINANCIAL CYBER CRIME AND FRAUDS
3.1 WHAT IS A CYBER CRIME?

3.2 AN EXAMPLE OF FINANCIAL CYBER CRIME
3.3 FINANCIAL CYBER CRIMES
3.4 WHAT IS A FRAUD?
3.5 TYPES OF FRAUD
3.6 FINANCIAL CRIMES
3.7 WAYS OF ONLINE BANKING FRAUD
3.8 2008 INTERNET CRIME REPORT
3.9 ONLINE FRAUD REPORT, CYBERSOURCE 2010
3.10 REFERENCES
3.1 WHAT IS A CYBER CRIME?
Cyber crime encompasses any criminal act dealing with computers and networks (called
hacking). Additionally, cyber crime also includes traditional crimes conducted through
the Internet. For example, hate crimes, telemarketing and Internet fraud, identity theft,
and credit card account thefts are considered to be cyber crimes when the illegal activities
are committed through the use of a computer and the Internet.
Information Systems Security Association (ISSA), Ireland conduct IRIS cyber crime
survey every year. They developed a questionnaire in which respondents indicated the
types of cyber crime incident which had affected their organization. Figure 3.1 detail the
responses received in the year 2007.
92
Chapter 3: Financial Cyber crime and Frauds
20
18
18
16
16 15
14 13
12 11
10
10
8 8
8
2
0
0
Systemor networ k El ectr oni c eml oyee El ectr oni c f i nanci al Or gani sati onal Thef t of i ntel l ectual El ectr oni c f i nanci al Phi si ng(di r ected Tel ecomf r aud Attacks agai nst
i ntr usi on(i nter nal har assment(exter nal f r aud(exter nal ) i denti ty thef t(e.g. pr oper ty f r aud(i nter nal ) agai nst the manuf actur i ng,
sour ce) sour ce) cl oned websi te) or gani sati on) SCADA or pr ocess
contr ol systems
Figure 3.1 Affecting the Person by Cyber Crime (in %)
3.2 AN EXAMPLE OF FINANCIAL CYBER CRIME
One example of financial crime is, a website offered to sell Alphonso mangoes at a
throwaway price. Initially very few people responded to or supplied the website with
their credit card numbers. These people were actually sent the Alphonso mangoes. The
word about this website now spread like wildfire. Thousands of people from all over the
country responded to this site and ordered mangoes by providing their credit card
numbers. The owners of what was later proven to be a bogus website then fled taking the
numerous credit card numbers and proceeded to spend huge amounts of money much to
the chagrin of the card owners.
3.3 FINANCIAL CYBER CRIMES
3.3.1 Credit Card Fraud
We simply have to type credit card no, expiry date, CVV no into www page of the
vendor for online transaction. If electronic transactions are not secured the credit card
numbers can be stolen by the hackers who can misuse this card by impersonating the
credit card owner.
93
3.3.2 Net Extortion
Copying the company’s confidential data in order to extort said company for huge
amount.
3.3.3 Phising
It is technique of pulling out confidential information from the bank/financial

institutional account holders by deceptive means.
3.3.4 Salami Attack
In such crime criminal makes insignificant changes in such a manner that such changes
would go unnoticed. Criminal makes such program that deducts small amount like
Rs.1.00 per month from the account of all the customers of the bank and deposit the same
in his account. In this case no account holder will approach the bank for such small
amount but criminals gain huge amount.
3.3.5 Sale of Narcotics
This crime can be committed by sale and purchase through net. There are web sites which
offer sale and shipment of contrabands drugs. They may use the techniques of
stenography for hiding the messages.
3.4 WHAT IS A FRAUD?
Fraud may be defined as a dishonest or illegal use of services, with the intension to avoid
service charges. Frauds have plagued telecommunication industries, financial institutions
and other organizations for a long time. These frauds cost the business at great expenses
per year. As a result, fraud detection has become an important and urgent task for these
businesses. At present a number of methods have been implemented to detect frauds,
94
from both statistical approaches (e.g. data mining) and hardware approaches. (e.g.
firewalls, smart cards).
3.5 TYPES OF FRAUD
We discuss the types of fraud like Credit card fraud, Telecommunications fraud and
Intrusion in computer systems.
3.5.1 Credit Card Fraud
In simple terms credit card fraud can be defined as follows.
When an individual uses another individual’s credit card for personal reasons while the
owner of the card issuer are not aware of the fact that the card is being used. Further, the
individual using the card has no connection with cardholder or issuer, and has no
intension of either contacting the owner of the card or making repayments for the
purchases made.
Generally we can categorize credit card fraud into two main types 1.Identity theft fraud
and 2. Non-identity theft fraud
3.5.1.1 Identity theft
While identity theft and what we call credit card fraud are both pernicious crimes, and
both constitute fraud, we would like to distinguish the two for policy purposes. We place
identity theft into two basic categories.
95
3.5.1.1.1 Fraudulent Applications-Three Percent of Our Total Fraud Cases
This involves the unlawful acquisition and use of another person’s identifying
information to obtain credit, or the use of that information to create a fictitious identity to
establish an account.
In order to commit identity theft by means of fraudulent application, the perpetrator needs
to acquire not just a name, address or credit card number but unique identifiers such as
mother’s maiden name, social security number and detailed information about a person’s
credit history such as the amount of their most mortgage payment. This is why more than
40 percent of the identity theft cases that we see are committed by someone familiar to
the victim, frequently a family member or someone in a position of intimacy or trust.
This variety of identity theft represents three percent of our total fraud cases.
3.5.1.1.2 Account Takeover-One Percent of Our Total Fraud Cases
This occurs when someone unlawfully uses another person’s identifying information to
take ownership of an account. This would typically occur by making an unauthorized
change of address followed by a request for a new product such as a card or check, or
perhaps a PIN number. This variety of identity theft represents less than one percent of
our total fraud cases.
3.5.1.2 Non-identity Theft Fraud-The Other 96 Percent of Our Total Fraud Cases
This type of fraud constitutes the vast majority of occurrences and falls under four basic
headings.
1) Lost or Stolen Cards: The card is actually in possession of the customer and is
subsequently lost or stolen.
2) Non-Receipt: The card is never received by the customer and is intercepted by
the perpetrator prior to or during mail delivery.
96
3) Fraudulent Mail or Telephone Order: The card is in possession of the customer

and the account number and expiration date is compromised permitting purchases
by phone, mail or internet.
The prevention of credit card fraud is an important application for prediction techniques.
One major obstacle neural network training technique is the high necessary diagnostic
quality. Since only one financial transaction of a thousand is invalid no prediction success
less than 99.9% is acceptable.
3.5.2 Telecommunications Fraud
Johnson defines the telecommunications fraud as any transmission of voice or data across
a telecommunications network where the intent of the sender is to avoid or reduce
legitimate call charges. In similar vein, Davis and Goyal define fraud as obtaining
unbillable services and undeserved fees.
3.5.2.1 Types of Telecommunications Fraud
There are many different types of telecoms fraud, and these can occur at various levels.
The two most prevalent types are subscription fraud and superimposed or ‘surfing’ fraud.
Subscription fraud: This occurs when fraudster obtains a subscription to a service, often
with false identity details, with no intension of paying. This is thus at the level of a phone
number – all transactions from this number will be fraudulent.
Superimposed fraud: This is the use of a service without having the necessary authority
and is usually detected by the appearance of ‘phantom’ calls on a bill. There are several
ways to carry out superimposed fraud, including mobile phone cloning and obtaining
calling card authorization details. Superimposed fraud will generally occur at the level of
individual calls – the fraudulent calls will be mixed in with the legitimate ones.
Subscription fraud will generally be detected at some point through the billing process –
though one would aim to detect it well before that, since large costs can quickly be run
up. Superimposed fraud can remain undetected for a long time.
97
Other types of telecoms fraud include

‘Ghosting’ (technology that tricks the network in order to obtain free calls) and
‘Insider’ fraud where telecom company employees sell information to criminals that can
be exploited for fraudulent gain. This, of course, is a universal cause of fraud, whatever
the domain.
‘Tumling’ is a type of superimposed fraud in which rolling fake serial numbers are used
on cloned handsets, so that successive calls are attributed to different legitimate phones.
The chance of detection by spotting unusual patterns is small, and the illicit phone will
operate until all of the assumed identities have been spotted. The term ‘spoofing’ is
sometimes used to describe users pretending to be someone else.
3.5.3 Computer Intrusion
Intrusion detection plays a vital role in today’s networked environment. Intrusions into
computer systems include unauthorized users penetrating the computer systems and
authorized users abusing their privileges. Intrusion into computer systems is the most
epidemic type of fraud since it is easy to commit. Furthermore, it is very difficult to trace
the intruders because they may hide in any corner of the world so long as they have the
Internet connection.
In recent years, computer security has become increasingly important and an international
priority. Intrusion detection techniques are largely categorized into two types such as
anomaly detection and misuse detection.
Anomaly detection: In this technique, the task is focused on extracting normal (non-
fraudulent) usage patterns and finding out deviation from them.
Misuse detection: In this technique, the patterns of previous intrusions and the
vulnerable spots of a system are captured based on the historical data. Then, an intrusion
trail is compared with these identified previous patterns.
98
3.6 FINANCIAL CRIMES
This would include cheating, credit card frauds, money laundering etc.
3.6.1 TYPES OF FINANCIAL CRIME
3.6.1.1 Credit-Card Fraud
Credit-card fraud detection is especially challenging because the analyst needs to identify
both the physical theft of a card, as well as an individual's identity; this means stolen
cards, as well as cloned and personal identification number (PIN) thefts. This type of
fraud can also be the result of the theft of an individual's identification, such as his or her
home address, for the creation of new accounts under false or stolen identities.
Credit-card theft will defraud the credit-card issuer or merchant. It has a profile of many
small amounts, and an out-of-character purchasing pattern. The fraud activity is time-
constrained. The card will be reported as stolen at some point and identity theft will be
detected, at least by the next statement date. This time constraint forces perpetrators to
use the card rapidly and for amounts normally out of pattern—this is the signature of this
financial crime and a method to its detection. It is a crime where, inevitably, some loss
will occur before detection. This crime is both highly organized and opportunistic.
3.6.1.2 Card-Not-Present Fraud
Internet and phone-order transactions are the classic card-not-present (CNP) sales. They
are also time-sensitive crimes, where the thieves are racing to beat the credit-card
monthly statement mailing date.
Internet credit-card thieves do leave characteristic footprints. For example, many

businesses see fraud rates increase at certain times of the day, and orders coming in from
certain countries exhibit a higher percentage of fraud. Thieves also gravitate to certain
types of products, such as electronics, which are easy to sell via Web auction sites. Other
99
clues to these perpetrators are the use of Web-based e-mail addresses and different
shipping and billing addresses.
3.6.1.3 Loan Default
This type of financial crime involves the manipulation and inflation of an individual
credit rating prior to performing a "sting," leading to a loan default and a loss for the
financial service provider.
This financial crime relies on creating a false identity and takes time to develop. Once an
account has been created with a stolen or false identity, the marketing initiatives
employed by the bank or credit-card issuer assist the perpetrator in building a portfolio of
credit-cards, loan accounts, and a viable credit-rating and history—before defaulting on
them.
3.6.1.4 Bank Fraud
This financial crime involves the creation of fictitious bank accounts for the conduit of
money and the siphoning of other legitimate accounts. It may also be for fictitious
account purchases, particularly in association with investment accounts, bond and bearer
bond transactions.
Many of the methods of executing internal fraud are similar to money laundering, except
there is an obvious attempt to defraud the bank, whereas in money laundering the
objective is simply to hide the funds. In addition, this fraud often works in conjunction
with the establishment of creditworthy accounts, lines of credit, and fictitious accounts.
The sting is often a single or small number of large-volume transactions, often related to
real estate purchases, business investments, and the like.
3.6.1.5. Money Laundering
Money generated in large volume by illegal activities must be "laundered," or made to

look legitimate, before it can be freely spent or invested; otherwise, it may be seized by
law enforcement and forfeited to the government. Transferring funds by electronic
100
messages between banks—"wire transfer"—is one way to swiftly move illegal profits
beyond the easy reach of law enforcement agents and at the same time begin to launder
the funds by confusing the audit trail.
To launder money is to disguise the origin or ownership of illegally gained funds to make
them appear legitimate. Hiding legitimately acquired money to avoid taxation, or moving
money for the financing of terrorist attacks also qualify as money laundering activities.
Law enforcement officials describe three basic steps to money laundering.
1. Placement: introducing cash into the banking system or into legitimate commerce
2. Layering: separating the money from its criminal origins by passing it through
several financial transactions, such as transferring it into and then out of several
bank accounts, or exchanging it for travelers' checks or a cashier's check
3. Integration: aggregating the funds with legitimately obtained money or providing
a plausible explanation for its ownership
Wire transfers of illicit funds are yet another key vehicle for moving and laundering
money through the vast electronic funds transfer systems. Using data mining technologies
and techniques for the identification of these illicit transfers could reveal previously
unsuspected criminal operations or make investigations and prosecutions more effective
by providing evidence of the flow of illegal profits.
There are many ways to launder money. Any system that attempts to identify money
laundering will need to evaluate wire transfers against multiple profiles. In addition,
money launderers are believed to change their MOs frequently. If one method is
discovered and used to arrest and convict a ring of criminals, activity will switch to
alternative methods. Law enforcement and intelligence community experts stress that
criminal organizations engaged in money laundering are highly adaptable and flexible.
For example, they may use non bank financial institutions, such as exchange houses and
check cashing services and instruments like postal money orders, cashier's checks, and
certificates of deposit. In this way, money launderers resemble individuals who engage in
ordinary fraud: They are adaptive and devise complex strategies to avoid detection. They
101
often assume their transactions are being monitored and design their schemes so that each
transaction fits a profile of legitimate activity.
As with other criminal detection applications the major obstacle to using data mining
techniques is the absence of data uniformity. Related issues, such as the absence of
experts, high costs, and privacy concerns, are being reevaluated in light of the recent
terrorist attacks. The post-9/11 environment is changing the priorities of years ago. One
of the biggest obstacles to using data mining to detect the use of wire transfers for illegal
money laundering was the poor quality of the data; ineffective standards did not ensure
that all the data fields in the reporting forms were complete and validated.
3.6.1.6 Insurance Crimes
Insurance fraud and health care-related crimes are widespread and very costly to carriers,
the government, and the consumer public. Insurance fraud involves intentional deception
or misrepresentation intended to result in an unauthorized benefit. An example would be
billing for health care services that have not been rendered. Health care crime involves
charging for services that are not medically necessary, do not conform to professionally
recognized standards, or are unfairly priced. An example would be performing a
laboratory test on a large numbers of patients when only a few should have it. Health care
crime may be similar to insurance fraud, except that it is not possible to establish that the
abusive acts were done with intent to deceive the insurer.
3.6.1.7 False Claims
False-claim schemes are the most common type of health-insurance fraud. The goal in
these schemes is to obtain undeserved payment for a claim or series of claims.
This includes billing for services, procedures, or supplies that were not provided or used,
as well as misrepresentation of what was provided, when it was provided, the condition
or diagnosis, the charges involved, or the identity of the provider recipient. This may also
involve providing unnecessary services or ordering unnecessary tests.
102
3.6.1.8 Illegal Billing
Illegal billing schemes involve charging a carrier for a service that was not performed.
This includes unbundling of claims—that is, billing separately for procedures that
normally are covered by a single fee. A variation is double billing, charging more than
once for the same service, also known as upcoding, the scam of charging for a more
complex service than was performed. This may also involve kickbacks in which a person
receives payment or other benefits for making referrals.
3.6.1.9 Excessive or Inappropriate Testing
Billing for inappropriate tests—both standard and nonstandard—appears to be much

more common among chiropractors and joint chiropractic/medical practices than among
other health care providers.
The most commonly abused tests include:
• Computerized inclinometry: Inclinometry is a procedure that measures joint

flexibility.
• Nerve conduction studies: Personal injury mills often use these inappropriately to
follow the progress of their patients.
• Surface electromyography: this measures the electrical activity of muscles, which
can be useful for analyzing certain types of performance in the workplace.
However, some chiropractors claim that the test enables them to screen patients
for "subluxations." This usage is invalid.
• Thermography: Chiropractors who use thermography typically claim that it can
detect nerve impingements, or "nerve irritation" and is useful for monitoring the
effect of chiropractic adjustments on sub-luxations. These uses are not medically
appropriate.
• Ultrasound screening: Ultrasonography is not appropriate for diagnosing muscle
spasm or inflammation or for following the progress of patients treated for back
pain.
103
• Unnecessary X rays: It is not appropriate for chiropractors to routinely X-ray

every patient to measure the progress of patients who undergo spinal
manipulation.
• Spinal videofluoroscopy: This procedure produces and records X-ray pictures of
the spinal joints that show the extent to which joint motion is restricted. For
practical purposes, however, a simple physical examination procedure, such as
asking the patient to bend, provides enough information to guide the patient's
treatment.
3.6.1.10 Personal Injury Mills
Many instances have been discovered in which corrupt attorneys and health care
providers, usually chiropractors or medical clinics, combine to bill insurance companies
for nonexistent or minor injuries. The typical scam includes "cappers" or "runners," who
are paid to recruit legitimate or fake auto-accident victims or worker's compensation
claimants. Victims are commonly told they need multiple visits.
Mills fabricate diagnoses and reports, providing expensive, but unnecessary, services.
The lawyers then initiate negotiations on settlements based upon these fraudulent or
exaggerated medical claims.
3.6.1.11 Miscoding
In processing claims, insurance companies rely mainly on diagnostic and procedural

codes recorded on the claim forms. Their computers are programmed to detect services
that are not covered. Most insurance policies exclude non-standard or experimental
methods. To help boost their income, many non-standard practitioners misrepresent what
they do and may misrepresent their diagnoses.
Brief or intermediate-length visits may be coded as lengthy or comprehensive visits.

Patients receiving chelation therapy may be falsely diagnosed as suffering from lead
poisoning and may be billed for infusion therapy or simply an office visit. The
administration of quack cancer remedies may be billed as chemotherapy. Live-cell
104
analysis may be billed as one or more tests for vitamin deficiency. Nonstandard allergy
tests may be coded as standard ones.
3.7 WAYS OF ONLINE BANKING FRAUD
Scams such as phising, spyware and malware are responsible for online banking fraud.
3.7.1 Phising
Phising is the name given to the practice of sending emails at random, purporting to come
from a genuine company operating on the internet, in an attempt to trick customers of that
company into disclosing information at a bogus website operated by fraudsters. These
emails usually claim that it is necessary to ‘update’ or ‘verify’ your password, and they
urge to click on a link from the email that takes us to the bogus website. Any information
entered on the bogus website will be captured by the criminals for their own fraudulent
purposes.
Phising originated because the banks’ own systems have proved incredibly difficult to
attack. Criminals have turned their attention to phising attacks, individual internet users
in order to gain personal or secret information that can be used online for fraudulent
purposes.
3.7.2 Malware
Although the rising number of phising incidents has undoubtedly helped to raise fraud
losses, we also know that online banking customers are increasingly being targeted by
malware attacks. Malware (malicious software) includes computer viruses that can be
installed on a computer without the user’s knowledge, typically by users clicking on a
link in an unsolicited email, or by downloading suspicious software. Malware is capable
of logging keystrokes thereby capturing passwords and other financial information.
105
3.7.3 Spyware
Spyware is a type of computer virus that can be installed on computer without user
realizing. Spyware is sometimes capable of acting as a ‘keystroke logger’, capturing all
of the keystrokes entered into a computer keyboard. Typically the fraudsters will send out
emails at random, to get people to click on a link from the email and visit a malicious
website, where vulnerabilities on the customer’s computer are exploited to install the
spyware. The emails are not normally related to internet banking, and try to dupe people
into visiting, or clicking on the link to, the malicious website using a variety of excuses.
3.8 2008 INTERNET CRIME REPORT
The Internet Crime Complaint Centre (IC3) was established with a mission to serve as a
vehicle to receive, develop, and refer criminal complaints regarding the rapidly
expanding arena of cyber crime. IC3 accepts online Internet crime complaints from either
the person who believes they were defrauded or from a third party to the complainant.
3.8.1 Complain Characteristics
During 2008, non-delivery of merchandise and/or payment was by far the most reported
offense, comprising 32.9% of referred crime complaints. This represents a 32.1%
increase from the 2007 levels of non-delivery of merchandise and/or payment reported to
IC3. In addition, during 2008, auction fraud represented 25.5% of complaints (down
28.6% from 2007), and credit and debit card fraud made up an additional 9.0% of
complaints. Confidence fraud such as Ponzi schemes, computer fraud, and check fraud
complaints represented 19.5% of all referred complaints. Other complaint categories such
as Nigerian letter fraud, identity theft, financial institutions fraud, and threat complaints
together represented less than 9.7% of all complaints (See Figure 3.2).
106
2 0 0 8 T op 1 0 I C 3 C ompl a i nt C a t e gor i e s
T hr eat
Fi nanc i al I ns t i t ut i ons Fr aud
I dent i t y T hef t
Ni ger i an Let t er Fr aud
Chec k Fr aud
Comput er Fr aud
Conf i denc e Fr aud
Cr edi t / Debi t Car d Fr aud
A uc t i on Fr aud
Non-del i v er y
0 5 10 15 20 25 30 35
Figure 3.2 IC3 Complaint Categories (in %)
Source : www.ic3.gov
During 2008, non-delivered merchandise and/or payment were, by far, the most reported
offense, comprising 32.9% of referred complaints. Internet auction fraud accounted for
25.5% of referred complaints. Credit/debit card fraud made up 9.0% of referred
complaints. Confidence fraud, computer fraud, check fraud, and Nigerian letter fraud
round out the top seven categories of complaints referred to law enforcement during the
year.
A key area of interest regarding Internet fraud is the average monetary loss incurred by
complainants contacting IC3 (See Figure 3.3). Such information is valuable because it
provides a foundation for estimating average Internet fraud losses in the general
population. To present information on average losses, two forms of averages are offered:
the mean and the median. The mean represents a form of averaging that is familiar to the
general public: the total dollar amount divided by the total number of complaints.
Because the mean can be sensitive to a small number of extremely high or extremely low
loss complaints, the median is also provided. The median represents the 50th percentile,
or midpoint, of all loss amounts for all referred complaints. The median is less
susceptible to extreme cases, whether high or low cost.
107
Of the 72,940 fraudulent referrals processed by IC3 during 2008, 63,382 involved a
victim who reported a monetary loss. Other complainants who did not file a loss may
have reported the incident prior to victimization (e.g., received a fraudulent business
investment offer online or in the mail), or may have already recovered money from the
incident prior to filing (e.g., zero liability in the case of credit/debit card fraud).
The total dollar loss from all referred cases of fraud in 2008 was $264.6 million. That loss
was greater than 2007 which reported a total loss of $239.1 million. Of those complaints
with a reported monetary loss, the mean dollar loss was $4,174.50 and the median was
$931.00. Nearly fifteen percent (14.8%) of these complaints involved losses of less than
$100.00, and (36.5%) reported a loss between $100.00 and $1,000.00. In other words,
over half of these cases involved a monetary loss of less than $1,000.00. Nearly a third
(33.7%) of the complainants reported
Percentage of Referalls by Monetary Loss

0%
7%
15%
8% $.01 to $99.99
$100 to $999.99
$1,000 to $4,999.99
$5,000 to $9,999.99
34% 36% $10,000 to $99,999.99
$100,000.00 and over
Figure 3.3 Percentage of Referrals by Monetary Loss
Source : www.ic3.gov
A key area of interest regarding Internet fraud is the average monetary loss incurred by
complainants contacting IC3. Of the 72,940 fraudulent referrals processed by IC3 during
2008, 63,382 involved a victim who reported a monetary loss. The total dollar loss from
all referred cases of fraud in 2008 was $264.6 million.
108
Table 3.1 Average (Median) Loss per Typical Complaint Demographics
Amount Lost per Referred Complaint by Average (Median) Loss Per Typical
Selected Complainant Demographics Complaint
Male $993.76
Female $860.98
Under 20 $500.00
20-29 $873.58
30-39 $900.00
40-49 $1,010.23
50-59 $1,000.00
60 and older $1,000.00
3.8.2 Case Studies of APACS (UK Payment Association and UK Card Association)
3.8.2.1 Plastic card fraud losses on UK-issued cards 1998-2008
700
609.9
600
535.2
504.8
500
424.6 420.4 439.4 427
411.5
400
317
300
188.4
200
135
100
0
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
( X-axis : Year, Y-axis : £ millions)
Figure 3.4 Plastic Card Fraud Losses on UK-issued Cards 1998-2008

Source : www.cardwatch.org.uk
109
Table 3.2 Losses Based on Fraud Category Wise
Fraud Type 98 99 00 01 02 03 04 05 06 07 08
Card Not 13.6 29.3 72.9 95.7 110.1 122.1 150.8 183.2 212.7 290.5 328.4
Present
Counterfiet 26.8 50.3 107.1 160.4 148.5 110.6 129.7 96.8 98.6 144.3 169.8
Lost/Stolen 65.8 79.7 101.9 114.0 108.3 112.4 114.5 89 68.5 56.2 54.1
Card IDTheft 16.8 14.4 17.4 14.6 20.6 30.2 36.9 30.5 31.9 34.1 47.4
Mail non- 12.0 14.6 17.7 26.8 37.1 45.1 72.9 40.0 15.4 10.2 10.2
receipt
Total 135 188.4 317.0 411.5 424.6 420.4 504.8 439.4 427.0 535.2 609.9
APACS has been the forum for the co-operative activity of banks, building societies and
card issuers on payments and payment systems since the mid-80s in U.K. Figure 3.4
shows the total losses in £ millions from year 1998 to 2008 alone in U.K. due to credit
and debit card. Table 3.2 shows how this plastic card fraud occurs category wise like
card-not-present, counterfeit, lost/stolen, card id theft and mail non-receipt.
3.8.2.2 Card fraud losses split by type (as percentage of total losses)
12%
10%
Lost/Stolen
Mail non-receipt
1998 49% Counterfeit
Card-not-present
20% Card ID theft
9%
Figure 3.5 Percentage of Different Plastic Card Fraud Category in Year 1998
110
9% 8%
2%
Lost/Stolen
Mail non-receipt
2008 28%
Counterfeit
Card-not-present
Card ID theft
53%
Figure 3.6 Percentage of Different Plastic Card Fraud Category in Year 2008
3.8.2.3 Internet / e-commerce fraud losses on UK – issued cards 2000 to 2008
200
17 8 .3 18 1.7
180
160 15 4 .5
140
117 117 .1
120
100
80
60
45
40 28
20 15
3 .8
0
2000 2001 2002 2003 2004 2005 2006 2007 2008
(X-axis: Year, Y-axis: £ millions)
Figure 3.7 Internet / E-Commerce Fraud Losses on UK – issued Cards

Figure 3.7 shows financial cyber crime in £ millions from year 2000 to 2008 in U.K.
111
3.9 ONLINE FRAUD REPORT, CYBERSOURCE 2010
According to the Cybersource, 11th Annual Online Fraud Report, which is based on
U.S.A. and Canadian online merchants, from 2006 to 2008 the percent of online revenues
lost to payment fraud was stable. However, total dollar losses from online payment fraud
in the U.S. and Canada steadily increased during this period as eCommerce continued to
grow.
The percent of accepted orders which are later determined to be fraudulent also fell in
2009. In 2009, merchants reported an overall average fraudulent order rate of 0.9%, down
from 1.1% in 2008 for their U.S. and Canadian orders. Over the past six years the average
percent of accepted orders which turn out to be fraudulent has varied from 1.0% to 1.3%.
2009 represents the first time this rate has dropped below the 1% threshold. Among
industry sectors, Consumer Electronics reported the highest fraudulent order rate,
averaging 1.5%, but this was down from 2.0% in 2008.
Since 2007, the percent of orders rejected due to suspicion of fraud has fallen from 4.2%
to 2.4% in 2009, a decline of more than 40% in order rejection, representing a 1.8%
increase in total orders accepted.
112
Figure 3.8 Revenue Lost to Online Fraud (in %)
Source : Online Fraud Report 11th Annual Edition 2010
3.10 REFERENCES
[1] Jesus Mena : Investigative Data Mining for Security and Criminal Detection
[2] Website – www.ic3.gov
[3] Website – www.cardwatch.org.uk
[4] Website – www.cybersource.com
[5] Website - www.issaireland.org/cybercrime
[6] Website – www.en.wikipedia.org
[7] Website – www.fas.org
[8] Website – www.citizencentre.virtualpune.com
[9] Website – www.indiaforensic.com
[10] Website – www.itbusinessedge.com
113
[11] Website – www.cybercrime.gov

[12] Website – www.cybercrime.planetindia.net
[13] Website – www.bespacific.com
[14] Website – www.dfait-maeci.gc.ca
[15] Website – www.isiconference.org
114
CHAPTER 4
ROLE OF DATA MINING IN FINANCIAL CRIME DETECTION
4.1. TWO STAGE SOLUTION FOR FINANCIAL CRIME DETECTION

4.2. TYPES OF FINANCIAL CRIME
4.3. CONCLUSION
4.4 REFERENCES
Today Industry is facing huge losses due to these types of financial crimes, so it would
be able to find financial crime through data mining techniques and remove it then it can
be great benefit to the industry.
In this chapter we have suggested a two-tier architecture model for financial crime
detection. In the first stage the financial transaction is verified against the rule-based
system and is given risk score by the system. These rules contain the human insight and
then this transaction is passed to second stage of data mining technique, which will
learn from the past experience of fraudulent transactions and then decide about the
current transaction. So the accuracy of prediction increased as the financial transaction
has to pass through two stages, one of rule based system and second of data mining
technique based system.
4.1. TWO STAGE SOLUTION FOR FINANCIAL CRIME DETECTION
Here we have given a figure 4.1 of architecture of 2-stage solution for financial crime.
In the first stage, rule based system contains the static rules which is generally based on
human knowledge i.e. human insight. If the financial transaction passes through this
phase then it passes to the second phase.
115
Chapter 4: Role of Data Mining in Financial Crime Detection
In the second stage, data mining techniques generate dynamic rules based on past
fraudulent transactions. Here learning is totally dynamic so if the pattern of fraudulent
transaction changed then the model learns itself from transactions and generates
dynamic rules for prediction of financial crime.
Financial Transaction
Stage 1
Rule Based System
Stage 2
Data Mining Based
Normal or Fraudulent Transaction
Figure 4.1 Architecture of 2-Stage Solution
4.2. TYPES OF FINANCIAL CRIME
In this section we have suggested 2-stage solution in each type of financial crime.
4.2.1 CREDIT-CARD FRAUD
4.2.1.1 Rule Based System:
1. If number of transactions increased rapidly within short time then recommendation=

Fraud
2. If current transaction amount is very much greater than average transaction amount
and income range is medium then recommendation= Fraud
3. If purchase of same product of luxury category within short time then

recommendation = Fraud
116
4.2.1.2 Detection Technique: Sequencing of purchases will change; the merchant mix
will be out of character compared to previous consumer transactions. Frequency,
monetary, and recency (FMR) techniques can be examined and employed. Time-
sequence accumulated-risk scores may be used as an input to aggregated risk exposure.
A change in location may indicate a ring operation. There are a number of leads that
relate specifically to credit card and debit card fraud. They are common points-of-
purchase (CPP) detection, particularly with regard to new merchant agents. The main
method of detection is to look for outliers and changes in the normal patterns of usage.
A SOM neural network can be used to perform an autonomous clustering of patterns in
the data.
4.2.2 CARD-NOT-PRESENT FRAUD (ONLINE CREDIT CARD FRAUD)
1. If Billing Address is not same as Shipping Address then recommendation = Fraud
2. If transaction amount is greater than maximum specified limit then recommendation

= Fraud
3. If Duration between online transactions increased rapidly then recommendation =

Fraud
4.2.2.2 Detection Technique: Indicators include looking for repeated attempts with
slight variations of card numbers or the use of different names and addresses. Another
possible indication of trouble is an IP address at variance with other data. If
demographics are available, a model may be developed. The absence of certain data,
such as activity in a credit report, is also signals of possible identify theft and fraud.
4.2.3 LOAN DEFAULT
This type of financial crime involves the manipulation and inflation of an individual
credit rating prior to performing a "sting," leading to a loan default and a loss for the
financial service provider.
117
This financial crime is done by creating a false identity and it takes time to develop.
Once an account has been created with a stolen or false identity, the marketing
initiatives employed by the bank or credit-card issuer assist the perpetrator in building a
portfolio of credit-cards, loan accounts, and a viable credit-rating and history—before
defaulting on them.
A Rule Based scoring system can be developed for preventing loan default on various
parameters like age (i.e. age is less then more points given or more age then less),
educational qualification (for higher studies or degrees more points otherwise less), No
of Assets owned by borrower at home (for more assets more points otherwise less),
borrower’s income, margin etc.
4.2.3.2 Detection Technique: There are many lead indicators available. There is often
only one "pot" of money that is cycled through the various accounts—a pattern of cash
withdrawals from credit cards, and then at the end of the credit cycle, a similar amount
repaid, usually using a cash withdrawal from another credit card. Lead indicators
include credit cards that are rarely used to make actual merchant purchases and have
small outstanding credit balances. Another pattern to look for is a loan account that is
left unused. These techniques inflate a centrally controlled credit rating, providing a
false impression that the account is deemed responsible. Detection has to occur before
the "sting," which is a use of the credit and loan accounts very rapidly within a credit
cycle. This financial crime can result in high losses. Detection must occur before the
loss, because the sting has a short execution time.
4.2.4 BANK FRAUD
4.2.4.1 Detection Technique: The method of detection relies on out-of-pattern

transactions or anomalous account use. As with other financial crimes, detection must
occur before any loss is sustained. There are lead indicators like the "manipulation of
credit" described above and in the lack of references, high associations of matching
attributes, and dubious acceptance criteria.
118
The critical factors for detecting all of these financial fraud crimes is knowing the
behavior of credit, bank, and loan accounts and developing an understanding of the
categories of customers. Data mining can be used to spot outliers or account usages that
are normal and out of character. Sometimes the account seems "too good to be true,"
and it often is. The absence of telephone numbers or other contact information may
indicate a "ring." These rings enable fraudulent activities to be distanced from their
sources and add complexity to criminal detection. Another clue is the multiple use of
the same address or phone number for different accounts.
4.2.5 INSURANCE CRIMES
A Rule Based Scoring system can be developed for Insurance crime such as neck injury
can be given more risk points than leg injury, laboratory or x-ray report that is not
relevant or unnecessary to the disease (high risk score can be assigned according to the
irrelevancy of the disease), etc.
In the insurance industry, there are various methods by which carriers attempt to review
for fraud while processing policy claims. The following are some important data
attributes for detecting potential fraud claims:
• Duration of illness
• Net amount cost
• Illness (disease)
• Claimant sex
• Claimant age
• Claim cost
• Hospital
Using these variables, analyses can be performed to identify outliers for each, such as
test costs, hospital charges, illness duration, and doctor charges. These are some
temporal parameters for analyzing insurance claims.
119
4.2.5.2 FALSE CLAIMS
4.2.5.2.1 Detection Technique: Depending on the insurance carrier, we can use various
methods in an attempt to identify false claims, including red-flag reviews by fraud
specialists, both on-line and behind the scenes. A carrier may also use an expert system,
which is a rule-based program that codifies the rules of a human reviewer. Link analysis
may be used to look for a ring of fraudulent providers, and, of course, data mining tools,
such as neural networks, may be used for training and detection if samples of fraud
cases exist. The net amount of the claim may be too large compared to the average
amount of similar claims.
4.2.5.3 ILLEGAL BILLING
4.2.5.3.1 Detection Technique: The methods are the same as with false claims. In
addition, a carrier may use models and rules developed insurance special coupled with
those from data mining analyses, such as decision trees or rule generators to detect these
schemes
4.2.5.4 PERSONAL INJURY MILLS
4.2.5.4.1 Detection Technique: Mill activity can be suspected when claims are
submitted for many unrelated individuals who receive similar treatment from a small
number of providers. These claims are typically manually reviewed by claim specialists;
however, link analysis and rule generators can also be used for screening large volumes
of claims.
4.2.5.5 MISCODING
4.2.5.5.1 Detection Technique: Any code that is not standard must be subject to review
and matched against prior claims from similar clinics or practitioners, typically
performed by red-flag claim specialists. Clustering of historical data can be used to
detect outliers automatically, and to check a disease (illness) against average duration
and cost using a historical claims database to generate a histogram.
120
4.3. CONCLUSION
Data Mining Techniques like Neural Networks, Decision trees, Link Analysis etc. can
become very helpful for financial crime detection. These techniques can be used with
rule-based system combinely so accuracy of prediction increased very much. The two-
tier architecture model is used very effectively for any financial transaction verification.
Any financial transaction has to pass through two level of verification, so prediction
gets closer to real prediction and also any genuine or normal transaction is not caught
by the model as fraudulent transaction so normal or genuine customer does not have to
suffer.
Here we suggested a two-stage solution for financial crime detection, which is actually
hybrid approach and contains both human insight and machine insight also. In these
types of crime hybrid approach proves more powerful than any single stage solution and
also accuracy of prediction is increased drastically.
In this type of model or system, we also need to take care of that any normal or genuine
transaction must not be caught by as fraudulent transaction and create overhead on
customer. If any customer suffered then we might lose him.
4.4. REFERENCES
[1] S. Haykin: Neural networks- a comprehensive foundation, MacMillan, New York

[2] Fawcet, T. and F. Provost- Adaptive Fraud detection, Data Mining and Knowledge
Discovery
[3] Cannady J. –The Application of Artificial Neural Networks to misuse Detection:
Initial results http://www.packetstormsecurity.org/.../Application-of-ANN-to-Misuse-
Detection.pdf
[4] Mannila H., Toivonen H. – “Discovering Generalized Episodes using Minimal
Occurences” http://www.citeseer.ist.psu.edu/mannila96discovering.html
[5] U.Fayyad, G. Piasky-Shapiro, P.Smyth, R.Uthrusamy (eds): Advances in
Knowledge Discovery and Data Mining, Menalo Park, AAA/MIT Press
121
[6] Michael J.A. Berry, Gordon S. Linoff – Data Mining Techniques

[7] Aleskeorv, E., Freisleben, B. Rao “CARDWATCH: a neural network based database
mining system for credit card fraud detection. Computational Intelligence for Financial
Engineering”, Proceeding of the IEEE/IAFE, 220-226
[8] R. Srikant, R. agrawal : “Mining generalized association rules” . Proc. VLDB
Conference , Zurich, Switzerland
[9] Jesus Mena : Investigative Data Mining for Security and Criminal Detection
[10] Dharwa J. N., Parikh S. M., “Data Mining in Financial Crime Detection”
Proceedings of the National Conference on ECTKM-08, 23rd November 2008 at AITS,
Rajkot
[11] Website – www.en.wikipedia.org
[12] Website – www.fas.org
[13]Website-www.citizencentre.virtualpune.com
[14] Website – www.indiaforensic.com
[15] Website – www.itbusinessedge.com
122
CHAPTER 5
DATA WAREHOUSE IMPLEMENTATION
5.1 DATA WAREOUSE ARCHITECTURE

5.2 FACT TABLE
5.3 DIMENSIONAL TABLES
5.4 LOOKUP TABLES
5.5 DATA COLLECTION
5.6 SAMPLE DATA
5.7 CREDIT CARD NUMBER GENERATION
5.8 REFERENCES
When transactional data is no longer of value to the operational environment, it is

removed from the database. If a business is without a decision support facility, the data is
achieved and eventually destroyed. However, if there is a decision support environment,
the data is transported to some type of interactive medium commonly referred to as a data
warehouse.
We can define the data warehouse as a historical database designed for decision support.
A more precise definition is given by W.H. Inmon(1996). Specifically,
“A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile

collection of data in support of management’s decision making process”.
5.1 DATA WAREHOUSE ARCHITECTURE
The main challenge of Data Warehouse architecture is to enable business to access

historical, summarized data with a read-only access of the end-users. Again, from a
technical standpoint the most SQL queries would start with a SELECT statement.
123
Chapter 5: Data Warehouse Implementation
In Data Warehouse environments, the relational model can be transformed into the
following architectures:
• Star schema
• Snowflake schema
• Constellation schema
5.1.1 Star schema architecture
Star schema architecture is the simplest data warehouse design. The main feature of a star
schema is a table at the center, called the fact table and the dimension tables which allow
browsing of specific categories, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are de-normalized (second normal form). Despite the fact that
the star schema is the simplest data warehouse architecture, it is most commonly used in
the data warehouse implementations across the world today (about 90-95% cases).
5.1.2 Snowflake schema architecture
The snowflake schema is a variation of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such table is easy to maintain and saves storage space because a large
dimension table can become enormous when the dimensional structure is included as
columns. However, this saving of space is negligible in comparison to the typical
magnitude of the fact table.
124
5.1.3 Fact constellation architecture
Sophisticated applications may require multiple fact tables to share dimensional tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
Here we have designed the data warehouse using snowflake schema architecture. Data
warehouse design layout is given in Figure 5.1 and 5.2.
5.2 FACT TABLE
5.2.1 Table Transaction
Description: This table contains information regarding the transaction performed by card
holder online. Whenever user orders any product through internet, then his transaction
details mentioned below are stored in this table. This is the fact table, so instead of
storing direct values, it stores the links from tables like product_master,
product_category_master, customer_master, creditcard_master, shipping_master,
location_master and seller_master.
Primary key: Serial_id
Foreign Key: Product_id references Product_Master
Product_cat_id references Product_Cat_Id
Customer_id references Customer_Master
Creditcard_id references Creditcard_Master
Seller_id references Seller_Master
Shipping_id references Shipping_Master
Location_id references Location_Master
125
Table 5.1 Transaction
Column Name Data Type Description
SERIAL_ID NUMBER(8) Unique ID for each product ordered
Date & time on which transaction

TRANSACTION_DATE DATE
ordered through internet
AMOUNT NUMBER(12,2) Amount of purchase
PRODUCT_ID NUMBER(8) Foreign Key
PRODUCT_CAT_ID NUMBER(8) Foreign Key
CUSTOMER_ID NUMBER(8) Foreign Key
CREDITCARD_ID NUMBER(8) Foreign Key
SELLER_ID NUMBER(8) Foreign Key
SHIPPING_ID NUMBER(8) Foreign Key
TRANSACTION_DAY_T
NUMBER(1) 1: Holiday, 0: Working day
YPE
LOCATION_ID NUMBER(8) Foreign Key
126
5.3 DIMENSIONAL TABLES
5.3.1 Table Customer_Master
Description: This table contains personal information mentioned below of the customer
who is currently performing the online transaction. This information is required by the
web site through which the customer wants to perform the transaction.
Primary key: Customer_id
Table 5.2 Customer_Master

CUSTOMER_ID NUMBER(8) Unique ID of customer
FIRST_NAME VARCHAR2(20) First Name of customer
MIDDLE_NAME VARCHAR2(20) Middle Name of customer
LAST_NAME VARCHAR2(20) Last Name of customer
GENDER VARCHAR2(6) Gender of customer
AGE NUMBER(3) Age of customer
ANNUAL_INCOME NUMBER(12,2) Annual Income of customer
127
5.3.2 Table Creditcard_Master
Description: It contains credit card details of the credit card holder. Whenever user wants
to purchase any thing through the credit card then he must give credit card number,
expiry date and customer verification value (CVV) number, then he is able to perform the
transaction.
Primary key: Creditcard_id
Table 5.3 Creditcard_Master

CREDITCARD_ID NUMBER(8) Unique ID of Credit Card
ACCOUNT_NO VARCHAR2(20) Account no of Credit Card
CREDITCARD_NO VARCHAR2(20) Credit Card No
CARD_TYPE VARCHAR2(15) Card Type e.g. Gold, Silver etc.
EXPIRY_DATE DATE Expiry Date of Credit Card
CVV_NO NUMBER(4) Secret Pin No of Credit Card
5.3.3 Table: Seller_Master
Description: It contains the seller or vendor name, with which customer is performing
the transaction.
Primary key: Seller_id
Table 5.4 Seller_Master

SELLER_ID NUMBER(8) Unique ID of Seller/Vendor
SELLER_NAME VARCHAR2(40) Name of Seller/Vendor
128
5.3.4 Table Address_Master
Description: It contains billing address or residential address of the credit card holder.
During the online transaction, this address is verified against the shipping address entered
by the buyer to decide the sensitivity of the transaction.
Primary key: Address_id
Foreign key: Cityid references City_Master
Stateid references State_Master
Countryid reference Country_Master
Table 5.5 Address_Master

ADDRESS_ID NUMBER(8) Unique ID of Billing Address
ADDRESS1 VARCHAR2(40) Address Details 1
CITYID NUMBER(8) Foreign key
PINCODE NUMBER(6) Pin Code No
STATEID NUMBER(8) Foreign key
COUNTRYID NUMBER(8) Foreign key
129
5.3.5 Table: Product_Master
Description: Product information is stored in this table.

Primary key: Product_id
Table 5.6 Product_Master

PRODUCT_ID NUMBER(8) Unique ID of Product
PRODUCTNAME VARCHAR2(40) Name of Product
5.3.6 Table Product_Category_Master
Description: Product category information is stored in this table. This table is useful to
study the customer purchase behavior in different categories, so the incoming transaction
is predicted according to this behavior.
Primary key: Product_cat_id
Table 5.7 Product_Category_Master

PRODUCT_CAT_ID NUMBER(8) Unique ID of Product Category
PRODUCT_CATEGORY VARCHAR2(40) Name of Product Category
130
5.3.7 Table: Shipping_Master
Description: This is the address entered by the customer during the online transaction,
where the customer wants his product to be shipped. This address may be different from
billing address.
Primary key: Shipping_id
Table 5.8 Shipping_Master

SHIPPING_ID NUMBER(8) Unique ID of Shipping Address
SHIPPING_ADDRESS1 VARCHAR2(40) Shipping Address Details 1
Pin Code Number where item to be
PINCODE NUMBER(6)
shipped
131
5.3.8 Table Location_master
Description: It contains the address details, where the customer requests to purchase the
product through the internet. There are several free tools are available for capturing this
kind of information. The system matches the city where the transaction is performed with
the billing address’s city to consider the time zone if both is found from different country.
Primary key: Location_id
Table 5.9 Location_master

Unique ID of location where the item is ordered
LOCATION_ID NUMBER(8)
through the net
PINCODE NUMBER(6) Pin Code No
132
5.3.9 Table City_master
Description: This table contains city related information along with time zone. Whenever
any online transaction is performed different or outside of the customer’s country, then
the system uses this table to convert time zone of one city to time zone of another city.
Primary key: Cityid
Table 5.10 City_master

CITYID NUMBER(8) Unique identification number of city
CITYNAME VARCHAR2(40) Name of city
TIME_ZONE VARCHAR2(10) Time zone of city
5.3.10 Table State_master
Description: It contains name of all the states.

Primary key: Stateid
Table 5.11 State_master
STATEID NUMBER(8) Unique identification number of state
STATENAME VARCHAR2(40) Name of state
133
5.3.11 Table Country_master
Description: It contains name of all the countries.

Primary key: Countryid
Table 5.12 Country_master
COUNTRYID NUMBER(8) Unique identification number of country
COUNTRYNAME VARCHAR2(40) Name of country
5.3.12 Table User_log_Master
Description: This table is used to store the user id and login date, time of the user.
System resets the value of the tables customer_daily_count, customer_weekly_count,
customer_fortnightly_count and customer_monthly_count by using this table only. E.g. If
logon_day contains the value of 1st January, 2010 then the next day 2nd January, 2010 the
value of daily_count and amount field of customer_daily_count table becomes zero. After
the completion of week, the value of weekly_count and amount field of
customer_weekly_count table becomes zero and accordingly for
customer_fortnightly_count and customer_monthly_count tables.
Primary key: User_id
Table 5.13 User_log_Master

USER_ID VARCHAR2(12) ID of the user
LOGON_DAY DATE Login Date & Time of the user
134
5.3.13 Table Cardholder_Master
Description: This table contains personal information mentioned below of the customer
who is the credit card holder.
Primary key: Cardholder_id
Foreign key: Address_id references Address_Master
Cardid references Creditcard_Master
Table 5.14 Cardholder_Master

CARDHOLDER_ID NUMBER(8) Unique ID of credit card holder
FIRST_NAME VARCHAR2(20) First Name of credit card holder
MIDDLE_NAME VARCHAR2(20) Middle Name of credit card holder
LAST_NAME VARCHAR2(20) Last Name of credit card holder
GENDER VARCHAR2(6) Gender of credit card holder
AGE NUMBER(3) Age of credit card holder
ANNUAL_INCOME NUMBER(12,2) Annual Income of credit card holder
ADDRESS_ID NUMBER(8) Foreign key
CARDID NUMBER(8) Foreign key
135
5.3.14 Table Fraud
Description: It is a generic fraud table maintained by the system. It stores the number of
fraud transactions performed within different given below time periods. The system
records time gap between each two transactions. If the transaction is found suspecious by
the system, then it uses the following table to calculate the posterior probability using
bayesian learning and decide about the sensitivity of the transaction.
Table 5.15 Fraud

Number of fraud transactions performed within 4
EVENT1 NUMBER(4)
Hours
EVENT2 NUMBER(4)
Hours
Number of fraud transactions performed within
EVENT3 NUMBER(4)
16 Hours
EVENT4 NUMBER(4)
24 Hours
EVENT5 NUMBER(4)
days
EVENT6 NUMBER(4)
15 days
Number of fraud transactions performed after 15
EVENT7 NUMBER(4)
days
TOTAL NUMBER(5) Total number of fraud
136
5.3.15 Table Suspect
Description: Whenever any transaction is found suspicious by the system, then its details
are stored in the following table. It stores the different time periods since the last
transaction. If another transaction on the same card is found suspicious, then this table is
updated accordingly till either the next transaction is found genuine or the generated risk
score reaches the specified threshold.
Foreign key: Cardid references Creditcard_Master
Table 5.16 Suspect

TRANSACTION_DATE DATE Date on which transaction performed
DAYS NUMBER(4) Number of days since the last transaction
HOURS NUMBER(4) Number of hours since the last transaction
MINUTES NUMBER(4) Number of minutes since the last transaction
SECONDS NUMBER(4) Number of seconds since the last transaction
Incremented to 1 if suspicious transaction
SUSPECT_COUNT NUMBER(3)
found
137
Apart from these there are additional tables maintained by the system, which holds the
current transaction details of the card holder basis on daily, weekly, fortnightly or
monthly time duration.
5.4 LOOKUP TABLES
5.4.1 Table Customer_DailyCount
Description: Whenever customer performs a transaction during day, this table is updated
automatically by the system. It stores the total number of transactions and total amount of
purchasing during the current day. (e.g. If customer first performs transaction of Rs.
4000, then transcount contains 1, amount contains 4000. If customer again performs the
second transaction of Rs.5000 in the same day, then transcount contains 2, amount
contains 9000).The next day value of number of transactions and amount of purchasing
becomes automatically zero by the system. So this table is used to observe the daily
behavior of customer. Then system match this data with the past daily customer behavior.
Table 5.17 Customer_DailyCount

TRANSCOUNT NUMBER(4) Total transactions performed daily
AMOUNT NUMBER(12,2) Total amount of products purchased daily
138
5.4.2 Table Customer_WeeklyCount
Description: This table is used to observe the behavior of current week of customer. All
the transactions performed in the current week are automatically reflected in this table.
This data is used to match past weekly customer behavior. The next week value of these
fields becomes zero.
Table 5.18 Customer_WeeklyCount

TRANSCOUNT NUMBER(4) Total transactions performed daily
AMOUNT NUMBER(12,2) Total amount of products purchased daily
5.4.3 Table Customer_FortnightlyCount
Description: This table stores the transaction details of the current fifteen days only. At
the end of fifteen days, value reset with zero by the system. Here also comparison is
made of current fifteen days behavior with the past fifteen days behavior to decide the
validity of the transaction.
Table 5.19 Customer_FortnightlyCount

TRANSCOUNT NUMBER(4) Total transactions performed fortnightly
AMOUNT NUMBER(12,2) Total amount of products purchased fortnightly
139
5.4.4 Table Customer_MonthlyCount
Description: It contains the total transaction details of the current month only. After the
completion of the month, the value of transcount and amount again starts to update
according to the transactions performed by the customer. This table is used to compare
the current monthly behavior of the customer with the past monthly behavior.
Table 5.20 Customer_MonthlyCount

TRANSCOUNT NUMBER(4) Total transactions performed monthly
AMOUNT NUMBER(12,2) Total amount of products purchased monthly
5.4.5 Table Customer_SundayCount
Description: It contains the total transaction details of today if today is Sunday. So

whenever user performs the transactions on Sunday, this table is updated automatically.
These current Sunday transaction details are verified against the past sundays behavior by
the system.
Table 5.21 Customer_SundayCount

TRANSCOUNT NUMBER(4) Total transactions performed on current Sunday
Total amount of products purchased on current
AMOUNT NUMBER(12,2)
Sunday
140
5.4.6 Table Customer_HolidayCount
Description: It stores the transaction details of the whole current day if the customer
performs the transactions on holiday. Customer’s current holiday behavior is checked
with the past holidays behavior to predict about the transaction is genuine or not.
Table 5.22 Customer_HolidayCount

Total number of transactions performed on
TRANSCOUNT NUMBER(4)
current holiday
Total amount of products purchased on current
AMOUNT NUMBER(12,2)
holiday
141
Product_Master
PRODUCT_ID Seller_Master
PRODUCTNAME SELLER_ID
Transaction SELLER_NAME
Product_Catego SERIAL_ID
ry_Master TRANSACTION_ID Shipping_Master
PRODUCT_CAT TRANSACTION_DATE SHIPPING_ID
_ID AMOUNT SHIPPING_ADDRESS1
PRODUCT_ID SHIPPING_ADDRESS2
PRODUCT_CAT PRODUCT_CAT_ID SHIPPING_ADDRESS3
EGORY CUSTOMER_ID CITY
CREDITCARD_ID PINCODE
Customer_Master SELLER_ID STATE
CUSTOMER_ID SHIPPING_ID COUNTRY
FIRST_NAME LOCATION_ID
MIDDLE_NAME TRANSACTION_DAY_T
Location_Master
YPE
LAST_NAME LOCATION_ID
GENDER ADDRESS1
AGE ADDRESS2
ANNUAL_INCOME ADDRESS3
Creditcard_Master CITYID
CREDITCARD_ID PINCODE
ACCOUNT_NO STATEID
CARD_NO COUNTRYID
CARD_TYPE
EXPIRY_DATE
CVV_NO
Figure 5.1 Data warehouse Design Layout-I

142
City_Master
Location_Master
CITYID
LOCATION_ID
CITYNAME
ADDRESS1
TIME_ZONE
ADDRESS2
Country_Master
ADDRESS3
COUNTRYID
CITYID
COUNTRYNAME
PINCODE
State_Master
STATEID
STATEID
COUNTRYID
STATENAME
Figure 5.2 Data warehouse Design Layout-II
5.5 DATA COLLECTION
The data used in this work was gathered from an online shopping firm. Even though the
firm provided real credit card data for this research, it required that the firm name was
kept confidential. Though real credit card transactional data is obtained, real credit card
number, customer personal information is not given due to confidentiality and fraudulent
transactional records are not available.
We have also generated huge synthetic data based on the statistical data to test the model
speed on large scale data. We had used Gaussian distribution to generate this data. The
number of transactional records is more than 10, 00,000.
143
Table 5.23 Statistical data of expenditure in category by income
Categor <2000 20000 30000 40000 50000 60000 70000 80000 >9000
y 0 - - - - - - - 0
29999 39999 49999 59999 69999 79999 89999
1 .19 .18 .18 .17 .16 .15 .15 .14 .13
2 .36 .38 .37 .36 .34 .32 .32 .30 .31
3 .06 .05 .05 .05 .04 .04 .04 .04 .04
4 .16 .15 .16 .17 .19 .20 .20 .21 .18
5 .05 .08 .09 .09 .08 .07 .07 .06 .04
6 .04 .04 .04 .04 .05 .05 .05 .05 .06
7 .14 .12 .11 .12 .14 .17 .17 .20 .24
The data is generated using Gaussian distribution with the following mean and standard
deviation.
Table 5.24 Components of Gaussian distribution
<2000 20000 30000 40000 50000 60000 70000 80000 >9000

0 - - - - - - - 0
29999 39999 49999 59999 69999 79999 89999
Mean 500 1000 1500 2000 2500 3000 3500 4000 4500
Standard 100 225 300 500 600 800 1000 1000 1200
Deviatio
n
144
5.6 SAMPLE DATA
There is a huge data in the data warehouse from the year 2005 to 2009. Here is a sample
transaction of some customers of year 2005.
Table 5.25 Sample Data of Table Transaction

PID: Product Id PCID: Product Category Id
CID: Customer Id SID: Seller Id
SHID: Shipping Id LID: Location Id
CADID: Card Id
TRANSACTION_DATE AMOUNT PID PCID CID SID SHID LID CADID

1/03/2005 4:00:23 PM 1682 20105 2 37 246 37 189 37
1/03/2005 8:26:13 PM 1254 10425 1 312 597 312 201 312
1/03/2005 8:31:45 PM 1632 20125 2 890 554 890 405 890
1/1/2005 10:05:15 AM 3656 20210 2 68 139 68 345 68
1/1/2005 10:19:33 AM 2537 70145 7 68 344 68 345 68
1/1/2005 10:25:08 AM 3455 40146 4 1001 102 1001 345 1001
1/1/2005 2:21:12 PM 3295 40346 4 1220 305 1220 345 1220
1/1/2005 3:25:08 AM 1651 40150 4 38 467 38 194 38
1/1/2005 4:30:11 AM 1631 30451 3 45 807 45 196 45
1/1/2005 6:33:30 PM 2167 40425 4 8 295 8 41 8
1/1/2005 6:33:30 PM 1270 20223 2 1601 46 1601 194 1601
1/1/2005 7:25:08 AM 550 40103 4 74 408 74 375 74
1/1/2005 8:25:08 AM 1931 40456 4 14 119 14 73 14
1/1/2005 9:27:56 AM 1403 10003 1 125 899 125 194 125
1/1/2005 9:34:56 AM 2277 10378 1 8 81 8 41 8
1/1/2005 9:43:56 AM 2423 10993 1 44 731 44 224 44
1/1/2005 9:44:56 AM 1126 10053 1 46 144 46 233 46
1/1/2005 9:58:56 AM 2201 10603 1 48 948 48 242 48
1/1/2005 9:58:56 AM 2201 10555 1 498 948 498 242 498
1/10/2005 1:25:08 AM 3501 60034 6 444 817 444 421 444
1/10/2005 1:30:11 AM 5971 70016 7 84 305 84 421 84
1/10/2005 4:00:23 PM 1370 20005 2 35 863 35 179 35
1/10/2005 6:30:45 PM 3504 60016 6 11 66 11 56 11
1/10/2005 6:30:45 PM 2977 10034 1 465 110 465 79 465
1/10/2005 8:26:13 AM 1519 10450 1 485 383 485 179 485
1/10/2005 8:30:45 PM 570 50145 5 2 234 2 6 2
1/10/2005 8:31:45 PM 1245 20231 2 35 286 35 179 35
1/11/2005 1:30:11 AM 4599 70345 7 86 807 86 432 86
1/11/2005 11:45:45
AM 3445 70452 7 11 63 11 56 11
1/11/2005 2:25:08 AM 3392 60238 6 86 467 86 432 86
1/11/2005 6:30:45 PM 4117 10136 1 75 549 75 377 75
1/12/2005 1:30:11 AM 6810 70870 7 88 257 88 445 88
1/12/2005 3:25:08 AM 5152 60432 6 88 750 88 445 88
145

1/12/2005 7:00:23 PM 1420 20125 2 455 69 455 1030 455
1/12/2005 8:15:13 AM 1764 10003 1 5 291 5 24 5
1/12/2005 8:31:45 PM 1356 20007 2 5 26 5 24 5
1/13/2005 1:30:11 AM 3041 70016 7 78 739 78 394 78
1/13/2005 2:25:08 AM 5668 10034 1 78 985 78 394 78
1/13/2005 4:25:08 AM 1990 60040 6 90 776 90 451 90
1/13/2005 9:11:04 AM 646 20034 2 2 11 2 6 2
1/15/2005 1:30:11 AM 4347 70016 7 80 546 80 405 80
1/15/2005 3:25:08 AM 5560 10034 1 440 870 440 405 440
1/16/2005 2:00:23 PM 1579 20005 2 491 220 491 206 491
1/16/2005 4:14:45 PM 1657 20007 2 41 910 41 206 41
1/16/2005 8:24:13 AM 1592 10103 1 491 15 491 206 491
1/17/2005 1:30:11 AM 1439 70450 7 442 605 442 414 442
1/17/2005 4:25:08 AM 1533 10250 1 1549 784 1549 414 1549
1/2/2005 1:12:30 PM 5357 20105 2 444 139 444 421 444
1/2/2005 1:30:11 AM 4286 60016 6 12 196 12 63 12
1/2/2005 1:33:30 PM 3015 20423 2 416 165 416 285 416
1/2/2005 10:04:56 AM 1292 10103 1 482 365 482 159 482
1/2/2005 10:30:56 AM 671 10459 1 34 320 34 170 34
1/2/2005 10:34:56 AM 948 10980 1 4 127 4 18 4
1/2/2005 11:07:33 AM 4760 40016 4 69 976 69 348 69
1/2/2005 11:25:23 AM 2584 40107 4 8 220 8 41 8
1/2/2005 2:12:30 PM 3789 20305 2 86 46 86 432 86
1/2/2005 2:33:30 PM 2234 20155 2 58 18 58 291 58
1/2/2005 3:12:30 PM 5242 20360 2 88 165 88 445 88
1/2/2005 3:33:30 PM 2151 20910 2 54 46 54 274 54
1/2/2005 4:00:30 PM 3804 20540 2 60 246 60 303 60
1/2/2005 4:00:56 AM 3688 70003 7 88 959 88 445 88
1/2/2005 4:01:30 PM 3375 20455 2 62 33 62 311 62
1/2/2005 4:02:30 PM 919 20145 2 66 139 66 331 66
1/2/2005 4:02:56 AM 652 71040 7 90 408 90 451 90
1/2/2005 4:03:30 PM 4060 20110 2 64 220 64 323 64
1/2/2005 4:06:30 PM 2253 20450 2 12 279 12 63 12
1/2/2005 4:11:56 AM 6008 70003 7 84 102 84 421 84
1/2/2005 4:12:30 PM 1261 20335 2 90 18 90 451 90
1/2/2005 4:33:30 PM 1526 20789 2 52 139 52 265 52
1/2/2005 4:52:56 AM 3186 70410 7 286 899 286 432 286
1/2/2005 4:59:56 AM 4737 70150 7 468 63 468 95 468
1/2/2005 5:52:12 PM 3380 50034 5 429 973 429 348 429
1/2/2005 6:01:30 PM 835 20124 2 482 897 482 159 482
1/2/2005 6:12:30 PM 1686 20650 2 542 62 542 95 542
1/2/2005 6:25:08 AM 3394 71204 7 67 286 67 337 67
1/2/2005 6:25:23 AM 1769 40123 4 48 824 48 242 48
1/2/2005 6:30:30 PM 1129 20341 2 34 451 34 170 34
1/2/2005 6:33:35 PM 2267 40134 4 499 94 499 248 499
1/2/2005 6:34:30 PM 700 20990 2 480 919 480 149 480
1/2/2005 7:25:23 AM 1772 40750 4 44 322 44 224 44
1/2/2005 7:47:56 AM 3221 10103 1 12 266 12 63 12
1/2/2005 8:04:56 AM 2392 10650 1 52 102 52 265 52
1/2/2005 8:21:56 AM 2570 10870 1 460 262 460 53 460
1/2/2005 8:25:08 AM 1878 20034 2 50 871 50 255 50
1/2/2005 8:25:23 AM 2670 40140 4 500 778 500 255 500
146

1/2/2005 8:26:56 AM 2775 10450 1 54 899 54 274 54
1/2/2005 8:34:56 AM 5209 41005 4 71 885 71 360 71
1/2/2005 8:48:56 AM 2281 10560 1 418 408 418 291 418
1/2/2005 8:55:56 AM 1422 10780 1 56 959 56 285 56
1/2/2005 9:06:56 AM 922 10970 1 477 646 477 96 477
1/2/2005 9:15:16 AM 4487 40128 4 76 698 76 382 76
1/2/2005 9:17:56 AM 1228 10453 1 481 88 481 151 481
1/2/2005 9:19:56 AM 918 10870 1 29 96 29 141 29
1/2/2005 9:22:16 AM 5150 40148 4 440 874 440 405 440
1/2/2005 9:24:56 AM 1703 10126 1 45 156 45 229 45
1/2/2005 9:25:23 AM 2067 40346 4 496 959 496 233 496
1/2/2005 9:34:56 AM 1025 10143 1 3 22 3 11 3
1/2/2005 9:43:56 AM 1798 10650 1 47 587 47 240 47
1/2/2005 9:44:16 AM 863 40110 4 442 226 442 414 442
1/2/2005 9:44:56 AM 2052 10870 1 49 567 49 248 49
1/2/2005 9:54:56 AM 1136 10458 1 483 588 483 161 483
1/2/2005 9:55:16 AM 4255 40457 4 16 203 16 84 16
1/2/2005 9:55:56 AM 2124 10678 1 43 113 43 218 43
1/3/2005 1:25:23 AM 5595 20707 2 444 344 444 421 444
1/3/2005 1:30:11 AM 3084 60016 6 10 44 10 53 10
1/3/2005 1:30:11 AM 1098 30007 3 901 153 901 96 901
1/3/2005 10:25:08 AM 2732 70034 7 10 217 10 53 10
1/3/2005 10:25:23 AM 2083 20547 2 10 142 10 53 10
1/3/2005 2:11:12 PM 4108 40016 4 14 31 14 73 14
1/3/2005 2:25:23 AM 4210 40007 4 78 355 78 394 78
1/3/2005 2:31:12 PM 3399 40123 4 70 807 70 351 70
1/3/2005 2:41:12 PM 2843 40149 4 432 257 432 365 432
1/3/2005 2:51:12 PM 1132 40116 4 74 54 74 375 74
1/3/2005 3:25:23 AM 4086 40717 4 80 689 80 405 80
1/3/2005 4:25:08 AM 2757 70134 7 58 776 58 291 58
1/3/2005 4:25:23 AM 3213 20105 2 55 33 55 276 55
1/3/2005 5:25:23 AM 3465 40133 4 77 486 77 387 77
1/3/2005 6:25:23 AM 3824 20007 2 18 167 18 95 18
1/3/2005 7:01:33 AM 914 70007 7 74 467 74 375 74
1/3/2005 7:05:15 AM 909 20562 2 434 18 434 375 434
1/3/2005 7:25:23 AM 4057 40147 4 441 963 441 406 441
1/3/2005 8:00:33 AM 795 70117 7 72 356 72 365 72
1/3/2005 8:05:15 AM 4801 21220 2 14 14 14 73 14
1/3/2005 8:10:33 AM 3455 70119 7 714 133 714 73 714
1/3/2005 8:25:23 AM 2298 21062 2 411 863 411 260 411
1/3/2005 8:33:33 AM 2707 70221 7 70 358 70 351 70
1/3/2005 9:25:08 AM 1742 70034 7 54 467 54 274 54
1/3/2005 9:25:23 AM 2096 20112 2 12 91 12 63 12
1/4/2005 1:30:11 AM 1505 60016 6 8 261 8 41 8
1/4/2005 11:25:08 AM 2413 20034 2 880 5 880 41 880
1/4/2005 11:26:23 AM 984 30007 3 32 970 32 159 32
1/4/2005 11:34:36 AM 532 30003 3 2 130 2 6 2
1/4/2005 12:26:23 AM 1012 31200 3 4 54 4 18 4
1/4/2005 3:25:08 AM 1528 40034 4 36 817 36 183 36
1/4/2005 4:30:11 AM 1337 30016 3 36 305 36 183 36
1/4/2005 6:25:08 AM 1473 20034 2 48 357 48 242 48
1/4/2005 6:33:30 PM 1027 20205 2 36 139 36 879 36
147

1/4/2005 7:25:08 AM 1447 20122 2 44 840 44 224 44
1/4/2005 9:09:56 AM 1650 10654 1 490 959 490 204 490
1/4/2005 9:25:08 AM 1390 21005 2 46 488 46 233 46
1/4/2005 9:39:56 AM 1567 10140 1 486 102 486 183 486
1/5/2005 1:25:08 AM 1467 40034 4 42 776 42 215 42
1/5/2005 1:30:11 AM 1023 30007 3 3 18 3 11 3
1/5/2005 1:35:11 AM 1688 41455 4 7 47 7 38 7
1/5/2005 11:25:08 AM 448 40077 4 1 34 1 1 1
1/5/2005 2:25:08 AM 2630 20007 2 83 286 83 418 83
1/5/2005 2:25:08 AM 2630 20654 2 173 286 173 418 173
1/5/2005 3:25:08 AM 1733 40123 4 456 282 456 26 456
1/5/2005 4:25:08 AM 6101 20670 2 87 372 87 437 87
1/2/2005 9:44:56 AM 2052 10110 1 49 567 49 248 49
1/2/2005 9:54:56 AM 1136 10775 1 483 588 483 161 483
1/2/2005 9:55:16 AM 4255 40780 4 16 203 16 84 16
1/5/2005 4:30:11 AM 1209 30116 3 6 58 6 26 6
1/5/2005 5:25:08 AM 4100 21340 2 89 910 89 447 89
1/5/2005 6:25:08 AM 3359 60234 6 468 114 468 95 468
1/5/2005 6:31:30 PM 1875 21230 2 42 18 42 215 42
1/5/2005 6:33:30 PM 1292 20859 2 456 283 456 26 456
1/5/2005 7:25:08 AM 1660 60456 6 49 608 49 248 49
1/5/2005 9:19:56 AM 1231 10112 1 1050 408 1050 215 1050
1/5/2005 9:51:56 AM 1697 10453 1 456 247 456 26 456
1/6/2005 1:30:11 AM 522 70678 7 2 268 2 6 2
1/6/2005 11:09:33 AM 4299 40786 4 73 952 73 367 73
1/6/2005 11:18:13 AM 3091 70634 7 424 352 424 323 424
1/6/2005 11:26:08 AM 845 50150 5 32 833 32 159 32
1/6/2005 12:26:08 AM 1105 50564 5 4 148 4 18 4
1/6/2005 3:30:16 PM 4647 50785 5 14 46 14 73 14
1/6/2005 3:45:16 PM 1293 50100 5 434 776 434 375 434
1/6/2005 4:00:23 PM 1566 20105 2 39 33 39 199 39
1/6/2005 5:01:12 PM 2593 50433 5 433 352 433 367 433
1/6/2005 6:25:16 PM 3322 50974 5 70 467 70 351 70
1/6/2005 8:31:45 PM 1076 20112 2 39 372 39 199 39
1/6/2005 8:35:16 PM 2846 50564 5 72 750 72 365 72
1/6/2005 8:50:13 AM 2286 11245 1 489 885 489 199 489
1/6/2005 9:15:16 PM 5122 50451 5 68 817 68 345 68
1/7/2005 1:25:23 AM 1612 21347 2 492 467 492 215 492
1/7/2005 11:05:15 AM 489 50324 5 1 172 1 1 1
1/7/2005 11:05:15 AM 1127 20543 2 3 271 3 11 3
1/7/2005 11:08:33 AM 1008 50120 5 29 746 29 141 29
1/7/2005 11:10:33 AM 1139 51007 5 477 651 477 96 477
1/7/2005 11:55:33 AM 4609 61003 6 445 973 445 428 445
1/7/2005 2:05:15 AM 5121 70100 7 83 427 83 418 83
1/7/2005 2:44:33 AM 6856 60560 6 443 705 443 418 443
1/7/2005 3:25:23 AM 2215 20177 2 6 238 6 26 6
1/7/2005 5:05:15 AM 4638 70670 7 449 952 449 447 449
1/7/2005 5:11:12 PM 546 20324 2 101 298 101 1 101
1/7/2005 5:14:12 PM 3060 70236 7 55 301 55 276 55
1/7/2005 5:21:12 PM 4544 70756 7 435 198 435 377 435
1/7/2005 5:23:12 PM 2341 70540 7 51 705 51 260 51
1/7/2005 5:25:12 PM 2014 20100 2 493 552 493 218 493
148

1/7/2005 5:26:12 PM 2754 20452 2 49 297 49 248 49
1/7/2005 5:28:33 AM 5116 60120 6 449 352 449 447 449
1/7/2005 5:30:33 AM 6093 40227 4 77 954 77 387 77
1/7/2005 5:31:12 PM 4093 70650 7 437 521 437 387 437
1/7/2005 5:34:12 PM 3020 70457 7 53 973 53 266 53
1/7/2005 5:41:12 PM 4170 70132 7 439 216 439 396 439
1/7/2005 5:44:12 PM 1748 20567 2 45 250 45 229 45
1/7/2005 5:51:12 PM 1252 70168 7 417 352 417 290 417
1/7/2005 5:57:12 PM 2568 20450 2 47 6 47 240 47
1/7/2005 5:59:12 PM 2934 50862 5 427 705 427 337 427
1/7/2005 6:05:15 AM 2154 20172 2 53 554 53 266 53
1/7/2005 6:10:33 AM 2525 60123 6 53 976 53 266 53
1/7/2005 6:40:33 AM 2962 40213 4 67 427 67 337 67
1/5/2005 4:30:11 AM 1209 30653 3 6 58 6 26 6
1/5/2005 5:25:08 AM 4100 20117 2 89 910 89 447 89
1/5/2005 6:25:08 AM 3359 60234 6 468 114 468 95 468
1/5/2005 6:31:30 PM 1875 20125 2 42 18 42 215 42
1/5/2005 6:33:30 PM 1292 20654 2 456 283 456 26 456
1/7/2005 7:05:15 AM 3918 70412 7 81 574 81 406 81
1/7/2005 7:05:33 AM 3910 40816 4 431 991 431 360 431
1/7/2005 7:50:33 AM 4532 40135 4 200 578 200 406 200
1/7/2005 8:05:15 AM 3228 20860 2 411 286 411 260 411
1/7/2005 8:10:33 AM 2263 60456 6 51 427 51 260 51
1/7/2005 9:05:15 AM 1229 20467 2 481 999 481 151 481
1/7/2005 9:10:33 AM 703 50345 5 31 989 31 151 31
1/8/2005 1:45:13 AM 4458 71245 7 420 973 420 303 420
1/8/2005 11:25:08 AM 392 41345 4 2 224 2 6 2
1/8/2005 11:25:08 AM 457 41568 4 19 150 19 35 19
1/9/2005 3:55:13 AM 2246 71568 7 62 301 62 311 62
1/9/2005 8:15:13 AM 1739 70187 7 462 70 462 63 462
1/9/2005 8:48:13 AM 2630 70178 7 66 817 66 331 66
1/9/2005 8:48:13 AM 2630 71456 7 156 817 156 331 156
1/7/2005 7:05:15 AM 3918 70048 7 81 574 81 406 81
1/7/2005 7:05:33 AM 3910 41902 4 431 991 431 360 431
1/7/2005 7:50:33 AM 4532 41680 4 81 578 81 406 81
1/7/2005 8:05:15 AM 3228 23469 2 411 286 411 260 411
1/7/2005 8:10:33 AM 2263 61458 6 51 427 51 260 51
1/7/2005 9:05:15 AM 1229 21567 2 481 999 481 151 481
1/7/2005 9:10:33 AM 703 52598 5 31 989 31 151 31
1/8/2005 1:45:13 AM 4458 71003 7 420 973 420 303 420
1/8/2005 11:25:08 AM 392 41007 4 2 224 2 6 2
1/8/2005 11:25:08 AM 457 44678 4 19 150 19 35 19
1/9/2005 3:55:13 AM 2246 70156 7 62 301 62 311 62
1/9/2005 8:15:13 AM 1739 71234 7 462 70 462 63 462
1/9/2005 8:48:13 AM 2630 72456 7 66 817 66 331 66
1/9/2005 8:48:13 AM 2630 74567 7 156 817 156 331 156
1/7/2005 7:05:15 AM 3918 72345 7 81 574 81 406 81
1/7/2005 7:05:33 AM 3910 41245 4 431 991 431 360 431
1/7/2005 7:50:33 AM 4532 41567 4 81 578 81 406 81
1/7/2005 8:05:15 AM 3228 21345 2 411 286 411 260 411
1/7/2005 8:10:33 AM 2263 60459 6 51 427 51 260 51
1/7/2005 9:05:15 AM 1229 21234 2 481 999 481 151 481
149

1/7/2005 9:10:33 AM 703 50408 5 31 989 31 151 31
1/8/2005 1:45:13 AM 4458 71214 7 420 973 420 303 420
1/8/2005 11:25:08 AM 392 40123 4 2 224 2 6 2
1/8/2005 11:25:08 AM 457 40245 4 19 150 19 35 19
1/9/2005 3:55:13 AM 2246 71345 7 62 301 62 311 62
1/9/2005 8:15:13 AM 1739 72456 7 462 70 462 63 462
1/9/2005 8:48:13 AM 2630 70101 7 66 817 66 331 66
1/9/2005 8:48:13 AM 2630 70237 7 156 817 156 331 156
1/7/2005 7:05:15 AM 3918 70457 7 81 574 81 406 81
1/7/2005 7:05:33 AM 3910 40134 4 431 991 431 360 431
1/7/2005 7:50:33 AM 4532 40145 4 81 578 81 406 81
1/7/2005 8:05:15 AM 3228 20657 2 411 286 411 260 411
1/7/2005 8:10:33 AM 2263 62345 6 51 427 51 260 51
1/7/2005 9:05:15 AM 1229 21345 2 481 999 481 151 481
1/7/2005 9:10:33 AM 703 50145 5 31 989 31 151 31
1/8/2005 1:45:13 AM 4458 70545 7 420 973 420 303 420
1/8/2005 11:25:08 AM 392 40145 4 2 224 2 6 2
1/8/2005 11:25:08 AM 457 40568 4 19 150 19 35 19
1/9/2005 3:55:13 AM 2246 70120 7 62 301 62 311 62
1/9/2005 8:15:13 AM 1739 70345 7 462 70 462 63 462
1/9/2005 8:48:13 AM 2630 70546 7 66 817 66 331 66
1/9/2005 8:48:13 AM 2630 70647 7 156 817 156 331 156
1/7/2005 7:05:15 AM 3918 71235 7 81 574 81 406 81
1/7/2005 7:05:33 AM 3910 41345 4 431 991 431 360 431
1/7/2005 7:50:33 AM 4532 40134 4 81 578 81 406 81
1/7/2005 8:05:15 AM 3228 20450 2 411 286 411 260 411
1/7/2005 8:10:33 AM 2263 60456 6 51 427 51 260 51
1/7/2005 9:05:15 AM 1229 20564 2 481 999 481 151 481
1/7/2005 9:10:33 AM 703 50324 5 31 989 31 151 31
1/8/2005 1:45:13 AM 4458 70125 7 420 973 420 303 420
1/8/2005 11:25:08 AM 392 40456 4 2 224 2 6 2
1/8/2005 11:25:08 AM 457 40224 4 19 150 19 35 19
1/9/2005 3:55:13 AM 2246 70911 7 62 301 62 311 62
1/9/2005 8:48:13 AM 2630 70123 7 66 817 66 331 66
1/9/2005 8:48:13 AM 2630 70620 7 156 817 156 331 156
1/7/2005 7:05:15 AM 3918 70324 7 81 574 81 406 81
1/7/2005 7:05:33 AM 3910 40862 4 431 991 431 360 431
-----------------------------
-------------------------
1/18/2009 8:07:45 PM 2810 20009 2 1618 603 1618 443 1618
1/22/2009 7:18:53 AM 635 70543 7 1532 284 1532 9 1532
1/9/2009 8:44:54 PM 483 20635 2 1553 949 1553 123 1553
10/7/2009 11:12:34
AM 1963 20456 2 1575 833 1575 226 1575
11/13/2009 9:02:12
PM 6801 20432 2 1620 585 1620 452 1620
3/14/2009 8:53:01
PM 1048 20182 2 1533 298 1533 13 1533
3/2/2009 6:16:13 AM 4638 20345 2 1612 662 1612 415 1612
4/1/2009 6:16:53 PM 2900 20125 2 1600 981 1600 351 1600
4/12/2009 6:59:26 PM 1860 71348 7 1535 260 1535 23 1535
9/13/2005 2:45:34 AM 5129 10057 1 1608 336 1608 395 1608
1/18/2009 8:07:45 PM 2810 21021 2 1618 603 1618 443 1618
150
5.7 CREDIT CARD NUMBER GENERATION
We had also studied how the realistic credit card numbers are generated. To generate
realistic credit card numbers, we use the semantic graph shown in the figure 5.3.
Card Type
Card Issuer
MII
Creditcard Number
Figure 5.3: Credit Card Number Semantic Graph
The first digit on a credit card is the Major Industry Identifier (MII) which represents the
source from where the credit card was issued. For example, a credit card number starting
with 6 is assigned for merchandising and banking purposes, such as in the case of the
Discovery card. Credit card numbers starting with 4 and 5 are used for banking and
financing purposes, as in the case of Visa and MasterCard. Digit 3 is used to represent
travel and entertainment used, for instance the American Express card. Table 5.26 is an
overview of the rules for numbering credit card. The first six numbers including the MII
represents the issuer identifier. The rest of the digits on the credit card represent the
cardholder’s account number except the last digit. The lone digit at the very right end of
151
the complete 15 or 16 digit credit card number sequence is known as the “check digit”,
which often is the final number that is computer generated to satisfy the mathematical
formulations of the Luhn check sum process. Meanwhile, in between the first 6 digits and
the last single check digit is the actual personalized account number – the 8 or 9 digit
sequence given by the card issuer.
Table 5.26 Credit Card Parameters
Issuer Identifier Length (Numbers)

Discovery 6011xx 16
Mastercard 51xxxx-55xxxx 16
American Express 34xxx,37xxx 15
Visa 4xxxxx 13, 16
5.7.1 The Luhn Algorithm
The Luhn Algorithm is the check sum formula used by payment verification systems and
mathematicians to verify the sequential integrity of real credit card numbers. It’s used to
help bring order to seemingly random numbers and used to prevent erroneous credit card
numbers from being cleared for use. The Luhn algorithm is not used for straight credit
card number generation from scratch, but rather utilized as a simple computational way to
distinguish valid credit card numbers from random collections of numbers put together.
The validation formula also works with most debit cards as well.
The Luhn formula was created and filed as a patent (now freely in the public domain) in
1954 by Hans Peter Luhn of IBM to detect numerical errors found in pre-existing and
newly generated identification numbers. Since then, it’s primary use has been in the area
of check sum validation, made popular with its use to verify the validity of important
sequences such as credit card numbers. Currently, almost all credit card numbers issued
today are generated and verified using the Luhn Algorithm. The luhn algorithm only
validates the 15-16 digit credit card number and not the other critical components of a
152
genuine card account such as the expiration date and the commonly used Card
Verification Value (CVV) and Card Verification Code (CVC) numbers.
ALGORITHM 5.1
1. The Luhn Algorithm always starts from right to left, beginning with the rightmost
digit on the credit card face (the check digit). Starting with the check digit and
moving left, double the value of every alternate digit. Non-doubled digits will
remain the same. The check digit is never doubled. For example, if the credit card
is a 16 digit Visa card, the check digit would be the rightmost 16th digit. Thus we
would double the value of the 15th, 13th, 11th, 9th digits, and so on until all odd
digits have been doubled. The even digits would be left the same.
2. For any digit that becomes a two digit number of 10 or more when doubled, add
the two digits together. For example, the digit 5 when doubled will become 10,
which turns into a 1.
3. Now, lay out the new sequence of numbers. The new doubled digits will replace
the old digits. Non-doubled digits will remain the same.
4. Add up the new sequence of numbers together to get a sum total. If the combined
tally is perfectly divisible by ten, then the account number is mathematically valid
according to the Luhn formula. If not, the credit card number provided is not valid
and thus fake or improperly generated.
5.7.2 An Example of Luhn Validation Technique
We can follow the luhn steps from 1 to 4 below, starting with the right most digit. I have
taken my own credit card to check how it is mathematically correct according to the Luhn
validation technique.
153
Figure 5.4 Sample of Credit Card
(1) Start here at check digit and go left.
5 1 7 6 5 3 0 0 9 2 2 4 5 0 0 3
(2) Double every other number. If doubled numbers are two digits, then add them.
5 1 7 6 5 3 0 0 9 2 2 4 5 0 0 3
10 14 10 0 18 4 10 0
(3) Drop the numbers down to the bottom arrow and keep other digits as it is.
1 1 5 6 1 3 0 0 9 2 4 4 1 0 0 3
(4) Add these new numbers, which is 40 and perfectly divisible by 10, so according to
luhn algorithm, it is valid credit card number.
5.8 REFERENCES

154
[4] K.V.S Sarma – Statistics Made Simple Do It Yourself on PC, Prentice Hall of India,
ISBN: 81-203-1741-6
[5] R.S. Bhardwaj – Business Statistics, Excel Books, ISBN: 81-7446-181-7
[6] Ivan Bayross – SQL, PL/SQL The Programming Language of Oracle, BPB
Publications, ISBN 81-7656-964-X
[7] Nilesh Shah – Database Systems Using Oracle, Prentice Hall of India, ISBN: 81-203-
2147-2
[8] A. Leon, M. Leon – Database Management Systems, Vikas Publishing House, ISBN:
0-81-259-1165-0
[9] http://www.citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1081&rep...
[10] http:// www.thetaoofmakingmoney.com/2007/04/12/324.html
[11] http://www.etl-tools.info/en/bi/datawarehouse_star-schema.htm
155
CHAPTER 6
DEVELOPMENT OF TRANSACTION PATTERN GENERATION

TOOL (TPGT)
6.1 MAIN PATTERNS GENERATED BY TPGT

6.2 DESCRIPTIONS OF THE PATTERNS
6.3 COMPUTATIONS OF THE PATTERNS
6.4 REFERENCES
The transaction pattern generation tool (TPGT) will generate the patterns (parameters)
based on the historical data stored in the data warehouse. TPGT is implemented in the
Oracle 9i. All the patterns generated by TPGT will collectively decide the purchasing
behavior of the card holder. These patterns are very useful for deciding or verifying the
current transaction performed by the card holder online. Implementation code is given in
the Appendix.
6.1 MAIN PATTERNS (PARAMETERS) GENERATED BY TPGT
TPGT
DP CP PP TP WP VP AP FP MP SP HP LP GP
Figure 6.1 Parameters of TPGT
156
Chapter 6: Development of Transaction Pattern Generation Tool (TPGT)
DP: Daily Parameters

CP: Category Parameters
PP: Product Parameters
TP: Transaction Parameters
WP: Weekly Parameters
VP: Vendor Parameters
AP: Address Parameters
FP: Fortnightly Parameters
MP: Monthly Parameters
SP: Sunday Parameters
HP: Holiday Parameters
LP: Location Parameters
GP: Transaction Gap Parameters
6.1.1 Subparameters of DP
DP
DP1 DP2 DP3 DP4
Figure 6.2: Subparameters of DP
DP1: Average Amount of purchases per day

DP2: Maximum Amount of Purchase daily
DP3: Maximum Number of transactions a day
DP4: Average number of transactions per day
157
6.1.2 Subparameters of CP
CP
CP1 CP2 CP3
Figure 6.3 : Subparameters of CP
CP1: Amount spend in the current category

CP2: Time passed since the same category purchased
CP3: Number of times the transactions taken place within same category
158
6.1.3 Subparameters of PP
PP
PP1 PP2
Figure 6.4: Subparameters of PP
PP1: Time passed since the same product purchased

PP2: Number of times the same product purchased
159
6.1.4 Subparameters of TP
TP
TP1 TP2 TP3 TP4 TP5 TP6 TP7 TP8 TP9 TP10 TP11 TP12
Figure 6.5 Parameters of TP
TP1: Number of transactions performed during (3:01 to 6:00)

TP9: Time passed since the last transaction
TP10: Maximum Amount of Transaction
TP11: Number of transactions during day
TP12: Number of transactions during late night
160
6.1.5 Subparameters of WP
WP
WP1 WP2 WP3 WP4
Figure 6.6 : Subparameters of WP
WP1: Average Amount of purchases per week

WP2: Average Number of transactions per week
WP3: Maximum Number of transactions a week
WP4: Maximum Amount of Purchase weekly
161
6.1.6 Subparameters of VP
VP
VP1 VP2
Figure 6.7: Subparameters of VP
VP1: Number of transactions with the same seller

VP2: Amount of purchases with the same seller
6.1.7 Subparameters of AP
AP
AP1 AP2
Figure 6.8: Subparameters of AP
AP1: Number of transactions shipped with the same current shipping address
AP2: Number of transactions with different shipping and billing address
162
6.1.8 Subparameters of FP
FP
FP1 FP2 FP3 FP4
Figure 6.9: Subparameters of FP
FP1: Average Amount of purchases per fortnight

FP2: Average Number of transactions per fortnight
FP3: Maximum Number of transactions a fortnight
FP4: Maximum Amount of Purchase fortnightly
163
6.1.9 Subparameters of MP
MP
MP1 MP2 MP3 MP4
Figure 6.10: Subparameters of MP
MP1: Average Amount of purchases per month

MP2: Average Number of transactions per month
MP3: Maximum Number of transactions a month
MP4: Maximum Amount of Purchase monthly
6.1.10 Subparameters of SP
SP
SP1 SP2 SP3 SP4 SP5
Figure 6.11: Subparameters of SP
164
SP1: Average Amount of purchases per Sunday

SP2: Average Number of transactions per Sunday
SP3: Maximum Number of transactions a Sunday
SP4: Maximum individual Amount of transactions on Sunday
SP5: Maximum total Amount of transactions on Sunday
6.1.11 Subparameters of HP
HP
HP1 HP2 HP3 HP4 HP5
Figure 6.12: Subparameters of HP
HP1: Average Amount of purchases per holiday

HP2: Average Number of transactions per holiday
HP3: Maximum Number of transactions a holiday
HP4: Maximum individual Amount of transactions on holiday
HP5: Maximum total Amount of transactions on holiday
165
6.1.12 Subparameters of LP
LP
LP1 LP2 LP3 LP4 LP5 LP6 LP7
Figure 6.13: Subparameters of LP
LP1: Number of transactions ordered from the same location

LP2: Number of transactions ordered in the different city within same state
LP3: Number of transactions ordered in the different city outside of the state
LP4: Number of transactions ordered in the different country
LP5: Number of transactions shipped in the different city within same state
LP6: Number of transactions shipped in the different city outside of the state
LP7: Number of transactions shipped in the different country
166
6.1.13 Subparameters of GP
GP
GP1 GP2 GP3 GP4 GP5 GP6 GP7
Figure 6.14: Subparameters of GP
GP1: Number of transactions performed within 4 hour time gap

GP5: Number of transactions performed within 7 day time gap
GP6: Number of transactions performed within 15 day time gap
GP7: Number of transactions performed after 15 days
6.2 DESCRIPTIONS OF THE PATTERNS (PARAMETERS)
6.2.1 Daily Parameters (DP)

6.2.1.1 Average Amount of purchases per day (DP1)
This parameter contains the value of average amount of purchases per day. Suppose the
total amount of purchasing is done by customer is Rs.30000 in one year, then this value is
divided by 365 and then the value of the parameter is derived.
167
6.2.1.2 Maximum Amount of Purchase daily (DP2)

The tool also calculates the total amount of purchase daily and then it finds the maximum
amount of purchase from all the past daily transactions. This value is stored in the
parameter.
6.2.1.3 Maximum Number of transactions a day (DP3)

The tool keep track record of total number of transactions performed on each day. It finds
then the maximum number of transactions a day for that customer who is currently
performing the online transaction.
6.2.1.4 Average number of transactions per day (DP4)

This parameter contains the average number of transactions per day. If the customer
performs 365 transactions in one year then the average number of transactions per day is
1.
6.2.2 Category Parameters (CP)

6.2.2.1 Amount spend in the current category (CP1)
Whenever customer orders product in any category, then the system calculates the total
amount spend by the customer in this category. E.g. If customer buys product in
electronics category and the past transactions performed in this category worth Rs.50000,
then this parameter contains Rs.50000.
6.2.2.2 Time passed since the same category purchased (CP2)

This parameter stores the value of time since the last transaction performed in the same
category. The value is also stored in days, hours, minutes and seconds.
6.2.2.3 Number of times the transactions taken place within same category (CP3)
Total number of transactions in each category is also stored by the tool in this parameter.
E.g. If customer currently buys product of electronics category and in the past the
customer has performed total six transactions within same category, then this parameter
has value six.
168
6.2.3 Product Parameters (PP)

6.2.3.1 Time passed since the same product purchased (PP1)
It stores the value of time since the last transaction performed in the same product. The
value is also stored in days, hours, minutes and seconds.
6.2.3.2 Number of times the same product purchased (PP2)

The tool records how many times each product purchased by the customer. So if the
customer purchases any product, then this parameter has value of number of times the
same product purchased.
6.2.4 Transaction Parameters (TP)

6.2.4.1 Number of transactions performed during (3:01 to 6:00) (TP1)
This parameter contains the value of total number of past transactions performed from
3:01 A.M. to 6:00 A.M. by the customer.


9:01 A.M. to 12:00 P.M. by the customer.

12:01 P.M. to 3:00 P.M. by the customer.

169


9:01 P.M. to 12:00 A.M. by the customer.

6.2.4.9 Time passed since the last transaction (TP9)

It has value of time difference between current transaction date and the last transaction.
The value is also stored in days, hours, minutes and seconds.
6.2.4.10 Maximum Amount of Transaction (TP10)

From all the past transactions, the system finds the transaction having the maximum
amount and stores the value in this parameter. If customer buys three products at the
same time, then the system considers it’s as only one transaction and adds prices of all
the three products.
6.2.4.11 Number of transactions during day (TP11)

The tool records whether the transaction is performed during day or during late night. It
has value of total number of transactions performed during day only.
6.2.4.12 Number of transactions during late night (TP12)

If the transaction is performed during late night then it is considered as sensitive if the
customer has not performed any transaction during late night. This parameter has value of
total number of past transactions performed during late night.
170
6.2.5 Weekly Parameters (WP)

6.2.5.1 Average Amount of purchases per week (WP1)
The tool generates this parameter based on the average amount of purchases per week of
the customer performing the transaction. It is useful to compare the current weekly
behavior of customer with the past weekly behavior.
6.2.5.2 Average Number of transactions per week (WP2)

It has value of average number of transactions per week. So if the extensive number of
transactions is performed in a week, then it is taken into consideration by the system
based on this parameter.
6.2.5.3 Maximum Number of transactions a week (WP3)

The tool finds the number of transactions of a week in which the maximum number of
transactions is performed by the customer. So if in the current week, the customer’s
behavior changes drastically, then accordingly risk score is generated by the model.
6.2.5.4 Maximum Amount of Purchase weekly (WP4)

The tool also finds in which week, the customer has performed maximum amount of
transactions.
6.2.6 Seller or Vendor Parameter (VP)

6.2.6.1 Number of transactions with the same seller (VP1)
Whenever customer performs a new transaction, then this tool finds how many
transactions are performed with the same seller by this customer. So customer habit of
any particular choice of seller can be monitored by the model.
6.2.6.2 Amount of purchases with the same seller (VP2)

With the new transaction, the tool finds total amount of purchases with the same seller
and stores value in this parameter. If the customer is performing the transaction first time
with this seller, then the value of this parameter is 0.
171
6.2.7 Address Parameters (AP)

6.2.7.1 Number of transactions shipped with the same current shipping address
(AP1)
While performing the online transaction, the customer has to enter the shipping address
where he wants his item to be delivered. The tool finds how many times the past
transactions are performed with the same shipping address. If the past transactions are
performed on the same shipping address, then the model considers transaction as highly
genuine.
6.2.7.2 Number of transactions with different shipping and billing address (AP2)
The tool finds how many transactions the customer has performed other than his billing
address. So the customer habit of performing transaction other than his billing address
can be studied by the model and decide about the sensitivity of new incoming transaction.
6.2.8 Fortnightly Parameters (FP)

6.2.8.1 Average Amount of purchases per fortnight (FP1)
It stores the value of average amount of purchases per fifteen days by the customer.
6.2.8.2 Average Number of transactions per fortnight (FP2)

It has value of average number of transactions per fifteen days by the customer.
6.2.8.3 Maximum Number of transactions a fortnight (FP3)

It finds the maximum number of transactions from each fortnight and stores the value in
this parameter.
6.2.8.4 Maximum Amount of Purchase fortnightly (FP4)

It also finds the maximum amount of transactions from each fortnight and stores the
value in this parameter.
172
6.2.9 Monthly Parameters (MP)

6.2.9.1 Average Amount of purchases per month (MP1)
This parameter has value of average amount of purchases per month by the customer.
6.2.9.2 Average Number of transactions per month (MP2)

It stores the value of average number of transactions per month.
6.2.9.3 Maximum Number of transactions a month (MP3)

The tool finds in which month, the customer has performed the maximum number of
transactions and stored the value in this parameter.
6.2.9.4 Maximum Amount of Purchase monthly (MP4)

It also finds the maximum amount of monthly purchase and stores the value in this
parameter.
6.2.10 Sunday Parameters (SP)

6.2.10.1 Average Amount of purchases per Sunday (SP1)
Whenever transaction is performed on Sunday, then the tool automatically records
customer behavior separately. It stores the average amount of purchases per Sunday in
this parameter.
6.2.10.2 Average Number of transactions per Sunday (SP2)

Average number of transactions performed by customer on sundays is stored in this
parameter.
6.2.10.3 Maximum Number of transactions a Sunday (SP3)

The tool finds a Sunday, in which the customer has performed the maximum number of
transactions and stores value in it. So if the new incoming transactions are performed
heavily on Sunday, then it compare with this parameter to decide sensitivity of
transactions.
173
6.2.10.4 Maximum individual Amount of transactions on Sunday (SP4)

If multiple transactions are performed on Sunday, then the tool finds which transaction
has maximum amount value and stores in this parameter.
6.2.10.5 Maximum total Amount of transactions on Sunday (SP5)

The tool calculates total amount of purchases on each Sunday and finds the maximum
amount of purchase among these.
6.2.11 Holiday Parameters (HP)

6.2.11.1 Average Amount of purchases per holiday (HP1)
Customer behavior is also monitored separately on national holiday, as the customer has
free time on this day. So his behavior may be totally different than other working day.
This parameter has value of average amount of purchases of all holidays.
6.2.11.2 Average Number of transactions per holiday (HP2)

The tool generates this parameter based on average number of transactions performed by
customer per holiday.
6.2.11.3 Maximum Number of transactions a holiday (HP3)

The tool calculates the total number of transactions on each holiday and then finds the
maximum number of transactions among these.
6.2.11.4 Maximum individual Amount of transactions on holiday (HP4)
If multiple transactions are performed on holiday, then the tool finds which transaction
has maximum amount value and stores in this parameter.
6.2.11.5 Maximum total Amount of transactions on holiday (HP5)

The tool calculates total amount of transactions on each holiday and then finds the
maximum amount among these.
174
6.2.12 Location Parameters (LP)

6.2.12.1 Number of transactions ordered from the same location (LP1)
Whenever customer performs online transaction, the tool records the location from which
transaction is ordered and also generate this parameter showing how many total
transactions are ordered from the same location.
6.2.12.2 Number of transactions ordered in the different city within same state (LP2)
If customer initiates any order from different city other than his own city but within same
state, then it will be added into this parameter.
6.2.12.3 Number of transactions ordered in the different city outside of the state
(LP3)
If the customer orders a product outside of his state but within his country, then it will be
added into this parameter.
6.2.12.4 Number of transactions ordered in the different country (LP4)

This parameter has value of number of transactions; the user has performed from outside
of his country.
6.2.12.5 Number of transactions shipped in the different city within same state (LP5)
This parameter has a value of number of transactions; the user has requested to ship the
items in the different city other than his billing address city, but within his state.
6.2.12.6 Number of transactions shipped in the different city outside of the state
(LP6)
items in the different state other than his billing address state, but within his country.
6.2.12.7 Number of transactions shipped in the different country (LP7)

items in the different country other than his billing address country.
175
6.2.13 Transaction Gap Parameters (GP)

6.2.13.1 Number of transactions performed within 4 hour time gap (GP1)
The tool timestamps each transaction. It records within particular time duration how
many past transactions are performed by the customer. This parameter contains the
number of transactions the customer has performed within 4 hour time gap from the
previous transaction. E.g. If the customer performs the first transaction and second
transaction is performed after just one hour (within 4 hour time gap), then the tool adds 1
to this parameter.

This parameter contains the number of transactions the customer has performed within 5th
to 6th hours time gap from the previous transaction.

This parameter contains the value of number of transactions the customer has performed
within 8th to 16th hours time gap from the previous transaction.

within 16th to 24th hours from the previous transaction.
6.2.13.5 Number of transactions performed within 7 day time gap (GP5)

within 1st day to 7th day from the previous transaction.
6.2.13.6 Number of transactions performed within 15 day time gap (GP6)

within 8th to 15th day from the previous transaction.
176
6.2.13.7 Number of transactions performed after 15 days (GP7)

within 16th to 30th day from the previous transaction.
6.3 COMPUTATIONS OF THE PATTERNS
6.3.1 TP1 to TP8
The Calculation of the parameters TP1 to TP8 in the tool is done as follows.
The tool divides all the transactions of the customer into eight different time frames
according to the following.
T1 becomes true if the past transaction is performed from 3:00 to 6:00 time frame on the
card Ck within data warehouse.
T 1 = TRUE | { ∃Tck ∧ 3 : 00 < t ≤ 6 : 00} (6.1)
T 2 = TRUE | { ∃Tck ∧ 6 : 00 < t ≤ 9 : 00} (6.2)
T 3 = TRUE | { ∃Tck ∧ 9 : 00 < t ≤ 12 : 00} (6.3)
T4 becomes true if the past transaction is performed from 12:00 to 15:00 time frame on
the card Ck within data warehouse.
177
T 4 = TRUE | { ∃Tck ∧ 12 : 00 < t ≤ 15 : 00} (6.4)
T 5 = TRUE | { ∃Tck ∧ 15 : 00 < t ≤ 18 : 00} (6.5)
T 6 = TRUE | { ∃Tck ∧ 18 : 00 < t ≤ 21: 00} (6.6)
T 7 = TRUE | { ∃Tck ∧ 21: 00 < t ≤ 0 : 00} (6.7)
T 8 = TRUE | { ∃Tck ∧ 0 : 00 < t ≤ 3 : 00} (6.8)
The tool then finds the total number of the transactions performed by the customer in
time frame from T1 to T8.
TP1 = occurrences (count) of T1 on the card Ck from the data warehouse (6.9)
178
Finally the percentage of all the parameters of all the transactions is computed as follows.
Percent_TP1=(TP1 * 100) / total transactions on card Ck from the data warehouse (6.17)
6.3.2 TP11 and TP12
L1 becomes true if the transaction is performed from 0:00 to 4:00 on the card Ck from the
data warehouse.
L1 = TRUE | { ∃Tck ∧ 0 : 00 < t ≤ 4 : 00} (6.25)
L2 becomes true if the transaction is performed except from 0:00 to 4:00 on the card Ck
within the data warehouse.
L 2 = TRUE | { ∃Tck ∧ 4 : 00 < t ≤ 0 : 00} (6.26)
Finally TP11 and TP12 are computed as follows.
TP11 = occurrences (count) of L1 on the card Ck from the data warehouse (6.27)
TP12 = occurrences (count) of L2 on the card Ck from the data warehouse (6.28)
179
6.3.3 GP1 to GP7
G1 becomes true if the transaction occurs just within 4 hours from the previous
transaction on the same card Ck from the data warehouse.
G1 = True |{∃Tck ∧ (0 < d ≤ 4)} (6.29)
d stands for the duration in hours between two successive transactions.
G2 becomes true if the transaction occurs just within 5 to 8 hours from the previous
G 2 = True |{∃Tck ∧ (4 < d ≤ 8)} (6.30)
G 3 = True |{∃Tck ∧ (8 < d ≤ 16)} (6.31)
G 4 = True |{∃Tck ∧ (16 < d ≤ 24)} (6.32)
G5 becomes true if the transaction occurs from 2nd day to within a week from the previous
G 5 = True |{∃Tck ∧ (24 < d ≤ (24*7))} (6.33)
G6 becomes true if the transaction occurs just within 15 days from the second week since
the previous transaction on the same card Ck from the data warehouse.
180
G 6 = True |{∃Tck ∧ ((24*7) < d ≤ (24*15))} (6.34)
G7 becomes true if the transaction occurs after 15 days from the previous transaction on
the same card Ck from the data warehouse.
G 7 = True |{∃Tck ∧ (d > (24*7))} (6.35)
Now the parameters GP1 to GP7 are computed as follows.
GP1 = occurrences (count) of G1 on the card Ck from the data warehouse (6.36)
6.3.4 AP1 and AP2
A1 becomes true if the past transactions are also shipped with the same shipping address
from the data warehouse.
A1 = TRUE | { ∃Tck ∧ Saddr (Tcurrent ) = Saddr (Tpast ) } (6.43)
A2 becomes true if the transaction is performed with the different shipping and billing
address.
A2 = TRUE | { ∃Tck ∧ Saddr ≠ Baddr } (6.44)
Finally AP1 and AP2 are computed as follows.
181
AP1 = occurrences (count) of A1 on the card Ck from the data warehouse (6.45)
AP2 = occurrences (count) of A2 on the card Ck from the data warehouse (6.46)
Other parameters are computed in the similar way.
6.4 REFERENCES

[4] E. Aleskerov, B. Freisleben, B.Rao, “CARDWATCH: a neural network based
database mining system for credit card fraud detection”, in: Proceedings of the
Computational Intelligence for Financial Enginnering, 1997, pp.220-226
[5] A.Shen, R.Tong, Y.Deng, “Application of classification models on credit card fraud
detection”, in: Proceedings of the IEEE Service Systems and Service Management,
International Conference, 9-11 June 2007, pp:1-4
[6] Tao Guo, Gui-Yang Ali, “Neural data mining for credit card fraud detection”, in:
Proceedings of the Seventh International Conference on Machine learning and
cybernetics, Kunming, 12-15 July 2008, pp.3630-3634
[7] J.Quah, M.Sriganesh, “Real time credit card fraud detection using computational
intelligence”, in: Proceedings of the International Joint Conference on Neural Networks,
Florida, U.S.A, August 2007
182
CHAPTER 7
DEVELOPMENT OF TRANSACTION RISK SCORE GENERATION

MODEL (TRSGM)
7.1 SIGNIFICANCE OF THE PARAMETERS IN TRSGM

7.2 TRSGM COMPONENTS
7.3 ALGORITHM
7.4 GRAPHS OF CLUSTER FORMATION BY DBSCAN ALGORITHM
7.5 IMPLEMENTATION ENVIRONMENT
7.6 SAMPLE RESULTS
7.7 RESULT ANALYSIS & DISCUSSIONS
7.8 REFERENCES
7.1 SIGNIFICANCE OF THE PARAMETERS IN TRSGM
7.1.1 Address Parameters (AP)
This is the most important parameters considered by the model. When the shipping
address entered by the customer is different than the billing address, then the model
checks how many previous transactions are performed on the same shipping address by
checking the value of parameter (AP1). If it is greater than zero, then model considers it
as highly genuine and generates 0 risk score. The model also learns how many total
transactions the customer has performed with different billing and shipping address by
the parameter (AP2).
183
Chapter 7: Development of Transaction Risk Generation Model (TRSGM)
7.1.2 Location Parameters (LP)
The model considers the location from which the current online transaction is performed.
The model then use the parameter LP1: number of transactions ordered from the same
location and accordingly generates a risk score. If there is no any transaction performed
from the same location then more risk score generated and more transactions performed
then less risk score generated.
7.1.3 Category Parameters (CP)
When a customer purchase a product in any category, then the model finds how much
amount is spend by the customer in this category. It uses the parameter CP1 for
generating risk score. Higher the value of CP1, less risk score generated as the model
assumes that it is matched with customer purchasing pattern. Less the value of CP1,
higher the risk score is generated as it is far from customer purchasing habit.
7.1.4 Product Parameters (PP)
The model also use the parameter PP1, time passed since the same product purchased. If
the value of this parameter is less and the product is costly, then the model considers the
transaction as sensitive and generates risk score accordingly. Number of times the same
product purchased is also recorded by the model.
7.1.5 Transaction Parameters (TP)
The parameters TP1 to TP8 has value of percentage of total number of transactions
performed within particular time frame. The model records the current transaction time
and finds percentage of total number of transactions in this time by using parameters TP1
to TP8. If percentage is higher, then less risk score generated as most of the past
transactions are performed within this time frame. If percentage is less, then risk score is
generated higher as it is not matched with customer past transaction time.
184
If current transaction is performed during late night, then the model checks the parameter
TP12 to find total past transactions performed by the customer during late night. If value
of TP12 is 0 or very less (as compared with total transactions), then the model considers
the transaction as sensitive and generates the risk score accordingly.
The customer is active and time passed since the last transaction (TP9) is more, the model
considers the transaction sensitive and generates the risk score accordingly.
The model also finds the deviation than maximum amount of all past transactions (TP10).
It generates a risk score based on how much the current transaction amount is greater
than TP10.If it is less then the risk score generated is less.
7.1.6 Vendor (Seller) Parameters (VP)
With the new incoming transaction with the seller, the model checks the parameter VP2
to find total amount of transactions performed by the customer with the same seller.
Higher purchasing is done with the seller, lesser risk score generated and lesser
purchasing is done, higher risk score generated.
7.1.7 Sunday Parameters (SP)
Customer behavior is observed separately by the model on Sunday. Customer’s current

Sunday behavior is matched with the past Sunday’s behavior and according to the
deviation the risk score is generated.
Customer’s first transaction amount is compared with SP4: maximum individual Amount
of transactions on Sunday and accordingly risk score is generated. If customer
subsequently performs transactions on this day, then its total amount and total number of
transactions are compared with SP5: Maximum total Amount of transactions on Sunday
and SP3: Maximum number of transactions on Sunday.
185
7.1.8 Holiday Parameters (HP)
Customer’s behavior is also monitored separately on holiday by the model and is

compared with holiday parameters to find the deviation of holiday behavior.
For the first transaction performed by customer on holiday, its amount is compared with
HP4: maximum individual Amount of transactions on holiday and accordingly risk score
is generated. If customer subsequently performs transactions on this day, then its total
amount and total number of transactions are compared with HP5: Maximum total amount
of transactions on holiday and HP3: Maximum number of transactions on holiday.
7.1.9 Daily Parameters (DP)
All the transactions performed by the customer on current day are monitored by model
and stored in the table customer_dailycount. They are compared with daily parameters to
find how close or far current day behavior is from past daily behavior.
Total amount of transactions on current day is compared with parameter DP2: Maximum
Amount of Purchase daily and risk score is generated accordingly. Higher risk score is
generated according its value is greater than DP2.
Total number of transactions on current day is matched with DP3: Maximum Number of
transactions a day and risk score is generated accordingly. Higher risk score is generated
according to its value is greater than DP3 otherwise less.
7.1.10 Weekly Parameters (WP)
The transactions of the current week are updated in the table customer_weeklycount. Its
value is matched with weekly parameters to find the deviation from the past weekly
behavior.
186
The weekly transaction amount is compared with WP4: Maximum Amount of Purchase
weekly and if it is greater than WP4 then more risk score generated and it is less than WP4
then less risk score is generated.
The total number of transactions of current week is matched with WP3: Maximum
Number of transactions a week. If it is higher then more risk score generated otherwise
less.
7.1.11 Fortnightly Parameters (FP)
Customer’s current 15 days behavior is stored in the table customer_fortnightlycount and

is compared with the fortnightly parameters to check how far or close the current 15 days
behavior is from the past fifteen days behavior.
Total number of transactions in the current fortnight is checked with the parameter FP3:
Maximum Number of transactions a fortnight and risk score is generated accordingly.
Total amount of transactions in the current fortnight is checked with FP4: Maximum
Amount of Purchase fortnightly and risk score is generated accordingly.
7.1.12 Monthly Parameters (MP)
All the transactions of the current month are stored in the customer_monthlycount table.
This table is used to find how far or close the current month’s behavior from past
monthly behavior by comparing with monthly parameters.
Total number of transactions of current month is compared with MP3: Maximum Number
of transactions a month and risk score is generated accordingly.
Total amount of transactions is compared with MP4: Maximum Amount of Purchase
monthly and risk score is generated accordingly.
187
7.1.13 Transaction Gap Parameters (GP)
The model has one important feature that it records the transaction gap between each two
successive transactions performed by the customer. Seven transaction gap parameters
GP1 to GP7 are generated according to the transaction gap.
Whenever any transaction is found suspicious by the model, it updates the field
suspect_count of the suspect table. Then the model first finds which event occurs on this
card and finds probability that it occurs from generic fraudulent transactions set or
normal transactions set by using these parameters. Finally posterior probabilities are
computed by the model.
Here the time gap between successive transactions on the same card is considered to
capture the frequency of card use. The transaction gap is divided into seven mutually
exclusive and exhaustive events – E1, E2, E3, E4, E5, E6 and E7. Occurrence of each event
depends on the time since the last purchase (transaction gap-g ) on any particular card.
The event E1 is defined as the occurrence of a transaction on the same card Ck within 4
hours of the last transaction which can be represented as:
E1 = True |{∃Tck ∧ (0 < g ≤ 4)} (7.1)
The event E2 is defined as the occurrence of a transaction on the same card Ck from 4th to
8 hours of the last transaction which can be represented as:
E 2 = True |{∃Tck ∧ (4 < g ≤ 8)} (7.2)
The event E3 is defined as the occurrence of a transaction from 8th to 16 hours of the last
transaction.
E 3 = True |{∃Tck ∧ (8 < g ≤ 16)} (7.3)
188
The event E4 is defined as the occurrence of a transaction from 16th to 24 hours of the last
transaction
E 4 = True |{∃Tck ∧ (16 < g ≤ 24)} (7.4)
The event E5 is defined as the occurrence of a transaction within a week (from 2nd day to
7th day) of the last transaction
E 5 = True |{∃Tck ∧ (24 < g ≤ (24*7))} (7.5)
The event E6 is defined as the occurrence of a transaction within a fortnight (from the 8th
day to 15th day) of the last transaction
E 6 = True |{∃Tck ∧ ((24*7) < g ≤ (24*15))} (7.6)
The event E7 is defined as the occurrence of a transaction after 15 days of the last
transaction
E 7 = True |{∃Tck ∧ g (> (24*15))} (7.7)
7.2 TRSGM COMPONENTS
In the TRSGM, a number of rules are used to analyze the deviation of each incoming
transaction from the normal profile of the cardholder by computing the patterns generated
by TPGT. The initial belief value is obtained as the risk score. The belief is further
strengthened or weakened according to its similarity with fraudulent or genuine
transaction history using Bayesian learning. In order to meet this functionality, the
TRSGM is designed with the following five major components:
(1) DBSCAN algorithm
189
(2) Linear equation

(3) Rules
(4) Historical transaction database
(5) Bayesian learner
7.2.1 DBSCAN Algorithm
A customer usually carries out similar types of transactions in terms of amount, which
can be visualized as part of a cluster. Since a fraudster is likely to deviate from the
customer’s profile, his transactions can be detected as exceptions to the cluster – a
process known as outlier detection. It has important applications in the field of fraud
detection and has been used for quite some time to detect anomalous behavior.
DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a density

based clustering algorithm which can be used to filter out outliers and discover clusters of
arbitrary shapes. Formally, let C’= {c1,c2,……,cn}denote the clusters in a database for a
particular card Ck. A transaction T is detected as an outlier if it does not belong to any
cluster in the set C’. We can measure the extent of deviation of an incoming transaction
by its degree of outlierness. If the average distance of the amount p of an outlier
transaction T from the set of existing clusters in C’ is vavg, then its degree of outlierness
doutlier is given by:
ε
doutlier = 1 − if | N ε ( P )| < MinPts (7.8)
vavg
doutlier=0 otherwise
where
MinPts: Minimum number of points required in the ε - neighborhood of each point to
form cluster.
ε : Maximum radius of the neighborhood N ε ( p ) = {q ∈ D | dist ( p, q ) ≤ ε }.
190
The key idea of the DBSCAN algorithm is that for each point p in a cluster ci, there are at
least a minimum number of points (MinPts) in the ε - neighborhood of that point p
denoted as N ε (p) i.e. the density in the ε - neighborhood has to exceed some threshold.
The larger the ε - neighborhood, the less is the number of clusters formed. If it is set too
high, there will be no cluster since the MinPts condition is not satisfied. However, if both
the parameters are small, there can be a lot of clusters. If MinPts is set to 1, then each
point in the database is treated as a separate cluster and even noise gets identified as a
separate cluster.
Here DBSCAN algorithm is used to form the clusters of transaction amounts spend by
the customer. Whenever a new transaction is performed by the customer, the algorithm
finds the cluster coverage of this particular amount. If this amount occurs more than once
in the past, then the TRSGM considers as highly genuine transaction.
7.2.2 Linear Equation
The TRSGM is based on the following linear equation, which generates a risk score and
indicates how far or close the current transaction is from the normal profile of the
customer. If the generated risk score is closer to 0, then it is considered closely match to
customer normal profile. If the risk score is greater than 0.5 or close to 1, then it
considered heavily deviation from the customer normal profile.
n
Risk score = (1 − thresold ) ∑ ( Pi *Wi ) (7.9)
i =1
Where threshold=0.5
Pi = Parameter generated by TPGT
Wi= Weightage of the parameter which is given as input to algorithm 7.1
Weightage is in the percentage
191
7.2.2.1 Parameters
Table 7.1 Parameters of the Equation
Sr No Parameter Weightage
1 Location from which product is ordered W1 %
2 Amount of the transaction W2 %
3 Number of the transactions W3 %
4 Category of the purchase W4 %
5 Time frame during which product is ordered W5 %
6 Seller or Vendor, with whom product is W6 %
purchased
7 Same product purchased within short time W7 %
8 Time passed since the last transaction W8 %
9 Late night transaction W9 %
10 Overseas transaction W10 %
7.2.2 2 Formation of Linear Equation
The sigmoid function is computed as:
f(x)=1/1+e –x (7.10)
where
e is the base of natural logarithms approximated by 2.718282.
This function is used when the value of parameter can not be shown in the percentage as
it maps the computation value in the range [0, 1].
The equation is a linear combination of the following sub equations.

1. ( 1 – percentage_location_count / 100) * W1 (7.11)
2. ( ( 1 – percentage_category_amount / 100) * W4) / (no_of_product_purchased) (7.12)
3. ( 1 / ( 1 + e –x) ) * W2 (7.13)
where x = (current_transaction_amount – max_transaction_amount) * 25 /
current_transaction_amount
4. ( 1 / ( 1 + e –x) ) * W3 (7.14)
192
where x = (current_transaction_total – max_transaction_total) * 25 / ( 7 *

current_transaction_total)
5. (1 – time_percentage / 100) * W5 (7.15)
6. ((1 – seller_amount_percentage / 100) * W6) / (no_of_product_purchased) (7.16)
7. (1 / (1 + e –x) ) * W7 (7.17)
where x = (1620 – time_same_product) / ( time_same_product * 0.005 )
8. (1 / ( 1 + e –x) ) * W8 (7.18)
where x = time_last_transaction / 75
9. (1 – latenight_transaction_percentage / 100) * W9 (7.19)
10. (1- overseas_transaction_percentage / 100) * W10 (7.20)
The co-efficient of sigmoid function is derived by creating a small simulator programs

and exhaustive run of the same.
The weightage of different parameters have been derived and implemented using
artificial intelligence. Despite that the application is not stick to this weightage, it is made
dynamic and can be changed if any credit card company wish to do that. It is also
observed that within particular month or time, fraudster becomes so active and fraudulent
transactions increased drastically. So it is useful as the weightage is dynamic because we
can give more weightage to any sensitive parameter when there is a fear of fraudster in a
particular time period.
7.2.3 Rules
There are various guidelines given on several websites, print and electronic media as
indications for fraudulent transaction. These guidelines are implemented as rules in the
TRSGM.
• If the transaction is performed during the late night then it is considered as

sensitive. So weightage is given to this in the TRSGM.
• If the customer is active and performs the transactions frequently, but then stops
performing the transactions and after some time he or she becomes again active, it
193
is also considered as sensitive, it is monitored by the TRSGM also and risk score
is also generated according to the duration of time since the last transaction
performed.
• Generally customer doesn’t purchase the costly and luxury product again within
short time. So the TRSGM raises alarm by generating a risk score if similar event
occurs on the same card.
• Overseas transaction is also considered as highly sensitive by the TRSGM if in
the past no overseas transaction is performed on the same card.
7.2.4 Historical transaction database (HTD)
HTD is the transaction repository component of the proposed TRSGM, which is stored in
the data warehouse. The expected behavior of a fraudster is to maximize his benefit from
a stolen card. This can be achieved by carrying out high value transactions frequently.
However, to avoid detection, the fraudsters can make either high value purchases at
longer time gaps or smaller value purchases at shorter time gaps. Contrary to such usual
behavior, a fraudster may also carry out low value purchases at longer time gaps. This
would be difficult for the TRSGM to detect if it resembles the genuine cardholder’s
profile. However, in such cases, the total loss incurred by the credit card company will
also be quite low.
To capture the frequency of card use, we consider the time gap between successive
transactions on the same card. The transaction gap is divided into seven mutually
exclusive and exhaustive events – E1, E2, E3, E4, E5, E6 and E7. Occurrence of each event
depends on the time since last purchase (transaction gap) on any particular card. All the
events are already defined according to the equation (7.1) to (7.7).
The Event E is the union of all the seven events E1, E2, E3, E4, E5, E6 and E7 such that:
7
P ( E ) = ∑ P( Ei ) = 1 (7.21)
i =1
194
Now Compute P(Ei| f ) and P(Ei| f ) from the normal transaction set of that card holder
and generic fraud transactions set. P(Ei| f ) measures the probability of occurrence of Ei
given that a transaction is originating from a fraudster and P(Ei| f ) measures the
probability of occurrence of Ei given that it is genuine. The likelihood functions P(Ei| f )
and P(Ei| f ) are given by the following equations.
#(Occurences of Ei in fraud transaction set )

P ( Ei | f ) = (7.22)
#(Transactions in fraud transaction set )
#(Occurences of Ei on Ck of normal transaction set )

P ( Ei | f ) = (7.23)
#(Transactions on Ck in normal transaction set )
Using equations (7.22) and (7.23), P(Ei) can be computed as follows:
P ( Ei ) = P( Ei | f )* P( f ) + P( Ei | f )* P( f ) (7.24)
7.2.5 Bayesian learner
Bayesian learning is a tool to measure evidences supporting alternative hypothesis and

arrive at optimal decisions. The general idea of belief revision is that, whenever new
information becomes available, it may require updating of prior beliefs. Initial risk score
is generated in the range 0 to 1 and is considered as prior probability. Bayes rule gives the
mathematical formula for belief revision, which can be expressed as follows:
P( Ei | f )* P( f )
P ( f | Ei ) = (7.25)
P( Ei )
By substituting equation (7.24) in equation (7.25) we get:
195
P( Ei | f )* P( f )
P ( f | Ei ) = (7.26)
P( Ei | f )* P( f ) + P( Ei | f )* P( f )
We use Bayesian learning once the transaction is found suspicious in the light of the new
evidence Ei. Ψ is the probability that the current transaction is fraudulent.
The credit card fraud detection problem has the following two hypothesis: f : fraud and
f : ¬ fraud . By substituting the values obtained from equations (7.22), (7.23) in (7.26),
the posterior probability for hypothesis f : fraud is given as:
P( Ei | fraud ) * P( fraud )
P ( fraud | Ei ) = (7.27)
P( Ei | fraud ) * P( fraud ) + P( Ei | ¬fraud ) * P(¬fraud )
Similarly, the posterior probability for hypothesis f : ¬ fraud is given as:
P( Ei | ¬fraud ) * P(¬fraud )
P (¬fraud | Ei ) = (7.28)
P ( Ei | ¬fraud ) * P(¬fraud ) + P( Ei | fraud ) * P( fraud )
Depending on which of the two posterior values is greater, future actions are decided by
the TRSGM.
TRSGM is based on the following algorithm.
7.3 ALGORITHM
The working principle of the proposed TRSGM is presented in Algorithm 7.1. It takes the
transaction parameters – card id, transaction amount, product, product category, shipping
address, location id from where transaction is performed and transaction day type(
working day or normal day) as well as design parameters - ε , MinPts and Wi (Weightage
of the parameter Pi) as input.
196
An incoming transaction is first checked for the address mismatch. If shipping address
and billing address is found same, then the transaction is considered to be genuine and is
approved and no other check is performed. The incoming transaction amount is checked
with the clusters formed by DBSCAN algorithm for its coverage. If coverage is found to
be more than 10%, then the transaction is considered to be genuine and is approved and
no other check is performed with the transaction. Then the linear equation of the patterns
generated by TPGT along with its weightage( Wi) generates a risk score for the
transaction. If the risk score < 0.5, the transaction is considered to be genuine and is
approved. On the other hand, if risk score > 0.8 then the transaction is declared to be
fraudulent and manual confirmation is made with the cardholder. In case 0.5 ≤ risk score
≤ 0.8, the transaction is allowed but the card Ck is labeled as suspicious. If this is the first
suspicious transaction on this card, the field suspect_count is incremented to 1 for this
card number in a suspect table. The TRSGM then waits until the next transaction occurs
on the same card number.
When the next transaction occurs on the same card Ck, it is also passed to the TRSGM.
The first four components of the TRSGM again generate a risk score to the transaction. In
case the transaction is found to be suspicious, the following events take place. Since each
transaction is time stamped, from the time gap g between the current and the last
transaction, the TRSGM determines which event E has occurred out of the seven Ei’s and
retrieves the corresponding P ( Ei | f ) and P ( Ei | f ) . The posterior probabilities P ( f | Ei )
and P ( f | E ) are next computed using Eqs. (7.14) and (7.15). If P ( f | Ei ) > P ( f | E )
then the transaction is declared to be fraudulent and if P( f | E ) > P ( f | Ei ) then the

transaction is declared to be genuine. The flow of the events of the proposed financial
cyber crime detection system is shown in the figure 7.1.
197
AlGORITHM 7.1:
Input: Ck, Tamount(i), Saddr, Location, ε , MinPts, categoryi, producti,selleri, day_type, Wi,
no_of_products // ( No of the products customer has ordered online)
Tamount_daily ; // It stores total amount of current day purchase and update table
customer_dailycount accordingly
Ttotal_daily ; // It stores total number of current day transactions and update table
customer_dailycount accordingly
Tamount_weekly; // It stores total amount of current week purchase and update table
customer_weeklycount accordingly
Ttotal_weekly ; // It stores total number of current week transactions and update table
customer_weeklycount accordingly
Tamount_fortnightly ; // It stores total amount of current fortnight transactions and update table
customer_fortnightlycount accordingly
Ttotal_fortnightly ; // It stores total number of current fortnight transactions and update table
customer_fortnightlycount accordingly
Tamount_monthly; // It stores total amount of current month transactions and update table
customer_monthlycount accordingly
Ttotal_monthly ; // It stores total number of current month transactions and update table
customer_monthlycount accordingly
Tamount_sunday ; // It stores total amount of current day(if Sunday) purchase and update
table customer_sundaycount accordingly
Ttotal_sunday ; // It stores total number of current day(if Sunday) transactions and update
table customer_dailycount accordingly
Tamount_holiday ; // It stores total amount of current day(if holiday) purchase and update
Ttotal_holiday ; // It stores total number of current day(if holiday) transactions and update
Ψ=0
trans_amount = 0
198
i=1
while ( i <= no_of_products)

loop
Input category_id(i), product_id(i), Tamount(i), seller_id(i)
trans_amount := trans_amount + Tamount(i)
i := i + 1
end loop;
If Baddr = Saddr then

risk_score Ψ =0;
Output(“Genuine”) // The transaction is approved
End if
If Baddr ≠ Saddr then

Call Transaction_Pattern_Generation_Tool;
If AP1 > 0 then // AP1: No of transactions shipped with the same shipping
address
risk_score Ψ =0;
output(“Genuine”) // The transaction is approved
else
If current_day is running then
If current_week is running then
If current_fortnight is running then
If current_month is running then
Clusteri=DBSCAN_Algorithm(trans_amount, ε ,MinPts);// Number of clusters

Found by this algorithm
count_percen=Cluster_coverage(Clusteri, trans_amount );
199
If count_percen >= 10 then

else
risk_score Ψ = generate_and_update_risk_score_1 ( LP ); // LP: Location
Parameters
// Using Eq. (7.11)
risk_score Ψ = generate_and_update_risk_score_2( CP ); // CP: Category
Parameters
// Using Eq. (7.12)
risk_score Ψ = generate_and_update_risk_score_3 (PP); // PP: Product
Parameters
// Using Eq. (7.17)
risk_score Ψ = generate_and_update_risk_score_4 (TP);//TP: Transaction
Parameters
//Using Eqs. (7.12),(7.13),(7.15),(7.18) and (7.19)
risk_score Ψ = generate_and_update_risk_score_5 (VP);
// VP: Vendor(Seller) Parameters
// Using Eq. (7.16)
If (day_type is Sunday) then
risk_score Ψ = generate_and_update_risk_score_6 (SP);//SP: Sunday
Parameters
Tamount_sunday=Tamount_sunday + Tamount;
Ttotal_sunday=Ttotal_sunday + 1;
Update_customer_sundaycount_table(Tamount_sunday, Ttotal_sunday );
End if; // End of Sunday
// At the end of day, trigger is automatically executed and update
Table customer_sundaycount(Tamount_sunday=0, Ttotal_sunday=0 )
200
If (day_type is Holiday) then

risk_score Ψ = generate_and_update_risk_score_7 (HP);//HP: Holiday
Parameters
Tamount_holiday=Tamount_holily + Tamount;
Ttotal_holiday=Ttotal_holiday + 1;
Update_customer_holidaycount_table(Tamount_holiday, Ttotal_holiday );
End if; // End of Holiday

Table customer_holidaycount(Tamount_holiday=0, Ttotal_holiday=0 )
Tamount_daily=Tamount_daily + Tamount;
Ttotal_daily=Ttotal_daily + 1;
Update_customer_daily_count_table(Tamount_daily, Ttotal_daily );
End if; // End of current day
risk_score Ψ = generate_and_update_risk_score_8 (DP);//DP: Daily
Parameters
Table customer_dailycount(Tamount_daily=0, Ttotal_daily=0 )
Tamount_weekly=Tamount_weekly + Tamount;
Ttotal_weekly=Ttotal_weekly + 1;
Update_customer_weekly_count_table(Tamount_weekly, Ttotal_weekly );
End if; // End of current week
risk_score Ψ = generate_and_update_risk_score_9 (WP);
//WP: Weekly Parameters
// At the end of week, trigger is automatically executed and update
table customer_weeklycount(Tamount_weekly=0, Ttotal_weekly=0 )
Tamount_fortnightly=Tamount_fortnightly + Tamount;
Ttotal_fortnightly=Ttotal_fortnightly + 1;
201
Update_customer_fortnightlycount_table(Tamount_fortnightly, Ttotal_fortnightly );
End if; // End of current fortnight
risk_score Ψ = generate_and_update_risk_score_10 (FP);
//FP: Fortnightly Parameters
// At the end of fortnight, trigger is automatically executed and update
table customer_fortnightlycount(Tamount_fortnightly=0, Ttotal_fortnightly=0 )
Tamount_monthly=Tamount_monthly + Tamount;
Ttotal_monthly=Ttotal_monthly + 1;
Update_customer_monthly_count_table(Tamount_monthly, Ttotal_monthly );
End if; // End of current month
risk_score Ψ = generate_and_update_risk_score_11 (MP);
//MP: Monthly Parameters
// At the end of month, trigger is automatically executed and update
Table customer_monthlycount(Tamount_monthly=0, Ttotal_monthly=0 )
If ( Ψ < 0.5) then

else if ( Ψ > 0.8) then
output(“Fraudulent”) // Check with customer
if (transaction verified to be fraudulent) then
block_card(Ck);
end if;
else
if (suspect_count =0) then // Returns true if the suspect_count field of
suspect table is zero
suspect_count ++; // Update suspect_count for card Ck in suspect table
wait for the next transaction on the card Ck;
else
E=find_event(g); // Using Eqs. (7.1),(7.2),(7.3),(7.4),(7.5) ,(7.6) and (7.7)
Ef=compute_event_probf(E); // Using Eq. (7.22) and generic fraud table
202
E f =compute_event_prob f (E); // Using Eq. (7.23) and GP: Transaction

Gap Parameters
Posteriorf = compute_posterior_probf ( Ψ , Ef , E f ); // Using Eq. (7.27)
Posterior f = compute_posterior_prob f ( Ψ , Ef , E f ); // Using Eq. (7.28)
If (Posteriorf > Posterior f ) then

output(“Fraudulent”) // Check with customer
if (transaction verified to be fraudulent) then
block_card(Ck);
end if;
else
output(“Genuine”);
suspect_count := 0; // Update suspect_count for card Ck in suspect
table
End if;
Wait for the next transaction on the card Ck;
End if;
End if;
End if;
If (All the transactions of current month are found to be genuine) then

Store them in the data warehouse;
End if;
7.3.1 Description of data structure used in the algorithm
Variable Meaning
Ck Current online transaction is performed on a card Ck
Tamount Purchase amount of current online transaction of each product
Trans_amount Total purchase amount of all the products of current online transaction
203
Baddr Billing Address of the customer, which is given by customer while

opening an account
Saddr Shipping address given by the customer while performing online
Transaction, where he wants his item to be shipped
Product_idi It shows customer has purchased which kind of products
Category_idi It shows customer has purchased which kind of category
Seller_idi It shows customer has performed transaction with which seller
ε ε -neighborhood or neighborhood of a point is the set of points within
distance of ε
MinPts Minimum number of points in any cluster
day_type The model keeps track of current day is holiday or not by using this
variable ( 0: Working Day, 1: Holiday)
Tamount_daily It has total amount of current day purchase till the current day is
completed
Ttotal_daily It stores total number of current day transactions till the current day is
completed
Tamount_weekly It stores total amount of current week purchase till the current week is
completed
Ttotal_weekly It stores total number of current week transactions till the current week
is completed
Tamount_fortnightly It stores total amount of current fortnight transactions till the current
fortnight is completed
Ttotal_fortnightly It stores total number of current fortnight transactions till the current
fortnight is completed
Tamount_monthly It stores total amount of current month transactions till the current
month is completed
Ttotal_monthly It stores total number of current month transactions till the current
month is completed
Tamount_sunday It stores total amount of current day(if Sunday) purchase till the
current day completed
204
Ttotal_Sunday It stores total number of current day(if Sunday) transactions till the
current day is completed
Tamount_holiday It stores total amount of current day(if holiday) purchase till the
Ttotal_holiday It stores total number of current day(if holiday) transactions till the
Ψ Risk score generated by the model
Clusteri It indicates the particular cluster formed by DBSCAN Algorithm
g Transaction gap e.g. Number of hours since the last transaction on the
same card
E Model finds event based on eqs. From (7.1) to (7.6)
Ef Probability of event E coming from fraudulent transaction set
Ef Probability of event E coming from normal transaction set
Posteriorf Posterior probability of event E that it is fraudulent
Posterior f Posterior probability of event E that it is genuine
Suspect_count If the current transaction is found suspicious, then the value of
suspect_count is incremented to 1 and system waits for the next
transaction.
Wi Weightage of the parameter
7.3.2 Description of the algorithm
• First Algorithm checks the shipping address entered by the customer with the
billing address given by the customer while performing online transaction, If both
are same then it considers the transaction highly genuine and generate risk score
0.
• If shipping address is different than billing address, then algorithm checks the
parameter AP1 generated by TPGT to check whether the past transactions are
successfully performed on the same shipping address. If products are successfully
shipped on the current shipping address, then also it considers the transaction
highly genuine and generate risk score 0.
205
• If this is the first shipping address on which current transaction is going to

perform then the algorithm proceeds further and generates the risk score based on
how far or close the current transaction is to all his past transactions.
• DBSCAN algorithm creates different clusters based on transaction amounts, ε
and MinPts. Cluster_Coverage() function finds the percentage of coverage of
current transaction amount in the cluster. If this value is greater than 10 then the
algorithm believes that this is the regular transaction performed by customer and
predicts the current transaction as highly genuine and generate risk score 0 for it.
• If the risk score generated by the algorithm applying all the parameters of TPGT
is less than 0.5, then the transaction is considered genuine by the TRSGM.
• If the risk score is greater than or equal to 0.8, then the transaction is considered
fraudulent by the TRSGM.
• If the risk score is between 0.5 and 0.8, then transaction is considered as
suspicious transaction and the value of suspect_count field is incremented to 1 for
this card Ck in suspect table. It waits for the next transaction on same card. If the
risk score of next transaction is less than 0.5, then it is declared as genuine and the
value of suspect_count field is set to 0. If the risk score of the next transaction is
greater than or equal to 0.8, then it is declared as fraudulent and verified by
confirming the customer. If the risk score of next transaction is again found
between 0.5 and 0.8 (i.e. found suspicious) then the event E will be decided by
find_event() function using equations from (7.1) to (7.7). Then the probability of
Event E Ef ( Probability that Event E is occurred from generic fraud transaction
set) and E f (Probability that Event E is occurred from normal transaction set) is
computed. Finally Posterior probability posteriorf (Probability that the transaction
is fraudulent) and Posterior f (Probability that the transaction is genuine) is
computed. If posteriorf is greater than Posterior f , then the transaction is
considered fraudulent and verified by contacting the customer. If posterior f is
greater than posteriorf , then the transaction is declared as genuine transaction by
the model.
• If the current day is Sunday then the current transaction is compared with Sunday
Parameters (SP) generated by TPGT and risk score is generated accordingly. If
206
customer subsequently performs transactions on this day, then total amount of

purchase and total number of transactions are updated in the table
customer_sundaycount accordingly. Each time risk score is generated if the
customer performs transaction on this day. The next day value of these two fields
is reset with 0 automatically by trigger.
• If the current day is holiday which is tracked by the variable day_type, then the
transaction is matched with the Holiday Parameters (HP) and risk score is
generated accordingly. If the subsequent transactions are performed by the
customer on this day, then each time the table customer_holidaycount is updated
and risk score is generated accordingly. On the next day the value of this table is
reset 0.
• Whatever the transactions are performed by the customer during the whole day,
they all are updated in the table customer_dailycount. For each new transaction on
day, the table is updated and risk score is generated accordingly using Daily
Parameters (DP) of TPGT. On the next day the trigger is automatically executed,
and set the value of the two fields of table to zero.
• The transactions of the current week are recorded in the table
customer_weeklycount. For each new transaction of the current week, this table is
updated and risk score is generated accordingly using Weekly Parameters (WP).
After the completion of current week, the trigger set the value of this table to zero.
• Whenever customer performs a transaction, its value is updated in the table
customer_fortnightlycount table till the current 15 days i.e. fortnight is not
completed. Each time risk score is also generated using parameters FP. At the end
of 15 days, the value of the fields of this table is automatically set to zero by the
trigger.
• All the transactions performed by the customer during the current month are
updated in the table customer_monthlycount. For each subsequent transaction
during the month, the risk score is generated using the monthly parameters to find
the deviation of customer monthly behavior. After the completion of current
month, the value of Tamount_monthly and Ttotal_monthly of this table is set to zero. If all
207
the transactions are found genuine then they are stored in the data warehouse, so
the next parameters are generated accordingly by TPGT.
• Here the block diagram of the proposed financial cyber crime detection system is
shown in the figure 7.1 which is the brief pictorial representation of the algorithm
7.1.
208
Past Transactions Data TPGT

Warehouse
Patterns
Current Transaction
TRSGM
Risk Score (0-1)

Yes No
< 0.5
Yes
Genuine Transaction No
< 0.8
Suspicious Transaction
Fraudulent Transaction
Update suspect_count field in Suspect table
Wait for the next transaction on the same card
Bayesian For next transaction found suspicious

Learner Event Ei occurs
Analysis
Yes No
Po.f > Po. f
Fraudulent Transaction Genuine Transaction
Figure 7.1 Block Diagram of Proposed Financial Cyber Crime Detection System
209
7.4 GRAPHS OF CLUSTER FORMATION BY DBSCAN ALGORITHM
Here we have generated the scatter graphs of the different clusters formed by the
DBSCAN algorithm by taking transaction amount attribute for the various customers. In
all the examples ε =500 and MinPts=5 was taken.
Cluster formation by DBSCAN Algorithm
3.5
3
Cluster Number
2.5
2
1.5
1
0.5
0
0 500 1000 1500 2000 2500
Transaction Am ount
Figure 7.2 Graph of clusters formed by DBSCAN algorithm for Card id=1
Cluster Formaton by DBSCAN Algorithm
7
6
Cluster Number
5
4
3
2
1
0
0 1000 2000 3000 4000 5000 6000 7000
Transaction Am ount
210
Cluster Formation by DBSCAN Algorithm
8
Cluster Number
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Transaction Am ount
Cluster Formation by DBSCAN Algorithm
10
Cluster Number
8
6
4
2
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Transaction Am ount
211
Here a result is shown of the clusters formed by DBSCAN algorithm implemented in the
data mining application for various transaction amounts spend by the customer having
card id 1507.
Figure 7.6 Sample output of Clusters formed by DBSCAN Algorithm – I
Figure 7.7 Sample output of Clusters formed by DBSCAN Algorithm - II
212
Figure 7.8 Sample output of Clusters formed by DBSCAN Algorithm - III
Figure 7.9 Sample output of Clusters formed by DBSCAN Algorithm - IV
213
Figure 7.10 Sample output of Clusters formed by DBSCAN Algorithm - V
Figure 7.11 Sample output of Clusters formed by DBSCAN Algorithm - VI
214
Figure 7.12 Sample output of Clusters formed by DBSCAN Algorithm - VII
Figure 7.13 Sample output of Clusters formed by DBSCAN Algorithm - VIII
215
Figure 7.14 Sample output of Clusters formed by DBSCAN Algorithm - IX
Figure 7.15 Sample output of Clusters formed by DBSCAN Algorithm - X
216
Figure 7.16 Sample output of Clusters formed by DBSCAN Algorithm - XI
Figure 7.17 Sample output of Clusters formed by DBSCAN Algorithm - XII
217
7.5 IMPLEMENTATION ENVIRONMENT
The implementation of FCDS has been done in Oracle 9i. The data warehouse is
designed and implemented in oracle 9i, which consists of a number of tables, as shown in
the Chapter 6. Descriptions of all the tables are also shown in the same chapter. Lookup
tables are designed to store the current spending behavior of the customer. Current online
transaction is given as input to the FCDS. Linear equation along with the rules
implemented in the TRSGM generates a risk score for this transaction.
Stored procedures, functions, packages and triggers were written to facilitate the
functioning of the setup. These were used to check the deviation of each transaction from
the customer’s normal profile.
7.5.1 Lookup tables auto updation
The following trigger is automatically executed when logging into the system and
updates all the lookup tables according to their specified time duration.
create or replace trigger logon_update_trigger

after logon on database
call logrecordproc
The procedure is as below.
create or replace procedure logrecordproc is

today date;
previousday date;
begin
select logon_day into previousday from user_log_master;
commit;
218
insert into user_log_master values(user,sysdate);

select sysdate into today from dual;
if today > previousday then

update customer_dailycount set transcount=0,amount=0;
end if;
if today > (previousday + 6) then

update customer_weeklycount set transcount=0,amount=0;
end if;
if today > (previousday + 14) then

update customer_fortnightlycount set transcount=0,amount=0;
end if;
if today > (previousday +29) then

update customer_monthlycount set transcount=0,amount=0;
end if;
end;
7.5.2 Inter Transaction Gap Recording
As we discussed in the chapter 6, TPGT generates the parameters GP1 to GP7 for inter
transaction gap (time duration between each two successive transactions on the same
card). For this the following procedure time_previous_transaction( ) is implemented in
the data mining application.
219
/* This procedure finds the time difference in days, hours, minutes and seconds between
each two successive transactions. */
PROCEDURE time_previous_transaction(a_array1 IN tpg_date_array,time_diff out
tpg_array,days out tpg_array,hrs out tpg_array,mins out tpg_array,secs out tpg_array) is
hrs_frac number(12,6);
mins_frac number(12,6);
secs_frac number(12,6);
hrs_int number(12,6);
mins_int number(12,6);
secs_int number(12,6);
hrs_full number(12,6);
mins_full number(10,2);
secs_full number(12,6);
index_time number(7):=1;
BEGIN
for i in 2..a_array1.LAST
LOOP
select (a_array1(i) - a_array1(i-1)) into time_diff(index_time) from
dual;
SELECT floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600),
floor((((a_array1(i)-a_array1(i-1))*24*60*60) -
floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600)/60),
round((((a_array1(i)-a_array1(i-1))*24*60*60) -
floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600 -
(floor((((a_array1(i)-a_array1(i-1))*24*60*60) -
floor(((a_array1(i)-a_array1(i-1))*24*60*60)/3600)*3600)/60)*60) ))
into hrs_frac,mins_frac,secs_frac
FROM dual;
220
days(index_time):=floor(hrs_frac/24);
hrs_int:=hrs_frac/24-floor(hrs_frac/24);
hrs_full:=floor(hrs_int*24);
mins_full:=floor((hrs_full - floor(hrs_full))*60);
hrs(index_time):=hrs_full;
secs_full:=floor((mins_full - floor(mins_full))*60);
mins(index_time) := mins_full + mins_frac;
secs(index_time):=secs_full + secs_frac;
while hrs(index_time) >= 24

LOOP
days(index_time) := days(index_time) + 1;
hrs(index_time) := hrs(index_time) - 24;
END LOOP;
while mins(index_time) >= 60

LOOP
hrs(index_time) := hrs(index_time) + 1;
mins(index_time) := mins(index_time) - 60;
END LOOP;
if secs(index_time) >= 60 then

mins(index_time):=mins(index_time)+1;
secs(index_time):=secs(index_time)-60;
end if;
index_time := index_time + 1;
END LOOP;
END;
221
7.5.3 Maximum Value Finding
To find the maximum value from given list of array find_maximum() function is
implemented in the data mining application. This function is called several times to find
the maximum amount of transaction and maximum number of transactions.
/* This function finds the maximum value from the given array. */
FUNCTION find_maximum(a_array1 IN tpg_array) RETURN number IS
max_value number(12,2);
BEGIN
max_value:=0;
for m in a_array1.FIRST .. a_array1.LAST
loop
if a_array1(m) > max_value then
max_value:=a_array1(m);
end if;
end loop;
return max_value;
END;
In this way several procedures, functions, triggers, packages are implemented in the data
mining application.
222
7.6 SAMPLE RESULTS
7.6.1 Genuine Transaction
Figure 7.18 Sample output of Data Mining Application for Genuine Transaction - I
Figure 7.19 Sample output of Data Mining Application for Genuine Transaction - II
223
Figure 7.20 Sample output of Data Mining Application for Genuine Transaction -III
224
7.6.2 Fraudulent Transaction
Figure 7.21 Sample output of Data Mining Application for Fraudulent Transaction
–I
Figure 7.22 Sample output of Data Mining Application for Fraudulent Transaction -
II
225
Figure 7.23 Sample output of Data Mining Application for Fraudulent Transaction -
III
226
7.6.3 Suspicious transaction
Here a sample of suspicious transaction is shown along with the probability of the
transaction being genuine or fraudulent. The snapshot of table suspect is also shown
where the filed suspect_count is incremented.
Figure 7.24 Sample output of Data Mining Application for Suspicious Transaction -
I
227
Figure 7.25 Sample output of Data Mining Application for Suspicious Transaction –
II
Figure 7.26 Sample output of Data Mining Application for Suspicious Transaction -
III
228
7.6.4 Multiple product order support
Figure 7.27 Sample output of Data Mining Application for Multiple Order Product
Support - I
Support - II
229
Support - III
230
7.7 RESULT ANALYSIS & DISCUSSIONS
• The most interesting result of the TRSGM is that the risk score generated by it is very
dynamic. i.e. If the customer make any purchase and there is a very minor change in
transaction amount and keeping all other inputs same, then also risk score generated
is different. This minor change would also be reflected in the risk score. We have run
the application several times for different transaction amounts with slight variation
and keeping all other inputs fix and same. Before taking the result for second time
and onwards, we have also reset all the lookup tables. Here is an example.
For the input Card Id : 125

Category Id: 6
Product Id : 60050
Seller Id : 750
Shipping Id : 410
Location Id : 980
The output of risk score is as below.
Amount 5000 50001 5002 5003

Risk Score 0.29277495 0.29277506 0.29277513 0.29287596
Table 7.2 Sample output of the application for different transaction amounts
231
Figure 7.30 Sample output of Data Mining Application for different transaction
amounts – I
amounts - II
232
amounts - III
amounts - IV
233
In the same way, we have changed the different sellers for the same product, category,
amount, shipping address and location address. It is observed that this change would also
reflect in the risk score. Here is an example.

Category Id: 5
Product Id : 50010
Amount : 4500
Shipping Id : 590
Location Id : 110
Seller Id 801 587 986 30

Risk Score 0.28811595 0.28780605 0.28811623 0.28802632
Table 7.3 Sample output of the application for different sellers
234
Figure 7.34 Sample output of Data Mining Application for different sellers - I
Figure 7.35 Sample output of Data Mining Application for different sellers - II
235
Figure 7.36 Sample output of Data Mining Application for different sellers - III
Figure 7.37 Sample output of Data Mining Application for different sellers - IV
236
We have also checked if the customer purchases the same product, category, amount,
seller, shipping address on different location, then its change is reflected in the risk
score. Here is an example.

Category Id: 7
Product Id : 70150
Amount : 5300
Shipping Id : 1596
Seller Id : 110
Location Id 351 352 353 354

Risk Score 0.25453484 0.24608244 0.22881003 0.2497576
Table 7.4 Sample output of the application for different locations
237
Figure 7.38 Sample output of Data Mining Application for different locations - I
Figure 7.39 Sample output of Data Mining Application for different locations-II
238
Figure 7.40 Sample output of Data Mining Application for different locations - III
Figure 7.41 Sample output of Data Mining Application for different locations - IV
239
• The application finds the cluster coverage of each new incoming transaction
amount and if it is greater than 10% then model assumes that it is a genuine
transaction considering the regular payment of the customer. So the application
generates 0 risk score for the transaction. Here is an example.
Figure 7.42 Sample output of Data Mining Application for Cluster Coverage
• The author has extensively run the applications and check that the transaction
,which is the closely met by the customer purchasing habit (i.e. maximum
purchase in this category, maximum number of transactions in this time frame,
maximum number of transactions ordered from the same location etc.), generates
a least score. The transaction, which does not fall into customer purchasing habit
and more deviation than the normal profile, generate more risk score. Here is an
example. As more and more transactions performed within this particular set,
more risk score decreased.
240
The customer having cardid 1570 has the maximum purchasing habit in the given
field as below.
Category :2
Time frame : 18:01 to 21:00
Location Id : 205
Seller Id : 257
Figure 7.43 Sample output of Data Mining Application for maximum

purchasing habit input - I
241

purchasing habit input - II

purchasing habit input - III
242
• In the domain of credit card fraud detection, the system should not raise too many
false alarms (i.e. genuine transactions should not be caught as fraudulent
transactions) because a credit card company needs to minimize its losses but, at
the same time, does not wish the cardholder to feel restricted too often. In the
same way, fraudulent transactions should also not get undetected. Considering
both of these matters, the model is designed flexible. Here we have taken upper
threshold value 0.8, but with more learning it can be changed. All the parameters’
weightage is also set according to the recommendation of credit card company.
• There is one interesting result by Bayesian learning. The customer having card id
8, first performs the transaction of 17000 is considered suspicious. After short
while he performs another transaction which is 13500, predicted as fraudulent by
Bayesian learning. Once transaction found suspicious, time duration since last
transaction is also stored in the table suspect. So if we consider both the
transactions as individual then they are seemed to be normal, but it is power of
Bayesian learning that occurrence of subsequent transaction after the first
transaction is predicted as fraudulent. Here is an example.
Figure 7.46 Sample output of Data Mining Application for Bayesian

Learning - I
243
Figure 7.47 Sample output of Data Mining Application for Bayesian Learning-II
7.8 REFERENCES

[4] Ivan Bayross – SQL, PL/SQL The Programming Language of Oracle, BPB
Publications, ISBN 81-7656-964-X
[5] Nilesh Shah – Database Systems Using Oracle, Prentice Hall of India, ISBN: 81-
203-2147-2
244
CHAPTER 8
PROPOSED FINANCIAL CYBER CRIME PREVENTION MODEL &

CONCLUSION
8.1 PROPOSED FINANCIAL CYBER CRIME PREVENTION MODEL

8.2 FEATURES OF DEVELOPED DATA MINING APPLICATION SOFTWARE
8.3 SIGNIFICANCE OF THE RESEARCH
8.4 LIMITATION OF THE STUDY
8.5 FUTURE SCOPE OF THE RESEARCH
8.6 REFERENCES
As we discussed in Chapter 1, for financial cyber crime prevention different methods like
First Virtual, Cyber Cash and SET are used. These systems are highly secure but are
rarely used by customers and merchants. These models secure our transaction over
internet but cannot stop any forgery if credit card information is lost physically or when
customer gives his information in wrong hands.
Anshul Jain et al. [1] have given an Internet Virtual Credit Card Model. In this model, a
login id and password will be given by the bank. Then After logging into bank’s website
a virtual credit card number and expiry date of this virtual credit card will be issued by
the bank. So the customer has to give and remember four details like login id, password,
virtual credit card number and expiry date of this virtual card while performing the online
transaction. In my opinion, it will create overhead on customer and extra burden of
remembering these additional details.
Recently in India Reserve Bank of India has made a mandate for all the banks to issue
separate password to their credit card holders for online transaction. In other countries
this tactics is already being used. In my opinion, this tactics is not enough to prevent
fraud as the first transaction is highly secure, but the subsequent transactions we can not
245
Chapter 8: Proposed Financial Cyber Crime Prevention Model & Conclusion
surely consider as highly secure, because while performing the first transaction by the
customer, the password can be stolen by fraudster by hacking the computer or any other
tactics. Also the card holder is not given any control or flexibility to their own end to
prevent the fraud.
Considering all the limitations of above models, the financial cyber crime prevention
model has been proposed.
8.1 PROPOSED FINANCIAL CYBER CRIME PREVENTION MODEL
In this model, not only separate password for online transaction is given to credit card
holder, but also the validity of this password is given to the card holder according to his
choice. The customer has to log into bank website. Then he can set his password along
with the expiry date of this password for the online transaction. So whenever customer
performs online transaction, then he requires the password to complete the transaction.
The model checks the validity of the password; if the password expires then he is not able
to complete the transaction. If he is the genuine card holder, then he has to log in bank
website and set password and expiry date for this password.
So in this model the password remain valid only till the completion of expiry date. When
the password expires, the customer has to again obtain password along with its validity
from the bank. The expiry date selected by the customer must be between the present date
and actual expiry date of the card.
So here the control and flexibility is given to the customer to their own end. They can
give expiry date of password according to their convenience and keeping in mind of
avoiding the forgery. As we compare with Internet Virtual Credit Card Model, the
customer can not remember the virtual credit card number easily, as it is long and it is
issued by bank so he has to store it somewhere. While in our model, the password will be
given by the user, so it is user defined word, so he can easily keep in mind. Customers,
who transact very often, could give the expiry date of their password very short, in order
246
to avoid forgery. He can also give the expiry date of password such that it would expire
on the next day. So the customer can himself make his each online transaction very safe.
Customers who do not transact very often or consider this as overhead, can select long
expiry date. Whenever financial cyber crime increased drastically in particular month,
then user can give shorter expiry date to avoid forgery.
Thus the benefit of this model is that user can temporarily suspend his credit or debit card
by giving short expiry date as when he fears that his information may be stolen or when a
time comes that huge amount of cyber crime cases drastically increased. So no one can
use his or her credit card information for online purchases.
8.2 FEATURES OF DEVELOPED DATA MINING APPLICATION SOFTWARE
• Transaction Differentiation: The software itself differentiates the transaction by

looking at transaction date and time. E.g. if the purchase is made by the same
customer of the three different products at the same time, then it is considered as
only one transaction and summation of these three products is computed by the
software. If two purchases is made on same date but at different time, it is
considered as two transactions performed by the customer.
• Accuracy: It is the main feature of this software. All the computations and
calculations are performed very accurately by this software. The author has taken
proper care of the accuracy of the software by debugging and taking several
sample input as the result is depending on this calculation.
• Speed: Software is tested on huge scale data and its speed of execution is
checked. The software generates the risk score very fast despite of huge data. So
the huge data does not slow down the software.
• Dynamic risk score: The risk score generated by the software is very dynamic. If
there is a change of only 1% in the input, it would be reflected in the risk score.
• Live data: The application is implemented on pure live data. It extracts data from
the properly designed data warehouse.
247
• Global Application: The application is not designed and implemented keeping

view of only one country. The customer can perform online transaction from
outside of the country where the application software resides on the server. In this
case, the application automatically maps the time of the server to the country
where the transaction has been performed. The function conver_time( ) has been
implemented in the application for doing the same. It converts a time zone of one
city to another city.
• Pattern Generation: The software generates more than 60 patterns for each
customer, which collectively decides the normal profile of the customer.
• Lookup tables: Several lookup tables are designed in the database to monitor the
current spending behavior of the customer. So whenever deviation occurs than
normal behavior, software can raise the alarm.
• Inter transaction time gap recording: The time difference between each two
successive transaction on the same card is monitored and stored by the software to
capture the frequency of card use.
• Bayes rule: It is implemented in the application, which will calculate the
posterior probability of the transaction for whether it is performed by genuine
customer or fraudster.
• Clustering: A clustering technique DBSCAN algorithm is implemented in the
application for the various transaction amounts spend by the customer. Cluster
coverage of incoming transaction amount is also calculated by the software to
decide the sensitivity of the transaction.
• Type of the day: The software monitors and records the transactions separately
whether it is performed on normal working day, Sunday or holiday. Accordingly
it generates a risk score based on past spending behavior of the customer on the
similar types of day.
• Time stamping: Each transaction’s date and time is monitored and recorded by
the software. So for new incoming transaction, the software monitors how far or
close the transaction is to the customer’s purchasing time habit and generates a
risk score accordingly.
248
• Transaction counting: The software automatically calculates the total number of

transactions performed on each day, week, fortnight and month for every
customer. It also calculates the total number of transactions for Sunday and
holiday separately.
• Linear equation: A linear equation with the patterns generated by TPGT is
implemented in the software, which will generate a risk score to decide the
sensitivity of the transaction.
• Triggers: There are automatically executed triggers implemented in the software.
It updates the lookup tables accordingly. As for example, after the completion of
current day, the trigger set the values of total transaction amount and total number
of transaction of the table customer_dailycount to zero. Same actions are
performed for other lookup tables.
• Multiple product order support: The software is not designed and implemented
keeping view of only one product order. The customer can purchase more than
one product as a cart on the same time. The software can take input of all the
products purchased and generates a risk score by considering all the products with
their category.
• Rules: As discussed previously in the chapter 7, all the rules which are generally
given as indication of fraudulent transactions on various web sites and news
papers for online transaction, are implemented as rules in the software.
• Location recording: The software takes the input as location from which the
current online transaction is performed, and calculates how many past
transactions are performed on the same location to decide the sensitivity of the
transaction.
• Transaction summation: Whenever multiple products are purchased by the
customer at the same time, then the software considers it as one individual
transaction and adds the price of all the products and takes it as transaction
amount.
• Sigmoid function: Sigmoid function is used in the linear function for the
parameters which can not represent its value in percentage. This function doesn’t
249
allow any parameter to increase its share in the final risk score as it maps the
value in the range [0, 1].
• Flexibility: With the consultation of the bank, the weightage of different
parameters are derived and implemented in the software. But the software is not
stick to this weightage only and it is flexible as we can change the weightage of
any parameter according to the recommendations of the credit card company.
8.3 SIGNIFICANCE OF THE RESEARCH
• The work is unique in nature as in modeling part incorporate data mining

techniques, statistics and artificial intelligence in a single platform. The work
explained in the thesis must be helpful for the researchers; especially literature
survey of data mining techniques is effort to provide roadmap to study and select
appropriate data mining technique before implementation for the researchers.
Also understanding of role of data mining in financial crime detection is useful
for developing other financial applications also.
• Though the application is implemented keeping view of online transactions, it can
also be used for credit card holders who are making offline transactions.
• Though we have developed a specific application, we feel that with minor
application-specific modifications, the present approach can be effectively used to
counter intrusion in other database applications as well.
• We have inquired almost all the banks in Gujarat, no bank is currently using any
kind of software for financial cyber crime detection. So our data mining software
become very useful for them.
8.4 LIMITATION OF THE STUDY
• The developed data mining application is used only for those customers who are
making credit card purchases frequently. It is not for those who transact once or
very less in a year. The model has to learn all the purchasing habits of the
customer, so for new incoming transaction it can predict properly. As more and
250
more transactions are performed by the customer, the model becomes stronger,
learns the customer behavior and predicts the transaction more accurately.
• The application is also not used for a new customer for the same above reason.
• Though the application is global and implemented keeping view of all the
countries, the parameter holiday is not the same for all the countries. So the
application requires minor changes to implement this change.
8.5 FUTURE SCOPE OF THE RESEARCH
• In the current work the location, where the customer performs the online
transaction is considered. The computer on which the online transaction is
performed is not taken into account, but in future work the IP Address can also be
considered and patterns can be generated for this IP address. The only problem to
consider is that IP address is not static, but dynamic. So care should be taken to
consider this as one parameter.
• It may be worthwhile to generate more parameters to closely match the
customer’s purchasing habits.
• More dynamic rules can be derived from the historical data and applied for the
initial belief.
• Full care has been taken to ensure that the research is designed and conducted to
achieve the research objectives. This is really a thrilling domain, in which one can
not stop and it requires constant refreshing to incorporate the dynamic changes
occurred as the real problems.
• Though data mining algorithm DBSCAN is implemented for only transaction
amount, it can be implemented for other attribute as well.
8.6 REFERENCES
[1] A.Jain, T.Sharma, Internet Virtual Credit Card Model,

http://www.profile.iiita.ac.in/ajain1_b04/ivccm.pdf
251

Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

A PH.D.

DATA MINING TECHNIQUES: STUDY, ANALYSIS,

FOR THE AWARD OF

UNDER THE GUIDANCE OF

Certificate by Research Guide III

Declaration by Ph.D. Student IV

List of Figures VII

Chapter 2: A comparative Study of Data Mining Techniques 21

Chapter 3: Financial Cyber crime and Frauds 92

Chapter 4: Role of Data Mining in Financial Crime Detection 115

Chapter 5: Data Warehouse Implementation 123

Chapter 6: Development of Transaction Pattern Generation Tool (TPGT) 156

Chapter 7: Development of Transaction Risk Score Generation Model 183

I owe a debt of gratitude to Prof. P. I. Patel, Director, Ganpat University for

Detection for Financial Cyber Crime and Frauds” under my guidance.

general advancement of knowledge. According to the best of my knowledge, I also

Ganpat University or any other university for this thesis.

Date: (Dr. A. R. Patel)

Date: (Jyotindra N. Dharwa)

Chapter 2 A Comparative Study of Data Mining Techniques 21

Chapter 3 Financial Cyber crime and Frauds 92

Chapter 4 Role of Data Mining in Financial Crime Detection 115

Chapter 5 Data Warehouse Implementation 123

Chapter 6 Development of Transaction Pattern Generation Tool (TPGT) 156

Chapter 7 Development of Transaction Risk Score Generation Model

Chapter 8 Proposed Financial Cyber Crime Prevention Model

1.2 OBJECTIVE OF THE RESEARCH

of processing is of the essence. This is particularly the case in transaction processing,

1.3 RELATED WORK

1.3.1 In Fraud detection

1.3.2 In Financial Cyber Crime Prevention

1.3.2.1 First Virtual

1.3.2.4 Internet Virtual Credit Card Model

1.4 RESEARCH ISSUES

1.5 OUTLINE OF THE RESEARCH

likelihood is more than fraudulent transaction likelihood then the transaction is

[2] E. Aleskerov, B. Freisleben, B.Rao, “CARDWATCH: a neural network based

Detection”, in: Proceedings of 15th International Conference on Advanced Computing

A COMPARATIVE STUDY OF DATA MINING TECHNIQUES

2.1 DATA MINING: A DEFINITION

2.1 DATA MINING: A DEFINITION

2.2 THE FOUNDATIONS OF DATA MINING

• Massive data collection

Table 2.1 Steps in the Evolution of Data Mining

Evolutionary Business Enabling Product Providers Characteristics

2.3 THE DEVELOPMENT OF DATA MINING

Algorithms Machine Learning

Figure 2.1 Historical perspective of data mining

Table 2.2 shows developments in the areas of artificial intelligence(AI), information

• Induction is used to proceed from very specific knowledge to more general

Table 2.2 Time Line of Data Mining Development

Time Area Contribution

2.5 A STATISTICAL PERSPECTIVE ON DATA MINING

2.5.1 Point Estimation

2.5.1.1 Methods of Point Estimation

DEFINITION 2.1: Let X1,X2,……,Xn be a random sample, and let θ = { θ 1,….., θ k} be

2.5.1.1.1 The Method of Moments

DEFINITION 2.2: Let X1,X2,……,Xn be a random sample from a population whose

be the rth sample moment. By equating φ r to Ør, where r=1,2,……,k, k equations in k

Therefore, if there are k population parameters to be estimated, the method of moments

2.5.1.1.2 Maximum Likelihood Estimation

L( Θ ; X1,X2,……,Xn) = f(X1,X2,……,Xn; Θ ). (2.2)

L( Θ ; X1,X2,……,Xn) = f(X1; Θ ) * …..* f(Xn; Θ ), (2.3)

It is important to note that a solution to the likelihood equation is not necessarily a