You are on page 1of 18

International Journal of Applied Engineering Research

ISSN 0973-4562 Volume 9, Number 24 (2014) pp. 30795-30812


© Research India Publications
http://www.ripublication.com

Anomaly Detection Via Eliminating Data


Redundancy and Rectifying Duplication in Uncertain
Data Streams

M. Nalini 1* and S. Anbu 2


1*
Research Scholar, Department of Computer Science and Engineering,
St.Peter's University, Avadi, Chennai, India, email: nalinicseme@gmail.com
2
Professor, Department of Computer Science and Engineering, St.Peter's College of
Engineering and Technology, Avadi, Chennai ,India, email: anbuss16@gmail.com

Abstract

One of the important problems is anomaly detection which necessitates in


emerging fields, such as data warehouse and data mining etc. Various generic
anomaly detection techniques were already developed for common
applications. Most of the real time systems using consistent data offering
high-quality services are affected by record duplication, quasi replicas or
partial erred data. Even government and private organizations are developing
procedures for eliminating replicas from their data repositories. In data
maintenance the quality of the data, database, and database related
applications needs error-free, replica removed, de-duplicated to improve the
accuracy of the query passed. In this paper, we propose Particle Swarm
Optimization approach to record de-duplication which combines various
pieces of evidence extracted from the data content to identify a de-duplication
method. which can able to find out whether two entries in a data repository are
replicas or not. From the experiment, our approach outperforms available
state-of-the-art method found in the literature. Our PSO approach is capable
than the existing approaches. The experiment result shows that the proposed
approach is more efficient than the existing approaches where the proposed
approach is implemented in DOTNET framework-2010.

Keywords: Data Duplication, Error Correction, DBMS, Data Mining, TPR,


FNR.

INTRODUCTION
Anomaly detection refers to the problem of finding patterns in any kind of data which
30796 M.Nalini and S.Anbu

cannot satisfy the customer’s expected behavior. These non-conforming patterns are
often referred to as anomalies, outliers, discordant observations, exceptions,
aberrations, surprises, peculiarities or contaminants in different application domains.
Of these anomalies and outliers are two terms used most commonly in the context of
anomaly detection and sometimes interchangeably. Anomaly detection finds
extensive use in a wide variety of applications. Such as fraud detection for credit
cards, insurance or health care, intrusion detection for cyber-security, fault detection
in safety critical systems, and military surveillance for enemy activities.
The importance of anomaly detection is due to the fact that anomalies in data
that translates to significant (and often critical) actionable information in a wide
variety of application domains. For example, an anomalous traffic pattern in a
computer network could mean that a hacked computer is sending out sensitive data to
an unauthorized destination. An anomalous MRI image may indicate presence of
malignant tumors. Anomalies in credit card transaction data could indicate credit card
or identity theft or anomalous readings from a space craft sensor could signify a fault
in some component of the space craft. Detecting outliers or anomalies in data has
been studied in the statistics community as early as the 19thcentury. Over time, a
variety of anomaly detection techniques have been developed in several research
communities. Many of these techniques have been specifically developed for certain
application domains, while others are more generic.
Also in this paper the data taken into account is uncertain data, the size of the
data is also too large.PSO approach is applied here to find and count the de-
duplication where it combines various pieces of evidence extracted from the data
content to produce de-duplication method. This approach is able to identify whether
any entries in a repository are same or not. In the same manner, by combining more
pieces in the data, taken as evidence, and compare with the whole data as training
data. This function is applied repeatedly on the whole data or in the repositories.
Newly inserted data can also be compared in the same manner to avoid replica by
comparing with evidence. A method applied to record de-duplication should
accomplish individual but contradictory objectives: this process should effectively
increase the identification of the records replicated. The approach GP [15] is chosen
as the basic approach which is suitable for finding accurate answers to a given
problem without searching all the data on the whole. Due to the record de-duplication
problem, in the existing approach [14, 16] the genetic programming is applied to
provide good solutions to it.
In this paper, the existing system results in [16] are taken for comparison
without PSO based approach, where our approach is able to automatically find more
effective de-duplication methods. Moreover, PSO based approach can interoperable
with existing best de-duplication methods to change on the replication identification
limits used to classify a pair of records as a match or not. In our experiment, real time
dataset having all scientific domain based article citations and hotel index records.
Also, the real time data set having synthetic generated datasets to control in best
experimental environment. In all the scenarios, our approach can be applied to all the
possible scenarios.
Anomaly Detection Via Eliminating Data Redundancy 30797

On the whole, our contribution of this paper is PSO based approach to find
and count the De-Duplication is as follows:
 A less computational time based solution can be obtained in terms of
duplication detection
 Reduce the individual comparison using PSO approach to find out the
similarity values.
 Choosing the replicas by computing TPR and FPR among the data
 Rectify the errors in the data entries.

RELATED WORKS
In [3] the author proposed an approach to data reduction. This data reduction
functions are very essential to machine learning and data mining. An agent based
population algorithm is used for solving data reduction. Only data reduction is not
only the solution for improving the quality of databases. Various sizes of database are
used to provide high classification among the data to find out anomalies. Two
algorithms such as evolutionary and non-evolutionary are applied and the results are
compared to finding the best suitable algorithm for anomaly detection in [4]. N-ary
relations are computed to define the patterns in the dataset [5] where it provides
relations in one-dimensional data. DBLEARN and DBDISCOVER [6] are two
systems developed to analyze RDBMS. The main objective of the data mining
technique is to detect and classify data in a huge set of database [7] without
negotiating the speed of the process. PCA is used for data reduction and SVM is used
for data classification in [7]. In [8] the data redundancy method is explored using
mathematical representation. Software developed with safe, correct and reliable
operations for avionics and automobile based database systems [9]. A statistical QA-
[Question Answer] model is applied to develop a prototype to avoid web based data
redundancy [10]. GDW-[Geographic Data Warehouses] [11], SOLAP (Spatial On-
Line Analytical Processing) is applied to Gist database and other spatial database
analysis, indexing, and generating various set of reports without any error. In [12], an
effective method was proposed for P-2-P sharing data. During the data sharing the
data duplication is removed using the effective method. Web entity data extraction
associated with the attributes of the data [13] can be obtained using a novel approach
which uses duplicated attribute value pairs.
G. de Carvalho et al. [1] used the Genetic Algorithm to mark the duplication
and convert de-duplication in the data also mainly concentrated on identifying the
entries are repository or replica. This approach outperformed and provides 6.2% of
accuracy more than the earlier approaches for two different data sets found in [2]. Our
proposed approach can be extended for various benchmark data with real time data
such as time series data, clinical data, 20-20 new group etc.

PARTICLE SWARM OPTIMIZATION GENERAL CONCEPTS


The natural selection process influences all virtually living things and it can be
inspired by evolutionary programming approaches based ideas. Particle swarm
30798 M.Nalini and S.Anbu

optimization programming is one of the best known evolutionary programming


techniques. It is considered as a heuristic approach and initially applied for optimizing
the data properties and availabilities. PSO is also considered as a multi-objective
problem which can restrict the environment. PSO and the other various evolutionary
approaches are mostly known and applied variety of applications due to their good
performance in terms of searching over a large set of data. PSO creates more
populations for individuals, instead of processing on a single point in the search space
of the problem. This behavior is the essential aspect of the PSO approach and it
creates additional new solutions with new combined features. It also moves forward
comparing with the existing solutions in the search space.

PARTICLE OPERATIONS
PSO generates random particles representing individuals. In this paper, the current
modeling the trees representing arithmetic functions which are illustrated in Figure-1.
When using, this tree representation of the PSO based approach, the set of all inputs,
variables, constants and methods should be defined [8]. Some of the nodes
terminating the trees are called as leaves. The collection of operators, statements and
methods are used in the PSO evolutionary process to manipulate the terminal values.
All these methods are placed in the internal nodes of the tree is shown in Figure-1.In
general PSO is analyzing social behavior of birds. In order to search for food, every
bird in a flock of birds is referred by velocity based on the personal experience and
information collected by interacting with each other inside the flock. This is the basic
idea about the PSO. Each particle denoting each bird, flies denotes searching in the
subspace for the optimization problem searching for the optimum solution. In PSO,
the solutions within the iteration are called as swarm and equal in population.

x /

x z

Tree(a, b, c) = a + ( b + b)

FIGURE-1: Tree Used For Mapping A Function


Anomaly Detection Via Eliminating Data Redundancy 30799

PROPOSED APPROACH
The proposed approach utilizes the functionality of the PSO optimization method for
finding the difference among entities in each record in a database. This difference
indicates the similarity index among two data entities can decide the duplication. In
this case, if the distance between two data entities [ and ] is less than a threshold
value [ ], then and are decided as duplicate. The algorithm of PSO applied in
this paper is given here:
1. Generate random population P is representing each individual of data entries.
2. Assume a random feasible solution from particles.
3. For I = 1to P
4. Evaluate all particles based on the objective function
5. The objective function = ( [ ], [ ]) ≤ ≥∝
6. Gbest = best solution based particle
7. Compute the velocity of the Gbest particles
8. Update the current position of the best solution
9. Next i

A Database is a rectangular table consists of number of Records as:

={ , ,…, } ---(1)

And each record has number of Entities as:

= = ---(2)
: : :

is the entity at row and column in the data. Here represent the rows
and j represents the column. In this paper the threshold value is user defined very
small value among 0 and 1.
30800 M.Nalini and S.Anbu

Load Data Pre-process the Divide data as Normalize the Data


data windows

D Pre Processing the Data


B Normalizing the Data

Mark redundant data Check data Finding Similarity Ratio


redundancy & Error

Yes
Anomaly Detection
Anomaly Detection
No

Persistent the data

Fig.1: Proposed Approach

The overall functionality of the proposed approach is depicted in Fig.1. The


database may be in any form like ORACLE, SQL, and MY-SQL, MS-ACCESS or
EXCEL.

PREPROCESSING
Let us consider an example of an employee data for an MNC company, where the
company branches are located overall world. The entire data are read from the
database and investigate that if there any ‘~’, empty space, “#”, “*” and irrelevant
characters placed as an entity in the database. [Example, if an entity is numerical data,
then it should contain only the digits from 0 to 9. If it is a name, then it should
represent all alphabets combined only with “.”, “_”,”-“]. In case of irrelevant
characters presented in any dataset, then those data entity are treated as error data and
it will be corrected, removed or changed by any other relevant characters.
If the data-type of the field is a string, then the preprocessing function assigns
“NULL” in the corresponding entity, else if the data type of the field is a numeric,
then preprocessing function assigns 0’s [according to the length of the numeric data
type] in the corresponding entity. Similarly, preprocessing function replaces the entity
as today’s-date if the data-type is ‘date’, ‘*’ for data-type is ‘character’ and so on.
Once the data is preprocessed the results of the SQL-Query are good else error
generated.
For example, in the following table-1, the first row says the Field name and all
the rows contain set of records. In the given Table-1, the first record fourth field is
having an irrelevant character as “~”. In the same way the second record 3rd field
consists “##” instead of numbers. It gives an error, when a query
Anomaly Detection Via Eliminating Data Redundancy 30801

Select City from EMP;

is passed in the table EMP [Table-1]. To avoid error during query process the City,
Age fields are corrected by verifying the original data sources. If it is not possible
then for alphanumeric fields “NULL” and numeric field “0” are applied for replacing
and correcting the error. If it is not possible to correct the record, those data are
marked [‘*’] and moved to a separate pool area.

Table-1: Sample Error Records Pre-Processed and Marked [‘*’] [EMP].

No Name Age City State Comment


0001* Kabir 45 ~ty Employee
0002* Ramu ## Chennai TN Employee

The entire data can be divided as sub windows for easy and fast process. Let
the dataset is DB and it can be divided as sub windows shown in Fig.2 as DB1 and
DB2. Each DB1 and DB2 has a number of windows as 1, 2 … .

DATA NORMALIZING
In general an uncertain data stream is considered for anomaly detection. The main
problem defined in this paper is anomaly detection for any kind of Data streams.
Since, the size of the data stream is huge, in our approach the complete data are
divided into subsets of data streams. A data stream DS is divided into two uncertain
data streams DS1 and DS2, are taken for our problem, where both data stream
consists of a sequence of continuously occurring uncertain objects in various time
intervals, are denoted as

DS1 = { [1], [2], … … [ ], … } --- (3)

DS2 = { [1], [2], … … [ ], … } --- (4)

Where [ ] or [ ] is a k-dimensional uncertain objects at the time interval


and is the current time interval. According to grouping the nearest neighbor, the
objects should retrieve a close pair of objects within a period. Thus a compartment
window concept is adapted for the uncertain stream group operator. From figure-2, a
USG operator always considers the most recent CW uncertain data in the stream, that
is

CW(DS1) = { [ − + 1], [ − + 2], … … … , [ ]} --- (5)

CW(DS2) = { [ − + 1], [ − + 2], … … … , [ ]} --- (6)


30802 M.Nalini and S.Anbu

At the current time intervals .It can say in other words, when a new certain
object x[t+1] (y[t+1]) comes at the next time interval (t+1), the new object x[t+1]
(y[t+1]) is appended to DS1(DS2). In that particular time the old object x[t-cw+1]
(y[t-cw+1]) expires and is ejected from the memory. Thus, USG at a time interval
(t+1) is conducted on a new compartment window {x[t-cw+2], ……x[t+1]} (y[t-
w+2],….,y[t+1]}) of size cw.

Expired uncertain object at Compartment window at New uncertain


Uncertain data time interval (t+1) time interval t - CW (DS1) object
stream

DS1
USG
answers

DS2

y[1]…….,y[t-cw+1]…………..| …………………………….y[t] y[t+1]

Compartment window at
time interval t - CW (DS2)

Fig.2: Data Set Divided as Sub-Windows

For Grouping the uncertain Data Streams the two data streams DS1 and DS2
and distance threshold value and a probabilistic threshold α ∈ [0, 1].A group on
uncertain data streams continuously monitors pairs of uncertain objects x[i] and y[i]
within compartment windows CW(DS1) and CW(DS2) respectively of size cw at the
current time stamp t. Here, the data streams DS1 and DS2 are compared to finding the
similarity distance can be obtained using PSO. Such that

PSO (Pr { ( [ ], [ ]) ≤ ≥∝) --- Equ (7)

Holds, where t-cw+1 ≤ I, j ≤ t, and dist(., .) is a Euclidean distance function


between two objects. To perform a USG Equation (7), users need to register two
parameters, distance threshold and probabilistic threshold α in PSO. Since, each
uncertain object at a timestamp consists of R samples, the grouping probability
P|r{dist( x[i], y[i]) ≤ } in Inequality (7) can be rewritten via samples as
{ 1[ ]. . 2[ ]. , ( 1[ ], 2[ ]) ≤ ;
Pr{ ( [ ], [ ]) ≤ } =
0 ℎ
--- Equ (8).
Anomaly Detection Via Eliminating Data Redundancy 30803

Note that, one straightforward method to directly perform USG over


compartment windows is to follow the USG definition. That is for every object pair
<X[i], Y[i] > fromcompartment windows CW(DS1) and CW(DS2) respectively. We
compute the grouping probability that X[i] is within distance from Y[i] (via
samples) based on (8).If the resulting probability is greater than or equal to
probabilistic threshold α, then this pair <X[i], Y[i] > is reported as the USG answer,
otherwise it is a false alarm and can be carefully discarded. The number of false alarm
is counted using PSO by repeating n number of times and generating numbers of
particles in the search space for each individual data. For any comparison, verification
and other relevant tasks the window based data makes easy and fulfill the task very
quickly for any DBMS. For example, if the database is having 1000 records can be
divided into 4 sub datasets having 250 records each.
Data in the database can be normalized using any normalization form for fast
and accurate query process. In this paper user defined normalization is also applied to
improve the efficiency such as arranging the data in a proper manner like ascending
order or descending order according to the SQL query keywords.

PSO BASED SIMILARITY COMPUTATION


This paper focuses on applying a PSO based comparison for finding similar or
dissimilar. PSO has a measurement among two data in a database well defined by
appropriate features. Since it accounts for unequal variances as well as the correlation
between features it will adequately evaluate the distance by assigning different
weights or important factors to the features of data entities. In this paper the
inconsistency of data can be removed in real-time digital libraries.
Assume two sets of groups and having data about girls and boys in a
school. Let number girls are categorized as same sub-group in since their
attribute or characteristics are same. It is computed by PSO as

=( − ) ≤1 --- (9)

The correlation among dataset is computed using Similarity-Distance. Data


entities are the main objects of data mining. The data entities are arranged in an order
according to the attributes. The data set with number of attributes is considered
as K-dimensional vector is represented as:

=( , ,…, ). --- (10)

N number data entities form a set

=( , ,…, )⊂ ℝ --- (11)

is known as data set. can be represented by an matrix

= --- (12)
30804 M.Nalini and S.Anbu

where is the jth component of the data set .There are various methods used for
data mining. Numerous such methods, for example, NN-classification techniques,
cluster investigation, and multi-dimensional scaling methods are based on the
processes of similarity between data. As a replacement for measuring similarity,
dissimilarity among the entities too will give the same results. For measuring
dissimilarity one of the parameters that can be used is distance. This category of
measures is also known as separability, divergence or discrimination measures.
A distance metric is a real-values function , such that for any data points
, , and :

( , ) ≥ 0, ( , ) = 0, = --- (13)

( , ) = ( , ) --- (14)

( , ) ≤ ( , ) + ( , ) --- (15)

The first line (13), positive definiteness assures the distance is a non- negative
value. The distance can be zero for the points to be the same. The second property
indicates the symmetry nature of distance. There are various distance formulas are
available like Euclidean, manhattans, Lp-Norm and Similarity distance. In this paper
the Similarity-Distance is taken as the main method to find the similarity distance
among two data sets. The distance among a set of observed groups in m-dimensional
space determined by m variables is known as Similarity-Distance method. The less
distance value says the data in the groups are very close and the other is not close. The
mathematical formula for Similarity-Distance for two set of data samples as X and Y
is written as:

( , )= ( − ) ∑ ( − ) --- (16)

∑ is the inverse co-variance matrix.


The similarity value among the sub-windows of the dataset DB1 and the
dataset DB2 is computed and the result is stored in a variable named score.

[ ]=∑ ( 1) − ( 2) ---- (17)

[ ]≤ ℎ 1
[ ]=0 ℎ 0 ---- (18)
[ ]> ℎ −1

The first line in (18) says that the data available in both windows of 1
and 2 are more or less similar.The next line says that exactly same and the third
line says that the data are different. Whenever the distance among dataset satisfies
[ ] = 0 and [ ] ≤ both data are marked in the DB. The value of [ ] gives two
solution such as:
Anomaly Detection Via Eliminating Data Redundancy 30805

TPR—if the similarity value lies above this boundary [-1 to 1], the records are
considered as replicas;
TNR—if the similarity value lies below this boundary then the records are considered
as not being replicas.

In this situation the similarity values lies among the two boundaries then the
records are classified as “possiblematches”. In this case, a human judgment is also
necessary to find the matching score. Usually most of the existing approaches to
replica identification depend on several choices to set their parameters and they may
not be always optimal. Setting these parameters requires the accomplishment of the
following tasks:
Selecting the best proof to use- as evidence, it takes more time to find out the
duplication due to apply more processes to compute the similarity among the data.
Decide how to merge the best evidence, some evidence may be more effective for
duplication identification than others. Finding the best boundary values to be used,
Bad boundaries may increase the number of identification errors (e.g., false positives
and false negatives),nullifying the whole process. Window1 from DB1 is compared
with Window1, window2, window3 and so on from DB2 can be written as:

[ ]= ( 1) − ∑ ( 2) ---- (19)

If the [ ] =0, then both ( 1) and ( 2) are same and mark it as


duplicate. Else ( 1) is compared with ( 2).
The objective of this paper is to improve the quality of the data in a DBMS is
error free and can provide fast outputs for any SQL query. It is also concentrates on
de-duplication if possible in the data model. The removal of duplicate is not efficient
in Government based organization and it is difficult to remove. Avoiding duplicate
data provides high retrieval of quality data from huge data set like banking.

DATA:
For experimenting the proposed approach two real time data sets commonly
employed for evaluating the record de-duplication purposes. They are based on the
current data gathered from the web index. Additionally, some more data sets also
created using a synthetic data set generator. One of the dataset is
the ℎ data set is a assembly of 1,295 different credentials to 122
computer science papers occupied from the Cora research paper through search
engine. These credentials were separated into numerous characteristics (author names,
year, title, venue, and pages and other info) by an information mining system. The
another real-time data set is the Restaurants data set comprises 864 records of
restaurant names and supplementary data together with 112 replicas that were attained
by incorporating records from Fodor and Zagat’s guidebooks. It is used the following
attributes from this data set: (restaurant) name, address, city, and specialty. The
synthetic data sets were created using the Synthetic Data Set Generator (SDG) [32]
available in the Febrl [26] package.
30806 M. Nalini and S.Anbu

Since the real time dataset are not sufficient and not easily accessible for the
experiment, such as time series data set, 20-20 news data set and customer data from
OLX.in. It contains the fields as name, age, city, address, phone numbers etc, (like
social security number). Using SDG it also can create manually some errors in data
and duplication in data. Some of the modifications also can be applied on the records
attribute level. The data taken for experiments are

DATA-1: This data set contains four files of 1000 records (600 originals and 400
duplicates) with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and in full record.

DATA-2: This data set contains four files of 1000 records (750 originals and 250
duplicates), with a maximum of five duplicates based on one original record (using a
Poisson distribution of duplicate records) and with a maximum of two modifications
in a single attribute and four in the full record.

DATA-3: This data set contains four files of 1000 records (800 originals and 200
duplicates) with a maximum of seven duplicates, based on one original record (using
a Poisson distribution of duplicate records) and with a maximum of four
modifications in a single attribute and five in the full record. The duplication can be
applied on each attribute of the data in the form of [i.e the evidence]

〈 , − 〉

The experiment on the time series data is done in MATLAB software and the
time complexity is compared with the existing system. The elapsed time taken for
implementing the proposed approach is 5.482168 seconds. The results obtained for all
the functionalities defined in Fig.1 are depicted in Fig.3 to Fig.6.
Anomaly Detection Via Eliminating Data Redundancy 30807

Fig.3: Original Data Not Preprocessed

Fig.3 shows the data originality as such taken from web. It has errors,
redundancy and noise. The three lines show that the data DB is divided into DB1,
DB2 and DB3. It is clear in the above figure that DB1, DB2 and DB3 are consisted
and overlapped in many places which indicates the data redundancy. Also it is drawn
in zigzag form says it is not preprocessed. In the time series data there are 14
numerical data are preprocessed [replaced as 0’s], it is verified from the database.

Fig.4: Preprocessed Data


30808 M.Nalini and S.Anbu

The data is preprocessed and normalized is shown in Fig.4. User defined


normalization of the data is arranging the data in an order for easy process. Even
DB1, DB2 and DB3 have overlapped data itself, which indicates more data are
similar. DB1 and DB2 have more similar overlapped data is clearly shown in Fig.4.
Finding the similarity index and those same data can be removed for easy process and
make it as duplication.

Fig.5: Single Window Data in DB1

After normalization the data is divided into windows and shown in Fig.5
where the window size is 50 defined by the developer. Each window has 50 data for
fast comparison. In order to confirm this behavior observed with real data, we
conducted additional experiments using our synthetic data sets. The user-selected
evidence setup used in this experiment was built using the following list of evidence:

<firstname, PSO>, <lastname, PSO>, <street number, string distance>,

<address1, PSO>, <address2, PSO>, <suburb, PSO>, <postcode, string distance>,

<state, PSO>, <date of birth, string distance>, <age, string distance>,

<phone number, string distance>, <social security number, string distance>.

This list of evidence, using the PSO similarity function for free text attributes
and a string distance function for numeric attributes was chosen. Since, it required
less time to be processed in our initial tuning tests.
Anomaly Detection Via Eliminating Data Redundancy 30809

Table-2: Original Data

Data set Original Data Good Data Similar Data Error Data
Time Series 1000 600 400 24%
Restaurant 1000 750 250 15%
Student Database 1000 800 200 12.4%
Cora 1000 700 300 19.2%

Table-3: Data Duplication Detection and De-Duplication

Data set Original Data Marked Duplication De-Duplicated Not De-Duplicated


Time Series 1000 400 395 5
Restaurant 1000 250 206 44
Student DB 1000 200 146 54
Cora 1000 300 244 46

Fig.6: Performance Evaluation of Proposed Approach

The performance of proposed approach is evaluated by comparing the


detection of duplication, error, marking duplication, number of de-duplication
achieved and error correction for various datasets. Fig.6 shows the performance
evaluation of the proposed approach using the Similarity-Distance . According the
distance score the duplicate error records are detected and marked. Similarity-
Distance rectifies the error of 24%, 15%, 12.4% and 19.2% for Time series data,
Restaurant data, student data and Cora respectively.
The number of duplicate records detected by PSO is 400, 250, 200 and 300 for
Time series data, Restaurant data, student data and Cora and de-duplicate the data are
30810 M.Nalini and S.Anbu

395, 206, 146 and 244 respectively. Due more complex or error in the data the de-
duplication is not obtained 100%.
Some of the performance metrics can be calculated to find out the accuracy of
our proposed approach as:

Number of Duplication Find correctly


=
Total number of data

TNR =

Number of Duplication wrongly obtained


=
Total Number of data to be Identi ied

FNR =

Sensitivity = = 99%

Specificity = = 88.5%

Accuracy = = 96.3%

Where P = TP+FN and N =FP+TN. The proposed approach proved the


efficiency is better in terms of Duplication detection, Error Detection, and De-
duplication in terms of accuracy is 96.3%. Hence Similarity-Distance based
duplication detection is more efficient.

CONCLUSION
In this paper the PSO based distance method is taken as the main method for finding
the similarity [redundancy] in any database. Where the similarity score is computed
for various databases and the performance is compared. The accuracy obtained using
this proposed approach is 96.3% for four different databases. The time series data is in
the form of Excel, Cora data is in the form of table, student data is in the form of MS-
Access and the restaurant data is in the form of SQL table. It is concluded from the
experiment results obtained using our proposed approach it is easy to do anomaly
detection and removal in terms of data redundancy and error. In future the reliability
and scalability is investigated in terms of data size and data variations.
Anomaly Detection Via Eliminating Data Redundancy 30811

REFERENCES

[1]. Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre Gonc¸alves,


andAltigran S. da Silva, “A Genetic Programming Approach to Record
Deduplication”, IEEE Transactions On Knowledge And Data Engineering,
Vol. 24, NO. 3, March 2012.
[2]. M. Wheatley, “Operation Clean Data,” CIO Asia Magazine, http:// www.cio-
asia.com, Aug. 2004.
[3]. Ireneusz Czarnowski, Piotr J drzejowicz, “Data Reduction Algorithm for
Machine Learning and Data Mining”,Volume 5027, 2008, pp 276-285.
[4]. Jose Ramon Cano, Francisco Herrera, Manuel Lozano, “Strategies for Scaling
Up Evolutionary Instance Reduction Algorithms for Data Mining”, Book-Soft
Computing, Volume 163, 2005, pp 21-39.
[5]. Gabriel Poesia, Loïc Cerf, “A Lossless Data Reduction for Mining Constrained
Patterns in n-ary Relations”, Machine Learning and Knowledge Discovery in
Databases, Volume 8725, 2014, pp 581-596.
[6]. Nick J. Cercone, Howard J. Hamilton, Xiaohua Hu, Ning Shan, “Data Mining
Using Attribute-Oriented Generalization and Information Reduction”, Rough
Sets and Data Mining1997, pp 199-22
[7] Vikrant Sabnis, Neelu Khare, “An Adaptive Iterative PCA-SVM Based
Technique for Dimensionality Reduction to Support Fast Mining of Leukemia
Data”, SocProS 2012.
[8] Paul Ammann, Dahlard L. Lukes, John C. Knight, “Applying data redundancy
to differential equation solvers”, Journal of Annals of Software Engineering,
1997, Volume 4, Issue 1, pp 65-77.
[9] P. E. Ammann, “Data Redundancy for the Detection and Tolerance of
Software Faults”, Computing Science and Statistics”,1992, pp 43-52.
[10] Rita Aceves-Pérez, Luis Villaseñor-Pineda, Manuel Montes-y-Gomez, “
Towards a Multilingual QA System Based on the Web Data Redundancy”,
Computer Science Volume 3528, 2005, pp 32-37.
[11]. Thiago Luís Lopes Siqueira, Cristina Dutra de Aguiar Ciferri, Valéria Cesário
Times, Anjolina Grisi de Oliveira, Ricardo Rodrigues Ciferri, “The impact of
spatial data redundancy on SOLAP query performance”,Journal of the
Brazilian Computer Society, June 2009, Volume 15, Issue 2, pp 19-34.
[12]. Ahmad Ali Iqbal, Maximilian Ott, Aruna Seneviratne, “Removing the
Redundancy from Distributed Semantic Web Data”, Database and Expert
Systems Applications Lecture Notes in Computer Science, Volume 6261,
2010, pp 512-519.
[13] Yanxu Zhu, Gang Yin, Xiang Li, Huaimin Wang, Dianxi Shi, Lin Yuan,
“Exploiting Attribute Redundancy for Web Entity Data Extraction”, Digital
Libraries: For Cultural Heritage, Knowledge Dissemination, and Future
Creation Lecture Notes in Computer Science Volume 7008, 2011, pp 98-107.
[14] M.G. de Carvalho, M.A. Gonc¸alves, A.H.F. Laender, and A.S. da Silva,
“Learning to Deduplicate,” Proc. Sixth ACM/IEEE CS Joint Conf. Digital
Libraries, pp. 41-50, 2006.
30812 M.Nalini and S.Anbu

[15] J.R. Koza, Gentic Programming: On the Programming of Computers byMeans


of Natural Selection. MIT Press, 1992.
[16] M.G. de Carvalho, A.H.F. Laender, M.A. Gonc¸alves, and A.S. da Silva,
“Replica Identification Using Genetic Programming,” Proc. 23rd Ann. ACM
Symp. Applied Computing (SAC), pp. 1801-1806, 2008.

You might also like