You are on page 1of 6

TDML: A Data Mining Language for transaction databases

Muthukumar A* Nadarajan R**


* Kumaraguru College of Technology, Coimbatore-641006, India
** PSG College of Technology, Coimbatore-641004, India
amk_in_2000@yahoo.com, nadarajan_psg@yahoo.co.in

Abstract Based on these primitives, we design a query language


for data mining called TDML (Transaction Data Mining
A desired feature of data mining systems is the ability Language). TDML allows the ad hoc mining of several
to support ad hoc and interactive data mining in order to kinds of knowledge from transaction databases and data
facilitate flexible and effective knowledge discovery. warehouses at multiple level of abstraction [4,5].
Data mining query languages can be designed to support
such a feature. There are data mining query languages 2. Data Mining Primitives
like DMQL for mining relational databases. In this paper,
we have proposed a new data mining language for mining Each user will have a data mining task in mind to be
transaction databases called TDML. This proposed performed. A data mining task can be specified in the
language mines association rule mining and sequential form of a data mining query, which is input to the data
pattern mining. It uses a new bit map processing mining system. A data mining query is defined in terms of
approach with buffered storage of results. Various types the following primitives.
of data mining approaches that are supported like Task relevant data: This is the database portion to be
generalized mining, multilevel mining, multidimensional investigated. Mainly databases are available in relational
mining, distributed mining, partition mining, incremental database form but there are still flat file form of
mining, online mining, merge mining, transaction representation of data called transaction databases. Any
reduction, stream mining and targeted itemset mining. data warehouse or database data can be easily converted
into transaction database and not vice-versa. Since there is
1. Introduction no data mining language specifically designed for
transaction database, we have taken this problem as our
Designing a comprehensive data mining language is research and designed TDML.
challenging because data mining covers a wide spectrum The kinds of knowledge to be mined: This specifies
of tasks, from data characterization to mining association the data mining functions to be performed such as
rules, data classification, and evolution analysis [8]. Each characterization, discrimination, association,
task has different requirements. The design of an classification, clustering, or evolution analysis. In our
effective data mining query language requires a deep TDML we have considered association rule mining and
understanding of power, limitation, and underlying sequential pattern mining.
mechanisms of the various kinds of data mining tasks [4]. Background knowledge: Users can specify background
The primitives for defining a data mining task in the form knowledge, or knowledge about the domain to be mined.
of a data mining query are as follows. This knowledge is useful for guiding the knowledge
discovery process and for evaluating the patterns found.
• The set of task-relevant data to be mined We represent background knowledge through concept
• The kind of knowledge to be mined hierarchies which are useful in that they allow data to be
• The background knowledge to be used in the mined at multiple levels of abstraction.
discovery process Interestingness measures: These functions are used to
• The interestingness measures and thresholds for separate uninteresting patterns from knowledge. They
pattern evaluation may be used to guide the mining process or, after
discovery, to evaluate the discovered patterns. Different
• The expected representation for visualizing the
kinds of knowledge may have different interestingness
discovered patterns
measures. Interestingness measures for association rules
• The dynamic nature of data growth and sequential pattern mining are support and confidence
• Special mining tasks like targeted itemset parameters.
mining, stream mining , distributed and partition Presentation and visualization of discovered patterns:
data mining This refers to the forms in which discovered patterns are
to be displayed. Users can choose from different forms of

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
knowledge presentation, such as rules, tables, charts, mining types supported in TDML are Association Rule
graphs and cubes [4,5]. Mining (ARM) [1], Sequential Pattern Mining (SPM)
Maintenance of data: Data are dynamic in nature and [3,6], target mining where rules are verified whether it
they grow with time. Each time we need not run the occurs in the database or not and “reduce” is a
whole process to determine knowledge. We can use preprocessing statement in which unwanted rows or
incremental mining method to add knowledge of columns that are not contributing to the knowledge are
incremental database with the whole knowledge. removed and stored in a buffer where it can be retrieved
Similarly, when the support count or confidence levels for future corrections [9].
need to be changed, they alone can be changed and rule
patterns can be identified on the fly [7]. This is online 2. from files=<f1,f2,...,fn> |
mining. When different databases are run at different [buffers=<b1,b2,...,bn>] |
time, their results can be merged and new knowledge can [systems=<s1,s2,...,sn>]
be identified through merge mining process. All these
methods are implemented in TDML. Here the transaction databases are stored in flat files
Special types of mining: When we want to find and the files are specified in files=”mydat.txt” like syntax.
whether a particular rule is satisfied by a data set, we can In the case of distributed mining where more transaction
use Target itemset mining. Stream mining performs the files are specified. To perform more efficiently, the results
rule identification on the fly and does not store the of a process are stored in buffers which can be used in
transaction in the database. Also distributed and partition later cases. We will explain these concepts using
mining [10] are supported under this category. examples in next section. Systems specify the system
names where the flat files and buffers are located in the
3. TDML: Transaction Data Mining case of distributed data mining.
Language
3. with support=<s1,s2,...,sn> |
The language has been designed for mining transaction conf=<c1,c2,...,cn> |
databases. The syntax of TDML is defined in an extended [parents=<p1,p2,...,pm>] |
BNF grammar, where “[ ]” represents 0 or one [levels=<no>] |
occurrence, “{ }” represents 0 or more occurrences, and [itemsets=<i1,i2,...,im>] |
words in Arial font represent keywords. [inputs=<i1,i2,...,im>] |
[minrow=<r1>, maxrow=<r2>, mincol=<c1>]
<TDML> ::= <TDML Statement>; { <TDML
Statement>} This portion specifies interestingness measures like
<TDML Statement> :: = support and confidence [1]. More values can be given in
mining <ARM/SPM/target/reduce> the case of online mining. Parents hierarchical
information can be given in text files in parents=”p1.txt”
from files=<f1,f2,...,fn> | like syntax. In the case of multilevel mining where more
[buffers=<b1,b2,...,bn>] | levels of abstraction are specified. For representing this,
[systems=<s1,s2,...,sn>] we have levels=3 where there are maximum 3 levels. For
target itemset mining we have itemsets parameter. More
with support=<s1,s2,...,sn> | rules can be checked by specifying more itemsets. In the
conf=<c1,c2,...,cn> | case of stream mining, the stream data are obtained
[parents=<p1,p2,...,pm>] | through inputs “i1, i2 … im”. When preprocessing is
[levels=<no>] | done the rows which are irrelevant are removed through
[itemsets=<i1,i2,...,im>] | minrow and maxrow parameters. Similarly the columns
[inputs=<i1,i2,...,im>] | which are less than a threshold mincol are removed.
[minrow=<r1>, maxrow=<r2>, mincol=<c1>] Suitably corresponding parameters alone can be used.

[save files=<f1,f2,...,fn> | 4. [save files=<f1,f2,...,fn> |


buffers=<b1,b2,...,bn>] buffers=<b1,b2,...,bn>]

[using <ddm /gen/ mlevel/ mdim/ online/ Incr/ The results of the mining process can be stored in flat
merge/ stream/ partition>] files either in the name of files or buffers. Buffers are
used for efficient processing in future cases where as the
1. mining <ARM/SPM/target/reduce> files are used in case the reuse of the whole result later.

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
5. [using <ddm /gen/ mlevel/ mdim/ online/ Incr/ Here the different types of knowledge to be mined like
merge/ stream/ partition>] association rules and sequential patterns are specified
using the following syntax.
The using parameter specifies which type of
knowledge to be mined. Here any one of the parameter is • mining <ARM/SPM>
used. If none is given then bitmap processing is used [2] • using <ddm/gen/mlevel/mdim>
to find association rules like apriori. The various types
supported are given below. ARM specifies the knowledge to be mined as
association rule and SPM specifies the knowledge as
ddm – Distributed data mining. sequential pattern. Suitable data files must be selected
gen - Generalized data mining. using “from” and “save” options as illustrated above. The
mlevel – Multilevel data mining. other kinds of data mining tasks like distributed data
mdim – Multidimensional data mining mining, generalized data mining, multilevel data mining
online – Online mining and multidimensional data mining are specified with
incr – Incremental mining “using” clause.
merge – Merge mining
stream – Stream mining 4.3. Syntax for mining concept hierarchy
partition – partition mining specification
4. Syntax for Data Mining Primitives Concept hierarchy is specified using “with” clause
option as illustrated below.
In this section we shall discuss how the syntax can be
used with respective to data mining primitives. • with parents=<p1,p2,…,pm>

4.1. Syntax for task relevant data specification Since there may be more than one hierarchy attached
with items, we have provided options to specify more
The first step in defining a data mining task is the than one.
specification of the task relevant data, that is, the data on
which mining is to be performed. This involves 4.4. Syntax for interestingness measure
specifying the transaction database, conditions for specification
selecting the relevant data, the relevant attributes or
dimensions for exploration, and instructions regarding the Interestingness is specified using parameters like
ordering or grouping of the data retrieved. TDML support and count. This is specified using “with” clause.
provides clauses for the specification of such information
as follows: • with support=<s1,s2,…,sm>
conf=<c1,c2,…,cm>
• from files=<f1,f2,…,fn>
buffers=<b1,b2,…,bn> Since in the case of online mining, we can specify
systems=<s1,s2,….,sm> more than one support and confidence to iteratively work
with one after another, we have provided many supports
The from clause specify the transaction database files, and many confidences.
and intermediate buffers and system names in the case of
distributed systems. 4.5. Syntax for specifying dynamic databases
• save files=<f1,f2,…,fn> Dynamic databases are supported in incremental,
buffers=<b1,b2,…,bn> online and merge mining [7] options specified in “using”
clause
The save clause specify where the resultant rules must be
stored. Files specify the file names where the data are • using <incr/online/merge>
stored and buffers are the place where the resultant rules
are stored.
4.6. Syntax for specifying special mining cases
4.2. Syntax for kind of knowledge to be mined Target mining is an interesting one in which we check
whether a given rule is present in the database or not. We
use “mining” clause as illustrated below.

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
systems and the result is passed as buffers to a central
• mining <target> node. These results are merged and stored in a buffer. To
communicate between nodes and server network
Another special mining case is preprocessing of the communication is used.
database. This is also called transaction reduction [9].
Distributed Data Mining
• mining <reduce>
• with minrow=<r1>, maxrow=<r2>, mining ARM
mincol=<c1> from files="d1.txt, d2.txt, d3.txt, d4.txt"
• save files=<f1,f2,…,fn> buffers="b1.txt, b2.txt, b3.txt, b4.txt"
systems="sys1, sys2, sys3, sys4"
The mining clause specifies that it is transaction with support=40%
reduction method. The with clause specifies the threshold conf=70%
parameters as minrow, maxrow and mincol. They save buffers="b.txt"
represent minimum number of rows, maximum number of using ddm
rows and minimum number of columns using the
transaction database can be pruned. The resultant file is In multilevel data mining, each item represents several
specified in “save” clause. levels of abstraction. The rules can be generated with
level 1 or level 2 etc. For enabling multilevel mining, the
5. Examples using TDML syntax as below can be used.

Multidimensional data mining handles both categorical


Here the transaction database is available in flat file
and quantitative attributes. The file has to be organized in
called mydat.txt. The results in terms of rules are stored
such a way that all categorical attributes come first and
in the buffer “mybuf.txt”. The threshold parameters
support and confidence are set to 40% and 60% quantitative attributes next. The Boolean type attributes
respectively. The rules identified using apriori like are arranged at the end.
bitmap processing algorithm [6,9,11].
When database grow and we want to find the rules of
Simple Association rule mining incremental database and the whole database, we use
incremental data mining. We specify first the incremental
database and the buffer of the whole database. The result
mining ARM
will be stored in a new buffer.
from files="mydat.txt"
with support=40%
Incremental data mining
conf=70%
save buffers="mybuf.txt"
mining ARM
Generalized association rule mining uses concept from files=”i1.txt”
hierarchy through parents file like “p1.txt”. Parent nodes buffers=”w1.txt”
are added to the database items and whole set of rules are with support=40%
identified. Here again, bit map processing is extensively conf=70%
used to improve performance. save buffers=”w2.txt”
using incr
Generalized data mining
Online data mining uses the threshold parameters
dynamically and finds new rules. This can be applied to
mining ARM
any n number of threshold parameters. The example is
from files="mydat.txt"
illustrated below.
with support=40%
conf=70%
parents="p1.txt" Online data mining
save buffers="b1.txt"
using gen mining ARM
from files=”f1.txt”
buffers=”b1.txt”
In distributed data mining, there are n number of
with support=”40%, 50%, 60%,70%”
computer systems that are having its own sub databases
conf=”60%,70%,80%,90%”
and buffers. Files are processed simultaneously in all
save buffers=”b2.txt,b3.txt,b4.txt,b5.txt”

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
using online databases. In the same way, we can use it for sequential
pattern mining [3,6]. The compiler of this language
Merging different number of database rules we use implements each such mining process using various
merge data mining. Here we use buffers that can be functions. We have developed an integrated development
merged and not the actual databases. The example is environment called TIDE (Transactional IDE), in which
given below. TDML programs can be developed and deployed.

Merge mining 6. Implementation


mining ARM TDML has been implemented in Java language as the
from buffers “b1.txt, b2.txt, b3.txt, b4.txt” development language. As of now, it is implemented
with support=40% with the database of the type transaction data base which
conf=70% is nothing but flat files. In future, it will be extended to
save buffers=”b.txt” relational databases and data warehouses. For testing
using merge purpose, the synthetic transaction database was generated
through dataset generation program of the development
Partition data mining is similar to distributed data environment. TDML development by the user is
mining and the difference is that partitions need not be in supported in the environment called TIDE (Transactional
different systems and it could be in a single system. Integrated Development Environment (IDE)). TIDE is a
menu driven IDE in which the TDML can be
Stream mining is a new area of data mining in which programmed by the user either as interactive selection of
the different inputs arrive for service at different time. menus or user-typed program through editor supported in
We process the data as and when it is arrived and their TIDE. The applications of TDML include market basket
result is stored in buffer. Since more number of itemsets data, bio-informatics and other areas where user friendly
may arrive in a small interval of time, we have used programming environment is required.
multiple threaded application.
The number of transactions in the experiment was
Stream mining considered as 100K to 500K with number of items as 100.
The datasets were generated through the TIDE data
mining ARM generation utility program. The TDML program was
from buffers=”b.txt” written for “mining reduce” option of the syntax. The
with support=40% result is shown in Table 1.
conf=70%
input=”i1.stream” Table 1. Performance of TDML with Apriori
save buffers=”b.txt” and FP-Growth
using stream
Number of Apriori FP- TDML(in
Transactions (in growth(in secs.)
Targeted itemset mining is used to identify whether a in Database secs.) secs)
rule is existing in a database or not. For this usage, we
can check either a single rule or multiple rules. 100K 211 54 52

As a transaction reduction tool, we use minimum rows, 200K 424 102 101
maximum rows and minimum cols as input parameters. 300K 593 153 155
We find row sum and column sum and use these
parameters to remove those columns and rows that are out 400K 1057 198 226
of the boundary. We store these removed rows and 500K 1113 258 265
columns for future corrections. The example is given
below.
More experiments have been done in isolated manner
mining reduce for each such option and will be further done in more
from files=”f.txt” integrated manner.
with minrow=5, maxrow=50, mincol=100
save files=”f2.txt”. 7. Conclusion and future work
All the examples given above are just to simplify Development of data mining language for relational
process of mining association rules using transaction databases is relatively simple than for developing other

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
databases like transactional databases [8]. Also the kind [4] Jiawei Han, Micheline Kamber, “Data Mining
of knowledge to be mined normally does not include Concepts and Techniques”, Morgan Kauffman Publishers,
sequential pattern mining. We have developed TDML San Francisco, 2001.
which solves the above two problems. A limitation of
this work is that the user should be familiar with the [5] Willi Klosgen and Jan M. Zytkow, “Hand Book on
knowledge to be mined and types of association rule Data Mining and Knowledge Discovery”, Oxford
mining and sequential pattern mining. Also he needs to University Press, 2002.
be familiarized with the syntax and combination of them.
To avoid such limitation, we have developed an [6] Jay Ayres, Johannes Gehrke, Tomi Yiu, Jason
integrated environment which helps the TDML user. Flannick, “Sequential Pattern Mining Using a bitmap
Various applications like market basket data, bio- representatoion”, KDD’02, 2002, pp.429-435.
informatics and other areas where user-friendly
development environment is required. As a future work, [7] Walid G. Ared, Mohamed G. Elfeky, Ahmed K
we like to expand the knowledge to be mined including Elmagarmid, Incremental, Online, and Merge Mining of
classification, clustering and evolutionary analysis. We Partial Periodic Patterns in Time-Series Databases, IEEE
would like to provide support for relational databases and TKDE, Vol. 16, No.3, 2004, pp.332-342.
data warehouses in our future work. Also we like to make
the functions available to the user so that he can handle [8] Jiawei Han et. al, DMQL: A Data Mining Query
them as a tailor made utility. Language for Relational Databases, Research Report,
Simon Fraser University, Canada,1996.
8. References
[9] Muthukumar A, Nadarajan R, “An Efficient Approach
to Mining Association Rules in Transaction Databases”,
[1] R.Agrawal, R.Srikant, “Fast Algorithms for mining
Indian Journal of Information Science and Technology,
Association Rules”, Proceedings of the 20th VLDB
Irofis Publications, 2005, pp. 10-16.
Conference, Santiago, Chile, 1994, pp.487-499.
[10] Muthukumar A, Nadarajan R, “Efficient and
[2] Show-Jane Yen and Arbee L.P.Chen, “A graph based
Scalable partition based algorithms for mining association
approach for discovering various types of association
rule mining”, AOIJ Journal, Vol. 19, 2006
rules”, IEEE Transaction on Knowledge and Data
Engineering, Vol 13, No.5, 2001, pp.839-845.
[11] Muthukumar A, Nadarajan R, “Association rule
mining of bit map transaction using transaction reduction,
[3] R.Agrawal, R.Srikant, “Mining Sequential Patterns”,
modified graph and partitioned approaches, Journal of
Proc. 11th Int’l Conf. Data Eng., 1995.
Computer Science, Karpagam publication, 2007, pp.281-
293

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007

You might also like