Professional Documents
Culture Documents
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
knowledge presentation, such as rules, tables, charts, mining types supported in TDML are Association Rule
graphs and cubes [4,5]. Mining (ARM) [1], Sequential Pattern Mining (SPM)
Maintenance of data: Data are dynamic in nature and [3,6], target mining where rules are verified whether it
they grow with time. Each time we need not run the occurs in the database or not and “reduce” is a
whole process to determine knowledge. We can use preprocessing statement in which unwanted rows or
incremental mining method to add knowledge of columns that are not contributing to the knowledge are
incremental database with the whole knowledge. removed and stored in a buffer where it can be retrieved
Similarly, when the support count or confidence levels for future corrections [9].
need to be changed, they alone can be changed and rule
patterns can be identified on the fly [7]. This is online 2. from files=<f1,f2,...,fn> |
mining. When different databases are run at different [buffers=<b1,b2,...,bn>] |
time, their results can be merged and new knowledge can [systems=<s1,s2,...,sn>]
be identified through merge mining process. All these
methods are implemented in TDML. Here the transaction databases are stored in flat files
Special types of mining: When we want to find and the files are specified in files=”mydat.txt” like syntax.
whether a particular rule is satisfied by a data set, we can In the case of distributed mining where more transaction
use Target itemset mining. Stream mining performs the files are specified. To perform more efficiently, the results
rule identification on the fly and does not store the of a process are stored in buffers which can be used in
transaction in the database. Also distributed and partition later cases. We will explain these concepts using
mining [10] are supported under this category. examples in next section. Systems specify the system
names where the flat files and buffers are located in the
3. TDML: Transaction Data Mining case of distributed data mining.
Language
3. with support=<s1,s2,...,sn> |
The language has been designed for mining transaction conf=<c1,c2,...,cn> |
databases. The syntax of TDML is defined in an extended [parents=<p1,p2,...,pm>] |
BNF grammar, where “[ ]” represents 0 or one [levels=<no>] |
occurrence, “{ }” represents 0 or more occurrences, and [itemsets=<i1,i2,...,im>] |
words in Arial font represent keywords. [inputs=<i1,i2,...,im>] |
[minrow=<r1>, maxrow=<r2>, mincol=<c1>]
<TDML> ::= <TDML Statement>; { <TDML
Statement>} This portion specifies interestingness measures like
<TDML Statement> :: = support and confidence [1]. More values can be given in
mining <ARM/SPM/target/reduce> the case of online mining. Parents hierarchical
information can be given in text files in parents=”p1.txt”
from files=<f1,f2,...,fn> | like syntax. In the case of multilevel mining where more
[buffers=<b1,b2,...,bn>] | levels of abstraction are specified. For representing this,
[systems=<s1,s2,...,sn>] we have levels=3 where there are maximum 3 levels. For
target itemset mining we have itemsets parameter. More
with support=<s1,s2,...,sn> | rules can be checked by specifying more itemsets. In the
conf=<c1,c2,...,cn> | case of stream mining, the stream data are obtained
[parents=<p1,p2,...,pm>] | through inputs “i1, i2 … im”. When preprocessing is
[levels=<no>] | done the rows which are irrelevant are removed through
[itemsets=<i1,i2,...,im>] | minrow and maxrow parameters. Similarly the columns
[inputs=<i1,i2,...,im>] | which are less than a threshold mincol are removed.
[minrow=<r1>, maxrow=<r2>, mincol=<c1>] Suitably corresponding parameters alone can be used.
[using <ddm /gen/ mlevel/ mdim/ online/ Incr/ The results of the mining process can be stored in flat
merge/ stream/ partition>] files either in the name of files or buffers. Buffers are
used for efficient processing in future cases where as the
1. mining <ARM/SPM/target/reduce> files are used in case the reuse of the whole result later.
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
5. [using <ddm /gen/ mlevel/ mdim/ online/ Incr/ Here the different types of knowledge to be mined like
merge/ stream/ partition>] association rules and sequential patterns are specified
using the following syntax.
The using parameter specifies which type of
knowledge to be mined. Here any one of the parameter is • mining <ARM/SPM>
used. If none is given then bitmap processing is used [2] • using <ddm/gen/mlevel/mdim>
to find association rules like apriori. The various types
supported are given below. ARM specifies the knowledge to be mined as
association rule and SPM specifies the knowledge as
ddm – Distributed data mining. sequential pattern. Suitable data files must be selected
gen - Generalized data mining. using “from” and “save” options as illustrated above. The
mlevel – Multilevel data mining. other kinds of data mining tasks like distributed data
mdim – Multidimensional data mining mining, generalized data mining, multilevel data mining
online – Online mining and multidimensional data mining are specified with
incr – Incremental mining “using” clause.
merge – Merge mining
stream – Stream mining 4.3. Syntax for mining concept hierarchy
partition – partition mining specification
4. Syntax for Data Mining Primitives Concept hierarchy is specified using “with” clause
option as illustrated below.
In this section we shall discuss how the syntax can be
used with respective to data mining primitives. • with parents=<p1,p2,…,pm>
4.1. Syntax for task relevant data specification Since there may be more than one hierarchy attached
with items, we have provided options to specify more
The first step in defining a data mining task is the than one.
specification of the task relevant data, that is, the data on
which mining is to be performed. This involves 4.4. Syntax for interestingness measure
specifying the transaction database, conditions for specification
selecting the relevant data, the relevant attributes or
dimensions for exploration, and instructions regarding the Interestingness is specified using parameters like
ordering or grouping of the data retrieved. TDML support and count. This is specified using “with” clause.
provides clauses for the specification of such information
as follows: • with support=<s1,s2,…,sm>
conf=<c1,c2,…,cm>
• from files=<f1,f2,…,fn>
buffers=<b1,b2,…,bn> Since in the case of online mining, we can specify
systems=<s1,s2,….,sm> more than one support and confidence to iteratively work
with one after another, we have provided many supports
The from clause specify the transaction database files, and many confidences.
and intermediate buffers and system names in the case of
distributed systems. 4.5. Syntax for specifying dynamic databases
• save files=<f1,f2,…,fn> Dynamic databases are supported in incremental,
buffers=<b1,b2,…,bn> online and merge mining [7] options specified in “using”
clause
The save clause specify where the resultant rules must be
stored. Files specify the file names where the data are • using <incr/online/merge>
stored and buffers are the place where the resultant rules
are stored.
4.6. Syntax for specifying special mining cases
4.2. Syntax for kind of knowledge to be mined Target mining is an interesting one in which we check
whether a given rule is present in the database or not. We
use “mining” clause as illustrated below.
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
systems and the result is passed as buffers to a central
• mining <target> node. These results are merged and stored in a buffer. To
communicate between nodes and server network
Another special mining case is preprocessing of the communication is used.
database. This is also called transaction reduction [9].
Distributed Data Mining
• mining <reduce>
• with minrow=<r1>, maxrow=<r2>, mining ARM
mincol=<c1> from files="d1.txt, d2.txt, d3.txt, d4.txt"
• save files=<f1,f2,…,fn> buffers="b1.txt, b2.txt, b3.txt, b4.txt"
systems="sys1, sys2, sys3, sys4"
The mining clause specifies that it is transaction with support=40%
reduction method. The with clause specifies the threshold conf=70%
parameters as minrow, maxrow and mincol. They save buffers="b.txt"
represent minimum number of rows, maximum number of using ddm
rows and minimum number of columns using the
transaction database can be pruned. The resultant file is In multilevel data mining, each item represents several
specified in “save” clause. levels of abstraction. The rules can be generated with
level 1 or level 2 etc. For enabling multilevel mining, the
5. Examples using TDML syntax as below can be used.
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
using online databases. In the same way, we can use it for sequential
pattern mining [3,6]. The compiler of this language
Merging different number of database rules we use implements each such mining process using various
merge data mining. Here we use buffers that can be functions. We have developed an integrated development
merged and not the actual databases. The example is environment called TIDE (Transactional IDE), in which
given below. TDML programs can be developed and deployed.
As a transaction reduction tool, we use minimum rows, 200K 424 102 101
maximum rows and minimum cols as input parameters. 300K 593 153 155
We find row sum and column sum and use these
parameters to remove those columns and rows that are out 400K 1057 198 226
of the boundary. We store these removed rows and 500K 1113 258 265
columns for future corrections. The example is given
below.
More experiments have been done in isolated manner
mining reduce for each such option and will be further done in more
from files=”f.txt” integrated manner.
with minrow=5, maxrow=50, mincol=100
save files=”f2.txt”. 7. Conclusion and future work
All the examples given above are just to simplify Development of data mining language for relational
process of mining association rules using transaction databases is relatively simple than for developing other
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007
databases like transactional databases [8]. Also the kind [4] Jiawei Han, Micheline Kamber, “Data Mining
of knowledge to be mined normally does not include Concepts and Techniques”, Morgan Kauffman Publishers,
sequential pattern mining. We have developed TDML San Francisco, 2001.
which solves the above two problems. A limitation of
this work is that the user should be familiar with the [5] Willi Klosgen and Jan M. Zytkow, “Hand Book on
knowledge to be mined and types of association rule Data Mining and Knowledge Discovery”, Oxford
mining and sequential pattern mining. Also he needs to University Press, 2002.
be familiarized with the syntax and combination of them.
To avoid such limitation, we have developed an [6] Jay Ayres, Johannes Gehrke, Tomi Yiu, Jason
integrated environment which helps the TDML user. Flannick, “Sequential Pattern Mining Using a bitmap
Various applications like market basket data, bio- representatoion”, KDD’02, 2002, pp.429-435.
informatics and other areas where user-friendly
development environment is required. As a future work, [7] Walid G. Ared, Mohamed G. Elfeky, Ahmed K
we like to expand the knowledge to be mined including Elmagarmid, Incremental, Online, and Merge Mining of
classification, clustering and evolutionary analysis. We Partial Periodic Patterns in Time-Series Databases, IEEE
would like to provide support for relational databases and TKDE, Vol. 16, No.3, 2004, pp.332-342.
data warehouses in our future work. Also we like to make
the functions available to the user so that he can handle [8] Jiawei Han et. al, DMQL: A Data Mining Query
them as a tailor made utility. Language for Relational Databases, Research Report,
Simon Fraser University, Canada,1996.
8. References
[9] Muthukumar A, Nadarajan R, “An Efficient Approach
to Mining Association Rules in Transaction Databases”,
[1] R.Agrawal, R.Srikant, “Fast Algorithms for mining
Indian Journal of Information Science and Technology,
Association Rules”, Proceedings of the 20th VLDB
Irofis Publications, 2005, pp. 10-16.
Conference, Santiago, Chile, 1994, pp.487-499.
[10] Muthukumar A, Nadarajan R, “Efficient and
[2] Show-Jane Yen and Arbee L.P.Chen, “A graph based
Scalable partition based algorithms for mining association
approach for discovering various types of association
rule mining”, AOIJ Journal, Vol. 19, 2006
rules”, IEEE Transaction on Knowledge and Data
Engineering, Vol 13, No.5, 2001, pp.839-845.
[11] Muthukumar A, Nadarajan R, “Association rule
mining of bit map transaction using transaction reduction,
[3] R.Agrawal, R.Srikant, “Mining Sequential Patterns”,
modified graph and partitioned approaches, Journal of
Proc. 11th Int’l Conf. Data Eng., 1995.
Computer Science, Karpagam publication, 2007, pp.281-
293
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)
0-7695-2874-0/07 $25.00 © 2007