Professional Documents
Culture Documents
VA KOVCS
__________________Cluj-Napoca, 2007__________________
Contents
Chapter 1. Introduction................................................................................................................8
1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7. 1.8. 2.1. 2.2. 2.3. 2.4. Definitions......................................................................................................................... 9 Description of the process of data mining.................................................................... 11 Preprocessing and transformation of data for data mining....................................... 14 Operations in data mining............................................................................................. 18 Clustering methods of the database.............................................................................. 20 Predictive modeling methods.........................................................................................24 Link analysis methods.................................................................................... 26 Structure of the thesis.................................................................................................... 33 Clustering methods......................................................................................................... 36 Classifications methods.................................................................................................. 39 Feature selection methods.............................................................................................. 46 Conclusions..................................................................................................................... 47
Chapter 3. Development of Clustering Methods with Clustering with Prototype Entity Selection.........................................................................................................................................50
3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7. 3.8. 3.9. 3.10. 3.11. Clustering with Prototype Entity Selection.................................................................. 51 Clustering with Prototype Entity Selection Algorithm............................................... 54 Optimized CPES using variable radius........................................................................ 59 CPES Algorithm with variable radius.......................................................................... 60 K-means optimized by CPES........................................................................................ 61 Experimental study.................................................................................................... 64 Results of the experimental study for CPES................................................................ 67 Results of the experimental study for CPES with variable radius............................ 74 Results of the experimental study for K-means with CPES....................................... 78 Analysis of the results..................................................................................................... 83 Conclusions..................................................................................................................... 88
Annexes........................................................................................................................................151
Chapter 1. Introduction
The development of hardware during the last two decades made possible the saving of large amounts of data on computers. The exact volume of these data is impossible to state, all estimations are mere suppositions. Researchers from the Berkley University have calculated that approximately one Exabyte (1 million Terabytes) of data are generated and saved each year. Real-time data mining of databases is one of the main research areas in databases. The volume of databases and their constant growth is an important problem in data mining. Data mining is a new discipline in development which uses the resources and ideas of several fields. The abstract role of data mining is to discover new and useful information in databases. Data mining techniques are to develop models, structures, regularities, etc, in large databases. The models discovered in databases can be characterized according to: accuracy, precision, interpretability and expressivity. Knowledge can be defined in several ways, for example: General term used to describe an object, idea, condition, situation or another fact that can be a number, a letter or a symbol. It can be a chart, an image and/or alphanumeric characters. It suggests elements of information that can be processed or produced by a computer. Facts, known things, one can draw conclusions from. There are several definitions for data mining; we mention only few of them: Data mining consists of the extraction of predictive information hidden in a large database. Data mining is the process through which advantageous models are discovered in databases Data mining is a rapidly growing field that combines the methods of databases, statistics, supervised learning and other related fields in order to extract useful information from existing data. Data mining is the non-trivial process which identifies new, valid, useful and interpretable models in existing data. Algorithms of data mining can be categorized keeping in mind the representation of models, input data and the field in which the algorithm is used. The model can be represented by decision trees, regression functions, associations or others. The most common classification of data mining algorithms divides them in four operations: predictive modeling, database clustering, link analysis and deviation detection [6][7][8][9]. Data mining is a repetitive process, which contains several steps, beginning with the understanding and definition of the problem and ending with the analysis of the results and the application of a strategy for the use of the results [14]. A data mining process is illustrated in Fig. 1.
-1-
Chapter 3. Development of Clustering Methods with Clustering with Prototype Entity Selection
This chapter presents three original clustering methods: Clustering with Prototype Entity Selection (or CPES) [16][19], Clustering with Prototype Entity Selection with variable radius [17] and K-means with Clustering with Prototype Entity Selection [18]. First the methods, then the experimental studies are presented, followed by the comparison of the results with other clustering methods from the specialized literature. CPES method will be used in data mining both as an independent clustering process and a combination of this method with K-means, a clustering algorithm often used in data mining.
-2-
cluster( x i ) = x j
until there are no more changes in the clusters Then the steps of the algorithm are described: 1. Constants
1 k =1 d ( xi , x k ) + A
(1)
d av =
d (x , x
i i =1 j =1
)
(2)
n( n 1)
A=
d av n
(3)
r=
2.
d av 2
(4)
After initialization n clusters are defined, each object will have its own cluster, namely cluster( x i ) = xi for each i=1,..,n. cluster( x i ) is the cluster x i belongs to. In this step the value of cluster
3. 3.1.
The cycle is repeated until stop condition is carried out, i.e. until there are no more changes in the values of cluster( x i ) for each i=1,..,n from one step to the other. For each
cluster( x i ) , i.e.
cluster( x j ) V (cluster( x i ), r )
(5)
-3-
For the selection, proportional selection is used and the chosen pair is marked with cluster( x p ) . 3.2. The values of function f are compared for the objects of
cluster( x i ) and
cluster( x p ) , and the object that has the highest value of fitness function f is chosen. If
the condition
f (cluster( x i )) < f (cluster( x p )) is true, then the values of cluster x i cluster( x p ) . If the condition is not true, then the value of
x i remains unchanged.
In each step of the algorithm, the number of the different clusters will diminish. In the end there will be only m distinct values in cluster ( x i ) . These are marked with c j , j=1,..,m. These distinct values can be considered prototype entities for their cluster, and objects cluster
c j . The number of clusters thus obtained will be equal to m and it is determined for each object the
cluster it belongs to. By CPES method a clean clustering is obtained with an optimal number of clusters for input data. If the analyst is not satisfied with the accuracy of the clusters, another algorithm of clustering can be performed starting with the results already obtained by the use of the CPES method.
algorithm that is performed, the value of the radius increases, so the migration among clusters will be smaller than in the case of a radius with fixed value. A diagram is used to increase the value of the radius, defined by the following equation:
rk +1 =
where is a positive constant smaller than 1.
rk
(6)
-4-
By using CPES method we find out both the optimal number of clusters in the data set proposed for data mining, m, and the prototype entities, c j , j=1,..,m. These data will be used as input data for K-means. K-means method will not start with the arbitrary selection of a representative object for each cluster; instead it will use the prototype entities obtained due to CPES algorithm, c j , j=1,..,m. These prototype entities,
c j represent the mean, the centre of the clusters. As they are local maximums, there will be no
massive migration among the clusters formed during the first step. K-means algorithm will need a short running time. Because the number of clusters is generated by the CPES method, and not specified by the user, an optimal number of clusters will be generated. In what follows K-means method will be carried out as usual, namely each object is attributed to that cluster to which it resembles the most. The mean of each cluster is recalculated, then each object is reattributed to the clusters, having in mind the new mean of the clusters. Consequently K-means algorithm will be improved by CPES, due to the followings: The user does not need to state the number of clusters, as they will be generated by CPES. The algorithm does not need to choose accidentally the first representative objects, they will be prototype entities generated by CPES. The described optimization was presented within the framework of K-means method. Nevertheless this optimization can be also applied to other clustering methods which need as input data the number of clusters, for example for K-medoid.
-5-
U 0 = U , R0 = C CORE ( Ri ) =
then set else set
Ri = Ri {a}, a Ri R = R CORE( Ri )
CORE ( Ri )
U iinc = f ( x , Ri ) = f ( y , Ri ) f ( x , D ) f ( y , D ) (U i)
U i = U i U iinc
POS R ( D) = POSC ( D)
In the first step of the algorithm reduct set, R is initialized with an empty set of attributes and variable i with number 0. The second step is repeated until POS R ( D) = POSC ( D) , i.e. until the set R of attributes will be
Ri is calculated. If the core of set Ri is an empty set, it means that any other attribute a can be removed from Ri without modifying the relation of
a reduct. In this step the core set of the set of attributes indiscernibility. There is no attribute a in for any
a Ri . Thus an attribute is removed from Ri and the core set for the new Ri is searched until
the core set is no more an empty set. This core set is added to set R. Then number i is increased with 1 and the new Ri and U i is calculated for the next step of the cycle.
-6-
-7-
-8-
Chapter 5 presents original contributions to the classification of the objects of a database using elements of Rough Set Theory. A new classification method is presented, Reduct Equivalent Rule Induction (or RERI) based on Rough Set Theory. Chapter 5 contains the following personal contributions: 1. Using Theorem 2 described in Chapter 4, I proposed a new algorithm of classification, Reduct Equivalent Rule Induction. 2. I implemented the proposed method, Reduct Equivalent Rule Induction. 3. I tested the proposed method in Chapter 5 with 10 sets of different data. These data sets were created and used by several researchers and are reference examples for classification and data mining and belong to the UCI collection. 4. I compared the accuracy of the new algorithm RERI with other classification methods from specialized literature, CHAID and QUEST. Data mining is a field that can be developed in many directions. Data mining process should be as automatic as possible, without the excessive intervention of the analyst. In order to achieve the highest possible automatism of the system, it must be capable of preprocessing correctly and optimally the data, so that data mining algorithms can extract useful information from it. Due to the study accomplished, based both on specialized literature and the results obtained in this thesis, I propose some new research directions: Optimization of clustering methods that need as input data the number of clusters. Use of Rough Set Theory in data mining within the framework of discovering of associations. Use of Rough Set Theory in reducing the dimension of the data set proposed for data mining.
Selected references
[1] Abidi, S. S. R., Hoe, K. M., Goh, A., Analyzing Data Clusters: A Rough Set Approach to Extract Cluster Defining Symbolic Rules, Lecture Notes in Computer Science 2189: Advances in Intelligent Data Analysis, Fourth International Conference of Intelligent Data Analysis, Cascais Portugal. Springer Verlag, pages 248257, 2001. Blake, C. L., Merz, C. J., UCI Repository of machine learning databases. Available at: http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science, 1998. Deogun, J. S., Raghavan, V. V., Sever, H., Rough Set Based Classification Methods and Extended Decision Tables, Third International Workshop on Rough Sets and Soft Computing, The Society for Computer Simulation, pp. 302-306, 1995. Elkan, C., Using the Triangle Inequality to Accelerate k-Means, Proceedings of the Twentieth International Conference on Machine Learning (ICML'03), pages 147-153, 2003. Farnstrom, F., Lewis, J., Elkan, C., Scalability for Clustering Algorithms Revisited, In ACM SIGKDD Explorations, Volume 2, Issue 1, pages 51-57, August 2000. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining, pages 1--43. AAAI Press/ The MIT Press, 1996. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI Press/ The MIT Press, 1996. Fayyad, U., Shapiro, G., Smyth, P., From Data Mining to Knowledge Discovery in databases, AI Magazine 17 , pages 37-54, 1996. Fayyad, U., Shapiro, G., Smyth, P., The KDD Process for extracting Useful Knowledge from Volumes of Data, COMUNICATION OF ACM , Volume 39. N 11., November 1996. Fisher, R. A., The use of multiple measurements in taxonomic problems, Annual Eugenics, 7, Part II, pages 179-188, 1936.
[2]
[3]
-9-
Guan, J. W., Bell, D. A., Liu, D. Y., The Rough Set Approach to Association Rule Mining, Proceedings of the Third IEEE International Conference on Data Mining, pages 529532, 2003. Hegland, M., Data Mining http://datamining.anu.edu.au/ Challenges, Models, Methods and Algorithms, May 2003,
Holte, R. C., Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, Machine Learning, Vol. 11, pages 63-90, 1993. John, G. H., Enhancements to the Data Mining Process, Stanford University, teza de doctorat, 1997. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., Wu, A. Y., An Efficient k-Means Clustering Algorithm: Analysis and Implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.7, pages 881-892, July 2002. Kovcs, ., Ignat, I., Clustering With Prototype Entity Selection Compared With K-Means, accepted by The Journal of Control Engineering and Applied Informatics, March 2007. Kovcs, ., Ignat, I., Clustering with Prototype Entity Selection in Data Mining, Proceedings of the IEEE International Conference on Automation, Quality and Testing, Robotics, Cluj-Napoca, Romania, pages 415419, 25-28 May 2006, ISBN 1-4244-0360-X. Kovcs, ., Ignat, I., Clustering with Prototype Entity Selection with variable radius, Proceedings of the 2nd IEEE International Conference on Intelligent Computer Communication and Processing, Cluj-Napoca, Romania, pages 17-21, 1-2 September 2006, ISBN (10) 973-662 -233-9. Kovcs, ., Ignat, I., Clustering with Prototype Entity Selection to Improve K-Means, Advances in Intelligent Systems and Technologies, Selected Papers of the 4th European Conference on Intelligent Systems and Technologies, Iasi, Romania, pages 37-47, 20-23 September 2006, ISBN (10) 973-730-246-X . Kovcs, ., Ignat, I., Reduct Equivalent Feature Selection Based On Rough Set Theory, Proceedings of the microCAD International Scientific Conference, Miskolc, Ungaria, pages 47-52, 22-23 March 2007, ISBN 978963-661-742-4. Kovcs, ., Ignat, I., Reduct Equivalent Rule Induction Based On Rough Set Theory, in print. Lingras, P., Rough Set Clustering for Web Mining, Proceedings of 2002 IEEE International Conference on Fuzzy Systems, pages 12-17, Honolulu, Hawaii, May 12-17, 2002. Lingras, P., Yan, R., West, C., Comparison of conventional and rough k-means clustering, Proceedings of Rough sets, fuzzy sets, data mining, and granular computing 9th international conference, RSFDGrC 2003, Chongqing, China, pages 130-137, May 26-29, 2003. MacQueen, J., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, Berkeley, Califonia. University of California Press, 1967. Magidson, J., Vermunt, J.K., Latent class modeling as a probabilistic extension of K-means clustering, Quirk's Marketing Research Review, 77-80, March 2002. Magidson, J., Vermunt, J.K., Latent class models for clustering: A comparison with K-means, Canadian Journal of Marketing Research, 36-43, 2002. Mazlack, L., He, A., Zhu, Y., Coppock, S., A Rough Set Approach In Choosing Partitioning Attributes, Proceedings of the ISCA 13th International Conference (CAINE-2000), November, 2000. Pawlak, Z., On Rough Sets, Bulletin of the European Association for Theoretical Computer sciences 24, pages 94-109, 1984. Pawlak, Z., Rough Classification, International Journal of Man-Machine Studies 20, pages 469-483, 1984. Raghavan, V. V., Sever, H., The State of Rough Sets for Database Mining Applications, Proceedings of 23rd Computer Science Conference Workshop on Rough Sets and Database Mining, pages 1-11, San Jose, CA, 1995. Ultsch, A., Clustering with SOM: U*C, In Proc. Workshop on Self-Organizing Maps, Paris, France, pp. 7582, 2005. Zhang, M., Yao, J. T., A Rough Sets Based Approach to Feature Selection, Proceedings of The 23rd International Conference of NAFIPS, Banff, Canada, pages 434-439, 2004.
[16] [17]
[18]
[19]
[20]
[24]
[31] [32]
- 10 -