You are on page 1of 5

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.

com Volume 2, Issue 1, January February 2013 ISSN 2278-6856

An Incremental clustering algorithm for multi viewpoint similarity measure


G.Sailaja1, B.Dhanalakshmi2, Y.Bharathi3, C.Ramesh Reddy4
2,3,4

QUBA ENGINEERING COLLEGE, NELLORE SREE RAMA ENGINEERING COLLEGE, TIRUPATHI

Abstract: In high dimensional data, the common distance


measures can be influenced by noise. Existing clustering algorithms are implemented based on partitioning, hierarchical, density based and grid based. All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods. Similarity among the pair of objects may be defined as implicitly or explicitly. Our main objective is to cluster web documents. So, in this paper we propose multi view- point based clustering methods with similarity measure by using incremental algorithm approach for clustering high dimensional data. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects, assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Experimental results demonstrate that the proposed clustering algorithm produces high-quality clusters, and topic ontology provides interpretations of news topics at different levels of abstraction.

clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of Knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify preprocessing and parameters until the result achieves the desired properties. Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. We can show this with a simple graphical example:

Keywords- Clustering, multi-view learning, incremental algorithm, similarity measure.

1. INTRODUCTION
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of explorative data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a Multiobjective optimization problem. The appropriate Volume 2, Issue 1 January - February 2013 Fig1: clustering In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. The multi-view approach to learning is one in which we have views of the data (sometimes in a rather abstract sense) and the goal is to use the relationship between these views to alleviate the difficulty of a learnPage 208

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 1, January February 2013 ISSN 2278-6856
ing problem of interest. In this work, we explore how having two views makes the clustering problem significantly more tractable. Incremental clustering algorithm has also been developed by Ester et al. from Data Mining perspective. Specifically, they develop an incremental version of DBSCAN, a density-based clustering algorithm. However, DBSCAN and its incremental version are partitional clustering algorithms. Our approach is more related to the agglomerative hierarchical clustering techniques [3, 10], and therefore, can be viewed as the incremental version of the more traditional bottom-up hierarchical clustering methods. The work in this paper is motivated by investigations from the above and similar research findings. It appears to us that the nature of similarity measure plays a very important role in the success or failure of a clustering method. Our first objective is to derive a novel method for measuring similarity between data objects in sparse and high-dimensional domain, particularly text documents .From the proposed similarity measure, we then formulate new clustering criterion functions respective clustering algorithms, which are fast and scalable like k-means, but are also capable of providing high-quality and consistent performance. The remaining of this paper is organized as follows: In Section 2, we review related literature on similarity and clustering based on incremental algorithm. We then present our proposal for document similarity measure in Section 3. It is followed by two criterion functions for document clustering and their Optimization algorithms in Section 4. Conclusion is in Sections 5 Finally, References is in Section 6. Table 1: Notations Figure 2. Restructuring operators: (a) node insertion operator, (b) hierarchy insertion operator, (c) demotion operator, (d) merging operator, and (e) splitting operator. SK, SI & SJ are the sets of child nodes of NK, NI & NJ, respectively; SK = {SI [SJ} and (SI , SJ ) = _(SK). _ is a splitting function that separates SK by disconnecting an edge in the clusters MST structure into two disjoint sets SI and SJ. (NI, NJ) = SPLIT (_, NK) splits NK into NI and NJ w.r.t. the splitting function . Definition 1: Let be a density representation of a cluster C. Given a lower limit LL = f and an upper limit , the cluster C is homogeneous with respect to f and g if and only if LL _ di _ UL for 8di_NDP. of clusters monotonically increases along any path in the concept hierarchy from the root to a leaf node. A cluster hierarchy is basically a tree structure with Leaf nodes represent singleton clusters covering single data points. Each node in the tree maintains two types of information: cluster center and cluster density. The cluster density describes the spatial distribution of child nodes of a node. We define a clusters density as the average distance to the closest neighbor among the clusters members. A natural way of obtaining the distances to the nearest Neighbors is by creating the minimum spanning tree (MST) of objects in the cluster. Specifically, the density representation of a node is a di_<} is a and the standard deviation of NDP. Each di in NDP is the length of an edge, measured by the distance between two nodes, of the MST structure connecting a nodes child nodes. In general, the distance between two nodes, w.r.t. the nodes cluster center, can be measured by using Ln distance function family.

2. INCREMENTAL ALGORITHM
Our approach aims to construct a concept hierarchy with two properties: homogeneity and monotonicity. Informally, a homogeneous cluster is a set of objects with similar density. A hierarchy of clusters satisfies the monotonicity property if the density of a cluster is always higher than the density of its parent. That is, the density Volume 2, Issue 1 January - February 2013

Definition 2: Let C be a homogenous cluster. Given a new point A, let B be a Cs cluster member that is the nearest neighbor to A. Let d be the distance from A to B. A (and B) is said to form a higher (lower) dense region in C if d < LL (d > UL, respectively).

Page 209

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 1, January February 2013 ISSN 2278-6856
Our approaches to incorporating a new data point into a cluster hierarchy incrementally can be divided into two stages. During the first stage, the algorithm locates a node in the hierarchy that can host the new data point. The second stage performs hierarchy restructuring. This twostage algorithm is applied on observing the third and subsequent data points. The initial hierarchy is created by merging the first two points. Locating the initial placement of a new data point during the first stage is carried out in a bottom-up fashion: 1. Find the closest point over leaf nodes. 2. starting from the parent of the closest leaf node, perform upward search to locate a cluster (or create a new cluster hierarchy) that can host the new point with minimal density changes and minimal disruption of the hierarchy monotonicity. Let N be the node being examined at current level. The placement of a new point NJ in the hierarchy is performed according to the following rules: if LL _ d _ UL then perform INS NODE (N,NJ ) (see Figure 1a) where d is the distance from the new point NJ to the nearest Ns child node. if NJ forms a higher dense region on N, and NJ forms a lower dense region on at least one of Ns child nodes then perform INS HIERARCHY (NI,NJ ) (see Figure 1b) where NI is the child node of N closest to the new point NJ . If none of the rules applies, the search proceeds to the next higher-level cluster. If the search process reaches the top-level cluster, a new cluster will be inserted at the top level using the hierarchy insertion operator. The second stage aims to recover any structural changes that occur after incorporating a new data point. The following algorithm describes the hierarchy restructuring process. Algorithm Hierarchy Restructuring: 1. Let crntNode be the node that accepts the new point. 2. While (crntNode 6= 3. Let parent Node Parent (crntNode) 4. Recover the siblings of crntNode that are misplaced. 5. Maintain the homogeneity of crntNode. 6. Let crntNode parent Node One of the most common problems is that a node is Stranded at an upper level cluster. In such a case, a node NJ, which is supposed to be a child node of NI, is misplaced as NIs sibling. Line 4 addresses this issue by utilizing Definition 2 to detect the problem. Specifically, a node NJ, which is the sibling of NI, is said to be misplaced as NIs sibling if and only if NJ does not form a lower dense region in NI. If such a problem is detected, we iteratively apply DEMOTE (NI, NJ) (see Figure 1c). Working in a divide and conquer fashion, it receives a cluster N and replaces N by one or more homogeneous clusters. .. (1) The two data sets were preprocessed by stop-word Removal and stemming. Moreover, we removed words that appear in less than two documents or more than 99.5 percent of the total number of documents. Finally, the documents were weighted by TF-IDF and normalized to unit vectors. The full characteristics of reuters7 and k1b are presented in Fig. 3. Algorithm Homogeneity Maintenance (N): 1. Let an input N be the node that is being examined. 2. Repeat 3. Let NI and NJ be the pair of neighbors among Ns child nodes with the smallest nearest distance. 4. If NI and NJ form a higher dense region, 5. Then MERGE (NI, NJ) (see Figure 1d) 6. until there is no higher dense region found in N during the last iteration. 7. Let MI and MJ be the pair of neighbors among Ns child nodes with the largest nearest distance. 8. If MI and MJ form a lower dense region in N, 9. Then Let (NI, NJ) = SPLIT (_, N). (See Figure 1(e)) 10. Call Homogeneity Maintenance (NI). 11. Call Homogeneity Maintenance (NJ).

3. MULTI VIEW POINT- BASED SIMILARITY


The cosine similarity in (3) can be expressed in the Following form without changing its meaning: Simdi; dj cosdi_0; dj_0 di_0tdj_0; 8where 0 is vector 0 that represents the origin point. According to this formula, the measure takes 0 as one and only reference point. The similarity between two documents di and dj is determined w.r.t. the angle between the two points when looking from the origin. To construct a new concept of similarity, it is possible to use more than just one point of reference. We may have a more accurate assessment of how close or distant a pair of points is, if we look at them from many different viewpoints. From a third point dh, the directions and distances to di and dj are indicated, respectively, by the difference vectors di _ dh and dj _ dh. By standing at various reference points dh to view di, dj and working on their difference vectors, we define similarity between the two documents as

Volume 2, Issue 1 January - February 2013

Page 210

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 1, January February 2013 ISSN 2278-6856

Fig. 4 shows the validity scores of CS and MVS on the two data sets relative to the parameter percentage. The value of percentage is set at 0.001, 0.01, 0.05, 0.1, and 0:2; . . . ; 1:0. According to Fig. 4, MVS is clearly better than CS for both data sets in this validity test. For example, with k1b data set at percentage 1:0, MVS validity score is 0.80, while that of CS is only 0.67. This indicates that, on average, when we pick up any document and consider its neighborhood of size equal to its true class size, only 67 percent of that documents neighbors based on CS actually belong to its class. If based on MVS, the number of valid neighbors Increases to 80 percent. The validity test has illustrated the potential advantage of the new multiviewpoint-based similarity measure compared to the cosine measure. A natural multi-view assumption: that the views are (conditionally) uncorrelated, conditioned on which mixture component generated the views. There are many natural applications for which this assumption applies. For example, we can consider multi-modal views, with one view being a video stream and the other an audio stream, of a speaker here, conditioned on the speaker identity and maybe the phoneme (both of which could label the generating cluster), the views may be uncorrelated. A second example is the words and link structure in a document from a corpus such as Wikipedia here, conditioned on the category of each document, the words in i

. (2) Another popular graph-based clustering technique is Implemented in a software package called CLUTO [19]. This method first models the documents with a nearest neighbor graph, and then splits the graph into clusters using a min-cut algorithm. Besides cosine measure, the extended Jaccard coefficient can also be used in this method to represent similarity between nearest Given nonunit document vectors ui, uj(d=u/\\u\\), their extended Jaccard coefficient is

......(3)

Fig 5: incremental algorithm Fig. 4 CS and MVS validity test. Cluster that leads to the highest improvement. If no clusters are better than the current cluster, the document is not moved. The clustering process terminates when iteration completes without any documents being moved to new clusters. Unlike the traditional k-means, this algorithm is a stepwise optimal procedure. the optimization procedure. To verify the advantages of our proposed methods, we evaluate their performance in experiments on document data. The objective of this section is to compare MVSCIR and MVSC-IV with the existing algorithms that also use specific similarity measures and criterion functions for document clustering. The similarity measures to be compared include Euclidean distance, cosine similarity, and extended Jaccard coefficient. Page 211

4. Evolution
Clustering framework by MVSC, meaning Clustering with Multiviewpoint-based Similarity. Subsequently, we have MVSC-IR and MVSC-IV, which are MVSC with criterion function IR and IV, respectively. The main goal is to perform document clustering by optimizing IR and IV for this purpose, the incremental k-way algorithm a sequential version of k-means is employed. Considering that the expression of IV depends only on nr and Dr, r 1; . . . ; k, IV can be written in a general form

Volume 2, Issue 1 January - February 2013

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 1, January February 2013 ISSN 2278-6856
Table 2: Document data sets [2] Data Mining: Concepts and Techniques, Third Edition, jiawei Han and Micheline Kamber. [3] X.Wu Kumar, J.Ross Quinlan, J.Ghosh, Q.Yang Top 10 algorithms in data mining knowl. Inf. Syst. Vol 14. [4] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, pages 458 469, 2005. [5] Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231. [6] Y. Zhao and G. Karypis, Criterion Functions for Document Clustering: Experiments and Analysis,technical report, Dept. of Computer Science, Univ. of Minnesota, 2002.

5. CONCLUSION
In this paper, we propose a Multiviewpoint-based Similarity measuring method, named MVS. Theoretical analysis and empirical examples show that MVS is potentially more suitable for text documents than the popular cosine similarity. Based on MVS, two criterion functions, IR and IV, and their respective clustering algorithms, MVSC-IR and MVSC-IV, have been introduced. Compared with other state-of-the-art clustering methods that use different types of similarity measure, on a large number of document data sets and under different evaluation metrics, the proposed algorithms show that they could provide significantly improved clustering performance. The key contribution of this paper is the fundamental concept of similarity measure from multiple viewpoints. Future methods could make use of the same principle, but define alternative forms for the relative similarity in (10), or do not use average but have other methods to combine the elative similarities according to the different viewpoints. Besides, this paper focuses on partition clustering of documents. In the future, it would also be possible to apply the proposed criterion functions for hierarchical clustering algorithms. We analyze the existing clustering methods with comparative study of proposed Multi-viewpoint based similarity measuring method. Theoretical analysis and empirical examples show that this proposed algorithm is suitable for web documents. Two criterion functions IC and IR with their clustering algorithms introduced. These methods are compared and analyzed with a large number of document data sets. [7] H. Chim and X. Deng, Efficient Phrase-Based Document Similarity for Clustering, IEEE Trans. Knowledge and Data Eng.,vol. 20, no. 9, pp. 12171229, Sept. 2008. [8] M. Pelillo, What Is a Cluster? Perspectives from Game Theory, Proc. NIPS Workshop Clustering Theory, 2009. AUTHOR(s)

G.sailaja, received the B.Tech degree in computer


science and Engineering from jntu ananthapur, 2010..and pursing M.Tech in QCET(2011-2013) .she participated in national level conference on grid computing at s.v university,tirupathi.and also she certified DB2 in the year of 2013. B.Dhanalakshmi, received the M.Tech degree in computer science and Engineering from Nagarjuna
University in 2010. At present she is working as asst.professor in sree rama engineering college. she is dedicated to teaching field from the last 5 years.

Y.Bharathi, received the B.tech degree in computer science and Engineering from jntu ananthapur, 2010..and pursing M.Tech in SRES(2012-2013) .She certified DB2 in the year of 2012. C.Ramesh reddy, received the B.tech degree in computer science and Engineering from jntu ananthapur, 2011..and pursing M.Tech in SREt(2012-2013) . He certified DB2 in the year of 2013.

REFERENCES
[1] Guyon,U.V. Luxburg, and R.C. Williamson, Clustering Science or Art?, Proc. NIPS Workshop Clustering Theory, 2009. Volume 2, Issue 1 January - February 2013 Page 212

You might also like