Professional Documents
Culture Documents
The partitioning of data into clusters is an important problem with many applications.
Typically, one locates partitions using an iterative fuzzy c-means algorithm and Fast Algorithm
of one form or another. in data mining clustering techniques are used to group together the
objects showing similar characteristics within the same cluster and the objects demonstrating
different characteristics are grouped in to clusters. Clustering approaches can be classified into
two categories namely Hard clustering and Soft clustering. Our proposed WLI partially
allows, to some extent, the existence of closely allocated centroids in the clustering results by
considering not only the minimum but also the median distances between a pair of centroids and
therefore possesses the better stability. The performances of WLI and some existing clustering
validity indexes are evaluated and compared by running the fuzzy c-means algorithm for
clustering various types of data sets, including articial data sets, UCI data sets, and images.
Experimental results have shown that WLI has the more accurate and satisfactory performance
than other indexes. The FCM algorithm is also tested with cluster validity indices such as
partition coefficient and partition entropy. . Validity functions typically suggest finding a tradeoff between intra-cluster and inter-cluster variability, which is of course a reasonable principle.
The latter process uses a region-based similarity representation of the image regions to decide
whether regions can be merged. The results show that LHS and RHS distance measure is
reported maximum partition coefficient and minimum partition entropy than the other distance
measures.
Keywordsclustering analysis; clustering validity index; partition clustering algorithm; fuzzy cmeans clustering algorithm.
Introduction
Microarray technology has made available an incredible amount of gene expression data,
driving research in several areas including the molecular basis of disease, drug discovery,
neurobiology, and others. Usually, microarray data is collected with the goal of either
discovering genes associated with some event, predicting outcomes based on gene expression, or
discovering sub-classes of diseases. While clustering has been used for decades in image
processing and pattern recognition, in recent years it has become a popular technique in genomic
studies for extracting this kind of valuable information from massive sets of gene expression
data.
Clustering applied to genes from microarray data groups together those whose expression
levels exhibit similar behavior through the samples. In this context, similarity is taken to indicate
possible co-regulation between the genes, but may also reveal other processes that relate their
expression. In other words, the application of clustering in our first goal listed above is founded
by the concept of guilty by association, where genes with similar expression across samples
are assumed to share some underlying mechanism.
Objects belonging to the same cluster are similar to each other i.e. each cluster is
homogeneous. Each cluster should be different from other clusters such that objects belonging to
one cluster are different from the objects present in other clusters i.e. Different clusters are nonhomogenous.
Clustering technique provides many advantages but the two most important
knowledge from an existing data set and transform it into a human-understandable structure for
further use.
The KDD process consists of various steps; they are data cleaning, data integration, data
selection, data transformation, data mining, pattern evaluation and knowledge representation.
The first four steps are different forms of data preprocessing, where data is prepared for data
mining. The data mining step is an essential step where data analysis technique is applied to
extract patterns or knowledge. The extracted patterns or knowledge is then evaluated in process
evaluation step and this evaluated knowledge is the represented before the user in knowledge
representation step. Basic data mining tasks are classification, regression, time-series-analysis,
prediction, clustering, summarization, association rules and sequence discovery.
Clustering
Clustering is the unsupervised data mining technique partitioning or grouping a given set
of patterns into disjoint clusters without advance knowledge of the group or clusters. This is done
such that patterns belonging to same clusters are alike and patterns belonging to two different
clusters are different. Clustering process can be divided into two parts, cluster formation and
cluster validation
Mathematical model of clustering
In the context of pattern recognition theory, each object is represented by a vector of
features, called a pattern. Clustering can be defined as the process of partitioning a collection of
vectors into subgroups whose members are similar relative to some distance measure.
A clustering algorithm receives a set of vectors, and groups them based on a cost criterion or
some other optimization rule.
The related field of pattern classification, which involves simply assigning individual
vectors to classes, has developed a theory based on defining error criteria, designing optimal
classifiers, and learning. In comparison, clustering has historically been approached heuristically;
there has been almost no consideration of learning or optimization, and error estimation has been
handled indirectly via validation indices. Only recently has a rigorous clustering theory been
developed in the context of random sets. Although we will not go over the mathematical details
of, in this section we summarize some essential points regarding clustering error, error
estimation, and inference.
Fuzzy C-Means
In the K-means algorithm, each vector is classified as belonging to a single cluster (hard
clustering), and the centroids are updated based on the classified samples. In a variation of this
approach known as fuzzy c-means, all vectors have a degree of membership for each cluster, and
the respective centroids are calculated based on these membership degrees.
Whereas the K-means algorithm computes the average of the vectors in a cluster as the center,
fuzzy c-means finds the center as a weighted average of all points, using the membership
probabilities for each point as weights. Vectors with a high probability of belonging to the class
have larger weights, and more influence on the centroid. As with K-means clustering, the process
of assigning vectors to centroids and updating the centroids is repeated until convergence is
reached.
Hierarchical
Hierarchical clustering creates a hierarchical tree of similarities between the vectors,
called a dendrogram. The usual implementation is based on agglomerative clustering, which
initializes the algorithm by assigning each vector to its own separate cluster and defining the
distances between each cluster based on either a distance metric (e.g., Euclidean) or similarity
(e.g., correlation). Next, the algorithm merges the two nearest clusters and updates all the
distances to the newly formed cluster via some linkage method, and this is repeated until there is
only one cluster left that contains all the vectors. Three of the most common ways to update the
distances are with single, complete or average linkages.
This process does not define a partition of the system, but a sequence of nested partitions, where
each partition contains one less cluster than the previous partition. To obtain a partition
with K clusters, the process must be stopped K 1 steps before the end.
Different linkages lead to different partitions, so the type of linkage used must be selected
according to the type of data to be clustered. For instance, complete and average linkages tend to
build compact clusters, while single linkage is capable of building clusters with more complex
shapes but is more likely to be affected by spurious data.
Literature Review
Data clustering is the process of dividing data elements into groups or clusters
such that items in the same class are similar and items belonging to different classes are
dissimilar. Different measures of similarity such as distance, connectivity, and intensity
may be used to place different items into clusters. The similarity measure controls how the
clusters are formed and
depends on the nature of the data and the purpose of clustering data.
technique
can
be
hard
or
soft
Clustering techniques
can
be
Clustering
classified
into
supervised clustering that demands human interaction to decide the clustering criteria and that
decides the clustering criteria itself. The two types of classic clustering techniques are defined as
follows:
Algorithm
Fuzzy C-Means Algorithm
Fuzzy C-means (FCM) is a method of clustering which allows one piece of data to
belong to more than one cluster. In other words, each data is a member of every cluster but with
a certain degree of membership value. So each sample or element has some membership value so
a sample is attached partially to other clusters also. So no cluster will be empty or no class will
be with any data points. The output of such algorithm will be clustering but not partition. It is
based on the minimization of the objective function
4.
Patient data
Anonymzed data
dim = Zip
code
splitVal = 53711
LHS
RHS
fs
splitVal =
26
LHS
RHS
No Allowable
Cut
No Allowable Cut
No Allowable Cut
Advantages
If a table satisfies k-anonymity for some value k, then anyone who knows only the quasiidentifier values of one individual cannot identify the record corresponding to that
individual with confidence greater than 1/k.
Evaluation Result
Bacteria Image The specification for the bacteria image considered for the experimentation
is 120 x 142 x 3 pixels. The goal of the algorithm is to separate the bacteria from the
background efficiently.
Implementation
of
all
images noiseless and corrupted with noise. Gaussian noise is introduced with
3% intensity
and the image consists of two clusters. Various types of noises and levels of noise
percentages have been experimented with images to show the performance of all the
clustering
the
clustering
outcomes shown in Fig 1(b) (d) for three clustering methods FCM, and Fast Algorithm.
Execution time TABLE II shows the outcome for FCM, Fast in terms of the convergence
rate and the execution time. FCM technique has least execution time compared to other image
segmentation techniques. It is can be seen that FAST method takes much more time to execute,
but has the best convergence rate as the number of iterations is the least in both images.