You are on page 1of 6

Standard and Genetic k-means Clustering Techniques in Image

Segmentation
Dariusz Małyszkoa, Sławomir T. Wierzchońb
a
Faculty of Computer Science, Technical University of Białystok, Wiejska 45A, 15-351 Bialystok, Poland
malyszko@ii.pb.bialystok.pl
b
Faculty of Mathematics, Physics and Informatics, University of Gdańsk, Wita Stwosza 57, 80-952 Gdańsk-Oliwa
b
Institute of Computer Sciences, Polish Academy of Sciences, Ordona 21, 01-267 Warszawa
stw@ipipan.waw.pl

Abstract: Clustering or data grouping is a key initial this reason combining clustering techniques with genetic
procedure in image processing. This paper deals with algorithms robustness in optimization should yield high
the application of standard and genetic k-means quality performance and results [3, 4].
clustering algorithms in the area of image segmentation.
In order to assess and compare both versions of k- The present paper in Section 2 briefly reviews a
means algorithm and its variants, appropriate family of k-means clustering algorithms. Genetic
procedures and software have been designed and algorithms in the context of k-means clustering
implemented. Experimental results point that genetically techniques are outlined in Section 3. Cluster validation
optimized k-means algorithms proved their usefulness in indices are described in Section 4. Section 5 describes
the area of image analysis, yielding comparable and performed experiments and obtained results. Section 6
even better segmentation results. concludes the paper and points future research.

1 Introduction 2 Standard k-means clustering algorithms


During last decades, growing attention has been put Clustering techniques usually are divided into two
on data clustering as robust technique in data analysis. general groups: hierarchical and partitional clustering
Clustering or data grouping describes important algorithms, see [1, 2, 4] for details. Hierarchical
technique of unsupervised classification that arranges clustering techniques create a cluster tree by means of
pattern data (most often vectors in multidimensional heuristic splitting or merging procedures. On the other
space) in the clusters (or groups). Patterns or vectors in hand, partitional clustering techniques divide the input
the same cluster are similar according to predefined data into specified in advance number of clusters. The
whole process is governed by minimization of certain
criteria, in contrast to distinct patterns from different
goal function, e.g. a square error function, [4].
clusters [1,2].
"Center-based clustering" refers to the family of
Possible areas of application of clustering algorithms algorithms that use a number of "centers" to represent
include data mining, statistical data analysis, and group input data. General iterative model for
compression, vector quantization and pattern recognition partitional center-based clustering algorithms has the
[1, 2]. Image analysis is the area where grouping data following form [4, 5, 6]:
into meaningful regions (image segmentation) presents 1. Data initialization by assigning some values to
the first step into more detailed routines and procedures the cluster centers.
in computer vision and image understanding. 2. For each data point xi , calculate its
Clustering problem understood as grouping input membership value m(c j | xi ) to all clusters c j
data by means of minimizing certain criteria presents and its weight w( xi ) .
NP-hard combinatorial optimization task. Genetic 3. For each cluster center cj, recalculate its
algorithms are classified as population based location taking into account all points xi
optimization techniques that make extensive use of the assigned to this cluster according to the
mechanisms met in evolution and natural genetics. For membership and weight values:

6th International Conference on Computer Information


Systems and Industrial Management Applications (CISIM'07)
0-7695-2894-5/07 $20.00 © 2007
n values of parameter r make the algorithm more
∑ m (c
i =1
j | xi )w( xi ) xi "fuzzy".
cj = n
3 Genetic k-means clustering algorithms
∑ m(c
i =1
j | xi ) w( xi )
4. Repeat steps 2 and 3 until some termination The application of genetic algorithms in the area of
criteria are met. clusters analysis takes advantage of extensive optimum
search capabilities of genetic algorithms. General
Standard k-means clustering algorithms require that genetic procedure in the case of determining the best k
cluster number k should be determined in advance. centers for clusters consists of setting of parameters
Additionally, results (segmentations) obtained in the run (number of clusters), population initialization, initial
of the k-means algorithm strongly depend on the population fitness calculation and repeated [4] selection,
selection of initial clusters centers. cross-over and mutation operations until termination
criteria are met.
2.2 k-means algorithm
3.1 Genetically optimized k-means clustering
The most important version of the partitional algorithms
algorithm is the iterative k-means algorithm in which the
next objective function is minimized: For genetic k-means (GKM) and its variants (GKHM,
n GFKM) selection of cluster number and other algorithm
∑ min x − c
2
KM ( X , C ) = i j (1) specific parameter values is required. Next, the
i =1 j∈{1...k } population should be initialized with randomly created
Here w(xi ) = 1 for all i, and the membership function is cluster centers. From the initial population by
defined according to the “winner takes all” rule, i.e. an subsequent iterations are created new populations by
object belongs to the class with nearest center. operations of selection, cross-over and mutation. For
every solution in population, fitness value is calculated
2.3 Harmonic k-means algorithm according to the specific fitness function as described in
Section 2. Solutions with high fitness values come into
Here the harmonic mean of the square distances from mating pool. The process is repeated until termination
each object to the centers of gravity of each class is criteria are met. Below some implementation details are
optimized [5]: given.
n
k
KHM ( X , C ) = ∑ 1
(2) Chromosomes


i =1
k Chromosomes represent solutions consisting of
j =1 p centers of k clusters – each cluster center is a d-
xi − c j
dimensional vector of values in the range between 0 and
255 representing intensity of gray or color component.
p is a parameter with the value p ≥ 2. Zhang [7]
proposes value 3.5 as yielding the best results. Population initialization and fitness computation
Membership and weight functions are calculated as The clusters centers are initialized randomly to k d-
described in [5] and [7]. dimensional points with values in the range 0 – a 255.
Fitness value is calculated for each chromosome in the
population according to the rules given in Section 2.
2.4 Fuzzy k-means algorithm
Selection
Selection operation tries to choose best suited
Fuzzy partition of input data makes possible multiple
chromosomes from parent population that come into
cluster assignments. Therefore, optimized objective
mating pool and after cross-over and mutation operation
function has the following form:
create child chromosomes of child population. Most
n k
frequently genetic algorithms make use of tournament
∑∑ m r 2
FKM ( X , C ) = ij xi − c j (3)
selection that selects into mating pool the best individual
i =1 j =1
from predefined number of randomly chosen population
Details can be find in [4, 5 ]. The value of parameter chromosomes. This process is repeated for each parental
r should be constrained to the values r ≥ 1 . Larger chromosome.

6th International Conference on Computer Information


Systems and Industrial Management Applications (CISIM'07)
0-7695-2894-5/07 $20.00 © 2007
Crossover Detailed description is given in [6, 9, 10]. The value of
The crossover operation presents probabilistic the Dunn index should be maximized. Bezdek and Pal
process exchanging information between two parent [11] generalized Dunn's index by means of considering
chromosomes during formation of two child five different measures of distance between clusters and
chromosomes. Typically, one-point or two-point cross- three different measures of cluster diameter.
over operation is used. According to [4] crossover rate
0.9 - 1.0 yields the best results. Davies-Bouldin index
The Davies-Bouldin index minimizes the average
Mutation simililarity between each cluster. It is defined as the the
Mutation operation is applied to each created child ratio of the sum of within-cluster scatter to between-
chromosome with a given probability pm. After cross- cluster separation. The objective is to minimize this
over operation children chromosomes that undergo index [6].
mutation operation flip the value of the chosen bit or
change the value of the chosen byte to other in the range S_Dbw index
from 0 to 255. Typically mutation probability rate is set The S_Dbw index proposed by Turi [10], consists of
in the range 0.05 - 0.1 [4]. two terms assessing cluster scattering and cluster
density. The first term describes the average scattering
Termination criterion of the clusters and presents a measure of compactness of
Termination criterion determines when algorithm the clusters. The second term evaluates the density of
completes execution and final results are presented to the area between two clusters. The value of S_Dbw
the user. Termination criterion should take into account index should be minimized. Detailed description of
specific requirements. Most often termination criterion S_Dbw index can be found in [6, 9, 10].
is that algorithm terminates after predefined number of
iterations. Other possible conditions for termination of Quantization Error
the k-means algorithms depend on degree of population Quantization Error measures average distance
diversity or situation when no further cluster between points and their cluster centers. Consult [6] for
reassignment takes place details.

4 Cluster validation 5 Experimental results


Data clustering presents unsupervised process that
Experimental input images consisted of three data
finally requires some sort of quality evaluation of
sets: 1D, 2D and 3D images presented in Fig.1.The 1D
generated clusters. This requirement can be satisfied by
image is a Lena image with gray-scale pixel values. The
using cluster validity indices, [8, 9]. In general, three
2D image presents buildup region shown in Fig.1 (b).
distinctive approaches to cluster validity are possible.
The 3D image is full RGB image with three color
The first approach relies on external criteria that
channels. The parameter r in fuzzy k-means algorithm
investigate the existence of some predefined structure in
was set to the value 2.0, similar to the solution in [4].
clustered data set. The second approach makes use of
The parameter p in harmonic k-means algorithm was set
internal criteria and the clustering results are evaluated
to the value 3.5 as this value yields the best results [7].
by quantities describing the data set such as proximity
All genetic versions of algorithms make use of a
matrix etc.
tournament selection with the tournament size 5.
Approaches based on internal and external criteria
Mutation rate pm was set to 0.05. All selected solutions
make use of statistical tests and their disadvantage is
are subjected oe-point crossover with probability 1.0 as
high computational cost. The third approach makes use
suggested in [4]. After crossover and mutation
of relative criteria and relies on finding the best
operation, cluster center vectors are sorted in ascending
clustering scheme that meets certain assumptions and
order relative to the first coordinate of the d-dimensional
requires predefined input parameters values. Most
vector of cluster centers. In all experiments, the number
commonly used indices are Dunn index, Davies-Bouldin
of clusters was fixed, k = 6 and population size n = 40
index, S_Dbw index and Quantization Error. Detailed
chromosomes. In [4] authors suggest n = 75 as upper
description of relative cluster validity methods is given
limit of population size. The chosen values of
in [9,10].
parameters are based on both recommended values and
empirical studies.
Dunn index
The Dunn index is a well known validity index that
Experiment I
recognizes compact and well separated clusters.

6th International Conference on Computer Information


Systems and Industrial Management Applications (CISIM'07)
0-7695-2894-5/07 $20.00 © 2007
Clustering performance of three standard algorithms: Quantization Error (QE). Selected indices are presented
k-means, harmonic k-means and fuzzy k-means (SKM, as minimal (Fitness, DBI, SDBI, QE) and maximal (DI)
SKHM, SFKM) and their genetic versions (GKM, values in the trial. For each algorithm, best and average
GKHM, GFKM) was compared. Input data that were fitness and cluster validity indices from five trials are
grouped consisted of three, described earlier, data sets: presented in the Tab. 1.
1D, 2D, 3D images. For each of selected algorithms five
separate trials was conducted. Every trial started with Experiment II
creating initial population with 40 solutions in case of The objective of this experiment was to determine
standard version or with 40 chromosomes in case of how many iterations of the genetic version of k-means
genetic algorithm version. Next, this initial population algorithms are required to obtain good grouping of
was 40 times iterated through required steps for the image data. For this purpose, standard (SKM, SKHM,
selected version of the algorithm as described in Section SFKM) and genetic (GKM, GKHM, GFKM) initial
2. The number of trials, solutions (chromosomes) and populations were created and 200 iterations were
iterations was chosen experimentally. After each trial performed on 3D input data. The convergence of the
completion, in addition to the fitness value of the best fitness value for each generation was investigated.
data grouping, the following statistics of the best Results are presented in Tab. 2. In Fig.1 the best fitness
grouping were computed and stored: Dunn index (DI), values during 200 iterations of standard and genetic k-
Davies-Bouldin index (DBI), S_Dbw index (SDBI) and means algorithm are presented

TABLE 1 BEST AND AVERAGE FITNESS VALUES AND CLUSTER VALIDITY INDICES (DI, DBI, SDBI, QE)
OF THE K-MEANS POPULATIONS IN 5 TRIALS. THE FIRST NUMBER IS THE AVERAGE VALUE, THE
SECOND NUMBER IS THE BEST VALUE

Algorithm Fitness Dunn index DB index S_Dbw index QE


1D Image
SKM 269342 / 269180 3.262 / 3.444 2.844 / 2.827 0.245 / 0.241 6.737 / 6.270
GKM 277293 / 271668 2,987 / 3.943 3,073 / 2.801 0,249 / 0.218 6,954 / 6.758
SKHM 299036 / 292040 2.572 / 2.991 3.74764 / 3.391 0.319 / 0.297 7.547 / 7.343
GKHM 282006 / 274301 3.376 / 3.778 3.123 / 3.034 0.280 / 0.247 7.060 / 6.947
SFKM 289171 / 289048 2.762 / 2.762 3.071 / 3.070 0.276 / 0.274 6.970 / 6.965
GFKM 295286 / 292360 2.732 / 3.654 3.075 / 2.866 0.260 / 0.226 6.952 / 6.823
2D Image
SKM 1067250/1065094 2.444 / 2.505 4.640 / 4.619 0.246 / 0.243 16.121 / 16.050
GKM 1080935/1048642 2.141 / 2.600 4.809 / 4.345 0.251 / 0.203 15.648 / 14.412
SKHM 71516812/71338562 2,616 / 3.122 4,520 / 4.492 0,287 / 0.282 17,507 / 17.453
GKHM 7315226/70814149 2,293 / 2.401 5,099 / 4.870 0,264 / 0.230 16,637 / 16.168
SFKM 1179384/1178901 1.951 / 1.958 5.714 / 5.698 0.298 / 0.294 16.313 / 16.310
GFKM 1181146/ 1135226 1.863 / 1.994 5.255 / 5.076 0.2702 / 0.250 15.634 / 14.906
3D Image
SKM 656244/654987 1.415 / 1.937 5.181 / 5.123 0.201 / 0.199 17.517 / 17.476
GKM 688075/663667 1.517 / 2.376 5.617 / 5.231 0.225 / 0.192 18.097 / 17.149
SKHM 47691016/47575118 1.663 / 2.030 5.194 / 5.161 0.230 / 0.205 18.222 / 18.166
GKHM 50784988/47519559 1.306 / 1.470 5.598 / 4.950 0.236 / 0.181 18.348 / 17.909
SFKM 765776/765727 1.605 / 1.623 6.358 / 6.341 0.210 / 0.209 19.653 / 19.650
GFKM 801289/727402 1.290 / 1.875 6.868 / 5.769 0.291 / 0.219 18.699 / 17.197

6th International Conference on Computer Information


Systems and Industrial Management Applications (CISIM'07)
0-7695-2894-5/07 $20.00 © 2007
TABLE 2 FITNESS VALUE AND CLUSTER VALIDITY INDICES (DI, DBI, SDBI, QE) SELECTED AFTER 200
ITERATIONS OF K-MEANS BASED ALGORITHMS FOR 3D IMAGE DATA

Algorithm Dunn index DB index S_Dbw index QE Fitness


SKM 1,3672 5,1822 0,2001 17,4943 656263
GKM 1,4477 6,3828 0,229 19,1233 704167
SKHM 1,5945 5,1239 0,2062 17,7216 4590386
GKHM 1,2585 5,7627 0,3604 19,2266 5166128
SFKM 1,7847 6,1957 0,2098 19,711 765227
GFKM 1,3975 5,1113 0,177 32,8215 790179

1800000
1600000
1400000
1200000
Fitness

1000000
800000
600000
400000
200000
0
0 21 41 61 81 101 121 141 161 181
Iteration

Fig. 1. Fitness values of the best solutions in genetic k-means algorithm (upper line) and standard k-means algorithm (lower line) during
200 iterations.

Exemplary segmentations
After completion of k-means clustering algorithm better values in the case of standard versions of k-means
execution, required centers of clusters are obtained. algorithms (see Fig. 1), although some authors (for
Therefore, segmentation of the input images should be example [3]) suggest contrary performance. However,
performed in order to determine image partition into this observation is similar to results presented in [4].
meaningful regions. Segmentation quality can be Standard versions of k-means algorithms seem be better
assessed and compared for particular clustering in finding high fitness solutions. In the same time results
techniques. In Fig. 3 two exemplary segmentations of obtained in standard and genetic versions of k-means
1D Lena image (Fig. 1a) and 3D image (Fig. 1c) algorithms relative to validity indices are also
obtained in the run of genetic KM are presented. Pixels comparable. During extensive search of solution space,
assigned to the given cluster are displayed in the mean genetic versions of k-means algorithms most often find
color of all the pixels belonging to the cluster. solutions with slightly worse fitness values (see Fig. 1)
but at the same time with exceptionally good values of
6. Conclusion and summary individual validity indices. Further investigation into
this matter could present starting point into
improvement of k-means based image clustering
Results obtained in the performed experiments techniques.
suggest that genetic versions of k-means clustering
techniques are equally robust in comparison to standard
versions. Segmentation results proved that in the long Acknowledgement
run, both types of techniques applied to image clustering This work was supported by Białystok Technical
- lead to the comparable values of fitness with slightly University grant S/WI/5/03.

6th International Conference on Computer Information


Systems and Industrial Management Applications (CISIM'07)
0-7695-2894-5/07 $20.00 © 2007
1a 1b 1c

Fig. 2. Image input data 1D (a), 2D (b) and 3D (c)

3a 3b
Fig. 3. Exemplary segmentations of 1D image (3a) and 3D image (3b)

References [6] G.H. Omran, A.Salman, A.P. Engelbrecht, "Dynamic


clustering using particle swarm optimization with
[1] R.Xu, D.Wunsch II, "Survey of clustering algorithms", application in image segmentation", Pattern Anal Applic.,
IEEE Transactions on Neural Networks, 16, 2005, 645- 8, 2006, 332-344.
678.
[7] B. Zhang, "Generalized k-harmonic means - Boosting in
[2] A.Jain, M. Murty, P. Flynn, "Data clustering: A review", unsupervised learning. Technical Report HLP-2000-137",
ACM Computing Surveys, 31, 1999, 264-323. Hewlett-Packard Labs, 2000.

[3] U. Maulik, S. Bandyopadhyay, "Genetic algorithm-based [8] M. Halkidi, M. Vazirgiannis, I. Batistakis, "Quality scheme
assessment in the clustering process". In Proc. of the 4th
clustering technique", Pattern Recognition 33, 2000, 1455-
European Conf. on Principles of Data Mining and
1465.
Knowledge Discovery, LNCS 1910, 2000, 265 -267.
[4] O.Hall, I.Barak, J.C. Bezdek, "Clustering with a genetically
optimized approach", IEEE Trans. Evo. Computation, 3,
[9] M.Halkidi et al., "Clustering validity checking methods:
Part II", SIGMOD Rec., 31, No. 3, 2002, 19-27
1999, 103-112
[10] R.H. Turi, "Clustering-based color image segmentation",
[5] G.Hamerly, C. Elkan, "Alternatives to the k-means PhD Thesis, Monash University, Australia 2001.
algorithm that find better clusterings", Proc. of the ACM
Conference on Information and Knowledge Management, [11] J.C. Bezdek, N.R. Pal, "Some new indexes of cluster
CIKM-2002, 2002, 600-607. validity", IEEE Trans. Sys. Man. Cyb., 28, 1998, 301-315.

6th International Conference on Computer Information


Systems and Industrial Management Applications (CISIM'07)
0-7695-2894-5/07 $20.00 © 2007

You might also like