Art 3A10.1007 2Fs10844 011 0158 3

J Intell Inf Syst (2012) 38:321341 DOI 10.
1007/s10844-011-0158-3
Data clustering using bacterial foraging optimization

Miao Wan Lixiang Li Jinghua Xiao Cong Wang Yixian Yang
Received: 10 May 2010 / Revised: 16 March 2011 / Accepted: 17 March 2011 / Published online: 9 April 2011 Springer Science+Business Media, LLC 2011
Abstract Clustering divides data into meaningful or useful groups (clusters) without any prior knowledge. It is a key technique in data mining and has become an important issue in many fields. This article presents a new clustering algorithm based on the mechanism analysis of Bacterial Foraging (BF). It is an optimization methodology for clustering problem in which a group of bacteria forage to converge to certain positions as final cluster centers by minimizing the fitness function. The quality of this approach is evaluated on several well-known benchmark data sets. Compared with the popular clustering method named k-means algorithm, ACObased algorithm and the PSO-based clustering technique, experimental results show that the proposed algorithm is an effective clustering technique and can be used to handle data sets with various cluster sizes, densities and multiple dimensions. Keywords Data mining Data clustering Bacterial foraging optimization Optimization based clustering
1 Introduction Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters) (Jain et al. 1999). In the past fifty years, many
) M. Wan (B L. Li C. Wang Y. Yang Information Security Center, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, P.O. Box 145, Beijing 100876, China e-mail: wanmiao120@163.com M. Wan L. Li C. Wang Y. Yang Key Laboratory of Network and Information Attack & Defence Technology of MOE, Beijing University of Posts and Telecommunications, Beijing 100876, China J. Xiao School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
322
J Intell Inf Syst (2012) 38:321341
attentions have been focused on the problem of clustering from the theoretical and the practical point of view. Such problem has been addressed in diverse areas such as pattern recognition, data analysis, image processing, economic science (especially market research) and biology. So the study about new clustering algorithms is an important issue in the research fields including data mining, machine learning, statistics, and biology. In recent years, different clustering algorithms have been proposed, such as partitioning (MacQueen 1967; Ng and Han 1994), hierarchical (Guha et al. 1998), density-based (Hinneburg and Keim 1998), grid-based (Sheikholeslami et al. 1998) and model-based (Dempster et al. 1977). Partitioning approach constructs different partitions based on some criterion. For hard partitional clustering, each pattern belongs to one and only one cluster. Fuzzy clustering (Bezdek 1981; Zhang and Leung 2004) extends this notion that each pattern may belong to all clusters with a degree of membership. Apart from the above techniques, kernel k-means and spectral clustering have both been used to identify clusters that are non-linearly separable in input space (Dhillon et al. 2005, 2007; Filippone et al. 2008). k-means algorithm (MacQueen 1967) is the most popular approach because of its simplicity, efficiency and low cost of computation. However, since criterion functions for clustering are usually non-convex and nonlinear, traditional approaches, especially standard k-means algorithm, is sensitive to initializations and easy to be trapped in local optimal solutions. As the increasing numbers and dimensions of data sets, finding solutions to the criterion functions has become an NP-hard problem. Some variants to standard k-means method provide a fast and local search strategy to solve this problem (Arthur and Vassilvitskii 2007; Kanungo et al. 2004). Since the importance of clustering strategies in many fields, global optimization methods (Hruschka et al. 2006; Shelokar et al. 2004; van der Merwe and Engelbrecht 2003; Li et al. 2006), such as genetic algorithms (GA), ant colony optimization (ACO) and particle swarm optimization (PSO), have been applied to solve clustering problems (Hruschka et al. 2006; Handl et al. 2006; Shelokar et al. 2004; van der Merwe and Engelbrecht 2003; Wan et al. 2010). When solving clustering problems, these algorithms start from an initial population or position and explore the solution space through a number of iterations to reach a near optimal solution. The social insects behavior such as finding the best food source, building of optimal nest structure, brooding, protecting the larva, guarding, etc. show intelligent behavior on the swarm level (Englebrecht 2002). Foraging is a kind of social insect behaviors and can be modelled as an optimization process where an animal seeks to maximize energy intake per unit time spent for foraging. This view led Passino to develop a new optimization algorithm which is inspired by the social foraging behavior of Escherichia coli (E. Coli) bacteria and named as Bacterial Foraging (BF) (Passino 2002). Until today, this latest optimization algorithm, BF, is gaining importance in the optimization problems and has been successfully implemented to some engineering problems such as optimal controller design (Passino 2002; Kim et al. 2007), antenna arrays systems (Guney and Basbug 2008), active power filter synthesis (Mishra and Bhende 2007), and learning of artificial neural networks (Kim and Cho 2005). Mathematical modelling, modification, and adaptation of the algorithm might be a major part of the research on BF in future. As data clustering can be seen as a process of function optimization, BF may be applied to solve clustering problems with its global search capability.
323
In this paper we propose a new clustering algorithm (called, BF-C) for grouping data by the optimization property of bacterial foraging behavior. Instead of the highspeed local search, BF-C is an global optimization-based algorithm which provides a new point of view to solve the NP-hard clustering problems. Meanwhile, it is a brandnew application of Bacterial Foraging. In our algorithm, no centroid or center needs to be selected in the initial step. Moreover, in order to overcome the drawbacks of traditional algorithms, the proposed algorithm tries to achieve its tripartite objective: (a) find a high quality approximation to the optimal clustering solution; (b) have a good algorithm performance for high-dimensional data; (c) not sensitive to clusters with different size and density. The rest of this paper is organized in the following way. Section 2 gives a background of optimization based clustering and the BF algorithm. Section 3 describes the whole process of the proposed BF-C algorithm in detail. In Section 4 we give a brief introduction to another three clustering algorithms for comparison and present four measures for algorithm performance evaluation. Section 5 presents experiment companions and discusses experimental results. Finally, conclusion and future work are given in Section 6.
2 Background 2.1 Optimization based clustering Clustering is a data mining technique which classifies objects into groups (clusters) without any prior knowledge. The problem of common clustering can be formally started as follows. Given a sample data set X = {x1 , x2 , . . . , xn }, determine a partition of the objects into K clusters C1 , C2 , . . . , C K which satisfies: K Ci = X; i=1 (1) Ci C j = , i, j = 1, 2, . . . , K ; i = j ; Ci = , i = 1, 2, . . . , K. In the viewpoint of mathematics, cluster Ci can be determined by: Ci = {x j | x j zi x j z p , x j X}, p = i, p = 1, 2, . . . , K , z = i (2)
1 |Ci | x j Ci
x j, i = 1, 2, . . . , K,
where denotes the distance of any two data points in the sample set. zi is the center of cluster Ci , which is represented by the average (mean) of all the points in the cluster. A clustering criterion must be adopted. The most commonly used criterion in clustering task is the Sum of Squared Error (SSE) (Tan et al. 2006):
K
SSE =
i=1 x j Ci
x j zi 2 .
(3)
324
For each data in the given set, the error is the distance to the nearest cluster. The general objective of clustering is to obtain that partition which, for fixed number of clusters, minimizes the square-error. Thus, the clustering problem is converted to a process of searching K centers z1 , z2 , . . . , z K , which can minimize the sum of distance between all the sample data xi and its closest center. This could be considered as a function optimization issue with the objective function as SSE. 2.2 The bacterial foraging (BF) algorithm The BF algorithm (Passino 2002) is a new stochastic global search technique based on the foraging behavior of E. Coli bacteria present in the human intestine. The ideas from bacterial foraging can be utilized to solve non-gradient optimization problems by three processes, namely, chemotaxis, reproduction, and elimination and dispersal. Generally, as a group, the E. Coli bacteria will try to find food and avoid harmful phenomena during foraging, and after a certain time period, recover and return to some standard behavior in a homogeneous medium. An E. Coli bacterium can move in two different ways: tumbling and swimming, and it alternates between these two modes of operation its entire lifetime. This alternation between the two modes, called chemotactic steps, will move the bacterium, but in random directions, and this enables it to search for nutrients. After the bacterium has collected a given amount of nutrients, it can self-reproduce and divide into two. The bacteria population can also change (e.g., be killed or dispersed) by the local environment. A BF optimization algorithm can be explained as follows: Given a D-dimensional search space D , try to find the minimum of objective function J(), D , where we do not have measurements or an analytical description of the gradient J(). Here, we use ideas from bacterial foraging to solve this non-gradient optimization problem. Let { i ( j)|i = 1, 2, . . . , S} represent the position of each member in the population of the S bacteria at the jth chemotactic step. Choose C(i) > 0 (i = 1, 2, . . . , S), denote a basic chemotactic step size that taken in the random direction specified by the tumble. To represent a tumble, a unit length random direction, say ( j), is generated; this will be used to the following swim phase after a tumble. Therefore, the position of bacterium i in one step is updated as: i ( j + 1) = i ( j) + C(i)( j). (4)
If J( i ( j + 1)) < J( i ( j)), another step in this same direction will be taken. This swimming iteration will be continued as long as it continues to reduce the objective function, but only up to a maximum number of steps, Ns . After Nc chemotactic steps, a reproduction step is taken. Sr (half of the population) healthiest bacteria each split into two bacteria, which are placed at the same location. Finally, each bacterium in the population is subjected to an eliminationdispersal process with probability ped .
3 Proposed methodology: the BF-C algorithm In this section we will express how bacterial foraging optimization solves general clustering problem in detail.
325
3.1 The BF based clustering (BF-C) algorithm As we have just mentioned in Section 2.1, clustering tasks can be considered as optimization problems. Firstly, the fitness function should be specified. Here we choose SSE in (3) to be the required function J in BF-C:
K n D
J(w, z) =
c=1 t=1 d=1
wtp
xtd zcd
(5)
where D is the dimension of the search space; w is a weight matrix of size n K and wtp is the associated weight of data xt with cluster c which can be assigned as wtp = 1 if xt is labelled to cluster c , t = 1, . . . , n, c = 1, . . . , K. 0 otherwise
Algorithm 1 introduces the proposed BF-C algorithm. In the BF-C algorithm, an S-size population of bacteria is generated for each center, so there will be S K bacteria changing positions for the minimum cost by foraging behaviors in this approach. A virtual bacterium is actually one trial solution (may be called a search-agent) that moves on the functional surface to locate the global optimum. Initially, S data are randomly generated from X as bacteria for each center zc (line 1 in Algorithm 1). Then for every bacterium i, the chemotaxis process starts (lines 420 in Algorithm 1). All the bacteria update their positions for Nc step of iterations. The agents first present a tumble in a unit length random direction ( T (i) , where (i) D is a random vector with each element a random number on [1, 1]) with a basic chemotaxis step size C(i) (line 6 in Algorithm 1), and then swim to minimize the objective function J up to a maximum number of steps, Ns (lines 9 18 in Algorithm 1). The chemotaxis process is in a combined Nre step reproduction loop (lines 3, 2123 in Algorithm 1), and encapsulated in a Ned -length elimination dispersal phase during which a percentage ped of bacteria are dispersed at random (lines 2, 2425 in Algorithm 1). All the bacteria will converge to certain places in the search space after the iteration process. The final positions of the bacteria are considered as the required centers. Allocate all the data according to (2) into different clusters which are represented by the final centers gained after the iteration process. Assign every data object a corresponding cluster label (lines 2633 in Algorithm 1). 3.2 Guidelines for algorithm parameter setting The bacterial foraging optimization algorithm requires initialization of a variety of parameters, and the authors of Passino (2002) gave out a set of guidelines for parameter choices in BF. Since we put the basic idea of BF into our methodology, these guidelines can work in BF-C as well. In BF-C, the size of the bacteria population S should be picked first. Enlarging S will apparently increase the computing time but find the optimal more easily. Next, there is a three-layer optimization loop in BF-C with the size of Niter = Nc Nre Ned . The larger Niter is, the better the optimization progress is, but also the more
(i) (i)
326
Algorithm 1 The BF-C Algorithm Require: Data set, X = {x1 , x2 , . . . , xn }; Cluster number, K. Ensure: Clusters: {C1 , C2 , . . . , C K }. S 1: Initialize K centers for C1 , C2 , . . . , C K : Generate S data {b 1 , b 2 , . . . , b c } from X c c randomly as the positions of bacteria for each cluster center zc (c = 1, 2, . . . , K). 2: for l = 1 : Ned do 3: for k = 1 : Nre do 4: for j = 1 : Nc do 5: for i = 1 : S do 6: b ic ( j + 1, k, l) = b ic ( j, k, l) + C(i) (i)
7: 8: 9: 10: 11: 12:
Calculate J(i, j, k, l) with current Jlast = J(i, j, k, l) while m < Ns do m=m+1 if J(i, j + 1, k, l) < Jlast then b ic ( j + 1, k, l) = b ic ( j + 1, k, l) + C(i)
T (i) (i) i b c ( j, k, l)
(i)
T (i)
13: Jlast = J(i, j + 1, k, l) 14: zc ( j, k, l) = b ic ( j + 1, k, l) 15: else 16: m = Ns 17: end if 18: end while 19: end for 20: end for i 21: Jhealth = Nc +1 J(i, j, k, l) j=1 22: Reproduce(X,Jhealth ) 23: end for 24: Eliminationdispersal(X, ped ) 25: end for 26: for t = 1 : n do 27: for c = 1 : K do 28: Calculate distance dc = xt zc 29: end for 30: d = {d1 , d2 , . . . , d K } 31: Find the position p of min(d) 32: C p .add(xt ) 33: end for
(i)
computational complexity is. If Niter is too short, the algorithm could more easily get trapped in a local minimum. Then, the bacteria will swim in random directions with Ns steps. Large values of Ns tend to make the bacteria move more in different directions to get better results, but of course more computational complexity. And, if ped is large, the algorithm can degrade to random exhaustive search. If, however, it
327
is chosen appropriately, it can help the algorithm jump out of local optima and into a global optimum. Finally, C(i) is the only one that occurs in the iteration function of (4) and can be seen as a type of step size for the BF optimization algorithm. You can choose a biologically motivated value; however, such values may not be the best for an engineering application (Passino 2002). If the C(i) values are too large, then if the optimum value lies in a valley with steep edges, the search will tend to jump out of the valley, or it may simply miss possible local minima by swimming through them without stopping. On the other hand, if the C(i) values are too small, convergence can be slow, but if the search finds a local minimum it will typically not deviate too far from it. In Section 5, we will set up experiments to investigate parameters of BF-C.
4 Cluster validity and compared methods One of the most important issues of cluster analysis is the evaluation of clustering results to find the partitioning that the best fits the underlying data. The procedure of evaluating the results of a clustering algorithm is known as cluster validity. Furthermore, in order to show the superiority of the clustering algorithm, some existing methods are selected for comparing with the proposed algorithm during cluster validity. 4.1 Cluster validity Two kinds of cluster validity approaches are chosen in this article. The first is based on external criteria, which are used to evaluate the results of the proposed BF-C algorithm based on the comparison to the pre-specified class label information of the data set. The second one is based on internal criteria, which we evaluate the clustering results of the BF-C algorithm performance without any prior knowledge of data sets. Two external validity measures Rand and Jaccard (Theodoridis and Koutroumbas 2006), as well as two internal validity measures, Beta (Pal et al. 2000) and Distance index are utilized for performance evaluation of the BF-C algorithm and its comparison methods. Rand coefficient (R): It determines the degree of similarity between the known correct cluster structure and the results obtained by a clustering algorithm (Theodoridis and Koutroumbas 2006). It is defined as R= SS + DD . SS + SD + DS + DD (6)
SS, SD, DS, DD represent the number of possible pairs of data points where, SS: SD: DS: DD: both the data points belong to the same cluster and same group. both the data points belong to the same cluster but different groups. both the data points belong to different clusters but same group. both the data points belong to different clusters and different groups.
Note that if there are N data points in a data set, M = SS + SD + DS + DD, where M is the total number of possible data pairs and its value equals to N(N 1)/2.
328
Value of R is in the range [0, 1] and higher the value of R, better is the clustering. Jaccard coefficient (J): It is the same as rand coefficient except that it excludes DD and is defined as J= SS . SS + SD + DS (7)
Value of J locates in the interval [0, 1]. The higher the value of J, the better the clustering performance is. Beta index (): It computes the ratio of total variation and within class variation (Pal et al. 2000), and is defined as =
C i=1 C i=1 ni j=1 (Xij ni j=1 (Xij
X)2 Xi )2
(8)
where X is the mean of all the data points and Xi is the mean of the data points that belong to cluster Ci ; Xij is the jth data point of ith cluster and ni is the number of data points in cluster Ci . Since the numerator of is a constant for a given data set, the value of is dependent on the denominator only. The denominator decreases with homogeneity in the formed clusters. Therefore, for a given data set, higher the value of , better is the clustering (Pal et al. 2000). Note that (Xij X) can be calculated as Euclidean distance of the two vectors Xij and X. Distance index (Dis = Intra ): It computes the ratio of average intra-cluster Inter distance and average inter-cluster distance. The intra-cluster distance measure is the distance between a point and its cluster center. We take the average of all of these distances and call it Intra which is defined as Intra = 1 n
K
x j zi 2 ,
i=1 x j Ci
(9)
where n is the total number of objects in a data set. The inter-cluster distance between two clusters is defined as the distance between the centers of them. We calculate the average of all of these distances as follows Inter = 1 K zi z j 2 , i = 1, 2, . . . , K 1, j = i + 1, . . . , K. (10)
A good clustering method should produce clusters with high intra-class similarity while low inter-class similarity. So cluster results can be measured by combining the average intra-cluster distance (Intra) and average inter-cluster distance (Inter) in a ratio way: Dis = Intra . Inter (11)
Therefore, we want to minimize the value of measure Dis.
329
4.2 Methods for comparison For presenting the superiority of the proposed BF-C algorithm, we select some previous clustering techniques for algorithm comparisons. Firstly we choose the k-means algorithm (MacQueen 1967) as a method to be compared because it is the most famous conventional clustering technique. The kmeans algorithm is a partition-based clustering approach (see Algorithm 2) and has been widely applied for decades of years.
Algorithm 2 The k-means Clustering Algorithm Require: Data set, X = {x1 , x2 , . . . , xn }; Cluster number, K. Ensure: Clusters: {C1 , C2 , . . . , C K }. 1: Initialize K centers for C1 , C2 , . . . , C K : Randomly select K data points from X randomly as the initial centroid vectors. 2: repeat 3: Assign each data point to its closest centroid and form K clusters by (2). 4: Recompute the centroid for each cluster. 5: until Centroid vectors do not change.
Moreover, as an global optimization-based methodology, the BF-C algorithm will be compared with the ant-based clustering (Handl et al. 2006) and PSO-based clustering technique (van der Merwe and Engelbrecht 2003). Ant colony optimization (ACO) (Dorigo and Maniezzo 1996) was designed to emulate ants behavior of laying pheromone on the ground while moving to solve optimization problems. Handl et al. (2006) presented an instance of ACO for clustering which return an explicit partitioning of data by an automatic process. The ACO algorithm imitates the mechanisms by choosing solutions based on pheromones and updating pheromones based on the solution quality (shown in Algorithm 3). Particle swarm optimization (PSO) (Kennedy and Eberhart 1995) is a populationbased algorithm. It is a global optimization method and simulates bud flocking or fish schooling behavior to achieve a self-evolution system. The clustering approach using PSO can search automatically the data centers of K groups data set by optimizing the objective function (see Algorithm 4). In Section 5, we will set up a series of experiments to describe method comparisons between BF-C, k-means, ACO-based and PSO-based clustering algorithms.
5 Experiments In this section, we will present several simulation experiments on the platform of Matlab to give a detailed illustration on the superiority and feasibility of the proposed approach.
330
Algorithm 3 The ACO-based Clustering Algorithm Require: Data set, X = {x1 , x2 , . . . , xn }; Cluster number, K. Ensure: Clusters: {C1 , C2 , . . . , C K }. 1: Initialize pheromones. Randomly scatter data items on the toroidal grid, and generate positions of R ants randomly from the data space for each center. 2: for j = 1 : itermax do 3: for i = 1 : R do 4: let each data belong to one cluster with the probability threshold q 5: Calculate the objective function J(i, j) with current centers 6: Jlast = J(i, j) 7: Construct solution Si using pheromone trail 8: Calculate new cluster center; Calculate J(i, j + 1) with current centers 9: if J(i, j + 1) < Jlast then 10: Si ( j + 1) = Si ( j) //Pi represents the Save the best solution among the R solutions found. 11: end if 12: end for 13: Update the pheromone level on all data according to the best solution. 14: {z1 , z2 , . . . , z K } = Sb //Update cluster centers by the cluster center values of the best solution. 15: end for 16: for t = 1 : n do 17: for c = 1 : K do 18: Calculate distance dc = xt zc 19: end for 20: d = {d1 , d2 , . . . , d K } 21: Find the position p of min(d) 22: C p .add(xt ) 23: end for
5.1 Data source Two different types of benchmark data sets are used: two synthetic data sets (Handl and Knowles 2008) that permit the modulation of specific data properties and three real data sets provided by UCI Machine Learning Repository (UCI Machine Learning Repository 2007). Both of the two synthetic data sets in our work follow x-dimensional normal distributions N(, ) from which the data items are located into the y different clusters. The sample size s of each cluster, the mean vector and the vector of the standard deviation are themselves randomly determined using uniform distributions over fixed ranges (with s [50, 450], i [10, 10] and i [0, 5]). Consequently, clusters in each data set are with different size and different density. The first one, which we call it 2D-4C, is a 2-dimensional data set arranged in ([20, 20], [12, 8]) and contains 4 clusters with 528, 348, 272 and 424 instances each
331
Algorithm 4 The PSO-based Clustering Algorithm Require: Data set, X = {x1 , x2 , . . . , xn }; Cluster number, K. Ensure: Clusters: {C1 , C2 , . . . , C K }. 1: Initialize the position M and velocity v of S particles randomly, in which each single particle Mi (i = 1, 2, . . . , K) contains K randomly generated centroid vectors: Mi = {mi1 , mi2 , . . . , miK }. 2: for j = 1 : itermax do 3: for i = 1 : S do 4: Calculate the objective function J(i, j) with current Mi ( j) 5: Jlast = J(i, j) 6: vi ( j + 1) = w vi ( j) + c1 rand() (Pi ( j) Mi ( j)) + c2 rand() (Pg Mi ( j)) 7: Mi ( j + 1) = Mi ( j) + vi ( j) 8: Calculate J(i, j + 1) with current Mi ( j + 1) 9: if J(i, j + 1) < Jlast then 10: Pi ( j + 1) = Mi ( j + 1) //Pi represents the local best position, the best position found so far for particle i. 11: else 12: Pi ( j + 1) = Pi ( j) 13: end if 14: end for 15: Update the global best position Pg : Select the best Pi from {P1 , P2 , . . . , PS } as Pg . //Pg represents the global best position in the neighborhood of each particle. 16: {z1 , z2 , . . . , z K } = Pg 17: end for 18: for t = 1 : n do 19: for c = 1 : K do 20: Calculate distance dc = xt zc 21: end for 22: d = {d1 , d2 , . . . , d K } 23: Find the position p of min(d) 24: C p .add(xt ) 25: end for
(see Fig. 1). The second data set, named 10D-4C, contains a total number of 1,289 items that spread in 4 clusters based on 10 different features. All the 5 data sets from UCI that we employ in our experiments are famous database that can be easily found in data mining and pattern recognition literature. Iris data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant and can be treated as a cluster in the experiments. Each instance has 4 features representing sepal length, sepal width, petal length and petal width, respectively. Wine data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. This set contains 3 clusters and has 59,
332
8 6 4 2 0 -2 -4 -6 -8 -10 -12 -20
-15
-10
-5
10
15
20
Fig. 1 The original 2-dimensional data distribution in space
71, 48 instances for each cluster. Glass data set has 214 instances describing 6 classes of glass based on 9 features. Zoo data set is a simple database containing 101 animal instances with 16 Boolean-valued attributes which are classified into 7 categories. Ionosphere contains 351 radar data with 34 continuous features and was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 highfrequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere.Good radar returns are those showing evidence of some type of structure in the ionosphere. Bad returns are those that do not; their signals pass through the ionosphere. The data points in all the 5 data sets are scattered in high-dimensional spaces. The description of all the data sets used in our study can be summarized in Table 1.
Table 1 Summarization of data sets
Data sets 2D-4C 10D-4C Iris Wine Glass Zoo Ionosphere
Instances 1,572 1,289 150 178 214 101 351
Featrues/dimensions 2 10 4 13 9 16 34
Clusters 4 4 3 3 6 7 2
333
5.2 Parameter investigation Parameter selection is an important part of optimization-based approaches. In this subsection we present results from our investigations on the impacts of some key parameters based on the guidelines in Section 3.2, and assign initial values for them. 5.2.1 Chemotaxis step size C(i) In BF, C(i) is the size of chemotaxis step and can be initialized with biologically motivated values. However, a biologically motivated value may not be the best for an engineering application (Passino 2002), it should be chosen according to our data clustering tasks. Below in Fig. 2 we illustrate the relationship between the objective function and the number of chemotactic steps Nc for different C(i). From Fig. 2 we can find when the size of chemotaxis step C(i) is smaller, the objective function converges faster. Since SSE reaches the smallest value at C(i) = 0.1, we select 0.1 as the parameter value of C(i) for the proposed BF-C algorithm to implement the coming tasks. 5.2.2 Chemotactic step Nc and swim step Ns Next, large values for Nc result in many chemotactic steps, and hopefully more optimization progress, but of course more computational complexity. Figure 3 presents the characteristics between objective function and the number of chemotactic steps
145 140 135 130 125 SSE 120 115 110 105 100 95 0 C(i)=0.05 C(i)=0.1 C(i)=0.15 C(i)=0.2
10
20
30
40
50 Nc
60
70
80
90
100
Fig. 2 Performance of BF-C for Iris data with different C(i)
334
160
Ns=2 Ns=3 150 Ns=4 Ns=5 Ns=6 140
SSE
130
120
110
100 0 10 20 30 40 50 Nc 60 70 80 90 100
Fig. 3 Performance values for the five different swim step sizes for Nc from 1 to 100
Nc for different life time Ns of the bacteria. As evident, when the swim step Ns is smaller, the objective function converges faster. From Fig. 3 we can also find BF-C converges to the smallest SSE at Ns = 4 and 6. However, the objective function converges faster at Ns = 4. We thus choose Ns = 4 and Nc = 100 in our data clustering tasks. 5.2.3 Reproduction step Nre and eliminationdispersal step Ned If Nc is large enough, the value of Nre affects how the algorithm ignores bad regions and focuses on good ones. If Nre is too small, the algorithm may converge prematurely; however, larger values of Nre clearly increase computational complexity. A low value for Ned dictates that the algorithm will not rely on random eliminationdispersal events to try to find favorable regions. A high value increases computational complexity but allows the bacteria to look in more regions to find good nutrient concentrations. Figures 4 and 5 depict the values of objective function (SSE) and the corresponding elapsed timea by experiments with Nre from 2 to 6 and Ned from 1 to 5. It is easy to find in Figs. 4 and 5 that the larger Nre or Ned is, the more slowly BFC converges. Moreover, SSE changes slightly after Nre = 4 and Ned = 2, while the elapsed times increase significantly. Based on these results, we choose Nre = 4 and Ned = 2 in our applications.
335
102 101 100 SSE 99 98 97 2 5 4 Time 3 2 1 2 3 4 Nre 5 6 3 4 5 6
Fig. 4 SSE and the computing time of BF-C for Iris data with different Nre
5.2.4 Eliminationdispersal probability ped In BF, if ped is large, the algorithm can degrade to random exhaustive search. However, appropriately choose of ped can help the algorithm jump out of local optima and into a global optimum. Figure 6 shows the relationship between objective
100 99 SSE 98 97 1 12 10 Time 8 6 4 2 1 2 3 Ned 4 5 2 3 4 5
Fig. 5 SSE and the computing time of BF-C for Iris data with different Ned
336
105
103
101 SSE 99 97 95 0.05
0.1
0.15
0.2 Ped
0.25
0.3
0.35
0.4
Fig. 6 Performance values for the eight different eliminationdispersal probabilities
function values and different ped . Apparently, BF-C gets the smallest SSE value at ped = 0.25. 5.2.5 Other parameters For PSO, we use 50 particles, and set w = 0.72 and c1 = c2 = 1.49. These values were chosen to ensure good convergence (van den Bergh 2002). For ACO, the authors have designed some techniques to set the parameters for optimal performance (Handl et al. 2006). In our implementation we also choose 10 ants and 1,000 iteration steps which have followed the same settings. For BF-C, based on the investigations in the previous subsection, we therefore choose S = 50, Nc = 100, Ns = 4, Nre = 4, Ned = 2 and ped = 0.25. 5.3 Results and analysis For all the results reported, average values of different performance indices over 30 simulations and their corresponding standard deviations (shown in bracket) for each data set are given. Euclidean distance is chosen to measure the distance between data points in our work. Rank of each algorithm is given depending on its performance measure followed by corresponding rank (from 1 to 3). Table 2 summarizes the clustering results obtained by the k-means, ACO, PSO and the proposed BF-C algorithms for different data sets. From the clustering results of real data sets which is shown in Table 2, and according to the properties of data sets which are described in Table 1, some conclusions are revealed as follows: (1) It is apparent that in terms of external validity measures (Rand and Jaccard indices) performance of the proposed BF-C algorithm is better for most of the
337
Table 2 Values of performance measures by the k-means, ACO, PSO and the proposed BF-C algorithms Data sets Method 2D-4C Rand Jaccard 11.1319c Dis =
Intra Inter
Time (s) 0.23058a (0.011049) 10.87969 (0.419201) 10.19844c (1.74367) 6.7922b (0.27472) 0.069018a (0.023524) 27.71563 (0.919202) 19.2719c (5.57154) 18.28282b (0.902037) 0.00625a (0.004941) 6.753125 (0.07683) 5.5256c (0.48719) 2.9344b (0.058962) 0.008235a (0.077548) 19.92031 (0.180339) 11.0644c (0.3288) 7.86721b (0.140304) 0.034375a (0.07548) 15.29531 (0.699914) 13.9375c (0.40873) 11.62502b (0.203714) 0.01875a (0.0010546) 44.23438 (3.770243) 17.6563c (0.75973) 11.38752b (0.552599)
10D-4C
Iris
Wine
Glass
Zoo
k-means 0.8636 (0.010365) PSO 0.9941a (0.0061857) ACO 0.9916c (0.031725) BF-C 0.9920b (0.0027578) k-means 0.8946c (0.03401184) PSO 0.8763 (0.0393151) ACO 0.9239b (0.050278) BF-C 0.9319a (0.011764) k-means 0.8737c (0.135340) PSO 0.9195b (0.0427095) ACO 0.8254 (0.008045) BF-C 0.9341a (0.0103238) k-means 0.7170c (0.00675452) PSO 0.7307 b (0.0118794) ACO 0.683959 (0.0107) BF-C 0.7516a (0.00289913) k-means 0.7047 b (0.0122868) PSO 0.5409 (0.0636369) ACO 0.6353c (0.0395196) BF-C 0.7376a (0.0127279) k-means 0.7998 (0.0484368) PSO 0.8525c (0.372645) ACO 0.8829b (0.0231966) BF-C 0.9210a (0.0311569)
0.8021 (0.017803) (0.684979) 0.9778a 12.565b (0.0094362) (0.669946) 0.9558c 1.3874 (0.0107) (0.051462) 0.9702b 13.249a (0.0101823) (0.329451) 0.7203c 2.264b (0.0138685) (0.052886) 0.6924 2.1989c (0.0893076) (0.06987) 0.76142b 1.1102 (0.035419) (0.034022) 0.8187a 2.2968a (0.00219203) (0.040214) 0.6823c 7.8405c (0.096661) (0.602076) 0.7828b 8.3579b (0.0717713) (0.374798) 0.6547 1.6159 (0.046406) (0.53276) 0.8180a 9.1295a (0.0248901) (0.369183) 0.4127 7.3745c (0.00306349) (0.398942) 0.4312b 7.6108b (0.01004092) (0.358704) 1.012184 0.424734c (0.007082) (0.0061833) 0.4494a 7.9366a (0.00282842) (0.270582) 0.2676b 3.1188b (0.029821) (0.211435) 0.1902 2.4245c (0.0543058) (0.127317) 0.2699c 1.009839 (0.0358161) (0.004898) 0.2765a 3.5644a (0.0208597) (0.072933) 0.3758 4.1048c (0.116199) (0.151947) 0.4768c 4.6966b (0.0714178) (0.26326) 0.6867 b 1.02699 (0.0541532) (0.0168503) 0.6977a 5.9665a (0.0448654) (0.093465)
0.01079 (0.289137) 0.00998a (0.598628) 0.01012c (0.004398) 0.01006b (0.002997) 0.0973 (0.080985) 0.09693c (0.434701) 0.096711b (0.062165) 0.09202a (0.044709) 0.02422c (0.02267) 0.02243b (0.005974) 0.03104 (0.02067) 0.02111a (0.004837) 0.02652b (0.002768) 0.02727c (0.003944) 0.030469 (0.008435) 0.0259a (0.001703) 0.03647 b (0.018903) 0.04097 (0.03715) 0.037285c (0.026172) 0.03171a (0.007819) 0.01821 (0.003507) 0.01757c (0.009615) 0.01691b (0.0034002) 0.01538a (0.003154)
338 Table 2 (continued) Data sets Method Rand Jaccard 0.4323c (0.0043657) 0.4261 (0.0067175) 0.5384a (0.0015278) 0.44390b (0.00314662) Ionosphere k-means 0.5877c (0.0012882) PSO 0.5921b (0.0013718) ACO 0.5398 (0.000967) BF-C 0.5989a (0.00120208)
a Rank b Rank c Rank
1.3405c (0.07535433) (0.0738499) 1.3516b 0.75771b (0.04071) (0.081232) 1.3114 0.76174 (0.0034079) (0.036454) 1.3528a 0.75413a (0.042282) (0.023309)
Intra Inter 0.75862c
Dis =
Time (s) 0.046875a (0.02578) 45.1c (1.094375) 53.6571 (4.6563) 36.39376b (0.476011)
1 2 3
(2)
(3)
(4)
(5)
data sets (namely 10D-4C, Iris, Wine, Glass, Zoo and Rand for Ionosphere), whereas PSO gives better result for 2D-4C data set and ACO gets the best Jaccard for Ionosphere. These results show that with the help of global and chaotic search, the proposed BF-C methodology can reach the global optimal solutions, which has covered the shortage of the k-means algorithm. Meanwhile, as an optimization-based clustering algorithm, BF-C reaches the optimal points more closer and exhibited better convergence than ACO and PSO techniques. Furthermore, for Glass and Zoo data, the proposed BF-C approach gets a evident improvement of Rand and Jaccard values to other three algorithms. Combining the description of Glass and Zoo data sets, we can conclude that the BF-C algorithm is much more effective for multi-cluster data sets and has the superiority for clusters with different scales. We can find that the proposed BF-C algorithm has the best performance of index and the ACO-based clustering algorithm gets the smallest (worst) values for all data sets. It is clear that the BF-C algorithm is quite available for data sets without any prior information, which will be helpful in real life clustering applications. When considering intra-cluster and inter-cluster distances, the former ensures compact clusters with little deviation from the cluster centers, while the latter ensures larger separation between the different clusters. Dis index is the ratio of intra-cluster and inter-cluster distances, which should be minimized. With reference to this criterion, the BF-C algorithm succeeds most in finding clusters with smaller Dis value than the k-means algorithm, ant-based and PSO-based approaches, although PSO algorithm performs the best for the 2D-4C data set. The standard deviations of different measures obtained by different methods are shown in bracket. Stability of BF-C over different data sets can also be seen from the smaller values of standard deviation of Rand, Jaccard and Dis indices. Although ACO has smaller standard deviation values of due to its poor performance of . And the k-means algorithm gives less standard deviation of computing time because of its little computational complexity. These comparisons present that the results of the BF-C algorithm change less at different experiments, and BF-C is a more stable clustering technique than the k-means, PSO and ACO-based clustering algorithms.
339
(6) The CPU (execution) time, in seconds, needed by the algorithms are also given in the table for comparison. All the experiments are performed in a Dell terminal with Intel Core(TM)2 Due CPU (2.53 GHz clock speed with 2GB memory) and in Windows XP environment. Implementation of the algorithms is done in Matlab. It is apparent that the k-means algorithm performs the best in term of running time and most the worst quality. BF-C is the second fast algorithm but performs the best in quality. Thus, each swarm-based algorithm adds a refined search to k-means. However, in many cases we are more concerned in the quality of a solution, especially when we are handling highdimensional massive data. Since swarm-based algorithms start with a group of points, there is implied parallelism in these approaches. Thus, we can also improve the implementation of BF-C in order to decrease the elapsed times, for example, running the algorithm on distributed processors. (7) In the last row of Table 2, we can find that the four algorithms perform closely to each other, although BF-C has a little bit improvement to the k-means algorithm in Rand, and Dis. That is to say, the three heuristic algorithms used in this paper (PSO, ACO and BF-C) are not superior to k-means on the Ionosphere data set. However, they have made contribution on other data sets. We thus hold the view that theres no algorithm that is superior to other algorithms over all kinds of data set. This can be explained by the No Free Lunch Theorem for optimization (Wolpert and Macready 1997): ...if some algorithms performance is superior to that of another algorithm over some set of optimization problems, then the reverse must be true over the set of all other optimization problems. (8) On an average, for most of the data sets, in terms of all the cluster validity measures, BF-C either outperforms the other methods or is close to the best results produced by the other algorithms. To sum up, BF-C algorithm is an effective method and can reach satisfied results for complicated data sets (data sets with different sizes and multi-clusters). The property of BF-C can make itself solve clustering problems in our real life.
6 Conclusion and future work This paper presents an efficient clustering algorithm based on the Bacterial Foraging optimization. The clustering problem is converted to that of seeking the center for each cluster by optimizing the objective function. Numerical experimental comparisons are given to show that the proposed BF-C algorithm could be used to achieve high quality on multi-dimensional real data sets and can detect clusters with different shapes and densities, multi-clusters or isolated points with encouraging results. There are a wide variety of fruitful research directions. Data clustering can be employed to numerous applications for solving real life problems. In the coming future we will focus our attention on detecting clusters of Web users with the help of the BF-C algorithm. We also plan to apply our methodology for DNA sequence clustering which has become an important research area of Bioinformatics.
Acknowledgements I would like to thank the editor and all the reviewers for their great supports to our work. Our study is also supported by the National Basic Research Program of China (973
340
Program) (2007CB311203), the National Natural Science Foundation of China (Grant No. 60805043, 60821001), the Beijing Natural Science Foundation (Grant No. 4092029), the Huo Ying-Dong Education Foundation of China (Grant No. 121062), and the Foundation for the Author of National Excellent Doctoral Dissertation of PR China (FANEDD) (Grant No. 200951).
References
Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In N. Bansal, K. Pruhs, & C. Stein (Eds.), Proc. of the eighteenth anual ACMSIAM symposium on discrete algorithms, SODA (pp. 10271035). Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms (pp. 95107). New York: Plenum Press. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 138. Dhillon, I. S., Guan, Y., & Kulis, B. (2005). A unif ied view of kernel k-means, spectral clustering and graph partitioning. Technical Report TR-0425, UTCS. Dhillon, I. S., Guan, Y., & Kulis, B. (2007). Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11), 19441957. Dorigo, M., & Maniezzo, V. (1996). Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 26(1), 2941. Englebrecht, A. P. (2002). Computational intelligence: An introduction. New York: Wiley. Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey on spectral and kernel methods for clustering. Pattern Recognition, 41(1), 176190. Guha, S., Rastogi, R., & Shim, K. (1998). Cure: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD conference on management of data (pp. 7384). Guney, K., & Basbug, S. (2008). Interference suppression of linear antenna arrays by amplitude-only control using a bacterial foraging algorithm. Progress in Electromagnetics Research, 79, 475497. Handl, J, & Knowles, J. (2008). Cluster generators: synthetic data for the evaluation of clustering algorithms. http://dbkgroup.org/handl/generators/. Handl, J., Knowles, J., & Dorigo, M. (2006). Ant-based clustering and topographic mapping. Artif icial Life, 12(1), 3562. Hinneburg, A., & Keim, D. (1998). An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th international conference on knowledge discovery and data mining (KDD-98) (pp. 5865). Hruschka, E., Campello, R., & de Castro, L. (2006). Evolving clusters in gene-expression data. Information Sciences, 176(13), 18981927. Jain, A. K., Murty, M. N., & Flyn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264323. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2004). A local search approximation algorithm for k-means clustering. Computational Geometry, 28(23), 89112. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of the IEEE international joint conference on neural networks (ICW) (Vol. 4, pp. 19421948). Perth, Australia. Kim, D. H., Abraham, A., & Cho, J. H. (2007). A hybrid genetic algorithm and bacterial foraging approach for global optimization. Information Sciences, 177(18), 39183937. Kim, D. H., & Cho, J. H. (2005). Bacterial foraging based neural network fuzzy learning (pp. 2030 2036). IICAI. Li, L., Yang, Y., Peng, H., & Wang, X. (2006). An optimization method inspired by chaotic ant behavior. International Journal of Bifurcation and Chaos, 16, 23512364. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley symposium on mathematical statistics and probability (pp. 281 297). Mishra, S., & Bhende, C. N. (2007). Bacterial foraging technique-based optimized active power filter for load compensation. IEEE Transactions on Power Delivery, 22(1), 457465. Ng, R. T., & Han, J. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th international conference on very large data bases conference (pp. 144 155).
341
Pal, S. K., Ghosh, A., & Uma Shankar, B. (2000). Segmentation of remotely sensed images with fuzzy thresholding and quantitative evaluation. International Journal on Remote Sensing, 21(11), 22692300. Passino, K. M. (2002). Biomimicry of bacterial foraging for distributed optimization and control. IEEE Control Systems Magazine, 22(3), 5267. Sheikholeslami, G., Chatterjee, S., & Zhang, A. D. (1998). WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th international conference on very large data bases (pp. 428439). Shelokar, P. S., Jayaraman, V. K., & Kulkarni, B. D. (2004). An ant colony approach for clustering. Analytica Chimica Acta, 509, 187195. Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Reading, MA: AddisonWesley. Theodoridis, S., & Koutroumbas, K. (2006). Pattern recognition 3rd ed. New York: Academic. UCI Machine Learning Repository (2007). http://archive.ics.uci.edu/ml/index.html. Univ. of California, Irvine, Dept. of Information and Computer Science, Center for Machine Learning and Intelligent Systems. van den Bergh, F. (2002). An analysis of particle swarm optimizers. PhD Thesis, Department of Computer Science, University of Pretoria, Pretoria, South Africa. van der Merwe, D. W., & Engelbrecht, A. P. (2003). Data clustering using particle swarm optimization. In Proceedings of IEEE congress on evolutionary computation (pp. 215220). Wan, M., Li, L., Xiao, J., Yang, Y., Wang, C., & Guo, X. (2010). CAS based clustering algorithm for web users. Nonlinear Dynamics, 61(3), 347361. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 6782. Zhang, J., & Leung, Y. (2004). Improved possibilistic C-means clustering algorithms. IEEE Transactions on Fuzzy Systems, 12(2), 209217.

Art 3A10.1007 2Fs10844 011 0158 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Art 3A10.1007 2Fs10844 011 0158 3

Uploaded by

Copyright:

Available Formats

J Intell Inf Syst (2012) 38:321341 DOI 10.

Data clustering using bacterial foraging optimization

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

Therefore, we want to minimize the value of measure Dis.

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

8 6 4 2 0 -2 -4 -6 -8 -10 -12 -20

Fig. 1 The original 2-dimensional data distribution in space

Table 1 Summarization of data sets

Data sets 2D-4C 10D-4C Iris Wine Glass Zoo Ionosphere

Instances 1,572 1,289 150 178 214 101 351

J Intell Inf Syst (2012) 38:321341

Fig. 2 Performance of BF-C for Iris data with different C(i)

J Intell Inf Syst (2012) 38:321341

Ns=2 Ns=3 150 Ns=4 Ns=5 Ns=6 140

J Intell Inf Syst (2012) 38:321341

102 101 100 SSE 99 98 97 2 5 4 Time 3 2 1 2 3 4 Nre 5 6 3 4 5 6

100 99 SSE 98 97 1 12 10 Time 8 6 4 2 1 2 3 Ned 4 5 2 3 4 5

J Intell Inf Syst (2012) 38:321341

101 SSE 99 97 95 0.05

Fig. 6 Performance values for the eight different eliminationdispersal probabilities

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

Intra Inter 0.75862c

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

J Intell Inf Syst (2012) 38:321341

You might also like