You are on page 1of 7

PROJECT 2

K-MEANS CLUSTERING
& THE IRIS PLANT DATASET
Monique Kirkman-Bey
ELEN-857: Advanced Pattern Recognition Methods

OCTOBER 7, 2015

Kirkman-Bey Project 2

Abstract
In this work the K-means clustering algorithm is applied to Fishers Iris Plant Dataset. The dataset is
known to include 3 classes of Iris plant data Setosa, Virginica, and Versicolor - one of which is linearly
separable form the other two. To assess the capabilities of the clustering algorithm, it is applied to the
dataset with varied number of initial centers, and stopping thresholds. It will be shown that the K-Means
Clustering algorithm is capable of perfectly separating the Setosa dataset from the other two, as
expected, and able to achieve acceptable recognition of the other two plant species.

Methodology
The K-means algorithm is an unsupervised algorithm that attempts to cluster data into groups based on
a chosen similarity measure. In this work, the similarity measure of choice is Euclidean distance. To
create the clusters, the K-Means algorithm iteratively implements the following steps:
1. Initialize - Initialize the center of each cluster.
2. Distribute data points - Assign each data point to the cluster whose center is the smallest
distance from the data point.
3. Compute new cluster centers Set the position of the cluster center to the mean of all data
points belonging to that cluster.
4. Compare new centers to old center If the new centers are the same as the old centers, then
the algorithm converges. The clusters and centers computed in step 3 are the final clusters and
centers. If they are not the same, return to step 2.
In reality, it may not always be possible to find centers that do not change from iteration to iteration. In
other words, this algorithm may not always lead to a perfect solution. Some datasets may lead to
centers that oscillate between two values, for example. So, to avoid an infinite loop when iterating
through the algorithm, the threshold is used as another stopping condition. When the new cluster
centers are identified at the end of each iteration, the amount of change in the clusters is also taken into
consideration. This is done by measuring the distance between the new centers and old centers. If this
distance is less than the threshold distance, the algorithm converges.

Experimental Setup
This project was broken up into two tasks. First, the K-means algorithm was coded into a general
function so that the number of centers and threshold value could be easily varied. Next, a shell to call
the function iteratively for each of the three k values and two threshold values was created. The results
of each run was saved to a cell array.
The K-means code from homework 2 was modified and made to be a more general K-means function.
The function can be reviewed in Appendix A. In addition to making the function more general, the
convergence test step was modified to include the threshold as a stopping condition for the algorithm.
So, the function takes in the following inputs: data, number of centers, and threshold value. Given these
parameters, the Kmeans.m function will return the following two cells: centers per iteration, assigned
classes per iteration. More in-depth consideration of pertinent steps are presented below.
1

Kirkman-Bey Project 2

Initializing the centers


An important consideration in the K-means algorithm is the choice in initial centers. Since this is an
unsupervised algorithm, I chose to use sample values as the initial centers. However, instead of using
the first k sample values, I decided to choose the centers at random for each iteration. This was
achieved using the code snippet below.
Z = x(randi([1,numSamples],centers,1),:);

%k initial centers

Implementing the threshold as a stopping condition


To avoid an infinite loop when the Kmeans function is called, step 4 of the algorithm was modified to
include the threshold as a stopping condition. First, the new centers are compared to the old centers. If
the new centers do not match the old centers, then the distance between them is computed using the
norm function as shown in the code snippet below. This difference is compared to the threshold value. If
the distance is less than the threshold, the algorithm converges. This is summarized in the code snippet
below.
case 4

%Step 4: compare new centers to old centers


if Z_iter{m+1} == Z_iter{m} %if new center = current center
NotEq = 0; %algorithm converges
break;
elseif m == 1 %if 1st iter, no prev distance, just proceed
m = m+1;
step = 2;
else %if not 1st iter and Z \= Znew
%check stopping conditions
if abs(Dist_iter{m} - Dist_iter{m-1}) < threshold
NotEq = 0;
break;
else
m = m+1;
%new iteration
step = 2; %go back to step 2
end

end

Confusion Matrix
After calling the Kmeans function, the confusion matrix was generated for each simulation using the
Matlab confusionmat function. The results can be seen in the subsequent section.

Results
As previously stated, K-means clustering was applied six times to the dataset. The results are broken up
into two groups and presented based on the chosen threshold value.
2

Kirkman-Bey Project 2

Threshold = 0.01
Below, the confusion matrix for the three different choices of number of initial centers is shown. In each
simulation, the stopping threshold was set to 0.01.
Table 1. K = 2 Confusion Matrix

1 2
Setosa
50 0
Versicolor 3 47
Virginica
0 50

Table 2. K = 3 Confusion Matrix

1
Setosa
0
Versicolor 2
Virginica
36

2
50
0
0

3
0
48
14

Table 3

Setosa
Versicolor
Virginica

1
50
0
0

2
0
0
23

3
0
21
26

4
0
29
1

It can be seen that in each case, the Setosa plant species was easily separated from the others.
However, the Versicolor and Virginica datasets were not as easily distinguished from each other as they
were from the Setosa. However, it is interesting to see that when there were just 2 centers, the Virginica
dataset was able to be perfectly separated from the others. The Versicolor. However is still straddling
between the two clusters. It is mostly clustered with the Virginica dataset, but there are several pieces
that were clustered with the Setosa plants.

Threshold = 0.1
Below, the confusion matrix for the three different choices of number of initial centers is shown. In each
simulation, the stopping threshold was set to 0.1.
Table 4. K = 2 Confusion Matrix

1
Setosa
0
Versicolor 47
Virginica
50

2
50
3
0

Table 5. K = 3 Confusion Matrix

1
Setosa
0
Versicolor 47
Virginica
14

2
50
0
0

3
0
3
36

Table 6

Setosa
Versicolor
Virginica

1
50
0
0

2
0
25
17

3
0
25
1

4
0
0
32

Increasing the threshold did not have much of an impact on the final confusion matrices. Although there
is some shifting of the data points, as is evidenced by the values shown in the tables, the overall
clustering results are quite similar. In all three runs, the Setosa species was perfectly separated from the
other two species. The other two species, on average could not be perfectly separated. However, when
there are two centers, the Setosa and Virginica sets are again easily separated from one another while
the Versicolor is split (unevenly) between the two clusters.

Conclusions
Using Matlab and a personal computer, the K-means algorithm was applied to the Iris plant dataset. It
was shown that the Setosa dataset was able to be perfectly classified in each case. The other two
species Versicolor and Virginica - were not as easily separated from each other as they were from the
Setosa plant. After randomly selecting the initial centers, varying the number of centers, and
manipulating the stopping threshold, these results remained true. Since these results are typical of the
Iris plant dataset and the recognition using the K-means clustering algorithm was able to reach these
results, the K-means algorithm was shown to be a reliable method of clustering.

Kirkman-Bey Project 2

Appendix A - Kmeans.m
%Monique Kirkman-Bey
%This function takes in a dataset and number of clusters and returns
%the clustered data after applying the K-means algorithm
%inputs: data, number of clusters, threshold
%output: clustered data
function [clusCenters,clusData] = Kmeans (x,cen,t)
centers=cen; %number of centers
threshold = t;
numSamples = size(x,1); %number of samples
sampleLength = size(x,2); %dimension of samples
Dist = zeros(numSamples,centers); %array to hold distances
Class = zeros(numSamples,1); %array to hold classes
Znew = zeros(centers,sampleLength);
%array to hold new centers
%step 1
Z = x(randi([1,numSamples],centers,1),:);
m=1;
Z_iter{m} = Z;
step = 2;
NotEq = true;

%k initial centers

%save Z values

while NotEq
switch step
case 2 %distribute samples to clusters
Z = Z_iter{m}; %grab current centers
for k = 1:centers %for each center
for N = 1:numSamples %for each sample
%compute distance between sample and center
Dist(N,k) = norm(x(N,:)-Z(k,:));
end
end
Dist_iter{m} = norm(Dist);
for N = 1:numSamples %for each sample
minDist = min(Dist(N,:)); %get min dist for sample
[i,j] = find(Dist(N,:) == minDist); %index of min
Class(N) = j(1);
%index=class, save index/class
end
Class_Iter{m} = Class;

%save class assignments for m

step = 3;
case 3 %compute new centers
for k = 1:centers%for each center
4

Kirkman-Bey Project 2
C = find(Class == k); %find all samples in class
zt = [0,0];
for i = 1:size(C,1) %for every sample in class
zt = zt + x(C(i),:);
%add sample to sum
end
Znew(k,:) = zt/size(C,1);

%center = sample mean

end
Z_iter{m+1} = Znew;
step = 4;

%save next centers

case 4 %compare new centers to old centers


if Z_iter{m+1} == Z_iter{m} %if new = current
NotEq = 0; %algorithm converges
break;
elseif m == 1 %if 1st iter, no prev distance, proceed
m = m+1;
step = 2;
else %if not 1st iter and Z \= Znew
%check stopping conditions
if abs(Dist_iter{m} - Dist_iter{m-1}) < threshold
NotEq = 0;
break;
else
m = m+1;
%new iteration
step = 2; %go back to step 2
end
end
end
end
clusCenters = Z_iter; %return cluster centers per iteration
clusData = Class_Iter; %return clusters per iteration
end

Kirkman-Bey Project 2

Appendix B - Proj2Shell.m
%Monique Kirkman-Bey
%Pattern Recognition Project 2 Shell
%October 6, 2015
%This program loads the iris data set, then iteratively calls the
kmeans
%function to implement the kmeans clustering algorithm for k values of
2,
%3, and 4 cluster center using different threshold values.
clear;
iris = csvread('iris.csv'); %load dataset
x = iris(:,1:end-1);
%get all but the class value
y = iris(:,end);
%set simulation parameters
k = [2 3 4];
%number of centers
t = [0.01 0.1];
%stopping threshold
for i = 1:size(k,2)
for j = 1:size(t,2)
[centers, clusters] = Kmeans(x,k(i),t(j));
CONF{i,j} = confusionmat(y,clusters{end});
end
end

You might also like