You are on page 1of 13

SS-4302: Artificial Intelligence

STUDENTS
SOLUTION

ASSIGNMENT 1: CLUSTERING WITH R


In this task we are to follow the step-by-step tutorial of how to do data mining with a set of data
using the R programme.

1. Data Preparations
For the assignment, we are told to used the mtcars data sets. The mtcars data sets are as follows:

I store the mtcars data into a variable "mydata" :


mydata=mtcars
Next is to adjust the data sets by removing missing data and scale them so we can compare them
later:
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables

2. Determine the number of clusters


The next step is to decide what is the number of clusters suitable to make the cluster information
we will get to be useful. Deciding the number of clusters is especially important in our kmeans partitoning
algorithm later.

One of the way to decide is by plotting a 'within sum of square x number of clusters' graph. We can
do this in R by:
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for
(i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss) plot(1:15, wss, type="b",
xlab="Number of Clusters", ylab="Within groups sum
of squares")
The resulting graph will be:

From the graph we can determine the best number of clusters for the data set using the Elbow
Method. By looking at the point at the graph where adding another cluster will not change much
value, we can set that as the optimum number of cluster. (The point where it creates an 'elbow' in the
graph) On our graph, that point is when number of cluster is 5.

3. Clustering Algorithm
1. Partitioning with Kmeans()
We can create the cluster with Kmeans algorithm in R by:
# K-Means Cluster Analysis
fitk <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fitk$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fitk$cluster)

Note that we set the number of clusters to be generated is 5, according to the graph we found
earlier. 'fitk' is the variable to store our cluster data.
We can see the cluster groups of each data sets by:
fitk$cluster

2. Hierarchical Agglomerative with hclust()


In R, we use the hclust() package to create a cluster for Hierarchical Agglomerative algorithm:
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance
matrix
fitha <- hclust(d, method="ward")
plot(fitha) # display dendogram groups <cutree(fitha, k=5) # cut tree into 5
clusters
# draw dendogram with red borders around the 5
clusters
rect.hclust(fitha, k=5, border="red")
'fitha' is the variable to store the cluster data. And this will also raw a dendorgram with red boxes
marking 5 clusters:

We can also show the cluster groups just like what we did with kmeans using:
cutree(fitha, k = 5) #cutting the tree to 5 clusters

We can also show a graph that shows the p-values of each hierarchical clusters. We do this by
using the pcvlust() library. However this library clusters by columns not rows. Our 'mydata' is by
rows. So before using the library we must transpose mydata with:
mydatat=t(mydata)
Then:

# Ward Hierarchical Clustering with Bootstrapped p


values library(pvclust) fitha2 <- pvclust(mydatat,
method.hclust="ward", method.dist="euclidean")
plot(fitha2) # dendogram with p values
# add rectangles around groups highly supported by the
data
pvrect(fitha2, alpha=.95)
This produces the graph:

3. Model Based with mclust()


This mclust() package in R is used to generate model based clusters
# Model Based Clustering
library(mclust) fitmb <- Mclust(mydata)
plot(fitmb) # plot results
summary(fitmb) # display the best model
'fitmb' stores the cluster data. And it will generate 4 different graphs:

We can also check the data set classification like before using:
fitmb$classification

Note that the algorithm decides that only two clusters are needed for our data. However we can
'force' the mclust() to generate any number of clusters we want. I set it to 5 clusters so we can
compare with our last 2 algorithm:
fitmb5 <- Mclust(mydata,G=5)
Fitmb5$classification

4. Plotting cluster Solutions


Now lets us look how our clusters look like in a Cluster Plot graph and Centroid Plot graph.
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
draw clusplot(mydata,, color=TRUE,
shade=TRUE,
labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
library(fpc) plotcluster(mydata, draw)
When draw=fitk$cluster (Kmeans)

When draw=cutree(fitha, k = 5) (Hierarchical)

When draw=fitmb$classification (Model based)

When draw=fitmb5$classification (Model based)

5. Validating cluster Solutions


Finally, we can also compare the similarity of two cluster data or the stats of one cluster data using:
# comparing 2 cluster solutions
library(fpc)
cluster.stats(dist, a,b)
#where a and b are cluster data
I set the variable for each cluster data as follows:

a=fitk$cluster #kmeans b=cutree(fitha, k = 5)


#Hierarchical cut to 5 clusters c=fitmb$classification
#Model
d=fitmb5$classification #Model force to 5 clusters

6. Investigation
Using the within sum of square graph, how do you determine the number of clusters?

As I mentioned in step 2, we can use the sum of square graph to find the best number of cluster
to give a good result. From the graph we can determine the best number of clusters for the data set
using the Elbow Method. By looking at the point at the graph where adding another cluster will not
change much value, we can set that as the optimum number of cluster. (The point where it creates an
'elbow' in the graph) On our graph, that point is when number of cluster is 5.
Are the solutions from K-means always the same? Why?
No. Because in a K-means algorithm, the k number of initial cluster points are randomly selected
each time it is run. So running the same algorithm twice on a same data set and same k value will most
probably generate different but similar result. This is known as initialization problem.
Compare the clustering results. Explain your findings.
Using the information from step 5, I ran the cluster.stat on the 4 cases of cluster result I got earlier:
K-Means 5 clusters, Hierarchical Agglomerative cut at 5 clusters, Model Based with 2 clusters,
Model based of 5 clusters. Recall that

10

a=fitk$cluster
#kmeans
b=cutree(fitha, k = 5) #Hierarchical cut to 5 clusters
c=fitmb$classification #Model
d=fitmb5$classification #Model force to 5 clusters
To analyze each cluster and compare them, I analyze the stats of each cluster:
cluster.stats(dist,
cluster.stats(dist,
cluster.stats(dist,
cluster.stats(dist,

a)
b)
c)
d)

The result is shown and summarized on the table on the next page. The rows marked by the star
are the properties I used to compare and analyze the clustering algorithm.
Cluster Size: The more similar the cluster size the better quality are the result. For a 5 cluster
result, K-means seems to have the closest in similarity for all 5 clusters.
Average Distance/Average Between: Refers to the inter distance between the 5 clusters. The
higher the distance, the better the quality. From the table, the highest average goes to Hierarchical
Agglomerative (2.485788) and lowest is K-Means (2.463476).
Average Within: Refers to the intra distance between the data points within a cluster. The
smaller the intra distance the better the quality. From the table, the lowest average goes to Model
Based algorithm with 5 clusters (2.135135) and highest average is Model Based algorithm with 2
clusters (2.354167).
Entropy: Measures the probability for a data cluster to be less useful if datasets increased in
size. So the lower the entropy the better are the cluster result. The lowest entropy according to table
is Model Based algorithm with 2 clusters (0.6931472) while the highest is K-means (1.548269).

11

What are your personal thoughts about the three different clustering algorithms?

12

For k-means algorithm, it only divide clusters to group only once so it has a good performance at
runtime. Because of this it seems to be a quick and efficient way to do clustering on a large data set.
However to make sure the result is useful the right number of clusters must be properly chosen. So the
data quality is not as good as other available algorithm.
For Hierarchical Agglomerative, it takes longer time in performance because of dividing the clusters
in more than one level. So for a very large data set this will take more time. But the quality of the
clustered data is better than k-means.
For Model Based algorithm, I feel like it is more like an approximate cluster based on the datasets.
During this lab it decides that 2 clusters is best but we see from the sum of square graph that 5 is the
best choice of cluster numbers. So data quality is not as good as Hierarchical
Agglomerative.

13

You might also like