Professional Documents
Culture Documents
Alka Arora
I.A.S.R.I., Library Avenue, Pusa, New Delhi-110 012
alkak@iasri.res.in
1. Introduction
Clustering algorithms maps the data items into clusters, such that homogenous data items are in a
group. Clustering algorithms maps the objects based on similarity such that there is high intra
cluster similarity and hence low inter cluster similarity. Unlike classification and prediction
which analyzes class-label data objects, clustering analyzes data objects without class-labels and
tries to generate such labels. There are many clustering algorithms available in literature, choice
of appropriate algorithm depends on the data type and desired results. We will be focused here
on hierarchical clustering algorithm
1.1 Hierarchical Algorithms
A hierarchical method creates a hierarchical decomposition of data objects in the form of tree
like diagram which is called a dendogram. There are two approaches to building a cluster
hierarchy. Agglomerative approach also called bottom up approach starts with each object
forming a separate group and successively merges the objects close to one another, until all the
groups are merged into one. Divisive approach also called top-down approach starts with all the
objects in same cluster, until each object is in one cluster.
Process flow of agglomerative hierarchical clustering method is given below:
1. Convert object features to distance matrix.
2. Set each object as a cluster (thus if we have 6 objects, we will have 6 clusters in the
beginning)
3. Iterate until number of cluster is 1
Merge two closest clusters
Update distance matrix
First distance matrix is computed using any valid distance measure between pairs of objects. The
choice of which clusters to merge is determined by a linkage criterion, which is a function of the
pair-wise distances between observations. Commonly used linkage criteria are mentioned below:
s1 s2 s4 s5 s3
Clustering using SAS E-Miner
190
Complete Linkage: The maximum distance between elements of each cluster
Single Linkage: The minimum distance between elements of each cluster
Average Linkage /UPGMA: The mean distance between elements of each cluster
Ward's method: This method is distinct from all other methods because it uses an
analysis of variance approach to evaluate the distances between clusters. In short, this
method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters
that can be formed at each step. In general, this method is regarded as very efficient,
however, it tends to create clusters of small size.
2. Example
A baseball manager wants to identify and group players on the team who are very similar with
respect to several statistics of interest. Note that there is no response variable in this example.
The manager simply wants to identify different groups of players. The manager also wants to
learn what differentiates players in one group from players in a different group. The data is
located in the DMABASE data set in the SAMPSIO library. The following table contains
description of key variables.
Table 1. Descriptions of Selected Variables in the DMABASE Data Set
Name
Model Role
Measurement Level
Description
NAME
ID Nominal Player Name
TEAM
Rejected Nominal Team at the end of
1986
POSITION
Rejected Nominal Positions played in
1986
LEAGUE
Rejected Binary League at the end of
1986
DIVISION
Rejected Binary Division at the end
of 1986
NO_ATBAT
Input Interval Times at Bat in
1986
NO_HITS
Input Interval Hits in 1986
NO_HOME
Input Interval Home Runs in 1986
NO_RUNS
Input Interval Runs in 1986
NO_RBI Input Interval RBIs in 1986
Clustering using SAS E-Miner
191
NO_BB
Input Interval Walks in 1986
YR_MAJOR
Input Interval Years in the Major
Leagues
CR_ATBAT
Input Interval Career Times at Bat
CR_HITS
Input Interval Career Hits
CR_HOME
Input Interval Career Home Runs
CR_RUNS
Input Interval Career Runs
CR_RBI
Input Interval Career RBIs
CR_BB
Input Interval Career Walks
NO_OUTS
Input Interval Put Outs in 1986
NO_ASSTS
Input Interval Assists in 1986
NO_ERROR
Input Interval Errors in 1986
SALARY
Rejected Interval 1987 Salary in
Thousands
LOGSALAR
Input Interval Log of 1987 Salary
in Thousands
For this example, set the model role for TEAM, POSITION, LEAGUE, DIVISION, and
SALARY to rejected. Set the model role for SALARY to rejected since this information is
stored in LOGSALAR in the data set.
2.1. Setting Up the Clustering Parameters
Step1: Add the diagram.
Step2: Add data source to diagram.
Step3: Select Clustering tab from Explore.
Step4: Select the options for clustering (Left frame, Down Window)
Clustering using SAS E-Miner
192
1. Open the Clustering node.
2. Select the Clusters tab.
3. Select Selection Criterion in the Number of Clusters section.
4. Type 3 for the Maximum Number of Clusters.
5. Run the Model.
Clustering using SAS E-Miner
193
6. Right click and see the results.
Generating several cluster solutions is fairly easy, but interpreting a particular cluster solution
can be extremely challenging. In some cases no easy or useful cluster interpretation is possible.
Since clusters naturally partition the population into mutually exclusive sets, they may provide
some benefit even if a convenient interpretation is not readily available.