You are on page 1of 5

CLUSTERING USING SAS E-MINER

Alka Arora
I.A.S.R.I., Library Avenue, Pusa, New Delhi-110 012
alkak@iasri.res.in

1. Introduction
Clustering algorithms maps the data items into clusters, such that homogenous data items are in a
group. Clustering algorithms maps the objects based on similarity such that there is high intra
cluster similarity and hence low inter cluster similarity. Unlike classification and prediction
which analyzes class-label data objects, clustering analyzes data objects without class-labels and
tries to generate such labels. There are many clustering algorithms available in literature, choice
of appropriate algorithm depends on the data type and desired results. We will be focused here
on hierarchical clustering algorithm

1.1 Hierarchical Algorithms
A hierarchical method creates a hierarchical decomposition of data objects in the form of tree
like diagram which is called a dendogram. There are two approaches to building a cluster
hierarchy. Agglomerative approach also called bottom up approach starts with each object
forming a separate group and successively merges the objects close to one another, until all the
groups are merged into one. Divisive approach also called top-down approach starts with all the
objects in same cluster, until each object is in one cluster.


Process flow of agglomerative hierarchical clustering method is given below:
1. Convert object features to distance matrix.
2. Set each object as a cluster (thus if we have 6 objects, we will have 6 clusters in the
beginning)
3. Iterate until number of cluster is 1
Merge two closest clusters
Update distance matrix
First distance matrix is computed using any valid distance measure between pairs of objects. The
choice of which clusters to merge is determined by a linkage criterion, which is a function of the
pair-wise distances between observations. Commonly used linkage criteria are mentioned below:
s1 s2 s4 s5 s3
Clustering using SAS E-Miner

190

Complete Linkage: The maximum distance between elements of each cluster

Single Linkage: The minimum distance between elements of each cluster

Average Linkage /UPGMA: The mean distance between elements of each cluster

Ward's method: This method is distinct from all other methods because it uses an
analysis of variance approach to evaluate the distances between clusters. In short, this
method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters
that can be formed at each step. In general, this method is regarded as very efficient,
however, it tends to create clusters of small size.
2. Example
A baseball manager wants to identify and group players on the team who are very similar with
respect to several statistics of interest. Note that there is no response variable in this example.
The manager simply wants to identify different groups of players. The manager also wants to
learn what differentiates players in one group from players in a different group. The data is
located in the DMABASE data set in the SAMPSIO library. The following table contains
description of key variables.

Table 1. Descriptions of Selected Variables in the DMABASE Data Set
Name

Model Role

Measurement Level

Description

NAME

ID Nominal Player Name
TEAM

Rejected Nominal Team at the end of
1986
POSITION

Rejected Nominal Positions played in
1986
LEAGUE

Rejected Binary League at the end of
1986
DIVISION

Rejected Binary Division at the end
of 1986
NO_ATBAT

Input Interval Times at Bat in
1986
NO_HITS

Input Interval Hits in 1986
NO_HOME

Input Interval Home Runs in 1986
NO_RUNS

Input Interval Runs in 1986
NO_RBI Input Interval RBIs in 1986
Clustering using SAS E-Miner

191


NO_BB

Input Interval Walks in 1986
YR_MAJOR

Input Interval Years in the Major
Leagues
CR_ATBAT

Input Interval Career Times at Bat
CR_HITS

Input Interval Career Hits
CR_HOME

Input Interval Career Home Runs
CR_RUNS

Input Interval Career Runs
CR_RBI

Input Interval Career RBIs
CR_BB

Input Interval Career Walks
NO_OUTS

Input Interval Put Outs in 1986
NO_ASSTS

Input Interval Assists in 1986
NO_ERROR

Input Interval Errors in 1986
SALARY

Rejected Interval 1987 Salary in
Thousands
LOGSALAR

Input Interval Log of 1987 Salary
in Thousands

For this example, set the model role for TEAM, POSITION, LEAGUE, DIVISION, and
SALARY to rejected. Set the model role for SALARY to rejected since this information is
stored in LOGSALAR in the data set.

2.1. Setting Up the Clustering Parameters
Step1: Add the diagram.
Step2: Add data source to diagram.
Step3: Select Clustering tab from Explore.
Step4: Select the options for clustering (Left frame, Down Window)
Clustering using SAS E-Miner

192





1. Open the Clustering node.
2. Select the Clusters tab.
3. Select Selection Criterion in the Number of Clusters section.
4. Type 3 for the Maximum Number of Clusters.
5. Run the Model.
Clustering using SAS E-Miner

193

6. Right click and see the results.



Generating several cluster solutions is fairly easy, but interpreting a particular cluster solution
can be extremely challenging. In some cases no easy or useful cluster interpretation is possible.
Since clusters naturally partition the population into mutually exclusive sets, they may provide
some benefit even if a convenient interpretation is not readily available.

You might also like