You are on page 1of 9

Cluster Analysis con Statistica

Resumen de la Ayuda
Cluster Analysis
Joining (Tree Clustering)
Advanced Tab
Select the Advanced tab of the Cluster Analysis: Joining (Tree Clustering) dialog to
access the options described here.
Variables. Click the Variables button to display the standard variable selection dialog.
Note that STATISTICA interprets the selected variables as dimensions if Cases (rows) is
selected in the Cluster box (see below); if Variables (Columns) is selected in the Cluster
box the selected variables will be interpreted as ob!ects.
Input file. "he Input file box contains two options# aw data and !istance matri".
Ra data. $f you select aw data then STATISTICA expects a standard raw data file as
input.
!istance matri". $f you select !istance matri" the input matrix may either be a
correlation matrix or a distance (dissimilarity) matrix with numbers indicating the
distances or dissimilarities between ob!ects. STATISTICA will automatically determine
the contents of the matrix (i.e. whether it contains correlations or dissimilarities see
%atrix file format). $f the input matrix is a correlation matrix (which indicates the
similarity and closeness between ob!ects) it is converted to distances before the analysis
begins; specifically all correlations are transformed as &'(earson r.
Note that if your Input file consists of correlation coefficients only (e.g. from a
published source) and no means standard deviations or N is available you may simply
assume standardi)ed data (mean * + standard deviation * &) and an N of for example
&++ (N must be greater than the number of variables in the analysis). ,ou will first need
to add these four cases (means standard deviation cases and matrix) to your
spreadsheet before you can run the analysis. -f course in the results the descriptive
statistics for each variable are not meaningful in that case however the cluster analysis
can be performed based on the correlation coefficients alone.
Cluster. "he Cluster box contains two options# Variables (columns) and Cases (rows)#
"he option you select determines how STATISTICA interprets the selected Variables.
Note that the Cluster box is only available if aw data is selected as the Input file.
Variables (columns). $f Variables (Columns) is selected STATISTICA interprets the
selected Variables (see above) as ob!ects.
Cases (ros). $f Cases (rows) is selected STATISTICA interprets the selected Variables
as dimensions.
Amalgamation (lin#age) rule. "here are . different amalgamation rules available in
the Amalgamation (lin$age) rule box# Single %in$age Complete %in$age &nweig'ted
pair(group average )eig'ted pair(group average &nweig'ted pair(group centroid
)eig'ted pair(group centroid (median) and )ard*s met'od. "he default rule is Single
%in$age (also called the /method of the nearest neighbors/).
-ne of the main parameters that guides the !oining (tree'clustering) process is the
linkage rule that is the rule that determines when two clusters are to be !oined (linked
or amalgamated). 0or a detailed description of amalgamation rules see 1oining ("ree
Clustering) $ntroductory -verview ' amalgamation or linkage rules.
!istance measure. "here are 2 different distance measures that can be computed from
aw data# S+uared ,uclidean distances ,uclidean distances City(bloc$ (-an'attan)
distances C'ebyc'ev distance metric .ower: S&-(A/S("(y)
p
)
&3r
.ercent disagreement
and 0(.earson r.
"he !oining algorithm starts by first computing a matrix of distances between the
ob!ects that are to be clustered. 0or a detailed description of these distances refer to
1oining ("ree Clustering) $ntroductory -verview ' distance measures.
$f !istance matri" is selected as the Input file then !issimilarities from matri" is
automatically selected in the !istance measure box. $f the input matrix is a correlation
matrix then the correlations (which denote the degree of similarity) will be transformed
to dissimilarities (& ' r).
$oer distance parameters. $f the .ower distances option is selected in the !istance
measure box specify the two parameters p and r for the power distance in these boxes.
%atc& processing'reporting. $f you select the /atc' processing1reporting check box
STATISTICA automatically performs the analysis (after you click the 23 button) and
sends the entire output from the analysis to a workbook individual windows and3or to a
report (depending to the options selected in the Analysis14rap' 2utput -anager).
(atri" )ile )ormat
STATISTICA4s statistical matrix files (e.g. Correlation Covariances Similarities and
5issimilarities) can be used in the modules that support the matrix input file format
(e.g. -ultiple egression Canonical Correlation eliability and Item Analysis
Cluster Analysis -ultidimensional Scaling 5actor Analysis etc.). 6y default matrix
spreadsheet will be saved with the default file extension .sm". %ost STATISTICA
modules will read both full and loer triangular matrices. 7owever in order
for STATISTICA to recogni)e the file as a matrix file the file must meet the following
conditions#
"he number of cases (rows) * the number of variables (columns) * +.
"he matrix must be a s8uare matrix and the case names should be the same as
the variable names.
"he last four cases contain the following case names and information#
(eans. "he mean of each variable is given in this row; this case can be left empty (i.e.
do not enter anything in this row) for Similarities and 5issimilarities matrices.
Std.!ev. "he standard deviation of each variable is given in this row; this case can be
left empty (i.e. do not enter anything in this row) for Similarities and 5issimilarities
matrices.
,o.Cases. "his re-uired number is the number of cases from which the matrix was
produced not the number of cases (rows of data) in this matrix file.
(atri". "his re-uired number represents the type of matrix file; & * Correlation 9 *
Similarities . / !issimilarities and : * Covariance.
,ote0 ;hen entering these last four cases into the matrix file manually be sure to spell
t&e case names e"actly as they appear above (i.e. %eans Std.5ev. No.Cases and
%atrix).
1"amples of Correlation (atri" )iles0
<ar & <ar 9 <ar =
<ar & &.++ .9+ .=+
<ar 9 .9+ &.++ .&+
<ar = .=+ .&+ &.++
%eans &9 && &+
Std. 5ev. = > 9
No. Cases >+
%atrix &
1"amples of 2oer Triangular Correlation (atri" )iles0
<ar & <ar 9 <ar =
<ar & &.++
<ar 9 .9+ &.++
<ar = .=+ .&+ &.++
%eans &9 && &+
Std. 5ev. = > 9
No. Cases >+
%atrix &
Joining (Tree Clustering)
!istance (easures
"he !oining or tree clustering method uses the dissimilarities or distances between
ob!ects when forming the clusters. "hese distances can be based on a single dimension
or multiple dimensions. 0or example if we were to cluster fast foods we could take
into account the number of calories they contain their price sub!ective ratings of taste
etc. "he most straightforward way of computing distances between ob!ects in a multi'
dimensional space is to compute ?uclidean distances. $f we had a two' or three'
dimensional space this measure is the actual geometric distance between ob!ects in the
space (i.e. as if measured with a ruler). 7owever the !oining algorithm does not /care/
whether the distances that are /fed/ to it are actual real distances or some other derived
measure of distance that is more meaningful to the researcher; and it is up to the
researcher to select the right method for his3her specific application. "he Cluster
Analysis module will compute various types of distance measures or the user can
compute a matrix of distances him or herself and directly use it in the procedure.
1uclidean distance. "his is probably the most commonly chosen type of distance. $t
simply is the geometric distance in the multidimensional space. $t is computed as#
distance(xy) * @
i
(x
i
' y
i
)
9
A
B
Note that ?uclidean (and s8uared ?uclidean) distances are computed from raw data and
not from standardi)ed data. "his is how it is usually computed and this method has
certain advantages (e.g. the distance between any two ob!ects is not affected by the
addition of new ob!ects to the analysis which may be outliers). 7owever the distances
can be greatly affected by differences in scale among the dimensions from which the
distances are computed. 0or example if one of the dimensions denotes a measured
length in centimeters and you then convert it to millimeters (by multiplying the values
by &+) the resulting ?uclidean or s8uared ?uclidean distances (computed from multiple
dimensions) can be greatly affected and conse8uently the results of cluster analyses
may be very different. -f course you can implement any desired standardi)ation or
scaling using the data management features of STATISTICA.
S-uared 1uclidean distance. -ne may want to s8uare the standard ?uclidean distance
in order to place progressively greater weight on ob!ects that are further apart. "his
distance is computed as (see also the note in the previous paragraph)#
distance(xy) *
i
(x
i
' y
i
)
9
City3bloc# ((an&attan) distance. "his distance is simply the average difference
across dimensions. $n most cases this distance measure yields results similar to the
simple ?uclidean distance. 7owever note that in this measure the effect of single large
differences (outliers) is dampened (since they are not s8uared). "he city'block distance
is computed as#
distance(xy) *
i
Cx
i
' y
i
C
C&ebyc&ev distance. "his distance measure may be appropriate in cases when one
wants to define two ob!ects as /different/ if they are different on any one of the
dimensions. "he Chebychev distance is computed as#
distance(xy) * %aximumCx
i
' y
i
C
$oer distance. Sometimes one may want to increase or decrease the progressive
weight that is placed on dimensions on which the respective ob!ects are very different.
"his can be accomplished via the power distance. "he power distance is computed as#
distance(xy) * (
i
Cx
i
' y
i
C
p
)
&3r
where r and p are user'defined parameters. D few example calculations may
demonstrate how this measure /behaves./ (arameter p controls the progressive weight
that is placed on differences on individual dimensions parameter r controls the
progressive weight that is placed on larger differences between ob!ects. $f r and p are
e8ual to 9 then this distance is e8ual to the ?uclidean distance.
$ercent disagreement. "his measure is particularly useful if the data for the
dimensions included in the analysis are categorical in nature. "his distance is computed
as#
distance(xy) * (Number of x
i
y
i
)3i
0or an overview of the other two methods of clustering see Two(way Joining and 3(
means Clustering.
Joining (Tree Clustering)
Amalgamation or 2in#age Rules
Dt the first step when each ob!ect represents its own cluster the distances between
those ob!ects are defined by the chosen distance measure. 7owever once several
ob!ects have been linked together how do we determine the distances between those
new clustersE $n other words we need a linkage or amalgamation rule to determine
when two clusters are sufficiently similar to be linked together. "here are various
possibilities# for example we could link two clusters together when any two ob!ects in
the two clusters are closer together than the respective linkage distance. (ut another
way we use the /nearest neighbors/ across clusters to determine the distances between
clusters; this method is called single linkage. "his rule produces /stringy/ types of
clusters that is clusters /chained together/ by only single ob!ects that happen to be
close together. Dlternatively we may use the neighbors across clusters that are furthest
away from each other; this method is called complete linkage. "here are numerous other
linkage rules that have been proposed and the Cluster Analysis module offers a wide
choice of them.
Single lin#age (nearest neig&bor) ' SA2T4 (5,I(4. Ds described above in this
method the distance between two clusters is determined by the distance of the two
closest ob!ects (nearest neighbors) in the different clusters. "his rule will in a sense
string ob!ects together to form clusters and the resulting clusters tend to represent long
/chains./
Complete lin#age (furt&est neig&bor) ' !I6(1TR4. $n this method the distances
between clusters are determined by the greatest distance between any two ob!ects in the
different clusters (i.e. by the /furthest neighbors/). "his method usually performs 8uite
well in cases when the ob!ects actually form naturally distinct /clumps./ $f the clusters
tend to be somehow elongated or of a /chain/ type nature then this method is
inappropriate.
7neig&ted pair3group average ' $R4(1!I4. $n this method the distance between
two clusters is calculated as the average distance between all pairs of ob!ects in the two
different clusters. "his method is also very efficient when the ob!ects form natural
distinct /clumps/ however it performs e8ually well with elongated /chain/ type
clusters. Note that in their book Sneath and Sokal (&F.=) introduced the abbreviation
G(H%D to refer to this method as unweighted pair'group method using arithmetic
averages.
8eig&ted pair3group average. "his method is identical to the unweighted pair'group
average method except that in the computations the si)e of the respective clusters (i.e.
the number of ob!ects contained in them) is used as a weight. "hus this method (rather
than the previous method) should be used when the cluster si)es are suspected to be
greatly uneven. Note that in their book Sneath and Sokal (&F.=) introduced the
abbreviation ;(H%D to refer to this method as weighted pair'group method using
arithmetic averages.
7neig&ted pair3group centroid ' C1,TR4I!1. "he centroid of a cluster is the
average point in the multidimensional space defined by the dimensions. $n a sense it is
the center of gravity for the respective cluster. $n this method the distance between two
clusters is determined as the difference between centroids. Sneath and Sokal (&F.=) use
the abbreviation G(H%C to refer to this method as unweighted pair'group method
using the centroid average.
8eig&ted pair3group centroid (median) ' (1!IA,A. "his method is identical to the
previous one except that weighting is introduced into the computations to take into
consideration differences in cluster si)es (i.e. the number of ob!ects contained in them).
"hus when there are (or one suspects there to be) considerable differences in cluster
si)es this method is preferable to the previous one. Sneath and Sokal (&F.=) use the
abbreviation ;(H%C to refer to this method as weighted pair'group method using the
centroid average.
8ard9s met&od. "his method is distinct from all other methods because it uses an
analysis of variance approach to evaluate the distances between clusters. $n short this
method attempts to minimi)e the Sum of S8uares (SS) of any two (hypothetical)
clusters that can be formed at each step. Iefer to ;ard (&F2=) for details concerning this
method. $n general this method is regarded as very efficient however it tends to create
clusters of small si)e.
0or an overview of the other two methods of clustering see Two(way Joining and 3(
means Clustering.
:3means Clustering
Introductory 4vervie
"his method of clustering is very different from the 1oining ("ree Clustering) and "wo'
way 1oining methods. Suppose that you already have hypotheses concerning the number
of clusters in your cases or variables. ,ou may want to /tell/ the computer to form
exactly = clusters that are to be as distinct as possible. "his is the type of research
8uestion that can be addressed by the k'means clustering algorithm. $n general the k'
means method will produce exactly $ different clusters of greatest possible distinction.
1"ample. $n the physical fitness example (see "wo'way 1oining) the medical
researcher may have a /hunch/ from clinical experience that her heart patients fall
basically into three different categories with regard to physical fitness. She might
wonder whether this intuition can be 8uantified that is whether a k'means cluster
analysis of the physical fitness measures would indeed produce the three clusters of
patients as expected. $f so the means on the different measures of physical fitness for
each cluster would represent a 8uantitative way of expressing the researcher4s
hypothesis or intuition (i.e. patients in cluster & are high on measure & low on measure
9 etc.).
Computations. Computationally you may think of this method as analysis of variance
(A62VA) /in reverse./ "he program will start with $ random clusters and then move
ob!ects between those clusters with the goal to (&) minimi)e variability within clusters
and (9) maximi)e variability between clusters. "his is analogous to /DN-<D in
reverse/ in the sense that the significance test in DN-<D evaluates the between group
variability against the within'group variability when computing the significance test for
the hypothesis that the means in the groups are different from each other. $n k'means
clustering the program tries to move ob!ects (e.g. cases) in and out of groups (clusters)
to get the most significant DN-<D results. (6ecause among other results the DN-<D
results are part of the standard output from a k'means clustering analysis you may want
to refer to A62VA1-A62VA to learn more about that method.)
Interpretation of results. Gsually as the result of a k'means clustering analysis we
would examine the means for each cluster on each dimension to assess how distinct our
$ clusters are. $deally we would obtain very different means for most if not all
dimensions used in the analysis. "he magnitude of the 5 values from the analysis of
variance performed on each dimension is another indication of how well the respective
dimension discriminates between clusters.

You might also like