You are on page 1of 8

Cluster Analysis With SPSS

I have never had research data for which cluster analysis was a technique I thought appropriate for analyzing the data, but just for fun I have played around with cluster analysis. I created a data file where the cases were faculty in the Department of Psychology at East arolina !niversity in the month of "ovember, #$$%. &he variables are' "ame (( )lthough faculty salaries are public information under "orth arolina state law, I though it best to assign each case a fictitious name. *alary + annual salary in dollars, from the university report available in ,ne*top. -&E + -ull time equivalent wor. load for the faculty member. /an. + where 0 1 adjunct, # 1 visiting, 2 1 assistant, 3 1 associate, % 1 professor )rticles + number of published scholarly articles, e4cluding things li.e comments in newsletters, abstracts in proceedings, and the li.e. &he primary source for these data was the faculty member5s online vita. 6hen that was not available, the data in the !niversity5s )cademic Publications Database was used, after eliminating duplicate entries. E4perience + "umber of years wor.ing as a full time faculty member in a Department of Psychology. If the faculty member did not have employment information on his or her web page, then other online sources were used + for e4ample, from the publications database I could estimate the year of first employment as being the year of first publication. In the data file but not used in the cluster analysis are also )rticles)PD + number of published articles as listed in the university5s )cademic Publications Database. &here were a lot of errors in this database, but I tried to correct them 7for e4ample, by adjusting for duplicate entries8. *e4 + I inferred biological se4 from physical appearance.

I have saved, annotated, and placed online the statistical output from the analysis. 9ou may wish to loo. at it while reading through this document. Conducting the Analysis *tart by bringing luster)non-aculty.sav into *P**. "ow clic. )nalyze, lassify, :ierarchical luster. Identify "ame as the variable by which to label cases and *alary, -&E, /an., )rticles, and E4perience as the variables. Indicate that you want to cluster cases rather than variables and want to display both statistics and plots.

luster)nalysis(*P**.doc

lic. *tatistics and indicate that you want to see an )gglomeration schedule. lic. ontinue.

lic. plots and indicate that you want a Dendogram and an Icicle plot with #, 2, and 3 cluster solutions. lic. ontinue.

luster)nalysis(*P**.doc

lic. ;ethod and indicate that you want to use the <etween(groups lin.age method of clustering, squared Euclidian distances, and variables standardized to z scores. lic. ontinue.

lic. *ave and indicate that you want to save, for each case, the cluster to which the case is assigned for #, 2, and 3 cluster solutions. lic. ontinue, ,=.

luster)nalysis(*P**.doc

*P** starts by standardizing all of the variables to mean $, variance 0. &his results in all the variables being on the same scale and being equally weighted. In the first step *P** computes for each pair of cases the squared Euclidian distance between the cases. &his is quite simply
v 2

( X
i =1

Yi ) , the sum across

variables 7from i 1 0 to v8 of the squared difference between the score on variable i for the one case 7Xi8 and the score on variable i for the other case 7Yi8. &he two cases which are separated by the smallest Euclidian distance are identified and then classified together into the first cluster. )t this point there is one cluster with two cases in it. "e4t *P** recomputes the squared Euclidian distances between each entity 7case or cluster8 and each other entity. 6hen one or both of the compared entities is a cluster, *P** computes the averaged squared Euclidian distance between members of the one entity and members of the other entity. &he two entities with the smallest squared Euclidian distance are classified together. *P** then recomputes the squared Euclidian distances between each entity and each other entity and the two with the smallest squared Euclidian distance are classified together. &his continues until all of the cases have been clustered into one big cluster. >oo. at the )gglomeration *chedule. ,n the first step *P** clustered case 2# with 22. &he squared Euclidian distance between these two cases is $.$$$. )t stages #(3 *P** creates three more clusters, each containing two cases. )t stage % *P** adds case 2? to the cluster that already contains cases 2@ and 2A. <y the 32rd stage all cases have been clustered into one entity. >oo. at the Bertical Icicle. -or the two cluster solution you can see that one cluster consists of ten cases7<oris through 6illy, followed by a column with no C5s8. &hese were our adjunct 7part(time8 faculty 7e4cepting one8 and the second cluster consists of everybody else.

luster)nalysis(*P**.doc

-or the three cluster solution you can see the cluster of adjunct faculty and the others split into two. Deanna through ;ic.ey were our junior faculty and >awrence through /osalyn our senior faculty -or the four cluster solution you can see that one case 7>awrence8 forms a cluster of his own. >oo. at the dendogram. It displays essentially the same information that is found in the agglomeration schedule but in graphic form. >oo. bac. at the data sheet. 9ou will find three new variables. >!#D0 is cluster membership for the two cluster solution, >!2D0 for the three cluster solution, and >!3D0 for the four cluster solution. /emove the variable labels and then label the values for >!#D0

and >!2D0.

>et us see how the two clusters in the two cluster solution differ from one another on the variables that were used to cluster them.

luster)nalysis(*P**.doc

&he output shows that the cluster adjuncts has lower mean salary, -&E, ran.s, published articles, and years e4perience. "ow compare the three clusters from the three cluster solution. !se ,ne(6ay )",B) and as. for plots of group means.

&he plots of means show nicely the differences between the clusters. Predicting Salary from FTE, Rank, Publications, and Ex erience "ow, just for fun, let us try a little multiple regression. 6e want to see how faculty salaries are related to -&Es, ran., number of published articles, and years of e4perience.

luster)nalysis(*P**.doc

)s. for part and partial correlations and for asewise diagnostics for )ll cases.

&he output is shows that each of our predictors is has a medium to large positive zero(order correlation with salary, but only -&E and ran. have significant partial effects. In the asewise Diagnostic table you are given for each case the

luster)nalysis(*P**.doc

standardized residual 7I thin. that any whose absolute value e4ceeds 0 is worthy of inspection by the persons who set faculty salaries8, the actual salary, the salary predicted by the model, and the difference, in E, between actual salary and predicted salary. If you split the file by se4 and repeat the regression analysis you will see some interesting differences between the model for women and the model for men. &he partial effect of ran. is much greater for women than for men. -or men the partial effect of articles is positive and significant, but for women it is negative. &hat is, among our female faculty, the partial effect of publication is to lower one5s salary.

=arl >. 6uensch East arolina !niversity Department of Psychology Freenville, " #@A%A(32%2 !nited *na.es of )merica ,ctober, #$$@ ;ore *P** >essons ;ore >essons on *tatistics

luster)nalysis(*P**.doc

You might also like