K Means Handout

ClusterAnalysis
WhatisClusterAnalysis? Clusteranalysisisastatisticaltechniqueusedtogroupcases(individualsorobjects)intohomogeneous subgroupsbasedonresponsestovariables.UsingPASW(SPSS)17.0toconductaclusteranalysis,there arethreeclusteringprocedures:twostep,kmeans,andhierarchical. Kmeansclusteringallowsyoutoselectthenumberofclustersandtheprocedurecanbeusedwith moderatetolargedatasets.Thekmeansclusteringalgorithmassignscasestoclustersbasedonthe smallestamountofdistancebetweentheclustermeanandcase.Thisisaniterativeprocessthatstops oncetheclustermeansdonotchangemuchinsuccessivesteps.
KMeansClustering
Asanexampleofkmeansclustering,asamplePASW17.0datasetwasused;telco_extra.sav, telecommunicationsproviderdatathathas14continuousvariables.Thecontinuousvariableshave alreadybeenstandardized,withameanof0andstandarddeviationof1,toallowfordifferentunitsin whichvariablesweremeasured.Thisanalysiswillclustercustomersbytheirserviceusagepatterns. InPASW17.0,gotoAnalyze>Classify>KMeansCluster
Next,theKMeansClusterAnalysismenuappears.SelectStandardizedloglongdistancethrough StandardizedlogwirelessandStandardizedmultiplelinesthroughStandardizedelectronicbilling variablesandplaceintheVariablesbox. LabelCasesby.Optional;placevariableheretolabelcases NumberofClusters.Youhavetospecifythenumberofclustersyouwant.Forthisexample, type3inthebox. Method.Thedefault"Iterateandclassify,"whichisaniterativeprocessisusedtocomputethe clustermeanseachtimeacaseisaddedordeletedfromthecluster.Clustersarethenclassified Page1of7
basedonceclustercentershavebeenupdated.The"Classifyonly"methodareclassifiedbased ontheinitialclustercenters,whicharenotiterativelycomputed.Forthisexample,Iterateand classifyischosen. ClusterCenters.Youcandrawinitialclustercentersfromafile(Readinitial)oryoucansave thefinalclustercenters(Writefinal).Forthisexample,wearenotusingeitheroption.
ClicktheIteratebutton;theKMeansClusterAnalysis:Iterateboxappears.ChangeMaximum Iterationsto20.ClickContinue. MaximumIterations.Setsthemaximumnumberofiterations. ConvergenceCriterion:Thedefaultterminatesoncethelargestchangeinmeansofanycluster islessthan2%oftheminimumdistancebetweeninitialclustercenters. Userunningmeans.Ifthisboxischecked,clustercenterswillbeupdatedaftereachcaseis classified,insteadofafterallofthecasesareclassified.
Page2of7
ClickOptionsintheKMeansClusterAnalysisdialogbox.CheckInitialclustercenters,ANOVAtable, Clusterinformationforeachcase,andExcludecasespairwise.ClickContinue.ClickOk. Initialclustercenters.Printstheinitialvariablemeansforeachclusterintheoutput. ANOVAtable.ANOVAFtestsareconductedforeachvariabletoindicatehowwellthevariable discriminatesbetweenclusters. Clusterinformationforeachcase.Printseachcase'sfinalclusterassignmentandtheEuclidean distancebetweenthecaseandtheclustercenterintheouput. MissingValues.Thedefaultislistwisedeletion.Forthisexample,therearemanymissingvalues becausemostcustomersdidnotsubscribetoallservices,soexcludingcasespairwisemaximizes theinformationyoucanobtainfromthedata.
Page3of7
KMeansClusteringInterpretation
TheInitialClusterCenterstableshowsthefirststepinthekmeansclusteringinfindingthekcenters.
TheIterationHistorytableshowsthenumberofiterationsthatwereenoughuntilclustercentersdid notchangesubstantially.
Page4of7
TheClusterMembershiptablegivesyouthecaseclustereachcasebelongstoandtheEuclidean distanceofeachcasetotheclustercenter.Belowisaprintoutofthefirstandlast10cases.Visual inspectionofdistancesisnecessarytocheckforoutliersthatmaynotadequatelyreflectthepopulation.
TheFinalClusterCenterstablebelowallowsyoutodescribetheclustersbythevariables.Forexample, customersinCluster1tendtopurchasealotofservices,asevidencedbyvaluesabovethemeanforall variables.CustomersinCluster2tendtopurchasethe"calling"services,shownbypositivevaluesfor thefourcallingservices(callerID,callwaiting,callforwarding,and3waycalling).Customersin Cluster3tendtospendverylittleanddonotpurchasemanyservices;theyhavenegativevalueson mostofthevariables.
Page5of7
TheDifferencesbetweenFinalClusterCenterstableshowstheEuclideandistancesbetweenthefinal clustercenters.Greaterdistancesbetweenclustersmeantherearegreaterdissimilarities.
Clusters1and3havethegreatestdissimilarities.
Cluster2isequallysimilartoClusters1and3.
TheANOVAtableindicateswhichvariablescontributethemosttoyourclustersolution.Variableswith largemeansquareerrorsprovidetheleasthelpindifferentiatingbetweenclusters.Forexample,long distanceandcallingcardhadthetwohighestmeansquareerrors(andlowestFstatistics);therefore,the twovariableswerenotashelpfulastheothervariablesinforminganddifferentiatingclusters.
Page6of7
TheNumberofCasesineachClustertableillustratesthesplitofcasesintoclusters.Alargenumberof caseswereassignedtothethirdcluster,whichistheleastprofitablegroup.
Page7of7

K Means Handout

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means Handout

Uploaded by

Copyright:

Available Formats

ClusterAnalysis

TheClusterMembershiptablegivesyouthecaseclustereachcasebelongstoandtheEuclidean distanceofeachcasetotheclustercenter.Belowisaprintoutofthefirstandlast10cases.Visual inspectionofdistancesisnecessarytocheckforoutliersthatmaynotadequatelyreflectthepopulation.

You might also like