You are on page 1of 19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Lesson7:PrincipalComponentsAnalysis(PCA)
Introduction
Sometimesdataarecollectedonalargenumberofvariablesfromasinglepopulation.Asan
exampleconsiderthePlacesRateddatasetbelow
Example:PlacesRated
InthePlacesRatedAlmanac,BoyerandSavageaurated329communitiesaccordingtothefollowing
ninecriteria:
1. ClimateandTerrain
2. Housing
3. HealthCare&theEnvironment
4. Crime
5. Transportation
6. Education
7. TheArts
8. Recreation
9. Economics
Notethatwithinthedataset,exceptforhousingandcrime,thehigherthescorethebetter.Forhousing
andcrime,thelowerthescorethebetter.Wheresomecommunitiesmightdobetterinthearts,other
communitiesmightberatedbetterinotherareassuchashavingalowercrimerateandgoodeducational
opportunities.
Objective
Withalargenumberofvariables,thedispersionmatrixmaybetoolargetostudyandinterpretproperly.
Therewouldbetoomanypairwisecorrelationsbetweenthevariablestoconsider.Graphicaldisplayof
datamayalsonotbeofparticularhelpincasethedatasetisverylarge.With12variables,forexample,
therewillbemorethan200threedimensionalscatterplotstobestudied!
Tointerpretthedatainamoremeaningfulform,itisthereforenecessarytoreducethenumberof
variablestoafew,interpretablelinearcombinationsofthedata.Eachlinearcombinationwillcorrespond
toaprincipalcomponent.
(ThereisanotherveryusefuldatareductiontechniquecalledFactorAnalysis,whichwillbetakenupina
subsequentlesson.)

Learningobjectives&outcomes
Uponcompletionofthislesson,youshouldbeabletodothefollowing:
CarryoutaprincipalcomponentsanalysisusingSASandMinitab
Assesshowmanyprincipalcomponentsshouldbeconsideredinananalysis
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

1/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Interpretprincipalcomponentscores.Beabletodescribeasubjectwithahighorlowscore
Determinewhenaprincipalcomponentanalysismaybebasedonthevariancecovariancematrix,
andwhenthecorrelationmatrixshouldbeused
Understandhowprincipalcomponentscoresmaybeusedinfurtheranalyses.

7.1PrincipalComponentAnalysis(PCA)
Procedure
SupposethatwehavearandomvectorX.
\(\textbf{X}=\left(\begin{array}{c}X_1\\X_2\\\vdots\\X_p\end{array}\right)\)
withpopulationvariancecovariancematrix
\(\text{var}(\textbf{X})=\Sigma=\left(\begin{array}{cccc}\sigma^2_1&\sigma_{12}&\dots
&\sigma_{1p}\\\sigma_{21}&\sigma^2_2&\dots&\sigma_{2p}\\\vdots&\vdots&\ddots&\vdots
\\\sigma_{p1}&\sigma_{p2}&\dots&\sigma^2_p\end{array}\right)\)
Considerthelinearcombinations
\(\begin{array}{lll}Y_1&=&e_{11}X_1+e_{12}X_2+\dots+e_{1p}X_p\\Y_2&=&
e_{21}X_1+e_{22}X_2+\dots+e_{2p}X_p\\&&\vdots\\Y_p&=&e_{p1}X_1+e_{p2}X_2+
\dots+e_{pp}X_p\end{array}\)
Eachofthesecanbethoughtofasalinearregression,predictingYifromX1,X2,...,Xp.Thereisno
intercept,butei1,ei2,...,eipcanbeviewedasregressioncoefficients.
NotethatYiisafunctionofourrandomdata,andsoisalsorandom.Thereforeithasapopulation
variance
\[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=
\mathbf{e}'_i\Sigma\mathbf{e}_i\]
Moreover,YiandYjwillhaveapopulationcovariance
\[\text{cov}(Y_i,Y_j)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{jl}\sigma_{kl}=
\mathbf{e}'_i\Sigma\mathbf{e}_j\]
Herethecoefficientseijarecollectedintothevector
\(\mathbf{e}_i=\left(\begin{array}{c}e_{i1}\\e_{i2}\\\vdots\\e_{ip}\end{array}\right)\)
FirstPrincipalComponent(PCA1):Y1
Thefirst principal component is the linear combination of xvariables that has maximum variance (among all
linearcombinations),soitaccountsforasmuchvariationinthedataaspossible.

https://onlinecourses.science.psu.edu/stat505/book/export/html/49

2/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Specificallywewilldefinecoefficientse11,e12,...,e1pforthatcomponentinsuchawaythatits
varianceismaximized,subjecttotheconstraintthatthesumofthesquaredcoefficientsisequaltoone.
Thisconstraintisrequiredsothatauniqueanswermaybeobtained.
Moreformally,selecte11,e12,...,e1pthatmaximizes
\[\text{var}(Y_1)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl}=
\mathbf{e}'_1\Sigma\mathbf{e}_1\]
subjecttotheconstraintthat
\[\mathbf{e}'_1\mathbf{e}_1=\sum_{j=1}^{p}e^2_{1j}=1\]
SecondPrincipalComponent(PCA2):Y2
Thesecondprincipalcomponentisthelinearcombinationofxvariablesthataccountsforasmuchofthe
remainingvariationaspossible,withtheconstraintthatthecorrelationbetweenthefirstandsecondcomponentis
0

Selecte21,e22,...,e2pthatmaximizesthevarianceofthisnewcomponent...
\[\text{var}(Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{2l}\sigma_{kl}=
\mathbf{e}'_2\Sigma\mathbf{e}_2\]
subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone,
\[\mathbf{e}'_2\mathbf{e}_2=\sum_{j=1}^{p}e^2_{2j}=1\]
alongwiththeadditionalconstraintthatthesetwocomponentswillbeuncorrelatedwithoneanother.
\[\text{cov}(Y_1,Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{2l}\sigma_{kl}=
\mathbf{e}'_1\Sigma\mathbf{e}_2=0\]
Allsubsequentprincipalcomponentshavethissamepropertytheyarelinearcombinationsthataccountforas
muchoftheremainingvariationaspossibleandtheyarenotcorrelatedwiththeotherprincipalcomponents

Wewilldothisinthesamewaywitheachadditionalcomponent.Forinstance:
ithPrincipalComponent(PCAi):Yi
Weselectei1,ei2,...,eipthatmaximizes
\[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=
\mathbf{e}'_i\Sigma\mathbf{e}_i\]
subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone...alongwiththeadditional
constraintthatthisnewcomponentwillbeuncorrelatedwithallthepreviouslydefinedcomponents.
\(\mathbf{e}'_i\mathbf{e}_i=\sum_{j=1}^{p}e^2_{ij}=1\)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

3/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

\(\text{cov}(Y_1,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{il}\sigma_{kl}=
\mathbf{e}'_1\Sigma\mathbf{e}_i=0\),
\(\text{cov}(Y_2,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{il}\sigma_{kl}=
\mathbf{e}'_2\Sigma\mathbf{e}_i=0\),
\(\vdots\)
\(\text{cov}(Y_{i1},Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{i1,k}e_{il}\sigma_{kl}=
\mathbf{e}'_{i1}\Sigma\mathbf{e}_i=0\)
Thereforeallprincipalcomponentsareuncorrelatedwithoneanother.

7.2Howdowefindthecoefficients?
Howdowefindthecoefficientseijforaprincipalcomponent?
Thesolutioninvolvestheeigenvaluesandeigenvectorsofthevariancecovariancematrix.
Solution:
Wearegoingtolet1throughpdenotetheeigenvaluesofthevariancecovariancematrix.Theseare
orderedsothat1hasthelargesteigenvalueandpisthesmallest.
\(\lambda_1\ge\lambda_2\ge\dots\ge\lambda_p\)
Wearealsogoingtoletthevectorse1throughep
e1,e2,...,ep
denotethecorrespondingeigenvectors.Itturnsoutthattheelementsfortheseeigenvectorswillbethe
coefficientsofourprincipalcomponents.
Thevariancefortheithprincipalcomponentisequaltotheitheigenvalue.
\(\textbf{var}(Y_i)=\text{var}(e_{i1}X_1+e_{i2}X_2+\dotse_{ip}X_p)=\lambda_i\)
Moreover,theprincipalcomponentsareuncorrelatedwithoneanother.
\(\text{cov}(Y_i,Y_j)=0\)
Thevariancecovariancematrixmaybewrittenasafunctionoftheeigenvaluesandtheircorresponding
eigenvectors.ThisisdeterminedbyusingtheSpectralDecompositionTheorem.Thiswillbecomeuseful
laterwhenweinvestigatetopicsunderfactoranalysis.
SpectralDecompositionTheorem
Thevariancecovariancmatrixcanbewrittenasthesumoverthepeigenvalues,multipliedbythe
productofthecorrespondingeigenvectortimesitstransposeasshowninthefirstexpressionbelow:
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

4/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

\[\begin{array}{lll}\Sigma&=&\sum_{i=1}^{p}\lambda_i\mathbf{e}_i\mathbf{e}_i'\\&\cong&
\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\end{array}\]
Thesecondexpressionisausefulapproximationif\(\lambda_{k+1},\lambda_{k+2},\dots,
\lambda_{p}\)aresmall.Wemightapproximateby
\[\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\]
Again,thiswillbecomemoreusefulwhenwetalkaboutfactoranalysis.
EarlierinthecoursewedefinedthetotalvariationofXasthetraceofthevariancecovariancematrix,or
ifyoulike,thesumofthevariancesoftheindividualvariables.Thisisalsoequaltothesumofthe
eigenvaluesasshownbelow:
\(\begin{array}{lll}trace(\Sigma)&=&\sigma^2_1+\sigma^2_2+\dots+\sigma^2_p\\&=&
\lambda_1+\lambda_2+\dots+\lambda_p\end{array}\)
Thiswillgiveusaninterpretationofthecomponentsintermsoftheamountofthefullvariation
explainedbyeachcomponent.Theproportionofvariationexplainedbytheithprincipalcomponentis
thengoingtobedefinedtobetheeigenvalueforthatcomponentdividedbythesumoftheeigenvalues.
Inotherwords,theithprincipalcomponentexplainsthefollowingproportionofthetotalvariation:
\[\frac{\lambda_i}{\lambda_1+\lambda_2+\dots+\lambda_p}\]
Arelatedquantityistheproportionofvariationexplainedbythefirstkprincipalcomponent.Thiswould
bethesumofthefirstkeigenvaluesdividedbyitstotalvariation.
\[\frac{\lambda_1+\lambda_2+\dots+\lambda_k}{\lambda_1+\lambda_2+\dots+\lambda_p}\]
Naturally,iftheproportionofvariationexplainedbythefirstkprincipalcomponentsislarge,thennot
muchinformationislostbyconsideringonlythefirstkprincipalcomponents.
WhyItMayBePossibletoReduceDimensions
Whenwehavecorrelations(multicollinarity)betweenthexvariables,thedatamaymoreorlessfallonalineor
planeinalowernumberofdimensions.Forinstance,imagineaplotoftwoxvariablesthathaveanearlyperfect
correlation.Thedatapointswillfallclosetoastraightline.Thatlinecouldbeusedasanew(onedimensional)
axistorepresentthevariationamongdatapoints.Asanotherexample,supposethatwehaveverbal,math,and
totalSATscoresforasampleofstudents.Wehavethreevariables,butreally(atmost)twodimensionstothe
databecausetotal=verbal+math,meaningthethirdvariableiscompletelydeterminedbythefirsttwo.The
reasonforsayingatmosttwodimensionsisthatifthereisastrongcorrelationbetweenverbalandmath,thenit
maybepossiblethatthereisonlyonetruedimensiontothedata .

Note
Allofthisisdefinedintermsofthepopulationvariancecovariancematrixwhichisunknown.
However,wemayestimatebythesamplevariancecovariancematrixwhichisgiveninthestandard
formulahere:
\[\textbf{S}=\frac{1}{n1}\sum_{i=1}^{n}(\mathbf{X}_i\bar{\textbf{x}})(\mathbf{X}_i
\bar{\textbf{x}})'\]
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

5/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Procedure
Computetheeigenvalues\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)ofthesample
variancecovariancematrixS,andthecorrespondingeigenvectors\(\hat{\mathbf{e}}_1,
\hat{\mathbf{e}}_2,\dots,\hat{\mathbf{e}}_p\).
Thenwewilldefineourestimatedprinciplecomponentsusingtheeigenvectorsasourcoefficients:
\(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}X_1+\hat{e}_{12}X_2+\dots+\hat{e}_{1p}X_p
\\\hat{Y}_2&=&\hat{e}_{21}X_1+\hat{e}_{22}X_2+\dots+\hat{e}_{2p}X_p\\&&\vdots\\
\hat{Y}_p&=&\hat{e}_{p1}X_1+\hat{e}_{p2}X_2+\dots+\hat{e}_{pp}X_p\\\end{array}\)
Generally,weonlyretainthefirstkprincipalcomponent.Herewemustbalancetwoconflictingdesires:
1.Toobtainthesimplestpossibleinterpretation,wewantktobeassmallaspossible.Ifwecanexplain
mostofthevariationjustbytwoprincipalcomponentsthenthiswouldgiveusamuchsimpler
descriptionofthedata.Thesmallerkisthesmalleramountofvariationisexplainedbythefirstk
component.
2.Toavoidlossofinformation,wewanttheproportionofvariationexplainedbythefirstkprincipal
componentstobelarge.Ideallyasclosetooneaspossiblei.e.,wewant
\[\frac{\hat{\lambda}_1+\hat{\lambda}_2+\dots+\hat{\lambda}_k}{\hat{\lambda}_1+
\hat{\lambda}_2+\dots+\hat{\lambda}_p}\cong1\]

7.3Example:PlacesRated
WewillusethePlacesRatedAlmanacdata(BoyerandSavageau)whichrates329communities
accordingtoninecriteria:
1. ClimateandTerrain
2. Housing
3. HealthCare&Environment
4. Crime
5. Transportation
6. Education
7. TheArts
8. Recreation
9. Economics
Notes:
Thedataformanyofthevariablesarestronglyskewedtotheright.
Thelogtransformationwasusedtonormalizethedata.
UsingSAS
UsingMinitab
TheSASprogramplaces.saswillimplementtheprincipalcomponentprocedures:
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

6/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Whenyouexaminetheoutput,thefirstthingthatSASdoesistogiveussummaryinformation.
Thereare329observationsrepresentingthe329communitiesinourdatasetand9variables.Thisis
followedbysimplestatisticsthatreportthemeansandstandarddeviationsforeachvariable.
Belowthisisthevariancecovariancematrixforthedata.Youshouldbeabletoseethatthevariance
reportedforclimateis0.01289.
Whatwereallyneedtodrawourattentiontohereistheeigenvaluesofthevariancecovariance
matrix.IntheSASoutputtheeigenvaluesinrankedorderfromlargesttosmallest.Thesevalueshave
beencopiedintoTable1belowfordiscussion.

DataAnalysis:
Step1:Weexaminetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:
Table1.Eigenvalues,andtheproportionofvariationexplainedbytheprincipalcomponents.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

7/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Component Eigenvalue Proportion Cumulative


1

0.3775

0.7227

0.7227

0.0511

0.0977

0.8204

0.0279

0.0535

0.8739

0.0230

0.0440

0.9178

0.0168

0.0321

0.9500

0.0120

0.0229

0.9728

0.0085

0.0162

0.9890

0.0039

0.0075

0.9966

0.0018

0.0034

1.0000

Total

0.5225

Ifyoutakealloftheseeigenvaluesandaddthemupandyougetthetotalvarianceof0.5223.
Theproportionofvariationexplainedbyeacheigenvalueisgiveninthethirdcolumn.Forexample,
0.3775dividedbythe0.5223equals0.7227,or,about72%ofthevariationisexplainedbythisfirst
eigenvalue.Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsof
variationexplainedtoobtaintherunningtotal.Forinstance,0.7227plus0.0977equals0.8204,andso
forth.Therefore,about82%ofthevariationisexplainedbythefirsttwoeigenvaluestogether.
Nextweneedtolookatsuccessivedifferencesbetweentheeigenvalues.Subtractingthesecond
eigenvalue0.051fromthefirsteigenvalue,0.377wegetadifferenceof0.326.Thedifferencebetween
thesecondandthirdeigenvaluesis0.0232thenextdifferenceis0.0049.Subsequentdifferencesare
evensmaller.Asharpdropfromoneeigenvaluetothenextmayserveasanotherindicatorofhowmany
eigenvaluestoconsider.
Thefirstthreeprincipalcomponentsexplain87%ofthevariation.Thisisanacceptablylargepercentage.
AnAlternativeMethodtodeterminethenumberofprincipalcomponentsistolookataScreePlot.
Withtheeigenvaluesorderedfromlargesttothesmallest,ascreeplotistheplotof versusi.The
numberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareall
relativelysmallandofcomparablesize.ThefollowingplotismadeinMinitab.

https://onlinecourses.science.psu.edu/stat505/book/export/html/49

8/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Thescreeplotforthevariableswithoutstandardization(covariancematrix)
Asyousee,wecouldhavestoppedatthesecondprincipalcomponent,butwecontinuedtillthethird

component.Relativelyspeaking,contributionofthethirdcomponentissmallcomparedtothesecond
component.
Step2:Next,wewillcomputetheprincipalcomponentscores.Forexample,thefirstprincipal
componentcanbecomputedusingtheelementsofthefirsteigenvector:
\(\begin{array}\hat{Y}_1&=&0.0351\times(\text{climate})+0.0933\times(\text{housing})+
0.4078\times(\text{health})\\&&+0.1004\times(\text{crime})+0.1501\times(\text{transportation})
+0.0321\times(\text{education})\\&&0.8743\times(\text{arts})+0.1590\times(\text{recreation})+
0.0195\times(\text{economy})\end{array}\)
Inordertocompletethisformulaandcomputetheprincipalcomponentfortheindividualcommunityof
interest,pluginthatcommunity'svaluesforeachofthesevariables.Afairlystandardprocedureis,rather
thanusingtherawdatahere,tousethedifferencebetweenthevariablesandtheirsamplemeans.Thisis
knownastranslationoftherandomvariables.Translationdoesnotaffecttheinterpretationsbecausethe
variancesoftheoriginalvariablesarethesameasthoseofthetranslatedvariables.
Magnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.However,the
magnitudeofthecoefficientsalsodependonthevariancesofthecorrespondingvariables.

7.4InterpretationofthePrincipalComponents
Step3:Tointerpreteachcomponent,wemustcomputethecorrelationsbetweentheoriginaldatafor
eachvariableandeachprincipalcomponent.
Thesecorrelationsareobtainedusingthecorrelationprocedure.Inthevariablestatementwewillinclude
thefirstthreeprincipalcomponents,"prin1,prin2,andprin3",inadditiontoallnineoftheoriginal
variables.Wewillusethesecorrelationsbetweentheprincipalcomponentsandtheoriginalvariablesto
interprettheseprincipalcomponents.
Becauseofstandardization,allprincipalcomponentswillhavemean0.Thestandarddeviationisalso
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

9/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

givenforeachofthecomponentsandthesewillbethesquarerootoftheeigenvalue.
Moreimportantforourcurrentpurposesarethecorrelationsbetweentheprincipalcomponentsandthe
originalvariables.Thesehavebeencopiedintothefollowingtable.Youwillalsonotethatifyoulookat
theprincipalcomponentsthemselvesthatthereiszerocorrelationbetweenthecomponents.
PrincipalComponent
Variable

Climate

0.190

0.017

0.207

Housing

0.544

0.020

0.204

Health

0.782

0.605

0.144

Crime

0.365

0.294

0.585

Transportation

0.585

0.085

0.234

Education

0.394

0.273

0.027

Arts

0.985

0.126

0.111

Recreation

0.520

0.402

0.519

Economy

0.142

0.150

0.239

Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststrongly
correlatedwitheachcomponent,i.e.,whichofthesenumbersarelargeinmagnitude,thefarthestfrom
zeroineitherpositiveornegativedirection.Whichnumbersweconsidertobelargeorsmallisofcourse
isasubjectivedecision.Youneedtodetermineatwhatlevelthecorrelationvaluewillbeofimportance.
Hereacorrelationvalueabove0.5isdeemedimportant.Theselargercorrelationsareinboldfaceinthe
tableabove:
Wewillnowinterprettheprincipalcomponentresultswithrespecttothevaluethatwehavedeemed
significant.
FirstPrincipalComponentAnalysisPCA1
Thefirstprincipalcomponentisstronglycorrelatedwithfiveoftheoriginalvariables.Thefirstprincipal
componentincreaseswithincreasingArts,Health,Transportation,HousingandRecreationscores.This
suggeststhatthesefivecriteriavarytogether.Ifoneincreases,thentheremainingtwoalsoincrease.This
componentcanbeviewedasameasureofthequalityofArts,Health,Transportation,andRecreation,
andthelackofqualityinHousing(recallthathighvaluesforHousingarebad).Furthermore,weseethat
thefirstprincipalcomponentcorrelatesmoststronglywiththeArts.Infact,wecouldstatethatbasedon
thecorrelationof0.985thatthisprincipalcomponentisprimarilyameasureoftheArts.Itwouldfollow
thatcommunitieswithhighvalueswouldtendtohavealotofartsavailable,intermsoftheaters,
orchestras,etc.Whereascommunitieswithsmallvalueswouldhaveveryfewofthesetypesof
opportunities.
SecondPrincipalComponentAnalysisPCA2
Thesecondprincipalcomponentincreaseswithonlyoneofthevalues,decreasingHealth.This
componentcanbeviewedasameasureofhowunhealthythelocationisintermsofavailablehealthcare
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

10/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

includingdoctors,hospitals,etc.
ThirdPrincipalComponentAnalysisPCA3
ThethirdprincipalcomponentincreaseswithincreasingCrimeandRecreation.Thissuggeststhatplaces
withhighcrimealsotendtohavebetterrecreationfacilities.
Tocompletetheanalysisweoftentimeswouldliketoproduceascatterplotofthecomponentscores.
Inlookingattheprogram,youwillseeagplotprocedureatthebottomwhereweareplottingthesecond
componentagainstthefirstcomponent.AsimilarplotcanalsobepreparedinMinitab,butisnotshown
here.

Eachdotinthisplotrepresentsonecommunity.Soifyouwerelookingatthereddotoutbyitselftothe
rightyoumayconcludethatthisparticulardothasaveryhighvalueforthefirstprincipalcomponentand
wewouldexpectthiscommunitytohavehighvaluesfortheArts,Health,Housing,Transportationand
Recreation.Whereasifyoulookatreddotattheleftofthespectrum,youwouldexpecttohavelow
valuesforeachofthosevariables.
Thetopdotinbluehasahighvalueforthesecondcomponent.Soyouwouldexpectthatthiscommunity
wouldbelousyforHealth.Andconverselyifyouweretolookatthebluedotonthebottom,the
correspondingcommunitywouldhavehighvaluesforHealth.
Furtheranalysesmayinclude:
Scatterplotsofprincipalcomponentscores.Inthepresentcontext,wemaywishtoidentifythe
locationsofeachpointintheplottoseeifplaceswithhighlevelsofagivencomponenttendtobe
clusteredinaparticularregionofthecountry,whilesiteswithlowlevelsofthatcomponentare
clusteredinanotherregionofthecountry.
Principlecomponentsareoftentreatedasdependentvariablesforregressionandanalysisof
variance.

https://onlinecourses.science.psu.edu/stat505/book/export/html/49

11/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

7.5Alternative:StandardizetheVariables
Inthepreviousexamplewelookedatprincipalcomponentsanalysisappliedtotherawdata.Inour
earlierdiscussionwenotedthatiftherawdataisusedprincipalcomponentanalysiswilltendtogive
moreemphasistothosevariablesthathavehighervariancesthantothosevariablesthathaveverylow
variances.Ineffecttheresultsoftheanalysiswilldependonwhatunitsofmeasurementareusedto
measureeachvariable.Thatwouldimplythataprincipalcomponentanalysisshouldonlybeusedwith
therawdataifallvariableshavethesameunitsofmeasure.Andeveninthiscase,onlyifyouwishto
givethosevariableswhichhavehighervariancesmoreweightintheanalysis.
Auniqueexampleofthistypeofimplementationmightbeinanecologicalsettingwhereyouarelooking
atcountsofdifferentspeciesoforganismsatanumberofdifferentsamplesites.Here,onemaywantto
givemoreweighttothemorecommonspeciesthatareobserved.Byanalysingtherawdatayouwilltend
tofindthatmorecommonspecieswillalsoshowhighervariancesandwillbegivenmoreemphasis.If
youweretodoaprincipalcomponentanalysisonstandardizedcounts,allspecieswouldbeweighted
equallyregardlessofhowabundanttheyareandhence,youmayfindsomeveryrarespeciesenteringin
assignificantcontributorsintheanalysis.Thismayormaynotbedesirable.Thesetypesofdecisions
needtobemadewiththescientificfoundationandquestionsinmind.
Summary
Theresultsofprincipalcomponentanalysisdependonthescalesatwhichthevariablesare
measured.
Variableswiththehighestsamplevarianceswilltendtobeemphasizedinthefirstfewprincipal
components.
Principalcomponentanalysisusingthecovariancefunctionshouldonlybeconsideredifallofthe
variableshavethesameunitsofmeasurement.
Ifthevariableseitherhavedifferentunitsofmeasurement(i.e.,pounds,feet,gallons,etc),orifwewish
eachvariabletoreceiveequalweightintheanalysis,thenthevariablesshouldbestandardizedbeforea
principalcomponentsanalysisiscarriedout.Standardizethevariablesbysubtractingitsmeanfromthat
variableanddividingitbyitsstandarddeviation:
\[Z_{ij}=\frac{X_{ij}\bar{x}_j}{s_j}\]
where
Xij=Dataforvariablejinsampleuniti
\(\bar{x}_{j}\)=Samplemeanforvariablej
sj=Samplestandarddeviationforvariablej
Wewillnowperformtheprincipalcomponentanalysisusingthestandardizeddata.
Note:thevariancecovariancematrixofthestandardizeddataisequaltothecorrelationmatrixforthe
unstandardizeddata.Therefore,principalcomponentanalysisusingthestandardizeddataisequivalentto
principalcomponentanalysisusingthecorrelationmatrix.
PrincipalComponentAnalysisProcedure
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

12/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Theprincipalcomponentsarefirstcalculatedbyobtainingtheeigenvaluesforthecorrelationmatrix:
\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)
InthismatrixwedenotetheeigenvaluesofthesamplecorrelationmatrixR,andthecorresponding
eigenvectors
\(\mathbf{\hat{e}}_1,\mathbf{\hat{e}}_2,\dots,\mathbf{\hat{e}}_p\)
Thentheestimatedprinciplecomponentsscoresarecalculatedusingformulassimilartobefore,but
insteadofusingtherawdatawewillusethestandardizeddataintheformulaebelow:
\(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}Z_1+\hat{e}_{12}Z_2+\dots+\hat{e}_{1p}Z_p
\\\hat{Y}_2&=&\hat{e}_{21}Z_1+\hat{e}_{22}Z_2+\dots+\hat{e}_{2p}Z_p\\&&\vdots\\
\hat{Y}_p&=&\hat{e}_{p1}Z_1+\hat{e}_{p2}Z_2+\dots+\hat{e}_{pp}Z_p\\\end{array}\)
Restoftheprocedureandtheinterpretationsareasdiscussedbefore.

7.6Example:PlacesRatedafter
Standardization
Thepreviousanalysisisrepeatedafterstandardizingthevariables.
UsingSAS
UsingMinitab
TheSASprogramplaces1.saswillimplementtheprincipalcomponentproceduresusingthe
standardizeddata:

https://onlinecourses.science.psu.edu/stat505/book/export/html/49

13/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Theoutputbeginswithdescriptiveinformationincludingthemeansandstandarddeviationsforthe
individualvariablesbeingpresented.
ThisisfollowedbytheCorrelationMatrixforthedata.Forexample,thecorrelationbetweenthe
housingandclimatedatawasonly0.273.Therearenohypothesispresentedthatthesecorrelations
areequaltozero.Wewillusethiscorrelationmatrixinsteadtoobtainoureigenvaluesand
eigenvectors.

Weneedtofocusontheeigenvaluesofthecorrelationmatrixthatcorrespondtoeachoftheprincipal
components.Inthiscase,totalvariationofthestandardizedvariablesisgoingtobeequaltop,the
numberofvariables.Afterstandardizationeachvariablehasvarianceequaltoone,andthetotalvariation
isthesumofthesevariations,inthiscasethetotalvariationwillbe9.
Theeigenvaluesofthecorrelationmatrixaregiveninthesecondcolumninthetablebelow.Notealso
theproportionofvariationexplainedbyeachoftheprincipalcomponents,aswellasthecumulative
proportionofthevariationexplained.
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

14/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Step1
Examinetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:
Component Eigenvalue Proportion Cumulative
1

3.2978

0.3664

0.3664

1.2136

0.1348

0.5013

1.1055

0.1228

0.6241

0.9073

0.1008

0.7249

0.8606

0.0956

0.8205

0.5622

0.0625

0.8830

0.4838

0.0538

0.9368

0.3181

0.0353

0.9721

0.2511

0.0279

1.0000

Thefirstprincipalcomponentexplainsabout37%ofthevariation.Furthermore,thefirstfourprincipal
componentsexplain72%,whilethefirstfiveprincipalcomponentsexplain82%ofthevariation.
Comparetheseproportionswiththoseobtainedusingnonstandardizedvariables.Thisanalysisisgoing
torequirealargernumberofcomponentstoexplainthesameamountofvariationastheoriginalanalysis
usingthevariancecovariancematrix.Thisisnotunusual.
Inmostcases,therequiredcutoffisprespecifiedi.e.howmuchofthevariationtobeexplainedispre
determined.Forinstance,ImightstatethatIwouldbesatisfiedifIcouldexplain70%ofthevariation.If
wedothisthenwewouldselectthecomponentsnecessaryuntilyougetupto70%ofthevariation.This
wouldbeoneapproach.Thistypeofjudgmentisarbitraryandhardtomakeifyouarenotexperienced
withthesetypesofanalysis.Thegoaltosomeextentalsodependsonthetypeofproblemathand.
Anotherapproachwouldbetoplotthedifferencesbetweentheorderedvaluesandlookforabreakora
sharpdrop.Theonlysharpdropthatisnoticeableinthiscaseisafterthefirstcomponent.Onemight,
basedonthis,selectonlyonecomponent.However,onecomponentisprobablytoofew,particularly
becausewehaveonlyexplained37%ofthevariation.Considerthescreeplotbasedonthestandardized
variables.

https://onlinecourses.science.psu.edu/stat505/book/export/html/49

15/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Thescreeplotforstandardizedvariables(correlationmatrix)
Step2
Next,wecancomputetheprincipalcomponentscoresusingtheeigenvectors.Thisisaformulaforthe
firstprincipalcomponent:
\(\begin{array}\hat{Y}_1&=&0.158\timesZ_{\text{climate}}+0.384\timesZ_{\text{housing}}+
0.410\timesZ_{\text{health}}\\&&+0.259\timesZ_{\text{crime}}+0.375\times
Z_{\text{transportation}}+0.274\timesZ_{\text{education}}\\&&0.474\timesZ_{\text{arts}}+
0.353\timesZ_{\text{recreation}}+0.164\timesZ_{\text{economy}}\end{array}\)
Andremember,thisisnowgoingtobeafunction,notoftherawdatabutthestandardizeddata.
Themagnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.Sincethe
datahavebeenstandardized,theydonotdependonthevariancesofthecorrespondingvariables.
Step3
Next,wecanlookatthecoefficientsfortheprincipalcomponents.Inthiscase,sincethedataare
standardized,withinacolumntherelativemagnitudeofthosecoefficientscanbedirectlyassessed.Each
columnherecorrespondswithacolumnintheoutputoftheprogramlabeledEigenvectors.

PrincipalComponent

Variable

Climate

0.158 0.069 0.800 0.377 0.041

Housing

0.384 0.139 0.080 0.197 0.580

Health

0.410 0.372 0.019 0.113 0.030

Crime

0.259 0.474 0.128 0.042 0.692

Transportation 0.375 0.141 0.141 0.430 0.191


Education
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

0.274 0.452 0.241 0.457 0.224


16/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Arts

0.474 0.104 0.011 0.147 0.012

Recreation

0.353 0.292 0.042 0.404 0.306

Economy

0.164 0.540 0.507 0.476 0.037

Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststrongly
correlatedwitheachcomponent.Inotherwords,weneedtodecidewhichnumbersarelargewithineach
column.InthefirstcolumnwewilldecidethatHealthandArtsarelarge.Thisisveryarbitrary.Other
variablesmighthavealsobeenincludedaspartofthisfirstprincipalcomponent.
ComponentSummaries
FirstPrincipalComponentAnalysisPCA1
ThefirstprincipalcomponentisameasureofthequalityofHealthandtheArts,andtosomeextent
Housing,TransportationandRecreation.HealthincreaseswithincreasingvaluesintheArts.Ifanyof
thesevariablesgoesup,sodotheremainingones.Theyareallpositivelyrelatedastheyallhavepositive
signs.
SecondPrincipalComponentAnalysisPCA2
Thesecondprincipalcomponentisameasureoftheseverityofcrime,thequalityoftheeconomy,and
thelackofqualityineducation.CrimeandEconomyincreasewithdecreasingEducation.Herewecan
seethatcitieswithhighlevelsofcrimeandgoodeconomiesalsotendtohavepooreducationalsystems.
ThirdPrincipalComponentAnalysisPCA3
Thethirdprincipalcomponentisameasureofthequalityoftheclimateandpoornessoftheeconomy.
ClimateincreaseswithdecreasingEconomy.Theinclusionofeconomywithinthiscomponentwilladda
bitofredundancywithinourresults.Thiscomponentisprimarilyameasureofclimate,andtoalesser
extenttheeconomy.
FourthPrincipalComponentAnalysisPCA4
Thefourthprincipalcomponentisameasureofthequalityofeducationandtheeconomyandthe
poornessofthetransportationnetworkandrecreationalopportunities.EducationandEconomyincrease
withdecreasingTransportationandRecreation.
FifthPrincipalComponentAnalysisPCA5
Thefifthprincipalcomponentisameasureoftheseverityofcrimeandthequalityofhousing.Crime
increaseswithdecreasinghousing.

7.7OncetheComponentsHaveBeen
Calculated
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

17/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

Onecaninterpretthesecomponentbycomponent.Onemethodofdecidinghowmanycomponentsisto
includeonlythosethatgiveunambiguousresults,i.e.,wherenovariableappearsintwodifferent
columnsasasignificantcontribution.
Notethattheprimarypurposeofthisanalysisisdescriptiveitisnothypothesistesting!Soyour
decisioninmanyrespectsneedstobemadebasedonwhatprovidesyouwithagood,concisedescription
ofthedata.
Wehavetomakeadecisionastowhatisanimportantcorrelation,notnecessarilyfromastatistical
hypothesistestingperspective,butfrom,inthiscaseanurbansociologicalperspective.Youhaveto
decidewhatisimportantinthecontextoftheproblemathand.Thisdecisionmaydifferfromdiscipline
todiscipline.Insomedisciplinessuchassociologyandecologythedatatendtobeinherently'noisy',and
inthiscaseyouwouldexpect'messier'interpretations.Ifyouarelookinginadisciplinesuchas
engineeringwhereeverythinghastobeprecise,youmightputhigherdemandsontheanalysis.You
wouldwanttohaveveryhighcorrelations.Principalcomponentsanalysisaremostlyimplementedin
sociologicalandecologicaltypesofapplicationsaswellasinmarketingresearch.
Asbefore,youcanplottheprincipalcomponentsagainstoneanotherandwecanexplorewherethedata
forcertainobservationslies.
Sometimestheprincipalcomponentsscoreswillbeusedasexplanatoryvariablesinaregression.
Sometimesinregressionsettingsyoumighthaveaverylargenumberofpotentialexplanatoryvariables
toworkwith.Andyoumaynothavemuchofanideaastowhichonesyoumightthinkareimportant.
Whatyoumightdoistoperformaprincipalcomponentsanalysisfirstandthenperformaregression
predictingthevariablesentersfromtheprincipalcomponentsthemselves.Thenicethingaboutthis
analysisisthattheregressioncoefficientswillbeindependenttooneanother,sincethecomponentsare
independentofoneanother.Inthiscaseyouactuallysayhowmuchofthevariationinthevariableof
interestisexplainedbyeachoftheindividualcomponents.Thisissomethingthatyoucannotnormally
doinmultipleregression.
Oneoftheproblemsthatwehavewiththisanalysisisthatbecauseofallofthenumbersinvolved,the
analysisisnotas'clean'asonewouldlike.Forexample,inlookingatthesecondandthirdcomponents,
theeconomyisconsideredtobesignificantforbothofthosecomponents.Asyoucansee,thiswilllead
toanambiguousinterpretationinouranalysis.
AnalternativemethodofdatareductionisFactorAnalysiswherefactorrotationsareusedtoreducethe
complexityandobtainacleanerinterpretationofthedata.

7.8Summary
Inthislessonwelearnedabout:
Thedefinitionofaprincipalcomponentsanalysis
Howtointerprettheprincipalcomponents
Howtoselectthenumberofprincipalcomponentstobeconsidered
Howtochoosebetweendoingtheanalysisbasedonthevariancecovariancematrixorthe
correlationmatrix.
Lookforthislesson'shomeworkproblemsthatwillgiveyouachancetoputwhatyouhavelearnedto
use...
https://onlinecourses.science.psu.edu/stat505/book/export/html/49

18/19

24/6/2015

Lesson7:PrincipalComponentsAnalysis(PCA)

https://onlinecourses.science.psu.edu/stat505/book/export/html/49

19/19

You might also like