You are on page 1of 10

Wehearaboutcorrelationseveryday.Varioushealthoutcomesarecorrelatedwithsocioeconomicstatus.Iodine supplementationininfantsiscorrelatedwithhigherIQ.Modelsareeverywhereaswell.Anobjectfallingfortseconds moves.5gt^2meters.Youcancalculatecorrelationandbuildapproximatemodelsusingseveraltechniques,butthesimplest andmostpopulartechniquebyfarislinearregression.Letsseehowitworks!

CountyHealthRankings
Forourexamples,wellusetheCountyHealthRankings.Specically,wellbelookingattwodatasetsinthisexample:Years ofPotentialLifeLostandAdditionalMeasures. Yearsofpotentiallifelost(YPLL)isanearlymortalitymeasure.Itmeasures,across100,000people,thetotalnumberof yearsbelowtheageof75thata100,000persongrouploses.Forexample,ifapersondiesatage73,theycontribute2years tothissum.Iftheydieatage77,theycontribute0yearstothesum.TheYPLLforeach100,000people,averagedacross countiesintheUnitedStatesisbetween8000and9000dependingontheyear.Thele ypll.csv containspercountyYPLLs fortheUnitedStatesin2011. Theadditionalmeasures(foundin additional_measures_cleaned.csv )containsallsortsoffunmeasurespercounty,ranging fromthepercentageofpeopleinthecountywithDiabetestothepopulationofthecounty. Weregoingtoseewhichoftheadditionalmeasurescorrelatestronglywithourmortalitymeasure,andbuildpredictive modelsforcountymortalityratesgiventheseadditionalmeasures.

LoadingtheRankings
Thetwo.csvleswevegivenyou(ypll.csvandadditional_measures_cleaned.csv)wentthroughquoteabitofscrubbing already.Youcanreadour notesontheprocessifyoureinterested. Weneedtoperformsomedatacleaningandlteringwhenloadingthedata.ThereisacolumncalledUnreliablethatwill bemarkedifweshouldnttrusttheYPLLdata.Wewanttoignorethose.Also,someoftherowswontcontaindatafor someoftheadditionalmeasures.Forexample,Yakutat,Alaskadoesnthaveavaluefor%childilliteracy.Wewanttoskip thoserows.Finally,thereisarowperstatethatsummarizesthestatesstatistics.Ithasanemptyvalueforthecounty columnandwewanttoignorethoserowssincewearedoingacountybycountyanalysis.Heresafunction, read_csv ,that willreadthedesiredcolumnsfromoneofthecsvles.
importcsv def read_csv(file_name,cols,check_reliable): reader=csv.DictReader(open(file_name,'rU')) rows={}#map"statename__countyname"tothecolumnnamesincols forrowinreader: ifcheck_reliableandrow['Unreliable']=="x":#discardunreliabledata continue ifrow['County']=="":#ignorethefirstentryforeachstate continue rname="%s__%s"%(row['State'],row['County']) try:#ifarow[col]isempty,float(row[col])throwsanexception rows[rname]=[float(row[col])forcolincols] except: pass returnrows

Thefunctiontakesasinputthecsvlename,anarrayofcolumnnamestoextract,andwhetherornotitshouldcheckand discardunreliabledata.Itreturnsadictionarymappingeachstate/countytothevaluesofthecolumnsspeciedin cols .It handlesallofthedirtydata:datamarkedunreliable,stateonlydata,andmissingcolumns. Whenwecall read_csv multipletimeswithdierentcsvles,arowthatisdroppedinonecsvlemaybekeptinanother.


1

Weneedtodowhatdatabasefolkscallajoinbetweenthe dict objectsreturnedfrom read_csv sothatonlythecounties presentinbothdictionarieswillbeconsidered. Wewroteafunctioncalled get_arrs thatretrievesdatafromtheYPLLandAdditionalMeasuresdatasets.Ittakesthe arguments dependent_cols ,whichisalistofcolumnnamestoextractfrom ypll.csv ,and independent_cols ,whichisalistof columnnamestoextractfrom additional_measures_cleaned.csv .Thisfunctionperformsthejoinforyou.
importnumpy defget_arrs(dependent_cols,independent_cols): ypll=read_csv("../datasets/county_health_rankings/ypll.csv",dependent_cols,True) measures=read_csv("../datasets/county_health_rankings/additional_measures_cleaned.csv",independent_cols,False) ypll_arr=[] measures_arr=[] forkey,valueinypll.iteritems(): ifkeyinmeasures:#joinypllandmeasuresifcountyisinboth ypll_arr.append(value[0]) measures_arr.append(measures[key]) return(numpy.array(ypll_arr),numpy.array(measures_arr))

Wereturnnumpyarrays(matrices)withrowscorrespondingtocountiesandcolumnscorrespondingtothecolumnswe readfromthespreadsheet.Wecannallycallthe get_arrs functiontoloadthedesiredcolumnsfromeachle.


dependent_cols=["YPLLRate"] independent_cols=["Population","<18","65andover","AfricanAmerican", "Female","Rural","%Diabetes","HIVrate", "PhysicalInactivity","mentalhealthproviderrate", "medianhouseholdincome","%highhousingcosts", "%Freelunch","%childIlliteracy","%DriveAlone"] ypll_arr,measures_arr=get_arrs(dependent_cols,independent_cols) printypll_arr.shape printmeasures_arr[:,6].shape exit()

Phew.Thatsucked.Letslookatthedata!

LookataScatterplot
Likewedidduringhypothesistesting,ourrststepistolookatthedatatoidentifycorrelations.Thebestvisualizationto identifycorrelationsisascatterplot,sincethatshowsustherelationshipbetweentwopotentiallyrelatedvariableslikeypll and%diabetes. Letsstartbylookingatscatterplotsofypllversusthreepotentiallycorrelatedvariables:%ofacommunitythathasdiabetes, %ofthecommunityundertheageof18,andmedianincome.
importmatplotlib.pyplotasplt fig=plt.figure(figsize=(6,8)) subplot=fig.add_subplot(311)
subplot.scatter(measures_arr[:,6],ypll_arr,color="#1f77b4")#:,6meansallrowsof"diabetes"
subplot.set_title("ypllvs.%ofpopulationwithdiabetes")
subplot=fig.add_subplot(312)
subplot.scatter(measures_arr[:,1],ypll_arr,color="#1f77b4")#1=age
subplot.set_title("ypllvs.%populationlessthan18yearsofage")
subplot=fig.add_subplot(313)
subplot.scatter(measures_arr[:,10],ypll_arr,color="#1f77b4")#10=income
subplot.set_title("ypllvs.medianhouseholdincome")

plt.savefig('figures/threescatters.png',format='png')

whats measures_arr[:,6] ?Thatsanumpysupportedsyntaxtoextractasubsetofamatrix.Therstargumentspecies whichrowstoextract.Itcanbeanumber(like3),apythonslice( :3 meanstherowsfrom0to3,while 3:5 means3to5),or : ,whichmeansalloftherows.Thesecondargumentspecieswhichcolumnstoextract.Inthiscaseitis 6 ,whichisthe 7thcolumn(remember,its0indexed). Yourplotsshouldlooksomethinglikethis:

Wepickedthesethreeexamplesbecausetheyshowvisualevidenceofthreeformsofcorrelation: Intherstplot,wecanseethatwhenthepercentageofpeopleinacountywithdiabetesishigher,soisthemortality rate(YPLL)evidenceofapositivecorrelation. Thesecondplotlookslikeablob.Itshardtoseearelationshipbetweenmortalityandthefractionofpeopleunderthe ageof18inacommunity. Thenalplotshowsevidenceofnegativecorrelation.Countieswithhighermedianincomesappeartohavelower mortalityrates. ExerciseLookatscatterplotsofothervariablesvs.YPLL.Wefoundthepercentofchildreneligibleforschoollunchtobe


3

alarminglycorrelatedwithYPLL!

YourFirstRegression
Itstimeweturntheintuitionfromourscatterplotsintomath!Welldothisusingthe ols module,whichstandsforordinary leastsquaresregression.LetsrunaregressionforYPLLvs.%Diabetes.
importols

model=ols.ols(ypll_arr,measures_arr[:,6],"YPLLRate",["%Diabetes"])#6=diabetes
model.summary()

theolsscriptin dataiap/day3/ implementamethodcalled ols() thattakesfourarguments: 1. a1dimensionalnumpyarraycontainingthevaluesofthedependentvariable(e.g.,YPLL) 2. a2dimensionalnumpyarraywhereeachrowcontainsthevaluesofanindependentvariable.Inthiscasetheonly independentvariableis%Diabetes,sothematrixhasthesameshapeasypll_arr. 3. Thelabelfortherstargument 4. Alistoflabelsforeachrowinthesecondargument Asyoucansee,runningtheregressionissimple,butinterpretingtheoutputistougher.Herestheoutputof model.summary() fortheYPLLvs.%Diabetesregression: ====================================================================== DependentVariable:YPLLRate Method:LeastSquares Date:Fri,23Dec2011 Time:13:48:11 #obs:2209 #variables:2 ====================================================================== variablecoefficientstd.Errortstatisticprob. ====================================================================== const585.126403169.7462883.4470640.000577 %Diabetes782.97632016.29067848.0628460.000000 ====================================================================== ModelsstatsResidualstats ====================================================================== Rsquared0.511405DurbinWatsonstat1.951279 AdjustedRsquared0.511184Omnibusstat271.354997 Fstatistic2310.037134Prob(Omnibusstat)0.000000 Prob(Fstatistic)0.000000JBstat559.729657 Loglikelihood19502.794993Prob(JB)0.000000 AICcriterion17.659389Skew0.752881 BICcriterion17.664550Kurtosis4.952933 ====================================================================== Letsinterpretthis: First,letsverifythestatisticalsignicance,tomakesurenothinghappenedbychance,andthattheregressionis meaningful.Inthiscase,Prob(Fstatistic),whichisunderModelsstats,issomethingverycloseto0,whichisless than.05or.01.Thatis:wehavestatisticalsignicance,andweansafelyinterprettherestofthedata. Thecoecients(calledbetas)helpusunderstandwhatlinebesttsthedata,incasewewanttobuildapredictive model.Inthiscaseconstis585.13,and%Diabeteshasacoecientof782.98.Thus,theline(y=mx+b)thatbest predictsYPLLfrom%Diabetesis:YPLL=(782.98*%Diabetes)+585.13. Tounderstandhowwelltheline/modelwevebuiltfromthedatahelpspredictthedata,welookatRsquared.This valuerangesfrom0(noneofthechangeinYPLLispredictedbytheaboveequation)to1(100%ofthechangein YPLLispredictedbytheaboveequation).Inourcase,51%ofthechangesYPLLcanbepredictedbyalinearequation on%Diabetes.Thatsareasonablystrongcorrelation.
4

Puttingthisalltogether,wevejustdiscoveredthat,withoutknowingtheYPLLofacommunity,wecantakedataonthe percentageofpeopleaectedbydiabetes,androughlyreconstruct51%oftheYPLLscharacteristics. Ifyouwanttousetheinformationinyourregressiontodomorethanprintalargetable,youcanaccessthedata individually


print"pvalue",model.Fpv print"coefficients",model.b print"RsquaredandadjustedRsquared:",model.R2,model.R2adj

Tobettervisualizethemodelwevebuilt,wecanalsoplotthelinewevecalculatedthroughthescatterplotwebuiltbefore
fig=plt.figure(figsize=(6,4)) subplot=fig.add_subplot(111)
subplot.scatter(measures_arr[:,6],ypll_arr,color="#1f77b4")#6=diabetes
subplot.set_title("ypllvs.%ofpopulationwithdiabetes")
def best_fit(m,b,x):#calculatesy=mx+b
returnm*x+b
line_ys=[best_fit(model.b[1],model.b[0],x)forxinmeasures_arr[:,6]]
subplot.plot(measures_arr[:,6],line_ys,color="#ff7f0e")
plt.savefig('figures/scatterline.png',format='png')

Thatshouldresultinaplotthatlookssomethinglike

Wecanseethatourlineslopesupward(thebetacoecientinfrontofthe%Diabetestermispositive)indicatingapositive correlation. ExerciseRunthecorrelationsforpercentageofpopulationunder18yearsofageandmedianhouseholdincome. Wegotstatisticallysignicantresultsforallofthesetests.Medianhouseholdincomeisnegativelycorrelated(theslopebeta is.13),andexplainsagoodportionofYPLL(Rsquaredis.48).Rememberthatwesawablobinthescatterplotfor percentageofpopulationunder18.Theregressionbacksthisup:theRsquaredof.005suggestslittlepredictivepowerof YPLL.

ExercisePlotthelinescalculatedfromtheregressionforeachoftheseindependentvariables.Dotheytthemodels? ExerciseRunthecorrelationfor%ofchildreneligibleforschoollunches.Isitsignicant?Positivelyornegatively correlated?HowdoesthisRsquaredvaluecomparetotheoneswejustcalculated?

ExplainingRsquared
Rsquaredroughlytellsushowwellthelinearmodel(theline)wegetfromalinearregressionexplainstheindependent variable. Rsquaredvalueshaveseveralinterpretations,butoneofthemisasthesquareofavaluecalledthePearsonCorrelation Coecient.ThatlastlinkhasausefulpictureofthecorrelationcoecientthatshowsyouthevalueofRfordierentkinds ofdata. SquaringRmakesitalwayspositiveandchangesitsasymptoticproperties,butthesametrends(beingnear0ornear1)still apply.

RunningMultipleVariables
Sofar,wevebeenabletoexplainabout50%ofthevarianceinYPLLusingouradditionalmeasuresdata.Canwedobetter? Whatifwecombineinformationfrommultiplemeasures?Thatscalledamultipleregression,andwealreadyhaveallthe toolsweneedtodoit!Letscombinehouseholdincome,%Diabetes,andpercentageofthepopulationunder18intoone regression.
dependent_cols=["YPLLRate"]
independent_cols=["<18","%Diabetes","medianhouseholdincome"]
ypll_arr,measures_arr=get_arrs(dependent_cols,independent_cols)
model=ols.ols(ypll_arr,measures_arr,"YPLLRate",independent_cols)
print"pvalue",model.Fpv
print"coefficients",model.b
print"RsquaredandadjustedRsquared:",model.R2,model.R2adj

Wegotthefollowingoutput: pvalue1.11022302463e16 coefficients[4.11471809e+031.30775027e+025.16355557e+02 8.76770577e02] RsquaredandadjustedRsquared:0.5832491445890.582809842914 Sowerestillsignicant,andcanreadtherestoftheoutput.Areadofthebetacoecientssuggeststhebestlinear combinationofallofthesevariablesisYPLL=4115+131(%under18)+516(%Diabetes) 877*(medianhouseholdincome) . Becausetherearemultipleindependentvariablesinthisregression,weshouldlookattheadjustedRsquaredvalue,which is.583.Thisvaluepenalizesyouforneedlesslyaddingvariablestotheregressionthatdontgiveyoumoreinformationabout YPLL.Anyway,checkoutthatRsquarednice!ThatslargerthantheRsquaredvalueforanyoneoftheregressionswe ranontheirown!WecanexplainmoreofYPLLwiththesevariables. ExerciseTrycombiningothervariables.WhatsthelargestadjustedRsquaredyoucanachieve?Wecanreach.715byan excessiveuseofvariables.Canyoureplicatethat?

EliminateFreeLunches,SavethePlanet
Atsomepointinperformingaregressionandtestingforacorrelation,youwillbetemptedtocomeupwithsolutionsto problemstheregressionhasnotidentied.Forexample,wenoticedthatthepercentageofchildreneligibleforfreelunchis prettystronglycorrelatedwiththemorbidityrateinacommunity.Howcanweusethisknowledgetolowerthemorbidity rate?
6

ALERT,ALERT,ALERT!!!Thequestionattheendofthelastparagraphjumpedfromaquestionofcorrelationtoa questionofcausation. Itwouldbefarfetchedtothinkthatincreasingordecreasingthenumberofchildreneligibleforschoolluncheswould increaseordecreasethemorbidityrateinanysignicantway.Whatthecorrelationlikelymeansisthatthereisathird variable,suchasavailablehealthcare,nutritionoptions,oroverallprosperityofacommunitythatiscorrelatedwithboth schoolluncheligibilityandthemorbidityrate.Thatsavariablepolicymakersmighthavecontrolover,andifwesomehow improvedoutcomesonthatthirdvariable,wedseebothschoolluncheligibilityandthemorbidityrategodown. Remember:correlationmeanstwovariablesmovetogether,notthatonemovestheother. Wevehitthepointthatifyourestressedfortime,youcanjumpaheadtotheclosingremarks.Realize,however,thattheres stillmindblowingstuahead,andifyouhavetimeyoushouldreadit!

Nonlinearity
Isndingabunchofindependentvariablesandperforminglinearregressionagainstsomedependentvariablethebestwe candotomodelourdata?Nope!Linearregressiongivesusthebestlinetotthroughthedata,buttherearecaseswhere theinteractionbetweentwovariablesisnonlinear.Inthesecases,thescatterplotswebuiltbeforematterquiteabit! Takegravityforexample.Saywemeasuredthedistanceanobjectfellinacertainamountoftime,andhadabitofnoiseto ourmeasurement.Below,wellsimulatethatactivitybygeneratingthetimedistancerelationshipthatwelearnedinhigh school(displacement=.5gt^2).Imaginewerecordthedisplacementofaballaswedropit,storingthetimeand displacementmeasurementsin timings and displacements .
timings=range(1,100)
displacements=[4.9*t*tfortintimings]

Ascatterplotofthedatalookslikeaparabola,whichwonttlinesverywell!Wecantransformthisdatabysquaringthe timevalues.
sq_timings=[t*tfortintimings]
fig=plt.figure()
subplot=fig.add_subplot(211)
subplot.scatter(timings,displacements,color="#1f77b4")
subplot.set_title("originalmeasurements(parabola)")
subplot=fig.add_subplot(212)
subplot.scatter(sq_timings,displacements,color="#1f77b4")
subplot.set_title("squaredtimemeasurements(line)")
plt.savefig('figures/parabolalinearized.png',format='png')

Herearescatterplotsoftheoriginalandtransformeddatasets.Youcanseethatsquaringthetimevaluesturnedtheplot intoamorelinearone.

ExercisePerformalinearregressionontheoriginalandtransformeddata.Aretheyallsignicant?WhatstheRsquared valueofeach?Whichmodelwouldyouprefer?Doesthecoecientofthetransformedvaluemeananythingtoyou? Forthosekeepingscoreathome,wegotRsquaredof.939and1.00fortheunadjustedandadjustedtimings,whichmeans wewereabletoperfectlymatchthedataaftertransformation.Notethatinthecaseofthesquaredtimings,theequationwe endupwithisdisplacement=4.9*time^2(thecoecientwas4.9),whichistheexactformulawehadforgravity. Awesome! ExerciseCanyouimprovetheRsquaredvaluesbytransformationinthecountyhealthrankings?Trytakingthelogofthe population,acommontechniqueformakingdatathatisbunchedupspreadoutmore.Tounderstandwhatthelog transformdid,takealookatascatterplot. LogtransformingpopulationgotusfromRsquared=.026toRsquared=.097. Linearregression,scatterplots,andvariabletransformationcangetyoualongway.Butsometimes,youjustcantgureout therighttransformation toperformeventhoughtheresavisiblerelationshipinthedata.Inthosecases,morecomplex technqueslikenonlinear least squarescantallsortsofnonlinearfunctionstothedata.

Wheretogofrom here

Todayyouveswallowedquiteabit.Youlearnedaboutsignicancetestingtosupportorrejecthighlikelihoodmeaningful hypotheses.YoulearnedabouttheTTesttohelpyoucomparetwocommunitiesonwhomyouvemeasureddata.Youthen learnedaboutregressionandcorrelation,foridentifyingvariablesthatchangetogether.Fromhere,thereareseveral directionstogrow. AmoregeneralformoftheTTestisanANOVA,whereyoucanidentifydierencesamongmorethantwogroups, andcontrolforknowndierencesbetweenitemsineachdataset.


8

TheTTestisoneofmanytestsofstatisticalsignicance. Theconceptofstatisticalsignicancetestingcomesfromafrequentistviewoftheworld.AnotherviewistheBayesian approach,ifmathematicalcontroversyisyourthing. Logisticregression,andmoregenerallyclassication,cantakeabunchofindependentvariablesandmapthemonto binaryvalues.Forexample,youcouldtakealloftheadditionalmeasuresforanindividualandpredictwhetherthey willdiebeforetheageof75. Machinelearninganddataminingareeldsthatassumestatisticalsignicance(youcollectboatloadsofdata)and developalgorithmstoclassify,cluster,andotherwisendpatternsintheunderlyingdatasets.

MIT OpenCourseWare http://ocw.mit.edu

Resource: How to Process, Analyze and Visualize Data


Adam Marcus and Eugene Wu

The following may not correspond to a particular course on MIT OpenCourseWare, but has been provided by the author as an individual learning resource.

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like