You are on page 1of 4

HowtoChoosetheBestRegressionModel

25
393
41

Choosingthecorrectlinearregressionmodelcanbedifficult.Afterall,theworldandhowitworksis
complex.Tryingtomodelitwithonlyasampledoesntmakeitanyeasier.Inthispost,I'llreview
somecommonstatisticalmethodsforselectingmodels,complicationsyoumayface,andprovide
somepracticaladviceforchoosingthebestregressionmodel.
Itstartswhenaresearcherwantstomathematicallydescribetherelationshipbetweensome
predictorsandtheresponsevariable.Theresearchteamtaskedtoinvestigatetypicallymeasures

manyvariablesbutincludesonlysomeoftheminthemodel.Theanalyststrytoeliminatethe
variablesthatarenotrelatedandincludeonlythosewithatruerelationship.Alongtheway,the
analystsconsidermanypossiblemodels.
TheystrivetoachieveaGoldilocksbalancewiththenumberofpredictorstheyinclude.
Toofew:Anunderspecifiedmodeltendstoproducebiasedestimates.
Toomany:Anoverspecifiedmodeltendstohavelesspreciseestimates.
Justright:Amodelwiththecorrecttermshasnobiasandthemostpreciseestimates.

StatisticalMethodsforFindingtheBestRegressionModel
Foragoodregressionmodel,youwanttoincludethevariablesthatyouarespecificallytestingalong
withothervariablesthataffecttheresponseinordertoavoidbiasedresults.Minitabstatistical
softwareoffersstatisticalmeasuresandproceduresthathelpyouspecifyyourregressionmodel.Ill
reviewthecommonmethods,butpleasedofollowthelinkstoreadmymoredetailedpostsabout
each.
AdjustedRsquaredandPredictedRsquared:Generally,youchoosethemodelsthathavehigher
adjustedandpredictedRsquaredvalues.Thesestatisticsaredesignedtoavoidakeyproblemwith
regularRsquareditincreaseseverytimeyouaddapredictorandcantrickyouintospecifyingan
overlycomplexmodel.
TheadjustedRsquaredincreasesonlyifthenewtermimprovesthemodelmorethanwouldbe
expectedbychanceanditcanalsodecreasewithpoorqualitypredictors.
ThepredictedRsquaredisaformofcrossvalidationanditcanalsodecrease.Crossvalidation
determineshowwellyourmodelgeneralizestootherdatasetsbypartitioningyourdata.
Pvaluesforthepredictors:Inregression,lowpvaluesindicatetermsthatarestatistically
significant.Reducingthemodelreferstothepracticeofincludingallcandidatepredictorsinthe
model,andthensystematicallyremovingthetermwiththehighestpvalueonebyoneuntilyouare
leftwithonlysignificantpredictors.
StepwiseregressionandBestsubsetsregression:Thesearetwoautomatedproceduresthat
canidentifyusefulpredictorsduringtheexploratorystagesofmodelbuilding.Withbestsubsets
regression,MinitabprovidesMallowsCp,whichisastatisticspecificallydesignedtohelpyou
managethetradeoffbetweenprecisionandbias.

RealWorldComplications
Great,thereareavarietyofstatisticalmethodstohelpuschoosethebestmodel.Unfortunately,
therealsoareanumberofpotentialcomplications.Dontworry,Illprovidesomepracticaladvice!
Thebestmodelcanbeonlyasgoodasthevariablesmeasuredbythestudy.Theresultsforthe

variablesyouincludeintheanalysiscanbebiasedbythesignificantvariablesthatyoudont
include.Readaboutanexampleofomittedvariablebias.
Yoursamplemightbeunusual,eitherbychanceorbydatacollectionmethodology.Falsepositives
andfalsenegativesarepartofthegamewhenworkingwithsamples.
Pvaluescanchangebasedonthespecifictermsinthemodel.Inparticular,multicollinearitycan
sapsignificanceandmakeitdifficulttodeterminetheroleofeachpredictor.
Ifyouassessenoughmodels,youwillfindvariablesthatappeartobesignificantbutareonly
correlatedbychance.Thisformofdataminingcanmakerandomdataappearsignificant.Alow
predictedRsquaredisagoodwaytocheckforthisproblem.
Pvalues,predictedandadjustedRsquared,andMallowsCpcansuggestdifferentmodels.
Stepwiseregressionandbestsubsetsregressionaregreattoolsandcangetyouclosetothe
correctmodel.However,studieshavefoundthattheygenerallydontpickthecorrectmodel.

RecommendationsforFindingtheBestRegressionModel
Choosingthecorrectregressionmodelisasmuchascienceasitisanart.Statisticalmethodscan
helppointyouintherightdirectionbutultimatelyyoullneedtoincorporateotherconsiderations.
Theory
Researchwhatothershavedoneandincorporatethosefindingsintoconstructingyourmodel.Before
beginningtheregressionanalysis,developanideaofwhattheimportantvariablesarealongwith
theirrelationships,coefficientsigns,andeffectmagnitudes.Buildingontheresultsofothersmakesit
easierbothtocollectthecorrectdataandtospecifythebestregressionmodelwithouttheneedfor
datamining.
Theoreticalconsiderationsshouldnotbediscardedbasedsolelyonstatisticalmeasures.Afteryoufit
yourmodel,determinewhetheritalignswiththeoryandpossiblymakeadjustments.Forexample,
basedontheory,youmightincludeapredictorinthemodelevenifitspvalueisnotsignificant.Ifany
ofthecoefficientsignscontradicttheory,investigateandeitherchangeyourmodelorexplainthe
inconsistency.
Complexity
Youmightthinkthatcomplexproblemsrequirecomplexmodels,butmanystudiesshowthatsimpler
modelsgenerallyproducemoreprecisepredictions.Givenseveralmodelswithsimilarexplanatory
ability,thesimplestismostlikelytobethebestchoice.Startsimple,andonlymakethemodelmore
complexasneeded.Themorecomplexyoumakeyourmodel,themorelikelyitisthatyouare
tailoringthemodeltoyourdatasetspecifically,andgeneralizabilitysuffers.
Verifythataddedcomplexityactuallyproducesnarrowerpredictionintervals.CheckthepredictedR
squaredanddontmindlesslychaseahighregularRsquared!
ResidualPlots
Asyouevaluatemodels,checktheresidualplotsbecausetheycanhelpyouavoidinadequate
modelsandhelpyouadjustyourmodelforbetterresults.Forexample,thebiasinunderspecified

modelscanshowupaspatternsintheresiduals,suchastheneedtomodelcurvature.Thesimplest
modelthatproducesrandomresidualsisagoodcandidateforbeingarelativelypreciseand
unbiasedmodel.
Intheend,nosinglemeasurecantellyouwhichmodelisthebest.Statisticalmethodsdon't
understandtheunderlyingprocessorsubjectarea.Yourknowledgeisacrucialpartoftheprocess!
Ifyou'relearningaboutregression,readmyregressiontutorial!
*TheimageofRodin'sTheThinkerwastakenbyflickruserinnoxiusandlicensedunderCCBY2.0.

You'llNeverMissaPost!
Giveusyouremail,andwe'llgiveyouaweeklysummaryofthelatestblogposts.

You might also like