You are on page 1of 12

Tutorial:

Regression 201
Thisisthethirdentryinourregressionanalysisandmodelingseries.Inthistutorial,wecontinuethe analysisdiscussionwestartedearlierbyleveragingamoreadvancedtechniqueinfluentialdata analysistohelpusimprovethemodel,and,asaresult,thereliabilityoftheforecast. Again,wewilluseasampledatasetgatheredfrom20differentsalespersons.Theregressionmodel attemptstoexplainandpredicttheweeklysalesforeachperson(dependentvariable)usingtwo explanatoryvariables:intelligence(IQ)andextroversion.

Data Preparation
Similartowhatwedidinourearliertutorial,weorganizeoursampledatabyplacingthevalueofeach variableinaseparatecolumnandeachobservationinaseparaterow. Next,weintroducethemask.ThemaskisaBooleanarray(0,1)thatchooseswhichvariableis included(orexcluded)intheanalysis. Initially,atthetopofthetable,letsinsertthemaskcellsarray;eachwithavalueof1(i.e.included). Thearrayisshownbelowhighlightedbelow:

Inthisexample,wehave20observationsandtwoindependent(explanatory)variables.Theresponseor dependentvariableistheweeklysales.

Process
Nowwearereadytoconductourregressionanalysis.First,selectanemptycellinyourworksheet whereyouwishtheoutputtobegenerated,thenlocateandclickontheregressioniconintheNumXL

Regression201Tutorial

SpiderFinancialCorp,2013

tab(ortoolbar).

NowtheRegressionWizardwillappear.

Selectthecellsrangefortheresponse/dependentvariablevalues(i.e.weeklysales).Selectthecells rangefortheexplanatory(independent)variablesvalues.ForVariables(X)Mask,selectthecellsatthe topofthedatatable(Booleanarray). Notes: 1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput tableswhereitreferencesthosevariables. 2. Theexplanatoryvariables(i.e.X)arealreadygroupedbycolumns(eachcolumnrepresentsa variable),sowedontneedtochangethat. 3. Bydefault,theoutputcellsrangeissettothecurrentselectedcellinyourworksheet. Pleasenotethat,onceweselecttheXandYcellsrange,theoptions,ForecastandMissingValues tabsbecomeavailable(enabled). Next,selecttheOptionstab.

Regression201Tutorial

SpiderFinancialCorp,2013

Initially,thetabissettothefollowingvalues: Theregressionintercept/constantisleftblank.Thisindicatesthattheregressioninterceptwill beestimatedbytheregression.Tosettheregressiontoafixedvalue(e.g.zero(0)),enterit there. Thesignificancelevel(aka. )issetto5% Inoutputsection,themostcommonregressionanalysisisselected. Forautomodeling,checkthisoption.

Now,clicktheMissingValuestab.

Regression201Tutorial

SpiderFinancialCorp,2013

Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(XandY).Bydefault,any missingvaluefoundinXorinYinanyobservationwouldexcludetheobservationfromtheanalysis. Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged. Now,clickOKtogeneratetheoutputtables.

Toassesstheinfluencethateachobservationexertsonourmodel,wecalculateacoupleofstatistical measures:leverageandCooksdistance.

Regression201Tutorial

SpiderFinancialCorp,2013

Selectthecellnexttotheresponsevariable. Intheformulabar,typeintheMLR_FITTEDfunction,thenclickthefxbutton.

TheFunctionWizardpopsup.Selecttheinputcellsrange,mask,andaReturntypeof4forthe leveragestatistics.ClickOK.

MLR_FITTEDreturnsanarrayofvalues,butyouwillinitiallyonlyseethe1stvalue. Todisplaythefullarray,selectallthecellsbelow(totheendofthesample).PressF2,thenpress CTRL+SHIFT+ENTERtocopythearrayformula.

Regression201Tutorial

SpiderFinancialCorp,2013

Now,tocalculatetheCooksdistance,selectthecellnexttoLeverageandrepeatthesame steps,butwiththereturntype=5.

Analysis
NowthatwehavetheleverageandCooksdistancestatistics,letsinterprettheirfindings. Regression201Tutorial 6 SpiderFinancialCorp,2013

1. Leverage Statistics (H)


Leveragestatisticsmeasure thedistanceofan observationfromthecenter ofthedata.Inourexample, theintelligenceand extroversionvaluesfor Salesman11arefurthest fromtheaverage.Doesthis meanSalesman11isan outlier?Doesthismeanhe exertsinfluenceonthe calculationoftheregression coefficient?
40% 35% 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Leverage(H)

Toexaminethisassumption,letsremoveSalesman11fromourinputdataandexaminetheresulting regression.Todoso,justinsertan#N/Avalueinanyinputvariableofthisobservation.

(Fulldataset) Omittingsalesman#11

Droppingobservation11madethingsatbestthesameasearlier.Weoptedtorecoverthisobservation backintothesample. Insum,theleveragestatisticsdo notnecessarilyimplyanoutlier, butmerelyadistantobservation withfewneighbors.

70%

Cook'sDistance(D)

60%

2. Cooks Distance (D)


TheCooksdistancecorrectsfor weaknessintheleverage statistics,andisthusmore Regression201Tutorial

50%

40%

30%

20%

7
10%

SpiderFinancialCorp,2013

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

indicativeofinfluentialdata.Furthermore,therearefewheuristicsforthethresholdvaluesofCooks distancetodetectaninfluentialdatum.Forouranalysis,weoftenuse translatesto20%forthe20observationsinourdataset). Usingthethresholdorjustlookingattheearlierplot,wedetectthatSalesman16exertsthehighest influenceonourregression,soletsvoidthisobservation(bysetting#N/Ainoneoftheinputvariables).

4 asathreshold(which N

NotethattheleveragestatisticsandCooksdistancereturn#N/Aforthismissingvalue. Letsnowexaminetheregressionstatisticsbeforeandafterwedroppedthesixteenthobservation.

(FullDataset)

(WithoutSalesman#16)

Asyoumayalreadyhavenoticed,theregressionimprovedsignificantlyoneverydimension(e.g.R square,stderror,etc.).Salesman#16seemstobeaninfluentialoutlier,sowelldrophim. Regression201Tutorial 8 SpiderFinancialCorp,2013

Tohelpexplainwhatmakesanobservationinfluential,letsexaminetheextroversionvs.weeklysales graphbelow:

Wedrawthelineartrendasaproxyforourregressionmodel.Theblack(circle)datapointrepresents Salesman16.Itslocation(extroversionandweeklysalesvalue)ispullingtheregression(dashed)line towardit,affectingthevalueoftheregressionslopeandintercept. Droppingthisobservationreleasestheregressionline,adjustingittobetterfittheremainingpoints. LetstakeanotherlookattheCooksdistanceplot(withoutSalesman16,andwithathresholdof

4 21% ) 19

Regression201Tutorial

SpiderFinancialCorp,2013

TheCooksdistancevaluesforthedifferentplotsaredistributedsomewhatuniformly,andwemaystop there. Note:Bearinmindthatourthresholdruleismerelyaheuristic(ruleofthumb),andshouldnotbetaken rigidly,butratherasaguideline.

Conclusion
Inthistutorial,wehaveshownthatexcludingobservation#16isbeneficialtoourmodelingeffortsasit exertsignificantinfluenceonourcoefficientcalculation. Next,usingtheremaining19observations,letsrecalculate(SHIFT+F9)theregressionstatistics,ANOVA, residualsdiagnosis,stepwiseregression,etc.

Regression201Tutorial

10

SpiderFinancialCorp,2013

Theoptimalsetoftheinputvariablesisthesameasearlier.Letsdroptheintelligencevariable(by settingitsvalueto0inthemask),andrecalculate

Theregressionerroris$307(vs.$332beforeweremovedsalesman#16).

Regression201Tutorial

11

SpiderFinancialCorp,2013


$4,500

$4,000

$3,500

$3,000

$2,500

$2,000

$1,500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

TheFinalquestionwemayaskourselves;Istheregressionstableoverthesampledataset?Nextissue.

Regression201Tutorial

12

SpiderFinancialCorp,2013

You might also like