You are on page 1of 24

Regression explained in

Regression explained in
simple terms
simple terms

A Vijay Gupta Publication


SPSS for Beginners Vijay Gupta 2000. All rights reside with author.
vjbooks.net
egression e!plained
egression e!plained
Copyright 2000 Vijay Gupta
Published by VJoo!s "nc#
All rights reserved. "o part of this book #ay be used or reprodu$ed in any for# or by any
#eans% or stored in a database or retrieval syste#% without prior written per#ission of the
publisher e!$ept in the $ase of brief &uotations e#bodied in reviews% arti$les% and resear$h
papers. 'aking $opies of any part of this book for any purpose other than personal use is a
violation of (nited States and international $opyright laws. )or infor#ation $onta$t Vijay
Gupta at vgupta*000+aol.$o#.
,ou $an rea$h the author at vgupta*000+aol.$o#.
-ibrary of .ongress .atalog "o./ Pending
0SB"/ Pending
)irst year of printing/ 2000
1ate of this $opy/ April 22% 2000
3his book is sold as is% without warranty of any kind% either e!press or i#plied% respe$ting the
$ontents of this book% in$luding but not li#ited to i#plied warranties for the book4s &uality%
perfor#an$e% #er$hantability% or fitness for any parti$ular purpose. "either the author% the
publisher and its dealers% nor distributors shall be liable to the pur$haser or any other person or
entity with respe$t to any liability% loss% or da#age $aused or alleged to be $aused dire$tly or
indire$tly by the book.
Publisher/ V5Books 0n$.
$ditor/ Vijay Gupta
Author/ Vijay Gupta
vjbooks.net
About the Author
Vijay Gupta has taught statisti$s and e$ono#etri$s to graduate students at Georgetown
(niversity. A Georgetown (niversity graduate with a 'asters degree in e$ono#i$s% he has a
vision of #aking the tools of e$ono#etri$s and statisti$s easily a$$essible to professionals and
graduate students.
0n addition% he has assisted the 6orld Bank and other organi7ations with statisti$al analysis%
design of international invest#ents% $ost8benefit and sensitivity analysis% and training and
troubleshooting in several areas.
9e is $urrently working on/
a pa$kage of SPSS S$ripts :'aking the )or#atting of ;utput <asy:
a #anual on 6ord
a #anual for <!$el
a tutorial for <8Views
an <!$el add8in :3ools for <nri$hing <!$el4s 1ata Analysis .apa$ity:
<!pe$t the# to be available during fall 2000. <arly versions $an be downloaded fro#
www.vgupta.$o#.
vjbooks.net
%"&$AR R$GR$''"(&
0nterpretation of regression output is dis$ussed in se$tion *
*
. ;ur approa$h #ight $onfli$t with
pra$ti$es you have e#ployed in the past% su$h as always looking at the 8s&uare first. As a result
of our vast e!perien$e in using and tea$hing e$ono#etri$s% we are fir# believers in our approa$h.
,ou will find the presentation to be &uite si#ple 8 everything is in one pla$e and displayed in an
orderly #anner.
3he a$$eptan$e =as being reliable>true? of regression results hinges on diagnosti$ $he$king for the
breakdown of $lassi$al assu#ptions
2
. 0f there is a breakdown% then the esti#ation is unreliable% and
thus the interpretation fro# se$tion * is unreliable. 3he table in se$tion 2 su$$in$tly lists the
various possible breakdowns and their i#pli$ations for the reliability of the regression results
2
.
6hy is the result not a$$eptable unless the assu#ptions are #et@ 3he reason is that the strong
state#ents inferred fro# a regression =i.e. 8 :an in$rease in one unit of the value of variable A
$auses an in$rease in the value of variable , by 0.2* units:? depend on the presu#ption that the
variables used in a regression% and the residuals fro# the regression% satisfy $ertain statisti$al
properties. 3hese are e!pressed in the properties of the distribution of the residuals (that explains
why so many of the diagnostic tests shown in sections 3-4 and the corrective methods are based on
the use of the residuals). 0f these properties are satisfied% then we $an be $onfident in our
interpretation of the results.
3he above state#ents are based on $o#ple! for#al #athe#ati$al proofs. Please $he$k your
te!tbook if you are $urious about the for#al foundations of the state#ents.
Se$tion 2 provides a brief s$he#a for $he$king for the breakdown of $lassi$al assu#ptions. 3he
testing usually involves infor#al =graphi$al? and for#al =distribution8based hypothesis tests like
the ) and 3? testing% with the latter involving the running of other regressions and $o#puting of
variables.
)# "nterpretation o* regression results
*
<ven though interpretation pre$edes $he$king for the breakdown of $lassi$al assu#ptions% it is good
pra$ti$e to first $he$k for the breakdown of $lassi$al assu#ptions =se$tion B?% then to $orre$t for the
breakdowns% and then% finally% to interpret the results of a regression analysis.
2
6e will use the phrase :.lassi$al Assu#ptions: often. .he$k your te!tbook for details about these
assu#ptions. 0n si#ple ter#s% regression is a statisti$al #ethod. 3he fa$t that this generi$ #ethod $an be
used for so #any different types of #odels and in so #any different fields of study hinges on one area of
$o##onality 8 the #odel rests on the bedro$k of the solid foundations of well8established and proven
statisti$al properties>theore#s. 0f the spe$ifi$ regression #odel is in $on$ordan$e with the $ertain
assu#ptions re&uired for the use of these properties>theore#s% then the generi$ regression results $an be
inferred. 3he $lassi$al assu#ptions $onstitute these re&uire#ents.
2
0f you find any breakdown=s? of the $lassi$al assu#ptions% then you #ust $orre$t for it by taking
appropriate #easures. .hapter C looks into these #easures. After running the :$orre$ted: #odel% you again
#ust perfor# the full range of diagnosti$ $he$ks for the breakdown of $lassi$al assu#ptions. 3his pro$ess
will $ontinue until you no longer have a serious breakdown proble#% or the li#itations of data $o#pel you to
stop.
2
Assu#e you want to run a regression of wage on age% work experience% education% gender, and a
du##y for sector of employment =whether e#ployed in the publi$ se$tor?.
wage D fun$tion=age% work experience% education% gender% sector?
or% as your te!tbook will have it%
wage D * E 2Fage E 2Fwork experience E BFeducation GFgender HFsector
Always look at the #odel fit
=IA";VAJ? first. 1o not
#ake the #istake of looking at
the 8s&uare before $he$king
the goodness of fit.
Signifi$an$e of the #odel
(!"id the model explain the
deviations in the dependent
variable#)
3he last $olu#n shows the
goodness of fit of the #odel. 3he
lower this nu#ber% the better the
fit. 3ypi$ally% if ISigJ is greater
than 0.0G% we $on$lude that our
#odel $ould not fit the data
+
.
$4$%4.3& $ %'&'(.)) 4%4.(*( .'''
b
$((&$.4) %&)+ (*.3%&
%'*)'&.& %&&(
egression
esidual
3otal
,odel
*
'um o*
'-uares d*
,ean
'-uare . 'ig#
A&(VA
a
1ependent Variable/ 6AG<
a.
0ndependent Variables/ =.onstant?% 6;KL<A% <1(.A30;"% G<"1<%
P(BLS<.% AG<
b.
The F is comparing the two models below:
)# wage / ) 0 21age 0 21work experience 0 +1education + 31gender + 41sector
2. wage D *
=0n for#al ter#s% the ) is testiong the hypothesis/ * D 2 D 2 D B, G, HD0
0f the ) is not signifi$ant% then we $annot say that #odel * is any better than #odel 2. 3he i#pli$ation is obvious88
the use of the independent variables has not assisted in predi$ting the dependent variable.
'um o* s-uares
3he 3SS =3otal Su# of S&uares? is the total deviations in the dependent variable. 5he aim o* the
regression is to explain these de6iations =by finding the best betas that $an #ini#i7e the su# of the
s&uares of these deviations?.
3he <SS =<!plained Su# of S&uares? is the a#ount of the 3SS that $ould be e!plained by the #odel.
3he 8s&uare% shown in the ne!t table% is the ratio <SS>3SS. 0t $aptures the per$ent of deviation fro#
B
0f Sig M .0*% then the #odel is signifi$ant at NNO% if Sig M .0G% then the #odel is signifi$ant at NGO% and if Sig M.*% the
#odel is signifi$ant at N0O. Signifi$an$e i#plies that we $an a$$ept the #odel. 0f SigP.%* then the #odel was not
signifi$ant =a relationship $ould not be found? or :8s&uare is not signifi$antly different fro# 7ero.:
Vjbooks.net
2
the #ean in the dependent variable that $ould be e!plained by the #odel.
3he SS is the a#ount that $ould not be e!plained =3SS #inus <SS?.
0n the previous table% the $olu#n :Su# of S&uares: holds the values for 3SS% <SS% and SS. 3he row
:3otal: is 3SS =*0HC0N.N in the e!a#ple?% the row :egression: is <SS =GBG*B.2N in the e!a#ple?% and
the row :esidual: $ontains the SS =G22NG.BC in the e!a#ple?.
Adjusted R7s-uare
'easures the proportion of the
6ariance in the dependent
variable =wage? that was
e!plained by variations in the
independent variables. 0n this
e!a#ple% the IAdjusted 8
S&uareJ shows that G0.NO of
the varian$e was e!plained.
R7s-uare
'easures the proportion of the
6ariation in the dependent
variable =wage? that was
e!plained by variations in the
independent variables. 0n this
e!a#ple% the :8S&uare:4 tells
us that G*O of the variation
=and not the varian$e? was
e!plained.
-./0123,
2"45678.9,
:29"2/,
;4<1=25,
6:2
c,d
. .$%' .$'& $.%3'(
,odel
*
$ntered Remo6ed
Variables
R
'-uare
Adjusted
R '-uare
'td#
$rror o*
the
$stimate
,odel 'ummary
a8b
1ependent Variable/ 6AG<
a.
'ethod/ <nter
b.
0ndependent Variables/ =.onstant?% 6;KL<A% <1(.A30;"%
G<"1<% P(BLS<.% AG<
$.
All re&uested variables entered.
d.
'td $rror o* $stimate
Std error of the esti#ate #easures the dispersion of the dependent
variables esti#ate around its #ean =in this e!a#ple% the IStd. <rror of
the <sti#ateJ is G.*2?. .o#pare this to the #ean of the IPredi$ted:
values of the dependent variable. 0f the Std. <rror is #ore than *0O
of the #ean% it is high.
5he reliability o* indi6idual coe**icients
3he table I.oeffi$ientsJ provides infor#ation on the $onfiden$e with whi$h we $an support the
esti#ate for ea$h su$h esti#ate =see the $olu#ns I3J and ISig.J.? 0f the value in ISig.J is less than
0.0G% then we $an assu#e that the esti#ate in $olu#n IBJ $an be asserted as true with a NGO level
of $onfiden$e
G
. Always interpret the :Sig: value first. 8f this value is more than ' .% then the
coefficient estimate is not reliable because it has >too> much dispersion?variance.
5he indi6idual coe**icients
3he table I.oeffi$ientsJ provides infor#ation effe$t of individual variables =the :<sti#ated
.oeffi$ients: or IbetaJ 88see $olu#n IBJ? on the dependent variable
G
0f the value is greater than 0.0G but less than 0.*% we $an only assert the vera$ity of the value in IBJ with a N0O level of
$onfiden$e. 0f ISigJ is above 0.*% then the esti#ate in IBJ is unreliable and is said to not be statisti$ally signifi$ant. 3he
$onfiden$e intervals provide a range of values within whi$h we $an assert with a NGO level of $onfiden$e that the
esti#ated $oeffi$ient in IBJ lies. )or e!a#ple% :3he $oeffi$ient for age lies in the range .0N* and .*BG with a NGO level
of $onfiden$e% while the $oeffi$ient for gender lies in the range 82.GNQ and 8*.BH2 at a NGO level of $onfiden$e.:
Vjbooks.net
B
Con*idence "nter6al
-%.)(' .4(' -4.33& .''' -(.*43 -.&&+
.%%) .'%4 ).*3$ .''' .'&% .%4$
.+++ .'($ 3%.*(( .''' .+(& .)($
-(.'3' .()& -+.'(3 .''' -(.$&+ -%.4*3
%.+4% .(&( $.&$+ .''' %.%*) (.3%4
.%'' .'%+ $.)$4 .''' .'*+ .%34
=.onstant?
AG<
<1(.A30;"
G<"1<
P(BLS<.
6;KL<A
,odel
*
'td# $rror
9nstandardi:ed
Coe**icients
t 'ig#
%o;er
ound
9pper
ound
<3= Con*idence
"nter6al *or
Coe**icients
a
1ependent Variable/ 6AG<
a.
Plot o* residual 6ersus predicted dependent 6ariable
3his is the plot for the
standardi7ed predi$ted variable
and the standardi7ed residuals.
3he pattern in this plot indi$ates
the presen$e of #is8
spe$ifi$ation
H
and>or
heteroskedasti$ity
Q
.
Scat t er pl ot
Dependent Var iabl e: WAGE
R
e
g
r
e
s
s
i
o
n

S
t
a
n
d
a
r
d
i
z
e
d

P
r
e
d
i
c
t
e
d

V
a
l
u
e
4
3
2
1

!1
!2
H
3his in$ludes the proble#s of in$orre$t fun$tional for#% o#itted variable% or a #is8#easured independent variable. A
for#al test su$h as the <S<3 3est is re&uired to $on$lusively prove the e!isten$e of #is8
spe$ifi$ation. eview your te!tbook for the step8by8step des$ription of the <S<3 test.
Q
A for#al test like the 6hite4s 3est is ne$essary to $on$lusively prove the e!isten$e of heteroskedasti$ity. eview
your te!tbook for the step8by8step des$ription of the <S<3 test.
Vjbooks.net
G
Plot o* residuals 6ersus independent 6ariables
3he definite positive pattern
indi$ates
C
the presen$e of
heteroskedasti$ity $aused% at
least in part% by the variable
edu$ation.
Par t ia l Residual Pl ot
Dependent Var iabl e: WAGE
W
A
G
E
"
4
3
2
1

!1
!2
3he plot of age and the
residual has no pattern
N
% whi$h
i#plies that no
heteroskedasti$ity is $aused by
this variable.
Par t ial Residual Pl ot
Dependent Var iabl e: WAGE
W
A
G
E
"
4
3
2
1

!1
!2
C
A for#al test like the 6hiteRs 3est is re&uired to $on$lusively prove the e!isten$e and stru$ture of
heteroskedasti$ity .
N
So#eti#es these plots #ay not show a pattern. 3he reason #ay be the presen$e of e!tre#e values that widen the s$ale
of one or both of the a!es% thereby :s#oothing out: any patterns. 0f you suspe$t this has happened% as would be the $ase
if #ost of the graph area were e#pty save for a few dots at the e!tre#e ends of the graph% then res$ale the a!es using the
#ethods. 3his is true for all s$atter graphs.
Vjbooks.net
H
Plots o* the residuals
3he histogra# and the P8P plot of the
residual suggest that the residual is probably
nor#ally distributed
*0
. ,ou $an also use
other tests to $he$k for nor#ality.
#or $al P!P Pl ot o% Regr ession St a ndar dized Residual
Dependent Var iabl e: WAGE
E
&
p
e
c
t
e
d

'
u
$

P
r
o
b
1(
()"
("
(2"
(
*ist ogr a$
Dependent Var iabl e: WAGE
+
r
e
,
u
e
n
c
-
.
"
4
3
2
1

St d( De/ 01(
1ean 0(
# 01223(
*0
3he residuals should be distributed nor#ally. 0f not% then so#e $lassi$al assu#ption has been violated.
Vjbooks.net
0deali7ed "or#al .urve. 0n
order to #eet the $lassi$al
assu#ptions% .the residuals
should% roughly% follow this
$urves shape.
3he thi$k $urve
should lie $lose
to the diagonal.
*
Regression output interpretation guidelines
&ame (* 'tatistic>
Chart
?hat @oes "t ,easure (r
"ndicateA
Critical Values Comment
Sig.8) 6hether the #odel as a whole is
signifi$ant. 0t tests whether 8
s&uare is signifi$antly different
fro# 7ero
8 below .0* for NNO $onfiden$e in the ability of
the #odel to e!plain the dependent variable
8 below .0G for NGO $onfiden$e in the ability of
the #odel to e!plain the dependent variable
8 below 0.* for N0O $onfiden$e in the ability of
the #odel to e!plain the dependent variable
5he *irst statistic to loo! *or in 'P''
output. 0f Sig.8) is insignifi$ant% then
the regression as a whole has failed. "o
#ore interpretation is ne$essary
=although so#e statisti$ians disagree on
this point?. ,ou #ust $on$lude that the
:1ependent variable $annot be
e!plained by the
independent>e!planatory variables.:
3he ne!t steps $ould be rebuilding the
#odel% using #ore data points% et$.
SS% <SS S 3SS 3he #ain fun$tion of these values
lies in $al$ulating test statisti$s like
the )8test% et$.
3he <SS should be high $o#pared to the 3SS
=the ratio e&uals the 8s&uare?. "ote for
interpreting the SPSS table% $olu#n :Su# of
S&uares:/
:3otal: D3SS%
:egression: D <SS% and
:esidual: D SS
0f the 8s&uares of two #odels are very
si#ilar or rounded off to 7ero or one%
then you #ight prefer to use the )8test
for#ula that uses SS and <SS.
Vjbooks.net
2
&ame (* 'tatistic>
Chart
?hat @oes "t ,easure (r
"ndicateA
Critical Values Comment
S< of egression

3he standard error of the esti#ate
predi$ted dependent variable
3here is no $riti$al value. 5ust $o#pare the std.
error to the #ean of the predi$ted dependent
variable. 3he for#er should be s#all =M*0O?
$o#pared to the latter.
,ou #ay wish to $o##ent on the S<%
espe$ially if it is too large or s#all
relative to the #ean of the
predi$ted>esti#ated values of the
dependent variable.
8S&uare Proportion of variation in the
dependent variable that $an be
e!plained by the independent
variables
Between 0 and *. A higher value is better. 3his often #is8used value should serve
only as a su##ary #easure of Goodness
of )it. 1o not use it blindly as a
$riterion for #odel sele$tion.
Adjusted 8s&uare Proportion of varian$e in the
dependent variable that $an be
e!plained by the independent
variables or 8s&uare adjusted for T
of independent variables
Below *. A higher value is better Another su##ary #easure of Goodness
of )it. Superior to 8s&uare be$ause it is
sensitive to the addition of irrelevant
variables.
38atios 3he reliability of our esti#ate of
the individual beta
-ook at the p8value =in the $olu#n ISig.J? it
#ust be low/
8 below .0* for NNO $onfiden$e in the value of
the esti#ated $oeffi$ient
8 below .0G for NGO $onfiden$e in the value of
the esti#ated $oeffi$ient
8 below .* for N0O $onfiden$e in the value of
the esti#ated $oeffi$ient
)or a one8tailed test =at NGO $onfiden$e
level?% the $riti$al value is
=appro!i#ately? *.HG for testing if the
$oeffi$ient is greater than 7ero and
=appro!i#ately? 8*.HG for testing if it is
below 7ero.
Vjbooks.net
2
&ame (* 'tatistic>
Chart
?hat @oes "t ,easure (r
"ndicateA
Critical Values Comment
.onfiden$e 0nterval
for beta
3he NGO $onfiden$e band for ea$h
beta esti#ate
3he upper and lower values give the NGO
$onfiden$e li#its for the $oeffi$ient
Any value within the $onfiden$e interval
$annot be reje$ted =as the true value? at
NGO degree of $onfiden$e
.harts/ S$atter of
predi$ted dependent
variable and residual
'is8spe$ifi$ation and>or
heteroskedasti$ity
3here should be no dis$ernible pattern. 0f there
is a dis$ernible pattern% then do the <S<3
and>or 16 test for #is8spe$ifi$ation or the
6hiteRs test for heteroskedasti$ity
<!tre#ely useful for $he$king for
breakdowns of the $lassi$al
assu#ptions% i.e. 8 for proble#s like #is8
spe$ifi$ation and>or heteroskedasti$ity.
At the top of this table% we #entioned
that the )8statisti$ is the first output to
interpret. So#e #ay argue that the
UP<18U<S01 plot is #ore i#portant
=their rationale will be$o#e apparent as
you read through the rest of this $hapter
and $hapter C?.
.harts/ plots of
residuals against
independent variables
9eteroskedasti$ity 3here should be no dis$ernible pattern. 0f there
is a dis$ernible pattern% then perfor# 6hite4s
test to for#ally $he$k.
.o##on in $ross8se$tional data.
0f a partial plot has a pattern% then that
variable is a likely $andidate for the
$ause of heteroskedasti$ity.
.harts/ 9istogra#s of
residuals
Provides an idea about the
distribution of the residuals
3he distribution should look like a nor#al
distribution
A good way to observe the a$tual
behavior of our residuals and to observe
any severe proble# in the residuals
=whi$h would indi$ate a breakdown of
the $lassi$al assu#ptions?
Vjbooks.net
B
Problems caused by brea!do;n o* classical assumptions
3he fa$t that we $an #ake bold state#ents on $ausality fro# a regression hinges on the $lassi$al linear #odel. 0f its assu#ptions are violated% then we
#ust re8spe$ify our analysis and begin the regression anew. 0t is very unsettling to reali7e that a large nu#ber of institutions% journals% and fa$ulties
allow this fa$t to be overlooked.
6hen using the table below% re#e#ber the ordering of the severity of an i#pa$t.
3he worst i#pa$t is a bias in the ) =then the #odel $ant be trusted?
A se$ond disastrous i#pa$t is a bias in the betas =the $oeffi$ient esti#ates are unreliable?
.o#pared to the above% biases in the standard errors and 3 are not so har#ful =these biases only affe$t the reliability of our $onfiden$e about the
variability of an esti#ate% not the reliability about the value of the esti#ate itself?
'ummary o* impact o* a brea!do;n o* a classical assumption on the reliability ;ith ;hich regression output can be interpreted
Violation
"mpact
. R
2
'td error
=of
esti#ate?
'td error
=of ?
5 Count o*
6iolations
'easure#ent error in dependent variable

B B 2
'easure#ent error in independent
variable
B B B B B B 4
0rrelevant variable

B B 2
;#itted variable
B B B B B B 4
0n$orre$t fun$tional for#
B B B B B B 4
Vjbooks.net
G
Violation
"mpact
. R
2
'td error
=of
esti#ate?
'td error
=of ?
5 Count o*
6iolations
9eteroskedasti$ity
B

B B B 2
.ollinearity

B B 2
Si#ultaneity Bias
B B B B B B 4
Legend for understanding the table
3he statisti$ is still reliable and unbiased.
A 3he statisti$ is biased% and thus $annot be relied upon.
(pward bias in esti#ation
1ownward bias in esti#ation.
Vjbooks.net
*
@iagnostics
3his se$tion lists so#e #ethods of dete$ting for breakdowns of the $lassi$al assu#ptions.
6hy is the result not a$$eptable unless the assu#ptions are #et@ 3he reason is si#ple 8 the strong
state#ents inferred fro# a regression =e.g. 8 :an in$rease in one unit of the value of variable A
$auses an in$rease of the value of variable , by 0.2* units:? depend on the presu#ption that the
variables used in a regression% and the residuals fro# that regression% satisfy $ertain statisti$al
properties. 3hese are e!pressed in the properties of the distribution of the residuals. 7hat explains
why so many of the diagnostic tests shown in sections +.4-+.$ and their relevant corrective
methods, shown in this chapter, are based on the use of the residuals. 0f these properties are
satisfied% then we $an be $onfident in our interpretation of the results. 3he above state#ents are
based on $o#ple!% for#al #athe#ati$al proofs. Please refer to your te!tbook if you are $urious
about the for#al foundations of the state#ents.
6ith e!perien$e% you should develop the habit of doing the diagnosti$s before interpreting the
#odel4s signifi$an$e% e!planatory power% and the signifi$an$e and esti#ates of the regression
$oeffi$ients. 0f the diagnosti$s show the presen$e of a proble#% you #ust first $orre$t the proble#
and then interpret the #odel. e#e#ber that the power of a regression analysis =after all% it is
e!tre#ely powerful to be able to say that :data shows that A $auses , by this slope fa$tor:? is
based upon the fulfill#ent of $ertain $onditions that are spe$ified in what have been dubbed the
:$lassi$al: assu#ptions.
efer to your te!tbook for a $o#prehensive listing of #ethods and their detailed des$riptions.
0f a for#al
**
diagnosti$ test $onfir#s the breakdown of an assu#ption% then you #ust atte#pt to
$orre$t for it. 3his $orre$tion usually involves running another regression on a transfor#ed version
of the original #odel% with the e!a$t nature of the transfor#ation being a fun$tion of the $lassi$al
regression assu#ption that has been violated
*2
.
Collinearity
13
.ollinearity between variables is always present. A proble# o$$urs if the degree of $ollinearity is
high enough to bias the esti#ates.
"ote/ .ollinearity #eans that two or #ore of the independent>e!planatory variables in a regression
have a linear relationship. 3his $auses a proble# in the interpretation of the regression results. 0f
the variables have a $lose linear relationship% then the esti#ated regression $oeffi$ients and 38
statisti$s #ay not be able to properly isolate the uni&ue effe$t>role of ea$h variable and the
$onfiden$e with whi$h we $an presu#e these effe$ts to be true. 3he $lose relationship of the
**
(sually% a :for#al: test uses a hypothesis testing approa$h. 3his involves the use of testing against distributions like
the 3% )% or .hi8S&uare. An :infor#al4 test typi$ally refers to a graphi$al test.
*2
1onRt worry if this line $onfuses you at present 8 its #eaning and relevan$e will be$o#e apparent as you read through
this $hapter.
*2
Also $alled 'ulti$ollinearity.
Vjbooks.net
2
variables #akes this isolation diffi$ult. ;ur e!planation #ay not satisfy a statisti$ian% but we hope
it $onveys the funda#ental prin$iple of $ollinearity.
Su##ary #easures for testing and dete$ting $ollinearity in$lude/
unning bivariate and partial $orrelations =see se$tion G.2?. A bivariate or partial $orrelation
$oeffi$ient greater than 0.C =in absolute ter#s? between two variables indi$ates the presen$e of
signifi$ant $ollinearity between the#.
.ollinearity is indi$ated if the 8s&uare is high =greater than 0.QG
*B
? and only a few 38values
are signifi$ant.
.he$k your te!tbook for #ore on $ollinearity diagnosti$s.
,is7speci*ication
'is8spe$ifi$ation of the regression #odel is the #ost severe proble# that $an befall an
e$ono#etri$ analysis. (nfortunately% it is also the #ost diffi$ult to dete$t and $orre$t.
"ote/ 'is8spe$ifi$ation $overs a list of proble#s. 3hese proble#s $an $ause #oderate or severe
da#age to the regression analysis. ;f graver i#portan$e is the fa$t that #ost of these proble#s are
$aused not by the nature of the data>issue% but by the #odeling work done by the resear$her. 0t is of
the ut#ost i#portan$e that every resear$her realise that the responsibility of $orre$tly spe$ifying an
e$ono#etri$ #odel lies solely on the#. A proper spe$ifi$ation in$ludes deter#ining $urvature
=linear or not?% fun$tional for# =whether to use logs% e!ponentials% or s&uared variables?% and the
a$$ura$y of #easure#ent of ea$h variable% et$.
'is8spe$ifi$ation $an be of several types/ in$orre$t fun$tional for#% o#ission of a relevant
independent variable% and>or #easure#ent error in the variables. Se$tions Q.B.$ to Q.B.f list a few
su##ary #ethods for dete$ting #is8spe$ifi$ation. efer to your te!tbook for a $o#prehensive
listing of #ethods and their detailed des$riptions.
'imultaneity bias
Si#ultaneity bias #ay be seen as a type of #is8spe$ifi$ation. 3his bias o$$urs if one or #ore of the
independent variables is a$tually dependent on other variables in the e&uation. )or e!a#ple% we are
using a #odel that $lai#s that in$o#e $an be e!plained by invest#ent and edu$ation. 9owever% we
#ight believe that invest#ent% in turn% is e!plained by in$o#e. 0f we were to use a si#ple #odel in
whi$h in$o#e =the dependent variable? is regressed on invest#ent and edu$ation =the independent
variables?% then the spe$ifi$ation would be in$orre$t be$ause invest#ent would not really be
:independent: to the #odel 8 it is affe$ted by in$o#e. 0ntuitively% this is a proble# be$ause the
si#ultaneity i#plies that the residual will have so#e relation with the variable that has been
in$orre$tly spe$ified as :independent: 8 the residual is $apturing =#ore in a #etaphysi$al than
for#al #athe#ati$al sense? so#e of the un#odeled reverse relation between the :dependent: and
:independent: variables.
*B
So#e books advise using 0.C.
Vjbooks.net
2
"ncorrect *unctional *orm
0f the $orre$t relation between the variables is non8linear but you use a linear #odel and do not
transfor# the variables% then the results will be biased.
6hy should an in$orre$t fun$tional for# lead to severe proble#s@ egression is based on finding
$oeffi$ients that #ini#i7e the :su# of s&uared residuals.: <a$h residual is the differen$e between
the predi$ted value =the regression line? of the dependent variable versus the reali7ed value in the
data. 0f the fun$tional for# is in$orre$t% then ea$h point on the regression :line: is in$orre$t
be$ause the line is based on an in$orre$t fun$tional for#. A si#ple e!a#ple/ assu#e , has a log
relation with A =a log $urve represents their s$atter plot? but a linear relation with :-og A.: 0f we
regress , on A =and not on :-og A:?% then the esti#ated regression line will have a syste#i$
tenden$y for a bias be$ause we are fitting a straight line on what should be a $urve. 3he residuals
will be $al$ulated fro# the in$orre$t :straight: line and will be wrong. 0f they are wrong% then the
entire analysis will be biased be$ause everything hinges on the use of the residuals.
-isted below are #ethods of dete$ting in$orre$t fun$tional for#s/
Perfor# a preli#inary visual test. Any pattern in a plot of the predi$ted variable and
the residuals plot i#plies #is8spe$ifi$ation =and>or heteroskedasti$ity? due to the use
of an in$orre$t fun$tional for# or due to o#ission of a relevant variable.
0f the visual test indi$ates a proble#% perfor# a for#al diagnosti$ test like the <S<3
test or the 16 test.
.he$k the #athe#ati$al derivation =if any? of the #odel.
1eter#ine whether any of the s$atter plots have a non8linear pattern. 0f so% is the
pattern log% s&uare% et$@
3he nature of the distribution of a variable #ay provide so#e indi$ation of the
transfor#ation that should be applied to it. )or e!a#ple% se$tion 2.2 showed that
wage is non8nor#al but that its log is nor#al. 3his suggests re8spe$ifying the #odel
by using the log of wage instead of wage.
.he$k your te!tbook for #ore #ethods.
(mitted 6ariable
"ot in$luding a variable that a$tually plays a role in e!plaining the dependent variable $an bias the
regression results. 'ethods of dete$tion
*G
in$lude/
Any pattern in this plot i#plies #is8spe$ifi$ation =and>or heteroskedasti$ity? due to
the use of an in$orre$t fun$tional for# or due to the o#ission of a relevant variable.
0f the visual test indi$ates a proble#% perfor# a for#al diagnosti$ test su$h as the
<S<3 test.
Apply your intuition% previous resear$h% hints fro# preli#inary bivariate analysis% et$.
)or e!a#ple% in the #odel we ran% we believe that there #ay be an o#itted variable
bias be$ause of the absen$e of two $ru$ial variables for wage deter#ination 8 whether
the labor is unioni7ed and the professional se$tor of work =#edi$ine% finan$e% retail%
et$.?.
.he$k your te!tbook for #ore #ethods.
*G
3he first three tests are si#ilar to those for 0n$orre$t )un$tional for#.
Vjbooks.net
B
"nclusion o* an irrele6ant 6ariable
3his #is8spe$ifi$ation o$$urs when a variable that is not a$tually relevant to the #odel is
in$luded
*H
. 3o dete$t the presen$e of irrelevant variables/
<!a#ine the signifi$an$e of the 38statisti$s. 0f the 38statisti$ is not signifi$ant at the *0O level
=usually if 3M *.HB in absolute ter#s?% then the variable #ay be irrelevant to the #odel.
,easurement error
3his is not a very severe proble# if it only affli$ts the dependent variable% but it #ay bias the 38
statisti$s. 'ethods of dete$ting this proble# in$lude/
Knowledge about proble#s>#istakes in data $olle$tion
3here #ay be a #easure#ent error if the variable you are using is a pro!y for the
a$tual variable you intended to use. 0n our e!a#ple% the wage variable in$ludes the
#oneti7ed values of the benefits re$eived by the respondent. But this is a subje$tive
#oneti7ation of respondents and is probably undervalued. As su$h% we $an guess that
there is probably so#e #easure#ent error.
.he$k your te!tbook for #ore #ethods
'easure#ent errors $ausing proble#s $an be easily understood. ;#itted variable bias is a bit
#ore $o#ple!. 3hink of it this way 8 the deviations in the dependent variable are in reality
e!plained by the variable that has been o#itted. Be$ause the variable has been o#itted% the
algorith# will% #istakenly% apportion what should have been e!plained by that variable to the other
variables% thus $reating the error=s?. e#e#ber/ our e!planations are too infor#al and probably
in$orre$t by stri$t #athe#ati$al proof for use in an e!a#. 6e in$lude the# here to help you
understand the proble#s a bit better.
Ceteros!edasticity
9eteroskedasti$ity i#plies that the varian$es =i.e. 8 the dispersion around the e!pe$ted #ean of
7ero? of the residuals are not $onstant% but that they are different for different observations. 3his
$auses a proble#/ if the varian$es are une&ual% then the relative reliability of ea$h observation
=used in the regression analysis? is une&ual. 3he larger the varian$e% the lower should be the
i#portan$e =or weight? atta$hed to that observation. As you will see in se$tion C.2% the $orre$tion
for this proble# involves the downgrading in relative i#portan$e of those observations with higher
varian$e. 3he proble# is #ore apparent when the value of the varian$e has so#e relation to one or
#ore of the independent variables. 8ntuitively, this is a problem because the distribution of the
residuals should have no relation with any of the variables (a basic assumption of the classical
model).
1ete$tion involves two steps/
-ooking for patterns in the plot of the predi$ted dependent variable and the residual
*H
By dropping it% we i#prove the reliability of the 38statisti$s of the other variables =whi$h are relevant to the #odel?.
But% we #ay be $ausing a far #ore serious proble# 8 an o#itted variableV An insignifi$ant 3 is not ne$essarily a bad
thing 8 it is the result of a :true: #odel. 3rying to re#ove variables to obtain only signifi$ant 38statisti$s is bad pra$ti$e.
Vjbooks.net
G
0f the graphi$al inspe$tion hints at heteroskedasti$ity% you #ust $ondu$t a for#al test like
the 6hiteRs test. Se$tion Q.G tea$hes you how to $ondu$t a 6hiteRs test
*Q
.
Chec!ing *ormally *or heteros!edasticityD ?hiteEs test
3he 6hiteRs test is usually used as a test for heteroskedasti$ity. 0n this test% a regression of the
s&uares of the residuals is run on the variables suspe$ted of $ausing the heteroskedasti$ity% their
s&uares% and $ross produ$ts.
=residuals?
2
D b0 E b* educ E b2 workLex E b2 =educ?
2
E bB =workLex?
2
E bG =educFworkLex?
=@1-./0,
=@12"45,
2"41-./0,
-ork
2xperience,
2"45678.9
.'3+ .'3$ .(%'(
$ntered
Variables
R
'-uare
Adjusted
R '-uare
'td#
$rror o*
the
$stimate
,odel 'ummary
a
1ependent Variable/ SWL<S
a.
6hiteRs 3est
.al$ulate nF
2

2
D 0.02Q% nD20*H 3hus% nF
2
D .02QF20*H D
QB.H.
.o#pare this value with
2
=n?% i.e. with
2
=20*H?
=
2
is the sy#bol for the .hi8S&uare distribution?

2
=20*H? D *2B obtained fro#
2
table. =)or NGG $onfiden$e? As nF
2
M
2
%
heteroskedasti$ity $an not be $onfir#ed.
"ote/ Please refer to your te!tbook for further infor#ation regarding the interpretation of the
6hite4s test. 0f you have not en$ountered the .hi8S&uare distribution>test before% there is no need
to pani$V 3he sa#e rules apply for testing using any distribution 8 the 3% )% U% or .hi8S&uare. )irst%
$al$ulate the re&uired value fro# your results. 9ere the re&uired value is the sa#ple si7e =:n:?
#ultiplied by the 8s&uare. ,ou #ust deter#ine whether this value is higher than that in the
standard table for the relevant distribution =here the .hi8S&uare? at the re$o##ended level of
$onfiden$e =usually NGO? for the appropriate degrees of freedo# =for the 6hite4s test% this e&uals
the sa#ple si7e :n:? in the table for the distribution =whi$h you will find in the ba$k of #ost
e$ono#etri$s>statisti$s te!tbooks?. 0f the for#er is higher% then the hypothesis is reje$ted. (sually
the reje$tion i#plies that the test $ould not find a proble#
*C
.
*Q
;ther tests/ Park% Glejser% Goldfelt8Wuandt. efer to your te!t book for a $o#prehensive listing of #ethods and their
detailed des$riptions.
*C
6e use the phraseology :.onfiden$e -evel of :NGO.: 'any professors #ay frown upon this% instead preferring to use
:Signifi$an$e -evel of GO.: Also% our e!planation is si#plisti$. 1o not use it in an e!a#V 0nstead% refer to the $hapter
on :9ypothesis 3esting: or :.onfiden$e 0ntervals: in your te!tbook. A $lear understanding of these $on$epts is essential.
Vjbooks.net
A&'?$R' 5( C(&C$P59A% F9$'5"(&' (& R$GR$''"(& A&A%G'"'
*. 6hy is the regression #ethod you use $alled X-east S&uaresR@ .an you justify the use of su$h a
#ethod@
Ans/ 3he #ethod #ini#ises the s&uares of the residuals. 3he for#ulas for obtaining the esti#ates
of the beta $oeffi$ients% std errors% et$. are all based on this prin$iple.
,es% we $an justify the use of su$h a #ethod/ the ai# is to #ini#ise the error in our predi$tion of
the dependent variable% and by #ini#ising the residuals we are doing just that. By using the
:s&uares: we are pre$luding the proble# of signs thereby giving positive and negative predi$tion
errors the sa#e i#portan$e.
2. 3he $lassi$al assu#ptions #ostly hinge on the properties of the residuals. 6hy should it be so@
Ans/ this is linked to &uestion *. 3he esti#ation #ethod is based on #ini#ising the su# of the
s&uared residuals. As su$h% all the powerful inferen$es we draw fro# the results Ylike
2
% betas% 3%
)% et$.Z are based on assu#ed properties of the residuals. Any deviations fro# these assu#ptions
$an $ause #ajor proble#s.
2. Prior to running a regression% you have to $reate a #odel. 6hat are the i#portant steps and
$onsiderations in $reating su$h a #odel@
Ans/ the #ost i#portant $onsideration is the theory you want to test>support>refute using the
esti#ation. 3he theory #ay be based on theoreti$al>analyti$al resear$h and derivations% previous
work by others% intuition% et$..
3he li#itations of data #ay $onstrain the #odel.
S$atter plots #ay provide an indi$ation of the transfor#ations needed.
.orrelations #ay tell you about the possibility of $ollinearity and the #odeling ra#ifi$ations
thereof.
B. 6hat role #ay $orrelations% s$atter plots and other bivariate and #ultivariate analysis play in the
spe$ifi$ation of a regression #odel@
Ans/ prior to any regression analysis% it is essential to run so#e des$riptives and basi$ bivariate
and #ultivariate analysis. Based on the inferen$es fro# these% you #ay want to $onstru$t a #odel
whi$h $an answer the &uestions raised by the initial analysis and>or $an in$orporate the insights
fro# the initial analysis.
G. After running a regression% in what order should you interpret the results and why@
Ans/ first% $he$k for the breakdown of $lassi$al assu#ptions Y$ollinearity% heteroskedasti$ity%
et$..Z. 3hen% you are sure that no #ajor proble# is present% interpret the results in roughly the
following order/ Sig )% Adj
2
% Std error of esti#ate% Sig83% beta% .onfiden$e interval of beta
H. 0n the regression results% are we $onfident about the $oeffi$ient esti#ates@ 0f not% what additional
infor#ation Ystatisti$Z are we using to $apture our degree of un8$onfiden$e about the esti#ate we
obtain@
Vjbooks.net
Ans/ no% we are not $onfident about the esti#ated $oeffi$ient. 3he std error is being used to
$apture this. 3he 3 s$ales the level of $onfiden$e for all variables on to a unifor# s$ale.
Q. <a$h esti#ated $oeffi$ient has a distribution of its own. 6hat kind of distribution does ea$h beta
have@ 0n this distribution% where do the esti#ate and the std error fit in@
Ans/ <a$h beta is distributed as a 38distribution with #ean e&ual to the esti#ated beta $oeffi$ient
and std error e&ual to the esti#ated std error of the esti#ated beta.
C. 9ow is a 38statisti$ related to what you learned about U8s$ores@ 6hy $annot the 78s$ore be used
instead of the 38statisti$@
Ans/ the 3 is used be$ause we have an esti#ated std error and not the a$tual std error. 0f we
had the a$tual std error for a beta% then we $ould use the 78s$ore.
N. 0n the regression results% what possible proble#s does a pattern in a plot of the predi$ted values
and the residuals tell you about@
Ans/ o#itted variable% in$orre$t fun$tional for# or>and heteroskedasti$ity.
*0. 0n the regression results% what possible proble# does a pattern in a plot of an independent
variable and the residuals tell you about@
Ans/ heteroskedasti$ity and the possible transfor#ation needed to $orre$t for it.
**. 0ntuitively% why should a s$atter plot of the residuals and a variable A indi$ate the presen$e of
heteroskedasti$ity if there is a pattern in the s$atter plot@
Ans/ 3he residuals are assu#ed to be rando#ly distributed with a 7ero #ean and $onstant
varian$e. 0f you have heteroskedasti$ity% but are still assu#ing a 7ero #ean% the i#pli$ation is that
:as the values of A in$rease% the residual $hanges in a $ertain #annerR. 3his i#plies that the
varian$e of the residuals is $hanging.
*2. 0n plain <nglish% what are the nu#erator and deno#inator of the )8statisti$ in ter#s of the su# of
s&uares@
Ans/ the nu#erator #easures the in$rease in the e!planatory power of the #odel as a result of
adding #ore variables. 3he deno#inator #easures the per$entage of deviation in , that $annot be
e!plained by the #odel. So the ) gives you :the i#pa$t of additional variables on e!plaining the
une!plained deviations in ,:.
*2. 0n a )8test% the degrees of freedo# of the nu#erator e&uals the nu#ber of restri$tions. 0ntuitively%
why should it be so@
Ans/ 3he nu#ber of restri$tions e&uals the differen$e in nu#ber of e!planatory variables a$ross
the two #odels being $o#pared. 3he degrees of freedo# refle$t the fa$t that the differen$e in the
e!planatory power is what is being $aptured by the nu#erator.
*B. 3he $lassi$al linear breakdown proble#s are/
Vjbooks.net
heteroskedasti$ity%
$ollinearity%
auto$orrelation%
#easure#ent errors%
#isspe$ifi$ation of fun$tional for#%
o#itted variable and irrelevant variable.
6hat proble#s does ea$h breakdown $ause@
Ans/ 3he proble#s are biases in the esti#ates of /
the beta esti#ates/ o#itted variable% in$orre$t fun$tional for#% #easure#ent
error in an independent variable
the std errors of the betas/ all
the 38statisti$s/ all
the std error of the esti#ate/ all e!$ept heteroskedasti$ity
the Sig8)8statisti$/ e!$ept heteroskedasti$ity% $ollinearity% irrelevant variable
Vjbooks.net
FLOW DIAGRAM FOR REGRESSION ANALYSIS
Problem/issue Method
OLS, MLE, Logit, Time
Series, etc.
Model
[Y = a + b X] or other

Final Dataset Read data into
software program
Prepare data for analysis
Descriptives, correlations Interpretation
scatter charts Obtain some intuitive
understanding
istogram to see distribution, o! the data series, identi!"
out#iers, , etc.
norma#it" tests, corre#ations

Run Regression
Test For rea!down of "lassical #ssumptions $for %inear Regression&
'o rea!down rea!down
"reate a new model (see #ist be#o$%
eteros&edasticit"' (LS, )utocorre#ation' *LS,
)+,M)
Simu#taneit" -ias' ,., /SLS, 0SLS, 1o##inearit"' ,.,
dro22ing a variab#e
May need to create new variables
Running "orrect Model and Method
egression s$he#ati$ flow $hart *NNN Vijay Gupta
eteros&edasticit",
)utocorre#ation,
Miss2eci3cation,
1o##inearit",
Simu#taneit" -ias
Omitted .ariab#e, ,rre#evant .ariab#e, ,ncorrect 4unctiona# 4orm
1#ear#" de3ne the issue and the resu#ts being
sought
Measurement
Error
FLOW DIAGRAM FOR REGRESSION ANALYSIS
Diagnostics )ypothesis Tests
egression s$he#ati$ flow $hart *NNN Vijay Gupta

You might also like