Professional Documents
Culture Documents
OLSDataAnalysisin
R
DinoChristenson&ScottPowell
Ohio State University
OhioStateUniversity
November20,2007
Introduction to R Outline
IntroductiontoROutline
II. DataDescription
Data Description
II. DataAnalysis
ii. Commandfunctions
Command functions
ii. Handrolling
III. OLSDiagnostics&Graphing
III
OLS Diagnostics & Graphing
IV. Functionsandloops
V. Movingforward
11/20/2007
Christenson&Powell:IntrotoR
DataAnalysis:DescriptiveStats
y
p
Other Useful Commands
OtherUsefulCommands
sum
mean
var
sd
range
min
max
median
di
cor
summary
First,letstakealookatour
codeforthehandrolledOLS
estimator
TheHolyGrail:
(XX)
(X
X)-1 X
XY
Y
Weneedasinglematrixof
independentvariables
The cbind() command
Thecbind()
command
takestheindividualvariable
vectorsandcombinesthem
intoonexvariablematrix
A1isincludedasthefirst
elementtoaccountforthe
constant.
Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrix willgiveusthe
standarderrors.
Oh
Otherteststatisticscanbe
i i
b
easilycomputed.
Viewthestandarderrors.
Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrixwillgiveusthe
standarderrors.
Oh
Otherteststatisticscan
i i
beeasilycomputed.
Viewthestandarderrors.
Tofindthestandard
errors,weneedto
computeboththe
varianceoftheresiduals
andthecovmatrixofthe
d h
f h
xs.
Thesqrtofthediagonal
elementsofthisvarcov
l
f hi
matrixwillgiveusthe
standarderrors.
Oh
Otherteststatisticscanbe
i i
b
easilycomputed.
Viewthestandarderrors.
DataAnalysis:Regression
y
g
Other Useful Commands
OtherUsefulCommands
lm
Linear Model
lme
glm
- General lm
Mixed Effects
multinom
- Multinomial
Logit
anova
optim
- General
Optimizer
OLS Diagnostics in R
OLSDiagnosticsinR
Postestimationdiagnosticsarekeytodata
g
y
analysis
Wewanttomakesureweestimatedtheproper
model
Besides,Irfan willhurtyouifyouneglecttodothis
Furthermore,diagnosticsallowusthe
g
opportunitytoshowoffsomeofRsgraphs
Rsrealstrengthisthatithasvirtuallyunlimited
graphing capabilities
graphingcapabilities
Ofcourse,suchstrengthsonRspartisdependenton
yourknowledgeofbothRandstatistics
Still,withjustsomebasicswecandosomecoolgraphs
Still with just some basics we can do some cool graphs
11/20/2007
Christenson&Powell:IntrotoR
19
OLS Diagnostics in R
OLSDiagnosticsinR
Whatcouldbeunjustifiably drivingourdata?
Outlier:unusualobservation
O tli
l b
ti
Leverage:abilitytochangetheslopeofthe
regression line
regressionline
Influence:thecombinedimpactofstrongleverage
and outlier status
andoutlierstatus
AccordingtoJohnFox,influence=leverage*outliers
11/20/2007
Christenson&Powell:IntrotoR
20
Ourmeasureofleverage:isthehi orhatvalue
Itsjustthepredictedvalueswrittenintermsofhi
Where,H
Where Hij isthecontributionofobservationY
is the contribution of observation Yitothefitted
to the fitted
valueYj
Ifhij islarge,thentheith observationhasasignificantimpacton
the jth fittedvalue
thejth
fitted value
So,skippingtheformulas,weknowthatthelargerthehatvalue
thegreatertheleverageofthatobservation
11/20/2007
Christenson&Powell:IntrotoR
21
Calculatetheaveragehatvalue
avg.mod1<-ncol(x)/nrow(x)
11/20/2007
Christenson&Powell:IntrotoR
22
0.35
18
0.20
0.25
0.30
20
3
11
0.15
plot(hatvalues(ols.model
1))
abline(h=1*(ncol(x))/nro
w(x))
abline(h=2*(ncol(x))/nro
bli (h 2*(
l( ))/
w(x))
abline(h=3*(ncol(x))/nro
w(x))
identify(hatvalues(ols.m
odel1))
14
0.10
Butapictureisworthahundred
numbers?
Graphthehatvalueswithlinesfor
theaverage,twicetheavg (large
samples)andthreetimestheavg
(small samples) hat values
(smallsamples)hatvalues
hatvalues(ols.model1)
identify letsusselectthedata
pointsinthenewgraph
State#2isovertwicetheavg
Nothing above three times
Nothingabovethreetimes
11/20/2007
Christenson&Powell:IntrotoR
19
10
15
20
Index
23
11/20/2007
Christenson&Powell:IntrotoR
24
11/20/2007
2
1
14
15
1
0
19
10
5
-1
Perhapsthereisamistake
i d
indataentry
Perhapsthemodelis
misspecified intermsof
functionalform
(forthcoming) or omitted
(forthcoming)oromitted
vars
Maybeyoucanthrowout
yourbadobservation
Ifyoumustincludethebad
y
observation,tryrobust
regression
22
3
-2
rstu
udent(ols.model1)
Again,letsplotthemwith
li
linesfor2&2
f 2& 2
States2and3appeartobe
outliers,ordarnclose
Weshoulddefinitelytakea
We should definitely take a
lookatwhatmakesthese
statesunusual
Christenson&Powell:IntrotoR
10
15
20
Index
25
IfCooksDisgreaterthan4/(nk
/
1),thentheobservationissaidto
exertundueinfluence
Letsjustplotit
plot(cookd(ols.model1))
abline(h=4/(nrow(x)ncol(x)))
Identify(cookd(ols.mode
y
l1))
States2and(maybe)3areinthe
troublezone
0.4
h
1 hi
0.3
k + 1
0.2
13
0.1
18
11
0.0
'2
i
0.5
CooksDgivesakindofsummary
for each observationssinfluence
foreachobservation
influence
coo
okd(ols.model1)
17
1
5
10
15
20
Index
11/20/2007
Christenson&Powell:IntrotoR
26
Forahostofmeasures
of influence including
ofinfluence,including
df betasanddf fits
influence.measu
res(ols.model1)
dfbeta givesthe
influenceofan
observationonthe
coefficients orthe
changeinivscoefficient
causedbydeletinga
singleobservation
Simplecommandsfor
partialregressionplots
canbefoundonFoxs
website
website
11/20/2007
Christenson&Powell:IntrotoR
27
11/20/2007
qq.plot(ols.model1,dist
l ( l
d l1 di
ribution="norm")
Theproblemsareagain2and13,
,
g
with3,22and14borderingon
troublethistimearound
-1
Pl
Plotsempiricalquantiles
t
ii l
til ofa
f
variableagainststudentized
residuals
Lookingforobs onastraightline
InRitissimpletoplottheerror
I R it i i l t l t th
bandsaswell
Deviationrequiresusto
transformourvariables
2
14
22
3
-2
Isourdatadistributednormally?
Was it correct to use a linear
Wasitcorrecttousealinear
model?
Useaquantile plot(qq plot)to
check
Studen
ntized Residuals(olss.model1)
13
-2
Christenson&Powell:IntrotoR
-1
norm Quantiles
28
11/20/2007
0.0
0.1
0.2
0..3
0.4
density.default(x = rstudent(ols.model1))
Density
Asimpledensityplot
p
yp
ofthestudentized
residualshelpsto
determine the nature
determinethenature
ofourdata
Theapparent
deviationfromthe
normalcurveisnot
severe but there
severe,butthere
certainlyseemstobe
aslightnegativeskew
-4
Christenson&Powell:IntrotoR
-2
N = 22 Bandwidth = 0.4217
29
11/20/2007
10
0
-20
-10
resid(ols.model1)
0
-10
-20
30
40
50
60
70
30000
35000
40000
45000
50000
0
-10
-20
-10
resid(o
ols.model1)
10
income
10
fitted.values(ols.model1)
-20
par(mfrow=c(2,2))
plot(resid(ols.model1)
~fitted.values(ols.mod
el1))
plot(resid(ols.model1)
p
~income)
plot(resid(ols.model1)
~presvote)
p
plot(resid(ols.model1)
(
(
)
~pressup)
resid(ols.model1)
Wecanalsoeasilylookfor
heteroskedasticity
Plottingtheresidualsagainstthe
fittedvaluesandthecontinuous
independentvariablesletsus
examineourstatisticalmodelfor
l
d lf
thepresenceofunbalanced
errorvariance
resid(o
ols.model1)
10
35
40
45
50
presvote
Christenson&Powell:IntrotoR
55
60
65
65
70
75
80
85
90
95
pressup
30
library(lmtest)
bptest(ols.model1) willgiveyoutheBreuschPaganteststat
gqtest(ols.model1) willgiveyoutheGoldfeld
will give you the GoldfeldQuandttest
Quandttest stat
hmctest(ols.model1)willgiveyoutheHarrisonMcCabeteststat
11/20/2007
Christenson&Powell:IntrotoR
31
Letslookattheconditionindex
fromtheperturb
p
libraryy
library(perturb)
colldiag(ols.model1)
Issues
Issueshereisthelargest
here is the largest
conditionindex
Ifitislargerthan30,Houston
we have
wehave
11/20/2007
Christenson&Powell:IntrotoR
32
11/20/2007
1
0
-1
Standardized residu
uals
0
-10
--2
-20
13
13
plot(ols.model1,
which=1:4)
30
40
50
60
70
-2
-1
Fitted values
1.5
Theoretical Quantiles
Scale-Location
Cook's distance
0.3
0
Cook's d
distance
1.0
3
13
0.0
0.1
0.5
0.2
0.4
0.5
13
0.0
Standardize
ed residuals
N
Nowyouhaveno
h
excusenottorunsome
diagnostics!
Btw,lookatthehigh
Bt l k t th hi h
residualsinthervf plot
for14,13and3
suggesting outliers
suggestingoutliers
10
14
Residuals
Myfavoriteshortcut
commandtogetyou
fouressentialdiagnostic
plotsafteryourunyour
model
d l
Normal Q-Q
2
Residuals vs Fitted
30
40
50
60
Fitted values
Christenson&Powell:IntrotoR
70
10
15
20
Obs. number
33
Loops
for loopsarethe
p
mostcommonandthe
onlytypeofloopwe
will look at today
willlookattoday.
Thefirstloop
p
commandattheright
showssimpleloop
iteration.
iteration
Loops
However,wecanalso
,
seehowloopscanbe
alittlemoreuseful.
Thesecondexample
Th
d
l
atright(although
inefficient)calculates
themeanofincome
Notehowtheindex
accesses elements of
accesseselementsof
theincomevector.
LoopsandMonte
Carlo
Loops
However,wecanalso
,
seehowloopscanbe
alittlemoreuseful.
Thesecondexample
Th
d
l
atright(although
inefficient)calculates
themeanofincome
Notehowtheindex
accesses elements of
accesseselementsof
theincomevector.
LoopsandMonte
Carlo
Functions
Nowwewillmakeourown
linearregressionfunction
usingourhandrolledOLS
code
Functions require inputs
Functionsrequireinputs
(whicharetheobjectstobe
utilized)andarguments
(whicharethecommands
thatthefunctionperforms)
Theactualestimation
proceduredoesnotchange.
However some changes are
However,somechangesare
made.
Functions
First,wehavetotellRthat
wearecreatingafunction.
Wellnameitols.
Thisletsusgeneralizethe
Thi
l t
li th
proceduretomultiple
objects.
Second,wehavetotellthe
functionwhatwewant
returnedorwhatwewant
theoutputtolooklike.
Functions
First,wehavetotellRthat
wearecreatingafunction.
Wellnameitols.
Thisletsusgeneralizethe
Thi
l t
li th
proceduretomultiple
objects.
Second,wehavetotellthe
functionwhatwewant
returnedorwhatwewant
theoutputtolooklike.
Functions
First,wehavetotellRthat
wearecreatingafunction.
Wellnameitols.
Thisletsusgeneralizethe
Thi
l t
li th
proceduretomultiple
objects.
Second,wehavetotellthe
functionwhatwewant
returnedorwhatwe
wanttheoutputtolook
like.
Functions
OLS:HandrolledvsFunction
Functions
Implementingour
p
g
newfunctionols,
wegetpreciselythe
output that we
outputthatwe
askedfor.
Wecancheckthis
againsttheresults
produced by the
producedbythe
standardlm
function.
Functions
Implementingour
p
g
newfunctionols,
wegetpreciselythe
output that we asked
outputthatweasked
for.
Wecancheckthis
againsttheresults
produced by the
producedbythe
standardlm
function.
Favorite Resources
Favorite
InvaluableResourcesonline
TheRmanuals
h
l
http://cran.rproject.org/manuals.html
Foxsslideshttp://socserv.mcmaster.ca/jfox/Courses/Rcourse/index.html
Faraway's book
http://cran.rproject.org/doc/contrib/FarawayPRA.pdf
//
/ /
/
Anderson'sICPSRlecturesusingR
http://socserv.mcmaster.ca/andersen/icpsr.html
Arai'sguidehttp://people.su.se/~ma/R_intro/
UCLAnoteshttp://www.ats.ucla.edu/stat/SPLUS/default.htm
Keeles introguidehttp://www.polisci.ohiostate.edu/faculty/lkeele/RIntro.pdf
G tRb k
GreatRbooks
Verzanis book
http://www.amazon.com/UsingIntroductoryStatisticsJohn
Verzani/dp/1584884509
Maindonald
M i d
ld andBraunsbook
dB
b k
http://www.amazon.com/DataAnalysisGraphicsUsingR/dp/0521813360
11/20/2007
Christenson&Powell:IntrotoR
45