Ridge Regression

PublishedonSTAT897D(https://onlinecourses.science.psu.
edu/stat857)
Home>5.1RidgeRegression
5.1RidgeRegression
Motivation:toomanypredictors
Itisnotunusualtoseethenumberofinputvariablesgreatlyexceedthenumberof
observations,e.g.microarraydataanalysis,environmentalpollutionstudies.
Withmanypredictors,fittingthefullmodelwithoutpenalizationwillresultinlarge
predictionintervals,andLSregressionestimatormaynotuniquelyexist.
Motivation:illconditionedX
BecausetheLSestimatesdependupon(X X)1 ,wewouldhaveproblemsin
computingLS ifX Xweresingularornearlysingular.
Inthosecases,smallchangestotheelementsofX leadtolargechangesin
1
(X X )
.
TheleastsquareestimatorLS mayprovideagoodfittothetrainingdata,butitwill
notfitsufficientlywelltothetestdata.
RidgeRegression
Onewayoutofthissituationistoabandontherequirementofanunbiasedestimator.
WeassumeonlythatX'sandYhavebeencentered,sothatwehavenoneedforaconstant
termintheregression:
Xisanbypmatrixwithcenteredcolumns,
Yisacenterednvector.
HoerlandKennard(1970)proposedthatpotentialinstabilityintheLSestimator
^
= (X X )
X Y,
couldbeimprovedbyaddingasmallconstantvalue tothediagonalentriesofthematrix
X Xbeforetakingitsinverse.
Theresultistheridgeregressionestimator
^
^
ridge = (X X + Ip )
X Y
Ridgeregressionplacesaparticularformofconstraintontheparameters( 's):^ridge is
chosentominimizethepenalizedsumofsquares:
p
p
2
(yi xij j )
i=1
j=1
whichisequivalenttominimizationofi=1 (yi
p
j=1
2
j
< c
2
j
j=1
j=1
xij j )
subjectto,forsomec
> 0
,i.e.constrainingthesumofthesquaredcoefficients.
Therefore,ridgeregressionputsfurtherconstraintsontheparameters,j 's,inthelinear
model.Inthiscase,whatwearedoingisthatinsteadofjustminimizingtheresidualsumof
squareswealsohaveapenaltytermonthe 's.Thispenaltytermis (aprechosen
constant)timesthesquarednormofthe vector.Thismeansthatifthej 'stakeonlarge
values,theoptimizationfunctionispenalized.Wewouldprefertotakesmallerj 's,orj 's
thatareclosetozerotodrivethepenaltytermsmall.
GeometricInterpretationofRidgeRegression:
Theellipsescorrespondtothecontoursofresidualsumofsquares(RSS):theinnerellipse
hassmallerRSS,andRSSisminimizedatordinalleastsquare(OLS)estimates.
Forp
= 2
,theconstraintinridgeregressioncorrespondstoacircle,j=1
2
j
< c
Wearetryingtominimizetheellipsesizeandcirclesimultanouslyintheridgeregression.
Theridgeestimateisgivenbythepointatwhichtheellipseandthecircletouch.
ThereisatradeoffbetweenthepenaltytermandRSS.Maybealarge wouldgiveyoua
betterresidualsumofsquaresbutthenitwillpushthepenaltytermhigher.Thisiswhyyou
mightactuallyprefersmaller 'swithworseresidualsumofsquares.Fromanoptimization
perspective,thepenaltytermisequivalenttoaconstraintonthe 's.Thefunctionisstillthe
residualsumofsquaresbutnowyouconstrainthenormofthej 'stobesmallerthansome
constantc.Thereisacorrespondencebetween andc.Thelargerthe is,themoreyou
preferthej 'sclosetozero.Intheextremecasewhen = 0 ,thenyouwouldsimplybe
doinganormallinearregression.Andtheotherextremeas approachesinfinity,yousetall
the 'stozero.
PropertiesofRidgeEstimator:
^
ls
isanunbiasedestimatorof ^ridge isabiasedestimatorof .
Fororthogonalcovariates,X X
= nIp
,^ridge
n+
^
ls
.Hence,inthiscase,the
ridgeestimatoralwaysproducesshrinkagetowards0. controlstheamountof
shrinkage.
Animportantconceptinshrinkageisthe"effective''degreesoffreedomassociatedwitha
setofparameters.Inaridgeregressionsetting:
1.Ifwechoose = 0 ,wehavepparameters(sincethereisnopenalization).
2.If islarge,theparametersareheavilyconstrainedandthedegreesof
freedomwilleffectivelybelower,tendingto0as .
Theeffectivedegreesoffreedomassociatedwith1 , 2 , , p isdefinedas
p
df () = tr(X(X X + Ip )
X ) =
j=1
2
j
2
j
,
+
wheredj arethesingularvaluesofX .Noticethat = 0 ,whichcorrespondsto

noshrinkage,givesdf () = p (aslongasX Xisnonsingular),aswewould
expect.
Thereisa1:1mappingbetween andthedegreesoffreedom,soinpracticeone
maysimplypicktheeffectivedegreesoffreedomthatonewouldlikeassociated
withthefit,andsolvefor .
Asanalternativetoauserchosen ,crossvalidationisoftenusedinchoosing :we
select thatyieldsthesmallestcrossvalidationpredictionerror.
Theintercept0 hasbeenleftoutofthepenaltytermbecauseY hasbeencentered.
Penalizationoftheinterceptwouldmaketheproceduredependontheoriginchosenfor
Y.
Sincetheridgeestimatorislinear,itisstraightforwardtocalculatethevariance
covariancematrixvar(^ridge )
= (X X + Ip )
X X(X X + Ip )
ABayesianFormulation
Considerthelinearregressionmodelwithnormalerrors:
p
Yi = Xij j + i
j=1
isi.i.d.normalerrorswithmean0andknownvariance 2 .
Since isappliedtothesquarednormofthevector,peopleoftenstandardizeallofthe
covariatestomakethemhaveasimilarscale.Assumej hasthepriordistribution
.Alargevalueof correspondstoapriorthatismoretightly
concentratedaroundzero,andhenceleadstogreatershrinkagetowardszero.
2
j iid N (0, /)
1
^
N ( , (X X + Ip )
X X(X X + Ip )
)
Theposterioris|Y
,where
^
^
= ridge = (X X + Ip )
X Y
,confirmingthattheposteriormean(andmode)ofthe
Bayesianlinearmodelcorrespondstotheridgeregressionestimator.
Whereastheleastsquaressolutions^ls
= (X X )
X Y
areunbiasedifmodeliscorrectly
specified,ridgesolutionsarebiased,E(^ridge ) .However,atthecostofbias,ridge
regressionreducesthevariance,andthusmightreducethemeansquarederror(MSE).
2
M S E = Bias
MoreGeometricInterpretations(optional)
Inputsarecenteredfirst
Considerthefittedresponse
ridge
^
^ = X
y
= X(X
X + I)
= UD(D
= uj
j=1
+ I)
2
j
u
+
DU
2
j
+ V ariance
where\(\textbf{u}_j\)arethenormalizedprincipalcomponentsofX.
Ridgeregressionshrinksthecoordinateswithrespecttotheorthonormalbasisformed
bytheprincipalcomponents.
Coordinateswithrespecttoprincipalcomponentswithsmallervarianceareshrunk
more.
~
InsteadofusingX=(X1,X2,...,Xp)aspredictingvariables,usethenewinputmatrixX
=UD
Thenforthenewinputs:
2
ridge
dj
dj +
^
V ar( j ) =
2
2
dj
where 2 isthevarianceoftheerrorterm inthelinearmodel.

Theshrinkagefactorgivenbyridgeregressionis:
d
d
2
j
2
j
Wesawthisinthepreviousformula.Thelargeris,themoretheprojectionisshrunkinthe
directionofuj .Coordinateswithrespecttotheprincipalcomponentswithasmallervariance
areshrunkmore.
Let'stakealookatthisgeometrically.
[1]
Thisinterpretationwillbecomeconvenientwhenwecompareittoprincipalcomponents
regressionwhereinsteadofdoingshrinkage,weeithershrinkthedirectionclosertozeroor
wedon'tshrinkatall.Wewillseethisinthe"DimensionReductionMethods"lesson.
SourceURL:https://onlinecourses.science.psu.edu/stat857/node/155
Links:
[1]https://onlinecourses.science.psu.edu/stat857/javascript:popup_window(
'/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson02/shrinkage_viewlet_swf.html','shrinkage',683,
525)

Ridge Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ridge Regression

Uploaded by

Copyright:

Available Formats

PublishedonSTAT897D(https://onlinecourses.science.psu.

isanunbiasedestimatorof ^ridge isabiasedestimatorof .

wheredj arethesingularvaluesofX .Noticethat = 0 ,whichcorrespondsto

where 2 isthevarianceoftheerrorterm inthelinearmodel.

You might also like