You are on page 1of 6

PublishedonSTAT897D(https://onlinecourses.science.psu.

edu/stat857)
Home>5.1RidgeRegression

5.1RidgeRegression
Motivation:toomanypredictors
Itisnotunusualtoseethenumberofinputvariablesgreatlyexceedthenumberof
observations,e.g.microarraydataanalysis,environmentalpollutionstudies.
Withmanypredictors,fittingthefullmodelwithoutpenalizationwillresultinlarge
predictionintervals,andLSregressionestimatormaynotuniquelyexist.
Motivation:illconditionedX
BecausetheLSestimatesdependupon(X X)1 ,wewouldhaveproblemsin
computingLS ifX Xweresingularornearlysingular.
Inthosecases,smallchangestotheelementsofX leadtolargechangesin

1
(X X )
.
TheleastsquareestimatorLS mayprovideagoodfittothetrainingdata,butitwill
notfitsufficientlywelltothetestdata.

RidgeRegression
Onewayoutofthissituationistoabandontherequirementofanunbiasedestimator.
WeassumeonlythatX'sandYhavebeencentered,sothatwehavenoneedforaconstant
termintheregression:
Xisanbypmatrixwithcenteredcolumns,
Yisacenterednvector.
HoerlandKennard(1970)proposedthatpotentialinstabilityintheLSestimator

^
= (X X )
X Y,

couldbeimprovedbyaddingasmallconstantvalue tothediagonalentriesofthematrix

X Xbeforetakingitsinverse.
Theresultistheridgeregressionestimator
^

^
ridge = (X X + Ip )
X Y

Ridgeregressionplacesaparticularformofconstraintontheparameters( 's):^ridge is
chosentominimizethepenalizedsumofsquares:
p

p
2

(yi xij j )
i=1

j=1

whichisequivalenttominimizationofi=1 (yi
p

j=1

2
j

< c

2
j

j=1

j=1

xij j )

subjectto,forsomec

> 0

,i.e.constrainingthesumofthesquaredcoefficients.

Therefore,ridgeregressionputsfurtherconstraintsontheparameters,j 's,inthelinear
model.Inthiscase,whatwearedoingisthatinsteadofjustminimizingtheresidualsumof
squareswealsohaveapenaltytermonthe 's.Thispenaltytermis (aprechosen
constant)timesthesquarednormofthe vector.Thismeansthatifthej 'stakeonlarge
values,theoptimizationfunctionispenalized.Wewouldprefertotakesmallerj 's,orj 's
thatareclosetozerotodrivethepenaltytermsmall.

GeometricInterpretationofRidgeRegression:

Theellipsescorrespondtothecontoursofresidualsumofsquares(RSS):theinnerellipse
hassmallerRSS,andRSSisminimizedatordinalleastsquare(OLS)estimates.
Forp

= 2

,theconstraintinridgeregressioncorrespondstoacircle,j=1

2
j

< c

Wearetryingtominimizetheellipsesizeandcirclesimultanouslyintheridgeregression.
Theridgeestimateisgivenbythepointatwhichtheellipseandthecircletouch.
ThereisatradeoffbetweenthepenaltytermandRSS.Maybealarge wouldgiveyoua
betterresidualsumofsquaresbutthenitwillpushthepenaltytermhigher.Thisiswhyyou

mightactuallyprefersmaller 'swithworseresidualsumofsquares.Fromanoptimization
perspective,thepenaltytermisequivalenttoaconstraintonthe 's.Thefunctionisstillthe
residualsumofsquaresbutnowyouconstrainthenormofthej 'stobesmallerthansome
constantc.Thereisacorrespondencebetween andc.Thelargerthe is,themoreyou
preferthej 'sclosetozero.Intheextremecasewhen = 0 ,thenyouwouldsimplybe
doinganormallinearregression.Andtheotherextremeas approachesinfinity,yousetall
the 'stozero.

PropertiesofRidgeEstimator:
^
ls

isanunbiasedestimatorof ^ridge isabiasedestimatorof .

Fororthogonalcovariates,X X

= nIp

,^ridge

n+

^
ls

.Hence,inthiscase,the

ridgeestimatoralwaysproducesshrinkagetowards0. controlstheamountof
shrinkage.
Animportantconceptinshrinkageisthe"effective''degreesoffreedomassociatedwitha
setofparameters.Inaridgeregressionsetting:
1.Ifwechoose = 0 ,wehavepparameters(sincethereisnopenalization).
2.If islarge,theparametersareheavilyconstrainedandthedegreesof
freedomwilleffectivelybelower,tendingto0as .
Theeffectivedegreesoffreedomassociatedwith1 , 2 , , p isdefinedas
p

df () = tr(X(X X + Ip )

X ) =
j=1

2
j

2
j

,
+

wheredj arethesingularvaluesofX .Noticethat = 0 ,whichcorrespondsto


noshrinkage,givesdf () = p (aslongasX Xisnonsingular),aswewould
expect.
Thereisa1:1mappingbetween andthedegreesoffreedom,soinpracticeone
maysimplypicktheeffectivedegreesoffreedomthatonewouldlikeassociated
withthefit,andsolvefor .
Asanalternativetoauserchosen ,crossvalidationisoftenusedinchoosing :we
select thatyieldsthesmallestcrossvalidationpredictionerror.
Theintercept0 hasbeenleftoutofthepenaltytermbecauseY hasbeencentered.
Penalizationoftheinterceptwouldmaketheproceduredependontheoriginchosenfor
Y.

Sincetheridgeestimatorislinear,itisstraightforwardtocalculatethevariance
covariancematrixvar(^ridge )

= (X X + Ip )

X X(X X + Ip )

ABayesianFormulation
Considerthelinearregressionmodelwithnormalerrors:
p

Yi = Xij j + i
j=1

isi.i.d.normalerrorswithmean0andknownvariance 2 .

Since isappliedtothesquarednormofthevector,peopleoftenstandardizeallofthe
covariatestomakethemhaveasimilarscale.Assumej hasthepriordistribution
.Alargevalueof correspondstoapriorthatismoretightly
concentratedaroundzero,andhenceleadstogreatershrinkagetowardszero.
2

j iid N (0, /)

1
^
N ( , (X X + Ip )
X X(X X + Ip )
)

Theposterioris|Y

,where

^
^
= ridge = (X X + Ip )
X Y

,confirmingthattheposteriormean(andmode)ofthe
Bayesianlinearmodelcorrespondstotheridgeregressionestimator.
Whereastheleastsquaressolutions^ls

= (X X )

X Y

areunbiasedifmodeliscorrectly

specified,ridgesolutionsarebiased,E(^ridge ) .However,atthecostofbias,ridge
regressionreducesthevariance,andthusmightreducethemeansquarederror(MSE).
2

M S E = Bias

MoreGeometricInterpretations(optional)
Inputsarecenteredfirst
Considerthefittedresponse
ridge

^
^ = X
y

= X(X

X + I)

= UD(D

= uj
j=1

+ I)

2
j

u
+

DU

2
j

+ V ariance

where\(\textbf{u}_j\)arethenormalizedprincipalcomponentsofX.
Ridgeregressionshrinksthecoordinateswithrespecttotheorthonormalbasisformed
bytheprincipalcomponents.
Coordinateswithrespecttoprincipalcomponentswithsmallervarianceareshrunk
more.
~
InsteadofusingX=(X1,X2,...,Xp)aspredictingvariables,usethenewinputmatrixX
=UD
Thenforthenewinputs:
2

ridge

dj

dj +

^
V ar( j ) =

2
2

dj

where 2 isthevarianceoftheerrorterm inthelinearmodel.


Theshrinkagefactorgivenbyridgeregressionis:
d
d

2
j

2
j

Wesawthisinthepreviousformula.Thelargeris,themoretheprojectionisshrunkinthe
directionofuj .Coordinateswithrespecttotheprincipalcomponentswithasmallervariance
areshrunkmore.
Let'stakealookatthisgeometrically.

[1]
Thisinterpretationwillbecomeconvenientwhenwecompareittoprincipalcomponents
regressionwhereinsteadofdoingshrinkage,weeithershrinkthedirectionclosertozeroor
wedon'tshrinkatall.Wewillseethisinthe"DimensionReductionMethods"lesson.
SourceURL:https://onlinecourses.science.psu.edu/stat857/node/155
Links:
[1]https://onlinecourses.science.psu.edu/stat857/javascript:popup_window(
'/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson02/shrinkage_viewlet_swf.html','shrinkage',683,
525)

You might also like