Neural Networks and Deep Learning Chap3

5/31/2016
Neuralnetworksanddeeplearning
CHAPTER3
Improvingthewayneuralnetworkslearn
Whenagolfplayerisfirstlearningtoplaygolf,theyusuallyspend
mostoftheirtimedevelopingabasicswing.Onlygraduallydothey
developothershots,learningtochip,drawandfadetheball,
buildingonandmodifyingtheirbasicswing.Inasimilarway,upto
nowwe'vefocusedonunderstandingthebackpropagation
algorithm.It'sour"basicswing",thefoundationforlearningin
mostworkonneuralnetworks.InthischapterIexplainasuiteof
techniqueswhichcanbeusedtoimproveonourvanilla
implementationofbackpropagation,andsoimprovethewayour
networkslearn.
Thetechniqueswe'lldevelopinthischapterinclude:abetterchoice
ofcostfunction,knownasthecrossentropycostfunctionfourso
called"regularization"methods(L1andL2regularization,dropout,
andartificialexpansionofthetrainingdata),whichmakeour
networksbetteratgeneralizingbeyondthetrainingdataabetter
NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Appendix:Isthereasimple
algorithmforintelligence?
Acknowledgements
FrequentlyAskedQuestions
Ifyoubenefitfromthebook,please
makeasmalldonation.Isuggest$3,
butyoucanchoosetheamount.
methodforinitializingtheweightsinthenetworkandasetof
heuristicstohelpchoosegoodhyperparametersforthenetwork.
I'llalsooverviewseveralothertechniquesinlessdepth.The
Sponsors
discussionsarelargelyindependentofoneanother,andsoyoumay
jumpaheadifyouwish.We'llalsoimplementmanyofthe
techniquesinrunningcode,andusethemtoimprovetheresults
obtainedonthehandwritingclassificationproblemstudiedin
Chapter1.
Ofcourse,we'reonlycoveringafewofthemany,manytechniques
whichhavebeendevelopedforuseinneuralnets.Thephilosophyis
thatthebestentreetotheplethoraofavailabletechniquesisin
depthstudyofafewofthemostimportant.Masteringthose
importanttechniquesisnotjustusefulinitsownright,butwillalso
deepenyourunderstandingofwhatproblemscanarisewhenyou
useneuralnetworks.Thatwillleaveyouwellpreparedtoquickly
pickupothertechniques,asyouneedthem.
Thecrossentropycostfunction
Mostofusfinditunpleasanttobewrong.Soonafterbeginningto
http://neuralnetworksanddeeplearning.com/chap3.html
Thankstoallthesupporterswho
madethebookpossible,with
especialthankstoPavelDudrenov.
Thanksalsotoallthecontributorsto
theBugfinderHallofFame.
Resources
BookFAQ
Coderepository
MichaelNielsen'sproject
announcementmailinglist
DeepLearning,draftbookin
preparation,byYoshuaBengio,Ian
1/92
5/31/2016
learnthepianoIgavemyfirstperformancebeforeanaudience.I
Goodfellow,andAaronCourville
wasnervous,andbeganplayingthepieceanoctavetoolow.Igot
confused,andcouldn'tcontinueuntilsomeonepointedoutmy
error.Iwasveryembarrassed.Yetwhileunpleasant,wealsolearn
quicklywhenwe'redecisivelywrong.Youcanbetthatthenexttime
IplayedbeforeanaudienceIplayedinthecorrectoctave!By
contrast,welearnmoreslowlywhenourerrorsarelesswell
ByMichaelNielsen/Jan2016
defined.
Ideally,wehopeandexpectthatourneuralnetworkswilllearnfast
fromtheirerrors.Isthiswhathappensinpractice?Toanswerthis
question,let'slookatatoyexample.Theexampleinvolvesaneuron
withjustoneinput:
We'lltrainthisneurontodosomethingridiculouslyeasy:takethe
input1totheoutput0.Ofcourse,thisissuchatrivialtaskthatwe
couldeasilyfigureoutanappropriateweightandbiasbyhand,
withoutusingalearningalgorithm.However,itturnsouttobe
illuminatingtousegradientdescenttoattempttolearnaweight
andbias.Solet'stakealookathowtheneuronlearns.
Tomakethingsdefinite,I'llpicktheinitialweighttobe0.6andthe
initialbiastobe0.9.Thesearegenericchoicesusedasaplaceto
beginlearning,Iwasn'tpickingthemtobespecialinanyway.The
initialoutputfromtheneuronis0.82,soquiteabitoflearningwill
beneededbeforeourneurongetsnearthedesiredoutput,0.0.Click
on"Run"inthebottomrightcornerbelowtoseehowtheneuron
learnsanoutputmuchcloserto0.0.Notethatthisisn'tapre
recordedanimation,yourbrowserisactuallycomputingthe
gradient,thenusingthegradienttoupdatetheweightandbias,and
displayingtheresult.Thelearningrateis = 0.15,whichturnsout
tobeslowenoughthatwecanfollowwhat'shappening,butfast
enoughthatwecangetsubstantiallearninginjustafewseconds.
Thecostisthequadraticcostfunction,C,introducedbackin
Chapter1.I'llremindyouoftheexactformofthecostfunction
shortly,sothere'snoneedtogoanddigupthedefinition.Notethat
youcanruntheanimationmultipletimesbyclickingon"Run"
again.
2/92
5/31/2016
Asyoucansee,theneuronrapidlylearnsaweightandbiasthat
drivesdownthecost,andgivesanoutputfromtheneuronofabout
0.09.That'snotquitethedesiredoutput,0.0,butitisprettygood.
Suppose,however,thatweinsteadchooseboththestartingweight
andthestartingbiastobe2.0.Inthiscasetheinitialoutputis0.98,
whichisverybadlywrong.Let'slookathowtheneuronlearnsto
output0inthiscase.Clickon"Run"again:
Althoughthisexampleusesthesamelearningrate( = 0.15),we
canseethatlearningstartsoutmuchmoreslowly.Indeed,forthe
first150orsolearningepochs,theweightsandbiasesdon'tchange
muchatall.Thenthelearningkicksinand,muchasinourfirst
example,theneuron'soutputrapidlymovescloserto0.0.
Thisbehaviourisstrangewhencontrastedtohumanlearning.AsI
saidatthebeginningofthissection,weoftenlearnfastestwhen
we'rebadlywrongaboutsomething.Butwe'vejustseenthatour
artificialneuronhasalotofdifficultylearningwhenit'sbadly
wrongfarmoredifficultythanwhenit'sjustalittlewrong.What's
more,itturnsoutthatthisbehaviouroccursnotjustinthistoy
model,butinmoregeneralnetworks.Whyislearningsoslow?And
3/92
5/31/2016
canwefindawayofavoidingthisslowdown?
Tounderstandtheoriginoftheproblem,considerthatourneuron
learnsbychangingtheweightandbiasataratedeterminedbythe
partialderivativesofthecostfunction,C / wandC / b.Sosaying
"learningisslow"isreallythesameassayingthatthosepartial
derivativesaresmall.Thechallengeistounderstandwhytheyare
small.Tounderstandthat,let'scomputethepartialderivatives.
Recallthatwe'reusingthequadraticcostfunction,which,from
Equation(6),isgivenby
C=
(y a) 2
2
whereaistheneuron'soutputwhenthetraininginputx = 1isused,
andy = 0isthecorrespondingdesiredoutput.Towritethismore
explicitlyintermsoftheweightandbias,recallthata = (z),where
z = wx + b.Usingthechainruletodifferentiatewithrespecttothe
weightandbiasweget
C
w
C
b
= (a y) (z)x = a (z)
= (a y) (z) = a (z),
whereIhavesubstitutedx = 1andy = 0.Tounderstandthe

behaviouroftheseexpressions,let'slookmorecloselyatthe (z)
termontherighthandside.Recalltheshapeofthefunction:
sigmoidfunction
1.0
0.8
0.6
0.4
0.2
0.0
4
Wecanseefromthisgraphthatwhentheneuron'soutputisclose
to1,thecurvegetsveryflat,andso (z)getsverysmall.Equations
(55)and(56)thentellusthatC / wandC / bgetverysmall.This
istheoriginofthelearningslowdown.What'smore,asweshallsee
alittlelater,thelearningslowdownoccursforessentiallythesame
4/92
5/31/2016
reasoninmoregeneralneuralnetworks,notjustthetoyexample
we'vebeenplayingwith.
Introducingthecrossentropycostfunction
Howcanweaddressthelearningslowdown?Itturnsoutthatwe
cansolvetheproblembyreplacingthequadraticcostwitha
differentcostfunction,knownasthecrossentropy.Tounderstand
thecrossentropy,let'smovealittleawayfromoursupersimpletoy
model.We'llsupposeinsteadthatwe'retryingtotrainaneuron
withseveralinputvariables,x 1, x 2, ,correspondingweights
w 1, w 2, ,andabias,b:
Theoutputfromtheneuronis,ofcourse,a = (z),where
z = jw jx j + bistheweightedsumoftheinputs.Wedefinethe
crossentropycostfunctionforthisneuronby
C=
1
n
[ylna + (1 y)ln(1 a)],

x
wherenisthetotalnumberofitemsoftrainingdata,thesumis
overalltraininginputs,x,andyisthecorrespondingdesired
output.
It'snotobviousthattheexpression(57)fixesthelearningslowdown
problem.Infact,frankly,it'snotevenobviousthatitmakessenseto
callthisacostfunction!Beforeaddressingthelearningslowdown,
let'sseeinwhatsensethecrossentropycanbeinterpretedasacost
function.
Twopropertiesinparticularmakeitreasonabletointerpretthe
crossentropyasacostfunction.First,it'snonnegative,thatis,
C > 0.Toseethis,noticethat:(a)alltheindividualtermsinthe
sumin(57)arenegative,sincebothlogarithmsareofnumbersin
therange0to1and(b)thereisaminussignoutthefrontofthe
sum.
Second,iftheneuron'sactualoutputisclosetothedesiredoutput
foralltraininginputs,x,thenthecrossentropywillbecloseto
5/92
5/31/2016
zero*.Toseethis,supposeforexamplethaty = 0anda 0for
*ToprovethisIwillneedtoassumethatthe
someinputx.Thisisacasewhentheneuronisdoingagoodjobon
usuallythecasewhensolvingclassification
thatinput.Weseethatthefirsttermintheexpression(57)forthe
costvanishes,sincey = 0,whilethesecondtermisjust
desiredoutputsyarealleither0or1.Thisis
problems,forexample,orwhencomputing
Booleanfunctions.Tounderstandwhathappens
whenwedon'tmakethisassumption,seethe
exercisesattheendofthissection.
ln(1 a) 0.Asimilaranalysisholdswheny = 1anda 1.Andso

thecontributiontothecostwillbelowprovidedtheactualoutputis
closetothedesiredoutput.
Summingup,thecrossentropyispositive,andtendstowardzero
astheneurongetsbetteratcomputingthedesiredoutput,y,forall
traininginputs,x.Thesearebothpropertieswe'dintuitivelyexpect
foracostfunction.Indeed,bothpropertiesarealsosatisfiedbythe
quadraticcost.Sothat'sgoodnewsforthecrossentropy.Butthe
crossentropycostfunctionhasthebenefitthat,unlikethe
quadraticcost,itavoidstheproblemoflearningslowingdown.To
seethis,let'scomputethepartialderivativeofthecrossentropy
costwithrespecttotheweights.Wesubstitutea = (z)into(57),
andapplythechainruletwice,obtaining:
C
w j
(
(
n
x
nx
y
(z)
y
(z)
(1 y)
)
)
1 (z) w j
(1 y)
1 (z)
(z)x j.
Puttingeverythingoveracommondenominatorandsimplifying
thisbecomes:
C
w j
(z)x j
(z)(1 (z)) ((z) y).
nx
Usingthedefinitionofthesigmoidfunction,(z) = 1 / (1 + e z),and
alittlealgebrawecanshowthat (z) = (z)(1 (z)).I'llaskyouto
verifythisinanexercisebelow,butfornowlet'sacceptitasgiven.
Weseethatthe (z)and(z)(1 (z))termscancelintheequation
justabove,anditsimplifiestobecome:
C
w j
1
n
x j((z) y).
x
Thisisabeautifulexpression.Ittellsusthattherateatwhichthe
weightlearnsiscontrolledby(z) y,i.e.,bytheerrorinthe
output.Thelargertheerror,thefastertheneuronwilllearn.Thisis
6/92
5/31/2016
justwhatwe'dintuitivelyexpect.Inparticular,itavoidsthelearning
slowdowncausedbythe (z)termintheanalogousequationfor
thequadraticcost,Equation(55).Whenweusethecrossentropy,
the (z)termgetscanceledout,andwenolongerneedworryabout
itbeingsmall.Thiscancellationisthespecialmiracleensuredby
thecrossentropycostfunction.Actually,it'snotreallyamiracle.As
we'llseelater,thecrossentropywasspeciallychosentohavejust
thisproperty.
Inasimilarway,wecancomputethepartialderivativeforthebias.
Iwon'tgothroughallthedetailsagain,butyoucaneasilyverify
that
C
b
1
n
((z) y).
x
Again,thisavoidsthelearningslowdowncausedbythe (z)termin
theanalogousequationforthequadraticcost,Equation(56).
Exercise
Verifythat (z) = (z)(1 (z)).
Let'sreturntothetoyexampleweplayedwithearlier,andexplore
whathappenswhenweusethecrossentropyinsteadofthe
quadraticcost.Toreorientourselves,we'llbeginwiththecase
wherethequadraticcostdidjustfine,withstartingweight0.6and
startingbias0.9.Press"Run"toseewhathappenswhenwereplace
thequadraticcostbythecrossentropy:
Unsurprisingly,theneuronlearnsperfectlywellinthisinstance,
justasitdidearlier.Andnowlet'slookatthecasewhereour
neurongotstuckbefore(link,forcomparison),withtheweightand
7/92
5/31/2016
biasbothstartingat2.0:
Success!Thistimetheneuronlearnedquickly,justaswehoped.If
youobservecloselyyoucanseethattheslopeofthecostcurvewas
muchsteeperinitiallythantheinitialflatregiononthe
correspondingcurveforthequadraticcost.It'sthatsteepnesswhich
thecrossentropybuysus,preventingusfromgettingstuckjust
whenwe'dexpectourneurontolearnfastest,i.e.,whentheneuron
startsoutbadlywrong.
Ididn'tsaywhatlearningratewasusedintheexamplesjust
illustrated.Earlier,withthequadraticcost,weused = 0.15.
Shouldwehaveusedthesamelearningrateinthenewexamples?
Infact,withthechangeincostfunctionit'snotpossibletosay
preciselywhatitmeanstousethe"same"learningrateit'san
applesandorangescomparison.ForbothcostfunctionsIsimply
experimentedtofindalearningratethatmadeitpossibletosee
whatisgoingon.Ifyou'restillcurious,despitemydisavowal,here's
thelowdown:Iused = 0.005intheexamplesjustgiven.
Youmightobjectthatthechangeinlearningratemakesthegraphs
abovemeaningless.Whocareshowfasttheneuronlearns,when
ourchoiceoflearningratewasarbitrarytobeginwith?!That
objectionmissesthepoint.Thepointofthegraphsisn'taboutthe
absolutespeedoflearning.It'sabouthowthespeedoflearning
changes.Inparticular,whenweusethequadraticcostlearningis
slowerwhentheneuronisunambiguouslywrongthanitislateron,
astheneurongetsclosertothecorrectoutputwhilewiththecross
entropylearningisfasterwhentheneuronisunambiguously
wrong.Thosestatementsdon'tdependonhowthelearningrateis
set.
8/92
5/31/2016
We'vebeenstudyingthecrossentropyforasingleneuron.
However,it'seasytogeneralizethecrossentropytomanyneuron
multilayernetworks.Inparticular,supposey = y 1, y 2, arethe
desiredvaluesattheoutputneurons,i.e.,theneuronsinthefinal
layer,whilea L1 , a L2 , aretheactualoutputvalues.Thenwedefine
thecrossentropyby
C=
1
n
[y jlna Lj + (1 y j)ln(1 a Lj) ].

x
Thisisthesameasourearlierexpression,Equation(57),except
nowwe'vegotthe jsummingoveralltheoutputneurons.Iwon't
explicitlyworkthroughaderivation,butitshouldbeplausiblethat
usingtheexpression(63)avoidsalearningslowdowninmany
neuronnetworks.Ifyou'reinterested,youcanworkthroughthe
derivationintheproblembelow.
Whenshouldweusethecrossentropyinsteadofthequadratic
cost?Infact,thecrossentropyisnearlyalwaysthebetterchoice,
providedtheoutputneuronsaresigmoidneurons.Toseewhy,
considerthatwhenwe'resettingupthenetworkweusually
initializetheweightsandbiasesusingsomesortofrandomization.
Itmayhappenthatthoseinitialchoicesresultinthenetworkbeing
decisivelywrongforsometraininginputthatis,anoutputneuron
willhavesaturatednear1,whenitshouldbe0,orviceversa.If
we'reusingthequadraticcostthatwillslowdownlearning.Itwon't
stoplearningcompletely,sincetheweightswillcontinuelearning
fromothertraininginputs,butit'sobviouslyundesirable.
Exercises
Onegotchawiththecrossentropyisthatitcanbedifficultat
firsttoremembertherespectiverolesoftheysandtheas.It's
easytogetconfusedaboutwhethertherightformis
[ylna + (1 y)ln(1 a)]or [alny + (1 a)ln(1 y)].What
happenstothesecondoftheseexpressionswheny = 0or1?
Doesthisproblemafflictthefirstexpression?Whyorwhynot?
Inthesingleneurondiscussionatthestartofthissection,I
arguedthatthecrossentropyissmallif(z) yforalltraining
inputs.Theargumentreliedonybeingequaltoeither0or1.
Thisisusuallytrueinclassificationproblems,butforother
9/92
5/31/2016
problems(e.g.,regressionproblems)ycansometimestake
valuesintermediatebetween0and1.Showthatthecross
entropyisstillminimizedwhen(z) = yforalltraininginputs.
Whenthisisthecasethecrossentropyhasthevalue:
C=
1
n
[ylny + (1 y)ln(1 y)].

x
Thequantity [ylny + (1 y)ln(1 y)]issometimesknownas

thebinaryentropy.
Problems
ManylayermultineuronnetworksInthenotation
introducedinthelastchapter,showthatforthequadraticcost
thepartialderivativewithrespecttoweightsintheoutputlayer
is
C
1
=
a Lk 1(a Lj y j) (z Lj).
L
n
w jk
x
Theterm (z Lj)causesalearningslowdownwheneveran
outputneuronsaturatesonthewrongvalue.Showthatforthe
crossentropycosttheoutputerror Lforasingletraining
examplexisgivenby
L = a L y.
Usethisexpressiontoshowthatthepartialderivativewith
respecttotheweightsintheoutputlayerisgivenby
C
w Ljk
a Lk 1(a Lj y j).
nx
The (z Lj)termhasvanished,andsothecrossentropyavoids
theproblemoflearningslowdown,notjustwhenusedwitha
singleneuron,aswesawearlier,butalsoinmanylayermulti
neuronnetworks.Asimplevariationonthisanalysisholdsalso
forthebiases.Ifthisisnotobvioustoyou,thenyoushould
workthroughthatanalysisaswell.
Usingthequadraticcostwhenwehavelinearneurons
intheoutputlayerSupposethatwehaveamanylayer
multineuronnetwork.Supposealltheneuronsinthefinal
10/92
5/31/2016
layerarelinearneurons,meaningthatthesigmoidactivation
functionisnotapplied,andtheoutputsaresimplya Lj = z Lj.
Showthatifweusethequadraticcostfunctionthentheoutput
error Lforasingletrainingexamplexisgivenby
L = a L y.
Similarlytothepreviousproblem,usethisexpressiontoshow
thatthepartialderivativeswithrespecttotheweightsand
biasesintheoutputlayeraregivenby
C
1
=
a Lk 1(a Lj y j)
L
n
w jk
x
C
1
=
(a Lj y j).
L
n
b j
x
Thisshowsthatiftheoutputneuronsarelinearneuronsthen
thequadraticcostwillnotgiverisetoanyproblemswitha
learningslowdown.Inthiscasethequadraticcostis,infact,an
appropriatecostfunctiontouse.
UsingthecrossentropytoclassifyMNISTdigits
Thecrossentropyiseasytoimplementaspartofaprogramwhich
learnsusinggradientdescentandbackpropagation.We'lldothat
laterinthechapter,developinganimprovedversionofourearlier
programforclassifyingtheMNISThandwrittendigits,network.py.
Thenewprogramiscallednetwork2.py,andincorporatesnotjustthe
crossentropy,butalsoseveralothertechniquesdevelopedinthis
chapter*.Fornow,let'slookathowwellournewprogramclassifies
*ThecodeisavailableonGitHub.
MNISTdigits.AswasthecaseinChapter1,we'lluseanetworkwith
30hiddenneurons,andwe'lluseaminibatchsizeof10.Wesetthe
learningrateto = 0.5*andwetrainfor30epochs.Theinterfaceto
*InChapter1weusedthequadraticcostanda
network2.pyisslightlydifferentthannetwork.py,butitshouldstillbe
notpossibletosaypreciselywhatitmeanstouse
clearwhatisgoingon.Youcan,bytheway,getdocumentation
learningrateof = 3.0.Asdiscussedabove,it's
the"same"learningratewhenthecostfunction
ischanged.ForbothcostfunctionsI
aboutnetwork2.py'sinterfacebyusingcommandssuchas
experimentedtofindalearningratethat
help(network2.Network.SGD)inaPythonshell.
otherhyperparameterchoices.
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.large_weight_initializer()
>>>net.SGD(training_data,30,10,0.5,evaluation_data=test_data,
...monitor_evaluation_accuracy=True)
providesnearoptimalperformance,giventhe
Thereis,incidentally,averyroughgeneral
heuristicforrelatingthelearningrateforthe
crossentropyandthequadraticcost.Aswesaw
earlier,thegradienttermsforthequadraticcost
haveanextra = (1 )terminthem.
Supposeweaveragethisovervaluesfor,
10d(1 ) = 1 /6.Weseethat(veryroughly)the
quadraticcostlearnsanaverageof6times
11/92
5/31/2016
Note,bytheway,thatthenet.large_weight_initializer()command
slower,forthesamelearningrate.Thissuggests
isusedtoinitializetheweightsandbiasesinthesamewayas
learningrateforthequadraticcostby6.Of
describedinChapter1.Weneedtorunthiscommandbecauselater
inthischapterwe'llchangethedefaultweightinitializationinour
thatareasonablestartingpointistodividethe
course,thisargumentisfarfromrigorous,and
shouldn'tbetakentooseriously.Still,itcan
sometimesbeausefulstartingpoint.
networks.Theresultfromrunningtheabovesequenceof
commandsisanetworkwith95.49percentaccuracy.Thisispretty
closetotheresultweobtainedinChapter1,95.42percent,usingthe
quadraticcost.
Let'slookalsoatthecasewhereweuse100hiddenneurons,the
crossentropy,andotherwisekeeptheparametersthesame.Inthis
caseweobtainanaccuracyof96.82percent.That'sasubstantial
improvementovertheresultsfromChapter1,whereweobtaineda
classificationaccuracyof96.59percent,usingthequadraticcost.
Thatmaylooklikeasmallchange,butconsiderthattheerrorrate
hasdroppedfrom3.41percentto3.18percent.Thatis,we've
eliminatedaboutoneinfourteenoftheoriginalerrors.That'squite
ahandyimprovement.
It'sencouragingthatthecrossentropycostgivesussimilaror
betterresultsthanthequadraticcost.However,theseresultsdon't
conclusivelyprovethatthecrossentropyisabetterchoice.The
reasonisthatI'veputonlyalittleeffortintochoosinghyper
parameterssuchaslearningrate,minibatchsize,andsoon.For
theimprovementtobereallyconvincingwe'dneedtodoathorough
joboptimizingsuchhyperparameters.Still,theresultsare
encouraging,andreinforceourearliertheoreticalargumentthatthe
crossentropyisabetterchoicethanthequadraticcost.
This,bytheway,ispartofageneralpatternthatwe'llseethrough
thischapterand,indeed,throughmuchoftherestofthebook.
We'lldevelopanewtechnique,we'lltryitout,andwe'llget
"improved"results.Itis,ofcourse,nicethatweseesuch
improvements.Buttheinterpretationofsuchimprovementsis
alwaysproblematic.They'reonlytrulyconvincingifweseean
improvementafterputtingtremendouseffortintooptimizingallthe
otherhyperparameters.That'sagreatdealofwork,requiringlots
ofcomputingpower,andwe'renotusuallygoingtodosuchan
exhaustiveinvestigation.Instead,we'llproceedonthebasisof
informaltestslikethosedoneabove.Still,youshouldkeepinmind
thatsuchtestsfallshortofdefinitiveproof,andremainalertto
12/92
5/31/2016
signsthattheargumentsarebreakingdown.
Bynow,we'vediscussedthecrossentropyatgreatlength.Whygo
tosomucheffortwhenitgivesonlyasmallimprovementtoour
MNISTresults?Laterinthechapterwe'llseeothertechniques
notably,regularizationwhichgivemuchbiggerimprovements.So
whysomuchfocusoncrossentropy?Partofthereasonisthatthe
crossentropyisawidelyusedcostfunction,andsoisworth
understandingwell.Butthemoreimportantreasonisthatneuron
saturationisanimportantprobleminneuralnets,aproblemwe'll
returntorepeatedlythroughoutthebook.AndsoI'vediscussedthe
crossentropyatlengthbecauseit'sagoodlaboratorytobegin
understandingneuronsaturationandhowitmaybeaddressed.
Whatdoesthecrossentropymean?Where
doesitcomefrom?
Ourdiscussionofthecrossentropyhasfocusedonalgebraic
analysisandpracticalimplementation.That'suseful,butitleaves
unansweredbroaderconceptualquestions,like:whatdoesthe
crossentropymean?Istheresomeintuitivewayofthinkingabout
thecrossentropy?Andhowcouldwehavedreamedupthecross
entropyinthefirstplace?
Let'sbeginwiththelastofthesequestions:whatcouldhave
motivatedustothinkupthecrossentropyinthefirstplace?
Supposewe'ddiscoveredthelearningslowdowndescribedearlier,
andunderstoodthattheoriginwasthe (z)termsinEquations
(55)and(56).Afterstaringatthoseequationsforabit,wemight
wonderifit'spossibletochooseacostfunctionsothatthe (z)
termdisappeared.Inthatcase,thecostC = C xforasingletraining
examplexwouldsatisfy
C
w j
C
b
= x j(a y)
= (a y).
Ifwecouldchoosethecostfunctiontomaketheseequationstrue,
thentheywouldcaptureinasimplewaytheintuitionthatthe
greatertheinitialerror,thefastertheneuronlearns.They'dalso
eliminatetheproblemofalearningslowdown.Infact,startingfrom
theseequationswe'llnowshowthatit'spossibletoderivetheform
13/92
5/31/2016
ofthecrossentropy,simplybyfollowingourmathematicalnoses.
Toseethis,notethatfromthechainrulewehave
C
b
C
(z).
a
Using (z) = (z)(1 (z)) = a(1 a)thelastequationbecomes

C
b
a(1 a).
ComparingtoEquation(72)weobtain
C
a
ay
a(1 a)
Integratingthisexpressionwithrespecttoagives
C = [ylna + (1 y)ln(1 a)] + constant,
forsomeconstantofintegration.Thisisthecontributiontothecost
fromasingletrainingexample,x.Togetthefullcostfunctionwe
mustaverageovertrainingexamples,obtaining
C=
1
n
[ylna + (1 y)ln(1 a)] + constant,

x
wheretheconstanthereistheaverageoftheindividualconstants
foreachtrainingexample.AndsoweseethatEquations(71)and
(72)uniquelydeterminetheformofthecrossentropy,uptoan
overallconstantterm.Thecrossentropyisn'tsomethingthatwas
miraculouslypulledoutofthinair.Rather,it'ssomethingthatwe
couldhavediscoveredinasimpleandnaturalway.
Whatabouttheintuitivemeaningofthecrossentropy?Howshould
wethinkaboutit?Explainingthisindepthwouldtakeusfurther
afieldthanIwanttogo.However,itisworthmentioningthatthere
isastandardwayofinterpretingthecrossentropythatcomesfrom
thefieldofinformationtheory.Roughlyspeaking,theideaisthat
thecrossentropyisameasureofsurprise.Inparticular,ourneuron
istryingtocomputethefunctionx y = y(x).Butinsteadit
computesthefunctionx a = a(x).Supposewethinkofaasour
neuron'sestimatedprobabilitythatyis1,and1 aistheestimated
probabilitythattherightvalueforyis0.Thenthecrossentropy
measureshow"surprised"weare,onaverage,whenwelearnthe
truevaluefory.Wegetlowsurpriseiftheoutputiswhatweexpect,
14/92
5/31/2016
andhighsurpriseiftheoutputisunexpected.Ofcourse,Ihaven't
saidexactlywhat"surprise"means,andsothisperhapsseemslike
emptyverbiage.Butinfactthereisapreciseinformationtheoretic
wayofsayingwhatismeantbysurprise.Unfortunately,Idon't
knowofagood,short,selfcontaineddiscussionofthissubject
that'savailableonline.Butifyouwanttodigdeeper,then
Wikipediacontainsabriefsummarythatwillgetyoustarteddown
therighttrack.Andthedetailscanbefilledinbyworkingthrough
thematerialsabouttheKraftinequalityinchapter5ofthebook
aboutinformationtheorybyCoverandThomas.
Problem
We'vediscussedatlengththelearningslowdownthatcanoccur
whenoutputneuronssaturate,innetworksusingthequadratic
costtotrain.Anotherfactorthatmayinhibitlearningisthe
presenceofthex jterminEquation(61).Becauseofthisterm,
whenaninputx jisneartozero,thecorrespondingweightw j
willlearnslowly.Explainwhyitisnotpossibletoeliminatethe
x jtermthroughacleverchoiceofcostfunction.
Softmax
Inthischapterwe'llmostlyusethecrossentropycosttoaddress
theproblemoflearningslowdown.However,Iwanttobriefly
describeanotherapproachtotheproblem,basedonwhatarecalled
softmaxlayersofneurons.We'renotactuallygoingtousesoftmax
layersintheremainderofthechapter,soifyou'reinagreathurry,
youcanskiptothenextsection.However,softmaxisstillworth
understanding,inpartbecauseit'sintrinsicallyinteresting,andin
partbecausewe'llusesoftmaxlayersinChapter6,inourdiscussion
ofdeepneuralnetworks.
Theideaofsoftmaxistodefineanewtypeofoutputlayerforour
neuralnetworks.Itbeginsinthesamewayaswithasigmoidlayer,
byformingtheweightedinputs*z Lj = kw Ljka Lk 1 + b Lj.However,we
*Indescribingthesoftmaxwe'llmakefrequent
don'tapplythesigmoidfunctiontogettheoutput.Instead,ina
Youmaywishtorevisitthatchapterifyouneed
softmaxlayerweapplythesocalledsoftmaxfunctiontothez Lj.
thenotation.
useofnotationintroducedinthelastchapter.
torefreshyourmemoryaboutthemeaningof
Accordingtothisfunction,theactivationa Ljofthejthoutput
neuronis
15/92
5/31/2016
a Lj =
L
ezj
L,
ke z k
whereinthedenominatorwesumoveralltheoutputneurons.
Ifyou'renotfamiliarwiththesoftmaxfunction,Equation(78)may
lookprettyopaque.It'scertainlynotobviouswhywe'dwanttouse
thisfunction.Andit'salsonotobviousthatthiswillhelpusaddress
thelearningslowdownproblem.TobetterunderstandEquation
(78),supposewehaveanetworkwithfouroutputneurons,andfour
correspondingweightedinputs,whichwe'lldenotez L1 , z L2 , z L3 ,andz L4 .
Shownbelowareadjustableslidersshowingpossiblevaluesforthe
weightedinputs,andagraphofthecorrespondingoutput
activations.Agoodplacetostartexplorationisbyusingthebottom
slidertoincreasez L4 :
z L1 =
a L1 =
2.5
0.315
a L2 =
z L2 =1
0.009
a L3 =
z L3 =3.2
0.633
a L4 =
z L4 =0.5
0.043
Asyouincreasez L4 ,you'llseeanincreaseinthecorresponding
outputactivation,a L4 ,andadecreaseintheotheroutputactivations.
Similarly,ifyoudecreasez L4 thena L4 willdecrease,andalltheother
outputactivationswillincrease.Infact,ifyoulookclosely,you'llsee
thatinbothcasesthetotalchangeintheotheractivationsexactly
compensatesforthechangeina L4 .Thereasonisthattheoutput
activationsareguaranteedtoalwayssumupto1,aswecanprove
usingEquation(78)andalittlealgebra:
L
a Lj =
j
je z j
ke z k
= 1.
Asaresult,ifa L4 increases,thentheotheroutputactivationsmust
decreasebythesametotalamount,toensurethesumoverall
16/92
5/31/2016
activationsremains1.And,ofcourse,similarstatementsholdforall
theotheractivations.
Equation(78)alsoimpliesthattheoutputactivationsareall
positive,sincetheexponentialfunctionispositive.Combiningthis
withtheobservationinthelastparagraph,weseethattheoutput
fromthesoftmaxlayerisasetofpositivenumberswhichsumupto
1.Inotherwords,theoutputfromthesoftmaxlayercanbethought
ofasaprobabilitydistribution.
Thefactthatasoftmaxlayeroutputsaprobabilitydistributionis
ratherpleasing.Inmanyproblemsit'sconvenienttobeableto
interprettheoutputactivationa Ljasthenetwork'sestimateofthe
probabilitythatthecorrectoutputisj.So,forinstance,inthe
MNISTclassificationproblem,wecaninterpreta Ljasthenetwork's
estimatedprobabilitythatthecorrectdigitclassificationisj.
Bycontrast,iftheoutputlayerwasasigmoidlayer,thenwe
certainlycouldn'tassumethattheactivationsformedaprobability
distribution.Iwon'texplicitlyproveit,butitshouldbeplausible
thattheactivationsfromasigmoidlayerwon'tingeneralforma
probabilitydistribution.Andsowithasigmoidoutputlayerwe
don'thavesuchasimpleinterpretationoftheoutputactivations.
Exercise
Constructanexampleshowingexplicitlythatinanetworkwith
asigmoidoutputlayer,theoutputactivationsa Ljwon'talways
sumto1.
We'restartingtobuildupsomefeelforthesoftmaxfunctionand
thewaysoftmaxlayersbehave.Justtoreviewwherewe'reat:the
exponentialsinEquation(78)ensurethatalltheoutputactivations
arepositive.AndthesuminthedenominatorofEquation(78)
ensuresthatthesoftmaxoutputssumto1.Sothatparticularform
nolongerappearssomysterious:rather,itisanaturalwayto
ensurethattheoutputactivationsformaprobabilitydistribution.
Youcanthinkofsoftmaxasawayofrescalingthez Lj,andthen
squishingthemtogethertoformaprobabilitydistribution.
Exercises
17/92
5/31/2016
MonotonicityofsoftmaxShowthata Lj / z Lkispositiveif
j = kandnegativeifj k.Asaconsequence,increasingz Ljis
guaranteedtoincreasethecorrespondingoutputactivation,a Lj,
andwilldecreasealltheotheroutputactivations.Wealready
sawthisempiricallywiththesliders,butthisisarigorous
proof.
NonlocalityofsoftmaxAnicethingaboutsigmoidlayersis
thattheoutputa Ljisafunctionofthecorrespondingweighted
input,a Lj = (z Lj).Explainwhythisisnotthecaseforasoftmax
layer:anyparticularoutputactivationa Ljdependsonallthe
weightedinputs.
Problem
InvertingthesoftmaxlayerSupposewehaveaneural
networkwithasoftmaxoutputlayer,andtheactivationsa Ljare
known.Showthatthecorrespondingweightedinputshavethe
formz Lj = lna Lj + C,forsomeconstantCthatisindependentofj
.
Thelearningslowdownproblem:We'venowbuiltup
considerablefamiliaritywithsoftmaxlayersofneurons.Butwe
haven'tyetseenhowasoftmaxlayerletsusaddressthelearning
slowdownproblem.Tounderstandthat,let'sdefinethelog
likelihoodcostfunction.We'llusextodenoteatraininginputtothe
network,andytodenotethecorrespondingdesiredoutput.Then
theloglikelihoodcostassociatedtothistraininginputis
C lna Ly.
So,forinstance,ifwe'retrainingwithMNISTimages,andinputan
imageofa7,thentheloglikelihoodcostis lna L7 .Toseethatthis
makesintuitivesense,considerthecasewhenthenetworkisdoing
agoodjob,thatis,itisconfidenttheinputisa7.Inthatcaseitwill
estimateavalueforthecorrespondingprobabilitya L7 whichisclose
to1,andsothecost lna L7 willbesmall.Bycontrast,whenthe
networkisn'tdoingsuchagoodjob,theprobabilitya L7 willbe
smaller,andthecost lna L7 willbelarger.Sotheloglikelihoodcost
behavesaswe'dexpectacostfunctiontobehave.
18/92
5/31/2016
Whataboutthelearningslowdownproblem?Toanalyzethat,recall
thatthekeytothelearningslowdownisthebehaviourofthe
quantitiesC / w LjkandC / b Lj.Iwon'tgothroughthederivation
explicitlyI'llaskyoutodointheproblems,belowbutwithalittle
*NotethatI'mabusingnotationhere,usingyin
algebrayoucanshowthat*
aslightlydifferentwaytolastparagraph.Inthe
lastparagraphweusedytodenotethedesired
outputfromthenetworke.g.,outputa"7"ifan
= a Lj y j
L
b j
C
w Ljk
imageofa7wasinput.Butintheequations
whichfollowI'musingytodenotethevectorof
outputactivationswhichcorrespondsto7,that
a Lk 1(a Lj
y j)
is,avectorwhichisall0s,exceptfora1inthe7
thlocation.
Theseequationsarethesameastheanalogousexpressions
obtainedinourearlieranalysisofthecrossentropy.Compare,for
example,Equation(82)toEquation(67).It'sthesameequation,
albeitinthelatterI'veaveragedovertraininginstances.And,justas
intheearlieranalysis,theseexpressionsensurethatwewillnot
encounteralearningslowdown.Infact,it'susefultothinkofa
softmaxoutputlayerwithloglikelihoodcostasbeingquitesimilar
toasigmoidoutputlayerwithcrossentropycost.
Giventhissimilarity,shouldyouuseasigmoidoutputlayerand
crossentropy,orasoftmaxoutputlayerandloglikelihood?Infact,
inmanysituationsbothapproachesworkwell.Throughthe
remainderofthischapterwe'lluseasigmoidoutputlayer,withthe
crossentropycost.Later,inChapter6,we'llsometimesusea
softmaxoutputlayer,withloglikelihoodcost.Thereasonforthe
switchistomakesomeofourlaternetworksmoresimilarto
networksfoundincertaininfluentialacademicpapers.Asamore
generalpointofprinciple,softmaxplusloglikelihoodisworth
usingwheneveryouwanttointerprettheoutputactivationsas
probabilities.That'snotalwaysaconcern,butcanbeusefulwith
classificationproblems(likeMNIST)involvingdisjointclasses.
Problems
DeriveEquations(81)and(82).
Wheredoesthe"softmax"namecomefrom?Suppose
wechangethesoftmaxfunctionsotheoutputactivationsare
givenby
19/92
5/31/2016
a Lj =
L
e cz j
L,
ke cz k
wherecisapositiveconstant.Notethatc = 1correspondsto
thestandardsoftmaxfunction.Butifweuseadifferentvalueof
cwegetadifferentfunction,whichisnonethelessqualitatively
rathersimilartothesoftmax.Inparticular,showthatthe
outputactivationsformaprobabilitydistribution,justasfor
theusualsoftmax.Supposeweallowctobecomelarge,i.e.,
c .Whatisthelimitingvaluefortheoutputactivationsa Lj?
Aftersolvingthisproblemitshouldbecleartoyouwhywe
thinkofthec = 1functionasa"softened"versionofthe
maximumfunction.Thisistheoriginoftheterm"softmax".
Backpropagationwithsoftmaxandtheloglikelihood
costInthelastchapterwederivedthebackpropagation
algorithmforanetworkcontainingsigmoidlayers.Toapply
thealgorithmtoanetworkwithasoftmaxlayerweneedto
figureoutanexpressionfortheerror Lj C / z Ljinthefinal
layer.Showthatasuitableexpressionis:
Lj = a Lj y j.
Usingthisexpressionwecanapplythebackpropagation
algorithmtoanetworkusingasoftmaxoutputlayerandthe
loglikelihoodcost.
Overfittingandregularization
TheNobelprizewinningphysicistEnricoFermiwasonceaskedhis
opinionofamathematicalmodelsomecolleagueshadproposedas
thesolutiontoanimportantunsolvedphysicsproblem.Themodel
gaveexcellentagreementwithexperiment,butFermiwasskeptical.
Heaskedhowmanyfreeparameterscouldbesetinthemodel.
"Four"wastheanswer.Fermireplied*:"Iremembermyfriend
*Thequotecomesfromacharmingarticleby
JohnnyvonNeumannusedtosay,withfourparametersIcanfitan
proposedtheflawedmodel.Afourparameter
elephant,andwithfiveIcanmakehimwigglehistrunk.".
FreemanDyson,whoisoneofthepeoplewho
elephantmaybefoundhere.
Thepoint,ofcourse,isthatmodelswithalargenumberoffree
parameterscandescribeanamazinglywiderangeofphenomena.
Evenifsuchamodelagreeswellwiththeavailabledata,that
20/92
5/31/2016
doesn'tmakeitagoodmodel.Itmayjustmeanthere'senough
freedominthemodelthatitcandescribealmostanydatasetofthe
givensize,withoutcapturinganygenuineinsightsintothe
underlyingphenomenon.Whenthathappensthemodelwillwork
wellfortheexistingdata,butwillfailtogeneralizetonew
situations.Thetruetestofamodelisitsabilitytomakepredictions
insituationsithasn'tbeenexposedtobefore.
FermiandvonNeumannweresuspiciousofmodelswithfour
parameters.Our30hiddenneuronnetworkforclassifyingMNIST
digitshasnearly24,000parameters!That'salotofparameters.Our
100hiddenneuronnetworkhasnearly80,000parameters,and
stateoftheartdeepneuralnetssometimescontainmillionsor
evenbillionsofparameters.Shouldwetrusttheresults?
Let'ssharpenthisproblemupbyconstructingasituationwhereour
networkdoesabadjobgeneralizingtonewsituations.We'lluseour
30hiddenneuronnetwork,withits23,860parameters.Butwe
won'ttrainthenetworkusingall50,000MNISTtrainingimages.
Instead,we'llusejustthefirst1,000trainingimages.Usingthat
restrictedsetwillmaketheproblemwithgeneralizationmuchmore
evident.We'lltraininasimilarwaytobefore,usingthecross
entropycostfunction,withalearningrateof = 0.5andamini
batchsizeof10.However,we'lltrainfor400epochs,asomewhat
largernumberthanbefore,becausewe'renotusingasmany
trainingexamples.Let'susenetwork2tolookatthewaythecost
functionchanges:
>>>importnetwork2
>>>net.SGD(training_data[:1000],400,10,0.5,evaluation_data=test_data,
...monitor_evaluation_accuracy=True,monitor_training_cost=True)
Usingtheresultswecanplotthewaythecostchangesasthe
networklearns*:
*Thisandthenextfourgraphsweregenerated
bytheprogramoverfitting.py.
21/92
5/31/2016
Thislooksencouraging,showingasmoothdecreaseinthecost,just
asweexpect.NotethatI'veonlyshowntrainingepochs200
through399.Thisgivesusaniceupcloseviewofthelaterstagesof
learning,which,aswe'llsee,turnsouttobewheretheinteresting
actionis.
Let'snowlookathowtheclassificationaccuracyonthetestdata
changesovertime:
Again,I'vezoomedinquiteabit.Inthefirst200epochs(not
shown)theaccuracyrisestojustunder82percent.Thelearning
thengraduallyslowsdown.Finally,ataroundepoch280the
classificationaccuracyprettymuchstopsimproving.Laterepochs
merelyseesmallstochasticfluctuationsnearthevalueofthe
accuracyatepoch280.Contrastthiswiththeearliergraph,where
thecostassociatedtothetrainingdatacontinuestosmoothlydrop.
22/92
5/31/2016
Ifwejustlookatthatcost,itappearsthatourmodelisstillgetting
"better".Butthetestaccuracyresultsshowtheimprovementisan
illusion.JustlikethemodelthatFermidisliked,whatournetwork
learnsafterepoch280nolongergeneralizestothetestdata.Andso
it'snotusefullearning.Wesaythenetworkisoverfittingor
overtrainingbeyondepoch280.
YoumightwonderiftheproblemhereisthatI'mlookingatthecost
onthetrainingdata,asopposedtotheclassificationaccuracyon
thetestdata.Inotherwords,maybetheproblemisthatwe're
makinganapplesandorangescomparison.Whatwouldhappenif
wecomparedthecostonthetrainingdatawiththecostonthetest
data,sowe'recomparingsimilarmeasures?Orperhapswecould
comparetheclassificationaccuracyonboththetrainingdataand
thetestdata?Infact,essentiallythesamephenomenonshowsup
nomatterhowwedothecomparison.Thedetailsdochange,
however.Forinstance,let'slookatthecostonthetestdata:
Wecanseethatthecostonthetestdataimprovesuntilaround
epoch15,butafterthatitactuallystartstogetworse,eventhough
thecostonthetrainingdataiscontinuingtogetbetter.Thisis
anothersignthatourmodelisoverfitting.Itposesapuzzle,though,
whichiswhetherweshouldregardepoch15orepoch280asthe
pointatwhichoverfittingiscomingtodominatelearning?Froma
practicalpointofview,whatwereallycareaboutisimproving
classificationaccuracyonthetestdata,whilethecostonthetest
dataisnomorethanaproxyforclassificationaccuracy.Andsoit
makesmostsensetoregardepoch280asthepointbeyondwhich
23/92
5/31/2016
overfittingisdominatinglearninginourneuralnetwork.
Anothersignofoverfittingmaybeseenintheclassification
accuracyonthetrainingdata:
Theaccuracyrisesallthewayupto100percent.Thatis,our
networkcorrectlyclassifiesall1, 000trainingimages!Meanwhile,
ourtestaccuracytopsoutatjust82.27percent.Soournetwork
reallyislearningaboutpeculiaritiesofthetrainingset,notjust
recognizingdigitsingeneral.It'salmostasthoughournetworkis
merelymemorizingthetrainingset,withoutunderstandingdigits
wellenoughtogeneralizetothetestset.
Overfittingisamajorprobleminneuralnetworks.Thisisespecially
trueinmodernnetworks,whichoftenhaveverylargenumbersof
weightsandbiases.Totraineffectively,weneedawayofdetecting
whenoverfittingisgoingon,sowedon'tovertrain.Andwe'dliketo
havetechniquesforreducingtheeffectsofoverfitting.
Theobviouswaytodetectoverfittingistousetheapproachabove,
keepingtrackofaccuracyonthetestdataasournetworktrains.If
weseethattheaccuracyonthetestdataisnolongerimproving,
thenweshouldstoptraining.Ofcourse,strictlyspeaking,thisisnot
necessarilyasignofoverfitting.Itmightbethataccuracyonthe
testdataandthetrainingdatabothstopimprovingatthesame
time.Still,adoptingthisstrategywillpreventoverfitting.
Infact,we'lluseavariationonthisstrategy.Recallthatwhenwe
loadintheMNISTdataweloadinthreedatasets:
24/92
5/31/2016
Uptonowwe'vebeenusingthetraining_dataandtest_data,and
ignoringthevalidation_data.Thevalidation_datacontains10, 000
imagesofdigits,imageswhicharedifferentfromthe50, 000images
intheMNISTtrainingset,andthe10, 000imagesintheMNISTtest
set.Insteadofusingthetest_datatopreventoverfitting,wewilluse
thevalidation_data.Todothis,we'llusemuchthesamestrategyas
wasdescribedaboveforthetest_data.Thatis,we'llcomputethe
classificationaccuracyonthevalidation_dataattheendofeach
epoch.Oncetheclassificationaccuracyonthevalidation_datahas
saturated,westoptraining.Thisstrategyiscalledearlystopping.
Ofcourse,inpracticewewon'timmediatelyknowwhenthe
accuracyhassaturated.Instead,wecontinuetraininguntilwe're
confidentthattheaccuracyhassaturated*.
*Itrequiressomejudgmenttodeterminewhen
tostop.InmyearliergraphsIidentifiedepoch
280astheplaceatwhichaccuracysaturated.It's
Whyusethevalidation_datatopreventoverfitting,ratherthanthe
test_data?Infact,thisispartofamoregeneralstrategy,whichisto
possiblethatwastoopessimistic.Neural
networkssometimesplateauforawhilein
training,beforecontinuingtoimprove.I
usethevalidation_datatoevaluatedifferenttrialchoicesofhyper
wouldn'tbesurprisedifmorelearningcould
parameterssuchasthenumberofepochstotrainfor,thelearning
magnitudeofanyfurtherimprovementwould
rate,thebestnetworkarchitecture,andsoon.Weusesuch
lessaggressivestrategiesforearlystopping.
haveoccurredevenafterepoch400,althoughthe
likelybesmall.Soit'spossibletoadoptmoreor
evaluationstofindandsetgoodvaluesforthehyperparameters.
Indeed,althoughIhaven'tmentionedituntilnow,thatis,inpart,
howIarrivedatthehyperparameterchoicesmadeearlierinthis
book.(Moreonthislater.)
Ofcourse,thatdoesn'tinanywayanswerthequestionofwhywe're
usingthevalidation_datatopreventoverfitting,ratherthanthe
test_data.Instead,itreplacesitwithamoregeneralquestion,which
iswhywe'reusingthevalidation_dataratherthanthetest_datato
setgoodhyperparameters?Tounderstandwhy,considerthat
whensettinghyperparameterswe'relikelytotrymanydifferent
choicesforthehyperparameters.Ifwesetthehyperparameters
basedonevaluationsofthetest_datait'spossiblewe'llendup
overfittingourhyperparameterstothetest_data.Thatis,wemay
endupfindinghyperparameterswhichfitparticularpeculiarities
ofthetest_data,butwheretheperformanceofthenetworkwon't
generalizetootherdatasets.Weguardagainstthatbyfiguringout
thehyperparametersusingthevalidation_data.Then,oncewe've
gotthehyperparameterswewant,wedoafinalevaluationof
25/92
5/31/2016
accuracyusingthetest_data.Thatgivesusconfidencethatour
resultsonthetest_dataareatruemeasureofhowwellourneural
networkgeneralizes.Toputitanotherway,youcanthinkofthe
validationdataasatypeoftrainingdatathathelpsuslearngood
hyperparameters.Thisapproachtofindinggoodhyperparameters
issometimesknownastheholdoutmethod,sincethe
validation_dataiskeptapartor"heldout"fromthetraining_data.
Now,inpractice,evenafterevaluatingperformanceonthe
test_datawemaychangeourmindsandwanttotryanother
approachperhapsadifferentnetworkarchitecturewhichwill
involvefindinganewsetofhyperparameters.Ifwedothis,isn't
thereadangerwe'llendupoverfittingtothetest_dataaswell?Do
weneedapotentiallyinfiniteregressofdatasets,sowecanbe
confidentourresultswillgeneralize?Addressingthisconcernfully
isadeepanddifficultproblem.Butforourpracticalpurposes,we're
notgoingtoworrytoomuchaboutthisquestion.Instead,we'll
plungeahead,usingthebasicholdoutmethod,basedonthe
training_data,validation_data,andtest_data,asdescribedabove.
We'vebeenlookingsofaratoverfittingwhenwe'rejustusing1,000
trainingimages.Whathappenswhenweusethefulltrainingsetof
50,000images?We'llkeepalltheotherparametersthesame(30
hiddenneurons,learningrate0.5,minibatchsizeof10),buttrain
usingall50,000imagesfor30epochs.Here'sagraphshowingthe
resultsfortheclassificationaccuracyonboththetrainingdataand
thetestdata.NotethatI'veusedthetestdatahere,ratherthanthe
validationdata,inordertomaketheresultsmoredirectly
comparablewiththeearliergraphs.
26/92
5/31/2016
Asyoucansee,theaccuracyonthetestandtrainingdataremain
muchclosertogetherthanwhenwewereusing1,000training
examples.Inparticular,thebestclassificationaccuracyof97.86
percentonthetrainingdataisonly1.53percenthigherthanthe
95.33percentonthetestdata.That'scomparedtothe17.73percent
gapwehadearlier!Overfittingisstillgoingon,butit'sbeengreatly
reduced.Ournetworkisgeneralizingmuchbetterfromthetraining
datatothetestdata.Ingeneral,oneofthebestwaysofreducing
overfittingistoincreasethesizeofthetrainingdata.Withenough
trainingdataitisdifficultforevenaverylargenetworktooverfit.
Unfortunately,trainingdatacanbeexpensiveordifficulttoacquire,
sothisisnotalwaysapracticaloption.
Regularization
Increasingtheamountoftrainingdataisonewayofreducing
overfitting.Arethereotherwayswecanreducetheextenttowhich
overfittingoccurs?Onepossibleapproachistoreducethesizeof
ournetwork.However,largenetworkshavethepotentialtobe
morepowerfulthansmallnetworks,andsothisisanoptionwe'd
onlyadoptreluctantly.
Fortunately,thereareothertechniqueswhichcanreduce
overfitting,evenwhenwehaveafixednetworkandfixedtraining
data.Theseareknownasregularizationtechniques.Inthissection
Idescribeoneofthemostcommonlyusedregularization
techniques,atechniquesometimesknownasweightdecayorL2
regularization.TheideaofL2regularizationistoaddanextraterm
27/92
5/31/2016
tothecostfunction,atermcalledtheregularizationterm.Here's
theregularizedcrossentropy:
C=
y jlna Lj + (1 y j)ln(1 a Lj) ] +
w 2.
[
n
2n
xj
Thefirsttermisjusttheusualexpressionforthecrossentropy.But
we'veaddedasecondterm,namelythesumofthesquaresofallthe
weightsinthenetwork.Thisisscaledbyafactor / 2n,where > 0
isknownastheregularizationparameter,andnis,asusual,the
sizeofourtrainingset.I'lldiscusslaterhowischosen.It'salso
worthnotingthattheregularizationtermdoesn'tincludethebiases.
I'llalsocomebacktothatbelow.
Ofcourse,it'spossibletoregularizeothercostfunctions,suchasthe
quadraticcost.Thiscanbedoneinasimilarway:
C=
y aL 2 +
w 2.
2n
x
w
2n
Inbothcaseswecanwritetheregularizedcostfunctionas
C = C0 +
w 2,
2n w
whereC 0istheoriginal,unregularizedcostfunction.
Intuitively,theeffectofregularizationistomakeitsothenetwork
preferstolearnsmallweights,allotherthingsbeingequal.Large
weightswillonlybeallowediftheyconsiderablyimprovethefirst
partofthecostfunction.Putanotherway,regularizationcanbe
viewedasawayofcompromisingbetweenfindingsmallweights
andminimizingtheoriginalcostfunction.Therelativeimportance
ofthetwoelementsofthecompromisedependsonthevalueof:
whenissmallweprefertominimizetheoriginalcostfunction,but
whenislargeweprefersmallweights.
Now,it'sreallynotatallobviouswhymakingthiskindof
compromiseshouldhelpreduceoverfitting!Butitturnsoutthatit
does.We'lladdressthequestionofwhyithelpsinthenextsection.
Butfirst,let'sworkthroughanexampleshowingthatregularization
reallydoesreduceoverfitting.
Toconstructsuchanexample,wefirstneedtofigureouthowto
28/92
5/31/2016
applyourstochasticgradientdescentlearningalgorithmina
regularizedneuralnetwork.Inparticular,weneedtoknowhowto
computethepartialderivativesC / wandC / bforalltheweights
andbiasesinthenetwork.Takingthepartialderivativesof
Equation(87)gives
C
w
C
b
C 0
w
C 0
TheC 0 / wandC 0 / btermscanbecomputedusing

backpropagation,asdescribedinthelastchapter.Andsowesee
thatit'seasytocomputethegradientoftheregularizedcost
function:justusebackpropagation,asusual,andthenadd n wto
thepartialderivativeofalltheweightterms.Thepartialderivatives
withrespecttothebiasesareunchanged,andsothegradient
descentlearningruleforthebiasesdoesn'tchangefromtheusual
rule:
b b
C 0
b
Thelearningrulefortheweightsbecomes:
w w
=
C 0
w
( )
1
w
C 0
w
Thisisexactlythesameastheusualgradientdescentlearningrule,
exceptwefirstrescaletheweightwbyafactor1 n .Thisrescaling
issometimesreferredtoasweightdecay,sinceitmakesthe
weightssmaller.Atfirstglanceitlooksasthoughthismeansthe
weightsarebeingdrivenunstoppablytowardzero.Butthat'snot
right,sincetheothertermmayleadtheweightstoincrease,ifso
doingcausesadecreaseintheunregularizedcostfunction.
Okay,that'showgradientdescentworks.Whataboutstochastic
gradientdescent?Well,justasinunregularizedstochasticgradient
descent,wecanestimateC 0 / wbyaveragingoveraminibatchof
mtrainingexamples.Thustheregularizedlearningrulefor
29/92
5/31/2016
stochasticgradientdescentbecomes(c.f.Equation(20))
( )
w 1
C x
w
wherethesumisovertrainingexamplesxintheminibatch,andC x
isthe(unregularized)costforeachtrainingexample.Thisisexactly
thesameastheusualruleforstochasticgradientdescent,exceptfor
the1 n weightdecayfactor.Finally,andforcompleteness,letme
statetheregularizedlearningruleforthebiases.Thisis,ofcourse,
exactlythesameasintheunregularizedcase(c.f.Equation(21)),
bb
C x
b
x
wherethesumisovertrainingexamplesxintheminibatch.
Let'sseehowregularizationchangestheperformanceofourneural
network.We'lluseanetworkwith30hiddenneurons,aminibatch
sizeof10,alearningrateof0.5,andthecrossentropycostfunction.
However,thistimewe'llusearegularizationparameterof = 0.1.
Notethatinthecode,weusethevariablenamelmbda,because
lambdaisareservedwordinPython,withanunrelatedmeaning.I've
alsousedthetest_dataagain,notthevalidation_data.Strictly
speaking,weshouldusethevalidation_data,forallthereasonswe
discussedearlier.ButIdecidedtousethetest_databecauseit
makestheresultsmoredirectlycomparablewithourearlier,
unregularizedresults.Youcaneasilychangethecodetousethe
validation_datainstead,andyou'llfindthatitgivessimilarresults.
>>>importnetwork2
>>>net.SGD(training_data[:1000],400,10,0.5,
...evaluation_data=test_data,lmbda=0.1,
...monitor_evaluation_cost=True,monitor_evaluation_accuracy=True,
...monitor_training_cost=True,monitor_training_accuracy=True)
Thecostonthetrainingdatadecreasesoverthewholetime,much
asitdidintheearlier,unregularizedcase*:
*Thisandthenexttwographswereproduced
withtheprogramoverfitting.py.
30/92
5/31/2016
Butthistimetheaccuracyonthetest_datacontinuestoincreasefor
theentire400epochs:
Clearly,theuseofregularizationhassuppressedoverfitting.What's
more,theaccuracyisconsiderablyhigher,withapeakclassification
accuracyof87.1percent,comparedtothepeakof82.27percent
obtainedintheunregularizedcase.Indeed,wecouldalmost
certainlygetconsiderablybetterresultsbycontinuingtotrainpast
400epochs.Itseemsthat,empirically,regularizationiscausingour
networktogeneralizebetter,andconsiderablyreducingtheeffects
ofoverfitting.
Whathappensifwemoveoutoftheartificialenvironmentofjust
having1,000trainingimages,andreturntothefull50,000image
trainingset?Ofcourse,we'veseenalreadythatoverfittingismuch
lessofaproblemwiththefull50,000images.Doesregularization
31/92
5/31/2016
helpanyfurther?Let'skeepthehyperparametersthesameas
before30epochs,learningrate0.5,minibatchsizeof10.However,
weneedtomodifytheregularizationparameter.Thereasonis
becausethesizenofthetrainingsethaschangedfromn = 1, 000to
n = 50, 000,andthischangestheweightdecayfactor1
.Ifwe
continuedtouse = 0.1thatwouldmeanmuchlessweightdecay,
andthusmuchlessofaregularizationeffect.Wecompensateby
changingto = 5.0.
Okay,let'strainournetwork,stoppingfirsttoreinitializethe
weights:
>>>net.SGD(training_data,30,10,0.5,
...evaluation_data=test_data,lmbda=5.0,
...monitor_evaluation_accuracy=True,monitor_training_accuracy=True)
Weobtaintheresults:
There'slotsofgoodnewshere.First,ourclassificationaccuracyon
thetestdataisup,from95.49percentwhenrunningunregularized,
to96.49percent.That'sabigimprovement.Second,wecanseethat
thegapbetweenresultsonthetrainingandtestdataismuch
narrowerthanbefore,runningatunderapercent.That'sstilla
significantgap,butwe'veobviouslymadesubstantialprogress
reducingoverfitting.
Finally,let'sseewhattestclassificationaccuracywegetwhenwe
use100hiddenneuronsandaregularizationparameterof = 5.0.I
won'tgothroughadetailedanalysisofoverfittinghere,thisis
32/92
5/31/2016
purelyforfun,justtoseehowhighanaccuracywecangetwhenwe
useournewtricks:thecrossentropycostfunctionandL2
regularization.
>>>net.SGD(training_data,30,10,0.5,lmbda=5.0,
...evaluation_data=validation_data,
Thefinalresultisaclassificationaccuracyof97.92percentonthe
validationdata.That'sabigjumpfromthe30hiddenneuroncase.
Infact,tuningjustalittlemore,torunfor60epochsat = 0.1and
= 5.0webreakthe98percentbarrier,achieving98.04percent
classificationaccuracyonthevalidationdata.Notbadforwhat
turnsouttobe152linesofcode!
I'vedescribedregularizationasawaytoreduceoverfittingandto
increaseclassificationaccuracies.Infact,that'snottheonlybenefit.
Empirically,whendoingmultiplerunsofourMNISTnetworks,but
withdifferent(random)weightinitializations,I'vefoundthatthe
unregularizedrunswilloccasionallyget"stuck",apparentlycaught
inlocalminimaofthecostfunction.Theresultisthatdifferentruns
sometimesprovidequitedifferentresults.Bycontrast,the
regularizedrunshaveprovidedmuchmoreeasilyreplicableresults.
Whyisthisgoingon?Heuristically,ifthecostfunctionis
unregularized,thenthelengthoftheweightvectorislikelytogrow,
allotherthingsbeingequal.Overtimethiscanleadtotheweight
vectorbeingverylargeindeed.Thiscancausetheweightvectorto
getstuckpointinginmoreorlessthesamedirection,sincechanges
duetogradientdescentonlymaketinychangestothedirection,
whenthelengthislong.Ibelievethisphenomenonismakingit
hardforourlearningalgorithmtoproperlyexploretheweight
space,andconsequentlyhardertofindgoodminimaofthecost
function.
Whydoesregularizationhelpreduce
overfitting?
We'veseenempiricallythatregularizationhelpsreduceoverfitting.
That'sencouragingbut,unfortunately,it'snotobviouswhy
regularizationhelps!Astandardstorypeopletelltoexplainwhat's
goingonisalongthefollowinglines:smallerweightsare,insome
33/92
5/31/2016
sense,lowercomplexity,andsoprovideasimplerandmore
powerfulexplanationforthedata,andshouldthusbepreferred.
That'saprettytersestory,though,andcontainsseveralelements
thatperhapsseemdubiousormystifying.Let'sunpackthestory
andexamineitcritically.Todothat,let'ssupposewehaveasimple
datasetforwhichwewishtobuildamodel:
10
9
8
7
6
5
4
3
2
1
0
0
Implicitly,we'restudyingsomerealworldphenomenonhere,with
xandyrepresentingrealworlddata.Ourgoalistobuildamodel
whichletsuspredictyasafunctionofx.Wecouldtryusingneural
networkstobuildsuchamodel,butI'mgoingtodosomethingeven
simpler:I'lltrytomodelyasapolynomialinx.I'mdoingthis
insteadofusingneuralnetsbecauseusingpolynomialswillmake
thingsparticularlytransparent.Oncewe'veunderstoodthe
polynomialcase,we'lltranslatetoneuralnetworks.Now,thereare
tenpointsinthegraphabove,whichmeanswecanfindaunique9
thorderpolynomialy = a 0x 9 + a 1x 8 + + a 9whichfitsthedata
exactly.Here'sthegraphofthatpolynomial*:
*Iwon'tshowthecoefficientsexplicitly,although
theyareeasytofindusingaroutinesuchas
Numpy'spolyfit.Youcanviewtheexactform
ofthepolynomialinthesourcecodeforthe
graphifyou'recurious.It'sthefunctionp(x)
definedstartingonline14oftheprogramwhich
producesthegraph.
34/92
5/31/2016
10
9
8
7
6
5
4
3
2
1
0
0
Thatprovidesanexactfit.Butwecanalsogetagoodfitusingthe
linearmodely = 2x:
10
9
8
7
6
5
4
3
2
1
0
0
Whichoftheseisthebettermodel?Whichismorelikelytobetrue?
Andwhichmodelismorelikelytogeneralizewelltootherexamples
ofthesameunderlyingrealworldphenomenon?
Thesearedifficultquestions.Infact,wecan'tdeterminewith
certaintytheanswertoanyoftheabovequestions,withoutmuch
moreinformationabouttheunderlyingrealworldphenomenon.
Butlet'sconsidertwopossibilities:(1)the9thorderpolynomialis,
infact,themodelwhichtrulydescribestherealworld
phenomenon,andthemodelwillthereforegeneralizeperfectly(2)
thecorrectmodelisy = 2x,butthere'salittleadditionalnoisedue
to,say,measurementerror,andthat'swhythemodelisn'tanexact
fit.
It'snotaprioripossibletosaywhichofthesetwopossibilitiesis
35/92
5/31/2016
correct.(Or,indeed,ifsomethirdpossibilityholds).Logically,
eithercouldbetrue.Andit'snotatrivialdifference.It'struethaton
thedataprovidedthere'sonlyasmalldifferencebetweenthetwo
models.Butsupposewewanttopredictthevalueofy
correspondingtosomelargevalueofx,muchlargerthananyshown
onthegraphabove.Ifwetrytodothattherewillbeadramatic
differencebetweenthepredictionsofthetwomodels,asthe9th
orderpolynomialmodelcomestobedominatedbythex 9term,
whilethelinearmodelremains,well,linear.
Onepointofviewistosaythatinscienceweshouldgowiththe
simplerexplanation,unlesscompellednotto.Whenwefinda
simplemodelthatseemstoexplainmanydatapointsweare
temptedtoshout"Eureka!"Afterall,itseemsunlikelythatasimple
explanationshouldoccurmerelybycoincidence.Rather,wesuspect
thatthemodelmustbeexpressingsomeunderlyingtruthaboutthe
phenomenon.Inthecaseathand,themodely = 2x + noiseseems
muchsimplerthany = a 0x 9 + a 1x 8 + .Itwouldbesurprisingif
thatsimplicityhadoccurredbychance,andsowesuspectthat
y = 2x + noiseexpressessomeunderlyingtruth.Inthispointof
view,the9thordermodelisreallyjustlearningtheeffectsoflocal
noise.Andsowhilethe9thordermodelworksperfectlyforthese
particulardatapoints,themodelwillfailtogeneralizetootherdata
points,andthenoisylinearmodelwillhavegreaterpredictive
power.
Let'sseewhatthispointofviewmeansforneuralnetworks.
Supposeournetworkmostlyhassmallweights,aswilltendto
happeninaregularizednetwork.Thesmallnessoftheweights
meansthatthebehaviourofthenetworkwon'tchangetoomuchif
wechangeafewrandominputshereandthere.Thatmakesit
difficultforaregularizednetworktolearntheeffectsoflocalnoise
inthedata.Thinkofitasawayofmakingitsosinglepiecesof
evidencedon'tmattertoomuchtotheoutputofthenetwork.
Instead,aregularizednetworklearnstorespondtotypesof
evidencewhichareseenoftenacrossthetrainingset.Bycontrast,a
networkwithlargeweightsmaychangeitsbehaviourquiteabitin
responsetosmallchangesintheinput.Andsoanunregularized
networkcanuselargeweightstolearnacomplexmodelthatcarries
alotofinformationaboutthenoiseinthetrainingdata.Ina
nutshell,regularizednetworksareconstrainedtobuildrelatively
36/92
5/31/2016
simplemodelsbasedonpatternsseenofteninthetrainingdata,
andareresistanttolearningpeculiaritiesofthenoiseinthetraining
data.Thehopeisthatthiswillforceournetworkstodoreal
learningaboutthephenomenonathand,andtogeneralizebetter
fromwhattheylearn.
Withthatsaid,thisideaofpreferringsimplerexplanationshould
makeyounervous.Peoplesometimesrefertothisideaas"Occam's
Razor",andwillzealouslyapplyitasthoughithasthestatusof
somegeneralscientificprinciple.But,ofcourse,it'snotageneral
scientificprinciple.Thereisnoapriorilogicalreasontoprefer
simpleexplanationsovermorecomplexexplanations.Indeed,
sometimesthemorecomplexexplanationturnsouttobecorrect.
Letmedescribetwoexampleswheremorecomplexexplanations
haveturnedouttobecorrect.Inthe1940sthephysicistMarcel
Scheinannouncedthediscoveryofanewparticleofnature.The
companyheworkedfor,GeneralElectric,wasecstatic,and
publicizedthediscoverywidely.ButthephysicistHansBethewas
skeptical.BethevisitedSchein,andlookedattheplatesshowingthe
tracksofSchein'snewparticle.ScheinshowedBetheplateafter
plate,butoneachplateBetheidentifiedsomeproblemthat
suggestedthedatashouldbediscarded.Finally,Scheinshowed
Betheaplatethatlookedgood.Bethesaiditmightjustbea
statisticalfluke.Schein:"Yes,butthechancethatthiswouldbe
statistics,evenaccordingtoyourownformula,isoneinfive."
Bethe:"Butwehavealreadylookedatfiveplates."Finally,Schein
said:"Butonmyplates,eachoneofthegoodplates,eachoneofthe
goodpictures,youexplainbyadifferenttheory,whereasIhaveone
hypothesisthatexplainsalltheplates,thattheyare[thenew
particle]."Bethereplied:"Thesoledifferencebetweenyourandmy
explanationsisthatyoursiswrongandallofmineareright.Your
singleexplanationiswrong,andallofmymultipleexplanationsare
right."SubsequentworkconfirmedthatNatureagreedwithBethe,
andSchein'sparticleisnomore*.
*ThestoryisrelatedbythephysicistRichard
Feynmaninaninterviewwiththehistorian
CharlesWeiner.
Asasecondexample,in1859theastronomerUrbainLeVerrier
observedthattheorbitoftheplanetMercurydoesn'thavequitethe
shapethatNewton'stheoryofgravitationsaysitshouldhave.Itwas
atiny,tinydeviationfromNewton'stheory,andseveralofthe
explanationsproferredatthetimeboileddowntosayingthat
37/92
5/31/2016
Newton'stheorywasmoreorlessright,butneededatiny
alteration.In1916,Einsteinshowedthatthedeviationcouldbe
explainedverywellusinghisgeneraltheoryofrelativity,atheory
radicallydifferenttoNewtoniangravitation,andbasedonmuch
morecomplexmathematics.Despitethatadditionalcomplexity,
todayit'sacceptedthatEinstein'sexplanationiscorrect,and
Newtoniangravity,eveninitsmodifiedforms,iswrong.Thisisin
partbecausewenowknowthatEinstein'stheoryexplainsmany
otherphenomenawhichNewton'stheoryhasdifficultywith.
Furthermore,andevenmoreimpressively,Einstein'stheory
accuratelypredictsseveralphenomenawhicharen'tpredictedby
Newtoniangravityatall.Buttheseimpressivequalitiesweren't
entirelyobviousintheearlydays.Ifonehadjudgedmerelyonthe
groundsofsimplicity,thensomemodifiedformofNewton'stheory
wouldarguablyhavebeenmoreattractive.
Therearethreemoralstodrawfromthesestories.First,itcanbe
quiteasubtlebusinessdecidingwhichoftwoexplanationsistruly
"simpler".Second,evenifwecanmakesuchajudgment,simplicity
isaguidethatmustbeusedwithgreatcaution!Third,thetruetest
ofamodelisnotsimplicity,butratherhowwellitdoesin
predictingnewphenomena,innewregimesofbehaviour.
Withthatsaid,andkeepingtheneedforcautioninmind,it'san
empiricalfactthatregularizedneuralnetworksusuallygeneralize
betterthanunregularizednetworks.Andsothroughtheremainder
ofthebookwewillmakefrequentuseofregularization.I've
includedthestoriesabovemerelytohelpconveywhynoonehas
yetdevelopedanentirelyconvincingtheoreticalexplanationfor
whyregularizationhelpsnetworksgeneralize.Indeed,researchers
continuetowritepaperswheretheytrydifferentapproachesto
regularization,comparethemtoseewhichworksbetter,and
attempttounderstandwhydifferentapproachesworkbetteror
worse.Andsoyoucanviewregularizationassomethingofakludge.
Whileitoftenhelps,wedon'thaveanentirelysatisfactory
systematicunderstandingofwhat'sgoingon,merelyincomplete
heuristicsandrulesofthumb.
There'sadeepersetofissueshere,issueswhichgototheheartof
science.It'sthequestionofhowwegeneralize.Regularizationmay
giveusacomputationalmagicwandthathelpsournetworks
38/92
5/31/2016
generalizebetter,butitdoesn'tgiveusaprincipledunderstanding
ofhowgeneralizationworks,norofwhatthebestapproachis*.
*Theseissuesgobacktotheproblemof
induction,famouslydiscussedbytheScottish
philosopherDavidHumein"AnEnquiry
Thisisparticularlygallingbecauseineverydaylife,wehumans
generalizephenomenallywell.Shownjustafewimagesofan
elephantachildwillquicklylearntorecognizeotherelephants.Of
ConcerningHumanUnderstanding"(1748).The
problemofinductionhasbeengivenamodern
machinelearningforminthenofreelunch
theorem(link)ofDavidWolpertandWilliam
Macready(1997).
course,theymayoccasionallymakemistakes,perhapsconfusinga
rhinocerosforanelephant,butingeneralthisprocessworks
remarkablyaccurately.Sowehaveasystemthehumanbrain
withahugenumberoffreeparameters.Andafterbeingshownjust
oneorafewtrainingimagesthatsystemlearnstogeneralizeto
otherimages.Ourbrainsare,insomesense,regularizingamazingly
well!Howdowedoit?Atthispointwedon'tknow.Iexpectthatin
yearstocomewewilldevelopmorepowerfultechniquesfor
regularizationinartificialneuralnetworks,techniquesthatwill
ultimatelyenableneuralnetstogeneralizewellevenfromsmall
datasets.
Infact,ournetworksalreadygeneralizebetterthanonemighta
prioriexpect.Anetworkwith100hiddenneuronshasnearly
80,000parameters.Wehaveonly50,000imagesinourtraining
data.It'sliketryingtofitan80,000thdegreepolynomialto50,000
datapoints.Byallrights,ournetworkshouldoverfitterribly.And
yet,aswesawearlier,suchanetworkactuallydoesaprettygood
jobgeneralizing.Whyisthatthecase?It'snotwellunderstood.It
hasbeenconjectured*that"thedynamicsofgradientdescent
learninginmultilayernetshasa`selfregularization'effect".Thisis
*InGradientBasedLearningAppliedto
exceptionallyfortunate,butit'salsosomewhatdisquietingthatwe
Bottou,YoshuaBengio,andPatrickHaffner
don'tunderstandwhyit'sthecase.Inthemeantime,wewilladopt
DocumentRecognition,byYannLeCun,Lon
(1998).
thepragmaticapproachanduseregularizationwheneverwecan.
Ourneuralnetworkswillbethebetterforit.
LetmeconcludethissectionbyreturningtoadetailwhichIleft
unexplainedearlier:thefactthatL2regularizationdoesn't
constrainthebiases.Ofcourse,itwouldbeeasytomodifythe
regularizationproceduretoregularizethebiases.Empirically,doing
thisoftendoesn'tchangetheresultsverymuch,sotosomeextent
it'smerelyaconventionwhethertoregularizethebiasesornot.
However,it'sworthnotingthathavingalargebiasdoesn'tmakea
neuronsensitivetoitsinputsinthesamewayashavinglarge
weights.Andsowedon'tneedtoworryaboutlargebiasesenabling
39/92
5/31/2016
ournetworktolearnthenoiseinourtrainingdata.Atthesame
time,allowinglargebiasesgivesournetworksmoreflexibilityin
behaviourinparticular,largebiasesmakeiteasierforneuronsto
saturate,whichissometimesdesirable.Forthesereasonswedon't
usuallyincludebiastermswhenregularizing.
Othertechniquesforregularization
TherearemanyregularizationtechniquesotherthanL2
regularization.Infact,somanytechniqueshavebeendeveloped
thatIcan'tpossiblysummarizethemall.InthissectionIbriefly
describethreeotherapproachestoreducingoverfitting:L1
regularization,dropout,andartificiallyincreasingthetrainingset
size.Wewon'tgointonearlyasmuchdepthstudyingthese
techniquesaswedidearlier.Instead,thepurposeistogetfamiliar
withthemainideas,andtoappreciatesomethingofthediversityof
regularizationtechniquesavailable.
L1regularization:Inthisapproachwemodifytheunregularized
costfunctionbyaddingthesumoftheabsolutevaluesofthe
weights:
C = C0 + | w | .
n w
Intuitively,thisissimilartoL2regularization,penalizinglarge
weights,andtendingtomakethenetworkprefersmallweights.Of
course,theL1regularizationtermisn'tthesameastheL2
regularizationterm,andsoweshouldn'texpecttogetexactlythe
samebehaviour.Let'strytounderstandhowthebehaviourofa
networktrainedusingL1regularizationdiffersfromanetwork
trainedusingL2regularization.
Todothat,we'lllookatthepartialderivativesofthecostfunction.
Differentiating(95)weobtain:
C
w
C 0
w
sgn(w),
wheresgn(w)isthesignofw,thatis,+ 1ifwispositive,and 1ifw

isnegative.Usingthisexpression,wecaneasilymodify
backpropagationtodostochasticgradientdescentusingL1
regularization.TheresultingupdateruleforanL1regularized
40/92
5/31/2016
networkis
w w = w
sgn(w)
C 0
w
where,asperusual,wecanestimateC 0 / wusingaminibatch
average,ifwewish.ComparethattotheupdateruleforL2
regularization(c.f.Equation(93)),
( )
w w = w 1
C 0
w
Inbothexpressionstheeffectofregularizationistoshrinkthe
weights.Thisaccordswithourintuitionthatbothkindsof
regularizationpenalizelargeweights.Butthewaytheweights
shrinkisdifferent.InL1regularization,theweightsshrinkbya
constantamounttoward0.InL2regularization,theweightsshrink
byanamountwhichisproportionaltow.Andsowhenaparticular
weighthasalargemagnitude, | w | ,L1regularizationshrinksthe
weightmuchlessthanL2regularizationdoes.Bycontrast,when
| w | issmall,L1regularizationshrinkstheweightmuchmorethan
L2regularization.ThenetresultisthatL1regularizationtendsto
concentratetheweightofthenetworkinarelativelysmallnumber
ofhighimportanceconnections,whiletheotherweightsaredriven
towardzero.
I'veglossedoveranissueintheabovediscussion,whichisthatthe
partialderivativeC / wisn'tdefinedwhenw = 0.Thereasonisthat
thefunction | w | hasasharp"corner"atw = 0,andsoisn't
differentiableatthatpoint.That'sokay,though.Whatwe'lldois
justapplytheusual(unregularized)ruleforstochasticgradient
descentwhenw = 0.Thatshouldbeokayintuitively,theeffectof
regularizationistoshrinkweights,andobviouslyitcan'tshrinka
weightwhichisalready0.Toputitmoreprecisely,we'lluse
Equations(96)and(97)withtheconventionthatsgn(0) = 0.That
givesanice,compactrulefordoingstochasticgradientdescentwith
L1regularization.
Dropout:Dropoutisaradicallydifferenttechniquefor
regularization.UnlikeL1andL2regularization,dropoutdoesn't
relyonmodifyingthecostfunction.Instead,indropoutwemodify
thenetworkitself.Letmedescribethebasicmechanicsofhow
41/92
5/31/2016
dropoutworks,beforegettingintowhyitworks,andwhatthe
resultsare.
Supposewe'retryingtotrainanetwork:
Inparticular,supposewehaveatraininginputxandcorresponding
desiredoutputy.Ordinarily,we'dtrainbyforwardpropagatingx
throughthenetwork,andthenbackpropagatingtodeterminethe
contributiontothegradient.Withdropout,thisprocessismodified.
Westartbyrandomly(andtemporarily)deletinghalfthehidden
neuronsinthenetwork,whileleavingtheinputandoutputneurons
untouched.Afterdoingthis,we'llendupwithanetworkalongthe
followinglines.Notethatthedropoutneurons,i.e.,theneurons
whichhavebeentemporarilydeleted,arestillghostedin:
Weforwardpropagatetheinputxthroughthemodifiednetwork,
andthenbackpropagatetheresult,alsothroughthemodified
network.Afterdoingthisoveraminibatchofexamples,weupdate
theappropriateweightsandbiases.Wethenrepeattheprocess,
firstrestoringthedropoutneurons,thenchoosinganewrandom
42/92
5/31/2016
subsetofhiddenneuronstodelete,estimatingthegradientfora
differentminibatch,andupdatingtheweightsandbiasesinthe
network.
Byrepeatingthisprocessoverandover,ournetworkwilllearnaset
ofweightsandbiases.Ofcourse,thoseweightsandbiaseswillhave
beenlearntunderconditionsinwhichhalfthehiddenneuronswere
droppedout.Whenweactuallyrunthefullnetworkthatmeansthat
twiceasmanyhiddenneuronswillbeactive.Tocompensatefor
that,wehalvetheweightsoutgoingfromthehiddenneurons.
Thisdropoutproceduremayseemstrangeandadhoc.Whywould
weexpectittohelpwithregularization?Toexplainwhat'sgoingon,
I'dlikeyoutobrieflystopthinkingaboutdropout,andinstead
imaginetrainingneuralnetworksinthestandardway(nodropout).
Inparticular,imaginewetrainseveraldifferentneuralnetworks,all
usingthesametrainingdata.Ofcourse,thenetworksmaynotstart
outidentical,andasaresultaftertrainingtheymaysometimesgive
differentresults.Whenthathappenswecouldusesomekindof
averagingorvotingschemetodecidewhichoutputtoaccept.For
instance,ifwehavetrainedfivenetworks,andthreeofthemare
classifyingadigitasa"3",thenitprobablyreallyisa"3".Theother
twonetworksareprobablyjustmakingamistake.Thiskindof
averagingschemeisoftenfoundtobeapowerful(though
expensive)wayofreducingoverfitting.Thereasonisthatthe
differentnetworksmayoverfitindifferentways,andaveragingmay
helpeliminatethatkindofoverfitting.
What'sthisgottodowithdropout?Heuristically,whenwedropout
differentsetsofneurons,it'sratherlikewe'retrainingdifferent
neuralnetworks.Andsothedropoutprocedureislikeaveragingthe
effectsofaverylargenumberofdifferentnetworks.Thedifferent
networkswilloverfitindifferentways,andso,hopefully,thenet
effectofdropoutwillbetoreduceoverfitting.
Arelatedheuristicexplanationfordropoutisgiveninoneofthe
earliestpaperstousethetechnique*:"Thistechniquereduces
*ImageNetClassificationwithDeep
complexcoadaptationsofneurons,sinceaneuroncannotrelyon
Krizhevsky,IlyaSutskever,andGeoffreyHinton
thepresenceofparticularotherneurons.Itis,therefore,forcedto
ConvolutionalNeuralNetworks,byAlex
(2012).
learnmorerobustfeaturesthatareusefulinconjunctionwithmany
differentrandomsubsetsoftheotherneurons."Inotherwords,if
43/92
5/31/2016
wethinkofournetworkasamodelwhichismakingpredictions,
thenwecanthinkofdropoutasawayofmakingsurethatthe
modelisrobusttothelossofanyindividualpieceofevidence.In
this,it'ssomewhatsimilartoL1andL2regularization,whichtend
toreduceweights,andthusmakethenetworkmorerobusttolosing
anyindividualconnectioninthenetwork.
Ofcourse,thetruemeasureofdropoutisthatithasbeenvery
successfulinimprovingtheperformanceofneuralnetworks.The
originalpaper*introducingthetechniqueappliedittomany
*Improvingneuralnetworksbypreventingco
differenttasks.Forus,it'sofparticularinterestthattheyapplied
Hinton,NitishSrivastava,AlexKrizhevsky,Ilya
dropouttoMNISTdigitclassification,usingavanillafeedforward
neuralnetworkalonglinessimilartothosewe'vebeenconsidering.
adaptationoffeaturedetectorsbyGeoffrey
Sutskever,andRuslanSalakhutdinov(2012).
Notethatthepaperdiscussesanumberof
subtletiesthatIhaveglossedoverinthisbrief
introduction.
Thepapernotedthatthebestresultanyonehadachieveduptothat
pointusingsuchanarchitecturewas98.4percentclassification
accuracyonthetestset.Theyimprovedthatto98.7percent
accuracyusingacombinationofdropoutandamodifiedformofL2
regularization.Similarlyimpressiveresultshavebeenobtainedfor
manyothertasks,includingproblemsinimageandspeech
recognition,andnaturallanguageprocessing.Dropouthasbeen
especiallyusefulintraininglarge,deepnetworks,wherethe
problemofoverfittingisoftenacute.
Artificiallyexpandingthetrainingdata:Wesawearlierthat
ourMNISTclassificationaccuracydroppeddowntopercentagesin
themid80swhenweusedonly1,000trainingimages.It'snot
surprisingthatthisisthecase,sincelesstrainingdatameansour
networkwillbeexposedtofewervariationsinthewayhuman
beingswritedigits.Let'strytrainingour30hiddenneuronnetwork
withavarietyofdifferenttrainingdatasetsizes,toseehow
performancevaries.Wetrainusingaminibatchsizeof10,a
learningrate = 0.5,aregularizationparameter = 5.0,andthe
crossentropycostfunction.Wewilltrainfor30epochswhenthe
fulltrainingdatasetisused,andscaleupthenumberofepochs
proportionallywhensmallertrainingsetsareused.Toensurethe
weightdecayfactorremainsthesameacrosstrainingsets,wewill
usearegularizationparameterof = 5.0whenthefulltrainingdata
setisused,andscaledownproportionallywhensmallertraining
setsareused*.
*Thisandthenexttwographareproducedwith
theprogrammore_data.py.
44/92
5/31/2016
Asyoucansee,theclassificationaccuraciesimproveconsiderably
asweusemoretrainingdata.Presumablythisimprovementwould
continuestillfurtherifmoredatawasavailable.Ofcourse,looking
atthegraphaboveitdoesappearthatwe'regettingnearsaturation.
Suppose,however,thatweredothegraphwiththetrainingsetsize
plottedlogarithmically:
Itseemsclearthatthegraphisstillgoinguptowardtheend.This
suggeststhatifweusedvastlymoretrainingdatasay,millionsor
evenbillionsofhandwritingsamples,insteadofjust50,000then
we'dlikelygetconsiderablybetterperformance,evenfromthisvery
smallnetwork.
Obtainingmoretrainingdataisagreatidea.Unfortunately,itcan
beexpensive,andsoisnotalwayspossibleinpractice.However,
there'sanotherideawhichcanworknearlyaswell,andthat'sto
45/92
5/31/2016
artificiallyexpandthetrainingdata.Suppose,forexample,thatwe
takeanMNISTtrainingimageofafive,
androtateitbyasmallamount,let'ssay15degrees:
It'sstillrecognizablythesamedigit.Andyetatthepixellevelit's
quitedifferenttoanyimagecurrentlyintheMNISTtrainingdata.
It'sconceivablethataddingthisimagetothetrainingdatamight
helpournetworklearnmoreabouthowtoclassifydigits.What's
more,obviouslywe'renotlimitedtoaddingjustthisoneimage.We
canexpandourtrainingdatabymakingmanysmallrotationsofall
theMNISTtrainingimages,andthenusingtheexpandedtraining
datatoimproveournetwork'sperformance.
Thisideaisverypowerfulandhasbeenwidelyused.Let'slookat
someoftheresultsfromapaper*whichappliedseveralvariations
*BestPracticesforConvolutionalNeural
oftheideatoMNIST.Oneoftheneuralnetworkarchitecturesthey
byPatriceSimard,DaveSteinkraus,andJohn
consideredwasalongsimilarlinestowhatwe'vebeenusing,a
NetworksAppliedtoVisualDocumentAnalysis,
Platt(2003).
feedforwardnetworkwith800hiddenneuronsandusingthecross
entropycostfunction.Runningthenetworkwiththestandard
MNISTtrainingdatatheyachievedaclassificationaccuracyof98.4
percentontheirtestset.Butthentheyexpandedthetrainingdata,
usingnotjustrotations,asIdescribedabove,butalsotranslating
andskewingtheimages.Bytrainingontheexpandeddatasetthey
increasedtheirnetwork'saccuracyto98.9percent.Theyalso
experimentedwithwhattheycalled"elasticdistortions",aspecial
typeofimagedistortionintendedtoemulatetherandom
oscillationsfoundinhandmuscles.Byusingtheelasticdistortions
toexpandthedatatheyachievedanevenhigheraccuracy,99.3
percent.Effectively,theywerebroadeningtheexperienceoftheir
networkbyexposingittothesortofvariationsthatarefoundin
realhandwriting.
Variationsonthisideacanbeusedtoimproveperformanceon
46/92
5/31/2016
manylearningtasks,notjusthandwritingrecognition.Thegeneral
principleistoexpandthetrainingdatabyapplyingoperationsthat
reflectrealworldvariation.It'snotdifficulttothinkofwaysof
doingthis.Suppose,forexample,thatyou'rebuildinganeural
networktodospeechrecognition.Wehumanscanrecognizespeech
eveninthepresenceofdistortionssuchasbackgroundnoise.And
soyoucanexpandyourdatabyaddingbackgroundnoise.Wecan
alsorecognizespeechifit'sspeduporsloweddown.Sothat's
anotherwaywecanexpandthetrainingdata.Thesetechniquesare
notalwaysusedforinstance,insteadofexpandingthetraining
databyaddingnoise,itmaywellbemoreefficienttocleanupthe
inputtothenetworkbyfirstapplyinganoisereductionfilter.Still,
it'sworthkeepingtheideaofexpandingthetrainingdatainmind,
andlookingforopportunitiestoapplyit.
Exercise
Asdiscussedabove,onewayofexpandingtheMNISTtraining
dataistousesmallrotationsoftrainingimages.What'sa
problemthatmightoccurifweallowarbitrarilylargerotations
oftrainingimages?
Anasideonbigdataandwhatitmeanstocompare
classificationaccuracies:Let'slookagainathowourneural
network'saccuracyvarieswithtrainingsetsize:
Supposethatinsteadofusinganeuralnetworkweusesomeother
machinelearningtechniquetoclassifydigits.Forinstance,let'stry
usingthesupportvectormachines(SVM)whichwemetbriefly
47/92
5/31/2016
backinChapter1.AswasthecaseinChapter1,don'tworryifyou're
notfamiliarwithSVMs,wedon'tneedtounderstandtheirdetails.
Instead,we'llusetheSVMsuppliedbythescikitlearnlibrary.
Here'showSVMperformancevariesasafunctionoftrainingset
size.I'veplottedtheneuralnetresultsaswell,tomakecomparison
easy*:
*Thisgraphwasproducedwiththeprogram
more_data.py(aswerethelastfewgraphs).
Probablythefirstthingthatstrikesyouaboutthisgraphisthatour
neuralnetworkoutperformstheSVMforeverytrainingsetsize.
That'snice,althoughyoushouldn'treadtoomuchintoit,sinceI
justusedtheoutoftheboxsettingsfromscikitlearn'sSVM,while
we'vedoneafairbitofworkimprovingourneuralnetwork.Amore
subtlebutmoreinterestingfactaboutthegraphisthatifwetrain
ourSVMusing50,000imagesthenitactuallyhasbetter
performance(94.48percentaccuracy)thanourneuralnetwork
doeswhentrainedusing5,000images(93.24percentaccuracy).In
otherwords,moretrainingdatacansometimescompensatefor
differencesinthemachinelearningalgorithmused.
Somethingevenmoreinterestingcanoccur.Supposewe'retryingto
solveaproblemusingtwomachinelearningalgorithms,algorithm
AandalgorithmB.ItsometimeshappensthatalgorithmAwill
outperformalgorithmBwithonesetoftrainingdata,while
algorithmBwilloutperformalgorithmAwithadifferentsetof
trainingdata.Wedon'tseethataboveitwouldrequirethetwo
graphstocrossbutitdoeshappen*.Thecorrectresponsetothe
*StrikingexamplesmaybefoundinScalingto
question"IsalgorithmAbetterthanalgorithmB?"isreally:"What
disambiguation,byMicheleBankoandEricBrill
trainingdatasetareyouusing?"
veryverylargecorporafornaturallanguage
(2001).
48/92
5/31/2016
Allthisisacautiontokeepinmind,bothwhendoingdevelopment,
andwhenreadingresearchpapers.Manypapersfocusonfinding
newtrickstowringoutimprovedperformanceonstandard
benchmarkdatasets."Ourwhizbangtechniquegaveusan
improvementofXpercentonstandardbenchmarkY"isacanonical
formofresearchclaim.Suchclaimsareoftengenuinelyinteresting,
buttheymustbeunderstoodasapplyingonlyinthecontextofthe
specifictrainingdatasetused.Imagineanalternatehistoryin
whichthepeoplewhooriginallycreatedthebenchmarkdatasethad
alargerresearchgrant.Theymighthaveusedtheextramoneyto
collectmoretrainingdata.It'sentirelypossiblethatthe
"improvement"duetothewhizbangtechniquewoulddisappearon
alargerdataset.Inotherwords,thepurportedimprovementmight
bejustanaccidentofhistory.Themessagetotakeaway,especially
inpracticalapplications,isthatwhatwewantisbothbetter
algorithmsandbettertrainingdata.It'sfinetolookforbetter
algorithms,butmakesureyou'renotfocusingonbetteralgorithms
totheexclusionofeasywinsgettingmoreorbettertrainingdata.
Problem
(Researchproblem)Howdoourmachinelearning
algorithmsperforminthelimitofverylargedatasets?Forany
givenalgorithmit'snaturaltoattempttodefineanotionof
asymptoticperformanceinthelimitoftrulybigdata.Aquick
anddirtyapproachtothisproblemistosimplytryfitting
curvestographslikethoseshownabove,andthento
extrapolatethefittedcurvesouttoinfinity.Anobjectiontothis
approachisthatdifferentapproachestocurvefittingwillgive
differentnotionsofasymptoticperformance.Canyoufinda
principledjustificationforfittingtosomeparticularclassof
curves?Ifso,comparetheasymptoticperformanceofseveral
differentmachinelearningalgorithms.
Summingup:We'venowcompletedourdiveintooverfittingand
regularization.Ofcourse,we'llreturnagaintotheissue.AsI've
mentionedseveraltimes,overfittingisamajorprobleminneural
networks,especiallyascomputersgetmorepowerful,andwehave
theabilitytotrainlargernetworks.Asaresultthere'sapressing
needtodeveloppowerfulregularizationtechniquestoreduce
overfitting,andthisisanextremelyactiveareaofcurrentwork.
49/92
5/31/2016
Weightinitialization
Whenwecreateourneuralnetworks,wehavetomakechoicesfor
theinitialweightsandbiases.Uptonow,we'vebeenchoosingthem
accordingtoaprescriptionwhichIdiscussedonlybrieflybackin
Chapter1.Justtoremindyou,thatprescriptionwastochooseboth
theweightsandbiasesusingindependentGaussianrandom
variables,normalizedtohavemean0andstandarddeviation1.
Whilethisapproachhasworkedwell,itwasquiteadhoc,andit's
worthrevisitingtoseeifwecanfindabetterwayofsettingour
initialweightsandbiases,andperhapshelpourneuralnetworks
learnfaster.
Itturnsoutthatwecandoquiteabitbetterthaninitializingwith
normalizedGaussians.Toseewhy,supposewe'reworkingwitha
networkwithalargenumbersay1, 000ofinputneurons.And
let'ssupposewe'veusednormalizedGaussianstoinitializethe
weightsconnectingtothefirsthiddenlayer.FornowI'mgoingto
concentratespecificallyontheweightsconnectingtheinput
neuronstothefirstneuroninthehiddenlayer,andignoretherest
ofthenetwork:
We'llsupposeforsimplicitythatwe'retryingtotrainusinga
traininginputxinwhichhalftheinputneuronsareon,i.e.,setto1,
andhalftheinputneuronsareoff,i.e.,setto0.Theargumentwhich
followsappliesmoregenerally,butyou'llgetthegistfromthis
specialcase.Let'sconsidertheweightedsumz = jw jx j + bof
inputstoourhiddenneuron.500termsinthissumvanish,because
thecorrespondinginputx jiszero.Andsozisasumoveratotalof
501normalizedGaussianrandomvariables,accountingforthe500
weighttermsandthe1extrabiasterm.Thuszisitselfdistributedas
aGaussianwithmeanzeroandstandarddeviation501 22.4.
50/92
5/31/2016
Thatis,zhasaverybroadGaussiandistribution,notsharply
peakedatall:
0.02
30
20
10
10
20
30
Inparticular,wecanseefromthisgraphthatit'squitelikelythat
| z | willbeprettylarge,i.e.,eitherz 1orz 1.Ifthat'sthe
casethentheoutput(z)fromthehiddenneuronwillbeveryclose
toeither1or0.Thatmeansourhiddenneuronwillhavesaturated.
Andwhenthathappens,asweknow,makingsmallchangesinthe
weightswillmakeonlyabsolutelyminisculechangesinthe
activationofourhiddenneuron.Thatminisculechangeinthe
activationofthehiddenneuronwill,inturn,barelyaffecttherestof
theneuronsinthenetworkatall,andwe'llseeacorrespondingly
minisculechangeinthecostfunction.Asaresult,thoseweightswill
onlylearnveryslowlywhenweusethegradientdescentalgorithm*.
*WediscussedthisinmoredetailinChapter2,
It'ssimilartotheproblemwediscussedearlierinthischapter,in
toshowthatweightsinputtosaturatedneurons
whichoutputneuronswhichsaturatedonthewrongvaluecaused
whereweusedtheequationsofbackpropagation
learnedslowly.
learningtoslowdown.Weaddressedthatearlierproblemwitha
cleverchoiceofcostfunction.Unfortunately,whilethathelpedwith
saturatedoutputneurons,itdoesnothingatallfortheproblem
withsaturatedhiddenneurons.
I'vebeentalkingabouttheweightsinputtothefirsthiddenlayer.
Ofcourse,similarargumentsapplyalsotolaterhiddenlayers:ifthe
weightsinlaterhiddenlayersareinitializedusingnormalized
Gaussians,thenactivationswilloftenbeverycloseto0or1,and
learningwillproceedveryslowly.
Istheresomewaywecanchoosebetterinitializationsforthe
weightsandbiases,sothatwedon'tgetthiskindofsaturation,and
soavoidalearningslowdown?Supposewehaveaneuronwithn in
inputweights.ThenweshallinitializethoseweightsasGaussian
randomvariableswithmean0andstandarddeviation1 / n in.That
is,we'llsquashtheGaussiansdown,makingitlesslikelythatour
neuronwillsaturate.We'llcontinuetochoosethebiasasa
Gaussianwithmean0andstandarddeviation1,forreasonsI'll
returntoinamoment.Withthesechoices,theweightedsum
51/92
5/31/2016
z = jw jx j + bwillagainbeaGaussianrandomvariablewithmean0
,butit'llbemuchmoresharplypeakedthanitwasbefore.Suppose,
aswedidearlier,that500oftheinputsarezeroand500are1.Then
it'seasytoshow(seetheexercisebelow)thatzhasaGaussian
distributionwithmean0andstandarddeviation3 / 2 = 1.22.
Thisismuchmoresharplypeakedthanbefore,somuchsothat
eventhegraphbelowunderstatesthesituation,sinceI'vehadto
rescaletheverticalaxis,whencomparedtotheearliergraph:
0.4
30
20
10
10
20
30
Suchaneuronismuchlesslikelytosaturate,andcorrespondingly
muchlesslikelytohaveproblemswithalearningslowdown.
Exercise
Verifythatthestandarddeviationofz = jw jx j + binthe
paragraphaboveis3 / 2.Itmayhelptoknowthat:(a)the
varianceofasumofindependentrandomvariablesisthesum
ofthevariancesoftheindividualrandomvariablesand(b)the
varianceisthesquareofthestandarddeviation.
Istatedabovethatwe'llcontinuetoinitializethebiasesasbefore,as
Gaussianrandomvariableswithameanof0andastandard
deviationof1.Thisisokay,becauseitdoesn'tmakeittoomuch
morelikelythatourneuronswillsaturate.Infact,itdoesn'tmuch
matterhowweinitializethebiases,providedweavoidtheproblem
withsaturation.Somepeoplegosofarastoinitializeallthebiases
to0,andrelyongradientdescenttolearnappropriatebiases.But
sinceit'sunlikelytomakemuchdifference,we'llcontinuewiththe
sameinitializationprocedureasbefore.
52/92
5/31/2016
Let'scomparetheresultsforbothouroldandnewapproachesto
weightinitialization,usingtheMNISTdigitclassificationtask.As
before,we'lluse30hiddenneurons,aminibatchsizeof10,a
regularizationparameter = 5.0,andthecrossentropycost
function.Wewilldecreasethelearningrateslightlyfrom = 0.5to
0.1,sincethatmakestheresultsalittlemoreeasilyvisibleinthe
graphs.Wecantrainusingtheoldmethodofweightinitialization:
>>>importnetwork2
Wecanalsotrainusingthenewapproachtoweightinitialization.
Thisisactuallyeveneasier,sincenetwork2'sdefaultwayof
initializingtheweightsisusingthisnewapproach.Thatmeanswe
canomitthenet.large_weight_initializer()callabove:
Plottingtheresults*,weobtain:
*Theprogramusedtogeneratethisandthenext
graphisweight_initialization.py.
Inbothcases,weendupwithaclassificationaccuracysomewhat
over96percent.Thefinalclassificationaccuracyisalmostexactly
thesameinthetwocases.Butthenewinitializationtechnique
bringsustheremuch,muchfaster.Attheendofthefirstepochof
trainingtheoldapproachtoweightinitializationhasaclassification
53/92
5/31/2016
accuracyunder87percent,whilethenewapproachisalready
almost93percent.Whatappearstobegoingonisthatournew
approachtoweightinitializationstartsusoffinamuchbetter
regime,whichletsusgetgoodresultsmuchmorequickly.Thesame
phenomenonisalsoseenifweplotresultswith100hidden
neurons:
Inthiscase,thetwocurvesdon'tquitemeet.However,my
experimentssuggestthatwithjustafewmoreepochsoftraining
(notshown)theaccuraciesbecomealmostexactlythesame.Soon
thebasisoftheseexperimentsitlooksasthoughtheimproved
weightinitializationonlyspeedsuplearning,itdoesn'tchangethe
finalperformanceofournetworks.However,inChapter4we'llsee
examplesofneuralnetworkswherethelongrunbehaviouris
significantlybetterwiththe1 / n inweightinitialization.Thusit's
notonlythespeedoflearningwhichisimproved,it'ssometimes
alsothefinalperformance.
The1 / n inapproachtoweightinitializationhelpsimprovetheway
ourneuralnetslearn.Othertechniquesforweightinitialization
havealsobeenproposed,manybuildingonthisbasicidea.Iwon't
reviewtheotherapproacheshere,since1 / n inworkswellenough
forourpurposes.Ifyou'reinterestedinlookingfurther,I
recommendlookingatthediscussiononpages14and15ofa2012
paperbyYoshuaBengio*,aswellasthereferencestherein.
*PracticalRecommendationsforGradientBased
TrainingofDeepArchitectures,byYoshua
Bengio(2012).
Problem
54/92
5/31/2016
Connectingregularizationandtheimprovedmethod
ofweightinitializationL2regularizationsometimes
automaticallygivesussomethingsimilartothenewapproach
toweightinitialization.Supposeweareusingtheoldapproach
toweightinitialization.Sketchaheuristicargumentthat:(1)
supposingisnottoosmall,thefirstepochsoftrainingwillbe
dominatedalmostentirelybyweightdecay(2)provided
ntheweightswilldecaybyafactorofexp( / m)per
epochand(3)supposingisnottoolarge,theweightdecay
willtailoffwhentheweightsaredowntoasizearound1 / n,
wherenisthetotalnumberofweightsinthenetwork.Argue
thattheseconditionsareallsatisfiedintheexamplesgraphed
inthissection.
Handwritingrecognitionrevisited:the
code
Let'simplementtheideaswe'vediscussedinthischapter.We'll
developanewprogram,network2.py,whichisanimprovedversion
oftheprogramnetwork.pywedevelopedinChapter1.Ifyouhaven't
lookedatnetwork.pyinawhilethenyoumayfindithelpfultospend
afewminutesquicklyreadingovertheearlierdiscussion.It'sonly
74linesofcode,andiseasilyunderstood.
Aswasthecaseinnetwork.py,thestarofnetwork2.pyistheNetwork
class,whichweusetorepresentourneuralnetworks.Weinitialize
aninstanceofNetworkwithalistofsizesfortherespectivelayersin
thenetwork,andachoiceforthecosttouse,defaultingtothecross
entropy:
classNetwork(object):
def__init__(self,sizes,cost=CrossEntropyCost):
self.num_layers=len(sizes)
self.sizes=sizes
self.default_weight_initializer()
self.cost=cost
Thefirstcoupleoflinesofthe__init__methodarethesameasin
network.py,andareprettyselfexplanatory.Butthenexttwolines
arenew,andweneedtounderstandwhatthey'redoingindetail.
Let'sstartbyexaminingthedefault_weight_initializermethod.This
makesuseofournewandimprovedapproachtoweight
55/92
5/31/2016
initialization.Aswe'veseen,inthatapproachtheweightsinputtoa
neuronareinitializedasGaussianrandomvariableswithmean0
andstandarddeviation1dividedbythesquarerootofthenumber
ofconnectionsinputtotheneuron.Alsointhismethodwe'll
initializethebiases,usingGaussianrandomvariableswithmean0
andstandarddeviation1.Here'sthecode:
defdefault_weight_initializer(self):
self.biases=[np.random.randn(y,1)foryinself.sizes[1:]]
self.weights=[np.random.randn(y,x)/np.sqrt(x)
forx,yinzip(self.sizes[:1],self.sizes[1:])]
Tounderstandthecode,itmayhelptorecallthatnpistheNumpy
libraryfordoinglinearalgebra.We'llimportNumpyatthe
beginningofourprogram.Also,noticethatwedon'tinitializeany
biasesforthefirstlayerofneurons.Weavoiddoingthisbecausethe
firstlayerisaninputlayer,andsoanybiaseswouldnotbeused.We
didexactlythesamethinginnetwork.py.
Complementingthedefault_weight_initializerwe'llalsoincludea
large_weight_initializermethod.Thismethodinitializesthe
weightsandbiasesusingtheoldapproachfromChapter1,with
bothweightsandbiasesinitializedasGaussianrandomvariables
withmean0andstandarddeviation1.Thecodeis,ofcourse,onlya
tinybitdifferentfromthedefault_weight_initializer:
deflarge_weight_initializer(self):
self.weights=[np.random.randn(y,x)
I'veincludedthelarge_weight_initializermethodmostlyasa
conveniencetomakeiteasiertocomparetheresultsinthischapter
tothoseinChapter1.Ican'tthinkofmanypracticalsituations
whereIwouldrecommendusingit!
ThesecondnewthinginNetwork's__init__methodisthatwenow
initializeacostattribute.Tounderstandhowthatworks,let'slook
attheclassweusetorepresentthecrossentropycost*:
*Ifyou'renotfamiliarwithPython'sstatic
methodsyoucanignorethe@staticmethod
decorators,andjusttreatfnanddeltaas
classCrossEntropyCost(object):
ordinarymethods.Ifyou'recuriousabout
details,all@staticmethoddoesistellthePython
@staticmethod
interpreterthatthemethodwhichfollows
deffn(a,y):
doesn'tdependontheobjectinanyway.That's
returnnp.sum(np.nan_to_num(y*np.log(a)(1y)*np.log(1a)))
whyselfisn'tpassedasaparametertothefn
anddeltamethods.
@staticmethod
defdelta(z,a,y):
return(ay)
56/92
5/31/2016
Let'sbreakthisdown.Thefirstthingtoobserveisthateventhough
thecrossentropyis,mathematicallyspeaking,afunction,we've
implementeditasaPythonclass,notaPythonfunction.Whyhave
Imadethatchoice?Thereasonisthatthecostplaystwodifferent
rolesinournetwork.Theobviousroleisthatit'sameasureofhow
wellanoutputactivation,a,matchesthedesiredoutput,y.Thisrole
iscapturedbytheCrossEntropyCost.fnmethod.(Note,bytheway,
thatthenp.nan_to_numcallinsideCrossEntropyCost.fnensuresthat
Numpydealscorrectlywiththelogofnumbersveryclosetozero.)
Butthere'salsoasecondwaythecostfunctionentersournetwork.
RecallfromChapter2thatwhenrunningthebackpropagation
algorithmweneedtocomputethenetwork'soutputerror, L.The
formoftheoutputerrordependsonthechoiceofcostfunction:
differentcostfunction,differentformfortheoutputerror.Forthe
crossentropytheoutputerroris,aswesawinEquation(66),
L = a L y.
Forthisreasonwedefineasecondmethod,CrossEntropyCost.delta,
whosepurposeistotellournetworkhowtocomputetheoutput
error.Andthenwebundlethesetwomethodsupintoasingleclass
containingeverythingournetworksneedtoknowaboutthecost
function.
Inasimilarway,network2.pyalsocontainsaclasstorepresentthe
quadraticcostfunction.Thisisincludedforcomparisonwiththe
resultsofChapter1,sincegoingforwardwe'llmostlyusethecross
entropy.Thecodeisjustbelow.TheQuadraticCost.fnmethodisa
straightforwardcomputationofthequadraticcostassociatedtothe
actualoutput,a,andthedesiredoutput,y.Thevaluereturnedby
QuadraticCost.deltaisbasedontheexpression(30)fortheoutput
errorforthequadraticcost,whichwederivedbackinChapter2.
classQuadraticCost(object):
@staticmethod
deffn(a,y):
return0.5*np.linalg.norm(ay)**2
@staticmethod
defdelta(z,a,y):
return(ay)*sigmoid_prime(z)
We'venowunderstoodthemaindifferencesbetweennetwork2.py
andnetwork.py.It'sallprettysimplestuff.Thereareanumberof
smallerchanges,whichI'lldiscussbelow,includingthe
57/92
5/31/2016
implementationofL2regularization.Beforegettingtothat,let's
lookatthecompletecodefornetwork2.py.Youdon'tneedtoreadall
thecodeindetail,butitisworthunderstandingthebroadstructure,
andinparticularreadingthedocumentationstrings,soyou
understandwhateachpieceoftheprogramisdoing.Ofcourse,
you'realsowelcometodelveasdeeplyasyouwish!Ifyougetlost,
youmaywishtocontinuereadingtheprosebelow,andreturnto
thecodelater.Anyway,here'sthecode:
"""network2.py
~~~~~~~~~~~~~~
Animprovedversionofnetwork.py,implementingthestochastic
gradientdescentlearningalgorithmforafeedforwardneuralnetwork.
Improvementsincludetheadditionofthecrossentropycostfunction,
regularization,andbetterinitializationofnetworkweights.Note
thatIhavefocusedonmakingthecodesimple,easilyreadable,and
easilymodifiable.Itisnotoptimized,andomitsmanydesirable
features.
"""
####Libraries
#Standardlibrary
importjson
importrandom
importsys
#Thirdpartylibraries
importnumpyasnp
####Definethequadraticandcrossentropycostfunctions
classQuadraticCost(object):
@staticmethod
deffn(a,y):
"""Returnthecostassociatedwithanoutput`à`ànddesiredoutput
``y``.
"""
return0.5*np.linalg.norm(ay)**2
@staticmethod
defdelta(z,a,y):
"""Returntheerrordeltafromtheoutputlayer."""
return(ay)*sigmoid_prime(z)
classCrossEntropyCost(object):
@staticmethod
deffn(a,y):
"""Returnthecostassociatedwithanoutput`à`ànddesiredoutput
``y``.Notethatnp.nan_to_numisusedtoensurenumerical
stability.Inparticular,ifboth`à`ànd``y``havea1.0
inthesameslot,thentheexpression(1y)*np.log(1a)
returnsnan.Thenp.nan_to_numensuresthatthatisconverted
tothecorrectvalue(0.0).
"""
58/92
5/31/2016
returnnp.sum(np.nan_to_num(y*np.log(a)(1y)*np.log(1a)))
@staticmethod
defdelta(z,a,y):
"""Returntheerrordeltafromtheoutputlayer.Notethatthe
parameter``z`ìsnotusedbythemethod.Itisincludedin
themethod'sparametersinordertomaketheinterface
consistentwiththedeltamethodforothercostclasses.
"""
return(ay)
####MainNetworkclass
classNetwork(object):
def__init__(self,sizes,cost=CrossEntropyCost):
"""Thelist``sizes``containsthenumberofneuronsintherespective
layersofthenetwork.Forexample,ifthelistwas[2,3,1]
thenitwouldbeathreelayernetwork,withthefirstlayer
containing2neurons,thesecondlayer3neurons,andthe
thirdlayer1neuron.Thebiasesandweightsforthenetwork
areinitializedrandomly,using
``self.default_weight_initializer``(seedocstringforthat
method).
"""
self.num_layers=len(sizes)
self.sizes=sizes
self.default_weight_initializer()
self.cost=cost
defdefault_weight_initializer(self):
"""InitializeeachweightusingaGaussiandistributionwithmean0
andstandarddeviation1overthesquarerootofthenumberof
weightsconnectingtothesameneuron.Initializethebiases
usingaGaussiandistributionwithmean0andstandard
deviation1.
Notethatthefirstlayerisassumedtobeaninputlayer,and
byconventionwewon'tsetanybiasesforthoseneurons,since
biasesareonlyeverusedincomputingtheoutputsfromlater
layers.
"""
self.weights=[np.random.randn(y,x)/np.sqrt(x)
deflarge_weight_initializer(self):
"""InitializetheweightsusingaGaussiandistributionwithmean0
andstandarddeviation1.Initializethebiasesusinga
Gaussiandistributionwithmean0andstandarddeviation1.
Notethatthefirstlayerisassumedtobeaninputlayer,and
byconventionwewon'tsetanybiasesforthoseneurons,since
biasesareonlyeverusedincomputingtheoutputsfromlater
layers.
Thisweightandbiasinitializerusesthesameapproachasin
Chapter1,andisincludedforpurposesofcomparison.It
willusuallybebettertousethedefaultweightinitializer
instead.
"""
self.weights=[np.random.randn(y,x)
59/92
5/31/2016
deffeedforward(self,a):
"""Returntheoutputofthenetworkif`à`ìsinput."""
forb,winzip(self.biases,self.weights):
a=sigmoid(np.dot(w,a)+b)
returna
defSGD(self,training_data,epochs,mini_batch_size,eta,
lmbda=0.0,
evaluation_data=None,
monitor_evaluation_cost=False,
monitor_evaluation_accuracy=False,
monitor_training_cost=False,
monitor_training_accuracy=False):
"""Traintheneuralnetworkusingminibatchstochasticgradient
descent.The``training_data`ìsalistoftuples``(x,y)``
representingthetraininginputsandthedesiredoutputs.The
othernonoptionalparametersareselfexplanatory,asisthe
regularizationparameter``lmbda``.Themethodalsoaccepts
`èvaluation_data``,usuallyeitherthevalidationortest
data.Wecanmonitorthecostandaccuracyoneitherthe
evaluationdataorthetrainingdata,bysettingthe
appropriateflags.Themethodreturnsatuplecontainingfour
lists:the(perepoch)costsontheevaluationdata,the
accuraciesontheevaluationdata,thecostsonthetraining
data,andtheaccuraciesonthetrainingdata.Allvaluesare
evaluatedattheendofeachtrainingepoch.So,forexample,
ifwetrainfor30epochs,thenthefirstelementofthetuple
willbea30elementlistcontainingthecostonthe
evaluationdataattheendofeachepoch.Notethatthelists
areemptyifthecorrespondingflagisnotset.
"""
ifevaluation_data:n_data=len(evaluation_data)
n=len(training_data)
evaluation_cost,evaluation_accuracy=[],[]
training_cost,training_accuracy=[],[]
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(
mini_batch,eta,lmbda,len(training_data))
print"Epoch%strainingcomplete"%j
ifmonitor_training_cost:
cost=self.total_cost(training_data,lmbda)
training_cost.append(cost)
print"Costontrainingdata:{}".format(cost)
ifmonitor_training_accuracy:
accuracy=self.accuracy(training_data,convert=True)
training_accuracy.append(accuracy)
print"Accuracyontrainingdata:{}/{}".format(
accuracy,n)
ifmonitor_evaluation_cost:
cost=self.total_cost(evaluation_data,lmbda,convert=True)
evaluation_cost.append(cost)
print"Costonevaluationdata:{}".format(cost)
ifmonitor_evaluation_accuracy:
accuracy=self.accuracy(evaluation_data)
evaluation_accuracy.append(accuracy)
print"Accuracyonevaluationdata:{}/{}".format(
self.accuracy(evaluation_data),n_data)
print
returnevaluation_cost,evaluation_accuracy,\
training_cost,training_accuracy
60/92
5/31/2016
defupdate_mini_batch(self,mini_batch,eta,lmbda,n):
"""Updatethenetwork'sweightsandbiasesbyapplyinggradient
descentusingbackpropagationtoasingleminibatch.The
``mini_batch`ìsalistoftuples``(x,y)``,`èta`ìsthe
learningrate,``lmbda`ìstheregularizationparameter,and
``n`ìsthetotalsizeofthetrainingdataset.
"""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[(1eta*(lmbda/n))*w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]
defbackprop(self,x,y):
"""Returnatuple``(nabla_b,nabla_w)``representingthe
gradientforthecostfunctionC_x.``nabla_b`ànd
``nabla_w`àrelayerbylayerlistsofnumpyarrays,similar
to``self.biases`ànd``self.weights``."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
#feedforward
activation=x
activations=[x]#listtostorealltheactivations,layerbylayer
zs=[]#listtostoreallthezvectors,layerbylayer
forb,winzip(self.biases,self.weights):
z=np.dot(w,activation)+b
zs.append(z)
activation=sigmoid(z)
activations.append(activation)
#backwardpass
delta=(self.cost).delta(zs[1],activations[1],y)
nabla_b[1]=delta
nabla_w[1]=np.dot(delta,activations[2].transpose())
#Notethatthevariablelintheloopbelowisusedalittle
#differentlytothenotationinChapter2ofthebook.Here,
#l=1meansthelastlayerofneurons,l=2isthe
#secondlastlayer,andsoon.It'sarenumberingofthe
#schemeinthebook,usedheretotakeadvantageofthefact
#thatPythoncanusenegativeindicesinlists.
forlinxrange(2,self.num_layers):
z=zs[l]
sp=sigmoid_prime(z)
delta=np.dot(self.weights[l+1].transpose(),delta)*sp
nabla_b[l]=delta
nabla_w[l]=np.dot(delta,activations[l1].transpose())
return(nabla_b,nabla_w)
defaccuracy(self,data,convert=False):
"""Returnthenumberofinputsin``data``forwhichtheneural
networkoutputsthecorrectresult.Theneuralnetwork's
outputisassumedtobetheindexofwhicheverneuroninthe
finallayerhasthehighestactivation.
Theflag``convert``shouldbesettoFalseifthedatasetis
validationortestdata(theusualcase),andtoTrueifthe
datasetisthetrainingdata.Theneedforthisflagarises
duetodifferencesinthewaytheresults``y`àre
representedinthedifferentdatasets.Inparticular,it
flagswhetherweneedtoconvertbetweenthedifferent
representations.Itmayseemstrangetousedifferent
representationsforthedifferentdatasets.Whynotusethe
samerepresentationforallthreedatasets?It'sdonefor
61/92
5/31/2016
efficiencyreasonstheprogramusuallyevaluatesthecost
onthetrainingdataandtheaccuracyonotherdatasets.
Thesearedifferenttypesofcomputations,andusingdifferent
representationsspeedsthingsup.Moredetailsonthe
representationscanbefoundin
mnist_loader.load_data_wrapper.
"""
ifconvert:
results=[(np.argmax(self.feedforward(x)),np.argmax(y))
for(x,y)indata]
else:
results=[(np.argmax(self.feedforward(x)),y)
for(x,y)indata]
returnsum(int(x==y)for(x,y)inresults)
deftotal_cost(self,data,lmbda,convert=False):
"""Returnthetotalcostforthedataset``data``.Theflag
``convert``shouldbesettoFalseifthedatasetisthe
trainingdata(theusualcase),andtoTrueifthedatasetis
thevalidationortestdata.Seecommentsonthesimilar(but
reversed)conventionforthe`àccuracy``method,above.
"""
cost=0.0
forx,yindata:
a=self.feedforward(x)
ifconvert:y=vectorized_result(y)
cost+=self.cost.fn(a,y)/len(data)
cost+=0.5*(lmbda/len(data))*sum(
np.linalg.norm(w)**2forwinself.weights)
returncost
defsave(self,filename):
"""Savetheneuralnetworktothefile``filename``."""
data={"sizes":self.sizes,
"weights":[w.tolist()forwinself.weights],
"biases":[b.tolist()forbinself.biases],
"cost":str(self.cost.__name__)}
f=open(filename,"w")
json.dump(data,f)
f.close()
####LoadingaNetwork
defload(filename):
"""Loadaneuralnetworkfromthefile``filename``.Returnsan
instanceofNetwork.
"""
f=open(filename,"r")
data=json.load(f)
f.close()
cost=getattr(sys.modules[__name__],data["cost"])
net=Network(data["sizes"],cost=cost)
net.weights=[np.array(w)forwindata["weights"]]
net.biases=[np.array(b)forbindata["biases"]]
returnnet
####Miscellaneousfunctions
defvectorized_result(j):
"""Returna10dimensionalunitvectorwitha1.0inthej'thposition
andzeroeselsewhere.Thisisusedtoconvertadigit(0...9)
intoacorrespondingdesiredoutputfromtheneuralnetwork.
"""
e=np.zeros((10,1))
e[j]=1.0
returne
62/92
5/31/2016
defsigmoid(z):
"""Thesigmoidfunction."""
return1.0/(1.0+np.exp(z))
defsigmoid_prime(z):
"""Derivativeofthesigmoidfunction."""
returnsigmoid(z)*(1sigmoid(z))
OneofthemoreinterestingchangesinthecodeistoincludeL2
regularization.Althoughthisisamajorconceptualchange,it'sso
trivialtoimplementthatit'seasytomissinthecode.Forthemost
partitjustinvolvespassingtheparameterlmbdatovariousmethods,
notablytheNetwork.SGDmethod.Therealworkisdoneinasingle
lineoftheprogram,thefourthlastlineofthe
Network.update_mini_batchmethod.That'swherewemodifythe
gradientdescentupdateruletoincludeweightdecay.Butalthough
themodificationistiny,ithasabigimpactonresults!
Thisis,bytheway,commonwhenimplementingnewtechniquesin
neuralnetworks.We'vespentthousandsofwordsdiscussing
regularization.It'sconceptuallyquitesubtleanddifficultto
understand.Andyetitwastrivialtoaddtoourprogram!Itoccurs
surprisinglyoftenthatsophisticatedtechniquescanbe
implementedwithsmallchangestocode.
Anothersmallbutimportantchangetoourcodeistheadditionof
severaloptionalflagstothestochasticgradientdescentmethod,
Network.SGD.Theseflagsmakeitpossibletomonitorthecostand
accuracyeitheronthetraining_dataoronasetofevaluation_data
whichcanbepassedtoNetwork.SGD.We'veusedtheseflagsoften
earlierinthechapter,butletmegiveanexampleofhowitworks,
justtoremindyou:
>>>importnetwork2
>>>net.SGD(training_data,30,10,0.5,
...lmbda=5.0,
...monitor_evaluation_accuracy=True,
...monitor_evaluation_cost=True,
...monitor_training_accuracy=True,
...monitor_training_cost=True)
Here,we'resettingtheevaluation_datatobethevalidation_data.But
wecouldalsohavemonitoredperformanceonthetest_dataorany
otherdataset.Wealsohavefourflagstellingustomonitorthecost
andaccuracyonboththeevaluation_dataandthetraining_data.
63/92
5/31/2016
ThoseflagsareFalsebydefault,butthey'vebeenturnedonherein
ordertomonitorourNetwork'sperformance.Furthermore,
network2.py'sNetwork.SGDmethodreturnsafourelementtuple
representingtheresultsofthemonitoring.Wecanusethisas
follows:
>>>evaluation_cost,evaluation_accuracy,
...training_cost,training_accuracy=net.SGD(training_data,30,10,0.5,
...lmbda=5.0,
...monitor_evaluation_accuracy=True,
...monitor_evaluation_cost=True,
...monitor_training_accuracy=True,
...monitor_training_cost=True)
So,forexample,evaluation_costwillbea30elementlistcontaining
thecostontheevaluationdataattheendofeachepoch.Thissortof
informationisextremelyusefulinunderstandinganetwork's
behaviour.Itcan,forexample,beusedtodrawgraphsshowinghow
thenetworklearnsovertime.Indeed,that'sexactlyhowI
constructedallthegraphsearlierinthechapter.Note,however,
thatifanyofthemonitoringflagsarenotset,thenthe
correspondingelementinthetuplewillbetheemptylist.
OtheradditionstothecodeincludeaNetwork.savemethod,tosave
Networkobjectstodisk,andafunctiontoloadthembackinagain
later.NotethatthesavingandloadingisdoneusingJSON,not
Python'spickleorcPicklemodules,whicharetheusualwaywesave
andloadobjectstoandfromdiskinPython.UsingJSONrequires
morecodethanpickleorcPicklewould.TounderstandwhyI've
usedJSON,imaginethatatsometimeinthefuturewedecidedto
changeourNetworkclasstoallowneuronsotherthansigmoid
neurons.Toimplementthatchangewe'dmostlikelychangethe
attributesdefinedintheNetwork.__init__method.Ifwe'vesimply
pickledtheobjectsthatwouldcauseourloadfunctiontofail.Using
JSONtodotheserializationexplicitlymakesiteasytoensurethat
oldNetworkswillstillload.
Therearemanyotherminorchangesinthecodefornetwork2.py,but
they'reallsimplevariationsonnetwork.py.Thenetresultisto
expandour74lineprogramtoafarmorecapable152lines.
Problems
ModifythecodeabovetoimplementL1regularization,anduse
64/92
5/31/2016
L1regularizationtoclassifyMNISTdigitsusinga30hidden
neuronnetwork.Canyoufindaregularizationparameterthat
enablesyoutodobetterthanrunningunregularized?
TakealookattheNetwork.cost_derivativemethodinnetwork.py.
Thatmethodwaswrittenforthequadraticcost.Howwould
yourewritethemethodforthecrossentropycost?Canyou
thinkofaproblemthatmightariseinthecrossentropy
version?Innetwork2.pywe'veeliminatedthe
Network.cost_derivativemethodentirely,insteadincorporating
itsfunctionalityintotheCrossEntropyCost.deltamethod.How
doesthissolvetheproblemyou'vejustidentified?
Howtochooseaneuralnetwork's
hyperparameters?
UpuntilnowIhaven'texplainedhowI'vebeenchoosingvaluesfor
hyperparameterssuchasthelearningrate,,theregularization
parameter,,andsoon.I'vejustbeensupplyingvalueswhichwork
prettywell.Inpractice,whenyou'reusingneuralnetstoattacka
problem,itcanbedifficulttofindgoodhyperparameters.Imagine,
forexample,thatwe'vejustbeenintroducedtotheMNIST
problem,andhavebegunworkingonit,knowingnothingatall
aboutwhathyperparameterstouse.Let'ssupposethatbygood
fortuneinourfirstexperimentswechoosemanyofthehyper
parametersinthesamewayaswasdoneearlierthischapter:30
hiddenneurons,aminibatchsizeof10,trainingfor30epochs
usingthecrossentropy.Butwechoosealearningrate = 10.0and
regularizationparameter = 1000.0.Here'swhatIsawononesuch
run:
>>>importnetwork2
>>>net=network2.Network([784,30,10])
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Epoch0trainingcomplete
Accuracyonevaluationdata:1030/10000
...
65/92
5/31/2016
Ourclassificationaccuraciesarenobetterthanchance!Our
networkisactingasarandomnoisegenerator!
"Well,that'seasytofix,"youmightsay,"justdecreasethelearning
rateandregularizationhyperparameters".Unfortunately,you
don'taprioriknowthosearethehyperparametersyouneedto
adjust.Maybetherealproblemisthatour30hiddenneuron
networkwillneverworkwell,nomatterhowtheotherhyper
parametersarechosen?Maybewereallyneedatleast100hidden
neurons?Or300hiddenneurons?Ormultiplehiddenlayers?Ora
differentapproachtoencodingtheoutput?Maybeournetworkis
learning,butweneedtotrainformoreepochs?Maybethemini
batchesaretoosmall?Maybewe'ddobetterswitchingbacktothe
quadraticcostfunction?Maybeweneedtotryadifferentapproach
toweightinitialization?Andsoon,onandonandon.It'seasyto
feellostinhyperparameterspace.Thiscanbeparticularly
frustratingifyournetworkisverylarge,orusesalotoftraining
data,sinceyoumaytrainforhoursordaysorweeks,onlytogetno
result.Ifthesituationpersists,itdamagesyourconfidence.Maybe
neuralnetworksarethewrongapproachtoyourproblem?Maybe
youshouldquityourjobandtakeupbeekeeping?
InthissectionIexplainsomeheuristicswhichcanbeusedtoset
thehyperparametersinaneuralnetwork.Thegoalistohelpyou
developaworkflowthatenablesyoutodoaprettygoodjobsetting
hyperparameters.Ofcourse,Iwon'tcovereverythingabouthyper
parameteroptimization.That'sahugesubject,andit'snot,inany
case,aproblemthatisevercompletelysolved,noristhereuniversal
agreementamongstpractitionersontherightstrategiestouse.
There'salwaysonemoretrickyoucantrytoekeoutabitmore
performancefromyournetwork.Buttheheuristicsinthissection
shouldgetyoustarted.
Broadstrategy:Whenusingneuralnetworkstoattackanew
problemthefirstchallengeistogetanynontriviallearning,i.e.,for
66/92
5/31/2016
thenetworktoachieveresultsbetterthanchance.Thiscanbe
surprisinglydifficult,especiallywhenconfrontinganewclassof
problem.Let'slookatsomestrategiesyoucanuseifyou'rehaving
thiskindoftrouble.
Suppose,forexample,thatyou'reattackingMNISTforthefirst
time.Youstartoutenthusiastic,butarealittlediscouragedwhen
yourfirstnetworkfailscompletely,asintheexampleabove.The
waytogoistostriptheproblemdown.Getridofallthetraining
andvalidationimagesexceptimageswhichare0sor1s.Thentryto
trainanetworktodistinguish0sfrom1s.Notonlyisthatan
inherentlyeasierproblemthandistinguishingalltendigits,italso
reducestheamountoftrainingdataby80percent,speedingup
trainingbyafactorof5.Thatenablesmuchmorerapid
experimentation,andsogivesyoumorerapidinsightintohowto
buildagoodnetwork.
Youcanfurtherspeedupexperimentationbystrippingyour
networkdowntothesimplestnetworklikelytodomeaningful
learning.Ifyoubelievea[784,10]networkcanlikelydobetter
thanchanceclassificationofMNISTdigits,thenbeginyour
experimentationwithsuchanetwork.It'llbemuchfasterthan
traininga[784,30,10]network,andyoucanbuildbackuptothe
latter.
Youcangetanotherspeedupinexperimentationbyincreasingthe
frequencyofmonitoring.Innetwork2.pywemonitorperformanceat
theendofeachtrainingepoch.With50,000imagesperepoch,that
meanswaitingalittlewhileabouttensecondsperepoch,onmy
laptop,whentraininga[784,30,10]networkbeforegetting
feedbackonhowwellthenetworkislearning.Ofcourse,ten
secondsisn'tverylong,butifyouwanttotrialdozensofhyper
parameterchoicesit'sannoying,andifyouwanttotrialhundreds
orthousandsofchoicesitstartstogetdebilitating.Wecanget
feedbackmorequicklybymonitoringthevalidationaccuracymore
often,say,afterevery1,000trainingimages.Furthermore,instead
ofusingthefull10,000imagevalidationsettomonitor
performance,wecangetamuchfasterestimateusingjust100
validationimages.Allthatmattersisthatthenetworkseesenough
imagestodoreallearning,andtogetaprettygoodroughestimate
ofperformance.Ofcourse,ourprogramnetwork2.pydoesn't
67/92
5/31/2016
currentlydothiskindofmonitoring.Butasakludgetoachievea
similareffectforthepurposesofillustration,we'llstripdownour
trainingdatatojustthefirst1,000MNISTtrainingimages.Let'stry
itandseewhathappens.(TokeepthecodebelowsimpleIhaven't
implementedtheideaofusingonly0and1images.Ofcourse,that
canbedonewithjustalittlemorework.)
>>>net=network2.Network([784,10])
>>>net.SGD(training_data[:1000],30,10,10.0,lmbda=1000.0,\
...evaluation_data=validation_data[:100],\
...
We'restillgettingpurenoise!Butthere'sabigwin:we'renow
gettingfeedbackinafractionofasecond,ratherthanonceevery
tensecondsorso.Thatmeansyoucanmorequicklyexperiment
withotherchoicesofhyperparameter,orevenconduct
experimentstriallingmanydifferentchoicesofhyperparameter
nearlysimultaneously.
IntheaboveexampleIleftas = 1000.0,asweusedearlier.But
sincewechangedthenumberoftrainingexamplesweshouldreally
changetokeeptheweightdecaythesame.Thatmeanschanging
to20.0.Ifwedothatthenthisiswhathappens:
...
Ahah!Wehaveasignal.Notaterriblygoodsignal,butasignal
nonetheless.That'ssomethingwecanbuildon,modifyingthe
hyperparameterstotrytogetfurtherimprovement.Maybewe
guessthatourlearningrateneedstobehigher.(Asyouperhaps
68/92
5/31/2016
realize,that'sasillyguess,forreasonswe'lldiscussshortly,but
pleasebearwithme.)Sototestourguesswetrydialingupto
100.0:
...
That'snogood!Itsuggeststhatourguesswaswrong,andthe
problemwasn'tthatthelearningratewastoolow.Soinsteadwetry
dialingdownto = 1.0:
...
That'sbetter!Andsowecancontinue,individuallyadjustingeach
hyperparameter,graduallyimprovingperformance.Oncewe've
exploredtofindanimprovedvaluefor,thenwemoveontofinda
goodvaluefor.Thenexperimentwithamorecomplex
architecture,sayanetworkwith10hiddenneurons.Thenadjustthe
valuesforandagain.Thenincreaseto20hiddenneurons.And
thenadjustotherhyperparameterssomemore.Andsoon,ateach
stageevaluatingperformanceusingourheldoutvalidationdata,
andusingthoseevaluationstofindbetterandbetterhyper
parameters.Aswedoso,ittypicallytakeslongertowitnessthe
impactduetomodificationsofthehyperparameters,andsowecan
69/92
5/31/2016
graduallydecreasethefrequencyofmonitoring.
Thisalllooksverypromisingasabroadstrategy.However,Iwant
toreturntothatinitialstageoffindinghyperparametersthat
enableanetworktolearnanythingatall.Infact,eventheabove
discussionconveystoopositiveanoutlook.Itcanbeimmensely
frustratingtoworkwithanetworkthat'slearningnothing.Youcan
tweakhyperparametersfordays,andstillgetnomeaningful
response.AndsoI'dliketoreemphasizethatduringtheearly
stagesyoushouldmakesureyoucangetquickfeedbackfrom
experiments.Intuitively,itmayseemasthoughsimplifyingthe
problemandthearchitecturewillmerelyslowyoudown.Infact,it
speedsthingsup,sinceyoumuchmorequicklyfindanetworkwith
ameaningfulsignal.Onceyou'vegotsuchasignal,youcanoftenget
rapidimprovementsbytweakingthehyperparameters.Aswith
manythingsinlife,gettingstartedcanbethehardestthingtodo.
Okay,that'sthebroadstrategy.Let'snowlookatsomespecific
recommendationsforsettinghyperparameters.Iwillfocusonthe
learningrate,,theL2regularizationparameter,,andthemini
batchsize.However,manyoftheremarksapplyalsotootherhyper
parameters,includingthoseassociatedtonetworkarchitecture,
otherformsofregularization,andsomehyperparameterswe'll
meetlaterinthebook,suchasthemomentumcoefficient.
Learningrate:SupposewerunthreeMNISTnetworkswiththree
differentlearningrates, = 0.025, = 0.25and = 2.5,respectively.
We'llsettheotherhyperparametersasfortheexperimentsin
earliersections,runningover30epochs,withaminibatchsizeof
10,andwith = 5.0.We'llalsoreturntousingthefull50, 000
trainingimages.Here'sagraphshowingthebehaviourofthe
trainingcostaswetrain*:
*Thegraphwasgeneratedbymultiple_eta.py.
70/92
5/31/2016
With = 0.025thecostdecreasessmoothlyuntilthefinalepoch.
With = 0.25thecostinitiallydecreases,butafterabout20epochs
itisnearsaturation,andthereaftermostofthechangesaremerely
smallandapparentlyrandomoscillations.Finally,with = 2.5the
costmakeslargeoscillationsrightfromthestart.Tounderstandthe
reasonfortheoscillations,recallthatstochasticgradientdescentis
supposedtostepusgraduallydownintoavalleyofthecost
function,
However,ifistoolargethenthestepswillbesolargethatthey
mayactuallyovershoottheminimum,causingthealgorithmto
climbupoutofthevalleyinstead.That'slikely*what'scausingthe
*Thispictureishelpful,butit'sintendedasan
costtooscillatewhen = 2.5.Whenwechoose = 0.25theinitial
notasacomplete,exhaustiveexplanation.
stepsdotakeustowardaminimumofthecostfunction,andit's
onlyoncewegetnearthatminimumthatwestarttosufferfromthe
intuitionbuildingillustrationofwhatmaygoon,
Briefly,amorecompleteexplanationisas
follows:gradientdescentusesafirstorder
approximationtothecostfunctionasaguideto
howtodecreasethecost.Forlarge,higher
71/92
5/31/2016
overshootingproblem.Andwhenwechoose = 0.025wedon't
ordertermsinthecostfunctionbecomemore
sufferfromthisproblematallduringthefirst30epochs.Ofcourse,
causinggradientdescenttobreakdown.Thisis
choosingsosmallcreatesanotherproblem,namely,thatitslows
important,andmaydominatethebehaviour,
especiallylikelyasweapproachminimaand
quasiminimaofthecostfunction,sincenear
downstochasticgradientdescent.Anevenbetterapproachwould
suchpointsthegradientbecomessmall,making
betostartwith = 0.25,trainfor20epochs,andthenswitchto
behaviour.
iteasierforhigherordertermstodominate
= 0.025.We'lldiscusssuchvariablelearningratescheduleslater.
Fornow,though,let'ssticktofiguringouthowtofindasinglegood
valueforthelearningrate,.
Withthispictureinmind,wecansetasfollows.First,weestimate
thethresholdvalueforatwhichthecostonthetrainingdata
immediatelybeginsdecreasing,insteadofoscillatingorincreasing.
Thisestimatedoesn'tneedtobetooaccurate.Youcanestimatethe
orderofmagnitudebystartingwith = 0.01.Ifthecostdecreases
duringthefirstfewepochs,thenyoushouldsuccessivelytry
= 0.1, 1.0, untilyoufindavalueforwherethecostoscillates
orincreasesduringthefirstfewepochs.Alternately,ifthecost
oscillatesorincreasesduringthefirstfewepochswhen = 0.01,
thentry = 0.001, 0.0001, untilyoufindavalueforwherethe
costdecreasesduringthefirstfewepochs.Followingthisprocedure
willgiveusanorderofmagnitudeestimateforthethresholdvalue
of.Youmayoptionallyrefineyourestimate,topickoutthelargest
valueofatwhichthecostdecreasesduringthefirstfewepochs,
say = 0.5or = 0.2(there'snoneedforthistobesuperaccurate).
Thisgivesusanestimateforthethresholdvalueof.
Obviously,theactualvalueofthatyouuseshouldbenolarger
thanthethresholdvalue.Infact,ifthevalueofistoremainusable
overmanyepochsthenyoulikelywanttouseavalueforthatis
smaller,say,afactoroftwobelowthethreshold.Suchachoicewill
typicallyallowyoutotrainformanyepochs,withoutcausingtoo
muchofaslowdowninlearning.
InthecaseoftheMNISTdata,followingthisstrategyleadstoan
estimateof0.1fortheorderofmagnitudeofthethresholdvalueof
.Aftersomemorerefinement,weobtainathresholdvalue = 0.5.
Followingtheprescriptionabove,thissuggestsusing = 0.25asour
valueforthelearningrate.Infact,Ifoundthatusing = 0.5worked
wellenoughover30epochsthatforthemostpartIdidn'tworry
aboutusingalowervalueof.
72/92
5/31/2016
Thisallseemsquitestraightforward.However,usingthetraining
costtopickappearstocontradictwhatIsaidearlierinthis
section,namely,thatwe'dpickhyperparametersbyevaluating
performanceusingourheldoutvalidationdata.Infact,we'lluse
validationaccuracytopicktheregularizationhyperparameter,the
minibatchsize,andnetworkparameterssuchasthenumberof
layersandhiddenneurons,andsoon.Whydothingsdifferentlyfor
thelearningrate?Frankly,thischoiceismypersonalaesthetic
preference,andisperhapssomewhatidiosyncratic.Thereasoning
isthattheotherhyperparametersareintendedtoimprovethefinal
classificationaccuracyonthetestset,andsoitmakessenseto
selectthemonthebasisofvalidationaccuracy.However,the
learningrateisonlyincidentallymeanttoimpactthefinal
classificationaccuracy.Itsprimarypurposeisreallytocontrolthe
stepsizeingradientdescent,andmonitoringthetrainingcostisthe
bestwaytodetectifthestepsizeistoobig.Withthatsaid,thisisa
personalaestheticpreference.Earlyonduringlearningthetraining
costusuallyonlydecreasesifthevalidationaccuracyimproves,and
soinpracticeit'sunlikelytomakemuchdifferencewhichcriterion
youuse.
Useearlystoppingtodeterminethenumberoftraining
epochs:Aswediscussedearlierinthechapter,earlystopping
meansthatattheendofeachepochweshouldcomputethe
classificationaccuracyonthevalidationdata.Whenthatstops
improving,terminate.Thismakessettingthenumberofepochs
verysimple.Inparticular,itmeansthatwedon'tneedtoworry
aboutexplicitlyfiguringouthowthenumberofepochsdependson
theotherhyperparameters.Instead,that'stakencareof
automatically.Furthermore,earlystoppingalsoautomatically
preventsusfromoverfitting.Thisis,ofcourse,agoodthing,
althoughintheearlystagesofexperimentationitcanbehelpfulto
turnoffearlystopping,soyoucanseeanysignsofoverfitting,and
useittoinformyourapproachtoregularization.
Toimplementearlystoppingweneedtosaymorepreciselywhatit
meansthattheclassificationaccuracyhasstoppedimproving.As
we'veseen,theaccuracycanjumparoundquiteabit,evenwhenthe
overalltrendistoimprove.Ifwestopthefirsttimetheaccuracy
decreasesthenwe'llalmostcertainlystopwhentherearemore
improvementstobehad.Abetterruleistoterminateifthebest
73/92
5/31/2016
classificationaccuracydoesn'timproveforquitesometime.
Suppose,forexample,thatwe'redoingMNIST.Thenwemightelect
toterminateiftheclassificationaccuracyhasn'timprovedduring
thelasttenepochs.Thisensuresthatwedon'tstoptoosoon,in
responsetobadluckintraining,butalsothatwe'renotwaiting
aroundforeverforanimprovementthatnevercomes.
Thisnoimprovementintenruleisgoodforinitialexplorationof
MNIST.However,networkscansometimesplateauneara
particularclassificationaccuracyforquitesometime,onlytothen
beginimprovingagain.Ifyou'retryingtogetreallygood
performance,thenoimprovementintenrulemaybetoo
aggressiveaboutstopping.Inthatcase,Isuggestusingtheno
improvementintenruleforinitialexperimentation,andgradually
adoptingmorelenientrules,asyoubetterunderstandthewayyour
networktrains:noimprovementintwenty,noimprovementin
fifty,andsoon.Ofcourse,thisintroducesanewhyperparameter
tooptimize!Inpractice,however,it'susuallyeasytosetthishyper
parametertogetprettygoodresults.Similarly,forproblemsother
thanMNIST,thenoimprovementintenrulemaybemuchtoo
aggressiveornotnearlyaggressiveenough,dependingonthe
detailsoftheproblem.However,withalittleexperimentationit's
usuallyeasytofindaprettygoodstrategyforearlystopping.
Wehaven'tusedearlystoppinginourMNISTexperimentstodate.
Thereasonisthatwe'vebeendoingalotofcomparisonsbetween
differentapproachestolearning.Forsuchcomparisonsit'shelpful
tousethesamenumberofepochsineachcase.However,it'swell
worthmodifyingnetwork2.pytoimplementearlystopping:
Problem
Modifynetwork2.pysothatitimplementsearlystoppingusinga
noimprovementinnepochsstrategy,wherenisaparameter
thatcanbeset.
Canyouthinkofaruleforearlystoppingotherthanno
improvementinn?Ideally,theruleshouldcompromise
betweengettinghighvalidationaccuraciesandnottrainingtoo
long.Addyourruletonetwork2.py,andrunthreeexperiments
comparingthevalidationaccuraciesandnumberofepochsof
trainingtonoimprovementin10.
74/92
5/31/2016
Learningrateschedule:We'vebeenholdingthelearningrate
constant.However,it'softenadvantageoustovarythelearningrate.
Earlyonduringthelearningprocessit'slikelythattheweightsare
badlywrong.Andsoit'sbesttousealargelearningratethatcauses
theweightstochangequickly.Later,wecanreducethelearning
rateaswemakemorefinetunedadjustmentstoourweights.
Howshouldwesetourlearningrateschedule?Manyapproaches
arepossible.Onenaturalapproachistousethesamebasicideaas
earlystopping.Theideaistoholdthelearningrateconstantuntil
thevalidationaccuracystartstogetworse.Thendecreasethe
learningratebysomeamount,sayafactoroftwoorten.Werepeat
thismanytimes,until,say,thelearningrateisafactorof1,024(or
1,000)timeslowerthantheinitialvalue.Thenweterminate.
Avariablelearningschedulecanimproveperformance,butitalso
opensupaworldofpossiblechoicesforthelearningschedule.
Thosechoicescanbeaheadacheyoucanspendforevertryingto
optimizeyourlearningschedule.Forfirstexperimentsmy
suggestionistouseasingle,constantvalueforthelearningrate.
That'llgetyouagoodfirstapproximation.Later,ifyouwantto
obtainthebestperformancefromyournetwork,it'sworth
experimentingwithalearningschedule,alongthelinesI've
described*.
*Areadablerecentpaperwhichdemonstrates
thebenefitsofvariablelearningratesin
attackingMNISTisDeep,Big,SimpleNeural
Exercise
NetsExcelonHandwrittenDigitRecognition,by
DanClaudiuCirean,UeliMeier,LucaMaria
Gambardella,andJrgenSchmidhuber(2010).
Modifynetwork2.pysothatitimplementsalearningschedule
that:halvesthelearningrateeachtimethevalidationaccuracy
satisfiesthenoimprovementin10ruleandterminateswhen
thelearningratehasdroppedto1 / 128ofitsoriginalvalue.
Theregularizationparameter,:Isuggeststartinginitially
withnoregularization( = 0.0),anddeterminingavaluefor,as
above.Usingthatchoiceof,wecanthenusethevalidationdatato
selectagoodvaluefor.Startbytrialling = 1.0*,andthen
*Idon'thaveagoodprincipledjustificationfor
increaseordecreasebyfactorsof10,asneededtoimprove
agoodprincipleddiscussionofwheretostart
performanceonthevalidationdata.Onceyou'vefoundagoodorder
usingthisasastartingvalue.Ifanyoneknowsof
with,I'dappreciatehearingit
(mn@michaelnielsen.org).
ofmagnitude,youcanfinetuneyourvalueof.Thatdone,you
shouldreturnandreoptimizeagain.
Exercise
75/92
5/31/2016
It'stemptingtousegradientdescenttotrytolearngoodvalues
forhyperparameterssuchasand.Canyouthinkofan
obstacletousinggradientdescenttodetermine?Canyou
thinkofanobstacletousinggradientdescenttodetermine?
HowIselectedhyperparametersearlierinthisbook:If
youusetherecommendationsinthissectionyou'llfindthatyouget
valuesforandwhichdon'talwaysexactlymatchthevaluesI've
usedearlierinthebook.Thereasonisthatthebookhasnarrative
constraintsthathavesometimesmadeitimpracticaltooptimizethe
hyperparameters.Thinkofallthecomparisonswe'vemadeof
differentapproachestolearning,e.g.,comparingthequadraticand
crossentropycostfunctions,comparingtheoldandnewmethods
ofweightinitialization,runningwithandwithoutregularization,
andsoon.Tomakesuchcomparisonsmeaningful,I'veusuallytried
tokeephyperparametersconstantacrosstheapproachesbeing
compared(ortoscaletheminanappropriateway).Ofcourse,
there'snoreasonforthesamehyperparameterstobeoptimalfor
allthedifferentapproachestolearning,sothehyperparameters
I'veusedaresomethingofacompromise.
Asanalternativetothiscompromise,Icouldhavetriedtooptimize
theheckoutofthehyperparametersforeverysingleapproachto
learning.Inprinciplethat'dbeabetter,fairerapproach,sincethen
we'dseethebestfromeveryapproachtolearning.However,we've
madedozensofcomparisonsalongtheselines,andinpracticeI
foundittoocomputationallyexpensive.That'swhyI'veadoptedthe
compromiseofusingprettygood(butnotnecessarilyoptimal)
choicesforthehyperparameters.
Minibatchsize:Howshouldwesettheminibatchsize?To
answerthisquestion,let'sfirstsupposethatwe'redoingonline
learning,i.e.,thatwe'reusingaminibatchsizeof1.
Theobviousworryaboutonlinelearningisthatusingminibatches
whichcontainjustasingletrainingexamplewillcausesignificant
errorsinourestimateofthegradient.Infact,though,theerrors
turnouttonotbesuchaproblem.Thereasonisthattheindividual
gradientestimatesdon'tneedtobesuperaccurate.Allweneedis
anestimateaccurateenoughthatourcostfunctiontendstokeep
decreasing.It'sasthoughyouaretryingtogettotheNorth
76/92
5/31/2016
MagneticPole,buthaveawonkycompassthat's1020degreesoff
eachtimeyoulookatit.Providedyoustoptocheckthecompass
frequently,andthecompassgetsthedirectionrightonaverage,
you'llendupattheNorthMagneticPolejustfine.
Basedonthisargument,itsoundsasthoughweshoulduseonline
learning.Infact,thesituationturnsouttobemorecomplicated
thanthat.InaprobleminthelastchapterIpointedoutthatit's
possibletousematrixtechniquestocomputethegradientupdate
forallexamplesinaminibatchsimultaneously,ratherthan
loopingoverthem.Dependingonthedetailsofyourhardwareand
linearalgebralibrarythiscanmakeitquiteabitfastertocompute
thegradientestimateforaminibatchof(forexample)size100,
ratherthancomputingtheminibatchgradientestimatebylooping
overthe100trainingexamplesseparately.Itmighttake(say)only
50timesaslong,ratherthan100timesaslong.
Now,atfirstitseemsasthoughthisdoesn'thelpusthatmuch.
Withourminibatchofsize100thelearningrulefortheweights
lookslike:
w w = w
1
100
C x,
x
wherethesumisovertrainingexamplesintheminibatch.Thisis
versus
w w = w C x
foronlinelearning.Evenifitonlytakes50timesaslongtodothe
minibatchupdate,itstillseemslikelytobebettertodoonline
learning,becausewe'dbeupdatingsomuchmorefrequently.
Suppose,however,thatintheminibatchcaseweincreasethe
learningratebyafactor100,sotheupdaterulebecomes
w w = w C x.
x
That'salotlikedoing100separateinstancesofonlinelearningwith
alearningrateof.Butitonlytakes50timesaslongasdoinga
singleinstanceofonlinelearning.Ofcourse,it'snottrulythesame
as100instancesofonlinelearning,sinceintheminibatchtheC x's
areallevaluatedforthesamesetofweights,asopposedtothe
cumulativelearningthatoccursintheonlinecase.Still,itseems
77/92
5/31/2016
distinctlypossiblethatusingthelargerminibatchwouldspeed
thingsup.
Withthesefactorsinmind,choosingthebestminibatchsizeisa
compromise.Toosmall,andyoudon'tgettotakefulladvantageof
thebenefitsofgoodmatrixlibrariesoptimizedforfasthardware.
Toolargeandyou'resimplynotupdatingyourweightsoften
enough.Whatyouneedistochooseacompromisevaluewhich
maximizesthespeedoflearning.Fortunately,thechoiceofmini
batchsizeatwhichthespeedismaximizedisrelativelyindependent
oftheotherhyperparameters(apartfromtheoverallarchitecture),
soyoudon'tneedtohaveoptimizedthosehyperparametersin
ordertofindagoodminibatchsize.Thewaytogoisthereforeto
usesomeacceptable(butnotnecessarilyoptimal)valuesforthe
otherhyperparameters,andthentrialanumberofdifferentmini
batchsizes,scalingasabove.Plotthevalidationaccuracyversus
time(asin,realelapsedtime,notepoch!),andchoosewhichever
minibatchsizegivesyouthemostrapidimprovementin
performance.Withtheminibatchsizechosenyoucanthenproceed
tooptimizetheotherhyperparameters.
Ofcourse,asyou'venodoubtrealized,Ihaven'tdonethis
optimizationinourwork.Indeed,ourimplementationdoesn'tuse
thefasterapproachtominibatchupdatesatall.I'vesimplyuseda
minibatchsizeof10withoutcommentorexplanationinnearlyall
examples.Becauseofthis,wecouldhavespeduplearningby
reducingtheminibatchsize.Ihaven'tdonethis,inpartbecauseI
wantedtoillustratetheuseofminibatchesbeyondsize1,andin
partbecausemypreliminaryexperimentssuggestedthespeedup
wouldberathermodest.Inpracticalimplementations,however,we
wouldmostcertainlyimplementthefasterapproachtominibatch
updates,andthenmakeanefforttooptimizetheminibatchsize,in
ordertomaximizeouroverallspeed.
Automatedtechniques:I'vebeendescribingtheseheuristicsas
thoughyou'reoptimizingyourhyperparametersbyhand.Hand
optimizationisagoodwaytobuildupafeelforhowneural
networksbehave.However,andunsurprisingly,agreatdealofwork
hasbeendoneonautomatingtheprocess.Acommontechniqueis
gridsearch,whichsystematicallysearchesthroughagridinhyper
parameterspace.Areviewofboththeachievementsandthe
78/92
5/31/2016
limitationsofgridsearch(withsuggestionsforeasilyimplemented
alternatives)maybefoundina2012paper*byJamesBergstraand
*Randomsearchforhyperparameter
optimization,byJamesBergstraandYoshua
Bengio(2012).
YoshuaBengio.Manymoresophisticatedapproacheshavealso
beenproposed.Iwon'treviewallthatworkhere,butdowantto
mentionaparticularlypromising2012paperwhichusedaBayesian
approachtoautomaticallyoptimizehyperparameters*.Thecode
*PracticalBayesianoptimizationofmachine
fromthepaperispubliclyavailable,andhasbeenusedwithsome
Larochelle,andRyanAdams.
learningalgorithms,byJasperSnoek,Hugo
successbyotherresearchers.
Summingup:FollowingtherulesofthumbI'vedescribedwon't
giveyoutheabsolutebestpossibleresultsfromyourneural
network.Butitwilllikelygiveyouagoodstartandabasisfor
furtherimprovements.Inparticular,I'vediscussedthehyper
parameterslargelyindependently.Inpractice,thereare
relationshipsbetweenthehyperparameters.Youmayexperiment
with,feelthatyou'vegotitjustright,thenstarttooptimizefor,
onlytofindthatit'smessingupyouroptimizationfor.Inpractice,
ithelpstobouncebackwardandforward,graduallyclosingingood
values.Aboveall,keepinmindthattheheuristicsI'vedescribedare
rulesofthumb,notrulescastinstone.Youshouldbeonthe
lookoutforsignsthatthingsaren'tworking,andbewillingto
experiment.Inparticular,thismeanscarefullymonitoringyour
network'sbehaviour,especiallythevalidationaccuracy.
Thedifficultyofchoosinghyperparametersisexacerbatedbythe
factthattheloreabouthowtochoosehyperparametersiswidely
spread,acrossmanyresearchpapersandsoftwareprograms,and
oftenisonlyavailableinsidetheheadsofindividualpractitioners.
Therearemany,manypaperssettingout(sometimes
contradictory)recommendationsforhowtoproceed.However,
thereareafewparticularlyusefulpapersthatsynthesizeanddistill
outmuchofthislore.YoshuaBengiohasa2012paper*thatgives
*Practicalrecommendationsforgradientbased
somepracticalrecommendationsforusingbackpropagationand
(2012).
trainingofdeeparchitectures,byYoshuaBengio
gradientdescenttotrainneuralnetworks,includingdeepneural
nets.BengiodiscussesmanyissuesinmuchmoredetailthanIhave,
includinghowtodomoresystematichyperparametersearches.
Anothergoodpaperisa1998paper*byYannLeCun,LonBottou,
GenevieveOrrandKlausRobertMller.Boththesepapersappear
inanextremelyuseful2012bookthatcollectsmanytricks
commonlyusedinneuralnets*.Thebookisexpensive,butmanyof
thearticleshavebeenplacedonlinebytheirrespectiveauthors
*EfficientBackProp,byYannLeCun,Lon
Bottou,GenevieveOrrandKlausRobertMller
(1998)
*NeuralNetworks:TricksoftheTrade,editedby
GrgoireMontavon,GeneviveOrr,andKlaus
79/92
5/31/2016
with,onepresumes,theblessingofthepublisher,andmaybe
RobertMller.
locatedusingasearchengine.
Onethingthatbecomesclearasyoureadthesearticlesand,
especially,asyouengageinyourownexperiments,isthathyper
parameteroptimizationisnotaproblemthatisevercompletely
solved.There'salwaysanothertrickyoucantrytoimprove
performance.Thereisasayingcommonamongwritersthatbooks
areneverfinished,onlyabandoned.Thesameisalsotrueofneural
networkoptimization:thespaceofhyperparametersissolarge
thatoneneverreallyfinishesoptimizing,oneonlyabandonsthe
networktoposterity.Soyourgoalshouldbetodevelopaworkflow
thatenablesyoutoquicklydoaprettygoodjobontheoptimization,
whileleavingyoutheflexibilitytotrymoredetailedoptimizations,
ifthat'simportant.
Thechallengeofsettinghyperparametershasledsomepeopleto
complainthatneuralnetworksrequirealotofworkwhen
comparedwithothermachinelearningtechniques.I'veheardmany
variationsonthefollowingcomplaint:"Yes,awelltunedneural
networkmaygetthebestperformanceontheproblem.Onthe
otherhand,Icantryarandomforest[orSVMorinsertyourown
favoritetechnique]anditjustworks.Idon'thavetimetofigureout
justtherightneuralnetwork."Ofcourse,fromapracticalpointof
viewit'sgoodtohaveeasytoapplytechniques.Thisisparticularly
truewhenyou'rejustgettingstartedonaproblem,anditmaynot
beobviouswhethermachinelearningcanhelpsolvetheproblemat
all.Ontheotherhand,ifgettingoptimalperformanceisimportant,
thenyoumayneedtotryapproachesthatrequiremorespecialist
knowledge.Whileitwouldbeniceifmachinelearningwerealways
easy,thereisnoapriorireasonitshouldbetriviallysimple.
Othertechniques
Eachtechniquedevelopedinthischapterisvaluabletoknowinits
ownright,butthat'snottheonlyreasonI'veexplainedthem.The
largerpointistofamiliarizeyouwithsomeoftheproblemswhich
canoccurinneuralnetworks,andwithastyleofanalysiswhichcan
helpovercomethoseproblems.Inasense,we'vebeenlearninghow
tothinkaboutneuralnets.OvertheremainderofthischapterI
brieflysketchahandfulofothertechniques.Thesesketchesareless
80/92
5/31/2016
indepththantheearlierdiscussions,butshouldconveysome
feelingforthediversityoftechniquesavailableforuseinneural
networks.
Variationsonstochasticgradientdescent
Stochasticgradientdescentbybackpropagationhasserveduswell
inattackingtheMNISTdigitclassificationproblem.However,there
aremanyotherapproachestooptimizingthecostfunction,and
sometimesthoseotherapproachesofferperformancesuperiorto
minibatchstochasticgradientdescent.InthissectionIsketchtwo
suchapproaches,theHessianandmomentumtechniques.
Hessiantechnique:Tobeginourdiscussionithelpstoput
neuralnetworksasideforabit.Instead,we'rejustgoingtoconsider
theabstractproblemofminimizingacostfunctionCwhichisa
functionofmanyvariables,w = w 1, w 2, ,soC = C(w).ByTaylor's
theorem,thecostfunctioncanbeapproximatednearapointwby
C(w + w) = C(w) +
w w j
j
2C
w j w w
2 jk
w k +
Wecanrewritethismorecompactlyas
C(w + w) = C(w) + C w +
1
2
w THw + ,
whereCistheusualgradientvector,andHisamatrixknownas
theHessianmatrix,whosejkthentryis 2C / w jw k.Supposewe
approximateCbydiscardingthehigherordertermsrepresentedby
above,
C(w + w) C(w) + C w +
1
2
w THw.
Usingcalculuswecanshowthattheexpressionontherighthand
sidecanbeminimized*bychoosing
w = H 1C.
*Strictlyspeaking,forthistobeaminimum,and
notmerelyanextremum,weneedtoassume
thattheHessianmatrixispositivedefinite.
Intuitively,thismeansthatthefunctionClooks
likeavalleylocally,notamountainorasaddle.
Provided(105)isagoodapproximateexpressionforthecost
function,thenwe'dexpectthatmovingfromthepointwto
w + w = w H 1Cshouldsignificantlydecreasethecost
81/92
5/31/2016
function.Thatsuggestsapossiblealgorithmforminimizingthe
cost:
Chooseastartingpoint,w.
Updatewtoanewpointw = w H 1C,wheretheHessianH
andCarecomputedatw.
Updatew toanewpointw = w H 1 C,wherethe
HessianH and Carecomputedatw .
Inpractice,(105)isonlyanapproximation,andit'sbettertotake
smallersteps.Wedothisbyrepeatedlychangingwbyanamount
w = H 1C,whereisknownasthelearningrate.
Thisapproachtominimizingacostfunctionisknownasthe
HessiantechniqueorHessianoptimization.Therearetheoretical
andempiricalresultsshowingthatHessianmethodsconvergeona
minimuminfewerstepsthanstandardgradientdescent.In
particular,byincorporatinginformationaboutsecondorder
changesinthecostfunctionit'spossiblefortheHessianapproach
toavoidmanypathologiesthatcanoccuringradientdescent.
Furthermore,thereareversionsofthebackpropagationalgorithm
whichcanbeusedtocomputetheHessian.
IfHessianoptimizationissogreat,whyaren'tweusingitinour
neuralnetworks?Unfortunately,whileithasmanydesirable
properties,ithasoneveryundesirableproperty:it'sverydifficultto
applyinpractice.Partoftheproblemisthesheersizeofthe
Hessianmatrix.Supposeyouhaveaneuralnetworkwith10 7
weightsandbiases.ThenthecorrespondingHessianmatrixwill
contain10 7 10 7 = 10 14entries.That'salotofentries!Andthat
makescomputingH 1Cextremelydifficultinpractice.However,
thatdoesn'tmeanthatit'snotusefultounderstand.Infact,there
aremanyvariationsongradientdescentwhichareinspiredby
Hessianoptimization,butwhichavoidtheproblemwithoverly
largematrices.Let'stakealookatonesuchtechnique,momentum
basedgradientdescent.
Momentumbasedgradientdescent:Intuitively,the
advantageHessianoptimizationhasisthatitincorporatesnotjust
82/92
5/31/2016
informationaboutthegradient,butalsoinformationabouthowthe
gradientischanging.Momentumbasedgradientdescentisbased
onasimilarintuition,butavoidslargematricesofsecond
derivatives.Tounderstandthemomentumtechnique,thinkbackto
ouroriginalpictureofgradientdescent,inwhichweconsidereda
ballrollingdownintoavalley.Atthetime,weobservedthat
gradientdescentis,despiteitsname,onlylooselysimilartoaball
fallingtothebottomofavalley.Themomentumtechniquemodifies
gradientdescentintwowaysthatmakeitmoresimilartothe
physicalpicture.First,itintroducesanotionof"velocity"forthe
parameterswe'retryingtooptimize.Thegradientactstochangethe
velocity,not(directly)the"position",inmuchthesamewayas
physicalforceschangethevelocity,andonlyindirectlyaffect
position.Second,themomentummethodintroducesakindof
frictionterm,whichtendstograduallyreducethevelocity.
Let'sgiveamoreprecisemathematicaldescription.Weintroduce
velocityvariablesv = v 1, v 2, ,oneforeachcorrespondingw j
variable*.Thenwereplacethegradientdescentupdaterule
w
*Inaneuralnetthew jvariableswould,of
course,includeallweightsandbiases.
= w Cby
v v = v C
w w = w + v .
Intheseequations,isahyperparameterwhichcontrolsthe
amountofdampingorfrictioninthesystem.Tounderstandthe
meaningoftheequationsit'shelpfultofirstconsiderthecasewhere
= 1,whichcorrespondstonofriction.Whenthat'sthecase,
inspectionoftheequationsshowsthatthe"force"Cisnow
modifyingthevelocity,v,andthevelocityiscontrollingtherateof
changeofw.Intuitively,webuildupthevelocitybyrepeatedly
addinggradienttermstoit.Thatmeansthatifthegradientisin
(roughly)thesamedirectionthroughseveralroundsoflearning,we
canbuildupquiteabitofsteammovinginthatdirection.Think,for
example,ofwhathappensifwe'removingstraightdownaslope:
83/92
5/31/2016
Witheachstepthevelocitygetslargerdowntheslope,sowemove
moreandmorequicklytothebottomofthevalley.Thiscanenable
themomentumtechniquetoworkmuchfasterthanstandard
gradientdescent.Ofcourse,aproblemisthatoncewereachthe
bottomofthevalleywewillovershoot.Or,ifthegradientshould
changerapidly,thenwecouldfindourselvesmovinginthewrong
direction.That'sthereasonforthehyperparameterin(107).I
saidearlierthatcontrolstheamountoffrictioninthesystemto
bealittlemoreprecise,youshouldthinkof1 astheamountof
frictioninthesystem.When = 1,aswe'veseen,thereisno
friction,andthevelocityiscompletelydrivenbythegradientC.By
contrast,when = 0there'salotoffriction,thevelocitycan'tbuild
up,andEquations(107)and(108)reducetotheusualequationfor
gradientdescent,w w = w C.Inpractice,usingavalueof
intermediatebetween0and1cangiveusmuchofthebenefitof
beingabletobuildupspeed,butwithoutcausingovershooting.We
canchoosesuchavalueforusingtheheldoutvalidationdata,in
muchthesamewayasweselectand.
I'veavoidednamingthehyperparameteruptonow.Thereason
isthatthestandardnameforisbadlychosen:it'scalledthe
momentumcoefficient.Thisispotentiallyconfusing,sinceisnot
atallthesameasthenotionofmomentumfromphysics.Rather,it
ismuchmorecloselyrelatedtofriction.However,theterm
momentumcoefficientiswidelyused,sowewillcontinuetouseit.
Anicethingaboutthemomentumtechniqueisthatittakesalmost
noworktomodifyanimplementationofgradientdescentto
84/92
5/31/2016
incorporatemomentum.Wecanstillusebackpropagationto
computethegradients,justasbefore,anduseideassuchas
samplingstochasticallychosenminibatches.Inthisway,wecan
getsomeoftheadvantagesoftheHessiantechnique,using
informationabouthowthegradientischanging.Butit'sdone
withoutthedisadvantages,andwithonlyminormodificationsto
ourcode.Inpractice,themomentumtechniqueiscommonlyused,
andoftenspeedsuplearning.
Exercise
Whatwouldgowrongifweused > 1inthemomentum
technique?
Whatwouldgowrongifweused < 0inthemomentum
technique?
Problem
Addmomentumbasedstochasticgradientdescentto
network2.py.
Otherapproachestominimizingthecostfunction:Many
otherapproachestominimizingthecostfunctionhavebeen
developed,andthereisn'tuniversalagreementonwhichisthebest
approach.Asyougodeeperintoneuralnetworksit'sworthdigging
intotheothertechniques,understandinghowtheywork,their
strengthsandweaknesses,andhowtoapplytheminpractice.A
paperImentionedearlier*introducesandcomparesseveralof
*EfficientBackProp,byYannLeCun,Lon
thesetechniques,includingconjugategradientdescentandthe
(1998).
Bottou,GenevieveOrrandKlausRobertMller
BFGSmethod(seealsothecloselyrelatedlimitedmemoryBFGS
method,knownasLBFGS).Anothertechniquewhichhasrecently
shownpromisingresults*isNesterov'sacceleratedgradient
*See,forexample,Ontheimportanceof
technique,whichimprovesonthemomentumtechnique.However,
byIlyaSutskever,JamesMartens,GeorgeDahl,
formanyproblems,plainstochasticgradientdescentworkswell,
initializationandmomentumindeeplearning,
andGeoffreyHinton(2012).
especiallyifmomentumisused,andsowe'llsticktostochastic
gradientdescentthroughtheremainderofthisbook.
Othermodelsofartificialneuron
Uptonowwe'vebuiltourneuralnetworksusingsigmoidneurons.
Inprinciple,anetworkbuiltfromsigmoidneuronscancompute
85/92
5/31/2016
anyfunction.Inpractice,however,networksbuiltusingother
modelneuronssometimesoutperformsigmoidnetworks.
Dependingontheapplication,networksbasedonsuchalternate
modelsmaylearnfaster,generalizebettertotestdata,orperhaps
doboth.Letmementionacoupleofalternatemodelneurons,to
giveyoutheflavorofsomevariationsincommonuse.
Perhapsthesimplestvariationisthetanh(pronounced"tanch")
neuron,whichreplacesthesigmoidfunctionbythehyperbolic
tangentfunction.Theoutputofatanhneuronwithinputx,weight
vectorw,andbiasbisgivenby
tanh(w x + b),
wheretanhis,ofcourse,thehyperbolictangentfunction.Itturns
outthatthisisverycloselyrelatedtothesigmoidneuron.Tosee
this,recallthatthetanhfunctionisdefinedby
tanh(z)
ez e z
ez + e z
Withalittlealgebraitcaneasilybeverifiedthat
(z) =
1 + tanh(z / 2)
2
thatis,tanhisjustarescaledversionofthesigmoidfunction.We
canalsoseegraphicallythatthetanhfunctionhasthesameshapeas
thesigmoidfunction,
tanhfunction
1.0
0.5
0.0
4
0.5
1.0
Onedifferencebetweentanhneuronsandsigmoidneuronsisthat
theoutputfromtanhneuronsrangesfrom1to1,not0to1.This
meansthatifyou'regoingtobuildanetworkbasedontanhneurons
youmayneedtonormalizeyouroutputs(and,dependingonthe
86/92
5/31/2016
detailsoftheapplication,possiblyyourinputs)alittledifferently
thaninsigmoidnetworks.
Similartosigmoidneurons,anetworkoftanhneuronscan,in
principle,computeanyfunction*mappinginputstotherange1to
*Therearesometechnicalcaveatstothis
1.Furthermore,ideassuchasbackpropagationandstochastic
wellasfortherectifiedlinearneuronsdiscussed
gradientdescentareaseasilyappliedtoanetworkoftanhneurons
astoanetworkofsigmoidneurons.
statementforbothtanhandsigmoidneurons,as
below.However,informallyit'susuallyfineto
thinkofneuralnetworksasbeingableto
approximateanyfunctiontoarbitraryaccuracy.
Exercise
ProvetheidentityinEquation(111).
Whichtypeofneuronshouldyouuseinyournetworks,thetanhor
sigmoid?Aprioritheanswerisnotobvious,toputitmildly!
However,therearetheoreticalargumentsandsomeempirical
evidencetosuggestthatthetanhsometimesperformsbetter*.Let
*See,forexample,EfficientBackProp,byYann
mebrieflygiveyoutheflavorofoneofthetheoreticalargumentsfor
RobertMller(1998),andUnderstandingthe
tanhneurons.Supposewe'reusingsigmoidneurons,soall
LeCun,LonBottou,GenevieveOrrandKlaus
difficultyoftrainingdeepfeedforwardnetworks,
byXavierGlorotandYoshuaBengio(2010).
activationsinournetworkarepositive.Let'sconsidertheweights
w ljk+ 1 inputtothejthneuroninthel + 1thlayer.Therulesfor
backpropagation(seehere)tellusthattheassociatedgradientwill
bea lk lj + 1 .Becausetheactivationsarepositivethesignofthis
gradientwillbethesameasthesignof lj + 1 .Whatthismeansis
thatif lj + 1 ispositivethenalltheweightsw ljk+ 1 willdecreaseduring
gradientdescent,whileif lj + 1 isnegativethenalltheweightsw ljk+ 1
willincreaseduringgradientdescent.Inotherwords,allweightsto
thesameneuronmusteitherincreasetogetherordecreasetogether.
That'saproblem,sincesomeoftheweightsmayneedtoincrease
whileothersneedtodecrease.Thatcanonlyhappenifsomeofthe
inputactivationshavedifferentsigns.Thatsuggestsreplacingthe
sigmoidbyanactivationfunction,suchastanh,whichallowsboth
positiveandnegativeactivations.Indeed,becausetanhissymmetric
aboutzero,tanh( z) = tanh(z),wemightevenexpectthat,
roughlyspeaking,theactivationsinhiddenlayerswouldbeequally
balancedbetweenpositiveandnegative.Thatwouldhelpensure
thatthereisnosystematicbiasfortheweightupdatestobeoneway
ortheother.
Howseriouslyshouldwetakethisargument?Whiletheargument
issuggestive,it'saheuristic,notarigorousproofthattanhneurons
87/92
5/31/2016
outperformsigmoidneurons.Perhapsthereareotherpropertiesof
thesigmoidneuronwhichcompensateforthisproblem?Indeed,for
manytasksthetanhisfoundempiricallytoprovideonlyasmallor
noimprovementinperformanceoversigmoidneurons.
Unfortunately,wedon'tyethavehardandfastrulestoknowwhich
neurontypeswilllearnfastest,orgivethebestgeneralization
performance,foranyparticularapplication.
Anothervariationonthesigmoidneuronistherectifiedlinear
neuronorrectifiedlinearunit.Theoutputofarectifiedlinearunit
withinputx,weightvectorw,andbiasbisgivenby
max (0, w x + b).
Graphically,therectifyingfunction max (0, z)lookslikethis:
max(0,z)
5
4
3
2
1
0
1
2
3
4
Obviouslysuchneuronsarequitedifferentfrombothsigmoidand
tanhneurons.However,likethesigmoidandtanhneurons,
rectifiedlinearunitscanbeusedtocomputeanyfunction,andthey
canbetrainedusingideassuchasbackpropagationandstochastic
gradientdescent.
Whenshouldyouuserectifiedlinearunitsinsteadofsigmoidor
tanhneurons?Somerecentworkonimagerecognition*hasfound
*See,forexample,WhatistheBestMultiStage
considerablebenefitinusingrectifiedlinearunitsthroughmuchof
Jarrett,KorayKavukcuoglu,Marc'Aurelio
thenetwork.However,aswithtanhneurons,wedonotyethavea
ArchitectureforObjectRecognition?,byKevin
RanzatoandYannLeCun(2009),DeepSparse
RectierNeuralNetworks,byXavierGlorot,
reallydeepunderstandingofwhen,exactly,rectifiedlinearunitsare
AntoineBordes,andYoshuaBengio(2011),and
preferable,norwhy.Togiveyoutheflavorofsomeoftheissues,
ConvolutionalNeuralNetworks,byAlex
recallthatsigmoidneuronsstoplearningwhentheysaturate,i.e.,
(2012).Notethatthesepapersfillinimportant
whentheiroutputisneareither0or1.Aswe'veseenrepeatedlyin
thischapter,theproblemisthat termsreducethegradient,and
ImageNetClassificationwithDeep
detailsabouthowtosetuptheoutputlayer,cost
function,andregularizationinnetworksusing
rectifiedlinearunits.I'veglossedoverallthese
detailsinthisbriefaccount.Thepapersalso
88/92
5/31/2016
thatslowsdownlearning.Tanhneuronssufferfromasimilar
discussinmoredetailthebenefitsand
problemwhentheysaturate.Bycontrast,increasingtheweighted
AnotherinformativepaperisRectifiedLinear
inputtoarectifiedlinearunitwillnevercauseittosaturate,andso
drawbacksofusingrectifiedlinearunits.
UnitsImproveRestrictedBoltzmannMachines,
byVinodNairandGeoffreyHinton(2010),
thereisnocorrespondinglearningslowdown.Ontheotherhand,
whichdemonstratesthebenefitsofusing
whentheweightedinputtoarectifiedlinearunitisnegative,the
approachtoneuralnetworks.
rectifiedlinearunitsinasomewhatdifferent
gradientvanishes,andsotheneuronstopslearningentirely.These
arejusttwoofthemanyissuesthatmakeitnontrivialto
understandwhenandwhyrectifiedlinearunitsperformbetterthan
sigmoidortanhneurons.
I'vepaintedapictureofuncertaintyhere,stressingthatwedonot
yethaveasolidtheoryofhowactivationfunctionsshouldbe
chosen.Indeed,theproblemishardereventhanIhavedescribed,
forthereareinfinitelymanypossibleactivationfunctions.Whichis
thebestforanygivenproblem?Whichwillresultinanetwork
whichlearnsfastest?Whichwillgivethehighesttestaccuracies?I
amsurprisedhowlittlereallydeepandsystematicinvestigationhas
beendoneofthesequestions.Ideally,we'dhaveatheorywhichtells
us,indetail,howtochoose(andperhapsmodifyonthefly)our
activationfunctions.Ontheotherhand,weshouldn'tletthelackof
afulltheorystopus!Wehavepowerfultoolsalreadyathand,and
canmakealotofprogresswiththosetools.Throughtheremainder
ofthisbookI'llcontinuetousesigmoidneuronsasourgoto
neuron,sincethey'repowerfulandprovideconcreteillustrationsof
thecoreideasaboutneuralnets.Butkeepinthebackofyourmind
thatthesesameideascanbeappliedtoothertypesofneuron,and
thattherearesometimesadvantagesindoingso.
Onstoriesinneuralnetworks
Question:Howdoyouapproachutilizingand
researchingmachinelearningtechniquesthatare
supportedalmostentirelyempirically,asopposedto
mathematically?Alsoinwhatsituationshaveyounoticed
someofthesetechniquesfail?
Answer:Youhavetorealizethatourtheoreticaltoolsare
veryweak.Sometimes,wehavegoodmathematical
intuitionsforwhyaparticulartechniqueshouldwork.
Sometimesourintuitionendsupbeingwrong[...]The
questionsbecome:howwelldoesmymethodworkonthis
89/92
5/31/2016
particularproblem,andhowlargeisthesetofproblems
onwhichitworkswell.
Questionandanswerwithneuralnetworksresearcher
YannLeCun
Once,attendingaconferenceonthefoundationsofquantum
mechanics,Inoticedwhatseemedtomeamostcuriousverbal
habit:whentalksfinished,questionsfromtheaudienceoftenbegan
with"I'mverysympathetictoyourpointofview,but[...]".
Quantumfoundationswasnotmyusualfield,andInoticedthis
styleofquestioningbecauseatotherscientificconferencesI'drarely
orneverheardaquestionerexpresstheirsympathyforthepointof
viewofthespeaker.Atthetime,Ithoughttheprevalenceofthe
questionsuggestedthatlittlegenuineprogresswasbeingmadein
quantumfoundations,andpeopleweremerelyspinningtheir
wheels.Later,Irealizedthatassessmentwastooharsh.The
speakerswerewrestlingwithsomeofthehardestproblemshuman
mindshaveeverconfronted.Ofcourseprogresswasslow!Butthere
wasstillvalueinhearingupdatesonhowpeoplewerethinking,
eveniftheydidn'talwayshaveunarguablenewprogresstoreport.
Youmayhavenoticedaverbalticsimilarto"I'mverysympathetic
[...]"inthecurrentbook.Toexplainwhatwe'reseeingI'veoften
fallenbackonsaying"Heuristically,[...]",or"Roughlyspeaking,
[...]",followingupwithastorytoexplainsomephenomenonor
other.Thesestoriesareplausible,buttheempiricalevidenceI've
presentedhasoftenbeenprettythin.Ifyoulookthroughthe
researchliteratureyou'llseethatstoriesinasimilarstyleappearin
manyresearchpapersonneuralnets,oftenwiththinsupporting
evidence.Whatshouldwethinkaboutsuchstories?
Inmanypartsofscienceespeciallythosepartsthatdealwith
simplephenomenait'spossibletoobtainverysolid,veryreliable
evidenceforquitegeneralhypotheses.Butinneuralnetworksthere
arelargenumbersofparametersandhyperparameters,and
extremelycomplexinteractionsbetweenthem.Insuch
extraordinarilycomplexsystemsit'sexceedinglydifficultto
establishreliablegeneralstatements.Understandingneural
networksintheirfullgeneralityisaproblemthat,likequantum
foundations,teststhelimitsofthehumanmind.Instead,weoften
90/92
5/31/2016
makedowithevidencefororagainstafewspecificinstancesofa
generalstatement.Asaresultthosestatementssometimeslater
needtobemodifiedorabandoned,whennewevidencecomesto
light.
Onewayofviewingthissituationisthatanyheuristicstoryabout
neuralnetworkscarrieswithitanimpliedchallenge.Forexample,
considerthestatementIquotedearlier,explainingwhydropout
works*:"Thistechniquereducescomplexcoadaptationsof
*FromImageNetClassificationwithDeep
neurons,sinceaneuroncannotrelyonthepresenceofparticular
otherneurons.Itis,therefore,forcedtolearnmorerobustfeatures
ConvolutionalNeuralNetworksbyAlex
(2012).
thatareusefulinconjunctionwithmanydifferentrandomsubsets
oftheotherneurons."Thisisarich,provocativestatement,andone
couldbuildafruitfulresearchprogramentirelyaroundunpacking
thestatement,figuringoutwhatinitistrue,whatisfalse,what
needsvariationandrefinement.Indeed,thereisnowasmall
industryofresearcherswhoareinvestigatingdropout(andmany
variations),tryingtounderstandhowitworks,andwhatitslimits
are.Andsoitgoeswithmanyoftheheuristicswe'vediscussed.
Eachheuristicisnotjusta(potential)explanation,it'salsoa
challengetoinvestigateandunderstandinmoredetail.
Ofcourse,thereisnottimeforanysinglepersontoinvestigateall
theseheuristicexplanationsindepth.It'sgoingtotakedecades(or
longer)forthecommunityofneuralnetworksresearchersto
developareallypowerful,evidencebasedtheoryofhowneural
networkslearn.Doesthismeanyoushouldrejectheuristic
explanationsasunrigorous,andnotsufficientlyevidencebased?
No!Infact,weneedsuchheuristicstoinspireandguideour
thinking.It'slikethegreatageofexploration:theearlyexplorers
sometimesexplored(andmadenewdiscoveries)onthebasisof
beliefswhichwerewronginimportantways.Later,thosemistakes
werecorrectedaswefilledinourknowledgeofgeography.When
youunderstandsomethingpoorlyastheexplorersunderstood
geography,andasweunderstandneuralnetstodayit'smore
importanttoexploreboldlythanitistoberigorouslycorrectin
everystepofyourthinking.Andsoyoushouldviewthesestoriesas
ausefulguidetohowtothinkaboutneuralnets,whileretaininga
healthyawarenessofthelimitationsofsuchstories,andcarefully
keepingtrackofjusthowstrongtheevidenceisforanygivenlineof
reasoning.Putanotherway,weneedgoodstoriestohelpmotivate
91/92
5/31/2016
andinspireus,andrigorousindepthinvestigationinorderto
uncovertherealfactsofthematter.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",
DeterminationPress,2015
Lastupdate:FriJan2214:09:502016
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.
92/92

Neural Networks and Deep Learning Chap3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks and Deep Learning Chap3

Uploaded by

Copyright:

Available Formats

5/31/2016

whereIhavesubstitutedx = 1andy = 0.Tounderstandthe

[ylna + (1 y)ln(1 a)],

zero*.Toseethis,supposeforexamplethaty = 0anda 0for

ln(1 a) 0.Asimilaranalysisholdswheny = 1anda 1.Andso

(z)(1 (z)) ((z) y).

[y jlna Lj + (1 y j)ln(1 a Lj) ].

[ylny + (1 y)ln(1 y)].

Thequantity [ylny + (1 y)ln(1 y)]issometimesknownas

Using (z) = (z)(1 (z)) = a(1 a)thelastequationbecomes

[ylna + (1 y)ln(1 a)] + constant,

y jlna Lj + (1 y j)ln(1 a Lj) ] +

TheC 0 / wandC 0 / btermscanbecomputedusing

wheresgn(w)isthesignofw,thatis,+ 1ifwispositive,and 1ifw

costtooscillatewhen = 2.5.Whenwechoose = 0.25theinitial

You might also like