You are on page 1of 92

5/31/2016

Neuralnetworksanddeeplearning

CHAPTER3

Improvingthewayneuralnetworkslearn
Whenagolfplayerisfirstlearningtoplaygolf,theyusuallyspend
mostoftheirtimedevelopingabasicswing.Onlygraduallydothey
developothershots,learningtochip,drawandfadetheball,
buildingonandmodifyingtheirbasicswing.Inasimilarway,upto
nowwe'vefocusedonunderstandingthebackpropagation
algorithm.It'sour"basicswing",thefoundationforlearningin
mostworkonneuralnetworks.InthischapterIexplainasuiteof
techniqueswhichcanbeusedtoimproveonourvanilla
implementationofbackpropagation,andsoimprovethewayour
networkslearn.
Thetechniqueswe'lldevelopinthischapterinclude:abetterchoice
ofcostfunction,knownasthecrossentropycostfunctionfourso
called"regularization"methods(L1andL2regularization,dropout,
andartificialexpansionofthetrainingdata),whichmakeour
networksbetteratgeneralizingbeyondthetrainingdataabetter

NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Appendix:Isthereasimple
algorithmforintelligence?
Acknowledgements
FrequentlyAskedQuestions
Ifyoubenefitfromthebook,please
makeasmalldonation.Isuggest$3,
butyoucanchoosetheamount.

methodforinitializingtheweightsinthenetworkandasetof

heuristicstohelpchoosegoodhyperparametersforthenetwork.
I'llalsooverviewseveralothertechniquesinlessdepth.The

Sponsors

discussionsarelargelyindependentofoneanother,andsoyoumay
jumpaheadifyouwish.We'llalsoimplementmanyofthe
techniquesinrunningcode,andusethemtoimprovetheresults
obtainedonthehandwritingclassificationproblemstudiedin
Chapter1.
Ofcourse,we'reonlycoveringafewofthemany,manytechniques
whichhavebeendevelopedforuseinneuralnets.Thephilosophyis
thatthebestentreetotheplethoraofavailabletechniquesisin
depthstudyofafewofthemostimportant.Masteringthose
importanttechniquesisnotjustusefulinitsownright,butwillalso
deepenyourunderstandingofwhatproblemscanarisewhenyou
useneuralnetworks.Thatwillleaveyouwellpreparedtoquickly
pickupothertechniques,asyouneedthem.

Thecrossentropycostfunction
Mostofusfinditunpleasanttobewrong.Soonafterbeginningto
http://neuralnetworksanddeeplearning.com/chap3.html

Thankstoallthesupporterswho
madethebookpossible,with
especialthankstoPavelDudrenov.
Thanksalsotoallthecontributorsto
theBugfinderHallofFame.

Resources
BookFAQ
Coderepository
MichaelNielsen'sproject
announcementmailinglist
DeepLearning,draftbookin
preparation,byYoshuaBengio,Ian
1/92

5/31/2016

Neuralnetworksanddeeplearning

learnthepianoIgavemyfirstperformancebeforeanaudience.I

Goodfellow,andAaronCourville

wasnervous,andbeganplayingthepieceanoctavetoolow.Igot
confused,andcouldn'tcontinueuntilsomeonepointedoutmy
error.Iwasveryembarrassed.Yetwhileunpleasant,wealsolearn
quicklywhenwe'redecisivelywrong.Youcanbetthatthenexttime
IplayedbeforeanaudienceIplayedinthecorrectoctave!By
contrast,welearnmoreslowlywhenourerrorsarelesswell

ByMichaelNielsen/Jan2016

defined.
Ideally,wehopeandexpectthatourneuralnetworkswilllearnfast
fromtheirerrors.Isthiswhathappensinpractice?Toanswerthis
question,let'slookatatoyexample.Theexampleinvolvesaneuron
withjustoneinput:

We'lltrainthisneurontodosomethingridiculouslyeasy:takethe
input1totheoutput0.Ofcourse,thisissuchatrivialtaskthatwe
couldeasilyfigureoutanappropriateweightandbiasbyhand,
withoutusingalearningalgorithm.However,itturnsouttobe
illuminatingtousegradientdescenttoattempttolearnaweight
andbias.Solet'stakealookathowtheneuronlearns.
Tomakethingsdefinite,I'llpicktheinitialweighttobe0.6andthe
initialbiastobe0.9.Thesearegenericchoicesusedasaplaceto
beginlearning,Iwasn'tpickingthemtobespecialinanyway.The
initialoutputfromtheneuronis0.82,soquiteabitoflearningwill
beneededbeforeourneurongetsnearthedesiredoutput,0.0.Click
on"Run"inthebottomrightcornerbelowtoseehowtheneuron
learnsanoutputmuchcloserto0.0.Notethatthisisn'tapre
recordedanimation,yourbrowserisactuallycomputingthe
gradient,thenusingthegradienttoupdatetheweightandbias,and
displayingtheresult.Thelearningrateis = 0.15,whichturnsout
tobeslowenoughthatwecanfollowwhat'shappening,butfast
enoughthatwecangetsubstantiallearninginjustafewseconds.
Thecostisthequadraticcostfunction,C,introducedbackin
Chapter1.I'llremindyouoftheexactformofthecostfunction
shortly,sothere'snoneedtogoanddigupthedefinition.Notethat
youcanruntheanimationmultipletimesbyclickingon"Run"
again.
http://neuralnetworksanddeeplearning.com/chap3.html

2/92

5/31/2016

Neuralnetworksanddeeplearning

Asyoucansee,theneuronrapidlylearnsaweightandbiasthat
drivesdownthecost,andgivesanoutputfromtheneuronofabout
0.09.That'snotquitethedesiredoutput,0.0,butitisprettygood.
Suppose,however,thatweinsteadchooseboththestartingweight
andthestartingbiastobe2.0.Inthiscasetheinitialoutputis0.98,
whichisverybadlywrong.Let'slookathowtheneuronlearnsto
output0inthiscase.Clickon"Run"again:

Althoughthisexampleusesthesamelearningrate( = 0.15),we
canseethatlearningstartsoutmuchmoreslowly.Indeed,forthe
first150orsolearningepochs,theweightsandbiasesdon'tchange
muchatall.Thenthelearningkicksinand,muchasinourfirst
example,theneuron'soutputrapidlymovescloserto0.0.
Thisbehaviourisstrangewhencontrastedtohumanlearning.AsI
saidatthebeginningofthissection,weoftenlearnfastestwhen
we'rebadlywrongaboutsomething.Butwe'vejustseenthatour
artificialneuronhasalotofdifficultylearningwhenit'sbadly
wrongfarmoredifficultythanwhenit'sjustalittlewrong.What's
more,itturnsoutthatthisbehaviouroccursnotjustinthistoy
model,butinmoregeneralnetworks.Whyislearningsoslow?And
http://neuralnetworksanddeeplearning.com/chap3.html

3/92

5/31/2016

Neuralnetworksanddeeplearning

canwefindawayofavoidingthisslowdown?
Tounderstandtheoriginoftheproblem,considerthatourneuron
learnsbychangingtheweightandbiasataratedeterminedbythe
partialderivativesofthecostfunction,C / wandC / b.Sosaying
"learningisslow"isreallythesameassayingthatthosepartial
derivativesaresmall.Thechallengeistounderstandwhytheyare
small.Tounderstandthat,let'scomputethepartialderivatives.
Recallthatwe'reusingthequadraticcostfunction,which,from
Equation(6),isgivenby
C=

(y a) 2
2

whereaistheneuron'soutputwhenthetraininginputx = 1isused,
andy = 0isthecorrespondingdesiredoutput.Towritethismore
explicitlyintermsoftheweightandbias,recallthata = (z),where
z = wx + b.Usingthechainruletodifferentiatewithrespecttothe
weightandbiasweget
C
w
C
b

= (a y) (z)x = a (z)
= (a y) (z) = a (z),

whereIhavesubstitutedx = 1andy = 0.Tounderstandthe


behaviouroftheseexpressions,let'slookmorecloselyatthe (z)
termontherighthandside.Recalltheshapeofthefunction:
sigmoidfunction
1.0

0.8

0.6

0.4

0.2

0.0
4

Wecanseefromthisgraphthatwhentheneuron'soutputisclose
to1,thecurvegetsveryflat,andso (z)getsverysmall.Equations
(55)and(56)thentellusthatC / wandC / bgetverysmall.This
istheoriginofthelearningslowdown.What'smore,asweshallsee
alittlelater,thelearningslowdownoccursforessentiallythesame
http://neuralnetworksanddeeplearning.com/chap3.html

4/92

5/31/2016

Neuralnetworksanddeeplearning

reasoninmoregeneralneuralnetworks,notjustthetoyexample
we'vebeenplayingwith.

Introducingthecrossentropycostfunction
Howcanweaddressthelearningslowdown?Itturnsoutthatwe
cansolvetheproblembyreplacingthequadraticcostwitha
differentcostfunction,knownasthecrossentropy.Tounderstand
thecrossentropy,let'smovealittleawayfromoursupersimpletoy
model.We'llsupposeinsteadthatwe'retryingtotrainaneuron
withseveralinputvariables,x 1, x 2, ,correspondingweights
w 1, w 2, ,andabias,b:

Theoutputfromtheneuronis,ofcourse,a = (z),where
z = jw jx j + bistheweightedsumoftheinputs.Wedefinethe
crossentropycostfunctionforthisneuronby
C=

1
n

[ylna + (1 y)ln(1 a)],


x

wherenisthetotalnumberofitemsoftrainingdata,thesumis
overalltraininginputs,x,andyisthecorrespondingdesired
output.
It'snotobviousthattheexpression(57)fixesthelearningslowdown
problem.Infact,frankly,it'snotevenobviousthatitmakessenseto
callthisacostfunction!Beforeaddressingthelearningslowdown,
let'sseeinwhatsensethecrossentropycanbeinterpretedasacost
function.
Twopropertiesinparticularmakeitreasonabletointerpretthe
crossentropyasacostfunction.First,it'snonnegative,thatis,
C > 0.Toseethis,noticethat:(a)alltheindividualtermsinthe
sumin(57)arenegative,sincebothlogarithmsareofnumbersin
therange0to1and(b)thereisaminussignoutthefrontofthe
sum.
Second,iftheneuron'sactualoutputisclosetothedesiredoutput
foralltraininginputs,x,thenthecrossentropywillbecloseto
http://neuralnetworksanddeeplearning.com/chap3.html

5/92

5/31/2016

Neuralnetworksanddeeplearning

zero*.Toseethis,supposeforexamplethaty = 0anda 0for

*ToprovethisIwillneedtoassumethatthe

someinputx.Thisisacasewhentheneuronisdoingagoodjobon

usuallythecasewhensolvingclassification

thatinput.Weseethatthefirsttermintheexpression(57)forthe
costvanishes,sincey = 0,whilethesecondtermisjust

desiredoutputsyarealleither0or1.Thisis
problems,forexample,orwhencomputing
Booleanfunctions.Tounderstandwhathappens
whenwedon'tmakethisassumption,seethe
exercisesattheendofthissection.

ln(1 a) 0.Asimilaranalysisholdswheny = 1anda 1.Andso


thecontributiontothecostwillbelowprovidedtheactualoutputis
closetothedesiredoutput.
Summingup,thecrossentropyispositive,andtendstowardzero
astheneurongetsbetteratcomputingthedesiredoutput,y,forall
traininginputs,x.Thesearebothpropertieswe'dintuitivelyexpect
foracostfunction.Indeed,bothpropertiesarealsosatisfiedbythe
quadraticcost.Sothat'sgoodnewsforthecrossentropy.Butthe
crossentropycostfunctionhasthebenefitthat,unlikethe
quadraticcost,itavoidstheproblemoflearningslowingdown.To
seethis,let'scomputethepartialderivativeofthecrossentropy
costwithrespecttotheweights.Wesubstitutea = (z)into(57),
andapplythechainruletwice,obtaining:
C
w j

(
(

n
x

nx

y
(z)
y
(z)

(1 y)

)
)

1 (z) w j
(1 y)
1 (z)

(z)x j.

Puttingeverythingoveracommondenominatorandsimplifying
thisbecomes:
C
w j

(z)x j

(z)(1 (z)) ((z) y).

nx

Usingthedefinitionofthesigmoidfunction,(z) = 1 / (1 + e z),and
alittlealgebrawecanshowthat (z) = (z)(1 (z)).I'llaskyouto
verifythisinanexercisebelow,butfornowlet'sacceptitasgiven.
Weseethatthe (z)and(z)(1 (z))termscancelintheequation
justabove,anditsimplifiestobecome:
C
w j

1
n

x j((z) y).
x

Thisisabeautifulexpression.Ittellsusthattherateatwhichthe
weightlearnsiscontrolledby(z) y,i.e.,bytheerrorinthe
output.Thelargertheerror,thefastertheneuronwilllearn.Thisis
http://neuralnetworksanddeeplearning.com/chap3.html

6/92

5/31/2016

Neuralnetworksanddeeplearning

justwhatwe'dintuitivelyexpect.Inparticular,itavoidsthelearning
slowdowncausedbythe (z)termintheanalogousequationfor
thequadraticcost,Equation(55).Whenweusethecrossentropy,
the (z)termgetscanceledout,andwenolongerneedworryabout
itbeingsmall.Thiscancellationisthespecialmiracleensuredby
thecrossentropycostfunction.Actually,it'snotreallyamiracle.As
we'llseelater,thecrossentropywasspeciallychosentohavejust
thisproperty.
Inasimilarway,wecancomputethepartialderivativeforthebias.
Iwon'tgothroughallthedetailsagain,butyoucaneasilyverify
that
C
b

1
n

((z) y).
x

Again,thisavoidsthelearningslowdowncausedbythe (z)termin
theanalogousequationforthequadraticcost,Equation(56).

Exercise
Verifythat (z) = (z)(1 (z)).
Let'sreturntothetoyexampleweplayedwithearlier,andexplore
whathappenswhenweusethecrossentropyinsteadofthe
quadraticcost.Toreorientourselves,we'llbeginwiththecase
wherethequadraticcostdidjustfine,withstartingweight0.6and
startingbias0.9.Press"Run"toseewhathappenswhenwereplace
thequadraticcostbythecrossentropy:

Unsurprisingly,theneuronlearnsperfectlywellinthisinstance,
justasitdidearlier.Andnowlet'slookatthecasewhereour
neurongotstuckbefore(link,forcomparison),withtheweightand
http://neuralnetworksanddeeplearning.com/chap3.html

7/92

5/31/2016

Neuralnetworksanddeeplearning

biasbothstartingat2.0:

Success!Thistimetheneuronlearnedquickly,justaswehoped.If
youobservecloselyyoucanseethattheslopeofthecostcurvewas
muchsteeperinitiallythantheinitialflatregiononthe
correspondingcurveforthequadraticcost.It'sthatsteepnesswhich
thecrossentropybuysus,preventingusfromgettingstuckjust
whenwe'dexpectourneurontolearnfastest,i.e.,whentheneuron
startsoutbadlywrong.
Ididn'tsaywhatlearningratewasusedintheexamplesjust
illustrated.Earlier,withthequadraticcost,weused = 0.15.
Shouldwehaveusedthesamelearningrateinthenewexamples?
Infact,withthechangeincostfunctionit'snotpossibletosay
preciselywhatitmeanstousethe"same"learningrateit'san
applesandorangescomparison.ForbothcostfunctionsIsimply
experimentedtofindalearningratethatmadeitpossibletosee
whatisgoingon.Ifyou'restillcurious,despitemydisavowal,here's
thelowdown:Iused = 0.005intheexamplesjustgiven.
Youmightobjectthatthechangeinlearningratemakesthegraphs
abovemeaningless.Whocareshowfasttheneuronlearns,when
ourchoiceoflearningratewasarbitrarytobeginwith?!That
objectionmissesthepoint.Thepointofthegraphsisn'taboutthe
absolutespeedoflearning.It'sabouthowthespeedoflearning
changes.Inparticular,whenweusethequadraticcostlearningis
slowerwhentheneuronisunambiguouslywrongthanitislateron,
astheneurongetsclosertothecorrectoutputwhilewiththecross
entropylearningisfasterwhentheneuronisunambiguously
wrong.Thosestatementsdon'tdependonhowthelearningrateis
set.
http://neuralnetworksanddeeplearning.com/chap3.html

8/92

5/31/2016

Neuralnetworksanddeeplearning

We'vebeenstudyingthecrossentropyforasingleneuron.
However,it'seasytogeneralizethecrossentropytomanyneuron
multilayernetworks.Inparticular,supposey = y 1, y 2, arethe
desiredvaluesattheoutputneurons,i.e.,theneuronsinthefinal
layer,whilea L1 , a L2 , aretheactualoutputvalues.Thenwedefine
thecrossentropyby
C=

1
n

[y jlna Lj + (1 y j)ln(1 a Lj) ].


x

Thisisthesameasourearlierexpression,Equation(57),except
nowwe'vegotthe jsummingoveralltheoutputneurons.Iwon't
explicitlyworkthroughaderivation,butitshouldbeplausiblethat
usingtheexpression(63)avoidsalearningslowdowninmany
neuronnetworks.Ifyou'reinterested,youcanworkthroughthe
derivationintheproblembelow.
Whenshouldweusethecrossentropyinsteadofthequadratic
cost?Infact,thecrossentropyisnearlyalwaysthebetterchoice,
providedtheoutputneuronsaresigmoidneurons.Toseewhy,
considerthatwhenwe'resettingupthenetworkweusually
initializetheweightsandbiasesusingsomesortofrandomization.
Itmayhappenthatthoseinitialchoicesresultinthenetworkbeing
decisivelywrongforsometraininginputthatis,anoutputneuron
willhavesaturatednear1,whenitshouldbe0,orviceversa.If
we'reusingthequadraticcostthatwillslowdownlearning.Itwon't
stoplearningcompletely,sincetheweightswillcontinuelearning
fromothertraininginputs,butit'sobviouslyundesirable.

Exercises
Onegotchawiththecrossentropyisthatitcanbedifficultat
firsttoremembertherespectiverolesoftheysandtheas.It's
easytogetconfusedaboutwhethertherightformis
[ylna + (1 y)ln(1 a)]or [alny + (1 a)ln(1 y)].What
happenstothesecondoftheseexpressionswheny = 0or1?
Doesthisproblemafflictthefirstexpression?Whyorwhynot?
Inthesingleneurondiscussionatthestartofthissection,I
arguedthatthecrossentropyissmallif(z) yforalltraining
inputs.Theargumentreliedonybeingequaltoeither0or1.
Thisisusuallytrueinclassificationproblems,butforother
http://neuralnetworksanddeeplearning.com/chap3.html

9/92

5/31/2016

Neuralnetworksanddeeplearning

problems(e.g.,regressionproblems)ycansometimestake
valuesintermediatebetween0and1.Showthatthecross
entropyisstillminimizedwhen(z) = yforalltraininginputs.
Whenthisisthecasethecrossentropyhasthevalue:
C=

1
n

[ylny + (1 y)ln(1 y)].


x

Thequantity [ylny + (1 y)ln(1 y)]issometimesknownas


thebinaryentropy.

Problems
ManylayermultineuronnetworksInthenotation
introducedinthelastchapter,showthatforthequadraticcost
thepartialderivativewithrespecttoweightsintheoutputlayer
is
C

1
=
a Lk 1(a Lj y j) (z Lj).
L
n
w jk
x
Theterm (z Lj)causesalearningslowdownwheneveran
outputneuronsaturatesonthewrongvalue.Showthatforthe
crossentropycosttheoutputerror Lforasingletraining
examplexisgivenby
L = a L y.
Usethisexpressiontoshowthatthepartialderivativewith
respecttotheweightsintheoutputlayerisgivenby
C
w Ljk

a Lk 1(a Lj y j).

nx

The (z Lj)termhasvanished,andsothecrossentropyavoids
theproblemoflearningslowdown,notjustwhenusedwitha
singleneuron,aswesawearlier,butalsoinmanylayermulti
neuronnetworks.Asimplevariationonthisanalysisholdsalso
forthebiases.Ifthisisnotobvioustoyou,thenyoushould
workthroughthatanalysisaswell.
Usingthequadraticcostwhenwehavelinearneurons
intheoutputlayerSupposethatwehaveamanylayer
multineuronnetwork.Supposealltheneuronsinthefinal
http://neuralnetworksanddeeplearning.com/chap3.html

10/92

5/31/2016

Neuralnetworksanddeeplearning

layerarelinearneurons,meaningthatthesigmoidactivation
functionisnotapplied,andtheoutputsaresimplya Lj = z Lj.
Showthatifweusethequadraticcostfunctionthentheoutput
error Lforasingletrainingexamplexisgivenby
L = a L y.
Similarlytothepreviousproblem,usethisexpressiontoshow
thatthepartialderivativeswithrespecttotheweightsand
biasesintheoutputlayeraregivenby
C

1
=
a Lk 1(a Lj y j)
L
n
w jk
x
C

1
=
(a Lj y j).
L
n
b j
x
Thisshowsthatiftheoutputneuronsarelinearneuronsthen
thequadraticcostwillnotgiverisetoanyproblemswitha
learningslowdown.Inthiscasethequadraticcostis,infact,an
appropriatecostfunctiontouse.

UsingthecrossentropytoclassifyMNISTdigits
Thecrossentropyiseasytoimplementaspartofaprogramwhich
learnsusinggradientdescentandbackpropagation.We'lldothat
laterinthechapter,developinganimprovedversionofourearlier
programforclassifyingtheMNISThandwrittendigits,network.py.
Thenewprogramiscallednetwork2.py,andincorporatesnotjustthe
crossentropy,butalsoseveralothertechniquesdevelopedinthis
chapter*.Fornow,let'slookathowwellournewprogramclassifies

*ThecodeisavailableonGitHub.

MNISTdigits.AswasthecaseinChapter1,we'lluseanetworkwith
30hiddenneurons,andwe'lluseaminibatchsizeof10.Wesetthe
learningrateto = 0.5*andwetrainfor30epochs.Theinterfaceto

*InChapter1weusedthequadraticcostanda

network2.pyisslightlydifferentthannetwork.py,butitshouldstillbe

notpossibletosaypreciselywhatitmeanstouse

clearwhatisgoingon.Youcan,bytheway,getdocumentation

learningrateof = 3.0.Asdiscussedabove,it's
the"same"learningratewhenthecostfunction
ischanged.ForbothcostfunctionsI

aboutnetwork2.py'sinterfacebyusingcommandssuchas

experimentedtofindalearningratethat

help(network2.Network.SGD)inaPythonshell.

otherhyperparameterchoices.

>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.large_weight_initializer()
>>>net.SGD(training_data,30,10,0.5,evaluation_data=test_data,
...monitor_evaluation_accuracy=True)

http://neuralnetworksanddeeplearning.com/chap3.html

providesnearoptimalperformance,giventhe

Thereis,incidentally,averyroughgeneral
heuristicforrelatingthelearningrateforthe
crossentropyandthequadraticcost.Aswesaw
earlier,thegradienttermsforthequadraticcost
haveanextra = (1 )terminthem.
Supposeweaveragethisovervaluesfor,
10d(1 ) = 1 /6.Weseethat(veryroughly)the
quadraticcostlearnsanaverageof6times

11/92

5/31/2016

Neuralnetworksanddeeplearning

Note,bytheway,thatthenet.large_weight_initializer()command

slower,forthesamelearningrate.Thissuggests

isusedtoinitializetheweightsandbiasesinthesamewayas

learningrateforthequadraticcostby6.Of

describedinChapter1.Weneedtorunthiscommandbecauselater
inthischapterwe'llchangethedefaultweightinitializationinour

thatareasonablestartingpointistodividethe
course,thisargumentisfarfromrigorous,and
shouldn'tbetakentooseriously.Still,itcan
sometimesbeausefulstartingpoint.

networks.Theresultfromrunningtheabovesequenceof
commandsisanetworkwith95.49percentaccuracy.Thisispretty
closetotheresultweobtainedinChapter1,95.42percent,usingthe
quadraticcost.
Let'slookalsoatthecasewhereweuse100hiddenneurons,the
crossentropy,andotherwisekeeptheparametersthesame.Inthis
caseweobtainanaccuracyof96.82percent.That'sasubstantial
improvementovertheresultsfromChapter1,whereweobtaineda
classificationaccuracyof96.59percent,usingthequadraticcost.
Thatmaylooklikeasmallchange,butconsiderthattheerrorrate
hasdroppedfrom3.41percentto3.18percent.Thatis,we've
eliminatedaboutoneinfourteenoftheoriginalerrors.That'squite
ahandyimprovement.
It'sencouragingthatthecrossentropycostgivesussimilaror
betterresultsthanthequadraticcost.However,theseresultsdon't
conclusivelyprovethatthecrossentropyisabetterchoice.The
reasonisthatI'veputonlyalittleeffortintochoosinghyper
parameterssuchaslearningrate,minibatchsize,andsoon.For
theimprovementtobereallyconvincingwe'dneedtodoathorough
joboptimizingsuchhyperparameters.Still,theresultsare
encouraging,andreinforceourearliertheoreticalargumentthatthe
crossentropyisabetterchoicethanthequadraticcost.
This,bytheway,ispartofageneralpatternthatwe'llseethrough
thischapterand,indeed,throughmuchoftherestofthebook.
We'lldevelopanewtechnique,we'lltryitout,andwe'llget
"improved"results.Itis,ofcourse,nicethatweseesuch
improvements.Buttheinterpretationofsuchimprovementsis
alwaysproblematic.They'reonlytrulyconvincingifweseean
improvementafterputtingtremendouseffortintooptimizingallthe
otherhyperparameters.That'sagreatdealofwork,requiringlots
ofcomputingpower,andwe'renotusuallygoingtodosuchan
exhaustiveinvestigation.Instead,we'llproceedonthebasisof
informaltestslikethosedoneabove.Still,youshouldkeepinmind
thatsuchtestsfallshortofdefinitiveproof,andremainalertto
http://neuralnetworksanddeeplearning.com/chap3.html

12/92

5/31/2016

Neuralnetworksanddeeplearning

signsthattheargumentsarebreakingdown.
Bynow,we'vediscussedthecrossentropyatgreatlength.Whygo
tosomucheffortwhenitgivesonlyasmallimprovementtoour
MNISTresults?Laterinthechapterwe'llseeothertechniques
notably,regularizationwhichgivemuchbiggerimprovements.So
whysomuchfocusoncrossentropy?Partofthereasonisthatthe
crossentropyisawidelyusedcostfunction,andsoisworth
understandingwell.Butthemoreimportantreasonisthatneuron
saturationisanimportantprobleminneuralnets,aproblemwe'll
returntorepeatedlythroughoutthebook.AndsoI'vediscussedthe
crossentropyatlengthbecauseit'sagoodlaboratorytobegin
understandingneuronsaturationandhowitmaybeaddressed.

Whatdoesthecrossentropymean?Where
doesitcomefrom?
Ourdiscussionofthecrossentropyhasfocusedonalgebraic
analysisandpracticalimplementation.That'suseful,butitleaves
unansweredbroaderconceptualquestions,like:whatdoesthe
crossentropymean?Istheresomeintuitivewayofthinkingabout
thecrossentropy?Andhowcouldwehavedreamedupthecross
entropyinthefirstplace?
Let'sbeginwiththelastofthesequestions:whatcouldhave
motivatedustothinkupthecrossentropyinthefirstplace?
Supposewe'ddiscoveredthelearningslowdowndescribedearlier,
andunderstoodthattheoriginwasthe (z)termsinEquations
(55)and(56).Afterstaringatthoseequationsforabit,wemight
wonderifit'spossibletochooseacostfunctionsothatthe (z)
termdisappeared.Inthatcase,thecostC = C xforasingletraining
examplexwouldsatisfy
C
w j
C
b

= x j(a y)
= (a y).

Ifwecouldchoosethecostfunctiontomaketheseequationstrue,
thentheywouldcaptureinasimplewaytheintuitionthatthe
greatertheinitialerror,thefastertheneuronlearns.They'dalso
eliminatetheproblemofalearningslowdown.Infact,startingfrom
theseequationswe'llnowshowthatit'spossibletoderivetheform
http://neuralnetworksanddeeplearning.com/chap3.html

13/92

5/31/2016

Neuralnetworksanddeeplearning

ofthecrossentropy,simplybyfollowingourmathematicalnoses.
Toseethis,notethatfromthechainrulewehave
C
b

C
(z).
a

Using (z) = (z)(1 (z)) = a(1 a)thelastequationbecomes


C
b

a(1 a).

ComparingtoEquation(72)weobtain
C
a

ay
a(1 a)

Integratingthisexpressionwithrespecttoagives
C = [ylna + (1 y)ln(1 a)] + constant,
forsomeconstantofintegration.Thisisthecontributiontothecost
fromasingletrainingexample,x.Togetthefullcostfunctionwe
mustaverageovertrainingexamples,obtaining
C=

1
n

[ylna + (1 y)ln(1 a)] + constant,


x

wheretheconstanthereistheaverageoftheindividualconstants
foreachtrainingexample.AndsoweseethatEquations(71)and
(72)uniquelydeterminetheformofthecrossentropy,uptoan
overallconstantterm.Thecrossentropyisn'tsomethingthatwas
miraculouslypulledoutofthinair.Rather,it'ssomethingthatwe
couldhavediscoveredinasimpleandnaturalway.
Whatabouttheintuitivemeaningofthecrossentropy?Howshould
wethinkaboutit?Explainingthisindepthwouldtakeusfurther
afieldthanIwanttogo.However,itisworthmentioningthatthere
isastandardwayofinterpretingthecrossentropythatcomesfrom
thefieldofinformationtheory.Roughlyspeaking,theideaisthat
thecrossentropyisameasureofsurprise.Inparticular,ourneuron
istryingtocomputethefunctionx y = y(x).Butinsteadit
computesthefunctionx a = a(x).Supposewethinkofaasour
neuron'sestimatedprobabilitythatyis1,and1 aistheestimated
probabilitythattherightvalueforyis0.Thenthecrossentropy
measureshow"surprised"weare,onaverage,whenwelearnthe
truevaluefory.Wegetlowsurpriseiftheoutputiswhatweexpect,
http://neuralnetworksanddeeplearning.com/chap3.html

14/92

5/31/2016

Neuralnetworksanddeeplearning

andhighsurpriseiftheoutputisunexpected.Ofcourse,Ihaven't
saidexactlywhat"surprise"means,andsothisperhapsseemslike
emptyverbiage.Butinfactthereisapreciseinformationtheoretic
wayofsayingwhatismeantbysurprise.Unfortunately,Idon't
knowofagood,short,selfcontaineddiscussionofthissubject
that'savailableonline.Butifyouwanttodigdeeper,then
Wikipediacontainsabriefsummarythatwillgetyoustarteddown
therighttrack.Andthedetailscanbefilledinbyworkingthrough
thematerialsabouttheKraftinequalityinchapter5ofthebook
aboutinformationtheorybyCoverandThomas.

Problem
We'vediscussedatlengththelearningslowdownthatcanoccur
whenoutputneuronssaturate,innetworksusingthequadratic
costtotrain.Anotherfactorthatmayinhibitlearningisthe
presenceofthex jterminEquation(61).Becauseofthisterm,
whenaninputx jisneartozero,thecorrespondingweightw j
willlearnslowly.Explainwhyitisnotpossibletoeliminatethe
x jtermthroughacleverchoiceofcostfunction.

Softmax
Inthischapterwe'llmostlyusethecrossentropycosttoaddress
theproblemoflearningslowdown.However,Iwanttobriefly
describeanotherapproachtotheproblem,basedonwhatarecalled
softmaxlayersofneurons.We'renotactuallygoingtousesoftmax
layersintheremainderofthechapter,soifyou'reinagreathurry,
youcanskiptothenextsection.However,softmaxisstillworth
understanding,inpartbecauseit'sintrinsicallyinteresting,andin
partbecausewe'llusesoftmaxlayersinChapter6,inourdiscussion
ofdeepneuralnetworks.
Theideaofsoftmaxistodefineanewtypeofoutputlayerforour
neuralnetworks.Itbeginsinthesamewayaswithasigmoidlayer,
byformingtheweightedinputs*z Lj = kw Ljka Lk 1 + b Lj.However,we

*Indescribingthesoftmaxwe'llmakefrequent

don'tapplythesigmoidfunctiontogettheoutput.Instead,ina

Youmaywishtorevisitthatchapterifyouneed

softmaxlayerweapplythesocalledsoftmaxfunctiontothez Lj.

thenotation.

useofnotationintroducedinthelastchapter.
torefreshyourmemoryaboutthemeaningof

Accordingtothisfunction,theactivationa Ljofthejthoutput
neuronis

http://neuralnetworksanddeeplearning.com/chap3.html

15/92

5/31/2016

Neuralnetworksanddeeplearning

a Lj =

L
ezj

L,

ke z k

whereinthedenominatorwesumoveralltheoutputneurons.
Ifyou'renotfamiliarwiththesoftmaxfunction,Equation(78)may
lookprettyopaque.It'scertainlynotobviouswhywe'dwanttouse
thisfunction.Andit'salsonotobviousthatthiswillhelpusaddress
thelearningslowdownproblem.TobetterunderstandEquation
(78),supposewehaveanetworkwithfouroutputneurons,andfour
correspondingweightedinputs,whichwe'lldenotez L1 , z L2 , z L3 ,andz L4 .
Shownbelowareadjustableslidersshowingpossiblevaluesforthe
weightedinputs,andagraphofthecorrespondingoutput
activations.Agoodplacetostartexplorationisbyusingthebottom
slidertoincreasez L4 :
z L1 =

a L1 =

2.5

0.315

a L2 =

z L2 =1

0.009

a L3 =

z L3 =3.2

0.633

a L4 =

z L4 =0.5

0.043

Asyouincreasez L4 ,you'llseeanincreaseinthecorresponding
outputactivation,a L4 ,andadecreaseintheotheroutputactivations.
Similarly,ifyoudecreasez L4 thena L4 willdecrease,andalltheother
outputactivationswillincrease.Infact,ifyoulookclosely,you'llsee
thatinbothcasesthetotalchangeintheotheractivationsexactly
compensatesforthechangeina L4 .Thereasonisthattheoutput
activationsareguaranteedtoalwayssumupto1,aswecanprove
usingEquation(78)andalittlealgebra:
L

a Lj =
j

je z j

ke z k

= 1.

Asaresult,ifa L4 increases,thentheotheroutputactivationsmust
decreasebythesametotalamount,toensurethesumoverall
http://neuralnetworksanddeeplearning.com/chap3.html

16/92

5/31/2016

Neuralnetworksanddeeplearning

activationsremains1.And,ofcourse,similarstatementsholdforall
theotheractivations.
Equation(78)alsoimpliesthattheoutputactivationsareall
positive,sincetheexponentialfunctionispositive.Combiningthis
withtheobservationinthelastparagraph,weseethattheoutput
fromthesoftmaxlayerisasetofpositivenumberswhichsumupto
1.Inotherwords,theoutputfromthesoftmaxlayercanbethought
ofasaprobabilitydistribution.
Thefactthatasoftmaxlayeroutputsaprobabilitydistributionis
ratherpleasing.Inmanyproblemsit'sconvenienttobeableto
interprettheoutputactivationa Ljasthenetwork'sestimateofthe
probabilitythatthecorrectoutputisj.So,forinstance,inthe
MNISTclassificationproblem,wecaninterpreta Ljasthenetwork's
estimatedprobabilitythatthecorrectdigitclassificationisj.
Bycontrast,iftheoutputlayerwasasigmoidlayer,thenwe
certainlycouldn'tassumethattheactivationsformedaprobability
distribution.Iwon'texplicitlyproveit,butitshouldbeplausible
thattheactivationsfromasigmoidlayerwon'tingeneralforma
probabilitydistribution.Andsowithasigmoidoutputlayerwe
don'thavesuchasimpleinterpretationoftheoutputactivations.

Exercise
Constructanexampleshowingexplicitlythatinanetworkwith
asigmoidoutputlayer,theoutputactivationsa Ljwon'talways
sumto1.
We'restartingtobuildupsomefeelforthesoftmaxfunctionand
thewaysoftmaxlayersbehave.Justtoreviewwherewe'reat:the
exponentialsinEquation(78)ensurethatalltheoutputactivations
arepositive.AndthesuminthedenominatorofEquation(78)
ensuresthatthesoftmaxoutputssumto1.Sothatparticularform
nolongerappearssomysterious:rather,itisanaturalwayto
ensurethattheoutputactivationsformaprobabilitydistribution.
Youcanthinkofsoftmaxasawayofrescalingthez Lj,andthen
squishingthemtogethertoformaprobabilitydistribution.

Exercises
http://neuralnetworksanddeeplearning.com/chap3.html

17/92

5/31/2016

Neuralnetworksanddeeplearning

MonotonicityofsoftmaxShowthata Lj / z Lkispositiveif
j = kandnegativeifj k.Asaconsequence,increasingz Ljis
guaranteedtoincreasethecorrespondingoutputactivation,a Lj,
andwilldecreasealltheotheroutputactivations.Wealready
sawthisempiricallywiththesliders,butthisisarigorous
proof.
NonlocalityofsoftmaxAnicethingaboutsigmoidlayersis
thattheoutputa Ljisafunctionofthecorrespondingweighted
input,a Lj = (z Lj).Explainwhythisisnotthecaseforasoftmax
layer:anyparticularoutputactivationa Ljdependsonallthe
weightedinputs.

Problem
InvertingthesoftmaxlayerSupposewehaveaneural
networkwithasoftmaxoutputlayer,andtheactivationsa Ljare
known.Showthatthecorrespondingweightedinputshavethe
formz Lj = lna Lj + C,forsomeconstantCthatisindependentofj
.
Thelearningslowdownproblem:We'venowbuiltup
considerablefamiliaritywithsoftmaxlayersofneurons.Butwe
haven'tyetseenhowasoftmaxlayerletsusaddressthelearning
slowdownproblem.Tounderstandthat,let'sdefinethelog
likelihoodcostfunction.We'llusextodenoteatraininginputtothe
network,andytodenotethecorrespondingdesiredoutput.Then
theloglikelihoodcostassociatedtothistraininginputis
C lna Ly.
So,forinstance,ifwe'retrainingwithMNISTimages,andinputan
imageofa7,thentheloglikelihoodcostis lna L7 .Toseethatthis
makesintuitivesense,considerthecasewhenthenetworkisdoing
agoodjob,thatis,itisconfidenttheinputisa7.Inthatcaseitwill
estimateavalueforthecorrespondingprobabilitya L7 whichisclose
to1,andsothecost lna L7 willbesmall.Bycontrast,whenthe
networkisn'tdoingsuchagoodjob,theprobabilitya L7 willbe
smaller,andthecost lna L7 willbelarger.Sotheloglikelihoodcost
behavesaswe'dexpectacostfunctiontobehave.

http://neuralnetworksanddeeplearning.com/chap3.html

18/92

5/31/2016

Neuralnetworksanddeeplearning

Whataboutthelearningslowdownproblem?Toanalyzethat,recall
thatthekeytothelearningslowdownisthebehaviourofthe
quantitiesC / w LjkandC / b Lj.Iwon'tgothroughthederivation
explicitlyI'llaskyoutodointheproblems,belowbutwithalittle
*NotethatI'mabusingnotationhere,usingyin

algebrayoucanshowthat*

aslightlydifferentwaytolastparagraph.Inthe
lastparagraphweusedytodenotethedesired

outputfromthenetworke.g.,outputa"7"ifan

= a Lj y j
L
b j
C
w Ljk

imageofa7wasinput.Butintheequations
whichfollowI'musingytodenotethevectorof
outputactivationswhichcorrespondsto7,that

a Lk 1(a Lj

y j)

is,avectorwhichisall0s,exceptfora1inthe7
thlocation.

Theseequationsarethesameastheanalogousexpressions
obtainedinourearlieranalysisofthecrossentropy.Compare,for
example,Equation(82)toEquation(67).It'sthesameequation,
albeitinthelatterI'veaveragedovertraininginstances.And,justas
intheearlieranalysis,theseexpressionsensurethatwewillnot
encounteralearningslowdown.Infact,it'susefultothinkofa
softmaxoutputlayerwithloglikelihoodcostasbeingquitesimilar
toasigmoidoutputlayerwithcrossentropycost.
Giventhissimilarity,shouldyouuseasigmoidoutputlayerand
crossentropy,orasoftmaxoutputlayerandloglikelihood?Infact,
inmanysituationsbothapproachesworkwell.Throughthe
remainderofthischapterwe'lluseasigmoidoutputlayer,withthe
crossentropycost.Later,inChapter6,we'llsometimesusea
softmaxoutputlayer,withloglikelihoodcost.Thereasonforthe
switchistomakesomeofourlaternetworksmoresimilarto
networksfoundincertaininfluentialacademicpapers.Asamore
generalpointofprinciple,softmaxplusloglikelihoodisworth
usingwheneveryouwanttointerprettheoutputactivationsas
probabilities.That'snotalwaysaconcern,butcanbeusefulwith
classificationproblems(likeMNIST)involvingdisjointclasses.

Problems
DeriveEquations(81)and(82).
Wheredoesthe"softmax"namecomefrom?Suppose
wechangethesoftmaxfunctionsotheoutputactivationsare
givenby

http://neuralnetworksanddeeplearning.com/chap3.html

19/92

5/31/2016

Neuralnetworksanddeeplearning

a Lj =

L
e cz j

L,

ke cz k

wherecisapositiveconstant.Notethatc = 1correspondsto
thestandardsoftmaxfunction.Butifweuseadifferentvalueof
cwegetadifferentfunction,whichisnonethelessqualitatively
rathersimilartothesoftmax.Inparticular,showthatthe
outputactivationsformaprobabilitydistribution,justasfor
theusualsoftmax.Supposeweallowctobecomelarge,i.e.,
c .Whatisthelimitingvaluefortheoutputactivationsa Lj?
Aftersolvingthisproblemitshouldbecleartoyouwhywe
thinkofthec = 1functionasa"softened"versionofthe
maximumfunction.Thisistheoriginoftheterm"softmax".
Backpropagationwithsoftmaxandtheloglikelihood
costInthelastchapterwederivedthebackpropagation
algorithmforanetworkcontainingsigmoidlayers.Toapply
thealgorithmtoanetworkwithasoftmaxlayerweneedto
figureoutanexpressionfortheerror Lj C / z Ljinthefinal
layer.Showthatasuitableexpressionis:
Lj = a Lj y j.
Usingthisexpressionwecanapplythebackpropagation
algorithmtoanetworkusingasoftmaxoutputlayerandthe
loglikelihoodcost.

Overfittingandregularization
TheNobelprizewinningphysicistEnricoFermiwasonceaskedhis
opinionofamathematicalmodelsomecolleagueshadproposedas
thesolutiontoanimportantunsolvedphysicsproblem.Themodel
gaveexcellentagreementwithexperiment,butFermiwasskeptical.
Heaskedhowmanyfreeparameterscouldbesetinthemodel.
"Four"wastheanswer.Fermireplied*:"Iremembermyfriend

*Thequotecomesfromacharmingarticleby

JohnnyvonNeumannusedtosay,withfourparametersIcanfitan

proposedtheflawedmodel.Afourparameter

elephant,andwithfiveIcanmakehimwigglehistrunk.".

FreemanDyson,whoisoneofthepeoplewho
elephantmaybefoundhere.

Thepoint,ofcourse,isthatmodelswithalargenumberoffree
parameterscandescribeanamazinglywiderangeofphenomena.
Evenifsuchamodelagreeswellwiththeavailabledata,that
http://neuralnetworksanddeeplearning.com/chap3.html

20/92

5/31/2016

Neuralnetworksanddeeplearning

doesn'tmakeitagoodmodel.Itmayjustmeanthere'senough
freedominthemodelthatitcandescribealmostanydatasetofthe
givensize,withoutcapturinganygenuineinsightsintothe
underlyingphenomenon.Whenthathappensthemodelwillwork
wellfortheexistingdata,butwillfailtogeneralizetonew
situations.Thetruetestofamodelisitsabilitytomakepredictions
insituationsithasn'tbeenexposedtobefore.
FermiandvonNeumannweresuspiciousofmodelswithfour
parameters.Our30hiddenneuronnetworkforclassifyingMNIST
digitshasnearly24,000parameters!That'salotofparameters.Our
100hiddenneuronnetworkhasnearly80,000parameters,and
stateoftheartdeepneuralnetssometimescontainmillionsor
evenbillionsofparameters.Shouldwetrusttheresults?
Let'ssharpenthisproblemupbyconstructingasituationwhereour
networkdoesabadjobgeneralizingtonewsituations.We'lluseour
30hiddenneuronnetwork,withits23,860parameters.Butwe
won'ttrainthenetworkusingall50,000MNISTtrainingimages.
Instead,we'llusejustthefirst1,000trainingimages.Usingthat
restrictedsetwillmaketheproblemwithgeneralizationmuchmore
evident.We'lltraininasimilarwaytobefore,usingthecross
entropycostfunction,withalearningrateof = 0.5andamini
batchsizeof10.However,we'lltrainfor400epochs,asomewhat
largernumberthanbefore,becausewe'renotusingasmany
trainingexamples.Let'susenetwork2tolookatthewaythecost
functionchanges:

>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.large_weight_initializer()
>>>net.SGD(training_data[:1000],400,10,0.5,evaluation_data=test_data,
...monitor_evaluation_accuracy=True,monitor_training_cost=True)

Usingtheresultswecanplotthewaythecostchangesasthe
networklearns*:

http://neuralnetworksanddeeplearning.com/chap3.html

*Thisandthenextfourgraphsweregenerated
bytheprogramoverfitting.py.

21/92

5/31/2016

Neuralnetworksanddeeplearning

Thislooksencouraging,showingasmoothdecreaseinthecost,just
asweexpect.NotethatI'veonlyshowntrainingepochs200
through399.Thisgivesusaniceupcloseviewofthelaterstagesof
learning,which,aswe'llsee,turnsouttobewheretheinteresting
actionis.
Let'snowlookathowtheclassificationaccuracyonthetestdata
changesovertime:

Again,I'vezoomedinquiteabit.Inthefirst200epochs(not
shown)theaccuracyrisestojustunder82percent.Thelearning
thengraduallyslowsdown.Finally,ataroundepoch280the
classificationaccuracyprettymuchstopsimproving.Laterepochs
merelyseesmallstochasticfluctuationsnearthevalueofthe
accuracyatepoch280.Contrastthiswiththeearliergraph,where
thecostassociatedtothetrainingdatacontinuestosmoothlydrop.
http://neuralnetworksanddeeplearning.com/chap3.html

22/92

5/31/2016

Neuralnetworksanddeeplearning

Ifwejustlookatthatcost,itappearsthatourmodelisstillgetting
"better".Butthetestaccuracyresultsshowtheimprovementisan
illusion.JustlikethemodelthatFermidisliked,whatournetwork
learnsafterepoch280nolongergeneralizestothetestdata.Andso
it'snotusefullearning.Wesaythenetworkisoverfittingor
overtrainingbeyondepoch280.
YoumightwonderiftheproblemhereisthatI'mlookingatthecost
onthetrainingdata,asopposedtotheclassificationaccuracyon
thetestdata.Inotherwords,maybetheproblemisthatwe're
makinganapplesandorangescomparison.Whatwouldhappenif
wecomparedthecostonthetrainingdatawiththecostonthetest
data,sowe'recomparingsimilarmeasures?Orperhapswecould
comparetheclassificationaccuracyonboththetrainingdataand
thetestdata?Infact,essentiallythesamephenomenonshowsup
nomatterhowwedothecomparison.Thedetailsdochange,
however.Forinstance,let'slookatthecostonthetestdata:

Wecanseethatthecostonthetestdataimprovesuntilaround
epoch15,butafterthatitactuallystartstogetworse,eventhough
thecostonthetrainingdataiscontinuingtogetbetter.Thisis
anothersignthatourmodelisoverfitting.Itposesapuzzle,though,
whichiswhetherweshouldregardepoch15orepoch280asthe
pointatwhichoverfittingiscomingtodominatelearning?Froma
practicalpointofview,whatwereallycareaboutisimproving
classificationaccuracyonthetestdata,whilethecostonthetest
dataisnomorethanaproxyforclassificationaccuracy.Andsoit
makesmostsensetoregardepoch280asthepointbeyondwhich
http://neuralnetworksanddeeplearning.com/chap3.html

23/92

5/31/2016

Neuralnetworksanddeeplearning

overfittingisdominatinglearninginourneuralnetwork.
Anothersignofoverfittingmaybeseenintheclassification
accuracyonthetrainingdata:

Theaccuracyrisesallthewayupto100percent.Thatis,our
networkcorrectlyclassifiesall1, 000trainingimages!Meanwhile,
ourtestaccuracytopsoutatjust82.27percent.Soournetwork
reallyislearningaboutpeculiaritiesofthetrainingset,notjust
recognizingdigitsingeneral.It'salmostasthoughournetworkis
merelymemorizingthetrainingset,withoutunderstandingdigits
wellenoughtogeneralizetothetestset.
Overfittingisamajorprobleminneuralnetworks.Thisisespecially
trueinmodernnetworks,whichoftenhaveverylargenumbersof
weightsandbiases.Totraineffectively,weneedawayofdetecting
whenoverfittingisgoingon,sowedon'tovertrain.Andwe'dliketo
havetechniquesforreducingtheeffectsofoverfitting.
Theobviouswaytodetectoverfittingistousetheapproachabove,
keepingtrackofaccuracyonthetestdataasournetworktrains.If
weseethattheaccuracyonthetestdataisnolongerimproving,
thenweshouldstoptraining.Ofcourse,strictlyspeaking,thisisnot
necessarilyasignofoverfitting.Itmightbethataccuracyonthe
testdataandthetrainingdatabothstopimprovingatthesame
time.Still,adoptingthisstrategywillpreventoverfitting.
Infact,we'lluseavariationonthisstrategy.Recallthatwhenwe
loadintheMNISTdataweloadinthreedatasets:
http://neuralnetworksanddeeplearning.com/chap3.html

24/92

5/31/2016

Neuralnetworksanddeeplearning

>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()

Uptonowwe'vebeenusingthetraining_dataandtest_data,and
ignoringthevalidation_data.Thevalidation_datacontains10, 000
imagesofdigits,imageswhicharedifferentfromthe50, 000images
intheMNISTtrainingset,andthe10, 000imagesintheMNISTtest
set.Insteadofusingthetest_datatopreventoverfitting,wewilluse
thevalidation_data.Todothis,we'llusemuchthesamestrategyas
wasdescribedaboveforthetest_data.Thatis,we'llcomputethe
classificationaccuracyonthevalidation_dataattheendofeach
epoch.Oncetheclassificationaccuracyonthevalidation_datahas
saturated,westoptraining.Thisstrategyiscalledearlystopping.
Ofcourse,inpracticewewon'timmediatelyknowwhenthe
accuracyhassaturated.Instead,wecontinuetraininguntilwe're
confidentthattheaccuracyhassaturated*.

*Itrequiressomejudgmenttodeterminewhen
tostop.InmyearliergraphsIidentifiedepoch
280astheplaceatwhichaccuracysaturated.It's

Whyusethevalidation_datatopreventoverfitting,ratherthanthe
test_data?Infact,thisispartofamoregeneralstrategy,whichisto

possiblethatwastoopessimistic.Neural
networkssometimesplateauforawhilein
training,beforecontinuingtoimprove.I

usethevalidation_datatoevaluatedifferenttrialchoicesofhyper

wouldn'tbesurprisedifmorelearningcould

parameterssuchasthenumberofepochstotrainfor,thelearning

magnitudeofanyfurtherimprovementwould

rate,thebestnetworkarchitecture,andsoon.Weusesuch

lessaggressivestrategiesforearlystopping.

haveoccurredevenafterepoch400,althoughthe
likelybesmall.Soit'spossibletoadoptmoreor

evaluationstofindandsetgoodvaluesforthehyperparameters.
Indeed,althoughIhaven'tmentionedituntilnow,thatis,inpart,
howIarrivedatthehyperparameterchoicesmadeearlierinthis
book.(Moreonthislater.)
Ofcourse,thatdoesn'tinanywayanswerthequestionofwhywe're
usingthevalidation_datatopreventoverfitting,ratherthanthe
test_data.Instead,itreplacesitwithamoregeneralquestion,which

iswhywe'reusingthevalidation_dataratherthanthetest_datato
setgoodhyperparameters?Tounderstandwhy,considerthat
whensettinghyperparameterswe'relikelytotrymanydifferent
choicesforthehyperparameters.Ifwesetthehyperparameters
basedonevaluationsofthetest_datait'spossiblewe'llendup
overfittingourhyperparameterstothetest_data.Thatis,wemay
endupfindinghyperparameterswhichfitparticularpeculiarities
ofthetest_data,butwheretheperformanceofthenetworkwon't
generalizetootherdatasets.Weguardagainstthatbyfiguringout
thehyperparametersusingthevalidation_data.Then,oncewe've
gotthehyperparameterswewant,wedoafinalevaluationof
http://neuralnetworksanddeeplearning.com/chap3.html

25/92

5/31/2016

Neuralnetworksanddeeplearning

accuracyusingthetest_data.Thatgivesusconfidencethatour
resultsonthetest_dataareatruemeasureofhowwellourneural
networkgeneralizes.Toputitanotherway,youcanthinkofthe
validationdataasatypeoftrainingdatathathelpsuslearngood
hyperparameters.Thisapproachtofindinggoodhyperparameters
issometimesknownastheholdoutmethod,sincethe
validation_dataiskeptapartor"heldout"fromthetraining_data.

Now,inpractice,evenafterevaluatingperformanceonthe
test_datawemaychangeourmindsandwanttotryanother

approachperhapsadifferentnetworkarchitecturewhichwill
involvefindinganewsetofhyperparameters.Ifwedothis,isn't
thereadangerwe'llendupoverfittingtothetest_dataaswell?Do
weneedapotentiallyinfiniteregressofdatasets,sowecanbe
confidentourresultswillgeneralize?Addressingthisconcernfully
isadeepanddifficultproblem.Butforourpracticalpurposes,we're
notgoingtoworrytoomuchaboutthisquestion.Instead,we'll
plungeahead,usingthebasicholdoutmethod,basedonthe
training_data,validation_data,andtest_data,asdescribedabove.

We'vebeenlookingsofaratoverfittingwhenwe'rejustusing1,000
trainingimages.Whathappenswhenweusethefulltrainingsetof
50,000images?We'llkeepalltheotherparametersthesame(30
hiddenneurons,learningrate0.5,minibatchsizeof10),buttrain
usingall50,000imagesfor30epochs.Here'sagraphshowingthe
resultsfortheclassificationaccuracyonboththetrainingdataand
thetestdata.NotethatI'veusedthetestdatahere,ratherthanthe
validationdata,inordertomaketheresultsmoredirectly
comparablewiththeearliergraphs.

http://neuralnetworksanddeeplearning.com/chap3.html

26/92

5/31/2016

Neuralnetworksanddeeplearning

Asyoucansee,theaccuracyonthetestandtrainingdataremain
muchclosertogetherthanwhenwewereusing1,000training
examples.Inparticular,thebestclassificationaccuracyof97.86
percentonthetrainingdataisonly1.53percenthigherthanthe
95.33percentonthetestdata.That'scomparedtothe17.73percent
gapwehadearlier!Overfittingisstillgoingon,butit'sbeengreatly
reduced.Ournetworkisgeneralizingmuchbetterfromthetraining
datatothetestdata.Ingeneral,oneofthebestwaysofreducing
overfittingistoincreasethesizeofthetrainingdata.Withenough
trainingdataitisdifficultforevenaverylargenetworktooverfit.
Unfortunately,trainingdatacanbeexpensiveordifficulttoacquire,
sothisisnotalwaysapracticaloption.

Regularization
Increasingtheamountoftrainingdataisonewayofreducing
overfitting.Arethereotherwayswecanreducetheextenttowhich
overfittingoccurs?Onepossibleapproachistoreducethesizeof
ournetwork.However,largenetworkshavethepotentialtobe
morepowerfulthansmallnetworks,andsothisisanoptionwe'd
onlyadoptreluctantly.
Fortunately,thereareothertechniqueswhichcanreduce
overfitting,evenwhenwehaveafixednetworkandfixedtraining
data.Theseareknownasregularizationtechniques.Inthissection
Idescribeoneofthemostcommonlyusedregularization
techniques,atechniquesometimesknownasweightdecayorL2
regularization.TheideaofL2regularizationistoaddanextraterm
http://neuralnetworksanddeeplearning.com/chap3.html

27/92

5/31/2016

Neuralnetworksanddeeplearning

tothecostfunction,atermcalledtheregularizationterm.Here's
theregularizedcrossentropy:
C=

y jlna Lj + (1 y j)ln(1 a Lj) ] +

w 2.
[
n
2n
xj

Thefirsttermisjusttheusualexpressionforthecrossentropy.But
we'veaddedasecondterm,namelythesumofthesquaresofallthe
weightsinthenetwork.Thisisscaledbyafactor / 2n,where > 0
isknownastheregularizationparameter,andnis,asusual,the
sizeofourtrainingset.I'lldiscusslaterhowischosen.It'salso
worthnotingthattheregularizationtermdoesn'tincludethebiases.
I'llalsocomebacktothatbelow.
Ofcourse,it'spossibletoregularizeothercostfunctions,suchasthe
quadraticcost.Thiscanbedoneinasimilarway:
C=

y aL 2 +
w 2.
2n
x
w

2n

Inbothcaseswecanwritetheregularizedcostfunctionas

C = C0 +
w 2,
2n w
whereC 0istheoriginal,unregularizedcostfunction.
Intuitively,theeffectofregularizationistomakeitsothenetwork
preferstolearnsmallweights,allotherthingsbeingequal.Large
weightswillonlybeallowediftheyconsiderablyimprovethefirst
partofthecostfunction.Putanotherway,regularizationcanbe
viewedasawayofcompromisingbetweenfindingsmallweights
andminimizingtheoriginalcostfunction.Therelativeimportance
ofthetwoelementsofthecompromisedependsonthevalueof:
whenissmallweprefertominimizetheoriginalcostfunction,but
whenislargeweprefersmallweights.
Now,it'sreallynotatallobviouswhymakingthiskindof
compromiseshouldhelpreduceoverfitting!Butitturnsoutthatit
does.We'lladdressthequestionofwhyithelpsinthenextsection.
Butfirst,let'sworkthroughanexampleshowingthatregularization
reallydoesreduceoverfitting.
Toconstructsuchanexample,wefirstneedtofigureouthowto
http://neuralnetworksanddeeplearning.com/chap3.html

28/92

5/31/2016

Neuralnetworksanddeeplearning

applyourstochasticgradientdescentlearningalgorithmina
regularizedneuralnetwork.Inparticular,weneedtoknowhowto
computethepartialderivativesC / wandC / bforalltheweights
andbiasesinthenetwork.Takingthepartialderivativesof
Equation(87)gives
C

w
C
b

C 0

w
C 0

TheC 0 / wandC 0 / btermscanbecomputedusing


backpropagation,asdescribedinthelastchapter.Andsowesee
thatit'seasytocomputethegradientoftheregularizedcost

function:justusebackpropagation,asusual,andthenadd n wto
thepartialderivativeofalltheweightterms.Thepartialderivatives
withrespecttothebiasesareunchanged,andsothegradient
descentlearningruleforthebiasesdoesn'tchangefromtheusual
rule:
b b

C 0
b

Thelearningrulefortheweightsbecomes:
w w
=

C 0
w

( )
1

w
C 0
w

Thisisexactlythesameastheusualgradientdescentlearningrule,

exceptwefirstrescaletheweightwbyafactor1 n .Thisrescaling
issometimesreferredtoasweightdecay,sinceitmakesthe
weightssmaller.Atfirstglanceitlooksasthoughthismeansthe
weightsarebeingdrivenunstoppablytowardzero.Butthat'snot
right,sincetheothertermmayleadtheweightstoincrease,ifso
doingcausesadecreaseintheunregularizedcostfunction.
Okay,that'showgradientdescentworks.Whataboutstochastic
gradientdescent?Well,justasinunregularizedstochasticgradient
descent,wecanestimateC 0 / wbyaveragingoveraminibatchof
mtrainingexamples.Thustheregularizedlearningrulefor
http://neuralnetworksanddeeplearning.com/chap3.html

29/92

5/31/2016

Neuralnetworksanddeeplearning

stochasticgradientdescentbecomes(c.f.Equation(20))

( )

w 1

C x
w

wherethesumisovertrainingexamplesxintheminibatch,andC x
isthe(unregularized)costforeachtrainingexample.Thisisexactly
thesameastheusualruleforstochasticgradientdescent,exceptfor

the1 n weightdecayfactor.Finally,andforcompleteness,letme
statetheregularizedlearningruleforthebiases.Thisis,ofcourse,
exactlythesameasintheunregularizedcase(c.f.Equation(21)),
bb

C x

b
x

wherethesumisovertrainingexamplesxintheminibatch.
Let'sseehowregularizationchangestheperformanceofourneural
network.We'lluseanetworkwith30hiddenneurons,aminibatch
sizeof10,alearningrateof0.5,andthecrossentropycostfunction.
However,thistimewe'llusearegularizationparameterof = 0.1.
Notethatinthecode,weusethevariablenamelmbda,because
lambdaisareservedwordinPython,withanunrelatedmeaning.I've

alsousedthetest_dataagain,notthevalidation_data.Strictly
speaking,weshouldusethevalidation_data,forallthereasonswe
discussedearlier.ButIdecidedtousethetest_databecauseit
makestheresultsmoredirectlycomparablewithourearlier,
unregularizedresults.Youcaneasilychangethecodetousethe
validation_datainstead,andyou'llfindthatitgivessimilarresults.

>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.large_weight_initializer()
>>>net.SGD(training_data[:1000],400,10,0.5,
...evaluation_data=test_data,lmbda=0.1,
...monitor_evaluation_cost=True,monitor_evaluation_accuracy=True,
...monitor_training_cost=True,monitor_training_accuracy=True)

Thecostonthetrainingdatadecreasesoverthewholetime,much
asitdidintheearlier,unregularizedcase*:

http://neuralnetworksanddeeplearning.com/chap3.html

*Thisandthenexttwographswereproduced
withtheprogramoverfitting.py.

30/92

5/31/2016

Neuralnetworksanddeeplearning

Butthistimetheaccuracyonthetest_datacontinuestoincreasefor
theentire400epochs:

Clearly,theuseofregularizationhassuppressedoverfitting.What's
more,theaccuracyisconsiderablyhigher,withapeakclassification
accuracyof87.1percent,comparedtothepeakof82.27percent
obtainedintheunregularizedcase.Indeed,wecouldalmost
certainlygetconsiderablybetterresultsbycontinuingtotrainpast
400epochs.Itseemsthat,empirically,regularizationiscausingour
networktogeneralizebetter,andconsiderablyreducingtheeffects
ofoverfitting.
Whathappensifwemoveoutoftheartificialenvironmentofjust
having1,000trainingimages,andreturntothefull50,000image
trainingset?Ofcourse,we'veseenalreadythatoverfittingismuch
lessofaproblemwiththefull50,000images.Doesregularization
http://neuralnetworksanddeeplearning.com/chap3.html

31/92

5/31/2016

Neuralnetworksanddeeplearning

helpanyfurther?Let'skeepthehyperparametersthesameas
before30epochs,learningrate0.5,minibatchsizeof10.However,
weneedtomodifytheregularizationparameter.Thereasonis
becausethesizenofthetrainingsethaschangedfromn = 1, 000to
n = 50, 000,andthischangestheweightdecayfactor1

.Ifwe

continuedtouse = 0.1thatwouldmeanmuchlessweightdecay,
andthusmuchlessofaregularizationeffect.Wecompensateby
changingto = 5.0.
Okay,let'strainournetwork,stoppingfirsttoreinitializethe
weights:

>>>net.large_weight_initializer()
>>>net.SGD(training_data,30,10,0.5,
...evaluation_data=test_data,lmbda=5.0,
...monitor_evaluation_accuracy=True,monitor_training_accuracy=True)

Weobtaintheresults:

There'slotsofgoodnewshere.First,ourclassificationaccuracyon
thetestdataisup,from95.49percentwhenrunningunregularized,
to96.49percent.That'sabigimprovement.Second,wecanseethat
thegapbetweenresultsonthetrainingandtestdataismuch
narrowerthanbefore,runningatunderapercent.That'sstilla
significantgap,butwe'veobviouslymadesubstantialprogress
reducingoverfitting.
Finally,let'sseewhattestclassificationaccuracywegetwhenwe
use100hiddenneuronsandaregularizationparameterof = 5.0.I
won'tgothroughadetailedanalysisofoverfittinghere,thisis
http://neuralnetworksanddeeplearning.com/chap3.html

32/92

5/31/2016

Neuralnetworksanddeeplearning

purelyforfun,justtoseehowhighanaccuracywecangetwhenwe
useournewtricks:thecrossentropycostfunctionandL2
regularization.

>>>net=network2.Network([784,100,10],cost=network2.CrossEntropyCost)
>>>net.large_weight_initializer()
>>>net.SGD(training_data,30,10,0.5,lmbda=5.0,
...evaluation_data=validation_data,
...monitor_evaluation_accuracy=True)

Thefinalresultisaclassificationaccuracyof97.92percentonthe
validationdata.That'sabigjumpfromthe30hiddenneuroncase.
Infact,tuningjustalittlemore,torunfor60epochsat = 0.1and
= 5.0webreakthe98percentbarrier,achieving98.04percent
classificationaccuracyonthevalidationdata.Notbadforwhat
turnsouttobe152linesofcode!
I'vedescribedregularizationasawaytoreduceoverfittingandto
increaseclassificationaccuracies.Infact,that'snottheonlybenefit.
Empirically,whendoingmultiplerunsofourMNISTnetworks,but
withdifferent(random)weightinitializations,I'vefoundthatthe
unregularizedrunswilloccasionallyget"stuck",apparentlycaught
inlocalminimaofthecostfunction.Theresultisthatdifferentruns
sometimesprovidequitedifferentresults.Bycontrast,the
regularizedrunshaveprovidedmuchmoreeasilyreplicableresults.
Whyisthisgoingon?Heuristically,ifthecostfunctionis
unregularized,thenthelengthoftheweightvectorislikelytogrow,
allotherthingsbeingequal.Overtimethiscanleadtotheweight
vectorbeingverylargeindeed.Thiscancausetheweightvectorto
getstuckpointinginmoreorlessthesamedirection,sincechanges
duetogradientdescentonlymaketinychangestothedirection,
whenthelengthislong.Ibelievethisphenomenonismakingit
hardforourlearningalgorithmtoproperlyexploretheweight
space,andconsequentlyhardertofindgoodminimaofthecost
function.

Whydoesregularizationhelpreduce
overfitting?
We'veseenempiricallythatregularizationhelpsreduceoverfitting.
That'sencouragingbut,unfortunately,it'snotobviouswhy
regularizationhelps!Astandardstorypeopletelltoexplainwhat's
goingonisalongthefollowinglines:smallerweightsare,insome
http://neuralnetworksanddeeplearning.com/chap3.html

33/92

5/31/2016

Neuralnetworksanddeeplearning

sense,lowercomplexity,andsoprovideasimplerandmore
powerfulexplanationforthedata,andshouldthusbepreferred.
That'saprettytersestory,though,andcontainsseveralelements
thatperhapsseemdubiousormystifying.Let'sunpackthestory
andexamineitcritically.Todothat,let'ssupposewehaveasimple
datasetforwhichwewishtobuildamodel:

10
9
8
7
6

5
4
3
2
1
0
0

Implicitly,we'restudyingsomerealworldphenomenonhere,with
xandyrepresentingrealworlddata.Ourgoalistobuildamodel
whichletsuspredictyasafunctionofx.Wecouldtryusingneural
networkstobuildsuchamodel,butI'mgoingtodosomethingeven
simpler:I'lltrytomodelyasapolynomialinx.I'mdoingthis
insteadofusingneuralnetsbecauseusingpolynomialswillmake
thingsparticularlytransparent.Oncewe'veunderstoodthe
polynomialcase,we'lltranslatetoneuralnetworks.Now,thereare
tenpointsinthegraphabove,whichmeanswecanfindaunique9
thorderpolynomialy = a 0x 9 + a 1x 8 + + a 9whichfitsthedata
exactly.Here'sthegraphofthatpolynomial*:

*Iwon'tshowthecoefficientsexplicitly,although
theyareeasytofindusingaroutinesuchas
Numpy'spolyfit.Youcanviewtheexactform
ofthepolynomialinthesourcecodeforthe
graphifyou'recurious.It'sthefunctionp(x)
definedstartingonline14oftheprogramwhich
producesthegraph.

http://neuralnetworksanddeeplearning.com/chap3.html

34/92

5/31/2016

Neuralnetworksanddeeplearning

10
9
8
7
6

5
4
3
2
1
0
0

Thatprovidesanexactfit.Butwecanalsogetagoodfitusingthe
linearmodely = 2x:

10
9
8
7
6

5
4
3
2
1
0
0

Whichoftheseisthebettermodel?Whichismorelikelytobetrue?
Andwhichmodelismorelikelytogeneralizewelltootherexamples
ofthesameunderlyingrealworldphenomenon?
Thesearedifficultquestions.Infact,wecan'tdeterminewith
certaintytheanswertoanyoftheabovequestions,withoutmuch
moreinformationabouttheunderlyingrealworldphenomenon.
Butlet'sconsidertwopossibilities:(1)the9thorderpolynomialis,
infact,themodelwhichtrulydescribestherealworld
phenomenon,andthemodelwillthereforegeneralizeperfectly(2)
thecorrectmodelisy = 2x,butthere'salittleadditionalnoisedue
to,say,measurementerror,andthat'swhythemodelisn'tanexact
fit.
It'snotaprioripossibletosaywhichofthesetwopossibilitiesis
http://neuralnetworksanddeeplearning.com/chap3.html

35/92

5/31/2016

Neuralnetworksanddeeplearning

correct.(Or,indeed,ifsomethirdpossibilityholds).Logically,
eithercouldbetrue.Andit'snotatrivialdifference.It'struethaton
thedataprovidedthere'sonlyasmalldifferencebetweenthetwo
models.Butsupposewewanttopredictthevalueofy
correspondingtosomelargevalueofx,muchlargerthananyshown
onthegraphabove.Ifwetrytodothattherewillbeadramatic
differencebetweenthepredictionsofthetwomodels,asthe9th
orderpolynomialmodelcomestobedominatedbythex 9term,
whilethelinearmodelremains,well,linear.
Onepointofviewistosaythatinscienceweshouldgowiththe
simplerexplanation,unlesscompellednotto.Whenwefinda
simplemodelthatseemstoexplainmanydatapointsweare
temptedtoshout"Eureka!"Afterall,itseemsunlikelythatasimple
explanationshouldoccurmerelybycoincidence.Rather,wesuspect
thatthemodelmustbeexpressingsomeunderlyingtruthaboutthe
phenomenon.Inthecaseathand,themodely = 2x + noiseseems
muchsimplerthany = a 0x 9 + a 1x 8 + .Itwouldbesurprisingif
thatsimplicityhadoccurredbychance,andsowesuspectthat
y = 2x + noiseexpressessomeunderlyingtruth.Inthispointof
view,the9thordermodelisreallyjustlearningtheeffectsoflocal
noise.Andsowhilethe9thordermodelworksperfectlyforthese
particulardatapoints,themodelwillfailtogeneralizetootherdata
points,andthenoisylinearmodelwillhavegreaterpredictive
power.
Let'sseewhatthispointofviewmeansforneuralnetworks.
Supposeournetworkmostlyhassmallweights,aswilltendto
happeninaregularizednetwork.Thesmallnessoftheweights
meansthatthebehaviourofthenetworkwon'tchangetoomuchif
wechangeafewrandominputshereandthere.Thatmakesit
difficultforaregularizednetworktolearntheeffectsoflocalnoise
inthedata.Thinkofitasawayofmakingitsosinglepiecesof
evidencedon'tmattertoomuchtotheoutputofthenetwork.
Instead,aregularizednetworklearnstorespondtotypesof
evidencewhichareseenoftenacrossthetrainingset.Bycontrast,a
networkwithlargeweightsmaychangeitsbehaviourquiteabitin
responsetosmallchangesintheinput.Andsoanunregularized
networkcanuselargeweightstolearnacomplexmodelthatcarries
alotofinformationaboutthenoiseinthetrainingdata.Ina
nutshell,regularizednetworksareconstrainedtobuildrelatively
http://neuralnetworksanddeeplearning.com/chap3.html

36/92

5/31/2016

Neuralnetworksanddeeplearning

simplemodelsbasedonpatternsseenofteninthetrainingdata,
andareresistanttolearningpeculiaritiesofthenoiseinthetraining
data.Thehopeisthatthiswillforceournetworkstodoreal
learningaboutthephenomenonathand,andtogeneralizebetter
fromwhattheylearn.
Withthatsaid,thisideaofpreferringsimplerexplanationshould
makeyounervous.Peoplesometimesrefertothisideaas"Occam's
Razor",andwillzealouslyapplyitasthoughithasthestatusof
somegeneralscientificprinciple.But,ofcourse,it'snotageneral
scientificprinciple.Thereisnoapriorilogicalreasontoprefer
simpleexplanationsovermorecomplexexplanations.Indeed,
sometimesthemorecomplexexplanationturnsouttobecorrect.
Letmedescribetwoexampleswheremorecomplexexplanations
haveturnedouttobecorrect.Inthe1940sthephysicistMarcel
Scheinannouncedthediscoveryofanewparticleofnature.The
companyheworkedfor,GeneralElectric,wasecstatic,and
publicizedthediscoverywidely.ButthephysicistHansBethewas
skeptical.BethevisitedSchein,andlookedattheplatesshowingthe
tracksofSchein'snewparticle.ScheinshowedBetheplateafter
plate,butoneachplateBetheidentifiedsomeproblemthat
suggestedthedatashouldbediscarded.Finally,Scheinshowed
Betheaplatethatlookedgood.Bethesaiditmightjustbea
statisticalfluke.Schein:"Yes,butthechancethatthiswouldbe
statistics,evenaccordingtoyourownformula,isoneinfive."
Bethe:"Butwehavealreadylookedatfiveplates."Finally,Schein
said:"Butonmyplates,eachoneofthegoodplates,eachoneofthe
goodpictures,youexplainbyadifferenttheory,whereasIhaveone
hypothesisthatexplainsalltheplates,thattheyare[thenew
particle]."Bethereplied:"Thesoledifferencebetweenyourandmy
explanationsisthatyoursiswrongandallofmineareright.Your
singleexplanationiswrong,andallofmymultipleexplanationsare
right."SubsequentworkconfirmedthatNatureagreedwithBethe,
andSchein'sparticleisnomore*.

*ThestoryisrelatedbythephysicistRichard
Feynmaninaninterviewwiththehistorian
CharlesWeiner.

Asasecondexample,in1859theastronomerUrbainLeVerrier
observedthattheorbitoftheplanetMercurydoesn'thavequitethe
shapethatNewton'stheoryofgravitationsaysitshouldhave.Itwas
atiny,tinydeviationfromNewton'stheory,andseveralofthe
explanationsproferredatthetimeboileddowntosayingthat
http://neuralnetworksanddeeplearning.com/chap3.html

37/92

5/31/2016

Neuralnetworksanddeeplearning

Newton'stheorywasmoreorlessright,butneededatiny
alteration.In1916,Einsteinshowedthatthedeviationcouldbe
explainedverywellusinghisgeneraltheoryofrelativity,atheory
radicallydifferenttoNewtoniangravitation,andbasedonmuch
morecomplexmathematics.Despitethatadditionalcomplexity,
todayit'sacceptedthatEinstein'sexplanationiscorrect,and
Newtoniangravity,eveninitsmodifiedforms,iswrong.Thisisin
partbecausewenowknowthatEinstein'stheoryexplainsmany
otherphenomenawhichNewton'stheoryhasdifficultywith.
Furthermore,andevenmoreimpressively,Einstein'stheory
accuratelypredictsseveralphenomenawhicharen'tpredictedby
Newtoniangravityatall.Buttheseimpressivequalitiesweren't
entirelyobviousintheearlydays.Ifonehadjudgedmerelyonthe
groundsofsimplicity,thensomemodifiedformofNewton'stheory
wouldarguablyhavebeenmoreattractive.
Therearethreemoralstodrawfromthesestories.First,itcanbe
quiteasubtlebusinessdecidingwhichoftwoexplanationsistruly
"simpler".Second,evenifwecanmakesuchajudgment,simplicity
isaguidethatmustbeusedwithgreatcaution!Third,thetruetest
ofamodelisnotsimplicity,butratherhowwellitdoesin
predictingnewphenomena,innewregimesofbehaviour.
Withthatsaid,andkeepingtheneedforcautioninmind,it'san
empiricalfactthatregularizedneuralnetworksusuallygeneralize
betterthanunregularizednetworks.Andsothroughtheremainder
ofthebookwewillmakefrequentuseofregularization.I've
includedthestoriesabovemerelytohelpconveywhynoonehas
yetdevelopedanentirelyconvincingtheoreticalexplanationfor
whyregularizationhelpsnetworksgeneralize.Indeed,researchers
continuetowritepaperswheretheytrydifferentapproachesto
regularization,comparethemtoseewhichworksbetter,and
attempttounderstandwhydifferentapproachesworkbetteror
worse.Andsoyoucanviewregularizationassomethingofakludge.
Whileitoftenhelps,wedon'thaveanentirelysatisfactory
systematicunderstandingofwhat'sgoingon,merelyincomplete
heuristicsandrulesofthumb.
There'sadeepersetofissueshere,issueswhichgototheheartof
science.It'sthequestionofhowwegeneralize.Regularizationmay
giveusacomputationalmagicwandthathelpsournetworks
http://neuralnetworksanddeeplearning.com/chap3.html

38/92

5/31/2016

Neuralnetworksanddeeplearning

generalizebetter,butitdoesn'tgiveusaprincipledunderstanding
ofhowgeneralizationworks,norofwhatthebestapproachis*.

*Theseissuesgobacktotheproblemof
induction,famouslydiscussedbytheScottish
philosopherDavidHumein"AnEnquiry

Thisisparticularlygallingbecauseineverydaylife,wehumans
generalizephenomenallywell.Shownjustafewimagesofan
elephantachildwillquicklylearntorecognizeotherelephants.Of

ConcerningHumanUnderstanding"(1748).The
problemofinductionhasbeengivenamodern
machinelearningforminthenofreelunch
theorem(link)ofDavidWolpertandWilliam
Macready(1997).

course,theymayoccasionallymakemistakes,perhapsconfusinga
rhinocerosforanelephant,butingeneralthisprocessworks
remarkablyaccurately.Sowehaveasystemthehumanbrain
withahugenumberoffreeparameters.Andafterbeingshownjust
oneorafewtrainingimagesthatsystemlearnstogeneralizeto
otherimages.Ourbrainsare,insomesense,regularizingamazingly
well!Howdowedoit?Atthispointwedon'tknow.Iexpectthatin
yearstocomewewilldevelopmorepowerfultechniquesfor
regularizationinartificialneuralnetworks,techniquesthatwill
ultimatelyenableneuralnetstogeneralizewellevenfromsmall
datasets.
Infact,ournetworksalreadygeneralizebetterthanonemighta
prioriexpect.Anetworkwith100hiddenneuronshasnearly
80,000parameters.Wehaveonly50,000imagesinourtraining
data.It'sliketryingtofitan80,000thdegreepolynomialto50,000
datapoints.Byallrights,ournetworkshouldoverfitterribly.And
yet,aswesawearlier,suchanetworkactuallydoesaprettygood
jobgeneralizing.Whyisthatthecase?It'snotwellunderstood.It
hasbeenconjectured*that"thedynamicsofgradientdescent
learninginmultilayernetshasa`selfregularization'effect".Thisis

*InGradientBasedLearningAppliedto

exceptionallyfortunate,butit'salsosomewhatdisquietingthatwe

Bottou,YoshuaBengio,andPatrickHaffner

don'tunderstandwhyit'sthecase.Inthemeantime,wewilladopt

DocumentRecognition,byYannLeCun,Lon
(1998).

thepragmaticapproachanduseregularizationwheneverwecan.
Ourneuralnetworkswillbethebetterforit.
LetmeconcludethissectionbyreturningtoadetailwhichIleft
unexplainedearlier:thefactthatL2regularizationdoesn't
constrainthebiases.Ofcourse,itwouldbeeasytomodifythe
regularizationproceduretoregularizethebiases.Empirically,doing
thisoftendoesn'tchangetheresultsverymuch,sotosomeextent
it'smerelyaconventionwhethertoregularizethebiasesornot.
However,it'sworthnotingthathavingalargebiasdoesn'tmakea
neuronsensitivetoitsinputsinthesamewayashavinglarge
weights.Andsowedon'tneedtoworryaboutlargebiasesenabling
http://neuralnetworksanddeeplearning.com/chap3.html

39/92

5/31/2016

Neuralnetworksanddeeplearning

ournetworktolearnthenoiseinourtrainingdata.Atthesame
time,allowinglargebiasesgivesournetworksmoreflexibilityin
behaviourinparticular,largebiasesmakeiteasierforneuronsto
saturate,whichissometimesdesirable.Forthesereasonswedon't
usuallyincludebiastermswhenregularizing.

Othertechniquesforregularization
TherearemanyregularizationtechniquesotherthanL2
regularization.Infact,somanytechniqueshavebeendeveloped
thatIcan'tpossiblysummarizethemall.InthissectionIbriefly
describethreeotherapproachestoreducingoverfitting:L1
regularization,dropout,andartificiallyincreasingthetrainingset
size.Wewon'tgointonearlyasmuchdepthstudyingthese
techniquesaswedidearlier.Instead,thepurposeistogetfamiliar
withthemainideas,andtoappreciatesomethingofthediversityof
regularizationtechniquesavailable.
L1regularization:Inthisapproachwemodifytheunregularized
costfunctionbyaddingthesumoftheabsolutevaluesofthe
weights:

C = C0 + | w | .
n w
Intuitively,thisissimilartoL2regularization,penalizinglarge
weights,andtendingtomakethenetworkprefersmallweights.Of
course,theL1regularizationtermisn'tthesameastheL2
regularizationterm,andsoweshouldn'texpecttogetexactlythe
samebehaviour.Let'strytounderstandhowthebehaviourofa
networktrainedusingL1regularizationdiffersfromanetwork
trainedusingL2regularization.
Todothat,we'lllookatthepartialderivativesofthecostfunction.
Differentiating(95)weobtain:
C
w

C 0
w

sgn(w),

wheresgn(w)isthesignofw,thatis,+ 1ifwispositive,and 1ifw


isnegative.Usingthisexpression,wecaneasilymodify
backpropagationtodostochasticgradientdescentusingL1
regularization.TheresultingupdateruleforanL1regularized
http://neuralnetworksanddeeplearning.com/chap3.html

40/92

5/31/2016

Neuralnetworksanddeeplearning

networkis
w w = w

sgn(w)

C 0
w

where,asperusual,wecanestimateC 0 / wusingaminibatch
average,ifwewish.ComparethattotheupdateruleforL2
regularization(c.f.Equation(93)),

( )

w w = w 1

C 0
w

Inbothexpressionstheeffectofregularizationistoshrinkthe
weights.Thisaccordswithourintuitionthatbothkindsof
regularizationpenalizelargeweights.Butthewaytheweights
shrinkisdifferent.InL1regularization,theweightsshrinkbya
constantamounttoward0.InL2regularization,theweightsshrink
byanamountwhichisproportionaltow.Andsowhenaparticular
weighthasalargemagnitude, | w | ,L1regularizationshrinksthe
weightmuchlessthanL2regularizationdoes.Bycontrast,when
| w | issmall,L1regularizationshrinkstheweightmuchmorethan
L2regularization.ThenetresultisthatL1regularizationtendsto
concentratetheweightofthenetworkinarelativelysmallnumber
ofhighimportanceconnections,whiletheotherweightsaredriven
towardzero.
I'veglossedoveranissueintheabovediscussion,whichisthatthe
partialderivativeC / wisn'tdefinedwhenw = 0.Thereasonisthat
thefunction | w | hasasharp"corner"atw = 0,andsoisn't
differentiableatthatpoint.That'sokay,though.Whatwe'lldois
justapplytheusual(unregularized)ruleforstochasticgradient
descentwhenw = 0.Thatshouldbeokayintuitively,theeffectof
regularizationistoshrinkweights,andobviouslyitcan'tshrinka
weightwhichisalready0.Toputitmoreprecisely,we'lluse
Equations(96)and(97)withtheconventionthatsgn(0) = 0.That
givesanice,compactrulefordoingstochasticgradientdescentwith
L1regularization.
Dropout:Dropoutisaradicallydifferenttechniquefor
regularization.UnlikeL1andL2regularization,dropoutdoesn't
relyonmodifyingthecostfunction.Instead,indropoutwemodify
thenetworkitself.Letmedescribethebasicmechanicsofhow
http://neuralnetworksanddeeplearning.com/chap3.html

41/92

5/31/2016

Neuralnetworksanddeeplearning

dropoutworks,beforegettingintowhyitworks,andwhatthe
resultsare.
Supposewe'retryingtotrainanetwork:

Inparticular,supposewehaveatraininginputxandcorresponding
desiredoutputy.Ordinarily,we'dtrainbyforwardpropagatingx
throughthenetwork,andthenbackpropagatingtodeterminethe
contributiontothegradient.Withdropout,thisprocessismodified.
Westartbyrandomly(andtemporarily)deletinghalfthehidden
neuronsinthenetwork,whileleavingtheinputandoutputneurons
untouched.Afterdoingthis,we'llendupwithanetworkalongthe
followinglines.Notethatthedropoutneurons,i.e.,theneurons
whichhavebeentemporarilydeleted,arestillghostedin:

Weforwardpropagatetheinputxthroughthemodifiednetwork,
andthenbackpropagatetheresult,alsothroughthemodified
network.Afterdoingthisoveraminibatchofexamples,weupdate
theappropriateweightsandbiases.Wethenrepeattheprocess,
firstrestoringthedropoutneurons,thenchoosinganewrandom
http://neuralnetworksanddeeplearning.com/chap3.html

42/92

5/31/2016

Neuralnetworksanddeeplearning

subsetofhiddenneuronstodelete,estimatingthegradientfora
differentminibatch,andupdatingtheweightsandbiasesinthe
network.
Byrepeatingthisprocessoverandover,ournetworkwilllearnaset
ofweightsandbiases.Ofcourse,thoseweightsandbiaseswillhave
beenlearntunderconditionsinwhichhalfthehiddenneuronswere
droppedout.Whenweactuallyrunthefullnetworkthatmeansthat
twiceasmanyhiddenneuronswillbeactive.Tocompensatefor
that,wehalvetheweightsoutgoingfromthehiddenneurons.
Thisdropoutproceduremayseemstrangeandadhoc.Whywould
weexpectittohelpwithregularization?Toexplainwhat'sgoingon,
I'dlikeyoutobrieflystopthinkingaboutdropout,andinstead
imaginetrainingneuralnetworksinthestandardway(nodropout).
Inparticular,imaginewetrainseveraldifferentneuralnetworks,all
usingthesametrainingdata.Ofcourse,thenetworksmaynotstart
outidentical,andasaresultaftertrainingtheymaysometimesgive
differentresults.Whenthathappenswecouldusesomekindof
averagingorvotingschemetodecidewhichoutputtoaccept.For
instance,ifwehavetrainedfivenetworks,andthreeofthemare
classifyingadigitasa"3",thenitprobablyreallyisa"3".Theother
twonetworksareprobablyjustmakingamistake.Thiskindof
averagingschemeisoftenfoundtobeapowerful(though
expensive)wayofreducingoverfitting.Thereasonisthatthe
differentnetworksmayoverfitindifferentways,andaveragingmay
helpeliminatethatkindofoverfitting.
What'sthisgottodowithdropout?Heuristically,whenwedropout
differentsetsofneurons,it'sratherlikewe'retrainingdifferent
neuralnetworks.Andsothedropoutprocedureislikeaveragingthe
effectsofaverylargenumberofdifferentnetworks.Thedifferent
networkswilloverfitindifferentways,andso,hopefully,thenet
effectofdropoutwillbetoreduceoverfitting.
Arelatedheuristicexplanationfordropoutisgiveninoneofthe
earliestpaperstousethetechnique*:"Thistechniquereduces

*ImageNetClassificationwithDeep

complexcoadaptationsofneurons,sinceaneuroncannotrelyon

Krizhevsky,IlyaSutskever,andGeoffreyHinton

thepresenceofparticularotherneurons.Itis,therefore,forcedto

ConvolutionalNeuralNetworks,byAlex
(2012).

learnmorerobustfeaturesthatareusefulinconjunctionwithmany
differentrandomsubsetsoftheotherneurons."Inotherwords,if
http://neuralnetworksanddeeplearning.com/chap3.html

43/92

5/31/2016

Neuralnetworksanddeeplearning

wethinkofournetworkasamodelwhichismakingpredictions,
thenwecanthinkofdropoutasawayofmakingsurethatthe
modelisrobusttothelossofanyindividualpieceofevidence.In
this,it'ssomewhatsimilartoL1andL2regularization,whichtend
toreduceweights,andthusmakethenetworkmorerobusttolosing
anyindividualconnectioninthenetwork.
Ofcourse,thetruemeasureofdropoutisthatithasbeenvery
successfulinimprovingtheperformanceofneuralnetworks.The
originalpaper*introducingthetechniqueappliedittomany

*Improvingneuralnetworksbypreventingco

differenttasks.Forus,it'sofparticularinterestthattheyapplied

Hinton,NitishSrivastava,AlexKrizhevsky,Ilya

dropouttoMNISTdigitclassification,usingavanillafeedforward
neuralnetworkalonglinessimilartothosewe'vebeenconsidering.

adaptationoffeaturedetectorsbyGeoffrey
Sutskever,andRuslanSalakhutdinov(2012).
Notethatthepaperdiscussesanumberof
subtletiesthatIhaveglossedoverinthisbrief
introduction.

Thepapernotedthatthebestresultanyonehadachieveduptothat
pointusingsuchanarchitecturewas98.4percentclassification
accuracyonthetestset.Theyimprovedthatto98.7percent
accuracyusingacombinationofdropoutandamodifiedformofL2
regularization.Similarlyimpressiveresultshavebeenobtainedfor
manyothertasks,includingproblemsinimageandspeech
recognition,andnaturallanguageprocessing.Dropouthasbeen
especiallyusefulintraininglarge,deepnetworks,wherethe
problemofoverfittingisoftenacute.
Artificiallyexpandingthetrainingdata:Wesawearlierthat
ourMNISTclassificationaccuracydroppeddowntopercentagesin
themid80swhenweusedonly1,000trainingimages.It'snot
surprisingthatthisisthecase,sincelesstrainingdatameansour
networkwillbeexposedtofewervariationsinthewayhuman
beingswritedigits.Let'strytrainingour30hiddenneuronnetwork
withavarietyofdifferenttrainingdatasetsizes,toseehow
performancevaries.Wetrainusingaminibatchsizeof10,a
learningrate = 0.5,aregularizationparameter = 5.0,andthe
crossentropycostfunction.Wewilltrainfor30epochswhenthe
fulltrainingdatasetisused,andscaleupthenumberofepochs
proportionallywhensmallertrainingsetsareused.Toensurethe
weightdecayfactorremainsthesameacrosstrainingsets,wewill
usearegularizationparameterof = 5.0whenthefulltrainingdata
setisused,andscaledownproportionallywhensmallertraining
setsareused*.

http://neuralnetworksanddeeplearning.com/chap3.html

*Thisandthenexttwographareproducedwith
theprogrammore_data.py.

44/92

5/31/2016

Neuralnetworksanddeeplearning

Asyoucansee,theclassificationaccuraciesimproveconsiderably
asweusemoretrainingdata.Presumablythisimprovementwould
continuestillfurtherifmoredatawasavailable.Ofcourse,looking
atthegraphaboveitdoesappearthatwe'regettingnearsaturation.
Suppose,however,thatweredothegraphwiththetrainingsetsize
plottedlogarithmically:

Itseemsclearthatthegraphisstillgoinguptowardtheend.This
suggeststhatifweusedvastlymoretrainingdatasay,millionsor
evenbillionsofhandwritingsamples,insteadofjust50,000then
we'dlikelygetconsiderablybetterperformance,evenfromthisvery
smallnetwork.
Obtainingmoretrainingdataisagreatidea.Unfortunately,itcan
beexpensive,andsoisnotalwayspossibleinpractice.However,
there'sanotherideawhichcanworknearlyaswell,andthat'sto
http://neuralnetworksanddeeplearning.com/chap3.html

45/92

5/31/2016

Neuralnetworksanddeeplearning

artificiallyexpandthetrainingdata.Suppose,forexample,thatwe
takeanMNISTtrainingimageofafive,

androtateitbyasmallamount,let'ssay15degrees:

It'sstillrecognizablythesamedigit.Andyetatthepixellevelit's
quitedifferenttoanyimagecurrentlyintheMNISTtrainingdata.
It'sconceivablethataddingthisimagetothetrainingdatamight
helpournetworklearnmoreabouthowtoclassifydigits.What's
more,obviouslywe'renotlimitedtoaddingjustthisoneimage.We
canexpandourtrainingdatabymakingmanysmallrotationsofall
theMNISTtrainingimages,andthenusingtheexpandedtraining
datatoimproveournetwork'sperformance.
Thisideaisverypowerfulandhasbeenwidelyused.Let'slookat
someoftheresultsfromapaper*whichappliedseveralvariations

*BestPracticesforConvolutionalNeural

oftheideatoMNIST.Oneoftheneuralnetworkarchitecturesthey

byPatriceSimard,DaveSteinkraus,andJohn

consideredwasalongsimilarlinestowhatwe'vebeenusing,a

NetworksAppliedtoVisualDocumentAnalysis,
Platt(2003).

feedforwardnetworkwith800hiddenneuronsandusingthecross
entropycostfunction.Runningthenetworkwiththestandard
MNISTtrainingdatatheyachievedaclassificationaccuracyof98.4
percentontheirtestset.Butthentheyexpandedthetrainingdata,
usingnotjustrotations,asIdescribedabove,butalsotranslating
andskewingtheimages.Bytrainingontheexpandeddatasetthey
increasedtheirnetwork'saccuracyto98.9percent.Theyalso
experimentedwithwhattheycalled"elasticdistortions",aspecial
typeofimagedistortionintendedtoemulatetherandom
oscillationsfoundinhandmuscles.Byusingtheelasticdistortions
toexpandthedatatheyachievedanevenhigheraccuracy,99.3
percent.Effectively,theywerebroadeningtheexperienceoftheir
networkbyexposingittothesortofvariationsthatarefoundin
realhandwriting.
Variationsonthisideacanbeusedtoimproveperformanceon
http://neuralnetworksanddeeplearning.com/chap3.html

46/92

5/31/2016

Neuralnetworksanddeeplearning

manylearningtasks,notjusthandwritingrecognition.Thegeneral
principleistoexpandthetrainingdatabyapplyingoperationsthat
reflectrealworldvariation.It'snotdifficulttothinkofwaysof
doingthis.Suppose,forexample,thatyou'rebuildinganeural
networktodospeechrecognition.Wehumanscanrecognizespeech
eveninthepresenceofdistortionssuchasbackgroundnoise.And
soyoucanexpandyourdatabyaddingbackgroundnoise.Wecan
alsorecognizespeechifit'sspeduporsloweddown.Sothat's
anotherwaywecanexpandthetrainingdata.Thesetechniquesare
notalwaysusedforinstance,insteadofexpandingthetraining
databyaddingnoise,itmaywellbemoreefficienttocleanupthe
inputtothenetworkbyfirstapplyinganoisereductionfilter.Still,
it'sworthkeepingtheideaofexpandingthetrainingdatainmind,
andlookingforopportunitiestoapplyit.

Exercise
Asdiscussedabove,onewayofexpandingtheMNISTtraining
dataistousesmallrotationsoftrainingimages.What'sa
problemthatmightoccurifweallowarbitrarilylargerotations
oftrainingimages?
Anasideonbigdataandwhatitmeanstocompare
classificationaccuracies:Let'slookagainathowourneural
network'saccuracyvarieswithtrainingsetsize:

Supposethatinsteadofusinganeuralnetworkweusesomeother
machinelearningtechniquetoclassifydigits.Forinstance,let'stry
usingthesupportvectormachines(SVM)whichwemetbriefly
http://neuralnetworksanddeeplearning.com/chap3.html

47/92

5/31/2016

Neuralnetworksanddeeplearning

backinChapter1.AswasthecaseinChapter1,don'tworryifyou're
notfamiliarwithSVMs,wedon'tneedtounderstandtheirdetails.
Instead,we'llusetheSVMsuppliedbythescikitlearnlibrary.
Here'showSVMperformancevariesasafunctionoftrainingset
size.I'veplottedtheneuralnetresultsaswell,tomakecomparison
easy*:

*Thisgraphwasproducedwiththeprogram
more_data.py(aswerethelastfewgraphs).

Probablythefirstthingthatstrikesyouaboutthisgraphisthatour
neuralnetworkoutperformstheSVMforeverytrainingsetsize.
That'snice,althoughyoushouldn'treadtoomuchintoit,sinceI
justusedtheoutoftheboxsettingsfromscikitlearn'sSVM,while
we'vedoneafairbitofworkimprovingourneuralnetwork.Amore
subtlebutmoreinterestingfactaboutthegraphisthatifwetrain
ourSVMusing50,000imagesthenitactuallyhasbetter
performance(94.48percentaccuracy)thanourneuralnetwork
doeswhentrainedusing5,000images(93.24percentaccuracy).In
otherwords,moretrainingdatacansometimescompensatefor
differencesinthemachinelearningalgorithmused.
Somethingevenmoreinterestingcanoccur.Supposewe'retryingto
solveaproblemusingtwomachinelearningalgorithms,algorithm
AandalgorithmB.ItsometimeshappensthatalgorithmAwill
outperformalgorithmBwithonesetoftrainingdata,while
algorithmBwilloutperformalgorithmAwithadifferentsetof
trainingdata.Wedon'tseethataboveitwouldrequirethetwo
graphstocrossbutitdoeshappen*.Thecorrectresponsetothe

*StrikingexamplesmaybefoundinScalingto

question"IsalgorithmAbetterthanalgorithmB?"isreally:"What

disambiguation,byMicheleBankoandEricBrill

trainingdatasetareyouusing?"
http://neuralnetworksanddeeplearning.com/chap3.html

veryverylargecorporafornaturallanguage
(2001).

48/92

5/31/2016

Neuralnetworksanddeeplearning

Allthisisacautiontokeepinmind,bothwhendoingdevelopment,
andwhenreadingresearchpapers.Manypapersfocusonfinding
newtrickstowringoutimprovedperformanceonstandard
benchmarkdatasets."Ourwhizbangtechniquegaveusan
improvementofXpercentonstandardbenchmarkY"isacanonical
formofresearchclaim.Suchclaimsareoftengenuinelyinteresting,
buttheymustbeunderstoodasapplyingonlyinthecontextofthe
specifictrainingdatasetused.Imagineanalternatehistoryin
whichthepeoplewhooriginallycreatedthebenchmarkdatasethad
alargerresearchgrant.Theymighthaveusedtheextramoneyto
collectmoretrainingdata.It'sentirelypossiblethatthe
"improvement"duetothewhizbangtechniquewoulddisappearon
alargerdataset.Inotherwords,thepurportedimprovementmight
bejustanaccidentofhistory.Themessagetotakeaway,especially
inpracticalapplications,isthatwhatwewantisbothbetter
algorithmsandbettertrainingdata.It'sfinetolookforbetter
algorithms,butmakesureyou'renotfocusingonbetteralgorithms
totheexclusionofeasywinsgettingmoreorbettertrainingdata.

Problem
(Researchproblem)Howdoourmachinelearning
algorithmsperforminthelimitofverylargedatasets?Forany
givenalgorithmit'snaturaltoattempttodefineanotionof
asymptoticperformanceinthelimitoftrulybigdata.Aquick
anddirtyapproachtothisproblemistosimplytryfitting
curvestographslikethoseshownabove,andthento
extrapolatethefittedcurvesouttoinfinity.Anobjectiontothis
approachisthatdifferentapproachestocurvefittingwillgive
differentnotionsofasymptoticperformance.Canyoufinda
principledjustificationforfittingtosomeparticularclassof
curves?Ifso,comparetheasymptoticperformanceofseveral
differentmachinelearningalgorithms.
Summingup:We'venowcompletedourdiveintooverfittingand
regularization.Ofcourse,we'llreturnagaintotheissue.AsI've
mentionedseveraltimes,overfittingisamajorprobleminneural
networks,especiallyascomputersgetmorepowerful,andwehave
theabilitytotrainlargernetworks.Asaresultthere'sapressing
needtodeveloppowerfulregularizationtechniquestoreduce
overfitting,andthisisanextremelyactiveareaofcurrentwork.
http://neuralnetworksanddeeplearning.com/chap3.html

49/92

5/31/2016

Neuralnetworksanddeeplearning

Weightinitialization
Whenwecreateourneuralnetworks,wehavetomakechoicesfor
theinitialweightsandbiases.Uptonow,we'vebeenchoosingthem
accordingtoaprescriptionwhichIdiscussedonlybrieflybackin
Chapter1.Justtoremindyou,thatprescriptionwastochooseboth
theweightsandbiasesusingindependentGaussianrandom
variables,normalizedtohavemean0andstandarddeviation1.
Whilethisapproachhasworkedwell,itwasquiteadhoc,andit's
worthrevisitingtoseeifwecanfindabetterwayofsettingour
initialweightsandbiases,andperhapshelpourneuralnetworks
learnfaster.
Itturnsoutthatwecandoquiteabitbetterthaninitializingwith
normalizedGaussians.Toseewhy,supposewe'reworkingwitha
networkwithalargenumbersay1, 000ofinputneurons.And
let'ssupposewe'veusednormalizedGaussianstoinitializethe
weightsconnectingtothefirsthiddenlayer.FornowI'mgoingto
concentratespecificallyontheweightsconnectingtheinput
neuronstothefirstneuroninthehiddenlayer,andignoretherest
ofthenetwork:

We'llsupposeforsimplicitythatwe'retryingtotrainusinga
traininginputxinwhichhalftheinputneuronsareon,i.e.,setto1,
andhalftheinputneuronsareoff,i.e.,setto0.Theargumentwhich
followsappliesmoregenerally,butyou'llgetthegistfromthis
specialcase.Let'sconsidertheweightedsumz = jw jx j + bof
inputstoourhiddenneuron.500termsinthissumvanish,because
thecorrespondinginputx jiszero.Andsozisasumoveratotalof
501normalizedGaussianrandomvariables,accountingforthe500
weighttermsandthe1extrabiasterm.Thuszisitselfdistributedas
aGaussianwithmeanzeroandstandarddeviation501 22.4.
http://neuralnetworksanddeeplearning.com/chap3.html

50/92

5/31/2016

Neuralnetworksanddeeplearning

Thatis,zhasaverybroadGaussiandistribution,notsharply
peakedatall:
0.02

30

20

10

10

20

30

Inparticular,wecanseefromthisgraphthatit'squitelikelythat
| z | willbeprettylarge,i.e.,eitherz 1orz 1.Ifthat'sthe
casethentheoutput(z)fromthehiddenneuronwillbeveryclose
toeither1or0.Thatmeansourhiddenneuronwillhavesaturated.
Andwhenthathappens,asweknow,makingsmallchangesinthe
weightswillmakeonlyabsolutelyminisculechangesinthe
activationofourhiddenneuron.Thatminisculechangeinthe
activationofthehiddenneuronwill,inturn,barelyaffecttherestof
theneuronsinthenetworkatall,andwe'llseeacorrespondingly
minisculechangeinthecostfunction.Asaresult,thoseweightswill
onlylearnveryslowlywhenweusethegradientdescentalgorithm*.

*WediscussedthisinmoredetailinChapter2,

It'ssimilartotheproblemwediscussedearlierinthischapter,in

toshowthatweightsinputtosaturatedneurons

whichoutputneuronswhichsaturatedonthewrongvaluecaused

whereweusedtheequationsofbackpropagation
learnedslowly.

learningtoslowdown.Weaddressedthatearlierproblemwitha
cleverchoiceofcostfunction.Unfortunately,whilethathelpedwith
saturatedoutputneurons,itdoesnothingatallfortheproblem
withsaturatedhiddenneurons.
I'vebeentalkingabouttheweightsinputtothefirsthiddenlayer.
Ofcourse,similarargumentsapplyalsotolaterhiddenlayers:ifthe
weightsinlaterhiddenlayersareinitializedusingnormalized
Gaussians,thenactivationswilloftenbeverycloseto0or1,and
learningwillproceedveryslowly.
Istheresomewaywecanchoosebetterinitializationsforthe
weightsandbiases,sothatwedon'tgetthiskindofsaturation,and
soavoidalearningslowdown?Supposewehaveaneuronwithn in
inputweights.ThenweshallinitializethoseweightsasGaussian
randomvariableswithmean0andstandarddeviation1 / n in.That

is,we'llsquashtheGaussiansdown,makingitlesslikelythatour
neuronwillsaturate.We'llcontinuetochoosethebiasasa
Gaussianwithmean0andstandarddeviation1,forreasonsI'll
returntoinamoment.Withthesechoices,theweightedsum
http://neuralnetworksanddeeplearning.com/chap3.html

51/92

5/31/2016

Neuralnetworksanddeeplearning

z = jw jx j + bwillagainbeaGaussianrandomvariablewithmean0
,butit'llbemuchmoresharplypeakedthanitwasbefore.Suppose,
aswedidearlier,that500oftheinputsarezeroand500are1.Then
it'seasytoshow(seetheexercisebelow)thatzhasaGaussian
distributionwithmean0andstandarddeviation3 / 2 = 1.22.
Thisismuchmoresharplypeakedthanbefore,somuchsothat
eventhegraphbelowunderstatesthesituation,sinceI'vehadto
rescaletheverticalaxis,whencomparedtotheearliergraph:
0.4

30

20

10

10

20

30

Suchaneuronismuchlesslikelytosaturate,andcorrespondingly
muchlesslikelytohaveproblemswithalearningslowdown.

Exercise
Verifythatthestandarddeviationofz = jw jx j + binthe
paragraphaboveis3 / 2.Itmayhelptoknowthat:(a)the
varianceofasumofindependentrandomvariablesisthesum
ofthevariancesoftheindividualrandomvariablesand(b)the
varianceisthesquareofthestandarddeviation.
Istatedabovethatwe'llcontinuetoinitializethebiasesasbefore,as
Gaussianrandomvariableswithameanof0andastandard
deviationof1.Thisisokay,becauseitdoesn'tmakeittoomuch
morelikelythatourneuronswillsaturate.Infact,itdoesn'tmuch
matterhowweinitializethebiases,providedweavoidtheproblem
withsaturation.Somepeoplegosofarastoinitializeallthebiases
to0,andrelyongradientdescenttolearnappropriatebiases.But
sinceit'sunlikelytomakemuchdifference,we'llcontinuewiththe
sameinitializationprocedureasbefore.

http://neuralnetworksanddeeplearning.com/chap3.html

52/92

5/31/2016

Neuralnetworksanddeeplearning

Let'scomparetheresultsforbothouroldandnewapproachesto
weightinitialization,usingtheMNISTdigitclassificationtask.As
before,we'lluse30hiddenneurons,aminibatchsizeof10,a
regularizationparameter = 5.0,andthecrossentropycost
function.Wewilldecreasethelearningrateslightlyfrom = 0.5to
0.1,sincethatmakestheresultsalittlemoreeasilyvisibleinthe
graphs.Wecantrainusingtheoldmethodofweightinitialization:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.large_weight_initializer()
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,
...monitor_evaluation_accuracy=True)

Wecanalsotrainusingthenewapproachtoweightinitialization.
Thisisactuallyeveneasier,sincenetwork2'sdefaultwayof
initializingtheweightsisusingthisnewapproach.Thatmeanswe
canomitthenet.large_weight_initializer()callabove:
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,
...monitor_evaluation_accuracy=True)

Plottingtheresults*,weobtain:

*Theprogramusedtogeneratethisandthenext
graphisweight_initialization.py.

Inbothcases,weendupwithaclassificationaccuracysomewhat
over96percent.Thefinalclassificationaccuracyisalmostexactly
thesameinthetwocases.Butthenewinitializationtechnique
bringsustheremuch,muchfaster.Attheendofthefirstepochof
trainingtheoldapproachtoweightinitializationhasaclassification
http://neuralnetworksanddeeplearning.com/chap3.html

53/92

5/31/2016

Neuralnetworksanddeeplearning

accuracyunder87percent,whilethenewapproachisalready
almost93percent.Whatappearstobegoingonisthatournew
approachtoweightinitializationstartsusoffinamuchbetter
regime,whichletsusgetgoodresultsmuchmorequickly.Thesame
phenomenonisalsoseenifweplotresultswith100hidden
neurons:

Inthiscase,thetwocurvesdon'tquitemeet.However,my
experimentssuggestthatwithjustafewmoreepochsoftraining
(notshown)theaccuraciesbecomealmostexactlythesame.Soon
thebasisoftheseexperimentsitlooksasthoughtheimproved
weightinitializationonlyspeedsuplearning,itdoesn'tchangethe
finalperformanceofournetworks.However,inChapter4we'llsee
examplesofneuralnetworkswherethelongrunbehaviouris
significantlybetterwiththe1 / n inweightinitialization.Thusit's

notonlythespeedoflearningwhichisimproved,it'ssometimes
alsothefinalperformance.
The1 / n inapproachtoweightinitializationhelpsimprovetheway

ourneuralnetslearn.Othertechniquesforweightinitialization
havealsobeenproposed,manybuildingonthisbasicidea.Iwon't
reviewtheotherapproacheshere,since1 / n inworkswellenough

forourpurposes.Ifyou'reinterestedinlookingfurther,I
recommendlookingatthediscussiononpages14and15ofa2012
paperbyYoshuaBengio*,aswellasthereferencestherein.

*PracticalRecommendationsforGradientBased
TrainingofDeepArchitectures,byYoshua
Bengio(2012).

Problem
http://neuralnetworksanddeeplearning.com/chap3.html

54/92

5/31/2016

Neuralnetworksanddeeplearning

Connectingregularizationandtheimprovedmethod
ofweightinitializationL2regularizationsometimes
automaticallygivesussomethingsimilartothenewapproach
toweightinitialization.Supposeweareusingtheoldapproach
toweightinitialization.Sketchaheuristicargumentthat:(1)
supposingisnottoosmall,thefirstepochsoftrainingwillbe
dominatedalmostentirelybyweightdecay(2)provided
ntheweightswilldecaybyafactorofexp( / m)per
epochand(3)supposingisnottoolarge,theweightdecay
willtailoffwhentheweightsaredowntoasizearound1 / n,
wherenisthetotalnumberofweightsinthenetwork.Argue
thattheseconditionsareallsatisfiedintheexamplesgraphed
inthissection.

Handwritingrecognitionrevisited:the
code
Let'simplementtheideaswe'vediscussedinthischapter.We'll
developanewprogram,network2.py,whichisanimprovedversion
oftheprogramnetwork.pywedevelopedinChapter1.Ifyouhaven't
lookedatnetwork.pyinawhilethenyoumayfindithelpfultospend
afewminutesquicklyreadingovertheearlierdiscussion.It'sonly
74linesofcode,andiseasilyunderstood.
Aswasthecaseinnetwork.py,thestarofnetwork2.pyistheNetwork
class,whichweusetorepresentourneuralnetworks.Weinitialize
aninstanceofNetworkwithalistofsizesfortherespectivelayersin
thenetwork,andachoiceforthecosttouse,defaultingtothecross
entropy:
classNetwork(object):
def__init__(self,sizes,cost=CrossEntropyCost):
self.num_layers=len(sizes)
self.sizes=sizes
self.default_weight_initializer()
self.cost=cost

Thefirstcoupleoflinesofthe__init__methodarethesameasin
network.py,andareprettyselfexplanatory.Butthenexttwolines

arenew,andweneedtounderstandwhatthey'redoingindetail.
Let'sstartbyexaminingthedefault_weight_initializermethod.This
makesuseofournewandimprovedapproachtoweight
http://neuralnetworksanddeeplearning.com/chap3.html

55/92

5/31/2016

Neuralnetworksanddeeplearning

initialization.Aswe'veseen,inthatapproachtheweightsinputtoa
neuronareinitializedasGaussianrandomvariableswithmean0
andstandarddeviation1dividedbythesquarerootofthenumber
ofconnectionsinputtotheneuron.Alsointhismethodwe'll
initializethebiases,usingGaussianrandomvariableswithmean0
andstandarddeviation1.Here'sthecode:
defdefault_weight_initializer(self):
self.biases=[np.random.randn(y,1)foryinself.sizes[1:]]
self.weights=[np.random.randn(y,x)/np.sqrt(x)
forx,yinzip(self.sizes[:1],self.sizes[1:])]

Tounderstandthecode,itmayhelptorecallthatnpistheNumpy
libraryfordoinglinearalgebra.We'llimportNumpyatthe
beginningofourprogram.Also,noticethatwedon'tinitializeany
biasesforthefirstlayerofneurons.Weavoiddoingthisbecausethe
firstlayerisaninputlayer,andsoanybiaseswouldnotbeused.We
didexactlythesamethinginnetwork.py.
Complementingthedefault_weight_initializerwe'llalsoincludea
large_weight_initializermethod.Thismethodinitializesthe

weightsandbiasesusingtheoldapproachfromChapter1,with
bothweightsandbiasesinitializedasGaussianrandomvariables
withmean0andstandarddeviation1.Thecodeis,ofcourse,onlya
tinybitdifferentfromthedefault_weight_initializer:
deflarge_weight_initializer(self):
self.biases=[np.random.randn(y,1)foryinself.sizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(self.sizes[:1],self.sizes[1:])]

I'veincludedthelarge_weight_initializermethodmostlyasa
conveniencetomakeiteasiertocomparetheresultsinthischapter
tothoseinChapter1.Ican'tthinkofmanypracticalsituations
whereIwouldrecommendusingit!
ThesecondnewthinginNetwork's__init__methodisthatwenow
initializeacostattribute.Tounderstandhowthatworks,let'slook
attheclassweusetorepresentthecrossentropycost*:

*Ifyou'renotfamiliarwithPython'sstatic
methodsyoucanignorethe@staticmethod
decorators,andjusttreatfnanddeltaas

classCrossEntropyCost(object):

ordinarymethods.Ifyou'recuriousabout
details,all@staticmethoddoesistellthePython

@staticmethod

interpreterthatthemethodwhichfollows

deffn(a,y):

doesn'tdependontheobjectinanyway.That's

returnnp.sum(np.nan_to_num(y*np.log(a)(1y)*np.log(1a)))

whyselfisn'tpassedasaparametertothefn
anddeltamethods.

@staticmethod
defdelta(z,a,y):
return(ay)

http://neuralnetworksanddeeplearning.com/chap3.html

56/92

5/31/2016

Neuralnetworksanddeeplearning

Let'sbreakthisdown.Thefirstthingtoobserveisthateventhough
thecrossentropyis,mathematicallyspeaking,afunction,we've
implementeditasaPythonclass,notaPythonfunction.Whyhave
Imadethatchoice?Thereasonisthatthecostplaystwodifferent
rolesinournetwork.Theobviousroleisthatit'sameasureofhow
wellanoutputactivation,a,matchesthedesiredoutput,y.Thisrole
iscapturedbytheCrossEntropyCost.fnmethod.(Note,bytheway,
thatthenp.nan_to_numcallinsideCrossEntropyCost.fnensuresthat
Numpydealscorrectlywiththelogofnumbersveryclosetozero.)
Butthere'salsoasecondwaythecostfunctionentersournetwork.
RecallfromChapter2thatwhenrunningthebackpropagation
algorithmweneedtocomputethenetwork'soutputerror, L.The
formoftheoutputerrordependsonthechoiceofcostfunction:
differentcostfunction,differentformfortheoutputerror.Forthe
crossentropytheoutputerroris,aswesawinEquation(66),
L = a L y.
Forthisreasonwedefineasecondmethod,CrossEntropyCost.delta,
whosepurposeistotellournetworkhowtocomputetheoutput
error.Andthenwebundlethesetwomethodsupintoasingleclass
containingeverythingournetworksneedtoknowaboutthecost
function.
Inasimilarway,network2.pyalsocontainsaclasstorepresentthe
quadraticcostfunction.Thisisincludedforcomparisonwiththe
resultsofChapter1,sincegoingforwardwe'llmostlyusethecross
entropy.Thecodeisjustbelow.TheQuadraticCost.fnmethodisa
straightforwardcomputationofthequadraticcostassociatedtothe
actualoutput,a,andthedesiredoutput,y.Thevaluereturnedby
QuadraticCost.deltaisbasedontheexpression(30)fortheoutput

errorforthequadraticcost,whichwederivedbackinChapter2.
classQuadraticCost(object):
@staticmethod
deffn(a,y):
return0.5*np.linalg.norm(ay)**2
@staticmethod
defdelta(z,a,y):
return(ay)*sigmoid_prime(z)

We'venowunderstoodthemaindifferencesbetweennetwork2.py
andnetwork.py.It'sallprettysimplestuff.Thereareanumberof
smallerchanges,whichI'lldiscussbelow,includingthe
http://neuralnetworksanddeeplearning.com/chap3.html

57/92

5/31/2016

Neuralnetworksanddeeplearning

implementationofL2regularization.Beforegettingtothat,let's
lookatthecompletecodefornetwork2.py.Youdon'tneedtoreadall
thecodeindetail,butitisworthunderstandingthebroadstructure,
andinparticularreadingthedocumentationstrings,soyou
understandwhateachpieceoftheprogramisdoing.Ofcourse,
you'realsowelcometodelveasdeeplyasyouwish!Ifyougetlost,
youmaywishtocontinuereadingtheprosebelow,andreturnto
thecodelater.Anyway,here'sthecode:
"""network2.py
~~~~~~~~~~~~~~
Animprovedversionofnetwork.py,implementingthestochastic
gradientdescentlearningalgorithmforafeedforwardneuralnetwork.
Improvementsincludetheadditionofthecrossentropycostfunction,
regularization,andbetterinitializationofnetworkweights.Note
thatIhavefocusedonmakingthecodesimple,easilyreadable,and
easilymodifiable.Itisnotoptimized,andomitsmanydesirable
features.
"""
####Libraries
#Standardlibrary
importjson
importrandom
importsys
#Thirdpartylibraries
importnumpyasnp

####Definethequadraticandcrossentropycostfunctions
classQuadraticCost(object):
@staticmethod
deffn(a,y):
"""Returnthecostassociatedwithanoutput``a``anddesiredoutput
``y``.
"""
return0.5*np.linalg.norm(ay)**2
@staticmethod
defdelta(z,a,y):
"""Returntheerrordeltafromtheoutputlayer."""
return(ay)*sigmoid_prime(z)

classCrossEntropyCost(object):
@staticmethod
deffn(a,y):
"""Returnthecostassociatedwithanoutput``a``anddesiredoutput
``y``.Notethatnp.nan_to_numisusedtoensurenumerical
stability.Inparticular,ifboth``a``and``y``havea1.0
inthesameslot,thentheexpression(1y)*np.log(1a)
returnsnan.Thenp.nan_to_numensuresthatthatisconverted
tothecorrectvalue(0.0).
"""

http://neuralnetworksanddeeplearning.com/chap3.html

58/92

5/31/2016

Neuralnetworksanddeeplearning

returnnp.sum(np.nan_to_num(y*np.log(a)(1y)*np.log(1a)))
@staticmethod
defdelta(z,a,y):
"""Returntheerrordeltafromtheoutputlayer.Notethatthe
parameter``z``isnotusedbythemethod.Itisincludedin
themethod'sparametersinordertomaketheinterface
consistentwiththedeltamethodforothercostclasses.
"""
return(ay)

####MainNetworkclass
classNetwork(object):
def__init__(self,sizes,cost=CrossEntropyCost):
"""Thelist``sizes``containsthenumberofneuronsintherespective
layersofthenetwork.Forexample,ifthelistwas[2,3,1]
thenitwouldbeathreelayernetwork,withthefirstlayer
containing2neurons,thesecondlayer3neurons,andthe
thirdlayer1neuron.Thebiasesandweightsforthenetwork
areinitializedrandomly,using
``self.default_weight_initializer``(seedocstringforthat
method).
"""
self.num_layers=len(sizes)
self.sizes=sizes
self.default_weight_initializer()
self.cost=cost
defdefault_weight_initializer(self):
"""InitializeeachweightusingaGaussiandistributionwithmean0
andstandarddeviation1overthesquarerootofthenumberof
weightsconnectingtothesameneuron.Initializethebiases
usingaGaussiandistributionwithmean0andstandard
deviation1.
Notethatthefirstlayerisassumedtobeaninputlayer,and
byconventionwewon'tsetanybiasesforthoseneurons,since
biasesareonlyeverusedincomputingtheoutputsfromlater
layers.
"""
self.biases=[np.random.randn(y,1)foryinself.sizes[1:]]
self.weights=[np.random.randn(y,x)/np.sqrt(x)
forx,yinzip(self.sizes[:1],self.sizes[1:])]
deflarge_weight_initializer(self):
"""InitializetheweightsusingaGaussiandistributionwithmean0
andstandarddeviation1.Initializethebiasesusinga
Gaussiandistributionwithmean0andstandarddeviation1.
Notethatthefirstlayerisassumedtobeaninputlayer,and
byconventionwewon'tsetanybiasesforthoseneurons,since
biasesareonlyeverusedincomputingtheoutputsfromlater
layers.
Thisweightandbiasinitializerusesthesameapproachasin
Chapter1,andisincludedforpurposesofcomparison.It
willusuallybebettertousethedefaultweightinitializer
instead.
"""
self.biases=[np.random.randn(y,1)foryinself.sizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(self.sizes[:1],self.sizes[1:])]

http://neuralnetworksanddeeplearning.com/chap3.html

59/92

5/31/2016

Neuralnetworksanddeeplearning

deffeedforward(self,a):
"""Returntheoutputofthenetworkif``a``isinput."""
forb,winzip(self.biases,self.weights):
a=sigmoid(np.dot(w,a)+b)
returna
defSGD(self,training_data,epochs,mini_batch_size,eta,
lmbda=0.0,
evaluation_data=None,
monitor_evaluation_cost=False,
monitor_evaluation_accuracy=False,
monitor_training_cost=False,
monitor_training_accuracy=False):
"""Traintheneuralnetworkusingminibatchstochasticgradient
descent.The``training_data``isalistoftuples``(x,y)``
representingthetraininginputsandthedesiredoutputs.The
othernonoptionalparametersareselfexplanatory,asisthe
regularizationparameter``lmbda``.Themethodalsoaccepts
``evaluation_data``,usuallyeitherthevalidationortest
data.Wecanmonitorthecostandaccuracyoneitherthe
evaluationdataorthetrainingdata,bysettingthe
appropriateflags.Themethodreturnsatuplecontainingfour
lists:the(perepoch)costsontheevaluationdata,the
accuraciesontheevaluationdata,thecostsonthetraining
data,andtheaccuraciesonthetrainingdata.Allvaluesare
evaluatedattheendofeachtrainingepoch.So,forexample,
ifwetrainfor30epochs,thenthefirstelementofthetuple
willbea30elementlistcontainingthecostonthe
evaluationdataattheendofeachepoch.Notethatthelists
areemptyifthecorrespondingflagisnotset.
"""
ifevaluation_data:n_data=len(evaluation_data)
n=len(training_data)
evaluation_cost,evaluation_accuracy=[],[]
training_cost,training_accuracy=[],[]
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(
mini_batch,eta,lmbda,len(training_data))
print"Epoch%strainingcomplete"%j
ifmonitor_training_cost:
cost=self.total_cost(training_data,lmbda)
training_cost.append(cost)
print"Costontrainingdata:{}".format(cost)
ifmonitor_training_accuracy:
accuracy=self.accuracy(training_data,convert=True)
training_accuracy.append(accuracy)
print"Accuracyontrainingdata:{}/{}".format(
accuracy,n)
ifmonitor_evaluation_cost:
cost=self.total_cost(evaluation_data,lmbda,convert=True)
evaluation_cost.append(cost)
print"Costonevaluationdata:{}".format(cost)
ifmonitor_evaluation_accuracy:
accuracy=self.accuracy(evaluation_data)
evaluation_accuracy.append(accuracy)
print"Accuracyonevaluationdata:{}/{}".format(
self.accuracy(evaluation_data),n_data)
print
returnevaluation_cost,evaluation_accuracy,\
training_cost,training_accuracy

http://neuralnetworksanddeeplearning.com/chap3.html

60/92

5/31/2016

Neuralnetworksanddeeplearning

defupdate_mini_batch(self,mini_batch,eta,lmbda,n):
"""Updatethenetwork'sweightsandbiasesbyapplyinggradient
descentusingbackpropagationtoasingleminibatch.The
``mini_batch``isalistoftuples``(x,y)``,``eta``isthe
learningrate,``lmbda``istheregularizationparameter,and
``n``isthetotalsizeofthetrainingdataset.
"""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[(1eta*(lmbda/n))*w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]
defbackprop(self,x,y):
"""Returnatuple``(nabla_b,nabla_w)``representingthe
gradientforthecostfunctionC_x.``nabla_b``and
``nabla_w``arelayerbylayerlistsofnumpyarrays,similar
to``self.biases``and``self.weights``."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
#feedforward
activation=x
activations=[x]#listtostorealltheactivations,layerbylayer
zs=[]#listtostoreallthezvectors,layerbylayer
forb,winzip(self.biases,self.weights):
z=np.dot(w,activation)+b
zs.append(z)
activation=sigmoid(z)
activations.append(activation)
#backwardpass
delta=(self.cost).delta(zs[1],activations[1],y)
nabla_b[1]=delta
nabla_w[1]=np.dot(delta,activations[2].transpose())
#Notethatthevariablelintheloopbelowisusedalittle
#differentlytothenotationinChapter2ofthebook.Here,
#l=1meansthelastlayerofneurons,l=2isthe
#secondlastlayer,andsoon.It'sarenumberingofthe
#schemeinthebook,usedheretotakeadvantageofthefact
#thatPythoncanusenegativeindicesinlists.
forlinxrange(2,self.num_layers):
z=zs[l]
sp=sigmoid_prime(z)
delta=np.dot(self.weights[l+1].transpose(),delta)*sp
nabla_b[l]=delta
nabla_w[l]=np.dot(delta,activations[l1].transpose())
return(nabla_b,nabla_w)
defaccuracy(self,data,convert=False):
"""Returnthenumberofinputsin``data``forwhichtheneural
networkoutputsthecorrectresult.Theneuralnetwork's
outputisassumedtobetheindexofwhicheverneuroninthe
finallayerhasthehighestactivation.
Theflag``convert``shouldbesettoFalseifthedatasetis
validationortestdata(theusualcase),andtoTrueifthe
datasetisthetrainingdata.Theneedforthisflagarises
duetodifferencesinthewaytheresults``y``are
representedinthedifferentdatasets.Inparticular,it
flagswhetherweneedtoconvertbetweenthedifferent
representations.Itmayseemstrangetousedifferent
representationsforthedifferentdatasets.Whynotusethe
samerepresentationforallthreedatasets?It'sdonefor

http://neuralnetworksanddeeplearning.com/chap3.html

61/92

5/31/2016

Neuralnetworksanddeeplearning

efficiencyreasonstheprogramusuallyevaluatesthecost
onthetrainingdataandtheaccuracyonotherdatasets.
Thesearedifferenttypesofcomputations,andusingdifferent
representationsspeedsthingsup.Moredetailsonthe
representationscanbefoundin
mnist_loader.load_data_wrapper.
"""
ifconvert:
results=[(np.argmax(self.feedforward(x)),np.argmax(y))
for(x,y)indata]
else:
results=[(np.argmax(self.feedforward(x)),y)
for(x,y)indata]
returnsum(int(x==y)for(x,y)inresults)
deftotal_cost(self,data,lmbda,convert=False):
"""Returnthetotalcostforthedataset``data``.Theflag
``convert``shouldbesettoFalseifthedatasetisthe
trainingdata(theusualcase),andtoTrueifthedatasetis
thevalidationortestdata.Seecommentsonthesimilar(but
reversed)conventionforthe``accuracy``method,above.
"""
cost=0.0
forx,yindata:
a=self.feedforward(x)
ifconvert:y=vectorized_result(y)
cost+=self.cost.fn(a,y)/len(data)
cost+=0.5*(lmbda/len(data))*sum(
np.linalg.norm(w)**2forwinself.weights)
returncost
defsave(self,filename):
"""Savetheneuralnetworktothefile``filename``."""
data={"sizes":self.sizes,
"weights":[w.tolist()forwinself.weights],
"biases":[b.tolist()forbinself.biases],
"cost":str(self.cost.__name__)}
f=open(filename,"w")
json.dump(data,f)
f.close()
####LoadingaNetwork
defload(filename):
"""Loadaneuralnetworkfromthefile``filename``.Returnsan
instanceofNetwork.
"""
f=open(filename,"r")
data=json.load(f)
f.close()
cost=getattr(sys.modules[__name__],data["cost"])
net=Network(data["sizes"],cost=cost)
net.weights=[np.array(w)forwindata["weights"]]
net.biases=[np.array(b)forbindata["biases"]]
returnnet
####Miscellaneousfunctions
defvectorized_result(j):
"""Returna10dimensionalunitvectorwitha1.0inthej'thposition
andzeroeselsewhere.Thisisusedtoconvertadigit(0...9)
intoacorrespondingdesiredoutputfromtheneuralnetwork.
"""
e=np.zeros((10,1))
e[j]=1.0
returne

http://neuralnetworksanddeeplearning.com/chap3.html

62/92

5/31/2016

Neuralnetworksanddeeplearning

defsigmoid(z):
"""Thesigmoidfunction."""
return1.0/(1.0+np.exp(z))
defsigmoid_prime(z):
"""Derivativeofthesigmoidfunction."""
returnsigmoid(z)*(1sigmoid(z))

OneofthemoreinterestingchangesinthecodeistoincludeL2
regularization.Althoughthisisamajorconceptualchange,it'sso
trivialtoimplementthatit'seasytomissinthecode.Forthemost
partitjustinvolvespassingtheparameterlmbdatovariousmethods,
notablytheNetwork.SGDmethod.Therealworkisdoneinasingle
lineoftheprogram,thefourthlastlineofthe
Network.update_mini_batchmethod.That'swherewemodifythe

gradientdescentupdateruletoincludeweightdecay.Butalthough
themodificationistiny,ithasabigimpactonresults!
Thisis,bytheway,commonwhenimplementingnewtechniquesin
neuralnetworks.We'vespentthousandsofwordsdiscussing
regularization.It'sconceptuallyquitesubtleanddifficultto
understand.Andyetitwastrivialtoaddtoourprogram!Itoccurs
surprisinglyoftenthatsophisticatedtechniquescanbe
implementedwithsmallchangestocode.
Anothersmallbutimportantchangetoourcodeistheadditionof
severaloptionalflagstothestochasticgradientdescentmethod,
Network.SGD.Theseflagsmakeitpossibletomonitorthecostand

accuracyeitheronthetraining_dataoronasetofevaluation_data
whichcanbepassedtoNetwork.SGD.We'veusedtheseflagsoften
earlierinthechapter,butletmegiveanexampleofhowitworks,
justtoremindyou:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10],cost=network2.CrossEntropyCost)
>>>net.SGD(training_data,30,10,0.5,
...lmbda=5.0,
...evaluation_data=validation_data,
...monitor_evaluation_accuracy=True,
...monitor_evaluation_cost=True,
...monitor_training_accuracy=True,
...monitor_training_cost=True)

Here,we'resettingtheevaluation_datatobethevalidation_data.But
wecouldalsohavemonitoredperformanceonthetest_dataorany
otherdataset.Wealsohavefourflagstellingustomonitorthecost
andaccuracyonboththeevaluation_dataandthetraining_data.
http://neuralnetworksanddeeplearning.com/chap3.html

63/92

5/31/2016

Neuralnetworksanddeeplearning

ThoseflagsareFalsebydefault,butthey'vebeenturnedonherein
ordertomonitorourNetwork'sperformance.Furthermore,
network2.py'sNetwork.SGDmethodreturnsafourelementtuple

representingtheresultsofthemonitoring.Wecanusethisas
follows:
>>>evaluation_cost,evaluation_accuracy,
...training_cost,training_accuracy=net.SGD(training_data,30,10,0.5,
...lmbda=5.0,
...evaluation_data=validation_data,
...monitor_evaluation_accuracy=True,
...monitor_evaluation_cost=True,
...monitor_training_accuracy=True,
...monitor_training_cost=True)

So,forexample,evaluation_costwillbea30elementlistcontaining
thecostontheevaluationdataattheendofeachepoch.Thissortof
informationisextremelyusefulinunderstandinganetwork's
behaviour.Itcan,forexample,beusedtodrawgraphsshowinghow
thenetworklearnsovertime.Indeed,that'sexactlyhowI
constructedallthegraphsearlierinthechapter.Note,however,
thatifanyofthemonitoringflagsarenotset,thenthe
correspondingelementinthetuplewillbetheemptylist.
OtheradditionstothecodeincludeaNetwork.savemethod,tosave
Networkobjectstodisk,andafunctiontoloadthembackinagain

later.NotethatthesavingandloadingisdoneusingJSON,not
Python'spickleorcPicklemodules,whicharetheusualwaywesave
andloadobjectstoandfromdiskinPython.UsingJSONrequires
morecodethanpickleorcPicklewould.TounderstandwhyI've
usedJSON,imaginethatatsometimeinthefuturewedecidedto
changeourNetworkclasstoallowneuronsotherthansigmoid
neurons.Toimplementthatchangewe'dmostlikelychangethe
attributesdefinedintheNetwork.__init__method.Ifwe'vesimply
pickledtheobjectsthatwouldcauseourloadfunctiontofail.Using
JSONtodotheserializationexplicitlymakesiteasytoensurethat
oldNetworkswillstillload.
Therearemanyotherminorchangesinthecodefornetwork2.py,but
they'reallsimplevariationsonnetwork.py.Thenetresultisto
expandour74lineprogramtoafarmorecapable152lines.

Problems
ModifythecodeabovetoimplementL1regularization,anduse
http://neuralnetworksanddeeplearning.com/chap3.html

64/92

5/31/2016

Neuralnetworksanddeeplearning

L1regularizationtoclassifyMNISTdigitsusinga30hidden
neuronnetwork.Canyoufindaregularizationparameterthat
enablesyoutodobetterthanrunningunregularized?
TakealookattheNetwork.cost_derivativemethodinnetwork.py.
Thatmethodwaswrittenforthequadraticcost.Howwould
yourewritethemethodforthecrossentropycost?Canyou
thinkofaproblemthatmightariseinthecrossentropy
version?Innetwork2.pywe'veeliminatedthe
Network.cost_derivativemethodentirely,insteadincorporating

itsfunctionalityintotheCrossEntropyCost.deltamethod.How
doesthissolvetheproblemyou'vejustidentified?

Howtochooseaneuralnetwork's
hyperparameters?
UpuntilnowIhaven'texplainedhowI'vebeenchoosingvaluesfor
hyperparameterssuchasthelearningrate,,theregularization
parameter,,andsoon.I'vejustbeensupplyingvalueswhichwork
prettywell.Inpractice,whenyou'reusingneuralnetstoattacka
problem,itcanbedifficulttofindgoodhyperparameters.Imagine,
forexample,thatwe'vejustbeenintroducedtotheMNIST
problem,andhavebegunworkingonit,knowingnothingatall
aboutwhathyperparameterstouse.Let'ssupposethatbygood
fortuneinourfirstexperimentswechoosemanyofthehyper
parametersinthesamewayaswasdoneearlierthischapter:30
hiddenneurons,aminibatchsizeof10,trainingfor30epochs
usingthecrossentropy.Butwechoosealearningrate = 10.0and
regularizationparameter = 1000.0.Here'swhatIsawononesuch
run:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
>>>importnetwork2
>>>net=network2.Network([784,30,10])
>>>net.SGD(training_data,30,10,10.0,lmbda=1000.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Epoch0trainingcomplete
Accuracyonevaluationdata:1030/10000
Epoch1trainingcomplete
Accuracyonevaluationdata:990/10000
Epoch2trainingcomplete
Accuracyonevaluationdata:1009/10000
...

http://neuralnetworksanddeeplearning.com/chap3.html

65/92

5/31/2016

Neuralnetworksanddeeplearning

Epoch27trainingcomplete
Accuracyonevaluationdata:1009/10000
Epoch28trainingcomplete
Accuracyonevaluationdata:983/10000
Epoch29trainingcomplete
Accuracyonevaluationdata:967/10000

Ourclassificationaccuraciesarenobetterthanchance!Our
networkisactingasarandomnoisegenerator!
"Well,that'seasytofix,"youmightsay,"justdecreasethelearning
rateandregularizationhyperparameters".Unfortunately,you
don'taprioriknowthosearethehyperparametersyouneedto
adjust.Maybetherealproblemisthatour30hiddenneuron
networkwillneverworkwell,nomatterhowtheotherhyper
parametersarechosen?Maybewereallyneedatleast100hidden
neurons?Or300hiddenneurons?Ormultiplehiddenlayers?Ora
differentapproachtoencodingtheoutput?Maybeournetworkis
learning,butweneedtotrainformoreepochs?Maybethemini
batchesaretoosmall?Maybewe'ddobetterswitchingbacktothe
quadraticcostfunction?Maybeweneedtotryadifferentapproach
toweightinitialization?Andsoon,onandonandon.It'seasyto
feellostinhyperparameterspace.Thiscanbeparticularly
frustratingifyournetworkisverylarge,orusesalotoftraining
data,sinceyoumaytrainforhoursordaysorweeks,onlytogetno
result.Ifthesituationpersists,itdamagesyourconfidence.Maybe
neuralnetworksarethewrongapproachtoyourproblem?Maybe
youshouldquityourjobandtakeupbeekeeping?
InthissectionIexplainsomeheuristicswhichcanbeusedtoset
thehyperparametersinaneuralnetwork.Thegoalistohelpyou
developaworkflowthatenablesyoutodoaprettygoodjobsetting
hyperparameters.Ofcourse,Iwon'tcovereverythingabouthyper
parameteroptimization.That'sahugesubject,andit'snot,inany
case,aproblemthatisevercompletelysolved,noristhereuniversal
agreementamongstpractitionersontherightstrategiestouse.
There'salwaysonemoretrickyoucantrytoekeoutabitmore
performancefromyournetwork.Buttheheuristicsinthissection
shouldgetyoustarted.
Broadstrategy:Whenusingneuralnetworkstoattackanew
problemthefirstchallengeistogetanynontriviallearning,i.e.,for
http://neuralnetworksanddeeplearning.com/chap3.html

66/92

5/31/2016

Neuralnetworksanddeeplearning

thenetworktoachieveresultsbetterthanchance.Thiscanbe
surprisinglydifficult,especiallywhenconfrontinganewclassof
problem.Let'slookatsomestrategiesyoucanuseifyou'rehaving
thiskindoftrouble.
Suppose,forexample,thatyou'reattackingMNISTforthefirst
time.Youstartoutenthusiastic,butarealittlediscouragedwhen
yourfirstnetworkfailscompletely,asintheexampleabove.The
waytogoistostriptheproblemdown.Getridofallthetraining
andvalidationimagesexceptimageswhichare0sor1s.Thentryto
trainanetworktodistinguish0sfrom1s.Notonlyisthatan
inherentlyeasierproblemthandistinguishingalltendigits,italso
reducestheamountoftrainingdataby80percent,speedingup
trainingbyafactorof5.Thatenablesmuchmorerapid
experimentation,andsogivesyoumorerapidinsightintohowto
buildagoodnetwork.
Youcanfurtherspeedupexperimentationbystrippingyour
networkdowntothesimplestnetworklikelytodomeaningful
learning.Ifyoubelievea[784,10]networkcanlikelydobetter
thanchanceclassificationofMNISTdigits,thenbeginyour
experimentationwithsuchanetwork.It'llbemuchfasterthan
traininga[784,30,10]network,andyoucanbuildbackuptothe
latter.
Youcangetanotherspeedupinexperimentationbyincreasingthe
frequencyofmonitoring.Innetwork2.pywemonitorperformanceat
theendofeachtrainingepoch.With50,000imagesperepoch,that
meanswaitingalittlewhileabouttensecondsperepoch,onmy
laptop,whentraininga[784,30,10]networkbeforegetting
feedbackonhowwellthenetworkislearning.Ofcourse,ten
secondsisn'tverylong,butifyouwanttotrialdozensofhyper
parameterchoicesit'sannoying,andifyouwanttotrialhundreds
orthousandsofchoicesitstartstogetdebilitating.Wecanget
feedbackmorequicklybymonitoringthevalidationaccuracymore
often,say,afterevery1,000trainingimages.Furthermore,instead
ofusingthefull10,000imagevalidationsettomonitor
performance,wecangetamuchfasterestimateusingjust100
validationimages.Allthatmattersisthatthenetworkseesenough
imagestodoreallearning,andtogetaprettygoodroughestimate
ofperformance.Ofcourse,ourprogramnetwork2.pydoesn't
http://neuralnetworksanddeeplearning.com/chap3.html

67/92

5/31/2016

Neuralnetworksanddeeplearning

currentlydothiskindofmonitoring.Butasakludgetoachievea
similareffectforthepurposesofillustration,we'llstripdownour
trainingdatatojustthefirst1,000MNISTtrainingimages.Let'stry
itandseewhathappens.(TokeepthecodebelowsimpleIhaven't
implementedtheideaofusingonly0and1images.Ofcourse,that
canbedonewithjustalittlemorework.)
>>>net=network2.Network([784,10])
>>>net.SGD(training_data[:1000],30,10,10.0,lmbda=1000.0,\
...evaluation_data=validation_data[:100],\
...monitor_evaluation_accuracy=True)
Epoch0trainingcomplete
Accuracyonevaluationdata:10/100
Epoch1trainingcomplete
Accuracyonevaluationdata:10/100
Epoch2trainingcomplete
Accuracyonevaluationdata:10/100
...

We'restillgettingpurenoise!Butthere'sabigwin:we'renow
gettingfeedbackinafractionofasecond,ratherthanonceevery
tensecondsorso.Thatmeansyoucanmorequicklyexperiment
withotherchoicesofhyperparameter,orevenconduct
experimentstriallingmanydifferentchoicesofhyperparameter
nearlysimultaneously.
IntheaboveexampleIleftas = 1000.0,asweusedearlier.But
sincewechangedthenumberoftrainingexamplesweshouldreally
changetokeeptheweightdecaythesame.Thatmeanschanging
to20.0.Ifwedothatthenthisiswhathappens:
>>>net=network2.Network([784,10])
>>>net.SGD(training_data[:1000],30,10,10.0,lmbda=20.0,\
...evaluation_data=validation_data[:100],\
...monitor_evaluation_accuracy=True)
Epoch0trainingcomplete
Accuracyonevaluationdata:12/100
Epoch1trainingcomplete
Accuracyonevaluationdata:14/100
Epoch2trainingcomplete
Accuracyonevaluationdata:25/100
Epoch3trainingcomplete
Accuracyonevaluationdata:18/100
...

Ahah!Wehaveasignal.Notaterriblygoodsignal,butasignal
nonetheless.That'ssomethingwecanbuildon,modifyingthe
hyperparameterstotrytogetfurtherimprovement.Maybewe
guessthatourlearningrateneedstobehigher.(Asyouperhaps
http://neuralnetworksanddeeplearning.com/chap3.html

68/92

5/31/2016

Neuralnetworksanddeeplearning

realize,that'sasillyguess,forreasonswe'lldiscussshortly,but
pleasebearwithme.)Sototestourguesswetrydialingupto
100.0:
>>>net=network2.Network([784,10])
>>>net.SGD(training_data[:1000],30,10,100.0,lmbda=20.0,\
...evaluation_data=validation_data[:100],\
...monitor_evaluation_accuracy=True)
Epoch0trainingcomplete
Accuracyonevaluationdata:10/100
Epoch1trainingcomplete
Accuracyonevaluationdata:10/100
Epoch2trainingcomplete
Accuracyonevaluationdata:10/100
Epoch3trainingcomplete
Accuracyonevaluationdata:10/100
...

That'snogood!Itsuggeststhatourguesswaswrong,andthe
problemwasn'tthatthelearningratewastoolow.Soinsteadwetry
dialingdownto = 1.0:
>>>net=network2.Network([784,10])
>>>net.SGD(training_data[:1000],30,10,1.0,lmbda=20.0,\
...evaluation_data=validation_data[:100],\
...monitor_evaluation_accuracy=True)
Epoch0trainingcomplete
Accuracyonevaluationdata:62/100
Epoch1trainingcomplete
Accuracyonevaluationdata:42/100
Epoch2trainingcomplete
Accuracyonevaluationdata:43/100
Epoch3trainingcomplete
Accuracyonevaluationdata:61/100
...

That'sbetter!Andsowecancontinue,individuallyadjustingeach
hyperparameter,graduallyimprovingperformance.Oncewe've
exploredtofindanimprovedvaluefor,thenwemoveontofinda
goodvaluefor.Thenexperimentwithamorecomplex
architecture,sayanetworkwith10hiddenneurons.Thenadjustthe
valuesforandagain.Thenincreaseto20hiddenneurons.And
thenadjustotherhyperparameterssomemore.Andsoon,ateach
stageevaluatingperformanceusingourheldoutvalidationdata,
andusingthoseevaluationstofindbetterandbetterhyper
parameters.Aswedoso,ittypicallytakeslongertowitnessthe
impactduetomodificationsofthehyperparameters,andsowecan
http://neuralnetworksanddeeplearning.com/chap3.html

69/92

5/31/2016

Neuralnetworksanddeeplearning

graduallydecreasethefrequencyofmonitoring.
Thisalllooksverypromisingasabroadstrategy.However,Iwant
toreturntothatinitialstageoffindinghyperparametersthat
enableanetworktolearnanythingatall.Infact,eventheabove
discussionconveystoopositiveanoutlook.Itcanbeimmensely
frustratingtoworkwithanetworkthat'slearningnothing.Youcan
tweakhyperparametersfordays,andstillgetnomeaningful
response.AndsoI'dliketoreemphasizethatduringtheearly
stagesyoushouldmakesureyoucangetquickfeedbackfrom
experiments.Intuitively,itmayseemasthoughsimplifyingthe
problemandthearchitecturewillmerelyslowyoudown.Infact,it
speedsthingsup,sinceyoumuchmorequicklyfindanetworkwith
ameaningfulsignal.Onceyou'vegotsuchasignal,youcanoftenget
rapidimprovementsbytweakingthehyperparameters.Aswith
manythingsinlife,gettingstartedcanbethehardestthingtodo.
Okay,that'sthebroadstrategy.Let'snowlookatsomespecific
recommendationsforsettinghyperparameters.Iwillfocusonthe
learningrate,,theL2regularizationparameter,,andthemini
batchsize.However,manyoftheremarksapplyalsotootherhyper
parameters,includingthoseassociatedtonetworkarchitecture,
otherformsofregularization,andsomehyperparameterswe'll
meetlaterinthebook,suchasthemomentumcoefficient.
Learningrate:SupposewerunthreeMNISTnetworkswiththree
differentlearningrates, = 0.025, = 0.25and = 2.5,respectively.
We'llsettheotherhyperparametersasfortheexperimentsin
earliersections,runningover30epochs,withaminibatchsizeof
10,andwith = 5.0.We'llalsoreturntousingthefull50, 000
trainingimages.Here'sagraphshowingthebehaviourofthe
trainingcostaswetrain*:

http://neuralnetworksanddeeplearning.com/chap3.html

*Thegraphwasgeneratedbymultiple_eta.py.

70/92

5/31/2016

Neuralnetworksanddeeplearning

With = 0.025thecostdecreasessmoothlyuntilthefinalepoch.
With = 0.25thecostinitiallydecreases,butafterabout20epochs
itisnearsaturation,andthereaftermostofthechangesaremerely
smallandapparentlyrandomoscillations.Finally,with = 2.5the
costmakeslargeoscillationsrightfromthestart.Tounderstandthe
reasonfortheoscillations,recallthatstochasticgradientdescentis
supposedtostepusgraduallydownintoavalleyofthecost
function,

However,ifistoolargethenthestepswillbesolargethatthey
mayactuallyovershoottheminimum,causingthealgorithmto
climbupoutofthevalleyinstead.That'slikely*what'scausingthe

*Thispictureishelpful,butit'sintendedasan

costtooscillatewhen = 2.5.Whenwechoose = 0.25theinitial

notasacomplete,exhaustiveexplanation.

stepsdotakeustowardaminimumofthecostfunction,andit's
onlyoncewegetnearthatminimumthatwestarttosufferfromthe
http://neuralnetworksanddeeplearning.com/chap3.html

intuitionbuildingillustrationofwhatmaygoon,
Briefly,amorecompleteexplanationisas
follows:gradientdescentusesafirstorder
approximationtothecostfunctionasaguideto
howtodecreasethecost.Forlarge,higher

71/92

5/31/2016

Neuralnetworksanddeeplearning

overshootingproblem.Andwhenwechoose = 0.025wedon't

ordertermsinthecostfunctionbecomemore

sufferfromthisproblematallduringthefirst30epochs.Ofcourse,

causinggradientdescenttobreakdown.Thisis

choosingsosmallcreatesanotherproblem,namely,thatitslows

important,andmaydominatethebehaviour,
especiallylikelyasweapproachminimaand
quasiminimaofthecostfunction,sincenear

downstochasticgradientdescent.Anevenbetterapproachwould

suchpointsthegradientbecomessmall,making

betostartwith = 0.25,trainfor20epochs,andthenswitchto

behaviour.

iteasierforhigherordertermstodominate

= 0.025.We'lldiscusssuchvariablelearningratescheduleslater.
Fornow,though,let'ssticktofiguringouthowtofindasinglegood
valueforthelearningrate,.
Withthispictureinmind,wecansetasfollows.First,weestimate
thethresholdvalueforatwhichthecostonthetrainingdata
immediatelybeginsdecreasing,insteadofoscillatingorincreasing.
Thisestimatedoesn'tneedtobetooaccurate.Youcanestimatethe
orderofmagnitudebystartingwith = 0.01.Ifthecostdecreases
duringthefirstfewepochs,thenyoushouldsuccessivelytry
= 0.1, 1.0, untilyoufindavalueforwherethecostoscillates
orincreasesduringthefirstfewepochs.Alternately,ifthecost
oscillatesorincreasesduringthefirstfewepochswhen = 0.01,
thentry = 0.001, 0.0001, untilyoufindavalueforwherethe
costdecreasesduringthefirstfewepochs.Followingthisprocedure
willgiveusanorderofmagnitudeestimateforthethresholdvalue
of.Youmayoptionallyrefineyourestimate,topickoutthelargest
valueofatwhichthecostdecreasesduringthefirstfewepochs,
say = 0.5or = 0.2(there'snoneedforthistobesuperaccurate).
Thisgivesusanestimateforthethresholdvalueof.
Obviously,theactualvalueofthatyouuseshouldbenolarger
thanthethresholdvalue.Infact,ifthevalueofistoremainusable
overmanyepochsthenyoulikelywanttouseavalueforthatis
smaller,say,afactoroftwobelowthethreshold.Suchachoicewill
typicallyallowyoutotrainformanyepochs,withoutcausingtoo
muchofaslowdowninlearning.
InthecaseoftheMNISTdata,followingthisstrategyleadstoan
estimateof0.1fortheorderofmagnitudeofthethresholdvalueof
.Aftersomemorerefinement,weobtainathresholdvalue = 0.5.
Followingtheprescriptionabove,thissuggestsusing = 0.25asour
valueforthelearningrate.Infact,Ifoundthatusing = 0.5worked
wellenoughover30epochsthatforthemostpartIdidn'tworry
aboutusingalowervalueof.

http://neuralnetworksanddeeplearning.com/chap3.html

72/92

5/31/2016

Neuralnetworksanddeeplearning

Thisallseemsquitestraightforward.However,usingthetraining
costtopickappearstocontradictwhatIsaidearlierinthis
section,namely,thatwe'dpickhyperparametersbyevaluating
performanceusingourheldoutvalidationdata.Infact,we'lluse
validationaccuracytopicktheregularizationhyperparameter,the
minibatchsize,andnetworkparameterssuchasthenumberof
layersandhiddenneurons,andsoon.Whydothingsdifferentlyfor
thelearningrate?Frankly,thischoiceismypersonalaesthetic
preference,andisperhapssomewhatidiosyncratic.Thereasoning
isthattheotherhyperparametersareintendedtoimprovethefinal
classificationaccuracyonthetestset,andsoitmakessenseto
selectthemonthebasisofvalidationaccuracy.However,the
learningrateisonlyincidentallymeanttoimpactthefinal
classificationaccuracy.Itsprimarypurposeisreallytocontrolthe
stepsizeingradientdescent,andmonitoringthetrainingcostisthe
bestwaytodetectifthestepsizeistoobig.Withthatsaid,thisisa
personalaestheticpreference.Earlyonduringlearningthetraining
costusuallyonlydecreasesifthevalidationaccuracyimproves,and
soinpracticeit'sunlikelytomakemuchdifferencewhichcriterion
youuse.
Useearlystoppingtodeterminethenumberoftraining
epochs:Aswediscussedearlierinthechapter,earlystopping
meansthatattheendofeachepochweshouldcomputethe
classificationaccuracyonthevalidationdata.Whenthatstops
improving,terminate.Thismakessettingthenumberofepochs
verysimple.Inparticular,itmeansthatwedon'tneedtoworry
aboutexplicitlyfiguringouthowthenumberofepochsdependson
theotherhyperparameters.Instead,that'stakencareof
automatically.Furthermore,earlystoppingalsoautomatically
preventsusfromoverfitting.Thisis,ofcourse,agoodthing,
althoughintheearlystagesofexperimentationitcanbehelpfulto
turnoffearlystopping,soyoucanseeanysignsofoverfitting,and
useittoinformyourapproachtoregularization.
Toimplementearlystoppingweneedtosaymorepreciselywhatit
meansthattheclassificationaccuracyhasstoppedimproving.As
we'veseen,theaccuracycanjumparoundquiteabit,evenwhenthe
overalltrendistoimprove.Ifwestopthefirsttimetheaccuracy
decreasesthenwe'llalmostcertainlystopwhentherearemore
improvementstobehad.Abetterruleistoterminateifthebest
http://neuralnetworksanddeeplearning.com/chap3.html

73/92

5/31/2016

Neuralnetworksanddeeplearning

classificationaccuracydoesn'timproveforquitesometime.
Suppose,forexample,thatwe'redoingMNIST.Thenwemightelect
toterminateiftheclassificationaccuracyhasn'timprovedduring
thelasttenepochs.Thisensuresthatwedon'tstoptoosoon,in
responsetobadluckintraining,butalsothatwe'renotwaiting
aroundforeverforanimprovementthatnevercomes.
Thisnoimprovementintenruleisgoodforinitialexplorationof
MNIST.However,networkscansometimesplateauneara
particularclassificationaccuracyforquitesometime,onlytothen
beginimprovingagain.Ifyou'retryingtogetreallygood
performance,thenoimprovementintenrulemaybetoo
aggressiveaboutstopping.Inthatcase,Isuggestusingtheno
improvementintenruleforinitialexperimentation,andgradually
adoptingmorelenientrules,asyoubetterunderstandthewayyour
networktrains:noimprovementintwenty,noimprovementin
fifty,andsoon.Ofcourse,thisintroducesanewhyperparameter
tooptimize!Inpractice,however,it'susuallyeasytosetthishyper
parametertogetprettygoodresults.Similarly,forproblemsother
thanMNIST,thenoimprovementintenrulemaybemuchtoo
aggressiveornotnearlyaggressiveenough,dependingonthe
detailsoftheproblem.However,withalittleexperimentationit's
usuallyeasytofindaprettygoodstrategyforearlystopping.
Wehaven'tusedearlystoppinginourMNISTexperimentstodate.
Thereasonisthatwe'vebeendoingalotofcomparisonsbetween
differentapproachestolearning.Forsuchcomparisonsit'shelpful
tousethesamenumberofepochsineachcase.However,it'swell
worthmodifyingnetwork2.pytoimplementearlystopping:

Problem
Modifynetwork2.pysothatitimplementsearlystoppingusinga
noimprovementinnepochsstrategy,wherenisaparameter
thatcanbeset.
Canyouthinkofaruleforearlystoppingotherthanno
improvementinn?Ideally,theruleshouldcompromise
betweengettinghighvalidationaccuraciesandnottrainingtoo
long.Addyourruletonetwork2.py,andrunthreeexperiments
comparingthevalidationaccuraciesandnumberofepochsof
trainingtonoimprovementin10.
http://neuralnetworksanddeeplearning.com/chap3.html

74/92

5/31/2016

Neuralnetworksanddeeplearning

Learningrateschedule:We'vebeenholdingthelearningrate
constant.However,it'softenadvantageoustovarythelearningrate.
Earlyonduringthelearningprocessit'slikelythattheweightsare
badlywrong.Andsoit'sbesttousealargelearningratethatcauses
theweightstochangequickly.Later,wecanreducethelearning
rateaswemakemorefinetunedadjustmentstoourweights.
Howshouldwesetourlearningrateschedule?Manyapproaches
arepossible.Onenaturalapproachistousethesamebasicideaas
earlystopping.Theideaistoholdthelearningrateconstantuntil
thevalidationaccuracystartstogetworse.Thendecreasethe
learningratebysomeamount,sayafactoroftwoorten.Werepeat
thismanytimes,until,say,thelearningrateisafactorof1,024(or
1,000)timeslowerthantheinitialvalue.Thenweterminate.
Avariablelearningschedulecanimproveperformance,butitalso
opensupaworldofpossiblechoicesforthelearningschedule.
Thosechoicescanbeaheadacheyoucanspendforevertryingto
optimizeyourlearningschedule.Forfirstexperimentsmy
suggestionistouseasingle,constantvalueforthelearningrate.
That'llgetyouagoodfirstapproximation.Later,ifyouwantto
obtainthebestperformancefromyournetwork,it'sworth
experimentingwithalearningschedule,alongthelinesI've
described*.

*Areadablerecentpaperwhichdemonstrates
thebenefitsofvariablelearningratesin
attackingMNISTisDeep,Big,SimpleNeural

Exercise

NetsExcelonHandwrittenDigitRecognition,by
DanClaudiuCirean,UeliMeier,LucaMaria
Gambardella,andJrgenSchmidhuber(2010).

Modifynetwork2.pysothatitimplementsalearningschedule
that:halvesthelearningrateeachtimethevalidationaccuracy
satisfiesthenoimprovementin10ruleandterminateswhen
thelearningratehasdroppedto1 / 128ofitsoriginalvalue.
Theregularizationparameter,:Isuggeststartinginitially
withnoregularization( = 0.0),anddeterminingavaluefor,as
above.Usingthatchoiceof,wecanthenusethevalidationdatato
selectagoodvaluefor.Startbytrialling = 1.0*,andthen

*Idon'thaveagoodprincipledjustificationfor

increaseordecreasebyfactorsof10,asneededtoimprove

agoodprincipleddiscussionofwheretostart

performanceonthevalidationdata.Onceyou'vefoundagoodorder

usingthisasastartingvalue.Ifanyoneknowsof
with,I'dappreciatehearingit
(mn@michaelnielsen.org).

ofmagnitude,youcanfinetuneyourvalueof.Thatdone,you
shouldreturnandreoptimizeagain.

Exercise
http://neuralnetworksanddeeplearning.com/chap3.html

75/92

5/31/2016

Neuralnetworksanddeeplearning

It'stemptingtousegradientdescenttotrytolearngoodvalues
forhyperparameterssuchasand.Canyouthinkofan
obstacletousinggradientdescenttodetermine?Canyou
thinkofanobstacletousinggradientdescenttodetermine?
HowIselectedhyperparametersearlierinthisbook:If
youusetherecommendationsinthissectionyou'llfindthatyouget
valuesforandwhichdon'talwaysexactlymatchthevaluesI've
usedearlierinthebook.Thereasonisthatthebookhasnarrative
constraintsthathavesometimesmadeitimpracticaltooptimizethe
hyperparameters.Thinkofallthecomparisonswe'vemadeof
differentapproachestolearning,e.g.,comparingthequadraticand
crossentropycostfunctions,comparingtheoldandnewmethods
ofweightinitialization,runningwithandwithoutregularization,
andsoon.Tomakesuchcomparisonsmeaningful,I'veusuallytried
tokeephyperparametersconstantacrosstheapproachesbeing
compared(ortoscaletheminanappropriateway).Ofcourse,
there'snoreasonforthesamehyperparameterstobeoptimalfor
allthedifferentapproachestolearning,sothehyperparameters
I'veusedaresomethingofacompromise.
Asanalternativetothiscompromise,Icouldhavetriedtooptimize
theheckoutofthehyperparametersforeverysingleapproachto
learning.Inprinciplethat'dbeabetter,fairerapproach,sincethen
we'dseethebestfromeveryapproachtolearning.However,we've
madedozensofcomparisonsalongtheselines,andinpracticeI
foundittoocomputationallyexpensive.That'swhyI'veadoptedthe
compromiseofusingprettygood(butnotnecessarilyoptimal)
choicesforthehyperparameters.
Minibatchsize:Howshouldwesettheminibatchsize?To
answerthisquestion,let'sfirstsupposethatwe'redoingonline
learning,i.e.,thatwe'reusingaminibatchsizeof1.
Theobviousworryaboutonlinelearningisthatusingminibatches
whichcontainjustasingletrainingexamplewillcausesignificant
errorsinourestimateofthegradient.Infact,though,theerrors
turnouttonotbesuchaproblem.Thereasonisthattheindividual
gradientestimatesdon'tneedtobesuperaccurate.Allweneedis
anestimateaccurateenoughthatourcostfunctiontendstokeep
decreasing.It'sasthoughyouaretryingtogettotheNorth
http://neuralnetworksanddeeplearning.com/chap3.html

76/92

5/31/2016

Neuralnetworksanddeeplearning

MagneticPole,buthaveawonkycompassthat's1020degreesoff
eachtimeyoulookatit.Providedyoustoptocheckthecompass
frequently,andthecompassgetsthedirectionrightonaverage,
you'llendupattheNorthMagneticPolejustfine.
Basedonthisargument,itsoundsasthoughweshoulduseonline
learning.Infact,thesituationturnsouttobemorecomplicated
thanthat.InaprobleminthelastchapterIpointedoutthatit's
possibletousematrixtechniquestocomputethegradientupdate
forallexamplesinaminibatchsimultaneously,ratherthan
loopingoverthem.Dependingonthedetailsofyourhardwareand
linearalgebralibrarythiscanmakeitquiteabitfastertocompute
thegradientestimateforaminibatchof(forexample)size100,
ratherthancomputingtheminibatchgradientestimatebylooping
overthe100trainingexamplesseparately.Itmighttake(say)only
50timesaslong,ratherthan100timesaslong.
Now,atfirstitseemsasthoughthisdoesn'thelpusthatmuch.
Withourminibatchofsize100thelearningrulefortheweights
lookslike:
w w = w

1
100

C x,
x

wherethesumisovertrainingexamplesintheminibatch.Thisis
versus
w w = w C x
foronlinelearning.Evenifitonlytakes50timesaslongtodothe
minibatchupdate,itstillseemslikelytobebettertodoonline
learning,becausewe'dbeupdatingsomuchmorefrequently.
Suppose,however,thatintheminibatchcaseweincreasethe
learningratebyafactor100,sotheupdaterulebecomes
w w = w C x.
x

That'salotlikedoing100separateinstancesofonlinelearningwith
alearningrateof.Butitonlytakes50timesaslongasdoinga
singleinstanceofonlinelearning.Ofcourse,it'snottrulythesame
as100instancesofonlinelearning,sinceintheminibatchtheC x's
areallevaluatedforthesamesetofweights,asopposedtothe
cumulativelearningthatoccursintheonlinecase.Still,itseems
http://neuralnetworksanddeeplearning.com/chap3.html

77/92

5/31/2016

Neuralnetworksanddeeplearning

distinctlypossiblethatusingthelargerminibatchwouldspeed
thingsup.
Withthesefactorsinmind,choosingthebestminibatchsizeisa
compromise.Toosmall,andyoudon'tgettotakefulladvantageof
thebenefitsofgoodmatrixlibrariesoptimizedforfasthardware.
Toolargeandyou'resimplynotupdatingyourweightsoften
enough.Whatyouneedistochooseacompromisevaluewhich
maximizesthespeedoflearning.Fortunately,thechoiceofmini
batchsizeatwhichthespeedismaximizedisrelativelyindependent
oftheotherhyperparameters(apartfromtheoverallarchitecture),
soyoudon'tneedtohaveoptimizedthosehyperparametersin
ordertofindagoodminibatchsize.Thewaytogoisthereforeto
usesomeacceptable(butnotnecessarilyoptimal)valuesforthe
otherhyperparameters,andthentrialanumberofdifferentmini
batchsizes,scalingasabove.Plotthevalidationaccuracyversus
time(asin,realelapsedtime,notepoch!),andchoosewhichever
minibatchsizegivesyouthemostrapidimprovementin
performance.Withtheminibatchsizechosenyoucanthenproceed
tooptimizetheotherhyperparameters.
Ofcourse,asyou'venodoubtrealized,Ihaven'tdonethis
optimizationinourwork.Indeed,ourimplementationdoesn'tuse
thefasterapproachtominibatchupdatesatall.I'vesimplyuseda
minibatchsizeof10withoutcommentorexplanationinnearlyall
examples.Becauseofthis,wecouldhavespeduplearningby
reducingtheminibatchsize.Ihaven'tdonethis,inpartbecauseI
wantedtoillustratetheuseofminibatchesbeyondsize1,andin
partbecausemypreliminaryexperimentssuggestedthespeedup
wouldberathermodest.Inpracticalimplementations,however,we
wouldmostcertainlyimplementthefasterapproachtominibatch
updates,andthenmakeanefforttooptimizetheminibatchsize,in
ordertomaximizeouroverallspeed.
Automatedtechniques:I'vebeendescribingtheseheuristicsas
thoughyou'reoptimizingyourhyperparametersbyhand.Hand
optimizationisagoodwaytobuildupafeelforhowneural
networksbehave.However,andunsurprisingly,agreatdealofwork
hasbeendoneonautomatingtheprocess.Acommontechniqueis
gridsearch,whichsystematicallysearchesthroughagridinhyper
parameterspace.Areviewofboththeachievementsandthe
http://neuralnetworksanddeeplearning.com/chap3.html

78/92

5/31/2016

Neuralnetworksanddeeplearning

limitationsofgridsearch(withsuggestionsforeasilyimplemented
alternatives)maybefoundina2012paper*byJamesBergstraand

*Randomsearchforhyperparameter
optimization,byJamesBergstraandYoshua
Bengio(2012).

YoshuaBengio.Manymoresophisticatedapproacheshavealso
beenproposed.Iwon'treviewallthatworkhere,butdowantto
mentionaparticularlypromising2012paperwhichusedaBayesian
approachtoautomaticallyoptimizehyperparameters*.Thecode

*PracticalBayesianoptimizationofmachine

fromthepaperispubliclyavailable,andhasbeenusedwithsome

Larochelle,andRyanAdams.

learningalgorithms,byJasperSnoek,Hugo

successbyotherresearchers.
Summingup:FollowingtherulesofthumbI'vedescribedwon't
giveyoutheabsolutebestpossibleresultsfromyourneural
network.Butitwilllikelygiveyouagoodstartandabasisfor
furtherimprovements.Inparticular,I'vediscussedthehyper
parameterslargelyindependently.Inpractice,thereare
relationshipsbetweenthehyperparameters.Youmayexperiment
with,feelthatyou'vegotitjustright,thenstarttooptimizefor,
onlytofindthatit'smessingupyouroptimizationfor.Inpractice,
ithelpstobouncebackwardandforward,graduallyclosingingood
values.Aboveall,keepinmindthattheheuristicsI'vedescribedare
rulesofthumb,notrulescastinstone.Youshouldbeonthe
lookoutforsignsthatthingsaren'tworking,andbewillingto
experiment.Inparticular,thismeanscarefullymonitoringyour
network'sbehaviour,especiallythevalidationaccuracy.
Thedifficultyofchoosinghyperparametersisexacerbatedbythe
factthattheloreabouthowtochoosehyperparametersiswidely
spread,acrossmanyresearchpapersandsoftwareprograms,and
oftenisonlyavailableinsidetheheadsofindividualpractitioners.
Therearemany,manypaperssettingout(sometimes
contradictory)recommendationsforhowtoproceed.However,
thereareafewparticularlyusefulpapersthatsynthesizeanddistill
outmuchofthislore.YoshuaBengiohasa2012paper*thatgives

*Practicalrecommendationsforgradientbased

somepracticalrecommendationsforusingbackpropagationand

(2012).

trainingofdeeparchitectures,byYoshuaBengio

gradientdescenttotrainneuralnetworks,includingdeepneural
nets.BengiodiscussesmanyissuesinmuchmoredetailthanIhave,
includinghowtodomoresystematichyperparametersearches.
Anothergoodpaperisa1998paper*byYannLeCun,LonBottou,
GenevieveOrrandKlausRobertMller.Boththesepapersappear
inanextremelyuseful2012bookthatcollectsmanytricks
commonlyusedinneuralnets*.Thebookisexpensive,butmanyof
thearticleshavebeenplacedonlinebytheirrespectiveauthors
http://neuralnetworksanddeeplearning.com/chap3.html

*EfficientBackProp,byYannLeCun,Lon
Bottou,GenevieveOrrandKlausRobertMller
(1998)

*NeuralNetworks:TricksoftheTrade,editedby
GrgoireMontavon,GeneviveOrr,andKlaus

79/92

5/31/2016

Neuralnetworksanddeeplearning

with,onepresumes,theblessingofthepublisher,andmaybe

RobertMller.

locatedusingasearchengine.
Onethingthatbecomesclearasyoureadthesearticlesand,
especially,asyouengageinyourownexperiments,isthathyper
parameteroptimizationisnotaproblemthatisevercompletely
solved.There'salwaysanothertrickyoucantrytoimprove
performance.Thereisasayingcommonamongwritersthatbooks
areneverfinished,onlyabandoned.Thesameisalsotrueofneural
networkoptimization:thespaceofhyperparametersissolarge
thatoneneverreallyfinishesoptimizing,oneonlyabandonsthe
networktoposterity.Soyourgoalshouldbetodevelopaworkflow
thatenablesyoutoquicklydoaprettygoodjobontheoptimization,
whileleavingyoutheflexibilitytotrymoredetailedoptimizations,
ifthat'simportant.
Thechallengeofsettinghyperparametershasledsomepeopleto
complainthatneuralnetworksrequirealotofworkwhen
comparedwithothermachinelearningtechniques.I'veheardmany
variationsonthefollowingcomplaint:"Yes,awelltunedneural
networkmaygetthebestperformanceontheproblem.Onthe
otherhand,Icantryarandomforest[orSVMorinsertyourown
favoritetechnique]anditjustworks.Idon'thavetimetofigureout
justtherightneuralnetwork."Ofcourse,fromapracticalpointof
viewit'sgoodtohaveeasytoapplytechniques.Thisisparticularly
truewhenyou'rejustgettingstartedonaproblem,anditmaynot
beobviouswhethermachinelearningcanhelpsolvetheproblemat
all.Ontheotherhand,ifgettingoptimalperformanceisimportant,
thenyoumayneedtotryapproachesthatrequiremorespecialist
knowledge.Whileitwouldbeniceifmachinelearningwerealways
easy,thereisnoapriorireasonitshouldbetriviallysimple.

Othertechniques
Eachtechniquedevelopedinthischapterisvaluabletoknowinits
ownright,butthat'snottheonlyreasonI'veexplainedthem.The
largerpointistofamiliarizeyouwithsomeoftheproblemswhich
canoccurinneuralnetworks,andwithastyleofanalysiswhichcan
helpovercomethoseproblems.Inasense,we'vebeenlearninghow
tothinkaboutneuralnets.OvertheremainderofthischapterI
brieflysketchahandfulofothertechniques.Thesesketchesareless
http://neuralnetworksanddeeplearning.com/chap3.html

80/92

5/31/2016

Neuralnetworksanddeeplearning

indepththantheearlierdiscussions,butshouldconveysome
feelingforthediversityoftechniquesavailableforuseinneural
networks.

Variationsonstochasticgradientdescent
Stochasticgradientdescentbybackpropagationhasserveduswell
inattackingtheMNISTdigitclassificationproblem.However,there
aremanyotherapproachestooptimizingthecostfunction,and
sometimesthoseotherapproachesofferperformancesuperiorto
minibatchstochasticgradientdescent.InthissectionIsketchtwo
suchapproaches,theHessianandmomentumtechniques.
Hessiantechnique:Tobeginourdiscussionithelpstoput
neuralnetworksasideforabit.Instead,we'rejustgoingtoconsider
theabstractproblemofminimizingacostfunctionCwhichisa
functionofmanyvariables,w = w 1, w 2, ,soC = C(w).ByTaylor's
theorem,thecostfunctioncanbeapproximatednearapointwby
C(w + w) = C(w) +

w w j
j

2C

w j w w

2 jk

w k +

Wecanrewritethismorecompactlyas
C(w + w) = C(w) + C w +

1
2

w THw + ,

whereCistheusualgradientvector,andHisamatrixknownas
theHessianmatrix,whosejkthentryis 2C / w jw k.Supposewe
approximateCbydiscardingthehigherordertermsrepresentedby
above,
C(w + w) C(w) + C w +

1
2

w THw.

Usingcalculuswecanshowthattheexpressionontherighthand
sidecanbeminimized*bychoosing
w = H 1C.

*Strictlyspeaking,forthistobeaminimum,and
notmerelyanextremum,weneedtoassume
thattheHessianmatrixispositivedefinite.
Intuitively,thismeansthatthefunctionClooks
likeavalleylocally,notamountainorasaddle.

Provided(105)isagoodapproximateexpressionforthecost
function,thenwe'dexpectthatmovingfromthepointwto
w + w = w H 1Cshouldsignificantlydecreasethecost
http://neuralnetworksanddeeplearning.com/chap3.html

81/92

5/31/2016

Neuralnetworksanddeeplearning

function.Thatsuggestsapossiblealgorithmforminimizingthe
cost:
Chooseastartingpoint,w.
Updatewtoanewpointw = w H 1C,wheretheHessianH
andCarecomputedatw.
Updatew toanewpointw = w H 1 C,wherethe
HessianH and Carecomputedatw .

Inpractice,(105)isonlyanapproximation,andit'sbettertotake
smallersteps.Wedothisbyrepeatedlychangingwbyanamount
w = H 1C,whereisknownasthelearningrate.
Thisapproachtominimizingacostfunctionisknownasthe
HessiantechniqueorHessianoptimization.Therearetheoretical
andempiricalresultsshowingthatHessianmethodsconvergeona
minimuminfewerstepsthanstandardgradientdescent.In
particular,byincorporatinginformationaboutsecondorder
changesinthecostfunctionit'spossiblefortheHessianapproach
toavoidmanypathologiesthatcanoccuringradientdescent.
Furthermore,thereareversionsofthebackpropagationalgorithm
whichcanbeusedtocomputetheHessian.
IfHessianoptimizationissogreat,whyaren'tweusingitinour
neuralnetworks?Unfortunately,whileithasmanydesirable
properties,ithasoneveryundesirableproperty:it'sverydifficultto
applyinpractice.Partoftheproblemisthesheersizeofthe
Hessianmatrix.Supposeyouhaveaneuralnetworkwith10 7
weightsandbiases.ThenthecorrespondingHessianmatrixwill
contain10 7 10 7 = 10 14entries.That'salotofentries!Andthat
makescomputingH 1Cextremelydifficultinpractice.However,
thatdoesn'tmeanthatit'snotusefultounderstand.Infact,there
aremanyvariationsongradientdescentwhichareinspiredby
Hessianoptimization,butwhichavoidtheproblemwithoverly
largematrices.Let'stakealookatonesuchtechnique,momentum
basedgradientdescent.
Momentumbasedgradientdescent:Intuitively,the
advantageHessianoptimizationhasisthatitincorporatesnotjust
http://neuralnetworksanddeeplearning.com/chap3.html

82/92

5/31/2016

Neuralnetworksanddeeplearning

informationaboutthegradient,butalsoinformationabouthowthe
gradientischanging.Momentumbasedgradientdescentisbased
onasimilarintuition,butavoidslargematricesofsecond
derivatives.Tounderstandthemomentumtechnique,thinkbackto
ouroriginalpictureofgradientdescent,inwhichweconsidereda
ballrollingdownintoavalley.Atthetime,weobservedthat
gradientdescentis,despiteitsname,onlylooselysimilartoaball
fallingtothebottomofavalley.Themomentumtechniquemodifies
gradientdescentintwowaysthatmakeitmoresimilartothe
physicalpicture.First,itintroducesanotionof"velocity"forthe
parameterswe'retryingtooptimize.Thegradientactstochangethe
velocity,not(directly)the"position",inmuchthesamewayas
physicalforceschangethevelocity,andonlyindirectlyaffect
position.Second,themomentummethodintroducesakindof
frictionterm,whichtendstograduallyreducethevelocity.
Let'sgiveamoreprecisemathematicaldescription.Weintroduce
velocityvariablesv = v 1, v 2, ,oneforeachcorrespondingw j
variable*.Thenwereplacethegradientdescentupdaterule
w

*Inaneuralnetthew jvariableswould,of
course,includeallweightsandbiases.

= w Cby
v v = v C
w w = w + v .

Intheseequations,isahyperparameterwhichcontrolsthe
amountofdampingorfrictioninthesystem.Tounderstandthe
meaningoftheequationsit'shelpfultofirstconsiderthecasewhere
= 1,whichcorrespondstonofriction.Whenthat'sthecase,
inspectionoftheequationsshowsthatthe"force"Cisnow
modifyingthevelocity,v,andthevelocityiscontrollingtherateof
changeofw.Intuitively,webuildupthevelocitybyrepeatedly
addinggradienttermstoit.Thatmeansthatifthegradientisin
(roughly)thesamedirectionthroughseveralroundsoflearning,we
canbuildupquiteabitofsteammovinginthatdirection.Think,for
example,ofwhathappensifwe'removingstraightdownaslope:

http://neuralnetworksanddeeplearning.com/chap3.html

83/92

5/31/2016

Neuralnetworksanddeeplearning

Witheachstepthevelocitygetslargerdowntheslope,sowemove
moreandmorequicklytothebottomofthevalley.Thiscanenable
themomentumtechniquetoworkmuchfasterthanstandard
gradientdescent.Ofcourse,aproblemisthatoncewereachthe
bottomofthevalleywewillovershoot.Or,ifthegradientshould
changerapidly,thenwecouldfindourselvesmovinginthewrong
direction.That'sthereasonforthehyperparameterin(107).I
saidearlierthatcontrolstheamountoffrictioninthesystemto
bealittlemoreprecise,youshouldthinkof1 astheamountof
frictioninthesystem.When = 1,aswe'veseen,thereisno
friction,andthevelocityiscompletelydrivenbythegradientC.By
contrast,when = 0there'salotoffriction,thevelocitycan'tbuild
up,andEquations(107)and(108)reducetotheusualequationfor
gradientdescent,w w = w C.Inpractice,usingavalueof
intermediatebetween0and1cangiveusmuchofthebenefitof
beingabletobuildupspeed,butwithoutcausingovershooting.We
canchoosesuchavalueforusingtheheldoutvalidationdata,in
muchthesamewayasweselectand.
I'veavoidednamingthehyperparameteruptonow.Thereason
isthatthestandardnameforisbadlychosen:it'scalledthe
momentumcoefficient.Thisispotentiallyconfusing,sinceisnot
atallthesameasthenotionofmomentumfromphysics.Rather,it
ismuchmorecloselyrelatedtofriction.However,theterm
momentumcoefficientiswidelyused,sowewillcontinuetouseit.
Anicethingaboutthemomentumtechniqueisthatittakesalmost
noworktomodifyanimplementationofgradientdescentto
http://neuralnetworksanddeeplearning.com/chap3.html

84/92

5/31/2016

Neuralnetworksanddeeplearning

incorporatemomentum.Wecanstillusebackpropagationto
computethegradients,justasbefore,anduseideassuchas
samplingstochasticallychosenminibatches.Inthisway,wecan
getsomeoftheadvantagesoftheHessiantechnique,using
informationabouthowthegradientischanging.Butit'sdone
withoutthedisadvantages,andwithonlyminormodificationsto
ourcode.Inpractice,themomentumtechniqueiscommonlyused,
andoftenspeedsuplearning.

Exercise
Whatwouldgowrongifweused > 1inthemomentum
technique?
Whatwouldgowrongifweused < 0inthemomentum
technique?

Problem
Addmomentumbasedstochasticgradientdescentto
network2.py.

Otherapproachestominimizingthecostfunction:Many
otherapproachestominimizingthecostfunctionhavebeen
developed,andthereisn'tuniversalagreementonwhichisthebest
approach.Asyougodeeperintoneuralnetworksit'sworthdigging
intotheothertechniques,understandinghowtheywork,their
strengthsandweaknesses,andhowtoapplytheminpractice.A
paperImentionedearlier*introducesandcomparesseveralof

*EfficientBackProp,byYannLeCun,Lon

thesetechniques,includingconjugategradientdescentandthe

(1998).

Bottou,GenevieveOrrandKlausRobertMller

BFGSmethod(seealsothecloselyrelatedlimitedmemoryBFGS
method,knownasLBFGS).Anothertechniquewhichhasrecently
shownpromisingresults*isNesterov'sacceleratedgradient

*See,forexample,Ontheimportanceof

technique,whichimprovesonthemomentumtechnique.However,

byIlyaSutskever,JamesMartens,GeorgeDahl,

formanyproblems,plainstochasticgradientdescentworkswell,

initializationandmomentumindeeplearning,
andGeoffreyHinton(2012).

especiallyifmomentumisused,andsowe'llsticktostochastic
gradientdescentthroughtheremainderofthisbook.

Othermodelsofartificialneuron
Uptonowwe'vebuiltourneuralnetworksusingsigmoidneurons.
Inprinciple,anetworkbuiltfromsigmoidneuronscancompute
http://neuralnetworksanddeeplearning.com/chap3.html

85/92

5/31/2016

Neuralnetworksanddeeplearning

anyfunction.Inpractice,however,networksbuiltusingother
modelneuronssometimesoutperformsigmoidnetworks.
Dependingontheapplication,networksbasedonsuchalternate
modelsmaylearnfaster,generalizebettertotestdata,orperhaps
doboth.Letmementionacoupleofalternatemodelneurons,to
giveyoutheflavorofsomevariationsincommonuse.
Perhapsthesimplestvariationisthetanh(pronounced"tanch")
neuron,whichreplacesthesigmoidfunctionbythehyperbolic
tangentfunction.Theoutputofatanhneuronwithinputx,weight
vectorw,andbiasbisgivenby
tanh(w x + b),
wheretanhis,ofcourse,thehyperbolictangentfunction.Itturns
outthatthisisverycloselyrelatedtothesigmoidneuron.Tosee
this,recallthatthetanhfunctionisdefinedby
tanh(z)

ez e z
ez + e z

Withalittlealgebraitcaneasilybeverifiedthat
(z) =

1 + tanh(z / 2)
2

thatis,tanhisjustarescaledversionofthesigmoidfunction.We
canalsoseegraphicallythatthetanhfunctionhasthesameshapeas
thesigmoidfunction,
tanhfunction
1.0

0.5

0.0
4

0.5

1.0

Onedifferencebetweentanhneuronsandsigmoidneuronsisthat
theoutputfromtanhneuronsrangesfrom1to1,not0to1.This
meansthatifyou'regoingtobuildanetworkbasedontanhneurons
youmayneedtonormalizeyouroutputs(and,dependingonthe
http://neuralnetworksanddeeplearning.com/chap3.html

86/92

5/31/2016

Neuralnetworksanddeeplearning

detailsoftheapplication,possiblyyourinputs)alittledifferently
thaninsigmoidnetworks.
Similartosigmoidneurons,anetworkoftanhneuronscan,in
principle,computeanyfunction*mappinginputstotherange1to

*Therearesometechnicalcaveatstothis

1.Furthermore,ideassuchasbackpropagationandstochastic

wellasfortherectifiedlinearneuronsdiscussed

gradientdescentareaseasilyappliedtoanetworkoftanhneurons
astoanetworkofsigmoidneurons.

statementforbothtanhandsigmoidneurons,as
below.However,informallyit'susuallyfineto
thinkofneuralnetworksasbeingableto
approximateanyfunctiontoarbitraryaccuracy.

Exercise
ProvetheidentityinEquation(111).
Whichtypeofneuronshouldyouuseinyournetworks,thetanhor
sigmoid?Aprioritheanswerisnotobvious,toputitmildly!
However,therearetheoreticalargumentsandsomeempirical
evidencetosuggestthatthetanhsometimesperformsbetter*.Let

*See,forexample,EfficientBackProp,byYann

mebrieflygiveyoutheflavorofoneofthetheoreticalargumentsfor

RobertMller(1998),andUnderstandingthe

tanhneurons.Supposewe'reusingsigmoidneurons,soall

LeCun,LonBottou,GenevieveOrrandKlaus
difficultyoftrainingdeepfeedforwardnetworks,
byXavierGlorotandYoshuaBengio(2010).

activationsinournetworkarepositive.Let'sconsidertheweights
w ljk+ 1 inputtothejthneuroninthel + 1thlayer.Therulesfor
backpropagation(seehere)tellusthattheassociatedgradientwill
bea lk lj + 1 .Becausetheactivationsarepositivethesignofthis
gradientwillbethesameasthesignof lj + 1 .Whatthismeansis
thatif lj + 1 ispositivethenalltheweightsw ljk+ 1 willdecreaseduring
gradientdescent,whileif lj + 1 isnegativethenalltheweightsw ljk+ 1
willincreaseduringgradientdescent.Inotherwords,allweightsto
thesameneuronmusteitherincreasetogetherordecreasetogether.
That'saproblem,sincesomeoftheweightsmayneedtoincrease
whileothersneedtodecrease.Thatcanonlyhappenifsomeofthe
inputactivationshavedifferentsigns.Thatsuggestsreplacingthe
sigmoidbyanactivationfunction,suchastanh,whichallowsboth
positiveandnegativeactivations.Indeed,becausetanhissymmetric
aboutzero,tanh( z) = tanh(z),wemightevenexpectthat,
roughlyspeaking,theactivationsinhiddenlayerswouldbeequally
balancedbetweenpositiveandnegative.Thatwouldhelpensure
thatthereisnosystematicbiasfortheweightupdatestobeoneway
ortheother.
Howseriouslyshouldwetakethisargument?Whiletheargument
issuggestive,it'saheuristic,notarigorousproofthattanhneurons
http://neuralnetworksanddeeplearning.com/chap3.html

87/92

5/31/2016

Neuralnetworksanddeeplearning

outperformsigmoidneurons.Perhapsthereareotherpropertiesof
thesigmoidneuronwhichcompensateforthisproblem?Indeed,for
manytasksthetanhisfoundempiricallytoprovideonlyasmallor
noimprovementinperformanceoversigmoidneurons.
Unfortunately,wedon'tyethavehardandfastrulestoknowwhich
neurontypeswilllearnfastest,orgivethebestgeneralization
performance,foranyparticularapplication.
Anothervariationonthesigmoidneuronistherectifiedlinear
neuronorrectifiedlinearunit.Theoutputofarectifiedlinearunit
withinputx,weightvectorw,andbiasbisgivenby
max (0, w x + b).
Graphically,therectifyingfunction max (0, z)lookslikethis:
max(0,z)
5
4
3
2
1
0
1

2
3
4

Obviouslysuchneuronsarequitedifferentfrombothsigmoidand
tanhneurons.However,likethesigmoidandtanhneurons,
rectifiedlinearunitscanbeusedtocomputeanyfunction,andthey
canbetrainedusingideassuchasbackpropagationandstochastic
gradientdescent.
Whenshouldyouuserectifiedlinearunitsinsteadofsigmoidor
tanhneurons?Somerecentworkonimagerecognition*hasfound

*See,forexample,WhatistheBestMultiStage

considerablebenefitinusingrectifiedlinearunitsthroughmuchof

Jarrett,KorayKavukcuoglu,Marc'Aurelio

thenetwork.However,aswithtanhneurons,wedonotyethavea

ArchitectureforObjectRecognition?,byKevin
RanzatoandYannLeCun(2009),DeepSparse
RectierNeuralNetworks,byXavierGlorot,

reallydeepunderstandingofwhen,exactly,rectifiedlinearunitsare

AntoineBordes,andYoshuaBengio(2011),and

preferable,norwhy.Togiveyoutheflavorofsomeoftheissues,

ConvolutionalNeuralNetworks,byAlex

recallthatsigmoidneuronsstoplearningwhentheysaturate,i.e.,

(2012).Notethatthesepapersfillinimportant

whentheiroutputisneareither0or1.Aswe'veseenrepeatedlyin
thischapter,theproblemisthat termsreducethegradient,and
http://neuralnetworksanddeeplearning.com/chap3.html

ImageNetClassificationwithDeep
Krizhevsky,IlyaSutskever,andGeoffreyHinton
detailsabouthowtosetuptheoutputlayer,cost
function,andregularizationinnetworksusing
rectifiedlinearunits.I'veglossedoverallthese
detailsinthisbriefaccount.Thepapersalso

88/92

5/31/2016

Neuralnetworksanddeeplearning

thatslowsdownlearning.Tanhneuronssufferfromasimilar

discussinmoredetailthebenefitsand

problemwhentheysaturate.Bycontrast,increasingtheweighted

AnotherinformativepaperisRectifiedLinear

inputtoarectifiedlinearunitwillnevercauseittosaturate,andso

drawbacksofusingrectifiedlinearunits.
UnitsImproveRestrictedBoltzmannMachines,
byVinodNairandGeoffreyHinton(2010),

thereisnocorrespondinglearningslowdown.Ontheotherhand,

whichdemonstratesthebenefitsofusing

whentheweightedinputtoarectifiedlinearunitisnegative,the

approachtoneuralnetworks.

rectifiedlinearunitsinasomewhatdifferent

gradientvanishes,andsotheneuronstopslearningentirely.These
arejusttwoofthemanyissuesthatmakeitnontrivialto
understandwhenandwhyrectifiedlinearunitsperformbetterthan
sigmoidortanhneurons.
I'vepaintedapictureofuncertaintyhere,stressingthatwedonot
yethaveasolidtheoryofhowactivationfunctionsshouldbe
chosen.Indeed,theproblemishardereventhanIhavedescribed,
forthereareinfinitelymanypossibleactivationfunctions.Whichis
thebestforanygivenproblem?Whichwillresultinanetwork
whichlearnsfastest?Whichwillgivethehighesttestaccuracies?I
amsurprisedhowlittlereallydeepandsystematicinvestigationhas
beendoneofthesequestions.Ideally,we'dhaveatheorywhichtells
us,indetail,howtochoose(andperhapsmodifyonthefly)our
activationfunctions.Ontheotherhand,weshouldn'tletthelackof
afulltheorystopus!Wehavepowerfultoolsalreadyathand,and
canmakealotofprogresswiththosetools.Throughtheremainder
ofthisbookI'llcontinuetousesigmoidneuronsasourgoto
neuron,sincethey'repowerfulandprovideconcreteillustrationsof
thecoreideasaboutneuralnets.Butkeepinthebackofyourmind
thatthesesameideascanbeappliedtoothertypesofneuron,and
thattherearesometimesadvantagesindoingso.

Onstoriesinneuralnetworks
Question:Howdoyouapproachutilizingand
researchingmachinelearningtechniquesthatare
supportedalmostentirelyempirically,asopposedto
mathematically?Alsoinwhatsituationshaveyounoticed
someofthesetechniquesfail?
Answer:Youhavetorealizethatourtheoreticaltoolsare
veryweak.Sometimes,wehavegoodmathematical
intuitionsforwhyaparticulartechniqueshouldwork.
Sometimesourintuitionendsupbeingwrong[...]The
questionsbecome:howwelldoesmymethodworkonthis
http://neuralnetworksanddeeplearning.com/chap3.html

89/92

5/31/2016

Neuralnetworksanddeeplearning

particularproblem,andhowlargeisthesetofproblems
onwhichitworkswell.
Questionandanswerwithneuralnetworksresearcher
YannLeCun
Once,attendingaconferenceonthefoundationsofquantum
mechanics,Inoticedwhatseemedtomeamostcuriousverbal
habit:whentalksfinished,questionsfromtheaudienceoftenbegan
with"I'mverysympathetictoyourpointofview,but[...]".
Quantumfoundationswasnotmyusualfield,andInoticedthis
styleofquestioningbecauseatotherscientificconferencesI'drarely
orneverheardaquestionerexpresstheirsympathyforthepointof
viewofthespeaker.Atthetime,Ithoughttheprevalenceofthe
questionsuggestedthatlittlegenuineprogresswasbeingmadein
quantumfoundations,andpeopleweremerelyspinningtheir
wheels.Later,Irealizedthatassessmentwastooharsh.The
speakerswerewrestlingwithsomeofthehardestproblemshuman
mindshaveeverconfronted.Ofcourseprogresswasslow!Butthere
wasstillvalueinhearingupdatesonhowpeoplewerethinking,
eveniftheydidn'talwayshaveunarguablenewprogresstoreport.
Youmayhavenoticedaverbalticsimilarto"I'mverysympathetic
[...]"inthecurrentbook.Toexplainwhatwe'reseeingI'veoften
fallenbackonsaying"Heuristically,[...]",or"Roughlyspeaking,
[...]",followingupwithastorytoexplainsomephenomenonor
other.Thesestoriesareplausible,buttheempiricalevidenceI've
presentedhasoftenbeenprettythin.Ifyoulookthroughthe
researchliteratureyou'llseethatstoriesinasimilarstyleappearin
manyresearchpapersonneuralnets,oftenwiththinsupporting
evidence.Whatshouldwethinkaboutsuchstories?
Inmanypartsofscienceespeciallythosepartsthatdealwith
simplephenomenait'spossibletoobtainverysolid,veryreliable
evidenceforquitegeneralhypotheses.Butinneuralnetworksthere
arelargenumbersofparametersandhyperparameters,and
extremelycomplexinteractionsbetweenthem.Insuch
extraordinarilycomplexsystemsit'sexceedinglydifficultto
establishreliablegeneralstatements.Understandingneural
networksintheirfullgeneralityisaproblemthat,likequantum
foundations,teststhelimitsofthehumanmind.Instead,weoften
http://neuralnetworksanddeeplearning.com/chap3.html

90/92

5/31/2016

Neuralnetworksanddeeplearning

makedowithevidencefororagainstafewspecificinstancesofa
generalstatement.Asaresultthosestatementssometimeslater
needtobemodifiedorabandoned,whennewevidencecomesto
light.
Onewayofviewingthissituationisthatanyheuristicstoryabout
neuralnetworkscarrieswithitanimpliedchallenge.Forexample,
considerthestatementIquotedearlier,explainingwhydropout
works*:"Thistechniquereducescomplexcoadaptationsof

*FromImageNetClassificationwithDeep

neurons,sinceaneuroncannotrelyonthepresenceofparticular

Krizhevsky,IlyaSutskever,andGeoffreyHinton

otherneurons.Itis,therefore,forcedtolearnmorerobustfeatures

ConvolutionalNeuralNetworksbyAlex
(2012).

thatareusefulinconjunctionwithmanydifferentrandomsubsets
oftheotherneurons."Thisisarich,provocativestatement,andone
couldbuildafruitfulresearchprogramentirelyaroundunpacking
thestatement,figuringoutwhatinitistrue,whatisfalse,what
needsvariationandrefinement.Indeed,thereisnowasmall
industryofresearcherswhoareinvestigatingdropout(andmany
variations),tryingtounderstandhowitworks,andwhatitslimits
are.Andsoitgoeswithmanyoftheheuristicswe'vediscussed.
Eachheuristicisnotjusta(potential)explanation,it'salsoa
challengetoinvestigateandunderstandinmoredetail.
Ofcourse,thereisnottimeforanysinglepersontoinvestigateall
theseheuristicexplanationsindepth.It'sgoingtotakedecades(or
longer)forthecommunityofneuralnetworksresearchersto
developareallypowerful,evidencebasedtheoryofhowneural
networkslearn.Doesthismeanyoushouldrejectheuristic
explanationsasunrigorous,andnotsufficientlyevidencebased?
No!Infact,weneedsuchheuristicstoinspireandguideour
thinking.It'slikethegreatageofexploration:theearlyexplorers
sometimesexplored(andmadenewdiscoveries)onthebasisof
beliefswhichwerewronginimportantways.Later,thosemistakes
werecorrectedaswefilledinourknowledgeofgeography.When
youunderstandsomethingpoorlyastheexplorersunderstood
geography,andasweunderstandneuralnetstodayit'smore
importanttoexploreboldlythanitistoberigorouslycorrectin
everystepofyourthinking.Andsoyoushouldviewthesestoriesas
ausefulguidetohowtothinkaboutneuralnets,whileretaininga
healthyawarenessofthelimitationsofsuchstories,andcarefully
keepingtrackofjusthowstrongtheevidenceisforanygivenlineof
reasoning.Putanotherway,weneedgoodstoriestohelpmotivate
http://neuralnetworksanddeeplearning.com/chap3.html

91/92

5/31/2016

Neuralnetworksanddeeplearning

andinspireus,andrigorousindepthinvestigationinorderto
uncovertherealfactsofthematter.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",
DeterminationPress,2015

Lastupdate:FriJan2214:09:502016

ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.

http://neuralnetworksanddeeplearning.com/chap3.html

92/92

You might also like