Machine Learning - Why Do Cost Functions Use The Square Error - Data Science Stack Exchange

9/23/2016 machinelearningWhydocostfunctionsusethesquareerror?
DataScienceStackExchange

signup login tour help
_
DataScienceStackExchangeisa Here'showitworks:
questionandanswersiteforData
scienceprofessionals,Machine
Learningspecialists,andthose
interestedinlearningmoreaboutthe
field.Jointhemitonlytakesaminute:
Anybodycanask Anybodycan Thebestanswersarevoted
aquestion answer upandrisetothetop
Signup
Whydocostfunctionsusethesquareerror?
I'mjustgettingstartedwithsomemachinelearning,anduntilnowIhavebeendealingwithlinearregressionoveronevariable.
Ihavelearntthatthereisahypothesis,whichis:
h (x) = 0 + 1 x
Tofindoutgoodvaluesfortheparameters0 and1 wewanttominimizethedifferencebetweenthecalculatedresultandtheactualresultofour

testdata.Sowesubtract
(i) (i)
h (x ) y
forallifrom1tom .Hencewecalculatethesumoverthisdifferenceandthencalculatetheaveragebymultiplyingthesumby m
1
.Sofar,so
good.Thiswouldresultin:
1 m (i) (i)
h (x ) y
m i=1
Butthisisnotwhathasbeensuggested.Insteadthecoursesuggeststotakethesquarevalueofthedifference,andtomultiplyby 2m
1
.Sothe
formulais:
1 m (i) (i) 2
(h (x ) y )
2m i=1
Whyisthat?Whydoweusethesquarefunctionhere,andwhydowemultiplyby 2m
1
insteadof m
1
?
machinelearning linearregression
askedFeb10at21:52
GoloRoden
133 4
2Answers
8followers,71questions rss
Techniquesforanalyzingtherelationshipbetweenone(ormore)
Yourlossfunctionherewouldnotworkbecauseitincentivizesputtingboth 0 and1 to .
"dependent"variablesand"independent"variables.
frequent info topusers

Letscallr(x, y) = h (x) ytheresidual(asisoftendone).You'reaskingtomakeitassmall
aspossible,i.e.h asnegativeaspossiblesincey,thegroundtruth,isjustaconstant.The
squarederrorsidestepsthisissuebecauseitforcesh(x)andytomatch,i.e.(u v)2 is
minimizedwhenu = v,ifpossible,andisalways> 0otherwise,becauseit'sasquare.
OnejustificationforusingthesquarederroristhatitrelatestoGaussianNoise.Supposethe
residual,whichmeasurestheerroristhesumofmanysmallindependentnoiseterms.Bythe
CentralLimitTheorem,itshouldbedistributedNormally.So,wewanttopickwherethisnoise
distributionthethingsyourmodelcannotexplainhasthesmallestvariance.Thiscorresponds
totheleastsquaresloss.
LetR bearandomvariablethatfollowsaGaussiandistributionN (, ) .Then,thevarianceof

R isE[R ] = .Seehere
2 2
Regardingyoursecondquestion,the1/2doesnotmatterandactually,them doesn'tmatter
either:).Theoptimalvalueofwouldremainthesameinbothcases,itisputinsothatwhen
youtakethederivative,theexpressionisprettier,becausethe2cancelsoutthe2fromthe
squareterm.Them isusefulifyousolvethisproblemwithgradientdescent.Thenyourgradient
isthesumofm termsdividedbym ,soitislikeanaverageoveryourpoints.Thisletsyou
handleallsizesofdatasets,soyourgradientswon'toverflowtheintegersifyouscaleupthe
dataset.
It'salsousedtomaintainconsistencywithfutureequationswhereyou'lladdregularizationterms.
TheslidesIlinkedheredon'tdoit,butyoucanseethatifyoudo,theregularizationparameter
willnotdependonthedatasetsizem soit'llbemoreinterpretable.
editedFeb11at2:04 answeredFeb11at1:18
Harsh
http://datascience.stackexchange.com/questions/10188/whydocostfunctionsusethesquareerror 1/2
9/23/2016 machinelearningWhydocostfunctionsusethesquareerror?DataScienceStackExchange
201 6
The1/2coefficientismerelyforconvenienceitmakesthederivative,whichisthefunction
actuallybeingoptimized,looknicer.The1/mismorefundamentalitsuggeststhatweare
interestedinthemeansquarederror.Thisallowsyoutomakefaircomparisonswhenchanging
thesamplesize,andpreventsoverflow.Socalled"stochastic"optimizersuseasubsetofthe
dataset(m'<m).Whenyouintroducearegularizer(anadditivetermtotheobjectivefunction),
usingthe1/mfactorallowsyoutousethesamecoefficientfortheregularizerregardlessofthe
samplesize.
Asforthequestionofwhythesquareandnotsimplythedifference:don'tyouwant
underestimatestobepenalizedsimilarlytooverestimates?Squaringeliminatestheeffectofthe
signoftheerror.Takingtheabsolutevalue(L1norm)doestoo,butitsderivativeisundefinedat
theorigin,soitrequiresmoresophisticationtouse.TheL1normhasitsuses,sokeepitinmind,
andperhapsasktheteacherif(s)he'sgoingtocoverit.
editedFeb11at0:00 answeredFeb10at23:28
Emre
5,038 1 9 21
1 Inadditiontodifferentiability,theL normisuniqueintheL normsinthatitisaHilbertspace.Thefactthatthe

2 p
normarisesfromaninnerproductmakesahugeamountofmachineryavailableforL2 whichisnotavailablefor
othernorms.StevenGubkinFeb11at5:59
http://datascience.stackexchange.com/questions/10188/whydocostfunctionsusethesquareerror 2/2

Machine Learning - Why Do Cost Functions Use The Square Error - Data Science Stack Exchange

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning - Why Do Cost Functions Use The Square Error - Data Science Stack Exchange

Uploaded by

Copyright:

Available Formats

9/23/2016 machinelearningWhydocostfunctionsusethesquareerror?

Tofindoutgoodvaluesfortheparameters0 and1 wewanttominimizethedifferencebetweenthecalculatedresultandtheactualresultofour

frequent info topusers

LetR bearandomvariablethatfollowsaGaussiandistributionN (, ) .Then,thevarianceof

1 Inadditiontodifferentiability,theL normisuniqueintheL normsinthatitisaHilbertspace.Thefactthatthe

You might also like