You are on page 1of 2

9/23/2016 machinelearningWhydocostfunctionsusethesquareerror?

DataScienceStackExchange

signup login tour help

_
DataScienceStackExchangeisa Here'showitworks:
questionandanswersiteforData
scienceprofessionals,Machine
Learningspecialists,andthose
interestedinlearningmoreaboutthe
field.Jointhemitonlytakesaminute:
Anybodycanask Anybodycan Thebestanswersarevoted
aquestion answer upandrisetothetop
Signup

Whydocostfunctionsusethesquareerror?

I'mjustgettingstartedwithsomemachinelearning,anduntilnowIhavebeendealingwithlinearregressionoveronevariable.

Ihavelearntthatthereisahypothesis,whichis:

h (x) = 0 + 1 x

Tofindoutgoodvaluesfortheparameters0 and1 wewanttominimizethedifferencebetweenthecalculatedresultandtheactualresultofour


testdata.Sowesubtract
(i) (i)
h (x ) y

forallifrom1tom .Hencewecalculatethesumoverthisdifferenceandthencalculatetheaveragebymultiplyingthesumby m
1
.Sofar,so
good.Thiswouldresultin:
1 m (i) (i)
h (x ) y
m i=1

Butthisisnotwhathasbeensuggested.Insteadthecoursesuggeststotakethesquarevalueofthedifference,andtomultiplyby 2m
1
.Sothe
formulais:
1 m (i) (i) 2
(h (x ) y )
2m i=1

Whyisthat?Whydoweusethesquarefunctionhere,andwhydowemultiplyby 2m
1
insteadof m
1
?

machinelearning linearregression

askedFeb10at21:52
GoloRoden
133 4

2Answers
8followers,71questions rss

Techniquesforanalyzingtherelationshipbetweenone(ormore)
Yourlossfunctionherewouldnotworkbecauseitincentivizesputtingboth 0 and1 to .
"dependent"variablesand"independent"variables.

frequent info topusers


Letscallr(x, y) = h (x) ytheresidual(asisoftendone).You'reaskingtomakeitassmall
aspossible,i.e.h asnegativeaspossiblesincey,thegroundtruth,isjustaconstant.The
squarederrorsidestepsthisissuebecauseitforcesh(x)andytomatch,i.e.(u v)2 is
minimizedwhenu = v,ifpossible,andisalways> 0otherwise,becauseit'sasquare.

OnejustificationforusingthesquarederroristhatitrelatestoGaussianNoise.Supposethe
residual,whichmeasurestheerroristhesumofmanysmallindependentnoiseterms.Bythe
CentralLimitTheorem,itshouldbedistributedNormally.So,wewanttopickwherethisnoise
distributionthethingsyourmodelcannotexplainhasthesmallestvariance.Thiscorresponds
totheleastsquaresloss.

LetR bearandomvariablethatfollowsaGaussiandistributionN (, ) .Then,thevarianceof


R isE[R ] = .Seehere
2 2

Regardingyoursecondquestion,the1/2doesnotmatterandactually,them doesn'tmatter
either:).Theoptimalvalueofwouldremainthesameinbothcases,itisputinsothatwhen
youtakethederivative,theexpressionisprettier,becausethe2cancelsoutthe2fromthe
squareterm.Them isusefulifyousolvethisproblemwithgradientdescent.Thenyourgradient
isthesumofm termsdividedbym ,soitislikeanaverageoveryourpoints.Thisletsyou
handleallsizesofdatasets,soyourgradientswon'toverflowtheintegersifyouscaleupthe
dataset.

It'salsousedtomaintainconsistencywithfutureequationswhereyou'lladdregularizationterms.
TheslidesIlinkedheredon'tdoit,butyoucanseethatifyoudo,theregularizationparameter
willnotdependonthedatasetsizem soit'llbemoreinterpretable.

editedFeb11at2:04 answeredFeb11at1:18
Harsh

http://datascience.stackexchange.com/questions/10188/whydocostfunctionsusethesquareerror 1/2
9/23/2016 machinelearningWhydocostfunctionsusethesquareerror?DataScienceStackExchange
201 6

The1/2coefficientismerelyforconvenienceitmakesthederivative,whichisthefunction
actuallybeingoptimized,looknicer.The1/mismorefundamentalitsuggeststhatweare
interestedinthemeansquarederror.Thisallowsyoutomakefaircomparisonswhenchanging
thesamplesize,andpreventsoverflow.Socalled"stochastic"optimizersuseasubsetofthe
dataset(m'<m).Whenyouintroducearegularizer(anadditivetermtotheobjectivefunction),
usingthe1/mfactorallowsyoutousethesamecoefficientfortheregularizerregardlessofthe
samplesize.

Asforthequestionofwhythesquareandnotsimplythedifference:don'tyouwant
underestimatestobepenalizedsimilarlytooverestimates?Squaringeliminatestheeffectofthe
signoftheerror.Takingtheabsolutevalue(L1norm)doestoo,butitsderivativeisundefinedat
theorigin,soitrequiresmoresophisticationtouse.TheL1normhasitsuses,sokeepitinmind,
andperhapsasktheteacherif(s)he'sgoingtocoverit.

editedFeb11at0:00 answeredFeb10at23:28
Emre
5,038 1 9 21

1 Inadditiontodifferentiability,theL normisuniqueintheL normsinthatitisaHilbertspace.Thefactthatthe


2 p

normarisesfromaninnerproductmakesahugeamountofmachineryavailableforL2 whichisnotavailablefor
othernorms.StevenGubkinFeb11at5:59

http://datascience.stackexchange.com/questions/10188/whydocostfunctionsusethesquareerror 2/2

You might also like