Professional Documents
Culture Documents
DataScienceStackExchange
signup login tour help
_
DataScienceStackExchangeisa Here'showitworks:
questionandanswersiteforData
scienceprofessionals,Machine
Learningspecialists,andthose
interestedinlearningmoreaboutthe
field.Jointhemitonlytakesaminute:
Anybodycanask Anybodycan Thebestanswersarevoted
aquestion answer upandrisetothetop
Signup
Whydocostfunctionsusethesquareerror?
I'mjustgettingstartedwithsomemachinelearning,anduntilnowIhavebeendealingwithlinearregressionoveronevariable.
Ihavelearntthatthereisahypothesis,whichis:
h (x) = 0 + 1 x
forallifrom1tom .Hencewecalculatethesumoverthisdifferenceandthencalculatetheaveragebymultiplyingthesumby m
1
.Sofar,so
good.Thiswouldresultin:
1 m (i) (i)
h (x ) y
m i=1
Butthisisnotwhathasbeensuggested.Insteadthecoursesuggeststotakethesquarevalueofthedifference,andtomultiplyby 2m
1
.Sothe
formulais:
1 m (i) (i) 2
(h (x ) y )
2m i=1
Whyisthat?Whydoweusethesquarefunctionhere,andwhydowemultiplyby 2m
1
insteadof m
1
?
machinelearning linearregression
askedFeb10at21:52
GoloRoden
133 4
2Answers
8followers,71questions rss
Techniquesforanalyzingtherelationshipbetweenone(ormore)
Yourlossfunctionherewouldnotworkbecauseitincentivizesputtingboth 0 and1 to .
"dependent"variablesand"independent"variables.
OnejustificationforusingthesquarederroristhatitrelatestoGaussianNoise.Supposethe
residual,whichmeasurestheerroristhesumofmanysmallindependentnoiseterms.Bythe
CentralLimitTheorem,itshouldbedistributedNormally.So,wewanttopickwherethisnoise
distributionthethingsyourmodelcannotexplainhasthesmallestvariance.Thiscorresponds
totheleastsquaresloss.
Regardingyoursecondquestion,the1/2doesnotmatterandactually,them doesn'tmatter
either:).Theoptimalvalueofwouldremainthesameinbothcases,itisputinsothatwhen
youtakethederivative,theexpressionisprettier,becausethe2cancelsoutthe2fromthe
squareterm.Them isusefulifyousolvethisproblemwithgradientdescent.Thenyourgradient
isthesumofm termsdividedbym ,soitislikeanaverageoveryourpoints.Thisletsyou
handleallsizesofdatasets,soyourgradientswon'toverflowtheintegersifyouscaleupthe
dataset.
It'salsousedtomaintainconsistencywithfutureequationswhereyou'lladdregularizationterms.
TheslidesIlinkedheredon'tdoit,butyoucanseethatifyoudo,theregularizationparameter
willnotdependonthedatasetsizem soit'llbemoreinterpretable.
editedFeb11at2:04 answeredFeb11at1:18
Harsh
http://datascience.stackexchange.com/questions/10188/whydocostfunctionsusethesquareerror 1/2
9/23/2016 machinelearningWhydocostfunctionsusethesquareerror?DataScienceStackExchange
201 6
The1/2coefficientismerelyforconvenienceitmakesthederivative,whichisthefunction
actuallybeingoptimized,looknicer.The1/mismorefundamentalitsuggeststhatweare
interestedinthemeansquarederror.Thisallowsyoutomakefaircomparisonswhenchanging
thesamplesize,andpreventsoverflow.Socalled"stochastic"optimizersuseasubsetofthe
dataset(m'<m).Whenyouintroducearegularizer(anadditivetermtotheobjectivefunction),
usingthe1/mfactorallowsyoutousethesamecoefficientfortheregularizerregardlessofthe
samplesize.
Asforthequestionofwhythesquareandnotsimplythedifference:don'tyouwant
underestimatestobepenalizedsimilarlytooverestimates?Squaringeliminatestheeffectofthe
signoftheerror.Takingtheabsolutevalue(L1norm)doestoo,butitsderivativeisundefinedat
theorigin,soitrequiresmoresophisticationtouse.TheL1normhasitsuses,sokeepitinmind,
andperhapsasktheteacherif(s)he'sgoingtocoverit.
editedFeb11at0:00 answeredFeb10at23:28
Emre
5,038 1 9 21
normarisesfromaninnerproductmakesahugeamountofmachineryavailableforL2 whichisnotavailablefor
othernorms.StevenGubkinFeb11at5:59
http://datascience.stackexchange.com/questions/10188/whydocostfunctionsusethesquareerror 2/2