Professional Documents
Culture Documents
HOMEHELPPREFERENCES
CONTRACTTEXT
EXPANDCONTENTS
DETACH
NOHIGHLIGHTING
StatisticalInference
sourceref:ebook.html
Metadata
1:Populations,Samples,EstimatesAndRepeatedSampling
2:PointEstimationandIntervalEstimation
3:ResultsFromProbabilityTheory
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 1/40
4/2/2017 2:PointEstimationandIntervalEstimation
4:OneContinuousVariable
5:ComparingMeansofTwoContinuousVariables
6:InferenceForCountdata
7:MoreOnCorrelationAndRegression
8:Summary
StatisticalInference
Metadata
Metadata
Titile:StatisticalInference
Author:Surfsata
Publisher:Surfstat
Date:1999
1:Populations,Samples,EstimatesAndRepeatedSampling
STATISTICALINFERENCE
POPULATIONS,SAMPLES,ESTIMATESANDREPEATEDSAMPLING
Statisticalinferenceistheuseofprobabilitytheorytomakeinferencesaboutapopulationfromsampledata.
Supposewewanttoestimatethecharacteristicsofapopulationsuchastheaverageweightofall30yearoldwomeninAustralia,orthepercentageofvotersin
N.S.W.whothinktheGovernmentisdoingagoodjobtocontrolinflation.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 2/40
4/2/2017 2:PointEstimationandIntervalEstimation
Inpracticewecannotobtaindatafromeverymemberofthepopulation.Instead,weobtaindatafromasampleandusetheresultstomakeinferencesaboutthe
population.
Definitions
Populationcollectionofallsubjectsorobjectsofinterest(notnecessarilypeople)
Samplesubsetofthepopulationusedtomakeinferencesaboutthecharacteristicsofthepopulation
Populationparameternumericalcharacteristicofapopulation,afixedandusuallyunknownquantity.
e.g.averageweight,,ofall30yearoldwomeninAustralia,
e.g.%ofvoters,p,inN.S.WwhothinktheGovernmentisdoingagoodjobtocontrolinflation.
Datavaluesmeasuredorrecordedonthesample.
Samplestatisticnumericalcharacteristicofthesampledatasuchasthemean,proportionorvariance.Itcanbeusedtoprovideestimatesofthecorresponding
populationparameters.
e.g.%ofvotersinanopinionpollinSydneywhothinktheGovernmentisdoingagoodjobtocontrolinflation.
Differentsamplesgivedifferentvaluesforsamplestatistics.Bytakingmanydifferentsamplesandcalculatingasamplestatisticforeachsample(e.g.thesample
mean),youcouldthendrawahistogramofallthesamplemeans.Astatisticfromasampleorrandomisedexperimentcanberegardedasarandomvariable
andthehistogramisanapproximationtoitsprobabilitydistribution.Thetermsamplingdistributionisusedtodescribethisdistribution,i.e.howthestatistic
(regardedasarandomvariable)variesifrandomsamplesarerepeatedlytakenfromthepopulation.
Bias
Ifthesamplingdistributionisknownthentheabilityofthesamplestatistictoestimatethecorrespondingpopulationparametercanbedetermined.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 3/40
4/2/2017 2:PointEstimationandIntervalEstimation
Inparticular,thesamplingdistributiondeterminestheexpectedvalueandvarianceofthesamplingstatistic.Iftheexpectedvalueofthestatisticisequaltothe
populationparameter,theestimatorisunbiased.Ifthevarianceofthestatisticis'small'anditisalsounbiasedthenanobservedstatisticislikelytobecloseto
thepopulationparameter.
Bias=distancebetweenparameterandexpectedvalueofsamplestatistics
Subsequently,samplestatisticscanbeclassifiedasshowninthefollowingdiagrams.
1.
Estimateshavelowbiasbecausetheiraverageisnearthepopulationparameter,buthavehighvariabilitybecausetheyarewidelyspreadandasingle
samplevaluecouldbefarfromtheparameter.
2.
Estimateshavebiasbecausetheexpectedvalueisnotequaltotheparameter.
Theyalsohavehighvariabilitybecausetheyarewidelyspreadout.
3.
Inthiscasetheestimatesarebiasedbecauseallofthemaresystematicallyhigherthanthepopulationparameter
Thesamplestatisticshave,however,lowvariabilitybecausetheyareallclosetogether.
4.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 4/40
4/2/2017 2:PointEstimationandIntervalEstimation
Inthiscasetheestimateshavebothlowbiasandlowvariability.Experimentaldesignaimstosimultaneouslyreducebiasandvariabilitybyproducinga
samplingdistributionasshownin4.
Ingeneral
samplestatistic=populationparameter+bias+chancevariation
Inferencesaboutthecharacteristicsofapopulationarebasedondatafromasample.
Ifthesampleisnotrepresentativeofthepopulationbeingstudied,thesamplestatisticmaybebiasedsoyoucannotuseittomakevalidinferencesabout
thepopulationparameter
Tominimisebiasthesampleshouldbechosenbyrandomsamplingfromalistofallindividualsintherelevantpopulation.Thislistiscalledthe
samplingframe.Itisessential.
Forasimplerandomsampletheindividualsarechoseninsuchawaythateachindividualinthesamplingframehasanequalchanceofbeingselected.
Thismayinvolveusingcomputergeneratedrandomnumberstoselectthesample.
ExampleHealthsurveyconductedintheHunterRegion
StudypopulationallresidentsofthelowerHunterRegion(Newcastle,LakeMacquarie,PortStephens,Maitland,Cessnock)aged2569years.
Samplingframeelectoralroll(note:somebiasisintroducedhere:youngerpersons(<35years)andmigrantsarelesslikelytobeontheroll).
Sampleselectionsamplechosenusingcomputergeneratedrandomnumberssoeachpersonontheelectoralrollinthisagegrouphasa1in100chanceof
selection.
Actualsamplethosewhorespondedtotherequesttoparticipateinthestudy
Nonrespondentsmaydifferfromtherespondentsinmanyways(e.g.beinglesshealthy)andthiscouldleadtobiasinestimatesoftheproportionofsmokers,
averageweight,etc.
ExampleHeightsofwomen,2529
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 5/40
4/2/2017 2:PointEstimationandIntervalEstimation
Supposeyoumeasuredtheheightsofarandomsampleof100womenaged2529yearsandcalculated
samplemean =165cms
samplestandarddeviations=5cms
Whatcanyouconcludeabouttheheightsofallwomeninthispopulationaged2529years?
Supposinganybiasisnegligiblysmall,thepopulationmeanisapproximately165cms.Buthowcloseistheapproximation?Howmighttheestimatehave
variedifadifferentrandomsamplehadbeenselected?
Inparameterestimation,weassumethatthedistributionofthevariableweareinterestedinisadequatelydescribedbyadistributionwithoneormore
(unknown)parameters.Weattempttoestimatethepopulationparameterusingthesampledata.Toemphasisethedifferencebetweensampleandpopulation,
theparameterswewishtoestimatearecalledpopulationparameters.Estimatesofthepopulationparametersobtainedfromasamplearecalledsamplestatistics
(orsampleestimates).
2:PointEstimationandIntervalEstimation
POINTESTIMATIONANDINTERVALESTIMATION
Inanyestimationproblem,weneedtoobtainbothapointestimateandanintervalestimate.Thepointestimateisourbestguessofthetruevalueof
theparameter,whiletheintervalestimategivesameasureofaccuracyofthatpointestimatebyprovidinganintervalthatcontainsplausiblevalues.
SampleMean
Whenthevariableofinterestisquantitative,thesamplemean providesapointestimateoftheunknownpopulationmean.Whenthevariablehasa
binomialdistribution,thesampleproportionisapointestimateoftheunknownpopulationproportionp.
ConfidenceInterval
Confidenceintervalarefrequentlyusedasintervalestimates.Articlesintheresearchliteraturecommonlyreport95%confidenceintervals(95%CI).
The95%CIiscalculatedinsuchawaythatunderrepeatedsamplingitwillcontainthetruepopulationparameter95%ofthetime.Thefigure95%is
ameasureoftheconfidenceyouhavethattheintervalcontainsthetruepopulationparameter.Confidenceintervalsmaybecomputedforanylevelof
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 6/40
4/2/2017 2:PointEstimationandIntervalEstimation
confidence:90%or99%confidenceintervalsaresometimesused.Wewillexaminethistopicinmoredetailafterabriefoverviewofsomeresultsfrom
probabilitytheorywhichareuseful.
3:ResultsFromProbabilityTheory
RESULTSFROMPROBABILITYTHEORY
Sincesamplingtheory(andhencestatisticalinference)asusedinthiscoursereliesontheconceptofrepeatedsamples,weconsiderthreeresultsfrom
probabilitytheory:lawofaverageslawoflargenumbersCentralLimitTheorem
RecallKerrich'scointossingexperiment
In10,000tossesofacoinyou'dexpectthenumberofheads(#heads)toapproximatelyequalthenumberoftails
so#heads #tosses
Fig.1showsthat(#heads#tosses)canbecomelargeinabsolutetermsasthenumberoftossesincreases
Fig.2showsthatinrelativeterms
%ofheads50%>0
as#tossesincreases
Youcanthinkofthisas
#heads=#tosses+chanceerror
wherechanceerrorbecomeslargeinabsolutetermsbutsmallas%of#tossesas#tossesincreases.
Figure1.Kerrich'scointossingexperiment.Aplotofthe'chanceerror'numberofheadshalfthenumberoftossesagainstthenumberof
tosses.Asthenumberoftossesgoesup,thesizeofthechanceerrortendstogoup.Thehorizontalaxisisnotdrawntoscale.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 7/40
4/2/2017 2:PointEstimationandIntervalEstimation
Figure2.Thechanceerrorexpressedasapercentageofthenumberoftosses.Asthenumberoftossesgoesup,thispercentagegetssmaller.In
otherwords,thechanceerrorgetssmallerrelativetothenumberoftosses.Thehorizontalaxisisnotdrawntoscale.
LawofAverages
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 8/40
4/2/2017 2:PointEstimationandIntervalEstimation
TheLawofAveragesstatesthatanaverageresultfornindependenttrialsconvergestoalimitasnincreases.
Thelawofaveragesdoesnotworkbycompensation.Arunofheadsisjustaslikelytobefollowedbyaheadasbyatailbecausetheoutcomesof
successivetossesareindependentevents.
Wecanprovethisresultusingresultsfromthebinomialdistribution.
LetRVXbethenumberofheadsinntosses.
Expectedvaluefortheproportionofheads=0.5withvariance(0.5)2/n,whichgoestozeroasn>
LawofLargeNumbers
If areindependentrandomvariablesallwiththesameprobabilitydistributionwithexpectedvalueandvariance2then
isverylikelytobecomeveryclosetoasnbecomesverylarge.
Cointossingisasimpleexample.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 9/40
4/2/2017 2:PointEstimationandIntervalEstimation
Thusthelawofaveragesisaninformalversionofthelawoflargenumbers
Lawoflargenumberssaysthat
andvar( )becomesverysmall
Infact,usingtheresultvar(X+Y)=var(X)+var(Y)ifXandYareindependent,then
Inparticular,iftherepeatedindependentmeasurementsareallNormallydistributed,thatis
thentheirmeanalsohastheNormaldistribution
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 10/40
4/2/2017 2:PointEstimationandIntervalEstimation
TheCentralLimitTheoremsaysthat isapproximatelyNormallydistributedeveniftheoriginalmeasurementswerenotNormally
distributed.
CentralLimitTheorem
Thedistributionofsamplemeans approaches
regardlessoftheshapeoftheprobabilitydistributionsofX1,X2,....
4:OneContinuousVariable
ONECONTINUOUSVARIABLE
SamplingDistributionoftheSampleMean
SupposewehaveasampleofnobservationsofacontinuousrandomvariableX(e.g.height,cost,temperature)
Let=E(X)bethepopulationmean,and= bethepopulationstandarddeviationofX.
Usuallybothandareunknown,andwewantprimarilytoestimate.
Fromthesamplewecancalculate =samplemeans=samplestandarddeviation
Thesamplemeanisanestimateof,buthowaccurateisit?
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 11/40
4/2/2017 2:PointEstimationandIntervalEstimation
Weknowtheapproximatesamplingdistributionoftherandomvariable fromtheCentralLimitTheorem.Supposemanydifferentsamplesofthe
samesizeareobtainedbyrepeatedlysamplingfromapopulation
foreachsample iscalculatedand
ahistogramofthe valuesisdrawn
Theshapeofthehistogramdependsonthesizenofthesample,andapproximatestothesamplingdistribution.
PropertiesoftheSamplingDistribution
i.SamplingDistributionsof havethesameexpectedvalue,regardlessofsamplesizen,equalto.
ii.EvenifthedistributionoftheXvalueswasnotNormal,asnincreasesthedistributionfor becomesmoreliketheNormaldistribution
thisistheCentralLimitTheorem
iii.Asnincreases,thedistributionof becomesnarrowerthatis,the 'sclustermoretightlyaround.Infactthevarianceisinversely
proportionalton:
Thesamplingdistributionof is(approx)
Thisfollowsfromthepreviousdefinitionofvarianceofcombinationsofrandomvariables,since
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 12/40
4/2/2017 2:PointEstimationandIntervalEstimation
ConfidenceLimitsforthePopulationMeanwhenisknown
Tocalculateprobabilitiesfor use
e.g.fromtableforZ~N(0,1),
P(Z<1)=.8413
soP(1<Z<1)=.8413(1.8413)=.6826.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 13/40
4/2/2017 2:PointEstimationandIntervalEstimation
Sowithprobability0.6826,1<Z<1,Equivalently,withprobability.6826, .
Wewanttorearrangetheseinequalitiessothataloneisinthemiddle.Ifyourearrangeastatementthathasacertainprobabilityofbeingtrue,
thenewequationstillhasthatsameprobability.
First,multiplyeachtermby toobtain
Multiplyeachtermby1(thismeansinequalitysignsshouldalsobereversed)
Add toeachterm
Rearrangethistoobtain
Thistellsusthat68.3%ofallsampleswillcontainthepopulationmeanbetweentherandominterval
Liscalledthelower68%confidencelimitforandUiscalledtheupper68%confidencelimitfor
Moreusually90%or95%or99%confidencelimitsareused.
E.g.FortablesforN(0,1)
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 14/40
4/2/2017 2:PointEstimationandIntervalEstimation
P(Z<1.96)=0.975
HenceP(1.96<Z<1.96)=0.95
Thiscanberearrangedasbeforetoobtain95%confidencelimitsfor.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 15/40
4/2/2017 2:PointEstimationandIntervalEstimation
Inpractice,isusualllyunknown,butitcanbeestimatedbythesamplestandarddeviation,s.Ifthisisdone,thetdistributionmustbeused
insteadofthestandardnormal.
Healthsurveyexamplecontinued:
Sampleofn=100womenaged2529years
samplemean =165cms
samplestandarddeviations=5cmsFornow,wewillpretendthatthepopulationSDisknowntobeexactly5.Amoreaccuratemethodmust
takeaccountofthefactthatthevalue5isonlyanestimate.Thismethodisbasedonthetdistributioninsteadofthenormaldistribution.The
differenceissmallwhenthesampleislarge,ashere.
95%confidencelimitsforpopulationmean,are
i.e.95%confidenceintervalforis(164,166)cms
Hence,plausiblevaluesforare164166cms,orwith95%confidencethetruestudypopulationmeanheightofwomenaged2529yearslies
between164and166cms.
5:ComparingMeansofTwoContinuousVariables
COMPARINGMEANSOFTWOCONTINUOUSVARIABLES
Introduction
ExampleFromtheMINITABhandbookpage101
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 16/40
4/2/2017 2:PointEstimationandIntervalEstimation
"Ashoecompanywantedtocomparetwomaterials(AandB)foruseonthesolesofboys'shoes.Wecoulddesignanexperimenttocomparethetwo
materialsintwoways.Onedesignmightbetorecruittenboys(ormoreifourbudgetallowed)andgivefiveoftheboysshoeswithmaterialAandgive
fiveboysshoeswithmaterialB.Thenafterasuitablelengthoftime,saythreemonths,wecouldmeasurethewearoneachboy'sshoes.Thiswouldlead
toindependentsamples.Now,youwouldexpectacertainvariabilityamongtenboyssomeboyswearoutshoesmuchfasterthanothers.Aproblem
arisesifthisvariabilityislarge.Itmightcompletelyhideanimportantdifferencebetweenthetwomaterials.
Analternativedesign,apaireddesign,attemptstoremovesomeofthisvariabilityfromtheanalysissowecanseemoreclearlyanydifferencesbetween
thematerialswearestudying.Again,supposewestartedwiththesametenboys,butthistimehadeachboytestbothmaterials.Thereareseveralways
wecoulddothis.EachboycouldwearmaterialAforthreemonths,thenmaterialBforasecondthreemonths.Orwecouldgiveeachboyaspecial
pairofshoeswiththesoleononeshoemadefrommaterialAandtheotherfrommaterialB.Thislatterprocedureproducedthedateinthetable
below:"
MTB>DotPlot'materA''materB';
SUBC>Same.
..........
++++++materA
....:....
++++++materB
6.07.59.010.512.013.5
MTB>descc1c2
NMEANMEDIANTRMEANSTDEVSEMEAN
materA1010.63010.75010.6752.4510.775
materB1011.04011.25011.2252.5180.796
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 17/40
4/2/2017 2:PointEstimationandIntervalEstimation
MINMAXQ1Q3
materA6.60014.3008.65013.225
materB6.40014.2009.17513.700
Thefirstdesignresultsintwoindependentsamples,whiletheseconddesigncontainsdependentsamples.Differentmethodsofanalysisareappropriate
ineachcase.
IndependentSamples
Supposewehavedatafromtwosamples,sizen1andn2,whichcomefromtwopopulationswhichmayormaynothavethesamemeans:
R.V.forsampledata Populationparameters
X1,X2,...,Xn1 X,X
Y1,Y2,...,Yn1 Y,Y
Tocomparethepopulationmeansxandyyoucouldfindaconfidenceintervalfor(XY)ortestthenullhypothesis
H0:X=Y,i.e.XY=0againstthealternativehypothesis
H1:X Y,i.e.XY 0
ToestimateXYusethedifferencebetweenthesamplemeans( )
Toobtainthesamplingdistributionof( )assumethatXandYareindependent.
Usingresultsonthesamplingdistributionofameanweobtain
~N(x, ), ~N(y, )
Then
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 18/40
4/2/2017 2:PointEstimationandIntervalEstimation
ifXandYareindependent.
Also( )isapproximatelyNormallydistributedsothat
orZ=
InpracticeXandYareusuallyunknownsoyouwouldusesamplestandarddeviationssXandsYtoestimateXandYTherearetwocasesto
consider.
Case1(ttest)
Ifweassumethat2X 2Y=2thenagoodestimatefor2isthepooledsamplevariance
wheresX2andsY2werecalculatedusingdenominators(n11)and(n21).
Thenthefollowingtstatisticcanbeusedtoconstruct95%C.I.orpvalues.
t=
with(n1+n22)degreesoffreedom.
Case2(Welch'stest)
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 19/40
4/2/2017 2:PointEstimationandIntervalEstimation
If2Xand2Yareunlikelytobeequalyoushoulduse
t=
with degreesoffreedomwhere
Whendoingcalculationswithoutacomputer,itcanbetimeconsumingtocalculatethedegreesoffreedomforWelsh'sttest.Inthissituation,thet
distributionwithdegreesoffreedomequaltothesmallerofn11andn21canbeusedtofindcriticalvalues.
Asthesamplesizesincrease,probabilityvaluesbasedonthisassumptionbecomemoreaccurate.
NOTE:thismethodwillnotbeusedinthiscourse.WewillassumeCase1applies.
Inthiscourse,wewillassumethatthepopulationvariancesareequal2X=2Yanduseapooledestimateofvariance.
ExampleBoys'shoes
Forboy'sshoes
=10.63,sX=2.451,n1=10
=11.04,sY=2.518,n2=10
Ifweassumethatthetwosamplesofshoewearmeasurementswereindependent,thenthefollowinganalysiscouldbedone.Infact,sincethedatain
thetwosamplesarenotindependent,becausethesameboyswereusedeachtime,thecorrectanalysiswouldbetouseapairedsamplesmethod.
TestH0:X=YversusH1:X Y
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 20/40
4/2/2017 2:PointEstimationandIntervalEstimation
AssumeX=Yandassumeindependentsamplesandsouse
NotespooledliesbetweensXandsYasrequired.
Tocalculatetheteststatistic,assumeH0istrue.IfH0istruethenXY=0sotheobservedvalueofthetstatisticis
thiststatistichas10+102=18degreesoffreedom.
Thepvalueistheprobabilityofobservingthevalue0.37,oranevenmoreextremevalue,ineitherdirection
p=Pr(t18<0.37ort18>0.37)
=2P(t18>0.37)
>2P(t18>0.688)(0.688isthenearesttabulatedvalue)
>20.25>0.5
Asthepvalueisgreaterthan0.05(ie.itisnotsmall)donotrejectH0.Thedifferencebetweenthesamplemeansisnotstatisticallysignificant.
ConclusionThedataareconsistentwithH0:X=Y
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 21/40
4/2/2017 2:PointEstimationandIntervalEstimation
i.e.thereisnoevidencethattheaveragewearisdifferentformaterialsAandB
Remarks
TheMINITABcommandis"TWOSAMPLE".(Welch'stestisusedifthesubcommandPOOLEDisnotincluded.)
MTB>twosamplec1c2;
SUBC>pooled.
TWOSAMPLETFORmaterAVSmaterB
NMEANSTDEVSEMEAN
materA1010.632.450.78
materB1011.042.520.80
95PCTCIFORMUmaterAMUmaterB:(2.75,1.93)
TTESTMUmaterA=MUmaterB(VSNE):T=0.37P=0.72DF=18
POOLEDSTDEV=2.49
The95%C.I.forX=Ycanbeconstructedinusualway:
FindthevaluecsuchthatPr(c<t18<c)=0.95.
Table2givesavalueofc=2.101fortwith18d.f.
95%CI=( )2.101spooled
=(10.6311.04)2.1012.485
=0.412.33
=(2.74,1.92)
Thevalue0iscontainedinthisintervalandhencedataareconsistentwithXY=0andsoprovidenoevidenceforadifferenceinthe
populationmeans.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 22/40
4/2/2017 2:PointEstimationandIntervalEstimation
6:InferenceForCountdata
INFERENCEFORCOUNTDATA
SamplingDistributionofaSampleProportion
Supposeyouareinterestedinapopulationinwhicheveryindividualcanbeclassifiedintooneoftwocategories,e.g.
manufactureditems:defective,OK
experimentalanimals:dead,alive
examresults:pass,fail
Ingeneral,wecallthese"success"and"failure".Wewanttoarriveatconclusionsaboutp,theproportionofsuccessesinthepopulation,using
informationfromasample.
Supposeyoutakearandomsampleofnindividuals.LettherandomvariableXbethenumberofsuccessesinthesample.Areasonablemodelwould
betoassumethatXhasthebinomialdistribution,X~binomial(n,p)whereusuallyn,thesamplesize,isknownbutpisnot.
IfX~binomial(n,p)thenE(X)=npandvar(X)=npqwhereq=1p.
Since isafunctionofX,andnisknown,wehave
sothestandarderrorof is
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 23/40
4/2/2017 2:PointEstimationandIntervalEstimation
ThebinomialdistributionofXcanbeapproximatedbytheNormaldistributionN(np,npq),providedn>20andnp>5.
andZ= N(0,1).Thisgivesusabasisformakinginferencesaboutp:constructingconfidenceintervalsandperforminghypothesistests.
ConfidenceIntervalsandtestsforProportions
Tofinda95%CIforp,wefirstneedtofindcsuchthatPr(c<Z<c)=0.95.UsingTable1,c=1.96.Thiscanbe
writtenas:
P(1.96<Z<1.96)=0.95where
Z= .
soPr(1.96< <1.96)=0.95.
Thiscanberearrangedtoobtainthe95%confidencelimitsforp.A95%CIforp,thetruepopulationproportion,is 1.96SE( )
where istheobservedsampleproportionX/n.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 24/40
4/2/2017 2:PointEstimationandIntervalEstimation
ExampleConfidenceinterval
Topredicttheoutcomeofareferendumanopinionpollisconducted.Arandomsampleof552peopleontheelectoralrollareasked"willyouvote
for....?"Theiranswersarerecordedaseither"yes"or"other"("other"includes"no","don'tknow",etc).
If239peoplesay"yes"finda95%confidenceintervalforthetruepopulationproportionof"yes"votes.
Theproportionof'yes'sinthesampleis
=239/552=0.433
so =10.433=0.567andtheestimatedstandarderroris
SE( ) = =0.021
Soa95%confidenceintervalforthepopulationproportionpisgivenby
1.96 =0.4331.960.021,i.e.(0.39,0.47).
Thuswith95%confidence,thetrueproportionof"yes"votesliesbetween39%and47%.
ExampleSamplesize
InmostAustraliannationalelections,about50%ofvotesarecastfortheLiberal/NationalCoalitionand50%fortheAustralianLabourParty.How
largeasampledoyouneedtoestimatethevotetowithin3%with95%confidence?
LetpbetheproportionofLiberal/Nationalvotes.
Ifwetaketheterm1.96SE( )asthe3%(or0.03)error,weneed
1.96SE( )=.03
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 25/40
4/2/2017 2:PointEstimationandIntervalEstimation
soSE( )=0.03/1.96=0.0153
SE( )= =
Hence, =0.0153
and =32.67.
whichgivesn=(32.67)2=1067.Thisisthereasonwhymostopinionpollsusesamplesofabout1000subjects.
ExampleHypothesisTest
Acompanyclaimstohave40%ofthemarketforsomeproduct.Youconductasurveyandfind38outof112buyers(i.e.34%)purchasedthis
brand.Arethesedataconsistentwiththecompany'sclaimorisyoursurveyresultof34%significantlydifferenttothecompany'sclaimof40%?
LetXbethenumberofbuyerswhochoosethisbrand.
AssumeX~Binomial(n,p)withn=112sincethebuyereitherpurchasesthisbrandornot(binomial).
Hypothesis:p=0.4(i.e.truepopulationproportionis40%).
IfthisistruethenX~Binomial(112,0.4)andsoE(X)=np=1120.4=44.8
Nowtocalculatethepvalueforthetest,wemustcalculatetheprobabilityofobtainingtheobservedvalueof38,oranevenmoreextremevalue,in
eitherdirectionfromtheexpectedvalueof44.8.
ValuesofXthatareeitherlessthanorequalto38,orgreaterthanorequalto51.6(=44.8+6.8),areasextremeormoreextremethantheobserved
value,so
Thiscouldbecalculatedusingtheexactbinomialformulatoobtaintheresult0.2102.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 26/40
4/2/2017 2:PointEstimationandIntervalEstimation
Alternatively,sincenislarge,wemightusetheNormalapproximationtotheBinomialdistribution
Xapprox~N(np,npq)
np=112.4=44.8
npq=112.4.6=26.88
i.e.X~approxN(44.8,26.88)
=P(X<38.5orX 51.5)
=P(Z orZ )
=P(Z 1.215orZ 1.292)
=.1122+(1.9015)
=.210
Thispvalueisgreaterthan0.05,soiftheconventionalsignificancelevelof0.05ischosentheresultis"notstatisticallysignificant".Thepvalue
isnotsmallsowedonotrejectH0butinsteadconcludethatthedataareconsistentwiththeclaimthatthetruepopulationproportionofbuyers
purchasingthisbrandis40%.Thesampleproportiondifferedfromthis,butnotbyastatisticallysignificantamount.
ConfidenceIntervalfortheDifferenceBetweenTwoProportions
ExampleCity/countrymarketsurvey
Inamarketsurveyitisfoundthat51of198(26%)ofpeopleincitiesusedbrand'A'ofaproductand26of145(18%)ofcountrypeopleuseit.
Canweconcludethatbrand'A'reallyismorepopularincitiesorcouldthisdifferencehaveoccurredbychance,i.e.duetosampling
variability?
Ingeneralwemaywishtocomparethepopulationproportionsbasedondatafromsamples
Group1 Group2
populationproportions p1 p2
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 27/40
4/2/2017 2:PointEstimationandIntervalEstimation
sampleproportions
1= 2=
standarderrors
Wecomparep1andp2bymakinginferencesabouttheirdifference(p1p2).Thisisestimatedby
Toobtainthesamplingdistributionfor weusetheformulasforexpectedvaluesandvariancesoffunctionsofrandomvariables.
= =p1p2
Ifp1,q1,p2andq2arereplacedbysampleestimatesthen
Alsoprovidedbothsamplesarelarge(sayn1>20andn2>20)andneitherp1norp2isverynear0or1,thenthesamplingdistributionof
isapproximatelyNormal.
So
Hencea95%confidenceintervalfor(p1p2)isgivenby
1.96
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 28/40
4/2/2017 2:PointEstimationandIntervalEstimation
Forthedataintheexampleabove
soa95%confidenceintervalis
(0.25760.1793)1.96
i.e. (0.01,0.17)
Thusthedatasuggestthat,with95%confidence,thetruepopulationdifferenceinproportionswhofavourbrand'A'betweencitiesandcountry
areasisbetween1%and17%.
Inparticular,sincethisintervalincludeszero,thedataareconsistentwiththerebeingnodifferenceinpreferences(theyarealsoconsistentwith
differencesof10%or17%or1%,etc).Thenullhypothesisof"nodifference"cannotberejectedandwecannotconcludethatthereisanyreal
differenceinpreferences.
7:MoreOnCorrelationAndRegression
MOREONCORRELATIONANDREGRESSION
UseofCorrelationandRegression
Dataforonegrouporsampleconsistoftwocontinuousmeasurementsoneachsubject(orobject)
Data:(x1,y1),(x2,y2),...,(xn,yn).
Plotyagainstx.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 29/40
4/2/2017 2:PointEstimationandIntervalEstimation
CorrelationandCorrelationcoefficient
Correlationmeasurestheextentoflinearassociationbetweenxandy.The"correlationcoefficient"isgivenby:
Incalculatingacorrelation,xandyaretreatedinterchangeably.Bothmeasurementsareassumedtobesubjecttorandomvariation.
Example20femalestudentsinSTAT101
Thedatabelowaretheheights(cm)andweights(Kg)of20femalestudentstakingSTAT101
ROWfhtfwt
116760
216465
317064
416347
515246
616057
717057
816055
915755
1017065
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 30/40
4/2/2017 2:PointEstimationandIntervalEstimation
1115050
1215646
1316860
1415955
1516050
1617269
1717556
1816956
1916972
2015656
gplotc2c1
correlc2c1
Correlationoffwtandfht=0.673
Forregressionontheotherhand,onevariableyisregardedasanoutcomeof,orresponseto,theothervariablex.Thetwovariablesarenot
interchangeable.
Wefindtheequationforastraightline"y=a+b.x"whichsummarisestherelationshipbetweenthem.Hereyistheoutcome,responseordependent
variable(output),whilexisthepredictor,explanatoryorindependentvariable(input).
Oftenvaluesofxarechosenbytheinvestigatorandarenotrandom,whereasyvaluesaremeasurementswhicharetreatedasrandomvariablestaken
fromdistibutionsthatdependonx.
ExampleHealthexpendituregrowthinAustralia19945
Thefollowingtableshowstherateofgrowth(%)ofhealthexpenditureperpersoninAustraliaatconstant19845prices.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 31/40
4/2/2017 2:PointEstimationandIntervalEstimation
(Source:AustInst.ofHealth,"HealthExpenditure",Info.Bull.No6,May1991)
Year19834845856867878889
Growth4.62.74.12.31.82.0
Denotetheyearsbyx=1for19834,x=2for19845andsoon.
MTB>plotc4c3
growth*
4.0+*
3.0+
*
*
2.0+*
*
++++++year
1.02.03.04.05.06.0
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 32/40
4/2/2017 2:PointEstimationandIntervalEstimation
Thesevaluesare
(sxyandsxaredefinedabove)
Youcancalculateand andthenpredicttheyvalueforanygivenx
= x
iscalledthefittedvalue.
AnalysisofVariance
Analysisofvarianceisatechniqueformeasuringtheimportanceofaregression,andforestimatinghowmuchunexplainedvariationinYisleft
afterallowingforX.
Thetotal(overall)variationamongmeasuredyvalues,ignoringthex's,is .Thevariationofyvaluesfromthelineisless,andisgivenby
.Theregressionlineispreciselythelinethatmakesthissecondquantityassmallaspossible.Itcanbeshownthat
TotalVariation
Sometimes(e.g.byMINITAB)thesevaluesareshowninanAnalysisofVariance(ANOVA)table:
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 33/40
4/2/2017 2:PointEstimationandIntervalEstimation
Sourceofvariation SumofSquares(SS)
Regression ( i )2 <"explainedvariation"
(yi 2
Error i) <"unexplainedvariation"
Total (yi )2
CoefficientofDetermination
Thisisameasureofhowwellthelinefitsthedata.
Itcanbeshownthat
coefficientofdetermination=
=r2=squareofthecorrelationcoefficient.
So100r2isthepercentageofvariation"explained"bytheregressionline.
Ifthepointsalllieexactlyontheline,i.e.ifyi= i,thenthe"unexplained"(orerror)variationiszerosothecoefficientofdeterminationisone.
If i= sotheslopeoftheline iszero(i.e.noregressioneffect)thenthecoefficientofdeterminationiszero.
Ingeneral:0 coefficientofdetermination 1.
Exampleonhealthexpenditurecontinued
MTB>brief3
(note:specifiesamountofoutputseeMinitabHELPorhandbook)MTB>regressc41c3
(note:thisregressesthedependent(y)variable,
growth,on"1"predictor(x)variableyear)Theregressionequationisgrowth=4.670.500year
(note:therelationshipbetweengrowthandyear.
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 34/40
4/2/2017 2:PointEstimationandIntervalEstimation
Onaverage,growthdeclinedby1/2%peryear)PredictorCoefStdevtratiopConstant4.66670.71716.510.003year0.50000.18412.720.053s
=0.7703Rsq=64.8%Rsq(adj)=56.0%
(note:sistheestimatedstandarddeviationofthedata
abouttheregressionline.64.8%ofvariationingrowth
isexplainedbythefittedequation)AnalysisofVarianceSOURCEDFSSMSFpRegression14.37504.37507.370.053Error42.37330.5933
Total56.7483Thevaluepredictedbytheregress.eqn.\/Obs.yeargrowthFitStdevFit*ResidualSt.Resid*11.004.6004.1670.5570.4330.822
2.002.7003.6670.4190.9671.4933.004.1003.1670.3280.9331.3444.002.3002.6670.3280.3670.5355.001.8002.1670.4190.3670.576
6.002.0001.6670.5570.3330.63*Nottreatedinthiscourse.
STATISTICALINFERENCE
8:Summary
SUMMARY
HowToChooseTheAppropriateMethodForStatisticalInference
1.Arethedatacontinuousorcategorical:?
2.Howmanysetsofmeasurements?
one1variablemeasuredon1sample
two1variablemeasuredon2(independentordependent)samples
two2variablesmeasuredon1sample
Setsofmeasurements
Scale 1sample 2samples 1sample
1variable 1variable 2variables
correlationor
continuous zort zort
regression
binomialor binomialor
categorical
2 2 2
1sample/group1variablecontinuousscale
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 35/40
4/2/2017 2:PointEstimationandIntervalEstimation
Model:assume
Tomakeinferencesabout
useZ= ~N(0,1)ifisknown
OR
use ~tn1ifisestimatedbys.
2samples/groups1variablecontinuousscale
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 36/40
4/2/2017 2:PointEstimationandIntervalEstimation
1sample/group2variablesbothcontinuous
(i)Arebothvariablesrandom(e.g.measurementsorresponses)orwerethevaluesofonevariablechosenbytheinvestigator?
(ii)Associationorprediction?
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 37/40
4/2/2017 2:PointEstimationandIntervalEstimation
1sample1variablecategorical
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 38/40
4/2/2017 2:PointEstimationandIntervalEstimation
2samples/groups1variablecategorical
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 39/40
4/2/2017 2:PointEstimationandIntervalEstimation
1sample/group2variablescategorical
Dataarefrequenciesincontingencytableusethe2test
http://iimk.ac.in/gsdl/cgibin/library?e=d000000statis00000prompt10401l1en5020about0003100110utfZz800&cl=CL1&d=HASHe00909ac46143070d8f732.3>=1 40/40