STA2, iST2 FS 2017 Chapter 13: Samples and Surveys

STA2, iST2
FS 2017
Chapter 13: Samples and Surveys
Chapter_13_14.tns
1 / 32
Car of the Year
In any particular year millions of new cars are bought.
2 / 32
Car of the Year
All the people

who bought
a new car
in 2011
2 / 32
Car of the Year
Results
All the people
from the
who bought
people who
a new car
completed
in 2011
a survey
2 / 32
Car of the Year
Results
All the people
from the
who bought
people who
a new car
completed
in 2011
a survey
2 / 32
Car of the Year
Results
All the people
from the
who bought
people who
a new car
completed
in 2011
a survey
Question: How trustworthy is this result?
2 / 32
Populations and Samples
3 / 32
I The Population is the entire collection of interest.
3 / 32
I A Sample is a subset of the population.
3 / 32
Population of Sample of people

all the people who bought a car
who bought and completed
a new car the survey.
3 / 32
Population of Sample of people

all the people who bought a car
who bought and completed
a new car the survey.
Question: How representative is the sample of the population?

3 / 32
Representative and Bias
4 / 32
I A sample is Representative if it reflects the same mix as in

the entire population.
4 / 32

I Samples that distort the mix of the population, are said to
have a Bias.
4 / 32

have a Bias.
I A Literary Bias: In 1936 The Literary Digest magazine sent
out 10 million questionnaires to readers and potential
readers: 2 million were returned. They predicted that
Landon would defeat Roosevelt.
4 / 32

have a Bias.
I BUT Roosevelt defeats Landon in a landslide!
4 / 32

have a Bias.
I BUT Roosevelt defeats Landon in a landslide!
I Soon after, the magazine goes out of existence.
4 / 32
Gallup Poll
Question: What went wrong?
5 / 32
Gallup Poll
I 1936 was the middle of the Great Depression.
5 / 32
Gallup Poll
I Not everyone was able to afford a telephone and a
magazine subscription.
5 / 32
Gallup Poll
I Therefore the sample was biased to wealthy people.
5 / 32
Gallup Poll
I Wealthy people tended to vote Republican.
5 / 32
Gallup Poll
Question: What went right?
5 / 32
Gallup Poll
I George Gallup correctly predicted the election even with a
reduced sample size.
5 / 32
Gallup Poll
I He also correctly predicted the results of the Digest’s poll!
5 / 32
Gallup Poll
I He also correctly predicted the results of the Digest’s poll!
I Gallup is still going strong today.
5 / 32
An Experiment in Randomness
6 / 32
I Representative samples are difficult to construct.
6 / 32
I The best method is to use a Random selection method.
6 / 32
I The best method is to use a Random selection method.
I Choose randomly six numbers from the gi-

ven 42, in any order, without replacement.
Then choose randomly one lucky number
out of 6.
6 / 32
Humans versus Machines!
Question: Did you choose at least two consecutive numbers?

The probability that at least two numbers are consecutive is

The probability that at least two numbers are consecutive is 56%.

randSamp(seq(i,i,1,42),6,1)
7 / 32
Conclusion: Humans are not good at being random!
7 / 32
Randomization
8 / 32
Randomization
8 / 32
Randomization
8 / 32
Randomization
8 / 32
Randomization
Sample
8 / 32
Randomization
Population Sample
8 / 32
Randomization
Population Sample
Question: Do you trust that this sample is representative of the

population?
8 / 32
Randomization
Randomization Population Sample
Question: Do you trust that this sample is representative of the

population?
Answer: It depends on the randomization.
8 / 32
Inferential Statistics
Given what is inside ?
9 / 32
Inferential Statistics
Given what is inside ?
Inferential Statistics is the use of sample statistics to make

deductions about population parameters.
9 / 32
Sample Sizes
10 / 32
Sample Sizes
I Let N be the population size: either unknown or else too

large for each member to be reachable.
10 / 32
Sample Sizes

I Let n be the sample size and we assume that it is much

smaller in comparison to N.
10 / 32
Sample Sizes

I Let n be the sample size and we assume that it is much

smaller in comparison to N.
I A Surprising Property: Larger populations do not require

larger samples!
10 / 32
Simple Random Sample
11 / 32
I A Simple Random Sample (SRS) is obtained by a

procedure that makes every sample of size n from the
population equally likely.
11 / 32

I A Sampling Frame is a list of items from which to draw the

sample. Ideally this list would be the population.
11 / 32

I A Sampling Frame is a list of items from which to draw the

sample. Ideally this list would be the population.
I Given a sampling frame, we can create an SRS using

random numbers generated from a computer. We
demonstrate this with Volunteers.
11 / 32
Identifying the Sample Frame
Obtaining a suitable sample frame is difficult. Consider an

election poll:
12 / 32

election poll:
What we What we
HAVE WANT
A list of
all the
people who
WILL vote
12 / 32

election poll:
What we What we
HAVE WANT
A list of
all the
people who
WILL vote
The people who do vote tend not to form an SRS of the voter’s list.
12 / 32
Hypothetical Populations
Some populations do not exist!
13 / 32
Some populations do not exist! Consider a farming experiment:
+ ⇒
10% BIGGER
13 / 32
+ ⇒
10% BIGGER
Question: If 300 such oranges exist, is this the population?
13 / 32
+ ⇒
10% BIGGER
Question: If 300 such oranges exist, is this the population?
No! The population is the collection of all potential oranges to come.
13 / 32
Estimating Parameters
14 / 32
µ σ2 p
14 / 32
µ σ2 p
x s2 p
b
14 / 32
µ σ2 p
x s2 p
b
I Given x, what can we say about µ?
14 / 32
µ σ2 p
x s2 p
b
I Given p
b, what can we say about p?
14 / 32
µ σ2 p
x s2 p
b
I Given p
b, what can we say about p?
I In General: Given a sample statistic, what can we say

about the population parameter?
14 / 32
Sampling Variation
15 / 32
Sampling Variation
I We want to know about some population parameter.
15 / 32
Sampling Variation
I We take a sample and measure the appropriate statistic.
15 / 32
Sampling Variation
I We select SRSs of size 5 from Similar Diamonds:
sample1:=randSamp(price,5,1)
mean(sample1)
mean(sample2)
15 / 32
Sampling Variation
I We select SRSs of size 5 from Similar Diamonds:
mean(sample1)
mean(sample2)
I Different samples lead to different values of this statistic.
15 / 32
Sampling Distribution
16 / 32
I Hence the statistic can be viewed as a random variable.
16 / 32
I The Sampling Distribution is the probability distribution of

this random variable.
16 / 32
I The Sampling Distribution is the probability distribution of

this random variable.
I Understanding sampling distributions for

particular statistics will help us make the
inferences about the appropriate popula-
tion parameters.
16 / 32
We Know Nothing!
17 / 32
We Know Nothing!
BUY
Data is stored: how
much, what etc.
17 / 32
We Know Nothing!
BUY
Data is stored: how
much, what etc.
NO BUY
We know nothing!
How many did not buy?
Why did they not buy?
I Nothing suitable
I wrong size/colour
I prices too high
17 / 32
We Know Nothing!
BUY
Data is stored: how
much, what etc.
NO BUY
We know nothing!
How many did not buy?
Why did they not buy?
I Nothing suitable
I wrong size/colour
I prices too high
In such circumstances a shop can organise an exit survey.

17 / 32
Exit Surveys
18 / 32
Exit Surveys
I Every survey needs a clear objective.
18 / 32
Exit Surveys
I Identify the population and the parameters of interest.
18 / 32
Exit Surveys
Customers who do not buy
18 / 32
Exit Surveys
Proportions for
each of the reasons
18 / 32
Exit Surveys
Proportions for
each of the reasons
I The sampling frame doesn’t exist, yet someone will need to

interview a random selection of shoppers who do not buy.
18 / 32
Exit Surveys
Proportions for
each of the reasons
I The sampling frame doesn’t exist, yet someone will need to

interview a random selection of shoppers who do not buy.
I To be reliable the nonresponses need to be also recorded!
18 / 32
Alternative Sampling Methods
19 / 32
I Simple Random Sample: Selects individuals randomly.
19 / 32

I Stratified Random Sample: Selects individuals randomly
within subsets of similar items (strata).
19 / 32

I Cluster Sampling: A type of stratified sampling whereby the
strata are determined geographically.
19 / 32

I Census: Selects all individuals of the population.
19 / 32

I Voluntary Response: Selects individuals who volunteer to
participate.
19 / 32

I Voluntary Response: Selects individuals who volunteer to
participate.
I Convenience Samples: Selects individuals who are readily
available.
19 / 32
Checklist for Surveys
20 / 32
I Does the sampling frame match the population?
20 / 32
I What is the rate of nonresponse?
20 / 32
I How was the question worded?
20 / 32
I Did the interviewer affect the results?
20 / 32
I Did the interviewer affect the results?
I Does survivor bias affect the survey?
20 / 32
Sampling Distribution of the Mean.
21 / 32
GPS Chips
GPS Chips
GPS Chips
GPS Chips
GPS Chips
GPS Chips
GPS Chips
Random sampling
to test the process
22 / 32
HALT Testing
To test chips Highly Accelerated Life Tests are set up.
23 / 32
HALT Testing
23 / 32
HALT Testing
23 / 32
HALT Testing
23 / 32
HALT Testing
A HALT test has 15 stages. When a chip fails a test, the stage
of the test is noted down. If a chip survives all tests it gets a
score of 16.
23 / 32
HALT Scores
24 / 32
HALT Scores
I HALT testing monitors the manufacturing process.
24 / 32
HALT Scores
I If every chip is exactly the same, it would be easy to see

when the process is malfunctioning.
24 / 32
HALT Scores

BUT
There is variation among the HALT scores for the chips.
24 / 32
HALT Scores

BUT
I Instead of testing single chips, random samples are tested.
24 / 32
HALT Scores

BUT
I Instead of testing single chips, random samples are tested.
I If the sample fails the test, then we want to conclude that

there is something wrong with the process.
24 / 32
HALT Data, X
25 / 32
HALT Data, X
I When the process is known to be operating correctly, we

have mean µ = 7 and standard deviation σ = 4.
25 / 32
HALT Data, X

I Data is collected over 21 days and stored in HALT.
25 / 32
HALT Data, X

I Let X be the random

variable representing the
HALT score for a chip.
25 / 32
HALT Data, X


variable representing the
HALT score for a chip.
I Is it enough to to know that some chips are performing

badly to conclude that the process is malfunctioning?
25 / 32
Sample Mean Data, X
26 / 32
Sample Mean Data, X
I The data shows that each day a sample of size 20 was

HALT tested.
26 / 32
Sample Mean Data, X

HALT tested.
variable representing
the mean HALT score
for these daily samples.

xbar:=seq mean iffn(day=i,halt,_) ,i,1,21
26 / 32
Sample Mean Data, X

HALT tested.
variable representing
the mean HALT score
for these daily samples.

xbar:=seq mean iffn(day=i,halt,_) ,i,1,21
I It’s expected that a single chip might score low, but it is
unlikely that a whole sample scores badly.
26 / 32
The Benefits of Averaging
1. mean(halt)
mean(xbar)
1. mean(halt) 6.94
mean(xbar)
1. mean(halt) 6.94
mean(xbar) 6.94
1. mean(halt) 6.94
mean(xbar) 6.94
The means of the the two distributions are the same.
27 / 32
1. mean(halt) 6.94
mean(xbar) 6.94
2. stdevsamp(halt)
stdevsamp(xbar)
27 / 32
1. mean(halt) 6.94
mean(xbar) 6.94
2. stdevsamp(halt) 4.24
stdevsamp(xbar)
27 / 32
1. mean(halt) 6.94
mean(xbar) 6.94
stdevsamp(xbar) 1.19
27 / 32
1. mean(halt) 6.94
mean(xbar) 6.94
The standard deviation of X is smaller than the standard

deviation of X .
27 / 32
1. mean(halt) 6.94
mean(xbar) 6.94
The standard deviation of X is smaller than the standard

deviation of X .
3. Although the shape for X was non-distinct, the shape of X

is bell-shaped.
27 / 32
Normality
Normality
X is
Normal
Normality
X is X is
Normal normal
Normality
X is X is
Normal normal
X is Non-
normal
Normality
X is X is
Normal normal
X is Non- X is asym-
normal ptotically
normal
28 / 32
Normality
X is X is
Normal normal
X is Non- Central Limit X is asym-

normal Theorem ptotically
normal
28 / 32
Normality
X is X is
Normal normal
X is Non- Central Limit X is asym-

normal Theorem ptotically
normal
Samples of a lar-
ge enough size
28 / 32
Sample Sizes
We will keep this simple:
29 / 32
Sample Sizes
I For symmetric distributions a sample size of 20-25 is

sufficient.
29 / 32
Sample Sizes
I For symmetric distributions a sample size of 20-25 is

sufficient.
I For skewed distributions sample sizes need to be

somewhat larger.
29 / 32
Normal Models
Let X be a random variable with mean µ and standard

deviation σ. A sample of size n can be considered thus:
1 1
X = X1 + · · · + Xn
n n
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
If n is large enough, then X will be normal such that:
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
I E(X ) =
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = =
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = =
n
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
I Var (X ) =
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
1 1
I Var (X ) = n2
Var (X1 ) + ··· + n2
Var (Xn ) = =
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
1 1 n · σ2
I Var (X ) = n2
Var (X1 ) + ··· + n2
Var (Xn ) = =
n2
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
1 1 n · σ2 σ2
I Var (X ) = n2
Var (X1 ) + ··· + n2
Var (Xn ) = =
n2 n
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
1 1 n · σ2 σ2
I Var (X ) = n2
Var (X1 ) + ··· + n2
Var (Xn ) = =
n2 n
I SD(X ) =
30 / 32
Normal Models

1 1
X = X1 + · · · + Xn
n n
1 n·µ
I E(X ) = n E(X1 ) + · · · + n1 E(Xn ) = = µ
n
1 1 n · σ2 σ2
I Var (X ) = n2
Var (X1 ) + ··· + n2
Var (Xn ) = =
n2 n
σ
I SD(X ) = √
n
30 / 32
Standard Error of the Mean
SD(X ) measures the variation of the value of the sample mean

from sample to sample. It is often called the Standard Error of
the Mean and is further notated as σX .
31 / 32

Hence σX = √σ .
n
31 / 32

Hence σX = √σ .
n
1. If σ can be decreased, then σX decreases.
31 / 32

Hence σX = √σ .
n
1. If σ can be decreased, then σX decreases.
2. If n is increased, then σX decreases.
31 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .
In file Sampling we have some raw data.

mean(data)
stDevPop(data)
samp1:=randSamp(data,20)
mean(samp1)
stDevSamp(samp1)

xbar:=seq mean randSamp(data,20) ,i,1,500
mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data)
mean(samp1)
stDevSamp(samp1)

mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1)
stDevSamp(samp1)

mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
samp1:=randSamp(data,20) {13.2, 4.2, 1.9...
mean(samp1)
stDevSamp(samp1)

mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1) 6.96
stDevSamp(samp1)

mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1) 6.96
stDevSamp(samp1) 5.08

mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1) 6.96

xbar:=seq mean randSamp(data,20) ,i,1,500 {7.3, 8.3,...
mean(xbar)
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1) 6.96

mean(xbar) 6.79
stDevSamp(xbar)
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1) 6.96

mean(xbar) 6.79
stDevSamp(xbar) 0.881
√
stDevPop(data)/ 20
32 / 32
2

If X ∼ (µ, σ 2 ), then X ∼ N µ, σn .

mean(data) 6.99
stDevPop(data) 4.04
mean(samp1) 6.96

mean(xbar) 6.79
stDevSamp(xbar) 0.881
√
stDevPop(data)/ 20 0.883
32 / 32

STA2, iST2 FS 2017 Chapter 13: Samples and Surveys

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STA2, iST2 FS 2017 Chapter 13: Samples and Surveys

Uploaded by

Copyright:

Available Formats

STA2, iST2

In any particular year millions of new cars are bought.

In any particular year millions of new cars are bought.

All the people

In any particular year millions of new cars are bought.

In any particular year millions of new cars are bought.

In any particular year millions of new cars are bought.

Question: How trustworthy is this result?

I The Population is the entire collection of interest.

I The Population is the entire collection of interest.

I A Sample is a subset of the population.

I The Population is the entire collection of interest.

I A Sample is a subset of the population.

Population of Sample of people

I The Population is the entire collection of interest.

I A Sample is a subset of the population.

Population of Sample of people

Question: How representative is the sample of the population?

I A sample is Representative if it reflects the same mix as in

I A sample is Representative if it reflects the same mix as in

I A sample is Representative if it reflects the same mix as in

I A sample is Representative if it reflects the same mix as in

I A sample is Representative if it reflects the same mix as in

I Representative samples are difficult to construct.

I Representative samples are difficult to construct.

I The best method is to use a Random selection method.

I Representative samples are difficult to construct.

I The best method is to use a Random selection method.

I Choose randomly six numbers from the gi-

Question: Did you choose at least two consecutive numbers?

Question: Did you choose at least two consecutive numbers?

The probability that at least two numbers are consecutive is

Question: Did you choose at least two consecutive numbers?

The probability that at least two numbers are consecutive is 56%.

Question: Did you choose at least two consecutive numbers?

The probability that at least two numbers are consecutive is 56%.

Question: Did you choose at least two consecutive numbers?

The probability that at least two numbers are consecutive is 56%.

Conclusion: Humans are not good at being random!

Question: Do you trust that this sample is representative of the

Randomization Population Sample

Question: Do you trust that this sample is representative of the

Answer: It depends on the randomization.

Given what is inside ?

Given what is inside ?

Inferential Statistics is the use of sample statistics to make

I Let N be the population size: either unknown or else too

I Let N be the population size: either unknown or else too

I Let n be the sample size and we assume that it is much

I Let N be the population size: either unknown or else too

I Let n be the sample size and we assume that it is much

I A Surprising Property: Larger populations do not require

I A Simple Random Sample (SRS) is obtained by a

I A Simple Random Sample (SRS) is obtained by a

I A Sampling Frame is a list of items from which to draw the

I A Simple Random Sample (SRS) is obtained by a

I A Sampling Frame is a list of items from which to draw the

I Given a sampling frame, we can create an SRS using

Obtaining a suitable sample frame is difficult. Consider an

Obtaining a suitable sample frame is difficult. Consider an

Obtaining a suitable sample frame is difficult. Consider an

Some populations do not exist!

Some populations do not exist! Consider a farming experiment: