You are on page 1of 16

Examination Date: 2007-04-26

Statistics for Business and Economics


Module 3: Statistical survey methodology

Name:

..........................................

Personal code number: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sal (hall) 5 Time of examination: 9.00-13.00, OP Aid: Pocket calculator. Formulae are handed out.

Write your answers on loose paper except for the multiple choice questions where you mark your answer. Evaluation of the exercises is done by the teacher. For the grade pass, 50% of maximal mark is required. To pass with distinction, 75% is required. Note that omitted or imperfect explanation leads to reduction of marks.

Exercise 1 2 3 4 5 6 Sum

Points 10 9 10 6 5 10 50

Break a leg! /J
Note: Examination with plausible solutions Author: Joakim Malmdin, 2007-04-30

Exercise 1, 10p
These statements are either true or false. 1) Stratication may produce a smaller bound on the error of estimation, B . This is especially true when the strata are heterogeneous. True False

2) A drawback with systematic sampling is the risk of periodicity. True False

3) If our ndings are statistically signicant they are too unusual to often occur just by chance. True 4) The margin of error includes only random sampling error. True False False

5) A census is a sample survey that attempts to include the entire population in the sample. True False

6) The use of a control group in an experiment allows us to control the eects of lurking variables. True False

7) Clusters should be as heterogeneous as possible within, and one cluster should look very much like another in order for the economic advantages of cluster to pay o. True False

8) The nite population correction (or correction factor) takes into account the fact that an estimate based on a sample n = 10 from a population of N = 20 000 items contains more information about the population than a sample of n = 10 from a population of N = 20. True 2 False

9) Reliability has to do with the quality of measurement. In its everyday sense, reliability is the repeatability of your measures. True False

10) We usually favour stem-plots when we have a small number of observations, and histograms for larger amounts of data. True False

Exercise 2, 9p
For each of the following sample situations, identify the target population and the frame used. Comment upon the coverage and identify also the sampling technique used and eventual shortcomings. Finally, suggest another way of doing the investigation with the same target population. 1. A sociologist is interested in determining the extent to which tenthgraders in the USA are self-motivated. A sample of four high schools in Large City is taken and all tenth-graders in each school is interviewed. 3p Answer [very short]: Target population: Tenth-graders in the USA Frame: High schools in Large City Coverage: Undercoverage Technique: Clustered sampling of high schools Shortcomings: Bias due to nonsampling errors (undercoverage) Suggestion: The risk of bias is due to the fact that just high schools from one city is selected. If the target population is tenth-graders in the USA we have to include all tenth-graders in the frame. One way of doing this is to randomly select cities and then schools within each city. 2. The host of a local radio talk show in London wonders if people who are actively religious are happier than those who are not. He asks the listeners to call in and the station receives calls from 48 listeners who voice their opinions. 3p Answer [very short]: Target population: London citizens (eventually other too, but unspecied) Frame: Listeners to the show Coverage: Undercoverage (eventually overcoverage) Technique: Voluntary response Shortcomings: Bias due to sampling errors (voluntary response) and nonsampling errors (undercoverage) Suggestion: To base an investigation on voluntary answers (self-selection) is not a good idea. A more serious attempt to reach a trustworthy result would be to randomly select London citizens from the telephone book for example, i.e. simple random sampling. 4

3. Every tenth person between 2 pm and 4 pm the rst day of a term outside the library at Ume a University is asked whether he or she prefer written or oral exams this term. 3p Answer [very short]: Target population: Students at Ume a University a specic term Frame: People passing by the library between 2 pm and 4 pm the rst day of a specic term Coverage: Over- and undercoverage Technique: Systematic sampling Shortcomings: Bias due to nonsampling errors (undercoverage) Suggestion: The risk of bias in this survey is connected to the choice of time and place. We could instead use lists of all students accepted for studying at the University the term in question and wait until the registration is fullled. Systematic or simple random sampling could be used.

Exercise 3, 10p

Newspapers sold
50000

Newspapers sold

News for you 40000 Daily words World in words 10000 Time for news 0 20000 30000

News for you

Daily words

Time for news

World in words

Figure 1: Newspapers sold displayed in two graphs a) In these two graphs (Figure 1) the sale numbers of the four biggest newspapers in News Island are displayed. Choose the most proper graph and give two reasons for your choice. 3p Answer: The bar chart is the most proper graph. 1) The four newspapers do not sum up to a whole (we get an impression of that the whole population is represented in the pie chart). 2) It is easier to compare the bars in order to see the dierence in number of sold papers. b) Can we use the graph in Figure 2? Explain. 2p

Answer: No. With nominal data (the four newspapers) it does not make sense to connect the categories with a line. c) Explain why a pictogram would have been improper to use. 2p

Answer: A pictogram (of newspapers) would be misleading since both the height and width must be increased in order to avoid distortion, i.e. it is not only the height of the picture which get larger, so do the width and thereby the area. [An alternative to this is to keep the same width regardless of height, but then the pictures are distorted.] d) Describe the data set displayed in Figure 3. 6 3p

Newspapers sold
50000 Count 0 10000 20000 30000 40000

ds

yo

Da ily wo r

ws

Ne ws f

Tim ef

Figure 2: Newspapers sold displayed with combining line Answer: This is a boxplot, which is a graph of the ve-number summary. All observations (except one outlier) lies in between approximately 18 and 50 (the minimum and the maximum), which is covered by the whiskers. The interquartile range stretches from approximately 30 (rst quartile) to 45 (third quartile) and constitutes the box, consisting of 50% of the observations. The measure of central location, the median, is also a measure of relative standing, and its value is approximately 37, meaning that half of the observations are larger than 37, and half of the observations are smaller. Since the dierence between the rst and second quartiles is approximately equal to the dierence between the second and third quartiles, a good guess is that the distribution is approximately symmetric.
Barry Bonds 19 home run counts
80

20

40

60

Figure 3: Home run counts 7

or

ld in wo r

or

or

ne

ds

Exercise 4, 6p
A forester wants to estimate the total number of farm acres planted in trees for a state. Since the number of acres of trees varies considerably with the size of the farm, she decides to stratify on farm sizes. The 240 farms in the state are placed in one of four categories according to size. A stratied random sample of 40 farms, selected by using proportional allocation, yields the results shown in Table 1 on number of acres planted in trees. Table 1: Acres of trees on farms Stratum I 0-200 Acres N1 = 86 n1 = 14 97, 67, 42, 125, 25, 92, 105, 86, 27, 43, 45, 59, 53, 21 Stratum II 201-400 Acres N2 = 72 n2 = 12 125, 155, 67, 96, 256, 47, 310, 236, 220, 352, 142, 190 Stratum III 401-600 Acres N3 = 52 n3 = 9 142, 256, 310, 440, 495, 510, 320, 396, 196 Stratum IV Over 600 Acres N4 = 30 n4 = 5 167, 655, 220, 540, 780

Stratified random sample of 40 farms


800 Acres of trees 0 200 400 600

Stratum I

Stratum II

Stratum III

Stratum IV

Figure 4: Acres of trees a) Estimate the total number of acres of trees on farms in the state by using the information given in Table 1. 3p

Answer: Use the formulae in A.2.1 and A.5 to nd the answer. First calculate the mean value of the random variable of interest (Y =number of farm acres planted in trees ) for each stratum. 1 = 63.3571 Y 2 = 183 Y 3 = 340.5555 Y 4 = 472.4 Y Now we have that st = Y = 1 N
L

i Ni Y
i=1 4

1 240

i Ni Y
i=1

1 = (86 63.3571 + 72 183 240 +52 340.5555 + 30 472.4) = 210.43999 and the total is estimated to st = NY = 240 210.43999 = 50 505.6 50 506

The total number of acres of trees on farms in the state, , is estimated to 50 506 . b) Place a bound on the error of estimation. Answer: The margin of error is given by the formula in A.6 ) B = 2 V ( (the estimate of interest) here is st ), so we have to where (i.e. N Y estimate the variance of Yst in order to estimate the margin of error of the total. We use the formula in A.2.1 to estimate the variance of st : Y st ) = V (Y 1 N2 9
L

3p

Ni2
i=1

Ni ni Ni

s2 i ni

where the estimated variance in each stratum is s2 1 = 1071.786 s2 2 = 9054.182 s2 3 = 16794.28 s2 4 = 72376.3 st is then The estimated variance of Y st ) = V (Y 1 (474 035.6366 + 3 259 505.52 2402 +4 172 445.564 + 10 856 445)

= 325.7366618 The variance of the estimated total is now estimated according to the formula in A.5 st ) = N 2 V (Y st ) V (N Y = 2402 325.7366618 = 18 762 431.72 and, nally, calculate the margin of error of to B = 2 18 762 431.72 = 8663.1245 8663 The margin of error of is 8663 .

10

Exercise 5, 5p
Suppose we are interested in determining the average daily sales (income) for a chain of grocery stores. In Figure 5 we see the true sale numbers for the last 12 days.

Income

50

100

150

6 Days

10

12

Figure 5: Daily sales a) Suppose we want to sample days in order to estimate the average daily sales. Comment upon the use of systematic sampling. 2p Answer: There is periodicity since it seems to be peak sales every second or every third day. The eectiveness of a 1 in k sample depends on the value we choose for k. The risk is that we over- or underestimate the parameter of interest. [We could change the random starting point several times in order to reduce the possibility of choosing observations from the same relative position in a periodic population. Note that the corresponding terminology in time series analysis is seasonal variation, which refer to systematic patterns that occur over short repetitive calendar periods (with a duration of less than one year.)] b) Another task is to estimate the average number of customers per grocery store for the chain. The 300 stores are listed in 50 geographical clusters of 6 each, and a simple random sampling of three clusters is selected (see Table 2). 3p Answer: Use the formula in A.4.1 to estimate the population mean, . The random variable of interest is dened , with the sample mean Y 11

Table 2: Number of customers Cluster 1 2 3 Number of customers 34, 56, 78, 56, 100, 87 47, 212, 220, 34, 68, 90 98, 67, 88, 99, 29, 58

as Y =number of customers per grocery store. = Y


n i=1 yi n i=1 mi

where m1 = m2 = m3 = m (the number of elements in each cluster) and yi is the total of all observations in the ith cluster. Here the total sample size is equal to nm elements. We get = Y
n i=1 yi

nm

411 + 671 + 439 = 84.5 3(6)

The average number of customers per grocery store is estimated to 84.5 .

12

Exercise 6, 10p
Fill in the right answer. a) The drawback of a web-survey with voluntary answers is that it. . . 2p may be biased. costs too much. is a very simple random sample. b) What do we call the distribution in Figure 6? unimodal. bimodal. 2p multimodal.

Frequency

0 0

10

15

20

20

40 X

60

80

100

Figure 6: Distribution of X c) When performing a signicance test we would like to know about the sample size. Why? 2p The p-value depends on the sample size. The sample size depends on the p-value. The true value of the parameter depends on the sample size. d) We can describe the overall pattern of a histogram or stem-plot by giving its shape, centre, and . . . 2p spread. height. stem.

e) When the respondent has not responded to any of the questions we call this. . . 2p random sampling error. error. undercoverage. nonresponse

13

A
A.1

Formulae for estimation


Srs
The mean and the variance of the mean = ) = V ( s2 n
n i=1 i

A.1.1

n N n N

where N is the size of the population, n the size of the sample, and s2 = A.1.2
n i=1 (i

)2 n1
n i=1 i

The proportion and the variance of the proportion p = V ( p) = n

where q = 1 p.

p q N n n1 N

A.2

Strs

L =number of strata, Ni =number of sampling units in stratum i, and ni the number of sampled units. A.2.1 The mean and the variance of the mean st = 1 N st ) = V ( A.2.2 1 N2
L L

i Ni
i=1

Ni2
i=1

Ni ni Ni

s2 i ni

The proportion and the variance of the proportion p st = 1 N


L

Ni p i
i=1 L

1 V ( pst ) = 2 N

Ni2 V ( pi )
i=1

14

A.2.3

Neyman allocation ni = n Ni i L k =1 Nk k

assuming that the cost per observation are equal. i is the standard deviation in stratum i (often estimated with si ). A.2.4 Proportional allocation ni = n Ni L k =1 Nk =n Ni N

assuming that the cost per observation and the standard deviation are equal for all strata.

A.3
A.3.1

Sys
The mean and the variance of the mean sy =
n i=1 i

n assuming a randomly ordered population we have sy ) = V ( A.3.2 s2 N n n N

The proportion and the variance for proportion p sy = V ( psy ) =


n i=1 i

n N n N

p sy q sy n1

A.4

Clus

N =the number of clusters in the population, n =the number of clusters selected in a simple random sample, mi =the number of elements in cluster i, m =the average cluster size for the sample, M =the number of elements in =the average cluster size for the population ( M ), i =the the population, M N total of all observations in the ith cluster

15

A.4.1

The mean and the variance of the mean = ) = V ( N n 2 N nM


n i=1 i n i=1 mi n i=1 (i

i )2 m n1

A.4.2

The proportion and the variance for proportion p =


n i=1 ai n i=1 mi

where ai denote the total number of elements in cluster i that possess the characteristic of interest. V ( p) = N n 2 N nM
n i=1 (ai

p mi )2 n1

A.5

The total and the variance of the total


= N ) = N 2 V ( ) V (N

A.6

Sample size and the margin of error

Sample size required to estimate with margin of error B when using Srs or Sys N 2 n= (N 1)D + 2 where D = B 4 The margin of error is B=2 ) V (
2

16

You might also like