You are on page 1of 72

Introduction to Descriptive Statistics

2014
Great takes on Statistics

"There are three ways to not tell the truth: lies, damned
lies, and statistics.“…

"
‘Statistics show that of those who contract the habit of
eating, very few
" survive’.…

The statistics on sanity are that one out of every four
Americans is suffering from some form of mental illness.
" If they're okay, then it's
Think of your three best friends.
you’.…

Statistics are used much like a drunk uses a lamppost: for


"
support, not illumination’.…

I can prove anything by statistics except the truth’.…


Common terms in Statistics

Variable Population
Data

Inference?
Statistics
Descriptive
Statistics
Sample
Parameter
Central dogma of Statistics

Probability

Population
Descriptive Statistics

Sample

Inferential Statistics
Concept of Frequency and Basic Exploratory Analysis

a mathematical function showing the


number of instances in which a
variable takes each of its possible HISTOGRAM
values.

Cross Tab : Cross tabulation


(or crosstabs for short) is a
statistical process that
summarizes categorical
data to create a Conversion of Continuous Variable to
contingency table. Discrete/ Categorical Variable
Data and Variables

Types Random

Categorical Discrete

Nominal Continuous

NOIR Ordinal

Interval

Ratio

Random Vs. Traditional Variables


Exercise!

Percentage scores on a Math exam


Letter grades on an English essay

Flavors of yogurt
Instructors classified as : Easy, Difficult or Impossible
Employee evaluations classified as : Excellent, Average, Poor
Religions
Political parties
Commuting times to school
Years (AD) of important historical events
Ages (in years) of statistics students
Ice cream flavor preference
Amount of money in savings accounts
Students classified by their reading ability : Above average, Below average, Normal
Descriptive Statistics

 Descriptive statistics are a collection of measurements of two things: location and


variability.

 Location tells you the central value of your variable

 Variability refers to the spread of the data from the center value

 Statistics is basically the study of what causes variability in the data.

Location Variability
Mean Variance
Mode Standard deviation
Median Range
Central Tendency

99.94
Mean, Median & Mode

A measure of central tendency is a single value that attempts to describe a


set of data by identifying the central position within that set of data.

Mean Median
is the simple average is the middle of a
of a data set ranked distribution

Mode
is the most common
data value
Mean

Calls per day


Associate 1 120
Associate 2 125
Mean
Associate 3 130 = (n)/n
Associate 4 132 = (120+125+…+165)/20
Associate 5 135 = 147 calls a day
Associate 6 135
Associate 7 137
Associate 8 148
Associate 9 150
Associate 10 150
Associate 11 150
Associate 12 155
Associate 13 155
Associate 14 155
Associate 15 157
Associate 16 157
Associate 17 158
Associate 18 160
Associate 19 160
Associate 20 165
Median

Calls per day


Associate 1 120
Associate 2 125
Associate 3 130
Associate 4 132
Associate 5 135
Associate 6 135
Associate 7 137
Associate 8 148 Median
Associate 9 150
= middle of dist. when ranked
Associate 10 150
= (10th + 11th value) /2
Associate 11 150
= (150 + 150) /2
Associate 12 155
= 150 calls a day
Associate 13 155
Associate 14 155
Associate 15 157
Associate 16 157
Associate 17 158
Associate 18 160
Associate 19 160
Associate 20 165
Mode

Calls per day


Associate 1 120
Associate 2 125
Associate 3 130
Associate 4 132
Associate 5 135
Associate 6 135
Associate 7 137
Associate 8 148
Associate 9 150
Associate 10 150
Associate 11 150
Associate 12 155
Associate 13 155
Associate 14 155
Associate 15 157
Associate 16 157 Mode
Associate 17 158 = the most commonly recurring
Associate 18 160 data point
Associate 19 160 = 150 calls a day
Associate 20 165
Percentile/ Quartile

Calls per day


Associate 1 120
Associate 2 125
A percentile is the value of a variable below which a
Associate 3 130
certain percent of observations fall. For example, the
Associate 4 132
20th percentile is the value (or score) below which ~20
25th Percentile Associate 5 135
percent of the observations may be found.
Or, the 1st quartile Associate 6 135
Associate 7 137
The quartiles of a set of values are the three points
Associate 8 148
that divide the data set into four equal groups, each
Associate 9 150
representing a fourth of the population being sampled.
Associate 10 150
Associate 11 150
Associate 12 155
Associate 13 155
Associate 14 155
Associate 15 157 75th Percentile
Associate 16 157 Or, the 3rd quartile
Associate 17 158
Associate 18 160
Associate 19 160
Associate 20 165
Weighted Average/ Rolling Average
Measures of Central Tendency by Level of Measurement

Level of Measures of
Measurement Central Tendency
nominal Mode
ordinal Median
Mode
Interval/ ratio Mean
Median
Mode
Measures of Dispersion
Consider Teams 1 and 2 below

Team 1 2 3 4 5 Team 1 2 3 4 5
member member
Height 72 73 76 76 78 Height 67 72 76 76 84
(inches) (inches)

• What is same among both the teams?


• What is different?
Another way to look at Dispersion

• Both distributions have same mean but what is the difference?


Measures of Dispersion

Definition:

• Measures of dispersion describe the data spread or how far the measurements are
from the center
• The more similar the scores are to each other, the lower the measure of
dispersion will be
• The less similar the scores are to each other, the higher the measure of
dispersion will be
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Distributions which are highly dispersed
• Consider the following distributions

Bollywood movie
Stock prices in an index collections

Test scores

Hitting bulls-eye on a
dart-board Income
Measures of Dispersion

Measures of
Dispersion

Mean
Variance/
Interquartile absolute Gini
Range Standard
deviation deviation Coefficient
Deviation
(MAD)
Range
Measures of
Dispersion

Interquartile Mean absolute Standard


Range deviation deviation Deviation
Ginii- Coefficient

• Definition: simplest measure of variation denoting


difference between highest and lowest number in
dataset
Range= Maximum - Minimum

• Disadvantages
• Ignores the way data is distributed
• Uses only two data
• Ignores outliers

Team 1 2 3 4 5 Team 1 2 3 4 5
member member
Height 72 73 76 76 78 Height 67 72 76 76 84
(inches) (inches)

What is the range for the datasets?


Interquartile Deviation
Measures of
Dispersion

Interquartile Mean absolute Standard


Ginii- Coefficient
Range
deviation deviation Deviation

Definition: The inter-quartile range is a measure that indicates the extent to which the central 50% of
values within the dataset are dispersed. It is based upon, and related to, the median.

Interquartile range: Q3 –Q1

Q1 Q2 Q3 Q4

25% 25% 25% 25%

Disadvantages
• Provides a clearer picture of the overall dataset by removing/ignoring the outlying values.

• Like the range however, the inter-quartile range is a measure of dispersion that is based upon
only two values from the dataset
Mean Absolute Deviation
Measures of
Dispersion

Interquartile Variance/Standar
Range
deviation
MAD d Deviation
Ginii- Coefficient

Definition: mean of the absolute values of the deviations. This measure is called mean absolute
deviation and is denoted by mad. Note that in the formula that follows, the
notation mean(x) denotes the mean of the x measurements.

MAD = sum( |x - mean(x)| ) / n


where n is the size of the sample.

Disadvantages
• Consider these two datasets. Both give the value 4 though 2nd dataset is more spreadout
+7
+4 +4

+1

-4 -4 -2

-6
|4| + |4| + |-4| + |-4| = 4+4+4+4 =4
|7| + |1| + |-6| + |-2| = 7+1+6+2 =4
4 4
4 4
Variance/ Standard Deviation
Measures of
Dispersion

Interquartile Variance/
Definition of variance Range
deviation
MAD
Standard
Ginii- Coefficient

Deviation
The variance indicates how close to or far from the mean are most of the cases for a particular variable.
The smaller the value of the variance, the more the cases are concentrated around the value of the
mean; the larger the value of the variance, the more spread out away from the mean are the cases.

Calculation
Data Data

Mean
1. Calculate difference of each data point from mean
Deviation of x = x – mean
2. What is the problem when
we take sum of deviations?
3. Square deviations and sum
4. Divide by no. of observations

=24/5
Variance/Standard Deviation
Measures of
Dispersion

Mean
Interquartile Variance/
Standard
Range
deviation
absolute
MAD Deviation
Ginii- Coefficient
Standard
deviation
Deviation

Why squared deviations?

• Absolute values do not have nice mathematical properties


• Squares eliminate the negatives

Result:
• Increasing contribution to the variance as you go farther from the mean

• Variance is somewhat arbitrary But if you could “standardize” that value, you could talk about any
variance (i.e. deviation) in equivalent terms

Advantages/Disadvantages

• The variance is not the simplest or easiest to understand measure of dispersion for interval/ratio
variables
• The reason why statisticians prefer it is because it provides an excellent basis for some very
important multivariate statistics
Variance/Standard Deviation
Measures of
Dispersion

Mean
Interquartile Variance/
Standard
Range
deviation
absolute
MAD Deviation
Ginii- Coefficient
Standard
deviation
Deviation

Standard Deviation Calculation

=2.19

1. Score (in the units that are meaningful)


2. Mean
3. Each score’s deviation from the mean
4. Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n-1
7. Square root – now the value is in the units we started with!!!
Gini-coefficient
Measures of
Dispersion

Interquartile Variance/Standar Gini-


Range MAD d Deviation
deviation Coefficient

Definition:

• Used as a measure of poverty


inequality

Calculation

• The Gini coefficient is


A calculated as the area A
divided by the sum of areas A
and B
B

Disadvantage
• not additive across
groups, i.e. the total Gini of a
society is not equal to the
sum of the Ginis for its sub-
groups
Measures of Central Tendency and Dispersion by Level
of Measurement

Level of Measures of Measures of


Measurement Central Tendency Dispersion
nominal Mode Percent distribution
ordinal Median Min and max
Mode Range
Percentiles
Percent distribution
Interval/ ratio Mean Variance
Median Standard deviation
Mode Min and max
Range
Percentiles
Percent distribution
Scatter Plot

• A scatter diagram is a tool for analyzing


relationships between two variables.
• One variable is plotted on the horizontal axis
and the other is plotted on the vertical axis.
• The pattern of their intersecting points can
graphically show relationship patterns.
• Most often a scatter diagram is used to prove
or disprove correlations.
• While the diagram shows relationships, it
does not by itself prove that one variable
causes the other.
Concept of Correlation

A statistic that quantifies a relation between two


variables

Falls between -1.00 and 1.00

The value of the number (not the sign) indicates


the strength of the relation
Types of relations
Positive Correlation

 Association between variables such that high scores on one variable tend to have
high scores on the other variable
 A direct relation between the variables
Negative Correlation

 Association between variables such that high scores on one variable tend to
have low scores on the other variable
 An inverse relation between the variables
Correlation

Formula of Correlation Coefficient


Smoking and Lung Capacity

N Cigarettes (X ) Lung Capacity (Y )


Example: investigate relationship
between cigarette smoking and lung
1 0 45
capacity
2 5 42
Data: sample group response data on
smoking habits, and measured lung
3 10 33
capacities, respectively 4 15 31
5 20 29
Smoking and Lung Capacity

 Observe that as smoking exposure Lung Capacity (Y )


goes up, corresponding lung capacity
goes down 50
 Variables covary inversely

Lung Capacity
40
 Covariance and Correlation quantify
relationship 30
20
rxy   0.96 -5 5 15 25
Smoking (yrs)

 rxy = -0.96 implies almost certainty smoker will have diminish lung capacity
 Greater smoking exposure implies greater likelihood of lung damage
Calculation for Correlation

Cigs (X ) X 2
XY Y2 Cap (Y )
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29

∑= 50 750 1585 6680 180


Calculation for Correlation

5(1585)  50(180)
rxy 
 5(750  50 )  5(6680)  180 
2 2

7925  9000

 3750  2500  33400  32400 
1075
rxy 
1250 1000 

rxy   0.9615
Covariance

 Variables may change in relation to each other

 Covariance measures how much the movement in


one variable predicts the movement in a
corresponding variable
Covariance

 Variables that covary inversely, like smoking and lung capacity, tend to appear on
opposite sides of the group means

When smoking is above its group mean, lung capacity tends to be below its group
mean.

 Average product of deviation measures extent to which variables covary, the degree
of linkage between them

 Similar to variance, for theoretical reasons, average is typically computed using (N -


1), not N . Thus,
1 N
S xy  
N  1 i1
 Xi  X Y  Y 
i
Calculation of Covariance

Cigs (X ) Lung Cap (Y )


0 45
5 42
10 33
15 31
20 29

10 36
Evaluation yields,
1
S xy  ( 215)  53.75
4
Introduction to Normal distribution

http://www.youtube.com/watch?v=dr1DynUzjq0&list=PLCkLQOAPOtT2H1hJRU
xUYOxThRwfVI9jI
Normal distribution
Characteristics of a normal distribution

• Bell shaped symmetrical frequency distribution curve

• Many natural phenomenon follow normal distribution (IQ


scores, height variation within population, variation in
quality of manufactured goods etc.)

• Characterized by mean and standard deviation

• Extremely large values and extremely small values are rare


and occur near the tail ends. Most frequent values are
clustered around the mean and fall off smoothly in either
side

• Infinite no. of values, never touches the x-axis

• The area under the curve is 1

• The probability of any event under the curve is determined


by the height of the curve at that place
Normal distribution

• Approximately 68 percent of the area under a normal curve lies


between the values of the mean and the standard deviation + and – the
mean.

• Approximately 95% of the area lies between 2 standard deviations +


and – the mean

• Approximately 99.7% lies between 3 standard deviations + and – the


mean.
What is a Standard Normal Distribution?

• Same as a normal distribution, but


the standard deviation is 1 and the
mean is 0
• We often use the standard normal
distribution as a result
– “Bell-shaped”
– Mean of 0
• 
– Standard deviation of 1
• 
– Possesses an infinite number of possible
values
Application of Normal distribution

Z score
• If we know the population mean and population standard deviation, for
any value of X we can compute a z-score by subtracting the population
mean and dividing the result by the population standard deviation

X 
z

Normal distribution

• Z-score tells us how far above or below the mean a value is in terms of
standard deviations
• It is a linear transformation1 of the original scores
• Multiplication (or division) of and/or addition to (or subtraction
from) X by a constant
• Relationship of the observations to each other remains the same
• Z = (X-)/
• X = Z + 
Normal or not? There are statistical ways to check
rather than guessing….

Performance rankings
Heights of men

Exercise time

Weekly wages
Waiting time for traffic Sales incentive
Applications of Normal Distribution

• Normal distribution
lends itself to
statistical properties

• Once we know the Central Limit Theorem


shape of the
distribution, we can Given a population with ANY distribution:
use the properties of
normal distribution to Taking random samples of size n from that
estimate probabilities distribution

• Useful for statistical The sample means will be (approximately)


inferences and normally distributed
hypothesis testing
Not-so Normal Distributions

Not all distributions are symmetric and beautiful as the Normal distribution
Shapes and Peaks of Distribution
Skewness Kurtosis

• Measure of symmetry • Denotes the sharpness/peakedness of


• Is income a positive or a negative skew the peak
distribution • Leptokurtic – Poisson distribution
• Which is higher in both cases - mean, • Platykurtic – Binomial distribution
median, mode?
Other distribution examples
Binomial Poisson

Tossing a No of calls at
coin work

Waiting time
in a queue
Throwing
dice
Characteristics
Characteristics
• Discrete probability distribution for the
• The number of observations n is fixed
counts of events that occur randomly in
• Each observation is independent
a given interval of time (or space).No.
• Each observation represents one of two
of trials are not fixed
outcomes ("success" or "failure")
• Used as model for no. of events in a
• The probability of "success" p is the same
specific time period in which the
for each outcome
number of successes is recorded.
Sampling

"Statistical designs always involve compromises between the desirable


and the possible."
Leslie Kish
Why Sampling ?

• Save costs: Less expensive to study the sample than the population.
• Save time: Less time needed to study the sample than the population .
• Accuracy: Since sampling is done with care and studies are conducted by skilled and qualified
interviewers, the results are expected to be accurate

• Complete – No missing units ; No duplication


• Main considerations (can lead to higher sampling error)
• Past experience, likely response rates, likely differences in the population
• Budget, timings, analysis of sub-groups

• Cannot be perfect (some distortion likely)

Precision
Cost
Sampling – Process

Sampling Process

Defining the Decide – Limitations of Sampling


‘Target Sample or
Population’ Census ? • Demands more rigid control in
undertaking sample operation.

• Minority and smallness in number of


Developing
a sampling sub-groups often render study to be
Frame
suspected.

• Accuracy level may be affected when


Specifying Determining data is subjected to weighting.
Sample Sample
Method Size • Sample results are good approximations
at best.

Implement
Sample Characteristics

 Register of electors

 Postcode address file

 Telephone book

 database of customers

 patient list

 pupil register
Unit
Element  hospital records

 census
Sampling

Sampling

Non-
Probability
Probability
Non- Probability Sampling

Non-
Probability
Sampling

Availability Quota Purposive Snowball


Sampling Sampling Sampling Sampling
Non-probability Sampling Techniques – Summing up

Advantages Disadvantages

Non-Probability

Least expensive, Selection bias,


Convenience
Least time-consuming, Sample not for descriptive or causal
(Accidental/Haphazard)
Most convenient (Ease of access) research

Does not allow generalization,


Subjective
Low cost, Not for projections to population but
Judgmental / Purposive
Convenient, decisions arrived at looking at the
(Qualitative)
Not time-consuming proportions
‘Being representative’ is secondary to
Quality of response

Sample can be controlled for Selection bias,


Quota Sampling
certain characteristics (reads) No assurance of representativeness

Snowball Sampling Can estimate rare characteristics Time-consuming


Probability Sampling

Probability
Sampling

Simple Stratified
Systematic Cluster
Random Random
Sampling Sampling
Sampling Sampling
Examples of Probability Sampling

Suppose you were interested in investigating the link between the family of origin and income and your particular
interest is in comparing incomes of Hispanic and Non-Hispanic respondents. For statistical reasons, you decide that you
need at least 1,000 non-Hispanics and 1,000 Hispanics. Hispanics comprise around 6 or 7% of the population

Let's suppose your sampling frame is a large city's telephone book that has 2,000,000 entries. To take a SRS, you need to
associate each entry with a number and choose n= 200 numbers from N= 2,000,000

Suppose you wanted to study dance club and bar employees in NYC with a sample of n = 600. Yet there is no list of these
employees from which to draw a simple random sample. Suppose you obtained a list of all bars/clubs in NYC

Say that you're interested in how job satisfaction varies by race among a group of employees at a firm. To explore this
issue, we need to create a sample of the employees of the firm. However, the employee population at this particular firm
is predominantly white
Systematic sampling – Quasi Random sampling

1. Similar to simple random sampling with one exception.

2. Only one random number is needed throughout the entire sampling process.

3. To use systematic sampling, a researcher needs:

a. a sampling frame of the population; and

b. a skip interval calculated as follows:

Skip interval = population list size / sample size required

4. Elements are selected using the skip interval

5. When the ordering of the elements is related to the characteristic of interest, systematic

sampling increases the representativeness of the sample else it will be SRS


Stratified sampling

• The elements within a stratum should


3 step process:
be as homogeneous as possible, but
• Step 1- Divide the population into
the elements in different strata should
homogeneous, mutually exclusive and be as heterogeneous as possible.
collectively exhaustive subgroups or strata • Finally, the variables should decrease
using a stratification variable(s) the cost of the stratification process

• Step 2- Elements are selected from each by being easy to measure and apply.

stratum by a random procedure, usually SRS.

• Step 3- Form the final sample by


Stratified samples can be:
consolidating all sample elements chosen in
• Proportionate
step 2. • Disproportionate (relative stratum size
and standard deviation)
Stratified Random Sampling

- population divided into strata, then random sampling from within each stratum

• Location (geographies, dispersion)


• Age (demographics)
• Religion
• Social Class (demographics, SEC)
• Lifestyle, Lifecycle
• Behaviour (past, present)
15-25 yrs 36-50 yrs
• Attitudes 26-35 yrs
Selection of a proportionate Stratified Sample

To select a proportionate stratified sample of 20 members of the Cosmopolitan Club which has 100
members belonging to three language groups i.e., English (E), Hindi (M) and French (F)

Step 1: Identify each member from the list by his or her respective language groups
00 (E ) 15 (H) 30 (E ) 45 (E ) 60 ( F ) 75 (E ) 90 ( F )
01 (E ) 16 (E ) 31 (E ) 46 ( F ) 61 (H) 76 (E ) 91 (E )
02 ( F ) 17 ( F ) 32 (E ) 47 (H) 62 (H) 77 (H) 92 (H)
03 (E ) 18 ( F ) 33 (H) 48 (E ) 63 (E ) 78 (H) 93 (E )
04 (E ) 19 (H) 34 (E ) 49 (E ) 64 (E ) 79 (E ) 94 (E )
05 (E ) 20 (H) 35 (H) 50 (E ) 65 ( F ) 80 (H) 95 ( F )
06 (H) 21 ( F ) 36 (E ) 51 (H) 66 (H) 81 (E ) 96 (E )
07 (H) 22 (E ) 37 (E ) 52 ( F ) 67 (E ) 82 (E ) 97 (E )
08 (E ) 23 ( F ) 38 ( F ) 53 (H) 68 (H) 83 (H) 98 (H)
09 (E ) 24 (E ) 39 ( F ) 54 (E ) 69 (E ) 84 ( F ) 99 (E )
10 (H) 25 (H) 40 (E ) 55 (E ) 70 (E ) 85 (E )
11 (E ) 26 (E ) 41 ( F ) 56 (H) 71 (E ) 86 (E )
12 ( F ) 27 (H) 42 ( F ) 57 (E ) 72 (H) 87 (H)
13 (H) 28 ( F ) 43 (E ) 58 (H) 73 (E ) 88 ( F )
14 (E ) 29 (E ) 44 (H) 59 (H) 74 ( F ) 89 (E )
Selection of a proportionate Stratified Sample

Step 2: Sub-divide the population into three homogenous groups or language stratas
French Language
English Language Stratum (50) Hindi Language Stratum (30)
Stratum (20)
00 22 40 64 82 06 35 66 02 42
01 24 43 67 85 07 44 68 12 46
03 26 45 69 86 10 47 72 17 52
04 29 48 70 89 13 51 77 18 60
05 30 49 71 91 15 53 78 21 65
08 31 50 73 93 19 56 80 23 74
09 32 54 75 94 20 58 83 28 84
11 34 55 76 96 25 59 87 38 88
14 36 57 79 97 27 61 92 39 90
16 37 63 81 99 33 62 98 41 95

Sampling fraction = n / N = (20/100) = 0.2

Proportionate 50 *0.2 = 10 30*0.2 = 6 20*0.2 = 4

Random sampling (random table)


Cluster sampling

• A two-step-process:
– Step 1- Defined population is divided into number of mutually exclusive and collectively
exhaustive sub population groups or clusters;
– Step 2- Select an independent simple random sample of clusters.

Cluster Sampling

One-Stage Two-Stage Multistage


Sampling Sampling Sampling

Simple Cluster Probability Proportionate


Sampling to Size Sampling

A two-step area cluster sample (sampling several clusters) is preferable to a one-step


(selecting only one cluster) sample unless the clusters are absolutely homogeneous
One Stage – Two stage Cluster Sampling

Consider the same Cosmopolitan Club example involving 100 club members:
Step 1: Sub-divide the club members into 5 clusters, each containing 20 members

Cluster No. English Language Hindi Language French Language


00 22 40 64 82 06 35 66 02 42
1
01 24 43 67 85 07 44 68 12 46
03 26 45 69 86 10 47 72 17 52
2
04 29 48 70 89 13 51 77 18 60
05 30 49 71 91 15 53 78 21 65
3
08 31 50 73 93 19 56 80 23 74
09 32 54 75 94 20 58 83 28 84
4
11 34 55 76 96 25 59 87 38 88
14 36 57 79 97 27 61 92 39 90
5
16 37 63 81 99 33 62 98 41 95

Step 2: Select one of the 5 clusters. If cluster 4 is selected, then all its elements are selected.

Step 3: In a two-stage cluster sampling, the researcher may randomly select 4 members from each of the five
clusters or the researcher may select 2 clusters out of 5 and then sample randomly within selected clusters
Cluster sampling …. Contd.

Stratified Sampling Cluster Sampling


Sub groups / Stratas Mutually exclusive & exhaustive Sub-population or clusters
Within stratum, elements are homogeneous. Within cluster, elements are heterogeneous.
High degree of heterogeneity between strata. Between clusters - high degree of homogeneity.
Less sampling error. More prone to sampling error.
Objective - to increase precision. Objective - increase sampling efficiency by decreasing cost

AREA SAMPLING - common form of cluster sampling where clusters consist of geographic areas (districts,
housing blocks or townships). Could be one-stage, two-stage, or multi-stage.

Step 1: Determine the geographic area to be surveyed,

identify its subdivisions (ten blocks)

Step 2: Decide on the use of one-step, two-step or multi-step cluster sampling.

Step 3: Using random numbers, select the housing blocks (units) to be sampled.

Select any 4 blocks randomly.

Step 4: In each of the chosen housing block identify a random starting point and follow the right hand rule.
Probability Sampling Techniques – Summing up

Advantages Disadvantages

Difficult to construct sampling frame.


Easily understood,
Expensive (difficulty in reach –geographic spread)
SRS (Simple random Results projectable
Lower precision,
sampling) Requires minimum knowledge of the
No assurance of representativeness (Target
population to be sampled
population)

Can increase representativeness,


Less costlier and easier to implement Can decrease representativeness if no ordered
than SRS, only one random point chosen pattern
Systematic Random
Sampling frame not necessary, can be Kth person may have a periodical order
used even without the knowledge of the
Element

Difficult to select relevant stratification variables,


Include all sub-populations, differs from
need details for all population members (SRS)
Quota sampling as elements are chosen
Stratified Random Not feasible to stratify on many variables (ideally
randomly than on convenience or judgment
not more than two)
More precision
Expensive

Imprecise (unequal probability of selection , fewer


Easy to implement,
Sampling points)
Cluster Sampling Don’t need details of the entire population
Increases complexity to compute and interpret
Cost effective
results
To end it all
The mean is a measure of location,
The center of a population.

If at random a score you drew,


The mean's the most likely score you'd view.

You could compute the mean in your slumber.


Sum the scores and divide by the number.
At the mean sample scores converge;
From the mean these scores diverge.
Near the mean the scores are many.
In the tails, there's hardly any.

To measure a distribution's variation,


From the mean find each score's deviation.
Each difference of D score now you square.
Sum all D scores, all scores' share.
Now this sum divide by N.
That's V, the variance, then.

The square root of V is called S.D.,


The gauge of a trait's variability.
We've found two moments of a distribution,
Developed from each score's contribution.

Picturing a universe, try to see,


Its center's the mean; its orbit, S.D.

You might also like