Introduction To Descriptive Statistics 2014

Introduction to Descriptive Statistics
2014
Great takes on Statistics
"There are three ways to not tell the truth: lies, damned
lies, and statistics.“…
"
‘Statistics show that of those who contract the habit of
eating, very few
" survive’.…
‘
The statistics on sanity are that one out of every four
Americans is suffering from some form of mental illness.
" If they're okay, then it's
Think of your three best friends.
you’.…
Statistics are used much like a drunk uses a lamppost: for

"
support, not illumination’.…
‘
I can prove anything by statistics except the truth’.…

Common terms in Statistics
Variable Population
Data
Inference?
Statistics
Descriptive
Statistics
Sample
Parameter
Central dogma of Statistics
Probability
Population
Descriptive Statistics
Sample
Inferential Statistics
Concept of Frequency and Basic Exploratory Analysis
a mathematical function showing the

number of instances in which a
variable takes each of its possible HISTOGRAM
values.
Cross Tab : Cross tabulation

(or crosstabs for short) is a
statistical process that
summarizes categorical
data to create a Conversion of Continuous Variable to
contingency table. Discrete/ Categorical Variable
Data and Variables
Types Random
Categorical Discrete
Nominal Continuous
NOIR Ordinal
Interval
Ratio
Random Vs. Traditional Variables

Exercise!
Percentage scores on a Math exam

Letter grades on an English essay
Flavors of yogurt
Instructors classified as : Easy, Difficult or Impossible
Employee evaluations classified as : Excellent, Average, Poor
Religions
Political parties
Commuting times to school
Years (AD) of important historical events
Ages (in years) of statistics students
Ice cream flavor preference
Amount of money in savings accounts
Students classified by their reading ability : Above average, Below average, Normal
Descriptive Statistics
 Descriptive statistics are a collection of measurements of two things: location and

variability.
 Location tells you the central value of your variable
 Variability refers to the spread of the data from the center value
 Statistics is basically the study of what causes variability in the data.
Location Variability
Mean Variance
Mode Standard deviation
Median Range
Central Tendency
99.94
Mean, Median & Mode
A measure of central tendency is a single value that attempts to describe a

set of data by identifying the central position within that set of data.
Mean Median
is the simple average is the middle of a
of a data set ranked distribution
Mode
is the most common
data value
Mean
Calls per day

Associate 1 120
Associate 2 125
Mean
Associate 3 130 = (n)/n
Associate 4 132 = (120+125+…+165)/20
Associate 5 135 = 147 calls a day
Associate 6 135
Associate 7 137
Associate 8 148
Associate 9 150
Associate 10 150
Associate 11 150
Associate 12 155
Associate 13 155
Associate 14 155
Associate 15 157
Associate 16 157
Associate 17 158
Associate 18 160
Associate 19 160
Associate 20 165
Median
Calls per day

Associate 1 120
Associate 2 125
Associate 3 130
Associate 4 132
Associate 5 135
Associate 6 135
Associate 7 137
Associate 8 148 Median
Associate 9 150
= middle of dist. when ranked
Associate 10 150
= (10th + 11th value) /2
Associate 11 150
= (150 + 150) /2
Associate 12 155
= 150 calls a day
Associate 13 155
Associate 14 155
Associate 15 157
Associate 16 157
Associate 17 158
Associate 18 160
Associate 19 160
Associate 20 165
Mode
Calls per day

Associate 1 120
Associate 2 125
Associate 3 130
Associate 4 132
Associate 5 135
Associate 6 135
Associate 7 137
Associate 8 148
Associate 9 150
Associate 10 150
Associate 11 150
Associate 12 155
Associate 13 155
Associate 14 155
Associate 15 157
Associate 16 157 Mode
Associate 17 158 = the most commonly recurring
Associate 18 160 data point
Associate 19 160 = 150 calls a day
Associate 20 165
Percentile/ Quartile
Calls per day

Associate 1 120
Associate 2 125
A percentile is the value of a variable below which a
Associate 3 130
certain percent of observations fall. For example, the
Associate 4 132
20th percentile is the value (or score) below which ~20
25th Percentile Associate 5 135
percent of the observations may be found.
Or, the 1st quartile Associate 6 135
Associate 7 137
The quartiles of a set of values are the three points
Associate 8 148
that divide the data set into four equal groups, each
Associate 9 150
representing a fourth of the population being sampled.
Associate 10 150
Associate 11 150
Associate 12 155
Associate 13 155
Associate 14 155
Associate 15 157 75th Percentile
Associate 16 157 Or, the 3rd quartile
Associate 17 158
Associate 18 160
Associate 19 160
Associate 20 165
Weighted Average/ Rolling Average
Measures of Central Tendency by Level of Measurement
Level of Measures of
Measurement Central Tendency
nominal Mode
ordinal Median
Mode
Interval/ ratio Mean
Median
Mode
Measures of Dispersion
Consider Teams 1 and 2 below
Team 1 2 3 4 5 Team 1 2 3 4 5
member member
Height 72 73 76 76 78 Height 67 72 76 76 84
(inches) (inches)
• What is same among both the teams?

• What is different?
Another way to look at Dispersion
• Both distributions have same mean but what is the difference?

Definition:
• Measures of dispersion describe the data spread or how far the measurements are
from the center
• The more similar the scores are to each other, the lower the measure of
dispersion will be
• The less similar the scores are to each other, the higher the measure of
dispersion will be
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Distributions which are highly dispersed
• Consider the following distributions
Bollywood movie
Stock prices in an index collections
Test scores
Hitting bulls-eye on a
dart-board Income
Measures of
Dispersion
Mean
Variance/
Interquartile absolute Gini
Range Standard
deviation deviation Coefficient
Deviation
(MAD)
Range
Measures of
Dispersion
Interquartile Mean absolute Standard

Range deviation deviation Deviation
Ginii- Coefficient
• Definition: simplest measure of variation denoting

difference between highest and lowest number in
dataset
Range= Maximum - Minimum
• Disadvantages
• Ignores the way data is distributed
• Uses only two data
• Ignores outliers
Team 1 2 3 4 5 Team 1 2 3 4 5
member member
Height 72 73 76 76 78 Height 67 72 76 76 84
(inches) (inches)
What is the range for the datasets?

Interquartile Deviation
Measures of
Dispersion
Interquartile Mean absolute Standard

Ginii- Coefficient
Range
deviation deviation Deviation
Definition: The inter-quartile range is a measure that indicates the extent to which the central 50% of
values within the dataset are dispersed. It is based upon, and related to, the median.
Interquartile range: Q3 –Q1
Q1 Q2 Q3 Q4
25% 25% 25% 25%
Disadvantages
• Provides a clearer picture of the overall dataset by removing/ignoring the outlying values.
• Like the range however, the inter-quartile range is a measure of dispersion that is based upon
only two values from the dataset
Mean Absolute Deviation
Measures of
Dispersion
Interquartile Variance/Standar
Range
deviation
MAD d Deviation
Ginii- Coefficient
Definition: mean of the absolute values of the deviations. This measure is called mean absolute
deviation and is denoted by mad. Note that in the formula that follows, the
notation mean(x) denotes the mean of the x measurements.
MAD = sum( |x - mean(x)| ) / n

where n is the size of the sample.
Disadvantages
• Consider these two datasets. Both give the value 4 though 2nd dataset is more spreadout
+7
+4 +4
+1
-4 -4 -2
-6
|4| + |4| + |-4| + |-4| = 4+4+4+4 =4
|7| + |1| + |-6| + |-2| = 7+1+6+2 =4
4 4
4 4
Variance/ Standard Deviation
Measures of
Dispersion
Interquartile Variance/
Definition of variance Range
deviation
MAD
Standard
Ginii- Coefficient
Deviation
The variance indicates how close to or far from the mean are most of the cases for a particular variable.
The smaller the value of the variance, the more the cases are concentrated around the value of the
mean; the larger the value of the variance, the more spread out away from the mean are the cases.
Calculation
Data Data
Mean
1. Calculate difference of each data point from mean
Deviation of x = x – mean
2. What is the problem when
we take sum of deviations?
3. Square deviations and sum
4. Divide by no. of observations
=24/5
Variance/Standard Deviation
Measures of
Dispersion
Mean
Standard
Range
deviation
absolute
MAD Deviation
Ginii- Coefficient
Standard
deviation
Deviation
Why squared deviations?
• Absolute values do not have nice mathematical properties

• Squares eliminate the negatives
Result:
• Increasing contribution to the variance as you go farther from the mean
• Variance is somewhat arbitrary But if you could “standardize” that value, you could talk about any
variance (i.e. deviation) in equivalent terms
Advantages/Disadvantages
• The variance is not the simplest or easiest to understand measure of dispersion for interval/ratio
variables
• The reason why statisticians prefer it is because it provides an excellent basis for some very
important multivariate statistics
Variance/Standard Deviation
Measures of
Dispersion
Mean
Standard
Range
deviation
absolute
MAD Deviation
Ginii- Coefficient
Standard
deviation
Deviation
Standard Deviation Calculation
=2.19
1. Score (in the units that are meaningful)

2. Mean
3. Each score’s deviation from the mean
4. Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n-1
7. Square root – now the value is in the units we started with!!!
Gini-coefficient
Measures of
Dispersion
Interquartile Variance/Standar Gini-

Range MAD d Deviation
deviation Coefficient
Definition:
• Used as a measure of poverty

inequality
Calculation
• The Gini coefficient is

A calculated as the area A
divided by the sum of areas A
and B
B
Disadvantage
• not additive across
groups, i.e. the total Gini of a
society is not equal to the
sum of the Ginis for its sub-
groups
Measures of Central Tendency and Dispersion by Level
of Measurement
Level of Measures of Measures of

Measurement Central Tendency Dispersion
nominal Mode Percent distribution
ordinal Median Min and max
Mode Range
Percentiles
Percent distribution
Interval/ ratio Mean Variance
Median Standard deviation
Mode Min and max
Range
Percentiles
Percent distribution
Scatter Plot
• A scatter diagram is a tool for analyzing

relationships between two variables.
• One variable is plotted on the horizontal axis
and the other is plotted on the vertical axis.
• The pattern of their intersecting points can
graphically show relationship patterns.
• Most often a scatter diagram is used to prove
or disprove correlations.
• While the diagram shows relationships, it
does not by itself prove that one variable
causes the other.
Concept of Correlation
A statistic that quantifies a relation between two

variables
Falls between -1.00 and 1.00
The value of the number (not the sign) indicates

the strength of the relation
Types of relations
Positive Correlation
 Association between variables such that high scores on one variable tend to have
high scores on the other variable
 A direct relation between the variables
Negative Correlation
 Association between variables such that high scores on one variable tend to
have low scores on the other variable
 An inverse relation between the variables
Correlation
Formula of Correlation Coefficient

Smoking and Lung Capacity
N Cigarettes (X ) Lung Capacity (Y )

Example: investigate relationship
between cigarette smoking and lung
1 0 45
capacity
2 5 42
Data: sample group response data on
smoking habits, and measured lung
3 10 33
capacities, respectively 4 15 31
5 20 29
Smoking and Lung Capacity
 Observe that as smoking exposure Lung Capacity (Y )

goes up, corresponding lung capacity
goes down 50
 Variables covary inversely
Lung Capacity
40
 Covariance and Correlation quantify
relationship 30
20
rxy   0.96 -5 5 15 25
Smoking (yrs)
 rxy = -0.96 implies almost certainty smoker will have diminish lung capacity
 Greater smoking exposure implies greater likelihood of lung damage
Calculation for Correlation
Cigs (X ) X 2
XY Y2 Cap (Y )
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
∑= 50 750 1585 6680 180

Calculation for Correlation
5(1585)  50(180)
rxy 
 5(750  50 )  5(6680)  180 
2 2
7925  9000

 3750  2500  33400  32400 
1075
rxy 
1250 1000 
rxy   0.9615
Covariance
 Variables may change in relation to each other
 Covariance measures how much the movement in

one variable predicts the movement in a
corresponding variable
Covariance
 Variables that covary inversely, like smoking and lung capacity, tend to appear on
opposite sides of the group means
When smoking is above its group mean, lung capacity tends to be below its group
mean.
 Average product of deviation measures extent to which variables covary, the degree
of linkage between them
 Similar to variance, for theoretical reasons, average is typically computed using (N -

1), not N . Thus,
1 N
S xy  
N  1 i1
 Xi  X Y  Y 
i
Calculation of Covariance
Cigs (X ) Lung Cap (Y )

0 45
5 42
10 33
15 31
20 29
10 36
Evaluation yields,
1
S xy  ( 215)  53.75
4
Introduction to Normal distribution
http://www.youtube.com/watch?v=dr1DynUzjq0&list=PLCkLQOAPOtT2H1hJRU
xUYOxThRwfVI9jI
Normal distribution
Characteristics of a normal distribution
• Bell shaped symmetrical frequency distribution curve
• Many natural phenomenon follow normal distribution (IQ

scores, height variation within population, variation in
quality of manufactured goods etc.)
• Characterized by mean and standard deviation
• Extremely large values and extremely small values are rare

and occur near the tail ends. Most frequent values are
clustered around the mean and fall off smoothly in either
side
• Infinite no. of values, never touches the x-axis
• The area under the curve is 1
• The probability of any event under the curve is determined

by the height of the curve at that place
Normal distribution
• Approximately 68 percent of the area under a normal curve lies

between the values of the mean and the standard deviation + and – the
mean.
• Approximately 95% of the area lies between 2 standard deviations +

and – the mean
• Approximately 99.7% lies between 3 standard deviations + and – the

mean.
What is a Standard Normal Distribution?
• Same as a normal distribution, but

the standard deviation is 1 and the
mean is 0
• We often use the standard normal
distribution as a result
– “Bell-shaped”
– Mean of 0
• 
– Standard deviation of 1
• 
– Possesses an infinite number of possible
values
Application of Normal distribution
Z score
• If we know the population mean and population standard deviation, for
any value of X we can compute a z-score by subtracting the population
mean and dividing the result by the population standard deviation
X 
z

Normal distribution
• Z-score tells us how far above or below the mean a value is in terms of
standard deviations
• It is a linear transformation1 of the original scores
• Multiplication (or division) of and/or addition to (or subtraction
from) X by a constant
• Relationship of the observations to each other remains the same
• Z = (X-)/
• X = Z + 
Normal or not? There are statistical ways to check
rather than guessing….
Performance rankings
Heights of men
Exercise time
Weekly wages
Waiting time for traffic Sales incentive
Applications of Normal Distribution
• Normal distribution
lends itself to
statistical properties
• Once we know the Central Limit Theorem

shape of the
distribution, we can Given a population with ANY distribution:
use the properties of
normal distribution to Taking random samples of size n from that
estimate probabilities distribution
• Useful for statistical The sample means will be (approximately)

inferences and normally distributed
hypothesis testing
Not-so Normal Distributions
Not all distributions are symmetric and beautiful as the Normal distribution
Shapes and Peaks of Distribution
Skewness Kurtosis
• Measure of symmetry • Denotes the sharpness/peakedness of

• Is income a positive or a negative skew the peak
distribution • Leptokurtic – Poisson distribution
• Which is higher in both cases - mean, • Platykurtic – Binomial distribution
median, mode?
Other distribution examples
Binomial Poisson
Tossing a No of calls at
coin work
Waiting time
in a queue
Throwing
dice
Characteristics
Characteristics
• Discrete probability distribution for the
• The number of observations n is fixed
counts of events that occur randomly in
• Each observation is independent
a given interval of time (or space).No.
• Each observation represents one of two
of trials are not fixed
outcomes ("success" or "failure")
• Used as model for no. of events in a
• The probability of "success" p is the same
specific time period in which the
for each outcome
number of successes is recorded.
Sampling
"Statistical designs always involve compromises between the desirable

and the possible."
Leslie Kish
Why Sampling ?
• Save costs: Less expensive to study the sample than the population.
• Save time: Less time needed to study the sample than the population .
• Accuracy: Since sampling is done with care and studies are conducted by skilled and qualified
interviewers, the results are expected to be accurate
• Complete – No missing units ; No duplication

• Main considerations (can lead to higher sampling error)
• Past experience, likely response rates, likely differences in the population
• Budget, timings, analysis of sub-groups
• Cannot be perfect (some distortion likely)
Precision
Cost
Sampling – Process
Sampling Process
Defining the Decide – Limitations of Sampling

‘Target Sample or
Population’ Census ? • Demands more rigid control in
undertaking sample operation.
• Minority and smallness in number of

Developing
a sampling sub-groups often render study to be
Frame
suspected.
• Accuracy level may be affected when

Specifying Determining data is subjected to weighting.
Sample Sample
Method Size • Sample results are good approximations
at best.
Implement
Sample Characteristics
 Register of electors
 Postcode address file
 Telephone book
 database of customers
 patient list
 pupil register
Unit
Element  hospital records
 census
Sampling
Sampling
Non-
Probability
Probability
Non- Probability Sampling
Non-
Probability
Sampling
Availability Quota Purposive Snowball

Sampling Sampling Sampling Sampling
Non-probability Sampling Techniques – Summing up
Advantages Disadvantages
Non-Probability
Least expensive, Selection bias,

Convenience
Least time-consuming, Sample not for descriptive or causal
(Accidental/Haphazard)
Most convenient (Ease of access) research
Does not allow generalization,

Subjective
Low cost, Not for projections to population but
Judgmental / Purposive
Convenient, decisions arrived at looking at the
(Qualitative)
Not time-consuming proportions
‘Being representative’ is secondary to
Quality of response
Sample can be controlled for Selection bias,

Quota Sampling
certain characteristics (reads) No assurance of representativeness
Snowball Sampling Can estimate rare characteristics Time-consuming

Probability Sampling
Probability
Sampling
Simple Stratified
Systematic Cluster
Random Random
Sampling Sampling
Sampling Sampling
Examples of Probability Sampling
Suppose you were interested in investigating the link between the family of origin and income and your particular
interest is in comparing incomes of Hispanic and Non-Hispanic respondents. For statistical reasons, you decide that you
need at least 1,000 non-Hispanics and 1,000 Hispanics. Hispanics comprise around 6 or 7% of the population
Let's suppose your sampling frame is a large city's telephone book that has 2,000,000 entries. To take a SRS, you need to
associate each entry with a number and choose n= 200 numbers from N= 2,000,000
Suppose you wanted to study dance club and bar employees in NYC with a sample of n = 600. Yet there is no list of these
employees from which to draw a simple random sample. Suppose you obtained a list of all bars/clubs in NYC
Say that you're interested in how job satisfaction varies by race among a group of employees at a firm. To explore this
issue, we need to create a sample of the employees of the firm. However, the employee population at this particular firm
is predominantly white
Systematic sampling – Quasi Random sampling
1. Similar to simple random sampling with one exception.
2. Only one random number is needed throughout the entire sampling process.
3. To use systematic sampling, a researcher needs:
a. a sampling frame of the population; and
b. a skip interval calculated as follows:
Skip interval = population list size / sample size required
4. Elements are selected using the skip interval
5. When the ordering of the elements is related to the characteristic of interest, systematic
sampling increases the representativeness of the sample else it will be SRS

Stratified sampling
• The elements within a stratum should

3 step process:
be as homogeneous as possible, but
• Step 1- Divide the population into
the elements in different strata should
homogeneous, mutually exclusive and be as heterogeneous as possible.
collectively exhaustive subgroups or strata • Finally, the variables should decrease
using a stratification variable(s) the cost of the stratification process
• Step 2- Elements are selected from each by being easy to measure and apply.
stratum by a random procedure, usually SRS.
• Step 3- Form the final sample by

Stratified samples can be:
consolidating all sample elements chosen in
• Proportionate
step 2. • Disproportionate (relative stratum size
and standard deviation)
Stratified Random Sampling
- population divided into strata, then random sampling from within each stratum
• Location (geographies, dispersion)

• Age (demographics)
• Religion
• Social Class (demographics, SEC)
• Lifestyle, Lifecycle
• Behaviour (past, present)
15-25 yrs 36-50 yrs
• Attitudes 26-35 yrs
Selection of a proportionate Stratified Sample
To select a proportionate stratified sample of 20 members of the Cosmopolitan Club which has 100
members belonging to three language groups i.e., English (E), Hindi (M) and French (F)
Step 1: Identify each member from the list by his or her respective language groups
00 (E ) 15 (H) 30 (E ) 45 (E ) 60 ( F ) 75 (E ) 90 ( F )
01 (E ) 16 (E ) 31 (E ) 46 ( F ) 61 (H) 76 (E ) 91 (E )
02 ( F ) 17 ( F ) 32 (E ) 47 (H) 62 (H) 77 (H) 92 (H)
03 (E ) 18 ( F ) 33 (H) 48 (E ) 63 (E ) 78 (H) 93 (E )
04 (E ) 19 (H) 34 (E ) 49 (E ) 64 (E ) 79 (E ) 94 (E )
05 (E ) 20 (H) 35 (H) 50 (E ) 65 ( F ) 80 (H) 95 ( F )
06 (H) 21 ( F ) 36 (E ) 51 (H) 66 (H) 81 (E ) 96 (E )
07 (H) 22 (E ) 37 (E ) 52 ( F ) 67 (E ) 82 (E ) 97 (E )
08 (E ) 23 ( F ) 38 ( F ) 53 (H) 68 (H) 83 (H) 98 (H)
09 (E ) 24 (E ) 39 ( F ) 54 (E ) 69 (E ) 84 ( F ) 99 (E )
10 (H) 25 (H) 40 (E ) 55 (E ) 70 (E ) 85 (E )
11 (E ) 26 (E ) 41 ( F ) 56 (H) 71 (E ) 86 (E )
12 ( F ) 27 (H) 42 ( F ) 57 (E ) 72 (H) 87 (H)
13 (H) 28 ( F ) 43 (E ) 58 (H) 73 (E ) 88 ( F )
14 (E ) 29 (E ) 44 (H) 59 (H) 74 ( F ) 89 (E )
Selection of a proportionate Stratified Sample
Step 2: Sub-divide the population into three homogenous groups or language stratas
French Language
English Language Stratum (50) Hindi Language Stratum (30)
Stratum (20)
00 22 40 64 82 06 35 66 02 42
01 24 43 67 85 07 44 68 12 46
03 26 45 69 86 10 47 72 17 52
04 29 48 70 89 13 51 77 18 60
05 30 49 71 91 15 53 78 21 65
08 31 50 73 93 19 56 80 23 74
09 32 54 75 94 20 58 83 28 84
11 34 55 76 96 25 59 87 38 88
14 36 57 79 97 27 61 92 39 90
16 37 63 81 99 33 62 98 41 95
Sampling fraction = n / N = (20/100) = 0.2
Proportionate 50 *0.2 = 10 30*0.2 = 6 20*0.2 = 4
Random sampling (random table)

Cluster sampling
• A two-step-process:
– Step 1- Defined population is divided into number of mutually exclusive and collectively
exhaustive sub population groups or clusters;
– Step 2- Select an independent simple random sample of clusters.
Cluster Sampling
One-Stage Two-Stage Multistage

Sampling Sampling Sampling
Simple Cluster Probability Proportionate

Sampling to Size Sampling
A two-step area cluster sample (sampling several clusters) is preferable to a one-step

(selecting only one cluster) sample unless the clusters are absolutely homogeneous
One Stage – Two stage Cluster Sampling
Consider the same Cosmopolitan Club example involving 100 club members:
Step 1: Sub-divide the club members into 5 clusters, each containing 20 members
Cluster No. English Language Hindi Language French Language

00 22 40 64 82 06 35 66 02 42
1
01 24 43 67 85 07 44 68 12 46
03 26 45 69 86 10 47 72 17 52
2
04 29 48 70 89 13 51 77 18 60
05 30 49 71 91 15 53 78 21 65
3
08 31 50 73 93 19 56 80 23 74
09 32 54 75 94 20 58 83 28 84
4
11 34 55 76 96 25 59 87 38 88
14 36 57 79 97 27 61 92 39 90
5
16 37 63 81 99 33 62 98 41 95
Step 2: Select one of the 5 clusters. If cluster 4 is selected, then all its elements are selected.
Step 3: In a two-stage cluster sampling, the researcher may randomly select 4 members from each of the five
clusters or the researcher may select 2 clusters out of 5 and then sample randomly within selected clusters
Cluster sampling …. Contd.
Stratified Sampling Cluster Sampling

Sub groups / Stratas Mutually exclusive & exhaustive Sub-population or clusters
Within stratum, elements are homogeneous. Within cluster, elements are heterogeneous.
High degree of heterogeneity between strata. Between clusters - high degree of homogeneity.
Less sampling error. More prone to sampling error.
Objective - to increase precision. Objective - increase sampling efficiency by decreasing cost
AREA SAMPLING - common form of cluster sampling where clusters consist of geographic areas (districts,
housing blocks or townships). Could be one-stage, two-stage, or multi-stage.
Step 1: Determine the geographic area to be surveyed,
identify its subdivisions (ten blocks)
Step 2: Decide on the use of one-step, two-step or multi-step cluster sampling.
Step 3: Using random numbers, select the housing blocks (units) to be sampled.
Select any 4 blocks randomly.
Step 4: In each of the chosen housing block identify a random starting point and follow the right hand rule.
Probability Sampling Techniques – Summing up
Advantages Disadvantages
Difficult to construct sampling frame.

Easily understood,
Expensive (difficulty in reach –geographic spread)
SRS (Simple random Results projectable
Lower precision,
sampling) Requires minimum knowledge of the
No assurance of representativeness (Target
population to be sampled
population)
Can increase representativeness,

Less costlier and easier to implement Can decrease representativeness if no ordered
than SRS, only one random point chosen pattern
Systematic Random
Sampling frame not necessary, can be Kth person may have a periodical order
used even without the knowledge of the
Element
Difficult to select relevant stratification variables,

Include all sub-populations, differs from
need details for all population members (SRS)
Quota sampling as elements are chosen
Stratified Random Not feasible to stratify on many variables (ideally
randomly than on convenience or judgment
not more than two)
More precision
Expensive
Imprecise (unequal probability of selection , fewer

Easy to implement,
Sampling points)
Cluster Sampling Don’t need details of the entire population
Increases complexity to compute and interpret
Cost effective
results
To end it all
The mean is a measure of location,
The center of a population.
If at random a score you drew,

The mean's the most likely score you'd view.
You could compute the mean in your slumber.

Sum the scores and divide by the number.
At the mean sample scores converge;
From the mean these scores diverge.
Near the mean the scores are many.
In the tails, there's hardly any.
To measure a distribution's variation,

From the mean find each score's deviation.
Each difference of D score now you square.
Sum all D scores, all scores' share.
Now this sum divide by N.
That's V, the variance, then.
The square root of V is called S.D.,

The gauge of a trait's variability.
We've found two moments of a distribution,
Developed from each score's contribution.
Picturing a universe, try to see,

Its center's the mean; its orbit, S.D.

Introduction To Descriptive Statistics 2014

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Descriptive Statistics 2014

Uploaded by

Copyright:

Available Formats

Introduction to Descriptive Statistics

Statistics are used much like a drunk uses a lamppost: for

I can prove anything by statistics except the truth’.…

a mathematical function showing the

Cross Tab : Cross tabulation

Random Vs. Traditional Variables

Percentage scores on a Math exam

 Descriptive statistics are a collection of measurements of two things: location and

 Location tells you the central value of your variable

 Statistics is basically the study of what causes variability in the data.

A measure of central tendency is a single value that attempts to describe a

Calls per day

Calls per day

Calls per day

Calls per day

• What is same among both the teams?

• Both distributions have same mean but what is the difference?

Interquartile Mean absolute Standard

• Definition: simplest measure of variation denoting

What is the range for the datasets?

Interquartile Mean absolute Standard

Interquartile range: Q3 –Q1

25% 25% 25% 25%

MAD = sum( |x - mean(x)| ) / n

Why squared deviations?

• Absolute values do not have nice mathematical properties

Standard Deviation Calculation

1. Score (in the units that are meaningful)

Interquartile Variance/Standar Gini-

• Used as a measure of poverty

• The Gini coefficient is

Level of Measures of Measures of

• A scatter diagram is a tool for analyzing

A statistic that quantifies a relation between two

Falls between -1.00 and 1.00

The value of the number (not the sign) indicates

Formula of Correlation Coefficient

N Cigarettes (X ) Lung Capacity (Y )

 Observe that as smoking exposure Lung Capacity (Y )

∑= 50 750 1585 6680 180

 Variables may change in relation to each other

 Covariance measures how much the movement in

 Similar to variance, for theoretical reasons, average is typically computed using (N -

Cigs (X ) Lung Cap (Y )

• Bell shaped symmetrical frequency distribution curve

• Many natural phenomenon follow normal distribution (IQ

• Characterized by mean and standard deviation

• Extremely large values and extremely small values are rare

• Infinite no. of values, never touches the x-axis

• The area under the curve is 1

• The probability of any event under the curve is determined

• Approximately 68 percent of the area under a normal curve lies

• Approximately 95% of the area lies between 2 standard deviations +

• Approximately 99.7% lies between 3 standard deviations + and – the

• Same as a normal distribution, but

• Once we know the Central Limit Theorem

• Useful for statistical The sample means will be (approximately)

• Measure of symmetry • Denotes the sharpness/peakedness of

"Statistical designs always involve compromises between the desirable

• Complete – No missing units ; No duplication

• Cannot be perfect (some distortion likely)

Defining the Decide – Limitations of Sampling

Proportionate 50 0.2 = 10 300.2 = 6 20*0.2 = 4