Professional Documents
Culture Documents
2014
Great takes on Statistics
"There are three ways to not tell the truth: lies, damned
lies, and statistics.“…
"
‘Statistics show that of those who contract the habit of
eating, very few
" survive’.…
‘
The statistics on sanity are that one out of every four
Americans is suffering from some form of mental illness.
" If they're okay, then it's
Think of your three best friends.
you’.…
Variable Population
Data
Inference?
Statistics
Descriptive
Statistics
Sample
Parameter
Central dogma of Statistics
Probability
Population
Descriptive Statistics
Sample
Inferential Statistics
Concept of Frequency and Basic Exploratory Analysis
Types Random
Categorical Discrete
Nominal Continuous
NOIR Ordinal
Interval
Ratio
Flavors of yogurt
Instructors classified as : Easy, Difficult or Impossible
Employee evaluations classified as : Excellent, Average, Poor
Religions
Political parties
Commuting times to school
Years (AD) of important historical events
Ages (in years) of statistics students
Ice cream flavor preference
Amount of money in savings accounts
Students classified by their reading ability : Above average, Below average, Normal
Descriptive Statistics
Variability refers to the spread of the data from the center value
Location Variability
Mean Variance
Mode Standard deviation
Median Range
Central Tendency
99.94
Mean, Median & Mode
Mean Median
is the simple average is the middle of a
of a data set ranked distribution
Mode
is the most common
data value
Mean
Level of Measures of
Measurement Central Tendency
nominal Mode
ordinal Median
Mode
Interval/ ratio Mean
Median
Mode
Measures of Dispersion
Consider Teams 1 and 2 below
Team 1 2 3 4 5 Team 1 2 3 4 5
member member
Height 72 73 76 76 78 Height 67 72 76 76 84
(inches) (inches)
Definition:
• Measures of dispersion describe the data spread or how far the measurements are
from the center
• The more similar the scores are to each other, the lower the measure of
dispersion will be
• The less similar the scores are to each other, the higher the measure of
dispersion will be
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Distributions which are highly dispersed
• Consider the following distributions
Bollywood movie
Stock prices in an index collections
Test scores
Hitting bulls-eye on a
dart-board Income
Measures of Dispersion
Measures of
Dispersion
Mean
Variance/
Interquartile absolute Gini
Range Standard
deviation deviation Coefficient
Deviation
(MAD)
Range
Measures of
Dispersion
• Disadvantages
• Ignores the way data is distributed
• Uses only two data
• Ignores outliers
Team 1 2 3 4 5 Team 1 2 3 4 5
member member
Height 72 73 76 76 78 Height 67 72 76 76 84
(inches) (inches)
Definition: The inter-quartile range is a measure that indicates the extent to which the central 50% of
values within the dataset are dispersed. It is based upon, and related to, the median.
Q1 Q2 Q3 Q4
Disadvantages
• Provides a clearer picture of the overall dataset by removing/ignoring the outlying values.
• Like the range however, the inter-quartile range is a measure of dispersion that is based upon
only two values from the dataset
Mean Absolute Deviation
Measures of
Dispersion
Interquartile Variance/Standar
Range
deviation
MAD d Deviation
Ginii- Coefficient
Definition: mean of the absolute values of the deviations. This measure is called mean absolute
deviation and is denoted by mad. Note that in the formula that follows, the
notation mean(x) denotes the mean of the x measurements.
Disadvantages
• Consider these two datasets. Both give the value 4 though 2nd dataset is more spreadout
+7
+4 +4
+1
-4 -4 -2
-6
|4| + |4| + |-4| + |-4| = 4+4+4+4 =4
|7| + |1| + |-6| + |-2| = 7+1+6+2 =4
4 4
4 4
Variance/ Standard Deviation
Measures of
Dispersion
Interquartile Variance/
Definition of variance Range
deviation
MAD
Standard
Ginii- Coefficient
Deviation
The variance indicates how close to or far from the mean are most of the cases for a particular variable.
The smaller the value of the variance, the more the cases are concentrated around the value of the
mean; the larger the value of the variance, the more spread out away from the mean are the cases.
Calculation
Data Data
Mean
1. Calculate difference of each data point from mean
Deviation of x = x – mean
2. What is the problem when
we take sum of deviations?
3. Square deviations and sum
4. Divide by no. of observations
=24/5
Variance/Standard Deviation
Measures of
Dispersion
Mean
Interquartile Variance/
Standard
Range
deviation
absolute
MAD Deviation
Ginii- Coefficient
Standard
deviation
Deviation
Result:
• Increasing contribution to the variance as you go farther from the mean
• Variance is somewhat arbitrary But if you could “standardize” that value, you could talk about any
variance (i.e. deviation) in equivalent terms
Advantages/Disadvantages
• The variance is not the simplest or easiest to understand measure of dispersion for interval/ratio
variables
• The reason why statisticians prefer it is because it provides an excellent basis for some very
important multivariate statistics
Variance/Standard Deviation
Measures of
Dispersion
Mean
Interquartile Variance/
Standard
Range
deviation
absolute
MAD Deviation
Ginii- Coefficient
Standard
deviation
Deviation
=2.19
Definition:
Calculation
Disadvantage
• not additive across
groups, i.e. the total Gini of a
society is not equal to the
sum of the Ginis for its sub-
groups
Measures of Central Tendency and Dispersion by Level
of Measurement
Association between variables such that high scores on one variable tend to have
high scores on the other variable
A direct relation between the variables
Negative Correlation
Association between variables such that high scores on one variable tend to
have low scores on the other variable
An inverse relation between the variables
Correlation
Lung Capacity
40
Covariance and Correlation quantify
relationship 30
20
rxy 0.96 -5 5 15 25
Smoking (yrs)
rxy = -0.96 implies almost certainty smoker will have diminish lung capacity
Greater smoking exposure implies greater likelihood of lung damage
Calculation for Correlation
Cigs (X ) X 2
XY Y2 Cap (Y )
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
5(1585) 50(180)
rxy
5(750 50 ) 5(6680) 180
2 2
7925 9000
3750 2500 33400 32400
1075
rxy
1250 1000
rxy 0.9615
Covariance
Variables that covary inversely, like smoking and lung capacity, tend to appear on
opposite sides of the group means
When smoking is above its group mean, lung capacity tends to be below its group
mean.
Average product of deviation measures extent to which variables covary, the degree
of linkage between them
10 36
Evaluation yields,
1
S xy ( 215) 53.75
4
Introduction to Normal distribution
http://www.youtube.com/watch?v=dr1DynUzjq0&list=PLCkLQOAPOtT2H1hJRU
xUYOxThRwfVI9jI
Normal distribution
Characteristics of a normal distribution
Z score
• If we know the population mean and population standard deviation, for
any value of X we can compute a z-score by subtracting the population
mean and dividing the result by the population standard deviation
X
z
Normal distribution
• Z-score tells us how far above or below the mean a value is in terms of
standard deviations
• It is a linear transformation1 of the original scores
• Multiplication (or division) of and/or addition to (or subtraction
from) X by a constant
• Relationship of the observations to each other remains the same
• Z = (X-)/
• X = Z +
Normal or not? There are statistical ways to check
rather than guessing….
Performance rankings
Heights of men
Exercise time
Weekly wages
Waiting time for traffic Sales incentive
Applications of Normal Distribution
• Normal distribution
lends itself to
statistical properties
Not all distributions are symmetric and beautiful as the Normal distribution
Shapes and Peaks of Distribution
Skewness Kurtosis
Tossing a No of calls at
coin work
Waiting time
in a queue
Throwing
dice
Characteristics
Characteristics
• Discrete probability distribution for the
• The number of observations n is fixed
counts of events that occur randomly in
• Each observation is independent
a given interval of time (or space).No.
• Each observation represents one of two
of trials are not fixed
outcomes ("success" or "failure")
• Used as model for no. of events in a
• The probability of "success" p is the same
specific time period in which the
for each outcome
number of successes is recorded.
Sampling
• Save costs: Less expensive to study the sample than the population.
• Save time: Less time needed to study the sample than the population .
• Accuracy: Since sampling is done with care and studies are conducted by skilled and qualified
interviewers, the results are expected to be accurate
Precision
Cost
Sampling – Process
Sampling Process
Implement
Sample Characteristics
Register of electors
Telephone book
database of customers
patient list
pupil register
Unit
Element hospital records
census
Sampling
Sampling
Non-
Probability
Probability
Non- Probability Sampling
Non-
Probability
Sampling
Advantages Disadvantages
Non-Probability
Probability
Sampling
Simple Stratified
Systematic Cluster
Random Random
Sampling Sampling
Sampling Sampling
Examples of Probability Sampling
Suppose you were interested in investigating the link between the family of origin and income and your particular
interest is in comparing incomes of Hispanic and Non-Hispanic respondents. For statistical reasons, you decide that you
need at least 1,000 non-Hispanics and 1,000 Hispanics. Hispanics comprise around 6 or 7% of the population
Let's suppose your sampling frame is a large city's telephone book that has 2,000,000 entries. To take a SRS, you need to
associate each entry with a number and choose n= 200 numbers from N= 2,000,000
Suppose you wanted to study dance club and bar employees in NYC with a sample of n = 600. Yet there is no list of these
employees from which to draw a simple random sample. Suppose you obtained a list of all bars/clubs in NYC
Say that you're interested in how job satisfaction varies by race among a group of employees at a firm. To explore this
issue, we need to create a sample of the employees of the firm. However, the employee population at this particular firm
is predominantly white
Systematic sampling – Quasi Random sampling
2. Only one random number is needed throughout the entire sampling process.
5. When the ordering of the elements is related to the characteristic of interest, systematic
• Step 2- Elements are selected from each by being easy to measure and apply.
- population divided into strata, then random sampling from within each stratum
To select a proportionate stratified sample of 20 members of the Cosmopolitan Club which has 100
members belonging to three language groups i.e., English (E), Hindi (M) and French (F)
Step 1: Identify each member from the list by his or her respective language groups
00 (E ) 15 (H) 30 (E ) 45 (E ) 60 ( F ) 75 (E ) 90 ( F )
01 (E ) 16 (E ) 31 (E ) 46 ( F ) 61 (H) 76 (E ) 91 (E )
02 ( F ) 17 ( F ) 32 (E ) 47 (H) 62 (H) 77 (H) 92 (H)
03 (E ) 18 ( F ) 33 (H) 48 (E ) 63 (E ) 78 (H) 93 (E )
04 (E ) 19 (H) 34 (E ) 49 (E ) 64 (E ) 79 (E ) 94 (E )
05 (E ) 20 (H) 35 (H) 50 (E ) 65 ( F ) 80 (H) 95 ( F )
06 (H) 21 ( F ) 36 (E ) 51 (H) 66 (H) 81 (E ) 96 (E )
07 (H) 22 (E ) 37 (E ) 52 ( F ) 67 (E ) 82 (E ) 97 (E )
08 (E ) 23 ( F ) 38 ( F ) 53 (H) 68 (H) 83 (H) 98 (H)
09 (E ) 24 (E ) 39 ( F ) 54 (E ) 69 (E ) 84 ( F ) 99 (E )
10 (H) 25 (H) 40 (E ) 55 (E ) 70 (E ) 85 (E )
11 (E ) 26 (E ) 41 ( F ) 56 (H) 71 (E ) 86 (E )
12 ( F ) 27 (H) 42 ( F ) 57 (E ) 72 (H) 87 (H)
13 (H) 28 ( F ) 43 (E ) 58 (H) 73 (E ) 88 ( F )
14 (E ) 29 (E ) 44 (H) 59 (H) 74 ( F ) 89 (E )
Selection of a proportionate Stratified Sample
Step 2: Sub-divide the population into three homogenous groups or language stratas
French Language
English Language Stratum (50) Hindi Language Stratum (30)
Stratum (20)
00 22 40 64 82 06 35 66 02 42
01 24 43 67 85 07 44 68 12 46
03 26 45 69 86 10 47 72 17 52
04 29 48 70 89 13 51 77 18 60
05 30 49 71 91 15 53 78 21 65
08 31 50 73 93 19 56 80 23 74
09 32 54 75 94 20 58 83 28 84
11 34 55 76 96 25 59 87 38 88
14 36 57 79 97 27 61 92 39 90
16 37 63 81 99 33 62 98 41 95
• A two-step-process:
– Step 1- Defined population is divided into number of mutually exclusive and collectively
exhaustive sub population groups or clusters;
– Step 2- Select an independent simple random sample of clusters.
Cluster Sampling
Consider the same Cosmopolitan Club example involving 100 club members:
Step 1: Sub-divide the club members into 5 clusters, each containing 20 members
Step 2: Select one of the 5 clusters. If cluster 4 is selected, then all its elements are selected.
Step 3: In a two-stage cluster sampling, the researcher may randomly select 4 members from each of the five
clusters or the researcher may select 2 clusters out of 5 and then sample randomly within selected clusters
Cluster sampling …. Contd.
AREA SAMPLING - common form of cluster sampling where clusters consist of geographic areas (districts,
housing blocks or townships). Could be one-stage, two-stage, or multi-stage.
Step 3: Using random numbers, select the housing blocks (units) to be sampled.
Step 4: In each of the chosen housing block identify a random starting point and follow the right hand rule.
Probability Sampling Techniques – Summing up
Advantages Disadvantages