FischerLecNotesIntroStatistics PDF

Introduction to Basic Statistical Methodology
(with an emphasis on biomedical applications, using R)

Lecture Notes, Spring 2014

Ismor Fischer, Ph.D.

Statistics plays a role in determining whether sources of variation
between groups and/or within groups are simply the effects of random
chance, or are attributable to genuine, nonrandom differences.

Variation in littermates
Orchids of the genus Phalenopsis
Darwins finches
Highly variable Kallima paralekta
(Malayan Dead Leaf Butterfly) Biodiversity in Homo sapiens
Variability in female forms of
Papilio dardanus (Mocker Swallowtail),
Madagascar. The tailed specimen is male,
but the tailless female morphs are mimics
of different poisonous butterfly species.
Australian boulder opals
X ~ N(, )

To the memory of my late wife,

~ Carla Michele Blum ~

the sweetest and wisest person I ever met,

taken far too young...

-

Mar 21, 1956 July 6, 2006
Introduction to Basic Statistical Methods
Note: Underlined headings are active webpage links!
0. Course Preliminaries
Course Description
A Brief Overview of Statistics

1. Introduction
1.1 Motivation: Examples and Applications
1.2 The Classical Scientific Method and Statistical Inference
1.3 Definitions and Examples
1.4 Some Important Study Designs in Medical Research
1.5 Problems

2. Exploratory Data Analysis and Descriptive Statistics
2.1 Examples of Random Variables and Associated Data Types
2.2 Graphical Displays of Sample Data
Dotplots, Stemplots,
Histograms: Absolute Frequency, Relative Frequency, Density
2.3 Summary Statistics
Measures of Center: Mode, Median, Mean,... (+ Shapes of Distributions)
Measures of Spread: Range, Quartiles, Variance, Standard Deviation
2.4 Summary: Parameters vs. Statistics, Expected Values, Bias, Chebyshevs Inequality
2.5 Problems
3. Theory of Probability
3.1 Basic Ideas, Definitions, and Properties
3.2 Conditional Probability and Independent Events (with Applications)
3.3 Bayes Formula
3.4 Applications
Diagnostic: Sensitivity, Specificity, Predictive Power, ROC curves
Epidemiological: Odds Ratios, Relative Risk
3.5 Problems
4. Classical Probability Distributions
4.1 Discrete Models: Binomial Distribution, Poisson Distribution,
4.2 Continuous Models: Normal Distribution,
4.3 Problems

5. Sampling Distributions and the Central Limit Theorem
5.1 Motivation
5.2 Formal Statement and Examples
5.3 Problems
6. Statistical Inference and Hypothesis Testing
6.1 One Sample
6.1.1 Mean (Z- and t-tests, Type I and II Error, Power & Sample Size)
6.1.2 Variance (Chi-squared Test)
6.1.3 Proportion (Z-test)
6.2 Two Samples
6.2.1 Means (Independent vs. Paired Samples, Nonparametric tests)
6.2.2 Variances (F-test, Levene Test)
6.2.3 Proportions (Z-test, Chi-squared Test, McNemar Test)
Applications: Case-Control Studies, Test of Association
and Test of Homogeneity of Odds Ratios, Mantel-Haenszel
Estimate of Summary Odds Ratio
6.3 Several Samples
6.3.1 Proportions (Chi-squared Test)
6.3.2 Variances (Bartletts Test, etc.)
6.3.3 Means (ANOVA, F-test, Multiple Comparisons)
6.4 Problems

S
A
M
P
L
E

P
O
P
U
L
A
T
I
O
N

7. Correlation and Regression
7.1 Motivation
7.2 Linear Correlation and Regression (+ Least Squares Approximation)
7.3 Extensions of Simple Linear Regression
Transformations (Power, Logarithmic,)
Multilinear Regression (ANOVA, Model Selection, Drug-Drug Interaction)
Logistic Regression (Dose-Response Curves)
7.4 Problems

8. Survival Analysis
8.1 Survival Functions and Hazard Functions
8.2 Estimation: Kaplan-Meier Product-Limit Formula
8.3 Statistical Inference: Log-Rank Test
8.4 Linear Regression: Cox Proportional Hazards Model
8.5 Problems

APPENDIX

A1. Basic Reviews
Logarithms
Perms & Combos

A2. Geometric Viewpoint
Mean and Variance
ANOVA
Least Squares Approximation

A3. Statistical Inference
Mean, One Sample
Means & Proportions, One & Two Samples
General Parameters & FORMULA TABLES

A4. Regression Models
Power Law Growth
Exponential Growth
Multilinear Regression
Logistic Regression
Example: Newtons Law of Cooling

A5. Statistical Tables
Z-distribution
t-distribution
Chi-squared distribution
F-distribution (in progress...)

Even genetically identical organisms, such as these inbred mice,
can exhibit a considerable amount of variation in physical and/or
behavioral characteristics, due to random epigenetic differences
in their development. But statistically, how large must such
differences be in order to reject random chance as their sole
cause, and accept that an alternative mechanism is responsible?
Source: Nature Genetics, November 2, 1999.
Ismor Fischer, 7/20/2010
i
Course Description for
Introduction to Basic Statistical Methodology

Ismor Fischer, UW Dept of Statistics, UW Dept of Biostatistics and Medical Informatics

Objective: The overall goal of this course is to provide students with an overview of fundamental
statistical concepts, and a practical working knowledge of the basic statistical techniques they are likely
to encounter in applied research and literature review contexts, with some basic programming in R.
An asterisk (*) indicates a topic only relevant to Biostatistics courses. Lecture topics include:

I. Introduction. General ideas, interpretation, and terminology: population, random sample, random
variable, empirical data, etc. Describing the formal steps of the classical scientific method hypothesis,
experiment, observation, analysis and conclusion to determine if sources of variation in a system are
genuinely significant or due to random chance effects. General study design considerations: prospective
(e.g., randomized clinical trials, cohort studies) versus retrospective (e.g., case-control studies).*

II. Exploratory Data Analysis and Descriptive Statistics. Classification of data: numerical (continuous,
discrete) and categorical (nominal including binary and ordinal). Graphical displays of data: tables,
histograms, stemplots, boxplots, etc. Summary Statistics: measures of center (sample mean, median,
mode), measures of spread (sample range, variance, standard deviation, quantiles), etc., of both grouped
and ungrouped data. Distributional summary using Chebyshevs Inequality.

III. Probability Theory. Basic definitions: experiment, outcomes, sample space, events, probability. Basic
operations on events and their probabilities, including conditional probability, independent events.
Specialized concepts include diagnostic tests (sensitivity and specificity, Bayes Theorem, ROC curves),
relative risk and odds ratios in case-control studies.*

IV. Probability Distributions and Densities. Probability tables, probability histograms and probability
distributions corresponding to discrete random variables, with emphasis on the classical Binomial and
Poisson models. Probability densities and probability distributions corresponding to continuous random
variables, with emphasis on the classical Normal (a.k.a. Gaussian) model.

V. Sampling Distributions and the Central Limit Theorem. Motivation, formal statement, and examples.

VI. Statistical Inference. Formulation of null and alternative hypotheses, and associated Type I and Type II
errors. One- and two-sided hypothesis testing methods for population parameters mostly, means and
proportions for one sample or two samples (independent or dependent), large (Z-test) or small (t-test).
Light treatment of hypothesis testing for population variances (
2
-test for one, F-test for two).
Specifically, for a specified significance level, calculation of confidence intervals, acceptance/rejection
regions, and p-values, and their application and interpretation. Power and sample size calculations.
Brief discussion of nonparametric (Wilcoxon) tests. Multiple comparisons: ANOVA tables for means,
2
and McNemar tests on contingency tables for proportions. Mantel-Haenszel Method for multiple
2 2 tables (i.e., Test of Homogeneity Summary Odds Ratio Test of Association).*

VII. Linear Regression. Plots of scattergrams of bivariate numerical data, computation of sample
correlation coefficient r, and associated inference. Calculation and applications of corresponding least
squares regression line, and associated inferences. Evaluation of fit via coefficient of determination r
2

and residual plot. Additional topics include: transformations (logarithmic and others), logistic regression
(e.g., dose-response curves), multilinear regression (including a brief discussion of drug-drug
interaction*, ANOVA formulation and model selection techniques).

VIII. Survival Analysis.* Survival curves, hazard functions, Kaplan-Meier Product-Limit Estimator, Log-
Rank Test, Cox Proportional Hazards Regression Model.
Ismor Fischer, 1/4/2011 ii

I n complex dynamic systems such as biological organisms, how is it possible to
distinguish genuine or statistically significant sources of variation,
from purely random chance effects? Why is it important to do so?
Consider the following three experimental scenarios

In a clinical trial designed to test the efficacy of a new drug, participants are randomized to
either a control arm (e.g., a standard drug or placebo) or a treatment arm, and carefully
monitored over time. After the study ends, the two groups are then compared to determine
if the differences between them are statistically significant or not.

In a longitudinal study of a cohort of individuals, the strength of association between a
disease such as COPD (Chronic Obstructive Pulmonary Disease) or lung cancer, and
exposure to a potential risk factor such as smoking, is estimated and determined to be
statistically significant.

By formulating an explicit mathematical model, an investigator wishes to describe how
much variation in a response variable, such as mean survival time after disease diagnosis in
a group of individuals, can be deterministically explained in terms of one or more
statistically significant predictor variables with which it is correlated.

This first course is an introduction to the basic but powerful techniques of statistical analysis
techniques which formally implement the fundamental principles of the classical scientific
method in the general context of biomedical applications. How to:

1. formulate a hypothesis about some characteristic of a variable quantity measured on a
population (e.g., mean cholesterol level, proportion of treated patients who improve),

2. classify different designs of experiment that generate appropriate sample data
(e.g., randomized clinical trials, cohort studies, case-control studies),

3. investigate ways to explore, describe and summarize the resulting empirical observations
(e.g., visual displays, numerical statistics),

4. conduct a rigorous statistical analysis (e.g., by comparing the empirical results with a
known reference obtained from Probability Theory), and finally,

5. infer a conclusion (i.e., whether or not the original hypothesis is rejected) and
corresponding interpretation (e.g., whether or not there exists a genuine treatment effect).

These important biostatistical techniques form a major component in much of the currently
active research that is conducted in the health sciences, such as the design of safe and effective
pharmaceuticals and medical devices, epidemiological studies, patient surveys, and many other
applications. Lecture topics and exams will include material on:

Exploratory Data Analysis of Random Samples

Probability Theory and Classical Population Distributions

Statistical Inference and Hypothesis Testing

Regression Models

Survival Analysis
Ismor Fischer, 1/4/2011 iii
A Brief Overview of Statistics

Statistics is a quantitative discipline that allows objective general statements to be made about a
population of units (e.g., people from Wisconsin), from specific data, either numerical (e.g., weight in
pounds) or categorical (e.g., overweight / normal weight / underweight), taken from a random sample.
It parallels and implements the fundamental steps of the classical scientific method: (1) the formulation
of a testable null hypothesis for the population, (2) the design of an experiment specifically designed to
test this hypothesis, (3) the performance of which results in empirical observations, (4) subsequent
analysis and interpretation of the generated data set, and finally, (5) conclusion about the hypothesis.

Specifically, a reproducible scientific study requires an explicit measurable quantity, known as a
random variable (e.g., IQ, annual income, cholesterol level, etc.), for the population. This variable has
some ideal probability distribution of values in the population, for example, a bell curve (see figure),
which in turn has certain population characteristics, a.k.a. parameters, such as a numerical center
and spread. A null hypothesis typically conjectures a fixed numerical value (or sometimes, just a
largest or smallest numerical bound) for a specific parameter of that distribution. (In this example, its
center as measured by the population mean IQ is hypothesized to be 100.) After being visually
displayed by any of several methods (e.g., a histogram; see figure), empirical data can then be
numerically summarized via sample characteristics, a.k.a. statistics, that estimate these parameters
without bias. (Here, the sample mean IQ is calculated to be 117.) Finally, in a process known as
statistical inference, the original null hypothesis is either rejected or retained, based on whether or not
the difference between these two values (117 100 = 17) is statistically significant at some pre-
specified significance level (say, a 5% Type I error rate). If this difference is not significant i.e., is
due to random chance variation alone then the data tend to support the null hypothesis. However, if the
difference is significant i.e., genuine, not due to random chance variation alone then the data tend to
refute the null hypothesis, and it is rejected in favor of a complementary alternative hypothesis.

Formally, this decision is reached via the computation of any or all of three closely related quantities:

1) Confidence Interval = the observed sample statistic (117), plus or minus a margin of error.
This interval is so constructed as to contain the hypothesized parameter value (100) with a pre-
specified high probability (say, 95%), the confidence level. If it does not, then the null is rejected.

2) Acceptance Region = the hypothesized parameter value (100), plus or minus a margin of error.
This is constructed to contain the sample statistic (117), again at a pre-specified confidence level
(say, 95%). If it does not, then the null hypothesis is rejected.

3) p-value = a measure of how probable it is to obtain the observed sample statistic (117) or worse,
assuming that the null hypothesis is true, i.e., that the conjectured value (100) is really the true value
of the parameter. (Thus, the smaller the p-value, the less probable that the sample data support the
null hypothesis.) This tail probability (0%-100%) is formally calculated using a test statistic, and
compared with the significance level (see above) to arrive at a decision about the null hypothesis.

Moreover, an attempt is sometimes made to formulate a mathematical model of a desired population
response variable (e.g., lung cancer) in terms of one or more predictor (or explanatory) variables
(e.g., smoking) with which it has some nonzero correlation, using sample data. Regression techniques
can be used to calculate such a model, as well as to test its validity.

This course will introduce the fundamental statistical methods that are used in all quantitative fields.
Material will include the different types of variable data and their descriptions, working the appropriate
statistical tests for a given hypothesis, and how to interpret the results accordingly in order to formulate a
valid conclusion for the population of interest. This will provide sufficient background to conduct basic
statistical analyses, understand the basic statistical content of published journal articles and other
scientific literature, and investigate more specialized statistical techniques if necessary.
Ismor Fischer, 1/4/2011 iv

POPULATION

RANDOM SAMPLE
Observations
Statistical I nference
Conclusion: Does the
experimental evidence
tend to support or refute
the null hypothesis?
Experiment to
test hypothesis
Analysis of empirically-generated data (e.g., via a histogram):

X
Statistic: Mean x = 117
(estimate of parameter)
Random Variable: X = IQ score, having an ideal distribution of values

X
Null Hypothesis: Mean = 100
(about a parameter)
1. Introduction

1.1 Motivation

1.2 Classical Scientific Method


1.4 Medical Study Designs

1.5 Problems

Ismor Fischer, 5/29/2012 1.1-1
| | | |

10 15 20 25 30 35 40
X =Survival
(months)
Population mean
survival time
1. Introduction

1.1 Motivation: Examples and Applications

I s there a statistically significant difference in survival time between cancer
patients on a new drug treatment, and the standard treatment population?

An experimenter may have suspicions, but how are they formally tested? Select a
random sample of cancer patients, and calculate their mean survival time.

Design issues ~
How do we randomize? For that matter, why do we randomize? (Bias)

What is a statistically significant difference, and how do we detect it?

How large should the sample be, in order to detect such a difference if there is one?

Sample mean survival time =27.0

Analysis issues ~

Is a mean difference of 2 months statistically significant, or possibly just due to
random variation? Can we formally test this, and if so, how?

Interpretation in context?

Similar problems arise in all fields where quantitative data analysis is required.

t =0
=angle
of elevation
v
0
=initial
speed
x = (v
0
cos ) t

y = (v
0
sin ) t
1
2
g t
2

At time t,
P(x, y):
DETERMINISTIC OUTCOMES
RANDOM OUTCOME
Heads or Tails????
Probability!
x
y
DTW Terminal A fountain; click for larger image
http://www.metroairport.com

Question: How can we prove objective statements about the
behavior of a given system, when random variation is present?

Example: Toss a small mass into space.

1. Hypothesis H
0
: The coin is fair (i.e., unbiased).

However, it is not possible to apply this formal definition in practice,
because we cannot toss the coin an infinite number of times. So....

#tosses: 1 2 3 4 5 6 7 n
outcome = (H T H H T T H T )
#Heads: 1 1 2 3 3 3 4 X

Definition: P(Heads) = lim
n

X
n
(=0.5, if the coin is fair)
Answer: In principle, the result of an individual random
outcome may be unpredictable, but long-term statistical
patterns and trends can often be determined.
0
final
2
0
final
2 2
0
max
2 sin
sin2
sin
2
v
t
g
v
x
g
v
y
g
=
=
=


2. Experiment: Generate a random sample of n =100 independent tosses.

1 2 3 4 5 6 7 100

3. Observation: outcome = (T T H T H H H T)

Exercise: How many such possible outcomes are there?

Let the random variable X =#Heads in this experiment: {0, 1, 2, , 100}.

[Comment: A nonrandom variable is one whose value is
determined, thus free of any experimental measurement variation,
such as the solution of the algebraic equation 3X + 7 = 11, or
X =#eggs in a standard one-dozen carton, or #wheels on a bicycle.]

4. Analysis: Compare the observed empirical data with the theoretical
prediction for X (using probability), assuming the hypothesis is true.
That is,

Expected #Heads: E[X] = 50

versus

Observed #Heads: {0, 1, 2, , 100}
Future Issue: If the hypothesis is indeed false,
then how large must the sample size n be, in
order to detect a genuine difference the vast
majority of the time? This relates to the
power of the experiment.

Is the difference
statistically significant?

. . . . . .
0 100
0.00
0.20
0.40
0.60
0.80
1.00
X
p-value
significance level =0.05
confidence level
1 =0.95
Accept H
0
Reject H
0
Reject H
0

X =50
Again, assuming P(Observed, given Expected) = p-value
the hypothesis is true,
P(X =50) = 0.0796.* P(X 49 or X 51) = 0.9204 Accept H
0

Likewise P(X 48 or X 52) = 0.7644
P(X 47 or X 53) = 0.6173
P(X 46 or X 54) = 0.4841
P(X 45 or X 55) = 0.3682
P(X 44 or X 56) = 0.2713
P(X 43 or X 57) = 0.1933
P(X 42 or X 58) = 0.1332
P(X 41 or X 59) = 0.0886
P(X 40 or X 60) = 0.0569
P(X 39 or X 61) = 0.0352
P(X 38 or X 62) = 0.0210
P(X 37 or X 63) = 0.0120
P(X 36 or X 64) = 0.0066
. . . . . . . .

P(X =0 or X =100) = 0.0000 Reject H
0

5. Conclusion:
Suppose H
0
is true, i.e., the coin is fair,
and we wish to guarantee that there is...
at worst, 5% probability of erroneously
concluding that the coin is unfair,
or equivalently,
95% probability of correctly concluding
that the coin is indeed fair.
This will be the case if we do not reject
H
0
unless X 39 or X 61.
* Of the 2
100
possible outcomes
of this experiment, only
100
50

of themhave exactly 50 Heads;
the ratio is 0.0796, or about 8%.

1.2 The Classical Scientific Method and Statistical Inference

The whole of science is nothing more than a refinement of everyday thinking.

- Albert Einstein
Population of units

Random Sample Mathematical Theorem
(empirical data) (formal proof)

n =#observations

Analysis: Observed vs. Expected,
under Hypothesis

Is the difference statistically significant? Or just due to random, chance variation alone?
x
1

x
2
. . . x
n
x
3
Proof :
If Hypothesis (about X),

then Conclusion (about X).
QED

Random Variable X

Hypothesis (about X)
THEORY EXPERIMENT
What actually happens this time,
regardless of hypothesis.
What ideally must follow,
if hypothesis is true. Decision:
Accept or Reject
Hypothesis

Example:
Population of individuals

Random Sample Mathematical Theorem
(empirical data) (formal proof)

n =2500 individuals

Hypothesis: The
prevalence (proportion)
of a certain disease is
10%.
Suppose random variable
X = # Yes =300, i.e.,
estimated prevalence =
300
2500
=0.12, or 12%.
Yes/No

Yes/No

. . .
Yes/No

Yes/No

If Hypothesis of 10% prevalence is true,
then the expected value of X would be
250 out of a random sample of 2500.

Moreover, under these conditions, it can
(and later will) be mathematically proved
that the probability of obtaining a sample
result that is as, or more, extreme than
12%, is only .00043 (the p-value), or
less than one-twentieth of one percent.
EXTREMELY RARE!!! Thus, our sample
evidence is indeed statistically significant;
it tends to strongly refute the original
Hypothesis.

THEORY EXPERIMENT
What actually happens this time,
regardless of hypothesis.
What ideally must follow,
if hypothesis is true.
Decision:
Reject Hypothesis

Based on our sample, the
prevalence of this
disease in the population
is significantly higher
than 10%, around 12%.
POPULATION = swimming pool
Random Variable
X =Water Temperature (F)

(Informal) Null Hypothesis
H
0
: (The mean of) X is okay for
swimming. (e.g., =80F)

(Informal) Experiment
Select a random sample by sticking in
foot and swishing water around.

(Informal) Analysis
Determine if the difference between the
observed temperature and expected
temperature under H
0
is significant.

Conclusion
If not, then accept H
0
J ump in!
If so, then reject H
0
Go jogging instead.


Definition: A random variable, usually denoted by X, Y, Z,, is a rule that assigns a
number to each outcome of an experiment. (Examples: X =mass, pulse rate, gender)

Definition: Statistics is a collection of formal computational techniques that are
designed to test and derive a (reject or accept) conclusion about a null hypothesis
for a random variable defined on a population, based on experimental data taken
from a random sample.

Example: Blood sample taken from a patient for medical testing purposes, and
results compared with ideal reference values, to see if differences are significant.

Example: Goldilocks Principle

The following example illustrates the general approach used in formal hypothesis testing.

Example: United States criminal justice system

Null Hypothesis H
0
: Defendant is innocent.

The burden of proof is on the prosecution to collect enough empirical evidence to try to
reject this hypothesis, beyond a reasonable doubt (i.e., at some significance level).

Too Cold OK Too Hot
Reject H
0
Accept H
0
Reject H
0

Casey Anthony
ACQUITTED
H
0
accepted
J uly 5, 2011
J odi Arias
CONVICTED
H
0
rejected
May 8, 2013

Example: Pharmaceutical Application

Phase III Randomized Clinical Trial (RCT)

Used to compare drug vs. placebo, new treatment vs. standard
treatment, etc., via randomization (to eliminate bias) of participants to
either a treatment arm or control arm. Moreover, randomization is
often blind (i.e., masked), and implemented by computer, especially
in multicenter collaborative studies. Increasing use of the Internet!
Standard procedure used by FDA to approve pharmaceuticals and other
medical treatments for national consumer population.
Random Variable X =cholesterol level (mg/dL)
1

2

Drug Placebo
POPULATION
Size n
1
Size n
2

RANDOM SAMPLES
Null Hypothesis

H
0
: There is no
difference in population
mean cholesterol levels
between the two groups,
i.e.,
1

2
=0.
Is the mean difference
statistically significant, (e.g., at
the =.05 level)?
If so, then reject H
0
.
There is evidence of a genuine
treatment difference!
If not, then accept H
0
.
There is not enough evidence of
a genuine treatment difference.
More study needed?
1
x =225
2
x =240
1
x
2
x =15

1.4 Some Important Study Designs in Medical Research

I. OBSERVATIONAL (no intervention)

A. LONGITUDINAL (over some period of time)

1. Retrospective (backward-looking)

Case-Control Study: Identifies present disease with past exposure to risk factors.

2. Prospective (forward-looking)

Cohort Study: Classically, follows a cohort of subjects forward in time.

Example: Framingham Heart Study to identify CVD risk factors, ongoing since 1948.

B. CROSS-SECTIONAL (at some fixed time)

Survey: Acquires self-reported information from a group of participants.

Prevalence Study: Determines the proportion of a specific disease in a given population.

II. EXPERIMENTAL (intervention)

Randomized Clinical Trial (RCT): Randomly assigns patients to either a
treatment group (e.g., new drug) or control group (e.g., standard drug or placebo),
and follows each through time.

TIME
FUTURE PRESENT
Investigate: Association with D+and D Given: Exposed (E+) and Unexposed (E)
TIME
PRESENT PAST
Given: Cases (D+) and Controls (D) Investigate: Association with E+and E
Patients
satisfying
inclusion
criteria
R
A
N
D
O
M
I
Z
E
Treatment
Arm
Control
Arm
At end of study,
compare via
statistical analysis.

Phases of a Clinical Trial
In vitro biochemical and pharmacological research, including any computer
simulations.

Pre-clinical testing of in vivo animal models to determine safety and potential to
fight a specific disease. Typically takes 3-4 years. Successful pass rate is only
0.01%, i.e., one in a thousand compounds.

PHASE I. First stage of human testing, contingent upon FDA approval,
including protocol evaluation by an International Review Board (IRB) ethics
committee. Determines safety and side effects as dosage is incrementally
increased to maximum tolerated dose (MTD) that can be administered without
serious toxicity. Typically involves very few ( 12, but sometimes more)
healthy volunteers, lasting several months to a year. Phase I pass rate is
approximately 70%.

PHASE II. Determines possible effectiveness of treatment. Typically involves
several ( 14-30, but sometimes more) afflicted patients who have either
received previous treatment, or are untreatable otherwise. Lasts from several
months to two years. Only approximately 30% of all experimental drugs tested
successfully pass both Phases I and II.

PHASE III. Classical randomized clinical trial (although most Phase II are
randomized as well) that compares patients randomly assigned to a new
treatment versus those treated with a control (standard treatment or placebo).
Large-scale experiment involving several hundred to several thousand patients,
lasting several years. Seventy to 90 percent of drugs that enter Phase III studies
successfully complete testing. FDA review and approval for public marketing
can take from six months to two years.

PHASE IV. Post-marketing monitoring. Randomized controlled studies often
designed with several objectives: 1) to evaluate long term safety, efficacy and
quality of life after the treatment is licensed or in common use, 2) to investigate
special patient populations not previously studied (e.g., pediatric or geriatric),
3) to determine the cost-effectiveness of a drug therapy relative to other
traditional and new therapies.

Total time from lab development to marketing: 10-15 years

Solutions / 1.5-1

1.5 Solutions

1. X = 38 Heads in n = 100 tosses corresponds to a p-value = .021, which is less than = .05;
hence in this case we are able to reject the null hypothesis, and conclude that the coin is not
fair, at this significance level. However, p = .021 is greater than = .01; hence we are unable
to reject the null hypothesis of fairness, at this significance level. We tentatively accept
or, at least, not outright reject that the coin is fair, at this level. (The coin may indeed be
biased, but this empirical evidence is not sufficient to show it.) Thus, lowering the
significance level at the outset means that based on the sample data, we will be able to reject
the null hypothesis less often on average, resulting in a more conservative test.

2.
(a) If the coin is known to be fair, then all 2
10
outcomes are equally likely; the probability of
any one of them occurring is the same (namely, 1/2
10
)!
(b) However, if the coin is not known to be fair, then Outcomes 1, 2, and 3 each with X = 5
Heads and n X = 5 Tails, regardless of the order in which they occur all provide the
best possible evidence in support of the hypothesis that the coin is unbiased. Outcome 4,
with X = 7 Heads, is next. And finally, Outcome 5, with all X = 10 Heads, provides the
worst possible evidence that the coin is fair.

3. The issue here is one of sample size, and statistical power the ability to detect a significant
difference from the expected value, if one exists. In this case, a total of X = 18 Heads out of
n = 50 tosses yields a p-value = 0.0649, which is just above the = .05 significance level.
Hence, the evidence in support of the hypothesis that the coin is fair is somewhat borderline.
This suggests that perhaps the sample size of n = 50 may not be large enough to detect a
genuine difference, even if there is one. If so, then a larger sample size might generate more
statistical power. In this experiment, obtaining X = 36 Heads out of n = 100 tosses is indeed
sufficient evidence to reject the hypothesis that the coin is fair.

Solutions / 1.5-2

4. R exercise

(a) If the population ages are uniformly distributed between 0 and 100 years, then via
symmetry, the mean age would correspond to the midpoint, or 50 years.
(b) The provided R code generates a random sample of n = 500 ages from a population
between 0 and 100 years old. The R command mean(my.sample) should typically
give a value fairly close to the population mean of 50 (but see part (d)).
(c) The histogram below is typical. The frequencies indicate the number of individuals in
each age group of the sample, and correspond to the heights of the rectangles. In this
sample, there are:
94 individuals between 0 and 20 years old, i.e., 18.8%,
103 individuals between 80 and 100 years old, i.e. 20.6%.
If the population is uniformly distributed, we would expect the sample frequencies
to be about the same in each of the five intervals, and indeed, that is the case; we
can see that each interval contains about one-hundred individuals (i.e., 20%).
Solutions / 1.5-3

(d) Most results should be generally similar to (b) and (c) in particular, the sample
means fairly close to the population mean of 50 but there is a certain nontrivial
amount of variability, due to the presence of outliers. For example, if by chance a
particular sample should consist of unusually many older individuals, it is quite possible
that the mean age would be shifted to a value that is noticeably larger than 50. This is
known as skewed to the right or positive skew. Similarly, a sample containing
many younger individuals might be skewed to the left or negatively skewed.

(e) The histogram below displays a simulated distribution of the means of many (in this
case, 2000) samples, each sample having n = 500 ages. Notice how much tighter
(i.e., less variability) the graph is around 50, than any of those in (c). The reason is that
it is much more common for a random sample to contain a relatively small number
of outliers whose contribution is damped out when all the ages are averaged
than for a random sample to contain a relatively large number of outliers whose
contribution is sizeable enough to skew the average. Thus, the histogram is rather
bell-shaped; highly peaked around 50, but with tails that taper off left and right.
Very rarely will a
random sample
have mostly low
values, resulting in
its average << 50.

Very rarely will a
random sample
have mostly high
values, resulting in
its average >> 50.

Solutions / 1.5-4

5. The following is typical output (copy-and-paste) directly from R. Comments are in blue.

(a) > pr ob = 0. 5
> t osses = r bi nom( 100, 1, pr ob) This returns a random sequence of 100 single tosses.*
> t osses # view the sequence
[ 1] 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0
[ 38] 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0
[ 75] 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0

> sum( t osses) # count the number of Heads
[ 1] 58

* Note: rbinom(1, 100, prob) just generates the number of Heads (not the actual sequence) in
1 run of 100 random tosses, in this case, 58.

This simulation of 100 random tosses of a fair coin produced 58 Heads. According to
the chart on page 1-4, the corresponding p-value = 0.1332. That is, if the coin is fair
(as here), then in 100 tosses, there is an expected 13.32% probability of obtaining 8 (or
more) Heads away from 50. This is above the 5% significance level, hence consistent
with the coin being fair. Had it been below (i.e., rarer than) 5%, it would have been
inconsistent with the coin being fair, and we would be forced to conclude that the coin is
indeed biased. Alas, in multiple runs, this would eventually happen just by chance!
(See the outliers in the graphs below.)
(b)
> X = r bi nom( 500, 100, pr ob)
This command generates the number of Heads in each of 500 runs of 100 tosses, as stated.
> sor t ( X) This command sorts the 500 numbers just found in increasing order (not shown).
> t abl e( X)
Produces a frequency table for X = # Heads, i.e., 35 Heads occurred twice, 36 twice, etc.
X
35 36 37 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 66
2 2 2 8 7 9 13 15 24 23 30 38 41 35 41 41 33 27 21 31 16 8 10 9 1 6 3 1 2 1

> summar y( X) This is often referred to as the five number summary:
Mi n. 1st Qu. Medi an Mean 3r d Qu. Max.
35. 00 46. 00 50. 00 49. 53 53. 00 66. 00

Notice that the mean median (suggesting that this may be close to a more-or-less
symmetric distribution; see page 2-14 in the notes) 50, both of which you might expect
to see in 100 tosses of an unbiased coin, as confirmed in the three graphs below.

Solutions / 1.5-5
Histogram

Stemplot Dotplot

35 | 00
36 | 00
37 | 00
38 |
39 | 00000000
40 | 0000000
41 | 000000000
42 | 0000000000000
43 | 000000000000000
44 | 000000000000000000000000
45 | 00000000000000000000000
46 | 000000000000000000000000000000
47 | 00000000000000000000000000000000000000
48 | 00000000000000000000000000000000000000000
49 | 00000000000000000000000000000000000
50 | 00000000000000000000000000000000000000000
51 | 00000000000000000000000000000000000000000
52 | 000000000000000000000000000000000
53 | 000000000000000000000000000
54 | 000000000000000000000
55 | 0000000000000000000000000000000
56 | 0000000000000000
57 | 00000000
58 | 0000000000
59 | 000000000
60 | 0
61 | 000000
62 | 000
63 | 0
64 | 00
65 |
66 | 0

(c) The sample proportions obtained from this experiment are quite close to the
theoretical p-values we expect to see, if the coin is fair.

l ower upper prop p-values (from chart)
[ 1, ] 49 51 0. 918 0.9204
[ 2, ] 48 52 0. 766 0.7644
[ 3, ] 47 53 0. 618 0.6173
[ 4, ] 46 54 0. 488 0.4841
[ 5, ] 45 55 0. 386 0.3682
[ 6, ] 44 56 0. 278 0.2713 Since these values are comparable,
[ 7, ] 43 57 0. 198 0.1933 it seems that we have reasonably
[ 8, ] 42 58 0. 152 0.1332 strong confirmation that the coin is
[ 9, ] 41 59 0. 106 0.0886 indeed unbiased.
[ 10, ] 40 60 0. 070 0.0569
[ 11, ] 39 61 0. 054 0.0352
[ 12, ] 38 62 0. 026 0.0210
[ 13, ] 37 63 0. 020 0.0120
[ 14, ] 36 64 0. 014 0.0066
[ 15, ] 35 65 0. 006
[ 16, ] 34 66 0. 002 etc.

From this point on,
all proportions are 0.

Note the
outliers!
Solutions / 1.5-6

(d) > pr ob = r uni f ( 1, mi n = 0, max = 1) This selects a random probability for Heads.
> t osses <- r bi nom( 100, 1, pr ob)
> sum( t osses) # count the number of Heads
[ 1] 62

This simulation of 100 random tosses of a fair coin produced 62 Heads, which
corresponds to a p-value = .021 < .05. Hence, based on this sample evidence, we may
reject the hypothesis that the coin is fair; the result is statistically significant at the
= .05 level. Graphs are similar to above, centered about the mean (see below).

> t abl e( X)
X
46 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 77 78 79 80
1 2 1 1 6 6 6 5 12 16 22 31 27 28 44 50 42 45 39 29 18 14 16 15 4 4 11 2 1 1 1

> summar y( X)
Mi n. 1st Qu. Medi an Mean 3r d Qu. Max.
46. 00 61. 00 64. 00 64. 31 67. 00 80. 00

According to these data, the mean number of Heads is 64.31 out of 100 tosses; hence
the estimated probability of Heads is 0.6431. The actual probability that R used here is

> pr ob
[ 1] 0.6412175
Solutions / 1.5-7

6.
(a)

(b)
P(2 X 12) = 1, because the event 2 X 12 comprises the entire sample space.
P(2 X 6 or 8 X 12 ) = P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6) +
P(X = 8) + P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12),
or, 1 P(X = 7) = 1 6/36 = 30/36 = 0.83333 Likewise,
P(2 X 5 or 9 X 12) = 20/36 = 0.55556
P(2 X 4 or 10 X 12) = 12/36 = 0.33333
P(2 X 3 or 11 X 12) = 6/36 = 0.16667
P(X 2 or X 12) = 2/36 = 0.05556
P(X 1 or X 13) = 0 , because neither the event X 1 nor X 13 can occur.

X = Sum Probability
2 1/36 = 0.02778
3 2/36 = 0.05556
4 3/36 = 0.08333
5 4/36 = 0.01111
6 5/36 = 0.01389
7 6/36 = 0.01667
8 5/36 = 0.01389
9 4/36 = 0.01111
10 3/36 = 0.08333
11 2/36 = 0.05556
12 1/36 = 0.02778
Solutions / 1.5-8

7. Absolutely not. That both sets of measurements average to 50.0 grams indicate that they have
the same accuracy, but Scale A has much less variability in its readings that Scale B, so it
has much greater precision. This experiment suggests that if many more measurements were
taken, those of A would show a much higher density of them centered around 50 g than B,
whose distribution of values would show much more spread around 50 g. Variability
determines reliability, a major factor in quality control of services and manufactured products.

Measurements obtained
from the A distribution are
much more tightly clustered
around their center, than
those of the B distribution.
50 g
A
B
50 g
1.5-1
1.5 Problems

In this section, we use some of the terminology that was introduced in this chapter, most of which
will be formally defined and discussed in later sections of these notes.

1. Suppose that n = 100 tosses of a coin result in X = 38 Heads. What can we conclude about
the fairness of the coin at the = .05 significance level? At the = .01 level? (Use the
chart given on page 1.1-4.)

2.
(a) Suppose that a given coin is known to be fair or unbiased (i.e., the probability of
Heads is 0.5 per toss). In an experiment, the coin is to be given n = 10 independent tosses,
resulting in exactly one out of 2
10
possible outcomes. Rank the following five outcomes in
order of which has the highest probability of occurrence, to which has the lowest.
Outcome 1: (H H T H T T T H T H)

Outcome 2: (H T H T H T H T H T)

Outcome 3: (H H H H H T T T T T)

Outcome 4: (H T H H H T H T H H)

Outcome 5: (H H H H H H H H H H)

(b) Suppose now that the bias of the coin is not known. Rank these outcomes in order of
which provides the best evidence in support of the hypothesis that the coin is fair, to
which provides the best evidence against it.

3. Let X = Number of Heads in n = 50 random, independent tosses of a fair coin. Then the
expected value is E[X] = 25, and the corresponding p-values for this experiment can be
obtained by the following probability calculations (for which you are not yet responsible).

P(X 24 or X 26) = 0.8877

P(X 23 or X 27) = 0.6718

P(X 22 or X 28) = 0.4799

P(X 21 or X 29) = 0.3222

P(X 20 or X 30) = 0.2026

P(X 19 or X 31) = 0.1189

P(X 18 or X 32) = 0.0649

P(X 17 or X 33) = 0.0328

P(X 16 or X 34) = 0.0153

P(X 15 or X 35) = 0.0066

P(X 14 or X 36) = 0.0026

P(X 13 or X 37) = 0.0009

P(X 12 or X 38) = 0.0003

P(X 11 or X 39) = 0.0001

P(X 10 or X 40) = 0.0000

P(X 0 or X 50) = 0.0000

Now suppose that this experiment is conducted twice, and X = 18 Heads are obtained both
times. According to this chart, the p-value = 0.0649 each time, which is above the = .05
significance level; hence, both times, we conclude that the sample evidence seems to
support the hypothesis that the coin is fair. However, the two experiments taken together
imply that in this random sequence of n = 100 independent tosses, X = 36 Heads are
obtained. According to the chart on page 1.1-4, the corresponding p-value = 0.0066, which
is much less than = .05, suggesting that the combined sample evidence tends to refute the
hypothesis that the coin is fair. Give a brief explanation for this apparent discrepancy.
1.5-2

NOTE: Please read the bottom of Getting Started with R regarding its use in HW problems, such
as 1.5/4 below. Answer questions in all parts, especially those involving the output, and indicate!

4. In this problem, we will gain some more fundamental practice with the R
programming language. Some of the terms and concepts may appear unfamiliar, but we
will formally define them later. For now, just use basic intuition.
[R Tip: At the prompt (>), repeatedly pressing the up arrow on your keyboard will
step through your previous commands in reverse order.]
(a) First, consider a uniformly distributed (i.e., evenly scattered) population of ages
between 0 and 100 years. What is the mean age of this population? (Use intuition.)
Let us simulate such a population, by generating an arbitrarily large (say one million)
vector of random numbers between 0 and 100 years. Type, or copy and paste
population = runif(1000000, 0, 100)
at the prompt (>) in the R console, and hit Enter.
Let us now select a single random sampleof n = 500 values from this population via
rand = sample(population, 500)
then sort them from lowest to highest, and round them to two decimal places:
my.sample = round(sort(rand), 2)
Type my.sample to view the sample you just generated. (You do not need to turn this in.)
(b) Compute the mean age of my.sample. How does it compare with the mean found in (a)?
(c) The R command hist graphs a frequency histogram of your data. Moreover,
?hist gives many options under Usage for this command. As an example, graph:
hist(my.sample, breaks = 5, xlab = "Ages", border = "blue",
labels = T)
Include and interpret the resulting graph. Does it reasonably reflect the uniformly-
distributed population? Explain.
(d) Repeat (b) and (c) several more times using different samples of n = 500 data values.
How do these samplemean ages compare with the population mean age in (a)?
(e) Suppose many random samples of size n = 500 values are averaged, as in (d). Graph their
histogram via the R code below, and offer a reasonable explanation for the resulting shape.

vec.means = NULL
for (i in 1:2000)
{vec.means[i] = mean(sample(population, 500))}
hist(vec.means, xlab = "Mean Ages", border = "darkgreen")
The idea behind
this problem will
be important in
Chapter 5.

1.5-3

5. In this problem, we will use the R programming language to simulate n = 100 random
tosses of a coin. (Remember that most such problems are linked to the Rcode folder.)
(a) First, assume the coin is fair or unbiased (i.e., the probability of Heads is 0.5 per toss),
and use the Binomial distribution to generate a random sequence of n = 100
independent tosses; each outcome is coded as Heads = 1 and Tails = 0.

prob = 0.5 ( )
tosses = rbinom(100, 1, prob)
tosses # view the sequence
sum(tosses) # count the number of Heads

From the chart on page 1.1-4, calculate the p-value of this experiment. At the = 0.05
significance level, does the outcome of this experiment tend to support or reject the
hypothesis that the coin is fair? Repeat the experiment several times.

(b) Suppose we run this experiment 500 times, and count the number of Heads each time.
Let us view the results, and display some summary statistics,

X = rbinom(500, 100, prob)
sort(X)
table(X)
summary(X)

as well as graph them, using each of the following methods, one at a time.

stripchart(X, method = "stack", ylim = range(0, 100), pch = 19)
# Dotplot
stem(X, scale = 2) # Stemplot

hist(X) # Histogram

Comment on how these graphs compare to what you would expect to see from a fair coin.

(c) How do the sample proportions obtained compare with the theoretical probabilities on page 1.1-4?

lower = 49:0
upper = 51:100

prop = NULL
for (k in 1:50) {less.eq <- which(X <= lower[k])
greater.eq <- which(X >= upper[k])
v <- c(less.eq, greater.eq)
prop <- c(prop, length(v)/500)}

cbind(lower, upper, prop)

(d) Suppose now that the coin may be biased. Replace line ( ) above with the following R
code, and repeat parts (a) and (b).

prob = runif(1, min = 0, max = 1)

Also, estimate the probability of Heads from the data. Check against the true value of prob.
1.5-4

6. Suppose we roll two distinct dice, each die having 6 faces. Thus there are
6
2
= 36 possible combinations of outcomes for the pair. For any given roll,
define the random variable X = Sum, so X can only take on the integer
values in the set S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.
(a) How many ways can each of these values theoretically occur on a single roll? For
example, there are three possible ways of rolling X = 4. (What are they?) Hence, if the
dice are fair, we may express the probability of this event occurring as being equal to
3/36. (The mathematical notion of probability will be formalized later.) Making this
assumption that the dice are indeed fair, and using the same logic as just outlined,
complete the following table. (The case X = 4 has been done for you.)

X = Sum Probability
2
3
4 3/36
5
6
7
8
9
10
11
12

(b) It should be reasonable that, by inspection, the expected value of X written E[X] is 7.
(Again, this will be formally shown later.) Using your table, calculate the probabilities of
each of the following events, again assuming that the dice are fair. Please show all work.

Probability of rolling 0 or more away from 7, i.e., P(2 X 12) = 1 (Why?)

Probability of rolling 1 or more away from 7, i.e., P(2 X 6 or 8 X 12 ) =

Probability of rolling 2 or more away from 7, i.e., P(2 X 5 or 9 X 12) =



Probability of rolling 5 or more away from 7, i.e., P(X 2 or X 12) =

Probability of rolling 6 or more away from 7, i.e., P(X 1 or X 13) = 0 (Why?)

1.5-5

7. Suppose an experimenter wishes to evaluate the reliability of two weighing scales, by
measuring a known 50-gram mass three times on each scale, and comparing the results.
Scale A gives measurements of 49.8, 50.0 and 50.2 grams. Scale B gives measurements of
49.0, 50.0, and 51.0 grams. The average in both cases is 50.0 grams, so should the
experimenter conclude that both scales are equally precise, on the basis of these data?
Explain.

A B
Synopsis of 1.1

At the exact center of Concourse A of Detroit Metro airport, lies a striking work of kinetic art. An
enormous black granite slab is covered with a continuously flowing layer of water; recessed in the
slab are numerous jets, each of which is capable of projecting a stream of water into the air at an
exact speed and angle. When all of the jets are activated simultaneously, the result is a beautiful
array of intersecting parabolic arcs at different positions, heights, and widths. But there is more...
the jets are also set on different timers that are activated intermittently, shooting small bursts of
water in rapid succession. The effect of these lifelike streamlets of water, leaping and porpoising
over and under each other as they follow their individual parabolic paths, resembles an elegantly
choreographed ballet, and is quite a sight.

Besides being aesthetically pleasing to watch, the piece is important for another reason. It is a
tribute to classical physics; the equations of motion of a projectile (neglecting air resistance) have
been well known for centuries. Classical Newtonian mechanics predicts that such an object will
follow the path of a specific parabola which depends on the initial speed and angle, and finding its
equation is a standard first-year calculus exercise. The equation also allow us to determine other
quantities as well, such as the total time taken to traverse this parabola, the maximum height the
projectile reaches, and the speed and downrange distance at the time of impact. In other words,
this system has no (or very little) variability; it can be mathematically modeled in a very precise
way. Despite the apparent complexity of the artwork, everything about its motion is completely
determined from initial conditions. In fact, one could argue that it is precisely this predictability
that makes the final structure possible to construct, and so visually appealing. There would
probably be nothing special about watching jets of water spouting randomly.

However, for most complex dynamical systems, this is an exception rather than the rule. Random
variability lurks everywhere. If, as in the previous scenario, the projectile is a coin rather than a
stream of water, then none of this mathematical analysis predicts whether the coin will land on
heads or tails. That outcome is not determined solely on initial conditions; in fact, even if were
possible to rewind the tape and start from exactly the same initial conditions, we would not
necessarily obtain the same outcome of this experiment. The culprit responsible for this strange
phenomenon is random chance (or random variation), an intrinsic property of the universe that can
never be entirely eliminated. Analyzing systems that yield random outcomes requires an entirely
different approach, one that involves an understanding of the nature of the probability of an
event occurring over time.

If every individual in a given population were exactly the same age, there would be zero variation
in that variable. But, of course, in reality, this is rarely the case, and there is usually a substantial
amount of variation in ages. To say that the mean age of the population is 45 years old, for
example, offers no clue about this variation; everyone could (theoretically) be 45 years old, or ages
could range widely, from infants to geriatrics. Random variation in a formal experiment can be
introduced through biases, measurement errors, etc. The phrase practice makes perfect can be
more formally interpreted as practice reduces the amount of variability in the desired outcome,
making it increasingly precise. Its what turns a good athlete into a great athlete (or an average
student into an excellent student). One definition of the field of statistics is a formal way to detect,
model, and interpret genuine information in a system, in the presence of random variation.
A classic example (in the biological sciences, especially) is to compare the amount of variation
between the treatment arm of a study say, individuals on a new, investigational drug with the
corresponding control arm say, individuals on a standard treatment or placebo (while
simultaneously adjusting for the amount of variation in individual responses within each group).

But exactly how is this accomplished? As a simple illustration, suppose we wish to test the claim
that half of all the individuals in a certain population are male, and half are female. This scenario
can be modeled by the coin toss example in the notes, which introduces some of the main
concepts and terminology used in statistical methodology, and covered in much more detail later.
Specifically, we can translate this problem into the equivalent one of trying to determine if a
particular coin is fair (i.e., unbiased), that is, if the probability of obtaining heads or tails on a
single toss is the same, namely 50%. To test this hypothesis, we can design a simple experiment:
generate a random sample of outcomes by tossing the coin a large number of times, say 100, and
count the resulting number of heads. If the hypothesis is indeed true, then the expected value of
this number would be 50. Of course, if our experiment were to result in exactly 50 heads, then that
would constitute the strongest possible evidence we could hope to obtain in support of the original
hypothesis. However, because random variation plays a role here, it is certainly possible to obtain
a number close to 50 heads, yet the coin may still be fair, i.e., the hypothesis is still true. The
question now is How far from 50 do we have to be, before the evidence suggests that the coin is
indeed not fair, and thus the hypothesis should be rejected?

To answer this, we turn to the formal mathematics of Probability Theory, which can be used to
predict, via a formula, the theoretical probability of obtaining any specified number of heads in
100 tosses of a fair coin. For instance, if the coin is fair
1
, then it can be mathematically shown that
there is a 92% theoretical probability of obtaining an experimental sample that is at least 1 head
away from the expected value of 50. Likewise, it can be shown that if the coin is fair
2

, then there
is a 76.4% mathematical probability of obtaining a sample that is at least 2 heads away from 50.
At least 3 heads away from 50, and the probability drops to 61.7%. In a similar fashion, we can
compute the probability of obtaining a sample that is at least any prescribed number of heads away
from 50, and observe that this probability (called the p-value of the sample) decreases rapidly, as
we drift farther and farther away from 50. At some point, the probability becomes so low, that it
suggests the assumption that the coin is fair is indeed not true, and should be rejected, based on the
sample evidence. But where should we draw the line? Typically, this significance level denoted
by , the Greek symbol alpha is set between 1% and 10% for many applications, with 5%
being a very common choice. At the 5% significance level, it turns out that if the coin is fair, the
sample can be as much as 10 heads away from 50, but no more than that. That is, in 100 random
tosses of a fair coin, the probability (i.e., p-value) of obtaining a sample with more than 10 heads
away from 50, is less (i.e., rarer) than 5%, a result that would be considered statistically
significant. Consequently, this evidence would tend to refute the hypothesis that the coin is fair.

1
and we are not saying for certain that it is or isnt
2
and again, we are not saying for certain that it is or isnt
Synopsis of 1.2, 1.3
In a criminal court trial, the claim that the defendant is presumed innocent (unless proven
guilty) can be taken as an example of a hypothesis. In other words, innocence is to be assumed;
it is (supposedly) not the defense attorneys job to prove it. Rather, the burden of proof lies with
the prosecution, who must provide strong enough evidence i.e., it must have sufficient power
to convince a jury that the hypothesis should be rejected, beyond a reasonable doubt i.e., with a
high level of confidence. (In a context such as this, we can never be 100% certain.)
3
This approach is also used in the sciences. A hypothesis is either rejected or retained, based upon
whether empirical evidence tends to refute or support it, respectively, via a formal procedure
known as the classical scientific method, outlined below. We first define a population of
interest (usually thought of as being arbitrarily large or infinite, for simplicity, and consisting of
distinct, individual units, such as the residents of a particular area), and a specific measurable
quantity that we wish to consider in that population, e.g., Age in years. Naturally, such a
quantity varies randomly from individual to individual, hence is referred to as a random
variable, and usually denoted by a generic capital letter such as X, Y, Z, etc.
This overall
conservative approach to jurisprudence is deliberately intended to err on the side of caution; it is
generally considered more serious to jail an innocent man i.e., reject the hypothesis if it is true,
what we will eventually come to know as a Type 1 error than set free a guilty one i.e., retain
the hypothesis if it false, what we will eventually come to know as a Type 2 error.
4
1. Formulate a hypothesis, such as The population mean age is 40 years old.
In practice, it is
usually not possible to measure the variable for everyone in the population, but we can imagine
that, in principle, its values would take on some true distribution (such as a bell curve for
example), which includes a population mean (i.e., average), although its exact value would most
probably be unknown to us. But it is precisely this population mean value e.g., mean age that
we wish to say something about, via the scientific method referred to as statistical inference:
2. Design an experiment to test this hypothesis. In a case like this, for example, select a
randomsample of individuals from the population. The number of individuals in the
sample is known as the sample size, and usually denoted by the generic symbol n.
(How to choose the value of n judiciously is the subject of a future topic.)
3. Measure the variable X (in this case, age) on all individuals in the sample, resulting in n
empirical observations, denoted generically by the sample data values {x
1
, x
2
, x
3
, , x
n
}.
4. Average these n values, and conduct a formal analysis to see whether this sample mean
suggests a statistically significant difference
5
5. Infer a reject or accept conclusion about the original population hypothesis.
from the hypothesized value of 40 years.

3
Arguably, absolute certainty can be achieved only in an abstract context, such as when formally proving a
mathematical theorem in an axiomatic logical framework. For example, it is not feasible to verify definitively the
statement that the sum of any two odd numbers is even by checking a large finite number of examples, no matter
how many (which would technically only provide empirical evidence, albeit overwhelming), but it can be formally
proved using simple laws of algebra, once formal terms are defined.
4
Note that not all variables in a population are necessarily random, such as the variable X = the number of eggs in an
individual carton selected from a huge truckload of standard one dozen cartons, namely X = 12.
5
That is, a difference greater than what one would expect just from random chance. This analysis is the critical step.
Synopsis of 1.4

This is a highly simplified chart of some of the main study designs used in biomedical research.
As a general rule, experimental designs that test the efficacy of some investigational treatment
for patients are of primary interest to physicians and other health care professionals. The gold
standard of such tests is the randomized clinical trial which, classically, randomly
6
Framingham Heart Study
assigns
each participant (of which there may be thousands) to one of two arms of the study either
treatment or control (i.e., standard treatment or placebo) and later compares the results of the
two groups. Though expensive and time-consuming, clinical trials are the most important tool
used by the FDA to approve medical treatments for the public consumer market. However, they
are clearly not suitable for epidemiological investigations into such issues as disease prevalence
in populations, or the association between diseases and their potential risk factors, such as lung
cancer and smoking. These questions can be addressed with surveys and longitudinal studies,
specifically, case-control where previous exposure status of participants currently with disease
(cases) and without disease (controls) is determined from medical records, tumor registries, etc.
and cohort where currently exposed and unexposed groups are followed over time, and their
disease status compared at the end of the study. Two of the largest and best-known cohort
studies are the (ongoing since 1948) and the Nurses Health Study.

6
This is usually done with mathematical algorithms that allow computers to generate pseudorandom numbers.
Advanced schemes exist for more complex scenarios, such as adaptive randomization for ongoing patient
recruitment during the study, block randomization for multicenter collaborative studies among several institutions,
etc. The entire purpose of randomizing is to minimize any source of systematic bias in the selection process.

Yes No
Intervention?
Longitudinal Cross-sectional

over time at a fixed time
Surveys,
prevalence
studies, etc.

backward forward
Cohort
studies

Case-Control
studies

Retrospective Prospective
Experimental
Randomized
Clinical Trials
(RCT)

Observational

2. Exploratory Data Analysis and
Descriptive Statistics

2.1 Random Variables and Data Types


2.3 Summary Statistics

2.4 Summary

2.5 Problems

X

2. Exploratory Data Analysis & Descriptive Statistics

2.1 Examples of Random Variables & Associated Data Types

NUMERICAL (Quantitative measurements)

Continuous: X =Length, Area, Volume, Temp,
Time elapsed, pH, Mass of tumor

Discrete: X =Shoe size, #weeks till death,
Time displayed, Rx dose, #tumors

CATEGORICAL (Qualitative attributes)

Nominal: X =Color (1 =Red, 2 =Green, 3 =Blue),
ID #, Zip Code, Type of tumor

Ordinal: X = Dosage (1 =Low, 2 =Med, 3 =High),
Year (2000, 2001, 2002, ),
Stage of tumor (I, II, III, IV),
Alphabet (01 =A, 02 =B, , 26 =Z)

Random variables are important in experiments because they ensure objective
reproducibility (i.e., verifiability, replicability) of results.

Example:

In any given study, the researcher must first decide what percentage of replicated experiments
should, in principle, obtain results that correctly agree (specifically, accept a true hypothesis),
and incorrectly agree (specifically, reject a true hypothesis), allowing for random variation.

Confidence Level: 1 = 0.90, 0.95, 0.99 are common choices
Significance Level: = 0.10, 0.05, 0.01 the corresponding error rates
X
interval
X
steps
X
1 2 3
ranked
< <
1 2 3 4 . . . . . . . 90 91 92 93 94 95 96 97 98 99 100
. . . . .
|

0
|

1
X
1 2 3
unranked
Special Case: Binary

1, Success
X =
0, Failure


Dotplots, Stem-and-Leaf Diagrams (Stemplots), Histograms, Boxplots, Bar Charts,
Pie Charts, Pareto Diagrams,

Example: Random variable X =Age (years) of individuals at Memorial Union.

Consider the following sorted random sample of n =20 ages:

{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

Dotplot

Comment: Uses all of the values. Simple, but crude; does not summarize the data.

Stemplot
Stem Leaves
Tens Ones
1 8 9 9 9
2 0 1 1 3 4 4 6 7
3 1 5 5 7 8
4 2 6
5 9

Comment: Uses all of the values more effectively. Grouping summarizes the data better.

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
X

Histograms

Class Interval
Frequency
(#occurrences)
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
n =20

4
8
2
5
1
Frequency Histogram

Class Interval
Absolute Frequency
(#occurrences)
Relative Frequency
(Frequency n)
[10, 20) 4
4
20
= 0.20
[20, 30) 8
8
20
= 0.40
[30, 40) 5
5
20
= 0.25
[40, 50) 2
2
20
= 0.10
[50, 60) 1
1
20
= 0.05
n =20
20
20
= 1.00
Relative Frequency Histogram
0.20
0.40
0.25
0.10
0.05
0.00
0.10
0.20
0.30
0.40

Relative frequencies are always
between 0 and 1, and their sum is
always = 1 !

Often, it is of interest to determine the total relative frequency, up to a certain value. For
example, we see here that 0.60 of the age data are under 30 years, 0.85 are under 40 years,
etc. The resulting cumulative distribution, which always increases monotonically from
0 to 1, can be represented by the discontinuous step function or staircase function in
the first graph below. By connecting the right endpoints of the steps, we obtain a
continuous polygonal graph called the ogive (pronounced o-jive), shown in the second
graph. This has the advantage of approximating the rate at which the cumulative
distribution increases within the intervals. For example, suppose we wish to know the
median age, i.e., the age that divides the values into equal halves, above and below. It is
clear from the original data that 25 does this job, but if data are unavailable, we can still
estimate it from the ogive. Imagine drawing a flat line from 0.5 on the vertical axis until it
hits the graph, then straight down to the horizontal Age axis somewhere in the interval
[20, 30); it is this value we seek. But the cumulative distribution up to 20 years is 0.2, and
up to 30 years is 0.6 a rise of 0.4 in 10 years, or 0.04 per year, on average. To reach 0.5
from 0.2 an increase of 0.3 would thus require a ratio of 0.3 / 0.04 =7.5 years from 20
years, or 27.5 years. Medians and other percentiles will be addressed in the next section.
Class Interval
Absolute Frequency
(#occurrences)
Relative Frequency
(Frequency n)
Cumulative Relative
Frequency
[0, 10) 0 0.00 0.00
[10, 20) 4 0.20 0.20 =0.00 +0.20
[20, 30) 8 0.40 0.60 =0.20 +0.40
[30, 40) 5 0.25 0.85 =0.60 +0.25
[40, 50) 2 0.10 0.95 =0.85 +0.10
[50, 60) 1 0.05 1.00 =0.95 +0.05
n =20 1.00

Problem! Suppose that all ages 30 and older are lumped into a single class interval:

{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

Class Interval
Absolute Frequency
(#occurrences)
Relative Frequency
(Frequency n)
[10, 20) 4
4
20
= 0.20
[20, 30) 8
8
20
= 0.40
[30, 60) 8
8
20
= 0.40
n =20
20
20
= 1.00

Relative Frequency Histogram
0.00
0.10
0.20
0.30
0.40
0.20
0.40
0.40

If this outlier (59) were
larger, the histogram would
be even more distorted!

Remedy: Let Area of each class rectangle = Relative Frequency

Height of rectangle Class Width
Therefore Density =
Relative Frequency
Class Width

Class Interval
Absolute Frequency
(#occurrences)
Relative Frequency
(Frequency n)
Density
(Rel Freq Class Width)
[10, 20); width = 10 4
4
20
= 0.20
0.20
10
= 0.02
[20, 30); width = 10 8
8
20
= 0.40
0.40
10
= 0.04
[30, 60); width = 30 8
8
20
= 0.40
0.40
30
= 0.01333
n =20
20
20
= 1.00

Relative
Frequency
Class Width
D
e
n
s
i
t
y

Density Histogram
0.02
0.04
0.0133
0.20
0.40
0.40
Total Area =1!


2.3 Summary Statistics Measures of Center and Spread

?
?
Distribution of X
X discrete
X continuous
POPULATION
Random Variable X,
numerical

True center = ???

True spread = ???
parameters
(population characteristics)

unknown fixed numerical values
usually denoted by Greek letters,
e.g., (theta)

Measures of center
median, mode, mean

Measures of spread
range, variance,
standard deviation
SAMPLE, size n
statistics
(sample characteristics)

known (or computable) numerical
values obtained from sample data
estimators of parameters, e.g.,

usually denoted by corresponding
Roman letters
Statistical
I nference

Measures of Center
For a given numerical random variable X, assume that a random sample {x
1
, x
2
, , x
n
}
has been selected, and sorted from lowest to highest values, i.e.,

sample median = the numerical middle value, in the sense that half the data
values are smaller, half are larger.

If n is odd, take the value in position #
n + 1
2
.
If n is even, take the average of the two closest neighboring data values,
left (position #
n
2
) and right (position #
n
2
+ 1).

Comments:

The sample median is robust (insensitive) with respect to the presence of outliers.

More generally, can also define quartiles (Q
1
= 25% cutoff, Q
2
= 50% cutoff
= median, Q
3
= 75% cutoff), or percentiles (a.k.a. quantiles), which divide
the data values into any given p% vs. (100 p)% split. Example: SAT scores

sample mode = the data value with the largest frequency (f
max
)

Comment: The sample mode is robust to outliers.

If present, repeated sample data values can be neatly consolidated in a frequency table,
vis--vis the corresponding dotplot. (If a value x
i
is not repeated, then its f
i
= 1.)

k distinct data
values of X
absolute
frequency of
i
x
relative
frequency of
i
x
x
i
f
i
f (x
i
) = f
i
/ n
x
1
f
1
f (x
1
)
x
2
f
2
f (x
2
)

x
k
f
k
f (x
k
)
n 1
x
1
x
2
x
n1
x
n

X
50% 50%

. . . . . . .
f
k

X
f
f
1

f
2

f
max

x
1
x
2
x
k
mode mean

Example: n = 12 random sample values of X = Body Temperature (F):

{98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99.1, 99.1, 99.2}

x
i
f
i
f (x
i
)
98.5 1 1/12
98.6 5 5/12
98.9 3 3/12
99.1 2 2/12
99.2 1 1/12
n = 12 1

sample median =
98.6 + 98.9
2
= 98.75F (six data values on either side)
sample mode = 98.6F
sample mean =
1
12
[ (98.5)(1) + (98.6)(5) + (98.9)(3) + (99.1)(2) + (99.2)(1) ]

or, = (98.5)
1
12
+ (98.6)
5
12
+ (98.9)
3
12
+ (99.1)
2
12
+ (99.2)
1
12
= 98.8F

sample mean = the weighted average of all the data values

Comments:

The sample mean is the center of mass, or balance point, of the data values.

The sample mean is sensitive to outliers. One common remedy for this

Trimmed mean: Compute the sample mean after deleting a predetermined
number or percentage of outliers from each end of the data set, e.g., 10%
trimmed mean. Robust to outliers by construction.

x =
1
1
k
i i
i
x f
n
=
, where f
i
is the absolute frequency of x
i

=
1
( )
k
i i
i
x f x
=
, where f(x
i
) =
i
f
n
is the relative frequency of x
i

10% 10%
X
98.5 98.6 99.2 98.7 98.8 99.0 99.1 98.9
f
1
5
3
2

1
0.20
0.40
Q
0.30
.10

Grouped Data Suppose the original values had been lumped into categories.

Example: Recall the grouped Memorial Union age data set

x
i
Class Interval Frequency f
i
Relative Frequency
f
i
n

Density
(Rel Freq Class Width)
15 [10, 20) 4 0.20 0.02
25 [20, 30) 8 0.40 0.04
45 [30, 60) 8 0.40 0.013

n = 20 1.00

group mean: Same formula as above, with x
i
= midpoint of i
th
class interval.

group
x =
1
20
[ (15)(4) + (25)(8) + (45)(8) ] = 31.0 years

Exercise: Compare this value with the ungrouped sample mean x = 29.2 years.

group median (& other quantiles):
Density Histogram

By definition, the median Q divides the data set into equal
halves, i.e., 0.50 above and below. In this example, it must
therefore lie in the class interval [20, 30), and divide the 0.40
area of the corresponding class rectangle as shown. Since the
0.10 strip is of that area, it proportionally follows that Q
must lie at of the class width 30 20 = 10, or 2.5, from the
right endpoint of 30. That is, Q = 30 2.5, or Q = 27.5 years.
(Check that the ungrouped median = 25 years.)

Formal approach ~

First, identify which class interval [a, b) contains the desired quantile Q (e.g., median,
quartile, etc.), and determine the respective left and right areas A and B into which it
divides the corresponding class rectangle. Equating proportions for Density =
A B
b a
+
,
we obtain
Density =
A B
a b
=
Q Q
,
from which it follows that

Density
A
a = + Q or
Density
B
b = Q or
Ab Ba
A B
+
=
+
Q .

For example, in the grouped Memorial Union age data, we have a = 20, b = 30, and
A = 0.30, B = 0.10. Substituting these values into any of the equivalent formulas above
yields the median Q
2
= 27.5.

Exercise: Now that Q
2
is found, use the formula again to find the first and third
quartiles Q
1
and Q
3
, respectively.

Note also from above, we obtain the useful formulas

( ) Density A Q a =

( ) Density B b Q =

for calculating the areas A and B, when a value of Q is given! This can be used when
finding the area between two quantiles Q
1
and Q
2
. (See next page for another way.)

A B
a Q b
D
e
n
s
i
t
y


Alternative approach First, form this column:

Class Interval
Frequency
f
i

Relative Frequency
/
i
f n
Cumulative Relative Frequency
1 2 i
f f f
i
n n n
F = + + +
0
I
0 0 0
1
I
1
f
1
/ f n
1
F
2
I
2
f
2
/ f n
2
F

i
I
i
f /
i
f n
low
F < 0.5
Q = ?
in
0.5
[ ) , a b
1 i
f
+

1
/
i
f n
+

high
F > 0.5

k
I
k
f /
k
f n
1
n 1

Then
0.5
( )
low
high low
F
a b a
F F

= +
Q or
0.5
( )
high
high low
F
b b a
F F

=
Q .

Again, in the grouped Memorial Union age data, we have a = 20, b = 30,
low
F = 0.2,
and
high
F = 0.6 (why?). Substituting these values into either formula yields the median
Q
2
= 27.5.

To find Q
1
, replace the 0.5 in the formula by 0.25; to find Q
3
, replace the 0.5 in the
formula by 0.75, etc.

Conversely, if a quantile Q in an interval [a, b) is given, then we can solve for the
cumulative relative frequency F(Q) up to that quantile value:
( ) ( )
( ) ( ) ( )
F b F a
F F a a
b a

= +
Q Q . It follows that the relative frequency

(i.e., area) between two quantiles Q
1
and Q
2
is equal to the difference between their
cumulative relative frequencies: F(Q
2
) F(Q
1
).
Next, identify
F
low
and F
high

which bracket
0.5, and let [a, b)
be the class
interval of the
latter.


Shapes of Distributions

Symmetric distributions correspond to values that are spread equally about a center.

mean = median

Examples: (Drawn for smoothed histograms of a random variable X.)

Note: An important special case of the bell-shaped curve is the normal distribution,
a.k.a. Gaussian distribution. Example: X = IQ score

Otherwise, if more outliers of X occur on one side of the median than the other, the
corresponding distribution will be skewed in that direction, forming a tail.

Examples: X = calcium level (mg) X = serum cholesterol level (mg/dL)

Furthermore, distributions can also be classified according to the number of peaks:
mean < median
X
skewed to the left
(negatively skewed)
0.5 0.5
X
skewed to the right
(positively skewed)
median < mean
0.5 0.5
unimodal bimodal multimodal
X X X
uniform triangular bell-shaped

Measures of Spread

Again assume that a numerical random sample {x
1
, x
2
, , x
n
} has been selected, and
sorted from lowest to highest values, i.e.,

sample range = x
n
x
1
(highest value lowest value)

Comments:

Uses only the two most extreme values. Very crude estimator of spread.

The sample range is extremely sensitive to outliers. One common remedy

Interquartile range (IQR) = Q
3
Q
1
. Robust to outliers by construction.

If the original data are grouped into k class intervals [a
1
, a
2
), [a
2
, a
3
),, [a
k
, a
k+1
),
then the group range = a
k+1
a
1
. A similar calculation holds for group IQR.

Example: The Body Temperature data set has a sample range = 99.2 98.5 = 0.7F.

{98.5, 98.6, 98.6, 98.6, 98.6, 98.6, 98.9, 98.9, 98.9, 99.1, 99.1, 99.2}

x
i
f
i

98.5 1
98.6 5
98.9 3
99.1 2
99.2 1
n = 12

x
1
x
2
x
n1
x
n

X
Q
1
Q
2
Q
3

25%
25% 25% 25%
1
( )
k
i i
i
x x f
=

= 0,

i.e., the sum of the deviations is always zero.

For a much less crude measure of spread that uses all the data, first consider the following

Definition: x
i
x = individual deviation of the i
th
sample data value from the sample mean

x
i

98.5 0.3 1
98.6 0.2 5
98.9 +0.1 3
99.1 +0.3 2
99.2 +0.4 1

n = 12

Naively, an estimate of the spread of the data values might be calculated as the average of
these n = 12 individual deviations from the mean. However, this will always yield zero!

FACT:

Check: In this example, the sum = (0.3)(1) + (0.2)(5) + (0.1)(3) + (0.3)(2) + (0.4)(1) = 0.

Exercise: Prove this general fact algebraically.

Interpretation: The sample mean is the center of mass, or balance point, of the data values.

98.8
f
i

x
i
x
X
98.5 98.6 99.2 98.7 98.8 99.0 99.1 98.9
f
1
5
3
2

1

Best remedy: To make them non-negative, square the deviations before summing.

sample variance

sample standard deviation

Example:

x
i
x
i
x (x
i
x)
2
f
i

98.5 0.3 +0.09 1
98.6 0.2 +0.04 5
98.9 +0.1 +0.01 3
99.1 +0.3 +0.09 2
99.2 +0.4 +0.16 1

n = 12

Comments:
s
2
=
2
( )
1
i i
x x f
n
has the important frequently-recurring form

SS
df
, where SS =
Sum of Squares (sometimes also denoted S
xx
) and df = degrees of freedom =
n 1, since the n individual deviations have a single constraint. (Namely, their sum
must equal zero.)

Same formulas are used for grouped data, with
group
x , and x
i
= class interval midpoint.

Exercise: Compute s for the grouped and ungrouped Memorial Union age data.

A related measure of spread is the absolute deviation, defined as
1
n
|x
i
x | f
i
,
but its statistical properties are not as well-behaved as the standard deviation.
Also, see Appendix > Geometric Viewpoint > Mean and Variance, for a way to
understand the sum of squares formula via the Pythagorean Theorem (!), as well
as a useful alternate computational formula for the sample variance.
s
2
=
1
n 1
(x
i
x)
2
f
i

i=1
k
s = + s
2

s
2
is not on the
same scale as
the data values!
s is on the same
scale as the data
values.
Then

s
2
=
1
11
[ (0.09)(1) + (0.04)(5) + (0.01)(3) +
(0.09)(2) + (0.16)(1) ] = 0.06 (F)
2
,

so that s = 0.06 = 0.245F.

Body Temp has a small amount of variance.

Typical Grouped Data Exam Problem

Given the sample frequency table of age intervals shown above; answer the following.
1. Sketch the density histogram. (See Lecture Notes, page 2.2-6)
2. Sketch the graph of the cumulative distribution. (page 2.2-4)
3. What proportion of the sample is under 36 yrs old? (pages 2.3-5 bottom, 2.3-6 bottom)
4. What proportion of the sample is under 45 yrs old? (same)
5. What proportion of the sample is between 36 and 45 yrs old? (same)
6. Calculate the values of the following grouped summary statistics.
Quartiles Q
1
, Q
2
, Q
3
and IQR (pages 2.3-4 to 2.3-6)
Mean (page 2.3-4)
Variance (page 2.3-10, second comment on bottom)
Standard deviation (same)

Solutions at http://www.stat.wisc.edu/~ifischer/Grouped_Data_Sols.pdf

Age
Intervals
Frequencies
[0, 18) -
[18, 24) 208
[24, 30) 156
[30, 40) 104
[40, 60) 52
520
POPULATION
Random Variable X, numerical

Parameters

Mean

Variance
2

Standard Deviation
Statistical
I nference
Estimators
and
can be
calculated via the following
statistics:

Mean x

Variance s
2

Standard Deviation s
SAMPLE, size n
Density Histogram

x
Relative
frequency of x
i

x
i
s

Distribution of X
X discrete
X continuous
POPULATION
Random Variable X, numerical

Parameters

Mean (mu)

Variance
2

Standard Deviation
(sigma)

2.4 Summary (Compare with first page of 2.3.)

Comments:

The population mean and variance
2
are defined in terms of expected value:
= E[X] = x f(x),
2
= E[(X )
2
] = (x )
2
f(x)
if X is discrete (with corresponding integration formulas if X is continuous), where
f(x) is the probability of value x occurring in the population, i.e., P(X =x). Later

If n is used instead of n 1 in the denominator of s
2
, the expected value is always less
than
2
. Consistent under- (or over-) estimation of a parameter by a statistic is called
bias. The formulas given for the sample mean and variance are unbiased estimators.
all x all x

Chebyshevs Inequality

Whatever the shape of the distribution, at least 75% of the values
lie within 2 standard deviations of the mean, at least 89% lie
within 3 standard deviations, etc.
More generally, at least
2
1
1 100%
k

of the values lie within k
standard deviations of the mean. (Note that k >1, but it need not be
an integer!)

Exercise: Suppose that a population of individuals has a mean age of =40 years,
and standard deviation of =10 years. At least how much of the population is between
20 and 60 years old? Between 15 and 65 years old? What symmetric age interval about
the mean is guaranteed to contain at least half the population?

Note: If the distribution is bell-shaped, then approximately 68% lie within 1,
approximately 95% lie within 2, approximately 99.7% lie within 3. For other
multiples of , percentages can be obtained via software or tables. Much sharper than
Chebyshevs general result, which can be overly conservative, this can be used to check
if a distribution is reasonably bell-shaped for use in subsequent testing procedures.
(Later...)
3 2 1 1 + 2 + 3 +

89%
75%
Pafnuty Chebyshev
(1821-1894)
Ismor Fischer, 2/8/2014 Solutions / 2.5-1
2.5 Solutions

1. Implementing various R commands reproduces exact (or similar) results to those in the
notes, plus more. Remember that you can always type help(command) for detailed
online information. In particular, help(par)yields much information on options used
in R plotting commands. For a general HTML browser interface, type help.start().

Full solution in progress

2. Data Types

(a) Amount of zinc: It is given that the coins are composed of a 5% alloy of tin and zinc.
Therefore, the exact proportion of zinc must correspond to some value in the interval
(0, 0.05); hence this variable is numerical, and in particular, continuous.

Image on reverse: There are only two possibilities, either wheat stalks or the Lincoln
Memorial, thus categorical: nominal: binary.

Year minted: Any of {1946, 1947, 1948, , 1961, 1962}. As these numbers purely
represent an ordered sequence of labels, this variable would typically be classified as
categorical: ordinal. (Although, an argument can technically be made that each of
these numbers represents the quantity of years passed since year 0, and hence is
numerical: discrete. However, this interpretation of years as measurements is not
the way they are normally used in most practical applications, including this one.)

City minted: Denver, San Francisco, or Philadelphia. Hence, this variable is
categorical: nominal: not binary, as there are more than two unordered categories.

Condition: Clearly, these classifications are not quantities, but a list of labels with a
definite order, so categorical: ordinal.

(b) Out of 1000 coins dropped, the number of heads face-up can be any integer from 0 to
1000 i.e., {0, 1, 2, , 999, 1000} hence numerical: discrete. It follows that the
proportion of heads face-up can be any fraction from
{ }
0 999 1000 1 2
1000 1000 1000 1000 1000
, , , , ,
i.e., {0, .001, .002, , .999, 1} in decimal format hence is also numerical: discrete.
(However, for certain practical applications, this may be approximately modeled by
the continuous interval [0, 1], for convenience.)


3. Dotplots

All of these samples have the same mean value =4. However, the sample variances
(28/6, 10/6, 2/6, and 0, respectively), and hence the standard deviations (2.160, 1.291,
0.577, and 0, respectively) become progressively smaller, until there is literally no
variation at all in the last sample of equal data values. This behavior is consistent with
the dotplots, whose shapes exhibit progressively smaller spread and hence
progressively higher peak concentrations about the mean as the standard deviation
decreases.


4. The following properties can be formally proved using algebra. (Exercise)

(a) If the same constant b is added to every value of a data set {x
1
, x
2
, x
3
, ..., x
n
}, then the
entire distribution is shifted by exactly that amount b, i.e., {x
1
+b, x
2
+b, ..., x
n
+b}.
Therefore, the mean also changes by b (i.e., from x to x +b), but the amount of
spread does not change. That is, the variance and standard deviation are
unchanged (as a simple calculation will verify). In general, for any dataset x and
constant b, it follows that
mean(x + b) =mean(x) +b and var(x + b) =var(x), so that sd(x + b) =sd(x), i.e.,
x b x b + = + and
2 2
x b x
s s
+
= , so that
x b x
s s
+
= .

(b) If every data value of {x
1
, x
2
, x
3
, ..., x
n
} is multiplied by a nonzero constant a, then the
distribution becomes {ax
1
, ax
2
, ax
3
, ..., ax
n
}. Therefore, the mean is multiplied by this
amount a as well (i.e., mean =a x ), but the variance (which is on the scale of the
square of the data) is multiplied by a
2
, which is positive, no matter what the sign of a.
Its square root, the standard deviation, is therefore multiplied by
2
a a = , the
absolute value of a, always positive. In general, if a is any constant, then
mean(a x) =a mean(x) and var(a x) =a
2
var(x), so that sd(a x) =|a| sd(x), i.e.,
a x a x = and
2 2 2
ax x
s a s = , so that | |
ax x
s a s = .
In particular, if a =1, then the mean changes sign, but the variance, and hence the
standard deviation, remain the same positive values that they were before. That is,

x x = and
2 2
x x
s s
= , so that
x x
s s
= .

5.
Sun Mon Tues Wed Thurs Fri Sat
Week 1 +8 +8 +8 +5 +3 +3 0
Week 2 0 3 3 5 8 8 8

(a) For Week 1, the mean temperature is x =
3(8) 1(5) 2(3) 1(0)
7
+ + +
=5, and the
variance is s
2
=
2 2 2 2
3(8 5) 1(5 5) 2(3 5) 1(0 5)
7 1
+ + +
=
60
6
=10. (s = 10)

(b) Note that the Week 2 temperatures are the negative values of the Week 1
temperatures. Therefore, via the result in 2-3(b), the Week 2 mean temperature is
5, while the variance is exactly the same, 10. (s = 10)

Check: x =
3( 8) 1( 5) 2( 3) 1(0)
7
+ + +
=5

s
2
=
2 2 2 2
3( 8 5) 1( 5 5) 2( 3 5) 1(0 5)
7 1
+ + + + + + +
=
60
6
=10

Add b
x

x b +

x

Multiply
by a
ax


6. (a) self-explanatory

(b) sum(x.vals)/5 and mean(x.vals) will both yield identical results, xbar.
(c) sum((x.vals xbar)^2)/4 and var(x.vals) will both yield identical
results, s.sqrd.
(d) sqrt(s.sqrd) and sd(x.vals) will both yield identical results.
7. The numerators of the z-values are simply the deviations of the original x-values from their
meanx ; hence their sum =0 (even after dividing each of them by the same standard
deviation s
x
of the x-values), so it follows that z = 0. Moreover, since the denominators of
the z-values are all the same constant s
x
, it follows that the new standard deviation is equal
to s
x
divided by s
x
, i.e., s
z
= 1. In other words, subtracting the sample mean x from
each x
i
results in deviations
i
x x that are centered around a new mean of 0.
Dividing them by their own standard deviation s
x
results in standardized deviations
z
i
that have a new standard deviation of s
x
/ s
x
= 1. (This is informal. See Problem 2.5/4
above for the formal mathematical details.)
8.
(a) If the two classes are pooled together to form
1 2
n n + =50 students, then the first class
of
1
n =20 students contributes a relative frequency of 2/5 toward the combined score,
while the second class of
2
n =30 students contributes a relative frequency of 3/5
toward the combined score. Hence, the weighted average is equal to
2 3
5 5
(90) (80) +
=84. More generally, the formula for the grand mean of two groups having means x
and y respectively, is
1 2
1 2
n x n y
n n
+
+
.

(b) If two classes have the same mean score, then so will the combined classes, regardless
of their respective sizes. (You may have to think about why this is true, if it is not
apparent to you. For example, what happens to the grand mean formula above if
x y = ?) However, calculating the combined standard deviation is a bit more subtle.
Recall that the sample variance is given by
2
SS
df
s = , where the sum of squares
SS =
2
( )
i i
x x f
, and degrees of freedom df =n 1. For the first class, we are

told that
1
7 s = and
1
24 n = , so that 49 =
1
SS
23
, or SS
1
= 1127. Similarly, for the
second class, we are told that
2
10 s = and
2
44 n = , so that 100 =
2
SS
43
, or SS
2
= 4300.
Combining the two, we would have a large sample of size
1 2
n n + =68, whose values,
say
i
z , consist of both
i
x and
i
y scores, with a mean value equal to the same value of
x and y (via the comments above). Denoting this common value by c, we obtain
2 2 2
both 1 2
SS ( ) ( ) ( ) SS SS 1127 4300,
i i i i i i
z c f x c f y c f = = + = + = +

i.e.,
SS
both
= 5427, and df
both
= 67. Thus,
2 both
both
both
SS 5427
df 67
s = = =81, so that
both
s = 9.

.
.
.
.
.
.
.
.
.
.
0.1
0.3
0.6

9. Note: Because the data are given in grouped form, all numerical calculations and
resulting answers are necessarily approximations of the true ungrouped values.

Age Group
Midpoint
Age Group Frequency
Relative
Frequency
Density
5
[0, 10);
width =10
9
9
90
=0.1
0.1
10
=0.010
17.5
[10, 25);
width =15
27
27
90
=0.3
0.3
15
=0.020
45
[25, 65];
width =40
54
54
90
=0.6
0.6
40
=0.015
90
90
90
=1.0

(a) group mean =
1
90
[(5)(9) +(17.5)(27) +(45)(54)] = 32.75 years, or 32 years, 9 months
group variance =
1
89
[(5 32.75)
2
(9) +(17.5 32.75)
2
(27) +(45 32.75)
2
(54)]

= 239.4733,

group standard deviation = 239.4733 = 15.475 years, or 15 years, 5.7 months

(b) Relative Frequency Histogram (c) Density Histogram

.20 .15 .45 .10 .10
.10
.15
17
.15
.10
31
2
3
48
1
3

.25 .25

(d) The age interval [15, 25) contains 2/3 of the 30%, or 20%, of the sample values found
in the interval [10, 25). Likewise, the remaining interval [25, 35) contains 1/4 of the
60%, or 15%, found in the interval [25, 65). Therefore, the interval [15, 35) contains
20% +15%, or 35%, of the sample.

(e) Quartiles are computed similarly. The median Q
2
divides the total area into equal
halves, and so must be one-sixth of the way inside the last interval [25, 65), i.e.,
Q
2
= 31 years, 8 months. After that, the remaining areas are halved, so Q
1
coincides
with the midpoint of [10, 25), i.e., Q
1
= 17 years, 6 months, and Q
3
with the midpoint
of [Q
2
, 65), i.e., Q
3
= 48 years, 4 months.

(f) Range =65 0 =65 years, IQR =Q
3
Q
1
=48 yrs, 4 mos 17 yrs, 6 mos =
30 years, 10 months

10.
(a) The number of possible combinations is equal to the number of possible rearrangements
of x objects (the ones) among n objects. This is the well-known combinatorial symbol
n-choose-x
| |
|
\ .
n
x
=
!
!( )!
n
x n x
. (See the Basic Reviews section of the Appendix.)

(b) Relative frequency table: Each
i
y =1 or 0; the former occurs with a frequency of x
times, the latter with a frequency of (n x) times. Therefore,
i
y =1 corresponds to a
relative frequency of ,
x
p
n
= so that
i
y =0 corresponds to a relative frequency of 1 p.

frequency
relative
frequency
y
i
f
i
f(y
i
)
1 x p
0 n x 1 p
n 1

(c) Clearly, the sum of all the
i
y values is equal to x (the number of ones), so the mean is
x
y p
n
= = . Or, from the table, (1)( ) (0)(1 ) . y p p p = + =

(d) We have
2 2 2
1
1
(1 ) ( ) (0 ) ( )
y
n
s p x p n x
(
= +

=
2 2
1
(1 ) ( ) (0 ) (1 )
n
n
p p p p
(
+

(recall that x = n p), =
1
(1 ).
n
n
p p


11.
(a) Given {10,10,10, ,10, 60, 60, 60, , 60} , where half the values are 10 and half the
values are 60, it clearly follows that...

sample mean x =(10)(0.5) +(60)(0.5) =35 and
sample median =
10 60
2
+
=35 as well.
(This is an example of a symmetric distribution.)

(b) Given only grouped data however, we have...

sample mean
group
x =(10)(0.5) +(60)(0.5) =35 as above,
and

sample group median =20, since it is that value which
divides the grouped data into equal halves, clearly very
different from the true median found in (a).

Because the density histogram is so constructed that its total area =1, it can be
interpreted as a physical system of aligned rectangular weights, whose total mass =1.
The fact that the deviations from the mean sum to 0 can be interpreted as saying that
from the mean, all of the negative (i.e., to the left) and positive (i.e., to the right)
horizontal forces cancel out exactly, and the system is at perfect equilibrium there. That
is, the mean is the balance point or center of mass of the system. (This is the reason
it is called a density histogram, for by definition, density of physical matter =amount of
mass per unit volume, area, or in this case, width.) This property is not true of the other
histogram, whose rectangular heights not areas measure the relative frequencies,
and therefore sum to 1; hence there is no analogous physical interpretation for the mean.

i
x ( )
i
f x
10 0.5
60 0.5
Class
Interval
Relative
Frequency
[0, 20)
midpt = 10
width = 20
0.5
[20, 100)
midpt = 60
width = 80
0.5
1

mean
(35)
median
(20)
0.5
0.5

mean
(35)
median
(20)

12. The easiest (and most efficient) way to solve this is to first choose the notation judiciously.
Recall that we define
i i
d x x = to be the i
th
deviation
i
d of a value
i
x from the mean x ,
and that they must sum to zero. As the mean is given as 80, x = and three of the four quiz
scores are equal, we may therefore represent them as
1 2 3 4
{ , , , } x x x x , where...

1
80 , x d = +
2
80 , x d = +
3
80 , x d = + and
4
80 3 x d = .

Hence, the variance would be given by
2 2 2 2
2
(3 )
4 1
d d d d
s
+ + +
=
=
2
12
3
d
=
2
4d , so the
standard deviation is 2 s d = . Because s =10 (given), it follows that 5 d = , whereby the
quiz scores can be either {85, 85, 85, 65} or {75, 75, 75, 95}. Both sets satisfy the
conditions that 80 x = and s =10. [Note: Other notation would still yield the same answers
(if solved correctly, of course), but the subsequent calculations might be much messier.]

13. Straightforward algebra.

2.5 Problems

1. Follow the instructions in the posted R code folder
(http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/) for this problem, to
reproduce the results that appear in the lecture notes for the Memorial Union age data.

2. A numismatist (coin collector) has a large collection of pennies minted between the years
1946-1962, when they were made of bronze: 95% copper, and 5% tin and zinc. (Today,
pennies have a 97.5% zinc core; the remaining 2.5% is a very thin layer of copper plating.)
The year the coin was minted appears on the obverse side (i.e., heads), sometimes with a
letter below it, indicating the city where it was minted: D (Denver), S (San Francisco), or
none (Philadelphia). Before 1959, a pair of wheat stalks was depicted on the reverse side
(i.e., tails); starting from that year, this image was changed to the Lincoln Memorial.
The overall condition of the coin follows a standard grading scale Poor (PO or PR), Fair
(FA or FR), About Good (AG), Good (G), Very Good (VG), Fine (F), Very Fine (VF),
Extremely Fine (EF or XF), Almost Uncirculated (AU), and Uncirculated or Mint State
(MS) which determines the coins value.

(a) Using this information, classify each of the following variables as either numerical
(specify continuous or discrete) or categorical (specify nominal: binary, nominal:
not binary, or ordinal).

Amount of zinc Image on reverse Year minted City minted Condition

(b) Suppose the collector accidentally drops 1000 pennies. Repeat the instructions in (a)
for the variables

Number of heads face-up Proportion of heads face-up

3. Sketch a dotplot (by hand) of the distribution of values for each of the data sets below, and
calculate the mean, variance, and standard deviation of each.

U: 1, 2, 3, 4, 5, 6, 7

X: 2, 3, 4, 4, 4, 5, 6

Y: 3, 4, 4, 4, 4, 4, 5

Z: 4, 4, 4, 4, 4, 4, 4

What happens to the mean, variance, and standard deviation, as we progress from one
data set to the next? What general observations can you make about the relationship
between the standard deviation, and the overall shape of the corresponding distribution?
In simple terms, why should this be so?

4. Useful Properties of Mean, Variance, Standard Deviation

(a) Suppose that a constant b is added to every value of a data set
1 2 3
{ , , , , }
n
x x x x , to
produce a new data set
1 2 3
{ , , , , }
n
x b x b x b x b + + + + . Exactly how are the mean,
variance, and standard deviation affected, and why? (Hint: Think of the dotplot.)

(b) Suppose that every value in a data set
1 2 3
{ , , , , }
n
x x x x is multiplied by a nonzero
constant a to produce a new data set
1 2 3
{ , , , , }
n
ax ax ax ax . Exactly how are the
mean, variance, and standard deviation affected, and why? Dont forget that a (and
for that matter, b above) can be negative! (Hint: Think of the dotplot.)


5. During a certain winter in Madison, the variable X = Temperature at noon (F) is
measured every day over two consecutive weeks, as shown below.

Sun Mon Tues Wed Thurs Fri Sat
Week 1 +8 +8 +8 +5 +3 +3 0
Week 2 0 3 3 5 8 8 8

(a) Calculate the sample mean temperature x and sample variance
2
s for Week 1.

(b) Without performing any further calculations, determine the mean temperature x and
sample variance
2
s for Week 2. [Hint: Compare the Week 2 temperatures with those
of Week 1, and use the result found in 4(b).] Confirm by explicitly calculating.

6. A little practice using R: First, type the command pop = 1:100 to generate a simulated
population of integers from 1 to 100, and view them (read the intro to R to see how).
(a) Next, type the command x.vals = sample(pop, 5, replace = T) to
generate a random sample of n = 5 values from this population, and view them.
Calculate, without R, their sample mean x , variance
2
s , and standard deviation s .
Show all work!
(b) Use R to calculate the sample mean in two ways: first, via the sum command, then
via the mean command. Do the two answers agree with each other? Do they agree
with (a)? If so, label this value xbar. Include a copy of the R output in your work.
(c) Use R to calculate the sample variance in two ways: first, via the sum command, then
via the var command. Do the two answers agree with each other? Do they agree with
(a)? If so, label this value s.sqrd. Include a copy of the R output in your work.
(d) Use R to calculate the sample standard deviation in two ways: first, via the sqrt
command, then via the sd command. Do the two answers agree with each other? Do
they agree with (a)? Include a copy of the R output in your work.

7. (You may want to refer to the Rcode folder for this problem.) First pick n = 5 numbers at random
} {
1 2 3 4 5
, , , , x x x x x , and calculate their sample mean x and standard deviation
x
s .
(a) Compute the deviations from the mean
i
x x for 1, 2, 3, 4, 5 i = , and confirm that their sum = 0.
(b) Now divide each of these individual deviations by the standard deviation
x
s . These new values
} {
1 2 3 4 5
, , , , z z z z z are called standardized values, i.e.,
i
i
x
x x
z
s
= , for 1, 2, 3, 4, 5 i = .
Calculate their mean z and standard deviation
z
s . Repeat several times. What do you notice?
(c) Why are these results not surprising? (Hint: See problem 4.)

The idea
behind this
problem will
be important
in Chapter 4.


8. (a) The average score of a class of
1
20 n = students on an exam is
1
90.0 x = , while the
average score of another class of
2
30 n = students on the same exam is
2
80.0 x = . If
the two classes are pooled together, what is their combined average score on the exam?

(b) Suppose two other classes one with
1
24 n = students, the other with
2
44 n = students
have the same mean score, but with standard deviations
1
7.0 s = and
2
10.0 s = ,
respectively. If these two classes are pooled together, what is their combined standard
deviation on the exam? (Hint: Think about how sample standard deviation is calculated.)

9. (Hint: See page 2.3-11) A random sample of n = 90 people is grouped according to age in
the frequency table below:

Age Group Frequency
[0, 10) 9
[10, 25) 27
[25, 65] 54

(a) Calculate the group mean age and group standard deviation. Express in years and months.

(b) Construct a relative frequency histogram.

(c) Construct a density histogram.

(d) What percentage of the sample falls between 15 and 35 years old?

(e) Calculate the group quartile ages Q
1
, Q
2
, Q
3
. Express in terms of years and months.

(f) Calculate the range and the interquartile range. Express in terms of years and months.

10. For any 0, 1, 2, , x n = , consider a data set { }
1 2 3
, , , ,
n
y y y y consisting entirely of x
ones and ( ) n x zeros, in any order. For example, {1, 1, , 1, 0, 0, , 0}
x n x

. Also denote
the sample proportion of ones by
x
p
n
= .

(a) How many such possible data sets can there be?

(b) Construct a relative frequency table for such a data set.

(c) Show that the sample mean y p = .

(d) Show that the sample variance
2
1
(1 )
y
n
n
s p p
= .


11.
(a) Consider the sample data {10, 10, 10, , 10, 60, 60, 60, , 60} , where half the values
are 10 and half the values are 60. Complete the following relative frequency table
for this sample, and calculate the sample mean x and sample median.

i
x ( )
i
f x
10
60

(b) Suppose the original dataset is unknown, and only given in grouped form, with each
of the two class intervals shown below containing half the values.

Class Interval
Relative
Frequency
[0, 20)
[20, 100)

Complete this relative frequency table, and calculate the group sample mean
group
x and group sample median. How do these compare with the values found
in (a)?

Sketch the relative frequency histogram.

Sketch the density histogram.

Label the group sample mean and median in each of the two histograms. In which
histogram does the mean more accurately represent the balance point of the data,
and why?

12. By the end of the semester, Merriman forgets the scores he received on the four quizzes
(each worth 100 points) he took in a certain course. He only remembers that their average
score was 80 points, standard deviation 10 points, and that 3 out of the 4 scores were the
same. From this information, compute all four missing quiz scores. [Hint: Recall that the
i
th
deviation of a value
i
x from the mean x is defined as
i i
d x x = , so that
i i
x x d = +
for 1, 2, 3, 4 i = . Then use the given information.]

Note: There are two possible solutions to this problem. Find them both.


13. Linear Interpolation (A generalization of the method used on page 2.3-6.)

If software is unavailable for computations, this is an old technique to estimate values
which are in-between tabulated entries. It is based on the idea that over a small interval,
a continuous function can be approximated by a linear one, i.e., constant slope.

Given two successive entries a
1
and a
2
in the first column of a table, with corresponding
successive entries b
1
and b
2
, respectively, in the second column. For a given x value
between a
1
and a
2
, we wish to approximate the corresponding y value between b
1
and b
2
,
or vice versa. Then assuming equal proportions, we have

1 2 1
1 2 1
y b b b
x a a a

=

.

Show that this relation implies that y can be written as a weighted average of b
1
and b
2
.
In particular,
1 2 2 1
1 2
v b v b
y
v v
+
=
+
,

where the weights are given by the differences
1 1
v x a = and
2 2
v a x = . Similarly,

1 2 2 1
1 2
w a w a
x
w w
+
=
+
,

where the weights are given by the differences
1 1
w y b = and
2 2
w b y = .

Column A Column B
a
1
b
1

v
1
w
1

x y
v
2
w
2

a
2
b
2

3. Theory of Probability

3.1 Basic Definitions and Properties

3.2 Conditional Probability and Independent Events

3.3 Bayes Formula

3.4 Applications

3.5 Problems


3. Probability Theory

3.1 Basic Ideas, Definitions, and Properties

POPULATION = Unlimited supply of five types of fruit, in equal proportions.
O
1
=Macintosh apple
O
2
=Golden Delicious apple
O
3
=Granny Smith apple
O
4
=Cavendish (supermarket) banana
O
5
=Plantain banana

Experiment 1: Randomly select one fruit from this population, and record its type.

Sample Space: The set S of all possible elementary outcomes of an experiment.

S = {O
1
, O
2
, O
3
, O
4
, O
5
} #(S) =5

Event: Any subset of a sample space S. (Elementary outcomes =simple events.)

A = Select an apple. = {O
1
, O
2
, O
3
} #(A) =3
B = Select a banana. = {O
4
, O
5
} #(B) =2

Event P(Event)
A 3/5 =0.6
B 2/5 =0.4
5/5 =1.0

P(A) =0.6 The probability of randomly selecting an apple is 0.6.
As #trials
P(B) =0.4 The probability of randomly selecting a banana is 0.4.
1/1
1/2
2/3
3/4
3/5
4/6
1/3
1/4
2/5
2/6
. . .
. . .
A
B
0
0.4
0.6
1
e.g., . . .
A B B A A
A
1 2 5 3 4 . . . 6
#trials of
experiment
#(Event)
#(trials)


General formulation may be facilitated with the use of a Venn diagram:

Event A = {O
1
, O
2
, , O
m
} S #(A) =m k

Definition: The probability of event A, denoted P(A), is the long-run relative
frequency with which A is expected to occur, as the experiment is repeated indefinitely.
Fundamental Properties of Probability

For any event A ={O
1
, O
2
, , O
m
} in a sample space S,

1. 0 P(A) 1

2. P(A) =
1 2 3
1
( ) ( ) ( ) ( ) ( )
m i
m
i
P O P O P O P O P O
=
= + + + +

Special Cases:

P() = 0
P(S) =
1
( )
i
k
i
P O
=
= 1 certainty

3. If all the elementary outcomes of S are equally likely, i.e.,
P(O
1
) =P(O
2
) = =P(O
k
) =
1
k
,
then
#( )
( )
#( )
A m
P A
k
= =
S
.

Example: P(A) =3/5 =0.6, P(B) =2/5 =0.4
Sample Space: S = {O
1
, O
2
, , O
k
} #(S) =k

O
1

O
2

O
4

O
m

. . .

O
k

O
3

. . .

A
O
m+1

O
m+2

O
m+3

Experiment

New Events from Old Events

Experiment 2: Select a card at random from a standard deck (and replace).

Sample Space: S = {A, , K} #(S) =52

Events: A = Select a 2. = {2, 2, 2, 2} #(A) =4
B = Select a . = {A, 2, , K} #(B) =13

Probabilities: Since all elementary outcomes are equally likely, it follows that

P(A) =
#( )
#( )
A
S
=
4
52
and P(B) =
#( )
#( )
B
S
=
13
52
.

(1) A
c
= not A = {All outcomes that are in S, but not in A.}

Example: A
c
= Select either A, 3, 4, , or K. P(A
c
) = 1
4
52
=
48
52
.

Example: Experiment = Toss a coin once.

Events: A ={Heads} A
c
={Tails}

Probabilities:
Fair coin P(A) =0.5 P(A
c
) =1 0.5 =0.5
Biased coin P(A) =0.7 P(A
c
) =1 0.7 =0.3

P(A
c
) = 1 P(A)
complement
A 2 3 4 5 6 7 8 9 10 J Q K

A 2 3 4 5 6 7 8 9 10 J Q K

A 2 3 4 5 6 7 8 9 10 J Q K

A 2 3 4 5 6 7 8 9 10 J Q K

A
B

(2) A B = A and B = {All outcomes in S that A and B share in common.}

= {All outcomes that result when events A and B occur simultaneously.}

Example: A B = Select a 2 and a = {2} P(A B) =
1
52
.

Definition: Two events A and B are said to be disjoint, or mutually exclusive,
if they cannot occur simultaneously, i.e., A B =, hence P(A B) =0.

Example: A =Select a 2 and C =Select a 3 are disjoint events.

Exercise: Are
4 4 4 4
{2 , 3 , 4 , 5 ,...} A= and
6 6 6 6
{2 , 3, 4 , 5 ,...} B = disjoint?
If not, find A B.

(3) A B = A or B = {All outcomes in S that are either in A or B, inclusive.}

Example: A B = Select either a 2 or a has probability
P(A B) =
4
52
+
13
52

1
52
=
16
52
.

Example: A C = Select either a 2 or a 3 has probability
P(A C) =
4
52
+
4
52
0 =
8
52
.
S
A B
P(A B) = P(A) + P(B) P(A B)
=0, if A and B
are disjoint.
intersection
union
S
A
C
B

Note: Formula (3) extends to n 3 disjoint events in a straightforward manner:

(4) P(A
1
A
2
A
n
) = P(A
1
) +P(A
2
) + +P(A
n
).

Question: How is this formula modified if the n events are not necessarily disjoint?

Example: Take n =3 events

Then P(A B C) =

P(A) +P(B) +P(C)

P(A B) P(A C) P(B C)

+ P(A B C).

Exercise: For S ={J anuary,, December},
verify this formula for the three events
A =Has 31 days, B =Name ends in r, and
C =Name begins with a vowel.

Exercise: A single tooth is to be randomly selected for a
certain dental procedure. Draw a Venn diagram to illustrate
the relationships between the three following events:
A =upper jaw, B =left side, and C =molar, and
indicate all corresponding probabilities. Calculate the
probability that all of these three events, A and B and C,
occur. Calculate the probability that none of these three
events occur. Calculate the probability that exactly one of
these three events occurs. Calculate the probability that
exactly two of these three events occur. (Think carefully.)
Assume equal likelihood in all cases.

The three set operations union, intersection, and complement can be unified via...

Exercise: Using a Venn diagram, convince
yourself that these statements are true in
general. Then verify them for a specific
example, e.g., A =Pick a picture card and
B =Pick a black card.

incisors

incisors

premolars
premolars

canine canine
canine canine

molars
DeMorgans Laws
(A B)
c
= A
c
B
c

(A B)
c
= A
c
B
c


Slight Detour

Suppose that out of the last n =40 races, a certain racing horse won x =25, and lost
the remaining n x =15. Based on these statistics, we can calculate the following
probability estimates for future races:

P(Win)
25 5
40 8
x
n
= = = 0.625 =p
P(Lose) 1
15 3
40 8
x
n
= = = 0.375 =1 p =q

Odds of winning =
(Win) 5/8 5
(Lose) 3/8 3
P
P
= = 5 to 3

Definition: For any event A, let P(A) =p, thus P(A
c
) =q =1 p. The odds of event A
=
p
q
=
1
p
p
, i.e., the probability that A does occur, divided by the probability that it
does not occur. (In the preceding example, A =Win with probability p =5/8.)

Note that if odds =1, then A and A
c
are equally likely to occur. If odds >1 (likewise,
<1), then the probability that A occurs is greater (likewise, less) than the probability
that it does not occur.

Example: Suppose the probability of contracting a certain disease in a particular
group of high risk individuals is P(D+) =0.75, so that the probability of being
disease-free is P(D) =0.25. Then the odds of contracting the disease in this group is
equal to 0.75/0.25 =3 (or 3 to 1).
*
3
1/ 49
Likewise, if in a reference group of low risk
individuals, the prevalence of the same disease is only P(D+) =0.02, so that P(D) =
0.98, then their odds =0.02/0.98 =1/49 ( 0.0204). As its name suggests, the
corresponding odds ratio between the two groups is defined as the ratio of their
respective odds, i.e., =147. That is, the odds of the high-risk group contracting
the disease are 147 times larger than the odds of the low-risk reference group. (Odds
ratios have nice properties, and are used extensively in epidemiological studies.)

*
That is, within this group, the probability of disease is three times larger than the probability of no disease.
Out of every 8 races, the horse
wins 5 and loses 3, on average.


3.2 Conditional Probability and Independent Events

Using population-based health studies to estimate probabilities relating
potential risk factors to a particular disease, evaluate efficacy of medical
diagnostic and screening tests, etc.

Example: Events: A = lung cancer B = smoker

Disease Status

Lung
cancer (A)
No lung
cancer (A
c
)

S
m
o
k
e
r

Yes
(B)
0.12 0.04 0.16
No
(B
c
)
0.03 0.81 0.84

0.15 0.85 1.00

Probabilities: P(A) = 0.15 P(B) = 0.16 P(A B) = 0.12

Definition:

=
0.12
0.16
= 0.75 >> 0.15 = P(A).
Comments:
P(B | A) =
( )
( )
P B A
P A
=
0.12
0.15
= 0.80, so P(A | B) P(B | A) in general.

General formula can be rewritten: P(A B) = P(A | B) P(B) IMPORTANT

Example: P(Angel barks) =0.1

P(Brutus barks) =0.2

P(Angel barks | Brutus barks) =0.3
Conditional Probability of Event A,
given Event B (where P(B) 0)

P(A | B) =
( )
( )
P A B
P B

Therefore
P(Angel and Brutus bark) =0.06
B

0.03 0.04
0.81
A

S
0.12

Example: Suppose that two balls are to be randomly drawn, one after another,
from a container holding four red balls and two green balls. Under the scenario
of sampling without replacement, calculate the probabilities of the events
A =First ball is red, B =Second ball is red, and A B =First ball is red
AND second ball is red. (As an exercise, list the 6 5 =30 outcomes in the
sample space of this experiment, and use brute force to solve this problem.)

This type of problem known as an urn model can be solved with the use of
a tree diagram, where each branch of the tree represents a specific event,
conditioned on a preceding event. The product of the probabilities of all such
events along a particular sequence of branches is equal to the corresponding
intersection probability, via the previous formula. In this example, we obtain the
following values:

1
st
draw 2
nd
draw

We can calculate the probability P(B) by adding the two boxed values above,
i.e., P(B) = P(A B) +P(A
c
B) = 12/30 +8/30 = 20/30, or P(B) = 2/3.

This last formula which can be written as P(B) =P(B | A) P(A) +P(B | A
c
) P(A
c
)
can be extended to more general situations, where it is known as the Law of Total
Probability, and is a useful tool in Bayes Theorem (next section).
P(A B) = 12/30
P(A B
c
) =8/30
P(A
c
B) =8/30
P(A
c
B
c
) =2/30
P(B
c
| A) =2/5

P(B
c
| A
c
) =1/5

P(A
c
) =2/6
P(A) = 4/6
P(B | A) =3/5
A A
c

B

A B A
c
B
P(B | A
c
) =4/5

R
3

R
4

G
1

G
2

R
1

R
2

Two events A and B are said to be statistically
independent if either:

(1) P(A | B) =P(A), i.e., P(B | A) =P(B),

or equivalently,
(2) P(A B) = P(A) P(B).

Suppose event C = coffee drinker.

Probabilities: P(A) = 0.15 P(C) = 0.40 P(A C) = 0.06

Therefore, P(A | C) =
P(A C)
P(C)
=
0.06
0.40
= 0.15 = P(A)

i.e., the occurrence of event C gives no information about the probability of event A.

Definition:

Exercise: Prove that if events B and C are statistically independent, then so are
each of the following: B and Not C Not B and C Not B and Not C
Hint: Let P(B) =b, P(C) =c, and construct a 2 2 probability table.

Summary
A, B disjoint If either event occurs, then the other cannot occur: ( ) 0 P A B = .

A, B independent If either event occurs, this gives no information about the other:
( ) ( ) ( ) P A B P A P B = .
Example: A =Select a 2 and B =Select a are not disjoint events, because
A B ={2} . However, P(A B) =1/52 =1/13 1/4 = P(A) P(B); hence
they are independent events. Can two disjoint events ever be independent? Why?

Disease Status

Lung
cancer (A)
No lung
cancer (A
c
)

C
o
f
f
e
e

D
r
i
n
k
e
r

Yes
(C)
0.06 0.34 0.40
No
(C
c
)
0.09 0.51 0.60
0.15 0.85 1.00
A
C
0.06 0.09 0.34
0.51
S

A VERY IMPORTANT AND USEFUL FACT: It can be shown that for
any event A, all of the elementary properties of probability P(A) covered in
the notes, extend to conditional probability ( | ) P A B , for any other event B.
For example, since we know that
1 2 1 2 1 2
( ) ( ) ( ) ( ) P A A P A P A P A A = +
for any two events A
1
and A
2
, it is also true that
1 2 1 2 1 2
( | ) ( | ) ( | ) ( | ) P A A B P A B P A B P A A B = + for any other event B.
As another example, since we know that ( ) 1 ( ) P A P A =
c
, it therefore also
follows that ( | ) 1 ( | ) P A B P A B =
c
.

Exercise: Prove these two statements. (Hint: Sketch a Venn diagram.)

HOWEVER, there is one important exception! We know that if A and B are
two independent events, then ( ) ( ) ( ) P A B P A P B = . But this does not
extend to conditional probabilities! In particular, if C is any other event, then
( | ) ( | ) ( | ) P A B C P A C P B C in general. The following example illustrates
this, for three events A, B, and C:

Exercise:
Confirm that ( ) ( ) ( ) P A B P A P B = , but ( | ) ( | ) ( | ) P A B C P A C P B C .

In other words, two events that may be independent in a general population,
may not necessarily be independent in a particular subgroup of that population.
C
B A
.10
.20
.05
.05
.05
.15
.20 .20

More on Conditional Probability and Independent Events
Another example from epidemiology

Suppose that, in a certain study population, we wish to investigate the prevalence of lung cancer
(A), and its associations with obesity (B) and cigarette smoking (C), respectively. From the first
of the two stylized Venn diagrams above, by comparing the scales drawn, observe that the
proportion of the size of the intersection A B (green) relative to event B (blue +green), is about
equal to the proportion of the size of event A (yellow +green) relative to the entire population S.
That is,
( )
( )
P A B
P B
=
( )
( )
P A
P S
.

(As an exercise, verify this equality for the following probabilities: yellow =.09, green =.07,
blue =.37, white =.47, to two decimals, before reading on.) In other words, the probability that a
randomly chosen person from the obese subpopulation has lung cancer, is equal to the probability
that a randomly chosen person from the general population has lung cancer (.16). This equation
can be equivalently expressed as
P(A | B) = P(A),

since the left side is conditional probability by definition, and P(S) =1 in the denominator of
the right side. In this form, the equation clearly conveys the interpretation that knowledge of
event B (obesity) yields no information about event A (lung cancer). In this example, lung cancer
is equally probable (.16) among the obese as it is among the general population, so knowing that
a person is obese is completely unrevealing with respect to having lung cancer. Events A and B
that are related in this way are said to be independent. Note that they are not disjoint!

In the second diagram however, the relative size of A C (orange) to C (red +orange), is larger
than the relative size of A (yellow +orange) to the whole population S, so P(A | C) P(A), i.e.,
events A and C are dependent. Here, as is true in general, the probability of lung cancer is
indeed influenced by whether a person is randomly selected from among the general population
or the smoking subset, where it is much higher. Statistically, lung cancer would be a rare disease
in the U.S., if not for cigarettes (although it is on the rise among nonsmokers).
S = POPULATION S = POPULATION A = lung cancer A = lung cancer
B = obese
A B
C = smoker
A C

Application: Are Blood Antibodies Independent?
An example of conditional probability in human genetics
(Adapted from Rick Chappell, Ph.D., UW Dept. of Biostatistics & Medical Informatics)

Background: The surfaces of human red blood cells (erythrocytes) are coated with antigens
that are classified into four disjoint blood types: O, A, B, and AB. Each type is associated
with blood serum antibodies for the other types, that is,

Type O blood contains both A and B antibodies.
(This makes Type O the universal donor, but capable of receiving only Type O.)
Type A blood contains only B antibodies.
Type B blood contains only A antibodies.
Type AB blood contains neither A nor B antibodies.
(This makes Type AB the universal recipient, but capable of donating only to Type AB.)

In addition, blood is also classified according to the presence (+) or absence () of Rh factor
(found predominantly in rhesus monkeys, and to varying degree in human populations; they
are important in obstetrics). Hence there are eight distinct blood groups corresponding to this
joint classification system: O
+
, O
, A
+
, A
, B
+
, B
, AB
+
, AB
. According to the American

Red Cross, the U.S. population has the following blood group relative frequencies:

Rh factor
+

Totals
O .384 .077 .461
A .323 .065 .388
B .094 .017 .111
AB .032 .007 .039
Totals .833 .166 .999

From these values (and from the background information above), we can calculate the
following probabilities:

P (A antibodies) =P (Type O or B) P (B antibodies) =P (Type O or A)
=P (O) +P (B) =P (O) +P (A)
=.461 +.111 =.461 +.388
=.572 =.849

P (B antibodies and Rh
+
) =P (Type O
+
or A
+
)
=P (O
+
) +P (A
+
)
=.384 +.323
=.707

B
l
o
o
d

T
y
p
e
s


Using these calculations, we can answer the following.

Question: Is having A antibodies independent of having B antibodies?

Solution: We must check whether or not

P(A and B antibodies) = P(A antibodies) P(B antibodies),
i.e.,
P(Type O) .572 .849
or
.461 .486

This indicates near independence of the two events; there does exist a slight
dependence. The dependence would be much stronger if America were
composed of two disjoint (i.e., non-interbreeding) groups: Type A (with B
antibodies only) and Type B (with A antibodies only), and no Type O (with
both A and B antibodies). Since this is evidently not the case, the implication is
that either these traits evolved before humans spread out geographically, or they
evolved later but the populations became mixed in America.

Question: Is having B antibodies independent of Rh
+
?

Solution: We must check whether or not

P (B antibodies and Rh
+
) =P (B antibodies) P (Rh
+
),
that is,
.707 = .849 .833,

which is true, so we have exact independence of these events. These traits
probably predate diversification in humans (and were not differentially selected
for since).

Exercises:

Is having A antibodies independent of Rh
+
?
Find P (A antibodies | B antibodies) and P (B antibodies | A antibodies).
Conclusions?
Is Blood Type independent of Rh factor? (Do a separate calculation for
each blood type: O, A, B, AB, and each Rh factor: +, .)

3.3 Bayes Formula

Suppose that, for a certain population of individuals, we are interested in
comparing sleep disorders in particular, the occurrence of event A =Apnea
between M =Males and F =Females.

Also assume that we know the following information:

P(M) = 0.4 P(A | M) =0.8 (80% of males have apnea)
prior probabilities
P(F) = 0.6 P(A | F) =0.3 (30% of females have apnea)

Given here are the conditional probabilities of having apnea within each
respective gender, but these are not necessarily the probabilities of interest. We
actually wish to calculate the probability of each gender, given A. That is, the
posterior probabilities P(M | A) and P(F | A).

To do this, we first need to reconstruct P(A) itself from the given information.
S =Adults under 50
M
A M

A F

F
A
P(A M) = P(A | M) P(M)
P(A
c
| M)

P(A
c
| F)

P(F)
P(M)
P(A | M)
P(A | F)
P(A
c
M) = P(A
c
| M) P(M)
P(A F) = P(A | F) P(F)
P(A
c
F) = P(A
c
| F) P(F)
P(A) = P(A | M) P(M) + P(A | F) P(F)
posterior
probabilities

S
M F
A

0.08 0.42

0.32 0.18

So, given A
P(M | A) =
P(M A)
P(A)
=
P(A | M) P(M)
P(A | M) P(M) + P(A | F) P(F)

=
(0.8)(0.4)
(0.8)(0.4) +(0.3)(0.6)
=
0.32
0.50
= 0.64

and
P(F | A) =
P(F A)
P(A)
=
P(A | F) P(F)
P(A | M) P(M) + P(A | F) P(F)

=
(0.3)(0.6)
(0.8)(0.4) +(0.3)(0.6)
=
0.18
0.50
= 0.36

Thus, the additional information that a
randomly selected individual has apnea (an
event with probability 50% why?) increases
the likelihood of being male from a prior
probability of 40% to a posterior probability
of 64%, and likewise, decreases the likelihood
of being female from a prior probability of
60% to a posterior probability of 36%. That
is, knowledge of event A can alter a prior
probability P(B) to a posterior probability
P(B | A), of some other event B.

Exercise: Calculate and interpret the posterior probabilities P(M | A
c
) and P(F | A
c
)
as above, using the prior probabilities (and conditional probabilities) given.

More formally, consider any event A, and two complementary events B
1
and B
2
,
(e.g., M and F) in a sample space S. How do we express the posterior
probabilities P(B
1
| A) and P(B
2
| A) in terms of the conditional probabilities
P(A | B
1
) and P(A | B
2
), and the prior probabilities P(B
1
) and P(B
2
)?

Bayes Formula for posterior probabilities P(B
i
| A) in terms
of prior probabilities P(B
i
), i = 1, 2

P(B
i
| A) =
( )
( )
i
P B A
P A
=
1 1 2 2
( ) ( )
( ) ( ) ( ) ( )
i i
P A|B P B
P A|B P B + P A|B P B

Prior Probabilities

In general, consider an event A, and events B
1
, B
2
, , B
n
, disjoint and exhaustive.

Reverend Thomas Bayes
1702 - 1761

B
1
B
2
B
n

A
S
. . .
. . .
A B
1
A B
2
A B
n

j=1
n

.
.
.
P(B
1
)
P(B
2
)
P(B
3
)
P(B
n
)
P(A | B
1
)
P(A
c
| B
1
)
P(A | B
2
)
P(A
c
| B
2
)
P(A | B
3
)
P(A
c
| B
3
)
P(A
c
| B
n
)
P(A | B
n
)
P(A B
1
)
P(A
c
B
1
)
P(A B
2
)
P(A
c
B
2
)
P(A B
3
)
P(A
c
B
3
)
P(A B
n
)
P(A
c
B
n
)
.
.
.
.
.
.
P(A B
j
)

P(A) = P(A | B
j
) P(B
j
)
Law of Total Probability
Bayes Formula (general version)

For i =1, 2, , n, the posterior probabilities are

P(B
i
| A) =
( )
( )
i
P B A
P A
=1
( ) ( )
( ) ( )
i i
n
j j
j
P A|B P B
P A|B P B
.

3.4 Applications

Evidence-Based Medicine: Screening Tests and Disease Diagnosis

Clinical tests are frequently used in medicine and epidemiology to diagnose or screen
for the presence (T+) or absence (T) of a particular condition, such as pregnancy or
disease. Definitive disease status (either D+or D) is often subsequently determined
by means of a gold standard, such as data resulting from follow-up, invasive
radiographic or surgical procedures, or autopsy. Different measures of the tests
merit can then be estimated via various conditional probabilities. For instance, the
sensitivity or true positive rate of the test is defined as the probability that a
randomly selected individual has a positive test result, given that he/she actually has
the disease. Other terms are defined similarly; the following example, using a
random sample of n =200 patients, shows how they are estimated from the data.

Disease Status

Diseased (D+) Nondiseased (D)

T
e
s
t

R
e
s
u
l
t

Positive (T+) 16 (=TP) 9 (=FP) 25
Negative (T) 4 (=FN) 171 (=TN) 175

20 180 200

True Positive rate = P(T+| D+) False Positive rate = P(T+| D)
Sensitivity =
16
20
= .80 1 specificity =
9
180
= .05

False Negative rate = P(T | D+) True Negative rate = P(T | D)
1 sensitivity =
4
20
= .20 Specificity =
171
180
= .95

D+
D
T D
T D+
T+ D
T+
T+ D+

In order to be able to apply this test to the general population, we need accurate
estimates of its predictive values of a positive and negative test, PV+=P(D+| T+)
and PV =P(D | T), respectively. We can do this via the basic definition

P(B | A) =
P(B A)
P(A)

which, when applied to our context, becomes

P(D+| T+) =
P(D+ T+)
P(T+)
and P(D | T) =
P(D T)
P(T)
,

often written PV+=
TP
TP +FP
and PV =
TN
FN +TN
.

Here, PV+=
16
25
=0.64 and PV =
171
175
=0.977.

However, a more accurate determination is possible, with the use of

Bayes Formula: P(B | A) =
P(A | B) P(B)
P(A | B) P(B) +P(A | B
c
) P(B
c
)

which, when applied to our context, becomes

P(D+| T+) =
P(T+| D+) P(D+)
P(T+| D+) P(D+) +P(T+| D) P(D)
,

i.e., PV+ =
(Sensitivity)(Prevalence)
(Sensitivity)(Prevalence) + (False Positive rate)(1 Prevalence)

and
P(D | T) =
P(T | D) P(D)
P(T | D) P(D) +P(T | D+) P(D+)
,

i.e., PV =
(Specificity)(1 Prevalence)
(Specificity)(1 Prevalence) + (False Negative rate)(Prevalence)
.

All the ingredients are obtainable from the table calculations, except for the
baseline prevalence of the disease in the population, P(D+), which is usually
grossly overestimated by the corresponding sample-based value, in this case,
20/200 =.10. We must look to outside published sources and references for a
more accurate estimate of this figure.


Suppose that we are able to determine the prior probabilities:

P(D+) = .04 and therefore, P(D) = .96.

Then, substituting, we obtain the following posterior probabilities:

PV+ =
(.80)(.04)
(.80)(.04) +(.05)(.96)
= .40 and PV =
(.95)(.96)
(.95)(.96) +(.20)(.04)
= .99.

Therefore, a positive test result increases the probability of having this disease from
4% to 40%; a negative test result increases the probability of not having the disease
from 96% to 99%. Hence, this test is extremely specific for the disease (i.e., low
false positive rate), but is not very sensitive to its presence (i.e., high false negative
rate). A physician may wish to use a screening test with higher sensitivity (i.e., low
false negative rate). However, such tests also sometimes have low specificity (i.e.,
high false positive rate), e.g., MRI screening for breast cancer. An ideal test
generally has both high sensitivity and high specificity (e.g., mammography), but are
often expensive. Typically, health insurance companies favor tests with three criteria:
cheap, fast, and easy, e.g., Fecal Occult Blood Test (FOBT) vs. colonoscopy.

FUITA Procedure

High cost Low cost No cost!

Overwhelmingly preferred by most insurance companies.

Patient-obtained fecal smears are analyzed
for presence of blood in stool, a possible
sign of colorectal cancer. High false
positive rate (e.g., bleeding hemmorhoid).

Evidence-Based Medicine: Receiver Operating Characteristic (ROC) Curves
Originally developed in the electronic communications field for displaying Signal-
to-Noise Ratio (SNR), these graphical objects are used when numerical cutoff
values are used to determine T+versus T.

Example: Using blood serum markers in a screening test (T) for detecting fetal
Downs syndrome (D) and other abnormalities, as maternal age changes.

Triple Test: Uses three maternal
serum markers (alpha-fetoprotein,
unconjugated oestriol, and human
gonadotrophin) to calculate a womans
individual risk of having a Down
syndrome pregnancy.

True + = False +
True = False
(nondiscriminatory test;
AUC =0.5)
IDEAL
TEST

Age 20
Age 25
Age 30
Age 35
sensitive,
but not
specific
Age 40
specific,
but not
sensitive
optimal
cutoff
AUC =1

The True Positive rate (from 0 to 1) of the test is graphed against its False Positive
rate (from 0 to 1), for a range of age levels, and approximated by a curve contained
in the unit square. The farther this graph lies above the diagonal i.e., the closer it
comes to the ideal level of 1 the better the test. This is often measured by the
Area Under Curve (AUC), which has a maximum value of 1, the total area of the
unit square. Often in practice, the curve is simply the corresponding polygonal
graph (as shown), and AUC can be numerically estimated by the Trapezoidal Rule.
(It can also be shown that this value corresponds to the probability that a random
pregnancy can be correctly classified as Down, using this screening test.) Illustrated
below are the ROC curves corresponding to three different Down syndrome
screening tests; although their relative superiorities are visually suggestive, formal
comparison is commonly performed by a modified version of the Wilcoxon Rank
Sum Test (covered later).

Triple + dimeric inhibin A (DIA)
cross product
ratio
cross product
ratio

Further Applications: Relative Risk and Odds Ratios
Measuring degrees of association between disease (D) and exposure (E) to a
potential risk (or protective) factor, using a prospective cohort study:

From the resulting data, various probabilities can be estimated. Approximately,

Disease Status

R
i
s
k

F
a
c
t
o
r

Exposed (E+) p
11
p
12
p
11
+p
12

Unexposed (E) p
21
p
22
p
21
+p
22

p
11
+p
21
p
12
+p
22
1

P(D+| E+) =
P(D+ E+)
P(E+)
=
p
11
p
11
+p
12
P(D | E+) =
P(D E+)
P(E+)
=
p
12
p
11
+p
12

P(D+| E) =
P(D+ E)
P(E)
=
p
21
p
21
+p
22
P(D | E) =
P(D E)
P(E)
=
p
22
p
21
+p
22

Odds of disease, given exposure =
P(D+ | E+)
P(D | E+)
=
p
11
/ (p
11
+p
12
)
p
12
/ (p
11
+p
12
)
=
p
11
p
12

Odds of disease, given no exposure =
P(D+ | E)
P(D | E)
=
p
21
/ (p
21
+p
22
)
p
22
/ (p
21
+p
22
)
=
p
21
p
22

Odds Ratio: OR =
P(D+ | E+)
P(D | E+)

P(D+ | E)
P(D | E)
=
p
11
p
12

p
21
p
22
=
p
11
p
22
p
12
p
21

Comment: If OR =1, then odds, given exposure =odds, given no exposure, i.e.,
no association exists between disease D and exposure E. What if OR > 1 or OR < 1?

Relative Risk: RR =
P(D+ | E+)
P(D+| E)
=
p
11
/ (p
11
+p
12
)
p
21
/ (p
21
+p
22
)
=
p
11
(p
21
+p
22
)
p
21
(p
11
+p
12
)

Comment: RR directly measures the effect of exposure on disease, but OR has
better statistical properties. However, if the disease is rare in the population, i.e.,
if p
11
0 and p
21
0, then RR =
p
11
(p
21
+p
22
)
p
21
(p
11
+p
12
)

p
11
p
22
p
12
p
21
= OR.
TIME
FUTURE PRESENT
Investigate: Association with D+and D Given: Exposed (E+) and Unexposed (E)

Recall our earlier example of investigating associations between lung cancer and
the potential risk factors of smoking and coffee drinking. First consider the former:

Lung Cancer

S
m
o
k
i
n
g

Exposed (E+) .12 .04 .16
Not Exposed (E) .03 .81 .84

.15 .85 1.00

P(D+| E+) =
P(D+ E+)
P(E+)
=
.12
.16
=
3
4
; therefore, P(D | E+) =
.04
.16
=
1
4
.
A random smoker has a 3 out of 4 (i.e., 75%) probability of having lung cancer;
a random smoker has a 1 out of 4 (i.e., 25%) probability of not having lung cancer.
Therefore, the odds of the disease, given exposure, =
P(D+ | E+)
P(D | E+)
=
3/4
1/4

or
.12
.04
= 3.
The probability that a random smoker has lung cancer is 3 times greater than the probability that
he/she does not have it.

P(D+| E) =
P(D+ E)
P(E)
=
.03
.84
=
1
28
; therefore, P(D | E) =
.81
.84
=
27
28
.
A random nonsmoker has a 1 out of 28 (i.e., 3.6%) probability of having lung cancer;
a random nonsmoker has a 27 out of 28 (i.e., 96.4%) probability of not having lung cancer.
Therefore, the odds of the disease, given no exposure, =
P(D+ | E)
P(D | E)
=
1/28
27/28

or
.03
.81
=
1
27
.
The probability that a random nonsmoker has lung cancer is 1/27 (= .037) times the probability
that he/she does not have it.
Or equivalently,
The probability that a random nonsmoker does not have lung cancer is 27 times greater than the
probability that he/she does have it.

Odds Ratio: OR =
odds(D | E+)
odds(D | E)
=
3
1/27

or the cross product ratio

(.12) (.81)
(.04) (.03)
= 81 .
The odds of having lung cancer among smokers are 81 times greater than
the odds of having lung cancer among nonsmokers.

Relative Risk: RR =
P(D+ | E+)
P(D+ | E)
=
3/4
1/28


(.12) (.84)
(.16) (.03)
= 21 .
The probability of having lung cancer among smokers is 21 times greater than
the probability of having lung cancer among nonsmokers.

The findings that OR >>1 and RR >>1 suggest a strong association between lung
cancer and smoking. (But how do we formally show that this is significant? Later)

Now consider measures of association between lung cancer and caffeine consumption.

Lung Cancer

C
a
f
f
e
i
n
e

Exposed (E+) .06 .34 .40
Not Exposed (E) .09 .51 .60

.15 .85 1.00

P(D+| E+) =
P(D+ E+)
P(E+)
=
.06
.40
= .15 ; therefore, P(D | E+) =
.34
.40
= .85 .
A random caffeine consumer has a 15% probability of having lung cancer;
a random caffeine consumer has an 85% probability of not having lung cancer.

NOTE: P(D+| E+) = .15 = P(D+), so D+and E+are independent events!

Therefore, the odds of the disease, given exposure, =
P(D+ | E+)
P(D | E+)
=
.15
.85

or
.06
.34
= .176 .
The probability that a random caffeine consumer has lung cancer is .176 times the probability

P(D+| E) =
P(D+ E)
P(E)
=
.09
.60
= .15 ; therefore, P(D | E) =
.51
.60
= .85 .
A random caffeine non-consumer has a 15% probability of having lung cancer;
a random caffeine non-consumer has an 85% probability of not having lung cancer.

Therefore, the odds of the disease, given no exposure, =
P(D+ | E)
P(D | E)
=
.15
.85

or
.09
.51
= .176 .
The probability that a random caffeine non-consumer has lung cancer is .176 times the probability

Odds Ratio: OR =
odds(D | E+)
odds(D | E)
=
.176
.176


(.06) (.51)
(.34) (.09)
= 1 .
The odds of having lung cancer among caffeine consumers are equal to
the odds of having lung cancer among caffeine non-consumers.

Relative Risk: RR =
P(D+ | E+)
P(D+ | E)
=
.15
.15


(.06) (.60)
(.40) (.09)
= 1 .
The probability of having lung cancer among caffeine consumers is equal to
the probability of having lung cancer among caffeine non-consumers.

NOTE: The findings that OR =1 and RR =1 are to be expected, since D+and E+are independent!
Thus, no association exists between lung cancer and caffeine consumption. (In truth, there actually is a
spurious association, since many coffee drinkers also smoke, which commonly leads to lung cancer.
In this context, smoking is a variable that confounds the association between lung cancer and caffeine,
and should be adjusted for. For a well-known example of a study where this was not done carefully
enough, with substantial consequences, see MacMahon B., Yen S., Trichopoulos D., et. al., Coffee and
Cancer of the Pancreas, New England Journal of Medicine, March 12, 1981; 304: 630-33.)

Adjusting for Age (and other confounders)
Once again, consider the association between lung cancer and smoking in the earlier
example. A legitimate argument can be made that the reason for such a high
relative risk (RR =21) is that age is a confounder that was not adequately taken
into account in the study. That is, there is a naturally higher risk of many cancers as
age increases, regardless of smoking status, so How do you tease apart the effects
of age versus smoking, on the disease? The answer is to adjust, or standardize,
for age. First, recall that relative risk RR =
P(D+ | E+)
P(D+ | E)
by definition, i.e., we are
confining our attention only to individuals with disease (D+), and measuring the
effect of exposure (E+vs. E). Therefore, we can restrict our analysis to the two
cells in the first column of the previous 2 2 table. However, suppose now that the
probability estimates are stratified on age, as shown:

D+
Age n
i
+
= #(E+) x
i
+
= #(D+ E+) p
i
+
= P(D+ | E+) = x
i
+
/ n
i
+

E+
50-59 250 5 5/250 =.02
60-69 150 15 15/150 =.10
70-79 100 40 40/100 =.40
Total n
+
= 500 x
+
= 60 p
+
= 60/500 = .12 (as before)

Age n
i
= #(E) x
i
= #(D+ E) p
i
= P(D+ | E) = x
i
/ n
i

E
50-59 300 3 3/300 =.01
60-69 200 8 8/200 =.04
70-79 100 7 7/100 =.07
Total n
= 600 x
= 18 p
= 18/600 = .03 (as before)

For each age stratum (i =1, 2, 3),
n
i
+
=#individuals in the study who were exposed (E+), regardless of disease status
n
i
=#individuals in the study who were not exposed (E), regardless of disease status

x
i
+
=#of exposed individuals (E+), with disease (D+)
x
i
=#of unexposed individuals (E), with disease (D+)

Therefore,
p
i
+
= x
i
+
/ n
i
+
= proportion of exposed individuals (E+), with disease (D+)
p
i
= x
i
/ n
i
= proportion of unexposed individuals (E), with disease (D+)

Ismor Fischer, 5/29/2012 3.4-10

From this information, we can imagine a combined table of age strata for D+:

Age n
i
= n
i
+
+ n
i
p
i
+
p
i

E
50-59 550 .02 .01
60-69 350 .10 .04
70-79 200 .40 .07
Total n = 1100

Now, to estimate the age-adjusted numerator P(D+|E+) of RR, we calculate the
weighted average of the proportions p
i
+
, using their corresponding combined
sample sizes n
i
as the weights. That is,

(550)(.02) (350)(.10) (200)(.40) 126
( | )
550 350 200 1100
i i
i
n p
P D E
n
+
+ +
+ + = = =
+ +
0.1145

and similarly, the age-adjusted denominator P(D+|E) of RR is estimated by the
weighted average of the proportions p
i
, again using the same combined sample

sizes n
i
as the weights:

(550)(.01) (350)(.04) (200)(.07) 33.5
( | )
550 350 200 1100
i i
i
n p
P D E
n

+ +
+ = = =
+ +
0.0305

whereby we obtain
RR
adj
=
P(D+ | E+)
P(D+ | E)
=
126
33.5
= 3.76.

Note that in this example, there is a substantial difference between the adjusted and
unadjusted risks. The same ideas extend to the age-adjusted odds ratio OR
adj
.
3.5

1. Let events A = Live to age 60, B = Live to age 70, C = Live to age 80; note that event
C is a subset of B, and that B is a subset of A, i.e., they are nested: C B A . We are given
that P(A) = 0.90, P(B | A) = 0.80, and P(C | B) = 0.75. Therefore, by the general formula
( ) ( | ) ( ) P E F P E F P F = , we have

See Note
( ) ( ) ( | ) ( ) P B P B A P B A P A = = = (0.80)(0.90) = 0.72

( ) ( ) ( | ) ( ) P C P C B P C B P B = = = (0.75)(0.72) = 0.54

( ) ( )
0.54
( | )
( ) ( ) 0.90
P C A P C
P C A
P A P A
= = = = 0.60

2. A = Angel barks B = Brutus barks P(A) = 0.1, P(B) = 0.2, P(A | B) = 0.3 P(A B) = 0.06

(a) Because P(A) = 0.1 is not equal to P(A | B) = 0.3, the events A and B are not independent!
Or, equivalently, P(A B) = 0.06 is not equal to P(A) P(B) = (0.1)(0.2) = 0.02.

(b)
P(A B) = P(A) + P(B) P(A B) = 0.1 + 0.2 0.06 = 0.24

Via DeMorgans Law: P( A
c
B
c
) = 1 P(A B) = 1 0.24 = 0.76

P(A B
c
) = P(A) P(A B) = 0.1 0.06 = 0.04

P( A
c
B) = P(B) P(A B) = 0.2 0.06 = 0.14

P(A B
c
) + P( A
c
B) = 0.04 + 0.14 = 0.18,
or, P(A B) P(A B) = 0.24 0.06 = 0.18

P(B | A) =
( )
( )
P B A
P A
=
0.06
0.1
= 0.6

P( B
c
| A) =
( )
( )
P B A
P A
c
=
0.04
0.1
= 0.4, or more simply, 1 P(B | A) = 1 0.6 = 0.4

P(A | B
c
) =
( )
( )
P A B
P B
c
c
=
( )
1 ( )
P A B
P B
c
=
0.04
0.8
= 0.05

A
A
c

B 0.06 0.14 0.20 = P(B)
B
c

0.04 0.76
0.80 = P( B
c
)

0.10
= P(A)
0.90
= P( A
c
)
1.00
Note: If event C occurs,
then event B must have
occurred. If event B
occurs, then event A
must have occurred.
Thus, the event A in the
intersection of B and
A is redundant, etc.

A B

0.04 0.06 0.14
0.76

3. Urn Model: Events A = First ball is red and B = Second ball is red. In the sampling
without replacement case illustrated, it was calculated that, reduced to lowest terms, P(A) = 4/6
= 2/3, P(B) = 2/3, and P(A B) = 12/30 = 2/5. Since P(A B) = 2/5 4/9 = 2/3 2/3 =
P(A) P(B), it follows that the two events A and B are not statistically independent. This
should be intuitively consistent; as this population is small, the probability that event A occurs
nontrivially affects that of event B, if the unit is not replaced after the first draw. However, in the
sampling with replacement scenario, this is not the case. For, as illustrated below, P(A) = 4/6 =
2/3, P(B) = 24/36 = 2/3, and P(A B) = 16/36 = 4/9. Hence, P(A B) = 4/9 = 2/3 2/3 =
P(A) P(B), and so it follows that events A and B are indeed statistically independent.

4. First note that, in this case, A B (event A is a subset of event B), that is, if A occurs, then B
occurs! (See Venn diagram.) In addition, the given information provides us with the following
conditional probabilities: ( | ) P A B = 0.75,
c c
( | ) P B A = 0.80. Expanding these out via the usual
formulas, we obtain, respectively,

0.75 = P(A | B) =
( )
( )
P A B
P B
=
( )
( )
P A
P B
,

i.e., ( ) 0.75 ( ) P A P B =
and
0.80 = ( | ) P B A
c c
=
( )
( )
P B A
P A
c c
c
=
( )
( )
P B
P A
c
c
=

1 ( )
1 ( )
P B
P A
, i.e., ( ) 1.25 ( ) 0.25 P A P B =

upon simplification. Since the left-hand sides of these two equations are identical, it follows that
the right-hand sides are equal, i.e., 1.25 P(B) 0.25 = 0.75 P(B), and solving yields P(B) = 0.5.
Hence, there is a 50% probability that any students come to the office hour.

Plugging this value back into either one of these equations yields P(A) = 0.375. Hence, there is a
37.5% probability that any students arrive within the first fifteen minutes of the office hour.
P(A B) = 16/36
P(A B
c
) = 8/36
P(A
c
B) = 8/36
P(A
c
B
c
) = 4/36
P(B
c
| A) = 2/6

P(B
c
| A
c
) = 2/6

P(A
c
) = 2/6
P(A) = 4/6
P(B | A) = 4/6
P(B | A
c
) = 4/6
P(B) = 24/36
Student
Population

B

0.5

A

0.375
0.125
Lung Cancer
Yes No
C
o
f
f
e
e

D
r
i
n
k
e
r

Yes 0.06 0.34 0.40
No 0.09 0.51 0.60
0.15 0.85 1.00

5.

Cancer stage (A)

1 2 3 4
I
n
c
o
m
e

L
e
v
e
l

(
B
)

Low (1) 0.05 0.10 0.15 0.20 0.5
Middle (2) 0.03 0.06 0.09 0.12 0.3
High (3) 0.02 0.04 0.06 0.08 0.2

0.1 0.2 0.3 0.4 1.0

(a) Recall that one definition of statistical independence of A and B is P(A B) = P(A) P(B).
In particular then, the first cell entry P(A = 1 B = 1) = P(A = 1) P(B = 1) =
(0.1)(0.5) = 0.05, i.e., the product of the 1
st
column marginal times 1
st
first row marginal.
In a similar fashion, the cell value in the intersection of the i
th
row (i = 1, 2, 3) and j
th

column (j = 1, 2, 3, 4) is equal to the product of the i
th
row marginal probability, times the
j
th
column marginal probability, which allows us to complete the entire table easily, as
shown. By definition, this property is only true for independent events (!!!), and is
fundamental to the derivation of the expected value formulas used in the Chi-squared
Test (sections 6.2.3 and 6.3.1).

(b) By construction, we have

1| 1
= 0.05 / 0.1 = 0.5,
1| 2
= 0.10 / 0.2 = 0.5,
1| 3
= 0.15 / 0.3 = 0.5,
1| 4
= 0.20 / 0.4 = 0.5 and P(Low) = 0.5
2 | 1
= 0.03 / 0.1 = 0.3,
2 | 2
= 0.06 / 0.2 = 0.3,
2 | 3
= 0.09 / 0.3 = 0.3,
2 | 4
= 0.12 / 0.4 = 0.3 and P(Mid) = 0.3
3 | 1
= 0.02 / 0.1 = 0.2,
3 | 2
= 0.04 / 0.2 = 0.2,
3 | 3
= 0.06 / 0.3 = 0.2,
3 | 4
= 0.08 / 0.4 = 0.2 and P(High) = 0.2

(c) Also,

1| 1
= 0.05 / 0.5 = 0.1
2 | 1
= 0.10 / 0.5 = 0.2
3 | 1
= 0.15 / 0.5 = 0.3
4 | 1
= 0.20 / 0.5 = 0.4
1| 2
= 0.03 / 0.3 = 0.1
2 | 2
= 0.06 / 0.3 = 0.2
3 | 2
= 0.09 / 0.3 = 0.3
4 | 2
= 0.12 / 0.3 = 0.4
1| 3
= 0.02 / 0.2 = 0.1
2 | 3
= 0.04 / 0.2 = 0.2
3 | 3
= 0.06 / 0.2 = 0.3
4 | 3
= 0.08 / 0.2 = 0.4
and
P(Stage 1) = 0.1
and
P(Stage 2) = 0.2
and
P(Stage 3) = 0.3
and
P(Stage 4) = 0.4

(d) It was shown in the Lung cancer versus Coffee drinker example that these two
events are independent in the study population; the 2 2 table is reproduced below.

The probability in the first cell (Yes for
both events), 0.06, is indeed equal to
(0.40)(0.15), the product of its row and
column marginal sums (i.e., Yes for one
event, times Yes for the other event), and
likewise for the probabilities in all the other
cells.

Note that this is not true of the 2 2
Lung Cancer versus Smoking table.
c a c b c
1 a b + c
A B
0.36 0.04 0.09
0.51
A B

6. The given information can be written as conditional probabilities:
( | ) 0.8 P A B = , ( | ) 0.9 P B A = , ( | ) 0.85 P B A =
c c

We are asked to find the value of ( | ) P A B
c c
. First, let ( ) P A a = ,
( ) P B b = , and ( ) P A B c = . Then all of the events in the Venn
diagram can be labeled as shown. Using the definition of
conditional probability ( | ) ( ) ( ) P E F P E F P F = , we have
0.8
c
b
= , 0.9
c
a
= ,
1
0.85
1
a b c
a
+
=
.

Algebraically solving these three equations with three unknowns
yields a = 0.40, b = 0.45, c = 0.36, as shown.
Therefore,
( )
0.51
( | )
0.55
( )
P A B
P A B
P B
= =
c c
c c
c
= 0.927.

7. Let events , , A B and C represent the occurrence of each symptom,
respectively. The given information can be written as:

( ) ( ) ( ) 0.6 P A P B P C = = =
( | ) 0.45 P A B C = , and similarly, ( | ) 0.45 P A C B = , ( | ) 0.45 P B C A = as well.
( | ) 0.75 P A B C = , and similarly, ( | ) 0.75 P B A C = , ( | ) 0.75 P C A B = as well.

(a) We are asked to find ( ) P A B C . It follows from
the definition of conditional probability that
( ) ( | ) ( ) P A B C P A B C P C = which, via the
first two statements above, (0.45)(0.6) = = 0.27 .
(The two other equations yield the same value.)

(b) Again, via conditional probability, we have
( ) ( | ) ( ) P A B C P A B C P B C = which, via
the third statement above and part (a), can be written
as 0.27 0.75 ( ) P B C = , so that ( ) P B C = 0.36 .
So ( ) 0.36 0.27 P A B C = =
c
0.09, and likewise
for the others, ( ) P A B C
c
and ( ) P A B C
c
.
(See Venn diagram.) Hence, P(Two or three) = (3 0.09) + 0.27 = 0.54.

(c) From (b), P(Exactly two) = (3 0.09) = 0.27.

(d) From (a) and (c), it follows that ( ) 0.6 (0.27 0.09 0.09) P A B C = + + =
c c
0.15 , and
likewise for the others, ( ) P A B C
c c
and ( ) P A B C
c c
. Hence 3 0.15 = 0.45.

(e) From (b), (c), and (d), we see that ( ) 0.27 3 (0.9) 3 (0.15) 0.99, P A B C = + + = so that
( ) 1 0.99 P A B C = =
c c c
0.01. (See Venn diagram.)

(f) Working with A and B for example, we have ( ) ( ) 0.6 P A P B = = from the given, and
( ) 0.36 P A B = from part (b). Since it is true that 0.36 = 0.6 0.6, it does indeed follow
that ( ) ( ) ( ) P A B P A P B = , i.e., events A and B are statistically independent.

A B

C
0.27
0.09
0.09 0.09
0.15
0.15 0.15
0.01

C B
.54
.07
A
.03
.001
.13
.06
.12
.049

8. With events A = Accident, B = Berkeley visited, and C = Chelsea visited, the given statements can
be translated into mathematical notation as follows:

i. P(B C) = P(B) P(C)
ii. P(B) = .80
iii. P(C) = .75
Therefore, substituting ii and iii into i yields P(B C) = (.8)(.75), i.e., P(B C) = .60. (purple + gray)
Furthermore, it also follows from statistical independence that
P(B only) = P(B C

c
) = (.8)(1 .75), i.e., P(B C
c
) = .20 (blue + green)
P(C only) = P(B
c
C) = (1 .8)(.75), i.e., P(B
c
C) = .15 (red + orange)
P(Neither B nor C) = P(B
c
C

c
) = (1 .8)(1 .75), i.e., P(B
c
C
c
) = .05 (yellow + white)
iv. P(A | B C) = .90, which implies P(A B C) = P(A | B C) P(B C) = (.9)(.6),
i.e., P(A B C) = .54, hence P(A
c
B C) = .06.
v. P(A | B C

c
) = .35, which implies P(A B C

c
) = P(A | B C

c
) P(B C

c
) = (.35)(.2),
i.e., P(A B C

c
) = .07, hence P(A
c
B C

c
) = .13.
vi. P(A | B
c
C) = .20, which implies P(A B
c
C) = P(A | B
c
C) P(B
c
C) = (.2)(.15),
i.e., P(A B
c
C) = .03, hence P(A
c
B
c
C) = .12.
vii. P(A | B
c
C

c
) = .02, which implies P(A B
c
C

c
) = P(A | B
c
C

c
) P(B
c
C

c
) = (.02)(.05),
i.e., P(A B
c
C

c
) = .001, hence P(A
c
B
c
C

c
) = .049.

.495

.33 .165
.01
A B

9. The given information tells us the following.
(i) P(A B) = .99
(ii) P(B | A) = .60, which implies that P(B A) = .6 P(A)
(iii) P(A | B) = .75, which implies that P(A B) = .75 P(B)

Because the left-hand sides of (ii) and (iii) are the same, it follows that .6 P(A) = .75 P(B), or

(iv) P(B) = .8 P(A).

Now, substituting (ii) and (iv) into the general relation P(A B) = P(A) + P(B) P(A B) gives

.99 = P(A) + .8 P(A) .6 P(A),

or .99 = 1.2 P(A), i.e., P(A) = .825. Thus, P(B) = .66 via (iv), and P(B A) = .495 via (ii). The
two events A and B are certainly not independent, which can be seen any one of three ways:
P(A | B) = .75 from (iii), is not equal to P(A) = .825 just found;
P(B | A) = .60 from (ii), is not equal to P(B) = .66 just found;
P(A B) = .495 is not equal to P(A) P(B) = .825 .66 = .5445.


10. Switch! It is tempting to believe that it makes no difference, since once a zonk door has been
opened and supposedly ruled out, the probability of winning the car should then be equally
likely (i.e., 1/2) between each of the two doors remaining. However, it is important to
remember that the host does not eliminate one of the original three doors at random, but always
i.e., with probability 1 a door other than the one chosen, and known (to him) to contain a
zonk. Rather than discarding it, this nonrandom choice conveys useful information, namely, if
indeed that had been the door originally chosen, then not switching would certainly have
resulted in losing. As exactly one of the other doors also contains a zonk, the same argument
can be applied to that door as well, whichever it is. Thus, as it would only succeed if the
winning door was chosen, the strategy of not switching would result in losing two out of
three times, on average.

This very surprising and counterintuitive result can be represented via the following table.
Suppose that, for the sake of argument, Door 1 contains the car, and Doors 2 and 3 contain
goats, as shown.

If contestant chooses: Door 1 Door 2 Door 3
then host reveals:
Door 2 or Door 3
(at random)
Door 3
(not at random)
Door 2
(not at random)

Switch?
Yes LOSE WIN WIN
P(Win | Switch) = 2/3
P(Lose | Switch) = 1/3
No WIN LOSE LOSE
P(Win | Stay) = 1/3
P(Lose | Stay) = 2/3

Much mathematical literature has been devoted to the Monty Hall Problem which has a
colorful history and its numerous variations. In addition, many computer programs exist on
the Internet (e.g., using Java applets), that numerically simulate the Monty Hall Problem, and in
so doing, verify that the values above are indeed correct. Despite this however, many people
(including more than a few professional mathematicians and statisticians) heatedly debate the
solution in favor of the powerfully intuitive, but incorrect, switching doesnt matter answer.
Strange but true...

C B
0.09
0.10
A
0.11
0.13
0.14
0.12
0.15
0.16

11. (a) We know that for any two events E and F, ( ) ( ) ( ) ( ) . P E F P E P F P E F = + Hence,
( ) ( ) ( ) ( ), P A B P A P B P A B = + i.e., 0.69 ( ) ( ) 0.19, P A P B = + or 0.88 ( ) ( ) P A P B = + .
Likewise,
( ) ( ) ( ) ( ), P A C P A P C P A C = + i.e., 0.70 ( ) ( ) 0.20, P A P C = + or 0.90 ( ) ( ) P A P C = +
and
( ) ( ) ( ) ( ), P B C P B P C P B C = + i.e., 0.71 ( ) ( ) 0.21, P B P C = + or 0.92 ( ) ( ) P B P C = + .
Solving these three simultaneous equations yields ( ) 0.43, ( ) 0.45, ( ) 0.47 P A P B P C = = = .
(b) Events E and F are statistically independent if ( ) ( ) ( ) . P E F P E P F = Hence,
( ) ( ) ( ) (0.20)(0.45), P A C B P A C P B = = i.e., ( ) P A B C = 0.09 , from which the entire
Venn diagram can be reconstructed from the triple intersection out, using the information above.


(a) Sensitivity P(T+ | D+) = P(T+ D+) / P(D+) = 302/481 = 0.628

Specificity P(T | D) = P(T D) / P(D) = 372/452 = 0.823

(b) If the prior probability is P(D+) = 0.010, then via Bayes Law, the posterior probability is
P(D+ | T+) =
P(T+ | D+) P(D+)
P(T+ | D+) P(D+) + P(T+ | D) P(D)
=
(0.628)(0.10)
(0.628)(0.10) + (0.177)(0.90)
= 0.283

(c) P(D | T) =
P(T | D) P(D)
P(T | D) P(D) + P(T | D+) P(D+)
=
(0.823)(0.90)
(0.823)(0.90) + (0.372)(0.10)
= 0.952

Comment: There are many potential reasons for low predictive value of a positive test,
despite high sensitivity (i.e., true positive rate). One possibility is very low prevalence of
the disease in the population (i.e., if P(D+) 0 in the numerator, then P(D+ | T+) will
consequently be small, in general), as in the previous two problems. Other possibilities
include health conditions other than the intended one that might also result in a positive
test, or that the test might be inaccurate in a large subgroup of the population for some
reason. Often, two or more different tests are combined (such as biopsy) in order to
obtain a more accurate diagnosis.

12. Odds Ratio and Relative Risk

Disease Status

Risk
Factor
Exposed (E+) p
11
p
12
p
11
+ p
12

Unexposed (E) p
21
p
22
p
21
+ p
22

p
11
+ p
21
p
12
+ p
22
1

In a cohort study design
OR =
odds of disease, given exposure
odds of disease, given no exposure
=
P(D+ | E+) P(D | E+)
P(D+ | E) P(D | E)
=
p
11
/ p
12
p
21
/ p
22
=
p
11
p
22
p
12
p
21
.

In a case-control study design
OR =
odds of exposure, given disease
odds of exposure, given no disease
=
P(E+ | D+) P(E | D+)
P(E+ | D) P(E | D)
=
p
11
/ p
21
p
12
/ p
22
=
p
11
p
22
p
21
p
12
.

Both of these quantities agree, so the odds ratio can be used in either type of longitudinal
study, although the interpretation must be adjusted accordingly. This is not true of the
relative risk, which is only defined for cohort studies. (However, it is possible to estimate it
using Bayes Law, provided one has an accurate estimate of the disease prevalence.)


13. OR =
(273)(7260)
(2641)(716)
= 1.048 The odds of previous use of oral contraceptives given breast
cancer, are 1.048 times the odds of previous use of oral contraceptives given no breast
cancer. That is, the odds of previous use of oral contraceptives are approximately 5% greater
among breast cancer cases than cancer-free controls. (Note: Whether or not this odds ratio of
1.048 is significantly different from 1 is the subject of statistical inference and hypothesis
testing)

14. OR =
(31)(4475)
(1594)(65)
= 1.339 The odds of breast cancer given age at first birth 25 years
old, are 1.339 times the odds of breast cancer given age at first birth < 25 years old. That is,
the odds of breast cancer among women who first gave birth when they were 25 or older, are
approximately 1/3 greater than those who first gave birth when they were under 25. (Again,
whether or not this odds ratio of 1.339 is significantly different from 1 is to be tested.)

RR =
(31)(4540)
(1625)(65)
= 1.332 The probability of breast cancer given age at first birth 25
years old, is 1.332 times the probability of breast cancer given age at first birth < 25 years
old. That is, the probability of breast cancer among women who first gave birth when they
were 25 or older, is approximately 1/3 greater than those who first gave birth when they were
under 25.

15. Events: A = Aspirin use, B
1
= GI bleeding, B
2
= Primary stroke, B
3
= CVD

Prior probabilities: P(B
1
) = 0.2, P(B
2
) = 0.3 P(B
3
) = 0.5

Conditional probabilities: P(A | B
1
) = 0.09, P(A | B
2
) = 0.04 P(A | B
3
) = 0.02

(a) Therefore, the posterior probabilities are, respectively,

P(B
1
| A) =
(0.09)(0.2)
(0.09)(0.2) + (0.04)(0.3) + (0.02)(0.5)
=
0.018
0.040
= 0.45

P(B
2
| A) =
(0.04)(0.3)
(0.09)(0.2) + (0.04)(0.3) + (0.02)(0.5)
=
0.012
0.040
= 0.30

P(B
3
| A) =
(0.02)(0.5)
(0.09)(0.2) + (0.04)(0.3) + (0.02)(0.5)
=
0.010
0.040
= 0.25

(b) The probability of gastrointestinal bleeding (B
1
) increases from 20% to 45%, in the
event of aspirin use (A); similarly, the probability of primary stroke (B
2
) remains
constant at 30%, and the probability of cardiovascular disease (B
3
) decreases from 50%
to 25%, in the event of aspirin use. Therefore, although it occurs the least often among
the three given vascular conditions, gastrointestinal bleeding occurs in the highest
overall proportion among the patients who used aspirin in this study. Furthermore,
although it occurs the most often among the three conditions, cardiovascular disease
occurs in the lowest overall proportion among the patients who used aspirin in this
study, suggesting a protective effect. Lastly, as the prior probability P(B
2
) and
posterior probability P(B
2
| A) are equal (0.30), the two corresponding events Aspirin
use and Primary stroke are statistically independent. Hence, the event that a
patient has a primary stroke conveys no information about aspirin use, and vice versa
(although aspirin does have a protective effect against secondary stroke). The
following Venn diagram shows the relations among these events, drawn approximately
to scale.

B
1
B
2
B
3

A
0.182 0.288 0.490
0.010
0.018
0.012
S = Survive S
c
= Not Survive

0.08 0.42
T = Treatment

0.32 0.18

16. Events: S = Five year survival S
c
= Death within five years
T = Treatment T
c
= No Treatment

Prior probability P(S) = 0.4 P(S
c
) = 1 P(S) = 0.6

Given Conditional probability P(T | S) = 0.8 P(T
c
| S) = 1 P(T | S) = 0.2

Conditional probability P(T | S
c
) = 0.3 P(T
c
| S
c
) = 1 P(T | S
c
) = 0.7

Posterior probabilities (via Bayes Formula):

(a) P(S | T) =
P(T | S) P(S)
P(T | S) P(S) + P(T | S
c
) P(S
c
)

=
(0.8)(0.4)
(0.8)(0.4) + (0.3)(0.6)
=
0.32
0.50
= 0.64

P(S | T
c
) =
P(T
c
| S) P(S)
P(T
c
| S) P(S) + P(T
c
| S
c
) P(S
c
)

=
(0.2)(0.4)
(0.2)(0.4) + (0.7)(0.6)
=
0.08
0.50
= 0.16

Given treatment (T), the probability of five-year survival (S) increases from a prior of 0.40
to a posterior of 0.64. Moreover, given no treatment (T
c
), the probability of five-year
survival (S) decreases from a prior of 0.40 to a posterior of 0.16. Hence, in this
population, treatment is associated with a four-fold increase in the probability of five-year
survival. (This is the relative risk.) Note, however, that this alone may not be enough to
recommend treatment. Other factors, such as adverse side effects and quality of life issues,
are legitimate patient concerns to be decided individually.

(b) Odds of survival, given treatment =
P(S | T)
P(S
c
| T)
=
0.64
1 0.64
= 1.778

Odds of survival, given no treatment =
P(S | T
c
)
P(S
c
| T
c
)
=
0.16
1 0.16
= 0.190

Odds Ratio =
1.778
0.190
= 9.33


17. Let P(A) = a, P(B) = b, P(A B) = c, as shown. Then it follows that

(1) 0 c a 1 and 0 c b 1
as well as
(2) 0 a + b c 1.

Therefore, = | c ab | in this notation. It thus suffices to show that
1 1
4 4
ab c + . From (2), we see that
ab c a (1 a + c) c = a a
2
(1 a) c
a a
2

1
4
.
Clearly, this inequality is sharp when c = 0 and a = 1/2,
i.e., when P(A B) = 0 (e.g., A and B are disjoint) and
P(A) = 1/2. Moreover, because the definition of is
symmetric in A and B, it must also follow that P(B) = 1/2.
(See first figure below.) Furthermore, from (1),
ab c (c)(c) c = c
2
c
1
4
.
This inequality is sharp when a = b = c = 1/2, i.e., when P(A) = P(B) = P(A B) = 1/2,
which implies that A = B, both having probability 1/2. (See second figure.)

A B

a c c b c
A B

1/2 1/2

A = B

1/2

1/2
Treatment A
Yes No
T
r
e
a
t
m
e
n
t

B

Yes 0.14 0.26 0.40
No 0.21 0.39 0.60
0.35 0.65 1.00

Treatment A
Yes No
T
r
e
a
t
m
e
n
t

B

Yes 0.14 0.40 0.54
No 0.35 0.11 0.46
0.49 0.51 1.00

18.

(a) Given: P(A) = .35, P(B) = .40, P(A B) = .14
Then P(A only) = P(A B
c
) = .35 .14 = .21, P(B only) = P(A
c
B) = .40 .14 = .26,
and P(Neither) = P(A
c
B
c
) = 1 (.21 + .14 + .26) = 1 .61 = .39, as shown in the first
Venn diagram above. Since P(A B) = .14 and P(A) P(B) = (.35)(.40) = .14 as well, it
follows that the two treatments are indeed statistically independent in this population.

P(A or B) = .61 (calculated above) P(A xor B) = .21 + .26, or .61 14, = .47

(b) Given: P(A only) = .35, P(B only) = .40, P(A B) = .14
Then P(Neither) = P(A
c
B
c
) = 1 (.35 + .14 + .40) = 1 .89 = .11, as shown in the
second Venn diagram above. Since P(A B) = .14 and P(A) P(B) = (.49)(.54) = .2646, it
follows that the two treatments are not statistically independent in this population.

P(A or B) = .89 (calculated above) P(A xor B) = .35 + .40, or .89 14, = .75

A B

.21 .14 .26

.39

A B

.35 .14 .40

.11

19. Let events A = Adult, B = Male, C = White. We are told that

(1) P(A B | C) = 0.3, i.e.,
( )
( )
P A B C
P C

= 0.3, so that ( ) 0.3 ( ) P A B C P C = ,
(2) P(A C | B) = 0.4, i.e.,
( )
( )
P A C B
P B

= 0.4, so that ( ) 0.4 ( ) P A B C P B = ,
and finally,
(3) P(A | B C) = 0.5, i.e.,
( )
( )
P A B C
P B C

= 0.5, so that ( ) 0.5 ( ) P A B C P B C = .

Since the left-hand sides of all three equations are the same, it follows that all the right-hand
sides are equal as well.
(a) Therefore, equating (1) and (3) yields
0.5 P(B C) = 0.3 P(C), i.e.,
( )
( )
P B C
P C
=
0.3
0.5
, or by definition, P(B | C) = 0.6, i.e., 60%
and
(b) equating (2) and (3) yields
0.5 P(B C) = 0.4 P(B), i.e.,
( )
( )
P B C
P B
=
0.4
0.5
, or by definition, P(C | B) = 0.8, i.e., 80%.

20. Again, let events A = Adult, B = Male, C = White. We are here told that
P(B | A) = .1, P(C | B) = .2, P(A | C) = .3

P(A | B) = .4, P(B | C) = .5, P(C | A) = ?
However, it is true that
P(A | B) P(B | C) P(C | A) = P(B | A) P(C | B) P(A | C)
because
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
P A B P B C P C A P B A P C B P A C
P B P C P A P A P B P C

= , since
the numerators of each side are simply rearrangements of one another, as likewise are the
denominators. Therefore,
.4 .5 P(C | A) = .1 .2 .3,
i.e., P(C | A) = .03, or 3%.


21. The Shell Game

(a) With 20 shells, the probability of winning exactly one game is 1/20, or .05; therefore, the
probability of losing exactly one game is .95. Thus (reasonably assuming independence
between game outcomes), the probability of losing all n games is equal to (.95)
n
, from
which it follows that the probability of not losing all n games i.e., P(winning at least
one game) is equal to 1 (.95)
n
.

In order for this probability to be greater than .5 i.e., 1 (.95) .5
n
> it must be true
that (.95)
n
< .5, or n >
log(.5)
log(.95)
= 13.51, so n 14 games. As n , it follows that
(.95)
n
0, so that P(win at least one game) = 1 (.95)
n
1 (certainty).

(b) Using the same logic as above with n shells, the probability of winning exactly one game
is
1
n
; therefore, the probability of losing exactly one game is
1
1
n
. Thus (again, tacitly
assuming independence between game outcomes), the probability of losing all n games is
equal to
1
1
n
n
| |
|
\ .
, from which it follows that the probability of not losing all n games
i.e., P(win at least one game) is equal to
1
1 1
n
n
| |

|
\ .
, which approaches 1 e
1
= .632
as n .

22. In progress

23. Recall that
p
RR
q
= and
/ (1 ) 1
/ (1 ) 1
p p p q
OR
q q q p
| |
= =
|

\ .
, with ( | ) p P D E = + + and ( | ) q P D E = + .
The case RR = 1 is trivial, for then p = q, hence OR = 1 as well; this corresponds to the case
of no association.
Suppose RR > 1. Then p > q, which implies
1
1
1
q
p
<
, or
1
1
p p q
q q p
| |
<
|
\ .
, i.e., RR < OR.
Thus we have 1 < RR < OR. For the case RR < 1, simply reverse all the inequalities.

24. Let event A = perfect square = {1
2
, 2
2
, 3
2
,, (10
3
)
2
}; then P(A) = 10
3
/10
6
= 0.001.
Likewise, let B = perfect cube = {1
3
, 2
3
, 3
3
,, (10
2
)
3
}; then P(B) = 10
2
/10
6
= 0.0001.
Thus, A B = perfect sixth = {1
6
, 2
6
, 3
6
,, 10
6
}; hence P(A B) = 10/10
6
= 0.00001.
Therefore, P(A B) = P(A) + P(B) P(A B) = 0.001 + 0.0001 0.00001 = 0.00109.


25. We construct a counterexample as follows. Let 0 < < 1 be a fixed but arbitrary number.
Also, with no loss of generality, assume the first toss results in 1 (i.e., Heads), as shown.
Now suppose that the next
0
1 n tosses all result in 0 (i.e., Tails), where
0
1
n
> . It then
follows that the proportion of Heads in the first
0
n tosses is
0
1
n
< , i.e., arbitrarily close to
0. Now suppose that the next
1 0
n n tosses all result in 1 (i.e., Heads), where
0
1
1 n
n
+
> .
It then follows that the proportion of Heads in the first
1
n tosses is
0 1
1
1
1
n n
n

+
> , i.e.,
arbitrarily close to 1. By continuing to attach sufficiently large blocks of zeros and ones in
this manner i.e.,
0 1
2
1 n n
n
+
> ,
0 1 2
3
1 n n n
n
+ +
> , an infinite sequence is
generated that does not converge, but forever oscillates between values which come
arbitrarily close to 0 and 1, respectively.

1 toss n
0
tosses n
1
tosses n
2
tosses n
3
tosses

1
0
1 n
1 0
n n
2 1
n n
3 2
n n
1 0000 1111 0000000 0 111111111 etc.

# Heads:
0
1 X =
1 0 1
1 X n n = +
2 0 1
1 X n n = +
3 0 1 2 3
1 X n n n n = + +
Proportion
of Heads
X / n:
0
1
n
<
0 1
1
1
1
n n
n

+
>
0 1
2
1 n n
n

+
<
0 1 2 3
3
1
1
n n n n
n

+ +
>

Exercise: Prove that
1 1
(1 )
max ,
k
k k k
n n

+

>
`
)
for k = 0, 1, 2,
Hint: By construction,
1
1 2 3
... ( 1)
k
k k k
k
n n n
n

+ +
> . From this, show that
1
1
k k
n n
+
| |
>
|
\ .
.
A
C B
(1 )(1 ) a b c
(1 ) (1 ) a b c (1 )(1 ) a b c
(1 ) ab c (1 ) a b c
(1 ) a bc
abc
(1 a)(1 b)(1 c)

26. In progress
27. Label the empty cells as shown.
.01 x .02
y ? z .50
.03 w .04
.60 1
It then follows that:
(1) .01 .02 ? .03 .04 1 x y z w + + + + + + + + = , i.e., ? .90 x y z w + + + + =
(2) ? .60 x w + + =
(3) ? .50 y z + + =
Adding equations (2) and (3) together yields 2? 1.10 x y z w + + + + = . Subtracting
equation (1) from this yields ? = .20 .
28. In progress
29. Careful calculation shows that P(A) = a, P(B) = b, P(C) = c, and P(AB) = ab, P(AC) = ac,
P(BC) = bc, so that the events are indeed pairwise independent. However, the triple intersection
P(ABC) = d, an arbitrary value. Thus ( ) ( ) ( ) ( ) P A B C P A P B P C , unless d =abc. In that
case, the Venn diagram simplifies to the following unsurprising form.

30. Bar Bet
(a) Absolutely not! To see why, let us start with the simpler scenario of drawing four cards
from a fair deck, with replacement. In this case, all cards have an equal likelihood of
being selected (namely, 1/52). This being the case, and the fact that there are 12 face
cards in a standard deck, it follows that the probability of selecting a face card is 12/52,
and the outcome of any selection is statistically independent of any other selection. To
calculate the probability of at least one face card, we can subtract the probability of the
complement no face cards from 1. That is, 1 the probability of picking 4 non-face
cards: 1 (40/52)
4
= 0.65.
Now suppose we modify the scenario to selecting n = 4 cards, without replacement.
Unlike the above, the probability of selecting a face card now changes with every draw,
making the outcomes statistically dependent. Since the number of cards decreases by one
with each draw, the probability of picking all 4 non-face cards is no longer simply
(40/52)
4
= (40/52)(40/52)(40/52)(40/52), but (40/52)(39/51)(38/50)(37/49).
Therefore, the probability of picking 1 face card = 1 (40/52)(39/51)(38/50)(37/49) =

0.6624. This means that I will win the bet approximately 2 out of 3 times!
Counterintuitive perhaps, but true nonetheless.

(b) No, you should still not take the bet. Using the same logic with n = 3 draws, the
probability of picking at least one face card = 1 (40/52)(39/51)(38/50) = 0.5529. Thus,
I still enjoy about a 5+% advantage over even money (i.e., 50%). On average, I will
win 11 out of every 20 games played, and make one dollar.
(c) The R simulation should be consistent with the result found in part (a), namely, that the
proportion of wins 0.6624, and therefore the proportion of losses 0.3376.

Note: For those who remember combinatorics, another way to arrive at this value is the following: There are
52
4
| |
|
\ .

ways of randomly selecting 4 cards from the deck of 52. Of this number, there are
40
4
| |
|
\ .
ways of randomly selecting 4
non-face cards. The ratio of the two,
40 52
4 4
| | | |
| |
\ . \ .
, yields the same value as the underlined four-factor product above.
3.5 Problems

1. In a certain population of males, the following longevity probabilities are determined.

P(Live to age 60) = 0.90

P(Live to age 70, given live to age 60) = 0.80

P(Live to age 80, given live to age 70) = 0.75

From this information, calculate the following probabilities.

P(Live to age 70)

P(Live to age 80)

P(Live to age 80, given live to age 60)

2. Refer to the barking dogs problem in section 3.2.

(a) Are the events Angel barks and Brutus barks statistically independent?
(b) Calculate each of the following probabilities.

P(Angel barks OR Brutus barks)

P(NEITHER Angel barks NOR Brutus barks), i.e.,
P(Angel does not bark AND Brutus does not bark)

P(Only Angel barks) i.e.,
P(Angel barks AND Brutus does not bark)

P(Only Brutus barks) i.e.,
P(Angel does not bark AND Brutus barks)

P(Exactly one dog barks)

P(Brutus barks | Angel barks)

P(Brutus does not bark | Angel barks)

P(Angel barks | Brutus does not bark)

Also construct a Venn diagram, and a 2 2 probability table, including marginal sums.

3. Referring to the urn model in section 3.2, are the events A = First ball is red and
B = Second ball is red independent in this sampling without replacement scenario? Does this
agree with your intuition? Rework this problem in the sampling with replacement scenario.

4. After much teaching experience, Professor F has come up with a conjecture about office hours:
There is a 75% probability that a random student arrives to a scheduled office hour within the
first 15 minutes (event A), from among those students who come at all (event B). Furthermore,
there is an 80% probability that no students will come to the office hour, given that no students
arrive within the first 15 mins. Answer the following. (Note: Some algebra may be involved.)

(a) Calculate P(B), the probability that any students come to the office hour.

(b) Calculate P(A), the probability that any students arrive in the first 15 mins of the office hour.

(c) Sketch a Venn diagram, and label all probabilities in it.

5. Suppose that, in a certain population of cancer patients having similar ages, lifestyles, etc., two
categorical variables I = Income (Low, Middle, High) and J = Disease stage (1, 2, 3, 4) have
probabilities corresponding to the column and row marginal sums in the 3 4 table shown.

Cancer stage

1 2 3 4
I
n
c
o
m
e

L
e
v
e
l
Low 0.5
Middle 0.3
High 0.2

0.1 0.2 0.3 0.4 1.0

(a) Suppose I and J are statistically independent.

Complete all entries in the table.
(b) For each row i = 1, 2, 3, calculate the following conditional probabilities, across the columns
j = 1, 2, 3, 4:

P(Low Inc | Stage 1), P(Low Inc | Stage 2), P(Low Inc | Stage 3), P(Low Inc | Stage 4)
P(Mid Inc | Stage 1), P(Mid Inc | Stage 2), P(Mid Inc | Stage 3), P(Mid Inc | Stage 4)
P(High Inc | Stage 1), P(High Inc | Stage 2), P(High Inc | Stage 3), P(High Inc | Stage 4)

Confirm that, for j = 1, 2, 3, 4:

That is, P(Income i | Stage j) = P(Income i). Is this consistent with the information in (a)? Why?

(c) Now for each column j = 1, 2, 3, 4, compute the following conditional probabilities, down the
rows i = 1, 2, 3:

P(Stage 1 | Low Inc), P(Stage 2 | Low Inc), P(Stage 3 | Low Inc), P(Stage 4 | Low Inc),
P(Stage 1 | Mid Inc), P(Stage 2 | Mid Inc), P(Stage 3 | Mid Inc), P(Stage 4 | Mid Inc),
P(Stage 1 | High Inc). P(Stage 2 | High Inc). P(Stage 3 | High Inc). P(Stage 4 | High Inc).

Likewise confirm that, for i = 1, 2, 3:

P(Stage 1 | Income i)
are all equal to the
unconditional column
probability P(Stage 1).

That is, P(Stage j | Income i) = P(Stage j). Is this consistent with the information in (a)? Why?

Technically, we have only defined statistical independence for events, but it can be formally extended to general random
variables in a natural way. For categorical variables such as these, every category (viewed as an event) in I, is statistically
independent of every category (viewed as an event) in J, and vice versa.
P(Low Income | Stage j) are all equal to the unconditional row probability P(Low Income).
P(Mid Income | Stage j) are all equal to the unconditional row probability P(Mid Income).
P(High Income | Stage j) are all equal to the unconditional row probability P(High Income).

6. A certain medical syndrome is usually associated with two overlapping sets of symptoms, A and B.
Suppose it is known that:

If B occurs, then A occurs with probability 0.80 .
If A occurs, then B occurs with probability 0.90 .
If A does not occur, then B does not occur with probability 0.85 .

Find the probability that A does not occur if B does not occur. (Hint: Use a Venn diagram; some
algebra may also be involved.)

7. The progression of a certain disease is typically characterized by the onset of up to three distinct
symptoms, with the following properties:

Each symptom occurs with 60% probability.

If a single symptom occurs, there is a 45% probability that the two other symptoms will also occur.

If any two symptoms occur, there is a 75% probability that the remaining symptom will also occur.

Answer each of the following. (Hint: Use a Venn diagram.)

(a) What is the probability that all three symptoms will occur?

(b) What is the probability that at least two symptoms occur?

(c) What is the probability that exactly two symptoms occur?

(d) What is the probability that exactly one symptom occurs?

(e) What is the probability that none of the symptoms occurs?

(f) Is the event that a symptom occurs statistically independent of the event that any other
symptom occurs?

8. I have a nephew Berkeley and niece Chelsea (true) who, when very young, would occasionally
visit their Uncle Ismor on weekends (also true). Furthermore,

i. Berkeley and Chelsea visited independently of one another.
ii. Berkeley visited with probability 80%.
iii. Chelsea visited with probability 75%.
However, it often happened that some object in his house especially if it was fragile
accidentally broke during such visits (not true). Furthermore,

iv. The probability of such an accident occurring, given that both children visited, was 90%.
v. The probability of such an accident occurring, given that only Berkeley visited, was 35%.
vi. The probability of such an accident occurring, given that only Chelsea visited, was 20%.
vii. The probability of such an accident occurring, given that neither child visited, was 2%.
Sketch and label a Venn diagram for events A = Accident, B = Berkeley visited, and C = Chelsea
visited. (Hint: The Exercise on page 3.2-3 might be useful.)

9. At a certain meteorological station, data are being collected about the behavior of
thunderstorms, using two lightning rods A and B. It is determined that, during a typical storm,
there is a 99% probability that lightning will strike at least one of the rods. Moreover, if A is
struck, there is a 60% probability that B will also be struck, whereas if B is struck, there is a 75%
probability that A will also be struck. Calculate the probability of each of the following events.
(Hint: See PowerPoint section 3.2, slide 28.)
Both rods A and B are struck by lightning

Rod A is struck by lightning

Rod B is struck by lightning
Are the two events A is struck and B is struck statistically independent? Explain.

10. The Monty Hall Problem (simplest version)

Between 1963 and 1976, a popular game show called
Lets Make A Deal aired on network television, starring
charismatic host Monty Hall, who would engage in deals
small games of chance with randomly chosen studio
audience members (usually dressed in outrageous costumes)
for cash and prizes. One of these games consisted of first
having a contestant pick one of three closed doors, behind one
of which was a big prize (such as a car), and behind the other
two were zonk prizes (often a goat, or some other farm
animal). Once a selection was made, Hall who knew what
was behind each door would open one of the other doors
that contained a zonk. At this point, Hall would then offer
the contestant a chance to switch their choice to the other
closed door, or stay with their original choice, before finally
revealing the contestants chosen prize.

Question: I n order to avoid getting zonked, should the
optimal strategy for the contestant be to switch, stay, or does it not make a difference?


11.
(a) Given the following information about three events A, B, and C.

( ) 0.69 ( ) 0.19
( ) 0.70 ( ) 0.20
( ) 0.71 ( ) 0.21
P A B P A B
P A C P A C
P B C P B C
= =
= =
= =

Find the values of ( ), ( ), P A P B and ( ) P C .
(b) Suppose it is also known that the two events A C and B are statistically independent.
Sketch a Venn diagram for events A, B, and C.

12. Recall that in a prospective cohort study, exposure (E+ or E) is given, so that the odds ratio is
defined as
OR =
odds of disease, given exposure
odds of disease, given no exposure
=
P(D+ | E+) P(D | E+)
P(D+ | E) P(D | E)
.

Recall that in a retrospective case-control study, disease status (D+ or D) is given; in this case,
the corresponding odds ratio is defined as

OR =
odds of exposure, given disease
odds of exposure, given no disease
=
P(E+ | D+) P(E | D+)
P(E+ | D) P(E | D)
.

Show algebraically that these two definitions are mathematically equivalent, so that the same
cross product ratio calculation can be used in either a cohort or case-control study, as the
following two problems demonstrate. (Recall the definition of conditional probability.)

13. Under construction


14. Under construction

.
15. An observational study investigates the connection between aspirin use and three vascular
conditions gastrointestinal bleeding, primary stroke, and cardiovascular disease using a group
of patients exhibiting these disjoint conditions with the following prior probabilities:
P(GI bleeding) = 0.2, P(Stroke) = 0.3, and P(CVD) = 0.5, as well as with the following
conditional probabilities: P(Aspirin | GI bleeding) = 0.09, P(Aspirin | Stroke) = 0.04, and
P(Aspirin | CVD) = 0.02.

(a) Calculate the following posterior probabilities: P(GI bleeding | Aspirin), P(Stroke | Aspirin),
and P(CVD | Aspirin).

(b) Interpret: Compare the prior probability of each category with its corresponding posterior
probability. What conclusions can you draw? Be as specific as possible.

16. On the basis of a retrospective study, it is determined (from hospital records, tumor registries, and
death certificates) that the overall five-year survival (event S) of a particular form of cancer in a
population has a prior probability of P(S) = 0.4. Furthermore, the conditional probability of
having received a certain treatment (event T) among the survivors is given by P(T | S) = 0.8, while
the conditional probability of treatment among the non-survivors is only P(T | S
c
) = 0.3.

(a) A cancer patient is uncertain about whether or not to undergo this treatment, and consults with
her oncologist, who is familiar with this study. Compare the prior probability of overall
survival given above with each of the following posterior probabilities, and interpret in context.

Survival among treated individuals, P(S | T)

Survival among untreated individuals, P(S | T
c
),

(b) Also calculate the following.

Odds of survival, given treatment

Odds of survival, given no treatment

Odds ratio of survival for this disease

17.

Recall that two events A and B are statistically independent if ( ) ( ) ( ) P A B P A P B = .
It therefore follows that the difference
( ) ( ) ( ) P A B P A P B =
is a measure of how far from statistical independence any two arbitrary events A and B are.
Prove that
1
4
. When is the inequality sharp? (That is, when is equality achieved?)

5 years
PRESENT PAST
Given: Survivors (S)
vs. Non-survivors (S
c
)
P(S) = 0.4

Treatment (T):
P(T | S) = 0.8
P(T | S
c
) = 0.3

WARNING! This problem is not for the mathematically timid.

18. First, recall that, for any two events A and B, the union A B defines the inclusive or i.e.,
Either A occurs, or B occurs, or both.

Now, consider the event Only A i.e., Event A occurs, and event B does not occur defined
as the intersection A B
c
, also denoted as the difference A B. Likewise, Only B = B and
not A = B A
c
= B A. Using these, we can define xor the so-called exclusive or i.e.,
Either A occurs, or B occurs, but not both as the union (A B) (B A), or equivalently,
(A B) (A B). This is also sometimes referred to the symmetric difference between A and
B, denoted A B. (See the two regions corresponding to the highlighted formulas below.)

(a) Suppose that two treatment regimens A and B exist for a certain medical condition. It is
reported that 35% of the total patient population receives Treatment A, 40% receives
Treatment B, and 14% receives both treatments. Construct the corresponding Venn diagram
and 2 2 probability table. Are the two treatments A and B statistically independent of
one another?

Calculate P(A or B), and P(A xor B).

(b) Suppose it is discovered that an error was made in the original medical report, and it is
actually the case that 35% of the population receives only Treatment A, 40% receives only
Treatment B, and 14% receives both treatments. Construct the corresponding Venn diagram
and 2 2 probability table. Are the two treatments A and B statistically independent of
one another?

Calculate P(A or B), and P(A xor B).

A B

A B B A
= A B
c
A B =A
c
B

19. Three of the most common demographic variables used in epidemiological studies are age, sex,
and race. Suppose it is known that, in a certain population,
30% of whites are men, 40% of males are white men, 50% of white males are men.
(a) What percentage of whites are male? Formally justify your answer!
(b) What percentage of males are white? Formally justify your answer!
Hint: Follow the same notation as the example in section 3.2, slide 24, of the PowerPoint slides.

20. In another epidemiological study, it is known that, for a certain population,
10% of adults are men, 20% of males are white, 30% of whites are adults
40% of males are men, 50% of whites are male.
What percentage of adults are white?
Hint: Find a connection between the products P(A | B) P(B | C) P(C | A) and P(B | A) P(C | B) P(A | C).

21. The Shell Game. In the traditional version, a single pea is placed under one of three walnut
half-shells in full view of an observer. The shells are then quickly shuffled into a new random
arrangement, and the observer then guesses which shell contains the pea. If the guess is correct,
the observer wins.

(a) For the sake of argument, suppose there are 20 half-shells instead of three,
and the observer plays the game a total of n times. What is the probability
that he/she will guess correctly at least once out of those n times? How
large must n be, in order to guarantee that the probability of winning is over
50%? What happens to the probability as n ?

(b) Now suppose there are n half-shells, and the observer plays the game a total of n times.
What is the probability that he/she will guess correctly at least once out of those n times?
What happens to this probability as n ?

Hint (for both parts): First calculate the probability of losing all n times.

22. (a) By definition, two events A and B are statistically independent if and only if P(A | B) = P(A).
Prove mathematically that two events A and B are independent if and only if P(A | B) = P(A | B
c
).
[Hint: Let P(A) = a, P(B) = b, P(A B) = c, and use either a Venn diagram or a 2 2 table.]

(b) More generally, let events A, B
1
, B
2
, , B
n
be defined as in Bayes Theorem. Prove that:
A and B
1
are independent, A and B
2
are independent, , A and B
n
are independent
if and only if P(A | B
1
) = P(A | B
2
) = = P(A | B
n
).
[Hint: Use the Law of Total Probability.]

23. Prove that the relative risk RR is always between 1 and the odds ratio OR. (Note there are three
possible cases to consider: RR < 1, RR = 1, and RR > 1.)

24. Consider the following experiment. Pick a random integer from 1 to one million (10
6
). What is
the probability that it is either a perfect square (1, 4, 9, 16, ) or a perfect cube (1, 8, 27, 64,)?

25. As defined at the beginning of this chapter, the probability of Heads of a coin is formally
identified with
( )
lim
n
X n
n
when that limiting value exists where n = # tosses, and X = # Heads

in those n tosses. Show by a mathematical counterexample that in fact, this limit need not
necessarily exist. That is, provide an explicit sequence of Heads and Tails (or ones and zeros)
for which the ratio
( ) X n
n
does not converge to a unique finite value, as n increases.

26. Warning: These may not be quite as simple as they look.
(a) Consider two independent events A and B. Suppose A occurs with probability 60%, while
B only occurs with probability 30%. Calculate the probability that B occurs, i.e., P(B).

(b) Consider two independent events C and D. Suppose they both occur together with
probability 72%, while there is a 2% probability that neither event occurs. Calculate the
probabilities P(C) and P(D).

27. Solve for the middle cell probability (?) in the following partially-filled probability table.

.01 .02
? .50
.03 .04
.60

28. How far away can a prior probability be from its posterior probabilities?
Consider two events A and B, and let P(A | B) = p and P(A | B
c
) = q be fixed probabilities.
If p = q, then A and B are statistically independent (see problem 22 above), and thus the
prior probability P(B) coincides with its corresponding posterior probabilities P(B | A) and
P(B | A
c
) exactly, yielding a minimum value of 0 for the absolute differences
| ( ) ( | ) | P B P B A and | ( ) ( | ) | P B P B A
C
.
In terms of p and q (with p q), what must P(B) be for the maximum absolute differences to
occur, and what are their respective values?

Ismor Fischer, 9/21/2014 3.5-10

29. Let A, B, and C be three pairwise-independent events, that is, A and B are independent, B and C
are independent, and A and C are independent. It does not necessarily follow that
( ) ( ) ( ) ( ) P A B C P A P B P C = , as the following Venn diagram illustrates. Provide the details.

30. Bar Bet
(a) Suppose I ask you to pick any four cards at random from a deck of 52, without replacement,
and bet you one dollar that at least one of the four is a face card (i.e., Jack, Queen, or King).
Should you take the bet? Why? (Hint: See how the probability of this event compares to 50%.
If this is too hard, try it with replacement first.)

(b) What if the bet involves picking three cards at random instead of four? Should you take the
bet then? Why?
(c) Refer to the posted Rcode folder for this part. Please answer all questions.

A
C B
(1 ) a b c d +
(1 ) b a c d + (1 ) c a b d +
ab d ac d
bc d
d
1 a b c + ab + ac + bc d

4.1 Discrete Models

4.2 Continuous Models

4.3 Summary Chart

4.4 Problems



4.1 Discrete Models

FACT:

Experiment 3a: Roll one fair die... Discrete random variable X =value obtained

Sample Space: S = {1, 2, 3, 4, 5, 6} #(S) =6

Because the die is fair, each of the six faces has an equally likely probability of
occurring, i.e., 1/6. The probability distribution for X can be defined by a so-called
probability mass function (pmf) f(x), organized in a probability table, and displayed
via a corresponding probability histogram, as shown.

Comment on notation:
Event
( 4) P X =
=1/6

Translation: The probability of rolling 4 is 1/6.

Likewise for the other probabilities P(X =1), P(X =2),, P(X =6) in this example.
A mathematically succinct way to write such probabilities is by the notation P(X =x),
where x =1, 2, 3, 4, 5, 6. In general therefore, since this depends on the value of x,
we can also express it as a mathematical function of x (specifically, the pmf; see
above), written f(x). Thus the two notations are synonymous and interchangeable.
The previous example could just as well have been written f(4) =1/6.

Event Probability
x f(x) = P(X = x)
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
1
Random variables can be used to define events that involve measurement!
Uniform Distribution
1
6

1
6

1
6

1
6

1
6

1
6

X

Experiment 3b: Roll two distinct, fair dice. Outcome =(Die 1, Die 2)

Sample Space: S = {(1, 1), , (6, 6)} #(S) =6
2
=36

Discrete random variable X =Sum of the two dice (2, 3, 4, , 12).

Events: X =2 = {(1, 1)} #(X =2) = 1
X =3 = {(1, 2), (2, 1)} #(X =3) = 2
X =4 = {(1, 3), (2, 2), (3, 1)} #(X =4) = 3
X =5 = {(1, 4), (2, 3), (3, 2), (4, 1)} #(X =5) = 4
X =6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} #(X =6) = 5
X =7 = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} #(X =7) = 6
X =8 = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} #(X =8) = 5
X =9 = {(3, 6), (4, 5), (5, 4), (6, 3)} #(X =9) = 4
X =10 = {(4, 6), (5, 5), (6, 4)} #(X =10) = 3
X =11 = {(5, 6), (6, 5)} #(X =11) = 2
X =12 = {(6, 6)} #(X =12) = 1

Recall that, by definition, each event X =x (where x =2, 3, 4,, 12) corresponds
to a specific subset of outcomes from the sample space (of ordered pairs, in this
case). Because we are still assuming equal likelihood of each die face appearing,
the probabilities of these events can be easily calculated by the shortcut formula
#( )
( )
#( )
A
P A =
S
. Question for later: What if the dice are loaded (i.e., biased)?

(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)
3 5 7 9 11
1
36

6
36

1
36

2
36

2
36

3
36

3
36

4
36

4
36

5
36

5
36

2 3 4 5 6 7 8 9 10 11 12

Again, the probability distribution for X can be organized in a probability table,
and displayed via a probability histogram, both of which enable calculations to be
done easily:

x f(x) = P(X = x)
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
1

P(X =7 or X =11) Note that X =7 and X =11 are disjoint!
= P(X =7) + P(X =11) via Formula (3) above
= 6/36 + 2/36 = 8/36

P(5 X 8)
= P(X =5 or X =6 or X =7 or X =8)
= P(X =5) + P(X =6) + P(X =7) + P(X =8)
= 4/36 + 5/36 + 6/36 + 5/36
= 20/36

P(X <10) = 1 P(X 10) via Formula (1) above
= 1 [P(X =10) + P(X =11) + P(X =12)]
= 1 [3/36 + 2/36 + 1/36] = 1 6/36 = 30/36

Exercise: How could event E =Roll doubles be characterized in terms of a
random variable? (Hint: Let Y =Difference between the two dice.)


The previous example motivates the important topic of...

Discrete Probability Distributions

In general, suppose that all of the distinct population values of a discrete random
variable X are sorted in increasing order: x
1
<x
2
<x
3
<, with corresponding
probabilities of occurrence f(x
1
), f(x
2
), f(x
3
), Formally then, we have the following.

Definition: f(x) is a probability distribution function for the discrete random
variable X if, for all x,
f(x) 0 AND
all
( )
x
f x
= 1.

In this case, f(x) = P(X = x), the probability that the value x occurs in the population.

The cumulative distribution function (cdf) is defined as, for all x,

F(x) = P(X x) =
all
( )
i
i
x x
f x
= f(x
1
) + f(x
2
) + +f(x).

Therefore, F is piecewise constant, increasing from 0 to 1.

Furthermore, for any two population values a <b, it follows that
P(a X b) = ( ) f x
b
a
= F(b) F(a
)
where a
is the value just preceding a in the sorted population.

Exercise: Sketch the cdf F(x)
for Experiments 3a and 3b above.

Total Area = 1

X
| | | |
x
1
x
2
x
3
x

f(x
1
)
f(x
2
)
f(x
3
)
f(x)
1
X
| | | |
x
1
x
2
x
3
x

F(x
1
)
F(x
2
)
F(x
3
)
0

Population Parameters and
2
(vs. Sample Statistics x and s
2
)
population mean = the expected value of the random variable X
= the arithmetic average of all the population values

Compare this with the relative frequency definition of sample mean given in 2.3.

population variance = the expected value of the squared deviation of the
random variable X from its mean ()

Compare the first with the definition of sample variance given in 2.3.
(The second is the analogue of the alternate computational formula.) Of course,
the population standard deviation is defined as the square root of the variance.

*Exercise: Algebraically expand the expression (X )
2
, and use the properties of expectation given above.

If X is a discrete numerical random variable, then
= E[X] = x f(x), where f(x) =P(X =x), the probability of x.

If X is a discrete numerical random variable, then
2
= E[(X )
2
] = (x )
2
f(x).

Equivalently,*
2
= E[X
2
]
2
= x
2
f(x)
2
,

where f(x) =P(X =x), the probability of x.

Properties of Mathematical Expectation

1. For any constant c, it follows that E[cX] = cE[X].

2. For any two random variables X and Y, it follows that

E[X +Y] = E[X] +E[Y] and, via Property 1,

E[X Y] = E[X] E[Y].

Any operator on variables satisfying 1 and 2 is said to be linear.
10%
20%
30%
40%

Experiment 4: Two populations, where the daily number of calories consumed is
designated by X
1
and X
2
, respectively.

Population 1
Probability Table

Mean(X
1
) =
1
=(2300)(0.1) +(2400)(0.2) +
(2500)(0.3) +(2600)(0.4) = 2500 cals
Var(X
1
) =
1
2
=(200)
2
(0.1) +(100)
2
(0.2) +
(0)
2
(0.3) +(+100)
2
(0.4) = 10000 cals
2

Population 2
Probability Table

Mean(X
2
) =
2
=(2200)(0.2) +(2300)(0.3) +(2400)(0.5) = 2330 cals
Var(X
2
) =
2
2
=(130)
2
(0.2) +(30)
2
(0.3) +(70)
2
(0.5) = 6100 cals
2

x f
1
(x)
2300 0.1
2400 0.2
2500 0.3
2600 0.4
x f
2
(x)
2200 0.2
2300 0.3
2400 0.5
20%
30%
50%
2300
2400
2500
2600
2200
2300
2400
0.1
0.2
0.3
0.4
0.2
0.3
0.5
Summary (Also refer back to 2.4 - Summary)

POPULATION
Discrete random variable X

Probability Table Probability Histogram

x f(x) = P(X = x)
x
1
f(x
1
)
x
2
f(x
2
)
.
.
.
.
.
.
1

=E[X] = x f(x)
E[(X )
2
] = (x )
2
f(x)
2
= or
E[X
2
]
2
= x
2
f(x)
2

X
P
a
r
a
m
e
t
e
r
s

SAMPLE, size n

Relative Frequency Table Density Histogram

x f(x) =
freq(x)
n

x
1
f(x
1
)
x
2
f(x
2
)
.
.
.
.
.
.
x
k
f(x
k
)
1

x = x f(x)

n
n 1
(x x )
2
f(x)
s
2
= or

n
n 1
[ x
2
f(x) x
2
]
X
S
t
a
t
i
s
t
i
c
s

X and
2
S
can be shown
to be
unbiased
estimators of
and
2
,
respectively.
That is,
E X
(
=

,
and
2 2
E S
(
=

.
(In fact, they
are MVUE.)

~ Some Advanced Notes on General Parameter Estimation ~
Suppose that is a fixed population parameter (e.g., ), and
is a sample-based estimator (e.g., X ). Consider all the

random samples of a given size n, and the resulting sampling
distribution of
values. Formally define the following:

Mean (of
) =
[ ] E , the expected value of
.

Bias =
[ ] E , the difference between the expected

value of
, and the target parameter .

Variance (of
) =
2

[ ] E E
(
| |
(
|
\ .
(

, the expected value
of the squared deviation of
from its mean
[ ] E ,

or equivalently,
*
2 2
[ ] E E
(
(

= .

Mean Squared Error (MSE) =
2
( ) E
(
(

, the expected value of the squared
difference between estimator
and the target parameter .

Exercise: Prove
*
2
MSE=Variance +Bias that .

Comment: A parameter estimator
is defined to be unbiased if
[ ] E = , i.e.,
Bias =0. In this case, MSE =Variance, so that if
minimizes MSE, it then follows

that it has the smallest variance of any estimator. Such a highly desirable estimator is
called MVUE (Minimum Variance Unbiased Estimator). It can be shown that the
estimators X and
2
S (of and
2
, respectively) are MVUE, but finding such an
estimator
for a general parameter can be quite difficult in practice. Often, one

must settle for either not having minimum variance or having a small amount of bias.

*
using the basic properties of mathematical expectation given earlier

POPULATION
Parameter
SAMPLE
Statistic
= c

[ ] E = a
[ ] E = b
Vector interpretation
2 2 2
[ ] [ ] [ ] E E E = +
c = a +b
c a b


Related (but not identical) to this is the idea that of all linear combinations
1 1 2 2 n n
c x c x c x + + + of the data
1 2
{ , , , }
n
x x x (such as X , with
1 2
1/
n
c c c n = = = = )
which are also unbiased, the one that minimizes MSE is called BLUE (Best Linear
Unbiased Estimator). It can be shown that, in addition to being MVUE (as stated
above), X is also BLUE. To summarize,

MVUE gives: Min Variance among all unbiased estimators

Min Variance among linear unbiased estimators

= Min MSE among linear unbiased estimators (since MSE =Var +Bias
2
),

given by BLUE (by def).

The Venn diagram below depicts these various relationships.

Comment: If MSE 0 as n , then
is said to have mean square convergence

to . This in turn implies convergence in probability (via Markov's Inequality,
also used in proving Chebyshevs Inequality), i.e.,
is a consistent estimator of .

Unbiased
Linear
Minimum
MSE
Minimum
Variance
BLUE
MVUE
X
2
S
Minimum variance
among linear
unbiased estimators
Minimum variance
among all unbiased
estimators
Ismor Fischer, 5/29/2012 4.1-10
10%
20%
30%
40%
20%
30%
50%

Experiment 4 - revisited: Recall the previous example, where X
1
and X
2
represent
the daily number of calories consumed in two populations, respectively.

Population 1 Population 2

Case 1: First suppose that X
1
and X
2
are statistically independent, as shown in the joint probability
distribution given in the table below. That is, each cell probability is equal to the product of the
corresponding row and column marginal probabilities. For example, P(X
1
=2300 X
2
=2200) =.02,
but this is equal to the product of the column marginal P(X
1
=2300) =.1 with the row marginal
P(X
2
=2200) =.2. Note that the marginal distributions for X
1
and X
2
remain the same as above, as can
be seen from the single-underlined values for X
1
, and respectively, the double-underlined values for X
2
.

X
1
= # calories for Pop 1
2300 2400 2500 2600
X
2

=

#

c
a
l
o
r
i
e
s

f
o
r

P
o
p

2

2200 .02 .04 .06 .08 .20
2300 .03 .06 .09 .12 .30
2400 .05 .10 .15 .20 .50

.10 .20 .30 .40 1.00

2300
2400
2500
2600
2200
2300
2400
x f
1
(x)
2300 0.1
2400 0.2
2500 0.3
2600 0.4
Mean(X
1
) =
1
= 2500 cals;
Var(X
1
) =
1
2
= 10000 cals
2

x f
2
(x)
2200 0.2
2300 0.3
2400 0.5
Mean(X
2
) =
2
= 2330 cals;
Var(X
2
) =
2
2
= 6100 cals
2

Ismor Fischer, 5/29/2012 4.1-11

Now imagine that we wish to compare the two populations, by considering the
probability distribution of the calorie difference D = X
1
X
2
between them. (The sum
S = X
1
+ X
2
is similar, and left as an exercise.)

As an example, there are two possible ways that D =300 can occur, i.e., two possible
outcomes corresponding to the event D =300: Either A =X
1
=2500 and X
2
=2200
or B =X
1
=2600 and X
2
=2300, that is, A B. For its probability, recall that
( ) ( ) ( ) ( ). P A B P A P B P A B = + However, events A and B are disjoint, for they
cannot both occur simultaneously, so that the last term is P(A B) =0. Thus,
( ) ( ) ( ) P A B P A P B = + with P(A) =.06 and P(B) =.12 from the joint distribution.

Mean(D) =
D
=
(100)(.05) +(0)(.13) +(100)(.23) +
(200)(.33) +(300)(.18) +(400)(.08)
= 170 cals
i.e.,
D
=
1

2
(Check this!)

Var(D) =
D
2
=
(270)
2
(.05) +(170)
2
(.13) +(70)
2
(.23)
+(30)
2
(.33) +(130)
2
(.18) +(230)
2
(.08)
= 16100 cals
2

i.e.,
D
2
=
1
2
+
2
2
(Check this!)

Events
D =d
Sample Space
Outcomes in the form of ordered pairs (X
1
, X
2
)
Probabilities
from joint distribution
D = 100: (2300, 2400) .05
D = 0: (2300, 2300), (2400, 2400) .13 =.03 +.10
D = +100: (2300, 2200), (2400, 2300), (2500, 2400) .23 =.02 +.06 +.15
D = +200: (2400, 2200), (2500, 2300), (2600, 2400) .33 =.04 +.09 +.20
D = +300: (2500, 2200), (2600, 2300) .18 =.06 +.12
D = +400: (2600, 2200) .08
.05
.13
.23
.33
.18
.08
Ismor Fischer, 5/29/2012 4.1-12

Case 2: Now assume that X
1
and X
2
are not statistically independent, as given in the
joint probability distribution table below.

X
1
= # calories for Pop 1
2300 2400 2500 2600
X
2

=

#

c
a
l
o
r
i
e
s

f
o
r

P
o
p

2

2200 .01 .03 .07 .09 .20
2300 .02 .05 .10 .13 .30
2400 .07 .12 .13 .18 .50

.10 .20 .30 .40 1.00

The events D =d and the corresponding sample space of outcomes remain unchanged,
but the last column of probabilities has to be recalculated, as shown. This results in a
slightly different probability histogram (Exercise) and parameter values.

Mean(D) =
D
=(100)(.07) +(0)(.14) +(100)(.19) +(200)(.33) +(300)(.18) +(400)(.08)
= 170 cals, i.e.,
D
=
1

2
.

Var(D) =
D
2
=(270)
2
(.07) +(170)
2
(.14) +(70)
2
(.19) +(30)
2
(.31) +(130)
2
(.20) +(230)
2
(.09)
= 18517 cals
2

It seems that the mean of the difference is equal to the difference in the means still
holds, even when the two populations are dependent. But the variance of the difference
is no longer necessarily equal the sum of the variances, as with independent populations.

Events
D =d
Sample Space
Outcomes in the form of ordered pairs (X
1
, X
2
)
Probabilities
from joint distribution
D = 100: (2300, 2400) .07
D = 0: (2300, 2300), (2400, 2400) .14 =.02 +.12
D = +100: (2300, 2200), (2400, 2300), (2500, 2400) .19 =.01 +.05 +.13
D = +200: (2400, 2200), (2500, 2300), (2600, 2400) .31 =.03 +.10 +.18
D = +300: (2500, 2200), (2600, 2300) .20 =.07 +.13
D = +400: (2600, 2200) .09
Ismor Fischer, 5/29/2012 4.1-13

These examples illustrate a general principle that can be rigorously proved with mathematics.

GENERAL FACT ~

Comments:

These formulas actually apply to both discrete and continuous variables (next section).
The difference relations will play a crucial role in 6.2 - Two Samples inference.
If X and Y are dependent, then the two bottom relations regarding the variance also
involve an additional term, Cov(X, Y), the population covariance between X and Y.
See problems 4.3/29 and 4.3/30 for details.
The variance relation can be interpreted visually via the Pythagorean Theorem,
which illustrates an important geometric connection, expanded in the Appendix.]

Certain discrete distributions (or discrete models) occur so frequently in practice, that
their properties have been well-studied and applied in many different scenarios. For
instance, suppose it is known that a certain population consists of 45% males (and thus
55% females). If a random sample of 250 individuals is to be selected, then what is the
probability of obtaining exactly 100 males? At most 100 males? At least 100 males?
What is the expected number of males? This is the subject of the next topic:

Mean(X + Y) = Mean(X) + Mean(Y) and Mean(X Y) = Mean(X) Mean(Y)

In addition, if X and Y are independent random variables,
Var(X + Y) = Var(X) + Var(Y) and Var(X Y) = Var(X) + Var(Y).
X

Y

D
Ismor Fischer, 5/29/2012 4.1-14

POPULATION = Women diagnosed
with breast cancer in Dane County,
1996-2000

Among other things, this study
estimated that the rate of breast cancer
in situ (BCIS), which is diagnosed
almost exclusively via mammogram, is
approximately 12-13%. That is, for any
individual randomly selected from this
population, we have a binary variable

1, with probability 0.12
BCIS
0, with probability 0.88.
=

In a random sample of 100 n = breast
cancer diagnoses, let

X =#BCIS cases (0,1,2, ,100) .

Questions:

How can we model the probability
distribution of X, and under what
assumptions?

Probabilities of events, such as
( 0), P X = ( 20), P X = ( 20), P X
etc.?

Mean #BCIS cases =?

Standard deviation of #BCIS cases =?

Full article available online at this link.
Ismor Fischer, 5/29/2012 4.1-15

Binomial Distribution (Paradigm model =coin tosses)

(H H H H H) (H H T H H) (H T H H H) (H T T H H) (T H H H H) (T H T H H) (T T H H H) (T T T H H)
(H H H H T) (H H T H T) (H T H H T) (H T T H T) (T H H H T) (T H T H T) (T T H H T) (T T T H T)
(H H H T H) (H H T T H) (H T H T H) (H T T T H) (T H H T H) (T H T T H) (T T H T H) (T T T T H)
(H H H T T) (H H T T T) (H T H T T) (H T T T T) (T H H T T) (T H T T T) (T T H T T) (T T T T T)

Random Variable: X = #Heads in n =5 independent tosses (0, 1, 2, 3, 4, 5)

Events: X =0 = Exercise #(X =0) =
\
|
.
|
5
0
= 1

X =1 = Exercise #(X =1) =
\
|
.
|
5
1
= 5

\
|
.
|
5
2
= 10

X =3 = see above #(X =3) =
\
|
.
|
5
3
= 10

\
|
.
|
5
4
= 5

\
|
.
|
5
5
= 1

Recall: For x =0, 1, 2, , n, the combinatorial symbol
\
|
.
|
| n
x
read n-choose-x is
defined as the value
n!
x! (n x)!
, and counts the number of ways of rearranging x objects
among n objects. SeeAppendix > Basic Reviews > Perms & Combos for details.

Note:
\
|
.
|
| n
r
is computed via the mathematical function nCr on most calculators.

Binary random variable: Probability:

1, Success (Heads) with P(Success) =
Y =
0, Failure (Tails) with P(Failure) = 1

Experiment: n =5 independent coin tosses

Sample Space S = {(H H H H H), , (T T T T T)}
#(S) =2
5
=32
Ismor Fischer, 5/29/2012 4.1-16
Total Area = 1

Probabilities:

First assume the coin is fair ( = 0.5 1 = 0.5), i.e., equally likely elementary
outcomes H and T on a single trial. In this case, the probability of any event A above
can thus be easily calculated via P(A) =#(A) / #(S).

x P(X = x) =
1
2
5
\
|
.
|
|
5
x

0 1/32 = 0.03125
1 5/32 = 0.15625
2 10/32 = 0.312500
3 10/32 = 0.312500
4 5/32 = 0.15625
5 1/32 = 0.03125

Now consider the case where the coin is biased (e.g., = 0.7 1 = 0.3).
Calculating P(X =x) for x =0, 1, 2, 3, 4, 5 means summing P(all its outcomes).

Example: P(X =3) =

outcome via independence of H, T

P(H H H T T) = (0.7)(0.7)(0.7)(0.3)(0.3) = (0.7)
3
(0.3)
2

+ P(H H T H T) = (0.7)(0.7)(0.3)(0.7)(0.3) = (0.7)
3
(0.3)
2

+ P(H H T T H) = (0.7)(0.7)(0.3)(0.3)(0.7) = (0.7)
3
(0.3)
2

+ P(H T H H T) = (0.7)(0.3)(0.7)(0.7)(0.3) = (0.7)
3
(0.3)
2

+ P(H T H T H) = (0.7)(0.3)(0.7)(0.3)(0.7) = (0.7)
3
(0.3)
2

+ P(H T T H H) = (0.7)(0.3)(0.3)(0.7)(0.7) = (0.7)
3
(0.3)
2

+ P(T H H H T) = (0.3)(0.7)(0.7)(0.7)(0.3) = (0.7)
3
(0.3)
2

+ P(T H H T H) = (0.3)(0.7)(0.7)(0.3)(0.7) = (0.7)
3
(0.3)
2

+ P(T H T H H) = (0.3)(0.7)(0.3)(0.7)(0.7) = (0.7)
3
(0.3)
2

+ P(T T H H H) = (0.3)(0.3)(0.7)(0.7)(0.7) = (0.7)
3
(0.3)
2

via disjoint outcomes,

=
\
|
.
|
| 5
3
(0.7)
3
(0.3)
2

Ismor Fischer, 5/29/2012 4.1-17

Hence, we similarly have

x
0
\
|
.
|
| 5
0
(0.7)
0
(0.3)
5
= 0.00243
1
\
|
.
|
| 5
1
(0.7)
1
(0.3)
4
= 0.02835
2
\
|
.
|
| 5
2
(0.7)
2
(0.3)
3
= 0.13230
3
\
|
.
|
| 5
3
(0.7)
3
(0.3)
2
= 0.30870
4
\
|
.
|
| 5
4
(0.7)
4
(0.3)
1
= 0.36015
5
\
|
.
|
| 5
5
(0.7)
5
(0.3)
0
= 0.16807

Example: Suppose that a certain medical procedure is known to have a 70%
successful recovery rate (assuming independence). In a random sample of n =5
patients, the probability that three or fewer patients will recover is:

Method 1: P(X 3) = P(X =0) +P(X =1) +P(X =2) +P(X =3)
= 0.00243 +0.02835 +0.13230 +0.30870 = 0.47178

Method 2: P(X 3) = 1 [ P(X =4) +P(X =5) ]
= 1 [0.36015 +0.16807 ] = 1 0.52822 = 0.47178

Example: The mean number of patients expected to recover is:

= E[X] = 0 (0.00243) +1 (0.02835) +2 (0.13230) +3 (0.30870) +4 (0.36015) +5 (0.16807)

= 3.5 patients

This makes perfect sense for n =5 patients with a = 0.7 recovery probability, i.e.,
their product. In the probability histogram above, the balance point fulcrum
indicates the mean value of 3.5.

Total Area = 1
P(X = x) =
\
|
.
|
|
5
x
(0.7)
x
(0.3)
5 x

Ismor Fischer, 5/29/2012 4.1-18

General formulation:

The Binomial Distribution

Let the discrete random variable X =#Successes in n independent Bernoulli trials
(0, 1, 2, , n), each having constant probability P(Success) = , and hence
P(Failure) = 1 . Then the probability of obtaining any specified number of
successes x =0, 1, 2, , n, is given by:

P(X = x) =
\
|
.
|
|
n
x

x
(1 )
n x
.

We say that X has a Binomial Distribution, denoted X ~ Bin(n, ).
Furthermore, the mean = n , and the standard deviation = n (1 ) .

Example: Suppose that a certain spontaneous medical condition affects 1% (i.e., = 0.01)
of the population. Let X =number of affected individuals in a random sample of n =300.
Then X ~Bin(300, 0.01), i.e., the probability of obtaining any specified number x =0, 1, 2,
, 300 of affected individuals is:
P(X =x) =
\
|
.
|
| 300
x
(0.01)
x
(0.99)
300 x
.

The mean number of affected individuals is =n =(300)(0.01) =3 expected cases, with a
standard deviation of = (300)(0.01)(0.99) =1.723 cases.

Probability Table for Binomial Dist.

x
f(x) =
\
|
.
|
|
n
x

x
(1 )
n x

0
| |
|
\ .
0 0
0
(1 )
n
n

1
| |
|
\ .
1 1
1
(1 )
n
n

2
| |
|
\ .
2 2
2
(1 )
n
n

etc. etc.
n
| |
|
\ .
(1 )
n n n
n
n

1

Exercise: In order to be a valid distribution,
the sum of these probabilities must =1. Prove it.

Hint: First recall the Binomial Theorem:
How do you expand the algebraic expression
( )
n
a b + for any n =0, 1, 2, 3, ? Then replace
a with , and b with 1 . Voil!
Ismor Fischer, 5/29/2012 4.1-19

Comments:

The assumption of independence of the trials is absolutely critical! If not satisfied i.e.,
if the success probability of one trial influences that of another then the Binomial
Distribution model can fail miserably. (Example: X =number of children in a particular
school infected with the flu) The investigator must decide whether or not independence
is appropriate, which is often problematic. If violated, then the correlation structure
between the trials may have to be considered in the model.

As in the preceding example, if the sample size n is very large, then the computation of
\
|
.
|
| n
x
for x =0, 1, 2, , n, can be intensive and impractical. An approximation to the
Binomial Distribution exists, when n is large and is small, via the Poisson Distribution
(coming up).

Note that the standard deviation = n (1 ) depends on the value of . (Later)

Ismor Fischer, 5/29/2012 4.1-20

How can we estimate the parameter , using a sample-based statistic
?

Example: If, in a sample of n =50 randomly selected individuals, X =36 are female,
then the statistic
=
X
n
=
36
50
=0.72 is an estimate of the true probability that a
randomly selected individual from the population is female. The probability of
selecting a male is therefore estimated by 1
=0.28 .
Binary random variable

1, Success with probability
Y =
0, Failure with probability 1

POPULATION
Experiment: n independent trials

SAMPLE
0/1 0/1 0/1 0/1 0/1 0/1 0/1
(y
1
, y
2
, y
3
, y
4
, y
5
, y
6
, , y
n
)

y
1
+y
2
+y
3
+y
4
+y
5
+ +y
n

Let X = #Successes in n trials ~ Bin(n, )

(n X = #Failures in n trials).

Therefore, dividing by n
X
n
= proportion of Successes in n trials

= p ( = y , as well)
and hence

q = 1 p = proportion of Failures in n trials.

Ismor Fischer, 5/29/2012 4.1-21
Poisson Distribution (Models rare events)

Assume:

1. All the occurrences of E are independent in the interval.

2. The mean number of expected occurrences of E in the interval is proportional
to T, i.e., = T. This constant of proportionality is called the rate of the
resulting Poisson process.

Then

Examples: #bee-sting fatalities per year, #spontaneous cancer remissions per year,
#accidental needle-stick HIV cases per year, hemocytometer cell counts

Discrete Random Variable:

X =#occurrences of a (rare) event E, in a given interval
of time or space, of size T. (0, 1, 2, 3, )
T 0

The Poisson Distribution

The probability of obtaining any specified number x =0, 1, 2, of
occurrences of event E is given by:

P(X = x) =
e

x
x!

where e =2.71828 (Eulers constant).

We say that X has a Poisson Distribution, denoted X ~ Poisson().
Furthermore, the mean is = T, and the variance is
2
= T also.
Ismor Fischer, 5/29/2012 4.1-22
Area = 1
Area = 1

Example (see above): Again suppose that a certain spontaneous medical condition E
affects 1% (i.e., = 0.01) of the population. Let X =number of affected individuals
in a random sample of T =300. As before, the mean number of expected occurrences
of E in the sample is = T =(0.01)(300) =3 cases. Hence X ~Poisson(3), and the
probability that any number x =0, 1, 2, of individuals are affected is given by:
P(X =x) =
e
3
3
x
x!

which is a much easier formula to work with than the previous one. This fact is
sometimes referred to as the Poisson approximation to the Binomial Distribution,
when T (respectively, n) is large, and (respectively, ) is small. Note that in this
example, the variance is also
2
=3, so that the standard deviation is = 3 =1.732,
very close to the exact Binomial value.

x
Binomial
P(X =x) =
\
|
.
|
| 300
x
(0.01)
x
(0.99)
300 x

Poisson
P(X =x) =
e
3
3
x
x!

0 0.04904 0.04979
1 0.14861 0.14936
2
0.22441 0.22404
3 0.22517 0.22404
4 0.16888 0.16803
5
0.10099 0.10082
6 0.05015 0.05041
7 0.02128 0.02160
8
0.00787 0.00810
9
0.00258 0.00270
10 0.00076 0.00081
etc. 0 0

Ismor Fischer, 5/29/2012 4.1-23

Why is the Poisson Distribution a good approximation to the
Binomial Distribution, for large n and small ?

Rule of Thumb: n 20 and 0.05; excellent if n 100 and 0.1.

Let f
Bin
(x) =
\
|
.
|
| n
x

x
(1 )
n x
and f
Poisson
(x) =
e

x
x!
, where =n.
We wish to show formally that, for fixed , and x =0, 1, 2, , we have:

lim f
Bin
(x) = f
Poisson
(x).

Proof: By elementary algebra, it follows that

f
Bin
(x) =
\
|
.
|
| n
x

x
(1 )
n x

=
n!
x! (n x)!

x
(1 )
n
(1 )
x

=
1
x!
n (n 1) (n 2) ... (n x +1)
x

\
|
.
|
|
1
n

n

(1 )
x

=
1
x!

n (n 1) (n 2) ... (n x +1)
n
x n
x

x

\
|
.
|
|
1
n

n

(1 )
x

=
1
x!

n
n

\
|
.
|
|
n 1
n

\
|
.
|
|
n 2
n

\
|
.
|
|
n x +1
n
(n)
x

\
|
.
|
|
1
n

n

(1 )
x

=
1
x!
1
\
|
.
|
|
1
1
n

\
|
.
|
|
1
2
n

\
|
.
|
|
1
x 1
n

x

\
|
.
|
|
1
n

n

(1 )
x

As n ,
0,

1
x!
1(1)(1) (1) =1
x
e

1
x
=1

=
e

x
x!
= f
Poisson
(x). QED

n
0
Simon Poisson
(1781 - 1840)
Ismor Fischer, 5/29/2012 4.1-24

Classical Discrete Probability Distributions

Binomial (probability of finding x successes and n x failures in n independent trials)

Negative Binomial (probability of needing x independent trials to find k successes)

Hypergeometric (modification of Binomial to sampling without replacement fromsmall finite populations, relative to n.)

Multinomial (generalization of Binomial to k categories, rather than just two)

Poisson (limiting case of Binomial, with n and 0, such that n = , fixed)

X =#occurrences of a rare event (i.e., 0) among many (i.e., n large), with fixed mean =n

f(x) = P(X =x) =
e

x
x!
, x =0, 1, 2,
X =#independent Bernoulli trials for k successes (each with probability ), k =1, 2, 3,

f(x) = P(X =x) = (
x 1
k 1
)
k
(1 )
x k
, x = k, k +1, k +2,

Geometric: X =#independent Bernoulli trials for k =1 success

f(x) = P(X =x) = (1 )
x 1
, x =1, 2, 3,
For i =1, 2, 3, , k,
X
i
=#outcomes in category i (each with probability
i
), in n independent Bernoulli trials, n =1, 2, 3,

1 2 3
1
k
+ + + + =
f(x
1
, x
2
, , x
k
) = P(X
1
=x
1
, X
2
=x
2
, , X
k
=x
k
) =
n!
x
1
! x
2
! x
k
!

1 2
1 2
k
x x x
k
,

x
i
=0, 1, 2, , n with x
1
+x
2
+ +x
k
= n
X =#successes (each with probability ) in n independent Bernoulli trials, n =1, 2, 3,

f(x) = P(X =x) = (
n
x
)
x
(1 )
n x
, x =0, 1, 2, , n
X =#successes in n random trials taken from a population of size N containing d successes, n >
N
10

f(x) = P(X =x) =
(
d
x
)(
N d
n x
)
(
N
n
)
, x =0, 1, 2, , d

4.2 Continuous Models

Horseshoe Crab (Limulus polyphemus)

Not true crabs, but closely related to spiders and scorpions.

Living fossils existed since Carboniferous Period, 350 mya.

Found primarily on Atlantic coast, with the highest concentration in
Delaware Bay, where males and the much larger females congregate in
large numbers on the beaches for mating, and subsequent egg-laying.

Pharmaceutical (and many other scientific) contributions!
Blue hemolymph (due to copper-based hemocyanin molecule) contains
amebocytes, which produce a clotting agent that reacts with endotoxins
found in the outer membrane of Gram-negative bacteria. Several East
Coast companies have developed the Limulus Amebocyte Lysate
(LAL) assay, used to detect bacterial contamination of drugs and
medical implant devices, etc. Equal amounts of LAL reagent and test
solution are mixed together, incubated at 37C for one hour, then
checked to see if gelling has occurred. Simple, fast, cheap, sensitive,
uses very small amounts, and does not harm the animals probably.
(Currently, a moratorium exists on their harvesting, while population
studies are ongoing)

Photo courtesy of Bill Hall, bhall@udel.edu. Used with permission.

0.24
0.40
0.36
0.02
0.18
0.24
0.16
0.12
0.20
0.08

Continuous Random Variable:

X = Length (inches) of adult horseshoe crabs

Sample 1 Sample 2

n =25; lengths measured to nearest inch n =1000; lengths measured to nearest inch
e.g., 10 in [12, 16), 6 in [16, 20), 9 in [20, 24) e.g., 180 in [12, 14), 240 in [14, 16), etc.

Examples: P(16 X <20) = 0.24 P(16 X <20) = 0.16 +0.12 = 0.28

In the limit as n , the population distribution of X can be characterized by a
continuous density curve, and formally described by a density function f(x) 0.
Thus, P(a X <b) =
a
b
f(x) dx = area under the density curve from a to b.
X
f(x)
a
b
Males are
smaller, on
average
Females are
larger, on
average
12 24
Total Area
=

f (x) dx = 1

Definition: f(x) is a probability density function for the
continuous random variable X if, for all x,
f(x) 0 AND

f(x) dx = 1.

The cumulative distribution function (cdf) is defined as, for all x,
F(x) = P(X x) =
x
f(t) dt .
Therefore, F increases monotonically and continuously from 0 to 1.
Furthermore, P(a X b) =
a
b
f(x) dx = F(b) F(a). FTC!!!!
X
f(x)
x
f(t) dt
x
X
0
x
1
F(x)
Total Area
=

f (x) dx = 1
The cumulative probability that X is less
than or equal to some value x i.e., P(X x)
is characterized by:
(1) the area under the graph of f up to x, or
(2) the height of the graph of F at x.
But note: f(x) NO LONGER corresponds to
the probability P(X =x) [which =0, since X
is here continuous], as it does for discrete X.

Example 1: Uniform density

This is the trivial constant function over some fixed interval [a, b]. That is,
1
( ) f x
b a
=
for a x b (and ( ) 0 f x = otherwise). Clearly, the two criteria for

being a valid density function are met: it is non-negative, and the (rectangular) area
under its graph is equal to its base (b a) height (1 / b a), which is indeed 1 .
Moreover, for any value of x in the interval [a, b], the (rectangular) area under the
graph up to x is equal to its base (x a) height (1 / b a). That is, the cumulative
distribution function (cdf) is given by ( )
x a
F x
b a
, the graph of which is a straight

line connecting the left endpoint (a, 0) to the right endpoint (b, 1).

[[Note: Since ( ) 0 f x = outside the interval [a, b], the area beneath it contributes
nothing to F(x) there; hence F(x) =0 if x <a, and F(x) =1 if x >b. Observe that,
indeed, F increases monotonically and continuously from 0 to 1; the graphs show
( ) f x and ( ) F x over the interval [1, 6], i.e., a =1, b =6. Compare this example
with the discrete version in section 3.1.]]

Thus, for example, the probability P(2.6 X 3.8) is equal to the (rectangular) area
under ( ) f x over that interval, or in terms of ( ) F x , simply equal to the difference
between the heights F(3.8) F(2.6) =
3.8 1 2.6 1
5 5

=0.56 0.32 =0.24.

1/5
1
5
x


Example 2: Power density (A special case of the Beta density: =1)
For any fixed p >0, let
1
( )
p
f x p x

= for 0 <x <1. (Else, ( ) 0 f x = .) This is a
valid density function, since f(x) 0 and

f(x) dx =
1
1
0
p
p x dx
=
1
0
p
x
(

= 1 .
The corresponding cdf is therefore F(x) =
x
f(t) dt =
1
0
x
p
pt dt
=
0
x
p
t
(

=
p
x
on [0, 1]. (And, as above, F(x) =0 if x <0, and F(x) =1 if x >1.) Again observe that
F indeed increases monotonically and continuously from 0 to 1, regardless of f ; see
graphs for p =
1
2
,
3
2
, 3. (Note: p =1 corresponds to the uniform density on [0, 1].)

1
2 1
2
x

1
2 3
2
x
2
3x
1
2
x
3
2
x
3
x
e
x

1 e
x

Example 3: Cauchy density
The function
2
1 1
( )
1
f x
x
=
+
for x < < + is a legitimate density function, since
it satisfies the two criteria above: f(x) 0 AND

f(x) dx =1. (Verify it!) The cdf is
therefore F(x) =
x
f(t) dt =
2
1 1
1
x
dt
t
=
1 1
arctan
2
x
+ for x < < +.

Thus, for instance, P(0 X 1) =F(1) F(0) =
( ) ( )
1 1 1 1 1
4 2 2 4
0 .

(
(

+ + =

Example 4: Exponential density
For any a >0 fixed, f(x) =a e
ax
for x 0 (and =0 for x <0) is a valid density function,
since it satisfies the two criteria. (Details are left as an exercise.) The corresponding
cdf is given by F(x) =
x
f(t) dt =
0
x
a e
at
dt =1
a x
e
, for x 0 (and =0 otherwise).

The case a =1 is shown below.

Thus, for instance, P(X 2) =F(2) =1 e
2
=0.8647, and
P(0.5 X 2) = F(2) F(0.5) = (1 e
2
) (1 e
0.5
) = 0.8647 0.3935 = 0.4712.

Exercise: (Another special case of the Beta density.) Sketch the graph of
( ) 6 (1 ) f x x x = for 0 1 x (and =0 elsewhere); show that it is a valid density
function. Find the cdf ( ) F x , and sketch its graph. Calculate P( X ).

Exercise: Sketch the graph of
2
( )
( 1)
x
x
e
f x
e
=
+
for x < < +, and show
that it is a valid density function. Find the cdf ( ) F x , and sketch its graph.
Find the quartiles. Calculate P(0 X 1).

Thus, for the exponential density, =

0

x a e
ax
dx =
1
a
, via integration by parts.
The calculation of
2
is left as an exercise.

Exercise: Sketch the graph of
2
2 1
1
( )
x
f x

= for 0 1 x < (and 0 elsewhere);
show that it is a valid density function. Find the cdf ( ) F x , and sketch its graph.
Calculate P(X ), and find the mean.

Exercise: What are the mean and variance of the power density?

Exercise: What is the mean of the Cauchy density?

If X is a continuous numerical random variable with density function f(x), then
the population mean is given by the first moment
= E[X] =

+
x f(x) dx
and the population variance is given by the second moment about the mean
2
= E[(X )
2
] =

+
(x )
2
f(x) dx ,
or equivalently,
2
= E[X
2
]
2
=

+
x
2
f(x) dx
2
.
(Compare these continuous formulas with those for discrete X.)
Augustin-Louis Cauchy
1789-1857
Faites attention!
Ce nest pas aussi
facile quil apparat...
7
28

6
28

5
28

4
28

3
28

2
28

1
28

X

f(x)

1 2 3 4 5 6

Example:
Crawling Ants and Jumping Fleas

Consider two insects on a (six-inch) ruler: a flea, who makes only discrete
integer jumps (X), and an ant, who crawls along continuously and can stop
anywhere (Y).

1. Let the discrete random variable X =length jumped (0, 1, 2, 3, 4, 5, or 6 inches)
by the flea. Suppose that the flea is tired, so is less likely to make a large jump
than a small (or no) jump, according to the following probability distribution
(or mass) function f(x) = P(X =x), and corresponding probability histogram.

The total probability is P(0 X 6) =1, as it should be.

P(3 X 6) = 4/28 +3/28 +2/28 +1/28 = 10/28

P(0 X <3) = 7/28 +6/28 +5/28 = 18/28, or
Not = 1 P(3 X 6) = 1 10/28 = 18/28
equal!
P(0 X 3) = 18/28 +4/28 = 22/28, because P(X =3) = 4/28

Exercise: Confirm that the flea jumps a mean length of =2 inches.

Exercise: Sketch a graph of the cumulative distribution function
F(x) =P(X x), similar to that of 2.2 in these notes.

Probability Table

x f(x) =P(X = x)

0 7/28
1 6/28
2 5/28
3 4/28
4 3/28
5 2/28
6 1/28

1

2. Let the continuous random variable Y =length crawled (any value in the
interval [0, 6] inches) by the ant. Suppose that the ant is tired, so is less
likely to crawl a long distance than a short (or no) distance, according to the
following probability density function f(y), and its corresponding graph, the
probability density curve. (Assume that f =0 outside of the given interval.)

The total probability is P(0 Y 6) = (6)(1/3) =1, as it should be.

P(3 Y 6) = (3)(1/6) = 1/4 (Could also use calculus.)

P(0 Y <3) = 1 P(3 Y 6) = 1 1/4 = 3/4
Equal!
P(0 Y 3) = 3/4 also, because P(Y =3) = 0 Why?

Exercise: Confirm that the ant crawls a mean length of =2 inches.

Exercise: Find the cumulative distribution function F(y), and
sketch its graph.

Y

1 2 3 4 5 6
1/3
f(y) =
6 y
18
, 0 y 6

1
Ismor Fischer, 5/22/2013 4.2-10
=98.6
small
=100
large
Total Area
=

f (x) dx = 1
Mean
Standard
Deviation

X
Right tail
Left tail
Total Area
=

f (x) dx = 1
J ohann Carl Friedrich Gauss
(1777 - 1855)

An extremely important bell-shaped continuous population distribution

Normal Distribution (a.k.a. Gaussian Distribution): X ~ N(, )

f(x) =
1
2
e , <x < +

Examples:

X =Body Temp (F) X =IQ score (discrete!)

1
2

\
|
.
|
| x

=3.14159
e =2.71828
Ismor Fischer, 5/22/2013 4.2-11
X
1
~ N(80.7, 3.5)
= 80.7
= 3.5
x = 87
X
2
~ N(82.8, 4.5)
= 82.8
= 4.5
x = 90
Z
N(0, 1)
Z
1
1.6 1.8

Example: Two exams are given in a statistics course, both resulting in class scores
that are normally distributed. The first exam distribution has a mean of 80.7 and a
standard deviation of 3.5 points. The second exam distribution has a mean of 82.8
and a standard deviation of 4.5 points. Carla receives a score of 87 on the first
exam, and a score of 90 on the second exam. Which of her two exam scores
represents the better effort, relative to the rest of the class?

The Z-score tells how many standard deviations the X-score lies from the mean .

x-score =87 x-score =90
z-score =
87 80.7
3.5
=1.8 z-score =
90 82.8
4.5
=1.6

higher relative score
Z-score Transformation

X ~N(, ) Z =
X
~ N(0, 1)

Standard Normal Distribution
Ismor Fischer, 5/22/2013 4.2-12
20
X
0.8 0
Z
19
X ~ N(20, 1.25) Z ~ N(0, 1)
=
=1.25
=

Example: X =Age (years) of UW-Madison third-year undergraduate population

Assume: X ~ N(20, 1.25),
i.e., X is normally distributed with mean =20 yrs, and s.d. =1.25 yrs.

Suppose that an individual from this population is randomly selected. Then

P(X <20) = 0.5 (via symmetry)
P(X <19) = P
\
|
.
|
|
Z <
19 20
1.25
= P(Z <0.8) = 0.2119 (via table or software)

Therefore
P(19 X <20) = P(X <20) P(X <19) = 0.5000 0.2119 = 0.2881

Likewise,
P(19 X <19.5) = 0.3446 0.2119 = 0.1327
P(19 X <19.05) = 0.2236 0.2119 = 0.0118
P(19 X <19.005) = 0.2130 0.2119 = 0.0012
P(19 X <19.0005) = 0.2120 0.2119 = 0.0001

P(X =19.00000) = 0,
since X is continuous! 19 20
How do we check
this? And what do
we do if its not
true, or we cant
tell? Later...
Ismor Fischer, 5/22/2013 4.2-13
+
X
0.6827
X
+
0.9545
+2

2

Two Related Questions
1. Given X ~N(, ). What is the probability that a randomly selected
individual from the population falls within one standard deviation (i.e., 1)
of the mean ? Within two standard deviations (2)? Within three (3)?

Solution: We solve this by transforming to the tabulated standard normal
distribution Z ~N(0, 1), via the formula Z =
X
, i.e., X = +Z .

P( 1 X +1) =

P( 1 Z +1 ) =

P(Z +1) P(Z 1) =

0.8413 0.1587 = 0.6827

P( 2 X +2) =

P( 2 Z +2 ) =

P(Z +2) P(Z 2) =

0.9772 0.0228 = 0.9545

Likewise, P( 3 X +3) = P(3 Z +3) = 0.9973 .

These so-called empirical guidelines can be used as an informal check to see if
sample-generated data derive from a population that is normally distributed.
For if so, then 68%, or approximately 2/3, of the data should lie within one
standard deviation s of the mean x ; approximately 95% should lie within two
standard deviations 2s of the mean x , etc. Other quantiles can be checked
similarly. Superior methods also exist

See my homepage to view a ball drop computer simulation of the normal distribution:
(requires J ava)
http://www.stat.wisc.edu/~ifischer
Ismor Fischer, 5/22/2013 4.2-14
Z
0.90
0.05 0.05
0 1.645 = z
.05

z
.05
= 1.645
0
Z
2.575 = z
.005

z
.005
= 2.575

0.005
0.99
0.005
Z
0
1
/2 /2
z
/2
z
/2

0.95
0.025 0.025
Z
0 z
.025
= 1.960 1.960 = z
.025

2. Given X ~N(, ). What symmetric interval about the mean contains
90% of the population distribution? 95%? 99%? General formulation?
Solution: Again, we can answer this question for the standard normal
distribution Z ~N(0, 1), and transform back to X ~N(, ), via the
formula Z =
X
, i.e., X = +Z .
The value z
.05
= 1.645 satisfies
P(z
.05
Z z
.05
) = 0.90,
or equivalently,
P(Z z
.05
) = P(Z z
.05
) = 0.05.
Hence, the required interval is
1.645 X + 1.645.

The value z
.025
= 1.960 satisfies
P(z
.025
Z z
.025
) = 0.95,
or equivalently,
P(Z z
.025
) = P(Z z
.025
) = 0.025.
1.960 X + 1.960.

The value z
.005
= 2.575 satisfies
P(z
.005
Z z
.005
) = 0.99,
or equivalently,
P(Z z
.005
) = P(Z z
.005
) = 0.005.
2.575 X + 2.575.

Def: The critical value z
/2
satisfies
P(z
/2
Z z
/2
) = 1 ,
or equivalently, the tail probabilities
P(Z z
/2
) = P(Z z
/2
) = /2 .
Hence, the required interval satisfies
P( z
/2
X + z
/2
) = 1 .

In general
Ismor Fischer, 5/22/2013 4.2-15
=20

Normal Approximation to the Binomial Distribution
(continuous) (discrete)

Example: Suppose that it is estimated that 20% (i.e., = 0.2) of a certain
population has diabetes. Out of n =100 randomly selected individuals, what is
the probability that

(a) exactly X =10 are diabetics? X =15? X =20? X =25? X =30?

Assuming that the occurrence of diabetes is independent among the
individuals in the population, we have X ~Bin(100, 0.2). Thus, the values
of P(X =x) are calculated in the following probability table and histogram.

x
P(X = x) =
\
|
.
|
|
100
x
(0.2)
x
(0.8)
100

x

10
\
|
.
|
| 100
10
(0.2)
10
(0.8)
90
= 0.00336
15
\
|
.
|
| 100
15
(0.2)
15
(0.8)
85
= 0.04806
20
\
|
.
|
| 100
20
(0.2)
20
(0.8)
80
= 0.09930
25
\
|
.
|
| 100
25
(0.2)
25
(0.8)
75
= 0.04388
30
\
|
.
|
| 100
30
(0.2)
30
(0.8)
70
= 0.00519

(b) X 10 are diabetics? X 15? X 20? X 25? X 30?

Method 1: Directly sum the exact binomial probabilities to obtain P(X x).

For instance, the cumulative probability P(X 10) =

\
|
.
|
|
100
0
(0.2)
0
(0.8)
100
+
\
|
.
|
|
100
1
(0.2)
1
(0.8)
99
+
\
|
.
|
|
100
2
(0.2)
2
(0.8)
98
+
\
|
.
|
|
100
3
(0.2)
3
(0.8)
97
+
\
|
.
|
|
100
4
(0.2)
4
(0.8)
96
+
\
|
.
|
|
100
5
(0.2)
5
(0.8)
95
+
\
|
.
|
|
100
6
(0.2)
6
(0.8)
94
+
\
|
.
|
|
100
7
(0.2)
7
(0.8)
93
+
\
|
.
|
|
100
8
(0.2)
8
(0.8)
92
+
\
|
.
|
|
100
9
(0.2)
9
(0.8)
91
+
\
|
.
|
|
100
10
(0.2)
10
(0.8)
90
= 0.00570
X ~Bin(100, 0.2)
Ismor Fischer, 5/22/2013 4.2-16
=20

Method 2: Despite the skew, X ~N(, ), approximately (a consequence of
the Central Limit Theorem, 5.2), with mean =n, and standard deviation
= n (1 ). Hence,

Z =
X
~ N(0, 1)
becomes
Z =
X n
n (1 )
~ N(0, 1).

In this example,
=n =(100)(0.2) =20, and
= n (1 ) = 100(0.2)(0.8) = 4.

So, approximately, X ~N(20, 4); thus
Z =
X 20
4
~ N(0, 1).

For instance, P(X 10) P
\
|
.
|
|
Z
10 20
4
= P(Z 2.5) = 0.00621.

The following table compares the two methods for finding P(X x).

x
Binomial
(exact)
Normal
(approximation)
Normal
(with correction)
10 0.00570 0.00621 0.00877
15 0.12851 0.10565 0.13029
20 0.55946 0.50000 0.54974
25 0.91252 0.89435 0.91543
30 0.99394 0.99379 0.99567

Comment: The normal approximation to the binomial generally works
well, provided n 15 and n(1 ) 15. A modification exists, which
adjusts for the difference between the discrete and continuous distributions:

Z =
X n 0.5
n (1 )
~ N(0, 1)

where the continuity correction factor is equal to +0.5 for P(X x), and
0.5 for P(X x). In this example, the corrected formula becomes

Z =
X 20 +0.5
4
~ N(0, 1).

X N(20, 4)
Ismor Fischer, 5/22/2013 4.2-17

Exercise: Recall the preceding section, where a spontaneous medical condition
affects 1% (i.e., = 0.01) of the population, and X =number of affected
individuals in a random sample of n =300. Previously, we calculated the
probability P(X =x) for x =0, 1, , 300. We now ask for the more meaningful
cumulative probability P(X x), for x =0, 1, 2, 3, 4, ... Rather than summing
the exact binomial (or the approximate Poisson) probabilities as in Method 1
above, adopt the technique in Method 2, both with continuity correction and
without. Compare these values with the exact binomial sums.

A Word about Probability Zero Events
(Much Ado About Nothing?)

Exactly what does it mean to say that an event E has zero probability of occurrence,
i.e. P(E) =0? A common, informal interpretation of this statement is that the event
cannot happen and, in many cases, this is indeed true. For example, if X =Sum of
two dice, then X =4, X =5.7, and X =13 all have probability zero because
they are impossible outcomes of this experiment, i.e., they are not in the sample space
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.

However, in a formal mathematical sense, this interpretation is too restrictive. For
example, consider the following scenario: Suppose that k people participate in a
lottery; each individual holds one ticket with a unique integer from the sample space
{1, 2, 3, , k}. The winner is determined by a computer that randomly selects one of
these k integers with equal likelihood. Hence, the probability that a randomly selected
individual wins is equal to 1/k. The larger the number k of participants, the smaller the
probability 1/k that any particular person will win. Now, for the sake of argument,
suppose that there is an infinite number of participants; a computer randomly selects
one integer from the sample space {1, 2, 3, }. The probability that a randomly
selected individual wins is therefore less than 1/k for any k, i.e., arbitrarily small,
hence =0.* But by design, someone must win the lottery, so probability zero does
not necessarily translate into the event cannot happen. So what does it mean?

Recall that the formal, classical definition of the probability P(E) of any event E is the
mathematical limiting value of the ratio
#(E occurs)
#trials
, as #trials . That is, the
fraction of the number of times that the event occurs to the total number of
experimental trials, as the experiment is repeated indefinitely. If, in principle, this
ratio becomes arbitrarily small after sufficiently many trials, then such an ever-
increasingly rare event E is formally identified with having probability zero (such as,
perhaps, the random toss of a coin under ordinary conditions resulting in it landing on
edge, rather than on heads or tails).

* Similarly, any event consisting of a finite subset of an infinite sample space of possible
outcomes (such as the event of randomly selecting a single particular value from a
continuous interval), has a mathematical probability of zero.
Ismor Fischer, 5/22/2013 4.2-18

Classical Continuous Probability Densities
(The t and F distributions will be handled separately.)
Uniform

Normal

Log-Normal

Gamma

Beta

For >0, >0,

f(x) = x
1

x
e

, x 0.

Thus, F(x) =1
x
e

.

For >0, >0,
f(x) =
1

()
x
1
e
x/
,
x 0.

f(x) =
1
b a
, a x b
Consequently, F(x) =
x a
b a
.
For >0,
f(x) =
1
2

2
1
2
x
e

| |
|
\ .
, <x <+.
For >0,
f(x) =
1
2
x
1

2
1 ln
2
x
e

| |
|
|
\ .
, x 0.
Weibull
Notes on the Gamma and Beta Functions
Def: () =
0

x
1
e
x
dx
Thm: () = ( 1) ( 1); therefore,
= ( 1)!, if =1, 2, 3,

Thm: (1/2) =

Def: (, ) =
0
1
x
1
(1 x)
1
dx
Thm: (, ) =
() ()
( + )

Exponential

f(x) =
1
e
x/
, x 0.

Thus, F(x) =1 e
x/
.
Chi-Squared: For =1, 2,
f(x) =
1
2
/2
(/2)
x
/2 1
e
x/2
,
x 0.
For >0, >0,

f(x) =
1
(, )
x
1
(1 x)
1
, 0 x 1.

Ismor Fischer, May 22, 2013 4.3-1
4.3 Summary Chart
POPULATION RANDOM VARIABLES (NUMERICAL)
TYPE
EXAMPLES
TABLE GRAPHICAL DISPLAYS CHARACTERISTICS, etc.
Discrete
All distinct
population
values can be
ordered and
individually
listed:
1 2 3
{ , , ,...} x x x

x-values have gaps
X =Sum of two dice
(2, 3, 4, , 12)

X =Shoe size
(, 8, 8, 9, 9, )

X =#Successes in
n Bernoulli trials
(0, 1, 2, , n)
~Binomial distribution

Probability
Table
probability
mass function
x f(x) 0
1
2
x
x

1
2
)
)
(
(
f x
f x

1

Probability
Histogram

Cumulative
Distribution
( ) ( ) ( ) F x P X x f x = =

Parameters

Mean
all
( )
x
x f x =

Variance
2 2
all
( ) ( )
x
x f x =

Probability
( ) ( ) P X c f c = =

Areafrom to
( ) ( )
a b
b
a
P a X b f x =

Continuous
Population is
interval-valued,
thus all of them
cannot be so
listed.

x-values run along
a continuous scale
of real numbers
X =pH, Length, Area,
Volume, Mass, Temp,
etc.

X ~Normal distribution
None*

Density
Curve

Cumulative
Distribution
( ) ( ) ( ) F x P X x f x dx = =

Parameters

Mean ( ) x f x dx
+

Variance
2 2
( ) ( ) x f x dx
+

Probability
( ) 0 P X c = =

Areafrom to
( ) ( )
a b
b
a
P a X b f x dx =

density function
f(x) 0
Total
Area =1
x
2
x
1
x
3

f(x
1
)
f(x
2
)
f(x
3
)
Total
Area =1
0
F(x
1
)
F(x
2
)
1
0

x
2

x
1

x
3

F(x
3
)

1
Ismor Fischer, May 22, 2013 4.3-2

* If X is a discrete randomvariable, then for any value x in the population, f(x) corresponds to the probability that x occurs, i.e., P(X =x). However, if X is a
continuous population variable, this is false, since P(X =x) =0. In this case, f(x) is its density function, and probability corresponds to the area under its
graph up to x. Formally, this is defined in terms of the cumulative distribution: F(x) =P(X x), which rises continuously and monotonically from0 to 1,
as x increases. It is the values of this function that are often tabulated for selected (i.e., discretized) values of x, as in the case of the standard normal
distribution. F(x) is defined the same way for discrete variables, but it is only piecewise continuous, i.e., its graph is a step or staircase function.
Similarly, in a randomsample, f(x) measures the relative frequency of x, and the cumulative distribution is defined the same way, F(x) =
( ), P X x where
P denotes the sample proportion. But since it is data-based, F(x) is known as the empirical distribution, and likewise has a stepwise graph from0 to 1.

RANDOM SAMPLE
Either
n observed data values
1 2
{ , ,..., }
n
x x x
selected from either
type of population
above, individually
ordered and listed.

If some values occur
multiple times, then list
only the k distinct values,
together with each of their
corresponding frequencies
1 2
{ , ,..., },
k
f f f where
1 2
.
k
n f f f + + + =

Relative
Frequency
Table
relative
frequency
x f(x) 0
1
2
k
x
x
x

1
2
)
)
(
(
( )
k
f x
f x
f x

1

Density
Histogram

Empirical
Distribution
( ) ( ) ( ) F x P X x f x = =

Statistics

Mean
all
( )
x
x x f x =

Variance
2
all
2
1
( ) ( )
x
n
n
x x f x s

=

Proportion

( ) ( ) P X c f c = =

Areafrom to
( ) ( )
a b
b
a
P a X b f x =

f(x
1
)
f(x
2
)
f(x
3
)

x
k

f(x
k
)
Total
Area =1

x
1

x
3

x
2

F(x
1
)

x
2

x
k

x
1

x
3

F(x
2
)
F(x
3
)
1
0
$20 $10
$40 $30
4.4 Problems

1. Patient noncompliance is one of many potential sources of bias in medical studies. Consider a
study where patients are asked to take 2 tablets of a certain medication in the morning, and 2
tablets at bedtime. Suppose however, that patients do not always fully comply and take both
tablets at both times; it can also occur that only 1 tablet, or even none, are taken at either of these
times.

(a) Explicitly construct the sample space S of all possible daily outcomes for a randomly selected
patient.

(b) Explicitly list the outcomes in the event that a patient takes at least one tablet at both times,
and calculate its probability, assuming that the outcomes are equally likely.

(c) Construct a probability table and corresponding probability histogram for the random variable
X =the daily total number of tablets taken by a random patient.

(d) Calculate the daily mean number of tablets taken.

(e) Suppose that the outcomes are not equally likely, but vary as follows:

#tablets AM probability PM probability
0 0.1 0.2
1 0.3 0.3
2 0.6 0.5

Rework parts (b)-(d) using these probabilities. Assume independence between AM and PM.

2. A statisticians teenage daughter withdraws a certain amount of money X from an ATM every so
often, using a method that is unknown to him: she randomly spins a circular wheel that is equally
divided among four regions, each containing a specific dollar amount, as shown.

Bank statements reveal that over the past n =80 ATM transactions, $10 was withdrawn thirteen
times, $20 sixteen times, $30 nineteen times, and $40 thirty-two times. For this sample, construct
a relative frequency table, and calculate the average amount x withdrawn per transaction, and
the variance s
2
.

Suppose this process continues indefinitely. Construct a probability table, and calculate the
expected amount withdrawn per transaction, and the variance
2
. (Verify that, for this
sample, s
2
and
2
happen to be equal.)


3. A youngster finds a broken clock, on which the hour and minute hands can be randomly spun at
the same time, independently of one another. Each hand can land in any one of the twelve equal
areas below, resulting in elementary outcomes in the form of ordered pairs (hour hand, minute
hand), e.g., (7, 11), as shown.

Let the simple events A =hour hand lands on 7 and B =minute hand lands on 11.

(a) Calculate each of the following probabilities. Show all work!

P(A and B)

P(A or B)

(b) Let the discrete random variable X =the product of the two numbers spun. List all the
elementary outcomes that belong to the event C =X =36 and calculate its probability P(C).

(c) After playing for a little while, some of the numbers fall off, creating new areas, as shown. For
example, the configuration below corresponds to the ordered pair (9, 12). Now calculate P(C).

4. An amateur game player throws darts at the dartboard shown below, with each target area worth
the number of points indicated. However, because of the players inexperience, all of the darts
hit random points that are uniformly distributed on the dartboard.

(a) Let X =points obtained per throw. What is the sample space S of this experiment?

(b) Calculate the probability of each outcome in S. (Hint: The area of a circle is
2
r .)

(c) What is the expected value of X, as darts are repeatedly thrown at the dartboard at random?

(d) What is the standard deviation of X?

Suppose that, if the total number of points in three independent random throws is exactly 100, the
player wins a prize. With what probability does this occur? (Hint: For the random variable T =
total points in three throws, calculate the probability of each ordered triple outcome
1 2 3
( , , ) X X X in the event T =100.)

5. Compare this problem with 2.5/10!

Consider the binary population variable
1, with probability
0, with probability 1
Y

(see figure).

(a) Construct a probability table for this random variable.

(b) Show that the population mean
Y
= .

(c) Show that the population variance
2
(1 )
Y
= .

Note that controls both the mean and the variance!

10

20

30

40

50
1 1 1 1 1
POPULATION
=1 =0
6. SLOT MACHINE

Wheel 1 Wheel 2 Wheel 3

A casino slot machine consists of three wheels, each with images of three types of fruit: apples,
bananas, and cherries. When a player pulls the handle, the wheels spin independently of one
another, until each one stops at a random image displayed in its window, as shown above. Thus,
the sample space S of possible outcomes consists of the 27 ordered triples shown below, where
events A =Apple, B =Banana, and C =Cherries.

(a) Complete the individual tables above, and use them to construct the probability table
(including the outcomes) for the discrete random variable X =#Apples that are displayed
when the handle is pulled. Show all work. (Hint: To make calculations easier, express
probabilities as fractions reduced to lowest terms, instead of as decimals.)

Outcome Probability
A
B
C
Outcome Probability
A
B
C
Outcome Probability
A
B
C
X Outcomes Probability f(x)

(A A A), (A A B), (A A C), (A B A), (A B B), (A B C), (A C A), (A C B), (A C C)
(B A A), (B A B), (B A C), (B B A), (B B B), (B B C), (B C A), (B C B), (B C C)
(C A A), (C A B), (C A C), (C B A), (C B B), (C B C), (C C A), (C C B), (C C C)
$1
00

$1
00


(b) Sketch the corresponding probability histogram of X. Label all relevant features.

(c) Calculate the mean and variance
2
of X. Show all work.

(d) Similar to X =#Apples, define random variables Y =#Bananas and Z =#Cherries
displayed in one play. The player wins if all three displayed images are of the same fruit.
Using these variables, calculate the probability of a win. Show all work.

(e) Suppose it costs one dollar to play this game once. The result is that either the player loses the
dollar, or if the player wins, the slot machine pays out ten dollars in coins. If the player
continues to play this game indefinitely, should he/she expect to win money, lose money, or
neither, in the long run? If win or lose money, how much per play? Show all work.

7. Formally prove that each of the following is a valid density function. [Note: This is a rigorous
mathematical exercise.]

(a)
Bin
( ) (1 )
x n x
n
f x
x

| |
=
|
\ .
x =0, 1, 2, ..., n
(b)
Poisson
( )
!
x
e
f x
x
= , x =0, 1, 2, ...
(c)
2
1
2
Normal
1
( )
2
x
f x e

| |
|
\ .
= , x < < +

8. Formally prove each of the following, using the appropriate expected value definitions.
[Note: As the preceding problem, this is a rigorous mathematical exercise.]

(a) If X ~Bin(n, ), then n = and
2
(1 ) n = .

(b) If X ~Poisson( ), then = and
2
= .

(c) If X ~ ( , ) N , then = and
2 2
= .

9. For any p >0, sketch the graph of
1
( )
p
f x p x

= for x 1 (and ( ) 0 f x = for x <1), and formally
show that it is a valid density function. Then show the following.

If p >2, then ( ) f x has finite mean and finite variance
2
.

If 1 2 p < , then ( ) f x has finite mean but infinite (i.e., undefined) variance.

If 0 1 p < , then ( ) f x has infinite (i.e., undefined) mean (and hence undefined variance).

[Note: As with the preceding problems, this is a rigorous mathematical exercise.]

10. This is a subtle problem that illustrates an important difference
between the normal distribution and many other distributions, the
binomial in particular. Consider a large group of populations of
males and females, such as all Wisconsin counties, and suppose that
the random variable Y =Age (years) is normally distributed in all
of them, each with some mean , and some variance
2
.
Clearly, there is no direct relationship between any and its
corresponding
2
, as we range continuously from county to county.
(In fact, it is not unreasonable to assume that although the means may
be different, the variances which, recall, are measures of spread
might all be the same (or similar) throughout the counties. This is
known as equivariance, a concept that we will revisit in Chapter 6.)

Suppose that, instead of age, we are now concerned with the different proportion of males from
one county to another, i.e., (Male) P = . If we intend to select a random sample of n =100
individuals from each county, then the random variable X =Number of males in each sample is
binomially distributed, i.e., X ~Bin(100, ), for 0 1 . Answer each of the following.
If a county has no males, compute the mean , and variance
2
.

If a county has all males, compute the mean , and variance
2
.

If a county has males and females in equal proportions, compute the mean , and variance
2
.

Sketch an accurate graph of
2
on the vertical axis, versus on the horizontal axis, for n =100
and 0 1 , as we range continuously from county to county. Conclusions?
Note: Also see related problem 4.4/5.


11. Imagine that a certain disease occurs in a large population in such a way that the probability of a
randomly selected individual having the disease remains constant at =.008, independent of any
other randomly selected individual having the disease. Suppose now that a sample of n =500
individuals is to be randomly selected from this population. Define the discrete random variable
X =the number of diseased individuals, capable of assuming any value in the set {0, 1, 2, ,
500} for this sample.

(a) Calculate the probability distribution function f(x) = P(X =x) the probability that the
number of diseased individuals equals x for x =0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Do these
computations two ways: first, using the Binomial Distribution and second, using the Poisson
Distribution, and arrange these values into a probability table. (For the sake of comparison,
record at least five decimal places.)
Tip: Use the functions dbinom and dpois in R.

x Binomial Poisson
0

1

2

3

4

5

6

7

8

9

10

etc. etc. etc.

(b) Using either the Binomial or Poisson Distribution, what is the mean number of diseased
individuals to be expected in the sample, and what is its probability? How does this
probability compare with the probabilities of other numbers of diseased individuals?

(c) Suppose that, after sampling n =500 individuals, you find that X =10 of them actually have
this disease. Before performing any formal statistical tests, what assumptions if any might
you suspect have been violated in this scenario? What is the estimate of the probability
of
disease, based on the data of this sample?

12. The uniform density function given in the notes has median and mean =3.5, by inspection.
Calculate the variance.


13.
(a) Let ( )
8
x
f x = for 0 4, x and =0 elsewhere, as shown below left.
Confirm that f(x) is indeed a density function.
Determine the formula for the cumulative distribution function ( ) ( ), F x P X x = and
sketch its graph. Recall that F(x) corresponds to the area under the density curve f(x) up
to and including the value x, and therefore must increase monotonically and continuously
from 0 to 1, as x increases.
Using F(x), calculate the probabilities ( 1), P X ( 3), P X and (1 3). P X
Using F(x), calculate the quartiles Q
1
, Q
2
, and Q
3
.

(b) Repeat (a) for the function
, 0 2
6
( )
1
, 2 4
3
x
x
f x
x
, and =0 elsewhere, as shown below right.

14. Define the piecewise uniform function
1
8
1
4
, 1 3
( )
, 3 6
x
f x
x
<
(and =0 elsewhere). Prove that

this is a valid density function, sketch the cdf F(x), and find the median, mean, and variance.
8
x

6
x

1
3


15. Suppose that the continuous random variable X =age of juniors at the UW-Iwanagoeechees
campus is symmetrically distributed about its mean, but piecewise linear as illustrated, rather
than being a normally distributed bell curve.

For an individual selected at random from this population, calculate each of the following.

(a) Verify by direct computation that P(18 X 22) =1, as it should be.
[Hint: Recall that the area of a triangle = (base height).]

(b) P(18 X <18.5)

(c) P(18.5 <X 19)

(d) P(19.5 <X <20.5)

(e) What symmetric interval about the mean contains exactly half the population values?
Express in terms of years and months.

16. Suppose that in a certain population of adult males, the variable X =total serum cholesterol level
(mg/dL) is found to be normally distributed, with mean =220 and standard deviation =40.
For an individual selected at random, what is the probability that his cholesterol level is

(a) under 190? under 210? under 230? under 250?

(b) over 240? over 270? over 300? over 330?

(c) Using the R command pnorm, redo parts (a) and (b). [Type ?pnorm for syntax help.
Ex: pnorm(q=190, mean=220, sd=40), or more simply, pnorm(190, 220, 40)]

(d) over 250, given that it is over 240? [Tip: See the last question in (a), and the first in (b).]

(e) between 214 and 276?

(f) between 202 and 238?

(g) Eighty-eight percent of men have a cholesterol level below what value? Hint: First find the
approximate critical value of z that satisfies P(Z +z) =0.88, then change back to X.

(h) Using the R command qnorm, redo (g). [Type ?qnorm for syntax help.]

(i) What symmetric interval about the mean contains exactly half the population values?
Hint: First find the approximate critical value of z that satisfies P(z Z +z) =0.5, then
change back to X.

1/3
f(x)
X
0 18 19 =20 21 22
Submit a
copy of the
output, and
clearly show
agreement
of your
answers!
Submit a
copy of the
output, and
clearly show
agreement
of your
answer!
Ismor Fischer, 6/27/2014 4.4-10
M ~N(10, 2.5)
F ~N(16, 5)
17. A population biologist is studying a certain species of lizard, whose sexes appear alike, except
for size. It is known that in the adult male population, length M is normally distributed with
mean
M
= 10.0 cm and standard deviation
M
= 2.5 cm, while in the adult female
population, length F is normally distributed with mean
F
=16.0 cm and standard deviation
F
=5.0 cm.

(a) Suppose that a single adult specimen of length 11 cm is captured at random, and its sex
identified as either a larger-than-average male, or a smaller-than-average female.

Calculate the probability that a randomly selected adult male is as large as, or larger
than, this specimen.

Calculate the probability that a randomly selected adult female is as small as, or smaller
than, this specimen.

Based on this information, which of these two events is more likely?

(b) Repeat part (a) for a second captured adult specimen, of length 12 cm.

(c) Repeat part (a) for a third captured adult specimen, of length 13 cm.

Ismor Fischer, 6/27/2014 4.4-11
18. Consider again the male and female lizard populations in the previous problem.

(a) Answer the following.
Calculate the probability that the length of a randomly selected adult male falls between
the two population means, i.e., between 10 cm and 16 cm.
Calculate the probability that the length of a randomly selected adult female falls between
the two population means, i.e., between 10 cm and 16 cm.
(b) Suppose it is known that males are slightly less common than females; in particular, males
comprise 40% of the lizard population, and females 60%. Further suppose that the length
of a randomly selected adult specimen of unknown sex falls between the two population
means, i.e., between 10 cm and 16 cm.
Calculate the probability that it is a male.
Calculate the probability that it is a female.

Hint: Use Bayes Theorem.

19. Bob spends the majority of a certain evening in his favorite drinking establishment.
Eventually, he decides to spend the rest of the night at the house of one of his two friends, each
of whom lives ten blocks away in opposite directions. However, being a bit intoxicated, he
engages in a so-called random walk of n =10 blocks where, at the start of each block, he
first either turns and faces due west with probability 0.4, or independently, turns and faces due
east with probability 0.6, before continuing. Using this information, answer the following.

Hint: Let the discrete random variable X =number of east turns in n =10 blocks.
(0, 1, 2, 3, , 10)

(a) Calculate the probability that he ends up at Als house.

(b) Calculate the probability that he ends up at Carls house.

(c) Calculate the probability that he ends up back where he started.

(d) How far, and in which direction, from where he started is he expected to end up, on average?
(Hint: Combine the expected number of east and west turns.) With what probability does this
occur?

West East
Als
house
Carls
house

ECKS BAR

Ismor Fischer, 6/27/2014 4.4-12
20.
(a) Let X =#Heads in n =100 tosses of a fair coin (i.e., =0.5). Write but DO NOT
EVALUATE an expression to calculate the probability P(X 45 or X 55).
(b) In R, type ?dbinom, and scroll down to Examples, where P(45 < X < 55) is computed
for X Binomial(100,0.5). Copy, paste, and run the single line of code given, and use it
to calculate the probability in (a).
(c) How does this compare with the corresponding probability on page 1.1-4?

21.
(a) How much overlap is there between the bell curves ~ (0,1) Z N and ~ (2,1) X N ? (Take
2 = in the figure below.) That is, calculate the probability that a randomly selected
population value is either in the upper tail of (0,1) N , or in the lower tail of (2,1) N .
Hint: Where on the horizontal axis do the two curves cross in this case?

(b) Suppose ~ ( ,1) X N for a general ; see figure. How close to 0 does the mean have to
be, in order for the overlap between the two distributions to be equal to 20%? 50%? 80%?

1 1
0

Z X
Ismor Fischer, 6/27/2014 4.4-13

22. Consider the two following modified Cauchy distributions.

(a) Truncated Cauchy:
2
2 1
( )
1
f x
x
=
+
for 1 1 x + (and ( ) 0 f x = otherwise).
Show that this is a valid density function, and sketch its graph. Find the cdf ( ) F x , and
sketch its graph. Find the mean and variance.

(b) One-sided Cauchy:
2
2 1
( )
1
f x
x
=
+
for 0 x (and ( ) 0 f x = otherwise). Show that
this is a valid density function, and sketch its graph. Find the cdf ( ) F x , and sketch its
graph. Find the median. Does the mean exist?

23. Suppose that the random variable X =time-to-failure (yrs) of a standard model of a medical
implant device is known to follow a uniform distribution over ten years, and therefore
corresponds to the density function
1
( ) 0.1 f x = for 0 10 x (and zero otherwise). A new
model of the same implant device is tested, and determined to correspond to a time-to-failure
density function
2
2
( ) .009 .08 0.2 f x x x = + for 0 10 x (and zero otherwise). See figure.

(a) Verify that
1
( ) f x and
2
( ) f x are indeed legitimate density functions.

(b) Determine and graph the corresponding cumulative distribution functions
1
( ) F x and
2
( ) F x .

(c) Calculate the probability that each model fails within the first five years of operation.

(d) Calculate the median failure time of each model.

(e) How do
1
( ) F x and
2
( ) F x compare? In particular, is one model always superior during the
entire ten years, or is there a time in 0 10 x < < when a switch occurs in which model
outperforms the other, and if so, when (and which model) is it? Be as specific as possible.

2
( ) f x
1
( ) f x
Ismor Fischer, 6/27/2014 4.4-14
24. Suppose that a certain random variable X follows a Poisson distribution with mean cases
i.e., X
1
~Poisson() in the first year, then independently, follows a Poisson distribution with
mean cases i.e., X
2
~Poisson() in the second year. Then it should seem intuitively
correct that the sum X
1
+X
2
follows a Poisson distribution with mean + cases i.e.,
X
1
+X
2
~Poisson( +) over the entire two-year period. Formally prove that this is indeed
true. (In other words, the sum of two Poisson variables is also a Poisson variable.)

25. [Note: The result of the previous problem might be useful for part (e).] Suppose the occurrence
of a rare disease in a certain population is known to follow a Poisson distribution, with an
average of =2.3 cases per year. In a typical year, what is the probability that

(a) no cases occur?
(b) exactly one case occurs?
(c) exactly two cases occur?
(d) three or more cases occur?
(e) Answer (a)-(d) for a typical two-year period. (Assume independence from year to year.)
(f) Use the function dpois in R to redo (a), (b), and (c), and include the output as part of your
submitted assignment, clearly showing agreement of your answers.
(g) Use the function ppois in R to redo (d), and include the output as part of your submitted
assignment, clearly showing agreement of your answer.

26.
(a) Population 1 consists of individuals whose ages are uniformly distributed from 0 to 50 years old.
What is the mean age of the population?
What proportion of the population is between 30 and 50 years old?

(b) Population 2 consists of individuals whose ages are uniformly distributed from 50 to 90 years old.

(c) Suppose the two populations are combined into a single population.
Ismor Fischer, 6/27/2014 4.4-15

27. Let X be a discrete random variable on a population, with corresponding probability mass
function ( ) f x , i.e., P(X = x). Then recall that the population mean, or expectation, of X is
defined as
all
Mean( ) = [ ] ( )
x
X E X x f x = =
,
and the population variance of X is defined as
2 2 2
all
Var( ) = ( ) ( ) ( )
x
X E X x f x ( = =

.
(NOTE: Also recall that if X is a continuous random variable with density function ( ) f x , all
of the definitions above as well as those that follow can be modified simply by replacing the
summation sign by an integral symbol over all population values x. For example,
Mean( ) = [ ] ( ) X E X x f x dx
+
= =
, etc.)
Now suppose we have two such random variables X and Y, with corresponding joint distribution
function ( , ) f x y , i.e., ( , ) P X x Y y = = . Then in addition to the individual means ,
X Y
and
variances
2 2
,
X Y
above,
we can also define the population covariance between X and Y :

| |
all all
Cov( , ) = ( )( ) ( )( ) ( , )
XY X Y X Y
x y
X Y E X Y x y f x y = =
.
Example: A sociological study investigates a certain population of married couples, with
random variables X =number of husbands former marriages (0, 1, or 2) and Y =number of
wifes former marriages (0 or 1). Suppose that the joint probability table is given below.

X = # former marriages
(Husbands)

0 1 2
Y = # former
marriages
(Wives)
0 .19 .20 .01 .40
1 .01 .10 .49 .60
.20 .30 .50 1.00

For instance, the probability f (0, 0) = ( 0, 0) P X Y = = =.19, i.e., neither spouse was previously
married in 19% of this population of married couples. Similarly, f (2, 1) = ( 2, 1) P X Y = = =.49,
i.e., in 49% of this population, the husband was married twice before, and the wife once before,
etc.

The individual distribution functions ( )

X
f x for X, and ( )
Y
f y for Y, correspond to the so-called marginal distributions of
the joint distribution ( , ) f x y , as will be seen in the upcoming example.

Ismor Fischer, 6/27/2014 4.4-16

From their joint distribution above, we can read off the marginal distributions of X and Y :
X ( )
X
f x Y ( )
Y
f y
0 0.2 0 0.4
1 0.3 1 0.6
2 0.5 1.0
1.0
from which we can compute the corresponding population means and population variances:
(0)(0.2) (1)(0.3) (2)(0.5),
X
= + +
i.e., 1.3
X
=

(0)(0.4) (1)(0.6),
Y
= +
i.e., 0.6
Y
=
2 2 2 2
(0 1.3) (0.2) (1 1.3) (0.3) (2 1.3) (0.5),
X
= + +
i.e.,
2
0.61
X
=

2 2 2
(0 0.6) (0.4) (1 0.6) (0.6),
Y
= +
i.e.,
2
0.24
Y
= .
But now, we can also compute the population covariance between X and Y, using their joint
distribution:

(0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)
(0 1.3)(0 0.6)(.19) (1 1.3)(0 0.6)(.20) (2 1.3)(0 0.6)(.01)
(0 1.3)(1 0.6)(.01) (1 1.3)(1 0.6)(.10) (2 1.3)(1 0.6)(.49),
f f f
XY
f f f
= + +
+ + +

i.e., 0.30
XY
= .
(A more meaningful context for the covariance will be discussed in Chapter 7.)

(a) Recall that two events A and B are statistically independent if ( ) ( ) ( ). P A B P A P B =
Therefore, in this context, two discrete random variables X and Y are statistically independent
if, for all population values x and y, we have ( , ) ( ) ( ). P X x Y y P X x P Y y = = = = = That is,
( , ) ( ) ( )
X Y
f x y f x f y = , i.e., the joint probability distribution is equal to the product of the
marginal distributions. However, it then follows from the covariance definition above, that
( , )
all all all all
( )( ) ( ) ( ) ( ) ( ) ( ) ( )
f x y
XY X Y X Y X X Y Y
x y x y
x y f x f y x f x y f y
| |
| |
= =
|
|
\ .
\ .

=0, since
each of the two factors in this product is the sum of the deviations of the variable from its
respective mean, hence =0. Consequently, we have the important property that
If X and Y are statistically independent, then Cov(X, Y) = 0.
Ismor Fischer, 6/27/2014 4.4-17

Verify that this statement is true for the joint probability table below.

(Husbands)

0 1 2
Y = # former
marriages
(Wives)
0 .08 .12 .20 .40
1 .12 .18 .30 .60
.20 .30 .50 1.00

That is, first confirm that X and Y are statistically independent, by showing that each cell
probability is equal to the product of the corresponding row marginal and column marginal
probabilities (as in Chapter 3). Then, using the previous example as a guide, compute the
covariance, and show that it is equal to zero.

(b) The converse of the statement in (a), however, is not necessarily true! For the table below,
show that Cov(X, Y) =0, but X and Y are not statistically independent.

(Husbands)

0 1 2
Y = # former
marriages
(Wives)
0 .13 .02 .25 .40
1 .07 .28 .25 .60
.20 .30 .50 1.00

Ismor Fischer, 6/27/2014 4.4-18

28. Using the joint distribution ( , ) f x y , we can also define the sum X +Y and difference X Y of
two discrete random variables in a natural way, as follows.
{ | , } X Y x y x X y Y + = + { | , } X Y x y x X y Y =
That is, the variable X + Y consists of all possible sums x + y, where x comes from the population
distribution of X, and y comes from the population distribution of Y. Likewise, the variable X Y
consists of all possible differences x y, where x comes from the population distribution of X,
and y comes from the population distribution of Y. The following important statements can then
be easily proved, from the algebraic properties of mathematical expectation given in the notes.
(Exercise)

Example (contd): Again consider the first joint probability table in the previous problem:

(Husbands)

0 1 2
Y = # former
marriages
(Wives)
0 .19 .20 .01 .40
1 .01 .10 .49 .60
.20 .30 .50 1.00

We are particularly interested in studying D = X Y, the difference between these two variables.
As before, we reproduce their respective marginal distributions below. In order to construct a
probability table for D, we must first list all the possible (x, y) ordered-pair outcomes in the
sample space, but use the joint probability table to calculate the corresponding probability values:

X
( )
X
f x Y
( )
Y
f y
D =X Y Outcomes ( ) f d
0 0.2 0 0.4 1 (0, 1) .01
1 0.3 1 0.6 0 (0, 0), (1, 1) .29 =.19 +.10
2 0.5 1.0 1 (1, 0), (2, 1) .69 =.20 +.49
1.0 2 (2, 0) .01
1.0
I I. .
(A) Mean(X +Y) = Mean(X) + Mean(Y)
(B) Var(X +Y) = Var(X) + Var(Y) + 2 Cov(X, Y)
I II I. .
(A) Mean(X Y) = Mean(X) Mean(Y)
(B) Var(X Y) = Var(X) + Var(Y) 2 Cov(X, Y)

Ismor Fischer, 6/27/2014 4.4-19

We are now able to compute the population mean and variance of the variable D:

( 1)(.01) (0)(.29) (1)(.69) (2)(.01),
D
= + + +
i.e., 0.7
D
=
2 2 2 2 2
( 1 0.7) (.01) (0 0.7) (.29) (1 0.7) (.69) (2 0.7) (.01),
D
= + + +
i.e.,
2
0.25
D
=

To verify properties II(A) and II(B) above, we can use the calculations already done in the
previous problem, i.e.,
2 2
1.3, 0.6, 0.61, 0.24,
X Y X Y
= = = = and 0.30
XY
= .
Mean(X Y) =0.7 =1.3 0.6 =Mean(X) Mean(Y)
Var(X Y) =0.25 =0.61 +0.24 2(0.30) =Var(X) + Var(Y) 2 Cov(X, Y)

Using this example as a guide, verify properties II(A) and II(B) for the tables in part (a) and part
(b) of the previous problem. These properties are extremely important, and will be used in 6.2.

29. On his way to work every morning, Bob first takes the bus from his house, exits near his
workplace, and walks the remaining distance. His time spent on the bus (X) is a random
variable that follows a normal distribution, with mean =20 minutes, and standard deviation
=2 minutes, i.e., X ~N(20, 2). Likewise, his walking time (Y) is also a random variable that
follows a normal distribution, with mean =10 minutes, and standard deviation =1.5
minutes, i.e., Y ~N(10, 1.5). Find the probability that Bob arrives at his workplace in 35
minutes or less. [Hint: Total time =X + Y ~N(?, ?). Recall the General Fact on page
4.1-13, which is true for both discrete and continuous random variables.]

30. The arrival time of my usual morning bus (B) is normally distributed, with a mean ETA at 8:00
AM, and a standard deviation of 4 minutes. My arrival time (A) at the bus stop is also normally
distributed, with a mean ETA at 7:50 AM, and a standard deviation of 3 minutes.
(a) With what probability can I expect to catch the bus? (Hint: What is the distribution of the
random variable X = A B, and what must be true about X in the event that I catch the bus?)

(b) On average, how much earlier should I arrive, if I expect to catch the bus with 99%
probability?

X ~N(20, 2)
Y ~N(10, 1.5)
Ismor Fischer, 6/27/2014 4.4-20

31. Discrete vs. Continuous
(a) Discrete: General. Imagine a flea starting from initial position X =0, only able to move by
making integer jumps X =1, X =2, X =3, X =4, X =5, or X =6, according to the following
probability table and corresponding probability histogram.

x f(x)
0 .05
1 .10
2 .20
3 .30
4 .20
5 .10
6 .05

Confirm that P(0 X 6) = 1, i.e., this is indeed a legitimate probability distribution.
Calculate the probability P(2 X 4).
Determine the mean and standard deviation of this distribution.

(b) Discrete: Binomial. Now imagine a flea starting from initial position X =0, only able to
move by making integer jumps X =1, X =2, , X =6, according to a binomial distribution,
with =0.5. That is, X ~Bin(6, 0.5).

x f(x)
0
1
2
3
4
5
6

Complete the probability table above, and confirm that P(0 X 6) = 1.
Determine the mean and standard deviation of this distribution.
Ismor Fischer, 6/27/2014 4.4-21

(c) Continuous: General. Next imagine an ant starting
from initial position X = 0, able to move by
crawling to any position in the interval [0, 6],
according to the following probability density
curve.

, if 0 3
9
( )
6
, if 3 6
9
x
x
f x
x
x
<

Confirm that P(0 X 6) = 1, i.e., this is indeed a legitimate probability density.
What distance is the ant able to pass only 2% of the time? That is, P(X ?) = .02.

(d) Continuous: Normal. Finally, imagine an ant
starting from initial position X =0, able to move by
crawling to any position in the interval [0, 6],
according to the normal probability curve, with
mean =3, and standard deviation =1. That is,
X ~N(3, 1).

What distance is the ant able to pass only 2% of the time? That is, P(X ?) = .02.

32. Temporary place-holder during SIBS to be deleted

Ismor Fischer, 6/27/2014 4.4-22

33.
(a) The ages of employees in a certain workplace are normally distributed. It is known that 80% of
the workers are under 65 years old, and 67% are under 55 years old. What percentage of the
workers are under 45 years old? (Hint: First find and by calculating the z-scores.)
(b) Suppose it is known that the wingspan X of the males of a certain bat species is normally
distributed with some mean and standard deviation , i.e., ( , ) X N , while the wingspan
Y of the females is normally distributed with the same mean , but standard deviation twice that
of the males, i.e., ( , 2 ) Y N . It is also known that 80% of the males have a wingspan less
than a certain amount m. What percentage of the females have a wingspan less than this same
amount m? (Hint: Calculate the z-scores.)

5. Sampling Distributions and the
Central Limit Theorem

5.1 Motivation


5.3 Problems

5. Sampling Distributions and the Central Limit Theorem
5.1 Motivation
Population Distribution of X
X
X
=70
X
=4
X

X
=70
Sampling Distribution of X
X
=
4
n

EXTREMELY
RARE mostly
short outliers

x <<70
EXTREMELY TYPICAL
most are near the
population mean, with a
few short and tall outliers

x 70
EXTREMELY
RARE mostly
tall outliers

x >>70
POPULATION = U.S. Adult Males
Random Variable X =Height (inches)
RANDOM SAMPLES
(all of size n)
RARE short outlier
x <<70
TYPICAL
x 70
RARE tall outlier
x >>70
=27 30
X
X
=27 30


Comments:

n
is called the standard error of the mean, denoted SEM, or more simply, s.e.
The corresponding Z-score transformation formula is Z =
X
/ n
~ N(0, 1).

Example: Suppose that the ages X of a certain population are normally distributed,
with mean =27.0 years, and standard deviation =12.0 years, i.e., X ~N(27, 12).

Sampling Distribution of a Normal Variable

Given a random variable X. Suppose that the population
distribution of X is known to be normal, with mean and
variance
2
, that is, X ~N(, ). Then, for any sample size n,
it follows that the sampling distribution of X is normal,
with mean and variance
2
n
, that is, X ~N
n
.
The probability that the age of
a single randomly selected
individual is less than 30 years
is P(X <30) = P
Z <
30 27
12

= P(Z <0.25) = 0.5987.
Now consider all random samples of size n =36 taken
from this population. By the above, their mean ages
X are also normally distributed, with mean =27 yrs
as before, but with standard error
n
=
12 yrs
36
=2 yrs.
That is, X ~N(27, 2).

The probability that the mean age of a single sample of
n =36 randomly selected individuals is less than 30
years is P( X <30) = P
Z <
30 27
2
= P(Z <1.5) =
0.9332.

In this population, the
probability that the average
age of 36 random people is
under 30 years old, is much
greater than the probability
that the age of one random
person is under 30 years old.

Exercise: Compare the two
probabilities of being under
24 years old.

Exercise: Compare the two
probabilities of being
between 24 and 30 years old.
The Central Limit Theorem

Given any random variable X, discrete or continuous, with finite
mean and finite variance
2
. Then, regardless of the shape of
the population distribution of X, as the sample size n gets larger,
the sampling distribution of X becomes increasingly closer to
normal, with mean and variance
2
n
, that is, X ~N
n
,
approximately.
More formally, (0,1) as .
/
X
Z N n
n
=

If X ~N(, ) approximately, then X ~N
n
approximately. (The larger the value
of n, the better the approximation.) In fact, more is true...

IMPORTANT GENERALIZATION:

Intuitively perhaps, there is less variation between different sample mean values, than
there is between different population values. This formal result states that, under very
general conditions, the sampling variability is usually much smaller than the population
variability, as well as gives the precise form of the limiting distribution of the statistic.

What if the population standard deviation is unknown? Then it can be replaced by the
sample standard deviation s, provided n is large. That is, X ~N
,
s
n
approximately,
if n 30 or so, for most distributions (... but see example below). Since the value
s
n

is a sample-based estimate of the true standard error s.e., it is commonly denoted
s.e.

Because the mean
X
of the sampling distribution is equal to the mean
X
of the
population distribution i.e., [ ] E X =
X
we say that X is an unbiased estimator of
X
. In other words, the sample mean is an unbiased estimator of the population mean.
A biased sample estimator is a statistic
whose expected value either consistently

overestimates or underestimates its intended population parameter .

Many other versions of CLT exist, related to so-called Laws of Large Numbers.


Example: Consider a(n infinite) population of paper notes, 50% of which are
blank, 30% are ten-dollar bills, and the remaining 20% are twenty-dollar bills.

Experiment 1: Randomly select a single note from the population.

Random variable: X =$ amount obtained

Mean
X
= E[X] = (.5)(0) +(.3)(10) +(.2)(20) = $7.00

Variance
2
X
= E[ (X
X
)
2
] = (.5)(7)
2
+ (.3)(3)
2
+ (.2)(13)
2
= 61

Standard deviation
X
= $7.81

x

f(x) = P(X = x)
0 .5
10 .3
20 .2
.5
.3
.2

Experiment 2: Each of n =2 people randomly selects a note, and split the winnings.

Random variable: X =$ sample mean amount obtained per person

x 0 5 10 5 10 15 10 15 20
(x
1
, x
2
) (0, 0) (0, 10) (0, 20) (10, 0) (10, 10) (10, 20) (20, 0) (20, 10) (20, 20)
Probability
.5 .5
= 0.25
.5 .3
=0.15
.5 .2
=0.10
.3 .5
=0.15
.3 .3
=0.09
.3 .2
=0.06
.2 .5
=0.10
.2 .3
=0.06
.2 .2
=0.04

x

f ( x ) = P( X = x )
0 .25
5 .30 =.15 +.15
10 .29 =.10 +.09 +.10
15 .12 =.06 +.06
20 .04

Mean
X
= (.25)(0) +(.30)( 5) +(.29)(10) +(.12)(15) +(.04)(20) = $7.00 =
X
!!

Variance
2
X
= (.25)(7)
2
+ (.30)(2)
2
+ (.29)(3)
2
+ (.12)(8)
2
+ (.04)(13)
2

= 30.5 =
61
2
=
2
X
n
!!

Standard deviation
X
= $5.52 =
X
n
!!

.25
.30
.29
.12
.04

Experiment 3: Each of n =3 people randomly selects a note, and split the winnings.

Random variable: X =$ sample mean amount obtained per person

x 0 3.33 6.67 3.33 6.67 10 6.67 10 13.33
(x
1
, x
2
, x
3
) (0, 0, 0) (0, 0, 10) (0, 0, 20) (0, 10, 0) (0, 10, 10) (0, 10, 20) (0, 20, 0) (0, 20, 10) (0, 20, 20)
Probability
.5 .5 .5
=0.125
.5 .5 .3
=0.075
.5 .5 .2
=0.050
.5 .3 .5
=0.075
.5 .3 .3
=0.045
.5 .3 .2
=0.030
.5 .2 .5
=0.050
.5 .2 .3
=0.030
.5 .2 .2
=0.020

3.33 6.67 10 6.67 10 13.33 10 13.33 16.67
(10, 0, 0) (10, 0, 10) (10, 0, 20) (10, 10, 0) (10, 10, 10) (10, 10, 20) (10, 20, 0) (10, 20, 10) (10, 20, 20)
.3 .5 .5
=0.075
.3 .5 .3
=0.045
.3 .5 .2
=0.030
.3 .3 .5
=0.045
.3 .3 .3
=0.027
.3 .3 .2
=0.018
.3 .2 .5
=0.030
.3 .2 .3
=0.018
.3 .2 .2
=0.012

6.67 10 13.33 10 13.33 16.67 13.33 16.67 20
(20, 0, 0) (20, 0, 10) (20, 0, 20) (20, 10, 0) (20, 10, 10) (20, 10, 20) (20, 20, 0) (20, 20, 10) (20, 20, 20)
.2 .5 .5
=0.050
.2 .5 .3
=0.030
.2 .5 .2
=0.020
.2 .3 .5
=0.030
.2 .3 .3
=0.018
.2 .3 .2
=0.012
.2 .2 .5
=0.020
.2 .2 .3
=0.012
.2 .2 .2
=0.008

x

f ( x ) = P( X = x )
0.00 .125
3.33
.225 =.075 +.075 +.075
6.67
.285 =.050 +.045 +.050 +
.045 +.045 +.050
10.00
.207 =.030 +.030 +.030 +.027
+.030 +.030 +.030
13.33
.114 =.020 +.018 +.018 +
.020 +.018 +.020
16.67
.036 =.012 +.012 +.012
20.00 .008

Mean
X
= Exercise = $7.00 =
X
!!!

Variance
2
X
= Exercise = 20.333 =
61
3
=
2
X
n
!!!

Standard deviation
X
= $4.51 =
X
n
!!!
.125
.225
.285
.207
.114
.036
.008

The tendency toward a normal distribution becomes stronger as the sample size
n gets larger, despite the mild skew in the original population values. This is
an empirical consequence of the Central Limit Theorem.

For most such distributions, n 30 or so is sufficient for a reasonable
normal approximation to the sampling distribution. In fact, if the
distribution is symmetric, then convergence to a bell curve can often be
seen for much lower n, say only n = 5 or 6. Recall also, from the first
result in this section, that if the population is normally distributed (with
known ), then so will be the sampling distribution, for any n.

BUT BEWARE....

However, if the population distribution of X is highly skewed, then the sampling
distribution of X can be highly skewed as well (especially if n is not very large),
i.e., relying on CLT can be risky! (Although, sometimes using a transformation,
such as ln(X) or X, can restore a bell shape to the values. Later)

Example: The two graphs on the bottom of this page are simulated sampling
distributions for the highly skewed population shown below. Both are density
histograms based on the means of 1000 random samples; the first corresponds to
samples of size n =30, the second to n =100. Note that skew is still present!

Population Distribution

5.3 Problems

1. Computer Simulation of Distributions

(a) In Problem 4.4/8(c), it was formally proved that if Z follows a standard normal
distribution i.e., has density function
2
/ 2
1
( )
2
z
z e

= then its expected value
is given by the mean = 0. A practical understanding of this interpretation can be
achieved via empirical computer simulation. For concreteness, suppose the random
variable Z = Temperature (C) ~ N(0, 1). Let us consider a single sample of n = 400
randomly generated z-value temperature measurements from a frozen lake, and calculate
its mean temperature z , via the following R code.

# Generate and display one random sample.
sample <- rnorm(400)
sort(sample)

Upon inspection, it should be apparent that there is some variation among these z-values.

# Compare density histogram of sample against
population distribution Z ~ N(0, 1).
hist(sample, freq = F)
curve(dnorm(x), lwd = 2, col = "darkgreen", add = T)

# Calculate and display sample mean.
mean(sample)

This sample mean z should be fairly close to the actual expected value in the
population, = 0 (likewise, sd(sample) should be fairly close to = 1), but it is
only generated from a single sample. To obtain an even better estimate of , consider
say, 500 samples, each containing n = 400 randomly generated z-values. Then average
each sample to find its mean temperature, and obtain { }
1 2 3 500
, , , ..., z z z z .

# Generate and display 500 random sample means.
zbars <- NULL
for (s in 1:500) {sample <- rnorm(400)
zbars <- c(zbars, mean(sample))}
sort(zbars)

Upon inspection, it should be apparent that there is little variation among these z -values.

# Compare density histogram of sample means against
sampling distribution Z ~ N(0, 0.05).
hist(zbars, freq = F)
curve(dnorm(x, 0, 0.05), lwd = 2, col = "darkgreen", add = T)

# Calculate and display mean of the sample means.
mean(zbars)

This value should be extremely close to the mean = 0, because there is much less
variation about in the sampling distribution, than in the population distribution.
(In fact, via the Central Limit Theorem, the standard deviation is now only
/ 1 / 400 n =

= 0.05. Check this value against sd(zbars).)

(b) Contrast the preceding example with the following. A random variable X is said to
follow a standard Cauchy (pronounced ko-shee) distribution if it has the density
function
Cauchy
2
1 1
( )
1
f x
x
=
+
, for x < < +, as illustrated.

First, as in Problem 4.4/7, formally
prove that this is indeed a valid density
function.

However, as in Problem 4.4/8,
formally prove using the appropriate
expected value definition that the
mean in fact does not exist!

Informally, there are too many outliers in
both tails to allow convergence to a
single mean value . To obtain a better
appreciation of this subtle point, we once
again rely on computer simulation.

# Generate and display one random sample.
sample <- rcauchy(400)
sort(sample)

Upon inspection, it should be apparent that there is much variation among these x-values.

# Compare density histogram of sample against
population distribution X ~ Cauchy.
hist(sample, freq = F)
curve(dcauchy(x), lwd = 2, col = "darkgreen", add = T)

# Calculate and display sample mean.
mean(sample)

This sample mean x is not necessarily close to an expected value in the population,
nor are the means { }
1 2 3 500
, , , ..., x x x x of even 500 random samples:

# Generate and display 500 random sample means.
xbars <- NULL
for (s in 1:500) {sample <- rcauchy(400)
xbars <- c(xbars, mean(sample))}
sort(xbars)

Upon inspection, it should be apparent that there is still much variation among these x -values.

# Compare density histogram of sample means against
sampling distribution X ~ Cauchy.
hist(xbars, freq = F)
curve(dcauchy(x), lwd = 2, col = "darkgreen", add = T)

# Calculate and display mean of the sample means.
mean(xbars)

Indeed, it can be shown that X follows a Cauchy distribution as well, i.e., the Central
Limit Theorem fails! Gathering more data does not yield convergence to a mean .


2. For which functions in Problem 4.4/9 does the Central Limit Theorem hold / fail?

3. Refer to Problem 4.4/16.

(a) Suppose that a random sample of n = 36 males is to be selected from this population,
and the sample mean cholesterol level calculated. As in part (f), what is the
probability that this sample mean value is between 202 and 238?

(b) How large a sample size n is necessary to guarantee that 80% of the sample mean
values are within 5 mg/dL of the mean of their distribution? (Hint: First find the
value of z that satisfies P(z Z +z) = 0.8, then change back to X, and solve for n.)

4. Suppose that each of the four experiments in Problem 4.4/31 is to be performed n = 9
times, and the nine resulting distances averaged. Estimate the probability (2 4) P X
for each of (a), (b), (c), and (d). [Note: Use the Central Limit Theorem, and the fact
that, for (c), the mean and variance are = 3 and
2
= 1.5, respectively.]

5. Bob suddenly remembers that today is Valentines Day, and rushes into a nearby florist to
buy his girlfriend some (overpriced) flowers. There he finds a large urn containing a
population of differently colored roses in roughly equal numbers, but with different prices:
yellow roses cost $1 each, pink roses cost $2 each, and red roses cost $6 each. As he is in
a hurry, he simply selects a dozen roses at random, and brings them up to the counter.

(a) Lowest and highest costs = ? How much money can Bob expect to pay, on average?

(b) What is the approximate probability that he will have to pay no more than $45?
Assume the Central Limit Theorem holds.

(c) Simulate this in R: From many random samples (each with a dozen values) selected
from a population of the prices listed above, calculate the proportion whose totals are
no more than $45. How does this compare with your answer in (b)?

6. A geologist manages a large museum collection of minerals, whose mass (in grams) is
known to be normally distributed, with some mean and standard deviation . She
knows that 60% of the minerals have mass less than a certain amount m, and needs to
select a random sample of n = 16 specimens for an experiment. With what probability
will their average mass be less than the same amount m? (Hint: Calculate the z-scores.)

7. Refer to Prob 4.4/5. Here, we sketch how formally applying the Central Limit Theorem
to a binary variable yields the normal approximation to the binomial distribution
(section 4.2). First, define the binary variable
1, with probability
0, with probability 1
Y

and the
discrete variable X = #(Y = 1) in a random sample of n Bernoulli trials ~ Bin(n, ).
(a) Using the results of Problem 4.4/5 for
Y
and
2
Y
, apply the Central Limit
Theorem to the variable Y.
(b) Why is it true that
X
Y
n
= ? [Hint: Why is #(Y = 1) the same as ( Y = 1)?]
Use this fact along with (a) to conclude that, indeed, X N(
X
,
X
).*
* Recall what the mean
X
and standard deviation
X
of the Binomial distribution are.

8. Imagine performing the following experiment in principle. We are conducting a
socioeconomic survey of an arbitrarily large population of households, each of which
owns a certain number of cars X = 0, 1, 2, or 3, as illustrated below. For simplicity, let
us assume that the proportions of these four types of household are all equal (although
this restriction can be relaxed).

Select n = 1 household at random from this population, and record its corresponding
value X = 0, 1, 2, or 3. By the equal likelihood assumption above, each of these four
elementary outcomes has the same probability of being selected (1/4), therefore the
resulting uniform distribution of population values is given by:

x f(x) = P(X = x)
0 1/4
1 1/4
2 1/4
3 1/4
1

From this, construct the probability histogram of the population distribution of X
values, on the graph paper below. Remember that the total area of a probability
histogram = 1!

Next, draw an ordered random sample of n = 2 households, and compute the mean number
of cars X . (For example, if the first household has 2 cars, and the second household has 3
cars, then the mean for this sample is 2.5 cars.) There are 4
2
= 16 possible samples of size
n = 2; they are listed below. For each such sample, calculate and record its corresponding
mean X ; the first two have been done for you. As above, construct the corresponding
probability table and probability histogram of these sampling distribution of X values,
on the graph paper below. Remember that the total area of a probability histogram = 1;
this fact must be reflected in your graph! Repeat this process for the 4
3
= 64 samples of
size n = 3, and answer the following questions.
. . .
. . .
. . .
. . .
0 1 2 3 0 1 2 3
0 1 2 0 1 2 3 3
3 0 1 2 0 3 2 1
3 0 1 2 0 3 2 1
. . .
. . .
. . .
. . .
How do the X
values behave?
Compare X
versus X.
Put it
together.

1. Comparing these three distributions, what can generally be observed about their
overall shapes, as the sample size n increases?

2. Using the expected value formula
=
all
( )
x
x f x
,
calculate the mean
X
of the population distribution of X. Similarly, calculate the
mean
X
of the sampling distribution of X , for n = 2. Similarly, calculate the mean
X
of the sampling distribution of X , for n = 3. Conclusions?

3. Using the expected value formula
2
=
2
all
( ) ( )
x
x f x
,
calculate the variance
2
X
of the population distribution of X. Similarly, calculate
the variance
2
X
of the sampling distribution of X , for n = 2. Similarly, calculate
the variance
2
X
of the sampling distribution of X , for n = 3. Conclusions?

4. Suppose now that we have some arbitrarily large study population, and a general
random variable X having an approximately symmetric distribution, with some mean
X
and standard deviation
X
. As you did above, imagine selecting all random
samples of a moderately large, fixed size n from this population, and calculate all of
their sample means X . Based partly on your observations in questions 1-3, answer
the following.

(a) In general, how would the means X of most typical random samples be expected
to behave, even if some of them do contain a few outliers, especially if the size n of
the samples is large? Why? Explain briefly and clearly.

(b) In general, how then would these two large collections the set of all sample mean
values X , and the set of all the original population values X compare with each
other, especially if the size n of the samples is large? Why? Explain briefly and
clearly.

(c) What effect would this have on the overall shape, mean
X
, and standard deviation
X
, of the sampling distribution of X , as compared with the shape, mean
X
and
standard deviation
X
, of the population distribution of X? Why? Explain briefly
and clearly.
SAMPLES, n = 2 Means X

Draw: 1
st
2
nd

1. 0 0 0.0
2. 0 1 0.5
3. 0 2
4. 0 3
5. 1 0
6. 1 1
7. 1 2
8. 1 3
9. 2 0
10. 2 1
11. 2 2
12. 2 3
13. 3 0
14. 3 1
15. 3 2
16. 3 3
SAMPLES, n = 3 Means X

Draw: 1
st
2
nd
3
rd

1. 0 0 0 0.00
2. 0 0 1 0.33
3. 0 0 2
4. 0 0 3
5. 0 1 0
6. 0 1 1
7. 0 1 2
8. 0 1 3
9. 0 2 0
10. 0 2 1
11. 0 2 2
12. 0 2 3
13. 0 3 0
14. 0 3 1
15. 0 3 2
16. 0 3 3
17. 1 0 0
18. 1 0 1
19. 1 0 2
20. 1 0 3
21. 1 1 0
22. 1 1 1
23. 1 1 2
24. 1 1 3
25. 1 2 0
26. 1 2 1
27. 1 2 2
28. 1 2 3
29. 1 3 0
30. 1 3 1
31. 1 3 2
32. 1 3 3
33. 2 0 0
34. 2 0 1
35. 2 0 2
36. 2 0 3
37. 2 1 0
38. 2 1 1
39. 2 1 2
40. 2 1 3
41. 2 2 0
42. 2 2 1
43. 2 2 2
44. 2 2 3
45. 2 3 0
46. 2 3 1
47. 2 3 2
48. 2 3 3
49. 3 0 0
50. 3 0 1
51. 3 0 2
52. 3 0 3
53. 3 1 0
54. 3 1 1
55. 3 1 2
56. 3 1 3
57. 3 2 0
58. 3 2 1
59. 3 2 2
60. 3 2 3
61. 3 3 0
62. 3 3 1
63. 3 3 2
64. 3 3 3


1/4
2/4
3/4
Probability Histogram of Population Distribution of X
0 1 2 3

2/16
4/16
6/16
8/16
10/16
12/16
Probability Histogram of Sampling Distribution of X, n =2
0 1 2 3 0.5 1.5 2.5

0 1 2 3 0.67 1.67 2.67 0.33 1.33 2.33
8/64
16/64
24/64
32/64
40/64
48/64
Probability Histogram of Sampling Distribution of X, n =3
6. Statistical Inference and
Hypothesis Testing

6.1 One Sample

6.1.1 Mean

6.1.2 Variance

6.1.3 Proportion

6.2 Two Samples

6.2.1 Means

6.2.2 Variances

6.2.3 Proportions

6.3 Several Samples

6.3.1 Proportions

6.3.2 Variances

6.3.3 Means

6.4 Problems


6. Statistical Inference and Hypothesis Testing

6.1 One Sample

6.1.1 Mean

STUDY POPULATION = Cancer patients on new drug treatment

Random Variable: X = Survival time (months)
Assume X N(, ), with unknown mean , but known (?) = 6 months.


= 6

What can be said about the mean of this study population?

X
RANDOM SAMPLE, n = 64

{x
1
, x
2
, x
3
, x
4
, x
5
, , x
64
}

Sampling Distribution of X

6
64
0.75
n
= =

x
x is called a
point estimate
of
X

0.95
0.025 0.025
Z
0 z
.025
= 1.960
1.960 = z
.025

Objective 1
x
: Parameter Estimation ~ Calculate an interval estimate of ,
centered at the point estimate , that contains with a high probability, say 95%.
(Hence, 1 = 0.95, so that = 0.05.)

That is, for any random sample, solve for d:

P( X d X + d) = 0.95
i.e., via some algebra,
P( d X + d) = 0.95 .

But recall that Z =
/
X
n

~ N(0, 1). Therefore,
P
/ /
d d
n n
Z

+
= 0.95

Hence,
/
d
n
+
= z
.025
d = z
.025

n
= (1.96)(0.75 months) = 1.47 months.

95% margin of error
For future
reference, call this
equation .
x + d x d

X
x
0.75 mos
n
=


X
x
x 1.47 x + 1.47
X ~ N(, 0.75)
X
= 24.53 26 27.47
X
= 25.53 27 28.47
X
X
X
X

95% Confidence Interval for

.025 .025
, z z
n n
x x

+

95% Confidence Limits

where the critical value z
.025
= 1.96 .

Therefore, the margin of error (and thus, the size of
the confidence interval) remains the same, from
sample to sample.

Example:

Sample Mean x 95% CI

1

26.0 mos

(26 1.47, 26 + 1.47)

2

27.0 mos

(27 1.47, 27 + 1.47)
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

etc.

I nterpretation: Based on Sample 1, the true mean of the new treatment population is
between 24.53 and 27.47 months, with 95% confidence. Based on Sample 2, the true
mean is between 25.53 and 28.47 months, with 95% confidence, etc. The ratio of
# CIs that contain
Total # CIs

0.95, as more and more samples are chosen, i.e., The probability
that a random CI contains the population mean is equal to 0.95. In practice however, the
common (but technically incorrect) interpretation is that the probability that a fixed CI
(such as the ones found above) contains is 95%. In reality, the parameter is
constant; once calculated, a single fixed confidence interval either contains it or not.
Z
0
1
/2 /2
z
/2
z
/2

For any significance level (and hence confidence level 1 ), we similarly define the

where z
/2
is the critical value that divides the area under the standard normal
distribution N(0, 1) as shown. Recall that for = 0.10, 0.05, 0.01 (i.e., 1 = 0.90, 0.95,
0.99), the corresponding critical values are z
.05
= 1.645, z
.025
= 1.960, and z
.005
= 2.576,
respectively. The quantity z
/2

n
is the two-sided margin of error.

Therefore, as the significance level decreases (i.e., as the confidence level 1
increases), it follows that the margin of error increases, and thus the corresponding
confidence interval widens. Likewise, as the significance level increases (i.e., as the
confidence level 1 decreases), it follows that the margin of error decreases, and thus the
corresponding confidence interval narrows.

Exercise: Why is it not realistic to ask for a 100% confidence interval (i.e., certainty)?

Exercise: Calculate the 90% and 99% confidence intervals for Samples 1 and 2 in the
preceding example, and compare with the 95% confidence intervals.

N(0, 1)
X
x
95% CI
99% CI
90% CI
(1 ) 100% Confidence Interval for

/2 /2
, z z
n n

+ x x

We are now in a position to be able to conduct Statistical Inference

on the population,
via a formal process known as
Objective 2a

: Hypothesis Testing ~ How does this new treatment compare with a
control treatment? In particular, how can we use a confidence interval to decide this?
STANDARD POPULATION = Cancer patients on standard drug treatment

Technical Notes: Although this is drawn as a bell curve, we dont really care how the
variable X is distributed in this population, as long as it is normally distributed in the study
population of interest, an assumption we will learn how to check later, from the data.
Likewise, we dont really care about the value of the standard deviation of this
population, only of the study population. However, in the absence of other information, it is
sometimes assumed (not altogether unreasonably) that the two are at least comparable in
value. And if this is indeed a standard treatment, it has presumably been around for a while
and given to many patients, during which time much data has been collected, and thus very
accurate parameter estimates have been calculated. Nevertheless, for the vast majority of
studies, it is still relatively uncommon that this is the case; in practice, very little if any
information is known about any population standard deviation . In lieu of this value
then, is usually well-estimated by the sample standard deviation s with little change,
if the sample is sufficiently large, but small samples present special problems. These
issues will be dealt with later; for now, we will simply assume that the value of is known

.

Random Variable: X = Survival time (months)
Suppose X is known to have mean = 25 months.


= 6

25

How does this compare with the mean of the study population?

X

Hence, let us consider the situation where, before any sampling is done, it is actually the
experimenters intention to see if there is a statistically significant difference between the
unknown mean survival time of the new treatment population, and the known mean
survival time of 25 months of the standard treatment population

. (See page 1-1!) That is,
the sample data will be used to determine whether or not to reject the formal
Null Hypothesis H
0
: = 25
versus the
Alternative Hypothesis H
A
: 25

at the = 0.05 significance level (i.e., the 95% confidence level).

Sample 1: 95% CI does contain = 25. Sample 2: 95% CI does not
Therefore, the data support H
0
, and we Therefore, the data do not support H
0
, and we
contain = 25.
cannot reject it at the = .05 level. Based can reject
on this sample, the new drug
it at the = .05 level. Based on this
does not result sample, the new drug does
in a mean survival time that is significantly survival time that is significantly different
result in a mean
different from 25 months. Further study? from 25 months. A genuine treatment effect.

In general
Null Hypothesis H
0
: =
0

versus the
A
:
0

Decision Rule: If the (1 ) 100% confidence interval contains the value
0
, then the
difference is not statistically significant; accept the null hypothesis at the level of
significance. If it does not contain the value
0
, then the difference is statistically
significant; reject the null hypothesis in favor of the alternative at the significance level.
Two-sided Alternative

Either < 25 or > 25
X
24.53 25 26 27.47
X
25 25.53 27 28.47
Null Distribution
X ~ N(25, 0.75)

Either <
0
or >
0

No significant difference exists.
0.95
X
0.025
= 25
Acceptance
Region for H
0

(Sample 1)
Rejection
Region

Rejection
Region
0.025
(Sample 2)

23.53
26.47 27 26
Null Distribution
X ~ N(25, 0.75)

Objective 2b x : Calculate which sample mean values will lead to rejecting or not
rejecting (i.e., accepting or retaining) the null hypothesis.

From equation above, and the calculated margin of error = 1.47, we have

P( 1.47 X + 1.47) = 0.95 .

Now, IF the null hypothesis : = 25 is indeed true

, then substituting this value gives
P(23.53 X 26.47) = 0.95 .

I nterpretation: If the mean survival time x
of a random sample of n = 64 patients is
between 23.53 and 26.47, then the difference
from 25 is not statistically significant (at
the = .05 significance level), and we retain
the null hypothesis. However, if x is either
less than 23.53, or greater than 26.47, then
the difference from 25 will be statistically
significant (at = .05), and we reject the
null hypothesis in favor of the alternative.
More specifically, if the former, then the
result is significantly lower than the standard
treatment average (i.e., new treatment is
detrimental!); if the latter, then the result is
significantly higher than the standard
treatment average (i.e., new treatment is
beneficial).

In general

Decision Rule: If the (1 ) 100% acceptance region contains the value x , then the
difference is not statistically significant; accept the null hypothesis at the significance
level. If it does not contain the value x , then the difference is statistically significant; reject
the null hypothesis in favor of the alternative at the significance level.
(1 ) 100% Acceptance Region for H
0
: =
0

/2 /2
, z z
n n

+
0 0

1
/2
H
0
: =
0

Acceptance
Region for H
0

Rejection
Region

Rejection
Region
/2

Null Distribution
X ~ N(
0
, / n )

Error Rates Associated with Accepting / Rejecting a Null Hypothesis
(vis--vis Neyman-Pearson)

- Confidence Level -

=
0

P(Accept H
0
| H
0
true) = 1

- Significance Level -

P(Reject H
0
| H
0
true) =

Type I Error

Likewise,
=
1

P(Accept H
0
| H
0
false) =

Type II Error

- Power -

P(Reject H
0
| H
A
: =
1
) = 1

X
Null Distribution
X ~ N(
0
, / n )
Alternative Distribution
X ~ N(
1
, / n )
H
0
: =
0
H
A
: =
1

X
X

1

Objective 2c
x
: How probable is my experimental result, if the null hypothesis is true?
Consider a sample mean value . Again assuming that the null hypothesis : =
0
is
indeed true, calculate the p-value of the sample = the probability that any random sample
mean is this far away or farther, in the direction of the alternative hypothesis

. That is, how
significant is the decision about H
0
, at level ?
Sample 1: p-value = P( X 24 or X 26) Sample 2: p-value = P( X 23 or X 27)
= P( X 24) + P( X 26) = P( X 23) + P( X 27)

= 2 P( X 26) = 2 P( X 27)
= 2 P
0.75
26 25
Z

= 2 P
0.75
27 25
Z

= 2 P(Z 1.333) = 2 P(Z 2.667)
= 2 0.0912 = 2 0.0038
= 0.1824 > 0.05 = = 0.0076 < 0.05 =

Decision Rule: If the p-value of the sample is greater than the significance level , then the
difference is not statistically significant; accept the null hypothesis at this level. If the
p-value is less than , then the difference is statistically significant

; reject the null
hypothesis in favor of the alternative at this level.
Guide to statistical significance of p-values for = .05:

Accept
H
0

Reject
H
0

0 p .001
extremely strong

p .05
borderline

p .005
strong

p .01
moderate

.10 p 1
not significant
Recall that Z = 1.96 is the
= .05 cutoff z-score!
0.95
X
23.53 = 25 26 26.47
0.0912

0.0912
0.025 0.025
0.95
X
0.025
23.53 = 25 26.47 27
0.025
0.0038 0.0038
Test Statistic

Z =
/
X
n

0
~ N(0, 1)


Summary of findings
x
: Even though the data from both samples suggest a generally
longer mean survival time among the new treatment population over the standard
treatment population, the formal conclusions and interpretations are different. Based on
Sample 1 patients ( = 26), the difference between the mean survival time of the study
population, and the mean survival time of 25 months of the standard population, is not
statistically significant, and may in fact simply be due to random chance. Based on
Sample 2 patients ( x = 27) however, the difference between the mean age of the study
population, and the mean age of 25 months of the standard population, is indeed
statistically significant, on the longer side. Here, the increased survival times serve as
empirical evidence of a genuine, beneficial treatment effect of the new drug.

Comment: For the sake of argument, suppose that a third sample of patients is selected,
and to the experimenters surprise, the sample mean survival time is calculated to be only
x = 23 months. Note that the p-value of this sample is the same as Sample 2, with x = 27
months, namely, 0.0076 < 0.05 = . Therefore, as far as inference is concerned, the
formal conclusion is the same, namely, reject H
0
: = 25 months. However, the practical
interpretation is very different! While we do have statistical significance as before, these
patients survived considerably shorter than the standard average, i.e., the treatment had an
unexpected effect of decreasing survival times, rather than increasing them. (This kind of
unanticipated result is more common than you might think, especially with investigational
drugs, which is one reason for formal hypothesis testing, before drawing a conclusion.)

://www.african-caribbean-ents.com

= .05
If p-value < ,
then reject H
0
;
significance!
... But interpret it
correctly!
higher
easier to reject,
less conservative
lower
harder to reject,
more conservative

Modification

: Consider now the (unlikely?) situation where the experimenter knows that
the new drug will not result in a mean survival time that is significantly less than 25
months, and would specifically like to determine if there is a statistically significant
increase. That is, he/she formulates the following one-sided null hypothesis to be rejected,
and complementary alternative:
Null Hypothesis H
0
: 25
versus the
A
: > 25

at the = 0.05 significance level (i.e., the 95% confidence level).

In this case, the acceptance region for H
0
consists of sample mean values x that are less
than the null-value of
0
= 25, plus the one-sided margin of error = z
= z
.05

6
64
=
(1.645)(0.75) = 1.234, hence 26.234 . Note that replaces /2 here

!
Sample 1: p-value = P( X 26) Sample 2: p-value = P( X 27)

= P(Z 1.333) = P(Z 2.667)

= 0.0912 > 0.05 = = 0.0038 < 0.05 =
(accept) (fairly strong rejection)

Note that these one-sided p-values are exactly half of their corresponding two-sided
p-values found above, potentially making the null hypothesis easier to reject. However,
there are subtleties that arise in one-sided tests that do not arise in two-sided tests
Right-tailed Alternative
0.95
X
= 25 26 26.234

0.0912
0.05
0.95
X
= 25 26.234 27

0.05
0.0038
Here, Z = 1.645 is the = .05
cutoff z-score! Why?

Consider again the third sample of patients, whose sample mean is unexpectedly calculated
to be only x = 23 months. Unlike the previous two samples, this evidence is in strong
agreement with the null hypothesis H
0
: 25 that the mean survival time is 25 months
or less. This is confirmed by the p-value of the sample, whose definition (recall above) is
the probability that any random sample mean is this far away or farther, in the direction
of the alternative hypothesis which, in this case, is the right-sided H
A
: > 25. Hence,
p-value = P( X 23) = P(Z 2.667) = 1 0.0038 = 0.9962 >> 0.05 =
which, as just observed informally, indicates a strong acceptance of the null hypothesis.
Exercise: What is the one-sided p-value if the sample mean x = 24 mos? Conclusions?
A word of caution: One-sided tests are less conservative than two-sided tests, and should be
used sparingly, especially when it is a priori unknown if the mean response is likely to be
significantly larger or smaller than the null-value
0
, e.g., testing the effect of a new drug. More
appropriate to use when it can be clearly assumed from the circumstances that the conclusion
would only be of practical significance if is either higher or lower (but not both) than some
tolerance or threshold level
0
, e.g., toxicity testing, where only higher levels are of concern.

X
|

= 25 26.234
0.9962
0.05
0.95
23
SUMMARY: To test any null hypothesis for one mean , via the p-value of a sample...
Step I: Draw a picture of a bell curve, centered at the null value
0
.
Step II: Calculate your sample mean x, and plot it on the horizontal X axis.
Step III: From x, find the area(s)
A
in the direction(s) of H (<, >, or both tails) , by first
transforming x to a z-score, and using the z-table. This is your p-value. SEE NEXT PAGE!
Step IV: Compare p with the significance level . If <, reject H
0
. If >, retain H
0
.
Step V: Interpret your conclusion in the context of the given situation!
N(0, 1)
z
p/2 p/2
N(0, 1)
z
p/2 p/2
N(0, 1)
z
p
N(0, 1)
z
p

P-VALUES MADE EASY
Def
0
H : Suppose a null hypothesis about a population mean is to be tested, at a significance level
(= .05, usually), using a known sample mean x from an experiment. The p-value of the sample is the
probability that a general random sample yields a mean X that differs from the hypothesized null value
0
, by an amount which is as large as or larger than the difference between our known x value and
0
.
Thus, a small p-value (i.e., < ) indicates that our sample provides evidence against the null hypothesis, and we may reject it; the
smaller the p-value, the stronger the rejection, and the more statistically significant the finding. A p-value > indicates that our
sample does not provide evidence against the null hypothesis, and so we may not reject it. Moreover, a large p-value (i.e., 1)
indicates empirical evidence in support of the null hypothesis, and we may retain, or even accept it. Follow these simple steps:

STEP 1. From your sample mean x , calculate the standardized z-score =
0
/
x
n

.

STEP 2. What form is your alternative hypothesis?

0
:
A
H < (1-sided, left)......... p-value = tabulated entry corresponding to z-score
= left shaded area, whether 0 z < or 0 z >
(illustrated)

0
:
A
H > (1-sided, right)...... p-value = 1 tabulated entry corresponding to z-score
= right shaded area, whether 0 z < or 0 z >
(illustrated)

0
:
A
H (2-sided)

If z-score is negative....... p-value = 2 tabulated entry corresponding to z-score
= 2 left-tailed shaded area

If z-score is positive........ p-value = 2 (1 tabulated entry corresponding to z-score)
= 2 right-tailed shaded area

STEP 3.
If the p-value is less than (= .05, usually), then REJECT NULL HYPOTHESIS
EXPERIMENTAL RESULT IS STATISTICALLY SIGNIFICANT AT THIS LEVEL!

If the p-value is greater than (= .05, usually), then RETAIN NULL HYPOTHESIS
EXPERIMENTAL RESULT IS NOT

STATISTICALLY SIGNIFICANT AT THIS LEVEL!
STEP 4. IMPORTANT - Interpret results in context. (Note

: For many, this is the hardest step of all!)
standard error
Example
0
: 10 ppb H < : Toxic levels of arsenic in drinking water? Test (safe) vs. : 10 ppb
A
H
(unsafe), at .05 = . Assume ( , ) N , with 1.6 ppb. = A sample of 64 n = readings that average to
10.1 ppb x = would have a z-score = 0.1/ 0.2 0.5, = which corresponds to a p-value = 1 0.69146 = 0.30854
> .05, hence not significant; toxicity has not been formally shown. (Unsafe levels are 10.33 ppb. x Why?)


P-VALUES MADE SUPER EASY

STATBOT 301, MODEL Z
Subject: basic calculation of p-values for Z-TEST
sign of
z-score?
1 table entry
table entry
H
A
:
0
?
H
A
: <
0
H
A
: >
0
2 (table entry) 2 (1 table entry)
+
CALCULATE from H
0
Test Statistic
z-score =
0
x
n

Check the direction
of the alternative
hypothesis!

Remember that the
Z-table corresponds to
the cumulative area to
the left of any z-score.
4.9 +4.9
X
0

1
28 28 0.75 z
25
|

28 + 0.75 z

.95
.025 .025

X
Rejection Region Rejection Region
23.53 25 26.47
Null
Distribution
Alternative
Distribution
Acceptance Region
for H
0
: = 25

1.47 +1.47
These diagrams compare the null distribution for = 25,
with the alternative distribution corresponding to = 28 in
the rejection region of the null hypothesis. By definition,
= P(Accept H
0
| H
A
: = 28), and its complement the
power to distinguish these two distributions from one
another is 1 = P(Reject H
0
| H
A
: = 28), as shown
by the gold-shaded areas below. However, the left-tail
component of this area is negligible, leaving the remaining
right-tail area equal to 1 by itself, approximately.
Hence, this corresponds to the critical value z
in the
standard normal Z-distribution (see inset), which
transforms back to 28 0.75 z
in the X -distribution.
Comparing this boundary value in both diagrams, we see
that
() 28 0.75 z
= 26.47

and solving yields z
= 2.04. Thus, = 0.0207,

and the associated power = 1 = 0.9793, or
98%. Hence, we would expect to be able to detect
significance 98% of the time, using 64 patients.
X ~ N(28, 0.75)

Power and Sample Size Calculations

Recall: X = survival time (mos) ~ N(, ), with = 6 (given). Testing null hypothesis H
0
: = 25 (versus
the 2-sided alternative H
A
: 25), at the = .05 significance level. Also recall that, by definition, power
= 1 = P(Reject H
0
| H
0
is false, i.e., 25). Indeed, suppose that the mean survival time of new
treatment patients is actually suspected to be H
A
: = 28 mos. In this case, what is the resulting power to
distinguish the difference and reject H
0
, using a sample of n = 64 patients (as in the previous examples)?

Z
z
0

1
X ~ N(25, 0.75)

General Formulation:
Procurement of drug samples for testing purposes, or patient recruitment for clinical trials,
can be extremely time-consuming and expensive. How to determine the minimum sample
size n required to reject the null hypothesis H
0
: =
0
, in favor of an alternative value
H
A
: =
1
, with a desired power 1 , at a specified significance level ? (And
conversely, how to determine the power 1 for a given sample size n, as above?)

H
0
true H
0
false
Reject H
0

Type I error,
probability =
(significance level)

probability = 1
(power)
Accept H
0

probability = 1
(confidence level)
Type II error,
probability =
(1 power)

That is, confidence level = 1 = P(Accept H
0
: =
0
| H
0
is true),

and power = 1 = P(Reject H
0
: =
0
| H
A
: =
1
).

1 1
/2 /2
1

1
z
0
z
/2
n

0

0
+ z
/2
n

Null Distribution

X ~ ,
n
N

0

Alternative Distribution

X ~ ,
n
N

1

X

0

1

0

1

than these
two, based
solely on
sample
data.

It is easier to
distinguish
these two
distributions
from each
other...

Hence (compare with () above),
1
z
=
0
+ z
/2

n
.
Solving for n yields the following.

Comments:
This formula corresponds to a two-sided hypothesis test. For a one-sided test, simply
replace /2 by . Recall that if = .05, then z
.025
= 1.960 and z
.05
= 1.645.
If is not known, then it can be replaced above by s, the sample standard deviation,
provided the resulting sample size turns out to be n 30, to be consistent with CLT.
However, if the result is n < 30, then add 2 to compensate. [Modified from: Lachin,
J. M. (1981), Introduction to sample size determination and power analysis for
clinical trials. Controlled Clinical Trials, 2(2), 93-113.]

What affects sample size, and how? With all other values being equal

As power 1 increases, n increases; as 1 decreases, n decreases.

As the difference decreases, n increases; as increases, n decreases.

Exercise: Also show that n increases...
as increases, [Hint: It may be useful to draw a picture, similar to above.]
as decreases. [Hint: It may be useful to recall that is the Type I Error rate,
or equivalently, that 1 is the confidence level.]
In order to be able to detect a statistically significant difference
(at level ) between the null population distribution having mean
0
, and an alternative population distribution having mean
1
,
with a power of 1 , we require a minimum sample size of

n =
z
/2
+ z

2

,

where =
|
1

0
|
is the scaled difference between

0
and
1
.
Note: Remember
that, as we defined it,
z
is always 0, and
has area to its right.

Z
0 z

1

Examples: Recall that in our study,
0
= 25 months, = 6 months.

Suppose we wish to detect a statistically significant difference (at level = .05
z
.025
= 1.960) between this null distribution, and an alternative distribution having

1
= 28 months, with 90% power (1 = .90 = .10 z
.10
= 1.282). Then the
scaled difference =
|28 25|
6
= 0.5, and

n =
1.960 + 1.282
0.5

2

= 42.04, so n 43 patients.

1
= 28 months, with 95% power (1 = .95 = .05 z
.05
= 1.645). Then,
n =
1.960 + 1.645
0.5

2

= 51.98, so n 52 patients.

1
= 27 months, with 95% power (so again, z
.05
= 1.645). Then =
|27 25|
6
= 0.333,

n =
1.960 + 1.645
0.333

2

= 116.96, so n 117 patients.

Table of Sample Sizes* for Two-Sided Tests ( = .05)
Power
80% 85% 90% 95% 99%
0.1 785 898 1051 1300 1838
0.125 503 575 673 832 1176
0.15 349 400 467 578 817
0.175 257 294 344 425 600
0.2 197 225 263 325 460
0.25 126 144 169 208 294
0.3 88 100 117 145 205
0.35 65 74 86 107 150
0.4 50 57 66 82 115
0.45 39 45 52 65 91
0.5 32 36 43 52 74
0.6 24 27 30 37 52
0.7 19 21 24 29 38
0.8 15 17 19 23 31
0.9 12 14 15 19 25
1.0 10 11 13 15 21

* Shaded cells indicate that 2 was added to compensate for small n.

= 0.0
= 0.1
= 0.2
= 0.3
= 0.4
= 1.0

n = 10
n = 20
n = 30
n = 100
= |
1

0
| /
.025
2
=

Power Curves A visual way to relate power and sample size.
1

1
Question:
Why is power
not equal to 0
if = 0?

Comments:

Due to time and/or budget constraints for example, a study may end before optimal
sample size is reached. Given the current value of n, the corresponding power can then
be determined by the graph above, or computed exactly via the following formula.

Example: As in the original study, let = .05,
| |
6
=
8 25 2
= 0.5, and n = 64. Then the
z-score = 1.96 + 0.5 64 = 2.04, so power = 1 = P(Z 2.04) = 0.9793, or 98% .
The probability of committing a Type 2 error = = 0.0207, or 2%. See page 6.1-15.

Exercise: How much power exists if the sample size is n = 25? 16? 9? 4? 1?

Generally, a minimum of 80% power is acceptable for reporting purposes.

Note: Larger sample size longer study time longer wait for results. In clinical
trials and other medical studies, formal protocols exist for early study termination.

Also, to achieve a target sample size, practical issues must be considered (e.g., parking,
meals, bed space,). Moreover, may have to recruit many more individuals due to
eventual censoring (e.g., move-aways, noncompliance,) or death. $$$$$$$ issues

Research proposals must have power and sample size calculations in their methods
section, in order to receive institutional approval, support, and eventual journal
publication.
0.0207
0.9793
Z
2.04
N(0, 1)
Power = 1 = P(Z z
/2
+ n)

z-score
The z-score can
be +, , or 0.
N(0, 1): (z) =
1
2
e

z/2
t
n1
: f
n
(t) =
1
(n 1)

n
2
n 1
2

1 +
t
2
n 1

n/2
t
n1, /2
z
/2
z
/2
t
n1, /2

Small Samples: Students t-distribution
Recall that, vis--vis the Central Limit Theorem: X ~ N(, ) X ~ N
n
, for any n.
Test statistic
known: Z =
X
/ n
~ N(0, 1).

unknown, n 30: Z =
X
s / n
~ N(0, 1) approximately

unknown, n < 30: T =
X
s / n
~ t
n1
Note: Can use for n 30 as well.

Students t-distribution, with = n 1 degrees of freedom df = 1, 2, 3,

(Due to William S. Gossett (1876 - 1937), Guinness Brewery, Ireland,
anonymously publishing under the pseudonym Student in 1908.)

df = 1 is also known as the Cauchy distribution.

As df , it follows that T ~ t
df
Z ~ N(0, 1).

Recall:
s.e. = / n

s.e. = s / n
.025
22.42
.025

= 25
27.58
X
27.4
0.0334 0.0334

Example: Again recall that in our study, the variable X = survival time was assumed to
be normally distributed among cancer patients, with = 6 months. The null hypothesis
H
0
: = 25 months was tested with a random sample of n = 64 patients; a sample mean of
x = 27.0 months was shown to be statistically significant (p = .0076), i.e., sufficient
evidence to reject the null hypothesis, suggesting a genuine difference, at the = .05 level.

Now suppose that is unknown and, like , must also be estimated from sample data.
Further suppose that the sample size is small, say n = 25 patients, with which to test the
same null hypothesis H
0
: = 25, versus the two-sided alternative H
A
: 25, at the = .05
significance level. Imagine that a sample mean x = 27.4 months, and a sample standard
deviation s = 6.25 months, are obtained. The greater mean survival time appears promising.
However

s.e. =
s
n
=
6.25 mos
25
= 1.25 months

(> s.e. = 0.75 months)
Therefore,
critical value = t
24, .025
= 2.064
Margin of Error = (2.064)(1.25 mos)

= 2.58 months
(> 1.47 months, previously)

So

95% Confidence Interval for = (27.4 2.58, 27.4 + 2.58) = (24.82, 29.98) months,
which does contain the null value = 25 Accept H
0
No significance shown!

95% Acceptance Region for H
0
= (25 2.58, 25 + 2.58) = (22.42, 27.58) months,
which does contain the sample mean x = 27.4 Accept H
0
No significance shown!

p-value = 2 P( X 27.4) = 2 P
T
24

27.4 25
1.25

= 2 P(T
24
1.92) = 2(0.0334) = 0.0668,, which
is greater than = .05 Accept H
0
... No
significance shown!

Why? The inability to reject is a typical consequence
of small sample size, thus low power!

Also see Appendix > Statistical Inference > Mean,
One Sample for more info and many more examples
on this material.
0.95
.025
2.064 0 2.064
.025

t
24


Example: A very simplified explanation of how fMRI works

Functional Magnetic Resonance Imaging (fMRI) is one technique of visually mapping areas
of the human cerebral cortex in real time. First, a three-dimensional computer-generated
image of the brain is divided into cube-shaped voxels (i.e., volume elements analogous
to square picture elements, or pixels, in a two-dimensional image), about 2-4 mm on a
side, each voxel containing thousands of neurons. While the patient is asked to concentrate
on a specific mental task, increased cerebral blood flow releases oxygen to activated
neurons at a greater rate than to inactive ones (the so-called hemodynamic response), and
the resulting magnetic resonance signal can be detected. In one version, each voxel signal
is compared with the mean of its neighboring voxels; if there is a statistically significant
difference in the measurements, then the original voxel is assigned one of several colors,
depending on the intensity of the signal (e.g., as determined by the p-value); see figures.

Suppose the variable X = Cerebral Blood Flow (CBF) typically follows a normal
distribution with mean = 0.5 ml/g/min at baseline. Further, suppose that the n = 6
neighbors surrounding a particular voxel (i.e., front and back, left and right, top and bottom)
yields a sample mean of x = 0.767 ml/g/min, and sample standard deviation of s = 0.082
ml/g/min. Calculate the two-sided p-value of this sample (using baseline as the null
hypothesis for simplicity), and determine what color should be assigned to the central voxel,
using the scale shown.

Solution: X = Cerebral Blood Flow (CBF) is normally distributed, H
0
: = 0.5 ml/g/min
n = 6 x = 0.767 ml/g/min s = 0.082 ml/g/min
As the population standard deviation is unknown, and the sample size n is small, the t-test
on df = 6 1 = 5 degrees of freedom is appropriate.
Using standard error estimate
s.e. =
s
n
=
0.082 ml/g/min
6
= 0.03348 ml/g/min yields
p-value = 2 P( X 0.767) = 2
5
0.767 0.5
0.03348
P T

= 2 P(T
5
7.976) = 2 (.00025) = .0005
This is strongly significant at any reasonable level . According to the scale, the voxel
should be assigned the color RED.
p .05 gray
.01 p < .05 green
.005 p < .01 yellow
.001 p < .005 orange
p < .001 red

ALTERNATIVE HYPOTHESIS
H
A
: <
0
H
A
:
0
H
A
: >
0
t
-
s
c
o
r
e
+
1 table entry 2 table entry table entry
table entry
for |t-score|
2 table entry
for |t-score|
1 table entry
for |t-score|
STATBOT 301, MODEL T
Subject: basic calculation of p-values for T-TEST
CALCULATE from H
0
Test Statistic
t-score =
0
x
s n

Remember that the
T-table corresponds to
the area to the right of
a positive t-score.
Each of these 25
areas represents
.04 of the total.

Checks for normality ~ Is the ongoing assumption that the sample data come
from a normally-distributed population reasonable?

Quantiles: As we have already seen, 68% within 1 s.d. of mean, 95% within 2
s.d. of mean, 99.7% within 3 s.d. of mean, etc. Other percentiles can also be
checked informally, or more formally via...

Normal Scores Plot: The graph of the quantiles of the n ordered (low-to-high)
observations, versus the n known z-scores that divide the total area under N(0, 1)
equally (representing an ideal sample from the standard normal distribution), should
resemble a straight line. Highly skewed data would generate a curved plot. Also
known as a probability plot or Q-Q plot (for Quantile-Quantile), this is a popular
method.

Example: Suppose n = 24 ages (years). Calculate the .04 quantiles of the sample, and
plot them against the 24 known (i.e., theoretical) .04 quantiles of the standard
normal distribution (below).

{1.750, 1.405, 1.175, 0.994, 0.842, 0.706, 0.583, 0.468, 0.358, 0.253, 0.151, 0.050,
+0.050, +0.151, +0.253, +0.358, +0.468, +0.583, +0.706, +0.842, +0.994, +1.175, +1.405, +1.750}

Sample 1:

{6, 8, 11, 12, 15, 17, 20, 20, 21, 23, 24, 24, 26, 28, 29, 30, 31, 32, 34, 37, 40, 41, 42, 45}

The Q-Q plot of this sample (see first graph, below) reveals a more or less linear trend
between the quantiles, which indicates that it is not unreasonable to assume that these
data are derived from a population whose ages are indeed normally distributed.

Sample 2:

{6, 6, 8, 8, 9, 10, 10, 10, 11, 11, 13, 16, 20, 21, 23, 28, 31, 32, 36, 38, 40, 44, 47, 50}

The Q-Q plot of this sample (see second graph, below) reveals an obvious deviation
from normality. Moreover, the general concave up nonlinearity seems to suggest that
the data are positively skewed (i.e., skewed to the right), and in fact, this is the case.
Applying statistical tests that rely on the normality assumption to data sets that are not
so distributed could very well yield erroneous results!

Formal tests for normality include:

Anderson-Darling

Shapiro-Wilk

Lilliefors (a special case of Kolmogorov-Smirnov)


Remedies for non-normality ~ What can be done if the normality
assumption is violated, or difficult to verify (as in a very small sample)?

Transformations: Functions such as Y = X or Y = ln(X), can transform a positively-
skewed variable X into a normally distributed variable Y. (These functions spread
out small values, and squeeze together large values. In the latter case, the original
variable X is said to be log-normal.)

Exercise: Sketch separately the dotplot of X, and the dotplot of Y = ln(X) (to two
decimal places), and compare.

X Y = ln(X) Frequency
1 1
2 2
3 3
4 4
5 5
6 5
7 4
8 4
9 3
10 3
11 3
12 2
13 2
14 2
15 2
16 1
17 1
18 1
19 1
20 1

Nonparametric Tests: Statistical tests (on the median, rather than the mean) that are
free of any assumptions on the underlying distribution of the population random
variable. Slightly less powerful than the corresponding parametric tests, tedious to
carry out by hand, but their generality makes them very useful, especially for small
samples where normality can be difficult to verify.

Sign Test (crude), Wilcoxon Signed Rank Test (preferred)


GENERAL SUMMARY

Step-by-Step Hypothesis Testing
One Sample Mean H
0
: vs.
0

CONTINUE
Is known?
Is n 30?
Use Z-test
(with )
Use t-test
(with = s)
Use Z-test or t-test
(with = s)
Use a transformation,
or a nonparametric test,
e.g., Wilcoxon Signed
Rank Test
Is random variable
approximately
normally distributed
(or mildly skewed)?
Yes No
Yes
No, or
dont know
Yes No
0
~ (0,1)
X
Z N
n

=
0
~ (0,1)
X
Z N
s n

=
0
1
~
n
X
T t
s n
=
(used most often
in practice)

p-value: How do I know in which direction to move, to find the p-value?
See STATBOT, page 6.1-14 (Z) and page 6.1-24 (T), or
Alternative Hypothesis

1-sided, left
H
A
: <
2-sided
H
A
:
1-sided, right
H
A
: >
Z
-

o
r

T
d
f

-

s
c
o
r
e

+

The p-value of an experiment is the probability (hence always between 0 and 1) of
obtaining a random sample with an outcome that is as, or more, extreme than the
one actually obtained, if the null hypothesis is true.
Starting from the value of the test statistic (i.e., z-score or t-score), the p-value is
computed in the direction of the alternative hypothesis (either <, >, or both), which
usually reflects the investigators belief or suspicion, if any.
If the p-value is small, then the sample data provides evidence that tends to refute
the null hypothesis; in particular, if the p-value is less than the significance level ,
then the null hypothesis can be rejected, and the result is statistically significant at
that level. However, if the p-value is greater than , then the null hypothesis is
retained; the result is not statistically significant at that level. Furthermore, if the
p-value is large (i.e., close to 1), then the sample data actually provides evidence
that tends to support the null hypothesis.
0
0
0
0
0
0

6.1.2 Variance

Given

: Null Hypothesis H
0
:
2
=
0
2
(constant value)
versus Alternative Hypothesis H
A
:
2

0
2

Test statistic:

2
=
(n 1) s
2

0
2
~
2
n1

Sampling Distribution of
2
:

Chi-Squared Distribution, with = n 1 degrees of freedom df = 1, 2, 3,
Note that the chi-squared distribution is not symmetric, but skewed to the right. We will not
pursue the details for finding an acceptance region and confidence intervals for
2
here. But
this distribution will appear again, in the context of hypothesis testing for equal proportions.


Either
2
<
0
2
or
2
>
0
2

= 1
= 2
= 3
= 4
= 5
= 6
= 7
f
(x) =
1
2
/2
(/2)
x
/2 1
e
x/2

Sample, size n

Calculate s
2

Population Distribution ~N(, )

Illustration of the bell curves
(1 )
, N
n

for n = 100, as proportion ranges from 0 to 1.
Note how, rather than being fixed at a constant value,
the spread s.e. is smallest when is close to 0 or 1
(i.e., when success in the population is either very rare
or very common), and is maximum when = 0.5
(i.e., when both success and failure are equally likely).
Also see Problem 4.4/10. This property of
nonconstant variance has further implications; see
Logistic Regression in section 7.3.
.03
.04
.046
.049 .05 .049
.046
.04
.03
|

0.1
|

0.3
|

0.5
|

0.7
|

0.9
= 0 = 1
= 0.5

6.1.3 Proportion

Problem! The expression for the standard error involves the very parameter upon which
we are performing statistical inference. (This did not happen with inference on the mean ,
where the standard error is s.e. = / n, which does not depend on .)


1, Success with probability
Y =
0, Failure with probability 1

POPULATION
Experiment: n independent trials

SAMPLE

Random Variable: X = # Successes ~ Bin(n, )

Recall: Assuming n 30, n 15, and n (1 ) 15,

X ~ N ( n, n (1 ) ), approximately. (see 4.2)

Therefore, dividing by n

=
X
n
~ N
,
(1 )
n

, approximately.

standard error s.e.

Example: Refer back to the coin toss example of section 1.1, where a random sample of
n = 100 independent trials is performed in order to acquire information about the
probability P(Heads) = . Suppose that X = 64 Heads are obtained. Then the sample-
based point estimate of is calculated as
= X / n = 64/100 = 0.64 . To improve this to

an interval estimate, we can compute the


95% limits = 0.64 z
.025

(0.64)(0.36)
100
= 0.64 1.96 (.048)
95% CI = (0.546, 0.734) contains the true value of , with 95% confidence.

As the 95% CI does not contain the null-value = 0.5, H
0
can be rejected at the
= .05 level, i.e., the coin is not fair.

0
: = 0.50

95% limits = 0.50 z
.025

(0.50)(0.50)
100
= 0.50 1.96 (.050)
95% AR = (0.402, 0.598)

As the 95% AR does not contain the sample proportion
= 0.64, H
0
can be
rejected at the = .05 level, i.e., the coin is not fair.
Is the coin fair at the = .05 level?

Null Hypothesis H
0
: = 0.5

vs. Alternative Hypothesis H
A
: 0.5
0
: =
0

0
z
/2

0
(1
0
)
n
,
0
+ z
/2

0
(1
0
)
n

z
/2

(1
)
n
,
+ z
/2

(1
)
n

s.e.
0
= .050
s.e. = .048

= 0.5 0.402 0.598 0.64 0.546 0.734
0.95
0.025 0.025
0.0026 0.0026

^

0.5 is not in the 95%
Confidence Interval
= (0.546, 0.734)

0.64 is not in the 95%
Acceptance Region
= (0.402, 0.598)

p-value = 2 P(
0.64) = 2 P
Z
0.64 0.50
.050
= 2 P(Z 2.8) = 2(.0026) = .0052

As p << = .05, H
0
can be strongly rejected at this level, i.e., the coin is not fair.

Test Statistic

Z =
0
(1
0
)
n
~ N(0, 1)
Null Distribution
~ N(0.5, .05)

Comments:

A continuity correction factor of
0.5
n
may be added to the numerator of the Z test
statistic above, in accordance with the normal approximation to the binomial
distribution see 4.2 of these Lecture Notes. (The n in the denominator is there
because we are here dealing with proportion of success
= X / n, rather than just

number of successes X.)

Power and sample size calculations are similar to those of inference for the mean, and
will not be pursued here.

IMPORTANT
See Appendix > Statistical Inference > General Parameters and FORMULA TABLES.

and Appendix > Statistical Inference > Means and Proportions, One and Two Samples.

1
X
2
X
Null
Distribution

6.2 Two Samples

6.2.1 Means

First assume that the samples are randomly selected from two populations that are
independent, i.e., no relation exists between individuals of one population and the other,
relative to the random variable, or any lurking or confounding variables that might have
an effect on this variable.

Model: Phase III Randomized Clinical Trial (RCT)

Measuring the effect of treatment (e.g., drug) versus control (e.g., placebo) on a
response variable X, to determine if there is any significant difference between them.

Then CLT

So
1
X
2
X ~ N
\
|
.
|
|
1

2
,
1
2
n
1
+
2
2
n
2

Comments:
Recall from 4.1: If Y
1
and Y
2
are independent, then Var(Y
1
Y
2
) = Var(Y
1
) +Var(Y
2
).
If n
1
=n
2
, the samples are said to be (numerically) balanced.
The null hypothesis H
0
:
1

2
= 0 can be replaced by H
0
:
1

2
=
0
if necessary,
in order to compare against a specific constant difference
0
(e.g., 10 cholesterol
points), with the corresponding modifications below.
s.e. =
1
2
n
1
+
2
2
n
2
can be replaced by
s.e. =
s
1
2
n
1
+
s
2
2
n
2
, provided n
1
30, n
2
30.
Independent

Dependent (Paired, Matched)

Control Arm Treatment Arm

Assume

X
1
~ N(
1
,
1
)
Assume

X
2
~ N(
2
,
2
)
Sample, size n
1
Sample, size n
2

1
X ~ N
\
|
.
|
|
1
,
1
n
1

2
X ~ N
\
|
.
|
|
2
,
2
n
2

H
0
:
1

2
= 0
(There is no difference in mean
response between the two
populations.)
X
X
1

X
2

0
Test Statistic

Z =
(
1
X
2
X )
0

s
1
2
n
1
+
s
2
2
n
2
~ N(0, 1)

Example: X =cholesterol level (mg/dL)
Test H
0
:
1

2
= 0 vs. H
A
:
1

2
0 for significance at the =.05 level.

s
1
2
n
1
=
1200
80
=15,
s
2
2
n
2
=
600
60
=10
s.e. =
s
1
2
n
1
+
s
2
2
n
2
= 25 =5

1

2

95% limits = 11 (1.96)(5) = 11 9.8 margin of error
95% CI = (1.2, 20.8), which does not contain 0 Reject H
0
. Drug works!

0
:
1

2
= 0

95% limits = 0 (1.96)(5) = 9.8 margin of error
95% AR = (9.8, +9.8), which does not contain 11 Reject H
0
. Drug works!

p-value = 2 P(
1
X
2
X 11)
= 2 P
\
|
.
|
|
Z
11 0
5

= 2 P(Z 2.2)
= 2(.0139)
= .0278 <.05 =
Reject H
0
. Drug works!
Placebo

Drug

n
1
=80

1
x =240

s
1
2
=1200
n
2
=60

2
x =229

s
2
2
=600

1
x
2
x =11
0
:
1

2
=
0

\
|
.
|
|

0
z
/2

s
1
2
n
1
+
s
2
2
n
2
,
0
+ z
/2

s
1
2
n
1
+
s
2
2
n
2

1

2

\
|
.
|
|
1 2
( ) x x z
/2

s
1
2
n
1
+
s
2
2
n
2
,
1 2
( ) x x + z
/2

s
1
2
n
1
+
s
2
2
n
2


0.95
1
X
2
X
0.025
9.8
1

2
= 0 9.8 11
0.025
0.0139 0.0139
1.2 20.8

0 is not in the 95%
Confidence Interval
=(1.2, 20.8)

11 is not in the 95%
Acceptance Region
=(9.8, 9.8)
Null Distribution
1 2
X X ~ N(0, 5)


Small samples: What if n
1
<30 and/or n
2
<30? Then use the t-distribution, provided

H
0
:
1
2
=
2
2
(equivariance, homoscedasticity)

Technically, this requires a formal test using the F-distribution; see next section ( 6.2.2).
However, an informal criterion is often used:

1
4
< F =
s
1
2
s
2
2 < 4 .

If equivariance is accepted, then the common value of
1
2
and
2
2
can be estimated
by the weighted mean of s
1
2
and s
2
2
, the pooled sample variance:

s
pooled
2
=
df
1
s
1
2
+ df
2
s
2
2

df
1
+df
2
, where df
1
=n
1
1 and df
2
=n
2
1,
i.e.,
s
pooled
2
=
(n
1
1) s
1
2
+ (n
2
1) s
2
2

n
1
+n
2
2
=
SS
df
.

Therefore, in this case, we have s.e. =
1
2
n
1
+
2
2
n
2
estimated by

s.e. =
s
pooled
2
n
1
+
s
pooled
2
n
2

i.e.,
s.e. = s
pooled
2

\
|
.
|
| 1
n
1
+
1
n
2

= s
pooled

1
n
1
+
1
n
2
.

If equivariance (but not normality) is rejected, then an approximate t-test can be
used, with the approximate degrees of freedom df given by

\
|
.
|
| s
1
2
n
1
+
s
2
2
n
2

2

(s
1
2
/n
1
)
2
n
1
1
+
(s
2
2
/n
2
)
2
n
2
1
.

This is known as the Smith-Satterwaithe Test. (Also used is the Welch Test.)


Example: X =cholesterol level (mg/dL)
Test H
0
:
1

2
= 0 vs. H
A
:
1

2
0 for significance at the =.05 level.

Pooled Variance

s
pooled
2
=
(8 1)(775) + (10 1)(1175)
8 +10 2
=
16000
16
= 1000

df

Note that s
pooled
2
=1000 is indeed between the variances s
1
2
=775 and s
2
2
=1175.

Standard Error

s.e. = 1000
\
|
.
|
| 1
8
+
1
10
= 15
Margin of Error = (2.120)(15) = 31.8

Critical Value

t
16, .025
= 2.120

Placebo

Drug

n
1
=8

1
x =230

s
1
2
=775
n
2
=10

2
x =200

s
2
2
=1175

1
x
2
x =30
F =s
1
2
/ s
2
2
= 0.66,
which is between 0.25 and 4.
Equivariance accepted t-test
Test Statistic

T =
(
1
X
2
X )
0
s
pooled
2

\
|
.
|
| 1
n
1
+
1
n
2
~ t
df

where df =n
1
+n
2
2

1

2

95% limits = 30 31.8 margin of error
95% CI = (1.8, 61.8), which contains 0 Accept H
0
.

0
:
1

2
= 0

95% limits = 0 31.8 margin of error
95% AR = (31.8, +31.8), which contains 30 Accept H
0
.

p-value = 2 P(
1
X
2
X 30)
= 2 P
\
|
.
|
|
T
16

30 0
15

= 2 P(T
16
2.0)
= 2(.0314)
= .0628 >.05 =

Accept H
0
.

Once again, low sample size implies low power to reject the null hypothesis. The tests
do not show significance, and we cannot conclude that the drug works, based on the data
from these small samples. Perhaps a larger study is indicated
1

2

\
|
.
|
|
1 2
( ) x x t
df, /2
s
pooled
2

\
|
.
|
| 1
n
1
+
1
n
2
,
1 2
( ) x x + t
df, /2
s
pooled
2

\
|
.
|
| 1
n
1
+
1
n
2

where df =n
1
+n
2
2
0
:
1

2
=
0

\
|
.
|
|

0
t
df, /2
s
pooled
2

\
|
.
|
| 1
n
1
+
1
n
2
,
0
+ t
df, /2
s
pooled
2

\
|
.
|
| 1
n
1
+
1
n
2

where df =n
1
+n
2
2

Now consider the case where the two samples are dependent. That is, each observation
in the first sample is paired, or matched, in a natural way on a corresponding
observation in the second sample.

Examples:

Individuals may be matched on characteristics such as age, sex, race, and/or
other variables that might confound the intended response.

Individuals may be matched on personal relations such as siblings (similar
genetics, e.g., twin studies), spouses (similar environment), etc.

Observations may be connected physically (e.g., left arm vs. right arm), or
connected in time (e.g., before treatment vs. after treatment).

Calculate the difference d
i
=x
i
y
i
of each matched pair of observations, thereby
forming a single collapsed sample {d
1
, d
2
, d
3
, , d
n
}, and apply the appropriate one-
sample Z- or t- test to the equivalent null hypothesis H
0
:
D
= 0.

Subtract
Subtract

Assume

X ~ N(
1
,
1
)
Assume

Y ~ N(
2
,
2
)
x
1

x
2

x
3

.
.
.
x
n

y
1

y
2

y
3

.
.
.
y
n

#

1

2

3
.
.
.
n
Sample, size n

Sample, size n

D =X Y ~N(, )

where
D
=
1

2

d
1
= x
1
y
1

d
2
= x
2
y
2

d
3
= x
3
y
3

.
.
.
d
n
= x
n
y
n

Sample, size n

H
0
:
1

2
= 0
H
0
:
D
= 0

Checks for normality
include normal scores plot (probability plot, Q-Q plot), etc., just as with one sample.

Remedies for non-normality
include transformations (e.g., logarithmic or square root), or nonparametric tests.

Independent Samples: Wilcoxon Rank Sum Test (=Mann-Whitney U Test)

Dependent Samples: Sign Test, Wilcoxon Signed Rank Test
(just as with one sample)


Step-by-Step Hypothesis Testing
Two Sample Means H
0
:
1

2
vs. 0

1 2 0
2 2
1 1 2 2
( ) X X
Z
n n

=
+

1 2
1 2 0
2 2
2
1 1 2 2
( )
n n
Z
X X
T
s n s n

=
`
+
)

1 2
2 2
1 1 2 2
1 2
1 2 0
2
1 2
( 1) ( 1)
2
( )
(1 1 )
n n
n s n s
n n
X X
T
n n
+
+
+

=
+
=
2
pooled
2
pooled
s
s

GO TO PAGE 6.1-28
Use Z-test
(with
1
,
2
)
Use t-test
(with
2 2
1 2

= =
2
pooled
s )
Use Z-test or t-test
(with
1 1
s = ,
2 2
s = )
Use a transformation,
or a nonparametric test,
e.g., Wilcoxon Rank
Sum Test
Independent
or Paired?
Compute D =X
1
X
2
for each i =1, 2, , n.

Then calculate
sample mean

=
=
n
i
i
d
n
d
1
1

sample variance

=

=
n
i
i d
d d
n
s
1
2 2
) (
1
1

and GO TO One Sample Mean testing
of H
0
:
D
= 0, section 6.1.1.
Are X
1
and X
2

approximately
normally distributed
(or mildly skewed)?
Are
1
,
2

known?

Are n
1
30
and n
2
30?

Equivariance:
1
2
=
2
2
?

Compute F =s
1
2
/ s
2
2
.
Is 1/4 <F <4?
Use an approximate t-test,
e.g., Satterwaithe Test
No
Paired
Yes
No, or
dont
know
Yes
No
Independent
Yes
Yes
No
Ismor Fischer, 5/29/2012 6.2-10
1
=20,
2
=40
1
=20,
2
=30

1
=20,
2
=20

1
=20,
2
=10
1
=20,
2
= 5

F-distribution

f(x) =
1
(
1
/2,
2
/2)

\
|
.
|
|
1
/2

x
1
/2 1

\
|
.
|
|
1 +
2
x

1
/2
2
/2

2

Independent groups
Sample, size n
1

Calculate
2
1
s
Sample, size n
2

Calculate
2
2
s

6.2.2 Variances

Suppose X
1
~N(
1
,
1
) and X
2
~N(
2
,
2
).

Null Hypothesis H
0
:
1
2
=
2
2

versus

A
:
1
2

2
2

Formal test: Reject H
0
if the
F-statistic is significantly
different from 1.

Informal criterion: Accept
H
0
if the F-statistic is
between 0.25 and 4.

Comment: Another test, more robust to departures from the normality assumption than
the F-test, is Levenes Test, a t-test of the absolute deviations of each sample. It can be
generalized to more than two samples (see section 6.3.2).

Test Statistic

F =
s
1
2
s
2
2 ~ F

1

2

where
1
=n
1
1 and
2
=n
2
1
are the corresponding numerator
and denominator degrees of
freedom, respectively.
Ismor Fischer, 5/29/2012 6.2-11

6.2.3 Proportions

Therefore, approximately

2
~ N
\
|
.
|
|
|
1

2
,
1
(1
1
)
n
1
+
2
(1
2
)
n
2
.

standard error s.e.

Confidence intervals are computed in the usual way, using the estimate

s.e. =
1 2

1 (1 ) ( )
n n

+

1 1 2 2
,

as follows:

POPULATION
I
1
=1 or 0, with

P(I
1
=1) =
1
, P(I
1
=0) =1
1

I
2
=1 or 0, with

P(I
2
=1) =
2
, P(I
2
=0) =1
2

INDEPENDENT SAMPLES
n
1
30 n
2
30

Random Variable
X
1
= #(I
1
=1) ~ Bin(n
1
,
1
)

Recall (assuming n
1
1
15, n
1
(1
1
) 15):

~ N
\
|
.
|
|
1
,
1
(1
1
)
n
1
, approx.
Random Variable
X
2
= #(I
2
=1) ~ Bin(n
2
,
2
)

Recall (assuming n
2
2
15, n
2
(1
2
) 15):

~ N
\
|
.
|
|
2
,
2
(1
2
)
n
2
, approx.
1
=
X
1
n
1

2
=
X
2
n
2

Ismor Fischer, 5/29/2012 6.2-12
1

2

\
|
.
|
|
|
(
2
) z
/2

1 2

1 (1 ) ( )
n n

+

1 1 2 2
(
2
) +z
/2

1 2

1 (1 ) ( )
n n

+

1 1 2 2

Test Statistic for H
0
:
1

2
= 0

Z =
(
2
) 0

pooled
(1
pooled
)
1
n
1
+
1
n
2
~ N(0, 1)

Unlike the one-sample case, the same estimate for the standard error can also be used in
computing the acceptance region for the null hypothesis H
0
:
1

2
=
0
, as well as the
test statistic for the p-value, provided the null value
0
0. HOWEVER, if testing for
equality between two proportions via the null hypothesis H
0
:
1

2
= 0, then their
common value should be estimated by the more stable weighted mean of
1
and
2
, the
pooled sample proportion:

pooled
=
X
1
+X
2
n
1
+n
2
=
n
1
1
+n
2
2
n
1
+n
2
.

Substituting yields
s.e.
0
=
pooled
(1
pooled
)
n
1
+
pooled
(1
pooled
)
n
2

i.e.,

s.e.
0
=
pooled
(1
pooled
)
1
n
1
+
1
n
2
.
Hence

0
:
1

2
= 0
\
|
.
|
|
0 z
/2

pooled
(1
pooled
)
1
n
1
+
1
n
2
, 0 + z
/2

pooled
(1
pooled
)
1
n
1
+
1
n
2

Ismor Fischer, 5/29/2012 6.2-13
PT +Supplement

PT only

n
1
=400

X
1
=332

n
2
=320

X
2
=244

H
0
:
1

2
= 0

Null Distribution
N(0, 0.03)
0 0.0675
2

Standard Normal
Distribution
N(0, 1)
Z
0 2.25
.0122 .0122
Figure 1
0.0675 2.25

Example: Consider a group of 720 patients who undergo physical therapy for arthritis.
A daily supplement of glucosamine and chondroitin is given to n
1
=400 of them in
addition to the physical therapy; after four weeks of treatment, X
1
= 332 show
measurable signs of improvement (increased ROM, etc.). The remaining n
2
=320
patients receive physical therapy only; after four weeks, X
2
=244 show improvement.
Does this difference represent a statistically significant treatment effect? Calculate the
p-value, and form a conclusion at the =.05 significance level.

H
0
:
1

2
= 0
vs.
H
A
:
1

2
0 at =.05

1
=
332
400
=0.83,
2
=
244
320
=0.7625
2
= 0.0675

pooled
=
332 +244
400 +320
=
576
720
= 0.8
and thus 1
pooled
=
144
720
= 0.2

Therefore, p-value =
2 P(
2
0.0675) = 2 P
\
|
.
|
|
Z
0.0675 0
0.03
= 2 P(Z 2.25) = 2(.0122) = .0244.

Conclusion: As this value is smaller than =.05, we can reject the null hypothesis that
the two proportions are equal. There does indeed seem to be a moderately significant
treatment difference between the two groups.

s.e.
0
= (0.8)(0.2)
1
400
+
1
320
= 0.03
Ismor Fischer, 5/29/2012 6.2-14

Exercise: Instead of H
0
:
1

2
= 0 vs. H
A
:
1

2
0, test the null hypothesis for a
5% difference, i.e., H
0
:
1

2
= .05 vs. H
A
:
1

2
.05, at =.05. [Note that the
pooled proportion
pooled
is no longer appropriate to use in the expression for the
standard error under the null hypothesis, since H
0
is not claiming that the two
proportions
1
and
2
are equal (to a common value); see notes above.] Conclusion?

Exercise: Instead of H
0
:
1

2
= 0 vs. H
A
:
1

2
0, test the one-sided null
hypothesis H
0
:
1

2
0 vs. H
A
:
1

2
> 0 at =.05 . Conclusion?

Exercise: Suppose that in a second experiment, n
1
=400 patients receive a new drug
that targets B-lymphocytes, while the remaining n
2
=320 receive a placebo, both in
addition to physical therapy. After four weeks, X
1
= 376 and X
2
= 272 show
improvement, respectively. Formally test the null hypothesis of equal proportions at the
=.05 level. Conclusion?

Exercise: Finally suppose that in a third experiment, n
1
= 400 patients receive
magnet therapy, while the remaining n
2
=320 do not, both in addition to physical
therapy. After four weeks, X
1
=300 and X
2
=240 show improvement, respectively.
Formally test the null hypothesis of equal proportions at the =.05 level. Conclusion?

See

Appendix > Statistical Inference > General Parameters and FORMULA TABLES.

IMPORTANT!
Ismor Fischer, 5/29/2012 6.2-15

Alternate Method: Chi-Squared (
2
) Test

As before, let the binary variable I =1 for improvement, I =0 for no improvement, with
probability and 1 , respectively. Now define a second binary variable J =1 for the
PT +Drug group, and J =0 for the PT only group. Thus, there are four possible
disjoint events: I =0 and J =0, I =0 and J =1, I =1 and J =0, and I =1 and
J =1. The number of times these events occur in the random sample can be arranged in
a 2 2 contingency table that consists of four cells (NW, NE, SW, and SE) as
demonstrated below, and compared with their corresponding expected values based on
the null hypothesis.

Observed Values
Group (J)

PT + Drug PT only

S
t
a
t
u
s

(
I
)

Improvement 332 244 576
Row
marginal
totals
No Improvement 68 76 144

400 320 720
Column marginal totals

versus

Expected Values =
Column total Row total
Total Sample Size n

under H
0
:
1
=
2

pooled
= 576/720 =0.8
Group (J)

PT + Drug PT only

S
t
a
t
u
s

(
I
)

Improvement
400 576
720
=
320.0
320 576
720
=
256.0
576
No Improvement
400 144
720
=
80.0
320 144
720
=
64.0
144

400.0 320.0 720

Note: Chi is
pronounced kye
Informal reasoning: Consider the first
cell, improvement in the 400 patients of
the PT + Drug group. The null
hypothesis conjectures that the
probability of improvement is equal in
both groups, and this common value is
estimated by the pooled proportion
576/720. Hence, the expected number
(under H
0
) of improved patients in the
PT +Drug group is 400 576/720, etc.

Note that, by construction,

H
0
:
320
400
=
256
320

=
576
720
, the pooled proportion.
Ismor Fischer, 5/29/2012 6.2-16
5.0625
Figure 2
.0244
2
1

Ideally, if all the observed values =all the expected values, then this statistic would =0,
and the corresponding p-value =1. As it is,

2
=
(332 320)
2
320
+
(244 256)
2
256
+
(68 80)
2
80
+
(76 64)
2
64
= 5.0625 on 1 df

Therefore, the p-value = P(
2
1
5.0625) = .0244, as before. Reject H
0
.

Comments:

Chi-squared Test is valid, provided Expected Values 5. (Otherwise, the score is
inflated.) For small expected values in a 2 2 table, defer to Fishers Exact Test.

Chi-squared statistic with Yates continuity correction to reduce spurious significance:

2
=

(|Obs Exp| 0.5)
2
Exp

all cells

Chi-squared Test is strictly for the two-sided H
0
:
1

2
=0 vs. H
A
:
1

2
0.
It cannot be modified to a one-sided test, or to H
0
:
1

2
=
0
vs. H
A
:
1

2

0
.

Test Statistic for H
0
:
1

2
= 0

2
=

(Obs Exp)
2
Exp
~
2
1

all cells
Note that

5.0625 = ( 2.25)
2
,

i.e.,
2
1
= Z
2
.

The two test statistics are
mathematically equivalent!
(Compare Figures 1 and 2.)
Ismor Fischer, 5/29/2012 6.2-17

How could we solve this problem using R? The code (which can be shortened a bit):

# Lines preceded by the pound sign are read as comments,
# and ignored by R.

# The following set of commands builds the 2-by-2 contingency table,
# column by column (with optional headings), and displays it as
# output (my boldface).

Tx.vs.Control = matrix(c(332, 68, 244, 76), ncol = 2, nrow = 2,
dimnames = list("Status" = c("Improvement", "No Improvement"),
"Group" = c("PT + Drug", "PT")))

Tx.vs.Control

Group
Status PT + Drug PT
Improvement 332 244
No Improvement 68 76

# A shorter alternative that outputs a simpler table:

Improvement = c(332, 244)
No_Improvement = c(68, 76)
Tx.vs.Control = rbind(Improvement, No_Improvement)

Tx.vs.Control

[,1] [,2]
Improvement 332 244
No_Improvement 68 76

# The actual Chi-squared Test itself. Since using a correction
# factor is the default, the F option specifies that no such
# factor is to be used in this example.

chisq.test(Tx.vs.Control, correct = F)

Pearson's Chi-squared test

data: Tx.vs.Control
X-squared = 5.0625, df = 1, p-value = 0.02445

Note how the output includes the Chi-squared test statistic, degrees of freedom, and
p-value, all of which agree with our previous manual calculations.
Ismor Fischer, 5/29/2012 6.2-18
Application: Case-Control Study Design

Determines if an association exists between disease D and risk factor exposure E.
TIME
PRESENT PAST
Given: Cases (D+) and Controls (D) Investigate: Relation with E+and E
Chi-Squared Test
H
0
:
E+ | D+
=
E+ | D

Randomly select a sample of cases
and controls, and categorize each
member according to whether or not
he/she was exposed to the risk factor.
SAMPLE
n
1
cases
D+
n
2
controls
D

For each case (D+),
there are 2 disjoint
possibilities for exposure:
E+ or E.
For each control (D),
there are 2 disjoint
possibilities for exposure:
E+ or E.

D+ D

E+

E

E+

E
a b
c d
Calculate the
2
1
statistic:

(a +b +c +d) (ad bc)
2
(a +c) (b +d) (a +b) (c +d)

McNemars Test

E+ E

E+

E
D+
D
For each matched case-control
ordered pair (D+, D), there are 4
disjoint possibilities for exposure:
SAMPLE

n cases
D+
n controls
D
E+and E+
or
E and E+
or
E+and E
or
E and E
concordant pair

discordant pair

discordant pair

concordant pair
b
c
a
d
H
0
:
E+ | D+
=
E+ | D

Match each case with a corresponding
control on age, sex, race, and any
other confounding variables that may
affect the outcome. Note that this
requires a balanced sample: n
1
=n
2
.
Calculate the
2
1
statistic:

(b c)
2
b +c

See Appendix > Statistical Inference > Means and Proportions, One and Two Samples.
Ismor Fischer, 5/29/2012 6.2-19

To quantify the strength of association between D and E, we turn to the notion of

Odds Ratios Revisited

Recall:

Alas, the probability distribution of the odds ratio OR is distinctly skewed to the right.
However, its natural logarithm, ln(OR), is approximately normally distributed, which
makes it more useful for conducting the Test of Association above. Namely

(1 ) 100% Confidence Limits for ln(OR)

e
ln(
OR ) (z
/2
)
s.e.

, where
s.e. =
1
a
+
1
b
+
1
c
+
1
d

POPULATION
Case-Control Studies:
OR =
odds(Exposure | Disease)
odds(Exposure | No Disease)
=
P(E+ | D+) / P(E | D+)
P(E+| D) / P(E | D)

Cohort Studies:
OR =
odds(Disease | Exposure)
odds(Disease | No Exposure)
=
P(D+ | E+) / P(D | E+)
P(D+| E) / P(D | E)

H
0
: OR = 1 No association exists between D, E.
versus
H
A
: OR 1 An association exists between D, E.
SAMPLE, size n
D+ D
E+ a b
E c d

OR =
ad
bc

(1 ) 100% Confidence Limits for OR
Ismor Fischer, 5/29/2012 6.2-20
0.79 8.30 2.56 1
1.52 4.32 2.56
1

D+ D
E+ 40 50
E 50 160

OR =
(40)(160)
(50)(50)
= 2.56

D+ D
E+ 8 10
E 10 32

OR =
(8)(32)
(10)(10)
= 2.56

Examples: Test H
0
: OR = 1 versus H
A
: OR 1 at the =.05 significance level.

ln(2.56) = 0.94

s.e. =
1
8
+
1
10
+
1
10
+
1
32
= 0.6 95% Margin of Error =(1.96)(0.6) =1.176

95% Confidence Interval for ln(OR) = ( 0.94 1.176, 0.94 +1.176 ) = ( 0.236, 2.116 )

and so 95% Confidence Interval for OR = ( e
0.236
, e
2.116
) = (0.79, 8.30)

Conclusion: As this interval does contain the null value OR =1, we cannot reject the
hypothesis of non-association at the 5% significance level.

ln(2.56) = 0.94

s.e. =
1
40
+
1
50
+
1
50
+
1
160
= 0.267 95% Margin of Error =(1.96)(0.267) =0.523

95% Confidence Interval for ln(OR) = ( 0.94 0.523, 0.94 +0.523 ) = ( 0.417, 1.463 )

and so 95% Confidence Interval for OR = ( e
0.417
, e
1.463
) = (1.52, 4.32)

Conclusion: As this interval does not contain the null value OR =1, we can reject the
hypothesis of non-association at the 5% level. With 95% confidence, the odds of disease
are between 1.52 and 4.32 times higher among the exposed than the unexposed.

Comments:
If any of a, b, c, or d =0, then use
s.e. =
1
a +0.5
+
1
b +0.5
+
1
c +0.5
+
1
d +0.5
.

If OR <1, this suggests that exposure might have a protective effect, e.g., daily
calcium supplements (yes/no) and osteoporosis (yes/no).

Ismor Fischer, 5/29/2012 6.2-21

Summary Odds Ratio

Combining 2 2 tables corresponding to distinct strata.

Examples:

Males Females All
D+ D D+ D D+ D
E+ 10 50 E+ 10 10

E+ 20 60
E 10 150 E 60 60 E 70 210

1
OR =3
2
OR =1
OR =1

Males Females All
D+ D D+ D D+ D
E+ 80 20 E+ 10 20

E+ 90 40
E 20 10 E 20 80 E 40 90

1
OR =2
2
OR =2
OR =5.0625

Males Females All
D+ D D+ D D+ D
E+ 60 100 E+ 50 10

E+ 110 110
E 10 50 E 100 60 E 110 110

1
OR =3
2
OR =3
OR =1

These examples illustrate the phenomenon known as Simpsons Paradox.

Ignoring a confounding variable (e.g., gender) may obscure an association that exists
within each stratum, but not observed in the pooled data, and thus must be adjusted for.
When is it acceptable to combine data from two or more such strata? How is the
summary odds ratio OR
summary
estimated? And how is it tested for association?
???
???
???
Ismor Fischer, 5/29/2012 6.2-22
In general
Stratum 1 Stratum 2
D+ D D+ D
E+ a
1
b
1
E+ a
2
b
2

E c
1
d
1
E c
2
d
2

1
OR =
a
1
d
1
b
1
c
1

2
OR =
a
2
d
2
b
2
c
2

Example:
Males Females
D+ D D+ D
E+ 10 20 E+ 40 50
E 30 90 E 60 90

1
OR = 1.5
2
OR = 1.2
Assuming that the Test of Homogeneity H
0
: OR
1
=OR
2
is conducted and accepted,

MH
OR =
(10)(90)
150
+
(40)(90)
240
(20)(30)
150
+
(50)(60)
240
=
6 +15
4 +12.5
=
21
16.5
= 1.273 .

Exercise: Show algebraically that
MH
OR is a weighted average of
1
OR and
2
OR .

I. Calculate the estimates of OR
1
and OR
2
for each stratum, as shown.

II. Can the strata be combined? Conduct a Breslow-Day (Chi-
squared) Test of Homogeneity for
H
0
: OR
1
= OR
2
.

III. If accepted, calculate the Mantel-Haenszel Estimate of OR
summary
:

MH
OR =
a
1
d
1
n
1
+
a
2
d
2
n
2
b
1
c
1
n
1
+
b
2
c
2
n
2
.

IV. Finally, conduct a Test of Association for the combined strata
H
0
: OR
summary
= 1
either via confidence interval, or special
2
-test (shown below).
Ismor Fischer, 5/29/2012 6.2-23

To conduct a formal Chi-squared Test of Association H
0
: OR
summary
= 1, we calculate,
for the 2 2 contingency table in each stratum i =1, 2,, s.

Observed
# diseased
vs. Expected
# diseased
Variance

D+ D

E+ a
i
b
i
R
1i
E
1i
=
R
1i
C
1i
n
i

V
i
=
R
1i
R
2i
C
1i
C
2i
n
i
2
(n
i
1)

E c
i
d
i
R
2i
E
2i
=
R
2i
C
1i
n
i

C
1i
C
2i
n
i

Therefore, summing over all strata i =1, 2,, s, we obtain the following:

Observed total, Diseased Expected total, Diseased Total Variance
Exposed: O
1
= a
i
Exposed: E
1
= E
1i

Not Exposed: O
2
= c
i
Not Exposed: E
2
= E
2i

and the formal test statistic for significance is given by
2
=
(O
1
E
1
)
2
V
~

1
2
.

This formulation will appear again in the context of the Log-Rank Test in the area of
Survival Analysis (section 8.3).

Example (contd):
For stratum 1 (males), E
11
=
(30)(40)
150
=8 and V
1
=
2
(30)(120)(40)(110)
150 (149)
=4.725.

For stratum 2 (females), E
12
=
(90)(100)
240
=37.5 and V
2
=
2
(90)(150)(100)(140)
240 (239)
=13.729.

Therefore, O
1
=50, E
1
=45.5, and V =18.454, so that
2
=
2
(4.5)
18.454
=1.097 on 1
degree of freedom, from which it follows that the null hypothesis H
0
: OR
summary
= 1
cannot be rejected at the =.05 significance level, i.e., there is not enough empirical
evidence to conclude that an association exists between disease D and exposure E.

Comment: This entire discussion on Odds Ratios OR can be modified to Relative Risk RR
(defined only for a cohort study), with the following changes:
s.e. =
1
a

1
R
1
+
1
c

1
R
2
,
as well as b replaced with row marginal R
1
, and d replaced with row marginal R
2
, in all
other formulas. [Recall, for instance, that
/ OR ad bc = , whereas
2 1
/ RR aR R c = , etc.]
V = V
i


6.3 Several Samples
6.3.1 Proportions
General formulation
Consider several fixed (i.e., nonrandom) populations, say j = 1, 2, 3, , c, where every
individual in each population can have one of several random responses, i =1, 2, 3, , r
(e.g., the previous example had c = 2 treatment groups and r = 2 possible improvement
responses: Yes or No). Formally, let I and J be two general categorical variables,
with r and c categories, respectively. Thus, there is a total of r c possible disjoint
outcomes namely, an individual in population j (= 1, 2, , c) corresponds to some
response i (= 1, 2, , r). With this in mind, let
i j
= the probability of this outcome.
We wish to test the null hypothesis that, for each response category i, the probabilities
i j
are equal, over all the population categories j. That is, the populations are
homogeneous, with respect to the proportions of individuals having the same responses:

H
0
:
11 12 13 1
= = ==
c
and

21 22 23 2
= = ==
c
and
There is no association between
(categories of) I and (categories of) J .
and

1 2 3
= = ==
r c r r r

versus

H
A
: At least one of these equalities There is an association between
is false, i.e.,
i j i k
for some i. (categories of) I and (categories of) J .

Much as before, we can construct an r c contingency table of n observed values,
where r = # rows, and c = # columns.

Categories of J
1 2 3 c
C
a
t
e
g
o
r
i
e
s

o
f

I

1 O
11
O
12
O
13
O
1c
R
1

2 O
21
O
22
O
23
O
2c
R
2

3 O
31
O
32
O
33
O
3c
R
3

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
r O
r1
O
r2
O
r3
O
rc
R
r

C
1
C
2
C
3
C
c
n
= 1
= 2
= 3
= 4
= 5
= 6
= 7
distribution

For i = 1, 2, , r and j = 1, 2, , c, the following are obtained:

Observed Values O
i

j
= #(I = i, J = j) whole numbers 0

Expected Values E
i

j
=
R
i
C
j
n
, real numbers (i.e., with decimals) 0

where the row marginals R
i
= O
i1
+ O
i2
+ O
i3
+ + O
ic
,

and the column marginals C
j
= O
1j
+ O
2j
+ O
3j
+ + O
rj

Comments:

Chi-squared Test is valid, provided 80% or more of E
ij
5. For small expected values,
lumping categories together increases the numbers in the corresponding cells.
Example: The five age categories 18-24, 25-39, 40-49, 50-64, and 65+ in a
might be lumped into three categories 18-39, 40-64, and 65+ if appropriate.
Caution: Categories should be deemed contextually meaningful before using
2
.

Remarkably, the same Chi-squared statistic can be applied in different scenarios,
including tests of different null hypotheses H
0
on the same contingency table,
as shown in the following examples.

If Z
1
, Z
2,
, Z
d
are N(0, 1) random variables, then Z
1
2
+ Z
2
2
+ + Z
d
2
~
2
d
.
Test Statistic

2
=

(O
ij
E
ij
)
2
E
ij
~
2
df

where = df = (r 1)(c 1)
all i, j

Example: Suppose that a study, similar to the previous one, compares r = 4 improvement
responses of c = 3 fixed groups of n = 600 patients: one group of 250 receives physical
therapy alone, a second group of 200 receives an over-the-counter supplement in addition
to physical therapy, and a third group of 150 receives a prescription medication in addition
to physical therapy. The 4 3 contingency table of observed values is generated below.

Treatment Group (J)

PT + Rx PT + OTC PT only

I
m
p
r
o
v
e
m
e
n
t

S
t
a
t
u
s

(
I
)

None 6 14 40 60
r
a
n
d
o
m

r
o
w

m
a
r
g
i
n
a
l

t
o
t
a
l
s

Minor 9 30 81 120
Moderate 15 60 105 180
Major 120 96 24 240

150 200 250 600

fixed column marginal totals

Upon inspection, it seems obvious that there are clear differences, but determining
whether or not these differences are statistically significant requires a formal test. For
instance, consider the null hypothesis that there is no significant difference in each
improvement response rate, across the treatment populations i.e., for each improvement
category i (= 1, 2, 3, 4) in I, the probabilities
i j
over all treatment categories j (= 1, 2, 3)
in J, are equal. That is, explicitly,

H
0
: Treatment populations are homogeneous with respect to each response.

None in PT + Rx
=
None in PT + OTC
=
None in PT only
and

Minor in PT + Rx
=
Minor in PT + OTC
=
Minor in PT only
and

Mod in PT + Rx
=
Mod in PT + OTC
=
Mod in PT only
and

Major in PT + Rx
=
Major in PT + OTC
=
Major in PT only

If the null hypothesis is true, then the expected table would consist of the values below,

Treatment Group (J)

PT + Rx PT + OTC PT only

I
m
p
r
o
v
e
m
e
n
t

S
t
a
t
u
s

(
I
)
None 15 20 25 60
Minor 30 40 50 120
Moderate 45 60 75 180
Major 60 80 100 240

150 200 250 600

because in this case.

15
150
=
20
200
=
25
250

= pooled proportion
None
=
60
600
, true since all = 0.1

30
150
=
40
200
=
50
250

= pooled proportion
Minor
=
120
600

45
150
=
60
200
=
75
250

= pooled proportion
Mod
=
180
600

60
150
=
80
200
=
100
250

= pooled proportion
Major
=
240
600
, true since all = 0.4.

If the null hypothesis is rejected based on the data, then the alternative is that at least one
of its four statements is false. For that corresponding improvement category, one of the
three treatment populations is significantly different from the others. This is referred to
as a Chi-squared Test of Homogeneity , and is performed in the usual way. (Exercise)

Let us consider a slightly different scenario which, for the sake of simplicity, has the
same observed values as above. Suppose now we start with a single population, where
every individual can have one of several random responses i = 1, 2, 3, , r
corresponding to one categorical variable I (such as improvement status, as before),
AND one of several random responses j =1, 2, 3, , c corresponding to another
categorical variable J (such as, perhaps, the baseline symptoms of their arthritis):

Baseline Disease Status (J)

Mild Moderate Severe

I
m
p
r
o
v
e
m
e
n
t

S
t
a
t
u
s

(
I
)

None 6 14 40 60
r
a
n
d
o
m

r
o
w

m
a
r
g
i
n
a
l

t
o
t
a
l
s

Minor 9 30 81 120
Moderate 15 60 105 180
Major 120 96 24 240

150 200 250 600

random column marginal totals

In other words, unlike the previous scenario where there was only one random response
for each individual per population, here there are two random responses for each
individual in a single fixed population. With this in mind, the probability
i j
(see first
page of this section) is defined differently namely, as the conditional probability that
an individual corresponds to a response i (= 1, 2, , r), given that he/she corresponds
to a response j (= 1, 2, , c). Hence, in this scenario, the null hypothesis translates to:


None | Mild
=
None | Moderate
=
None | Severe
and

Minor | Mild
=
Minor | Moderate
=
Minor | Severe
and

Mod | Mild
=
Mod | Moderate
=
Mod | Severe
and

Major | Mild
=
Major | Moderate
=
Major | Severe

However, interpreting this in context, each row now states that the improvement status
variable I (= None, Minor, Mod, Major) is not affected by the baseline disease status
variable J (= Mild, Moderate, Severe). This implies that for each i = 1, 2, 3, 4, the
events I = i and J = j (j = 1, 2, 3) are statistically independent, and hence, by
definition, the common value of the conditional probabilities P(I = i | J = j) in each row,
is equal to the corresponding unconditional probability P(I = i) for that row, namely,
None
,
Minor
,
Mod
, and
Major
, respectively. It then also follows that P(I = i J = j) =
P(I = i) P(J = j).
*
Test of Independence
The left-hand intersection probability in this equation is simply the
expected value E
i

j
/ n; the right-hand side is the product of (Row marginal R
i
/ n)
(Column marginal C
j
/ n), and so we obtain the familiar formula E
i

j
= R
i
C
j
/ n. Thus,
the previous table of expected values and subsequent calculations are exactly the same
for this so-called Chi-squared :

H
0
: The two responses are statistically independent in this population.

Furthermore, because both responses I and J are independent, we can also characterize
this null hypothesis by the symmetric statement that the baseline disease status
variable J (= Mild, Moderate, Severe) is not affected by the improvement status
variable I (= None, Minor, Mod, Major). That is, the common value of the conditional
probabilities P(J = j | I = i) in each column, is equal to the corresponding unconditional
probability P(J = j) for that column, i.e.,
Mild
,
Moderate
, and
Severe
, respectively:

and and

Mild | None

Moderate | None

Severe | None

=
Mild | Minor
=
Moderate | Minor
=
Severe | Minor

=
Mild | Mod
=
Moderate | Mod
=
Severe | Mod

=
Mild | Major
=
Moderate | Major
=
Severe | Major

(=
Mild
) (=
Moderate
) (=
Severe
)

*
We have used several results here. Recall that, by definition, two events A and B are said to be statistically independent
if P(A | B) = P(A), or equivalently, P(A B) = P(A) P(B). Also see Problems 3-5 and 3-22(b) for related ideas.

In particular, this would yield the following:

and and

(=
Mild
= 150/600) (=
Moderate
= 200/600) (=
Severe
= 250/600),

true, since all = 1/4. true, since all = 1/3. true, since all = 5/12.

That is, the independence between I and J can also be interpreted in this equivalent form.

Hence, the same Chi-squared statistic
2
=
2
all cells
(Obs Exp)
Exp
on df = (r 1)(c 1) is
used for both types of hypothesis test! The exact interpretation depends on the design of
the experiment, i.e., whether two or more populations are being compared for
homogeneity with respect to a single response, or whether any two responses are
independent of one another in a single population. However, as the application of the
Chi-squared test is equally valid in either scenario, the subtle distinction between them is
often blurred in practice. MORAL: In general, if the null hypothesis is rejected in either
scenario, then there is an association between the two categorical variables I and J.

Exercise: Conduct (both versions of) the Chi-squared Test for this 4 3 table.

One way to code this in R:

# Input
None = c(6, 14, 40)
Minor = c(9, 30, 81)
Moderate = c(15, 60, 105)
Major = c(120, 96, 24)
Improvement = rbind(None, Minor, Moderate, Major)

# Output
Improvement
chisq.test(Improvement, correct = F)

15/60
= 30/120
= 45/180
= 60/240
20/60
= 40/120
= 60/180
= 80/240
25/60
= 50/120
= 75/180
= 100/240
Goodness-of-Fit Test

H
0
:
1
=
1
0
,
2
=
2
0
,
3
=
3
0
, ,
k
=
k
0

For i = 1, 2, 3, , k = # groups, n = sample size:
Observed Values O
i
Expected Values E
i
= n
i
0

Test Statistic
2
=
2
1
( )
k
i i
i i
O E
E
=
~
2
df

where = df = k 1

As a final application, consider one of the treatment categories alone, say PT + Rx,
written below as a row, for convenience.

PT + Rx
Observed
Values
None Minor Moderate Major

6 9 15 120 n = 150

Suppose we wish to test the null hypothesis that there is no significant difference in
improvement responses, i.e., the probabilities of all the improvement categories are
equal. That is, H
0
:
None
=
Minor
=
Moderate
=
Major
(thus, = 0.25 each). Therefore,
under this null hypothesis (and changing notation slightly), these n = 150 patients should
be equally divided into the k = 4 response categories, i.e., H
0
: For this treatment
category, the responses follow a uniform distribution (= n/k) as illustrated.

PT + Rx
Expected
Values
37.5 37.5 37.5 37.5 n = 150

Of course, even a cursory comparison of these two distributions strongly suggests that
there is indeed a significant difference. Remarkably, the same basic test statistic can be
used in this Chi-squared Goodness - of - Fit Test . The degrees of freedom is equal to
one less than k, the number of response categories being compared; in this case, df = 3.

In general, this test can be applied to determine if data follow other probability
distributions as well. For example, suppose it is more realistic to believe that the null
distribution is not uniform, but skewed, i.e., H
0
:
None
= .10,
Minor
= .20,
Moderate
= .30,

Major
= .40 . Then the observed values above would instead be compared with

PT + Rx
Expected
Values
15 30 45 60 n = 150

In general,

Exercise: Conduct this test for the PT + Rx data given, under both null hypotheses.

The Birds and the Bees
An Application of the Chi-squared Test to Basic Genetics

Inherited biological traits among humans (e.g., right- or left- handedness) and other organisms are
transmitted from parents to offspring via unit factors called genes, discrete regions of DNA that are
located on chromosomes, which are tightly coiled within the nucleus of a cell. Most human cells normally
contain 46 chromosomes, arranged in 23 pairs (diploid); hence, two copies of each gene. Each copy can
be either dominant (say, A = right-handedness) or recessive (a = left-handedness) for a given trait. The
trait that is physically expressed in the organism i.e., its phenotype is det ermined by which of the
three possible combinations of pairs AA, Aa, aa of these two alleles A and a occurs in its genes i.e.,
its genotype and its interactions with environmental factors: AA is homozygous dominant for right-
handedness, Aa is heterozygous dominant (or hybrid) for right-handedness, and aa is homozygous
recessive for left-handedness. However, reproductive cells (gametes: egg and sperm cells) only have
23 chromosomes, thus a single copy of each gene (haploid). When male and female parents reproduce,
the zygote receives one gene copy either A or a from each parental gamete, restoring diploidy in the
offspring. With two traits, say handedness and eye color (B = brown, b = blue), there are nine possible
genotypes: AABB, AABb, AAbb, AaBB, AaBb, Aabb, aaBB, aaBb, aabb, resulting in four possible
phenotypes. (AaBb is known as a dihybrid.)

According to Mendels Law of Independent Assortment, segregation of the alleles of one allelic pair
during gamete formation is independent of the segregation of the alleles of another allelic pair.
Therefore, a homozygous dominant parent AABB has gametes AB, and a homozygous recessive parent
aabb has gametes ab; crossing them consequently results in all dihybrid AaBb offspring in the so-called
F
1
(or first filial) generation, having gametes AB, Ab, aB, and ab, as shown below.

Parental Genotypes AABB aabb

Parental Gametes

F
1
Genotype AaBb

F
1
Gametes

AB ab
AB ab Ab aB

It follows that further crossing two such AaBb genotypes results in expected genotype frequencies in
the F
2
(second filial) generation that follow a 9:3:3:1 ratio, shown in the 4 4 Punnet square below.

Phenotypes
Expected Frequencies
1 = Right-handed, Brown-eyed 9/16 = 0.5625
2 = Right-handed, Blue-eyed 3/16 = 0.1875
3 = Left-handed, Brown-eyed 3/16 = 0.1875
4 = Left-handed, Blue-eyed 1/16 = 0.0625

For example, in a random sample of n = 400 such individuals, the expected phenotypic values under the
null hypothesis
0 1 2 3 4
: 0.5625, 0.1875, 0.1875, 0.0625 H = = = = are as follows.

Expected
Values
1 2 3 4
225 75 75 25 n = 400

These would be compared with the observed values, say

Observed
Values
1 2 3 4

234 67 81 18 n = 400

via the Chi-squared Goodness of Fit Test:
2 2 2 2
2
( 9) ( 8) ( 6) ( 7)
225 75 75 25
+ +
= + + + = 3.653 on df = 3.
Because this is less than the .05 Chi-squared score of 7.815, the p-value is greater than .05 (its exact
value = 0.301), and hence the data provide evidence in support of the 9:3:3:1 ratio in the null
hypothesis, at the .05 = significance level. If this model had been rejected however, then this would
suggest a possible violation of the original assumption of independent assortment of allelic pairs. This
is indeed the case in genetic linkage, where the two genes are located in close proximity to one other
on the same chromosome.

If two alleles A and a occur with respective frequencies p and q (= 1 p) in a population, then observed
genotype frequencies can be compared with those expected from The Hardy-Weinberg Law (namely
p
2
for AA, 2pq for Aa, and q
2
for aa) via a similar Chi-squared Test.
F
2
Genotypes Female Gametes
AB Ab aB ab
M
a
l
e

G
a
m
e
t
e
s

AB AABB
1
AABb
1
AaBB
1
AaBb
1

Ab AABb
1
AAbb
2
AaBb
1
Aabb
2

aB AaBB
1
AaBb
1
aaBB
3
aaBb
3

ab AaBb
1
Aabb
2
aaBb
3
aabb
4

Ismor Fischer, 5/29/2012 6.3-10
1
= 20,
2
= 40

1
= 20,
2
= 30

1
= 20,
2
= 20

1
= 20,
2
= 10

1
= 20,
2
= 5

F-distribution

6.3.2 Variances

Consider k independent, normally-distributed groups X
1
~ N(
1
,
1
), X
2
~ N(
2
,
2
), ,
X
k
~ N(
k
,
k
). We wish to conduct a formal test for equivariance, or homogeneity of
variances.

Null Hypothesis H
0
:
1
2
=
2
2
=
3
2
= =
k
2

versus

A
: At least one of the
i
2
is different from the others.

Formal test: Reject H
0
if the
F-statistic is significantly > 1.

Comments:

Other tests: Levene (see 6.2.2), Hartley, Cochran, Bartlett, and Scheff.

For what follows (ANOVA), moderate heterogeneity of variances is permissible,
especially with large, approximately equal sample sizes n
1
, n
2
, , n
k
. Hence this test
is often not even performed in practice, unless the sample variances s
1
2
, s
2
2
, ..., s
k
2

appear to be greatly unequal.
Test Statistic

F =
s
max
2
s
min
2 ~ F

1

2

where
1
and
2
are the
corresponding numerator and
denominator degrees of freedom,
respectively.
Ismor Fischer, 5/29/2012 6.3-11
k

H
0
:
1
=
2
= . . . .
k

X
1

X
2

X
k

. . . .
The total variation in this system can be decomposed into two disjoint sources:

variation between the groups (via a treatment s
2
measure)

variation within the groups (as measured by s
pooled
2
).

If the former is significantly larger than the latter (i.e., if the ratio is significantly > 1),
then there must be a genuine treatment effect, and the null hypothesis can be rejected.

6.3.3 Means

Assumewe have k independent, equivariant, normally-distributed groups X
1
~ N(
1
,
1
),
X
2
~ N(
2
,
2
), , X
k
~ N(
k
,
k
), e.g., corresponding to different treatments. We wish
to compare the treatment means with each other in order to determine if there is a
significant difference among any of the groups. Hence

H
0
: There is no difference in treatment means, i.e., no treatment effect.
vs.
H
A
: There is at least one treatment mean
i
that is different from the others.

Recall (from the comment at the end of 2.3) that sample variance has the general form

s
2
=
(x
i
x )
2
n 1
=
Sum of Squares
degrees of freedom
=
SS
df
.

That is, SS = (n 1) s
2
. Using this fact, the powerful technique of Analysis of Variance
(ANOVA) separates the total variation of the system into its two disjoint sources
(known as partitioning sums of squares), so that a formal test statistic can then be
formulated, and a decision regarding the null hypothesis ultimately reached. However,
in order to apply this, it is necessary to make the additional assumption of equivariance,
i.e,
1
2
=
2
2
=
3
2
= =
k
2
, testable using the methods of the preceding section.

Ismor Fischer, 5/29/2012 6.3-12
0 4.8 4.8
0.95
0.025 0.025
0.0043 0.0043
t
4

Figure 1

Example: For simplicity, take k = 2 balanced samples, say of size n
1
= 3 and n
2
= 3,
from two independent, normally distributed populations:

x
11
x
12
x
13
x
21
x
22
x
23

X
1
: {50, 53, 71} X
2
: {1, 4, 25}

The null hypothesis H
0
:
1
=
2
is to be tested against the alternative H
A
:
1

2
at the
= .05 level of significance, as usual. In this case, the difference in magnitudes
between the two samples appears to be sufficiently substantial, that significance seems
evident, despite the small sample sizes.

The following summary statistics are an elementary exercise:

1
x = 58
2
x = 10

s
1
2
= 129 s
2
2
= 171

Therefore,

s
pooled
2
=
(3 1)(129) + (3 1)(171)
(3 1) + (3 1)
=
600
4

SS
Error
df
Error

= 150.

We are now in a position to carry out formal testing of the null hypothesis.

Method 1. (Old way: two-sample t-test) In order to use the t-test, we must first verify
equivariance
1
2
=
2
2
. The computed sample variances of 129 and 171 are certainly
sufficiently close that this condition is reasonably satisfied. (Or, check that the ratio
129/171 is between 0.25 and 4.) Now, recall from the formula for standard error, that:

s.e. = 150 1/3 + 1/3 = 10.

Hence,
p-value = 2 P(
1
X
2
X 48) = 2 P
T
4

48 0
10
= 2 P(T
4
4.8) = 2(.0043) = .0086 < .05
so the null hypothesis is (strongly) rejected; a significant difference exists at this level.

Also, the grand mean is calculated as:

3(58) + 3(10)

x =
50 + 53 + 71 + 1 + 4 + 25
3 + 3
= 34.
Ismor Fischer, 5/29/2012 6.3-13

Method 2. (New way: ANOVA F-test) We first calculate three Sums of Squares (SS)
that measure the variation of the system and its two component sources, along with their
associated degrees of freedom (df).

1. Total Sum of Squares = sum of the squared deviations of each observation x
ij
from the
grand mean x .

SS
Total
= (50 34)
2
+ (53 34)
2
+ (71 34)
2
+ (1 34)
2
+ (4 34)
2
+ (25 34)
2
= 4056

df
Total
= (3 + 3) 1 = 5

2. Treatment Sum of Squares = sum of the squared deviations of each group mean
i
x
from the grand mean x .

Motivation: In order to measure pure treatment effect, imagine two ideal groups with
no within group variation, i.e., replace each sample value by its sample mean
i
x :

X
1
: {58, 58, 58} X
2
: {10, 10, 10}

SS
Trt
= (58 34)
2
+ (58 34)
2
+ (58 34)
2
+ (10 34)
2
+ (10 34)
2
+ (10 34)
2

= 3 (58 34)
2
+ 3 (10 34)
2
= 3456

df
Trt
= 1 Reason: As with any deviations, these must satisfy a single constraint:
namely, their sum = 3(58 34) + 3(10 34) = 0. Hence their degrees
of freedom = one less than the number of treatment groups (k = 2).

3. Error Sum of Squares = sum of the squared deviations of each observation x
ij
from its
group mean
i
x .

SS
Error
= (50 58)
2
+ (53 58)
2
+ (71 58)
2
+ (1 10)
2
+ (4 10)
2
+ (25 10)
2
= 600

df
Error
= (3 1) + (3 1) = 4

Treatment
Error
Total
SS
Total
= SS
Trt
+ SS
Error

df
Total
= df
Trt
+ df
Error

Ismor Fischer, 5/29/2012 6.3-14
F
1, 4

Figure 2
23.04
.0086

ANOVA Table
Test Statistic
Sum of Squares Mean Squares (F
1, 4
distribution)
Source df SS
MS =
SS
df
F =
MS
Trt
MS
Err

p-value
Treatment 1 3456 3456 ( = s
between
2
)
23.04 .0086
Error 4 600 150 ( = s
within
2
)
Total 5 4056

The F
1, 4
-score of 23.04 is certainly much greater than 1 (the expected value under the
null hypothesis of no treatment difference), and is in fact greater than 7.71, the F
1,

4

critical value for = .05. Hence the small p-value, and significance is established.
In fact, the ratio of
SS
Trt
SS
Total
=
3456
4056
= 0.852 indicates that 85.2% of the total variation in
response is due to the treatment effect!

Comment: Note that 23.04 = ( 4.8)
2
, i.e., F
1, 4
= t
4
2
. In general, F
1, df
= t
df
2
for any df.
Hence the two tests are mathematically equivalent to each other. Compare Figs 1 and 2.
Ismor Fischer, 5/29/2012 6.3-15
F
k 1, n k

F
p-value

General ANOVA formulation
Consider now the general case of k independent, normally-distributed, equivariant groups.
Treatment Groups X
1
~ N(
1
,
1
) X
2
~ N(
2
,
2
) X
k
~ N(
k
,
k
)

Sample Sizes n
1
+ n
2
+ n
k
= n
Group Means
1
x

2
x

k
x

Group Variances s
1
2
s
2
2
s
k
2

Grand Mean x =
n
1

1
x + n
2

2
x + + n
k

k
x
n

Pooled Variance s
within
2
=
(n
1
1) s
1
2
+ (n
2
1) s
2
2
+ + (n
k
1) s
k
2
n k

Source df SS MS F-statistic p- value
Treatment k 1
2
1
( )
k
i i
i
n x x
=

s
between
2

F
k 1, n k
0 p 1
Error
n k
2
1
1) (
k
i i
i
n s
=

s
within
2
Total
n 1
2
all ,
( )
i j
i j
x x

Comments:
This is referred to as the overall F-test of
significance. If the null hypothesis is
rejected, then (the mean value of at least)
one of the treatment groups is different
from the others. But which one(s)?
Nonparametric form of ANOVA:
Kruskal-Wallis Test
Appendix > Geometric Viewpoint >
ANOVA

Null Hypothesis H
0
:
1
=
2
= =
k
No treatment difference exists.
Alternative Hyp. H
A
:
i

j
for some i j A treatment difference exists.

Ismor Fischer, 5/29/2012 6.3-16

Multiple Comparison Procedures
How do we decide which groups (if any) are significantly different from the others?
Pairwise comparisons between the two means of individual groups can be t-tested. But
how do we decide which pairs to test, and why should it matter?

A Priori Analysis (Planned Comparisons before any data are collected)

Investigator wishes to perform pairwise t-test comparisons on a fixed number m
specific groups of interest, chosen for scientific or other theoretical reasons.

Example: Group 1 = control, and each experimental group 2, , k is to be compared
with it separately (e.g., testing mean annual crop yields of different seed types,
against a standard seed type). Then there are m = k 1 pairwise comparisons, with
corresponding null hypotheses H
0
:
1
=
2
, H
0
:
1
=
3
, H
0
:
1
=
4
, , H
0
:
1
=
k
.

A Posteriori (or Post Hoc) Analysis (Unplanned Comparisons after data are collected)

Data-mining, data-dredging, fishing expedition, etc. Unlike above, should be
used only if the ANOVA overall F-test is significant.

Example: Suppose it is decided to compare all possible pairs among Groups 1, , k,
i.e., H
0
:
i
=
j
for all i j. Then there will be m =

k
2
=
k (k 1)
2
such t-tests.
For example, if k = 5 groups, then m =

5
2
= 10 pairwise comparisons.
Though computationally intensive perhaps, these t-tests pose no problem for a computer.

However..
. . . .
k

H
0
:
1
=
2
= . . . .
k

X
1

X
2

X
k

t-test t-test t-test
t-test
etc.
Ismor Fischer, 5/29/2012 6.3-17

With a large number m of such comparisons, there is an increased probability of finding
a spurious significance (i.e., making a Type 1 error) between two groups, just by chance.

Exercise: Show that this probability = 1 (1 )
m
, which goes to 1 as m gets large.
The graph for = .05 is shown below. (Also see Problem 3-21, The Shell Game.)

In m t-test comparisons, the probability of
finding at least one significant p-value at
the = .05 level, is 1 (.95)
m
, which
approaches certainty. Note that if m = 14,
this probability is already greater than 50%.

How do we reduce this risk? Various methods exist
Bonferroni correction - Lower the significance level of each t-test from to * =
m
.
(But use the overall ANOVA MS
Error
term for s
pooled
2
.)
Example: As above, if = .05 and m = 10 t-tests, then make * =
.05
10
= .005 for each.
The overall Type 1 error rate remains unchanged.
Each individual t-test is more conservative, hence less chance of spurious rejection.
However, Bonferroni correction can be overly conservative, failing to reject
differences known to be statistically significant, e.g., via the ANOVA overall F-test.
A common remedy for this is the Holm-Bonferroni correction, in which the *
values are allowed to become slightly larger (i.e., less conservative) with each
successive t-test.
Other methods include:
Fishers Least Significant Difference (LSD) Test
Tukeys Honest Significant Difference (HSD) Test
Neumann-Keuls Test
Ismor Fischer, 5/29/2012 6.3-18

Without being aware of this phenomenon, a researcher might be tempted to report a
random finding as being evidence of a genuine statistical significance, when in fact it
might simply be an artifact of conducting a large number of individual experiments.
Such a result should be regarded as the starting point of more rigorous investigation

Famous example ~ Case-control study involving coffee and pancreatic cancer
Former chair,
Harvard School
of Public Health
Ismor Fischer, 5/29/2012 6.3-19

The findings make it to the media

First public reaction:
PANIC?!!
Do we have to stop drinking coffee??

Second public reaction:
Hold on Coffee has been around for a long time, and so have cancer studies. This is
the first time any connection like this has ever been reported. Ill keep it in mind, but
lets just wait and see

Ismor Fischer, 5/29/2012 6.3-20
Scientific doubts are very quickly raised.

Many sources of BIAS exist, including (but not limited to):
For convenience, cases (with pancreatic cancer) were
chosen from a group of patients hospitalized by the same
physicians who had diagnosed and hospitalized the
controls (with non-cancerous diseases of the digestive
system). Therefore, investigators who interviewed
patients about their coffee consumption history knew in
advance who did or did not have pancreatic cancer,
possibly introducing unintentional selection bias.
Also, either on their own, or on advice from their
physicians, patients with noncancerous gastrointestinal
illness frequently stop drinking coffee, thereby biasing
the proportion of coffee drinkers away from the control
group, who are to be compared to the cases with cancer.
Investigators were fishing for any association between
pancreatic cancer and multiple possible risk factors
including coffee, tea, alcohol, pipe smoking, and cigar
smoking (while adjusting for cigarette smoking history,
since this is a known confounding variable for
pancreatic cancer) but they did not Bonferroni correct!
Publication bias: Many professional research journals
prefer only to publish articles that result in positive
(i.e., statistically significant) study outcomes, rather than
negative ones. (This may be changing, somewhat.)

For more info, see http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/6_-_Statistical_Inference/BIAS.pdf
and on its source website http://www.medicine.mcgill.ca/epidemiology/pai/teaching.htm.
Ismor Fischer, 5/29/2012 6.3-21

Results could not be replicated by
others, including the original
investigators, in subsequent studies.
Eventual consensus: No association.
Moral: You cant paint a bulls-eye
around an arrow, after its been fired
at a target.

To date, no association has been found
between coffee and pancreatic cancer, or
any other life-threatening medical illness.
Coffee is a substance in search of a disease.
Old adage

6.4 Problems

NOTE: Before starting these problems, it might be useful to review pages 1.3-1 and 2.1-1.

1. Suppose that a random sample of n = 102 children is selected from the population of newborn
infants in Mexico. The probability that a child in this population weighs at most 2500 grams
is presumed to be = 0.15. Calculate the probability that thirteen or fewer of the infants
weigh at most 2500 grams, using

(a) the exact binomial distribution (Tip: Use the function pbinom in R),
(b) the normal approximation to the binomial distribution (with continuity correction).
Suppose we wish to test the null hypothesis H
0
: = 0.15 versus the alternative H
A
: 0.15,
and that in this random sample of n = 102 children, we find thirteen whose weights are under
2500 grams. Use this information to decide whether or not to reject H
0
at the = .05
significance level, and interpret your conclusion in context.
(c) Calculate the p-value, using the normal approximation to the binomial with continuity
correction. (Hint: See (b).) Also compute the 95% confidence interval.
(d) Calculate the exact p-value, via the function binom.test in R.

2. A new smart pill is tested on n = 36 individuals randomly sampled from a certain population
whose IQ scores are known to be normally distributed, with mean = 100 and standard
deviation = 27. After treatment, the sample mean IQ score is calculated to be x = 109.9, and
a two-sided test of the null hypothesis H
0
: = 100 versus the alternative hypothesis H
A
: 100
is performed, to see if there is any statistically significant difference from the mean IQ score of
the original population. Using this information, answer the following.

(a) Calculate the p-value of the sample.

(b) Fill in the following table, concluding with the decision either to reject or not reject the null
hypothesis H
0
at the given significance level .

Significance
Level
Confidence
Level 1
Confidence Interval Decision about H
0

.10
.05
.01

(c) Extend these observations to more general circumstances. Namely, as the significance level
decreases, what happens to the ability to reject a null hypothesis? Explain why this is so, in
terms of the p-value and generated confidence intervals.


3. Consider the distribution of serum cholesterol levels for all 20- to 74-year-old males living in the
United States. The mean of this population is 211 mg/dL, and the standard deviation is 46.0
mg/dL. In a study of a subpopulation of such males who smoke and are hypertensive, it is
assumed (not unreasonably) that the distribution of serum cholesterol levels is normally
distributed, with unknown mean , but with the same standard deviation as the original
population.
(a) Formulate the null hypothesis and complementary alternative hypothesis, for testing whether
the unknown mean serum cholesterol level of the subpopulation of hypertensive male
smokers is equal to the known mean serum cholesterol level of 211 mg/dL of the general
population of 20- to 74-year-old males.
(b) In the study, a random sample of size n = 12 hypertensive smokers was selected, and found to
have a sample mean cholesterol level of x = 217 mg/dL. Construct a 95% confidence
interval for the true mean cholesterol level of this subpopulation.
(c) Calculate the p-value of this sample, at the = .05 significance level.
(d) Based on your answers in parts (b) and (c), is the null hypothesis rejected in favor of the
alternative hypothesis, at the = .05 significance level? I nterpret your conclusion: What
exactly has been demonstrated, based on the empirical evidence?
(e) Determine the 95% acceptance region and complementary rejection region for the null
hypothesis. Is this consistent with your findings in part (d)? Why?

4. Consider a random sample of ten children selected from a population of infants receiving antacids
that contain aluminum, in order to treat peptic or digestive disorders. The distribution of plasma
aluminum levels is known to be approximately normal; however its mean and standard deviation
are not known. The mean aluminum level for the sample of n = 10 infants is found to be
x = 37.20 g/l and the sample standard deviation is s = 7.13 g/l. Furthermore, the mean plasma
aluminum level for the population of infants not receiving antacids is known to be only 4.13 g/l.
(a) Formulate the null hypothesis and complementary alternative hypothesis, for a two-sided test
of whether the mean plasma aluminum level of the population of infants receiving antacids is
equal to the mean plasma aluminum level of the population of infants not receiving antacids.
(b) Construct a 95% confidence interval for the true mean plasma aluminum level of the population
of infants receiving antacids.

(c) Calculate the p-value of this sample (as best as possible), at the = .05 significance level.
exactly has been demonstrated, based on the empirical evidence?
(e) With the knowledge that significantly elevated plasma aluminum levels are toxic to human
beings, reformulate the null hypothesis and complementary alternative hypothesis, for the
appropriate one-sided test of the mean plasma aluminum levels. With the same sample data as
above, how does the new p-value compare with that found in part (c), and what is the resulting
conclusion and interpretation?

5. Refer to Problem 4.4/2.

(a) Suppose we wish to formally test the null hypothesis H
0
: = 25 against the alternative
H
A
: 25, at the .05 = significance level, by using the random sample of n = 80 given.

Calculate the p-value, and verify that in fact, this sample leads to an incorrect conclusion.
[[Hint: Use the Central Limit Theorem to approximate the sampling distribution of X
with the normal distribution ( , / ) N n .]] Which type of error (Type I or Type II) is
committed here, and why?

(b) Now suppose we wish to formally test the null hypothesis H
0
: = 27 against the specific
alternative H
A
: = 25, at the .05 = significance level, using the same random sample of
n = 80 trials.

How much power exists (i.e., what is the probability) of inferring the correct conclusion?

Calculate the p-value, and verify that, once again, this sample in fact leads to an incorrect
conclusion. [[Use the same hint as in part (a).]] Which type of error (Type I or Type II)
is committed here, and why?

6. Two physicians are having a disagreement about the effectiveness of chicken soup in relieving
common cold symptoms. While both agree that the number of symptomatic days generally
follows a normal distribution, physician A claims that most colds last about a week; chicken
soup makes no difference, whereas physician B argues that it does. They decide to settle the
matter by performing a formal two-sided test of the null hypothesis H
0
: = 7 days, versus the
alternative H
A
: 7 days.

(a) After treating a random sample of n = 16 cold patients with chicken soup, they calculate a
mean number of symptomatic days x = 5.5, and standard deviation s = 3.0 days. Using
either the 95% confidence interval or the p-value (or both), verify that the null hypothesis
cannot be rejected at the = .05 significance level.

(b) Physician A is delighted, but can predict physician Bs rebuttal: The sample size was too
small! There wasnt enough power to detect a statistically significant difference between
= 7 days, and say = 5 days, even if there was one present! Calculate the minimum
sample size required in order to achieve at least 99% power of detecting such a genuine
difference, if indeed one actually exists. (Note: Use s to estimate .)

(c) Suppose that, after treating a random sample of n = 49 patients, they calculate the mean
number of symptomatic days x = 5.5 (as before), and standard deviation s = 2.8 days. Using
either the 95% confidence interval or the p-value (or both), verify that the null hypothesis
can now be rejected at the = .05 significance level.

FYI: The long-claimed ability of chicken soup sometimes referred to as Jewish penicillin
to combat colds has actually been the subject of several well-known published studies,
starting with a 1978 seminal paper written by researchers at Mount Sinai Hospital in NYC.
The heat does serve to break up chest congestion, but it turns out that there are many other
surprising cold-fighting benefits, far beyond just that. Who knew? Evidently Mama.
See http://well.blogs.nytimes.com/2007/10/12/the-science-of-chicken-soup/.

7. Toxicity Testing. [Tip: See page 6.1-28] According to the EPA (Environmental Protection
Agency), drinking water can contain no more than 10 ppb (parts per billion) of arsenic, in order
to be considered safe for human consumption.
x
Suppose that the concentration X of arsenic in
a typical water source is known to be normally distributed, with an unknown mean and
standard deviation . A random sample of n = 121 independent measurements is to be taken,
from which the sample mean and sample standard deviation s are calculated, and used in
formal hypothesis testing. The following sample data for four water sources are obtained:

This is known as the Maximum Contaminant Level (MCL).

Source 1: 11.43 x = ppb, s = 5.5 ppb
Source 2: 8.57 x = ppb, s = 5.5 ppb

Source 3: 9.10 x = ppb, s = 5.5 ppb
Source 4: 10.90 x = ppb, s = 5.5 ppb
(a) For each water source, answer the following questions to test the null hypothesis
0
: 10 H =
ppb, vs. the two-sided alternative hypothesis : 10
A
H ppb, at the = .05 significance level.

(i) J ust by intuitive inspection, i.e., without first conducting any formal calculations, does
this sample mean suggest that the water might be safe, or unsafe, to drink? Why??

(ii) Calculate the p-value of this sample (to the closest entries of the appropriate table), and
use it to draw a formal conclusion about whether or not the null hypothesis can be
rejected in favor of the alternative, at the = .05 significance level.
(iii) I nterpret: According to your findings, is the result statistically significant? That is
I s the water unsafe to drink? Does this agree with your informal reasoning in (i)?

(b) For the hypothesis test in (a), what is the two-sided 5% rejection region for this
0
H ?
Is it consistent with your findings?

(c) One-sided hypothesis tests can be justifiably used in some contexts, such as situations where
one direction (either or ) is impossible (for example, a human knee cannot flex backwards),
or irrelevant, as in toxicity testing here. We are really not concerned if the mean is
significantly below 10 ppb, only above. With this in mind, repeat the instructions in (a) above,
to test the left-sided null hypothesis
0
: 10 H ppb (i.e., safe) versus the right-sided
alternative : 10
A
H > ppb (i.e., unsafe) at the = .05 significance level.

(d) Suppose a fifth water source yields 10.6445 x = ppb and s = 5.5 ppb. Repeat part (c).

(e) For the hypothesis test in (c), what is the exact cutoff ppb level for x , above which we can
conclude that the water is unsafe? (Compare Sources 4 and 5, for example.) That is, what is
the one-sided 5% rejection region for this
0
H ? Is it consistent with your findings?

(f) Summarize these results, and make some general conclusions regarding advantages and
disadvantages of using a one-sided test, versus a two-sided test, in this context.
[Hint: Compare the practical results in (a) and (c) for Source 4, for example.]

8. Do the Exercise on page 6.1-20.

9.
(a) In R, type the following command to generate a data set called x of 1000 random values.
x = rf(1000, 5, 20)
Obtain a graph of its frequency histogram by typing hist(x). Include this graph as part
of your submitted homework assignment. (Do not include the 1000 data values!)
Next construct a normal q-q plot by typing qqnorm(x, pch = 19). Include this plot
as part of your submitted homework assignment.
(b) Now define a new data set called y by taking the (natural) logarithm of x.
y = log(x)
Obtain a graph of its frequency histogram by typing hist(y). Include this graph as part
of your submitted homework assignment. (Do not include the 1000 data values!)
Then construct a normal q-q plot by typing qqnorm(y, pch = 19). Include this plot
as part of your submitted homework assignment.
(c) Summarize the results in (a) and (b). In particular, from their respective histograms and q-q
plots, what general observation can be made regarding the distributions of x and
y = log(x)? (Hint: See pages 6.1-25 through 6.1-27.)

10. In this problem, assume that population cholesterol level is normally distributed.

(a) Consider a small clinical trial, designed to measure the efficacy of a new cholesterol-
lowering drug against a placebo. A group of six high-cholesterol patients is randomized to
either a treatment arm or a control arm, resulting in two numerically balanced samples of
n
1
= n
2
= 3 patients each, in order to test the null hypothesis H
0
:
1
=
2
vs. the alternative
H
A
:
1

2
. Suppose that the data below are obtained.

Placebo Drug
220 180
240 200
290 220

Obtain the 95% confidence interval for
1

2
, and the p-value of the data, and use each to
decide whether or not to reject H
0
at the = .05 significance level. Conclusion?

(b) Now imagine that the same drug is tested using another pilot study, with a different design.
Serum cholesterol levels of n = 3 patients are measured at the beginning of the study, then re-
measured after a six month treatment period on the drug, in order to test the null hypothesis
H
0
:
1
=
2
versus the alternative H
A
:
1

2
. Suppose that the data below are obtained.

Obtain the 95% confidence interval for
1

2
, and the p-value of the data, and use each to
decide whether or not to reject H
0
at the = .05 significance level. Conclusion?

(c) Compare and contrast these two study designs and their results.

(d) Redo (a) and (b) using R (see hint). Show agreement between your answers and the output.

Baseline End of Study
220 180
240 200
290 220

11. In order to determine whether children with cystic fibrosis have a normal level of iron in their
blood on average, a study is performed to detect any significant difference in mean serum iron
levels between this population and the population of healthy children, both of which are
approximately normally distributed with unknown standard deviations. A random sample of
n
1
= 9 healthy children has mean serum iron level
1
x = 18.9 mol/l and standard deviation
s
1
= 5.9 mol/l; a sample of n
2
= 13 children with cystic fibrosis has mean serum iron level
2
x = 11.9 mol/l and standard deviation s
2
= 6.3 mol/l.

(a) Formulate the null hypothesis and complementary alternative hypothesis, for testing
whether the mean serum iron level
1
of the population of healthy children is equal to the
mean serum iron level
2
of children with cystic fibrosis.

(b) Construct the 95% confidence interval for the mean serum iron level difference
1

2
.

(c) Calculate the p-value for this experiment, under the null hypothesis.

exactly has been demonstrated, based on the sample evidence?

12. Methylphenidate is a drug that is widely used in the treatment of attention deficit disorder
(ADD). As part of a crossover study, ten children between the ages of 7 and 12 who suffered
from this disorder were assigned to receive the drug and ten were given a placebo. After a fixed
period of time, treatment was withdrawn from all 20 children and, after a washout period of
no treatment for either group, subsequently resumed after switching the treatments between the
two groups. Measures of each childs attention and behavioral status, both on the drug and on
the placebo, were obtained using an instrument called the Parent Rating Scale. Distributions of
these scores are approximately normal with unknown means and standard deviations. In
general, lower scores indicate an increase in attention. It is found that the random sample of
n = 20 children enrolled in the study has a sample mean attention rating score of
methyl
x = 10.8
and standard deviation s
methyl
= 2.9 when taking methylphenidate, and mean rating score
placebo
x = 14.0 and standard deviation s
placebo
= 4.8 when taking the placebo.

(a) Calculate the 95% confidence interval for
placebo
, the mean attention rating score of the
population of children taking the placebo.

(b) Calculate the 95% confidence interval for
methyl
, the mean attention rating score of the
population of children taking the drug.

(c) Comparing these two confidence intervals side-by-side, develop an informal conclusion
about the efficacy of methylphenidate, based on this experiment. Why can this not be used
as a formal test of the hypothesis H
0
:
placebo
=
methyl
, vs. the alternative H
A
:
placebo

methyl
,
at the = .05 significance level? (Hint: See next problem.)


13. A formal hypothesis test for two-sample means using the confidence interval for
1 2

is generally NOT equivalent to an informal side-by-side comparison of the individual
confidence intervals for
1
and
2
for detecting overlap between them.

(a) Suppose that two population random variables
1
X and
2
X are normally distributed, each
with standard deviation 50 = . We wish to test the null hypothesis
0 1 2
: H = versus the
alternative
0 1 2
: H , at the .05 = significance level. Two independent, random
samples are selected, each of size 100 n = , and it is found that the corresponding means are
1
215 x = and
2
200 x = , respectively. Show that even though the two individual 95%
confidence intervals for
1
and
2
overlap, the formal 95% confidence interval for the mean
difference
1 2
does not contain the value 0, and hence the null hypothesis can be
rejected. (See middle figure below.)

(b) In general, suppose that
1 1
~ ( , ) X N and
2 2
~ ( , ) X N , with equal (for simplicity).
In order to test the null hypothesis
0 1 2
: H = versus the two-sided alternative
0 1 2
: H
at the significance level, two random samples are selected, each of the same size n (for
simplicity), resulting in corresponding means
1
x and
2
x , respectively. Let
1
CI
and
2
CI

be the respective 100(1 )% confidence intervals, and let
( )
1 2
/ 2
| |
/
x x
d
z n
= . (Note that
the denominator is simply the margin of error for the confidence intervals.) Also let
1 2
CI

be the 100(1 )% confidence interval for the true mean difference
1 2
. Prove:

If 2 d < , then
1 2
0 CI

(i.e., accept
0
H ), and
1 2
CI CI

(i.e., overlap).

If 2 2 d < < , then
1 2
0 CI

(i.e., reject
0
H ), but
1 2
CI CI

(i.e., overlap)!

If 2 d > , then
1 2
0 CI

(i.e., reject
0
H ), and
1 2
CI CI

= (i.e., no overlap).

1
x
2
x

1 2
x x
|

0

1
x
2
x

1 2
x x
|

0

1
x
2
x

1 2
x x
|

0

14. Z-tests and Chi-squared Tests

(a) Test of Independence (1 population, 2 random responses). Imagine that a marketing
research study surveys a random sample of n = 2000 consumers about their responses
regarding two brands (A and B) of a certain product, with the following observed results.

Do You Like Brand B?
Yes No
D
o

Y
o
u

L
i
k
e

B
r
a
n
d

A
?

Yes 335 915 1250
No 165 585 750
500 1500 2000

First consider the null hypothesis H
0
:
| | A B A B
=
c
, that is, in this consumer population,
The probability of liking A, given that B is liked, is equal to probability of liking A, given
that B is not liked.

There is no association between liking A and liking B.

Liking A and liking B are independent of each other.
[Why? See Problem 3.5/22(a).]

Calculate the point estimate
| |

A B A B

c
. Determine the Z-score of this sample (and
thus whether or not H
0
is rejected at = .05). Conclusion?

Now consider the null hypothesis H
0
:
| | B A B A
=
c
, that is, in this consumer population,
The probability of liking B, given that A is liked, is equal to probability of liking B, given
that A is not liked.

There is no association between liking B and liking A.

Liking B and liking A are independent of each other.

Calculate the point estimate
| |

B A B A

c
. Determine the Z-score of this sample (and
thus whether or not H
0
is rejected at = .05). How does it compare with the previous Z-
score? Conclusion?

Compute the Chi-squared score. How does it compare with the preceding Z-scores?
Conclusion?


(b) Test of Homogeneity (2 populations, 1 random response). Suppose that, for the sake of
simplicity, the same data are obtained in a survey that compares the probability of liking
Brand A between two populations.

City 1 City 2
D
o

Y
o
u

L
i
k
e

B
r
a
n
d

A
?

Yes 335 915 1250
No 165 585 750
500 1500 2000

Here, the null hypothesis is H
0
:
2 City | 1 City | A A
= , that is,
The probability of liking A in the City 1 population is equal to probability of liking A in the
City 2 population.

City 1 and City 2 populations are homogeneous with respect to liking A.

There is no association between city and liking A.

How do these corresponding Z and Chi-squared test statistics compare with those in (a)?
Conclusion?


15. Consider the following 2 2 contingency table taken from a retrospective case-control study
that investigates the proportion of diabetes sufferers among acute myocardial infarction (heart
attack) victims in the Navajo population residing in the United States.
MI
Yes No Total
D
i
a
b
e
t
e
s

Yes 46 25 71
No 98 119 217

Total 144 144 288

(a) Conduct a Chi-squared Test for the null hypothesis H
0
:
Diabetes | MI
=
Diabetes | No MI
versus the
alternative H
A
:
Diabetes | MI

Diabetes | No MI
. Determine whether or not we can reject the null
hypothesis at the = .01 significance level. I nterpret your conclusion: At the = .01
significance level, what exactly has been demonstrated about the proportion of diabetics among
the two categories of heart disease in this population?

(b) In the study design above, the 144 victims of myocardial infarction (cases) and the 144
individuals free of heart disease (controls) were actually age- and gender-matched. The
members of each case-control pair were then asked whether they had ever been diagnosed with
diabetes. Of the 46 individuals who had experienced MI and who were diabetic, it turned out
that 9 were paired with diabetics and 37 with non-diabetics. Of the 98 individuals who had
experienced MI but who were not diabetic, it turned out that 16 were paired with diabetics and
82 with non-diabetics. Therefore, each cell in the resulting 2 2 contingency table below
corresponds to the combination of responses for age- and gender- matched case-control pairs,
rather than individuals.
MI
Diabetes No Diabetes Totals
N
o

M
I

Diabetes 9 16 25
No
Diabetes
37 82 119

Totals 46 98 144

Conduct a McNemar Test for the null hypothesis H
0
: The number of diabetic, MI case -
non-diabetic, non-MI control pairs, is equal to the number of non-diabetic, MI case -
diabetic, non-MI control pairs, who have been matched on age and gender, or more
succinctly, H
0
: There is no association between diabetes and myocardial infarction in the
Navajo population, adjusting for age and gender. Determine whether or not we can reject the
null hypothesis at the = .01 significance level. I nterpret your conclusion: At the = .01
significance level, what exactly has been demonstrated about the association between diabetes
and myocardial infarction in this population?

(c) Why does the McNemar Test only consider discordant case-control pairs? Hint: What, if
anything, would a concordant pair (i.e., either both individuals in a MI case - No MI control
pair are diabetic, or both are non-diabetic) reveal about a diabetes-MI association, and why?

(d) Redo this problem with R, using chisq.test and mcnemar.test.


16. The following data are taken from a study that attempts to determine whether the use of
electronic fetal monitoring (exposure) during labor affects the frequency of caesarian
section deliveries (disease). Of the 5824 infants included in the study, 2850 were
electronically monitored during labor and 2974 were not. Results are displayed in the 2 2
contingency table below.

Caesarian Delivery
Yes No Totals
E
F
M

E
x
p
o
s
u
r
e

Yes 358 2492 2850
No 229 2745 2974

Totals 587 5237 5824

(a) Calculate a point estimate for the population odds ratio OR, and interpret.

(b) Compute a 95% confidence interval for the population odds ratio OR.

(c) Based on your answer in part (b), show that the null hypothesis H
0
: OR = 1 can be rejected in
favor of the alternative H
A
: OR 1, at the = .05 significance level. I nterpret this
conclusion: What exactly has been demonstrated about the association between electronic
fetal monitoring and caesarian section delivery? Be precise.

(d) Does this imply that electronic monitoring somehow causes a caesarian delivery? Can the
association possibly be explained any other way? If so, how?

Ismor Fischer, 5/20/2014 6.4-10

17. The following data come from two separate studies, both conducted in San Francisco, that
investigate various risk factors for epithelial ovarian cancer.

Study 1 Disease Status

Study 2 Disease Status

Cancer
No
Cancer
Total

Cancer
No
Cancer
Total
T
e
r
m

P
r
e
g
n
a
n
c
i
e
s

None 31 93 124

T
e
r
m

P
r
e
g
n
a
n
c
i
e
s

None 39 74 113
One or
More
80 379 459

One or
More
149 465 614
Total 111 472 583

Total 188 539 727

(a) Compute point estimates
1 OR and
2 OR of the respective odds ratios OR

1
and OR
2
of the two
studies, and interpret.

(b) In order to determine whether or not we may combine information from the two tables, it is
first necessary to conduct a Test of Homogeneity on the null hypothesis H
0
: OR
1
= OR
2
, vs.
the alternative H
A
: OR
1
OR
2
, by performing the following steps.

Step 1: First, calculate l
1
= ln(
1 OR ) and l
2
= ln(
2 OR ), in the usual way.

Step 2: Next, using the definition of
s.e. given in the notes, calculate the weights

w
1
=
2
1
1
s.e.
and w
2
=
2
2
1
s.e.
.

Step 3: Compute the weighted mean of l
1
and l
2
:

L =
w
1
l
1
+ w
2
l
2
w
1
+ w
2
.

Step 4: Finally, calculate the test statistic

2
= w
1
(l
1
L)
2
+ w
2
(l
2
L)
2
,

which follows an approximate
2
distribution, with 1 degree of freedom.

Step 5: Use this information to show that the null hypothesis cannot be rejected at the = .05
significance level, and that the information from the two tables may therefore be
combined.

(c) Hence, calculate the Mantel-Haenszel estimate of the summary odds ratio:

summary OR =
(a
1
d
1
/

n
1
) + (a
2
d
2
/

n
2
)
(b
1
c
1
/

n
1
) + (b
2
c
2
/

n
2
)
.

Ismor Fischer, 5/20/2014 6.4-11

(d) To compute a 95% confidence interval for the summary odds ratio OR
summary
, we must first
verify that the sample sizes in the two studies are large enough to ensure that the method used
is valid.

Step 1: Verify that the expected number of observations of the (i, j)
th
cell in the first table, plus
the expected number of observations of the corresponding (i, j)
th
cell in the second
table, is greater than or equal to 5, for i = 1, 2 and j = 1, 2. Recall that the expected
number of the (i, j)
th
cell is given by E
i j
= R
i
C
j
/ n.

Step 2: By its definition, the quantity L computed in part (b) is a weighted mean of log-odds
ratios, and already represents a point estimate of ln(OR
summary
). The estimated
standard error of L is given by

s.e.( ) L =
1
w
1
+ w
2

.

Step 3: From these two values in Step 2, construct a 95% confidence interval for ln(OR
summary
),
and exponentiate it to derive a 95% confidence interval for OR
summary
itself.

(e) Also compute the value of the Chi-squared test statistic for OR
summary
given at the end of 6.2.3.

(f) Use the confidence interval in (d), and/or the
2
1
statistic in (e), to perform a Test of
Association of the null hypothesis H
0
: OR
summary
= 1, versus the alternative H
A
: OR
summary
1,
at the = .05 significance level. I nterpret your conclusion: What exactly has been
demonstrated about the association between the number of term pregnancies and the odds of
developing epithelial ovarian cancer? Be precise.

(g) Redo this problem in R, using the code found in the link below, and compare results.
http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/Rcode/

Ismor Fischer, 5/20/2014 6.4-12

18.
(a) Suppose a survey determines the political orientation of 60 men in a certain community:

Among these men, calculate the proportion belonging to each political category. Then
show that a Chi-squared Test of the null hypothesis of equal proportions

H
0
:
Left | Men
=
Mid | Men
=
Right | Men

leads to its rejection at the = .05 significance level. Conclusion?

(b) Suppose the survey also determines the political orientation of 540 women in the same
community:

Among these women, calculate the proportion belonging to each political category. How do
these proportions compare with those in (a)? Show that a Chi-squared Test of the null
hypothesis of equal proportions

H
0
:
Left | Women
=
Mid | Women
=
Right | Women

leads to its rejection at the = .05 significance level. Conclusion?

(c) Suppose the two survey results are combined:

Among the individuals in each gender (i.e., row), the proportion belonging to each
political category (i.e., column) of course match those found in (a) and (b), respectively.
Therefore, show that a Chi-squared Test of the null hypothesis of equal proportions

H
0
:
Left | Men
=
Left | Women
AND
Mid | Men
=
Mid | Women
AND
Right | Men
=
Right | Women

leads to a 100% acceptance at the = .05 significance level. Conclusion?

NOTE: The closely-resembling null hypothesis

H
0
:
Men | Left
=
Women | Left
AND
Men | Mid
=
Women | Mid
AND
Men | Right
=
Women | Right

tests for equal proportions of men and women within each political category, which is
very different from the above. Based on sample proportions (0.1 vs. 0.9), it is likely to be
rejected, but each column would need to be formally tested by a separate Goodness-of-Fit.
Left Middle Right
Men 12 18 30 60
Left Middle Right
Women 108 162 270 540
Left Middle Right
Men 12 18 30 60
Women 108 162 270 540
120 180 300 600
Ismor Fischer, 5/20/2014 6.4-13

(d) Among the individuals in each political category (i.e., column), calculate the proportion of
men, and show that they are all equal to each other.

Among the individuals in each political category (i.e., column), calculate the proportion of
women, and show that they are all equal to each other.

Therefore, show that a Chi-squared Test of the null hypothesis of equal proportions

H
0
:
Men | Left
=
Men | Mid
=
Men | Right
AND
Women | Left
=
Women | Mid
=
Women | Right

also leads to a 100% acceptance at the = .05 significance level. Conclusion?

MORAL: There is more than one type of null hypothesis on proportions to which the Chi-
squared Test can be applied.

19. In a random sample of n = 1200 consumers who are surveyed about their ice cream flavor
preferences, 416 indicate that they prefer vanilla, 419 prefer chocolate, and 365 prefer
strawberry.

(a) Conduct a Chi-squared Goodness-of-Fit Test of the null hypothesis of equal proportions
H
0
:
Vanilla Chocolate Strawberry
= = of flavor preferences, at the = .05 significance level.

Vanilla Chocolate Strawberry

416 419 365 1200

(b) Suppose that the sample of n = 1200 consumers is equally divided between males and
females, yielding the results shown below. Conduct a Chi-squared Test of the null
hypothesis that flavor preference is not associated with gender, at the = .05 level.

Vanilla Chocolate Strawberry Totals
Males 200 190 210 600
Females 216 229 155 600
Totals 416 419 365 1200

(c) Redo (a) and (b) with R, using chisq.test. Show agreement with your calculations!

Ismor Fischer, 5/20/2014 6.4-14

20. In the late 1980s, the pharmaceutical company Upjohn received approval from the Food and
Drug Administration to market Rogaine
TM
, a 2% minoxidil solution, for the treatment of
androgenetic alopecia (male pattern hair loss). Upjohns advertising campaign for Rogaine
included the results of a double-blind randomized clinical trial, conducted with 1431 patients in
27 centers across the United States. The results of this study at the end of four months are
summarized in the 2 5 contingency table below, where the two row categories represent the
treatment arm and control arm respectively, and each column represents a response category,
the degree of hair growth reported. [Source: Ronald L. Iman, A Data-Based Approach to
Statistics, Duxbury Press]

Degree of Hair Growth

No
Growth
New
Vellus
Minimal
Growth
Moderate
Growth
Dense
Growth
Total
Rogaine 301 172 178 58 5 714
Placebo 423 150 114 29 1 717
Total 724 322 292 87 6 1431

(a) Conduct a Chi-squared Test of the null hypothesis H
0
:
Rogaine
=
Placebo
versus the
alternative hypothesis H
A
:
Rogaine

Placebo
across the five hair growth categories (That is,
H
0
:
NoGrowth | Rogaine NoGrowth | Placebo
= and
NewVellus | Rogaine NewVellus | Placebo
= and
and
DenseGrowth | Rogaine DenseGrowth | Placebo
= .) Infer whether or not we can reject the null
hypothesis at the = .01 significance level. I nterpret in context: At the = .01 significance
level, what exactly has been demonstrated about the efficacy of Rogaine versus placebo?

(b) Form a 2 2 contingency table by combining the last four columns into a single column
labeled Growth. Conduct a Chi-squared Test for the null hypothesis H
0
:
Rogaine
=
Placebo

versus the alternative H
A
:
Rogaine

Placebo
between the resulting No Growth versus Growth
binary response categories. (That is, H
0
:
Growth | Rogaine Growth | Placebo
= .) Infer whether or
not we can reject the null hypothesis at the = .01 significance level. I nterpret in context:
At the = .01 significance level, what exactly has been demonstrated about the efficacy of
Rogaine versus placebo?

(c) Calculate the p-value using a two-sample Z-test of the null hypothesis in part (b), and show
that the square of the corresponding z-score is equal to the Chi-squared test statistic found in
(b). Verify that the same conclusion about H
0
is reached, at the = .01 significance level.

(d) Redo this problem with R, using chisq.test. Show agreement with your calculations!

Ismor Fischer, 5/20/2014 6.4-15

21. Male patients with coronary artery disease were recruited from three different medical centers
the Johns Hopkins University School of Medicine, The Rancho Los Amigos Medical
Center, and the St. Louis University School of Medicine to investigate the effects of carbon
monoxide exposure. One of the baseline characteristics considered in the study was
pulmonary lung function, as measured by X = Forced Expiratory Volume in one second, or
FEV
1
. The data are summarized below.

Johns Hopkins Rancho Los Amigos St. Louis
n
1
= 21 n
2
= 16 n
2
= 23
1
x = 2.63 liters
2
x = 3.03 liters
3
x = 2.88 liters
s
1
2
= 0.246 liters
2
s
2
2
= 0.274 liters
2
s
3
2
= 0.248 liters
2

Based on histograms of the raw data (not shown), it is reasonable to assume that the FEV
1

measurements of the three populations from which these samples were obtained are each
approximately normally distributed, i.e.,
1 1 1
~ ( , ) X N ,
2 2 2
~ ( , ) X N , and
3 3 3
~ ( , ) X N . Furthermore, because the three sample variances are so close in value, it is
reasonable to assume equivariance of the three populations, that is,
1
2
=
2
2
=
3
2
. With
these assumptions, answer the following.

(a) Compute the pooled estimate of the common variance

2
within groups via the formula

s
within
2
= MS
Error
=
SS
Error
df
Error
=
(n
1
1) s
1
2
+ (n
2
1) s
2
2
+ + (n
k
1) s
k
2
n k
.

(b) Compute the grand mean of the k = 3 groups via the formula

x =
1 1 2 2 k k
n x n x n x
n
+ + +
, where the combined sample size n = n
1
+ n
2
+ + n
k
.

From this, calculate the estimate of the variance between groups via the formula

s
between
2
= MS
Treatment
=
SS
Treatment
df
Treatment
=
2 2 2
1 1 2 2
( ) ( ) ( )
1
k k
n x x n x x n x x
k
+ + +

.

(c) Using this information, construct a complete ANOVA table, including the F-statistic, and
corresponding p-value, relative to .05 (i.e., < .05, > .05, or = .05). Infer whether or not we can
reject H
0
:
1
=
2
=
3,
at the = .05 level of significance. I nterpret in context: Exactly what
has been demonstrated about the baseline FEV
1
levels of the three groups?

Ismor Fischer, 5/20/2014 6.4-16

22. Generalization of Problem 2.5/8
(a) Suppose a random sample of size n
1
has a mean
1
x and variance s
1
2
, and a second random
sample of size n
2
has a mean
2
x and variance s
2
2
. If the two samples are combined into a
single sample, then algebraically express its mean
Total
x and variance s
Total
2
in terms of the
preceding variables. (Hint: If you think of this in the right way, its easier than it looks.)

(b) In a study of the medical expenses at a particular hospital, it is determined from a sample
of 4000 patients that a certain laboratory procedure incurs a mean cost of $30, with a
standard deviation of $10. It is realized however, that these values inadvertently excluded
1000 patients for whom the cost was $0. When these patients are included in the study,
what is the adjusted cost of the mean and standard deviation?

23.
(a) For a generic 2 2 contingency table such as the one
shown, prove that the Chi-squared test statistic reduces to
2
2
1
1 2 1 2
( ) n ad bc
R R C C

= .

(b) Suppose that a z-test of two equal proportions results in the generic sample values shown
in this table. Prove that the square of the z-score is equal to the Chi-squared score in (a).
24. Problem 5.3/1 illustrates one way that the normal and t distributions differ, as similar as their
graphs may appear (drawn to scale, below). Essentially, any t-curve has heavier tails than the
bell curve, indicating a higher density of outliers in the distribution. (So much higher in fact,
that the mean does not exist!) Another way is to see this is to check the t-distribution for
normality, via a Q-Q plot. The posted R code for this problem graphs such a plot for a
standard normal distribution (with predictable results), and for a t-distribution with 1 degree
of freedom (a.k.a. the Cauchy distribution). Run this code fivetimes each, and comment on
the results!

a b R
1

c d R
2

C
1
C
2
n
curve(dnorm(x), -3, 3, lwd = 2, col = "darkgreen")
N(0, 1)
curve(dt(x, 1), -3, 3, ylim = range(0,.4), lwd = 2, col = "darkgreen")
t
1

Ismor Fischer, 5/20/2014 6.4-17

25.
(a) In R, type the following command to generate a data set called x of 1000 random values.
x = rf(1000, 5, 20)
Obtain a graph of its frequency histogram by typing hist(x). Include this graph as part of
your submitted homework assignment. (Do not include the 1000 data values!)
(b) Next construct a normal q-q plot by typing the following.
qqnorm(x, pch = 19)
qqline(x)
Include this plot as part of your submitted homework assignment.
Now define a new data set called y by taking the (natural) logarithm of x.
y = log(x)
Obtain a graph of its frequency histogram by typing hist(y). Include this graph as part of
your submitted homework assignment. (Do not include the 1000 data values!)
Then construct a normal q-q plot by typing the following.
qqnorm(y, pch = 19)
qqline(y)
Include this plot as part of your submitted homework assignment.
(c) Summarize the results in (a) and (b). In particular, from their respective histograms and q-q
plots, what general observation can be made regarding the distributions of x and y = log(x)?
(Hint: See pages 6.1-25 through 6.1-27.)
26. Refer to the posted Rcode folder for this problem. Please answer all questions.


7.1 Motivation

7.2 Linear Correlation and Regression


7.4 Problems



7.1 Motivation

*Exercise: Algebraically expand the expression (X
X
)(Y
Y
), and use the properties of
mathematical expectation given in 3.1. This motivates an alternate formula for s
xy
.
POPULATION

Random Variables X, Y: numerical (Contrast with 6.3.1.)

How can the association between X and Y (if any exists) be

1) characterized and measured?

2) mathematically modeled via an equation, i.e., Y =f(X)?

Recall:
X
=Mean(X) =E[X]
Y
=Mean(Y) =E[Y]

X
2
=Var(X) =E[(X
X
)
2
]
Y
2
=Var(Y) =E[(Y
Y
)
2
]

Definition: Population Covariance of X, Y

XY
= Cov(X, Y) = E[(X
X
)(Y
Y
)]

Equivalently,* = E[XY]
X

Y

SAMPLE, size n
Recall:
1
1
n
i
i
x x
n
=
=

1
1
n
i
i
y y
n
=
=

2 2
1
1
( )
1
x
n
i
i
s x x
n
=
=

2 2
1
1
( )
1
y
n
i
i
s y y
n
=
=

Definition: Sample Covariance of X, Y

1
1
( )( )
1
xy
n
i i
i
s x x y y
n
=
=

Note: Whereas s
x
2
0 and s
y
2
0, s
xy
is unrestricted in sign.

For the sake of simplicity, let us assume that the predictor variable X is
nonrandom (i.e., deterministic), and that the response variable Y is random.
(Although, the subsequent techniques can be extended to random X as well.)

Example: X =fat (grams), Y =cholesterol level (mg/dL)

Suppose the following sample of n = 5 data pairs (i.e., points) is obtained and
graphed in a scatterplot, along with some accompanying summary statistics:

X

Y
60 70 80 90 100
x =80

y =240
s
x
2
=250

s
y
2
=1750
210 200 220 280 290

Sample Covariance

s
xy
=
1
5 1
[ (60 80)(210 240) +(70 80)(200 240) +(80 80)(220 240) +
(90 80)(280 240) +(100 80)(290 240) ] = 600

As the name implies, the variance measures the extent to which a single variable
varies (about its mean). Similarly, the covariance measures the extent to which
two variables vary (about their individual means), with respect to each other.

Ideally, if there is no association of any kind between two variables X and Y (as
in the case where they are independent), then a scatterplot would reveal no
organized structure, and covariance =0; e.g., X =adult head size, Y =IQ.
Clearly, in a case such as this, the variable X is not a good predictor of the
response Y. Likewise, if the variables X =age, Y =body temperature (F) are
measured in a group of healthy individuals, then the resulting scatterplot would
consist of data points that are very nearly lined up horizontally (i.e., zero slope),
reflecting a constant mean response value of Y =98.6F, regardless of age X.
Here again, covariance =0 (or nearly so); X is not a good predictor of the
response Y. See figures.

However, in the preceding fat vs. cholesterol example, there is a clear
positive trend exhibited in the scatterplot. Overall, it seems that as X
increases, Y increases, and inversely, as X decreases, Y decreases. The simplest
mathematical object that has this property is a straight line with positive slope,
and so a linear description can be used to capture such first-order properties of
the association between X and Y. The two questions we now ask are

1) How can we measure the strength of the linear association between X and Y?

Answer: Linear Correlation Coefficient

2) How can we model the linear association between X and Y, essentially via an
equation of the form Y =mX +b?

Answer: Simple Linear Regression

Caution: The covariance can equal zero under other conditions as well; see Exercise in the next section.
X =Head Circumference
Y

=

I
Q

s
c
o
r
e

Y

=

B
o
d
y

T
e
m
p

(
F
)

X =Age
98.6

Y

|

X

=

7
0

Y

|

X

=

8
0

Y

|

X

=

9
0

Y

|

X

=

1
0
0

Y

|

X

=

6
0

We can consider n =5 subpopulations,
each of whose cholesterol levels Y are
normally distributed, and whose
means are conditioned on X =60, 70,
80, 90, 100 fat grams, respectively.

Before moving on to the next section, some important details are necessary in
order to provide a more formal context for this type of problem. In our example,
the response variable of interest is cholesterol level Y, which presumably has some
overall probability distribution in the study population. The mean cholesterol level
of this population can therefore be denoted
Y
or, recall, expectation E[Y] and
estimated by the grand mean y =240. Note that no information about X is used.

Now we seek to characterize the relation (if any) between cholesterol level Y and
fat intake X in this population, based on a random sample using n =5 fat intake
values (i.e., x
1
=60, x
2
=70, x
3
=80, x
4
=90, x
5
=100). Each of these fixed x
i

values can be regarded as representing a different amount of fat grams consumed
by a subpopulation of individuals, whose cholesterol levels Y, conditioned on that
value of X = x
i
, are assumed to be normally distributed. The conditional mean
cholesterol level of each of these distributions could therefore be denoted
|
i
Y X x
=

equivalently, conditional expectation E[Y | X =x
i
] for i =1, 2, 3, 4, 5. (See
figure; note that, in addition, we will assume that the variances within groups are
all equal (to
2
), and that they are independent of one another.) If no relation
between X and Y exists, we would expect to see no organized variation in Y as X
changes, and all of these conditional means would either be uniformly scattered
around or exactly equal to the unconditional mean
Y
; recall the discussion on
the preceding page. But if there is a true relation between X and Y, then it becomes
important to characterize and model the resulting (nonzero) variation.


7.2 Linear Correlation and Regression

Example: r =
600
250 1750
= 0.907 strong, positive linear correlation

FACT:

Any set of data points (x
i
, y
i
), i =1, 2, , n, having r >0 (likewise, r <0) is said
to have a positive linear correlation (likewise, negative linear correlation).
The linear correlation can be strong, moderate, or weak, depending on the
magnitude. The closer r is to +1 (likewise, 1), the more strongly the points
follow a straight line having some positive (likewise, negative) slope. The closer
r is to 0, the weaker the linear correlation; if r =0, then EITHER the points are
uncorrelated (see 7.1), OR they are correlated, but nonlinearly (e.g., Y =X
2
).

Exercise: Draw a scatterplot of the following n =7 data points, and compute r.

(3, 9), (2, 4), (1, 1), (0, 0), (1, 1), (2, 4), (3, 9)
1 r +1
POPULATION

Random Variables X, Y: numerical

Definition: Population Linear Correlation Coefficient of X, Y

=
XY
X

Y

FACT: 1 +1
SAMPLE, size n

Definition: Sample Linear Correlation Coefficient of X, Y

= r =
s
xy
s
x
s
y

(Pearsons) Sample Linear Correlation Coefficient r =
s
xy
s
x
s
y

Some important exceptions to the typical cases above:
uncorrelated
1 0.8 0.5 0 + 0.5 + 0.8 + 1
strong strong moderate moderate weak
negative linear correlation

As X increases, Y decreases.
As X decreases, Y increases.
positive linear correlation

As X increases, Y increases.
As X decreases, Y decreases.
r =0, but X and Y are
correlated, nonlinearly
r >0 in each of the two
individual subgroups,
but r <0 when combined
r >0, only due to the effect of one
influential outlier; if removed,
then data are uncorrelated (r =0)
r
Hypothesis H
0
: = 0 There is no linear correlation between X and Y.
vs.
Alternative Hyp. H
A
: 0 There is a linear correlation between X and Y.
Statistical Inference for

Suppose we now wish to conduct a formal test of

Example: p-value = 2 P
T
3

.907 3
1 (.907)
2

= 2 P(T
3
3.733) = 2(.017) = .034

As p < =.05, the null hypothesis of no linear correlation can be rejected at this level.

Comments:

Defining the numerator sums of squares S
xx
=(n 1) s
x
2
, S
yy
=(n 1) s
y
2
, and
S
xy
=(n 1) s
xy
, the correlation coefficient can also be written as r =
xy
xx yy
S
S S
.

The general null hypothesis H
0
: =
0
requires a more complicated Z-test, which
first applies the so-called Fisher transformation, and will not be presented here.

The assumption on X and Y is that their joint distribution is bivariate normal, which
is difficult to check fully in practice. However, a consequence of this assumption is
that X and Y are linearly uncorrelated (i.e., =0) if and only if X and Y are
independent. That is, it overlooks the possibility that X and Y might have a
nonlinear correlation. The moral: and therefore the Pearson sample linear
correlation coefficient r calculated above only captures the strength of linear
correlation. A more sophisticated measure, the multiple correlation coefficient,
can detect nonlinear correlation, or correlation in several variables. Also, the
nonparametric Spearman rank-correlation coefficient can be used as a substitute.

Correlation does not imply causation! (E.g., X =childrens foot size is indeed
positively correlated with Y =IQ score, but is this really cause-and-effect????) The
ideal way to establish causality is via a well-designed randomized clinical trial, but
this is not always possible, or even desirable. (E.g., X =smoking vs. Y =lung cancer)
Test Statistic

T =
r n 2
1 r
2

~ t
n 2

k = 2 parameters,
regression coefficients
Predictor Variable,
Explanatory Variable

Y =
0
+
1
X +

Response = (Linear) Model + Error

Y =
0
+
1
X
intercept =b
0
b
1
=slope
Y

Simple Linear Regression and the Method of Least Squares

If a linear association exists between variables X and Y, then it can be written as

Sample-based estimator of response

That is, given the response vector Y, we wish to find the linear estimate
Y that
makes the magnitude of the difference
=Y
Y as small as possible.
Y =
0
+
1
X +
Y =
0
+
1

X

How should we define the line that best fits the data, and obtain its coefficients
0
and
1
?

For any line, errors
i
, i =1, 2, , n, can be estimated by the residuals

= e
i
= y
i

i
y .

Example (contd): Slope b
1
=
600
250
= 2.4 Intercept b
0
= 240 (2.4)(80) = 48

Therefore, the least squares regression line is given by the equation
Y = 48 + 2.4 X.

(x , y )
X
Y
|
x
i

y
i

i
y
e
3

e
n

e
1

e
2

(x
i
, y
i
)
e
i
= y
i

i
y
(x
i
,
i
y )
residual = observed response fitted response
The least squares regression line is the unique line that minimizes the
Error (or Residual) Sum of Squares SS
Error
=
2
1
n
i
i
e
=
=
2
1
( )
n
i
i i
y y
=

.
Slope:
1
= b
1
=
s
xy
s
x
2

Intercept:
0
= b
0
= y b
1
x

Y = b
0
+ b
1
X
+18
16
20
+16
+2
Y = 48 + 2.4 X
^
Scatterplot, Least Squares Regression Line,
and Residuals

predictor values x
i
60 70 80 90 100
observed responses y
i
210 200 220 280 290
fitted responses,
predicted responses
i
y 192 216 240 264 288
residuals e
i
= y
i

i
y +18 16 20 +16 +2

Note that the sum of the residuals is equal to zero. But the sum of their squares,

2
= SS
Error
= (+18)
2
+(16)
2
+(20)
2
+(+16)
2
+(+2)
2
= 1240

is, by construction, the smallest such value of all possible regression lines that could
have been used to estimate the data. Note also that the center of mass (80, 240)
lies on the least squares regression line.

Example: The population cholesterol level corresponding to x* =75 fat grams is
estimated by
48 2.4(75) y = + =228 mg/dL. But how precise is this value?

(Later...)


Statistical Inference for
0
and
1

It is possible to test for significance of the intercept parameter
0
and slope parameter
1
of the least squares regression line, using the following:

where s
e
2
=
SS
Error
n 2
is the so-called standard error of estimate, and S
xx
=(n 1) s
x
2
.
(Note: s
e
2
is also written as MSE or MS
Error
, the mean square error of the
regression; see ANOVA below.)

Example: Calculate the p-value of the slope parameter
1
, under

First, s
e
2
=
1240
3
= 413.333, so s
e
=20.331. And S
xx
=(4)(250) =1000. So
p-value = 2 P
T
3

2.4 0
20.331
1000 = 2 P(T
3
3.733) = 2 (.017) = .034

As p < =.05, the null hypothesis of no linear association can be rejected at this level.

Note that the T-statistic (3.733), and hence the resulting p-value (.034), is identical to
the test of significance of the linear correlation coefficient H
0
: = 0 conducted above!

Exercise: Calculate the 95% confidence interval for
1
, and use it to test H
0
:
1
= 0.
Null Hypothesis H
0
:
1
= 0 There is no linear association between X and Y.
vs.
Alternative Hyp. H
A
:
1
0 There is a linear association between X and Y.

Test Statistic

For
0
: T =
b
0

0
s
e

n S
xx
S
xx
+n (x )
2 ~ t
n 2

For
1
: T =
b
1

1
s
e
S
xx
~ t
n 2

(1 ) 100% Confidence Limits

For
0
: b
0
t
n 2, /2
s
e

1
n
+
(x )
2
S
xx

For
1
: b
1
t
n 2, /2
s
e

1
S
xx


Confidence and Prediction Intervals

Recall that, from the discussion in the previous section, a regression problem such as this
may be viewed in the formal context of starting with n normally-distributed populations,
each having a conditional mean
|
i
Y X x
=
, i =1, 2, ..., n. From this, we then obtain a
linear model that allows us to derive an estimate of the response variable via
0 1
Y b b X + = , for any value X =x* (with certain restrictions to be discussed later), i.e.,
0 1
* y b b x + = . There are two standard possible interpretations for this fitted value. First,
y can be regarded simply as a predicted value of the response variable Y, for a

randomly selected individual from the specific normally-distributed population
corresponding to X =x*, and can be improved via a so-called prediction interval.

Exercise: Confirm that the 95% prediction interval for
y =228 (when x* =75) is

(156.3977, 299.6023).

Example ( =.05): 95% Prediction Bounds

X fit Lower Upper
60 192 110.1589 273.8411
70 216 142.2294 289.7706
80 240 169.1235 310.8765
90 264 190.2294 337.7706
100 288 206.1589 369.8411
This diagram illustrates the associated 95%
prediction interval around
y = b
0
+b
1
x*,
which contains the true response value Y
with 95% probability.
b
0
+b
1
x*
x*
X
Y
.95
.025
.025
(1 ) 100% Prediction Limits for Y at X = x*
(b
0
+b
1
x*) t
n 2, /2
s
e
1 +
1
n
+
(x* x )
2
S
xx

| * x Y X
=


The second interpretation is that
y can be regarded as a point estimate of the conditional

mean
*
| Y X x
=
of this population, and can be improved via a confidence interval.

Exercise: Confirm that the 95% confidence interval for
y =228 (when x* =75) is

(197.2133, 258.6867).

Note: Both approaches are based on the fact that there is, in principle, variability in the
coefficients b
0
and b
1
themselves, from one sample of n data points to another. Thus, for
fixed x*, the object
0 1
* y b b x + = can actually be treated as a random variable in its own

right, with a computable sampling distribution.

Also, we define the general conditional mean
| Y X
i.e., conditional expectation
E[Y | X] as
| * Y X x
=
i.e., E[Y |X =x*] for all appropriate x*, rather than a specific one.

(1 ) 100% Confidence Limits for
Y | X =x*

(b
0
+b
1
x*) t
n 2, /2
s
e

1
n
+
(x* x )
2
S
xx

This diagram illustrates the associated 95%
confidence interval around
y = b
0
+ b
1
x*,
which contains the true conditional mean
Y | X = x*

with 95% probability. Note that it is narrower
than the corresponding prediction interval above.
b
0
+b
1
x*
x*
X
Y
.95
.025
.025
| * x Y X
=

Ismor Fischer, 5/29/2012 7.2-10
upper 95% confidence band
lower 95% confidence band
95% Confidence Intervals

Example ( =.05): 95% Confidence Bounds

X fit Lower Upper
60 192 141.8827 242.1173
70 216 180.5617 251.4383
80 240 211.0648 268.9352
90 264 228.5617 299.4383
100 288 237.8827 338.1173

Comments:

Note that, because individual responses have greater variability than mean
responses (recall the Central Limit Theorem, for example), we expect prediction
intervals to be wider than the corresponding confidence intervals, and indeed, this
is the case. The two formulas differ by a term of 1 + in the standard error of the
former, resulting in a larger margin of error.

Note also from the formulas that both types of interval are narrowest when x* =x ,
and grow steadily wider as x* moves farther away from x . (This is evident in the
graph of the 95% confidence intervals above.) Great care should be taken if x* is
outside the domain of sample values! For example, when fat grams x =0, the
linear model predicts an unrealistic cholesterol level of
y =48, and the margin of

error is uselessly large. The linear model is not a good predictor there.
Ismor Fischer, 5/29/2012 7.2-11

ANOVA Formulation

As with comparison of multiple treatment means (6.3.3), regression can also be
interpreted in the general context of analysis of variance. That is, because

Response = Model + Error,

it follows that the total variation in the original response data can be partitioned into a
source of variation due to the model, plus a source of variation for whatever remains.
We now calculate the three Sums of Squares (SS) that measure the variation of the
system and its two component sources, and their associated degrees of freedom (df).

1. Total Sum of Squares = sum of the squared deviations of each observed response
value y
i
from the mean response value y .

SS
Total
= (210 240)
2
+(200 240)
2
+(220 240)
2
+(280 240)
2
+(290 240)
2
= 7000

df
Total
= 5 1 = 4 Reason: n data values 1

Note that, by definition, s
y
2
=
SS
Total
df
Total
=
7000
4
=1750, as given in the beginning of this
example in 7.1.

2. Regression Sum of Squares = sum of the squared deviations of each fitted response
value
i
y from the mean response value y .

SS
Reg
= (192 240)
2
+(216 240)
2
+(240 240)
2
+(264 240)
2
+(288 240)
2
= 5760

df
Reg
= 1 Reason: As the regression model is linear, its degrees of freedom =
one less than the k =2 parameters we are trying to estimate (
0
and
1
).

3. Error Sum of Squares = sum of the squared deviations of each observed response
y
i
from its corresponding fitted response
i
y (i.e., the sum of the squared residuals).

SS
Error
= (210 192)
2
+(200 216)
2
+(220 240)
2
+(280 264)
2
+(290 288)
2
= 1240

df
Error
= 5 2 = 3 Reason: n data values k regression parameters in model

SS
Total
= SS
Reg
+ SS
Error

df
Total
= df
Reg
+ df
Error

Regression
Model
Error
Total
Ismor Fischer, 5/29/2012 7.2-12
F
1, 3

13.94
.034
Null Hypothesis H
0
:
1
= 0 There is no linear association between X and Y.
vs.
Alternative Hyp. H
A
:
1
0 There is a linear association between X and Y.

ANOVA Table
Test Statistic
Sum of Squares Mean Squares (F
1, 3
distribution)
Source df SS
MS =
SS
df
F =
MS
Reg
MS
Err

p-value
Regression 1 5760 5760
13.94 .034
Error 3 1240 413.333
Total 4 7000

According to this F-test, we can reject

at the =.05 significance level, which is consistent with our earlier findings.

Comment: Again, note that 13.94 = ( 3.733)
2
, i.e., F
1, 3
=t
3
2
equivalent tests.
Ismor Fischer, 5/29/2012 7.2-13

How well does the model fit? Out of a total response variation of 7000, the linear
regression model accounts for 5760, with the remaining 1240 unaccounted for
(perhaps explainable by a better model, or simply due to random chance). We can
therefore assess how well the model fits the data by calculating the ratio
SS
Reg
SS
Total
=
5760
7000
=0.823. That is, 82.3% of the total response variation is due to the linear
association between the variables, as determined by the least squares regression
line, with the remaining 17.7% unaccounted for. (Note: This does NOT mean that
82.3% of the original data points lie on the line. This is clearly false; from the
scatterplot, it is clear that none of the points lies on the regression line!)

Moreover, note that 0.823 =(0.907)
2
=r
2
, the square of the correlation coefficient
calculated before! This relation is true in general

Comment: In practice, it is tempting to over-rely on the coefficient of
determination as the sole indicator of linear fit to a data set. As with the correlation
coefficient r itself, a reasonably high r
2
value is suggestive of a linear trend, or a
strong linear component, but should not be used as the definitive measure.

Exercise: Sketch the n =5 data points (X, Y)

(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)

in a scatterplot, and calculate the coefficient of determination r
2
in two ways:

1. By squaring the linear correlation coefficient r.

2. By explicitly calculating the ratio
SS
Reg
SS
Total
from the regression line.

Show agreement of your answers, and that, despite a value of r
2
very close to 1, the
exact association between X and Y is actually a nonlinear one. Compare the linear
estimate of Y when X =5, with its exact value.

Also see Appendix > Geometric Viewpoint > Least Squares Approximation.
Coefficient of Determination

r
2
=
SS
Reg
SS
Total
= 1
SS
Err
SS
Total

This value (always between 0 and 1) indicates the
proportion of total response variation that is
accounted for by the least squares regression model.
Ismor Fischer, 5/29/2012 7.2-14
0

1
~N(0, )
2
~N(0, )
n
~N(0, )
| | |

x
1
x
2
x
n

Regression Diagnostics Checking the Assumptions

True Responses:
0 1
Y X = + +
Model

0 1 i i
y x = + +
i
, i =1, 2, ..., n
Fitted Responses:
0 1
Y b b aa Xa = +
0 1
,
i i
y b b x aa = + i =1, 2, ..., n
Residuals:
Y Y = ,

i i i i
e y y = = i =1, 2, ..., n

1. The model is correct.

Perhaps a better word is useful, since correctness is difficult to establish without a
theoretical justification, based on known mathematical and scientific principles.

Check: Scatterplot(s) for general behavior, r
2
1, overall balance of simplicity vs.
complexity of model, and robustness of response variable explanation.

2. Errors
i
are independent of each other, i = 1, 2, , n.

This condition is equivalent to the assumption that the responses y
i
are independent
of one other. Alas, it is somewhat problematic to check in practice; formal statistical
tests are limited. Often, but not always, it is implicit in the design of the experiment.
Other times, errors (and hence, responses) may be autocorrelated with each other.
Example: Y =systolic blood pressure (mm Hg) at times t =0 and t =1 minute later.
Specialized time-series techniques exist for these cases, but are not pursued here.

3. Errors
i
are normally distributed with mean 0, and equal variances
1
2
=
2
2
= =
n
2
(=
2
), i.e.,
i
~ N(0, ), i = 1, 2, , n.

This condition is equivalent to the
original normality assumption on the
responses y
i
. Informally, if for each
fixed x
i
, the true response y
i
is normally
distributed with mean
|
i
Y X x
=
and
variance
2
i.e, y
i
~N(
|
i
Y X x
=
, )
then the error
i
that remains upon
subtracting out the true model value
0 1 i
x + (see boxed equation above)
turns out also to be normally distributed,
with mean 0 and the same variance
2

i.e.,
i
~N(0, ). Formal details are left
to the mathematically brave to complete.

Response = + Error
Ismor Fischer, 5/29/2012 7.2-15

Check: Residual plot (residuals e
i
vs. fitted values
i
y ) for a general random appearance,
evenly distributed about zero. (Can also check the normal probability plot.)

Typical residual plots that violate Assumptions 1-3:
nonlinearity dependent errors increasing variance omitted predictor

Nonlinear trend can often be described with a polynomial regression model, e.g.,
Y =
0
+
1
X +
2
X
2
+. If a residual plot resembles the last figure, this is a
possible indication that more than one predictor variable may be necessary to
explain the response, e.g., Y =
0
+
1
X
1
+
2
X

2
+, multiple linear regression.
Nonconstant variance can be handled by Weighted Least Squares (WLS)
versus Ordinary Least Squares (OLS) above or by using a transformation of
the data, which can also alleviate nonlinearity, as well as violations of the third
assumption that the errors are normally distributed.
0 0 0 0
Ismor Fischer, 5/29/2012 7.2-16
Y = 12.1 + 4.7 X
^
Sadie

Example: Regress Y =human age (years) on X =dog age (years), based on the
following n =20 data points, for adult dogs 23-34 lbs.:
X 1 2 3 4 5 6 7 8 9 10
Y 15 21 27 32 37 42 46 51 55 59

11 12 13 14 15 16 17 18 19 20
63 67 71 76 80 85 91 97 103 111

Residuals:
Min 1Q Median 3Q Max
-2.61353 -1.57124 0.08947 1.16654 4.87143

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.06842 0.87794 13.75 5.5e-11 ***
X 4.70301 0.07329 64.17 < 2e-16 ***

Multiple R-Squared: 0.9956, Adjusted R-squared: 0.9954
F-statistic: 4118 on 1 and 18 degrees of freedom,
p-value: 0
Ismor Fischer, 5/29/2012 7.2-17

The residual plot exhibits a clear nonlinear trend, despite the excellent fit of the
linear model. It is possible to take this into account using, say, a cubic (i.e.,
third-degree) polynomial, but this then begs the question: How complicated
should we make the regression model?

My assistant and I, thinking hard about regression models.
X
1

X
2

X
1/2

0.0
X
2

X
3

X
4

X
1/2

X
1/3

X
1/4

< 0
> 1 0 < < 1


Power Laws: Y = X


If a scatterplot exhibits evidence of a monotonic nonlinear trend, then it may be
possible to improve the regression model by first transforming the data
according to one of the power functions above, depending on its overall shape.

If Y = X
, then it follows that log(Y) = log() + log(X)

i.e., V =
0
+
1
U.

[See Appendix > Basic Reviews > Logarithms for properties of logarithms.]

That is, if X and Y have a power law association, then log(X) and log(Y) have a
linear association. Therefore, such (X, Y) data are often replotted on a log-log
(U, V) scale in order to bring out the linear trend. The linear regression
coefficients of the transformed data are then computed, and backsolved for the
original parameters and . Algebraically, any logarithmic base can be used,
but it is customary to use natural logarithms ln that is, base e =2.71828
Thus, if Y = X
, then V =
0
+
1
U, where V =ln(Y), U =ln(X), and the
parameters
0
=ln() and
1
=, so that the scale parameter =e
0
, and the
shape parameter =
1
. However

Comment: This description of the retransformation is not quite complete.
For, recall that linear regression assumes the true form of the response as
V =
0
+
1
U +. (The random error term is estimated by the least squares
minimum SS
Error
=
2
1
n
i
i
e
=
.) Therefore, exponentiating both sides, the actual

relationship between X and Y is given by Y = X
. Hence (see section 7.2),

the conditional expectation is E[Y | X] = X
E[e
], where E[e
] is the mean of
the exponentiated errors
i
, and is thus estimated by the sample mean of the
exponentiated residuals e
i
. Consequently, the estimate of the original scale
parameter is more accurately given by

=e
0

1
n

=
n
i
e
i
e
1
.
(The estimate of the original shape parameter remains
=
1
.)
In this context, the expression
1
n

=
n
i
e
i
e
1
is called a smearing factor, introduced to
reduce bias during the retransformation process. Note that, ideally, if all the
residuals e
i
=0 i.e., the model fits exactly then (because e
0
=1) it follows
that the smearing factor =1. This will be the case in most of the rigged
examples in this section, for the sake of simplicity. The often-cited reference
below contains information on smearing estimators for other transformations.

Duan, N. (1983) Smearing estimate: a nonparametric retransformation method.
Journal of the American Statistical Association, 78, 605-610.

Example: This example is modified from a pharmaceutical research paper,
Allometric Scaling of Xenobiotic Clearance: Uncertainty versus Universality by
Teh-Min Hu and William L. Hayton, that can be found at the URL
http://www.aapsj.org/view.asp?art=ps030429, and which deals with different
rates of metabolic clearance of various substances in mammals. (A xenobiotic is
any organic compound that is foreign to the organism under study. In some
situations, this is loosely defined to include naturally present compounds
administered by alternate routes or at unnatural concentrations.) In one part of
this particular study, n = 6 mammals were considered: mouse, rat, rabbit,
monkey, dog and human. Let X =body weight (kg) and the response Y =
clearance rate of some specific compound. Suppose the following ideal data
were generated (consistent with the spirit of the articles conclusions):

X .02 .25 2.5 5 14 70
Y 5.318 35.355 198.82 334.4 723.8 2420.0


Solving for the least squares regression line yields the following standard output.

Residuals:
1 2 3 4 5 6
-102.15 -79.83 8.20 59.96 147.60 -33.78

Coefficients:
(Intercept) 106.800 49.898 2.14 0.099
X 33.528 1.707 19.64 3.96e-05

Residual standard error: 104.2 on 4 degrees of freedom
Multiple R-Squared: 0.9897, Adjusted R-squared: 0.9872
F-statistic: 385.8 on 1 and 4 degrees of freedom,
p-value: 3.962e-005

V = 4.605 + 0.75 U
^

The residual plot, as well as a visual inspection of the linear fit, would seem to
indicate that model improvement is possible, despite the high r
2
value. The
overall shape is suggestive of a power law relation Y = X
with 0 < <1.

Transforming to a log-log scale produces the following data and regression line.

U =ln X 1.386 0.916 1.609 0.264 4.248
V =ln Y 1.671 3.565 5.292 5.812 6.585 7.792

Residuals:
1 2 3 4 5 6
-2.469e-05 -1.944e-06 -1.938e-06 6.927e-05 2.244e-05 -6.313e-05

Coefficients:
(Intercept) 4.605e+00 2.097e-05 219568 <2e-16 ***
U 7.500e-01 7.602e-06 98657 <2e-16 ***

Residual standard error: 4.976e-05 on 4 degrees of freedom
Multiple R-Squared: 1, Adjusted R-squared: 1
F-statistic: 9.733e+009 on 1 and 4 degrees of freedom,
p-value: 0

3.912
Y = 100 X
3/4

^

The residuals are all within 10
6
of 0; this is clearly a much better fit to the data.
Transforming back to the original X, Y variables from the regression line

ln(Y
) = 4.605 +0.75 ln(X),

we obtain Y
= e
4.605 +0.75 ln(X)
= e
4.605
e
0.75 ln(X)
= 100 X
0.75
.

That is, the variables follow a power
law relation with exponent ,
illustrating a result known as
Kleibers Law of quarter power
scaling. See Appendix >
Regression Models > Power Law
Growth for more examples and
information.
e
0.5X
e
X
e
2X
> 0
e
2X
e
X
e
0.5X
< 0

Logarithmic Transformation: Y = e
X
(Assume >0.)

In some systems, the response variable Y grows ( >0) or decays ( <0)
exponentially in X. That is, each unit increase in X results in a new response
value Y that is a constant multiple (either >1 or <1, respectively) of the previous
response value. A typical example is unrestricted cell division where, under ideal
conditions, the number of cells Y at the end of every time period X is twice the
number at the previous period. (The resulting explosion in the number of cells
helps explain why patients with bacterial infections need to remain on their full
ten-day regimen of antibiotics, even if they feel recovered sooner.) The half-life
of a radioactive isotope is a typical example of exponential decay.

In general, if Y = e
X
, then ln(Y) = ln() + X,
i.e., V =
0
+
1
X.

That is, X and ln(Y) have a linear association, and the model itself is said to be
log-linear. Therefore, the responses are often replotted on a semilog scale i.e.,
ln(Y) versus X in order to bring out the linear trend. As before, the linear
regression coefficients of the transformed data are then computed, and backsolved
for estimates of the scale parameter =e
0
and shape parameter =

1
.

Also see Appendix > Regression Models > Exponential Growth and
Appendix > Regression Models > Example - Newton's Law of Cooling.

Comment: Recall that the square root and logarithm functions also serve to
transform positively skewed data closer to being normally distributed. Caution: If
any of the values are 0, then add a constant value (e.g., +1) uniformly to all of
the values, before attempting to take their square root or logarithm!!!

Multiple Linear Regression

Suppose we now have k 1 independent explanatory variables X
1
, X
2
, , X
k 1

(numerical or categorical) to predict a single continuous response variable Y.
Then the regression setup Response = Model + Error becomes:

Y =
0
+
1
X
1
+
2
X
2
+
3
X
3
+
k 1
X
k 1
main effect terms
+
11
X
1
2
+
22
X
2
2
+
k 1, k 1
X
k 1
2
quadratic terms (if any)
+
25
X
2
X
5
+
68
X
6
X
8
+ two-way interaction terms (if any)
+
147
X
1
X
4
X
7
+ three-way interaction terms (if any)
+

For simplicity, first consider the general additive model, i.e., main effects only.

Question 1: How are the estimates of the regression coefficients obtained?
Answer: Least Squares Approximation (LS), which follows the same
principle of minimizing the residual sum of squares SS
Error
. However, this leads
to a set of complicated normal equations, best formulated via matrix algebra,
and solved numerically by a computer. See figure below for two predictors.
Fitted response
i
y

Residual
e
i
= y
i

i
y
True response y
i

X
1

X
2

0
Y
Y =
0
+
1
1
X +
2
2
X

(x
1i
, x
2i
)
Predictors
1 2
( , , ) x x y

Question 2: Which predictor variables among X
1
, X
2
, , X
k 1
are the most
important for modeling the response variable? That is, which regression
coefficients
j
are statistically significant?

Answer: This raises the issue of model selection, one of the most important
problems in the sciences. There are two basic stepwise procedures: forward
selection (FS) and backward elimination (BE) (as well as widely used hybrids
of these methods (FB)). The latter is a bit easier to conceptualize, and the steps
are outlined below.

Model Selection: Backward Elimination (BE)

Step 0. In a procedure that is extremely similar to that for multiple comparison
of k treatment means (6.3.3), first conduct an overall F-test of the full model
0
+
1
X
1
+
2
X
2
+
k 1
X
k 1
,

by constructing an ANOVA table:

Sum of Squares Mean Squares Test Statistic
Source df SS
MS =
SS
df
F =
MS
Reg
MS
Err

p-value
Regression k 1
2
1
( )
n
i
i
y y
=

MS
Reg

F
k 1, n k
0 p 1
Error n k
2
1
( )
n
i i
i
y y
=

MS
Err

Total n 1
2
1
( )
n
i
i
y y
=

If and only if the null hypothesis is (hopefully) rejected, it then becomes
necessary to determine which of the predictor variables correspond to
statistically significant regression coefficients. (Note that this is analogous to
determining the mean of which of the k treatment groups are significantly
different from the others, in multiple comparisons.)

Null Hypothesis H
0
:
1
=
2
= =
k 1
= 0 There is no linear association
between the response Y and any
of the predictors X
1
,, X
k
.

Alternative Hyp. H
A
:
j
0 for some j There is a linear association
between the response Y and at
least one predictor X
j
.
Ismor Fischer, 5/29/2012 7.3-10

Example ~

Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model.

Step 1. t-test
0 1
: 0 H =
0 2
: 0 H =
0 3
: 0 H =
0 4
: 0 H =

p-values: p
1
<.05 p
2
<.05
3
.05 p > p
4
<.05
Reject H
0
Reject H
0
Accept H
0
Reject H
0

Step 2. Are all the p-values significant (i.e., <.05 =)? If not, then...

Step 3. Delete the predictor with the largest p-value, and recompute new coefficients.
Repeat Steps 1-3 as necessary, until all p-values are significant.

Step 4. Check feasibility of the final reduced model, and interpret.
1

X
1

2

X
2

3

X
3

4

X
4

+ + + +
X
3

1

X
1

2

X
2

4

X
4

+ + + +

1

X
1

2

X
2

4

X
4

+ + +
Ismor Fischer, 5/29/2012 7.3-11

Comment: The steps outlined above extend to much more general models, including
interaction terms, binary predictors (e.g., in womens breast cancer risk assessment, let
X =1 if a first-order relative mother, sister, daughter was ever affected, X =0 if
not), binary response (e.g., Y =1 if cancer occurs, Y =0 if not), multiple responses, etc.
The overall goal is to construct a parsimonious model based on the given data, i.e.,
one that achieves a balance between the level of explanation of the response, and the
number of predictor variables. A good model will not have so few variables that it is
overly simplistic, yet not too many that its complexity makes it difficult to interpret
and form general conclusions. There is a voluminous amount of literature on
regression methods for specialized applications; some of these topics are discussed
below, but a thorough treatment is far beyond the scope of this basic introduction.
Step 2.
Are all the p-values corresponding to
the regression coefficients significant
at (i.e., less than) level ?
Step 1.
For each coefficient (j =1, 2, , k), calculate the associated p-value
from the test statistic t-ratio =
j
0
s.e.(
j
)
~t
n k
corresponding to the null
hypothesis H
0
:
j
= 0, versus the alternative H
A
:
j
0.

(Note: The mathematical expression for the standard error s.e.(
j
) is
quite complicated, and best left to statistical software for evaluation.)
Step 3.
Select the single least significant coefficient
at level (i.e., the largest p-value, indicating
strongest acceptance of the null hypothesis
j
=0), and delete only that corresponding
term
j
X
j
from the model. Refit the original
data to the new model without the deleted
term. That is, recompute the remaining
regression coefficients from scratch. Repeat
Steps 1-3 until all surviving coefficients are
significant (i.e., all p-values <), i.e., Step 4.
No Yes
Step 4.
Evaluate how well the final reduced model
fits; check multiple r
2
value, residual plots,
reality check, etc. It is also possible to
conduct a formal Lack-of-Fit Test, which
involves repeated observations y
ij
at predictor
value x
i
; the minimized residual sum of
squares SS
Error
can then be further partitioned
into SS
Pure
+SS
Lack-of-Fit
, and a formal F-test
of significance conducted for the
appropriateness of the linear model.
Ismor Fischer, 5/29/2012 7.3-12
LOW HIGH
LOW X
2
=0

Y =120 +0.5 X
1

HIGH X
2
=20

Y =125 +0.5 X
1

Change in response with
respect to X
2
is constant
(5 mm Hg), independent
of X
1
. No interaction
between X
1
, X
2
on Y!

Interaction Terms

Consider the following example. We wish to study the effect of two continuous
predictor variables, say X
1
=Drug 1 dosage (0-10 mg) and X
2
=Drug 2 dosage
(0-20 mg), on a response variable Y = systolic blood pressure (mm Hg).
Suppose that, based on empirical data using different dose levels, we obtain the
following additive multilinear regression model, consisting of main effects only:

Y = 120 + 0.5 X
1
+ 0.25 X
2
, 0 X
1
10, 0 X
2
20.

Rather than attempting to visualize this planar response surface in three
dimensions, we can better develop intuition into the relationships between the three
variables by projecting it into a two-dimensional interaction diagram, and seeing
how the response varies as each predictor is tuned from low to high.

First consider the effect of Drug 1 alone on systolic blood pressure, i.e., X
2
=0. As
Drug 1 dosage is increased from a low level of X
1
=0 mg to a high level of X
1
=
10 mg, the blood pressure increases linearly, from
Y =120 mm Hg to
Y =125
mm Hg. Now consider the effect of adding Drug 2, eventually at X
2
=20 mg.
Again, as Drug 1 dosage is increased from a low level of X
1
=0 mg to a high level
of X
1
=10 mg, blood pressure increases linearly, from
Y =125 mm Hg to
Y =130
mm Hg. The change in blood pressure remains constant, thereby resulting in two
parallel lines, indicating no interaction between the two drugs on the response.
Ismor Fischer, 5/29/2012 7.3-13
Change in response with
respect to X
2
depends on X
1
.

X
1
=0: Y =5 mm Hg
X
1
=10: Y =25 mm Hg

Interaction between X
1
, X
2

on Y!

LOW HIGH
LOW X
2
=0

Y =120 +0.5 X
1

HIGH X
2
=20

Y =125 +2.5 X
1

However, suppose instead that the model includes a statistically
significant (i.e., p-value <) interaction term:

Y = 120 + 0.5 X
1
+ 0.25 X
2
+ 0.1 X
1
X
2
0 X
1
10, 0 X
2
20.

This has the effect of changing the response surface from a plane to a
hyperbolic paraboloid, shaped somewhat like a saddle.

Again, at the Drug 2 low dosage level X
2
=0, systolic blood pressure linearly
increases by 5 mm Hg as Drug 1 is increased from X
1
=0 to X
1
=10, exactly as
before. But now, at the Drug 2 high dosage level X
2
=20, a different picture
emerges. For as Drug 1 dosage is increased from a low level of X
1
=0 mg to a
high level of X
1
=10 mg, blood pressure linearly increases from
Y =125 mm
Hg to a hypertensive
Y =150 mm Hg, a much larger difference of 25 mm Hg!

Hence the two resulting lines are not parallel, indicating a significant drug-
drug interaction on the response.

Exercise: Draw the interaction diagram corresponding to the model
Y
= 120 + 0.5 X
1
+ 0.25 X
2
0.1 X
1
X
2
.

Comment: As a rule, if an explanatory variable X
j
is not significant as a main
effect, but is a factor in a statistically significant interaction term, it is
nevertheless retained as a main effect in the final model. This convention is
known as the Hierarchical Principle.
Ismor Fischer, 5/29/2012 7.3-14

These ideas also appear in another form. Consider the example of constructing a
simple linear regression model for the response variable Y =height (in.) on
the single predictor variable X =weight (lbs.) for individuals of a particular
age group. A reasonably positive correlation might be expected, and after
obtaining sample observations, the following scatterplot may result, with
accompanying least squares regression line.

However, suppose it is the case that the sample is actually composed of two
distinct subgroups, which are more satisfactorily modeled by separate, but
parallel, regression lines, as in the examples shown below.

X
Y
X
Y
Males Females
2
Y =52 +0.1 X
1
Y =48 +0.1 X
Ismor Fischer, 5/29/2012 7.3-15

It is possible to fit both parallel lines to a single multiple linear model
simultaneously, by introducing a binary variable that, in this case, codes for
gender. Let M =1 if Male, and M =0 if Female. Then the model

Y = 48 + 0.1 X + 4 M

incorporates both the (continuous) numerical variable X, as well as the (binary)
categorical variable M, as predictors for the response.

However, if the simple linear regression lines are not parallel, then it becomes
necessary to include an interaction term, just as before. For example, the model

Y = 48 + 0.1 X + 4 M + 0.2 M X

becomes
1
Y =48 +0.1X if M =0, and

2
Y =52 + 0.3X if M =1. These lines

have unequal slopes (0.1 and 0.3), hence are not parallel.

More generally then, categorical data can also be used as predictors of response,
by introducing dummy, or indicator variables in the model. Specifically, for
each disjoint category i =1, 2, 3,, k, let I
i
=1 if category i, and 0 otherwise.
For example, for the k =4 categories of blood type, we have

1, if Type A
I
1
=
0, otherwise

1, if Type B
I
2
=
0, otherwise

1, if Type AB
I
3
=
0, otherwise

1, if Type O
I
4
=
0, otherwise.

Note that I
1
+I
2
+ +I
k
=1, so there is collinearity among these k variables;
hence just as in multiple comparisons there are k 1 degrees of freedom.
(Therefore, only this many indicator variables should be retained in the model;
adding the last does not supply new information.) As before, a numerical
response Y
i
for each of the categories can then be modeled by combining main
effects and possible interactions of numerical and/or indicator variables.

But what if the response Y itself is categorical, e.g., binary?

Ismor Fischer, 5/29/2012 7.3-16
Logistic Regression

Suppose we wish to model a binary response variable Y, i.e., Y =1 (Success)
with probability , and Y =0 (Failure) with probability 1 , in terms of a
predictor variable X. This problem gives rise to several difficulties, as the
following example demonstrates.

Example: If you live long enough, you will need surgery. Imagine that we
wish to use the continuous variable X =Age as a predictor for the binary
variable Y =Ever had major surgery (1 =Yes, 0 =No). If we naively attempt
to use simple linear regression however, the resulting model contains relatively
little predictive value for the response (either 0 or 1), since it attains all
continuous values from to +; see figure below.

This is even more problematic if there are several people of the same age X, with
some having had major surgery (i.e., Y =1), but the others not (i.e., Y =0).
Possibly, a better approach might be to replace the response Y (either 0 or 1),
with its probability , in the model. This would convert the binary variable to a
continuous variable, but we still have two problems. First, we are restricted to
the finite interval 0 1. And second, although is approximately normally
distributed, its variance is not constant (see 6.1.3), in violation of one of the
assumptions on least squares regression models stated in 7.2.
1
Y
0
X =Age
Y =
0
+
1
X

| | | | | | | | |

10 20 30 40 50 60 70 80 90
| | | | | | | | |

10 20 30 40 50 60 70 80 90
1
= P(Y =1)
0
X =Age
=
0
+
1
X
Ismor Fischer, 5/29/2012 7.3-17

One solution to the first problem is to transform the probability , using a
continuous link function g(), which takes on values from to +, as ranges
from 0 to 1. The function usually chosen for this purpose is the log-odds, or logit
(pronounced low-jit): g() = ln
1
. Thus, the model is given by

ln
= b
0
+b
1
X
=
1
1 +e
b
0

b
1
X

This reformulation does indeed put the estimate
between 0 and 1, but with the

constant variance assumption violated, the technique of least squares
approximation does not give the best fit here. For example, consider the
following artificial data:

X 0 1 2 3 4
0.01 0.01 0.50 0.99 0.99

Least squares approximation gives
the regression parameter estimates
b
0
= 5.514 and b
1
= 2.757,
resulting in the dotted graph shown.
However, a closer fit is obtained by
using the technique of Maximum
Likelihood Estimation (MLE)
actually, a generalization of least
squares approximation and best
solved by computer software. The
MLE coefficients are b
0
=7.072
and b
1
= 3.536, resulting in the
solid graph shown.

1
= P(Y =1)
0
X =Age | | | | | | | | |

10 20 30 40 50 60 70 80 90
Ismor Fischer, 5/29/2012 7.3-18

Comments:

This is known as the S-shaped, sigmoid, or logistic curve, and appears in
a wide variety of applications. See Appendix > Regression Models >
Logisitic Growth for an example involving restricted population growth.
(Compare with unrestricted exponential growth, discussed earlier.)

It is often of interest to determine the median response level, that is, the value
of the predictor variable X for which a 50% response level is achieved.
Hence, if
=0.5, then b
0
+b
1
X = ln
0.5
1 0.5
= 0, so X =
b
0
b
1

.
Exercise: Prove that the median response corresponds to the point of
inflection (i.e., change in concavity) of any general logistic curve.

Other link functions sometimes used for binary responses are the probit
(pronounced pro-bit) and tobit (pronounced toe-bit) functions, which
have similar properties to the logit. The technique of using link functions is
part of a larger regression theory called Generalized Linear Models.

Since the method of least squares is not used for the best fit, the traditional
coefficient of determination r
2
as a measure of model fitness does not exist!
However, several analogous pseudo-r
2
formulas have been defined (Efron,
McFadden Cox & Snell, others), but must be interpreted differently.

Another way to deal with the nonconstant variance of proportions, which does
not require logistic regression, is to work with the variance-stabilizing
transformation arcsin , a technique that we do not pursue here.

To compare regression models: Wald Test, Likelihood Ratio Test, Akaike
Information Criterion (AIC), Bayesian Information Criterion (BIC).

Polytomous regression is used if the response Y has more than two categories.

Ismor Fischer, 5/29/2012 7.3-19
Disease Status: 1 or 0

The logit function can be modeled by more than one predictor variable via
multilinear logistic regression, using selection techniques as described above
(except that MLE for the coefficients must be used instead of LS).
For instance,
ln
= b
0
+b
1
X
1
+b
2
X
2
++b
k 1
X
k 1
.

In particular, suppose that one of these variables, say X
1
, is binary. Then, as
its category level changes from X
1
=0 to X
1
=1, the right-hand amount
changes exactly by its coefficient b
1
. The corresponding amount of change of
the left side is equal to the difference in the two log-odds which, via a basic
property of logarithms, is equal to the logarithm of the odds ratio between the
two categories. Thus, the odds ratio itself can be estimated by
b
e
1
.

Example: Suppose that, in a certain population of individuals 50+years old, it
is found that the probability =P(Lung cancer) is modeled by

ln
= 6 +.05X
1
+4.3X
2

where the predictors are X
1
=Age (years) , and X
2
=Smoker (1 =Yes, 0 =No);
note that X
1
is numerical, but X
2
is binary. Thus, for example, a 50-year-old
nonsmoker would correspond to X
1
=50 and X
2
=0 in the model, which yields
3.5 on the right hand side for the log-odds of this group. (Solving for the
actual probability itself gives
3.5
1 (1 ) e
+
= + =.03.) We can take this value
as a baseline for the population. Likewise, for a 50-year-old smoker, the only
difference would be to have X
2
=1 in the model, to indicate the change in
smoking status from baseline. This would yield +0.8 for the log-odds
(corresponding to
0.8
1 (1 ) e

= + =0.69). Thus, taking the difference gives

log-odds
Smokers
log-odds
Nonsmokers
= 0.8 (3.5)
i.e.,
Smokers
Nonsmokers
odds
log 0.8 3.5
odds

= +
or
log( ) 4.3 OR = , so that

4.3
73.7 OR e = = , quite large.

That the exponent 4.3 is also the coefficient of X
2
in the model is not a
coincidence, as stated above. Moreover,
73.7 OR = here, for any age X

1
!
Recall that
log log log
A
A B
B

=

Exposure Status: 1 or 0
Ismor Fischer, 5/29/2012 7.3-20
0.5

260

Pharmaceutical Application: Dose-Response Curves

Example: Suppose that, in order to determine its efficacy, a certain drug is
administered in subtoxic dosages X (mg) of 90 mg increments to a large group of
patients. For each patient, let the binary variable Y =1 if improvement is
observed, Y =0 if there is no improvement. The proportion of improved
patients is recorded at each dosage level, and the following data are obtained.

X 90 180 270 360 450
0.10 0.20 0.60 0.80 0.90

The logistic regression model (as computed via MLE) is

ln
= 3.46662 +0.01333 X
=
1
1 +e
3.46662

0.01333X
,

the following graph is obtained.

The median dosage is X =
3.46662
0.01333
= 260.0 mg. That is, above this dosage
level, more patients are improving than not improving.


7.4 Problems

1. In Problem 4.4/29, it was shown that important relations exist between population means,
variances, and covariance. Specifically, we have the formulas that appear below left.

In this problem, we verify that these properties are also true for sample means, variances, and
covariance in two examples. For data values {x
1
, x
2
, , x
n
} and {y
1
, y
2
, , y
n
}, recall that:
x =
1
n

1
n
i
i
x
=
s
x
2
=
1
n 1

2
1
( )
n
i
i
x x
=

y =
1
n

1
n
i
i
y
=
s
y
2
=
1
n 1

2
1
( )
n
i
i
y y
=

.
Now suppose that each value x
i
from the first sample is paired with exactly one corresponding
value y
i
from the second sample. That is, we have the set of n ordered pairs of data
1 1 2 2
{( , ), ( , ), , ( , )}
n n
x y x y x y , with sample covariance given by
s
xy
=
1
n 1

1
( )( )
n
i i
i
x x y y
=

.
Furthermore, we can label the pairwise sum x + y as the dataset
1 1 2 2
( , , , )
n n
x y x y x y + + + ,
and likewise for the pairwise difference x y. It can be shown (via basic algebra, or Appendix
A2), that for any such dataset of ordered pairs, the formulas that appear above right hold. (Note
that these formulas generalize the properties found in Problem 2.5/4.)

For the following ordered data pairs, verify that the formulas in I and II hold. (In R, use
mean, var, and cov.) Also, sketch the scatterplot.
x 0 6 12 18
y 3 3 5 9

Repeat for the following dataset. Notice that the values of x
i
and y
i
are the same as before,
but the correspondence between them is different!
x 0 6 12 18
y 3 9 3 5

I. (A) x y x y + = +

(B)
2 2 2
2
x y x y xy
s s s s
+
= + +

II. (A) x y x y =

(B)
2 2 2
2
x y x y xy
s s s s
= +
I. (A)
X Y X Y

+
= +

(B)
2 2 2
2
X Y X Y XY

+
= + +

II. (A)
X Y X Y

=

(B)
2 2 2
2
X Y X Y XY

= +

2. Expiration dates that establish the shelf lives of pharmaceutical products are determined from
stability data in drug formulation studies. In order to measure the rate of decomposition of a
particular drug, it is stored under various conditions of temperature, humidity, light intensity,
etc., and assayed for intact drug potency at FDA-recommended time intervals of every three
months during the first year. In this example, the assay Y (mg) of a certain 500 mg tablet
formulation is determined at time X (months) under ambient storage conditions.

X

Y
0 3 6 9 12
500 490 470 430 350

(a) Graph these data points (x
i
, y
i
) in a scatterplot, and calculate the sample correlation
coefficient r = s
xy
/ s
x
s
y
. Classify the correlation as positive or negative, and as weak,
moderate, or strong.

(b) Determine the equation of the least squares regression line for these data points, and
include a 95% confidence interval for the slope
1
.

(c) Sketch a graph of this line on the same set of axes as part (a); also calculate and plot the
fitted response values
i
y and the residuals e
i
= y
i

i
y on this graph.

(d) Complete an ANOVA table for this linear regression, including the F-ratio and
corresponding p-value.

(e) Calculate the value of the coefficient of determination r
2
, using the two following
equivalent ways (and showing agreement of your answers), and interpret this quantity as a
measure of fit of the regression line to the data, in a brief, clear explanation.

via squaring the correlation coefficient r =
s
xy
s
x
s
y
found in (a),
via the ratio r
2
=
SS
Regression
SS
Total
of sums of squares found in (d).

(f) Test the null hypothesis of no linear association between X and Y, either by using your
answer in (a) on H
0
: = 0, or equivalently, by using your answers in (b) and/or (d) on
H
0
:
1
= 0.

(g) Calculate a point estimate of the mean potency when X = 6 months. Judging from the
data, is this realistic? Determine a 95% confidence interval for this value.

(h) The FDA recommends that the expiration date should be defined as that time when a drug
contains 90% of the labeled potency. Using this definition, calculate the expiration date
for this tablet formulation. Judging from the data, is this realistic?

(i) The residual plot of this model shows evidence of a nonlinear trend. (Check this!) In order
to obtain a better regression model, first apply the linear transformations X
= X / 3 and
Y
= 510 Y, then try fitting an exponential curve

X
Y e
. Use this model to

determine the expiration date. Judging from the data, is this realistic?


(j) Redo this problem using the following R code:

# See help(lm) or help(lsfit), and help(plot.lm) for details.

# Compute Correlation Coefficient and Scatterplot
X <- c(0, 3, 6, 9, 12)
Y <- c(500, 490, 470, 430, 350)
cor(X, Y)
plot(X, Y, xlab = "X = Months", ylab = "Y = Assay (mg)", pch=19)

# Least Squares Fit, Regression Line Plot, ANOVA F-test
regline <- lm(Y ~ X)
summary(regline)
abline(regline, col = "blue")
# Exercise: Why does the p-value of 0.02049 appear twice?

# Estimate Mean Potency at 6 Months
new <- data.frame(X = 6)
predict(regline, new, interval = "confidence")

# Residual Plot
resids <- round(resid(regline), 2)
plot(regline, which = 1, id.n = 5, labels.id = resids, pch=19)

# Log-Transformed Linear Regression
Xtilde <- X / 3
Ytilde = 510 Y
V <- log(Ytilde)
plot(Xtilde, V, xlab = "Xtilde", ylab = "ln(Ytilde)", pch=19)
regline.transf <- lm(V ~ Xtilde)
summary(regline.transf)
abline(regline.transf, col = "red")

# Plot Transformed Model
coeffs <- coefficients(regline.transf)
scale <- exp(coeffs[1])
shape <- coeffs[2]
Yhat <- function(X)(510 scale * exp(shape * X / 3))
plot(X, Y, xlab = "X = Months", ylab = "Y = Assay (mg)", pch=19)
curve(Yhat, col = "red", add = TRUE)

Answer
this.

3. A Third Transformation. Suppose that two continuous variables X and Y are negatively
correlated via the nonlinear relation Y =
1
X +
, for some parameters and . This is
algebraically equivalent to the relation
1
Y
= X + , which can then be solved via simple linear
regression. Use this reciprocal transformation on the data and corresponding scatterplot
below, to sketch a new scatterplot, and solve for sample-based estimates of the parameters and
. (Hint: Finding the parameter values in this example should be straightforward, and not
require any least squares regression formulas.) Express the original response Y in terms of X.

X 0 1 2 3 4 5

X 0 1 2 3 4 5
Y 60 30 20 15 12 10 1/Y

4. For this problem, recall that in simple linear regression, we have the following definitions:
1
2
xy
x
s
b
s
= , MS
Err
=
SS
Err
n 2
,
Reg
Err 2
Tot Tot
SS
SS
1
SS SS
r = = , SS
Tot
= (n 1) s
y
2
, and S
xx
= (n 1) s
x
2
.
(a) Formally prove that the T-score =
2
2
1
r n
r
for testing the null hypothesis H

0
: = 0,
is equal to the T-score =
1 1
Err
MS
xx
b
S

0
:
1
= 0.

(b) Formally prove that, in simplelinear regression (where df
Reg
= 1), the square of the T-score =
1 1
Err
MS
xx
b
S

is equal to the F-ratio =
Reg
Err
MS
MS
0
:
1
= 0.


5. In a study of binge eating disorders among dieters, the average weights (Y) of a group of
overweight women of similar ages and lifestyles are measured at the end of every two months
(X) over an eight month period. The resulting data values, some accompanying summary
statistics, and the corresponding scatterplot, are shown below.

X 0 2 4 6 8
x = 4
s
x
2
= 10
Y 200 190 210 180 220 y = 200 s
y
2
= 250

(a) Compute the sample covariance s
xy
between the variables X and Y.

(b) Compute the sample correlation coefficient r between the variables X and Y. Use it to
classify the linear correlation as positive or negative, and as strong, moderate, or weak.

(c) Determine the equation of the least squares regression line for these data. Sketch a graph
of this line on the scatterplot provided above. Please label clearly!

(d) Also calculate the fitted response values
i
y , and plot the residuals e
i
= y
i

i
y , on this same
graph. Please label clearly!

(e) Calculate the coefficient of determination r
2
, and interpret its value in the context of
evaluating the fit of this linear model to the sample data. Be as clear as possible.

(f) I nterpretation: Evaluate the overall adequacy of the linear model to these data, using as
much evidence as possible. In particular, refer to at least two formal linear regression
assumptions which may or may not be satisfied here, and why.


6. A pharmaceutical company wishes to evaluate the results Y of a new drug assay procedure,
performed on n = 5 drug samples of different, but known potency X. In a perfect error-free
assay, the two sets of values would be identical, thus resulting in the ideal calibration line
Y = X, i.e., Y = 0 + 1X. However, experimental variability generates the results shown below,
along with some accompanying summary statistics: the sample means, variances, and
covariance, respectively.

X (mg)

Y (mg)
30 40 50 60 70
x = 50 s
x
2
= 250
s
xy
= 260
32 39 53 65 71 y = 52 s
y
2
= 275

(a) Graph these data points (x
i
, y
i
) in a scatterplot.

(b) Compute the sample correlation coefficient r. Use it to determine whether or not X and Y
are linearly correlated; if so, classify as positive or negative, and as weak, moderate, or
strong.

(c) Determine the equation of the least squares regression line for these data. Sketch a graph of
this line on the same set of axes as part (a). Also calculate and plot the fitted response
values
i
y and the residuals e
i
= y
i

i
y , on this same graph.

(d) Using all of this information, complete the following ANOVA table for this simple linear
regression model. (Hints: SS
Total
and df
Total
can be obtained from s
y
2
given above; SS
Error
=
residual sum of squares, and df
Error
= n 2.) Show all work.

Source df SS MS F-ratio p-value
Regression

Error

Total

(e) Construct a 95% confidence interval for the slope
1
.

(f) Use the p-value in (d) and the 95% confidence interval in (e) to test whether the null
hypothesis H
0
:
1
= 0 can be rejected in favor of the alternative H
A
:
1
0, at the = .05
significance level. Interpret your answer: What exactly has been demonstrated about any
association that might exist between X and Y? Be precise.

(g) Use the 95% confidence interval in (e) to test whether the null hypothesis H
0
:
1
= 1 can be
rejected in favor of the alternative H
A
:
1
1, at the = .05 significance level. Interpret your
answer in context: What exactly has been demonstrated about the new drug assay procedure?
Be precise.


SS
Total
df
Total


8.1 Survival Functions and Hazard Functions

8.2 Estimation: Kaplan-Meier Formula

8.3 Inference: Log-Rank Test

8.4 Regression: Cox Proportional Hazards Model

8.5 Problems


8.1 Definition: Survival Function

Survival Analysis is also known as Time-to-Event Analysis, Time-to-Failure
Analysis, or Reliability Analysis (especially in the engineering disciplines), and
requires specialized techniques.

Examples: Event

Cancer surgery, radiotherapy, chemotherapy Death

Cancer remission Cancer recurrence

Coronary artery bypass surgery Heart attack or death,
whichever comes first

Topical application of skin rash ointment Disappearance of
symptoms

However, such longitudinal data can be censored, i.e., the event may not occur
before the end of the study. Patients can be lost to follow-up (e.g., moved away,
non-compliant, choose a different treatment, etc.), as shown in the diagram below.

Patients
Time
Study
begins
Study
ends
3

2

1
= Death

= Censored

POPULATION
Define a continuous random variable T =time-to-event, or in this context, survival
time (until death). From this, construct the survival function

S(t) = P(T >t) = Probability of surviving beyond time t.

The graph of the survival function is the survival curve.

Properties: For all t 0,

0 S(t) 1

S(0) =1, and S(t) monotonically decreases to 0 as t gets larger.

S(t) is continuous.
Examples: S(t) = ( 0)
ct
e c
> ,
1
1 +t
,

Note that the probability of death occurring in the interval [ , ] a b is ( ) P a T b =
( ) ( ) P T a P T b > > = ( ) ( ) S a S b .
T
S(t)
t 0
1
T
S(t)
a 0
1
b
S(a)
S(b)

SAMPLE: How can we estimate S(t), using a cohort of n individuals?

For simplicity, assume no censoring for now.

Life Table Method: Suppose that, at the end of every month (week, year, etc.), we
record the current number of deaths d
t
so far, or equivalently, the current number of
survivors n
t
=n d
t
, over the duration of the study. At these values t =1, 2, 3,...,
define
( ) S t =
n
t
n
= 1
d
t
n
,

and linear in between.

Example: Twelve-month cohort study of n =10 patients

Patient
Survival Time
(months)
Month d
t
n
t

( ) S t
1 3*

1 0 10 1.0
2 5

2 0 10 1.0
3 6

3 0 10 1.0
4 6

4 1 9 0.9
5 7

5 1 9 0.9
6 8

6 2 8 0.8
7 8

7 4 6 0.6
8 8

8 5 5 0.5
9 10

9 8 2 0.2
10 12

10 8 2 0.2

11 9 1 0.1
* Patient 1 died in month 4, etc.
12 9 1 0.1


Disadvantage: This method is based on calendar times, not cohort times of death,
thereby wasting much information. A more efficient method can be developed, that is
based on the observed times of death of the patients.

( ) S t

Time
(months)
| | | | | | | | | | | |

0 1 2 3 4 5 6 7 8 9 10 11 12
1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
1.0
1.0 1.0
0.9 0.9
0.8
0.6
0.5
0.2 0.2
0.1 0.1

8.2 Estimation: Kaplan-Meier Product-Limit Formula

Let t
1
, t
2
, t
3
, denote the actual times of death of the n individuals in the cohort. Also
let d
1
, d
2
, d
3
, denote the number of deaths that occur at each of these times, and
let n
1
, n
2
, n
3
, be the corresponding number of patients remaining in the cohort.
Note that n
2
=n
1
d
1
, n
3
=n
2
d
2
, etc. Then, loosely speaking, S(t
2
) =P(T >t
2
) =
Probability of surviving beyond time t
2
depends conditionally on S(t
1
) = P(T >t
1
) =
Probability of surviving beyond time t
1
. Likewise, S(t
3
) =P(T >t
3
) =Probability of
surviving beyond time t
3
depends conditionally on S(t
2
) =P(T >t
2
) =Probability of
surviving beyond time t
2
, etc. By using this recursive idea, we can iteratively build a
numerical estimate
( ) S t of the true survival function S(t). Specifically,

For any time t [0, t
1
), we have S(t) = P(T >t) = Probability of surviving
beyond time t = 1, because no deaths have as yet occurred. Therefore, for all t in
this interval, let
( ) S t =1.

Recall (see 3.2): For any two events A and B, P(A and B) = P(A) P(B | A).

Let A =survive to time t
1
and B =survive from time t
1
to beyond some time t
before t
2
. Having both events occur is therefore equivalent to the event A and B
=survive to beyond time t before t
2
, i.e., T >t. Hence, the following holds.

For any time t [t
1
, t
2
), we have

S(t) = P(T >t) = P(survive in [0, t
1
)) P(survive in [t
1
, t] | survive in [0, t
1
)),

i.e,

( ) S t = 1
n
1
d
1
n
1
, or

( ) S t = 1
d
1
n
1
. Similarly,

For any time t [t
2
, t
3
), we have

S(t) = P(T >t) = P(survive in [t
1
, t
2
)) P(survive in [t
2
, t] | survive in [t
1
, t
2
)),

i.e,

( ) S t =
1
d
1
n
1

n
2
d
2
n
2
, or

( ) S t =
1
d
1
n
1

1
d
2
n
2
, etc.

Time | | | |
0 t
1
t
2
t
3
. . .

In general, for t [t
j
, t
j+1
), j =1, 2, 3,, we have

This is known as the Kaplan-Meier estimator of the survival function S(t). (Theory
developed in 1950s, but first implemented with computers in 1970s.) Note that it is not
continuous, but only piecewise-continuous (actually, piecewise-constant, or step
function).

Comment: The Kaplan-Meier estimator
( ) S t can be regarded as a point estimate of the

survival function S(t) at any time t. In a manner similar to that discussed in 7.2, we can
construct 95% confidence intervals around each of these estimates, resulting in a pair
of confidence bands that brackets the graph. To compute the confidence intervals,
Greenwoods Formula gives an asymptotic estimate of the standard error of
( ) S t for
large groups.

Time
( ) S t
1
1
d
1
n
1

1
d
1
n
1
1
d
2
n
2

| | |
times of death: 0 t
1
t
2
t
3
. . .

#deaths: 0 d
1
d
2
d
3
. . .

#survivors: n
1
=n 0 n
2
=n
1
d
1
n
3
=n
2
d
2
. . .

( ) S t =
1
d
1
n
1

1
d
2
n
2

1
d
j
n
j
=
j
i =1

1
d
i
n
i
.
| | | | | | |

0 3.2 5.5 6.7 7.9 8.4 10.3 12
0.5

Example (contd): Twelve-month cohort study of n =10 patients

Patient t
i
(months)
Interval
[t
i
, t
i+1
)
n
i
= # patients at
risk at time
i
t

d
i
= # deaths
at time t
i

1
d
i
n
i

( ) S t
1 3.2

[0, 3.2) 10 0 1.00 1.0
2 5.5

[3.2, 5.5) 10 0 =10 1 0.90 0.9
3 6.7

[5.5, 6.7) 10 1 = 9 1 0.89 0.8
4 6.7

[6.7, 7.9) 9 1 = 8 2 0.75 0.6
5 7.9

[7.9, 8.4) 8 2 = 6 1 0.83 0.5
6 8.4

[8.4, 10.3) 6 1 = 5 3 0.40 0.2
7 8.4

[10.3, 12) 5 3 = 2 1 0.50 0.1
8 8.4
Study Ends 2 1 = 1 0 1.00 0.1
9 10.3

10 alive

i
t

denotes a time
just prior to t
i

Time
(months)
( ) S t
1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
1.0
0.9
0.8
0.6
0.1
0.2
1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
1.000
0.900
x
0.675
x
0.135
0.270
| | | | | | |

0 3.2 5.5 6.7 7.9 8.4 10.3 12

Exercise: Prove algebraically that, assuming no censored observations (as in the
preceding example), the Kaplan-Meier estimator can be written simply as
( ) S t =
n
i+1
n
for
t [t
i
, t
i+1
), i =0, 1, 2, Hint: Use mathematical induction; recall that n
i+1
= n
i
d
i
.

In light of this, now assume that the data consists of censored observations as well, so
that n
i+1
= n
i
d
i
c
i
.

Example (contd):

Patient t
i
(months)
Interval
[t
i
, t
i+1
)
n
i
= # at risk
at time t
i

d
i
= #
deaths
c
i
= #
censored
1
d
i
n
i

( ) S t
1 3.2

[0, 3.2) 10 0 0 1.00 1.000
2 5.5*

[3.2, 6.7) 10 0 0 =10 1 1 0.90 0.900
3 6.7

[6.7, 8.4) 10 1 1 = 8 2 1 0.75 0.675
4 6.7

[8.4, 10.3) 8 2 1 = 5 3 0 0.40 0.270
5 7.9*

[10.3, 12) 5 3 0 = 2 1 0 0.50 0.135
6 8.4

Study Ends 2 1 0 = 1 0 0 1.00 0.135
7 8.4

8 8.4

9 10.3

Exercise: What would the corresponding
changes be to the Kaplan-Meier estimator if
Patient 10 died at the very end of the study?

10 alive

*censored

( ) S t
Hazard Functions

Suppose we have a survival function S(t) =P(T >t), where T =survival time, and some
t >0. We wish to calculate the conditional probability of survival to the later time
t +t, given survival to time t.

P(Survive in [t, t +t) | Survive after t) =
P(t T <t +t)
P(T >t)
=
S(t) S(t +t)
S(t)
.

t T <t +t T >t

Therefore, dividing by t,

P(t T <t +t | T >t)
t
=
1
S(t)

S(t +t) S(t)
t
.

Now, take the limit of both sides as t 0:

h(t) =
1
S(t)
S (t) =
d [ln S(t)]
dt

0
( )
( )
t
h u du
S t e
=

This is the hazard function (or hazard rate, failure rate), and roughly characterizes
the instantaneous probability of dying at time t, in the above mathematical limiting
sense. It is always 0 (Why? Hint: What signs are S(t) and S (t), respectively?), but
can be >1, hence is not a true probability in a mathematically rigorous sense.

Exercise: Suppose two hazard functions are linearly combined to form a third hazard
function:
1 2 1 2 3
( ) ( ) ( ) c h t c h t h t + = , for any constants
1 2
, 0 c c . What is the relationship
between their corresponding log-survival functions
1
ln ( ) S t ,
2
ln ( ) S t , and
3
ln ( ) S t ?

Its integral,
0
( )
t
h u du
, is the cumulative hazard rate denoted H(t) and increases

(since H (t) =h(t) 0). Note also that H(t) = ln S(t), and so

( ) ln ( ) H t S t = .

T
S(t)
t 0
1
t +t
S(t +t)
S(t)

Examples: (Also see last page of 4.2!)

If the hazard function is constant for t 0, i.e., h(t) >0, then it follows that the
survival function is S(t) =
t
e

, i.e., the exponential model. Shown here is =1.

More realistically perhaps, suppose the hazard takes the form of a more general
power function, i.e., h(t) =
1
t

, for scale parameter >0, and shape
parameter >0, for t 0. Then S(t) =
t
e

, i.e., the Weibull model, an
extremely versatile and useful model with broad applications to many fields.
The case =1, =2 is illustrated below.

Exercise: Suppose that, for arguments sake, a population is modeled by the
decreasing hazard function h(t) =
1
t c +
for t 0, where c >0 is some constant.
Sketch the graph of the survival function S(t), and find the median survival time.


8.3 Statistical I nference: Log-Rank Test

Suppose that we wish to compare the survival curves S
1
(t) and S
2
(t) of two groups, e.g.,
breast cancer patients with chemotherapy versus without.

POPULATION

SAMPLE

Time
S(t)
0
1
S
1
(t)
S
2
(t)
S(t)
Time
1
0
S
1
(t)
^
S
2
(t)
^
^
Null Hypothesis

H
0
: S
1
(t) =S
2
(t) for all t

Survival probability is
equal in both groups.

To conduct a formal test of the null hypothesis, we construct a 2 2 contingency table
for each interval [t
i
, t
i+1
), where i =0, 1, 2,

Observed
# deaths
vs. Expected
# deaths
Variance

Dead Alive

Group 1 a
i
b
i
R
1i
E
1i
=
R
1i
C
1i
n
i

V
i
=
R
1i
R
2i
C
1i
C
2i
n
i
2
(n
i
1)

Group 2 c
i
d
i
R
2i
E
2i
=
R
2i
C
1i
n
i

C
1i
C
2i
n
i

Therefore, summing over all intervals i =0, 1, 2,, we obtain

Observed total deaths Expected total deaths Total Variance
Group 1: O
1
= a
i
Group 1: E
1
= E
1i

Group 2: O
2
= c
i
Group 2: E
2
= E
2i

In effect, the contingency tables are combined in the same way as in any cohort study.
In particular, an estimate of the summary odds ratio can be calculated via the general
Mantel-Haenszel formula
OR =
(a
i
d
i
/ n
i
)
(b
i
c
i
/ n
i
)
(see 6.2.3), with an analogous
interpretation in terms of group survival. The formal test for significance relies on the
corresponding log-rank statistic:

2
=
(O
1
E
1
)
2
V
~

1
2
,

although a slightly less cumbersome alternative is the (approximate) test statistic
2
=
(O
1
E
1
)
2
E
1
+
(O
2
E
2
)
2
E
2
~

1
2
.

Illustration of two Kaplan-Meier
survival curves that are not
significantly different from one another

Time
1
0
S
1
(t)
^
S
2
(t)
^
V = V
i


8.4 Regression: Cox Proportional Hazards Model

Suppose we wish to model the hazard function h(t) for a population, in terms of
explanatory variables or covariates X
1
, X
2
, X
3
, , X
m
. That is,

h(t) = h(t; X
1
, X
2
, X
3
, , X
m
),

so that all the individuals corresponding to one set of covariate values have a different
hazard function from all the individuals corresponding to some other set of covariate
values.

Assume initially that h has the general form h(t) = h
0
(t) C(X
1
, X
2
, X
3
, , X
m
).

Example: In a population of 50-year-old males, X
1
=smoking status (0 =No, 1 =Yes),
X
2
=#pounds overweight, X
3
=#hours of exercise per week. Consider

h(t) = .02 t e
X
1
+0.3X
2
0.5X
3
.

If X
1
=0, X
2
=0, X
3
=0, then h
0
(t) =.02 t. This is the baseline hazard. (Therefore,
the corresponding survival function is S
0
(t) =e
.01 t

2

. Why?)

If X
1
=1, X
2
=10 lbs, X
3
=2 hrs/wk, then h(t) =.02 t e
3
=.02 t (20.1) =.402 t.
(Therefore, the corresponding survival function is S(t) =e
.201 t

2

. Why?)

Thus, the proportion of hazards
h(t)
h
0
(t)
= e
3
(=20.1), i.e., constant for all time t.

t
h
0
(t)
h(t) =20.1 h
0
(t)
h(t) = h
0
(t) e

1
X
1
+
2
X
2
+ +
m
X
m

Furthermore, notice that this hazard function can be written as

h(t) = .02 t (e
X
1
) (e
0.3X
2
) (e
0.5X
3
).

Hence, with all other covariates being equal, we have the following properties.

If X
1
is changed from 0 to 1, then the net effect is that of multiplying the hazard
function by a constant factor of e
1
2.72. Similarly,

If X
2
is increased to X
2
+1, then the net effect is that of multiplying the hazard
0.3
1.35. And finally,

If X
3
is increased to X
3
+1, then the net effect is that of multiplying the hazard
0.5
0.61. (Note that this is less than 1, i.e.,
beneficial to survival.)

In general, the hazard function given by the form

where h
0
(t) is the baseline hazard function, is called the Cox Proportional Hazards
Model, and can be rewritten as the equivalent linear regression problem:

ln
h(t)
h
0
(t)
=
1
X
1
+
2
X
2
+ +
m
X
m

The constant proportions assumption is empirically verifiable. Once again, the
regression coefficients are computationally intensive, and best left to a computer.

Comment: There are many practical extensions of the methods in this section,
including techniques for hazards modeling when the constant proportions
assumption is violated, when the covariates X
1
, X
2
, X
3
, , X
m
are time-dependent, i.e.,

ln
h(t)
h
0
(t)
=
1
X
1
(t) +
2
X
2
(t) + +
m
X
m
(t),

when patients continue to be recruited after the study begins, etc. Survival Analysis
remains a very open area of active research.


8.5 Problems

1. Displayed below are the survival times in months since diagnosis for 10 AIDS patients
suffering from concomitant esophageal candidiasis, an infection due to Candida yeast, and
cytomegalovirus, a herpes infection that can cause serious illness.

Patient
t
i
(months)
1 0.5*
2 1.0
3 1.0
4 1.0
5 2.0
6 5.0*
7 8.0*
8 9.0
9 10.0*
10 12.0*
*censored

(a) Construct the Kaplan-Meier product-limit estimator of the survival function S(t), and
sketch its graph.

(b) Calculate the estimated 1-month and 2-month survival probabilities, respectively.

(c) Redo part (a) with R, using survfit.

2. For any constants 1, 0 a b > > , graph the hazard function ( )
b
h t a
t b
=
+
, for 0 t . Find
and graph the corresponding survival function ( ) S t . What happens to each function as
0 b ? b ?


3. In a cancer research journal article, authors conduct a six-year trial involving a small
sample of 5 n = patients who are on a certain aggressive treatment, and present the survival
data in the form of a computer-generated table shown below, at left. (The last patient is
alive at the end of the trial.)

Patient
Survival
time (mos.)

Time
interval
i
n
i
c
i
d
1
d
i
n
i

( ) S t
001 36.0

002 48.0*

003 60.0

004 60.0

005
72.0
(alive)

*censored

(a) Using the Kaplan-Meier product-limit formula, complete the table of estimated
survival probabilities shown at right.

(b) From part (a), sketch the Kaplan-Meier survival curve
( ) S t corresponding to this
sample. Label all relevant features.
^


Suppose that, from subsequently larger
studies, it is determined that the true
survival curve corresponding to this
population can be modeled by the function
2 5
00004
( )
t
S t e
=
.
.
for 0 t , as shown.
Use this Weibull model to answer each of
the following.

(c) Calculate the probability of surviving beyond three years.

(d) Compute the median survival time for this population.

(e) Determine the hazard function ( ) h t , and sketch its graph below.

(f) Calculate the hazard rate at three years.

2 5
00004t
e
.
.

1
( ) h t
2
( ) h t

4.

(a) Suppose that a certain population of individuals has a constant hazard function h
1
(t) = 0.03
for all time t > 0, as shown in the first graph above. For the variable T = survival time
(years), determine the survival function S
1
(t), and sketch its graph on the set of axes below.

(b) Suppose that another population of individuals has a piecewise constant hazard function
given by h
2
(t) =
0.02, for 0 5
0.04, for 5
t
t

>
, as shown in the second graph above. For the variable

T = survival time (years), determine the survival function S
2
(t), and sketch its graph on
the same set of axes below.

(3 2 ) / 20, 0 1
( ) 1/ 20, 1 6
/120, 6
t t
h t t
t t
<
= <

(c) For each population, use the corresponding survival function S(t) = P(T > t) to calculate
each of the following. Show all work.
Population 1 Population 2
P(T > 4)

P(T > 5)

P(T > 6)

Odds of survival after 5 years

Median survival time t
med

i.e., when P(T > t
med
) = 0.5

(d) Is there a finite time t* > 0 when the two populations have equal survival probabilities P(T > t*)?
If so, calculate its value, and the value of this common survival probability.

5. (Hint: See page 8.2-6 in the notes.) A population of children having a certain disease suffers
from a high but rapidly decreasing infant mortality rate during the first year of life, followed by
death due to random causes between the ages of one and six years old, and finally, steadily
increasing mortality as individuals approach adolescence and beyond. Suppose that the
associated hazard function ( ) h t is known to be well-modeled by a so-called bathtub curve,
whose definition and graph are given below.

(a) Find and use R to sketch the graph of the
corresponding survival function
( )
( ) ( ) ,
H t
S t P T t e
= > = where the cumulative

hazard function is given by
0
( ) ( )
t
H t h s ds =
.

(b) Calculate each of the following.
( 1) P T >
( 6) P T >
( 12) P T >
( 6 | 1) P T T > >
Median survival time

(c) From the cumulative distribution function ( ) ( ) F t P T t = , find and use R to sketch the
graph of the corresponding density function ( ) f t .
R tip: To graph a function f(x) in the interval [a, b], first define foo = function(x)(expression in terms
of x), then use the command plot(foo, from = a, to = b,...) with optional graphical parameters
col = "" (for color), lty = 1 (for line type), lwd = 2 (for line width), etc.; type help(par) for more details.
To add the graph of a function g(x) to an existing graph, type plot(goo, from = b, to = c,, add = T)

Appendix >>> >>
A1. Basic Reviews
A1. Basic Reviews

A1.1 Logarithms

A1.2 Permutations and Combinations

Ismor Fischer, 5/22/2008 Appendix / A1. Basic Reviews / Logarithms-1

A1. Basic Reviews
Logarithms

What are they?

In a word, exponents.

The logarithm (base 10) of a specified positive number is the exponent to which the
base 10 needs to be raised, in order to obtain that specified positive number. In effect, it is
the reverse (or more correctly, inverse) process of raising 10 to an exponent.

Example: The logarithm (base 10) of 100 is equal to 2, because 10
2
=100,

or, in shorthand notation, log
10
100 =2.

Likewise, log
10
10000 = 4, because 10
4
=10000
log
10
1000 = 3, because 10
3
=1000
log
10
100 = 2, because 10
2
=100
log
10
10 = 1, because 10
1
=10
log
10
1 = 0, because 10
0
=1
log
10
0.1 = 1, because 10
1
=1/10
1
=0.1
log
10
0.01 = 2, because 10
2
=1/10
2
=0.01
log
10
0.01 = 3, because 10
3
=1/10
3
=0.001

etc.

How do you take the logarithm of a specified number that is between powers of 10?

In the old days, this would be done with the aid of a lookup table or slide rule (for those of
us who are old enough to remember slide rules). Today, scientific calculators are
equipped with a button labeled log, log
10
, or INV 10
x
.

Examples: To five decimal places,

log
10
3 = 0.47712, because (check this) 10
0.47712
=3.

log
10
0.69897
=5.

log
10
0.95424
=9.

log
10
1.17609
=15.

There are several relations we can observe here that extend to general properties of logarithms.

First, notice that the values for log
10
3 and log
10
5 add up to the value for log
10
15.
This is not an accident; it is a direct consequence of 3 5 =15, together with the algebraic
law of exponents 10
s
10
t
=10
s + t
, and the fact that logarithms are exponents by
definition. (Exercise: Fill in the details.) In general, we have

Property 1:

that is, the sum of the logarithms of two numbers is equal to the logarithm of their product .
For example, taking A =3 and B =5 yields log
10
(15) = log
10
3 + log
10
5.

Another relation to notice from these examples is that the value for log
10
9 is exactly double
the value for log
10
3. Again, not a coincidence, but a direct consequence of 3
2
=9, together
with the algebraic law of exponents (10
s
)
t

=10
s t
, and the fact that logarithms are exponents
by definition. (Exercise: Fill in the details.) In general, we have
log
10
(A
B
) = B log
10
A
log
10
(AB) = log
10
A + log
10
B

Property 2:

that is, the logarithm of a number raised to a power is equal to the power times the logarithm of
the original number. For example, taking A =3 and B =2 yields log
10
(3
2
) = 2 (log
10
3).

There are other properties of logarithms, but these are the most important for our purposes. In
particular, we can combine these properties in the following way. Suppose that two variables
X and Y are related by the general form

Y = X

for some constants and .

Then, taking log
10
of both sides,
log
10
Y = log
10
( X
)
or, by Property 1,
log
10
Y = log
10
+ log
10
( X
)
and by Property 2,
log
10
Y = log
10
+ log
10
X .

Relabeling, V =
0
+
1
U .

In other words, if there exists a power law relation between two variables X and Y, then there
exists a simple linear relation between their logarithms. For this reason, scatterplots of two
such related variables X and Y are often replotted on a log-log scale. More on this later

Additional comments:

log
10
is an operation on positive numbers you must have the logarithm of something.
(This is analogous to square roots; you must have the square root of something in order to
have a value. The disembodied symbol is meaningless without a number inside;
similarly with log
10
. )

There is nothing special about using base 10. In principle, we
could use any positive base b (provided b 1, which causes a
problem). Popular choices are b =10 (resulting in the so-called
common logarithms above), b =2 (sometimes denoted by lg),
and finally, b =2.71828 (resulting in natural logarithms, denoted
by ln). This last peculiar choice is sometimes referred to as e
and is known as Eulers constant. (Leonhard Euler, pronounced
oiler, was a Swiss mathematician. This constant e arises in a
variety of applications, including the formula for the density function
of a normal distribution, described in a previous lecture.) There is a
special formula for converting logarithms (using any base b) back to
common logarithms (i.e., base 10), for calculator use. For any
positive number a, and base b as described above,
Leonhard Euler
1707 - 1783
log
b
a =
log
10
a
log
10
b

Logarithms are particularly useful in calculating physical processes that grow or decay
exponentially. For example, suppose that at time t =0, we have N =1 cell in a culture,
and that it continually divides in two in such a way that the entire population doubles its
size every hour. At the end of t =1 hour, there are N =2 cells; at time t =2 hours, there
are N =2
2
=4 cells; at time t =3 hours, there are N =2
3
=8 cells, etc. Clearly, at time t,
there will be N =2
t
cells in culture (exponential growth). Question: At what time t will
there be 500000 (half a million) cells in the culture? The solution can be written as
t =log
2
500000, which can be rewritten via the change of base formula above (for
calculator use) as t = log
10
500000 log
10
2 = 5.69897 0.30103 = 18.93 hours, or
about 18 hours, 56 minutes. (Check: 2
18.93
=499456.67, which represents an error of only
about 0.1% from 500000; the discrepancy is due to roundoff error.)

Other applications where logarithms are used include the radioactive isotope dating of
fossils and artifacts (exponential decay), determining the acidity or alkalinity of chemical
solutions (pH =log
10
H
+
, the power of hydrogen), and the Richter scale a measure
of earthquake intensity as defined by the log
10
of the quakes seismic wave amplitude.
(Hence an earthquake of magnitude 6 is ten times more powerful than a magnitude 5
quake, which in turn is ten times more powerful than one of magnitude 4, etc.)


Supplement: What Is This Number Called e, Anyway?

The symbol e stands for Eulers constant, and is a fundamental mathematical constant
(like ), extremely important for various calculus applications. It is usually defined as

e =
1
lim 1
n
n
n

+

.

Exercise: Evaluate this expression for n =1, 10, 100, 1000, , 10
6
.

It can be shown, via rigorous mathematical proof, that the limiting value formally exists,
and converges to the value 2.718281828459 Another common expression for e is the
infinite series

e = 1 +
1
1!
+
1
2!
+
1
3!
++
1
n!
+

Exercise: Add a few terms of this series. How do the convergence rates of the two
expressions compare?

The reason for its importance: Of all possible bases b, it is this constant e =2.71828 that
has the most natural calculus properties. Specifically, if f(x) = b
x
, then it can be
mathematically proved that its derivative is f (x) =b
x
(ln b). (Remember that ln b =log
e
b.)
For example, the function f(x) =10
x
has as its derivative f (x) =10
x
(ln 10) =10
x
(2.3026);
see Figure 1. The constant multiple of 2.3026, though necessary, is something of a
nuisance. On the other hand, if b =e, that is, if f(x) =e
x
, then f (x) =e
x
(ln e) =e
x
(1) =e
x
,
i.e., itself! See Figure 2. This property makes calculations involving base e much easier.

Figure 1. y =10
x
Figure 2. y =e
x
m
tan
=e
1
m
tan
=e
-0.5
m
tan
=1
m
tan
=e
-1
m
tan
=e
0.5
m
tan
=ln 10
m
tan
=10
-0.5
ln 10
m
tan
=10
0.5
ln 10
m
tan
=10
1
ln 10
10
1
10
.05
1
e
-0.5
e
0.5
e
1
1
Ismor Fischer, 7/21/2010 Appendix / A1. Basic Reviews / Perms & Combos-1
A1. Basic Reviews
PERMUTATIONS and COMBI NATI ONS...
or HOW TO COUNT

Question 1: Suppose we wish to arrange n =5 people {a, b, c, d, e}, standing side by side, for
a portrait. How many such distinct portraits (permutations) are possible?

a b c d e
Example:

Solution: There are 5 possible choices for which person stands in the first position (either a, b, c,
d, or e). For each of these five possibilities, there are 4 possible choices left for who is in the next
position. For each of these four possibilities, there are 3 possible choices left for the next position,
and so on. Therefore, there are 5 4 3 2 1 =120 distinct permutations. See Table 1.

This number, 5 4 3 2 1 (or equivalently, 1 2 3 4 5), is denoted by the symbol 5!
and read 5 factorial, so we can write the answer succinctly as 5! =120.

In general,

FACT 1: The number of distinct PERMUTATIONS of n objects is "n factorial", denoted by
n! =1 2 3 ... n, or equivalently,
=n (n-1) (n-2) ... 2 1.

Examples: 6! = 6 5 4 3 2 1

= 6 5!

= 6 120 (by previous calculation)

= 720

3! = 3 2 1 = 6

2! = 2 1 = 2

1! = 1

0! = 1, BY CONVENTION (It may not be obvious why, but there are good
mathematical reasons for it.)

FACT 1: The number of distinct PERMUTATIONS of n objects is n factorial, denoted by

n! = 1 2 3 ... n, or equivalently,

= n (n 1) (n 2) ... 2 1.

Here, every different ordering
counts as a distinct permutation.
For instance, the ordering
(a,b,c,d,e) is distinct from
(c,e,a,d,b), etc.
Question 2: Now suppose we start with the same n =5 people {a, b, c, d, e}, but we wish to
make portraits of only k = 3 of them at a time. How many such distinct portraits are possible?

a b c
Example:

Solution: By using exactly the same reasoning as before, there are 5 4 3 =60 permutations.

See Table 2 for the explicit list!

Note that this is technically NOT considered a factorial (since we don't go all the way down to 1),
but we can express it as a ratio of factorials:

5 4 3 =
5 4 3 (2 1)
(2 1)
=
5!
2!
.

In general,

FACT 2: The number of distinct PERMUTATIONS of n objects, taken k at a time, is given by the ratio

n!
(n k)!
= n (n 1) (n 2) ... (n k +1) .

Question 3: Finally suppose that instead of portraits (permutations), we wish to form
committees (combinations) of k =3 people from the original n =5. How many such distinct
committees are possible?

Example:

c
b
a
Again, as above, every different
ordering counts as a distinct
permutation. For instance, the
ordering (a,b,c) is distinct from
(c,a,b), etc.
Now, every different ordering does
NOT count as a distinct combination.
For instance, the committee {a,b,c} is
the same as the committee {c,a,b}, etc.

Solution: This time the reasoning is a little subtler. From the previous calculation, we know that

#of permutations of k =3 from n =5 is equal to
5!
2!
= 60.

But now, all the ordered permutations of any three people (and there are 3! =6 of them, by FACT 1),
will collapse into one single unordered combination, e.g., {a, b, c}, as illustrated. So...

#of combinations of k =3 from n =5 is equal to
5!
2!
, divided by 3!, i.e., 60 6 = 10.

See Table 3 for the explicit list!

This number,
5!
3! 2!
, is given the compact notation
5
3

, read 5 choose 3, and corresponds to the
number of ways of selecting 3 objects from 5 objects, regardless of their order. Hence
5
3

=10.

In general,

FACT 3: The number of distinct COMBINATIONS of n objects, taken k at a time, is given by the ratio

n!
k! (n k)!
=
n (n 1) (n 2) ... (n k +1)
k!
.

This quantity is usually written as
n
k

, and read n choose k.

Examples:
5
3

=
5!
3! 2!
= 10, just done. Note that this is also equal to
5
2

=
5!
2! 3!
=10.

8
2

=
8!
2! 6!
=
8 7 6!
2! 6!
=
8 7
2
= 28. Note that this is equal to

8
6

=
8!
6! 2!
=28.

15
1

=
15!
1! 14!
=
15 14!
1! 14!
=15. Note that this is equal to
15
14

=15. Why?

7
7

=
7!
7! 0!
= 1. (Recall that 0! =1.) Note that this is equal to
7
0

=1. Why?

Observe that it is neither necessary nor advisable to compute the factorials of large numbers directly.
For instance, 8! =40320, but by writing it instead as 8 7 6!, we can cancel 6!, leaving only 8 7
above. Likewise, 14! cancels out of 15!, leaving only 15, so we avoid having to compute 15! , etc.
Remark:
n
k

is sometimes called a combinatorial symbol or binomial coefficient (in
connection with a fundamental mathematical result called the Binomial Theorem; you may also
recall the related Pascals Triangle). The previous examples also show that binomial coefficients
possess a useful symmetry, namely,
n
k

=
n
n k

. For example,
5
3

=
5!
3! 2!
, but this is clearly
the same as
5
2

=
5!
2! 3!
. In other words, the number of ways of choosing 3-person committees
from 5 people is equal to the number of ways of choosing 2-person committees from 5 people.
A quick way to see this without any calculating is through the insight that every choice of a 3-
person committee from a collection of 5 people leaves behind a 2-person committee, so the total
number of both types of committee must be equal (10).

Exercise: List all the ways of choosing 2 objects from 5, say {a, b, c, d, e}, and check these
claims explicitly. That is, match each pair with its complementary triple in the list of Table 3.

A Simple Combinatorial Application

Suppose you toss a coin n =5 times in a row. How many ways can you end up with k =3 heads?

Solution: The answer can be obtained by calculating the number of ways of rearranging 3 objects
among 5; it only remains to determine whether we need to use permutations or combinations.
Suppose, for example, that the 3 heads occur in the first three tosses, say a, b, and c, as shown
below. Clearly, rearranging these three letters in a different order would not result in a different
outcome. Therefore, different orderings of the letters a, b, and c should not count as distinct
permutations, and likewise for any other choice of three letters among {a, b, c, d, e}. Hence, there
are
5
3

=10 ways of obtaining k =3 heads in n =5 independent successive tosses.

Exercise: Let H denote heads, and T denote tails. Using these symbols, construct the explicit
list of 10 combinations. (Suggestion: Arrange this list of H/T sequences in alphabetical order.
You should see that in each case, the three H positions match up exactly with each ordered triple in
the list of Table 3. Why?)

a b c d e

Table 1 Permutations of {a, b, c, d, e}

These are the 5! =120 ways of arranging 5 objects, in such a way that all the different orders count
as being distinct.

a b c d e b a c d e c a b d e d a b c e e a b c d
a b c e d b a c e d c a b e d d a b e c e a b d c
a b d c e b a d c e c a d b e d a c b e e a c b d
a b d e c b a d e c c a d e b d a c e b e a c d b
a b e c d b a e c d c a e b d d a e b c e a d b c
a b e d c b a e d c c a e d b d a e c b e a d c b
a c b d e b c a d e c b a d e d b a c e e b a c d
a c b e d b c a e d c b a e d d b a e c e b a d c
a c d b e b c d a e c b d a e d b c a e e b c a d
a c d e b b c d e a c b d e a d b c e a e b c d a
a c e b d b c e a d c b e a d d b e a c e b d a c
a c e d b b c e d a c b e d a d b e c a e b d c a
a d b c e b d a c e c d a b e d c a b e e c a b d
a d b e c b d a e c c d a e b d c a e b e c a d b
a d c b e b d c a e c d b a e d c b a e e c b a d
a d c e b b d c e a c d b e a d c b e a e c b d a
a d e b c b d e a c c d e a d d c e a b e c d a b
a d e c b b d e c a c d e d a d c e b a e c d b a
a e b c d b e a c d c e a b d d e a b c e d a b c
a e b d c b e a d c c e a d b d e a c b e d a c b
a e c b d b e c a d c e b a d d e b a c e d b a c
a e c d b b e c d a c e b d a d e b c a e d b c a
a e d b c b e d a c c e d a b d e c a b e d c a b
a e d c b b e d c a c e d b a d e c b a e d c b a
Table 2 Permutations of {a, b, c, d, e}, taken 3 at a time

These are the
5!
2!
= 60 ways of arranging 3 objects among 5, in such a way that different orders of
any triple count as being distinct, e.g., the 3! =6 permutations of (a, b, c), shown below .

a b c b a c c a b d a b e a b
a b d b a d c a d d a c e a c
a b e b a e c a e d a e e a d
a c b b c a c b a d b a e b a
a c d b c d c b d d b c e b c
a c e b c e c b e d b e e b d
a d b b d a c d a d c a e c a
a d c b d c c d b d c b e c b
a d e b d e c d e d c e e c d
a e b b e a c e a d e a e d a
a e c b e c c e b d e b e d b
a e d b e d c e d d e c e d c

Table 3 Combinations of {a, b, c, d, e}, taken 3 at a time

If different orders of the same triple are not counted as being distinct, then their six permutations are
lumped as one, e.g., {a, b, c}. Therefore, the total number of combinations is
1
6
of the original 60,
or 10. Notationally, we express this as
1
3!
of the original
5!
2!
, i.e.,
5!
3! 2!
, or more neatly, as
5
3

.
These
5
3

=10 combinations are listed below.

a b c
a b d
a b e
a c d
a c e
a d e
b c d
b c e
b d e
c d e

A2.1 Mean and Variance

A2.2 ANOVA

A2.3 Least Squares Approximation

Ismor Fischer, 7/21/2010 Appendix / A2. Geometric Viewpoint / Mean and Variance-1

A2. Statistics from a Geometric Viewpoint

Mean and Variance

Many of the concepts we will encounter can be unified in a very elegant geometric way, which
yields additional insight and understanding. If you relate to visual ideas, then you might benefit
from reading this. First, recall some basic facts from elementary vector analysis:

For any two column vectors v =(v
1
, v
2
, , v
n
)
T
and w =(w
1
, w
2
, , w
n
)
T
in
n
, the standard
Euclidean dot product v w is defined as v
T
w =
1
n
i i
i
v w
=
, hence is a scalar. Technically, the

dot product is a special case of a more general mathematical object known as an inner product,
denoted by , v w , and these notations are often used interchangeably. The length, or norm, of a
vector v can therefore be characterized as v = , v v =
2
1
n
i
i
v
=
, and the included angle

between two vectors v and w can be calculated via the formula

cos =
, v w
v w
, 0 .

From this relation, it is easily seen that two vectors v and w are orthogonal (i.e., = /2), written
v w, if and only if their dot product is equal to zero, i.e., v, w =0.

Now suppose we have n random sample observations {x
1
, x
2
, x
3
, , x
n
}, with mean x . As shown
below, let x be the vector consisting of these n data values, and x be the vector composed solely of
x . Note that x is simply a scalar multiple of the vector 1 =(1, 1, 1, , 1)
T
. Finally, let x x be
the vector difference; therefore its components are the individual deviations between the
observations and the overall mean. (Its useful to think of x as a sample taken from an ideal
population that responds exactly the same way to some treatment, hence there is no variation; x is
the sample of actual responses, and x x measures the error between them.)

x =
1
2
3
n
x
x
x
x
| |
|
|
|
|
|
|
\ .

x =
x
x
x
x
| |
|
|
|
|
|
|
\ .

x x =
1
2
3
n
x x
x x
x x
x x
| |
|
|
|
|
|
|
\ .

Ismor Fischer, 7/21/2010 Appendix / A2. Geometric Viewpoint / Mean and Variance-2
s
2
=
1
n 1

2
2
1 1
1
n n
i i
i i
x x
n
= =
(
| |
(
|
\ . (

Recall that the sum of the individual deviations is equal to zero, i.e.,
1
( )
n
i
i
x x
=

=0, or in vector
notation, the dot product 1 (x x ) =0. Therefore, 1 (x x ), and the three vectors above form
a right triangle.

Let the scalars a, b, and c represent the lengths of the corresponding vectors, respectively. That is,

a = x - x =
2
1
( )
n
i
i
x x
=

, b = x =
2
1
n
i
x
=
=
2
n x , c = x =
2
1
n
i
i
x
=
.

Therefore, a
2
, b
2
, and c
2
are all sums of squares, denoted by

SS
Error
= a
2
=
2
1
( )
n
i
i
x x
=

, SS
Treatment
= b
2
= n x
2
, SS
Total
= c
2
=
2
1
n
i
i
x
=
.
via algebra, =
1
n

2
1
n
i
i
x
=
| |
|
\ .

Now via the Pythagorean Theorem, we have c
2
=b
2
+a
2
, referred to in this context as a
partitioning of sums of squares:

SS
Total
= SS
Treatment
+ SS
Error
.

Note also that, by definition, the sample variance is

s
2
=
SS
Error
n 1
,

and that combining both of these boxed equations yields the equivalent alternate formula:

s
2
=
1
n 1
[ SS
Total
SS
Treatment
] ,

i.e.,

This formula, because it only requires one subtraction rather than n, is computationally more stable
than the original; however, it is less enlightening.

Exercise: Verify that SS
Total
= SS
Treatment
+ SS
Error
for the sample data values {3, 8, 17, 20, 32},
and calculate s
2
both ways, showing equality. Be especially careful about roundoff error!
IMPORTANT
FORMULA!!
IMPORTANT
FORMULA!!
Ismor Fischer, 1/7/2009 Appendix / A2. Geometric Viewpoint / ANOVA-1


Analysis of Variance

The technique of multiple comparison of treatment means via ANOVA can be viewed very elegantly,
from a purely geometric perspective. Again, recall some basic facts from elementary vector analysis:

1
, v
2
, , v
n
)
T
and w =(w
1
, w
2
, , w
n
)
T
in
n
, the standard Euclidean
dot product v w is defined as v
T
w = , hence is a scalar. Technically, the dot product is a
special case of a more general mathematical object known as an inner product, denoted by
1
n
i i
i
v w
=
, v w , and
these notations are often used interchangeably. The length, or norm, of a vector v can therefore be
characterized as v = , v v =
2
1
n
i
i
v
=
, and the included angle between two vectors v and w can be

calculated via the formula
cos =
, v w
v w
, 0 .

From this relation, it is easily seen that two vectors v and w are orthogonal (i.e., = /2), written v w,
if and only if their dot product is equal to zero, i.e., v, w =0.

Now suppose we have sample data from k treatment groups of sizes n
1
, n
2
, , n
k
, respectively, which we
organize in vector form as follows:

Treatment 1 Treatment 2 . . . . . Treatment k

y
1
=
1
11
12
13
1n
y
y
y
y

y
2
=
2
21
22
23
2n
y
y
y
y

. . . . . y
k
=
1
2
3
k
k
k
k
k n
y
y
y
y

Group Means:
1
y
2
y . . . . .
k
y

Group Variances: s
1
2
s
2
2
. . . . . s
k
2

Grand Mean: y =
1 1 2 2
...
k k
n y n y n y
n
+ + +
,

where n =n
1
+n
2
+ +n
k
is the combined sample size.

Pooled Variance: s
within groups
2
=
(n
1
1) s
1
2
+(n
2
1) s
2
2
+ +(n
k
1) s
k
2
n k


Now, for Treatment column i =1, 2, , k and row j =1, 2, , n
i
, it is clear from simple algebra that

y
i j
y = (
i
y y ) + (y
i j

i
y ).

Therefore, for each Treatment i =1, 2, , k, we have the n
i
-dimensional column vector identity

y
i
y 1 = (
i
y y ) 1 + (y
i

i
y 1),

where the n
i
-dimensional vector 1 =(1, 1, , 1)
T
. Hence, vertically stacking these k columns produces a
vector identity in
n
:

Treatment i

1
2
3
k

y
y
y
y

1
2
3
y 1
y 1
y 1
y 1
k
=
1
2
3
( )
( )
( )
( )
k
y y
y y
y y
y y

1
1
1
1
+
1
2
3
k
y
y
y
y

1
2
3
y 1
y 1
y 1
y 1
k

or, more succinctly u = v + w.

But the two vectors v and w are orthogonal, since they have a zero dot product:

v
T
w =
1
( ) (
k
T
i i
i
y y y
=

1 ) y 1
i

1
(
i
n
i j i
j
y y
=

) = 0,

because this is the sum of the deviations of the y
i

j
values in Treatment i from their group mean
i
y .
Therefore, the three vectors u, v and w form a right triangle, as shown. So by the Pythagorean Theorem,

Total vector
u
Treatment vector v
Error vector
w


2
u =
2
v +
2
w

or, in statistical notation
SS
Total
= SS
Trt
+ SS
Error

where
The sum of the squared
deviations of each observation
from thegrand mean.
deviations of each group mean
from the grand mean.
deviations of each observation
from its group mean.
SS
Total
=
2
u =
2
1
k
i
y
=

y 1
i
=
2
1 1
( )
i
n k
i j
i j
y y
= =

=
2
all ,
( )
i j
i j
y y

SS
Trt
=
2
v =
2
1
( )
k
i
i
y y
=

1 =
2
1 1
(
i
n k
i
i j
y y
= =

) =
2
1
( )
k
i i
i
n y y
=

SS
Error
=
2
w =
2
1
k
i
i
y
=

y 1
i
=
2
1 1
(
i
n k
i j i
i j
y y
= =

) = .
2
1
( 1)
k
i i
i
n s
=

The resulting ANOVA table for the null hypothesis H
0
:
1
=
2
= =
k
is given by:

Source df SS MS F-statistic p-value
Treatment k 1
2
( )
k
i i
n y y
1 i=
s
between groups
2
Error n k
2
1
( 1)
k
i i
i
n s
=

F
k 1, n k
F
p-value

s
within groups
2
F
k 1, n k
0 p 1
Total n 1
2
all ,
( )
i j
i j
y y


One final note about multiple treatment comparisons We may also express the problem via the
following equivalent formulation: For each Treatment column i =1, 2, , k and row j =1, 2, , n
i
,
the (i, j)
th
response y
i j
differs from its true group mean
i
by a random error amount
i j
. At the
same time however, the true group mean
i
itself differs from the true grand mean by a random
amount
i
, appropriately called the i
th
treatment effect. That is,

Null Hypothesis:

y
i j
=
i
+
i j
H
0
:
1
=
2
= =
k
i.e.,
y
i j
= +
i
+
i j
H
0
:
1
=
2
= =
k
= 0.
Estimated
by y
i
Estimated
by y

In words, this so-called model equation says that each individual response can be formulated as the sum
of the grand mean plus its group treatment effect (the two of these together sum to its group mean), and
an individual error term. The null hypothesis that all of the group means are equal to each other
translates to the equivalent null hypothesis that all of the group treatment effects are equal to zero.

This expression of the problem as response = model + error is extremely useful, and will appear
again, in the context of regression models.
Ismor Fischer, 7/26/2010 Appendix / A2. Geometric Viewpoint / Least Squares Approximation-1
v
v

e = v
v
u


Least Squares Approximation

The concepts of linear correlation and least squares regression can be viewed very elegantly, from a pure
geometric perspective. Again, recall some basic background facts from elementary vector analysis:

1
, v
2
, , v
n
)
T
and w =(w
1
, w
2
, , w
n
)
T
in
n
, the standard Euclidean
dot product v w is defined as v
T
w =
1
n
i i
i
v w
=
, hence is a scalar. Technically, the dot product is a

special case of a more general mathematical object known as an inner product, denoted by , v w , and
these notations are often used interchangeably. The length, or norm, of a vector v can therefore be
characterized as v = , v v =
2
1
n
i
i
v
=
, and the included angle between two vectors v and w can be

calculated via the formula
cos =
, v w
v w
, 0 .

From this relation, it is easily seen that two vectors v and w are orthogonal (i.e., = /2), written v w,
if and only if their dot product is equal to zero, i.e., v, w =0. More generally, the orthogonal projection
of the vector v onto the vector w is given by the formula shown in the figure below. (Think of it
informally as the shadow vector that v casts in the direction of w.)

Why are orthogonal projections so important? Suppose we are given any vector v (in a general inner
product space), and a plane (or more precisely, alinear subspace) not containing v. Of all the vectors u
in this plane, we wish to find a vector
v that comes closest to v, in some formal mathematical sense.

The Best Approximation Theorem asserts that, under such very general conditions, such a vector does
indeed exist, and is uniquely determined by the orthogonal projection of v onto this plane. Moreover, the
resulting error e =v
v is smallest possible, with

2
e =
2
v
2
v , via the Pythagorean Theorem.

w
v
proj
w
v =
2
, v w
w
w
scalar multiple of w

Of all the vectors u in the plane, the
one that minimizes the length v - u
(thin dashed line) is the orthogonal
projection
v . Therefore,
v is the
least squares approximation to v,
yielding the least squares error
2
e =
2
v
2
v .

Now suppose we are given n data points (x
i
, y
i
), i =1, 2, , n, obtained from two variables X and Y.
Define the following vectors in n-dimensional Euclidean space
n
:

0 = (0, 0, 0, , 0)
T
, 1 = (1, 1, 1, , 1)
T
,

x = (x
1
, x
2
, x
3
, , x
n
)
T
, x = ( x , x , x , , x )
T
, so that x x = (x
1
x , x
2
x , x
3
x , , x
n
x )
T
,

y = (y
1
, y
2
, y
3
, , y
n
)
T
, y = ( y , y , y , , y )
T
, so that y y = (y
1
y , y
2
y , y
3
y , , y
n
y )
T
.

The centered data vectors x x and y y are crucial to our analysis. For observe that, by definition,

2
x x = (n 1) s
x
2
,
2
y y = (n 1) s
y
2
, and , x x y y = (n 1) s
xy
.

Now, note that 1, x x =
1
( )
n
i
i
x x
=

= 0, therefore 1 (x x ); likewise, 1 (y y ) as well.

See the figure below, showing the geometric relationships between the vector y y and the plane
spanned by the orthogonal basis vectors 1 and x x .

Also, from a previous formula, we see that the general angle between these two vectors is given by

cos =
,

x x y y
x x y y

(from above) =
(n 1) s
xy
(n 1) s
x
2
(n 1) s
y
2

=
xy
x y
s
r
s s
=

i.e., the sample linear correlation coefficient! Therefore, this ratio r measures the cosine of the angle
between the vectors x x and y y , and hence is always between 1 and +1. But what is its exact
connection with the original vectors x and y?

0
1

y y
y y x x
e = y
y

IF the vectors x and y are exactly linearly correlated, then by definition, it must hold that
0 1
b b = + y 1 x
for some constants b
0
and b
1
, and conversely. A little elementary algebra (take the mean of both sides,
then subtract the two equations from one another) shows that this is equivalent to the statement

1
( ) b = y y x x , with
0 1
b y b x = .

That is, the vector y y is a scalar multiple of the vector x x , and therefore must lie not only in the
plane, but along the line spanned by x x itself. If the scalar multiple b
1
>0, then y y must point in
the same direction as x x ; hence r =cos 0 =+1, and the linear correlation is positive. If b
1
<0, then
these two vectors point in opposite directions, hence r =cos =1, and the linear correlation is negative.
However, if these two vectors are orthogonal, then r =cos(/2) =0, and there is no linear correlation
between x and y.

More generally, if the original vectors x and y are not exactly linearly correlated (that is, 1 <r <+1),
then the vector y y does not lie in the plane. The unique vector
y y that does lie in the plane which

best approximates it in the least squares sense is its orthogonal projection onto the vector x x ,
computed by the formula given above:

y y =
2
,
y y x x
x x
(x x )

=
(n 1) s
xy
(n 1) s
x
2
(x x ),

i.e., Linear Model:
1
( ) b = y y x x , with b
1
=
s
xy
s
x
2 .

Furthermore, via the Pythagorean Theorem,

2 2
2

= + y y y y y y

or, in statistical notation SS
Total
= SS
Reg
+ SS
Error
.

Finally, from this, we also see that the ratio
SS
Reg
SS
Total
=
2
2
y y
y y

= cos
2
,
i.e., the coefficient of determination is

SS
Reg
SS
Total
= r
2
, where r is the correlation coefficient.

Exercise: Derive the previous formulas
2 2 2
2
x y xy x y
s s s s
= + . (Hint: Use the Law of Cosines.)

Remark: In this analysis, we have seen how the familiar formulas of linear regression follow easily and
immediately from orthogonal approximation on vectors. With slightly more generality, interpreting
vectors abstractly as functions f(x), it is possible to develop the formulas that are used in Fourier series.

A3.1 Mean, One Sample

A3.2 Means and Proportions, One and Two Samples

A3.3 General Parameters and FORMULA TABLES

Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Mean, One Sample-1


Population Mean of a Random Variable X
with known standard deviation , and random sample of size n
1

Before selecting a random sample, the experimenter first decides on each of the following

Null Hypothesis H : =
0
(the conjectured value of the true mean)
0

Alternative Hypothesis
2
H
A
: (that is, either <
0 0
or > )
0

Type I error
Significance Level = P (Reject H true) = .05, usually; therefore,
0
| H
0

Confidence Level 1 = P (Accept H | H true) = .95, usually
0 0

and calculates each of these following:

Standard Error / n , the standard deviation of X ; this is then used to calculate

/ Margin of Error z n , where the critical value z is computed via its definition:
/2 /2
Z N (0,1), P ( z <Z <z ) =1 , i.e., by tail-area symmetry,
/2 /2
P (Z < z ) =P (Z >z
/2 /2
) = /2. Note: If =.05, then z
.025
=1.96.

Acceptance Region for H
0

Figure 1
Illustration of the sample
mean
Sampling Distribution
N (
0
, / n ) X
x in the rejection
region; note that p < .
p / 2 p / 2
margin of error
x
0
Rejection Region for H
0

1
If is unknown, but n 30, then estimate by the sample standard deviation s.
If n <30, then use the t-distribution instead of the standard normal Z-distribution.

2
The two-sided alternative is illustrated here. Some formulas may have to be modified
slightly if a one-sided alternative is used instead. (Discussed later)


After selecting a random sample, the experimenter next calculates the statistic

x = point estimate of Sample Mean

then calculates any or all of the following:

x (1 ) 100% Confidence Interval the interval , such that P ( centered at inside) =1

x x C.I. = ( margin of error, +margin of error)
=interval estimate of
Decision Rule
At the (1 ) 100% confidence level
If is contained in C.I., then accept H

(1 ) 100% Acceptance Region the interval centered at
0
, such that P ( x inside) =1

A.R. = (
0
margin of error,
0
+margin of error)

Decision Rule

p-value a measure of how significantly our sample mean differs from the null hypothesis

p = the probability of obtaining a random sample mean that is AS, or MORE, extreme
than the value of x actually obtained, assuming the null hypothesis H
0
: =
0
is true.

= P (obtaining a sample mean on either side of
0
, as far away as, or farther than, x is )

= P
Z
| x
0
|
/ n
+ P
Z
| x
0
|
/ n

Decision Rule

Left-sided area + Right-sided area
cut by both x and its symmetrically reflected value through
0
0 0
.
If is not in C.I., then reject H
0 0
in favor of H
A
.
x is in the acceptance region, then accept H If .
0
SEE FIGURE 1!
x If is not in the acceptance region (i.e., is in the
rejection region), then reject H in favor of H
A
.
0
NOTE: By
symmetry, can
multiply the
amount of area
in one tail by 2.
If p <, then reject H
0
in favor of H
A
.
SEE
FIGURE 1!
x and The difference between is statistically significant.
0

If p >, then accept H .
0
x and The difference between is not statistically significant.
0

For a one-sided hypothesis test, the preceding formulas must be modified. The decision to reject H
0
in
favor of H
A
depends on the probability of a sample mean being either significantly larger, or
significantly smaller, than the value (always following the direction of the alternative H
A 0
), but not
both, as in a two-sided test. Previous remarks about and s, as well as z and t, still apply.

, right-sided alternative Hypotheses (Case 1) H :
Acceptance Region
p
0
Rejection Region
x
Figure 3
mean x in the rejection
Decision Rule
If p <, then x is in rejection region for H
0
.
x is significantly smaller than
0
.

If p >, then x is in acceptance region for H
0
.
x is not significantly smaller than
0
.
Acceptance Region Rejection Region
Figure 2
mean x in the rejection
x
0
If p <, then x is in rejection region for H
0
.
x is significantly larger than
0
.

If p >, then x is in acceptance region for H
0
.
x is not significantly larger than
0
.
Decision Rule
p
0 0
vs. H
A
: >
0

Confidence Interval = ( x z
( / n ), + )

Acceptance Region = ( ,
0
+ z
( / n ) )

x p-value = P (obtaining a sample mean that is equal to, or larger than, )
= P
Z
x
0
/
x
n
, right-sided area cut off by (darkened)

, left-sided alternative Hypotheses (Case 2) H :
0 0
vs. H
A
: <
0

Confidence Interval = ( , x + z
( / n ) )

Acceptance Region = (
0
z
( / n ), + )

p-value = P (obtaining a sample mean that is equal to, or smaller than, x )
=P
Z
x
0
/
x
n
, left-sided area cut off by (darkened)


Examples

Given: Assume that the random variable X =IQ score is normally distributed in a certain study
population, with standard deviation =30.0, but with unknown mean .

Conjecture a null hypothesis H
x
compare
hypothesized
mean value (100)
...with mean of
random sample
data
THEORY EXPERIMENT
0
: =100 vs. the (two-sided) alternative hypothesis H
A
: 100.

Question: Do we accept or reject H
0
at the 5% (i.e., = .05) significance level, and how strong is
our decision, relative to this 5%?

Suppose statistical inference is to be based on random sampledata of size n =400 individuals.

Figure 4
Normal distribution of X =IQ score,
under the conjectured null
hypothesis H
0
: =100
N(100, 30)
Procedure: Decision Rule will depend on calculation of the following quantities. First,

then,

100
X
30
Margin of Error = Critical Value Standard Error

= z
/ 2
/ n

= 1.96 30 / 400

= 1.96 1.5

= 2.94
if =.05 given

N(0, 1)
1.96 1.96
.025 .025
.95
0
Z

x : All values between 100 2.94, i.e., (97.06, 102.94). Acceptance Region for

102.94 97.06 105 101
.025 .025
.95
Why?

N(100, 1.5)
100
X
Figure 5

Null Distribution
Sampling distribution of X , under the
null hypothesis H
0
: =100. Compare
with Figure 4 above.

Sample # 1: Suppose it is found that x =105 (or 95). As shown in Figure 5, this value lies
far inside the =.05 rejection region for the null hypothesis H
0
(i.e., true mean =100).

In particular, we can measure exactly how significantly our sample evidence differs from the null
hypothesis, by calculating the probability (that is, the area under the curve) that a random sample mean
x X will be as far away, or farther, from =100 as =105 is, on either side. Hence, this corresponds
to the combined total area contained in the tails to the left and right of 95 and 105 respectively, and it is
clear from Figure 5 that this value will be much smaller than the combined shaded area of .05 shown.
This can be checked by a formal computation:

X X p-value = P ( 95) + P ( 105) by definition
X = 2 P ( 95), by symmetry
= 2 P (Z 3.33), since (95 100) / 1.5 = 3.33, to two places
= 2 .0004, via tabulated entry for N(0,1) tail areas
= .0008 << .05, statistically significant difference

As observed, our p-value of .08% is much smaller than the accepted 5% probability of committing a
Type I error (i.e., the =.05 significance level) initially specified. Therefore, as suggested above, this
sample evidence indicates a strong rejection of the null hypothesis at the .05 level.

As a final method of verifying this decision, we may also calculate the sample-based

95% Confidence Interval for : All values between 105 2.94, i.e., (102.06, 107.94).
X
x =105 102.06 107.94 =100

By construction, this interval should contain the true value of the mean , with 95% confidence.
Because =100 is clearly outside the interval, this shows a reasonably strong rejection at =.05.


x Similarly, we can experiment with means that result from other random samples.
Sample # 2: Suppose it is now found that x =103 (or likewise, 97). This sample mean is
closer to the hypothesized mean =100 than that of the previous sample, hence it is somewhat stronger
evidence in support of H
0
. However, Figure 5 illustrates that 103 is only very slightly larger than the
rejection region endpoint 102.94, thus we technically have a borderline rejection of H
0
at =.05.
In addition, we can see that the combined left and right tail areas will total only slightly less than the .05
significance level. Proceeding as above,

= 2 P (Z 2), since (97 100) / 1.5 = 2
= 2 .0228, via tabulated entry of N(0,1) tail areas
= .0456 .05, borderline significant difference

Finally, additional insight may be gained via the

X
x =103 =100 100.06 105.94

As before, this interval should contain the true value of the mean with 95% confidence, by definition.
Because =100 is just outside the interval, this shows a borderline rejection at =.05.

Sample # 3: Suppose now x =101 (or 99). The difference between this sample mean and
the hypothesized mean =100 is much less significant, hence this is quite strong evidence in support
of H . Figure 5 illustrates that 101 is clearly in the acceptance region of H at =.05. Furthermore,
0 0

= 2 P (Z 0.67), since (99 100) / 1.5 = 0.67, to two places
= 2 .2514, via tabulated entry of N(0,1) tail areas
= .5028 >> .05, not statistically significant difference


As before, this interval should contain the true value of with 95% confidence, by definition. Because
=100 is clearly inside, this too indicates acceptance of H at the =.05 level.
0

Other Samples: As an exercise, show that if x x =100.3, then p =.8414; if =100.1, then
p =.9442, etc. From these examples, it is clear that the closer the random sample mean x gets to the
hypothesized value of the true mean , the stronger the empirical evidence is for that hypothesis, and
the higher the p-value. (Of course, the maximum value of any probability is 1.)

Next suppose that, as before, X =IQ score is normally distributed, with =30.0, and that statistical
inference for is to be based on random samples of size n =400, at the =.05 significance level. But
perhaps we now wish to test specifically for significantly higher than average IQ in our population,
by seeing if we can reject the null hypothesis H
0
: 100, in favor of the (right-sided) alternative
hypothesis H
A
: >100, via sample data.

Proceeding as before (with the appropriate modifications), we have
Margin of Error itical Value Standard Error

= z

/
= Cr
n

= 1.645 1.5

= 2.4675 Z
.05
.95
1.645
N(0, 1)
if =.05
Figure 6

Null Distribution
N(100, 1.5)
.95
.05
X
100 102.4675
x : All values below 100 +2.4675, i.e., < 102.4675. Acceptance Region for
Samples: As in the first example, suppose that x =105, which is clearly in the rejection
region. The corresponding p-value is P ( X 105), i.e., the single right-tailed area only, or .0004
exactly half the two-sided p-value calculated before. (Of course, this leads to an even stronger
rejection of H x at the =.05 level than before.) Likewise, if, as in the second sample,
0
=103, the
corresponding p-value =.0228 <.05, a moderate rejection. The sample mean x =101 is in the
acceptance region, with a right-sided p-value =.2514 >.05. Clearly, x =100 corresponds to p =.5
exactly; x x =99 corresponds to p =.7486 >>.05, and as sample means continue to decrease to the
left, the corresponding p-values continue to increase toward 1, as empirical evidence in support of the
null hypothesis H
0
: 100 continues to grow stronger.
Ismor Fischer, 1/7/2009 Appendix / A3. Statistical Inference / Means & Proportions, One & Two Samples-1


Hypothesis Testing for One Mean

POPULATION

ONE SAMPLE
Assume random variable X N(, ).
1a

Testing H
0
: =
0
vs. H
A
:
0
Test Statistic (with s replacing in standard error / n ):

1
, if 30
~
, if 30
/
n
Z n
X
t n
s n

<

1a
Normality can be verified empirically by checking quantiles (such as 68%, 95%, 99.7%),
stemplot, normal scores plot, and/or Lilliefors Test. If the data turn out not to be normally
distributed, things might still be OK due to the Central Limit Theorem, provided n 30.
Otherwise, a transformation of the data can sometimes restore a normal distribution.

1b
When X
1
and X
2
are not close to being normally distributed (or more to the point, when their
difference X
1
X
2
is not), or not known, a common alternative approach in hypothesis testing
is to use a nonparametric test, such as a Wilcoxon Test. There are two types: the Rank
Sum Test (or Mann-Whitney Test) for independent samples, and the Signed Rank Test
for paired sample data. Both use test statistics based on an ordered ranking of the data, and are
free of distributional assumptions on the random variables.

2
If the sample sizes are large, the test statistic follows a standard normal Z-distribution (via the
Central Limit Theorem), with standard error =
1
2
/ n
1
+
2
2
/ n
2
. If the sample sizes are
small, the test statistic does not follow an exact t-distribution, as in the single sample case,
unless the two population variances
1
2
and
2
2
are equal. (Formally, this requires a separate
test of how significantly the sample statistic s
1
2
/ s
2
2
, which follows an F-distribution, differs
from 1. An informal rule of thumb is to accept equivariance if this ratio is between 0.25 and 4.
Other, formal tests, such as Levenes Test, can also be used.) In this case, the two samples
can be pooled together to increase the power of the t-test, and the common value of their equal
variances estimated. However, if the two variances cannot be assumed to be equal, then
approximate t-tests such as Satterwaithes Test should be used. Alternatively, a
Wilcoxon Test is frequently used instead; see footnote 1b above.

Hypothesis Testing for Two Means
1
vs.
2

POPULATION
Random Variable X defined on two groups (arms):

Assume X
1
~ N(
1
,
1
), X
2
~ N(
2
,
2
).
1a, 1b

Testing H
0
:
1

2
=
0

mean of X
1
X
2
Note: =0, frequently

TWO SAMPLES

Independent
2
Paired
n
1

3
0
,

n
2

3
0

Test Statistic (
1
2

,
2
2
replaced by s
1
2

, s
2
2
in standard error):

Z =
1 2
2 2
1 2
1 2
( ) X X
s s
n n

+
0
~ N(0,1)

1

2

=

2

2
Test Statistic (
1
2

,
2
2
replaced by s
pooled
2
in standard error):

T =
1 2
2
1 2
pooled
( )
1 1
X X
s
n n

+
0
~ t
df
, df = n
1
+ n
2
2

where s
pooled

2
= [ (n
1
1) s
1
2
+ (n
2
1) s
2
2
] / df

n
1

<

3
0
,

n
2

<

3
0

1

2

2

2
Must use an approximate t-test, such as Satterwaithe.

Note that the Wilcoxon (= Mann-Whitney) Rank Sum Test
may be used as an alternative.
Since the data are naturally matched
by design, the pairwise differences
constitute a single collapsed sample.
Therefore, apply the appropriate one-
sample test to the random variable

D = X
1
X
2

(hence
1 2
D X X = ), having mean
=
1

2
; s = sample standard
deviation of the D-values.

Note that the Wilcoxon Signed Rank
Test may be used as an alternative.

Hypothesis Testing for One Proportion

POPULATION

Binary random variable Y, with P (Y = 1) =

Testing H
0
: =
0
vs. H
A
:
0

ONE SAMPLE
If n is large
3
, then standard error (1 ) / n with N(0, 1) distribution.

For confidence intervals, replace by its point estimate
= X / n, where X = (Y = 1) = # successes in sample.

For acceptance regions and p-values, replace by
0
, i.e.,

Test Statistic: Z =
0
0 0
)
(1
n

~ N(0, 1)

If n is small, then the above approximation does not apply, and computations
are performed directly on X, using the fact that it is binomially distributed.
That is, X Bin (n; ). Messy by hand...

3
In this context, large is somewhat subjective and open to interpretation. A typical criterion
is to require that the mean number of successes n , and the mean number of failures
(1 ) n , in the sample(s) be sufficiently large, say greater than or equal to 10 or 15. (Other,
less common, criteria are also used.)


Hypothesis Testing for Two Proportions
1
vs.
2

POPULATION

Binary random variable Y defined on two groups (arms),
P (Y
1
= 1) =
1
, P (Y
2
= 1) =
2

Testing H
0
:
1

2
= 0 vs. H
A
:
1

2
0

TWO SAMPLES

Independent Paired
Large
3

Standard error =
1
(1
1
) / n
1
+
2
(1
2
) / n
2
,
with N(0, 1) distribution

For confidence intervals, replace
1
,
2
by point estimates
1
,
2
.

For acceptance regions and p-values, replace
1
,
2
by the pooled
estimate of their common value under the null hypothesis,
pooled
= (X
1
+ X
2
) / (n
1
+ n
2
), i.e.,
Test Statistic: Z =
1 2
1 2
pooled pooled

( ) 0
1 1

(1 )
n n

+
~ N(0, 1)

Alternatively, can use a Chi-squared (
2
) Test.
McNemars Test
(A matched form of the
2
Test.)
Small
Fishers Exact Test
(Messy; based on the hypergeometric distribution of X.)
Ad hoc techniques; not covered.

Ismor Fischer, 4/21/2011 Appendix / A3. Statistical Inference / General Parameters-1


Hypothesis Testing for General Population Parameters

POPULATION
Null Hypothesis H
0
: =
0

SAMPLE
Once a suitable random sample (or two or more, depending on the application) has been selected, the
observed data can be used to compute a point estimate
that approximates the parameter above.

For example, for single sample estimates, we take
= x ,
= p,
2
= s
2
; for two samples, take
1
=
1
x
2
x ,
1
= p
1
p
2
,
2
1
/
2
2
= s
1
2
/ s
2
2
. This sample-based statistic is then used to
test the null hypothesis in a procedure known as statistical inference. The fundamental question:
At some pre-determined significance level , does the sample estimator
provide sufficient
experimental evidence to reject the null hypothesis that the parameter value is equal to
0
, i.e., is there a
statistically significant difference between the two? If not, then this can be interpreted as having
evidence in support of the null hypothesis, and we can tentatively accept it, bearing further empirical
evidence; see THE BIG PICTURE. In order to arrive at the correct decision rule for the mean(s) and
proportion(s) [subtleties exist in the case of the variance(s)], we need to calculate the following object(s):

margin of error
Confidence Interval endpoints =
critical value standard error

(If
0
is inside, then accept null hypothesis. If
0
is outside, then reject null hypothesis.)

Acceptance Region endpoints =
0
critical value standard error
(If
is inside, then accept null hypothesis. If
is outside, then reject null hypothesis.)

Test Statistic =
0
standard error

, which is used to calculate the p-value of the experiment.
(If p-value > , then accept null hypothesis. If p-value < , then reject null hypothesis.)

The appropriate critical values and standard errors can be computed from the following tables,
assuming that the variable X is normally distributed. (Details can be found in previous notes.)

is a generic parameter of interest
(e.g., , ,
2
in the one sample case;
1

2
,
1

2
,
1
2
/
2
2
in the two
sample case) of a random variable X.

0
is a conjectured value of the parameter in
the null hypothesis. In the two sample case for
means and proportions, this value is often chosen
to be zero if, as in a clinical trial, we are attempting
to detect any statistically significant difference
between the two groups (at some predetermined
significance level ). For the ratio of variances
between two groups, this value is usually one, to
test for equivariance.
MARGIN OF ERROR
k samples

POPULATI ON
PARAMETER
SAMPLE
STATI STI C = product of these two factors:

Null Hypothesis
H
0
: =
0

Point Estimate
= f(x
1
,, x
n
)
CRITICAL VALUE
(2-sided)
1
STANDARD ERROR
(estimate)
2
Mean
2

H
0
: =
0

= x =
x
i
n

n 30: t
n1, /2
or z
/2

Any n: s / n
n < 30: t
n1, /2
only
Proportion H
0
: =
0

(= p) =
X
n
,

where
X = # Successes
n 30: z
/2
~ N(0, 1)
n 30 also, n 15 and n(1) 15:
For Confidence Interval:

(1 ) n

For Acceptance Region, p-value:

0 0
(1 ) n
n < 30: Use X ~ Bin(n,).
(not explicitly covered)

Two Paired Samples

3

Null Hypothesis
H
0
:
1

2
= 0
Point Estimate
1 2

CRITICAL VALUE
(2-sided)
1
STANDARD ERROR
(estimate)
2
Means
2

H
0
:
1

2
= 0
1 2
x x
n
1
, n
2
30:
1 2
2, / 2 n n
t
+
or z
/2

n
1
, n
2
30:
s
1
2
/ n
1
+ s
2
2
/ n
2

n
1
, n
2
< 30: Is
1
2
=
2
2
?
Informal: 1/4 < s
1
2
/s
2
2
< 4 ?
Yes
1 2
2, / 2 n n
t
+

No Satterwaithes Test
n
1
, n
2
< 30:
s
pooled
2
1 / n
1
+ 1 / n
2

where s
pooled
2
=
(n
1
1) s
1
2
+ (n
2
1) s
2
2
n
1
+ n
2
2

Proportions H
0
:
1

2
= 0
1 2

n
1
, n
2
30: z
/2

(or use Chi-squared Test)
n
1
, n
2
30 also, (see criteria above):

For Confidence Interval:
1 1 1 2 2 2

(1 ) (1 ) n n +

For Acceptance Region, p-value:
1 2 pooled pooled

(1 ) 1 1 n n +

where
pooled
= (X
1
+ X
2
) / (n
1
+ n
2
)

n
1
, n
2
< 30:

Fishers Exact Test
(not explicitly covered)

(k 2)
Null Hypothesis
H
0
:
1
=
2
= =
k

Independent Dependent (not covered)
Means H
0
:
1
=
2
= =
k
F-test (ANOVA) Repeated Measures, Blocks
Proportions H
0
:
1
=
2
= =
k
Chi-squared Test Other techniques

1
For 1-sided hypothesis tests, replace /2 by .
2
For Mean(s): If normality is established, use the true standard error if known either / n or
2 2
1 1 2 2
/ / n n + with the
Z-distribution. If normality is not established, then use a transformation, or a nonparametric Wilcoxon Test on the median(s).

3
For Paired Means: Apply the appropriate one sample test to the pairwise differences D = X Y.
For Paired Proportions: Apply McNemars Test, a matched version of the 2 2 Chi-squared Test.

One Sample

Two Independent Samples


HOW TO USE THESE TABLES

The preceding page consists of three tables that are divided into general statistical inference formulas for hypothesis tests
of means and proportions, for one sample, two samples, and k 2 samples, respectively. The first two tables for 2-sided
Z- and t- tests can be used to calculate the margin of error = critical value standard error for acceptance/rejection
regions and confidence intervals. Column 1 indicates the general form of the null hypothesis H
0
for the relevant
parameter value, Column 2 shows the form of the sample-based parameter estimate (a.k.a., statistic), Column 3 shows
the appropriate distribution and corresponding critical value, and Column 4 shows the corresponding standard error
estimate (if the exact standard error is unknown).

Pay close attention to the footnotes in the tables, and refer back to previous notes for details and examples!

To calculate To reject H
0
, ask

Confidence Limits: Column 2 (Column 3)(Column 4) Is Column 1 outside?

Acceptance Region: Column 1 (Column 3)(Column 4) Is Column 2 outside?

Test Statistic:
Column 2 Column 1
Column 4
Is the p-value < ?
(Z-score for large samples,
T-score for small samples)

Two-sided alternative H
0
: =
0
vs. H
A
:
0

2 P(Z > |Z-score|), or equivalently, 2 P(Z < |Z-score|), for large samples
p-value =
2 P(T
df
> |T-score|), or equivalently, 2 P(T
df
< |T-score|), for small samples

Reject H
0
Accept H
0

1
0
p .001
extremely
significant
Example:
= .05
p .005
strongly
significant
p .01
moderately
significant
p .05
borderline
significant
p .10
not significant

0
, ask

Confidence Interval: Column 2 (Column 3)(Column 4) Is Column 1 outside?

Acceptance Region: Column 1 + (Column 3)(Column 4) Is Column 2 outside?

Test Statistic:
Column 2 Column 1
Column 4
Is the p-value < ?

0
, ask

Confidence Interval: Column 2 + (Column 3)(Column 4) Is Column 1 outside?

Acceptance Region: Column 1 (Column 3)(Column 4) Is Column 2 outside?

Test Statistic:
Column 2 Column 1
Column 4
Is the p-value < ?

* The formulas in the tables are written for 2-sided tests only, and must be modified for 1-sided tests, by
changing /2 to . Also, recall that the p-value is always determined by the direction of the corresponding
alternative hypothesis (either < or > in a 1-sided test, both in a 2-sided test).

One-sided test*, Right-tailed alternative H
0
:
0
vs. H
A
: >
0

One-sided test*, Left-tailed alternative H
0
:
0
vs. H
A
: <
0

P(Z > Z-score), for large samples
p-value =
P(T
df
> T-score), for small samples

P(Z < Z-score), for large samples
p-value =
P(T
df
< T-score), for small samples


THE BIG PICTURE

STATISTICS AND THE SCIENTIFIC METHOD

If, over time, a particular null hypothesis is continually accepted (as in a statistical meta-analysis of
numerous studies, for example), then it may eventually become formally recognized as an established
scientific fact. When sufficiently many such interrelated facts are collected and the connections
between them understood in a coherently structured way, the resulting organized body of truths is often
referred to as a scientific theory such as the Theory of Relativity, the Theory of Plate Tectonics, or
the Theory of Natural Selection. It is the ultimate goal of a scientific theory to provide an objective
description of some aspect, or natural law, of the physical universe, such as the Law of Gravitation,
Laws of Thermodynamics, Mendels Laws of Genetic Inheritance, etc.

h
t
t
p
:
/
/
w
w
w
.
n
a
s
a
.
g
o
v
/
v
i
s
i
o
n
/
u
n
i
v
e
r
s
e
/
s
t
a
r
s
g
a
l
a
x
i
e
s
/
h
u
b
b
l
e
_
U
D
F
.
h
t
m
l


A4.1 Power Law Growth

A4.2 Exponential Growth

A4.3 Logistic Growth

A4.4 Example Newtons Law of Cooling

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Power Law Growth-1

Power Law Growth

The technique of transforming data, especially using logarithms, is extremely valuable.
Many physical systems involve two variables X and Y that are known (or suspected) to
obey a power law relation, where Y is proportional to X raised to a power, i.e., Y = X

for some fixed constants and . Examples include the relation L =1.4 A
0.6
that exists
between river length L and the area A that it drains, inverse square laws such as the
gravitational attraction F =G m
1
m
2
r
2
between two masses separated by a distance r,
earthquake frequency versus intensity, the frequency of global mass extinction events over
geologic time, comet brightness vs. distance to the sun, economic trends, language patterns,
and numerous others.

As mentioned before, in these cases, both variables X and Y are often transformed by
means of a logarithm. The resulting data are replotted on a log-log scale, where a linear
model is then fit: (The algebraic details were presented in the basic review of logarithms.)

log
10
Y =
0
+
1
log
10
X ,

and the original power law relation can be recovered via the formulas

= 10

0
=
1
.

As a simple example, suppose we are examining the relation between V =Volume (cm
3
)
and A =Surface Area (cm
2
) of various physical objects. For the sake of simplicity, let us
confine our investigation to sample data of solid cubes of n =10 different sizes:

V 1 8 27 64 125 216 343 512 729 1000
A 6 24 54 96 150 216 294 384 486 600

Note the nonlinear scatterplot in Figure 1. If we take the common logarithm of both
variables, the rescaled log-log plot reveals a strong linear correlation; see Figure 2. This
is strong evidence that there is a power law relation between the original variables, i.e.,
A = V

.

log
10
V 0.000 0.903 1.431 1.806 2.097 2.334 2.535 2.709 2.863 3.000
log
10
A 0.778 1.380 1.732 1.982 2.176 2.334 2.468 2.584 2.687 2.778

Therefore, a linear model will be a much better fit for these transformed data points than
for the original data points. Solving for the regression coefficients in the usual way
(Exercise), we find that the least squares regression line is given by

log
10
V = 0.778 + 0.667 log
10
A .

0

We can now estimate the original coefficients:
= 10
0.778
=6, and
=0.667 =
2
3
,
approximately. Therefore, the required power law relation is A =6 V
2/3
. This should
come as no surprise, because the surface area of a cube (which has six square faces) is
given by A =6 s
2
, and the volume is given by V =s
3
, where s is the length of one side of the
cube. Hence, eliminating the s, we see that A =6 V
2/3
for solid cubes. If we had chosen to
work with spheres instead, only the constant of proportionality would have changed
slightly (to
3
36 4.836); the power would remain unchanged at =
2
3
. (Here, V =
4
3
r
3

and A =4 r
2
, where r is the radius.) This illustrates a basic principle of mechanics: since
the volume of any object is roughly proportional to the cube of its length (say, V L
3
),
and the surface area is proportional to its square (say, A L
2
), what follows is the general
power relation that A V
2/3
.

Comment. In a biomechanical application of power law scaling, consider the relation
between the metabolic rate Y of organisms (as measured by the amount of surface area heat
dissipation per unit time), and their body mass M (generally proportional to the volume).
From the preceding argument, one might naively expect that, as a general rule, Y M
2/3
.
However, this has been shown not to be the case. From systematic measurements of the
correlation between these two variables (first done in 1932 by Max Kleiber), it was shown
that a more accurate power relation is given by Y M
3/4
, known as Kleibers Law. Since
that time, quarter-power scaling has been shown to exist everywhere in biology, from
respiratory rates ( M
1/4
), to tree trunk and human aorta diameters ( M
3/8
). Exactly why
this is so universal is something of a major mystery, but seems related to an area of
mathematics known as fractal geometry. Since 1997, much research has been devoted to
describe general models that explain the origin and prevalence of quarter-power scaling in
nature, and is considered by some to be perhaps the single most pervasive theme
underlying all biological diversity. (Santa Fe Institute Bulletin, Volume 12, No. 2.)


Figure 1

Figure 2
Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Exponential Growth-1


Exponential Growth

Consider a (somewhat idealized) example of how to use a logarithm transformation on
exponential growth data. Assume we start with an initial population of 100 cells in
culture, and they grow under ideal conditions, exactly doubling their numbers once every
hour. Let X =time (hours), Y =population size; suppose we obtain the following data.

X: 0 1 2 3 4

Y: 100 200 400 800 1600

A scatterplot reveals typical exponential (a.k.a. geometric) growth; see Figure 1. A linear
fit of these data points (X, Y) will not be a particularly good model for it, but there is
nothing to prevent us, either statistically or mathematically, from proceeding this way.
Their least squares regression line (also shown in Figure 1) is given by the equation

Y = 100 + 360 X,

with a coefficient of determination r
2
=0.871. (Exercise: Verify these claims.)

Although r
2
is fairly close to 1, there is nothing scientifically compelling about this
model; there is certainly nothing natural or enlightening about the regression coefficients
100 and 360 in the context of this particular application. This illustrates the drawback
of relying on r
2
as the sole indicator of the fit of the linear model.

One alternative approach is to take the logarithm (we will use common logarithm, base
10) of the response variable Y which is possible to do, since Y takes positive values in
an attempt to put the population size on the same scale as the time variable X. This gives

log
10
(Y): 2.0 2.3 2.6 2.9 3.2

Notice that the transformed response variable increases with a constant slope (+0.3), for
every one-hour increase in time, the hallmark of linear behavior. Therefore, since the
points (X, log
10
Y) are collinear, their least squares regression line is given by the equation

log
10
( ) = 2 + 0.3 X. (Verify this by computing the regression coefficients.)
Y

Given this, we can now solve for the population size directly. Inverting the logarithm,

Y = 10
2 +0.3 X

= 10
2
10
0.3 X
(via a law of exponents),
i.e.,
Y = 100 2
X
.

This exponential growth model is a much better fit to the data; see Figure 2. In fact, it's
exact (check it for X =0, 1, 2, 3, 4), and makes intuitively reasonable sense for this
application. The population size Y at any time X, is equal to the initial population size of
100, times 2 raised to the X power, since it doubles in size every hour. This is an
example of unrestricted exponential growth. The technique of logistic regression applies
to restricted exponential growth models, among other things.
Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Exponential Growth-2

Figure 1

Figure 2
Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Logistic Growth-1


Logistic Growth

Consider the more realistic situation of restricted population growth. As with the previous
unrestricted case, the population initially grows exponentially, as resources are plentiful. Eventually,
however, various factors (such as competition for diminishing resources, stress due to overcrowding,
disease, predation, etc.) act to reduce the population growth rate. The population size continues to
increase, but at an ever-slower rate. Ultimately it approaches (but may never actually reach) an
asymptotically stable value, the carrying capacity, that represents a theoretical maximum limit to
the population size under these conditions.

Given the following (idealized) data set, for a population with a carrying capacity of 900 organisms.
Note how the growth slows down and levels off with time.

X: 0 1 2 3 4

Y: 100 300 600 800 873

We wish to model this growth rate via regression, taking into account the carrying capacity. We first
convert the Y-values to proportions () that survive out of 900, by dividing.

: 0.11 0.33 0.67 0.89 0.97

Next we transform to the peculiar-looking link function log
10
1
. (Note: In practice, the
natural logarithm base e =2.71828 is normally used, for good reasons, but here we use the
common logarithm base 10. The final model for will not depend on the particular base used.)

log
10
1
: 0.9 0.3 0.3 0.9 1.5

Notice how for every +1 increase in X, there is a corresponding constant increase (+0.6) in the
transformed variable log
10

1
, indicating linear behavior. Hence, the fitted linear model is exact:

log
10

= 0.9 + 0.6 X .

Solving algebraically (details omitted) yields

=
(0.125) (4
X
)
1 +(0.125) (4
X
)
,

which simplifies to =
4
X
8 +4
X , i.e.,
1
1 +8 (4
X
)
. (Multiply by 900 to get the fitted Y

.)

Exercise: Calculate the fitted values of this model for X =0, 1, 2, 3, 4, and compare with the
original data values.

Formally, is the probability P(S =1), where the binary variable S =1 indicates survival, and S =0 indicates death.
Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Logistic Growth-2

The S-shaped graph of this relation is the classical logistic curve, or logit (pronounced low-jit); see
figure. Besides restricted population growth, it also describes many other phenomena that behave
similarly, such as dose - response in pharmacokinetics, and the learning curve in psychology.

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Example Newtons Law of Cooling-1


A Modeling Example: Newtons Law of Cooling

Suppose that a beaker containing hot liquid is placed in a room of ambient temperature 70F, and
allowed to cool. Its temperature Y (F) is recorded every ten minutes over a period of time (X),
yielding the n =5 measurements shown below, along with some accompanying summary statistics:

0 10 20 30 40
X

Y
150 110 90 80 75
x =20

y =101
s
x
2
=250

s
y
2
=930
s
xy
=450

Using simple linear regression, the least-squares regression line is given by Y =137 1.8 X,
which has a reasonably high coefficient of determination r
2
=0.871, indicating an acceptable fit.
(You should check these claims on your own; in fact, this is one of my old exam problems!)
However, it is clear from the scatterplot that the linear model does not capture the curved nature of
the relationship between time X and temperature Y, as the latter decreases very rapidly in the early
minutes, then more slowly later on. Therefore, curvilinear regression might produce better
models. In particular, using polynomial regression, we may fit a quadratic model (i.e., parabola),
or a higher degree polynomial, to the data. Using elementary functions other than polynomials can
also produce suitable alternative least-squares regression models, as shown in the figures below.

All these models are reasonably good fits, and are potentially useful within the limits of the data,
especially if we have no additional information about the theoretical dynamics between X and Y.
However, in certain instances, it is possible to derive a formal mathematical relationship between
the variables of interest, starting from known fundamental scientific principles. For example, the
behavior of this system is governed by a principle known as Newtons Law of Cooling, which
states that at any given time, the rate of change of the temperature of the liquid is proportional to
the difference between the temperature of the liquid and the ambient temperature. In calculus
notation, this statement translates to a first-order ordinary differential equation, and
corresponding initial condition at time zero:

dY
dX
= k (Y a), Y(0) =Y
0
.

Here, k <0 is the constant of proportionality (negative, because the temperature Y is decreasing),
a is the known ambient temperature, and Y
0
is the given initial temperature of the liquid. The
unique solution of this initial value problem (IVP) is given by the following formula:

Y = a +(Y
0
a) e
k X
.

In this particular example, we have Y
0
=150F and a =70F. Furthermore, with the given data,
the constant of proportionality turns out to be precisely k =
ln 2
10
, so there is exact agreement with

/10
70 80(2 )
X
Y

= + .

More importantly, note that as the time variable X grows large, the temperature variable Y in this
decaying exponential model asymptotically approaches the ambient temperature of a =70F at
equilibrium, as expected in practice. The other models do not share this physically realistic property.

Ismor Fischer, 1/7/2009 Appendix / A4. Regression Models / Example Newtons Law of Cooling-2
Y = 58.269 +
935.718
X + 10

r
2
=0.991
Y = 70 + 80 (2
X/10
)

r
2
=1.000
Y = 152.086 20.288 ln(X + 1)

r
2
=0.985
Y = 137 1.8 X

r
2
=0.871
Y = 148.429 4.086 X + 0.057 X

2

r
2
=0.994
Y = 148.733 12.279 X

r
2
=0.991
70


A5.1 Z-distribution

A5.2 T-distribution

A5.3 Chi-squared distribution

Cumulative Probabilities of the Standard Normal Distribution N(0, 1)

Left-sided area Left-sided area Left-sided area Left-sided area Left-sided area Left-sided area
z-score P(Z z-score) z-score P(Z z-score) z-score P(Z z-score) z-score P(Z z-score) z-score P(Z z-score) z-score P(Z z-score)

4.26 0.00001
4.25 0.00001
4.24 0.00001
4.23 0.00001
4.22 0.00001
4.21 0.00001
4.20 0.00001
4.19 0.00001
4.18 0.00001
4.17 0.00002
4.16 0.00002
4.15 0.00002
4.14 0.00002
4.13 0.00002
4.12 0.00002
4.11 0.00002
4.10 0.00002
4.09 0.00002
4.08 0.00002
4.07 0.00002
4.06 0.00002
4.05 0.00003
4.04 0.00003
4.03 0.00003
4.02 0.00003
4.01 0.00003
4.00 0.00003
3.99 0.00003
3.98 0.00003
3.97 0.00004
3.96 0.00004
3.95 0.00004
3.94 0.00004
3.93 0.00004
3.92 0.00004
3.91 0.00005
3.90 0.00005
3.89 0.00005
3.88 0.00005
3.87 0.00005
3.86 0.00006
3.85 0.00006
3.84 0.00006
3.83 0.00006
3.82 0.00007
3.81 0.00007
3.80 0.00007
3.79 0.00008
3.78 0.00008
3.77 0.00008
3.76 0.00008
3.75 0.00009
3.74 0.00009
3.73 0.00010
3.72 0.00010
3.71 0.00010
3.70 0.00011
3.69 0.00011
3.68 0.00012
3.67 0.00012
3.66 0.00013
3.65 0.00013
3.64 0.00014
3.63 0.00014
3.62 0.00015
3.61 0.00015
3.60 0.00016
3.59 0.00017
3.58 0.00017
3.57 0.00018
3.56 0.00019
3.55 0.00019
3.54 0.00020
3.53 0.00021
3.52 0.00022
3.51 0.00022
3.50 0.00023
3.49 0.00024
3.48 0.00025
3.47 0.00026
3.46 0.00027
3.45 0.00028
3.44 0.00029
3.43 0.00030
3.42 0.00031
3.41 0.00032
3.40 0.00034
3.39 0.00035
3.38 0.00036
3.37 0.00038
3.36 0.00039
3.35 0.00040
3.34 0.00042
3.33 0.00043
3.32 0.00045
3.31 0.00047
3.30 0.00048
3.29 0.00050
3.28 0.00052
3.27 0.00054
3.26 0.00056
3.25 0.00058
3.24 0.00060
3.23 0.00062
3.22 0.00064
3.21 0.00066
3.20 0.00069
3.19 0.00071
3.18 0.00074
3.17 0.00076
3.16 0.00079
3.15 0.00082
3.14 0.00084
3.13 0.00087
3.12 0.00090
3.11 0.00094
3.10 0.00097
3.09 0.00100
3.08 0.00104
3.07 0.00107
3.06 0.00111
3.05 0.00114
3.04 0.00118
3.03 0.00122
3.02 0.00126
3.01 0.00131
3.00 0.00135
2.99 0.00139
2.98 0.00144
2.97 0.00149
2.96 0.00154
2.95 0.00159
2.94 0.00164
2.93 0.00169
2.92 0.00175
2.91 0.00181
2.90 0.00187
2.89 0.00193
2.88 0.00199
2.87 0.00205
2.86 0.00212
2.85 0.00219
2.84 0.00226
2.83 0.00233
2.82 0.00240
2.81 0.00248
2.80 0.00256
2.79 0.00264
2.78 0.00272
2.77 0.00280
2.76 0.00289
2.75 0.00298
2.74 0.00307
2.73 0.00317
2.72 0.00326
2.71 0.00336
2.70 0.00347
2.69 0.00357
2.68 0.00368
2.67 0.00379
2.66 0.00391
2.65 0.00402
2.64 0.00415
2.63 0.00427
2.62 0.00440
2.61 0.00453
2.60 0.00466
2.59 0.00480
2.58 0.00494
2.57 0.00508
2.56 0.00523
2.55 0.00539
2.54 0.00554
2.53 0.00570
2.52 0.00587
2.51 0.00604
2.50 0.00621
2.49 0.00639
2.48 0.00657
2.47 0.00676
2.46 0.00695
2.45 0.00714
2.44 0.00734
2.43 0.00755
2.42 0.00776
2.41 0.00798
2.40 0.00820
2.39 0.00842
2.38 0.00866
2.37 0.00889
2.36 0.00914
2.35 0.00939
2.34 0.00964
2.33 0.00990
2.32 0.01017
2.31 0.01044
2.30 0.01072
2.29 0.01101
2.28 0.01130
2.27 0.01160
2.26 0.01191
2.25 0.01222
2.24 0.01255
2.23 0.01287
2.22 0.01321
2.21 0.01355
2.20 0.01390
2.19 0.01426
2.18 0.01463
2.17 0.01500
2.16 0.01539
2.15 0.01578
2.14 0.01618
2.13 0.01659
2.12 0.01700
2.11 0.01743
2.10 0.01786
2.09 0.01831
2.08 0.01876
2.07 0.01923
2.06 0.01970
2.05 0.02018
2.04 0.02068
2.03 0.02118
2.02 0.02169
2.01 0.02222
2.00 0.02275
1.99 0.02330
1.98 0.02385
1.97 0.02442
1.96 0.02500
1.95 0.02559
1.94 0.02619
1.93 0.02680
1.92 0.02743
1.91 0.02807
1.90 0.02872
1.89 0.02938
1.88 0.03005
1.87 0.03074
1.86 0.03144
1.85 0.03216
1.84 0.03288
1.83 0.03362
1.82 0.03438
1.81 0.03515
1.80 0.03593
1.79 0.03673
1.78 0.03754
1.77 0.03836
1.76 0.03920
1.75 0.04006
1.74 0.04093
1.73 0.04182
1.72 0.04272
1.71 0.04363
1.70 0.04457
1.69 0.04551
1.68 0.04648
1.67 0.04746
1.66 0.04846
1.65 0.04947
1.64 0.05050
1.63 0.05155
1.62 0.05262
1.61 0.05370
1.60 0.05480
1.59 0.05592
1.58 0.05705
1.57 0.05821
1.56 0.05938
1.55 0.06057
1.54 0.06178
1.53 0.06301
1.52 0.06426
1.51 0.06552
1.50 0.06681
1.49 0.06811
1.48 0.06944
1.47 0.07078
1.46 0.07215
1.45 0.07353
1.44 0.07493
1.43 0.07636
1.42 0.07780
1.41 0.07927
1.40 0.08076
1.39 0.08226
1.38 0.08379
1.37 0.08534
1.36 0.08691
1.35 0.08851
1.34 0.09012
1.33 0.09176
1.32 0.09342
1.31 0.09510
1.30 0.09680
1.29 0.09853
1.28 0.10027
1.27 0.10204
1.26 0.10383
1.25 0.10565
1.24 0.10749
1.23 0.10935
1.22 0.11123
1.21 0.11314
1.20 0.11507
1.19 0.11702
1.18 0.11900
1.17 0.12100
1.16 0.12302
1.15 0.12507
1.14 0.12714
1.13 0.12924
1.12 0.13136
1.11 0.13350
1.10 0.13567
1.09 0.13786
1.08 0.14007
1.07 0.14231
1.06 0.14457
1.05 0.14686
1.04 0.14917
1.03 0.15151
1.02 0.15386
1.01 0.15625
1.00 0.15866
0.99 0.16109
0.98 0.16354
0.97 0.16602
0.96 0.16853
0.95 0.17106
0.94 0.17361
0.93 0.17619
0.92 0.17879
0.91 0.18141
0.90 0.18406
0.89 0.18673
0.88 0.18943
0.87 0.19215
0.86 0.19489
0.85 0.19766
0.84 0.20045
0.83 0.20327
0.82 0.20611
0.81 0.20897
0.80 0.21186
0.79 0.21476
0.78 0.21770
0.77 0.22065
0.76 0.22363
0.75 0.22663
0.74 0.22965
0.73 0.23270
0.72 0.23576
0.71 0.23885
0.70 0.24196
0.69 0.24510
0.68 0.24825
0.67 0.25143
0.66 0.25463
0.65 0.25785
0.64 0.26109
0.63 0.26435
0.62 0.26763
0.61 0.27093
0.60 0.27425
0.59 0.27760
0.58 0.28096
0.57 0.28434
0.56 0.28774
0.55 0.29116
0.54 0.29460
0.53 0.29806
0.52 0.30153
0.51 0.30503
0.50 0.30854
0.49 0.31207
0.48 0.31561
0.47 0.31918
0.46 0.32276
0.45 0.32636
0.44 0.32997
0.43 0.33360
0.42 0.33724
0.41 0.34090
0.40 0.34458
0.39 0.34827
0.38 0.35197
0.37 0.35569
0.36 0.35942
0.35 0.36317
0.34 0.36693
0.33 0.37070
0.32 0.37448
0.31 0.37828
0.30 0.38209
0.29 0.38591
0.28 0.38974
0.27 0.39358
0.26 0.39743
0.25 0.40129
0.24 0.40517
0.23 0.40905
0.22 0.41294
0.21 0.41683
0.20 0.42074
0.19 0.42465
0.18 0.42858
0.17 0.43251
0.16 0.43644
0.15 0.44038
0.14 0.44433
0.13 0.44828
0.12 0.45224
0.11 0.45620
0.10 0.46017
0.09 0.46414
0.08 0.46812
0.07 0.47210
0.06 0.47608
0.05 0.48006
0.04 0.48405
0.03 0.48803
0.02 0.49202
0.01 0.49601
z-score
P(Z z-score)
Z
0.00 0.50000
+0.01 0.50399
+0.02 0.50798
+0.03 0.51197
+0.04 0.51595
+0.05 0.51994
+0.06 0.52392
+0.07 0.52790
+0.08 0.53188
+0.09 0.53586
+0.10 0.53983
+0.11 0.54380
+0.12 0.54776
+0.13 0.55172
+0.14 0.55567
+0.15 0.55962
+0.16 0.56356
+0.17 0.56749
+0.18 0.57142
+0.19 0.57535
+0.20 0.57926
+0.21 0.58317
+0.22 0.58706
+0.23 0.59095
+0.24 0.59483
+0.25 0.59871
+0.26 0.60257
+0.27 0.60642
+0.28 0.61026
+0.29 0.61409
+0.30 0.61791
+0.31 0.62172
+0.32 0.62552
+0.33 0.62930
+0.34 0.63307
+0.35 0.63683
+0.36 0.64058
+0.37 0.64431
+0.38 0.64803
+0.39 0.65173
+0.40 0.65542
+0.41 0.65910
+0.42 0.66276
+0.43 0.66640
+0.44 0.67003
+0.45 0.67364
+0.46 0.67724
+0.47 0.68082
+0.48 0.68439
+0.49 0.68793
+0.50 0.69146
+0.51 0.69497
+0.52 0.69847
+0.53 0.70194
+0.54 0.70540
+0.55 0.70884
+0.56 0.71226
+0.57 0.71566
+0.58 0.71904
+0.59 0.72240
+0.60 0.72575
+0.61 0.72907
+0.62 0.73237
+0.63 0.73565
+0.64 0.73891
+0.65 0.74215
+0.66 0.74537
+0.67 0.74857
+0.68 0.75175
+0.69 0.75490
+0.70 0.75804
+0.71 0.76115
+0.72 0.76424
+0.73 0.76730
+0.74 0.77035
+0.75 0.77337
+0.76 0.77637
+0.77 0.77935
+0.78 0.78230
+0.79 0.78524
+0.80 0.78814
+0.81 0.79103
+0.82 0.79389
+0.83 0.79673
+0.84 0.79955
+0.85 0.80234
+0.86 0.80511
+0.87 0.80785
+0.88 0.81057
+0.89 0.81327
+0.90 0.81594
+0.91 0.81859
+0.92 0.82121
+0.93 0.82381
+0.94 0.82639
+0.95 0.82894
+0.96 0.83147
+0.97 0.83398
+0.98 0.83646
+0.99 0.83891
+1.00 0.84134
+1.01 0.84375
+1.02 0.84614
+1.03 0.84849
+1.04 0.85083
+1.05 0.85314
+1.06 0.85543
+1.07 0.85769
+1.08 0.85993
+1.09 0.86214
+1.10 0.86433
+1.11 0.86650
+1.12 0.86864
+1.13 0.87076
+1.14 0.87286
+1.15 0.87493
+1.16 0.87698
+1.17 0.87900
+1.18 0.88100
+1.19 0.88298
+1.20 0.88493
+1.21 0.88686
+1.22 0.88877
+1.23 0.89065
+1.24 0.89251
+1.25 0.89435
+1.26 0.89617
+1.27 0.89796
+1.28 0.89973
+1.29 0.90147
+1.30 0.90320
+1.31 0.90490
+1.32 0.90658
+1.33 0.90824
+1.34 0.90988
+1.35 0.91149
+1.36 0.91309
+1.37 0.91466
+1.38 0.91621
+1.39 0.91774
+1.40 0.91924
+1.41 0.92073
+1.42 0.92220
+1.43 0.92364
+1.44 0.92507
+1.45 0.92647
+1.46 0.92785
+1.47 0.92922
+1.48 0.93056
+1.49 0.93189
+1.50 0.93319
+1.51 0.93448
+1.52 0.93574
+1.53 0.93699
+1.54 0.93822
+1.55 0.93943
+1.56 0.94062
+1.57 0.94179
+1.58 0.94295
+1.59 0.94408
+1.60 0.94520
+1.61 0.94630
+1.62 0.94738
+1.63 0.94845
+1.64 0.94950
+1.65 0.95053
+1.66 0.95154
+1.67 0.95254
+1.68 0.95352
+1.69 0.95449
+1.70 0.95543
+1.71 0.95637
+1.72 0.95728
+1.73 0.95818
+1.74 0.95907
+1.75 0.95994
+1.76 0.96080
+1.77 0.96164
+1.78 0.96246
+1.79 0.96327
+1.80 0.96407
+1.81 0.96485
+1.82 0.96562
+1.83 0.96638
+1.84 0.96712
+1.85 0.96784
+1.86 0.96856
+1.87 0.96926
+1.88 0.96995
+1.89 0.97062
+1.90 0.97128
+1.91 0.97193
+1.92 0.97257
+1.93 0.97320
+1.94 0.97381
+1.95 0.97441
+1.96 0.97500
+1.97 0.97558
+1.98 0.97615
+1.99 0.97670
+2.00 0.97725
+2.01 0.97778
+2.02 0.97831
+2.03 0.97882
+2.04 0.97932
+2.05 0.97982
+2.06 0.98030
+2.07 0.98077
+2.08 0.98124
+2.09 0.98169
+2.10 0.98214
+2.11 0.98257
+2.12 0.98300
+2.13 0.98341
+2.14 0.98382
+2.15 0.98422
+2.16 0.98461
+2.17 0.98500
+2.18 0.98537
+2.19 0.98574
+2.20 0.98610
+2.21 0.98645
+2.22 0.98679
+2.23 0.98713
+2.24 0.98745
+2.25 0.98778
+2.26 0.98809
+2.27 0.98840
+2.28 0.98870
+2.29 0.98899
+2.30 0.98928
+2.31 0.98956
+2.32 0.98983
+2.33 0.99010
+2.34 0.99036
+2.35 0.99061
+2.36 0.99086
+2.37 0.99111
+2.38 0.99134
+2.39 0.99158
+2.40 0.99180
+2.41 0.99202
+2.42 0.99224
+2.43 0.99245
+2.44 0.99266
+2.45 0.99286
+2.46 0.99305
+2.47 0.99324
+2.48 0.99343
+2.49 0.99361
+2.50 0.99379
+2.51 0.99396
+2.52 0.99413
+2.53 0.99430
+2.54 0.99446
+2.55 0.99461
+2.56 0.99477
+2.57 0.99492
+2.58 0.99506
+2.59 0.99520
+2.60 0.99534
+2.61 0.99547
+2.62 0.99560
+2.63 0.99573
+2.64 0.99585
+2.65 0.99598
+2.66 0.99609
+2.67 0.99621
+2.68 0.99632
+2.69 0.99643
+2.70 0.99653
+2.71 0.99664
+2.72 0.99674
+2.73 0.99683
+2.74 0.99693
+2.75 0.99702
+2.76 0.99711
+2.77 0.99720
+2.78 0.99728
+2.79 0.99736
+2.80 0.99744
+2.81 0.99752
+2.82 0.99760
+2.83 0.99767
+2.84 0.99774
+2.85 0.99781
+2.86 0.99788
+2.87 0.99795
+2.88 0.99801
+2.89 0.99807
+2.90 0.99813
+2.91 0.99819
+2.92 0.99825
+2.93 0.99831
+2.94 0.99836
+2.95 0.99841
+2.96 0.99846
+2.97 0.99851
+2.98 0.99856
+2.99 0.99861
+3.00 0.99865
+3.01 0.99869
+3.02 0.99874
+3.03 0.99878
+3.04 0.99882
+3.05 0.99886
+3.06 0.99889
+3.07 0.99893
+3.08 0.99896
+3.09 0.99900
+3.10 0.99903
+3.11 0.99906
+3.12 0.99910
+3.13 0.99913
+3.14 0.99916
+3.15 0.99918
+3.16 0.99921
+3.17 0.99924
+3.18 0.99926
+3.19 0.99929
+3.20 0.99931
+3.21 0.99934
+3.22 0.99936
+3.23 0.99938
+3.24 0.99940
+3.25 0.99942
+3.26 0.99944
+3.27 0.99946
+3.28 0.99948
+3.29 0.99950
+3.30 0.99952
+3.31 0.99953
+3.32 0.99955
+3.33 0.99957
+3.34 0.99958
+3.35 0.99960
+3.36 0.99961
+3.37 0.99962
+3.38 0.99964
+3.39 0.99965
+3.40 0.99966
+3.41 0.99968
+3.42 0.99969
+3.43 0.99970
+3.44 0.99971
+3.45 0.99972
+3.46 0.99973
+3.47 0.99974
+3.48 0.99975
+3.49 0.99976
+3.50 0.99977
+3.51 0.99978
+3.52 0.99978
+3.53 0.99979
+3.54 0.99980
+3.55 0.99981
+3.56 0.99981
+3.57 0.99982
+3.58 0.99983
+3.59 0.99983
+3.60 0.99984
+3.61 0.99985
+3.62 0.99985
+3.63 0.99986
+3.64 0.99986
+3.65 0.99987
+3.66 0.99987
+3.67 0.99988
+3.68 0.99988
+3.69 0.99989
+3.70 0.99989
+3.71 0.99990
+3.72 0.99990
+3.73 0.99990
+3.74 0.99991
+3.75 0.99991
+3.76 0.99992
+3.77 0.99992
+3.78 0.99992
+3.79 0.99992
+3.80 0.99993
+3.81 0.99993
+3.82 0.99993
+3.83 0.99994
+3.84 0.99994
+3.85 0.99994
+3.86 0.99994
+3.87 0.99995
+3.88 0.99995
+3.89 0.99995
+3.90 0.99995
+3.91 0.99995
+3.92 0.99996
+3.93 0.99996
+3.94 0.99996
+3.95 0.99996
+3.96 0.99996
+3.97 0.99996
+3.98 0.99997
+3.99 0.99997
+4.00 0.99997
+4.01 0.99997
+4.02 0.99997
+4.03 0.99997
+4.04 0.99997
+4.05 0.99997
+4.06 0.99998
+4.07 0.99998
+4.08 0.99998
+4.09 0.99998
+4.10 0.99998
+4.11 0.99998
+4.12 0.99998
+4.13 0.99998
+4.14 0.99998
+4.15 0.99998
+4.16 0.99998
+4.17 0.99998
+4.18 0.99999
+4.19 0.99999
+4.20 0.99999
+4.18 0.99999
+4.19 0.99999
+4.20 0.99999
+4.21 0.99999
+4.22 0.99999
+4.23 0.99999
+4.24 0.99999
+4.25 0.99999
+4.26 0.99999
+4.27 0.99999
+4.28 0.99999
+4.29 0.99999
+4.30 0.99999
+4.31 0.99999
+4.32 0.99999
+4.33 0.99999
+4.34 0.99999
Right-sided area: P(Z z-score) =1 Left-sided area

Interval area: P(a Z b) =P(Z b) P(Z a)
Note: To linearly interpolate for in-between values, solve
(z
high
z
low
)(P
between
P
low
) =(z
between
z
low
)(P
high
P
low
)
for either z
between
or P
between
, whichever required, given the other.
Right-
tailed area
T-scores corresponding to selected right-tailed
probabilities of the t
df
-distribution

[Note that, for any fixed df, t-scores >z-scores.
As df , t-scores z-scores (i.e., last row).]

df 0.5 0.25 0.10 0.05 0.025 0.010 0.005 0.0025 0.001 0.0005 0.00025

1 0 1.000 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619 1273.239
2 0 0.816 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.599 44.705
3 0 0.765 1.638 2.353 3.182 4.541 5.841 7.453 10.215 12.924 16.326
4 0 0.741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610 10.306
5 0 0.727 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869 7.976
6 0 0.718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959 6.788
7 0 0.711 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408 6.082
8 0 0.706 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041 5.617
9 0 0.703 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781 5.291
10 0 0.700 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587 5.049
11 0 0.697 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437 4.863
12 0 0.695 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318 4.716
13 0 0.694 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221 4.597
14 0 0.692 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140 4.499
15 0 0.691 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073 4.417
16 0 0.690 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015 4.346
17 0 0.689 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965 4.286
18 0 0.688 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922 4.233
19 0 0.688 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883 4.187
20 0 0.687 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850 4.146
21 0 0.686 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819 4.110
22 0 0.686 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792 4.077
23 0 0.685 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.768 4.047
24 0 0.685 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745 4.021
25 0 0.684 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725 3.996
26 0 0.684 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707 3.974
27 0 0.684 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690 3.954
28 0 0.683 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674 3.935
29 0 0.683 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659 3.918
30 0 0.683 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646 3.902
31 0 0.682 1.309 1.696 2.040 2.453 2.744 3.022 3.375 3.633 3.887
32 0 0.682 1.309 1.694 2.037 2.449 2.738 3.015 3.365 3.622 3.873
33 0 0.682 1.308 1.692 2.035 2.445 2.733 3.008 3.356 3.611 3.860
34 0 0.682 1.307 1.691 2.032 2.441 2.728 3.002 3.348 3.601 3.848
35 0 0.682 1.306 1.690 2.030 2.438 2.724 2.996 3.340 3.591 3.836
36 0 0.681 1.306 1.688 2.028 2.434 2.719 2.990 3.333 3.582 3.826
37 0 0.681 1.305 1.687 2.026 2.431 2.715 2.985 3.326 3.574 3.815
38 0 0.681 1.304 1.686 2.024 2.429 2.712 2.980 3.319 3.566 3.806
39 0 0.681 1.304 1.685 2.023 2.426 2.708 2.976 3.313 3.558 3.797
40 0 0.681 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551 3.788
41 0 0.681 1.303 1.683 2.020 2.421 2.701 2.967 3.301 3.544 3.780
42 0 0.680 1.302 1.682 2.018 2.418 2.698 2.963 3.296 3.538 3.773
43 0 0.680 1.302 1.681 2.017 2.416 2.695 2.959 3.291 3.532 3.765
44 0 0.680 1.301 1.680 2.015 2.414 2.692 2.956 3.286 3.526 3.758
45 0 0.680 1.301 1.679 2.014 2.412 2.690 2.952 3.281 3.520 3.752
46 0 0.680 1.300 1.679 2.013 2.410 2.687 2.949 3.277 3.515 3.746
47 0 0.680 1.300 1.678 2.012 2.408 2.685 2.946 3.273 3.510 3.740
48 0 0.680 1.299 1.677 2.011 2.407 2.682 2.943 3.269 3.505 3.734
49 0 0.680 1.299 1.677 2.010 2.405 2.680 2.940 3.265 3.500 3.728
50 0 0.679 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496 3.723
df 0.5 0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005 0.00025

51 0 0.679 1.298 1.675 2.008 2.402 2.676 2.934 3.258 3.492 3.718
52 0 0.679 1.298 1.675 2.007 2.400 2.674 2.932 3.255 3.488 3.713
53 0 0.679 1.298 1.674 2.006 2.399 2.672 2.929 3.251 3.484 3.709
54 0 0.679 1.297 1.674 2.005 2.397 2.670 2.927 3.248 3.480 3.704
55 0 0.679 1.297 1.673 2.004 2.396 2.668 2.925 3.245 3.476 3.700
56 0 0.679 1.297 1.673 2.003 2.395 2.667 2.923 3.242 3.473 3.696
57 0 0.679 1.297 1.672 2.002 2.394 2.665 2.920 3.239 3.470 3.692
58 0 0.679 1.296 1.672 2.002 2.392 2.663 2.918 3.237 3.466 3.688
59 0 0.679 1.296 1.671 2.001 2.391 2.662 2.916 3.234 3.463 3.684
60 0 0.679 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460 3.681
61 0 0.679 1.296 1.670 2.000 2.389 2.659 2.913 3.229 3.457 3.677
62 0 0.678 1.295 1.670 1.999 2.388 2.657 2.911 3.227 3.454 3.674
63 0 0.678 1.295 1.669 1.998 2.387 2.656 2.909 3.225 3.452 3.671
64 0 0.678 1.295 1.669 1.998 2.386 2.655 2.908 3.223 3.449 3.668
65 0 0.678 1.295 1.669 1.997 2.385 2.654 2.906 3.220 3.447 3.665
66 0 0.678 1.295 1.668 1.997 2.384 2.652 2.904 3.218 3.444 3.662
67 0 0.678 1.294 1.668 1.996 2.383 2.651 2.903 3.216 3.442 3.659
68 0 0.678 1.294 1.668 1.995 2.382 2.650 2.902 3.214 3.439 3.656
69 0 0.678 1.294 1.667 1.995 2.382 2.649 2.900 3.213 3.437 3.653
70 0 0.678 1.294 1.667 1.994 2.381 2.648 2.899 3.211 3.435 3.651
71 0 0.678 1.294 1.667 1.994 2.380 2.647 2.897 3.209 3.433 3.648
72 0 0.678 1.293 1.666 1.993 2.379 2.646 2.896 3.207 3.431 3.646
73 0 0.678 1.293 1.666 1.993 2.379 2.645 2.895 3.206 3.429 3.644
74 0 0.678 1.293 1.666 1.993 2.378 2.644 2.894 3.204 3.427 3.641
75 0 0.678 1.293 1.665 1.992 2.377 2.643 2.892 3.202 3.425 3.639
76 0 0.678 1.293 1.665 1.992 2.376 2.642 2.891 3.201 3.423 3.637
77 0 0.678 1.293 1.665 1.991 2.376 2.641 2.890 3.199 3.421 3.635
78 0 0.678 1.292 1.665 1.991 2.375 2.640 2.889 3.198 3.420 3.633
79 0 0.678 1.292 1.664 1.990 2.374 2.640 2.888 3.197 3.418 3.631
80 0 0.678 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416 3.629
81 0 0.678 1.292 1.664 1.990 2.373 2.638 2.886 3.194 3.415 3.627
82 0 0.677 1.292 1.664 1.989 2.373 2.637 2.885 3.193 3.413 3.625
83 0 0.677 1.292 1.663 1.989 2.372 2.636 2.884 3.191 3.412 3.623
84 0 0.677 1.292 1.663 1.989 2.372 2.636 2.883 3.190 3.410 3.622
85 0 0.677 1.292 1.663 1.988 2.371 2.635 2.882 3.189 3.409 3.620
86 0 0.677 1.291 1.663 1.988 2.370 2.634 2.881 3.188 3.407 3.618
87 0 0.677 1.291 1.663 1.988 2.370 2.634 2.880 3.187 3.406 3.617
88 0 0.677 1.291 1.662 1.987 2.369 2.633 2.880 3.185 3.405 3.615
89 0 0.677 1.291 1.662 1.987 2.369 2.632 2.879 3.184 3.403 3.613
90 0 0.677 1.291 1.662 1.987 2.368 2.632 2.878 3.183 3.402 3.612
91 0 0.677 1.291 1.662 1.986 2.368 2.631 2.877 3.182 3.401 3.610
92 0 0.677 1.291 1.662 1.986 2.368 2.630 2.876 3.181 3.399 3.609
93 0 0.677 1.291 1.661 1.986 2.367 2.630 2.876 3.180 3.398 3.607
94 0 0.677 1.291 1.661 1.986 2.367 2.629 2.875 3.179 3.397 3.606
95 0 0.677 1.291 1.661 1.985 2.366 2.629 2.874 3.178 3.396 3.605
96 0 0.677 1.290 1.661 1.985 2.366 2.628 2.873 3.177 3.395 3.603
97 0 0.677 1.290 1.661 1.985 2.365 2.627 2.873 3.176 3.394 3.602
98 0 0.677 1.290 1.661 1.984 2.365 2.627 2.872 3.175 3.393 3.601
99 0 0.677 1.290 1.660 1.984 2.365 2.626 2.871 3.175 3.392 3.600
100 0 0.677 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390 3.598
120 0 0.677 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373 3.578
140 0 0.676 1.288 1.656 1.977 2.353 2.611 2.852 3.149 3.361 3.564
160 0 0.676 1.287 1.654 1.975 2.350 2.607 2.846 3.142 3.352 3.553
180 0 0.676 1.286 1.653 1.973 2.347 2.603 2.842 3.136 3.345 3.545
200 0 0.676 1.286 1.653 1.972 2.345 2.601 2.839 3.131 3.340 3.539
0 0.674 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291 3.481
2
-score
0
Right-
tailed area
Chi-squared scores corresponding to selected
right-tailed probabilities of the
2
df
distribution

df 1 0.5 0.25 0.10 0.05 0.025 0.010 0.005 0.0025 0.0010 0.0005 0.00025

1 0 0.455 1.323 2.706 3.841 5.024 6.635 7.879 9.141 10.828 12.116 13.412
2 0 1.386 2.773 4.605 5.991 7.378 9.210 10.597 11.983 13.816 15.202 16.588
3 0 2.366 4.108 6.251 7.815 9.348 11.345 12.838 14.320 16.266 17.730 19.188
4 0 3.357 5.385 7.779 9.488 11.143 13.277 14.860 16.424 18.467 19.997 21.517
5 0 4.351 6.626 9.236 11.070 12.833 15.086 16.750 18.386 20.515 22.105 23.681
6 0 5.348 7.841 10.645 12.592 14.449 16.812 18.548 20.249 22.458 24.103 25.730
7 0 6.346 9.037 12.017 14.067 16.013 18.475 20.278 22.040 24.322 26.018 27.692
8 0 7.344 10.219 13.362 15.507 17.535 20.090 21.955 23.774 26.124 27.868 29.587
9 0 8.343 11.389 14.684 16.919 19.023 21.666 23.589 25.462 27.877 29.666 31.427
10 0 9.342 12.549 15.987 18.307 20.483 23.209 25.188 27.112 29.588 31.420 33.221
11 0 10.341 13.701 17.275 19.675 21.920 24.725 26.757 28.729 31.264 33.137 34.977
12 0 11.340 14.845 18.549 21.026 23.337 26.217 28.300 30.318 32.909 34.821 36.698
13 0 12.340 15.984 19.812 22.362 24.736 27.688 29.819 31.883 34.528 36.478 38.390
14 0 13.339 17.117 21.064 23.685 26.119 29.141 31.319 33.426 36.123 38.109 40.056
15 0 14.339 18.245 22.307 24.996 27.488 30.578 32.801 34.950 37.697 39.719 41.699
16 0 15.338 19.369 23.542 26.296 28.845 32.000 34.267 36.456 39.252 41.308 43.321
17 0 16.338 20.489 24.769 27.587 30.191 33.409 35.718 37.946 40.790 42.879 44.923
18 0 17.338 21.605 25.989 28.869 31.526 34.805 37.156 39.422 42.312 44.434 46.508
19 0 18.338 22.718 27.204 30.144 32.852 36.191 38.582 40.885 43.820 45.973 48.077
20 0 19.337 23.828 28.412 31.410 34.170 37.566 39.997 42.336 45.315 47.498 49.632
21 0 20.337 24.935 29.615 32.671 35.479 38.932 41.401 43.775 46.797 49.011 51.173
22 0 21.337 26.039 30.813 33.924 36.781 40.289 42.796 45.204 48.268 50.511 52.701
23 0 22.337 27.141 32.007 35.172 38.076 41.638 44.181 46.623 49.728 52.000 54.217
24 0 23.337 28.241 33.196 36.415 39.364 42.980 45.559 48.034 51.179 53.479 55.722
25 0 24.337 29.339 34.382 37.652 40.646 44.314 46.928 49.435 52.620 54.947 57.217
26 0 25.336 30.435 35.563 38.885 41.923 45.642 48.290 50.829 54.052 56.407 58.702
27 0 26.336 31.528 36.741 40.113 43.195 46.963 49.645 52.215 55.476 57.858 60.178
28 0 27.336 32.620 37.916 41.337 44.461 48.278 50.993 53.594 56.892 59.300 61.645
29 0 28.336 33.711 39.087 42.557 45.722 49.588 52.336 54.967 58.301 60.735 63.104
30 0 29.336 34.800 40.256 43.773 46.979 50.892 53.672 56.332 59.703 62.162 64.555
31 0 30.336 35.887 41.422 44.985 48.232 52.191 55.003 57.692 61.098 63.582 65.999
32 0 31.336 36.973 42.585 46.194 49.480 53.486 56.328 59.046 62.487 64.995 67.435
33 0 32.336 38.058 43.745 47.400 50.725 54.776 57.648 60.395 63.870 66.403 68.865
34 0 33.336 39.141 44.903 48.602 51.966 56.061 58.964 61.738 65.247 67.803 70.289
35 0 34.336 40.223 46.059 49.802 53.203 57.342 60.275 63.076 66.619 69.199 71.706
36 0 35.336 41.304 47.212 50.998 54.437 58.619 61.581 64.410 67.985 70.588 73.118
37 0 36.336 42.383 48.363 52.192 55.668 59.893 62.883 65.739 69.346 71.972 74.523
38 0 37.335 43.462 49.513 53.384 56.896 61.162 64.181 67.063 70.703 73.351 75.924
39 0 38.335 44.539 50.660 54.572 58.120 62.428 65.476 68.383 72.055 74.725 77.319
40 0 39.335 45.616 51.805 55.758 59.342 63.691 66.766 69.699 73.402 76.095 78.709
41 0 40.335 46.692 52.949 56.942 60.561 64.950 68.053 71.011 74.745 77.459 80.094
42 0 41.335 47.766 54.090 58.124 61.777 66.206 69.336 72.320 76.084 78.820 81.475
43 0 42.335 48.840 55.230 59.304 62.990 67.459 70.616 73.624 77.419 80.176 82.851
44 0 43.335 49.913 56.369 60.481 64.201 68.710 71.893 74.925 78.750 81.528 84.223
45 0 44.335 50.985 57.505 61.656 65.410 69.957 73.166 76.223 80.077 82.876 85.591
46 0 45.335 52.056 58.641 62.830 66.617 71.201 74.437 77.517 81.400 84.220 86.954
47 0 46.335 53.127 59.774 64.001 67.821 72.443 75.704 78.809 82.720 85.560 88.314
48 0 47.335 54.196 60.907 65.171 69.023 73.683 76.969 80.097 84.037 86.897 89.670
49 0 48.335 55.265 62.038 66.339 70.222 74.919 78.231 81.382 85.351 88.231 91.022
50 0 49.335 56.334 63.167 67.505 71.420 76.154 79.490 82.664 86.661 89.561 92.371
VOLUME 3: NO. 1 JANUARY 2006
Identifying Geographic Disparities in the
Early Detection of Breast Cancer Using a
Geographic Information System
ORIGINAL RESEARCH
Suggested citation for this article: McElroy JA,
Remington PL, Gangnon RE, Hariharan L, Andersen
LD. Identifying geographic disparities in the early detec-
tion of breast cancer using a geographic information sys-
tem. Prev Chronic Dis [serial online] 2006 Jan [date
cited]. Available from: URL: http://www.cdc.gov/pcd/
issues/2006/jan/05_0065.htm.
PEER REVIEWED
Abstract
Introduction
Identifying communities with lower rates of mammogra-
phy screening is a critical step to providing targeted
screening programs; however, population-based data nec-
essary for identifying these geographic areas are limited.
This study presents methods to identify geographic dis-
parities in the early detection of breast cancer.
Methods
Data for all women residing in Dane County, Wisconsin,
at the time of their breast cancer diagnosis from 1981
through 2000 (N = 4769) were obtained from the Wisconsin
Cancer Reporting System (Wisconsins tumor registry) by
ZIP code of residence. Hierarchical logistic regression mod-
els for disease mapping were used to identify geographic
differences in the early detection of breast cancer.
Results
The percentage of breast cancer cases diagnosed in situ
(excluding lobular carcinoma in situ) increased from 1.3%
in 1981 to 11.9% in 2000. This increase, reflecting increas-
ing mammography use, occurred sooner in Dane County
than in Wisconsin as a whole. From 1981 through 1985,
the proportion of breast cancer diagnosed in situ in Dane
county was universally low (2%3%). From 1986 through
1990, urban and suburban ZIP codes had significantly
higher rates (10%) compared with rural ZIP codes (5%).
From 1991 through 1995, mammography screening had
increased in rural ZIP codes (7% of breast cancer diag-
nosed in situ). From 1996 through 2000, mammography
use was fairly homogeneous across the entire county
(13%14% of breast cancer diagnosed in situ).
Conclusion
The percentage of breast cancer cases diagnosed in situ
increased in the state and in all areas of Dane County from
1981 through 2000. Visual display of the geographic differ-
ences in the early detection of breast cancer demonstrates
the diffusion of mammography use across the county over
the 20-year period.
Introduction
Geographic differences in health status and use of health
services have been reported in the United States and inter-
nationally (1), including stage of breast cancer incidence and
mammography screening practices (2). Early diagnosis of
breast cancer through mammography screening improves
breast cancer treatment options and may reduce mortality
(3,4), yet many women in the United States are not routine-
ly screened according to recommended guidelines (5).
Needs assessment to account for noncompliance with
breast cancer screening recommendations has focused on
The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services,
the Public Health Service, the Centers for Disease Control and Prevention, or the authors affiliated institutions. Use of trade names is for identification only
and does not imply endorsement by any of the groups named above.
www.cdc.gov/pcd/issues/2006/jan/05_0065.htm Centers for Disease Control and Prevention 1
Jane A. McElroy, PhD, Patrick L. Remington, MD, Ronald E. Gangnon, PhD, Luxme Hariharan,
LeAnn D. Andersen, MS
VOLUME 3: NO. 1
JANUARY 2006
personal factors related to participation, including the bar-
riers women perceive (6), the role of physicians (7), and the
role of services such as mobile vans (8) and insurance cov-
erage (9). Evaluations of the effectiveness of interventions
directed at patients, communities, and special populations
have also provided important information about mammog-
raphy use (10). However, little attention has been paid to
geographic location, except to focus on inner-city and rural
disparities in mammography use (11,12).
The purpose of this study was to identify geographic dis-
parities in the early detection of breast cancer using cancer
registry data. This information can be used to identify
areas where increased mammography screening is needed
and to understand the diffusion of innovation in an urban
or a rural setting.
Cancer registry data were used for these analyses.
Validity of the use of these data rests on the correlation
between the percentage of breast cancer diagnosed in situ
and mammography screening rates; breast cancer in situ
(BCIS) (excluding lobular carcinoma in situ [13-15]) is the
earliest stage of localized breast cancer and is diagnosed
almost exclusively by mammography (16). In the 1970s,
before widespread use of mammography, BCIS represent-
ed less than 2% of breast cancer cases in the United States
(15). A nationwide community-based breast cancer screen-
ing program showed that among populations of women
screened regularly, the stage distribution of diagnosed
cases was skewed to earlier stages, with BCIS accounting
for more than 35% (17). Trends in the relative frequency of
BCIS are closely correlated with trends in mammography
use (reflected in data from surveys of mammography
providers in Wisconsin) and with trends in self-reported
mammography use (reflected in data from the Behavioral
Risk Factor Surveillance System) (18-20).
In Wisconsin, either a physician can refer a patient for
screening or a woman can self-refer. More than 60% of
the mammography imaging facilities in the state accept
self-referrals (21). Since 1989, Wisconsin state law has
mandated health insurance coverage for women aged 45
to 65 years, and Medicare covers mammography screen-
ing for eligible women (22). In Wisconsin, the
Department of Health and Family Services provides a
toll-free number through which women can contact more
than 400 service providers (22). Finally, several programs
such as the Wisconsin Well Woman Program, which is
funded by the Centers for Disease Control and
Prevention, provide free or low-cost screening to under-
served women.
Methods
Study population
All female breast cancer cases diagnosed from 1981
through 2000 were identified by the Wisconsin Cancer
Reporting System (WCRS). The WCRS was established in
1976 as mandated by Wisconsin state statute to collect
cancer incidence data on Wisconsin residents. In compli-
ance with state law, hospitals and physicians are required
to report cancer cases to the WCRS (within 6 months of
initial diagnosis for hospitals and within 3 months for
physicians, through their clinics). Variables obtained from
the WCRS included histology (International Classification
of Diseases for Oncology, 2nd Edition [ICD-02] codes),
stage (0 = in situ, 1 = localized, 25 = regional, 7 = distant,
and 9 = unstaged), year of diagnosis, county of residence
at time of diagnosis, and number of incident cases in 5-
year age groups by ZIP code for all breast cancer cases
among women. ZIP codes and county of residences, self-
reported by the women with diagnosed breast cancer, are
provided to the WCRS. Only ZIP codes verified for Dane
County by the U.S. Postal Service were included in the
data set (n = 37). The ZIP code was the smallest area unit
available for WCRS incidence data.
Study location and characteristics
Dane County is located in south central Wisconsin. The
population of the county in 1990 was 367,085, with 20% of
the population living in rural areas (23); approximately
190,000 people lived in Madison, the second largest city in
Wisconsin and home to the University of Wisconsin. The
37 unique ZIP codes in Dane County incorporate 60 cities,
villages, and towns (Figure 1).
Data analysis
We determined the percentage of breast cancer cases
diagnosed as BCIS in Wisconsin and Dane County over
time and by ZIP codes for Dane County. For ZIP codes
that encompassed areas beyond the borders of Dane
County, only women who reported their county of resi-
dence as Dane were included in the analysis. The per-
centage of BCIS by ZIP code was mapped using 1996 ZIP
2 Centers for Disease Control and Prevention www.cdc.gov/pcd/issues/2006/jan/05_0065.htm
code boundary files. For 17 breast cancer cases in which
the womens ZIP codes no longer existed, each ZIP code
was reassigned to the ZIP code in the same location.
We used analytic methods to estimate rates of early
breast cancer detection by ZIP code. Because of small num-
bers of BCIS cases in each ZIP code, a well-characterized
statistical method was used to stabilize the prediction of
rates by borrowing information from neighboring ZIP
codes (24). This is done by using Bayesian hierarchical
logistic regression models to estimate ZIP-codespecific
effects on percentage of breast cancer cases diagnosed in
situ (excluding lobular carcinoma in situ). ZIP-codespecif-
ic effects (log odds ratios) were modeled as a Gaussian con-
ditional autoregression (CAR) (25). Using the CAR model,
one assumes that the log odds ratio for one ZIP code is
influenced by the average log odds ratio for its neighbors.
The conditional standard deviation of the CAR model, the
free parameter which controls the smoothness of the map,
was given a uniform prior (24).
For each time period, two CAR models were fitted. The
first model included age group as the only covariate. Age
group effects were modeled using an exchangeable normal
prior. The standard deviation of this distribution was given
a uniform prior. The second model included additional ZIP-
codelevel covariates. Potential covariates were urban or
rural status, education, median household income, marital
status, employment status, and commuting time from the
Summary Tape File 3 of the 1990 U.S. Census of
Population and Housing (23). Census data from 1990 were
used because 1990 is the midpoint of the years included in
these analyses (19812000). Urban or rural status was
defined as percentage of women living in each of the four
census classifications: urban inside urbanized area, urban
outside of urbanized area, rural farm, and rural nonfarm
for each ZIP code. Education was defined as percentage of
women in each ZIP code aged 25 years and older with less
than a high school diploma. Median household income for
each ZIP code was based on self-reported income. Marital
status was defined as women aged 25 years and older in
each ZIP code who had never been married. Employment
status was defined as percentage of women aged 16 years
and older in each ZIP code who worked in 1989. Full-time
employment variable was defined as percentage of women
25 years and older in each ZIP code who worked at least 40
hours per week. Commuting time was divided into five cat-
egories of percentage of female workers in each ZIP code:
worked at home, commuted 1 to 14 minutes, commuted 15
to 29 minutes, commuted 30 to 44 minutes, and commuted
45 minutes or more. Age was defined as age at diagnosis.
These potential covariates were initially screened using
forward stepwise logistic regression models, which includ-
ed ZIP code as an exchangeable (nonspatially structured)
random effect. Covariates included in the best model
selected using Schwarzs Bayesian Information Criterion
(BIC) (26) were used in the second covariate-adjusted
model. The covariate effects and the intercept were given
posterior priors.
Posterior estimates of the age-adjusted percentage of
BCIS for each ZIP code in each time period were obtained
from the CAR model. Posterior medians were used as
point estimates of the parameters, and 95% posterior
credible intervals were obtained. Analyses were per-
formed using WinBUGS software (27). Covariate screen-
VOLUME 3: NO. 1
JANUARY 2006
Figure 1. Map of Dane County, Wisconsin, showing capital city of Madison,
major lakes, active mammogram facilities, and percentage of area classified
as urban by ZIP code, using 1996 ZIP code boundaries and 1990 census
data. Inset map shows location of Dane County within the state.
VOLUME 3: NO. 1
JANUARY 2006
ing was performed using SAS software, version 8 (SAS
Institute Inc, Cary, NC). ZIP-codespecific estimates were
mapped using ESRI 3.2 ArcView software
(Environmental Systems Research Institute, Redwood,
Calif) and 1996 ZIP code boundary files to display the
data.
As an empirical check on our mapping, we fitted a
regression model to the BCIS rates by ZIP code. The
dependent variable was BCIS rates (using the posterior
estimates of age-adjusted percentage of BCIS), and the
independent variable in the model was linear distance
from the University of Wisconsin Comprehensive Cancer
Center (UWCCC), located in Madison, to the centroid of
each ZIP code.
Results
A total of 4769 breast cancer cases were reported in
Dane County from 1981 through 2000: 825 from 1981
through 1985, 1119 from 1986 through 1990, 1239 from
1991 through 1995, and 1586 from 1996 through 2000.
Percentage of cases in situ varied by age group from a high
of 18% among women aged 45 to 49 years to a low of 0%
among women aged 20 to 24 years. From the mid 1980s,
the age group most frequently diagnosed with BCIS was
women aged 45 to 49. In contrast, women aged 20 to 34
and older than 84 were the least often (<2%) diagnosed
with BCIS (data not shown). Based on the 1990 U.S. cen-
sus, the total female population (aged 18 years and older)
in Dane County was 145,974; 60% of the female population
had more than a high school degree, and 15% of the female
population aged 25 and older had never married.
In Dane County, the percentage of BCIS increased from
1.3% in 1981 to 11.9% in 2000. For the state, the percent-
age of BCIS increased from 1.5% in 1981 to 12.8% in 2000.
From 1981 to 1993, Dane County had a higher percentage
of BCIS diagnosis than the state as a whole. By the mid-
1990s, the percentage of BCIS among breast cancer cases
in Dane County was similar to the percentage in the state
(Figure 2). Similar results are seen when mapping the
observed data (maps not shown).
Figure 3 shows model-based estimates of the age-adjust-
ed percentage of BCIS diagnosis by ZIP code in Dane
County during four 5-year periods. These maps demon-
strate the increase in the percentage of cases diagnosed as
BCIS noted in Figure 2. These maps also show that the
increase in the percentage of BCIS was not uniform across
Dane County. From 1981 through 1985, the entire county
had uniformly low rates of BCIS (2%3%). From 1986
through 1990, urban ZIP codes had markedly higher rates
of BCIS (approximately 12%) compared with rural ZIP
codes (approximately 5%). From 1991 through 1995, use of
mammography screening had begun to increase in the
rural ZIP codes (with a 7% rate of BCIS), although the
Figure 2. Smoothed trends in percentage of breast cancer cases diagnosed
in situ (excluding lobular carcinoma in situ), Dane County, Wisconsin, and
Wisconsin, 19812000. Data point for Dane County, 1980, was estimated
from Andersen et al (28).
Figure 3. Model-based estimates of age-adjusted percentage of breast can-
cer cases diagnosed in situ during four 5-year periods, by ZIP code, Dane
County, Wisconsin, 19812000. BCIS indicates breast cancer in situ.
rates of BCIS remained higher in urban ZIP codes (12%).
From 1996 through 2000, mammography screening was
fairly universal across the county, with BCIS rates of 13%
to 14%. Similar patterns were observed from models that
adjusted for additional covariates of marital status and
education (data not shown).
From 1981 through 1985, there was no significant rela-
tionship between distance from UWCCC and the rate of
BCIS (P = .27). From 1986 through 1990 and from 1991
through 1995, there was strong evidence of an inverse rela-
tionship between distance from UWCCC and the rate of
BCIS (i.e., the closer to UWCCC, the higher the BCIS rate
[P < .001] for both periods). From 1996 through 2000, there
was a nonsignificant inverse relationship between distance
from UWCCC and the rate of BCIS (P = .07).
Discussion
The frequency of BCIS diagnosis increased substan-
tially in Wisconsin and in Dane County from 1981
through 2000. This increase in percentage of BCIS
among diagnosed breast cancer cases is consistent with
increases in self-reported mammography use,
Wisconsin Medicare claims for mammography, and the
number of medical imaging centers in Wisconsin (21).
However, progress in mammography screening was not
uniform across Dane County, and this lack of uniformi-
ty represents a classic case of diffusion of innovation.
Early adopters of mammography use lived in and near
the city of Madison. We can speculate that Madison
embodies one characteristic that accelerates the diffu-
sion process: namely, a more highly educated popula-
tion living in a university community with a strong
medical presence. One predictor of mammography use
is education: women who are more educated are more
likely to ask their physician for a referral or to
self-refer (29), and the strongest predictor of mammog-
raphy use is physician referral (30). Furthermore,
physicians are more likely to have chosen to live in the
Madison area instead of a more rural location because
they value the opportunity for regular contact with the
medical school and the medical community (31).
Consequently, a greater number of interpersonal net-
works and more information exchange among physi-
cians about adoption of this innovation might have
occurred earlier in the Madison medical community
than in the more rural areas of the county (32).
Although median household income by ZIP code was not
a predictor of mammography use in our study, the amount
of disposable income by individuals, which is not captured
by this variable, might also have been an important factor
for early adopters (33,34). In a national study of mammog-
raphy use, income was a significant predictor of repeat
screening in 1987 but not in 1990 (35). In the mid-1980s,
few insurance plans covered mammography screening.
Therefore, women of higher socioeconomic status (SES)
would have been more likely to be able to pay the cost of
the mammogram. Efforts to reduce costs, such as a 1987
statewide promotional campaign sponsored by the
American Cancer Society, still required a $50 copay from
women who were able to self-refer for a mammogram (36).
As the use of this technology diffused outward, increas-
ing numbers of women living in suburban and rural areas
surrounding Madison elected to get a mammogram. From
1996 through 2000, the geographic disparity in mammog-
raphy use was muted, although the eastern corridor of
Dane County still had slightly lower rates of BCIS than
other parts of the county. The reasons for persistent dis-
parity in this region of Dane County are unclear: it is
unlikely to be because of proximity to mammography
screening facilities, nor are the ZIP-codelevel SES meas-
ures such as percentage unemployed, household income,
percentage below poverty level, or education level statisti-
cally different from the western corridor of Dane County.
Differences in the trends of early detection of breast can-
cer within Dane County suggest that progress in mam-
mography screening has not been uniform across the
county. From 1996 through 2000, while more than 14% of
age-adjusted breast cancer cases were diagnosed as BCIS
in Madison, fewer than 6% of age-adjusted breast cancer
cases were diagnosed as BCIS in a few outlying and more
rural areas of Dane County, reflecting lower mammogra-
phy use by residents in this area. The results of an earli-
er analysis of these data were shared with local health
department staff in rural Dane County who were working
to increase early detection efforts through outreach edu-
cation and referrals to providers. As suggested by
Andersen et al, strategies to improve mammography use
include improving access to primary care physicians,
increasing the number of mammography facilities located
in rural areas, and increasing outreach efforts by a net-
work of public health professionals promoting screening in
their community (28). In addition, pointing out the varia-
tions in care may lead to improvements, since the first step
VOLUME 3: NO. 1
JANUARY 2006
VOLUME 3: NO. 1
JANUARY 2006
toward change is identifying a problem. With identification
of particular areas of need, resources can be garnered
toward alleviating the disparity.
Persistent disparities in mammography use after
adjusting for community level of educational attainment
and marital status were found. Other studies have found
that patients with cancer living in census tracts with
lower median levels of education attainment are diag-
nosed in later disease stages than are patients in tracts
with higher median levels of education (29). Studies have
also shown that one predictor for getting a mammogram
is being married (37).
This study demonstrates the use of percentage of BCIS
as a tool for comparing population-based mammography
screening rates in different geographic areas. Using cancer
incidence data to monitor population-based rates of breast
cancer screening is possible throughout the nation,
because data from population-based cancer registries are
now widely available, often by ZIP code or census tract.
This method permits comparison of mammography screen-
ing rates among geographic areas smaller than areas used
in many previous studies of geographic variation in the
early detection of breast cancer (2).
The method described in this article can be used to com-
plement other ways to assess the quality of health care in
communities, such as the Health Plan Employer Data and
Information Set (HEDIS), created by the National
Committee for Quality Assurance. HEDIS addresses over-
all rates in managed care but does not include the under-
insured or fee-for-service populations particularly at risk
for inadequate screening (34). Cancer registry data are
population based; therefore, using cancer registry data is
not only effective but also economical and efficient for out-
reach specialists and health providers.
A potential weakness in this method is the representa-
tiveness of the statewide tumor registry. However, the
WCRS has been evaluated by the North American
Association of Central Cancer Registries and was given its
gold standard for quality, completeness, and timeliness in
1995 and 1996, the first 2 years this standard was recog-
nized (38). Completeness estimates are a general measure
of accuracy. The WCRS participated in national audits
that measured completeness in 1987, 1992, and 1996 as
well as one formal study in 1982. Overall, the quality of
the data improved slightly after 1994 when clinics and
neighboring state data-sharing agreements were imple-
mented (oral communication, Laura Stephenson, WCRS,
July 2005). In addition, the tumor registry has used stan-
dard methods for classifying tumor stage (e.g., in situ)
throughout the entire period of the study. Incidence data
from data sources of lesser quality or completeness than
the WCRS would need to be carefully evaluated for use in
this type of analysis.
Another limitation of this type of analysis is our use of
BCIS as a proxy for mammography screening practices.
Undoubtedly, some diagnoses of BCIS result from diag-
nostic mammograms, but reported use of screening mam-
mograms by individuals and medical facilities correlates
strongly with percentage of BCIS over time, particularly
ductal carcinoma in situ (18-20). Furthermore, we chose to
exclude lobular carcinoma in situ from our BCIS category
because this condition is often opportunistic (13-15).
A third limitation, which would be found in any type of
geographic analysis, rests on the accuracy of the assign-
ment of participants to the proper location. For area analy-
sis (e.g., ZIP code, county), this legitimate concern is ame-
liorated by using tools to check ZIP codes and county
assignments for correctness. For this study, women diag-
nosed with breast cancer provided their addresses, includ-
ing county of residence, to their medical facilities. These
addresses were forwarded to the WCRS, where quality-
control checks were implemented to validate ZIP code and
county assignments. For example, lists of ZIP codes and
their county codes were cross-referenced to the ZIP codes
and county codes of the addresses provided by the women
diagnosed with breast cancer. Inaccuracies were corrected
by the WCRS (oral communication, Laura Stephenson,
WCRS, January 2005).
Although there has been significant improvement in
breast cancer screening across the state and county, this
study demonstrates that the improvement has not been
uniform. The maps clearly indicate for program direc-
tors and policy makers the areas where further outreach
and research should be conducted. More specifically,
this type of analysis can be used to identify specific
areas (such as ZIP codes) within a community (such as
a county) with varying rates of early-stage breast can-
cer. Using this method, public health professionals can
provide population-level data to all health care providers
to target interventions to improve the early detection of
breast cancer in other counties in Wisconsin and other
states. Finally, this type of analysis is useful for compre-
hensive cancer control efforts and can be conducted for
other cancers with effective screening methods, such as
colorectal cancer.
Acknowledgments
The authors are grateful to Dr Larry Hanrahan and
Mark Bunge for advice and Laura Stephenson of the
WCRS for assistance with data.
This study was supported by National Cancer Institute
grant U01CA82004.
Author Information
Corresponding Author: Jane A. McElroy, PhD,
Comprehensive Cancer Center, 610 Walnut St, 307
WARF, Madison, WI 53726. Telephone: 608-265-8780.
E-mail: jamcelroy@wisc.edu.
Author Affiliations: Patrick L. Remington, MD,
Comprehensive Cancer Center and Department of
Population Health Sciences, University of Wisconsin,
Madison, Wis; Ronald E. Gangnon, PhD, Department of
Population Health Sciences and Biostatistics and Medical
Informatics, University of Wisconsin, Madison, Wis;
Luxme Hariharan, Department of Molecular Biology,
University of Wisconsin, Madison, Wis; LeAnn D.
Andersen, MS, Department of Population Health Sciences,
University of Wisconsin, Madison, Wis.
References
1. Coughlin SS, Uhler RJ, Bobo JK, Caplan L. Breast
cancer screening practices among women in the
United States, 2000. Cancer Causes Control
2004;15:159-70.
2. Roche LM, Skinner R, Weinstein RB. Use of a geo-
graphic information system to identify and character-
ize areas with high proportions of distant stage breast
cancer. J Public Health Manag Pract 2002;8:26-32.
3. Lee CH. Screening mammography: proven benefit,
continued controversy. Radiol Clin North Am
2002;40:395-407.
4. Nystrom L, Andersson I, Bjurstam N, Frisell J,
Nordenskjold B, Rutqvist LE. Long-term effects of
mammography screening: updated overview of the
Swedish randomised trials. Lancet 2002;359:909-19.
5. Mandelblatt J, Saha S, Teutsch S, Hoerger T, Siu AL,
Atkins D, et al. The cost-effectiveness of screening
mammography beyond age 65 years: a systematic
review for the U.S. Preventive Services Task Force.
Ann Intern Med 2003;139:835-42.
6. Rimer BK. Understanding the acceptance of mam-
mography by women. Annals of Behavioral Medicine
1992;14:197-203.
7. Fox SA, Stein JA. The effect of physician-patient com-
munication on mammography utilization by different
ethnic groups. Med Care 1991;29:1065-82.
8. Haiart DC, McKenzie L, Henderson J, Pollock W,
McQueen DV, Roberts MM, et al. Mobile breast
screening: factors affecting uptake, efforts to increase
response and acceptability. Public Health
1990;104:239-47.
9. Thompson GB, Kessler LG, Boss LP. Breast cancer
screening legislation in the United States. Am J Public
Health 1989;79:1541-3.
10. Zapka JG, Harris DR, Hosmer D, Costanza ME, Mas
E, Barth R. Effect of a community health center inter-
vention on breast cancer screening among Hispanic
American women. Health Serv Res 1993;28:223-35.
11. Breen N, Figueroa JB. Stage of breast and cervical
cancer diagnosis in disadvantaged neighborhoods: a
prevention policy perspective. Am J Prev Med
1996;12:319-26.
12. Andersen MR, Yasui Y, Meischke H, Kuniyuki A,
Etzioni R, Urban N. The effectiveness of mammogra-
phy promotion by volunteers in rural communities.
Am J Prev Med 2000;18:199-207.
13. Li CI, Anderson BO, Daling JR, Moe RE. Changing
incidence of lobular carcinoma in situ of the breast.
Breast Cancer Res Treat 2002;75:259-68.
14. Millikan R, Dressler L, Geradts J, Graham M. The
need for epidemiologic studies of in-situ carcinoma of
the breast. Breast Cancer Res Treat 1995;35:65-77.
15. Ernster VL, Barclay J, Kerlikowske K, Grady D,
Henderson C. Incidence of and treatment for ductal
carcinoma in situ of the breast. JAMA 1996;275:913-8.
16. Claus EB, Stowe M, Carter D. Breast carcinoma in
situ: risk factors and screening patterns. J Natl Cancer
Inst 2001;93:1811-7.
17. May DS, Lee NC, Richardson LC, Giustozzi AG, Bobo
JK. Mammography and breast cancer detection by
race and Hispanic ethnicity: results from a national
program (United States). Cancer Causes Control
VOLUME 3: NO. 1
JANUARY 2006
VOLUME 3: NO. 1
JANUARY 2006
2000;11:697-705.
18. Lantz P, Bunge M, Remington PL. Trends in mam-
mography in Wisconsin. Wis Med J 1990;89:281-2.
19. Lantz PM, Remington PL, Newcomb PA.
Mammography screening and increased incidence of
breast cancer in Wisconsin. J Natl Cancer Inst
1991;83:1540-6.
20. Bush DS, Remington PL, Reeves M, Phillips JL. In
situ breast cancer correlates with mammography use,
Wisconsin: 1980-1992. Wis Med J 1994;93:483-4.
21. Propeck PA, Scanlan KA. Breast imaging trends in
Wisconsin. WMJ 2000;99:42-6.
22. Fowler BA. Variability in mammography screening
legislation across the states. J Womens Health Gend
Based Med 2000;9:175-84.
23. U.S. Census Bureau. 1990 Summary Tape File 3 (STF
3)- sample data. Census of Population and Housing.
Washington (DC): American Factfinder;1990.
Available from: URL:
http://www.census.gov/main/www/cen1990.html.
24. Gelman A. Prior distributions for variance parameters
in hierarchical models. New York: Columbia
University; 2004.
25. Besag J, York J, Molli A. Bayesian image restoration
with applications in spatial statistics. Annals of the
Institute of Mathematical Statistics 1991;43:1-20.
26. Schwarz G. Estimating the dimension of a model.
Annals of Statistics 1978;6:461-4.
27. Spiegelhalter D, Thomas A, Best N, Lunn D.
WinBUGS: Bayesian inference using Gibbs Sampling.
London (UK): MRC Biostatistics Unit; 2004.
28. Andersen LD, Remington PL, Trentham-Dietz A,
Robert S. Community trends in the early detection of
breast cancer in Wisconsin, 1980-1998. Am J Prev Med
2004;26:51-5.
29. Wells BL, Horm JW. Stage at diagnosis in breast can-
cer: race and socioeconomic factors. Am J Public
Health 1992;82:1383-5.
30. Simon MS, Gimotty PA, Coombs J, McBride S,
Moncrease A, Burack RC. Factors affecting participa-
tion in a mammography screening program among
members of an urban Detroit health maintenance
organization. Cancer Detect Prev 1998;22:30-8.
31. Cooper JK, Heald K, Samuels M, Coleman S. Rural or
urban practice: factors influencing the location deci-
sion of primary care physicians. Inquiry 1975;12:18-
25.
32. Rogers EM. Diffusion of innovations. New York (NY):
Free Press; 2003.
33. Calle EE, Flanders WD, Thun MJ, Martin LM.
Demographic predictors of mammography and Pap
smear screening in US women. Am J Public Health
1993;83:53-60.
34. Bradley CJ, Given CW, Roberts C. Disparities in can-
cer diagnosis and survival. Cancer 2001;91:178-88.
35. Zapka JG, Hosmer D, Costanza ME, Harris DR,
Stoddard A. Changes in mammography use: economic,
need, and service factors. Am J Public Health
1992;82:1345-51.
36. Remington PL, Lantz PM. Using a population-based
cancer reporting system to evaluate a breast cancer
detection and awareness program. CA Cancer J Clin
1992;42:367-71.
37. Lannin DR, Mathews HF, Mitchell J, Swanson MS,
Swanson FH, Edwards MS. Influence of socioeconomic
and cultural factors on racial differences in late-stage
presentation of breast cancer. JAMA 1998;279:1801-7.
38. Chen VW, Wu XC, Andrews PA. Cancer in North
America, 1991-1995, Vol I: Incidence. Sacramento
(CA): North American Association of Central Cancer
Registries; 1999.
1

Case studies of bias in real life epidemiologic studies

Bias File 2. Should we stop drinking coffee? The story of coffee and pancreatic cancer

Compiled by

Madhukar Pai, MD, PhD
Jay S Kaufman, PhD

Department of Epidemiology, Biostatistics & Occupational Health
McGill University, Montreal, Canada
madhukar.pai@mcgill.ca & jay.kaufman@mcgill.ca

THIS CASE STUDY CAN BE FREELY USED FOR EDUCATIONAL PURPOSES WITH DUE CREDIT
2

Bias File 2. Should we stop drinking coffee? The story of coffee and pancreatic cancer
The story
Brian MacMahon (1923 - 2007) was a British-American epidemiologist who
chaired the Department of Epidemiology at Harvard from 1958 until 1988. In
1981, he published a paper in the New England Journal of Medicine, a case-
control study on coffee drinking and pancreatic cancer [MacMahon B, et al..
1981]. The study concluded that "coffee use might account for a substantial
proportion of the cases of this disease in the United States." According to
some reports, after this study came out, MacMahon stopped drinking coffee
and replaced coffee with tea in his office. This publication provoked a storm
of protest from coffee drinkers and industry groups, with coverage in the
New York Times, Time magazine and Newsweek. Subsequent studies,
including one by MacMahon's group, failed to confirm the association. So,
what went wrong and why?
The study
From the original abstract:
We questioned 369 patients with histologically proved cancer of the pancreas and 644 control patients
about their use of tobacco, alcohol, tea, and coffee. There was a weak positive association between
pancreatic cancer and cigarette smoking, but we found no association with use of cigars, pipe tobacco,
alcoholic beverages, or tea. A strong association between coffee consumption and pancreatic cancer was
evident in both sexes. The association was not affected by controlling for cigarette use. For the sexes
combined, there was a significant dose-response relation (P approximately 0.001); after adjustment for
cigarette smoking, the relative risk associated with drinking up to two cups of coffee per day was 1.8
(95% confidence limits, 1.0 to 3.0), and that with three or more cups per day was 2.7 (1.6 to 4.7). This
association should be evaluated with other data; if it reflects a causal relation between coffee drinking
and pancreatic cancer, coffee use might account for a substantial proportion of the cases of this disease
in the United States.
The bias
The MacMahon study had several problems and several experts have debated these in various journals,
but a widely recognized bias was related to control selection. A nice, easy to read explanation can be
found in the Gordis text [Gordis L, 2009], but a 1981 paper by Feinstein drew attention to this problem).
Controls in the MacMahon study were selected from a group of patients hospitalized by the same
physicians who had diagnosed and hospitalized the cases' disease. The idea was to make the selection
process of cases and controls similar. It was also logistically easier to get controls using this method.
However, as the exposure factor was coffee drinking, it turned out that patients seen by the physicians
who diagnosed pancreatic cancer often had gastrointestinal disorders and were thus advised not to
drink coffee (or had chosen to reduce coffee drinking by themselves). So, this led to the selection of
controls with higher prevalence of gastrointestinal disorders, and these controls had an unusually low
odds of exposure (coffee intake). These in turn may have led to a spurious positive association between
coffee intake and pancreatic cancer that could not be subsequently confirmed.
3

This problem can be explored further using causal diagrams. Since the study used a case-control design,
cases were sampled from the source population with higher frequency than the controls, which is
represented by the directed arrow between
pancreatic cancer and recruitment into study in
Figure 1. However, controls were selected by being
hospitalized by the same doctors who treated the
cases. If they were not hospitalized for pancreatic
cancer, they must have been hospitalized for some
other disease, which gave them a higher
representation of GI tract disease than observed in
the source population. Patients with GI tract disease
may have been discouraged from drinking coffee,
which gave controls a lower prevalence of exposure
than seen in the source population. This is shown in
Figure 1 as a directed arc from GI tract disease to coffee and to recruitment into study
Collider stratification bias occurs when one
conditions (in the design or the analysis) on a
common child of two parents. In this case,
restricting the observations to people recruited into
the study (Figure 2) changes the correlation
structure so that it is no longer the same as in the
source population. Specifically, pancreatic cancer
and GI tract diseases may be uncorrelated in the
general population. However, among patients
hospitalized by the doctors who had admitted
patients with pancreatic cancer, the ones who didnt
have pancreatic disease were more likely to have
something else: a GI tract disease. Therefore, restriction to the population of the doctors who
hospitalized the cases induces a negative correlation between these two diseases in the data set.
Figure 3 shows a graph of the data for the study
population, as opposed to the source population.
Restriction to the subjects recruited from the
hospital has created a correlation between GI tract
disease and pancreatic cancer. Since GI tract disease
lowers exposure, an unblocked backdoor path is now
opened, which leads to confounding of the estimated
exposure effect (shown with a dashed line and a
question mark). Specifically, since the induced
correlation is negative, and the effect of GI tract
disease on coffee is negative, the exposure estimate
for coffee on pancreatic cancer will be biased upward (Vander Stoep et al 1999).

4

The lesson
Control selection is a critical element of case-control studies, and even the best among us can make
erroneous choices. Considerable thought needs to go into this critical step in study design. As Rothman
et al. emphasize in their textbook (Modern Epidemiology, 2008), the two important rules for control
selection are:
1. Controls should be selected from the same population - the source population (i.e. study base) - that
gives rise to the study cases. If this rule cannot be followed, there needs to be solid evidence that the
population supplying controls has an exposure distribution identical to that of the population that is the
source of cases, which is a very stringent demand that is rarely demonstrable.
2. Within strata of factors that will be used for stratification in the analysis, controls should be selected
independently of their exposure status, in that the sampling rate for controls should not vary with
exposure.
A more general concern than the issue of control selection in case-control studies is the problem of
selection bias (Hernn et al 2004). Whenever the epidemiologist conditions statistically (e.g. by
stratification, exclusion or adjustment) on a factor affected by exposure and affected by outcome, a
spurious correlation will occur in the study data-set that does not reflect an association in the real world
from which the data were drawn. If there is already a non-null association between exposure and
outcome, it can be shifted upwards or downwards by this form of bias.
Sources and suggested readings*
1. MacMahon B, Yen S, Trichopoulos D et al. Coffee and cancer of the pancreas. N Engl J Med 1981;304: 630633.
2. Schmeck HM. Critics say coffee study was flawed. New York Times, June 30, 1981.

3. Gordis L. Epidemiology. Saunders, 2008.
4. Feinstein A et al. Coffee and Pancreatic Cancer. The Problems of Etiologic Science and Epidemiologic Case-
Control Research. JAMA 1981;246:957-961.
5. Rothman K, Greenland S, Lash T. Modern epidemiology. Lippincott Williams & Wilkins, 3rd edition, 2008.
6. Coffee and Pancreatic Cancer. An Interview With Brian MacMahon. EpiMonitor, April/May, 1981.
7. Vander Stoep A, Beresford SA, Weiss NS. A didactic device for teaching epidemiology students how to anticipate
the effect of a third factor on an exposure-outcome relation. Am J Epidemiol. 1999 Jul 15;150(2):221.
8. Hernn MA, Hernndez-Daz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004
Sep;15(5):615-25.
Image credit: Epidemiology: July 2004 - Volume 15 - Issue 4 - pp 504-508

*From this readings list, the most relevant papers are enclosed.

Sunday, August 16, 2009
Search All NYTimes.com

Science
WORLD U.S. N.Y. / REGION BUSINESS TECHNOLOGY SCIENCE HEALTH SPORTS OPINION ARTS STYLE TRAVEL JOBS REAL ESTATE AUTOS
CRITICS SAY COFFEE STUDY WAS FLAWED
By HAROLD M. SCHMECK J R.
Published: J une 30, 1981
THERE were flaws in a study showing links between coffee drinking
and a common form of cancer, several medical scientists and
physicians said in letters published in the latest issue of The New
England Journal of Medicine.
In March, the journal carried a report showing statistical links
between coffee drinking and cancer of the pancreas, the fourth most
common cause of cancer deaths among Americans.
''This otherwise excellent paper may be flawed in one critical way,'' said a letter from Dr.
Steven Shedlofsky of the Veterans Administration Hospital in White River Junction, Vt.
He questioned the comparison of pancreatic cancer patients with persons hospitalized for
noncancerous diseases of the digestive system.
Such patients, he noted, might be expected to give up coffee drinking because of their
illness. This, he argued, would tilt the proportion of coffee drinkers away from the
''control'' group who were being compared with the cancer patients. Amplifying the letter
in an interview, Dr. Shedlofsky said many patients with digestive diseases give up coffee
because they believe it aggravates their discomfort, and others do so because their
doctors have advised them to.
Dr. Thomas C. Chalmers, president of the Mount Sinai Medical Center and dean of its
medical school, commented that the investigators who questioned patients on their
prehospitalization coffee habits knew in advance which ones had cancer. This could have
introduced unintentional bias in the results, Dr. Chalmers asserted.
Among the comments from other physicians were these: the question of whether
noncancerous illness might have kept the control patients from drinking coffee was
raised; a correspondent pointed out the problem inherent in trying to judge coffee
consumption simply by asking about typical daily consumption before hospitalization;
and another noted the possible role of other health habits that are closely related to coffee
drinking. These habits included cigarette smoking and the use of sugar, milk, cream or
nondairy ''creamers'' with the coffee.
The authors of the original report, led by Dr. Brian MacMahon of the Harvard School of
Public Health, defended their study against all of the comments. They agreed that
concern was ''reasonable'' over the large number of patients in their control group who
had gastrointestinal disorders. But they said the association between coffee drinking and
cancer of the pancreas was present in all the control groups.
The introduction of unintentional bias was unlikely, they said, because the study team
had no hypotheses about coffee when it began the study. Coffee drinking only emerged as
statistically important when most of the data had already been gathered, they said.
Differences Between Sexes
Go to Complete List
RELATED ADS what are related ads?
MOST POPULAR
Patient Money: The Expense of Eating With Celiac
Disease
1.
Noticed: Its Hip to Be Round 2.
Bob Herbert: Hard to Believe! 3.
Gail Collins: To Be Old and in Woodstock 4.
Getting Your Wireless Network Up to Speed 5.
Multicultural Stages in a Small Oregon Town 6.
Well: Fatty Foods Affect Memory and Exercise 7.
Believers Invest in the Gospel of Getting Rich 8.
Shortcuts: New Worries About Children With
Cellphones
9.
Paul Krugman: Republican Death Trip 10.
The authority of informal power
ALSO IN J OBS
Looking for a new job?
Are three martinis three too many?
Coffee Car
MR Coffee Pots
Study Skills
SIGN IN TO
RECOMMEND
E-MAIL
SEND TO PHONE
PRINT
REPRINTS
SHARE
BLOGGED E-MAILED
HOME PAGE TODAY'S PAPER VIDEO MOST POPULAR TIMES TOPICS Log In Register Now
Welcome to TimesPeople
Get Started
Recommend TimesPeople Lets You Share and Discover the Best of NYTimes.com 0:15 AM
Page 1of 2 CRITICS SAY COFFEE STUDY WAS FLAWED - New York Times
16/08/2009 http://www.nytimes.com/1981/06/30/science/critics-say-coffee-study-was-flawed.html
Home World U.S. N.Y. / Region Business Technology Science Health Sports Opinion Arts Style Travel J obs Real Estate Autos Back to Top
Copyright 2009 The New York Times Company Privacy Policy Search Corrections RSS First Look Help Contact Us Work for Us Media Kit Site Map
Sign in to Recommend
More Articles in Science >
The study showed no difference in risk between men who said they drank only about two
cups of coffee a day and those who drank much more. Among women, however, the risk
seemed to be related to the amount consumed. Some of the physicians who commented
on the study considered the lack of a dose effect in men puzzling and a cause of doubt
concerning the overall implications of the study.
In their original report, Dr. MacMahon and his colleagues treated their evidence
cautiously, asserting that further studies were needed to determine whether coffee
drinking was actually a factor in causing the cancers. If it is a matter of cause and effect,
they said, and if the findings apply to the nation as a whole, coffee drinking might be a
factor in slightly more than half of the roughly 20,000 cases a year of that form of cancer
in the United States.
Coffee industry spokesmen, who were critical of the report when it was published in
March, estimate that more than half of Americans over the age of 10 drink coffee.
Sleep Study
Coffee Allergy
Ads by Google what's this?
Bell Traveller Q3
Bell Traveller Q3 Bell Traveller Q3
affairesmobiles.bell.ca/fr/C
HEALTH

Roving Runner: Baseball
Nostalgia in the Bronx
WORLD

Kiev Residents Wonder if
Mayor Is Fit for Office
OPINION
But They Were
Next in Line
for Takeoff
Airplane passengers
should demand
approval of the merciful
Airline Passengers Bill
of Rights.
FASHION & STYLE

The Spirit of 69, Circa 1972
ARTS

In Dresden, High Culture
and Ugly Reality Clash
OPINION

Weekend Opinionator:
Cheney v. Bush
INSIDE NYTIMES.COM

Page 2of 2 CRITICS SAY COFFEE STUDY WAS FLAWED - New York Times
16/08/2009 http://www.nytimes.com/1981/06/30/science/critics-say-coffee-study-was-flawed.html
Coffee and Pancreatic Cancer
The Problems of
Etiologic
Science and
Epidemiologic
Case-Control
Research
THE RECENT
report
that coffee
may
cause
pancreatic
cancer' was
presented
in a
pattern
that has become
distressingly
familiar. The
alleged carcinogen
is a com-
monly
used
product.
The
report
was
given widespread
publicity
before the
supporting
evidence was available for
appraisal by
the scientific
community,
and
the
public
received renewed fear and
uncertainty
about the cancer-
ous hazards
lurking
in
everyday
life.
The research on coffee and
pancreatic
cancer was done
with the case-control
technique
that has
regularly
been
used in
epidemiologic
circumstances where the more
scientifically
desirable forms2 of clinical
investigation\p=m-\a
randomized controlled trial or a
suitably performed
observational cohort
study\p=m-\are
either
impossible
or
unfeasible. In case-control
studies,
the
investigators begin
at the
end,
rather than at the
beginning,
of the cause\x=req-\
effect
pathway.
The cases are selected from
persons
in
whom the
target
disease has
already developed.
The
controls are selected from
persons
in whom that disease
has not been noted. The cases and controls are then
investigated
in a backward
temporal direction,
with
inquiries
intended to determine antecedent
exposure
to
agents
that
may
have caused the disease. If the ratio of
antecedent
exposure
to a
particular agent
is
higher
in the
cases than in the
controls,
and if the associated mathe
matical calculations are
"statistically significant,"
the
agent
is
suspected
of
having
caused the disease.
In the
recently reported study'
of coffee and
pancreatic
cancer,
the
investigators began by assembling
records for
578 cases of
patients
with
"histologie diagnoses
of cancer
of the exocrine
pancreas."
The
investigators
next created
two "control"
groups, having
other
diagnoses.
The cases
and controls were then interviewed
regarding
antecedent
exposure
to
tobacco, alcohol, tea,
and coffee. When the
data were
analyzed
for
groups
demarcated
according
to
gender
and
quantity
of coffee
consumption,
the calculated
relative-risk ratios for
pancreatic
cancer were the values
shown in Table 1.
From these and other features of the statistical
analy
sis,
the
investigators
concluded that "a
strong
association
between
coffee
consumption
and
pancreatic
cancer was
evident in both sexes." The conclusions were
presented
with the
customary
caveats about the need for more
research and with the
customary
restraints shown in such
expressions
as "coffee use
might [our italics]
account for a
substantial
proportion"
of
pancreatic
cancers. Neverthe
less,
the
impression
was
strongly conveyed
that coffee had
been indicted as a
carcinogen.
Although
the
major public
attention has been
given
to
the "Results" and "Discussion" sections of the
published
report,
readers concerned with scientific standards of
evidence will want to focus on the "Methods." The rest of
this
commentary
contains a review of
pertinent principles
of case-control
methodology, together
with a
critique
of
the
way
these
principles
were
applied
in the coffee-
pancreatic
cancer
study
to formulate a
hypothesis,
assem
ble the case and control
groups,
collect the individual
data,
and
interpret
the results.
Scientific
Hypotheses
and 'Fishing Expeditions'
Most case-control studies are done to check the
hypothe
sis that the
target
disease has been caused
by
a
specified
suspected agent,
but after the cases and controls are
assembled the
investigators
can also collect data about
many
other
possible etiologic agents.
The
process
of
getting
and
analyzing
data for these other
agents
is
sometimes called a
"fishing expedition,"
but the
process
seems
entirely
reasonable. If we do not know what
causes
a
disease,
we
might
as well check
many
different
possibil
ities. On the other
hand,
when an
unsuspected agent yields
a
positive result,
so that the
causal
hypothesis
is
gener
ated
by
the data rather than
by
the
investigator,
the
results of the
fishing expedition require
cautious
interpre
tation.
Many
scientists would not even call the
positive
association a
"hypothesis"
until the work has been
reproduced
in another
investigation.
The
investigators
who found a
positive
association
between coffee
consumption
and
pancreatic
cancer have
been
commendably forthright
in
acknowledging
that
they
were
looking
for
something
else. When the
original
analyses
showed
nothing
substantial to incriminate
the
two
principal suspectstobacco
and alcoholthe
explora
tion of alternative
agents began.
The
investigators
do not
state how
many
additional
agents
were examined
besides
Table 1.Relative-Risk Ratios
According
to Gender
and
Quantity of Coffee
Consumption
Coffee Consumption, Cups per Day
_0_1-2_3-4_>5
Men 1.0 2.6 2.3 2.6
Women 1.0 1.6 3.3 3.1
From the Robert Wood Johnson Clinical Scholars
Program,
Yale
University
School of Medicine, New
Haven,
Conn (Drs Feinstein and Horwitz), and the
Cooperative
Studies
Program Support Center, Veterans Administration
Hospital,
West
Haven,
Conn
(Dr Feinstein), and the McGill Cancer
Center,
McGill
University (Dr
Spitzer), and the
Kellogg
Center for Advanced Studies
in
Primary Care,
Montreal General
Hospital (Drs Spitzer
and Battista),
Montreal.
Reprint requests
to Robert Wood Johnson Clinical Scholar
Program,
Yale
University
School of Medicine, 333 Cedar
St,
Box
3333,
New
Haven,
CT
06510 (Dr Feinstein).
at McGill University Libraries on August 15, 2009 www.jama.com Downloaded from
tea and
coffee,
but tea was exonerated in the
subsequent
analyses,
while coffee
yielded
a
positive
result.
The
investigators suggest
that this
result is consistent
with
coffee-as-carcinogen
evidence that had
appeared
in a
previous
case-control
study3
of
pancreatic
cancer. In
fact,
however,
coffee was not indicted in that
previous study.
The
previous investigators
found an elevated risk ratio for
only
decaffeinated
coffee,
and
they
drew no conclusion
about
it, having
found elevated risks for several other
phenomena
that led to the
decision that
pancreatic
cancer
had a
nonspecific
multifactorial
etiology. Thus,
the new
hypothesis
that coffee
may
cause
pancreatic
cancer not
only
arises from a
"fishing expedition,"
but also contra
dicts the results found in
previous
research.
Selection and Retention of Cases and Controls
Because the
investigators begin
at the
end
of the causal
pathway
and must
explore
it with a reversal of
customary
scientific
logic,
the selection of cases and controls is a
crucial feature of case-control studies. Both
groups
are
chosen
according
to
judgments
made
by
the
investigators.
The
decisions about the cases are
relatively easy. They
are
commonly picked
from a
registry
or some other
listing
that will
provide
the names of
persons
with the
target
disease. For the
controls,
who do not have the
target
disease,
no standard
method of
selection is
available,
and
they
have come from an
extraordinarily
diverse
array
of
sources. The sources include death
certificates,
tumor
registries, hospitalized patients, patients
with
specific
categories
of
disease, patients hospitalized
on
specific
clinical
services,
other
patients
of the same
physicians,
random
samples
of
geographically
defined
communities,
people living
in "retirement"
communities, neighbors
of
the
cases,
or
personal
friends of the cases.
One useful
way
of
making
these decisions less
arbitrary
is to choose cases and controls
according
to the same
principles
of
eligibility
and observation that would be used
in a randomized
controlled trial of the effects of the
alleged etiologic agent.
In such a
trial,
a set of admission
criteria would be used for
demarcating persons
to be
included
(or excluded)
in the
group
who are
randomly
assigned
to be
exposed
or
non-exposed
to the
agent.
Special
methods would then be used to follow the mem
bers of the
exposed
and
non-exposed groups thereafter,
and to examine them for occurrence of the
target
disease.
Those in whom this disease
develops
would become the
cases,
and all other
people
would be the controls.
When cases and controls are chosen for a case-control
study,
the selection can be made
from
persons
who
would
have been
accepted
for admission to such a randomized
trial and
who have been examined
with
reasonably
similar
methods
of observation.
As a scientific set of
guidelines
for
choosing eligible patients,
the randomized-trial
princi
ples
could also
help avpid
or reduce
many
of the different
forms of bias that beset case-control studies.
Among
these
difficulties are several
biases to be discussed
later,
as well
as other
problems
such as clinical
susceptibility bias,
surveillance
bias,
detection
bias,
and
"early
death"
bias,
which are
beyond
the
scope
of this discussion
and have
been described elsewhere.4"9
The randomized-trial
principles
can also
help
illuminate
the
problems
created
and encountered
by
the
investigators
in the
study
of coffee and
pancreatic
cancer. In a
randomized
trial, people
without
pancreatic
cancer would
be
assigned
either to drink or not to drink coffee.
Anyone
with clinical
contraindications
against
coffee
drinking
or
indications
for it
(whatever they might be)
would be
regarded
as
ineligible
and not admitted.
Everyone
who did
enter the
trial, however,
would thereafter be included in
the results as the
equivalent
of either a
case,
if later found
to have
pancreatic cancer,
or a control. The cases would
be
"incidence
cases,"
with
newly
detected
pancreatic cancer,
whose
diagnoses
would be verified
by
a
separate panel
of
histological
reviewers. All of the other admitted
persons
would
eventually
be classified as unaffected
"controls,"
no
matter what ailments
they acquired,
as
long
as
they
did
not have
pancreatic
cancer. If
large proportions
of the
potential
cases and controls were lost to
follow-up,
the
investigators
would
perform
detailed
analyses
to show
that
the
remaining patients
resembled those who were
lost,
thus
providing
reasonable assurance that the results
were free from
migration
bias.2
In the
coffee-pancreatic
cancer
study,
the source of the
cases was a list of 578
patients
with
"histologie diagnoses
of cancer of the exocrine
pancreas."
The
histologie
materi
al was
apparently
not obtained and
reviewed;
and the
authors
do not indicate whether the
patients
were
newly
diagnosed
"incidence
cases,"
or
"prevalence
cases" who
had been
diagnosed
at
previous
admissions.
Regardless
of
the
incidence-prevalence distinction, however,
the
pub
lished data are based on
only
369
(64% )
of the 578
patients
who were identified as
potential
cases. Most of the "lost"
patients
were not
interviewed,
with 98
potential
cases
being
too sick or
already
dead when the interviewer
arrived. The
investigators report
no data to indicate
whether the "lost" cases were otherwise similar to those
who were retained.
In
choosing
the control
group,
the
investigators
made
several
arbitrary
decisions about whom to admit or
exclude. The source of the controls was "all other
patients
who were under the care of the same
physician
in the
same
hospital
at the time of an interview with a
patient
with
pancreatic
cancer." From this
group,
the
investiga
tors then excluded
anyone
with
any
of the
following
diagnoses:
diseases of the
pancreas;
diseases of the
hepatobiliary tract;
cardiovascular
disease;
diabetes melli-
tus; respiratory cancer;
bladder
cancer;
or
peptic
ulcer.
Since none of these
patients
would have been excluded as
nonpancreatic-cancer
controls if
they acquired
these dis
eases after
entry
into a randomized trial of coffee
drinking,
their
rejection
in this case-control
study
is
puzzling.
The
investigators give
no reasons for
excluding
patients
with "diseases of
the
pancreas
or
hepatobiliary
tract." The reason offered for the other
rejections
is that
the
patients
had "diseases known to be associated with
smoking
or alcohol consumption."
The
pertinence
of this
stipulation
for a
study
of coffee is not
readily apparent.
Since the
investigators
do not state how
many potential
controls were
eliminated,
the
proportionate impact
of the
exclusions cannot be estimated. The
remaining
list of
eligible
control
patients, however,
contained
1,118 people,
of whom
only
a little more than half644
patients-
became the actual control
group
used for
analyses.
Most of
the "lost" controls were not interviewed
because of
death,
early discharge, severity
of
illness,
refusal to
participate,
and
language problems.
Of the 700 interviewed
controls,
56 were
subsequently
excluded because
they
were non-
white, foreign,
older than 79
years,
or "unreliable." No
data are offered to demonstrate that the 644 actual
controls were similar to the 474
"eligible"
controls who
were not included.
The
many missing
controls and
missing
interviews
could have led to exclusion biases10" whose effects cannot
.
be evaluated in this
study.
The
investigators
have also
given
no attention to the
impact
of selective
hospitaliza-
tion
bias, perceived by
Berkson" and
empirically
demon
strated
by
Roberts et
al,6
that can sometimes
falsely
elevate relative-risk ratios in a
hospital population
to as
high
as 17 times their true value in the
general population.
For
example,
in a
hospitalized population,
Roberts et al*
found a value of 5.0 for the relative-risk ratio of arthritic
and rheumatic
complaints
in relation to laxative
usage;
but in the
general population
that contained the
hospital
ized
patients,
the true value was 1.5. Whatever
may
have
been the effects of selective
hospitalization
in the current
study (including
the
possibility
of
having
masked real
effects of tobacco and
alcohol),
the
way
that the cases and
controls were chosen made the
study particularly
vulnera
ble to the
type
of bias described in the next section.
Protopathic
Bias in Cases and Controls
"Protopathic"
refers to
early
disease. A
protopathic
problem
occurs if a
person's exposure
to a
suspected
etiologic agent
is altered because of the
early
manifesta
tions of a
disease,
and if the altered
(rather
than the
original)
level of
exposure
is later associated with that
disease.
By producing changes
in a
person's life-style
or
medication,
the
early
manifestations of a disease can
create a bias
unique
to case-control studies.12 In a
randomized trial or observational cohort
study,
the inves
tigator begins
with each
person's
baseline state and
follows it to the
subsequent
outcome. If
exposure
to a
suspected etiologic agent
is
started, stopped,
or altered
during
this
pathway,
the
investigator
can
readily
deter
mine whether the
change
in
exposure
took
place
before or
after occurrence of the outcome. In a case-control
study,
however,
the
investigator beginning
with an outcome
cannot be sure whether it
preceded
or followed
changes
in
exposure
to the
suspected agent.
If the
exposure
was
altered because the outcome had
already
occurred and if
the
timing
of this
change
is not
recognized by
the
investigator,
the later level of
exposure
(or non-exposure)
may
be
erroneously
linked to the outcome event.
For
example,
in circumstances of
ordinary
medical
care,
women found to have
benign
breast disease
might
be told
by
their
physicians
to avoid or
stop any
form of
estrogen
therapy.
If such women are later included as cases in a
case-control
study
of
etiologic
factors in
benign
breast
disease,
the antecedent
exposure
to
estrogens
will have
been
artifactually
reduced in the case
group.
Oral contra
ceptives
or other forms of
estrogen therapy may
then be
found to exert a fallacious
"protection" against
the
development
of
benign
breast disease.
The
problem
of
protopathic
bias will occur in a case-
control
study
if the amount of
previous exposure
to the
suspected etiologic agent
was
preferentially
altered
either
upward
or downwardbecause of clinical manifes
tations that
represented early
effects of the same disease
that later led to the
patient's
selection as either a case or
control. The bias is
particularly likely
to arise if the
preferential
decisions about
exposure
were made in
oppo
site directions in the cases and controls. The coffee-
pancreatic
cancer
study
was
particularly susceptible
to
this
type
of bi-directional bias. The
customary
intake of
coffee
may
have been increased
by
members of the
pancreatic-cancer
case
group
who were anxious about
vague
abdominal
symptoms
that had not
yet
become
diagnosed
or even
regarded
as "illness."
Conversely,
control
patients
with such
gastrointestinal
ailments as
regional
enteritis or
dyspepsia may
have been
medically
advised to
stop
or reduce their
drinking
of coffee. With a
strict set of admission
criteria,
none of these
patients
would be chosen as cases or
controls,
because the use of
the
alleged etiologic agent
would have been
previously
altered
by
the same ailment that led to the
patient's
selection for the
study.
This
problem
of
protopathic
bias is a
compelling
concern in the
investigation
under review here. Because so
many potential
control
patients
were
excluded,
the
remaining
control
group
contained
many people
with
gastrointestinal
diseases for which coffee
drinking may
have been
previously
reduced or eliminated. Of the 644
controls,
249
(39%)
had one of the
following diagnoses:
cancer of the
stomach, bowel,
or
rectum; colitis, enteritis,
or
diverticulitis;
bowel
obstruction, adhesions,
or
fistula;
gastritis;
or "other
gastroenterologic
conditions." If coffee
drinking
is
really
unrelated to
pancreatic cancer,
but if
many
of these 249
patients
had
premonitory symptoms
that led to a cessation or reduction in coffee
drinking
"before the current illness was
evident,"
the
subsequent
distortions could
easily produce
a
false-positive
associa
tion.
The existence of this
type
of bias could have
been
revealed or
prevented
if the
investigators
had
obtained
suitable data. All that was needed
during
the interview
with each case or control
patient
was to ask
about
duration of coffee
drinking, changes
in
customary pattern
of
consumption,
and reasons for
any changes.
Unfortu
nately,
since coffee was not a
major etiologic suspect
in
the
research,
this additional information was not solicited.
After the available data were
analyzed,
when the investi
gators
became aware of a
possible problem, they
tried to
minimize its
potential importance by asserting
that
"although
the
majority
of control
patients
in our series
had
chronic
disease, pancreatic
cancer is itself a chronic
disease,
and in
theory
it would seem as
likely
as
any
other
disorder to induce a
change
in coffee
[consumption]."
This
assertion does not address the
point
at issue. The bias
under discussion arises from
changes
in
exposure
status
because of the
early
clinical manifestations of a
disease,
not from the chronic
(or acute)
characteristics of the
conditions under
comparison.
The
investigators
also claimed that "it is inconceivable
that this bias would account for the total
difference
between cases and controls." The
conception
is
actually
quite easy.
To make the demonstration
clear,
let
us
eliminate
gender
distinctions and
coffee
quantification
in
the
investigators'
Table
4,
which can then be
converted
into a
simple
fourfold table
(Table 2).
In this
table,
the
odds
ratio,
which estimates the relative-risk
ratio,
is
(347/20)/(555/88)=2.75,
which is the same
magnitude
as
the relative
risks cited
by
the
investigators.
Let us now assume that 5% of the
coffee-drinker cases
were
formerly
non-coffee-drinkers. If
so,
17
people
in the
case
group
would be transferred downward from the
coffee drinkers to the nondrinkers.
Although
249 members
Table 2.Status of Study Subjects According to
Coffee
Consumption
Cases Controls
Coffee-drinkers 347 555
Non-coffee-drinkers 20 88
Total
367 643
Table
3.Hypothetical'
Status of Study Subjects
Shown in Table 2
Cases Controls
Coffee-drinkers 330 573
Non-coffee-drinkers 37 70
Total_367_643
Based on estimate that 5% of coffee-drinkers in case
group were
previously
non-coffee-drinkers and that 20% of non-coffee-drinkers in control
group
ceased coffee
consumption
because of
symptoms.
of
the control
group
had
gastrointestinal conditions that
might
have
led to a cessation of coffee
consumption,
let us
conservatively
estimate that
only
20% of the 88 controls
listed in the
non-coffee-drinkers category
were
previous
coffee-drinkers who had
stopped
because of
symptoms. If
so,
18 of the
non-coffee-drinking
controls should move
upward
into the
coffee-drinking group.
With these
reclas
sifications,
the
adjusted
fourfold table would be
as
presented
in Table 3. For this new
table,
the odds ratio
is
(330/37)/(573/70)=1.09,
and the entire
positive
association
vanishes.
Acquisition
of Basic Data
All of the difficulties
just
described arise as conse
quences
of basic decisions
made
in
choosing
cases and
controls. After these
decisions are
completed,
the case-
control
investigator acquires
information about each
person's
antecedent
exposure.
This information
becomes
the basic research
data, analogous
to the
description
of
each
patient's
outcome in a randomized controlled trial.
The information about
exposure
should therefore
be
collected
with
thorough
scientific
care, using impeccable
criteria to achieve
accuracy, and,
when
necessary, using
objective (or "blinded")
methods to
prevent
biased obser
vations.
These scientific
requirements
are seldom fulfilled in
epidemiologic
research. The
primary
data about
exposure
are verified so
infrequently
in case-control studies
that
prominent epidemiologists'1
have
begun
to make
public
pleas
for
improved
scientific standards and methods. In
the few instances where efforts have been made to
confirm recorded
data,""5 to
repeat
interviews at a later
date16 or to check
the
agreement
of data obtained from
different sources,11
the
investigators
have encountered
discrepancies
of
major magnitude.
In one of
these stud
ies,17
when the
agent
of
exposure (occupation
as a fisher
man)
was confirmed,
the
original
numbers of
exposed
people
were reduced
by
17%. Had these numbers not been
corrected,
the
study
would have
produced misleading
conclusions.
Although
errors of similar
magnitude
could
easily
have
occurred in the
coffee-pancreatic
cancer
investigation,
the
investigators
did not
publish
even a brief text of the actual
questions
used for the
interviews,
and no efforts are
mentioned to check the
quality
of the data that were
obtained
in the
single
interview with
each
patient. Family
members or friends were not asked to confirm the
patients' answers;
the information was not checked
against previous records;
and no
patients
were reinter-
viewed after the
original interrogation
to see whether
subsequent responses agreed
with what
was said
previous
ly. Although
a verification of each
interview is difficult to
achieve in a
large study,
the scientific
quality
of the data
could
have been checked in a selected
sample.
Because of
the
high
likelihood of
the
protopathic
bias
noted
earlier,
the
quality
of the
coffee-drinking
data is a
major problem
in
the
study
under review. The
investiga
tors state that "the
questions
on tea and coffee were
limited to the number of
cups
consumed in a
typical day
before the current illness was evident." This
approach
would not
produce
reliable
data,
since it
does not indicate
what and when was a
"typical day,"
who decided what was
the "time
before the
current illness was
evident,"
or who
determined which of the
patient's symptoms
were the
first
manifestation of
"illness" either for
pancreatic
cancer or
for the
diverse
diseases contained in the control
group.
Although
the
investigators acknowledge
the
possibility
that
"patients
reduced their coffee
consumption
because
of
illness," nothing
was done to check
this
possibility
or to
check the alternative
possibility
that other
patients may
have
increased
their
customary amounts of coffee drink
ing.
In addition to no
questions
about
changes
in coffee
consumption,
the
patients
were also asked
nothing
about
duration.
Thus,
a
patient
who had started
drinking
four
cups
a
day
in the
past year
would have been classified as
having exactly
the same
exposure
as a
patient
who had
been
drinking
four
cups
a
day
for 30
years.
The Problem of
Multiple
Contrasts
When
multiple
features of two
groups
are tested for
"statistically significant" differences,
one
or more of
those
features
may
seem
"significant" purely by
chance. This
multiple-contrast problem
is
particularly likely
to arise
during
a
"fishing expedition."
In the
customary test of
statistical
significance,
the
investigator
contrasts the
results for a
single
feature in two
groups.
The result of
this
single-feature two-group contrast is declared
signifi
cant if the P value falls below a selected
boundary,
which
is called the a level. Because a is
commonly
set at .05,
medical literature has become
replete
with statements
that
say
"the results are
statistically significant
at P-C05."
For a
single two-group
contrast at an a level of
.05,
the
investigator
has one chance in 20
(which
can also be
expressed
as
contrary
odds of 19 to
1)
of
finding
a
false-positive
result if the contrasted
groups
are
really
similar.
For the
large
series of features that receive
two-group
contrasts
during
a
"fishing expedition," however,
statisti
cal
significance
cannot be decided
according
to the same a
level used
for a
single
contrast. For
example,
in the
coffee-pancreatic
cancer
study,
the cases and controls
were divided for
two-group
contrasts of such individual
exposures (or non-exposures)
as
cigars, pipes, cigarettes,
alcohol, tea,
and coffee.
(If
other
agents
were also
checked,
the results are not
mentioned.)
With at least six such
two-group contrasts,
the random
chance
of
finding
a
single false-positive
association where none
really
exists
is
no
longer
.05. If the characteristics are
mutually indepen
dent,
the chance is at least
.26[=1(.95)*]. Consequently,
when six different
agents
are checked in the same
study,
the odds
against finding
a
spurious positive
result are
reduced from 19 to 1 and become less than 3 to 1
[=.74/.26].
To
guard against
such
spurious
conclusions
during
multiple contrasts,
the
customary
statistical
strategy
is to
make
stringent
demands on the size of the P value
required
for
"significance."
Instead of
being
set at the
customary
value of
.05,
the a level is
substantially
lowered.
Statisticians do not
agree
on the most desirable formula
for
determining
this lowered
boundary,
but a
frequent
procedure
is to divide the
customary
a level
by k,
where k
is the number of
comparisons.18 Thus,
in the current
study,
containing
at least six
comparisons,
the decisive level of a
would" be set at no
higher
than .05/6=.008.
In the
published report,
the
investigators
make no
comment about this
multiple-contrast problem
and
they
do not seem to have considered it in their
analyses.
In one
of the
results,
a P value is cited as
"<.001,"
but most of the
cogent
data for relative risks are
expressed
in "95%
confidence
intervals,"
which were calculated with =.05.
Many
of those intervals would become
expanded
to include
the value of
1, thereby losing
"statistical
significance,"
if a were re-set at the
appropriate
level of .008 or lower.
Comment
The
foregoing
discussion has been confined to the main
reasons for
doubting
the
reported
association between
coffee and
pancreatic
cancer. Readers who are interested
in
evaluating
other
features of the
study
can check its
constituent methods
by referring
to the criteria listed in
several
published proposals810
of scientific standards for
case-control research.
A
separate problem,
to be mentioned
only
in
passing,
is
the
appropriateness
of
forming
conclusions and extensive
ly diffusing
results from a
study
in which the
hypothesis
develops
as an
analytic surprise
in the data. Scientists and
practitioners
in the
field
of human health face difficult
dilemmas about the risks and benefits of their activities.
The old
principle
of
avoiding
harm whenever
possible
holds true whether a
person
or a
population
is at risk.
Whether to shout "Fire!" in a crowded theater is a
difficult
decision,
even if a fire is
clearly
evident. The risk
of
harm seems
especially likely
if
such shouts are raised
when the evidence of a blaze is inconclusive or
meager.
Aside from
puzzled
medical
practitioners
and a confused
lay public,
another
possible
victim is the
developing
science of chronic disease
epidemiology.
Its
credibility
can
withstand
only
a limited number of false alarms.
Because the
epidemiologic
case-control
study
is a neces
sary, currently irreplaceable
research mechanism in etio
logic science,~its procedures
and
operating paradigms
need
major improvements
in scientific
quality.
In the evalua
tion of
cause-effect,
relationships
for
therapeutic agents,
the
experimental
scientific
principles
of a randomized
trial
have sometimes
required huge sample
sizes and
massive
efforts that
have made the trials become an
"indispensable
ordeal."" In the evaluation of cause-effect
relationships
for
etiologic agents,
the
case-control tech
nique
has eliminated the "ordeal" of a randomized
controlled trial
by allowing
smaller
sample sizes,
the
analysis
of natural events and
data,
and a reversed
observational direction. Since the use of scientific
princi
ples
remains
"indispensable," however,
the
development
and
application
of suitable scientific standards in case-
control research is a
prime challenge
in chronic disease
epidemiology today.
The current
mthodologie
difficulties
arise because
case-control
investigators, having recognized
that etio
logic agents cannot be
assigned
with
experimental
designs,
and
having necessarily
abandoned the randomiza
tion
principle
in order to work with
naturally occurring
events and
data,
have also abandoned
many
other scientif
ic
principles
that are
part
of the
experimental
method and
that could be
employed
in observational research. The
verification and
suitably
unbiased
acquisition
of basic raw
data
regarding diagnoses
and
exposures
do not
require
randomized
trials;
and the
patients
admitted to an
observational
study
can be selected in accordance with the
same
eligibility
criteria and the same
subsequent diagnos
tic
procedures
that would have been used in a randomized
trial.20 These scientific
experimental principles, however,
are still
frequently disregarded
in case-control
research,
despite
the celebrated
warning
of the
distinguished
Brit
ish
statistician,
Sir Austin Bradford Hill.21 In
discussing
the use of observational substitutes for
experimental
trials,
he said that the
investigator
"must have the
experimental approach firmly
in mind" and must work "in
such a
way
as to
fulfill,
as far as
possible, experimental
requirements."
Alvan R. Feinstein, MD
Ralph I.
Horwitz,
MD
Walter O.
Spitzer, MD
Renaldo N.
Battista,
MD
1. MacMahon
B,
Yen
S, Trichopoulos D,
et al: Coffee and cancer of the
pancreas.
N
Engl
J Med
1981;304:630-633.
2. Feinstein AR: Clinical biostatistics: XLVIII. Efficacy of different
research structures in
preventing
bias in the
analysis
of causation. Clin
Pharmacol Ther
1979;26:129-141.
3. Lin
RS, Kessler II: A multifactorial model for
pancreatic
cancer in
man. JAMA
1981;245:147-152.
4. Berkson J: Limitations of the
application
of four-fold tables to
hospital data. Biometrics Bull
1946;2:47-53.
5. Neyman
J: Statistics: Servant of all sciences. Science
1955;122:401.
6. Roberts
RS, Spitzer WO, Delmore
T,
et al: An
empirical
demonstra-
tion of Berkson's bias. J Chronic Dis
1978;31:119-128.
7. Horwitz
RI,
Feinstein AR: Methodologic standards and
contradictory
results in case-control research. Am J Med
1979;66:556-564.
8. Feinstein AR:
Methodologic problems
and standards in case-control
research. J Chronic Dis
1979;32:35-41.
9. Sackett DL: Bias in
analytic
research. J Chronic Dis
1979;32:51-63.
10. Horwitz
RI,
Feinstein
AR,
Stewart KR: Exclusion bias and the false
relationship
of
reserpine/breast cancer, abstracted. Clin Res
1981;29:563.
11. Horwitz
RI,
Feinstein
AR,
Stremlau JR: Alternative data sources
and
discrepant
results in case-control studies of
estrogens
and endometrial
cancer. Am J
Epidemiol 1980;111:389-394.
12. Horwitz
RI,
Feinstein AR: The
problem
of
`protopathic
bias' in
case-control studies. Am J Med
1980;68:255-258.
13. Gordis L:
Assuring the
quality
of
questionnaire
data in
epidemiologic
research. Am J
Epidemiol 1979;109:21-24.
14. Chambers
LW, Spitzer WO,
Hill
GB,
et al:
Underreporting
of cancer
in medical surveys: A source of
systematic
error in cancer research. Am J
Epidemiol 1976;104:141-145.
15. Chambers
LW, Spitzer
WO: A method of
estimating
risk for
occupational
factors
using multiple
data sources: The Newfoundland
lip
cancer
study. Am J Public Health
1977;67:176-179.
16. Klemetti
A,
Saxen L:
Prospective
versus
retrospective approach
in
the search for environmental causes of malformations. Am J Public Health
1967;57:2071-2075.
17. Spitzer WO, Hill
GB,
Chambers
LW,
et al: The
occupation
of
fishing
as a risk factor in cancer of the
lip.
N
Engl
J Med
1975;293:419-424.
18. Brown BW
Jr,
Hollander M: Statistics: A Biomedical Introduction.
New
York,
John
Wiley
& Sons
Inc, 1977, pp 231-234.
19. Fredrickson DS: The field trial: Some
thoughts
on the
indispensable
ordeal. Bull NY Acad Med
1968;44:985-993.
20. Horwitz
RI,
Feinstein AR: A new research
method, suggesting that
anticoagulants
reduce
mortality in
patients
with
myocardial
infarction.
Clin Pharmacol Ther
1980;27:258.
21. Hill AB: Observation and
experiment.
N
Engl
J Med
1953;248:995\x=req-\
1001.
ORIGINAL ARTICLE
A Structural Approach to Selection Bias
Miguel A. Hernan,
*
Sonia Hernandez-D az,
and James M. Robins

*
Abstract: The term selection bias encompasses various biases in
epidemiology. We describe examples of selection bias in case-
control studies (eg, inappropriate selection of controls) and cohort
studies (eg, informative censoring). We argue that the causal struc-
ture underlying the bias in each example is essentially the same:
conditioning on a common effect of 2 variables, one of which is
either exposure or a cause of exposure and the other is either the
outcome or a cause of the outcome. This structure is shared by other
biases (eg, adjustment for variables affected by prior exposure). A
structural classication of bias distinguishes between biases result-
ing from conditioning on common effects (selection bias) and
those resulting from the existence of common causes of exposure
and outcome (confounding). This classication also leads to a
unied approach to adjust for selection bias.
(Epidemiology 2004;15: 615625)
E
pidemiologists apply the term selection bias to many
biases, including bias resulting from inappropriate selec-
tion of controls in case-control studies, bias resulting from
differential loss-to-follow up, incidenceprevalence bias, vol-
unteer bias, healthy-worker bias, and nonresponse bias.
As discussed in numerous textbooks,
15
the common
consequence of selection bias is that the association between
exposure and outcome among those selected for analysis
differs from the association among those eligible. In this
article, we consider whether all these seemingly heteroge-
neous types of selection bias share a common underlying
causal structure that justies classifying them together. We
use causal diagrams to propose a common structure and show
how this structure leads to a unied statistical approach to
adjust for selection bias. We also show that causal diagrams
can be used to differentiate selection bias from what epide-
miologists generally consider confounding.
CAUSAL DIAGRAMS AND ASSOCIATION
Directed acyclic graphs (DAGs) are useful for depicting
causal structure in epidemiologic settings.
612
In fact, the struc-
ture of bias resulting from selection was rst described in the
DAG literature by Pearl
13
and by Spirtes et al.
14
A DAG is
composed of variables (nodes), both measured and unmeasured,
and arrows (directed edges). A causal DAG is one in which 1)
the arrows can be interpreted as direct causal effects (as dened
in Appendix A.1), and 2) all common causes of any pair of
variables are included on the graph. Causal DAGs are acyclic
because a variable cannot cause itself, either directly or through
other variables. The causal DAG in Figure 1 represents the
dichotomous variables L (being a smoker), E (carrying matches
in the pocket), and D (diagnosis of lung cancer). The lack of an
arrow between E and D indicates that carrying matches does not
have a causal effect (causative or preventive) on lung cancer, ie,
the risk of D would be the same if one intervened to change the
value of E.
Besides representing causal relations, causal DAGs
also encode the causal determinants of statistical associations.
In fact, the theory of causal DAGs species that an associa-
tion between an exposure and an outcome can be produced by
the following 3 causal structures
13,14
:
1. Cause and effect: If the exposure E causes the outcome D,
or vice versa, then they will in general be associated.
Figure 2 represents a randomized trial in which E (anti-
retroviral treatment) prevents D (AIDS) among HIV-
infected subjects. The (associational) risk ratio ARR
ED
differs from 1.0, and this association is entirely attribut-
able to the causal effect of E on D.
2. Common causes: If the exposure and the outcome share a
common cause, then they will in general be associated
even if neither is a cause of the other. In Figure 1, the
common cause L (smoking) results in E (carrying
matches) and D (lung cancer) being associated, ie, again,
ARR
ED
1.0.
3. Common effects: An exposure E and an outcome D that
have a common effect C will be conditionally associated if
Submitted 21 March 2003; nal version accepted 24 May 2004.
From the *Department of Epidemiology, Harvard School of Public Health,
Boston, Massachusetts; and the Slone Epidemiology Center, Boston
University School of Public Health, Brookline, Massachusetts.
Miguel Hernan was supported by NIH grant K08-AI-49392 and James
Robins by NIH grant R01-AI-32475.
Correspondence: Miguel Hernan, Department of Epidemiology, Harvard
School of Public Health, 677 Huntington Avenue, Boston, MA 02115.
E-mail: miguel_hernan@post.harvard.edu
Copyright 2004 by Lippincott Williams & Wilkins
ISSN: 1044-3983/04/1505-0615
DOI: 10.1097/01.ede.0000135174.63482.43
Epidemiology Volume 15, Number 5, September 2004 615
the association measure is computed within levels of the
common effect C, ie, the stratum-specic ARR
EDC
will
differ from 1.0, regardless of whether the crude (equiva-
lently, marginal, or unconditional) ARR
ED
is 1.0. More
generally, a conditional association between E and D will
occur within strata of a common effect C of 2 other
variables, one of which is either exposure or a cause of
exposure and the other is either the outcome or a cause of
the outcome. Note that E and D need not be uncondition-
ally associated simply because they have a common effect.
In the Appendix we describe additional, more complex,
structural causes of statistical associations.
That causal structures (1) and (2) imply a crude asso-
ciation accords with the intuition of most epidemiologists.
We now provide intuition for why structure (3) induces a
conditional association. (For a formal justication, see refer-
ences 13 and 14.) In Figure 3, the genetic haplotype E and
smoking D both cause coronary heart disease C. Nonetheless,
E and D are marginally unassociated (ARR
ED
1.0) because
neither causes the other and they share no common cause. We
now argue heuristically that, in general, they will be condi-
tionally associated within levels of their common effect C.
Suppose that the investigators, who are interested in
estimating the effect of haplotype E on smoking status D,
restricted the study population to subjects with heart disease
(C 1). The square around C in Figure 3 indicates that they
are conditioning on a particular value of C. Knowing that a
subject with heart disease lacks haplotype E provides some
information about her smoking status because, in the absence
of E, it is more likely that another cause of C such as D is
present. That is, among people with heart disease, the pro-
portion of smokers is increased among those without the
haplotype E. Therefore, E and D are inversely associated
conditionally on C 1, and the conditional risk ratio
ARR
EDC1
is less than 1.0. In the extreme, if E and D were
the only causes of C, then among people with heart disease,
the absence of one of them would perfectly predict the
presence of the other.
As another example, the DAG in Figure 4 adds to the
DAG in Figure 3 a diuretic medication M whose use is a
consequence of a diagnosis of heart disease. E and D are also
associated within levels of M because M is a common effect
of E and D.
There is another possible source of association between
2 variables that we have not discussed yet. As a result of
sampling variability, 2 variables could be associated by
chance even in the absence of structures (1), (2), or (3).
Chance is not a structural source of association because
chance associations become smaller with increased sample
size. In contrast, structural associations remain unchanged.
To focus our discussion on structural rather than chance
associations, we assume we have recorded data in every
subject in a very large (perhaps hypothetical) population of
interest. We also assume that all variables are perfectly
measured.
A CLASSIFICATION OF BIASES ACCORDING TO
THEIR STRUCTURE
We will say that bias is present when the association
between exposure and outcome is not in its entirety the result
of the causal effect of exposure on outcome, or more pre-
cisely when the causal risk ratio (CRR
ED
), dened in Appen-
dix A.1, differs from the associational risk ratio (ARR
ED
). In
an ideal randomized trial (ie, no confounding, full adherence
to treatment, perfect blinding, no losses to follow up) such as
the one represented in Figure 2, there is no bias and the
association measure equals the causal effect measure.
Because nonchance associations are generated by struc-
tures (1), (2), and (3), it follows that biases could be classied
on the basis of these structures:
1. Cause and effect could create bias as a result of reverse
causation. For example, in many case-control studies, the
outcome precedes the exposure measurement. Thus, the
association of the outcome with measured exposure could
in part reect bias attributable to the outcomes effect on
measured exposure.
7,8
Examples of reverse causation bias
include not only recall bias in case-control studies, but
also more general forms of information bias like, for
example, when a blood parameter affected by the presence
of cancer is measured after the cancer is present.
2. Common causes: In general, when the exposure and out-
come share a common cause, the association measure
FIGURE 1. Common cause L of exposure E and outcome D.
FIGURE 2. Causal effect of exposure E on outcome D.
FIGURE 3. Conditioning on a common effect C of exposure E
and outcome D.
FIGURE 4. Conditioning on a common effect M of exposure E
and outcome D.
Hernan et al Epidemiology Volume 15, Number 5, September 2004
2004 Lippincott Williams & Wilkins 616
differs from the effect measure. Epidemiologists tend to
use the term confounding to refer to this bias.
3. Conditioning on common effects: We propose that this
structure is the source of those biases that epidemiologists
refer to as selection bias. We argue by way of example.
EXAMPLES OF SELECTION BIAS
Inappropriate Selection of Controls in a
Case-Control Study
Figure 5 represents a case-control study of the effect of
postmenopausal estrogens (E) on the risk of myocardial
infarction (D). The variable C indicates whether a woman in
the population cohort is selected for the case-control study
(yes 1, no 0). The arrow from disease status D to
selection C indicates that cases in the cohort are more likely
to be selected than noncases, which is the dening feature of
a case-control study. In this particular case-control study,
investigators selected controls preferentially among women
with a hip fracture (F), which is represented by an arrow from
F to C. There is an arrow from E to F to represent the
protective effect of estrogens on hip fracture. Note Figure 5 is
essentially the same as Figure 3, except we have now elab-
orated the causal pathway from E to C.
In a case-control study, the associational exposure
disease odds ratio (AOR
EDC 1
) is by denition conditional
on having been selected into the study (C 1). If subjects
with hip fracture F are oversampled as controls, then the
probability of control selection depends on a consequence F
of the exposure (as represented by the path from E to C
through F) and inappropriate control selection bias will
occur (eg, AOR
EDC 1
will differ from 1.0, even when like
in Figure 5 the exposure has no effect on the disease). This
bias arises because we are conditioning on a common effect
C of exposure and disease. A heuristic explanation of this
bias follows. Among subjects selected for the study, controls
are more likely than cases to have had a hip fracture. There-
fore, because estrogens lower the incidence of hip fractures,
a control is less likely to be on estrogens than a case, and
hence AOR
EDC 1
is greater than 1.0, even though the
exposure does not cause the outcome. Identical reasoning
would explain that the expected AOR
EDC 1
would be
greater than the causal OR
ED
even had the causal OR
ED
differed from 1.0.
Berksons Bias
Berkson
15
pointed out that 2 diseases (E and D) that are
unassociated in the population could be associated among
hospitalized patients when both diseases affect the probability
of hospital admission. By taking C in Figure 3 to be the
indicator variable for hospitalization, we recognize that Berk-
sons bias comes from conditioning on the common effect C
of diseases E and D. As a consequence, in a case-control
study in which the cases were hospitalized patients with
disease D and controls were hospitalized patients with disease
E, an exposure R that causes disease E would appear to be a
risk factor for disease D (ie, Fig. 3 is modied by adding
factor R and an arrow from R to E). That is, AOR
RDC 1
would differ from 1.0 even if R does not cause D.
Differential Loss to Follow Up in Longitudinal
Studies
Figure 6a represents a follow-up study of the effect of
antiretroviral therapy (E) on AIDS (D) risk among HIV-
FIGURE 6. Selection bias in a cohort study. See text for details.
FIGURE 5. Selection bias in a case-control study. See text for
details.
Epidemiology Volume 15, Number 5, September 2004 Structural Approach to Selection Bias
infected patients. The greater the true level of immunosup-
pression (U), the greater the risk of AIDS. U is unmeasured.
If a patient drops out from the study, his AIDS status cannot
be assessed and we say that he is censored (C 1). Patients
with greater values of U are more likely to be lost to follow
up because the severity of their disease prevents them from
attending future study visits. The effect of U on censoring is
mediated by presence of symptoms (fever, weight loss, diar-
rhea, and so on), CD4 count, and viral load in plasma, all
summarized in the (vector) variable L, which could or could
not be measured. The role of L, when measured, in data
analysis is discussed in the next section; in this section, we
take L to be unmeasured. Patients receiving treatment are at
a greater risk of experiencing side effects, which could lead
them to dropout, as represented by the arrow from E to C. For
simplicity, assume that treatment E does not cause D and so
there is no arrow from E to D (CRR
ED
1.0). The square
around C indicates that the analysis is restricted to those
patients who did not drop out (C 0). The associational risk
(or rate) ratio ARR
EDC 0
differs from 1.0. This differential
loss to follow-up bias is an example of bias resulting from
structure (3) because it arises from conditioning on the
censoring variable C, which is a common effect of exposure
E and a cause U of the outcome.
An intuitive explanation of the bias follows. If a treated
subject with treatment-induced side effects (and thereby at a
greater risk of dropping out) did in fact not drop out (C 0),
then it is generally less likely that a second cause of dropping
out (eg, a large value of U) was present. Therefore, an inverse
association between E and U would be expected. However, U
is positively associated with the outcome D. Therefore, re-
stricting the analysis to subjects who did not drop out of this
study induces an inverse association (mediated by U) between
exposure and outcome, ie, ARR
EDC 0
is not equal to 1.0.
Figure 6a is a simple transformation of Figure 3 that
also represents bias resulting from structure (3): the associa-
tion between D and C resulting from a direct effect of D on
C in Figure 3 is now the result of U, a common cause of D
and C. We now present 3 additional structures, (Figs. 6bd),
which could lead to selection bias by differential loss to
follow up.
Figure 6b is a variation of Figure 6a. If prior treatment
has a direct effect on symptoms, then restricting the study to
the uncensored individuals again implies conditioning on the
common effect C of the exposure and U thereby introducing
a spurious association between treatment and outcome. Fig-
ures 6a and 6b could depict either an observational study or an
experiment in which treatment E is randomly assigned, because
there are no common causes of E and any other variable. Thus,
our results demonstrate that randomized trials are not free of
selection bias as a result of differential loss to follow up because
such selection occurs after the randomization.
Figures 6c and d are variations of Figures 6a and b,
respectively, in which there is a common cause U* of E and
another measured variable. U* indicates unmeasured life-
style/personality/educational variables that determine both
treatment (through the arrow from U* to E) and either
attitudes toward attending study visits (through the arrow
from U* to C in Fig. 6c) or threshold for reporting symptoms
(through the arrow from U* to L in Fig. 6d). Again, these 2
are examples of bias resulting from structure (3) because the
bias arises from conditioning on the common effect C of both
a cause U* of E and a cause U of D. This particular bias has
been referred to as M bias.
12
The bias caused by differential
loss to follow up in Figures 6ad is also referred to as bias
due to informative censoring.
Nonresponse Bias/Missing Data Bias
The variable C in Figures 6ad can represent missing
data on the outcome for any reason, not just as a result of loss
to follow up. For example, subjects could have missing data
because they are reluctant to provide information or because
they miss study visits. Regardless of the reasons why data on
D are missing, standard analyses restricted to subjects with
complete data (C 0) will be biased.
Volunteer Bias/Self-selection Bias
Figures 6ad can also represent a study in which C is
agreement to participate (yes 1, no 0), E is cigarette
smoking, D is coronary heart disease, U is family history of
heart disease, and U* is healthy lifestyle. (L is any mediator
between U and C such as heart disease awareness.) Under any
of these structures, there would be no bias if the study
population was a representative (ie, random) sample of the
target population. However, bias will be present if the study
is restricted to those who volunteered or elected to participate
(C 1). Volunteer bias cannot occur in a randomized study
in which subjects are randomized (ie, exposed) only after
agreeing to participate, because none of Figures 6ad can
represent such a trial. Figures 6a and b are eliminated because
exposure cannot cause C. Figures 6c and d are eliminated
because, as a result of the random exposure assignment, there
cannot exist a common cause of exposure and any another
variable.
Healthy Worker Bias
Figures 6ad can also describe a bias that could arise
when estimating the effect of a chemical E (an occupational
exposure) on mortality D in a cohort of factory workers. The
underlying unmeasured true health status U is a determinant
of both death (D) and of being at work (C). The study is
restricted to individuals who are at work (C 1) at the time
of outcome ascertainment. (L could be the result of blood
tests and a physical examination.) Being exposed to the
chemical is a predictor of being at work in the near future,
either directly (eg, exposure can cause disabling asthma), like
in Figures 6a and b, or through a common cause U* (eg,
certain exposed jobs are eliminated for economic reasons and
the workers laid off) like in Figures 6c and d.
This healthy worker bias is an example of bias
resulting from structure (3) because it arises from condition-
ing on the censoring variable C, which is a common effect of
(a cause of) exposure and (a cause of) the outcome. However,
the term healthy worker bias is also used to describe the
bias that occurs when comparing the risk in certain group of
workers with that in a group of subjects from the general
population. This second bias can be depicted by the DAG in
Figure 1 in which L represents health status, E represents
membership in the group of workers, and D represents the
outcome of interest. There are arrows from L to E and D
because being healthy affects job type and risk of subsequent
outcome, respectively. In this case, the bias is caused by
structure (1) and would therefore generally be considered to
be the result of confounding.
These examples lead us to propose that the term selec-
tion bias in causal inference settings be used to refer to any
bias that arises from conditioning on a common effect as in
Figure 3 or its variations (Figs. 46).
In addition to the examples given here, DAGs have
been used to characterize various other selection biases. For
example, Robins
7
explained how certain attempts to elimi-
nate ascertainment bias in studies of estrogens and endome-
trial cancer could themselves induce bias
16
; Hernan et al.
8
discussed incidenceprevalence bias in case-control studies
of birth defects; and Cole and Hernan
9
discussed the bias that
could be introduced by standard methods to estimate direct
effects.
17,18
In Appendix A.2, we provide a nal example: the
bias that results from the use of the hazard ratio as an effect
measure. We deferred this example to the appendix because
of its greater technical complexity. (Note that standard DAGs
do not represent effect modication or interactions be-
tween variables, but this does not affect their ability to
represent the causal structures that produce bias, as more
fully explained in Appendix A.3).
To demonstrate the generality of our approach to se-
lection bias, we now show that a bias that arises in longitu-
dinal studies with time-varying exposures
19
can also be
understood as a form of selection bias.
Adjustment for Variables Affected by Previous
Exposure (or its causes)
Consider a follow-up study of the effect of antiretrovi-
ral therapy (E) on viral load at the end of follow up (D 1
if detectable, D 0 otherwise) in HIV-infected subjects. The
greater a subjects unmeasured true immunosuppression level
(U), the greater her viral load D and the lower the CD4 count
L (low 1, high 0). Treatment increases CD4 count, and
the presence of low CD4 count (a proxy for the true level of
immunosuppression) increases the probability of receiving
treatment. We assume that, in truth but unknown to the data
analyst, treatment has no causal effect on the outcome D. The
DAGs in Figures 7a and b represent the rst 2 time points of
the study. At time 1, treatment E
1
is decided after observing
the subjects risk factor prole L
1
. (E
0
could be decided after
observing L
0
, but the inclusion of L
0
in the DAG would not
essentially alter our main point.) Let E be the sum of E
0
and
E
1
. The cumulative exposure variable E can therefore take 3
values: 0 (if the subject is not treated at any time), 1 (if the
subject is treated at time one only or at time 2 only), and 2 (if
the subject is treated at both times). Suppose the analysts
interest lies in comparing the risk had all subjects been
always treated (E 2) with that had all subjects never been
treated (E 0), and that the causal risk ratio is 1.0 (CRR
ED
1, when comparing E 2 vs. E 0).
To estimate the effect of E without bias, the analyst
needs to be able to estimate the effect of each of its compo-
nents E
0
and E
1
simultaneously and without bias.
17
As we
will see, this is not possible using standard methods, even
when data on L
1
are available, because lack of adjustment for
L
1
precludes unbiased estimation of the causal effect of E
1
whereas adjustment for L
1
by stratication (or, equivalently,
by conditioning, matching, or regression adjustment) pre-
cludes unbiased estimation of the causal effect of E
0
.
Unlike previous structures, Figures 7a and 7b contain a
common cause of the (component E
1
of) exposure E and the
outcome D, so one needs to adjust for L
1
to eliminate
FIGURE 7. Adjustment for a variable affected by previous
exposure.
confounding. The standard approach to confounder control is
stratication: the associational risk ratio is computed in each
level of the variable L
1
. The square around the node L
1
denotes that the associational risk ratios (ARR
EDL 0
and
ARR
EDL 1
) are conditional on L
1
. Examples of stratica-
tion-based methods are a Mantel-Haenzsel stratied analysis
or regression models (linear, logistic, Poisson, Cox, and so
on) that include the covariate L
1
. (Not including interaction
terms between L
1
and the exposure in a regression model is
equivalent to assuming homogeneity of ARR
EDL 0
and
ARR
EDL 1
.) To calculate ARR
EDL l
, the data analyst has
to select (ie, condition on) the subset of the population with
value L
1
l. However, in this example, the process of
choosing this subset results in selection on a variable L
1
affected by (a component E
0
of) exposure E and thus can
result in bias as we now describe.
Although stratication is commonly used to adjust for
confounding, it can have unintended effects when the asso-
ciation measure is computed within levels of L
1
and in
addition L
1
is caused by or shares causes with a component
E
0
of E. Among those with low CD4 count (L
1
1), being
on treatment (E
0
1) makes it more likely that the person is
severely immunodepressed; among those with a high level of
CD4 (L
1
0), being off treatment (E
0
0) makes it more
likely that the person is not severely immunodepressed. Thus,
the side effect of stratication is to induce an association
between prior exposure E
0
and U, and therefore between E
0
and the outcome D. Stratication eliminates confounding for
E
1
at the cost of introducing selection bias for E
0
. The net bias
for any particular summary of the time-varying exposure that
is used in the analysis (cumulative exposure, average expo-
sure, and so on) depends on the relative magnitude of the
confounding that is eliminated and the selection bias that is
created. In summary, the associational (conditional) risk ratio
ARR
EDL
1
, could be different from 1.0 even if the exposure
history has no effect on the outcome of any subjects.
Conditioning on confounders L
1
which are affected by
previous exposure can create selection bias even if the con-
founder is not on a causal pathway between exposure and
outcome. In fact, no such causal pathway exists in Figures 7a
and 7b. On the other hand, in Figure 7C the confounder L
1
for
subsequent exposure E
1
lies on a causal pathway from earlier
exposure E
0
to an outcome D. Nonetheless, conditioning on
L
1
still results in selection bias. Were the potential for
selection bias not present in Figure 7C (e.g., were U not a
common cause of L
1
and D), the association of cumulative
exposure E with the outcome D within strata of L
1
could be
an unbiased estimate of the direct effect
18
of E not through L
1
but still would not be an unbiased estimate of the overall
effect of E on D, because the effect of E
0
mediated through
L
1
is not included.
ADJUSTING FOR SELECTION BIAS
Selection bias can sometimes be avoided by an ade-
quate design such as by sampling controls in a manner to
ensure that they will represent the exposure distribution in the
population. Other times, selection bias can be avoided by
appropriately adjusting for confounding by using alternatives
to stratication-based methods (see subsequently) in the pres-
ence of time-dependent confounders affected by previous
exposure.
However, appropriate design and confounding adjust-
ment cannot immunize studies against selection bias. For ex-
ample, loss to follow up, self-selection, and, in general, missing
data leading to bias can occur no matter how careful the
investigator. In those cases, the selection bias needs to be
explicitly corrected in the analysis, when possible.
Selection bias correction, as we briey describe, could
sometimes be accomplished by a generalization of inverse
probability weighting
2023
estimators for longitudinal studies.
Consider again Figures 6ad and assume that L is measured.
Inverse probability weighting is based on assigning a weight
to each selected subject so that she accounts in the analysis
not only for herself, but also for those with similar charac-
teristics (ie, those with the same vales of L and E) who were
not selected. The weight is the inverse of the probability of
her selection. For example, if there are 4 untreated women,
age 4045 years, with CD4 count 500, in our cohort study,
and 3 of them are lost to follow up, then these 3 subjects do
not contribute to the analysis (ie, they receive a zero weight),
whereas the remaining woman receives a weight of 4. In
other words, the (estimated) conditional probability of re-
maining uncensored is 1/4 0.25, and therefore the (esti-
mated) weight for the uncensored subject is 1/0.25 4.
Inverse probability weighting creates a pseudopopulation in
which the 4 subjects of the original population are replaced
by 4 copies of the uncensored subject.
The effect measure based on the pseudopulation, in
contrast to that based on the original population, is unaffected
by selection bias provided that the outcome in the uncensored
subjects truly represents the unobserved outcomes of the
censored subjects (with the same values of E and L). This
provision will be satised if the probability of selection (the
denominator of the weight) is calculated conditional on E and
on all additional factors that independently predict both
selection and the outcome. Unfortunately, one can never be
sure that these additional factors were identied and recorded
in L, and thus the causal interpretation of the resulting
adjustment for selection bias depends on this untestable
assumption.
One might attempt to remove selection bias by strati-
cation (ie, by estimating the effect measure conditional on
the L variables) rather than by weighting. Stratication could
yield unbiased conditional effect measures within levels of L
under the assumptions that all relevant L variables were
measured and that the exposure does not cause or share a
common cause with any variable in L. Thus, stratication
would work (ie, it would provide an unbiased conditional
effect measure) under the causal structures depicted in Fig-
ures 6a and c, but not under those in Figures 6b and d. Inverse
probability weighting appropriately adjusts for selection bias
under all these situations because this approach is not based
on estimating effect measures conditional on the covariates L,
but rather on estimating unconditional effect measures after
reweighting the subjects according to their exposure and their
values of L.
Inverse probability weighting can also be used to adjust
for the confounding of later exposure E
1
by L
1
, even when
exposure E
0
either causes L
1
or shares a common cause with
L
1
(Figs. 7a7c), a situation in which stratication fails.
When using inverse probability weighting to adjust for con-
founding, we model the probability of exposure or treatment
given past exposure and past L so that the denominator of a
subjects weight is, informally, the subjects conditional
probability of receiving her treatment history. We therefore
refer to this method as inverse-probability-of-treatment
weighting.
22
One limitation of inverse probability weighting is that
all conditional probabilities (of receiving certain treatment or
censoring history) must be different from zero. This would
not be true, for example, in occupational studies in which the
probability of being exposed to a chemical is zero for those
not working. In these cases, g-estimation
19
rather than inverse
probability weighting can often be used to adjust for selection
bias and confounding.
The use of inverse probability weighting can provide
unbiased estimates of causal effects even in the presence of
selection bias because the method works by creating a pseu-
dopopulation in which censoring (or missing data) has been
abolished and in which the effect of the exposure is the same
as in the original population. Thus, the pseudopopulation
effect measure is equal to the effect measure had nobody been
censored. For example, Figure 8 represents the pseudopula-
tion corresponding to the population of Figure 6a when the
weights were estimated conditional on L and E. The censor-
ing node is now lower-case because it does not correspond to
a random variable but to a constant (everybody is uncensored
in the pseudopopulation). This interpretation is desirable
when censoring is the result of loss to follow up or nonre-
sponse, but questionably helpful when censoring is the result
of competing risks. For example, in a study aimed at estimat-
ing the effect of certain exposure on the risk of Alzheimers
disease, we might not wish to base our effect estimates on a
pseudopopulation in which all other causes of death (cancer,
heart disease, stroke, and so on) have been removed, because
it is unclear even conceptually what sort of medical interven-
tion would produce such a population. Another more prag-
matic reason is that no feasible intervention could possibly
remove just one cause of death without affecting the others as
well.
24
DISCUSSION
The terms confounding and selection bias are used
in multiple ways. For instance, the same phenomenon is some-
times named confounding by indication by epidemiologists
and selection bias by statisticians/econometricians. Others use
the term selection bias when confounders are unmeasured.
Sometimes the distinction between confounding and selection
bias is blurred in the term selection confounding.
We elected to refer to the presence of common causes
as confounding and to refer to conditioning on common
effects as selection bias. This structural denition provides
a clearcut classication of confounding and selection bias,
even though it might not coincide perfectly with the tradi-
tional, often discipline-specic, terminologies. Our goal,
however, was not to be normative about terminology, but
rather to emphasize that, regardless of the particular terms
chosen, there are 2 distinct causal structures that lead to these
biases. The magnitude of both biases depends on the strength
of the causal arrows involved.
12,25
(When 2 or more common
effects have been conditioned on, an even more general
formulation of selection bias is useful. For a brief discussion,
see Appendix A.4.)
The end result of both structures is the same: noncom-
parability (also referred to as lack of exchangeability) be-
tween the exposed and the unexposed. For example, consider
a cohort study restricted to reghters that aims to estimate
the effect of being physically active (E) on the risk of heart
disease (D) (as represented in Fig. 9). For simplicity, we have
assumed that, although unknown to the data analyst, E does
not cause D. Parental socioeconomic status (L) affects the
FIGURE 8. Causal diagramin the pseudopopulation created by
inverseprobability weighting. FIGURE 9. The firefighters study.
risk of becoming a reghter (C) and, through childhood diet,
of heart disease (D). Attraction toward activities that involve
physical activity (an unmeasured variable U) affects the risk
of becoming a reghter and of being physically active (E).
U does not affect D, and L does not affect E. According to our
terminology, there is no confounding because there are no
common causes of E and D. Thus, if our study population had
been a random sample of the target population, the crude
associational risk ratio ARR
ED
would have been equal to the
causal risk ratio CRR
ED
of 1.0.
However, in a study restricted to reghters, the
crude ARR
ED
and CRR
ED
would differ because condition-
ing on a common effect C of causes of exposure and
outcome induces selection bias resulting in noncompara-
bility of the exposed and unexposed reghters. To the
study investigators, the distinction between confounding
and selection bias is moot because, regardless of nomen-
clature, they must stratify on L to make the exposed and
the unexposed reghters comparable. This example dem-
onstrates that a structural classication of bias does not
always have consequences for either the analysis or inter-
pretation of a study. Indeed, for this reason, many epide-
miologists use the term confounder for any variable L on
which one has to stratify to create comparability, regard-
less of whether the (crude) noncomparability was the result
of conditioning on a common effect or the result of a
common cause of exposure and disease.
There are, however, advantages of adopting a structural
or causal approach to the classication of biases. First, the
structure of the problem frequently guides the choice of
analytical methods to reduce or avoid the bias. For example,
in longitudinal studies with time-dependent confounding,
identifying the structure allows us to detect situations in
which stratication-based methods would adjust for con-
founding at the expense of introducing selection bias. In those
cases, inverse probability weighting or g-estimation are better
alternatives. Second, even when understanding the structure
of bias does not have implications for data analysis (like in
the reghters study), it could still help study design. For
example, investigators running a study restricted to reght-
ers should make sure that they collect information on joint
risk factors for the outcome and for becoming a reghter.
Third, selection bias resulting from conditioning on preexpo-
sure variables (eg, being a reghter) could explain why
certain variables behave as confounders in some studies but
not others. In our example, parental socioeconomic status
would not necessarily need to be adjusted for in studies not
restricted to reghters. Finally, causal diagrams enhance
communication among investigators because they can be
used to provide a rigorous, formal denition of terms such as
selection bias.
ACKNOWLEDGMENTS
We thank Stephen Cole and Sander Greenland for their helpful
comments.
REFERENCES
1. Rothman KJ, Greenland S. Modern Epidemiology, 2nd ed. Philadelphia:
Lippincott-Raven; 1998.
2. Szklo M0, Nieto FJ. Epidemiology. Beyond the Basics. Gaithersburg,
MD: Aspen; 2000.
3. MacMahon B, Trichopoulos D. Epidemiology. Principles & Methods,
2nd ed. Boston: Little, Brown and Co; 1996.
4. Hennekens CH, Buring JE. Epidemiology in Medicine. Boston: Little,
Brown and Co; 1987.
5. Gordis L. Epidemiology. Philadelphia: WB Saunders Co; 1996.
6. Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic
research. Epidemiology. 1999;10:3748.
7. Robins JM. Data, design, and background knowledge in etiologic infer-
ence. Epidemiology. 2001;11:313320.
8. Hernan MA, Hernandez-Diaz S, Werler MM, et al. Causal knowledge as
a prerequisite for confounding evaluation: an application to birth defects
epidemiology. Am J Epidemiol. 2002;155:176184.
9. Cole SR, Hernan MA. Fallibility in the estimation of direct effects. Int
J Epidemiol. 2002;31:163165.
10. Maclure M, Schneeweiss S. Causation of bias: the episcope. Epidemi-
ology. 2001;12:114122.
11. Greenland S, Brumback BA. An overview of relations among causal
modeling methods. Int J Epidemiol. 2002;31:10301037.
12. Greenland S. Quantifying biases in causal models: classical confounding
versus collider-stratication bias. Epidemiology. 2003;14:300306.
13. Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82:
669710.
14. Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search.
Lecture Notes in Statistics 81. New York: Springer-Verlag; 1993.
15. Berkson J. Limitations of the application of fourfold table analysis to
hospital data. Biometrics. 1946;2:4753.
16. Greenland S, Neutra RR. An analysis of detection bias and proposed
corrections in the study of estrogens and endometrial cancer. J Chronic
Dis. 1981;34:433438.
17. Robins JM. A new approach to causal inference in mortality studies with
a sustained exposure periodapplication to the healthy worker survivor
effect published errata appear in Mathematical Modelling. 1987;14:
917921. Mathematical Modelling. 1986;7:13931512.
18. Robins JM, Greenland S. Identiability and exchangeability for direct
and indirect effects. Epidemiology. 1992;3:143155.
19. Robins JM. Causal inference from complex longitudinal data. In: Ber-
kane M, ed. Latent Variable Modeling and Applications to Causality.
Lecture Notes in Statistics 120. New York: Springer-Verlag; 1997:69117.
20. Horvitz DG, Thompson DJ. A generalization of sampling without
replacement from a nite universe. J Am Stat Assoc. 1952;47:663685.
21. Robins JM, Finkelstein DM. Correcting for noncompliance and depen-
dent censoring in an AIDS clinical trial with inverse probability of
censoring weighted (IPCW) log-rank tests. Biometrics. 2000;56:779788.
22. Hernan MA, Brumback B, Robins JM. Marginal structural models to
estimate the causal effect of zidovudine on the survival of HIV-positive
men. Epidemiology. 2000;11:561570.
23. Robins JM, Hernan MA, Brumback B. Marginal structural models and
causal inference in epidemiology. Epidemiology. 2000;11:550560.
24. Greenland S. Causality theory for policy uses of epidemiologic mea-
sures. In: Murray CJL, Salomon JA, Mathers CD, et al., eds. Summary
Measures of Population Health. Cambridge, MA: Harvard University
Press/WHO; 2002.
25. Walker AM. Observation and Inference: An introduction to the Methods
of Epidemiology. Newton Lower Falls: Epidemiology Resources Inc; 1991.
26. Greenland S. Absence of confounding does not correspond to collapsibility
of the rate ratio or rate difference. Epidemiology. 1996;7:498501.
APPENDIX
A.1. Causal and Associational Risk Ratio
For a given subject, E has a causal effect on D if the subjects
value of D had she been exposed differs from the value of D
had she remained unexposed. Formally, letting D
i, e 1
and
D
i,e 0
be subjects i (counterfactual or potential) outcomes
when exposed and unexposed, respectively, we say there is a
causal effect for subject i if D
i, e 1
D
i, e 0
. Only one of
the counterfactual outcomes can be observed for each subject
(the one corresponding to his observed exposure), ie, D
i, e
D
i
if E
i
e, where D
i
and E
i
represent subject is observed
outcome and exposure. For a population, we say that there is
no average causal effect (preventive or causative) of E on D if
the average of D would remain unchanged whether
the whole population had been treated or untreated, ie, when
Pr(D
e 1
1) Pr(D
e 0
1) for a dichotomous D.
Equivalently, we say that E does not have a causal effect on D
if the causal risk ratio is one, ie, CRR
ED
Pr(D
e 1
1)/
Pr(D
e 0
1) 1.0. For an extension of counterfactual theory
and methods to complex longitudinal data, see reference 19.
In a DAG, CRR
ED
1.0 is represented by the lack of
a directed path of arrows originating from E and ending on D
as, for example, in Figure 5. We shall refer to a directed path
of arrows as a causal path. On the other hand, in Figure 5,
CRR
EC
1.0 because there is a causal path from E to C
through F. The lack of a direct arrow from E to C implies that
E does not have a direct effect on C (relative to the other
variables on the DAG), ie, the effect is wholly mediated
through other variables on the DAG (ie, F).
For a population, we say that there is no association
between E and D if the average of D is the same in the subset
of the population that was exposed as in the subset that was
unexposed, ie, when Pr(D 1E 1) Pr(D 1E 0) for
a dichotomous D. Equivalently, we say that E and D are
unassociated if the associational risk ratio is 1.0, ie,
ARR
ED
Pr(D 1E 1) / Pr(D 1E 0) 1.0. The
associational risk ratio can always be estimated from obser-
vational data. We say that there is bias when the causal risk
ratio in the population differs from the associational risk
ratio, ie, CRR
ED
ARR
ED
.
A.2. Hazard Ratios as Effect Measures
The causal DAG in Appendix Figure 1a describes a
randomized study of the effect of surgery E on death at times
1 (D
1
) and 2 (D
2
). Suppose the effect of exposure on D
1
is
protective. Then the lack of an arrow from E to D
2
indicates
that, although the exposure E has a direct protective effect
(decreases the risk of death) at time 1, it has no direct effect
on death at time 2. That is, the exposure does not inuence
the survival status at time D
2
of any subject who would
survive past time 1 when unexposed (and thus when ex-
posed). Suppose further that U is an unmeasured haplotype
that decreases the subjects risk of death at all times. The
associational risk ratios ARR
ED
1
and ARR
ED
2
are unbiased
measures of the effect of E on death at times 1 and 2,
respectively. (Because of the absence of confounding,
ARR
ED
1
and ARR
ED
2
equal the causal risk ratios CRR
ED
1
and
CRR
ED
2
, respectively.) Note that, even though E has no direct
effect on D
2
, ARR
ED
2
(or, equivalently, CRR
ED
2
) will be less
than 1.0 because it is a measure of the effect of E on total
mortality through time 2.
Consider now the time-specic associational hazard
(rate) ratio as an effect measure. In discrete time, the hazard
of death at time 1 is the probability of dying at time 1 and thus
is the same as ARR
ED
1
. However, the hazard at time 2 is the
probability of dying at time 2 among those who survived past
time 1. Thus, the associational hazard ratio at time 2 is then
ARR
ED
2
D
1
0. The square around D
1
in Appendix Figure
1a indicates this conditioning. Exposed survivors of time 1
are less likely than unexposed survivors of time 1 to have the
protective haplotype U (because exposure can explain their
survival) and therefore are more likely to die at time 2. That
is, conditional on D
1
0, exposure is associated with a
higher mortality at time 2. Thus, the hazard ratio at time 1 is
less than 1.0, whereas the hazard ratio at time 2 is greater than
1.0, ie, the hazards have crossed. We conclude that the hazard
ratio at time 2 is a biased estimate of the direct effect of
exposure on mortality at time 2. The bias is selection bias
arising from conditioning on a common effect D
1
of exposure
and of U, which is a cause of D
2
that opens the noncausal (ie,
associational) path E 3D
1
4U3D
2
between E and D
2
.
13
In
the survival analysis literature, an unmeasured cause of death
that is marginally unassociated with exposure such as U is
often referred to as a frailty.
In contrast to this, the conditional hazard ratio
ARR
ED
2
D
1
0,U
at D
2
given U is equal to 1.0 within each
stratum of U because the path E 3D
1
4U3D
2
between E
and D
2
is now blocked by conditioning on the noncollider U.
Thus, the conditional hazard ratio correctly indicates the
absence of a direct effect of E on D
2
. The fact that the
unconditional hazard ratio ARR
ED
2
D
1
0 differs from the
common-stratum specic hazard ratios of 1.0 even though U
Appendix Figure 1. Effect of exposure on survival.
is independent of E, shows the noncollapsibility of the hazard
ratio.
26
Unfortunately, the unbiased measure ARR
ED
2
D
1
0,U
of the direct effect of E on D
2
cannot be computed because U
is unobserved. In the absence of data on U, it is impossible to
know whether exposure has a direct effect on D
2
. That is, the
data cannot determine whether the true causal DAG generat-
ing the data was that in Appendix Figure 1a versus that in
Appendix Figure 1b.
A.3. Effect Modification and Common Effects
in DAGs
Although an arrow on a causal DAG represents a direct
effect, a standard causal DAG does not distinguish a harmful
effect from a protective effect. Similarly, a standard DAG
does not indicate the presence of effect modication. For
example, although Appendix Figure 1a implies that both E
and U affect death D
1
, the DAG does not distinguish among
the following 3 qualitatively distinct ways that U could
modify the effect of E on D
1
:
1. The causal effect of exposure E on mortality D
1
is in
the same direction (ie, harmful or benecial) in both
stratum U 1 and stratum U 0.
2. The direction of the causal effect of exposure E on
mortality D
1
in stratum U 1 is the opposite of that
in stratum U 0 (ie, there is a qualitative interaction
between U and E).
3. Exposure E has a causal effect on D
1
in one stratum
of U but no causal effect in the other stratum, eg, E
only kills subjects with U 0.
Because standard DAGs do not represent interaction, it
follows that it is not possible to infer from a DAG the
direction of the conditional association between 2 marginally
independent causes (E and U) within strata of their common
effect D
1
. For example, suppose that, in the presence of an
undiscovered background factor V that is unassociated with E
or U, having either E 1 or U 1 is sufcient and necessary
to cause death (an or mechanism), but that neither E nor U
causes death in the absence of V. Then among those who died
by time 1 (D
1
1), E and U will be negatively associated,
because it is more likely that an unexposed subject (E 0)
had U 1 because the absence of exposure increases the
chance that U was the cause of death. (Indeed, the logarithm
of the conditional odds ratio OR
UED
1
1 will approach
minus innity as the population prevalence of V approaches
1.0.) Although this or mechanism was the only explanation
given in the main text for the conditional association of
independent causes within strata of a common effect; none-
theless, other possibilities exist. For example, suppose that in
the presence of the undiscovered background factor V, having
both E 1 and U 1 is sufcient and necessary to cause
death (an and mechanism) and that neither E nor U causes
death in the absence of V. Then, among those who die by time
1, those who had been exposed (E 1) are more likely to have
the haplotype (U 1), ie, E and U are positively correlated. A
standard DAG such as that in Appendix Figure 1a fails to
distinguish between the case of E and U interacting through an
or mechanism from the case of an and mechanism.
Although conditioning on common effect D
1
always
induces a conditional association between independent causes
E and U in at least one of the 2 strata of D
1
(say, D
1
1),
there is a special situation under which E and U remain
conditionally independent within the other stratum (say, D
1
0). This situation occurs when the data follow a multiplicative

survival model. That is, when the probability, PrD
1
0 U
u, E e, of survival (ie, D
1
0) given E and U is equal
to a product g(u) h(e) of functions of u and e. The multipli-
cative model PrD
1
0 U u, E e g(u) h(e) is
equivalent to the model that assumes the survival ratio PrD
1
0 U u, E e/PrD
1
0 U 0, E 0 does not
depend on u and is equal to h(e). (Note that if PrD
1
0 U
u, E e g(u) h(e), then PrD
1
1 U u, E e
1 g(u) h(e) does not follow a multiplicative mortality
model. Hence, when E and U are conditionally independent
given D
1
0, they will be conditionally dependent given D
1
1.)
Biologically, this multiplicative survival model will
hold when E and U affect survival through totally indepen-
dent mechanisms in such a way that U cannot possibly
modify the effect of E on D
1
, and vice versa. For example,
suppose that the surgery E affects survival through the re-
moval of a tumor, whereas the haplotype U affects survival
through increasing levels of low-density lipoprotein-choles-
terol levels resulting in an increased risk of heart attack
(whether or not a tumor is present), and that death by tumor
and death by heart attack are independent in the sense that
they do not share a common cause. In this scenario, we can
consider 2 cause-specic mortality variables: death from
tumor D
1A
and death from heart attack D
1B
. The observed
mortality variable D
1
is equal to 1 (death)when either D
1A
or
D
1B
is equal to 1, and D
1
is equal to 0 (survival) when both
D
1A
and D
1B
equal 0. We assume the measured variables are
those in Appendix Figure 1a so data on underlying cause of
death is not recorded. Appendix Figure 2 is an expansion of
Appendix Figure 1a that represents this scenario (variable D
2
is not represented because it is not essential to the current
Appendix Figure 2. Multiplicative survival model.
discussion). Because D
1
0 implies both D
1A
0 and D
1B
0, conditioning on observed survival (D
1
0) is equivalent
to simultaneously conditioning on D
1A
0 and D
1B
0 as
well. As a consequence, we nd by applying d-separation
13
to
Appendix Figure 2 that E and U are conditionally indepen-
dent given D
1
0, ie, the path, between E and U through the
conditioned on collider D
1
is blocked by conditioning on the
noncolliders D
1A
and D
1B
.
8
On the other hand, conditioning
on D
1
1 does not imply conditioning on any specic values
of D
1A
and D
1B
as the event D
1
1 is compatible with 3
possible unmeasured events D
1A
1 and D
1B
1, D
1A
1
and D
1B
0, and D
1A
0 and D
1B
1. Thus, the path
between E and U through the conditioned on collider D
1
is
not blocked, and thus E and U are associated given D
1
1.
What is interesting about Appendix Figure 2 is that by
adding the unmeasured variables D
1A
and D
1B
, which function-
ally determine the observed variable D
1
, we have created an
annotated DAG that succeeds in representing both the condi-
tional independence between E and Ugiven D
1
0 and the their
conditional dependence given D
1
1. As far as we are aware,
this is the rst time such a conditional independence structure
has been represented on a DAG.
If E and U affect survival through a common mecha-
nism, then there will exist an arrow either from E to D
1B
or
from U to D
1A
, as shown in Appendix Figure 3a. In that case,
the multiplicative survival model will not hold, and E and U
will be dependent within both strata of D
1
. Similarly, if the
causes D
1A
and D
1B
are not independent because of a com-
mon cause V as shown in Appendix Figure 3b, the multipli-
cative survival model will not hold, and E and U will be
dependent within both strata of D
1
.
In summary, conditioning on a common effect always
induces an association between its causes, but this association
could be restricted to certain levels of the common effect.
A.4. Generalizations of Structure (3)
Consider Appendix Figure 4a representing a study
restricted to reghters (F 1). E and D are unassociated
among reghters because the path EFACD is blocked by C.
If we then stratify on the covariate C like in Appendix Figure
4b, E and D are conditionally associated among reghters in
a given stratum of C; yet C is neither caused by E nor by a
cause of E. This example demonstrates that our previous
formulation of structure (3) is insufciently general to cover
examples in which we have already conditioned on another
variable F before conditioning on C. Note that one could try
to argue that our previous formulation works by insisting that
the set (F,C) of all variables conditioned be regarded as a
single supervariable and then apply our previous formulation
with this supervariable in place of C. This x-up fails because
it would require E and D to be conditionally associated within
joint levels of the super variable (C, F) in Appendix Figure 4c
as well, which is not the case.
However, a general formulation that works in all set-
tings is the following. A conditional association between E and
D will occur within strata of a common effect C of 2 other
variables, one of which is either the exposure or statistically
associated with the exposure and the other is either the outcome
or statistically associated with the outcome.
Clearly, our earlier formulation is implied by the new
formulation and, furthermore, the new formulation gives
the correct results for both Appendix Figures 4b and 4c. A
drawback of this new formulation is that it is not stated
purely in terms of causal structures, because it makes
reference to (possibly noncausal) statistical associations.
Now it actually is possible to provide a fully general
formulation in terms of causal structures but it is not
simple, and so we will not give it here, but see references
13 and 14.
Appendix Figure 3. Multiplicative survival model does not
hold. Appendix Figure 4. Conditioning on 2 variables.
Chapter 1 ~ Things You Should Know

1. The difference between population and sample.

2. The informal notion of a random variableX and its distribution of values in a population.

3. The difference between parameters and statistics.

4. Statistics follows the Classical Scientific Method of correctly designing, conducting,
and analyzing the random sampleoutcomes of an experiment, in order to test a specific
null hypothesis on a population, and infer a formal conclusion.

5. The informal notion of significance level (say, = .05) and confidence level (then 1 = .95).

6. The informal notion that the corresponding confidence interval gives the lowest and highest
estimates for a population mean , based on a samplemean x . It can then be used to test any
null hypothesis for .

7. The informal notion that if some null hypothesis for a population mean is true, then its
corresponding acceptance region gives the expected lowest and highest estimates for a
random samplemean x . (Outside of that is the corresponding rejection region.)

8. The informal notion that if the null hypothesis is true, then the p-valueof a particular sample
outcome measures the probability of obtaining that outcome (or farther from the null value).
Hence, a low p-value indicates that the sample provides evidence against the null hypothesis.
(Formally, if the p-value is less than the predetermined significance level (say, = .05), then
the null hypothesis can be rejected in favor of a complementary alternative hypothesis.)

9. The different types of medical study design.
For Biostatistics courses only (i.e., not Stat 301).

(usually arbitrarily
large or infinite) (always finite)
(a.k.a. sample characteristics
e.g., mean x )
(a.k.a. population characteristics
e.g., mean )

1. The different types of random variableon a population; how to classify data.

2. Graph numerical sample data, especially histograms (frequency, relative frequency, density) and
cumulative distribution function (cdf).

3. Calculate Summary Statistics for grouped and ungrouped sample data:

Measures of Center: mode, median, and mean x
Quartiles: Q
1
, Q
2
(=median), Q
3
, and other percentiles
Measures of Spread: range, I nterquartile Range (I QR), variance
2
s , and standard deviation s.
Understand the behaviors these have with respect to outliers, skew, etc.

4. Calculate proportion of sample (especially grouped data) between two given values a and b.
Numerical
(Quantitative)
Continuous
Ex: Foot length
Discrete
Ex: Shoe size
Categorical
(Qualitative)
Nominal (unranked)
Ex: Zip code
Non-binary (Non-dichotomous)
3 categories
Ex: Race (White = 1, Hisp = 2,...
Binary (Dichotomous)
2 categories
Ex: Sex (M = 0, F = 1)
Ordinal (ranked)
Ex: Alphabet
A = 1, ..., Z = 26
F E

1. Basic Definitions

How to represent the sample space S of all experimental outcomes via
a Venn Diagram, and two or more events (E, F,) as subsets of S.
Example: Experiment = Randomly pick a single playing card from a standard deck (and replace).
S = {A, 2, , K, A, 2, , K, A, 2, , K, A, 2, , K}
A = Pick an Ace = {A, A, A, A}
B = Pick a Black card = {A, 2, , K, A, 2, , K}
C = Pick a Clubs card = {A, 2, , K}
D = Pick a Diamonds card = {A, 2, , K}
2. Basic Definition and Properties of Probability (for any two events E and F)

The general notion of probability P(E) of an event E as the limiting value of its long-run
relative frequency
# times event occurs
# experimental trials
E
, as the experimental trials are repeated indefinitely.
0 P(E) 1

P(E) =
# outcomes in
# outcomes in
E
S
ONLY I F the outcomes are equally likely.

Example (contd): P(A) = 4/52 and P(B) = 26/52 I F the deck is fair, i.e., P(each card) = 1/52.

Complement E
c
= Not E P(E
c
) = 1 P(E) Complement Rule

Intersection E F = E and F
Example (contd): A B = {A, A}, so that P(A B) = 2/52.
Special Case: E and F are disjoint or mutually exclusive if E F = , i.e., P(E F) = 0.
Example (contd): With C = Clubs and D = Diamonds above, C D = , so that P(C D) = 0.
Union E F = E or F P(E F) = P(E) + P(F) P(E F) Addition Rule
Example (contd): P(A B) = 4/52 + 26/52 2/52 = 28/52
How to construct and use a 2 2 probability table for two events E and F:
Events E E
c

F P(E F) P(E
c
F) P(F)
F
c
P(E F
c
) P(E
c
F
c
) P(F
c
)
P(E) P(E
c
) 1

3. Conditional Probability (of any event E, given any event F)

P(E | F) =
( )
( )
P E F
P F
, which can be rewritten as P(E F) = P(E | F) P(F) Multiplication Rule
This latter formula can be expanded into a full tree diagram, where successive
branch probabilities are multiplied together to yield intersection probabilities.
Special Case: E and F are statistically independent if either of the following conditions holds:
o P(E | F) = P(E) or likewise, P(F | E) = P(F)
o P(E F) = P(E) P(F) (from above)

Example (contd): P(A B) = 1/26 is indeed equal to the product of P(A) = 1/13 times P(B) = 1/2,
so events A = Pick an Ace and B = Pick a Black card are statistically independent ( but not
disjoint, since A B = {A, A})!
Example: E and F below are statistically independent because each cell probability is equal to the
product of its corresponding row and column marginal probabilities (e.g., 0.28 = 0.7 0.4, etc.),
but events G and H are not, i.e., they are statistically dependent.

4. Bayes Rule

B
i
are disjoint and exhaustive, i =1, 2, , n

B
1
B
2
B
n

A P(A)

A
c
P(A
c
)
Given:
Prior
Probabilities
P(B
1
) P(B
2
) P(B
n
) 1
Given:
Conditional
Probabilities
P(A | B
1
) P(A | B
2
) P(A | B
n
)
Then
Posterior
Probabilities
P(B
1
| A) P(B
2
| A) P(B
n
| A) 1

are obtained via
the formula:
( ) ( ( ) )
( | )
i i i
i
P A| B P B A
P
P
A
B
B = =
=1
( )
( ) ( )
n
j j
j
P A
P A| B P B
, i = 1, 2, , n
Finally, compare each prior
to its corresponding posterior.
I NTERPRET I N CONTEXT!

E E
c

F 0.28 0.42 0.70
F
c
0.12 0.18 0.30
0.40 0.60 1.00
G G
c

H 0.15 0.55 0.70
H
c
0.25 0.05 0.30
0.40 0.60 1.00
Law of Total
Probability
Prior probabilities Conditional probabilities
STEP 1.
STEP 2.
STEP 3.

etc.
Chebyshevs Inequality
Now that the mean and standard deviation have been formally defined for any population
distribution (with the exception of a very few special cases), is it possible to relate them in any
way? At first glance, it may appear that the answer is no. For example, if the mean age of a
certain population is known to be = 40 years, that tells us nothing about the standard deviation
. At one extreme, it could be the case that nearly everyone is very close to 40 (i.e., is small),
or at the other extreme, the ages could vary widely from infants to the elderly (i.e., is large).
Likewise, knowing that the standard deviation of a population is, say = 10 years, tells us
absolutely nothing about the mean age of that population. So it is perhaps somewhat surprising
that there is in fact a relation of sorts between the two. While it may not be possible to derive
one directly from the other, there is still something we can say, albeit very general.
A well-known theorem proved by the Russian mathematician Chebyshev (pronounced just as it
appears, although there are numerous spelling variations of his name) loosely states that
no matter what the shape of the population distribution (e.g., bell, skewed, bimodal, etc.),
at least 3/4 (or 0.75) of the population values lie within plus or minus two standard deviations
of the mean . That is, the event that the population value (say, age) of a randomly selected
individual lies between the lower bound of 2, and the upper bound of + 2, has
probability 75%. Furthermore, at least 8/9 (or 0.89) of the population values lie within plus
or minus three standard deviations of the mean . That is, the event that the population value
of a randomly selected individual lies between the lower bound of 3, and the upper bound
of + 3, has probability 89%. In fact, in general, for any number k >1, at least (1 1/k
2
) of
the population values lie within plus or minus k standard deviations of the mean .
That is, the interval between the lower bound of k, and the upper bound of + k, captures
(1 1/k
2
) or more of the population values. (So, for instance, 1 1/k
2
is equal to 3/4 when k = 2,
and is equal to 8/9 when k = 3, thus confirming the claims above. At least how much of the
population is captured within k = 2.5 standard deviations of the mean ? Answer: 0.84)
However, the generality of Chebyshevs Inequality (i.e., no assumptions are made on the shape
of the distribution) is also something of a drawback, for, although true, it is far too general to be
of practical use, and is therefore mainly of theoretical interest. The probabilities considered
above for most realistic distributions correspond to values which are much higher than the very
general ones provided by Chebyshev. For example, we will see that any bell curve captures
exactly 68.3% of the population values within one standard deviation of its mean . (Note that
Chebyshevs Inequality states nothing useful for the case k = 1.) Similarly, any bell curve
captures exactly 95.4% of the population values within two standard deviations of its mean .
(For k = 2, Chebyshevs Inequality states only that this probability is 75%... true, but very
conservative, when compared with the actual value.) Likewise, any bell curve captures exactly
99.7% of the population values within three standard deviations of its mean . (For k = 3,
Chebyshevs Inequality states only that this probability is 89%... again, true, but conservative.)

Intro Stat HW LECTURE NOTES Problem Sets

Each HW assignment consists of at least oneset of required problems from the textbook, AND at least one
set of problems from the Lecture Notes (numbered sets I, II, are shown below in BLUE). The
Suggested problems are not to be turned in, but are there for additional practice. Solutions will be posted here.

0. READ: Getting Started with R
I. 1.5 - Problems Introduction
Required: 1, 2, 3, 4, 7
Suggested: 5, 6
II. 2.5 - Problems Exploratory Data Analysis
Required: 2, 3, 4, 6, 7, 8, 9
Suggested: 1, 11, 13
III. 3.5 - Problems Probability Theory
Required: 1, 2, 7, 8, 11, 15, 16(a), 19, 30 DO ANY FIVE PROBLEMS

Suggested: 3, 6, 9, 10, 18, 20, 21(a), 24, 27
IVa. 4.4 - Problems Discrete Models
Required: 1, 2, 19, 25
Suggested: 3, 11
IVb. 4.4 - Problems Continuous Models
Required: 13(a), 15, 16, 17, 18, 21, 29, 30, 31, 33 DO ANY FIVE PROBLEMS
Suggested: 11, 13(b), 26, 32
V. 5.3 - Problems Sampling Distributions, Central Limit Theorem
Required: 3, 4, 5, 6
Suggested: 1, 8
VIa. 6.4 - Problems Hypothesis Testing: One Mean (Large Samples)
Required: 2, 3, 5, 8, 25
VIb. 6.4 - Problems Hypothesis Testing: One Mean (Small Samples)
Required: 4, 6, 26
VIc. 6.4 - Problems Hypothesis Testing: One Proportion
Required: 1
VId. 6.4 - Problems Hypothesis Testing: Two Means
Required: 10 [see hint for (d)], 11, 27
VIe. 6.4 - Problems Hypothesis Testing: Proportions
Required: 14, 19
Suggested: 18, 20
VIf. 6.4 - Problems Hypothesis Testing: ANOVA
Required: 21
VII. 7.4 - Problems Linear Correlation and Regression
Required: 5, 6, 7
Suggested: 2, 3

AAPS PharmSci 2001; 3 (4) article 29 (http://www.pharmsci.org/).
1
Allometric Scaling of Xenobiotic Clearance: Uncertainty versus Universality
Submitted: February 21, 2001; Accepted: November 7, 2001; Published: November 21, 2001
Teh-Min Hu and William L. Hayton
Division of Pharmaceutics, College of Pharmacy, The Ohio State University, 500 W. 12th Ave. Columbus, OH 43210-1291

ABSTRACT Statistical analysis and Monte Carlo
simulation were used to characterize uncertainty in
the allometric exponent (b) of xenobiotic clearance
(CL). CL values for 115 xenobiotics were from
published studies in which at least 3 species were
used for the purpose of interspecies comparison of
pharmacokinetics. The b value for each xenobiotic
was calculated along with its confidence interval
(CI). For 24 xenobiotics (21%), there was no
correlation between log CL and log body weight. For
the other 91 cases, the mean standard deviation of
the b values was 0.74 0.16; range: 0.29 to 1.2.
Most (81%) of these individual b values did not
differ from either 0.67 or 0.75 at P = 0.05. When CL
values for the subset of 91 substances were
normalized to a common body weight coefficient (a),
the b value for the 460 adjusted CL values was 0.74;
the 99% CI was 0.71 to 0.76, which excluded 0.67.
Monte Carlo simulation indicated that the wide range
of observed b values could have resulted from
random variability in CL values determined in a
limited number of species, even though the
underlying b value was 0.75. From the normalized
CL values, 4 xenobiotic subgroups were examined:
those that were (i) protein, and those that were (ii)
eliminated mainly by renal excretion, (iii) by
metabolism, or (iv) by renal excretion and
metabolism combined. All subgroups except (ii)
showed a b value not different from 0.75. The b
value for the renal excretion subgroup (21
xenobiotics, 105 CL values) was 0.65, which
differed from 0.75 but not from 0.67.
KEYWORDS: allometric scaling, body-weight
exponent, clearance, metabolism, metabolic rate,
pharmacokinetics, Monte Carlo simulation, power
law

Corresponding Author: William L. Hayton; Division of
Pharmaceutics, College of Pharmacy, The Ohio State
University, 500 W. 12th Ave. Columbus, OH 43210-
1291;Telephone: 614-292-1288; Facsimile: 614-292-7766; E-
mail: hayton@dendrite.pharmacy.ohio-state.edu
INTRODUCTION
Biological structures and processes ranging from
cellular metabolism to population dynamics are
affected by the size of the organism
1,2
. Although
the sizes of mammalian species span 7 orders of
magnitude, interspecies similarities in structural,
physiological, and biochemical attributes result in
an empirical power law (the allometric equation)
that characterizes the dependency of biological
variables on body mass:
Y = a BW
b

where Y is the dependent biological variable of
interest, a is a normalization constant known as the
allometric coefficient, BW is the body weight, and
b is the allometric exponent. The exponential form
can be transformed into a linear function:
Log Y = Log a + b (Log BW),
and a and b can be estimated from the intercept and
slope of a linear regression analysis. The magnitude
of b characterizes the rate of change of a biological
variable subjected to a change of body mass and
reflects the geometric and dynamic constraints of
the body
3,4
.
Although allometric scaling of physiological
parameters has been a century- long endeavor, no
consensus has been reached as to whether a
universal scaling exponent exists. In particular,
discussion has centered on whether the basal
metabolic rate scales as the 2/3 or 3/4 power of the
body mass
1,2,3-9
.
Allometric scaling has been applied in
pharmacokinetics for approximately 2 decades. The
major interest has been prediction of
pharmacokinetic parameters in man from parameter
values determined in animals
10-15
. Clearance has
been the most studied parameter, as it determines
the drug-dosing rate. In most cases, the
pharmacokinetics of a new drug was studied in
several animal species, and the allometric
relationship between pharmacokinetic parameters
2
and body weight was determined using linear
regression of the log-transformed data. One or more
of the following observations apply to most such
studies: (i) Little attention was given to uncertainty
in the a and b values; although the correlation
coefficient was frequently reported, the confidence
intervals of the a and b values were infrequently
addressed. (ii) The a and b values were used for
interspecies extrapolation of pharmacokinetics
without analysis of the uncertainty in the predicted
parameter values. (iii) The b value of clearance was
compared with either the value 2/3 from "surface
law" or 3/4 from "Kleiber's law" and the allometric
scaling of basal metabolic rate.
This paper addresses the possible impact of the
uncertainty in allometric scaling parameters on
predicted pharmacokinetic parameter values. We
combined a statistical analysis of the allometric
exponent of clearance from 115 xenobiotics and a
Monte Carlo simulation to characterize the
uncertainty in the allometric exponent for clearance
and to investigate whether a universal exponent
may exist for the scaling of xenobiotic clearance.
MATERIALS AND METHODS
Data collection and statistical analysis
Clearance (CL) and BW data for 115 substances
were collected from published studies in which at
least 3 animal species were used for the purpose of
interspecies comparison of pharmacokinetics
16-90
.
A total of 18 species (16 mammals, 2 birds) with
body weights spanning 10
4
were involved (Table
1). Previously published studies generally did not
control or standardize across species the (i) dosage,
(ii) numbers of individuals studied per species, (iii)
principal investigator, (iv) blood sampling regime,
or (v) gender.
Table 1. Allometric Scaling Parameters Obtained from Linear
Regressions of the Log-Log-Transformed CL versus BW Data
of 115 Xenobiotics (a: allometric coefficient; b: allometric
exponent) (Table located at the end of article).
Linear regression was performed on the log-
transformed data according to the equation, Log CL
= log a + b * log BW. Values for a and b were
obtained from the intercept and the slope of the
regression, along with the coefficient of
determination (r
2
). Statistical inferences about b
were performed in the following form:
H
0
: b =
i

H
1
: b
i
, i = 0, 1, 2
Where = 0,
1
= 2/3, and
2
= 3/4, respectively.
The 95% and 99% confidence intervals (CI) were
also calculated for each b value. In addition, the CL
values for each individual xenobiotic were normalized
so that all compounds had the same a value. Linear
regression analysis was applied to the pooled,
normalized CL versus BW data for the 91 xenobiotics
that showed statistically significant correlation between
log CL and log BW in Table 1 .
Monte Carlo simulation
The power function CL = a BW b was used to
generate a set of error-free CL versus BW data. The
values for BW were 0.02, 0.25, 2.5, 5, 14, and 70
kg, which represented the body weights of mouse,
rat, rabbit, monkey, dog, and human, respectively.
The values of a and b used in the simulation were
100 and 0.75, respectively. Random error was
added to the calculated CL values, assuming a
normal distribution of error with either a 20% or a
30% coefficient of variation (CV), using the
function RANDOM in Mathematica 4.0. (Wolfram
Research, Champaign, IL) The b and r values were
obtained by applying linear regression analyses on
the log- log-transformed error-containing CL versus
BW data using the Mathematica function
REGRESS. Ten scenarios with a variety of
sampling regimens that covered different numbers
of animal species (3-6) with various body weight
ranges (varying 5.6- to 3500-fold) were simulated
(n = 100 per scenario). The simulations mimicked
the sampling patterns commonly adopted in the
published interspecies pharmacokinetics studies.
RESULTS
The allometric scaling parameters and their
statistics are listed in Table 1 . Of 115 compounds,
24 (21%) showed no correlation between clearance
and body weight; in other words, there was a lack
of statistical significance for the regression (P >
0.05). This generally occurred when only 3 species
were used. Among the remaining 91 cases, the
mean standard deviation of the b values was 0.74
0.16 with a wide range from 0.29 to 1.2 (Figure
1). The frequency distribution of the b values
appeared to be Gaussian. The mean significantly
differed from 0.67 (P < 0.001) but not from 0.75.
When the b value of each substance was tested
3
statistically against both 0.67 and 0.75, the majority
of the cases (81% and 98% at the level of
significance equal to 0.05 and 0.01, respectively)
failed to reject the null hypotheses raised against
both values (Table 1); in other words, individual b
values did not differ from 0.67 and 0.75. The wide
range for b of 95% and 99% CI highlighted the
uncertainty associated with the determination of b
values in most studies.
The 10 animal groups studied by Monte Carlo
simulation had mean b values (n = 100 per
simulation) close to the assigned true value, 0.75
(Table 2). However, the 95% CI in the majority of
the scenarios failed to distinguish the expected
value 0.75 from 0.67. Only Scenario 3 at the level
of 20% CV excluded the possibility that b was 0.67
with 95% confidence. When the experimental error
was set at 30% CV, none of the simulations
distinguished between b values of 0.67 and 0.75
with 95% confidence. The mean r values ranged
from 0.925 to 0.996, suggesting that the simulated
experiments with a 20% and a 30% CV in
experimental bias were not particularly noisy. The
frequency distributions of b values are shown in
Figure 2 .

Figure 1.The frequency distribution of the b values for the 91
xenobiotics that showed statistically significant correlation
between log clearance (CL) and log body weight (BW) in Table
1 . The frequency of the b values from 0.2 to 1.2, at an interval
of 0.1, was plotted against the midpoint of each interval of b
values. The dotted line represents a fitted Gaussian
distribution curve. SD = standard deviation.

Allometric exponent
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
F
r
e
q
u
e
n
c
y
0
5
10
15
20
25
30
Mean = 0.74, SD = 0.16
Normal distribution
Table 2. Simulated b Values in Different Scenarios with Varied Body Weight Ranges
b

Scenarios
*

ms

rt

rb

mk

dg

hm

Range
**

20% CV 30% CV 20% CV 30% CV
1

125 0.75
(0.630.87)
0.74
(0.530.95)
0.996 0.986
2

250 0.74
(0.640.84)
0.74
(0.580.91)
0.994 0.988
3

700 0.75
(0.670.83)
0.75
(0.620.88)
0.996 0.990
4

3500 0.75
(0.690.81)
0.75
(0.620.88)
0.996 0.989
5

20 0.76
(0.570.94)
0.72
(0.291.2)
0.992 0.954
6

56 0.75
(0.600.88)
0.73
(0.500.95)
0.990 0.968
7

280 0.75
(0.650.85)
0.76
(0.580.93)
0.992 0.980
8

5.6 0.80
(0.501.1)
0.74
(0.231.3)
0.974 0.925
9

28 0.74
(0.580.90)
0.75
(0.471.0)
0.987 0.971
10

14 0.74
(0.500.98)
0.73
(0.441.0)
0.988 0.969
*
ms: mouse, 0.02 kg; rt: rat, 0.25 kg; rb: rabbit, 2.5 kg; mk: monkey, 5 kg; dg: dog, 14 kg; hm: human, 70 kg. ;
**
Range = maximum body
weight/minimum body weight in each scenario;

The mean b value with 95% confidence interval (boldface in parentheses) was obtained
from 100 simulations where linear regression analyses were applied to the log-log-transformed CL versus BW data with either a 20% or a
30% coefficient of variation (CV) in clearance;

The mean correlation coefficient (r) of linear regression from 100 simulated experiments
per scenario.

4
mouse, rat, rabbit
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
30% CV
20% CV
mouse, rat, rabbit, monkey, dog
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
rat, rabbit, monkey
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
rat, rabbit, monkey, dog, human
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
rabbit, monkey, dog, human
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
mouse, rat, rabbit, monkey
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
mouse, rat, rabbit, monkey, dog, human
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
rat, rabbit, monkey,dog
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
rabbit, monkey, dog
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
monkey, dog, human
0
.
3
5
0
.
4
5
0
.
5
5
0
.
6
5
0
.
7
5
0
.
8
5
0
.
9
5
1
.
0
5
1
.
1
5
1
.
2
5
1
.
3
5
0
20
40
60
80
100
Exponent
F
r
e
q
u
e
n
c
y

5

Figure 2 (previous page). The frequency distribution of the
simulated b values in the 10 scenarios where the number of
animal species and the range of body weight were varied. The b
values were obtained by applying linear regression analyses on
the log-log-transformed, error-containing clearance (CL) versus
body weight (BW) data with either a 20% (gray) or a 30% (black)
coefficient of variation (CV) in CL.
Figure 3. The relationship between normalized clearances
(CLnormalized ) and body weights (BW) for the 91 xenobiotics (n =
460) that showed statistically significant correlation between log
CL and log BW in Table 1 . The relationship follows the
equation: log CLnormalized = 0.74 log BW + 0.015, r
2
= 0.917. The
99% confidence interval of the regression slope was 0.71 to
0.76. The different colors represent different subgroups of
xenobiotics: red, protein; blue, xenobiotics that were eliminated
mainly (< 70%) by renal excretion; green, xenobiotics that were
eliminated mainly (< 70%) by metabolism; black, xenobiotics
that were eliminated by both renal excretion and metabolism.
The result of each subgroup can be viewed in the Web version
by moving the cursor to each symbol legend.

Table 3. Summary of the Statistical Results in Figure 3.

Group* No. of
Xenobiotics
No. of
Data Points
Slope, b

(95% CI) (99% CI)
1 9 41 0.78 0.730.83 0.720.84
2 21 105 0.65 0.620.69 0.610.70
3 39 203 0.75 0.720.78 0.700.79
4 22 111 0.76 0.710.81 0.700.82
Overall 91 460 0.74 0.720.76 0.710.76
Note: CI = confidence interval
*
Group 1 = protein; group 2 = xenobiotics that were eliminated mainly
by renal excretion; group 3 = xenobiotics that were eliminated mainly
by extensive metabolism; group 4 = xenobiotics that were eliminated
by both renal excretion and nonrenal metabolism

Figure 3 shows the relationship between
normalized clearances and body weights (n = 460)
for the 91 xenobiotics that showed a statistically
significant correlation in Table 1 . The regression
slope was 0.74, and the 99% CI was 0.71 to 0.76.
The normalized clearances were divided into four
groups: 9 proteins (Group 1, n = 41), 21
compounds eliminated mainly via renal excretion
(Group 2, n = 105), 39 compounds eliminated
mainly via extensive metabolism (Group 3, n =
203), and 22 compounds eliminated by both renal
excretion and metabolism (Group 4, n = 111)
(Figure 3). The summary of the regression results
appears in Table 3 . While Groups 1, 3, and 4 had
b values close to 0.75 and significantly different
from 0.67 (P < 0.001), Group 2 had a b value close
to 0.67 and significantly different from 0.75 (P <
0.001).
DISCUSSION
Successful prediction of human clearance values
using allometric scaling and clearance values
measured in animals depends heavily on the
accuracy of the b value. Retrospective analysis of
published results for 115 substances indicated that
the commonly used experimental designs result in
considerable uncertainty for this parameter (Table
1).
CL values for 24 of the substances listed in Table 1
failed to follow the allometric equation at the 95%
confidence level. The failures appeared to result
from the following factors: (i) Only 3 species were
studied in 16 cases, which severely limited the
robustness of the statistics. In the remaining 8 failed
cases, 1 or more of the following occurred: (ii) The
species were studied in different labs in 3 cases,
(iii) small (n = 2) or unequal (n = 2-10) numbers of
animals per species were studied in 4 cases, (iv)
different dosages among species were used in 2
cases, and (v) high interspecies variability in UDP-
glucuronosyltransferase activity was proposed in 1
case
75
. The failure of these 24 cases to follow the
allometric equation appeared for the most part,
therefore, to result from deficiencies in
experimental design-in other words, failure of
detection rather than failure of the particular
substance's CL to follow the allometric
relationship.
Body Weight (kg)
0.001 0.01 0.1 1 10 100 1000 10000
N
o
r
m
a
l
i
z
e
d

C
L
0.001
0.01
0.1
1
10
100
1000
6
How well did allometry applied to animal CL
values predict the human CL value? One indication
is how close the human CL value fell to the fitted
line. Of the 91 substances that followed the
allometric equation, 68 included human as 1 of the
species. In 41 cases, the human CL value fell below
the line, and in 27 cases it fell above (Figure 4).
The mean deviation was only 0.62%, and the
majority of deviations were less than 50%. It
therefore appeared that for most of the 68
substances studied with human as one of the
species, the human CL value did not deviate
systematically or extraordinarily from the fitted
allometric equation. The tendency, noted by
others
10,12
, of the CL value for human to be lower
than that predicted from animal CL values was
therefore not apparent in this large data set.

Figure 4.The deviation between the fitted and the observed
human clearance (CL) for 68 xenobiotics. The fitted human CL
of each xenobiotic was obtained by applying linear regression
on the log-log-transformed CL versus BW data from different
animal species including human. The deviation was calculated
as 100*(CLobserved - CLfitted)/CLfitted . The mean deviation was
0.62%.
The b values for the 91 substances that followed
the allometric equation appeared to be normally
distributed around a mean value of 0.74, but the
range of values was quite broad (Figure 1).
Although impossible to answer definitively with
these data, the question of whether there is a
"universal" b value is of interest. Does the
distribution shown in Figure 1 reflect a universal
value with deviation from the mean due to
measurement errors, or are there different b values
for the various mechanisms involved in clearance?
The Monte Carlo simulations indicated that
introduction of modest amounts of random error in
CL determinations (Figure 2) resulted in a
distribution of b values not unlike that shown in
Figure 1 . This result supported the possibility that
a universal b value operates and that the range of
values seen in Table 1 resulted from random error
in CL determination coupled with the uncertainty
that accrued from use of a limited number of
species. However, examination of subsets of the 91
substances segregated by elimination pathway
showed a b value around 0.75, except for
substances cleared primarily by the kidneys; the b
value for this subgroup was 0.65 (see below), and
the CI excluded a value larger than 0.70.
The central tendency of the b values is of interest,
particularly given the recent interest in the question
of whether basal metabolic rate scales with a b
value of 0.67 or 0.75
3,4,8,9
. When examined
individually, the 95% CI of the b values for most of
the 91 substances included both values, although
the mean for all the b values tended toward 0.75.
So that all CL values could be viewed together, a
normalization process was used that assumed a
common a value for all 91 substances, and CL
values were adjusted accordingly (Figure 3). Fit of
the allometric equation to this data set gave a b
value of 0.74, and its CI included 0.75 and
excluded 0.67. Normalized CL values were
randomly scattered about the line, with one
exception: In the body weight range 20 to 50 kg
(dog, minipig, sheep, and goat), the normalized CL
values generally fell above the line.
The 91 substances were segregated by molecular
size (protein) and by major elimination pathway
(renal excretion, metabolism, combination of both)
(Figure 3). With the exception of the renal
excretion subgroup, the normalized CL values for
the subgroups showed b values similar to the
combined group and their CIs included 0.75 and
excluded 0.67 (Table 3). The renal excretion
subgroup (21 substances and 105 CL values),
however, showed a b value of 0.65 with a CI that
excluded 0.75. This result was surprising as it
appeared to contradict b values of 0.77 reported for
% Deviation
-150 -100 -50 0 50 100 150
7
both mammalian glomerular filtration rate and
effective renal plasma flow
91-93
, although it was
consistent with a b value of 0.66 reported for
intraspecies scaling of inulin-based glomerular
filtration rate in humans
94
and with a b value of
0.69 for scaling creatinine clearance
95
.
Whether the metabolic rate scales to the 2/3 or the
3/4 power of body weight has been the subject of
debate for many years. No consensus has been
reached. The surface law that suggested a
proportional relationship between the metabolic
rate and the body surface area was first
conceptualized in the 19th century. It has gained
support from empirical data
6,

96
as well as
statistical
6,9
and theoretical
6,

97
results. In 1932,
Kleiber's empirical analysis led to the 3/4-power
law, which has recently been generalized as the
quarter-power law by West et al.
3,4
. Different
theoretical analyses based on nutrient-supply
networks
3,8
and 4-dimensional biology
4
all
suggested that the quarter-power law is the
universal scaling law in biology
98
. However, the
claim of universality was challenged by Dodds et
al.
9
, whose statistical and theoretical reanalyses
cannot exclude 0.67 as the scaling exponent of the
basal metabolic rate.
The logic behind the pursuit of a universal law for
the scaling of energy metabolism across animal
species is mainly based on the assumption that an
optimal design of structure and function operates
across animal species
3,4,8,

99-101
. Given the fact that
all mammals use the same energy source (oxygen)
and energy transport systems (cardiovascular,
pulmonary) and given the possibility that
evolutionary force may result in a design principle
that optimizes energy metabolism systems across
species, the existence of such a law might be
possible. However, available data and analyses
have not led to a conclusion.
A large body of literature data has indicated that the
allometric scaling relationship applies to the
clearance of a variety of xenobiotics. It has been
speculated that xenobiotic clearance is related to
metabolic rate, and clearance b values have
frequently been compared with either 0.67 or 0.75.
The b values obtained from the scaling of clearance
for a variety of xenobiotics tended to be scattered.
Our analysis indicated that the b value generally
fell within a broad range between 0 and 1 or even
higher. The scatter of b values may have resulted
from the uncertainty that accrued from the
regression analysis of a limited number of data
points as discussed above. In addition, the scatter
may have involved the variability in
pharmacokinetic properties among different
xenobiotics. This variability rendered the prediction
of the b value extremely difficult. Moreover, the
discussion of "universality" of the b value was less
possible in this regard. From the pharmacokinetics
point of view, lack of a unique b value for all drugs
may be considered as a norm. In this regard, the
uncertainty and variability became a universal
phenomenon. To determine whether a unique b
value exists for the scaling of CL, a more rigorous
experimental design has to be included to control
the uncertainty that may obscure the conclusion.
Although a study that includes the CL data for a
variety of drugs covering the animal species with a
scope similar to that of its counterpart in scaling
basal metabolic rate might be sufficient, it would
also be extremely unrealistic. Therefore, from the
perspective of pharmacokinetics where the drug is
the center of discussion, it is almost impossible to
address whether the b value of CL tended to be
dominated by 1 or 2 values. However, from the
perspective of physiology where the function of a
body is of interest, systematic analysis of currently
available data in interspecies scaling of CL may
provide some insight into the interspecies scaling of
energy metabolism. The rationale behind this line
of reasoning was that the elimination of a
xenobiotic from a body is a manifestation of
physiological processes such as blood flow and
oxygen consumption. Interestingly, the two
competitive exponent values, but not others, in
theorizing the scaling of energy metabolism
reappeared in our analysis. The value 0.75 appeared
to be the central tendency of the b values for the CL
of most compounds, except for that of drugs whose
elimination was mainly via kidney.
CONCLUSION
Whether allometric scaling could be used for the
prediction of the first-time-in- man dose has been
debated
102,103
. Figure 4 shows that a reasonable
error range can be achieved when human CL is
predicted by the animal data for some drugs.
8
However, the success shown in the retrospective
analysis does not necessarily warrant success in
prospective applications. As indicated by our
analyses on the uncertainty of b values and as
illustrated in Bonate and Howard's commentary
102
,
caution is needed when allometric scaling is
applied in a prospective manner. In addition, the
use of a deterministic equation in predicting
individual CL data may be questionable because
the intersubject variability cannot be accounted for.
Nevertheless, allometric scaling could be an
alternative tool, if the mean CL for a population is
to be estimated and if the uncertainty is adequately
addressed. When the uncertainty in the
determination of a b value is relatively large, a
fixed-exponent approach might be feasible. In this
regard, 0.75 might be used for substances that are
eliminated mainly by metabolism or by metabolism
and excretion combined, whereas 0.67 might apply
for drugs that are eliminated mainly by renal
excretion.
ACKNOWLEDGEMENTS
Teh-Min Hu is supported by a fellowship from
National Defense Medical Center, Taipei, Taiwan.
REFERENCES
1. Schmidt-Nielsen K. Scaling: Why Is Animal Size So Important?
Princeton, NJ: Cambridge University Press, 1983.
2. Calder WA III. Size, Function and Life History. Cambridge, MA:
Harvard University Press, 1984.
3. West GB, Brown JH, Enquist BJ. A general model for the origin of
allometric scaling laws in biology. Science. 1997;276:122-126.
4. West GB, Brown JH, Enquist BJ. The fourth dimension of life:
Fractal geometry and allometric scaling of organisms. Science.
1999;284:1677-1679.
5. Kleiber M. Body size and metabolism. Hilgardia. 1932;6:315-353.
6. Heusner AA. Energy metabolism and body size. I. Is the 0.75 mass
exponent of Kleibers equation a statistical artifact? Respir Physiol.
1982;48:1-12.
7. Feldman HA, McMahon TA. The 3/4 mass exponent for energy
metabolism is not a statistical artifact. Respir Physiol. 1983;52:149-
163.
8. Banavar JR, Maritan A, Rinaldo A. Size and form in efficient
transportation networks. Nature. 1999;399:130-132.
9. Dodds PS, Rothman DH, Weitz JS. Re-examination of the 3/4-
law of metabolism. J Theor Biol. 2001;209:9-27.
10.Boxenbaum H. Interspecies scaling, allometry, physiological time,
and the ground plan of pharmacokinetics. J Pharmacokin Biopharm.
1982;10:201-227.
11.Sawada Y, Hanano M, Sugiyama Y, Iga T. Prediction of disposition
of beta-lactam antibiotics in humans from pharmacokinetic parameters
in animals. J Pharmacokin Biopharm. 1984;12:241-261.
12.Mordenti J. Man versus beast: Pharmacokinetic scaling in
mammals. J Pharm Sci. 1986;75:1028-1040.
13.Mahmood I, Balian JD. Interspecies scaling: Prediction clearance of
drugs in humans. Three different approaches. Xenobiotica.
1996;26:887-895.
14.Feng MR, Lou X, Brown RR, Hutchaleelaha A. Allometric
pharmacokinetic scaling: Towards the prediction of human oral
pharmacokinetics. Pharm Res. 2000;17:410-418.
15.Mahmood I. Interspecies scaling of renally secreted drugs. Life Sci.
1998;63:2365-2371.
16.McGovren SP, Williams MG, Stewart JC. Interspecies comparison
of acivicin pharmacokinetics. Drug Metab Dispo. 1988;16:18-22.
17.Brazzell RK, Park YH, Wooldridge CB, et al. Interspecies
comparison of the pharmacokinetics of aldose reductase inhibitors.
Drug Metab Dispos. 1990;18:435-440.
18.Bjorkman S, Redke F. Clearance of fentanyl, alfentanil,
methohexitone, thiopentone and ketamine in relation to estimated
hepatic blood flow in several animal species: Application to prediction
of clearance in man. J Pharm Pharmacol. 2000;52:1065-1074.
19.Cherkofsky SC. 1-Aminocyclopropanecarboxylic acid: Mouse to
man interspecies pharmacokinetic comparisons and allometric
relationships. J Pharm Sci. 1995;84:1231-1235.
20.Robbie G, Chiou WL. Elucidation of human amphotericin B
pharmacokinetics: Identification of a new potential factor affecting
interspecies pharmacokinetic scaling. Pharm Res. 1998;15:1630-1636.
21.Paxton JW, Kim SN, Whitfield LR. Pharmacokinetic and toxicity
scaling of the antitumor agents amsacrine and CI-921, a new analogue,
in mice, rats, rabbits, dogs, and humans. Cancer Res. 1990;50:2692-
2697.
22.GreneLerouge NAM, Bazin-Redureau MI, Debray M, Schermann
JM. Interspecies scaling of clearance and volume of distribution for
digoxin-specific Fab. Toxicol Appl Pharmacol. 1996;138:84-89.
23.Lave T, Dupin S, Schmidt C, Chou RC, Jaeck D, Coassolo PH.
Integration of in vitro data into allometric scaling to predict hepatic
metabolic clearance in man: Application to 10 extensively metabolized
drugs. J Pharm Sci. 1997;86:584-590.
24.Bazin-Redureau M, Pepin S, Hong G, Debray M, Scherrmann JM.
Interspecies scaling of clearance and volume of distribution for horse
antivenom F(ab)
2
. Toxicol Appl Pharmacol. 1998;150:295-300.
25.Lashev LD, Pashov DA, Marinkov TN. Interspecies differences in
the pharmacokinetics of kanamycin and apramycin. Vet Res Comm.
1992;16:293-300.
26.Patel BA, Boudinot FD, Schinazi RF, Gallo JM, Chu CK.
Comparative pharmacokinetics and interspecies scaling of 3-azido-3-
deoxy-thymidine (AZT) in several mammalian species. J Pharmacobio-
Dyn. 1990;13:206-211.
27.Kurihara A, Naganuma H, Hisaoka M, Tokiwa H, Kawahara Y.
Prediction of human pharmacokinetics of panipenem-betamipron, a
new carbapenem, from animal data. Antimicrob Ag Chemother.
1992;36:1810-1816.
28.Mehta SC, Lu DR. Interspecies pharmacokinetic scaling of BSH in
mice, rats, rabbits, and humans. Biopharm Drug Dispos. 1995;16:735-
744.
29.Bonati M, Latini R, Tognoni G. Interspecies comparison of in vivo
caffeine pharmacokinetics in man, monkey, rabbit, rat, and mouse.
Drug Metab Rev. 1984-85;15:1355-1383.
30.Kaye B, Brearley CJ, Cussans NJ, Herron M, Humphrey MJ,
Mollatt AR. Formation and pharmacokinetics of the active drug
candoxatrilat in mouse, rat, rabbit, dog and man following
9
administration of the produg candoxatril. Xenobiotica. 1997;27:1091-
1102.
31.Mordenti J, Chen SA, Moore JA, Ferraiolo BL, Green JD.
Interspecies scaling of clearance and volume of distribution data for
five therapeutic proteins. Pharm Res. 1991;8:1351-1359.
32.Sawada Y, Hanano M, Sugiyama Y, Iga T. Prediction of the
disposition of -lactam antibiotics in humans from pharmacokinetic
parameters in animals. J Pharmacokinet Biopharm. 1984;12:241-261.
33.Matsushita H, Suzuki H, Sugiyama Y, et al. Prediction of the
pharmacokinetics of cefodizime and cefotetan in humans from
pharmacokinetic parameters in animals. J Pharmacobio-Dyn.
1990;13:602-611.
34.Mordenti J. Pharmacokinetic scale-up: Accurate prediction of
human pharmacokinetic profiles from animal data. J Pharm Sci.
1985;74:1097-1099.
35.Feng MR, Loo J, Wright J. Disposition of the antipsychot ic agent
CI-1007 in rats, monkeys, dogs, and human cytochrome p450 2D6
extensive metabolizers: Species comparison and allometric scaling.
Drug Metab Dispos. 1998;26:982-988.
36.Hildebrand M. Inter-species extrapolation of pharmacokinetic data
of three prostacyclin-mimetics. Prostaglandins. 1994;48:297-312.
37.Ericsson H, Tholander B, Bjorkman JA, Nordlander M, Regardh
CG. Pharmacokinetics of new calcium channel antagonist clevidipine
in the rat, rabbit, and dog and pharmacokinetic/pharmacodynamic
relationship in anesthetized dogs. Drug Metab Dispo. 1999;27:558-564.
38.Sangalli L, Bortolotti A, Jiritano L, Bonati M. Cyclosporine
pharmacokinetics in rats and interspecies comparison in dogs, rabbits,
rats, and humans. Drug Metab Dispo. 1998;16:749-753.
39.Kim SH, Kim WB, Lee MG. Interspecies pharmacokinetic scaling
of a new carbapenem, DA-1131, in mice, rats, rabbits and dogs, and
prediction of human pharmacokinetics. Biopharm Drug Dispos.
1998;19:231-235.
40.Klotz U, Antonin K-H, Bieck PR. Pharmacokinetics and plasma
binding of diazepam in man, dog, rabbit, guinea pig and rat. J
Pharmacol Exp Ther. 1976;199:67-73.
41.Kaul S, Daudekar KA, Schilling BE, Barbhaiya RH. Toxicokinetics
of 2,3-deoxythymidine, stavudine (D4T). Drug Metab Dispos.
1999;27:1-12.
42.Sanwald-Ducray P, Dow J. Prediction of the pharmacokinetic
parameters of reduced-dolasetron in man using in vitro-in vivo and
interspecies allometric scaling. Xenobiotica. 1997;27:189-201.
43.Kawakami J, Yamamoto K, Sawada Y, Iga T. Prediction of brain
delivery of ofloxacin, a new quinolone, in the human from animal data.
J Pharmacokinet Biopharm. 1994;22:207-227.
44.Tsunekawa Y, Hasegawa T, Nadai M, Takagi K, Nabeshima T.
Interspecies differences and scaling for the pharmacokinetics of
xanthine derivatives. J Pharm Pharmacol. 1992;44:594-599.
45.Bregante MA, Saez P, Aramayona JJ, et al. Comparative
pharmacokinetics of enrofloxacin in mice, rats, rabbits, sheep, and
cows. Am J Vet Res. 1999;60:1111-1116.
46.Duthu GS. Interspecies correlation of the pharmacokinetics of
erythromycin, oleandomycin, and tylosin. J Pharm Sci. 1995;74:943-
946.
47.Efthymiopoulos C, Battaglia R, Strolin Benedetti M. Animal
pharmacokinetics and interspecies scaling of FCE 22101, a penem
antibiotic. J Antimicrob Chemother. 1991;27:517-526.
48.Jezequel SG. Fluconazole: Interspecies scaling and allometric
relationships of pharmacokinetic properties. J Pharm Pharmacol.
1994;46:196-199.
49.Segre G, Bianchi E, Zanolo G. Pharmacokinetics of flunoxaprofen
in rats, dogs, and monkeys. J Pharm Sci. 1988;77:670-673.
50.Khor SP, Amyx H, Davis ST, Nelson D, Baccanari DP, Spector T.
Dihydropyrimidine dehydrogenase inactivation and 5-fluorouracil
pharmacokinetics: Allometric scaling of animal data, pharmacokinetics
and toxicodynamics of 5-fluorouracil in humans. Cancer Chemother
Pharmacol. 1997;39:233-238.
51.Clark B, Smith DA. Metabolism and excretion of a chromone
carboxylic acid (FPL 52757) in various animal species. Xenobiotica.
1982;12:147-153.
52.Nakajima Y, Hattori K, Shinsei M, et al. Physiologically-based
pharmacokinetic analysis of grepafloxacin. Biol Pharm Bull.
2000;23:1077-1083.
53.Baggot JD. Application of interspecies scaling to the bispyridinium
oxime HI-6. Am J Vet Res. 1994;55:689-691.
54.Lave T, Levet-Trafit B, Schmitt-Hoffmann AH, et al. Interspecies
scaling of interferon disposition and comparison of allometric scaling
with concentration-time transformations. J Pharm Sci. 1995;84:1285-
1290.
55.Sakai T, Hamada T, Awata N, Watanabe J. Pharmacokinetics of an
antiallergic agent, 1-(2-ethoxyethyl)-2-(hexahydro-4-methyl-1H-1,4-
diazepin-1-yl)-1H-benzimidazole difumarate (KG-2413) after oral
administration: Interspecies differences in rats, guinea pigs and dogs. J
Pharmacobio-Dyn. 1989;12:530-536.
56.Lave T, Saner A, Coassolo P, Brandt R, Schmitt-Hoffman AH,
Chou RC. Animal pharmacokinetics and interspecies scaling from
animals to man of lamifiban, a new platelet aggregation inhibitor. J
Pharm Pharmacol. 1996;48:573-577.
57.Richter WF, Gallati H, Schiller CD. Animal pharmacokinetics of
the tumor necrosis factor receptor-immunoglobulin fusion protein
lenercept and their extrapolation to humans. Drug Metab Dispos.
1999;27:21-25.
58.Lapka R, Rejholec V, Sechser T, Peterkova M, Smid M.
Interspecies pharmacokinetic scaling of metazosin, a novel alpha-
adrenergic antagonist. Biopharm Drug Dispo. 1989;10:581-589.
59.Ahr H-J, Boberg M, Brendel E, Krause HP, Steinke W.
Pharmacokinetics of miglitol: Absorption, distribution, metabolism,
and excretion following administration to rats, dogs, and man. Arzneim
Forsch. 1997;47:734-745.
60.Siefert HM, Domdey -Bette A, Henninger K, Hucke F, Kohlsdorfer
C, Stass HH. Pharmacokinetics of the 8-methoxyquinolone,
moxifloxacin: A comparison in humans and other mammalian species.
J Antimicrob Chemother. 1999;43 (Suppl. B):69-76.
61.Lave T, Portmann R, Schenker G, et al. Interspecies
pharmacokinetic comp arisons and allometric scaling of napsagatran, a
low molecular weight thrombin inhibitor. J Pharm Pharmacol.
1999;51:85-91.
62.Higuchi S, Shiobara Y. Comparative pharmacokinetics of
nicardipine hydrochloride, a new vasodilator, in various species.
Xenobiotica. 1980;10:447-454.
63.Mitsuhashi Y, Sugiyama Y, Ozawa S, et al. Prediction of ACNU
plasma concentration-time profiles in humans by animal scale-up.
Cancer Chemother Pharmacol. 1990;27:20-26.
64.Yoshimura M, Kojima J, Ito T, Suzuki J. Pharmacokinetics of
nipradilol (K-351), a new antihypertensive agent. I. Studies on
interspecies variation in laboratory animals. J Pharmacobio-Dyn.
1985;8:738-750.
65.Gombar CT, Harrington GW, Pylypiw HM Jr, et al. Interspecies
scaling of the pharmacokinetics of N-nitrosodimethylamine. Cancer
Res. 1990;50:4366-4370.
10
66.Mukai H, Watanabe S, Tsuchida K, Morino A. Pharmacokinetics of
NS-49, a phenethylamine class
1A
-adrenoceptor agonist, at therapeutic
doses in several animal species and interspecies scaling of its
pharmacokinetic parameters. Int J Pharm. 1999;186:215-222.
67.Owens SM, Hardwick WC, Blackall D. Phencyclidine
pharmacokinetic scaling among species. J Pharmacol Exp Ther.
1987;242:96-101.
68.Ishigami M, Saburomaru K, Niino K, et al. Pharmacokinetics of
procaterol in the rat, rabbit, and beagle dog. Arzneim Forsch.
1979;29:266-270.
69.Khor AP, McCarthy K, DuPont M, Murray K, Timony G.
Pharmacokinetics, pharmacody namics, allometry, and dose selection of
rPSGL-Ig for phase I trial. J Pharmacol Exp Ther. 2000;293:618-624.
70.Mordenti J, Osaka G, Garcia K, Thomsen K, Licko V, Meng G.
Pharmacokinetics and interspecies scaling of recombinant human factor
VIII. Toxicol Appl Pharmacol. 1996;136:75-78.
71.Coassolo P, Fischli W, Clozel J-P, Chou RC. Pharmacokinetics of
remikiren, a potent orally active inhibitor of human renin, in rat, dog,
and primates. Xenobiotica. 1996;26:333-345.
72.Widman M, Nilsson LB, Bryske B, Lundstrom J. Disposition of
remoxipride in different species. Arzneim Forsch. 1993;43:287-297.
73.Lashev L, Pashov D, Kanelov I. Species specific pharmacokinetics
of rolitetracycline. J Vet Med A. 1995;42:201-208.
74.Herault JP, Donat F, Barzu T, et al. Pharmacokinetic study of three
synthetic AT-binding pentasaccharides in various animal species-
extrapolation to humans. Blood Coagul Fibrinol. 1997;8:161-167.
75.Ward KW, Azzarano LM, Bondinell WE, et al. Preclinical
pharmacokinetics and interspecies scaling of a novel vitronectin
receptor antagonist. Drug Metab Dispos. 1999;27:1232-1241.
76.Lin C, Gupta S, Loebenberg D, Cayen MN. Pharmacokinetics of an
everninomicin (SCH 27899) in mice, rats, rabbits, and cynomolgus
monkeys following intravenous administration. Antimicrob Ag
Chemother. 2000;44:916-919.
77.Chung M, Radwanski E, Loebenberg D, et al. Interspecies
pharmacokinetic scaling of Sch 34343. J Antimicrob Chemother.
1985;15 (Suppl. C):227-233.
78.Hinderling PH, Dilea C, Koziol T, Millington G. Comparative
kinetics of sematilide in four species. Drug Metab Dispo. 1993;21:662-
669.
79.Walker DK, Ackland MJ, James GC, et al. Pharmacokinetics and
metabolism of sildenafil in mouse, rat, rabbit, dog, and man.
Xenobiotica. 1999;29:297-310.
80.Brocks DR, Freed MI, Martin DE, et al. Interspecies
pharmacokinetics of a novel hematoregulatory peptide (SK&F 107647)
in rats, dogs, and oncologic patients. Pharm Res. 1996;13:794-797.
81.Cosson VF, Fuseau E, Efthymiopoulos C, Bye A. Mixed effect
modeling of sumatriptan pharmacokinetics during drug development. I:
Interspecies allometric scaling. J Pharmacokin Biopharm. 1997;25:149-
167.
82.Leusch A, Troger W, Greischel A, Roth W. Pharmacokinetics of the
M1-agonist talsaclidine in mouse, rat, rabbit, and monkey, and
extrapolation to man. Xenobiotica. 2000;30:797-813.
83.van Hoogdalem EJ, Soeishi Y, Matsushima H, Higuchi S.
Disposition of the selective
1A
-adrenoceptor antagonist tamsulosin in
humans: Comparison with data from interspecies scaling. J Pharm Sci.
1997;86:1156-1161.
84.Cruze CA, Kelm GR, Meredith MP. Interspecies scaling of
tebufelone pharmacokinetic data and application to preclinical
toxicology. Pharm Res. 1995;12:895-901.
85.Gaspari F, Bonati M. Interspecies metabolism and pharmacokinetic
scaling of theophylline disposition. Drug Metab Rev. 1990;22:179-207.
86.Davi H, Tronquet C, Calx J, et al. Disposition of tiludronate (Skelid)
in animals. Xenobiotica. 1999;29:1017-1031.
87.Pahlman I, Kankaanranta S, Palmer L. Pharmacokinetics of
tolterodine, a muscarinic receptor antagonist, in mouse, rat and dog.
Arzneim Forsch. 2001;51:134-144.
88.Tanaka E, Ishikawa A, Horie T. In vivo and in vitro trimethadione
oxidation activity of the liver from various animal species including
mouse, hamster, rat, rabbit, dog, monkey and human. Human Exp
Toxicol. 1999;18:12-16.
89.Izumi T, Enomoto S, Hosiyama K, et al. Prediction of the human
pharmacokinetics of troglitazone, a new and extensively metabolized
antidiabetic agent, after oral administration, with an animal scale-up
approach. J Pharmacol Exp Ther. 1996;277:1630-1641.
90.Grindel JM, ONeil PG, Yorgey KA, et al. The metabolism of
zomepirac sodium I. Disposition in laboratory animals and man. Drug
Metab Dispo. 1980;8:343-348.
91.Singer MA, Morton AR. Mouse to elephant: Biological scaling and
Kt/V. Am J Kidney Dis. 2000;35:306-309.
92.Singer MA. Of mice and men and elephants: Metabolic rate sets
glomerular filtration rate. Am J Kidney Dis. 2001;37:164-178.
93.Edwards NA. Scaling of renal functions in mammals. Comp
Biochem Physiol. 1975;52A:63-66.
94.Hayton WL. Maturation and growth of renal function: Dosing
renally cleared drugs in children. AAPS PharmSci. 2000;2(1), article
3.
95.Adolph EF. Quantitative relations in the physiological constituents
of mammals. Science. 1949;109:579-585.
96.Rubner M. ber den enifluss der krpergrsse auf stoff und
kraftwechsel. Z Biol. 1883;19:535-562.
97.Heusner A. Energy metabolism and body size. II. Dimensional
analysis and energetic non-similarity. Resp Physiol. 1982;48:13-25.
98.West GB. The origin of universal scaling laws in biology. Physica
A. 1999;263:104-113.
99.Murray CD. The physiological principle of minimum work. I. The
vascular system and the cost of blood volume. Proc Natl Acad Sci U S
A. 1926;12:207-214.
100. Cohn DL. Optimal systems: I. The vascular system. Bull Math
Biophys. 1954;16:59-74.
101. Cohn DL. Optimal systems: II. The vascular system. Bull Math
Biophys. 1955;17:219-227.
102. Bonate PL, Howard D. Prospective allometic scaling: Does the
emperor have clothes? J Clin Pharmacol. 2000;40:665-670.
103. Mahmood I. Critique of prospective allometric scaling: Does the
emperor have clothes? J Clin Pharmacol. 2000;40:671-674.

11

Table 1. Allometric Scaling Parameters Obtained from Linear Regressions of the Log-Log-Transformed CL versus
BW Data of 115 Xenobiotics (a: allometric coefficient; b: allometric exponent)

Compounds a b r
2 ( i )
P
( ii )
95% CI of b 99% CI of b Species
(vii)
Ref
Acivin 3.9 0.57 0.976 *** 0.45 - 0.70
(iii)
0.37 - 0.78 ms, rt, mk, dg, hm 16
AL01567 0.41 0.93 0.834 * 0.17 - 1.7 n.d. rt, mk, dg, cz, hm 17
AL01576 0.36 1.1 0.955 ** 0.75 - 1.4
(iv)
0.54 - 1.6 rt, mk, cz, hm 17
AL01750 0.39 0.98 0.829 * 0.16 - 1.8 n.d. rt, dg, mk, cz 17
Alfentanil 47 0.75 0.975 *** 0.59 - 0.92 0.48 - 1.0 rt, rb, dg, sh 18
1-Aminocyclopropanecarboxylate 2.6 0.72 0.902 * 0.28 - 1.2 n.d. ms, rt, mk, hm 19
Amphotericin B 0.94 0.84 0.988 *** 0.77 - 0.91
(v)
0.74 - 0.94
(iv)
ms, rt, rb, dg, hm 20
Amsacrine 38 0.46 0.906 * 0.19 - 0.73 n.d. ms, rt, rb, dg, hm 21
Anti-digoxin Fab 1.0 0.67 0.992 0.06 n.d.
(vi)
n.d. ms, rt, rb 22
Antipyrine 6.9 0.57 0.716 0.15 n.d. n.d. rt, rb, dg, hm 23
Antivenom Fab2 0.033 0.53 0.990 0.06 n.d. n.d. ms, rt, rb 24
Apramycin 2.8 0.80 0.924 ** 0.38 - 1.2 0.028 - 1.6 sh, rb, ck, pn 25
AZT 26 0.96 0.982 ** 0.72 - 1.2
(iv)
0.52 - 1.4 ms, rt, mk, dg, hm 26
Betamipron 16 0.69 0.975 *** 0.53 - 0.84 0.43 - 0.94 ms, gp, rt, rb, mk, dg 27
Bosentan 25 0.56 0.663 * 0.006 - 1.1 n.d. ms, mt, rt, rb, hm 23
BSH 2.1 0.68 0.945 * 0.028 - 0.18 n.d. ms, rt, rb, hm 28
Caffeine 6.3 0.74 0.981 ** 0.55 - 0.93 0.39 - 1.1 ms, rt, rb, mk, hm 29
Candoxatrilat 9.6 0.66 0.986 *** 0.52 - 0.81 0.39 - 0.93 ms, rt, rb, dg, hm 30
CD4-IgG 0.10 0.74 0.959 * 0.27 - 1.2 n.d. rt, rb, mk, hm 31
Cefazolin 4.5 0.68 0.975 *** 0.52 - 0.83 0.43 - 0.93 ms, rt, rb, dg, mk, hm 32
Cefmetazole 12 0.59 0.917 ** 0.35 - 0.84 0.18 - 1.0 ms, rt, rb, dg, mk, hm 32
Cefodizime 1.5 1.0 0.926 ** 0.48 - 1.5 0.047 - 1.9 ms, rt, rb, dg, mk 33
Cefoperazone 6.7 0.57 0.823 * 0.20 - 0.94 n.d. ms, rt, rb, dg, mk, hm 32
Cefotetan 6.3 0.53 0.849 ** 0.22 - 0.84 0.016 - 1.0 ms, rt, rb, dg, mk, hm 32
Cefpiramide 4.1 0.40 0.589 0.07 n.d. n.d. ms, rt, rb, dg, mk, hm 32
Ceftizoxime 11 0.57 0.986 ** 0.37 - 0.78 0.10 - 1.1 ms, rt, mk, dg 34
CI-1007 35 0.90 0.998 * 0.44 - 1.4 n.d. rt, mk, dg 35
CI-921 15 0.51 0.830 * 0.085 - 0.93 n.d. ms, rt, rb, dg, hm 21
Cicaprost 37 0.83 0.956 *** 0.59 - 1.1 0.42 - 1.2 ms, rt, rb, mk, pg, hm 36
Clevidipine 288 0.84 0.985 0.07 n.d. n.d. rt, rb, dg 37

12

Table 1. (continued)

Compounds a b r
2 ( i )
P
( ii )
(vii)
Ref.
Cyclosporin 5.8 0.99 0.931 * 0.17 - 1.8 n.d. rt, rb, dg, hm 38
DA-1131 11 0.81 0.995 *** 0.71 - 0.93
(iv)
0.61 - 1.0 ms, rt, rb, dg, hm 39
Diazepam 89 0.2 0.135 0.5 n.d. n.d. rt, gp, rb, dg, hm 40
Didanosine 33 0.76 0.971 ** 0.52 - 1.0 0.32 - 1.2 ms, rt, mk, dg, hm 41
Dolasetron 74 0.73 0.950 * 0.22 - 1.2 n.d. rt, mk, dg, hm 42
Enoxacin 36 0.43 0.874 * 0.13 - 0.73
(iii)
n.d. ms, rt, mk, dg, hm 43
Enprofylline 6.0 0.72 0.852 ** 0.30 - 1.1 0.028 - 1.4 ms, rt, gp, rb, dg, hm 44
Enrofloxacin 23 0.77 0.972 ** 0.53 - 1.0 0.33 - 1.2 ms, rt, rb, sh, cw 45
Eptaloprost 115 0.83 0.985 0.08 n.d. n.d. rt, mk, hm 36
Erythromycin 37 0.66 0.966 *** 0.49 - 0.83 0.37 - 0.94 ms, rt, rb, dg, hm, cw 46
FCE22101 11 0.76 0.909 * 0.027 - 1.5 n.d. rt, rb, mk, dg 47
Fentanyl 60 0.88 0.990 0.06 n.d. n.d. rt, dg, pg 18
Fluconazole 1.2 0.70 0.992 *** 0.63 - 0.77 0.58 - 0.82 ms, rt, gp, rb, ct, dg, hm 48
Flunoxaprofen 0.98 1.0 0.925 0.2 n.d. n.d. rt, dg, mk 49
5-Fluorouracil 7.6 0.74 0.991 ** 0.52 - 0.95 0.24 - 1.2 ms, rt, dg, hm 50
FPL-52757 0.91 0.62 0.973 ** 0.43 - 0.81 0.28 - 0.97 rt, rb, mk, dg, hm 51
Grepafloxacin 15 0.64 0.886 0.06 n.d. n.d. rt, rb, mk, dg 52
HI-6 9.8 0.76 0.972 *** 0.61 - 0.91 0.53 - 0.99 ms, rt, rb, mk, dg, sh, hm 53
Iloprost 48 0.85 0.970 *** 0.64 - 1.1 0.51 - 1.2 ms, rt, rb, dg, pg, hm 36
Interferon 3.7 0.71 0.980 ** 0.52 - 0.90 0.36 - 1.1 ms, rt, rb, dg, mk 54
Kanamycin 2.9 0.81 0.970 *** 0.61 - 1.0 0.48 - 1.1 sh, gt, rb, ck, pn 25
Ketamine 119 0.56 0.632 0.1 n.d. n.d. rt, rb, pg 18
KG-2413 610 1.1 0.741 0.3 n.d. n.d. rt, gp, dg 55
Lamifiban 6.1 0.88 0.887 0.2 n.d. n.d. rt, dg, mk 56
Lamivudine 15 0.75 0.991 ** 0.53 - 0.97 0.24 - 1.3 rt, mk, dg, hm 41
Lenercept 0.0079 1.1 0.998 ** 0.90 - 1.2
(v)
0.71 - 1.4
(iv)
rt, rb, mk, dg 57
Lomefloxacin 10 0.79 0.992 *** 0.66 - 0.92 0.56 - 1.0 ms, rt, mk, dg, hm 46
Metazocin 11 0.29 0.973 * 0.15 - 0.44 n.d. ms, rt, rb, hm 58
Methohexitone 73 0.86 0.997 * 0.26 - 1.5 n.d. rt, rb, dg 18
Mibefradil 62 0.62 0.923 ** 0.29 - 0.95 0.018 - 1.2 rt, mt, rb, dg, hm 23
Midazolam 67 0.68 0.850 * 0.15 - 1.2 n.d. rt, rb, dg, pg, hm 23

13


Compounds a b r
2 ( i )
P
( ii )
(vii)
Ref.
Miglitol 7.4 0.64 0.998 * 0.31 - 0.97 n.d. rt, dg, hm 59
Mofarotene 14 0.84 0.983 ** 0.51 - 1.2 n.d. ms, rt, dg, hm 23
Moxalactam 5.0 0.66 0.992 *** 0.58 - 0.74
(iii)
0.53 - 0.79 ms, rt, rb, dg, mk, hm 32
Moxifloxacin 20 0.56 0.949 *** 0.38 - 0.74
(iii)
0.26 - 0.86 ms, rt, mk, dg 60
Napsagatran 50 0.74 0.842 0.08 n.d. n.d. rt, rb, dg, mk 61
Nicardipine 69 0.55 0.962 *** 0.40 - 0.70
(iii)
0.30 - 0.80 rt, dg, mk, hm 62
Nimustine 42 0.83 0.968 ** 0.55 - 1.1 0.32 - 1.3 ms, rt, rb, dg, hm 63
Nipradilol 59 0.66 0.796 * 0.047 - 1.3 n.d. rt, rb, mk, dg 64
N-Nitrosodimethylamine 59 0.93 0.972 *** 0.75 - 1.1
(iv)
0.65 - 1.2 ms, hr, rt, rb, mk, dg, pg 65
Norfloxacin 81 0.77 0.893 * 0.28 - 1.3 n.d ms, rt, mk, dg, hm 43
NS-49 14 0.64 0.994 0.05 n.d. n.d. rt, rb, dg 66
Ofloxacin 7.5 0.64 0.946 * 0.17 - 1.1 n.d. rt, mk, dg, hm 43
Oleandomycin 30 0.69 0.996 ** 0.55 - 0.83 0.36 - 1.0 ms, rt, dg, hm 46
Panipenem 12 0.61 0.977 *** 0.48 - 0.74
(iii)
0.39 - 0.82 ms, gp, rt, rb, mk, dg 27
Pefloxacin 13 0.57 0.910 * 0.24 - 0.90 n.d. ms, rt, mk, dg, hm 43
Phencyclidine 52 0.64 0.891 ** 0.33 - 0.95 0.12 - 1.1 ms, rt, pn, mk, dg, hm 67
Procaterol 29 0.80 0.992 0.06 n.d. n.d. rt, rb, dg 68
Propranolol 98 0.64 0.81 0.10 n.d. n.d. rt, rb, dg, hm 23
P-selectin glycoprotein ligand-1 0.0060 0.93 0.939 ** 0.49 - 1.4 0.13 - 1.7 ms, rt, mk, pg 69
Recombinant CD4 3.4 0.65 0.995 ** 0.50 - 0.79 0.31 - 0.98 rt, rb, mk, hm 31
Recombinant growth hormone 6.8 0.71 0.995 ** 0.55 - 0.87 0.34 - 1.1 ms, rt, mk, hm 31
Recombinant human factor VIII 0.16 0.71 0.999 * 0.45 - 0.97 n.d. ms, rt, hm 70
Relaxin 6.0 0.80 0.992 *** 0.66 - 0.93 0.55 - 1.0 ms, rt, rb, mk, hm 31
Remikiren 50 0.67 0.898 * 0.26 - 1.1 n.d. rt, dg, mt, mk, 71
Remoxipride 29 0.42 0.710 0.07 n.d. n.d. ms, rt, hs, dg, hm 72
Ro 24-6173 69 0.64 0.976 * 0.33 - 0.95 n.d. rt, rb, dg, hm 23
Rolitetracycline 11 0.89 0.989 *** 0.72 - 1.1
(iv)
0.58 - 1.2 rb, pg, pn, ck 73
Sanorg 32701 0.35 0.87 0.979 0.09 n.d. n.d. rt, rb, bb 74
SB-265123 15 0.80 0.812 0.1 n.d. n.d. ms, rt, mk, dg 75
Sch 27899 0.78 0.62 0.966 * 0.27 - 0.98 n.d. ms, rt, rb, mk 76
Sch 34343 13 0.77 0.924 *** 0.51 - 1.0 0.37 - 1.2 ms, rt, mk, rb, dg, hm 77

14


Compounds a b r
2 ( i )
P
( ii )
(vii)
Ref.
Sematilide 20 0.66 0.982 ** 0.39 - 0.94 0.034 - 1.3 rt, rb, dg, hm 78
Sildenafil 28 0.66 0.999 *** 0.59 - 0.73
(iii)
0.51 - 0.81 ms, rt, dg, hm 79
SK&F107647 7.2 0.63 0.964 0.1 n.d. n.d. rt, dg, hm 80
SR 80027 0.10 0.53 0.990 0.06 n.d. n.d. rt, rb, bb 74
SR90107A 0.68 0.55 0.978 * 0.30 - 0.79 n.d. rt, rb, bb, hm 74
Stavudine 19 0.84 0.993 *** 0.71 - 0.97
(iv)
0.60 - 1.1 ms, rt, mk, rb, hm 41
Sumatriptan 32 0.84 0.973 * 0.42 - 1.3 n.d. rt, rb, dg, hm 81
Talsaclidine 37 0.63 0.971 * 0.30 - 0.97 n.d. ms, rt, mk, hm 82
Tamsulosin 61 0.59 0.993 0.05 n.d. n.d. rt, rb, dg 83
Tebufelone 31 0.79 0.963 * 0.32 - 1.3 n.d. rt, mk, dg, hm 84
Theophylline 1.9 0.81 0.950 *** 0.64 - 0.98 0.57 - 1.1 rt, gp, rb, ct, pg, hs,
hm
85
Thiopentone 3.5 1.0 0.874 ** 0.57 - 1.4 0.32 - 1.7 rt, rb, dg, sh 18
Tiludronate 1.5 0.56 0.977 ** 0.40 - 0.71 0.27 - 0.84 ms, rt, rb, dg, bb 86
Tissue-plasminogen activator 17 0.84 0.986 *** 0.72 - 0.95
(iv)
0.66 - 1.0 ms, hs, rt, rb, mk, dg,
hm
23
Tolcapone 12 0.65 0.927 * 0.095 - 1.2 n.d. rt, rb, dg, hm 27
Tolterodine 62 0.62 0.978 * 0.34 - 0.90 n.d. ms, rt, dg, hm 87
Tosufloxacin 64 0.80 0.919 * 0.36 - 1.24 n.d. ms, rt, mk, dg, hm 43
Trimethadione 4.1 0.70 0.942 *** 0.50 - 0.90 0.39 - 1.0 ms, hs, rt, rb, dg, mk,
hm
88
Troglitazone 12 0.81 0.988 ** 0.54 - 1.1 0.19 - 1.4 ms, rt, mk, dg 89
Tylosin 54 0.69 0.993 0.053 n.d. n.d. rt, dg, cw 48
Zalcitabine 15 0.82 0.983 *** 0.62 - 1.0 0.45 - 1.2 ms, rt, ct, mk, hm 41
Zidovudine 26 0.95 0.981 ** 0.71 - 1.2
(iv)
0.51 - 1.4 ms, rt, mk, dg, hm 41
Zomepirac 1.6 1.2 0.902 ** 0.63 - 1.7 0.28 - 2.0 ms, rt, rb, hs, mk, hm 90

Note: CL = clearance, BW = body weight, CI = confidence interval.
(i)
Coefficient of determination.
(ii)
Statistical testing against b = 0: P < 0.05 (*); P < 0.01 (**); P < 0.001 (***).
(iii)
Excluding b = 0.75.
(iv)
Excluding b = 0.67.
(v)
Excluding both b = 0.75 and b = 0.67.
(vi)
n.d.: not determined because of a lack of correlation between CL and BW at the significance level = 0.05 (column 6)
and = 0.01 (column 7).
(vii)
rt, rat; rb, rabbit; bb, baboon; mk, monkey; dg, dog; hm, human; ms, mouse; cz, chimpanzee; sh, sheep; ck, chicken;
pn, pigeon; gp, guinea pig; pg, pig; ct, cat; cw, cow; gt, goat; mt, marmoset; hs, hamster.

Measures of Center

If X is a random variable (e.g., age) defined on a specific population, it has a certain theoretical distribution
of values in that population, with definite characteristics. One of these is a center around which the
distribution is located; another is a spread which would correspond to the amount of its variability.
(Although, there are some distributions such as the Cauchy distribution for which this is not true, but they
are infrequently encountered in practice.) Also, these two objects are often independent of one another;
knowing one gives no information about the other. (Although, again, there are some important exceptions.)
Of course, center and spread are vague, informal terms that require clarification. Furthermore, even with
precise definitions (later), as it is usually impossible to measure every individual in the population, these so-
called population characteristics or parameters are typically unknown quantities, even though they
may exist. However, we can begin to approach an understanding of their meanings, using various estimates
based on random sample data. These parameter estimators are so-called sample characteristics or
statistics and are entirely computable from the data values, hence known. (Although, they will differ from
sample to sample, but let us assume a single random sample for now.)

Suppose the collection
1 2 3
{ , , , , }
n
x x x x represents a random sample of n measurements of the variable X.
For the sake of simplicity, we will also assume that these data values are sorted from low to high.
(Duplicates may also occur; two individuals could be the same age, for example.) There are three main
measures of center that are used in practice, representing what might be thought of as an estimate of a
typical value in the population. These are listed below with some of their most basic properties. (The
most common measure of spread, the sample standard deviation, will be discussed in lecture.)

sample mode
This is simply that data value which occurs most often, i.e., has the largest frequency. It gives
some information, but is rather crude. A distribution with exactly one mode is called
unimodal (such as the bell curve); a bimodal distribution has two modes, at least locally,
and could be thought of as two unimodal distributions (which could have unequal heights)
that are blended together. This suggests that the sample consists mostly of two distinct
subgroups which differ in the variable being measured, e.g., the ages of infants and geriatrics.

sample median
This is the value that divides the dataset into two equally-sized halves. That is, half the data
values are below the median, and half are above. As the data has been sorted, this is
particularly easy to find. If the sample size n is odd, there will be exactly one data value that
is located at the exact middle (in position number
1
2
n +
). However, this will not be the case if
n is even, so here the median is defined as the average of the two data values that bracket the
exact middle (position
2
n
to its immediate left, and position
2
1
n
+ to its immediate right).

The median is most useful as a measure of center if there are so-called outliers in the data
i.e., extreme values. (Again, there is a formal definition, which we will not pursue here.)
For instance, in a dataset of company employee salaries that happens to include the CEO, the
median would be a more accurate representative of a typical salary, than say, the average.

sample mean
The calculation and properties of this most common measure of center, will be discussed in
detail in lecture. In a perfectly symmetric distribution (such as a bell curve), the mean and
median will be exactly equal to each other. However, the presence of many outliers on either
end of the distribution will tend to pull the mean in that direction, while having no effect on
the median. Hence, this results in an asymmetric distribution having a negatively skewed
tail (or skewed to the left) if the mean < median, or likewise, a positively skewed tail
(or skewed to the right) if the mean > median, respectively.

A Few Words on Mathematical Notation

Every grade school pupil knows that the average (i.e., sample mean) of a set of values is computed by
adding them, and then dividing this sum by the number of values in the set. As mathematical procedures
such as this become more complex however, it becomes more necessary to devise a succinct way to express
them, rather than writing them out in words. Proper mathematical notation allows us to do this. First, as
above, if we express a generic set of n values as
1 2 3
{ , , , , }
n
x x x x , then the sum can be written as
1 2 3 n
x x x x + + + + . [Note: The ellipsis (...) indicates the x-values that are in between, but not explicitly
written.] But even this shorthand will eventually become too cumbersome. So mathematicians have created
a standard way to rewrite an expression like this, using so-called sigma notation. The sum
1 2 3 n
x x x x + + + + can be abstractly, but more conveniently, expressed as
1
n
i
i
x
=
,

which we now dissect. The symbol
is the uppercase Greek letter sigma equivalent to the letter S

in English and stands for summation (i.e., addition). The objects being summed are the values
1 2 3
, , , ,
n
x x x x or for short, the generic symbol
i
x as the index i ranges from 1, 2, 3, ..., n. Thus, the
first term of the sum is
i
x when i = 1, or simply
1
x . Likewise, the second term of the sum is
i
x when i = 2,
or
2
x , and so on. This pattern continues until the last value of the summation, which is
i
x when i = n, or
n
x .
Hence, the formula written above would literally be read as the sum of x-sub-i, as i ranges from 1 to n. (If
the context is clear, sometimes symbols are dropped for convenience, as in
1
n
i
x
, or even just x
.)
Therefore, the mean would be equal to this sum divided by n, or
1
n
i
i
x
n
=
. And since division by n is

equivalent to multiplication by its reciprocal
1
n
(e.g., dividing by 3 is the same as multiplying by 1/3), this
last expression can also be written as
1
1
n
i
i
x
n
=
. This is the quantity to be calculated for the sample mean, x .

Finally, because repeated values can arise, each
i
x comes naturally equipped with a frequency, labeled
i
f ,
equal to the number of times it occurs in the original dataset of n values. Thus, for example, if the value
3
x
actually occurs 5 times, then its corresponding frequency is
3
5 f = . (If, say,
3
x is not repeated, then its
frequency is
3
1 f = , for it occurs only once.) A related concept is the relative frequency of
i
x , which is
defined as the ratio
i
f
n
. In order to emphasize that this ratio explicitly depends on (or, to say it
mathematically, is a function of) the value
i
x , it is often customary to symbolize
i
f
n
with an alternative
notation, ( )
i
f x , read f of x-sub-i. So, to summarize,
i
f is the absolute frequency of
i
x , but ( )
i
i
f
f x
n
= is
the relative frequency of
i
x . They look very similar, but they are not the same, so try not to confuse them.
[Peeking ahead: Later, ( )
i
f x will denote the probability that the value
i
x occurs in a population, which
is a direct extension of the concept of relative frequency for a sample data value
i
x .]

Sample Quartiles
We have seen that the sample median of a data set {x
1
, x
2
, x
3
,, x
n
}, sorted in increasing order, is a
value that divides it in such a way, that exactly half (i.e., 50%) of the sample observations fall below
the median, and exactly half (50%) are above it.
If the sample size n is odd, then precisely one of the data values will lie at the exact center; this
value is located at position (n + 1)/2 in the data set, and corresponds to the median.
If the sample size n is even however, then the exact middle of the data set will fall between two
values, located at positions n/2 and n/2 + 1. In this case, it is customary to define the median as
the average of the two values, which lies midway between them.

Sample quartiles are defined similarly: they divide the data set into quarters (i.e., 25%). The first
quartile, designated, Q
1
, marks the cutoff below which lies the lowest 25% of the sample. Likewise,
the second quartile Q
2
marks the cutoff between the second lowest 25% and second highest 25% of
the sample; note that this coincides with the sample median! Finally, the third quartile Q
3 marks the
cutoff above which lies the highest 25% of the sample.
This procedure of ranking data is not limited to quartiles. For example, if we wanted to divide a
sample into ten intervals of 10% each, the cutoff points would be known as sample deciles. In
general, the cutoff values that divide a data set into any given proportion(s) are known as sample
quantiles or sample percentiles. For example, receiving an exam score in the 90
th
percentile
means that 90% of the scores are below it, and 10% are above.
For technical reasons, the strict definitions of quartiles and other percentiles follow rigorous
mathematical formulas; however, these formulas can differ slightly from one reference to another. As
a consequence, different statistical computing packages frequently output slightly different values
from one another. On the other hand, these differences are usually very minor, especially for very
large data sets.
Exercise 1 (not required): Using the R code given below, generate and view a random sample of
n = 40 positive values, and find the quartiles via the so-called five number summary that is output.

# Create and view a sor t ed sample, rounded to 3 decimal places.
x = round(sort(rchisq(40, 1)), 3)
print(x)
y = rep(0, 40)

# Plot it along the real number line.
plot.new()
plot(x, y, pch = 19, cex = .5, xlim = range(0, 1 + max(x)), ylim =
range(0, 0.01), ylab = "", axes = F)
axis(1)

# Identify the quartiles.
summary(x)

# Plot the median Q
2
(with a filled red circle).
Q2 = summary(x)[3]
points(Q2, 0, col = "red", pch = 19)

# Plot the first quartile Q
1
(with a filled blue circle).
Q1 = summary(x)[2]
points(Q1, 0, col = "blue", pch = 19)

# Plot the third quartile Q
3
(with a filled green circle).
Q3 = summary(x)[5]
points(Q3, 0, col = "green", pch = 19)

Exercise 2 (not required): Using the same sample, sketch and interpret
boxplot(x, pch = 19)
identifying all relevant features. From the results of these two exercises, what can you conclude
about the general shape of this distribution?
NOTE: Finding the approximate quartiles (or other percentiles) of grouped data can be a little
more challenging. Refer to the Lecture Notes, pages 2.3-4 to 2.3-6, and especially 2.3-11.
To illustrate the idea of estimating quartiles from the density histogram of grouped data, let us
consider a previous, posted exam problem (Fall 2013).

First, we find the median
2
, Q i.e., the value on the X-axis that divides the total area of 1
into .50 area on either side of it. By inspection, the cumulative area below the left
endpoint 4 is equal to .10 +.20 =.30, too small. Likewise, the cumulative area below the
right endpoint 12 is .10 +.20 +.30 =.60, too big. Therefore, in order to have .50 area
both below and above it,
2
Q must lie in the third interval [4, 12), in such a way that its
corresponding rectangle of .30 area is split into left and right sub-areas of .20 +.10,
respectively:

.10
.20
.25
.15
.10
.20
.30
.25
.15
.20 .10
Total Area above Q
2

= .10 + .25 + .15 =.50

Total Area below Q
2

= .10 + .20 + .20 =.50

2
Q
Now just focus on this particular rectangle

and use any of the three boxed formulas on page 2.3-5 of the Lecture Notes, with the
quantities labeled above. For example, the third formula (which I think is easiest) yields
2
(.2)(12) (.1)(4) 2.8
.2 .1 .3
Ab Ba
Q
A B
+ +
= = = =
+ +
9.333.
The other quartiles are computed similarly. For example, the first quartile
1
Q is the cutoff
for the lowest 25% of the area. By the same logic, this value must lie in the second
interval [2, 4), and split its corresponding rectangle of .20 area into left and right sub-areas
of .15 +.05, respectively:

2
Q
A =.20 B = .10
a =4 b =12
D
e
n
s
i
t
y

=

.
0
3
7
5

D
e
n
s
i
t
y

=

0
.
1

a =2 4 =b
1
Q
A =.15 .05 = B
.10
sum =.25 Therefore,
1
(.15)(4) (.05)(2) .7
.15 .05 .2
Ab Ba
Q
A B
+ +
= = = =
+ +
3.5.
Likewise, the third quartile
3
Q is the cutoff for the highest 25% of the area. By the same
logic as before, this value must lie in the fourth interval [12, 22), and split its
corresponding rectangle of .25 area into left and right sub-areas of .15 +.10, respectively:

Estimating a sample proportion between two known quantile values is done pretty much the same
way, except in reverse, using the formulas on the bottom of the same page, 2.3-5. For example,
the same problem asks to estimate the sample proportion in the interval [9, 30). This interval
consists of the disjoint union of the subintervals [9, 12), [12, 22), and [22, 30).
The first subinterval [9, 12) splits the corresponding rectangle of area .30 over the class
interval [4, 12) into unknown left and right subareas A and B, respectively, as shown
below. Since it is the right subarea B we want, we use the formula B =(b Q) Density
=(12 9) .0375 =.1125.
The next subinterval [12, 22) contains the entire corresponding rectangular area of .25.
The last subinterval [22, 30) splits the corresponding rectangle of area .15 over the class
interval [22, 34) into unknown left and right subareas A and B, respectively, as shown
below. In this case, it is the left subarea A that we want, so we use A =(Q a) Density
=(30 22) .0125 =.10.
Adding these three areas together yields our answer, .1125 +.25 +.10 =.4625.

Page 2.3-6 gives a way to calculate quartiles, etc., from the cumulative distribution function
(cdf) table, without using the density histogram.
Therefore,
3
(.15)(22) (.10)(12) 4.5
.15 .10 .25
Ab Ba
Q
A B
+ +
= = = =
+ +
18.
3
Q a =12 b =22
.15
D
e
n
s
i
t
y

=

.
0
2
5

sum =.25
B =.10 A =.15
A B = ?
a =4 Q =9 b =12
D
e
n
s
i
t
y

=

.
0
3
7
5

B A = ?
a =22 Q =30 b =34
D
e
n
s
i
t
y

=

.
0
1
2
5

FischerLecNotesIntroStatistics PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FischerLecNotesIntroStatistics PDF

Uploaded by

Copyright:

Available Formats

Introduction to Basic Statistical Methodology

(with an emphasis on biomedical applications, using R)

Q Q . It follows that the relative frequency

has the important frequently-recurring form

, and degrees of freedom df =n 1. For the first class, we are

. According to the American

Overwhelmingly preferred by most insurance companies.

or the cross product ratio

or the cross product ratio

or the cross product ratio

or the cross product ratio

= 18/600 = .03 (as before)

=#of unexposed individuals (E), with disease (D+)

= proportion of unexposed individuals (E), with disease (D+)

, again using the same combined sample

, i.e., ( ) 1.25 ( ) 0.25 P A P B =

= 0.5, so that ( ) 0.5 ( ) P A B C P B C = .

Therefore, the probability of picking 1 face card = 1 (40/52)(39/51)(38/50)(37/49) =

when that limiting value exists where n = # tosses, and X = # Heads

is the value just preceding a in the sorted population.

is a sample-based estimator (e.g., X ). Consider all the

values. Formally define the following:

[ ] E , the expected value of

[ ] E , the difference between the expected

, and the target parameter .

from its mean

and the target parameter .

minimizes MSE, it then follows

for a general parameter can be quite difficult in practice. Often, one

is said to have mean square convergence

Ismor Fischer, 5/29/2012 4.1-14

for a x b (and ( ) 0 f x = otherwise). Clearly, the two criteria for

, the graph of which is a straight

+ for x < < +.

, for x 0 (and =0 otherwise).

, and =0 elsewhere, as shown below right.

(and =0 elsewhere). Prove that

we can also define the population covariance between X and Y :

The individual distribution functions ( )

whose expected value either consistently

= (1.96)(0.75 months) = 1.47 months.

is the two-sided margin of error.

= 2.04. Thus, = 0.0207,

is the scaled difference between

= X / n = 64/100 = 0.64 . To improve this to

= X / n, rather than just

s.e. = 150 1/3 + 1/3 = 10.

This is known as the Maximum Contaminant Level (MCL).

2 OR of the respective odds ratios OR

2 OR ), in the usual way.

s.e. given in the notes, calculate the weights

48 2.4(75) y = + =228 mg/dL. But how precise is this value?

y can be regarded simply as a predicted value of the response variable Y, for a

y =228 (when x* =75) is

y can be regarded as a point estimate of the conditional

y =228 (when x* =75) is

* y b b x + = can actually be treated as a random variable in its own

y =48, and the margin of

, then it follows that log(Y) = log() + log(X)

.) Therefore, exponentiating both sides, the actual

. Hence (see section 7.2),

with 0 < <1.