You are on page 1of 14

Study Guide – Biostatistics Page 1

35% of PrevMed Exam (with EPI)


DESCRIPTIVE STATISTICS
Measurement Scales
Nominal: names or labels: numbers, if present, act only as labels
e.g.: ABO blood groups; gender – CATEGORICAL, Qualitative
Ordinal: natural ranking: numbers, if present, represent rank ordering
Differences in rank have no meaning – can be continuous or discrete
e.g.: Pain scale; min-mod-severe; and nonsmoker-former-smoker, avg=misleading
Interval: numeric data with arbitrary 0
Equal differences between numbers represent equal amounts
Ratios have no meaning due to arbitrary zero
e.g.: Temperature in °C
Ratio: numeric data with true 0 = CONTINUOUS
Differences as well as ratios have meaning due to true zero value
e.g.: Heart rate in beats per minute; °K, BP in mmHg
Types of Data
Qualitative: non-numeric - e.g., sex, race, disease severity
Quantitative: numeric - Discrete: counts or frequencies, e.g.: #office visits
Continuous: along a continuum = size in cm; age in years; time
Measures of Central Tendency
Mean ( X̄ ): arithmetic average - The preferred measure of central tendency
Geometric Mean: uses logarithm to compare ratios
Median: middle value of sorted data, may be best when data is skewed
Mode: value that occurs most often - Rarely used

Measures of Spread
Range: minimum to maximum values (Interquartile range, median of top / bottom ÷ 2)
Variance (s2): average squared difference of each data value from the mean
2 ∑ ( X− X̄ )2
s=
(n−1) n = #data values X̄ = mean
Standard Deviation (s): square root of variance, SD = √Variance
MOST Commonly used, increases with sample size. If n is large, data symmetric / unimodal:
Then Mean ± 1s = 68% of data
Mean ± 2s = 95% of data
Mean ± 3s = 99% of data
Standard Error: Describes the variability of a sample statistic, used primarily for
constructing confidence intervals SE = SD
√n
Sources of Variability
Biologic: Inter-individual: different people vary
Intra-individual: same individual varies over time
Measurement: Inter-observer: 2 observers have different values (kappa)
Intra-observer: same observer varies over time
Analytical: mechanical or laboratory error

Page 1
Study Guide – Biostatistics Page 2

INFERENTIAL STATISTICS: Generalizations about a population based on data from a


representative group.

Probability: for an event that can either occur or not occur. ∩ = “AND”, U = “OR”
P(A) = probability that event A occurs
0 ≤ P(A) ≥ 1 for any event A

P (AUB) = P(A or B) = P(A) + P(B) - P(A∩B)

P (A∩B) = P(A and B) = P(B) * P(A│B)

P(A│B) = P(A if B) = P(A∩ B)


P(B)
Mutual exclusivity: Venn Diagram = 2 non-intersecting circles, events cannot occur together
E.g.: A = hearts B = diamonds
P (AUB) = P(A) + P(B) - P(A∩B) = P(A) + P(B) + 0
P (A∩B) = P(B) * P(A│B) = P(B) * 0 = 0

Independence: Events are independent if the occurrence of one has no effect on the other
P(A│B) = P(A) and P(B│A) = P(B)
Therefore;
P (A∩B) = P(B) * P(A│B) = P(B) P(A)
MULTIPLICATION RULE
P(A∩B∩…P(K) = P(A)+P(B)+…P(K) (if independent events)

Complementary: Events are mutually exclusive and contain all the outcomes in the sample
If 2 events A and Ā are complementary: then P(A) = 1 – P( Ā )

Conditional Probability: The conditional probability that the event A occurs given that the
event B has occurred:

= P(A│B) = P(A and B)


P(B) (Chart) ME factors out, independence also factors out

Page 2
Study Guide – Biostatistics Page 3

Application of Probability to Epidemiology:

With Disease (D+) Without Disease (D-) Total


Positive Test (T+) a b a+b
Negative Test (T-) c d c+d
Total a+c b+d a+b+c+d

Sensitivity: a / (a+c) P(T+│D+)

Specificity: b / (b+d) P (T-│D-)

Positive Predictive Value: a / (a+b) P(D+│T+)

Negative Predictive Value: c/ ( c+d) P(D-│T-)

Prevalence: (a+c) / (a+b+c+d) P(D+)

Baye’s Theorem:

P(A│B) = P(A) * P(B│A) .


P(A) * P(B│A) + P(notA)P(B│notA)

PPV = P(D+│T+) = after bullshit math = Prevalence(Sensitivity) .


Prevalence(sensitivity) + (1-prevalence)(1-specificity)

EXAMPLE: Virtual colonoscopy has a sensitivity of 90% and a specificity of 96%. What is the
positive predictive value (PPV) given a prevelance of 5%?

P(D+│T+) = P(D+) * P(T+│D+) .= Prevalence (Sensitivity)


P(D+) * P(T+│D+) – P(D-)P(T+│D-) Prev(sens) – (1-sens)(1-spec)

PPV = .05(.90) / .05(.90) – (1-.90)(1-.96) = .045 / (.045 - .004) = 0.045 / 0.041 =

Page 3
Study Guide – Biostatistics Page 4

Normal (Gaussian) Probability Distribution: Continuous data


Properties:
Bell-shaped
σ = standard deviation
μ = mean (= median = mode)
total area under = 1
extends to infinity in both directions
area under the curve corresponds to probability
Completely described if 2 parameters known: μ and σ

Changing mean →→ changes location, not shape of curve (i.e. moves left or right)
Changing standard deviation →→ changes shape, not location of curve (flat, tall)
Increasing standard deviation →→ curve flattens

Area under the curve


Mean ± 1s = 0.6827
Mean ± 2s = 0.9545
Mean ± 3s = 0.9973

SAMPLING DISTRIBUTIONS:
Sampling distribution of the mean: Normal population, Repeated sampling from a normal
distribution with mean μ and SD σ gives a normal distribution of sample means
SE (standard error)of the mean = σ / √n, which is the square root of variance
Areas under the curve for sampling distribution of the mean can be standardized
Because it has a normal distribution
( X̄−μ ) σ
Z= 2 σ2 σ x́ = ( X́ −μ)
σ /√n σ x́ =
recall, n , SE = Square root = √ n so, Z = σ

Sampling distribution of the mean: Non-normal population: If sample size is large (more
than 30), sampling distribution holds due to Central Limit Theorem
Central Limit Theorem: If a population has a finite mean (μ) and finite variance (σ2), then the
distribution of sample means derived from repeated sampling from this distribution approaches
the normal distribution with sample mean μ and variance σ2/n as the sample size increases (>30).

The MEANS from MULTIPLE samples of same pop will have normal distribution (vs. data)

Standard Normal (Z) Distribution: Normal distribution with μ =0 and σ = 1. Area under
curve = 1, area under any portion of curve/distribution = probability of observing value there.
Z-Table allows calculation of area under curve between any 2 points on x-axis

Page 4
Study Guide – Biostatistics Page 5

Standardization: Allows computation of area (probability) under any normal


distribution using table z-distribution. Simply RE-CALCULATE in standard Z units
Z (z-score) = (X - μ) / σ, just need to know what you are measuring (direction), value, mean and
standard deviation. Example: What is prob of BP>135 in pop mean=120, SD=10? Data =
normal, but with horizontal axis 80 – 160. Z = 135-120/10= 1.5. P(Z>1.5)=0.0668 (from table)

α/2 α/2
The wider the interval, the MORE confident
we are about the mean being present under
the curve. The narrower = less confident,
but MORE PRECISE (smaller range)

T-Statistic: Continuous unimodal distribution of INFINITE range, symmetric around 0. A bit


more flat than a normal curve, but approaches normality with larger sample sizes.
Only ONE parameter that affects curve is degrees of freedom (df) vs. Z with μ (left/right) and
σ (tall/flat). df = n-1, as df ∞, tZ Can be used on smaller populations (less than 30)

t = (xbar - µ0) / (s / √n) Confidence Interval: P[ X́ ± t 1− ∝ , df


2
( √sn )]
Estimating a Sample: Take a sample, select the sample statistic, calculate sampling
distribution, estimate desired parameter (like example above with Z). Can calculate a
“point estimate” or an “interval estimate” (like the example above).

Confidence Interval: (Estimator) +/- (reliability coefficient) * (standard error)


The range that will include with a stated probability the actual population
parameter estimated from a sample. Z is estimated at 1.96 for a 95% confidence interval. This is
NOT probability, as the value either is or isn’t in the interval (0 or 1). So, there is 95%
confidence (probability) that the sample mean is between these values.
σ σ σ
P[ X́ −1.96( )
√n
< z < X́ +1.96( )
√n
] P[ X́ ± z ∝
1− ( )
2 √n
]

Confidence Interval Width can be decreased by:


Reduce confidence level (increase α)
Reduce SD (use stratified sampling)

Page 5
Study Guide – Biostatistics Page 6

Increase n (take larger sample)


Confidence Interval Estimate
95% CI = mean ± 1.96 SE
90% CI = mean ± 1.645 SE
99% CI = mean ± 2.576 SE
SE = SD/√n

Confidence Interval when μ not known and n is small (n<30)


Central Limit Theorem doesn’t hold
CI will have a Student’s t-distribution
Characteristics of Student’s t-distribution
Symmetric about zero
Parameter = degree of freedom; df = n – 1
As df approaches infinity: t-distrib approaches z-distrib
95% CI = X ± t.975, n-1 SD/√n

Confidence Interval for Binomial Proportion p


95% CI = p ± 1.96 (√(pq/n))

z p(1− p)
p 1−

2
√ n

Sample Size calculation


95% CI = mean ± 1.96 SD / √n
√n = mean ± 1.96 SD/95%CI

Binomial Probability Distribution: used for distributions of NOMINAL data.


Requires a DEFINITE number of trials (n), Each trial results in 1 of 2 possible outcomes
Probability assigned to each of 2 outcomes is constant from trial to trial
p= P(success) q=(1-p)=P(failure)
Each trial is independent of the others = coin flipping
Formula: P(x successes in n trials) = (n!/x!(n-x)!) px (1-p)n-x

μ=np σ 2=npq

^p q^
Confidence Interval: 95% CI = ^p ± z .975
√ n

Poisson Probability Distribution: Occurrences are independent, probability proportional to


time interval. Requires a factor of “time”. The average number of outcomes (the parameter λ) in
a trial remains constant from trial to trial and trials must not overlap. An infinite number of
outcomes can occur in the unit of time of space.

Formula: P(x occurrences of an event in a unit of time or space) = e-λ λx/x!


Characteristics of Poisson Distribution
Space →→ random mixing of agent into medium

Page 6
Study Guide – Biostatistics Page 7

Time →→ events occur independently in time


λ = mean of Poisson = variance of Poisson
Contrast with Normal: mean = mode = median
Contrast with Binomial: mean = np, Variance = npq

HYPOTHESIS TESTING
Ho = Null hypotheses
Must contain =, ≥, or ≤
Investigator usually wants to reject Ho (find Ho to be false)
Ha = Alternative hypothesis
Disagrees with Ho
\ thought to be true
Outcomes of hypothesis testing
Ho is actually true and it is accepted
Ho is actually false and it is rejected
Ho is actually true and it is rejected (Type I Error) (reject truth)
Ho is actually false and it is accepted (Type II Error) (accept false)

Ho Null Hypothesis
Actually True Actually False
Decision Accept Ho Correct Type II Error
Reject Ho Type I Error Correct

If reject Ho → → can make a Type I error (not a Type II error)


If accept Ho → → can make a Type II error (not a Type I error)
α and β Errors
α = P(Type I Error) = P(reject Ho when Ho is true) = significance level of test,
control by having an alpha of .05 or thereabouts (95% likely)
β = P(Type II Error) == P(fail to reject Ho when Ho is false) = 1 - power of test
Decrease Type II errors by increasing sample size
Power of test = 1 – β

Page 7
Study Guide – Biostatistics Page 8

P-value:
Not a test statistic (it is a measure of probability)
Significance level at which the observed value of the test statistic would just be
significant
Probability that the observed difference from Ho is due to chance alone when Ho is true

1-tailed upper vs. 1-tailed lower vs. 2-tailed test


Determined by what we think the mean actually is
1-tailed test upper– mean is known greater than a constant
1-tailed test lower– mean is known less than a constant
Most medical tests are 1-tailed
2-tailed test – if don’t know direction

Neyman-Pearson Hypothesis Testing: Fix α


Determine hypothesis: Ho and Ha
Fix α at a small level to minimize probability of Type I error
Usually, α < .05 or α < .01
Ignore β – inversely related to α
To decrease β for a given α → increase sample size
Take a sample from a population
Calculate test statistic (z, t, F, or χ2)
Inspect test statistic to determine if accept of reject Ho
Determine p-value from test statistic
If p < α (.05 or determined α level) → reject Ho
Compute critical region
Reject Ho if test statistic value is outside critical region
If p < .05 – reject value Ho

Page 8
Study Guide – Biostatistics Page 9

Statistic Inference for 2 Means or 2 Proportions


Estimate the difference in the 2 population means (independent samples), e.g. drug vs control
Confidence Interval (CI): Is a mean point estimate ± z (or t) * SE
(Estimator) +/- (reliability coefficient) * (standard error)
SE comes from sampling distribution
If CI does not contain 0 →→ reject Ho
If CI contains 0 →→ “not enough info to say the means differ”
2 - Sample t-test (must assume a NORMAL distribution), INDEPENDENT
t = difference in means / SE
Paired t-test: Used for data that is not independent, i.e. paired together.
Examples include “pre/post”, “baseline / treatment”, Matched, Twins
Consider the difference between the individual pairs
Compute statistic based on paired differences (as if have 1 sample)
CI (95%) = d ± t.975, n-1 df Sd / √n T = d /SE d=mean difference, SE = s/sqrt(n)
If normality cannot be assumed: Wilcoxon rank sum test
Wilcoxon rank sum test: Non-parametric version of 2-sample-t test.
Normality cannot be assumed for independent samples. Can use for ordinal data.
Data is ranked smallest to largest, Compares sum of ranks
Ho: both populations have the same location
Ha: populations have different locations
Results in a p-value
Wilcoxon Signed Rank Test: Non-parametric version of paired-t test
Non-normal, non-independent.

Difference in 2 Binomial Proportions


Confidence interval = (p1 – p2) ± z .975 √ ((p1q1/n1) + (p2q2/n2))
q = 1-p
Hypothesis test for 2 (unpaired) binomial proportions p1 and p2
Chi-Square Test – Better for binomial than z / t
Degrees of Freedom = (# rows – 1) x (# columns – 1)
# of Successes # of Failures Total
Group 1 A B A+B
Group 2 C D C+D
Total A+C B+D N

Χ2 = [N*(AD – BC)2] / [(A+B) (C+D) (A+C) (B+D)]

To use Chi-Square data must be discrete, mutually exclusive and each box >5

Yates Correction for continuity:


The binomial distribution is discontinuous, Correction for continuity
Makes little difference if cell numbers are large, Reduces size of Chi square statistic (Χ2)
Reduces chance of finding statistically significant difference = more conservative

Fisher’s Exact Test: when sample sizes are small (boxes in chart are less than 5)
Allows calculation of exact p-value, For 2x2 tables with small cell size

Page 9
Study Guide – Biostatistics Page 10

Expected cell size = row size X column size


Total
For A: (A+B) x (A+C)/N
For B: (A+B) x (B+D)/N

McNemar’s Chi-Square Test: 2 Paired Binomial Proportions

(b−c )2
McNemar Chi Square =
(b+ c)
ANALYSIS OF VARIANCE (ANOVA): One continuous variable or one categorical variable
with at least 3 groups. Independent, normally distributed. Use F statistic to differentiate groups.
EXAMPLE: Does cholesterol vary by race? Study groups = black, white, Asian, other
H o : μB =μW =μ A=μ O
Have to compare each of the populations to each other, which is tedious, but done by a
computer. Results can be used to find which populations are different from the others.

STATISTICAL INFERENCE OF LINEAR AND OTHER MODELS


Pearson’s Correlation Coefficient: Measures correspondence b/w continuous variables
Does NOT imply cause / effect. Determines if association exists and quantifies strength
Plots of bivariate normal. “Mysterious, misunderstood and misleading”
ρ = (Pearson’s) population correlation coefficient, unitless, ranges from -1 to +1
ρ = 0 is a perfectly symmetrical plot = no correlation
ρ = -1 plot accumulates data from the upper left to lower right ( \ )
ρ = +1 plot accumulates data from lower left to upper right ( / )
if sample comes from bivariate normal distribution, the scatter diagram should
have a circular or elliptical pattern

r = “sample” correlation coefficient, estimate of ρ


+1 → strong positive correlation (linear)
-1 → strong negative correlation (linear)
0 → no correlation, Highly effected by outliers
r2 = Strength of Association
Proportion of variation in y explained by x (or vise-versa)
Test statistic (t)
t = r √ [(n – 2)/(1 – r2 )] with n-2 degrees of freedom
Use the test statistic (t) to determine p-value, but p-value not that important.
Look at r 2, if = .8 (80%), that means 80% of variation is explained by other variable.
r 2 <0.4=weak , r 2 >0.7=strong
Spearman Rank Correlation Coefficient: Non-parametric version of Pearson’s

Linear Regression: Attempts to predict 1 variable from an other(s). Applies a model to


explain an observed relationship and then estimates the parameters of that model.

Simple linear mode: y = α + βx α = Y-intercept β = slope, y=dependent variable


Uses least squares curve fitting for the simple linear model

Page 10
Study Guide – Biostatistics Page 11

Assumptions”
Linearity: data should follow a straight line
Homogeneity of variance: variability of y should be the same at each x
Normality: values of y at each x should be normally distributed

Multiple Regression: more than 1 independent (x) variable


Logistic Regression: dependent variable is categorical
Use regression with caution, each of these figures has the same regression line…

Kappa Statistic (κ): Measure of agreement. Used for Inter/Intra-rater reliability


Interpretation: κ = 1: perfect agreement
κ = 0: no agreement better than chance
κ = neg: worse than chance agreement (rare)

SURVIVAL ANALYSIS: Follows a group of people over time. Contribution of each subject
weighted by time. 4 person years = one subject for 4 years, or a combination of subjects for 4
years.

Survival analysis used vs. other statistics because patients followed for variable lengths of time
and information tends to be incomplete (people are lost to follow-up).

Five requirements for a “Life Table”: 1) Time at beginning of interval


2) Length of time interval
3) # at beginning of interval
4) Probability of dying at beginning of interval
5) # lost to follow-up

Kaplan-Meier Life Table: 6 columns, Last one = “cumulative survival rate at time t”
This can be made into a survival curve. These curves can be compared with various methods
Log Rank Test (Peto’s), Gehan’s Generalized Wilcoxon Test, Cox mantel Test
Peto’s Generalized Wilcoxon Test, Cox’s F-Test, Mantel-Haenzel Chi Square Test
(So, it sucks Mantel Peto’s Cox to compare survival curves)

Page 11
Study Guide – Biostatistics Page 12

Cox Proportionate Hazard Model Regression: Used to compare survival analysis between
two groups that vary in age, sex, disease severity or other variables. Uses multiple logistical
regression techniques combined with survival analysis.

For Every Statistical Test:


Ho
Type I error
Type II error
α
β
Significance level
Power
Critical region
P-value

Statistical Analysis Issues and Difficulties


Statistical significance is different from practical or clinical importance
SMALL studies can often MISS important difference
(Meta Analysis of many small studies can be misleading)
LARGE studies can make trivial differences appear important
Testing many different variable or subgroups leads to problems in interpreting results
(By chance 5% results will show significance (when not significant))
Multiple bivariate statistical tests over treatment groups, dose levels, or time can lead to
confusing results (ANOVA better than multiple paired t tests) – increases Type I error

Repeated analysis of the data as they accrue over time and stopping the experiment or
trial when statistical significance is reached: leads to incorrect conclusions

Try not to use p values ALONE


Different types of data require different types of testing
Pay attention to sample size
Absence of proof does not prove absence
Always know what assumptions your test requires

Page 12
Study Guide – Biostatistics Page 13

1st Variable 2nd Variable Example Appropriate Test(s) of


Significance
Continuous Continuous Age + Pearson Correlation Coeff
Syst BP Linear Regression
Continuous Ordinal Age + Spearman Correlation Coeff
Satisfaction ANOVA (possibly)
Continuous Dichotomous Syst BP + Student’s t-test
Unpaired Gender
Continuous Dichotomous Δ Syst BP + Paired t-test
Paired Before / After Treatment
Continuous Nominal Hemoglobin level + ANOVA (F-test)
Blood type
Ordinal Ordinal Correlation: Satisfaction + Spearman Correlation Coeff
Illness severity
Ordinal Dichotomous Satisfaction + Mann-Whitney U test
Unpaired Gender
Ordinal Dichotomous Δ Satisfaction Wilcoxon matched-pairs
Paired Before / After Treatment signed-ranks test
Ordinal Nominal Satisfaction + Kruskal-Wallis test
Ethnicity
Dichotomous Dichotomous Success / Failure Chi-Square test
Unpaired Treated / Untreated Groups Fisher’s Exact test
Dichotomous Dichotomous Δ Success/failure McNemar Chi-Square test
Paired Before / After treatment
Dichotomous Nominal Success / Failure + Chi-Square test
Blood type
Nominal Nominal Ethnicity + Chi-Square test
Blood Type

Page 13
Study Guide – Biostatistics Page 14

  Type of Data
Goal Normal Non-Normal Binomial Survival Time
Rank, Score, (2 possible
or measure outcomes)

Univariable Mean, SD Median, Proportion Kaplan Meier survival curve


Describe one group interquartile
range
Univariable One-sample t test Wilcoxon test Chi-square  
Compare to a or
population /
hypothetical value Binomial test **
Bivariable 2-sample t test Mann-Whitney Fisher's test Log-rank test or Mantel-Haenszel*
Independent groups test (chi-square for
large samples)
Bivariable Paired t test Wilcoxon test McNemar's test Conditional proportional hazards
Dependent / paired regression*
Multivariable One-way ANOVA Kruskal-Wallis Chi-square test Cox proportional hazard
Independent, 3+ test regression**
groups
Multivariable Repeated- Friedman test Cochrane Q** Conditional proportional hazards
Dependent, 3+ measures regression**
groups ANOVA
Association Pearson Spearman Contingency  
between two correlation correlation coefficients**
variables
Predict value from Simple linear Nonparametri Simple logistic Cox proportional hazard
another measured regression c regression** regression* regression*
variable or
Nonlinear
regression
Predict value from Multiple linear   Multiple logistic Cox proportional hazard
several measured or regression* regression* regression*
binomial variables or
or control variables
Multiple nonlinear
regression**

Page 14

You might also like