You are on page 1of 40

Descriptive Statistics

A variable is any characteristic of a person or thing whose value can


vary from individual to individual
Frequency is the number of times a particular value of a variable occurs

Types of variables:
- Nominal (non-ordered attributes)
- Ordinal (ordered categories)
- Ranks
- Numerical, discrete (integer counts)
- Numerical, continuous (measured quantities)
- Numerical, ratio (derived quantities)

Describing the Data with Numbers (p19)
You can summarize results using measures of central tendency and dispersion.

Measure of Central Tendency
A measure of central tendency is the typical value for the data. Where is the
center or high point of the distribution?

Central tendency
Variation
Slope
Summarization
Tables
Graphs
Charts
Display
Descriptive
p-value
Testing
Point estimate
Confidence interval
Estimation
Inferential
Statistics
Arithmetic Mean or Mean (AM, X-bar): calculating the average for interval and
ratio data.
-Useful for interval/ratio data,
-Good for symmetric distributions,
-Easily affected by outliers and,
-Mean is similar to median for symmetric data.
Median: the value such that half of the data points fall above it and half below it.
First rank the values in order (ascending or descending). Used for ordinal data.
Sometimes used with interval and ratio data if the data arent distributed
symmetrically.
-Used regardless of shape of distribution
-Aka resistant measures (not influenced by outliers)
Mode: the most frequently occurring category. If two categories have same or
almost the same frequency, the data are called bimodal or even trimodal (more
rare), or multimodal. Used for nominal data.

Measure of Dispersion
A measure of dispersion refers to how closely the data cluster around the measure
of central tendency.

Index of Dispersion: stating how many categories were used; used for nominal
data or ordinal data comprised of named, ordered categories. A better index of
dispersion for nominal and ordinal data is the index of dispersion.
D= k(N
2
-f
i
2
) / N
2
(k-1)
If all values fall under one category: D=0
If all values equally distributed among k categories: D=~1
D lower with less categories?
Range: the length of an observed interval. It is the difference between the highest
and lowest values, used for numeric ordinal data. The range is always one
number: not reported as 120-102, rather as 18. Disadvantages of range is that it is
dependent on sample size and is greatly affected by outliers. Expressed as min to
max is ? to ? not range is ? to ?.
Interquartile Range or Midspread: the difference between the upper quartile
(Q
U
; median of upper half) and lower quartile (Q
L;
median of lower half), and
comprises the middle 50% of the data. Used with ordinal, interval and ratio data.
Need to have data arranged first. Expressed as Q
1
to Q
3
are 12 and 15 (compute
length if needed), not IQR is 12,15).
Variations on a Range: same idea as quartiles, but using different intervals.
Narrower intervals fall closer to the median but contain less of the data.
The Mean Deviation: sum of the deviations of a set of numbers around the mean
always equals zero though. Could use absolute values for the difference but this
is not acceptable algebraically.
The Variance and Standard Deviation: the variance (s
2
,
2
) is the average of the
squared deviation of the individual values from the mean. This renders a result in
square units, which is useless, so we take the square root to get the standard
deviation (SD, s, ).

(For popn)


The closer the numbers cluster around
the mean, the smaller s will be. If we add a constant to every number, the
variance (and hence the SD) does not change. If we multiply every number by a
constant, the variance (and hence the SD) increases.
Var(X) = E((X-E(X))
2
)
*Sample variance is not the same as variance of a random variable. Sample
variance is calculated as the arithmetic mean of square deviations around a
sample mean divided by (n-1).

*SD used to calculate variation not variance, because same units.
*google: SD is only useful for normal distributions, use IQR for skewed
distributions
- Rule 1: if X1 & X2 are indept random variables, then var(X1+X2) = var(X1) +
var(X2) and mean(X+Y) is mean(X) + mean(Y).
- Rule 2: if X1 is a random variable, then var(a*X1) = a*a*var(X1)

The Coefficient of Variation (CV, V): used to compare the spread of scores of
measurements from different labs, for example. Used for ratio-level data.
CV = SD / X-bar
Yields a pure number without units. Advantage: multiplying the values by a
constant increases X-bar and SD (CV unchanged). Disadvantage: adding a constant
increases X-bar and does not affect SD, therefore increasing the mean decreasing
CV. So it may be useful for ratio-level data where you cannot indiscriminately add
a constant number, but should not be used with interval-level data where 0 is
arbitrary and constants dont change anything.

*there is no measure of spread for nominal data

Skewness and Kurtosis
Used to describe the distribution for interval and ratio data.

Skewness: refers to the symmetry of the curve. Skewed right (positive skew) is
when the right tail is longer (bulk of data on left). Skew=0: no skew, skew>0:
positive skew, skew<0: negative skew.
If skewness <1 or >+1: highly skewed
If < skewness < +: approximately symmetric

Kurtosis: refers to how flat of peaked a curve is.
-Leptokurtic (peaked, kurtosis>0 or 3, heavier tail),
-Mesokurtic (normal distribution, kurtosis=3, sometimes programs substract 3 so
its 0),
-Platykurtic (flat, kurtosis<0
or 3, lighter tails).
*Kurtosis doesnt affect
variance of the
distribution.








Skewness, kurtosis and variance are independent!

Normal Curve
Assessing normality (meeting 1 requirement doesnt necessarily mean it is
normal):
-The normal curve extends beyond two SDs on either side of the mean.
-For data with only positive numbers, if the mean is less than twice the SD, theres
skewness present (can be used when there are many groups each with a different
mean value of the variable)
-If SD increases as mean increases, the data are skewed
-Skewness or kurtosis should both be less than 2
-Wilks-Shapiro Normality Test: assesses data against a normal population (p
should be greater than 0.05 to be normal)
-Anderson-Darling Normality Test: assesses data against a number of different
distributions (p should be greater than 0.05 to be normal)

*Loosely, probability density curve is the limiting case of a histogram as the
number of observations tends to infinity

*is standard deviation divided by 2 so that you can get SD on each side of the
mean? Is SD always equal on both side of the mean, even for skewed graphs?
Skewed graphs dont have SD. You calculate IQR.

Box Plots (aka Box and Whisker Plot):
-The ends of the box are the Q
U
and Q
L
and the middle 50% of the cases fall within
the range of scores defined by the box.
-Variability of data is the length of the box
-Median gives us an estimate of central tendency
-Median tells us skewness:
if median is closer to Q
U
= negatively skewed
if median is closer to Q
L
= positively skewed
-Whiskers:
Inner fence: 1.5 times the IQR (Q
1
+/- 1.5*IQR and/or z-score lies outside (-2,2) for
symmetric distributions). If datum point falls exactly at one step then the inner
fence is draw one step from the Q
U
. If not, then the fence is drawn to the largest
observed value that is still less than one step away from Q
U
.
When distribution is symmetrical, whiskers are ~same length.

Outer fence: is two steps beyond the quartile (3.0 times the IQR) (Q
1
+/- 3*IQR
and/or z-score lies outside (-3,3) for symmetric distributions)

In a symmetrical distribution 95% of data falls within the inner fences and 99%
within the outer fences.

Data that are between the fences are outliers, and data beyond the outer fences
are far outliers.

Figure 1 shows the 5 number summary statistics (min, max, median, Q1, Q3)

-It retains some information about the data
-It retains the some information about the data
-Provides information about the
Centre of the distribution
Median (M)
Q3
Q1
Q3+1.5*IQR
Q1-1.5*IQR
Outlier
Spread of the distribution
Overall shape of the distribution
-Useful for interval/continuous data
-Useful for detecting outliers: values beyond (M-1.5*IQR, M+1.5*IQR)

Stem-and-Leaf Plot:
Each row is a stem (leading digit) and trailing digits form the leaves.
-It retains all information about the data
-Provides information about the
Centre of the distribution
Spread of the distribution
Overall shape of the distribution
-Useful for interval/continuous data

Histogram
-Horizontal axis is a continuous scale of all units in grouped classes, and vertical
axis represents frequency in that class.
-It retains some information about the data
-Provides information about the
Centre of the distribution
Spread of the distribution
Overall shape of the distribution
-Useful for interval/continuous data

Bar Chart
-Bars of equal widths representing categories of a nominal variable, and height
corresponds to frequency
-It retains all information about the data
-Provides information about the
Centre of the distribution
-Useful for nominal/categorical data

Criterion used for deciding what measure of central tendency and dispersion to
use:
Suffieciency: how much of the data is used? E.g. mean is more sufficient than
mode, and SD is more sufficient than range.
Unbiasedness: does the average of estimates from infinite number of sample
approximate the parameter (population) were interested in
Efficiency: how closely do estimates of large number of samples cluster about the
mean (SD > IQR)
Robustness: to what degree are statistics affected by outliers? Median > mean.
This is a disadvantage for the median for relatively symmetric graphs, but an
advantage for skewed-up data since it more accurately reflects the bulk of the
data. So although the mean is the most useful measure of central tendency, its
not ideal for skewed distributions.















Geometric mean is better than the average mean for exponentially growing data,
because AM overestimates it. However since you are taking square root, GM
doesn't work when you have a 0 or a negative value under the square root.

Harmonic mean is used when we want the average sample size of a number of
groups each with a different number of subjects. It always yields the lowest value
and is the most conservative.

Figure 2 Mode is always at the peak, then comes median and mean
If the numbers are the same, AM=GM=HM. As the difference increases, their
difference increases and AM>GM>HM.

Side note:
Median cuts the data in 2 halves, two tertiles into 3 thirds, 3 quartiles into (Q
1
,
median, Q
3
) into 4 quarters, quintiles etc.

Variance is the square of SD with squared units
Variation is the general concept of variability

-Replace ranged by varied, range by vary, or ranging by varying
(verbs)
-Replace range by interval (nouns)
-Replace parameter by variable
*A parameter is a characteristic of a population distribution of a variable;
not another name for a variable
-If Confidence Intervals contain negative numbers, use to, , or ; as
delimeters.
-Avoid at least 5 to 10; use at least 5, since 10 is above 5
-Avoid range from 6 20; use vary from 6 to 20
-Avoid claiming first paper when no credible literature search is presented
Start numbers less than 1 with 0; i.e., 0.53 not .53.

Notations:
X
i
: a random variable of ith person
x
i
: a specific data point, value of a variable for one subject (p19)
N: number of subjects in the entire sample
n: number of subjects in sample for a group when there are 2 or more groups
(*subjective to book, check with prof)
: sum (upper case sigma)
x:
: SD of individual observations X
i
in population

2
: variance
: population mean
X-bar: sample mean of a variable
/ n: SE of the sample mean (i.e. square root of the variance of the sample
mean)


Normal Distribution
Aka Bell Curve and Gaussian Distribution
-With normally distributed data, the mean and variance are not dependent on
each other.
-Data from a large number of people (e.g. 1000) will approximate a normal curve
-The distribution of the mean of a large number of sample of reasonable size
will always be normally distributed even if the data is not normally distributed
Central Limit Theorem

Figure 3 Here, mue is the sample mean estimate (pop'n mean) and variance/n is the sample's variance
If we draw equally sized samples from a non-normal distribution, the
distribution of the means of these samples will still be normal, as long as the
samples are large enough

Large: it depends on if the shape of the population is close/different from the
normal. General is >30.

A standard score, abbreviated as z or Z is a way of expressing any raw score in
terms of SD units (standard normal distribution).


-A score that is equal to the mean has a Z-score of 0.
-The average deviation of scores about their mean is 0, whether raw score or Z-
score. (adding up z-scores=0)
-Z-scorers will always have a mean of 0.0 and a SD of 1.0
-We dont always have to use the mean and SD of the sample to calculate z-score.
We can take them from another sample or the population. If we use a population
mean and SD for e.g. then its possible that all the z-scores would be positive.


Properties of a true normal curve
-Mean=median=mode
-Skew=0
-Kurtosis=0
-The curve approaches the X-axis asymptotically
-Relative frequencies tend to 0
-Most of action takes place between the lines labeled -3 and +3
-Slightly over 95% of curve falls between -2 and +2

E.g. if =30.83 and SD=14.08; then 68% of the nurses emptied between (30.83-
14.08) and (30.83+14.08). And 95% dumped between (30.83 [2*14.08]) and
(30.83 + [2*14.83])). Those who cleaned more than (30.83 + [2*14.83]) worked
harder than about 97% of their mates.

For z- score table, if
when z=0, area=0.5
then it's the area to
the left of the curve.
If when z=0, area=0
then its the area
between the mean and
z.







Probability

Probability deals with the relative likelihood that a certain event will or will not
occur, relative to some other events.

Deriving probabilities:
Empirically aka Relative Frequencies:
Based on past experiences, provided that the circumstances have not changed.
E.g. probability of survival of cancer patients.

Theoretically aka Axiomatic Approach (Classical Approach):
Based on theory of probability.

A probability must like between 0 (cannot occur) and 1 (must occur).

Mutually Exclusive Events:
Two events, X and Y, are mutually exclusive if the occurrence of one precludes
the occurrence of the other

- Additive law: Pr (X or Y) = Pr(X) + Pr(Y)
o Vs Pr (X or Y) = P(X) + P(Y) P(X and Y)
- Pr (X and Y) = 0
- Pr (X and Y and .) = 1

Conditionally Dependent Events:
Two events, X and Y, are conditionally dependent if the outcome of Y depends on
X, or X depends on Y.

E.g. 11.11% is the probability of rolling two dices and getting a 5 (unconditional).
If one dice is rolled to yield a 1, then the probability of rolling a sum of 5 is 1/6 or
16.67% (conditional).

- Multiplicative law: Pr(X and Y) = Pr(X) x Pr(Y | X)
Or P(X|Y) = P(X and Y)/P(Y)

Independent Events
They are neither mutually exclusive nor conditionally probable. Often
independent events are mistaken to be conditionally probable (e.g. rolling a good
number in casino, being on a roll).

Pr(X and Y) = Pr(X)*Pr(Y)

E.g. Rolling a 6 and an even number on a dice are not independent events. But
they are independent if its 2 dices.

Permutations (when order matters)





Combinations





*Can be abbreviated as n/r

The Law of At Least One
Pr(At Least One) = [1 Pr(none)]

Binomial Distribution
Non-continuous variables, when there are only two possible outcomes. (binary
data)

The binomial distribution shows the probability of different outcomes for a series
of random events, each of which can have only one of two values.


*More generally, with n trials, each with probability p of being a success, the
probability of a particular sequence of r successes and n-r failures is P
r
(1-p)
n-r
.
n
P
r

Binomial Expansion:
Used
to
know what the probability of getting something wrong exactly 7 out of 10 times is
for example. If you want to know the probability of getting something wrong at
least 7 out of 10 times, you calculate the cumulative probability of these
outcomes (r=7, 8, 9, 10).

-As p gets closer to 0.5, the graph becomes more symmetric and skew decreases
(data spread more)
-When p=0.5, the graph is perfectly symmetric
-When p is less than -.5 the distribution is skewed to the right, when p is more
than 0.5 the distribution is skewed to the right.
-There is a different binomial distribution for every combination of n and p
-As n increases, so does spread in scores (when p is unchanged). When n is 30 or
more, we dont need to use the equation for binomial expansion to figure out
probabilities we approximate the binomial distribution using the normal curve.
Can also be used if p=0.5 and n>10, but when p is not 0.5 it is risky.

Mean = np
Variance = npq
SD = sqrt(npq)

If you want to know the probability of getting exactly 5 of the 15 patients
developing an infection you use the binomial expansion formula. If you want to
use the normal distribution to approximate it, you calculate the z-scores using
(4.5 and 5.5 as you upper and lower limits because normal distributions use
continuous variables) and calculate the area in between.

Bernoulli Distribution
Is a special case of binomial when n=1 (binomial (1,p)).
If Xi~Bernoulli (p).
Mean= p = sum of X
i
/n = proportion of X
i
=1 divided by n
Variance= E[(X-p)
2
]

Poisson Distribution

-When you have a large number of trials
-When probability of success is small
-Can be used as an approximation of the binomial
-Describes count data

For X ~ Poisson()
X has a mean and variance =

Poisson and binomial distributions start to look normal as well when is large and
both np and n(1-p) are large.

Notations:
n: total number of objects in the set
r: the number of object were looking for
p: the probability on each try of the outcome of interest occurring
q: (1-p)

Inferential Statistics

Does not describe data, but rather used to determine the probability (likelihood)
that a conclusion based on analysis of data from a sample is true.

The sample describes those individuals who are in the study; the population
describes the hypothetical (and usually) infinite number of people to whom you
wish to generalize

Estimates depend on:
1. The extent to which individual values differ from the average (SD)
2. The sample size

Notations
Alpha a Type I error
Beta B Type II error
Delta D Difference
Pi p Proportion
Mu M Mean

Sigma s SD

-Null hypothesis (H
0
- pessimistic) and alternate hypothesis (H
1
)

-You can neither prove nor disprove the null hypothesis; Statistics is concerned
with inference, i.e. inductive logic.
-Retain the null until there is considerable evidence against it. Thus, H
0
and H
1
not
treated equally.
-If the null is true, the sample means will fall symmetrically on either side of the
population mean, and the mean of the means will be the same as the overall
population mean. The distribution is the SE(M).
-The distribution of the means will be tighter around the population mean than
will be the individual lengths (because big and small can cancel out).
-As the sample gets larger, the mean will get closer to the population mean.

The less dispersed the original observations (smaller SD) the tighter the means
will be to the population mean. Hence, the width of the distribution of means,
called the Standard Error of the Mean (SE
M
), is directly proportional to the SD and
inversely proportional to the square root of the sample size.
SE(M) = / \n

SD vs SE(M)
SD reflects how close individual observations come to the sample mean, and the
SE(M) reflects, for a given sample size, how close the sample means of repeated
samples will come to the population.

We need to find out how far away from the population mean, in SD units, the
mean of the treated sample is, in order to calculate the area/the chances that it
happens by chance.

z= (X-bar - ) / ( / \n)

-With a smaller sample but keeping all values the same, you may not reject the
null.

p-Value
A p-value is the probability of observing a result as or more extreme than the
result actually observed, given that the null hypothesis is true.

- A p-value is always with respect to a given null hypothesis
- A p-value also depends on your alternative hypothesis.

Alpha
Alpha is the probability of concluding that the sample came from the H
1
distribution (i.e. concluding that there is a significant difference) when in fact it
came from the H
0
distribution.

-Multiple testing increases your risk of type I errors
Power
=1-Beta

The probability of concluding that the sample came from the H
1
distribution (i.e.,
concluding there is a significant difference), when it really came frmom H
1

distribution (there is a difference)

I.e.,

It is the likelihood that we could detect a difference that was there.

First determine the center of the distribution of sample means under the H
1
. The
width of the distribution is the same as that of H
0
, the SE(M).

Beta
Beta is the probability of concluding that the sample came from the H
0

distribution (i.e., concluding that there is no significant difference), when in fact it
came from the H
1
distribution (there really is a difference).

I.e.,

What is the likelihood that we could get a sample mean of 170 or less, assuming
that H
1
is true?

I.e.,

Probability that the sample mean would be small enough that we would miss it,
and declare that there was no difference. (Type 2 error)

H
0
true H
1
true
Retain H
0

1-

Type II error
Prob=
Reject H
0

Type I error
Prob=

1-

Type 1 Error: Rejecting the null hypothesis when it is in fact true.
Type 2 Error: Failing to reject the null hypothesis when it is in fact false.

-Ideally we want to be 0.15-0.20 so that the power of the study is 80-85%. Easily
affected by sample size, because results may look promising but did not reach
statistical significance. Also affected by the magnitude of the difference between
groups, the variability within the groups, and .
-The larger the difference, the smaller the probability of a type II error
-A smaller alpha and a smaller sample size increase chance of type II error

Signal to Noise Ratio

-Signal: observed difference between groups
-Noise: variability in the measure between individuals within the group

Nearly all statistical tests are based on a signal-to-noise ratio, where the signal is
the important relationship and the noise is a measure of individual variation.

-If the signal is large enough compared to the noise reasonable to conclude
that the signal has some effect

Two Tails Versus One Tail
-A one-tailed test specifies direction of the difference of interest in advance (used
to test directional hypotheses).
-A two-tailed test is a test of any difference between groups, regardless of the
direction of the difference (indifferent of direction).
*2-tailed tests used all the time so as to be able to report differences (e.g. side
effects) rather than just say the drug is 80% worse, just by chance).

**The z-value for beta is always based on a one-sided test. This only applies for
the alpha. Because the tail of H
1
only overlaps that of H
0
on one side. (p.59).

Confidence Intervals

A 95% confidence interval is an interval such that, if the experiment were
repeated a large number of times, the true parameter value would lie in the
confidence interval 95% of the time.

95% of the time we take a sample, the CI will cover (contain) the actual
population mean .


Incorrect to say: there is a 95% chance that thepopulation mean is between
9.9 and 10.3

If CI parameters dont contain 0, this implies there is a statistically significant
difference between the populations.
-Greater levels of variance yield larger CI (less precise estimates of parameters)
-As the level of confidence decreases, the size of the corresponding interval will
decrease.
-Use fail to reject the null (better than accept the null) because you are
acknowledging that there may well be an effect but we have insufficient evidence
to prove it, yet.
-Note that the CI does not depend on any hypothesis being tested.
-In fact there may not be a hypothesis test involved at all.

Lower bound: We want to find out where the population mean would have to be
so that 2.5% of the sample means (for n) would be greater than the obtained
sample mean.
Upper bound: there is a 2.5% chance of seeing a sample mean of X-bar or less if
the true population mean was this value.
-Evidently, if the 95% CI does contain the parameter value of the null hypothesis,
then the effect will is not significant at the 0.05 level (fail to reject H
0
).
-If the parameter value for the null hypothesis is less than the lower bound, that
the probability of seeing a sample mean of value X-bar is lower than 0.025, and so
the result is significant.
-If the upper bound is greater than the parameter of our alternate hypothesis,
then the probability of observing a sample mean of X-bar with a population mean
of H
1
is greater than 0.025.


*When 0 is in the CI, you fail to reject the null. You need to be looking at a CI for the difference
of two things though. For example, the mean of treated sample mean of untreated
population. If the CI for that encompasses 0, then there may not be a difference and so you fail
to reject the null. (ps. Dont accept it because there may also be a difference as indicated the
CI).

Statistical Significance versus Clinical Importance
P 1.96 x 1.96 0.95
n n
P x 1.96 x 1.96 0.95
n n
P x 1.96 x 1.96 0.95
n n
o o | |
s s + =
|
\ .
o o | |
s s + =
|
\ .
o o | |
s s + =
|
\ .

When you have a small sample size, a larger difference is required to reach statistical
significance.

Statistical significance is a necessary precondition for a consideration of clinical importance
but says nothing about the actual magnitude of the effect.

In order to determine clinical importance
-not difficult if the outcome is in units we can understand
-if dealing with point-scales then we need an index Effect Size

Effect size:
-d type: Differences between groups (no upper limit; usually max is1.0 though)
-r type: Proportion of variance in one variable that is explained by another variable (upper limit
= 1.0)
-If the sample size is really large, the magnitude of the difference needed to achieve statistical
significance is small (p.58). Therefore, it is always important to ask what the size of the sample
was.
-Critical value is 1.96 SEs to the right of the null mean. The distance is called z

(the z value
corresponding to the alpha error).

(CV 100)/ ( / \n) = z

(105 CV)/ ( / \n) = z

/ ( / \n) = z

+ z


n = [(z

+ z

) / ]
2

/ = Effect size; its like the z-score, and it tell you how big the difference is in
SD units.

-p-value shouldnt be higher than 2 decimal places unless your sample size is
larger than 1000.

*do the CV for z

and

z
b
have to be equal? See p.59. Based on the graph they look
kinda equal.

Post-Hoc tests
Tells you if you had power to detect the difference you actually saw, then it would
likely have been statistically significant
Chi-Squared Test
The observed counts in each cell follow a Poisson Distribution (mean = variance).
When your sample size is large,

O
jk
~ approx N(E
jk
,E
jk
)

So,


Since there are 4 cells in the table, we combine the 4 test statistics:


This test statistic follows a chi-squared distribution on 1 df. (x
2
(1)
)

jk jk
jk
O E
~ approx N(0,1)
E

( )
( ) ( )
( ) ( )
2
2 2
jk jk
j 1 k 1
jk
2 2
11 11 12 12
11 12
2 2
21 21 22 22
21 22
O E
E
O E O E
E E
O E O E
E E
= =


= + +

+ +

Yates Continuity Correction


Used if the expected cell count in any of the cells is less than 5, because the p-
value would otherwise be too small (chi-square test is too liberal).


Fishers Exact Test
An alternative to Yates continuity correction.

If we doubled all the values in a table, we expect the chi-squared statistic to
double and the p-value to decrease with the same association.

-p-values measure how likely it is that the observed difference (or one more
extreme) is attributable to chance. They do not measure how large the difference.
-Therefore, we report measure of association in addition to the p-value for chi-
squred.

Measure of Association:

( )
( )
2
2 2
jk jk
j 1 k 1
jk
2
2 2
jk jk
j 1 k 1
jk
O E
Instead of
E
O E 0.5
Use
E
= =
= =

Outcome present Outcome absent Total


Group 1 a b a+b
Group 2 c d c+d
Total a+c b+d N
Risk difference: a/(a+b)-c/(c+d)
Relative risk: (a/(a+b))/(c/(c+d))
Odds ratio: ad/(bc)

Odds is the probability of an event happening divided by the probability of the
event not happening.

Analysis of an rxc Table
df= (r-1)*(c-1)

McNemars Test
For paired proportions

df=1
OR= discordant (makes sense) / discordant (opposite)
The odds of a case having a problem are 7 times more likely than a control.

The T-test






But we dont know SD, so we estimate it:





As a result, the distribution is not longer normal. It is more diffuse (fatter tails).
The distribution is known as the
Students t-test distribution.
The fatness of the tails
depends on df.



-3 -2 -1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
-3 -2 -1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Y 3562
N(0,1)
n

o
Y 3562
n

o












-As sample size approaches infinity, the t-distribution begins to look normal
because the estimate of SD is more precise.

One Sample T-test











Two-Sample T-test
- Can be used with one or two groups, whereas the z-test is mainly used to
compare one group to an arbitrary or population value.
- SD is unknown for the t-test but the population SD, , is given to us for z-
test.
- Therefore, t-test estimates both means and SD, which introduces a
dependency on sample size.
- If the null hypothesis is true, then the difference in sample means
o Has mean zero
o Has a Normal distribution if sample size is sufficiently large


0
(n 1)
n
2 2
i
i 1
Y
t
n
1
(Y Y)
n 1

=

o
o =


Assumptions:
1. Sample means are Normally distributed
2. All observations are independent of one another
3. Variances in the two groups are equal




-The estimate SD from above should lie between the estimates of SD for each group.
-Weighted by df
*We use n-1 when calculating SD for a sample to accommodate for bias since the
data will deviate less from the sample mean than they will from the population
mean.

-We calculate the SE(d) by summing the error of the two estimated means

SE(d) = \(s
1
2
+s
2
2
/ n) = sigma-hat* \ (1/ n
1
+ 1/n
2
)






-The distribution of differences is centered on zero
with an SD of SE(d).

-t-converges with z- and they are both 1.96 when alpha=0.05. However, t is larger
for small samples, so we require a larger relative difference to achieve
significance.
-The probability of observing a sample difference large enough is the area in the
right and left tail of the standard curve

Degrees of Freedom
The number of unique pieces of information in a set of data given that we know
the sum. It is df= n 1.

2 2
2
T T C C
T C
(n 1) (n 1)

n n 2
o + o
o =
+
T C
T C
(n n 2)
T C
Y Y
~ t
1 1

n n
+

o +
After estimating the sample mean, how many unique pieces of information did
you use to estimate your standard deviation.

For a t-test:
df = [(n
1
-1) + (n
2
1)] = (n
1
+ n
2
- 2)

Extended t-test
Used for 2 groups of unequal sample size

Sigma-hat = \[(n
1
-1)s
1
2
+ (n
2
-1)s
2
2
/ n
1
+ n
2
-2)
T = (X-bar
1
X-bar
2
) / [SE(d) * (1/n
1
+ 1/n
2
)]

Pooled vs Separate Variance Estimates
So far, we have been able to pool the two samples and estimate the SD and
variance from the population (pooled estimate). This works because the two
samples are drawn from the same population and hence have the same mean and
SD.

Therefore, compute the t-test that doesnt weight the two estimates together.
The trade-off is that df are calculated differently and turn out to be much closer
to the smaller sample of the two (harmonic mean). Therefore, it is harder to get
statistical significance.

SE = \[(s
1
2
/ n
1
) + (s
2
2
/ n
2
)]

The Calibrated Eyeball

CI = X-bar +/- t
a/2
* SE
*use t not z when sample size is relatively small
For two independent groups; doesnt work for groups >2 and for two
measurements on the same individual.
1. If the error bars dont overlap, the groups are significant at p<=0.01.
2. If the amount of overlap is less than half of the CI, the significance level is
<=0.05.
3. If the overlap is more than about 50% of the CI, the difference is not
statistically significant.

Effect Size
ES = (X-bar
1
X-bar
2
) / SD
-Expressed in SD units.
-ES is reffered to as (standard mean difference):
-Cohens d: used pooled SDs from both groups
-Glasss : use SD only from the control group

**Do we use SD or SE(M)? On page 74, I am not sure what they used, but it isnt
the same as the SE(M) from p.72.

-d used more, although we dont know which is better.
-Advantage of d is that it uses data from both groups which increases the
accuracy of the estimate. The disadvantage is that the intervention may change
not only the mean but also the SD.

Criteria for d works on d and ):
ES=0.2 small; negligible practical/clinical significance
ES=0.5 moderate; moderate practical/clinical significance
ES = 0.8 large; crucial practical/clinical importance

If you come across a naked t-value without an ES, you can calculate d using:
d = t * \(1/n
1
+ 1/n
2
)

Sample Size and Power (see table D & E)

n = 2[(z

+ z

) OR IS IT SE (Q.3B)?/ ]
2

If power is low, we conclude that the study was too small and that the negative
results were probably a type II error.

Paired T-test
-When you cannot assume independence between values
To test for equality of variances, we use Levenes test

H
0
: var(Y
Ti
)=var(Y
Ci
)
H
1
: var(Y
Ti
)var(Y
Ci
)

If Levenes test has a p-value>0.05 retain the null, assume variances are equal
and use the standard t-test

Or, p-value<0.05 reject the null hypothesis of equal variances, need to use
Satterthwaite or Welch t-test (modified version of t-test).






D-bar is the sample mean of differences in
weights









d
= mean of the D
i
, i.e. mean of differences in weights

d
= std deviation of the D
i
, i.e. std deviation of differences in weights






df = n-1


2
d d
D ~ N( , / n) o
( )
372 372
1 1
i 2i 1i 372 372
i 1 i 1
D D Y Y
= =
= =

d
D
n o
One-Way ANOVA

SS(between) = n*sum(X-bar
.k
-X-bar
..
)
df = k -1

If the groups are of different sample size, use the harmonic mean of sample sizes
[2/ (1/n
1
+ 1/n
2
]

SS(within) = sum
i
sum
k
(X
ik
X-bar
.k
)
2

df = k(n-1)

SS(total) = SS(between) + SS(within)
=sum
i
sum
k
(X
ik
- X-bar
..
)
df = nk-1 = df(between) + df(within)

*equation looks different if n is different between groups computer fixes the
problem.

We could directly look at the signal to noise ratio,
BUT:
-The more groups I have, the larger the between-group SS
-The more observations within groups, the larger the within-group SS
Therefore

Calculating Mean Squares
By dividing each SS by its df.
-It is a measure of average deviation.

F-ratio
Signal:noise ratio;





-Under the null hypothesis,
Mean Square Between
Mean Square Within
this distribution has an F distribution of df(between) and df(within).

-F-ratio is never zero because the SS (between) is never 0. In the absence of a
difference, the mean square (between), E(MS
bet
), is exactly equal to the variance
(within),
2
err
.
-If no variance exists within groups, then the difference between sample means is
equal to the difference between population means. (=n
2
bet
)

Therefore,
E(MS
bet
) =
2
err
+ n
2
bet

When the tru variance between group drops out, the ratio equals 1, because
mean square within group is the error value.

Assumptions
- Data are normally distributed
- Homogeneity of variance among groups
- Observations are independent of one another

Unequal Sized Groups
Grand mean is a weighted average of group means
Y-bar= n
1
*X-bar + n
2
*X-bar + n
3
*X-bar / n
1
+ n
2
+ n
3












Post Hoc Tests
k
k
n
K
2
ik
k 1 i 1
K
2
k
k 1
n
K
2
k
k
k
1 i 1
Total (Y Y)
Between (Y Y)
Withi
n
n (Y Y)
= =
-
=
-
= =
=
=
=

Post-hoc tests are essentially t-tests, modified so as to control the type I error.
Use post-hoc tests only if your ANOVA is statistically significant
Select a post-hoc test to use a priori and use only that test

Bonferroni Correction

Divide your desired significance level (0.05) by the number of comparisons you
are going to make (k) this gives you your new significance level.
OR
Multiply p-values by number of tests done (in this case 3)

Anova vs T-test
*The calculated F-value for a one-way ANOVA on two groups is equal to the
square of the t-test. You get the same p-value.

T-test more flexible in regard to unequal variances, calculating CIs for differences
in means

Two-Way ANOVA

Interested in the mean of an interval or ratio outcome that may be associated
with two categorical factors.

It allows you test if the association between a factor and an outcome depends on
the level of another factor.

Tests for the effect of each factor separately.

Full factorial design: when all possible combinations are included in the design (VS
fractional factorial design).

Notations
Ctrl:
0

Group A:
A
=
0
+
A

Group B:
B
=
0
+
B
Group A+B:
AB
=
0
+
A
+
B
+ AB

Hypotheses
1. H0:
A
=0 vs H1:
A
0
2. H0:
B
=0 vs H1:
B
0
3. H0: AB =0 vs H1: AB 0

We are going to assume the same number of people will receive each
combination of factors (balanced design).

The concepts apply to unbalanced design.

Group Means



Sums of Square (within, error)


Difference between individual values and their group mean.
= = =
-
= =
-
= =
=
= = =
=
=
=

n J K
ijk
i 1 j 1 k 1
n K
j ijk
i 1 k 1
n J
k ijk
i 1 j 1
n
jk ijk
i 1
1
Y Y , where N n J K total sample size
N
1
Y Y
Kn
1
Y Y
Jn
1
Y Y
n
= = =
=

J K n
2
Within ijk jk
j 1 k 1 i 1
SS (Y Y )
df
error
= N-JK

Sums of Square (total)

df = N-1

Other SS
Intuitively, since we have 3 hypotheses to test (with 2 factors), we need 3 SS.

Sums of Square (A)


A subscript is for the row (equivalent to j), and K is number of columns. It is the
square deviations between the group means for A and the grand mean.

df = K-1
F= MS
A
/MS
error


Sums of Square (B)

B subscript is for the row (equivalent to k), and J is number of rows. It is the
square deviations between the group means for A and the grand mean.
= = =
=

J K n
2
ijk
j 1 k 1 i 1
Total (Y Y)
J
2
A j
j=1
SS =nK (Y Y)
-

K
2
B k
k=1
SS =nJ (Y Y)
-


df = J -1
F= MS
B
/MS
error


Sums of Square (interaction)
The degree to which combined effects differ from the sum of the individual
effects




df = (J -1)(K-1)
F= MS
AB
/MS
error

If the interaction is insignificant, you can do a post-hoc test.
-One-Way and Two-Way ANOVA assume that the observations are independent.
Otherwise you should use a repeated measures ANOVA.

Correlation and Regression (Linear)
Regression and correlation techniques can be used to examine the nature and
strength of the relationship between two variables.

Pearsons Correlation Coefficient (r)
-Used to measure the degree or strength of a linear relationship between two
continuous variables
-Varies from -1 to 1 with both extremes indicating strong negative or positive
relationship, respectively
J K
2
interaction jk j k
j 1 k 1
SS n (Y Y Y Y)
- -
= =
= +

( )
- -
= =
- -
= =
=
= +

2
J K
interaction jk j k
j 1 k 1
1 1
2
jk j k
j 0 k 0
SS n (Y Y) (Y Y) (Y Y)
n (Y Y Y Y)
If r=1, there is a perfect direct linear association between the two
variables
If r=-1, there is a perfect inverse linear association between the two
variables
If =0 ,there is a no linear association between the two variables.
*Values closer to 1 or -1 represent a strong negative or positive
linear associations
*Values >0.5 and <0.5 are considered strong
-It is a point estimate of the population correlation coefficient, rho ()
-No units
-r can be adversely affected by outliers; therefore, an r without a scatter plot may
be misleading
-Correlation does not imply cause and effect

Assumptions of r
-Both variables are continuous (ie. measured on an interval scale)
-Both variables are Normally distributed
-There is a linear relationship between the variables
Part II
Regression: Determining the Form of a Relationship
-To verify the association or relationship between a single variable and one or
more explanatory variables (explanatory variables variables on the right side of
the equation)
-To confirm results from other studies
-To reduce a large number of variables to a smaller set of variables
-To develop scoring rules for risk assessment
-To develop a model to predict future unobserved responses (outcomes) given a
set of predictor variables

Y
i

=

+

X
i

+

i

Dependent
Variable
Coefficient
Intercept
Coefficient
Slope
Predictor
Variable
Random
Error
(Known) (Unknown) (Unknown)
(Known)
error-
free
(Unknown)
minimize


This is linear because the X is linear (not squared/cubed).

Assumptions
-Existence: The relationship between dependent and independent variables exist
-Linearity: The relationship is linear over the spectrum of values studied
-Independence: For given values of explanatory variable x, the y-values are
independent of each other
-Normality: The errors are normally distributed
-Constant variance: the distribution of the errors is normal with constant
variance.

Assessing the Assumptions
-Existence:
-biological plausibility of the relationship
-R
2
, (known as the coefficient of determination) is measure of percent of
variability explained by the model; goodness-of-fit of the mode (aka.
Percent of variability in the outcome accounted for by the regression line).
-Determines magnitude of the effect
-Between 0-1 (never negative)
E.g. If R
2
=0.89 then the regression line explains 89% of the variation in
y, and the strength of the linear relationship between x and y is 0.94.


-Linearity:
-Use scatter plots to visually assess linear relationship
-Add non-linear terms to the model and assess their statistical significance
or model improvement

-Independence of y-values (given the x-values)
-Data collection procedures (eg clustered data, time series data, etc)

-Independence
-Use qqplot: normal probability plot
-Ideally: Normality assumption is valid if the graph shows a straight line
-Normal probability plot of the residuals. Residuals are deviations form
regression line (difference between observed and calculated y-value for a
certain x-value).

-Constant Variance
-Use plots of residuals vs fitted values/x-values
-Ideally: Constant variance assumption is valid if the graph shows a random
pattern

Method of Least Squares
The line drawn is called the least-squares line
Used for the best line
General equation for a straight line is y= + x

Notes:
-Positive correlation coefficient means is positive
-Slope is the rate of increase for one unit change
-Regression line and r can be sensitive to outliers and missing data
-Scatter plots are used to describe a linear relationship. But correlation
coefficients/simple regression analyses measures it
- The coefficient of determination (r
2
) is a more accurate measure of the strength
of a relationship than the absolute value of r. For example, an r of .2 does not
indicate a relationship that is twice as strong as an r of .1. In fact, an r of .2
indicates a relationship that is four times as strong as an r of .1. That is, the r
2
for
.2 is .04 whereas the r
2
for .1 is only .01.

Statistical Significance
For the relationship or association can be assed by two ways:
1. T-Statistic (Wald Test)


0
: 0
: 0
A
H
H
|
|
=





*n-2 because we are estimating x and y




se(r) =
1 r
2
n 2
2 2 2
1 1
2 2 2 2
( ) ( ) / 183891.7
( ) ( ) 0.64 395541.7 162013.9
21877.8
n n
i i i
i i
i i
SST y y y y n
SSR y y b x x
SSE SST SSR
= =
= = =
= = = =
= =


Source of Variation Sum of Square DF Mean Square F-Ratio

2. F-Statistic (ANOVA)

Multiple Linear Regression

Assumptions
Independence of explanatory variables
Existence: The relationship between dependent and independent variables exist
Linearity: The relationship is linear over the spectrum of values studied
Independence of the response values: For given values of explanatory variable x,
the y-values are independent of each other
Normality: For given values of the explanatory variables, the y-values are normally
distributed
Constant variance: the distribution of y-values has equal variance at each value of
x.

Regression
(explained)
SSR k MSR = SSR/k F=MSR/MSE
Error (unexplained) SSE= SST-SSR n-k-1 MSR = SSE/(n-k-1)
Total SST n-1

You might also like