Professional Documents
Culture Documents
2
: variance
: population mean
X-bar: sample mean of a variable
/ n: SE of the sample mean (i.e. square root of the variance of the sample
mean)
Normal Distribution
Aka Bell Curve and Gaussian Distribution
-With normally distributed data, the mean and variance are not dependent on
each other.
-Data from a large number of people (e.g. 1000) will approximate a normal curve
-The distribution of the mean of a large number of sample of reasonable size
will always be normally distributed even if the data is not normally distributed
Central Limit Theorem
Figure 3 Here, mue is the sample mean estimate (pop'n mean) and variance/n is the sample's variance
If we draw equally sized samples from a non-normal distribution, the
distribution of the means of these samples will still be normal, as long as the
samples are large enough
Large: it depends on if the shape of the population is close/different from the
normal. General is >30.
A standard score, abbreviated as z or Z is a way of expressing any raw score in
terms of SD units (standard normal distribution).
-A score that is equal to the mean has a Z-score of 0.
-The average deviation of scores about their mean is 0, whether raw score or Z-
score. (adding up z-scores=0)
-Z-scorers will always have a mean of 0.0 and a SD of 1.0
-We dont always have to use the mean and SD of the sample to calculate z-score.
We can take them from another sample or the population. If we use a population
mean and SD for e.g. then its possible that all the z-scores would be positive.
Properties of a true normal curve
-Mean=median=mode
-Skew=0
-Kurtosis=0
-The curve approaches the X-axis asymptotically
-Relative frequencies tend to 0
-Most of action takes place between the lines labeled -3 and +3
-Slightly over 95% of curve falls between -2 and +2
E.g. if =30.83 and SD=14.08; then 68% of the nurses emptied between (30.83-
14.08) and (30.83+14.08). And 95% dumped between (30.83 [2*14.08]) and
(30.83 + [2*14.83])). Those who cleaned more than (30.83 + [2*14.83]) worked
harder than about 97% of their mates.
For z- score table, if
when z=0, area=0.5
then it's the area to
the left of the curve.
If when z=0, area=0
then its the area
between the mean and
z.
Probability
Probability deals with the relative likelihood that a certain event will or will not
occur, relative to some other events.
Deriving probabilities:
Empirically aka Relative Frequencies:
Based on past experiences, provided that the circumstances have not changed.
E.g. probability of survival of cancer patients.
Theoretically aka Axiomatic Approach (Classical Approach):
Based on theory of probability.
A probability must like between 0 (cannot occur) and 1 (must occur).
Mutually Exclusive Events:
Two events, X and Y, are mutually exclusive if the occurrence of one precludes
the occurrence of the other
- Additive law: Pr (X or Y) = Pr(X) + Pr(Y)
o Vs Pr (X or Y) = P(X) + P(Y) P(X and Y)
- Pr (X and Y) = 0
- Pr (X and Y and .) = 1
Conditionally Dependent Events:
Two events, X and Y, are conditionally dependent if the outcome of Y depends on
X, or X depends on Y.
E.g. 11.11% is the probability of rolling two dices and getting a 5 (unconditional).
If one dice is rolled to yield a 1, then the probability of rolling a sum of 5 is 1/6 or
16.67% (conditional).
- Multiplicative law: Pr(X and Y) = Pr(X) x Pr(Y | X)
Or P(X|Y) = P(X and Y)/P(Y)
Independent Events
They are neither mutually exclusive nor conditionally probable. Often
independent events are mistaken to be conditionally probable (e.g. rolling a good
number in casino, being on a roll).
Pr(X and Y) = Pr(X)*Pr(Y)
E.g. Rolling a 6 and an even number on a dice are not independent events. But
they are independent if its 2 dices.
Permutations (when order matters)
Combinations
*Can be abbreviated as n/r
The Law of At Least One
Pr(At Least One) = [1 Pr(none)]
Binomial Distribution
Non-continuous variables, when there are only two possible outcomes. (binary
data)
The binomial distribution shows the probability of different outcomes for a series
of random events, each of which can have only one of two values.
*More generally, with n trials, each with probability p of being a success, the
probability of a particular sequence of r successes and n-r failures is P
r
(1-p)
n-r
.
n
P
r
Binomial Expansion:
Used
to
know what the probability of getting something wrong exactly 7 out of 10 times is
for example. If you want to know the probability of getting something wrong at
least 7 out of 10 times, you calculate the cumulative probability of these
outcomes (r=7, 8, 9, 10).
-As p gets closer to 0.5, the graph becomes more symmetric and skew decreases
(data spread more)
-When p=0.5, the graph is perfectly symmetric
-When p is less than -.5 the distribution is skewed to the right, when p is more
than 0.5 the distribution is skewed to the right.
-There is a different binomial distribution for every combination of n and p
-As n increases, so does spread in scores (when p is unchanged). When n is 30 or
more, we dont need to use the equation for binomial expansion to figure out
probabilities we approximate the binomial distribution using the normal curve.
Can also be used if p=0.5 and n>10, but when p is not 0.5 it is risky.
Mean = np
Variance = npq
SD = sqrt(npq)
If you want to know the probability of getting exactly 5 of the 15 patients
developing an infection you use the binomial expansion formula. If you want to
use the normal distribution to approximate it, you calculate the z-scores using
(4.5 and 5.5 as you upper and lower limits because normal distributions use
continuous variables) and calculate the area in between.
Bernoulli Distribution
Is a special case of binomial when n=1 (binomial (1,p)).
If Xi~Bernoulli (p).
Mean= p = sum of X
i
/n = proportion of X
i
=1 divided by n
Variance= E[(X-p)
2
]
Poisson Distribution
-When you have a large number of trials
-When probability of success is small
-Can be used as an approximation of the binomial
-Describes count data
For X ~ Poisson()
X has a mean and variance =
Poisson and binomial distributions start to look normal as well when is large and
both np and n(1-p) are large.
Notations:
n: total number of objects in the set
r: the number of object were looking for
p: the probability on each try of the outcome of interest occurring
q: (1-p)
Inferential Statistics
Does not describe data, but rather used to determine the probability (likelihood)
that a conclusion based on analysis of data from a sample is true.
The sample describes those individuals who are in the study; the population
describes the hypothetical (and usually) infinite number of people to whom you
wish to generalize
Estimates depend on:
1. The extent to which individual values differ from the average (SD)
2. The sample size
Notations
Alpha a Type I error
Beta B Type II error
Delta D Difference
Pi p Proportion
Mu M Mean
Sigma s SD
-Null hypothesis (H
0
- pessimistic) and alternate hypothesis (H
1
)
-You can neither prove nor disprove the null hypothesis; Statistics is concerned
with inference, i.e. inductive logic.
-Retain the null until there is considerable evidence against it. Thus, H
0
and H
1
not
treated equally.
-If the null is true, the sample means will fall symmetrically on either side of the
population mean, and the mean of the means will be the same as the overall
population mean. The distribution is the SE(M).
-The distribution of the means will be tighter around the population mean than
will be the individual lengths (because big and small can cancel out).
-As the sample gets larger, the mean will get closer to the population mean.
The less dispersed the original observations (smaller SD) the tighter the means
will be to the population mean. Hence, the width of the distribution of means,
called the Standard Error of the Mean (SE
M
), is directly proportional to the SD and
inversely proportional to the square root of the sample size.
SE(M) = / \n
SD vs SE(M)
SD reflects how close individual observations come to the sample mean, and the
SE(M) reflects, for a given sample size, how close the sample means of repeated
samples will come to the population.
We need to find out how far away from the population mean, in SD units, the
mean of the treated sample is, in order to calculate the area/the chances that it
happens by chance.
z= (X-bar - ) / ( / \n)
-With a smaller sample but keeping all values the same, you may not reject the
null.
p-Value
A p-value is the probability of observing a result as or more extreme than the
result actually observed, given that the null hypothesis is true.
- A p-value is always with respect to a given null hypothesis
- A p-value also depends on your alternative hypothesis.
Alpha
Alpha is the probability of concluding that the sample came from the H
1
distribution (i.e. concluding that there is a significant difference) when in fact it
came from the H
0
distribution.
-Multiple testing increases your risk of type I errors
Power
=1-Beta
The probability of concluding that the sample came from the H
1
distribution (i.e.,
concluding there is a significant difference), when it really came frmom H
1
distribution (there is a difference)
I.e.,
It is the likelihood that we could detect a difference that was there.
First determine the center of the distribution of sample means under the H
1
. The
width of the distribution is the same as that of H
0
, the SE(M).
Beta
Beta is the probability of concluding that the sample came from the H
0
distribution (i.e., concluding that there is no significant difference), when in fact it
came from the H
1
distribution (there really is a difference).
I.e.,
What is the likelihood that we could get a sample mean of 170 or less, assuming
that H
1
is true?
I.e.,
Probability that the sample mean would be small enough that we would miss it,
and declare that there was no difference. (Type 2 error)
H
0
true H
1
true
Retain H
0
1-
Type II error
Prob=
Reject H
0
Type I error
Prob=
1-
Type 1 Error: Rejecting the null hypothesis when it is in fact true.
Type 2 Error: Failing to reject the null hypothesis when it is in fact false.
-Ideally we want to be 0.15-0.20 so that the power of the study is 80-85%. Easily
affected by sample size, because results may look promising but did not reach
statistical significance. Also affected by the magnitude of the difference between
groups, the variability within the groups, and .
-The larger the difference, the smaller the probability of a type II error
-A smaller alpha and a smaller sample size increase chance of type II error
Signal to Noise Ratio
-Signal: observed difference between groups
-Noise: variability in the measure between individuals within the group
Nearly all statistical tests are based on a signal-to-noise ratio, where the signal is
the important relationship and the noise is a measure of individual variation.
-If the signal is large enough compared to the noise reasonable to conclude
that the signal has some effect
Two Tails Versus One Tail
-A one-tailed test specifies direction of the difference of interest in advance (used
to test directional hypotheses).
-A two-tailed test is a test of any difference between groups, regardless of the
direction of the difference (indifferent of direction).
*2-tailed tests used all the time so as to be able to report differences (e.g. side
effects) rather than just say the drug is 80% worse, just by chance).
**The z-value for beta is always based on a one-sided test. This only applies for
the alpha. Because the tail of H
1
only overlaps that of H
0
on one side. (p.59).
Confidence Intervals
A 95% confidence interval is an interval such that, if the experiment were
repeated a large number of times, the true parameter value would lie in the
confidence interval 95% of the time.
95% of the time we take a sample, the CI will cover (contain) the actual
population mean .
Incorrect to say: there is a 95% chance that thepopulation mean is between
9.9 and 10.3
If CI parameters dont contain 0, this implies there is a statistically significant
difference between the populations.
-Greater levels of variance yield larger CI (less precise estimates of parameters)
-As the level of confidence decreases, the size of the corresponding interval will
decrease.
-Use fail to reject the null (better than accept the null) because you are
acknowledging that there may well be an effect but we have insufficient evidence
to prove it, yet.
-Note that the CI does not depend on any hypothesis being tested.
-In fact there may not be a hypothesis test involved at all.
Lower bound: We want to find out where the population mean would have to be
so that 2.5% of the sample means (for n) would be greater than the obtained
sample mean.
Upper bound: there is a 2.5% chance of seeing a sample mean of X-bar or less if
the true population mean was this value.
-Evidently, if the 95% CI does contain the parameter value of the null hypothesis,
then the effect will is not significant at the 0.05 level (fail to reject H
0
).
-If the parameter value for the null hypothesis is less than the lower bound, that
the probability of seeing a sample mean of value X-bar is lower than 0.025, and so
the result is significant.
-If the upper bound is greater than the parameter of our alternate hypothesis,
then the probability of observing a sample mean of X-bar with a population mean
of H
1
is greater than 0.025.
*When 0 is in the CI, you fail to reject the null. You need to be looking at a CI for the difference
of two things though. For example, the mean of treated sample mean of untreated
population. If the CI for that encompasses 0, then there may not be a difference and so you fail
to reject the null. (ps. Dont accept it because there may also be a difference as indicated the
CI).
Statistical Significance versus Clinical Importance
P 1.96 x 1.96 0.95
n n
P x 1.96 x 1.96 0.95
n n
P x 1.96 x 1.96 0.95
n n
o o | |
s s + =
|
\ .
o o | |
s s + =
|
\ .
o o | |
s s + =
|
\ .
When you have a small sample size, a larger difference is required to reach statistical
significance.
Statistical significance is a necessary precondition for a consideration of clinical importance
but says nothing about the actual magnitude of the effect.
In order to determine clinical importance
-not difficult if the outcome is in units we can understand
-if dealing with point-scales then we need an index Effect Size
Effect size:
-d type: Differences between groups (no upper limit; usually max is1.0 though)
-r type: Proportion of variance in one variable that is explained by another variable (upper limit
= 1.0)
-If the sample size is really large, the magnitude of the difference needed to achieve statistical
significance is small (p.58). Therefore, it is always important to ask what the size of the sample
was.
-Critical value is 1.96 SEs to the right of the null mean. The distance is called z
(the z value
corresponding to the alpha error).
(CV 100)/ ( / \n) = z
(105 CV)/ ( / \n) = z
/ ( / \n) = z
+ z
n = [(z
+ z
) / ]
2
/ = Effect size; its like the z-score, and it tell you how big the difference is in
SD units.
-p-value shouldnt be higher than 2 decimal places unless your sample size is
larger than 1000.
*do the CV for z
and
z
b
have to be equal? See p.59. Based on the graph they look
kinda equal.
Post-Hoc tests
Tells you if you had power to detect the difference you actually saw, then it would
likely have been statistically significant
Chi-Squared Test
The observed counts in each cell follow a Poisson Distribution (mean = variance).
When your sample size is large,
O
jk
~ approx N(E
jk
,E
jk
)
So,
Since there are 4 cells in the table, we combine the 4 test statistics:
This test statistic follows a chi-squared distribution on 1 df. (x
2
(1)
)
jk jk
jk
O E
~ approx N(0,1)
E
( )
( ) ( )
( ) ( )
2
2 2
jk jk
j 1 k 1
jk
2 2
11 11 12 12
11 12
2 2
21 21 22 22
21 22
O E
E
O E O E
E E
O E O E
E E
= =
= + +
+ +
o
Y 3562
n
o
-As sample size approaches infinity, the t-distribution begins to look normal
because the estimate of SD is more precise.
One Sample T-test
Two-Sample T-test
- Can be used with one or two groups, whereas the z-test is mainly used to
compare one group to an arbitrary or population value.
- SD is unknown for the t-test but the population SD, , is given to us for z-
test.
- Therefore, t-test estimates both means and SD, which introduces a
dependency on sample size.
- If the null hypothesis is true, then the difference in sample means
o Has mean zero
o Has a Normal distribution if sample size is sufficiently large
0
(n 1)
n
2 2
i
i 1
Y
t
n
1
(Y Y)
n 1
=
o
o =
Assumptions:
1. Sample means are Normally distributed
2. All observations are independent of one another
3. Variances in the two groups are equal
-The estimate SD from above should lie between the estimates of SD for each group.
-Weighted by df
*We use n-1 when calculating SD for a sample to accommodate for bias since the
data will deviate less from the sample mean than they will from the population
mean.
-We calculate the SE(d) by summing the error of the two estimated means
SE(d) = \(s
1
2
+s
2
2
/ n) = sigma-hat* \ (1/ n
1
+ 1/n
2
)
-The distribution of differences is centered on zero
with an SD of SE(d).
-t-converges with z- and they are both 1.96 when alpha=0.05. However, t is larger
for small samples, so we require a larger relative difference to achieve
significance.
-The probability of observing a sample difference large enough is the area in the
right and left tail of the standard curve
Degrees of Freedom
The number of unique pieces of information in a set of data given that we know
the sum. It is df= n 1.
2 2
2
T T C C
T C
(n 1) (n 1)
n n 2
o + o
o =
+
T C
T C
(n n 2)
T C
Y Y
~ t
1 1
n n
+
o +
After estimating the sample mean, how many unique pieces of information did
you use to estimate your standard deviation.
For a t-test:
df = [(n
1
-1) + (n
2
1)] = (n
1
+ n
2
- 2)
Extended t-test
Used for 2 groups of unequal sample size
Sigma-hat = \[(n
1
-1)s
1
2
+ (n
2
-1)s
2
2
/ n
1
+ n
2
-2)
T = (X-bar
1
X-bar
2
) / [SE(d) * (1/n
1
+ 1/n
2
)]
Pooled vs Separate Variance Estimates
So far, we have been able to pool the two samples and estimate the SD and
variance from the population (pooled estimate). This works because the two
samples are drawn from the same population and hence have the same mean and
SD.
Therefore, compute the t-test that doesnt weight the two estimates together.
The trade-off is that df are calculated differently and turn out to be much closer
to the smaller sample of the two (harmonic mean). Therefore, it is harder to get
statistical significance.
SE = \[(s
1
2
/ n
1
) + (s
2
2
/ n
2
)]
The Calibrated Eyeball
CI = X-bar +/- t
a/2
* SE
*use t not z when sample size is relatively small
For two independent groups; doesnt work for groups >2 and for two
measurements on the same individual.
1. If the error bars dont overlap, the groups are significant at p<=0.01.
2. If the amount of overlap is less than half of the CI, the significance level is
<=0.05.
3. If the overlap is more than about 50% of the CI, the difference is not
statistically significant.
Effect Size
ES = (X-bar
1
X-bar
2
) / SD
-Expressed in SD units.
-ES is reffered to as (standard mean difference):
-Cohens d: used pooled SDs from both groups
-Glasss : use SD only from the control group
**Do we use SD or SE(M)? On page 74, I am not sure what they used, but it isnt
the same as the SE(M) from p.72.
-d used more, although we dont know which is better.
-Advantage of d is that it uses data from both groups which increases the
accuracy of the estimate. The disadvantage is that the intervention may change
not only the mean but also the SD.
Criteria for d works on d and ):
ES=0.2 small; negligible practical/clinical significance
ES=0.5 moderate; moderate practical/clinical significance
ES = 0.8 large; crucial practical/clinical importance
If you come across a naked t-value without an ES, you can calculate d using:
d = t * \(1/n
1
+ 1/n
2
)
Sample Size and Power (see table D & E)
n = 2[(z
+ z
) OR IS IT SE (Q.3B)?/ ]
2
If power is low, we conclude that the study was too small and that the negative
results were probably a type II error.
Paired T-test
-When you cannot assume independence between values
To test for equality of variances, we use Levenes test
H
0
: var(Y
Ti
)=var(Y
Ci
)
H
1
: var(Y
Ti
)var(Y
Ci
)
If Levenes test has a p-value>0.05 retain the null, assume variances are equal
and use the standard t-test
Or, p-value<0.05 reject the null hypothesis of equal variances, need to use
Satterthwaite or Welch t-test (modified version of t-test).
D-bar is the sample mean of differences in
weights
d
= mean of the D
i
, i.e. mean of differences in weights
d
= std deviation of the D
i
, i.e. std deviation of differences in weights
df = n-1
2
d d
D ~ N( , / n) o
( )
372 372
1 1
i 2i 1i 372 372
i 1 i 1
D D Y Y
= =
= =
d
D
n o
One-Way ANOVA
SS(between) = n*sum(X-bar
.k
-X-bar
..
)
df = k -1
If the groups are of different sample size, use the harmonic mean of sample sizes
[2/ (1/n
1
+ 1/n
2
]
SS(within) = sum
i
sum
k
(X
ik
X-bar
.k
)
2
df = k(n-1)
SS(total) = SS(between) + SS(within)
=sum
i
sum
k
(X
ik
- X-bar
..
)
df = nk-1 = df(between) + df(within)
*equation looks different if n is different between groups computer fixes the
problem.
We could directly look at the signal to noise ratio,
BUT:
-The more groups I have, the larger the between-group SS
-The more observations within groups, the larger the within-group SS
Therefore
Calculating Mean Squares
By dividing each SS by its df.
-It is a measure of average deviation.
F-ratio
Signal:noise ratio;
-Under the null hypothesis,
Mean Square Between
Mean Square Within
this distribution has an F distribution of df(between) and df(within).
-F-ratio is never zero because the SS (between) is never 0. In the absence of a
difference, the mean square (between), E(MS
bet
), is exactly equal to the variance
(within),
2
err
.
-If no variance exists within groups, then the difference between sample means is
equal to the difference between population means. (=n
2
bet
)
Therefore,
E(MS
bet
) =
2
err
+ n
2
bet
When the tru variance between group drops out, the ratio equals 1, because
mean square within group is the error value.
Assumptions
- Data are normally distributed
- Homogeneity of variance among groups
- Observations are independent of one another
Unequal Sized Groups
Grand mean is a weighted average of group means
Y-bar= n
1
*X-bar + n
2
*X-bar + n
3
*X-bar / n
1
+ n
2
+ n
3
Post Hoc Tests
k
k
n
K
2
ik
k 1 i 1
K
2
k
k 1
n
K
2
k
k
k
1 i 1
Total (Y Y)
Between (Y Y)
Withi
n
n (Y Y)
= =
-
=
-
= =
=
=
=
Post-hoc tests are essentially t-tests, modified so as to control the type I error.
Use post-hoc tests only if your ANOVA is statistically significant
Select a post-hoc test to use a priori and use only that test
Bonferroni Correction
Divide your desired significance level (0.05) by the number of comparisons you
are going to make (k) this gives you your new significance level.
OR
Multiply p-values by number of tests done (in this case 3)
Anova vs T-test
*The calculated F-value for a one-way ANOVA on two groups is equal to the
square of the t-test. You get the same p-value.
T-test more flexible in regard to unequal variances, calculating CIs for differences
in means
Two-Way ANOVA
Interested in the mean of an interval or ratio outcome that may be associated
with two categorical factors.
It allows you test if the association between a factor and an outcome depends on
the level of another factor.
Tests for the effect of each factor separately.
Full factorial design: when all possible combinations are included in the design (VS
fractional factorial design).
Notations
Ctrl:
0
Group A:
A
=
0
+
A
Group B:
B
=
0
+
B
Group A+B:
AB
=
0
+
A
+
B
+ AB
Hypotheses
1. H0:
A
=0 vs H1:
A
0
2. H0:
B
=0 vs H1:
B
0
3. H0: AB =0 vs H1: AB 0
We are going to assume the same number of people will receive each
combination of factors (balanced design).
The concepts apply to unbalanced design.
Group Means
Sums of Square (within, error)
Difference between individual values and their group mean.
= = =
-
= =
-
= =
=
= = =
=
=
=
n J K
ijk
i 1 j 1 k 1
n K
j ijk
i 1 k 1
n J
k ijk
i 1 j 1
n
jk ijk
i 1
1
Y Y , where N n J K total sample size
N
1
Y Y
Kn
1
Y Y
Jn
1
Y Y
n
= = =
=
J K n
2
Within ijk jk
j 1 k 1 i 1
SS (Y Y )
df
error
= N-JK
Sums of Square (total)
df = N-1
Other SS
Intuitively, since we have 3 hypotheses to test (with 2 factors), we need 3 SS.
Sums of Square (A)
A subscript is for the row (equivalent to j), and K is number of columns. It is the
square deviations between the group means for A and the grand mean.
df = K-1
F= MS
A
/MS
error
Sums of Square (B)
B subscript is for the row (equivalent to k), and J is number of rows. It is the
square deviations between the group means for A and the grand mean.
= = =
=
J K n
2
ijk
j 1 k 1 i 1
Total (Y Y)
J
2
A j
j=1
SS =nK (Y Y)
-
K
2
B k
k=1
SS =nJ (Y Y)
-
df = J -1
F= MS
B
/MS
error
Sums of Square (interaction)
The degree to which combined effects differ from the sum of the individual
effects
df = (J -1)(K-1)
F= MS
AB
/MS
error
If the interaction is insignificant, you can do a post-hoc test.
-One-Way and Two-Way ANOVA assume that the observations are independent.
Otherwise you should use a repeated measures ANOVA.
Correlation and Regression (Linear)
Regression and correlation techniques can be used to examine the nature and
strength of the relationship between two variables.
Pearsons Correlation Coefficient (r)
-Used to measure the degree or strength of a linear relationship between two
continuous variables
-Varies from -1 to 1 with both extremes indicating strong negative or positive
relationship, respectively
J K
2
interaction jk j k
j 1 k 1
SS n (Y Y Y Y)
- -
= =
= +
( )
- -
= =
- -
= =
=
= +
2
J K
interaction jk j k
j 1 k 1
1 1
2
jk j k
j 0 k 0
SS n (Y Y) (Y Y) (Y Y)
n (Y Y Y Y)
If r=1, there is a perfect direct linear association between the two
variables
If r=-1, there is a perfect inverse linear association between the two
variables
If =0 ,there is a no linear association between the two variables.
*Values closer to 1 or -1 represent a strong negative or positive
linear associations
*Values >0.5 and <0.5 are considered strong
-It is a point estimate of the population correlation coefficient, rho ()
-No units
-r can be adversely affected by outliers; therefore, an r without a scatter plot may
be misleading
-Correlation does not imply cause and effect
Assumptions of r
-Both variables are continuous (ie. measured on an interval scale)
-Both variables are Normally distributed
-There is a linear relationship between the variables
Part II
Regression: Determining the Form of a Relationship
-To verify the association or relationship between a single variable and one or
more explanatory variables (explanatory variables variables on the right side of
the equation)
-To confirm results from other studies
-To reduce a large number of variables to a smaller set of variables
-To develop scoring rules for risk assessment
-To develop a model to predict future unobserved responses (outcomes) given a
set of predictor variables
Y
i
=
+
X
i
+
i
Dependent
Variable
Coefficient
Intercept
Coefficient
Slope
Predictor
Variable
Random
Error
(Known) (Unknown) (Unknown)
(Known)
error-
free
(Unknown)
minimize
This is linear because the X is linear (not squared/cubed).
Assumptions
-Existence: The relationship between dependent and independent variables exist
-Linearity: The relationship is linear over the spectrum of values studied
-Independence: For given values of explanatory variable x, the y-values are
independent of each other
-Normality: The errors are normally distributed
-Constant variance: the distribution of the errors is normal with constant
variance.
Assessing the Assumptions
-Existence:
-biological plausibility of the relationship
-R
2
, (known as the coefficient of determination) is measure of percent of
variability explained by the model; goodness-of-fit of the mode (aka.
Percent of variability in the outcome accounted for by the regression line).
-Determines magnitude of the effect
-Between 0-1 (never negative)
E.g. If R
2
=0.89 then the regression line explains 89% of the variation in
y, and the strength of the linear relationship between x and y is 0.94.
-Linearity:
-Use scatter plots to visually assess linear relationship
-Add non-linear terms to the model and assess their statistical significance
or model improvement
-Independence of y-values (given the x-values)
-Data collection procedures (eg clustered data, time series data, etc)
-Independence
-Use qqplot: normal probability plot
-Ideally: Normality assumption is valid if the graph shows a straight line
-Normal probability plot of the residuals. Residuals are deviations form
regression line (difference between observed and calculated y-value for a
certain x-value).
-Constant Variance
-Use plots of residuals vs fitted values/x-values
-Ideally: Constant variance assumption is valid if the graph shows a random
pattern
Method of Least Squares
The line drawn is called the least-squares line
Used for the best line
General equation for a straight line is y= + x
Notes:
-Positive correlation coefficient means is positive
-Slope is the rate of increase for one unit change
-Regression line and r can be sensitive to outliers and missing data
-Scatter plots are used to describe a linear relationship. But correlation
coefficients/simple regression analyses measures it
- The coefficient of determination (r
2
) is a more accurate measure of the strength
of a relationship than the absolute value of r. For example, an r of .2 does not
indicate a relationship that is twice as strong as an r of .1. In fact, an r of .2
indicates a relationship that is four times as strong as an r of .1. That is, the r
2
for
.2 is .04 whereas the r
2
for .1 is only .01.
Statistical Significance
For the relationship or association can be assed by two ways:
1. T-Statistic (Wald Test)
0
: 0
: 0
A
H
H
|
|
=
*n-2 because we are estimating x and y
se(r) =
1 r
2
n 2
2 2 2
1 1
2 2 2 2
( ) ( ) / 183891.7
( ) ( ) 0.64 395541.7 162013.9
21877.8
n n
i i i
i i
i i
SST y y y y n
SSR y y b x x
SSE SST SSR
= =
= = =
= = = =
= =
Source of Variation Sum of Square DF Mean Square F-Ratio
2. F-Statistic (ANOVA)
Multiple Linear Regression
Assumptions
Independence of explanatory variables
Existence: The relationship between dependent and independent variables exist
Linearity: The relationship is linear over the spectrum of values studied
Independence of the response values: For given values of explanatory variable x,
the y-values are independent of each other
Normality: For given values of the explanatory variables, the y-values are normally
distributed
Constant variance: the distribution of y-values has equal variance at each value of
x.
Regression
(explained)
SSR k MSR = SSR/k F=MSR/MSE
Error (unexplained) SSE= SST-SSR n-k-1 MSR = SSE/(n-k-1)
Total SST n-1