Professional Documents
Culture Documents
docx
One-Way Independent Samples Analysis of Variance
If we are interested in the relationship between a categorical IV and a continuous DV, the
two categories analysis of variance (ANOVA) may be a suitable inferential technique. If the IV had
only two levels (groups), we could just as well do a t-test, but the ANOVA allows us to have 2 or
more categories. The null hypothesis tested is that
1
=
2
= ... =
k
, that is, all k treatment groups
have identical population means on the DV. The alternative hypothesis is that at least two of the
population means differ from one another.
We start out by making two assumptions:
- Each of the k populations is normally distributed and
- Homogeneity of variance - each of the populations has the same variance, the IV does not
affect the variance in the DV. Thus, if the populations differ from one another they differ in
location (central tendency, mean).
The model we employ here states that each score on the DV has two components:
- the effect of the treatment (the IV, Groups) and
- error, which is anything else that affects the DV scores, such as individual differences among
subjects, errors in measurement, and other extraneous variables. That is, Y
ij
= + t
j
+ e
ij
, or,
Y
ij
- = t
j
+ e
ij
The difference between the grand mean ( ) and the DV score of subject number i in group number j,
Y
ij
, is equal to the effect of being in treatment group number j, t
j
, plus error, e
ij
[Note that I am using i as the subscript for subject # and j for group #]
Computing ANOVA Statistics From Group Means and Variances, Equal n.
Let us work with the following contrived data set. We have randomly assigned five students to
each of four treatment groups, A, B, C, and D. Each group receives a different type of instruction in
the logic of ANOVA. After instruction, each student is given a 10 item multiple-choice test. Test
scores (# items correct) follow:
Group Scores Mean
A 1 2 2 2 3 2
B 2 3 3 3 4 3
C 6 7 7 7 8 7
D 7 8 8 8 9 8
Now, do these four samples differ enough from each other to reject the null hypothesis that
type of instruction has no effect on mean test performance? First, we use the sample data to
estimate the amount of error variance in the scores in the population from which the samples were
randomly drawn. That is variance (differences among scores) that is due to anything other than the
IV. One simple way to do this, assuming that you have an equal number of scores in each sample, is
to compute the average within group variance,
k
s s s
MSE
k
2 2
2
2
1
.... + + +
=
=
N
G
n
T
SS
j
j
A
2
2
, which simplifies to:
N
G
n
T
SS
j
A
2
2
= + + + = =
j j
s p MSE
2. Obtain the Among Groups SS, E n
j
(M
j
- GM)
2
.
The GM = E p
j
M
j
=.2556(4.85) + .2331(4.61) + .2707(4.61) + .2406(4.38) = 4.616.
Among Groups SS =
34(4.85 - 4.616)
2
+ 31(4.61 - 4.616)
2
+ 36(4.61 - 4.616)
2
+ 32(4.38 - 4.616)
2
= 3.646.
With 3 df, MSA = 1.215, and F(3, 129) = 2.814, p = .042.
3. Before you get excited about this significant result, notice that the sample variances are not
homogeneous. There is a negative correlation between sample mean and sample variance, due to a
ceiling effect as the mean approaches its upper limit, 5. The ratio of the largest to the smallest
variance is .793
2
/.360
2
= 4.852, which is significant beyond the .01 level with Hartleys maximum
F-ratio statistic (a method for testing the null hypothesis that the variances are homogeneous).
Although the sample sizes are close enough to equal that we might not worry about violating the
homogeneity of variance assumption, for instructional purposes let us make some corrections for the
heterogeneity of variance.
4. Box (1954, see our textbook) tells us the critical (.05) value for our F on this problem is
somewhere between F(1, 30) = 4.17 and F(3, 129) = 2.675. Unfortunately our F falls in that range, so
we dont know whether or not it is significant.
5. The Welch procedure (see the formulae in our textbook) is now our last resort, since we
cannot transform the raw data (which we do not have).
W
1
= 34 / .360
2
= 262.35,
W
2
= 31 / .715
2
= 60.638, W
3
= 36 / .688
2
= 76.055, and W
4
= 32 / .793
2
= 50.887.
8
. 724 . 4
93 . 449
44 . 2125
887 . 50 055 . 76 638 . 60 35 . 262
) 38 . 4 ( 887 . 50 ) 61 . 4 ( 055 . 76 ) 61 . 4 ( 638 . 60 ) 85 . 4 ( 35 . 262
! = =
+ + +
+ + +
= X
The numerator of F'' =
3
) 724 . 4 - 38 . 4 ( 887 . 50 + ) 724 . 4 - 61 . 4 ( 055 . 76 + ) 724 . 4 - 61 . 4 ( 638 . 60 + ) 724 . 4 - 85 . 4 ( 35 . 262
2 2 2 2
=
3.988. The denominator of F'' equals
93 . 449
887 . 50 1
31
1
93 . 449
055 . 76 1
35
1
93 . 449
638 . 60 1
30
1
93 . 449
35 . 262 1
33
1
15
4
1
2 2 2 2
(
(
|
.
|
\
|
|
.
|
\
|
+
|
.
|
\
|
|
.
|
\
|
+
|
.
|
\
|
|
.
|
\
|
+
|
.
|
\
|
|
.
|
\
|
+ =
1 + 4 / 15(.07532) = 1.020. Thus, F'' = 3.988 / 1.020 = 3.910. Note that this F'' is greater than our
standard F. Why? Well, notice that each groups contribution to the numerator is inversely related to
its variance, thus increasing the contribution of Group 1, which had a mean far from the Grand Mean
and a small variance.
We are not done yet, we still need to compute adjusted error degrees of freedom; df' = (15) /
[3(.07532)] = 66.38. Thus, F(3, 66) = 3.910, p = .012.
Directional Hypotheses
I have never seen published research where the authors used ANOVA and employed a
directional test, but it is possible. Suppose you were testing the following directional hypotheses:
H
0
: The classification variable is not related to the outcome variable in the way specified in the
alternative hypothesis
H
1
:
1
>
2
>
3
The one-tailed p value that you obtain with the traditional F test tells you the probability of
getting sample means as (or more) different from one another, in any order, as were those you
obtained, were the truth that the population means are identical. Were the null true, the probability of
your correctly predicting the order of the differences in the sample means is k!, where k is the number
of groups. By application of the multiplication rule of probability, the probability of your getting sample
means as different from one another as they were, and in the order you predicted, is the one-tailed p
times k!. If k is three, you take the one-tailed p and divide by 3 x 2 = 6 a one-sixth tailed test. I
know, that sounds strange. Lots of luck convincing the reviewers of your manuscript that you actually
PREdicted the order of the means. They will think that you POSTdicted them.
Fixed vs. Random vs. Mixed Effects ANOVA
As in correlation/regression analysis, the IV in ANOVA may be fixed or random. If it is fixed,
the researcher has arbitrarily (based on es opinon, judgement, or prejudice) chosen k values of the
IV. E will restrict es generalization of the results to those k values of the IV. E has defined the
population of IV values in which e is interested as consisting of only those values e actually used,
thus, e has used the entire population of IV values. For example, I give subjects 0, 1, or 3 beers and
measure reaction time. I can draw conclusions about the effects of 0, 1, or 3 beers, but not about 2
beers, 4 beers, 10 beers, etc.
With a random effects IV, one randomly obtains levels of the IV, so the actual levels used
would not be the same if you repeated the experiment. For example, I decide to study the effect of
dose of phenylpropanolamine upon reaction time. I have my computer randomly select ten dosages
from a uniform distribution of dosages from zero to 100 units of the drug. I then administer those 10
dosages to my subjects, collect the data, and do the analyses. I may generalize across the entire
range of values (doses) from which I randomly selected my 10 values, even (by interpolation or
extrapolation) to values other than the 10 I actually employed.
9
Group
5 4 3 2 1 0
S
c
o
r
e
10
8
6
4
2
0
Group
5 4 3 2 1 0
S
c
o
r
e
10
8
6
4
2
0
In a factorial ANOVA, one with more than one IV, you may have a mixed effects ANOVA -
one where one or more IVs is fixed and one or more is random.
Statistically, our one-way ANOVA does actually have two IVs, but one is sort of hidden. The
hidden IV is SUBJECTS. Does who the subject is affect the score on the DV? Of course it does, but
we count such effects as error variance in the one-way independent samples ANOVA. Subjects is a
random effects variable, or at least we pretend it is, since we randomly selected subjects from the
population of persons (or other things) to which we wish to generalize our results. In fact, if there is
not at least one random effects IV in your research, you dont need ANOVA or any other inferential
statistic. If all of your IVs are fixed, your data represent the entire population, not a random sample
therefrom, so your descriptive statistics are parameters and you need not infer what you already
know for sure.
ANOVA as a Regression Analysis: Eta-Squared and Omega-Squared
The ANOVA is really just a special case of a regression analysis. It can be represented as a
multiple regression analysis, with one dichotomous "dummy variable" for each treatment degree of
freedom (more on this in another lesson). It can also be represented as a bivariate, curvilinear
regression.
Here is a scatter plot for our ANOVA
data. Since the numbers used to code our
groups are arbitrary (the independent
variable being qualitative), I elected to use
the number 1 for Group A, 2 for Group D, 3
for Group C and 4 for Group B. Note that I
have used blue squares to plot the points
with a frequency of three and red triangles to
plot those with a frequency of one. The blue
squares are also the group means. I have
placed on the plot the linear regression line
predicting score from group. The regression
falls far short of significance, with the
SS
Regression
being only 1, for an r
2
of 1/138 =
.007.
We could improve the fit of our
regression line to the data by removing the
restriction that it be a straight line, that is, by doing
a curvilinear regression. A quadratic regression
line is based on a polynomial where the
independent variables are Group and Group-
squared that is,
2
2 1
X b X b a Y + + = more
on this when we cover trend analysis. A quadratic
function allows us one bend in the curve. Here is
a plot of our data with a quadratic regression line.
Eta-squared ( q
2
) is a curvilinear correlation
coefficient. To compute it, one first uses a
curvilinear equation to predict values of Y|X. You
then compute the SS
Error
as the sum of squared
residuals between actual Y and predicted Y, that
is, ( )
=
2
Y Y SSE . As usual,
10
Group
5 4 3 2 1 0
S
c
o
r
e
10
8
6
4
2
0
( )
=
2
GM Y SS
Total
, where GM is the grand mean, the mean of all scores in all groups. The
SS
Regression
is then SS
Total
- SS
Error
. Eta squared is then SS
Regression
/ SS
Total
, the proportion of the
SS
Total
that is due to the curvilinear association with X. For our quadratic regression (which is highly
significant), SS
Regression
= 126, q
2
= .913.
We could improve the fit a bit more by
going to a cubic polynomial model (which
adds Group-cubed to the quadratic model,
allowing a second bending of the curve).
Here is our scatter plot with the cubic
regression line. Note that the regression line
runs through all of the group means. This will
always be the case when we have used a
polynomial model of order = K 1, where K =
the number of levels of our independent
variable. A cubic model has order = 3, since it
includes three powers of the independent
variable (Group, Group-squared, and
Group-cubed). The SS
Regression
for the cubic
model is 130, q
2
= .942. Please note that this
SS
Regression
is exactly the same as that we
computed earlier as the ANOVA SS
Among Groups
. We have demonstrated that a poynomial regression
with order = K 1 is identical to the traditional one-way ANOVA.
Take a look at my document T = ANOVA = Regression.
Strength of Effect Estimates Proportions of Variance Explained
We can employ q
2
as a measure of the magnitude of the effect of our ANOVA independent
variable without doing the polynomial regression. We simply find
Total
s AmongGroup
SS
SS
from our ANOVA
source table. This provides a fine measure of the strength of effect of our independent variable in our
sample data, but it generally overestimates the population q
2
. My programs Conf-Interval-R2-
Regr.sas and CI-R2-SPSS.zip will compute an exact confidence interval about eta-squared. For our
data q
2
= 130/138 = .94. A 95% confidence interval for the population parameter extends from .84 to
.96. It might be better to report a 90% confidence interval here, more on that soon.
One well-known alternative is omega-squared, e
2
, which estimates the proportion of the
variance in Y in the population which is due to variance in X.
Error Total
Error Among
MS SS
MS K SS
+
=
) 1 (
2
e . For our
data, 93 .
5 . 138
5 ). 3 ( 130
2
=
+
= e .
Benchmarks for q
2
.
- .01 (1%) is small but not trivial
- .06 is medium
- .14 is large
A Word of Caution. Rosenthal has found that most psychologists misinterpret strength of
effect estimates such as r
2
and e
2
. Rosenthal (1990, American Psychologist, 45, 775-777.) used an
example where a treatment (a small daily dose of aspirin) lowered patients death rate so much that
the researchers conducting this research the research prematurely and told the participants who were
11
in the control condition to start taking a baby aspirin every day. So, how large was the effect of the
baby aspirin? As an odds ratio it was 1.83 that is, the odds of a heart attack were 1.83 times higher
in the placebo group than in the aspirin group. As a proportion of variance explained the effect size
was .0011 (about one tenth of one percent).
One solution that has been proposed for dealing with r
2
-like statistics is to report their square
root instead. For the aspirin study, we would report r = .033 (but that still sounds small to me).
Also, keep in mind that anything that artificially lowers error variance, such as using
homogeneous subjects and highly controlled laboratory conditions, artificially inflates r
2
, e
2
, etc.
Thus, under highly controlled conditions, one can obtain a very high e
2
even if outside the laboratory
the IV accounts for almost none of the variance in the DV. In the field those variables held constant
in the lab may account for almost all of the variance in the DV.
What Confidence Coefficient Should I Employ for q
2
and RMSSE?
If you want the confidence interval to be equivalent to the ANOVA F test of the effect (which
employs a one-tailed, upper tailed, probability) you should employ a confidence coefficient of (1 - 2).
For example, for the usual .05 criterion of statistical significance, use a 90% confidence interval, not
95%. Please see my document Confidence Intervals for Squared Effect Size Estimates in ANOVA:
What Confidence Coefficient Should be Employed? .
Strength of Effect Estimates Standardized Differences Among Means
When dealing with differences between or among group means, I generally prefer strength of
effect estimators that rely on the standardized difference between means (rather than proportions of
variance explained). We have already seen such estimators when we studied two group designs
(Hedges g) but how can we apply this approach when we have more than two groups?
My favorite answer to this question is that you should just report estimates of Cohens d for
those contrasts (differences between means or sets of means) that are of most interest that is,
which are most relevant to the research questions you wish to address. Of course, I am also of the
opinion that we would often be better served by dispending with the ANOVA in the first place and
proceeding directly to making those contrasts of interest without doing the ANOVA.
There is, however, another interesting suggestion. We could estimate the average value of
Cohens d for the groups in our research. There are several ways we could do this. We could, for
example, estimate d for every pair of means, take the absolute values of those estimates, and then
average them.
James H. Steiger (2004: Psychological Methods, 9, 164-182) has proposed the use of RMSSE
(root mean square standardized effect) in situations like this. Here is how the RMSSE is calculated:
|
|
.
|
\
|
|
.
|
\
|
=
k
j
MSE
GM M
k
RMSSE
1
2
1
1
, where k is the number of groups, M
j
is a group mean, GM is the
overall (grand) mean, and the standardizer is the pooled standard devation, the square root of the
within groups mean square, MSE (note that we are assuming homogeneity of variances). Basically
what we are doing here is averaging the values of (M
j
GM)/SD, having squared them first (to avoid
them summing to zero), dividing by among groups degrees of freedom (k -1) rather than k, and then
taking the square root to get back to un-squared (standard deviation) units.
Since the standardizer (sqrt of MSE) is constant across groups, we can simplify the expression
above to
2
) (
1
1
MSE
GM M
k
RMSSE
j
|
.
|
\
|
= .
12
For our original set of data, the sum of the squared deviations between group means and
grand mean is (2-5)
2
+ (3-5)
2
+ (7-5)
2
+ (8-5)
2
= 26. Notice that this is simply the among groups sum
of squares (130) divided by n (5). Accordingly, 16 . 4
5 .
26
1 4
1
= |
.
|
\
|
= RMSSE , a Godzilla-sized
average standardized difference between group means.
We can place a confidence interval about our estimate of the average standardized difference
between group means. To do so we shall need the NDC program from Steigers page at
http://www.statpower.net/Content/NDC/NDC.exe . Download and run that exe. Ask for a 90% CI and
give the values of F and df:
Click COMPUTE.
You are given the CI for lambda, the noncentrality parameter:
13
Now we transform this confidence interval to a confidence interval for RMSSE by with the
following transformation (applied to each end of the CI):
n k
RMSSE
) 1 (
=
. For the lower
boundary, this yields 837 . 2
5 ) 3 (
6998 . 120
= , and for the upper boundary 393 . 5
5 ) 3 (
3431 . 436
= . That is,
our estimate of the effect size is between King Kong-sized and beyond Godzilla-sized.
Steiger noted that a test of the null hypothesis that + (the parameter estimated by RMSSE) = 0
is equivalent to the standard ANOVA F test if the confidence interval is constructed with 100(1-2)%
confidence. For example, if the ANOVA were conducted with .05 as the criterion of statistical
significance, then an equivalent confidence interval for + should be at 90% confidence -- + cannot be
negative, after all. If the 90% confidence interval for + includes 0, then the ANOVA F falls short of
significance, if it excludes 0, then the ANOVA F is significant.
Power Analysis
One-way ANOVA power analysis is detailed in out text book. The effect size may be specified
in terms of Et
2
:
2
2
1
) (
error
j
k
j
ko
|
= '
=
. Cohen used the symbol f for this same statistic, and considered
an f of .10 to represent a small effect, .25 a medium effect, and .40 a large effect. In terms of
percentage of variance explained
2
, small is 1%, medium is 6%, and large is 14%.
For example, suppose that I wish to test the null hypothesis that for GRE-Q, the population
means for undergraduates intending to major in social psychology, clinical psychology, and
experimental psychology are all equal. I decide that the minimum nontrivial effect size is if each
mean differs from the next by 20 points (about 1/5 o ). For example, means of 480, 500, and 520.
The Et
2
is then 20
2
+ 0
2
+ 20
2
= 800. Next we compute |'. Assuming that the o is about 100, |'.
163 . 0 10000 / 3 / 800 = = .
Suppose we have 11 subjects in each group. | = |' - 54 . 11 163 . = - = n .
Treatment df = 2, error df = 3(11 - 1) = 30. From the noncentral F table in our text book, for | = .50,
df
t
= 2, df
e
= 30, o =.05, | = 90%, thus power = 10%.
How many subjects would be needed to raise power to 70%? | = .30. Go to the table,
assuming that you will need enough subjects so that df
e
= infinity. For | = .30,
| = 1.6. Now, n = (|
2
)(k)(o
e
2
) / Et
2
= (1.6)
2
(3)(100)
2
/ 800 = 96. Now, 96 subjects per group would
give you, practically speaking, infinite df. If N came out so low that df
e
< 30, you would re-do the
analysis with a downwards-adjusted df
e
.
One can define an effect size in terms of q
2
. For example, if q
2
= 10%, then
|' 33 .
10 . 1
10 .
1
2
2
=
=
q
q
.
Suppose I had 6 subjects in each of four groups. If I employed an alpha-criterion of .05, how
large [in terms of % variance in the DV accounted for by variance in the IV] would the effect need be
for me to have a 90% chance of rejecting the null hypothesis? From the table, for df
t
= 3, df
e
= 20, | =
2.0 for | = .13, and | = 2.2 for | = .07. By linear interpolation, for | = .10, | = 2.0 + (3/6)(.2) = 2.1. |'
857 . 0
6
1 . 2
= = =
n
|
.
14
q
2
= |'
2
/ (1 + |'
2
) = .857
2
/ (1 + .857
2
) = 0.42, a very large effect!
Do note that this method of power analysis does not ignore the effect of error df, as did the
methods employed in Chapter 8. If you were doing small sample power analyses for independent t-
tests, you should use the methods shown here (with k = 2), which will give the correct power figures
(since F t = , ts power must be the same as Fs).
Make it easy on yourself. Use G*Power to do the power analysis.
APA-Style Summary Statement
Teaching method significantly affected the students test scores, F(3, 16) = 86.66, MSE = 0.50,
p < .001, q
2
= .942, 95% CI [.858, .956]. As shown in Table 1, .
Copyright 2012, Karl L. Wuensch - All rights reserved.
CI-Eta2-Alpha
Confidence Intervals for Squared Effect Size Estimates in ANOVA: What
Confidence Coefficient Should be Employed?
If you want the confidence interval to be equivalent to the ANOVA F test of the
effect (which employs a one-tailed, upper tailed, probability) you should employ a
confidence coefficient of (1 - 2). For example, for the usual .05 criterion of statistical
significance, use a 90% confidence interval, not 95%. This is illustrated below.
A two-way independent samples ANOVA was conducted and produced this
output:
Dependent Variable: PulseIncrease
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 355.95683 118.65228 3.15 0.0249
Error 380 14295.21251 37.61898
Corrected Total 383 14651.16933
R-Square Coeff Var Root MSE pulse Mean
0.024295 190.8744 6.133431 3.213333
Source DF Anova SS Mean Square F Value Pr > F
Gender 1 186.0937042 186.0937042 4.95 0.0267
Image 1 63.6027042 63.6027042 1.69 0.1943
Gender*Image 1 106.2604167 106.2604167 2.82 0.0936
Eta-square and a corresponding 95% Confidence Interval will be computed for
each effect. To put a confidence interval on the
2
we need to compute an adjusted
F. To adjust the F we first compute an adjusted error term. For the main effect of
gender, 867 . 37
1 383
09 . 186 14651
=
=
Effect Total
Effect Total
df df
SS SS
MSE . In effect we are putting
back into the error term all of the variance accounted for by other effects in our model.
Now the adjusted F(1, 382) = 914 . 4
867 . 37
09 . 186
= =
Gender
Gender
MSE
MS
.
For main effects, one can also get the adjusted F by simply doing a one way
ANOVA with only the main effect of interest in the model:
2
proc ANOVA data=Katie; class Gender;
model PulseIncrease = Gender;
Dependent Variable: PulseIncrease
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 1 186.09370 186.09370 4.91 0.0272
Error 382 14465.07563 37.86669
Corrected Total 383 14651.16933
R-Square Coeff Var Root MSE PulseIncrease Mean
0.012702 191.5018 6.153592 3.213333
Source DF Anova SS Mean Square F Value Pr > F
Gender 1 186.0937042 186.0937042 4.91 0.0272
Now use this adjusted F with the SAS or SPSS program for putting a confidence
interval on R
2
.
DATA ETA;
*****************************************************************************
*********************************
Construct Confidence Interval for Eta-Squared
*****************************************************************************
*********************************;
F= 4.914 ;
df_num = 1 ;
df_den = 382;
ncp_lower = MAX(0,fnonct (F,df_num,df_den,.975));
ncp_upper = MAX(0,fnonct (F,df_num,df_den,.025));
eta_squared = df_num*F/(df_den + df_num*F);
eta2_lower = ncp_lower / (ncp_lower + df_num + df_den + 1);
eta2_upper = ncp_upper / (ncp_upper + df_num + df_den + 1);
output; run; proc print; var eta_squared eta2_lower eta2_upper;
title 'Confidence Interval on Eta-Squared'; run;
-------------------------------------------------------------------------------------------------
Confidence Interval on Eta-Squared
eta_ eta2_ eta2_
Obs squared lower upper
1 0.012700 0 0.043552
SASLOG
NOTE: Invalid argument to function FNONCT at line 57 column 19.
F=4.914 df_num=1 df_den=382 ncp_lower=0 ncp_upper=17.485492855 eta_squared=0.0127004968
eta2_lower=0 eta2_upper=0.0435519917 _ERROR_=1 _N_=1
NOTE: Mathematical operations could not be performed at the following places. The results of the
3
operations have been set to missing values.
Each place is given by: (Number of times) at (Line):(Column).
Do not be concerned about this note. You will get every time your CI includes zero -- the iterative
procedure bumps up against the wall at value = 0.
Notice that the confidence interval includes the value 0 even though the effect of
gender is significant at the .027 level. What is going on here? I think the answer can be
found in Steiger (2004).
Example 10: Consider a test of the hypothesis that = 0, that is, that the RMSSE (as defined in Equation
12) in an ANOVA is zero. This hypothesis test is one-sided because the RMSSE cannot be negative. To
use a two-sided confidence interval to test this hypothesis at the = .05 significance level, one should
examine the 100(1 - 2)% = 90% confidence interval for . If the confidence interval excludes zero, the
null hypothesis will be rejected. This hypothesis test is equivalent to the standard ANOVA F test.
Well, R
2
(and
2
) cannot be less than zero either. Accordingly, one can argue
that when putting a CI on an ANOVA effect that has been tested with the traditional .05
criterion of significance, that CI should be a 90% CI, not a 95% CI.
ncp_lower = MAX(0,fnonct (F,df_num,df_den,.95));
ncp_upper = MAX(0,fnonct (F,df_num,df_den,.05));
------------------------------------------------------------------------------------------------
Confidence Interval on Eta-Squared
eta_ eta2_
Obs squared eta2_lower upper
1 0.012700 .000743843 0.037453
The 90% CI does not include zero. Let us try another case. Suppose you
obtained F(2, 97) = 3.09019. The obtained value of F here is exactly equal to the critical
value of F for alpha = .05.
F= 3.09019 ;
df_num = 2 ;
df_den = 97;
ncp_lower = MAX(0,fnonct (F,df_num,df_den,.95));
4
ncp_upper = MAX(0,fnonct (F,df_num,df_den,.05));
.
------------------------------------------------------------------------------------------------
Confidence Interval on Eta-Squared
eta_ eta2_ eta2_
Obs squared lower upper
1 0.059899 2.1519E-8 0.13743
Notice that the 90% CI does exclude zero, but barely. A 95% CI would include
zero.
Reference
Steiger, J. H. (2004). Beyond the F test: Effect size confidence intervals and tests of
close fit in the analysis of variance and contrast analysis. Psychological Methods, 9,
164-182,
Karl L. Wuensch, Dept. of Psychology, East Carolina Univ., Greenville, NC USA
September, 2009
Homogeneity of Variance Tests For Two or More Groups
We covered this topic for two-group designs earlier. Basically, one transforms
the scores so that between groups variance in the scores reflects differences in
variance rather than differences in means. Then one does a t test on the transformed
scores. If there are three or more groups, simply replace the t test with an ANOVA.
See the discussion in the Engineering Statistics Handbook. Levene suggested
transforming the scores by subtracting the within-group mean from each score and then
either taking the absolute value of each deviation or squaring each deviation. Both
versions are available in SAS. Brown and Forsythe recommended using absolute
deviations from the median or from a trimmed mean. Their Monte Carlo research
indicated that the trimmed mean was the best choice when the populations were heavy
in their tails and the median was the best choice when the populations were skewed.
The Brown and Forsythe method using the median is available in SAS. It would not be
very difficult to program SAS to use the trimmed means. Obriens test is also available
in SAS.
I provide here SAS code to illustrate homogeneity of variance tests. The data
are the gear data from the Engineering Statistics Handbook.
options pageno=min nodate formdlim='-';
title 'Homogeneity of Variance Tests';
title2 'See http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm';
run;
data Levene;
input Batch N; Do I=1 to N; Input GearDiameter @@; output; end;
cards;
1 10
1.006 0.996 0.998 1.000 0.992 0.993 1.002 0.999 0.994 1.000
2 10
0.998 1.006 1.000 1.002 0.997 0.998 0.996 1.000 1.006 0.988
3 10
0.991 0.987 0.997 0.999 0.995 0.994 1.000 0.999 0.996 0.996
4 10
1.005 1.002 0.994 1.000 0.995 0.994 0.998 0.996 1.002 0.996
5 10
0.998 0.998 0.982 0.990 1.002 0.984 0.996 0.993 0.980 0.996
6 10
1.009 1.013 1.009 0.997 0.988 1.002 0.995 0.998 0.981 0.996
7 10
0.990 1.004 0.996 1.001 0.998 1.000 1.018 1.010 0.996 1.002
8 10
0.998 1.000 1.006 1.000 1.002 0.996 0.998 0.996 1.002 1.006
9 10
1.002 0.998 0.996 0.995 0.996 1.004 1.004 0.998 0.999 0.991
10 10
0.991 0.995 0.984 0.994 0.997 0.997 0.991 0.998 1.004 0.997
*****************************************************************************
;
proc GLM data=Levene; class Batch;
model GearDiameter = Batch / ss1;
means Batch / hovtest=levene hovtest=BF hovtest=obrien;
title; run;
*****************************************************************************
;
proc GLM data=Levene; class Batch;
model GearDiameter = Batch / ss1;
means Batch / hovtest=levene(type=ABS); run;
Here are parts of the statistical output, with annotations:
Levene's Test for Homogeneity of GearDiameter Variance
ANOVA of Squared Deviations from Group Means
Sum of Mean
Source DF Squares Square F Value Pr > F
Batch 9 5.755E-8 6.394E-9 2.50 0.0133
Error 90 2.3E-7 2.556E-9
With the default Levenes test (using squared deviations), the groups differ significantly
in variances.
O'Brien's Test for Homogeneity of GearDiameter Variance
ANOVA of O'Brien's Spread Variable, W = 0.5
Sum of Mean
Source DF Squares Square F Value Pr > F
Batch 9 7.105E-8 7.894E-9 2.22 0.0279
Error 90 3.205E-7 3.562E-9
Also significant with Obriens Test.
Brown and Forsythe's Test for Homogeneity of GearDiameter Variance
ANOVA of Absolute Deviations from Group Medians
But not significant with the Brown & Forsythe test using absolute deviations from within-
group medians.
Sum of Mean
Source DF Squares Square F Value Pr > F
Batch 9 0.000227 0.000025 1.71 0.0991
Error 90 0.00133 0.000015
-------------------------------------------------------------------------------------------------
SAS will only let you do one Levene test per invocation of PROC GLM, so I ran
GLM a second time to get the Levene test with absolute deviations. As you can see
below, the difference in variances is significant with this test.
Levene's Test for Homogeneity of GearDiameter Variance
ANOVA of Absolute Deviations from Group Means
Sum of Mean
Source DF Squares Square F Value Pr > F
Batch 9 0.000241 0.000027 2.16 0.0322
Error 90 0.00112 0.000012
The One-Way ANOVA procedure in PASW also provides a test of homogeneity
of variance, as shown below.
Test of Homogeneity of Variances
GearDiameter
Levene Statistic df1 df2 Sig.
2.159 9 90 .032
Notice that the Levene test provided by PASW is
that using absolute deviations from within-group means.
The Brown-Forsythe test offered as an option is not
their test of equality of variances, it is a robust test of
differences among means, like the Welch test.
Return to Wuenschs Statistics Lessons Page
Karl L. Wuensch
May, 2010.
Omega-Squared.doc
Dear 6430 students,
We have discussed omega-squared as a less biased (than is eta-squared)
estimate of the proportion of variance explained by the treatment variable in
the population from which our sample data could be considered to be random.
Earlier this semester we discussed a very similar statistic, r-squared, and I
warned you about how this statistic can be inflated by high levels of extraneous
variable control. The same caution applies to eta-squared and omega-squared.
Here is a comment I posted to EDSTAT-L on this topic a few years back:
------------------------------------------------------------------------------
Date: Mon, 11 Oct 93 11:27:23 EDT
From: "Karl L. Wuensch" <PSWUENSC@ecuvm1>
To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>
Subject: Omega-squared (was P Value)
Josh, backon@vms.huji.ac.il, noted:
>We routinely run omega squared on our data. Omega squared is one of the most
>frequently applied methods in estimating the proportion of the dependent
>variable accounted for by an independent variable, and is used to confirm the
>strength of association between variables in a population. ............
Omega-squared can also be misinterpreted. If the treatment is evaluated in
circumstances (the laboratory) where the influence of extraneous variables
(other variables that influence the dependent variable) is eliminated, then the
omega-squared will be inflated relative to the proportion of the variance in the
dependent variable due to the treatment in a (real) population where those
extraneous variables are not eliminated. Thus, a treatment that really accounts
for a trivial amount of the variance in the dependent variable out there in the
real world can produce a large omega-squared when computed from data collected
in the laboratory. To a great extent both P and omega-squared measure the
extent to which the researcher has been able to eliminate "error variance" when
collecting the data (but P is also greatly influenced by sample size).
Imagine that all your subjects were clones of one another with identical
past histories. All are treated in exactly the same way, except that for half
of them you clapped your hands in their presence ten minutes before measuring
whatever the dependent variable is. Because the subjects differ only on whether
or not you clapped your hands in their presence, if such clapping has any effect
at all, no matter how small, it accounts for 100% of the variance in your
sample. If the population to which you wish to generalize your results is not
one where most extraneous variance has been eliminated, your omega-squared may
be a gross overestimate of the magnitude of the effect. Do note that this
problem is not unique to omega-squared. Were you to measure the magnitude of
the effect as being the between groups difference in means divided by the within
groups standard deviation the same potential for inflation of effect would
exist.
Karl L. Wuensch, Dept. of Psychology, East Carolina Univ.
Greenville, NC 27858-4353, phone 919-757-6800, fax 919-757-6283
Bitnet Address: PSWUENSC@ECUVM1
Internet Address: PSWUENSC@ECUVM.CIS.ECU.EDU
========================================================================
Sender: edstat-l@jse.stat.ncsu.edu
From: Joe H Ward <joeward@tenet.edu>
Karl --- good comment!! My early research days were spent in an R-squared,
Omega-squared, Factor Analysis environment. My own observations say: "BEWARE
of those correlation-type indicators!!!" --- Joe
Joe Ward 167 East Arrowhead Dr.
San Antonio, TX 78228-2402 Phone: 210-433-6575 joeward@tenet.edu
MultComp.doc
One-Way Multiple Comparisons Tests
Error Rates
The error rate per comparison,
pc
, is the probability of making a Type I error on a
single comparison, assuming the null hypothesis is true.
The error rate per experiment,
PE
, is the expected number of Type I errors made
when making c comparisons, assuming that each of the null hypotheses is true. It is
equal to the sum of the per comparison alphas. If the per comparison alphas are
constant, then
PE
= c
pc
,
The familywise error rate,
fw
, is the probability of making one or more Type I
errors in a family of c comparisons, assuming that each of the c null hypotheses is true.
If the comparisons are independent of one another (orthogonal), then
( )
c
pc fw
= 1 1 . For our example problem, evaluating four different teaching
methods, if we were to compare each treatment mean with each other treatment mean,
c would equal 6. If we were to assume those 6 comparisons to be independent of each
other (they are not), then 26 . 95 . 1
6
= =
fw
.
Multiple t tests
One could just use multiple t-tests to make each comparison desired, but one runs
the risk of greatly inflating the familywise error rate (the probability of making one or
more Type I errors in a family of c comparisons) when doing so. One may use a series
of protected t-tests in this situation. This procedure requires that one first do an
omnibus ANOVA involving all k groups. If the omnibus ANOVA is not significant, one
stops and no additional comparisons are done. If that ANOVA is significant, one makes
all the comparisons e wishes using t-tests. If you have equal sample sizes and
homogeneity of variance, you can use
n
MSE
X X
t
j i
=
2
, which pools the error variance
across all k groups, giving you N - k degrees of freedom. If you have homogeneity of
variance but unequal ns use:
=
j i
j i
n n
MSE
X X
t
1 1
. MSE is the error mean square from
the omnibus ANOVA. If you had heterogeneous variances, you would need to compute
separate variances t-tests, with adjusted df.
The procedure just discussed (protected t-tests) is commonly referred to as Fishers
LSD test. LSD stands for Least Significant Difference. If you were making
comparisons for several pairs of means, and n was the same in each sample, and you
=
j
j
n
c
SS
2
2
=
2
2
j
c
n
SS
. Each
contrast will have only one treatment df, so the contrast MS is the same as the contrast
SS. To get an F for the contrast just divide it by an appropriate MSE (usually that which
would be obtained were one to do an omnibus ANOVA on all k treatment groups).
For our example problem, suppose we want to compare combined groups C and D
with combined groups A and B. The A, B, C, D means are 2, 3, 7, 8, and the
3
coefficients are .5, ,5, +.5, +.5. 5 ) 8 ( 5 . ) 7 ( 5 . ) 3 ( 5 . ) 2 ( 5 . = + + = . Note that the
value of the contrast is quite simply the difference between the mean of combined
groups C and D (7.5) and the mean of combined groups A and B (2.5).
125
1
) 25 ( 5
25 . 25 . 25 . 25 .
) 5 ( 5
2
= =
+ + +
=
s t
crit
.,
where
=
j
j
n
c
MSE s
2
. With equal sample sizes, this simplifies to
n
MSE
s =
.
When one is constructing multiple confidence intervals, one can use Bonferroni to
adjust the per contrast alpha. Such intervals have been called simultaneous or joint
confidence intervals. For the contrast above, 3162 .
5
5 .
= =
s . With no adjustment of
the per-comparison alpha, and df = 16, a 95% confidence interval is 5 2.12(.3162),
which extends from 4.33 to 5.67.
A population standardized contrast,
= , can be estimated by s ,
where s is the standard deviation of just one of the groups being compared (Glass ),
the pooled standard deviation of the two groups being compared (Hedges g), or the
pooled standard deviation of all of the groups (the square root of the MSE). For the
contrast above, 07 . 7 5 . 5
= =
g , a whopper effect.
SAS and other statistical software can be used to obtain the F for a specified
contrast. Having obtained a contrast F from your computer program, you can compute
=
j
j
n
c
F g
2
. For our contrast, 07 . 7
5
25 . 25 . 25 . 25 .
250
+ + +
=
g .
An approximate confidence interval for a standardized contrast d can be
computed simply by taking the confidence interval for the contrast and dividing its
endpoints by the pooled standard deviation (square root of MSE). In this case the
confidence interval amounts to
g crit
s t g , where
=
j
j
g
n
c
s
2
. For our contrast,
2 .
5
25 . 25 . 25 . 25 .
+ + +
=
g
s and a 95% confidence interval is 7.07 2.12(.447),
running from 6.12 to 8.02. More simply, we take the unstandardized confidence
interval, which runs from 4.33 to 5.67, and divide each end by the standard deviation
(.707) and obtain 6.12 to 8.02.
At http://www.psy.unsw.edu.au/research/PSY.htm one can obtain PSY: A
Program for Contrast Analysis, by Kevin Bird, Dusan Hadzi-Pavlovic, and Andrew
Isaac. This program computes unstandardized and approximate standardized
confidence intervals for contrasts with between-subjects and/or within/subjects factors.
It will also compute simultaneous confidence intervals. Contrast coefficients are
4
provided as integers, and the program converts them to standard weights. For an
example of the use of the PSY program, see my document PSY: A Program for
Contrast Analysis.
An exact confidence interval for a standardized contrast involving
independent samples can be computed with my SAS program Conf_Interval-
Contrast.sas. Enter the contrast t (the square root of the contrast F, 15.81 for our
contrast), the df (16), the sample sizes (5, 5, 5, 5), and the standard contrast
coefficients (.5, .5, .5, .5) and run the program. You obtain a confidence interval that
extends from 4.48 to 9.64. Notice that this confidence interval is considerably wider
than that obtained by the earlier approximation.
One can also use
2
or partial
2
as a measure of the strength of a contrast,
and use my program Conf-Interval-R2-Regr.sas to construct a CI. For
2
simply take
the SS
contrast
and divide by the SS
Total
. For our contrast, that yields
2
= 125/138 =
.9058. To get the confidence interval for
2
we need to compute a modified contrast F,
adding to the error term all variance not included in the contrast and all degrees of
freedom not included in the contrast.
Source SS df MS F
Teaching Method 130 3 43.33 86.66
AB vs. CD 125 1 125 250
Error 8+5=13 16+2=18 13/18=.722 173.077
Total 138 19
077 . 173
18 / 13
125
) 1 19 ( ) 125 138 (
125
) ( ) (
) 18 , 1 ( = =
=
=
contrast Total contrast Total
contrast
df df SS SS
SS
F .
Feed that F and df to my SAS program and you obtain an
2
of .9058 with a
confidence interval that extends from .78 to .94.
Alternatively, one can compute a partial
2
as
93985 .
8 125
125
=
+
=
+
Error Contrast
Contrast
SS SS
SS
. Notice that this excludes from the denominator
all variance that is explained by differences among the groups that are not captured by
the tested contrast.
Source SS df MS F
Teaching Method 130 3 43.33 86.66
AB vs. CD 125 1 125 250
Error 8 16 0.50
Total 138 19
For partial
2
enter the contrast F(1, 16) = 250 into my program and you obtain
2
= .93985 with a confidence interval extending from .85 to .96.
5
Orthogonal Contrasts
One can construct k - 1 orthogonal (independent) contrasts involving k means. If I
consider a
i
to represent the contrast coefficients applied for one contrast and b
j
those
for another, for the contrasts to be orthogonal it must be true that 0 =
j
j j
n
b a
. If you
have equal sample sizes, this simplifies to 0 =
j i
b a . Consider the following set of
contrast coefficients involving groups A, B, C, D, and E and equal sample sizes.
A B C D E
+.5 +.5 1/3 1/3 1/3
+1 1 0 0 0
0 0 1 .5 .5
0 0 0 +1 1
If we computed a SS for each of these contrasts and summed those SS, the sum
would equal the treatment SS which would be obtained in an omnibus ANOVA on the k
groups. This is beautiful, but not necessarily practical. The comparisons you make
should be meaningful, whether or not they form an orthogonal set.
Studentized Range Procedures
There is a number of procedures available to make a posteriori, posthoc,
unplanned multiple comparisons. When one will compare each group mean with each
other group mean, k(k - 1)/2 comparisons, one widely used procedure is the Student-
Newman-Keuls procedure. As is generally the case, this procedure adjusts downwards
the per comparison alpha to keep the alpha familywise at a specified value. It is a
layer technique, adjusting alpha downwards more when comparing extremely different
means than when comparing closer means, thus correcting for the tendency to
capitalize on chance by comparing extreme means, yet making it somewhat easier
(compared to non-layer techniques) to get significance when comparing less extreme
means.
To conduct a Student-Newman-Keuls (SNK) analysis:
a. Put the means in ascending order of magnitude.
b. r is the number of means spanned by a given comparison.
c. Start with the most extreme means (the lowest vs. the highest), where r = k.
d. Compute q with this formula:
n
MSE
X X
q
j i
= , assuming equal sample sizes and
homogeneity of variance. MSE is the error mean square from an overall ANOVA on the
k groups. Do note that the SNK, and multiple comparison tests in general, were
6
developed as an alternative to the omnibus ANOVA. One is not required to do the
ANOVA first, and if one does do the ANOVA first it does not need to be significant for
one to do the SNK or most other multiple comparison procedures.
e. If the computed q equals or exceeds the tabled critical value for the studentized
range statistic, q
r,df
, the two means compared are significantly different, you move to the
step g. The df is the df for MSE.
f. If q was not significant, stop and, if you have done an omnibus ANOVA and it
was significant, conclude that only the extreme means differ from one another.
g. If the q on the outermost layer is significant, next test the two pairs of means
spanning (k - 1) means. Note that r, and, thus the critical value of q, has decreased.
From this point on, underline all pairs not significantly different from one another, and
do not test any other pairs whose means are both underlined by the same line.
h. If there remain any pairs to test, move down to the next layer, etc. etc.
i. Any means not underlined by the same line are significantly different from one
another.
For our sample problem. which was presented in the previous handout, One-Way
Independent Samples Analysis of Variance, with alpha at .01:
1.
Group A B C D
Mean 2 3 7 8
2. n = 5
316 . 0
5
.5
= r Denominato =
3. A vs D: r = 4, df = 16, 99 . 18
316 .
2 8
=
= q , p < .01
b. B vs D: 82 . 15
316 .
3 8
=
= q , p < .01
5. r = 2, q
.01
= 4.13
a. A vs B: 16 . 3
316 .
2 3
=
= q , p > .01
b. B vs C: 66 . 12
316 .
3 7
=
= q , p < .01
7
c. C vs D: 16 . 3
316 .
7 8
=
= q , p > .01
6.
Group A B C D
Mean 2 3 7 8
What if there are unequal sample sizes? One solution is to use the harmonic
mean sample size computed across all k groups. That is,
=
j
n
k
n
1
~
. Another solution
is to compute for each comparison made the harmonic mean sample size of the two
groups involved in that comparison, that is,
j i
n n
n
1 1
2
~
+
= . With the first solution the
effect of n (bigger n, more power) is spread out across groups. With the latter solution
comparisons involving groups with larger sample sizes will have more power than those
with smaller sample sizes.
If you have disparate variances, you should compute a q that is very similar to the
separate variances t-test earlier studied. The formula is: 2 2
2
2
t
n
S
n
S
X X
q
j
j
i
i
j i
=
+
= ,
where t is the familiar separate variances t-test. This procedure is known as the
Games and Howell procedure.
When using this unpooled variances q one should also adjust the degrees of
freedom downwards exactly as done with Satterthwaites solution previously discussed.
Consult our text book for details on how the SNK can have a familywise alpha that is
greatly inflated if the omnibus null hypothesis is only party true.
Relationship Between q And Other Test Statistics
The studentized range statistic is closely related to t and to F. If one computes the
pooled-across-k -groups t, as done with Fishers LSD, then 2 t q = . If one computes
an F from a planned comparison, then F q = 2 . For example, for the A vs C
comparison with our sample problem: . 18 . 11
447 .
5
5
) 5 (. 2
2 7
= =
= t
2 18 . 11 82 . 15 = = q .
8
The contrast coefficients to compare A with C would be .5, 0, .5, 0.
The contrast
( )
5 . 62
5 .
) 25 . 6 ( 5
) 25 . 0 0 25 (.
) 8 0 7 5 . 3 0 2 5 . ( 5
2
2
2
= =
+ + +
+ + +
= =
j
j j
a
M a n
SS .
5
5 . 62
= F , 82 . 15 125 2 = = q .
Tukeys (a) Honestly Significant Difference Test
This test is applied in exactly the same way that the Student-Newman-Keuls is, with
the exception that r is set at k for all comparisons. This test is more conservative (less
powerful) than the Student-Newman-Keuls.
Tukeys (b) Wholly Significant Difference Test
This test is a compromise between the Tukey (a) and the Newman-Keuls. For each
comparison, the critical q is set at the mean of the critical q were a Tukey (a) being
done and the critical q were a Newman-Keuls being done.
Ryans Procedure (REGWQ)
This procedure, the Ryan / Einot and Gabriel / Welsch procedure, is based on the q
statistic, but adjusts the per comparison alpha in such a way (Howell provides details in
our text book) that the familywise error rate is maintained at the specified value (unlike
with the SNK) but power will be greater than with the Tukey(a). I recommend its use
with four or more groups. With three groups the REGWQ is identical to the SNK, and,
as you know, I recommend Fishers procedure when you have three groups. With four
or more groups I recommend the REGWQ, but you cant do it by hand, you need a
computer (SAS and SPSS will do it).
Other Procedures
Dunns Test (The Bonferroni t )
Since the familywise error rate is always less than or equal to the error rate per
experiment,
pc fw
c , an inequality known as the Bonferroni inequality, one can be
sure that alpha familywise does not exceed some desired maximum value by using an
adjusted alpha per comparison that equals the desired maximum alpha familywise
divided by c, that is,
c
fw
pc
=
pooled
s
M M
g , a very large effect.
An easier way to get the pooled standard deviation is to conduct an ANOVA
relating the test variable to the grouping variable. Here is SPSS output from such an
analysis:
Basic Concepts
We have already studied the one-way independent-samples ANOVA, which is used
when we have one categorical independent variable and one continuous dependent
variable. Research designs with more than one independent variable are much more
interesting than those with only one independent variable. When we have two categorical
independent variables (with nonexperimental research, these are better referred to as
factors, predictors, grouping variables, or classification variables), and one continuous
dependent variable (with nonexperimental research these are better referred to as criterion
variables, outcome variables, or response variables), with all combinations of levels of the first
independent variable with levels of the second independent variable (or factor) consider the
following design: we have measures of drunkenness for each of four groupsparticipants
given neither alcohol nor a barbiturate, participants given a vodka screwdriver but no
barbiturate, participants given a barbiturate tablet but not alcohol, and participants given both
alcohol and the barbiturate. We have a 2 x 2 factorial design, factor A being dose of alcohol
and factor B being dose of barbiturate. Suppose that our participants were some green alien
creatures that showed up at our party last week, and that we obtained the following means:
Alcohol
Barbiturate none one marginal
none 00 10 05
one 20 30 25
marginal 10 20 15
The 2 x 2 = 4 group means (0, 10, 20, 30) are called cell means. I can average cell
means to obtain marginal means, which reflect the effect of one factor ignoring the other
factor. For example, for factor A, ignoring B, participants who drank no alcohol averaged a
(0 + 20) / 2 = 10 on our drunkenness scale, those who did drink averaged (10 + 30) / 2 = 20.
From such marginal means one can compute the main effect of a factor, its effect ignoring the
other factor. For factor A, that main effect is (20 - 10) = 10 participants who drank alcohol
averaged 10 units more drunk than those who didnt. For factor B, the main effect is
(25 - 5) = 20, the barbiturate tablet produced 20 units of drunkenness, on the average.
A simple main effect is the effect of one factor at a specified level of the other factor.
For example, the simple main effect of the vodka screwdriver for participants who took no
barbiturate is (10 - 0) = 10. For participants who did take a barbiturate, taking alcohol also
made them (30 - 20) = 10 units more drunk. In this case, the simple main effect of A at level 1
The dependent variable is what I have called the Executive Male Attitude. High
scores on this variable indicate that the respondent endorses statements such as
husbands should make all of the important decisions; a wife should do whatever her
husband wants; and only the husband should decide about major purchases.
------------------------------------------------------------------------------------------------------------
Additive Effect of Gender and Culture on Executive Male Attitude
The first graph illustrates hypothetical results in which men endorse this attitude more
than do women and in which the attitude is stronger in one culture than in the other.
Notice that there are main effects of both gender and culture, but no interaction.
------------------------------------------------------------------------------------------------------------
Interactive Effect of Gender and Culture on Executive Male Attitude
Faculty in ECUs Department of Psychology (Rosina Chia, John Childers, and
myself) have actually conducted research on this topic. Here is a graph illustrating the
actual effects found when comparing students here at ECU with university students in
Taiwan. Note the striking interaction -- while our male students are more likely to
endorse this attitude than are our female students, in Taiwan it is the female students
who strongly endorse the traditional Executive Male Attitude!
We shall test the hypotheses in factorial ANOVA in essentially the same way we
tested the one hypothesis in a one-way ANOVA. I shall assume that our samples are
strictly independent, not correlated with each other. The total sum of squares in the
dependent/outcome variable will be partitioned into two sources, a Cells or Model SS
[our model is Y = effect of level of A + effect of level of B + effect of interaction + error +
grand mean] and an Error SS.
The SS
Cells
reflects the effect of the combination of the two grouping variables.
The SS
error
reflects the effects of all else. If the cell sample sizes are all equal, then we
can simply partition the SS
Cells
into three orthogonal (independent, nonoverlapping)
parts: SS
A
, the effect of grouping variable A ignoring grouping variable B; SS
B
, the
effect of B ignoring A; and SS
AxB
, the interaction.
If the cell sample sizes are not equal, the design is nonorthogonal, that is, the
factors are correlated with one another, and the three effects (A, B, and A x B) overlap
with one another. Such overlap is a thorny problem which we shall avoid for now by
insisting on having equal cell ns. If you have unequal cell ns you may consider
randomly discarding a few scores to obtain equal ns or you may need to learn about
the statistical procedures available for dealing with such nonorthogonal designs.
Suppose that we are investigating the effects of gender and smoking history
upon the ability to smell an odourant thought to be involved in sexual responsiveness.
One grouping variable is the participants gender, male or female. The other grouping
variable is the participants smoking history: never smoked, smoked 2 packs a day for
10 years but have now stopped, stopped less than 1 month ago, between 1 month and
2 years ago, between 2 and 7 years ago, or between 7 and 12 years ago. Suppose we
have 10 participants in each cell, and we obtain the following cell and marginal totals
(with some means in parentheses):
SMOKING HISTORY
GENDER never < 1m 1 m - 2 y 2 y - 7 y 7 y - 12 y marginal
Male 300 200 220 250 280 1,250 (25)
Female 600 300 350 450 500 2,200 (44)
marginal 900 (45) 500 (25) 570 (28.5) 700 (35) 780 (39) 3,450
Gender
SS
140 , 5 025 , 119
20
780 700 570 500 900
2 2 2 2 2
Smoke
SS
Since the SS
Cells
reflects the combined effect of Gender and Smoking History,
both their main effects and their interaction, we can compute the SS
interaction
as a
residual,
SS
Cells
SS
Gender
SS
Smoke
= 15,405 - 9,025 - 5,140 = 1,240.
As in the one-way design, total df = N - 1, and main effects df = number of levels
minus one, that is, (a - 1) and (b - 1). Interaction df is the product of main effect df,
(a - 1)(b - 1). Error df = (a)(b)(n - 1).
Mean squares are SS / df, and Fs are obtained by dividing effect mean squares
by error MS. Results are summarized in this source table:
Page 3
Source SS df MS F p
A-gender 9025 1 9025 75.84 <.001
B-smoking history 5140 4 1285 10.80 <.001
AxB interaction 1240 4 310 2.61 .041
Error 10,710 90 119
Total 26,115 99
Analysis of Simple Main Effects
The finding of a significant interaction is often followed by testing of the simple
main effects of one factor at each level of the other. Let us first compare the two
genders at each level of smoking history. Following our general rule for the
computation of effect sums of squares (note that each simple effect has its own CM).
Simple Main Effect of Gender at Each Level of Smoking History
SS
Gender for never smokers
= 500 , 4
20
900
10
600 300
2 2 2
SS
Gender, stopped < 1m
= 500
20
500
10
300 200
2 2 2
SS
Gender stopped 1 m - 2 y
= 845
20
570
10
350 220
2 2 2
SS
Gender stopped 2 y - 7 y
= 000 , 2
20
700
10
450 250
2 2 2
SS
Gender stopped 7 y - 12 y
= 420 , 2
20
780
10
500 280
2 2 2
Please note that the sum of the simple main effects SS for A (across levels of B)
will always equal the sum of SS
A
and the SS
interaction
: 4,500 + 500 + 845 + 2,000 + 2,420
= 10,265 = 9,025 + 1,240. Since gender is a two-level factor, each of these simple main
effects has 1 df, so MS = SS. For each we compute an F on 1, 90 df by dividing by the
MSE from the overall ANOVA:
Smoking History
SS
Gender at
never < 1m 1 m - 2 y 2 y - 7 y 7 y - 12 y
F(1, 90) 37.82 4.20 7.10 16.81 20.34
p <.001 .043 .009 <.001 <.001
Some recommend using a MSE computed using only the scores involved in the
simple main effect being tested, that is, using individual error terms for each simple
main effect. This is especially recommended when the assumption of homogeneity of
Page 4
variance is suspect. When I contrived these data I did so in such a way that there is
absolute homogeneity of variance each cell has a variance of 119, so I used the MSE
from the overall ANOVA, the pooled error term.
The results indicate that the gender difference is significant at the .05 level at
every level of smoking history, but the difference is clearly greater at levels 1, 4, and 5
(those who have never smoked or quit over 2 years ago) than at levels 2 and 3 (recent
smokers).
Simple Main Effect of Smoking History at Each Level of Gender
Is smoking history significantly related to olfactory ability within each gender? Let
us test the simple main effects of smoking history at each level of gender:
SS
Smoking history for men
= 680
50
250 , 1
10
280 250 220 200 300
2 2 2 2 2 2
SS
Smoking history for women
= 700 , 5
50
200 , 2
10
500 450 350 300 600
2 2 2 2 2 2
Note that SS
B at A1
+ SS
B at A2
= 680 + 5,700 = 6,380 = SS
B
+ SS
AxB
= 5,140 +
1,240. Since B had 5 levels, each of these simple main effects has 4 df, so the mean
squares are 680 / 4 = 170 for men, 5,700 / 4 = 1,425 for women. Smoking history has a
significant simple main effect for women, F(4, 90) = 11.97, p < .001, but not for men,
F(4, 90) = 1.43, p = .23.
Multiple comparisons
Since smoking history had a significant simple main effect for the women, we
might want to make some comparisons involving the five means in that simple main
effect. Rather than make all possible (10) pairwise comparisons , I elect to make only 4
comparisons: the never smoked group versus each other group. Although there is a
special procedure for the case where one (control) group is compared to each other
group, the Dunnett test, I shall use the Bonferroni t test instead. Holding familywise
alpha at .05 or less, my criterion to reject each null hypothesis becomes
. 0125 .
4
05 .
c
pc
D I will need the help of SAS to get the exact p values. Here is a little
SAS program that will obtain the p values for the t scores below:
options formdlim='-' pageno=min nodate;
data p;
T1 = 2*PROBT(-6.149, 90); T2 = 2*PROBT(-5.125, 90); T3 = 2*PROBT(-3.075,
90);
T4 = 2*PROBT(-2.050, 90); proc print; run;
Notice that I entered each t score as a negative value, and then gave the df. Since
PROBT returns a one-tailed p, I multiplied by 2. The output from SAS is:
Obs T1 T2 T3 T4
1 2.1036E-8 .000001688 .002787218 0.043274
The denominator for each t will be 119 1 10 1 10 48785 ( / / ) . . The computed t
scores and p values are then:
Page 5
Never Smoked vs Quit
8785 . 4
) 90 (
j i
M M
t
p Significant?
< 1 m (60-30) / 4.8785=6.149 < .001 yes
1 m - 2 y (60-35) / 4.8785=5.125 < .001 yes
2 y - 7 y (60-45) / 4.8785=3.075 .0028 yes
7 y - 12 y (60-50) / 4.8785=2.050 .0433 no
As you can see, female ex-smokers olfactory ability was significantly less than
that of women who never smoked for every group except the group that had stopped
smoking 7 to 12 years ago.
If the interaction were not significant (and sometimes, even if it were) we would
likely want to make multiple comparisons involving the marginal means of significant
factors with more than two levels. Let us do so for smoking history, again using the
Bonferroni t-test. I should note that, in actual practice, I would probably use the
REGWQ test. Since I shall be making ten comparisons, my adjusted per comparison
alpha will be, for a maximum familywise error rate of .05, .05/10 = .005. Again, I rely on
SAS to obtain the exact p values.
options formdlim='-' pageno=min nodate;
data p;
T12 = 2*PROBT(-5.80, 90); T13 = 2*PROBT(-4.78, 90); T14 = 2*PROBT(-2.90,
90);
T15 = 2*PROBT(-1.74, 90); T23 = 2*PROBT(-1.01, 90); T24 = 2*PROBT(-2.90,
90);
T25 = 2*PROBT(-4.06, 90); T34 = 2*PROBT(-1.88, 90); T35 = 2*PROBT(-3.04,
90);
T45 = 2*PROBT(-1.16, 90); proc print; run;
Obs T12 T13 T14 T15 T23
1 9.7316E-8 .000006793 .004688757 0.085277 0.31520
Obs T24 T25 T34 T35 T45
1 .004688757 .000104499 0.063343 .003097811 0.24912
Level i vs j t = Significant?
1 vs 2
80 . 5
4496 . 3
20
) 20 / 1 20 / 1 ( 119 / ) 25 45 (
yes
1 vs 3 (45-28.5) / 3.4496=4.78 yes
1 vs 4 (45-35) / 3.4496=2.90 yes
1 vs 5 (45-39) / 3.4496=1.74 no
2 vs 3 (28.5-25) / 3.4496=1.01 no
2 vs 4 (35-25) / 3.4496=2.90 yes
Page 6
2 vs 5 (39-25) / 3.4496=4.06 yes
3 vs 4 (35-28.5) / 3.4496=1.88 no
3 vs 5 (39-28.5) / 3.4496=3.04 yes
4 vs 5 (39-35) / 3.4496=1.16 no
Note that the ns are 20 because 20 scores went into each mean.
Smoking History < 1 m 1 m - 2 y 2 y - 7 y 7 y - 12 y never
Mean 25
A
28.5
AB
35
BC
39
CD
45
D
Note. Means sharing a superscript are not significantly different from one another.
Contrasts in Factorial ANOVA
One can create contrasts in factorial ANOVA just as in one-way ANOVA. For
example, in a 2 x 2 ANOVA one contrast is that known as the main effect of the one
factor, another contrast is that known as the main effect of the other factor, and a third
contrast is that known as the interaction between the two factors. For effects (main or
interaction) with more than one df, the effect can be broken down into a set of
orthogonal one df contrasts.
The coefficients for an interaction contrast must be doubly centered in the sense
that the coefficients must sum to zero in every row and every column of the a x b matrix.
For example, consider a 2 x 2 ANOVA. The interaction has only one df, so there is only
one contrast available.
Coefficients Means
B
1
B
2
B
1
B
2
A
1
1 -1 M
11
M
12
A
2
-1 1 M
21
M
22
This contrast is M
11
M
12
M
21
+ M
22
. From one perspective, this contrast is the
combined cells on one diagonal (M
11
+ M
22
) versus the combined cells on the other
diagonal (M
21
+ M
12
). From another perspective, it is (M
11
- M
12
) (M
21
M
22
), that is,
the simple main effect of B at A
1
versus the simple main effect of B at A
2
. From another
perspective it is (M
11
M
21
) (M
12
M
22
), that is, the simple main effect of A at B
1
versus the simple main effect of A at B
2
. All of this is illustrated in my program ANOVA-
Interact2x2.sas.
Now consider a 2 x 3 design. The interaction has two df and can be broken
down into two orthogonal interaction contrasts. For example, consider the contrast
coefficients in the table below:
Page 7
A x B
12 vs 3
A x B
1 vs 2
B
1
B
2
B
3
B
1
B
2
B
3
A
1
1 1 -2 1 -1 0
A
2
-1 -1 2 -1 1 0
The contrast on the left side of the table compares the simple main effect of A at
combined levels 1 and 2 of B with the simple main effect of A at level 3 of B. From
another perspective, it compares the simple main effect of (combined B
1
and B
2
) versus
B
3
at A
1
with that same effect at A
2
. Put another way, it is the A x B interaction with
levels 1 and 2 of B combined.
The contrast on the right side of the table compares the simple main effect of A
at level 1 of B with the simple main effect of A at level 2 of B. From another
perspective, it compares the simple main effect of B12 (excluding level 3 of B) at A1
with that same effect at A2. Put another way, it is the A x B interaction with level 3 of B
excluded.
If we had reason to want the coefficients on the left side of the table above to be
a standard set of weights, we would divide each by 2.
A x B
12 vs 3
B
1
B
2
B
3
A
1
.5 .5 -1
A
2
-.5 -.5 1
My program ANOVA-Interact2x3.sas illustrates the computation of these
interaction contrasts and more.
Standardized Contrasts
As in one-way designs, one can compute standardized contrasts. Rex B. Kline
(Chapter 7 of Beyond Significance Testing, 2004, American Psychological Association)
notes that there is much disagreement regarding how to compute standardized
contrasts with data from a multifactorial design, and opines that
1. such estimates should be comparable to those that would be obtained from a one-
way design, and
2. changing the number of factors in the design should not necessarily change the
effect size estimates.
Adding factors to a design is, IMHO, not different from adding covariates. Should
the additional variance explained by added factors be excluded from the denominator of
the standardized contrast g ? Imagine a 2 x 2 design, where A is type of therapy, B is
sex of patient, and Y is post-treatment wellness. You want to compute g for the effect of
type of therapy. The MSE excludes variance due to sex, but in the population of
interest sex may naturally account for some of the variance in wellness, so using the
root mean square error as the standardizer will underestimate the population standard
Page 8
deviation. It may be desirable to pool the SS
within-cells
, SS
B
, and SS
AxB
to form an
appropriate standardizer in a case like this. Id just drop B and AxB from the model, run
a one-way ANOVA, and use the root mean square error from that as the standardizer.
Kline argues that when a factor like sex is naturally variable in both the
population of interest and the sample then variance associated with it should be
included in the denominator of g. While I agree with this basic idea, I am not entirely
satisfied with it. Such a factor may be associated with more or less of the variance in
the sample than it is in the population of interest. In experimental research the
distribution of such a factor can be quite different in the experiment than it is in the
population of interest. For example, in the experiment there may be approximately
equal numbers of clients assigned to each of three therapies, but in the natural world
patients may be given the one therapy much more often than the others.
Now suppose that you are looking at the simple main effects of A (therapy) at
levels of B (sex). Should the standardizer be computed within-sex, in which case the
standardizer for men would differ from that for women, or should the standardizer be
pooled across sexes? Do you want each g to estimate d in a single-sex population, or
do you want a g for men that can be compared with the g for women without having to
consider the effect of the two estimators having different denominators?
Magnitude of Effect
Eta-squared and omega-squared can be computed for each effect in the
model. With omega-squared, substitute the effects df for the term (k-1) in the formula
we used for the one-way design.
For the interaction, 047 .
115 , 26
240 , 1
2
K , 029 .
234 , 26
764
119 115 , 26
) 119 ( 4 240 , 1
2
Z .
For Gender, 346 .
115 , 26
025 , 9
2
K , 339 .
234 , 26
) 119 ( 1 025 , 9
2
Z .
For Smoking History, 197 .
115 , 26
140 , 5
2
K , 178 .
234 , 26
) 119 ( 4 140 , 5
2
Z .
Gender clearly accounts for the greatest portion of the variance in ability to detect
the scent, but smoking history also accounts for a great deal. Of course, were we to
analyze the data from only female participants, excluding the male participants (for
whom the effect of smoking history was smaller and nonsignificant), the Z
2
for smoking
history would be much larger.
Partial Eta-Squared. The value of K
2
for any one effect can be influenced by the
number of and magnitude of other effects in the model. For example, if we conducted
our research on only women, the total variance in the criterion variable would be
reduced by the elimination of the effects of gender and the Gender x Smoking
interaction. If the effect of smoking remained unchanged, then the
Total
Smoking
SS
SS
ratio
would be increased. One attempt to correct for this is to compute a partial eta-squared,
Error Effect
Effect
p
SS SS
SS
2
K . In other words, the question answered by partial eta-squared is
Page 9
this: Of the variance that is not explained by other effects in the model, what proportion
is explained by this effect.
For the interaction, 104 .
710 , 10 240 , 1
240 , 1
2
p
K .
For gender, 457 .
710 , 10 025 , 9
025 , 9
2
p
K .
For smoking history, 324 .
710 , 10 140 , 5
140 , 5
2
p
K .
Notice that the partial eta-square values are considerably larger than the eta-
square or omega-square values. Clearly this statistic can be used to make a small
effect look moderate in size or a moderate-sized effect look big. It is even possible to
get partial eta-square values that sum to greater than 1. That makes me a little
uncomfortable. Even more discomforting, many researchers have incorrectly reported
partial eta-squared as being regular eta-squared. Pierce, Block, and Aguinis (2004)
found articles in prestigious psychological journals in which this error was made.
Apparently the authors of these articles (which appeared in Psychological Science and
other premier journals) were not disturbed by the fact that the values they reported
indicated that they had accounted for more than 100% of the variance in the outcome
variable in one case, the authors claimed to have explained 204%. Oh my.
Confidence Intervals for K
2
and Partial K
2
As was the case with one-way ANOVA, one can use my program Conf-Interval-
R2-Regr.sas to put a confidence interval about eta-square or partial eta-squared. You
will, however, need to compute an adjusted F when putting the confidence interval on
eta-square, the F that would be obtained were all other effects excluded from the model.
Note that I have computed 90% confidence intervals, not 95%. See this document.
Confidence Intervals for K
2
. For the effect of gender,
39 . 174
1 99
025 , 9 115 , 26
Gender Total
Gender Total
df df
SS SS
MSE , and
df F 98 1, on 752 . 51
39 . 174
025 . 9
. 90% CI [.22, .45].
For the effect of smoking, 79 . 220
4 99
140 , 5 115 , 26
Smoking Total
Smoking Total
df df
SS SS
MSE
and df F 95 1, on 82 . 5
79 . 220
285 , 1
. 90% CI [.005, .15].
For the effect of the interaction,
84 . 261
4 99
240 , 1 115 , 26
n Interactio Total
n Interactio Total
df df
SS SS
MSE and
df F 95 1, on 184 . 1
84 . 2261
310
. 90% CI [.000, .17].
Notice that the CI for the interaction includes zero, even though the interaction
was statistically significant. The reason for this is that the F testing the significance of
Page 10
the interaction used a MSE that excluded variance due to the two main effects, but that
variance was included in the standardizer for our confidence interval.
Confidence Intervals for Partial K
2
. If you give the program the unadjusted
values for F and df, you get confidence intervals for partial eta-squared. Here they are
for our data:
Gender: .33, .55
Smoking: .17, .41
Interaction: .002, .18 (notice that this CI does not include zero)
Eta-Squared or Partial Eta-Squared? Which one should you use? I am more
comfortable with eta-squared, but can imagine some situations where the use of partial
eta-squared might be justified. Kline (2004) has argued that when a factor like sex is
naturally variable in both the population of interest and the sample, then variance
associated with it should be included in the denominator of the strength of effect
estimator, but when a factor is experimental and not present in the population of
interest, then the variance associated with it may reasonably be excluded from the
denominator.
For example, suppose you are investigating the effect of experimentally
manipulated A (you create a lesion in the nucleus spurious of some subjects but not of
others) and subject characteristic B (sex of the subject). Experimental variable A does
not exist in the real-world population, subject characteristic B does. When estimating
the strength of effect of Experimental variable A, the effect of B should remain in the
denominator, but when estimating the strength of effect of subject characteristic B it
may be reasonable to exclude A (and the interaction) from the denominator.
To learn more about this controversial topic, read Chapter 7 in Kline (2004). You
can find my notes taken from this book on my Stat Help page, Beyond Significance
Testing.
Perhaps another example will help illustrate
the difference between eta-squared and partial eta-
squared. Here we have sums of squares for a two-
way orthogonal ANOVA. Eta-squared is
25 .
100
25
*
2
Error B A B A
Effect
SS SS SS SS
SS
K for
every effect. Eta-squared answers the question of
all the variance in Y, what proportion is (uniquely)
associated with this effect.
Partial eta squared is 50 .
50
25
2
Error Effect
Effect
p
SS SS
SS
K for every effect. Partial
eta-squared answers the question of the variance in Y that is not associated with any of
the other effects in the model, what proportion is associated with this effect. Put
another way, if all of the other effects were nil, what proportion of the variance in Y
would be associated with this effect.
Page 11
Notice that the values of eta-squared sum to .75, the full model eta-squared. The
values of partial eta-squared sum to 1.5. Hot damn, we explained 150% of the
variance. -
Once you have covered multiple regression, you should compare the difference
between eta-squared and partial eta-squared with the difference between squared
semipartial correlation coefficients and squared partial correlation coefficients.
K
2
for Simple Main Effects. We have seen that the effect of smoking history is
significant for the women, but how large is the effect among the women? From the cell
sums and uncorrected sums of squares given on pages 1 and 2, one can compute the
total sum of squares for the women it is 11,055. We already computed the sum of
squares for smoking history for the women 5,700. Accordingly, the K
2
= 5,700/11,055
= .52. Recall that this K
2
was only .20 for men and women together.
When using my Conf-Interval-R2-Regr.sas with simple effects one should use an
F that was computed with an individual error term rather than with a pooled error term.
If you use the data on pages 1 and 2 to conduct a one-way ANOVA for the effect of
smoking history using only the data from the women, you obtain an F of 11.97 on (4, 45)
degrees of freedom. Because there was absolute heterogeneity of variance in my
contrived data, this is the same value of F obtained with the pooled error term, but
notice that the df are less than they were with the pooled error term. With this F and df
my program gives you a 90% CI of [.29, .60]. For the men, K
2
= .11, 90% CI [0, .20].
Assumptions
The assumptions of the factorial ANOVA are essentially the same as those made
for the one-way design. We assume that in each of the a x b populations the dependent
variable is normally distributed and that variance in the dependent variable is constant
across populations.
Advantages of Factorial ANOVA
The advantages of the factorial design include:
1. Economy - if you wish to study the effects of 2 (or more) factors on the same
dependent variable, you can more efficiently use your participants by running them
in 1 factorial design than in 2 or more 1-way designs.
2. Power - if both factors A and B are going to be determinants of variance in your
participants dependent variable scores, then the factorial design should have a
smaller error term (denominator of the F-ratio) than would a one-way ANOVA on just
one factor. The variance due to B and due to AxB is removed from the MSE in the
factorial design, which should increase the F for factor A (and thus increase power)
relative to one-way analysis where that B and AxB variance would be included in the
error variance.
Consider the partitioning of the sums of squares illustrated to
the right. SS
B
= 15 and SSE = 85. Suppose there are two
levels of B (an experimental manipulation) and a total of 20
cases. MS
B
= 15, MSE = 85/18 = 4.722. The F(1, 18) =
15/4.72 = 3.176, p = .092. Woe to us, the effect of our
Page 12
experimental treatment has fallen short of statistical significance.
Now suppose that the subjects here consist of both men and women and that the
sexes differ on the dependent variable. Since sex is not included in the model,
variance due to sex is error variance, as is variance due to any interaction between
sex and the experimental treatment.
Let us see what happens if we include sex and the
interaction in the model. SS
Sex
= 25, SS
B
= 15, SS
Sex*B
=
10, and SSE = 50. Notice that the SSE has been reduced
by removing from it the effects of sex and the interaction.
The MS
B
is still 15, but the MSE is now 50/16 = 3.125 and
the F(1, 16) = 15/3.125 = 4.80, p = .044. Notice that
excluding the variance due to sex and the interaction has
reduced the error variance enough that now the main effect
of the experimental treatment is significant.
Of course, you could achieve the same reduction in error by simply holding the one
factor constant in your experimentfor example, using only participants of one
genderbut that would reduce your experiments external validity (generalizability
of effects across various levels of other variables). For example, if you used only
male participants you would not know whether or not your effects generalize to
female participants.
3. Interactions - if the effect of A does not generalize across levels of B, then including
B in a factorial design allows you to study how As effect varies across levels of B,
that is, how A and B interact in jointly affecting the dependent variable.
Example of How to Write-Up the Results of a Factorial ANOVA
Results
Participants were given a test of their ability to detect the scent of a chemical
thought to have pheromonal properties in humans. Each participant had been classified
into one of five groups based on his or her smoking history. A 2 x 5, Gender x Smoking
History, ANOVA was employed, using a .05 criterion of statistical significance and a
MSE of 119 for all effects tested. There were significant main effects of gender,
F(1, 90) = 75.84, p < .001, K
2
= .346, 90% CI [.22, .45], and smoking history, F(4, 90) =
10.80, p < .001, K
2
= .197, 90% CI [.005, .15], as well as a significant interaction
between gender and smoking history, F(4, 90) = 2.61, p = .041, K
2
= .047, 90% CI [.00,
.17],. As shown in Table 1, women were better able to detect this scent than were men,
and smoking reduced ability to detect the scent, with recovery of function being greater
the longer the period since the participant had last smoked.
Page 13
Table 1. Mean ability to detect the scent.
Smoking History
Gender < 1 m 1 m -2 y 2 y - 7 y 7 y - 12 y never marginal
Male 20 22 25 28 30 25
Female 30 35 45 50 60 44
Marginal 25 28 35 39 45
The significant interaction was further investigated with tests of the simple main
effect of smoking history. For the men, the effect of smoking history fell short of
statistical significance, F(4, 90) = 1.43, p = .23, K
2
= .113, 90% CI [.00, .20]. For the
women, smoking history had a significant effect on ability to detect the scent, F(4, 90) =
11.97, p < .001, K
2
= .516, 90% CI [.29, .60]. This significant simple main effect was
followed by a set of four contrasts. Each group of female ex-smokers was compared
with the group of women who had never smoked. The Bonferroni inequality was
employed to cap the familywise error rate at .05 for this family of four comparisons. It
was found that the women who had never smoked had a significantly better ability to
detect the scent than did women who had quit smoking one month to seven years
earlier, but the difference between those who never smoked and those who had
stopped smoking more than seven years ago was too small to be statistically significant.
Please note that you could include your ANOVA statistics in a source table
(referring to it in the text of your results section) rather than presenting them as I have
done above. Also, you might find it useful to present the cell means in an interaction
plot rather than in a table of means. I have presented such an interaction plot below.
10
20
30
40
50
60
70
< 1 m 1m - 2y 2y - 7y 7y - 12 y never
Female
Male
10
20
30
40
50
60
M
e
a
n
A
b
i
l
i
t
y
t
o
D
e
t
e
c
t
t
h
e
S
c
e
n
t
Smoking History
Page 14
Reference
Pierce, C. A., Block, R. A., & Aguinis, H. (2004). Cautionary note on reporting eta-
squared values from multifactor designs. Educational and Psychological
Measurement, 64, 916-924.
Copyright 2011, Karl L. Wuensch - All rights reserved.
Triv-Int.doc
Main Effects That Participate in Significant but Trivial Interactions
Some persons opine that one should never interpret a main effect when it
participates in a significant interaction. I disagree. One may have good reasons to
ignore an interaction. For example, the interaction may be statistically significant but of
trivial magnitude. One of my graduate students investigated how the degree of altruism
shown in social interactions is a function of the degree of kinship between the parties
involved in the interaction. He had reason to believe that there are cultural differences
in the relationship between degree of kinship and amount of altruism shown. His
primary analysis was a Culture x Kinship ANOVA. The kinship variable was a within-
subjects variable, and the sample sizes were large, so he had so much power that even
trivial effects could be detected and labeled statistically significant. The student was
overjoyed when both the main effect of degree of kinship and the interaction between
kinship and culture were significant beyond the .001 level. But look at the results that I
have plotted below (after reflecting the cell means -- originally, high scores indicated low
altruism, but that makes things confusing). From the plot, it is pretty clear to me that the
level of altruism increases with degree of kinship in both cultures, and that the shape of
this function in the Americans is not much different than it is in the Chinese. I asked the
student to report a magnitude of effect estimate for each effect. He chose to employ
2
.
For the main effect of degree of kinship, the
2
was an enormous .75. For the
interaction it was a trivial .004. Significant or not, I argue that the interaction is so small
in magnitude that it can, even should be, ignored.
10
60
110
160
210
260
e
n
e
m
y
s
t
r
a
n
g
e
r
t
o
w
n
s
m
a
n
2
n
d
c
o
u
s
i
n
1
s
t
k
i
n
American
Chi nese
=
N
Y
CM
16500 1406250 1422750
40
4800
20
2700
20
2200
10
1550
2 2 2 2
2
=
= + + + = = CM CM
n
T
SS
ij
ij
cells
( )
( )
( )
( )
0
40 20
4800 2700
20 10
2200 1550
2 2 2
=
+
+
+
+
+
= = CM CM
n
T
SS
i
i
School
( )
( )
( )
( )
12500
40 20
4800 2200
20 10
2700 1550
2 2 2
=
+
+
+
+
+
= = CM CM
n
T
SS
j
j
Gender
4000 12500 0 16500
64500 16500 81000
_ _
= = =
= = =
Gender School cells Gender x School
cells TOT error
SS SS SS SS
SS SS SS
3
Source SS df MS F p
School 0 1 0 0.0 1.000
Gender 12500 1 12500 16.6 < .001
Interaction 4000 1 4000 5.3 .024
Error 64500 86 750
Total 81000 89
Interaction Analysis:
( )
13500
20 10
2200 1550
20
2200
10
1550
2
2 2
1 _ _ _
=
+
+
+ =
School at Gender
SS
F(1, 86) = 13500 / 750 = 18, p < .001.
( )
3000
40 20
4800 2700
40
4800
20
2700
2
2 2
2 _ _ _
=
+
+
+ =
School at Gender
SS
F(1, 86) = 3000 / 750 = 4, p = .049.
Significant gender effects at both schools, but a greater difference between male
students and female students at School 1 than at School 2.
------------------------------------ OR -------------------------------------
( )
6 . 2666
20 10
2700 1550
20
2700
10
1550
2
2 2
_ _
=
+
+
+ =
students Male School
SS
F(1, 86) = 2666.6 / 750 = 3.5, p = .06.
( )
1333.3
40 20
4800 2200
40
4800
20
2200
2 2 2
_ _
=
+
+
+ =
students Female School
SS
F(1, 86) = 1333.3 / 750 = 1.7, p = .19.
Nonsignificant school differences for each gender, but trends in opposite
directions [Sch 1 > Sch 2 for male students, Sch 1 < Sch 2 for female students]
Traditional Unweighted Means ANOVA
One simple way to weight the cell means equally involves using the harmonic
mean. In this case we compute:
=
=
k
i i
n
k
N
1
1
~
For the data set Int (School x Gender), retain the previous sums and ns.
7 . 17
40
1
20
1
20
1
10
1
4 ~
=
+ + +
= N
4
We now adjust cell totals by multiplying cell means ( M ) by harmonic sample
size, M N
~
= total cell Adjusted .
Male X Female X Marginal
Total
School 1 2755.5 1955.5 4711.1
School 2 2400 2133.3 4533.3
Marginal Total 5155.5 4088.8 9244.4
( )
( )
7 . 1201777
) 7 . 17 ( 4
4 . 9244
cells #
~
2 2
= =
=
N
X
CM
( )
4 . 444
) 7 . 17 ( 2
3 . 4533 1 . 4711
cols #
~
2 2
2
=
+
= =
CM CM
N
T
SS
i
School
( )
16000
) 7 . 17 ( 2
8 . 4088 5 . 5155
rows #
~
2 2
2
=
+
= =
CM CM
N
T
SS
j
Gender
4 . 20444
) 7 . 17 (
3 . 2133 2400 5 . 1955 5 . 2755
~
2 2 2 2
2
=
+ + +
= =
CM CM
N
T
SS
ij
Cells
4000 16000 4 . 444 4 . 20444
_ _
= = =
Gender School Cells Gender x School
SS SS SS SS
To find the SSE, find for each cell
( )
=
n
X
X SS
ij
2
2
and then sum these
across cells.
Assume the below cell sums and ns.
School 1 School 2
Male Female Male Female
X 1,550 2,200 2,700 4,800
X
2
248,000 256,000 379,000 604,250
n 10 20 20 40
7750
10
1550
000 , 248
2
11
= = SS . 000 , 14
20
2200
000 , 256
2
12
= = SS .
500 , 14
20
2700
000 , 379
2
21
= = SS . 250 , 28
40
4800
250 , 604
2
22
= = SS .
The sum = SSE = 64500. The MSE = the weighted average of the cell variances.
Source SS df MS F p
School 444.4 1 444.4 0.59 .44
Gender 16,000 1 16,000 21.30 < .001
5
Interaction 4,000 1 4,000 5.30 .024
Error 64,500 86 750
Gender Interaction Analysis
( )
)
~
(
A at
~
A at
2
i i
2
_ _
N b
X
N
T
SS
ij
A at B
i
=
000 , 18
) 7 . 17 ( 2
1 . 4711
7 . 17
5 . 1955 5 . 2755
2 2 2
1 _ _ _
=
+
=
School at Gender
SS
000 , 2
) 7 . 17 ( 2
3 . 4533
7 . 17
3 . 2133 2400
2 2 2
2 _ _ _
=
+
=
School at Gender
SS
Gender x School Gender School at Gender School at Gender
SS SS SS SS
_ _ 2 _ _ _ 1 _ _ _
+ = +
18,000 + 2,000 = 20,000 = 16,000 + 4,000
F
1
= 18000 / 750 = 24, p < .001. F
2
= 2000 / 750 = 2.6, p = .11.
There is a significant gender difference at School 1, but not at School 2.
----------------- Or, School Interaction Analysis ----------------------
3,555.5
) 7 . 17 ( 2
5 . 5155
7 . 17
2400 5 . 2755
2 2 2
_
=
+
=
male School
SS
8 . 888
) 7 . 17 ( 2
8 . 4088
7 . 17
3 . 2133 5 . 1955
2 2 2
_
=
+
=
female School
SS
Gender x School School female School male School
SS SS SS SS
_ _ _ _
+ = +
3,555.5 + 888.8 = 4444.4 = 444.4 + 4,000
F
men
= 3555.5 / 750 = 4.74, p =.032. F
women
= 888.8 / 750 = 1.185, p =.28.
There is a significant school difference for men but not for women.
Reversal Paradox
We have seen that the School x Gender interaction present in the body weight
data (from page 412 of the 3
rd
edition of Howell) results in there being no main effect of
school if we use unweighted means, but a (small) main effect being indicated if we use
weighted means. When we modified one cell mean to remove the interaction, choice of
weighting method no longer affected the magnitude of the main effects. The cell
frequencies in Howells data were proportional, making school and gender orthogonal
(independent).
Let me show you a strange thing that can happen when the cell frequencies are
not proportional.
Gender
Male Female Marginal Means
6
School M n M n weighted unweighted
1 150 60 110 40 134 130
2 160 10 120 90 124 140
Note that there is no interaction, but that the cell frequencies indicate that gender
is correlated with school (School 1 has a higher proportion of male students than does
School 2). Weighted means indicate that body weight at School 1 exceeds that at
School 2, but unweighted means indicate that body weight at School 2 exceeds that at
School 1. Both make sense. School 1 has a higher mean body weight than School 2
because School 1 has a higher proportion of male students than does School 2, and
men weigh more than women. But the men at School 2 weigh more than do the men at
School 1 and the women at School 2 weigh more than do the women at School 1.
A reversal paradox is when 2 variables are positively related in aggregated
data, but, within each level of a third variable, they are negatively related (or negatively
in the aggregate and positively within each level of the third variable). Please read
Messick and van de Geers article on the reversal paradox (Psychol. Bull. 90: 582-593).
We have a reversal paradox here - in the aggregated data (weighted marginal means),
students at School 1 weigh more than do those at School 2, but within each Gender,
students at School 2 weigh more than those at School 1.
Copyright 2010, Karl L. Wuensch - All rights reserved.
Trend2.doc
Two-Way Independent Samples Trend Analysis
Imagine that we are continuing our earlier work (from the handout "One-
Way Independent Samples ANOVA with SAS") evaluating the effectiveness of
the drug Athenopram HBr. This time we have data from three different groups.
The groups differ with respect to the psychiatric condition for which the drug is
being employed. We wish to determine whether the dose-response curve is the
same across all three groups.
Download and run the file Trend2.sas from my SAS programs page. The
contrived data (created with SAS' normal random number generator) are within
the program. We have 100 scores (20 at each of the five doses) in each
diagnostic group. Our design is Diagnosis x Dose, 3 x 5. Diagnosis is a
qualitative variable, Dose is quantitative. Our dependent variable, as before, is a
measure of the patients' psychological illness after two months of
pharmacotherapy.
In the data step I compute the powers of the Dose variable necessary to
conduct the analysis as a polynomial regression. If I had used a input statement
of "INPUT DOSE DIAGNOS $ ILLNESS," I would have required 300 data lines,
one for each participant, from "0 A 83" to "40 C 120. I only needed 30 data lines
(two per cell) with the do loop I employed.
PROC MEANS and PROC PLOT are used to create a plot of the dose-
response curve for each diagnostic group, with the plotting symbols being the
letter representing diagnostic group. You should edit your output file in Word to
connect the plotted means with line segments. Look at that plot. The plots for
group A is clearly quadratic, while that for group B and C are largely linear, with
some quadratic thrown in.
The first invocation of PROC GLM is used to conduct a standard 3 x 5
factorial ANOVA. Note that all three effects are significant. The interaction here
is not only significant, but also large in magnitude, with an
2
of .14. Clearly we
need to investigate this interaction.
The second invocation of PROC GLM obtains trend components for the
Dose variable and for the interaction between Dose and Diagnosis. Look at the
output. Sum the SS for the four trends for the main effect of Dose. You should
get the SS
Dose
from the previous analysis. The trends for the main effect of Dose
are an orthogonal partitioning of the SS for the main effect of Dose. Sum the SS
for the trend components of the interaction between Dose and Diagnosis. You
should get the SS
Dose x Diagnosis
from the previous analysis. The trends for the
interaction are an orthogonal partitioning of the SS for the interaction. We see
that the effect of Dose differs significantly across the diagnostic groups with
respect to its linear, quadratic, and cubic trends. Given these results, we should
look at the linear, quadratic, and cubic trends in the simple effects (the effect of
dose in each of the diagnostic groups).
2
with both fixed and random effects, but others use the symbol
2
with fixed effects
and the symbol
2
(the intraclass correlation coefficient) with random effects.
Statistically, there are three classification variables (factors in the language of
ANOVA) in Loris design: Using the variable names in her SPSS data file, they are
CONDTN (FTF or CM), TnNo2 (team number), and Subjects. The Subjects variable is
nested within the Team variable and the Team variable is nested within the Condtn
variable -- each subject served on only one team and each team served in only one of
the experimental conditions.
Lori concluded, correctly I believe, that the editor wanted her to conduct a
one-way random effects ANOVA using process satisfaction (PrSat) as the dependent
variable and teams (TnNo2) as the classification variable, and then report
2
as an
estimate of the proportion of the variance in process satisfaction that is explained by
differences among the teams. Do note that in treating Teams as a random rather
than a fixed factor we are asserting that our teams represent a random sample from a
population of teams to which we wish to generalize our results. Put another way, we
are not just interested in the 40 teams on which we have data but rather on the entire
population of teams from which our sample of teams could be considered to be a
random sample. Of course, we are also treating Subjects as a random factor,
pretending that they really represent a random sample from the population of subjects
that is of interest to us. The experimental factor (Condtn) is a fixed factor -- we did
not randomly choose two experimental conditions from a population of conditions, we
deliberately chose these two conditions, the only two conditions in which we are
interested -- that is, on this factor we have the entire population of interest.
One-Way Random Effects ANOVA,
2
= .425,
2
= .562
= , where MS
A
is the mean square among teams, MS
E
is the
error (within teams) mean square, and n is the number of scores in each team. In the
seventh edition of Howells Methods text only two-way ANOVA is included, but he
provides a link to a table of expected mean squares for three-way designs.
Now I obtain the random effects ANOVA, using my preferred statistical program,
SAS. Here is the procedural code I used, followed by the output:
proc glm; class tm_no2; model prsat = tm_no2; random tm_no2 / test; run;
------------------------------------------------------------------------------------------------
The SAS System 2
The GLM Procedure
Dependent Variable: PRSAT
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 39 71.1840000 1.8252308 3.96 <.0001
Error 120 55.3600000 0.4613333
Corrected Total 159 126.5440000
R-Square Coeff Var Root MSE PRSAT Mean
0.562524 17.92125 0.679215 3.790000
Source DF Type III SS Mean Square F Value Pr > F
TM_NO2 39 71.18400000 1.82523077 3.96 <.0001
------------------------------------------------------------------------------------------------
The SAS System 4
The GLM Procedure
Tests of Hypotheses for Random Model Analysis of Variance
Dependent Variable: PRSAT
Source DF Type III SS Mean Square F Value Pr > F
TM_NO2 39 71.184000 1.825231 3.96 <.0001
Error: MS(Error) 120 55.360000 0.461333
------------------------------------------------------------------------------------------------
For a one-way ANOVA, the basic analysis for the random effect model is
identical to that for the fixed effect model. The computation of the
2
does differ,
Page 3
however, between the fixed effect model and the random effect model. Computation of
the
2
for this analysis is:
425 . 0
802 . 0
341 . 0
4 / ) 0.461 1.825 ( 0.461
4 / ) 0.461 1.825 (
/ ) (
/ ) (
2
= =
+
=
+
=
n MS MS MS
n MS MS
E A E
E A
.
Omega-squared is considered to be superior to eta-squared for estimating the
proportion of variance accounted for by an ANOVA factor. Eta-squared tends to
overestimate that proportion -- but eta-squared is certainly easier to compute:
563 . 0
544 . 126
184 . 71
2
= = =
Total
Effect
SS
SS
. Notice that this is the R
2
reported by SAS.
One-Way Fixed Effects ANOVA,
2
= .419,
2
= .563
For pedagogical purposes, Ill show the computation of
2
treating teams as a
fixed variable. The estimated treatment variance is
. 332 . 0
) 40 )( 4 (
) 461 . 0 825 . 1 )( 1 40 (
) )( (
) )( 1 (
=
=
a n
MS MS a
E A
The term a is the number of
levels of the team variable. The estimated total variance is equal to the estimated
treatment variance plus the estimated error variance, . 793 . 461 . 332 . 332 . = + = +
E
MS
Accordingly, 419 .
793 .
332 .
2
= = .
Two-Way Mixed Effects ANOVA
One should keep in mind that the
2
for teams, as computed above, includes the
effect of the experimental treatment, Condtn -- that is, we have estimated the variance
among the teams, but part of that variance is due to the fact that the two experimental
groups of teams were treated differently, and the other part of it is due to other
differences among teams (error, effects of extraneous variables, reflected in differences
among subjects scores within teams).
If one wanted to determine the variance of teams after excluding variance due to
the experimental condition, a mixed factorial ANOVA, Condtn x Teams (nested within
Condtn), could be conducted. Here is the SAS code and output for such an analysis,
treating Condtn as fixed and Teams as random:
proc glm; class condtn tm_no2; model prsat = condtn tm_no2(condtn);
random tm_no2(condtn) / test; run;
------------------------------------------------------------------------------------------------
The SAS System 6
The GLM Procedure
Dependent Variable: PRSAT
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 39 71.1840000 1.8252308 3.96 <.0001
Error 120 55.3600000 0.4613333
Page 4
Corrected Total 159 126.5440000
R-Square Coeff Var Root MSE PRSAT Mean
0.562524 17.92125 0.679215 3.790000
Source DF Type III SS Mean Square F Value Pr > F
CONDTN 1 33.48900000 33.48900000 72.59 <.0001
TM_NO2(CONDTN) 38 37.69500000 0.99197368 2.15 0.0009
------------------------------------------------------------------------------------------------
The SAS System 8
The GLM Procedure
Tests of Hypotheses for Mixed Model Analysis of Variance
Dependent Variable: PRSAT PRSAT
Source DF Type III SS Mean Square F Value Pr > F
CONDTN 1 33.489000 33.489000 33.76 <.0001
Error 38 37.695000 0.991974
Error: MS(TM_NO2(CONDTN))
Source DF Type III SS Mean Square F Value Pr > F
TM_NO2(CONDTN) 38 37.695000 0.991974 2.15 0.0009
Error: MS(Error) 120 55.360000 0.461333
Note that the SS for Teams from the previous analysis, 71.184, has been
partitioned into a SS for Condtn, 33.489, and a SS for Teams within conditions, 37.695.
The Total SS (126.544) is comprised of the SS for Condtn, 33.489, plus the SS for
Teams within conditions, 37.695, plus the SS for Subjects within teams within
conditions, 55.36. Do note that the analysis above uses the variance for teams within
conditions as the error variance for testing the effect of Condtn.
The
2
for the entire effect of teams is 71.184/126.544 = .563. That part due to
the experimental manipulation is 33.489/126.544 = .265.
References
Howell, D. C. (2007). Statistical methods for psychology (6
th
ed.). Belmont, CA:
Thompson Wadsworth.
Howell, D. C. (2010). Statistical methods for psychology (7
th
ed.). Belmont, CA:
Cengage Wadsworth.
Return to Wuenschs Statistics Lessons Page
Copyright 2010, Karl L. Wuensch - All rights reserved.
ANOVA_Flow.doc
The Factorial ANOVA Is Done, Now What Do I Do?
After conducting a factorial ANOVA, one typically inspects the results of that ANOVA and then decides what additional analyses
are needed. It is often recommended that this take place in a top-down fashion, inspecting the highest-order interaction term first and
then moving down to interactions of the next lower order, an so on until reaching the main effects.
Two basic principles are:
If an interaction is significant, conduct tests of simple (conditional) effects to help explain the interaction, and
Effects which do not participate in higher-order interactions are easier to interpret than are those that do.
Consider a three-way analysis. If the triple interaction, AxBxC is significant, one might decide to test the simple (conditional)
interaction of AxB at each level of C. If the AxB interaction at C=1 is significant, one might then decide to test the simple, simple
(doubly conditional) main effects of A at each level of B for those cells where C=1. If the AxB interaction at C=2 is not significant, then
one is likely to want to look the (simple main) effects of A and of B for those cells where C=1.
If the triple interaction is not significant, one next looks at the two-way interactions. Suppose that AxB is significant but the other two
interactions are not. The significant AxB interaction might then be followed by tests of the simple main effects of A at each level of B.
For each significant simple main effect of A, when there are more than two levels of A, one might want to conduct pairwise comparisons
or more complex contrasts among As marginal means for the data at the specified level of B. Since the main effect of C does not
participate in any significant interactions, it can now be more simply interpreted -- if there are more than two levels of C, one might want
to conduct pairwise comparisons or more complex contrasts involving the marginal means of C.
In some situations one might be justified in interpreting main effects even when they do participate in significant interactions,
especially when those interactions are monotonic. For example, even though AxB is significant, if the direction of the effect of A is the
same at all levels of B, there may be some merit in talking about the main effect of A, ignoring B.
The most important thing to keep in mind is that the contrasts that are made (interactions, simple interactions, main effects,
simple main effects, contrasts involving marginal means, and so on) should be contrasts help you answer questions of interest about
the data. My presentation here has been rather abstract, treating A, B, and C as generic factors. When A, B, and C are particular
variables, the recommendations given here may or may not make good sense. When they do not make good sense, do not follow them
-- make the comparisons that do make good sense!
B. J. White, graduate student in PSYC 6431 in the Spring of 2002, prepared the following ANOVA Flow Chart based on the
generic recommendations made above. Thanks, B.J.
LeastSq.doc
Least Squares Analyses of Variance and Covariance
One-Way ANOVA
Read Sections 1 and 2 in Chapter 16 of Howell. Run the program ANOVA1-
LS.sas, which can be found on my SAS programs page. The data here are from Table
16.1 of Howell.
Dummy Variable Coding. Look at the values of X1-X3 in the data in the Data
Dummy section of the program file. X1 codes whether or not an observation is from
Group 1 (0 = no, 1 = yes), X2 whether or not it is from Group 2, and X3 whether or not it
is from Group 3. Only k-1 (4-1) dummy variables are needed, since an observation that
is not in any of the first k-1 groups must be in the k
th
group. The dummy variable coding
matrix is thus:
Group X1 X2 X3
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 0
For each dummy variable the partial coefficients represent a contrast between
its group and the reference group (the one coded with all 0s), that is, X1s partials
code Group 1 vs. Group 4, X2 codes Group 2 vs. Group 4, and X3 codes Group 3 vs.
Group 4.
Look at the correlations among the Xs and note that with equal ns the
off-diagonal correlations are constant. Now look at the output from the regression
analysis. Note that the omnibus F of 4.455 is the same that would be obtained from a
traditional ANOVA. Also note the following about the partial statistics:
The intercept is the mean of the reference group.
For each X the b is the difference between its groups mean and the mean of the
reference group. For example, the b for X1 is the mean for Group 1 minus the
mean for Group 4, (8 - 6.33) = 1.67.
Do note that only Group 3 differs significantly from the reference group.
=
=
( )( )
.
( )( . )
.
23196875 20462500
3 5 80208
1571
The Model: B output is for a reduced model with the three terms coding the main
effect of B deleted. You find the SS and
2
for B by subtracting the appropriate reduced
model statistics from the full model statistics. The Model: A output is for a reduced
model with the one term coding the main effect of A deleted. Use this output to obtain
the SS and
2
for the main effect of A. Construct a source table and then compare the
output of PROC ANOVA with the source table you obtained by comparing reduced
effects coded models with the full effects coded model. The CLASS statement in PROC
ANOVA and PROC GLM simply tells SAS which independent variables need to be
dummy coded.
Nonorthogonal Analysis. ANOV2-LS-UnEq.sas uses the unequal ns data from
Table 16.5 of Howell. The coding scheme is the same as in the previous analysis.
Obtain sums-of-squares for A, B, and AxB in the same way as you did in the previous
analysis and you will have done an Overall and Spiegel Method I analysis. Do note that
the sums-of-squares do not sum to the total SS, since we have excluded variance that is
ambiguous. Each effect is partialled for every other effect. If you will compare your
results from such an analysis with those provided by the TYPE III SS computed by
PROC GLM you will see that they are identical.
Analysis of Covariance
Read Sections 16.5 through 16.11 in Howell and Chapter 6 in Tabachnick and
Fidell. As explained there, the ANCOV is simply a least-squares ANOVA where the
covariate or covariates are entered into the model prior to the categorical independent
variables so that the effect of each categorical independent variable is adjusted for the
covariate(s). Do note the additional assumptions involved in ANCOV (that each
covariate has a linear relationship with the outcome variable and that the slope for that
relationship does not change across levels of the categorical predictor variable(s).
Carefully read Howells cautions about interpreting analyses of covariance when subjects
Page 5
have not been randomly assigned to treatment groups. Run the programs ANCOV1.sas
and ANCOV2.sas.
One-Way ANCOV. I am not going to burden you with doing ANCOV with PROC REGI
think you already have the basic idea of least-squares analyses mastered. Look at
ANCOV1.sas and its output. These data were obtained from Figure 2 in the article,
"Relationships among models of salary bias," by M. H. Birnbaum (1985, American
Psychologist, pp. 862-866) and are said to be representative of data obtained in various
studies of sex bias in faculty salaries. I did double the sample size from that displayed in
the plot from which I harvested the data. We can imagine that we have data from three
different departments faculty members: The professors Gender (1 = male, 2 = female),
an objective measure of the professors QUALIFICations (a composite of things like
number of publications, ratings of instruction, etc.), and SALARY (in thousands of 1985
dollars).
The data are plotted, using the symbol for gender as the plotting symbol. The plot
suggests three lines, one for each department (salaries being highest in the business
department and lowest in the sociology department), but that is not our primary interest.
Do note that salaries go up as qualifications go up. Also note that the Ms tend to be
plotted higher and more to the right than the Fs.
PROC ANOVA does two simple ANOVAs, one on the qualifications data (later to
be used as a covariate) and one on the salary data. Both are significant. This is going to
make the interpretation of the ANCOV difficult, since we will be adjusting group means
on the salary variable to remove the effect of the qualifications variable (the covariate),
but the groups differ on both. The interpretation would be more straightforward if the
groups did not differ on the covariate, in which case adjusting for the covariate would
simply reduce the error term, providing for a more powerful analysis. The error SS
(1789.3) from the analysis on the covariate is that which Howell calls SS
e(c)
when
discussing making comparisons between pairs of adjusted means.
The first invocation of PROC GLM is used to test the homogeneity of regression
assumption. PROC ANOVA does not allow any continuous effects (such as a
continuous covariate). The model statement includes (when the bar notation is
expanded) the interaction term, Qualific*Gender. Some computing time is saved by
asking for only sequential (SS1) sums of squares. Were Qualific*Gender significant, we
would have a significant violation of the homogeneity of regression assumption (the
slopes of the lines for predicting salary from qualifications would differ significantly
between genders), which would, I opine, be a very interesting finding in its own right.
The second invocation of PROC GLM is used to obtain the slopes for predicting
salary from qualifications within each level of GenderQUALIFIC(GENDER). We
already know that these two slopes do not differ significantly, but I do find it interesting
that the slope for the male faculty is higher than that for the female faculty.
The third invocation of PROC GLM is used to do the Analysis of Covariance. The
correlation between salary and qualifications is significant (Type I p < .0001 -- evaluating
Page 6
qualifications unadjusted for gender) and the genders do differ significantly after
adjusting for qualifications. The Estimate given for qualifications is the common (across
genders) slope used to adjust salary scores in both groups. The LSMEANS are
estimates of what the group means would be if the groups did not differ on qualifications.
If you have more than two groups, you will probably want to use the PDIFF option, for
example, LSMEANS GROUP / PDIFF. The matrix of p-values produced with the PDIFF
option are for pairwise comparisons between adjusted means (with no adjustment of
per-comparison alpha). You can adjust the alpha-criterion downwards (Bonferroni,
Sidak) if you are worried about familywise error rates.
We can estimate the magnitude of effect of gender with an eta-squared statistic,
the ratio of the gender sum of squares to the total sum of squares, 268.364 / 3537 =
.076. This is equivalent to the increase in R
2
when we add gender to a model for
predicting salary from the covariate(s). The Proc Corr shows that r for predicting salary
from qualifications is .60193. Proc GLM shows that the R
2
for predicting salary from
qualifications and gender is .438189. Accordingly, eta-squared = .438189 - .60193
2
=
.076. If men and women were equally qualified, 7.6% of the differences in salaries would
be explained by gender. Look back at the ANOVA comparing the genders on salary.
The eta-squared there was .306. If we ignore qualifications, 30.6% of the differences in
salaries is explained by gender (which is confounded with qualifications and other
unknown variables).
We could also use d
= d .
Our results indicate that even when we statistically adjust for differences in
qualifications, men receive a salary significantly higher than that of women. This would
seem to be pretty good evidence of bias against women, but will the results look the
same if we view them from a different perspective? Look at the last invocation of PROC
GLM. Here we compared the genders on qualifications after removing the effect of
salary. The results indicate that when we equate the groups on salary the mean
qualifications of the men is significantly greater than that of the women. That looks like
bias too, but in the opposite direction. ANCOV is a slippery thing, especially when
dealing with data from a confounded design where the covariate is correlated not only
with the dependent variable but with the independent variable as well.
Two-Way ANCOV. Look at ANCOV2.sas and its output. The data are from Table
16.11 in Howell. The program is a straightforward extension of ANCOV1.sas to a
two-way design. First PROC ANOVA is used to evaluate treatment effects on the
covariate (YEARS) and on the dependent variable (EXAM). Were the design
Page 7
unbalanced (unequal ns) you would need to use PROC GLM with Type III
sums-of-squares here. The model SS, 6051.8, from the ANOVA on the covariate is the
SS
cells(c)
from Howells discussion of comparing adjusted means. The error SS from the
same analysis, 54,285,8, is Howells SS
e(c)
and the SS
Smoke
, 730, is Howells SS
g(c)
.
PROC GLM is first used to test homogeneity of regression within cells and
treatments. The DISTRACTTASK F tests the null hypothesis that the slope for
predicting ERRORS from DISTRACT is the same for all three types of task. The
DISTRACTSMOKE tests slopes across smoking groups. The
DISTRACTTASKSMOKE tests the null that the slope is the same in every cell of the
two-way design. Howell did not extract the DISTRACTTASK and DISTRACTSMOKE
terms from the error term and he did not test them, although in the third edition of his text
(p. 562) he admitted that a good case could be made for testing those effects (he wanted
their 3 df in the error term). Our analysis indicates that we have no problem with
heterogeneity of regression across cells, but notice that there is heterogeneity of
regression across tasks and across smoking groups. PROC GLM is next used to obtain
the slopes for each cell. Ignore the Biased estimates for within treatment slopes.
Although these slopes do not differ enough across cells to produce significant
heterogeneity of regression, inspection of the slopes shows why the DISTRACT*TASK
effect was significant. Look at how high the slopes are for the cognitive task as
compared to the other two tasks. Clearly the number of error increased more rapidly with
participants level of distractibility with the cognitive task than with the other tasks,
especially for those nicotine junkies who had been deprived of their drug. You can also
see the (smaller) DISTRACT*SMOKE effect, with the slopes for the delayed smokers
(smokers who had not had a smoke in three hours) being larger than for the other
participants.
The next GLM does the ANCOV. Note that DISTRACT is significantly correlated with
ERRORS (p < .001, Type I SS). Remember that the Type I SS reported here does not
adjust the first term in the model (the covariate) for the later terms in the model. Howell
prefers to adjust the covariate for the other effects in the model, so he uses SPSS
unique (same as SAS Type III) SS to test the covariate. The common slope used to
adjust scores is 0.2925. TASK, SMOKE, and TASKSMOKE all have significant effects
after we adjust for the covariate (Type III SS). Since the interaction is significant, we
need to do some simple main effects analyses. Look first at the adjusted cell means. If
you look at Figure 16.5 in Howell, you will see the interaction quite clearly. The effect of
smoking group is clearly greater with the cognitive task than with the other two tasks (for
which the lines are nearly flat). The very large main effect of type of task is obvious in
that plot too, with errors being much more likely with the cognitive task than with the
other two tasks.
If we ignore the interaction and look at the comparisons between marginal means
(using the PDIFF output, and not worrying about familywise error), we see that, for the
type of task variable, there were significantly more errors with the cognitive task than with
the other two types of tasks. On the smoking variable, we see that the nonsmokers
made significantly fewer errors than did those in the two groups of smokers.
Page 8
The simple main effects analysis done with the data from the pattern recognition
task shows that the smoking groups did not differ significantly. The Type I SS
smoke
gives
us a test of the effect of smoking history ignoring the covariate, while the Type III SS
gives us the test after adjusting for the covariate (an ANCOVA). The slope used to
adjust the scores on the pattern recognition test is 0.085, notably less than the 0.293
used in the factorial ANCOV.
When we look at the analysis of the data from the cognitive task, we see that the
smoking groups differ significantly whether we adjust for the covariate or not. The
nonsmokers made significantly fewer errors than did the participants in both the smoking
groups. Notice that the small difference between the two smoking groups using the
means as adjusted in the factorial analysis virtually disappears when using the
adjustment from this analysis, where the slope used for adjusting scores (0.537) is
notably more than it was with the factorial ANCOV or with the other two tasks. This is
due to the DISTRACT*TASK interaction which Howell choose to ignore, but we detected.
Finally, with the driving task, we see that the smoking groups differ significantly,
with the active smokers making significantly fewer errors than did the delayed smokers
and the nonsmokers. I guess the stimulant properties of nicotine are of some value
when driving.
Controlling Familywise Error When Using PDIFF
If the comparisons being made involve only three means, I recommend Fishers
procedure that is, do not adjust the p values, but require that the main effect be
statistically significant if it is not, none of the pairwise differences are significant. If the
comparisons involve more than three means, you can tell SAS to adjust the p values to
control familywise error. For example, LSMEANS smoke / PDIFF ADJUST=TUKEY; would
apply a Tukey adjustment. Other adjustments available include BONferroni, SIDAK,
DUNNETT, and SCHEFFE.
References and Recommended Readings
Birnbaum, M. H. (1985). Relationships among models of salary bias. American Psychologist, 40,
862866.
Howell, D. C. (2010). Statistical methods for psychology (7
h
ed.). Belmont, CA: Cengage
Wadsworth. ISBN-10: 0-495-59784-8. ISBN-13: 978-0-495-59784-1.
Huck, S. W., & McLean, R. A. (1975). Using a repeated measures ANOVA to analyze the data
from a pretest-posttest design: A potentially confusing task. Psychological Bulletin, 82. 511-518.
Maxwell, S. E., Delaney, H. D., & Dill. (1984). Another look at ANCOVA versus blocking.
Psychological Bulletin, 95, 136-147.
Maxwell, S. E., Delaney, H. D., & Manheimer, J. M. (1985). ANOVA of residuals and ANCOVA:
Correcting an illusion by using model comparisons and graphs. Journal of Educational and
Behavioral Statistics, 10, 197-209. doi: 10.3102/10769986010003197
Rausch, J. R., Maxwell, S. E., & and Kelley, K. (2003). Analytic methods for questions pertaining
to a randomized pretest, posttest, follow-up design. Journal of Clinical Child and Adolescent
Psychology, 32, 467-486.
Page 9
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5
th
ed.). Boston: Allyn &
Bacon. ISBN-10: 0205459382. ISBN-13: 9780205459384.
Example of Presentation of Results from One-Way ANCOV
The Pretest-Posttest x Groups Design: How to Analyze the Data
Matching and ANCOV with Confounded Variables
Return to Wuenschs Stats Lessons Page
Copyright 2010 Karl L. Wuensch - All rights reserved.
Pretest-Posttest-ANCOV.doc
The Pretest-Posttest x Groups Design: How to Analyze the Data
You could ignore the pretest scores and simply compare the groups on the
posttest scores, but there is probably a good reason you collected the pretest scores in
the first place (such as a desire to enhance power), so Ill dismiss that option.
To illustrate the analyses I shall use the AirportSearch data, available at
http://core.ecu.edu/psyc/wuenschk/SPSS/SPSS-Data.htm . Do see the Description of
the data.
Mixed Factorial ANOVA
Treat the Pretest-Postest contrast as a within-subjects factor and the groups as a
between-subjects factor. Since the within-subjects factor has only one degree of
freedom, the multivariate-approach results will be identical to the univariate-approach
results and sphericity will not be an issue.
Here is SPSS syntax and output.
GLM post pre BY race
/WSFACTOR=PostPre 2 Polynomial
/METHOD=SSTYPE(3)
/CRITERIA=ALPHA(.05)
/WSDESIGN=PostPre
/DESIGN=race.
Tests of Within-Subjects Contrasts
Measure:MEASURE_1
Source PostPre
Type III Sum of
Squares df Mean Square F Sig.
PostPre Linear 288.364 1 288.364 84.676 .000
PostPre * race Linear 76.364 1 76.364 22.424 .000
Error(PostPre) Linear 180.491 53 3.405
Tests of Between-Subjects Effects
Measure:MEASURE_1
Transformed Variable:Average
Source
Type III Sum of
Squares df Mean Square F Sig.
Intercept 1330.135 1 1330.135 307.006 .000
race 254.063 1 254.063 58.640 .000
Error 229.628 53 4.333
2
It is the interaction term that is of most interest in this analysis, and it is
significant. It indicates that the pre-post difference is not the same for Arab travelers as
it is for Caucasian travelers. To further investigate the interaction, one can compare the
groups on pretest only and posttest only and/or compare posttest with pretest
separately for the two groups. Ill do the latter here.
SORT CASES BY race.
SPLIT FILE SEPARATE BY race.
T-TEST PAIRS=post WITH pre (PAIRED)
/CRITERIA=CI(.9500)
/MISSING=ANALYSIS.
Arab Travelers
Paired Samples Statistics
a
Mean N Std. Deviation Std. Error Mean
Pair 1 Post-9-11 7.67 21 3.512 .766
Pre-9-11 2.62 21 1.161 .253
a. race = Arab
Paired Samples Correlations
a
N Correlation Sig.
Pair 1 Post-9-11 & Pre-9-11 21 .065 .778
a. race = Arab
Paired Samples Test
a
Paired Differences
95% Confidence Interval of the
Difference
t df Sig. (2-tailed)
Lower Upper
Pair 1 Post-9-11 - Pre-9-11 3.397 6.698 6.379 20 .000
a. race = Arab
As you can see, the pre-post difference was significant for the Arab travelers.
3
Caucasian Travelers
Paired Samples Statistics
a
Mean N Std. Deviation Std. Error Mean
Pair 1 Post-9-11 2.82 34 1.290 .221
Pre-9-11 1.21 34 1.572 .270
a. race = Caucasian
Paired Samples Correlations
a
N Correlation Sig.
Pair 1 Post-9-11 & Pre-9-11 34 .287 .099
a. race = Caucasian
Paired Samples Test
a
Paired Differences
95% Confidence Interval of the
Difference
t df Sig. (2-tailed)
Lower Upper
Pair 1 Post-9-11 - Pre-9-11 1.016 2.219 5.473 33 .000
a. race = Caucasian
As you can see, the (smaller) pre-post difference is significant for Caucasian
travelers too. We conclude that both groups were searched more often after 9/11, but
that the increase in searches was greater for Arab travelers than for Caucasian
travelers. I should also note that the pre-post correlations are not very impressive here.
Simple Comparison of the Difference Scores
We could simply compute Post minus Pre difference scores and then compare
the two groups on those difference scores. Here is the output from exactly such a
comparison.
COMPUTE Diff=post-pre.
VARIABLE LABELS Diff 'Post Minus Pre'.
EXECUTE.
4
Group Statistics
race N Mean Std. Deviation Std. Error Mean
Post Minus Pre Arab 21 5.0476 3.62596 .79125
Caucasian 34 1.6176 1.72354 .29558
Independent Samples Test
t-test for Equality of Means
t df Sig. (2-tailed)
Post Minus Pre Equal variances assumed 4.735 53 .000
Equal variances not assumed 4.061 25.669 .000
The increase in number of searches is significantly greater for Arab travelers
than for Caucasian travelers. Notice that the value of t is 4.735 on 53 df. An ANOVA
on the same contrast would yield an F with one df in the numerator and the same error
df. The value of F would be the square of the value of t. When you square our t here
you get F(1, 53) = 22.42. The one-tailed p for this F is identical to the two-tailed p for
the t. Yes, with ANOVA one properly employs a one-tailed p to evaluate a
nondirectional hypothesis. Look back at the source table for the mixed factorial
ANOVA. Notice that the F(1, 53) for the interaction term is 22.42. Is this mere
coincidence? No, it is a demonstration that a t test on difference scores is
absolutely equivalent to the test of the interaction term in a 2 x 2 mixed factorial
ANOVA. Many folks find this hard to believe, but it is easy to demonstrate, as I have
done above. Try it with any other Pre-Post x Two Groups design if you are not yet
convinced.
The logical next step here would be to test, for each group, whether or not the
mean difference score differs significantly from zero. Here are such tests:
SORT CASES BY race.
SPLIT FILE SEPARATE BY race.
T-TEST
/TESTVAL=0
/MISSING=ANALYSIS
/VARIABLES=Diff
/CRITERIA=CI(.95).
5
One-Sample Test
a
Test Value = 0
t df Sig. (2-tailed) Mean Difference
95% Confidence Interval of the
Difference
Lower Upper
Post Minus Pre 6.379 20 .000 5.04762 3.3971 6.6981
a. race = Arab
One-Sample Test
a
Test Value = 0
t df Sig. (2-tailed) Mean Difference
95% Confidence Interval of the
Difference
Lower Upper
Post Minus Pre 5.473 33 .000 1.61765 1.0163 2.2190
a. race = Caucasian
Please do notice that the values of t, df, and p here are identical to those
obtained earlier with pre-post correlated samples t tests.
Analysis of Covariance
Here we treat the pretest scores as a covariate and the posttest scores as the
outcome variables. Please note that this involves the assumption that the relationship
between pretest and posttest is linear and that the slope is identical in both groups.
These assumptions are easily evaluated. For example, the slope for predicting posttest
scores from pretest scores in the Arab group is PostTest = 7.148 + .198PreTest. For
the Caucasian group it is PostTest = 2.539 + .236PreTest.
UNIANOVA post BY race WITH pre
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(race) WITH(pre=MEAN)
/CRITERIA=ALPHA(.05)
/DESIGN=pre race.
6
Tests of Between-Subjects Effects
Dependent Variable:Post-9-11
Source
Type III Sum of
Squares df Mean Square F Sig.
Corrected Model 310.064
a
2 155.032 27.231 .000
Intercept 437.205 1 437.205 76.795 .000
pre 5.563 1 5.563 .977 .327
race 214.378 1 214.378 37.655 .000
Error 296.045 52 5.693
Total 1807.000 55
Corrected Total 606.109 54
a. R Squared = .512 (Adjusted R Squared = .493)
Estimated Marginal Means
race
Dependent Variable:Post-9-11
race Mean Std. Error
95% Confidence Interval
Lower Bound Upper Bound
Arab 7.469
a
.558 6.350 8.588
Caucasian 2.946
a
.427 2.088 3.803
a. Covariates appearing in the model are evaluated at the following
values: Pre-9-11 = 1.75.
We conclude that the two groups differ significantly on posttest scores after
adjusting for the pretest scores. Notice that for these data the F for the effect of interest
is larger with the ANCOV in the mixed factorial ANOVA. In other words, ANCOV had
more power.
Which Analysis Should I Use?
If the difference scores are intrinsically meaningful (generally this will involve
prettest and posttest having both been measured on the same metric), the simple
comparison of the groups on mean differences scores is appealing and, as I have
shown earlier, is mathematically identical to the mixed factorial. The ANCOV, however,
generally has more power.
7
Huck and McLean (1975) addressed the issue of which type of analysis to use
for the pretest-postest control group design. They did assume that assignment to
groups was random. They explained that it is the interaction term that is of interest if
the mixed factorial ANOVA is employed and that a simple t test comparing the groups
on pre-post difference scores is absolutely equivalent to such an ANOVA. They pointed
out that the t test and the mixed factorial ANOVA are equivalent to an ANCOV (with
pretest as covariate) if the linear correlation between pretest and posttest is positive and
perfect. They argued that the ANCOV is preferred over the others due to fact that it
generally will have more power and can easily be adopted to resolve problems such as
heterogeneity of regression (the groups differ with respect to slope for predicting the
posttest from the pretest) and nonlinearity of the relationship between pretest and
posttest.
Maxwell, Delaney, and Dill (1984) noted that under some conditions the ANCOV
is more powerful, under other conditions it is less powerful. Other points they made
include:
The mixed factorial ANOVA may employ data from a randomized blocks design
(where subjects have been matched/blocked up on one or more variables
thought to be well associated with the outcome variable) and then, within blocks,
randomly assigned to treatment groups, or it may employ data where no random
assignment was employed (as in my example, where subjects were not randomly
assigned a race/ethnicity). This matters. The randomized blocks design equates
the groups on the blocking variables.
If you can obtain scores on the concomitant variable (here the pretest) prior to
assigning subjects to groups, matching/blocking the subjects on that concomitant
variable and then randomly assigning subjects to treatment groups will enhance
power relative to ignoring the concomitant variable when assigning subjects to
treatment groups. Even with the randomized blocks design, one can use
ANCOV rather than mixed ANOVA for the analysis.
If the relationship between the concomitant variable and the outcome variable is
not linear, the ANCOV is problematic. You may want to consider transformations
to straighten up the (monotonic) nonlinear relationship or polynomial regression
analysis.
References and Recommended Readings
Huck, S. W., & McLean, R. A. (1975). Using a repeated measures ANOVA to
analyze the data from a pretest-posttest design: A potentially confusing task.
Psychological Bulletin, 82, 511-518.
Maxwell, S. E., Delaney, H. D., & Dill, C. A. (1984). Another look at ANCOVA
versus Blocking. Psychological Bulletin, 95, 136-147.
Rausch, J. R., Maxwell, S. E., & and Kelley, K. (2003). Analytic methods for
questions pertaining to a randomized pretest, posttest, follow-up design. Journal
of Clinical Child and Adolescent Psychology, 32, 467-486.
Links
8
AERA-D Discussion Don Burrill commenting on a study with pretest and
posttest but not random assignment.
Wuenschs Stats Lessons
o Least Squares Analyses of Variance and Covariance
Karl L. Wuensch
August, 2009
ws-anova.doc
An Introduction to Within-Subjects Analysis of Variance
In ANOVA a factor is either a between-subjects factor or a within-subjects factor. When the
factor is between-subjects the data are from independent samples, one sample of
outcome/dependent variable scores for each level of the factor. With such independent samples we
expect no correlation between the scores at any one level of the factor and those at any other level of
the factor. A within-subjects or repeated measures factor is one where we expect to have
correlated samples, because each subject is measured (on the dependent variable) at each level of
the factor.
The Data for this Lesson. An example of a within-subjects design is the migraine-headache
study described by Howell (Statistical Methods for Psychology, 6
th
ed., 2007, Table 14.3). The
dependent variable is duration of headaches (hours per week), measured five times. The
within-subjects factor is Weeks, when the measurement was taken, during the third or fourth week of
baseline recording (levels 1 and 2 of Week) or during the fourth, fifth, or sixth week of relaxation
training (levels 3, 4, and 5 of Week). The resulting five samples of scores are clearly not independent
samples, each is based on the same nine subjects. Since we expect the effect of individual differences
among subjects to exist across levels of the Week factor, we expect that the scores at each level of
Week will be positively correlated with those at each other level of Weekfor example, we expect
those who reported the greatest durations during the level 1 week will also tend to report the greatest
durations during the level 3 week.
Crossed and Nested Factors. When each subject is measured at each level of a factor we
say that Subjects is crossed with that factor. For our headache example, Subjects is crossed with
Week. Mathematically we treat Subjects as a factor, so we have a Week x Subjects factorial design
with only one score in each cell (each subject is measured once and only once at each level of Week).
In ANOVA each factor is either crossed with or nested within each other factor. When one
factor is nested within another then knowing the level of the nested factor tells you the level of the other
factor. The Subjects factor is said to be nested within between-subjects factors. For example, if I
randomly assigned ten subjects to each of five experimental groups I know that subjects 1-10 are at
level one of the between-subjects factor, 11-20 at level two, etc. If you ask me at what level of the
between-subjects factor is the score that is at level 35 of the Subjects factor, I can answer three. If
the experimental factor were within-subjects (each subject tested in each of the five experimental
conditions) and you asked me, This score is from subject number 5, at what level of the experimental
factor was it obtained, I could not tell you.
Order Effects and Counterbalancing. Suppose that our within-subjects factor is not Week,
but rather some experimental manipulation, for example, the color of the computer screen (gray,
green, white, blue, or black) upon which I present the material the subject is to learn. Each subject is
tested with each color. A big problem with such a design is that the order of presentation of the
experimental conditions may confound the results. For example, were I to test each subject first with
the gray screen, then green, then white, then blue, and lastly black, the results (how well the subject
learned the material that was presented, such as a list of paired associates) might be contaminated by
practice effects (subjects get better at the task as time passes), fatigue effects (subjects get tired of
it all as time passes), and other such order effects. While one may ameliorate such problems by being
+ +
=
j i
error
j i
n n
MS
M M
t on 32 degrees of freedom, p < .01.
This is the same formula used for multiple comparisons involving a between-subjects factor,
except that the error MS is the interaction between Subjects and the Within-subjects factor. If you
wanted qs instead of ts (for example, doing a Student-Newman-Keuls analysis), you would just
multiply the obtained t by SQRT(2). For example, for Week 2 versus Week 3, t =
(22-9.33)/SQRT(7.2(1/9 + 1/9)) = 10.02, q = 10.02 * SQRT(2) = 14.16.
Keppel (Design and Analysis, 1973, pages 408-421) recommends using individual rather than
pooled error terms and computes an F rather than a t. An individual error term estimates error
variance using only the scores for the two conditions being compated rather than all of the scores in all
conditions. Using Keppels method on the Week 2 versus Week 3 data I obtained a contrast SS of 722
and error SS of 50, for an F(1, 8) = 722/6.25 = 115.52.
Within-Subjects Analysis with SAS
On Karls SAS Programs page is the file WS-ANOVA.SASrun it and save the program and
output. The data are within the program.
Univariate Data Format. The first data step has the data in a univariate setup. Notice that
there are 5 lines of data for each subject, one line for each week. The format is Subject #, Week #,
score on outcome variable, new line.
Here are the data as they appear on Dave Howells web site:
Subject Wk1 Wk2 Wk3 Wk4 Wk5
1 21 22 8 6 6
2 20 19 10 4 4
3 17 15 5 4 5
4 25 30 13 12 17
5 30 27 13 8 6
6 19 27 8 7 4
7 26 16 5 2 5
8 17 18 8 1 5
9 26 24 14 8 9
The first invocation of PROC ANOVA does the analysis on the data in univariate setup.
proc anova; class subject week; model duration = subject week;
Page 5
Since the model statement does not include the Subject x Week interaction, that interaction is
used as the error term, which is appropriate. We conclude that mean duration of headaches changed
significantly across the five weeks, F(4, 32) = 85.04. MSE = 244.7, p < .001.
Multivariate Data Format. The second data step has the data in multivariate format. There is
only one line of data for each subject: Subject number followed by outcome variable scores for each of
the five weeks.
Compare Week 2 with Week 3. The treatment started on the third week, so this would seem
to be an important contrast. The second ANOVA is a one-way within-subjects ANOVA using only the
Week 2 and Week 3 data.
proc anova; model week2 week3 = / nouni; repeated week 2 / nom;
The basic syntax for the model statement is this: On the left side list the variables and on the
right side list the groups (we have no groups). The nouni stops SAS from reporting univariate
ANOVAS testing the null that the the population means for Week 2 and Week 3 are zero. The
repeated week 2 tells SAS that week is a repeated measures dimenion with 2 levels. The nom
stops SAS from reporting multivariate output.
Note that the F(1, 8) obtained is the 115.52 obtained earlier, by hand, using Keppels method
(individual error terms).
proc means mean t prt; var d23 week1-week5;
In the data step I created a difference score, d23, coding the difference between Week 2 and
Week 3. The Means procedure provides a correlated t-test comparing Week 2 with Week 3 by testing
the null hypothesis that the appropriate difference-score has a mu of zero. Note that the square root of
the F just obtained equals this correlated t, 10.75. When doing pairwise comparisons Keppels method
simplifies to a correlated t-test. I also obtained mean duration of headaches by week.
The easiest way to do pairwise comparisons for a within-subjects factor is to compute
difference-scores for each comparison and therefrom a correlated t for each comparison. If you
want to control familywise error rate (alpha), use the Bonferroni or the Sidak inequality to adjust
downwards your per-comparison alpha, or convert your ts into qs for procedures using the
Studentized range statistic, or square the ts to obtain Fs and use the Scheffe procedure to adjust the
critical F. The adjusted Scheffe critical F is simply (w-1) times the unadjusted critical F for the
within-subjects effect, where w is the number of levels of the within-subjects factor. If you want to do
Dunnetts test, just take the obtained correlated ts to Dunnetts table. Of course, all these methods
could also be applied to the ts computed with Howells (pooled error) formula.
proc anova;model week1-week5= / nouni;repeated week 5 profile / summary printe;
The final ANOVA in the SAS program does the overall within-subjects ANOVA. It also does a
profile analysis, comparing each mean with the next mean, with individual error terms. Notice that
data from all five weeks are used in this analysis. The profile and summary cause SAS to contrast
each weeks mean with the mean of the following week and report the results in ANOVA tables. The
printe option provides a test of sphericity (and a bunch of other stuff to ignore).
Page 6
Under Sphericity Tests, Orthogonal Components you find Mauchleys test of sphericity.
Significance of this test would indicate that the sphericity assumption has been violated. We have no
such problem with these data.
Under MANOVA test criteria no week effect are the results of the multivariate analysis.
Under Univariate Tests of Hypotheses is the univariate-approach analysis.Notice that we get the
same F etc. that we got with the earlier analysis with the data in univariate format.
SAS also gives us values of epsilon for both the Greenhouse-Geisser correction and the
Huynh-Feldt correction. These are corrections for violation of the assumption of sphericity. When
one of these has a value of 1 or more and Mauchleys test of sphericity is not significant we clearly do
not need to make any correction. The G-G correction is more conservative (less power) than the H-F
correction. If both the G-G and the H-F near or above .75, it is probably best to use the H-F.
If we were going to apply the G-G or H-F correction, we would multiply both numerator and
denominator degrees of freedom by epsilon. SAS provides three p values, one with no adjustment,
one with the G-G adjustment, and one with the H-F adjustment. If we had applied the G-G adjustment
here, we could report the results like this: A one-way, repeated measures ANOVA was employed to
evaluate the change in duration of headaches across the five weeks. Degrees of freedom were
adjusted according to Greenhouse and Geisser to correct for any violation of the assumption of
sphericity. Duration of headches changed significantly across the weeks, F(2.7, 21.9) = 85.04, MSE =
7.2, p < .001.
Under Analysis of Variance of Contrast Variables are the results of the profile analysis. Look
at CONTRAST VARIABLE: WEEK.2 this is the contrast between Week 2 and Week 3. For some
reason that escapes me, SAS reports contrast and error SS and MS that are each twice that obtained
when I do the contrasts by hand or with separate ANOVAs in SAS, but the F, df, and p are identical to
those produced by other means, so that is not a big deal.
For Week 2 vs Week 3 the F(1, 8) reported in the final analysis is 1444/12.5 = 115.52. When
we made this same contrast with a separate ANOVA the F was computed as 722/6.25 = 115.52.
Same F, same outcome, but doubled MS treatment and error.
If we were going to modify the contrast results to use a pooled error term, we would need be
careful computing the contrast F. For Week 2 versus Week 3 the correct numerator is 722, not 1444,
to obtain a pooled F(1, 32) = 722/7.2 = 100.28. Do note that taking the square root of this F gives
10.01, within rounding error of the pooled-error t computed with Howells method.
Multivariate versus Univariate Approach
Notice that when the data are in the multivariate layout, SAS gives us both a multivariate
approach analysis (Manova Test Criteria) and a univariate approach analysis (Univariate Tests). The
multivariate approach has the distinct advantage of not requiring a sphericity assumption. With the
univariate approach one can adjust the degrees of freedome, by multiplying by them by epsilon, to
correct for violation of the sphericity assumption. We shall cover the multivariate approach analysis in
much greater detail later.
Page 7
Omnibus Effect Size Estimates
We have partitioned the total sum of squares into three components: Weeks, subjects, and the
Weeks x Subjects interaction (error). We could compute eta-squared by dividing the sum of squares
for weeks by the total sum of squares. That would yield 2449.2 3166.3 = .774. An alternative is
partial eta-squared, in which the sum of squares for subjects is removed from the denominator. That
is,
( ) ( )
. 914 .
4 . 230 2 . 2449
2 . 2449
2
=
+
=
+
=
Error Conditions
Conditions
partial
SS SS
SS
Factorial ANOVA With One or More Within-Subjects Factors:
The Univariate Approach
AxBxS Two-Way Repeated Measures
CLASS A B S; MODEL Y=A|B|S;
TEST H=A E=AS;
TEST H=B E=BS;
TEST H=AB E=ABS;
MEANS A|B;
Ax(BxS) Mixed (B Repeated)
CLASS A B S; MODEL Y=A|B|S(A);
TEST H=A E=S(A);
TEST H=B AB E=BS(A);
MEANS A|B;
AxBx(CxS) Three-Way Mixed (C Repeated)
CLASS A B C S; MODEL Y=A|B|C|S(A B);
TEST H=A B AB E=S(A B);
TEST H=C AC BC ABC E=CS(A B);
MEANS A|B|C;
Ax(BxCxS) Mixed (B and C Repeated)
CLASS A B C S; MODEL Y=A|B|C|S(A);
TEST H=A E=S(A);
TEST H=B AB E=BS(A);
TEST H=C AC E=CS(A);
TEST H=BC ABC E=BCS(A);
MEANS A|B|C;
Page 8
AxBxCxS All Within
CLASS A B C S; MODEL Y=A|B|C|S;
TEST H=A E=AS;
TEST H=B E=BS;
TEST H=C E=CS;
TEST H=AB E=ABS;
TEST H=AC E=ACS;
TEST H=BC E=BCS;
TEST H=ABC E=ABCS;
MEANS A|B|C;
Higher-Order Mixed or Repeated Model
Expand as needed, extrapolating from the above. Here is a general rule for finding the error term
for an effect: If the effect contains only between-subjects factors, the error term is
Subjects(nested within one or more factors). For any effect that includes one or more within-
subjects factors the error term is the interaction between Subjects and those one or more
within-subjects factors.
Copyright 2008, Karl L. Wuensch - All rights reserved.
MAN_RM1.doc
The Multivariate Approach to the One-Way Repeated Measures ANOVA
Analyses of variance which have one or more repeated measures/within
subjects factors have a SPHERICITY ASSUMPTION (the standard error of the
difference between pairs of means is constant across all pairs of means at one level of
the repeated factor versus another level of the repeated factor. Howell discusses
compound symmetry, a somewhat more restrictive assumption. There are
adjustments (of degrees of freedom) to correct for violation of the sphericity
assumption, but at a cost of lower power. A better solution might be a multivariate
approach to repeated measures designs, which does not have such a sphericity
assumption.
Consider the first experiment in Karl Wuenschs doctoral dissertation (see the
article, Fostering house mice onto rats and deer mice: Effects on response to species odors, Animal
Learning and Behavior, 20, 253-258. Wild-strain house mice were at birth
cross-fostered onto house-mouse (Mus), deer mouse (Peromyscus) or rat (Rattus)
nursing mothers. Ten days after weaning each subject was tested in an apparatus
that allowed it to enter tunnels scented with clean pine shavings or with shavings
bearing the scent of Mus, Peromyscus, or Rattus. One of the variables measured was
how long each subject spent in each of the four tunnels during a twenty minute test.
The data are in the file TUNNEL4b.DAT and a program to do the analysis in
MAN_RM1.SAS, both available on my web pages. Run the program. Time spent in
each tunnel is coded in variables T_clean, T_Mus, T_Pero, and T_Rat. TT_clean,
TT_Mus, TT_Pero, and TT_Rat are these same variables after a square root
transformation to normalize the within-cell distributions and to reduce heterogeneity of
variance.
proc anova; model TT_clean TT_mus TT_pero TT_rat = / nouni;
repeated scent 4 contrast(1) / summary printe;
proc means; var T_clean -- T_Rat;
Note that PROC ANOVA includes no CLASS statement and the MODEL
statement includes no grouping variable (since we have no between subjects factor).
The model statement does identify the multiple dependent variables, TT_clean,
TT_Mus, TT_Pero, and TT_Rat, and includes the NOUNI option to suppress irrelevant
output. The REPEATED statement indicates that we want a repeated measures
analysis, with SCENT being the name we give to the 4-level repeated factor
represented by the four transformed time variates. CONTRAST(1) indicates that
these four variates are to be transformed into three sets of difference scores, each
representing the difference between the subjects score on the 1
st
variate (tt_clean)
and one of the other variatesthat is, clean versus Mus, clean versus Peromyscus,
and clean versus Rattus. I chose clean as the comparison variable for all others
=
1
f m n
, but G*Power is set up for us to
enter as Effect size f
2
the quantity 143 .
79 . 1
) 01 (. 3
1
2
=
f m
.
Boot up G*Power. Click Tests, Other F-Tests. Enter Effect size f
2
= 0.143,
Alpha = 0.05, N = 64, Numerator df = 2, and Denominator df = 128. Click calculate.
G*Power shows that power = .7677.
f m
. Note that I have used, as my
estimate of , the mean of the three values observed by Sheri. This may, or may not,
be reasonable. Uncorrected her numerator df = 2 and her denominator df = 72.
Corrected with epsilon, her Effect size f
2
= .0275, numerator df = 1, and denominator
df = 36. I enter these values into G*Power and obtain power = .1625. Sheri needs
more data, or needs to hope for a larger effect size. If she assumes a medium sized
effect, then epsilon Effect size f
2
= 17 .
45 . 1
) 0625 (. 3
5 .
1
2
=
f m
and power jumps to .67.
The big problem here is the small value of in Sheris data she is going to
need more data to get good power. With typical repeated measures data, is larger,
and we can do well with relatively small sample sizes.
Multivariate Approach
The multivariate approach analysis does not require sphericity, and, when
sphericity is lacking, is usually more powerful than is the univariate analysis with
Greenhouse-Geisser or Huynh-Feldt correction. Refer to the G*Power online
instructions, Other F-Tests, Repeated Measures, Multivariate approach.
3
Since the are three groups, the numerator df = 2. The denominator df = n-p+1,
where n is the number of cases and p is the number of dependent variables in the
MANOVA (one less than the number of levels of the repeated factors). For Sherri,
denominator df = 36-2+1 = 35.
For a small effect size, we need Effect size f
2
= 055 .
45 . 1
) 01 (. 3
1
2
=
f m
. As you
can see, G*Power tells me power is .2083, a little better than it was with the univariate
test corrected for lack of sphericity.
So, how many cases would Sherri need to raise her power to .80? This G*Power
routine will not solve for N directly, so you need to guess until you get it right. On each
guess you need to change the input values of N and denominator df. After a few
quesses, I found that Sheri needs 178 cases to get 80% power to detect a small effect.
A Simpler Approach
Ultimately, in most cases, ones primary interest is going to be focused on
comparisons between pairs of means. Why not just find the number of cases necessary
to have the desired power for those comparisons? With repeated measures designs I
generally avoid using a pooled error term for such comparisons. In other words, I use
simple correlated t tests for each such comparison. How many cases would Sherri
need to have an 80% chance of detecting a small effect, d = .2?
First we adjust the value of d to take into account the value of . I shall use her
weakest link, the correlation of .27. 166 .
) 27 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
. Notice that
the value of d went down after adjustment. Usually will exceed .5 and the adjusted d
will be greater than the unadjusted d.
4
The approximate sample size needed is 285
8 . 2
2
=
=
Diff
d
n . I checked this with
G*Power. Click Tests, Other T tests. For Effect size f enter .166. N = 285 and df =
n-1 = 284. Select two-tailed. Click Calculate. G*Power confirms that power = 80%.
When N is small, G*Power will show that you need a larger N than indicated by
the approximation. Just feed values of N and df to G*power until you find the N that
gives you the desired power.
Copyright 2005, Karl L. Wuensch - All rights reserved.
Two x Two Within-Subject ANOVA Interaction = Correlated t on Difference Scores
Petra Schweinhardt at McGill University is planning research involving a 2 x 2
within-subjects ANOVA. Each case has four measurements, PostX1, PreX1, PostX2
and PreX2. X1 and X2 are different experimental treatments, Pre is the dependent
variable measured prior to administration of X, and Post is the dependent variable
following administration of X. Order effects are controlled by counterbalancing.
Petra wants to determine how many cases she needs to have adequate power to
detect the Time (Post versus Pre) by X (1 versus 2) interaction. She is using G*Power
3.1, and it is not obvious to her (or to me) how to do this. She suggested ANOVA,
Repeated Measures, within factors, but I think some tweaking would be necessary.
My first thought is that the interaction term in such a 2 x 2 ANOVA might be
equivalent to a t test between difference scores (I know for sure this is the case with
independent samples). To test this hunch, I contrived this data set:
Diff1 and Diff2 are Post-Pre difference scores for X1 and X2. Next I conducted
the usual 2 x 2 within-subjects ANOVA with these data:
COMPUTE Diff1=PostX1-PreX1.
EXECUTE.
COMPUTE Diff2=PostX2-PreX2.
EXECUTE.
GLM PostX1 PostX2 PreX1 PreX2
/WSFACTOR=Time 2 Polynomial X 2 Polynomial
/METHOD=SSTYPE(3)
/CRITERIA=ALPHA(.05)
/WSDESIGN=Time X Time*X.
Source Type III Sum of Squares df Mean Square F Sig.
Time 45.375 1 45.375 24.200 .004
Error(Time) 9.375 5 1.875
X 5.042 1 5.042 14.756 .012
Error(X) 1.708 5 .342
Time * X 1.042 1 1.042 1.404 .289
Error(Time*X) 3.708 5 .742
Next I conducted a correlated t test comparing the difference scores.
T-TEST PAIRS=Diff1 WITH Diff2 (PAIRED)
/CRITERIA=CI(.9500)
/MISSING=ANALYSIS.
Paired Samples Correlations
N Correlation Sig.
Pair 1 Diff1 & Diff2 6 .650 .163
Paired Samples Test
Paired Differences
Mean Std. Deviation Std. Error Mean
95% Confidence Interval of the
Difference
Lower Upper
Pair 1 Diff1 - Diff2 -.83333 1.72240 .70317 -2.64088 .97422
Paired Samples Test
t df Sig. (2-tailed)
Pair 1 Diff1 - Diff2 -1.185 5 .289
As you can see, the correlated t test on the difference scores is absolutely
equivalent to the interaction test in the ANOVA. The square of the t (-1.185
2
= 1.404) is
equal to the interaction F and the p values are identical.
Having established this equivalence, my suggestion is that the required sample
size be determined as if one were simply doing a correlated t test. There are all sorts of
issues involving how to define effect sizes for within-subjects effects, but I shall not
address those here.
G*Power shows me that Petra would need 54 cases to have a 95% chance of
detecting a medium-sized effect using the usual 5% criterion of statistical significance.
t tests - Means: Difference between two dependent means (matched pairs)
Analysis: A priori: Compute required sample size
Input: Tail(s) = Two
Effect size dz = 0.5
err prob = 0.05
Power (1- err prob) = 0.95
Output: Noncentrality parameter = 3.6742346
Critical t = 2.0057460
Df = 53
Total sample size = 54
Actual power = 0.9502120
We should be able to get this same result using the ANOVA, Repeated
Measures, within factors analysis in G*Power, as Petra suggested, and, in fact, we do:
F tests - ANOVA: Repeated measures, within factors
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = 0.95
Number of groups = 1
Repetitions = 2
Corr among rep measures = 0.5
Nonsphericity correction = 1
Output: Noncentrality parameter = 13.5000000
Critical F = 4.0230170
Numerator df = 1.0000000
Denominator df = 53.0000000
Total sample size = 54
Actual power = 0.9502120
Karl L. Wuensch, Dept. of Psychology, East Carolina University, Greenville, NC.
August, 2009
Return to Wuenschs Stats Lessons Page
MAN_1W1B.doc
The A X (B X S) ANOVA: A Multivariate Approach
The A X (B X S) mixed ANOVA, where factor A is between/among subjects and factor
B is a repeated measures/within subjects factor, has a sphericity assumption (the same
assumption we discussed earlier when studying one-way repeated measures ANOVA). Our
example of such a design will be the first experiment in my dissertation, the same example we
used for the one-way analysis, but this time we shall not ignore the between subjects Nursing
groups variable. Run the program MAN_1W1B.sas from my SAS programs page. Variable
NURS is the Nursing Group variable, identifying the species of the subjects foster mother,
Mus, Peromyscus, or Rattus.
Note that nurs is identified in PROC ANOVAs CLASS statement (nurs is a
classification, categorical variable) and in the MODEL statement (nurs is a between subjects
independent variable). The MEANS statement is used to obtain means on each of the four
variates and to do LSD comparisons between nursing groups on each variate. Please note
that we had equal cell sizes in our data set. If we had unequal cell sizes, we would employ
PROC GLM, Type III sums of squares, instead of PROC ANOVA.
Simple Effects of the Between-Subjects Factor at Each Level of the Within-
Subjects Factor. Since I did not employ the nouni keyword, the first output given is the
simple effects of nurs at each level of Scent. These analyses indicate that the nursing groups
do not differ significantly from one another on time spent in the clean tunnel, F(2, 33) = 0.13, p
= .88, the Mus-scented tunnel, F(2, 33) = 1.39, p = .26, or the Peromyscus-scented tunnel F(2,
33) = 1.2, p = .31, but do on the Rattus-scented tunnel, F(2, 33) = 12.86, p < .0001. The LSD
comparisons, later in the output, show that the rat-reared mice spent significantly more time in
the rat-scented tunnel than did the other two groups of mice, which did not differ significantly
from each other. Do note that SAS has used individual error terms, one for each level of
Scent. In his chapter on repeated measures ANOVA, Howell explains how you could use a
possibly more powerful pooled error term instead.
Mauchlys criterion (W = .4297) indicates we have a serious lack of sphericity, so if we
were to take the univariate approach analysis, we would need to adjust the degrees of
freedom for both effects that involve the within-subjects factor, scent. If you compare this
analysis with the one-way analysis we previously did you will see that the univariate SS for
scent remains unchanged, but the error SS is reduced, due to the Scent Nurs effect being
removed from the error term.
The multivariate approach, which does not require sphericity, gives us significant
effects for both repeated effects. See Manova Test Criteria and Exact F Statistics for the
Hypothesis of no scent Effect. This tests the null hypothesis that the profile is flat when
collapsed across the groupsthat is, a plot with mean time on the ordinate and scent of tunnel
We have already covered the one-way repeated measures design and the A x (B
x S) design. I shall not present computation of the (A x B x S) totally within-subjects
two-way design, since it is a simplification of the (A x B x C x S) design that I shall
address. If you need to do an (A x B x S) ANOVA, just drop the Factor C from the (A x
B x C x S) design.
A X B X (C X S) ANOVA
In this design only one factor, C, is crossed with subjects (is a within-subjects
factor), while the other two factors, A and B, are between-subjects factors.
Howell (page 495 of the 5
th
edition of Statistical Methods for Psychology)
presented a set of data with two between-subjects factors (Gender and Group) and one
within-subjects factor (Time). One group of adolescents attended a behavioral skills
training (BST)program designed to teach them how to avoid HIV infection. The other
group attended a traditional educational program (EC). The dependent variable which
we shall analyze is a measure of the frequency with which the participants used
condoms during intercourse. This variable is measured at four times: Prior to the
treatment, immediately after completion of the program, six months after completion of
the program, and 12 months after completion of the program.
SAS
Obtain the data file, MAN_1W2B.dat, from my StatData page and the program
file, MAN_1W2B.SAS, from my SAS programs page. Note that the first variable is
Gender, then Group number, then dependent variable scores at each level (pretest,
posttest, 6 month follow-up, and 12 month follow-up) of the within-subjects factor, Time.
Group|Gender factorially combines the two between-subjects factors, and Time 4
indicates that the variables Pretest, Posttest, FU6, and FU12 represent the 4-level
within-subjects factor, Time.
By not specifying NOUNI I had SAS compute Group|Gender univariate
ANOVAs on Pretest, Posttest, FU6, and FU12 . These provide the simple interaction
tests of GroupGender at each level of Time that we might use to follow-up a significant
triple interaction, but our triple interaction is not significant. However, our TimeGroup
interaction is significant, so we can use these univariate ANOVAs for simple main
effects tests of Group at each level of Time. Note that the groups differ significantly
only at the time of the 6 month follow-up, when the BST participants used condoms
more frequently (M = 18.8) than did the EC participants (M = 8.6).
An analysis may be doubly multivariate in at least two different ways. First, a
set of noncommensurate (not measured on the same scale) dependent variables
may be administered at two or more different times. For example, I measure
subjects blood pressure, heart rate, cholesterol level, and percent body fat. Subjects
participate in a month-long cardiac fitness program or in some placebo activity. I
measure the four dependent variables just before the program starts, just after it ends,
a month after it ends, and a year after it ends. I have a Group x Time mixed design
with multiple dependent variables. I take the multivariate approach with respect to the
Time variable (to avoid the sphericity assumption), and I have multiple dependent
variables, so I have a doubly multivariate design. For each effect (Group x Time, Time,
Group) I obtain a test on an optimally weighted combination of the four dependent
variables (weighted to maximize that effect). If (and only if) the test from the doubly
multivariate analysis (which simultaneously analyzes all four dependent variables) is
significant, I then conduct tests of that effect on each dependent variable (one at a
time). This procedure may provide some protection against inflation of alpha with
multiple dependent variables. Suppose that only the time effect was significant. I
would then conduct singly multivariate analyses on each dependent variable, ignoring
the group and Group x Time tests in those analyses.
A second sort of doubly multivariate design exists when two or more
noncommensurate sets of commensurate dependent measures are obtained at
one time. Experiment 1 of Karls dissertation, which we used as an example for the
multivariate approach to the (A x S) and the A x (B x S) ANOVAs, will serve as an
example. In addition to measuring how much time each subject spent in each of four
differently scented tunnels (one set of commensurate variates), I measured how many
visits each subject made to each tunnel (a second set) and each subjects latency to
first entry of each tunnel (a third set).
Obtain from my SPSS Data Page TUNNEL4b.sav. Bring it into SPSS. Copy the
following syntax to the syntax editor and run it:
manova v_clean to v_rat t_clean to t_rat L_clean to L_rat
by nurs(1,3) / wsfactors = scent(4) /
contrast(scent)=helmert /
measure = visits time latency /
print=transform signif(univ hypoth) error(sscp) / design .
The order of the variates in the MANOVA statement must be: The variate
representing the first dependent variable at level 1 of the within-subjects factor; the
variate representing the first dependent variable at level 2 of the within-subjects factor; .
. . variate for first dv at the last level of within factor; variate for second dv at level 1 of
When one wishes to determine whether two or more groups differ significantly on
one or more optimally weighted linear combinations (canonical variates or discriminant
functions) of two or more normally distributed dependent variables, a one-way multiple
analysis of variance is appropriate. We have already studied Discriminant Function
Analysis, which is mathematically equivalent to a one-way MANOVA, so we are just
shifting perspective here.
A Manipulation Check for Michelle Plasters Thesis
Download Plaster.dat from my StatData page and MANOVA1.sas from my SAS
Programs page. Run the program. The data are from Michelle Plaster's thesis, which
you may wish to look over (in Joyner or in the Psychology Department). The analyses
she did are very similar to, but not identical to, those we shall do for instructional
purposes. Male participants were shown a picture of one of three young women. Pilot
work had indicated that the one woman was beautiful, another of average physical
attractiveness, and the third unattractive. Participants rated the woman they saw on
each of twelve attributes. Those ratings are our dependent variables. To simplify the
analysis, we shall deal with only the ratings of physical attractiveness (PHYATTR),
happiness (HAPPY), INDEPENdence, and SOPHISTication. The purpose of the
research was to investigate the effect of the defendants physical attractiveness (and
some other variables which we shall ignore here) upon the sentence recommended in a
simulated jury trial. The MANOVA on the ratings served as a check on the
effectiveness of our manipulation of the physical attractiveness (PA) independent
variable.
Screening the Data
The overall means and standard deviations on the dependent variables were
obtained so that I could standardize them and then compute scores on the canonical
variates. Although you can not see it in this output, there is heterogeneity of variance
on the physical attractiveness variable. The F
max
is 4.62. With n's approximately equal,
I'm not going to worry about it. If F
max
were larger and sample sizes less nearly equal, I
might try data transformations to stabilize the variance or I might randomly discard
scores to produce equal n's and thus greater robustness.
One should look at the distributions of the dependent variables within each cell.
I have done so, (using SAS), but have not provided you with the statistical output. Such
output tends to fill many pages, and I generally do not print it, I just take notes from the
screen. Tests of significance employed in MANOVA do assume that the dependent
variables and linear combinations of the dependent variables are normal within each
cell. With large sample sizes (or small samples with approximately equal ns), the tests
, then Wilks' can be computed from . For our data, for the first root =
1.767/2.767 = 0.6386, and, for the second root, .168/1.168 = 0.1438. To obtain Wilks'
, subtract each theta from one and then find the product of these differences:
(1 - .6386)(1 - .1438) = .309.
Pillai's Trace is the sum of all the thetas, .6386 + .1438 = .782.
Roy's greatest characteristic root is simply the eigenvalue for the first root.
Roy's gcr should be the most powerful test when the first root is much larger than the
other roots.
Each of these statistics' significance levels is approximated using F.
Univariate ANOVAs on the Canonical Variates and the Original Variables
Now, look back at the data step of our program. I used the total-sample means
and standard deviations to standardize the variables into Z_scores. I then computed,
for each subject, canonical variate scores, using the total-sample standardized
canonical correlation coefficients. CV1 is canonical variate 1, CV2 is canonical variate
2. I computed these canonical variate scores so I could use them as outcome variables
in univariate analyses of variance. I used PROC ANOVA to do univariate ANOVAs with
Fishers LSD contrasts on these two canonical variates and also, for the benefit of
those who just cannot understand canonical variates, on each of the observed
variables.
Please note that for each canonical variate:
5
If you take the SS
among groups
and divide by the SS
within groups
, you get the eigenvalue
reported earlier for that root.
If you take the ratio of SS
among groups
to SS
total
, you get the root's squared canonical
correlation.
The group means are the group centroids that would be reported with a
discriminant function analysis.
I have never seen anyone else compute canonical scores like this and then do
pairwise comparisons on them, but it seems sensible to me, and such analysis has
been accepted in manuscripts I have published. As an example of this procedure, see
my summary of an article on Mock Jurors Insanity Defense Verdict Selections.
Note that on the "beauty" canonical variate, the beautiful group's mean is
significantly higher than the other two means, which are not significantly different from
one another. On the happily independent variate, the average group is significantly
higher than the other two groups.
The univariate ANOVAs reveal that our manipulation significantly affected every
one of the ratings variables, with the effect on the physical attractiveness ratings being
very large (
2
= .63). Compared to the other groups, the beautiful woman was rated
significantly more attractive and sophisticated, and the unattractive woman was rated
significantly less independent and happy.
Unsophisticated Use of MANOVA
Unsophisticated users of MANOVA usually rely on the Univariate F-tests to try
to interpret a significant MANOVA. These are also the same users who believe that the
purpose of a MANOVA is to guard against inflation of alpha across several DVs.
They promise themselves that they will not even look at the univariate ANOVAs for any
effect which is not significant in the MANOVA. The logic is essentially the same as that
in the Fishers LSD protected test for making pairwise comparisons between means
following a significant effect in ANOVA (this procedure has a lousy reputation -- not
conservative enough -- these days, but is actually a good procedure when only three
means are being compared; the procedure can also be generalized to other sorts of
analyses, most appropriately those involving effects with only two degrees of freedom).
In fact, this "protection" is afforded only if the overall null hypothesis is true (none of the
outcome variables is associated with the grouping variable), not if some outcome
variables are associated with the grouping variable and others are not. If one outcome
variable is very strongly associated with the grouping variable and another very weakly
(for practical purposes, zero effect), the probability of finding the very weak effect to be
"significant" is unacceptably high.
Were we so unsophisticated as to take such a "MANOVA-protected univariate
tests" approach, we would note that the MANOVA was significant, that univariate tests
on physically attractive, happy, independent, and sophisticated were significant, and
then we might do some pairwise comparisons between groups on each of these
"significant" dependent variables. I have included such univariate tests in my second
6
invocation of PROC ANOVA. Persons who are thinking of using MANOVA because
they have an obsession about inflated familywise alpha and they think MANOVA
somehow protects against this should consider another approach -- application of a
Bonferroni adjustment to multiple univariate ANOVAs. That is, dispense with the
MANOVA, do the several univariate ANOVAs, and evaluate each ANOVA using an
adjusted criterion of significance equal to the familywise alpha you can live with divided
by the number of tests in the family. For example, if I am willing to accept only a .05
probably of rejecting one or more of four true null hypotheses (four dependent
variables, four univariate ANOVAs), I used an adjusted criterion of .05/4 = .0125. For
each of the ANOVAs I declare the effect to be significant only if its p .0125. This will,
of course, increase the probability of making a Type II error (which is already the most
likely sort of error), so I am not fond of making the Bonferroni adjustment of the per-
comparison criterion of statistical significance. Read more on this in my document
Familywise Alpha and the Boogey Men.
While I have somewhat disparaged the protected test test use of MANOVA, I
must confess that I sometimes employ it, especially in complex factorial designs where,
frankly, interpreting the canonical variates for higher order effects is a bit too
challenging for me. With a complex factorial analysis (usually from one of my
colleagues, I usually having too much sense to embark on such ambitious projects) I
may simply note which of the effects are significant in the MANOVA and then use
univariate analyses to investigate those (and only those) effects in each of the
dependent variables. Aside from any protection against inflating familywise alpha, the
main advantage of this approach is that it may impress some reviewers of your
manuscript, multivariate analyses being popular these days.
This unsophisticated approach just described ignores the correlations among the
outcome variables and ignores the dimensions (canonical variates, discriminant
functions) upon which MANOVA found the groups to differ. This unsophisticated user
may miss what is really going on -- it is quite possible for none of the univariate tests to
be significant, but the MANOVA to be significant.
Relationship between MANOVA and DFA
I also conducted a discriminant function analysis on these data just to show you
the equivalence of MANOVA and DFA. Please note that the eigenvalues, canonical
correlations, loadings, and canonical coefficients are identical to those obtained with the
MANOVA.
SPSS MANOVA
SPSS has two routines for doing multiple analysis of variance, the GLM routine
and the MANOVA routine. Let us first consider the GLM routine. Go to my SPSS Data
Page and download the SPSS data file, Plaster.sav. Bring it into SPSS and click
Analyze, General Linear Model, Multivariate. Move our outcome variables (phyattr,
happy, indepen, and sophist) into the dependent variables box and the grouping
variable (PA, manipulated physical attractiveness) into the fixed factor box.
7
Under Options, select descriptive statistics and homogeneity tests. Now, look at
the output. Note that you do get Boxs M, which is not available in SAS. You also get
the tests of the significance of all roots simultaneously tested and Roys test of only the
first root, as in SAS. Levenes test is used to test the significance of group differences
in variance on the original variables. Univariate ANOVAs are presented, as are basic
descriptive statistics. Missing, regretfully, are any canonical statistics, and these are not
available even if you select every optional statistic available with this routine. What a
bummer. Apparently the folks at SPSS have decided that people that can only point
and click will never do a truly multivariate analysis of variance, that they only use what I
have called the unsophisticated approach, so they have made the canonical statistics
unavailable.
Fortunately, the canonical statistics are available from the SPSS MANOVA
routine. While this routine was once available on a point-and-click basis, it is now
available only by syntax -- that is, you must enter the program statements in plain text,
just like you would in SAS, and then submit those statements via the SPSS Syntax
Editor. Point your browser to my SPSS programs page and download the file
MANOVA1.sps to your hard drive or diskette. From the Windows Explorer or My
Computer, double click on the file. The SPSS Syntax Editor opens with the program
displayed. Edit the syntax file so that it points to the correct location of the Paster.sav
file and then click on Run, All, The output will appear.
I did not ask SPSS to print the variance/covariance matrices for each cell, but I
did get the determinants of these matrices, which are used in Box's M. You may think
of the determinants as being indices of the generalized variance within a
8
variance/covariance matrix. For each cell, for our 4 dependent variable design, that
matrix is a 4 x 4 matrix with variances on the main diagonal and covariances (between
each pair of DVs) off the main diagonal. Box's M tests the null hypothesis that the
variance/covariance matrices in the population are identical across cells. If this null
hypothesis is false, the pooled variance/covariance matrix used by SPSS is
inappropriate. Box's M is notoriously powerful, and one generally doesn't worry unless
p <.001 and sample sizes unequal. Using Pillai's trace (rather than Wilks' ) may
improve the robustness of the test in these circumstances. One can always randomly
discard scores to produce equal n's. Since our n's are nearly equal, I'll just use Pillai's
trace and not discard any scores.
Look at the pooled within cells (thus eliminating any influence of the grouping
variable) correlation matrix it is now referred to as the WITHIN+RESIDUAL correlation
matrix. The dependent variables are generally correlated with one another. Bartlett's
test of sphericity tests the null hypothesis that in the population the correlation matrix
for the outcome variables is an identity matrix (each r
ij
= 0). That is clearly not the case
with our variables. Bartletts test is based on the determinant of the within-cells
correlation matrix. If the determinant is small, the null hypothesis is rejected. If the
determinant is very small (zero to several decimal places), then at least one of the
variables is nearly a perfect linear combination of the other variables. This creates a
problem known as multicollinearity. With multicollinearity your solution is suspect --
another random sample from the same population would be likely to produce quite
different results. When this problem arises, I recommend that you delete one or more
of the variables. If one of your variables is a perfect linear combination of the others
(for example, were you to include as variables SAT
Total
, SAT
Math
and SAT
Verbal
), the
analysis would crash, due to the singularity of a matrix which needs to be inverted as
part of the solution. If you have a multicollinearity problem but just must retain all of the
outcome variables, you can replace those variables with principal component scores.
See Principal Components Discriminant Function Analysis for an example.
For our data, SPSS tells us that the determinant of the within cells correlation
matrix has a log of -.37725. Using my calculator, natural log (.689) = -.37725 -- that is,
the determinant is .689, not near zero, apparently no problem with multicollinearity.
Another way to check on multicollinearity is to compute the R
2
between each variable
and all the others (or 1-R
2
, the tolerance). This will help you identify which variable you
might need to delete to avoid the multicollinearity problem. I used SAS PROC
FACTOR to get the R
2
s, which are:
Prior Communality Estimates: SMC
PHYATTR HAPPY INDEPEN SOPHIST
0.136803 0.270761 0.133796 0.289598
Note that the MANOVA routine has given us all of the canonical statistics that we
are likely to need to interpret our multivariate results. If you compare the canonical
9
coefficients and loadings to those obtained with SAS, you will find that each SPSS
coefficient equals minus one times the SAS coefficient. While SAS constructed
canonical variates measuring physical attractiveness (CV1) and
happiness/independence (CV2), SPSS constructed canonical variates measuring
physical unattractiveness and unhappiness/dependence. The standardized coefficients
presented by SPSS are computed using within group statistics.
Return to my Stats Lessons Page
Copyright 2008 Karl L. Wuensch - All rights reserved.
MANOVA2.doc
Factorial MANOVA
A factorial MANOVA may be used to determine whether or not two or more categorical
grouping variables (and their interactions) significantly affect optimally weighted linear
combinations of two or more normally distributed outcome variables. We have already
studied one-way MANOVA, and we previously expanded one-way ANOVA to factorial
ANOVA, so we should be well prepared to expand one-way MANOVA to factorial MANOVA.
The normality and homogeneity of variance assumptions we made for the factorial ANOVA
apply for the factorial MANOVA also, as does the homogeneity of dispersion matrices
assumption (variance/covariance matrices do not differ across cells) we made in one-way
MANOVA.
Obtain MANOVA2.sas from my SAS Programs page and run it. The data are from the
same thesis that provided us the data for our one-way MANOVA, but this time there are two
dependent variables: YEARS, length of sentence given the defendant by the mock-juror
subject, and SEVERITY, a rating of how serious the subject thought the defendants crime
was. The PA independent variable was a physical attractiveness manipulation: the female
defendant presented to the mock jurors was beautiful, average, or ugly. The second
independent variable was CRIME, the type of crime the defendant committed, a burglary
(theft of items from victims room) or a swindle (conned the male victim).
Multivariate Interactions (pages 6 & 7 of the listing)
As in univariate factorial ANOVA, we shall generally inspect effects from higher order
down to main effects. For our 3 x 2 design, the PA X CRIME effect is the highest order
effect. We had some reason to expect this effect to be significantothers have found that
beautiful defendants get lighter sentences than do ugly defendants unless the crime was one
in which the defendants beauty played a role (such as enabling her to more easily con our
male victim). We have had little luck replicating this interaction, and it has no significant
multivariate effect here (but Roys greatest root, which tests only the first root, is nearly
significant, p = .07). We shall, however, for instructional purposes, assume that the
interaction is significant. What do we do following a significant multivariate interaction?
The unsophisticated way to investigate a multivariate interaction is first to determine
which of the dependent variables have significant univariate effects from that interaction.
Suppose that YEARS does and SERIOUS does not. We could then do univariate simple
main effects analysis (or, if we were dealing with a triple or even higher order interaction,
simple interaction analysis) on that dependent variable. This unsophisticated approach
totally ignores the correlations among the dependent variables and the optimally weighted
linear combinations of them that MANOVA worked so hard to obtain.
At first thought, it might seem reasonable to follow a significant multivariate PA X
CRIME interaction with multivariate simple main effects analysis, that is, do two one-way
MANOVAs: multivariate effect of PA upon YEARS and SERIOUS using data from the
burglary level of CRIME only, and another using data from the swindle level of CRIME only
(or alternatively, three one-way MANOVAs, one at each level of PA, with IV=CRIME). There
is, however a serious complication with this strategy.
For each variable, you must decide whether it is for practical purposes
categorical (only a few values are possible) or continuous (many values are
possible). K = the number of values of the variable.
If both variables are categorical, go to the section Both Variables
Categorical on this page.
If both of the variables are continuous, go to the section Two Continuous
Variables on this page.
If one variable (Ill call it X) is categorical and the other (Y) is continuous,
go to the section Categorical X, Continuous Y on page 2.
Both Variables Categorical
The Pearson chi-square is appropriately used to test the null hypothesis
that two categorical (K 2) variables are independent of one another.
If each variable is dichotomous (K = 2), the phi coefficient ( ) is also
appropriate. If you can assume that each of the dichotomous variables
measures a normally distributed underlying construct, the tetrachoric
correlation coefficient is appropriate.
Two Continuous Variables
The Pearson product moment correlation coefficient (r) is used to
measure the strength of the linear association between two continuous variables.
To do inferential statistics on r you need assume normality (and
homoscedasticity) in (across) X, Y, (X|Y), and (Y|X).
Linear regression analysis has less restrictive assumptions (no
assumptions on X, the fixed variable) for doing inferential statistics, such as
testing the hypothesis that the slope of the regression line for predicting Y from X
is zero in the population.
The Spearman rho is used to measure the strength of the monotonic
association between two continuous variables. It is no more than a Pearson r
computed on ranks and its significance can be tested just like r.
Kendalls tau coefficient (), which is based on the number of inversions
(across X) in the rankings of Y, can also be used with rank data, and its
significance can be tested.
I. What is Measurement?
A. Strict definition: any method by which a unique and reciprocal
correspondence is established between all or some of the magnitudes of a
kind and all or some of the numbers...
1. Magnitude: a particular amount of an attribute
2. Attribute: a measurable property
B. Example:
length
|---Y---Y--Y----Y----------------->
numbers
<-----------------------|---+---+--+----+----------------->
0 2 4 5 8
C. Loose definition: the assignment of numerals to objects or events
according to rules (presumably any rules). A numeral is any symbol other
than a word.
II. Scales of Measurement
A. Nominal Scale of Measurement
1. The function of nominal scales is to classify - numerals (symbols, such as
0, 1, A, Q #, Z) are arbitrarily assigned to name objects/events or
qualitative classes of objects/events.
2. Does not meet the criteria of the strict definition of measurement.
3. Given two measurements, I can determine whether they are the same or
not, but I may not be able to tell whether the one object/event has more or
less of the measured attribute than does the other.
4. For example, I ask each member of the class to take all of es (his or her)
paper money, write es Banner ID number on each bill, and then put it all
in a paper bag I pass around class. I shake the bag well and pull out two
bills. From the Banner ID numbers on them I can tell whether they belong
to the same person or not.
B. Ordinal Scale of Measurement
1. The data here have the characteristics of a nominal scale and more.
When the objects/events are arranged in serial order with respect to the
property measured, that order is identical to the order of the
measurements.
Let: m(O
i
) = our measurement of the amount of some attribute that object i has
t(O
i
) = the true amount of that attribute that object i has
For a measuring scale to be ordinal, the following two criteria must be met:
1. m(O
1
) m(O
2
) only if t(O
1
) t(O
2
) If two measurements are unequal, the true
magnitudes (amounts of the measured attribute) are unequal.
2. m(O
1
) > m(O
2
) only if t(O
1
) > t(O
2
) If measurement 1 is larger than measurement
2, then object 1 has more of the measured attribute than object 2.
If the relationship between the Truth and our Measurements is positive monotonic
[whenever T increases, M increases], these criteria are met.
For a measuring scale to be interval, the above criteria must be met and also a third
criterion:
3. The measurement is some positive linear function of the Truth. That is, letting X
i
stand for t(O
i
)
to simplify the notation: m(O
i
)
= a + bX
i
, b > 0
Thus, we may say that t(O
1
) - t(O
2
) = t(O
3
) - t(O
4
) if m(O
1
) - m(O
2
) = m(O
3
) - m(O
4
),
since the latter is (a + bX
1
) - (a + bX
2
) = (a + bX
3
) - (a + bX
4
), so bX
1
- bX
2
= bX
3
- bX
4
. Thus,
X
1
- X
2
= X
3
- X
4
.
In other words, a difference of y units on the measuring scale represents the same true
amount of the attribute being measured regardless of where on the scale the measurements
are taken. Consider the following data on how long it takes a runner to finish a race:
Runner: Joe Lou Sam Bob Wes
Rank: 1 2 3 4 5
(True) Time (sec) 60 60.001 65 75 80
=
Note that a linear function is monotonic, but a monotonic function is not necessarily linear. A
linear function has a constant slope. The slope of a monotonic function has a constant sign
(always positive or always negative) but not necessarily a constant magnitude.
In addition to the three criteria already mentioned, a fourth criterion is necessary for a scale
to be ratio:
4. a = 0, that is m(O
i
) = bX
i
, that is, a true zero point.
If m(O) = 0, then 0 = bX, thus X must = 0 since b > 0. In order words, if an
object has a measurement of zero, it has absolutely none of the measured
attribute.
For ratio data, m(O
1
) m(O
2
) = bX
1
bX
2
= X
1
X
2
Thus, we can interpret the ratios of ratio measurements as the ratio of the true
magnitudes.
For nonratio, interval data, m(O
1
) m(O
2
) = (a + bX
1
) (a + bX
2
) X
1
X
2
since
a 0.
When you worked Gas Law problems in high school [for example, volume of gas held
constant, pressure of gas at 10 degrees Celsius given, what is pressure if temperature is
raised to 20 degrees Celsius] you had to convert from Celsius to Kelvin because you were
working with ratios, but Celsius is not ratio [20 degrees Celsius is not twice as hot as 10
degrees Celsius], Kelvin is.
3
Additional Readings
See the two articles in BlackBoard, Articles, Scales of Measurement
Ratio versus Interval Scales of Measurement -- a graphical explanation of the
difference
PSYC 2101: Howell Chapters 1 & 2 a document from my undergraduate class,
includes material on scales of measurement and other very basic concepts.
Copyright 2010, Karl L. Wuensch - All rights reserved.
Ratio versus Interval Data
Imagine that you are vacationing in Canada, where, like in most of the world,
they use the Celsius scale of temperature. When you get up in the morning you tune in
the weather station and the forecaster says The low this morning was 10 degrees, but
the forecast high this afternoon is 20 degrees, twice as hot. Well, 20 to 10 surely does
sound like a 2 to 1 ratio, as illustrated in the plot below:
[Hold down the Ctrl key and hit the L key to view in full screen mode]
However, the 0 value here is not true, it does not represent the absence of
molecular motion. The true (absolute) zero point on the Celsius scale is -273, as shown
here:
The first plot is basically what I have elsewhere called a Gee Whiz plot a big
chunk of the ordinate (vertical axis) was left out, creating the misperception of a much
larger difference in the height of the two bars.
To find the ratio of the two temperatures, you need convert to a ratio scale, such
as the Kelvin scale. In degrees Kelvin the two temperatures are 283 and 293, and the
ratio is not 2 but rather 293/283 = 1.035. As a chart,
For those of you more familiar with the Fahrenheit scale, our two temperatures
are 50 and 68. The Fahrenheit scale, like the Celsius scale, is interval, not ratio, as the
zero is not true. The ratio of 68 to 50 is 1.36, also meaningless in this context
Karl L. Wuensch, East Carolina University, Greenville, NC. May, 2010
Descript.doc
Descriptive Statistics
I. Frequency Distribution: a tallying of the number of times (frequency) each
score value (or interval of score values) is represented in a group of scores.
A. Ungrouped: frequency of each score value is given
Cumulative Cumulative
Statophobia Frequency Percent Frequency Percent
-----------------------------------------------------------------
0 9 1.51 9 1.51
1 17 2.86 26 4.37
2 15 2.52 41 6.89
3 35 5.88 76 12.77
4 38 6.39 114 19.16
5 93 15.63 207 34.79
6 67 11.26 274 46.05
7 110 18.49 384 64.54
8 120 20.17 504 84.71
9 47 7.90 551 92.61
10 43 7.23 594 99.83
11 1 0.17 595 100.00
B. Grouped: total range of scores divided into several (usually equal in width)
intervals, with frequency given for each interval.
1. Summarizes data, but involves loss of info, possible distortion
2. Usually 5-20 intervals
Nucophobia
Frequency Percent
Cumulative
Percent
Valid 0-9 13 2.1 2.1
10-19 17 2.8 4.9
20-29 30 4.9 9.8
30-39 45 7.3 17.1
40-49 36 5.9 23.0
50-59 144 23.5 46.5
60-69 129 21.0 67.5
70-79 81 13.2 80.8
80-89 57 9.3 90.0
90-100 61 10.0 100.0
Total 613 100.0
C. Percent: the percentage of scores at a given value or interval of values.
=
N
SS
N
M Y
s
y
3. s is a relatively unbiased estimate of population standard deviation
s =
2
s
4. Since in a bell-shaped (normal) distribution nearly all of the scores fall
within plus or minus 3 standard deviations from the mean, when you have
a moderately sized sample of scores from such a distribution the standard
deviation should be approximately one-sixth of the range.
I. Example Calculations
Y (Y-M) (Y-M)
2
z
5 +2 4 1.265
4 +1 1 0.633
3 0 0 0.000
2 -1 1 -0.633
1 -2 4 -1.265
Sum 15 0 10 0
Mean 3 0 4 0
Notice that the sum of the deviations of scores from their mean is zero (as
always). If you find the mean of the squared deviations, 10/5 = 2, you have
the variance, assuming that these five scores represent the entire population.
The population standard deviation is 414 . 1 2 = . Usually we shall consider
7
the data we have to be a sample. In this case the sample variance is 10/4 =
2.5 and the sample standard deviation is 581 . 1 5 . 2 = .
Notice that for this distribution the mean is 3 and the median is also 3. The
distribution is perfectly symmetric. Watch what happens when I replace the
score of 5 with a score of 40.
Distribution of Y: 40, 4, 3, 2, 1
Median = 3, Mean = 10. The mean is drawn in the direction of the (positive)
skew. The mean is somewhat deceptive here notice that 80% of the scores
are below average (below the mean) that sounds fishy, eh?
V. Standard Scores
A. Take the scores from a given distribution and change them such that the new
distribution has a standard mean and a standard deviation
B. This transformation does not change the shape of the distribution
C. Z - Scores: a mean of 0, standard deviation of 1
=
Y
Z how many standard deviations the score is above or below the
mean In the table above, I have computed z for each score by subtracting
the sample mean and dividing by the sample standard deviation.
D. Standard Score = Standard Mean + (z score)(Standard Standard Deviation).
Examples
Suzie Cueless has a z score of -2 on a test of intelligence. We want to
change this score to a standard IQ score, where the mean is 100 and the
standard deviation is 15. Suzies IQ = 100 - (2)(15) = 70.
Gina Genius has a z score of +2.5 on a test of verbal aptitude. We want
to change this score to the type of standard score that is used on the SAT
tests, where the mean is 500 and the standard deviation is 100. Suzies
SAT-Verbal score is 500 + (2.5)100 = 750.
Return to Wuenschs Page of Statistics Lessons
Exercises Involving Descriptive Statistics
Copyright 2010, Karl L. Wuensch - All rights reserved.
The Three Quarter Rule
GeeWhiz.doc
0
5
10
15
20
25
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
It is recommended that one make the height of the highest point on the ordinate
about 3/4 of the length of the abscissa. Below is a simple sales plot for 5 salespersons
following that rule. Take a look at it and get a feel for how much these five differ in
sales.
0
5
10
15
20
25
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
Now here is a plot of the same data, but with the width increased relative to the height.
This gives the perception that the salespersons differ less.
Below, on the left, is a plot of the same data, but now I have increased the
height relative to the width. Notice that this creates the perception that the
salespersons differ from one another more. On the right, I have applied a "Gee-Whiz,",
leaving out a big chunk of the lower portion of the ordinate, which makes the differences
among the salespersons appear yet even greater.
0
5
10
15
20
25
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
17
18
19
20
21
22
23
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
999
1000
1001
1002
1003
1004
1005
1006
1982 1986
Their Bill
Our Bill
1982 1986
Their Bill
Our Bill
Here is my rendition of a graph used by Ronald Reagan on July 27, 1981. It was
published in the New York Times, and elsewhere. His graph was better done than
mine, but mine captures many of little "tricks" he used. The graph was designed to
show that the Republican (true blue) tax plan would save you money compared to the
Democratic (in the red) plan, over time. "Your Taxes" makes it personal. Notice that
there are no numbers on the ordinate, but a big attention-catching dollar sign is there.
The Republican plan is "Our" plan (yours and mine), and the Democratic plan is "Their"
plan (the bad guys). It looks like the Democratic plan would cost us a little less for a
couple of years, but then a lot more thereafter. But without any numbers on the
ordinate, we can't really make a fair comparison between the two plans. Might this
graph be a "Gee-Whiz?"
YOUR TAXES
ANNUAL FAMILY INCOME = $20,000
$
Here I have added some numbers to the ordinate. If these were the correct numbers (I
do not know what the correct numbers are), then this graph is clearly a "Gee-Whiz", and
the difference between the two plans is trivial, only a few dollars.
$
Here is an ad placed by Quaker Oats. Gee Whiz! The graph makes it look like there is
a dramatic drop in cholesterol, but notice that the ordinate starts at 196. The drop
across four weeks is from 209 to 199. That is a drop of 10/209 = 4.8%.
Document revised October, 2006.
Return to Karl Wuenschs Stats Lessons Page
eda.doc
Exploratory Data Analysis (EDA)
John Tukey has developed a set of procedures collectively known as EDA. Two
of these procedures that are especially useful for producing initial displays of data are:
1. the Stem-and-Leaf Display, and 2. the Box-and-Whiskers Plot.
To illustrate EDA, consider the following set of pulse rates from 96 people:
66 60 64 64 64 76 82 70 60 78
92 82 90 70 62 60 68 68 70 70
68 76 68 72 98 60 80 80 104 70
92 80 90 64 78 60 60 70 66 76
70 52 74 78 70 68 66 80 62 56
58 68 60 48 78 86 68 90 76 70
94 90 64 68 68 80 70 72 80 60
68 99 60 74 56 86 64 86 64 68
76 74 70 77 80 72 88 94 78 70
78 78 55 62 74 58
Stem and Leaf Display
You first decide how wide each row (class interval) will be. I decided on an
interval width of 5, that is, I'll group together on one row all scores of 40-44; on another,
45-49; on another, 50-54, etc. I next wrote down the leading digits (most significant
digits) for each interval, starting with the lowest. These make up the stem of the display.
Next I tallied each score, placing its trailing digit (rightmost, least significant digit) in
the appropriate row to the right of the stem. These digits (each one representing one
score) make up the leaves of the display. Here is how the display looks now:
4 8
5 2
5 68658
6 0444020040020400442
6 68888686888888
7 0000200040002440204
7 6868688667888
8 220000000
8 6668
9 20200404
9 89
10 4
Notice that the leaves in each row are in the order that I encountered them when
reading the unordered raw data in rows from left to right. A more helpful display is one
Skewness
In everyday language, the terms skewed and askew are used to refer to
something that is out of line or distorted on one side. When referring to the shape of
frequency or probability distributions, skewness refers to asymmetry of the distribution.
A distribution with an asymmetric tail extending out to the right is referred to as
positively skewed or skewed to the right, while a distribution with an asymmetric tail
extending out to the left is referred to as negatively skewed or skewed to the left.
Skewness can range from minus infinity to positive infinity.
Karl Pearson (1895) first suggested measuring skewness by standardizing the
difference between the mean and the mode, that is,
mode
= sk . Population modes
are not well estimated from sample modes, but one can estimate the difference
between the mean and the mode as being three times the difference between the mean
and the median (Stuart & Ord, 1994), leading to the following estimate of skewness:
s
M
sk
est
median) ( 3
= . Many statisticians use this measure but with the 3 eliminated,
that is,
s
M
sk
median) (
= . This statistic ranges from -1 to +1. Absolute values above
0.2 indicate great skewness (Hildebrand, 1986).
Skewness has also been defined with respect to the third moment about the
mean:
3
3
1
) (
n
X
= , which is simply the expected value of the distribution of cubed z
scores. Skewness measured in this way is sometimes referred to as Fishers
skewness. When the deviations from the mean are greater in one direction than in the
other direction, this statistic will deviate from zero in the direction of the larger
deviations. From sample data, Fishers skewness is most often estimated by:
) 2 )( 1 (
3
1
=
n n
z n
g . For large sample sizes (n > 150), g
1
may be distributed
approximately normally, with a standard error of approximately n / 6 . While one could
use this sampling distribution to construct confidence intervals for or tests of hypotheses
about
1
, there is rarely any value in doing so.
The most commonly used measures of skewness (those discussed here) may
produce surprising results, such as a negative value when the shape of the distribution
appears skewed to the right. There may be superior alternative measures not
commonly used (Groeneveld & Meeden, 1984).
n
X
= , the expected value of the distribution of Z scores which have
been raised to the 4
th
power.
2
is often referred to as Pearsons kurtosis, and
2
- 3
(often symbolized with
2
) as kurtosis excess or Fishers kurtosis, even though it was
Pearson who defined kurtosis as
2
- 3. An unbiased estimator for
2
is
) 3 )( 2 (
) 1 ( 3
) 3 )( 2 )( 1 (
) 1 (
2 4
2
+
=
n n
n
n n n
Z n n
g . For large sample sizes (n > 1000), g
2
may be
distributed approximately normally, with a standard error of approximately n / 24
(Snedecor, & Cochran, 1967). While one could use this sampling distribution to
construct confidence intervals for or tests of hypotheses about
2
, there is rarely any
value in doing so.
Pearson (1905) introduced kurtosis as a measure of how flat the top of a
symmetric distribution is when compared to a normal distribution of the same variance.
He referred to more flat-topped distributions (
2
< 0) as platykurtic, less flat-topped
distributions (
2
> 0) as leptokurtic, and equally flat-topped distributions as mesokurtic
(
2
0). Kurtosis is actually more influenced by scores in the tails of the distribution
than scores in the center of a distribution (DeCarlo, 1967). Accordingly, it is often
appropriate to describe a leptokurtic distribution as fat in the tails and a platykurtic
distribution as thin in the tails.
Student (1927, Biometrika, 19, 160) published a cute description of kurtosis,
which I quote here: Platykurtic curves have shorter tails than the normal curve of error
and leptokurtic longer tails. I myself bear in mind the meaning of the words by the
above memoria technica, where the first figure represents platypus and the second
kangaroos, noted for lepping. See Students drawings.
Moors (1986) demonstrated that 1 ) (
2
2
+ = Z Var . Accordingly, it may be best to
treat kurtosis as the extent to which scores are dispersed away from the shoulders of a
3
distribution, where the shoulders are the points where Z
2
= 1, that is, Z = 1. Balanda
and MacGillivray (1988) wrote it is best to define kurtosis vaguely as the location- and
scale-free movement of probability mass from the shoulders of a distribution into its
centre and tails. If one starts with a normal distribution and moves scores from the
shoulders into the center and the tails, keeping variance constant, kurtosis is increased.
The distribution will likely appear more peaked in the center and fatter in the tails, like a
Laplace distribution ( 3
2
= ) or Students t with few degrees of freedom (
4
6
2
=
df
).
Starting again with a normal distribution, moving scores from the tails and the
center to the shoulders will decrease kurtosis. A uniform distribution certainly has a flat
top, with 2 . 1
2
= , but
2
can reach a minimum value of 2 when two score values are
equally probably and all other score values have probability zero (a rectangular U
distribution, that is, a binomial distribution with n =1, p = .5). One might object that the
rectangular U distribution has all of its scores in the tails, but closer inspection will
reveal that it has no tails, and that all of its scores are in its shoulders, exactly one
standard deviation from its mean. Values of g
2
less than that expected for an uniform
distribution (1.2) may suggest that the distribution is bimodal (Darlington, 1970), but
bimodal distributions can have high kurtosis if the modes are distant from the shoulders.
One leptokurtic distribution we shall deal with is Students t distribution. The
kurtosis of t is infinite when df < 5, 6 when df = 5, 3 when df = 6. Kurtosis decreases
further (towards zero) as df increase and t approaches the normal distribution.
Kurtosis is usually of interest only when dealing with approximately symmetric
distributions. Skewed distributions are always leptokurtic (Hopkins & Weeks, 1990).
Among the several alternative measures of kurtosis that have been proposed (none of
which has often been employed), is one which adjusts the measurement of kurtosis to
remove the effect of skewness (Blest, 2003).
There is much confusion about how kurtosis is related to the shape of
distributions. Many authors of textbooks have asserted that kurtosis is a measure of the
peakedness of distributions, which is not strictly true.
It is easy to confuse low kurtosis with high variance, but distributions with
identical kurtosis can differ in variance, and distributions with identical variances can
differ in kurtosis. Here are some simple distributions that may help you appreciate that
kurtosis is, in part, a measure of tail heaviness relative to the total variance in the
distribution (remember the
4
in the denominator).
4
Table 1.
Kurtosis for 7 Simple Distributions Also Differing in Variance
X freq A freq B freq C freq D freq E freq F freq G
05 20 20 20 10 05 03 01
10 00 10 20 20 20 20 20
15 20 20 20 10 05 03 01
Kurtosis -2.0 -1.75 -1.5 -1.0 0.0 1.33 8.0
Variance 25 20 16.6 12.5 8.3 5.77 2.27
Platykurtic Leptokurtic
When I presented these distributions to my colleagues and graduate students
and asked them to identify which had the least kurtosis and which the most, all said A
has the most kurtosis, G the least (excepting those who refused to answer). But in fact
A has the least kurtosis (2 is the smallest possible value of kurtosis) and G the most.
The trick is to do a mental frequency plot where the abscissa is in standard deviation
units. In the maximally platykurtic distribution A, which initially appears to have all its
scores in its tails, no score is more than one away from the mean - that is, it has no
tails! In the leptokurtic distribution G, which seems only to have a few scores in its tails,
one must remember that those scores (5 & 15) are much farther away from the mean
(3.3 ) than are the 5s & 15s in distribution A. In fact, in G nine percent of the scores
are more than three from the mean, much more than you would expect in a
mesokurtic distribution (like a normal distribution), thus G does indeed have fat tails.
If you were you to ask SAS to compute kurtosis on the A scores in Table 1, you
would get a value less than 2.0, less than the lowest possible population kurtosis.
Why? SAS assumes your data are a sample and computes the g
2
estimate of
population kurtosis, which can fall below 2.0.
Sune Karlsson, of the Stockholm School of Economics, has provided me with the
following modified example which holds the variance approximately constant, making it
quite clear that a higher kurtosis implies that there are more extreme observations (or
that the extreme observations are more extreme). It is also evident that a higher
kurtosis also implies that the distribution is more single-peaked (this would be even
more evident if the sum of the frequencies was constant). I have highlighted the rows
representing the shoulders of the distribution so that you can see that the increase in
kurtosis is associated with a movement of scores away from the shoulders.
5
Table 2.
Kurtosis for Seven Simple Distributions Not Differing in Variance
X Freq. A Freq. B Freq. C Freq. D Freq. E Freq. F Freq. G
6.6 0 0 0 0 0 0 1
0.4 0 0 0 0 0 3 0
1.3 0 0 0 0 5 0 0
2.9 0 0 0 10 0 0 0
3.9 0 0 20 0 0 0 0
4.4 0 20 0 0 0 0 0
5 20 0 0 0 0 0 0
10 0 10 20 20 20 20 20
15 20 0 0 0 0 0 0
15.6 0 20 0 0 0 0 0
16.1 0 0 20 0 0 0 0
17.1 0 0 0 10 0 0 0
18.7 0 0 0 0 5 0 0
20.4 0 0 0 0 0 3 0
26.6 0 0 0 0 0 0 1
Kurtosis 2.0 1.75 1.5 1.0 0.0 1.33 8.0
Variance 25 25.1 24.8 25.2 25.2 25.0 25.1
While is unlikely that a behavioral researcher will be interested in questions that
focus on the kurtosis of a distribution, estimates of kurtosis, in combination with other
information about the shape of a distribution, can be useful. DeCarlo (1997) described
several uses for the g
2
statistic. When considering the shape of a distribution of scores,
it is useful to have at hand measures of skewness and kurtosis, as well as graphical
displays. These statistics can help one decide which estimators or tests should perform
best with data distributed like those on hand. High kurtosis should alert the researcher
to investigate outliers in one or both tails of the distribution.
Tests of Significance
Some statistical packages (including SPSS) provide both estimates of skewness
and kurtosis and standard errors for those estimates. One can divide the estimate by
6
its standard error to obtain a z test of the null hypothesis that the parameter is zero (as
would be expected in a normal population), but I generally find such tests of little use.
One may do an eyeball test on measures of skewness and kurtosis when deciding
whether or not a sample is normal enough to use an inferential procedure that
assumes normality of the population(s). If you wish to test the null hypothesis that the
sample came from a normal population, you can use a chi-square goodness of fit test,
comparing observed frequencies in ten or so intervals (from lowest to highest score)
with the frequencies that would be expected in those intervals were the population
normal. This test has very low power, especially with small sample sizes, where the
normality assumption may be most critical. Thus you may think your data close enough
to normal (not significantly different from normal) to use a test statistic which assumes
normality when in fact the data are too distinctly non-normal to employ such a test, the
nonsignificance of the deviation from normality resulting only from low power, small
sample sizes. SAS PROC UNIVARIATE will test such a null hypothesis for you using
the more powerful Kolmogorov-Smirnov statistic (for larger samples) or the Shapiro-
Wilks statistic (for smaller samples). These have very high power, especially with large
sample sizes, in which case the normality assumption may be less critical for the test
statistic whose normality assumption is being questioned. These tests may tell you that
your sample differs significantly from normal even when the deviation from normality is
not large enough to cause problems with the test statistic which assumes normality.
SAS Exercises
Go to my StatData page and download the file EDA.dat. Go to my SAS-
Programs page and download the program file g1g2.sas. Edit the program so that the
INFILE statement points correctly to the folder where you located EDA.dat and then run
the program, which illustrates the computation of g
1
and g
2
. Look at the program. The
raw data are read from EDA.dat and PROC MEANS is then used to compute g
1
and g
2
.
The next portion of the program uses PROC STANDARD to convert the data to z
scores. PROC MEANS is then used to compute g
1
and g
2
on the z scores. Note that
standardization of the scores has not changed the values of g
1
and g
2
. The next portion
of the program creates a data set with the z scores raised to the 3
rd
and the 4
th
powers.
The final step of the program uses these powers of z to compute g
1
and g
2
using the
formulas presented earlier in this handout. Note that the values of g
1
and g
2
are the
same as obtained earlier from PROC MEANS.
Go to my SAS-Programs page and download and run the file Kurtosis-
Uniform.sas. Look at the program. A DO loop and the UNIFORM function are used to
create a sample of 500,000 scores drawn from a uniform population which ranges from
0 to 1. PROC MEANS then computes mean, standard deviation, skewness, and
kurtosis. Look at the output. Compare the obtained statistics to the expected values for
the following parameters of a uniform distribution that ranges from a to b:
7
Parameter Expected Value Parameter Expected Value
Mean
2
b a +
Skewness 0
Standard Deviation
12
) (
2
a b
Kurtosis 1.2
Go to my SAS-Programs page and download and run the file Kurtosis-T.sas,
which demonstrates the effect of sample size (degrees of freedom) on the kurtosis of
the t distribution. Look at the program. Within each section of the program a DO loop is
used to create 500,000 samples of N scores (where N is 10, 11, 17, or 29), each drawn
from a normal population with mean 0 and standard deviation 1. PROC MEANS is then
used to compute Students t for each sample, outputting these t scores into a new data
set. We shall treat this new data set as the sampling distribution of t. PROC MEANS is
then used to compute the mean, standard deviation, and kurtosis of the sampling
distributions of t. For each value of degrees of freedom, compare the obtained statistics
with their expected values.
Mean Standard Deviation Kurtosis
0
2 df
df
4
6
df
Download and run my program Kurtosis_Beta2.sas. Look at the program.
Each section of the program creates one of the distributions from Table 1 above and
then converts the data to z scores, raises the z scores to the fourth power, and
computes
2
as the mean of z
4
. Subtract 3 from each value of
2
and then compare the
resulting
2
to the value given in Table 1.
Download and run my program Kurtosis-Normal.sas. Look at the program. DO
loops and the NORMAL function are used to create 100,000 samples, each with 1,000
scores drawn from a normal population with mean 0 and standard deviation 1. PROC
MEANS creates a new data set with the g
1
and the g
2
statistics for each sample. PROC
MEANS then computes the mean and standard deviation (standard error) for skewness
and kurtosis. Compare the values obtained with those expected, 0 for the means, and
n / 6 and n / 24 for the standard errors.
References
Balanda & MacGillivray. (1988). Kurtosis: A critical review. American Statistician, 42: 111-119.
Blest, D.C. (2003). A new measure of kurtosis adjusted for skewness. Australian &New Zealand
Journal of Statistics, 45, 175-179.
Darlington, R.B. (1970). Is kurtosis really peakedness? The American Statistician, 24(2), 19-
22.
8
DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292-307.
Groeneveld, R.A. & Meeden, G. (1984). Measuring skewness and kurtosis. The Statistician, 33,
391-399.
Hildebrand, D. K. (1986). Statistical thinking for behavioral scientists. Boston: Duxbury.
Hopkins, K.D. & Weeks, D.L. (1990). Tests for normality and measures of skewness and
kurtosis: Their place in research reporting. Educational and Psychological Measurement, 50,
717-729.
Loether, H. L., & McTavish, D. G. (1988). Descriptive and inferential statistics: An introduction ,
3
rd
ed. Boston: Allyn & Bacon.
Moors, J.J.A. (1986). The meaning of kurtosis: Darlington reexamined. The American
Statistician, 40, 283-284.
Pearson, K. (1895) Contributions to the mathematical theory of evolution, II: Skew variation in
homogeneous material. Philosophical Transactions of the Royal Society of London, 186,
343-414.
Pearson, K. (1905). Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und
Pearson. A Rejoinder. Biometrika, 4, 169-212.
Snedecor, G.W. & Cochran, W.G. (1967). Statistical methods (6
th
ed.), Iowa State University
Press, Ames, Iowa.
Stuart, A. & Ord, J.K. (1994). Kendalls advanced theory of statistics. Volume 1. Distribution
Theory. Sixth Edition. Edward Arnold, London.
Wuensch, K. L. (2005). Kurtosis. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of
statistics in behavioral science (pp. 1028 - 1029). Chichester, UK: Wiley.
Wuensch, K. L. (2005). Skewness. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of
statistics in behavioral science (pp. 1855 - 1856). Chichester, UK: Wiley.
Links
http://core.ecu.edu/psyc/wuenschk/StatHelp/KURTOSIS.txt -- a log of email discussions on
the topic of kurtosis, most of them from the EDSTAT list.
http://core.ecu.edu/psyc/WuenschK/docs30/Platykurtosis.jpg -- distribution of final grades in
PSYC 2101 (undergrad stats), Spring, 2007.
Kurtosis slide show with histograms of the distributions presented in Table 2 above
Table 2 data from Table 2, SPSS format
Copyright 2011, Karl L. Wuensch - All rights reserved.
Return to My Statistics Lessons Page
6430
Normal-30.docx
The Normal Distribution
The normal, or Gaussian, distribution has played a prominent role in statistics. It
was originally investigated by persons interested in gambling or in the distribution of
errors made by persons observing astronomical events. It is still very important to
behavioral statisticians because:
1. Many variables are distributed approximately as the bell-shaped normal curve.
2. Many of the inferential procedures (the so-called parametric tests) we shall learn
assume that variables are normally distributed.
3. Even when a variable is not normally distributed, a distribution of sample sums or
means on that variable will be approximately normally distributed if sample size is
large.
4. Most of the special probability distributions we shall study approach a normal
distribution under some conditions.
5. The mathematics of the normal curve are well known and relatively simple. One can
find the probability that a score randomly sampled from a normal distribution will fall
within the interval a to b by integrating the normal probability density function (pdf)
from a to b. This is equivalent to finding the area under the curve between a and b,
assuming a total area of one.
Here is the probability density function known as the normal curve. F(Y) is the
probability density, aka the height of the curve at value Y.
2 2
2 / ) (
) (
2
1
) (
=
Y
e Y F
Notice that there are only two parameters in this pdf the mean and the standard
deviation. Everything else on the right-hand side is a constant. If you know that a
distribution if normal and you know its mean and standard deviation, you know
everything there is to know about it. Normal distributions differ only with respect to their
means and their standard deviations.
Those who have not mastered integral calculus need not worry about integrating
the normal curve. You can use the computer to do it for you or make use of the normal
curve table in our textbook. This table is based on the standard normal curve (z),
which has a mean of 0 and a variance of 1. To use this table, one need convert raw
scores to z-scores. A z-score is the number of standard deviations ( or s) a score is
above or below the mean of a reference distribution.
=
Y
Z
Y
For example, suppose we wish to know the percentile rank (PR, the percentage
of scores at or below a given score value) of a score of 85 on an IQ test with = 100,
= 15. Z = (85 - 100)/15 = -1.00. We then either integrate the normal curve from minus
infinity to minus one or go the table. On page 695 find the row with Z = 1.00 (ignore the
minus sign for now). Draw a curve, locate mean and -1.00 on the curve and shade the
area you want (the lower tail). The entry under Smaller Portion is the answer, .1587 or
15.87%.
Suppose IQ = 115, Z = +1.00. Now the answer is under Larger Portion,
84.13%.
What percentage of persons have IQs between 85 and 130? The Z-scores are
-1.00 and +2.00. Between the -1.00 and the mean are 34.13%, with another 47.72
between the mean and +2.00, for a total of 81.85%.
What percentage have IQs between 115 and 130 ? The Z-scores are +1.00 and
+2.00. 97.72% are below +2.00, 84.13% are below +1.00, so the answer is 97.72 -
84.13 = 13.59%.
What score marks off the lower 10% of IQ scores ? Now we look in the Smaller
Portion column to find .1000 . The closest we can get is .1003 , which has Z = 1.28 .
We could do a linear interpolation between 1.28 and 1.29 to be more precise. Since we
are below the mean, Z = -1.28 . What IQ has a Z of -1.28 ? X = + Z ,
IQ = 100 + (-1.28) (15) = 100 - 19.2 = 80.8 .
What scores mark off the middle 50% of IQ scores? There will be 25% between
the mean and each Z-score,so we look for .2500 . The closest Z is .67, so the middle
50% is between Z = -0.67 and Z = +0.67, which is IQ = 100 - (.67)(15) to 100 + (.67)(15)
= 90 through 110.
You should memorize the following important Z-scores:
The MIDDLE __ % FALL BETWEEN PLUS OR MINUS Z =
50 ------------------------- 0.67
68 ------------------------- 1.00
90 ------------------------- 1.645
95 ------------------------- 1.96
98 ------------------------- 2.33
99 ------------------------- 2.58
There are standard score systems (where raw scores are transformed to have a
preset mean and standard deviation) other than Z. For example, SAT scores and GRE
~ 3 ~
scores are generally reported with a system having = 500, = 100. A math SAT
score of 600 means that Z = (600 - 500)/100 = +1.00, just like an IQ of 115 means that
Z = (115 - 100)/15 = +1.00. Converting to Z allows one to compare the relative
standings of scores from distributions with different means and standard deviations.
Thus, you can compare apples with oranges, provided you have first converted to Z. Be
careful, however, when doing this, because the two references groups may differ. For
example, a math SAT of 600 is not really equivalent to an IQ of 115, because the
persons who take SAT tests come from a population different from (brighter than) the
group with which IQ statistics were normed. (Also, math SAT and IQ tests measure
somewhat different things.)
Copyright 2011, Karl L. Wuensch - All rights reserved.
Designs.doc
An Introduction to Research Design
Bivariate Experimental Research
Let me start by sketching a simple picture of a basic bivariate (focus on two
variables) research paradigm.
IV stands for independent variable (also called the treatment), DV for
dependent variable, and EV for extraneous variable. In experimental research
we manipulate the IV and observe any resulting change in the DV. Because we are
manipulating it experimentally, the IV will probably assume only a very few values,
maybe as few as two. The DV may be categorical or may be continuous. The EVs are
variables other than the IV which may affect the DV. To be able to detect the effect of
the IV upon the DV, we must be able to control the EVs.
Consider the following experiment. I go to each of 100 classrooms on campus.
At each, I flip a coin to determine whether I will assign the classroom to Group 1 (level 1
of the IV) or to Group 2. The classrooms are my experimental units or subjects. In
psychology, when our subjects are humans, we prefer to refer to them as participants,
or respondents, but in statistics, the use of the word subjects is quite common, and I
shall use it as a generic term for experimental units. For subjects assigned to Group
1, I turn the rooms light switch off. For Group 2 I turn it on. My DV is the brightness of
the room, as measured by a photographic light meter. EVs would include factors such
as time of day, season of the year, weather outside, condition of the light bulbs in the
room, etc.
Parametric statistical inference may take the form of:
1. Estimation: on the basis of sample data we estimate the value of some
parameter of the population from which the sample was randomly drawn.
2. Hypothesis Testing: We test the null hypothesis that a specified parameter
(I shall use to stand for the parameter being estimated) of the population has a
specified value.
One must know the sampling distribution of the estimator (the statistic used to
estimate - I shall use
$
to stand for the statistic used to estimate ) to make full use
of the estimator. The sampling distribution of a statistic is the distribution that would be
obtained if you repeatedly drew samples of a specified size from a specified population
and computed
$
on each sample. In other words, it is the probability distribution of a
statistic.
Desirable Properties of Estimators Include:
1. Unbiasedness:
$
is an unbiased estimator of if its expected value equals
the value of the parameter being estimated, that is, if the mean of its sampling
distribution is . The sample mean and sample variance are unbiased estimators of the
population mean and population variance (but sample standard deviation is not an
unbiased estimator of population standard deviation).
For a discrete variable X, E(X), the expected value of X, is:
i i
X P X E
= ) ( . For
example, if 50% of the bills in a pot are one-dollar bills, 30% are two-dollar bills, 10%
are five-dollar bills, 5% are ten-dollar bills, 3% are twenty-dollar bills, and 2% are fifty-
dollar bills, the expected value for the value of what you get when you randomly select
one bill is .5(1) + .3(2) + .1(5) + .05(10) + .03(20) + .02(50) = $3.70. For a continuous
variable the basic idea of an expected value is the same as for a discrete variable, but a
little calculus is necessary to sum up the infinite number of products of P
i
(actually,
probability density) and X
i
.
Please note that the sample mean is an unbiased estimator of the population
mean, and the sample variance, s
2
, SS / (N - 1), is an unbiased estimator of the
population variance,
2
. If we computed the estimator, s
2
, with (N) rather than (N-1) in
the denominator then the estimator would be biased. SS is the sum of the squared
deviations of scores from their mean,
2
) (
Y
M Y .
The sample standard deviation is not, however, an unbiased estimator of the
population standard deviation (it is the least biased estimator available to us). Consider
a hypothetical sampling distribution for the sample variance where half of the samples
have s
2
= 2 and half have s
2
= 4. Since the sample variance is totally unbiased, the
population variance must be the expected value of the sample variances,
$
is the standard
error, the standard deviation of the sampling distribution of
$
.
b. A 95% CI will extend from
96 . 1
96 . 1
+ if the sampling
distribution is normal. We would be 95% confident that our estimate-interval included
the true value of the estimated parameter (if we drew a very large number of samples
95% of them would have
$
intervals which would in fact include the true value of ). If
CC = .95, a fair bet would be placing 19:1 odds in favor of CI containing .
c. The value of Z will be 2.58 for a 99% CI, 2.33 for a 98% CI, 1.645 for a 90%
CI, 1.00 for a 68% CI, 2/3 for a 50% CI, other values obtainable from the normal curve
table.
d. Consider this very simple case. We know that a population of IQ scores is
normally distributed and has a of 15. We have randomly sampled one score and it is
110. When N = 1, the standard error of the mean,
M
= population . Thus, a 95%
CI would be 110 1.96(15). That is, we are 95% confident that the is between 80.6
and 139.4.
Hypothesis Testing
A second type of inferential statistics is hypothesis testing. For parametric
hypothesis testing one first states a null hypothesis (H
). The H
specifies that some
parameter has a particular value or has a value in a specified range of values. For
nondirectional hypotheses, a single value is stated. For example, = 100. For
directional hypotheses a value of less than or equal to (or greater than or equal to)
some specified value is hypothesized. For example, 100.
The alternative hypotheses (H
1
) is the antithetical complement of the H
. If the
H
is = 100, the H
1
is 100. If H
is 100, H
1
is > 100. H
:
100 implies
H
1
:
< 100. The H
and the H
1
are mutually exclusive and exhaustive: one, but not
both, must be true.
Very often the behavioral scientist wants to reject the H
that
100 for IQ, hoping to show that H
and assert
the H
1
. We measure how well the data fit the H
, given our a priori criterion for . For an of .05 or less this will be the
most extreme 5% of the normal curve, split into the two tails, 2.5% in each tail. The
rejection region would then include all values of Z less than or equal to -1.96 or greater
than or equal to +1.96. The nonrejection region would include all values of Z greater
than -1.96 but less than +1.96. The value of the test statistic at the boundary between
the nonrejection and the rejection regions is the critical value. Now we compute the
test statistic and locate it on the sampling distribution. If it falls in the rejection region
we conclude that p is less than or equal to our a priori criterion for and we reject the
H
. If it falls in the nonrejection region we conclude that p is greater than our a priori
criterion for and we do not reject the H
with set at .01 or less. If you report p = .0198, e can make such decisions.
Imagine that our p came out to be .057. Although we would not reject the H
, it might
be misleading to simply report p > .05, H
, but to
assert its truth or near truth. But using the traditional method, one would simply report p
> .05 and readers could not simply discriminate between the case when p = .057 and
that when p = .95.
Please notice that we could have decided whether or not to reject the H
on the
basis of the 95% CI we constructed earlier. Since our CC was 95% we were using an
of (1 - CC) = .05. Our CI for extended from 80.6 to 139.4, which does not include the
hypothesized value of 145. Since we are 95% confident that the true value of is
between 80.6 and 139.4, we can also be at least 95% confident (5% ) that is not 145
(or any other value less than 80.6 or more than 139.4) and reject the H
. If our CI
9
included the value of hypothesized in the H
were = 100,
then we could not reject the H
.
The CI approach does not give you a p value with which quickly to assess the
likelihood that a type I error was made. It does, however, give you a CI, which
hypothesis testing does not. I suggest that you give your readers both p and a (1-) CI
as well as your decision regarding rejection or nonrejection of the H
.
You now know that is the probability of rejecting a H
given that it is really false. The lower one sets the criterion for ,
the larger will be, ceteris paribus, so one should not just set very low and think e
has no chance of making any errors.
Possible Outcomes of Hypothesis Testing (and Their Conditional Probabilities)
IMHO, the null hypothesis is almost always wrong. Think of the alternative
hypothesis as being the signal that one is trying to detect. That signal typically is the
existence of a relationship between two things (events, variables, or linear combinations
of variables). Typically that thing really is there, but there may be too much noise
(variance from extraneous variables and other error) to detect it with confidence, or the
signal may be too weak to detect (like listening for the sound of a pin dropping) unless
almost all noise is eliminated.
The True Hypothesis Is
Decision The H
1
The H
Reject H
Assert H
1
correct decision
(power)
Type I error
( )
Retain H
Do not assert H
1
Type II error
( )
correct decision
(1- )
Think of the truth state as being two non-overlapping universes. You can be in
only one universe at a time, but may be confused about which one you are in
now.
You might be in the universe where the null hypothesis is true (very unlikely, but
you can imagine being there). In that universe there are only two possible
outcomes: you make a correct decision (do not detect the signal) or a Type I
error (detect a signal that does not exist). You cannot make a Type II error in
this universe.
10
You might be in the universe where the alternative hypothesis is correct, the
signal you seek to detect really is there. In that universe there are only two
possible outcomes: you make a correct decision or you make a Type II error.
You cannot make a Type I error in that universe.
Beta is the conditional probability of making a Type II error, failing to reject a
false null hypothesis. That is, if the null hypothesis is false (the signal you seek
to find is really there), is the probability that you will fail to reject the null (you
will not detect the signal).
Power is the conditional probability of correctly rejecting a false null hypothesis.
That is, if the signal you seek to detect is really there, power is the probability
that you will detect it.
Power is greater with
o larger a priori alpha (increasing P(Type I error) also) that is, if you
change how low p must get before you reject the null, you also change
beta and power.
o smaller sampling distribution variance (produced by larger sample size (n)
or smaller population variance) less noise
o greater difference between the actual value of the tested parameter and
the value specified by the null hypothesis stronger signal
o one-tailed tests (if the predicted direction (specified in the alternative
hypothesis) is correct) paying attention to the likely source of the signal
o some types of tests (t test) than others (sign test) like a better sensory
system
o some research designs (matched subjects) under some conditions
(matching variable correlated with DV)
Suppose you are setting out to test a hypothesis. You want to know the
unconditional probability of making an error (Type I or Type II). That probability
depends, in large part, on the probability of being in the one or the other
universe, that is, on the probability of the null hypothesis being true. This
unconditional error probability is equal to P(H
true) + P(H
1
true).
The 2 x 2 matrix above is a special case of what is sometimes called a confusion
matrix. The reference to confusion has nothing to do with the fact that this matrix
confuses some students. Rather, it refers to the confusion inherent in predicting into
which of two (or more) categories an event falls or will fall. Substituting the language of
signal detection theory for that of hypothesis testing, our confusion matrix becomes:
11
Is the Signal Really There ?
Prediction Signal is there
Signal is not there
Signal is there
True Positive (Hit)
(power)
False Positive
( )
Signal is not there
False Negative (Miss)
( )
True Negative
(1- )
Relative Seriousness of Type I and Type II Errors
Imagine that you are testing an experimental drug that is supposed to reduce
blood pressure, but is suspected of inducing cancer. You administer the drug to 10,000
rodents. Since you know that the tumor rate in these rodents is normally 10%, your H
is that the tumor rate in drug-treated rodents is 10% or less. That is, the H
is that the
drug is safe, it does not increase cancer rate. The H
1
is that the drug does induce
cancer, that the tumor rate in treated rodents is greater than 10%. [Note that the H
always includes an =, but the H
1
never does.] A Type II error (failing to reject the H
of safety when the drug really does cause cancer) seems more serious than a Type I
error (rejecting the H
given that it is
false, and Power = 1 - .
Now suppose we are testing the drugs effect on blood pressure. The H
is that
the mean decrease in blood pressure after giving the drug (pre-treatment BP minus
post-treatment BP) is less than or equal to zero (the drug does not reduce BP). The H
1
is that the mean decrease is greater than zero (the drug does reduce BP). Now a Type
I error (claiming the drug reduces BP when it actually does not) is clearly more
dangerous than a type II error (not finding the drug effective when indeed it is), again
assuming that there are other effective treatments and ignoring things like your boss
threat to fire you if you dont produce results that support es desire to market the drug.
You would want to set the criterion for relatively low here.
Directional and Nondirectional Hypotheses
Notice that in these last two examples the H
. When we did a nondirectional test, this was the probability which we doubled
prior to comparing to the criterion for . Since we are now doing a one-tailed test, we
do not double the probability. Not doubling the probability gives us more power, since p
is more likely to be less than or equal to our -criterion if we dont need to double p
before comparing it to . In fact, we could reject the H
might be a Type II
error. Given the Publish or Perish atmosphere at many institutions, researchers may
bias (consciously or not) data collection and analysis. There is also a file drawer
14
problem. Imagine that each of 20 researchers is independently testing the same true
H
. Each uses an -criterion of .05. By chance, we would expect one of the 20 falsely
to reject the H
. That one would joyfully mail es results off to be published. The other
19 would likely stick their nonsignificant results in a file drawer rather than an
envelope, or, if they did mail them off, they would likely be dismissed as being type II
errors and would not be published, especially if the current Zeitgeist favored rejection of
that H
hypotheses
that are almost true, and rejecting them may as serious as rejecting absolutely true H
hypotheses.
Return to Wuenschs Stats Lessons Page
Recommended Reading
Read More About Exact p Values recommended reading.
Much Confusion About p recommended reading.
The History of the .05 Criterion of Statistical Significance recommended
reading.
The Most Dangerous Equation --
n
M
=
Copyright 2010, Karl L. Wuensch - All rights reserved.
power1.doc
An Introduction to Power Analysis, N = 1
1.
H
: 100 H
1
: > 100 N = 1, = 15, Normal Distribution
For = .10, Z
critical
= 1.28, X
critical
= 100 + 1.28(15) = 119.2, therefore: we reject H
if X 119.2.
Our chances of rejecting the H
is true.
What if H
wrong by
10 points, 2/3 , we would have only a 27% chance of rejecting H
.
----------------------------------------------------------------------------------------------------------------------------------------
2.
Raise to .20: Z
critical
= .84 X
critical
= 100 + .84(15) = 112.6
Power = P(Z (112.6 - 110) / 15) = P(Z .17) = 43%
Increasing Raises Power, but at the expense of making Type I errors more
likely.
----------------------------------------------------------------------------------------------------------------------------------------
3.
Increase The Difference Between H
if |Z| 1.645
X
critical
100 + 1.645(15) 124.68
100 - 1.645(15) 75. 32
Assume that the true = 110
P(X 124.68 OR X 75.32) = POWER
P(X 124.68) = P(Z (124.68 - 110) / 15) = P(Z .98) = .1635
P(X 75.32) = P(Z (75.32 -110) / 15) = P(Z -2.31) = .0104
------
Power = 17%
----------------------------------------------------------------------------------------------------------------------------------------
B. Directional, correct prediction by H
1
See Example 1 on the first page, Power = 27%
3
C. Directional, incorrect prediction by H
1
H
: 100 H
1
: < 100
Z
crit
= -1.28 X
crit
= 100 - 1.28(15) = 80.8
Power* = P(X 80.8 | =110) = P(Z (80.8 - 110) / 15) = P(Z -1.95) = 3%
Thus, if you can correctly predict the direction of the difference between the truth and the H
,
directional hypotheses have a higher probability of rejecting the H
than do nondirectional hypotheses.
If you can't, nondirectional hypotheses are more likely to result in rejection of the H
.
*One could argue that the probability here represented as Power is not Power at all, since the H
( 100) is in fact true ( = 110). Power is the probability of rejecting H
given that H
Random Variable
A random variable is real valued function defined on a sample space.
The sample space is the set of all distinct outcomes possible for an experiment.
Function: two sets (well defined collections of objects) members are paired so that each
member of the one set (domain) is paired with one and only one member of the other set
(range), although elements of the range may be paired with more than one element of the
domain.
The domain is the sample space, the range is a set of real numbers. A random variable is the
set of pairs created by pairing each possible experimental outcome with one and only one real
number.
Examples: a.) the outcome of rolling a die: = 1, = 2, = 3, etc. (Each outcome has
only one number, and, vice versa); b.) = 1, = 2, = 1, etc. (each outcome has (odd-
even) only one number, but not vica versa); c.) The weight of each student in my statistics
class.
Probability and Probability Experiments
A probability experiment is a well-defined act or process that leads to a single well defined
outcome. Example: toss a coin (H or T), roll a die
The probability of an event, P(A) is the fraction of times that event will occur in an indefinitely
long series of trials of the experiment. This may be estimated:
Empirically: conduct the experiment many times and compute
total
A
N
N
A P = ) ( , the sample
relative frequency of A. Roll die 1000 times, even numbers appear 510 times, P(even) =
510/1000 = .51 or 51%.
Rationally or Analytically: make certain assumptions about the probabilities of the
elementary events included in outcome A and compute probability by rules of probability.
Assume each event 1,2,3,4,5,6 on die is equally likely. The sum of the probabilities of all
possible events must equal one. Then P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6. P (even)
= P(2) + P(4) + P(6) = 1/6 + 1/6 + 1/6 = 1/2 (addition rule) or 50%.
Subjectively: a measure of an individuals degree of belief assigned to a given event in
whatever manner. I think that the probability that ECU will win its opening game of the season
is 1/3 or 33%. This means I would accept 2:1 odds against ECU as a fair bet (if I bet $1 on
ECU and they win, I get $2 in winnings).
Independence, Mutual Exclusion, and Mutual Exhaustion
Two events are independent iff (if and only if) the occurrence or non-occurrence of the one
has no effect on the occurrence or non-occurrence of the other.
Two events are mutually exclusive iff the occurrence of the one precludes occurrence of the
other (both cannot occur simultaneously on any one trial).
.
Now suppose I decide that order is not important, that is, I dont care whether the chocolate is
atop the vanilla or vice versa, etc. This is a combinations problem. Since the number of ways of
arranging X objects is X!, I simply divide the number of permutations by X! That is, I find
210
2 3 4 ! 6
! 6 7 8 9 10
! 4 ! 6
! 10
! )! (
!
=
= =
X X N
N
.
Suppose I am assigning ID numbers to the employees at my ice cream factory. Each employee
will have a two digit number. How many different two digit numbers can I generate? A rule I can
apply is C
L
where C is the number of different characters available (10) and L is the length of the ID
number. There are 10
2
= 100 different ID numbers, but you already knew that. Suppose I decide to
use letters instead. There are 26 different characters in the alphabet you use, so there are 26
2
= 676
different two digit ID strings. Now suppose I decide to use both letters (A through Z) and numerals (0
through 9). Now there are 36
2
= 1,296 different two-character ID strings. Now suppose I decide to
use one and two character strings. Since there are 36 different one character strings, I have 1,296 +
36 = 1,332 different ID strings of not more than two characters.
If I up the maximum number of characters to three, that gives me an additional 36
3
= 46,656
strings, for a total of 46,656 + 1,332 = 47,988 different strings of one to three characters. Up it to one
to four characters and I have 1,679,616 + 47,988 = 1,727,604. Up it to one to five characters and you
get 60,466,176 + 1,727,604 = 62,193,780. Up it to one to six characters and we have 2,176,782,336
+ 62,193,780 = 2,238,976,116 different strings of from one to six characters. I guess it will be a while
until tinyurl needs to go to seven character strings for their shortened urls.
Probability FAQ Answers to frequently asked questions.
Return to Wuenschs Stats Lessons
Copyright 2012, Karl L . Wuensch - All rights reserved.
Binomial.docx
Testing Hypotheses with the Binomial Probability Distribution
A Binomial Experiment:
consists of n identical trials.
each trial results in one of two outcomes, a success or a failure.
the probabilities of success ( p ) and of failure ( q = 1 - p ) are constant across
trials.
trials are independent, not affected by the outcome of other trials.
Y is the number of successes in n trials.
( )
( ) y n y
q p
y
n
y Y P
= =
! y) - (n !
!
P(Y = y) may also be determined by reference to a binomial table.
The binomial distribution has:
o np =
o npq =
2
Testing Hypotheses
State null and alternative hypotheses
o the null hypothesis specifies the value of some population parameter. For
example, p = .5 (two-tailed, nondirectional; this coin is fair) or p < .25
(one-tailed; directional; student is merely guessing on 4-choice multiple
choice test).
o the alternative hypothesis, which the researcher often wants to support, is
the antithetical complement of the null. For example, p .5 (two-tailed,
the coin is biased) or p > .25 (one-tailed, the student is not merely
guessing, e knows tested material).
Specify the sampling (probability) distribution and the test statistic (Y). Example:
the binomial distribution describes the probability that a single sample of n trials
would result in (Y = y) successes (if assumptions of binomial are true).
Set alpha at a level determined by how great a risk of a Type I error (falsely
rejecting a true null) you are willing to take. Traditional values of alpha are .05
and .01.
: p < .5; H
1
: p > .5; n = 25, Y=18;
P(Y > 18 | n = 25, p = .5) = .022
APA-style summary: Mothers were allowed to smell two
articles of infants clothing and asked to pick the one which
was their infants. They were successful in doing so 72% of
the time, significantly more often than would be expected by
chance, exact binomial p (one-tailed) = .022.
H
: p > .5; H
1
: p < .5; n = 25, Y = 18;
P(Y 18 | n = 25, p = .5) = .993 (note that the direction of
the P(Y ) matches that of H
1
)
o For a two-tailed test
Compute a one-tailed P and double it.
H
: p = .5; H
1
: p .5; n = 25, Y = 18;
2P(Y 18) = 2(.022) = .044
H
: p = .5; H
1
: p .5; n = 25, Y = 7;
2P(Y 7) = 2(.022) = .044 (the direction of the P(Y ) is that which
gives the smaller p value; P(Y 7) = .993 and 2(.993) = 1.986,
obviously not a possible p.
If p alpha, reject H
If one has a relatively large sample (large enough to use a normal approximation
of the binomial parameter p), then one can construct a confidence interval about ones
estimate of the population proportion by using the following formula:
n
pq
Z p
2
. For
example, suppose we wish to estimate the proportion of persons who would vote for a
guilty verdict in a particular sexual harassment case. We shall use the data from a
study by Egbert, Moore, Wuensch, and Castellow (1992, Journal of Social Behavior
and Personality, 7: 569-579). Of 160 mock jurors of both sexes, 105 voted guilty and
55 voted not guilty. Our point estimate of the population proportion is simply our
sample proportion, 105 / 160 = .656. Is n large enough (given p and q) to use our
normal approximation, that is, is npq np 2 (which is essentially a 95% confidence
interval for the number of successes) within 0 n ? If we construct a 95% confidence
interval for p and the interval is within 01, then the normal approximation is OK. For a
95% confidence interval we compute:
730 . 582 . 074 . 656 .
160
) 344 (. 656 .
96 . 1 656 . = = .
Suppose we look at the proportions separately for female and male jurors.
Among the 80 female jurors 58 (72.5%) voted guilty. For a 95% confidence interval we
compute: 823 . 627 . 098 . 725 .
80
) 275 (. 725 .
96 . 1 725 . = = .
Among the 80 male jurors 47 (58.8%) voted guilty. For a 95% confidence
interval we compute: 696 . 480 . 108 . 588 .
80
) 412 (. 588 .
96 . 1 588 . = = . Do notice
that the confidence interval for the male jurors overlaps the confidence interval for the
female jurors.
There are several online calculators that will construct a confidence interval
around a proportion or percentage. Try the one at
http://www.dimensionresearch.com/resources/calculators/conf_prop.html . In Step 1
select the desired degree of confidence (95%). In step two enter the total sample size.
In Step 3 enter the number of successes or the percentage of successes. Click
Calculate and you get the confidence interval for the percentage. If you prefer a
Bayesian approach, try the calculator at
http://www.causascientia.org/math_stat/ProportionCI.html .
The probability density function defining the chi-square distribution is given in the
chapter on Chi-square in Howell's text. Do not fear, we shall not have to deal directly
with that formula. You should know, however, that given that function, the mean of the
chi-square distribution is equal to its degrees of freedom and the variance is twice the
mean.
The chi-square distribution is closely related to the normal distribution. Image
that you have a normal population. Sample one score from the normal population and
compute
2
2
2
) (
=
Y
Z . Record that Z
2
and then sample another score, compute and
record another Z
2
, repeating this process an uncountably large number of times. The
resulting distribution is a chi-square distribution on one degree of freedom.
Now, sample two scores from that normal distribution. Convert each into Z
2
and
then sum the two scores. Record the resulting sum. Repeat this process an
uncountably large number of times and you have constructed the chi-square distribution
on two degrees of freedom. If you used three scores in each sample, you would have
chi-square on three degrees of freedom. In other words,
= =
=
2
2
1
2
2
) (
Y
Z
n
i
n
. Now, from the definition of variance, you know that the
numerator of this last expression,
2
) ( Y , is the sum of squares, the numerator of
the ratio we call a variance, sum of squares divided by n. From sample data we
estimate the population variance with sample sum of squares divided by degrees of
freedom, (n - 1). That is,
1
) (
2
2
=
n
Y Y
s . Multiplying both sides of this expression by (n
- 1), we see that
2 2
) 1 ( ) ( s n Y Y = . Taking our chi-square formula and substituting
2
) 1 ( S n for
2
) ( Y , we obtain
2
2
2
) 1 (
s n
= , which can be useful for testing null
hypotheses about variances. You could create a chi-square distribution using this
modified formula -- for chi-square on (n - 1) degrees of freedom, sample n scores from a
normal distribution, compute the sum of squares of that sample, divide by the known
population variance, and record the result. Repeat this process an uncountably large
number of times.
Given that the chi-square distribution is a sum of squared z-scores, and knowing
what you know about the standard normal distribution (mean and median are zero), for
chi-square on one df, what is the most probable value of chisquare (0)? What is the
smallest possible value (0)? Is the distribution skewed? In what direction (positive)?
Now, consider chi-square on 10 degrees of freedom. The only way you could get
a chi-square of zero is if each of the 10 squared z-scores were exactly zero. While zero
is still the most likely value for z from the standard normal distribution, it is not likely that
:
2
6.25 H
1
:
2
< 6.25
We compute the sample variance and find it to be 4.55. We next compute the
value of the test statistic, chi-square. [If we were repeatedly to sample 31 scores from
a normal population with a variance of 6.25 and on each compute (N1) S
2
/ 6.25, we
would obtain the chi-square distribution on 30 df.]
2
2
2
) (
S df
= where df = N1
2
= 30(4.55) / 6.25 = 21.84
The expected value of the chi-square (the mean of the sampling distribution)
were the null hypothesis true is its degrees of freedom, 30. Our computed chi-square
is less than that, but is it enough less than that for us to be confident in rejecting the null
hypothesis? We now need to obtain the p-value. Since our alternative hypothesis
specified a < sign, we need find P(
2
< 21.84 | df = 30). We go to the chi-square table,
which is a one-tailed, upper-tailed, table (in Howell). For 30 df, 21.84 falls between
20.60, which marks off the upper 90%, and 24.48, which marks off the upper 75%.
Thus, the upper-tailed p is .75 < p < .90. But we need a lower-tailed p, given our
alternative hypothesis. To obtain the desired lower-tailed p, we simply subtract the
upper-tailed p from unity, obtaining .10 < p < .25. [If you integrate the chi-square
distribution you obtain the exact p = .14.] Using the traditional .05 criterion, we are
unable to reject the null hypothesis.
Our APA-style summary reads: A one-tailed chi-square test indicated that the
heights of male high school varsity basketball players (s
2
= 4.55) were not significantly
less variable than those of the general population of adult men (
2
= 6.25),
2
(30, N =
31) = 21.84, p = .14. I obtained the exact p from SAS. Note that I have specified the
variable (height), the subjects (basketball players), the status of the null hypothesis (not
rejected), the nature of the test (directional), the parameter of interest (variance),the
value of the relevant sample statistic (s
2
) the test statistic (
2
), the degrees of freedom
and N, the computed value of the test statistic, and an exact p. The phrase not
3
significantly less implies that I tested directional hypotheses, but I chose to be explicit
about having conducted a one-tailed test.
For a two-tailed test of nondirectional hypotheses, one simply doubles the
one-tailed p. If the resulting two-tailed p comes out above 1.0, as it would if you
doubled the upper-tailed p from the above problem, then you need to work with the
(doubled) p from the other tail. For the above problem the two-tailed p is .20 < p < .50.
An APA summary statement would read: A two-tailed chi-square test indicated that the
variance of male high school varsity basketball players heights (s
2
= 4.55) was not
significantly different from that of the general population of adult men (
2
= 6.25),
2
(30,
N = 31) = 21.84, p = .28. Note that with a nonsignificant result my use of the phrase
not significantly different implies nondirectional hypotheses.
Suppose we were testing the alternative hypothesis that the population variance
is greater than 6.25. Assume we have a sample of 101 heights of men who have been
diagnosed as having one or more of several types of pituitary dysfunction. The
obtained sample variance is 7.95, which differs from 6.25 by the same amount, 1.7, that
our previous sample variance, 4.55, did, but in the opposite direction. Given our larger
sample size this time, we should expect to have a better chance of rejecting the null
hypothesis. Our computed chi-square is 127.2, yielding an (upper-tail) p of .025 < p <
.05, enabling us to reject the null hypothesis at the .05 level. Our APA-style summary
statement reads: A one-tailed chi-square test indicated that the heights of men with
pituitary dysfunction (s
2
= 7.95) were significantly more variable than those of the
general population of men (
2
= 6.25),
2
(100, N = 101) = 127.2, p = .034. Since I
rejected the null hypothesis (a significant result), I indicated the direction of the
obtained effect (significantly more variable than ...). Note that if we had used
nondirectional hypotheses our two-tailed p would be .05 < p < .10 and we could not
reject the null hypothesis with the usual amount of confidence (.05 criterion for ). In
that case my APA-style summary statement would read: A two-tailed chi-square test
indicated that the variance in the heights of men with pituitary dysfunction (s
2
= 7.95)
was not significantly different from that of the general population of men (
2
= 6.25),
2
(100, N = 101) = 127.2, p = .069.
We can also place confidence limits on our estimation of a population variance.
For a 100(1 ) % confidence interval for the population variance, compute:
a
s N
b
s N
2 2
) 1 (
,
) 1 (
where a and b are the / 2 and 1 ( / 2) fractiles of the chi-square distribution on
(n 1) df. For example, for our sample of 101 pituitary patients, for a 90% confidence
interval, the .05 fractile (the value of chi-square marking off the lower 5%) is 77.93, and
the .95 fractile is 124.34. The confidence interval is 100(7.95)/124.34, 100(7.95)/77.93
or 6.39 to 10.20. In other words, we are 90% confident that the population variance is
between 6.39 and 10.20. Technically, the interpretation of the confidence coefficient
(90%) is this: were we to repeatedly draw random samples and for each construct a
90% confidence interval, 90% of those intervals would indeed include the true value of
the estimated parameter (in this case, the population variance).
4
Please note that the application of chi-square for tests about variances is not
robust to the normality assumption made when using such applications. When a
statistic is robust to violation of one of its assumptions then one can violate that
assumption considerably and still have a valid test.
Chi-Square Approximation of the Binomial Distribution
2
2
2
1
) (
=
Y
where Y is from a normal population.
Consider Y = # of successes in a binomial experiment. With N 0 within 2 npq np ,
the binomial distribution should be approximately normal. Thus,
npq
np Y
2
2
1
) (
= , which can be shown to equal
nq
nq Y n
np
np Y
2 2
) ( ) (
+
. Here is a proof
(the not-so-obvious algebra referred to by Howell):
[ ]
2 2
2
2
) ( ) ( ) 1 ( ) ( Y np np n Y n p n Y n nq Y n = + = = . Now, since
2 2 2 2
) ( ) ( ) ( ) ( np X X np a b b a = = , .
Thus,
npq
np Y p np Y q
nq
np Y
np
np Y
nq
nq Y n
np
np Y
2 2 2 2 2 2
) ( ) ( ) ( ) ( ) ( ) ( +
=
=
+
. 1 since ,
) (
which ,
) )( (
2 2
= +
=
+
= p q
npq
np Y
npq
np X p q
Substituting O
1
for number of successes, O
2
for number of failures, and E for np,
=
E
E O
E
E O
E
E O
2
2
2
2 2
1
2
1 1 2
) ( ) ( ) (
The Correction for Continuity (Yates Correction) When Using Chi-square to
Approximate a Binomial Probability
Suppose that we wish to test the null hypothesis that 50% of ECU students favor
tuition increases to fund the acquisition of additional computers for student use at ECU.
The data are: in a random sample of three, not a single person favors the increase.
The null hypothesis is that binomial p = .50. The two-tailed exact significance level
(using the multiplication rule of probability) is 2 x .5
3
= .25.
Using the chi-square distribution to approximate this binomial probability,
( )
00 . 3
5 . 1
) 5 . 1 3 (
5 . 1
) 5 . 1 0 (
2 2
2
2
=
E
E O
, p = .0833, not a very good
approximation. Remember that a one-tailed p is appropriate for nondirectional
hypotheses with this test, since the computed chi-square increases with increasing
(O - E) regardless of whether O > E or O < E.
Using the chi-square distribution with Yates correction for continuity.:
5
( )
33 . 1
5 . 1
) 5 . 5 . 1 (
2
5 .
2
2
2
=
=
=
E
E O
, p = .25, a much better approximation.
Half-Tailed Tests
Suppose that you wanted to test directional hypotheses, with the alternative
hypothesis being that fewer than 50% of ECU students favor the increased tuition? For
the binomial p you would simply not double the one tailed P(Y 0). For a directional
chi-square, with the direction correctly predicted in the alternative hypothesis, you take
the one-tailed p that is appropriate for a nondirectional test and divide it by the number
of possible orderings of the categorical frequencies. For this problem, we could have
had more favor than disfavor or more disfavor than favor, two possible orderings. This
is really just an application of the multiplication rule of probability. One one-tailed p
1
gives you the conditional probability of obtaining results as or more discrepant with the
null than are those you obtained. The probability of correctly guessing the direction of
the outcome, p
2
, is . The joint probability of getting results as unusual as those you
obtained AND in the predicted direction is p
1
p
2
.
One-Sixth Tailed Tests
What if there were three categories, favor, disfavor, and dont care, and you
correctly predicted that the greatest number of students would disfavor, the next
greatest number would not care, and the smallest number would favor? [The null
hypothesis from which you would compute expected frequencies would be that 1/3
favor, 1/3 disfavor, and 1/3 dont care.] In that case you would divide your one-tailed p
by 3! = 6, since there are 6 possible orderings of three things.
The basic logic of the half-tailed and sixth-tailed tests presented here was
outlined by David Howell in the fourth edition of his Statistical Methods for Psychology
text, page 155. It can be generalized to other situations, for example, a one-way
ANOVA where one predicts a particular ordering of the group means.
Multicategory One-Way Chi Square
Suppose we wish to test the null hypothesis that Karl Wuensch gives twice as
many Cs as Bs, twice as many Bs as As, just as many Ds as Bs, and just as many
Fs as As in his undergraduate statistics classes. We decide on a nondirectional test
using a .05 criterion of significance. The observed frequencies are: A: 6, B: 24, C:
50, D: 10, F: 10. Under this null hypothesis, given a total N of 100, the expected
frequencies are: 10, 20, 40, 20, 10, and
2
= 1.6 + 0.8 + 2.5 + 5 + 0 = 9.9; df = K - 1 =
4. p = .042. We reject the null hypothesis.
There are additional analyses you could do to determine which parts of the null
hypothesis are (significantly) wrong. For example, under the null hypotheses one
expects that 10% of the grades will be As. Six As were observed. You could do a
binomial test of the null hypothesis that the proportion of As is .10. Your two-tailed p
would be two times the probability of obtaining 6 or fewer As if n = 100 and p = 0.10.
As an example of another approach, you could test the hypothesis that there are twice
as many Cs as Bs. Restricting your attention to the 50 + 24 = 74 Cs and Bs, you
6
would expect 2/3(74) = 49.33 Cs and 1/3(74) = 24.67 Bs. A one df Chi-square (or an
exact binomial test) could be used to test this part of the omnibus null hypothesis.
Pearson Chi-Square Test for Contingency Tables.
For the dichotomous variables A and B, consider the below joint frequency
distribution [joint frequencies in the cells, marginal frequencies in the margins]. Imagine
that your experimental units are shoes belonging to members of a commune, that
variable A is whether the shoe belongs to a woman or a man, and that variable B is
whether the shoe has or has not been chewed by the dog that lives with the commune.
One of my graduate students actually had data like these for her 6430 personal data set
years ago. The observed cell counts are in bold font.
A = Gender of Shoe Owner
B = Chewed? Female Male
Yes 10 (15) 20 (15) 30
No 40 (35) 30 (35) 70
50 50 100
We wish to test the null hypothesis that A is independent of (not correlated with) B.
The marginal probabilities of being chewed are .3 chewed, .7 not. The marginal
probabilities for gender of the owner are .5, .5.
Using the multiplication rule to find the joint probability of (A = a) (B = b),
assuming independence of A and B (the null hypothesis), we obtain .5(.3) = .15 and
.5(.7) = .35.
Multiplying each of these joint probabilities by the total N, we obtain the expected
frequencies, which I have entered in the table in parentheses. A short cut method to get
these expected frequencies is: For each cell, multiply the row marginal frequency by
the column marginal frequency and then divide by the total table N. For example, for
the upper left cell, E = 30(50)/100 = 15.
( )
762 . 4
35
) 35 30 (
35
) 35 40 (
15
) 15 20 (
15
) 15 10 (
2 2 2 2 2
2
=
E
E O
.
Shoes owned by male members of the commune were significantly more likely to
be chewed by the dog (40%) than were shoes owned by female members of the
commune (20%),
2
(1, N = 100) = 4.762, p = .029, odds ratio = 2.67, 95% CI [1.09,
6.02].
Yates Correction in 2 x 2 Contingency Tables
Dont make this correction unless you find yourself in the situation of having both
sets of marginals fixed rather than random. By fixed marginals, I mean that if you were
to repeat the data collection the marginal probabilities would be exactly the same. This
7
is almost never the case. There is one circumstance when it would be the case
suppose that you dichotomized two continuous variables using a median split and then
ran a 2 x 2 chi-square. On each of the dichotomous variables each marginal probability
would be .5, and that would remain unchanged if you gathered the data a second time.
Misuses of the Pearson Chi-square
Independence of Observations. The observations in a contingency table
analyzed with the chi-square statistic are assumed to be independent of one another. If
they are not, the chi-square test is not valid. A common way in which this assumption is
violated is to count subjects in more than one cell. When I was studying ethology at
Miami University I attended a paper session where a graduate student was looking at
how lizards move in response to lighting conditions. He had a big terrarium with three
environmentally different chambers. Each day he counted how many lizards were in
each chamber and he repeated this observation each night. He conducted a Time of
Day x Chamber chi-square. Since each lizard was counted more than once, this
analysis was invalid.
Inclusion of Nonoccurrences. Every subject must be counted once and only
once in your contingency table. When dealing with a dichotomous variable, an ignorant
researcher might do a one-way analysis, excluding observations at one of the levels of
the dichotomous variable. Here is the Permanent Daylight Savings Time Attitude x
Rural/Urban example in Howell.
Twenty urban residents and twenty rural residents are asked whether or not they
favor making DST permanent, rather than changing to and from it annually: 17 rural
residents favoring making DST permanent, 11 urban residents do. An inappropriate
analysis is a one-way
2
with expected probability of favoring DST the same for rural as
for urban residents.
O E |O-E-.5|
2
/E
Rural 17 14 .4464
Urban 11 14 .4464
2
(1, N = 28) = 0.893, p = .35
The appropriate analysis would include those who disfavor permanent DST.
Favor Permanent DST
Residence No Yes
Rural 3 17
Urban 9 11
2
(1, N = 40) = 4.29, p = .038
Normality. For the binomial or multinomial distribution to be approximately
normal, the sample size must be fairly large. Accordingly, there may be a problem with
8
chi-square tests done with small cell sizes. Your computer program may warn you if
many of the expected frequencies are small. You may be able to eliminate small
expected frequencies by getting more data, collapsing across (combining) categories, or
eliminating a category. Please do note that the primary effect of having small expected
frequencies is a reduction in power. If your results are significant in spite of having
small expected frequencies, there really is no problem, other than your being less
precise when specifying the magnitude of the effect than you would be if you had more
data.
Likelihood Ratio Tests
In traditional tests of significance, one obtains a significance level by computing
the probability of obtaining results as or more discrepant with the null hypothesis than
are those which were obtained. In a likelihood ratio test the approach is a bit different.
We obtain two likelihoods: The likelihood of getting the data that we did obtain were the
null hypothesis true, and the likelihood of getting the data we got under the exact
alternative hypothesis that would make our sample data as likely as possible. For
example, if we were testing the null hypothesis that half of the students at ECU are
female, p = .5, and our sample of 100 students included 65 women, then the alternative
hypothesis would be p = .65. When the alternative likelihood is much greater than the
null likelihood, we reject the null. We shall encounter such tests when we study log
linear models next semester, which we shall employ to conduct muldimensional
contingency table analysis (where we have more than two categorical variables in our
contingency table).
Strength of Effect Estimates
I find phi an appealing estimate of the magnitude of effect of the relationship
between two dichotomous variables and Cramrs phi appealing for use with tables
where at least one of the variables has more than two levels.
Odds ratios can also be very useful. Consider the results of some of my
research on attitudes about animals (Wuensch, K. L., & Poteat, G. M. Evaluating the
morality of animal research: Effects of ethical ideology, gender, and purpose. Journal of
Social Behavior and Personality, 1998, 13, 139-150. Participants were pretending to be
members of a university research ethics committee charged with deciding whether or
not to stop a particular piece of animal research which was alleged, by an animal rights
group, to be evil. After hearing the evidence and arguments of both sides, 140 female
participants decided to stop the research and 60 decided to let it continue. That is, the
odds that a female participant would stop the research were 140/60 = 2.33. Among
male participants, 47 decided to stop the research and 68 decided to let it continue, for
odds of 47/68 = 0.69. The ratio of these two odds is 2.33 / .69 = 3.38. In other words,
the women were more than 3 times as likely as the men to decide to stop the research.
Why form ratios of odds rather than ratios of probabilities? See my document
Odds Ratios and Probability Ratios.
The Cochran-Mantel-Haenzel Statistic
9
New to the 7
th
edition of David Howells Statistical Methods for Psychology
(2010) is coverage of the CMH statistic (pages 157-159). Howell provides data from a
1973 study of sexual discrimination in graduate admissions at UC Berkeley. In Table
6.8 are data for each of six academic departments. For each department we have the
frequencies for a 2 x 2 contingency table, sex/gender of applicant by admissions
decision. At the bottom of this table are the data for a contingency table collapsing
across departments B through F and excluding data from department A. The data from
A were excluded because the relationship between sex and decision differed notably in
this department from what it was in the other departments.
The contingency tables for departments B through F are shown below. To the
right of each I have typed in the odds ratio showing how much more likely women were
to be admitted (compared to men). Notice that none of these differs much from 1 (men
and women admitted at the same rates).
The FREQ Procedure
Table 1 of Sex by Decision
Controlling for Dept=B
Sex Decision
Frequency
Row Pct Accept Reject Total
F 17 8 25 OR = (17/8)/(353/207)
68.00 32.00 = 1.25
M 353 207 560
63.04 36.96
Total 370 215 585
Table 2 of Sex by Decision
Controlling for Dept=C
Sex Decision
Frequency
Row Pct Accept Reject Total
F 202 391 593 OR = 0.88
34.06 65.94
M 120 205 325
36.92 63.08
Total 322 596 918
Table 3 of Sex by Decision
Controlling for Dept=D
Sex Decision
Frequency
Row Pct Accept Reject Total
F 131 244 375 OR = 1.09
10
34.93 65.07
M 138 279 417
33.09 66.91
Total 269 523 792
-------------------------------------------------------------------------------------------------
Table 4 of Sex by Decision
Controlling for Dept=E
Sex Decision
Frequency
Row Pct Accept Reject Total
F 94 299 393 OR = 0.82
23.92 76.08
M 53 138 191
27.75 72.25
Total 147 437 584
Table 5 of Sex by Decision
Controlling for Dept=F
Sex Decision
Frequency
Row Pct Accept Reject Total
F 24 317 341 OR = 1.21
7.04 92.96
M 22 351 373
5.90 94.10
Total 46 668 714
-------------------------------------------------------------------------------------------------
The CMH statistic is designed to test the hypothesis that there is no relationship
between rows and columns when you average across two or more levels of a third
variable (departments in this case). As you can see below, the data fit well with that
null.
Summary Statistics for Sex by Decision
Controlling for Dept
Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 0.1250 0.7237
2 Row Mean Scores Differ 1 0.1250 0.7237
3 General Association 1 0.1250 0.7237
11
Estimates of the Common Relative Risk (Row1/Row2)
Type of Study Method Value 95% Confidence Limits
Case-Control Mantel-Haenszel 0.9699 0.8185 1.1493
(Odds Ratio) Logit 0.9689 0.8178 1.1481
The Breslow-Day test is for the null hypothesis that the odds ratios do not differ
across levels of the third variable (department). As you can see below, that null is
retained here.
Breslow-Day Test for
Homogeneity of the Odds Ratios
Chi-Square 2.5582
DF 4
Pr > ChiSq 0.6342
Total Sample Size = 3593
-------------------------------------------------------------------------------------------------
Below is a contingency table analysis on the aggregated data (collapsed across
departments B through F). As you can see, these data indicate that there is significant
sex bias again women the odds of a woman being admitted are significantly less than
the odds of a man being admitted.
Sex Decision
Frequency
Row Pct Accept Reject Total
F 508 1259 1767 OR = 0.69
28.75 71.25
M 686 1180 1866
36.76 63.24
Total 1194 2439 3633
Statistics for Table of Sex by Decision
Statistic DF Value Prob
Chi-Square 1 26.4167 <.0001
Likelihood Ratio Chi-Square 1 26.4964 <.0001
Phi Coefficient -0.0853
Sample Size = 3633
-------------------------------------------------------------------------------------------------
If we include department A in the analysis, we see that in that department there
was considerable sex bias in favor of women.
Table 1 of Sex by Decision
Controlling for Dept=A
Sex Decision
Frequency
Row Pct Accept Reject Total
12
F 89 19 108 OR = 2.86
82.41 17.59
M 512 313 825
62.06 37.94
Total 601 332 933
The CMH still falls short of significance, but
The FREQ Procedure
Summary Statistics for Sex by Decision
Controlling for Dept (A through F)
Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 1.5246 0.2169
2 Row Mean Scores Differ 1 1.5246 0.2169
3 General Association 1 1.5246 0.2169
Estimates of the Common Relative Risk (Row1/Row2)
Type of Study Method Value 95% Confidence Limits
Case-Control Mantel-Haenszel 1.1053 0.9431 1.2955
(Odds Ratio) Logit 1.0774 0.9171 1.2658
notice that the Breslow-Day test now tells us that the odds ratios differ significantly
across departments.
Breslow-Day Test for
Homogeneity of the Odds Ratios
Chi-Square 18.8255
DF 5
Pr > ChiSq 0.0021
Total Sample Size = 4526
Here are the data aggregated across departments A through F. Note that these
aggregated data indicate significant sex bias against women.
Table of Sex by Decision
Sex Decision
Frequency
Row Pct Accept Reject Total
F 597 1278 1875 OR = 0.58
31.84 68.16
M 1198 1493 2691
44.52 55.48
Total 1795 2771 4566
Statistics for Table of Sex by Decision
13
Statistic DF Value Prob
Chi-Square 1 74.4567 <.0001
Likelihood Ratio Chi-Square 1 75.2483 <.0001
Cramer's V -0.1277
Howell mentions Simpons paradox in connection with these data. Simpsons
paradox is said to have taken place when the direction of the association between two
variables (in this case, sex and admission) is in one direction at each level of a third
variable, but when you aggregate the data (collapse across levels of the third variable)
the direction of the association changes.
We shall see Simpsons paradox (also known as a reversal paradox) in other
contexts later, including ANOVA and multiple regression. See The Reversal Paradox
(Simpson's Paradox) and the SAS code used to produce the output above.
Kappa
If you wish to compute a measure of the extent two which two judges agree when
making categorical decisions, kappa can be a useful statistic, since it corrects for the
spuriously high apparent agreement that otherwise results when marginal probabilities
differ from one another considerably.
For example, suppose that each of two persons were observing children at play
and at a designated time or interval of time determining whether or not the target child
was involved in a fight. Furthermore, if the rater decided a fight was in progress, the
target child was classified as being the aggressor or the victim. Consider the following
hypothetical data:
Rater 2
Rater 1 No Fight Aggressor Victim
marginal
No Fight 70 (54.75) 3 2
75
Aggressor 2 6 (2.08) 5
13
Victim 1 7 4 (1.32)
12
marginal 73 16 11 100
The percentage of agreement here is pretty good, (70 + 6 + 4) 100 = 80%, but
not all is rosy here. The raters have done a pretty good job of agreeing regarding
whether the child is fighting or not, but there is considerable disagreement between the
raters with respect to whether the child is the aggressor or the victim.
Jacob Cohen developed a coefficient of agreement, kappa, that corrects the
percentage of agreement statistic for the tendency to get high values by chance alone
when one of the categories is very frequently chosen by both raters. On the main
diagonal of the table above I have entered in parentheses the number of agreements
that would be expected by chance alone given the marginal totals. Each of these
14
expected frequencies is computed by taking the marginal total for the column the cell is
in, multiplying it by the marginal total for the row the cell is in, and then dividing by the
total count. For example, for the No Fight-No Fight cell, (73)(75) 100 = 54.75. Kappa
is then computed as:
E N
E O
= , where the Os are observed frequencies on the
main diagonal, the Es are expected frequencies on the main diagonal, and N is the total
count. For our data, 52 . 0
85 . 41
85 . 21
32 . 1 08 . 2 75 . 54 100
32 . 1 08 . 2 75 . 54 4 6 70
= =
+ +
= , which is not so
impressive.
More impressive would be these data, for which kappa is 0.82:
Rater 2
Rater 1 No Fight Aggressor Victim
marginal
No Fight 70 (52.56) 0 2
72
Aggressor 2 13 (2.40) 1
16
Victim 1 2 9 (1.44)
12
marginal 73 15 12 100
15
Power Analysis
G*Power uses Cohens w as the effect size statistic for contingency table
analysis. Here are conventional benchmarks for that statistic.
Size of effect w = odds ratio
small .1 1.49
medium .3 3.45
large .5 9
Please read the following documents:
Constructing a Confidence Interval for the Standard Deviation
Chi-Square, One- and Two-Way -- more detail on the w statistic and power analysis,
in the document linked here.
Power Analysis for a 2 x 2 Contingency Table
Power Analysis for One-Sample Test of Variance (Chi-Square)
Return to Wuenschs Stats Lessons Page
Copyright 2011, Karl L. Wuensch - All rights reserved.
Three Flavors of Chi-Square: Pearson, Likelihood Ratio, and Wald
Here is a short SAS program and annotated output.
options pageno=min nodate formdlim='-';
proc format; value yn 1='Yes' 2='No'; value ww 1='Alone' 2='Partner';
data duh; input Interest WithWhom count;
weight count; cards;
1 1 51
1 2 16
2 1 21
2 2 1
proc freq; format Interest yn. WithWhom ww.;
table Interest*WithWhom / chisq nopercent nocol relrisk; run;
proc logistic; model WithWhom = Interest; run;
--------------------------------------------------------------------------------------------------
The SAS System 1
The FREQ Procedure
Table of Interest by WithWhom
Interest WithWhom
Frequency
Row Pct Alone Partner Total
Yes 51 16 67
76.12 23.88
No 21 1 22
95.45 4.55
Total 72 17 89
Statistics for Table of Interest by WithWhom
Statistic DF Value Prob
Chi-Square (Pearson) 1 4.0068 0.0453
Likelihood Ratio Chi-Square 1 5.0124 0.0252
Notice that the relationship is significant with both the Pearson and LR Chi-Square.
WARNING: 25% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
--------------------------------------------------------------------------------------------------
The SAS System 2
The FREQ Procedure
Statistics for Table of Interest by WithWhom
Estimates of the Relative Risk (Row1/Row2)
Type of Study Value 95% Confidence Limits
Case-Control (Odds Ratio) 0.1518 0.0189 1.2189
Cohort (Col1 Risk) 0.7974 0.6781 0.9378
Cohort (Col2 Risk) 5.2537 0.7385 37.3741
Sample Size = 89
Notice that although the Pearson and LR Chi-Square statistics were significant
beyond .05, the 95% confidence interval for the odds ratio includes the value one. As you
will soon see, this is because a more conservative Chi-Square, the Wald Chi-Square, is
used in constructing that confidence interval.
Since most people are uncomfortable with odds ratios between 0 and 1, I shall
invert the odds ratio, to 6.588, with a confidence interval extending from 0.820 to 52.910.
--------------------------------------------------------------------------------------------------
The SAS System 3
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 5.0124 1 0.0252
Score (Pearson) 4.0068 1 0.0453
Wald 3.1461 1 0.0761
The values of the Pearson and the LR Chi-Square statistics are the same as
reported with Proc Freq. Notice that here we also get the conservative Wald Chi-Square,
and it falls short of significance. The Wald Chi-square is essentially a squared t, where t =
the value of the slope in the logistic regression divided by its standard error.
--------------------------------------------------------------------------------------------------
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
Interest 6.588 0.820 52.900
So we should not be surprised that the confidence interval, based up the Wald Chi-Square
statistic, does include one.
Strength_of_Effect.doc
Reporting the Strength of Effect Estimates for Simple Statistical Analyses
This document was prepared as a guide for my students in Experimental
Psychology. It shows how to present the results of a few simple but common statistical
analyses. It also shows how to compute commonly employed strength of effect
estimates.
Independent Samples T
When we learned how to do t tests (see T Tests and Related Statistics: SPSS),
you compared the mean amount of weight lost by participants who completed two
different weight loss programs. Here is SPSS output from that analysis:
Group Statistics
6 22.67 4.274 1.745
12 13.25 4.093 1.181
GROUP
1
2
LOSS
N Mean Std. Deviation
Std. Error
Mean
The difference in the two means is statistically significant, but how large is it?
We can express the difference in terms of within-group standard deviations, that is, we
can compute the statistic commonly referred to as Cohens d, but more appropriately
referred to as Hedges g. Cohens d is a parameter. Hedges g is the statistic we use to
estimate d.
First we need to compute the pooled standard deviation. Convert the standard
deviations to sums of squares by squaring each and then multiplying by (n-1). For
Group 1, (5)4.274
2
= 91.34. For Group 2, (11)4.093
2
= 184.28. Now compute the
pooled standard deviation this way: 15 . 4
16
28 . 184 34 . 91
2
2 1
2 1
=
+
=
+
+
=
n n
SS SS
s
pooled
.
Finally, simply standardize the difference in means:
27 . 2
15 . 4
25 . 13 67 . 22
2 1
=
=
pooled
s
M M
g , a very large effect.
An easier way to get the pooled standard deviation is to conduct an ANOVA
relating the test variable to the grouping variable. Here is SPSS output from such an
analysis:
A reversal paradox is when 2 variables are positively related in aggregated
data, but, within each level of a third variable, they are negatively related (or negatively
in the aggregate and positively within each level of the third variable). See Messick and
van de Geers article on the reversal paradox (Psychol. Bull. 90: 582-593).
Later I shall discuss the reversal paradox in the context of ANOVA and multiple
regression. Here I have an example in the context of contingency table analysis.
At Zoo Univ. 15 of 100 women (15%) applying for admission to the graduate
program in Clinical Psychology are offered admission. One of 10 men (10%) applying
to the same program are offered admission. For the Experimental Psychology program,
6 of 10 women (60%) are offered admission, 50 of 100 men (50%) are offered
admission. For the department as a whole, (15 + 6)/(100 + 10) = 19% of the female
applicants are offered admission and (1 + 50)/(10 + 100) = 46% of the male applicants
are offered admission. Assuming that male and female applicants are equally qualified,
is there evidence of gender discrimination in admissions, and, if so, against which
gender?
Program Female Applicants Male Applicants
Experimental
Psychology
6 of 10 offered
admission
60%
50 of 100 offered
admission
50%
Clinical
Psychology
15 of 100 offered
admission
15%
1 of 10 offered
admission
10%
Department
as a whole
21 of 110 offered
admission
19%
51 of 110 offered
admission
46%
See also: The Reversal Paradox (Simpson's Paradox)
Return to Wuenschs Stats Lessons Page
Copyright 2011, Karl L. Wuensch - All rights reserved.
= .
For example, suppose we wish to test the H
= Z .
Now, P(Z +2.33) = .0099; doubling for a two-tailed test, p = .0198. Thus, we
could reject the H
being 100, H
1
being > 100, p = .0099 and we could reject
the H
even at .01.
With a one-tailed test, if the direction of the effect, (sample mean is > or <
) is
as specified in the alternative hypothesis, one always uses the smaller portion
column of the normal curve table to obtain the p . If the direction is opposite that
specified in the alternative hypothesis, one uses the larger portion column. For a
two-tailed test, always use the smaller portion column and double the value that
appears there.
2
Confidence intervals may be constructed by taking the point estimate of and
going out the appropriate number of standard errors. The general formula is:
X X
CV X CV X CI + =
where CV = the critical value for the appropriate sampling distribution. For our sample
problem 88 . 112 12 . 101 ) 5 15 ( 96 . 1 107 ) 5 15 ( 96 . 1 107
95 .
+ = CI . Once we
have this confidence interval we can decide whether or not to reject a hypothesis about
simply by determining whether the hypothesized value of falls within the confidence
interval or not. The hypothesized value of 100 does not fall within 101.12 to 112.88, so
we could reject the hypothesis that = 100 with at least 95% confidence that is, with
alpha not greater than 1 - .95 = .05.
Students t. Population Standard Deviation Not Known
One big problem with what we have done so far is knowledge of the population . If
we really knew the , we would likely also know , and thus not need to make
inferences about . The assumption we made above, that
IQ
at ECU = 15, is probably
not reasonable. Assuming that ECU tends to admit brighter persons and not persons
with low IQ, the
IQ
at ECU should be lower than that in the general population. We
shall usually need to estimate the population from the same sample data we use to
test the mean. Unfortunately, sample variance, SS / (N - 1), has a positively skewed
sampling distribution. Although unbiased [the mean of the distribution of sample
variances equals the population variance], more often than not sample s
2
will be smaller
than population
2
and sample s smaller than population .
Thus, the quantity
N s
X
t
=
will tend to be larger than
N
X
Z
= . The result of
all this is that the sampling distribution of the test statistic will not be normally
distributed, but will rather be distributed as Students t, a distribution developed by
Gosset (his employer, Guiness Brewers, did not allow him to publish under his real
name). For more information on Gosset, point your browser to:
http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Gosset.html.
The Student t-distribution is plumper in its tails (representing a greater number of
extreme scores) than is the normal curve. Because the distribution of sample variances
is more skewed with small sample sizes than when N is large, the t distribution
becomes very nearly normal when N is large.
One of the parameters going into the probability density function of t is df, degrees
of freedom. We start out with df = N and then we lose one df for each parameter we
estimate when computing the standard error. We compute the sample standard error
as
N
s
s
X
X
= . When computing the sample s we estimate the population mean when
using (Y minus sample mean) rather than (Y minus ) to compute the sum-of-squares.
That one estimation cost us one df, so df = N - 1. The fewer the df, the plumper the t is
in its tails, and accordingly the greater the absolute critical value of t. With infinite df, t
has the same critical value as Z.
3
Here is an abbreviated table of critical values of t marking off the upper 2.5% of the
area under the curve. Notice how the critical value is very large when df are small, but
approaches 1.96 (the critical value for z) as df increase.
df 1 2 3 10 30 100
Critical
Value
12.706 4.303 3.182 2.228 2.042 1.984 1.960
When df are small, a larger absolute value of computed t is required to reject the null
hypothesis. Accordingly, low df translates into low power. When df are low, sample
size will be low too, and that also reduces power.
I shall illustrate the use of Students t for testing a hypothesis about the mean score
that my students get on the math section of the Scholastic Aptitude Test. I shall use
self-report data provided by students who took my undergraduate statistics class
between 2000 and 2004. During that five year period the national mean score on the
math SAT was 516. For North Carolina students it was 503. For the 114 students on
whom I have data, the mean is 534.78 and the standard deviation is 93.385. I shall test
the null hypothesis that the mean of the population from which my students scores
were randomly drawn is 516. I shall employ the usual .05 criterion of statistical
significance.
746 . 8
114
385 . 93
= =
X
s , and 147 . 2
746 . 8
516 78 . 534
=
. If we were doing a one-tailed test but the predicted direction were wrong,
p would be 1 minus the value for the one-tailed p with direction correct, that is, .975 < p
< .99. We can use PASW or SAS to get the exact p, which is, for these data, .034.
A confidence interval should also be constructed.
X X
s CV X s CV X CI + = .
For CC = 95%, = 1 - .95 = .05, = .025 in upper tail. From the t table for df = 100, CV =
1.984. 13 . 552 43 . 517 ) 746 . 8 ( 984 . 1 78 . 534 ) 746 . 8 ( 984 . 1 78 . 534
95 .
+ = CI .
Effect Size
When you test a hypothesis about a population mean, you should report an estimate
of ( -
) = 534.78 516 = 18.78. Note that this is the numerator our the t
ratio. To get a confidence interval for this difference, just take the confidence interval
for the mean and subtract the hypothesized mean from both the lower and the upper
limits. For our SAT data, the 95% confidence interval for ( -
) is 1.43 to 36.13.
4
When you are dealing with data where the unit of measurement is easily understood
by most persons (such as inches, pounds, dollars, etc.), reporting an effect size in that
unit of measurement is fine. Psychologists, however, typically deal with data where the
unit of measurement is not so easily understood (such as score on a personality test).
Accordingly, it useful to measure effect size in standard deviation units. The
standardized effect size parameter for the one-sample design is
= . I often
refer to this parameter simply as Cohens d, to avoid confusion with the noncentrality
parameter, also symbolized with lower case delta. We can estimate d with the statistic:
20 .
385 . 93
78 . 18
= =
=
s
X
d
Two-Group Research
1. We wish to know whether two groups (samples) of scores (on some continuous OV,
outcome variable) are different enough from one another to indicate that the two
populations from which they were randomly drawn are also different from one another.
2. The two groups of scores are from research units (subjects) that differ with respect to
some dichotomous GV, grouping variable (treatment).
3. We shall compute an exact significance level, p, which represents the likelihood that
our two samples would differ from each other on the OV as much (or more) as they do, if in
fact the two populations from which they were randomly sampled are identical, that is, if
the dichotomous GV has no effect on the OV mean.
Research Designs
1. In the Independent Sampling Design (also know as the between-subjects design) we
have no good reason to believe there should be correlation between scores in the one
sample and those in the other. With experimental research this is also known as the
completely randomized design -- we not only randomly select our subjects but we also
randomly assign them to groups - the assignment of any one subject to group A is
independent of the assignment of any other subject to group A or B. Of course, if our
dichotomous OV is something not experimentally manipulated, such as subjects
sex/gender, we do not randomly assign subjects to groups, but subjects may still be in
groups in such as way that we expect no correlation between the two groups scores.
2. In the Matched Pairs Design (also called a split-plot design, a randomized blocks
design, or a correlated samples design) we randomly select pairs of subjects, with the
subjects matched on some extraneous variable (the matching variable) thought to be well
correlated with the dependent variable. Within each pair, one subject is randomly
assigned to group A, the other to group B. Again, our dichotomous GV may not be
experimentally manipulated, but our subjects may be matched up neverthelessfor
example, GV = sex, subjects = married couples.
a. If the matching variable is in fact well correlated with the dependent variable, the
matched pairs design should provide a more powerful test (greater probability of rejecting
the null hypothesis) than will the completely randomized design. If not, it may yield a less
powerful test.
b. One special case of the matched pairs design is the Within Subjects Design
(also known as the repeated measures design). Here each subject generates two scores:
one after treatment A, one after treatment B. Treatments are counterbalanced so that
half the subjects get treatment A first, the other half receiving treatment B first, hopefully
removing order effects.
drunk
, that alcohol doesnt affect reaction time. We create a new variable, D. For each
pair we compute D = X
1
- X
2
. The null hypothesis becomes
D
= 0 and we test it exactly as
we previously tested one mean hypotheses, including using one-tailed tests if appropriate [
if the alternative hypothesis is
D
> 0, that is
1
>
2
, or if it is
D
< 0,
1
<
2
].
Calculation of the Related Samples t
The independent sampling design is more complex. The sampling distribution is
the distribution of differences between means, which has a mean equal to
1
-
2
. By the
variance sum law, the standard deviation of the sampling distribution, the Standard Error
Of Differences Between Means, is the square root of the variance of the sampling
distribution:
2 1 2 1 2 1
2
2 2
M M M M M M
+ =
This formula for the standard error actually applies to both matched pairs and
independent sampling designs. The (rho) is the correlation between scores in population
1 and scores in population 2. In matched pairs designs this should be positive and fairly
large, assuming that the variable used to match scores is itself positively correlated with
the dependent variable. That is, pairs whose Group 1 score is high should also have their
Group 2 score high, while pairs whose Group 1 score is low should have their Group 2
score low, relative to other within-group scores. The larger the , the smaller the standard
error, and thus the more powerful the analysis (the more likely we are to reject a false null
hypothesis). Fortunately there is an easier way to compute the standard error with
matched pairs, the difference score approach we used earlier.
In the independent sampling design we assume that = 0, so the standard error
becomes :
2
2
2
1
2
1
2 1
N N
M M
+ =
and
2 1
2 1
M M
s
M M
t
=
If n
1
n
2
, the pooled variances standard error requires a more elaborate formula.
Given the homogeneity of variance assumption, we can better estimate the variance of the
3
two populations by using (n
1
+ n
2
) scores than by using the n
1
and the n
2
scores
separately. This involves pooling the sums of squares when computing the standard error:
+
+
=
2 1 2 1
2 1
1 1
2
2 1
n n n n
SS SS
s
M M
Remember, SS = s
2
(n - 1).
The t that you obtain is then evaluated using df = n
1
+ n
2
- 2.
If you cannot assume homogeneity of variance, then using the pooled variances
estimate is not reasonable. Instead, compute t using the separate variances error term,
2
2
2
1
2
1
2 1
n
s
n
s
s
M M
+ =
2 1
= d . Notice that we are dealing with population parameters,
not sample statistics, when computing d. In other words, d is not an effect size estimate.
Nevertheless, most psychologists use the letter d to report what is really d
.
You should memorize the following benchmarks for d
2 1
= d .
Our estimator is
pooled
s
M M
d
2 1
= , where the pooled standard deviation is the square root of
the within groups mean square (from a one-way ANOVA comparing the two groups). If
you have equal sample sizes, the pooled standard deviation is ) ( 5 .
2
2
2
1
s s s
pooled
+ = . If
you have unequal sample sizes, ) (
2
j j pooled
s p s = , where for each group s
2
j
is the within-
group variance and
N
n
p
j
j
= , the proportion of the total number of scores (in both groups,
N) which are in that group (n
j
). You can also compute d
as
2 1
2 1
n n
n n t
d
+
= , where t is the
pooled variances independent samples t comparing the two group means.
You can use the program Conf_Interval-d2.sas to obtain the confidence interval for
the standardized difference between means. It will require that you give the sample sizes
and the values of t and df. Use the pooled variances values of t and df. Why the pooled
variances t and df? See Confidence Intervals, Pooled and Separate Variances T. Also
see Standardized Difference Between Means, Independent Samples .
6
I shall illustrate using the Howell data (participants were students in Vermont),
comparing boys GPA with girls GPA. Please look at the computer output. For the girls,
M = 2.82, SD = .83, n = 33, and for the boys, M = 2.24, SD = .81, n = 55.
818 . ) 81 (.
88
55
) 83 (.
88
33
2 2
= + =
pooled
s .
71 .
818 .
24 . 2 82 . 2
= d . Also, =
+
=
) 55 ( 33
55 33 267 . 3
= Z , which yields
7
a lower-tailed p of .69. That is, if one boy and one girl were randomly selected, the
probability that the girl would have the higher GPA is .69. If you prefer odds, the odds of
the girl having the higher GPA = .69/(1-.69) = 2.23 to 1.
Point Biserial r versus Estimated d. Each of these has its advocates.
Regardless of which you employ, you should be aware that the ratio of the two sample
sizes can have a drastic effect on the value of the point-biserial r (and the square of that
statistic, which is
2
), but does not affect the value of estimated d. See Effect of n
1
/n
2
on
Estimated d and r
pb
.
Correlated Samples Designs. You could compute
Diff
Diff
s
M M
d
2 1
= , where s
Diff
is
the standard deviation of the difference scores, but this would artificially inflate the size of
the effect, because the correlation between conditions will probably make s
Diff
smaller than
the within-conditions standard deviation. You should instead treat the data as if they were
from independent samples. If you base your effect size estimate on the correlated
samples analysis, you will overestimate the size of the effect. You cannot use my
Conf_Interval-d2.sas program to construct a confidence interval for d when the data are
from correlated samples. See my document Confidence Interval for Standardized
Difference Between Means, Related Samples for details on how to construct an
approximate confidence interval for the standardized difference between related means.
Tests of Equivalence. Sometimes we want to test the hypothesis that the size of
an effect is not different from zero by more than a trivial amount. For example, we might
wish to test the hypothesis that the effect of a generic drug is equivalent to the effect of a
brand name drug. Please read my document Tests of Equivalence and Confidence
Intervals for Effect Sizes.
Testing Variances
One may be interested in determining the effect of some treatment upon variances
instead of or in addition to its effect on means. Suppose we have two different drugs, each
thought to be a good treatment for lowering blood cholesterol. Suppose that the mean
amount of cholesterol lowering for drug A was 40 with a variance of 100 and for drug B the
mean was 42 with a variance of 400. The difference in means is trivial compared to the
difference in variances. It appears that the effect of drug A does not vary much from
subject to subject, but drug B appears to produce very great lowering of blood cholesterol
for some subjects, but none (or even elevated cholesterol) for others. At this point the
researcher should start to look for the mystery variable which interacts with drug B to
determine whether Bs effect is positive or negative.
To test the null hypothesis that the treatments do not differ in effect upon variance,
that is,
A B
2 2
= , one may use an F-test. Simply divide the larger variance by the smaller,
obtaining an F of 400/100 = 4.0. Suppose we had 11 subjects in Group A and 9 in Group
B. The numerator (variance for B) degrees of freedom is n
b
- 1 = 8, the denominator
(variance for A) df is n
a
- 1 = 10. From your statistical program [ in SAS, p = 2*(1-
PROBF(4, 8, 10)); ] you obtain the two-tailed probability for F(8, 10) = 4, which is p =
.044.
8
We can do one-tailed tests of directional hypotheses about the relationship
between two variances. With directional hypotheses we must put in the numerator the
variance which we predicted (in the alternative hypothesis) would be larger, even if it isnt
larger. Suppose we did predict that >
A B
2 2
. F(8, 10) = 4.00, we dont double p, p =
.022. What if we had predicted that >
B A
2 2
? F(10, 8) = 100 / 400 = 0.25. Since the p
for F(x, y) equals 1 minus the p for 1 / F(y, x), our p equals 1 - .022 = .98, and the null
hypothesis looks very good. If you wish, you can use SAS to verify that p = 1-
PROBF(.25, 10, 8); returns a value of .98.
Although F is often used as I have shown you here, it has a robustness problem in
this application. It is not robust to violations of its normality assumption. There are,
however, procedures that are appropriate even if the populations are not normal. Levene
suggested that for each score you find either the square or the absolute value of its
deviation from the mean of the group in which it is and then run a standard t-test
comparing the transformed deviations in the one group with those in the other group.
Brown and Forsythe recommended using absolute deviation from the median or a trimmed
mean. Their Monte Carlo research indicated that the trimmed mean was the best choice
when the populations were heavy in their tails and the median was the best choice when
the populations were skewed. Levenes tests can be generalized to situations involving
more than two populations just apply an ANOVA to the transformed data. Please consult
the document Levene Test for Equality of Variances. Another alternative, Obriens test, is
illustrated in the 4
th
edition of Howell's Statistical Methods for Psychology. As he notes in
the 5
th
edition, it has not been included in mainstream statistical computing packages.
To test the null hypothesis of homogeneity of variance in two related (not
independent) samples, use E. J. G. Pitman's (A note on normal correlation, Biometrika,
1939, 31, 9-12) t:
) 1 ( 2
2 ) 1 (
2
r F
n F
t
= , where F is the ratio of the larger to the smaller sample
variance, n is the number of pairs of scores, r is the correlation between the scores in the
one sample and the scores in the other sample, and n - 2 is the df.
Testing Variances Prior to Testing Means. Some researchers have adopted the
bad habit of using a test of variances to help decide whether to use a pooled t test or a
separate variances t test. This is poor practice for several reasons.
The test of variances will have very little power when sample size is small, and
thus will not detect even rather large deviations from homogeneity of variance. It
is with small sample sizes that pooled t is likely least robust to the homogeneity of
variance assumption.
The test of variances will have very much power when sample size is large, and
thus will detect as significant even very small differences in variance, differences
that are of no concern given the pooled t tests great robustness when sample
sizes are large.
Heterogeneity of variance is often accompanied by non-normal distributions, and
some tests of variances are often not robust to their normality assumption.
9
Box (1953) was an early critic of testing variances prior to conducting a test of
means. He wrote to make the preliminary test on variances is rather like putting
to sea in a rowing boat to find out whether conditions are sufficiently calm for an
ocean liner to leave port.
Writing an APA Style Summary for Two-Group Research Results
Using our example data, a succinct summary statement should read something like
this: Among Vermont school-children, girls GPA (M = 2.82, SD = .83, N = 33) was
significantly higher than boys GPA (M = 2.24, SD = .81, N = 55), t(65.9) = 3.24, p = .002, d
= .72. A 95% confidence interval for the difference between girls and boys mean GPA
runs from .23 to .95 in raw score units and from .27 to 1.16 in standardized units.
Please note the following important components of the summary statement:
The subjects are identified.
The variables are identified: Method of motivation and time to cross the finish line.
The group means, standard deviations, and sample sizes are given.
Rejection of the null hypothesis is indicated (the difference is significant).
The direction of the significant effect is indicated.
The test statistic ( t ) is identified, and its degrees of freedom, computed value, and p-
value are reported.
An effect size estimate is reported.
Confidence intervals are reported for the difference between means and for d.
The style for reporting the results of correlated t test would be the same.
If the result were not significant, we would not emphasize the direction of the
difference between the group means, unless we were testing a directional hypothesis. For
example, among school-children in Vermont, the IQ of girls (M = 101.8, SD = 12.7, N = 33)
did not differ significantly from that of boys (M = 99.3, SD = 13.2, N = 55), t(69.7) = 0.879,
p = .38, d = .19. A 95% confidence interval for the difference between girls and boys
mean IQ runs from -3.16 to 8.14 in raw score units and from -.24 to .62 in standardized
units.
As an example of a nonsignificant test of a directional hypothesis: As predicted, the
GPA of students who had no social problems (M = 2.47, SD = 0.89, N = 78) was greater
than that of students who did have social problems (M = 2.39, SD = .61, N = 10), but this
difference fell short of statistical significance, one-tailed t(14.6) = 0.377, p = .36, d = .09. A
95% confidence interval for the difference between mean GPA of students with no social
problems and that of students with social problems runs from -.38 to .54 in raw score units
and from -.56 to .75 in standardized units.
10
References
Box, G. E. P. (1953). Non-normality and tests on variance. Biometrika, 40, 318-335.
Bradley, J. V. (1982). The insidious L-shaped distribution. Bulletin of the Psychonomic
Society, 20, 85-88.
Wuensch, K. L. (2009). The standardized difference between means: Much variance in
notation. Also, differences between g and r
pb
as effect size estimates. Available here.
Zimmerman, D. W. (1996). Some properties of preliminary tests of equality of variances in
the two-sample location problem, Journal of General Psychology, 1996, 123, 217-231.
Zimmerman, D. W., & Zumbo, B. D. (2009). Hazards in choosing between pooled and
separate-variances t tests. Psicolgica, 30, 371-390.
Summary of Effect Size Estimates -- Lee Becker at the Univ. of Colorado, Colorado
Springs
Two Groups and One Continuous Variable
The Moments of Students t Distribution
Return to My Statistics Lessons Page
Copyright 2012, Karl L. Wuensch - All rights reserved.
Confidence Interval for Standardized Difference Between Means, Independent Samples
Here are results from a independent samples t test. One group consists of mice (Mus
musculus) who were reared by mice, the other group consists of mice who were reared by rats
(Rattus norvegicus). The dependent variable is a the difference between the number of visits
the mouse made to a tunnel that smelled like another mouse and the number of visits to a
tunnel that smelled like rat.
--------------------------------------------------------------------------------------------------
The SAS System 1
Independent Samples T-Tests on Mouse-Rat Tunnel Difference Scores
Foster Mom is a Mouse or is a Rat
The TTEST Procedure
Variable: v_diff
Mom N Mean Std Dev Std Err Minimum Maximum
Mouse 32 14.8125 9.0320 1.5966 0 31.0000
Rat 16 -1.3125 8.4041 2.1010 -17.0000 17.0000
Diff (1-2) 16.1250 8.8321 2.7043
Mom Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Mouse 14.8125 11.5561 18.0689 9.0320 7.2410 12.0078
Rat -1.3125 -5.7907 3.1657 8.4041 6.2082 13.0070
Diff (1-2) Pooled 16.1250 10.6816 21.5684 8.8321 7.3393 11.0930
Diff (1-2) Satterthwaite 16.1250 10.7507 21.4993
Method Variances DF t Value Pr > |t|
Pooled Equal 46 5.96 <.0001
Satterthwaite Unequal 32.141 6.11 <.0001
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 31 15 1.15 0.7906
Notice that you are given a pooled variances confidence interval and a separate
variances confidence interval. These are in raw units, not standardized units.
We may get a better feel for the size of the effect if we standardize it. I have two
programs available to do this.
Program One
title 'Compute 95% Confidence Interval for d, Standardardized Difference Between Two
Independent Population Means';
Data CI;
**********************************************************************************;
Replace tttt with the computed value of the independent samples t test.
Replace dd with the degrees of freedom for the independent samples t test.
Replace n1n with the sample size for the first group.
Replace n2n with the sample size for the second group.
***********************************************************************************;
t= 5.96 ;
df = 46 ;
n1 = 32 ;
n2 = 16 ;
***********************************************************************************;
g = t/sqrt(n1*n2/(n1+n2));
ncp_lower = TNONCT(t,df,.975);
ncp_upper = TNONCT(t,df,.025);
d_lower = ncp_lower*sqrt((n1+n2)/(n1*n2));
d_upper = ncp_upper*sqrt((n1+n2)/(n1*n2));
output; run; proc print; var g d_lower d_upper; run;
The Output
Obs g d_lower d_upper
1 1.82487 1.11164 2.52360
Notice that both sides of the confidence interval indicate that the effect is quite large.
Program 2
*This program computes a CI for the effect size in
a between-subject design with two groups.
m1 and m2 are the means for the two groups
s1 and s2 are the standard deviations for the two groups
n1 and n2 are the sample sizes for the two groups
prob is the confidence level;
*Downloaded from James Alginas webpage at http://plaza.ufl.edu/algina/ ;
data;
m1=14.8125 ;
m2= -1.3125 ;
s1=9.032 ;
s2=8.4041 ;
n1=32 ;
n2=16 ;
prob=.95;
v1=s1**2;
v2=s2**2;
pvar=((n1-1)*v1+(n2-1)*v2)/(n1+n2-2);
se=sqrt(pvar*(1/n1+1/n2));
nchat=(m1-m2)/se;
es=(m1-m2)/(sqrt(pvar));
df=n1+n2-2;
ncu=TNONCT(nchat,df,(1-prob)/2);
ncl=TNONCT(nchat,df,1-(1-prob)/2);
ll=(sqrt(1/n1+1/n2))*ncl;
ul=(sqrt(1/n1+1/n2))*ncu;
output;
proc print;
title1 'll is the lower limit and ul is the upper limit';
title2 'of a confidence interval for the effect size';
var es ll ul;
run;
The Output
ll is the lower limit and ul is the upper limit 2
of a confidence interval for the effect size
Obs es ll ul
1 1.82572 1.11239 2.52453
The minor differences between these results and those shown earlier are due to rounding error
from the value of t.
Do it with SPSS
Wuenschs Stats Lessons
Karl L. Wuensch, East Carolina University, Dept. of Psychology, 3. September, 2011.
Tests of Equivalence and Confidence Intervals for
Effect Sizes
Point or sharp null hypotheses specify that a parameter has a particular value -- for
example, (
1
2
) = 0, or = 0. Such null hypotheses are highly unlikely ever to be
true. They may, however, be close to true, and it may be more useful to test range or
loose null hypotheses that state that the value of the parameter of interest is close to a
hypothetical value. For example, one might test the null hypothesis that the difference
between the effect of drug G and that of drug A is so small that the drugs are essentially
equivalent. Biostatisticians do exactly this, and they call it bioequivalence testing.
Steiger (2004) presents a simple example of bioequivalence testing. Suppose that we
wish to determine whether or not generic drug G is bioequivalent to brand name drug
B. Suppose that the FDA defines bioequivalence as bioavailability within 20% of that of
the brand name drug. Let
1
represent the lower limit (bioavailability 20% less than that
of the brand name drug),
2
the upper limit (bioavailability 20% greater than that of the
brand name drug), and
G
the bioavailability of the generic drug. A test of
bioequivalence amounts to pitting the following two hypotheses against one another:
H
NE
:
G
<
1
or
G
>
2
-- the drugs are not equivalent
H
E
:
1
G
2
-- the drugs are equivalent -- note that this a range hypothesis
In practice, this amounts to testing two pairs of directional hypotheses:
H
0:
G
1
versus H
1
:
G
>
1
and H
0
:
G
2
versus H
1
:
G
<
2
.
If both of these null hypotheses rejected, then we conclude that the drugs are
equivalent. Alternatively, we can simply construct a confidence interval for
G
-- if the
confidence interval falls entirely within
1
to
2
, then bioequivalence is established.
Steiger (2004) opines that tests of equivalence (also described as tests of close fit)
have a place in psychology too, especially when we are interested in demonstrating that
an effect is trivial in magnitude. Steiger recommends the use of confidence intervals,
dispensing with the traditional NHST procedures (computation of test statistic, p value,
decision).
Suppose, for example, that we are interested in determining whether or not two
different therapies for anorexia are equivalent. Our criterion variable will be the average
amount of weight gained during a two month period of therapy. By how much would the
groups need differ before we would say they differ by a nontrivial amount? Suppose we
decide that a difference of less than three pounds is trivial. The hypothesis that the
difference (D) is trivial in magnitude can be evaluated with two simultaneous one-sided
tests:
H
0:
D -3 versus H
1
: D > 3, and H
0
: D 3
versus H
1
: D < 3
After obtaining our data, we simply construct a confidence interval for the difference
in the two means. If that confidence interval is entirely enclosed within the "range of
triviality," -3 to +3, then we retain the loose null hypothesis that the two therapies are
equivalent. What if the entire confidence interval is outside the range of triviality? I
assume we would then conclude that there is a nontrivial difference between the
therapies. If part of the confidence interval is within the range of triviality and part
outside the range, then we suspend judgment and wish we had obtained more data
and/or less error variance. Of course, if the confidence interval extended into the range
of triviality but not all the way to the point of no difference then we would probably want
to conclude that there is a difference but confess that it might be trivial.
Psychologists often use instruments which produce measurements in units that are
not as meaningful as pounds and inches. For example, suppose that we are interested
in studying the relationship between political affiliation and misanthropy. We treat
political affiliation as dichotomous (Democrat or Republican) and obtain a measure of
misanthropy on a 100 point scale. The point null is that mean misanthropy in
Democrats is exactly the same as that in Republicans. While this hypothesis is highly
unlikely to be true, it could be very close to true. Can we construct a loose null
hypothesis, like we did for the anorexia therapies? What is the smallest difference
between means on the misanthropy scale that we would consider to be nontrivial? Is a
5 point difference small, medium, or large? Faced with questions like this, we often
resort to using standardized measures of effect sizes. In this case, we could use
Cohen's d, the standardized difference between means. Suppose that we decide that
the smallest difference that would be nontrivial is d = .1. All we need to do is get our
data and then construct a confidence interval for d. If that interval is totally enclosed
within the range -.1 to .1, then we conclude that affiliates of the two parties are
equivalent in misanthropy, and if the entire confidence interval is outside the range, then
we conclude that there is a nontrivial difference between the parties.
So, how do we get a confidence interval for d? Regretfully, it is not as simple as
finding the confidence interval in the raw unit of measure and then dividing the upper
and lower limits by the pooled standard deviation. Because we are estimating both
means and standard deviations, we will be dealing with noncentral distributions (see
Cumming & Finch, 2001; Fidler & Thompson, 2001; Smithson, 2001). Iterative
computations that cannot reasonably be done by hand will be required. There are, out
there on the Internet, statistical programs designed to construct confidence intervals for
standardized effect size estimates, but I think it unlikely that such confidence intervals
will be commonly used unless and until they are incorporated in major statistical
packages such as SAS, SPSS, BMDP, Minitab, and so on. I have, on my SAS Program
Page and my SPSS Program Page, programs for constructing confidence intervals for
Cohen's d.
Steiger (2004) argues that when testing for close fit, the appropriate confidence
interval for testing range hypotheses is a 100(1 - 2) confidence interval. For example,
with the traditional .05 criterion, use a 90% confidence interval, not a 95% confidence
interval. His argument is that the estimated effect cannot be small in both directions, so
the confidence coefficient is relaxed to provide the same amount of power that would be
obtained with a one-sided test. I am not entirely comfortable with this argument,
especially after reading the Monte Carlo work by Serlin & Zumbo (2001).
References
Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and
calculation of confidence intervals that are based on central and noncentral
distributions. Educational and Psychological Measurement, 61, 532-574.
Fidler, F., & Thompson, B. (2001). Computing correct confidence intervals for
ANOVA fixed- and random-effects effect sizes. Educational and Psychological
Measurement, 61, 575-604.
Smithson, M. (2001). Correct confidence intervals for various regression effect
sizes and parameters: The importance of noncentral distributions in computing
intervals. Educational and Psychological Measurement, 61, 605-532.
Serlin, R. C., & Zumbo, B. D. (2001). Confidence intervals for directional
decisions. Retrieved from
http://edtech.connect.msu.edu/searchaera2002/viewproposaltext.asp?propID=26
78 on 20. February 2005.
Steiger, J. H. (2004). Beyond the F test: Effect size confidence intervals and
tests of close fit in the analysis of variance and contrast analysis. Psychological
Methods, 9, 164-182. Retrieved from
http://www.statpower.net/Steiger%20Biblio/Steiger04.pdf on 20. February, 2005.
Return to Wuenschs Stats Lessons Page
This document most recently revised on the 5
th
of April, 2012.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Confidence Interval for Standardized Difference Between Means, Related Samples
You cannot use my Conf_Interval-d2.sas program to construct a confidence interval for d
when the data are from correlated samples. With correlated samples the distributions here are
very complex, not following the noncentral t. You can construct an approximate confidence
interval, g Z
cc
SE where SE is
n
r
n
g ) 1 ( 2
) 1 ( 2
12
2
McGraw and Wong (1992, Psychological Bulletin, 111: 361-365) proposed an effect size statistic
which for a two group design with a continuous dependent variable is the probability that a randomly
selected score from the one population will be greater than a randomly sampled score from the
other distribution. As an example they use sexual dimorphism in height among young adult humans.
National statistics are mean = 69.7 inches (SD= 2.8 ) for men, mean = 64.3 inches (SD = 2.6) for women.
If we assume that the distributions are normal, then the probability that a randomly selected man will be
taller than a randomly selected woman is 92%, thus the CL is 92%. They argue that the CL is a statistic
more likely to be understood by statistically naive individuals than are the other available effect size
statistics. Ill reserve judgment on that (naive persons have some funny ideas about probabilities), but it
may help you get a better feel for effect sizes. I assume they use sex differences because most of us
already have a pretty good feeling for how much the sexes differ on things like height and weight.
To calculate the CL with independent samples McGraw and Wong instruct us to compute
Z
X X
S S
=
+
1 2
1
2
2
2
and then find the probability of obtaining a Z less than the computed value. For the
height example, Z =
+
=
69 7 64 3
2 8 2 6
141
2 2
. .
. .
. and P(Z < 1.41) = 92%. Alternatively, one can compute the
CL from the more common effect size statistic d. If we have weighted the two samples variances equally
when computing d, that is,
$
d
X X
S S
=
+
1 2
1
2
2
2
2
, then we can compute the Z for the CL as d divided by 2.
For the height data,
$
. .
. .
. . d Z =
+
= = =
69 7 64 3
2 8 2 6
2
2 00
2
2
141
2 2
and . In their article McGraw and Wong
computed d with a weighted (by sample sizes) mean variance, that is,
$
d
X X
p S p S
p
n
N
i
i
=
+
=
1 2
1 1
2
2 2
2
, where . They used this weighted mean variance even though they were
comparing men with women, where in the population there are about equal numbers in both groups (for
some of their variables they had much more data from men than from women). I would have weighted
the two variances equally.
McGraw and Wong gave hypothetical examples of group differences in IQ (SD = 15) and
computed d, CL, and four other effect size statistics (including the binomial effect size display). I
reproduce the table without the other four effect size statistics but with my addition of examples
corresponding to Cohens small (d = .2), medium (d = .5) and large (d = .8) effect sizes.
Mean 1 Mean 2 d CL Odds Mean 1 Mean 2 d CL Odds
100 100 0.00 50% 1 90 110 1.33 83% 4.88
98.5 101.5 0.20 56% 1.27 85 115 2.00 92% 11.5
96.25 103.75 0.50 64% 1.78 80 120 2.67 97% 32.3
95 105 0.67 68% 2.12 75 125 3.33 99% 99
94 106 0.80 72% 2.57
sin ( )
.
1
5
Power is the conditional probability that one will reject the null hypothesis given
that the null hypothesis is really false.
Imagine that we are evaluating the effect of a putative memory enhancing drug.
We have randomly sampled 25 people from a population known to be normally
distributed with a of 100 and a of 15. We administer the drug, wait a reasonable
time for it to take effect, and then test our subjects IQ. Assume that we were so
confident in our belief that the drug would either increase IQ or have no effect that we
entertained directional hypotheses. Our null hypothesis is that after administering the
drug 100; our alternative hypothesis is > 100.
These hypotheses must first be converted to exact hypotheses. Converting the
null is easy: it becomes = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be = 115, and we could compute power, the probability of correctly rejecting the
false null hypothesis given that is really equal to 115 after drug treatment, not 100
(normal IQ). But if we already knew how large the effect of the drug was, we would not
need to do inferential statistics.
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide that if
the drug increases
iq
by 2 or more points, then that is a nontrivial effect, but if the
mean increase is less than 2 then the effect is trivial.
Now we can test the null of = 100 versus the alternative of = 102. Look at the
figure on the following page (if you are reading this on paper, in black and white, I
recommend that you obtain an electronic copy of this document from our BlackBoard
site and open it in Word so you can see the colors). Let the left curve represent the
distribution of sample means if the null hypothesis were true, = 100. This sampling
distribution has a = 100 and a 3
25
15
= =
x
. Let the right curve represent the
sampling distribution if the exact alternative hypothesis is true, = 102. Its is 102
and, assuming the drug has no effect on the variance in IQ scores, 3
25
15
= =
x
.
The red area in the upper tail of the null distribution is . Assume we are using a
one-tailed of .05. How large would a sample mean need be for us to reject the null?
Since the upper 5% of a normal distribution extends from 1.645 above the up to
positive infinity, the sample mean IQ would need be 100 + 1.645(3) = 104.935 or more
to reject the null. What are the chances of getting a sample mean of 104.935 or more if
the alternative hypothesis is correct, if the drug increases IQ by 2 points? The area
under the alternative curve from 104.935 up to positive infinity represents that
probability, which is power. Assuming the alternative hypothesis is true, that = 102,
the probability of rejecting the null hypothesis is the probability of getting a sample mean
of 104.935 or more in a normal distribution with = 102, = 3. Z = (104.935 102)/3 =
0.98, and P(Z > 0.98) = .1635. That is, power is about 16%. If the drug really does
2
increase IQ by an average of 2 points, we have a 16% chance of rejecting the null. If its
effect is even larger, we have a greater than 16% chance.
Suppose we consider 5 the minimum nontrivial effect size. This will separate the
null and alternative distributions more, decreasing their overlap and increasing power.
Now, Z = (104.935 105)/3 = 0.02, P(Z > 0.02) = .5080 or about 51%. It is easier to
detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;
is now split into both tails of the null distribution, .025 in each tail. We shall reject the
null if the sample mean is 1.96 or more standard errors away from the of the null
distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more (or if it is 100
1.96(3) = 94.12 or less) we reject the null. The probability of that happening if the
alternative is correct ( = 105) is: Z = (105.88 105)/3 = .29, P(Z > .29) = .3859, power
= about 39%. We can ignore P(Z < (94.12 105)/3) = P(Z < 3.63) = very, very small.
Note that our power is less than it was with a one-tailed test. If you can correctly
predict the direction of effect, a one-tailed test is more powerful than a two-tailed
test.
Consider what would happen if you increased sample size to 100. Now the
5 . 1
100
15
= =
x
. With the null and alternative distributions less plump, they should
overlap less, increasing power. With 1.5 =
x
, the sample mean will need be 100 +
(1.96)(1.5) = 102.94 or more to reject the null. If the drug increases IQ by 5 points,
3
power is : Z = (102.94 105)/1.5 = 1.37, P(Z > 1.37) = .9147, or between 91 and
92%. Anything that decreases the standard error will increase power. This may
be achieved by increasing the sample size or by reducing the of the dependent
variable. The of the criterion variable may be reduced by reducing the influence of
extraneous variables upon the criterion variable (eliminating noise in the criterion
variable makes it easier to detect the signal, the grouping variables effect on the
criterion variable).
Now consider what happens if you change . Let us reduce to .01. Now the
sample mean must be 2.58 or more standard errors from the null before we reject the
null. That is, 100 + 2.58(1.5) = 103.87. Under the alternative, Z = (103.87 105)/1.5 =
0.75, P(Z > 0.75) = 0.7734 or about 77%, less than it was with at .05, ceteris
paribus. Reducing reduces power.
Please note that all of the above analyses have assumed that we have used a
normally distributed test statistic, as
x
X
Z
=
1
d . For our IQ problem with minimum
nontrivial effect size at 5 IQ points, d = (105 100)/15 = 1/3. We combine d with N to
get , the noncentrality parameter. For the one sample test, N d = . For our IQ
problem with N = 25, = 1/3 5 = 1.67. Once is obtained, power is obtained using
the power table in our textbook. For a .05 two-tailed test, power = 36% for a of 1.60
and 40% for a of 1.70. By linear interpolation, power for = 1.67 is 36% + .7(40%
36%) = 38.8%, within rounding error of the result we obtained using the normal curve.
For a one-tailed test, use the column in the table with twice its one-tailed value. For
of .05 one-tailed, use the .10 two-tailed column. For = 1.67 power then = 48% +
7(52% 48%) = 50.8%, the same answer we got with the normal curve.
4
If the sample size is large enough that there will be little difference between the t
distribution and the standard normal curve, then the solution we obtain using Howells
table is good. You can use the GPower program to fine tune the solution you get using
Howells table.
If we were not able to reject the null hypothesis in our research on the putative IQ
drug, and our power analysis indicated about 39% power, we would be in an awkward
position. Although we could not reject the null, we also could not accept it, given that
we only had a relatively small (39%) chance of rejecting it even if it were false. We
might decide to repeat the experiment using an n large enough to allow us to accept
the null if we cannot reject it. In my opinion, if 5% is a reasonable risk for a Type I
error (), then 5% is also a reasonable risk for a Type II error (), so let us use power =
1 = 95%. From the power table, to have power = .95 with = .05, is 3.60.
2
=
d
n
. For a five-point minimum IQ effect, 64 . 116
3 / 1
6 . 3
2
=
= n . Thus, if we repeat
the experiment with 117 subjects and still cannot reject the null, we can accept the null
and conclude that the drug has no nontrivial ( 5 IQ points) effect upon IQ. The null
hypothesis we are accepting here is a loose null hypothesis [95 < < 105] rather
than a sharp null hypothesis [ = exactly 100]. Sharp null hypotheses are probably
very rarely ever true.
Others could argue with your choice of the minimum nontrivial effect size. Cohen
has defined a small effect as d = .20, a medium effect as d = .50, and a large effect as d
= .80. If you defined minimum d at .20, you would need even more subjects for 95%
power.
A third approach one can take is to find the smallest effect that one could have
detected with high probability given n. If that d is small, and the null hypothesis is not
rejected, then it is accepted. For example, I used 225 subjects in the IQ enhancer
study. For power = 95%, = 3.60 with .05 two-tailed, and 24 . 0
15
60 . 3
= = =
N
d
. If I
cant reject the null, I accept it, concluding that if the drug has any effect, it is a small
effect, since I had a 95% chance of detecting an effect as small as .24 . The loose null
hypothesis accepted here would be that the population differs from 100 by less than
.24 .
Two Independent Samples
For the Two Group Independent Sampling Design,
2
n
d = , where n = the
number of scores in one group, and both groups have the same n;
2 1
= d , where
is the standard deviation in either population, assuming is identical in both
populations.
5
If n
1
n
2
, use the harmonic mean sample size,
2 1
1 1
2
~
n n
n
+
= .
For a fixed total N, the harmonic mean (and thus power) is higher the more
nearly equal n
1
and n
2
are. This is one good reason to use equal n designs. Other
good reasons are computational simplicity with equal ns and greater robustness to
violation of assumptions. Try computing the effective (harmonic) sample size for 100
subjects evenly split into two groups of 50 each and compare that with the effective
sample size obtained if you split them into 10 in one group and 90 in the other.
You should be able to rearrange the above formula for to solve for d or for n as
required.
Consider the following a priori power analysis. We wish to compare the
Advanced Psychology GRE scores of students in general psychology masters programs
with that of those in clinical psychology masters programs. We decide that we will be
satisfied if we have enough data to have an 80% chance of detecting an effect of 1/3 of
a standard deviation, employing a .05 criterion of significance. How many scores do we
need in each group, if we have the same number of scores in each group? From the
power table, we obtain the value of 2.8 for . 141
3 / 1
8 . 2
2 2
2 2
=
=
d
n
scores in
each group, a total of 282 scores.
Consider the following a posteriori power analysis. We have available only 36
scores from students in clinical programs and 48 scores from students in general
programs. What are our chances of detecting a difference of 40 points (which is that
actually observed at ECU in 1981) if we use a .05 criterion of significance and the
standard deviation is 98? The standardized effect size, d, is 40/98 = .408. The
harmonic mean sample size is 14 . 41
48
1
36
1
2
~
2 1
=
+
= n
85 . 1
2
14 . 41
408 . = = . From our power table, power is 46% (halfway between .44 and
.48).
Correlated Samples
The correlated samples t test is mathematically equivalent to a one-sample t test
conducted on the difference scores (for each subject, score under one condition less
score under the other condition). The greater
12
, the correlation between the scores in
the one condition and those in the second condition, the smaller the standard deviation
of the difference scores and the greater the power, ceteris paribus. By the variance
sum law, the standard deviation of the difference scores is
2 1 12
2
2
2
1
2 + =
Diff
.
If we assume equal variances, this simplifies to ) 1 ( 2 =
Diff
.
6
The reduction in the standard error should increase power relative to an
independent samples test with the same number of scores, but the correlated t has only
half the degrees of freedom as the independent t, which causes some loss of power.
The gain in power from reducing the standard error will generally greatly exceed the
loss of power due to loss of half of the degrees of freedom, but one could actually have
less power with the correlated t than with the independent t if sample size were low and
the
12
low. Please see my document Correlated t versus Independent t .
When conducting a power analysis for the correlated samples design, we can
take into account the effect of
12
by computing d
Diff
, an adjusted value of d:
) 1 ( 2
12
2 1
=
d
d
Diff
Diff
, where d is the effect size as computed above, with
independent samples. We can then compute power via n d
Diff
= or the required
sample size via
2
=
Diff
d
n
.
Please note that using the standard error of the difference scores, rather than the
standard deviation of the criterion variable, as the denominator of d
Diff
, is simply
Howells method of incorporating into the analysis the effect of the correlation produced
by matching. If we were computing estimated d (Hedges g) as an estimate of the
standardized effect size given the obtained results, we would use the standard deviation
of the criterion variable in the denominator, not the standard deviation of the difference
scores. I should admit that on rare occasions I have argued that, in a particular
research context, it made more sense to use the standard deviation of the difference
scores in the denominator of g.
Consider the following a priori power analysis. I am testing the effect of a new
drug on performance on a task that involves solving anagrams. I want to have enough
power to be able to detect an effect as small as 1/5 of a standard deviation (d = .2) with
95% power I consider Type I and Type II errors equally serious and am employing a
.05 criterion of statistical significance, so I want beta to be not more than .05. I shall use
a correlated samples design (within subjects) and two conditions (tested under the
influence of the drug and not under the influence of the drug). In previous research I
have found the correlation between conditions to be approximately .8.
3162 .
) 8 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
. . 130
3162 .
6 . 3
2
2
=
=
Diff
d
n
Consider the following a posteriori power analysis. We assume that GRE Verbal
and GRE Quantitative scores are measured on the same metric, and we wish to
determine whether persons intending to major in experimental or developmental
psychology are equally skilled in things verbal and things quantitative. If we employ a
.05 criterion of significance, and if the true size of the effect is 20 GRE points (that was
the actual population difference the last time I checked it, with quantitative > verbal),
what are our chances of obtaining significant results if we have data on 400 persons?
We shall assume that the correlation between verbal and quantitative GRE is .60 (that is
what it was for social science majors the last time I checked). We need to know what
7
the standard deviation is for the dependent variable, GRE score. The last time I
checked, it was 108 for verbal, 114 for quantitative. Let us just average those and use
111. ( ) 28 . 99 6 . 1 2 111 = =
Diff
. 20145 .
28 . 99
20
= =
Diff
d . . 03 . 4 400 20145 . = =
From the power table, power = 98%.
Pearson r
For a Pearson Correlation Coefficient, d = the size of the correlation coefficient in
the population, and 1 = n d where n = the number of pairs of scores in the sample.
Consider the following a priori power analysis. We wish to determine whether or
not there is a correlation between misanthropy and support for animal rights. We shall
measure these attributes with instruments that produce scores for which it is reasonable
to treat the variables as continuous. How many respondents would we need to have a
95% probability of obtaining significant results if we employed a .05 criterion of
significance and if the true value of the correlation (in the population) was 0.2?
325 1
2 .
6 . 3
1
2 2
= +
= +
=
d
n
.
Type III Errors and Three-Choice Tests
Leventhal and Huynh (Psychological Methods, 1996, 1, 278-292) note that it is
common practice, following rejection of a nondirectional null, to conclude that the
direction of difference in the population is the same as what it is in the sample. This
procedure is what they call a "directional two-tailed test." They also refer to it as a
"three-choice test" (I prefer that language), in that the three hypotheses entertained are:
parameter = null value, parameter < null value, and parameter > null value. This makes
possible a Type III error: correctly rejecting the null hypothesis, but incorrectly inferring
the direction of the effect - for example, when the population value of the tested
parameter is actually more than the null value, getting a sample value that is so much
below the null value that you reject the null and conclude that the population value is
also below the null value. The authors show how to conduct a power analysis that
corrects for the possibility of making a Type III error. See my summary at:
http://core.ecu.edu/psyc/wuenschk/StatHelp/Type_III.htm
Copyright 2011, Karl L. Wuensch - All rights reserved.
Power-Example.doc
Examples of the Use of Power Analysis in Actual Research Projects
Two Conditions, Within-Subjects Design
Here is a real-life example of an a priori power analysis done by a graduate
student in our health psychology program. I think it serves as a good example of how
power analysis is an essential part of planning research.
I have a within subjects design. I am trying to predict the necessary sample size
(smallest) necessary to achieve adequate power (0.80 is fine). My problem now is I
don't know the expected value for the correlation between baseline scores and post-test
scores. Could you give me an idea about this --- or where I would get it from?
I am using prior research to estimate how large the effect will be in my research.
The stats from that prior research are:
Baseline: Mean = 41.7; SD = 9.93; Post: Mean = 32.8; SD = 4.94
The expected value for the correlation between baseline scores and post-test
scores must be estimated. Look at my document at
http://core.ecu.edu/psyc/wuenschk/docs30/Power-N.doc, the section Correlated
Samples T the table on page 5 shows how the required number of cases to achieve
80% power differs with both size of the effect and the correlation between conditions.
If the scores will be obtained from an instrument that has been used before, you
should be able to find, in the literature or from the researchers who have used that
instrument, an estimate of its reliability (Cronbachs alpha, for example). You could then
use that as an estimate of the baseline-posttest correlation. If others have used the
same dependent variable in pre-post designs, you could estimate the pre-post
correlation for your study as being about what it was in those others studies. If you are
still striking out, you can simply estimate the correlation as having a modest value, say
.7, and then after you have started collecting data check to what the correlation is. If it
is much less than .7, then your study will be underpowered and you know you need to
increase the sample size beyond what you expected to need unless, of course, the
data show that the effect is also large enough to be able to detect with the sample size
you will obtain.
I suggest you obtain enough data to be able to detect an effect that is only
medium in size (one-half standard deviation) or, if you expect the effect to be small but
not trivial, small in size (one-fifth standard deviation). For the stats you provided, g is
about 1.1:
2
If we were computing g to report as an effect-size estimate, I would probably use
the baseline SD as the standardizer, that is, report Glass delta rather than Hedges g.
For a medium effect and rho = .7, d
diff
= .5/sqrt(2(1-.7)) = .645. Taking that to
G*Power, we find that you need 22 cases (each measured pre and post) to get 80%
power to detect a medium-sized effect.
Two Groups, Independent Samples
Sylwia Mlynarska was working on her proposal (for her Science Fair project) and
the Institutional Review Board (which has to approve the proposal before she is allowed
to start collecting data) requested that she conduct a power analysis to determine how
many respondents she need recruit to answer her questionnaire. As I was serving as
her research mentor in this matter, I assisted her with this analysis. Below is a copy of
my correspondence with her.
Wuensch, Karl L.
3
From: Wuensch, Karl L.
Sent: Friday, June 02, 2000 1:04 PM
To: 'Sylwia Mlynarska'
Subject: A Priori Power Analysis
Sylwia, it so happens that the question you ask concerns exactly the topic that
we are covering in my undergraduate statistics class right now, so I am going to share
my response with the class.
PSYC 2101 students, Sylwia is a sophomore at the Manhattan Center for
Science and Mathematics High School in New York City. She is researching whether or
not ethnic groups differ on attitude towards animals (animal rights), using an instrument
of my construction. She is now in the process of obtaining approval (from the high
school's Institutional Review Board) of her proposal to conduct this research. I am
assisting her long-distance. Here is my response to her question about sample sizes:
Sylwia, the more subjects you have, the greater your power. Power is the
probability that you will find a difference, assuming that one really exists. Power is also
a function of the magnitude of the difference you seek to detect. If the difference is, in
fact, large, then you don't need many subjects to have a good chance of detecting it. If
it is small, then you do. Of course, you don't really know how large the difference
between ethnic groups is, so that makes it hard to plan. We assume that you will be
satisfied if your power is 80%. That is, if there really is a difference, you have an 80%
chance of detecting it statistically. Put another way, the odds of your finding the
difference are 4 to 1 in your favor. We also assume that you will be using the traditional
5% criterion (alpha) of statistical significance.
If the difference between two ethnic groups is small (defined by Cohen as
differing by 1/5 of a standard deviation), then to have 80% power you would need to
have 393 subjects in each ethnic group. If the difference is of medium size (1/2 of a
standard deviation), then you need only 64 subjects in each ethnic group. If the
difference is large (4/5 of a standard deviation), then you only need 26 subjects per
ethnic group.
It is typically difficult to get enough subjects to have a good chance of detecting a
small effect, so we generally settle for getting enough to have a good chance of
detecting a medium effect -- but if you can get high power even for small effects, there
is this advantage: If your results fall short of statistical significance (you did not detect
the difference), you can still make a strong statement, you can say that the difference, if
it exists, must be quite small. Without great power, when your result is statistically
"nonsignificant," it is quite possible that a difference is present, but your research did
just not have sufficient power to detect it (this circumstance is referred to as a Type II
error).
Post Hoc Power Analysis As Part of a Critical Evaluation of Published Research
Michelle Marvier wrote a delightful article for the American Scientist (2001,
Ecology of Transgenic Crops, 89: 160-167). She noted that before a transgenic crop
(genetically modified) receives government approval, it must be shown to be relatively
4
safe. Then she went on to discuss an actual petition to the government. Calgene Inc.
submitted a petition for approval of a variety of Bt cotton (cotton which contains genes
from a bacterium that result in it producing a toxin that kills insects which prey upon
cotton). To test this transgenic crops on friendly invertebrates found in the soil around
the plants (such as earthworms), they conducted research with a sample of four
subjects. The test period only lasted 14 days, and in that period the earthworms
exposed to the transgenic cotton plants gained 29.5% less weight than did control
earthworms. The difference between the groups was not statistically significant, which
the naive consumer might interpret to mean that the transgenic cotton had no influence
on the growth of the earthworms. Of course, with a sample size of only 4, the chances
of finding an undesirable effect of the transgenic cotton, assuming that such an effect
does exist, are very small. Dr. Marvier calculated that with the small sample sizes
employed in this research, the effect of the transgenic cotton would need to be quite
large (the exposed earthworms gaining less than half as much weight as the control
earthworms) to have a good (90%) chance (power) of detecting the effect (using the
conventional .05 level of significance, which is, in this case, IMHO, too small given the
relative risks of Type I vs Type II errors).
Karl L. Wuensch, Psychology, East Carolina University, December, 2007.
Power-N.doc
Estimating the Sample Size Necessary to Have Enough Power
How much data do you need -- that is, how many subjects should you include in
your research. If you do not consider the expenses of gathering and analyzing the data
(including any expenses incurred by the subjects), the answer to this question is very
simple -- the more data the better. The more data you have, the more likely you are to
reach a correct decision and the less error there will be in your estimates of parameters
of interest. The ideal would be to have data on the entire population of interest. In that
case you would be able to make your conclusions with absolute confidence (barring any
errors in the computation of the descriptive statistics) and you would not need any
inferential statistics.
Although you may sometimes have data on the entire population of interest,
more commonly you will consider the data on hand as a random sample of the
population of interest. In this case, you will need to employ inferential statistics, and
accordingly power becomes an issue. As you already know, the more data you have,
the more power you have, ceteris paribus. So, how many subjects do you need?
Before you can answer the question how many subjects do I need, you will
have to answer several other questions, such as:
How much power do I want?
What is the likely size (in the population) of the effect I am trying to detect, or,
what is smallest effect size that I would consider of importance?
What criterion of statistical significance will I employ?
What test statistic will I employ?
What is the standard deviation (in the population) of the criterion variable?
For correlated samples designs, what is the correlation (in the population)
between groups?
In my opinion, if one considers Type I and Type II errors equally serious, then
one should have enough power to make = . If employing the traditional .05 criterion
of statistical significance, that would mean you should have 95% power. However,
getting 95% power usually involves expenses too great for behavioral researchers --
that is, it requires getting data on many subjects.
A common convention is to try to get at least enough data to have 80% power.
So, how do you figure out how many subjects you need to have the desired amount of
power. There are several methods, including:
You could buy an expensive, professional-quality software package to do the
power analysis.
You could buy an expensive, professional-quality book on power analysis and
learn to do the calculations yourself and/or to use power tables and figures to
estimate power.
You could try to find an interactive web page on the Internet that will do the
power analysis for you. I do not have a great deal of trust in this method.
2
You could download and use the GPower program, which is free, not too
difficult to use, and generally reliable (this is not to say that it is error free). For
an undetermined reason, this program will not run on my laptop, but it runs fine
on all my other computers.
You could use the simple guidelines provided in Jacob Cohens A Power Primer
(Psychological Bulletin, 1992, 112, 155-159).
Here are minimum sample sizes for detecting small (but not trivial), medium, and
large sized effects for a few simple designs. I have assumed that you will employ the
traditional .05 criterion of statistical significance, and I have used Cohens guidelines for
what constitutes a small, medium, or large effect.
Chi-Square, One- and Two-Way
Effect size is computed as
=
k
i i
i i
P
P P
w
1 0
2
0 1
) (
. k is the number of cells, P
0i
is the
population proportion in cell i under the null hypothesis, and P
1i
is the population
proportion in cell i under the alternative hypothesis. For example, suppose that you
plan to analyze a 2 x 2 contingency table. You decide that the smallest effect that you
would consider to be nontrivial is one that would be expected to produce a contingency
table like this, where the experimental variable is whether the subject received a
particular type of psychotherapy or just a placebo treatment and the outcome is whether
the subject reported having benefited from the treatment or not:
Experimental Group
Outcome Treatment Control
= =
= 10 .
25 .
) 25 . 275 (. 4
2
w . Positive 55 45
Negative 45 55
For each cell in the table you compute the expected frequency under the null
hypothesis (P
0
)by multiplying the number of scores in the row in which that cell falls by
the number of scores in the column in which that cell falls and then dividing by the total
number of scores in the table. Then you divide by total N again to convert the expected
frequency to an expected proportion. For the table above the expected frequency will
be the same for every cell, 25 .
) 200 ( 200
) 100 ( 100
= . For each cell you also compute the
expected proportion under the alternative hypothesis (P
1
) by dividing the expected
number of scores in that cell by total N. For the table above that will give you the same
proportion for every cell, 55 200 = .275 or 45 200 = .225. The squared difference
between P
1
and P
0
, divided by P
0
, is the same in each cell, .0025. Sum that across four
cells and you get .01. The square root of .01 is .10. Please note that this is also the
value of phi.
In the treatment group, 55% of the patients reported a positive outcome. In the
control group only 45% reported a positive outcome. In the treatment group the odds of
reporting a positive outcome are 55 to 45, that is, 1.2222. In the control group the odds
3
are 45 to 55, that is, .8181. That yields an odds ratio of 1.2222 .8181 = 1.49. That is,
the odds of reporting a positive outcome are, for the treatment group, about one and a
half times higher than they are for the control group.
What if the effect is larger, like this:
Experimental Group
Outcome Treatment Control
= =
= 30 .
25 .
) 25 . 325 (. 4
2
w . Positive 65 35
Negative 35 65
Now the odds ratio is 3.45 and the phi is .3.
Or even larger, like this:
Experimental Group
Outcome Treatment Control
= =
= 50 .
25 .
) 25 . 375 (. 4
2
w . Positive 75 25
Negative 25 75
Now the odds ratio is 9 and the phi is .5.
Cohen considered a w of .10 to constitute a small effect, .3 a medium effect, and
.5 a large effect. Note that these are the same values indicated below for a Pearson r.
The required total sample size depends on the degrees of freedom, as shown in the
table below:
Effect Size
df Small Medium Large
1 785 87 26
2 964 107 39
3 1,090 121 44
4 1,194 133 48
5 1,293 143 51
6 1,362 151 54
The Correspondence between Phi and Odds Ratios it depends the distribution
of the marginals.
More on w = .
4
Pearson r
Cohen considered a of .1 to be small, .3 medium, and .5 large. You need 783
pairs of scores for a small effect, 85 for a medium effect, and 28 for a large effect. In
terms of percentage of variance explained, small is 1%, medium is 9%, and large is
25%.
One-Sample T Test
Effect size is computed as
=
1
d . A d of .2 is considered small, .5
medium, and .8 large. For 80% power you need 196 scores for small effect, 33 for
medium, and 14 for large.
Cohens d is not affected by the ratio of n
1
to n
2
, but some alternative measures
of magnitude of effect (r
pb
and
2
) are. See this document.
Independent Samples T, Pooled Variances.
Effect size is computed as
2 1
= d . A d of .2 is considered small, .5
medium, and .8 large. Suppose that you have population with means of 10 and 12 and
a within group standard deviation of 10. 2 .
10
10 12
=
=
Diff
d
n
. If the sample size is large enough
that there will be little difference between the t distribution and the standard normal
curve, then we can obtain the value of (the noncentrality parameter) from a table
found in David Howells statistics texts. With the usual nondirectional hypotheses and a
.05 criterion of significance, is 2.8 for power of 80%. You can use the GPower
program to fine tune the solution you get using Howells table.
I constructed the table below using Howells table and GPower, assuming
nondirectional hypotheses and a .05 criterion of significance.
Small effect Medium effect Large effect
d d
Diff
n d d
Diff
n d d
Diff
n
.2 .00 .141 393 .5 .00 0.354 65 .8 .00 0.566 26
.2 .50 .200 196 .5 .50 0.500 33 .8 .50 0.800 14
.2 .75 .283 100 .5 .75 0.707 18 .8 .75 1.131 08
.2 .90 .447 041 .5 .90 1.118 08 .8 .90 1.789 04
IMHO, one should not include the effect of the correlation in ones calculation of d
with correlated samples. Consider a hypothetical case. We have a physiological
measure of arousal for which the mean and standard deviation, in our population of
interest, are 50 (M) and 10 (SD). We wish to evaluate the effect of an experimental
treatment on arousal. We decide that the smallest nontrivial effect would be one of 2
points, which corresponds to a standardized effect size of d = .20.
Now suppose that the correlation is .75. The SD of the difference scores would
be 7.071, and the d
Diff
would be .28. If our sample means differed by exactly 2 points,
what would be our effect size estimate? Despite d
Diff
being .28, the difference is still
just 2 points, which corresponds to a d of .20 using the original group standard
deviations, so, IMHO, we should estimate d as being .20.
Now suppose that the correlation is .9. The SD of the difference scores would be
4.472, and the D
Diff
would be .45 but the difference is still just two points, so we
should not claim a larger effect just because the high correlation reduced the standard
deviation of the difference scores. We should still estimate d as being .20.
Note that the correlated samples t will generally have more power than an
independent samples t , holding the number of scores constant, as long as the
12
is not
very small or negative. With a small
12
it is possible to get less power with the
6
correlated t than with the independent samples t see this illustration. The correlated
samples t has only half the df of the independent samples t, making the critical value of t
larger. In most cases the reduction in the standard error will more than offset this loss
of df. Do keep in mind that if you want to have as many scores in a between-subjects
design as you have in a within-subjects design you will need twice as many cases.
One-Way Independent Samples ANOVA
Cohens f (effect size) is computed as
2
2
1
) (
error
j
k
j
k
=
, where
j
is the population
mean for a single group, is the grand mean, k is the number of groups, and error
variance is the mean within group variance. This can also be computed as
error
means
,
where the numerator is the standard deviation of the population means and the
denominator is the within-group standard deviation.
We assume equal sample sizes and homogeneity of variance.
Suppose that the effect size we wish to use is one where the three populations
means are 480, 500, and 520, with the within-group standard deviation being 100.
Using the first formula above, 163 .
) 100 ( 3
400 0 400
2
=
+ +
= f . Using the second formula,
the population standard deviation of the means (with k, not k-1, in the denominator) is
16.33, so f = 16.33 100 = .163. By the way, David Howell uses the symbol ' instead
of f.
You should be familiar with
2
as the treatment variance expressed as a
proportion of the total variance. If
2
is the treatment variance, then 1-
2
is the error
variance. With this in mind, we can define
2
2
1
d
d
Diff
.
Use the following settings:
Statistical test: Means: Difference between two dependent means (matched pairs)
Type of power analysis: A priori: Compute required sample size given , power, and
effect size
Tail(s): Two
Effect size dz: .3162
error prob: 0.05
Power (1- err prob): .95
Click Calculate. You will find that you need 132 pairs of scores.
Output: Noncentrality parameter = 3.632861
Critical t = 1.978239
Df = 131
Total sample size = 132
Actual power = 0.950132
Consider the following a posteriori power analysis. We assume that GRE Verbal
and GRE Quantitative scores are measured on the same metric, and we wish to
determine whether persons intending to major in experimental or developmental
psychology are equally skilled in things verbal and things quantitative. If we employ a
.05 criterion of significance, and if the true size of the effect is 20 GRE points (that was
the actual population difference the last time I checked it, with quantitative > verbal),
what are our chances of obtaining significant results if we have data on 400 persons?
We shall assume that the correlation between verbal and quantitative GRE is .60 (that is
what it was for social science majors the last time I checked). We need to know what
the standard deviation is for the dependent variable, GRE score. The last time I
checked, it was 108 for verbal, 114 for quantitative.
Change type of power analysis to Post hoc. Set the total sample size to 400.
Click on Determine. Select from group parameters. Set the group means to 0 and
20 (or any other two means that differ by 20), the standard deviations to 108 and 114,
6
and the correlation between groups to .6. Click Calculate in this window to obtain the
effect size dz, .2100539.
Click Calculate and transfer to main window to move the effect size dz to the
main window. Click Calculate in the main window to compute the power. You will see
that you have 98% power.
7
Pearson r
Consider the following a priori power analysis. We wish to determine whether or
not there is a correlation between misanthropy and support for animal rights. We shall
measure these attributes with instruments that produce scores for which it is reasonable
to treat the variables as continuous. How many respondents would we need to have a
95% probability of obtaining significant results if we employed a .05 criterion of
significance and if the true value of the correlation (in the population) was 0.2?
Select the following options:
Test family: ttests
Statistical test: Correlation: Point biserial model (that is, a regression analysis)
Type of power analysis: A priori: Compute required sample size given , power, and
effect size
Tails: Two
Effect size |r|: .2
error prob: 0.05
Power (1- err prob): .95
Click Calculate and you will see that you need 314 cases.
t tests - Correlation: Point biserial model
Analysis: A priori: Compute required sample size
Input: Tail(s) = Two
Effect size |r| = .2
err prob = 0.05
Power (1- err prob) = 0.95
Output: Noncentrality parameter = 3.617089
Critical t = 1.967596
Df = 312
Total sample size = 314
Actual power = 0.950115
Check out Steiger and Fouladis R2 program, which will do power analysis (and
more) for correlation models, including multiple correlation.
Install G Power on Your Personal Computer
If you would like to install GPower on your Windows computer, you can
download it from Universitt Duesseldorf.
Return to Wuenschs Statistics Lessons Page
February, 2012.
G*Power: One-Way Independent Samples ANOVA
See the power analysis done by hand in my document One-Way Independent
Samples Analysis of Variance. Here I shall do it with G*Power.
We want to know how much power we would have for a three-group ANOVA
where we have 11 cases in each group and the effect size in the population is
163 . = = f . When we did by hand, using the table in our text book, we found power =
10%. Boot up G*Power:
Click OK. Click OK again on the next window.
Click Tests, F-Test (Anova).
Under Analysis, select Post Hoc. Enter .163 as the Effect size f, .05 as the
Alpha, 33 as the Total sample size, and 3 as the number of Groups. Click Calculate.
G*Power tell you that power = .1146.
OK, how many subjects would you need to raise power to 70%? Under Analysis,
select A Priori, under Power enter .70, and click Calculate.
G*Power advises that you need 294 cases, evenly split into three groups, that is,
98 cases per group.
Alt-X, Discard to exit G*Power.
That was easy, wasnt it?
Links
Karl Wuenschs Statistics Lessons
Internet Resources for Power Analysis
Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC USA
GPower3-ANOVA1.docx
G*Power: One-Way Independent Samples ANOVA
See the power analysis done by hand in my document One-Way Independent
Samples Analysis of Variance. Here I shall do it with G*Power.
We want to know how much power we would have for a three-group ANOVA
where we have 11 cases in each group and the effect size in the population is
163 . = = f . When we did by hand, using the table in our text book, we found power =
10%. Here is the analysis:
OK, how many subjects would you need to raise power to 70%? Under Analysis,
select A Priori, under Power enter .70, and click Calculate.
G*Power advises that you need 294 cases, evenly split into three groups, that is,
98 cases per group.
That was easy, wasnt it?
Links
Karl Wuenschs Statistics Lessons
Internet Resources for Power Analysis
Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC USA
GPower3-ANOVA-Factorial.doc
G*Power: Factorial Independent Samples ANOVA
The analysis is done pretty much the same as it is with a one-way ANOVA.
Suppose we are planning research for which an A x B, 3 x 4 ANOVA would be
appropriate. We want to have enough data to have 80% power for a medium sized
effect. The omnibus analysis will include three F tests one with 2 df in the numerator,
one with 3, and one with 6 (the interaction). We plan on having sample size constant
across cells.
Boot up G*Power and enter the options shown below:
Remember that Cohen suggested .25 as the value of f for a medium-sized effect.
The numerator df for the main effect of Factor A is (3-1)=2. The number of groups here
is the number of cells in the factorial design, 3 x 4 = 12. When you click Calculate you
see that you need a total N of 158. That works out to 13.2 cases per cell, so bump the
N up to 14(12) = 168.
What about Factor B and the interaction? There are (4-1)=3 df for the main
effect of Factor A, and when you change the numerator df to 3 and click Calculate
again you see that you need an N of 179 to get 80% power for that effect. The
interaction has 2(3)=6 df, and when you change the numerator df to 6 and click
Calculate you see that you need an N of 225 to have 80% power to detect a medium-
sized interaction. With equal sample sizes, that means you need 19 cases per cell, 228
total N.
GPower3-ANOVA-Factorial.doc
Clearly you are not going to have the same amount of power for each of the
three effects. If your primary interest was in the main effects, you might go with a total
N that would give you the desired power for main effects but somewhat less than that
for the interaction. If, however, you have reason to expect an interaction, you will go for
the total N of 228. How much power would that give you for the main effects?
As you can see, you would have almost 93% power for A. If you change the
numerator df to 3 you will see that you would have 89.6% power for B.
If you click the Determine button you get a second window which allows you
select the value of f by specifying a value of
2
or partial
2
. Suppose you want to know
what f is for an effect that explains only 1% of the total variance. You tell G*Power to
that the Variance explained by special effect is .01 and Error variance is .99. Click
Calculate and you get an f of .10. Recall that earlier I told you that an f of .10 is
equivalent to an
2
of .01.
GPower3-ANOVA-Factorial.doc
If you wanted to find f for an effect that accounted for 6% of the variance, you
would enter .06 (effect) and .94 (error) and get an f of .25 (a medium-sized effect).
Wait a minute. I have ignored the fact that the error variance in the factorial
ANOVA will be reduced by an amount equal to variance explained by the other factors
in the model, and that will increase power. Suppose that I have entered Factor B into
the model primarily as a categorical covariate. From past research, I have reason to
believe that Factor B will account for about 14% of the total variance (a large effect,
equivalent to an f of .40). I have no idea whether or not the interaction will explain much
variance, so I play it safe and assume it will explain no variance. When I calculate f I
should enter .06 (effect) and .80 (error 1 less .06 for A and another .14 for B).
G*Power gives an f of .27, which I would then use in the power analysis for Factor A.
The 2 x 2 ANOVA: A Query From Down Under
I work at the Department of Psychology, Macquarie University in Sydney
Australia. We're currently writing up a grant proposal and have to include power
calculations. I have a very simple question about G*Power analysis for a simple
experimental 2x2 ANOVA study. For such a design, I assume that the Numerator df
would be 1 for main effects and interactions. Does this mean that unlike Cohen's power
analysis, G*Power3 it would give the same power estimate for main effects and the
interaction in a 2x2 ANOVA? I had always assumed that interactions would be more
difficult to detect than main effect, and that seems true for all multi-factorials except a
2x2?
The key is the numerator df, and, as you note, they are all the same (1) in the 2 x
2 design, so your power will be constant across effects. You should, however, consider
what will follow if you have a significant interaction. Likely you will want to test simple
main effects. When planning it is probably best to assume that you might have enough
heterogeneity of variance to warrant using individual error terms rather than a pooled
GPower3-ANOVA-Factorial.doc
error term. In that case, the tests of simple main effects are absolutely equivalent to
independent samples t tests, on half (assuming equal sample sizes) of the total data.
For example, if you decide to settle for 80% power for detecting a medium-sized
effect, you will need 128 cases (32 per cell).
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = 0.80
Numerator df = 1
Number of groups = 4
Output: Noncentrality parameter = 8.0000000
Critical F = 3.9175498
Denominator df = 124
Total sample size = 128
Actual power = 0.8013621
If you wish to test the effect of one factor at each level of the other factor, with
individual error terms, still settling for 80% power for a medium-sized effect, then you
will need 128 cases for each simple effect, that is a total of 256 cases (64 per cell).
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = 0.80
Numerator df = 1
Number of groups = 2
Output: Noncentrality parameter = 8.0000000
Critical F = 3.9163246
Denominator df = 126
Total sample size = 128
Actual power = 0.8014596
t tests - Means: Difference between two independent means (two groups)
Analysis: A priori: Compute required sample size
Input: Tail(s) = Two
Effect size d = 0.5
err prob = 0.05
Power (1- err prob) = 0.80
Allocation ratio N2/N1 = 1
Output: Noncentrality parameter = 2.8284271
Critical t = 1.9789706
Df = 126
Sample size group 1 = 64
Sample size group 2 = 64
Total sample size = 128
Actual power = 0.8014596
Links
Karl Wuenschs Statistics Lessons
Internet Resources for Power Analysis
Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC USA
GPower3-ANOVA1.doc
G*Power: 3-Way Factorial Independent Samples ANOVA
The analysis is done pretty much the same as it is with a two-way ANOVA.
Suppose we are planning research for which an A x B, 2 x 2 x 3 ANOVA would be
appropriate. We want to have enough data to have 80% power for a medium sized
effect. The omnibus analysis will include seven F tests three with one df each (A, B,
and A x B) and four with two df each (C, A x C, B x C, and A x B x C). We plan on
having sample size constant across cells.
For the tests of A, B, and A x B:
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = .80
Numerator df = 1
Number of groups = 12
Output: Noncentrality parameter = 8.0000000
Critical F = 3.9228794
Denominator df = 116
Total sample size = 128
Actual power = 0.8009381
Remember that Cohen suggested .25 as the value of f for a medium-sized effect.
1. The number of groups here is the number of cells in the factorial design, 2 x 2 x 3 =
12. When you click Calculate you see that you need a total N of 128. That works out
to 10.67 cases per cell, so bump the N up to 11(12) = 132.
For the effects with 2 df:
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = .80
Numerator df = 2
Number of groups = 12
Output: Noncentrality parameter = 9.8750000
Critical F = 3.0580504
Denominator df = 146
Total sample size = 158
Actual power = 0.8016972
That works out to 13.2 cases per cell, so bump the N up to 14(12) = 168.
GPower3-ANOVA1.doc
Suppose that you anticipate obtaining a significant triple interaction and following
that with analysis of the A x B simple interactions at each level of C. Playing it
conservative by using individual error terms, you will then need at each level of C
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = .80
Numerator df = 1
Number of groups = 4
Output: Noncentrality parameter = 8.0000000
Critical F = 3.9175498
Denominator df = 124
Total sample size = 128
Actual power = 0.8013621
That is 128/4 = 32 cases for each A x B cell. Since there are three levels of C,
the total sample size needed is now 3 x 128 = 384.
Suppose the A x B interaction were to be significant one or more of the levels of
C. You likely would then test the simple, simple, main effects of A at each level of B (or
vice versa). For each such comparison (which would involved only two cells):
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: A priori: Compute required sample size
Input: Effect size f = 0.25
err prob = 0.05
Power (1- err prob) = .80
Numerator df = 1
Number of groups = 2
Output: Noncentrality parameter = 8.0000000
Critical F = 3.9163246
Denominator df = 126
Total sample size = 128
Actual power = 0.8014596
You need 128 scores, 64 per cell. Since we have a total of 12 cells, that works
out to 768 cases. You might end up deciding that you can get by with having less
power for detecting simple effects than for detecting effects in the omnibus analysis.
Suppose you ended up with 20 scores per cell, total N = 20(12) = 240. How
much power would you have for detecting medium-sized effects in the omnibus
analysis?
For the one df effects:
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: Post hoc: Compute achieved power
Input: Effect size f = 0.25
err prob = 0.05
Total sample size = 240
Numerator df = 1
GPower3-ANOVA1.doc
Number of groups = 12
Output: Noncentrality parameter = 15.0000000
Critical F = 3.8825676
Denominator df = 228
Power (1- err prob) = 0.9710633
For the two df effects:
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: Post hoc: Compute achieved power
Input: Effect size f = 0.25
err prob = 0.05
Total sample size = 240
Numerator df = 2
Number of groups = 12
Output: Noncentrality parameter = 15.0000000
Critical F = 3.0354408
Denominator df = 228
Power (1- err prob) = 0.9411531
How much power would you have if you got down to the level of comparing one
cell with one other cell:
F tests - ANOVA: Fixed effects, special, main effects and interactions
Analysis: Post hoc: Compute achieved power
Input: Effect size f = 0.25
err prob = 0.05
Total sample size = 40
Numerator df = 1
Number of groups = 2
Output: Noncentrality parameter = 2.5000000
Critical F = 4.0981717
Denominator df = 38
Power (1- err prob) = 0.3379390
Links
Karl Wuenschs Statistics Lessons
Internet Resources for Power Analysis
Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC USA
Power-RM-ANOVA.doc
Power Analysis for One-Way Repeated Measures ANOVA
Univariate Approach
Colleague Caren Jordan was working on a proposal and wanted to know how
much power she would have if she were able to obtain 64 subjects. The proposed
design was a three group repeated measures ANOVA. I used G*Power to obtain the
answer for her. Refer to the online instructions, Other F-Tests, Repeated Measures,
Univariate approach. We shall use n = 64, m = 3 (number of levels of repeated factor),
numerator df = 2 (m-1), and denominator df = 128 (n times m-1) f
2
= .01 (small effect,
within-group ratio of effect variance to error variance), and = .79 (the correlation
between scores at any one level of the repeated factor and scores and any other level
of the repeated factor). Her estimate of was based on the test-retest reliability of the
instrument employed.
I have used Cohens (1992, A power primer, Psychological Bulletin, 112,
155-159) guidelines, which are .01 = small, .0625 = medium, and .16 = large.
The noncentrality parameter is
=
1
f m n
, but G*Power is set up for us to
enter as Effect size f
2
the quantity 143 .
79 . 1
) 01 (. 3
1
2
=
f m
.
Boot up G*Power. Click Tests, Other F-Tests. Enter Effect size f
2
= 0.143,
Alpha = 0.05, N = 64, Numerator df = 2, and Denominator df = 128. Click calculate.
G*Power shows that power = .7677.
f m
. Note that I have used, as my
estimate of , the mean of the three values observed by Sheri. This may, or may not,
be reasonable. Uncorrected her numerator df = 2 and her denominator df = 72.
Corrected with epsilon, her Effect size f
2
= .0275, numerator df = 1, and denominator
df = 36. I enter these values into G*Power and obtain power = .1625. Sheri needs
more data, or needs to hope for a larger effect size. If she assumes a medium sized
effect, then epsilon Effect size f
2
= 17 .
45 . 1
) 0625 (. 3
5 .
1
2
=
f m
and power jumps to .67.
The big problem here is the small value of in Sheris data she is going to
need more data to get good power. With typical repeated measures data, is larger,
and we can do well with relatively small sample sizes.
Multivariate Approach
The multivariate approach analysis does not require sphericity, and, when
sphericity is lacking, is usually more powerful than is the univariate analysis with
Greenhouse-Geisser or Huynh-Feldt correction. Refer to the G*Power online
instructions, Other F-Tests, Repeated Measures, Multivariate approach.
3
Since the are three groups, the numerator df = 2. The denominator df = n-p+1,
where n is the number of cases and p is the number of dependent variables in the
MANOVA (one less than the number of levels of the repeated factors). For Sherri,
denominator df = 36-2+1 = 35.
For a small effect size, we need Effect size f
2
= 055 .
45 . 1
) 01 (. 3
1
2
=
f m
. As you
can see, G*Power tells me power is .2083, a little better than it was with the univariate
test corrected for lack of sphericity.
So, how many cases would Sherri need to raise her power to .80? This G*Power
routine will not solve for N directly, so you need to guess until you get it right. On each
guess you need to change the input values of N and denominator df. After a few
quesses, I found that Sheri needs 178 cases to get 80% power to detect a small effect.
A Simpler Approach
Ultimately, in most cases, ones primary interest is going to be focused on
comparisons between pairs of means. Why not just find the number of cases necessary
to have the desired power for those comparisons? With repeated measures designs I
generally avoid using a pooled error term for such comparisons. In other words, I use
simple correlated t tests for each such comparison. How many cases would Sherri
need to have an 80% chance of detecting a small effect, d = .2?
First we adjust the value of d to take into account the value of . I shall use her
weakest link, the correlation of .27. 166 .
) 27 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
. Notice that
the value of d went down after adjustment. Usually will exceed .5 and the adjusted d
will be greater than the unadjusted d.
4
The approximate sample size needed is 285
8 . 2
2
=
=
Diff
d
n . I checked this with
G*Power. Click Tests, Other T tests. For Effect size f enter .166. N = 285 and df =
n-1 = 284. Select two-tailed. Click Calculate. G*Power confirms that power = 80%.
When N is small, G*Power will show that you need a larger N than indicated by
the approximation. Just feed values of N and df to G*power until you find the N that
gives you the desired power.
Copyright 2005, Karl L. Wuensch - All rights reserved.
Power Analysis for an ANCOV
If you add one or more covariates to your ANOVA model, and they are well
correlated with the outcome variable, then the error term will be reduced and power will
be increased. The effect of the addition of covariates can be incorporated into the
power analysis in this way:
Adjusting the effect size statistic, f, such that the adjusted f,
2
1 r
f
f
= , where r
is the correlation between the covariate (or set of covariates) and the outcome
variable.
Reducing the error df by one for each covariate added to the model.
Consider this example. I am using an ANOVA design to compare three
experimental groups. I want to know how many cases I need to detect a small effect
(f = .1). GPower tells me I need 1,548 cases. Ouch, that is a lot of data.
Suppose I find a covariate that I can measure prior to manipulating the
experimental variable and which is known to be correlated .7 with the dependent
variable. The adjusted f for a small effect increases to 14 .
49 . 1
1 .
=
= f .
Now I only need 792 cases. Do note that the error df here should be 788, not
789, but that one df is not going to make much difference, as shown below.
I used the Generic F Test routine with the noncentrality parameter from the
earlier run, and I dropped the denominator df to 788. The value of the critical F
increased ever so slightly, but the power did not change at all to six decimal places.
My Earlier Discussion of this Topic
Clinical student Natalie Cross wants to conduct a 2 x 2 x 2 ANCOV with a single
covariate. How many subjects does she need to have 80% power with alpha set at .05
if the effect is medium in size?
I shall estimate N for tests of main effects only, not interactions or simple effects.
First, I assume that no covariate is employed. Cohens f (effect size) is
computed as
2
2
/ ) (
e
j
k
. Cohens guidelines suppose that a medium-sized
difference between two groups is one that equals one half the size of the within-group
(error) variance, such as when
1
= 10,
2
= 12, = 11, and = 4. This corresponds to
a value of f equal to 25 . 0
4
2 / ) 1 1 (
2
2 2
=
+
, which is exactly the value of f which Cohen
has suggested corresponding to a medium sized effect in ANOVA.
Now, how many subjects would we need? n f = , where n is the number of
scores in each group. From Appendix ncF in Howell (Statistical methods for
psychology (5
th
ed.), 2002, Belmont, CA: Wadsworth), we need of about 2, so
64
25 .
2
2 2
=
=
f
n
. This matches exactly the amount indicated in Table 2 of
Cohens Power Primer (Psychological Bulletin, 1992, 112, 155-159).
Of course, when we factor in the effect of the covariate, we will have more power
for a fixed sample size, because the error variance (the variance of the dependent
variable scores after being adjusted for the covariate) will be reduced. The larger the
correlation between the covariate and the dependent variable (or, with multiple
covariates, the multiple R between covariates and the dependent variable), the greater
the reduction of error variance. We can estimate the error variance of the adjusted
scores this way:
2
1 r
Y Yadj
= (see http://www.psycho.uni-
duesseldorf.de/aap/projects/gpower/reference/reference_manual_07.html#t4). If we
assume that the correlation between the covariate and the dependent variable is .5,
then 464 . 3 25 . 1 4 = =
Yadj
.
Next we adjust the value of f to take into account the reduction in error variance
due to employing the covariate. Our adjusted f is computed as 29 . 0
464 . 3
2 / ) 1 1 (
2
2 2
=
+
.
Our required sample size is now computed as 48
29 .
2
2
=
, and 47
577 .
80 . 2
2 2
2 2
=
=
d
n
.
Natalie checked the database and found that the correlation between pre and
post test data was about .7. Using this value of r, our adjusted error variance is
computed as 857 . 2 49 . 1 4 = =
Yadj
, our adjusted f as 35 . 0
857 . 2
2 / ) 1 1 (
2
2 2
=
+
, and our
required sample size as 33
35 .
2
2
=
= n .
How Did Karl Get That Formula for the Adjusted f ?
I assume that the covariate is not related to the ANOVA factor(s), but is related to
the part of Y that is not related to the factors (that is, to the error variance).
2
2
2
/ ) (
e
j
k
f
= and the adjusted error variance is ) 1 (
2 2 2
r
e adj e
=
. Substituting
the adjusted error variance in the denominator,
2
2
2 2
2
2
1 ) 1 (
/ ) (
r
f
r
k
f
e
j
.
Accordingly,
2
1 r
f
f
= .
When asked to provide a reference for this adjusted f, I was at a loss, since I had
never seen it before I derived it myself. Thanks to Google, however, I have now found
the same derivation in Rogers, W. T., & Hopkins, K. D. (1988). Power estimates in the
presence of a covariate and measurement error. Educational and Psychological
Measurement, 48, 647-656. doi: 10.1177/0013164488483008
Return to Wuenschs Stats Lessons Page
Karl L. Wuensch
Department of Psychology
East Carolina University
Greenville, NC 27858 USA
24. October 2009
Power Analysis For Correlation and Regression Models
R2.exe Correlation Model
The free R2 program, from James H. Steiger and Rachel T. Fouladi, can be used
to do power analysis for testing the null hypothesis that R
2
(bivariate or multiple) is zero
in the population of interest. You can download the program and the manual here.
Unzip the files and put them in the directory/folder R2. Navigate to the R2 directory and
run (double click) the file R2.EXE. A window will appear with R2 in white on a black
background. Hit any key to continue.
Bad News: R2 will not run on Windows 7 Home Premium, which does not
support DOS. It ran on XP just fine. It might run on Windows 7 Pro.
Good News: You can get a free DOS emulator, and R2 works just fine within
the virtual DOS machine it creates. See my document DOSBox.
Consider the research published in the article: Patel, S., Long, T. E.,
McCammon, S. L., & Wuensch, K. L. (1995). Personality and emotional correlates of
self-reported antigay behaviors. Journal of Interpersonal Violence, 10, 354-366. We
had data from 80 respondents. We wished to predict self-reported antigay behaviors
from five predictor variables. Suppose we wanted to know how much power we would
have if the population
2
was .13 (a medium sized effect according to Cohen).
Enter the letter O to get the Options drop down menu. Enter the letter P for
Power Analysis. Enter the letter N to bring up the sample size data entry window.
Enter 80 and hit the enter key. Enter the letter K to bring up the number of variables
data entry window. Enter 6 and hit the enter key. Enter the letter R to enter the
2
data
entry window. Enter .13 and hit the enter key. Enter the letter A to bring up the alpha
entry window. Enter .05 and hit the enter key. The window should now look like this:
Enter G to begin computing. Hit any key to display the results.
2
Try substituting .02 (a small effect) for
2
and you will see power shrink to 13%.
So, how many subjects would we need to have an 80% chance of rejecting the
null hypothesis if the effect were small and we use the usual .05 criterion of statistical
significance. Hit the O key to get the options and then the S key to initiate sample size
calculation. K = 6, A = .05, R = .02, P = .8.
Enter G to begin computing. Hit any key to display the results.
3
G*Power Regression Model
The R2 program is designed for correlation analysis (all variables are random),
not regression analysis (Y is random but the predictors are fixed). Under most
circumstances you will get the similar results from R2 and G*Power. For example,
suppose I ask how much power I would have for a large effect, alpha = .05, n = 5, one
predictor.
Using G*Power, correlation, point biserial
t tests - Correlation: Point biserial model
Analysis: Post hoc: Compute achieved power
Input: Tail(s) = Two
Effect size |r| = 0.5
err prob = 0.05
Total sample size = 5
Output: Noncentrality parameter = 1.290994
Critical t = 3.182446
Df = 3
Power (1- err prob) = 0.151938
Equivalently, using G*Power, Multiple regression, omnibus R
2
F tests - Multiple Regression: Omnibus (R deviation from zero)
Analysis: Post hoc: Compute achieved power
Input: Effect size f = 0.3333333
err prob = 0.05
Total sample size = 5
Number of predictors = 1
Output: Noncentrality parameter = 1.666667
Critical F = 10.127964
Numerator df = 1
Denominator df = 3
Power (1- err prob) = 0.151938
Here G*Power uses Cohens f
2
effect size statistic, which is R
2
/ (1-R
2
). For a rho
of .5, that is .25/.75 = .333333333.
For a correlation model, the R2 program produces the following result
4
Return to Wuenschs Statistics Lesson Page
Karl L. Wuensch, Dept. of Psychology, East Carolina University, October, 2011.
G*Power for Change In R
2
in Multiple Linear Regression
Graduate student Ruchi Patel asked me how to determine how many cases
would be needed to achieve 80% power for detecting the interaction between two
predictors in a multiple linear regression. The interaction term is simply treated as
another predictor. I assumed that she wanted enough data to have 80% power and that
there were only three predictors, X1, X2, and their interaction. Here is the analysis:
Equivalently,
The method immediately above could also be used to determine the number of
cases needed to have the desired probability of detecting the increase in R
2
that
accompanies adding to the model a block of two or more predictors.
Return to Wuenschs Statistics Lessons Page
PowerAnalysis_Overview.doc
An Overview of Power Analysis
Power is the conditional probability that one will reject the null hypothesis given
that the null hypothesis is really false by a specified amount and given certain other
specifications, such as sample size and criterion of statistical significance (alpha). I
shall introduce power analysis in the context of a one sample test of the mean. After
that I shall move on to statistics more commonly employed.
There are several different sorts of power analyses see Faul, Erdfelder, Lang,
& Buchner (Behavior Research Methods, 2007, 39, 175-191) for descriptions of five
types that can be computed using G*Power 3. I shall focus on a priori and a
posteriori power analysis.
A Priori Power Analysis. This is an important part of planning research. You
determine how many cases you will need to have a good chance of detecting an effect
of a specified size with the desired amount of power. See my document Estimating the
Sample Size Necessary to Have Enough Power for required number of cases to have
80% for common designs.
A Posteriori Power Analysis. Also know as post hoc power analysis. Here
you find how much power you would have if you had a specified number of cases. Is it
a posteriori only in the sense that you provide the number of number of cases, as if
you had already conducted the research. Like a priori power analysis, it is best used in
the planning of research for example, I am planning on obtaining data on 100 cases,
and I want to know whether or not would give me adequate power.
Retrospective Power Analysis. Also known as observed power. There are
several types, but basically this involves answering the following question: If I were to
repeat this research, using the same methods and the same number of cases, and if the
size of the effect in the population was exactly the same as it was in the present
sample, what would be the probability that I would obtain significant results? Many
have demonstrated that this question is foolish, that the answer tells us nothing of value,
and that it has led to much mischief. See this discussion from Edstat-L. I also
recommend that you read Hoenig and Heisey (The American Statistician, 2001, 55, 19-
24). A few key points:
Some stat packs (SPSS) give you observed power even though it is useless.
Observed power is perfectly correlated with the value of p that is, it provides
absolutely no new information that you did not already have.
It is useless to conduct a power analysis AFTER the research has been
completed. What you should be doing is calculating confidence intervals for
effect sizes.
One Sample Test of Mean
Imagine that we are evaluating the effect of a putative memory enhancing drug.
We have randomly sampled 25 people from a population known to be normally
distributed with a of 100 and a of 15. We administer the drug, wait a reasonable
2
time for it to take effect, and then test our subjects IQ. Assume that we were so
confident in our belief that the drug would either increase IQ or have no effect that we
entertained directional hypotheses. Our null hypothesis is that after administering the
drug 100; our alternative hypothesis is > 100.
These hypotheses must first be converted to exact hypotheses. Converting the
null is easy: it becomes = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be = 115, and we could compute power, the probability of correctly rejecting the
false null hypothesis given that is really equal to 115 after drug treatment, not 100
(normal IQ). But if we already knew how large the effect of the drug was, we would not
need to do inferential statistics.
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide that if
the drug increases
iq
by 2 or more points, then that is a nontrivial effect, but if the
mean increase is less than 2 then the effect is trivial.
Now we can test the null of = 100 versus the alternative of = 102. Let the left
curve represent the distribution of sample means if the null hypothesis were true, =
100. This sampling distribution has a = 100 and a 3
25
15
= =
x
. Let the right curve
represent the sampling distribution if the exact alternative hypothesis is true, = 102.
Its is 102 and, assuming the drug has no effect on the variance in IQ scores,
3
25
15
= =
x
.
The red area in the upper tail of the null distribution is . Assume we are using a
one-tailed of .05. How large would a sample mean need be for us to reject the null?
Since the upper 5% of a normal distribution extends from 1.645 above the up to
positive infinity, the sample mean IQ would need be 100 + 1.645(3) = 104.935 or more
to reject the null. What are the chances of getting a sample mean of 104.935 or more if
the alternative hypothesis is correct, if the drug increases IQ by 2 points? The area
under the alternative curve from 104.935 up to positive infinity represents that
probability, which is power. Assuming the alternative hypothesis is true, that = 102,
the probability of rejecting the null hypothesis is the probability of getting a sample mean
of 104.935 or more in a normal distribution with = 102, = 3. Z = (104.935 102)/3 =
0.98, and P(Z > 0.98) = .1635. That is, power is about 16%. If the drug really does
increase IQ by an average of 2 points, we have a 16% chance of rejecting the null. If its
effect is even larger, we have a greater than 16% chance.
3
Suppose we consider 5 the minimum nontrivial effect size. This will separate the
null and alternative distributions more, decreasing their overlap and increasing power.
Now, Z = (104.935 105)/3 = 0.02, P(Z > 0.02) = .5080 or about 51%. It is easier to
detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;
is now split into both tails of the null distribution, .025 in each tail. We shall reject the
null if the sample mean is 1.96 or more standard errors away from the of the null
distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more (or if it is 100
1.96(3) = 94.12 or less) we reject the null. The probability of that happening if the
alternative is correct ( = 105) is: Z = (105.88 105)/3 = .29, P(Z > .29) = .3859, power
= about 39%. We can ignore P(Z < (94.12 105)/3) = P(Z < 3.63) = very, very small.
Note that our power is less than it was with a one-tailed test. If you can correctly
predict the direction of effect, a one-tailed test is more powerful than a two-tailed
test.
Consider what would happen if you increased sample size to 100. Now the
5 . 1
100
15
= =
x
. With the null and alternative distributions less plump, they should
overlap less, increasing power. With 1.5 =
x
, the sample mean will need be 100 +
(1.96)(1.5) = 102.94 or more to reject the null. If the drug increases IQ by 5 points,
power is : Z = (102.94 105)/1.5 = 1.37, P(Z > 1.37) = .9147, or between 91 and
92%. Anything that decreases the standard error will increase power. This may
be achieved by increasing the sample size or by reducing the of the dependent
4
variable. The of the criterion variable may be reduced by reducing the influence of
extraneous variables upon the criterion variable (eliminating noise in the criterion
variable makes it easier to detect the signal, the grouping variables effect on the
criterion variable).
Now consider what happens if you change . Let us reduce to .01. Now the
sample mean must be 2.58 or more standard errors from the null before we reject the
null. That is, 100 + 2.58(1.5) = 103.87. Under the alternative, Z = (103.87 105)/1.5 =
0.75, P(Z > 0.75) = 0.7734 or about 77%, less than it was with at .05, ceteris
paribus. Reducing reduces power.
Please note that all of the above analyses have assumed that we have used a
normally distributed test statistic, as
x
X
Z
=
d
d
Diff
Diff
, where d is the effect size as computed above, with
independent samples.
Please note that using the standard error of the difference scores, rather than the
standard deviation of the criterion variable, as the denominator of d
Diff
, is simply a
means of incorporating into the analysis the effect of the correlation produced by
matching. If we were computing estimated d (Hedges g) as an estimate of the
standardized effect size given the obtained results, we would use the standard deviation
of the criterion variable in the denominator, not the standard deviation of the difference
scores. I should admit that on rare occasions I have argued that, in a particular
research context, it made more sense to use the standard deviation of the difference
scores in the denominator of g.
Consider the following a priori power analysis. I am testing the effect of a new
drug on performance on a task that involves solving anagrams. I want to have enough
power to be able to detect an effect as small as 1/5 of a standard deviation (d = .2) with
95% power I consider Type I and Type II errors equally serious and am employing a
.05 criterion of statistical significance, so I want beta to be not more than .05. I shall use
a correlated samples design (within subjects) and two conditions (tested under the
influence of the drug and not under the influence of the drug). In previous research I
have found the correlation between conditions to be approximately .8.
3162 .
) 8 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
.
Use the following settings:
Statistical test: Means: Difference between two dependent means (matched pairs)
Type of power analysis: A priori: Compute required sample size given , power, and
effect size
Tail(s): Two
Effect size dz: .3162
error prob: 0.05
Power (1- err prob): .95
Click Calculate. You will find that you need 132 pairs of scores.
Output: Noncentrality parameter = 3.632861
Critical t = 1.978239
Df = 131
Total sample size = 132
Actual power = 0.950132
12
Consider the following a posteriori power analysis. We assume that GRE Verbal
and GRE Quantitative scores are measured on the same metric, and we wish to
determine whether persons intending to major in experimental or developmental
psychology are equally skilled in things verbal and things quantitative. If we employ a
.05 criterion of significance, and if the true size of the effect is 20 GRE points (that was
the actual population difference the last time I checked it, with quantitative > verbal),
what are our chances of obtaining significant results if we have data on 400 persons?
We shall assume that the correlation between verbal and quantitative GRE is .60 (that is
what it was for social science majors the last time I checked). We need to know what
the standard deviation is for the dependent variable, GRE score. The last time I
checked, it was 108 for verbal, 114 for quantitative.
Change type of power analysis to Post hoc. Set the total sample size to 400.
Click on Determine. Select from group parameters. Set the group means to 0 and
20 (or any other two means that differ by 20), the standard deviations to 108 and 114,
and the correlation between groups to .6. Click Calculate in this window to obtain the
effect size dz, .2100539.
Click Calculate and transfer to main window to move the effect size dz to the
main window. Click Calculate in the main window to compute the power. You will see
that you have 98% power.
13
Type III Errors and Three-Choice Tests
Leventhal and Huynh (Psychological Methods, 1996, 1, 278-292) note that it is
common practice, following rejection of a nondirectional null, to conclude that the
direction of difference in the population is the same as what it is in the sample. This
procedure is what they call a "directional two-tailed test." They also refer to it as a
"three-choice test" (I prefer that language), in that the three hypotheses entertained are:
parameter = null value, parameter < null value, and parameter > null value. This makes
possible a Type III error: correctly rejecting the null hypothesis, but incorrectly inferring
the direction of the effect - for example, when the population value of the tested
parameter is actually more than the null value, getting a sample value that is so much
below the null value that you reject the null and conclude that the population value is
also below the null value. The authors show how to conduct a power analysis that
corrects for the possibility of making a Type III error. See my summary at:
http://core.ecu.edu/psyc/wuenschk/StatHelp/Type_III.htm
Bivariate Correlation/Regression Analysis
Consider the following a priori power analysis. We wish to determine whether or
not there is a correlation between misanthropy and support for animal rights. We shall
measure these attributes with instruments that produce scores for which it is reasonable
to treat the variables as continuous. How many respondents would we need to have a
95% probability of obtaining significant results if we employed a .05 criterion of
significance and if the true value of the correlation (in the population) was 0.2?
Select the following options:
Test family: ttests
Statistical test: Correlation: Point biserial model (that is, a regression analysis)
14
Type of power analysis: A priori: Compute required sample size given , power, and
effect size
Tails: Two
Effect size |r|: .2
error prob: 0.05
Power (1- err prob): .95
Click Calculate and you will see that you need 314 cases.
t tests - Correlation: Point biserial model
Analysis: A priori: Compute required sample size
Input: Tail(s) = Two
Effect size |r| = .2
err prob = 0.05
Power (1- err prob) = 0.95
Output: Noncentrality parameter = 3.617089
Critical t = 1.967596
Df = 312
Total sample size = 314
Actual power = 0.950115
Check out Steiger and Fouladis R2 program, which will do power analysis (and
more) for correlation models, including multiple correlation.
One-Way ANOVA, Independent Samples
The effect size may be specified in terms of f:
2
2
1
) (
error
j
k
j
k
f
=
. Cohen
considered an f of .10 to represent a small effect, .25 a medium effect, and .40 a large
effect. In terms of percentage of variance explained
2
, small is 1%, medium is 6%, and
large is 14%.
Suppose that I wish to test the null hypothesis that for GRE-Q, the population
means for undergraduates intending to major in social psychology, clinical psychology,
and experimental psychology are all equal. I decide that the minimum nontrivial effect
size is if each mean differs from the next by 20 points (about 1/5 ). For example,
means of 480, 500, and 520. The sum of the squared deviations between group means
and grand mean is then 20
2
+ 0
2
+ 20
2
= 800. Next we compute f. Assuming that the
is about 100, 163 . 0 10000 / 3 / 800 f = = . Suppose we have 11 cases in each group.
15
OK, how many subjects would you need to raise power to 70%? Under Analysis,
select A Priori, under Power enter .70, and click Calculate.
16
G*Power advises that you need 294 cases, evenly split into three groups, that is,
98 cases per group.
Analysis of Covariance
If you add one or more covariates to your ANOVA model, and they are well
correlated with the outcome variable, then the error term will be reduced and power will
be increased. The effect of the addition of covariates can be incorporated into the
power analysis in this way:
Adjusting the effect size statistic, f, such that the adjusted f,
2
1 r
f
f
= , where r
is the correlation between the covariate (or set of covariates) and the outcome
variable.
Reducing the error df by one for each covariate added to the model.
Consider this example. I am using an ANOVA design to compare three
experimental groups. I want to know how many cases I need to detect a small effect
(f = .1). GPower tells me I need 1,548 cases. Ouch, that is a lot of data.
17
Suppose I find a covariate that I can measure prior to manipulating the
experimental variable and which is known to be correlated .7 with the dependent
variable. The adjusted f for a small effect increases to 14 .
49 . 1
1 .
=
= f .
Now I only need 792 cases. Do note that the error df here should be 788, not
789, but that one df is not going to make much difference, as shown below.
18
I used the Generic F Test routine with the noncentrality parameter from the
earlier run, and I dropped the denominator df to 788. The value of the critical F
increased ever so slightly, but the power did not change at all to six decimal places.
Factorial ANOVA, Independent Samples
The analysis is done pretty much the same as it is with a one-way ANOVA.
Suppose we are planning research for which an A x B, 3 x 4 ANOVA would be
appropriate. We want to have enough data to have 80% power for a medium sized
effect. The omnibus analysis will include three F tests one with 2 df in the numerator,
one with 3, and one with 6 (the interaction). We plan on having sample size constant
across cells.
Boot up G*Power and enter the options shown below:
19
Remember that Cohen suggested .25 as the value of f for a medium-sized effect.
The numerator df for the main effect of Factor A is (3-1)=2. The number of groups here
is the number of cells in the factorial design, 3 x 4 = 12. When you click Calculate you
see that you need a total N of 158. That works out to 13.2 cases per cell, so bump the
N up to 14(12) = 168.
What about Factor B and the interaction? There are (4-1)=3 df for the main
effect of Factor A, and when you change the numerator df to 3 and click Calculate
again you see that you need an N of 179 to get 80% power for that effect. The
interaction has 2(3)=6 df, and when you change the numerator df to 6 and click
Calculate you see that you need an N of 225 to have 80% power to detect a medium-
sized interaction. With equal sample sizes, that means you need 19 cases per cell, 228
total N.
Clearly you are not going to have the same amount of power for each of the
three effects. If your primary interest was in the main effects, you might go with a total
N that would give you the desired power for main effect but somewhat less than that for
the interaction. If, however, you have reason to expect an interaction, you will go for the
total N of 228. How much power would that give you for the main effects?
As you can see, you would have almost 93% power for A. If you change the
numerator df to 3 you will see that you would have 89.6% power for B.
If you click the Determine button you get a second window which allows you
select the value of f by specifying a value of
2
or partial
2
. Suppose you want to know
what f is for an effect that explains only 1% of the total variance. You tell G*Power to
that the Variance explained by special effect is .01 and Error variance is .99. Click
Calculate and you get an f of .10. Recall that earlier I told you that an f of .10 is
equivalent to an
2
of .01.
20
If you wanted to find f for an effect that accounted for 6% of the variance, you
would enter .06 (effect) and .94 (error) and get an f of .25 (a medium-sized effect).
Wait a minute. I have ignored the fact that the error variance in the factorial
ANOVA will be reduced by an amount equal to variance explained by the other factors
in the model, and that will increase power. Suppose that I have entered Factor B into
the model primarily as a categorical covariate. From past research, I have reason to
believe that Factor B will account for about 14% of the total variance (a large effect,
equivalent to an f of .40). I have no idea whether or not the interaction will explain much
variance, so I play it safe and assume it will explain no variance. When I calculate f I
should enter .06 (effect) and .80 (error 1 less .06 for A and another .14 for B).
G*Power gives an f of .27, which I would then use in the power analysis for Factor A.
ANOVA With Related Factors
The analysis here can be done with G*Power in pretty much the same way
described earlier for independent samples. There are two new parameters that you will
need to provide:
the value of the correlation between scores at any one level of the related factor
and any other level of the repeated factor. Assuming that this correlation is
constant across pairs of levels is the sphericity assumption.
-- this is a correction (applied to the degrees of freedom) to adjust for violation
of the sphericity assumption. The df are literally multiplied by , which has a
upper boundary of 1. There are two common ways to estimate , one developed
by Greenhouse and Geisser, the other by Huynh and Feldt.
Here is the setup for a one-way repeated measures or randomized blocks ANOVA
with four levels of the factor:
21
We need 36 cases to have 95% power to detect a medium sized effect assuming
no problem with sphericity and a .5 correlation between repeated measures. Let us see
what happens if we have a stronger correlation between repeated measures:
Very nice. I guess your stats prof wasnt kidding we she pointed out the power
benefit of having strong correlations across conditions but what if you have a problem
with the sphericity assumption. Let us assume that you suspect (from previous
research) that epsilon might be as low as .6.
22
Notice the reduction and the degrees of freedom and the associated increase in
number of cases needed.
Instead of the traditional univariate approach ANOVA, one can analyze data from
designs with related factors with the newer multivariate approach, which does not
have a sphericity assumption. G*Power will do power analysis for this approach too.
Let us see how many cases we would need with that approach using the same input
parameters as the previous example.
Contingency Table Analysis (Two-Way)
Effect size is computed as
=
k
i i
i i
P
P P
w
1 0
2
0 1
) (
. k is the number of cells, P
0i
is
the population proportion in cell i under the null hypothesis, and P
1i
is the population
proportion in cell i under the alternative hypothesis. Cohens benchmarks are
23
.1 is small but not trivial
.3 is medium
.5 is large
When the table is 2 x 2, w is identical to .
Suppose we are going to employ a 2 x 4 analysis. We shall use the traditional
5% criterion of statistical significance, and we think Type I and Type II errors equally
serious, and, accordingly, we seek to have 95% power for finding an effect that is small
but not trivial. As you see below, you need a lot of data to have a lot of power when
doing contingency table analysis.
MANOVA and DFA
Each effect will have as many roots (discriminant functions, canonical variates,
weighted linear combinations of the outcome variables) as it has treatment degrees of
freedom, or it will have as many roots as there are outcome variables, whichever is
fewer. The weights maximize the ratio
groups within
groups among
SS
SS
_
_
. If you were to compute, for each
case, a canonical variate score and then conduct an ANOVA comparing the groups on
that canonical variate, you would get the sums of squares in the ratio above. This ratio
is called the eigenvalue ( ).
Theta is defined as
1 +
=
One way to describe the association between two variables is to assume that the
value of the one variable is a linear function of the value of the other variable. If this
relationship is perfect, then it can be described by the slope-intercept equation for a
straight line, Y = a + bX. Even if the relationship is not perfect, one may be able to
describe it as nonperfect linear.
Distinction Between Correlation and Regression
Correlation and regression are very closely related topics. Technically, if the X
variable (often called the independent variable, even in nonexperimental research) is
fixed, that is, if it includes all of the values of X to which the researcher wants to
generalize the results, and the probability distribution of the values of X matches that in
the population of interest, then the analysis is a regression analysis. If both the X and
the Y variable (often called the dependent variable, even in nonexperimental research)
are random, free to vary (were the research repeated, different values and sample
probability distributions of X and Y would be obtained), then the analysis is a
correlation analysis. For example, suppose I decide to study the correlation between
dose of alcohol (X) and reaction time. If I arbitrarily decide to use as values of X doses
of 0, 1, 2, and 3 ounces of 190 proof grain alcohol and restrict X to those values, and
have the equal numbers of subjects at each level of X, then Ive fixed X and do a
regression analysis. If I allow X to vary randomly, for example, I recruit subjects from
a local bar, measure their blood alcohol (X), and then test their reaction time, then a
correlation analysis is appropriate.
In actual practice, when one is using linear models to develop a way to predict Y
given X, the typical behavioral researcher is likely to say she is doing regression
analysis. If she is using linear models to measure the degree of association between X
and Y, she says she is doing correlation analysis.
Scatter Plots
One way to describe a bivariate association is to prepare a scatter plot, a plot of
all the known paired X,Y values (dots) in Cartesian space. X is traditionally plotted on
the horizontal dimension (the abscissa) and Y on the vertical (the ordinate).
If all the dots fall on a straight line with a positive slope, the relationship is
perfect positive linear. Every time X goes up one unit, Y goes up b units. If all dots
fall on a negatively sloped line, the relationship is perfect negative linear.
=
N
SSCP
COV
A major problem with COV is that it is affected not only by degree of linear
relationship between X and Y but also by the standard deviations in X and in Y. In fact,
the maximum absolute value of COV(X,Y) is the product
x
y
. Imagine that you and I
each measured the height and weight of individuals in our class and then computed the
covariance between height and weight. You use inches and pounds, but I use miles
and tons. Your numbers would be much larger than mine, so your covariance would be
larger than mine, but the strength of the relationship between height and weight should
be the same for both of our data sets. We need to standardize the unit of measure of
our variables.
Pearson r
We can get a standardized index of the degree of linear association by dividing
COV by the two standard deviations, removing the effect of the two univariate standard
deviations. This index is called the Pearson product moment correlation coefficient,
r for short, and is defined as 80 .
) 162 . 3 ( 581 . 1
4 ) , (
= = =
y x
s s
Y X COV
r . Pearson r may also
be defined as a mean, r
Z Z
N
x y
= , where the Z-scores are computed using population
standard deviations,
n
SS
.
Pearson r may also be computed as
8 .
) 40 ( 10
16
) 162 . 3 )( 4 ( ) 581 . 1 ( 4
16
2 2
= = = =
y x
SS SS
SSCP
r .
Pearson r will vary from 1 to 0 to +1. If r = +1 the relationship is perfect positive,
and every pair of X,Y scores has Z
x
= Z
y
. If r = 0, there is no linear relationship. If r =
1, the relationship is perfect negative and every pair of X,Y scores has Z
x
= Z
y
.
Sample r is a Biased Estimator It Underestimates
) 1 ( 2
) 1 (
) (
2
n
r E
. If the were .5 and the sample size 10, the expected value of r
would be about 479 .
) 9 ( 2
) 5 . 1 ( 5 .
5 .
2
=
+ =
n
r
r
2
) 1 (
1
2
. Since then there have been several other approximately
unbiased estimators. In 1958, Olin and Pratt proposed an even less biased estimator,
+ =
) 4 ( 2
) 1 (
1
2
n
r
r . For our correlation, the Olin & Pratt estimator has a value of
. 944 .
) 4 5 ( 2
) 8 . 1 (
1 8 .
2
=
+ =
There are estimators even less biased than the Olin & Pratt estimator, but I do
not recommend them because of the complexity of calculating them and because the
bias in the Olin & Pratt estimator is already so small. For more details, see Shieh
(2010) and Zimmerman, Zumbo, & Williams (2003).
Sample r
2
is a Biased Estimator It Overestimates
2
) 1 (
) 1 )( 2 (
1 ) (
2
2
=
n
n
r E
. If the
2
were .25 (or any other value) and the sample
size 2, the expected value of r
2
would be 1
) 1 2 (
) 25 . 1 )( 2 2 (
1 =
. See my document
What is R
2
When N = p + 1 (and df = 0)?
If the
2
were .25 and the sample size 10, the expected value of r
2
would be
333 .
) 1 10 (
) 25 . 1 )( 2 10 (
1 ) (
2
=
= r E . If the
2
were .25 and the sample size 100, the
expected value of r
2
would be 258 .
) 1 100 (
) 25 . 1 )( 2 100 (
1 ) (
2
=
= r E . As you can see, the
bias decreases with increasing sample size.
For a relatively unbiased estimate of population r
2
, the shrunken r
2
,
52 .
3
) 4 )( 64 . 1 (
1
) 2 (
) 1 )( 1 (
1
2
=
=
n
n r
for our data.
Factors Which Can Affect the Size of r
Range restrictions. If the range of X is restricted, r will usually fall (it can rise if
X and Y are related in a curvilinear fashion and a linear correlation coefficient has
inappropriately been used). This is very important when interpreting criterion-related
validity studies, such as one correlating entrance exam scores with grades after
entrance.
6
Extraneous variance. Anything causing variance in Y but not in X will tend to
reduce the correlation between X and Y. For example, with a homogeneous set of
subjects all run under highly controlled conditions, the r between alcohol intake and
reaction time might be +0.95, but if subjects were very heterogeneous and testing
conditions variable, r might be only +0.50. Alcohol might still have just as strong an
effect on reaction time, but the effects of many other extraneous variables (such as
sex, age, health, time of day, day of week, etc.) upon reaction time would dilute the
apparent effect of alcohol as measured by r.
Interactions. It is also possible that the extraneous variables might interact
with X in determining Y. That is, X might have one effect on Y if Z = 1 and a different
effect if Z = 2. For example, among experienced drinkers (Z = 1), alcohol might affect
reaction time less than among novice drinkers (Z = 2). If such an interaction is not
taken into account by the statistical analysis (a topic beyond the scope of this course),
the r will likely be smaller than it otherwise would be.
Assumptions of Correlation Analysis
There are no assumptions if you are simply using the correlation coefficient to
describe the strength of linear association between X and Y in your sample. If,
however, you wish to use t or F to test hypothesis about or place a confidence interval
about your estimate of , there are assumptions.
Bivariate Normality
It is assumed that the joint distribution of X,Y is bivariate normal. To see what
such a distribution look like, try the Java Applet at
http://ucs.kuleuven.be/java/version2.0/Applet030.html . Use the controls to change
various parameters and rotate the plot in three-dimensional space.
In a bivariate normal distribution the following will be true:
1. The marginal distribution of Y ignoring X will be normal.
2. The marginal distribution of X ignoring Y will be normal.
3. Every conditional distribution of Y|X will be normal.
4. Every conditional distribution of X|Y will be normal.
Homoscedasticity
1. The variance in the conditional distributions of Y|X is constant across values of X.
2. The variance in the conditional distributions of X|Y is constant across values of Y.
Testing H
: = 0
If we have X,Y data sampled randomly from some bivariate population of
interest, we may wish to test H
=
r
n r
t , with df = N - 2.
7
You should remember that we used this formula earlier to demonstrate that the
independent samples t test is just a special case of a correlation analysis if one of the
variables is dichotomous and the other continuous, computing the (point biserial) r and
testing its significance is absolutely equivalent to conducting an independent samples t
test. Keep this in mind when someone tells you that you can make causal inferences
from the results of a t test but not from the results of a correlation analysis the two are
mathematically identical, so it does not matter which analysis you did. What does
matter is how the data were collected. If they were collected in an experimental manner
(manipulating the independent variable) with adequate control of extraneous variables,
you can make a causal inference. If they were gathered in a nonexperimental manner,
you cannot.
Putting a Confidence Interval on R or R
2
It is a good idea to place a confidence interval around the sample value of r or r
2
,
but it is tedious to compute by hand. Fortunately, there is now available a free program
for constructing such confidence intervals. Please read my document Putting
Confidence Intervals on R
2
or R.
For our beer and burger data, a 95% confidence interval for r extends from -.28
to .99.
APA-Style Summary Statement
For our beer and burger data, our APA summary statement could read like this:
The correlation between my friends burger consumption and their beer
consumption fell short of statistical significance, r(n = 5) = .8, p = .10,
95% CI [-.28, .99]. For some strange reason, the value of the computed t is not
generally given when reporting a test of the significance of a correlation coefficient. You
might want to warn your readers that a Type II error is quite likely here, given the small
sample size. Were the result significant, your summary statement might read
something like this: Among my friends, burger consumption was significantly
related to beer consumption, ..........
Power Analysis
Power analysis for r is exceptionally simple: 1 = n , assuming that df are
large enough for t to be approximately normal. Cohens benchmarks for effect sizes for
r are: .10 is small but not necessarily trivial, .30 is medium, and .50 is large (Cohen, J.
A Power Primer, Psychological Bulletin, 1992, 112, 155-159).
For our burger-beer data, how much power would we have if the effect size was
large in the population, that is, = .50? 00 . 1 4 5 . = = . From our power table, using
the traditional .05 criterion of significance, we then see that power is 17%. As stated
earlier, a Type II error is quite likely here. How many subjects would we need to have
95% power to detect even a small effect? Lots: 1297 1
2
= +
n . That is a lot of
burgers and beer! See the document R2 Power Anaysis .
8
Correcting for Measurement Error in Bivariate Linear Correlations
The following draws upon the material presented in the article :
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research:
Lessons from 26 research scenarios. Psychological Methods, 1, 199-223.
When one is using observed variables to estimate the correlation between the
underlying constructs which these observed variables measure, one should correct the
correlation between the observed variables for attenuation due to measurement error.
Such a correction will give you an estimate of what the correlation is between the two
constructs (underlying variables), that is, what the correlation would be if we able to
measure the two constructs without measurement error.
Measurement error results in less than perfect values for the reliability of an
instrument. To correct for the attenuation resulting from such lack of perfect reliability,
one can apply the following correction:
yy XX
XY
Y X
r r
r
r
t t
= ,where
t t
Y X
r is our estimate for the correlation between the constructs, corrected for
attenuation,
r
XY
is the observed correlation between X and Y in our sample,
r
XX
is the reliability of variable X, and
r
YY
is the reliability of variable Y.
Here is an example from my own research:
I obtained the correlation between misanthropy and attitude towards animals for
two groups, idealists (for whom I predicted there would be only a weak correlation) and
nonidealists (for whom I predicted a stronger correlation). The observed correlation was
.02 for the idealists, .36 for the idealists. The reliability (Cronbach alpha) was .91 for the
attitude towards animals instrument (which had 28 items) but only .66 for the
misanthropy instrument (not surprising, given that it had only 5 items). When we correct
the observed correlation for the nonidealists, we obtain 46 .
) 91 (. 66 .
36 .
= =
t t
Y X
r , a much
more impressive correlation. When we correct the correlation for the idealists, the
corrected r is only .03.
I should add that Cronbach's alpha underestimates a test's reliability, so this
correction is an over-correction. It is preferable to use maximized lamba4 as the
estimate of reliability. Using labmda4 estimates of reliability, the corrected r is
. 42 .
) 93 (. 78 .
36 .
= =
t t
Y X
r
9
Testing Other Hypotheses
H
:
1
=
2
One may also test the null hypothesis that the correlation between X and Y in
one population is the same as the correlation between X and Y in another population.
See our textbook for the statistical procedures. One interesting and controversial
application of this test is testing the null hypothesis that the correlation between IQ and
Grades in school is the same for Blacks as it is for Whites. Poteat, Wuensch, and
Gregg (1988, Journal of School Psychology: 26, 59-68) were not able to reject that null
hypothesis.
H
:
WX
=
WY
If you wish to compare the correlation between one pair of variables with that
between a second, overlapping pair of variables (for example, when comparing the
correlation between one IQ test and grades with the correlation between a second IQ
test and grades), use Williams procedure explained in our textbook or use Hotellings
more traditional solution, available from Wuensch and elsewhere. It is assumed that the
correlations for both pairs of variables have been computed on the same set of
subjects. Should you get seriously interested in this sort of analysis, consult this
reference: Meng, Rosenthal, & Rubin (1992) Comparing correlated correlation
coefficients. Psychological Bulletin, 111: 172-175.
H
:
WX
=
YZ
If you wish to compare the correlation between one pair of variables with that
between a second (nonoverlapping) pair of variables, read the article by T. E.
Raghunathan , R. Rosenthal, and D. B. Rubin (Comparing correlated but
nonoverlapping correlations, Psychological Methods, 1996, 1, 178-183).
H
: = nonzero value
Our textbook also shows how to test the null hypothesis that a correlation has a
particular value (not necessarily zero) and how to place confidence limits on our
estimation of a correlation coefficient. For example, we might wish to test the null
hypothesis that in grad. school the r between IQ and Grades is +0.5 (the value most
often reported for this correlation in primary and secondary schools) and then put 95%
confidence limits on our estimation of the population .
Please note that these procedures require the same assumptions made for
testing the null hypothesis that the is zero. There are, however, no assumptions
necessary to use r as a descriptive statistic, to describe the strength of linear
association between X and Y in the data you have.
Spearman rho
When ones data are ranks, one may compute the Spearman correlation for
ranked data, also called the Spearman , which is computed and significance-tested
10
exactly as is Pearson r (if n < 10, find a special table for testing the significance of the
Spearman ). The Spearman measures the linear association between pairs of
ranks. If ones data are not ranks, but e converts the raw data into ranks prior to
computing the correlation coefficient, the Spearman measures the degree of
monotonicity between the original variables. If every time X goes up, Y goes up (the
slope of the line relating X to Y is always positive) there is a perfect positive monotonic
relationship, but not necessarily a perfect linear relationship (for which the slope would
have to be constant). Consider the following data:
X 1.0 1.9 2.0 2.9 3.0 3.1 4.0 4.1 5
Y 10 99 100 999 1,000 1,001 10,000 10,001 100,000
You should run the program Spearman.sas on my SAS Programs web page. It
takes these data and transforms them into ranks and then prints out the new data. The
first page of output shows the original data, the ranked data, and also the Y variable
after a base 10 log transformation. A plot of the raw data shows a monotonic but
distinctly nonlinear relationship. A plot of X by the log of Y shows a nearly perfect linear
relationship. A plot of the ranks show a perfect relationship. PROC CORR is then used
to compute Pearson, Spearman, and Kendall tau correlation coefficients.
How Do Behavioral Scientists Use Correlation Analyses?
1. to measure the linear association between two variables without establishing
any cause-effect relationship.
2. as a necessary (and suggestive) but not sufficient condition to establish
causality. If changing X causes Y to change, then X and Y must be correlated (but the
correlation is not necessarily linear). X and Y may, however, be correlated without X
causing Y. It may be that Y causes X. Maybe increasing Z causes increases in both X
and Y, producing a correlation between X and Y with no cause-effect relationship
between X and Y. For example, smoking cigarettes is well known to be correlated with
health problems in humans, but we cannot do experimental research on the effect of
smoking upon humans health. Experimental research with rats has shown a causal
relationship, but we are not rats. One alternative explanation of the correlation between
smoking and health problems in humans is that there is a third variable, or constellation
of variables (genetic disposition or personality), that is causally related to both smoking
and development of health problems. That is, if you have this disposition, it causes you
to smoke and it causes you to have health problems, creating a spurious correlation
between smoking and health problems but the disposition that caused the smoking
would have caused the health problems whether or not you smoked. No, I do not
believe this model, but the data on humans cannot rule it out.
As another example of a third variable problem, consider the strike by PATCO,
the union of air traffic controllers back during the Reagan years. The union cited
statistics that air traffic controllers had much higher than normal incidence of stress-
related illnesses (hypertension, heart attacks, drug abuse, suicide, divorce, etc.). They
11
said that this was caused by the stress of the job, and demanded better benefits to deal
with the stress, no mandatory overtime, rotation between high stress and low stress job
positions, etc. The government crushed the strike (fired all controllers), invoking a third
variable explanation of the observed correlation between working in air traffic control
and these illnesses. They said that the air traffic controller profession attracted persons
of a certain disposition (Type A individuals, who are perfectionists who seem always to
be under time pressure), and these individuals would get those illnesses whether they
worked in air traffic or not. Accordingly, the government said, the problem was the fault
of the individuals, not the job. Maybe the government would prefer that we hire only
Type B controllers (folks who take it easy and dont get so upset when they see two
blips converging on the radar screen)!
3. to establish an instruments reliability a reliable instrument is one which will
produce about the same measurements when the same objects are measured
repeatedly, in which case the scores at one time should be well correlated with the
scores at another time (and have equivalent means and variances as well).
4. to establish an instruments (criterion-related) validity a valid instrument is
one which measures what it says it measures. One way to establish such validity is to
show that there is a strong positive correlation between scores on the instrument and an
independent measure of the attribute being measured. For example, the Scholastic
Aptitude Test was designed to measure individuals ability to do well in college.
Showing that scores on this test are well correlated with grades in college establishes
the tests validity.
5. to do independent groups t-tests: if the X variable, groups, is coded 0,1 (or
any other two numbers) and we obtain the r between X and Y, a significance-test of the
hypothesis that = 0 will yield exactly the same t and p as the traditional pooled-
variances independent groups t-test. In other words, the independent groups t-test is
just a special case of correlation analysis, where the X variable is dichotomous and the
Y variable is normally distributed. The r is called a point-biserial r. It can also be
shown that the 2 x 2 Pearson Chi-square test is a special case of r. When both X and Y
are dichotomous, the r is called phi ( ).
6. One can measure the correlation between Y and an optimally weighted set of
two or more Xs. Such a correlation is called a multiple correlation. A model with
multiple predictors might well predict a criterion variable better than would a model with
just a single predictor variable. Consider the research reported by McCammon, Golden,
and Wuensch in the Journal of Research in Science Education, 1988, 25, 501-510.
Subjects were students in freshman and sophomore level Physics courses (only those
courses that were designed for science majors, no general education <football physics>
courses). The mission was to develop a model to predict performance in the course.
The predictor variables were CT (the Watson-Glaser Critical Thinking Appraisal), PMA
(Thurstones Primary Mental Abilities Test), ARI (the College Entrance Exam Boards
Arithmetic Skills Test), ALG (the College Entrance Exam Boards Elementary Algebra
Skills Test), and ANX (the Mathematics Anxiety Rating Scale). The criterion variable
was subjects scores on course examinations. Our results indicated that we could
predict performance in the physics classes much better with a combination of these
predictors than with just any one of them. At Susan McCammons insistence, I also
12
separately analyzed the data from female and male students. Much to my surprise I
found a remarkable sex difference. Among female students every one of the predictors
was significantly related to the criterion, among male students none of the predictors
was. A posteriori searching of the literature revealed that Anastasi (Psychological
Testing, 1982) had noted a relatively consistent finding of sex differences in the
predictability of academic grades, possibly due to women being more conforming and
more accepting of academic standards (better students), so that women put maximal
effort into their studies, whether or not they like the course, and according they work up
to their potential. Men, on the other hand, may be more fickle, putting forth maximum
effort only if they like the course, thus making it difficult to predict their performance
solely from measures of ability.
ANOVA, which we shall cover later, can be shown to be a special case of
multiple correlation/regression analysis.
7. One can measure the correlation between an optimally weighted set of Ys
and an optimally weighted set of Xs. Such an analysis is called canonical correlation,
and almost all inferential statistics in common use can be shown to be special cases of
canonical correlation analysis. As an example of a canonical correlation, consider the
research reported by Patel, Long, McCammon, & Wuensch (Journal of Interpersonal
Violence, 1995, 10: 354-366, 1994). We had two sets of data on a group of male
college students. The one set was personality variables from the MMPI. One of these
was the PD (psychopathically deviant) scale, Scale 4, on which high scores are
associated with general social maladjustment and hostility. The second was the MF
(masculinity/femininity) scale, Scale 5, on which low scores are associated with
stereotypical masculinity. The third was the MA (hypomania) scale, Scale 9, on which
high scores are associated with overactivity, flight of ideas, low frustration tolerance,
narcissism, irritability, restlessness, hostility, and difficulty with controlling impulses.
The fourth MMPI variable was Scale K, which is a validity scale on which high scores
indicate that the subject is clinically defensive, attempting to present himself in a
favorable light, and low scores indicate that the subject is unusually frank. The second
set of variables was a pair of homonegativity variables. One was the IAH (Index of
Attitudes Towards Homosexuals), designed to measure affective components of
homophobia. The second was the SBS, (Self-Report of Behavior Scale), designed to
measure past aggressive behavior towards homosexuals, an instrument specifically
developed for this study.
Our results indicated that high scores on the SBS and the IAH were associated
with stereotypical masculinity (low Scale 5), frankness (low Scale K), impulsivity (high
Scale 9), and general social maladjustment and hostility (high Scale 4). A second
relationship found showed that having a low IAH but high SBS (not being homophobic
but nevertheless aggressing against gays) was associated with being high on Scales 5
(not being stereotypically masculine) and 9 (impulsivity). This relationship seems to
reflect a general (not directed towards homosexuals) aggressiveness in the words of
one of my graduate students, being an equal opportunity bully.
Links all recommended reading (in other words, know it for the test)
Biserial and Polychoric Correlation Coefficients
13
Comparing Correlation Coefficients, Slopes, and Intercepts
Confidence Intervals on R
2
or R
Contingency Tables with Ordinal Variables
Correlation and Causation
Cronbachs Alpha and Maximized Lambda4
Inter-Rater Agreement
Phi
Residuals Plots -- how to make them and interpret them
Tetrachoric Correlation -- what it is and how to compute it.
Shieh, G. Estimation of the simple correlation coefficient. Behavior Research
Methods, 42, 906-917.
Zimmerman, D. W., Zumbo, B. D., & Williams, R. H. (2003). Bias in estimation
and hypothesis testing of correlation. Psicolgica, 24, 133-158.
Copyright 2011, Karl L. Wuensch - All rights reserved.
Regr6430.docx
Bivariate Linear Regression
Bivariate Linear Regression analysis involves finding the best fitting straight line to describe a
set of bivariate data. It is based upon the linear model, e bX a Y + + = . That is, every Y score is
made up of two components: a + bX, which is the linear effect of X upon Y, the value of Y given X if
X and Y were perfectly correlated in a linear fashion, and e, which stands for error. Error is simply
the difference between the actual value of Y and that value we would predict from the best fitting
straight line. That is, Y Y e
=
Sources of Error
There are three basic sources that contribute to the error term, e.
Error in the measurement of X and or Y or in the manipulation of X (experimental research).
The influence upon Y of variables other than X (extraneous variables), including variables
that interact with X.
Any nonlinear influence of X upon Y.
The Regression Line
The best fitting straight line, or regression line, is bX a Y + =
If r
2
is less than one (a nonperfect correlation), then predicted Z
y
regresses towards the
mean relative to Z
x
. Regression towards the mean refers to the fact that predicted Y will be closer
to the mean on Y than is known X to the mean on X, unless the relationship between X and Y is
perfect linear.
r Z
x
Predicted Z
y
1.00 2 2: Just as far (2 sd) from mean Y as X is from mean X No
regression towards the mean.
0.50 2 1: Closer to mean Y (1 sd) than X is to mean X
0.00 2 0: Regression all the way to the mean
-0.50 2 -1: Closer to mean Y (1 sd) than X is to mean X
-1.00 2 -2: Just as far (2 sd) from mean Y as X is from mean X No
regression towards the mean.
The criterion used to find the best fitting straight line is the least squares criterion. That is,
we find a and b such that ( )
2
= b
Subject Burgers,
X
Beers,
Y
Y
( Y Y
2
)
( Y Y
2
)
( Y Y
1 5 8 9.2 -1.2 1.44 10.24
2 4 10 7.6 2.4 5.76 2.56
3 3 4 6.0 -2.0 4.00 0.00
4 2 6 4.4 1.6 2.56 2.56
5 1 2 2.8 -0.8 0.64 10.24
Sum 15 30 30 0.0 14.40 25.60
Mean 3 6
St. Dev. 1.581 3.162
40 ) 162 . 3 ( 4
2
= =
Y
SS
Please note that if s
y
= s
x
, as would be the case if we changed all scores on Y and X to z-
scores, then r = b. This provides a very useful interpretation of r. Pearson r is the number of
standard deviations that predicted Y changes for each one standard deviation change in X. That is,
on the average, and in sd units, how much does Y change per sd change in X.
We compute the intercept, a, the predicted value of Y when X = 0, with: X b Y a = . For our
data, that is 6 1.6(3) = 1.2
There are two different linear regression lines that we could compute with a set of bivariate
data. The one I have already presented is for predicting Y from X. We could also find the least
squares regression line for predicting X from Y. This will usually not be the same line used for
predicting Y from X, since the line that minimizes ( )
2
X X ,
unless r
2
equals one. We can find the regression line for predicting X from Y using the same
formulas we used for finding the line to predict Y from X, but we must substitute X for Y and Y for X in
the formulas. The a and b for intercept and slope may be subscripted with y.x to indicate Y predicted
from X or with x.y to indicate X predicted from Y.
The two regression lines are coincident (are the same line) only when the correlation is perfect
(r is +1.00 or -1.00). They always intersect at the point M
x
, M
y
. When r = 0.00, the two regression
lines, X X Y Y = =
and
and X
r SS Y Y SSE
y
= =
2
. Since SSE
would then = SS
y
, all of the variance in Y would be error variance. If r is nonzero we can do better
than just always predicting Y Y =
= =
n
n
r s MSE s
y y y
. For our beer-burger data, where the
total SS for Y was 40, 4 . 14 ) 40 )( 64 . 1 ( = = SSE , 8 . 4
3
4 . 14
= = MSE , and the standard error of
estimate is 191 . 2 8 . 4 = .
SSE represents the SS in Y not due to the linear association between X and Y. SS
regr
, the
regression sum of squares, represents the SS in Y that is due to that linear association.
( )
y y regr
SS r SSE SS Y Y SS
2
2
= = =
+ =
.
2. If we wish to estimate an individual value of Y given X we need to widen the confidence
interval to include the variance of individual values of Y given X about the mean value of Y given X ,
using this formula:
( )
x
y y
SS
X X
n
s t Y CI
2
1
1
+ + =
.
3. In both cases, the value of t is obtained from a table where df = n - 2 and (1 - cc) = the
level of significance for two-tailed test in the t table in our textbook. Recall that cc is the confidence
coefficient, the degree of confidence desired.
4. Constructing confidence intervals requires the same assumptions as testing hypotheses
about the slope.
5. The confidence intervals about the regression line will be bowed, with that for predicting
individual values wider than that for predicting average values.
Testing Other Hypotheses
One may test the null hypothesis that the slope for predicting Y from X is the same in one
population as it is in another. For example, is the slope of the line relating blood cholesterol level to
blood pressure the same in women as it is in men? Howell explains how to test such a hypothesis in
7
our textbook. Please note that this is not equivalent to a test of the null hypothesis that two
correlation coefficients are equal.
The Y.X relationships in different populations may differ from one another in terms of slope,
intercept, and/or scatter about the regression line (error, 1 - r
2
). [See relevant plots] There are
methods (see Kleinbaum & Kupper, Applied Regression Analysis and Other Multivariable Methods,
Duxbury, 1978, Chapter 8) to test all three of these across two or more groups. Howell restricts his
attention to tests of slope and scatter with only two groups.
Suppose that for predicting blood pressure from blood cholesterol the slope were exactly the
same for men and for women. The intercepts need not be the same (men might have higher average
blood pressure than do women at all levels of cholesterol) and the rs may differ, for example, the
effect of extraneous variables might be greater among men than among women, producing more
scatter about the regression line for men, lowering r
2
for men. Alternatively, the rs may be identical
but the slopes different. For example, suppose the scatter about the regression line was identical for
men and women but that blood pressure increased more per unit increase in cholesterol for men than
for women.
Consider this case, where the slopes differ but the correlation coefficients do not:
Group r s
x
s
y
b = r s
y
s
x
A .5 10 20 .5(20/10) = 1
B .5 10 100 .5(100/10) = 5
Now, consider this case, where the correlation coefficients differ but the slopes do not:
Group b s
x
s
y
r = b s
x
s
y
A 1 10 20 1(10/20) = .5
B 1 10 100 1(10/100) = .1
See these data sets which have identical slopes for predicting Y from X but different correlation
coefficients.
Return to my Statistics Lessons page.
More on Assumptions in Correlation and Regression
Copyright 2012, Karl L. Wuensch - All rights reserved.
What is R
2
When N = p + 1 (and df = 0)?
N = 2 = p + 1: Two variables, two cases.
N is the number of cases and p is the number of predictor variables. I shall
represent the outcome variable with Y and the predictor variables with X
i
. The data for
all variables here were randomly sampled from a population where the correlation
between each pair of variables was exactly zero.
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 1.000
a
1.000 . .
a. Predictors: (Constant), X1
With only two data points, you can fit them perfectly with a straight line no matter
where the two points are located in Cartesian space accordingly r
2
= 1.
2
3
N = 3 = p + 1: Three cases, three variables.
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 1.000
a
1.000 . .
a. Predictors: (Constant), X2, X1
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) 8.079E-16 .000
. .
X1 .318 .000 .842 . .
X2 -.136 .000 -.343 . .
a. Dependent Variable: Y
We now have three data points in three dimensional space. We can fit them
perfectly with a plane. R
2
= 1.
I asked SPSS to save predicted scores. As you can see, prediction is perfect:
4
Here I plot Y versus predicted Y.
5
N = 4 = p + 1: Four cases, four variables. We are in hyperspace now.
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 1.000
a
1.000 . .
a. Predictors: (Constant), X3, X1, X2
b. Dependent Variable: Y
Again, prediction is perfect.
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) 4.886 .000
. .
X1 -.008 .000 -.010 . .
X2 -.788 .000 -1.553 . .
X3 .326 .000 .844 . .
a. Dependent Variable: Y
6
And so on.
Bottom line, when there only as many cases as there are variables, you can
perfectly predict any one of the variables from an optimally weighted linear combination
of the others. I should note that each of the variables must be a variable (have variance
> 0), not a constant (variance = 0).
Clearly, sample R
2
is an overestimate of population R
2
when the number of
cases is the same as the number of variables. When the number of cases is not much
more than the number of variables the overestimation will be less than in the extreme
cases above, but still enough that you might want to compute adjusted R
2
(also known
as shrunken R
2
). When the number of cases is very much greater than the number of
variables, overestimation of the value of R
2
will be trivial.
Back to Wuenschs Stats Lessons Page
Karl L. Wuensch
East Carolina University
February, 2009.
CI-R2.docx
Putting Confidence Intervals on R
2
or R
Giving a confidence interval for an R or R
2
is a lot more informative than just
giving the sample value and a significance level. So, how does one compute a
confidence interval for R or R
2
?
Bivariate Correlation
Benchmarks for . Again, context can be very important.
.1 is small but not trivial
.3 is medium
.5 is large
Confidence Interval for , Correlation Analysis. My colleagues and I
(Wuensch, K. L., Castellow, W. A., & Moore, C. H. Effects of defendant attractiveness
and type of crime on juridic judgment. Journal of Social Behavior and Personality, 1991,
6, 713-724) asked mock jurors to rate the seriousness of a crime and also to
recommend a sentence for a defendant who was convicted of that crime. The observed
correlation between seriousness and sentence was .555, n = 318, p < .001. We treat
both variables as random. Now we roll up our sleeves and prepare to do a bunch of
tedious arithmetic.
First we apply Fishers transformation to the observed value of r. I shall use
Greek zeta to symbolize the transformed r.
626 . 0 ) 251 . 1 )( 5 . 0 ( ) 494 . 3 ln( ) 5 . 0 (
445 .
555 . 1
ln ) 5 . 0 (
1
1
ln ) 5 . 0 ( = = = =
+
=
r
r
. We compute
the standard error as 05634 .
315
1
3
1
= =
=
n
SE
r
. We compute a 95% confidence
interval for zeta: ). 05634 (. 96 . 1 626 . =
r cc
SE z This give us a confidence interval
extending from .51557 to .73643 but it is in transformed units, so we need to
untransform it.
1
1
2
2
+
e
e
r . At the lower boundary, that gives us
474 .
8039 . 3
8039 . 1
1
1
031 . 1
031 . 1
= =
+
=
e
e
r , and at the upper boundary . 627 .
3617 . 5
3617 . 3
1
1
473 . 1
473 . 1
= =
+
=
e
e
r
What a bunch of tedious arithmetic that involved. We need a computer program
to do it for us. My Conf-Interval-r.sas program will do it all for you.
2
= (N 1)r
2
= 3215(.324) / 383.444 = 2.717, within rounding error of the Linear by
Linear Association reported by SPSS.
We could also test the deviation from linearity by subtracting from the overall
2
the linear
2
: 11.752 2.717 = 9.035. The df are also obtained by subtraction, overall
less linear = 2 1 = 1. P(
2
> 9.035 | df = 1) = .0026. There is a significant deviation
from linearity.
Now let us split the file by the direction of travel. If we consider only those going
down, there is a significant overall effect of weight category but not a significant linear
effect:
3
Chi-Square Tests
b
8.639
a
2 .013
9.091 2 .011
.001 1 .973
1362
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value df
Asymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. The
minimum expected count is 23.09.
a.
direct = 2 Descending
b.
If we consider only those going up, there is significant linear effect and the
deviation from linearity is not significant
2
(1, N = 1362) = 2.626, p = .105
Chi-Square Tests
b
9.525
a
2 .009
10.001 2 .007
6.899 1 .009
1855
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value df
Asymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. The
minimum expected count is 13.21.
a.
direct = 1 Ascending
b.
Return to Wuenschs Stats Lessons Page
Karl L. Wuensch, East Carolina University, October, 2010.
CompareCorrCoeff.docx
Comparing Correlation Coefficients, Slopes, and Intercepts
Two Independent Samples
H
:
1
=
2
If you want to test the null hypothesis that the correlation between X and Y in one
population is the same as the correlation between X and Y in another population, you can
use the procedure developed by R. A. Fisher in 1921 (On the probable error of a
coefficient of correlation deduced from a small sample, Metron, 1, 3-32).
First, transform each of the two correlation coefficients in this fashion:
r
r
r
e
+
=
1
1
log ) 5 . 0 (
Second, compute the test statistic this way:
3
1
3
1
2 1
2 1
=
n n
r r
z
Third, obtain p for the computed z.
Consider the research reported in by Wuensch, K. L., Jenkins, K. W., & Poteat, G.
M. (2002). Misanthropy, idealism, and attitudes towards animals. Anthrozos, 15, 139-
149.
The relationship between misanthropy and support for animal rights was compared
between two different groups of persons persons who scored high on Forsyths measure
of ethical idealism, and persons who did not score high on that instrument. For 91
nonidealists, the correlation between misanthropy and support for animal rights was .3639.
For 63 idealists the correlation was .0205. The test statistic, 16 . 2
60
1
88
1
0205 . 3814 .
=
+
= z , p =
.031, leading to the conclusion that the correlation in nonidealists is significantly higher
than it is in idealists.
Calvin P. Garbin of the Department of Psychology at the University of Nebraska has
authored a dandy document Bivariate Correlation Analyses and Comparisons which is
recommended reading. His web server has been a bit schizo lately, so you might find the
link invalid, sorry. Dr. Garbin has also made available a program (FZT.exe) for conducting
this Fishers z test. Files with the .exe extension encounter a lot of prejudice on the
Internet these days, so you might have problems with that link too. If so, try to find it on his
web sites at http://psych.unl.edu/psycrs/statpage/ and
http://psych.unl.edu/psycrs/statpage/comp.html .
= on
(N 4) degrees of freedom.
3
If your regression program gives you the standard error of the slope (both SAS and
SPSS do), the standard error of the difference between the two slopes is most easily
computed as 1258 . 09594 . 08140 .
2 2 2 2
2 1 2 1
= + = + =
b b b b
s s s
Or, if you just love doing arithmetic, you can first find the pooled residual variance,
2649 .
150
6841 . 15 0554 . 24
4
2 1
2 1 2
.
=
+
=
+
+
=
n n
SSE SSE
s
x y
,
and then compute the standard error of the difference between slopes as
1264 .
6712 ). 62 (
2649 .
6732 ). 90 (
2649 .
2 2
2
.
2
.
2 1
= + = +
X
x y
X
x y
SS
s
SS
s
, within rounding error of what we got
above.
Now we can compute the test statistic: 26 . 2
1258 .
0153 . 3001 .
=
= t .
This is significant on 150 df (p = .025), so we conclude that the slope in nonidealists
is significantly higher than that in idealists.
Please note that the test on slopes uses a pooled error term. If the variance in the
dependent variable is much greater in one group than in the other, there are alternative
methods. See Kleinbaum and Kupper (1978, Applied Regression Analysis and Other
Multivariable Methods, Boston: Duxbury, pages 101 & 102) for a large (each sample
n > 25) samples test, and K & K page 192 for a reference on other alternatives to pooling.
H
: a
1
= a
2
The regression lines (one for nonidealists, the other for idealists) for predicting
support for animal rights from misanthropy could also differ in intercepts. Here is how to
test the null hypothesis that the intercepts are identical:
+ + + =
2
2
2
1
2
1
2 1
2
.
2
1 1
pooled
2 1
X X
x y a a
SS
M
SS
M
n n
s s
2 1
2 1
a a
s
a a
t
= 4
2 1
+ = n n df
M
1
and M
2
are the means on the predictor variable for the two groups (nonidealists and
idealists). If you ever need a large sample test that does not require homogeneity of
variance, see K & K pages 104 and 105.
For our data,
3024 .
6712 ). 62 (
2413 . 2
6732 ). 90 (
3758 . 2
63
1
91
1
2649 .
2
2
2
2
2 1
=
+ + + =
a a
s , and
4
57 . 2
3024 .
404 . 2 626 . 1
=
:
ay
=
by
H
:
WX
=
YZ
If you wish to compare the correlation between one pair of variables with that
between a second (nonoverlapping) pair of variables, read the article by T. E.
Raghunathan , R. Rosenthal, and D. B. Rubin (Comparing correlated but nonoverlapping
correlations, Psychological Methods, 1996, 1, 178-183). Also, see my example,
Comparing Correlated but Nonoverlapping Correlation Coefficients .
Return to my Statistics Lessons page.
5
Copyright 2011, Karl L. Wuensch - All rights reserved.
InterRater.doc
East Carolina University
Department of Psychology
Inter-Rater Agreement
Psychologists commonly measure various characteristics by having a rater
assign scores to observed people, other animals, other objects, or events. When using
such a measurement technique, it is desirable to measure the extent to which two or
more raters agree when rating the same set of things. This can be treated as a sort of
reliability statistic for the measurement procedure.
Continuous Ratings, Two Judges
Let us first consider a circumstance where we are comfortable with treating the
ratings as a continuous variable. For example, suppose that we have two judges rating
the aggressiveness of each of a group of children on a playground. If the judges agree
with one another, then there should be a high correlation between the ratings given by
the one judge and those given by the other. Accordingly, one thing we can do to
assess inter-rater agreement is to correlate the two judges' ratings. Consider the
following ratings (they also happen to be ranks) of ten subjects:
Subject 1 2 3 4 5 6 7 8 9 10
Judge 1 10 9 8 7 6 5 4 3 2 1
Judge 2 9 10 8 7 5 6 4 3 1 2
These data are available in the SPSS data set IRA-1.sav at my SPSS Data
Page. I used SPSS to compute the correlation coefficients, but SAS can do the same
analyses. Here is the dialog window from Analyze, Correlate, Bivariate:
icc j
icc j
, where j is the number of judges and icc is the
intraclass correlation coefficient. I would think this statistic appropriate when the data
for our main study involves having j judges rate each subject.
Rank Data, More Than Two Judges
When our data are rankings, we don't have to worry about differences in
magnitude. In that case, we can simply employ Spearman rho or Kendall tau if there
are only two judges or Kendall's coefficient of concordance if there are three or more
judges. Consult pages 309 - 311 in David Howell's Statistics for Psychology, 7
th
edition,
for an explanation of Kendall's coefficient of concordance. Run the program
Kendall-Patches.sas, from my SAS programs page, as an example of using SAS to
compute Kendall's coefficient of concordance. The data are those from Howell, page
310. Statistic 2 is the Friedman chi-square testing the null hypothesis that the patches
differ significantly from one another with respect to how well they are liked. This null
hypothesis is equivalent to the hypothesis that there is no agreement among the judges
with respect to how pleasant the patches are. To convert the Friedman chi-square to
Kendall's coefficient of concordance, we simply substitute into this equation:
807 .
) 7 ( 6
889 . 33
) 1 (
2
= =
=
n J
W
, where J is the number of judges and n is the number of
things being ranked.
If the judges gave ratings rather than ranks, you must first convert the ratings
into ranks in order to compute the Kendall coefficient of concordance. An explanation
of how to this with SAS is presented in my document "Nonparametric Statistics." You
would, of course, need to remember that ratings could be concordant in order but not in
magnitude.
Categorical Judgments
Please re-read pages 165 and 166 in David Howells Statistical Methods for
Psychology, 7
th
edition. Run the program Kappa.sas, from my SAS programs page,
as an example of using SAS to compute kappa. It includes the data from page 166 of
Howell. Note that Cohens kappa is appropriate only when you have two judges. If you
have more than two judges you may use Fleiss kappa.
Return to Wuenschs Statistics Lessons Page
Copyright 2010, Karl L. Wuensch - All rights reserved.
IntraClassCorrelation.doc
East Carolina University
Department of Psychology
The Intraclass Correlation Coefficient
Read pages 495 through 498 in David Howells Statistical Methods for
Psychology, 7
th
edition.
Here I shall compute the same intraclass correlation coefficient that Howell did,
treating judges as a random (rather than fixed) variable. The basic analysis is a Judges
x Subjects repeated measures ANOVA. Here are the data, within SPSS:
Click Analyze, General Linear Model, Repeated Measures.
Name the Factor judge, indicate 3 levels, click Add.
+ +
.
Howell has more rounding error in his calculations than do I.
Doing the analysis as described above has pedagogical value, but if you just want to get
the intraclass correlation coefficient with little fuss, do it this way:
Click Analyze, Scale, Reliability Analysis. Scoot all three judges into the Items box.
4
Click Statistics. Ask for an Intraclass correlation coefficient, Two-Way Random model,
Type = Absolute Agreement.
Continue, OK.
Here is the output. I have set the intraclass correlation coefficient in bold font and
highlighted it.
5
****** Method 1 (space saver) will be used for this analysis ******
Intraclass Correlation Coefficient
Two-way Random Effect Model (Absolute Agreement Definition):
People and Measure Effect Random
Single Measure Intraclass Correlation = .6961*
95.00% C.I.: Lower = .0558 Upper = .9604
F = 214.0000 DF = ( 4, 8.0) Sig. = .0000 (Test Value = .0000 )
Average Measure Intraclass Correlation = .8730
95.00% C.I.: Lower = .1480 Upper = .9864
F = 214.0000 DF = ( 4, 8.0) Sig. = .0000 (Test Value = .0000 )
*: Notice that the same estimator is used whether the interaction effect
is present or not.
Reliability Coefficients
N of Cases = 5.0 N of Items = 3
Alpha = .9953
Addendum
A correspondent at SUNY, Albany, provided me with data on the grades given by
faculty grading comprehensive examinations for doctoral students in his unit and asked
if I could provide assistance in estimating the inter-rater reliability. I computed the ICC
as well as simple Pearson I between each rater and each other rater. The single
measure ICC was exactly equal to the mean of the Pearson r coefficients. Interesting,
and probably not mere coincidence.
See Also:
Enhancement of Reliability Analysis -- Robert A. Yaffee
Choosing an Intraclass Correlation Coefficient -- David P. Nichols
Return to Wuenschs Statistics Lessons Page
Revised March, 2010.
Pearson R and Phi
Pearson r computed on two dichotomous variables is a phi coefficient. To test
the significance of such a phi coefficient one generally uses a chi-square statistic, which
can be computed as
2 2
n = . For the contingency table presented below,
22 . 1
5 . 7
305 .
30
2
= = (r
2
is the ratio of the regression SS to the total SS). This chi-
square is evaluated on one degree of freedom. Do notice that the p value provided with
the usual test of significance for a Pearson correlation coefficient is off a bit (.285 as
compared to the .269 obtained from the chi-square).
Regression
Model Summary
.202
a
.041 .006 .50690
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), B
a.
ANOVA
b
.305 1 .305 1.189 .285
a
7.195 28 .257
7.500 29
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), B
a.
Dependent Variable: A
b.
Correlations
Correlations
.202
.285
30
Pearson Correlation
Sig. (2-tailed)
N
A
B
= .202
Crosstabs
A * B Crosstabulation
Count
10 5 15
7 8 15
17 13 30
1.00
2.00
A
Total
1.00 2.00
B
Total
Chi-Square Tests
1.222 1 .269 Pearson Chi-Square
Value df
Asymp. Sig.
(2-sided)
Return to Wuenschs Introductory Lesson on Bivariate Linear Correlation
Residual-Plots-SPSS.doc
Producing and Interpreting Residuals Plots in SPSS
In a linear regression analysis it is assumed that the distribution of residuals,
)
I shall compare the Wilcoxon rank-sum statistic with the independent samples t-
test to illustrate the differences between typical nonparametric tests and their parametric
equivalents.
Independent Samples t test Wilcoxon Rank-Sum Test
H
:
1
=
2
H
: Population 1 = Population 2
Assumptions: None for general test, but often assume:
Normal populations Equal shapes
Homogeneity of variance Equal dispersions
(but not for separate variances test)
Both tests are appropriate for determining whether or not there is a significant
association between a dichotomous variable and a continuous variable with
independent samples data. Note that with the independent samples t test the null
hypothesis focuses on the population means. If you have used the general form of the
nonparametric hypothesis (without assuming that the populations have equal shapes
and equal dispersions), rejection of that null hypothesis simply means that you are
confident that the two populations differ on one or more of location, shape, or
dispersion. If, however, we are willing to assume that the two populations have identical
shapes and dispersions, then we can interpret rejection of the nonparametric null
hypothesis as indicating that the populations differ in location. With these equal shapes
and dispersions assumptions the nonparametric test is quite similar to the parametric
test. In many ways the nonparametric tests we shall study are little more than
parametric tests on rank-transformed data. The nonparametric tests we shall study are
especially sensitive to differences in medians.
If your data indicate that the populations are not normally distributed, then a
nonparametric test may be a good alternative, especially if the populations do appear to
be of the same non-normal shape. If, however, the populations are approximately
normal but heterogeneous in variance, I would recommend a separate variances t-test
over a nonparametric test. If you cannot assume equal dispersions with the
nonparametric test, then you cannot interpret rejection of the nonparametric null
hypothesis as due solely to differences in location.
Conducting the Wilcoxon Rank-Sum Test
Rank the data from lowest to highest. If you have tied scores, assign all of them
the mean of the ranks for which they are tied. Find the sum of the ranks for each group.
If n
1
= n
2
, then the test statistic, W
S
, is the smaller of the two sums of ranks. Go to the
table (starts on page 715 of Howell) and obtain the one-tailed (lower tailed) p. For a
= . This procedure does not require that you first conduct the
omnibus test, and should you first conduct the omnibus test, you may make the
Bonferroni comparisons whether or not that omnibus test is significant. Suppose that k
= 4 and you wish to make all 6 pairwise comparisons (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) with a
maximum familywise alpha of .05. Your adjusted criterion is .05 divided by 6, .0083.
For each pairwise comparison you obtain an exact p, and if that exact p is less than or
equal to the adjusted criterion, you declare that difference to be significant. Do note that
the cost of such a procedure is a great reduction in power (you are trading an increased
risk of Type II error for a reduced risk of Type I error).
Here is a summary statement for the problem on page 684 of Howell: Kruskal-
Wallis ANOVA indicated that type of drug significantly affected the number of problems
solved, H(2, N = 19) = 10.36, p = .006. Pairwise comparisons made with Wilcoxons
rank-sum test revealed that ......... Basic descriptive statistics (means, medians,
standard deviations, sample sizes) would be presented in a table.
Friedmans ANOVA
This test is appropriate to test the significance of the association between a
categorical variable (k 2) and a continuous variable with randomized blocks data
(related samples). While Friedmans test could be employed with k = 2, usually
Wilcoxons signed-ranks test would be employed if there were only two groups.
Subjects have been matched (blocked) on some variable or variables thought to be
correlated with the continuous variable of primary interest. Within each block the
continuous variable scores are ranked. Within each condition (level of the categorical
variable) you sum the ranks and substitute in the formula on page 685 of Howell. As
with the Kruskal-Wallis, obtain p from chi-square on k1 degrees of freedom, using an
upper-tailed p for nondirectional hypotheses, adjusting it with k! for directional
hypotheses. Pairwise comparisons could be accomplished employing Wilcoxon signed-
ranks tests, with Fishers or Bonferronis procedure to guard against inflated familywise
alpha.
Friedmans ANOVA is closely related to Kendalls coefficient of concordance.
For the example on page 685 of Howell, the Friedman tests asks whether the rankings
5
are the same for the three levels of visual aids. Kendalls coefficient of concordance, W,
would measure the extent to which the blocks agree in their rankings.
) 1 (
2
=
k N
W
F
.
Here is a sample summary statement for the problem on page 685 of Howell:
Friedmans ANOVA indicated that judgments of the quality of the lectures were
significantly affected by the number of visual aids employed,
2
F
(2, n = 17) = 10.94, p =
.004. Pairwise comparisons with Wilcoxon signed-ranks tests indicated that
....................... Basic descriptive statistics would be presented in a table.
Power
It is commonly opined that the primary disadvantage of the nonparametric
procedures is that they have less power than does the corresponding parametric test.
The reduction in power is not, however, great, and if the assumptions of the parametric
test are violated, then the nonparametric test may be more powerful.
Everything You Ever Wanted to Know About Six But Were Afraid to Ask
You may have noticed that the numbers 2, 3, 4, 6, 12, and 24 commonly appear
as constants in the formulas for nonparametric test statistics. This results from the fact
that the sum of the integers from 1 to n is equal to n(n + 1) / 2.
Effect Size Estimation
Please read my document Nonparametric Effect Size Estimators .
Using SAS to Compute Nonparametric Statistics
Run the program Nonpar.sas from my SAS programs page. Print the output and
the program file.
The first analysis is a Wilcoxon Rank Sum Test, using the birthweight data also
used by Howell (page 676) to illustrate this procedure. SAS gives us the sum of scores
for each group. That sum for the smaller group is the statistic which Howell calls W
S
(100). Note that SAS does not report the W
S
statistic (52), but it is easily computed by
hand -- 52 100 152 = =
S
W . Please remember that the test statistic which
psychologists report is the smaller of W and W SAS does report both a normal
approximation (z = 2.088, p = .037) and an exact (not approximated) p = .034. The z
differs slightly from that reported by Howell because SAS employs a correction for
continuity (reducing by .5 the absolute value of the denominator of the z ratio).
The next analysis is a Wilcoxon Matched Pairs Signed-Ranks Test using the
data from page 682 of Howell. Glucose-Saccharine difference scores are computed
and then fed to Proc Univariate. Among the many other statistics reported with Proc
Univariate, there is the Wilcoxon Signed-Ranks Test. For the data employed here, you
6
will see that SAS reports S = 53.5, p = .004. S, the signed-rank statistic, is the
absolute value of
4
) 1 ( +
n n
T , where T is the sum of the positive ranks or the negative
ranks.
S is the difference between the expected and the obtained sums of ranks. You
know that the sum of the ranks from 1 to n is
2
) 1 ( + n n
. Under the null hypothesis, you
expect the sum of the positive ranks to equal the sum of the negative ranks, so you
expect each of those sums of ranks to be half of
2
) 1 ( + n n
. For the data we analyzed
here, the sum of the ranks 1 through 16 = 136, and half of that is 68. The observed sum
of positive ranks is 121.5, and the observed sum of negative ranks is 14.5 The
difference between 68 and 14.5 (or between 121.5 and 68) is 53.5, the value of S
reported by SAS.
To get T from S, just subtract the absolute value of S from the expected value for
the sum of ranks, that is, | |
4
) 1 (
S
n n
T
+
= . Alternatively, just report S instead of T and
be prepared to explain what S is to the ignorant psychologists who review your
manuscript.
If you needed to conduct several signed-ranks tests, you might not want to
produce all of the output that you get by default with Proc Univariate. See my program
WilcoxonSignedRanks.sas on my SAS programs page to see how to get just the
statistics you want and nothing else.
Note that a Binomial Sign Test is also included in the output of Proc Univariate.
SAS reports M = 5, p = .0213. M is the difference between the expected number of
negative signs and the obtained number of negative signs. Since we have 16 pairs of
scores, we expect, under the null, to get 8 negative signs. We got 3 negative signs, so
M - 8 - 3 = 5. The p here is the probability of getting an event as or more unusual than 3
successes on 16 binomial trials when the probability of a success on each trial is .5.
Another way to get this probability with SAS is: Data p; p = 2*PROBBNML(.5, 16, 3);
proc print; run;
Next is a Kruskal-Wallis ANOVA, using Howells data on effect of stimulants
and depressants on problem solving (page 684). Do note that the sums and means
reported by SAS are for the ranked data. Following the overall test, I conducted
pairwise comparisons with Wilcoxon Rank Sum tests. Note how I used the subsetting
IF statement to create the three subsets necessary to do the pairwise comparisons.
The last analysis is Friedmans Rank Test for Correlated Samples, using
Howells data on the effect of visual aids on rated quality of lectures (page 685). Note
that I first had to use Proc Rank to create a data set with ranked data. Proc Freq then
7
provides the Friedman statistic as a Cochran-Mantel-Haenszel Statistic. One might
want to follow the overall analysis with pairwise comparisons, but I have not done so
here.
I have also provided an alternative rank analysis for the data just analyzed with
the Friedman procedure. Note that I simply conducted a factorial ANOVA on the rank
data, treating the blocking variable as a second independent variable. One advantage
of this approach is that it makes it easy to get the pairwise comparisons -- just include
the LSMEANS command with the PDIFF option. The output from LSMEANS includes
the mean ranks and a matrix of p values for tests comparing each groups mean rank
with each other groups mean rank.
References
Gaito, J. (1980). Measurement scales and statistics: Resurgence of an old
misconception. Psychological Bulletin, 87, 564-567. doi:10.1037/0033-
2909.87.3.564
Howell, D. C. (2010). Statistical methods for psychology (7
th
ed.). Belmont, CA:
Cengage Wadsworth.
Nanna, M. J., & Sawilowsky, S. S. (1998). Analysis of Likert scale data in disability and
medical rehabilitation research. Psychological Methods, 3, 5567.
doi:10.1037/1082-989X.3.1.55
Return to Wuenschs Statistics Lessons Page
Copyright 2012, Karl L. Wuensch - All rights reserved.
East Carolina University
Department of Psychology
Nonparametric Effect Size Estimators
As you know, the American Psychological Association now emphasizes the reporting of effect
size estimates. Since the unit of measure for most criterion variables used in psychological research
is arbitrary, standardized effect size estimates, such as Hedges g,
2
, and
2
are popular. What is
one to use when the analysis has been done with nonparametric methods? This query is addressed
in the document A Call for Greater Use of Nonparametric Statistics, pages 13-15. The authors
(Leech & Onwuegbuzie) note that researchers who employ nonparametric analysis generally either
do not report effect size estimates or report parametric effect size estimates such as estimated
Cohens d. It is, however, known that these effect size estimates are adversely affected by
departures from normality and heterogeneity of variances, so they may not be well advised for use
with the sort of data which generally motivates a researcher to employ nonparametric analysis.
There are a few nonparametric effect size estimates (see Leech & Onwuegbuzie), but they are
not well-known and they are not available in the typical statistical software package.
Remember that nonparametric procedures do not test the same null hypothesis that a
parametric t test or ANOVA tests. The nonparametric null hypothesis is that the populations be
compared are identical in all aspects -- not just in location. If you are willing to assume that the
populations do not differ in dispersion or shape, then you can interpret a significant difference as a
difference in locations. I shall assume that you are making such assumptions.
With respect to the two independent samples design (comparing means), the following might
make sense, but I never seen them done:
A d like estimator calculated by taking the difference in group mean ranks and dividing by the
standard deviation of the ranks.
Another d like estimator calculated by taking the difference between the group median scores
and dividing by the standard deviation of the scores.
An eta-squared like estimator calculated as the squared point-biserial correlation between
groups and the ranks.
Grissom and Kim (2012) have suggested some effect size estimators for use in association
with nonparametric statistics. For the two-group independent samples design, they suggest that
one obtain the Mann-Whitney U statistic and then divide it by the product of the two sample sizes.
That is,
b a
b a
n n
U
p =
.
)
. This statistic estimates the probability that a score randomly drawn from
population a will be greater than a score randomly drawn from population b. If you stats package
does not compute U, but rather computes the Wilcoxon Rank Sum Statistic, you can get
2
) 1 ( +
=
s s
n n
W U , where n
s
is the smaller of n
a
and n
b
. If there are tied ranks, you may add to U
one half the number of ties.
For the two related samples design, associated with the Binomial Sign Test and the Wilcoxon
Signed Ranks Test, Grissom and Kim (2012) recommend PS
dep
, the probability that in a randomly
sampled pair of scores (one matched pair scores) the score from Condition B (the condition which
most frequently has the higher score) will be greater than the score from Condition A (the
condition which most frequently has the lower score). When computing either the sign test or the
signed ranks test, one first finds the B-A difference scores. To obtain PS
dep
, one simply divides
the number of positive difference scores by the total number of matched pairs. That is,
N
n
PS
dep
+
= , where n
+
is the number of positive difference scores. If there are ties, one can simply
discard the ties (reducing n) or add to the numerator one half the number of ties.
You can find SAS code for computing two nonparametric effect size estimates in the document
Robust Effect Size Estimates and Meta-Analytic Tests of Homogeneity (Hogarty & Kromrey,
SAS Users Group International Conference, Indianapolis, April, 2000).
I posted a query about nonparametric effect size estimators on EDSTAT-L and got a few
responses, which I provide here.
Leech (2002) suggested to report nonparametric effect size indices, such as Vargha &
Delaney's A or Cliff's d. (Leech (2002). A Call for Greater Use of Nonparametric Statistics. Paper
presented at the Annual Meeting of the Mid-South Educational Research Association, Chattanooga,
TN, November 6-8.)
John Mark, Regions University.
----------------------------------------------------
See the chapter titled "Effect sizes for ordinal categorical variables" in Grissom and Kim
(2005). Effect sizes for research. Lawrence Erlbaum.
dale
If you find any good Internet resources on this topic, please do pass them on to me so I can
include them here. Thanks.
Reference
Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate
applications. (2
nd
ed.). New York, NY: Taylor & Francis
Return to Dr. Wuensch's Statistics Lessons Page.
Contact Information for the Webmaster,
Dr. Karl L. Wuensch
This document most recently revised on the 6
th
of April, 2012.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to
open the image, or the image may have been corrupted. Restart your computer,
and then open the file again. If the red x still appears, you may have to delete the
image and then insert it again.
Screen.docx
Screening Data
Many of my students have gotten spoiled because I nearly always provide them
with clean data. By clean data I mean data which has already been screened to
remove out-of range values, transformed to meet the assumptions of the analysis to be
done, and otherwise made ready to use. Sometimes these spoiled students have quite
a shock when they start working with dirty data sets, like those they are likely to
encounter with their thesis, dissertation, or other research projects. Cleaning up a data
file is like household cleaning jobsit can be tedious, and few people really enjoy doing
it, but it is vitally important to do. If you dont clean your dishes, scrub the toilet, and
wash your clothes, you get sick. If you dont clean up your research data file, your data
analysis will produce sick (schizophrenic) results. You may be able to afford to pay
someone to clean your household. Paying someone to clean up your data file can be
more expensive (once you know how to do it you may be willing to do it for others, if
compensated wellI charge $200 an hour, the same price I charge for programming,
interpreting, and writing).
Missing Data
With some sorts of research it is not unusual to have cases for which there are
missing data for some but not all variables. There may or there may not be a pattern to
the missing data. The missing data may be classified as MCAR, MAR, or MNAR.
Missing Not at Random (MNAR)
Some cases are missing scores on our variable of interest, Y.
Suppose that Y is the salary of faculty members.
Missingness on Y is related to the actual value of Y.
Of course, we do not know that, since we do not know the values of Y for cases
with missing data.
For example, faculty with higher salaries may be more reluctant to provide their
income.
If we estimate mean faculty salary with the data we do have on hand it will be a
biased estimate.
There is some mechanism which is causing missingness, but we do not know
what it is.
Missing At Random (MAR)
Missingness on Y is not related to the true value of Y itself or is related to Y only
through its relationship with another variable or set of variables, and we have
scores on that other variable or variables for all cases.
The term multivariate statistics is appropriately used to include all statistics where
there are more than two variables simultaneously analyzed. You are already familiar with
bivariate statistics such as the Pearson product moment correlation coefficient and the
independent groups t-test. A one-way ANOVA with 3 or more treatment groups might also be
considered a bivariate design, since there are two variables: one independent variable and one
dependent variable. Statistically, one could consider the one-way ANOVA as either a bivariate
curvilinear regression or as a multiple regression with the K level categorical independent
variable dummy coded into K-1 dichotomous variables.
Independent vs. Dependent Variables
We shall generally continue to make use of the terms independent variable and
dependent variable, but shall find the distinction between the two somewhat blurred in
multivariate designs, especially those observational rather than experimental in nature.
Classically, the independent variable is that which is manipulated by the researcher. With such
control, accompanied by control of extraneous variables through means such as random
assignment of subjects to the conditions, one may interpret the correlation between the
dependent variable and the independent variable as resulting from a cause-effect relationship
from independent (cause) to dependent (effect) variable. Whether the data were collected by
experimental or observational means is NOT a consideration in the choice of an analytic tool.
Data from an experimental design can be analyzed with either an ANOVA or a regression
analysis (the former being a special case of the latter) and the results interpreted as
representing a cause-effect relationship regardless of which statistic was employed. Likewise,
observational data may be analyzed with either an ANOVA or a regression analysis, and the
results cannot be unambiguously interpreted with respect to causal relationship in either case.
We may sometimes find it more reasonable to refer to independent variables as
predictors, and dependent variables as response-, outcome-, or criterion-variables.
For example, we may use SAT scores and high school GPA as predictor variables when
predicting college GPA, even though we wouldnt want to say that SAT causes college GPA. In
general, the independent variable is that which one considers the causal variable, the prior
variable (temporally prior or just theoretically prior), or the variable on which one has data from
which to make predictions.
Descriptive vs. Inferential Statistics
While psychologists generally think of multivariate statistics in terms of making
inferences from a sample to the population from which that sample was randomly or
representatively drawn, sometimes it may be more reasonable to consider the data that one
has as the entire population of interest. In this case, one may employ multivariate descriptive
statistics (for example, a multiple regression to see how well a linear model fits the data) without
worrying about any of the assumptions (such as homoscedasticity and normality of conditionals
or residuals) associated with inferential statistics. That is, multivariate statistics, such as R
2
,
can be used as descriptive statistics. In any case, psychologists rarely ever randomly sample
. The third
was the MA (hypomania) scale, Scale 9, on which high scores are associated with overactivity,
flight of ideas, low frustration tolerance, narcissism, irritability, restlessness, hostility, and
difficulty with controlling impulses. The fourth MMPI variable was Scale K, which is a validity
scale on which high scores indicate that the subject is clinically defensive, attempting to
present himself in a favorable light, and low scores indicate that the subject is unusually frank.
The second set of variables was a pair of homonegativity variables. One was the IAH (Index of
Attitudes Towards Homosexuals), designed to measure affective components of homophobia.
5
The second was the SBS, (Self-Report of Behavior Scale), designed to measure past
aggressive behavior towards homosexuals, an instrument specifically developed for this study.
With luck, we can interpret the weights (or, even better, the loadings, the correlations
between each canonical variable and the variables in its set) so that each of our canonical
variates represents some underlying dimension (that is causing the variance in the observed
variables of its set). We may also think of a canonical variate as a superordinate variable,
made up of the more molecular variables in its set. After constructing the first pair of canonical
variates we attempt to construct a second pair that will explain as much as possible of the
(residual) variance in the observed variables, variance not explained by the first pair of
canonical variates. Thus, each canonical variate of the Xs is orthogonal to (independent of)
each of the other canonical variates of the Xs and each canonical variate of the Ys is
orthogonal to each of the other canonical variates of the Ys. Construction of canonical
variates continues until you can no longer extract a pair of canonical variates that accounts for a
significant proportion of the variance. The maximum number of pairs possible is the smaller of
the number of X variables or number of Y variables.
In the Patel et al. study both of the canonical correlations were significant. The first
canonical correlation indicated that high scores on the SBS and the IAH were associated with
stereotypical masculinity (low Scale 5), frankness (low Scale K), impulsivity (high Scale 9), and
general social maladjustment and hostility (high Scale 4). The second canonical correlation
indicated that having a low IAH but high SBS (not being homophobic but nevertheless
aggressing against gays) was associated with being high on Scales 5 (not being stereotypically
masculine) and 9 (impulsivity). The second canonical variate of the homonegativity variables
seems to reflect a general (not directed towards homosexuals) aggressiveness.
PRINCIPAL COMPONENTS AND FACTOR ANALYSIS
Here we start out with one set of variables. The variables are generally correlated with
one another. We wish to reduce the (large) number of variables to a smaller number of
components or factors (Ill explain the difference between components and factors when we
study this in detail) that capture most of the variance in the observed variables. Each factor (or
component) is estimated as being a linear (weighted) combination of the observed variables.
We could extract as many factors as there are variables, but generally most of them would
contribute little, so we try to get a few factors that capture most of the variance. Our initial
extraction generally includes the restriction that the factors be orthogonal, independent of one
another.
Consider the analysis reported by Chia, Wuensch, Childers, Chuang, Cheng, Cesar-
Romero, & Nava in the Journal of Social Behavior and Personality, 1994, 9, 249-258. College
students in Mexico, Taiwan, and the US completed a 45 item Cultural Values Survey. A
principal components analysis produced seven components (each a linear combination of the
45 items) which explained in the aggregate 51% of the variance in the 45 items. We could
have explained 100% of the variance with 45 components, but the purpose of the PCA is to
explain much of the variance with relatively few components. Imagine a plot in seven
dimensional space with seven perpendicular (orthogonal) axes. Each axis represents one
component. For each variable I plot a point that represents its loading (correlation) with each
component. With luck Ill have seven clusters of dots in this hyperspace (one for each
component). I may be able to improve my solution by rotating the axes so that each one more
6
nearly passes through one of the clusters. I may do this by an orthogonal rotation (keeping
the axes perpendicular to one another) or by an oblique rotation. In the latter case I allow the
axes to vary from perpendicular, and as a result, the components obtained are no longer
independent of one another. This may be quite reasonable if I believe the underlying
dimensions (that correspond to the extracted components) are correlated with one another.
With luck (or after having tried many different extractions/rotations), Ill come up with a
set of loadings that can be interpreted sensibly (that may mean finding what I expected to find).
From consideration of which items loaded well on which components, I named the components
Family Solidarity (respect for the family), Executive Male (men make decisions, women are
homemakers), Conscience (important for family to conform to social and moral standards),
Equality of the Sexes (minimizing sexual stereotyping), Temporal Farsightedness (interest in
the future and the past), Independence (desire for material possessions and freedom), and
Spousal Employment (each spouse should make decisions about his/her own job). Now, using
weighting coefficients obtained with the analysis, I computed for each subject a score that
estimated how much of each of the seven dimensions e had. These component scores were
then used as dependent variables in 3 x 2 x 2, Culture x Sex x Age (under 20 vs. over 20)
ANOVAs. US students (especially the women) stood out as being sexually egalitarian, wanting
independence, and, among the younger students, placing little importance on family solidarity.
The Taiwanese students were distinguished by scoring very high on the temporal
farsightedness component but low on the conscience component. Among Taiwanese students
the men were more sexually egalitarian than the women and the women more concerned with
independence than were the men. The Mexican students were like the Taiwanese in being
concerned with family solidarity but not with sexual egalitarianism and independence, but like
the US students in attaching more importance to conscience and less to temporal
farsightedness. Among the Mexican students the men attached more importance to
independence than did the women.
Factor analysis also plays a prominent role in test construction. For example, I factor
analyzed subjects scores on the 21 items in Patels SBS discussed earlier. Although the
instrument was designed to measure a single dimension, my analysis indicated that three
dimensions were being measured. The first factor, on which 13 of the items loaded well,
seemed to reflect avoidance behaviors (such as moving away from a gay, staring to
communicate disapproval of proximity, and warning gays to keep away). The second factor (six
items) reflected aggression from a distance (writing anti-gay graffiti, damaging a gays property,
making harassing phone calls). The third factor (two items) reflected up-close aggression
(physical fighting). Despite this evidence of three factors, item analysis indicated that the
instrument performed well as a measure of a single dimension. Item-total correlations were
good for all but two items. Cronbachs alpha was .91, a value which could not be increased by
deleting from the scale any of the items. Cronbachs alpha is considered a measure of the
reliability or internal consistency of an instrument. It can be thought of as the mean of all
possible split-half reliability coefficients (correlations between scores on half of the items vs.
the other half of the items, with the items randomly split into halves) with the Spearman-Brown
correction (a correction for the reduction in the correlation due to having only half as many
items contributing to each score used in the split-half reliability correlation coefficientreliability
tends to be higher with more items, ceteris paribus). Please read the document Cronbach's
Alpha and Maximized Lambda4. Follow the instructions there to conduct an item analysis with
SAS and with SPSS. Bring your output to class for discussion.
7
MULTIPLE REGRESSION
In a standard multiple regression we have one continuous Y variable and two or more
continuous X variables. Actually, the X variables may include dichotomous variables and/or
categorical variables that have been dummy coded into dichotomous variables. The goal is to
construct a linear model that minimizes error in predicting Y. That is, we wish to create a linear
combination of the X variables that is maximally correlated with the Y variable. We obtain
standardized regression coefficients ( weights
$
Z Z Z Z
Y p p
= + + +
1 1 2 2
L ) that
represent how large an effect each X has on Y above and beyond the effect of the other Xs in
the model. We may use some a priori hierarchical structure to build the model (enter first X
1
,
then X
2
, then X
3
, etc., each time seeing how much adding the new X improves the model, or,
start with all Xs, then first delete X
1
, then delete X
2
, etc., each time seeing how much deletion
of an X affects the model). We may just use a statistical algorithm (one of several sorts of
stepwise selection) to build what we hope is the best model using some subset of the total
number of X variables available.
For example, I may wish to predict college GPA from high school grades, SATV, SATQ,
score on a why I want to go to college essay, and quantified results of an interview with an
admissions officer. Since some of these measures are less expensive than others, I may wish
to give them priority for entry into the model. I might also give more theoretically important
variables priority. I might also include sex and race as predictors. I can also enter interactions
between variables as predictors, for example, SATM x SEX, which would be literally
represented by an X that equals the subjects SATM score times es sex code (typically 0 vs. 1
or 1 vs. 2). I may fit nonlinear models by entering transformed variables such as LOG(SATM)
or SAT
2
. We shall explore lots of such fun stuff later.
As an example of a multiple regression analysis, consider the research reported by
McCammon, Golden, and Wuensch in the Journal of Research in Science Teaching, 1988, 25,
501-510. Subjects were students in freshman and sophomore level Physics courses (only
those courses that were designed for science majors, no general education <football physics>
courses). The mission was to develop a model to predict performance in the course. The
predictor variables were CT (the Watson-Glaser Critical Thinking Appraisal), PMA (Thurstones
Primary Mental Abilities Test), ARI (the College Entrance Exam Boards Arithmetic Skills Test),
ALG (the College Entrance Exam Boards Elementary Algebra Skills Test), and ANX (the
Mathematics Anxiety Rating Scale). The criterion variable was subjects scores on course
examinations. All of the predictor variables were significantly correlated with one another and
with the criterion variable. A simultaneous multiple regression yielded a multiple R of .40
(which is more impressive if you consider that the data were collected across several sections
of different courses with different instructors). Only ALG and CT had significant semipartial
correlations (indicating that they explained variance in the criterion that was not explained by
any of the other predictors). Both forward and backwards selection analyses produced a
model containing only ALG and CT as predictors. At Susan McCammons insistence, I also
separately analyzed the data from female and male students. Much to my surprise I found a
remarkable sex difference. Among female students every one of the predictors was
significantly related to the criterion, among male students none of the predictors was. There
were only small differences between the sexes on variance in the predictors or the criterion, so
it was not a case of there not being sufficient variability among the men to support covariance
between their grades and their scores on the predictor variables. A posteriori searching of the
literature revealed that Anastasi (Psychological Testing, 1982) had noted a relatively consistent
finding of sex differences in the predictability of academic grades, possibly due to women being
8
more conforming and more accepting of academic standards (better students), so that women
put maximal effort into their studies, whether or not they like the course, and according they
work up to their potential. Men, on the other hand, may be more fickle, putting forth maximum
effort only if they like the course, thus making it difficult to predict their performance solely from
measures of ability.
STRUCTURAL EQUATION MODELING (SEM)
This is a special form of hierarchical multiple regression analysis in which the researcher
specifies a particular causal model in which each variable affects one or more of the other
variables both directly and through its effects upon intervening variables. The less complex
models use only unidirectional paths (if X
1
has an effect on X
2
, then X
2
cannot have an effect
on X
1
) and include only measured variables. Such an analysis is referred to as a path
analysis. Patels data, discussed earlier, were originally analyzed (in her thesis) with a path
analysis. The model was that the MMPI scales were noncausally correlated with one another
but had direct causal effects on both IAH and SBS, with IAH having a direct causal effect on
SBS. The path analysis was not well received by
reviewers the first journal to which we sent the
manuscript, so we reanalyzed the data with the
atheoretical canonical correlation/regression analysis
presented earlier and submitted it elsewhere.
Reviewers of that revised manuscript asked that we
supplement the canonical correlation/regression
analysis with a hierarchical multiple regression
analysis (essentially a path analysis).
In a path analysis one obtains path
coefficients, measuring the strength of each path
(each causal or noncausal link between one variable and another) and one assesses how well
the model fits the data. The arrows from e represent error variance (the effect of variables not
included in the model). One can compare two different models and determine which one better
fits the data. Our analysis indicated that the only significant paths were from MF to IAH (.40)
and from MA (.25) and IAH (.4) to SBS.
SEM can include latent variables (factors), constructs that are not directly measured
but rather are inferred from measured variables (indicators). Confirmatory factor analysis
can be considered a special case of SEM. In confirmatory factor analysis the focus is on
testing an a priori model of the factor structure of a group of measured variables. Tabachnick
and Fidell (5
th
edition) present an example (pages 732 - 749) in which the tested model
hypothesizes that intelligence in learning disabled children, as estimated by the WISC, can be
represented by two factors (possibly correlated with one another) with a particular simple
structure (relationship between the indicator variables and the factors).
The relationships between latent variables are referred to as the structural part of a
model (as opposed to the measurement part, which is the relationship between latent variables
and measured variables). As an example of SEM including latent variables, consider the
research by Greenwald and Gillmore (Journal of Educational Psychology, 1997, 89, 743-751)
on the validity of student ratings of instruction (check out my review of this research). Their
analysis indicated that when students expect to get better grades in a class they work less on
9
that class and evaluate the course and the instructor more favorably. The indicators (measured
variables) for the Workload latent variable were questions about how much time the students
spent on the course and how challenging it was. Relative expected grade (comparing the
grade expected in the rated course with that the student usually got in other courses) was a
more important indicator of the Expected Grade latent variable than was absolute expected
grade. The Evaluation latent variable was indicated by questions about challenge, whether or
not the student would take this course with the same instructor if e had it to do all over again,
and assorted items about desirable characteristics of the instructor and course.
.57
.49 .75
.98
.70
.44 .53
.93
.90
Greenwalds research suggests that instructors who have lenient grading policies will
get good evaluations but will not motivate their students to work hard enough to learn as much
as they do with instructors whose less lenient grading policies lead to more work but less
favorable evaluations.
I have avoided becoming very involved with SEM. Only twice have I decided that a path
analysis was an appropriate way to analyze the data from research in which I was involved, and
only once was the path analysis accepted as being appropriate for publication. Part of my
reluctance to embrace SEM stems from my not being comfortable with the notion that showing
good fit between an observed correlation matrix and ones theoretical model really confirms
that model. It is always quite possible that an untested model would fit the data as well or
better than the tested model.
Workload
Expected
Grade
Evaluation
Relative
Expected
Grade
Absolute
Expected
Grade
Hours Worked per
Credit Hour
Intellectual Challenge
Take Same Instructor Again?
Characteristics of Instructor & Course
10
I use PROC REG and PROC IML to do path analysis, which requires that I understand
fairly well the math underlying the relatively simple models I have tested with path analysis.
Were I to do more sophisticated analyses (those including latent variables and/or bidirectional
paths), I would need to employ software specially designed to do complex SEM. Information
about such software is available at: http://core.ecu.edu/psyc/wuenschk/StructuralSoftware.htm.
DISCRIMINANT FUNCTION ANALYSIS
This is essentially a multiple regression where the Y variable is a categorical variable.
You wish to develop a set of discriminant functions (weighted combinations of the predictors)
that will enable you to predict into which group (level of the categorical variable) a subject falls,
based on es scores on the X variables (several continuous variables, maybe with some
dichotomous and/or dummy coded variables). The total possible number of discriminant
functions is one less than the number of groups, or the number of predictor variables,
whichever is less. Generally only a few of the functions will do a good job of discriminating
group membership. The second function, orthogonal to the first, uses variance not already
used by the first, the third uses the residuals from the first and second, etc. One may think of
the resulting functions as dimensions on which the groups differ, but one must remember that
the weights are chosen to maximize the discrimination among groups, not to make sense to
you. Standardized discriminant function coefficients (weights) and loadings (correlations
between discriminant functions and predictor variables) may be used to label the functions.
One might also determine how well a function separates each group from all the rest to help
label the function. It is possible to do hierarchical/stepwise analysis and factorial (more than
one grouping variable) analysis.
As a rather nasty example, consider what the IRS does with the data they collect from
random audits of taxpayers. From each taxpayer they collect data on a number of predictor
variables (gross income, number of exemptions, amount of deductions, age, occupation, etc.)
and one classification variable, is the taxpayer a cheater (underpaid es taxes) or honest. From
these data they develop a discriminant function model to predict whether or not a return is likely
fraudulent. Next year their computers automatically test every return, and if yours fits the profile
of a cheaters you are called up for a discriminant analysis audit. Of course, the details of the
model are a closely guarded secret, since if a cheater knew the discriminant function e could
prepare his return with the maximum amount of cheating that would result in es (barely) not
being classified as a cheater.
As another example, consider the research done by Poulson, Braithwaite, Brondino, and
Wuensch (1997, Journal of Social Behavior and Personality, 12, 743-758). Subjects watched a
simulated trial in which the defendant was accused of murder and was pleading insanity. There
was so little doubt about his having killed the victim that none of the jurors voted for a verdict of
not guilty. Aside from not guilty, their verdict options were Guilty, NGRI (not guilty by reason of
insanity), and GBMI (guilty but mentally ill). Each mock juror filled out a questionnaire,
answering 21 questions (from which 8 predictor variables were constructed) about es attitudes
about crime control, the insanity defense, the death penalty, the attorneys, and es assessment
of the expert testimony, the defendants mental status, and the possibility that the defendant
could be rehabilitated. To avoid problems associated with multicollinearity among the 8
predictor variables (they were very highly correlated with one another, and such multicollinearity
can cause problems in a multivariate analysis), the scores on the 8 predictor variables were
subjected to a principal components analysis, with the resulting orthogonal components used
11
as predictors in a discriminant analysis. The verdict choice (Guilty, NGRI, or GBMI) was the
criterion variable.
Both of the discriminant functions were significant. The first function discriminated
between jurors choosing a guilty verdict and subjects choosing a NGRI verdict. Believing that
the defendant was mentally ill, believing the defenses expert testimony more than the
prosecutions, being receptive to the insanity defense, opposing the death penalty, believing
that the defendant could be rehabilitated, and favoring lenient treatment were associated with
rendering a NGRI verdict. Conversely, the opposite orientation on these factors was associated
with rendering a guilty verdict. The second function separated those who rendered a GBMI
verdict from those choosing Guilty or NGRI. Distrusting the attorneys (especially the
prosecution attorney), thinking rehabilitation likely, opposing lenient treatment, not being
receptive to the insanity defense, and favoring the death penalty were associated with
rendering a GBMI verdict rather than a guilty or NGRI verdict.
MULTIPLE ANALYSIS OF VARIANCE, MANOVA
This is essentially a DFA turned around. You have two or more continuous Ys and one
or more categorical Xs. You may also throw in some continuous Xs (covariates, giving you a
MANCOVA, multiple analysis of covariance). The most common application of MANOVA in
psychology is as a device to guard against inflation of familywise alpha when there are
multiple dependent variables. The logic is the same as that of the protected t-test, where an
omnibus ANOVA on your K-level categorical X must be significant before you make pairwise
comparisons among your K groups means on Y. You do a MANOVA on your multiple Ys. If it
is significant, you may go on and do univariate ANOVAs (one on each Y), if not, you stop. In a
factorial analysis, you may follow-up any effect which is significant in the MANOVA by doing
univariate analyses for each such effect.
As an example, consider the MANOVA I did with data from a simulated jury trial with
Taiwanese subjects (see Wuensch, Chia, Castellow, Chuang, & Cheng, Journal of Cross-
Cultural Psychology, 1993, 24, 414-427). The same experiment had earlier been done with
American subjects. Xs consisted of whether or not the defendant was physically attractive, sex
of the defendant, type of alleged crime (swindle or burglary), culture of the defendant (American
or Chinese), and sex of subject (juror). Ys consisted of length of sentence given the
defendant, rated seriousness of the crime, and ratings on 12 attributes of the defendant. I did
two MANOVAs, one with length of sentence and rated seriousness of the crime as Ys, one with
ratings on the 12 attributes as Ys. On each I first inspected the MANOVA. For each effect
(main effect or interaction) that was significant on the MANOVA, I inspected the univariate
analyses to determine which Ys were significantly associated with that effect. For those that
were significant, I conducted follow-up analyses such as simple interaction analyses and simple
main effects analyses. A brief summary of the results follows: Female subjects gave longer
sentences for the crime of burglary, but only when the defendant was American; attractiveness
was associated with lenient sentencing for American burglars but with stringent sentencing for
American swindlers (perhaps subjects thought that physically attractive swindlers had used their
attractiveness in the commission of the crime and thus were especially deserving of
punishment); female jurors gave more lenient sentences to female defendants than to male
defendants; American defendants were rated more favorably (exciting, happy, intelligent,
sociable, strong) than were Chinese defendants; physically attractive defendants were rated
more favorably (attractive, calm, exciting, happy, intelligent, warm) than were physically
12
unattractive defendants; and the swindler was rated more favorably (attractive, calm, exciting,
independent, intelligent, sociable, warm) than the burglar.
In MANOVA the Ys are weighted to maximize the correlation between their linear
combination and the Xs. A different linear combination (canonical variate) is formed for each
effect (main effect or interactionin fact, a different linear combination is formed for each
treatment dfthus, if an independent variable consists of four groups, three df, there are three
different linear combinations constructed to represent that effect, each orthogonal to the
others). Standardized discriminant function coefficients (weights for predicting X from the
Ys) and loadings (for each linear combination of Ys, the correlations between the linear
combination and the Ys themselves) may be used better to define the effects of the factors and
their interactions. One may also do a stepdown analysis where one enters the Ys in an a
priori order of importance (or based solely on statistical criteria, as in stepwise multiple
regression). At each step one evaluates the contribution of the newly added Y, above and
beyond that of the Ys already entered.
As an example of an analysis which uses more of the multivariate output than was used
with the example two paragraphs above, consider again the research done by Moore,
Wuensch, Hedges, and Castellow (1994, discussed earlier under the topic of log-linear
analysis). Recall that we manipulated the physical attractiveness and social desirability of the
litigants in a civil case involving sexual harassment. In each of the experiments in that study we
had subjects fill out a rating scale, describing the litigant (defendant or plaintiff) whose attributes
we had manipulated. This analysis was essentially a manipulation check, to verify that our
manipulations were effective. The rating scales were nine-point scales, for example,
Awkward Poised
1 2 3 4 5 6 7 8 9
There were 19 attributes measured for each litigant. The data from the 19 variables
were used as dependent variables in a three-way MANOVA (social desirability manipulation,
physical attractiveness manipulation, gender of subject). In the first experiment, in which the
physical attractiveness and social desirability of the defendant were manipulated, the MANOVA
produced significant effects for the social desirability manipulation and the physical
attractiveness manipulation, but no other significant effects. The canonical variate maximizing
the effect of the social desirability manipulation loaded most heavily (r > .45) on the ratings of
sociability (r = .68), intelligence (r = .66), warmth (r = .61), sensitivity (r = .50), and kindness (r =
.49). Univariate analyses indicated that compared to the socially undesirable defendant, the
socially desirable defendant was rated significantly more poised, modest, strong, interesting,
sociable, independent, warm, genuine, kind, exciting, sexually warm, secure, sensitive, calm,
intelligent, sophisticated, and happy. Clearly the social desirability manipulation was effective.
The canonical variate that maximized the effect of the physical attractiveness
manipulation loaded heavily only on the physical attractiveness ratings (r = .95), all the other
loadings being less than .35. The mean physical attractiveness ratings were 7.12 for the
physically attractive defendant and 2.25 for the physically unattractive defendant. Clearly the
physical attractiveness manipulation was effective. Univariate analyses indicated that this
manipulation had significant effects on several of the ratings variables. Compared to the
physically unattractive defendant, the physically attractive defendant was rated significantly
more poised, strong, interesting, sociable, physically attractive, warm, exciting, sexually warm,
secure, sophisticated, and happy.
13
In the second experiment, in which the physical attractiveness and social desirability of
the plaintiff were manipulated, similar results were obtained. The canonical variate maximizing
the effect of the social desirability manipulation loaded most heavily (r > .45) on the ratings of
intelligence (r = .73), poise (r = .68), sensitivity (r = .63), kindness (r = .62), genuineness (r =
.56), warmth (r = .54), and sociability (r = .53). Univariate analyses indicated that compared to
the socially undesirable plaintiff the socially desirable plaintiff was rated significantly more
favorably on all nineteen of the adjective scale ratings.
The canonical variate that maximized the effect of the physical attractiveness
manipulation loaded heavily only on the physical attractiveness ratings (r = .84), all the other
loadings being less than .40. The mean physical attractiveness ratings were 7.52 for the
physically attractive plaintiff and 3.16 for the physically unattractive plaintiff. Univariate
analyses indicated that this manipulation had significant effects on several of the ratings
variables. Compared to the physically unattractive plaintiff the physically attractive plaintiff was
rated significantly more poised, interesting, sociable, physically attractive, warm, exciting,
sexually warm, secure, sophisticated, and happy.
LOGISTIC REGRESSION
Logistic regression is used to predict a categorical (usually dichotomous) variable from a
set of predictor variables. With a categorical dependent variable, discriminant function analysis
is usually employed if all of the predictors are continuous and nicely distributed; logit analysis is
usually employed if all of the predictors are categorical; and logistic regression is often chosen
if the predictor variables are a mix of continuous and categorical variables and/or if they are not
nicely distributed (logistic regression makes no assumptions about the distributions of the
predictor variables). Logistic regression has been especially popular with medical research in
which the dependent variable is whether or not a patient has a disease.
For a logistic regression, the predicted dependent variable is the estimated probability
that a particular subject will be in one of the categories (for example, the probability that Suzie
Cue has the disease, given her set of scores on the predictor variables).
As an example of the use of logistic regression in psychological research, consider the
research done by Wuensch and Poteat and published in the Journal of Social Behavior and
Personality, 1998, 13, 139-150. College students (N = 315) were asked to pretend that they
were serving on a university research committee hearing a complaint against animal research
being conducted by a member of the university faculty. Five different research scenarios were
used: Testing cosmetics, basic psychological theory testing, agricultural (meat production)
research, veterinary research, and medical research. Participants were asked to decide
whether or not the research should be halted. An ethical inventory was used to measure
participants idealism (persons who score high on idealism believe that ethical behavior will
always lead only to good consequences, never to bad consequences, and never to a mixture of
good and bad consequences) and relativism (persons who score high on relativism reject the
notion of universal moral principles, preferring personal and situational analysis of behavior).
Since the dependent variable was dichotomous (whether or not the respondent decided
to halt the research) and the predictors were a mixture of continuous and categorical variables
(idealism score, relativism score, participants gender, and the scenario given), logistic
14
regression was employed. The scenario variable was represented by k1 dichotomous dummy
variables, each representing the contrast between the medical scenario and one of the other
scenarios. Idealism was negatively associated and relativism positively associated with support
for animal research. Women were much less accepting of animal research than were men.
Support for the theoretical and agricultural research projects was significantly less than that for
the medical research.
In a logistic regression, odds ratios are commonly employed to measure the strength of
the partial relationship between one predictor and the dependent variable (in the context of the
other predictor variables). It may be helpful to consider a simple univariate odds ratio first.
Among the male respondents, 68 approved continuing the research, 47 voted to stop it, yielding
odds of 68 / 47. That is, approval was 1.45 times more likely than nonapproval. Among female
respondents, the odds were 60 / 140. That is, approval was only .43 times as likely as was
nonapproval. Inverting these odds (odds less than one are difficult for some people to
comprehend), among female respondents nonapproval was 2.33 times as likely as approval.
The ratio of these odds, 38 . 3
140 60
47 68
=
v
i
i i
Y X , the sum across
variables (from i = 1 to v) of the squared difference between the score on variable i for the one
case (X
i
) and the score on variable i for the other case (Y
i
). At the next step SPSS recomputes
all the distances between entities (cases and clusters) and then groups together the two with
the smallest distance. When one or both of the entities is a cluster, SPSS computes the
averaged squared Euclidian distance between members of the one entity and members of the
other entity. This continues until all cases have been grouped into one giant cluster. It is up to
the researcher to decide when to stop this procedure and accept a solution with k clusters. K
can be any number from 1 to the number of cases.
SPSS produces both tables and graphics that help the analyst follow the process and
decide which solution to accept I obtained 2, 3, and 4 cluster solutions. In the k = 2 solution
the one cluster consisted of all the adjunct faculty (excepting one) and the second cluster
consisted of everybody else. I compared the two clusters (using t tests) and found compared to
the regular faculty the adjuncts had significantly lower salary, experience, course load, rank,
and number of publications.
In the k = 3 solution the group of regular faculty was split into two groups, with one
group consisting of senior faculty (those who have been in the profession long enough to get a
decent salary and lots of publications) and the other group consisting of junior faculty (and a
few older faculty who just never did the things that gets one merit pay increases). I used plots
of means to show that the senior faculty had greater salary, experience, rank, and number of
publications than did the junior faculty.
In the k = 4 solution the group of senior faculty was split into two clusters. One cluster
consisted of the acting chair of the department (who had a salary and a number of publications
considerably higher than the others) and the other cluster consisting of the remaining senior
faculty (excepting those few who had been clustered with the junior faculty).
There are other ways of measuring the distance between clusters and other methods of
doing the clustering. For example, one can do divisive hierarchical clustering, in which one
starts out with all cases in one big cluster and then splits off cases into new clusters until every
case is a cluster all by itself.
Aziz and Zickar (2006: A cluster analysis investigation of workaholism as a syndrome,
Journal of Occupational Health Psychology, 11, 52-62) is a good example of the use of cluster
analysis with psychological data. Some have defined workaholism as being high in work
involvement, high in drive to work, and low in work enjoyment. Aziz and Zickar obtained
measures of work involvement, drive to work, and work enjoyment and conducted a cluster
analysis. One of the clusters in the three-cluster solution did look like workaholics high in
work involvement and drive to work but low in work enjoyment. A second cluster consisted of
positively engaged workers (high on work involvement and work enjoyment) and a third
consisted of unengaged workers (low in involvement, drive, and enjoyment).
There are numerous other multivariate techniques and various modifications of those I
have briefly described here. I have, however, covered those you are most likely to encounter in
psychology. We are now ready to go into each of these in greater detail. The general Gestalt
you obtain from studying these techniques should enable you to learn other multivariate
techniques that you may encounter as you zoom through the hyperspace of multivariate
research.
19
Hyperlinks
Multivariate Effect Size Estimation supplemental chapter from Kline, Rex. B.
(2004). Beyond significance testing: Reforming data analysis methods in
behavioral research. Washington, DC: American Psychological Association.
Statistics Lessons
MANOVA, Familywise Error, and the Boogey Man
SAS Lessons
SPSS Lessons
Endnote
A high Scale 5 score indicates that the individual is more like members of the other gender
than are most people. A man with a high Scale 5 score lacks stereotypical masculine interests,
and a woman with a high Scale 5 score has interests that are stereotypically masculine. Low
Scale 5 scores indicate stereotypical masculinity in men and stereotypical femininity in women.
MMPI Scale scores are T-scores that is, they have been standardized to mean 50, standard
deviation 10. The normative group was residents of Minnesota in the 1930s. The MMPI-2 was
normed on what should be a group more representative of US residents.
Copyright 2010 Karl L. Wuensch - All rights reserved.
Canonical Correlation
In a canonical correlation (multiple multiple correlation) one has two or more X
variables and two or more Y variables. The goal is to describe the relationships
between the two sets of variables. You find the canonical weights (coefficients) a
1
, a
2
,
a
3
, ... a
p
to be applied to the p X variables and b
1
, b
2
, b
3
, ... b
m
to be applied to the m Y
variables in such a way that the correlation between CV
X1
and CV
Y1
is maximized.
CV
X1
= a
1
X
1
+ a
2
X
2
+...+ a
p
X
p
. CV
Y1
= b
1
Y
1
+ b
2
Y
2
+ ... + b
m
Y
m
. CV
X1
and
CV
Y1
are the first canonical variates, and their correlation is the sample canonical
correlation coefficient for the first pair of canonical variates. The residuals are then
analyzed in the same fashion to find a second pair of canonical variates, CV
X2
and
CV
Y2
, whose weights are chosen to maximize the correlation between CV
X2
and CV
Y2
,
using only the variance remaining after the variance due to the first pair of canonical
variates has been removed from the original variables. This continues until a
"significance" cutoff is reached or the maximum number of pairs (which equals the
smaller of m and p) has been found.
You may think of the pairs of canonical variates as representing superordinate
constructs. For each pair this construct is estimated as a linear combination of the
variables, where the sole criterion for choosing one linear combination over another is
maximizing the correlation between the two canonical variates. The resulting constructs
may not be easily interpretable as representing an underlying dimension of interest.
Since each pair of canonical variates is calculated from the residuals of the pair(s)
extracted earlier, the resulting canonical variates are orthogonal. The underlying
dimensions in which you are interested may, however, be related to one another.
To learn about canonical correlation, we shall reproduce the analysis reported by
Patel, Long, McCammon, & Wuensch (Journal of Interpersonal Violence, 1995, 10: 354-
366, 1994). We had two sets of data on a group of male college students. The one set
was personality variables from the MMPI. One of these was the PD (psychopathically
deviant) scale, Scale 4, on which high scores are associated with general social
maladjustment, rebelliousness, antisocial behavior, criminal behavior, impulsive acting
out, insensitivity, hostility, and difficulties with interpersonal relationships (family, school,
and authority figures). The second was the MF (masculinity/femininity) scale, Scale 5,
on which low scores are associated with traditional masculinity - being easy-going,
cheerful, practical, coarse, adventurous, lacking insight into own motives, preferring
action to thought, overemphasizing strength and physical prowess, having a narrow
range of interests, and harboring doubts about one's own masculinity and identity
. The
third was the MA (hypomania) scale, Scale 9, on which high scores are associated with
overactivity, emotional lability, flight of ideas, being easily bored, having low frustration
tolerance, narcissism, difficulty inhibiting impulses, thrill-seeking, irritability,
restlessness, and aggressiveness. The fourth MMPI variable was Scale K, which is a
validity scale on which high scores indicate that the subject is clinically defensive,
attempting to present himself in a favorable light, and low scores indicate that the
A high Scale 5 score indicates that the individual is more like members of the other gender
than are most people. A man with a high Scale 5 score lacks stereotypical masculine interests,
and a woman with a high Scale 5 score has interests that are stereotypically masculine. Low
Scale 5 scores indicate stereotypical masculinity in men and stereotypical femininity in women.
MMPI Scale scores are T-scores that is, they have been standardized to mean 50, standard
deviation 10. The normative group was residents of Minnesota in the 1930s. The MMPI-2 was
normed on what should be a group more representative of US residents.
ClusterAnalysis-SPSS.docx
Cluster Analysis With SPSS
I have never had research data for which cluster analysis was a technique I
thought appropriate for analyzing the data, but just for fun I have played around with
cluster analysis. I created a data file where the cases were faculty in the Department of
Psychology at East Carolina University in the month of November, 2005. The variables
are:
Name -- Although faculty salaries are public information under North Carolina
state law, I though it best to assign each case a fictitious name.
Salary annual salary in dollars, from the university report available in OneStop.
FTE Full time equivalent work load for the faculty member.
Rank where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor
Articles number of published scholarly articles, excluding things like comments
in newsletters, abstracts in proceedings, and the like. The primary source for
these data was the faculty members online vita. When that was not available,
the data in the Universitys Academic Publications Database was used, after
eliminating duplicate entries.
Experience Number of years working as a full time faculty member in a
Department of Psychology. If the faculty member did not have employment
information on his or her web page, then other online sources were used for
example, from the publications database I could estimate the year of first
employment as being the year of first publication.
In the data file but not used in the cluster analysis are also
ArticlesAPD number of published articles as listed in the universitys Academic
Publications Database. There were a lot of errors in this database, but I tried to
correct them (for example, by adjusting for duplicate entries).
Sex I inferred biological sex from physical appearance.
I have saved, annotated, and placed online the statistical output from the
analysis. You may wish to look at it while reading through this document.
Conducting the Analysis
Start by bringing ClusterAnonFaculty.sav into SPSS. Now click Analyze,
Classify, Hierarchical Cluster. Identify Name as the variable by which to label cases
and Salary, FTE, Rank, Articles, and Experience as the variables. Indicate that you
want to cluster cases rather than variables and want to display both statistics and plots.
2
Click Statistics and indicate that you want to see an Agglomeration schedule with
2, 3, 4, and 5 cluster solutions. Click Continue.
Click plots and indicate that you want a Dendogram and a verticle Icicle plot with
2, 3, and 4 cluster solutions. Click Continue.
3
Click Method and indicate that you want to use the Between-groups linkage
method of clustering, squared Euclidian distances, and variables standardized to z
scores (so each variable contributes equally). Click Continue.
4
Click Save and indicate that you want to save, for each case, the cluster to which
the case is assigned for 2, 3, and 4 cluster solutions. Click Continue, OK.
SPSS starts by standardizing all of the variables to mean 0, variance 1. This
results in all the variables being on the same scale and being equally weighted.
In the first step SPSS computes for each pair of cases the squared Euclidian
distance between the cases. This is quite simply ( )
2
1
v
i
i i
Y X , the sum across
variables (from i = 1 to v) of the squared difference between the score on variable i
for the one case (X
i
) and the score on variable i for the other case (Y
i
). The two
cases which are separated by the smallest Euclidian distance are identified and then
classified together into the first cluster. At this point there is one cluster with two
cases in it.
Next SPSS re-computes the squared Euclidian distances between each entity
(case or cluster) and each other entity. When one or both of the compared entities
is a cluster, SPSS computes the averaged squared Euclidian distance between
members of the one entity and members of the other entity. The two entities with
the smallest squared Euclidian distance are classified together. SPSS then re-
computes the squared Euclidian distances between each entity and each other
entity and the two with the smallest squared Euclidian distance are classified
together. This continues until all of the cases have been clustered into one big
cluster.
Look at the Agglomeration Schedule. On the first step SPSS clustered case 32
with 33. The squared Euclidian distance between these two cases is 0.000. At
stages 2-4 SPSS creates three more clusters, each containing two cases. At stage
5 SPSS adds case 39 to the cluster that already contains cases 37 and 38. By the
43
rd
stage all cases have been clustered into one entity.
5
Look at the Vertical Icicle. For the two cluster solution you can see that one
cluster consists of ten cases(Boris through Willy, followed by a column with no Xs).
These were our adjunct (part-time) faculty (excepting one) and the second cluster
consists of everybody else.
For the three cluster solution you can see the cluster of adjunct faculty and the
others split into two. Deanna through Mickey were our junior faculty and Lawrence
through Rosalyn our senior faculty
For the four cluster solution you can see that one case (Lawrence) forms a
cluster of his own.
Look at the dendogram. It displays essentially the same information that is found
in the agglomeration schedule but in graphic form.
Look back at the data sheet. You will find three new variables. CLU2_1 is
cluster membership for the two cluster solution, CLU3_1 for the three cluster
solution, and CLU4_1 for the four cluster solution. Remove the variable labels and
then label the values for CLU2_1
and CLU3_1.
Let us see how the two clusters in the two cluster solution differ from one another
on the variables that were used to cluster them.
6
The output shows that the cluster adjuncts has lower mean salary, FTE, ranks,
published articles, and years experience.
Now compare the three clusters from the three cluster solution. Use One-Way
ANOVA and ask for plots of group means.
The plots of means show nicely the differences between the clusters.
Predicting Salary from FTE, Rank, Publications, and Experience
Now, just for fun, let us try a little multiple regression. We want to see how
faculty salaries are related to FTEs, rank, number of published articles, and years of
experience.
7
Ask for part and partial correlations and for Casewise diagnostics for All cases.
The output is shows that each of our predictors is has a medium to large positive
zero-order correlation with salary, but only FTE and rank have significant partial
effects. In the Casewise Diagnostic table you are given for each case the
standardized residual (I think that any whose absolute value exceeds 1 is worthy of
8
inspection by the persons who set faculty salaries), the actual salary, the salary
predicted by the model, and the difference, in $, between actual salary and predicted
salary.
If you split the file by sex and repeat the regression analysis you will see some
interesting differences between the model for women and the model for men. The
partial effect of rank is much greater for women than for men. For men the partial
effect of articles is positive and significant, but for women it is negative. That is,
among our female faculty, the partial effect of publication is to lower ones salary.
Clustering Variables
Cluster analysis can be used to cluster variables instead of cases. In this case
the goal is similar to that in factor analysis to get groups of variables that are
similar to one another. Again, I have yet to use this technique in my research, but it
does seem interesting.
We shall use the same data earlier used for principal components and factor
analysis, FactBeer.sav. Start out by clicking Analyze, Classify, Hierarchical Cluster.
Scoot into the variables box the same seven variables we used in the components
and factors analysis. Under Cluster select Variables.
9
Click Statistics and
Continue
Click Plots and
10
Continue
Click Method and
Continue, OK.
11
I have saved, annotated, and placed online the statistical output from the
analysis. You may wish to look at it while reading through the remainder of this
document.
Look at the proximity matrix. It is simply the intercorrelation matrix. We start out
with each variable being an element of its own. Our first step is to combine the two
elements that are closest that is, the two variables that are most well correlated.
As you can from the proximity matrix, that is color and aroma (r = .909). Now we
have six elements one cluster and five variables not yet clustered.
In Stage 2, we cluster the two closest of the six remaining elements. That is size
and alcohol (r = .904). Look at the agglomeration schedule. As you can see, the
first stage involved clustering variables 5 and 6 (color and aroma), and the second
stage involved clustering variables 2 and 3 (size and alcohol).
In Stage 3, variable 7 (taste) is added to the cluster that already contains
variables 5 (color) and 6 (aroma).
In Stage 4, variable 1 (cost) is added to the cluster that already contains
variables 2 (size) and 3 (alcohol). We now have three elements two clusters, each
with three variables, and one variable not yet clustered.
In Stage 5, the two clusters are combined, but note that they are not very similar,
the similarity coefficient being only .038. At this point we have two elements, the
reputation variable all alone and the six remaining variables clumped into one
cluster.
The remaining plots show pretty much the same as what I have illustrated with
the proximity matrix and agglomeration schedule, but in what might be more easily
digested format.
I prefer the three cluster solution here. Do notice that reputation is not clustered
until the very last step, as it was negatively correlated with the remaining variables.
Recall that in the components and factor analyses it did load (negatively) on the two
factors (quality and cheap drunk).
Karl L. Wuensch
East Carolina University
Department of Psychology
Greenville, NC 27858-4353
United Snakes of America
June, 2011
More SPSS Lessons
More Lessons on Statistics
Curvi.docx
Curvilinear Bivariate Regression
You are now familiar with linear bivariate regression analysis. What do you do if
the relationship between X and Y is curvilinear? It may be possible to get a good analysis
with our usual techniques if we first straighten-up the relationship with data
transformations.
You may have a theory or model that indicates the nature of the nonlinear effect.
For example, if you had data relating the physical intensity of some stimulus to the
psychologically perceived intensity of the stimulus, Fechners law would suggest a
logarithmic function (Stevens would suggest a power function). To straighten out this log
function all you would need to do is take the log of the physical intensity scores and then
complete the regression analysis using transformed physical intensity scores to predict
psychological intensity scores. For another example, suppose you have monthly sales
data for each of 25 consecutive months of a new business. You remember having been
taught about exponential growth curves in a business or a biology class, so you do the
regression analysis for predicting the log of monthly sales from the number of months the
firm has been in business.
In other cases you will have no such model, you simply discover (from the scatter
plot) that the relationship is curvilinear. Here are some suggestions for straightening up
the line, assuming that the relationship is monotonic.
A. If the curve for predicting Y from X is a negatively accelerated curve, a curve of
decreasing returns, one where the positive slope decreases as X increases, try
transforming X with the following: X , LOG(X),
X
1
. Prepare a scatter plot for each of
these and choose the one that best straightens the line (and best assists in meeting the
assumptions of any inferential statistics you are doing).
B. If the curve for predicting Y from X is a positively accelerated curve, one where
the positive slope increases as X increases, try: Y , LOG(Y),
Y
1
.
X b X b X b a Y + + + = . With a
quadratic, the slope for predicting Y from X changes direction once, with a cubic it
changes direction twice.
Please run the program Curvi.sas from my SAS Programs page. This provides an
example of how to do a polynomial regression with SAS. The data were obtained from
scatterplots in an article by N. H. Copp (Animal Behavior, 31,: 424-430). Ladybugs tend
to form large winter aggregations, clinging to one another in large clumps, perhaps to
stay warm. In the laboratory, Copp observed, at various temperatures, how many beetles
(in groups of 100) were free (not aggregated). For each group tested, we have the
temperature at which they were tested and the number of ladybugs that were free. Note
that in the data step I create the powers of the temperature variable (temp2, temp3, and
temp4) as well as the log of the temperature variable (LogTemp) and the log of the free
variable (LogFree).
Please note that a polynomial regression analysis is a sequential analysis. One
first evaluates a linear model. Then one adds a quadratic term and decides whether or
not addition of such a term is justified. Then one adds a cubic term and decides whether
or not such an addition is justified, etc.
Proc Corr is used to evaluate the effects of ranking both variables (Spearman rho)
and the effect of a log transformation on temperature and/or on number of free ladybugs.
The output show that none of these transformations helps.
Proc Reg is used to test five different models and to prepare scatterplots with the
regression line drawn on the plot. The VAR statement is used to list all of the variables
that will be used in the models that are specified.
The LINEAR model replicates the analysis which Copp reported. Note that there is
a strong (r
2
= .615) and significant (t = 7.79, p < .001) linear relationship between
3
temperature and number of free ladybugs. Note my use of the plot subcommand to
produce a plot with number of free ladybugs on the ordinate and temperature (Celsius) on
the abscissa. I asked that the data points be plotted with the asterisk symbol. I also
asked that a second plot, predicted number of free ladybugs (p.) versus temperature,
plotted with the symbol x, be overlaid. This results in a plot where the regression line is
indicated by the xs and the data points by the asterisks. If you look at that plot, you see
that the ladybugs did aggregate more at low temperatures than at high temperatures.
That plot also suggests that the data would be better fit with a quadratic function, one
whose slope increases as temperature either increases or decreases from the point where
the ladybugs are most aggregated.
The QUADRATIC model adds temperature squared to the model. SS1 is used to
obtain Type I (sequential) sums of squares and SCORR1(SEQTESTS) is used to obtain
squared sequential semipartial correlation coefficients and sequential tests of the predictor
variables.
The output (page 4) shows that addition of the quadratic component has increased
the model SS from 853.27 (that for the linear model) to 1163.36, an increase of 310.09.
Dividing this increase in the model SS by the error MS gives 19 . 51
058 . 6
09 . 310
= , the F ratio
testing the effect of adding temperature squared, which is shown to be a significant effect
( t = 7.15, p < .001).
Adding temperature squared increased the proportion of explained variance
from.6150 (r
2
for the linear model) to .8385 (R
2
for the quadratic model), an increase of
.2235, the squared semipartial correlation coefficient for the quadratic term.
The plot shows that aggregation of the ladybugs is greatest at about 5 to 10
degrees Celsius (the mid to upper 40s Fahrenheit). When it gets warmer than that, the
ladybugs start dispersing, but they also start dispersing when it gets cooler than that.
Perhaps ladybugs are threatened by temperatures below freezing, so the dispersal at the
coldest temperatures represents their attempt to find a warmer place to aggregate.
The CUBIC model adds temperature cubed. The output (page 6) shows that this
component is significant (t = 2.43, p = .02), but that it has not explained much more
variance in aggregation (the R
2
has increased by only .02281). At this point I might decide
that adding the cubic component is not justified (because it adds so little to the model),
even though it is statistically significant. The second bend in the curve provided by a
cubic model is not apparent in the plot, but there is an apparent flattening of the line at low
temperatures. It would be really interesting to see what would happen if the ladybugs
were tested at temperatures even lower than those employed by Copp.
The QUARTIC model adds temperature raised to the 4
th
power. The output (page
8) shows that the quartic component is not significant (t = 0.26, p = .80).
The LOG_X model shows that a log function describes the data less well than does
a simple linear function. As shown in the plot, the bend in the curve does not match that
in the data.
Below is an example of how to present results of a polynomial regression. I used
SPSS/PASW to produce the figure.
4
Forty groups of ladybugs (100 ladybugs per group) were tested at temperatures
ranging from -2 to 34 degrees Celsius. In each group we counted the number of ladybugs
which were free (not aggregated). A polynomial regression analysis was employed to fit
the data with an appropriate model. To be retained in the final model, a component had to
be statistically significant at the .05 level and account for at least 2% of the variance in the
number of free ladybugs. The model adopted was a cubic model, Free Ladybugs =
13.607 + .085 Temperature - .022 Temperature
2
+ .001 Temperature
3
, F(3, 36) = 74.50, p
< .001,
2
= .86, 90% CI [.77, .89]. Table 1 shows the contribution of each component at
the point where it entered the model. It should be noted that a quadratic model fit the data
nearly as well as did the cubic model.
Table 1.
Number of Free Ladybugs Related to Temperature
Component SS df t p sr
2
Linear 853 1 7.79 < .001 .61
Quadratic 310 1 7.15 < .001 .22
Cubic 32 1 2.43 .020 .02
As shown in Figure 1, the ladybugs were most aggregated at temperatures of 18
degrees or less. As temperatures increased beyond 18 degrees, there was a rapid rise in
the number of free ladybugs.
5
Current research in our laboratory is directed towards evaluating the response of
ladybugs tested at temperatures lower than those employed in the currently reported
research. It is anticipated that the ladybugs will break free of aggregations as
temperatures fall below freezing, since remaining in such a cold location could kill a
ladybug.
Back to Wuenschs Stats Lessons Page
The data for this exercise in an Excel spreadsheet
In its most general form, the GLM (General Linear Model) relates a set of p
predictor variables (X
1
through X
p
) to a set of q criterion variables (Y
1
through Y
q
). We
shall now briefly survey two special cases of the GLM, bivariate correlation/regression
and multiple correlation/regression.
The Univariate Mean: A One Parameter (a) Model
If there is only one Y and no X, then the GLM simplifies to the computation of a
mean. We apply the least squares criterion to reduce the squared deviations
between Y and predicted Y to the smallest value possible for a linear model. The
prediction equation is Y Y =
=
n
Y Y
s .
Bivariate Regression: A Two Parameter (a and b) Model
If there is only one X and only one Y, then the GLM simplifies to the simple
bivariate linear correlation/regression with which you are familiar. We apply the least
squares criterion to reduce the squared deviations between Y and predicted Y to the
smallest value possible for a linear model. That is, we find a and b such that for
bX a Y + =
, the ( )
,
where e is the "error" term, the deviation of Y from predicted Y. The coefficient "a" is
the Y-intercept, the value of Y when X = 0 (the intercept was the mean of Y in the one
parameter model above), and "b" is the slope, the average amount of change in Y per
unit change in X. Error in prediction is estimated by
1
)
(
2
_
=
n
Y Y
s
Y est
.
Although the model is linear, that is, specifies a straight line relationship
between X and Y, it may be modified to test nonlinear models. For example, if you
think that the function relating Y to X is quadratic, you employ the model
e X b X b a Y + + + =
2
2 1
.
It is often more convenient to work with variables that have all been standardized
to some common mean and some common SD (standard deviation) such as 0, 1 (Z-
scores). If scores are so standardized, the intercept, "a," drops out (becomes zero) and
the standardized slope, the number of standard deviations that predicted Y changes for
each change of one SD in X, is commonly referred to as . In a bivariate regression,
is the Pearson r. If r = 1, then each change in X of one SD is associated with a one SD
change in predicted Y.
or,
employing standardized scores,
p p Y
Z Z Z Z + + + = K
2 2 1 1
Suppose that we have developed a model for predicting graduate students
Grade Point Average. We had data from 30 graduate students on the following
variables: GPA (graduate grade point average), GREQ (score on the quantitative
section of the Graduate Record Exam, a commonly used entrance exam for graduate
programs), GREV (score on the verbal section of the GRE), MAT (score on the Miller
Analogies Test, another graduate entrance exam), and AR, the Average Rating that the
student received from 3 professors who interviewed the student prior to making
admission decisions. GPA can exceed 4.0, since this university attaches pluses and
minuses to letter grades.
Later I shall show you how to use SAS to conduct a multiple regression analysis
like this. Right now I simply want to give you an example of how to present the results
of such an analysis. You can expect to receive from me a few assignments in which I
ask you to conduct a multiple regression analysis and then present the results. I
suggest that you use the examples below as your models when preparing such
assignments.
Table 1.
Graduate Grade Point Averages Related to Criteria Used When Making
Admission Decisions (N = 30).
Zero-Order r
sr
2
b
Variable AR MAT GREV GREQ GPA
GREQ .611* .32* .07 .0040
GREV .468* .581* .21 .03 .0015
MAT .426* .267 .604* .32* .07 .0209
AR .525* .405* .508* .621* .20 .02 .1442
Intercept = -1.738
Mean 3.57 67.00 575.3 565.3 3.31
SD 0.84 9.25 83.0 48.6 0.60 R
2
= .64*
*p < .05
Multiple linear regression analysis was used to develop a model for predicting
graduate students grade point average from their GRE scores (both verbal and
quantitative), MAT scores, and the average rating the student received from a panel of
professors following that students pre-admission interview with those professors. Basic
descriptive statistics and regression coefficients are shown in Table 1. Each of the
It is not totally unreasonable to conduct a multiple regression analysis by hand,
as long as you have only two predictor variables. Consider the analysis we did in PSYC
6430 (with the program CorrRegr.sas) predicting attitudes towards animals (AR) from
idealism and misanthropy. SAS gave us the following univariate and bivariate statistics:
Variable N Mean Std Dev Sum Minimum Maximum
ar 154 2.37969 0.53501 366.47169 1.21429 4.21429
ideal 154 3.65024 0.53278 562.13651 2.30000 5.00000
misanth 154 2.32078 0.67346 357.40000 1.00000 4.00000
Pearson Correlation Coefficients, N = 154
Prob > |r| under H0: Rho=0
ar ideal misanth
ar 1.00000 0.05312 0.22064
0.5129 0.0060
ideal 0.05312 1.00000 -0.13975
0.5129 0.0839
Let us first obtain the beta weights for a standardized regression equation,
2 2 1 1
Z Z Z
y
+ =
0856 .
13975 . 1
) 13975 . 0 ( 22064 . 05312 .
1
2 2
12
12 2 1
1
=
=
r
r r r
y y
2326 .
13975 . 1
) 13975 . 0 ( 05312 . 22064 .
1
2 2
12
12 1 2
2
=
=
r
r r r
y y
Now for the unstandardized equation,
2 2 1 1
X b X b a Y + + =
i
y i
i
s
s
b
=
=
i
X b Y a
086 .
53278 .
) 535 (. 0856 .
1
= = b 185 .
67346 .
) 535 (. 2326 .
2
= = b
637 . 1 ) 32 . 2 ( 185 . ) 65 . 3 ( 086 . 38 . 2 = = a
The Multiple R
2
+
=
+
=
r
r r r r r
R
y y y y
y
2
2
... ... 12 y y yi i p i y
r r R = =
for p 2 (p is number of predictors)
0559 . ) 22064 (. 2326 . ) 05312 (. 0856 .
2
12
= + =
y
R
Semipartials
2
12
12 2 1
1
1 r
r r r
sr
y y
=
2
12
12 1 2
2
1 r
r r r
sr
y y
pr
sr
r
y
2
2
1
2
1
=
pr
sr
R
i
i
y i p
2
2
12
2
1
=
...( )...
for p 2
Tests of significance of partials
2
12
2
12
1
1
1
1
1
r p N
R
SE
y
H
0
: Pop. = 0
SE
t = df = N - p - 1 (but easier to get with next formula)
H
0
: Pop. sr = 0
2
1
1
R
p N
sr t
= df = N - p - 1
3
07 . 1
0559 . 1
1 2 154
085 .
1
=
= t 91 . 2
0559 . 1
1 2 154
231 .
2
=
= t df = 151
A test of H
0
: Pop. sr = 0 is identical to a test of H
0
: Pop. = 0 or a test of H
0
: Pop. b = 0
Shrunked (Adjusted) R
2
043 .
151
) 153 )( 0559 . 1 (
1
1
) 1 )( 1 (
1 shrunken
2
2
=
=
=
p N
N R
R
Please re-read this document about shrunken R
2
.
ANOVA summary table:
Source SS df MS F p
Regression R
2
SS
Y
p
Error (1-R
2
)SS
Y
n-p-1
Total SS
Y
n-1
F tests the null that population R
2
= 0
For our example, SS
y
= (N-1)s
2
= 153(.53501)
2
= 43.794
SS
regr
= .0559(43.794) = 2.447
SS
error
= 43.794 - 2.447 = 41.347
F(2, 151) = 4.47, p = .013
52 . 0 estimate of error standard = = MSE
For a hierarchical analysis,
2
) 1 ...( 12
2
12 3
2
1 2
2
1
2
... 12
+ + + + =
p yp y y y p y
sr sr sr r R K
Return to Wuenschs Stats Lessons Page
Copyright 2009, Karl L. Wuensch - All rights reserved.
Suppress.docx
Redundancy and Suppression in Trivariate Regression Analysis
Redundancy
In the behavioral sciences, ones nonexperimental data often result in the two
predictor variables being redundant with one another with respect to their covariance
with the dependent variable. Look at the ballantine above, where we are predicting
verbal SAT from family income as parental education. Area b represents the
redundancy. For each X, sr
i
and
i
will be smaller than r
yi
, and the sum of the squared
semipartial rs will be less than the multiple R
2
. Because of the redundancy between
the two predictors, the sum of the squared semipartial correlation coefficients (areas a
and c, the unique contributions of each predictor), is less than the squared multiple
correlation coefficient, R
2
y.12
(area a + b + c).
Classical Y X
2
Suppression
X
1
Look at the ballantine above. Suppose that Y is score on a statistics
achievement test, X
1
is score on a speeded quiz in statistics (slow readers dont have
enough time), and X
2
is a test of reading speed. The r
y1
= .38, r
y2
= 0, r
12
= .45.
, 4255 . 181 .
45 . 1
38 .
2
2
12 .
= == = = == =
= == =
y
R greater than r
y1
. Adding X
2
to X
1
increased R
2
by
.181 .38
2
= .0366. That is, 19 . . 0366 .
2
= = sr . The sum of the two squared
semipartial rs, .181 + .0366 = .218, greater than the R
2
y.12
.
, 476 .
45 . 1
) 45 (. 0 38 .
2 1
= == =
= == = greater than r
y1
. . 214 .
45 . 1
) 45 (. 38 . 0
2 2
= == =
= == =
X X . It is increased (relative to r
2
y1
) by removing from X
1
the
irrelevant variance due to X
2
what variance is left in ( )
2 . 1 1
X X is more correlated
with Y than is unpartialled X
1
.
144 . 38 .
2 2
1
= == = = == = = == =
+ ++ + + ++ +
= == = b
d c b
b
r
y
which is <
2
12 . 2 2
12
2
1 2
) 2 . 1 (
181 .
45 . 1
144 .
1 1
y
y
y
R
r
r
d
b
c b
b
r = == = = == =
= == =
= == =
= == =
+ ++ +
= == =
Velicer (see Smith et al, 1992) wrote that suppression exists when a predictors
usefulness is greater than its squared zero-order correlation with the criterion variable.
Usefulness is the squared semipartial correlation for the predictor.
Our X
1
has 144 .
2
1
=
Y
r -- all by itself it explains 14.4% of the variance in Y. When
added to a model that already contains X
2
, X
1
increases the R
2
by .181 that is,
. 181 .
2
1
= sr X
1
is more useful in the multiple regression than all by itself. Likewise,
. 0 0366 .
2
2
2
2
= > = r sr That is, X
2
is more useful in the multiple regression than all by
itself.
Net Suppression
X
1
X
2
3
Look at the ballantine above. Suppose Y is the amount of damage done to a
building by a fire. X
1
is the severity of the fire. X
2
is the number of fire fighters sent to
extinguish the fire. The r
y1
= .65, r
y2
= .25, and r
12
= .70.
. > 93 .
70 . 1
) 70 (. 25 . 65 .
1 2 1 y
r = == =
= == = . 40 .
70 . 1
) 70 (. 65 . 25 .
2 2
= == =
= == =
Note that
2
has a sign opposite that of r
y2
. It is always the X which has the
smaller r
yi
which ends up with a of opposite sign. Each falls outside of the range 0
r
yi
, which is always true with any sort of suppression.
Again, the sum of the two squared semipartials is greater than is the squared
multiple correlation coefficient:
.505. > 525 . 0825 . 4425 . , 0825 . 505 . , 4425 . 505 . , 505 .
2
1
2
2
2
2
2
1
2
12 .
= == = + ++ + = == = = == = = == = = == = = == =
y y y
r sr r sr R
Again, each predictor is more useful in the context of the other predictor than all
by itself: 4225 . 4424 .
2
1
2
1
= == = > >> > = == =
Y
r sr and 0625 . 0825 .
2
2
2
2
= == = > >> > = == =
Y
r sr .
For our example, number of fire fighters, although slightly positively correlated
with amount of damage, functions in the multiple regression primarily as a suppressor of
variance in X
1
that is irrelevant to Y (the shaded area in the Venn diagram). Removing
that irrelevant variance increases the for X
1
.
Looking at it another way, treating severity of fire as the covariate, when we
control for severity of fire, the more fire fighters we send, the less the amount of damage
suffered in the fire. That is, for the conditional distributions where severity of fire is held
constant at some set value, sending more fire fighters reduces the amount of damage.
Please note that this is an example of a reversal paradox, where the sign of the
correlation between two variables in aggregated data (ignoring a third variable) is
opposite the sign of that correlation in segregated data (within each level of the third
variable). I suggest that you (re)read the article on this phenomenon by Messick and
van de Geer [Psychological Bulletin, 90, 582-593].
Cooperative Suppression
R
2
will be maximally enhanced when the two Xs correlate negatively with one
another but positively with Y (or positively with one another and negatively with Y) so
that when each X is partialled from the other its remaining variance is more correlated
with Y: both predictors , pr, and sr increase in absolute magnitude (and retain the
same sign as r
yi
). Each predictor suppresses variance in the other that is irrelevant to Y.
Consider this contrived example: We seek variables that predict how much the
students in an introductory psychology class will learn (Y). Our teachers are all
graduate students. X
1
is a measure of the graduate students level of mastery of
4
general psychology. X
2
is a measure of how strongly the students agree with
statements such as This instructor presents simple, easy to understand explanations of
the material, This instructor uses language that I comprehend with little difficulty, etc.
Suppose that r
y1
= .30, r
y2
= .25, and r
12
= 0.35.
. 442 .
35 . 1
) 35 . ( 25 . 30 .
2 1
= == =
= == = . 405 .
35 . 1
) 35 . ( 30 . 25 .
2 2
= == =
= == =
. 414 .
35 . 1
) 35 . ( 25 . 30 .
2
1
= == =
= == = sr . 379 .
35 . 1
) 35 . ( 30 . 25 .
2
2
= == =
= == = sr
. 234 . ) 405 (. 25 . ) 442 (. 3 .
2
12 .
= == = + ++ + = == = = == =
i yi y
r R Note that the sum of the squared
semipartials, .414
2
+ .379
2
= .171 + .144 = .315, exceeds the squared multiple
correlation coefficient, .234.
Again, each predictor is more useful in the context of the other predictor than all
by itself: 09 . 171 .
2
1
2
1
= > =
Y
r sr and 062 . 144 .
2
2
2
2
= > =
Y
r sr .
Summary
When
i
falls outside the range of 0 r
yi
, suppression is taking place. This is
Cohen & Cohens definition of suppression. As noted above, Velicer defined
suppression in terms of a predictor having a squared semipartial correlation coefficient
that is greater than its squared zero-order correlation with the criterion variable.
If one r
yi
is zero or close to zero , it is classic suppression, and the sign of the
for the X with a nearly zero r
yi
will be opposite the sign of r
12
.
When neither X has r
yi
close to zero but one has a opposite in sign from its r
yi
and the other a greater in absolute magnitude but of the same sign as its r
yi
, net
suppression is taking place.
If both Xs have absolute
i
> r
yi
, but of the same sign as r
yi
, then cooperative
suppression is taking place.
Although unusual, beta weights can even exceed one when cooperative
suppression is present.
References
Cohen, J., & Cohen, P. (1975). Applied multiple regression/correlation for the
behavioral sciences. New York, NY: Wiley. [This handout has drawn heavily
from Cohen & Cohen.]
5
Smith, R. L., Ager, J. W., Jr., & Williams, D. L. (1992). Suppressor variables in
multiple regression/correlation. Educational and Psychological Measurement,
52, 17-29. doi:10.1177/001316449205200102
Wuensch, K. L. (2008). Beta Weights Greater Than One !
http://core.ecu.edu/psyc/wuenschk/MV/multReg/Suppress-BetaGT1.doc .
Return to Wuenschs Stats Lessons Page
Copyright 2012, Karl L. Wuensch - All rights reserved.
Example of Three Predictor Multiple Regression/Correlation Analysis: Checking
Assumptions, Transforming Variables, and Detecting Suppression
The data are from Guber, D.L. (1999). Getting what you pay for: The debate over
equity in public school expenditures. Journal of Statistics Education, 7, 1-8. The
research units are the fifty states in the USA. We shall pretend they represent a
random sample from a population of interest. The criterion variable is mean SAT in the
state. The predictors are Expenditure ($ spent per student), Salary (mean salary of
teachers), and Teacher/Pupil Ratio. If we consider the predictor variables to be fixed
(the regression model), then we do not worry about the shape of the distributions of the
predictor variables. If we consider the predictor variables to be random (the correlation
model) we do. It turns out that each of the predictors has a distinct positive skewness
which can be greatly reduced by a negative reciprocal transformation.
Here are the zero-order correlations for the untransformed variables:
Descriptive Statistics
50 3.66 9.77 1.107 .337 1.279 .662
50 -.27 -.10 -.109 .337 .009 .662
50 25.99 50.05 .757 .337 .028 .662
50 -.04 -.02 .090 .337 -.620 .662
50 844.00 1107.00 .236 .337 -1.309 .662
50 13.80 24.30 1.334 .337 2.583 .662
50 -.07 -.04 .490 .337 .220 .662
50
Expenditure
Expend_nr
salary
Salary_nr
SAT
TeachPerPup
TeachPerPup_nr
Valid N (listwise)
Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error
N Minimum Maximum Skewness Kurtosis
Correlations
1 .870** -.371** -.381**
.000 .008 .006
50 50 50 50
.870** 1 -.001 -.440**
.000 .994 .001
50 50 50 50
-.371** -.001 1 .081
.008 .994 .575
50 50 50 50
-.381** -.440** .081 1
.006 .001 .575
50 50 50 50
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Expenditure
salary
TeachPerPup
SAT
Expenditure salary TeachPerPup SAT
Correlation is significant at the 0.01 level (2-tailed).
**.
Here is a regression analysis with the untransformed variables. I asked SPSS
for a plot of the standardized residuals versus the standardized predicted scores. I also
asked for a histogram of the residuals.
Model Summary
b
.458
a
.210 .158 68.65350
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), TeachPerPup, salary,
Expenditure
a.
Dependent Variable: SAT
b.
ANOVA
b
57495.745 3 19165.248 4.066 .012
a
216811.9 46 4713.303
274307.7 49
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), TeachPerPup, salary, Expenditure
a.
Dependent Variable: SAT
b.
If you compare the beta weights with the zero-order correlations, it is obvious that
we have some suppression taking place. The beta for expenditure is positive but the
zero-order correlation between SAT and expenditure was negative. For the other two
predictors the value of beta exceeds the value of their zero-order correlation with SAT.
Here is a histogram of the residuals with a normal curve superimposed:
The residuals appear to be approximately normally distributed. The plot of
standardized residuals versus standardized predicted scores will allow us visually to
check for heterogeneity of variance, nonlinear trends, and normality of the residuals
across values of the predicted variable. I have drawn in the regression line (error = 0). I
see no obvious problems here.
Coefficients
a
1069.234 110.925 9.639 .000
16.469 22.050 .300 .747 .459
-8.823 4.697 -.701 -1.878 .067
6.330 6.542 .192 .968 .338
(Constant)
Expenditure
salary
TeachPerPup
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: SAT
a.
Under the homoscedasticity assumption there should be no correlation between
the predicted scores and error variance. The vertical spread of the dots in the plot
above should not vary as we move left to right. I squared the residuals and correlated
them with the predicted values. If the residuals were increasing in variance as the
predicted values increase this correlation would be positive. It is close to zero,
confirming my eyeball conclusion that there is no problem with that fairly common sort
of heteroscedasticity.
Now let us look at the results using the transformed data.
Correlations
.093 Pearson Correlation Predicted_SAT
Residual2
Correlations
1 .816** -.425** -.398**
.000 .002 .004
50 50 50 50
.816** 1 .015 -.467**
.000 .920 .001
50 50 50 50
-.425** .015 1 .089
.002 .920 .537
50 50 50 50
-.398** -.467** .089 1
.004 .001 .537
50 50 50 50
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Expend_nr
Salary_nr
TeachPerPup_nr
SAT
Expend_nr Salary_nr
TeachPer
Pup_nr SAT
Correlation is significant at the 0.01 level (2-tailed).
**.
The correlation matrix looks much like it did with the untransformed data.
The R
2
has increased a bit.
No major changes caused by the transformation, which is comforting. Trust me
that the residuals plots still look OK too.
I wonder what high school teachers would think about the negative relationship
between average state salary for teachers and average state SAT score? If we want
better education should we lower teacher salaries? There is an important state
characteristic that we should have but have not included in our model. Check out the
JSE article to learn what that characteristic is.
Now, can we figure out what sort of suppression is going on here?
Model Summary
b
.482
a
.232 .182 67.65259
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), TeachPerPup_nr, Salary_nr,
Expend_nr
a.
Dependent Variable: SAT
b.
ANOVA
b
63771.502 3 21257.167 4.644 .006
a
210536.2 46 4576.873
274307.7 49
Regression
Residual
Total
Model
1
Sum of
Squares df Mean Square F Sig.
Predictors: (Constant), TeachPerPup_nr, Salary_nr, Expend_nr
a.
Dependent Variable: SAT
b.
Coefficients
a
850.240 130.425 6.519 .000
367.276 692.442 .181 .530 .598
-9823.521 4920.257 -.618 -1.997 .052
1805.969 2031.564 .176 .889 .379
(Constant)
Expend_nr
Salary_nr
TeachPerPup_nr
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: SAT
a.
It looks like the expenditures variable is suppressing irrelevant variance in one or
both or a linear combination of the other two predictors. Put another way, if we hold
constant the effects of teacher salary and number of teachers per pupil, then the
relationship between expenditures and SAT goes from negative to positive. Maybe the
money is best spent on things other than hiring more teachers or better paid teachers?
Let us look at two-predictor models.
No suppression between expenditures and teacher salary.
A little bit of classical suppression here, but not dramatic.
Coefficients
a
.181 -.398
-.618 -.467
.176 .089
Expend_nr
Salary_nr
TeachPerPup_nr
Model
1
Beta
Standardized
Coefficients
r
Dependent Variable: SAT
a.
Coefficients
a
-.049 -.398
-.428 -.467
Expend_nr
Salary_nr
Model
1
Beta
Standardized
Coefficients
r
Dependent Variable: SAT
a.
Coefficients
a
-.439 -.398
-.097 .089
Expend_nr
TeachPerPup_nr
Model
1
Beta
Standardized
Coefficients
r
Dependent Variable: SAT
a.
A little bit of cooperative suppression here, but not dramatic.
Maybe the expenditures variable is suppressing irrelevant variance in a linear
combination of teacher salary and teacher/pupil ratio. I predicted SAT from salary and
teacher/pupil ratio and saved the predicted scores as predicted23. Those predicted
scores are a linear combination of teacher salary and teacher/pupil ratio, with lower
salaries and higher teacher/pupil ratios being associated with higher SAT scores. When
I correlate predicted23 with SAT I get .477, the R for SAT predicted from salary and
teacher/pupil ratio. Watch what happens when I add the expenditures variable to the
predicted23 combination.
As you can see, the expenditures variable suppresses irrelevant variance in the
predicted23 combination of the other two predictor variables. When you hold total
amount of expenditures constant, there is an increase in the predictive value of a linear
combination of teacher salary and teacher/pupil ratio.
Karl L. Wuensch
East Carolina University, Dept. of Psychology
March, 2011
Return to Wuenschs Stats Lessons Page
Coefficients
a
-.469 -.467
.096 .089
Salary_nr
TeachPerPup_nr
Model
1
Beta
Standardized
Coefficients
r
Dependent Variable: SAT
a.
Coefficients
a
.586 .477
.122 -.398
predicted23
Expend_nr
Model
1
Beta
Standardized
Coefficients
r
Dependent Variable: SAT
a.
Invert.docx
Inverting Matrices: Determinants and Matrix Multiplication
Determinants
Square matrices have determinants, which are useful in other matrix operations,
especially inversion.
For a second-order square matrix, A,
22 21
12 11
a a
a a
, the determinant of A,
21 12 22 11
a a a a = A
Consider the following bivariate raw data matrix:
Subject # 1 2 3 4 5
X 12 18 32 44 49
Y 1 3 2 4 5
from which the following XY variance-covariance matrix is obtained:
X Y
X 256 21.5
Y 21.5 2.5
9 . 0
5 . 2 256
5 . 21
= = =
Y X
XY
S S
COV
r 75 . 177 ) 5 . 21 ( 5 . 21 ) 5 . 2 ( 256 = = A
Think of the variance-covariance matrix as containing information about the two
variables the more variable X and Y are, the more information you have. The total
amount of information you have is reduced, however, by any redundancy between X
and Y that is, to the extent that you have covariance between X and Y you have less
total information. The determinant of a matrix is sometimes called its generalized
variance, the total amount of information you have about variance in the scores, after
removing the redundancy between the variables look at how we just computed the
determinant the product of the variances (information) less the product of the
covariances (redundancy).
Now think of the information in the X scores as being represented by the width of
a rectangle, and the information in the Y scores represented by the height of the
rectangle. The area of this rectangle is the total amount of information you have. Since
1 0 0
0 1 0
0 0 1
With scalars, multiplication by the inverse yields the scalar identity, 1: . 1
1
=
a
a
Multiplying by an inverse is equivalent to division: .
1
b
a
b
a =
The inverse of a 2 * 2 matrix,
256 21.5 -
21.5 - 2.5
75 . 177
1
-
-
*
1
11 21
12 22
a a
a a
2
1
2
A
A for
our example.
3
Multiplying a scalar by a matrix is easy - simply multiply each matrix element by the
scalar, thus,
5 1.44022503 .120956399 -
.120956399 - .014064698
1
A
Now to demonstrate that AA
1
= A
1
A = I, but multiplying matrices is not so easy.
For a 2 2,
+ +
+ +
=
dz cx dy cw
bz ax by aw
z y
x w
d c
b a
col row col row
col row col row
2 2 1 2
2 1 1 1
1 0
0 1
5 1.44022503 .120956399 -
.120956399 - .014064698
2.5 21.5
21.5 256
Third-Order Determinant and Matrix Multiplication
The determinant of a third-order square matrix,
3
A =
33 32 31
23 22 21
13 12 11
a a a
a a a
a a a
33 21 12 23 32 11 13 22 31 21 32 13 31 23 12 33 22 11
a a a a a a a a a a a a a a a a a a + + =
Matrix multiplication for a 3 x 3
+ + + + + +
+ + + + + +
+ + + + + +
=
iz hw gt iy hv gs ix hu gr
fz ew dt fy ev ds fx eu dr
cz bw at cy bv as cx bu ar
z y x
w v u
t s r
i h g
f e d
c b a
That is,
3 3 2 3 1 3
3 2 2 2 1 2
3 1 2 1 1 1
col row col row col row
col row col row col row
col row col row col row
Isnt this fun? Arent you glad that SAS will do matrix algebra for you? Copy the
little program below into the SAS editor and submit it.
4
SAS Program
Proc IML;
reset print;
XY ={
256 21.5,
21.5 2.5};
determinant = det(XY);
inverse = inv(XY);
identity = XY*inverse;
quit;
Look at the program statements. The reset print statement makes SAS display
each matrix as it is created. When defining a matrix, one puts brackets about the data
points and commas at the end of each row of the matrix.
Look at the output. The first matrix is the variance-covariance matrix from this
handout. Next is the determinant of that matrix, followed by the inverted variance-
covariance matrix. The last matrix is, within rounding error, an identity matrix, obtained
by multiplying the variance-covariance matrix by its inverse.
SAS Output
XY 2 rows 2 cols (numeric)
256 21.5
21.5 2.5
DETERMINANT 1 row 1 col (numeric)
177.75
INVERSE 2 rows 2 cols (numeric)
0.0140647 -0.120956
-0.120956 1.440225
IDENTITY 2 rows 2 cols (numeric)
1 -2.22E-16
-2.08E-17 1
Copyright 2011, Karl L. Wuensch - All rights reserved.
MultReg-Matrix.docx
Using Matrix Algebra to do Multiple Regression
Before we had computers to assist us, we relied on matrix algebra to solve
multiple regressions. You have some appreciation of how much arithmetic is involved in
matrix algebra, so you can imagine how tedious the solution is. We shall use SAS to do
that arithmetic for us. Consider the research I have done involving the relationship
between a persons attitudes about animals, idealism, and misanthropy. I also have, for
the same respondents, relativism scores and gender. Below is a correlation matrix and,
in the last two rows, a table of simple descriptive statistics for these variables.
Persons who score high on the idealism dimension believe that ethical behavior
will always lead only to good consequences, never to bad consequences, and never to
a mixture of good and bad consequences. Persons who score high on the relativism
dimension reject the notion of universal moral principles, preferring personal and
situational analysis of behavior. Persons who score high on the misanthropy dimension
dislike humans. Gender was coded 1 for female, 2 for male. High scores on the
attitude variable indicate that the respondent supports animals rights and does not
support research on animals. There were 153 respondents.
idealism relativism misanthropy gender attitude
idealism 1.0000 -0.0870 -0.1395 -0.1011 0.0501
relativism -0.0870 1.0000 0.0525 0.0731 0.1581
misanthropy -0.1395 0.0525 1.0000 0.1504 0.2259
gender -0.1011 0.0731 0.1504 1.0000 -0.1158
attitude 0.0501 0.1581 0.2259 -0.1158 1.0000
mean 3.64926 3.35810 2.32157 1.18954 2.37276
standard dev. 0.53439 0.57596 0.67560 0.39323 0.52979
1. The first step is to obtain R
iy
, the column vector of correlations between X
i
and
Y.
ar
ideal 0.0501
relat 0.1581
misanth 0.2259
gender -0.1158
2. Next we obtain R
ii
, the matrix of correlations among Xs.
ideal relat misanth gender
ideal 1.0000 -0.0870 -0.1395 -0.1011
relat -0.0870 1.0000 0.0525 0.0731
misanth -0.1395 0.0525 1.0000 0.1504
gender -0.1011 0.0731 0.1504 1.0000
3. Now we invert R
ii
. You dont really want to do this by hand, do you?
=
i i
X b Y a
6. To obtain the squared multiple correlation coefficient,
=
1
2
r R . For our
data, that is 0.0501(.0837) + 0.1581(.1636) + 0.2259(.2526) + (-0.1158)(-.1573) =
.1053.
7. Test the significance of the R
2
For our data, s
y
= 0.52979, n = 153, so SS
Y
= 152(0.52979)
2
= 42.663.
The regression sum of squares, 492 . 4 ) 663 . 42 ( 1053 .
2
= = =
Y regr
SS R SS
The error sum of squares, 171 . 38 492 . 4 663 . 42 = = =
regr Y error
SS SS SS
Source SS df MS F
Regression 4.492 4 1.123 4.353
Residual 38.171 148 0.258
Total 42.663 152
This is significant at about .002. We could go on to obtain test the significance of
the partials and obtain partial or semipartial correlation coefficients, but frankly, that is
just more arithmetic than I can stand. Let us stop at this point. The main objective of
this handout is to help you appreciate how matrices and matrix algebra are essential
when computing multiple regressions, and I hope that I have already made that point
adequately.
reg-diag.doc
Regression Diagnostics
Run the program RegDiag.sas, available at my SAS Programs Page. The data
are within the program. We have data on the following variables:
SpermCount sperm count for one man, gathered during copulation
Together percentage of time the man has spent with his mate recently
LastEjac time since the mans last ejaculation
We are interested in predicting sperm counts from the other two variables.
Proc Univariate is used to screen the three variables. We find an outlier on
LastEjac a man who apparently went 168 hours without an ejaculation. We
investigate and conclude that this data point is valid, but it does cause the LastEjac
variable to be distinctly skewed. We apply a square root transformation which works
marvelously.
The multiple regression is disappointingly nonsignificant. Inspection of the
residuals, as explained below, does reveal a troublesome case that demands
investigation.
The diagnostic statistics appear on page 10 of the output. For each observation
we are given the actual score on SpermCount, the predicted SpermCount, and the
standard error of prediction. The standard error of prediction could be used to put
confidence intervals about predicted scores.
Detection of Outliers among the Independent Variables
LEVERAGE, h
i
or Hat Diagonal, is used to detect outliers among the predictor
variables. It varies from 1/n to 1.0 with a mean of (p + 1)/n. Kleinbaum et al. describe
leverage as a measure of the geometric distance of the i
th
predictor point (X
i1
, X
i2
, ...,
X
ik
) from the center point ) X ...., , X , X (
k 2 1
of the predictor space. The SAS manual
cites Belsley, Kuh, and Welschs (1980) Regression Diagnostics text, suggesting that
one investigate observations with Hat greater than 2p/n, where n is the number of
observations used to fit the model, and p is the number of parameters in the model.
They present an example with 10 observations, two predictors, and the intercept, noting
that a HAT cutoff of 0.60 should be used. Our model has three parameters, and we
have 11 observations, so our cutoff would be 2(3)/11 = .55. Observations # 5 and 7
seem worthy of investigation. Case 5 had a very high time since last ejaculation, and
case 7 had a very low time together. Investigation reveals the data to be valid.
i i
s . This prevents the i
th
observation
from influencing these statistics, resulting in unusual observations being more likely to
stick out like a sore thumb. Kleinbaum et al. refer to this statistic as the jackknife
residual and note that it is distributed exactly as a t on n - k - 2 degrees of freedom, as
opposed to n - k -1 degrees of freedom for the Studentized (nondeleted) residuals. The
SAS manual (SAS/STAT Users Guide, Version 6, fourth edition, chapter on the REG
procedure) suggests that one attend to observations which have absolute values of
RSTUDENT greater than 2 (observations whose score on the dependent variable is a
large distance from the regression surface). Using that criterion, observation # 11
demands investigation the predicted sperm count is much lower than the actual
sperm count.
Measuring the Extent to Which an Observation Influences the Location of the
Regression Surface
COOKS D is used to measure INFLUENCE, the extent to which an observation
is affecting the location of the regression surface, a function of both its distance and its
leverage. Cook suggested that one check observations whose D is greater than the
median value of F on p and n-p degrees of freedom . David Howell ( Statistical
Methods for Psychology, sixth edition, 2007, page 518) suggests investigating any D >
1.00. By Howells criterion, observation # 11 has an influence worth of our attention.
The Cov Ratio measures how much change there is in the determinant of the
covariance matrix of the estimates when one deletes a particular observation. The SAS
manual says Belsley et al. suggest investigating observations with ABS(Cov Ratio - 1) >
3*p/n. The Dffits statistic is very similar to Cooks D. The SAS manual says Belsley et
al. suggest investigating observations with Dffits > 2SQRT(p/n). The SAS manual
suggests a simple cutoff of 2. Dfbetas measure the influence of an observation on a
single parameter (intercept or slope). The SAS manual says Belsley et al. recommend
a general cutoff of > 2 or a size-adjusted cutoff of > 2/SQRT(n). Case number 11 is
suspect here too, with great influence on the slope for time since last ejaculation.
Page 3
We investigate case number 11 and discover that the participant had not
followed the instructions for gathering the data. We decide to discard case number 11
and reanalyze the data. Case number 11 was, by the way, contrived by me for this
lesson, but the data for cases 1 through ten are the actual data used in the research
presented in this article:
Baker, R. R., & Bellis, M. A. (1989). Number of sperm in human ejaculates
varies in accordance with sperm competition theory. Animal Behaviour, 37, 867
869.
Back to Wuenschs Stats Lesson Page
Copyright 2007, Karl L. Wuensch, All Rights Reserved
Stepwise.doc
Stepwise Multiple Regression
Your introductory lesson for multiple regression with SAS involved developing a
model for predicting graduate students Grade Point Average. We had data from 30
graduate students on the following variables: GPA (graduate grade point average),
GREQ (score on the quantitative section of the Graduate Record Exam, a commonly
used entrance exam for graduate programs), GREV (score on the verbal section of the
GRE), MAT (score on the Miller Analogies Test, another graduate entrance exam), and
AR, the Average Rating that the student received from 3 professors who interviewed
em prior to making admission decisions. GPA can exceed 4.0, since this university
attaches pluses and minuses to letter grades. We used a simultaneous multiple
regression, entering all of the predictors at once. Now we shall learn how to conduct
stepwise regressions, where variables are entered and/or deleted according to
statistical criteria. Please run the program STEPWISE.SAS from my SAS Programs
page.
Forward Selection
In a forward selection analysis we start out with no predictors in the model. Each
of the available predictors is evaluated with respect to how much R
2
would be increased
by adding it to the model. The one which will most increase R
2
will be added if it meets
the statistical criterion for entry. With SAS the statistical criterion is the significance
level for the increase in the R
2
produced by addition of the predictor. If no predictor
meets that criterion, the analysis stops. If a predictor is added, then the second step
involves re-evaluating all of the available predictors which have not yet been entered
into the model. If any satisfy the criterion for entry, the one which most increases R
2
is
added. This procedure is repeated until there remain no more predictors that are
eligible for entry.
Look at the program. The first model (A:) asks for a forward selection analysis.
The SLENTRY= value specifies the significance level for entry into the model. The
defaults are 0.50 for forward selection and 0.15 for fully stepwise selection. I set the
entry level at .05 -- I think that is unreasonably low for a forward selection analysis, but I
wanted to show you a possible consequence of sticking with the .05 criterion.
Look at the output. The Statistics for Entry on page 1 show that all four
predictors meet the criterion for entry. The one which most increases R
2
is the Average
Rating, so that variable is entered. Now look at the Step 2 Statistics for Entry. The F
values there test the null hypotheses that entering a particular predictor will not change
the R
2
at all. Notice that all of these F values are less than they were at Step 1,
because each of the predictors is somewhat redundant with the AR variable which is
now in the model. Now look at the Step 3 Statistics for Entry. The F values there are
down again, reflecting additional redundancy with the now entered GRE_Verbal
predictor. Neither predictor available for entry meets the criterion for entry, so the
The analysis discussed in this document is appropriate when one wishes to
determine whether the linear relationship between one continuously distributed criterion
variable and one or more continuously distributed predictor variables differs across
levels of a categorical variable. For example, school psychologists often are interested
in whether the predictive validity of a test varies across different groups of children.
Poteat, Wuensch, and Gregg (1988) investigated the relationship between IQ scores
(WISC-R full scale, the predictor variable) and grades in school (the criterion variable) in
independent samples of black and white students who had been referred for special
education evaluation. Within each group (black students and white students) a linear
model for predicting grades from IQ was developed. These two models were then
compared with respect to slopes, intercepts, and scatter about the regression line.
Such an analysis, when done by a school psychologist, is commonly referred to as a
Potthoff (1966) analysis. Poteat et al. found no significant differences between the two
groups they compared, and argued that the predictive validity of the WISC-R does not
differ much between white and black students in the referred population from which the
samples were drawn.
In the simplest case, a Potthoff analysis is essentially a multiple regression
analysis of the following form: Y = a + b
1
C + b
2
G + b
3
CG, where Y is the criterion
variable, C is the continuously distributed predictor variable, G is the dichotomous
grouping variable, and CG is the interaction between C and G. Grouping variables are
commonly dummy-coded with K-1 dichotomous variables (see Chapter 16 of Howell,
2010) for a good introduction to ANOVA and ANCOV as multiple regressions). In the
case where there are only two groups, only one such dummy variable is necessary.
I shall illustrate a Potthoff analysis using data from some of my previous research
on ethical ideology, misanthropy, and attitudes about animals. Clearly this has nothing
to do differential predictive validity of tests used by school psychologists, but otherwise
the analysis is the same as that which school psychologists call a Potthoff analysis.
First I shall describe the source of the data.
One day as I sat in the living room watching the news on TV there was a story
about some demonstration by animal rights activists. I found myself agreeing with them
to a greater extent than I normally do. While pondering why I found their position more
appealing than usual that evening, I noted that I was also in a rather misanthropic mood
that day. Watching the evening news tends to do that to me, it reminds me of how
selfish, myopic, and ignorant humans are. It occurred to me that there might be an
association between misanthropy and support for animal rights. When evaluating the
ethical status of an action that does some harm to a nonhuman animal, I generally do a
cost/benefit analysis, weighing the benefit to humankind against the cost of harm done
to the nonhuman. When doing such an analysis, if one does not think much of
humankind (is misanthropic), e is unlikely to be able to justify harming nonhumans. To
the extent that one does not like humans, one will not be likely to think that benefits to
humans can justify doing harm to nonhumans.
f is the number of predictors in the full model, r is the number of predictors in the
reduced model. The numerator degrees of freedom is (f - r), the denominator df is
(n - f - 1). The full model MSE is identical to the pooled error variance one would use
for comparing slopes with Howells t-test ( s
2
y.x
on page 258).
For our data, 623 . 3
) 26493 )(. 1 3 (
13252 . 2 05237 . 4
) 150 , 2 (
full
reduced reg full reg
MSE r f
SS SS
F on 4, 745 df, p <
.001.
The full model output already shows us that the slopes do not differ significantly,
since p = .1961 for the interaction term.
Testing the null hypothesis of equal intercepts,
900 . 4
3)124.817 - (5
2 1926004.15 - 0 1927227.45
) (
full
reduced reg full reg
MSE r f
SS SS
F on 2, 745 df, p = .008.
Since the slopes do not differ significantly, but the intercepts do, the group
means must differ. When comparing the groups we can either ignore the covariate or
control for it. Look at the ANCOV output. The weights are significantly correlated with
the lengths (p < .001) and the locations differ significantly in lengths, after controlling for
weights (p < .001). The flounder in the sound are significantly longer than those in the
rivers.
Page 9
Mean Length, Controlling for Weight
Location Mean Length
Pamlico Sound 347.16
A
Pamlico River 341.98
B
Tar River 338.92
B
Note. Groups with the same letter in
their subscripts do not differ
significantly at the .05 level.
Lastly, the ANOVA compares the groups on lengths ignoring weights. The
pattern of results differs when weight is ignored.
Mean Length, Ignoring Weight
Location Mean Length
Pamlico River 347.29
A
Pamlico Sound 344.73
A
Tar River 296.60
B
Note. Groups with the same letter in
their subscripts do not differ
significantly at the .05 level.
A Better Approach When Both Predictors are Continuous
It is usually a bad idea to categorize a continuous variable prior to analysis. For
an introduction to testing interactions between continuous predictor variables, see my
document Continuous Moderator Variables in Multiple Regression Analysis.
Page 10
References
DeShon, R. P. & Alexander, R. A. (1996). Alternative procedures for testing regression
slope homogeneity when group error variances are unequal. Psychological
Methods, 1, 261-277.
Forsyth, D. R. (1990). A taxonomy of ethical ideologies. Journal of Personality and
Social Psychology, 1990, 39, 175-184.
Howell, D. C. (2007). Statistical methods for psychology (6
th
ed.). Belmont, CA:
Thomson Wadsworth.
Kleinbaum, D. G., & Kupper, L. L. (1978). Applied regression analysis and other
multivariable methods. Boston: Duxbury.
Poteat, G. M., Wuensch, K. L., & Gregg, N. B. (1988). An investigation of differential
prediction with the WISC-R. Journal of School Psychology, 26, 59-68.
Potthoff, R. F. (1966). Statistical aspects of the problem of biases in psychological
tests. (Institute of Statistics Mimeo Series No. 479.) Chapel Hill: University of North
Carolina, Department of Statistics.
x Learn How to Use SPSS to Make a Dandy Scatterplot Displaying These Results
x Return to Wuenschs Statistics Lessons Page
The url for this document is
http://core.ecu.edu/psyc/wuenschk/MV/MultReg/Potthoff.pdf.
Copyright 2011, Karl L. Wuensch, All Rights Reserved
Logistic-SPSS.docx
Binary Logistic Regression with PASW/SPSS
Logistic regression is used to predict a categorical (usually dichotomous) variable
from a set of predictor variables. With a categorical dependent variable, discriminant
function analysis is usually employed if all of the predictors are continuous and nicely
distributed; logit analysis is usually employed if all of the predictors are categorical; and
logistic regression is often chosen if the predictor variables are a mix of continuous and
categorical variables and/or if they are not nicely distributed (logistic regression makes
no assumptions about the distributions of the predictor variables). Logistic regression
has been especially popular with medical research in which the dependent variable is
whether or not a patient has a disease.
For a logistic regression, the predicted dependent variable is a function of the
probability that a particular subject will be in one of the categories (for example, the
probability that Suzie Cue has the disease, given her set of scores on the predictor
variables).
Description of the Research Used to Generate Our Data
As an example of the use of logistic regression in psychological research,
consider the research done by Wuensch and Poteat and published in the Journal of
Social Behavior and Personality, 1998, 13, 139-150. College students (N = 315) were
asked to pretend that they were serving on a university research committee hearing a
complaint against animal research being conducted by a member of the university
faculty. The complaint included a description of the research in simple but emotional
language. Cats were being subjected to stereotaxic surgery in which a cannula was
implanted into their brains. Chemicals were then introduced into the cats brains via the
cannula and the cats given various psychological tests. Following completion of testing,
the cats brains were subjected to histological analysis. The complaint asked that the
researcher's authorization to conduct this research be withdrawn and the cats turned
over to the animal rights group that was filing the complaint. It was suggested that the
research could just as well be done with computer simulations.
In defense of his research, the researcher provided an explanation of how steps
had been taken to assure that no animal felt much pain at any time, an explanation that
computer simulation was not an adequate substitute for animal research, and an
explanation of what the benefits of the research were. Each participant read one of five
different scenarios which described the goals and benefits of the research. They were:
- COSMETIC -- testing the toxicity of chemicals to be used in new lines of hair
care products.
- THEORY -- evaluating two competing theories about the function of a particular
nucleus in the brain.
- MEAT -- testing a synthetic growth hormone said to have the potential of
increasing meat production.
\
|
ln ln , where Y
1 is the predicted probability of the other decision, and X is our predictor variable,
gender. Some statistical programs (such as SAS) predict the event which is coded with
the smaller of the two numeric codes. By the way, if you have ever wondered what is
"natural" about the natural log, you can find an answer of sorts at
http://www.math.toronto.edu/mathnet/answers/answers_13.html.
Our model will be constructed by an iterative maximum likelihood procedure.
The program will start with arbitrary values of the regression coefficients and will
construct an initial model for predicting the observed data. It will then evaluate errors in
such prediction and change the regression coefficients so as make the likelihood of the
observed data greater under the new model. This procedure is repeated until the model
3
converges -- that is, until the differences between the newest model and the previous
model are trivial.
Open the data file at http://core.ecu.edu/psyc/wuenschk/SPSS/Logistic.sav.
Click Analyze, Regression, Binary Logistic. Scoot the decision variable into the
Dependent box and the gender variable into the Covariates box. The dialog box should
now look like this:
Click OK.
Look at the statistical output. We see that there are 315 cases used in the
analysis.
The Block 0 output is for a model that includes only the intercept (which SPSS
calls the constant). Given the base rates of the two decision options (187/315 = 59%
decided to stop the research, 41% decided to allow it to continue), and no other
information, the best strategy is to predict, for every case, that the subject will decide to
stop the research. Using that strategy, you would be correct 59% of the time.
Case Processing Summary
315 100.0
0 .0
315 100.0
0 .0
315 100.0
Unweighted Cases
a
Included in Analysis
Missing Cases
Total
Selected Cases
Unselected Cases
Total
N Percent
If weight is in ef f ect, see classif ication table f or the total
number of cases.
a.
4
Under Variables in the Equation you see that the intercept-only model is
ln(odds) = -.379. If we exponentiate both sides of this expression we find that our
predicted odds [Exp(B)] = .684. That is, the predicted odds of deciding to continue the
research is .684. Since 128 of our subjects decided to continue the research and 187
decided to stop the research, our observed odds are 128/187 = .684.
Now look at the Block 1 output. Here SPSS has added the gender variable as a
predictor. Omnibus Tests of Model Coefficients gives us a Chi-Square of 25.653 on
1 df, significant beyond .001. This is a test of the null hypothesis that adding the gender
variable to the model has not significantly increased our ability to predict the decisions
made by our subjects.
Under Model Summary we see that the -2 Log Likelihood statistic is 399.913.
This statistic measures how poorly the model predicts the decisions -- the smaller
the statistic the better the model. Although SPSS does not give us this statistic for the
model that had only the intercept, I know it to be 425.666. Adding the gender variable
reduced the -2 Log Likelihood statistic by 425.666 - 399.913 = 25.653, the _
2
statistic
we just discussed in the previous paragraph. The Cox & Snell R
2
can be interpreted
like R
2
in a multiple regression, but cannot reach a maximum value of 1. The
Nagelkerke R
2
can reach a maximum of 1.
Classification Tabl e
a,b
187 0 100.0
128 0 .0
59.4
Observed
stop
continue
decision
Overall Percentage
Step 0
stop continue
decision
Percentage
Correct
Predicted
Constant is included in the model.
a.
The cut value is .500
b.
Variables in the Equation
-.379 .115 10.919 1 .001 .684 Constant St ep 0
B S. E. Wald df Sig. Exp(B)
Omnibus Tests of Model Coefficients
25.653 1 .000
25.653 1 .000
25.653 1 .000
St ep
Block
Model
St ep 1
Chi-square df Sig.
5
The Variables in the Equation output shows us that the regression equation is
( ) Gender ODDS 217 . 1 847 . ln + = .
We can now use this model to predict the odds that a subject of a given gender
will decide to continue the research. The odds prediction equation is
bX a
e ODDS
+
= .
If our subject is a woman (gender = 0), then the 429 . 0
847 . ) 0 ( 217 . 1 847 .
= = =
+
e e ODDS .
That is, a woman is only .429 as likely to decide to continue the research as she is to
decide to stop the research. If our subject is a man (gender = 1), then the
448 . 1
37 . ) 1 ( 217 . 1 847 .
= = =
+
e e ODDS . That is, a man is 1.448 times more likely to decide
to continue the research than to decide to stop the research.
We can easily convert odds to probabilities. For our women,
30 . 0
429 . 1
429 . 0
1
= =
+
=
ODDS
ODDS
Y . That is, our model predicts that 30% of women will
decide to continue the research. For our men, 59 . 0
448 . 2
448 . 1
1
= =
+
=
ODDS
ODDS
Y . That is,
our model predicts that 59% of men will decide to continue the research
The Variables in the Equation output also gives us the Exp(B). This is better
known as the odds ratio predicted by the model. This odds ratio can be computed by
raising the base of the natural log to the b
th
power, where b is the slope from our
logistic regression equation. For our model, 376 . 3
217 . 1
= e . That tells us that the
model predicts that the odds of deciding to continue the research are 3.376 times higher
for men than they are for women. For the men, the odds are 1.448, and for the women
they are 0.429. The odds ratio is
1.448 / 0.429 = 3.376 .
The results of our logistic regression can be used to classify subjects with
respect to what decision we think they will make. As noted earlier, our model leads to
the prediction that the probability of deciding to continue the research is 30% for women
and 59% for men. Before we can use this information to classify subjects, we need to
Model Summary
399.913
a
.078 .106
St ep
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Estimation terminat ed at iteration number 3 because
parameter est imat es changed by less than .001.
a.
Variables i n the Equation
1.217 .245 24.757 1 .000 3.376
-.847 .154 30.152 1 .000 .429
gender
Constant
Step
1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on st ep 1: gender.
a.
6
have a decision rule. Our decision rule will take the following form: If the probability of
the event is greater than or equal to some threshold, we shall predict that the event will
take place. By default, SPSS sets this threshold to .5. While that seems reasonable, in
many cases we may want to set it higher or lower than .5. More on this later. Using the
default threshold, SPSS will classify a subject into the Continue the Research category
if the estimated probability is .5 or more, which it is for every male subject. SPSS will
classify a subject into the Stop the Research category if the estimated probability is
less than .5, which it is for every female subject.
The Classification Table shows us that this rule allows us to correctly classify
68 / 128 = 53% of the subjects where the predicted event (deciding to continue the
research) was observed. This is known as the sensitivity of prediction, the P(correct |
event did occur), that is, the percentage of occurrences correctly predicted. We also
see that this rule allows us to correctly classify 140 / 187 = 75% of the subjects where
the predicted event was not observed. This is known as the specificity of prediction,
the P(correct | event did not occur), that is, the percentage of nonoccurrences correctly
predicted. Overall our predictions were correct 208 out of 315 times, for an overall
success rate of 66%. Recall that it was only 59% for the model with intercept only.
We could focus on error rates in classification. A false positive would be
predicting that the event would occur when, in fact, it did not. Our decision rule
predicted a decision to continue the research 115 times. That prediction was wrong 47
times, for a false positive rate of 47 / 115 = 41%. A false negative would be predicting
that the event would not occur when, in fact, it did occur. Our decision rule predicted a
decision not to continue the research 200 times. That prediction was wrong 60 times,
for a false negative rate of 60 / 200 = 30%.
It has probably occurred to you that you could have used a simple Pearson Chi-
Square Contingency Table Analysis to answer the question of whether or not there is
a significant relationship between gender and decision about the animal research. Let
us take a quick look at such an analysis. In SPSS click Analyze, Descriptive
Statistics, Crosstabs. Scoot gender into the rows box and decision into the columns
box. The dialog box should look like this:
Classification Tabl e
a
140 47 74.9
60 68 53.1
66.0
Observed
stop
continue
decision
Overall Percentage
Step 1
stop continue
decision
Percentage
Correct
Predicted
The cut value is .500
a.
7
Now click the Statistics box. Check Chi-Square and then click Continue.
Now click the Cells box. Check Observed Counts and Row Percentages and
then click Continue.
8
Back on the initial page, click OK.
In the Crosstabulation output you will see that 59% of the men and 30% of the
women decided to continue the research, just as predicted by our logistic regression.
You will also notice that the Likelihood Ratio Chi-Square is 25.653 on 1 df, the
same test of significance we got from our logistic regression, and the Pearson Chi-
Square is almost the same (25.685). If you are thinking, Hey, this logistic regression is
nearly equivalent to a simple Pearson Chi-Square, you are correct, in this simple case.
Remember, however, that we can add additional predictor variables, and those
additional predictors can be either categorical or continuous -- you cant do that with a
simple Pearson Chi-Square.
Multiple Predictors, Both Categorical and Continuous
Now let us conduct an analysis that will better tap the strengths of logistic
regression. Click Analyze, Regression, Binary Logistic. Scoot the decision variable
in the Dependent box and gender, idealism, and relatvsm into the Covariates box.
gender * decision Crosstabulation
140 60 200
70.0% 30.0% 100.0%
47 68 115
40.9% 59.1% 100.0%
187 128 315
59.4% 40.6% 100.0%
Count
% wit hin gender
Count
% wit hin gender
Count
% wit hin gender
Female
Male
gender
Total
stop continue
decision
Total
Chi-Square Tests
25.685
b
1 .000
25.653 1 .000
315
Pearson Chi-Square
Likelihood Ratio
N of Valid Cases
Value df
Asy mp. Sig.
(2-sided)
Computed only f or a 2x2 table
a.
0 cells (.0%) hav e expect ed count less than 5. The
minimum expected count is 46.73.
b.
9
Click Options and check Hosmer-Lemeshow goodness of fit and CI for exp(B)
95%.
Continue, OK. Look at the output.
In the Block 1 output, notice that the -2 Log Likelihood statistic has dropped to
346.503, indicating that our expanded model is doing a better job at predicting decisions
than was our one-predictor model. The R
2
statistics have also increased.
We can test the significance of the difference between any two models, as long
as one model is nested within the other. Our one-predictor model had a
Model Summary
346.503
a
.222 .300
St ep
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Estimation terminat ed at iteration number 4 because
parameter est imat es changed by less than .001.
a.
10
2 Log Likelihood statistic of 399.913. Adding the ethical ideology variables (idealism
and relatvsm) produced a decrease of 53.41. This difference is a _
2
on 2 df (one df for
each predictor variable).
To determine the p value associated with this _
2
, just click Transform,
Compute. Enter the letter p in the Target Variable box. In the Numeric Expression
box, type 1-CDF.CHISQ(53.41,2). The dialog box should look like this:
Click OK and then go to the SPSS Data Editor, Data View. You will find a new
column, p, with the value of .00 in every cell. If you go to the Variable View and set the
number of decimal points to 5 for the p variable you will see that the value of p
is.00000. We conclude that adding the ethical ideology variables significantly
improved the model, _
2
(2, N = 315) = 53.41, p < .001.
Note that our overall success rate in classification has improved from 66% to
71%.
The Hosmer-Lemeshow tests the null hypothesis that there is a linear
relationship between the predictor variables and the log odds of the criterion variable.
Cases are arranged in order by their predicted probability on the criterion variable.
These ordered cases are then divided into ten groups (lowest decile [prob < .1] to
highest decile [prob > .9]). Each of these groups is then divided into two groups on the
basis of actual score on the criterion variable. This results in a 2 x 10 contingency table.
Expected frequencies are computed based on the assumption that there is a linear
relationship between the weighted combination of the predictor variables and the log
odds of the criterion variable. For the outcome = no (decision = stop for our data)
column, the expected frequencies will run from high (for the lowest decile) to low (for the
highest decile). For the outcome = yes column the frequencies will run from low to high.
Classification Tabl e
a
151 36 80.7
55 73 57.0
71.1
Observed
stop
continue
decision
Overall Percentage
Step 1
stop continue
decision
Percentage
Correct
Predicted
The cut value is .500
a.
11
A chi-square statistic is computed comparing the observed frequencies with those
expected under the linear model. A nonsignificant chi-square indicates that the data fit
the model well.
Using a K > 2 Categorical Predictor
We can use a categorical predictor that has more than two levels. For our data,
the stated purpose of the research is such a predictor. While SPSS can dummy code
such a predictor for you, I prefer to set up my own dummy variables. You will need K-1
dummy variables to represent K groups. Since we have five levels of purpose of the
research, we shall need 4 dummy variables. Each of the subjects will have a score of
either 0 or 1 on each of the dummy variables. For each dummy variable a score of 0
will indicate that the subject does not belong to the group represented by that dummy
variable and a score of 1 will indicate that the subject does belong to the group
represented by that dummy variable. One of the groups will not be represented by a
dummy variable. If it is reasonable to consider one of your groups as a reference
group to which each other group should be compared, make that group the one
which is not represented by a dummy variable.
I decided that I wanted to compare each of the cosmetic, theory, meat, and
veterinary groups with the medical group, so I set up a dummy variable for each of the
groups except the medical group. Take a look at our data in the data editor. Notice
that the first subject has a score of 1 for the cosmetic dummy variable and 0 for the
other three dummy variables. That subject was told that the purpose of the research
was to test the safety of a new ingredient in hair care products. Now scoot to the
bottom of the data file. The last subject has a score of 0 for each of the four dummy
Hosmer and Lemeshow Test
8.810 8 .359
Step
1
Chi-square df Sig.
Contingency Table for Hosmer and Lemeshow Test
29 29.331 3 2.669 32
30 27.673 2 4.327 32
28 25.669 4 6.331 32
20 23.265 12 8.735 32
22 20.693 10 11.307 32
15 18.058 17 13.942 32
15 15.830 17 16.170 32
10 12.920 22 19.080 32
12 9.319 20 22.681 32
6 4.241 21 22.759 27
1
2
3
4
5
6
7
8
9
10
Step
1
Observed Expected
decision = stop
Observed Expected
decision = continue
Total
12
variables. That subject was told that the purpose of the research was to evaluate a
treatment for a debilitating disease that afflicts humans of college age.
Click Analyze, Regression, Binary Logistic and add to the list of covariates the
four dummy variables. You should now have the decision variable in the Dependent
box and all of the other variables (but not the p value column) in the Covariates box.
Click OK.
The Block 0 Variables not in the Equation show how much the -2LL would drop
if a single predictor were added to the model (which already has the intercept)
Look at the output, Block 1. Under Omnibus Tests of Model Coefficients we
see that our latest model is significantly better than a model with only the intercept.
Under Model Summary we see that our R
2
statistics have increased again and
the -2 Log Likelihood statistic has dropped from 346.503 to 338.060. Is this drop
statistically significant? The _
2
, is the difference between the two -2 log likelihood
values, 8.443, on 4 df (one df for each dummy variable). Using 1-CDF.CHISQ(8.443,4),
we obtain an upper-tailed p of .0766, short of the usual standard of statistical
significance. I shall, however, retain these dummy variables, since I have an a priori
interest in the comparison made by each dummy variable.
Variables not i n the Equation
25.685 1 .000
47.679 1 .000
7.239 1 .007
.003 1 .955
2.933 1 .087
.556 1 .456
.013 1 .909
77.665 7 .000
gender
idealism
relatvsm
cosmetic
theory
meat
veterin
Variables
Overall Statistics
Step
0
Score df Sig.
Omnibus Tests of Model Coefficients
87.506 7 .000
87.506 7 .000
87.506 7 .000
St ep
Block
Model
St ep 1
Chi-square df Sig.
Model Summary
338.060
a
.243 .327
St ep
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Estimation terminat ed at iteration number 5 because
parameter est imat es changed by less than .001.
a.
13
In the Classification Table, we see a small increase in our overall success rate,
from 71% to 72%.
I would like you to compute the values for Sensitivity, Specificity, False Positive
Rate, and False Negative Rate for this model, using the default .5 cutoff.
Sensitivity percentage of occurrences correctly predicted
Specificity percentage of nonoccurrences correctly predicted
False Positive Rate percentage of predicted occurrences which are incorrect
False Negative Rate percentage of predicted nonoccurrences which are incorrect
Remember that the predicted event was a decision to continue the research.
Under Variables in the Equation we are given regression coefficients and odds
ratios.
We are also given a statistic I have ignored so far, the Wald Chi-Square statistic,
which tests the unique contribution of each predictor, in the context of the other
predictors -- that is, holding constant the other predictors -- that is, eliminating any
overlap between predictors. Notice that each predictor meets the conventional .05
standard for statistical significance, except for the dummy variable for cosmetic
research and for veterinary research. I should note that the Wald _
2
has been criticized
Classification Tabl e
a
152 35 81.3
54 74 57.8
71.7
Observed
stop
continue
decision
Overall Percentage
Step 1
stop continue
decision
Percentage
Correct
Predicted
The cut value is .500
a.
Variables i n the Equation
1.255 20.586 1 .000 3.508 2.040 6.033
-.701 37.891 1 .000 .496 .397 .620
.326 6.634 1 .010 1.386 1.081 1.777
-.709 2.850 1 .091 .492 .216 1.121
-1.160 7.346 1 .007 .314 .136 .725
-.866 4.164 1 .041 .421 .183 .966
-.542 1.751 1 .186 .581 .260 1.298
2.279 4.867 1 .027 9.766
gender
idealism
relatvsm
cosmetic
theory
meat
veterin
Constant
Step
1
a
B Wald df Sig. Exp(B) Lower Upper
95.0% C.I.f or EXP(B)
Variable(s) entered on step 1: gender, idealism, relatv sm, cosmetic, theory , meat, v eterin.
a.
14
for being too conservative, that is, lacking adequate power. An alternative would be to
test the significance of each predictor by eliminating it from the full model and testing
the significance of the increase in the -2 log likelihood statistic for the reduced model.
That would, of course, require that you construct p+1 models, where p is the number of
predictor variables.
Let us now interpret the odds ratios.
- The .496 odds ratio for idealism indicates that the odds of approval are more than
cut in half for each one point increase in respondents idealism score. Inverting this
odds ratio for easier interpretation, for each one point increase on the idealism scale
there was a doubling of the odds that the respondent would not approve the
research.
- Relativisms effect is smaller, and in the opposite direction, with a one point increase
on the nine-point relativism scale being associated with the odds of approving the
research increasing by a multiplicative factor of 1.39.
- The odds ratios of the scenario dummy variables compare each scenario except
medical to the medical scenario. For the theory dummy variable, the .314 odds ratio
means that the odds of approval of theory-testing research are only .314 times those
of medical research.
- Inverted odds ratios for the dummy variables coding the effect of the scenario
variable indicated that the odds of approval for the medical scenario were 2.38 times
higher than for the meat scenario and 3.22 times higher than for the theory scenario.
Let us now revisit the issue of the decision rule used to determine into which
group to classify a subject given that subject's estimated probability of group
membership. While the most obvious decision rule would be to classify the subject into
the target group if p > .5 and into the other group if p < .5, you may well want to choose
a different decision rule given the relative seriousness of making one sort of error (for
example, declaring a patient to have breast cancer when she does not) or the other sort
of error (declaring the patient not to have breast cancer when she does).
Repeat our analysis with classification done with a different decision rule. Click
Analyze, Regression, Binary Logistic, Options. In the resulting dialog window,
change the Classification Cutoff from .5 to .4. The window should look like this:
15
Click Continue, OK.
Now SPSS will classify a subject into the "Continue the Research" group if the
estimated probability of membership in that group is .4 or higher, and into the "Stop the
Research" group otherwise. Take a look at the classification output and see how the
change in cutoff has changed the classification results. Fill in the table below to
compare the two models with respect to classification statistics.
Value When Cutoff = .5 .4
Sensitivity
Specificity
False Positive Rate
False Negative Rate
Overall % Correct
SAS makes it much easier to see the effects of the decision rule on sensitivity
etc. Using the ctable option, one gets output like this:
16
------------------------------------------------------------------------------------
The LOGISTIC Procedure
Classification Table
Correct Incorrect Percentages
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG
0.160 123 56 131 5 56.8 96.1 29.9 51.6 8.2
0.180 122 65 122 6 59.4 95.3 34.8 50.0 8.5
0.200 120 72 115 8 61.0 93.8 38.5 48.9 10.0
0.220 116 84 103 12 63.5 90.6 44.9 47.0 12.5
0.240 113 93 94 15 65.4 88.3 49.7 45.4 13.9
0.260 110 100 87 18 66.7 85.9 53.5 44.2 15.3
0.280 108 106 81 20 67.9 84.4 56.7 42.9 15.9
0.300 105 108 79 23 67.6 82.0 57.8 42.9 17.6
0.320 103 115 72 25 69.2 80.5 61.5 41.1 17.9
0.340 100 118 69 28 69.2 78.1 63.1 40.8 19.2
0.360 97 120 67 31 68.9 75.8 64.2 40.9 20.5
0.380 96 124 63 32 69.8 75.0 66.3 39.6 20.5
0.400 94 130 57 34 71.1 73.4 69.5 37.7 20.7
0.420 88 134 53 40 70.5 68.8 71.7 37.6 23.0
0.440 86 140 47 42 71.7 67.2 74.9 35.3 23.1
0.460 79 141 46 49 69.8 61.7 75.4 36.8 25.8
0.480 75 144 43 53 69.5 58.6 77.0 36.4 26.9
0.500 71 147 40 57 69.2 55.5 78.6 36.0 27.9
0.520 69 152 35 59 70.2 53.9 81.3 33.7 28.0
0.540 67 157 30 61 71.1 52.3 84.0 30.9 28.0
0.560 65 159 28 63 71.1 50.8 85.0 30.1 28.4
0.580 61 159 28 67 69.8 47.7 85.0 31.5 29.6
0.600 56 162 25 72 69.2 43.8 86.6 30.9 30.8
0.620 50 165 22 78 68.3 39.1 88.2 30.6 32.1
0.640 48 166 21 80 67.9 37.5 88.8 30.4 32.5
0.660 43 170 17 85 67.6 33.6 90.9 28.3 33.3
0.680 40 170 17 88 66.7 31.3 90.9 29.8 34.1
0.700 36 173 14 92 66.3 28.1 92.5 28.0 34.7
0.720 30 177 10 98 65.7 23.4 94.7 25.0 35.6
0.740 28 178 9 100 65.4 21.9 95.2 24.3 36.0
0.760 23 180 7 105 64.4 18.0 96.3 23.3 36.8
0.780 22 180 7 106 64.1 17.2 96.3 24.1 37.1
0.800 18 181 6 110 63.2 14.1 96.8 25.0 37.8
0.820 17 182 5 111 63.2 13.3 97.3 22.7 37.9
0.840 13 184 3 115 62.5 10.2 98.4 18.8 38.5
0.860 12 185 2 116 62.5 9.4 98.9 14.3 38.5
0.880 8 185 2 120 61.3 6.3 98.9 20.0 39.3
0.900 7 185 2 121 61.0 5.5 98.9 22.2 39.5
0.920 5 187 0 123 61.0 3.9 100.0 0.0 39.7
0.940 1 187 0 127 59.7 0.8 100.0 0.0 40.4
0.960 1 187 0 127 59.7 0.8 100.0 0.0 40.4
0.980 0 187 0 128 59.4 0.0 100.0 . 40.6
------------------------------------------------------------------------------------
The classification results given by SAS are a little less impressive because SAS
uses a jackknifed classification procedure. Classification results are biased when the
coefficients used to classify a subject were developed, in part, with data provided by
that same subject. SPSS' classification results do not remove such bias. With
jackknifed classification, SAS eliminates the subject currently being classified when
computing the coefficients used to classify that subject. Of course, this procedure is
17
computationally more intense than that used by SPSS. If you would like to learn more
about conducting logistic regression with SAS, see my document at
http://core.ecu.edu/psyc/wuenschk/MV/MultReg/Logistic-SAS.doc.
Beyond An Introduction to Logistic Regression
I have left out of this handout much about logistic regression. We could consider
logistic regression with a criterion variable with more than two levels, with that variable
being either qualitative or ordinal. We could consider testing of interaction terms. We
could consider sequential and stepwise construction of logistic models. We could talk
about detecting outliers among the cases, dealing with multicollinearity and nonlinear
relationships between predictors and the logit, and so on. If you wish to learn more
about logistic regression, I recommend, as a starting point, Chapter 10 in Using
Multivariate Statistics, 5
th
edition, by Tabachnick and Fidell (Pearson, 2007).
Presenting the Results
Let me close with an example of how to present the results of a logistic
regression. In the example below you will see that I have included both the multivariate
analysis (logistic regression) and univariate analysis. I assume that you all already
know how to conduct the univariate analyses I present below.
Table 1
Effect of Scenario on Percentage of Participants Voting to Allow the Research to
Continue and Participants Mean Justification Score
Scenario Percentage
Support
Theory 31
Meat 37
Cosmetic 40
Veterinary 41
Medical 54
As shown in Table 1, only the medical research received support from a majority
of the respondents. Overall a majority of respondents (59%) voted to stop the research.
Logistic regression analysis was employed to predict the probability that a participant
would approve continuation of the research. The predictor variables were participants
gender, idealism, relativism, and four dummy variables coding the scenario. A test of
the full model versus a model with intercept only was statistically significant, _
2
(7, N =
315) = 87.51, p < .001. The model was able correctly to classify 73% of those who
approved the research and 70% of those who did not, for an overall success rate of
71%.
Table 2 shows the logistic regression coefficient, Wald test, and odds ratio for
each of the predictors. Employing a .05 criterion of statistical significance, gender,
idealism, relativism, and two of the scenario dummy variables had significant partial
effects. The odds ratio for gender indicates that when holding all other variables
constant, a man is 3.5 times more likely to approve the research than is a woman.
Inverting the odds ratio for idealism reveals that for each one point increase on the nine-
18
point idealism scale there is a doubling of the odds that the participant will not approve
the research. Although significant, the effect of relativism was much smaller than that of
idealism, with a one point increase on the nine-point idealism scale being associated
with the odds of approving the research increasing by a multiplicative factor of 1.39.
The scenario variable was dummy coded using the medical scenario as the reference
group. Only the theory and the meat scenarios were approved significantly less than
the medical scenario. Inverted odds ratios for these dummy variables indicate that the
odds of approval for the medical scenario were 2.38 times higher than for the meat
scenario and 3.22 times higher than for the theory scenario.
Table 2
Logistic Regression Predicting Decision From Gender, Ideology, and Scenario
Predictor B Wald _
2
p Odds Ratio
Gender 1.25 20.59 < .001 3.51
Idealism -0.70 37.89 < .001 0.50
Relativism 0.33 6.63 .01 1.39
Scenario
Cosmetic -0.71 2.85 .091 0.49
Theory -1.16 7.35 .007 0.31
Meat -0.87 4.16 .041 0.42
Veterinary -0.54 1.75 .186 0.58
Univariate analysis indicated that men were significantly more likely to approve
the research (59%) than were women (30%), _
2
(1, N = 315) = 25.68, p < .001, that
those who approved the research were significantly less idealistic (M = 5.87, SD = 1.23)
than those who didnt (M = 6.92, SD = 1.22), t(313) = 7.47, p < .001, that those who
approved the research were significantly more relativistic (M = 6.26, SD = 0.99) than
those who didnt (M = 5.91, SD = 1.19), t(313) = 2.71, p = .007, and that the omnibus
effect of scenario fell short of significance, _
2
(4, N = 315) = 7.44, p = .11.
Interaction Terms
Interaction terms can be included in a logistic model. When the variables in an
interaction are continuous they probably should be centered. Consider the following
research: Mock jurors are presented with a criminal case in which there is some doubt
about the guilt of the defendant. For half of the jurors the defendant is physically
attractive, for the other half she is plain. Half of the jurors are asked to recommend a
verdict without having deliberated, the other half are asked about their recommendation
only after a short deliberation with others. The deliberating mock jurors were primed
with instructions predisposing them to change their opinion if convinced by the
arguments of others. We could use a logit analysis here, but elect to use a logistic
regression instead. The article in which these results were published is: Patry, M. W.
19
(2008). Attractive but guilty: Deliberation and the physical attractiveness bias.
Psychological Reports, 102, 727-733.
The data are in Logistic2x2x2.sav at my SPSS Data Page. Download the data
and bring them into SPSS. Each row in the data file represents one cell in the three-
way contingency table. Freq is the number of scores in the cell.
Tell SPSS to weight cases by Freq. Data, Weight Cases:
Analyze, Regression, Binary Logistic. Slide Guilty into the Dependent box and
Delib and Plain into the Covariates box. Highlight both Delib and Plain in the pane on
the left and then click the >a*b> box.
20
This creates the interaction term. It could also be created by simply creating a
new variable, Interaction = Delib-Plain.
Under Options, ask for the Hosmer-Lemeshow test and confidence intervals on
the odds ratios.
You will find that the odds ratios are .338 for Delib, 3.134 for Plain, and 0.030 for
the interaction.
21
Those who deliberated were less likely to suggest a guilty verdict (15%) than
those who did not deliberate (66%), but this (partial) effect fell just short of statistical
significance in the logistic regression (but a 2 x 2 chi-square would show it to be
significant).
Plain defendants were significantly more likely (43%) than physically attractive
defendants (39%) to be found guilty. This effect would fall well short of statistical
significance with a 2 x 2 chi-square.
We should not pay much attention to the main effects, given that the interaction
is powerful.
The interaction odds ratio can be simply computed, by hand, from the cell
frequencies.
- For those who did deliberate, the odds of a guilty verdict are 1/29 when the
defendant was plain and 8/22 when she was attractive, yielding a conditional
odds ratio of 0.09483.
- For those who did not deliberate, the odds of a guilty verdict are 27/8 when the
defendant was plain and 14/13 when she was attractive, yielding a conditional
odds ratio of 3.1339.
Variables i n the Equation
3.697 1 .054 .338 .112 1.021
4.204 1 .040 3.134 1.052 9.339
8.075 1 .004 .030 .003 .338
.037 1 .847 1.077
Delib
Plain
Delib by Plain
Constant
Step
1
a
Wald df Sig. Exp(B) Lower Upper
95.0% C.I.f or EXP(B)
Variable(s) entered on st ep 1: Delib, Plain, Delib * Plain .
a.
Pl ai n * Guilty Crosstabulation
a
22 8 30
73.3% 26.7% 100.0%
29 1 30
96.7% 3.3% 100.0%
51 9 60
85.0% 15.0% 100.0%
Count
% wit hin Plain
Count
% wit hin Plain
Count
% wit hin Plain
At trractive
Plain
Plain
Total
No Yes
Guilty
Total
Delib = Yes
a.
22
- The interaction odds ratio is simply the ratio of these conditional odds ratios
that is, .09483/3.1339 = 0.030.
Follow-up analysis shows that among those who did not deliberate the plain
defendant was found guilty significantly more often than the attractive defendant, _
2
(1, N
= 62) = 4.353, p = .037, but among those who did deliberate the attractive defendant
was found guilty significantly more often than the plain defendant, _
2
(1, N = 60) = 6.405,
p = .011.
Interaction Between a Dichotomous Predictor and a Continuous Predictor
Suppose that I had some reason to suspect that the effect of idealism differed
between men and women. I can create the interaction term just as shown above.
Pl ai n * Guilty Crosstabulation
a
13 14 27
48.1% 51.9% 100.0%
8 27 35
22.9% 77.1% 100.0%
21 41 62
33.9% 66.1% 100.0%
Count
% wit hin Plain
Count
% wit hin Plain
Count
% wit hin Plain
At trractive
Plain
Plain
Total
No Yes
Guilty
Total
Delib = No
a.
23
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1
a
idealism -.773 .145 28.572 1 .000 .461
gender -.530 1.441 .135 1 .713 .589
gender by idealism .268 .223 1.439 1 .230 1.307
Constant 4.107 .921 19.903 1 .000 60.747
a. Variable(s) entered on step 1: idealism, gender, gender * idealism .
As you can see, the interaction falls short of significance.
Partially Standardized B Weights and Odds Ratios
The value of a predictors B (and the associated odds ratio) is highly dependent
on the unit of measure. Suppose I am predicting whether or not an archer hits the
target. One predictor is distance to the target. Another is how much training the archer
has had. Suppose I measure distance in inches and training in years. I would not
expect much of an increase in the logit when decreasing distance by an inch, but I
would expect a considerable increase when increasing training by a year. Suppose I
measured distance in miles and training in seconds. Now I would expect a large B for
distance and a small B for training. For purposes of making comparisons between the
predictors, it may be helpful to standardize the B weights.
Suppose that a third predictor is the archers score on a survey of political
conservatism and that a photo of Karl Marx appears on the target. The unit of measure
here is not intrinsically meaningful how much is a one point change in score on this
survey. Here too it may be helpful to standardize the predictors. Menard (The
American Statistician, 2004, 58, 218-223) discussed several ways to standardize B
weights. I favor simply standardizing the predictor, which can be simply accomplished
by converting the predictor scores to z scores or by multiplying the unstandardized B
weight by the predictors standard deviation. While one could also standardize the
dichotomous outcome variable (group membership), I prefer to leave that
unstandardized.
In research here at ECU, Cathy Hall gathered data that is useful in predicting
who will be retained in our engineering program. Among the predictor variables are
high school GPA, score on the quantitative section of the SAT, and one of the Big Five
personality measures, openness to experience. Here are the results of a binary logistic
regression predicting retention from high school GPA, quantitative SAT, and openness
(you can find more detail here).
24
Unstandardized Standardized
Predictor B Odds Ratio B Odds Ratio
HS-GPA 1.296 3.656 0.510 1.665
SAT-Q 0.006 1.006 0.440 1.553
Openness 0.100 1.105 0.435 1.545
The novice might look at the unstandardized statistics and conclude that SAT-Q
and openness to experience are of little utility, but the standardized coefficients show
that not to be true. The three predictors differ little in their unique contributions to
predicting retention in the engineering program.
Practice Your Newly Learned Skills
Now that you know how to do a logistic regression, you should practice those
skills. I have presented below three exercises designed to give you a little practice.
Exercise 1: What is Beautiful is Good, and Vice Versa
Castellow, Wuensch, and Moore (1990, Journal of Social Behavior and
Personality, 5, 547-562) found that physically attractive litigants are favored by jurors
hearing a civil case involving alleged sexual harassment (we manipulated physical
attractiveness by controlling the photos of the litigants seen by the mock jurors). Guilty
verdicts were more likely when the male defendant was physically unattractive and
when the female plaintiff was physically attractive. We also found that jurors rated the
physically attractive litigants as more socially desirable than the physically unattractive
litigants -- that is, more warm, sincere, intelligent, kind, and so on. Perhaps the jurors
treated the physically attractive litigants better because they assumed that physically
attractive people are more socially desirable (kinder, more sincere, etc.).
Our next research project (Egbert, Moore, Wuensch, & Castellow, 1992, Journal
of Social Behavior and Personality, 7, 569-579) involved our manipulating (via character
witness testimony) the litigants' social desirability but providing mock jurors with no
information on physical attractiveness. The jurors treated litigants described as socially
desirable more favorably than they treated those described as socially undesirable.
However, these jurors also rated the socially desirable litigants as more physically
attractive than the socially undesirable litigants, despite having never seen them! Might
our jurors be treating the socially desirable litigants more favorably because they
assume that socially desirable people are more physically attractive than are socially
undesirable people?
We next conducted research in which we manipulated both the physical
attractiveness and the social desirability of the litigants (Moore, Wuensch, Hedges, &
25
Castellow, 1994, Journal of Social Behavior and Personality, 9, 715-730). Data from
selected variables from this research project are in the SPSS data file found at
http://core.ecu.edu/psyc/wuenschk/SPSS/Jury94.sav. Please download that file now.
You should use SPSS to predict verdict from all of the other variables. The
variables in the file are as follows:
- VERDICT -- whether the mock juror recommended a not guilty (0) or a guilty (1)
verdict -- that is, finding in favor of the defendant (0) or the plaintiff (1)
- ATTRACT -- whether the photos of the defendant were physically unattractive (0)
or physically attractive (1)
- GENDER -- whether the mock juror was female (0) or male (1)
- SOCIABLE -- the mock juror's rating of the sociability of the defendant, on a 9-
point scale, with higher representing greater sociability
- WARMTH -- ratings of the defendant's warmth, 9-point scale
- KIND -- ratings of defendant's kindness
- SENSITIV -- ratings of defendant's sensitivity
- INTELLIG -- ratings of defendant's intelligence
You should also conduct bivariate analysis (Pearson Chi-Square and
independent samples t-tests) to test the significance of the association between each
predictor and the criterion variable (verdict). You will find that some of the predictors
have significant zero-order associations with the criterion but are not significant in the
full model logistic regression. Why is that?
You should find that the sociability predictor has an odds ratio that indicates that
the odds of a guilty verdict increase as the rated sociability of the defendant increases --
but one would expect that greater sociability would be associated with a reduced
probability of being found guilty, and the univariate analysis indicates exactly that (mean
sociability was significantly higher with those who were found not guilty). How is it
possible for our multivariate (partial) effect to be opposite in direction to that indicated by
our univariate analysis? You may wish to consult the following documents to help
understand this:
Redundancy and Suppression
Simpson's Paradox
Exercise 2: Predicting Whether or Not Sexual Harassment Will Be Reported
Download the SPSS data file found at
http://core.ecu.edu/psyc/wuenschk/SPSS/Harass-Howell.sav. This file was obtained
from David Howell's site,
http://www.uvm.edu/~dhowell/StatPages/Methods/DataMethods5/Harass.dat. I have
added value labels to a couple of the variables. You should use SPSS to conduct a
logistic regression predicting the variable "reported" from all of the other variables. Here
is a brief description for each variable:
- REPORTED -- whether (1) or not (0) an incident of sexual harassment was
reported
26
- AGE -- age of the victim
- MARSTAT -- marital status of the victim -- 1 = married, 2 = single
- FEMinist ideology -- the higher the score, the more feminist the victim
- OFFENSUV -- offensiveness of the harassment -- higher = more offensive
I suggest that you obtain, in addition to the multivariate analysis, some bivariate
statistics, including independent samples t-tests, a Pearson chi-square contingency
table analysis, and simple Pearson correlation coefficients for all pairs of variables.
Exercise 3: Predicting Who Will Drop-Out of School
Download the SPSS data file found at
http://core.ecu.edu/psyc/wuenschk/SPSS/Dropout.sav. I simulated these data based on
the results of research by David Howell and H. R. Huessy (Pediatrics, 76, 185-190).
You should use SPSS to predict the variable "dropout" from all of the other variables.
Here is a brief description for each variable:
- DROPOUT -- whether the student dropped out of high school before graduating -
- 0 = No, 1 = Yes.
- ADDSC -- a measure of the extent to which each child had exhibited behaviors
associated with attention deficit disorder. These data were collected while the
children were in the 2
nd
, 4
th
, and 5
th
grades combined into one variable in the
present data set.
- REPEAT -- did the child ever repeat a grade -- 0 = No, 1 = Yes.
- SOCPROB -- was the child considered to have had more than the usual number
of social problems in the 9
th
grade -- 0 = No, 1 = Yes.
I suggest that you obtain, in addition to the multivariate analysis, some bivariate
statistics, including independent samples t-tests, a Pearson chi-square contingency
table analysis, and simple Pearson correlation coefficients for all pairs of variables.
Imagine that you were actually going to use the results of your analysis to decide which
children to select as "at risk" for dropping out before graduation. Your intention is, after
identifying those children, to intervene in a way designed to make it less likely that they
will drop out. What cutoff would you employ in your decision rule?
Copyright 2009, Karl L. Wuensch - All rights reserved.
- Logistic Regression With SAS
- Why Not Let SPSS Do The Dummy Coding of Categorical Predictors?
- Statistics Lessons
- SPSS Lessons
- More Links
27
- Letters From Former Students -- some continue to use my online lessons when
they go on to doctoral programs
MediationModels.docx
Statistical Tests of Models That Include Mediating Variables
Consider a model that proposes that some independent variable (X) is correlated
with some dependent variable (Y) not because it exerts some direct effect upon the
dependent variable, but because it causes changes in an intervening or mediating
variable (M), and then the mediating variable causes changes in the dependent
variable. Psychologists tend to refer to the X M Y relationship as mediation.
Sociologists tend to speak of the indirect effect of X on Y through M.
XM MY
XY
MacKinnon, Lockwood, Hoffman, West, and Sheets (A comparison of methods
to test mediation and other intervening variable effects, Psychological Methods, 2002, 7,
83-104) reviewed 14 different methods that have been proposed for testing models that
include intervening variables. They grouped these methods into three general
approaches.
Causal Steps. This is the approach that has most directly descended from the
work of Judd, Baron, and Kenny and which has most often been employed by
psychologists. Using this approach, the criteria for establishing mediation, which are
nicely summarized by David Howell (Statistical Methods for Psychology, 6
th
ed., page
528) are:
X must be correlated with Y.
X must be correlated with M.
M must be correlated with Y, holding constant any direct effect of X on Y.
When the effect of M on Y is removed, X is no longer correlated with Y (complete
mediation) or the correlation between X and Y is reduced (partial mediation).
Each of these four criteria are tested separately in the causal steps method:
First you demonstrate that the zero-order correlation between X and Y (ignoring
M) is significant.
Next you demonstrate that the zero-order correlation between X and M (ignoring
Y) is significant.
2
is the
standard error for that coefficient, is the standardized or unstandardized partial
regression coefficient for predicting Y from M controlling for X, and
2
is the standard
error for that coefficient. Since most computer programs give the standard errors for the
unstandardized but not the standardized coefficients, I shall employ the unstandardized
coefficients in my computations (using an interactive tool found on the Internet) below.
An alternative standard error is Aroians (1944) second-order exact solution,
2 2 2 2 2 2
+ + . Another alternative is Goodmans (1960) unbiased solution,
in which the rightmost addition sign becomes a subtraction sign:
2 2 2 2 2 2
+ .
In his text, Dave Howell employed Goodmans solution, but he made a potentially
3
serious error -- for the MY path he employed a zero-order coefficient and standard error
when he should have employed the partial coefficient and standard error.
MacKinnon et al. gave some examples of hypotheses and models that include
intervening variables. One was that of Ajzen & Fishbein (1980), in which intentions are
hypothesized to intervene between attitudes and behavior. I shall use here an example
involving data relevant to that hypothesis. Ingram, Cope, Harju, and Wuensch
(Applying to graduate school: A test of the theory of planned behavior. Journal of Social
Behavior and Personality, 2000, 15, 215-226) tested a model which included three
independent variables (attitude, subjective norms, and perceived behavior control),
one mediator (intention), and one dependent variable (behavior). I shall simplify that
model here, dropping subjective norms and perceived behavioral control as
independent variables. Accordingly, the mediation model (with standardized path
coefficients) is:
= .767 = .245
direct effect = .337
Let us first consider the causal steps approach:
Attitude is significantly correlated with behavior, r = .525.
Attitude is significantly correlated with intention, r = .767.
The partial effect of intention on behavior, holding attitude constant, falls short
of statistical significance, = .245, p = .16.
The direct effect of attitude on behavior (removing the effect of intention) also
falls short of statistical significance, = .337, p = .056.
The causal steps approach does not, here, provide strong evidence of mediation,
given lack of significance of the partial effect of intention on behavior. If sample size
were greater, however, that critical effect would, of course, be statistically significant.
Now I calculate the Sobel/Aroian/Goodman tests. The statistics which I need are
the following:
The zero-order unstandardized regression coefficient for predicting the
mediator (intention) from the independent variable (attitude). That coefficient
= .423.
Attitude
Intention
Behavior
4
The standard error for that coefficient = .046.
The partial, unstandardized regression coefficient for predicting the
dependent variable (behavior) from the mediator (intention) holding constant
the independent variable (attitude). That regression coefficient = 1.065.
The standard error for that coefficient = .751.
For Aroians second-order exact solution,
3935 . 1
) 751 (. 046 . ) 046 (. 065 . 1 ) 751 (. 423 .
) 065 . 1 ( 423 .
2 2 2 2 2 2 2 2 2 2 2 2
=
+ +
=
+ +
=
TS
What a tedious calculation that was. I just lost interest in showing you
how to calculate the Sobel and the Goodman statistics by hand. Let us use Kris
Preachers dandy tool at http://quantpsy.org/sobel/sobel.htm . Just enter alpha
(a), beta (b), and their standard errors and click Calculate:
Coefficients
a
3.390 1.519 2.231 .030
.423 .046 .767 9.108 .000
(Constant)
attitude
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: intent
a.
Coefficients
a
.075 9.056 .008 .993
.807 .414 .337 1.950 .056
1.065 .751 .245 1.418 .162
(Constant)
attitude
intent
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: behav
a.
5
Even easier (with a little bit of rounding error), just provide the t statistics for
alpha and beta and click Calculate:
The results indicate (for each of the error terms) a z of about 1.40 with a p of
about .16. Again, our results do not provide strong support for the mediation
hypothesis.
Mackinnon et al. (1998) Distribution of
Suppose that I am at a party. You are on your way to the party, late. Your friend
asks you, Do you suppose that Karl has already had too many beers? Based on past
experience with me at such parties, your prior probability of my having had too many
beers, 2 . 0 ) ( = == = B p . The probability that I have not had too many beers, 8 . 0 ) ( = == = B p ,
giving prior odds, of: 25 . 0
8 . 0
2 . 0
) (
) (
= == = = == = = == =
B p
B p
(inverting the odds, the probability
that I have not had too many beers is 4 times the probability that I have). is Greek
omega.
Now, what data could we use to revise your prior probability of my having had
too many beers? How about some behavioral data. Suppose that your friend tells you
that, based on her past experience, the likelihood that I behave awfully at a party if I
have had too many beers is 30%, that is, the conditional probability 3 . 0 ) | ( = == = B A p .
According to her, if I have not been drinking too many beers, there is only a 3% chance
of my behaving awfully, that is, the likelihood 03 . 0 ) | ( = == = B A p . Drinking too many beers
raises the probability of my behaving awfully ten-fold, that is, the
likelihood ratio, L is: 10
03 . 0
30 . 0
) | (
) | (
= == = = == = = == =
B A p
B A p
L .
From the multiplication rule of probability, you know that
) | ( ) ( ) | ( ) ( ) ( B A p B p A B p A p A B p = = , so it follows that
) (
) (
) | (
A p
A B p
A B p
= .
From the addition rule, you know that ) ( ) ( ) ( A B p A B p A p + = , since B and
not B are mutually exclusive. Thus,
) ( ) (
) (
) | (
A B p A B p
A B p
A B P
+
= .
From the multiplication rule you know that
) | ( ) ( ) ( and ) | ( ) ( ) ( B A p B p A B p B A p B p A B p = == = = == = , so
) | ( ) ( ) | ( ) (
) | ( ) (
) | (
B A p B p B A p B p
B A p B p
A B p
+ ++ +
= == = . This is Bayes theorem, as applied to the
probability of my having had too many beers given that I have been observed behaving
awfully. Stated in words rather than symbols:
posterior probability =
prior probability likelihood
prior probability likelihood + prior probability likelihood
i i
i i j j
= , and
) | ( ) ( ) | ( ) (
) | ( ) (
) (
) (
) | (
1 1 0 0
1 1 1
1
H D P H P H D P H P
H D P H P
D P
D H P
D H P
+
= .
The P(H
0
|D) and P(H
1
|D) are posterior probabilities, the probability that the null is
true given tP p(H
1
) are prior probabilities, the probability that the null or the alternative is
true prior to considering the new data. The P(D|H
0
) and P(D|H
1
) are the likelihoods, the
probabilities of the data given one or the other hypothesis.
As before, L = == = , that is,
) | (
) | (
) (
) (
) | (
) | (
0
1
0
1
0
1
H D P
H D P
H P
H P
D H P
D H P
=
In classical hypothesis testing, one considers only P(D|H
0
), or more precisely, the
probability of obtaining sample data as or more discrepant with null than are those on
hand, that is, the obtained significance level, p, and if that p is small enough, one
rejects the null hypothesis and asserts the alternative hypothesis. One may mistakenly
believe e has estimated the probability that the null hypothesis is true, given the
obtained data, but e has not done so. The Bayesian does estimate the probability that
the null hypothesis is true given the obtained data, P(H
0
|D), and if that probability is
sufficiently small, e rejects the null hypothesis in favor of the alternative hypothesis. Of
course, how small is sufficiently small depends on an informed consideration of the
3
relative seriousness of making one sort of error (rejecting H
0
in favor of H
1
) versus
another sort of error (retaining H
0
in favor of H
1
).
Suppose that we are interested in testing the following two hypotheses about the
IQ of a particular population H
: = 100 versus H
1
: = 110. I consider the two
hypotheses equally likely, and dismiss all other possible values of , so the prior
probability of the null is .5 and the prior probability of the alternative is also .5.
I obtain a sample of 25 scores from the population of interest. I assume it is
normally distributed with a standard deviation of 15, so the standard error of the mean
is 15/5 = 3. The obtained sample mean is 107. I compute for each hypothesis
M
i
i
M
z
= . For H
,
where pi is approx. 3.1416, or you can use a normal curve table, SAS, SPSS, or other
statistical program. Using PDF.NORMAL(2.333333333333,0,1) in SPSS, I obtain
.0262/2 = .0131. In the same way I obtain the likelihood p(D|H
1
) = .1210. The
06705 . ) 121 (. 5 . ) 0131 (. 5 . ) | ( ) ( ) | ( ) ( ) (
1 1 0 0
= + = + = H D P H P H D P H P D P . Our revised,
posterior probabilities are:
0977 .
06705 .
) 0131 (. 5 .
) (
) | ( ) (
) | (
0 0
0
= =
=
D P
H D P H P
D H P , and 9023 .
06705 .
) 1210 (. 5 .
) | (
1
= = D H P .
Before we gathered our data we thought the two hypotheses equally likely, that
is, the odds were 1/1. Our posterior odds are .9023/.0977 = 9.24. That is, after
gathering our data we now think that H
1
is more than 9 times more likely than is H
.
The likelihood ratio 24 . 9
0131 .
1210 .
) | (
) | (
0
1
= =
H D P
H D P
. Multiplying the prior odds ratio (1)
by the likelihood ratio gives us the posterior odds. When the prior odds = 1 (the null
and the alternative hypotheses are equally likely), then the posterior odds is equal to
the likelihood ratio. When intuitively revising opinion, humans often make the mistake
of assuming that the prior probabilities are equal.
If we are really paranoid about rejecting null hypotheses, we may still retain the
null here, even though we now think the alternative about nine times more likely.
Suppose we gather another sample of 25 scores, and this time the mean is 106. We
can use these new data to revise the posterior probabilities from the previous analysis.
For these new data, H
=
D P
H D P H P
D H P , and
9655 .
07663 .
) 0820 (. 9023 .
) | (
1
= = D H P .
The likelihood ratio is .082/.027 = 3.037. The newly revised posterior odds is
.9655/.0344= 28.1. The prior odds 9.24, times the likelihood ratio, 3.037, also gives the
posterior odds, 9.24(3.037) = 28.1. With the posterior probability of the null at .0344,
we should now be confident in rejecting it.
The Bayesian approach seems to give us just want we want, the probability of
the null hypothesis given our data. So whats the rub? The rub is, to get that posterior
probability we have to come up with the prior probability of the null being true. If you
and I disagree on that prior probability, given the same data, we arrive at different
posterior probabilities. Bayesians are less worried about this than are traditionalists,
since the Bayesian thinks of probability as being subjective, ones degree of belief
about some hypothesis, event, or uncertain quantity. The traditionalist thinks of a
probability as being an objective quantity, the limit of the relative frequency of the event
across an uncountably large number of trials (which, of course, we can never know, but
we can estimate by rational or empirical means). Advocates of Bayesian statistics are
often quick to point out that as evidence accumulates there is a convergence of the
posterior probabilities of those started with quite different prior probabilities.
Bayesian Confidence Intervals
In Bayesian statistics a parameter, such as , is thought of as a random variable
with its own distribution rather than as a constant. That distribution is thought of as
representing our knowledge about what the true value of the parameter may be, and
the mean of that distribution is our best guess for the true value of the parameter. The
wider the distribution, the greater our ignorance about the parameter. The precision
(prc) of the distribution is the inverse of its variance, so the greater the precision, the
greater our knowledge about the parameter.
Our prior distribution of the parameter may be noninformative or informative. A
noninformative prior will specify that all possible values of the parameter are equally
likely. The range of possible values may be fixed (for example, from 0 to 1 for a
proportion) or may be infinite. Such a prior distribution will be rectangular, and if the
range is not fixed, of infinite variance. An informative prior distribution specifies a
particular nonuniform shape for the distribution of the parameter, for example, a
binomial, normal, or t distribution centered at some value. When new data are
gathered, they are used to revise the prior distribution. The mean of the revised
(posterior) distribution becomes our new best guess of the exact value of the
parameter. We can construct a Bayesian confidence interval and opine that the
probability that the true value of the parameter falls within that interval is cc, where cc is
the confidence coefficient (typically 95%).
Suppose that we are interested in the mean of a population for which we confess
absolute ignorance about the value of prior to gathering the data, but we are willing to
5
assume a normally distribution. We obtain 100 scores and compute the sample mean
to be 107 and the sample variance 200. The precision of this sample result is the
inverse of its squared standard error of the mean. That is, 5 .
200
100
2 1
= = =
s
n
prc . The
95% Bayesian confidence interval is identical to the traditional confidence interval, that
is, 77 . 109 23 . 104 77 . 2 107 100 / 200 96 . 1 107 96 . 1 = = =
M
s M .
Now suppose that additional data become available. We have 81 scores with a
mean of 106, a variance of 243, and a precision of 81/243 = 1/3. Our prior distribution
has (from the first sample) a mean of 107 and a precision of .5. Our posterior
distribution will have a mean that is a weighted combination of the mean of the prior
distribution and that of the new sample. The weights will be based on the precisions:
6 . 106 106
3 . 5 .
3 .
107
3 . 5 .
5 .
1
1
1
1
2
=
+
+
+
=
+
+
+
=
S
S
S
S
M
prc prc
prc
prc prc
prc
.
The precision of the revised (posterior) distribution for is simply the sum of the
prior and sample precisions: 3 8 . 3 . 5 .
1 2
= + = + =
S
prc prc prc . The variance of the
revised distribution is just the inverse of its precision, 1.2. Our new Bayesian
confidence interval is 75 . 108 46 . 104 15 . 2 6 . 106 2 . 1 96 . 1 6 . 106 96 . 1
2 2
= = = .
Of course, if more data come in, we revise our distribution for again. Each time
the width of the confidence interval will decline, reflecting greater precision, more
knowledge about .
Copyright 2008 Karl L. Wuensch - All rights reserved.
dfa2.doc
Two Group Discriminant Function Analysis
In DFA one wishes to predict group membership from a set of (usually continuous)
predictor variables. In the most simple case one has two groups and p predictor variables. A
linear discriminant equation,
p p i
X b X b X b a D + ++ + + ++ + + ++ + + ++ + = == = K
2 2 1 1
, is constructed such that the
two groups differ as much as possible on D. That is, the weights are chosen so that were you
to compute a discriminant score ( D
i
) for each subject and then do an ANOVA on D, the ratio
of the between groups sum of squares to the within groups sum of squares is as large as
possible. The value of this ratio is the eigenvalue. Eigen can be translated from the
German as own, peculiar, original, singular, etc. Check out the page at
http://core.ecu.edu/psyc/wuenschk/StatHelp/eigenvalue.txt for a discussion of the origins of
the term eigenvalue.
Read the following article, which has been placed on reserve in Joyner:
Castellow, W. A., Wuensch, K. L., & Moore, C. H. (1990). Effects of physical attractiveness of
the plaintiff and defendant in sexual harassment judgments. Journal of Social Behavior
and Personality, 5, 547-562.
The data for this analysis are those used for the research presented in that article.
They are in the SPSS data file Harass90.sav. Download it from my SPSS-Data page and
bring it into SPSS. To do the discriminant analysis, click Analyze, Classify, Discriminant.
Place the Verdict variable into the Grouping Variable box and define the range from 1 to 2.
Place the 22 rating scale variables (D_excit through P_happy) in the Independents box. We
are using the ratings the jurors gave defendant and plaintiff to predict the verdict. Under
Statistics, ask for Means, Univariate ANOVAs, Boxs M, Fishers Coefficients, and
Unstandardized Coefficients. Under Classify, ask for Priors Computed From Group Sizes and
for a Summary Table. Under Save ask that the discriminant scores be saved.
Now look at the output. The means show that when the defendant was judged not
guilty he was rated more favorably on all 11 scales than when he was judged guilty. When the
defendant was judged not guilty the plaintiff was rated less favorably on all 11 scales than
when a guilty verdict was returned. The Tests of Equality of Group Means show that the
groups differ significantly on every variable except plaintiff excitingness, calmness,
independence, and happiness.
The discriminant function, in unstandardized units (Canonical Discriminant Function
Coefficients), is D = -0.064 + .083 D_excit + ...... + .029 P_happy. The group centroids
(mean discriminant scores) are -0.785 for the Guilty group and 1.491 for those jurors who
decided the defendant was not guilty. High scores on the discriminant function are associated
with the juror deciding to vote not guilty.
= == =
= == =
g
i
i i
i i
i
G D p G p
G D p G p
D G p
1
) | ( ) (
) | ( ) (
) | ( .
Each subjects discriminant score is used to determine the posterior probabilities of
being in each of the two groups. The subject is then classified (predicted) to be in the group
with the higher posterior probability.
By default, SPSS assumes that all groups have equal prior probabilities. For two
groups, each prior = , for three, 1/3, etc. I asked SPSS to use the group relative frequencies
as priors, which should result in better classification.
Another way to classify subjects is to use Fishers classification function
coefficients. For each subject a D is computed for each group and the subject classified into
the group for which es D is highest. To compute a subjects D
1
you would multiply es scores
on the 22 rating scales by the indicated coefficients and sum them and the constant. For es
D
2
you would do the same with the coefficients for Group 2. If D
1
> D
2
then you classify the
subject into Group 1, if D
2
> D
1
, the you classify em into Group 2.
The classification results table shows that we correctly classified 88% of the subjects.
To evaluate how good this is we should compare 88% with what would be expected by
chance. By just randomly classifying half into group 1 and half into group 2 you would
expect to get .5(.655) + .5(.345) = 50% correct. Given that the marginal distribution of Verdict
is not uniform, you would do better by randomly putting 65.5% into group 1 and 34.4% into
group 2 (probability matching), in which case you would expect to be correct .655(.655) +
.345(.345) = 54.8% of the time. Even better would be to probability maximize by just
placing every subject into the most likely group, in which case you would be correct 65.5% of
the time. We can do significantly better than any of these by using our discriminant function.
Assumptions: Multivariate normality of the predictors is assumed. One may hope
that large sample sizes make the DFA sufficiently robust that one does not worry about
moderate departures from normality. One also assumes that the variance-covariance
matrix of the predictor variables is the same in all groups (so we can obtain a pooled
matrix to estimate error variance). Boxs M tests this assumption and indicates a problem
with our example data. For validity of significance tests, one generally does not worry about
this if sample sizes are equal, and with unequal sample sizes one need not worry unless the p
< .001. The DFA is thought to be very robust and Boxs M is very sensitive. Non-normality
also tends to lower the p for Boxs M. The classification procedures are not, however, so
robust as the significance tests are. One may need to transform variables or do a quadratic
Page 4
DFA (SPSS wont do this) or ask that separate rather than pooled variance-covariance
matrices be used. Pillais criterion (rather than Wilks ) may provide additional robustness
for significance testing -- although not available with SPSS discriminant, this criterion is
available with SPSS MANOVA.
ANOVA on D. Conduct an ANOVA comparing the verdict groups on the discriminant
function. Then you can demonstrate that the DFA eigenvalue is equal to the ratio of the
SS
between
to SS
within
from that ANOVA and that the ratio of SS
between
to SS
total
is the squared
canonical correlation coefficient from the DFA.
Correlation Between Groups and D. Correlate the discriminant scores with the
verdict variable. You will discover that the resulting point biserial correlation coefficient is the
canonical correlation from the DFA.
SAS: Obtain the data file Harass90.dat from my StatData page and the program
DFA2.sas from my SAS Programs Page. Run the program. This program uses SAS to do
essentially the same analysis we just did with SPSS. Look at the output from PROC REG. It
did a multiple regression to predict group membership (1, 2) from the rating scales. Notice
that the SS
model
/ SS
error
= the eigenvalue from the DFA, and that the SS
error
/ SS
total
= the Wilks
from the DFA. The square root of the R
2
equals the canonical correlation from the DFA.
The unstandardized discriminant function coefficients (raw canonical coefficients) are equal to
the standardized discriminant function coefficients (pooled within-class standardized canonical
coefficients) divided by the pooled (within-group) standard deviations.
Note also that the DFAs discriminant function coefficients are a linear transformation of
the multiple regression bs (multiply each by 4.19395 and you get the unstandardized
discriminant funtion coefficients). I do not know what determines the value of this constant, I
determined it empirically for this set of data.
Return to Wuenschs Statistics Lessons Page
Copyright 2008 Karl L. Wuensch - All rights reserved.
DFA3.doc
Discriminant Function Analysis with Three or More Groups
With more than two groups one can obtain more than one discriminant function. The
first DF is that which maximally separates the groups (produces the largest ratio of
among-groups to within groups SS on the resulting D scores). The second DF, orthogonal
to the first, maximally separates the groups on variance not yet explained by the first DF.
One can find a total of K-1 (number of groups minus 1) or p (number of predictor variables)
orthogonal discriminant functions, whichever is smaller.
We shall use the data from Experiment 1 of my dissertation to illustrate a
discriminant function analysis with three groups. The analysis I reported when I published
this research was a doubly multivariate repeated measures ANOVA (see Wuensch, K. L.,
Fostering house mice onto rats and deer mice: Effects on response to species odors.
Animal Learning and Behavior, 1992, 20, 253-258). Wild-strain house mice were, at birth,
cross-fostered onto house-mouse (Mus), deer mouse (Peromyscus) or rat (Rattus) nursing
mothers. Ten days after weaning, each subject was tested in an apparatus that allowed it
to enter tunnels scented with clean pine shavings or with shavings bearing the scent of
Mus, Peromyscus, or Rattus. One of the variables measured was the number of visits to
each tunnel during a twenty minute test. Also measured were how long each subject spent
in each of the four tunnels and the latency to first visit of each tunnel. We shall use the
visits data for our discriminant function analysis.
The data are in the SPSS data file, TUNNEL4b.sav. Download it from my SPSS-
Data page. The variables in this data file are:
NURS (nursing group, 1 for Mus reared, 2 for Peromyscus reared, and 3 for Rattus
reared)
V1, V2, V3, and V4 (labeled Clean-V, Mus-V, Pero-V, and Rat-V, these are the raw
data for number of visits to the clean, Mus-scented, Peromyscus-scented, and
Rattus-scented tunnels)
V_Clean, V_Mus, V_Pero, and V_Rat (the visits data after a square root
transformation to reduce positive skewness and stabilize the variances)
T1, T2, T3, and T4 (time in seconds spent in each tunnel)
T_Clean, T_Mus, T_Pero, and T_Rat (the time data after a square root
transformation to reduce positive skewness)
L1, L2, L3, and L4 (the latency data in seconds) and
L_Clean, L_Mus, L_Pero, and L_Rat (the latency data after a log transformation to
reduce positive skewness).
For this lesson we shall use only the NURS variable and the visits variables.
Obtaining Means and Standard Deviations for the Untransformed Data
Open the TUNNEL4b.sav file in SPSS. Click Analyze, Compare Means, Means.
SPSS will do stepwise DFA. You simply specify which method you wish to employ
for selecting predictors. The most economical method is the Wilks lambda method, which
selects predictors that minimize Wilks lambda. As with stepwise multiple regression, you
may set the criteria for entry and removal (F criteria or p criteria), or you may take the
defaults.
Imagine that you are working as a statistician for the Internal Revenue Service. You
are told that another IRS employee has developed four composite scores (X
1
- X
4
), easily
computable from the information that taxpayers provide on their income tax returns and from
other databases to which the IRS has access. These composite scores were developed in
the hope that they would be useful for discriminating tax cheaters from other persons. To
see if these composite scores actually have any predictive validity, the IRS selects a random
sample of taxpayers and audits their returns. Based on this audit, each taxpayer is placed
into one of three groups: Group 1 is persons who overpaid their taxes by a considerable
amount, Group 2 is persons who paid the correct amount, and Group 3 is persons who
underpaid their taxes by a considerable amount. X
1
through X
4
are then computed for each
of these taxpayers. You are given a data file with group membership, X
1
, X
2
, X
3
, and X
4
for
each taxpayer, with an equal number of subjects in each group. Your job is to use
discriminant function analysis to develop a pair of discriminant functions (weighted sums of
X
1
through X
4
) to predict group membership. You use a fully stepwise selection procedure to
develop a (maybe) reduced (less than four predictors) model. You employ the WILKS
method of selecting variables to be entered or deleted, using the default p criterion for
entering and removing variables.
Your data
file is DFA-
STEP.sav,
which is
available on
Karls SPSS-
Data page --
download it and
then bring it into
SPSS. To do
the DFA, click
Analyze,
Classify, and
then put Group
into the
Grouping
Variable box,
defining its
range from 1 to 3. Put X1 through X4 in the Independents box, and select the stepwise
method.
2
.
3
Positive Assortative Mating on the Main Diagonal
Our research hypothesis was that individuals would prefer to mate with others similar to
themselves (in this case, of the same religion). Look at the main diagonal (upper left cell,
middle cell, lower right cell) of the Relig_S x Relig_M table. Most of the counts are in that
diagonal, which represents individuals who want mates of the same religion as their own. If we
sum the counts on the main diagonal, we see that 185 (or 92.5%) of our respondents said they
want their mates to be of the same religion as their religion. How many respondents would we
expect to be on this main diagonal if there was no correlation between Relig_S and Relig_M?
The answer to that question is simple: Just sum the expected frequencies in that main
diagonal -- in the absence of any correlation, we expect 108 (54%) of our respondents to be on
that main diagonal. Now we can employ an exact binomial test of the null hypothesis that the
proportion of respondents desiring mates with the same religion as their own is what would be
expected given independence of self religion and ideal mate religion (binomial p = .54). The
one-tailed p is the P(Y 185 | n = 200, p = .54). Back in PSYC 6430 you learned how to use
SAS to get binomial probabilities. In the little program below, I obtained the P(Y 184 | n =
200, p = .54), subtracted that from 1 to get the P(Y 185 | n = 200, p = .54), and then doubled
that to get the two-tailed significance level. The SAS output shows that p < .001.
data p;
LE184 = PROBBNML(.54, 200, 184);
GE185 = 1 - LE184;
p = 2*GE185;
proc print; run;
Obs LE184 GE185 p
1 1 0 0
You can also use SPSS to get an exact binomial probability. See my document
Obtaining Significance Levels with SPSS.
4
The unpartialled
2
on Relig_S x Gender is also significant. The column percentages in
the table make it fairly clear that this effect is due to men being much more likely than women
to have no religion.
SAS Catmod
options pageno=min nodate formdlim='-';
data Religion;
input Relig_Self Relig_Mate Sex count;
cards;
1 1 1 20.5
1 1 2 7.5
1 2 1 0.5
1 2 2 0.5
1 3 1 1.5
1 3 2 1.5
2 1 1 1.5
2 1 2 1.5
2 2 1 86.5
2 2 2 49.5
2 3 1 3.5
2 3 2 2.5
3 1 1 0.5
3 1 2 1.5
3 2 1 2.5
3 2 2 3.5
3 3 1 8.5
3 3 2 15.5
proc catmod;
weight count;
model Relig_Self*Relig_Mate*Sex = _response_;
Loglin Relig_Self|Relig_Mate|Sex;
run;
Submit this code to obtain the analysis of the saturated model in SAS.
Karl L. Wuensch, Dept. of Psychology, East Carolina University, Greenville, NC 27858 USA
March, 2012
5
Links
Return to Wuenschs Statistics Lessons Page
Download the SPSS Output
Log-Linear Contingency Table Analysis, Two-Way
Three-Way Nonhierarchical Log-Linear Analysis: Escalators and Obesity
Four Variable LOGIT Analysis: The 1989 Sexual Harassment Study
Log3N.doc
Three-Way Nonhierarchical Log-Linear Analysis: Escalators and Obesity
Hierarchical analyses are the norm when one is doing multidimensional log-linear
analyses. The backwards elimination tests of significance are available because each
reduced model is nested within the next more complex model. With nonhierarchical
analysis one can exclude lower-order effects that are contained within retained
higher-order effects. One might wish to evaluate two nonhierarchical models when one
is not nested within the other. One cannot test the significance of the difference
between two such nonhierarchical models, but one can assess the adequacy of fit of
each such model.
The Data and the Program
We shall use data which I captured from the article "Stairs, Escalators, and
Obesity," by Meyers et al. (Behavior Modification 4: 355-359). A copy of the article is
available within BlackBoard. The (nonhierarchical) LOGLINEAR procedure is not
available by point and click in SPSS, you must use syntax. Since I needed to issue the
loglinear command by syntax, I also issued the rest of the commands by syntax. The
syntax file is ESCALATE.SPS on my SPSS Programs page, and the data file is
ESCALATE.SAV on my SPSS Data page. Download both. After downloading, double-
click on the syntax file and PASW will boot and the syntax file will be displayed in the
syntax editor.
You will need to edit the File statement so that it points to the correct location of
the data file on your computer. Run the syntax file to produce the output. I exported
the output to a rtf document, edited it, and then converted it to a pdf document. You
can obtain the pdf document at http://core.ecu.edu/psyc/wuenschk/MV/LogLin/Log3N-
SPSS-Output.pdf.
An Initial Run with the Hiloglinear Procedure
Look at the program. With Hiloglinear I asked that the tests of partial
associations and the parameter estimates be listed. I did not ask for the frequencies or
residuals. Look at the output. The three-way interaction is significant. When the
highest-order effect is significant, one may attempt to eliminate one or more of the
2
lower-order effects while retaining the higher-order effect. The partial chi-squares may
suggest which effects to try deleting, and one can try deleting any effect which does not
have at least one highly significant .
For our data, every partial chi-square is significant, but the Weight x Device
effect has a relatively small
2
, so I'll try removing it. Looking at the parameter
estimates, Weight x Device (neither parameter is significant) and Direct (not significant
at .01) appear to be candidates for removal.
Building a Reduced Model with the Loglinear Procedure
I used Loglinear to test two models, one with Weight x Device removed and one
with Direct removed. In both cases the goodness-of-fit chi-square was significant,
meaning that the reduced models do not fit the data well. This is in part due to the
great power we have with large sample sizes. We can look at the residuals to see
where the fit is poor. For the model with Weight x Device removed, none of the
standardized residuals is very large (> 2), but three are large enough to warrant
inspection (> 1). The model predicted that:
15.45 Obese folks would be observed Ascending Stairs, but only 10 were;
19.45 Obese folks would be observed Descending Stairs, but only 14 were; and
72.04 folks of normal weight would be observed Ascending Stairs, but 82 were.
For the model with Direct removed, the residuals are generally small, but two
cells have residuals worthy of some attention. For the Obese, the model predicted that:
14.3 would be observed Ascending Stairs, only 10 were, and
9.7 would be observed Descending Stairs, but 14 were.
Comparing Nested Models
When we have two models that are nested, with Model A being a subset of
Model B, with all of the effects in Model A also in Model B, then we can test the
significance of the difference between those two models. The difference
2
will equal
the Model A goodness-of-fit
2
minus the Model B goodness-of-fit
2
, with df equal to
the difference between the two models' df. We do have two such pairs, the full model
versus that with Weight x Device removed and the full model versus that with Direct
removed. Since the full model always has
2
= 0 and df = 0, the difference chi-squares
are the reduced model chi-squares, and they are significant.
The Triple Interaction
Now we shall try to explain the triple interaction by looking at "simple two-way"
interactions at each level of the third variable. I decided to look at the Weight x Device
interaction at each level of Direction, but could have just as well looked at Weight x
Direction at each level of Device or Device x Direction at each level of Weight.
Look at the tables produced by the first Crosstabs command. I reproduce here
the row percentages for the Stairs column.
Percentage Using Stairs Within Each Weight x Direction Combination
3
Direction
Weight Ascending Descending
Obese 4.7 14.7
Overweight 3.9 27.8
Normal 7.6 23.1
The Direction x Device interaction is obvious here, with many more people using
the stairs going down than going up. Were we inspecting Direction x Device at each
level of Weight, we would do three 2 x 2 Direction x Device analyses, each determining
whether the rate of stairway use was significantly higher when descending than when
ascending for a given weight category. For example, among the obese, is 14.7%
significantly different from 4.7%? I decided to look at Weight x Device interaction at
each level of Direction. Crosstabs gave us the LR
2
for Weight x Device at each
direction, and they are both significant.
Breaking Up Each 3 x 2 Interaction Into Three 2 x 2 Interactions
To understand each 3 x 2 (Weight x Device) interaction better, I broke each into
three 2 x 2 interactions. If you will look at the program, you will see that I changed
WEIGHT(1,3) to WEIGHT(1,2) to get the comparison between the Obese (level 1) and
the Overweight (level 2). When ascending, they do not differ significantly on
percentage using the stairs, but when descending they do, with the overweight using
the stairs more often than do the obese.
The Obese versus Normal comparisons both fall short of significance, but just
barely. Note, in the program, how I used the TEMPORARY and the MISSING VALUES
commands to construct these comparisons. I declared the value 2 to be a missing
value for the weight variable, so when I indicated WEIGHT(1,3), the comparison was
only between weight group 1 and weight group 2. The TEMPORARY command made
this declaration of missing value status valid for only one procedure.
For Overweight versus Normal, the normal weight folks are significantly more
likely to use the stairs than are the overweight when ascending, but when descending
the overweight persons use the stairs more than do the normal weight persons, with the
difference not quite reaching statistical significance.
4
Percentage Use of Staircase Rather than Escalator Among Three Weight Groups
0
5
10
15
20
25
30
Obese Overweight Normal
Ascending
Descending
In summary, people use the stairs much less than the escalator, especially when
going up. The overweight are the least likely to use the stairs when going up, but the
most likely to use the stairs when going down. Perhaps these people know they have a
weight problem, know they need exercise, so they resolve to use the stairs more often,
but using them going up is just asking too much.
Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC 27858
November, 2009
SAS code to do the Model 1 analysis
How to get people to use the stairs -- http://www.youtube.com/watch?v=2lXh2n0aPyw
Return to Wuenschs Statistics Lessons Page
Logit.doc
Four Variable LOGIT Analysis: The 1989 Sexual Harassment Study
The Data and the Program
In the file HARASS89.sav on Karl's SPSS Data Page are cell data from a mock
jury study done by C. Moore et al. early in 1989. Download the data file. Every variable
is categorical: Verdict (1 = guilty, 2 = not guilty), Gender (1 = male, 2 = female) Plattr (1
= the plaintiff is low in physical attractiveness, 2 = high in physical attractiveness), and
Deattr(1 = defendant is low in physical attractiveness, 2 = high). The cell frequencies
are provided by the Freq variable. The female plaintiff in this civil case has accused the
male defendant of sexually harassing her. We wish to determine whether our
outcome/dependent variable, Verdict, is affected by (associated with) Plattr, Deattr,
Gender, and/or any combination of these three categorical predictor/independent
variables. Download the SPSS program file, LOGIT.sps, from Karls SPSS Programs
Page. Edit the syntax so the Get command points correctly to the location of the data
file on your computer and then run the program.
A Screening Run with Hiloglinear
Let us first ignore the fact that we consider one of the variables a dependent
variable and do a hierarchical backwards elimination analysis. We shall pay special
attention to the effects which include our dependent variable, Verdict, in this analysis.
Note that while the one-way effects are as a group significant (due solely to the fact that
guilty verdicts were more common than not guilty), the two-way and higher-order effects
are not. This is exactly what we should expect, since most of these effects were
experimentally made zero or near zero by our random assignment of participants to
groups. We randomly assigned female and male participants to have an attractive or
an unattractive plaintiff and, independent of that assignment, to have an attractive or an
unattractive defendant, so, we made the effects involving only Gender, Plattr, and
Deattr zero or near zero.
Hiloglinear's "Tests of PARTIAL associations" indicated significant effects of
Verdict, Gender x Verdict, and Plattr x Deattr x Verdict. The estimated parameters for
Verdict and Gender x Verdict are significant, and that for Plattr x Deattr x Verdict is very
close to significance. The backwards elimination procedure led to a model that
includes the Plattr x Deattr x Verdict effect and the Gender x Verdict effect. Since this
is a hierarchical model, all lower-order effects included in the retained higher-order
effects are also retained, that is, the model also includes the effects of Plattr x Deattr,
Plattr x Verdict, Deattr x Verdict, Plattr, Deattr, Verdict, and Gender. Note that many of
these effects are effects that we experimentally made zero or near zero by our random
assignment to treatment groups. The model fits the data well, as indicated by the high
p for the goodness-of-fit
2
.
A Saturated Logit Analysis
The Hiloglinear analysis was employed simply to give us some suggestions
regarding which of the effects we want to include in our Logit Analysis. In a logit
analysis we retain only effects that include the dependent variable, Verdict. The partial
association tests suggest a model with Verdict, Gender x Verdict, and
2
Plattr x Deattr x Verdict. We could just start out with every effect that includes Verdict
and then evaluate various reduced models, using standardized parameter estimates (Z)
to guide our selection of effects to be deleted (if an effect has no parameter with a large
absolute Z, delete it, then evaluate the reduced model). When deciding between any
two particular models, we may test the significance of their differences if and only if one
is nested within the other. Of course, each model is automatically compared to the fully
saturated model with the goodness-of-fit
2
SPSS gives us, and we don't want to accept
a model that is significantly bad in fit.
Instead of using just the three effects suggested by Hiloglinear's partial
association tests, I entered every effect containing Verdict to do a saturated logit
analysis. Note the syntax: LOGLINEAR dependent variable BY independent variables.
As always, the saturated model has perfect fit.
A Backwards Elimination Nonhierarchical Logit Analysis
I inspected the Z scores for tests of parameters in the saturated model, looking
for an effect to delete. Verdict x Plattr with its Z of .024 was chosen. I employed
Loglinear again, leaving Verdict x Plattr out of the DESIGN statement, to evaluate the
reduced model this analysis is not included in the program you ran. The
goodness-of-fit
2
was incremented from 0 to .00059, a trivial, nonsignificant increase,
df = 1, p = .981.
The smallest |Z| in the reduced model was -.306 for Verdict x Gender x Plattr, so
I deleted that effect increasing the
2
to .094, which was still nonsignificantly different
from the saturated model, df = 2, p = .954. Again, this analysis is not included in the
program you ran.
I next deleted Verdict x Gender x Deattr, Z = -.431, increasing
2
to .280, still not
significantly ill fitting, df = 3, p = .964. Next I removed Verdict x Deattr, Z = -1.13,
increasing
2
to 1.567, df = 4 p = .815. Next out was Verdict x Gender x Plattr x Deattr,
Z = -1.31,
2
= 3.283, p = .656. I have omitted from the program the four models
between the saturated model and the Verdict, Verdict x Gender, Verdict x Plattr x Deattr
model, just to save paper. I made my decisions (and took notes) looking at these
models on my computer screen, not printing them.
If you look at the standardized residuals for the Verdict, Verdict x Gender, Verdict
x Plattr x Deattr model, you will see that there are no problems, not even a hint of any
problem (no standardized residual of 1 or more).
Going Too Far
The way I have been deleting effects, each model is nested within all previous
models, so I can test the significance of the difference between one model and any that
preceded it with a
2
that equals the difference between the two models' goodness-of-fit
chi-squares. The df = the number of effects deleted from the one model to obtain the
other model. The null hypothesis is that the two models fit the data equally well. Since
the .05 critical value for
2
on 1 df is 3.84, I was on the watch for an increase of this
magnitude in the goodness-of-fit
2
produced by removing one effect.
3
Verdict x Plattr x Deattr had a significant Z-value of -2.17 in the current model,
but since that was the smallest Z-value, I removed it. The goodness-of-fit
2
jumped a
significant 4.84 from 3.28 to 8.12. This was enough to convince me to leave
Verdict x Plattr x Deattr in the model, even though the Verdict, Verdict x Gender model
was not significantly different from the saturated model (due to large df), df = 6, p = .23.
Removal of the Verdict x Plattr x Deattr effect also resulted in increased residuals, four
of the cells having standardized residuals greater than 1.
Just to complete my compulsion, I tried one last step (not included in the
program you ran), removing Verdict x Gender, producing an ill fitting one parameter
(Verdict) model,
2
= 15.09, df = 7, p = .035.
Our Final Model
So, we are left with a model containing Verdict, Verdict x Gender, and
Verdict x Plattr x Deattr. Do note that this is exactly the model suggested by the partial
association tests in our screening run with Hiloglinear. Now we need to interpret the
model we have selected.
Our structural model is LN(cell freq)
vgpd
= +
v
+
vg
+
vpd
. Consider the
parameter for the effect of Verdict. A value of 0 would indicate that there were equal
numbers of guilty and not guilty votes -- the odds would be 1/1 = 1, and the natural log
of 1 is 0. Our models estimate of the parameter for Verdict = Guilty is .363, which is
significantly greater than zero.
Odds
In our sample there were 110 guilty votes and 56 not guilty, for odds = 110/56 =
1.96. Our model predicts the odds to be 07 . 2
) 3626 (. 2 2
= = e e
, a pretty good estimate.
Of course, we could also predict the odds of a not guilty vote using the parameter for
Verdict = Not_Guilty: 484 .
) 3626 . 0 ( 2
=
e not a bad estimate. (The constant 4 follows from the four conditional
odds just discussed). The log of an odds ratio is called a logit, thus, "logit analysis."
Using Crosstabs to Help Interpret the Significant Effects
For the Verdict x Plattr x Deattr triple interaction I decided to inspect the Verdict x
Plattr effect at each level of Deattr. Look at the Crosstabs output. When the defendant
was unattractive, guilty verdicts were rendered more often when the plaintiff was
attractive (70%) than when she was unattractive (55%), but the difference between
these percentages fell short of significance (p = .154 by the likelihood ratio test). When
the defendant was handsome ,the results were just the opposite, guilty verdicts being
more likely when the plaintiff was unattractive (77%) than when she was beautiful
(62%), but the simple effect again fell short of significance.
Some people just cannot understand how a higher-order effect can be significant
when none of its simple effects is. Although we understand this stems from the simple
effects having opposite signs, perhaps we should try looking at the interaction from
another perspective, the Verdict x Deattr effect at each level of Plattr. Look at the
Crosstabs output. When the plaintiff was unattractive, handsome defendants were
found guilty significantly more often (77%) than unattractive defendants (55%),
p = .026. When the plaintiff was beautiful, attractiveness of the defendant had no
significant effect upon the verdict, 70% versus 62%, p = .48.
50
55
60
65
70
75
80
Unattractive Attractive
Defendant
P
e
r
c
e
n
t
a
g
e
G
u
i
l
t
y
Unattractive Plaintiff
Attractive Plaintiff
Next, look at the Verdict x Gender Crosstabs output. Significantly more of the
female jurors (76%) found the defendant guilty than did the male jurors (57%), p = .008.
5
The likelihood ratio test reported here is one that ignores all the other effects in the full
model.
The 1990 Sexual Harassment Study
The results of the 1989 sexual harassment study were never published. There
was a serious problem with the stimulus materials that made the physical attractiveness
manipulation not adequate. We never even bothered submitting that study to any
journal. We considered it a pilot study and did additional pilot work that led to better
stimulus materials. The research conducted with these better stimulus materials has
been published -- "Effects of physical attractiveness of the plaintiff and defendant in
sexual harassment judgments" by Wilbur A. Castellow, Karl L. Wuensch, & Charles H.
Moore (1990), Journal of Social Behavior and Personality, 5, 547-562.
The program file, LOGIT2.sps, along with the data file, HARASS90.sav, will
produce the logit analysis that is reported in the article. Download the files, run the
program, and look over the output until you understand the statistics reported in the
article. The results are not as complex as they were in the pilot study. The 1-way and
2-way effects are significant, but the higher order effects are not. Verdict, Defendant
Attractiveness x Verdict, and Plaintiff Attractiveness x Verdict have significant partial
chi-squares and significant parameters. The backwards elimination procedure led to a
model with Defendant Attractiveness x Verdict and Plaintiff Attractiveness x Verdict,
and, because it is a hierarchical procedure, those effects included therein, namely
Verdict, Defendant Attractiveness, and Plaintiff Attractiveness.
As explained in the article, nonhierarchical logit analysis was then used to test a
model including only effects that involved the verdict (dependent) variable. The
saturated model (all effects) produced significant parameters only for Verdict,
Defendant Attractiveness x Verdict, and Plaintiff Attractiveness x Verdict. A reduced
model containing only these three effects fit the data well, as indicated by the
nonsignificant goodness-of-fit test. All three retained parameters remained significant
in the reduced model.
The output from Crosstabs helps explain the significant effects. The effect of verdict is
due to guilty verdicts being significantly more frequent (66%) than not guilty (34%)
verdicts. The two interactions each show that physically attractive persons are favored
over physically unattractive persons.
Karl L. Wuensch Dept. of Psychology East Carolina University Greenville, NC 27858
September, 2009
http://core.ecu.edu/psyc/wuenschk/MV/LogLin/Logit-SPSS-Output.pdf --
annotated PASW output
SAS Catmod code
Return to Wuenschs Statistics Lessons Page
Reliability-Validity-Scaling.docx
A Brief Introduction to Reliability, Validity, and Scaling
Reliability
Simply put, a reliable measuring instrument is one which gives you the same
measurements when you repeatedly measure the same unchanged objects or events.
We shall briefly discuss here methods of estimating an instruments reliability. The
theory underlying this discussion is that which is sometimes called classical
measurement theory. The foundations for this theory were developed by Charles
Spearman (1904, General Intelligence, objectively determined and measures.
American Journal of Psychology, 15, 201-293).
If a measuring instrument were perfectly reliable, then it would have a perfect
positive (r = +1) correlation with the true scores. If you measured an object or event
twice, and the true scores did not change, then you would get the same measurement
both times.
We theorize that our measurements contain random error, but that the mean
error is zero. That is, some of our measurements have error that make them lower than
the true scores, but others have errors that make them higher than the true scores, with
the sum of the score-decreasing errors being equal to the sum of the score increasing
errors. Accordingly, random error will not affect the mean of the measurements, but it
will increase the variance of the measurements.
Our definition of reliability is
2
2 2
2
2
2
TM
E T
T
M
T
XX
r r =
+
= =
The Data
Download the KJ.dat data file from my StatData page. These are the data from
the research that was reported in:
Wuensch, K. L., Jenkins, K. W., & Poteat, G. M. (2002). Misanthropy, idealism,
and attitudes towards animals. Anthrozos, 15, 139-149.
A summary of the research can be found at Misanthropy, Idealism, and Attitudes
About Animals.
SAS
To illustrate the computation of Cronbachs alpha with SAS, I shall use the data
set we used back in PSYC 6430 when learning to do correlation/regression analysis on
SAS, KJ.dat. Columns 1 through 10 contain the participants responses to the first ten
items, which constitute the idealism scale.
Obtain the program file, Alpha.sas, from my SAS Programs page. The simple
way to get Cronbachs alpha is to use the NOMISS and ALPHA options with PROC
CORR. I have also included an illustration of how Cronbachs alpha can be computed
from the item variances and an illustration of how to compute maximized
4
(lambda4)
and estimated maximum
4
.
SPSS
Boot up SPSS and click File, Read Text Data. Point the wizard to the KJ.dat file.
On step one, indicate that there is no predefined format. On step two, indicate that the
data file is of FIXED WIDTH, and that there are no variable names at the top. On step
three indicate that the data start on line one, there is one line per case, and all cases
should be read. On step four you will see a screen like that atop the next page. To
indicate that the scores for item one are in column one, item two in column two, etc.,
just click between columns one and two, two and three ten and eleven.
=
n
n
T I
, where the ratio of variances is the sum of the n item variances
divided by the total test variance. The second part of the SAS program illustrates the
application of this method for computing coefficient alpha.
Maximized
4
(Lambda4)
H. G. Osburn (Coefficient alpha and related internal consistency reliability
coefficients, Psychological Methods, 2000, 5, 343-355) noted that coefficient alpha is a
lower bound to the true reliability of a measuring instrument, and that it may seriously
underestimate the true reliability. Osburn used Monte Carlo techniques to study a
variety of alternative methods of estimating reliability from internal consistency. Their
conclusion was that maximized lambda4 was the most consistently accurate of the
techniques.
4
is computed as coefficient alpha, but on only one pair of split-halves of the
instrument. To obtain maximized
4
, one simply computes
4
for all possible split-
halves and then selects the largest obtained value of
4
. The problem is that the
5
number of possible split halves is
2
) ! (
)! 2 ( 5 .
n
n
for a test with 2n items. If there are only four
or five items, this is not so bad. For example, suppose we decide to use only the first
four items in the idealism instrument. Look at the third part of the SAS program. In the
data step I create the three possible split-halves and the total test score for a scale
comprised of only the first four items of the idealism measure. Proc Corr is used to
obtain an alpha coefficient of .717. Then I use Proc Corr three more times, obtaining
the correlations for each pair of split halves. Those correlations are .49581, .66198, and
.53911. When I apply the Spearman-Brown correction (taking into account that each
split half has only half as many items as does the total instrument),
r
r
+
=
1
2
4
, the
values of
4
are .6629, .7966, and .7005. The largest of these, .7966, is maximized
4
.
The average of these, .72, is Cronbachs alpha.
If you have an even number of items,
+
=
2
2 2
4
) (
1 2
T
B A
, where the three
variances are for half A, half B, and the total test variance. My program computes
variances for each half and the total variance, and then
4
is computed for each split-
half using these variances. The maximized
4
is .796. I also computed coefficient alpha
as the mean of the three possible pairs of split-halves. Note that value obtained is the
same as reported by Proc Corr.
If you have an odd number of items, use the method which actually computes r
for each split-half. For example, I had data from a 5-item misanthropy scale. I created
10 split-halves (in each, one set had only 2 items, the other had 3 items). The
correlations, with the Spearman-Brown correction, were .586, .718., .623, .765, .684,
.686, .776, .687, .453, and .625. The highest corrected correlation. .776, is the
maximized
4
. The mean of the 10 corrected correlations is Cronbachs alpha, .66.
Estimating Maximized
4
When there are more than just a few items, there are just too many possible split-
halves to be computing
4
on each one. For our 10 item idealism scale, there are 126
possible split-halves, so dont even think about computing
4
for each. There are
methods for estimating maximized
4
. If you are interested in such methods, read my
document Estimating Maximized Lambda4.
I have archived some EDSTAT-L discussions of Cronbachs alpha. They are
available at Discussion of Cronbachs Alpha On the EDSTAT-L List.
Cronbachs Alpha When There Are Only Two Items on the Scale
A colleague, while reviewing a manuscript, asked about Cronbachs alpha for a
scale with only two items. The authors had reported the simple r between the two
items. My response was that the alpha would be higher than that because of the
6
Spearman-Brown correction. To verify that I used SPSS to compute r and alpha for two
items selected from a scale that measures forgiveness. The alpha for the full scale is
well over .9.
Correlations
RAof3 Aof7
RAof3 Pearson Correlation 1 .679
**
Sig. (2-tailed)
.000
N 468 468
There is only one possible split half when there are only two items, and the r for
that split half is, here, .679. Applying the Spearman-Brown correction,
81 .
679 . 1
) 2 ( 679 .
1
2
= =
+
=
r
r
alpha
which is what SPSS reports:
Return to Wuenschs Stats Lessons Page
Copyright 2011, Karl L. Wuensch - All rights reserved.
Reliability Statistics
Cronbach's
Alpha N of Items
.807 2
PCA-SPSS.docx
Principal Components Analysis - PASW
In principal components analysis (PCA) and factor analysis (FA) one wishes to
extract from a set of p variables a reduced set of m components or factors that accounts
for most of the variance in the p variables. In other words, we wish to reduce a set of p
variables to a set of m underlying superordinate dimensions.
These underlying factors are inferred from the correlations among the p
variables. Each factor is estimated as a weighted sum of the p variables. The i
th
factor
is thus
p ip i i i
X W X W X W F + + + = K
2 2 1 1
One may also express each of the p variables as a linear combination of the m
factors,
j m mj j j j
U F A F A F A X + + + + = K
2 2 1 1
where U
j
is the variance that is unique to variable j, variance that cannot be explained
by any of the common factors.
Goals of PCA and FA
One may do a PCA or FA simply to reduce a set of p variables to m
components or factors prior to further analyses on those m factors. For example,
Ossenkopp and Mazmanian (Physiology and Behavior, 34: 935-941) had 19 behavioral
and physiological variables from which they wished to predict a single criterion variable,
physiological response to four hours of cold-restraint. They first subjected the 19
predictor variables to a FA. They extracted five factors, which were labeled Exploration,
General Activity, Metabolic Rate, Behavioral Reactivity, and Autonomic Reactivity.
They then computed for each subject scores on each of the five factors. That is, each
subjects set of scores on 19 variables was reduced to a set of scores on 5 factors.
These five factors were then used as predictors (of the single criterion) in a stepwise
multiple regression.
One may use FA to discover and summarize the pattern of intercorrelations
among variables. This is often called Exploratory FA. One simply wishes to group
together (into factors) variables that are highly correlated with one another, presumably
because they all are influenced by the same underlying dimension (factor). One may
also then operationalize (invent a way to measure) the underlying dimension by a linear
combination of the variables that contributed most heavily to the factor.
If one has a theory regarding what basic dimensions underlie an observed event,
e may engage in Confirmatory Factor Analysis. For example, if I believe that
performance on standardized tests of academic aptitude represents the joint operation
of several basically independent faculties, such as Thurstones Verbal Comprehension,
Word Fluency, Simple Arithmetic, Spatial Ability, Associative Memory, Perceptual
Speed, and General Reasoning, rather than one global intelligence factor, then I may
and ( )
p j i j j
X X
).. )..( ..( 12 .
A large partial correlation indicates that the variables involved share variance that
is not shared by the other variables in the data set. Kaisers Measure of Sampling
Adequacy (MSA) for a variable X
i
is the ratio of the sum of the squared simple rs
between X
i
and each other X to (that same sum plus the sum of the squared partial rs
between X
i
and each other X). Recall that squared rs can be thought of as variances.
+
=
2 2
2
ij ij
ij
pr r
r
MSA
Small values of MSA indicate that the correlations between X
i
and the other
variables are unique, that is, not related to the remaining variables outside each simple
correlation. Kaiser has described MSAs above .9 as marvelous, above .8 as
meritorious, above .7 as middling, above .6 as mediocre, above .5 as miserable, and
below .5 as unacceptable.
The MSA option in SAS PROC FACTOR [Enter PROC FACTOR MSA;] gives
you a matrix of the partial correlations, the MSA for each variable, and an overall MSA
computed across all variables. Variables with small MSAs should be deleted prior to FA
or the data set supplemented with additional relevant variables which one hopes will be
correlated with the offending variables.
For our sample data the partial correlation matrix looks like this:
COST SIZE ALCOHOL REPUTAT COLOR AROMA TASTE
COST 1.00 .54 -.11 -.26 -.10 -.14 .11
SIZE .54 1.00 .81 .11 .50 .06 -.44
ALCOHOL -.11 .81 1.00 -.23 -.38 .06 .31
REPUTAT -.26 .11 -.23 1.00 .23 -.29 -.26
COLOR -.10 .50 -.38 .23 1.00 .57 .69
AROMA -.14 .06 .06 -.29 .57 1.00 .09
TASTE .11 -.44 .31 -.26 .69 .09 1.00
___________________________________________________________
MSA .78 .55 .63 .76 .59 .80 .68
OVERALL MSA = .67
These MSAs may not be marvelous, but they arent low enough to make me
drop any variables (especially since I have only seven variables, already an
unrealistically low number).
The PASW output includes the overall MSA in the same table as the (useless)
Bartletts test of sphericity.
7
The partial correlations (each multiplied by minus 1) are found in the Anti-Image
Correlation Matrix. On the main diagonal of this matrix are the MSAs for the individual
variables.
Anti-image Matrices
cost size alcohol reputat color aroma taste
Anti-image
Correlation
cost .779
a
-.543 .105 .256 .100 .135 -.105
size -.543 .550
a
-.806 -.109 -.495 .061 .435
alcohol .105 -.806 .630
a
.226 .381 -.060 -.310
reputat .256 -.109 .226 .763
a
-.231 .287 .257
color .100 -.495 .381 -.231 .590
a
-.574 -.693
aroma .135 .061 -.060 .287 -.574 .801
a
-.087
taste -.105 .435 -.310 .257 -.693 -.087 .676
a
a. Measures of Sampling Adequacy(MSA)
Extracting Principal Components
We are now ready to extract principal components. We shall let the computer do
most of the work, which is considerable. From p variables we can extract p
components. This will involve solving p equations with p unknowns. The variance in
the correlation matrix is repackaged into p eigenvalues. Each eigenvalue represents
the amount of variance that has been captured by one component.
Each component is a linear combination of the p variables. The first component
accounts for the largest possible amount of variance. The second component, formed
from the variance remaining after that associated with the first component has been
extracted, accounts for the second largest amount of variance, etc. The principal
components are extracted with the restriction that they are orthogonal. Geometrically
they may be viewed as dimensions in p-dimensional space where each dimension is
perpendicular to each other dimension.
Each of the p variables variance is standardized to one. Each factors eigenvalue
may be compared to 1 to see how much more (or less) variance it represents than does
KMO and Bartlett's Test
.665
1637.9
21
.000
Kaiser-Meyer-Olkin Measure of Sampling
Adequacy.
Approx. Chi-Square
df
Sig.
Bartlett's Test of
Sphericity
8
a single variable. With p variables there is p x 1 = p variance to distribute. The principal
components extraction will produce p components which in the aggregate account for
all of the variance in the p variables. That is, the sum of the p eigenvalues will be equal
to p, the number of variables. The proportion of variance accounted for by one
component equals its eigenvalue divided by p.
For our beer data, here are the eigenvalues and proportions of variance for the
seven components:
Deciding How Many Components to Retain
So far, all we have done is to repackage the variance from p correlated variables
into p uncorrelated components. We probably want to have fewer than p components.
If our p variables do share considerable variance, several of the p components should
have large eigenvalues and many should have small eigenvalues. One needs to decide
how many components to retain. One handy rule of thumb is to retain only components
with eigenvalues of one or more. That is, drop any component that accounts for less
variance than does a single variable. Another device for deciding on the number of
components to retain is the scree test. This is a plot with eigenvalues on the ordinate
and component number on the abscissa. Scree is the rubble at the base of a sloping
cliff. In a scree plot, scree is those components that are at the bottom of the sloping plot
of eigenvalues versus component number. The plot provides a visual aid for deciding at
what point including additional components no longer increases the amount of variance
accounted for by a nontrivial amount. Here is the scree plot produced by PASW:
3.313 47.327 47.327
2.616 37.369 84.696
.575 8.209 92.905
.240 3.427 96.332
.134 1.921 98.252
9.E-02 1.221 99.473
4.E-02 .527 100.000
Component
1
2
3
4
5
6
7
Total
% of
Variance
Cumulative
%
Initial Eigenvalues
Extraction Method: Principal Component Analysis.
9
For our beer data, only the first two components have eigenvalues greater than
1. There is a big drop in eigenvalue between component 2 and component 3. On a
scree plot, components 3 through 7 would appear as scree at the base of the cliff
composed of components 1 and 2. Together components 1 and 2 account for 85% of
the total variance. We shall retain only the first two components.
I often find it useful to try at least three different solutions, and then decide
among them which packages the variance in a way most pleasing to me. Here I could
try a one component, a two component, and a three component solution.
Loadings, Unrotated and Rotated
Another matrix of interest is the loading matrix, also known as the factor
pattern matrix or the component matrix. The entries in this matrix, loadings, are
correlations between the components and the variables. Since the two components are
orthogonal, these correlation coefficients are also beta weights, that is,
j j j j
U F A F A X + + =
2 2 1 1
, thus A
1
equals the number of standard deviations that X
j
changes for each one standard deviation change in Factor 1. Here is the loading matrix
for our beer data:
Scree Plot
Component Number
7 6 5 4 3 2 1
E
i
g
e
n
v
a
l
u
e
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
10
As you can see, almost all of the variables load well on the first component, all
positively except reputation. The second component is more interesting, with 3 large
positive loadings and three large negative loadings. Component 1 seems to reflect
concern for economy and quality versus reputation. Component 2 seems to reflect
economy versus quality.
Remember that each component represents an orthogonal (perpendicular)
dimension. Fortunately, we retained only two dimensions, so I can plot them on paper.
If we had retained more than two components, we could look at several pairwise plots
(two components at a time).
For each variable I have plotted in the vertical dimension its loading on
component 1, and in the horizontal dimension its loading on component 2. Wouldnt it
be nice if I could rotate these axes so that the two dimensions passed more nearly
through the two major clusters (COST, SIZE, ALCH and COLOR, AROMA, TASTE)?
Imagine that the two axes are perpendicular wires joined at the origin (0,0) with a pin. I
rotate them, preserving their perpendicularity, so that the one axis passes through or
near the one cluster, the other through or near the other cluster. The number of
degrees by which I rotate the axes is the angle PSI. For these data, rotating the axes -
40.63 degrees has the desired effect.
Here is the loading matrix after rotation:
Component Matrix
a
.760 -.576
.736 -.614
-.735 -.071
.710 -.646
.550 .734
.632 .699
.667 .675
COLOR
AROMA
REPUTAT
TASTE
COST
ALCOHOL
SIZE
1 2
Component
Extraction Method: Principal Component Analysis.
2 components extracted. a.
11
Rotated Component Matrix
a
.960 -.028
.958 1.E-02
.952 6.E-02
7.E-02 .947
2.E-02 .942
-.061 .916
-.512 -.533
TASTE
AROMA
COLOR
SIZE
ALCOHOL
COST
REPUTAT
1 2
Component
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kaiser Normalization.
Rotation converged in 3 iterations. a.
Number of Components in the Rotated Solution
I generally will look at the initial, unrotated, extraction and make an initial
judgment regarding how many components to retain. Then I will obtain and inspect
rotated solutions with that many, one less than that many, and one more than that many
components. I may use a "meaningfulness" criterion to help me decide which solution
to retain if a solution leads to a component which is not well defined (has none or very
few variables loading on it) or which just does not make sense, I may decide not to
accept that solution.
One can err in the direction of extracting too many components (overextraction)
or too few components (underextraction). Wood, Tataryn, and Gorsuch (1996,
Psychological Methods, 1, 354-365) have studied the effects of under- and over-
extraction in principal factor analysis with varimax rotation. They used simulation
methods, sampling from populations where the true factor structure was known. They
found that overextraction generally led to less error (differences between the structure
of the obtained factors and that of the true factors) than did underextraction. Of course,
extracting the correct number of factors is the best solution, but it might be a good
strategy to lean towards overextraction to avoid the greater error found with
underextraction.
Wood et al. did find one case in which overextraction was especially problematic
the case where the true factor structure is that there is only a single factor, there are
no unique variables (variables which do not share variance with others in the data set),
and where the statistician extracts two factors and employs a varimax rotation (the type
I used with our example data). In this case, they found that the first unrotated factor had
loadings close to those of the true factor, with only low loadings on the second factor.
However, after rotation, factor splitting took place for some of the variables the
obtained solution grossly underestimated their loadings on the first factor and
overestimated them on the second factor. That is, the second factor was imaginary and
12
the first factor was corrupted. Interestingly, if there were unique variables in the data
set, such factor splitting was not a problem. The authors suggested that one include
unique variables in the data set to avoid this potential problem. I suppose one could do
this by including "filler" items on a questionnaire. The authors recommend using a
random number generator to create the unique variables or manually inserting into the
correlation matrix variables that have a zero correlation with all others. These unique
variables can be removed for the final analysis, after determining how many factors to
retain.
Explained Variance
The PASW output also gives the variance explained by each component after
the rotation. The variance explained is equal to the sum of squared loadings (SSL)
across variables. For component 1 that is (.76
2
+ .74
2
+...+ .67
2
) = 3.31 = its
eigenvalue before rotation and (.96
2
+ .96
2
+...+ -.51
2
) = 3.02 after rotation. For
component 2 the SSLs are 2.62 and 2.91. After rotation component 1 accounted for
3.02/7 = 43% of the total variance and 3.02 / (3.02 + 2.91) = 51% of the variance
distributed between the two components. After rotation the two components together
account for (3.02 + 2.91)/7 = 85% of the total variance.
The SSLs for components can be used to help decide how many components to
retain. An after rotation SSL is much like an eigenvalue. A rotated component with an
SSL of 1 accounts for as much of the total variance as does a single variable. One may
want to retain and rotate a few more components than indicated by eigenvalue 1 or
more criterion. Inspection of the retained components SSLs after rotation should tell
you whether or not they should be retained. Sometimes a component with an
eigenvalue > 1 will have a postrotation SSL < 1, in which case you may wish to drop it
and ask for a smaller number of retained components.
You also should look at the postrotation loadings to decide how well each
retained component is defined. If only one variable loads heavily on a component, that
component is not well defined. If only two variables load heavily on a component, the
component may be reliable if those two variables are highly correlated with one another
but not with the other variables.
Naming Components
Total Variance Explained
3.017 43.101 43.101
2.912 41.595 84.696
Component
1
2
Total
% of
Variance
Cumulative
%
Rotation Sums of Squared
Loadings
Extraction Method: Principal Component Analysis.
13
Now let us look at the rotated loadings again and try to name the two
components. Component 1 has heavy loadings (>.4) on TASTE, AROMA, and COLOR
and a moderate negative loading on REPUTATION. Id call this component
AESTHETIC QUALITY. Component 2 has heavy loadings on large SIZE, high
ALCOHOL content, and low COST and a moderate negative loading on REPUTATION.
Id call this component CHEAP DRUNK.
Communalities
Let us also look at the SSL for each variable across factors. Such a SSL is
called a communality. This is the amount of the variables variance that is accounted
for by the components (since the loadings are correlations between variables and
components and the components are orthogonal, a variables communality represents
the R
2
of the variable predicted from the components). For our beer data the
communalities are COST, .84; SIZE, .90; ALCOHOL, .89; REPUTAT, .55; COLOR, .91;
AROMA, .92; and TASTE, .92.
Orthogonal Versus Oblique Rotations
The rotation I used on these data is the VARIMAX rotation. It is the most
commonly used rotation. Its goal is to minimize the complexity of the components by
making the large loadings larger and the small loadings smaller within each component.
There are other rotational methods. QUARTIMAX rotation makes large loadings larger
and small loadings smaller within each variable. EQUAMAX rotation is a compromise
that attempts to simplify both components and variables. These are all orthogonal
rotations, that is, the axes remain perpendicular, so the components are not correlated
with one another.
It is also possible to employ oblique rotational methods. These methods do not
produce orthogonal components. Suppose you have done an orthogonal rotation and
you obtain a rotated loadings plot that looks like this:
Communalities
1.000 .842
1.000 .901
1.000 .889
1.000 .546
1.000 .910
1.000 .918
1.000 .922
COST
SIZE
ALCOHOL
REPUTAT
COLOR
AROMA
TASTE
Initial Extraction
Extraction Method: Principal Component Analysis.
14
The cluster of points midway between
axes in the upper left quadrant indicates that
a third component is present. The two
clusters in the upper right quadrant indicate
that the data would be better fit with axes
that are not orthogonal. Axes drawn
through those two clusters would not be
perpendicular to one another. We shall
return to the topic of oblique rotation later.
Return to Multivariate Analysis with PASW
Continue on to Factor Analysis with PASW
Copyright 2011, Karl L. Wuensch - All rights reserved.
FA-SPSS.docx
Factor Analysis - PASW
First Read Principal Components Analysis.
The methods we have employed so far attempt to repackage all of the variance
in the p variables into principal components. We may wish to restrict our analysis to
variance that is common among variables. That is, when repackaging the variables
variance we may wish not to redistribute variance that is unique to any one variable.
This is Common Factor Analysis. A common factor is an abstraction, a hypothetical
dimension that affects at least two of the variables. We assume that there is also one
unique factor for each variable, a factor that affects that variable but does not affect any
other variables. We assume that the p unique factors are uncorrelated with one another
and with the common factors. It is the variance due to these unique factors that we
shall exclude from our FA.
Iterated Principal Factors Analysis
The most common sort of FA is principal axis FA, also known as principal
factor analysis. This analysis proceeds very much like that for a PCA. We eliminate
the variance due to unique factors by replacing the 1s on the main diagonal of the
correlation matrix with estimates of the variables communalities. Recall that a variables
communality, its SSL across components or factors, is the amount of the variables
variance that is accounted for by the components or factors. Since our factors will be
common factors, a variables communality will be the amount of its variance that is
common rather than unique. The R
2
between a variable and all other variables is most
often used initially to estimate a variables communality.
Using the beer data, change the extraction method to principal axis:
When you ran the FactOut.sas program, there was produced an output file,
FactBeer.dat, which contained factor scores, SES, and GROUP data for each subject.
Suppose that we wished to compare the two groups on the remaining three variables
using a discriminant function analysis. Here is how to do that analysis:
First, import the data file. Click File, Read Text Data, and point to the
FactBeer.dat file. Tell the wizard that there is no predefined format, the file is delimited,
no variable names at top, data starts on line 1, each line is a case, read all cases,and
the delimiter is a space. Name the first variable AesthetQ, the second CheapDr, the
third SES, and the fourth Group. Click Analyze, Classify, Discriminant. Put Group in
the Grouping Variable box, and AesthetQ, CheapDr, and SES in the Independents box.
Enter the independents together. For Statistics, ask for Means, Univariate ANOVAs,
and Boxs M. Under Classify, ask for Compution of Priors From Group Sizes, a
Summary Table, and Separate-Groups Plots.
Look at the output. The Group Statistics and Tests of Equality of Group Means
show that, compared to Group 1, Group 2 is significantly more interested in the
aesthetic quality of their beer, significantly less interested in getting a cheap drunk, and
significantly higher in SES.
The discriminant analysis produces a weighted linear combination of the three
independent variables on which the two groups differ as much as possible. This
weighted linear combination is called the discriminant function, D. If we were to
conduct an ANOVA comparing the two groups on this weighted linear combination, the
ratio of the SS
between groups
to SS
within groups
would be the eigenvalue (the value which is
maximized by the weights obtained for the linear combination). The canonical
correlation is the square root on the ratio of the SS
between groups
to the SS
total
. The
square of this quantity is the same as
2
in ANOVA. Accordingly, for our data, group
membership accounts for 64% of the variance in AesthetQ, CheapDr, and SES.
Wilks lambda is used to test the significance of this canonical correlation. The
smaller the Wilks lambda, the smaller the p value. SPSS uses a chi-square statistic to
approximate the p value. It is significant for our data -- that is, our groups differ
significantly on an optimally weighted combination of AesthetQ, CheapDr, and SES --
or, from another perspective, using the discriminant function to predict group
membership from scores on AesthetQ, CheapDr, and SES is significantly better than
just guessing group membership. Boxs M tests one of the assumptions of this test of
significance. The nonsignificant result for our data tell us that we have no problem with
the assumption that our two groups have equal variance/covariance matrices.
The standardized discriminant function coefficients (the standardized
weighting coefficients for computing D) and the loadings (in the structure matrix, the
correlations between the predictor variables and D) indicate that SES is most important
Developed by Sewall Wright, path analysis is a method employed to determine
whether or not a multivariate set of nonexperimental data fits well with a particular (a
priori) causal model. Elazar J. Pedhazur (Multiple Regression in Behavioral Research,
2
nd
edition, Holt, Rinehard and Winston, 1982) has a nice introductory chapter on path
analysis which is recommended reading for anyone who intends to use path analysis.
This lecture draws heavily upon the material in Pedhazur's book.
Consider the path diagram presented in Figure 1.
Figure 1
p
41
= .009
p
31
= .398
p
43
= .416
r
12
= .3
p
32
= .041
E
3
p
42
= .501 E
4
Each circle represents a variable. We have data on each variable for each subject. In
this diagram SES and IQ are considered to be exogenous variables -- their variance is
assumed to be caused entirely by variables not in the causal model. The connecting
line with arrows at both ends indicates that the correlation between these two variables
will remain unanalyzed because we choose not to identify one variable as a cause of
the other variable. Any correlation between these variables may actually be casual (1
causing 2 and/or 2 causing 1) and/or may be due to 1 and 2 sharing common causes.
= . For our
example, Q = (1 -.437)/(1 -.297) = .801. A perfect fit would give a Q of one; less than
perfect fits yield Q's less than one. Q can also be computed simply by computing the
product of the squares of the error path coefficients for the full model and dividing by
the square of the products of the error path coefficients for the reduced model. For our
model, 801 .
) 866 (. ) 968 (.
) 775 (. ) 968 (.
2 2
2 2
= = Q .
Finally, we compute the test statistic, W = -(N - d) ln(Q) where N = sample
size (let us assume N = 100), d = number of overidentifying restrictions (number of
paths eliminated from the full model to yield the reduced model), and ln = natural
logarithm. For our data, W = -(100 - 1) * ln(.801) = 21.97. W is evaluated with the
chi-square distribution on d degrees of freedom. For our W, p <.0001 and we conclude
that the model does not fit the data well. If you compute W for Figure 4A or 4B you will
obtain W = 0, p = 1.000, indicating perfect fit for those models.
The overidentified model (Figure 5) we tested had only one restriction (no p
23
),
so we could have more easily tested it by testing the significance of p
23
in Figure 3B.
Our overidentified models may, however, have several such restrictions, and the
method I just presented allows us simultaneously to test all those restrictions. Consider
the overidentified model in Figure 6. The just identified model in Figure 1 is an
appropriate model for computing R
2
F
. Please note that the model in Figure 1A (with IQ
endogenous rather than exogenous) would not be appropriate, since that model would
include an E
2
path, but our overidentified model has no E
2
path (paths are drawn from
extraneous variables to endogenous variables but not to exogenous variables). In the
just-identified model 911 . 1
2
12 . 3 3
= = R E and 710 . 1
2
123 . 4 4
= = R E . For our
Page 19
overidentified model 912 . 1
2
1 . 3 3
= = R E and 710 . 1
2
23 . 4 4
= = R E .
R
2
F
= 1 - (.911)
2
(.710)
2
= .582. R
2
R
= 1 - (.912)
2
(.710)
2
= .581. Q = (1 - .582)/(1 - .581)
= .998. If N = 100, W = -(100 - 2) * ln(.998) = 0.196 on 2 df (we eliminated two paths), p
= .91, our overidentified model fits the data well.
Let us now evaluate three different overidentified models, all based on the same
correlation matrix. Variable V is a measure of the civil rights attitudes of the Voters in
116 congressional districts (N = 116). Variable C is the civil rights attitude of the
Congressman in each district. Variable P is a measure of the congressmen's
Perceptions of their constituents' attitudes towards civil rights. Variable R is a measure
of the congressmen's Roll Call behavior on civil rights. Suppose that three different a
priori models have been identified (each based on a different theory), as shown in
Figure 8.
Figure 8
Just-Identified Model
.867
E
C
.498
.324
.075
.560
.366 .507
.556
E
R
.595
E
P
For the just-identified model, R
2
F
= 1 - (.867)
2
(.595)
2
(.507)
2
= .9316.
C
V R
P
Page 20
Model X
.766
E
C
.643
.327
.613
.510
.738
E
R
.675
E
P
For Model X, R
2
R
= 1 - (.766)
2
(.675)
2
(.510)
2
= .9305.
C
V R
P
Page 21
Model Y
.867
E
C
.498
.327
.613
.643 .510
E
R
.766
E
P
For Model Y, R
2
R
= 1 - (.867)
2
(.766)
2
(.510)
2
= .8853.
C
V R
P
Page 22
Model Z
.867
E
C
.498
.327
.613
.366 .510
.556
E
R
.595
E
P
For Model Z, R
2
R
= 1 -(.867)
2
(.595)
2
(.510)
2
= .9308.
Goodness of fit Q
x
= (1 - .9316)/(1 - .9305) = .0684/.0695 = .9842.
Q
y
= .0684/(1 - .8853) = .5963.
Q
z
= .0684/(1 - .9308) = .9884.
It appears that Models X and Z fit the data better than does Model Y.
C
V R
P
Page 23
W
x
= -(116 - 2) ln(.9842) = 1.816. On 2 df (two restrictions, no paths from V to
C or to R), p = .41. We do not reject the null hypothesis that Model X fits the data.
W
Y
= -(116 - 2) ln(.5963) = 58.939, on 2 df, p <.0001. We do reject the null
and conclude that Model Y does not fit the data.
W
Z
= -(116 - 1) ln(.9884) = 1.342, on 1 df (only one restriction, no V to R direct
path) p = .25. We do not reject the null hypothesis that Model Z fits the data. It seems
that Model Y needs some revision, but Models X and Z have passed this test.
Sometimes one can test the null hypothesis that one reduced model fits the data
as well as does another reduced model. One of the models must be nested within the
other -- that is, it can be obtained by eliminating one or more of the paths in the other.
For example, Model Y is exactly like Model Z except that the path from V to P has been
eliminated in Model Y. Thus, Y is nested within Z. To test the null, we compute
W = -(N -d) ln[(1 - M
1
)/(1 - M
2
)] . M
2
is for the model that is nested within the other,
and d is the number of paths eliminated from the other model to obtain the nested
model. For Z versus Y, on d = 1 df, W = -(116 - 1) * ln[(1 - .9308)/(1 - .8853) = 58.112,
p < .0001. We reject the null hypothesis. Removing the path from V to P significantly
reduced the fit between the model and the data.
Return to Wuenschs Stats Lessons Page
Copyright 2008, Karl L. Wuensch - All rights reserved.
Path-SPSS-AMOS.doc
Conducting a Path Analysis With SPSS/AMOS
Download the PATH-INGRAM.sps data file from my SPSS data page and then
bring it into SPSS. The data are those from the research that led to this publication:
Ingram, K. L., Cope, J. G., Harju, B. L., & Wuensch, K. L. (2000). Applying to graduate
school: A test of the theory of planned behavior. Journal of Social Behavior and
Personality, 15, 215-226.
Obtain the simple correlations among the variables:
Correlations
Attitude SubNorm PBC Intent Behavior
Attitude Pearson Correlation 1.000 .472 .665 .767 .525
SubNorm Pearson Correlation .472 1.000 .505 .411 .379
PBC Pearson Correlation .665 .505 1.000 .458 .496
Intent Pearson Correlation .767 .411 .458 1.000 .503
Behavior Pearson Correlation .525 .379 .496 .503 1.000
One can conduct a path analysis with a series of multiple regression analyses.
We shall test a model corresponding to Ajzens Theory of Planned Behavior look at
the model presented in the article cited above, which is available online. Notice that the
final variable, Behavior, has paths to it only from Intention and PBC. To find the
coefficients for those paths we simply conduct a multiple regression to predict Behavior
from Intention and PBC. Here is the output.
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .585
a
.343 .319 13.74634
a. Predictors: (Constant), PBC, Intent
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
1 Regression 5611.752 2 2805.876 14.849 .000
a
Residual 10770.831 57 188.962
Total 16382.583 59
a. Predictors: (Constant), PBC, Intent
b. Dependent Variable: Behavior
2
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) -11.346 10.420
-1.089 .281
Intent 1.520 .525 .350 2.894 .005
PBC .734 .264 .336 2.781 .007
a. Dependent Variable: Behavior
The Beta weights are the path coefficients leading to Behavior: .336 from PBC
and .350 from Intention.
In the model Intention has paths to it from Attitude, Subjective Norm, and
Perceived Behavioral Control, so we predict Intention from Attitude, Subjective Norm,
and Perceived Behavioral Control. Here is the output:
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .774
a
.600 .578 2.48849
a. Predictors: (Constant), PBC, SubNorm, Attitude
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
1 Regression 519.799 3 173.266 27.980 .000
a
Residual 346.784 56 6.193
Total 866.583 59
a. Predictors: (Constant), PBC, SubNorm, Attitude
b. Dependent Variable: Intent
3
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) 3.906 1.828
2.137 .037
Attitude .444 .064 .807 6.966 .000
SubNorm .029 .031 .095 .946 .348
PBC -.064 .059 -.126 -1.069 .290
a. Dependent Variable: Intent
The path coefficients leading to Intention are: .807 from Attitude, .095 from
Subjective Norms, and .126 from Perceived Behavioral Control.
AMOS
Now let us use AMOS. The data file is already open in SPSS. Click Analyze,
AMOS 16. In the AMOS window which will open click File, New:
Click on the Draw observed variables icon which I have circled on the image
above. Move the cursor over into the drawing space on the right. Hold down the left
mouse button while you move the cursor to draw a rectangle. Release the mouse
button and move the cursor to another location and draw another rectangle. Annoyed
that you cant draw five rectangles of the same dimensions. Do it this way instead:
Draw one rectangle. Now click the Duplicate Objects icon, boxed in black in the
image below, point at that rectangle, hold down the left mouse button while you move to
the desired location for the second rectangle, and release the mouse button.
4
Draw five rectangles arranged something like this:
You can change the shape of the rectangles later, using the Change the shape
of objects tool (boxed in green in the image above), and you can move the rectangles
later using the Move objects tool (boxed in blue in the image above).
Click on the List variables in data set icon (boxed in orange in the image
above). From the window that results, drag and drop variable names to the boxes.
A more cumbersome way to do this is: Right-click the rectangle, select Object
Properties, then enter in the Object Properties window the name of the observed
variable. Close the widow and enter variable names in the remaining rectangles in the
same way.
Click on the Draw paths icon (the single-headed arrow boxed in purple in the
image below) and then draw a path from Attitude to Intent (hold down the left mouse
button at the point you wish to start the path and then drag it to the ending point and
5
release the mouse button). Also draw paths from SubNorm to Intent, PBC to Intent,
PBC to Behavior, and Intent to Behavior.
Click on the Draw Covariances icon (the double-headed arrow boxed in purple
in the image above) and draw a path from SubNorm to Attitude. Draw another from
PBC to SubNorm and one from PBC to Attitude. You can use the Change the shape of
objects tool (boxed in green in the image above) to increase or decrease the arc of
these paths just select that tool, put the cursor on the path to be changed, hold down
the left mouse button, and move the mouse.
Click on the Add a unique variable to an existing variable icon (boxed in red in
the image above) and then move the cursor over the Intent variable and click the left
mouse button to add the error variable. Do the same to add an error variable to the
Behavior variable. Right-click the error circle leading to Intent, Select Object Properties,
and name the variable e1. Name the other error circle e2.
Click the Analysis properties icon -- to display the Analysis Properties
window. Select the Output tab and ask for the output shown below.
6
7
Click on the Calculate estimates icon . In the Save As window browse
to the desired folder and give the file a name. Click Save.
Change the Parameter Formats setting (boxed in red in the image below) to
Standardized estimates if it is not already set that way. Click the View the output path
diagram icon (boxed in red in the image below) and zap, you get the path analysis
diagram.
8
Click the View text icon to see extensive text output from the analysis.
9
The Copy to Clipboard icon (in green, above) can be used to copy the output to
another document via the clipboard. Click the Options icon (in red, above) to select
whether you want to view/copy just part of the output or all of the output.
Here are some parts of the output with my comments in green:
Variable Summary (Group number 1)
Your model contains the following variables (Group number 1)
Observed, endogenous variables
Intent
Behavior
Observed, exogenous variables
Attitude
PBC
SubNorm
Unobserved, exogenous variables
e1
e2
Variable counts (Group number 1)
Number of variables in your model: 7
Number of observed variables: 5
Number of unobserved variables: 2
Number of exogenous variables: 5
Number of endogenous variables: 2
10
Parameter summary (Group number 1)
Weights Covariances Variances Means Intercepts Total
Fixed 2 0 0 0 0 2
Labeled 0 0 0 0 0 0
Unlabeled 5 3 5 0 0 13
Total 7 3 5 0 0 15
Models
Default model (Default model)
Notes for Model (Default model)
Computation of degrees of freedom (Default model)
Number of distinct sample moments: 15
Number of distinct parameters to be estimated: 13
Degrees of freedom (15 - 13): 2
Result (Default model)
Minimum was achieved
Chi-square = .847
Degrees of freedom = 2
Probability level = .655
This Chi-square tests the null hypothesis that the overidentified (reduced) model
fits the data as well as does a just-identified (full, saturated) model. In a just-identified
model there is a direct path (not through an intervening variable) from each variable to
each other variable. When you delete one or more of the paths you obtain an
overidentified model. The nonsignificant Chi-square here indicated that the fit between
our overidentified model and the data is not significantly worse than the fit between the
just-identified model and the data. You can see the just-identified model here. While
one might argue that nonsignificance of this Chi-square indicates that the reduced
model fits the data well, even a well-fitting reduced model will be significantly different
from the full model if sample size is sufficiently large. A good fitting model is one that
can reproduce the original variance-covariance matrix (or correlation matrix) from the
path coefficients, in much the same way that a good factor analytic solution can
reproduce the original correlation matrix with little error.
Maximum Likelihood Estimates
Do note that the parameters are estimated by maximum likelihood (ML) methods rather
than by ordinary least squares (OLS) methods. OLS methods minimize the squared
11
deviations between values of the criterion variable and those predicted by the model.
ML (an iterative procedure) attempts to maximize the likelihood that obtained values of
the criterion variable will be correctly predicted.
Standardized Regression Weights: (Group number 1 - Default model)
Estimate
Intent SubNorm .095
Intent PBC -.126
Intent Attitude .807
Behavior Intent .350
Behavior PBC .336
The path coefficients above match those we obtained earlier by multiple regression.
Correlations: (Group number 1 - Default model)
Estimate
Attitude <--> PBC .665
Attitude <--> SubNorm .472
PBC <--> SubNorm .505
Above are the simple correlations between exogenous variables.
Squared Multiple Correlations: (Group number 1 - Default model)
Estimate
Intent .600
Behavior .343
Above are the squared multiple correlation coefficients we saw in the two multiple
regressions.
The total effect of one variable on another can be divided into direct effects (no
intervening variables involved) and indirect effects (through one or more intervening
variables). Consider the effect of PBC on Behavior. The direct effect is .336 (the path
coefficient from PBC to Behavior). The indirect effect, through Intention is computed as
the product of the path coefficient from PBC to Intention and the path coefficient from
Intention to Behavior, (.126)(.350) = .044. The total effect is the sum of direct and
indirect effects, .336 + (.126) = .292.
12
Standardized Total Effects (Group number 1 - Default model)
SubNorm PBC Attitude Intent
Intent .095 -.126 .807 .000
Behavior .033 .292 .282 .350
Standardized Direct Effects (Group number 1 - Default model)
SubNorm PBC Attitude Intent
Intent .095 -.126 .807 .000
Behavior .000 .336 .000 .350
Standardized Indirect Effects (Group number 1 - Default model)
SubNorm PBC Attitude Intent
Intent .000 .000 .000 .000
Behavior .033 -.044 .282 .000
Model Fit Summary
CMIN
Model NPAR CMIN DF P CMIN/DF
Default model 13 .847 2 .655 .424
Saturated model 15 .000 0
Independence model 5 134.142 10 .000 13.414
NPAR is the number of parameters in the model. In the saturated (just-identified) model
there are 15 parameters 5 variances (one for each variable) and 10 path coefficients.
For our tested (default) model there are 13 parameters we dropped two paths. For
the independence model (one where all of the paths have been deleted) there are five
parameters (the variances of the five variables).
CMIN is a Chi-square statistic comparing the tested model and the independence model
with the saturated model. We saw the former a bit earlier. CMIN/DF, the relative chi-
square, is an index of how much the fit of data to model has been reduced by dropping
one or more paths. One rule of thumb is to decide you have dropped too many paths if
this index exceeds 2 or 3.
13
RMR, GFI
Model RMR GFI AGFI PGFI
Default model 3.564 .994 .957 .133
Saturated model .000 1.000
Independence model 36.681 .471 .207 .314
RMR, the root mean square residual, is an index of the amount by which the estimated
(by your model) variances and covariances differ from the observed variances and
covariances. Smaller is better, of course.
GFI, the goodness of fit index, tells you what proportion of the variance in the sample
variance-covariance matrix is accounted for by the model. This should exceed .9 for a
good model. For the full model it will be a perfect 1. AGFI (adjusted GFI) is an
alternate GFI index in which the value of the index is adjusted for the number of
parameters in the model. The fewer the number of parameters in the model relative to
the number of data points (variances and covariances in the sample variance-
covariance matrix), the closer the AGFI will be to the GFI. The PGFI (P is for
parsimony), the index is adjusted to reward simple models and penalize models in
which few paths have been deleted. Note that for our data the PGFI is larger for the
independence model than for our tested model.
Baseline Comparisons
Model
NFI
Delta1
RFI
rho1
IFI
Delta2
TLI
rho2
CFI
Default model .994 .968 1.009 1.046 1.000
Saturated model 1.000 1.000 1.000
Independence model .000 .000 .000 .000 .000
These goodness of fit indices compare your model to the independence model rather
than to the saturated model. The Normed Fit Index (NFI) is simply the difference
between the two models chi-squares divided by the chi-square for the independence
model. For our data, that is (134.142)-.847)/134.142 = .994. Values of .9 or higher
(some say .95 or higher) indicate good fit. The Comparative Fit Index (CFI) uses a
similar approach (with a noncentral chi-square) and is said to be a good index for use
even with small samples. It ranges from 0 to 1, like the NFI, and .95 (or .9 or higher)
indicates good fit.
14
Parsimony-Adjusted Measures
Model PRATIO PNFI PCFI
Default model .200 .199 .200
Saturated model .000 .000 .000
Independence model 1.000 .000 .000
PRATIO is the ratio of how many paths you dropped to how many you could have
dropped (all of them). The Parsimony Normed Fit Index (PNFI), is the product of NFI
and PRATIO, and PCFI is the product of the CFI and PRATIO. The PNFI and PCFI are
intended to reward those whose models are parsimonious (contain few paths).
RMSEA
Model RMSEA LO 90 HI 90 PCLOSE
Default model .000 .000 .200 .693
Independence model .459 .391 .529 .000
The Root Mean Square Error of Approximation (RMSEA) estimates lack of fit compared
to the saturated model. RMSEA of .05 or less indicates good fit, and .08 or less
adequate fit. LO 90 and HI 90 are the lower and upper ends of a 90% confidence
interval on this estimate. PCLOSE is the p value testing the null that RMSEA is no
greater than .05.
HOELTER
Model
HOELTER
.05
HOELTER
.01
Default model 418 642
Independence model 9 11
If your sample were larger than this you would reject the null hypothesis that your model
fit the data just as well as does the saturated model.
15
The Just-Identified Model
Our Reduced Model
16
Matrix Input
AMOS will accept as input a correlation matrix (accompanied by standard
deviations and sample sizes) or a variance/covariance matrix. The SPSS syntax below
would input such a matrix:
MATRIX DATA VARIABLES=ROWTYPE_ Attitude SubNorm PBC Intent Behavior.
BEGIN DATA
N 60 60 60 60 60
SD 6.96 12.32 7.62 3.83 16.66
CORR 1
CORR .472 1
CORR .665 .505 1
CORR .767 .411 .458 1
CORR .525 .379 .496 .503 1
END DATA.
After running the syntax you would just click Analyze, AMOS, and proceed as
before. If you had the correlations but not the standard deviations, you could just
specify a value of 1 for each standard deviation. You would not be able to get the
unstandardized coefficients, but they are generally not of interest anyhow.
AMOS Files
Amos creates several files during the course of conducting a path analysis. Here
is what I have learned about them, mostly by trial and error.
.amw = a path diagram, with coefficients etc.
.amp = table output all the statistical output details. Open it with the AMOS
file manager.
.AmosOutput looks the same as .amp, but takes up more space on drive.
.AmosTN = thumbnail image of path diagram
*.bk# -- probably a backup file
Notes
To bring a path diagram into Word, just Edit, Copy to Clipboard, and then paste it
into Word.
If you pull up an .amw path diagram but have not specified an input data file,
you cannot alter the diagram and re-analyze the data. The .amw file includes the
coefficients etc., but not the input data.
If you input an altered data file and then call up the original .amw, you can
Calculate Estimates again and get a new set of coefficients etc. WARNING when you
exit you will find that the old .amp and .AmosOutput have been updated with the
results of the analysis on the modified data. The original .amw file remains unaltered.
17
Links
Lesson by Garson at NCSU
Introduction to Path Analysis maybe more than you want to know.
Wuenschs Stats Lessons Page
Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC 27858-4353
October, 2008
SEM-Intro.doc
An Introduction to Structural Equation Modeling (SEM)
SEM is a combination of factor analysis and multiple regression. It also goes by
the aliases causal modeling and analysis of covariance structure. Special cases of
SEM include confirmatory factor analysis and path analysis. You are already familiar
with path analysis, which is SEM with no latent variables.
The variables in SEM are measured (observed, manifest) variables (indicators)
and factors (latent variables). I think of factors as weighted linear combinations that we
have created/invented. Those who are fond of SEM tend to think of them as underlying
constructs that we have discovered.
Even though no variables may have been manipulated, variables and factors in
SEM may be classified as independent variables or dependent variables. Such
classification is made on the basis of a theoretical causal model, formal or informal.
The causal model is presented in a diagram where the names of measured variables
are within rectangles and the names of factors in ellipses. Rectangles and ellipses are
connected with lines having an arrowhead on one (unidirectional causation) or two (no
specification of direction of causality) ends.
Dependent variables are those which have one-way arrows pointing to them and
independent variables are those which do not. Dependent variables have residuals (are
not perfectly related to the other variables in the model) indicated by es (errors) pointing
to measured variables and ds pointing to latent variables.
The SEM can be divided into two parts. The measurement model is the part
which relates measured variables to latent variables. The structural model is the part
that relates latent variables to one another.
Statistically, the model is evaluated by comparing two variance/covariance
matrices. From the data a sample variance/covariance matrix is calculated. From this
matrix and the model an estimated population variance/covariance matrix is computed.
If the estimated population variance/covariance matrix is very similar to the known
sample variance/covariance matrix, then the model is said to fit the data well. A Chi-
square statistic is computed to test the null hypothesis that the model does fit the data
well. There are also numerous goodness of fit estimators designed to estimate how
well the model fits the data.
Sample Size. As with factor analysis, you should have lots of data when
evaluating a SEM. As usual, there are several rules of thumb. For a simple model, 200
cases might be adequate. When relationships among components of the model are
strong, 10 cases per estimated parameter may be adequate.
Assumptions. Multivariate normality is generally assumed. It is also assumed
that relationships between variables are linear, but powers of variables may be included
in the model to test polynomial relationships.
Problems. If one of the variables is a perfect linear combination of the other
variables, a singular matrix (which cannot be inverted) will cause the analysis to crash.
Multicollinearity can also be a problem.
An Example. Consider the model presented in Figure 14.4 of Tabachnick and
Fidell [Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5
th
ed.).
2
Boston: Allyn & Bacon.]. There are five measurement variables (in rectangles) and two
latent variables (in ellipses). Two of the variables are considered independent (and
shaded), the others are considered dependent. From each latent variable there is a
path pointing to two indicators. From one measured variable (SenSeek) there is a path
pointing to a latent variable (SkiSat). Each measured variable has an error path leading
to it. Each latent variable has a disturbance path leading to it.
Parameters. The parameters of the model are regression coefficients for paths
between variables and variances/covariances of independent variables. Parameters
may be fixed to a certain value (usually 0 or 1) or may be estimated. In the
diagram, an represents a parameter to be estimated. A 1 indicates that the
parameter has been fixed to value 1. When two variables are not connected by a
path the coefficient for that path is fixed at 0.
Tabachnick and Fidell used EQS to arrive at the final model displayed in their
Figure 14.5.
Model Identification. An identified model is one for which each of the
estimated parameters has a unique solution. To determine whether the model is
identified or not, compare the number of data points to the number of parameters to be
estimated. Since the input data set is the sample variance/covariance matrix, the
number of data points is the number of variances and covariances in that matrix, which
can be calculated as
2
) 1 ( + m m
, where m is the number of measured variables. For
T&Fs example the number of data points is 5(6)/2 = 15.
If the number of data points equals the number of parameters to be estimated,
then the model is just identified or saturated. Such a model will fit the data
perfectly, and thus is of little use, although it can be used to estimate the values of the
coefficients for the paths.
If there are fewer data points than parameters to be estimated then the model is
under identified. In this case the parameters cannot be estimated, and the
researcher needs to reduced the number of parameters to be estimated by deleting or
fixing some of them.
When the number of data points is greater than the number of parameters to be
estimated then the model is over identified, and the analysis can proceed.
Identification of the Measurement Model. The scale of each independent
variable must be fixed to a constant (typically to 1, as in z scores) or to that of one of the
measured variables (a marker variable, one that is thought to be exceptionally well
related to the this latent variable and not to other latent variables in the model). To fix
the scale to that of a measured variable one simply fixes to 1 the regression coefficient
for the path from the latent variable to the measured variable. Most often the scale of
dependent latent variables is set to that of a measured variable. The scale of
independent latent variables may be set to 1 or to the variance of a measured variable.
The measurement portion of the model will probably be identified if:
There is only one latent variable, it has at least three indicators that load on it,
and the errors of these indicators are not correlated with one another.
3
There are two or more latent variables, each has at least three indicators that
load on it, and the errors of these indicators are not correlated, each indicator
loads on only one factor, and the factors are allowed to covary.
There are two or more latent variables, but there is a latent variable on which
only two indicators load, the errors of the indicators are not correlated, each
indicator loads on only one factor, and none of variances or covariances between
factors is zero.
Identification of the Structural Model. This portion of the model may be identified
if:
None of the latent dependent variables predicts another latent dependent
variable.
When a latent dependent variable does predict another latent dependent
variable, the relationship is recursive, and the disturbances are not correlated.
A relationship is recursive if the causal relationship is unidirectional (one line
pointing from the one latent variable to the other). In a nonrecursive
relationship there are two lines between a pair of variables, one pointing from
A to B and the other from B to A. Correlated disturbances are indicated by
being connected with a single line with arrowhead on each end.
When there is a nonrecursive relationship between latent dependent variables
or disturbances, spend some time with: Bollen, K.A. (1989). Structural
equations with latent variables. New York: John Wiley & Sons -- or hire an
expert in SEM.
If your model is not identified, the SEM program will throw an error and then you
must tinker with the model until it is identified.
Estimation. The analysis uses an iterative procedure to minimize the
differences between the sample variance/covariance matrix and the estimated
population variance matrix. Maximum Likelihood (ML) estimation is that most
frequently employed. Among the techniques available in the software used in this
course (SAS and AMOS), the ML and Generalized Least Squares (GLS) techniques
have fared well in Monte Carlo comparisons of techniques.
Fit. With large sample sizes, the Chi-square testing the null that the model fits
the data well may be significant even when the fit is good. Accordingly there has
been great interest in developing estimates of fit that do not rely on tests of
significance. In fact, there has been so much interest that there are dozens of such
indices of fit. Tabacknick and Fidell discuss many of these fit indices. You can also
find some discussion of them in my document Conducting a Path Analysis With
SPSS/AMOS.
Model Modification and Comparison. You may wish to evaluate two nested
models. Model R is nested within Model F if Model R can be created by deleting
one or more of the parameters from Model F. The significance of the difference in fit
can be tested with a simple Chi-square statistic. The value of this Chi-square equals
the Chi-square fit statistic for Model F minus the Chi-square statistic for Model R.
The degrees of freedom equal degrees of freedom for Model F minus degrees of
freedom for Model R. A nonsignificant Chi-square indicates that removal of the
4
parameters that are estimated in Model F but not in Model R did not significantly
reduce the fit of the model to the data.
The Lagrange Multiplier Test (LM) can be used to determine whether or not the
model fit would be significantly improved by estimating (rather than fixing) an
additional parameter. The Wald Test can be used to determine whether or not
deleting a parameter would significantly reduce the fit of the model. The Wald test is
available in SAS Calis, but not in AMOS. One should keep in mind that adding or
deleting a parameter will likely change the effect of adding or deleting other
parameters, so parameters should be added or deleted one at a time. It is
recommended that one add parameters before deleting parameters.
Reliability of Measured Variables. The variance in each measured variable is
assumed to stem from variance in the underlying latent variable. Classically, the
variance of a measured variable can be partitioned into true variance (that related to
the true variable) and (random) error variance. The reliability of a measured variable
is the ratio of true variance to total (true + error) variance. In SEM the reliability of a
measured variable is estimated by a squared correlation coefficient, which is the
proportion of variance in the measured variable that is explained by variance in the
latent variable(s).
Return to Wuenschs Stats Lessons Page
SEM with AMOS
SEM with SAS Proc Calis
Intro to SEM Garson at NC State
Karl L. Wuensch
Dept. of Psychology, East Carolina University, Greenville, NC USA
November, 2009
SEM-Ski-Amos.doc
SEM With Amos: Ski Satisfaction
The data for this example comes from Tabachnick and Fidel (4
th
ed.). The
variance covariance matrix is in the file SkiSat-VarCov.txt, which you should download
from my StatData page. Note the data are different in the 5
th
edition of T&F.
Start by booting Amos Graphics. File, New to start a new diagram. Click File,
Data Files, File Name. Select SkiSat-VarCov.txt.
Click Open. Click View Data if you wish to peek at the data you have selected.
Click OK.
Click the Draw a Latent Variable icon once. Put the cursor
where you wish to draw the ellipse for the first latent variable and click
again. Click again once for each variable you wish to related to the
first latent variable.
Click the Draw a Latent Variable icon again and place the
ellipse for the second latent variable. Add two observed variables
associated with this latent variable. Use Move Object to relocate the
objects as desired. If an arrow will not locate as you wish it, delete it
(X icon) and then redraw it (arrow icon).
2
Click File, Save As and save your model before too
much time passes that way, if AMOS decides to nuke your
model then you can get it back from the saved amw file. I try to
remember to save my model frequently while I am working on
it.
Click the Draw Observed Variables icon and locate an
observed variable near the second (right) latent variable.
Place an arrow going from it to the second latent variable.
Click the List Variables in Data Set icon and then drag
each variable name to the appropriate rectangle. You will find
that the rectangles are not large enough to hold the variable
names. Use the Change the Shape of Objects tool to enlarge
the rectangles.
Right click the error circle that goes to numyrs. Select Object Properties. Enter
e1 as the variable name. In the same way name the other three error circles.
Use Object Properties to name the first latent variable (that on the left) LoveSki.
Click the Parameters tab and set the variance to 1. Name the second latent variable
SkiSat, but do not fix its variance. Draw an arrow from LoveSki to SkiSat.
3
Click the Add a Unique Variable to an Existing Variable icon and then click the
SkiSat ellipse. Move and resize the error circle and name it d2.
Compare your diagram with that in Tabachnick and Fidell. Notice that AMOS
has fixed the coefficient from LoveSki to numyears at 1. That is not necessary, as we
fixed the variance of LoveSki to 1. Right click that arrow and select Object Properties.
Delete the 1 under Regression Weight.
4
Click the Analysis Properties icon. Under the Output tab select the stats you
want, as indicated below.
To start the analysis, just click the Calculate Estimates icon.
Click Proceed with the analysis.
5
Click the View the output path diagram to see the path diagram with values of
the estimated parameters placed on the arrows. Notice that you can select
unstandardized or standardized estimates.
The standardized regression coefficients are printed beside each path. Beside
each dependent variable is printed the r
2
relating that variable to a latent variable(s).
Click the View Text icon to see much more extensive output from the analysis.
Under Notes for model: Result you see that the null that the model does fit the data
well is not rejected,
2
(4) = 8.814, p = .066.
Under Estimates you see both unstandardized and standardized regression
weights. Many of the elements of the output are hyperlinks. For example, if you click
on the .399 estimate for the standardized regression weight for SkiSat <--- senseek, you
get the message When senseek goes up by one standard deviation SkiSat goes up by
.399 standard deviation.
The p values in the regression weights table are for tests of the null that the
regression coefficient is zero. Those in the variances tables are for tests of the null that
the variance is zero.
6
In the squared multiple correlations table the .328 for SkiSat indicates that 32.8%
of the variance in that latent variable is explained by its predictors (LoveSki and
SenSeek).
Look at the standardized residual covariances table. The elements in this table
represent differences between the sample variance/covariance table and the estimated
population variance/covariance table. The residuals for two covariances are
distressingly large SenSeek-numyears and SenSeek-DaySki. We might want to
modify the model to reduce these residuals.
Total, direct, and indirect effects have the same meaning they had in path
analysis. For example, consider the standardized direct effect of LoveSki on FoodSat
it is zero, as there is no path connecting those two variables. The indirect standardized
effect of LoveSki on FoodSat is the product of the standardized path coefficients leading
from LoveSki to FoodSat that is, .411(.601) = .247. Of course, the total effects equal
the sum of the direct and indirect effects.
Under Modification Indices we see that the LM test indicates that allowing
LoveSki and SenSeek to covary would reduce the goodness of fit Chi-square by about
5.57. Since this involves only one parameter, this Chi-square could be evaluated on
one degree of freedom. It is significant. That is, adding this one parameter to the
model should significantly increase the fit of the model to the data.
Under Model Fit you see values of the many fit statistics. The Comparative Fit
Index (CFI) is supposed to be good with small samples, and we certainly have a small
sample here. Its value is .919. Values greater than .95 indicate good fit. The Root
Mean Square Error of Approximation (RMSEA) is .110. Values less than .06 indicate
good fit, and values greater than .10 indicate poor fit.
Modification of the Model
Our model does not fit the data very well.
Let us try adding the parameter recommended by the LM,
the path from SenSeek to LoveSki. Edit the diagram to look like
that below. Notice that LoveSki is now a latent dependent
variable. Also notice the following changes:
LoveSki no longer has its variance fixed to 1 AMOS
warned me not to constrain its variance to 1 if I wanted to
draw a path to it from SenSeek. Accordingly, I fixed the
regression coefficient from LoveSki to NumYrs at 1, giving
LoveSki the same variance as NumYrs. I had noticed earlier that
LoveSki and NumYrs were very well correlated.
I added a disturbance for LoveSki, as it is now a latent dependent variable.
After making the indicated changes in the model, click the Calculate Estimates
icon and then view the output path diagram with standardized estimates.
7
LoveSki
numyrs
e1
1
dayski
e2
SkiSat
snowsat
e3
1
foodsat
e4
1
senseek
d2
1
1
d1
1
1
1
8
Click the View Text icon and look at the results. The goodness of fit Chi-square
is now only 2.053 on 3 degrees of freedom. Previously it was 8.814 on 4 degrees of
freedom. The change of fit Chi-square is 8.814 2.053 = 6.761 on (4 3) = 1 degrees
of freedom. Adding the path from SenSeek to LoveSki significantly increased the fit of
the model with the data.
Notice that the standardized residual matrix no longer has any very large
elements. Among the fit indices, the CFI has increased from .919 to 1.000 and the
RMSEA has decreased from .110 to 0.000, both indicating better fit.
Return to Wuenschs Stats Lessons Page
An Introduction to Structural Equation Modeling (SEM)
SEM with SAS Proc Calis
Karl L. Wuensch
Dept. of Psychology, East Carolina University, Greenville, NC USA
October, 2008
Tender-Heartedness and Attitude About Animal Research
First Things First
Go here and download to your computer (not open in your browser) the Power
Point slide for creating the diagram.
Download the data using the SEM Tenderheartedness link on my SPSS Data
Page. The data are in the form of a variance/covariance matrix. If you are going
to use SAS Calis instead of Amos, download ARC-VarCov.dat from my StatData
Page.
Description of the Data
Four of the items on Forsyths Ethics Position Questionnaire seem to measure
tender-heartedness. The items are:
Risks to another should never be tolerated, irrespective of how small the risks
might be.
It is never necessary to sacrifice the welfare of others.
If an action could harm an innocent other, then it should not be done.
The existence of potential harm to others is always wrong, irrespective of the
benefits to be gained.
These items were included in the idealism subscale employed in the research
reported in:
Wuensch, K. L., & Poteat, G. M. (1998). Evaluating the morality of animal
research: Effects of ethical ideology, gender, and purpose. Journal of Social
Behavior and Personality, 13, 139-150.
The subjects were also asked to answer a question about how justified a
particular case of animal research was (higher scores = more justified) and asked
whether or not that research should be allowed to continue (0 = no, 1 = yes).
Your Assignment
Your assignment is to use the data gathered in this research to test the SEM
model diagrammed below. It includes two latent variables: Tender Heartedness (high
scores = high tender heartedness) and Animal Research (high scores = favor animal
research). Obtain standardized estimates, squared multiple correlations, and estimates
of variance.
In a Word document, enter your answers to each of the following questions.
Immediately following each answer, paste in the relevant part of the text output from
Amos or SAS.
1. Is the Chi-square test (of the null that the fit is as good as that of the saturated
model -- perfect) significant?
2. Do any of the regression weights fall short of statistical significance?
3. Which three paths have the highest absolute standardized regression weights?
4. Interpret the association between Tender Heartedness and Animal Research and
specify the proportion of the variance in Animal Research that is explained by
Tender Heartedness?
5. Which of the indicators for Tender Heartedness has the highest reliability?
6. What is the relationship between the estimated reliabilities and the standardized
weights of the paths leading to the indicators from the latent variables?
7. Does the GFI indicate that the model fits the data well?
8. Does the RMSEA indicate that the model fits the data well?
9. You know that the Chi-square test will be significant even when the fit is good if
you have a sufficiently large sample. How large would the sample need be here
for the Chi-square to be significant at the .05 level?
Rather than using the diagram created by Amos, I would like you to use a
diagram I drew in PowerPoint. Open it in PowerPoint Click on each vari (in a text
box) and replace vari with the estimated variance of the error or disturbance. Click on
each .rr and enter the estimated reliability of the indicator variable. Click on each sw
and enter the estimated standardized weight. After you have finished entering the
parameter estimates and reliabilities, File, Save as, and select a graphics format (png,
jpg, or gif). Then open your Word doc and Insert, Picture the diagram into the Word
document below your answers to the eight questions posed above.
Print the Word document and bring it to me in class on Wednesday the 18
th
of
November, 2009.
Karl L. Wuensch, November, 2009
CFA-AMOS.doc
Confirmatory Factor Analysis With AMOS
Please read pages 732 through 749 in
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5
th
ed.).
Boston: Allyn & Bacon. ISBN-10: 0205459382. ISBN-13: 9780205459384
(Students should already have this text from the prerequisite PSYC 7431
course).
The data for this lesson are available at T&Fs data site and also from my SPSS
data page, file CFA-Wisc.sav. Download the file and bring it into SPSS and pass
it to AMOS. Alternatively, you can just boot AMOS Graphic, click Select data
files, and then select CFA-Wisc.sav. Minor culling has already taken place, as
described in the textbook.
2
Draw two latent variables, one with six indicators and one with five
indicators, like this:
3
Click List variables in data set and then drag the names of the measured
variables into the rectangles, as shown below.
4
Using Object Properties, name the errors and factors, fix both factors to variance
1, remove the fixed path coefficients, and draw a two-headed arrow like this:
5
Click Analysis properties and select the desired output:
Click Calculate estimates. Click View the output path diagram and View
Text.
Look at the diagram with standardized estimates. Note that the solution is the
same as that shown in T & F Figure 14.10. The error variances for the measured
variables, shown on the left in Figure 14.10, are simply 1 minus the value of R
2
shown
in the Amos diagram. For example, for Info, 1 - .58 = .42.
Look at the text output. The null hypothesis of good fit is rejected, but this may
be simply from having too much power. The fit indices are OK. GFI (.931) exceeds .9,
CFI (.941) does not quite reach the .95 standard, and RMSEA (.06) is between good
(.05) and adequate (.08).
The Standardized Residual Covariances are large for Comp-Pictcomp and Digit-
Coding. The Modification Indices for Covariances suggests linkage between e2 (error
in comp) and Performance IQ perhaps we need a path from Performance IQ to Comp.
The Modification Indices for Regression Weights suggests linkage between Comp and
(Object and Pictcomp), both of which are connected with Performance IQ. Again, this
6
suggests a path from Performance IQ to Comp. Let us add that path and see what
happens.
Diagram for Model 1.
7
Diagram for Model 2.
8
Look at the text output for the second model. The model fit Chi-square has
dropped from 70.236 to 60.296, a drop of 9.94, which, on one df, is significant. Adding
that path from Performance IQ to Comp has significantly improved the fit of the model.
GFI has increased from .931 to .94, CFI from .941 to .960, and RMSEA has dropped
from .06 to .05.
Notice that the path from Performance IQ to Coding is not statistically significant.
Perhaps we should just drop that variable. Drop it and see what happens.
With Coding out of the model, the goodness of fit Chi-square is no longer
significant,
2
(33) = 45.018, p = .079. GFI has increased from .94 to .952, CFI from
.960 to .974, and RMSEA has dropped from .05 to .046.
Links
CFA Using AMOS Indiana Univ.
CFA Using SAS Calis
Wuenschs Stats Lessons
Karl L. Wuensch
Dept. of Psychology, East Carolina University, Greenville, NC 27858 USA
November, 2008
Drawing an SEM Diagram in Power Point
Open Power Point.
Select blank content layout.
Display the Drawing toolbar
Click on the rectangle icon for a measured variable or the ellipse icon for a
latent variable. Then move the cursor to where you want to draw the shape. Hold down
the left mouse button and move the mouse the modify the shape of the object. Release
the mouse button when you are satisfied.
Dont like the color inside the object? Right click on the shape, select Format
Auto Shape, and select a different color or no fill. In the Format Auto Shape window
you can check Default for new objects.
You can resize the object by putting the cursor on one of the little circles on its
border the cursor will change to an double-headed arrow. Hold down the left mouse
button and drag the border in the desired direction.
You can move the object by putting the cursor on a border of the shape not on
one of those little circles the cursor will change to a four-headed arrow. Hold down
the left mouse button and drag the object to the desired location.
You can rotate the object by putting the cursor on the little green circle the
cursor will change to a circular arrow. Hold down the left mouse button and move the
mouse left or right.
You want to add some text inside the object. Click on the Text Box icon, move
the cursor to the interior of the shape and draw the text box. You can now type in the
text box and format the text as you wish.
To draw an arrow from one object to another, click on the arrow icon,
place the cursor on the border of the one object, hold down the left mouse
button, and pull the arrowhead to the border of the other object.
You can draw curved lines as well, but I find it challenging. Click
AutoShapes, Lines, Curved. Move the cursor to one desired location, hold
down the left mouse button, and drag the line to the other desired location.
Double click to stop drawing. Now click Draw and select Edit Points. Grab
the straight line in its middle and pull it to curve it. Want to put an arrow head
or heads on it. Select the line and then click the Arrow Style icon and select
the type of arrow you want.
To enter the values of estimated parameters, just put a text box
where you want it and enter the value inside the box. Text boxes can be
rotated too, so when entering a coefficient on a diagonal path you can rotate
the text box to match the slope of the path.
If you want to duplicate an object, put the cursor on its border (see four-headed
error), right click, and select copy. Then right click and select paste. Then move the
copy to the new location. This is very handy if, for example, you want several objects all
of the same size and shape.
You may also group objects together so that you can manipulate them
collectively as if they were a single object. First, hold down the left mouse button at one
edge of the area to be grouped and move the mouse to select all of the objects to be
grouped. When you are satisfied that you have correctly selected the objects, release
the mouse button and click on Draw, Group. Suppose you want to make a copy of the
grouped objects. Put the cursor on the border of one of the grouped objects, right click,
select copy. Right click, select paste. Grab the copied group of objects and move it to
the desired location.
When you are satisfied that your diagram is as good as it is going to get, select
File, Save As. Save it in a graphic format (png, jpg, gif). You can insert the saved
picture into a Word document.
Here is an example of a diagram created by my colleague John Finch. I need to
ask him how he drew those curved arrows. He may know a better way to do that than I
have described here, or he may just have much better drawing skills than do I.
Back to SEM Lessons
HLM-Intro.doc
An Introduction to Hierarchical Linear Modeling
The data are those described in the following article:
Singer, J. D. (1998). Using SAS proc mixed to fit multilevel models, hierarchical
models, and individual growth models. Journal of Educational and Behavioral
Statistics, 24, 323-355.
There are data for 7,185 students (Level 1) in 160 schools (Level 2). I shall use
MathAch as the Level 1 outcome variable.
Download the data using a link at my page at
http://core.ecu.edu/psyc/wuenschk/MV/HLM/HLM.htm . I shall use csv file myself. I did
convert it to .xls before bringing it into SAS.
Model 1: Unconditional Means, Intercepts Only
After SAS has imported the data, submit this program which will estimate
parameters for a model that includes only the outcome variable and intercepts.
title 'Model 1: Unconditional Means Model, Intercepts Only';
options formdlim='-' pageno=min nodate;
proc mixed data = hsb12 covtest noclprint;
class School;
model MathAch = / solution;
random intercept / subject = School;
run;
Level 1 Equation.
ij j ij
e Y + =
0
. That is, the score of the i
th
case in the j
th
school is due to the intercept for the j
th
school and error for the the i
th
case in the j
th
school. Although I usually use a for the intercept, here I use
0
for the intercept.
00
will be an estimate of the variance in the school intercepts the more the schools differ
in mean MathAch, the greater this variance should be.
Level 2 Equation.
j j 0 00 0
+ = . That is, the school intercepts are due to the
average intercept across schools plus the effect (on the intercept) of being in school j.
Combined Equation. Substitute
j 0 00
+ (from the Level 2 equation) for
j 0
in
the Level 1 equation and you get
ij j ij
e Y + + =
0 00
.
Fixed Effects. These are effects that are constant across schools. They are
specified in the model statement (see the SAS code above). Since no variable follows
the = sign in model MathAch =/, the only fixed parameter is the average intercept
across schools, which SAS automatically includes in the model. This effect is
symbolized with
00
in the boxed equations. Remember that the outcome variable is
MathAch.
Random Effects. These are effects that vary across schools,
0j
and e
ij
. I shall
estimate their variance.
2
Look at the Output. Under Solution for Fixed Effects we find an estimate of
the average intercept across schools, 12.637. That it differs significantly from zero is of
no interest (unless zero is a meaningful point on the scale of the outcome variable).
Under Covariance Parameter Estimates, the Random Effects, you see that
the variance in intercepts across schools is estimated to be 8.6097 and it is significantly
greater than zero (this is a one-tailed test, since a variance cannot be less than 0 unless
you have been drinking too much). This tells us that the schools differ significantly in
intercepts (means). The error variance (differences among students within schools) is
estimated to be 39.1487, also significantly greater than zero.
Intraclass Correlation. We can use this coefficient to estimate the proportion of
the variance in MathAch that is due to differences among schools. To get this
coefficient we simply take the estimated variance due to schools and divide by the sum
of that same variance plus the error variance, that is, 8.6097 / (8.6097 + 39.1487) =
18%.
Model 2: Including a Level 2 Predictor in the Model
I shall use MeanSES (the mean SES by school) as a predictor of MathAch.
MeanSES has been centered/transformed to have a mean of zero (by subtracting the
grand mean from each score).
Level 1 Equation. Same as before.
Level 2 Equation.
j j j 0 01 00 0
MeanSES + + = . That is, the school
intercepts/means are due to the average intercept across schools, the effect of being in
a school with the MeanSES of school j, and the effect of everything else (error,
extraneous variables) on which school j differs from the other schools.
Combined Equation. Substituting the right hand side of the Level 2 equation
into the Level 1 equation, we get ] [ ] MeanSES [
0 01 00 ij j j ij
e Y + + + = . The
parameters within the brackets on the left are fixed, those on the right are random.
SAS Code. Add this code to your program and submit it (highlight just this code
before you click the running person).
title 'Model 2: Including Effects of School (Level 2) Predictors';
title2 '-- predicting MathAch from MeanSES'; run;
proc mixed covtest noclprint;
class school;
model MathAch = MeanSES / solution ddfm = bw;
random intercept / subject = school;
run;
Notice the addition of ddfm = bw;. This results in SAS using the
between/within method of computing denominator df for tests of fixed effects. Why do
this because Singer says so.
Look at the Output, Fixed Effects. Under Solution for Fixed Effects, we see
that the equation for predicting MathAch is 12.6495 + 5.8635*(School MeanSES
Grand MeanSES) remember that MeanSES is centered about zero. That is, for each
3
one point increase in a schools MeanSES, MathAch rises by 5.9 points. For a school
with average MeanSES, the predicted MathACH would be the intercept, 12.6.
Grab your calculator and divide the estimated slope for MeanSES by its standard
error, retaining all decimal points. Square the resulting value of t. You will get the value
of F reported under Type 3 Tests of Fixed Effects.
Notice that the df for the fixed effect of MeanSES = 158 (number of schools
minus 2). Without the ddfm = bw the df would have been 7025. The t distribution is
not much different with 7025 df than with 158 df, so this really would not have much
mattered.
Look at the Output, Random Effects. The value of the covariance parameter
estimate for the (error) variance within schools has changed little, but that for the
difference in intercepts/means across schools has decreased dramatically, from 8.6097
to 2.6357, a reduction of 5.974. That is, MeanSES explains 5.974/8.6097 = 69% of the
differences among schools in MathAch.
Even after accounting for variance explained by MeanSES, the MathAch scores
differ significantly across schools (z = 6.53). Our estimate of this residual variance is
2.6357. Add to that our estimate of (error) variance among students within schools
(39.1578) and we have 41.7935 units of variance not yet explained. Of that not yet
explained variance, 2.6357/41.7935 = 6.3% remains available to be explained by some
other (not yet introduced into the model) Level 2 predictor. Clearly most of the variance
not yet explained is within schools, that is, at Level 1 so lets introduce a Level 1
predictor in our next model.
Model 3: Including a Level 1 Predictor in the Model
Suppose that instead of entering MeanSES into the model I entered SES, the
socio-economic-status of individual students.
Level 1 Equation.
ij ij j j ij
e Y + + = SES
1 0
. That is, a students score is due to
the intercept/mean for his/her school, the within-school effect of SES (these
slopes may differ across schools), and error. To facilitate interpretation, I shall
subtract from each students SES score the mean SES score for the school in
which that student is enrolled. Thus,
ij ij j j ij
e M Y
j
+ + = ) SES (
SES 1 0
. In the
SAS code this centered SES is represented by the variable cSES.
Level 2 Equations. Each random effect (excepting error within schools) will
require a separate Level 2 equation. Here I need one for the random intercept and one
for the random slope.
For the random intercept,
j j 0 00 0
+ = . That is, the intercept for school j is the
sum of the grand intercept across schools and the effect (on intercept) of being in
school j.
For the random slope,
j j 1 10 1
+ = . That is, the slope for predicting MathAch
from SES is, in school j, the grand slope (across all groups) and the effect (on slope) of
being in school j.
4
Combined Equation. Substituting the right hand expressions in the Level 2
equations for the corresponding elements in the Level 1 equation yields
] ) SES ( [ ] ) SES ( [
SES 1 0 SES 10 00 ij ij j j ij c ij ij
e M e M Y
j j
+ + + + + = . The fixed effects are
within the brackets on the left, the random effects to the right.
SAS Code. Add this code to your program and submit it.
title 'Model 3: Including Effects of Student-Level Predictors';
title2 '--predicting MathAch from cSES';
data hsbc; set hsb12; cSES = SES - MeanSES;
run;
proc mixed data = hsbc noclprint covtest noitprint;
class School;
model MathAch = cSES / solution ddfm = bw notest;
random intercept cSES / subject = School type = un;
run;
Note the computation of cSES, student SES centered about the mean SES for
the students school. Just as noclprint suppresses the printing of class information,
noitprint suppresses printing of information about the iterations. Type = un indicates
you are imposing no structure, allowing all parameters to be determined by the data.
Look at the Output, Fixed Effects. Under Solution for Fixed Effects, see that
the estimated MathAch for a student whose SES is average for his or her school is
12.6493. The average slope, across schools, for predicting MathAch from SES is
2.1932, which is significantly different from zero.
Look at the Output, Random Effects. Under Covariance Parameter
Estimates we see that the UN(1,1) estimate is 8.6769 and is significantly greater than
zero. This is an estimate of the variance (across schools) for the first parameter, the
intercept. That it is significantly greater than zero tells us that there remains variance,
across schools, in MathAch, even after controlling for cSES.
The UN(2,1) estimates the covariance between the first parameter and the
second, that is, between the school intercepts and school slopes. This (with a two-
tailed test) falls well short of significance. There is no evidence that the slope for
predicting MathAch from cSES depends on a schools average value of MathAch.
The UN(2,2) estimates the variance in the second parameter, cSES. The
estimated variance, .694, is significantly greater than zero. In other words, the slope for
predicting MathAch from cSES differs across schools.
The unconditional means model (the first model) estimated the within-schools
variance in MathAch to be 39.1487. Our most recent model shows that within-schools
variance is 36.7006 after taking out the effect of cSES. Accordingly, cSES accounted
for 39.1487-36.7006 2.4481 units of variance, or 2.4881/39.1487 = 6.25% of the
within-school variance.
5
Model 4: A Model with Predictors at Both Levels and All Interactions
Here I add to the model the variable sector, where 0 = public school and 1 =
Catholic school. Notice in the SAS code that the model also includes interactions
among predictors. More on this later.
SAS Code.
title 'Model 4: Model with Predictors From Both Levels and Interactions';
proc mixed data = hsbc noclprint covtest noitprint;
class School;
model mathach = MeanSES sector cSES MeanSES*Sector
MeanSES*cSES Sector*cSES MeanSES*Sector*cSES
/ solution ddfm = bw notest;
random intercept cSES / subject = School type = un;
Look at the Output, Fixed Effects. MeanSES x Sector and MeanSES x Sector
x cSES are not significant. Without further comment I shall drop these from the model.
Model 5: A Model with Predictors at Both Levels and Selected Interactions
I provide more comment on this model.
Level 1 Equation.
ij j j ij
e cSES Y + + =
1 0
.
Level 2 Equations.
For the random intercept,
j j j j
Sector MeanSES
0 02 01 00 0
+ + + = . That is, the
intercept/mean for a schools MathAch is due to the grand mean, the linear effect of
MeanSES, the effect of Sector, and the effect of being in school j.
For the random slope,
j j j j
Sector MeanSES
1 12 11 10 1
+ + + = . That is, the
slope for predicting MathAch from SES, in school j, is affected by the grand slope
(across all schools), the effect of being in a school with the MeanSES that school j has,
the effect of being in Catholic school, and the effect of everything else on which the
schools differ.
Combined Equation.
+ + + + + + = ] [
12 11 10 02 01 00 j j j j j j ij
cSES Sector cSES MeanSES cSES Sector MeanSES Y
] [
1 0 ij j j j
e cSES + + . Arent you glad you remember that algebra you learned in
ninth grade?
SAS Code. Add this code to your program and submit it
title 'Model 5: Model with Two Interactions Deleted';
title2 '--predicting mathach from meanses, sector, cses and ';
title3 'cross level interaction of meanses and sector with cses'; run;
proc mixed data = hsbc noclprint covtest noitprint;
class School;
model MathAch = MeanSES Sector cSES MeanSES*cSES Sector*cSES
/ solution ddfm = bw notest;
random intercept cSES / subject = School type = un;
proc means mean q1 q3 min max skewness kurtosis; var MeanSES Sector cSES;
run;
6
Look at the Output, Fixed Effects. All of the fixed effects are significant. Sector
is new to this model. The main effect of sector tells us that a one point increase in
sector is associated with a 1.2 point increase in MathAch. Since public schools were
coded with 0 and Catholic schools with 1, this means that higher MatchAch is
associated with the schools being Catholic. Keep in mind that this is above and
beyond other effects in the model. Also new to this model are the interactions with
cSES. The MeanSES x cSES interaction indicates that the slopes for predicting
MathAch from cSES differ across levels of MeanSES. The Sector x cSES interaction
indicates that the slopes for predicting MathAch from cSES differ between public and
Catholic schools. Note that Singer reported that she tested for a MeanSES x Sector
interaction and a MeanSES x cSES x Sector interaction but found them not to be
significant.
I created separate regressions equations for the public and the Catholic schools
by substituting 0 and 1 for the values of sector. For the public schools, that yields
MathAch = 12.11 + 5.34(MeanSES) + 1.22(0) + 2.94(cSES) + 1.04(MeanSES)(cSES) -
1.64(cSES)(0). For the Catholic schools that yields MathAch = 12.11 + 5.34(MeanSES)
+ 1.22(1) + 2.94(cSES) + 1.04(MeanSES*cSES) - 1.64(cSES)(0). These simplify to:
Public: 12.11 + 5.34(MeanSES) + 2.94(cSES) + 1.04(MeanSES)(cSES)
Catholic: 13.33 + 5.34(MeanSES) + 1.30(cSES) + 1.04(MeanSES)(cSES)
As you can see, MathAch is significantly higher in the Catholic schools and the effect
of cSES on MathAch is significantly greater in the public schools.
The MeanSES x cSES interaction can be illustrated by preparing a plot of the
relationship between MathAch and cSES at each of two or three levels of MeanSES
for example, when MeanSES is its first quartile, its second quartile, and its third quartile.
Italassi could be used to illustrate this interaction interactively, but it is hard to move that
slider in the published article.
At the mean for sector (.493), MathAch = 12.11 + 5.34(MeanSES) + 1.22(.493) +
2.94(cSES) + 1.04(MeanSES)(cSES) - 1.64(cSES)(.493) =
12.71 + 5.34(MeanSES) + 2.13(cSES) + 1.04(MeanSES)(cSES).
At Q
1
for MeanSES (-.32), MathAch = 12.71 -1.71 + 2.13(cSES) - 0.33(cSES) =
11.00 + 1.80(cSES).
At Q
2
for MeanSES (.006), MathAch = 12.71 + .03 + 2.13(cSES) + .006(cSES) =
12.74 + 2.14(cSES).
At Q
3
for MeanSES (.33), MathAch = 12.71 +1.76 + 2.13(cSES) + 0.34(cSES) =
14.47 + 2.47(cSES).
For each of these conditional regressions I shall predict MatchAch at two values
of cSES (-3 and +3) and then produce an overlay plot with the three lines.
7
Here is the table of predicted values:
MeanSES
cSES
Difference
-3 +3
Q
1
5.60 16.4 10.80
Q
2
6.32 19.16 12.84
Q
3
7.06 21.88 14.82
Here is a plot of the relationship between cSES and MathAch at each of three
levels of MeanSES. Notice that the slope increases as MeanSES increases.
0
5
10
15
20
25
-3 3
cSES
M
a
t
h
A
c
h
MeanSES=Q1
MeanSES=Q2
MeanSES=Q3
Look at the Output, Random Effects. The estimate for the difference in
intercepts/slopes across schools, UN (1,1) remains significant, but now the estimate for
the differences across schools in slope (for predicting MathAch from cSES), UN (2,2), is
small and not significant, as is the estimate for the covariance between intercepts and
slopes, UN (2,1). Perhaps I should trim the model, removing those components.
Model 6: Trimmed
In this model I remove the random effect of cSES slopes (and thus also the
interaction between those slopes and the intercepts). Because there is only one
random effect, I no longer need to use type = un.
SAS Code.
title 'Model 6: Simpler Model Without cSES Slopes';
proc mixed data = hsbc noclprint covtest noitprint;
class School;
model MathAch = MeanSES Sector cSES MeanSES*cSES Sector*cSES / solution
ddfm = bw notest;
random intercept / subject = School;
run;
data p;
df = 2; p = 1-probchi(1.1,2);
proc print data = pvalue noobs; run;
8
Look at the Output. The model for the fixed effects is the same in this model as it was
in the previous model. Accordingly, we can compare the two models fit with the data by
comparing their fit statistics.
Fit Statistics
-2 Res Log Likelihood 46503.7 -2 Res Log Likelihood 46504.8
AIC (smaller is better) 46511.7 AIC (smaller is better) 46508.8
AICC (smaller is better) 46511.7 AICC (smaller is better) 46508.8
BIC (smaller is better) 46524.0 BIC (smaller is better) 46514.9
The fit is (slightly) better in the trimmed model. Trimming the model has reduced
the log likelihood statistic by 1.1. We can evaluate the significance of this change with a
Chi-square on 2 df (one df for each parameter trimmed, the slopes and the Slope x
Intercept interaction). As you can see, deleting those two parameters has not
significantly affected the fit of the model to the data.
Return to Wuenschs Stats Lessons Page
Karl L. Wuensch, East Carolina Univ., Dept. of Psychology, Greenville, NC 27858, USA
December, 2008
Many thanks to Dr. Cecelia Valrie, who introduced me to this topic. That said,
any mistakes in this document are mine, not hers.